Linux NFS development
 help / color / mirror / Atom feed
From: jehan procaccia <jehan.procaccia@int-evry.fr>
To: Neil Brown <neilb@cse.unsw.edu.au>
Cc: "Lever, Charles" <Charles.Lever@netapp.com>, nfs@lists.sourceforge.net
Subject: Re: async vs. sync
Date: Thu, 25 Nov 2004 00:14:36 +0100	[thread overview]
Message-ID: <41A515DC.7010408@int-evry.fr> (raw)
In-Reply-To: <16805.2572.79895.275921@cse.unsw.edu.au>

Neil Brown wrote:

>On Wednesday November 24, jehan.procaccia@int-evry.fr wrote:
>  
>
>>However now the tar extraction goes very fast but stops 1 or 2  or  and 
>>restart fast -> there are some hangs. Here with a 16MB journal I got 15 
>>hangs of 1-2 seconds, with a 128 MB I get only 3 hangs but they last 4or 
>>5 seconds. I checked at a momment of an hang on the nfs server with 
>>iostat, and disk utilisation goes from a few % to  316 % in the exemple 
>>below (for 128 MB journal withing the 4 seconds hangs it goes to 4700 % !)
>>Device:    rrqm/s wrqm/s   r/s   w/s  rsec/s  wsec/s    rkB/s    wkB/s 
>>avgrq-sz avgqu-sz   await  svctm  %util
>>/dev/emcpowerl2
>>             0.00 150.67 97.33 224.00  768.00 3018.67   384.00  
>>1509.33    11.78    33.33   19.79   9.83 316.00
>>
>>Maybe it hangs because the journal commits on the SP ! ?
>>
>>    
>>
>
>It hangs because of some clumsy code in ext3 that no-one has bothered
>to fix yet - I had a look once but it was a little beyond the time I
>had to spare.
>
>When information is written to the journal, it stays in memory as well
>and is eventually written out to the main filesystem using normal
>lazy-flushing mechanisms (data is pushed out either due to memory
>pressure or because it has been idle for too long).
>When ext3 wants to add information to the head of the journal, it
>needs to clean up the tail to make space.
>If it finds that the data that was written to the tail is already
>safe in the main filesystem, it just frees up some of the  tail and
>starts using it for a new head.
>HOWEVER, if it finds that the data in the tail hasn't made it to the
>main filesystem, it flushes *ALL* of the data in the journal out to
>the main filesystem. (It should only flush some fraction or fixed
>number of blocks or something).  This flushing causes a very
>noticeable pause.  The larger the journal, the less often the flush is
>needed, but the longer the flush lasts for.
>
>There are two ways to avoid this pause.  One I have tested and works
>well.  The other only just occurred to me and I haven't tried.
>
>The untested one involves making the journal larger than main memory.
>If it is that large, then memory pressure should flush out journal
>blocks before the journal wraps back to them, and so the flush should
>never happen.  However such a large journal may cause other problems
>(slow replay) as mentioned in my other email.
>
>The way that works if to adjust the "bdflush" parameters so that data
>is flushed to disk more quickly.  The default is to flush data once it
>is 30 seconds old.  If you reduce that to 5 seconds, the problem goes
>away. 
>
>For 2.4, I put
>vm.bdflush =  30 500 0 0 100 500 60 20 0
>
>in my /etc/sysctl.conf, which is equivalent to running
>  echo   30 500 0 0 100 500 60 20 0 > /proc/sys/vm/bdflush
>  
>

$ uname -r
2.4.21-4.ELsmp
here's what I had before setting the above:
$ cat /proc/sys/vm/bdflush
50      500     0       0       500     3000    80      50      0

Now indeed pauses seems to be shorter (I've seen 12 instead of 15 and 
they latest less than 1s )
[root@arvouin Nfs-test]# time tar xvfz 
/usr/src/redhat/SOURCES/httpd-2.0.51.tar.gz
real    1m22.504s
user    0m0.898s
sys     0m2.846s

On a 128MB journal it's even better, I don't see any pauses (I had a 
least 3 of each 4-5 seconds before) .
[root@arvouin Nfs-test]# time tar xvfz 
/usr/src/redhat/SOURCES/httpd-2.0.51.tar.gz
real    0m25.038s
user    0m0.914s
sys     0m2.477s

Very good :-)

just for the record so that I'am sure how I got that performance, here 
is the server's export options: (data=journal in /etc/fstab for that FS !)
$ cat /var/lib/nfs/xtab
/mnt/emcpowerm1 
arvouin.int-evry.fr(rw,sync,no_wdelay,hide,nocrossmnt,secure,no_root_squash,no_all_squash,subtree_check,secure_locks,no_acl,mapping=identity,anonuid=-2,anongid=-2)
and client mount option
[root@arvouin Nfs-test]# cat /proc/mounts
cobra3:/mnt/emcpowerm1 /mnt/cobra3extjournal nfs 
rw,v3,rsize=8192,wsize=8192,hard,tcp,lock,addr=cobra3 0 0

To be sure of the improvement of the "hack" on /proc/sys/vm/bdflush I've 
set it back to the original values:
$ echo 50 500 0 0 500 3000 80 50 0 > /proc/sys/vm/bdflush

and dynamically (no unmount or remont anything on either side) test again

[root@arvouin Nfs-test]# time tar xvfz 
/usr/src/redhat/SOURCES/httpd-2.0.51.tar.gz
real    1m19.655s
user    0m0.860s
sys     0m2.612s

time is longer and pauses are worst than I though -> 3 pauses of 
approximately 10 to 15 seconds each !

So it seem to be a very good advice to echo   30 500 0 0 100 500 60 20 0 
 > /proc/sys/vm/bdflush  :-)
however this is a general configuration, will it disturb other devices ? 
what means every figures here ? why where they set to an non optimal 
value iin the 1st place ?

PS:  different optimisation:
I've read this "the maximum block size is
defined by the value of the kernel constant *NFSSVC_MAXBLKSIZE*,
found in the Linux kernel source file ./include/linux/nfsd/const.h"
is there a way to change my actual 8K buffer size to 32 K without 
recompiling the kernel ?

thanks.


>For 2.6, I assume you would
>   echo 500 > /proc/sys/vm/dirty_expire_centisecs
>but I haven't tested this.
>
>
>  
>
>>Well, finally, is this safer in terms of performances to externalize 
>>journal than using async export ?
>>    
>>
>
>Absolutely, providing you trust the hardware that you are storing your
>journal on.
>An external journal is perfectly safe.
>async export is not.
>
>NeilBrown
>  
>



-------------------------------------------------------
SF email is sponsored by - The IT Product Guide
Read honest & candid reviews on hundreds of IT Products from real users.
Discover which products truly live up to the hype. Start reading now. 
http://productguide.itmanagersjournal.com/
_______________________________________________
NFS maillist  -  NFS@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs

  reply	other threads:[~2004-11-24 23:14 UTC|newest]

Thread overview: 63+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2004-11-23 14:30 async vs. sync Lever, Charles
2004-11-23 21:46 ` jehan procaccia
2004-11-24 18:45   ` jehan.procaccia
2004-11-24 22:24     ` Neil Brown
2004-11-24 23:14       ` jehan procaccia [this message]
2004-11-24 23:34         ` Neil Brown
2004-11-24 22:09   ` Neil Brown
     [not found]   ` <Pine.GSO.4.53.0412010900500.5486@int1.cdc.noaa.gov>
2004-12-01 17:27     ` jehan.procaccia
  -- strict thread matches above, loose matches on Subject: below --
2004-11-24 19:05 Lever, Charles
2004-11-23 16:36 Lever, Charles
2004-11-23 18:16 ` Dan Stromberg
2004-11-23  3:53 Lever, Charles
2004-11-23 16:33 ` Dan Stromberg
2004-11-22 22:14 Lever, Charles
     [not found] <20041122214605.8E2B31D0FE1@sc8-sf-uberspam1.sourceforge.net>
2004-11-22 21:57 ` Joshua Baker-LePain
2004-11-22 21:50 Lever, Charles
2004-11-22 22:06 ` jehan procaccia
2004-11-23  1:09 ` Dan Stromberg
2004-11-22 19:02 Lever, Charles
2004-11-22 21:25 ` jehan procaccia
2004-11-22 21:45   ` Nicolas.Kowalski
2004-11-22 23:51     ` jehan procaccia
2004-11-22 18:31 Lever, Charles
2004-11-16 18:48 Lever, Charles
2004-11-22 15:36 ` Olaf Kirch
2004-11-22 17:55   ` jehan.procaccia
2004-11-22 18:06     ` Roger Heflin
2004-11-22 18:46       ` jehan.procaccia
2004-11-22 19:10         ` Roger Heflin
2004-11-22 21:44           ` jehan procaccia
2004-11-22 21:52             ` jehan procaccia
2004-11-22 22:20               ` Trond Myklebust
2004-11-22 22:57                 ` jehan procaccia
2004-11-23  9:50                   ` jehan procaccia
2004-11-23 14:57                     ` J. Bruce Fields
2004-11-22 18:08     ` Trond Myklebust
2004-11-22 18:57       ` jehan.procaccia
2004-11-22 19:05         ` Roger Heflin
2004-11-22 20:14         ` Trond Myklebust
2004-11-22 21:04           ` Paul Cunningham
2004-11-22 21:14             ` Trond Myklebust
2004-11-22 22:07               ` Paul Cunningham
2004-11-22 22:26                 ` Trond Myklebust
2004-11-16 18:45 Lever, Charles
2004-11-16 16:15 Lever, Charles
2004-11-16 16:32 ` Trond Myklebust
2004-11-16 17:18   ` jehan.procaccia
2004-11-16 18:08     ` Trond Myklebust
     [not found] <482A3FA0050D21419C269D13989C61130435E530@lavender-fe.eng.netapp.com>
2004-07-27 15:07 ` Bernd Schubert
2004-07-26 23:05 John Roberts
     [not found] <482A3FA0050D21419C269D13989C61130435E523@lavender-fe.eng.netapp.com>
2004-07-26 21:28 ` Bernd Schubert
     [not found] <482A3FA0050D21419C269D13989C61130435E51E@lavender-fe.eng.netapp.com>
2004-07-26 17:05 ` Bernd Schubert
2004-07-26 19:47   ` Jan Bruvoll
2004-07-26 22:06     ` Bernd Schubert
2004-07-27 12:00       ` Jan Bruvoll
2004-07-27 13:00         ` Bernd Schubert
2004-07-27 13:56           ` raven
2004-07-27 14:04             ` Jan Bruvoll
2004-07-27 14:11           ` Jan Bruvoll
2004-07-28  8:56       ` Olaf Kirch
2004-07-28 12:35         ` Bernd Schubert
2004-07-28 12:49           ` Olaf Kirch
2004-07-23 16:20 Linux NFS writes to Solaris very, very slow John Roberts
2004-07-26 15:17 ` async vs. sync Bernd Schubert

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=41A515DC.7010408@int-evry.fr \
    --to=jehan.procaccia@int-evry.fr \
    --cc=Charles.Lever@netapp.com \
    --cc=neilb@cse.unsw.edu.au \
    --cc=nfs@lists.sourceforge.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox