From: Daniel Pocock <daniel@pocock.com.au>
To: "Myklebust, Trond" <Trond.Myklebust@netapp.com>
Cc: "linux-nfs@vger.kernel.org" <linux-nfs@vger.kernel.org>
Subject: Re: extremely slow nfs when sync enabled
Date: Sun, 06 May 2012 22:12:28 +0000 [thread overview]
Message-ID: <4FA6F74C.7000505@pocock.com.au> (raw)
In-Reply-To: <1336340993.2600.11.camel@lade.trondhjem.org>
On 06/05/12 21:49, Myklebust, Trond wrote:
> On Sun, 2012-05-06 at 21:23 +0000, Daniel Pocock wrote:
>>
>> On 06/05/12 18:23, Myklebust, Trond wrote:
>>> On Sun, 2012-05-06 at 03:00 +0000, Daniel Pocock wrote:
>>>>
>>>> I've been observing some very slow nfs write performance when the server
>>>> has `sync' in /etc/exports
>>>>
>>>> I want to avoid using async, but I have tested it and on my gigabit
>>>> network, it gives almost the same speed as if I was on the server
>>>> itself. (e.g. 30MB/sec to one disk, or less than 1MB/sec to the same
>>>> disk over NFS with `sync')
>>>>
>>>> I'm using Debian 6 with 2.6.38 kernels on client and server, NFSv3
>>>>
>>>> I've also tried a client running Debian 7/Linux 3.2.0 with both NFSv3
>>>> and NFSv4, speed is still slow
>>>>
>>>> Looking at iostat on the server, I notice that avgrq-sz = 8 sectors
>>>> (4096 bytes) throughout the write operations
>>>>
>>>> I've tried various tests, e.g. dd a large file, or unpack a tarball with
>>>> many small files, the iostat output is always the same
>>>
>>> Were you using 'conv=sync'?
>>
>> No, it was not using conv=sync, just the vanilla dd:
>>
>> dd if=/dev/zero of=some-fat-file bs=65536 count=65536
>
> Then the results are not comparable.
If I run dd with conv=sync on the server, then I still notice that OS
caching plays a factor and write performance just appears really fast
>>>> Looking at /proc/mounts on the clients, everything looks good, large
>>>> wsize, tcp:
>>>>
>>>> rw,relatime,vers=3,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,mountaddr=192.x.x.x,mountvers=3,mountport=58727,mountproto=udp,local_lock=none,addr=192.x.x.x
>>>> 0 0
>>>>
>>>> and
>>>> rw,relatime,vers=4,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,port=0,timeo=600,retrans=2,sec=sys,clientaddr=192.x.x.x.,minorversion=0,local_lock=none,addr=192.x.x.x 0 0
>>>>
>>>> and in /proc/fs/nfs/exports on the server, I have sync and wdelay:
>>>>
>>>> /nfs4/daniel
>>>> 192.168.1.0/24,192.x.x.x(rw,insecure,root_squash,sync,wdelay,no_subtree_check,uuid=aa2a6f37:9cc94eeb:bcbf983c:d6e041d9,sec=1)
>>>> /home/daniel
>>>> 192.168.1.0/24,192.x.x.x(rw,root_squash,sync,wdelay,no_subtree_check,uuid=aa2a6f37:9cc94eeb:bcbf983c:d6e041d9)
>>>>
>>>> Can anyone suggest anything else? Or is this really the performance hit
>>>> of `sync'?
>>>
>>> It really depends on your disk setup. Particularly when your filesystem
>>> is using barriers (enabled by default on ext4 and xfs), a lot of raid
>>
>> On the server, I've tried both ext3 and ext4, explicitly changing things
>> like data=writeback,barrier=0, but the problem remains
>>
>> The only thing that made it faster was using hdparm -W1 /dev/sd[ab] to
>> enable the write-back cache on the disk
>
> That should in principle be safe to do as long as you are using
> barrier=1.
Ok, so the combination of:
- enable writeback with hdparm
- use ext4 (and not ext3)
- barrier=1 and data=writeback? or data=?
- is there a particular kernel version (on either client or server side)
that will offer more stability using this combination of features?
I think there are some other variations of my workflow that I can
attempt too, e.g. I've contemplated compiling C++ code onto a RAM disk
because I don't need to keep the hundreds of object files.
>>> setups really _suck_ at dealing with fsync(). The latter is used every
>>
>> I'm using md RAID1, my setup is like this:
>>
>> 2x 1TB SATA disks ST31000528AS (7200rpm with 32MB cache and NCQ)
>>
>> SATA controller: ATI Technologies Inc SB700/SB800 SATA Controller [AHCI
>> mode] (rev 40)
>> - not using any of the BIOS softraid stuff
>>
>> Both devices have identical partitioning:
>> 1. 128MB boot
>> 2. md volume (1TB - 128MB)
>>
>> The entire md volume (/dev/md2) is then used as a PV for LVM
>>
>> I do my write tests on a fresh LV with no fragmentation
>>
>>> time the NFS client sends a COMMIT or trunc() instruction, and for
>>> pretty much all file and directory creation operations (you can use
>>> 'nfsstat' to monitor how many such operations the NFS client is sending
>>> as part of your test).
>>
>> I know that my two tests are very different in that way:
>>
>> - dd is just writing one big file, no fsync
>>
>> - unpacking a tarball (or compiling a large C++ project) does a lot of
>> small writes with many fsyncs
>>
>> In both cases, it is slow
>>
>>> Local disk can get away with doing a lot less fsync(), because the cache
>>> consistency guarantees are different:
>>> * in NFS, the server is allowed to crash or reboot without
>>> affecting the client's view of the filesystem.
>>> * in the local file system, the expectation is that on reboot any
>>> data lost is won't need to be recovered (the application will
>>> have used fsync() for any data that does need to be persistent).
>>> Only the disk filesystem structures need to be recovered, and
>>> that is done using the journal (or fsck).
>>
>>
>> Is this an intractable problem though?
>>
>> Or do people just work around this, for example, enable async and
>> write-back cache, and then try to manage the risk by adding a UPS and/or
>> battery backed cache to their RAID setup (to reduce the probability of
>> unclean shutdown)?
>
> It all boils down to what kind of consistency guarantees you are
> comfortable living with. The default NFS server setup offers much
> stronger data consistency guarantees than local disk, and is therefore
> likely to be slower when using cheap hardware.
>
I'm keen for consistency, because I don't like the idea of corrupting
some source code or a whole git repository for example.
How did you know I'm using cheap hardware? It is a HP MicroServer, I
even got the £100 cash-back cheque:
http://www8.hp.com/uk/en/campaign/focus-for-smb/solution.html#/tab2/
Seriously though, I've worked with some very large arrays in my business
environment, but I use this hardware at home because of the low noise
and low heat dissipation rather than for saving money, so I would like
to try and get the most out of it if possible.
next prev parent reply other threads:[~2012-05-06 22:12 UTC|newest]
Thread overview: 16+ messages / expand[flat|nested] mbox.gz Atom feed top
2012-05-06 3:00 extremely slow nfs when sync enabled Daniel Pocock
2012-05-06 18:23 ` Myklebust, Trond
2012-05-06 21:23 ` Daniel Pocock
2012-05-06 21:49 ` Myklebust, Trond
2012-05-06 22:12 ` Daniel Pocock [this message]
2012-05-06 22:12 ` Daniel Pocock
2012-05-06 22:42 ` Myklebust, Trond
2012-05-07 9:19 ` Daniel Pocock
2012-05-07 13:59 ` Daniel Pocock
2012-05-07 17:18 ` J. Bruce Fields
2012-05-08 12:06 ` Daniel Pocock
2012-05-08 12:45 ` J. Bruce Fields
2012-05-08 13:29 ` Myklebust, Trond
2012-05-08 13:43 ` Daniel Pocock
-- strict thread matches above, loose matches on Subject: below --
2012-05-06 9:26 Daniel Pocock
2012-05-06 11:03 ` Daniel Pocock
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=4FA6F74C.7000505@pocock.com.au \
--to=daniel@pocock.com.au \
--cc=Trond.Myklebust@netapp.com \
--cc=linux-nfs@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).