NFS performance (Currently 2.6.20)

All of lore.kernel.org
 help / color / mirror / Atom feed

* NFS performance (Currently 2.6.20)
@ 2008-02-06 10:04 Jesper Krogh
       [not found] ` <3093.195.41.66.226.1202292274.squirrel-e3PW5SUo3N5/BLzvFphCflpr/1R2p/CL@public.gmane.org>
  0 siblings, 1 reply; 8+ messages in thread
From: Jesper Krogh @ 2008-02-06 10:04 UTC (permalink / raw)
  To: linux-nfs

Hi.

I'm currently trying to optimize our NFS server. We're running in a
cluster setup with a single NFS server and some compute nodes pulling data
from it. Currently the dataset is less than 10GB so it fits in memory of
the NFS-server. (confirmed via vmstat 1).
Currently I'm  getting around 500mbit (700 peak) of the server on a
gigabit link and the server is CPU-bottlenecked when this happens. Clients
having iowait around 30-50%.

Is it reasonable to expect to be able to fill a gigabit link in this
scenario? (I'd like to put in a 10Gbit interface, but when I have a
cpu-bottleneck)

Should I go for NFSv2 (default if I dont change mount options) NFSv3 ? or
NFSv4

NFSv3 default mount options is around 1MB for rsize and wsize, but reading
the nfs-man page, they suggest setting them "up to" around 32K.

I probably only need some pointers to the documentation.

Thanks.
-- 
Jesper Krogh

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: NFS performance (Currently 2.6.20)
       [not found] ` <3093.195.41.66.226.1202292274.squirrel-e3PW5SUo3N5/BLzvFphCflpr/1R2p/CL@public.gmane.org>
@ 2008-02-06 14:37   ` Gabriel Barazer
       [not found]     ` <47A9C620.70106-KSe8qvLY914@public.gmane.org>
  0 siblings, 1 reply; 8+ messages in thread
From: Gabriel Barazer @ 2008-02-06 14:37 UTC (permalink / raw)
  To: Jesper Krogh; +Cc: linux-nfs

Hi,

On 02/06/2008 11:04:34 AM +0100, "Jesper Krogh" <jesper-Q2TZfHgGEy4@public.gmane.org> wrote:
> Hi.
> 
> I'm currently trying to optimize our NFS server. We're running in a
> cluster setup with a single NFS server and some compute nodes pulling data
> from it. Currently the dataset is less than 10GB so it fits in memory of
> the NFS-server. (confirmed via vmstat 1).
> Currently I'm  getting around 500mbit (700 peak) of the server on a
> gigabit link and the server is CPU-bottlenecked when this happens. Clients
> having iowait around 30-50%.

I have a similar setup, and I'm very curious on how you can read an 
"iowait" value from the clients: On my nodes (server 2.6.21.5/clients 
2.6.23.14), the iowait counter is only incremented when dealing with 
block devices, and since my nodes are diskless my iowait is near 0%.

Maybe I'm wrong, but when the NFS servers lags, this is my system 
counter which is increased (having peaks at 30% system instead of 5-10%)

> Is it reasonable to expect to be able to fill a gigabit link in this
> scenario? (I'd like to put in a 10Gbit interface, but when I have a
> cpu-bottleneck)

I'm sure this is possible, but it is very dependant on which kind of 
traffic you have. If you have only data to pull (which theoretically 
never invalidate the page cache on the server), and you have options 
like 'noatime,nodiratime' to avoid nfs updating the access times, it 
seems possible to me. But maybe your CPU is busy doing something else 
than only computing NFS traffic. Maybe you should change your network 
controller ? I use the Intel Gigabit ones (integrated ESB2 with e1000 
driver) with rx-polling and Intel I/OAT enabled (DMA engine), and this 
really helps by reducing interrupts when dealing with a lot of traffic.

You will have to check your kernel if you have IOAT enabled in the "DMA 
engines" section.

> 
> Should I go for NFSv2 (default if I dont change mount options) NFSv3 ? or
> NFSv4

NFSv2/3 have nearly the same performance, and NFSv4 has a slight 
negative hit probably because of its "earlyness": it's too early to work 
on the performances when features are not completely stable.

> 
> NFSv3 default mount options is around 1MB for rsize and wsize, but reading
> the nfs-man page, they suggest setting them "up to" around 32K.

the values for rsize and wsize mount options depends on the amount of 
memory you have (on the server AFAIK), and when you have >4GB the values 
are not very realistic anymore. On my systems I have the defaults 
rsize/wsize set to 512KB and all is running fine, but I sure there is 
some work to be done to adjust more precisely the buffers size when 
dealing with large memory amounts (e.g. a 1MB buffer is a non-sense). 
The 32k value in a very old one and the man page doesn't even explain 
the memory-related rsize/wsize values.

> 
> I probably only need some pointers to the documentation.

And the documentation probably needs some refresh, but things are 
changing nearly every week here...

Gabriel

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: NFS performance (Currently 2.6.20)
       [not found]     ` <47A9C620.70106-KSe8qvLY914@public.gmane.org>
@ 2008-02-06 15:18       ` Trond Myklebust
       [not found]         ` <1202311096.12647.28.camel-rJ7iovZKK19ZJLDQqaL3InhyD016LWXt@public.gmane.org>
  2008-02-06 15:59       ` Jesper Krogh
  1 sibling, 1 reply; 8+ messages in thread
From: Trond Myklebust @ 2008-02-06 15:18 UTC (permalink / raw)
  To: Gabriel Barazer; +Cc: Jesper Krogh, linux-nfs


On Wed, 2008-02-06 at 15:37 +0100, Gabriel Barazer wrote:

> > 
> > Should I go for NFSv2 (default if I dont change mount options) NFSv3 ? or
> > NFSv4
> 
> NFSv2/3 have nearly the same performance

Only if you shoot yourself in the foot by setting the 'async' flag
in /etc/exports. Don't do that...

Most people will want to use NFSv3 for performance reasons. Unlike NFSv2
with 'async', NFSv3 with the 'sync' export flag set actually does _safe_
server-side caching of writes.

Trond


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: NFS performance (Currently 2.6.20)
       [not found]     ` <47A9C620.70106-KSe8qvLY914@public.gmane.org>
  2008-02-06 15:18       ` Trond Myklebust
@ 2008-02-06 15:59       ` Jesper Krogh
       [not found]         ` <64226.195.41.66.226.1202313579.squirrel-e3PW5SUo3N5/BLzvFphCflpr/1R2p/CL@public.gmane.org>
  1 sibling, 1 reply; 8+ messages in thread
From: Jesper Krogh @ 2008-02-06 15:59 UTC (permalink / raw)
  To: Gabriel Barazer; +Cc: linux-nfs

> Hi,
>> I'm currently trying to optimize our NFS server. We're running in a
>> cluster setup with a single NFS server and some compute nodes pulling
>> data from it. Currently the dataset is less than 10GB so it fits in
>> memory of the NFS-server. (confirmed via vmstat 1). Currently I'm
>> getting around 500mbit (700 peak) of the server on a gigabit link and
>> the server is CPU-bottlenecked when this happens. Clients having iowait
>> around 30-50%.
>
> I have a similar setup, and I'm very curious on how you can read an
> "iowait" value from the clients: On my nodes (server 2.6.21.5/clients
> 2.6.23.14), the iowait counter is only incremented when dealing with
> block devices, and since my nodes are diskless my iowait is near 0%.

Output in top is like this:
top - 16:51:01 up 119 days,  6:10,  1 user,  load average: 2.09, 2.00, 1.41
Tasks:  74 total,   2 running,  72 sleeping,   0 stopped,   0 zombie
Cpu(s):  0.2%us,  0.0%sy,  0.0%ni, 50.0%id, 49.8%wa,  0.0%hi,  0.0%si, 
0.0%st
Mem:   2060188k total,  2047488k used,    12700k free,     2988k buffers
Swap:  4200988k total,    42776k used,  4158212k free,  1985500k cached

>> Is it reasonable to expect to be able to fill a gigabit link in this
>> scenario? (I'd like to put in a 10Gbit interface, but when I have a
>> cpu-bottleneck)
>
> I'm sure this is possible, but it is very dependant on which kind of
> traffic you have. If you have only data to pull (which theoretically never
> invalidate the page cache on the server), and you have options like
> 'noatime,nodiratime' to avoid nfs updating the access times, it
> seems possible to me. But maybe your CPU is busy doing something else than
> only computing NFS traffic. Maybe you should change your network
> controller ? I use the Intel Gigabit ones (integrated ESB2 with e1000
> driver) with rx-polling and Intel I/OAT enabled (DMA engine), and this
> really helps by reducing interrupts when dealing with a lot of traffic.

It is a Sun V20Z (dual Opteron) NIC is:
02:02.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5704
Gigabit Ethernet (rev 03)

Jesper
-- 
Jesper Krogh


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: NFS performance (Currently 2.6.20)
       [not found]         ` <1202311096.12647.28.camel-rJ7iovZKK19ZJLDQqaL3InhyD016LWXt@public.gmane.org>
@ 2008-02-06 18:24           ` Gabriel Barazer
       [not found]             ` <47A9FB75.90206-KSe8qvLY914@public.gmane.org>
  0 siblings, 1 reply; 8+ messages in thread
From: Gabriel Barazer @ 2008-02-06 18:24 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: Jesper Krogh, linux-nfs

On 02/06/2008 4:18:16 PM +0100, Trond Myklebust 
<trond.myklebust@fys.uio.no> wrote:
> On Wed, 2008-02-06 at 15:37 +0100, Gabriel Barazer wrote:
> 
>>> Should I go for NFSv2 (default if I dont change mount options) NFSv3 ? or
>>> NFSv4
>> NFSv2/3 have nearly the same performance
> 
> Only if you shoot yourself in the foot by setting the 'async' flag
> in /etc/exports. Don't do that...
> 
> Most people will want to use NFSv3 for performance reasons. Unlike NFSv2
> with 'async', NFSv3 with the 'sync' export flag set actually does _safe_
> server-side caching of writes.
> 

Oops (tm)! Fortunately I do mostly reads, but maybe the exports(5) man 
page should be updated. According to the man page, I thought that 
although writes aren't commited to the block devices, the server-side 
cache is correctly synchronized (but lost if you pull the plug). Thanks 
for the explanation. Having a battery backed large write cache on the 
server, is there a performance hit when switching from async to sync in 
NFSv3 ?

Off-Topic: maybe the warning when omitting the 'sync' option at export 
should be removed to only be showed when using the 'async' option ? We 
really want to warn people before too many feet are shot :-)

To Jesper: I found out that using the 'nolock' flag at mount time on the 
nfs clients improve the performances but obviously only if don't need 
write locks (and your setup seems to do only intensive reads)

Gabriel

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: NFS performance (Currently 2.6.20)
       [not found]             ` <47A9FB75.90206-KSe8qvLY914@public.gmane.org>
@ 2008-02-06 18:46               ` Trond Myklebust
  0 siblings, 0 replies; 8+ messages in thread
From: Trond Myklebust @ 2008-02-06 18:46 UTC (permalink / raw)
  To: Gabriel Barazer; +Cc: Jesper Krogh, linux-nfs


On Wed, 2008-02-06 at 19:24 +0100, Gabriel Barazer wrote:
> Oops (tm)! Fortunately I do mostly reads, but maybe the exports(5) man 
> page should be updated. According to the man page, I thought that 
> although writes aren't commited to the block devices, the server-side 
> cache is correctly synchronized (but lost if you pull the plug).

...or if the server crashes for some reason.

> Thanks 
> for the explanation. Having a battery backed large write cache on the 
> server, is there a performance hit when switching from async to sync in 
> NFSv3 ?

The main performance hits occur on operations like create(), mkdir(),
rename and unlink() since they are required to be immediately synced to
disk.
IOW: there will be a noticeable overhead when writing lots of small
files.

For large files, the overhead should be minimal, since all writes can be
cached by the server until the close() operation.

Trond


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: NFS performance (Currently 2.6.20)
       [not found]         ` <64226.195.41.66.226.1202313579.squirrel-e3PW5SUo3N5/BLzvFphCflpr/1R2p/CL@public.gmane.org>
@ 2008-02-06 20:04           ` Gabriel Barazer
       [not found]             ` <47AA12C5.4010807-KSe8qvLY914@public.gmane.org>
  0 siblings, 1 reply; 8+ messages in thread
From: Gabriel Barazer @ 2008-02-06 20:04 UTC (permalink / raw)
  To: Jesper Krogh; +Cc: linux-nfs

On 02/06/2008 4:59:39 PM +0100, "Jesper Krogh" <jesper-Q2TZfHgGEy4@public.gmane.org> wrote:

>> I have a similar setup, and I'm very curious on how you can read an
>> "iowait" value from the clients: On my nodes (server 2.6.21.5/clients
>> 2.6.23.14), the iowait counter is only incremented when dealing with
>> block devices, and since my nodes are diskless my iowait is near 0%.
> 
> Output in top is like this:
> top - 16:51:01 up 119 days,  6:10,  1 user,  load average: 2.09, 2.00, 1.41
> Tasks:  74 total,   2 running,  72 sleeping,   0 stopped,   0 zombie
> Cpu(s):  0.2%us,  0.0%sy,  0.0%ni, 50.0%id, 49.8%wa,  0.0%hi,  0.0%si, 
> 0.0%st
> Mem:   2060188k total,  2047488k used,    12700k free,     2988k buffers
> Swap:  4200988k total,    42776k used,  4158212k free,  1985500k cached

You have obviously a block device on your nodes, so I suspect that 
something is reading/writing to it. Looking at how much memory is used, 
your system must be constantly swapping. This could explain why your 
iowait is so high (if your swap space is a block device or a file on a 
block device. You don't use swap over NFS do you?)

> It is a Sun V20Z (dual Opteron) NIC is:
> 02:02.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5704
> Gigabit Ethernet (rev 03)

I don't know if this adapter supports DMA (no mention on the broadcom 
specs page). I've seen such a technology only with the Intel I/O 
Acceleration Technology (I/OAT) implementation, which the mainstream 
linux kernel supports. But I have really seen the difference. I suppose 
your controllers are integrated on the motherboard?
Another thing which could make a difference, maybe you could compile 
your kernel with a lower timer frequency (CONFIG_HZ) such as 100hz: this 
results in less interrupts being processed and a higher throughput. 
(very dirty explanation, I know)

Gabriel

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: NFS performance (Currently 2.6.20)
       [not found]             ` <47AA12C5.4010807-KSe8qvLY914@public.gmane.org>
@ 2008-02-06 20:24               ` Jesper Krogh
  0 siblings, 0 replies; 8+ messages in thread
From: Jesper Krogh @ 2008-02-06 20:24 UTC (permalink / raw)
  To: Gabriel Barazer; +Cc: linux-nfs

Gabriel Barazer wrote:
> On 02/06/2008 4:59:39 PM +0100, "Jesper Krogh" <jesper-Q2TZfHgGEy4@public.gmane.org> wrote:
> 
>>> I have a similar setup, and I'm very curious on how you can read an
>>> "iowait" value from the clients: On my nodes (server 2.6.21.5/clients
>>> 2.6.23.14), the iowait counter is only incremented when dealing with
>>> block devices, and since my nodes are diskless my iowait is near 0%.
>>
>> Output in top is like this:
>> top - 16:51:01 up 119 days,  6:10,  1 user,  load average: 2.09, 2.00, 
>> 1.41
>> Tasks:  74 total,   2 running,  72 sleeping,   0 stopped,   0 zombie
>> Cpu(s):  0.2%us,  0.0%sy,  0.0%ni, 50.0%id, 49.8%wa,  0.0%hi,  0.0%si, 
>> 0.0%st
>> Mem:   2060188k total,  2047488k used,    12700k free,     2988k buffers
>> Swap:  4200988k total,    42776k used,  4158212k free,  1985500k cached
> 
> You have obviously a block device on your nodes, so I suspect that 
> something is reading/writing to it. Looking at how much memory is used, 
> your system must be constantly swapping. This could explain why your 
> iowait is so high (if your swap space is a block device or a file on a 
> block device. You don't use swap over NFS do you?)

No swap over NFS and no swapping at all.

A "vmstat 1" output of the above situation looks like:
procs -----------memory---------- ---swap-- -----io---- -system-- 
----cpu----
  0  2  42768  11580   1368 1987336    0    0     0     0  638  366  1 
0 50 48
  0  2  42768  13088   1368 1985924    0    0     0     0  695  367  2 
1 50 47
  0  2  42768  13028   1368 1986112    0    0     0     0  345  129  0 
0 50 50
  1  1  42768  12720   1364 1986328    0    0     0     0 1043  710  6 
1 50 42
  0  1  42768  12648   1364 1987308    0    0     0     0  636  374  2 
4 50 44
  0  2  42768  11608   1364 1988436    0    0     0     0  696  382  1 
0 51 49

You can also see that there barely is used any swap in the "top" report.

Jesper
-- 
Jesper

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2008-02-06 20:24 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-02-06 10:04 NFS performance (Currently 2.6.20) Jesper Krogh
     [not found] ` <3093.195.41.66.226.1202292274.squirrel-e3PW5SUo3N5/BLzvFphCflpr/1R2p/CL@public.gmane.org>
2008-02-06 14:37   ` Gabriel Barazer
     [not found]     ` <47A9C620.70106-KSe8qvLY914@public.gmane.org>
2008-02-06 15:18       ` Trond Myklebust
     [not found]         ` <1202311096.12647.28.camel-rJ7iovZKK19ZJLDQqaL3InhyD016LWXt@public.gmane.org>
2008-02-06 18:24           ` Gabriel Barazer
     [not found]             ` <47A9FB75.90206-KSe8qvLY914@public.gmane.org>
2008-02-06 18:46               ` Trond Myklebust
2008-02-06 15:59       ` Jesper Krogh
     [not found]         ` <64226.195.41.66.226.1202313579.squirrel-e3PW5SUo3N5/BLzvFphCflpr/1R2p/CL@public.gmane.org>
2008-02-06 20:04           ` Gabriel Barazer
     [not found]             ` <47AA12C5.4010807-KSe8qvLY914@public.gmane.org>
2008-02-06 20:24               ` Jesper Krogh

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.