nfs performance: read only/gigE/nolock/1Tb per day

All of lore.kernel.org
 help / color / mirror / Atom feed

* nfs performance: read only/gigE/nolock/1Tb per day
@ 2002-04-21  3:27 jason andrade
  0 siblings, 0 replies; 12+ messages in thread
From: jason andrade @ 2002-04-21  3:27 UTC (permalink / raw)
  To: nfs

Hi,

This is a bit of a long query - i am happy to post a summary back to the
list if i can get sufficient responses.

I've been trying to sort out NFS issues for the last 12 months due to
the increase in traffic we have at our opensource archive (planetmirror.com).

We started with a RedHat 7.0 deployment and the 2.2 kernel series and moved
onto 2.4 to try to address some performance issues.

We are now using a Redhat 7.2 deployment and have recently upgraded to the
2.4.18-0.22 kernel tree in an effort to deal with NFS lock ups and performance
issues. 

At peak we are pushing between 700-1000Gigabytes of traffic daily.  I am not
sure if that's at the upper boundries of what NFS testing is done at or not.

I don't believe there are any back end bandwidth issues from the disk - there
are two QLA2200 HBAs each with 2 LUNs coming from a separate fibrechannel
RAID server (PV650F) with 10 disks in each lun (36G 10,000RPM fibrechannel)

Testing has shown the ability to exceed > 50Mbyte/sec from the disk subsystem.

Some questions/queries:

o we have upgraded our backbone so that the server and all clients have gigE
  cards (previously the server had gigE and the clients had 100Bt) into a
  unmanaged switch on a private NFS backbone (i.e separate physical interface
  for nfs exports/client mounts from the "outbound" application interface)

  is there any benefit in jumbo packets and setting the MTU to 9000 ?

o we have periodic lockups - these were pretty bad with 2.4.9 or older with
  a lockup almost twice a day.  restarting the NFS subsystem made no difference
  and only a reboot of the server would clear it.

  we have been able to reproduce this with the 2.4.9-31 kernel though it is much
  rarer (once every 2-3 days).

  in an effort to avoid this, we've upgraded to 2.4.18-0.22 redhat rawhide kernel
  and i will monitor it over the next few days to see how it goes.

o we are using read only nfs - are there any optimizations or other tweaks that can
  be done knowing our front end boxes only mount filesystems as read only ?

  i have already turned off NFS locking on the server and client.

o is it possible to change NFS mount size from 1024 to 8192 (especially with GigE).
  i have tried this and was seeing slowdowns in nfs access, so reverted back to
  1024 block size.

o is there any benefit of nfs over TCP rather than UDP, when using a local gigE
  switch between server and clients ?  and any benefit in increased block size
  16K or 32K? if using tcp ?

o is there an easy way to work out what if any patches by Trond or Neil Brown are
  applied to redhat kernels ?  i'm having a hard time figuring out if i should be
  applying NFS_ALL patches to redhat rawhide trees.  and in particular, Neil has
  a patch that should make a significant performance improvement to SMP NFS servers
  which i'd like to see.  trying to track stuff through bugzilla, the varioous changelogs
  and manually is proving difficult.

o current config/performance.

  currently, top shows me that "system" uses about 70% of resources on both cpus,
  with the system around 30% idle.  i have 256 nfsds running on the server.  exports
  are ro with no_subtree_check.

  on the client, about 50-60% of cpu is spent in system, with average load around
  10-25.  at times it will spike to 100-200.  the front end box is attempting to
  service > 1000 apache clients and > 250 ftp clients.

  the NFS server filesystems are mounted ext3 with: 
  rw,async,auto,noexec,nosuid,nouser,noatime,nodev

  the NFS clients mount the filesystems with: 
  ro,rsize=1024,nolock,hard,bg,intr,retrans=8,nfsvers=3,timeo=10

cheers,

-jason

_______________________________________________
NFS maillist  -  NFS@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs

^ permalink raw reply	[flat|nested] 12+ messages in thread

* nfs performance: read only/gigE/nolock/1Tb per day
@ 2002-04-21 13:08 Gavin Woodhatch
  0 siblings, 0 replies; 12+ messages in thread
From: Gavin Woodhatch @ 2002-04-21 13:08 UTC (permalink / raw)
  To: nfs

Hi Jason,

I have seen (as posted before) our linux-nfs boxes receiving 9 - 10
MByte/s over a 100  Mbit Network. This was as i was doing some
testing.

Is has to be noted that this is a  sequential read from a 500 MB file.
When reading the "real" data, i hit about 1 - 2 MByte/s

In my Setup, i have not seen a great speed increase in using TCP.
I also don't know how good the linux-nfs Server is at that. I am just
using the client and am using a dedicated NAS Server.

The Block sizes on NFSv2 are from 1024 - 8192. With NFSv3 the Max. is
32768. The Blocksize is dependent on the NFS Version, not on TCP or
UDP Transport. I am using a Stock 2.4.17 Kernel with Trond's NFS-All
Patch.

Kind Regards

Gavin Woodhatch

NetZone Ltd.

> is it possible to change NFS mount size from 1024 to 8192 (especially with GigE).
>   i have tried this and was seeing slowdowns in nfs access, so reverted back to
>   1024 block size.

>   the NFS clients mount the filesystems with:
>   ro,rsize=1024,nolock,hard,bg,intr,retrans=8,nfsvers=3,timeo=10

_______________________________________________
NFS maillist  -  NFS@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs

^ permalink raw reply	[flat|nested] 12+ messages in thread

* RE: nfs performance: read only/gigE/nolock/1Tb per day
@ 2002-04-22 14:49 Lever, Charles
  2002-04-22 15:32 ` Trond Myklebust
  0 siblings, 1 reply; 12+ messages in thread
From: Lever, Charles @ 2002-04-22 14:49 UTC (permalink / raw)
  To: 'jason andrade'; +Cc: nfs

i'm looking at a similar problem (1K rsize works, but 8K rsize
doesn't behave under load; only a server reboot will fix the
problem).  the environment is also a web server running an
NFS client, but the back-end is a NetApp filer.  the NFS traffic
goes over a private switched 100MB network.

try with NFSv3 and TCP.  my guess is you have a network problem
of some kind that causes packet loss.  this triggers the UDP
timeout/recovery mechanism which will slow you down and maybe
even get the server and client out of sync with each other.
you might also check your GbE settings -- flow control should
be enabled, and make sure both ends of all your links have
identically configured autonegotiation parameters.

(trond- losing sync may be a client problem since it appears to
happen with different server implementations.  what can we do
to get better information about this?)

also, jason, can you post the output of "nfsstat -c" ?

if your network is behaving, r/wsize=8K and jumbo packets over
GbE should work well, as long as you have the CPU power on both
client and server to handle the interrupt load.  before trying
this, though, you should ensure that your network is healthy
with regular frame size.

> I've been trying to sort out NFS issues for the last 12 months due to
> the increase in traffic we have at our opensource archive 
> (planetmirror.com).
> 
> We started with a RedHat 7.0 deployment and the 2.2 kernel 
> series and moved
> onto 2.4 to try to address some performance issues.
> 
> We are now using a Redhat 7.2 deployment and have recently 
> upgraded to the
> 2.4.18-0.22 kernel tree in an effort to deal with NFS lock 
> ups and performance
> issues. 
> 
> At peak we are pushing between 700-1000Gigabytes of traffic 
> daily.  I am not
> sure if that's at the upper boundries of what NFS testing is 
> done at or not.
> 
> I don't believe there are any back end bandwidth issues from 
> the disk - there
> are two QLA2200 HBAs each with 2 LUNs coming from a separate 
> fibrechannel
> RAID server (PV650F) with 10 disks in each lun (36G 10,000RPM 
> fibrechannel)
> 
> Testing has shown the ability to exceed > 50Mbyte/sec from 
> the disk subsystem.
> 
> 
> Some questions/queries:
> 
> o we have upgraded our backbone so that the server and all 
> clients have gigE
>   cards (previously the server had gigE and the clients had 
> 100Bt) into a
>   unmanaged switch on a private NFS backbone (i.e separate 
> physical interface
>   for nfs exports/client mounts from the "outbound" 
> application interface)
> 
>   is there any benefit in jumbo packets and setting the MTU to 9000 ?
> 
> o we have periodic lockups - these were pretty bad with 2.4.9 
> or older with
>   a lockup almost twice a day.  restarting the NFS subsystem 
> made no difference
>   and only a reboot of the server would clear it.
> 
>   we have been able to reproduce this with the 2.4.9-31 
> kernel though it is much
>   rarer (once every 2-3 days).
> 
>   in an effort to avoid this, we've upgraded to 2.4.18-0.22 
> redhat rawhide kernel
>   and i will monitor it over the next few days to see how it goes.
> 
> o we are using read only nfs - are there any optimizations or 
> other tweaks that can
>   be done knowing our front end boxes only mount filesystems 
> as read only ?
> 
>   i have already turned off NFS locking on the server and client.
> 
> 
> o is it possible to change NFS mount size from 1024 to 8192 
> (especially with GigE).
>   i have tried this and was seeing slowdowns in nfs access, 
> so reverted back to
>   1024 block size.
> 
> o is there any benefit of nfs over TCP rather than UDP, when 
> using a local gigE
>   switch between server and clients ?  and any benefit in 
> increased block size
>   16K or 32K? if using tcp ?
> 
> 
> o is there an easy way to work out what if any patches by 
> Trond or Neil Brown are
>   applied to redhat kernels ?  i'm having a hard time 
> figuring out if i should be
>   applying NFS_ALL patches to redhat rawhide trees.  and in 
> particular, Neil has
>   a patch that should make a significant performance 
> improvement to SMP NFS servers
>   which i'd like to see.  trying to track stuff through 
> bugzilla, the varioous changelogs
>   and manually is proving difficult.
> 
> 
> o current config/performance.
> 
>   currently, top shows me that "system" uses about 70% of 
> resources on both cpus,
>   with the system around 30% idle.  i have 256 nfsds running 
> on the server.  exports
>   are ro with no_subtree_check.
> 
>   on the client, about 50-60% of cpu is spent in system, with 
> average load around
>   10-25.  at times it will spike to 100-200.  the front end 
> box is attempting to
>   service > 1000 apache clients and > 250 ftp clients.
> 
> 
>   the NFS server filesystems are mounted ext3 with: 
>   rw,async,auto,noexec,nosuid,nouser,noatime,nodev
> 
>   the NFS clients mount the filesystems with: 
>   ro,rsize=1024,nolock,hard,bg,intr,retrans=8,nfsvers=3,timeo=10
> 
>   
> 
> cheers,
> 
> -jason
> 
> 
> _______________________________________________
> NFS maillist  -  NFS@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/nfs
> 

_______________________________________________
NFS maillist  -  NFS@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: nfs performance: read only/gigE/nolock/1Tb per day
  2002-04-22 14:49 Lever, Charles
@ 2002-04-22 15:32 ` Trond Myklebust
  2002-04-22 18:52   ` Bogdan Costescu
  0 siblings, 1 reply; 12+ messages in thread
From: Trond Myklebust @ 2002-04-22 15:32 UTC (permalink / raw)
  To: Lever, Charles; +Cc: 'jason andrade', nfs

>>>>> " " == Charles Lever <Lever> writes:

     > i'm looking at a similar problem (1K rsize works, but 8K rsize
     > doesn't behave under load; only a server reboot will fix the
     > problem).  the environment is also a web server running an NFS
     > client, but the back-end is a NetApp filer.  the NFS traffic
     > goes over a private switched 100MB network.

     > try with NFSv3 and TCP.  my guess is you have a network problem
     > of some kind that causes packet loss.  this triggers the UDP
     > timeout/recovery mechanism which will slow you down and maybe
     > even get the server and client out of sync with each other.
     > you might also check your GbE settings -- flow control should
     > be enabled, and make sure both ends of all your links have
     > identically configured autonegotiation parameters.

     > (trond- losing sync may be a client problem since it appears to
     > happen with different server implementations.  what can we do
     > to get better information about this?)

I'm not sure what you mean by this. There is no 'sync' with UDP: each
packet going down the wire is either a UDP header or a fragment.

What might perhaps be happening is that the cards are somehow getting
messed up due to data flooding. Have you tried playing around with
driver parameters such as 'max_interrupt_work', 'max_rx_desc' and/or
other interrupt-related variables? (see 'modinfo -p <module>' for the
list of supported paramenters)

Cheers,
  Trond

_______________________________________________
NFS maillist  -  NFS@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs

^ permalink raw reply	[flat|nested] 12+ messages in thread

* RE: nfs performance: read only/gigE/nolock/1Tb per day
@ 2002-04-22 16:23 Andrew Ryan
  2002-04-22 18:06 ` Pedro M. Rodrigues
  0 siblings, 1 reply; 12+ messages in thread
From: Andrew Ryan @ 2002-04-22 16:23 UTC (permalink / raw)
  To: Lever, Charles; +Cc: 'jason andrade', nfs

Using NFSv3/TCP (with Trond's patches!) is good advice, the performance is 
generally better from my tests, and if UDP is hanging on you, trying TCP 
can't seriously hurt. Note that with NFSv3/TCP, you may experience hangs 
under load as well, as I did, unless you use the latest 2.4.19-pre kernel 
with Trond's patches.

Jason, as to your earlier question about applying Trond's patches to RH 
kernels, the short answer is that yes, you can get them to apply (at least 
the last time I checked, which was the 2.4.9 RPM). But RH already includes 
some NFS patches, so you'd need to remove those and put in Trond's. You 
will need to be comfortable hacking up a RPM specfile and have some 
patience and diligence to get the resulting kernel RPM to build, however. 
And when you're done you won't have a strictly RH kernel, which won't be a 
problem unless you pay for technical support and expect to ever get it. But 
since RH seems to give very little attention to a stable, reliable NFS 
client implementation in their kernels, if you're stuck using NFS on linux, 
it may be your only choice.

andrew

_______________________________________________
NFS maillist  -  NFS@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs

^ permalink raw reply	[flat|nested] 12+ messages in thread

* RE: nfs performance: read only/gigE/nolock/1Tb per day
  2002-04-22 16:23 nfs performance: read only/gigE/nolock/1Tb per day Andrew Ryan
@ 2002-04-22 18:06 ` Pedro M. Rodrigues
  0 siblings, 0 replies; 12+ messages in thread
From: Pedro M. Rodrigues @ 2002-04-22 18:06 UTC (permalink / raw)
  To: nfs


   Indeed. The NFS client part of RH kernels is really lacking. They 
work pretty well at NFS serving though, enough for me to have them in 
several servers without complains. 


/Pedro

On 22 Apr 2002 at 9:23, Andrew Ryan wrote:

> technical support and expect to ever get it. But since RH seems to
> give very little attention to a stable, reliable NFS client
> implementation in their kernels, if you're stuck using NFS on linux,
> it may be your only choice.
> 
> 
> andrew
> 
> 


_______________________________________________
NFS maillist  -  NFS@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: nfs performance: read only/gigE/nolock/1Tb per day
  2002-04-22 15:32 ` Trond Myklebust
@ 2002-04-22 18:52   ` Bogdan Costescu
  2002-04-23 10:39     ` Trond Myklebust
  0 siblings, 1 reply; 12+ messages in thread
From: Bogdan Costescu @ 2002-04-22 18:52 UTC (permalink / raw)
  To: nfs; +Cc: Lever, Charles, 'jason andrade'

On 22 Apr 2002, Trond Myklebust wrote:

> What might perhaps be happening is that the cards are somehow getting
> messed up due to data flooding. Have you tried playing around with
> driver parameters such as 'max_interrupt_work', 'max_rx_desc' and/or
> other interrupt-related variables? (see 'modinfo -p <module>' for the
> list of supported paramenters)

In case of network problems, some more info can be obtained from 
/proc/net/; f.e. /proc/net/dev can give some ideea about low level 
(driver) problems, where the most interesting might be "Rx overruns"
(the computer can't process packets as fast as they arrive and has to drop 
them as soon as the Rx ring becomes full - if your network driver has such 
parameter like "max_rx_desc" it should be increased). I don't know how to 
interpret all the data that is there, but either using the source or 
asking the Linux network developers at netdev@oss.sgi.com might help.

"max_interrupt_work" should not be modified unless a message like "ethx: 
Too much work in interrupt!" is logged by the kernel. In some cases, 
increasing "max_interrupt_work" without also increasing the Rx ring size 
would not help...

-- 
Bogdan Costescu

IWR - Interdisziplinaeres Zentrum fuer Wissenschaftliches Rechnen
Universitaet Heidelberg, INF 368, D-69120 Heidelberg, GERMANY
Telephone: +49 6221 54 8869, Telefax: +49 6221 54 8868
E-mail: Bogdan.Costescu@IWR.Uni-Heidelberg.De

_______________________________________________
NFS maillist  -  NFS@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: nfs performance: read only/gigE/nolock/1Tb per day
@ 2002-04-22 21:45 Heflin, Roger A.
  0 siblings, 0 replies; 12+ messages in thread
From: Heflin, Roger A. @ 2002-04-22 21:45 UTC (permalink / raw)
  To: nfs



> Date: Mon, 22 Apr 2002 20:52:23 +0200 (CEST)
> From: Bogdan Costescu <bogdan.costescu@iwr.uni-heidelberg.de>
> To: nfs@lists.sourceforge.net
> cc: "Lever, Charles" <Charles.Lever@netapp.com>,
>    "'jason andrade'" <jason@dstc.edu.au>
> Subject: Re: [NFS] nfs performance: read only/gigE/nolock/1Tb per day
>=20
> On 22 Apr 2002, Trond Myklebust wrote:
>=20
> > What might perhaps be happening is that the cards are somehow =
getting
> > messed up due to data flooding. Have you tried playing around with
> > driver parameters such as 'max_interrupt_work', 'max_rx_desc' and/or
> > other interrupt-related variables? (see 'modinfo -p <module>' for =
the
> > list of supported paramenters)
>=20
> In case of network problems, some more info can be obtained from=20
> /proc/net/; f.e. /proc/net/dev can give some ideea about low level=20
> (driver) problems, where the most interesting might be "Rx overruns"
> (the computer can't process packets as fast as they arrive and has to =
drop=20
> them as soon as the Rx ring becomes full - if your network driver has =
such=20
> parameter like "max_rx_desc" it should be increased). I don't know how =
to=20
> interpret all the data that is there, but either using the source or=20
> asking the Linux network developers at netdev@oss.sgi.com might help.
>=20
> "max_interrupt_work" should not be modified unless a message like =
"ethx:=20
> Too much work in interrupt!" is logged by the kernel. In some cases,=20
> increasing "max_interrupt_work" without also increasing the Rx ring =
size=20
> would not help.
>=20
	I would suggest using "netstat -s" as things are a bit easier to read,
	and most things that you will need to make sure aren't rising are
	there (maybe all).   Errors, timeouts, invalids, and fails rising too =
quickly
	are signs of problems with the underlying network, and packets are
	getting misplaced.  I have found that if you lose even a small percent
	of the packets you will take a large speed hit, and it will be quite a =
bit
	worse with UDP vs. TCP, and the larger the UDP packet size the worse
	it will be as you need to retransmit the entire packet with UDP.

				Roger =20

_______________________________________________
NFS maillist  -  NFS@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: nfs performance: read only/gigE/nolock/1Tb per day
  2002-04-22 18:52   ` Bogdan Costescu
@ 2002-04-23 10:39     ` Trond Myklebust
  2002-04-23 15:14       ` Bogdan Costescu
  0 siblings, 1 reply; 12+ messages in thread
From: Trond Myklebust @ 2002-04-23 10:39 UTC (permalink / raw)
  To: Bogdan Costescu; +Cc: nfs, Lever, Charles, 'jason andrade'

>>>>> " " == Bogdan Costescu <bogdan.costescu@iwr.uni-heidelberg.de> writes:

     > "max_interrupt_work" should not be modified unless a message
     > like "ethx: Too much work in interrupt!" is logged by the
     > kernel. In some cases, increasing "max_interrupt_work" without
     > also increasing the Rx ring size would not help...

So what would an avalanche of ICMP Time Exceeded messages usually
indicate as far as the driver/card is concerned?

At the networking levels, a single Time Exceeded message means that
some fragment(s) got dropped and/or lost, so some datagram never got
reassembled within /proc/sys/net/ipv4/ipfrag_low_thresh seconds
(as per RFC1122).

In the avalanching case that I've sometimes observed, then it looks as
if *no* datagrams are getting rebuilt.
IOW: the client is just sitting there sending off ICMP messages, and
never reading the reply. Changing card/driver did not help in the
cases I observed, but shutting down the network, and then bringing it
up again sometimes did. Any suggestions?

Cheers,
  Trond

_______________________________________________
NFS maillist  -  NFS@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: nfs performance: read only/gigE/nolock/1Tb per day
  2002-04-23 10:39     ` Trond Myklebust
@ 2002-04-23 15:14       ` Bogdan Costescu
  2002-04-23 16:36         ` Trond Myklebust
  0 siblings, 1 reply; 12+ messages in thread
From: Bogdan Costescu @ 2002-04-23 15:14 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: nfs, Lever, Charles, 'jason andrade'

On 23 Apr 2002, Trond Myklebust wrote:

> So what would an avalanche of ICMP Time Exceeded messages usually
> indicate as far as the driver/card is concerned?

Many and nothing 8-) As you say, this message is issued when the datagram 
couldn't be reassembled. There can be many low-level (driver/card/switch) 
causes why a packet doesn't make it in time to the destination, these are 
those that I can think of:

1. the server can't send the packet
  1.1 it's slower in producing packets than the NIC can handle -> Tx underrun
	usually associated with bus (PCI) problems.
  1.2 it produces too many packets (usually small ones and for datagram 
	protocols) and the NIC can't send them as fast -> Tx queue full, 
	in extreme cases (5 seconds in most drivers in 2.4 kernels) a Tx 
	timeout occurs.
  1.3 (actually could be included in the previous one) the NIC can't send 
	packets because of network congestion, usually happens on 
	half-duplex links (and mostly with hubs) because of collisions -> 
	Tx queue full, then maybe Tx timeout. Some cards/drivers can
	continue to try sending the packet indefinitely, some can just 
	drop the packet, some stall the tramission path after some number 
	of collisions and resetting it can take some time.
  1.4 link speed mismatch between NIC and hub/switch -> packets are 
	randomly dropped, there are frame errors, etc.
  1.5 the server has interrupt problems (APIC errors) and Tx interrupts 
	can be missed, such that the Tx queue is not emptied in time 
	(with interrupt mitigation)-> Tx timeout.
2. the hub/switch doesn't send the packet
  2.1 dual speed hub/switches have to buffer the packet(s) coming from 
	the fast ports and send them with lower speeds; in some cases this 
	buffer can be filled and packets are dropped.
  2.2 switches that have to deal with oversized (Jumbo) frames and split 
	them in normal (max. 1500 bytes payload) packets. Depending on how 
	well the splitting is handled (usually directly proportional 
	with how much the switch costs), packets can be dropped.
  2.3 switches under broadcast storms act just like hubs, packets can be 
	dropped.
3. the client can't receive the packet
  3.1 the client is too loaded or there are bus (PCI) problems and 
	the CPU cannot process packets as fast as they arrive -> Rx 
	overruns. As soon as the Rx ring is full, packets are dropped by 
	the NIC. If this happens only occasionaly, a larger Rx ring helps 
	taking the peaks.
  3.2 the client has interrupt problems (APIC errors) and it uses 
	Rx interrupt mitigation, such that a missed interrupt doesn't 
	start the processing of the packets. It's less likely than the Tx 
	interrupt mitigation case, because there is usually also a timer 
	based interrupt (in 2.4 only hardware support, in 2.5 also 
	software support from NAPI).
  3.3 the client has interrupt problems which manifest as some device 
	(other than the NIC) keeping interrupts disabled for too long (IDE 
	is one such example). The NIC generates the interrupt, but the 
	driver receives it with delay, such that the Rx ring can be 
	already full and Rx overruns occur. This situation is usually 
	associated with the "Too much work in interrupt" message, as the 
	driver has to process the Rx ring plus maybe some Tx interrupts, 
	media related interrupts, statistics interrupts, etc. (although 
	usually the Rx processing produces the highest number of loops, 
	that's why I included it here and not on the server/Tx side).
  3.4 link speed mismatch between NIC and hub/switch (see 1.4)
4. different fragments take different times to travel
  4.1 a router/switch with higher layer processing somewhere in the middle 
	might delay/drop packets
  4.2 even for computers connected to the same switch, it might happen 
	with channel bonding

Of course, the roles of server and client are here depicted only as 
transmitter and receiver respectively. In a bidirectional protocol, the 
roles alternate.

Again I have to state the obvious: the above situations can happen alone 
or associated. When they are associated it's much harder to cure all of 
them, as some people say plainly "it just doesn't work" or give up too 
soon in solving them (f.e. "I fixed the link speed autonegotiation 
problem, but I still get dropped packets" which can be related to some 
congestion).

> In the avalanching case that I've sometimes observed, then it looks as
> if *no* datagrams are getting rebuilt.

How big are the datagrams compared with the MTU ? With 32K datagrams over 
Ethernet, you're talking about roughly a full Rx ring worth of packets (32 
is common for the Rx ring size)...

> IOW: the client is just sitting there sending off ICMP messages, and
> never reading the reply.

Does the other side sees these messages ? If so, are there any response 
messages sent out (but which don't make it back to the client) ?

> Changing card/driver did not help in the
> cases I observed, but shutting down the network, and then bringing it
> up again sometimes did. Any suggestions?

Down/up was on the sending or receiving/reassembling side ?
Shutting down an interface should clear all buffers/queues associated with 
it, so a restart gets a "clean" state. For reassembling, it probably 
means droping all incomplete datagrams, but I'm not 100% sure, it may get 
more complicated when packets can take different ways between sender and 
receiver.

-- 
Bogdan Costescu

IWR - Interdisziplinaeres Zentrum fuer Wissenschaftliches Rechnen
Universitaet Heidelberg, INF 368, D-69120 Heidelberg, GERMANY
Telephone: +49 6221 54 8869, Telefax: +49 6221 54 8868
E-mail: Bogdan.Costescu@IWR.Uni-Heidelberg.De






_______________________________________________
NFS maillist  -  NFS@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: nfs performance: read only/gigE/nolock/1Tb per day
  2002-04-23 15:14       ` Bogdan Costescu
@ 2002-04-23 16:36         ` Trond Myklebust
  2002-04-23 18:16           ` Bogdan Costescu
  0 siblings, 1 reply; 12+ messages in thread
From: Trond Myklebust @ 2002-04-23 16:36 UTC (permalink / raw)
  To: Bogdan Costescu; +Cc: nfs

>>>>> " " == Bogdan Costescu <bogdan.costescu@iwr.uni-heidelberg.de> writes:

     > How big are the datagrams compared with the MTU ? With 32K
     > datagrams over Ethernet, you're talking about roughly a full Rx
     > ring worth of packets (32 is common for the Rx ring size)...

It has been a while ago (I've since mothballed the machine) but I saw
it on a Pentium 90 with only 8k write sizes. 4k was fine, 8k gave
avalanches.

    >> IOW: the client is just sitting there sending off ICMP
    >> messages, and never reading the reply.

     > Does the other side sees these messages ? If so, are there any
     > response messages sent out (but which don't make it back to the
     > client) ?

IIRC, yes, and the server was resending the datagrams. From the code,
it looks as if there is no attempt to stop loopback situations
occurring when this goes on:
i.e. resending an ICMP when the server resends a datagram which times
out again appears to be possible. This might be what was happening...

     > Down/up was on the sending or receiving/reassembling side ?

Down/up on the receiving/reassembling side.

Cheers,
  Trond

_______________________________________________
NFS maillist  -  NFS@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: nfs performance: read only/gigE/nolock/1Tb per day
  2002-04-23 16:36         ` Trond Myklebust
@ 2002-04-23 18:16           ` Bogdan Costescu
  0 siblings, 0 replies; 12+ messages in thread
From: Bogdan Costescu @ 2002-04-23 18:16 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: nfs, netdev


[ cc-ed to netdev; the discussion was about receiving bursts of ICMP Time 
Exceeded messages after some large NFS datagrams could not be reassembled; 
sometimes down/up the interface on the receiver/reassembly side cures it ]

On 23 Apr 2002, Trond Myklebust wrote:

>      > How big are the datagrams compared with the MTU ? With 32K
>      > datagrams over Ethernet, you're talking about roughly a full Rx
>      > ring worth of packets (32 is common for the Rx ring size)...
> 
> It has been a while ago (I've since mothballed the machine) but I saw
> it on a Pentium 90 with only 8k write sizes. 4k was fine, 8k gave
> avalanches.

IMHO you can't comletely eliminate hardware related problems: apart from 
having a slow CPU, some early PCI implementations were buggy (although you 
don't say if it's PCI or ISA and what's the link speed).

>      > Does the other side sees these messages ?
> 
> IIRC, yes, and the server was resending the datagrams. From the code,
> it looks as if there is no attempt to stop loopback situations
> occurring when this goes on:
> i.e. resending an ICMP when the server resends a datagram which times
> out again appears to be possible. This might be what was happening...

That's why I cc-ed netdev. My knowledge above the driver level is close to 
non-existant...

-- 
Bogdan Costescu

IWR - Interdisziplinaeres Zentrum fuer Wissenschaftliches Rechnen
Universitaet Heidelberg, INF 368, D-69120 Heidelberg, GERMANY
Telephone: +49 6221 54 8869, Telefax: +49 6221 54 8868
E-mail: Bogdan.Costescu@IWR.Uni-Heidelberg.De




_______________________________________________
NFS maillist  -  NFS@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2002-04-23 18:16 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2002-04-22 16:23 nfs performance: read only/gigE/nolock/1Tb per day Andrew Ryan
2002-04-22 18:06 ` Pedro M. Rodrigues
  -- strict thread matches above, loose matches on Subject: below --
2002-04-22 21:45 Heflin, Roger A.
2002-04-22 14:49 Lever, Charles
2002-04-22 15:32 ` Trond Myklebust
2002-04-22 18:52   ` Bogdan Costescu
2002-04-23 10:39     ` Trond Myklebust
2002-04-23 15:14       ` Bogdan Costescu
2002-04-23 16:36         ` Trond Myklebust
2002-04-23 18:16           ` Bogdan Costescu
2002-04-21 13:08 Gavin Woodhatch
2002-04-21  3:27 jason andrade

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.