Transient Automount Problems in CentOS 5.3

All of lore.kernel.org
 help / color / mirror / Atom feed

* Transient Automount Problems in CentOS 5.3
@ 2010-02-04  0:16 Jon Forrest
  0 siblings, 0 replies; 10+ messages in thread
From: Jon Forrest @ 2010-02-04  0:16 UTC (permalink / raw)
  To: autofs

I have a new cluster running CentOS 5.3.
The cluster uses a Sun 7310 storage server
that provides NFS service over a private
1Gb/s ethernet with 9K jumbo frames to the
cluster.

We've noticed that a number of the compute
nodes sometimes generate the

automount[15023]: umount_autofs_indirect: ask umount returned busy /home

message. When this happens the program running on the
node dies. This has happened between 10 and 20 times.
We're not sure what's going on on a node when this
happens. Most of the time everything is fine and
the home directories are automounted without problem.

I've googled for this problem and I see that other people
have seen it too, but I've never seen a resolution,
especially not for RHEL5.

The auto.master line for this mount is

/home  /etc/auto.home  --timeout=1200 
noatime,nodiratime,rw,noacl,rsize=32768,wsize=32768

The network interface configuration is

eth0      Link encap:Ethernet  HWaddr 00:30:48:B9:F6:52
           inet addr:10.1.255.233  Bcast:10.1.255.255  Mask:255.255.0.0
           inet6 addr: fe80::230:48ff:feb9:f652/64 Scope:Link
           UP BROADCAST RUNNING MULTICAST  MTU:9000  Metric:1
           RX packets:32999308 errors:0 dropped:0 overruns:0 frame:0
           TX packets:27468315 errors:0 dropped:0 overruns:0 carrier:0
           collisions:0 txqueuelen:1000
           RX bytes:24225053296 (22.5 GiB)  TX bytes:73313582546 (68.2 GiB)
           Interrupt:74 Base address:0x2000

The automount is

Linux automount version 5.0.1-0.rc2.102

Directories:
         config dir:     /etc/sysconfig
         maps dir:       /etc
         modules dir:    /usr/lib64/autofs

Compile options:
   DISABLE_MOUNT_LOCKING ENABLE_IGNORE_BUSY_MOUNTS WITH_HESIOD WITH_LDAP
   WITH_SASL

The kernel is 2.6.18-128.1.14.el5.

Any advice on what to do?

Cordially,
-- 
Jon Forrest
Research Computing Support
College of Chemistry
173 Tan Hall
University of California Berkeley
Berkeley, CA
94720-1460
510-643-1032

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Transient Automount Problems in CentOS 5.3
@ 2010-02-04 16:50 Jon Forrest
  2010-02-04 16:57 ` Carsten Aulbert
  2010-02-05  4:42 ` Ian Kent
  0 siblings, 2 replies; 10+ messages in thread
From: Jon Forrest @ 2010-02-04 16:50 UTC (permalink / raw)
  To: autofs

I have a new cluster running CentOS 5.3.
The cluster uses a Sun 7310 storage server
that provides NFS service over a private
1Gb/s ethernet with 9K jumbo frames to the
cluster.

We've noticed that a number of the compute
nodes sometimes generate the

automount[15023]: umount_autofs_indirect: ask umount returned busy /home

message. When this happens the program running on the
node dies. This has happened between 10 and 20 times.
We're not sure what's going on on a node when this
happens. Most of the time everything is fine and
the home directories are automounted without problem.

I've googled for this problem and I see that other people
have seen it too, but I've never seen a resolution,
especially not for RHEL5.

The auto.master line for this mount is

/home  /etc/auto.home  --timeout=1200 
noatime,nodiratime,rw,noacl,rsize=32768,wsize=32768

The network interface configuration is

eth0      Link encap:Ethernet  HWaddr 00:30:48:B9:F6:52
           inet addr:10.1.255.233  Bcast:10.1.255.255  Mask:255.255.0.0
           inet6 addr: fe80::230:48ff:feb9:f652/64 Scope:Link
           UP BROADCAST RUNNING MULTICAST  MTU:9000  Metric:1
           RX packets:32999308 errors:0 dropped:0 overruns:0 frame:0
           TX packets:27468315 errors:0 dropped:0 overruns:0 carrier:0
           collisions:0 txqueuelen:1000
           RX bytes:24225053296 (22.5 GiB)  TX bytes:73313582546 (68.2 GiB)
           Interrupt:74 Base address:0x2000

The automount is

Linux automount version 5.0.1-0.rc2.102

Directories:
         config dir:     /etc/sysconfig
         maps dir:       /etc
         modules dir:    /usr/lib64/autofs

Compile options:
   DISABLE_MOUNT_LOCKING ENABLE_IGNORE_BUSY_MOUNTS WITH_HESIOD WITH_LDAP
   WITH_SASL

The kernel is 2.6.18-128.1.14.el5.

Any advice on what to do?

Cordially,
-- 
Jon Forrest
Research Computing Support
College of Chemistry
173 Tan Hall
University of California Berkeley
Berkeley, CA
94720-1460
510-643-1032

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Transient Automount Problems in CentOS 5.3
  2010-02-04 16:50 Jon Forrest
@ 2010-02-04 16:57 ` Carsten Aulbert
  2010-02-04 18:09   ` Jon Forrest
  2010-02-05  4:42 ` Ian Kent
  1 sibling, 1 reply; 10+ messages in thread
From: Carsten Aulbert @ 2010-02-04 16:57 UTC (permalink / raw)
  To: autofs

Hi Jon,

On Thursday 04 February 2010 17:50:25 Jon Forrest wrote:
> I have a new cluster running CentOS 5.3.
> The cluster uses a Sun 7310 storage server
> that provides NFS service over a private
> 1Gb/s ethernet with 9K jumbo frames to the
> cluster.
> 
> We've noticed that a number of the compute
> nodes sometimes generate the
> 
> automount[15023]: umount_autofs_indirect: ask umount returned busy /home
> 
> message. When this happens the program running on the
> node dies. This has happened between 10 and 20 times.
> We're not sure what's going on on a node when this
> happens. Most of the time everything is fine and
> the home directories are automounted without problem.

Have you ramped up the number of NFS daemons, backlog stuff etc on the solaris 
box (/etc/default/nfs) and is the Solaris NIC running with jumbo frames as 
well (and the switch is *really* capable of doing it (please test with ping).

Cheers

Carsten

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Transient Automount Problems in CentOS 5.3
  2010-02-04 16:57 ` Carsten Aulbert
@ 2010-02-04 18:09   ` Jon Forrest
  2010-02-05  7:59     ` Carsten Aulbert
  0 siblings, 1 reply; 10+ messages in thread
From: Jon Forrest @ 2010-02-04 18:09 UTC (permalink / raw)
  To: Carsten Aulbert; +Cc: autofs

On 2/4/2010 8:57 AM, Carsten Aulbert wrote:

> Have you ramped up the number of NFS daemons, backlog stuff etc on the solaris
> box (/etc/default/nfs) and is the Solaris NIC running with jumbo frames as
> well (and the switch is *really* capable of doing it (please test with ping).

First of all, the Sun 7310 is one of those servers
that you "can't" login to. You manage it only
via a Web interface.

I've set the server to use 500 NFS server threads.
Is it necessary to go above that?

Both the Sun server, the switch, and all the compute nodes
in the cluster claim to support jumbo frames.
Turning them off is something that I could
do, but I'd have to do it globally because this is
a per interface settings. There's no
way I can think of that would allow me to
keep jumbo frames on on only certain nodes
so that I could run a controlled experiment.

Running ping shows no problems.

I've thought it might be useful to turn off
automounting on a few of the cluster nodes to
see what happens.

One additional thing I've discovered is that the
compute nodes all are showing a bunch of this
error message via dmesg:

eth0: too many iterations (6) in nv_nic_irq.

I've looked this up and it might have something
to do with the problem. The trouble is that I can't
see when these error messages are generated so
I can't try to correlate them with the autofs
problem.

I'm grasping at straws here.

I appreciate any addition suggestions.

-- 
Jon Forrest
Research Computing Support
College of Chemistry
173 Tan Hall
University of California Berkeley
Berkeley, CA
94720-1460
510-643-1032
jlforrest@berkeley.edu

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Transient Automount Problems in CentOS 5.3
  2010-02-04 16:50 Jon Forrest
  2010-02-04 16:57 ` Carsten Aulbert
@ 2010-02-05  4:42 ` Ian Kent
  2010-02-05 16:31   ` Jon Forrest
  1 sibling, 1 reply; 10+ messages in thread
From: Ian Kent @ 2010-02-05  4:42 UTC (permalink / raw)
  To: Jon Forrest; +Cc: autofs

On 02/05/2010 12:50 AM, Jon Forrest wrote:
> I have a new cluster running CentOS 5.3.
> The cluster uses a Sun 7310 storage server
> that provides NFS service over a private
> 1Gb/s ethernet with 9K jumbo frames to the
> cluster.
> 
> We've noticed that a number of the compute
> nodes sometimes generate the
> 
> automount[15023]: umount_autofs_indirect: ask umount returned busy /home
> 
> message. When this happens the program running on the
> node dies. This has happened between 10 and 20 times.
> We're not sure what's going on on a node when this
> happens. Most of the time everything is fine and
> the home directories are automounted without problem.
> 
> I've googled for this problem and I see that other people
> have seen it too, but I've never seen a resolution,
> especially not for RHEL5.
> 
> The auto.master line for this mount is
> 
> /home  /etc/auto.home  --timeout=1200
> noatime,nodiratime,rw,noacl,rsize=32768,wsize=32768
> 
> The network interface configuration is
> 
> eth0      Link encap:Ethernet  HWaddr 00:30:48:B9:F6:52
>           inet addr:10.1.255.233  Bcast:10.1.255.255  Mask:255.255.0.0
>           inet6 addr: fe80::230:48ff:feb9:f652/64 Scope:Link
>           UP BROADCAST RUNNING MULTICAST  MTU:9000  Metric:1
>           RX packets:32999308 errors:0 dropped:0 overruns:0 frame:0
>           TX packets:27468315 errors:0 dropped:0 overruns:0 carrier:0
>           collisions:0 txqueuelen:1000
>           RX bytes:24225053296 (22.5 GiB)  TX bytes:73313582546 (68.2 GiB)
>           Interrupt:74 Base address:0x2000
> 
> The automount is
> 
> Linux automount version 5.0.1-0.rc2.102
> 
> Directories:
>         config dir:     /etc/sysconfig
>         maps dir:       /etc
>         modules dir:    /usr/lib64/autofs
> 
> Compile options:
>   DISABLE_MOUNT_LOCKING ENABLE_IGNORE_BUSY_MOUNTS WITH_HESIOD WITH_LDAP
>   WITH_SASL
> 
> The kernel is 2.6.18-128.1.14.el5.
> 
> Any advice on what to do?

Update to 5.4, at least the autofs package and the kernel, and see if
the problem persists.

Ian

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Transient Automount Problems in CentOS 5.3
  2010-02-04 18:09   ` Jon Forrest
@ 2010-02-05  7:59     ` Carsten Aulbert
  2010-02-05 17:10       ` Jon Forrest
  0 siblings, 1 reply; 10+ messages in thread
From: Carsten Aulbert @ 2010-02-05  7:59 UTC (permalink / raw)
  To: Jon Forrest; +Cc: autofs

Hi Jon,

On Thursday 04 February 2010 19:09:37 Jon Forrest wrote:
> 
> First of all, the Sun 7310 is one of those servers
> that you "can't" login to. You manage it only
> via a Web interface.
> 
OK, sorry, not had such a beast under my fingers yet ;)

> I've set the server to use 500 NFS server threads.
> Is it necessary to go above that?
> 

That should be fine.

> Both the Sun server, the switch, and all the compute nodes
> in the cluster claim to support jumbo frames.

That's why I'm asking. We've had some network components (e.g. HP) which did 
support 9k Jumbo frames, however going beyond 3-4.5k caused such a severe 
performance drop that it was not advisable. What I would do to make certain 
this is not a problem:

(1) ping -s 8972 -Mdo <remote host>
(try different payload sizes and remember that there might be some overhead in 
the switches needed)
(2) Use netperf between different nodes and see if the performance is not 
drastically reduced with large jumbo frames.

> Turning them off is something that I could
> do, but I'd have to do it globally because this is
> a per interface settings. There's no
> way I can think of that would allow me to
> keep jumbo frames on on only certain nodes
> so that I could run a controlled experiment.
> 

If above's stuff is working, there should be no need.

> 
> I've looked this up and it might have something
> to do with the problem. The trouble is that I can't
> see when these error messages are generated so
> I can't try to correlate them with the autofs
> problem.

Would it be possible for you to recompile the kernel with the same settings 
and enable timings in printk lines (under kernel hacking)? That might help, 
but might be some work to get working.

Cheers

Carsten

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Transient Automount Problems in CentOS 5.3
  2010-02-05  4:42 ` Ian Kent
@ 2010-02-05 16:31   ` Jon Forrest
  2010-02-05 16:43     ` Ian Kent
  0 siblings, 1 reply; 10+ messages in thread
From: Jon Forrest @ 2010-02-05 16:31 UTC (permalink / raw)
  To: Ian Kent; +Cc: autofs

On 2/4/2010 8:42 PM, Ian Kent wrote:

> Update to 5.4, at least the autofs package and the kernel, and see if
> the problem persists.

For logistical reasons updating the kernel is tricky.
Would updating the autofs package while keeping
the RHEL 5.3 kernel help the situations?

Jon

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Transient Automount Problems in CentOS 5.3
  2010-02-05 16:31   ` Jon Forrest
@ 2010-02-05 16:43     ` Ian Kent
  0 siblings, 0 replies; 10+ messages in thread
From: Ian Kent @ 2010-02-05 16:43 UTC (permalink / raw)
  To: Jon Forrest; +Cc: autofs

On 02/06/2010 12:31 AM, Jon Forrest wrote:
> On 2/4/2010 8:42 PM, Ian Kent wrote:
> 
>> Update to 5.4, at least the autofs package and the kernel, and see if
>> the problem persists.
> 
> For logistical reasons updating the kernel is tricky.
> Would updating the autofs package while keeping
> the RHEL 5.3 kernel help the situations?

Doubt it, but don't just rush into it, you need to test before updating
anyway.

Ian

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Transient Automount Problems in CentOS 5.3
  2010-02-05  7:59     ` Carsten Aulbert
@ 2010-02-05 17:10       ` Jon Forrest
  2010-02-05 17:30         ` Carsten Aulbert
  0 siblings, 1 reply; 10+ messages in thread
From: Jon Forrest @ 2010-02-05 17:10 UTC (permalink / raw)
  To: Carsten Aulbert; +Cc: autofs

On 2/4/2010 11:59 PM, Carsten Aulbert wrote:

> (1) ping -s 8972 -Mdo<remote host>
> (try different payload sizes and remember that there might be some overhead in
> the switches needed)

This results in

  icmp_seq=2 Frag needed and DF set (mtu = 1500)

which is not what I expected. I wonder where the
"mtu = 1500" is coming from. ifconfig on the interface
of the source machine is definitely 9000 (I just
reconfirmed). I also confirmed that jumbo frames
are enabled both on the switch and on the storage
server. They are.

For yuks, I tried lowering the packet
size lower, and I found that I continued to see this
error until the packet size was 1472. So, either
ping is doing something I don't expect, or somebody
is lying about jumbo frames being enabled.

> (2) Use netperf between different nodes and see if the performance is not
> drastically reduced with large jumbo frames.

The funny thing is that performance seems to be fine,
although that's purely subjective. I'll try the netperf
test to see what the numbers really are.

> Would it be possible for you to recompile the kernel with the same settings
> and enable timings in printk lines (under kernel hacking)? That might help,
> but might be some work to get working.

This is a 48 node cluster, so doing something like
that is something I'd like to hold off on doing
until I've exhausted everything else.

I appreciate your suggestions.

-- 
Jon Forrest
Research Computing Support
College of Chemistry
173 Tan Hall
University of California Berkeley
Berkeley, CA
94720-1460
510-643-1032
jlforrest@berkeley.edu

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Transient Automount Problems in CentOS 5.3
  2010-02-05 17:10       ` Jon Forrest
@ 2010-02-05 17:30         ` Carsten Aulbert
  0 siblings, 0 replies; 10+ messages in thread
From: Carsten Aulbert @ 2010-02-05 17:30 UTC (permalink / raw)
  To: Jon Forrest; +Cc: autofs

Hi Jon

On Friday 05 February 2010 18:10:54 Jon Forrest wrote:
> 
> This results in
> 
>   icmp_seq=2 Frag needed and DF set (mtu = 1500)
> 
> which is not what I expected. I wonder where the
> "mtu = 1500" is coming from. ifconfig on the interface
> of the source machine is definitely 9000 (I just
> reconfirmed). I also confirmed that jumbo frames
> are enabled both on the switch and on the storage
> server. They are.
> 
> For yuks, I tried lowering the packet
> size lower, and I found that I continued to see this
> error until the packet size was 1472. So, either
> ping is doing something I don't expect, or somebody
> is lying about jumbo frames being enabled.
> 

That sounds that one component is not allowing jumbo frames through.

> This is a 48 node cluster, so doing something like
> that is something I'd like to hold off on doing
> until I've exhausted everything else.

48 are easy, it starts to get interesting in the not too low 3 digit regime, 4 
digits are more fun ;)

What I would do next are the following routes (try both):

Route A:

Set MTU1500 everywhere and stress the system and look if the problems go away.

Route B:

(a) Directly link two nodes and try the ping again, if that does not work, 
there seems to be a problem with the nodes NICs.

(b) Directly link a node a the 7340, try the ping again, if it does not work 
but (a) did, the 7340 is acting up.

(c) Try the ping between two nodes with a single switch in-between and see if 
that works.

Basically, start at the simplest set-up and work your way up the complexity 
ladder, it's tedious but eventually you'll find the problem.

Cheers

Carsten

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2010-02-05 17:30 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-02-04  0:16 Transient Automount Problems in CentOS 5.3 Jon Forrest
  -- strict thread matches above, loose matches on Subject: below --
2010-02-04 16:50 Jon Forrest
2010-02-04 16:57 ` Carsten Aulbert
2010-02-04 18:09   ` Jon Forrest
2010-02-05  7:59     ` Carsten Aulbert
2010-02-05 17:10       ` Jon Forrest
2010-02-05 17:30         ` Carsten Aulbert
2010-02-05  4:42 ` Ian Kent
2010-02-05 16:31   ` Jon Forrest
2010-02-05 16:43     ` Ian Kent

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.