* Transient Automount Problems in CentOS 5.3
@ 2010-02-04 0:16 Jon Forrest
0 siblings, 0 replies; 10+ messages in thread
From: Jon Forrest @ 2010-02-04 0:16 UTC (permalink / raw)
To: autofs
I have a new cluster running CentOS 5.3.
The cluster uses a Sun 7310 storage server
that provides NFS service over a private
1Gb/s ethernet with 9K jumbo frames to the
cluster.
We've noticed that a number of the compute
nodes sometimes generate the
automount[15023]: umount_autofs_indirect: ask umount returned busy /home
message. When this happens the program running on the
node dies. This has happened between 10 and 20 times.
We're not sure what's going on on a node when this
happens. Most of the time everything is fine and
the home directories are automounted without problem.
I've googled for this problem and I see that other people
have seen it too, but I've never seen a resolution,
especially not for RHEL5.
The auto.master line for this mount is
/home /etc/auto.home --timeout=1200
noatime,nodiratime,rw,noacl,rsize=32768,wsize=32768
The network interface configuration is
eth0 Link encap:Ethernet HWaddr 00:30:48:B9:F6:52
inet addr:10.1.255.233 Bcast:10.1.255.255 Mask:255.255.0.0
inet6 addr: fe80::230:48ff:feb9:f652/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:9000 Metric:1
RX packets:32999308 errors:0 dropped:0 overruns:0 frame:0
TX packets:27468315 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:24225053296 (22.5 GiB) TX bytes:73313582546 (68.2 GiB)
Interrupt:74 Base address:0x2000
The automount is
Linux automount version 5.0.1-0.rc2.102
Directories:
config dir: /etc/sysconfig
maps dir: /etc
modules dir: /usr/lib64/autofs
Compile options:
DISABLE_MOUNT_LOCKING ENABLE_IGNORE_BUSY_MOUNTS WITH_HESIOD WITH_LDAP
WITH_SASL
The kernel is 2.6.18-128.1.14.el5.
Any advice on what to do?
Cordially,
--
Jon Forrest
Research Computing Support
College of Chemistry
173 Tan Hall
University of California Berkeley
Berkeley, CA
94720-1460
510-643-1032
^ permalink raw reply [flat|nested] 10+ messages in thread
* Transient Automount Problems in CentOS 5.3
@ 2010-02-04 16:50 Jon Forrest
2010-02-04 16:57 ` Carsten Aulbert
2010-02-05 4:42 ` Ian Kent
0 siblings, 2 replies; 10+ messages in thread
From: Jon Forrest @ 2010-02-04 16:50 UTC (permalink / raw)
To: autofs
I have a new cluster running CentOS 5.3.
The cluster uses a Sun 7310 storage server
that provides NFS service over a private
1Gb/s ethernet with 9K jumbo frames to the
cluster.
We've noticed that a number of the compute
nodes sometimes generate the
automount[15023]: umount_autofs_indirect: ask umount returned busy /home
message. When this happens the program running on the
node dies. This has happened between 10 and 20 times.
We're not sure what's going on on a node when this
happens. Most of the time everything is fine and
the home directories are automounted without problem.
I've googled for this problem and I see that other people
have seen it too, but I've never seen a resolution,
especially not for RHEL5.
The auto.master line for this mount is
/home /etc/auto.home --timeout=1200
noatime,nodiratime,rw,noacl,rsize=32768,wsize=32768
The network interface configuration is
eth0 Link encap:Ethernet HWaddr 00:30:48:B9:F6:52
inet addr:10.1.255.233 Bcast:10.1.255.255 Mask:255.255.0.0
inet6 addr: fe80::230:48ff:feb9:f652/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:9000 Metric:1
RX packets:32999308 errors:0 dropped:0 overruns:0 frame:0
TX packets:27468315 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:24225053296 (22.5 GiB) TX bytes:73313582546 (68.2 GiB)
Interrupt:74 Base address:0x2000
The automount is
Linux automount version 5.0.1-0.rc2.102
Directories:
config dir: /etc/sysconfig
maps dir: /etc
modules dir: /usr/lib64/autofs
Compile options:
DISABLE_MOUNT_LOCKING ENABLE_IGNORE_BUSY_MOUNTS WITH_HESIOD WITH_LDAP
WITH_SASL
The kernel is 2.6.18-128.1.14.el5.
Any advice on what to do?
Cordially,
--
Jon Forrest
Research Computing Support
College of Chemistry
173 Tan Hall
University of California Berkeley
Berkeley, CA
94720-1460
510-643-1032
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Transient Automount Problems in CentOS 5.3
2010-02-04 16:50 Transient Automount Problems in CentOS 5.3 Jon Forrest
@ 2010-02-04 16:57 ` Carsten Aulbert
2010-02-04 18:09 ` Jon Forrest
2010-02-05 4:42 ` Ian Kent
1 sibling, 1 reply; 10+ messages in thread
From: Carsten Aulbert @ 2010-02-04 16:57 UTC (permalink / raw)
To: autofs
Hi Jon,
On Thursday 04 February 2010 17:50:25 Jon Forrest wrote:
> I have a new cluster running CentOS 5.3.
> The cluster uses a Sun 7310 storage server
> that provides NFS service over a private
> 1Gb/s ethernet with 9K jumbo frames to the
> cluster.
>
> We've noticed that a number of the compute
> nodes sometimes generate the
>
> automount[15023]: umount_autofs_indirect: ask umount returned busy /home
>
> message. When this happens the program running on the
> node dies. This has happened between 10 and 20 times.
> We're not sure what's going on on a node when this
> happens. Most of the time everything is fine and
> the home directories are automounted without problem.
Have you ramped up the number of NFS daemons, backlog stuff etc on the solaris
box (/etc/default/nfs) and is the Solaris NIC running with jumbo frames as
well (and the switch is *really* capable of doing it (please test with ping).
Cheers
Carsten
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Transient Automount Problems in CentOS 5.3
2010-02-04 16:57 ` Carsten Aulbert
@ 2010-02-04 18:09 ` Jon Forrest
2010-02-05 7:59 ` Carsten Aulbert
0 siblings, 1 reply; 10+ messages in thread
From: Jon Forrest @ 2010-02-04 18:09 UTC (permalink / raw)
To: Carsten Aulbert; +Cc: autofs
On 2/4/2010 8:57 AM, Carsten Aulbert wrote:
> Have you ramped up the number of NFS daemons, backlog stuff etc on the solaris
> box (/etc/default/nfs) and is the Solaris NIC running with jumbo frames as
> well (and the switch is *really* capable of doing it (please test with ping).
First of all, the Sun 7310 is one of those servers
that you "can't" login to. You manage it only
via a Web interface.
I've set the server to use 500 NFS server threads.
Is it necessary to go above that?
Both the Sun server, the switch, and all the compute nodes
in the cluster claim to support jumbo frames.
Turning them off is something that I could
do, but I'd have to do it globally because this is
a per interface settings. There's no
way I can think of that would allow me to
keep jumbo frames on on only certain nodes
so that I could run a controlled experiment.
Running ping shows no problems.
I've thought it might be useful to turn off
automounting on a few of the cluster nodes to
see what happens.
One additional thing I've discovered is that the
compute nodes all are showing a bunch of this
error message via dmesg:
eth0: too many iterations (6) in nv_nic_irq.
I've looked this up and it might have something
to do with the problem. The trouble is that I can't
see when these error messages are generated so
I can't try to correlate them with the autofs
problem.
I'm grasping at straws here.
I appreciate any addition suggestions.
--
Jon Forrest
Research Computing Support
College of Chemistry
173 Tan Hall
University of California Berkeley
Berkeley, CA
94720-1460
510-643-1032
jlforrest@berkeley.edu
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Transient Automount Problems in CentOS 5.3
2010-02-04 16:50 Transient Automount Problems in CentOS 5.3 Jon Forrest
2010-02-04 16:57 ` Carsten Aulbert
@ 2010-02-05 4:42 ` Ian Kent
2010-02-05 16:31 ` Jon Forrest
1 sibling, 1 reply; 10+ messages in thread
From: Ian Kent @ 2010-02-05 4:42 UTC (permalink / raw)
To: Jon Forrest; +Cc: autofs
On 02/05/2010 12:50 AM, Jon Forrest wrote:
> I have a new cluster running CentOS 5.3.
> The cluster uses a Sun 7310 storage server
> that provides NFS service over a private
> 1Gb/s ethernet with 9K jumbo frames to the
> cluster.
>
> We've noticed that a number of the compute
> nodes sometimes generate the
>
> automount[15023]: umount_autofs_indirect: ask umount returned busy /home
>
> message. When this happens the program running on the
> node dies. This has happened between 10 and 20 times.
> We're not sure what's going on on a node when this
> happens. Most of the time everything is fine and
> the home directories are automounted without problem.
>
> I've googled for this problem and I see that other people
> have seen it too, but I've never seen a resolution,
> especially not for RHEL5.
>
> The auto.master line for this mount is
>
> /home /etc/auto.home --timeout=1200
> noatime,nodiratime,rw,noacl,rsize=32768,wsize=32768
>
> The network interface configuration is
>
> eth0 Link encap:Ethernet HWaddr 00:30:48:B9:F6:52
> inet addr:10.1.255.233 Bcast:10.1.255.255 Mask:255.255.0.0
> inet6 addr: fe80::230:48ff:feb9:f652/64 Scope:Link
> UP BROADCAST RUNNING MULTICAST MTU:9000 Metric:1
> RX packets:32999308 errors:0 dropped:0 overruns:0 frame:0
> TX packets:27468315 errors:0 dropped:0 overruns:0 carrier:0
> collisions:0 txqueuelen:1000
> RX bytes:24225053296 (22.5 GiB) TX bytes:73313582546 (68.2 GiB)
> Interrupt:74 Base address:0x2000
>
> The automount is
>
> Linux automount version 5.0.1-0.rc2.102
>
> Directories:
> config dir: /etc/sysconfig
> maps dir: /etc
> modules dir: /usr/lib64/autofs
>
> Compile options:
> DISABLE_MOUNT_LOCKING ENABLE_IGNORE_BUSY_MOUNTS WITH_HESIOD WITH_LDAP
> WITH_SASL
>
> The kernel is 2.6.18-128.1.14.el5.
>
> Any advice on what to do?
Update to 5.4, at least the autofs package and the kernel, and see if
the problem persists.
Ian
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Transient Automount Problems in CentOS 5.3
2010-02-04 18:09 ` Jon Forrest
@ 2010-02-05 7:59 ` Carsten Aulbert
2010-02-05 17:10 ` Jon Forrest
0 siblings, 1 reply; 10+ messages in thread
From: Carsten Aulbert @ 2010-02-05 7:59 UTC (permalink / raw)
To: Jon Forrest; +Cc: autofs
Hi Jon,
On Thursday 04 February 2010 19:09:37 Jon Forrest wrote:
>
> First of all, the Sun 7310 is one of those servers
> that you "can't" login to. You manage it only
> via a Web interface.
>
OK, sorry, not had such a beast under my fingers yet ;)
> I've set the server to use 500 NFS server threads.
> Is it necessary to go above that?
>
That should be fine.
> Both the Sun server, the switch, and all the compute nodes
> in the cluster claim to support jumbo frames.
That's why I'm asking. We've had some network components (e.g. HP) which did
support 9k Jumbo frames, however going beyond 3-4.5k caused such a severe
performance drop that it was not advisable. What I would do to make certain
this is not a problem:
(1) ping -s 8972 -Mdo <remote host>
(try different payload sizes and remember that there might be some overhead in
the switches needed)
(2) Use netperf between different nodes and see if the performance is not
drastically reduced with large jumbo frames.
> Turning them off is something that I could
> do, but I'd have to do it globally because this is
> a per interface settings. There's no
> way I can think of that would allow me to
> keep jumbo frames on on only certain nodes
> so that I could run a controlled experiment.
>
If above's stuff is working, there should be no need.
>
> I've looked this up and it might have something
> to do with the problem. The trouble is that I can't
> see when these error messages are generated so
> I can't try to correlate them with the autofs
> problem.
Would it be possible for you to recompile the kernel with the same settings
and enable timings in printk lines (under kernel hacking)? That might help,
but might be some work to get working.
Cheers
Carsten
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Transient Automount Problems in CentOS 5.3
2010-02-05 4:42 ` Ian Kent
@ 2010-02-05 16:31 ` Jon Forrest
2010-02-05 16:43 ` Ian Kent
0 siblings, 1 reply; 10+ messages in thread
From: Jon Forrest @ 2010-02-05 16:31 UTC (permalink / raw)
To: Ian Kent; +Cc: autofs
On 2/4/2010 8:42 PM, Ian Kent wrote:
> Update to 5.4, at least the autofs package and the kernel, and see if
> the problem persists.
For logistical reasons updating the kernel is tricky.
Would updating the autofs package while keeping
the RHEL 5.3 kernel help the situations?
Jon
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Transient Automount Problems in CentOS 5.3
2010-02-05 16:31 ` Jon Forrest
@ 2010-02-05 16:43 ` Ian Kent
0 siblings, 0 replies; 10+ messages in thread
From: Ian Kent @ 2010-02-05 16:43 UTC (permalink / raw)
To: Jon Forrest; +Cc: autofs
On 02/06/2010 12:31 AM, Jon Forrest wrote:
> On 2/4/2010 8:42 PM, Ian Kent wrote:
>
>> Update to 5.4, at least the autofs package and the kernel, and see if
>> the problem persists.
>
> For logistical reasons updating the kernel is tricky.
> Would updating the autofs package while keeping
> the RHEL 5.3 kernel help the situations?
Doubt it, but don't just rush into it, you need to test before updating
anyway.
Ian
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Transient Automount Problems in CentOS 5.3
2010-02-05 7:59 ` Carsten Aulbert
@ 2010-02-05 17:10 ` Jon Forrest
2010-02-05 17:30 ` Carsten Aulbert
0 siblings, 1 reply; 10+ messages in thread
From: Jon Forrest @ 2010-02-05 17:10 UTC (permalink / raw)
To: Carsten Aulbert; +Cc: autofs
On 2/4/2010 11:59 PM, Carsten Aulbert wrote:
> (1) ping -s 8972 -Mdo<remote host>
> (try different payload sizes and remember that there might be some overhead in
> the switches needed)
This results in
icmp_seq=2 Frag needed and DF set (mtu = 1500)
which is not what I expected. I wonder where the
"mtu = 1500" is coming from. ifconfig on the interface
of the source machine is definitely 9000 (I just
reconfirmed). I also confirmed that jumbo frames
are enabled both on the switch and on the storage
server. They are.
For yuks, I tried lowering the packet
size lower, and I found that I continued to see this
error until the packet size was 1472. So, either
ping is doing something I don't expect, or somebody
is lying about jumbo frames being enabled.
> (2) Use netperf between different nodes and see if the performance is not
> drastically reduced with large jumbo frames.
The funny thing is that performance seems to be fine,
although that's purely subjective. I'll try the netperf
test to see what the numbers really are.
> Would it be possible for you to recompile the kernel with the same settings
> and enable timings in printk lines (under kernel hacking)? That might help,
> but might be some work to get working.
This is a 48 node cluster, so doing something like
that is something I'd like to hold off on doing
until I've exhausted everything else.
I appreciate your suggestions.
--
Jon Forrest
Research Computing Support
College of Chemistry
173 Tan Hall
University of California Berkeley
Berkeley, CA
94720-1460
510-643-1032
jlforrest@berkeley.edu
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Transient Automount Problems in CentOS 5.3
2010-02-05 17:10 ` Jon Forrest
@ 2010-02-05 17:30 ` Carsten Aulbert
0 siblings, 0 replies; 10+ messages in thread
From: Carsten Aulbert @ 2010-02-05 17:30 UTC (permalink / raw)
To: Jon Forrest; +Cc: autofs
Hi Jon
On Friday 05 February 2010 18:10:54 Jon Forrest wrote:
>
> This results in
>
> icmp_seq=2 Frag needed and DF set (mtu = 1500)
>
> which is not what I expected. I wonder where the
> "mtu = 1500" is coming from. ifconfig on the interface
> of the source machine is definitely 9000 (I just
> reconfirmed). I also confirmed that jumbo frames
> are enabled both on the switch and on the storage
> server. They are.
>
> For yuks, I tried lowering the packet
> size lower, and I found that I continued to see this
> error until the packet size was 1472. So, either
> ping is doing something I don't expect, or somebody
> is lying about jumbo frames being enabled.
>
That sounds that one component is not allowing jumbo frames through.
> This is a 48 node cluster, so doing something like
> that is something I'd like to hold off on doing
> until I've exhausted everything else.
48 are easy, it starts to get interesting in the not too low 3 digit regime, 4
digits are more fun ;)
What I would do next are the following routes (try both):
Route A:
Set MTU1500 everywhere and stress the system and look if the problems go away.
Route B:
(a) Directly link two nodes and try the ping again, if that does not work,
there seems to be a problem with the nodes NICs.
(b) Directly link a node a the 7340, try the ping again, if it does not work
but (a) did, the 7340 is acting up.
(c) Try the ping between two nodes with a single switch in-between and see if
that works.
Basically, start at the simplest set-up and work your way up the complexity
ladder, it's tedious but eventually you'll find the problem.
Cheers
Carsten
^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2010-02-05 17:30 UTC | newest]
Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-02-04 16:50 Transient Automount Problems in CentOS 5.3 Jon Forrest
2010-02-04 16:57 ` Carsten Aulbert
2010-02-04 18:09 ` Jon Forrest
2010-02-05 7:59 ` Carsten Aulbert
2010-02-05 17:10 ` Jon Forrest
2010-02-05 17:30 ` Carsten Aulbert
2010-02-05 4:42 ` Ian Kent
2010-02-05 16:31 ` Jon Forrest
2010-02-05 16:43 ` Ian Kent
-- strict thread matches above, loose matches on Subject: below --
2010-02-04 0:16 Jon Forrest
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.