* unacceptable bug in autofs kernel module
@ 2004-12-28 7:51 ramana
2004-12-29 1:02 ` Ian Kent
2005-02-04 0:38 ` mmarion
0 siblings, 2 replies; 18+ messages in thread
From: ramana @ 2004-12-28 7:51 UTC (permalink / raw)
To: autofs
Dear developers,
Here is the bug in autofs3 module which causing so much pain. It simply
stopped me from adding much more interesting features to Autodir
http://www.intraperson.com/autodir/
Taken from Linux kernel 2.4 autofs module source.
file: root.c
function: autofs_root_lookup.
protocol: 3
/*
* If this dentry is unhashed, then we shouldn't honour this
* lookup even if the dentry is positive. Returning ENOENT here
* doesn't do the right thing for all system calls, but it should
* be OK for the operations we permit from an autofs.
*/
/*
if ( dentry->d_inode && d_unhashed(dentry) )
return ERR_PTR(-ENOENT);
*/
if ( dentry->d_inode && d_unhashed(dentry) ) {
printk( "ENOENT for %s\n", dentry->d_name.name );
return ERR_PTR(-ENOENT);
}
I added printk to easily trace it. To my surprise autofs 4 also has
similar code.
Because of this, user space test program reporting like this:
fail : /test/t944 : No such file or directory
fail : /test/t4187 : No such file or directory
fail : /test/t100 : No such file or directory
fail : /test/t806 : No such file or directory
fail : /test/t3451 : No such file or directory
fail : /test/t1790 : No such file or directory
fail : /test/t3555 : No such file or directory
fail : /test/t3098 : No such file or directory
fail : /test/t4085 : No such file or directory
fail : /test/t3935 : No such file or directory
with corresponding kernel messages are,
ENOENT for t944
ENOENT for t4187
ENOENT for t100
ENOENT for t806
ENOENT for t3451
ENOENT for t1790
ENOENT for t3555
ENOENT for t3098
ENOENT for t4085
ENOENT for t3935
The error rate as taken from months of stress tests -- ie ENOENT; is
around 0.002% and increases as system load increases. Even at this rate
I do not think it is acceptable in production systems.
Thanks in advance.
Regards
ramana
--
http://www.intraperson.com
^ permalink raw reply [flat|nested] 18+ messages in thread* Re: unacceptable bug in autofs kernel module
2004-12-28 7:51 unacceptable bug in autofs kernel module ramana
@ 2004-12-29 1:02 ` Ian Kent
2004-12-29 3:44 ` ramana
[not found] ` <41D21C1E.8040407@intraperson.com>
2005-02-04 0:38 ` mmarion
1 sibling, 2 replies; 18+ messages in thread
From: Ian Kent @ 2004-12-29 1:02 UTC (permalink / raw)
To: ramana; +Cc: autofs
On Tue, 28 Dec 2004, ramana wrote:
> Dear developers,
>
> Here is the bug in autofs3 module which causing so much pain. It simply
> stopped me from adding much more interesting features to Autodir
> http://www.intraperson.com/autodir/
Thanks for this.
You've provided some symptoms but you haven't provided any explanation as
to why this is a bug.
Can you explain why you need the kernel to honour a lookup for an already
deleted dentry?
This could be due to the way that autofs does a d_drop instead of a
d_delete in the directory unlink callback. However, the dentry, for all
intentional purposes, has already been deleted.
>
> Taken from Linux kernel 2.4 autofs module source.
>
> file: root.c
> function: autofs_root_lookup.
> protocol: 3
>
> /*
> * If this dentry is unhashed, then we shouldn't honour this
> * lookup even if the dentry is positive. Returning ENOENT here
> * doesn't do the right thing for all system calls, but it should
> * be OK for the operations we permit from an autofs.
> */
>
>
> /*
> if ( dentry->d_inode && d_unhashed(dentry) )
> return ERR_PTR(-ENOENT);
> */
>
>
> if ( dentry->d_inode && d_unhashed(dentry) ) {
> printk( "ENOENT for %s\n", dentry->d_name.name );
> return ERR_PTR(-ENOENT);
> }
>
> I added printk to easily trace it. To my surprise autofs 4 also has
> similar code.
>
> Because of this, user space test program reporting like this:
>
> fail : /test/t944 : No such file or directory
> fail : /test/t4187 : No such file or directory
> fail : /test/t100 : No such file or directory
> fail : /test/t806 : No such file or directory
> fail : /test/t3451 : No such file or directory
> fail : /test/t1790 : No such file or directory
> fail : /test/t3555 : No such file or directory
> fail : /test/t3098 : No such file or directory
> fail : /test/t4085 : No such file or directory
> fail : /test/t3935 : No such file or directory
>
> with corresponding kernel messages are,
>
> ENOENT for t944
> ENOENT for t4187
> ENOENT for t100
> ENOENT for t806
> ENOENT for t3451
> ENOENT for t1790
> ENOENT for t3555
> ENOENT for t3098
> ENOENT for t4085
> ENOENT for t3935
>
> The error rate as taken from months of stress tests -- ie ENOENT; is
> around 0.002% and increases as system load increases. Even at this rate
> I do not think it is acceptable in production systems.
>
> Thanks in advance.
>
> Regards
> ramana
>
> --
> http://www.intraperson.com
>
>
> _______________________________________________
> autofs mailing list
> autofs@linux.kernel.org
> http://linux.kernel.org/mailman/listinfo/autofs
>
^ permalink raw reply [flat|nested] 18+ messages in thread* Re: unacceptable bug in autofs kernel module
2004-12-29 1:02 ` Ian Kent
@ 2004-12-29 3:44 ` ramana
[not found] ` <41D21C1E.8040407@intraperson.com>
1 sibling, 0 replies; 18+ messages in thread
From: ramana @ 2004-12-29 3:44 UTC (permalink / raw)
To: Ian Kent, autofs
Ian Kent wrote:
>On Tue, 28 Dec 2004, ramana wrote:
>
>
>
>>Dear developers,
>>
>>Here is the bug in autofs3 module which causing so much pain. It simply
>>stopped me from adding much more interesting features to Autodir
>>http://www.intraperson.com/autodir/
>>
>>
>
>Thanks for this.
>
>You've provided some symptoms but you haven't provided any explanation as
>to why this is a bug.
>
>Can you explain why you need the kernel to honour a lookup for an already
>deleted dentry?
>
>
If it it is deleted then it should cleanup if any and report back to the
user space autofs/autodir daemon again that this directory is missing
instead of directly reporting that entry does not exist because it is
deleted. After all that is what autofs is all about.
What is important is that -t option is user settable option. If the user
choses low value like 1 or 2 seconds these autofs directories will be
deleted more frequently. But deletion does not mean they do not exist
actually.
It it certainly bug. Let us view from user space application which is
accessing these autofs directories. Most of the time they get access to
the directories which exist and perfectly legal. And suddenly at some
time kernel decides itself and reports it does not exist even without
asking user space autofs/autodir daemon.
What is important here is, after a while, if I access it again after
ENOENT, everything works perfectly. Is not this inconsistent behavior?
Above statement is true as I tested it from user space applications for
millions of directory requests rather then looking from kernal point of
view. If I access a autofs directory for 999 times and I get ENOENT for
1 time without proper reason from user space daemon, it is certainly bug.
Thanks for your reply.
>This could be due to the way that autofs does a d_drop instead of a
>d_delete in the directory unlink callback. However, the dentry, for all
>intentional purposes, has already been deleted.
>
>
>
Regards
ramana
--
http://www.intraperson.com
^ permalink raw reply [flat|nested] 18+ messages in thread[parent not found: <41D21C1E.8040407@intraperson.com>]
* Re: unacceptable bug in autofs kernel module
2004-12-28 7:51 unacceptable bug in autofs kernel module ramana
2004-12-29 1:02 ` Ian Kent
@ 2005-02-04 0:38 ` mmarion
2005-02-04 1:49 ` Ian Kent
1 sibling, 1 reply; 18+ messages in thread
From: mmarion @ 2005-02-04 0:38 UTC (permalink / raw)
To: ramana; +Cc: autofs
On 28 Dec, ramana wrote:
> Here is the bug in autofs3 module which causing so much pain. It simply
> stopped me from adding much more interesting features to Autodir
> http://www.intraperson.com/autodir/
[snip]
> Because of this, user space test program reporting like this:
>
> fail : /test/t944 : No such file or directory
> fail : /test/t4187 : No such file or directory
Hmm.. I wonder if this might be related to a weirdness we're seeing. Running
autofs-4.1.3 with previous latest patch to kernel (pre-2005 release) and users
use LSF to submit batch jobs to hosts. On linux hosts, user level programs
will sometimes exit quickly with a "file does not exist" error, even though you
can login to the host and see the file/dir just fine. As a hacked
work-around, we have a pre-exec script that tries to stat all the directories
they need to force the mounts to happen before their program touches the
files.
I didn't see any attempts to patch this bit.. did you have any ideas on how to
patch that particular piece of code? Or just comment it out?
--
Mike Marion-Unix SysAdmin/Staff Engineer-http://www.qualcomm.com
Groundskeeper Willie: "oooh.. Me mule wouldn't walk in the mud. So I had to
put 17 bullets in 'em." ==> Simpsons
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: unacceptable bug in autofs kernel module
2005-02-04 0:38 ` mmarion
@ 2005-02-04 1:49 ` Ian Kent
2005-02-04 2:59 ` ramana
0 siblings, 1 reply; 18+ messages in thread
From: Ian Kent @ 2005-02-04 1:49 UTC (permalink / raw)
To: mmarion; +Cc: autofs
On Thu, 3 Feb 2005 mmarion@qualcomm.com wrote:
> On 28 Dec, ramana wrote:
>
> > Here is the bug in autofs3 module which causing so much pain. It simply
> > stopped me from adding much more interesting features to Autodir
> > http://www.intraperson.com/autodir/
> [snip]
> > Because of this, user space test program reporting like this:
> >
> > fail : /test/t944 : No such file or directory
> > fail : /test/t4187 : No such file or directory
>
> Hmm.. I wonder if this might be related to a weirdness we're seeing. Running
> autofs-4.1.3 with previous latest patch to kernel (pre-2005 release) and users
> use LSF to submit batch jobs to hosts. On linux hosts, user level programs
> will sometimes exit quickly with a "file does not exist" error, even though you
> can login to the host and see the file/dir just fine. As a hacked
> work-around, we have a pre-exec script that tries to stat all the directories
> they need to force the mounts to happen before their program touches the
> files.
Does the stat actually mount anything?
It shouldn't?
Ian
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: unacceptable bug in autofs kernel module
2005-02-04 1:49 ` Ian Kent
@ 2005-02-04 2:59 ` ramana
2005-02-05 13:46 ` ramana
0 siblings, 1 reply; 18+ messages in thread
From: ramana @ 2005-02-04 2:59 UTC (permalink / raw)
To: Ian Kent, mmarion; +Cc: autofs
Ian Kent wrote:
>On Thu, 3 Feb 2005 mmarion@qualcomm.com wrote:
>
>
>
>>On 28 Dec, ramana wrote:
>>
>>
>>
>>>Here is the bug in autofs3 module which causing so much pain. It simply
>>>stopped me from adding much more interesting features to Autodir
>>>http://www.intraperson.com/autodir/
>>>
>>>
>>[snip]
>>
>>
>>>Because of this, user space test program reporting like this:
>>>
>>>fail : /test/t944 : No such file or directory
>>>fail : /test/t4187 : No such file or directory
>>>
>>>
>>Hmm.. I wonder if this might be related to a weirdness we're seeing. Running
>>autofs-4.1.3 with previous latest patch to kernel (pre-2005 release) and users
>>use LSF to submit batch jobs to hosts. On linux hosts, user level programs
>>will sometimes exit quickly with a "file does not exist" error, even though you
>>can login to the host and see the file/dir just fine. As a hacked
>>work-around, we have a pre-exec script that tries to stat all the directories
>>they need to force the mounts to happen before their program touches the
>>files.
>>
>>
>
>Does the stat actually mount anything?
>It shouldn't?
>
>Ian
>
>
>
>
I moved latest version Autodir to autofs 4 kernel module and so far all
stress tests tell me autofs4 protocol is performing well without these
ENOENT errors. I have to do little bit more tests before concluding
anything as final.
For more details check http://www.intraperson.com/autodir/.
DVersion: Autodir 0.93.0 and above.
Regards
ramana
--
http://www.intraperson.com
^ permalink raw reply [flat|nested] 18+ messages in thread
* RE: unacceptable bug in autofs kernel module
@ 2005-02-04 18:58 peter.a.harris
2005-03-07 19:49 ` Mike Marion
0 siblings, 1 reply; 18+ messages in thread
From: peter.a.harris @ 2005-02-04 18:58 UTC (permalink / raw)
To: autofs; +Cc: mmarion
Hello, all,
Funny you should mention - I was just getting ready to ask about this.
We are doing the same thing, i.e. submitting jobs via LSF. What we see are
file not found errors when trying to access a file somewhere down in the
tree of an automounted file system. For instance, a job will execute a Perl
script that starts with "#!/tools/perl5.8.3/bin/perl", which fails because
it cannot find the Perl executable. I log into the machine and do "ls
/tools/perl5.8.3/bin/perl" and get a file not found. I check /etc/mnttab or
/proc/mounts and /tools/perl5.8.3 is not mounted. So then I do an ls of
/tools/perl5.8.3 and the mount is made. Once I do that, the mount point is
generally well behaved for some random period of time when we will go
through all this again.
At first we thought it was networking problems because we were also seeing
some "server not responding" errors on our Solaris boxes. We found that if
the mount failed with an RPC timeout, then the automounter would not try
again until you did an ls of the mount point directory (or in some cases,
you would have to cd to the directory to get the mount to happen). We have
fixed some networking problems that we found and the number of these kinds
of error messages has gone way down. Now we only see them when the 10 boxes
all run a cron job at 10PM and try to mount the same file system at the same
time. Some win but most lose.
Testing (60 second expiry, multiple jobs accessing files every 2 to 3
minutes; caused lots of expirations and remounts) showed that we could also
lose track of a mount if the mount expired and then immediately remounted.
Well, it would not remount but the automounter thought it had. Similarly to
the above, and ls or cd would fix the problem.
Occasionally, the automounter fails to mount without any indication that I
can find in /var/log/messages. And, again, an ls or cd of the directory
will cause the mount to happen.
Most of the machines are running Red Hat EL 3 U4 (automount 4.1.3-47,
2.4.21-27.0.1ELhugemem/smp kernel). One is running 4.1.3-12. A couple are
running RHEL 3 U0, 2.4.21-4EL kernel, 4.1.0-2 automouunt. We have several
IBM blades with P4's and mostly 4GB of memory. We also have one HP DL585
running AMD64 with 16GB of memory. Most run with a 10 minute expiry, but
one is set to 30 minutes and one to 1 hour. That does not seem to affect
the error rate. Some are running soft mounts to the tools (which should be
read only) and some are running hard mounts - this too does not seem to make
a difference.
And, oh yes, these mounts are all from NetApp Filers.
Anybody else see this and/or have any ideas?
Pete Harris
Tektronix, Inc.
Technical Computing
MS 39-325 / PO BOX 500 / BEAVERTON OR 97077-0500
Phone: 1-503-627-3989
Fax: 1-503-627-5587
----------------------------------------------------------------------
-- Any opinions expressed are those of the author --
-- and may not be those of Tektronix, Inc. --
=-----Original Message-----
=From: autofs-bounces@linux.kernel.org [mailto:autofs-
=bounces@linux.kernel.org] On Behalf Of mmarion@qualcomm.com
=Sent: Thursday, February 03, 2005 4:39 PM
=To: ramana@intraperson.com
=Cc: autofs@linux.kernel.org
=Subject: Re: [autofs] unacceptable bug in autofs kernel module
=
=On 28 Dec, ramana wrote:
=
=> Here is the bug in autofs3 module which causing so much pain. It simply
=> stopped me from adding much more interesting features to Autodir
=> http://www.intraperson.com/autodir/
=[snip]
=> Because of this, user space test program reporting like this:
=>
=> fail : /test/t944 : No such file or directory
=> fail : /test/t4187 : No such file or directory
=
=Hmm.. I wonder if this might be related to a weirdness we're seeing.
=Running
=autofs-4.1.3 with previous latest patch to kernel (pre-2005 release) and
=users
=use LSF to submit batch jobs to hosts. On linux hosts, user level
=programs
=will sometimes exit quickly with a "file does not exist" error, even
=though you
=can login to the host and see the file/dir just fine. As a hacked
=work-around, we have a pre-exec script that tries to stat all the
=directories
=they need to force the mounts to happen before their program touches the
=files.
=
=I didn't see any attempts to patch this bit.. did you have any ideas on
=how to
=patch that particular piece of code? Or just comment it out?
=
=--
=Mike Marion-Unix SysAdmin/Staff Engineer-http://www.qualcomm.com
=Groundskeeper Willie: "oooh.. Me mule wouldn't walk in the mud. So I had
=to
=put 17 bullets in 'em." ==> Simpsons
=
=_______________________________________________
=autofs mailing list
=autofs@linux.kernel.org
=http://linux.kernel.org/mailman/listinfo/autofs
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: unacceptable bug in autofs kernel module
2005-02-04 18:58 peter.a.harris
@ 2005-03-07 19:49 ` Mike Marion
0 siblings, 0 replies; 18+ messages in thread
From: Mike Marion @ 2005-03-07 19:49 UTC (permalink / raw)
To: peter.a.harris; +Cc: autofs
On Feb 4, 2005, at 10:58 AM, peter.a.harris@exgate.tek.com wrote:
> Funny you should mention - I was just getting ready to ask about this.
Curious.. those that are seeing that mount problem, especially for jobs
submitted via a system like lsf.. What kind of map(s) are you seeing
the problems on? Direct map with/without ghosting? Program map? yp?
ldap? etc..
We've been having the problem with a shell scripted, program map to
support our sun auto.direct map, but I did just get the newer direct
map support with ghosting working on a couple test boxes and want to do
some testing to see if that helps vs the non-visible way it was with
the program script.
--
Mike Marion-Unix SysAdmin/Staff Engineer-http://www.qualcomm.com
Drew: "Violence doesn't solve anything? World War I. World War II. Star
Wars.
every Super Bowl. Who says violence doesn't solve anything?!" ==> Drew
Cary Show
^ permalink raw reply [flat|nested] 18+ messages in thread
* RE: unacceptable bug in autofs kernel module
@ 2005-02-04 19:11 Lever, Charles
0 siblings, 0 replies; 18+ messages in thread
From: Lever, Charles @ 2005-02-04 19:11 UTC (permalink / raw)
To: peter.a.harris, Jeff Moyer; +Cc: autofs, mmarion
This sounds a lot like the mount command is not retrying a mount when it
gets a timed out RPC. Networking problems or an overloaded mountd on
the server would both be reasons for an RPC timeout during a mount.
Jeff, is the mount patch we worked on last summer available for RHEL 3,
or is it just a RHEL AS 2.1 fix at this point?
Peter, what release of Data ONTAP is running on the filer(s)?
> -----Original Message-----
> From: peter.a.harris@exgate.tek.com
> [mailto:peter.a.harris@exgate.tek.com]
> Sent: Friday, February 04, 2005 1:58 PM
> To: autofs@linux.kernel.org
> Cc: mmarion@qualcomm.com
> Subject: RE: [autofs] unacceptable bug in autofs kernel module
>
> Hello, all,
>
> Funny you should mention - I was just getting ready to ask about this.
>
> We are doing the same thing, i.e. submitting jobs via LSF.
> What we see are file not found errors when trying to access a
> file somewhere down in the tree of an automounted file
> system. For instance, a job will execute a Perl script that
> starts with "#!/tools/perl5.8.3/bin/perl", which fails
> because it cannot find the Perl executable. I log into the
> machine and do "ls /tools/perl5.8.3/bin/perl" and get a file
> not found. I check /etc/mnttab or /proc/mounts and
> /tools/perl5.8.3 is not mounted. So then I do an ls of
> /tools/perl5.8.3 and the mount is made. Once I do that, the
> mount point is generally well behaved for some random period
> of time when we will go through all this again.
>
> At first we thought it was networking problems because we
> were also seeing some "server not responding" errors on our
> Solaris boxes. We found that if the mount failed with an RPC
> timeout, then the automounter would not try again until you
> did an ls of the mount point directory (or in some cases, you
> would have to cd to the directory to get the mount to
> happen). We have fixed some networking problems that we
> found and the number of these kinds of error messages has
> gone way down. Now we only see them when the 10 boxes all
> run a cron job at 10PM and try to mount the same file system
> at the same time. Some win but most lose.
>
> Testing (60 second expiry, multiple jobs accessing files
> every 2 to 3 minutes; caused lots of expirations and
> remounts) showed that we could also lose track of a mount if
> the mount expired and then immediately remounted.
> Well, it would not remount but the automounter thought it
> had. Similarly to the above, and ls or cd would fix the problem.
>
> Occasionally, the automounter fails to mount without any
> indication that I can find in /var/log/messages. And, again,
> an ls or cd of the directory will cause the mount to happen.
>
> Most of the machines are running Red Hat EL 3 U4 (automount
> 4.1.3-47, 2.4.21-27.0.1ELhugemem/smp kernel). One is running
> 4.1.3-12. A couple are running RHEL 3 U0, 2.4.21-4EL kernel,
> 4.1.0-2 automouunt. We have several IBM blades with P4's and
> mostly 4GB of memory. We also have one HP DL585 running
> AMD64 with 16GB of memory. Most run with a 10 minute expiry,
> but one is set to 30 minutes and one to 1 hour. That does
> not seem to affect the error rate. Some are running soft
> mounts to the tools (which should be read only) and some are
> running hard mounts - this too does not seem to make a difference.
>
> And, oh yes, these mounts are all from NetApp Filers.
>
> Anybody else see this and/or have any ideas?
>
>
> Pete Harris
> Tektronix, Inc.
> Technical Computing
> MS 39-325 / PO BOX 500 / BEAVERTON OR 97077-0500
> Phone: 1-503-627-3989
> Fax: 1-503-627-5587
> ----------------------------------------------------------------------
> -- Any opinions expressed are those of the author --
> -- and may not be those of Tektronix, Inc. --
>
> =-----Original Message-----
> =From: autofs-bounces@linux.kernel.org [mailto:autofs-
> =bounces@linux.kernel.org] On Behalf Of mmarion@qualcomm.com
> =Sent: Thursday, February 03, 2005 4:39 PM
> =To: ramana@intraperson.com
> =Cc: autofs@linux.kernel.org
> =Subject: Re: [autofs] unacceptable bug in autofs kernel
> module = =On 28 Dec, ramana wrote:
> =
> => Here is the bug in autofs3 module which causing so much
> pain. It simply => stopped me from adding much more
> interesting features to Autodir => http://www.intraperson.com/autodir/
> =[snip]
> => Because of this, user space test program reporting like this:
> =>
> => fail : /test/t944 : No such file or directory => fail :
> /test/t4187 : No such file or directory = =Hmm.. I wonder if
> this might be related to a weirdness we're seeing.
> =Running
> =autofs-4.1.3 with previous latest patch to kernel (pre-2005
> release) and =users =use LSF to submit batch jobs to hosts.
> On linux hosts, user level =programs =will sometimes exit
> quickly with a "file does not exist" error, even =though you
> =can login to the host and see the file/dir just fine. As a
> hacked =work-around, we have a pre-exec script that tries to
> stat all the =directories =they need to force the mounts to
> happen before their program touches the =files.
> =
> =I didn't see any attempts to patch this bit.. did you have
> any ideas on =how to
> =patch that particular piece of code? Or just comment it out?
> =
> =--
> =Mike Marion-Unix SysAdmin/Staff
> Engineer-http://www.qualcomm.com =Groundskeeper Willie:
> "oooh.. Me mule wouldn't walk in the mud. So I had =to =put
> 17 bullets in 'em." ==> Simpsons =
> =_______________________________________________
> =autofs mailing list
> =autofs@linux.kernel.org
> =http://linux.kernel.org/mailman/listinfo/autofs
>
> _______________________________________________
> autofs mailing list
> autofs@linux.kernel.org
> http://linux.kernel.org/mailman/listinfo/autofs
>
^ permalink raw reply [flat|nested] 18+ messages in thread
* RE: unacceptable bug in autofs kernel module
@ 2005-02-04 22:34 peter.a.harris
0 siblings, 0 replies; 18+ messages in thread
From: peter.a.harris @ 2005-02-04 22:34 UTC (permalink / raw)
To: Charles.Lever, jmoyer; +Cc: autofs, mmarion
We are running 6.5.3.
Pete Harris
Tektronix, Inc. / Technical Computing
MS 39-325 / PO BOX 500 / BEAVERTON OR 97077-0500
Ph: 1-503-627-3989 / Fax: 1-503-627-5587
PGP: 0xD1F493F6 EA9E 25B8 EF02 3EBD 26CB 7E28 026E 74DB D1F4 93F6
----------------------------------------------------------------------
-- Any opinions expressed are those of the author --
-- and may not represent Tektronix, Inc. --
=-----Original Message-----
=From: Lever, Charles [mailto:Charles.Lever@netapp.com]
=Sent: Friday, February 04, 2005 11:11 AM
=To: Harris, Peter A; Jeff Moyer
=Cc: mmarion@qualcomm.com; autofs@linux.kernel.org
=Subject: RE: [autofs] unacceptable bug in autofs kernel module
=
=This sounds a lot like the mount command is not retrying a mount when it
=gets a timed out RPC. Networking problems or an overloaded mountd on
=the server would both be reasons for an RPC timeout during a mount.
=
=Jeff, is the mount patch we worked on last summer available for RHEL 3,
=or is it just a RHEL AS 2.1 fix at this point?
=
=Peter, what release of Data ONTAP is running on the filer(s)?
=
<Snip>
^ permalink raw reply [flat|nested] 18+ messages in thread
* unacceptable bug in autofs kernel module
@ 2005-02-25 21:22 Trinh, Ngan
0 siblings, 0 replies; 18+ messages in thread
From: Trinh, Ngan @ 2005-02-25 21:22 UTC (permalink / raw)
To: autofs
[-- Attachment #1.1: Type: text/plain, Size: 7077 bytes --]
I am experiencing the same problem with Peter. Is there any fix for
this?
================================================
[autofs] unacceptable bug in autofs kernel module
peter.a.harris at exgate.tek.com peter.a.harris at exgate.tek.com
<mailto:autofs%40linux.kernel.org?Subject=%5Bautofs%5D%20unacceptable%20
bug%20in%20autofs%20kernel%20module&In-Reply-To=>
Fri Feb 4 10:58:17 PST 2005
* Previous message: [autofs] Re: get_best_mounts fixes.
<http://linux.kernel.org/pipermail/autofs/2005-February/001935.html>
* Next message: [autofs] unacceptable bug in autofs kernel module
<http://linux.kernel.org/pipermail/autofs/2005-February/001896.html>
* Messages sorted by: [ date ]
<http://linux.kernel.org/pipermail/autofs/2005-February/date.html#1895>
[ thread ]
<http://linux.kernel.org/pipermail/autofs/2005-February/thread.html#1895
> [ subject ]
<http://linux.kernel.org/pipermail/autofs/2005-February/subject.html#189
5> [ author ]
<http://linux.kernel.org/pipermail/autofs/2005-February/author.html#1895
>
_____
Hello, all,
Funny you should mention - I was just getting ready to ask about this.
We are doing the same thing, i.e. submitting jobs via LSF. What we see
are
file not found errors when trying to access a file somewhere down in the
tree of an automounted file system. For instance, a job will execute a
Perl
script that starts with "#!/tools/perl5.8.3/bin/perl", which fails
because
it cannot find the Perl executable. I log into the machine and do "ls
/tools/perl5.8.3/bin/perl" and get a file not found. I check
/etc/mnttab or
/proc/mounts and /tools/perl5.8.3 is not mounted. So then I do an ls of
/tools/perl5.8.3 and the mount is made. Once I do that, the mount point
is
generally well behaved for some random period of time when we will go
through all this again.
At first we thought it was networking problems because we were also
seeing
some "server not responding" errors on our Solaris boxes. We found that
if
the mount failed with an RPC timeout, then the automounter would not try
again until you did an ls of the mount point directory (or in some
cases,
you would have to cd to the directory to get the mount to happen). We
have
fixed some networking problems that we found and the number of these
kinds
of error messages has gone way down. Now we only see them when the 10
boxes
all run a cron job at 10PM and try to mount the same file system at the
same
time. Some win but most lose.
Testing (60 second expiry, multiple jobs accessing files every 2 to 3
minutes; caused lots of expirations and remounts) showed that we could
also
lose track of a mount if the mount expired and then immediately
remounted.
Well, it would not remount but the automounter thought it had.
Similarly to
the above, and ls or cd would fix the problem.
Occasionally, the automounter fails to mount without any indication that
I
can find in /var/log/messages. And, again, an ls or cd of the directory
will cause the mount to happen.
Most of the machines are running Red Hat EL 3 U4 (automount 4.1.3-47,
2.4.21-27.0.1ELhugemem/smp kernel). One is running 4.1.3-12. A couple
are
running RHEL 3 U0, 2.4.21-4EL kernel, 4.1.0-2 automouunt. We have
several
IBM blades with P4's and mostly 4GB of memory. We also have one HP
DL585
running AMD64 with 16GB of memory. Most run with a 10 minute expiry,
but
one is set to 30 minutes and one to 1 hour. That does not seem to
affect
the error rate. Some are running soft mounts to the tools (which should
be
read only) and some are running hard mounts - this too does not seem to
make
a difference.
And, oh yes, these mounts are all from NetApp Filers.
Anybody else see this and/or have any ideas?
Pete Harris
Tektronix, Inc.
Technical Computing
MS 39-325 / PO BOX 500 / BEAVERTON OR 97077-0500
Phone: 1-503-627-3989
Fax: 1-503-627-5587
----------------------------------------------------------------------
-- Any opinions expressed are those of the author --
-- and may not be those of Tektronix, Inc. --
=-----Original Message-----
=From: autofs-bounces at linux.kernel.org
<http://linux.kernel.org/mailman/listinfo/autofs> [mailto:autofs-
=bounces at linux.kernel.org
<http://linux.kernel.org/mailman/listinfo/autofs> ] On Behalf Of mmarion
at qualcomm.com <http://linux.kernel.org/mailman/listinfo/autofs>
=Sent: Thursday, February 03, 2005 4:39 PM
=To: ramana at intraperson.com
<http://linux.kernel.org/mailman/listinfo/autofs>
=Cc: autofs at linux.kernel.org
<http://linux.kernel.org/mailman/listinfo/autofs>
=Subject: Re: [autofs] unacceptable bug in autofs kernel module
=
=On 28 Dec, ramana wrote:
=
=> Here is the bug in autofs3 module which causing so much pain. It
simply
=> stopped me from adding much more interesting features to Autodir
=> http://www.intraperson.com/autodir/
=[snip]
=> Because of this, user space test program reporting like this:
=>
=> fail : /test/t944 : No such file or directory
=> fail : /test/t4187 : No such file or directory
=
=Hmm.. I wonder if this might be related to a weirdness we're seeing.
=Running
=autofs-4.1.3 with previous latest patch to kernel (pre-2005 release)
and
=users
=use LSF to submit batch jobs to hosts. On linux hosts, user level
=programs
=will sometimes exit quickly with a "file does not exist" error, even
=though you
=can login to the host and see the file/dir just fine. As a hacked
=work-around, we have a pre-exec script that tries to stat all the
=directories
=they need to force the mounts to happen before their program touches
the
=files.
=
=I didn't see any attempts to patch this bit.. did you have any ideas on
=how to
=patch that particular piece of code? Or just comment it out?
=
=--
=Mike Marion-Unix SysAdmin/Staff Engineer-http://www.qualcomm.com
<http://www.qualcomm.com/>
=Groundskeeper Willie: "oooh.. Me mule wouldn't walk in the mud. So I
had
=to
=put 17 bullets in 'em." ==> Simpsons
=
=_______________________________________________
=autofs mailing list
=autofs at linux.kernel.org
<http://linux.kernel.org/mailman/listinfo/autofs>
=http://linux.kernel.org/mailman/listinfo/autofs
_____
* Previous message: [autofs] Re: get_best_mounts fixes.
<http://linux.kernel.org/pipermail/autofs/2005-February/001935.html>
* Next message: [autofs] unacceptable bug in autofs kernel module
<http://linux.kernel.org/pipermail/autofs/2005-February/001896.html>
* Messages sorted by: [ date ]
<http://linux.kernel.org/pipermail/autofs/2005-February/date.html#1895>
[ thread ]
<http://linux.kernel.org/pipermail/autofs/2005-February/thread.html#1895
> [ subject ]
<http://linux.kernel.org/pipermail/autofs/2005-February/subject.html#189
5> [ author ]
<http://linux.kernel.org/pipermail/autofs/2005-February/author.html#1895
>
_____
More information about the autofs mailing list
<http://linux.kernel.org/mailman/listinfo/autofs>
[-- Attachment #1.2: Type: text/html, Size: 8136 bytes --]
[-- Attachment #2: Type: text/plain, Size: 140 bytes --]
_______________________________________________
autofs mailing list
autofs@linux.kernel.org
http://linux.kernel.org/mailman/listinfo/autofs
^ permalink raw reply [flat|nested] 18+ messages in thread
* RE: unacceptable bug in autofs kernel module
@ 2005-03-08 0:16 peter.a.harris
2005-03-08 22:53 ` Mike Marion
0 siblings, 1 reply; 18+ messages in thread
From: peter.a.harris @ 2005-03-08 0:16 UTC (permalink / raw)
To: mmarion; +Cc: autofs
Mike,
We use a mixture of direct and indirect maps, all with ghosting turned on.
The problems we are seeing are evenly spread between the two types. Our
basic tools and home directories are indirect maps, the project data and
project specific tools are in a large direct map. All the maps are NIS (not
NIS+) served. All totaled we have about 1200+ mounts.
We have also seen some problems on the Solaris boxes in a timeframe similar
to our Linux problems. We have been getting more RPC timeouts and NFS
server not responding errors in the messages files. And the users have been
complaining of, what they have termed, "pausenia" and "stuttering".
Pausenia is where the user types a command and the shell locks for 15 to 60
seconds before anything happens. Stuttering is a file not found error,
followed by the file being there when the user immediately (after cursing
and retyping) reissues the command. The one difference we see on the
Solaris side is that the automounter seems to self-recover.
We have turned off the "/net" program map because there are too many
problems with things staying mounted, overly long exports lists from the
filers, etc. and we don't really need it on every machine.
Pete Harris
Central Engineering / Technical Computing
Phone: 1-503-627-3989
Fax: 1-503-627-5587
__++__ --------- ___,--. --+_._._:_
_|____|_ _________ |__|_|__| |_SP&S_| |_|_===__|
oo oo ~ oo oo ~ oo oo ~ ooo ooo~ o OOOO =o\
============================================================
Perform random acts of kindness and senseless beauty...
=-----Original Message-----
=From: Mike Marion [mailto:mmarion@qualcomm.com]
=Sent: Monday, March 07, 2005 11:49 AM
=To: Harris, Peter A
=Cc: autofs@linux.kernel.org
=Subject: Re: [autofs] unacceptable bug in autofs kernel module
=
=On Feb 4, 2005, at 10:58 AM, peter.a.harris@exgate.tek.com wrote:
=
=> Funny you should mention - I was just getting ready to ask about this.
=
=Curious.. those that are seeing that mount problem, especially for jobs
=submitted via a system like lsf.. What kind of map(s) are you seeing
=the problems on? Direct map with/without ghosting? Program map? yp?
=ldap? etc..
=
=We've been having the problem with a shell scripted, program map to
=support our sun auto.direct map, but I did just get the newer direct
=map support with ghosting working on a couple test boxes and want to do
=some testing to see if that helps vs the non-visible way it was with
=the program script.
=
=--
=Mike Marion-Unix SysAdmin/Staff Engineer-http://www.qualcomm.com
=Drew: "Violence doesn't solve anything? World War I. World War II. Star
=Wars.
=every Super Bowl. Who says violence doesn't solve anything?!" ==> Drew
=Cary Show
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: unacceptable bug in autofs kernel module
2005-03-08 0:16 peter.a.harris
@ 2005-03-08 22:53 ` Mike Marion
0 siblings, 0 replies; 18+ messages in thread
From: Mike Marion @ 2005-03-08 22:53 UTC (permalink / raw)
To: autofs
On Mar 7, 2005, at 4:16 PM, peter.a.harris@exgate.tek.com wrote:
> We use a mixture of direct and indirect maps, all with ghosting turned
> on.
> The problems we are seeing are evenly spread between the two types.
> Our
> basic tools and home directories are indirect maps, the project data
> and
> project specific tools are in a large direct map. All the maps are
> NIS (not
> NIS+) served. All totaled we have about 1200+ mounts.
Sounds close to us in size, though we had to move our map into a file
ages ago... we broke NIS at some point. A quick check shows more 3000+
mounts, but still 1200 is a lot. We've been using program maps so far
though on linux, and haven't seen any of the same "the file isn't
there.. even though it's there" issue on solaris at all.
> We have also seen some problems on the Solaris boxes in a timeframe
> similar
> to our Linux problems. We have been getting more RPC timeouts and NFS
> server not responding errors in the messages files. And the users
> have been
> complaining of, what they have termed, "pausenia" and "stuttering".
> Pausenia is where the user types a command and the shell locks for 15
> to 60
> seconds before anything happens. Stuttering is a file not found error,
> followed by the file being there when the user immediately (after
> cursing
> and retyping) reissues the command. The one difference we see on the
> Solaris side is that the automounter seems to self-recover.
Is the pausenia on solaris happening near the top of the hour? We've
had a similar problem that seems to occur when automount on solaris
does it's hourly check/flush where it parses the maps again. We
noticed when that happens that automount takes up a lot of CPU too.
We don't see the stuttering so much since it only seems to occur when a
batch lsf job is submitted, and not when the user is working
interactively.
> We have turned off the "/net" program map because there are too many
> problems with things staying mounted, overly long exports lists from
> the
> filers, etc. and we don't really need it on every machine.
Ah yes.. that's why I had to write this huge, ugly, shell script that
scrubs hosts trying to find and flush such hung/gone mounts. If you
ever want to see it.. let me know. ;) I found I could force umount
most of the hung mounts by spoofing the IP of a missing host on an
aliased interface, though I had to disable tcp mounts on /net for that
to work.
--
Mike Marion-Unix SysAdmin/Staff Engineer-http://www.qualcomm.com
Drew: "Violence doesn't solve anything? World War I. World War II. Star
Wars.
every Super Bowl. Who says violence doesn't solve anything?!" ==> Drew
Cary Show
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: unacceptable bug in autofs kernel module
@ 2005-03-21 20:54 devnull
0 siblings, 0 replies; 18+ messages in thread
From: devnull @ 2005-03-21 20:54 UTC (permalink / raw)
To: AutoFS on Linux Kernel
> the problems on? Direct map with/without ghosting? Program map? yp?
> ldap? etc..
Program Map, yp with ghosting.
I haven't bothered to turn ghosting off to see if that will help any. I
need ghosting in any case.
The issue I am seeing is rather simple, easy to reproduce and happens
every time.
Automount -V yields 4.1.2
4.1.4_beta2 has the problem too.
The job being run via "lsf" really has nothing to do with the problem, I
can run the job locally on the machine and reproduce it.
The application I run creates a directory to store certain results, and to
make sure that I am not looking at older results, my script deletes that
directory at the very beginning of the job.
Somehow this causes a problem, the directory is always reported
missing, however if i remove the directory by hand, then run the script,
everything works great.
Where would I need to look to see what process is caching the contents of
the directory.
Thanks.
^ permalink raw reply [flat|nested] 18+ messages in thread
end of thread, other threads:[~2005-03-21 20:54 UTC | newest]
Thread overview: 18+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2004-12-28 7:51 unacceptable bug in autofs kernel module ramana
2004-12-29 1:02 ` Ian Kent
2004-12-29 3:44 ` ramana
[not found] ` <41D21C1E.8040407@intraperson.com>
[not found] ` <Pine.LNX.4.58.0412291418160.8463@wombat.indigo.net.au>
[not found] ` <41D28271.601@intraperson.com>
2004-12-30 0:38 ` Ian Kent
2004-12-30 0:47 ` Ian Kent
[not found] ` <41D370E7.9080409@intraperson.com>
2004-12-30 5:42 ` Ian Kent
2005-02-04 0:38 ` mmarion
2005-02-04 1:49 ` Ian Kent
2005-02-04 2:59 ` ramana
2005-02-05 13:46 ` ramana
-- strict thread matches above, loose matches on Subject: below --
2005-02-04 18:58 peter.a.harris
2005-03-07 19:49 ` Mike Marion
2005-02-04 19:11 Lever, Charles
2005-02-04 22:34 peter.a.harris
2005-02-25 21:22 Trinh, Ngan
2005-03-08 0:16 peter.a.harris
2005-03-08 22:53 ` Mike Marion
2005-03-21 20:54 devnull
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.