All of lore.kernel.org
 help / color / mirror / Atom feed
* unacceptable bug in autofs kernel module
@ 2004-12-28  7:51 ramana
  2004-12-29  1:02 ` Ian Kent
  2005-02-04  0:38 ` mmarion
  0 siblings, 2 replies; 18+ messages in thread
From: ramana @ 2004-12-28  7:51 UTC (permalink / raw)
  To: autofs

Dear developers,

Here is the bug in autofs3 module which causing so much pain. It simply 
stopped me from adding much more interesting features to Autodir 
http://www.intraperson.com/autodir/

Taken from Linux kernel 2.4 autofs module source.

file:            root.c
function:    autofs_root_lookup.
protocol:    3

        /*
         * If this dentry is unhashed, then we shouldn't honour this
         * lookup even if the dentry is positive.  Returning ENOENT here
         * doesn't do the right thing for all system calls, but it should
         * be OK for the operations we permit from an autofs.
         */
                                                                                                 

        /*
        if ( dentry->d_inode && d_unhashed(dentry) )
                return ERR_PTR(-ENOENT);
        */
                                                                                                 

        if ( dentry->d_inode && d_unhashed(dentry) ) {
                printk( "ENOENT for %s\n", dentry->d_name.name );
                return ERR_PTR(-ENOENT);
        }

I added printk to easily trace it. To my surprise autofs 4 also has 
similar code.

Because of this, user space test program reporting like this:

fail : /test/t944 : No such file or directory
fail : /test/t4187 : No such file or directory
fail : /test/t100 : No such file or directory
fail : /test/t806 : No such file or directory
fail : /test/t3451 : No such file or directory
fail : /test/t1790 : No such file or directory
fail : /test/t3555 : No such file or directory
fail : /test/t3098 : No such file or directory
fail : /test/t4085 : No such file or directory
fail : /test/t3935 : No such file or directory

with corresponding kernel messages are,

ENOENT for t944
ENOENT for t4187
ENOENT for t100
ENOENT for t806
ENOENT for t3451
ENOENT for t1790
ENOENT for t3555
ENOENT for t3098
ENOENT for t4085
ENOENT for t3935

The error rate as taken from months of stress tests -- ie ENOENT; is 
around 0.002% and increases as system load increases. Even at this rate 
I do not think it is acceptable in production systems.

Thanks in advance.

Regards
ramana

-- 
http://www.intraperson.com

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: unacceptable bug in autofs kernel module
  2004-12-28  7:51 unacceptable bug in autofs kernel module ramana
@ 2004-12-29  1:02 ` Ian Kent
  2004-12-29  3:44   ` ramana
       [not found]   ` <41D21C1E.8040407@intraperson.com>
  2005-02-04  0:38 ` mmarion
  1 sibling, 2 replies; 18+ messages in thread
From: Ian Kent @ 2004-12-29  1:02 UTC (permalink / raw)
  To: ramana; +Cc: autofs

On Tue, 28 Dec 2004, ramana wrote:

> Dear developers,
> 
> Here is the bug in autofs3 module which causing so much pain. It simply 
> stopped me from adding much more interesting features to Autodir 
> http://www.intraperson.com/autodir/

Thanks for this.

You've provided some symptoms but you haven't provided any explanation as 
to why this is a bug.

Can you explain why you need the kernel to honour a lookup for an already 
deleted dentry?

This could be due to the way that autofs does a d_drop instead of a 
d_delete in the directory unlink callback. However, the dentry, for all 
intentional purposes, has already been deleted.

> 
> Taken from Linux kernel 2.4 autofs module source.
> 
> file:            root.c
> function:    autofs_root_lookup.
> protocol:    3
> 
>         /*
>          * If this dentry is unhashed, then we shouldn't honour this
>          * lookup even if the dentry is positive.  Returning ENOENT here
>          * doesn't do the right thing for all system calls, but it should
>          * be OK for the operations we permit from an autofs.
>          */
>                                                                                                  
> 
>         /*
>         if ( dentry->d_inode && d_unhashed(dentry) )
>                 return ERR_PTR(-ENOENT);
>         */
>                                                                                                  
> 
>         if ( dentry->d_inode && d_unhashed(dentry) ) {
>                 printk( "ENOENT for %s\n", dentry->d_name.name );
>                 return ERR_PTR(-ENOENT);
>         }
> 
> I added printk to easily trace it. To my surprise autofs 4 also has 
> similar code.
> 
> Because of this, user space test program reporting like this:
> 
> fail : /test/t944 : No such file or directory
> fail : /test/t4187 : No such file or directory
> fail : /test/t100 : No such file or directory
> fail : /test/t806 : No such file or directory
> fail : /test/t3451 : No such file or directory
> fail : /test/t1790 : No such file or directory
> fail : /test/t3555 : No such file or directory
> fail : /test/t3098 : No such file or directory
> fail : /test/t4085 : No such file or directory
> fail : /test/t3935 : No such file or directory
> 
> with corresponding kernel messages are,
> 
> ENOENT for t944
> ENOENT for t4187
> ENOENT for t100
> ENOENT for t806
> ENOENT for t3451
> ENOENT for t1790
> ENOENT for t3555
> ENOENT for t3098
> ENOENT for t4085
> ENOENT for t3935
> 
> The error rate as taken from months of stress tests -- ie ENOENT; is 
> around 0.002% and increases as system load increases. Even at this rate 
> I do not think it is acceptable in production systems.
> 
> Thanks in advance.
> 
> Regards
> ramana
> 
> -- 
> http://www.intraperson.com
> 
> 
> _______________________________________________
> autofs mailing list
> autofs@linux.kernel.org
> http://linux.kernel.org/mailman/listinfo/autofs
> 

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: unacceptable bug in autofs kernel module
  2004-12-29  1:02 ` Ian Kent
@ 2004-12-29  3:44   ` ramana
       [not found]   ` <41D21C1E.8040407@intraperson.com>
  1 sibling, 0 replies; 18+ messages in thread
From: ramana @ 2004-12-29  3:44 UTC (permalink / raw)
  To: Ian Kent, autofs

Ian Kent wrote:

>On Tue, 28 Dec 2004, ramana wrote:
>
>  
>
>>Dear developers,
>>
>>Here is the bug in autofs3 module which causing so much pain. It simply 
>>stopped me from adding much more interesting features to Autodir 
>>http://www.intraperson.com/autodir/
>>    
>>
>
>Thanks for this.
>
>You've provided some symptoms but you haven't provided any explanation as 
>to why this is a bug.
>
>Can you explain why you need the kernel to honour a lookup for an already 
>deleted dentry?
>  
>

If it it is deleted then it should cleanup if any and report back to the 
user space autofs/autodir daemon again that this directory is missing 
instead of directly reporting that entry does not exist because it is 
deleted. After all that is what autofs is all about.

What is important is that -t option is user settable option. If the user 
choses low value like 1 or 2 seconds these autofs directories will be 
deleted more frequently. But deletion does not mean they do not exist 
actually.

It it certainly bug. Let us view from user space application which is 
accessing these autofs directories. Most of the time they get access to 
the directories which exist and perfectly legal. And suddenly at some 
time kernel decides itself and reports it does not exist even without 
asking user space autofs/autodir daemon.

What is important here is, after a while, if I access it again after 
ENOENT, everything works perfectly. Is not this inconsistent behavior?

Above statement is true as I tested it from user space applications for 
millions of directory requests rather then looking from kernal point of 
view. If I access a autofs directory for 999 times and I get ENOENT for 
1 time without proper reason from user space daemon, it is certainly bug.

Thanks for your reply.

>This could be due to the way that autofs does a d_drop instead of a 
>d_delete in the directory unlink callback. However, the dentry, for all 
>intentional purposes, has already been deleted.
>
>  
>

Regards
ramana

-- 
http://www.intraperson.com

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: unacceptable bug in autofs kernel module
       [not found]       ` <41D28271.601@intraperson.com>
@ 2004-12-30  0:38         ` Ian Kent
  2004-12-30  0:47         ` Ian Kent
  1 sibling, 0 replies; 18+ messages in thread
From: Ian Kent @ 2004-12-30  0:38 UTC (permalink / raw)
  To: ramana; +Cc: autofs

On Wed, 29 Dec 2004, ramana wrote:

> Ian Kent wrote:
> 
> >Do you have a simple way of reproducing this?
> >  
> >
> 

And the environment.

Version of kernel?
Version of autofs?
What autofs4 module?

Ian

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: unacceptable bug in autofs kernel module
       [not found]       ` <41D28271.601@intraperson.com>
  2004-12-30  0:38         ` Ian Kent
@ 2004-12-30  0:47         ` Ian Kent
       [not found]           ` <41D370E7.9080409@intraperson.com>
  1 sibling, 1 reply; 18+ messages in thread
From: Ian Kent @ 2004-12-30  0:47 UTC (permalink / raw)
  To: ramana; +Cc: autofs

On Wed, 29 Dec 2004, ramana wrote:

> autofs 3 with Fdora core 1 but I suspect even autofs4 also will behave 
> similarly as code does not look different with ENOENT replies.

autofs 3 is not being developed any more.

If you are testing you should use the latest autofs4. This can be done by 
using 2.6.10 or applying the appropriate patch.

If want to try this out let me know and I will post the latest patches I 
have to kernel.org now. 

Ian

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: unacceptable bug in autofs kernel module
       [not found]           ` <41D370E7.9080409@intraperson.com>
@ 2004-12-30  5:42             ` Ian Kent
  0 siblings, 0 replies; 18+ messages in thread
From: Ian Kent @ 2004-12-30  5:42 UTC (permalink / raw)
  To: ramana; +Cc: autofs

On Thu, 30 Dec 2004, ramana wrote:

> Ian Kent wrote:
> 
> >On Wed, 29 Dec 2004, ramana wrote:
> >
> >  
> >
> >>autofs 3 with Fdora core 1 but I suspect even autofs4 also will behave 
> >>similarly as code does not look different with ENOENT replies.
> >>    
> >>
> >
> >autofs 3 is not being developed any more.
> >
> >If you are testing you should use the latest autofs4. This can be done by 
> >using 2.6.10 or applying the appropriate patch.
> >
> >If want to try this out let me know and I will post the latest patches I 
> >have to kernel.org now. 
> >
> >Ian
> >
> >
> >  
> >
> I would like to test it using latest patches rather then going for 
> 2.6.10 as not everyone migrate to 2.6 right now.

I need to do a couple more small things so I'll get them done and upload 
the patches asap.

What kernel version will you use?

> 
> I know autofs 3 module not maintained anymore but I would like to 
> clarify one issue here. It is autofs3 module not maintained or autofs3 
> protocol. It is becuase autofs4 module supports autofs3 protocol as far 
> as my understanding.

I've tried to maintain compatability.

It's been a long time since I tested this at length and many changes have 
been made since without this compatibility testing.

So, yes autofs4 should be backward compatible and no I can't confirm how 
good or not it is now.

Ian

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: unacceptable bug in autofs kernel module
  2004-12-28  7:51 unacceptable bug in autofs kernel module ramana
  2004-12-29  1:02 ` Ian Kent
@ 2005-02-04  0:38 ` mmarion
  2005-02-04  1:49   ` Ian Kent
  1 sibling, 1 reply; 18+ messages in thread
From: mmarion @ 2005-02-04  0:38 UTC (permalink / raw)
  To: ramana; +Cc: autofs

On 28 Dec, ramana wrote:

> Here is the bug in autofs3 module which causing so much pain. It simply 
> stopped me from adding much more interesting features to Autodir 
> http://www.intraperson.com/autodir/
[snip]
> Because of this, user space test program reporting like this:
> 
> fail : /test/t944 : No such file or directory
> fail : /test/t4187 : No such file or directory

Hmm.. I wonder if this might be related to a weirdness we're seeing. Running
autofs-4.1.3 with previous latest patch to kernel (pre-2005 release) and users
use LSF to submit batch jobs to hosts.  On linux hosts, user level programs
will sometimes exit quickly with a "file does not exist" error, even though you
can login to the host and see the file/dir just fine.  As a hacked
work-around, we have a pre-exec script that tries to stat all the directories
they need to force the mounts to happen before their program touches the
files.

I didn't see any attempts to patch this bit.. did you have any ideas on how to
patch that particular piece of code?   Or just comment it out?

-- 
Mike Marion-Unix SysAdmin/Staff Engineer-http://www.qualcomm.com
Groundskeeper Willie: "oooh.. Me mule wouldn't walk in the mud.  So I had to 
put 17 bullets in 'em." ==> Simpsons

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: unacceptable bug in autofs kernel module
  2005-02-04  0:38 ` mmarion
@ 2005-02-04  1:49   ` Ian Kent
  2005-02-04  2:59     ` ramana
  0 siblings, 1 reply; 18+ messages in thread
From: Ian Kent @ 2005-02-04  1:49 UTC (permalink / raw)
  To: mmarion; +Cc: autofs

On Thu, 3 Feb 2005 mmarion@qualcomm.com wrote:

> On 28 Dec, ramana wrote:
> 
> > Here is the bug in autofs3 module which causing so much pain. It simply 
> > stopped me from adding much more interesting features to Autodir 
> > http://www.intraperson.com/autodir/
> [snip]
> > Because of this, user space test program reporting like this:
> > 
> > fail : /test/t944 : No such file or directory
> > fail : /test/t4187 : No such file or directory
> 
> Hmm.. I wonder if this might be related to a weirdness we're seeing. Running
> autofs-4.1.3 with previous latest patch to kernel (pre-2005 release) and users
> use LSF to submit batch jobs to hosts.  On linux hosts, user level programs
> will sometimes exit quickly with a "file does not exist" error, even though you
> can login to the host and see the file/dir just fine.  As a hacked
> work-around, we have a pre-exec script that tries to stat all the directories
> they need to force the mounts to happen before their program touches the
> files.

Does the stat actually mount anything?
It shouldn't?

Ian

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: unacceptable bug in autofs kernel module
  2005-02-04  1:49   ` Ian Kent
@ 2005-02-04  2:59     ` ramana
  2005-02-05 13:46       ` ramana
  0 siblings, 1 reply; 18+ messages in thread
From: ramana @ 2005-02-04  2:59 UTC (permalink / raw)
  To: Ian Kent, mmarion; +Cc: autofs

Ian Kent wrote:

>On Thu, 3 Feb 2005 mmarion@qualcomm.com wrote:
>
>  
>
>>On 28 Dec, ramana wrote:
>>
>>    
>>
>>>Here is the bug in autofs3 module which causing so much pain. It simply 
>>>stopped me from adding much more interesting features to Autodir 
>>>http://www.intraperson.com/autodir/
>>>      
>>>
>>[snip]
>>    
>>
>>>Because of this, user space test program reporting like this:
>>>
>>>fail : /test/t944 : No such file or directory
>>>fail : /test/t4187 : No such file or directory
>>>      
>>>
>>Hmm.. I wonder if this might be related to a weirdness we're seeing. Running
>>autofs-4.1.3 with previous latest patch to kernel (pre-2005 release) and users
>>use LSF to submit batch jobs to hosts.  On linux hosts, user level programs
>>will sometimes exit quickly with a "file does not exist" error, even though you
>>can login to the host and see the file/dir just fine.  As a hacked
>>work-around, we have a pre-exec script that tries to stat all the directories
>>they need to force the mounts to happen before their program touches the
>>files.
>>    
>>
>
>Does the stat actually mount anything?
>It shouldn't?
>
>Ian
>
>
>  
>
I moved latest version Autodir to autofs 4 kernel module and so far all 
stress tests tell me autofs4 protocol is performing well without these 
ENOENT errors. I have to do little bit more tests before concluding 
anything as final.

For more details check http://www.intraperson.com/autodir/.
DVersion: Autodir 0.93.0 and above.

Regards
ramana

-- 
http://www.intraperson.com

^ permalink raw reply	[flat|nested] 18+ messages in thread

* RE: unacceptable bug in autofs kernel module
@ 2005-02-04 18:58 peter.a.harris
  2005-03-07 19:49 ` Mike Marion
  0 siblings, 1 reply; 18+ messages in thread
From: peter.a.harris @ 2005-02-04 18:58 UTC (permalink / raw)
  To: autofs; +Cc: mmarion

Hello, all,

Funny you should mention - I was just getting ready to ask about this.

We are doing the same thing, i.e. submitting jobs via LSF.  What we see are
file not found errors when trying to access a file somewhere down in the
tree of an automounted file system.  For instance, a job will execute a Perl
script that starts with "#!/tools/perl5.8.3/bin/perl", which fails because
it cannot find the Perl executable.  I log into the machine and do "ls
/tools/perl5.8.3/bin/perl" and get a file not found.  I check /etc/mnttab or
/proc/mounts and /tools/perl5.8.3 is not mounted.  So then I do an ls of
/tools/perl5.8.3 and the mount is made.  Once I do that, the mount point is
generally well behaved for some random period of time when we will go
through all this again.

At first we thought it was networking problems because we were also seeing
some "server not responding" errors on our Solaris boxes.  We found that if
the mount failed with an RPC timeout, then the automounter would not try
again until you did an ls of the mount point directory (or in some cases,
you would have to cd to the directory to get the mount to happen).  We have
fixed some networking problems that we found and the number of these kinds
of error messages has gone way down.  Now we only see them when the 10 boxes
all run a cron job at 10PM and try to mount the same file system at the same
time.  Some win but most lose.

Testing (60 second expiry, multiple jobs accessing files every 2 to 3
minutes; caused lots of expirations and remounts) showed that we could also
lose track of a mount if the mount expired and then immediately remounted.
Well, it would not remount but the automounter thought it had.  Similarly to
the above, and ls or cd would fix the problem.

Occasionally, the automounter fails to mount without any indication that I
can find in /var/log/messages.  And, again, an ls or cd of the directory
will cause the mount to happen.

Most of the machines are running Red Hat EL 3 U4 (automount 4.1.3-47,
2.4.21-27.0.1ELhugemem/smp kernel).  One is running 4.1.3-12.  A couple are
running RHEL 3 U0, 2.4.21-4EL kernel, 4.1.0-2 automouunt.  We have several
IBM blades with P4's and mostly 4GB of memory.  We also have one HP DL585
running AMD64 with 16GB of memory.  Most run with a 10 minute expiry, but
one is set to 30 minutes and one to 1 hour.  That does not seem to affect
the error rate.  Some are running soft mounts to the tools (which should be
read only) and some are running hard mounts - this too does not seem to make
a difference.

And, oh yes, these mounts are all from NetApp Filers.

Anybody else see this and/or have any ideas?


Pete Harris
Tektronix, Inc.
Technical Computing
MS 39-325 / PO BOX 500 / BEAVERTON OR 97077-0500
Phone:	1-503-627-3989
Fax:	1-503-627-5587
----------------------------------------------------------------------
--          Any opinions expressed are those of the author          --
--             and may not be those of Tektronix, Inc.              --

=-----Original Message-----
=From: autofs-bounces@linux.kernel.org [mailto:autofs-
=bounces@linux.kernel.org] On Behalf Of mmarion@qualcomm.com
=Sent: Thursday, February 03, 2005 4:39 PM
=To: ramana@intraperson.com
=Cc: autofs@linux.kernel.org
=Subject: Re: [autofs] unacceptable bug in autofs kernel module
=
=On 28 Dec, ramana wrote:
=
=> Here is the bug in autofs3 module which causing so much pain. It simply
=> stopped me from adding much more interesting features to Autodir
=> http://www.intraperson.com/autodir/
=[snip]
=> Because of this, user space test program reporting like this:
=>
=> fail : /test/t944 : No such file or directory
=> fail : /test/t4187 : No such file or directory
=
=Hmm.. I wonder if this might be related to a weirdness we're seeing.
=Running
=autofs-4.1.3 with previous latest patch to kernel (pre-2005 release) and
=users
=use LSF to submit batch jobs to hosts.  On linux hosts, user level
=programs
=will sometimes exit quickly with a "file does not exist" error, even
=though you
=can login to the host and see the file/dir just fine.  As a hacked
=work-around, we have a pre-exec script that tries to stat all the
=directories
=they need to force the mounts to happen before their program touches the
=files.
=
=I didn't see any attempts to patch this bit.. did you have any ideas on
=how to
=patch that particular piece of code?   Or just comment it out?
=
=--
=Mike Marion-Unix SysAdmin/Staff Engineer-http://www.qualcomm.com
=Groundskeeper Willie: "oooh.. Me mule wouldn't walk in the mud.  So I had
=to
=put 17 bullets in 'em." ==> Simpsons
=
=_______________________________________________
=autofs mailing list
=autofs@linux.kernel.org
=http://linux.kernel.org/mailman/listinfo/autofs

^ permalink raw reply	[flat|nested] 18+ messages in thread

* RE: unacceptable bug in autofs kernel module
@ 2005-02-04 19:11 Lever, Charles
  0 siblings, 0 replies; 18+ messages in thread
From: Lever, Charles @ 2005-02-04 19:11 UTC (permalink / raw)
  To: peter.a.harris, Jeff Moyer; +Cc: autofs, mmarion

This sounds a lot like the mount command is not retrying a mount when it
gets a timed out RPC.  Networking problems or an overloaded mountd on
the server would both be reasons for an RPC timeout during a mount.

Jeff, is the mount patch we worked on last summer available for RHEL 3,
or is it just a RHEL AS 2.1 fix at this point?

Peter, what release of Data ONTAP is running on the filer(s)?

> -----Original Message-----
> From: peter.a.harris@exgate.tek.com 
> [mailto:peter.a.harris@exgate.tek.com] 
> Sent: Friday, February 04, 2005 1:58 PM
> To: autofs@linux.kernel.org
> Cc: mmarion@qualcomm.com
> Subject: RE: [autofs] unacceptable bug in autofs kernel module
> 
> Hello, all,
> 
> Funny you should mention - I was just getting ready to ask about this.
> 
> We are doing the same thing, i.e. submitting jobs via LSF.  
> What we see are file not found errors when trying to access a 
> file somewhere down in the tree of an automounted file 
> system.  For instance, a job will execute a Perl script that 
> starts with "#!/tools/perl5.8.3/bin/perl", which fails 
> because it cannot find the Perl executable.  I log into the 
> machine and do "ls /tools/perl5.8.3/bin/perl" and get a file 
> not found.  I check /etc/mnttab or /proc/mounts and 
> /tools/perl5.8.3 is not mounted.  So then I do an ls of
> /tools/perl5.8.3 and the mount is made.  Once I do that, the 
> mount point is generally well behaved for some random period 
> of time when we will go through all this again.
> 
> At first we thought it was networking problems because we 
> were also seeing some "server not responding" errors on our 
> Solaris boxes.  We found that if the mount failed with an RPC 
> timeout, then the automounter would not try again until you 
> did an ls of the mount point directory (or in some cases, you 
> would have to cd to the directory to get the mount to 
> happen).  We have fixed some networking problems that we 
> found and the number of these kinds of error messages has 
> gone way down.  Now we only see them when the 10 boxes all 
> run a cron job at 10PM and try to mount the same file system 
> at the same time.  Some win but most lose.
> 
> Testing (60 second expiry, multiple jobs accessing files 
> every 2 to 3 minutes; caused lots of expirations and 
> remounts) showed that we could also lose track of a mount if 
> the mount expired and then immediately remounted.
> Well, it would not remount but the automounter thought it 
> had.  Similarly to the above, and ls or cd would fix the problem.
> 
> Occasionally, the automounter fails to mount without any 
> indication that I can find in /var/log/messages.  And, again, 
> an ls or cd of the directory will cause the mount to happen.
> 
> Most of the machines are running Red Hat EL 3 U4 (automount 
> 4.1.3-47, 2.4.21-27.0.1ELhugemem/smp kernel).  One is running 
> 4.1.3-12.  A couple are running RHEL 3 U0, 2.4.21-4EL kernel, 
> 4.1.0-2 automouunt.  We have several IBM blades with P4's and 
> mostly 4GB of memory.  We also have one HP DL585 running 
> AMD64 with 16GB of memory.  Most run with a 10 minute expiry, 
> but one is set to 30 minutes and one to 1 hour.  That does 
> not seem to affect the error rate.  Some are running soft 
> mounts to the tools (which should be read only) and some are 
> running hard mounts - this too does not seem to make a difference.
> 
> And, oh yes, these mounts are all from NetApp Filers.
> 
> Anybody else see this and/or have any ideas?
> 
> 
> Pete Harris
> Tektronix, Inc.
> Technical Computing
> MS 39-325 / PO BOX 500 / BEAVERTON OR 97077-0500
> Phone:	1-503-627-3989
> Fax:	1-503-627-5587
> ----------------------------------------------------------------------
> --          Any opinions expressed are those of the author          --
> --             and may not be those of Tektronix, Inc.              --
> 
> =-----Original Message-----
> =From: autofs-bounces@linux.kernel.org [mailto:autofs- 
> =bounces@linux.kernel.org] On Behalf Of mmarion@qualcomm.com
> =Sent: Thursday, February 03, 2005 4:39 PM
> =To: ramana@intraperson.com
> =Cc: autofs@linux.kernel.org
> =Subject: Re: [autofs] unacceptable bug in autofs kernel 
> module = =On 28 Dec, ramana wrote:
> =
> => Here is the bug in autofs3 module which causing so much 
> pain. It simply => stopped me from adding much more 
> interesting features to Autodir => http://www.intraperson.com/autodir/
> =[snip]
> => Because of this, user space test program reporting like this:
> =>
> => fail : /test/t944 : No such file or directory => fail : 
> /test/t4187 : No such file or directory = =Hmm.. I wonder if 
> this might be related to a weirdness we're seeing.
> =Running
> =autofs-4.1.3 with previous latest patch to kernel (pre-2005 
> release) and =users =use LSF to submit batch jobs to hosts.  
> On linux hosts, user level =programs =will sometimes exit 
> quickly with a "file does not exist" error, even =though you 
> =can login to the host and see the file/dir just fine.  As a 
> hacked =work-around, we have a pre-exec script that tries to 
> stat all the =directories =they need to force the mounts to 
> happen before their program touches the =files.
> =
> =I didn't see any attempts to patch this bit.. did you have 
> any ideas on =how to
> =patch that particular piece of code?   Or just comment it out?
> =
> =--
> =Mike Marion-Unix SysAdmin/Staff 
> Engineer-http://www.qualcomm.com =Groundskeeper Willie: 
> "oooh.. Me mule wouldn't walk in the mud.  So I had =to =put 
> 17 bullets in 'em." ==> Simpsons = 
> =_______________________________________________
> =autofs mailing list
> =autofs@linux.kernel.org
> =http://linux.kernel.org/mailman/listinfo/autofs
> 
> _______________________________________________
> autofs mailing list
> autofs@linux.kernel.org
> http://linux.kernel.org/mailman/listinfo/autofs
> 

^ permalink raw reply	[flat|nested] 18+ messages in thread

* RE: unacceptable bug in autofs kernel module
@ 2005-02-04 22:34 peter.a.harris
  0 siblings, 0 replies; 18+ messages in thread
From: peter.a.harris @ 2005-02-04 22:34 UTC (permalink / raw)
  To: Charles.Lever, jmoyer; +Cc: autofs, mmarion

We are running 6.5.3.


Pete Harris                      
Tektronix, Inc. / Technical Computing
MS 39-325 / PO BOX 500 / BEAVERTON OR 97077-0500
Ph: 1-503-627-3989 / Fax: 1-503-627-5587
PGP: 0xD1F493F6      EA9E 25B8 EF02 3EBD 26CB 7E28 026E 74DB D1F4 93F6
----------------------------------------------------------------------
--          Any opinions expressed are those of the author          --
--               and may not represent Tektronix, Inc.              --

=-----Original Message-----
=From: Lever, Charles [mailto:Charles.Lever@netapp.com]
=Sent: Friday, February 04, 2005 11:11 AM
=To: Harris, Peter A; Jeff Moyer
=Cc: mmarion@qualcomm.com; autofs@linux.kernel.org
=Subject: RE: [autofs] unacceptable bug in autofs kernel module
=
=This sounds a lot like the mount command is not retrying a mount when it
=gets a timed out RPC.  Networking problems or an overloaded mountd on
=the server would both be reasons for an RPC timeout during a mount.
=
=Jeff, is the mount patch we worked on last summer available for RHEL 3,
=or is it just a RHEL AS 2.1 fix at this point?
=
=Peter, what release of Data ONTAP is running on the filer(s)?
=
<Snip>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: unacceptable bug in autofs kernel module
  2005-02-04  2:59     ` ramana
@ 2005-02-05 13:46       ` ramana
  0 siblings, 0 replies; 18+ messages in thread
From: ramana @ 2005-02-05 13:46 UTC (permalink / raw)
  To: autofs; +Cc: mmarion, Ian Kent


> I moved latest version Autodir to autofs 4 kernel module and so far 
> all stress tests tell me autofs4 protocol is performing well without 
> these ENOENT errors. I have to do little bit more tests before 
> concluding anything as final.
>
> For more details check http://www.intraperson.com/autodir/.
> DVersion: Autodir 0.93.0 and above.
>
Recent tests done under little loaded machine, show me this problem 
still exist even in autofs4 kernel module.

Regards
ramana

-- 
http://www.intraperson.com

^ permalink raw reply	[flat|nested] 18+ messages in thread

* unacceptable bug in autofs kernel module
@ 2005-02-25 21:22 Trinh, Ngan
  0 siblings, 0 replies; 18+ messages in thread
From: Trinh, Ngan @ 2005-02-25 21:22 UTC (permalink / raw)
  To: autofs


[-- Attachment #1.1: Type: text/plain, Size: 7077 bytes --]


 I am experiencing the same problem with Peter. Is there any fix for
this?


================================================


[autofs] unacceptable bug in autofs kernel module

peter.a.harris at exgate.tek.com peter.a.harris at exgate.tek.com
<mailto:autofs%40linux.kernel.org?Subject=%5Bautofs%5D%20unacceptable%20
bug%20in%20autofs%20kernel%20module&In-Reply-To=> 
Fri Feb 4 10:58:17 PST 2005 

*	Previous message: [autofs] Re: get_best_mounts fixes.
<http://linux.kernel.org/pipermail/autofs/2005-February/001935.html> 
*	Next message: [autofs] unacceptable bug in autofs kernel module
<http://linux.kernel.org/pipermail/autofs/2005-February/001896.html> 
*	Messages sorted by: [ date ]
<http://linux.kernel.org/pipermail/autofs/2005-February/date.html#1895>
[ thread ]
<http://linux.kernel.org/pipermail/autofs/2005-February/thread.html#1895
>  [ subject ]
<http://linux.kernel.org/pipermail/autofs/2005-February/subject.html#189
5>  [ author ]
<http://linux.kernel.org/pipermail/autofs/2005-February/author.html#1895
>  

  _____  

Hello, all,

Funny you should mention - I was just getting ready to ask about this.

We are doing the same thing, i.e. submitting jobs via LSF.  What we see
are
file not found errors when trying to access a file somewhere down in the
tree of an automounted file system.  For instance, a job will execute a
Perl
script that starts with "#!/tools/perl5.8.3/bin/perl", which fails
because
it cannot find the Perl executable.  I log into the machine and do "ls
/tools/perl5.8.3/bin/perl" and get a file not found.  I check
/etc/mnttab or
/proc/mounts and /tools/perl5.8.3 is not mounted.  So then I do an ls of
/tools/perl5.8.3 and the mount is made.  Once I do that, the mount point
is
generally well behaved for some random period of time when we will go
through all this again.

At first we thought it was networking problems because we were also
seeing
some "server not responding" errors on our Solaris boxes.  We found that
if
the mount failed with an RPC timeout, then the automounter would not try
again until you did an ls of the mount point directory (or in some
cases,
you would have to cd to the directory to get the mount to happen).  We
have
fixed some networking problems that we found and the number of these
kinds
of error messages has gone way down.  Now we only see them when the 10
boxes
all run a cron job at 10PM and try to mount the same file system at the
same
time.  Some win but most lose.

Testing (60 second expiry, multiple jobs accessing files every 2 to 3
minutes; caused lots of expirations and remounts) showed that we could
also
lose track of a mount if the mount expired and then immediately
remounted.
Well, it would not remount but the automounter thought it had.
Similarly to
the above, and ls or cd would fix the problem.

Occasionally, the automounter fails to mount without any indication that
I
can find in /var/log/messages.  And, again, an ls or cd of the directory
will cause the mount to happen.

Most of the machines are running Red Hat EL 3 U4 (automount 4.1.3-47,
2.4.21-27.0.1ELhugemem/smp kernel).  One is running 4.1.3-12.  A couple
are
running RHEL 3 U0, 2.4.21-4EL kernel, 4.1.0-2 automouunt.  We have
several
IBM blades with P4's and mostly 4GB of memory.  We also have one HP
DL585
running AMD64 with 16GB of memory.  Most run with a 10 minute expiry,
but
one is set to 30 minutes and one to 1 hour.  That does not seem to
affect
the error rate.  Some are running soft mounts to the tools (which should
be
read only) and some are running hard mounts - this too does not seem to
make
a difference.

And, oh yes, these mounts are all from NetApp Filers.

Anybody else see this and/or have any ideas?


Pete Harris
Tektronix, Inc.
Technical Computing
MS 39-325 / PO BOX 500 / BEAVERTON OR 97077-0500
Phone:	1-503-627-3989
Fax:	1-503-627-5587
----------------------------------------------------------------------
--          Any opinions expressed are those of the author          --
--             and may not be those of Tektronix, Inc.              --

=-----Original Message-----
=From: autofs-bounces at linux.kernel.org
<http://linux.kernel.org/mailman/listinfo/autofs>  [mailto:autofs-
=bounces at linux.kernel.org
<http://linux.kernel.org/mailman/listinfo/autofs> ] On Behalf Of mmarion
at qualcomm.com <http://linux.kernel.org/mailman/listinfo/autofs> 
=Sent: Thursday, February 03, 2005 4:39 PM
=To: ramana at intraperson.com
<http://linux.kernel.org/mailman/listinfo/autofs> 
=Cc: autofs at linux.kernel.org
<http://linux.kernel.org/mailman/listinfo/autofs> 
=Subject: Re: [autofs] unacceptable bug in autofs kernel module
=
=On 28 Dec, ramana wrote:
=
=> Here is the bug in autofs3 module which causing so much pain. It
simply
=> stopped me from adding much more interesting features to Autodir
=> http://www.intraperson.com/autodir/
=[snip]
=> Because of this, user space test program reporting like this:
=>
=> fail : /test/t944 : No such file or directory
=> fail : /test/t4187 : No such file or directory
=
=Hmm.. I wonder if this might be related to a weirdness we're seeing.
=Running
=autofs-4.1.3 with previous latest patch to kernel (pre-2005 release)
and
=users
=use LSF to submit batch jobs to hosts.  On linux hosts, user level
=programs
=will sometimes exit quickly with a "file does not exist" error, even
=though you
=can login to the host and see the file/dir just fine.  As a hacked
=work-around, we have a pre-exec script that tries to stat all the
=directories
=they need to force the mounts to happen before their program touches
the
=files.
=
=I didn't see any attempts to patch this bit.. did you have any ideas on
=how to
=patch that particular piece of code?   Or just comment it out?
=
=--
=Mike Marion-Unix SysAdmin/Staff Engineer-http://www.qualcomm.com
<http://www.qualcomm.com/> 
=Groundskeeper Willie: "oooh.. Me mule wouldn't walk in the mud.  So I
had
=to
=put 17 bullets in 'em." ==> Simpsons
=
=_______________________________________________
=autofs mailing list
=autofs at linux.kernel.org
<http://linux.kernel.org/mailman/listinfo/autofs> 
=http://linux.kernel.org/mailman/listinfo/autofs


  _____  


*	Previous message: [autofs] Re: get_best_mounts fixes.
<http://linux.kernel.org/pipermail/autofs/2005-February/001935.html> 
*	Next message: [autofs] unacceptable bug in autofs kernel module
<http://linux.kernel.org/pipermail/autofs/2005-February/001896.html> 
*	Messages sorted by: [ date ]
<http://linux.kernel.org/pipermail/autofs/2005-February/date.html#1895>
[ thread ]
<http://linux.kernel.org/pipermail/autofs/2005-February/thread.html#1895
>  [ subject ]
<http://linux.kernel.org/pipermail/autofs/2005-February/subject.html#189
5>  [ author ]
<http://linux.kernel.org/pipermail/autofs/2005-February/author.html#1895
>  

  _____  

More information about the autofs mailing list
<http://linux.kernel.org/mailman/listinfo/autofs> 


[-- Attachment #1.2: Type: text/html, Size: 8136 bytes --]

[-- Attachment #2: Type: text/plain, Size: 140 bytes --]

_______________________________________________
autofs mailing list
autofs@linux.kernel.org
http://linux.kernel.org/mailman/listinfo/autofs

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: unacceptable bug in autofs kernel module
  2005-02-04 18:58 peter.a.harris
@ 2005-03-07 19:49 ` Mike Marion
  0 siblings, 0 replies; 18+ messages in thread
From: Mike Marion @ 2005-03-07 19:49 UTC (permalink / raw)
  To: peter.a.harris; +Cc: autofs

On Feb 4, 2005, at 10:58 AM, peter.a.harris@exgate.tek.com wrote:

> Funny you should mention - I was just getting ready to ask about this.

Curious.. those that are seeing that mount problem, especially for jobs 
submitted via a system like lsf.. What kind of map(s) are you seeing 
the problems on?  Direct map with/without ghosting?   Program map?  yp? 
ldap? etc..

We've been having the problem with a shell scripted, program map to 
support our sun auto.direct map, but I did just get the newer direct 
map support with ghosting working on a couple test boxes and want to do 
some testing to see if that helps vs the non-visible way it was with 
the program script.

-- 
Mike Marion-Unix SysAdmin/Staff Engineer-http://www.qualcomm.com
Drew: "Violence doesn't solve anything? World War I. World War II. Star 
Wars.
every Super Bowl. Who says violence doesn't solve anything?!" ==> Drew 
Cary Show

^ permalink raw reply	[flat|nested] 18+ messages in thread

* RE: unacceptable bug in autofs kernel module
@ 2005-03-08  0:16 peter.a.harris
  2005-03-08 22:53 ` Mike Marion
  0 siblings, 1 reply; 18+ messages in thread
From: peter.a.harris @ 2005-03-08  0:16 UTC (permalink / raw)
  To: mmarion; +Cc: autofs

Mike,

We use a mixture of direct and indirect maps, all with ghosting turned on.
The problems we are seeing are evenly spread between the two types.  Our
basic tools and home directories are indirect maps, the project data and
project specific tools are in a large direct map.  All the maps are NIS (not
NIS+) served.  All totaled we have about 1200+ mounts.

We have also seen some problems on the Solaris boxes in a timeframe similar
to our Linux problems.  We have been getting more RPC timeouts and NFS
server not responding errors in the messages files.  And the users have been
complaining of, what they have termed, "pausenia" and "stuttering".
Pausenia is where the user types a command and the shell locks for 15 to 60
seconds before anything happens.  Stuttering is a file not found error,
followed by the file being there when the user immediately (after cursing
and retyping) reissues the command.  The one difference we see on the
Solaris side is that the automounter seems to self-recover.

We have turned off the "/net" program map because there are too many
problems with things staying mounted, overly long exports lists from the
filers, etc. and we don't really need it on every machine.


Pete Harris
Central Engineering / Technical Computing
Phone:	1-503-627-3989
Fax:	1-503-627-5587

   __++__            ---------  ___,--. --+_._._:_    
  _|____|_ _________ |__|_|__| |_SP&S_| |_|_===__|    
   oo  oo ~ oo   oo ~ oo   oo ~ ooo ooo~ o OOOO =o\   
============================================================
Perform random acts of kindness and senseless beauty...

=-----Original Message-----
=From: Mike Marion [mailto:mmarion@qualcomm.com]
=Sent: Monday, March 07, 2005 11:49 AM
=To: Harris, Peter A
=Cc: autofs@linux.kernel.org
=Subject: Re: [autofs] unacceptable bug in autofs kernel module
=
=On Feb 4, 2005, at 10:58 AM, peter.a.harris@exgate.tek.com wrote:
=
=> Funny you should mention - I was just getting ready to ask about this.
=
=Curious.. those that are seeing that mount problem, especially for jobs
=submitted via a system like lsf.. What kind of map(s) are you seeing
=the problems on?  Direct map with/without ghosting?   Program map?  yp?
=ldap? etc..
=
=We've been having the problem with a shell scripted, program map to
=support our sun auto.direct map, but I did just get the newer direct
=map support with ghosting working on a couple test boxes and want to do
=some testing to see if that helps vs the non-visible way it was with
=the program script.
=
=--
=Mike Marion-Unix SysAdmin/Staff Engineer-http://www.qualcomm.com
=Drew: "Violence doesn't solve anything? World War I. World War II. Star
=Wars.
=every Super Bowl. Who says violence doesn't solve anything?!" ==> Drew
=Cary Show

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: unacceptable bug in autofs kernel module
  2005-03-08  0:16 peter.a.harris
@ 2005-03-08 22:53 ` Mike Marion
  0 siblings, 0 replies; 18+ messages in thread
From: Mike Marion @ 2005-03-08 22:53 UTC (permalink / raw)
  To: autofs

On Mar 7, 2005, at 4:16 PM, peter.a.harris@exgate.tek.com wrote:

> We use a mixture of direct and indirect maps, all with ghosting turned 
> on.
> The problems we are seeing are evenly spread between the two types.  
> Our
> basic tools and home directories are indirect maps, the project data 
> and
> project specific tools are in a large direct map.  All the maps are 
> NIS (not
> NIS+) served.  All totaled we have about 1200+ mounts.

Sounds close to us in size, though we had to move our map into a file 
ages ago... we broke NIS at some point.  A quick check shows more 3000+ 
mounts, but still 1200 is a lot.  We've been using program maps so far 
though on linux, and haven't seen any of the same "the file isn't 
there.. even though it's there" issue on solaris at all.

> We have also seen some problems on the Solaris boxes in a timeframe 
> similar
> to our Linux problems.  We have been getting more RPC timeouts and NFS
> server not responding errors in the messages files.  And the users 
> have been
> complaining of, what they have termed, "pausenia" and "stuttering".
> Pausenia is where the user types a command and the shell locks for 15 
> to 60
> seconds before anything happens.  Stuttering is a file not found error,
> followed by the file being there when the user immediately (after 
> cursing
> and retyping) reissues the command.  The one difference we see on the
> Solaris side is that the automounter seems to self-recover.

Is the pausenia on solaris happening near the top of the hour?  We've 
had a similar problem that seems to occur when automount on solaris 
does it's hourly check/flush where it parses the maps again.  We 
noticed when that happens that automount takes up a lot of CPU too.

We don't see the stuttering so much since it only seems to occur when a 
batch lsf job is submitted, and not when the user is working 
interactively.

> We have turned off the "/net" program map because there are too many
> problems with things staying mounted, overly long exports lists from 
> the
> filers, etc. and we don't really need it on every machine.

Ah yes.. that's why I had to write this huge, ugly, shell script that 
scrubs hosts trying to find and flush such hung/gone mounts.   If you 
ever want to see it.. let me know. ;)  I found I could force umount 
most of the hung mounts by spoofing the IP of a missing host on an 
aliased interface, though I had to disable tcp mounts on /net for that 
to work.

-- 
Mike Marion-Unix SysAdmin/Staff Engineer-http://www.qualcomm.com
Drew: "Violence doesn't solve anything? World War I. World War II. Star 
Wars.
every Super Bowl. Who says violence doesn't solve anything?!" ==> Drew 
Cary Show

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: unacceptable bug in autofs kernel module
@ 2005-03-21 20:54 devnull
  0 siblings, 0 replies; 18+ messages in thread
From: devnull @ 2005-03-21 20:54 UTC (permalink / raw)
  To: AutoFS on Linux Kernel

> the problems on?  Direct map with/without ghosting?   Program map?  yp?
> ldap? etc..
Program Map, yp with ghosting.

I haven't bothered to turn ghosting off to see if that will help any. I 
need ghosting in any case.

The issue I am seeing is rather simple, easy to reproduce and happens 
every time.

Automount -V yields 4.1.2

4.1.4_beta2 has the problem too.

The job being run via "lsf" really has nothing to do with the problem, I 
can run the job locally on the machine and reproduce it.

The application I run creates a directory to store certain results, and to 
make sure that I am not looking at older results, my script deletes that 
directory at the very beginning of the job.

Somehow this causes a problem, the directory is always reported 
missing, however if i remove the directory by hand, then run the script, 
everything works great.

Where would I need to look to see what process is caching the contents of 
the directory.

Thanks.

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2005-03-21 20:54 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2004-12-28  7:51 unacceptable bug in autofs kernel module ramana
2004-12-29  1:02 ` Ian Kent
2004-12-29  3:44   ` ramana
     [not found]   ` <41D21C1E.8040407@intraperson.com>
     [not found]     ` <Pine.LNX.4.58.0412291418160.8463@wombat.indigo.net.au>
     [not found]       ` <41D28271.601@intraperson.com>
2004-12-30  0:38         ` Ian Kent
2004-12-30  0:47         ` Ian Kent
     [not found]           ` <41D370E7.9080409@intraperson.com>
2004-12-30  5:42             ` Ian Kent
2005-02-04  0:38 ` mmarion
2005-02-04  1:49   ` Ian Kent
2005-02-04  2:59     ` ramana
2005-02-05 13:46       ` ramana
  -- strict thread matches above, loose matches on Subject: below --
2005-02-04 18:58 peter.a.harris
2005-03-07 19:49 ` Mike Marion
2005-02-04 19:11 Lever, Charles
2005-02-04 22:34 peter.a.harris
2005-02-25 21:22 Trinh, Ngan
2005-03-08  0:16 peter.a.harris
2005-03-08 22:53 ` Mike Marion
2005-03-21 20:54 devnull

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.