Seeing autofs-5.0.2 core dumping

All of lore.kernel.org
 help / color / mirror / Atom feed

* Seeing autofs-5.0.2 core dumping
@ 2007-12-21  2:30 Mike Marion
  2007-12-21 11:34 ` Ian Kent
  0 siblings, 1 reply; 8+ messages in thread
From: Mike Marion @ 2007-12-21  2:30 UTC (permalink / raw)
  To: autofs

In the last 2 days we're seeing our autofs 5.0.2 daemon dumping core,
and it seems to be triggerd by a kill -HUP call to it to make it re-read
the maps.  Using all LDAP maps (and if HUP isn't needed there, we can
turn it off) and it only seems to trigger if the daemon has been running
for a least a few hours, as I can send it numerous HUP signals after
restarting it and it won't crash.

It looks like the HUP is making it try to shut down a subset of the
paths (and I see this in syslog sometimes without segfaulting too)..
where it does several entries of:

 automount[2475]: umounted direct mount <path>
followed by the same paths in the same order:
 automount[2475]: rmdir_path: lstat of <path> failed
and then it core dumps:
automount[7419]: segfault at 00002aaaac141e08 rip 0000000000410d63 rsp
0000000040627030 error 4

Sometimes that happens after 1 of the above failed rmdir_path lines,
sometimes after most or all.

gdb shows them all crashing at the same point:
#0  lookup_prune_cache (ap=0x54ace0, age=1198202622) at lookup.c:1014

Unfortunately I don't have the exact same patched copy of lookup.c, or
at least it didn't line up to a line with anything in it (was blank)
when I ran the build again and then used the file after rpm patched it.

This has only cropped up in the last few days.. 

Running SLES9-SP3 hosts with 2.6.16.21-0.8 kernel from sles10 built on
it (using src.rpm) with autofs5 patch added.  Autofs-5.0.2 with patches 
as of June of this year (I believe).

First possible thing that comes to mind:
- Are our maps just too big now?  We have huge maps now, a typical
  /proc/mounts has values like so:
$ grep ^auto. /proc/mounts  |wc
   6940   41640  815531

Yes.. we have almost 7000 mounts in the maps.  Those are all direct
mounts.  We have > 25,000 mounts in our homedir map, but that's an
indirect map.

If one of the newer patches in the last few months might address this,
I'll be happy to patch up.  

-- 
Mike Marion-Unix SysAdmin/Staff IT Engineer-http://www.qualcomm.com
Groundskeeper Willie: "What? Have you gone waxie in your pister? I canna' fit
in the wee vent, you croquet playing mint muncher."
Principal Skinner: "Greese yourself up and go in you... you, guff speaking work
slacker!"
Willie: "Oooh.  Good comeback."  ==> Simpsons.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Seeing autofs-5.0.2 core dumping
  2007-12-21  2:30 Seeing autofs-5.0.2 core dumping Mike Marion
@ 2007-12-21 11:34 ` Ian Kent
  2007-12-21 19:27   ` Mike Marion
  0 siblings, 1 reply; 8+ messages in thread
From: Ian Kent @ 2007-12-21 11:34 UTC (permalink / raw)
  To: Mike Marion; +Cc: autofs

On Thu, 2007-12-20 at 18:30 -0800, Mike Marion wrote:
> In the last 2 days we're seeing our autofs 5.0.2 daemon dumping core,
> and it seems to be triggerd by a kill -HUP call to it to make it re-read
> the maps.  Using all LDAP maps (and if HUP isn't needed there, we can
> turn it off) and it only seems to trigger if the daemon has been running
> for a least a few hours, as I can send it numerous HUP signals after
> restarting it and it won't crash.

When it rains it pours.
Second SEGV report today.

> 
> It looks like the HUP is making it try to shut down a subset of the
> paths (and I see this in syslog sometimes without segfaulting too)..
> where it does several entries of:
> 
>  automount[2475]: umounted direct mount <path>
> followed by the same paths in the same order:
>  automount[2475]: rmdir_path: lstat of <path> failed
> and then it core dumps:
> automount[7419]: segfault at 00002aaaac141e08 rip 0000000000410d63 rsp
> 0000000040627030 error 4

There was a bug that caused the direct map to be pruned out of existence
when a server connection failed for some reason. I don't remember seeing
a SEGV although I wasn't paying attention to that when I worked on it.

> 
> Sometimes that happens after 1 of the above failed rmdir_path lines,
> sometimes after most or all.
> 
> gdb shows them all crashing at the same point:
> #0  lookup_prune_cache (ap=0x54ace0, age=1198202622) at lookup.c:1014
> 
> Unfortunately I don't have the exact same patched copy of lookup.c, or
> at least it didn't line up to a line with anything in it (was blank)
> when I ran the build again and then used the file after rpm patched it.
> 
> This has only cropped up in the last few days.. 
> 
> Running SLES9-SP3 hosts with 2.6.16.21-0.8 kernel from sles10 built on
> it (using src.rpm) with autofs5 patch added.  Autofs-5.0.2 with patches 
> as of June of this year (I believe).

I'm not quite sure what that means but this doesn't sound like a kernel
problem so far.

> 
> First possible thing that comes to mind:
> - Are our maps just too big now?  We have huge maps now, a typical
>   /proc/mounts has values like so:
> $ grep ^auto. /proc/mounts  |wc
>    6940   41640  815531
> 
> Yes.. we have almost 7000 mounts in the maps.  Those are all direct
> mounts.  We have > 25,000 mounts in our homedir map, but that's an
> indirect map.

That shouldn't be a problem except that expires and map reads will take
much longer. If there are problems with synchronization I expect you
will see them before most others.

> 
> If one of the newer patches in the last few months might address this,
> I'll be happy to patch up.  

There are a lot of patches, about 62 now.
I need to consolidate and release 5.0.3 but I'm still testing and now
have a couple more bugs.

I would prefer to work from fully patched source if possible.

Ian

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Seeing autofs-5.0.2 core dumping
  2007-12-21 11:34 ` Ian Kent
@ 2007-12-21 19:27   ` Mike Marion
  2007-12-21 21:13     ` Mike Marion
  0 siblings, 1 reply; 8+ messages in thread
From: Mike Marion @ 2007-12-21 19:27 UTC (permalink / raw)
  To: Ian Kent; +Cc: autofs

On Fri, Dec 21, 2007 at 08:34:16PM +0900, Ian Kent wrote:

> There was a bug that caused the direct map to be pruned out of existence
> when a server connection failed for some reason. I don't remember seeing
> a SEGV although I wasn't paying attention to that when I worked on it.

Yeah, we're digging into trying to get more info via debugging.  The
weird thing is that every host with an issue seems to be doing the
umount/rmdir on the exact same subset of paths (about 13 out of almost
7000 entries) and I'm positive that at least 1 path was changed 2 days
ago.  Changed in that the local path is different, but the mount point
(server:/path) is the same.  So we're going to try to re-create this on
a host by hand to see if it has to do with the daemon seeing a change in
what it has cached vs what the maps now answer with.

Even a new daemon seems to show this problem eventually too (if
restarted on a host after the daemon core dumped before) so it might be
that the change can be seen between what's in /proc/mounts still vs the
ldap map entries, and can poison the new daemon too.

BTW, not positive on this yet, but it seems like setting the logging to
debug (vs verbose) in /etc/sysconfig/autofs doesn't seem to trigger the 
same logging with 5.0.2 that it did/does with 5.0.1.  

> That shouldn't be a problem except that expires and map reads will take
> much longer. If there are problems with synchronization I expect you
> will see them before most others.

Yeah, what takes a _really_ long time, but actually works well (a credit
to the autofs5 code) is that if the running daemon is dorked or dies or
whatever... you can start up a new one, and it'll parse the entire
/proc/mounts path and take over the existing mounts and such just fine.

> I would prefer to work from fully patched source if possible.

We'll be doing some testing on this as well.  Just harder for us to get
buy-in to push out an update on something this important to the compute
pool.. but if we can prove it's more stable and fixes a problem, it's
a little easier.

-- 
Mike Marion-Unix SysAdmin/Staff IT Engineer-http://www.qualcomm.com
Peggy: "12 years old and drinking a beer!?!"
Bobby: "I didn't even like it!"
Hank: "Well now you're just trying to get me mad!" ==> King of the Hill

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Seeing autofs-5.0.2 core dumping
  2007-12-21 19:27   ` Mike Marion
@ 2007-12-21 21:13     ` Mike Marion
  2007-12-21 21:50       ` Mike Marion
  2007-12-21 22:47       ` Mike Marion
  0 siblings, 2 replies; 8+ messages in thread
From: Mike Marion @ 2007-12-21 21:13 UTC (permalink / raw)
  To: autofs

On Fri, Dec 21, 2007 at 11:27:45AM -0800, Mike Marion wrote:

> ago.  Changed in that the local path is different, but the mount point
> (server:/path) is the same.  So we're going to try to re-create this on
> a host by hand to see if it has to do with the daemon seeing a change in
> what it has cached vs what the maps now answer with.

Even weirder... I built a new box in our lab to use for testing and
found that a small subset of paths didn't get loaded in the initial
startup.. but when the default 10min timeout happened, they were then
added.  That subset exactly matches the subset we're seeing the bug
with.  Nothing obvious in verbose output as to why it didn't load those
paths initially.

Setting up a debug log and going to strace daemon at same time.

-- 
Mike Marion-Unix SysAdmin/Staff IT Engineer-http://www.qualcomm.com
SCSI is *NOT* magic. There are *fundamental technical reasons* why it is
necessary to sacrifice a young goat to your SCSI chain now and then.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Seeing autofs-5.0.2 core dumping
  2007-12-21 21:13     ` Mike Marion
@ 2007-12-21 21:50       ` Mike Marion
  2007-12-21 22:47       ` Mike Marion
  1 sibling, 0 replies; 8+ messages in thread
From: Mike Marion @ 2007-12-21 21:50 UTC (permalink / raw)
  To: autofs

On Fri, Dec 21, 2007 at 01:13:00PM -0800, Mike Marion wrote:

> Setting up a debug log and going to strace daemon at same time.

Gah.. haven't seen a repeat on that with multiple restart and reboots
though.  

Fixed what I mentioned about debug logging too.. turns out I'm a doofus
and missed that we weren't specifically loggin daemon.debug on all
hosts.  *smack*

About to debug log a case where we (which is easy to reproduce) the HUP
triggering the umounted direct mount.. issue.

-- 
Mike Marion-Unix SysAdmin/Staff IT Engineer-http://www.qualcomm.com
Sorry, please try again. Thank you for taking the Turing test.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Seeing autofs-5.0.2 core dumping
  2007-12-21 21:13     ` Mike Marion
  2007-12-21 21:50       ` Mike Marion
@ 2007-12-21 22:47       ` Mike Marion
  2007-12-22  8:20         ` Ian Kent
  1 sibling, 1 reply; 8+ messages in thread
From: Mike Marion @ 2007-12-21 22:47 UTC (permalink / raw)
  To: autofs

On Fri, Dec 21, 2007 at 01:13:00PM -0800, Mike Marion wrote:

> Even weirder... I built a new box in our lab to use for testing and
> found that a small subset of paths didn't get loaded in the initial
> startup.. but when the default 10min timeout happened, they were then
> added.  That subset exactly matches the subset we're seeing the bug
> with.  Nothing obvious in verbose output as to why it didn't load those
> paths initially.

Well I have tracked down what appears to be the root cause.  The above
made me run a loop of ldapsearch queries against our ldap farm and I
found that periodically I would get a return missing the exact paths
we're seeing being removed then re-added over and over.  We have a CSS
setup to handle load balancing and failover for the ldap farm and I
found that one server behind it is giving out the broken maps.  Working
on getting that fixed ASAP (I'm not the ldap guy).

No details on why this happening over several hours would eventually get
the daemon into a state where it finally core-dumped though.  But fixing
the map results should make that not happen anymore anyway.  If I can
manage to catch a debug output and/or strace of one that still
core-dumps before the ldap side is fixed, I'll pass it along.

-- 
Mike Marion-Unix SysAdmin/Staff IT Engineer-http://www.qualcomm.com
"Do you know what this is?  No, I can see you don't.  You have that vacant look
in your eyes that says, 'Place my head to your ear.. you will hear the sea!'"
--Londo, Babylon 5.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Seeing autofs-5.0.2 core dumping
  2007-12-21 22:47       ` Mike Marion
@ 2007-12-22  8:20         ` Ian Kent
  2007-12-24  7:40           ` Mike Marion
  0 siblings, 1 reply; 8+ messages in thread
From: Ian Kent @ 2007-12-22  8:20 UTC (permalink / raw)
  To: Mike Marion; +Cc: autofs

On Fri, 2007-12-21 at 14:47 -0800, Mike Marion wrote:
> Well I have tracked down what appears to be the root cause.  The above
> made me run a loop of ldapsearch queries against our ldap farm and I
> found that periodically I would get a return missing the exact paths
> we're seeing being removed then re-added over and over.  We have a CSS
> setup to handle load balancing and failover for the ldap farm and I
> found that one server behind it is giving out the broken maps.  Working
> on getting that fixed ASAP (I'm not the ldap guy).
> 
> No details on why this happening over several hours would eventually get
> the daemon into a state where it finally core-dumped though.  But fixing
> the map results should make that not happen anymore anyway.  If I can
> manage to catch a debug output and/or strace of one that still
> core-dumps before the ldap side is fixed, I'll pass it along.
> 

That's good news at least.
Any info you manage to get will be appreciated.

Ian

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Seeing autofs-5.0.2 core dumping
  2007-12-22  8:20         ` Ian Kent
@ 2007-12-24  7:40           ` Mike Marion
  0 siblings, 0 replies; 8+ messages in thread
From: Mike Marion @ 2007-12-24  7:40 UTC (permalink / raw)
  To: autofs

On Sat, Dec 22, 2007 at 05:20:11PM +0900, Ian Kent wrote:

> That's good news at least.
> Any info you manage to get will be appreciated.

Later this week (after holiday) I'm going to see if we can setup a box
that keeps alternating between 2 test ldap servers that reproduce what
we were seeing here... basically just remove some paths from one so that
each update keeps making it drop, then re-add the paths.  Will make sure
to turn on full debug logging and also likely will strace the daemon.
Hopefully we can re-create the crash and gather some good data on it.

-- 
Mike Marion-Unix SysAdmin/Staff IT Engineer-http://www.qualcomm.com
Ned: "How do you do it Homer?  How do you silence that little voice that says
think?"
Homer: "You mean Lisa?"  ==> Simpsons

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2007-12-24  7:40 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-12-21  2:30 Seeing autofs-5.0.2 core dumping Mike Marion
2007-12-21 11:34 ` Ian Kent
2007-12-21 19:27   ` Mike Marion
2007-12-21 21:13     ` Mike Marion
2007-12-21 21:50       ` Mike Marion
2007-12-21 22:47       ` Mike Marion
2007-12-22  8:20         ` Ian Kent
2007-12-24  7:40           ` Mike Marion

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.