* Re: Automount daemon getting killed by SIGBUS
[not found] <20100222194830.GA11730@libre.l.ngdn.org>
@ 2010-02-23 2:46 ` Ian Kent
2010-02-25 3:15 ` Ian Kent
1 sibling, 0 replies; 5+ messages in thread
From: Ian Kent @ 2010-02-23 2:46 UTC (permalink / raw)
To: Leonardo Chiquitto; +Cc: autofs
On 02/23/2010 03:48 AM, Leonardo Chiquitto wrote:
> Hello,
>
> We have a user reporting periodic crashes in automount. The daemon gets
> killed by SIGBUS when returning from spawn_mount():
>
> Core was generated by `/usr/sbin/automount -p /var/run/automount.pid'.
> Program terminated with signal 7, Bus error.
> #0 0x0000555555566bd0 in spawn_mount (logopt=Cannot access memory at
> address 0x80004062242c
> ) at spawn.c:412
> 412 }
>
> 0x0000555555566bcd <spawn_mount+829>: mov %r12d,%eax
> 0x0000555555566bd0 <spawn_mount+832>: pop %rbx
> 0x0000555555566bd1 <spawn_mount+833>: pop %r12
> 0x0000555555566bd3 <spawn_mount+835>: pop %r13
> 0x0000555555566bd5 <spawn_mount+837>: pop %r14
> 0x0000555555566bd7 <spawn_mount+839>: pop %r15
> 0x0000555555566bd9 <spawn_mount+841>: leaveq
> 0x0000555555566bda <spawn_mount+842>: retq
>
> Is it possible that we're exceeding stack usage at this point, mostly
> due to the call to alloca()? Do you think we should replace alloca() with
> regular malloc() in spawn.c (patch below)?
There were some changes to reduce the usage of alloca() contributed by
Val Henson some time ago but they didn't get all of them by any means.
Val pointed out the use of alloca() was bad so replacing them with
malloc() is a good idea whether this is a stack overflow or not. I'll
have a look at the patch and merge it.
Not sure if the source used here even has those patches since we don't
know what source it is.
Ian
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: Automount daemon getting killed by SIGBUS
[not found] <20100222194830.GA11730@libre.l.ngdn.org>
2010-02-23 2:46 ` Automount daemon getting killed by SIGBUS Ian Kent
@ 2010-02-25 3:15 ` Ian Kent
2010-02-25 11:15 ` Leonardo Chiquitto
1 sibling, 1 reply; 5+ messages in thread
From: Ian Kent @ 2010-02-25 3:15 UTC (permalink / raw)
To: Leonardo Chiquitto; +Cc: autofs
On 02/23/2010 03:48 AM, Leonardo Chiquitto wrote:
> Hello,
>
> We have a user reporting periodic crashes in automount. The daemon gets
> killed by SIGBUS when returning from spawn_mount():
>
> Core was generated by `/usr/sbin/automount -p /var/run/automount.pid'.
> Program terminated with signal 7, Bus error.
> #0 0x0000555555566bd0 in spawn_mount (logopt=Cannot access memory at
> address 0x80004062242c
> ) at spawn.c:412
> 412 }
>
> 0x0000555555566bcd <spawn_mount+829>: mov %r12d,%eax
> 0x0000555555566bd0 <spawn_mount+832>: pop %rbx
> 0x0000555555566bd1 <spawn_mount+833>: pop %r12
> 0x0000555555566bd3 <spawn_mount+835>: pop %r13
> 0x0000555555566bd5 <spawn_mount+837>: pop %r14
> 0x0000555555566bd7 <spawn_mount+839>: pop %r15
> 0x0000555555566bd9 <spawn_mount+841>: leaveq
> 0x0000555555566bda <spawn_mount+842>: retq
>
> Is it possible that we're exceeding stack usage at this point, mostly
> due to the call to alloca()? Do you think we should replace alloca() with
> regular malloc() in spawn.c (patch below)?
Does this patch actually resolve your customers' problem?
What is the version in use and what additional patches have been applied?
Ian
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: Automount daemon getting killed by SIGBUS
2010-02-25 3:15 ` Ian Kent
@ 2010-02-25 11:15 ` Leonardo Chiquitto
2010-02-26 2:30 ` Ian Kent
0 siblings, 1 reply; 5+ messages in thread
From: Leonardo Chiquitto @ 2010-02-25 11:15 UTC (permalink / raw)
To: Ian Kent; +Cc: autofs
On Thu, Feb 25, 2010 at 11:15:31AM +0800, Ian Kent wrote:
> On 02/23/2010 03:48 AM, Leonardo Chiquitto wrote:
> > Hello,
> >
> > We have a user reporting periodic crashes in automount. The daemon gets
> > killed by SIGBUS when returning from spawn_mount():
> >
> > Core was generated by `/usr/sbin/automount -p /var/run/automount.pid'.
> > Program terminated with signal 7, Bus error.
> > #0 0x0000555555566bd0 in spawn_mount (logopt=Cannot access memory at
> > address 0x80004062242c
> > ) at spawn.c:412
> > 412 }
> >
> > 0x0000555555566bcd <spawn_mount+829>: mov %r12d,%eax
> > 0x0000555555566bd0 <spawn_mount+832>: pop %rbx
> > 0x0000555555566bd1 <spawn_mount+833>: pop %r12
> > 0x0000555555566bd3 <spawn_mount+835>: pop %r13
> > 0x0000555555566bd5 <spawn_mount+837>: pop %r14
> > 0x0000555555566bd7 <spawn_mount+839>: pop %r15
> > 0x0000555555566bd9 <spawn_mount+841>: leaveq
> > 0x0000555555566bda <spawn_mount+842>: retq
> >
> > Is it possible that we're exceeding stack usage at this point, mostly
> > due to the call to alloca()? Do you think we should replace alloca() with
> > regular malloc() in spawn.c (patch below)?
>
> Does this patch actually resolve your customers' problem?
Unfortunately I still don't know. Customer is currently running with
a workaround (increased stack limit from 8k to 32k) to avoid the
problem. I did some basic tests with the patch here but decided
to wait for your comments before submitting a test package.
> What is the version in use and what additional patches have been applied?
They are running 5.0.3 plus all the patches in patch_order-5.0.3 and
autofs-5.0.4-fix_negative_cache_non-existent_key.patch, meaning that we
don't have the other alloca() replacements that went in after 5.0.4.
Thanks!
Leonardo
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: Automount daemon getting killed by SIGBUS
2010-02-25 11:15 ` Leonardo Chiquitto
@ 2010-02-26 2:30 ` Ian Kent
2010-03-18 0:07 ` Leonardo Chiquitto
0 siblings, 1 reply; 5+ messages in thread
From: Ian Kent @ 2010-02-26 2:30 UTC (permalink / raw)
To: Leonardo Chiquitto; +Cc: autofs
On Thu, 2010-02-25 at 08:15 -0300, Leonardo Chiquitto wrote:
> On Thu, Feb 25, 2010 at 11:15:31AM +0800, Ian Kent wrote:
> > On 02/23/2010 03:48 AM, Leonardo Chiquitto wrote:
> > > Hello,
> > >
> > > We have a user reporting periodic crashes in automount. The daemon gets
> > > killed by SIGBUS when returning from spawn_mount():
> > >
> > > Core was generated by `/usr/sbin/automount -p /var/run/automount.pid'.
> > > Program terminated with signal 7, Bus error.
> > > #0 0x0000555555566bd0 in spawn_mount (logopt=Cannot access memory at
> > > address 0x80004062242c
> > > ) at spawn.c:412
> > > 412 }
> > >
> > > 0x0000555555566bcd <spawn_mount+829>: mov %r12d,%eax
> > > 0x0000555555566bd0 <spawn_mount+832>: pop %rbx
> > > 0x0000555555566bd1 <spawn_mount+833>: pop %r12
> > > 0x0000555555566bd3 <spawn_mount+835>: pop %r13
> > > 0x0000555555566bd5 <spawn_mount+837>: pop %r14
> > > 0x0000555555566bd7 <spawn_mount+839>: pop %r15
> > > 0x0000555555566bd9 <spawn_mount+841>: leaveq
> > > 0x0000555555566bda <spawn_mount+842>: retq
> > >
> > > Is it possible that we're exceeding stack usage at this point, mostly
> > > due to the call to alloca()? Do you think we should replace alloca() with
> > > regular malloc() in spawn.c (patch below)?
> >
> > Does this patch actually resolve your customers' problem?
>
> Unfortunately I still don't know. Customer is currently running with
> a workaround (increased stack limit from 8k to 32k) to avoid the
> problem. I did some basic tests with the patch here but decided
> to wait for your comments before submitting a test package.
>
> > What is the version in use and what additional patches have been applied?
>
> They are running 5.0.3 plus all the patches in patch_order-5.0.3 and
> autofs-5.0.4-fix_negative_cache_non-existent_key.patch, meaning that we
> don't have the other alloca() replacements that went in after 5.0.4.
OK.
I had a bug report where the customer believed that the max open file
limit and stack size was a problem. It turned out that increasing them,
for some unknown reason reduced the likelihood of the problem occurring,
but actually had nothing to to with the problem.
If automount crashes then you need to look at the gdb backtrace of the
running threads at the time of the crash with "thr a a bt" to get more
info. I don't know how you provide debug symbols for your packages but
you will need them if you want to make any sens at all of the backtrace.
Is your customer using direct mounts?
Is your customer using LDAP?
Have a look at the patches below and try and work out if they are
relevant to the code base you are working with:
autofs-5.0.4-fix-direct-map-cache-locking.patch
autofs-5.0.4-fix-dont-umount-existing-direct-mount-on-reread.patch
(of course this path accompanies
autofs-5.0.4-dont-umount-existing-direct-mount-on-reread.patch)
autofs-5.0.4-fix-libxml2-non-thread-safe-calls.patch
There are also some other libxml2 patches, which took several tries to
get right, whose symptom is apparent random crashes:
autofs-5.0.4-fix-dumb-libxml2-check.patch
autofs-5.0.4-libxml2-workaround-fix.patch
autofs-5.0.4-library-reload-fix-update-fix-2.patch
autofs-5.0.4-library-reload-fix-update.patch
autofs-5.0.4-library-reload-fix-update-fix.patch
Not sure about the order of these and what their dependencies are.
I think all the patches have reasonably good descriptions.
Of course most of this stuff isn't relevant if LDAP isn't being used.
Ian
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: Automount daemon getting killed by SIGBUS
2010-02-26 2:30 ` Ian Kent
@ 2010-03-18 0:07 ` Leonardo Chiquitto
0 siblings, 0 replies; 5+ messages in thread
From: Leonardo Chiquitto @ 2010-03-18 0:07 UTC (permalink / raw)
To: Ian Kent; +Cc: autofs
> > > What is the version in use and what additional patches have been applied?
> >
> > They are running 5.0.3 plus all the patches in patch_order-5.0.3 and
> > autofs-5.0.4-fix_negative_cache_non-existent_key.patch, meaning that we
> > don't have the other alloca() replacements that went in after 5.0.4.
>
> OK.
>
> I had a bug report where the customer believed that the max open file
> limit and stack size was a problem. It turned out that increasing them,
> for some unknown reason reduced the likelihood of the problem occurring,
> but actually had nothing to to with the problem.
Increasing the stack size definitelly helped here too. Customer is not
seeing the problem anymore and now that we have a workaround, it's
more complicated to keep asking for more tests. I spent a lot of time
trying to reproduce the problem in house to make testing easier, but
even with a very similar setup (LDAP plus thousands of mount points)
I was not able to make it crash.
> If automount crashes then you need to look at the gdb backtrace of the
> running threads at the time of the crash with "thr a a bt" to get more
> info. I don't know how you provide debug symbols for your packages but
> you will need them if you want to make any sens at all of the backtrace.
All threads look allright, except for thread 1 that apparently has a
corrupted stack (and hence caused the SIGBUS):
(gdb) thr a a bt
Thread 7 (Thread 3577):
#0 0x00002b39dd901a48 in do_sigwait () from /lib64/libpthread.so.0
#1 0x00002b39dd901aed in sigwait () from /lib64/libpthread.so.0
#2 0x000055555555d6aa in statemachine (arg=<value optimized out>)
at automount.c:1382
#3 main (arg=<value optimized out>) at automount.c:2105
Thread 6 (Thread 3578):
#0 0x00002b39dd8fe517 in pthread_cond_timedwait@@GLIBC_2.3.2 ()
from /lib64/libpthread.so.0
#1 0x0000555555571802 in alarm_handler (arg=<value optimized out>)
at alarm.c:203
#2 0x00002b39dd8fa193 in start_thread () from /lib64/libpthread.so.0
#3 0x00002b39ddbd1dfd in clone () from /lib64/libc.so.6
Thread 5 (Thread 3579):
#0 0x00002b39dd8fe517 in pthread_cond_timedwait@@GLIBC_2.3.2 ()
from /lib64/libpthread.so.0
#1 0x000055555556b72d in st_queue_handler (arg=<value optimized out>)
at state.c:1022
#2 0x00002b39dd8fa193 in start_thread () from /lib64/libpthread.so.0
#3 0x00002b39ddbd1dfd in clone () from /lib64/libc.so.6
Thread 4 (Thread 3582):
#0 0x00002b39ddbc9b26 in poll () from /lib64/libc.so.6
#1 0x000055555555f2f7 in get_pkt (pkt=<value optimized out>,
ap=<value optimized out>) at automount.c:925
#2 handle_packet (pkt=<value optimized out>, ap=<value optimized out>)
at automount.c:1082
#3 handle_mounts (pkt=<value optimized out>, ap=<value optimized out>)
at automount.c:1581
#4 0x00002b39dd8fa193 in start_thread () from /lib64/libpthread.so.0
#5 0x00002b39ddbd1dfd in clone () from /lib64/libc.so.6
Thread 3 (Thread 3585):
#0 0x00002b39ddbc9b26 in poll () from /lib64/libc.so.6
#1 0x000055555555f2f7 in get_pkt (pkt=<value optimized out>,
ap=<value optimized out>) at automount.c:925
#2 handle_packet (pkt=<value optimized out>, ap=<value optimized out>)
at automount.c:1082
#3 handle_mounts (pkt=<value optimized out>, ap=<value optimized out>)
at automount.c:1581
#4 0x00002b39dd8fa193 in start_thread () from /lib64/libpthread.so.0
#5 0x00002b39ddbd1dfd in clone () from /lib64/libc.so.6
Thread 2 (Thread 3586):
#0 0x00002b39ddbc9b26 in poll () from /lib64/libc.so.6
#1 0x000055555555f2f7 in get_pkt (pkt=<value optimized out>,
ap=<value optimized out>) at automount.c:925
#2 handle_packet (pkt=<value optimized out>, ap=<value optimized out>)
at automount.c:1082
#3 handle_mounts (pkt=<value optimized out>, ap=<value optimized out>)
at automount.c:1581
#4 0x00002b39dd8fa193 in start_thread () from /lib64/libpthread.so.0
#5 0x00002b39ddbd1dfd in clone () from /lib64/libc.so.6
Thread 1 (Thread 11657):
Cannot access memory at address 0x800040623598
(gdb) thr 1
[Switching to thread 1 (Thread 11657)]#0 0x0000555555566bd0 in
spawn_mount (
logopt=Cannot access memory at address 0x80004062242c
) at spawn.c:412
412 }
(gdb) info registers
rax 0x0 0
rbx 0x406223f0 1080173552
rcx 0x1 1
rdx 0x0 0
rsi 0x0 0
rdi 0x1 1
rbp 0x800040623590 0x800040623590
rsp 0x800040623568 0x800040623568
r8 0x1 1
r9 0x2d89 11657
r10 0x8 8
r11 0x246 582
r12 0x0 0
r13 0x0 0
r14 0x2 2
r15 0x406223b0 1080173488
rip 0x555555566bd0 0x555555566bd0 <spawn_mount+832>
eflags 0x10287 [ CF PF SF IF RF ]
cs 0x33 51
ss 0x2b 43
ds 0x0 0
es 0x0 0
fs 0x63 99
gs 0x0 0
fctrl 0x37f 895
fstat 0x0 0
ftag 0xffff 65535
fiseg 0x0 0
fioff 0x0 0
foseg 0x0 0
fooff 0x0 0
fop 0x0 0
mxcsr 0x1f80 [ IM DM ZM OM UM PM ]
> Is your customer using direct mounts?
Yes, lots of direct mounts (more than 9000).
> Is your customer using LDAP?
Yes, all maps are retrieved from LDAP.
> Have a look at the patches below and try and work out if they are
> relevant to the code base you are working with:
Thanks a lot for the useful comments and for listing the patches. I'll
try to merge them in our package.
Kind regards,
Leonardo
^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2010-03-18 0:07 UTC | newest]
Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <20100222194830.GA11730@libre.l.ngdn.org>
2010-02-23 2:46 ` Automount daemon getting killed by SIGBUS Ian Kent
2010-02-25 3:15 ` Ian Kent
2010-02-25 11:15 ` Leonardo Chiquitto
2010-02-26 2:30 ` Ian Kent
2010-03-18 0:07 ` Leonardo Chiquitto
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.