All of lore.kernel.org
 help / color / mirror / Atom feed
* Re: Automount daemon getting killed by SIGBUS
       [not found] <20100222194830.GA11730@libre.l.ngdn.org>
@ 2010-02-23  2:46 ` Ian Kent
  2010-02-25  3:15 ` Ian Kent
  1 sibling, 0 replies; 5+ messages in thread
From: Ian Kent @ 2010-02-23  2:46 UTC (permalink / raw)
  To: Leonardo Chiquitto; +Cc: autofs

On 02/23/2010 03:48 AM, Leonardo Chiquitto wrote:
> Hello,
> 
> We have a user reporting periodic crashes in automount. The daemon gets
> killed by SIGBUS when returning from spawn_mount():
> 
> Core was generated by `/usr/sbin/automount -p /var/run/automount.pid'.
> Program terminated with signal 7, Bus error.
> #0  0x0000555555566bd0 in spawn_mount (logopt=Cannot access memory at
> address 0x80004062242c
> ) at spawn.c:412
> 412	}
> 
> 0x0000555555566bcd <spawn_mount+829>:	mov    %r12d,%eax
> 0x0000555555566bd0 <spawn_mount+832>:	pop    %rbx
> 0x0000555555566bd1 <spawn_mount+833>:	pop    %r12
> 0x0000555555566bd3 <spawn_mount+835>:	pop    %r13
> 0x0000555555566bd5 <spawn_mount+837>:	pop    %r14
> 0x0000555555566bd7 <spawn_mount+839>:	pop    %r15
> 0x0000555555566bd9 <spawn_mount+841>:	leaveq 
> 0x0000555555566bda <spawn_mount+842>:	retq   
> 
> Is it possible that we're exceeding stack usage at this point, mostly
> due to the call to alloca()? Do you think we should replace alloca() with
> regular malloc() in spawn.c (patch below)?

There were some changes to reduce the usage of alloca() contributed by
Val Henson some time ago but they didn't get all of them by any means.
Val pointed out the use of alloca() was bad so replacing them with
malloc() is a good idea whether this is a stack overflow or not. I'll
have a look at the patch and merge it.

Not sure if the source used here even has those patches since we don't
know what source it is.

Ian

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Automount daemon getting killed by SIGBUS
       [not found] <20100222194830.GA11730@libre.l.ngdn.org>
  2010-02-23  2:46 ` Automount daemon getting killed by SIGBUS Ian Kent
@ 2010-02-25  3:15 ` Ian Kent
  2010-02-25 11:15   ` Leonardo Chiquitto
  1 sibling, 1 reply; 5+ messages in thread
From: Ian Kent @ 2010-02-25  3:15 UTC (permalink / raw)
  To: Leonardo Chiquitto; +Cc: autofs

On 02/23/2010 03:48 AM, Leonardo Chiquitto wrote:
> Hello,
> 
> We have a user reporting periodic crashes in automount. The daemon gets
> killed by SIGBUS when returning from spawn_mount():
> 
> Core was generated by `/usr/sbin/automount -p /var/run/automount.pid'.
> Program terminated with signal 7, Bus error.
> #0  0x0000555555566bd0 in spawn_mount (logopt=Cannot access memory at
> address 0x80004062242c
> ) at spawn.c:412
> 412	}
> 
> 0x0000555555566bcd <spawn_mount+829>:	mov    %r12d,%eax
> 0x0000555555566bd0 <spawn_mount+832>:	pop    %rbx
> 0x0000555555566bd1 <spawn_mount+833>:	pop    %r12
> 0x0000555555566bd3 <spawn_mount+835>:	pop    %r13
> 0x0000555555566bd5 <spawn_mount+837>:	pop    %r14
> 0x0000555555566bd7 <spawn_mount+839>:	pop    %r15
> 0x0000555555566bd9 <spawn_mount+841>:	leaveq 
> 0x0000555555566bda <spawn_mount+842>:	retq   
> 
> Is it possible that we're exceeding stack usage at this point, mostly
> due to the call to alloca()? Do you think we should replace alloca() with
> regular malloc() in spawn.c (patch below)?

Does this patch actually resolve your customers' problem?
What is the version in use and what additional patches have been applied?

Ian

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Automount daemon getting killed by SIGBUS
  2010-02-25  3:15 ` Ian Kent
@ 2010-02-25 11:15   ` Leonardo Chiquitto
  2010-02-26  2:30     ` Ian Kent
  0 siblings, 1 reply; 5+ messages in thread
From: Leonardo Chiquitto @ 2010-02-25 11:15 UTC (permalink / raw)
  To: Ian Kent; +Cc: autofs

On Thu, Feb 25, 2010 at 11:15:31AM +0800, Ian Kent wrote:
> On 02/23/2010 03:48 AM, Leonardo Chiquitto wrote:
> > Hello,
> > 
> > We have a user reporting periodic crashes in automount. The daemon gets
> > killed by SIGBUS when returning from spawn_mount():
> > 
> > Core was generated by `/usr/sbin/automount -p /var/run/automount.pid'.
> > Program terminated with signal 7, Bus error.
> > #0  0x0000555555566bd0 in spawn_mount (logopt=Cannot access memory at
> > address 0x80004062242c
> > ) at spawn.c:412
> > 412	}
> > 
> > 0x0000555555566bcd <spawn_mount+829>:	mov    %r12d,%eax
> > 0x0000555555566bd0 <spawn_mount+832>:	pop    %rbx
> > 0x0000555555566bd1 <spawn_mount+833>:	pop    %r12
> > 0x0000555555566bd3 <spawn_mount+835>:	pop    %r13
> > 0x0000555555566bd5 <spawn_mount+837>:	pop    %r14
> > 0x0000555555566bd7 <spawn_mount+839>:	pop    %r15
> > 0x0000555555566bd9 <spawn_mount+841>:	leaveq 
> > 0x0000555555566bda <spawn_mount+842>:	retq   
> > 
> > Is it possible that we're exceeding stack usage at this point, mostly
> > due to the call to alloca()? Do you think we should replace alloca() with
> > regular malloc() in spawn.c (patch below)?
> 
> Does this patch actually resolve your customers' problem?

Unfortunately I still don't know. Customer is currently running with
a workaround (increased stack limit from 8k to 32k) to avoid the
problem. I did some basic tests with the patch here but decided
to wait for your comments before submitting a test package.

> What is the version in use and what additional patches have been applied?

They are running 5.0.3 plus all the patches in patch_order-5.0.3 and
autofs-5.0.4-fix_negative_cache_non-existent_key.patch, meaning that we
don't have the other alloca() replacements that went in after 5.0.4.

Thanks!
Leonardo

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Automount daemon getting killed by SIGBUS
  2010-02-25 11:15   ` Leonardo Chiquitto
@ 2010-02-26  2:30     ` Ian Kent
  2010-03-18  0:07       ` Leonardo Chiquitto
  0 siblings, 1 reply; 5+ messages in thread
From: Ian Kent @ 2010-02-26  2:30 UTC (permalink / raw)
  To: Leonardo Chiquitto; +Cc: autofs

On Thu, 2010-02-25 at 08:15 -0300, Leonardo Chiquitto wrote:
> On Thu, Feb 25, 2010 at 11:15:31AM +0800, Ian Kent wrote:
> > On 02/23/2010 03:48 AM, Leonardo Chiquitto wrote:
> > > Hello,
> > > 
> > > We have a user reporting periodic crashes in automount. The daemon gets
> > > killed by SIGBUS when returning from spawn_mount():
> > > 
> > > Core was generated by `/usr/sbin/automount -p /var/run/automount.pid'.
> > > Program terminated with signal 7, Bus error.
> > > #0  0x0000555555566bd0 in spawn_mount (logopt=Cannot access memory at
> > > address 0x80004062242c
> > > ) at spawn.c:412
> > > 412	}
> > > 
> > > 0x0000555555566bcd <spawn_mount+829>:	mov    %r12d,%eax
> > > 0x0000555555566bd0 <spawn_mount+832>:	pop    %rbx
> > > 0x0000555555566bd1 <spawn_mount+833>:	pop    %r12
> > > 0x0000555555566bd3 <spawn_mount+835>:	pop    %r13
> > > 0x0000555555566bd5 <spawn_mount+837>:	pop    %r14
> > > 0x0000555555566bd7 <spawn_mount+839>:	pop    %r15
> > > 0x0000555555566bd9 <spawn_mount+841>:	leaveq 
> > > 0x0000555555566bda <spawn_mount+842>:	retq   
> > > 
> > > Is it possible that we're exceeding stack usage at this point, mostly
> > > due to the call to alloca()? Do you think we should replace alloca() with
> > > regular malloc() in spawn.c (patch below)?
> > 
> > Does this patch actually resolve your customers' problem?
> 
> Unfortunately I still don't know. Customer is currently running with
> a workaround (increased stack limit from 8k to 32k) to avoid the
> problem. I did some basic tests with the patch here but decided
> to wait for your comments before submitting a test package.
> 
> > What is the version in use and what additional patches have been applied?
> 
> They are running 5.0.3 plus all the patches in patch_order-5.0.3 and
> autofs-5.0.4-fix_negative_cache_non-existent_key.patch, meaning that we
> don't have the other alloca() replacements that went in after 5.0.4.

OK.

I had a bug report where the customer believed that the max open file
limit and stack size was a problem. It turned out that increasing them,
for some unknown reason reduced the likelihood of the problem occurring,
but actually had nothing to to with the problem.

If automount crashes then you need to look at the gdb backtrace of the
running threads at the time of the crash with "thr a a bt" to get more
info. I don't know how you provide debug symbols for your packages but
you will need them if you want to make any sens at all of the backtrace.

Is your customer using direct mounts?
Is your customer using LDAP?

Have a look at the patches below and try and work out if they are
relevant to the code base you are working with:

autofs-5.0.4-fix-direct-map-cache-locking.patch
autofs-5.0.4-fix-dont-umount-existing-direct-mount-on-reread.patch
(of course this path accompanies
autofs-5.0.4-dont-umount-existing-direct-mount-on-reread.patch)
autofs-5.0.4-fix-libxml2-non-thread-safe-calls.patch

There are also some other libxml2 patches, which took several tries to
get right, whose symptom is apparent random crashes:

autofs-5.0.4-fix-dumb-libxml2-check.patch
autofs-5.0.4-libxml2-workaround-fix.patch
autofs-5.0.4-library-reload-fix-update-fix-2.patch
autofs-5.0.4-library-reload-fix-update.patch
autofs-5.0.4-library-reload-fix-update-fix.patch

Not sure about the order of these and what their dependencies are.
I think all the patches have reasonably good descriptions.

Of course most of this stuff isn't relevant if LDAP isn't being used.

Ian

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Automount daemon getting killed by SIGBUS
  2010-02-26  2:30     ` Ian Kent
@ 2010-03-18  0:07       ` Leonardo Chiquitto
  0 siblings, 0 replies; 5+ messages in thread
From: Leonardo Chiquitto @ 2010-03-18  0:07 UTC (permalink / raw)
  To: Ian Kent; +Cc: autofs

> > > What is the version in use and what additional patches have been applied?
> > 
> > They are running 5.0.3 plus all the patches in patch_order-5.0.3 and
> > autofs-5.0.4-fix_negative_cache_non-existent_key.patch, meaning that we
> > don't have the other alloca() replacements that went in after 5.0.4.
> 
> OK.
> 
> I had a bug report where the customer believed that the max open file
> limit and stack size was a problem. It turned out that increasing them,
> for some unknown reason reduced the likelihood of the problem occurring,
> but actually had nothing to to with the problem.

Increasing the stack size definitelly helped here too. Customer is not
seeing the problem anymore and now that we have a workaround, it's
more complicated to keep asking for more tests. I spent a lot of time
trying to reproduce the problem in house to make testing easier, but
even with a very similar setup (LDAP plus thousands of mount points)
I was not able to make it crash.

> If automount crashes then you need to look at the gdb backtrace of the
> running threads at the time of the crash with "thr a a bt" to get more
> info. I don't know how you provide debug symbols for your packages but
> you will need them if you want to make any sens at all of the backtrace.

All threads look allright, except for thread 1 that apparently has a
corrupted stack (and hence caused the SIGBUS):

(gdb) thr a a bt
Thread 7 (Thread 3577):
#0  0x00002b39dd901a48 in do_sigwait () from /lib64/libpthread.so.0
#1  0x00002b39dd901aed in sigwait () from /lib64/libpthread.so.0
#2  0x000055555555d6aa in statemachine (arg=<value optimized out>)
    at automount.c:1382
#3  main (arg=<value optimized out>) at automount.c:2105

Thread 6 (Thread 3578):
#0  0x00002b39dd8fe517 in pthread_cond_timedwait@@GLIBC_2.3.2 ()
   from /lib64/libpthread.so.0
#1  0x0000555555571802 in alarm_handler (arg=<value optimized out>)
    at alarm.c:203
#2  0x00002b39dd8fa193 in start_thread () from /lib64/libpthread.so.0
#3  0x00002b39ddbd1dfd in clone () from /lib64/libc.so.6

Thread 5 (Thread 3579):
#0  0x00002b39dd8fe517 in pthread_cond_timedwait@@GLIBC_2.3.2 ()
   from /lib64/libpthread.so.0
#1  0x000055555556b72d in st_queue_handler (arg=<value optimized out>)
    at state.c:1022
#2  0x00002b39dd8fa193 in start_thread () from /lib64/libpthread.so.0
#3  0x00002b39ddbd1dfd in clone () from /lib64/libc.so.6

Thread 4 (Thread 3582):
#0  0x00002b39ddbc9b26 in poll () from /lib64/libc.so.6
#1  0x000055555555f2f7 in get_pkt (pkt=<value optimized out>, 
    ap=<value optimized out>) at automount.c:925
#2  handle_packet (pkt=<value optimized out>, ap=<value optimized out>)
    at automount.c:1082
#3  handle_mounts (pkt=<value optimized out>, ap=<value optimized out>)
    at automount.c:1581
#4  0x00002b39dd8fa193 in start_thread () from /lib64/libpthread.so.0
#5  0x00002b39ddbd1dfd in clone () from /lib64/libc.so.6

Thread 3 (Thread 3585):
#0  0x00002b39ddbc9b26 in poll () from /lib64/libc.so.6
#1  0x000055555555f2f7 in get_pkt (pkt=<value optimized out>, 
    ap=<value optimized out>) at automount.c:925
#2  handle_packet (pkt=<value optimized out>, ap=<value optimized out>)
    at automount.c:1082
#3  handle_mounts (pkt=<value optimized out>, ap=<value optimized out>)
    at automount.c:1581
#4  0x00002b39dd8fa193 in start_thread () from /lib64/libpthread.so.0
#5  0x00002b39ddbd1dfd in clone () from /lib64/libc.so.6

Thread 2 (Thread 3586):
#0  0x00002b39ddbc9b26 in poll () from /lib64/libc.so.6
#1  0x000055555555f2f7 in get_pkt (pkt=<value optimized out>, 
    ap=<value optimized out>) at automount.c:925
#2  handle_packet (pkt=<value optimized out>, ap=<value optimized out>)
    at automount.c:1082
#3  handle_mounts (pkt=<value optimized out>, ap=<value optimized out>)
    at automount.c:1581
#4  0x00002b39dd8fa193 in start_thread () from /lib64/libpthread.so.0
#5  0x00002b39ddbd1dfd in clone () from /lib64/libc.so.6

Thread 1 (Thread 11657):
Cannot access memory at address 0x800040623598

(gdb) thr 1
[Switching to thread 1 (Thread 11657)]#0  0x0000555555566bd0 in
spawn_mount (
    logopt=Cannot access memory at address 0x80004062242c
) at spawn.c:412
412	}

(gdb) info registers
rax            0x0	0
rbx            0x406223f0	1080173552
rcx            0x1	1
rdx            0x0	0
rsi            0x0	0
rdi            0x1	1
rbp            0x800040623590	0x800040623590
rsp            0x800040623568	0x800040623568
r8             0x1	1
r9             0x2d89	11657
r10            0x8	8
r11            0x246	582
r12            0x0	0
r13            0x0	0
r14            0x2	2
r15            0x406223b0	1080173488
rip            0x555555566bd0	0x555555566bd0 <spawn_mount+832>
eflags         0x10287	[ CF PF SF IF RF ]
cs             0x33	51
ss             0x2b	43
ds             0x0	0
es             0x0	0
fs             0x63	99
gs             0x0	0
fctrl          0x37f	895
fstat          0x0	0
ftag           0xffff	65535
fiseg          0x0	0
fioff          0x0	0
foseg          0x0	0
fooff          0x0	0
fop            0x0	0
mxcsr          0x1f80	[ IM DM ZM OM UM PM ]

> Is your customer using direct mounts?

Yes, lots of direct mounts (more than 9000).

> Is your customer using LDAP?

Yes, all maps are retrieved from LDAP.

> Have a look at the patches below and try and work out if they are
> relevant to the code base you are working with:

Thanks a lot for the useful comments and for listing the patches. I'll
try to merge them in our package.

Kind regards,
Leonardo

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2010-03-18  0:07 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <20100222194830.GA11730@libre.l.ngdn.org>
2010-02-23  2:46 ` Automount daemon getting killed by SIGBUS Ian Kent
2010-02-25  3:15 ` Ian Kent
2010-02-25 11:15   ` Leonardo Chiquitto
2010-02-26  2:30     ` Ian Kent
2010-03-18  0:07       ` Leonardo Chiquitto

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.