* Re: bug report: opensm 3.3.15 crash (with traces)
[not found] ` <CAF3spKHCEVtmdjXHS-1YGjZ2OHsURsn3Junt3OhbNCF7_AfG9A-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2014-04-02 21:43 ` Florent Parent
[not found] ` <CAF3spKES=F1Tr6Y-iwMV_Dues80HnHvxOeaSN9KjTJbQvU0qyg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
0 siblings, 1 reply; 3+ messages in thread
From: Florent Parent @ 2014-04-02 21:43 UTC (permalink / raw)
To: linux-rdma-u79uwXL29TY76Z2rM5mHXA
Hi,
We experienced constant crashing from opensm 3.3.15 (3.3.15-1.el6.cq5)
after a recent upgrade. We compiled and installed 3.3.17 and problem
went away.
OpenSM server: CentOS 6.5 w/ stock RDMA. OpenSM 3.3.15 was from the
CentOS repository.
A behaviour that may help diagnose this: Unusual large amount messages
were filling up the opensm.log file:
Mar 13 09:50:04 909147 [4FAFC700] 0x01 -> log_rcv_cb_error: ERR 3111:
Received MAD with error status = 0x1C
SubnGetResp(SwitchInfo), attr_mod 0x0, TID 0x73c86e46
Initial path: 0,1,33,30,28 Return path: 0,10,32,13,28
80 of these messages occur periodically. smpquery on the paths shows
that these all point to the Sun QNEM switches (80 I4 chips).
"use_mfttop FALSE" eliminated these messages.
Florent
*** glibc detected *** /usr/sbin/opensm: malloc(): smallbin double
linked list corrupted: 0x00007f9b3c4352a0 ***
======= Backtrace: =========
/lib64/libc.so.6(+0x76166)[0x7f9b56279166]
/lib64/libc.so.6(+0x79f1f)[0x7f9b5627cf1f]
/lib64/libc.so.6(__libc_malloc+0x71)[0x7f9b5627d991]
/usr/sbin/opensm[0x4216f3]
/usr/sbin/opensm(osm_pkey_mgr_process+0x467)[0x422187]
/usr/sbin/opensm[0x446efb]
/usr/sbin/opensm(osm_state_mgr_process+0x1f8)[0x448538]
/usr/sbin/opensm[0x4422bb]
/usr/lib64/libosmcomp.so.3(+0x85fe)[0x7f9b56ddb5fe]
/lib64/libpthread.so.0(+0x79d1)[0x7f9b5659e9d1]
/lib64/libc.so.6(clone+0x6d)[0x7f9b562ebb6d]
*** glibc detected *** /usr/sbin/opensm: double free or corruption
(out): 0x00007fe2f42e1830 ***
======= Backtrace: =========
/lib64/libc.so.6(+0x76166)[0x7fe30ec9d166]
/lib64/libc.so.6(+0x78c93)[0x7fe30ec9fc93]
/usr/sbin/opensm[0x449cf6]
/usr/sbin/opensm(osm_subn_rescan_conf_files+0x194)[0x44af14]
/usr/sbin/opensm[0x447260]
/usr/sbin/opensm(osm_state_mgr_process+0x1f8)[0x448538]
/usr/sbin/opensm[0x4422bb]
/usr/lib64/libosmcomp.so.3(+0x85fe)[0x7fe30f7ff5fe]
/lib64/libpthread.so.0(+0x79d1)[0x7fe30efc29d1]
/lib64/libc.so.6(clone+0x6d)[0x7fe30ed0fb6d]
*** glibc detected *** /usr/sbin/opensm: malloc(): smallbin double
linked list corrupted: 0x00007f200838ede0 ***
======= Backtrace: =========
/lib64/libc.so.6(+0x76166)[0x7f2025131166]
/lib64/libc.so.6(+0x79f1f)[0x7f2025134f1f]
/lib64/libc.so.6(__libc_malloc+0x71)[0x7f2025135991]
/usr/sbin/opensm[0x4216f3]
/usr/sbin/opensm(osm_pkey_mgr_process+0x467)[0x422187]
/usr/sbin/opensm[0x446efb]
/usr/sbin/opensm(osm_state_mgr_process+0x1f8)[0x448538]
/usr/sbin/opensm[0x4422bb]
/usr/lib64/libosmcomp.so.3(+0x85fe)[0x7f2025c935fe]
/lib64/libpthread.so.0(+0x79d1)[0x7f20254569d1]
/lib64/libc.so.6(clone+0x6d)[0x7f20251a3b6d]
*** glibc detected *** /usr/sbin/opensm: malloc(): smallbin double
linked list corrupted: 0x00007f8464013df0 ***
======= Backtrace: =========
/lib64/libc.so.6(+0x76166)[0x7f847ec95166]
/lib64/libc.so.6(+0x79f1f)[0x7f847ec98f1f]
/lib64/libc.so.6(__libc_malloc+0x71)[0x7f847ec99991]
/usr/sbin/opensm[0x4216f3]
/usr/sbin/opensm(osm_pkey_mgr_process+0x467)[0x422187]
/usr/sbin/opensm[0x446efb]
/usr/sbin/opensm(osm_state_mgr_process+0x1f8)[0x448538]
/usr/sbin/opensm[0x4422bb]
/usr/lib64/libosmcomp.so.3(+0x85fe)[0x7f847f7f75fe]
/lib64/libpthread.so.0(+0x79d1)[0x7f847efba9d1]
/lib64/libc.so.6(clone+0x6d)[0x7f847ed07b6d]
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: bug report: opensm 3.3.15 crash (with traces)
[not found] ` <CAF3spKES=F1Tr6Y-iwMV_Dues80HnHvxOeaSN9KjTJbQvU0qyg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2014-04-04 16:56 ` Hal Rosenstock
[not found] ` <533EE437.4070003-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
0 siblings, 1 reply; 3+ messages in thread
From: Hal Rosenstock @ 2014-04-04 16:56 UTC (permalink / raw)
To: Florent Parent; +Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA
Hi Florent,
On 4/2/2014 5:43 PM, Florent Parent wrote:
> Hi,
>
> We experienced constant crashing from opensm 3.3.15 (3.3.15-1.el6.cq5)
> after a recent upgrade. We compiled and installed 3.3.17 and problem
> went away.
>
> OpenSM server: CentOS 6.5 w/ stock RDMA. OpenSM 3.3.15 was from the
> CentOS repository.
>
> A behaviour that may help diagnose this: Unusual large amount messages
> were filling up the opensm.log file:
>
> Mar 13 09:50:04 909147 [4FAFC700] 0x01 -> log_rcv_cb_error: ERR 3111:
> Received MAD with error status = 0x1C
> SubnGetResp(SwitchInfo), attr_mod 0x0, TID 0x73c86e46
> Initial path: 0,1,33,30,28 Return path: 0,10,32,13,28
>
> 80 of these messages occur periodically. smpquery on the paths shows
> that these all point to the Sun QNEM switches (80 I4 chips).
> "use_mfttop FALSE" eliminated these messages.
Yes, this is caused by bad firmware. The best fix is to upgrade the
firmware on the devices indicated by the DR paths. There's also the
workaround on the OpenSM side that you are using.
This is orthogonal to the crashes below.
> Florent
>
>
> *** glibc detected *** /usr/sbin/opensm: malloc(): smallbin double
> linked list corrupted: 0x00007f9b3c4352a0 ***
> ======= Backtrace: =========
> /lib64/libc.so.6(+0x76166)[0x7f9b56279166]
> /lib64/libc.so.6(+0x79f1f)[0x7f9b5627cf1f]
> /lib64/libc.so.6(__libc_malloc+0x71)[0x7f9b5627d991]
> /usr/sbin/opensm[0x4216f3]
> /usr/sbin/opensm(osm_pkey_mgr_process+0x467)[0x422187]
> /usr/sbin/opensm[0x446efb]
> /usr/sbin/opensm(osm_state_mgr_process+0x1f8)[0x448538]
> /usr/sbin/opensm[0x4422bb]
> /usr/lib64/libosmcomp.so.3(+0x85fe)[0x7f9b56ddb5fe]
> /lib64/libpthread.so.0(+0x79d1)[0x7f9b5659e9d1]
> /lib64/libc.so.6(clone+0x6d)[0x7f9b562ebb6d]
>
> *** glibc detected *** /usr/sbin/opensm: double free or corruption
> (out): 0x00007fe2f42e1830 ***
Are you using partitions ? Any idea on the scenario here ?
I can isolate the patch (beyond 3.3.15) that fixes this if needed.
> ======= Backtrace: =========
> /lib64/libc.so.6(+0x76166)[0x7fe30ec9d166]
> /lib64/libc.so.6(+0x78c93)[0x7fe30ec9fc93]
> /usr/sbin/opensm[0x449cf6]
> /usr/sbin/opensm(osm_subn_rescan_conf_files+0x194)[0x44af14]
> /usr/sbin/opensm[0x447260]
> /usr/sbin/opensm(osm_state_mgr_process+0x1f8)[0x448538]
> /usr/sbin/opensm[0x4422bb]
> /usr/lib64/libosmcomp.so.3(+0x85fe)[0x7fe30f7ff5fe]
> /lib64/libpthread.so.0(+0x79d1)[0x7fe30efc29d1]
> /lib64/libc.so.6(clone+0x6d)[0x7fe30ed0fb6d]
>
> *** glibc detected *** /usr/sbin/opensm: malloc(): smallbin double
> linked list corrupted: 0x00007f200838ede0 ***
This is one I'm unfamiliar with and will need to investigate further.
Did this one also go away with 3.3.17 ?
Thanks.
-- Hal
> ======= Backtrace: =========
> /lib64/libc.so.6(+0x76166)[0x7f2025131166]
> /lib64/libc.so.6(+0x79f1f)[0x7f2025134f1f]
> /lib64/libc.so.6(__libc_malloc+0x71)[0x7f2025135991]
> /usr/sbin/opensm[0x4216f3]
> /usr/sbin/opensm(osm_pkey_mgr_process+0x467)[0x422187]
> /usr/sbin/opensm[0x446efb]
> /usr/sbin/opensm(osm_state_mgr_process+0x1f8)[0x448538]
> /usr/sbin/opensm[0x4422bb]
> /usr/lib64/libosmcomp.so.3(+0x85fe)[0x7f2025c935fe]
> /lib64/libpthread.so.0(+0x79d1)[0x7f20254569d1]
> /lib64/libc.so.6(clone+0x6d)[0x7f20251a3b6d]
>
>
> *** glibc detected *** /usr/sbin/opensm: malloc(): smallbin double
> linked list corrupted: 0x00007f8464013df0 ***
> ======= Backtrace: =========
> /lib64/libc.so.6(+0x76166)[0x7f847ec95166]
> /lib64/libc.so.6(+0x79f1f)[0x7f847ec98f1f]
> /lib64/libc.so.6(__libc_malloc+0x71)[0x7f847ec99991]
> /usr/sbin/opensm[0x4216f3]
> /usr/sbin/opensm(osm_pkey_mgr_process+0x467)[0x422187]
> /usr/sbin/opensm[0x446efb]
> /usr/sbin/opensm(osm_state_mgr_process+0x1f8)[0x448538]
> /usr/sbin/opensm[0x4422bb]
> /usr/lib64/libosmcomp.so.3(+0x85fe)[0x7f847f7f75fe]
> /lib64/libpthread.so.0(+0x79d1)[0x7f847efba9d1]
> /lib64/libc.so.6(clone+0x6d)[0x7f847ed07b6d]
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: bug report: opensm 3.3.15 crash (with traces)
[not found] ` <533EE437.4070003-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
@ 2014-04-04 23:21 ` Florent Parent
0 siblings, 0 replies; 3+ messages in thread
From: Florent Parent @ 2014-04-04 23:21 UTC (permalink / raw)
To: Hal Rosenstock; +Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA
On Fri, Apr 4, 2014 at 12:56 PM, Hal Rosenstock <hal-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org> wrote:
> Hi Florent,
>
> On 4/2/2014 5:43 PM, Florent Parent wrote:
>> Hi,
>>
>> We experienced constant crashing from opensm 3.3.15 (3.3.15-1.el6.cq5)
>> after a recent upgrade. We compiled and installed 3.3.17 and problem
>> went away.
>>
>> OpenSM server: CentOS 6.5 w/ stock RDMA. OpenSM 3.3.15 was from the
>> CentOS repository.
>>
>> A behaviour that may help diagnose this: Unusual large amount messages
>> were filling up the opensm.log file:
>>
>> Mar 13 09:50:04 909147 [4FAFC700] 0x01 -> log_rcv_cb_error: ERR 3111:
>> Received MAD with error status = 0x1C
>> SubnGetResp(SwitchInfo), attr_mod 0x0, TID 0x73c86e46
>> Initial path: 0,1,33,30,28 Return path: 0,10,32,13,28
>>
>> 80 of these messages occur periodically. smpquery on the paths shows
>> that these all point to the Sun QNEM switches (80 I4 chips).
>> "use_mfttop FALSE" eliminated these messages.
>
> Yes, this is caused by bad firmware. The best fix is to upgrade the
> firmware on the devices indicated by the DR paths. There's also the
> workaround on the OpenSM side that you are using.
>
> This is orthogonal to the crashes below.
ok
>
>> Florent
>>
>>
>> *** glibc detected *** /usr/sbin/opensm: malloc(): smallbin double
>> linked list corrupted: 0x00007f9b3c4352a0 ***
>> ======= Backtrace: =========
>> /lib64/libc.so.6(+0x76166)[0x7f9b56279166]
>> /lib64/libc.so.6(+0x79f1f)[0x7f9b5627cf1f]
>> /lib64/libc.so.6(__libc_malloc+0x71)[0x7f9b5627d991]
>> /usr/sbin/opensm[0x4216f3]
>> /usr/sbin/opensm(osm_pkey_mgr_process+0x467)[0x422187]
>> /usr/sbin/opensm[0x446efb]
>> /usr/sbin/opensm(osm_state_mgr_process+0x1f8)[0x448538]
>> /usr/sbin/opensm[0x4422bb]
>> /usr/lib64/libosmcomp.so.3(+0x85fe)[0x7f9b56ddb5fe]
>> /lib64/libpthread.so.0(+0x79d1)[0x7f9b5659e9d1]
>> /lib64/libc.so.6(clone+0x6d)[0x7f9b562ebb6d]
>>
>> *** glibc detected *** /usr/sbin/opensm: double free or corruption
>> (out): 0x00007fe2f42e1830 ***
>
> Are you using partitions ? Any idea on the scenario here ?
>
> I can isolate the patch (beyond 3.3.15) that fixes this if needed.
No partitions. We installed 3.3.15 during a maintenance window. Crash
started to occur only when the scheduler started dispatching jobs.
Since we're not seeing any issues so far with 3.3.17, this patch is
not required for us. I just taught it was good practice to report any
crash.
>
>> ======= Backtrace: =========
>> /lib64/libc.so.6(+0x76166)[0x7fe30ec9d166]
>> /lib64/libc.so.6(+0x78c93)[0x7fe30ec9fc93]
>> /usr/sbin/opensm[0x449cf6]
>> /usr/sbin/opensm(osm_subn_rescan_conf_files+0x194)[0x44af14]
>> /usr/sbin/opensm[0x447260]
>> /usr/sbin/opensm(osm_state_mgr_process+0x1f8)[0x448538]
>> /usr/sbin/opensm[0x4422bb]
>> /usr/lib64/libosmcomp.so.3(+0x85fe)[0x7fe30f7ff5fe]
>> /lib64/libpthread.so.0(+0x79d1)[0x7fe30efc29d1]
>> /lib64/libc.so.6(clone+0x6d)[0x7fe30ed0fb6d]
>>
>> *** glibc detected *** /usr/sbin/opensm: malloc(): smallbin double
>> linked list corrupted: 0x00007f200838ede0 ***
>
> This is one I'm unfamiliar with and will need to investigate further.
> Did this one also go away with 3.3.17 ?
Yes it did.
Thanks
Florent
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 3+ messages in thread
end of thread, other threads:[~2014-04-04 23:21 UTC | newest]
Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <CAF3spKHCEVtmdjXHS-1YGjZ2OHsURsn3Junt3OhbNCF7_AfG9A@mail.gmail.com>
[not found] ` <CAF3spKHCEVtmdjXHS-1YGjZ2OHsURsn3Junt3OhbNCF7_AfG9A-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2014-04-02 21:43 ` bug report: opensm 3.3.15 crash (with traces) Florent Parent
[not found] ` <CAF3spKES=F1Tr6Y-iwMV_Dues80HnHvxOeaSN9KjTJbQvU0qyg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2014-04-04 16:56 ` Hal Rosenstock
[not found] ` <533EE437.4070003-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
2014-04-04 23:21 ` Florent Parent
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox