All of lore.kernel.org
 help / color / mirror / Atom feed
* [BUG?] OpenMPI with openib on SPARC64: Signal: Bus error (10)
@ 2011-11-22  1:51 Lukas Razik
       [not found] ` <1321926674.44951.YahooMailNeo-Igr0H0yBZInyX4RqAA4FmIglqE1Y4D90QQ4Iyu8u01E@public.gmane.org>
  0 siblings, 1 reply; 4+ messages in thread
From: Lukas Razik @ 2011-11-22  1:51 UTC (permalink / raw)
  To: ewg-ZwoEplunGu1OwGhvXhtEPSCwEArCW2h5@public.gmane.org,
	devel-ygRj4skf0tpg9hUCZPvPmw@public.gmane.org,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org

Hello everybody!

I've Sun T5120 (SPARC64) Servers with
- Debian: 6.0.3
- linux-2.6.39.4 (from kernel.org)
- OFED-1.5.3.2
- InfiniBand: Mellanox Technologies MT25418 [ConnectX VPI PCIe 2.0 2.5GT/s - IB DDR / 10GigE] (rev a0)
  with newest FW (2.9.1)
and the following issue:

If I try to mpirun a program like the osu_latency benchmark:
$ /usr/mpi/gcc/openmpi-1.4.3/bin/mpirun -np 2 --mca btl_base_verbose 50 --mca btl_openib_verbose 1 -host cluster1,cluster2 /usr/mpi/gcc/openmpi-1.4.3/tests/osu_benchmarks-3.1.1/osu_latency

then I get these errors:
<snip>
# OSU MPI Latency Test v3.1.1
# Size            Latency (us)
[cluster1:64027] *** Process received signal ***
[cluster1:64027] Signal: Bus error (10)
[cluster1:64027] Signal code: Invalid address alignment (1)
[cluster1:64027] Failing at address: 0xaa9053
[cluster1:64027] [ 0] /usr/mpi/gcc/openmpi-1.4.3/lib64/openmpi/mca_pml_ob1.so(+0x62f0) [0xfffff8010209e2f0]
[cluster1:64027] [ 1] /usr/mpi/gcc/openmpi-1.4.3/lib64/openmpi/mca_coll_tuned.so(+0x2904) [0xfffff801031ce904]
[cluster1:64027] [ 2] /usr/mpi/gcc/openmpi-1.4.3/lib64/openmpi/mca_coll_tuned.so(+0xb498) [0xfffff801031d7498]
[cluster1:64027] [ 3] /usr/mpi/gcc/openmpi-1.4.3/lib64/libmpi.so.0(MPI_Barrier+0xbc) [0xfffff8010005a97c]
[cluster1:64027] [ 4] /usr/mpi/gcc/openmpi-1.4.3/tests/osu_benchmarks-3.1.1/osu_latency(main+0x2b0) [0x100f34]
[cluster1:64027] [ 5] /lib64/libc.so.6(__libc_start_main+0x100) [0xfffff80100ac1240]
[cluster1:64027] [ 6] /usr/mpi/gcc/openmpi-1.4.3/tests/osu_benchmarks-3.1.1/osu_latency(_start+0x2c) [0x100bac]
[cluster1:64027] *** End of error message ***
[cluster2:02759] *** Process received signal ***
[cluster2:02759] Signal: Bus error (10)
[cluster2:02759] Signal code: Invalid address alignment (1)
[cluster2:02759] Failing at address: 0xaa9053
[cluster2:02759] [ 0] /usr/mpi/gcc/openmpi-1.4.3/lib64/openmpi/mca_pml_ob1.so(+0x62f0) [0xfffff8010209e2f0]
[cluster2:02759] [ 1] /usr/mpi/gcc/openmpi-1.4.3/lib64/openmpi/mca_coll_tuned.so(+0x2904) [0xfffff801031ce904]
[cluster2:02759] [ 2] /usr/mpi/gcc/openmpi-1.4.3/lib64/openmpi/mca_coll_tuned.so(+0xb498) [0xfffff801031d7498]
[cluster2:02759] [ 3] /usr/mpi/gcc/openmpi-1.4.3/lib64/libmpi.so.0(MPI_Barrier+0xbc) [0xfffff8010005a97c]
[cluster2:02759] [ 4] /usr/mpi/gcc/openmpi-1.4.3/tests/osu_benchmarks-3.1.1/osu_latency(main+0x2b0) [0x100f34]
[cluster2:02759] [ 5] /lib64/libc.so.6(__libc_start_main+0x100) [0xfffff80100ac1240]
[cluster2:02759] [ 6] /usr/mpi/gcc/openmpi-1.4.3/tests/osu_benchmarks-3.1.1/osu_latency(_start+0x2c) [0x100bac]
[cluster2:02759] *** End of error message ***
---

The whole output can be found here:
http://net.razik.de/linux/T5120/openmpi-1.4.3-verbose.txt

That's my 'ompi_info --param all all' output:
http://net.razik.de/linux/T5120/openmpi-1.4.3_param_all_all.txt

Same error with OFED-1.5.4-rc4 and also the same with openmpi-1.4.4.

If I disable openib the I get the right results:
$ /usr/mpi/gcc/openmpi-1.4.3/bin/mpirun --mca btl ^openib -np 2 -host cluster1,cluster2 /usr/mpi/gcc/openmpi-1.4.3/tests/osu_benchmarks-3.1.1/osu_latency
# OSU MPI Latency Test v3.1.1
# Size            Latency (us)
0                       143.53
1                       140.50
<snip>
---

ibverbs seems to work:
$ ibv_srq_pingpong -n 1000000 cluster2
<snip>
8192000000 bytes in 4.15 seconds = 15806.63 Mbit/sec
1000000 iters in 4.15 seconds = 4.15 usec/iter
---

These are the installed OFED packets:
kernel-ib
ofed-scripts
libibverbs
libibverbs-devel
libibverbs-utils
libmlx4
libmlx4-devel
libibumad
libibumad-devel
libibmad
libibmad-devel
librdmacm
librdmacm-utils
librdmacm-devel
opensm-libs
ibutils
infiniband-diags
qperf
ofed-docs
mpi-selector
openmpi_gcc
mpitests_openmpi_gcc
---

I don't know which mailing list is the right one and I'm very thankful for any help!
If you have questions, please ask!

Best regards,
Lukas


The archives of the lists I've sent this email to:
http://lists.openfabrics.org/pipermail/ewg/2011-November/thread.html
http://www.open-mpi.org/community/lists/devel/2011/11/date.php
http://thread.gmane.org/gmane.linux.drivers.rdma/
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [BUG?] OpenMPI with openib on SPARC64: Signal: Bus error (10)
       [not found] ` <1321926674.44951.YahooMailNeo-Igr0H0yBZInyX4RqAA4FmIglqE1Y4D90QQ4Iyu8u01E@public.gmane.org>
@ 2011-11-22 18:44   ` Roland Dreier
       [not found]     ` <CAL1RGDVsj_cZ7qJZAmjAGAPd2a3wvs3-7jELtTN6Z1Vn88Z9OQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2011-12-06  5:36   ` [OMPI devel] " Eugene Loh
  1 sibling, 1 reply; 4+ messages in thread
From: Roland Dreier @ 2011-11-22 18:44 UTC (permalink / raw)
  To: Lukas Razik
  Cc: ewg-ZwoEplunGu1OwGhvXhtEPSCwEArCW2h5@public.gmane.org,
	devel-ygRj4skf0tpg9hUCZPvPmw@public.gmane.org,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org

On Mon, Nov 21, 2011 at 5:51 PM, Lukas Razik <linux-GLgANOly0l1BDLzU/O5InQ@public.gmane.org> wrote:
> [cluster1:64027] Signal code: Invalid address alignment (1)
> [cluster1:64027] Failing at address: 0xaa9053
> [cluster1:64027] [ 0] /usr/mpi/gcc/openmpi-1.4.3/lib64/openmpi/mca_pml_ob1.so(+0x62f0) [0xfffff8010209e2f0]

Seems like openmpi is doing a misaligned access somewhere...

not sure how to turn this into a real location in the code, Open MPI guys??

 - R.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [BUG?] OpenMPI with openib on SPARC64: Signal: Bus error (10)
       [not found]     ` <CAL1RGDVsj_cZ7qJZAmjAGAPd2a3wvs3-7jELtTN6Z1Vn88Z9OQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2011-11-22 19:25       ` Lukas Razik
  0 siblings, 0 replies; 4+ messages in thread
From: Lukas Razik @ 2011-11-22 19:25 UTC (permalink / raw)
  To: Roland Dreier
  Cc: ewg-ZwoEplunGu1OwGhvXhtEPSCwEArCW2h5@public.gmane.org,
	devel-ygRj4skf0tpg9hUCZPvPmw@public.gmane.org,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, TERRY DONTJE

Roland Dreier <roland-BHEL68pLQRGGvPXPguhicg@public.gmane.org> wrote:
>
> On Mon, Nov 21, 2011 at 5:51 PM, Lukas Razik <linux-GLgANOly0l1BDLzU/O5InQ@public.gmane.org> wrote:
>> [cluster1:64027] Signal code: Invalid address alignment (1)
>> [cluster1:64027] Failing at address: 0xaa9053
>> [cluster1:64027] [ 0]
> /usr/mpi/gcc/openmpi-1.4.3/lib64/openmpi/mca_pml_ob1.so(+0x62f0)
> [0xfffff8010209e2f0]
>
> Seems like openmpi is doing a misaligned access somewhere...
>
> not sure how to turn this into a real location in the code, Open MPI guys??

Hello Roland,

one guy (Terry D. Dontje) already answered in the devel-ygRj4skf0tpg9hUCZPvPmw@public.gmane.org mailing list:
http://www.open-mpi.org/community/lists/devel/2011/11/10011.php

As I've understood him, he thinks the same. Now I'm trying to do what he wrote and answer soon...
Thanks for your estimation!

Best regards,
Lukas
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [OMPI devel] [BUG?] OpenMPI with openib on SPARC64: Signal: Bus error (10)
       [not found] ` <1321926674.44951.YahooMailNeo-Igr0H0yBZInyX4RqAA4FmIglqE1Y4D90QQ4Iyu8u01E@public.gmane.org>
  2011-11-22 18:44   ` Roland Dreier
@ 2011-12-06  5:36   ` Eugene Loh
  1 sibling, 0 replies; 4+ messages in thread
From: Eugene Loh @ 2011-12-06  5:36 UTC (permalink / raw)
  To: Lukas Razik, Open MPI Developers
  Cc: ewg-ZwoEplunGu1OwGhvXhtEPSCwEArCW2h5@public.gmane.org,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org

On 11/21/11 20:51, Lukas Razik wrote:
> Hello everybody!
>
> I've Sun T5120 (SPARC64) Servers with
> - Debian: 6.0.3
> - linux-2.6.39.4 (from kernel.org)
> - OFED-1.5.3.2
> - InfiniBand: Mellanox Technologies MT25418 [ConnectX VPI PCIe 2.0 2.5GT/s - IB DDR / 10GigE] (rev a0)
>    with newest FW (2.9.1)
> and the following issue:
>
> If I try to mpirun a program like the osu_latency benchmark:
> $ /usr/mpi/gcc/openmpi-1.4.3/bin/mpirun -np 2 --mca btl_base_verbose 50 --mca btl_openib_verbose 1 -host cluster1,cluster2 /usr/mpi/gcc/openmpi-1.4.3/tests/osu_benchmarks-3.1.1/osu_latency
>
> then I get these errors:
> <snip>
> # OSU MPI Latency Test v3.1.1
> # Size            Latency (us)
> [cluster1:64027] *** Process received signal ***
> [cluster1:64027] Signal: Bus error (10)
> [cluster1:64027] Signal code: Invalid address alignment (1)
> [cluster1:64027] Failing at address: 0xaa9053
> [cluster1:64027] [ 0] /usr/mpi/gcc/openmpi-1.4.3/lib64/openmpi/mca_pml_ob1.so(+0x62f0) [0xfffff8010209e2f0]
> [cluster1:64027] [ 1] /usr/mpi/gcc/openmpi-1.4.3/lib64/openmpi/mca_coll_tuned.so(+0x2904) [0xfffff801031ce904]
> [cluster1:64027] [ 2] /usr/mpi/gcc/openmpi-1.4.3/lib64/openmpi/mca_coll_tuned.so(+0xb498) [0xfffff801031d7498]
> [cluster1:64027] [ 3] /usr/mpi/gcc/openmpi-1.4.3/lib64/libmpi.so.0(MPI_Barrier+0xbc) [0xfffff8010005a97c]
> [cluster1:64027] [ 4] /usr/mpi/gcc/openmpi-1.4.3/tests/osu_benchmarks-3.1.1/osu_latency(main+0x2b0) [0x100f34]
> [cluster1:64027] [ 5] /lib64/libc.so.6(__libc_start_main+0x100) [0xfffff80100ac1240]
> [cluster1:64027] [ 6] /usr/mpi/gcc/openmpi-1.4.3/tests/osu_benchmarks-3.1.1/osu_latency(_start+0x2c) [0x100bac]
> [cluster1:64027] *** End of error message ***
> [cluster2:02759] *** Process received signal ***
> [cluster2:02759] Signal: Bus error (10)
> [cluster2:02759] Signal code: Invalid address alignment (1)
> [cluster2:02759] Failing at address: 0xaa9053
> [cluster2:02759] [ 0] /usr/mpi/gcc/openmpi-1.4.3/lib64/openmpi/mca_pml_ob1.so(+0x62f0) [0xfffff8010209e2f0]
> [cluster2:02759] [ 1] /usr/mpi/gcc/openmpi-1.4.3/lib64/openmpi/mca_coll_tuned.so(+0x2904) [0xfffff801031ce904]
> [cluster2:02759] [ 2] /usr/mpi/gcc/openmpi-1.4.3/lib64/openmpi/mca_coll_tuned.so(+0xb498) [0xfffff801031d7498]
> [cluster2:02759] [ 3] /usr/mpi/gcc/openmpi-1.4.3/lib64/libmpi.so.0(MPI_Barrier+0xbc) [0xfffff8010005a97c]
> [cluster2:02759] [ 4] /usr/mpi/gcc/openmpi-1.4.3/tests/osu_benchmarks-3.1.1/osu_latency(main+0x2b0) [0x100f34]
> [cluster2:02759] [ 5] /lib64/libc.so.6(__libc_start_main+0x100) [0xfffff80100ac1240]
> [cluster2:02759] [ 6] /usr/mpi/gcc/openmpi-1.4.3/tests/osu_benchmarks-3.1.1/osu_latency(_start+0x2c) [0x100bac]
> [cluster2:02759] *** End of error message ***
There do indeed seem to be a set of problems here addressing non-aligned 
words.

*IF* you were to use Oracle Solaris Studio compilers, you could use 
-xmemalign=8i as Terry suggested and it appears that eliminates these 
errors, albeit potentially with a loss of performance.

Your e-mail thread identified a problem with misalignment in

551         hdr->hdr_match.hdr_ctx = sendreq->req_send.req_base.req_comm->c_contextid;

It appears one can get past this problem by configuring OMPI with --enable-openib-control-hdr-padding.  This turns on OMPI_OPENIB_PAD_HDR and gives you padding/alignment in ompi/mca/btl/openib/btl_openib_frag.h here:

struct mca_btl_openib_control_header_t {
     uint8_t  type;
#if OMPI_OPENIB_PAD_HDR
     uint8_t  padding[15];
#endif
};
typedef struct mca_btl_openib_control_header_t mca_btl_openib_control_header_t;

struct mca_btl_openib_eager_rdma_header_t {
     mca_btl_openib_control_header_t control;
     uint8_t padding[3];
     uint32_t rkey;
     ompi_ptr_t rdma_start;
};
typedef struct mca_btl_openib_eager_rdma_header_t mca_btl_openib_eager_rdma_header_t;

But then perhaps the padding in mca_btl_openib_eager_rdma_header_t needs to be adjusted.  I don't yet know.

This helps (more tests pass), but in many cases it just delays problems until a later point.

All of this is I suppose to say:
1)  Yes, there is a problem with misaligned words in the openib BTL.
2)  We are interested in and looking at the problem.
3)  No promises of outcome.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2011-12-06  5:36 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-11-22  1:51 [BUG?] OpenMPI with openib on SPARC64: Signal: Bus error (10) Lukas Razik
     [not found] ` <1321926674.44951.YahooMailNeo-Igr0H0yBZInyX4RqAA4FmIglqE1Y4D90QQ4Iyu8u01E@public.gmane.org>
2011-11-22 18:44   ` Roland Dreier
     [not found]     ` <CAL1RGDVsj_cZ7qJZAmjAGAPd2a3wvs3-7jELtTN6Z1Vn88Z9OQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2011-11-22 19:25       ` Lukas Razik
2011-12-06  5:36   ` [OMPI devel] " Eugene Loh

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.