From: Leon Romanovsky <leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
To: Robert LeBlanc <robert-4JaGZRWAfWbajFs6igw21g@public.gmane.org>
Cc: linux-rdma <linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>
Subject: Re: rdma_cm segfaults on RoCE with ConnectX-4 [WAS: Re: rping segfault with 4.9.28 on CentOS 7.3]
Date: Thu, 18 May 2017 08:07:45 +0300 [thread overview]
Message-ID: <20170518050745.GZ3616@mtr-leonro.local> (raw)
In-Reply-To: <CAANLjFp3PWJBiVabMGksarwg9BTM7Cg4mPrBsXY+mYh_zBbsgA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
[-- Attachment #1: Type: text/plain, Size: 18129 bytes --]
On Wed, May 17, 2017 at 12:14:18PM -0600, Robert LeBlanc wrote:
> Since I have a connectX-3 card in this same box, I set it up as
> Infiniband. I can run all the tests (udaddy, rping, ib_send_bw with -R
> or -z) using the Infiniband link, but the RoCE ConnectX-4 LX segfault
> on any rdma_cm communications.
>
> I put the ConnectX-3 into Ethernet mode and ran the tests again and it
> passed all of them while the ConnectX-4 LX cards still failed. We have
> some ConnectX-4 EN 100 Gb cards in other boxes that have the same
> problem.
>
> It really looks like this problem is specific to ConnectX-4 (mlx5
> driver) when running in RoCE. I _don't_ have ConnectX-4 IB cards to
> test. We are also seeing the problem with the Mellanox drivers. I
> can't find http://www.mellanox.com/page/custom_firmware_table to build
> a new OEM firmware for my SuperMicro branded cards to test the latest
> firmware.
Robert,
Please avoid top-posting, It is unreadable.
In regards to your issue, the best way to move forward is to open
customer issue request and leverage established procedures to get
proper and prompt customer channel support.
Thanks
> ----------------
> Robert LeBlanc
> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1
>
>
> On Tue, May 16, 2017 at 4:00 PM, Robert LeBlanc <robert-4JaGZRWAfWbajFs6igw21g@public.gmane.org> wrote:
> > The ib_read_bw looks like it can use rdma_cm or not. By default, I can
> > get things to work between the nodes. If I specify -R or -z, it fails.
> > It seems that the context is not being set properly when using
> > rdma_cm.
> >
> > "Server"
> > -----------
> >
> > # ib_read_bw
> >
> > ************************************
> > * Waiting for client to connect... *
> > ************************************
> > ---------------------------------------------------------------------------------------
> > RDMA_Read BW Test
> > Dual-port : OFF Device : mlx5_0
> > Number of qps : 1 Transport type : IB
> > Connection type : RC Using SRQ : OFF
> > CQ Moderation : 100
> > Mtu : 1024[B]
> > Link type : Ethernet
> > GID index : 2
> > Outstand reads : 16
> > rdma_cm QPs : OFF
> > Data ex. method : Ethernet
> > ---------------------------------------------------------------------------------------
> > local address: LID 0000 QPN 0x011a PSN 0xa0e9fd OUT 0x10 RKey 0x00175e
> > VAddr 0x007fc73fd6e000
> > GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:13:13
> > remote address: LID 0000 QPN 0x011a PSN 0xf7747b OUT 0x10 RKey
> > 0x002797 VAddr 0x007fe5cccc5000
> > GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:13:14
> > ---------------------------------------------------------------------------------------
> > #bytes #iterations BW peak[MB/sec] BW average[MB/sec] MsgRate[Mpps]
> > 65536 1000 2728.79 2728.77 0.043660
> > ---------------------------------------------------------------------------------------
> >
> > # ib_read_bw -R
> >
> > ************************************
> > * Waiting for client to connect... *
> > ************************************
> > Segmentation fault (core dumped)
> >
> > # gdb ib_read_bw core.8319
> > GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-94.el7
> > Copyright (C) 2013 Free Software Foundation, Inc.
> > License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
> > This is free software: you are free to change and redistribute it.
> > There is NO WARRANTY, to the extent permitted by law. Type "show copying"
> > and "show warranty" for details.
> > This GDB was configured as "x86_64-redhat-linux-gnu".
> > For bug reporting instructions, please see:
> > <http://www.gnu.org/software/gdb/bugs/>...
> > Reading symbols from /usr/bin/ib_read_bw...Reading symbols from
> > /usr/lib/debug/usr/bin/ib_read_bw.debug...done.
> > done.
> > [New LWP 8319]
> > [Thread debugging using libthread_db enabled]
> > Using host libthread_db library "/lib64/libthread_db.so.1".
> > Core was generated by `ib_read_bw -R'.
> > Program terminated with signal 11, Segmentation fault.
> > #0 __ibv_query_device (context=0x0, device_attr=0x7ffcd8fec160) at
> > src/verbs.c:135
> > 135 return context->ops.query_device(context, device_attr);
> > (gdb) bt
> > #0 __ibv_query_device (context=0x0, device_attr=0x7ffcd8fec160) at
> > src/verbs.c:135
> > #1 0x0000000000410518 in check_for_contig_pages_support
> > (context=<optimized out>) at src/perftest_resources.c:262
> > #2 ctx_init (ctx=ctx@entry=0x110b000,
> > user_param=user_param@entry=0x110ad70) at
> > src/perftest_resources.c:1314
> > #3 0x000000000040585c in rdma_server_connect (ctx=0x110b000,
> > user_param=0x110ad70) at src/perftest_communication.c:1119
> > #4 0x0000000000405f53 in establish_connection
> > (comm=comm@entry=0x7ffcd8fec470) at src/perftest_communication.c:1244
> > #5 0x0000000000402b37 in main (argc=<optimized out>, argv=<optimized
> > out>) at src/read_bw.c:110
> > (gdb) f 0
> > #0 __ibv_query_device (context=0x0, device_attr=0x7ffcd8fec160) at
> > src/verbs.c:135
> > 135 return context->ops.query_device(context, device_attr);
> > (gdb) list
> > 130 }
> > 131
> > 132 int __ibv_query_device(struct ibv_context *context,
> > 133 struct ibv_device_attr *device_attr)
> > 134 {
> > 135 return context->ops.query_device(context, device_attr);
> > 136 }
> > 137 default_symver(__ibv_query_device, ibv_query_device);
> > 138
> > 139 int __ibv_query_port(struct ibv_context *context, uint8_t port_num,
> > (gdb) p context
> > $1 = (struct ibv_context *) 0x0
> >
> > # ib_read_bw -z
> >
> > ************************************
> > * Waiting for client to connect... *
> > ************************************
> > Segmentation fault (core dumped)
> >
> > # gdb ib_read_bw core.8369
> > GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-94.el7
> > Copyright (C) 2013 Free Software Foundation, Inc.
> > License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
> > This is free software: you are free to change and redistribute it.
> > There is NO WARRANTY, to the extent permitted by law. Type "show copying"
> > and "show warranty" for details.
> > This GDB was configured as "x86_64-redhat-linux-gnu".
> > For bug reporting instructions, please see:
> > <http://www.gnu.org/software/gdb/bugs/>...
> > Reading symbols from /usr/bin/ib_read_bw...Reading symbols from
> > /usr/lib/debug/usr/bin/ib_read_bw.debug...done.
> > done.
> > [New LWP 8369]
> > [Thread debugging using libthread_db enabled]
> > Using host libthread_db library "/lib64/libthread_db.so.1".
> > Core was generated by `ib_read_bw -z'.
> > Program terminated with signal 11, Segmentation fault.
> > #0 __ibv_query_device (context=0x0, device_attr=0x7ffe5f5ee4b0) at
> > src/verbs.c:135
> > 135 return context->ops.query_device(context, device_attr);
> > (gdb) bt
> > #0 __ibv_query_device (context=0x0, device_attr=0x7ffe5f5ee4b0) at
> > src/verbs.c:135
> > #1 0x0000000000410518 in check_for_contig_pages_support
> > (context=<optimized out>) at src/perftest_resources.c:262
> > #2 ctx_init (ctx=ctx@entry=0x1b3d000,
> > user_param=user_param@entry=0x1b3cd70) at
> > src/perftest_resources.c:1314
> > #3 0x000000000040585c in rdma_server_connect (ctx=0x1b3d000,
> > user_param=0x1b3cd70)
> > at src/perftest_communication.c:1119
> > #4 0x0000000000405f53 in establish_connection
> > (comm=comm@entry=0x7ffe5f5ee7c0) at src/perftest_communication.c:1244
> > #5 0x0000000000402b37 in main (argc=<optimized out>, argv=<optimized
> > out>) at src/read_bw.c:110
> > (gdb) f 0
> > #0 __ibv_query_device (context=0x0, device_attr=0x7ffe5f5ee4b0) at
> > src/verbs.c:135
> > 135 return context->ops.query_device(context, device_attr);
> > (gdb) list
> > 130 }
> > 131
> > 132 int __ibv_query_device(struct ibv_context *context,
> > 133 struct ibv_device_attr *device_attr)
> > 134 {
> > 135 return context->ops.query_device(context, device_attr);
> > 136 }
> > 137 default_symver(__ibv_query_device, ibv_query_device);
> > 138
> > 139 int __ibv_query_port(struct ibv_context *context, uint8_t port_num,
> > (gdb) p context
> > $1 = (struct ibv_context *) 0x0
> >
> >
> > "Client"
> > ----------
> > # ib_read_bw 192.168.13.13
> > ---------------------------------------------------------------------------------------
> > RDMA_Read BW Test
> > Dual-port : OFF Device : mlx5_0
> > Number of qps : 1 Transport type : IB
> > Connection type : RC Using SRQ : OFF
> > TX depth : 128
> > CQ Moderation : 100
> > Mtu : 1024[B]
> > Link type : Ethernet
> > GID index : 2
> > Outstand reads : 16
> > rdma_cm QPs : OFF
> > Data ex. method : Ethernet
> > ---------------------------------------------------------------------------------------
> > local address: LID 0000 QPN 0x011a PSN 0xf7747b OUT 0x10 RKey 0x002797
> > VAddr 0x007fe5cccc5000
> > GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:13:14
> > remote address: LID 0000 QPN 0x011a PSN 0xa0e9fd OUT 0x10 RKey
> > 0x00175e VAddr 0x007fc73fd6e000
> > GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:13:13
> > ---------------------------------------------------------------------------------------
> > #bytes #iterations BW peak[MB/sec] BW average[MB/sec] MsgRate[Mpps]
> > Conflicting CPU frequency values detected: 1200.024000 != 2600.000000.
> > CPU Frequency is not max.
> > 65536 1000 2728.79 2728.77 0.043660
> > ---------------------------------------------------------------------------------------
> >
> > # ib_read_bw -R 192.168.13.13
> > Unexpected CM event bl blka 8
> > Unable to perform rdma_client function
> > Unable to init the socket connection
> >
> > # ib_read_bw -z 192.168.13.13
> > Unexpected CM event bl blka 8
> > Unable to perform rdma_client function
> > Unable to init the socket connection
> > ----------------
> > Robert LeBlanc
> > PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1
> >
> >
> > On Tue, May 16, 2017 at 2:50 PM, Robert LeBlanc <robert-4JaGZRWAfWbajFs6igw21g@public.gmane.org> wrote:
> >> I installed OFED 4.0-2.0.0.1 on a fresh snapshot with the stock kernel
> >> (3.10.0-514.16.1.el7.x86_64). I'm getting a segfault on the server
> >> side, but not on the client side. I don't see any debug packages in
> >> the OFED package to load the symbols.
> >>
> >> rping server:
> >>
> >> # gdb rping core.10405
> >> GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-94.el7
> >> Copyright (C) 2013 Free Software Foundation, Inc.
> >> License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
> >> This is free software: you are free to change and redistribute it.
> >> There is NO WARRANTY, to the extent permitted by law. Type "show copying"
> >> and "show warranty" for details.
> >> This GDB was configured as "x86_64-redhat-linux-gnu".
> >> For bug reporting instructions, please see:
> >> <http://www.gnu.org/software/gdb/bugs/>...
> >> Reading symbols from /usr/bin/rping...Reading symbols from
> >> /usr/bin/rping...(no debugging symbols found)...done.
> >> (no debugging symbols found)...done.
> >> [New LWP 10405]
> >> [New LWP 10408]
> >> [Thread debugging using libthread_db enabled]
> >> Using host libthread_db library "/lib64/libthread_db.so.1".
> >> Core was generated by `rping -s'.
> >> Program terminated with signal 11, Segmentation fault.
> >> #0 0x00007f31883d45b4 in ibv_alloc_pd () from /usr/lib64/libibverbs.so.1
> >> Missing separate debuginfos, use: debuginfo-install
> >> librdmacm-utils-1.1.0mlnx-OFED.4.0.1.6.1.40200.x86_64
> >> (gdb) bt
> >> #0 0x00007f31883d45b4 in ibv_alloc_pd () from /usr/lib64/libibverbs.so.1
> >> #1 0x0000000000402fe6 in rping_setup_qp.isra.7 ()
> >> #2 0x0000000000401d04 in main ()
> >> (gdb) list
> >> No symbol table is loaded. Use the "file" command.
> >>
> >> rping client:
> >>
> >> # rping -c -a 192.168.13.13
> >> cma event RDMA_CM_EVENT_REJECTED, error 28
> >> wait for CONNECTED state 4
> >> connect error -1
> >> ----------------
> >> Robert LeBlanc
> >> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1
> >>
> >>
> >> On Tue, May 16, 2017 at 1:23 PM, Robert LeBlanc <robert-4JaGZRWAfWbajFs6igw21g@public.gmane.org> wrote:
> >>> This is using ConnectX-4 LX RoCE cards, using only in-box drivers.
> >>>
> >>> While trying to debug some iSER issues, I'm trying to do rping between
> >>> the two hosts, but I'm getting a segfault. Sagi suggested that there
> >>> may be something wrong with my kernel ABI. I did a make mrproper and
> >>> built the latest 4.9.28 kernel and installed the kernel headers.
> >>>
> >>> make -j 32 && sudo make modules_install && sudo make install && sudo
> >>> make headers_install INSTALL_HDR_PATH=/usr
> >>>
> >>> After booting into the new kernel, I kept getting the segfaults, so I
> >>> rebuilt the libibverbs, libibumad, librdmacm packages in case they
> >>> aren't picking up the new kernel headers. Still no luck.
> >>>
> >>> Here is the server of rping with the rebuilt packages:
> >>> # gdb rping core.22936
> >>> GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-94.el7
> >>> Copyright (C) 2013 Free Software Foundation, Inc.
> >>> License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
> >>> This is free software: you are free to change and redistribute it.
> >>> There is NO WARRANTY, to the extent permitted by law. Type "show copying"
> >>> and "show warranty" for details.
> >>> This GDB was configured as "x86_64-redhat-linux-gnu".
> >>> For bug reporting instructions, please see:
> >>> <http://www.gnu.org/software/gdb/bugs/>...
> >>> Reading symbols from /usr/bin/rping...Reading symbols from
> >>> /usr/lib/debug/usr/bin/rping.debug...done.
> >>> done.
> >>> [New LWP 22936]
> >>> [New LWP 22939]
> >>> [Thread debugging using libthread_db enabled]
> >>> Using host libthread_db library "/lib64/libthread_db.so.1".
> >>> Core was generated by `rping -s'.
> >>> Program terminated with signal 11, Segmentation fault.
> >>> #0 __ibv_alloc_pd (context=0x0) at src/verbs.c:196
> >>> 196 pd = context->ops.alloc_pd(context);
> >>> (gdb) bt
> >>> #0 __ibv_alloc_pd (context=0x0) at src/verbs.c:196
> >>> #1 0x000055f60331d5f6 in rping_setup_qp (cb=cb@entry=0x55f603d74780,
> >>> cm_id=<optimized out>) at examples/rping.c:519
> >>> #2 0x000055f60331be7e in rping_run_server (cb=0x55f603d74780) at
> >>> examples/rping.c:890
> >>> #3 main (argc=2, argv=0x7ffcd16aae88) at examples/rping.c:1268
> >>> (gdb) f 0
> >>> #0 __ibv_alloc_pd (context=0x0) at src/verbs.c:196
> >>> 196 pd = context->ops.alloc_pd(context);
> >>> (gdb) list
> >>> 191
> >>> 192 struct ibv_pd *__ibv_alloc_pd(struct ibv_context *context)
> >>> 193 {
> >>> 194 struct ibv_pd *pd;
> >>> 195
> >>> 196 pd = context->ops.alloc_pd(context);
> >>> 197 if (pd)
> >>> 198 pd->context = context;
> >>> 199
> >>> 200 return pd;
> >>> (gdb) p context
> >>> $1 = (struct ibv_context *) 0x0
> >>>
> >>> Here is the rping client that does not have the rebuilt packages:
> >>> # gdb rping core.8253
> >>> GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-94.el7
> >>> Copyright (C) 2013 Free Software Foundation, Inc.
> >>> License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
> >>> This is free software: you are free to change and redistribute it.
> >>> There is NO WARRANTY, to the extent permitted by law. Type "show copying"
> >>> and "show warranty" for details.
> >>> This GDB was configured as "x86_64-redhat-linux-gnu".
> >>> For bug reporting instructions, please see:
> >>> <http://www.gnu.org/software/gdb/bugs/>...
> >>> Reading symbols from /usr/bin/rping...Reading symbols from
> >>> /usr/lib/debug/usr/bin/rping.debug...done.
> >>> done.
> >>> [New LWP 8253]
> >>> [New LWP 8256]
> >>> [Thread debugging using libthread_db enabled]
> >>> Using host libthread_db library "/lib64/libthread_db.so.1".
> >>> Core was generated by `rping -c -a 192.168.13.13'.
> >>> Program terminated with signal 11, Segmentation fault.
> >>> #0 __ibv_dereg_mr (mr=0x560e295e93b0) at src/verbs.c:299
> >>> 299 ret = mr->context->ops.dereg_mr(mr);
> >>> (gdb) bt
> >>> #0 __ibv_dereg_mr (mr=0x560e295e93b0) at src/verbs.c:299
> >>> #1 0x0000560e293cd917 in rping_free_buffers (cb=0x560e295e5780) at
> >>> examples/rping.c:470
> >>> #2 0x0000560e293cbf57 in rping_run_client (cb=<optimized out>) at
> >>> examples/rping.c:1111
> >>> #3 main (argc=<optimized out>, argv=<optimized out>) at examples/rping.c:1270
> >>> (gdb) f 9
> >>> #0 0x0000000000000000 in ?? ()
> >>> (gdb) f 0
> >>> #0 __ibv_dereg_mr (mr=0x560e295e93b0) at src/verbs.c:299
> >>> 299 ret = mr->context->ops.dereg_mr(mr);
> >>> (gdb) list
> >>> 294 {
> >>> 295 int ret;
> >>> 296 void *addr = mr->addr;
> >>> 297 size_t length = mr->length;
> >>> 298
> >>> 299 ret = mr->context->ops.dereg_mr(mr);
> >>> 300 if (!ret)
> >>> 301 ibv_dofork_range(addr, length);
> >>> 302
> >>> 303 return ret;
> >>> (gdb) p mr
> >>> $1 = (struct ibv_mr *) 0x560e295e93b0
> >>> (gdb) p *mr
> >>> $2 = {context = 0x7fd423be5090, pd = 0x560e295e9960, addr =
> >>> 0x560e295e57e8, length = 16, handle = 0, lkey = 72829, rkey = 72829}
> >>> (gdb) p *mr->context
> >>> Cannot access memory at address 0x7fd423be5090
> >>>
> >>> Any ideas on what I'm doing wrong?
> >>>
> >>> Thanks,
> >>>
> >>> ----------------
> >>> Robert LeBlanc
> >>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
next prev parent reply other threads:[~2017-05-18 5:07 UTC|newest]
Thread overview: 4+ messages / expand[flat|nested] mbox.gz Atom feed top
2017-05-17 18:14 rdma_cm segfaults on RoCE with ConnectX-4 [WAS: Re: rping segfault with 4.9.28 on CentOS 7.3] Robert LeBlanc
[not found] ` <CAANLjFp3PWJBiVabMGksarwg9BTM7Cg4mPrBsXY+mYh_zBbsgA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2017-05-18 5:07 ` Leon Romanovsky [this message]
[not found] ` <20170518050745.GZ3616-U/DQcQFIOTAAJjI8aNfphQ@public.gmane.org>
2017-05-18 15:59 ` Robert LeBlanc
[not found] ` <CAANLjFqFEbV3ZNi9bvq0nf2bi+fx8iKXRL_ZdQoC13Pbz_5nhw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2017-05-19 3:53 ` Leon Romanovsky
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20170518050745.GZ3616@mtr-leonro.local \
--to=leon-dgejt+ai2ygdnm+yrofe0a@public.gmane.org \
--cc=linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org \
--cc=robert-4JaGZRWAfWbajFs6igw21g@public.gmane.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox