public inbox for linux-rdma@vger.kernel.org
 help / color / mirror / Atom feed
* ibv_rc_pingpong, rping, and other tools hang with Linux 4.10.0 and rdma-core 13
@ 2017-02-25  3:27 GAFBlizzard
       [not found] ` <CABQspYbv7j58pdLLbPegE8Bc3qhwb-3+4E8SQ2U9jObkeTbrzw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 7+ messages in thread
From: GAFBlizzard @ 2017-02-25  3:27 UTC (permalink / raw)
  To: linux-rdma-u79uwXL29TY76Z2rM5mHXA

Hello,

I have Linux 4.10.0 stable running on two at91 ARM machines.  I have
rdma-core 13 installed on both.

"rxe_cfg status" shows normal information, e.g.:
  Name  Link  Driver  Speed  NMTU  IPv4_addr     RDEV  RMTU
  eth0  yes   macb           1500  192.168.0.12  rxe0  1024  (3)

"ibv_devinfo" likewise shows normal information, e.g.:
hca_id: rxe0
 transport:   InfiniBand (0)
 fw_ver:    0.0.0
 node_guid:   1034:56ff:fe84:1952
 sys_image_guid:   0000:0000:0000:0000
 vendor_id:   0x0000
 vendor_part_id:   0
 hw_ver:    0x0
 phys_port_cnt:   1
  port: 1
   state:   PORT_ACTIVE (4)
   max_mtu:  4096 (5)
   active_mtu:  1024 (3)
   sm_lid:   0
   port_lid:  0
   port_lmc:  0x00
   link_layer:  Ethernet


Every communication tool I have tried hangs after printing remote
address information.  No errors are printed or logged in dmesg.
Example:

## This is system A
# ibv_rc_pingpong -d rxe0 -g 1 -i 1 192.168.0.12
  local address:  LID 0x0000, QPN 0x000011, PSN 0xd1d8a8, GID
::ffff:192.168.0.11
  remote address: LID 0x0000, QPN 0x000011, PSN 0xc55eed, GID
::ffff:192.168.0.12

## This is system B
# ibv_rc_pingpong -d rxe0 -g 1 -i 1
  local address:  LID 0x0000, QPN 0x000011, PSN 0xc55eed, GID
::ffff:192.168.0.12
  remote address: LID 0x0000, QPN 0x000011, PSN 0xd1d8a8, GID
::ffff:192.168.0.11



If it makes a difference, I have a 10/100 switch connected at the
moment.  I am merely trying to verify functionality, not reach high
speeds.

I have found previous message(s) with similar problems on mailing
lists and online but no resolution to date.  Is there any
configuration option I might have missed?  I have no iptables
firewall, and have even tried directly connecting the two systems
instead of using the Ethernet switch.

Thanks,
G
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: ibv_rc_pingpong, rping, and other tools hang with Linux 4.10.0 and rdma-core 13
       [not found] ` <CABQspYbv7j58pdLLbPegE8Bc3qhwb-3+4E8SQ2U9jObkeTbrzw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2017-02-26 16:09   ` Yonatan Cohen
       [not found]     ` <4e077022-5e5f-6ba8-530c-b86d2f09313e-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
  0 siblings, 1 reply; 7+ messages in thread
From: Yonatan Cohen @ 2017-02-26 16:09 UTC (permalink / raw)
  To: GAFBlizzard, linux-rdma-u79uwXL29TY76Z2rM5mHXA
  Cc: Jason Gunthorpe, Youngjae Lee

On 2/25/2017 5:27 AM, GAFBlizzard wrote:
> Hello,
>
> I have Linux 4.10.0 stable running on two at91 ARM machines.  I have
> rdma-core 13 installed on both.
>
> "rxe_cfg status" shows normal information, e.g.:
>   Name  Link  Driver  Speed  NMTU  IPv4_addr     RDEV  RMTU
>   eth0  yes   macb           1500  192.168.0.12  rxe0  1024  (3)
>
> "ibv_devinfo" likewise shows normal information, e.g.:
> hca_id: rxe0
>  transport:   InfiniBand (0)
>  fw_ver:    0.0.0
>  node_guid:   1034:56ff:fe84:1952
>  sys_image_guid:   0000:0000:0000:0000
>  vendor_id:   0x0000
>  vendor_part_id:   0
>  hw_ver:    0x0
>  phys_port_cnt:   1
>   port: 1
>    state:   PORT_ACTIVE (4)
>    max_mtu:  4096 (5)
>    active_mtu:  1024 (3)
>    sm_lid:   0
>    port_lid:  0
>    port_lmc:  0x00
>    link_layer:  Ethernet
>
>
> Every communication tool I have tried hangs after printing remote
> address information.  No errors are printed or logged in dmesg.
> Example:
>
> ## This is system A
> # ibv_rc_pingpong -d rxe0 -g 1 -i 1 192.168.0.12
>   local address:  LID 0x0000, QPN 0x000011, PSN 0xd1d8a8, GID
> ::ffff:192.168.0.11
>   remote address: LID 0x0000, QPN 0x000011, PSN 0xc55eed, GID
> ::ffff:192.168.0.12
>
> ## This is system B
> # ibv_rc_pingpong -d rxe0 -g 1 -i 1
>   local address:  LID 0x0000, QPN 0x000011, PSN 0xc55eed, GID
> ::ffff:192.168.0.12
>   remote address: LID 0x0000, QPN 0x000011, PSN 0xd1d8a8, GID
> ::ffff:192.168.0.11
>
>
>
> If it makes a difference, I have a 10/100 switch connected at the
> moment.  I am merely trying to verify functionality, not reach high
> speeds.
>
> I have found previous message(s) with similar problems on mailing
> lists and online but no resolution to date.  Is there any
> configuration option I might have missed?  I have no iptables
> firewall, and have even tried directly connecting the two systems
> instead of using the Ethernet switch.
>
Hi all,

I succeeded to reproduce the issue on my x86 setup.
The last time my user-space libraries weren't up to date and thus it passed.

I bisected the rdma-core library and figured out that the following 
commit introduced this regression:
6b26a9e24739 Use C11 atomics instead of wmb/rmb macros for CPU-only atomics

I haven't debugged this yet and would appreciate Jason's input.

Can you confirm that reverting the previous commit solves the issues on 
ARM as well?

Thanks

> Thanks,
> G
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: ibv_rc_pingpong, rping, and other tools hang with Linux 4.10.0 and rdma-core 13
       [not found]     ` <4e077022-5e5f-6ba8-530c-b86d2f09313e-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
@ 2017-02-27 16:29       ` Jason Gunthorpe
       [not found]         ` <20170227162916.GC5891-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
  0 siblings, 1 reply; 7+ messages in thread
From: Jason Gunthorpe @ 2017-02-27 16:29 UTC (permalink / raw)
  To: Yonatan Cohen
  Cc: GAFBlizzard, linux-rdma-u79uwXL29TY76Z2rM5mHXA, Youngjae Lee,
	Josh Beavers

On Sun, Feb 26, 2017 at 06:09:34PM +0200, Yonatan Cohen wrote:

> I bisected the rdma-core library and figured out that the following commit
> introduced this regression:
> 6b26a9e24739 Use C11 atomics instead of wmb/rmb macros for CPU-only atomics
> 
> I haven't debugged this yet and would appreciate Jason's input.

Oops, I think I typo'd it here:

--- a/providers/rxe/rxe_queue.h
+++ b/providers/rxe/rxe_queue.h
@@ -37,15 +37,16 @@
 #ifndef H_RXE_PCQ
 #define H_RXE_PCQ
 
+#include <stdatomic.h>
+
 /* MUST MATCH kernel struct rxe_pqc in rxe_queue.h */
 struct rxe_queue {
        uint32_t                log2_elem_size;
        uint32_t                index_mask;
        uint32_t                pad_1[30];
-       volatile uint32_t       producer_index;
+       _Atomic(uint32_t)       producer_index;
        uint32_t                pad_2[31];
-       volatile uint32_t       consumer_index;
-       uint32_t                pad_3[31];
+       _Atomic(uint32_t)       consumer_index;


Ie deleted pad_3[31] by mistake!

My bad

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: ibv_rc_pingpong, rping, and other tools hang with Linux 4.10.0 and rdma-core 13
@ 2017-02-27 17:02 Josh Beavers
       [not found] ` <CAE=AiOMqGzMC6sD-cXB_sGRH_L15annm_4WottmY17oCSNZveA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 7+ messages in thread
From: Josh Beavers @ 2017-02-27 17:02 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Yonatan Cohen, GAFBlizzard, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	Youngjae Lee

Jason and Yonatan,

On Mon, Feb 27, 2017 at 11:29 AM, Jason Gunthorpe
<jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org> wrote:
> On Sun, Feb 26, 2017 at 06:09:34PM +0200, Yonatan Cohen wrote:
>
>> I bisected the rdma-core library and figured out that the following commit
>> introduced this regression:
>> 6b26a9e24739 Use C11 atomics instead of wmb/rmb macros for CPU-only atomics
>>
>> I haven't debugged this yet and would appreciate Jason's input.
>
> Oops, I think I typo'd it here:

> Ie deleted pad_3[31] by mistake!



I just confirmed that reverting the C11 atomics commit (6b26a9e24739)
fixes ibv_rc_pingpong on my two at91 ARM boards.  For some reason the
first few packets seem to send slowly, but once it gets going the rest
send quickly.

Youngjae, I suspect this may correct the issue you reported in
http://www.spinics.net/lists/linux-rdma/msg46451.html.



IMPORTANT:  I had previously found the pad_3[31] issue and corrected
it.  That resulted in wr_id showing up in the kernel with the correct
value, but ibv_rc_pingpong would still sometimes (30% or so?) fail
with "Couldn't post send" "parse WC failed 1" on one side.  Weirdly,
it seems to fail more often just after a reboot, and only occasionally
once I run several tests.

Jason, was it intentional that rmb() was removed with no replacement
in rxe_post_one_recv()?   See
https://github.com/linux-rdma/rdma-core/commit/6b26a9e24739576ac3f4ae308485389a5b285497?diff=split#diff-f6b2d2321c2b3273e3453d055a62fa98
for details.

Unfortunately, even after reverting the C11 atomics commit, I still
seem to observe "Couldn't post send" failures which kill the ping
occasionally.  Is this a known issue?


Thanks,
-G
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: ibv_rc_pingpong, rping, and other tools hang with Linux 4.10.0 and rdma-core 13
       [not found] ` <CAE=AiOMqGzMC6sD-cXB_sGRH_L15annm_4WottmY17oCSNZveA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2017-02-27 17:15   ` Jason Gunthorpe
       [not found]     ` <20170227171549.GG5891-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
  0 siblings, 1 reply; 7+ messages in thread
From: Jason Gunthorpe @ 2017-02-27 17:15 UTC (permalink / raw)
  To: Josh Beavers
  Cc: Yonatan Cohen, GAFBlizzard, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	Youngjae Lee

On Mon, Feb 27, 2017 at 12:02:07PM -0500, Josh Beavers wrote:

> I just confirmed that reverting the C11 atomics commit (6b26a9e24739)
> fixes ibv_rc_pingpong on my two at91 ARM boards.  For some reason the
> first few packets seem to send slowly, but once it gets going the rest
> send quickly.

How are you able to get it to compile? Without that commit ARM32
should blow up?

> Jason, was it intentional that rmb() was removed with no replacement
> in rxe_post_one_recv()?   See
> https://github.com/linux-rdma/rdma-core/commit/6b26a9e24739576ac3f4ae308485389a5b285497?diff=split#diff-f6b2d2321c2b3273e3453d055a62fa98
> for details.

Yes, the rmb should have been a wmb and the atomic_thread_fence added
to in advance_producer correctly replaces it.

> Unfortunately, even after reverting the C11 atomics commit, I still
> seem to observe "Couldn't post send" failures which kill the ping
> occasionally.  Is this a known issue?

Sounds like this is unrelated to 6b26..

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: ibv_rc_pingpong, rping, and other tools hang with Linux 4.10.0 and rdma-core 13
       [not found]     ` <20170227171549.GG5891-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
@ 2017-02-27 18:59       ` Jason Gunthorpe
  0 siblings, 0 replies; 7+ messages in thread
From: Jason Gunthorpe @ 2017-02-27 18:59 UTC (permalink / raw)
  To: Josh Beavers
  Cc: Yonatan Cohen, GAFBlizzard, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	Youngjae Lee

On Mon, Feb 27, 2017 at 10:15:49AM -0700, Jason Gunthorpe wrote:
> On Mon, Feb 27, 2017 at 12:02:07PM -0500, Josh Beavers wrote:
> 
> > I just confirmed that reverting the C11 atomics commit (6b26a9e24739)
> > fixes ibv_rc_pingpong on my two at91 ARM boards.  For some reason the
> > first few packets seem to send slowly, but once it gets going the rest
> > send quickly.
> 
> How are you able to get it to compile? Without that commit ARM32
> should blow up?

https://github.com/linux-rdma/rdma-core/pull/84

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: ibv_rc_pingpong, rping, and other tools hang with Linux 4.10.0 and rdma-core 13
       [not found]         ` <20170227162916.GC5891-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
@ 2017-02-27 20:38           ` Majd Dibbiny
  0 siblings, 0 replies; 7+ messages in thread
From: Majd Dibbiny @ 2017-02-27 20:38 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Yonatan Cohen, GAFBlizzard,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Youngjae Lee,
	Josh Beavers


> On Feb 27, 2017, at 7:09 PM, Jason Gunthorpe <jgunthorpe@obsidianresearch.com> wrote:
> 
>> On Sun, Feb 26, 2017 at 06:09:34PM +0200, Yonatan Cohen wrote:
>> 
>> I bisected the rdma-core library and figured out that the following commit
>> introduced this regression:
>> 6b26a9e24739 Use C11 atomics instead of wmb/rmb macros for CPU-only atomics
>> 
>> I haven't debugged this yet and would appreciate Jason's input.
> 
> Oops, I think I typo'd it here:
> 
> --- a/providers/rxe/rxe_queue.h
> +++ b/providers/rxe/rxe_queue.h
> @@ -37,15 +37,16 @@
> #ifndef H_RXE_PCQ
> #define H_RXE_PCQ
> 
> +#include <stdatomic.h>
> +
> /* MUST MATCH kernel struct rxe_pqc in rxe_queue.h */
> struct rxe_queue {
>        uint32_t                log2_elem_size;
>        uint32_t                index_mask;
>        uint32_t                pad_1[30];
> -       volatile uint32_t       producer_index;
> +       _Atomic(uint32_t)       producer_index;
>        uint32_t                pad_2[31];
> -       volatile uint32_t       consumer_index;
> -       uint32_t                pad_3[31];
> +       _Atomic(uint32_t)       consumer_index;
> 
> 
> Ie deleted pad_3[31] by mistake!
> 
> My bad
Thanks Jason for looking into it.
Can you please submit this fix to rdma-core?
> 
> Jason
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2017-02-27 20:38 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2017-02-25  3:27 ibv_rc_pingpong, rping, and other tools hang with Linux 4.10.0 and rdma-core 13 GAFBlizzard
     [not found] ` <CABQspYbv7j58pdLLbPegE8Bc3qhwb-3+4E8SQ2U9jObkeTbrzw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2017-02-26 16:09   ` Yonatan Cohen
     [not found]     ` <4e077022-5e5f-6ba8-530c-b86d2f09313e-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
2017-02-27 16:29       ` Jason Gunthorpe
     [not found]         ` <20170227162916.GC5891-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2017-02-27 20:38           ` Majd Dibbiny
  -- strict thread matches above, loose matches on Subject: below --
2017-02-27 17:02 Josh Beavers
     [not found] ` <CAE=AiOMqGzMC6sD-cXB_sGRH_L15annm_4WottmY17oCSNZveA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2017-02-27 17:15   ` Jason Gunthorpe
     [not found]     ` <20170227171549.GG5891-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2017-02-27 18:59       ` Jason Gunthorpe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox