public inbox for linux-rdma@vger.kernel.org
 help / color / mirror / Atom feed
* Re: ibv_rc_pingpong, rping, and other tools hang with Linux 4.10.0 and rdma-core 13
@ 2017-02-27 17:02 Josh Beavers
       [not found] ` <CAE=AiOMqGzMC6sD-cXB_sGRH_L15annm_4WottmY17oCSNZveA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 7+ messages in thread
From: Josh Beavers @ 2017-02-27 17:02 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Yonatan Cohen, GAFBlizzard, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	Youngjae Lee

Jason and Yonatan,

On Mon, Feb 27, 2017 at 11:29 AM, Jason Gunthorpe
<jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org> wrote:
> On Sun, Feb 26, 2017 at 06:09:34PM +0200, Yonatan Cohen wrote:
>
>> I bisected the rdma-core library and figured out that the following commit
>> introduced this regression:
>> 6b26a9e24739 Use C11 atomics instead of wmb/rmb macros for CPU-only atomics
>>
>> I haven't debugged this yet and would appreciate Jason's input.
>
> Oops, I think I typo'd it here:

> Ie deleted pad_3[31] by mistake!



I just confirmed that reverting the C11 atomics commit (6b26a9e24739)
fixes ibv_rc_pingpong on my two at91 ARM boards.  For some reason the
first few packets seem to send slowly, but once it gets going the rest
send quickly.

Youngjae, I suspect this may correct the issue you reported in
http://www.spinics.net/lists/linux-rdma/msg46451.html.



IMPORTANT:  I had previously found the pad_3[31] issue and corrected
it.  That resulted in wr_id showing up in the kernel with the correct
value, but ibv_rc_pingpong would still sometimes (30% or so?) fail
with "Couldn't post send" "parse WC failed 1" on one side.  Weirdly,
it seems to fail more often just after a reboot, and only occasionally
once I run several tests.

Jason, was it intentional that rmb() was removed with no replacement
in rxe_post_one_recv()?   See
https://github.com/linux-rdma/rdma-core/commit/6b26a9e24739576ac3f4ae308485389a5b285497?diff=split#diff-f6b2d2321c2b3273e3453d055a62fa98
for details.

Unfortunately, even after reverting the C11 atomics commit, I still
seem to observe "Couldn't post send" failures which kill the ping
occasionally.  Is this a known issue?


Thanks,
-G
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 7+ messages in thread
* ibv_rc_pingpong, rping, and other tools hang with Linux 4.10.0 and rdma-core 13
@ 2017-02-25  3:27 GAFBlizzard
       [not found] ` <CABQspYbv7j58pdLLbPegE8Bc3qhwb-3+4E8SQ2U9jObkeTbrzw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 7+ messages in thread
From: GAFBlizzard @ 2017-02-25  3:27 UTC (permalink / raw)
  To: linux-rdma-u79uwXL29TY76Z2rM5mHXA

Hello,

I have Linux 4.10.0 stable running on two at91 ARM machines.  I have
rdma-core 13 installed on both.

"rxe_cfg status" shows normal information, e.g.:
  Name  Link  Driver  Speed  NMTU  IPv4_addr     RDEV  RMTU
  eth0  yes   macb           1500  192.168.0.12  rxe0  1024  (3)

"ibv_devinfo" likewise shows normal information, e.g.:
hca_id: rxe0
 transport:   InfiniBand (0)
 fw_ver:    0.0.0
 node_guid:   1034:56ff:fe84:1952
 sys_image_guid:   0000:0000:0000:0000
 vendor_id:   0x0000
 vendor_part_id:   0
 hw_ver:    0x0
 phys_port_cnt:   1
  port: 1
   state:   PORT_ACTIVE (4)
   max_mtu:  4096 (5)
   active_mtu:  1024 (3)
   sm_lid:   0
   port_lid:  0
   port_lmc:  0x00
   link_layer:  Ethernet


Every communication tool I have tried hangs after printing remote
address information.  No errors are printed or logged in dmesg.
Example:

## This is system A
# ibv_rc_pingpong -d rxe0 -g 1 -i 1 192.168.0.12
  local address:  LID 0x0000, QPN 0x000011, PSN 0xd1d8a8, GID
::ffff:192.168.0.11
  remote address: LID 0x0000, QPN 0x000011, PSN 0xc55eed, GID
::ffff:192.168.0.12

## This is system B
# ibv_rc_pingpong -d rxe0 -g 1 -i 1
  local address:  LID 0x0000, QPN 0x000011, PSN 0xc55eed, GID
::ffff:192.168.0.12
  remote address: LID 0x0000, QPN 0x000011, PSN 0xd1d8a8, GID
::ffff:192.168.0.11



If it makes a difference, I have a 10/100 switch connected at the
moment.  I am merely trying to verify functionality, not reach high
speeds.

I have found previous message(s) with similar problems on mailing
lists and online but no resolution to date.  Is there any
configuration option I might have missed?  I have no iptables
firewall, and have even tried directly connecting the two systems
instead of using the Ethernet switch.

Thanks,
G
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2017-02-27 20:38 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2017-02-27 17:02 ibv_rc_pingpong, rping, and other tools hang with Linux 4.10.0 and rdma-core 13 Josh Beavers
     [not found] ` <CAE=AiOMqGzMC6sD-cXB_sGRH_L15annm_4WottmY17oCSNZveA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2017-02-27 17:15   ` Jason Gunthorpe
     [not found]     ` <20170227171549.GG5891-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2017-02-27 18:59       ` Jason Gunthorpe
  -- strict thread matches above, loose matches on Subject: below --
2017-02-25  3:27 GAFBlizzard
     [not found] ` <CABQspYbv7j58pdLLbPegE8Bc3qhwb-3+4E8SQ2U9jObkeTbrzw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2017-02-26 16:09   ` Yonatan Cohen
     [not found]     ` <4e077022-5e5f-6ba8-530c-b86d2f09313e-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
2017-02-27 16:29       ` Jason Gunthorpe
     [not found]         ` <20170227162916.GC5891-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2017-02-27 20:38           ` Majd Dibbiny

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox