* Re: ibv_rc_pingpong, rping, and other tools hang with Linux 4.10.0 and rdma-core 13
@ 2017-02-27 17:02 Josh Beavers
[not found] ` <CAE=AiOMqGzMC6sD-cXB_sGRH_L15annm_4WottmY17oCSNZveA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
0 siblings, 1 reply; 7+ messages in thread
From: Josh Beavers @ 2017-02-27 17:02 UTC (permalink / raw)
To: Jason Gunthorpe
Cc: Yonatan Cohen, GAFBlizzard, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
Youngjae Lee
Jason and Yonatan,
On Mon, Feb 27, 2017 at 11:29 AM, Jason Gunthorpe
<jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org> wrote:
> On Sun, Feb 26, 2017 at 06:09:34PM +0200, Yonatan Cohen wrote:
>
>> I bisected the rdma-core library and figured out that the following commit
>> introduced this regression:
>> 6b26a9e24739 Use C11 atomics instead of wmb/rmb macros for CPU-only atomics
>>
>> I haven't debugged this yet and would appreciate Jason's input.
>
> Oops, I think I typo'd it here:
> Ie deleted pad_3[31] by mistake!
I just confirmed that reverting the C11 atomics commit (6b26a9e24739)
fixes ibv_rc_pingpong on my two at91 ARM boards. For some reason the
first few packets seem to send slowly, but once it gets going the rest
send quickly.
Youngjae, I suspect this may correct the issue you reported in
http://www.spinics.net/lists/linux-rdma/msg46451.html.
IMPORTANT: I had previously found the pad_3[31] issue and corrected
it. That resulted in wr_id showing up in the kernel with the correct
value, but ibv_rc_pingpong would still sometimes (30% or so?) fail
with "Couldn't post send" "parse WC failed 1" on one side. Weirdly,
it seems to fail more often just after a reboot, and only occasionally
once I run several tests.
Jason, was it intentional that rmb() was removed with no replacement
in rxe_post_one_recv()? See
https://github.com/linux-rdma/rdma-core/commit/6b26a9e24739576ac3f4ae308485389a5b285497?diff=split#diff-f6b2d2321c2b3273e3453d055a62fa98
for details.
Unfortunately, even after reverting the C11 atomics commit, I still
seem to observe "Couldn't post send" failures which kill the ping
occasionally. Is this a known issue?
Thanks,
-G
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 7+ messages in thread[parent not found: <CAE=AiOMqGzMC6sD-cXB_sGRH_L15annm_4WottmY17oCSNZveA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]
* Re: ibv_rc_pingpong, rping, and other tools hang with Linux 4.10.0 and rdma-core 13 [not found] ` <CAE=AiOMqGzMC6sD-cXB_sGRH_L15annm_4WottmY17oCSNZveA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2017-02-27 17:15 ` Jason Gunthorpe [not found] ` <20170227171549.GG5891-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org> 0 siblings, 1 reply; 7+ messages in thread From: Jason Gunthorpe @ 2017-02-27 17:15 UTC (permalink / raw) To: Josh Beavers Cc: Yonatan Cohen, GAFBlizzard, linux-rdma-u79uwXL29TY76Z2rM5mHXA, Youngjae Lee On Mon, Feb 27, 2017 at 12:02:07PM -0500, Josh Beavers wrote: > I just confirmed that reverting the C11 atomics commit (6b26a9e24739) > fixes ibv_rc_pingpong on my two at91 ARM boards. For some reason the > first few packets seem to send slowly, but once it gets going the rest > send quickly. How are you able to get it to compile? Without that commit ARM32 should blow up? > Jason, was it intentional that rmb() was removed with no replacement > in rxe_post_one_recv()? See > https://github.com/linux-rdma/rdma-core/commit/6b26a9e24739576ac3f4ae308485389a5b285497?diff=split#diff-f6b2d2321c2b3273e3453d055a62fa98 > for details. Yes, the rmb should have been a wmb and the atomic_thread_fence added to in advance_producer correctly replaces it. > Unfortunately, even after reverting the C11 atomics commit, I still > seem to observe "Couldn't post send" failures which kill the ping > occasionally. Is this a known issue? Sounds like this is unrelated to 6b26.. Jason -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 7+ messages in thread
[parent not found: <20170227171549.GG5891-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>]
* Re: ibv_rc_pingpong, rping, and other tools hang with Linux 4.10.0 and rdma-core 13 [not found] ` <20170227171549.GG5891-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org> @ 2017-02-27 18:59 ` Jason Gunthorpe 0 siblings, 0 replies; 7+ messages in thread From: Jason Gunthorpe @ 2017-02-27 18:59 UTC (permalink / raw) To: Josh Beavers Cc: Yonatan Cohen, GAFBlizzard, linux-rdma-u79uwXL29TY76Z2rM5mHXA, Youngjae Lee On Mon, Feb 27, 2017 at 10:15:49AM -0700, Jason Gunthorpe wrote: > On Mon, Feb 27, 2017 at 12:02:07PM -0500, Josh Beavers wrote: > > > I just confirmed that reverting the C11 atomics commit (6b26a9e24739) > > fixes ibv_rc_pingpong on my two at91 ARM boards. For some reason the > > first few packets seem to send slowly, but once it gets going the rest > > send quickly. > > How are you able to get it to compile? Without that commit ARM32 > should blow up? https://github.com/linux-rdma/rdma-core/pull/84 Jason -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 7+ messages in thread
* ibv_rc_pingpong, rping, and other tools hang with Linux 4.10.0 and rdma-core 13
@ 2017-02-25 3:27 GAFBlizzard
[not found] ` <CABQspYbv7j58pdLLbPegE8Bc3qhwb-3+4E8SQ2U9jObkeTbrzw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
0 siblings, 1 reply; 7+ messages in thread
From: GAFBlizzard @ 2017-02-25 3:27 UTC (permalink / raw)
To: linux-rdma-u79uwXL29TY76Z2rM5mHXA
Hello,
I have Linux 4.10.0 stable running on two at91 ARM machines. I have
rdma-core 13 installed on both.
"rxe_cfg status" shows normal information, e.g.:
Name Link Driver Speed NMTU IPv4_addr RDEV RMTU
eth0 yes macb 1500 192.168.0.12 rxe0 1024 (3)
"ibv_devinfo" likewise shows normal information, e.g.:
hca_id: rxe0
transport: InfiniBand (0)
fw_ver: 0.0.0
node_guid: 1034:56ff:fe84:1952
sys_image_guid: 0000:0000:0000:0000
vendor_id: 0x0000
vendor_part_id: 0
hw_ver: 0x0
phys_port_cnt: 1
port: 1
state: PORT_ACTIVE (4)
max_mtu: 4096 (5)
active_mtu: 1024 (3)
sm_lid: 0
port_lid: 0
port_lmc: 0x00
link_layer: Ethernet
Every communication tool I have tried hangs after printing remote
address information. No errors are printed or logged in dmesg.
Example:
## This is system A
# ibv_rc_pingpong -d rxe0 -g 1 -i 1 192.168.0.12
local address: LID 0x0000, QPN 0x000011, PSN 0xd1d8a8, GID
::ffff:192.168.0.11
remote address: LID 0x0000, QPN 0x000011, PSN 0xc55eed, GID
::ffff:192.168.0.12
## This is system B
# ibv_rc_pingpong -d rxe0 -g 1 -i 1
local address: LID 0x0000, QPN 0x000011, PSN 0xc55eed, GID
::ffff:192.168.0.12
remote address: LID 0x0000, QPN 0x000011, PSN 0xd1d8a8, GID
::ffff:192.168.0.11
If it makes a difference, I have a 10/100 switch connected at the
moment. I am merely trying to verify functionality, not reach high
speeds.
I have found previous message(s) with similar problems on mailing
lists and online but no resolution to date. Is there any
configuration option I might have missed? I have no iptables
firewall, and have even tried directly connecting the two systems
instead of using the Ethernet switch.
Thanks,
G
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 7+ messages in thread[parent not found: <CABQspYbv7j58pdLLbPegE8Bc3qhwb-3+4E8SQ2U9jObkeTbrzw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]
* Re: ibv_rc_pingpong, rping, and other tools hang with Linux 4.10.0 and rdma-core 13 [not found] ` <CABQspYbv7j58pdLLbPegE8Bc3qhwb-3+4E8SQ2U9jObkeTbrzw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2017-02-26 16:09 ` Yonatan Cohen [not found] ` <4e077022-5e5f-6ba8-530c-b86d2f09313e-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> 0 siblings, 1 reply; 7+ messages in thread From: Yonatan Cohen @ 2017-02-26 16:09 UTC (permalink / raw) To: GAFBlizzard, linux-rdma-u79uwXL29TY76Z2rM5mHXA Cc: Jason Gunthorpe, Youngjae Lee On 2/25/2017 5:27 AM, GAFBlizzard wrote: > Hello, > > I have Linux 4.10.0 stable running on two at91 ARM machines. I have > rdma-core 13 installed on both. > > "rxe_cfg status" shows normal information, e.g.: > Name Link Driver Speed NMTU IPv4_addr RDEV RMTU > eth0 yes macb 1500 192.168.0.12 rxe0 1024 (3) > > "ibv_devinfo" likewise shows normal information, e.g.: > hca_id: rxe0 > transport: InfiniBand (0) > fw_ver: 0.0.0 > node_guid: 1034:56ff:fe84:1952 > sys_image_guid: 0000:0000:0000:0000 > vendor_id: 0x0000 > vendor_part_id: 0 > hw_ver: 0x0 > phys_port_cnt: 1 > port: 1 > state: PORT_ACTIVE (4) > max_mtu: 4096 (5) > active_mtu: 1024 (3) > sm_lid: 0 > port_lid: 0 > port_lmc: 0x00 > link_layer: Ethernet > > > Every communication tool I have tried hangs after printing remote > address information. No errors are printed or logged in dmesg. > Example: > > ## This is system A > # ibv_rc_pingpong -d rxe0 -g 1 -i 1 192.168.0.12 > local address: LID 0x0000, QPN 0x000011, PSN 0xd1d8a8, GID > ::ffff:192.168.0.11 > remote address: LID 0x0000, QPN 0x000011, PSN 0xc55eed, GID > ::ffff:192.168.0.12 > > ## This is system B > # ibv_rc_pingpong -d rxe0 -g 1 -i 1 > local address: LID 0x0000, QPN 0x000011, PSN 0xc55eed, GID > ::ffff:192.168.0.12 > remote address: LID 0x0000, QPN 0x000011, PSN 0xd1d8a8, GID > ::ffff:192.168.0.11 > > > > If it makes a difference, I have a 10/100 switch connected at the > moment. I am merely trying to verify functionality, not reach high > speeds. > > I have found previous message(s) with similar problems on mailing > lists and online but no resolution to date. Is there any > configuration option I might have missed? I have no iptables > firewall, and have even tried directly connecting the two systems > instead of using the Ethernet switch. > Hi all, I succeeded to reproduce the issue on my x86 setup. The last time my user-space libraries weren't up to date and thus it passed. I bisected the rdma-core library and figured out that the following commit introduced this regression: 6b26a9e24739 Use C11 atomics instead of wmb/rmb macros for CPU-only atomics I haven't debugged this yet and would appreciate Jason's input. Can you confirm that reverting the previous commit solves the issues on ARM as well? Thanks > Thanks, > G > -- > To unsubscribe from this list: send the line "unsubscribe linux-rdma" in > the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 7+ messages in thread
[parent not found: <4e077022-5e5f-6ba8-530c-b86d2f09313e-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>]
* Re: ibv_rc_pingpong, rping, and other tools hang with Linux 4.10.0 and rdma-core 13 [not found] ` <4e077022-5e5f-6ba8-530c-b86d2f09313e-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> @ 2017-02-27 16:29 ` Jason Gunthorpe [not found] ` <20170227162916.GC5891-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org> 0 siblings, 1 reply; 7+ messages in thread From: Jason Gunthorpe @ 2017-02-27 16:29 UTC (permalink / raw) To: Yonatan Cohen Cc: GAFBlizzard, linux-rdma-u79uwXL29TY76Z2rM5mHXA, Youngjae Lee, Josh Beavers On Sun, Feb 26, 2017 at 06:09:34PM +0200, Yonatan Cohen wrote: > I bisected the rdma-core library and figured out that the following commit > introduced this regression: > 6b26a9e24739 Use C11 atomics instead of wmb/rmb macros for CPU-only atomics > > I haven't debugged this yet and would appreciate Jason's input. Oops, I think I typo'd it here: --- a/providers/rxe/rxe_queue.h +++ b/providers/rxe/rxe_queue.h @@ -37,15 +37,16 @@ #ifndef H_RXE_PCQ #define H_RXE_PCQ +#include <stdatomic.h> + /* MUST MATCH kernel struct rxe_pqc in rxe_queue.h */ struct rxe_queue { uint32_t log2_elem_size; uint32_t index_mask; uint32_t pad_1[30]; - volatile uint32_t producer_index; + _Atomic(uint32_t) producer_index; uint32_t pad_2[31]; - volatile uint32_t consumer_index; - uint32_t pad_3[31]; + _Atomic(uint32_t) consumer_index; Ie deleted pad_3[31] by mistake! My bad Jason -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 7+ messages in thread
[parent not found: <20170227162916.GC5891-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>]
* Re: ibv_rc_pingpong, rping, and other tools hang with Linux 4.10.0 and rdma-core 13 [not found] ` <20170227162916.GC5891-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org> @ 2017-02-27 20:38 ` Majd Dibbiny 0 siblings, 0 replies; 7+ messages in thread From: Majd Dibbiny @ 2017-02-27 20:38 UTC (permalink / raw) To: Jason Gunthorpe Cc: Yonatan Cohen, GAFBlizzard, linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Youngjae Lee, Josh Beavers > On Feb 27, 2017, at 7:09 PM, Jason Gunthorpe <jgunthorpe@obsidianresearch.com> wrote: > >> On Sun, Feb 26, 2017 at 06:09:34PM +0200, Yonatan Cohen wrote: >> >> I bisected the rdma-core library and figured out that the following commit >> introduced this regression: >> 6b26a9e24739 Use C11 atomics instead of wmb/rmb macros for CPU-only atomics >> >> I haven't debugged this yet and would appreciate Jason's input. > > Oops, I think I typo'd it here: > > --- a/providers/rxe/rxe_queue.h > +++ b/providers/rxe/rxe_queue.h > @@ -37,15 +37,16 @@ > #ifndef H_RXE_PCQ > #define H_RXE_PCQ > > +#include <stdatomic.h> > + > /* MUST MATCH kernel struct rxe_pqc in rxe_queue.h */ > struct rxe_queue { > uint32_t log2_elem_size; > uint32_t index_mask; > uint32_t pad_1[30]; > - volatile uint32_t producer_index; > + _Atomic(uint32_t) producer_index; > uint32_t pad_2[31]; > - volatile uint32_t consumer_index; > - uint32_t pad_3[31]; > + _Atomic(uint32_t) consumer_index; > > > Ie deleted pad_3[31] by mistake! > > My bad Thanks Jason for looking into it. Can you please submit this fix to rdma-core? > > Jason > -- > To unsubscribe from this list: send the line "unsubscribe linux-rdma" in > the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2017-02-27 20:38 UTC | newest]
Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2017-02-27 17:02 ibv_rc_pingpong, rping, and other tools hang with Linux 4.10.0 and rdma-core 13 Josh Beavers
[not found] ` <CAE=AiOMqGzMC6sD-cXB_sGRH_L15annm_4WottmY17oCSNZveA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2017-02-27 17:15 ` Jason Gunthorpe
[not found] ` <20170227171549.GG5891-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2017-02-27 18:59 ` Jason Gunthorpe
-- strict thread matches above, loose matches on Subject: below --
2017-02-25 3:27 GAFBlizzard
[not found] ` <CABQspYbv7j58pdLLbPegE8Bc3qhwb-3+4E8SQ2U9jObkeTbrzw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2017-02-26 16:09 ` Yonatan Cohen
[not found] ` <4e077022-5e5f-6ba8-530c-b86d2f09313e-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
2017-02-27 16:29 ` Jason Gunthorpe
[not found] ` <20170227162916.GC5891-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2017-02-27 20:38 ` Majd Dibbiny
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox