* ibv_rc_pingpong, rping, and other tools hang with Linux 4.10.0 and rdma-core 13
@ 2017-02-25 3:27 GAFBlizzard
[not found] ` <CABQspYbv7j58pdLLbPegE8Bc3qhwb-3+4E8SQ2U9jObkeTbrzw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
0 siblings, 1 reply; 7+ messages in thread
From: GAFBlizzard @ 2017-02-25 3:27 UTC (permalink / raw)
To: linux-rdma-u79uwXL29TY76Z2rM5mHXA
Hello,
I have Linux 4.10.0 stable running on two at91 ARM machines. I have
rdma-core 13 installed on both.
"rxe_cfg status" shows normal information, e.g.:
Name Link Driver Speed NMTU IPv4_addr RDEV RMTU
eth0 yes macb 1500 192.168.0.12 rxe0 1024 (3)
"ibv_devinfo" likewise shows normal information, e.g.:
hca_id: rxe0
transport: InfiniBand (0)
fw_ver: 0.0.0
node_guid: 1034:56ff:fe84:1952
sys_image_guid: 0000:0000:0000:0000
vendor_id: 0x0000
vendor_part_id: 0
hw_ver: 0x0
phys_port_cnt: 1
port: 1
state: PORT_ACTIVE (4)
max_mtu: 4096 (5)
active_mtu: 1024 (3)
sm_lid: 0
port_lid: 0
port_lmc: 0x00
link_layer: Ethernet
Every communication tool I have tried hangs after printing remote
address information. No errors are printed or logged in dmesg.
Example:
## This is system A
# ibv_rc_pingpong -d rxe0 -g 1 -i 1 192.168.0.12
local address: LID 0x0000, QPN 0x000011, PSN 0xd1d8a8, GID
::ffff:192.168.0.11
remote address: LID 0x0000, QPN 0x000011, PSN 0xc55eed, GID
::ffff:192.168.0.12
## This is system B
# ibv_rc_pingpong -d rxe0 -g 1 -i 1
local address: LID 0x0000, QPN 0x000011, PSN 0xc55eed, GID
::ffff:192.168.0.12
remote address: LID 0x0000, QPN 0x000011, PSN 0xd1d8a8, GID
::ffff:192.168.0.11
If it makes a difference, I have a 10/100 switch connected at the
moment. I am merely trying to verify functionality, not reach high
speeds.
I have found previous message(s) with similar problems on mailing
lists and online but no resolution to date. Is there any
configuration option I might have missed? I have no iptables
firewall, and have even tried directly connecting the two systems
instead of using the Ethernet switch.
Thanks,
G
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: ibv_rc_pingpong, rping, and other tools hang with Linux 4.10.0 and rdma-core 13
[not found] ` <CABQspYbv7j58pdLLbPegE8Bc3qhwb-3+4E8SQ2U9jObkeTbrzw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2017-02-26 16:09 ` Yonatan Cohen
[not found] ` <4e077022-5e5f-6ba8-530c-b86d2f09313e-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
0 siblings, 1 reply; 7+ messages in thread
From: Yonatan Cohen @ 2017-02-26 16:09 UTC (permalink / raw)
To: GAFBlizzard, linux-rdma-u79uwXL29TY76Z2rM5mHXA
Cc: Jason Gunthorpe, Youngjae Lee
On 2/25/2017 5:27 AM, GAFBlizzard wrote:
> Hello,
>
> I have Linux 4.10.0 stable running on two at91 ARM machines. I have
> rdma-core 13 installed on both.
>
> "rxe_cfg status" shows normal information, e.g.:
> Name Link Driver Speed NMTU IPv4_addr RDEV RMTU
> eth0 yes macb 1500 192.168.0.12 rxe0 1024 (3)
>
> "ibv_devinfo" likewise shows normal information, e.g.:
> hca_id: rxe0
> transport: InfiniBand (0)
> fw_ver: 0.0.0
> node_guid: 1034:56ff:fe84:1952
> sys_image_guid: 0000:0000:0000:0000
> vendor_id: 0x0000
> vendor_part_id: 0
> hw_ver: 0x0
> phys_port_cnt: 1
> port: 1
> state: PORT_ACTIVE (4)
> max_mtu: 4096 (5)
> active_mtu: 1024 (3)
> sm_lid: 0
> port_lid: 0
> port_lmc: 0x00
> link_layer: Ethernet
>
>
> Every communication tool I have tried hangs after printing remote
> address information. No errors are printed or logged in dmesg.
> Example:
>
> ## This is system A
> # ibv_rc_pingpong -d rxe0 -g 1 -i 1 192.168.0.12
> local address: LID 0x0000, QPN 0x000011, PSN 0xd1d8a8, GID
> ::ffff:192.168.0.11
> remote address: LID 0x0000, QPN 0x000011, PSN 0xc55eed, GID
> ::ffff:192.168.0.12
>
> ## This is system B
> # ibv_rc_pingpong -d rxe0 -g 1 -i 1
> local address: LID 0x0000, QPN 0x000011, PSN 0xc55eed, GID
> ::ffff:192.168.0.12
> remote address: LID 0x0000, QPN 0x000011, PSN 0xd1d8a8, GID
> ::ffff:192.168.0.11
>
>
>
> If it makes a difference, I have a 10/100 switch connected at the
> moment. I am merely trying to verify functionality, not reach high
> speeds.
>
> I have found previous message(s) with similar problems on mailing
> lists and online but no resolution to date. Is there any
> configuration option I might have missed? I have no iptables
> firewall, and have even tried directly connecting the two systems
> instead of using the Ethernet switch.
>
Hi all,
I succeeded to reproduce the issue on my x86 setup.
The last time my user-space libraries weren't up to date and thus it passed.
I bisected the rdma-core library and figured out that the following
commit introduced this regression:
6b26a9e24739 Use C11 atomics instead of wmb/rmb macros for CPU-only atomics
I haven't debugged this yet and would appreciate Jason's input.
Can you confirm that reverting the previous commit solves the issues on
ARM as well?
Thanks
> Thanks,
> G
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: ibv_rc_pingpong, rping, and other tools hang with Linux 4.10.0 and rdma-core 13
[not found] ` <4e077022-5e5f-6ba8-530c-b86d2f09313e-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
@ 2017-02-27 16:29 ` Jason Gunthorpe
[not found] ` <20170227162916.GC5891-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
0 siblings, 1 reply; 7+ messages in thread
From: Jason Gunthorpe @ 2017-02-27 16:29 UTC (permalink / raw)
To: Yonatan Cohen
Cc: GAFBlizzard, linux-rdma-u79uwXL29TY76Z2rM5mHXA, Youngjae Lee,
Josh Beavers
On Sun, Feb 26, 2017 at 06:09:34PM +0200, Yonatan Cohen wrote:
> I bisected the rdma-core library and figured out that the following commit
> introduced this regression:
> 6b26a9e24739 Use C11 atomics instead of wmb/rmb macros for CPU-only atomics
>
> I haven't debugged this yet and would appreciate Jason's input.
Oops, I think I typo'd it here:
--- a/providers/rxe/rxe_queue.h
+++ b/providers/rxe/rxe_queue.h
@@ -37,15 +37,16 @@
#ifndef H_RXE_PCQ
#define H_RXE_PCQ
+#include <stdatomic.h>
+
/* MUST MATCH kernel struct rxe_pqc in rxe_queue.h */
struct rxe_queue {
uint32_t log2_elem_size;
uint32_t index_mask;
uint32_t pad_1[30];
- volatile uint32_t producer_index;
+ _Atomic(uint32_t) producer_index;
uint32_t pad_2[31];
- volatile uint32_t consumer_index;
- uint32_t pad_3[31];
+ _Atomic(uint32_t) consumer_index;
Ie deleted pad_3[31] by mistake!
My bad
Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: ibv_rc_pingpong, rping, and other tools hang with Linux 4.10.0 and rdma-core 13
@ 2017-02-27 17:02 Josh Beavers
[not found] ` <CAE=AiOMqGzMC6sD-cXB_sGRH_L15annm_4WottmY17oCSNZveA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
0 siblings, 1 reply; 7+ messages in thread
From: Josh Beavers @ 2017-02-27 17:02 UTC (permalink / raw)
To: Jason Gunthorpe
Cc: Yonatan Cohen, GAFBlizzard, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
Youngjae Lee
Jason and Yonatan,
On Mon, Feb 27, 2017 at 11:29 AM, Jason Gunthorpe
<jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org> wrote:
> On Sun, Feb 26, 2017 at 06:09:34PM +0200, Yonatan Cohen wrote:
>
>> I bisected the rdma-core library and figured out that the following commit
>> introduced this regression:
>> 6b26a9e24739 Use C11 atomics instead of wmb/rmb macros for CPU-only atomics
>>
>> I haven't debugged this yet and would appreciate Jason's input.
>
> Oops, I think I typo'd it here:
> Ie deleted pad_3[31] by mistake!
I just confirmed that reverting the C11 atomics commit (6b26a9e24739)
fixes ibv_rc_pingpong on my two at91 ARM boards. For some reason the
first few packets seem to send slowly, but once it gets going the rest
send quickly.
Youngjae, I suspect this may correct the issue you reported in
http://www.spinics.net/lists/linux-rdma/msg46451.html.
IMPORTANT: I had previously found the pad_3[31] issue and corrected
it. That resulted in wr_id showing up in the kernel with the correct
value, but ibv_rc_pingpong would still sometimes (30% or so?) fail
with "Couldn't post send" "parse WC failed 1" on one side. Weirdly,
it seems to fail more often just after a reboot, and only occasionally
once I run several tests.
Jason, was it intentional that rmb() was removed with no replacement
in rxe_post_one_recv()? See
https://github.com/linux-rdma/rdma-core/commit/6b26a9e24739576ac3f4ae308485389a5b285497?diff=split#diff-f6b2d2321c2b3273e3453d055a62fa98
for details.
Unfortunately, even after reverting the C11 atomics commit, I still
seem to observe "Couldn't post send" failures which kill the ping
occasionally. Is this a known issue?
Thanks,
-G
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: ibv_rc_pingpong, rping, and other tools hang with Linux 4.10.0 and rdma-core 13
[not found] ` <CAE=AiOMqGzMC6sD-cXB_sGRH_L15annm_4WottmY17oCSNZveA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2017-02-27 17:15 ` Jason Gunthorpe
[not found] ` <20170227171549.GG5891-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
0 siblings, 1 reply; 7+ messages in thread
From: Jason Gunthorpe @ 2017-02-27 17:15 UTC (permalink / raw)
To: Josh Beavers
Cc: Yonatan Cohen, GAFBlizzard, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
Youngjae Lee
On Mon, Feb 27, 2017 at 12:02:07PM -0500, Josh Beavers wrote:
> I just confirmed that reverting the C11 atomics commit (6b26a9e24739)
> fixes ibv_rc_pingpong on my two at91 ARM boards. For some reason the
> first few packets seem to send slowly, but once it gets going the rest
> send quickly.
How are you able to get it to compile? Without that commit ARM32
should blow up?
> Jason, was it intentional that rmb() was removed with no replacement
> in rxe_post_one_recv()? See
> https://github.com/linux-rdma/rdma-core/commit/6b26a9e24739576ac3f4ae308485389a5b285497?diff=split#diff-f6b2d2321c2b3273e3453d055a62fa98
> for details.
Yes, the rmb should have been a wmb and the atomic_thread_fence added
to in advance_producer correctly replaces it.
> Unfortunately, even after reverting the C11 atomics commit, I still
> seem to observe "Couldn't post send" failures which kill the ping
> occasionally. Is this a known issue?
Sounds like this is unrelated to 6b26..
Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: ibv_rc_pingpong, rping, and other tools hang with Linux 4.10.0 and rdma-core 13
[not found] ` <20170227171549.GG5891-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
@ 2017-02-27 18:59 ` Jason Gunthorpe
0 siblings, 0 replies; 7+ messages in thread
From: Jason Gunthorpe @ 2017-02-27 18:59 UTC (permalink / raw)
To: Josh Beavers
Cc: Yonatan Cohen, GAFBlizzard, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
Youngjae Lee
On Mon, Feb 27, 2017 at 10:15:49AM -0700, Jason Gunthorpe wrote:
> On Mon, Feb 27, 2017 at 12:02:07PM -0500, Josh Beavers wrote:
>
> > I just confirmed that reverting the C11 atomics commit (6b26a9e24739)
> > fixes ibv_rc_pingpong on my two at91 ARM boards. For some reason the
> > first few packets seem to send slowly, but once it gets going the rest
> > send quickly.
>
> How are you able to get it to compile? Without that commit ARM32
> should blow up?
https://github.com/linux-rdma/rdma-core/pull/84
Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: ibv_rc_pingpong, rping, and other tools hang with Linux 4.10.0 and rdma-core 13
[not found] ` <20170227162916.GC5891-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
@ 2017-02-27 20:38 ` Majd Dibbiny
0 siblings, 0 replies; 7+ messages in thread
From: Majd Dibbiny @ 2017-02-27 20:38 UTC (permalink / raw)
To: Jason Gunthorpe
Cc: Yonatan Cohen, GAFBlizzard,
linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Youngjae Lee,
Josh Beavers
> On Feb 27, 2017, at 7:09 PM, Jason Gunthorpe <jgunthorpe@obsidianresearch.com> wrote:
>
>> On Sun, Feb 26, 2017 at 06:09:34PM +0200, Yonatan Cohen wrote:
>>
>> I bisected the rdma-core library and figured out that the following commit
>> introduced this regression:
>> 6b26a9e24739 Use C11 atomics instead of wmb/rmb macros for CPU-only atomics
>>
>> I haven't debugged this yet and would appreciate Jason's input.
>
> Oops, I think I typo'd it here:
>
> --- a/providers/rxe/rxe_queue.h
> +++ b/providers/rxe/rxe_queue.h
> @@ -37,15 +37,16 @@
> #ifndef H_RXE_PCQ
> #define H_RXE_PCQ
>
> +#include <stdatomic.h>
> +
> /* MUST MATCH kernel struct rxe_pqc in rxe_queue.h */
> struct rxe_queue {
> uint32_t log2_elem_size;
> uint32_t index_mask;
> uint32_t pad_1[30];
> - volatile uint32_t producer_index;
> + _Atomic(uint32_t) producer_index;
> uint32_t pad_2[31];
> - volatile uint32_t consumer_index;
> - uint32_t pad_3[31];
> + _Atomic(uint32_t) consumer_index;
>
>
> Ie deleted pad_3[31] by mistake!
>
> My bad
Thanks Jason for looking into it.
Can you please submit this fix to rdma-core?
>
> Jason
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2017-02-27 20:38 UTC | newest]
Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2017-02-27 17:02 ibv_rc_pingpong, rping, and other tools hang with Linux 4.10.0 and rdma-core 13 Josh Beavers
[not found] ` <CAE=AiOMqGzMC6sD-cXB_sGRH_L15annm_4WottmY17oCSNZveA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2017-02-27 17:15 ` Jason Gunthorpe
[not found] ` <20170227171549.GG5891-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2017-02-27 18:59 ` Jason Gunthorpe
-- strict thread matches above, loose matches on Subject: below --
2017-02-25 3:27 GAFBlizzard
[not found] ` <CABQspYbv7j58pdLLbPegE8Bc3qhwb-3+4E8SQ2U9jObkeTbrzw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2017-02-26 16:09 ` Yonatan Cohen
[not found] ` <4e077022-5e5f-6ba8-530c-b86d2f09313e-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
2017-02-27 16:29 ` Jason Gunthorpe
[not found] ` <20170227162916.GC5891-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2017-02-27 20:38 ` Majd Dibbiny
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox