better understanding rdma-cm UNREACHABLE/ETIMEDOUT scheme

public inbox for linux-rdma@vger.kernel.org
 help / color / mirror / Atom feed

* better understanding rdma-cm UNREACHABLE/ETIMEDOUT scheme
@ 2013-05-21 15:07 Or Gerlitz
       [not found] ` <519B8DB3.3010500-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
  0 siblings, 1 reply; 7+ messages in thread
From: Or Gerlitz @ 2013-05-21 15:07 UTC (permalink / raw)
  To: Hefty, Sean
  Cc: Alex Rosenbaum,
	linux-rdma (linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org),
	Dina Leventol

Hi Sean,

We have a user space application which is made of M (clients) x N 
(servers) RC connectivity pattern using librdmacm. Basically, there are 
N nodes, each running M client process and each client connects to all N 
servers.

So under some unknown conditions, many of the clients connection 
attempts fail with RDMA_CM_EVENT_UNREACHABLE event and the status is 
-ETIMEDOUT.  Looking on the rdma-cm kernel code, I see that the only 
location which generates this event is in cma_ib_handler when getting 
IB_CM_REQ_ERROR (or IB_CM_REP_ERROR).

Digging down into the CM, I see that the only place where 
IB_CM_REQ_ERROR is delivered is on cm_process_send_error which is called 
when the status of mad send completion is not success or flush.

Digging down into the MAD code and the CM usage of it,  I see that that 
the mad code will issue a mac send completion handler with the 
IB_WC_RESP_TIMEOUT_ERR status, and that the CM code programs the number 
of retries set by its consumer (rdma-cm in this case) into the mad send 
buffer.

Running this over an M=8 and N=4setup, e.g four nodes, each running one 
server process and eight client processes and sampling the IB CM 
counters before and after the job and adding the numbers from the four 
nodes, we see the following

cm_tx_msgs.req = 395
cm_tx_retries.req= 270
cm_rx_msgs.req= 390

cm_tx_msgs.rep= 375
cm_tx_retries.rep= 255
cm_rx_msgs.rep= 380

cm_tx_msgs.rtu= 108
cm_rx_msgs.rtu= 103

cm_tx_msgs.mra= 540
cm_rx_msgs.mra= 270
cm_tx_retries.mra= 270

In cm_send_handler we see that the CM TX retry counter is incremented 
with the number of retries reported
by the MAD layer, I also see that the RDMA-CM programs the CM to do 15 
retries and the CM further programs this into the MAD send buffers.

 From the RTU counters its clear that at most ~100 connections got 
established out of 128.

One thing seen in the nodes dmesg is a message from an old patch of 
yours which exists in ofed1.5.3 but didn't hit (or wasn't accepted?) 
upstream saying "ib_cm: calculated mra timeout 67584 > 8192, decreasing 
used timeout_ms" does this provides any insight into the problem?

One more piece of info, is that this apps doesn't call rdma_disconnect 
at all, when they are done or if something goes wrong (e.g that 
unreachable event) they simply issue rdma_destroy_id which when I look 
on the rdma-cm/cm code gets to a CM function whic sends a dreq (if the 
ID is in the established state) and puts the ID in the timewait zone.

So it seems we're not loosing mads, also on the stack they use (that 
1.5.3) the ucma backlog size is 128
but each server process gets only 32 request (8x4) so we don't think 
ucma dropping REQs as of no more backlog budget takes place.

Or.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 7+ messages in thread

* RE: better understanding rdma-cm UNREACHABLE/ETIMEDOUT scheme
       [not found] ` <519B8DB3.3010500-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
@ 2013-05-21 15:24   ` Hefty, Sean
       [not found]     ` <1828884A29C6694DAF28B7E6B8A823736FD29AE3-P5GAC/sN6hkd3b2yrw5b5LfspsVTdybXVpNB7YpNyf8@public.gmane.org>
  0 siblings, 1 reply; 7+ messages in thread
From: Hefty, Sean @ 2013-05-21 15:24 UTC (permalink / raw)
  To: Or Gerlitz
  Cc: Alex Rosenbaum,
	linux-rdma (linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org),
	Dina Leventol

> So under some unknown conditions, many of the clients connection
> attempts fail with RDMA_CM_EVENT_UNREACHABLE event and the status is
> -ETIMEDOUT.  Looking on the rdma-cm kernel code, I see that the only
> location which generates this event is in cma_ib_handler when getting
> IB_CM_REQ_ERROR (or IB_CM_REP_ERROR).

Does the server continue to run and accept connections during the running of this test?
 
> Digging down into the CM, I see that the only place where
> IB_CM_REQ_ERROR is delivered is on cm_process_send_error which is called
> when the status of mad send completion is not success or flush.
> 
> Digging down into the MAD code and the CM usage of it,  I see that that
> the mad code will issue a mac send completion handler with the
> IB_WC_RESP_TIMEOUT_ERR status, and that the CM code programs the number
> of retries set by its consumer (rdma-cm in this case) into the mad send
> buffer.

Correct - it appears that the connection requests are timing out.
 
> Running this over an M=8 and N=4setup, e.g four nodes, each running one
> server process and eight client processes and sampling the IB CM
> counters before and after the job and adding the numbers from the four
> nodes, we see the following

This is such a small number of nodes/clients, that it shouldn't be related to scaling.  The normal timeout/retry mechanism should work fine.
 
> cm_tx_msgs.req = 395
> cm_tx_retries.req= 270
> cm_rx_msgs.req= 390
> 
> cm_tx_msgs.rep= 375
> cm_tx_retries.rep= 255
> cm_rx_msgs.rep= 380
> 
> cm_tx_msgs.rtu= 108
> cm_rx_msgs.rtu= 103
> 
> cm_tx_msgs.mra= 540
> cm_rx_msgs.mra= 270
> cm_tx_retries.mra= 270
> 
> In cm_send_handler we see that the CM TX retry counter is incremented
> with the number of retries reported
> by the MAD layer, I also see that the RDMA-CM programs the CM to do 15
> retries and the CM further programs this into the MAD send buffers.
> 
>  From the RTU counters its clear that at most ~100 connections got
> established out of 128.
> 
> One thing seen in the nodes dmesg is a message from an old patch of
> yours which exists in ofed1.5.3 but didn't hit (or wasn't accepted?)
> upstream saying "ib_cm: calculated mra timeout 67584 > 8192, decreasing
> used timeout_ms" does this provides any insight into the problem?

I don't remember this patch at all.  The upstream kernel will respond to duplicate CM REQs by sending an MRA, which should increase the connection timeout.  This was added to handle servers which were slow to accept connection requests.

> One more piece of info, is that this apps doesn't call rdma_disconnect
> at all, when they are done or if something goes wrong (e.g that
> unreachable event) they simply issue rdma_destroy_id which when I look
> on the rdma-cm/cm code gets to a CM function whic sends a dreq (if the
> ID is in the established state) and puts the ID in the timewait zone.

Calling rdma_disconnect shouldn't matter.  Destroy will send the DREQ/DREP if the user hasn't disconnected.
 
> So it seems we're not loosing mads, also on the stack they use (that
> 1.5.3) the ucma backlog size is 128
> but each server process gets only 32 request (8x4) so we don't think
> ucma dropping REQs as of no more backlog budget takes place.

My first guess is that the server isn't responding to new requests.

- Sean
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: better understanding rdma-cm UNREACHABLE/ETIMEDOUT scheme
       [not found]     ` <1828884A29C6694DAF28B7E6B8A823736FD29AE3-P5GAC/sN6hkd3b2yrw5b5LfspsVTdybXVpNB7YpNyf8@public.gmane.org>
@ 2013-05-21 15:25       ` Or Gerlitz
  2013-05-21 18:21       ` Or Gerlitz
  2013-05-23 10:31       ` Alex Rosenbaum
  2 siblings, 0 replies; 7+ messages in thread
From: Or Gerlitz @ 2013-05-21 15:25 UTC (permalink / raw)
  To: Hefty, Sean
  Cc: Alex Rosenbaum,
	linux-rdma (linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org),
	Dina Leventol

On 21/05/2013 18:24, Hefty, Sean wrote:
> I don't remember this patch at all.

Alex, can you please send Sean this patch
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: better understanding rdma-cm UNREACHABLE/ETIMEDOUT scheme
       [not found]     ` <1828884A29C6694DAF28B7E6B8A823736FD29AE3-P5GAC/sN6hkd3b2yrw5b5LfspsVTdybXVpNB7YpNyf8@public.gmane.org>
  2013-05-21 15:25       ` Or Gerlitz
@ 2013-05-21 18:21       ` Or Gerlitz
       [not found]         ` <CAJZOPZJ44fgPtBHpu5eXSVUQb0zP7rJH1UvL2RneDCrhGVLSwQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2013-05-23 10:31       ` Alex Rosenbaum
  2 siblings, 1 reply; 7+ messages in thread
From: Or Gerlitz @ 2013-05-21 18:21 UTC (permalink / raw)
  To: Hefty, Sean
  Cc: Or Gerlitz, Alex Rosenbaum,
	linux-rdma (linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org),
	Dina Leventol

On Tue, May 21, 2013 at 6:24 PM, Hefty, Sean <sean.hefty-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org> wrote:

>> One thing seen in the nodes dmesg is a message from an old patch of
>> yours which exists in ofed1.5.3 but didn't hit (or wasn't accepted?)
>> upstream saying "ib_cm: calculated mra timeout 67584 > 8192, decreasing
>> used timeout_ms" does this provides any insight into the problem?

> I don't remember this patch at all.

Alex sent it to you, is that something which is missing upstream or
alternatively could create troubles on that ofed stack where its
applied?


> My first guess is that the server isn't responding to new requests.

yep, smells like this could be the root cause here, Dina and Alex will
do some tweaking of the server code to make sure there's no starvation
is servicing new connection requests.

Or.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 7+ messages in thread

* RE: better understanding rdma-cm UNREACHABLE/ETIMEDOUT scheme
       [not found]         ` <CAJZOPZJ44fgPtBHpu5eXSVUQb0zP7rJH1UvL2RneDCrhGVLSwQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2013-05-21 18:54           ` Hefty, Sean
  0 siblings, 0 replies; 7+ messages in thread
From: Hefty, Sean @ 2013-05-21 18:54 UTC (permalink / raw)
  To: Or Gerlitz
  Cc: Or Gerlitz, Alex Rosenbaum,
	linux-rdma (linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org),
	Dina Leventol

> >> One thing seen in the nodes dmesg is a message from an old patch of
> >> yours which exists in ofed1.5.3 but didn't hit (or wasn't accepted?)
> >> upstream saying "ib_cm: calculated mra timeout 67584 > 8192, decreasing
> >> used timeout_ms" does this provides any insight into the problem?
> 
> > I don't remember this patch at all.
> 
> Alex sent it to you, is that something which is missing upstream or
> alternatively could create troubles on that ofed stack where its
> applied?

I saw the patch and recall it now.  It was a fix for an SRP target and shouldn't affect your tests or cause any issues.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: better understanding rdma-cm UNREACHABLE/ETIMEDOUT scheme
       [not found]     ` <1828884A29C6694DAF28B7E6B8A823736FD29AE3-P5GAC/sN6hkd3b2yrw5b5LfspsVTdybXVpNB7YpNyf8@public.gmane.org>
  2013-05-21 15:25       ` Or Gerlitz
  2013-05-21 18:21       ` Or Gerlitz
@ 2013-05-23 10:31       ` Alex Rosenbaum
       [not found]         ` <519DF00C.9010304-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
  2 siblings, 1 reply; 7+ messages in thread
From: Alex Rosenbaum @ 2013-05-23 10:31 UTC (permalink / raw)
  To: Hefty, Sean
  Cc: Or Gerlitz,
	linux-rdma (linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org),
	Dina Leventol

On 5/21/2013 6:24 PM, Hefty, Sean wrote:
> My first guess is that the server isn't responding to new requests. - 
> Sean 

This is where we're looking now.
Now testing on 17 server with 8 clients per server.

When disabling all RDMA traffic in the test we get 100% RDMA connection 
established. So at least we know this is not some fundamental issue with 
our setup.

Modifying our code to increasing the priority of RDMA connection 
handling to be higher then the RDMA traffic (CQ completions handling) we 
still see many UNREACHABLE events. But only after quite a few client got 
connected and started pushing traffic (1GB RDMA WRITEs from server to 
client).

We are now adding code (via the conn_attr private data) to compare 
timestamp between the rdma_conenct, RDMA_CM_EV_CONNECT_REQ, rdma_accept 
and on the client events of UNREACHABLE or CONNECTED.
We'll have better understand once we see these results.

thanks,

Alex
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: better understanding rdma-cm UNREACHABLE/ETIMEDOUT scheme
       [not found]         ` <519DF00C.9010304-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
@ 2013-05-26 14:46           ` Alex Rosenbaum
  0 siblings, 0 replies; 7+ messages in thread
From: Alex Rosenbaum @ 2013-05-26 14:46 UTC (permalink / raw)
  To: Hefty, Sean
  Cc: Or Gerlitz,
	linux-rdma (linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org),
	Dina Leventol

On 5/23/2013 1:31 PM, Alex Rosenbaum wrote:
> On 5/21/2013 6:24 PM, Hefty, Sean wrote:
>> My first guess is that the server isn't responding to new requests. - 
>> Sean 
>
> This is where we're looking now.
> Now testing on 17 server with 8 clients per server.
>
> When disabling all RDMA traffic in the test we get 100% RDMA 
> connection established. So at least we know this is not some 
> fundamental issue with our setup.
>
> Modifying our code to increasing the priority of RDMA connection 
> handling to be higher then the RDMA traffic (CQ completions handling) 
> we still see many UNREACHABLE events. But only after quite a few 
> client got connected and started pushing traffic (1GB RDMA WRITEs from 
> server to client).
>
> We are now adding code (via the conn_attr private data) to compare 
> timestamp between the rdma_conenct, RDMA_CM_EV_CONNECT_REQ, 
> rdma_accept and on the client events of UNREACHABLE or CONNECTED.
> We'll have better understand once we see these results.
>
> thanks,
>
> Alex
We found the peace of code that got the server to hang for so long, 
enough to causes the rdma_connect() to fail on the client side with 
retries with RDMA_CM_EVENT_UNREACHABLE(-TIMEDOUT)
OK, case closed.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2013-05-26 14:46 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-05-21 15:07 better understanding rdma-cm UNREACHABLE/ETIMEDOUT scheme Or Gerlitz
     [not found] ` <519B8DB3.3010500-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
2013-05-21 15:24   ` Hefty, Sean
     [not found]     ` <1828884A29C6694DAF28B7E6B8A823736FD29AE3-P5GAC/sN6hkd3b2yrw5b5LfspsVTdybXVpNB7YpNyf8@public.gmane.org>
2013-05-21 15:25       ` Or Gerlitz
2013-05-21 18:21       ` Or Gerlitz
     [not found]         ` <CAJZOPZJ44fgPtBHpu5eXSVUQb0zP7rJH1UvL2RneDCrhGVLSwQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2013-05-21 18:54           ` Hefty, Sean
2013-05-23 10:31       ` Alex Rosenbaum
     [not found]         ` <519DF00C.9010304-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
2013-05-26 14:46           ` Alex Rosenbaum

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox