* cq error timeout issue
@ 2011-10-30 12:51 Vlad Weinbaum
[not found] ` <CAFAcbYPXqcOc79fcWHK7iHkfEX0ecmg4CBSUgUk=C=xiScSRRw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
0 siblings, 1 reply; 6+ messages in thread
From: Vlad Weinbaum @ 2011-10-30 12:51 UTC (permalink / raw)
To: linux-rdma-u79uwXL29TY76Z2rM5mHXA
Hi,
I try to use ibv_poll_cq to identify connectivity problems. The
scenario is following, based on modified rping example:
1) preliminary steps done and rdma connection established between
Client and Server, retry_count in rdma_conn_param is set 1;
2) Server lost its link (corresponding switch port disabled), Client
is still connected to the switch;
3) Client calls ibv_post_send
4) Client polls cq with ibv_poll_cq and gets expected
IBV_WC_RETRY_EXC_ERR after about 1 second.
Can this timeout be decreased? If it is impossible, can you suggest
something else?
Used software is Ofed-1.5.3.1, IB device is Mellanox MT26428.
Thanks,
Vlad
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 6+ messages in thread[parent not found: <CAFAcbYPXqcOc79fcWHK7iHkfEX0ecmg4CBSUgUk=C=xiScSRRw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]
* RE: cq error timeout issue [not found] ` <CAFAcbYPXqcOc79fcWHK7iHkfEX0ecmg4CBSUgUk=C=xiScSRRw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2011-10-31 6:51 ` Hefty, Sean [not found] ` <1828884A29C6694DAF28B7E6B8A8237316E8B236-P5GAC/sN6hmkrb+BlOpmy7fspsVTdybXVpNB7YpNyf8@public.gmane.org> 0 siblings, 1 reply; 6+ messages in thread From: Hefty, Sean @ 2011-10-31 6:51 UTC (permalink / raw) To: Vlad Weinbaum, linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org > I try to use ibv_poll_cq to identify connectivity problems. The > scenario is following, based on modified rping example: > > 1) preliminary steps done and rdma connection established between > Client and Server, retry_count in rdma_conn_param is set 1; > 2) Server lost its link (corresponding switch port disabled), Client > is still connected to the switch; > 3) Client calls ibv_post_send > 4) Client polls cq with ibv_poll_cq and gets expected > IBV_WC_RETRY_EXC_ERR after about 1 second. > > Can this timeout be decreased? If it is impossible, can you suggest > something else? I don't believe easily. The timeout is based on the path record returned by the SM, which is really what an app should use. If you can adjust the timeout at the SM, that would be best. If you can use a newer kernel, another alternative is to use rdma_set_option to provide your own path record as input in place of calling rdma_resolve_route. Btw, with a small timeout and few retries, if you're not using QoS, you may want to enable that to prevent false timeouts. - Sean -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 6+ messages in thread
[parent not found: <1828884A29C6694DAF28B7E6B8A8237316E8B236-P5GAC/sN6hmkrb+BlOpmy7fspsVTdybXVpNB7YpNyf8@public.gmane.org>]
* Re: cq error timeout issue [not found] ` <1828884A29C6694DAF28B7E6B8A8237316E8B236-P5GAC/sN6hmkrb+BlOpmy7fspsVTdybXVpNB7YpNyf8@public.gmane.org> @ 2011-10-31 7:08 ` Vlad Weinbaum [not found] ` <CAFAcbYNJFH0n=gFwki0uLUzDd6cu7MxuHR99x9i9wUY3xp+Hdg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 0 siblings, 1 reply; 6+ messages in thread From: Vlad Weinbaum @ 2011-10-31 7:08 UTC (permalink / raw) To: Hefty, Sean; +Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org Thank you Sean, I have an access to OpenSM configuration on the switch in my testing environment. I tried to reduce the timout there (subnet_timeout, packet_life_time), but unsuccessfully. Btw, I found detail that I cannot explain. I query the QP after connect and get timeout value 16, that must be 4 us * 2^ 16 = 256 ms, but I get about 800 ms. Thanks, Vlad On Mon, Oct 31, 2011 at 8:51 AM, Hefty, Sean <sean.hefty-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org> wrote: >> I try to use ibv_poll_cq to identify connectivity problems. The >> scenario is following, based on modified rping example: >> >> 1) preliminary steps done and rdma connection established between >> Client and Server, retry_count in rdma_conn_param is set 1; >> 2) Server lost its link (corresponding switch port disabled), Client >> is still connected to the switch; >> 3) Client calls ibv_post_send >> 4) Client polls cq with ibv_poll_cq and gets expected >> IBV_WC_RETRY_EXC_ERR after about 1 second. >> >> Can this timeout be decreased? If it is impossible, can you suggest >> something else? > > I don't believe easily. The timeout is based on the path record returned by the SM, which is really what an app should use. If you can adjust the timeout at the SM, that would be best. > > If you can use a newer kernel, another alternative is to use rdma_set_option to provide your own path record as input in place of calling rdma_resolve_route. > > Btw, with a small timeout and few retries, if you're not using QoS, you may want to enable that to prevent false timeouts. > > - Sean > -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 6+ messages in thread
[parent not found: <CAFAcbYNJFH0n=gFwki0uLUzDd6cu7MxuHR99x9i9wUY3xp+Hdg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]
* Re: cq error timeout issue [not found] ` <CAFAcbYNJFH0n=gFwki0uLUzDd6cu7MxuHR99x9i9wUY3xp+Hdg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2011-10-31 13:00 ` Or Gerlitz [not found] ` <CAJZOPZKr3Jz4r=bfN18HRH7My_8haw+XtJznk7R5N92ZZn4=4Q-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 0 siblings, 1 reply; 6+ messages in thread From: Or Gerlitz @ 2011-10-31 13:00 UTC (permalink / raw) To: Vlad Weinbaum Cc: Hefty, Sean, linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org On Mon, Oct 31, 2011 at 9:08 AM, Vlad Weinbaum <vlad.weinbaum-adSD+vzF2QdBDgjK7y7TUQ@public.gmane.org> wrote: > [...] I found detail that I cannot explain. I query the QP after connect and get timeout value 16, > that must be 4 us * 2^16 = 256 ms, but I get about 800 ms. As Sean indicated, the timeout is **based** on the packet_life_time, in case you're configuring your QP through the rdma-cm, the IB stack code actually adds one to the packet_life_time quantity as a rough estimate for the local hca ack delay, still maybe there is a hole here and the query qp verb isn't reporting correctly, what was the value you configured on the sm side, and how many retries did you use for the qp? each retry will double the time you should be observing in practice. Or. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 6+ messages in thread
[parent not found: <CAJZOPZKr3Jz4r=bfN18HRH7My_8haw+XtJznk7R5N92ZZn4=4Q-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]
* Re: cq error timeout issue [not found] ` <CAJZOPZKr3Jz4r=bfN18HRH7My_8haw+XtJznk7R5N92ZZn4=4Q-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2011-11-01 13:25 ` Vlad Weinbaum [not found] ` <CAFAcbYOsaAVumnRQu_MYvH1-RL1gB+9_-GApPHSdJ_2E4H_j4Q-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 0 siblings, 1 reply; 6+ messages in thread From: Vlad Weinbaum @ 2011-11-01 13:25 UTC (permalink / raw) To: Or Gerlitz Cc: Hefty, Sean, linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org Hi, packet_life_time is 12 in sm.conf. I'm debugging the polling process and see that ibv_poll_cq returns error when mlx4_cqe->owner_sr_opcode gets MLX4_CQE_OPCODE_ERROR value (in libmlx4, mlx4_poll_one). Where this value came from? Thanks, Vlad On Mon, Oct 31, 2011 at 3:00 PM, Or Gerlitz <or.gerlitz-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote: > On Mon, Oct 31, 2011 at 9:08 AM, Vlad Weinbaum > <vlad.weinbaum-adSD+vzF2QdBDgjK7y7TUQ@public.gmane.org> wrote: >> [...] I found detail that I cannot explain. I query the QP after connect and get timeout value 16, >> that must be 4 us * 2^16 = 256 ms, but I get about 800 ms. > > As Sean indicated, the timeout is **based** on the packet_life_time, > in case you're configuring > your QP through the rdma-cm, the IB stack code actually adds one to > the packet_life_time quantity > as a rough estimate for the local hca ack delay, still maybe there is > a hole here and the query qp verb > isn't reporting correctly, what was the value you configured on the sm > side, and how many retries did > you use for the qp? each retry will double the time you should be > observing in practice. > > Or. > -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 6+ messages in thread
[parent not found: <CAFAcbYOsaAVumnRQu_MYvH1-RL1gB+9_-GApPHSdJ_2E4H_j4Q-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]
* Re: cq error timeout issue [not found] ` <CAFAcbYOsaAVumnRQu_MYvH1-RL1gB+9_-GApPHSdJ_2E4H_j4Q-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2011-11-01 14:04 ` Hal Rosenstock 0 siblings, 0 replies; 6+ messages in thread From: Hal Rosenstock @ 2011-11-01 14:04 UTC (permalink / raw) To: Vlad Weinbaum Cc: Or Gerlitz, Hefty, Sean, linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org Just to clarify: On 11/1/2011 9:25 AM, Vlad Weinbaum wrote: > packet_life_time is 12 in sm.conf. OpenSM packet_life_time setting has nothing to do with this (but is used for SwitchInfo:LifeTimeValue setting): # The code of maximal time a packet can live in a switch # The actual time is 4.096usec * 2^<packet_life_time> # The value 0x14 disables this mechanism The rate in SA PathRecord comes from subnet timeout in the absence of QoS: /* * Set packet lifetime. * According to spec definition IBA 1.2 Table 205 * PacketLifeTime description, for loopback paths, * packetLifeTime shall be zero. */ if (p_src_port == p_dest_port) pkt_life = 0; else if (p_qos_level && p_qos_level->pkt_life_set) pkt_life = p_qos_level->pkt_life; else pkt_life = sa->p_subn->opt.subnet_timeout; -- Hal -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2011-11-01 14:04 UTC | newest]
Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-10-30 12:51 cq error timeout issue Vlad Weinbaum
[not found] ` <CAFAcbYPXqcOc79fcWHK7iHkfEX0ecmg4CBSUgUk=C=xiScSRRw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2011-10-31 6:51 ` Hefty, Sean
[not found] ` <1828884A29C6694DAF28B7E6B8A8237316E8B236-P5GAC/sN6hmkrb+BlOpmy7fspsVTdybXVpNB7YpNyf8@public.gmane.org>
2011-10-31 7:08 ` Vlad Weinbaum
[not found] ` <CAFAcbYNJFH0n=gFwki0uLUzDd6cu7MxuHR99x9i9wUY3xp+Hdg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2011-10-31 13:00 ` Or Gerlitz
[not found] ` <CAJZOPZKr3Jz4r=bfN18HRH7My_8haw+XtJznk7R5N92ZZn4=4Q-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2011-11-01 13:25 ` Vlad Weinbaum
[not found] ` <CAFAcbYOsaAVumnRQu_MYvH1-RL1gB+9_-GApPHSdJ_2E4H_j4Q-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2011-11-01 14:04 ` Hal Rosenstock
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox