public inbox for linux-rdma@vger.kernel.org
 help / color / mirror / Atom feed
* identify the race condition in this code and win the respect of linux-rdma developers!
@ 2011-09-15 23:58 Hefty, Sean
       [not found] ` <1828884A29C6694DAF28B7E6B8A8237316E5B2B1-P5GAC/sN6hmkrb+BlOpmy7fspsVTdybXVpNB7YpNyf8@public.gmane.org>
  0 siblings, 1 reply; 8+ messages in thread
From: Hefty, Sean @ 2011-09-15 23:58 UTC (permalink / raw)
  To: linux-rdma (linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org)

I have a ping-pong test application that loops doing: send, wait for send completion, wait for receive completion.  The test occasionally hangs in the following code at ibv_get_cq_event() (error handling removed):

Case 1:
	ret = ibv_poll_cq(id->recv_cq, 1, wc);
	ret = ibv_req_notify_cq(id->recv_cq, 0);

	while (!(ret = ibv_poll_cq(id->recv_cq, 1, wc))) {
		ret = ibv_get_cq_event(id->recv_cq_channel, &cq, &context);
		ibv_ack_cq_events(id->recv_cq, 1);
	}

The test does NOT hang if the code is restructured as either of the below: 

Case 2:
	ret = ibv_poll_cq(id->recv_cq, 1, wc);
	ret = ibv_req_notify_cq(id->recv_cq, 0);

	do {
		ret = ibv_get_cq_event(id->recv_cq_channel, &cq, &context);
		ibv_ack_cq_events(id->recv_cq, 1);
	} while (!(ret = ibv_poll_cq(id->recv_cq, 1, wc)));

or
Case 3:
	ret = ibv_poll_cq(id->recv_cq, 1, wc);
	while (1)
		ret = ibv_req_notify_cq(id->recv_cq, 0);

		ret = ibv_poll_cq(id->recv_cq, 1, wc);
		if (ret)
			break;
		ret = ibv_get_cq_event(id->recv_cq_channel, &cq, &context);
		ibv_ack_cq_events(id->recv_cq, 1);
	}

Personally, I would have expected case 2 to hang and cases 1 and 3 to work.  Does anyone see why case 1 should hang?  This is with mlx4.

- Sean
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: identify the race condition in this code and win the respect of linux-rdma developers!
       [not found] ` <1828884A29C6694DAF28B7E6B8A8237316E5B2B1-P5GAC/sN6hmkrb+BlOpmy7fspsVTdybXVpNB7YpNyf8@public.gmane.org>
@ 2011-09-16  0:50   ` Jason Gunthorpe
       [not found]     ` <20110916005008.GC6020-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
  0 siblings, 1 reply; 8+ messages in thread
From: Jason Gunthorpe @ 2011-09-16  0:50 UTC (permalink / raw)
  To: Hefty, Sean
  Cc: linux-rdma (linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org)

On Thu, Sep 15, 2011 at 11:58:34PM +0000, Hefty, Sean wrote:
> I have a ping-pong test application that loops doing: send, wait for
> send completion, wait for receive completion.  The test occasionally
> hangs in the following code at ibv_get_cq_event() (error handling
> removed):

> Case 1:
> 	ret = ibv_poll_cq(id->recv_cq, 1, wc);
> 	ret = ibv_req_notify_cq(id->recv_cq, 0);
> 
> 	while (!(ret = ibv_poll_cq(id->recv_cq, 1, wc))) {
> 		ret = ibv_get_cq_event(id->recv_cq_channel, &cq, &context);
> 		ibv_ack_cq_events(id->recv_cq, 1);
> 	}

If ibv_get_cq_event returns here, because the CQ notify triggers but
poll_cq goes not return a WC then you go back into get_cq_event
without requesting notification.

I'm pretty sure ibv_req_notify_cq does nothing if there is already a
notification pending.

> Case 3:
> 	ret = ibv_poll_cq(id->recv_cq, 1, wc);
> 	while (1)
> 		ret = ibv_req_notify_cq(id->recv_cq, 0);
> 
> 		ret = ibv_poll_cq(id->recv_cq, 1, wc);
> 		if (ret)
> 			break;
> 		ret = ibv_get_cq_event(id->recv_cq_channel, &cq, &context);
> 		ibv_ack_cq_events(id->recv_cq, 1);
> 	}

Where as this, is correct. The sequence must always be
ibv_req_notify_cq, ibv_poll_cq, ibv_get_cq_event

Case 2 will also fail, but it will be statistically less likely since
you are looking for a WC being added between the poll_cq and
req_notify_cq. While case 1 only requires a run through the loop with
a pending notification but no WC entries.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 8+ messages in thread

* RE: identify the race condition in this code and win the respect of linux-rdma developers!
       [not found]     ` <20110916005008.GC6020-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
@ 2011-09-16 14:51       ` Hefty, Sean
       [not found]         ` <1828884A29C6694DAF28B7E6B8A8237316E5B320-P5GAC/sN6hmkrb+BlOpmy7fspsVTdybXVpNB7YpNyf8@public.gmane.org>
  0 siblings, 1 reply; 8+ messages in thread
From: Hefty, Sean @ 2011-09-16 14:51 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: linux-rdma (linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org)

> > Case 1:
> > 	ret = ibv_poll_cq(id->recv_cq, 1, wc);
> > 	ret = ibv_req_notify_cq(id->recv_cq, 0);
> >
> > 	while (!(ret = ibv_poll_cq(id->recv_cq, 1, wc))) {
> > 		ret = ibv_get_cq_event(id->recv_cq_channel, &cq, &context);
> > 		ibv_ack_cq_events(id->recv_cq, 1);
> > 	}
> 
> If ibv_get_cq_event returns here, because the CQ notify triggers but
> poll_cq goes not return a WC then you go back into get_cq_event
> without requesting notification.

I understand this, but I still don't quite see how the situation fully occurs.

> I'm pretty sure ibv_req_notify_cq does nothing if there is already a
> notification pending.

This last statement might be it, but I couldn't see that in the code.  I was under the assumption that once an event occurred, the CQ would be 'un-armed'.  ibv_req_notify_cq() would rearm it.  In the above code then, ibv_get_cq_event() would pull the old event, but then a new one should eventually show up with a corresponding completion.  The flow would be:

poll
rearm
-- completion --
while (poll) -> returns cq entry, but leaves event, cq is not armed?

next test iteration

poll
rearm -> should arm cq
while (poll)
	get_event -> returns previous event
	-- loop back around waiting for new event --

this, of course, isn't happening.  Maybe there's a significant delay between the device writing the CQ entry and generating the event, such that the 2nd rearm doesn't do anything, as you suggest..

- Sean

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: identify the race condition in this code and win the respect of linux-rdma developers!
       [not found]         ` <1828884A29C6694DAF28B7E6B8A8237316E5B320-P5GAC/sN6hmkrb+BlOpmy7fspsVTdybXVpNB7YpNyf8@public.gmane.org>
@ 2011-09-16 15:59           ` Jason Gunthorpe
       [not found]             ` <20110916155955.GA18548-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
  0 siblings, 1 reply; 8+ messages in thread
From: Jason Gunthorpe @ 2011-09-16 15:59 UTC (permalink / raw)
  To: Hefty, Sean
  Cc: linux-rdma (linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org)

On Fri, Sep 16, 2011 at 02:51:04PM +0000, Hefty, Sean wrote:
> > > Case 1:
> > > 	ret = ibv_poll_cq(id->recv_cq, 1, wc);
> > > 	ret = ibv_req_notify_cq(id->recv_cq, 0);
> > >
> > > 	while (!(ret = ibv_poll_cq(id->recv_cq, 1, wc))) {
> > > 		ret = ibv_get_cq_event(id->recv_cq_channel, &cq, &context);
> > > 		ibv_ack_cq_events(id->recv_cq, 1);
> > > 	}
> > 
> > If ibv_get_cq_event returns here, because the CQ notify triggers but
> > poll_cq goes not return a WC then you go back into get_cq_event
> > without requesting notification.
> 
> I understand this, but I still don't quite see how the situation fully occurs.
> 
> > I'm pretty sure ibv_req_notify_cq does nothing if there is already a
> > notification pending.
> 
> This last statement might be it, but I couldn't see that in the
> code.  I was under the assumption that once an event occurred, the
> CQ would be 'un-armed'.  ibv_req_notify_cq() would rearm it.  In the
> above code then, ibv_get_cq_event() would pull the old event, but
> then a new one should eventually show up with a corresponding
> completion.  The flow would be:

As I see it the problem flow would be this:

poll
ibv_req_notify_cq
// HCA Write CQ Entry
// Trigger EVENT
poll -> return CQ

poll
ibv_req_notify_cq // does nothing, event is already triggered
poll
ibv_get_cq_event // returns immediately due to prior Trigger event
poll
ibv_get_cq_event // blocks because the CQ was never rearmed.

Which would happen easily and frequently without having to invoke any
spooky infrequent racing.

poll and cq_events are totally independent, I belive the implementation
of CQ events is like:

enum {DISABLED,ARMED,TRIGGERED} flag;

And the only sane, race free implementation of req_notify must thus be

if (flag == DISABLED) flag = ARMED

And for CQ entry add:
if (flag == ARMED) flag = TRIGGERED

And for get_cq_event:

if (flag == TRIGGERED) {flag = DISABLED; return;}

What you are imagining is that ibv_req_notify_cq does this:

if (flag == ARMED || flag == TRIGGERED) flag = ARMED

which is not a good choice because it will destroy a triggered state
and that creates all sorts of bad race conditions, again because of
the independence of poll and cq_event.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 8+ messages in thread

* RE: identify the race condition in this code and win the respect of linux-rdma developers!
       [not found]             ` <20110916155955.GA18548-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
@ 2011-09-16 16:57               ` Hefty, Sean
       [not found]                 ` <1828884A29C6694DAF28B7E6B8A8237316E5B3B8-P5GAC/sN6hmkrb+BlOpmy7fspsVTdybXVpNB7YpNyf8@public.gmane.org>
  2011-09-16 19:36               ` Hefty, Sean
  1 sibling, 1 reply; 8+ messages in thread
From: Hefty, Sean @ 2011-09-16 16:57 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: linux-rdma (linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org)

> poll and cq_events are totally independent, I belive the implementation
> of CQ events is like:
> 
> enum {DISABLED,ARMED,TRIGGERED} flag;

I was suggesting that there were only 2 states, not 3, with ibv_get_cq_event() simply retrieving a queued event from the kernel without touching the CQ state.

- Sean
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 8+ messages in thread

* RE: identify the race condition in this code and win the respect of linux-rdma developers!
       [not found]                 ` <1828884A29C6694DAF28B7E6B8A8237316E5B3B8-P5GAC/sN6hmkrb+BlOpmy7fspsVTdybXVpNB7YpNyf8@public.gmane.org>
@ 2011-09-16 17:21                   ` Hefty, Sean
  0 siblings, 0 replies; 8+ messages in thread
From: Hefty, Sean @ 2011-09-16 17:21 UTC (permalink / raw)
  To: Hefty, Sean, Jason Gunthorpe
  Cc: linux-rdma (linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org)

Maybe someone at Mellanox can provide some guidance here.  Here are the arm/get_event calls from the libmlx4. 

int mlx4_arm_cq(struct ibv_cq *ibvcq, int solicited)
{
	...
	sn  = cq->arm_sn & 3;
	ci  = cq->cons_index & 0xffffff;
	cmd = solicited ? MLX4_CQ_DB_REQ_NOT_SOL : MLX4_CQ_DB_REQ_NOT;

	*cq->arm_db = htonl(sn << 28 | cmd | ci);

	wmb();

	doorbell[0] = htonl(sn << 28 | cmd | cq->cqn);
	doorbell[1] = htonl(ci);
	mlx4_write64(doorbell, to_mctx(ibvcq->context), MLX4_CQ_DOORBELL);
	return 0;
}

void mlx4_cq_event(struct ibv_cq *cq)
{
	to_mcq(cq)->arm_sn++;
}

What exactly is the role of arm_sn?

- Sean

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 8+ messages in thread

* RE: identify the race condition in this code and win the respect of linux-rdma developers!
       [not found]             ` <20110916155955.GA18548-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
  2011-09-16 16:57               ` Hefty, Sean
@ 2011-09-16 19:36               ` Hefty, Sean
       [not found]                 ` <1828884A29C6694DAF28B7E6B8A8237316E5B55E-P5GAC/sN6hmkrb+BlOpmy7fspsVTdybXVpNB7YpNyf8@public.gmane.org>
  1 sibling, 1 reply; 8+ messages in thread
From: Hefty, Sean @ 2011-09-16 19:36 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: linux-rdma (linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org)

> As I see it the problem flow would be this:
> 
> poll
> ibv_req_notify_cq
> // HCA Write CQ Entry
> // Trigger EVENT

Since the trigger event must be separate from writing the cq entry, if it doesn't happen until...

> poll -> return CQ
> 
> poll
> ibv_req_notify_cq // does nothing, event is already triggered

here, then the situation you describe can occur based on the spec.  And if we allow an implementation to make ibv_req_notify_cq a no-op if there is an event outstanding, the race condition is easier to hit.

> poll
> ibv_get_cq_event // returns immediately due to prior Trigger event
> poll
> ibv_get_cq_event // blocks because the CQ was never rearmed.

I changed the code to do this:

	do {
		poll
		rearm
		poll
		get_event
	} while (1);

which I believe fixes the issue.  Thanks for the help.  You are now the winner of my respect.  :)

- Sean
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: identify the race condition in this code and win the respect of linux-rdma developers!
       [not found]                 ` <1828884A29C6694DAF28B7E6B8A8237316E5B55E-P5GAC/sN6hmkrb+BlOpmy7fspsVTdybXVpNB7YpNyf8@public.gmane.org>
@ 2011-09-16 19:57                   ` Jason Gunthorpe
  0 siblings, 0 replies; 8+ messages in thread
From: Jason Gunthorpe @ 2011-09-16 19:57 UTC (permalink / raw)
  To: Hefty, Sean
  Cc: linux-rdma (linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org)

On Fri, Sep 16, 2011 at 07:36:17PM +0000, Hefty, Sean wrote:

> I was suggesting that there were only 2 states, not 3, with
> ibv_get_cq_event() simply retrieving a queued event from the kernel
> without touching the CQ state.

But that isn't true, as you point out in your next message
ibv_get_cq_event calls mlx4_cq_event which does alter the CQ state,
particularly in how it behaves WRT to mlx4_arm_cq.

Hard to say what the arm_sn does, but my guess is that it is creating
a one-shot that is sensitive to ibv_get_cq_event, eg you will never
get more than one event queued in the kernel queue, which creates the
3 state system I described.. (at least that matches my experience)

Certainly, considering that arm_sn is a 3 bit counter, calling
mlx4_cq_event many times then calling arm_cq will alias the counter
from the point of view of the chip and that can't possibly be good.

> I changed the code to do this:
> 
> 	do {
> 		poll
> 		rearm
> 		poll
> 		get_event
> 	} while (1);
> 
> which I believe fixes the issue.

In the past I looked at all this and concluded this was the only
possible correct way to write the loop. You have to guarentee rearm is
called before every get_event and you have to do a poll after every
rearm.

> Thanks for the help.  You are now the winner of my respect.  :)

NP :)

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2011-09-16 19:57 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-09-15 23:58 identify the race condition in this code and win the respect of linux-rdma developers! Hefty, Sean
     [not found] ` <1828884A29C6694DAF28B7E6B8A8237316E5B2B1-P5GAC/sN6hmkrb+BlOpmy7fspsVTdybXVpNB7YpNyf8@public.gmane.org>
2011-09-16  0:50   ` Jason Gunthorpe
     [not found]     ` <20110916005008.GC6020-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2011-09-16 14:51       ` Hefty, Sean
     [not found]         ` <1828884A29C6694DAF28B7E6B8A8237316E5B320-P5GAC/sN6hmkrb+BlOpmy7fspsVTdybXVpNB7YpNyf8@public.gmane.org>
2011-09-16 15:59           ` Jason Gunthorpe
     [not found]             ` <20110916155955.GA18548-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2011-09-16 16:57               ` Hefty, Sean
     [not found]                 ` <1828884A29C6694DAF28B7E6B8A8237316E5B3B8-P5GAC/sN6hmkrb+BlOpmy7fspsVTdybXVpNB7YpNyf8@public.gmane.org>
2011-09-16 17:21                   ` Hefty, Sean
2011-09-16 19:36               ` Hefty, Sean
     [not found]                 ` <1828884A29C6694DAF28B7E6B8A8237316E5B55E-P5GAC/sN6hmkrb+BlOpmy7fspsVTdybXVpNB7YpNyf8@public.gmane.org>
2011-09-16 19:57                   ` Jason Gunthorpe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox