All of lore.kernel.org
 help / color / mirror / Atom feed
* IB_POLL_DIRECT
@ 2023-06-06 20:54 Bob Pearson
  2023-06-07  0:21 ` IB_POLL_DIRECT Jason Gunthorpe
  0 siblings, 1 reply; 3+ messages in thread
From: Bob Pearson @ 2023-06-06 20:54 UTC (permalink / raw)
  To: Jason Gunthorpe, linux-rdma@vger.kernel.org, Bernard Metzler,
	Bart Van Assche

Jason,

Both the rxe driver and the siw driver running the blktests srp test suite exhibit failures on my machine
running the for-next branch. This has been true for months so I decided to try again to track it down.
After a lot of tracing, it looks like the problem is that the built in cq handling in core/cq.c is failing to
continue to process some completion queues.

The traffic is between the srp driver and the srpt driver. The srpt driver uses

	cq = ib_cq_pool_get(..., IB_POLL_WORKQUEUE) and

the srp driver uses

	cq = ib_alloc_cq(..., IB_POLL_SOFTIRQ) for receive cqs and
	cq = ib_alloc_cq(..., IB_POLL_DIRECT) for send cqs.

AFAIK the poll workqueue and poll softirq cqs are working correctly but the poll direct cq sometimes
loses the thread and just stops processing those cqs. The test cases sometimes recover after about
a 2 second delay and start processing again and eventually fail after about a 10 second delay and
cleanup and go home.

The failures feel like a race or at least are timing sensitive. If you run the test suite several times
various test cases will sometimes succeed and sometimes fail. But they always fail in the same way.

Looking at the mlxn drivers for inspiration, I don't see anything specific about IB_POLL_DIRECT except
that they have a private version of send_queue_drain which also calls a cqe drain function which calls
ib_process_cq_direct() in a loop until the cq is drained. But this is only during qp tear down. (No other
verbs driver does this but as far as I know no other driver is passing blktests.) This is only done for
IB_POLL_DIRECT, so I wonder, is this required to use that correctly?

I am still figuring out how IB_POLL_DIRECT works. It doesn't allow the driver to call cq->comp_handler so
I don't know how it figures out when there are new wcs to process.

Any ideas would be really helpful.

Bob



	

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: IB_POLL_DIRECT
  2023-06-06 20:54 IB_POLL_DIRECT Bob Pearson
@ 2023-06-07  0:21 ` Jason Gunthorpe
  2023-06-07 14:54   ` IB_POLL_DIRECT Bob Pearson
  0 siblings, 1 reply; 3+ messages in thread
From: Jason Gunthorpe @ 2023-06-07  0:21 UTC (permalink / raw)
  To: Bob Pearson; +Cc: linux-rdma@vger.kernel.org, Bernard Metzler, Bart Van Assche

On Tue, Jun 06, 2023 at 03:54:25PM -0500, Bob Pearson wrote:
> AFAIK the poll workqueue and poll softirq cqs are working correctly but the poll direct cq sometimes
> loses the thread and just stops processing those cqs. The test cases sometimes recover after about
> a 2 second delay and start processing again and eventually fail after about a 10 second delay and
> cleanup and go home.

This sort of sounds like a race with re-arming?
 
> The failures feel like a race or at least are timing sensitive. If you run the test suite several times
> various test cases will sometimes succeed and sometimes fail. But they always fail in the same way.
> 
> Looking at the mlxn drivers for inspiration, I don't see anything specific about IB_POLL_DIRECT except
> that they have a private version of send_queue_drain which also calls a cqe drain function which calls
> ib_process_cq_direct() in a loop until the cq is drained. But this is only during qp tear down. (No other
> verbs driver does this but as far as I know no other driver is passing blktests.) This is only done for
> IB_POLL_DIRECT, so I wonder, is this required to use that correctly?
> 
> I am still figuring out how IB_POLL_DIRECT works. It doesn't allow the driver to call cq->comp_handler so
> I don't know how it figures out when there are new wcs to process.

IIRC POLL_DIRECT means you don't get completion interrutps and instead
the ULP has to occasionally call ib_process_cq_direct() which will
pull out the CQEs.

So you should look at how ib_process_cq_direct() is being called in
srp and presumably something about that logic is not calling it..

It kind of looks like SRP is using it to reap send completions when
the send queue progresses, so maybe your issue is that the sendq is
getting stuck?

Jason

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: IB_POLL_DIRECT
  2023-06-07  0:21 ` IB_POLL_DIRECT Jason Gunthorpe
@ 2023-06-07 14:54   ` Bob Pearson
  0 siblings, 0 replies; 3+ messages in thread
From: Bob Pearson @ 2023-06-07 14:54 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: linux-rdma@vger.kernel.org, Bernard Metzler, Bart Van Assche

On 6/6/23 19:21, Jason Gunthorpe wrote:
> On Tue, Jun 06, 2023 at 03:54:25PM -0500, Bob Pearson wrote:
>> AFAIK the poll workqueue and poll softirq cqs are working correctly but the poll direct cq sometimes
>> loses the thread and just stops processing those cqs. The test cases sometimes recover after about
>> a 2 second delay and start processing again and eventually fail after about a 10 second delay and
>> cleanup and go home.
> 
> This sort of sounds like a race with re-arming?
>  
>> The failures feel like a race or at least are timing sensitive. If you run the test suite several times
>> various test cases will sometimes succeed and sometimes fail. But they always fail in the same way.
>>
>> Looking at the mlxn drivers for inspiration, I don't see anything specific about IB_POLL_DIRECT except
>> that they have a private version of send_queue_drain which also calls a cqe drain function which calls
>> ib_process_cq_direct() in a loop until the cq is drained. But this is only during qp tear down. (No other
>> verbs driver does this but as far as I know no other driver is passing blktests.) This is only done for
>> IB_POLL_DIRECT, so I wonder, is this required to use that correctly?
>>
>> I am still figuring out how IB_POLL_DIRECT works. It doesn't allow the driver to call cq->comp_handler so
>> I don't know how it figures out when there are new wcs to process.
> 
> IIRC POLL_DIRECT means you don't get completion interrutps and instead
> the ULP has to occasionally call ib_process_cq_direct() which will
> pull out the CQEs.
> 
> So you should look at how ib_process_cq_direct() is being called in
> srp and presumably something about that logic is not calling it..
> 
> It kind of looks like SRP is using it to reap send completions when
> the send queue progresses, so maybe your issue is that the sendq is
> getting stuck?
> 
> Jason

Jason,

This was a helpful. I had counted the total number of posts and polls from all the CQs
and there were differences throughout the run. I fixed some unprotected code that changed the
notify state of the CQs with a spinlock and then the totals added up in balance at the end of
the run but still had a lot more posts than polls at the delay points. Then I replaced the
CQs in the srp driver with IB_POLL_WORKQUEUE instead of IB_POLL_DIRECT and then the posts and
polls stayed on track the whole time but there were still big delays and the test failed.

This is consistent with the suggestion you made above with sporadic calls to ib_process_cq_direct()
causing the differences. So at the end of the day the problem seems to be elsewhere and the delays are
caused by some other problem.

I'll keep chasing it.

Thanks,

Bob

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2023-06-07 14:54 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-06-06 20:54 IB_POLL_DIRECT Bob Pearson
2023-06-07  0:21 ` IB_POLL_DIRECT Jason Gunthorpe
2023-06-07 14:54   ` IB_POLL_DIRECT Bob Pearson

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.