From: Bob Pearson <rpearsonhpe@gmail.com>
To: Edward Srouji <edwards@nvidia.com>,
linux-rdma@vger.kernel.org, Jason Gunthorpe <jgg@nvidia.com>
Subject: Re: Fwd: pyverbs test_resize_cq fails in some cases
Date: Wed, 19 Jul 2023 11:14:34 -0500 [thread overview]
Message-ID: <8f924d85-1ca7-d51a-d8ed-d019957244f1@gmail.com> (raw)
In-Reply-To: <94b33c65-e0b5-663a-f041-3b96df3ef977@nvidia.com>
On 7/18/23 10:13, Edward Srouji wrote:
>
> On 7/18/2023 5:08 AM, Bob Pearson wrote:
>> External email: Use caution opening links or attachments
>>
>>
>> On 3/23/23 21:38, Bob Pearson wrote:
>>>
>>>
>>> -------- Forwarded Message --------
>>> Subject: pyverbs test_resize_cq fails in some cases
>>> Date: Thu, 23 Mar 2023 21:37:28 -0500
>>> From: Bob Pearson <rpearsonhpe@gmail.com>
>>> To: Edward Srouji <edwards@nvidia.com>
>>> CC: Jason Gunthorpe <jgg@nvidia.com>, Zhu Yanjun <zyjzyj2000@gmail.com>
>>>
>>> Edward,
>>>
>>> The pyverbs test: test_resize_cq fails for the rxe driver in some cases.
>>>
>>> The code is definitely racy:
>>>
>>> def test_resize_cq(self):
>>> """
>>> Test resize CQ, start with specific value and then increase and decrease
>>> the CQ size. The test also check bad flow of decrease the CQ size when
>>> there are more completions on it than the new value.
>>> """
>>> self.create_players(CQUDResources, cq_depth=3)
>>> # Decrease the CQ size.
>>> new_cq_size = 1
>>> try:
>>> self.client.cq.resize(new_cq_size)
>>> except PyverbsRDMAError as ex:
>>> if ex.error_code == errno.EOPNOTSUPP:
>>> raise unittest.SkipTest('Resize CQ is not supported')
>>> raise ex
>>> self.assertTrue(self.client.cq.cqe >= new_cq_size,
>>> f'The actual CQ size ({self.client.cq.cqe}) is less '
>>> 'than guaranteed ({new_cq_size})')
>>>
>>> # Increase the CQ size.
>>> new_cq_size = 7
>>> self.client.cq.resize(new_cq_size)
>>> self.assertTrue(self.client.cq.cqe >= new_cq_size,
>>> f'The actual CQ size ({self.client.cq.cqe}) is less '
>>> 'than guaranteed ({new_cq_size})')
>>>
>>> # Fill the CQ entries except one for avoid cq_overrun warnings.
>>> send_wr, _ = u.get_send_elements(self.client, False)
>>> ah_client = u.get_global_ah(self.client, self.gid_index, self.ib_port)
>>> for i in range(self.client.cq.cqe - 1):
>>> u.send(self.client, send_wr, ah=ah_client)
>>>
>>> ### This posts 6 send work requests but does not wait for them to complete
>>> ### The following proceeds while the sends are executing in the background
>>> ### and the test can fail. An easy fix is to add the line
>>> time.sleep(1/1000) ### or maybe something a little larger but this works for me.
>>> ### alternatively you could test the data at the destination.
>>>
>>> # Decrease the CQ size to less than the CQ unpolled entries.
>>> new_cq_size = 1
>>> with self.assertRaises(PyverbsRDMAError) as ex:
>>> self.client.cq.resize(new_cq_size)
>>> self.assertEqual(ex.exception.error_code, errno.EINVAL)
>>>
>>> Bob
>> Edward,
>>
>> This is showing up again. The software drivers are slower than the hardware drivers.
>> Now that the send side logic in the rxe driver is handled by a workqueue and if it is
>> scheduled instead of running inline this test will fail some percentage of the time.
>> Especially when the test is run for the first time but 5-10% of the time after that.
>> If you run the send side logic inline it is fast enough that the test doesn't fail but
>> that doesn't allow the workqueue tasks to spread out on the available cores so for high
>> QP count it is much better to schedule them.
>>
>> There are a few time.sleep(1) calls in the pyverbs test suite but that is wasteful and
>> better is time.sleep(0.001) which works fine for this test and doesn't slow down the
>> run time for the whole suite noticeably.
>
> I prefer not to add a sleep (the only sleeps are in mlx5_vfio test which is required to wait for the port to become up - but it's not run by default until the user specifically configure an mlx5_vfio device).
>
> What if we poll the CQ for one completion, to make sure that the driver had the time to publish a CQE (which might be enough "time" for us)?
> i.e.
>
> $ git diff
> diff --git a/tests/test_cq.py b/tests/test_cq.py
> index 98485e3f8..e17676304 100644
> --- a/tests/test_cq.py
> +++ b/tests/test_cq.py
> @@ -123,6 +123,7 @@ class CQTest(RDMATestCase):
> ah_client = u.get_global_ah(self.client, self.gid_index, self.ib_port)
> for i in range(self.client.cq.cqe - 1):
> u.send(self.client, send_wr, ah=ah_client)
> + u.poll_cq(self.client.cq)
>
> # Decrease the CQ size to less than the CQ unpolled entries.
> new_cq_size = 1
>
> Could you try to see if that solves the issue for you?
>
>>
>> Bottom line this test case is racy.
>>
>> If you want I can submit a 2 line patch that does this. Or if you would that would be
>> better.
>>
>> Bob
Yes. May take a day or two to get to it but I will definitely try it.
Thanks
Bob
prev parent reply other threads:[~2023-07-19 16:14 UTC|newest]
Thread overview: 4+ messages / expand[flat|nested] mbox.gz Atom feed top
[not found] <f218337a-b975-bfa7-635d-bafc42c2df04@gmail.com>
2023-03-24 2:38 ` Fwd: pyverbs test_resize_cq fails in some cases Bob Pearson
2023-07-18 2:08 ` Bob Pearson
2023-07-18 15:13 ` Edward Srouji
2023-07-19 16:14 ` Bob Pearson [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=8f924d85-1ca7-d51a-d8ed-d019957244f1@gmail.com \
--to=rpearsonhpe@gmail.com \
--cc=edwards@nvidia.com \
--cc=jgg@nvidia.com \
--cc=linux-rdma@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox