From: Edward Srouji <edwards@nvidia.com>
To: Bob Pearson <rpearsonhpe@gmail.com>,
linux-rdma@vger.kernel.org, Jason Gunthorpe <jgg@nvidia.com>
Subject: Re: Fwd: pyverbs test_resize_cq fails in some cases
Date: Tue, 18 Jul 2023 18:13:49 +0300 [thread overview]
Message-ID: <94b33c65-e0b5-663a-f041-3b96df3ef977@nvidia.com> (raw)
In-Reply-To: <a958a957-dfe3-9348-632f-ee0c9af13238@gmail.com>
On 7/18/2023 5:08 AM, Bob Pearson wrote:
> External email: Use caution opening links or attachments
>
>
> On 3/23/23 21:38, Bob Pearson wrote:
>>
>>
>> -------- Forwarded Message --------
>> Subject: pyverbs test_resize_cq fails in some cases
>> Date: Thu, 23 Mar 2023 21:37:28 -0500
>> From: Bob Pearson <rpearsonhpe@gmail.com>
>> To: Edward Srouji <edwards@nvidia.com>
>> CC: Jason Gunthorpe <jgg@nvidia.com>, Zhu Yanjun <zyjzyj2000@gmail.com>
>>
>> Edward,
>>
>> The pyverbs test: test_resize_cq fails for the rxe driver in some cases.
>>
>> The code is definitely racy:
>>
>> def test_resize_cq(self):
>> """
>> Test resize CQ, start with specific value and then increase and decrease
>> the CQ size. The test also check bad flow of decrease the CQ size when
>> there are more completions on it than the new value.
>> """
>> self.create_players(CQUDResources, cq_depth=3)
>> # Decrease the CQ size.
>> new_cq_size = 1
>> try:
>> self.client.cq.resize(new_cq_size)
>> except PyverbsRDMAError as ex:
>> if ex.error_code == errno.EOPNOTSUPP:
>> raise unittest.SkipTest('Resize CQ is not supported')
>> raise ex
>> self.assertTrue(self.client.cq.cqe >= new_cq_size,
>> f'The actual CQ size ({self.client.cq.cqe}) is less '
>> 'than guaranteed ({new_cq_size})')
>>
>> # Increase the CQ size.
>> new_cq_size = 7
>> self.client.cq.resize(new_cq_size)
>> self.assertTrue(self.client.cq.cqe >= new_cq_size,
>> f'The actual CQ size ({self.client.cq.cqe}) is less '
>> 'than guaranteed ({new_cq_size})')
>>
>> # Fill the CQ entries except one for avoid cq_overrun warnings.
>> send_wr, _ = u.get_send_elements(self.client, False)
>> ah_client = u.get_global_ah(self.client, self.gid_index, self.ib_port)
>> for i in range(self.client.cq.cqe - 1):
>> u.send(self.client, send_wr, ah=ah_client)
>>
>> ### This posts 6 send work requests but does not wait for them to complete
>> ### The following proceeds while the sends are executing in the background
>> ### and the test can fail. An easy fix is to add the line
>> time.sleep(1/1000) ### or maybe something a little larger but this works for me.
>> ### alternatively you could test the data at the destination.
>>
>> # Decrease the CQ size to less than the CQ unpolled entries.
>> new_cq_size = 1
>> with self.assertRaises(PyverbsRDMAError) as ex:
>> self.client.cq.resize(new_cq_size)
>> self.assertEqual(ex.exception.error_code, errno.EINVAL)
>>
>> Bob
> Edward,
>
> This is showing up again. The software drivers are slower than the hardware drivers.
> Now that the send side logic in the rxe driver is handled by a workqueue and if it is
> scheduled instead of running inline this test will fail some percentage of the time.
> Especially when the test is run for the first time but 5-10% of the time after that.
> If you run the send side logic inline it is fast enough that the test doesn't fail but
> that doesn't allow the workqueue tasks to spread out on the available cores so for high
> QP count it is much better to schedule them.
>
> There are a few time.sleep(1) calls in the pyverbs test suite but that is wasteful and
> better is time.sleep(0.001) which works fine for this test and doesn't slow down the
> run time for the whole suite noticeably.
I prefer not to add a sleep (the only sleeps are in mlx5_vfio test which
is required to wait for the port to become up - but it's not run by
default until the user specifically configure an mlx5_vfio device).
What if we poll the CQ for one completion, to make sure that the driver
had the time to publish a CQE (which might be enough "time" for us)?
i.e.
$ git diff
diff --git a/tests/test_cq.py b/tests/test_cq.py
index 98485e3f8..e17676304 100644
--- a/tests/test_cq.py
+++ b/tests/test_cq.py
@@ -123,6 +123,7 @@ class CQTest(RDMATestCase):
ah_client = u.get_global_ah(self.client, self.gid_index,
self.ib_port)
for i in range(self.client.cq.cqe - 1):
u.send(self.client, send_wr, ah=ah_client)
+ u.poll_cq(self.client.cq)
# Decrease the CQ size to less than the CQ unpolled entries.
new_cq_size = 1
Could you try to see if that solves the issue for you?
>
> Bottom line this test case is racy.
>
> If you want I can submit a 2 line patch that does this. Or if you would that would be
> better.
>
> Bob
next prev parent reply other threads:[~2023-07-18 15:14 UTC|newest]
Thread overview: 4+ messages / expand[flat|nested] mbox.gz Atom feed top
[not found] <f218337a-b975-bfa7-635d-bafc42c2df04@gmail.com>
2023-03-24 2:38 ` Fwd: pyverbs test_resize_cq fails in some cases Bob Pearson
2023-07-18 2:08 ` Bob Pearson
2023-07-18 15:13 ` Edward Srouji [this message]
2023-07-19 16:14 ` Bob Pearson
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=94b33c65-e0b5-663a-f041-3b96df3ef977@nvidia.com \
--to=edwards@nvidia.com \
--cc=jgg@nvidia.com \
--cc=linux-rdma@vger.kernel.org \
--cc=rpearsonhpe@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox