Re: Fwd: pyverbs test_resize_cq fails in some cases

public inbox for linux-rdma@vger.kernel.org
 help / color / mirror / Atom feed

From: Edward Srouji <edwards@nvidia.com>
To: Bob Pearson <rpearsonhpe@gmail.com>,
	linux-rdma@vger.kernel.org, Jason Gunthorpe <jgg@nvidia.com>
Subject: Re: Fwd: pyverbs test_resize_cq fails in some cases
Date: Tue, 18 Jul 2023 18:13:49 +0300	[thread overview]
Message-ID: <94b33c65-e0b5-663a-f041-3b96df3ef977@nvidia.com> (raw)
In-Reply-To: <a958a957-dfe3-9348-632f-ee0c9af13238@gmail.com>


On 7/18/2023 5:08 AM, Bob Pearson wrote:
> External email: Use caution opening links or attachments
>
>
> On 3/23/23 21:38, Bob Pearson wrote:
>>
>>
>> -------- Forwarded Message --------
>> Subject: pyverbs test_resize_cq fails in some cases
>> Date: Thu, 23 Mar 2023 21:37:28 -0500
>> From: Bob Pearson <rpearsonhpe@gmail.com>
>> To: Edward Srouji <edwards@nvidia.com>
>> CC: Jason Gunthorpe <jgg@nvidia.com>, Zhu Yanjun <zyjzyj2000@gmail.com>
>>
>> Edward,
>>
>> The pyverbs test: test_resize_cq fails for the rxe driver in some cases.
>>
>> The code is definitely racy:
>>
>> def test_resize_cq(self):
>>          """
>>          Test resize CQ, start with specific value and then increase and decrease
>>          the CQ size. The test also check bad flow of decrease the CQ size when
>>          there are more completions on it than the new value.
>>          """
>>          self.create_players(CQUDResources, cq_depth=3)
>>          # Decrease the CQ size.
>>          new_cq_size = 1
>>          try:
>>              self.client.cq.resize(new_cq_size)
>>          except PyverbsRDMAError as ex:
>>              if ex.error_code == errno.EOPNOTSUPP:
>>                  raise unittest.SkipTest('Resize CQ is not supported')
>>              raise ex
>>          self.assertTrue(self.client.cq.cqe >= new_cq_size,
>>                          f'The actual CQ size ({self.client.cq.cqe}) is less '
>>                          'than guaranteed ({new_cq_size})')
>>
>>          # Increase the CQ size.
>>          new_cq_size = 7
>>          self.client.cq.resize(new_cq_size)
>>          self.assertTrue(self.client.cq.cqe >= new_cq_size,
>>                          f'The actual CQ size ({self.client.cq.cqe}) is less '
>>                          'than guaranteed ({new_cq_size})')
>>
>>          # Fill the CQ entries except one for avoid cq_overrun warnings.
>>          send_wr, _ = u.get_send_elements(self.client, False)
>>          ah_client = u.get_global_ah(self.client, self.gid_index, self.ib_port)
>>          for i in range(self.client.cq.cqe - 1):
>>              u.send(self.client, send_wr, ah=ah_client)
>>
>>        ### This posts 6 send work requests but does not wait for them to complete
>>        ### The following proceeds while the sends are executing in the background
>>        ### and the test can fail. An easy fix is to add the line
>>          time.sleep(1/1000)  ### or maybe something a little larger but this works for me.
>>        ### alternatively you could test the data at the destination.
>>
>>          # Decrease the CQ size to less than the CQ unpolled entries.
>>          new_cq_size = 1
>>          with self.assertRaises(PyverbsRDMAError) as ex:
>>              self.client.cq.resize(new_cq_size)
>>          self.assertEqual(ex.exception.error_code, errno.EINVAL)
>>
>> Bob
> Edward,
>
> This is showing up again. The software drivers are slower than the hardware drivers.
> Now that the send side logic in the rxe driver is handled by a workqueue and if it is
> scheduled instead of running inline this test will fail some percentage of the time.
> Especially when the test is run for the first time but 5-10% of the time after that.
> If you run the send side logic inline it is fast enough that the test doesn't fail but
> that doesn't allow the workqueue tasks to spread out on the available cores so for high
> QP count it is much better to schedule them.
>
> There are a few time.sleep(1) calls in the pyverbs test suite but that is wasteful and
> better is time.sleep(0.001) which works fine for this test and doesn't slow down the
> run time for the whole suite noticeably.

I prefer not to add a sleep (the only sleeps are in mlx5_vfio test which 
is required to wait for the port to become up - but it's not run by 
default until the user specifically configure an mlx5_vfio device).

What if we poll the CQ for one completion, to make sure that the driver 
had the time to publish a CQE (which might be enough "time" for us)?
i.e.

$ git diff
diff --git a/tests/test_cq.py b/tests/test_cq.py
index 98485e3f8..e17676304 100644
--- a/tests/test_cq.py
+++ b/tests/test_cq.py
@@ -123,6 +123,7 @@ class CQTest(RDMATestCase):
          ah_client = u.get_global_ah(self.client, self.gid_index, 
self.ib_port)
          for i in range(self.client.cq.cqe - 1):
              u.send(self.client, send_wr, ah=ah_client)
+        u.poll_cq(self.client.cq)

          # Decrease the CQ size to less than the CQ unpolled entries.
          new_cq_size = 1

Could you try to see if that solves the issue for you?

>
> Bottom line this test case is racy.
>
> If you want I can submit a 2 line patch that does this. Or if you would that would be
> better.
>
> Bob

next prev parent reply	other threads:[~2023-07-18 15:14 UTC|newest]

Thread overview: 4+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <f218337a-b975-bfa7-635d-bafc42c2df04@gmail.com>
2023-03-24  2:38 ` Fwd: pyverbs test_resize_cq fails in some cases Bob Pearson
2023-07-18  2:08   ` Bob Pearson
2023-07-18 15:13     ` Edward Srouji [this message]
2023-07-19 16:14       ` Bob Pearson

find likely ancestor, descendant, or conflicting patches for this message:
( dfblob:98485e3f dfblob:e1767630 )
 OR (
bs:"Re: Fwd: pyverbs test_resize_cq fails in some cases" )
	(help)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=94b33c65-e0b5-663a-f041-3b96df3ef977@nvidia.com \
    --to=edwards@nvidia.com \
    --cc=jgg@nvidia.com \
    --cc=linux-rdma@vger.kernel.org \
    --cc=rpearsonhpe@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox