Fwd: pyverbs test_resize_cq fails in some cases

public inbox for linux-rdma@vger.kernel.org
 help / color / mirror / Atom feed

* Fwd: pyverbs test_resize_cq fails in some cases
       [not found] <f218337a-b975-bfa7-635d-bafc42c2df04@gmail.com>
@ 2023-03-24  2:38 ` Bob Pearson
  2023-07-18  2:08   ` Bob Pearson
  0 siblings, 1 reply; 4+ messages in thread
From: Bob Pearson @ 2023-03-24  2:38 UTC (permalink / raw)
  To: linux-rdma




-------- Forwarded Message --------
Subject: pyverbs test_resize_cq fails in some cases
Date: Thu, 23 Mar 2023 21:37:28 -0500
From: Bob Pearson <rpearsonhpe@gmail.com>
To: Edward Srouji <edwards@nvidia.com>
CC: Jason Gunthorpe <jgg@nvidia.com>, Zhu Yanjun <zyjzyj2000@gmail.com>

Edward,

The pyverbs test: test_resize_cq fails for the rxe driver in some cases.

The code is definitely racy:

def test_resize_cq(self):
        """
        Test resize CQ, start with specific value and then increase and decrease
        the CQ size. The test also check bad flow of decrease the CQ size when
        there are more completions on it than the new value.
        """
        self.create_players(CQUDResources, cq_depth=3)
        # Decrease the CQ size.
        new_cq_size = 1
        try:
            self.client.cq.resize(new_cq_size)
        except PyverbsRDMAError as ex:
            if ex.error_code == errno.EOPNOTSUPP:
                raise unittest.SkipTest('Resize CQ is not supported')
            raise ex
        self.assertTrue(self.client.cq.cqe >= new_cq_size,
                        f'The actual CQ size ({self.client.cq.cqe}) is less '
                        'than guaranteed ({new_cq_size})')

        # Increase the CQ size.
        new_cq_size = 7
        self.client.cq.resize(new_cq_size)
        self.assertTrue(self.client.cq.cqe >= new_cq_size,
                        f'The actual CQ size ({self.client.cq.cqe}) is less '
                        'than guaranteed ({new_cq_size})')

        # Fill the CQ entries except one for avoid cq_overrun warnings.
        send_wr, _ = u.get_send_elements(self.client, False)
        ah_client = u.get_global_ah(self.client, self.gid_index, self.ib_port)
        for i in range(self.client.cq.cqe - 1):
            u.send(self.client, send_wr, ah=ah_client)

	### This posts 6 send work requests but does not wait for them to complete
	### The following proceeds while the sends are executing in the background
	### and the test can fail. An easy fix is to add the line
        time.sleep(1/1000)  ### or maybe something a little larger but this works for me.
	### alternatively you could test the data at the destination.

        # Decrease the CQ size to less than the CQ unpolled entries.
        new_cq_size = 1
        with self.assertRaises(PyverbsRDMAError) as ex:
            self.client.cq.resize(new_cq_size)
        self.assertEqual(ex.exception.error_code, errno.EINVAL)

Bob

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Fwd: pyverbs test_resize_cq fails in some cases
  2023-03-24  2:38 ` Fwd: pyverbs test_resize_cq fails in some cases Bob Pearson
@ 2023-07-18  2:08   ` Bob Pearson
  2023-07-18 15:13     ` Edward Srouji
  0 siblings, 1 reply; 4+ messages in thread
From: Bob Pearson @ 2023-07-18  2:08 UTC (permalink / raw)
  To: linux-rdma, Edward Srouji, Jason Gunthorpe

On 3/23/23 21:38, Bob Pearson wrote:
> 
> 
> 
> -------- Forwarded Message --------
> Subject: pyverbs test_resize_cq fails in some cases
> Date: Thu, 23 Mar 2023 21:37:28 -0500
> From: Bob Pearson <rpearsonhpe@gmail.com>
> To: Edward Srouji <edwards@nvidia.com>
> CC: Jason Gunthorpe <jgg@nvidia.com>, Zhu Yanjun <zyjzyj2000@gmail.com>
> 
> Edward,
> 
> The pyverbs test: test_resize_cq fails for the rxe driver in some cases.
> 
> The code is definitely racy:
> 
> def test_resize_cq(self):
>         """
>         Test resize CQ, start with specific value and then increase and decrease
>         the CQ size. The test also check bad flow of decrease the CQ size when
>         there are more completions on it than the new value.
>         """
>         self.create_players(CQUDResources, cq_depth=3)
>         # Decrease the CQ size.
>         new_cq_size = 1
>         try:
>             self.client.cq.resize(new_cq_size)
>         except PyverbsRDMAError as ex:
>             if ex.error_code == errno.EOPNOTSUPP:
>                 raise unittest.SkipTest('Resize CQ is not supported')
>             raise ex
>         self.assertTrue(self.client.cq.cqe >= new_cq_size,
>                         f'The actual CQ size ({self.client.cq.cqe}) is less '
>                         'than guaranteed ({new_cq_size})')
> 
>         # Increase the CQ size.
>         new_cq_size = 7
>         self.client.cq.resize(new_cq_size)
>         self.assertTrue(self.client.cq.cqe >= new_cq_size,
>                         f'The actual CQ size ({self.client.cq.cqe}) is less '
>                         'than guaranteed ({new_cq_size})')
> 
>         # Fill the CQ entries except one for avoid cq_overrun warnings.
>         send_wr, _ = u.get_send_elements(self.client, False)
>         ah_client = u.get_global_ah(self.client, self.gid_index, self.ib_port)
>         for i in range(self.client.cq.cqe - 1):
>             u.send(self.client, send_wr, ah=ah_client)
> 
> 	### This posts 6 send work requests but does not wait for them to complete
> 	### The following proceeds while the sends are executing in the background
> 	### and the test can fail. An easy fix is to add the line
>         time.sleep(1/1000)  ### or maybe something a little larger but this works for me.
> 	### alternatively you could test the data at the destination.
> 
>         # Decrease the CQ size to less than the CQ unpolled entries.
>         new_cq_size = 1
>         with self.assertRaises(PyverbsRDMAError) as ex:
>             self.client.cq.resize(new_cq_size)
>         self.assertEqual(ex.exception.error_code, errno.EINVAL)
> 
> Bob

Edward,

This is showing up again. The software drivers are slower than the hardware drivers.
Now that the send side logic in the rxe driver is handled by a workqueue and if it is
scheduled instead of running inline this test will fail some percentage of the time.
Especially when the test is run for the first time but 5-10% of the time after that.
If you run the send side logic inline it is fast enough that the test doesn't fail but
that doesn't allow the workqueue tasks to spread out on the available cores so for high
QP count it is much better to schedule them.

There are a few time.sleep(1) calls in the pyverbs test suite but that is wasteful and
better is time.sleep(0.001) which works fine for this test and doesn't slow down the
run time for the whole suite noticeably.

Bottom line this test case is racy.

If you want I can submit a 2 line patch that does this. Or if you would that would be
better.

Bob

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Fwd: pyverbs test_resize_cq fails in some cases
  2023-07-18  2:08   ` Bob Pearson
@ 2023-07-18 15:13     ` Edward Srouji
  2023-07-19 16:14       ` Bob Pearson
  0 siblings, 1 reply; 4+ messages in thread
From: Edward Srouji @ 2023-07-18 15:13 UTC (permalink / raw)
  To: Bob Pearson, linux-rdma, Jason Gunthorpe


On 7/18/2023 5:08 AM, Bob Pearson wrote:
> External email: Use caution opening links or attachments
>
>
> On 3/23/23 21:38, Bob Pearson wrote:
>>
>>
>> -------- Forwarded Message --------
>> Subject: pyverbs test_resize_cq fails in some cases
>> Date: Thu, 23 Mar 2023 21:37:28 -0500
>> From: Bob Pearson <rpearsonhpe@gmail.com>
>> To: Edward Srouji <edwards@nvidia.com>
>> CC: Jason Gunthorpe <jgg@nvidia.com>, Zhu Yanjun <zyjzyj2000@gmail.com>
>>
>> Edward,
>>
>> The pyverbs test: test_resize_cq fails for the rxe driver in some cases.
>>
>> The code is definitely racy:
>>
>> def test_resize_cq(self):
>>          """
>>          Test resize CQ, start with specific value and then increase and decrease
>>          the CQ size. The test also check bad flow of decrease the CQ size when
>>          there are more completions on it than the new value.
>>          """
>>          self.create_players(CQUDResources, cq_depth=3)
>>          # Decrease the CQ size.
>>          new_cq_size = 1
>>          try:
>>              self.client.cq.resize(new_cq_size)
>>          except PyverbsRDMAError as ex:
>>              if ex.error_code == errno.EOPNOTSUPP:
>>                  raise unittest.SkipTest('Resize CQ is not supported')
>>              raise ex
>>          self.assertTrue(self.client.cq.cqe >= new_cq_size,
>>                          f'The actual CQ size ({self.client.cq.cqe}) is less '
>>                          'than guaranteed ({new_cq_size})')
>>
>>          # Increase the CQ size.
>>          new_cq_size = 7
>>          self.client.cq.resize(new_cq_size)
>>          self.assertTrue(self.client.cq.cqe >= new_cq_size,
>>                          f'The actual CQ size ({self.client.cq.cqe}) is less '
>>                          'than guaranteed ({new_cq_size})')
>>
>>          # Fill the CQ entries except one for avoid cq_overrun warnings.
>>          send_wr, _ = u.get_send_elements(self.client, False)
>>          ah_client = u.get_global_ah(self.client, self.gid_index, self.ib_port)
>>          for i in range(self.client.cq.cqe - 1):
>>              u.send(self.client, send_wr, ah=ah_client)
>>
>>        ### This posts 6 send work requests but does not wait for them to complete
>>        ### The following proceeds while the sends are executing in the background
>>        ### and the test can fail. An easy fix is to add the line
>>          time.sleep(1/1000)  ### or maybe something a little larger but this works for me.
>>        ### alternatively you could test the data at the destination.
>>
>>          # Decrease the CQ size to less than the CQ unpolled entries.
>>          new_cq_size = 1
>>          with self.assertRaises(PyverbsRDMAError) as ex:
>>              self.client.cq.resize(new_cq_size)
>>          self.assertEqual(ex.exception.error_code, errno.EINVAL)
>>
>> Bob
> Edward,
>
> This is showing up again. The software drivers are slower than the hardware drivers.
> Now that the send side logic in the rxe driver is handled by a workqueue and if it is
> scheduled instead of running inline this test will fail some percentage of the time.
> Especially when the test is run for the first time but 5-10% of the time after that.
> If you run the send side logic inline it is fast enough that the test doesn't fail but
> that doesn't allow the workqueue tasks to spread out on the available cores so for high
> QP count it is much better to schedule them.
>
> There are a few time.sleep(1) calls in the pyverbs test suite but that is wasteful and
> better is time.sleep(0.001) which works fine for this test and doesn't slow down the
> run time for the whole suite noticeably.

I prefer not to add a sleep (the only sleeps are in mlx5_vfio test which 
is required to wait for the port to become up - but it's not run by 
default until the user specifically configure an mlx5_vfio device).

What if we poll the CQ for one completion, to make sure that the driver 
had the time to publish a CQE (which might be enough "time" for us)?
i.e.

$ git diff
diff --git a/tests/test_cq.py b/tests/test_cq.py
index 98485e3f8..e17676304 100644
--- a/tests/test_cq.py
+++ b/tests/test_cq.py
@@ -123,6 +123,7 @@ class CQTest(RDMATestCase):
          ah_client = u.get_global_ah(self.client, self.gid_index, 
self.ib_port)
          for i in range(self.client.cq.cqe - 1):
              u.send(self.client, send_wr, ah=ah_client)
+        u.poll_cq(self.client.cq)

          # Decrease the CQ size to less than the CQ unpolled entries.
          new_cq_size = 1

Could you try to see if that solves the issue for you?

>
> Bottom line this test case is racy.
>
> If you want I can submit a 2 line patch that does this. Or if you would that would be
> better.
>
> Bob

^ permalink raw reply related	[flat|nested] 4+ messages in thread

* Re: Fwd: pyverbs test_resize_cq fails in some cases
  2023-07-18 15:13     ` Edward Srouji
@ 2023-07-19 16:14       ` Bob Pearson
  0 siblings, 0 replies; 4+ messages in thread
From: Bob Pearson @ 2023-07-19 16:14 UTC (permalink / raw)
  To: Edward Srouji, linux-rdma, Jason Gunthorpe

On 7/18/23 10:13, Edward Srouji wrote:
> 
> On 7/18/2023 5:08 AM, Bob Pearson wrote:
>> External email: Use caution opening links or attachments
>>
>>
>> On 3/23/23 21:38, Bob Pearson wrote:
>>>
>>>
>>> -------- Forwarded Message --------
>>> Subject: pyverbs test_resize_cq fails in some cases
>>> Date: Thu, 23 Mar 2023 21:37:28 -0500
>>> From: Bob Pearson <rpearsonhpe@gmail.com>
>>> To: Edward Srouji <edwards@nvidia.com>
>>> CC: Jason Gunthorpe <jgg@nvidia.com>, Zhu Yanjun <zyjzyj2000@gmail.com>
>>>
>>> Edward,
>>>
>>> The pyverbs test: test_resize_cq fails for the rxe driver in some cases.
>>>
>>> The code is definitely racy:
>>>
>>> def test_resize_cq(self):
>>>          """
>>>          Test resize CQ, start with specific value and then increase and decrease
>>>          the CQ size. The test also check bad flow of decrease the CQ size when
>>>          there are more completions on it than the new value.
>>>          """
>>>          self.create_players(CQUDResources, cq_depth=3)
>>>          # Decrease the CQ size.
>>>          new_cq_size = 1
>>>          try:
>>>              self.client.cq.resize(new_cq_size)
>>>          except PyverbsRDMAError as ex:
>>>              if ex.error_code == errno.EOPNOTSUPP:
>>>                  raise unittest.SkipTest('Resize CQ is not supported')
>>>              raise ex
>>>          self.assertTrue(self.client.cq.cqe >= new_cq_size,
>>>                          f'The actual CQ size ({self.client.cq.cqe}) is less '
>>>                          'than guaranteed ({new_cq_size})')
>>>
>>>          # Increase the CQ size.
>>>          new_cq_size = 7
>>>          self.client.cq.resize(new_cq_size)
>>>          self.assertTrue(self.client.cq.cqe >= new_cq_size,
>>>                          f'The actual CQ size ({self.client.cq.cqe}) is less '
>>>                          'than guaranteed ({new_cq_size})')
>>>
>>>          # Fill the CQ entries except one for avoid cq_overrun warnings.
>>>          send_wr, _ = u.get_send_elements(self.client, False)
>>>          ah_client = u.get_global_ah(self.client, self.gid_index, self.ib_port)
>>>          for i in range(self.client.cq.cqe - 1):
>>>              u.send(self.client, send_wr, ah=ah_client)
>>>
>>>        ### This posts 6 send work requests but does not wait for them to complete
>>>        ### The following proceeds while the sends are executing in the background
>>>        ### and the test can fail. An easy fix is to add the line
>>>          time.sleep(1/1000)  ### or maybe something a little larger but this works for me.
>>>        ### alternatively you could test the data at the destination.
>>>
>>>          # Decrease the CQ size to less than the CQ unpolled entries.
>>>          new_cq_size = 1
>>>          with self.assertRaises(PyverbsRDMAError) as ex:
>>>              self.client.cq.resize(new_cq_size)
>>>          self.assertEqual(ex.exception.error_code, errno.EINVAL)
>>>
>>> Bob
>> Edward,
>>
>> This is showing up again. The software drivers are slower than the hardware drivers.
>> Now that the send side logic in the rxe driver is handled by a workqueue and if it is
>> scheduled instead of running inline this test will fail some percentage of the time.
>> Especially when the test is run for the first time but 5-10% of the time after that.
>> If you run the send side logic inline it is fast enough that the test doesn't fail but
>> that doesn't allow the workqueue tasks to spread out on the available cores so for high
>> QP count it is much better to schedule them.
>>
>> There are a few time.sleep(1) calls in the pyverbs test suite but that is wasteful and
>> better is time.sleep(0.001) which works fine for this test and doesn't slow down the
>> run time for the whole suite noticeably.
> 
> I prefer not to add a sleep (the only sleeps are in mlx5_vfio test which is required to wait for the port to become up - but it's not run by default until the user specifically configure an mlx5_vfio device).
> 
> What if we poll the CQ for one completion, to make sure that the driver had the time to publish a CQE (which might be enough "time" for us)?
> i.e.
> 
> $ git diff
> diff --git a/tests/test_cq.py b/tests/test_cq.py
> index 98485e3f8..e17676304 100644
> --- a/tests/test_cq.py
> +++ b/tests/test_cq.py
> @@ -123,6 +123,7 @@ class CQTest(RDMATestCase):
>          ah_client = u.get_global_ah(self.client, self.gid_index, self.ib_port)
>          for i in range(self.client.cq.cqe - 1):
>              u.send(self.client, send_wr, ah=ah_client)
> +        u.poll_cq(self.client.cq)
> 
>          # Decrease the CQ size to less than the CQ unpolled entries.
>          new_cq_size = 1
> 
> Could you try to see if that solves the issue for you?
> 
>>
>> Bottom line this test case is racy.
>>
>> If you want I can submit a 2 line patch that does this. Or if you would that would be
>> better.
>>
>> Bob

Yes. May take a day or two to get to it but I will definitely try it.

Thanks

Bob

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2023-07-19 16:14 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <f218337a-b975-bfa7-635d-bafc42c2df04@gmail.com>
2023-03-24  2:38 ` Fwd: pyverbs test_resize_cq fails in some cases Bob Pearson
2023-07-18  2:08   ` Bob Pearson
2023-07-18 15:13     ` Edward Srouji
2023-07-19 16:14       ` Bob Pearson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox