[PATCH for-rc v1] RDMA/rxe: Avoid CQ polling hang triggered by CQ resize

public inbox for linux-rdma@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH for-rc v1] RDMA/rxe: Avoid CQ polling hang triggered by CQ resize
@ 2025-08-17 12:37 Daisuke Matsuda
  2025-08-18  4:44 ` Zhu Yanjun
  2025-08-25 18:10 ` Jason Gunthorpe
  0 siblings, 2 replies; 9+ messages in thread
From: Daisuke Matsuda @ 2025-08-17 12:37 UTC (permalink / raw)
  To: linux-rdma, leon, jgg, zyjzyj2000
  Cc: yanjun.zhu, philipp.reisner, Daisuke Matsuda

When running the test_resize_cq testcase from rdma-core, polling a
completion queue from userspace may occasionally hang and eventually fail
with a timeout:
=====
ERROR: test_resize_cq (tests.test_cq.CQTest.test_resize_cq)
Test resize CQ, start with specific value and then increase and decrease
----------------------------------------------------------------------
Traceback (most recent call last):
    File "/root/deb/rdma-core/tests/test_cq.py", line 135, in test_resize_cq
      u.poll_cq(self.client.cq)
    File "/root/deb/rdma-core/tests/utils.py", line 687, in poll_cq
      wcs = _poll_cq(cq, count, data)
            ^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/root/deb/rdma-core/tests/utils.py", line 669, in _poll_cq
      raise PyverbsError(f'Got timeout on polling ({count} CQEs remaining)')
pyverbs.pyverbs_error.PyverbsError: Got timeout on polling (1 CQEs
remaining)
=====

The issue is caused when rxe_cq_post() fails to post a CQE due to the queue
being temporarily full, and the CQE is effectively lost. To mitigate this,
add a bounded busy-wait with fallback rescheduling so that CQE does not get
lost.

Signed-off-by: Daisuke Matsuda <dskmtsd@gmail.com>
---
 drivers/infiniband/sw/rxe/rxe_cq.c | 27 +++++++++++++++++++++++++--
 1 file changed, 25 insertions(+), 2 deletions(-)

diff --git a/drivers/infiniband/sw/rxe/rxe_cq.c b/drivers/infiniband/sw/rxe/rxe_cq.c
index fffd144d509e..7b0fba63204e 100644
--- a/drivers/infiniband/sw/rxe/rxe_cq.c
+++ b/drivers/infiniband/sw/rxe/rxe_cq.c
@@ -84,14 +84,36 @@ int rxe_cq_resize_queue(struct rxe_cq *cq, int cqe,
 /* caller holds reference to cq */
 int rxe_cq_post(struct rxe_cq *cq, struct rxe_cqe *cqe, int solicited)
 {
+	unsigned long flags;
+	u32 spin_cnt = 3000;
 	struct ib_event ev;
-	int full;
 	void *addr;
-	unsigned long flags;
+	int full;
 
 	spin_lock_irqsave(&cq->cq_lock, flags);
 
 	full = queue_full(cq->queue, QUEUE_TYPE_TO_CLIENT);
+	if (likely(!full))
+		goto post_queue;
+
+	/* constant backoff until queue is ready */
+	while (spin_cnt--) {
+		full = queue_full(cq->queue, QUEUE_TYPE_TO_CLIENT);
+		if (!full)
+			goto post_queue;
+
+		cpu_relax();
+	}
+
+	/* try giving up cpu and retry */
+	if (full) {
+		spin_unlock_irqrestore(&cq->cq_lock, flags);
+		cond_resched();
+		spin_lock_irqsave(&cq->cq_lock, flags);
+
+		full = queue_full(cq->queue, QUEUE_TYPE_TO_CLIENT);
+	}
+
 	if (unlikely(full)) {
 		rxe_err_cq(cq, "queue full\n");
 		spin_unlock_irqrestore(&cq->cq_lock, flags);
@@ -105,6 +127,7 @@ int rxe_cq_post(struct rxe_cq *cq, struct rxe_cqe *cqe, int solicited)
 		return -EBUSY;
 	}
 
+ post_queue:
 	addr = queue_producer_addr(cq->queue, QUEUE_TYPE_TO_CLIENT);
 	memcpy(addr, cqe, sizeof(*cqe));
 
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH for-rc v1] RDMA/rxe: Avoid CQ polling hang triggered by CQ resize
  2025-08-17 12:37 [PATCH for-rc v1] RDMA/rxe: Avoid CQ polling hang triggered by CQ resize Daisuke Matsuda
@ 2025-08-18  4:44 ` Zhu Yanjun
  2025-08-19 15:15   ` Daisuke Matsuda
  2025-08-25 18:10 ` Jason Gunthorpe
  1 sibling, 1 reply; 9+ messages in thread
From: Zhu Yanjun @ 2025-08-18  4:44 UTC (permalink / raw)
  To: Daisuke Matsuda, linux-rdma, leon, jgg, zyjzyj2000; +Cc: philipp.reisner

在 2025/8/17 5:37, Daisuke Matsuda 写道:
> When running the test_resize_cq testcase from rdma-core, polling a
> completion queue from userspace may occasionally hang and eventually fail
> with a timeout:
> =====
> ERROR: test_resize_cq (tests.test_cq.CQTest.test_resize_cq)
> Test resize CQ, start with specific value and then increase and decrease
> ----------------------------------------------------------------------
> Traceback (most recent call last):
>      File "/root/deb/rdma-core/tests/test_cq.py", line 135, in test_resize_cq
>        u.poll_cq(self.client.cq)
>      File "/root/deb/rdma-core/tests/utils.py", line 687, in poll_cq
>        wcs = _poll_cq(cq, count, data)
>              ^^^^^^^^^^^^^^^^^^^^^^^^^
>      File "/root/deb/rdma-core/tests/utils.py", line 669, in _poll_cq
>        raise PyverbsError(f'Got timeout on polling ({count} CQEs remaining)')
> pyverbs.pyverbs_error.PyverbsError: Got timeout on polling (1 CQEs
> remaining)
> =====
> 
> The issue is caused when rxe_cq_post() fails to post a CQE due to the queue
> being temporarily full, and the CQE is effectively lost. To mitigate this,
> add a bounded busy-wait with fallback rescheduling so that CQE does not get
> lost.
> 
> Signed-off-by: Daisuke Matsuda <dskmtsd@gmail.com>
> ---
>   drivers/infiniband/sw/rxe/rxe_cq.c | 27 +++++++++++++++++++++++++--
>   1 file changed, 25 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/infiniband/sw/rxe/rxe_cq.c b/drivers/infiniband/sw/rxe/rxe_cq.c
> index fffd144d509e..7b0fba63204e 100644
> --- a/drivers/infiniband/sw/rxe/rxe_cq.c
> +++ b/drivers/infiniband/sw/rxe/rxe_cq.c
> @@ -84,14 +84,36 @@ int rxe_cq_resize_queue(struct rxe_cq *cq, int cqe,
>   /* caller holds reference to cq */
>   int rxe_cq_post(struct rxe_cq *cq, struct rxe_cqe *cqe, int solicited)
>   {
> +	unsigned long flags;
> +	u32 spin_cnt = 3000;
>   	struct ib_event ev;
> -	int full;
>   	void *addr;
> -	unsigned long flags;
> +	int full;
>   
>   	spin_lock_irqsave(&cq->cq_lock, flags);
>   
>   	full = queue_full(cq->queue, QUEUE_TYPE_TO_CLIENT);
> +	if (likely(!full))
> +		goto post_queue;
> +
> +	/* constant backoff until queue is ready */
> +	while (spin_cnt--) {
> +		full = queue_full(cq->queue, QUEUE_TYPE_TO_CLIENT);
> +		if (!full)
> +			goto post_queue;
> +
> +		cpu_relax();
> +	}

The loop runs 3000 times.
Each iteration:

Checks queue_full()
Executes cpu_relax()

On modern CPUs, each iteration may take a few cycles, e.g., 4–10 cycles 
per iteration (depends on memory/cache).

Suppose 1 cycle = ~0.3 ns on a 3 GHz CPU, 10 cycles ≈ 3 ns
3000 iterations × 10 cycles ≈ 30,000 cycles

30000 cycles * 0.3 ns = 9000 ns = 9 microseconds

So the “critical section” while spinning is tens of microseconds, not 
milliseconds.

I was concerned that 3000 iterations might make the spin lock critical 
section too long, but based on the analysis above, it appears that this 
is still a short-duration critical section.

I am not sure if it is a big spin lock critical section or not.
If it is not,

Reviewed-by: Zhu Yanjun <yanjun.zhu@linux.dev>

Zhu Yanjun

> +
> +	/* try giving up cpu and retry */
> +	if (full) {
> +		spin_unlock_irqrestore(&cq->cq_lock, flags);
> +		cond_resched();
> +		spin_lock_irqsave(&cq->cq_lock, flags);
> +
> +		full = queue_full(cq->queue, QUEUE_TYPE_TO_CLIENT);
> +	}
> +
>   	if (unlikely(full)) {
>   		rxe_err_cq(cq, "queue full\n");
>   		spin_unlock_irqrestore(&cq->cq_lock, flags);
> @@ -105,6 +127,7 @@ int rxe_cq_post(struct rxe_cq *cq, struct rxe_cqe *cqe, int solicited)
>   		return -EBUSY;
>   	}
>   
> + post_queue:
>   	addr = queue_producer_addr(cq->queue, QUEUE_TYPE_TO_CLIENT);
>   	memcpy(addr, cqe, sizeof(*cqe));
>   


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH for-rc v1] RDMA/rxe: Avoid CQ polling hang triggered by CQ resize
  2025-08-18  4:44 ` Zhu Yanjun
@ 2025-08-19 15:15   ` Daisuke Matsuda
  2025-08-21  3:12     ` Zhu Yanjun
  0 siblings, 1 reply; 9+ messages in thread
From: Daisuke Matsuda @ 2025-08-19 15:15 UTC (permalink / raw)
  To: Zhu Yanjun, linux-rdma, leon, jgg, zyjzyj2000; +Cc: philipp.reisner

On 2025/08/18 13:44, Zhu Yanjun wrote:
> 在 2025/8/17 5:37, Daisuke Matsuda 写道:
>> When running the test_resize_cq testcase from rdma-core, polling a
>> completion queue from userspace may occasionally hang and eventually fail
>> with a timeout:
>> =====
>> ERROR: test_resize_cq (tests.test_cq.CQTest.test_resize_cq)
>> Test resize CQ, start with specific value and then increase and decrease
>> ----------------------------------------------------------------------
>> Traceback (most recent call last):
>>      File "/root/deb/rdma-core/tests/test_cq.py", line 135, in test_resize_cq
>>        u.poll_cq(self.client.cq)
>>      File "/root/deb/rdma-core/tests/utils.py", line 687, in poll_cq
>>        wcs = _poll_cq(cq, count, data)
>>              ^^^^^^^^^^^^^^^^^^^^^^^^^
>>      File "/root/deb/rdma-core/tests/utils.py", line 669, in _poll_cq
>>        raise PyverbsError(f'Got timeout on polling ({count} CQEs remaining)')
>> pyverbs.pyverbs_error.PyverbsError: Got timeout on polling (1 CQEs
>> remaining)
>> =====
>>
>> The issue is caused when rxe_cq_post() fails to post a CQE due to the queue
>> being temporarily full, and the CQE is effectively lost. To mitigate this,
>> add a bounded busy-wait with fallback rescheduling so that CQE does not get
>> lost.
>>
>> Signed-off-by: Daisuke Matsuda <dskmtsd@gmail.com>
>> ---
>>   drivers/infiniband/sw/rxe/rxe_cq.c | 27 +++++++++++++++++++++++++--
>>   1 file changed, 25 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/infiniband/sw/rxe/rxe_cq.c b/drivers/infiniband/sw/rxe/rxe_cq.c
>> index fffd144d509e..7b0fba63204e 100644
>> --- a/drivers/infiniband/sw/rxe/rxe_cq.c
>> +++ b/drivers/infiniband/sw/rxe/rxe_cq.c
>> @@ -84,14 +84,36 @@ int rxe_cq_resize_queue(struct rxe_cq *cq, int cqe,
>>   /* caller holds reference to cq */
>>   int rxe_cq_post(struct rxe_cq *cq, struct rxe_cqe *cqe, int solicited)
>>   {
>> +    unsigned long flags;
>> +    u32 spin_cnt = 3000;
>>       struct ib_event ev;
>> -    int full;
>>       void *addr;
>> -    unsigned long flags;
>> +    int full;
>>       spin_lock_irqsave(&cq->cq_lock, flags);
>>       full = queue_full(cq->queue, QUEUE_TYPE_TO_CLIENT);
>> +    if (likely(!full))
>> +        goto post_queue;
>> +
>> +    /* constant backoff until queue is ready */
>> +    while (spin_cnt--) {
>> +        full = queue_full(cq->queue, QUEUE_TYPE_TO_CLIENT);
>> +        if (!full)
>> +            goto post_queue;
>> +
>> +        cpu_relax();
>> +    }
> 
> The loop runs 3000 times.
> Each iteration:
> 
> Checks queue_full()
> Executes cpu_relax()
> 
> On modern CPUs, each iteration may take a few cycles, e.g., 4–10 cycles per iteration (depends on memory/cache).
> 
> Suppose 1 cycle = ~0.3 ns on a 3 GHz CPU, 10 cycles ≈ 3 ns
> 3000 iterations × 10 cycles ≈ 30,000 cycles
> 
> 30000 cycles * 0.3 ns = 9000 ns = 9 microseconds
> 
> So the “critical section” while spinning is tens of microseconds, not milliseconds.
> 
> I was concerned that 3000 iterations might make the spin lock critical section too long, but based on the analysis above, it appears that this is still a short-duration critical section.

Thank you for the review.

Assuming the two loads in queue_full() hit in the L1 cache, I estimate each iteration could take around
15–20 cycles. Based on your calculation, the maximum total time would be approximately 18 microseconds.

> 
> I am not sure if it is a big spin lock critical section or not.
> If it is not,

In my opinion, this duration is acceptable, as the thread does not actually spin for that long
in practice. During my testing, it never reached the cond_resched() fallback, so the
current spin count appears sufficient to avoid the failure case.

Thanks,
Daisuke

> 
> Reviewed-by: Zhu Yanjun <yanjun.zhu@linux.dev>
> 
> Zhu Yanjun
> 
>> +
>> +    /* try giving up cpu and retry */
>> +    if (full) {
>> +        spin_unlock_irqrestore(&cq->cq_lock, flags);
>> +        cond_resched();
>> +        spin_lock_irqsave(&cq->cq_lock, flags);
>> +
>> +        full = queue_full(cq->queue, QUEUE_TYPE_TO_CLIENT);
>> +    }
>> +
>>       if (unlikely(full)) {
>>           rxe_err_cq(cq, "queue full\n");
>>           spin_unlock_irqrestore(&cq->cq_lock, flags);
>> @@ -105,6 +127,7 @@ int rxe_cq_post(struct rxe_cq *cq, struct rxe_cqe *cqe, int solicited)
>>           return -EBUSY;
>>       }
>> + post_queue:
>>       addr = queue_producer_addr(cq->queue, QUEUE_TYPE_TO_CLIENT);
>>       memcpy(addr, cqe, sizeof(*cqe));
> 


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH for-rc v1] RDMA/rxe: Avoid CQ polling hang triggered by CQ resize
  2025-08-19 15:15   ` Daisuke Matsuda
@ 2025-08-21  3:12     ` Zhu Yanjun
  2025-08-23  4:19       ` Daisuke Matsuda
  0 siblings, 1 reply; 9+ messages in thread
From: Zhu Yanjun @ 2025-08-21  3:12 UTC (permalink / raw)
  To: Daisuke Matsuda, linux-rdma, leon, jgg, zyjzyj2000; +Cc: philipp.reisner

在 2025/8/19 8:15, Daisuke Matsuda 写道:
> On 2025/08/18 13:44, Zhu Yanjun wrote:
>> 在 2025/8/17 5:37, Daisuke Matsuda 写道:
>>> When running the test_resize_cq testcase from rdma-core, polling a
>>> completion queue from userspace may occasionally hang and eventually 
>>> fail
>>> with a timeout:
>>> =====
>>> ERROR: test_resize_cq (tests.test_cq.CQTest.test_resize_cq)
>>> Test resize CQ, start with specific value and then increase and decrease
>>> ----------------------------------------------------------------------
>>> Traceback (most recent call last):
>>>      File "/root/deb/rdma-core/tests/test_cq.py", line 135, in 
>>> test_resize_cq
>>>        u.poll_cq(self.client.cq)
>>>      File "/root/deb/rdma-core/tests/utils.py", line 687, in poll_cq
>>>        wcs = _poll_cq(cq, count, data)
>>>              ^^^^^^^^^^^^^^^^^^^^^^^^^
>>>      File "/root/deb/rdma-core/tests/utils.py", line 669, in _poll_cq
>>>        raise PyverbsError(f'Got timeout on polling ({count} CQEs 
>>> remaining)')
>>> pyverbs.pyverbs_error.PyverbsError: Got timeout on polling (1 CQEs
>>> remaining)
>>> =====
>>>
>>> The issue is caused when rxe_cq_post() fails to post a CQE due to the 
>>> queue
>>> being temporarily full, and the CQE is effectively lost. To mitigate 
>>> this,
>>> add a bounded busy-wait with fallback rescheduling so that CQE does 
>>> not get
>>> lost.
>>>
>>> Signed-off-by: Daisuke Matsuda <dskmtsd@gmail.com>
>>> ---
>>>   drivers/infiniband/sw/rxe/rxe_cq.c | 27 +++++++++++++++++++++++++--
>>>   1 file changed, 25 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/drivers/infiniband/sw/rxe/rxe_cq.c b/drivers/infiniband/ 
>>> sw/rxe/rxe_cq.c
>>> index fffd144d509e..7b0fba63204e 100644
>>> --- a/drivers/infiniband/sw/rxe/rxe_cq.c
>>> +++ b/drivers/infiniband/sw/rxe/rxe_cq.c
>>> @@ -84,14 +84,36 @@ int rxe_cq_resize_queue(struct rxe_cq *cq, int cqe,
>>>   /* caller holds reference to cq */
>>>   int rxe_cq_post(struct rxe_cq *cq, struct rxe_cqe *cqe, int solicited)
>>>   {
>>> +    unsigned long flags;
>>> +    u32 spin_cnt = 3000;
>>>       struct ib_event ev;
>>> -    int full;
>>>       void *addr;
>>> -    unsigned long flags;
>>> +    int full;
>>>       spin_lock_irqsave(&cq->cq_lock, flags);
>>>       full = queue_full(cq->queue, QUEUE_TYPE_TO_CLIENT);
>>> +    if (likely(!full))
>>> +        goto post_queue;
>>> +
>>> +    /* constant backoff until queue is ready */
>>> +    while (spin_cnt--) {
>>> +        full = queue_full(cq->queue, QUEUE_TYPE_TO_CLIENT);
>>> +        if (!full)
>>> +            goto post_queue;
>>> +
>>> +        cpu_relax();
>>> +    }
>>
>> The loop runs 3000 times.
>> Each iteration:
>>
>> Checks queue_full()
>> Executes cpu_relax()
>>
>> On modern CPUs, each iteration may take a few cycles, e.g., 4–10 
>> cycles per iteration (depends on memory/cache).
>>
>> Suppose 1 cycle = ~0.3 ns on a 3 GHz CPU, 10 cycles ≈ 3 ns
>> 3000 iterations × 10 cycles ≈ 30,000 cycles
>>
>> 30000 cycles * 0.3 ns = 9000 ns = 9 microseconds
>>
>> So the “critical section” while spinning is tens of microseconds, not 
>> milliseconds.
>>
>> I was concerned that 3000 iterations might make the spin lock critical 
>> section too long, but based on the analysis above, it appears that 
>> this is still a short-duration critical section.
> 
> Thank you for the review.
> 
> Assuming the two loads in queue_full() hit in the L1 cache, I estimate 
> each iteration could take around
> 15–20 cycles. Based on your calculation, the maximum total time would be 
> approximately 18 microseconds.

======================================================================
ERROR: test_rdmacm_async_write (tests.test_rdmacm.CMTestCase)
----------------------------------------------------------------------
Traceback (most recent call last):
   File "/..../rdma-core/tests/test_rdmacm.py", line 71, in 
test_rdmacm_async_write
     self.two_nodes_rdmacm_traffic(CMAsyncConnection,
   File "/..../rdma-core/tests/base.py", line 447, in 
two_nodes_rdmacm_traffic
     raise Exception('Exception in active/passive side occurred')
Exception: Exception in active/passive side occurred

After appying your commit, I run the following run_tests.py for 10000 times.
The above error sometimes will appear. The frequency is very low.

"
for (( i = 0; i < 10000; i++ ))
do
	rdma-core/build/bin/run_tests.py --dev rxe0
done
"
It is weird.

Yanjun.Zhu

> 
>>
>> I am not sure if it is a big spin lock critical section or not.
>> If it is not,
> 
> In my opinion, this duration is acceptable, as the thread does not 
> actually spin for that long
> in practice. During my testing, it never reached the cond_resched() 
> fallback, so the
> current spin count appears sufficient to avoid the failure case.
> 
> Thanks,
> Daisuke
> 
>>
>> Reviewed-by: Zhu Yanjun <yanjun.zhu@linux.dev>
>>
>> Zhu Yanjun
>>
>>> +
>>> +    /* try giving up cpu and retry */
>>> +    if (full) {
>>> +        spin_unlock_irqrestore(&cq->cq_lock, flags);
>>> +        cond_resched();
>>> +        spin_lock_irqsave(&cq->cq_lock, flags);
>>> +
>>> +        full = queue_full(cq->queue, QUEUE_TYPE_TO_CLIENT);
>>> +    }
>>> +
>>>       if (unlikely(full)) {
>>>           rxe_err_cq(cq, "queue full\n");
>>>           spin_unlock_irqrestore(&cq->cq_lock, flags);
>>> @@ -105,6 +127,7 @@ int rxe_cq_post(struct rxe_cq *cq, struct rxe_cqe 
>>> *cqe, int solicited)
>>>           return -EBUSY;
>>>       }
>>> + post_queue:
>>>       addr = queue_producer_addr(cq->queue, QUEUE_TYPE_TO_CLIENT);
>>>       memcpy(addr, cqe, sizeof(*cqe));
>>
> 


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH for-rc v1] RDMA/rxe: Avoid CQ polling hang triggered by CQ resize
  2025-08-21  3:12     ` Zhu Yanjun
@ 2025-08-23  4:19       ` Daisuke Matsuda
  2025-08-23  5:22         ` Zhu Yanjun
  0 siblings, 1 reply; 9+ messages in thread
From: Daisuke Matsuda @ 2025-08-23  4:19 UTC (permalink / raw)
  To: Zhu Yanjun, linux-rdma, leon, jgg, zyjzyj2000; +Cc: philipp.reisner

On 2025/08/21 12:12, Zhu Yanjun wrote:
> 在 2025/8/19 8:15, Daisuke Matsuda 写道:
>> On 2025/08/18 13:44, Zhu Yanjun wrote:
>>> 在 2025/8/17 5:37, Daisuke Matsuda 写道:
>>>> When running the test_resize_cq testcase from rdma-core, polling a
>>>> completion queue from userspace may occasionally hang and eventually fail
>>>> with a timeout:
>>>> =====
>>>> ERROR: test_resize_cq (tests.test_cq.CQTest.test_resize_cq)
>>>> Test resize CQ, start with specific value and then increase and decrease
>>>> ----------------------------------------------------------------------
>>>> Traceback (most recent call last):
>>>>      File "/root/deb/rdma-core/tests/test_cq.py", line 135, in test_resize_cq
>>>>        u.poll_cq(self.client.cq)
>>>>      File "/root/deb/rdma-core/tests/utils.py", line 687, in poll_cq
>>>>        wcs = _poll_cq(cq, count, data)
>>>>              ^^^^^^^^^^^^^^^^^^^^^^^^^
>>>>      File "/root/deb/rdma-core/tests/utils.py", line 669, in _poll_cq
>>>>        raise PyverbsError(f'Got timeout on polling ({count} CQEs remaining)')
>>>> pyverbs.pyverbs_error.PyverbsError: Got timeout on polling (1 CQEs
>>>> remaining)
>>>> =====
>>>>
>>>> The issue is caused when rxe_cq_post() fails to post a CQE due to the queue
>>>> being temporarily full, and the CQE is effectively lost. To mitigate this,
>>>> add a bounded busy-wait with fallback rescheduling so that CQE does not get
>>>> lost.
>>>>
>>>> Signed-off-by: Daisuke Matsuda <dskmtsd@gmail.com>
>>>> ---
>>>>   drivers/infiniband/sw/rxe/rxe_cq.c | 27 +++++++++++++++++++++++++--
>>>>   1 file changed, 25 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/drivers/infiniband/sw/rxe/rxe_cq.c b/drivers/infiniband/ sw/rxe/rxe_cq.c
>>>> index fffd144d509e..7b0fba63204e 100644
>>>> --- a/drivers/infiniband/sw/rxe/rxe_cq.c
>>>> +++ b/drivers/infiniband/sw/rxe/rxe_cq.c
>>>> @@ -84,14 +84,36 @@ int rxe_cq_resize_queue(struct rxe_cq *cq, int cqe,
>>>>   /* caller holds reference to cq */
>>>>   int rxe_cq_post(struct rxe_cq *cq, struct rxe_cqe *cqe, int solicited)
>>>>   {
>>>> +    unsigned long flags;
>>>> +    u32 spin_cnt = 3000;
>>>>       struct ib_event ev;
>>>> -    int full;
>>>>       void *addr;
>>>> -    unsigned long flags;
>>>> +    int full;
>>>>       spin_lock_irqsave(&cq->cq_lock, flags);
>>>>       full = queue_full(cq->queue, QUEUE_TYPE_TO_CLIENT);
>>>> +    if (likely(!full))
>>>> +        goto post_queue;
>>>> +
>>>> +    /* constant backoff until queue is ready */
>>>> +    while (spin_cnt--) {
>>>> +        full = queue_full(cq->queue, QUEUE_TYPE_TO_CLIENT);
>>>> +        if (!full)
>>>> +            goto post_queue;
>>>> +
>>>> +        cpu_relax();
>>>> +    }
>>>
>>> The loop runs 3000 times.
>>> Each iteration:
>>>
>>> Checks queue_full()
>>> Executes cpu_relax()
>>>
>>> On modern CPUs, each iteration may take a few cycles, e.g., 4–10 cycles per iteration (depends on memory/cache).
>>>
>>> Suppose 1 cycle = ~0.3 ns on a 3 GHz CPU, 10 cycles ≈ 3 ns
>>> 3000 iterations × 10 cycles ≈ 30,000 cycles
>>>
>>> 30000 cycles * 0.3 ns = 9000 ns = 9 microseconds
>>>
>>> So the “critical section” while spinning is tens of microseconds, not milliseconds.
>>>
>>> I was concerned that 3000 iterations might make the spin lock critical section too long, but based on the analysis above, it appears that this is still a short-duration critical section.
>>
>> Thank you for the review.
>>
>> Assuming the two loads in queue_full() hit in the L1 cache, I estimate each iteration could take around
>> 15–20 cycles. Based on your calculation, the maximum total time would be approximately 18 microseconds.
> 
> ======================================================================
> ERROR: test_rdmacm_async_write (tests.test_rdmacm.CMTestCase)
> ----------------------------------------------------------------------
> Traceback (most recent call last):
>    File "/..../rdma-core/tests/test_rdmacm.py", line 71, in test_rdmacm_async_write
>      self.two_nodes_rdmacm_traffic(CMAsyncConnection,
>    File "/..../rdma-core/tests/base.py", line 447, in two_nodes_rdmacm_traffic
>      raise Exception('Exception in active/passive side occurred')
> Exception: Exception in active/passive side occurred
> 
> After appying your commit, I run the following run_tests.py for 10000 times.
> The above error sometimes will appear. The frequency is very low.
> 
> "
> for (( i = 0; i < 10000; i++ ))
> do
>      rdma-core/build/bin/run_tests.py --dev rxe0
> done
> "
> It is weird.

I tried running test_rdmacm_async_write alone for 50000 times, but could not reproduce this one.
There have been multiple latency-related issues in RXE, so it is not surprising a new one is
uncovered by changing seemingly irrelevant part.

How about applying additional change below:
===
diff --git a/drivers/infiniband/sw/rxe/rxe_cq.c b/drivers/infiniband/sw/rxe/rxe_cq.c
index 7b0fba63204e..8f8d56051b8d 100644
--- a/drivers/infiniband/sw/rxe/rxe_cq.c
+++ b/drivers/infiniband/sw/rxe/rxe_cq.c
@@ -102,7 +102,9 @@ int rxe_cq_post(struct rxe_cq *cq, struct rxe_cqe *cqe, int solicited)
                 if (!full)
                         goto post_queue;

+               spin_unlock_irqrestore(&cq->cq_lock, flags);
                 cpu_relax();
+               spin_lock_irqsave(&cq->cq_lock, flags);
         }

         /* try giving up cpu and retry */
===
This makes cpu_relax() almost meaningless, but ensures the lock is released in each iteration.

It would be nice if you could provide the frequency and whether it takes longer than usual in failure cases.
I think that could be helpful as a starting point to find a solution.

Thanks,
Daisuke

> 
> Yanjun.Zhu
> 



^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH for-rc v1] RDMA/rxe: Avoid CQ polling hang triggered by CQ resize
  2025-08-23  4:19       ` Daisuke Matsuda
@ 2025-08-23  5:22         ` Zhu Yanjun
  0 siblings, 0 replies; 9+ messages in thread
From: Zhu Yanjun @ 2025-08-23  5:22 UTC (permalink / raw)
  To: Daisuke Matsuda, linux-rdma, leon, jgg, zyjzyj2000; +Cc: philipp.reisner


在 2025/8/22 21:19, Daisuke Matsuda 写道:
> On 2025/08/21 12:12, Zhu Yanjun wrote:
>> 在 2025/8/19 8:15, Daisuke Matsuda 写道:
>>> On 2025/08/18 13:44, Zhu Yanjun wrote:
>>>> 在 2025/8/17 5:37, Daisuke Matsuda 写道:
>>>>> When running the test_resize_cq testcase from rdma-core, polling a
>>>>> completion queue from userspace may occasionally hang and 
>>>>> eventually fail
>>>>> with a timeout:
>>>>> =====
>>>>> ERROR: test_resize_cq (tests.test_cq.CQTest.test_resize_cq)
>>>>> Test resize CQ, start with specific value and then increase and 
>>>>> decrease
>>>>> ---------------------------------------------------------------------- 
>>>>>
>>>>> Traceback (most recent call last):
>>>>>      File "/root/deb/rdma-core/tests/test_cq.py", line 135, in 
>>>>> test_resize_cq
>>>>>        u.poll_cq(self.client.cq)
>>>>>      File "/root/deb/rdma-core/tests/utils.py", line 687, in poll_cq
>>>>>        wcs = _poll_cq(cq, count, data)
>>>>>              ^^^^^^^^^^^^^^^^^^^^^^^^^
>>>>>      File "/root/deb/rdma-core/tests/utils.py", line 669, in _poll_cq
>>>>>        raise PyverbsError(f'Got timeout on polling ({count} CQEs 
>>>>> remaining)')
>>>>> pyverbs.pyverbs_error.PyverbsError: Got timeout on polling (1 CQEs
>>>>> remaining)
>>>>> =====
>>>>>
>>>>> The issue is caused when rxe_cq_post() fails to post a CQE due to 
>>>>> the queue
>>>>> being temporarily full, and the CQE is effectively lost. To 
>>>>> mitigate this,
>>>>> add a bounded busy-wait with fallback rescheduling so that CQE 
>>>>> does not get
>>>>> lost.
>>>>>
>>>>> Signed-off-by: Daisuke Matsuda <dskmtsd@gmail.com>
>>>>> ---
>>>>>   drivers/infiniband/sw/rxe/rxe_cq.c | 27 +++++++++++++++++++++++++--
>>>>>   1 file changed, 25 insertions(+), 2 deletions(-)
>>>>>
>>>>> diff --git a/drivers/infiniband/sw/rxe/rxe_cq.c 
>>>>> b/drivers/infiniband/ sw/rxe/rxe_cq.c
>>>>> index fffd144d509e..7b0fba63204e 100644
>>>>> --- a/drivers/infiniband/sw/rxe/rxe_cq.c
>>>>> +++ b/drivers/infiniband/sw/rxe/rxe_cq.c
>>>>> @@ -84,14 +84,36 @@ int rxe_cq_resize_queue(struct rxe_cq *cq, int 
>>>>> cqe,
>>>>>   /* caller holds reference to cq */
>>>>>   int rxe_cq_post(struct rxe_cq *cq, struct rxe_cqe *cqe, int 
>>>>> solicited)
>>>>>   {
>>>>> +    unsigned long flags;
>>>>> +    u32 spin_cnt = 3000;
>>>>>       struct ib_event ev;
>>>>> -    int full;
>>>>>       void *addr;
>>>>> -    unsigned long flags;
>>>>> +    int full;
>>>>>       spin_lock_irqsave(&cq->cq_lock, flags);
>>>>>       full = queue_full(cq->queue, QUEUE_TYPE_TO_CLIENT);
>>>>> +    if (likely(!full))
>>>>> +        goto post_queue;
>>>>> +
>>>>> +    /* constant backoff until queue is ready */
>>>>> +    while (spin_cnt--) {
>>>>> +        full = queue_full(cq->queue, QUEUE_TYPE_TO_CLIENT);
>>>>> +        if (!full)
>>>>> +            goto post_queue;
>>>>> +
>>>>> +        cpu_relax();
>>>>> +    }
>>>>
>>>> The loop runs 3000 times.
>>>> Each iteration:
>>>>
>>>> Checks queue_full()
>>>> Executes cpu_relax()
>>>>
>>>> On modern CPUs, each iteration may take a few cycles, e.g., 4–10 
>>>> cycles per iteration (depends on memory/cache).
>>>>
>>>> Suppose 1 cycle = ~0.3 ns on a 3 GHz CPU, 10 cycles ≈ 3 ns
>>>> 3000 iterations × 10 cycles ≈ 30,000 cycles
>>>>
>>>> 30000 cycles * 0.3 ns = 9000 ns = 9 microseconds
>>>>
>>>> So the “critical section” while spinning is tens of microseconds, 
>>>> not milliseconds.
>>>>
>>>> I was concerned that 3000 iterations might make the spin lock 
>>>> critical section too long, but based on the analysis above, it 
>>>> appears that this is still a short-duration critical section.
>>>
>>> Thank you for the review.
>>>
>>> Assuming the two loads in queue_full() hit in the L1 cache, I 
>>> estimate each iteration could take around
>>> 15–20 cycles. Based on your calculation, the maximum total time 
>>> would be approximately 18 microseconds.
>>
>> ======================================================================
>> ERROR: test_rdmacm_async_write (tests.test_rdmacm.CMTestCase)
>> ----------------------------------------------------------------------
>> Traceback (most recent call last):
>>    File "/..../rdma-core/tests/test_rdmacm.py", line 71, in 
>> test_rdmacm_async_write
>>      self.two_nodes_rdmacm_traffic(CMAsyncConnection,
>>    File "/..../rdma-core/tests/base.py", line 447, in 
>> two_nodes_rdmacm_traffic
>>      raise Exception('Exception in active/passive side occurred')
>> Exception: Exception in active/passive side occurred
>>
>> After appying your commit, I run the following run_tests.py for 10000 
>> times.
>> The above error sometimes will appear. The frequency is very low.
>>
>> "
>> for (( i = 0; i < 10000; i++ ))
>> do
>>      rdma-core/build/bin/run_tests.py --dev rxe0
>> done
>> "
>> It is weird.
>
> I tried running test_rdmacm_async_write alone for 50000 times, but 
> could not reproduce this one.
> There have been multiple latency-related issues in RXE, so it is not 
> surprising a new one is
> uncovered by changing seemingly irrelevant part.
>
> How about applying additional change below:
> ===
> diff --git a/drivers/infiniband/sw/rxe/rxe_cq.c 
> b/drivers/infiniband/sw/rxe/rxe_cq.c
> index 7b0fba63204e..8f8d56051b8d 100644
> --- a/drivers/infiniband/sw/rxe/rxe_cq.c
> +++ b/drivers/infiniband/sw/rxe/rxe_cq.c
> @@ -102,7 +102,9 @@ int rxe_cq_post(struct rxe_cq *cq, struct rxe_cqe 
> *cqe, int solicited)
>                 if (!full)
>                         goto post_queue;
>
> +               spin_unlock_irqrestore(&cq->cq_lock, flags);
>                 cpu_relax();
> +               spin_lock_irqsave(&cq->cq_lock, flags);
>         }
>
>         /* try giving up cpu and retry */
> ===
> This makes cpu_relax() almost meaningless, but ensures the lock is 
> released in each iteration.
>
> It would be nice if you could provide the frequency and whether it 
> takes longer than usual in failure cases.
> I think that could be helpful as a starting point to find a solution.


With a clean KVM QEMU VM, after applying your commit, the same problem 
occurs every time the above script is run.

Yanjun.Zhu


>
> Thanks,
> Daisuke
>
>>
>> Yanjun.Zhu
>>
>
>
-- 
Best Regards,
Yanjun.Zhu


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH for-rc v1] RDMA/rxe: Avoid CQ polling hang triggered by CQ resize
  2025-08-17 12:37 [PATCH for-rc v1] RDMA/rxe: Avoid CQ polling hang triggered by CQ resize Daisuke Matsuda
  2025-08-18  4:44 ` Zhu Yanjun
@ 2025-08-25 18:10 ` Jason Gunthorpe
  2025-08-27 11:14   ` Daisuke Matsuda
  1 sibling, 1 reply; 9+ messages in thread
From: Jason Gunthorpe @ 2025-08-25 18:10 UTC (permalink / raw)
  To: Daisuke Matsuda; +Cc: linux-rdma, leon, zyjzyj2000, yanjun.zhu, philipp.reisner

On Sun, Aug 17, 2025 at 12:37:52PM +0000, Daisuke Matsuda wrote:
> When running the test_resize_cq testcase from rdma-core, polling a
> completion queue from userspace may occasionally hang and eventually fail
> with a timeout:
> =====
> ERROR: test_resize_cq (tests.test_cq.CQTest.test_resize_cq)
> Test resize CQ, start with specific value and then increase and decrease
> ----------------------------------------------------------------------
> Traceback (most recent call last):
>     File "/root/deb/rdma-core/tests/test_cq.py", line 135, in test_resize_cq
>       u.poll_cq(self.client.cq)
>     File "/root/deb/rdma-core/tests/utils.py", line 687, in poll_cq
>       wcs = _poll_cq(cq, count, data)
>             ^^^^^^^^^^^^^^^^^^^^^^^^^
>     File "/root/deb/rdma-core/tests/utils.py", line 669, in _poll_cq
>       raise PyverbsError(f'Got timeout on polling ({count} CQEs remaining)')
> pyverbs.pyverbs_error.PyverbsError: Got timeout on polling (1 CQEs
> remaining)
> =====
> 
> The issue is caused when rxe_cq_post() fails to post a CQE due to the queue
> being temporarily full, and the CQE is effectively lost. To mitigate this,
> add a bounded busy-wait with fallback rescheduling so that CQE does not get
> lost.

Nothing should spin like this, that is not right.

The CQE queue is intended to be properly sized for the workload and I
seem to remember overflowing a CQE can just ERR the QP.

Alternatively you can drop the packet and do nothing.

But not spin hoping something emptys it.

Jason

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH for-rc v1] RDMA/rxe: Avoid CQ polling hang triggered by CQ resize
  2025-08-25 18:10 ` Jason Gunthorpe
@ 2025-08-27 11:14   ` Daisuke Matsuda
  2025-08-27 12:04     ` Jason Gunthorpe
  0 siblings, 1 reply; 9+ messages in thread
From: Daisuke Matsuda @ 2025-08-27 11:14 UTC (permalink / raw)
  To: Jason Gunthorpe; +Cc: linux-rdma, leon, zyjzyj2000, yanjun.zhu, philipp.reisner

On 2025/08/26 3:10, Jason Gunthorpe wrote:
> On Sun, Aug 17, 2025 at 12:37:52PM +0000, Daisuke Matsuda wrote:
>> When running the test_resize_cq testcase from rdma-core, polling a
>> completion queue from userspace may occasionally hang and eventually fail
>> with a timeout:
>> =====
>> ERROR: test_resize_cq (tests.test_cq.CQTest.test_resize_cq)
>> Test resize CQ, start with specific value and then increase and decrease
>> ----------------------------------------------------------------------
>> Traceback (most recent call last):
>>      File "/root/deb/rdma-core/tests/test_cq.py", line 135, in test_resize_cq
>>        u.poll_cq(self.client.cq)
>>      File "/root/deb/rdma-core/tests/utils.py", line 687, in poll_cq
>>        wcs = _poll_cq(cq, count, data)
>>              ^^^^^^^^^^^^^^^^^^^^^^^^^
>>      File "/root/deb/rdma-core/tests/utils.py", line 669, in _poll_cq
>>        raise PyverbsError(f'Got timeout on polling ({count} CQEs remaining)')
>> pyverbs.pyverbs_error.PyverbsError: Got timeout on polling (1 CQEs
>> remaining)
>> =====
>>
>> The issue is caused when rxe_cq_post() fails to post a CQE due to the queue
>> being temporarily full, and the CQE is effectively lost. To mitigate this,
>> add a bounded busy-wait with fallback rescheduling so that CQE does not get
>> lost.
> 
> Nothing should spin like this, that is not right.
> 
> The CQE queue is intended to be properly sized for the workload and I
> seem to remember overflowing a CQE can just ERR the QP.
> 
> Alternatively you can drop the packet and do nothing.
> 
> But not spin hoping something emptys it.
> 
> Jason

Okay, please drop this patch.
In a sense, the failure indicates that RXE is behaving as intended.

This issue seems to have always existed, though its frequency appears to vary over time.
Perhaps the switch from tasklet to workqueue introduced additional latency that influences this behavior.

Thanks,
Daisuke


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH for-rc v1] RDMA/rxe: Avoid CQ polling hang triggered by CQ resize
  2025-08-27 11:14   ` Daisuke Matsuda
@ 2025-08-27 12:04     ` Jason Gunthorpe
  0 siblings, 0 replies; 9+ messages in thread
From: Jason Gunthorpe @ 2025-08-27 12:04 UTC (permalink / raw)
  To: Daisuke Matsuda; +Cc: linux-rdma, leon, zyjzyj2000, yanjun.zhu, philipp.reisner

On Wed, Aug 27, 2025 at 08:14:53PM +0900, Daisuke Matsuda wrote:
> On 2025/08/26 3:10, Jason Gunthorpe wrote:
> > On Sun, Aug 17, 2025 at 12:37:52PM +0000, Daisuke Matsuda wrote:
> > > When running the test_resize_cq testcase from rdma-core, polling a
> > > completion queue from userspace may occasionally hang and eventually fail
> > > with a timeout:
> > > =====
> > > ERROR: test_resize_cq (tests.test_cq.CQTest.test_resize_cq)
> > > Test resize CQ, start with specific value and then increase and decrease
> > > ----------------------------------------------------------------------
> > > Traceback (most recent call last):
> > >      File "/root/deb/rdma-core/tests/test_cq.py", line 135, in test_resize_cq
> > >        u.poll_cq(self.client.cq)
> > >      File "/root/deb/rdma-core/tests/utils.py", line 687, in poll_cq
> > >        wcs = _poll_cq(cq, count, data)
> > >              ^^^^^^^^^^^^^^^^^^^^^^^^^
> > >      File "/root/deb/rdma-core/tests/utils.py", line 669, in _poll_cq
> > >        raise PyverbsError(f'Got timeout on polling ({count} CQEs remaining)')
> > > pyverbs.pyverbs_error.PyverbsError: Got timeout on polling (1 CQEs
> > > remaining)
> > > =====
> > > 
> > > The issue is caused when rxe_cq_post() fails to post a CQE due to the queue
> > > being temporarily full, and the CQE is effectively lost. To mitigate this,
> > > add a bounded busy-wait with fallback rescheduling so that CQE does not get
> > > lost.
> > 
> > Nothing should spin like this, that is not right.
> > 
> > The CQE queue is intended to be properly sized for the workload and I
> > seem to remember overflowing a CQE can just ERR the QP.
> > 
> > Alternatively you can drop the packet and do nothing.
> > 
> > But not spin hoping something emptys it.
> > 
> > Jason
> 
> Okay, please drop this patch.
> In a sense, the failure indicates that RXE is behaving as intended.
> 
> This issue seems to have always existed, though its frequency appears to vary over time.
> Perhaps the switch from tasklet to workqueue introduced additional latency that influences this behavior.

I'm not sure the test suite should be hitting cq overflows though, it
should have properly sized cqs for what it is doing..

Something sounds wrong.

Jason

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2025-08-27 12:05 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-08-17 12:37 [PATCH for-rc v1] RDMA/rxe: Avoid CQ polling hang triggered by CQ resize Daisuke Matsuda
2025-08-18  4:44 ` Zhu Yanjun
2025-08-19 15:15   ` Daisuke Matsuda
2025-08-21  3:12     ` Zhu Yanjun
2025-08-23  4:19       ` Daisuke Matsuda
2025-08-23  5:22         ` Zhu Yanjun
2025-08-25 18:10 ` Jason Gunthorpe
2025-08-27 11:14   ` Daisuke Matsuda
2025-08-27 12:04     ` Jason Gunthorpe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox