linux-rdma.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH for-next v2] RDMA/efa: Extend admin timeout error print
@ 2025-07-03 18:23 Michael Margolin
  2025-07-06  7:25 ` Leon Romanovsky
  2025-08-25 14:56 ` Jason Gunthorpe
  0 siblings, 2 replies; 10+ messages in thread
From: Michael Margolin @ 2025-07-03 18:23 UTC (permalink / raw)
  To: jgg, leon, linux-rdma; +Cc: sleybo, matua, gal.pressman, Yonatan Nachum

Add command id to the printed message for additional debug information.

Reviewed-by: Yonatan Nachum <ynachum@amazon.com>
Signed-off-by: Michael Margolin <mrgolin@amazon.com>
---
 drivers/infiniband/hw/efa/efa_com.c | 14 +++++++++-----
 1 file changed, 9 insertions(+), 5 deletions(-)

diff --git a/drivers/infiniband/hw/efa/efa_com.c b/drivers/infiniband/hw/efa/efa_com.c
index bafd210dd43e..e6377602a9c4 100644
--- a/drivers/infiniband/hw/efa/efa_com.c
+++ b/drivers/infiniband/hw/efa/efa_com.c
@@ -1,6 +1,6 @@
 // SPDX-License-Identifier: GPL-2.0 OR BSD-2-Clause
 /*
- * Copyright 2018-2024 Amazon.com, Inc. or its affiliates. All rights reserved.
+ * Copyright 2018-2025 Amazon.com, Inc. or its affiliates. All rights reserved.
  */
 
 #include "efa_com.h"
@@ -30,6 +30,7 @@ struct efa_comp_ctx {
 	struct efa_admin_acq_entry *user_cqe;
 	u32 comp_size;
 	enum efa_cmd_status status;
+	u16 cmd_id;
 	u8 cmd_opcode;
 	u8 occupied;
 };
@@ -333,6 +334,7 @@ static struct efa_comp_ctx *__efa_com_submit_admin_cmd(struct efa_com_admin_queu
 	comp_ctx->comp_size = comp_size_in_bytes;
 	comp_ctx->user_cqe = comp;
 	comp_ctx->cmd_opcode = cmd->aq_common_descriptor.opcode;
+	comp_ctx->cmd_id = cmd_id;
 
 	reinit_completion(&comp_ctx->wait_event);
 
@@ -557,17 +559,19 @@ static int efa_com_wait_and_process_admin_cq_interrupts(struct efa_comp_ctx *com
 		if (comp_ctx->status == EFA_CMD_COMPLETED)
 			ibdev_err_ratelimited(
 				aq->efa_dev,
-				"The device sent a completion but the driver didn't receive any MSI-X interrupt for admin cmd %s(%d) status %d (ctx: 0x%p, sq producer: %d, sq consumer: %d, cq consumer: %d)\n",
+				"The device sent a completion but the driver didn't receive any MSI-X interrupt for admin cmd %s(%d) status %d (id: %d, sq producer: %d, sq consumer: %d, cq consumer: %d)\n",
 				efa_com_cmd_str(comp_ctx->cmd_opcode),
 				comp_ctx->cmd_opcode, comp_ctx->status,
-				comp_ctx, aq->sq.pc, aq->sq.cc, aq->cq.cc);
+				comp_ctx->cmd_id, aq->sq.pc, aq->sq.cc,
+				aq->cq.cc);
 		else
 			ibdev_err_ratelimited(
 				aq->efa_dev,
-				"The device didn't send any completion for admin cmd %s(%d) status %d (ctx 0x%p, sq producer: %d, sq consumer: %d, cq consumer: %d)\n",
+				"The device didn't send any completion for admin cmd %s(%d) status %d (id: %d, sq producer: %d, sq consumer: %d, cq consumer: %d)\n",
 				efa_com_cmd_str(comp_ctx->cmd_opcode),
 				comp_ctx->cmd_opcode, comp_ctx->status,
-				comp_ctx, aq->sq.pc, aq->sq.cc, aq->cq.cc);
+				comp_ctx->cmd_id, aq->sq.pc, aq->sq.cc,
+				aq->cq.cc);
 
 		clear_bit(EFA_AQ_STATE_RUNNING_BIT, &aq->state);
 		err = -ETIME;
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH for-next v2] RDMA/efa: Extend admin timeout error print
  2025-07-03 18:23 [PATCH for-next v2] RDMA/efa: Extend admin timeout error print Michael Margolin
@ 2025-07-06  7:25 ` Leon Romanovsky
  2025-07-06 16:32   ` Margolin, Michael
  2025-08-25 14:56 ` Jason Gunthorpe
  1 sibling, 1 reply; 10+ messages in thread
From: Leon Romanovsky @ 2025-07-06  7:25 UTC (permalink / raw)
  To: Michael Margolin
  Cc: jgg, linux-rdma, sleybo, matua, gal.pressman, Yonatan Nachum

On Thu, Jul 03, 2025 at 06:23:14PM +0000, Michael Margolin wrote:
> Add command id to the printed message for additional debug information.
> 
> Reviewed-by: Yonatan Nachum <ynachum@amazon.com>
> Signed-off-by: Michael Margolin <mrgolin@amazon.com>
> ---
>  drivers/infiniband/hw/efa/efa_com.c | 14 +++++++++-----
>  1 file changed, 9 insertions(+), 5 deletions(-)

Please don't forget to add changelogs.

> 
> diff --git a/drivers/infiniband/hw/efa/efa_com.c b/drivers/infiniband/hw/efa/efa_com.c
> index bafd210dd43e..e6377602a9c4 100644
> --- a/drivers/infiniband/hw/efa/efa_com.c
> +++ b/drivers/infiniband/hw/efa/efa_com.c
> @@ -1,6 +1,6 @@
>  // SPDX-License-Identifier: GPL-2.0 OR BSD-2-Clause
>  /*
> - * Copyright 2018-2024 Amazon.com, Inc. or its affiliates. All rights reserved.
> + * Copyright 2018-2025 Amazon.com, Inc. or its affiliates. All rights reserved.
>   */

<...>

>  	reinit_completion(&comp_ctx->wait_event);
>  
> @@ -557,17 +559,19 @@ static int efa_com_wait_and_process_admin_cq_interrupts(struct efa_comp_ctx *com
>  		if (comp_ctx->status == EFA_CMD_COMPLETED)
>  			ibdev_err_ratelimited(
>  				aq->efa_dev,
> -				"The device sent a completion but the driver didn't receive any MSI-X interrupt for admin cmd %s(%d) status %d (ctx: 0x%p, sq producer: %d, sq consumer: %d, cq consumer: %d)\n",
> +				"The device sent a completion but the driver didn't receive any MSI-X interrupt for admin cmd %s(%d) status %d (id: %d, sq producer: %d, sq consumer: %d, cq consumer: %d)\n",
>  				efa_com_cmd_str(comp_ctx->cmd_opcode),
>  				comp_ctx->cmd_opcode, comp_ctx->status,
> -				comp_ctx, aq->sq.pc, aq->sq.cc, aq->cq.cc);
> +				comp_ctx->cmd_id, aq->sq.pc, aq->sq.cc,
> +				aq->cq.cc);
>  		else
>  			ibdev_err_ratelimited(
>  				aq->efa_dev,
> -				"The device didn't send any completion for admin cmd %s(%d) status %d (ctx 0x%p, sq producer: %d, sq consumer: %d, cq consumer: %d)\n",
> +				"The device didn't send any completion for admin cmd %s(%d) status %d (id: %d, sq producer: %d, sq consumer: %d, cq consumer: %d)\n",
>  				efa_com_cmd_str(comp_ctx->cmd_opcode),
>  				comp_ctx->cmd_opcode, comp_ctx->status,
> -				comp_ctx, aq->sq.pc, aq->sq.cc, aq->cq.cc);
> +				comp_ctx->cmd_id, aq->sq.pc, aq->sq.cc,
> +				aq->cq.cc);

I have very strong feeling that you don't really use these prints in real life.

For example, comp_ctx->cmd_id is printed with %d, while code and comment
around cmd_id in __efa_com_submit_admin_cmd() suggests that it needs to be 0x%X.

It has a lot of information separated to LSB and MSB bits which are not readable
while printing with %d.

You are also printing comp_ctx->status, which is clear from if/else section.

So no, I don't buy this claim for "additional debug information", while
existing is not used.

Thanks

>  
>  		clear_bit(EFA_AQ_STATE_RUNNING_BIT, &aq->state);
>  		err = -ETIME;
> -- 
> 2.47.1
> 

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH for-next v2] RDMA/efa: Extend admin timeout error print
  2025-07-06  7:25 ` Leon Romanovsky
@ 2025-07-06 16:32   ` Margolin, Michael
  2025-07-07  6:28     ` Leon Romanovsky
  0 siblings, 1 reply; 10+ messages in thread
From: Margolin, Michael @ 2025-07-06 16:32 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: jgg, linux-rdma, sleybo, matua, gal.pressman, Yonatan Nachum


On 7/6/2025 10:25 AM, Leon Romanovsky wrote:
> CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.
>
>
>
> On Thu, Jul 03, 2025 at 06:23:14PM +0000, Michael Margolin wrote:
>>        reinit_completion(&comp_ctx->wait_event);
>>
>> @@ -557,17 +559,19 @@ static int efa_com_wait_and_process_admin_cq_interrupts(struct efa_comp_ctx *com
>>                if (comp_ctx->status == EFA_CMD_COMPLETED)
>>                        ibdev_err_ratelimited(
>>                                aq->efa_dev,
>> -                             "The device sent a completion but the driver didn't receive any MSI-X interrupt for admin cmd %s(%d) status %d (ctx: 0x%p, sq producer: %d, sq consumer: %d, cq consumer: %d)\n",
>> +                             "The device sent a completion but the driver didn't receive any MSI-X interrupt for admin cmd %s(%d) status %d (id: %d, sq producer: %d, sq consumer: %d, cq consumer: %d)\n",
>>                                efa_com_cmd_str(comp_ctx->cmd_opcode),
>>                                comp_ctx->cmd_opcode, comp_ctx->status,
>> -                             comp_ctx, aq->sq.pc, aq->sq.cc, aq->cq.cc);
>> +                             comp_ctx->cmd_id, aq->sq.pc, aq->sq.cc,
>> +                             aq->cq.cc);
>>                else
>>                        ibdev_err_ratelimited(
>>                                aq->efa_dev,
>> -                             "The device didn't send any completion for admin cmd %s(%d) status %d (ctx 0x%p, sq producer: %d, sq consumer: %d, cq consumer: %d)\n",
>> +                             "The device didn't send any completion for admin cmd %s(%d) status %d (id: %d, sq producer: %d, sq consumer: %d, cq consumer: %d)\n",
>>                                efa_com_cmd_str(comp_ctx->cmd_opcode),
>>                                comp_ctx->cmd_opcode, comp_ctx->status,
>> -                             comp_ctx, aq->sq.pc, aq->sq.cc, aq->cq.cc);
>> +                             comp_ctx->cmd_id, aq->sq.pc, aq->sq.cc,
>> +                             aq->cq.cc);
> I have very strong feeling that you don't really use these prints in real life.
>
> For example, comp_ctx->cmd_id is printed with %d, while code and comment
> around cmd_id in __efa_com_submit_admin_cmd() suggests that it needs to be 0x%X.
>
> It has a lot of information separated to LSB and MSB bits which are not readable
> while printing with %d.
>
> You are also printing comp_ctx->status, which is clear from if/else section.
>
> So no, I don't buy this claim for "additional debug information", while
> existing is not used.

What do you mean by that?!?

These errors are extremely rare and are not manually reproducible, 
that's why we want to collect as much information as we can when it happens.

I'm less bothered by the format as long as we have the info we need.


Michael

>
> Thanks
>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH for-next v2] RDMA/efa: Extend admin timeout error print
  2025-07-06 16:32   ` Margolin, Michael
@ 2025-07-07  6:28     ` Leon Romanovsky
  2025-07-07  9:51       ` Margolin, Michael
  0 siblings, 1 reply; 10+ messages in thread
From: Leon Romanovsky @ 2025-07-07  6:28 UTC (permalink / raw)
  To: Margolin, Michael
  Cc: jgg, linux-rdma, sleybo, matua, gal.pressman, Yonatan Nachum

On Sun, Jul 06, 2025 at 07:32:05PM +0300, Margolin, Michael wrote:
> 
> On 7/6/2025 10:25 AM, Leon Romanovsky wrote:
> > CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.
> > 
> > 
> > 
> > On Thu, Jul 03, 2025 at 06:23:14PM +0000, Michael Margolin wrote:
> > >        reinit_completion(&comp_ctx->wait_event);
> > > 
> > > @@ -557,17 +559,19 @@ static int efa_com_wait_and_process_admin_cq_interrupts(struct efa_comp_ctx *com
> > >                if (comp_ctx->status == EFA_CMD_COMPLETED)
> > >                        ibdev_err_ratelimited(
> > >                                aq->efa_dev,
> > > -                             "The device sent a completion but the driver didn't receive any MSI-X interrupt for admin cmd %s(%d) status %d (ctx: 0x%p, sq producer: %d, sq consumer: %d, cq consumer: %d)\n",
> > > +                             "The device sent a completion but the driver didn't receive any MSI-X interrupt for admin cmd %s(%d) status %d (id: %d, sq producer: %d, sq consumer: %d, cq consumer: %d)\n",
> > >                                efa_com_cmd_str(comp_ctx->cmd_opcode),
> > >                                comp_ctx->cmd_opcode, comp_ctx->status,
> > > -                             comp_ctx, aq->sq.pc, aq->sq.cc, aq->cq.cc);
> > > +                             comp_ctx->cmd_id, aq->sq.pc, aq->sq.cc,
> > > +                             aq->cq.cc);
> > >                else
> > >                        ibdev_err_ratelimited(
> > >                                aq->efa_dev,
> > > -                             "The device didn't send any completion for admin cmd %s(%d) status %d (ctx 0x%p, sq producer: %d, sq consumer: %d, cq consumer: %d)\n",
> > > +                             "The device didn't send any completion for admin cmd %s(%d) status %d (id: %d, sq producer: %d, sq consumer: %d, cq consumer: %d)\n",
> > >                                efa_com_cmd_str(comp_ctx->cmd_opcode),
> > >                                comp_ctx->cmd_opcode, comp_ctx->status,
> > > -                             comp_ctx, aq->sq.pc, aq->sq.cc, aq->cq.cc);
> > > +                             comp_ctx->cmd_id, aq->sq.pc, aq->sq.cc,
> > > +                             aq->cq.cc);
> > I have very strong feeling that you don't really use these prints in real life.
> > 
> > For example, comp_ctx->cmd_id is printed with %d, while code and comment
> > around cmd_id in __efa_com_submit_admin_cmd() suggests that it needs to be 0x%X.
> > 
> > It has a lot of information separated to LSB and MSB bits which are not readable
> > while printing with %d.
> > 
> > You are also printing comp_ctx->status, which is clear from if/else section.
> > 
> > So no, I don't buy this claim for "additional debug information", while
> > existing is not used.
> 
> What do you mean by that?!?

If you take a close look on the prints, you will see the reasons.
For example, you print comp_ctx->status which can be only 0 or 1,
while it is already clear what its value from the print itself.

> 
> These errors are extremely rare and are not manually reproducible, that's
> why we want to collect as much information as we can when it happens.

Do it out-of-tree, there is no need in upstream code for internal debug
sessions.

> 
> I'm less bothered by the format as long as we have the info we need.

Like I said, it is clear that you never actually relied on this information.
Better if you completely delete these prints and keep them out-of-tree.

Thanks

> 
> 
> Michael
> 
> > 
> > Thanks
> > 

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH for-next v2] RDMA/efa: Extend admin timeout error print
  2025-07-07  6:28     ` Leon Romanovsky
@ 2025-07-07  9:51       ` Margolin, Michael
  2025-07-07 10:28         ` Leon Romanovsky
  0 siblings, 1 reply; 10+ messages in thread
From: Margolin, Michael @ 2025-07-07  9:51 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: jgg, linux-rdma, sleybo, matua, gal.pressman, Yonatan Nachum


On 7/7/2025 9:28 AM, Leon Romanovsky wrote:
> CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.
>
>
>
> On Sun, Jul 06, 2025 at 07:32:05PM +0300, Margolin, Michael wrote:
>> On 7/6/2025 10:25 AM, Leon Romanovsky wrote:
>>> CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.
>>>
>>>
>>>
>>> On Thu, Jul 03, 2025 at 06:23:14PM +0000, Michael Margolin wrote:
>>>>         reinit_completion(&comp_ctx->wait_event);
>>>>
>>>> @@ -557,17 +559,19 @@ static int efa_com_wait_and_process_admin_cq_interrupts(struct efa_comp_ctx *com
>>>>                 if (comp_ctx->status == EFA_CMD_COMPLETED)
>>>>                         ibdev_err_ratelimited(
>>>>                                 aq->efa_dev,
>>>> -                             "The device sent a completion but the driver didn't receive any MSI-X interrupt for admin cmd %s(%d) status %d (ctx: 0x%p, sq producer: %d, sq consumer: %d, cq consumer: %d)\n",
>>>> +                             "The device sent a completion but the driver didn't receive any MSI-X interrupt for admin cmd %s(%d) status %d (id: %d, sq producer: %d, sq consumer: %d, cq consumer: %d)\n",
>>>>                                 efa_com_cmd_str(comp_ctx->cmd_opcode),
>>>>                                 comp_ctx->cmd_opcode, comp_ctx->status,
>>>> -                             comp_ctx, aq->sq.pc, aq->sq.cc, aq->cq.cc);
>>>> +                             comp_ctx->cmd_id, aq->sq.pc, aq->sq.cc,
>>>> +                             aq->cq.cc);
>>>>                 else
>>>>                         ibdev_err_ratelimited(
>>>>                                 aq->efa_dev,
>>>> -                             "The device didn't send any completion for admin cmd %s(%d) status %d (ctx 0x%p, sq producer: %d, sq consumer: %d, cq consumer: %d)\n",
>>>> +                             "The device didn't send any completion for admin cmd %s(%d) status %d (id: %d, sq producer: %d, sq consumer: %d, cq consumer: %d)\n",
>>>>                                 efa_com_cmd_str(comp_ctx->cmd_opcode),
>>>>                                 comp_ctx->cmd_opcode, comp_ctx->status,
>>>> -                             comp_ctx, aq->sq.pc, aq->sq.cc, aq->cq.cc);
>>>> +                             comp_ctx->cmd_id, aq->sq.pc, aq->sq.cc,
>>>> +                             aq->cq.cc);
>>> I have very strong feeling that you don't really use these prints in real life.
>>>
>>> For example, comp_ctx->cmd_id is printed with %d, while code and comment
>>> around cmd_id in __efa_com_submit_admin_cmd() suggests that it needs to be 0x%X.
>>>
>>> It has a lot of information separated to LSB and MSB bits which are not readable
>>> while printing with %d.
>>>
>>> You are also printing comp_ctx->status, which is clear from if/else section.
>>>
>>> So no, I don't buy this claim for "additional debug information", while
>>> existing is not used.
>> What do you mean by that?!?
> If you take a close look on the prints, you will see the reasons.
> For example, you print comp_ctx->status which can be only 0 or 1,
> while it is already clear what its value from the print itself.
>
>> These errors are extremely rare and are not manually reproducible, that's
>> why we want to collect as much information as we can when it happens.
> Do it out-of-tree, there is no need in upstream code for internal debug
> sessions.

It's not for internal debug, it is used in production. Why would I 
upstream internal debug prints?

>> I'm less bothered by the format as long as we have the info we need.
> Like I said, it is clear that you never actually relied on this information.
> Better if you completely delete these prints and keep them out-of-tree.

Not sure that I follow, can you please elaborate on what "relying" means 
from your POV?

Michael


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH for-next v2] RDMA/efa: Extend admin timeout error print
  2025-07-07  9:51       ` Margolin, Michael
@ 2025-07-07 10:28         ` Leon Romanovsky
  2025-07-07 11:18           ` Margolin, Michael
  2025-07-07 11:30           ` Gal Pressman
  0 siblings, 2 replies; 10+ messages in thread
From: Leon Romanovsky @ 2025-07-07 10:28 UTC (permalink / raw)
  To: Margolin, Michael
  Cc: jgg, linux-rdma, sleybo, matua, gal.pressman, Yonatan Nachum

On Mon, Jul 07, 2025 at 12:51:40PM +0300, Margolin, Michael wrote:
> 
> On 7/7/2025 9:28 AM, Leon Romanovsky wrote:
> > CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.
> > 
> > 
> > 
> > On Sun, Jul 06, 2025 at 07:32:05PM +0300, Margolin, Michael wrote:
> > > On 7/6/2025 10:25 AM, Leon Romanovsky wrote:
> > > > CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.
> > > > 
> > > > 
> > > > 
> > > > On Thu, Jul 03, 2025 at 06:23:14PM +0000, Michael Margolin wrote:
> > > > >         reinit_completion(&comp_ctx->wait_event);
> > > > > 
> > > > > @@ -557,17 +559,19 @@ static int efa_com_wait_and_process_admin_cq_interrupts(struct efa_comp_ctx *com
> > > > >                 if (comp_ctx->status == EFA_CMD_COMPLETED)
> > > > >                         ibdev_err_ratelimited(
> > > > >                                 aq->efa_dev,
> > > > > -                             "The device sent a completion but the driver didn't receive any MSI-X interrupt for admin cmd %s(%d) status %d (ctx: 0x%p, sq producer: %d, sq consumer: %d, cq consumer: %d)\n",
> > > > > +                             "The device sent a completion but the driver didn't receive any MSI-X interrupt for admin cmd %s(%d) status %d (id: %d, sq producer: %d, sq consumer: %d, cq consumer: %d)\n",
> > > > >                                 efa_com_cmd_str(comp_ctx->cmd_opcode),
> > > > >                                 comp_ctx->cmd_opcode, comp_ctx->status,
> > > > > -                             comp_ctx, aq->sq.pc, aq->sq.cc, aq->cq.cc);
> > > > > +                             comp_ctx->cmd_id, aq->sq.pc, aq->sq.cc,
> > > > > +                             aq->cq.cc);
> > > > >                 else
> > > > >                         ibdev_err_ratelimited(
> > > > >                                 aq->efa_dev,
> > > > > -                             "The device didn't send any completion for admin cmd %s(%d) status %d (ctx 0x%p, sq producer: %d, sq consumer: %d, cq consumer: %d)\n",
> > > > > +                             "The device didn't send any completion for admin cmd %s(%d) status %d (id: %d, sq producer: %d, sq consumer: %d, cq consumer: %d)\n",
> > > > >                                 efa_com_cmd_str(comp_ctx->cmd_opcode),
> > > > >                                 comp_ctx->cmd_opcode, comp_ctx->status,
> > > > > -                             comp_ctx, aq->sq.pc, aq->sq.cc, aq->cq.cc);
> > > > > +                             comp_ctx->cmd_id, aq->sq.pc, aq->sq.cc,
> > > > > +                             aq->cq.cc);
> > > > I have very strong feeling that you don't really use these prints in real life.
> > > > 
> > > > For example, comp_ctx->cmd_id is printed with %d, while code and comment
> > > > around cmd_id in __efa_com_submit_admin_cmd() suggests that it needs to be 0x%X.
> > > > 
> > > > It has a lot of information separated to LSB and MSB bits which are not readable
> > > > while printing with %d.
> > > > 
> > > > You are also printing comp_ctx->status, which is clear from if/else section.
> > > > 
> > > > So no, I don't buy this claim for "additional debug information", while
> > > > existing is not used.
> > > What do you mean by that?!?
> > If you take a close look on the prints, you will see the reasons.
> > For example, you print comp_ctx->status which can be only 0 or 1,
> > while it is already clear what its value from the print itself.
> > 
> > > These errors are extremely rare and are not manually reproducible, that's
> > > why we want to collect as much information as we can when it happens.
> > Do it out-of-tree, there is no need in upstream code for internal debug
> > sessions.
> 
> It's not for internal debug, it is used in production. Why would I upstream
> internal debug prints?

It is used in internal cloud for the NICs not available to the rest of
the world. So yes, it is your internal debug print.

> 
> > > I'm less bothered by the format as long as we have the info we need.
> > Like I said, it is clear that you never actually relied on this information.
> > Better if you completely delete these prints and keep them out-of-tree.
> 
> Not sure that I follow, can you please elaborate on what "relying" means
> from your POV?

"relied" == "used"

> 
> Michael
> 

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH for-next v2] RDMA/efa: Extend admin timeout error print
  2025-07-07 10:28         ` Leon Romanovsky
@ 2025-07-07 11:18           ` Margolin, Michael
  2025-08-18  7:55             ` Margolin, Michael
  2025-07-07 11:30           ` Gal Pressman
  1 sibling, 1 reply; 10+ messages in thread
From: Margolin, Michael @ 2025-07-07 11:18 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: jgg, linux-rdma, sleybo, matua, gal.pressman, Yonatan Nachum


On 7/7/2025 1:28 PM, Leon Romanovsky wrote:
> CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.
>
>
>
> On Mon, Jul 07, 2025 at 12:51:40PM +0300, Margolin, Michael wrote:
>> On 7/7/2025 9:28 AM, Leon Romanovsky wrote:
>>> CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.
>>>
>>>
>>>
>>> On Sun, Jul 06, 2025 at 07:32:05PM +0300, Margolin, Michael wrote:
>>>> On 7/6/2025 10:25 AM, Leon Romanovsky wrote:
>>>>> CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.
>>>>>
>>>>>
>>>>>
>>>>> On Thu, Jul 03, 2025 at 06:23:14PM +0000, Michael Margolin wrote:
>>>>>>          reinit_completion(&comp_ctx->wait_event);
>>>>>>
>>>>>> @@ -557,17 +559,19 @@ static int efa_com_wait_and_process_admin_cq_interrupts(struct efa_comp_ctx *com
>>>>>>                  if (comp_ctx->status == EFA_CMD_COMPLETED)
>>>>>>                          ibdev_err_ratelimited(
>>>>>>                                  aq->efa_dev,
>>>>>> -                             "The device sent a completion but the driver didn't receive any MSI-X interrupt for admin cmd %s(%d) status %d (ctx: 0x%p, sq producer: %d, sq consumer: %d, cq consumer: %d)\n",
>>>>>> +                             "The device sent a completion but the driver didn't receive any MSI-X interrupt for admin cmd %s(%d) status %d (id: %d, sq producer: %d, sq consumer: %d, cq consumer: %d)\n",
>>>>>>                                  efa_com_cmd_str(comp_ctx->cmd_opcode),
>>>>>>                                  comp_ctx->cmd_opcode, comp_ctx->status,
>>>>>> -                             comp_ctx, aq->sq.pc, aq->sq.cc, aq->cq.cc);
>>>>>> +                             comp_ctx->cmd_id, aq->sq.pc, aq->sq.cc,
>>>>>> +                             aq->cq.cc);
>>>>>>                  else
>>>>>>                          ibdev_err_ratelimited(
>>>>>>                                  aq->efa_dev,
>>>>>> -                             "The device didn't send any completion for admin cmd %s(%d) status %d (ctx 0x%p, sq producer: %d, sq consumer: %d, cq consumer: %d)\n",
>>>>>> +                             "The device didn't send any completion for admin cmd %s(%d) status %d (id: %d, sq producer: %d, sq consumer: %d, cq consumer: %d)\n",
>>>>>>                                  efa_com_cmd_str(comp_ctx->cmd_opcode),
>>>>>>                                  comp_ctx->cmd_opcode, comp_ctx->status,
>>>>>> -                             comp_ctx, aq->sq.pc, aq->sq.cc, aq->cq.cc);
>>>>>> +                             comp_ctx->cmd_id, aq->sq.pc, aq->sq.cc,
>>>>>> +                             aq->cq.cc);
>>>>> I have very strong feeling that you don't really use these prints in real life.
>>>>>
>>>>> For example, comp_ctx->cmd_id is printed with %d, while code and comment
>>>>> around cmd_id in __efa_com_submit_admin_cmd() suggests that it needs to be 0x%X.
>>>>>
>>>>> It has a lot of information separated to LSB and MSB bits which are not readable
>>>>> while printing with %d.
>>>>>
>>>>> You are also printing comp_ctx->status, which is clear from if/else section.
>>>>>
>>>>> So no, I don't buy this claim for "additional debug information", while
>>>>> existing is not used.
>>>> What do you mean by that?!?
>>> If you take a close look on the prints, you will see the reasons.
>>> For example, you print comp_ctx->status which can be only 0 or 1,
>>> while it is already clear what its value from the print itself.
>>>
>>>> These errors are extremely rare and are not manually reproducible, that's
>>>> why we want to collect as much information as we can when it happens.
>>> Do it out-of-tree, there is no need in upstream code for internal debug
>>> sessions.
>> It's not for internal debug, it is used in production. Why would I upstream
>> internal debug prints?
> It is used in internal cloud for the NICs not available to the rest of
> the world. So yes, it is your internal debug print.
>>>> I'm less bothered by the format as long as we have the info we need.
>>> Like I said, it is clear that you never actually relied on this information.
>>> Better if you completely delete these prints and keep them out-of-tree.
>> Not sure that I follow, can you please elaborate on what "relying" means
>> from your POV?
> "relied" == "used"
>
The NIC might be available only in AWS but customers decide what Linux 
versions they use. Many of our customers choose to use upstream versions 
and avoid taking debug or modified versions.

You are right saying the appearance for this print is rare, that's why 
as you pointed out, it isn't perfectly formatted. But this is since the 
issue rarely reproduce, what is exactly the reason we need the print 
in-place when it does.

I can definitely improve the styling if it makes things better.

Michael


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH for-next v2] RDMA/efa: Extend admin timeout error print
  2025-07-07 10:28         ` Leon Romanovsky
  2025-07-07 11:18           ` Margolin, Michael
@ 2025-07-07 11:30           ` Gal Pressman
  1 sibling, 0 replies; 10+ messages in thread
From: Gal Pressman @ 2025-07-07 11:30 UTC (permalink / raw)
  To: Leon Romanovsky, Margolin, Michael
  Cc: jgg, linux-rdma, sleybo, matua, Yonatan Nachum

On 07/07/2025 13:28, Leon Romanovsky wrote:
>> It's not for internal debug, it is used in production. Why would I upstream
>> internal debug prints?
> 
> It is used in internal cloud for the NICs not available to the rest of
> the world. So yes, it is your internal debug print.

It is available to literally anyone in the world through AWS, with
upstream kernel.

You may like or dislike the patch, but this argument has nothing to do
with anything.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH for-next v2] RDMA/efa: Extend admin timeout error print
  2025-07-07 11:18           ` Margolin, Michael
@ 2025-08-18  7:55             ` Margolin, Michael
  0 siblings, 0 replies; 10+ messages in thread
From: Margolin, Michael @ 2025-08-18  7:55 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: jgg, linux-rdma, sleybo, matua, gal.pressman, Yonatan Nachum


On 7/7/2025 2:18 PM, Margolin, Michael wrote:
>
> On 7/7/2025 1:28 PM, Leon Romanovsky wrote:
>> CAUTION: This email originated from outside of the organization. Do 
>> not click links or open attachments unless you can confirm the sender 
>> and know the content is safe.
>>
>>
>>
>> On Mon, Jul 07, 2025 at 12:51:40PM +0300, Margolin, Michael wrote:
>>> On 7/7/2025 9:28 AM, Leon Romanovsky wrote:
>>>> CAUTION: This email originated from outside of the organization. Do 
>>>> not click links or open attachments unless you can confirm the 
>>>> sender and know the content is safe.
>>>>
>>>>
>>>>
>>>> On Sun, Jul 06, 2025 at 07:32:05PM +0300, Margolin, Michael wrote:
>>>>> On 7/6/2025 10:25 AM, Leon Romanovsky wrote:
>>>>>> CAUTION: This email originated from outside of the organization. 
>>>>>> Do not click links or open attachments unless you can confirm the 
>>>>>> sender and know the content is safe.
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Thu, Jul 03, 2025 at 06:23:14PM +0000, Michael Margolin wrote:
>>>>>>> reinit_completion(&comp_ctx->wait_event);
>>>>>>>
>>>>>>> @@ -557,17 +559,19 @@ static int 
>>>>>>> efa_com_wait_and_process_admin_cq_interrupts(struct efa_comp_ctx 
>>>>>>> *com
>>>>>>>                  if (comp_ctx->status == EFA_CMD_COMPLETED)
>>>>>>>                          ibdev_err_ratelimited(
>>>>>>>                                  aq->efa_dev,
>>>>>>> -                             "The device sent a completion but 
>>>>>>> the driver didn't receive any MSI-X interrupt for admin cmd 
>>>>>>> %s(%d) status %d (ctx: 0x%p, sq producer: %d, sq consumer: %d, 
>>>>>>> cq consumer: %d)\n",
>>>>>>> +                             "The device sent a completion but 
>>>>>>> the driver didn't receive any MSI-X interrupt for admin cmd 
>>>>>>> %s(%d) status %d (id: %d, sq producer: %d, sq consumer: %d, cq 
>>>>>>> consumer: %d)\n",
>>>>>>> efa_com_cmd_str(comp_ctx->cmd_opcode),
>>>>>>> comp_ctx->cmd_opcode, comp_ctx->status,
>>>>>>> -                             comp_ctx, aq->sq.pc, aq->sq.cc, 
>>>>>>> aq->cq.cc);
>>>>>>> +                             comp_ctx->cmd_id, aq->sq.pc, 
>>>>>>> aq->sq.cc,
>>>>>>> +                             aq->cq.cc);
>>>>>>>                  else
>>>>>>>                          ibdev_err_ratelimited(
>>>>>>>                                  aq->efa_dev,
>>>>>>> -                             "The device didn't send any 
>>>>>>> completion for admin cmd %s(%d) status %d (ctx 0x%p, sq 
>>>>>>> producer: %d, sq consumer: %d, cq consumer: %d)\n",
>>>>>>> +                             "The device didn't send any 
>>>>>>> completion for admin cmd %s(%d) status %d (id: %d, sq producer: 
>>>>>>> %d, sq consumer: %d, cq consumer: %d)\n",
>>>>>>> efa_com_cmd_str(comp_ctx->cmd_opcode),
>>>>>>> comp_ctx->cmd_opcode, comp_ctx->status,
>>>>>>> -                             comp_ctx, aq->sq.pc, aq->sq.cc, 
>>>>>>> aq->cq.cc);
>>>>>>> +                             comp_ctx->cmd_id, aq->sq.pc, 
>>>>>>> aq->sq.cc,
>>>>>>> +                             aq->cq.cc);
>>>>>> I have very strong feeling that you don't really use these prints 
>>>>>> in real life.
>>>>>>
>>>>>> For example, comp_ctx->cmd_id is printed with %d, while code and 
>>>>>> comment
>>>>>> around cmd_id in __efa_com_submit_admin_cmd() suggests that it 
>>>>>> needs to be 0x%X.
>>>>>>
>>>>>> It has a lot of information separated to LSB and MSB bits which 
>>>>>> are not readable
>>>>>> while printing with %d.
>>>>>>
>>>>>> You are also printing comp_ctx->status, which is clear from 
>>>>>> if/else section.
>>>>>>
>>>>>> So no, I don't buy this claim for "additional debug information", 
>>>>>> while
>>>>>> existing is not used.
>>>>> What do you mean by that?!?
>>>> If you take a close look on the prints, you will see the reasons.
>>>> For example, you print comp_ctx->status which can be only 0 or 1,
>>>> while it is already clear what its value from the print itself.
>>>>
>>>>> These errors are extremely rare and are not manually reproducible, 
>>>>> that's
>>>>> why we want to collect as much information as we can when it happens.
>>>> Do it out-of-tree, there is no need in upstream code for internal 
>>>> debug
>>>> sessions.
>>> It's not for internal debug, it is used in production. Why would I 
>>> upstream
>>> internal debug prints?
>> It is used in internal cloud for the NICs not available to the rest of
>> the world. So yes, it is your internal debug print.
>>>>> I'm less bothered by the format as long as we have the info we need.
>>>> Like I said, it is clear that you never actually relied on this 
>>>> information.
>>>> Better if you completely delete these prints and keep them 
>>>> out-of-tree.
>>> Not sure that I follow, can you please elaborate on what "relying" 
>>> means
>>> from your POV?
>> "relied" == "used"
>>
> The NIC might be available only in AWS but customers decide what Linux 
> versions they use. Many of our customers choose to use upstream 
> versions and avoid taking debug or modified versions.
>
> You are right saying the appearance for this print is rare, that's why 
> as you pointed out, it isn't perfectly formatted. But this is since 
> the issue rarely reproduce, what is exactly the reason we need the 
> print in-place when it does.
>
> I can definitely improve the styling if it makes things better.
>
> Michael
>
Leon, do you have any additional concerns regarding this patch? It's 
more than 6 weeks in the pipe now.

We are asking customers to use patched drivers and we shouldn't be there.


Michael


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH for-next v2] RDMA/efa: Extend admin timeout error print
  2025-07-03 18:23 [PATCH for-next v2] RDMA/efa: Extend admin timeout error print Michael Margolin
  2025-07-06  7:25 ` Leon Romanovsky
@ 2025-08-25 14:56 ` Jason Gunthorpe
  1 sibling, 0 replies; 10+ messages in thread
From: Jason Gunthorpe @ 2025-08-25 14:56 UTC (permalink / raw)
  To: Michael Margolin
  Cc: leon, linux-rdma, sleybo, matua, gal.pressman, Yonatan Nachum

On Thu, Jul 03, 2025 at 06:23:14PM +0000, Michael Margolin wrote:
> Add command id to the printed message for additional debug information.
> 
> Reviewed-by: Yonatan Nachum <ynachum@amazon.com>
> Signed-off-by: Michael Margolin <mrgolin@amazon.com>
> ---
>  drivers/infiniband/hw/efa/efa_com.c | 14 +++++++++-----
>  1 file changed, 9 insertions(+), 5 deletions(-)

I picked this up, I think we can debate what level of debug prints or
traces are appropriate next someone adds them, but tweaking an
existing one seems fine to me.

Jason

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2025-08-25 14:56 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-07-03 18:23 [PATCH for-next v2] RDMA/efa: Extend admin timeout error print Michael Margolin
2025-07-06  7:25 ` Leon Romanovsky
2025-07-06 16:32   ` Margolin, Michael
2025-07-07  6:28     ` Leon Romanovsky
2025-07-07  9:51       ` Margolin, Michael
2025-07-07 10:28         ` Leon Romanovsky
2025-07-07 11:18           ` Margolin, Michael
2025-08-18  7:55             ` Margolin, Michael
2025-07-07 11:30           ` Gal Pressman
2025-08-25 14:56 ` Jason Gunthorpe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).