* [PATCH for-next v2] RDMA/efa: Extend admin timeout error print @ 2025-07-03 18:23 Michael Margolin 2025-07-06 7:25 ` Leon Romanovsky 2025-08-25 14:56 ` Jason Gunthorpe 0 siblings, 2 replies; 10+ messages in thread From: Michael Margolin @ 2025-07-03 18:23 UTC (permalink / raw) To: jgg, leon, linux-rdma; +Cc: sleybo, matua, gal.pressman, Yonatan Nachum Add command id to the printed message for additional debug information. Reviewed-by: Yonatan Nachum <ynachum@amazon.com> Signed-off-by: Michael Margolin <mrgolin@amazon.com> --- drivers/infiniband/hw/efa/efa_com.c | 14 +++++++++----- 1 file changed, 9 insertions(+), 5 deletions(-) diff --git a/drivers/infiniband/hw/efa/efa_com.c b/drivers/infiniband/hw/efa/efa_com.c index bafd210dd43e..e6377602a9c4 100644 --- a/drivers/infiniband/hw/efa/efa_com.c +++ b/drivers/infiniband/hw/efa/efa_com.c @@ -1,6 +1,6 @@ // SPDX-License-Identifier: GPL-2.0 OR BSD-2-Clause /* - * Copyright 2018-2024 Amazon.com, Inc. or its affiliates. All rights reserved. + * Copyright 2018-2025 Amazon.com, Inc. or its affiliates. All rights reserved. */ #include "efa_com.h" @@ -30,6 +30,7 @@ struct efa_comp_ctx { struct efa_admin_acq_entry *user_cqe; u32 comp_size; enum efa_cmd_status status; + u16 cmd_id; u8 cmd_opcode; u8 occupied; }; @@ -333,6 +334,7 @@ static struct efa_comp_ctx *__efa_com_submit_admin_cmd(struct efa_com_admin_queu comp_ctx->comp_size = comp_size_in_bytes; comp_ctx->user_cqe = comp; comp_ctx->cmd_opcode = cmd->aq_common_descriptor.opcode; + comp_ctx->cmd_id = cmd_id; reinit_completion(&comp_ctx->wait_event); @@ -557,17 +559,19 @@ static int efa_com_wait_and_process_admin_cq_interrupts(struct efa_comp_ctx *com if (comp_ctx->status == EFA_CMD_COMPLETED) ibdev_err_ratelimited( aq->efa_dev, - "The device sent a completion but the driver didn't receive any MSI-X interrupt for admin cmd %s(%d) status %d (ctx: 0x%p, sq producer: %d, sq consumer: %d, cq consumer: %d)\n", + "The device sent a completion but the driver didn't receive any MSI-X interrupt for admin cmd %s(%d) status %d (id: %d, sq producer: %d, sq consumer: %d, cq consumer: %d)\n", efa_com_cmd_str(comp_ctx->cmd_opcode), comp_ctx->cmd_opcode, comp_ctx->status, - comp_ctx, aq->sq.pc, aq->sq.cc, aq->cq.cc); + comp_ctx->cmd_id, aq->sq.pc, aq->sq.cc, + aq->cq.cc); else ibdev_err_ratelimited( aq->efa_dev, - "The device didn't send any completion for admin cmd %s(%d) status %d (ctx 0x%p, sq producer: %d, sq consumer: %d, cq consumer: %d)\n", + "The device didn't send any completion for admin cmd %s(%d) status %d (id: %d, sq producer: %d, sq consumer: %d, cq consumer: %d)\n", efa_com_cmd_str(comp_ctx->cmd_opcode), comp_ctx->cmd_opcode, comp_ctx->status, - comp_ctx, aq->sq.pc, aq->sq.cc, aq->cq.cc); + comp_ctx->cmd_id, aq->sq.pc, aq->sq.cc, + aq->cq.cc); clear_bit(EFA_AQ_STATE_RUNNING_BIT, &aq->state); err = -ETIME; -- 2.47.1 ^ permalink raw reply related [flat|nested] 10+ messages in thread
* Re: [PATCH for-next v2] RDMA/efa: Extend admin timeout error print 2025-07-03 18:23 [PATCH for-next v2] RDMA/efa: Extend admin timeout error print Michael Margolin @ 2025-07-06 7:25 ` Leon Romanovsky 2025-07-06 16:32 ` Margolin, Michael 2025-08-25 14:56 ` Jason Gunthorpe 1 sibling, 1 reply; 10+ messages in thread From: Leon Romanovsky @ 2025-07-06 7:25 UTC (permalink / raw) To: Michael Margolin Cc: jgg, linux-rdma, sleybo, matua, gal.pressman, Yonatan Nachum On Thu, Jul 03, 2025 at 06:23:14PM +0000, Michael Margolin wrote: > Add command id to the printed message for additional debug information. > > Reviewed-by: Yonatan Nachum <ynachum@amazon.com> > Signed-off-by: Michael Margolin <mrgolin@amazon.com> > --- > drivers/infiniband/hw/efa/efa_com.c | 14 +++++++++----- > 1 file changed, 9 insertions(+), 5 deletions(-) Please don't forget to add changelogs. > > diff --git a/drivers/infiniband/hw/efa/efa_com.c b/drivers/infiniband/hw/efa/efa_com.c > index bafd210dd43e..e6377602a9c4 100644 > --- a/drivers/infiniband/hw/efa/efa_com.c > +++ b/drivers/infiniband/hw/efa/efa_com.c > @@ -1,6 +1,6 @@ > // SPDX-License-Identifier: GPL-2.0 OR BSD-2-Clause > /* > - * Copyright 2018-2024 Amazon.com, Inc. or its affiliates. All rights reserved. > + * Copyright 2018-2025 Amazon.com, Inc. or its affiliates. All rights reserved. > */ <...> > reinit_completion(&comp_ctx->wait_event); > > @@ -557,17 +559,19 @@ static int efa_com_wait_and_process_admin_cq_interrupts(struct efa_comp_ctx *com > if (comp_ctx->status == EFA_CMD_COMPLETED) > ibdev_err_ratelimited( > aq->efa_dev, > - "The device sent a completion but the driver didn't receive any MSI-X interrupt for admin cmd %s(%d) status %d (ctx: 0x%p, sq producer: %d, sq consumer: %d, cq consumer: %d)\n", > + "The device sent a completion but the driver didn't receive any MSI-X interrupt for admin cmd %s(%d) status %d (id: %d, sq producer: %d, sq consumer: %d, cq consumer: %d)\n", > efa_com_cmd_str(comp_ctx->cmd_opcode), > comp_ctx->cmd_opcode, comp_ctx->status, > - comp_ctx, aq->sq.pc, aq->sq.cc, aq->cq.cc); > + comp_ctx->cmd_id, aq->sq.pc, aq->sq.cc, > + aq->cq.cc); > else > ibdev_err_ratelimited( > aq->efa_dev, > - "The device didn't send any completion for admin cmd %s(%d) status %d (ctx 0x%p, sq producer: %d, sq consumer: %d, cq consumer: %d)\n", > + "The device didn't send any completion for admin cmd %s(%d) status %d (id: %d, sq producer: %d, sq consumer: %d, cq consumer: %d)\n", > efa_com_cmd_str(comp_ctx->cmd_opcode), > comp_ctx->cmd_opcode, comp_ctx->status, > - comp_ctx, aq->sq.pc, aq->sq.cc, aq->cq.cc); > + comp_ctx->cmd_id, aq->sq.pc, aq->sq.cc, > + aq->cq.cc); I have very strong feeling that you don't really use these prints in real life. For example, comp_ctx->cmd_id is printed with %d, while code and comment around cmd_id in __efa_com_submit_admin_cmd() suggests that it needs to be 0x%X. It has a lot of information separated to LSB and MSB bits which are not readable while printing with %d. You are also printing comp_ctx->status, which is clear from if/else section. So no, I don't buy this claim for "additional debug information", while existing is not used. Thanks > > clear_bit(EFA_AQ_STATE_RUNNING_BIT, &aq->state); > err = -ETIME; > -- > 2.47.1 > ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH for-next v2] RDMA/efa: Extend admin timeout error print 2025-07-06 7:25 ` Leon Romanovsky @ 2025-07-06 16:32 ` Margolin, Michael 2025-07-07 6:28 ` Leon Romanovsky 0 siblings, 1 reply; 10+ messages in thread From: Margolin, Michael @ 2025-07-06 16:32 UTC (permalink / raw) To: Leon Romanovsky Cc: jgg, linux-rdma, sleybo, matua, gal.pressman, Yonatan Nachum On 7/6/2025 10:25 AM, Leon Romanovsky wrote: > CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe. > > > > On Thu, Jul 03, 2025 at 06:23:14PM +0000, Michael Margolin wrote: >> reinit_completion(&comp_ctx->wait_event); >> >> @@ -557,17 +559,19 @@ static int efa_com_wait_and_process_admin_cq_interrupts(struct efa_comp_ctx *com >> if (comp_ctx->status == EFA_CMD_COMPLETED) >> ibdev_err_ratelimited( >> aq->efa_dev, >> - "The device sent a completion but the driver didn't receive any MSI-X interrupt for admin cmd %s(%d) status %d (ctx: 0x%p, sq producer: %d, sq consumer: %d, cq consumer: %d)\n", >> + "The device sent a completion but the driver didn't receive any MSI-X interrupt for admin cmd %s(%d) status %d (id: %d, sq producer: %d, sq consumer: %d, cq consumer: %d)\n", >> efa_com_cmd_str(comp_ctx->cmd_opcode), >> comp_ctx->cmd_opcode, comp_ctx->status, >> - comp_ctx, aq->sq.pc, aq->sq.cc, aq->cq.cc); >> + comp_ctx->cmd_id, aq->sq.pc, aq->sq.cc, >> + aq->cq.cc); >> else >> ibdev_err_ratelimited( >> aq->efa_dev, >> - "The device didn't send any completion for admin cmd %s(%d) status %d (ctx 0x%p, sq producer: %d, sq consumer: %d, cq consumer: %d)\n", >> + "The device didn't send any completion for admin cmd %s(%d) status %d (id: %d, sq producer: %d, sq consumer: %d, cq consumer: %d)\n", >> efa_com_cmd_str(comp_ctx->cmd_opcode), >> comp_ctx->cmd_opcode, comp_ctx->status, >> - comp_ctx, aq->sq.pc, aq->sq.cc, aq->cq.cc); >> + comp_ctx->cmd_id, aq->sq.pc, aq->sq.cc, >> + aq->cq.cc); > I have very strong feeling that you don't really use these prints in real life. > > For example, comp_ctx->cmd_id is printed with %d, while code and comment > around cmd_id in __efa_com_submit_admin_cmd() suggests that it needs to be 0x%X. > > It has a lot of information separated to LSB and MSB bits which are not readable > while printing with %d. > > You are also printing comp_ctx->status, which is clear from if/else section. > > So no, I don't buy this claim for "additional debug information", while > existing is not used. What do you mean by that?!? These errors are extremely rare and are not manually reproducible, that's why we want to collect as much information as we can when it happens. I'm less bothered by the format as long as we have the info we need. Michael > > Thanks > ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH for-next v2] RDMA/efa: Extend admin timeout error print 2025-07-06 16:32 ` Margolin, Michael @ 2025-07-07 6:28 ` Leon Romanovsky 2025-07-07 9:51 ` Margolin, Michael 0 siblings, 1 reply; 10+ messages in thread From: Leon Romanovsky @ 2025-07-07 6:28 UTC (permalink / raw) To: Margolin, Michael Cc: jgg, linux-rdma, sleybo, matua, gal.pressman, Yonatan Nachum On Sun, Jul 06, 2025 at 07:32:05PM +0300, Margolin, Michael wrote: > > On 7/6/2025 10:25 AM, Leon Romanovsky wrote: > > CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe. > > > > > > > > On Thu, Jul 03, 2025 at 06:23:14PM +0000, Michael Margolin wrote: > > > reinit_completion(&comp_ctx->wait_event); > > > > > > @@ -557,17 +559,19 @@ static int efa_com_wait_and_process_admin_cq_interrupts(struct efa_comp_ctx *com > > > if (comp_ctx->status == EFA_CMD_COMPLETED) > > > ibdev_err_ratelimited( > > > aq->efa_dev, > > > - "The device sent a completion but the driver didn't receive any MSI-X interrupt for admin cmd %s(%d) status %d (ctx: 0x%p, sq producer: %d, sq consumer: %d, cq consumer: %d)\n", > > > + "The device sent a completion but the driver didn't receive any MSI-X interrupt for admin cmd %s(%d) status %d (id: %d, sq producer: %d, sq consumer: %d, cq consumer: %d)\n", > > > efa_com_cmd_str(comp_ctx->cmd_opcode), > > > comp_ctx->cmd_opcode, comp_ctx->status, > > > - comp_ctx, aq->sq.pc, aq->sq.cc, aq->cq.cc); > > > + comp_ctx->cmd_id, aq->sq.pc, aq->sq.cc, > > > + aq->cq.cc); > > > else > > > ibdev_err_ratelimited( > > > aq->efa_dev, > > > - "The device didn't send any completion for admin cmd %s(%d) status %d (ctx 0x%p, sq producer: %d, sq consumer: %d, cq consumer: %d)\n", > > > + "The device didn't send any completion for admin cmd %s(%d) status %d (id: %d, sq producer: %d, sq consumer: %d, cq consumer: %d)\n", > > > efa_com_cmd_str(comp_ctx->cmd_opcode), > > > comp_ctx->cmd_opcode, comp_ctx->status, > > > - comp_ctx, aq->sq.pc, aq->sq.cc, aq->cq.cc); > > > + comp_ctx->cmd_id, aq->sq.pc, aq->sq.cc, > > > + aq->cq.cc); > > I have very strong feeling that you don't really use these prints in real life. > > > > For example, comp_ctx->cmd_id is printed with %d, while code and comment > > around cmd_id in __efa_com_submit_admin_cmd() suggests that it needs to be 0x%X. > > > > It has a lot of information separated to LSB and MSB bits which are not readable > > while printing with %d. > > > > You are also printing comp_ctx->status, which is clear from if/else section. > > > > So no, I don't buy this claim for "additional debug information", while > > existing is not used. > > What do you mean by that?!? If you take a close look on the prints, you will see the reasons. For example, you print comp_ctx->status which can be only 0 or 1, while it is already clear what its value from the print itself. > > These errors are extremely rare and are not manually reproducible, that's > why we want to collect as much information as we can when it happens. Do it out-of-tree, there is no need in upstream code for internal debug sessions. > > I'm less bothered by the format as long as we have the info we need. Like I said, it is clear that you never actually relied on this information. Better if you completely delete these prints and keep them out-of-tree. Thanks > > > Michael > > > > > Thanks > > ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH for-next v2] RDMA/efa: Extend admin timeout error print 2025-07-07 6:28 ` Leon Romanovsky @ 2025-07-07 9:51 ` Margolin, Michael 2025-07-07 10:28 ` Leon Romanovsky 0 siblings, 1 reply; 10+ messages in thread From: Margolin, Michael @ 2025-07-07 9:51 UTC (permalink / raw) To: Leon Romanovsky Cc: jgg, linux-rdma, sleybo, matua, gal.pressman, Yonatan Nachum On 7/7/2025 9:28 AM, Leon Romanovsky wrote: > CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe. > > > > On Sun, Jul 06, 2025 at 07:32:05PM +0300, Margolin, Michael wrote: >> On 7/6/2025 10:25 AM, Leon Romanovsky wrote: >>> CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe. >>> >>> >>> >>> On Thu, Jul 03, 2025 at 06:23:14PM +0000, Michael Margolin wrote: >>>> reinit_completion(&comp_ctx->wait_event); >>>> >>>> @@ -557,17 +559,19 @@ static int efa_com_wait_and_process_admin_cq_interrupts(struct efa_comp_ctx *com >>>> if (comp_ctx->status == EFA_CMD_COMPLETED) >>>> ibdev_err_ratelimited( >>>> aq->efa_dev, >>>> - "The device sent a completion but the driver didn't receive any MSI-X interrupt for admin cmd %s(%d) status %d (ctx: 0x%p, sq producer: %d, sq consumer: %d, cq consumer: %d)\n", >>>> + "The device sent a completion but the driver didn't receive any MSI-X interrupt for admin cmd %s(%d) status %d (id: %d, sq producer: %d, sq consumer: %d, cq consumer: %d)\n", >>>> efa_com_cmd_str(comp_ctx->cmd_opcode), >>>> comp_ctx->cmd_opcode, comp_ctx->status, >>>> - comp_ctx, aq->sq.pc, aq->sq.cc, aq->cq.cc); >>>> + comp_ctx->cmd_id, aq->sq.pc, aq->sq.cc, >>>> + aq->cq.cc); >>>> else >>>> ibdev_err_ratelimited( >>>> aq->efa_dev, >>>> - "The device didn't send any completion for admin cmd %s(%d) status %d (ctx 0x%p, sq producer: %d, sq consumer: %d, cq consumer: %d)\n", >>>> + "The device didn't send any completion for admin cmd %s(%d) status %d (id: %d, sq producer: %d, sq consumer: %d, cq consumer: %d)\n", >>>> efa_com_cmd_str(comp_ctx->cmd_opcode), >>>> comp_ctx->cmd_opcode, comp_ctx->status, >>>> - comp_ctx, aq->sq.pc, aq->sq.cc, aq->cq.cc); >>>> + comp_ctx->cmd_id, aq->sq.pc, aq->sq.cc, >>>> + aq->cq.cc); >>> I have very strong feeling that you don't really use these prints in real life. >>> >>> For example, comp_ctx->cmd_id is printed with %d, while code and comment >>> around cmd_id in __efa_com_submit_admin_cmd() suggests that it needs to be 0x%X. >>> >>> It has a lot of information separated to LSB and MSB bits which are not readable >>> while printing with %d. >>> >>> You are also printing comp_ctx->status, which is clear from if/else section. >>> >>> So no, I don't buy this claim for "additional debug information", while >>> existing is not used. >> What do you mean by that?!? > If you take a close look on the prints, you will see the reasons. > For example, you print comp_ctx->status which can be only 0 or 1, > while it is already clear what its value from the print itself. > >> These errors are extremely rare and are not manually reproducible, that's >> why we want to collect as much information as we can when it happens. > Do it out-of-tree, there is no need in upstream code for internal debug > sessions. It's not for internal debug, it is used in production. Why would I upstream internal debug prints? >> I'm less bothered by the format as long as we have the info we need. > Like I said, it is clear that you never actually relied on this information. > Better if you completely delete these prints and keep them out-of-tree. Not sure that I follow, can you please elaborate on what "relying" means from your POV? Michael ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH for-next v2] RDMA/efa: Extend admin timeout error print 2025-07-07 9:51 ` Margolin, Michael @ 2025-07-07 10:28 ` Leon Romanovsky 2025-07-07 11:18 ` Margolin, Michael 2025-07-07 11:30 ` Gal Pressman 0 siblings, 2 replies; 10+ messages in thread From: Leon Romanovsky @ 2025-07-07 10:28 UTC (permalink / raw) To: Margolin, Michael Cc: jgg, linux-rdma, sleybo, matua, gal.pressman, Yonatan Nachum On Mon, Jul 07, 2025 at 12:51:40PM +0300, Margolin, Michael wrote: > > On 7/7/2025 9:28 AM, Leon Romanovsky wrote: > > CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe. > > > > > > > > On Sun, Jul 06, 2025 at 07:32:05PM +0300, Margolin, Michael wrote: > > > On 7/6/2025 10:25 AM, Leon Romanovsky wrote: > > > > CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe. > > > > > > > > > > > > > > > > On Thu, Jul 03, 2025 at 06:23:14PM +0000, Michael Margolin wrote: > > > > > reinit_completion(&comp_ctx->wait_event); > > > > > > > > > > @@ -557,17 +559,19 @@ static int efa_com_wait_and_process_admin_cq_interrupts(struct efa_comp_ctx *com > > > > > if (comp_ctx->status == EFA_CMD_COMPLETED) > > > > > ibdev_err_ratelimited( > > > > > aq->efa_dev, > > > > > - "The device sent a completion but the driver didn't receive any MSI-X interrupt for admin cmd %s(%d) status %d (ctx: 0x%p, sq producer: %d, sq consumer: %d, cq consumer: %d)\n", > > > > > + "The device sent a completion but the driver didn't receive any MSI-X interrupt for admin cmd %s(%d) status %d (id: %d, sq producer: %d, sq consumer: %d, cq consumer: %d)\n", > > > > > efa_com_cmd_str(comp_ctx->cmd_opcode), > > > > > comp_ctx->cmd_opcode, comp_ctx->status, > > > > > - comp_ctx, aq->sq.pc, aq->sq.cc, aq->cq.cc); > > > > > + comp_ctx->cmd_id, aq->sq.pc, aq->sq.cc, > > > > > + aq->cq.cc); > > > > > else > > > > > ibdev_err_ratelimited( > > > > > aq->efa_dev, > > > > > - "The device didn't send any completion for admin cmd %s(%d) status %d (ctx 0x%p, sq producer: %d, sq consumer: %d, cq consumer: %d)\n", > > > > > + "The device didn't send any completion for admin cmd %s(%d) status %d (id: %d, sq producer: %d, sq consumer: %d, cq consumer: %d)\n", > > > > > efa_com_cmd_str(comp_ctx->cmd_opcode), > > > > > comp_ctx->cmd_opcode, comp_ctx->status, > > > > > - comp_ctx, aq->sq.pc, aq->sq.cc, aq->cq.cc); > > > > > + comp_ctx->cmd_id, aq->sq.pc, aq->sq.cc, > > > > > + aq->cq.cc); > > > > I have very strong feeling that you don't really use these prints in real life. > > > > > > > > For example, comp_ctx->cmd_id is printed with %d, while code and comment > > > > around cmd_id in __efa_com_submit_admin_cmd() suggests that it needs to be 0x%X. > > > > > > > > It has a lot of information separated to LSB and MSB bits which are not readable > > > > while printing with %d. > > > > > > > > You are also printing comp_ctx->status, which is clear from if/else section. > > > > > > > > So no, I don't buy this claim for "additional debug information", while > > > > existing is not used. > > > What do you mean by that?!? > > If you take a close look on the prints, you will see the reasons. > > For example, you print comp_ctx->status which can be only 0 or 1, > > while it is already clear what its value from the print itself. > > > > > These errors are extremely rare and are not manually reproducible, that's > > > why we want to collect as much information as we can when it happens. > > Do it out-of-tree, there is no need in upstream code for internal debug > > sessions. > > It's not for internal debug, it is used in production. Why would I upstream > internal debug prints? It is used in internal cloud for the NICs not available to the rest of the world. So yes, it is your internal debug print. > > > > I'm less bothered by the format as long as we have the info we need. > > Like I said, it is clear that you never actually relied on this information. > > Better if you completely delete these prints and keep them out-of-tree. > > Not sure that I follow, can you please elaborate on what "relying" means > from your POV? "relied" == "used" > > Michael > ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH for-next v2] RDMA/efa: Extend admin timeout error print 2025-07-07 10:28 ` Leon Romanovsky @ 2025-07-07 11:18 ` Margolin, Michael 2025-08-18 7:55 ` Margolin, Michael 2025-07-07 11:30 ` Gal Pressman 1 sibling, 1 reply; 10+ messages in thread From: Margolin, Michael @ 2025-07-07 11:18 UTC (permalink / raw) To: Leon Romanovsky Cc: jgg, linux-rdma, sleybo, matua, gal.pressman, Yonatan Nachum On 7/7/2025 1:28 PM, Leon Romanovsky wrote: > CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe. > > > > On Mon, Jul 07, 2025 at 12:51:40PM +0300, Margolin, Michael wrote: >> On 7/7/2025 9:28 AM, Leon Romanovsky wrote: >>> CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe. >>> >>> >>> >>> On Sun, Jul 06, 2025 at 07:32:05PM +0300, Margolin, Michael wrote: >>>> On 7/6/2025 10:25 AM, Leon Romanovsky wrote: >>>>> CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe. >>>>> >>>>> >>>>> >>>>> On Thu, Jul 03, 2025 at 06:23:14PM +0000, Michael Margolin wrote: >>>>>> reinit_completion(&comp_ctx->wait_event); >>>>>> >>>>>> @@ -557,17 +559,19 @@ static int efa_com_wait_and_process_admin_cq_interrupts(struct efa_comp_ctx *com >>>>>> if (comp_ctx->status == EFA_CMD_COMPLETED) >>>>>> ibdev_err_ratelimited( >>>>>> aq->efa_dev, >>>>>> - "The device sent a completion but the driver didn't receive any MSI-X interrupt for admin cmd %s(%d) status %d (ctx: 0x%p, sq producer: %d, sq consumer: %d, cq consumer: %d)\n", >>>>>> + "The device sent a completion but the driver didn't receive any MSI-X interrupt for admin cmd %s(%d) status %d (id: %d, sq producer: %d, sq consumer: %d, cq consumer: %d)\n", >>>>>> efa_com_cmd_str(comp_ctx->cmd_opcode), >>>>>> comp_ctx->cmd_opcode, comp_ctx->status, >>>>>> - comp_ctx, aq->sq.pc, aq->sq.cc, aq->cq.cc); >>>>>> + comp_ctx->cmd_id, aq->sq.pc, aq->sq.cc, >>>>>> + aq->cq.cc); >>>>>> else >>>>>> ibdev_err_ratelimited( >>>>>> aq->efa_dev, >>>>>> - "The device didn't send any completion for admin cmd %s(%d) status %d (ctx 0x%p, sq producer: %d, sq consumer: %d, cq consumer: %d)\n", >>>>>> + "The device didn't send any completion for admin cmd %s(%d) status %d (id: %d, sq producer: %d, sq consumer: %d, cq consumer: %d)\n", >>>>>> efa_com_cmd_str(comp_ctx->cmd_opcode), >>>>>> comp_ctx->cmd_opcode, comp_ctx->status, >>>>>> - comp_ctx, aq->sq.pc, aq->sq.cc, aq->cq.cc); >>>>>> + comp_ctx->cmd_id, aq->sq.pc, aq->sq.cc, >>>>>> + aq->cq.cc); >>>>> I have very strong feeling that you don't really use these prints in real life. >>>>> >>>>> For example, comp_ctx->cmd_id is printed with %d, while code and comment >>>>> around cmd_id in __efa_com_submit_admin_cmd() suggests that it needs to be 0x%X. >>>>> >>>>> It has a lot of information separated to LSB and MSB bits which are not readable >>>>> while printing with %d. >>>>> >>>>> You are also printing comp_ctx->status, which is clear from if/else section. >>>>> >>>>> So no, I don't buy this claim for "additional debug information", while >>>>> existing is not used. >>>> What do you mean by that?!? >>> If you take a close look on the prints, you will see the reasons. >>> For example, you print comp_ctx->status which can be only 0 or 1, >>> while it is already clear what its value from the print itself. >>> >>>> These errors are extremely rare and are not manually reproducible, that's >>>> why we want to collect as much information as we can when it happens. >>> Do it out-of-tree, there is no need in upstream code for internal debug >>> sessions. >> It's not for internal debug, it is used in production. Why would I upstream >> internal debug prints? > It is used in internal cloud for the NICs not available to the rest of > the world. So yes, it is your internal debug print. >>>> I'm less bothered by the format as long as we have the info we need. >>> Like I said, it is clear that you never actually relied on this information. >>> Better if you completely delete these prints and keep them out-of-tree. >> Not sure that I follow, can you please elaborate on what "relying" means >> from your POV? > "relied" == "used" > The NIC might be available only in AWS but customers decide what Linux versions they use. Many of our customers choose to use upstream versions and avoid taking debug or modified versions. You are right saying the appearance for this print is rare, that's why as you pointed out, it isn't perfectly formatted. But this is since the issue rarely reproduce, what is exactly the reason we need the print in-place when it does. I can definitely improve the styling if it makes things better. Michael ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH for-next v2] RDMA/efa: Extend admin timeout error print 2025-07-07 11:18 ` Margolin, Michael @ 2025-08-18 7:55 ` Margolin, Michael 0 siblings, 0 replies; 10+ messages in thread From: Margolin, Michael @ 2025-08-18 7:55 UTC (permalink / raw) To: Leon Romanovsky Cc: jgg, linux-rdma, sleybo, matua, gal.pressman, Yonatan Nachum On 7/7/2025 2:18 PM, Margolin, Michael wrote: > > On 7/7/2025 1:28 PM, Leon Romanovsky wrote: >> CAUTION: This email originated from outside of the organization. Do >> not click links or open attachments unless you can confirm the sender >> and know the content is safe. >> >> >> >> On Mon, Jul 07, 2025 at 12:51:40PM +0300, Margolin, Michael wrote: >>> On 7/7/2025 9:28 AM, Leon Romanovsky wrote: >>>> CAUTION: This email originated from outside of the organization. Do >>>> not click links or open attachments unless you can confirm the >>>> sender and know the content is safe. >>>> >>>> >>>> >>>> On Sun, Jul 06, 2025 at 07:32:05PM +0300, Margolin, Michael wrote: >>>>> On 7/6/2025 10:25 AM, Leon Romanovsky wrote: >>>>>> CAUTION: This email originated from outside of the organization. >>>>>> Do not click links or open attachments unless you can confirm the >>>>>> sender and know the content is safe. >>>>>> >>>>>> >>>>>> >>>>>> On Thu, Jul 03, 2025 at 06:23:14PM +0000, Michael Margolin wrote: >>>>>>> reinit_completion(&comp_ctx->wait_event); >>>>>>> >>>>>>> @@ -557,17 +559,19 @@ static int >>>>>>> efa_com_wait_and_process_admin_cq_interrupts(struct efa_comp_ctx >>>>>>> *com >>>>>>> if (comp_ctx->status == EFA_CMD_COMPLETED) >>>>>>> ibdev_err_ratelimited( >>>>>>> aq->efa_dev, >>>>>>> - "The device sent a completion but >>>>>>> the driver didn't receive any MSI-X interrupt for admin cmd >>>>>>> %s(%d) status %d (ctx: 0x%p, sq producer: %d, sq consumer: %d, >>>>>>> cq consumer: %d)\n", >>>>>>> + "The device sent a completion but >>>>>>> the driver didn't receive any MSI-X interrupt for admin cmd >>>>>>> %s(%d) status %d (id: %d, sq producer: %d, sq consumer: %d, cq >>>>>>> consumer: %d)\n", >>>>>>> efa_com_cmd_str(comp_ctx->cmd_opcode), >>>>>>> comp_ctx->cmd_opcode, comp_ctx->status, >>>>>>> - comp_ctx, aq->sq.pc, aq->sq.cc, >>>>>>> aq->cq.cc); >>>>>>> + comp_ctx->cmd_id, aq->sq.pc, >>>>>>> aq->sq.cc, >>>>>>> + aq->cq.cc); >>>>>>> else >>>>>>> ibdev_err_ratelimited( >>>>>>> aq->efa_dev, >>>>>>> - "The device didn't send any >>>>>>> completion for admin cmd %s(%d) status %d (ctx 0x%p, sq >>>>>>> producer: %d, sq consumer: %d, cq consumer: %d)\n", >>>>>>> + "The device didn't send any >>>>>>> completion for admin cmd %s(%d) status %d (id: %d, sq producer: >>>>>>> %d, sq consumer: %d, cq consumer: %d)\n", >>>>>>> efa_com_cmd_str(comp_ctx->cmd_opcode), >>>>>>> comp_ctx->cmd_opcode, comp_ctx->status, >>>>>>> - comp_ctx, aq->sq.pc, aq->sq.cc, >>>>>>> aq->cq.cc); >>>>>>> + comp_ctx->cmd_id, aq->sq.pc, >>>>>>> aq->sq.cc, >>>>>>> + aq->cq.cc); >>>>>> I have very strong feeling that you don't really use these prints >>>>>> in real life. >>>>>> >>>>>> For example, comp_ctx->cmd_id is printed with %d, while code and >>>>>> comment >>>>>> around cmd_id in __efa_com_submit_admin_cmd() suggests that it >>>>>> needs to be 0x%X. >>>>>> >>>>>> It has a lot of information separated to LSB and MSB bits which >>>>>> are not readable >>>>>> while printing with %d. >>>>>> >>>>>> You are also printing comp_ctx->status, which is clear from >>>>>> if/else section. >>>>>> >>>>>> So no, I don't buy this claim for "additional debug information", >>>>>> while >>>>>> existing is not used. >>>>> What do you mean by that?!? >>>> If you take a close look on the prints, you will see the reasons. >>>> For example, you print comp_ctx->status which can be only 0 or 1, >>>> while it is already clear what its value from the print itself. >>>> >>>>> These errors are extremely rare and are not manually reproducible, >>>>> that's >>>>> why we want to collect as much information as we can when it happens. >>>> Do it out-of-tree, there is no need in upstream code for internal >>>> debug >>>> sessions. >>> It's not for internal debug, it is used in production. Why would I >>> upstream >>> internal debug prints? >> It is used in internal cloud for the NICs not available to the rest of >> the world. So yes, it is your internal debug print. >>>>> I'm less bothered by the format as long as we have the info we need. >>>> Like I said, it is clear that you never actually relied on this >>>> information. >>>> Better if you completely delete these prints and keep them >>>> out-of-tree. >>> Not sure that I follow, can you please elaborate on what "relying" >>> means >>> from your POV? >> "relied" == "used" >> > The NIC might be available only in AWS but customers decide what Linux > versions they use. Many of our customers choose to use upstream > versions and avoid taking debug or modified versions. > > You are right saying the appearance for this print is rare, that's why > as you pointed out, it isn't perfectly formatted. But this is since > the issue rarely reproduce, what is exactly the reason we need the > print in-place when it does. > > I can definitely improve the styling if it makes things better. > > Michael > Leon, do you have any additional concerns regarding this patch? It's more than 6 weeks in the pipe now. We are asking customers to use patched drivers and we shouldn't be there. Michael ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH for-next v2] RDMA/efa: Extend admin timeout error print 2025-07-07 10:28 ` Leon Romanovsky 2025-07-07 11:18 ` Margolin, Michael @ 2025-07-07 11:30 ` Gal Pressman 1 sibling, 0 replies; 10+ messages in thread From: Gal Pressman @ 2025-07-07 11:30 UTC (permalink / raw) To: Leon Romanovsky, Margolin, Michael Cc: jgg, linux-rdma, sleybo, matua, Yonatan Nachum On 07/07/2025 13:28, Leon Romanovsky wrote: >> It's not for internal debug, it is used in production. Why would I upstream >> internal debug prints? > > It is used in internal cloud for the NICs not available to the rest of > the world. So yes, it is your internal debug print. It is available to literally anyone in the world through AWS, with upstream kernel. You may like or dislike the patch, but this argument has nothing to do with anything. ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH for-next v2] RDMA/efa: Extend admin timeout error print 2025-07-03 18:23 [PATCH for-next v2] RDMA/efa: Extend admin timeout error print Michael Margolin 2025-07-06 7:25 ` Leon Romanovsky @ 2025-08-25 14:56 ` Jason Gunthorpe 1 sibling, 0 replies; 10+ messages in thread From: Jason Gunthorpe @ 2025-08-25 14:56 UTC (permalink / raw) To: Michael Margolin Cc: leon, linux-rdma, sleybo, matua, gal.pressman, Yonatan Nachum On Thu, Jul 03, 2025 at 06:23:14PM +0000, Michael Margolin wrote: > Add command id to the printed message for additional debug information. > > Reviewed-by: Yonatan Nachum <ynachum@amazon.com> > Signed-off-by: Michael Margolin <mrgolin@amazon.com> > --- > drivers/infiniband/hw/efa/efa_com.c | 14 +++++++++----- > 1 file changed, 9 insertions(+), 5 deletions(-) I picked this up, I think we can debate what level of debug prints or traces are appropriate next someone adds them, but tweaking an existing one seems fine to me. Jason ^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2025-08-25 14:56 UTC | newest] Thread overview: 10+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2025-07-03 18:23 [PATCH for-next v2] RDMA/efa: Extend admin timeout error print Michael Margolin 2025-07-06 7:25 ` Leon Romanovsky 2025-07-06 16:32 ` Margolin, Michael 2025-07-07 6:28 ` Leon Romanovsky 2025-07-07 9:51 ` Margolin, Michael 2025-07-07 10:28 ` Leon Romanovsky 2025-07-07 11:18 ` Margolin, Michael 2025-08-18 7:55 ` Margolin, Michael 2025-07-07 11:30 ` Gal Pressman 2025-08-25 14:56 ` Jason Gunthorpe
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).