* [PATCH for-next 0/3] EFA statistics minor updates
@ 2020-04-20 6:22 Gal Pressman
2020-04-20 6:22 ` [PATCH for-next 1/3] RDMA/efa: Report create CQ error counter Gal Pressman
` (3 more replies)
0 siblings, 4 replies; 11+ messages in thread
From: Gal Pressman @ 2020-04-20 6:22 UTC (permalink / raw)
To: Jason Gunthorpe, Doug Ledford
Cc: linux-rdma, Alexander Matushevsky, Gal Pressman
This series contains three small patches to collect a few counters we
found helpful when encountering different issues.
Regards,
Gal
Gal Pressman (3):
RDMA/efa: Report create CQ error counter
RDMA/efa: Count mmap failures
RDMA/efa: Count admin commands errors
drivers/infiniband/hw/efa/efa.h | 3 ++-
drivers/infiniband/hw/efa/efa_com.c | 5 ++++-
drivers/infiniband/hw/efa/efa_com.h | 3 ++-
drivers/infiniband/hw/efa/efa_verbs.c | 13 +++++++++++--
4 files changed, 19 insertions(+), 5 deletions(-)
base-commit: 8f3d9f354286745c751374f5f1fcafee6b3f3136
--
2.26.1
^ permalink raw reply [flat|nested] 11+ messages in thread
* [PATCH for-next 1/3] RDMA/efa: Report create CQ error counter
2020-04-20 6:22 [PATCH for-next 0/3] EFA statistics minor updates Gal Pressman
@ 2020-04-20 6:22 ` Gal Pressman
2020-04-20 6:22 ` [PATCH for-next 2/3] RDMA/efa: Count mmap failures Gal Pressman
` (2 subsequent siblings)
3 siblings, 0 replies; 11+ messages in thread
From: Gal Pressman @ 2020-04-20 6:22 UTC (permalink / raw)
To: Jason Gunthorpe, Doug Ledford
Cc: linux-rdma, Alexander Matushevsky, Gal Pressman, Firas JahJah,
Yossi Leybovich
Create CQ errors are already being counted, report them along all other
counters.
Reviewed-by: Firas JahJah <firasj@amazon.com>
Reviewed-by: Yossi Leybovich <sleybo@amazon.com>
Signed-off-by: Gal Pressman <galpress@amazon.com>
---
drivers/infiniband/hw/efa/efa_verbs.c | 2 ++
1 file changed, 2 insertions(+)
diff --git a/drivers/infiniband/hw/efa/efa_verbs.c b/drivers/infiniband/hw/efa/efa_verbs.c
index 5c57098a4aee..b555845d6c14 100644
--- a/drivers/infiniband/hw/efa/efa_verbs.c
+++ b/drivers/infiniband/hw/efa/efa_verbs.c
@@ -41,6 +41,7 @@ struct efa_user_mmap_entry {
op(EFA_KEEP_ALIVE_RCVD, "keep_alive_rcvd") \
op(EFA_ALLOC_PD_ERR, "alloc_pd_err") \
op(EFA_CREATE_QP_ERR, "create_qp_err") \
+ op(EFA_CREATE_CQ_ERR, "create_cq_err") \
op(EFA_REG_MR_ERR, "reg_mr_err") \
op(EFA_ALLOC_UCONTEXT_ERR, "alloc_ucontext_err") \
op(EFA_CREATE_AH_ERR, "create_ah_err")
@@ -1753,6 +1754,7 @@ int efa_get_hw_stats(struct ib_device *ibdev, struct rdma_hw_stats *stats,
stats->value[EFA_KEEP_ALIVE_RCVD] = atomic64_read(&s->keep_alive_rcvd);
stats->value[EFA_ALLOC_PD_ERR] = atomic64_read(&s->sw_stats.alloc_pd_err);
stats->value[EFA_CREATE_QP_ERR] = atomic64_read(&s->sw_stats.create_qp_err);
+ stats->value[EFA_CREATE_CQ_ERR] = atomic64_read(&s->sw_stats.create_cq_err);
stats->value[EFA_REG_MR_ERR] = atomic64_read(&s->sw_stats.reg_mr_err);
stats->value[EFA_ALLOC_UCONTEXT_ERR] = atomic64_read(&s->sw_stats.alloc_ucontext_err);
stats->value[EFA_CREATE_AH_ERR] = atomic64_read(&s->sw_stats.create_ah_err);
--
2.26.1
^ permalink raw reply related [flat|nested] 11+ messages in thread
* [PATCH for-next 2/3] RDMA/efa: Count mmap failures
2020-04-20 6:22 [PATCH for-next 0/3] EFA statistics minor updates Gal Pressman
2020-04-20 6:22 ` [PATCH for-next 1/3] RDMA/efa: Report create CQ error counter Gal Pressman
@ 2020-04-20 6:22 ` Gal Pressman
2020-04-24 14:59 ` Jason Gunthorpe
2020-04-20 6:22 ` [PATCH for-next 3/3] RDMA/efa: Count admin commands errors Gal Pressman
2020-05-02 23:33 ` [PATCH for-next 0/3] EFA statistics minor updates Jason Gunthorpe
3 siblings, 1 reply; 11+ messages in thread
From: Gal Pressman @ 2020-04-20 6:22 UTC (permalink / raw)
To: Jason Gunthorpe, Doug Ledford
Cc: linux-rdma, Alexander Matushevsky, Gal Pressman, Firas JahJah,
Yossi Leybovich
Add a new stat that counts mmap failures, which might help when
debugging different issues.
Reviewed-by: Firas JahJah <firasj@amazon.com>
Reviewed-by: Yossi Leybovich <sleybo@amazon.com>
Signed-off-by: Gal Pressman <galpress@amazon.com>
---
drivers/infiniband/hw/efa/efa.h | 3 ++-
drivers/infiniband/hw/efa/efa_verbs.c | 9 +++++++--
2 files changed, 9 insertions(+), 3 deletions(-)
diff --git a/drivers/infiniband/hw/efa/efa.h b/drivers/infiniband/hw/efa/efa.h
index aa7396a1588a..77c9ff798117 100644
--- a/drivers/infiniband/hw/efa/efa.h
+++ b/drivers/infiniband/hw/efa/efa.h
@@ -1,6 +1,6 @@
/* SPDX-License-Identifier: GPL-2.0 OR BSD-2-Clause */
/*
- * Copyright 2018-2019 Amazon.com, Inc. or its affiliates. All rights reserved.
+ * Copyright 2018-2020 Amazon.com, Inc. or its affiliates. All rights reserved.
*/
#ifndef _EFA_H_
@@ -40,6 +40,7 @@ struct efa_sw_stats {
atomic64_t reg_mr_err;
atomic64_t alloc_ucontext_err;
atomic64_t create_ah_err;
+ atomic64_t mmap_err;
};
/* Don't use anything other than atomic64 */
diff --git a/drivers/infiniband/hw/efa/efa_verbs.c b/drivers/infiniband/hw/efa/efa_verbs.c
index b555845d6c14..75eef1ec2474 100644
--- a/drivers/infiniband/hw/efa/efa_verbs.c
+++ b/drivers/infiniband/hw/efa/efa_verbs.c
@@ -44,7 +44,8 @@ struct efa_user_mmap_entry {
op(EFA_CREATE_CQ_ERR, "create_cq_err") \
op(EFA_REG_MR_ERR, "reg_mr_err") \
op(EFA_ALLOC_UCONTEXT_ERR, "alloc_ucontext_err") \
- op(EFA_CREATE_AH_ERR, "create_ah_err")
+ op(EFA_CREATE_AH_ERR, "create_ah_err") \
+ op(EFA_MMAP_ERR, "mmap_err")
#define EFA_STATS_ENUM(ename, name) ename,
#define EFA_STATS_STR(ename, name) [ename] = name,
@@ -1569,6 +1570,7 @@ static int __efa_mmap(struct efa_dev *dev, struct efa_ucontext *ucontext,
ibdev_dbg(&dev->ibdev,
"pgoff[%#lx] does not have valid entry\n",
vma->vm_pgoff);
+ atomic64_inc(&dev->stats.sw_stats.mmap_err);
return -EINVAL;
}
entry = to_emmap(rdma_entry);
@@ -1604,12 +1606,14 @@ static int __efa_mmap(struct efa_dev *dev, struct efa_ucontext *ucontext,
err = -EINVAL;
}
- if (err)
+ if (err) {
ibdev_dbg(
&dev->ibdev,
"Couldn't mmap address[%#llx] length[%#zx] mmap_flag[%d] err[%d]\n",
entry->address, rdma_entry->npages * PAGE_SIZE,
entry->mmap_flag, err);
+ atomic64_inc(&dev->stats.sw_stats.mmap_err);
+ }
rdma_user_mmap_entry_put(rdma_entry);
return err;
@@ -1758,6 +1762,7 @@ int efa_get_hw_stats(struct ib_device *ibdev, struct rdma_hw_stats *stats,
stats->value[EFA_REG_MR_ERR] = atomic64_read(&s->sw_stats.reg_mr_err);
stats->value[EFA_ALLOC_UCONTEXT_ERR] = atomic64_read(&s->sw_stats.alloc_ucontext_err);
stats->value[EFA_CREATE_AH_ERR] = atomic64_read(&s->sw_stats.create_ah_err);
+ stats->value[EFA_MMAP_ERR] = atomic64_read(&s->sw_stats.mmap_err);
return ARRAY_SIZE(efa_stats_names);
}
--
2.26.1
^ permalink raw reply related [flat|nested] 11+ messages in thread
* [PATCH for-next 3/3] RDMA/efa: Count admin commands errors
2020-04-20 6:22 [PATCH for-next 0/3] EFA statistics minor updates Gal Pressman
2020-04-20 6:22 ` [PATCH for-next 1/3] RDMA/efa: Report create CQ error counter Gal Pressman
2020-04-20 6:22 ` [PATCH for-next 2/3] RDMA/efa: Count mmap failures Gal Pressman
@ 2020-04-20 6:22 ` Gal Pressman
2020-05-02 23:33 ` [PATCH for-next 0/3] EFA statistics minor updates Jason Gunthorpe
3 siblings, 0 replies; 11+ messages in thread
From: Gal Pressman @ 2020-04-20 6:22 UTC (permalink / raw)
To: Jason Gunthorpe, Doug Ledford
Cc: linux-rdma, Alexander Matushevsky, Gal Pressman, Daniel Kranzdorf,
Yossi Leybovich
Add a new stat that counts admin commands failures, which might help
when debugging different issues.
Reviewed-by: Daniel Kranzdorf <dkkranzd@amazon.com>
Reviewed-by: Yossi Leybovich <sleybo@amazon.com>
Signed-off-by: Gal Pressman <galpress@amazon.com>
---
drivers/infiniband/hw/efa/efa_com.c | 5 ++++-
drivers/infiniband/hw/efa/efa_com.h | 3 ++-
drivers/infiniband/hw/efa/efa_verbs.c | 2 ++
3 files changed, 8 insertions(+), 2 deletions(-)
diff --git a/drivers/infiniband/hw/efa/efa_com.c b/drivers/infiniband/hw/efa/efa_com.c
index 7fce69f5568f..336bc2c57bb1 100644
--- a/drivers/infiniband/hw/efa/efa_com.c
+++ b/drivers/infiniband/hw/efa/efa_com.c
@@ -631,17 +631,20 @@ int efa_com_cmd_exec(struct efa_com_admin_queue *aq,
cmd->aq_common_descriptor.opcode, PTR_ERR(comp_ctx));
up(&aq->avail_cmds);
+ atomic64_inc(&aq->stats.cmd_err);
return PTR_ERR(comp_ctx);
}
err = efa_com_wait_and_process_admin_cq(comp_ctx, aq);
- if (err)
+ if (err) {
ibdev_err_ratelimited(
aq->efa_dev,
"Failed to process command %s (opcode %u) comp_status %d err %d\n",
efa_com_cmd_str(cmd->aq_common_descriptor.opcode),
cmd->aq_common_descriptor.opcode, comp_ctx->comp_status,
err);
+ atomic64_inc(&aq->stats.cmd_err);
+ }
up(&aq->avail_cmds);
diff --git a/drivers/infiniband/hw/efa/efa_com.h b/drivers/infiniband/hw/efa/efa_com.h
index c67dd8109d1c..5e4c88877ddb 100644
--- a/drivers/infiniband/hw/efa/efa_com.h
+++ b/drivers/infiniband/hw/efa/efa_com.h
@@ -1,6 +1,6 @@
/* SPDX-License-Identifier: GPL-2.0 OR BSD-2-Clause */
/*
- * Copyright 2018-2019 Amazon.com, Inc. or its affiliates. All rights reserved.
+ * Copyright 2018-2020 Amazon.com, Inc. or its affiliates. All rights reserved.
*/
#ifndef _EFA_COM_H_
@@ -47,6 +47,7 @@ struct efa_com_admin_sq {
struct efa_com_stats_admin {
atomic64_t submitted_cmd;
atomic64_t completed_cmd;
+ atomic64_t cmd_err;
atomic64_t no_completion;
};
diff --git a/drivers/infiniband/hw/efa/efa_verbs.c b/drivers/infiniband/hw/efa/efa_verbs.c
index 75eef1ec2474..ce7a3ead1ba7 100644
--- a/drivers/infiniband/hw/efa/efa_verbs.c
+++ b/drivers/infiniband/hw/efa/efa_verbs.c
@@ -37,6 +37,7 @@ struct efa_user_mmap_entry {
op(EFA_RX_DROPS, "rx_drops") \
op(EFA_SUBMITTED_CMDS, "submitted_cmds") \
op(EFA_COMPLETED_CMDS, "completed_cmds") \
+ op(EFA_CMDS_ERR, "cmds_err") \
op(EFA_NO_COMPLETION_CMDS, "no_completion_cmds") \
op(EFA_KEEP_ALIVE_RCVD, "keep_alive_rcvd") \
op(EFA_ALLOC_PD_ERR, "alloc_pd_err") \
@@ -1752,6 +1753,7 @@ int efa_get_hw_stats(struct ib_device *ibdev, struct rdma_hw_stats *stats,
as = &dev->edev.aq.stats;
stats->value[EFA_SUBMITTED_CMDS] = atomic64_read(&as->submitted_cmd);
stats->value[EFA_COMPLETED_CMDS] = atomic64_read(&as->completed_cmd);
+ stats->value[EFA_CMDS_ERR] = atomic64_read(&as->cmd_err);
stats->value[EFA_NO_COMPLETION_CMDS] = atomic64_read(&as->no_completion);
s = &dev->stats;
--
2.26.1
^ permalink raw reply related [flat|nested] 11+ messages in thread
* Re: [PATCH for-next 2/3] RDMA/efa: Count mmap failures
2020-04-20 6:22 ` [PATCH for-next 2/3] RDMA/efa: Count mmap failures Gal Pressman
@ 2020-04-24 14:59 ` Jason Gunthorpe
2020-04-24 15:25 ` Gal Pressman
0 siblings, 1 reply; 11+ messages in thread
From: Jason Gunthorpe @ 2020-04-24 14:59 UTC (permalink / raw)
To: Gal Pressman
Cc: Doug Ledford, linux-rdma, Alexander Matushevsky, Firas JahJah,
Yossi Leybovich
On Mon, Apr 20, 2020 at 09:22:12AM +0300, Gal Pressman wrote:
> Add a new stat that counts mmap failures, which might help when
> debugging different issues.
>
> Reviewed-by: Firas JahJah <firasj@amazon.com>
> Reviewed-by: Yossi Leybovich <sleybo@amazon.com>
> Signed-off-by: Gal Pressman <galpress@amazon.com>
> drivers/infiniband/hw/efa/efa.h | 3 ++-
> drivers/infiniband/hw/efa/efa_verbs.c | 9 +++++++--
> 2 files changed, 9 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/infiniband/hw/efa/efa.h b/drivers/infiniband/hw/efa/efa.h
> index aa7396a1588a..77c9ff798117 100644
> +++ b/drivers/infiniband/hw/efa/efa.h
> @@ -1,6 +1,6 @@
> /* SPDX-License-Identifier: GPL-2.0 OR BSD-2-Clause */
> /*
> - * Copyright 2018-2019 Amazon.com, Inc. or its affiliates. All rights reserved.
> + * Copyright 2018-2020 Amazon.com, Inc. or its affiliates. All rights reserved.
> */
>
> #ifndef _EFA_H_
> @@ -40,6 +40,7 @@ struct efa_sw_stats {
> atomic64_t reg_mr_err;
> atomic64_t alloc_ucontext_err;
> atomic64_t create_ah_err;
> + atomic64_t mmap_err;
> };
>
> /* Don't use anything other than atomic64 */
> diff --git a/drivers/infiniband/hw/efa/efa_verbs.c b/drivers/infiniband/hw/efa/efa_verbs.c
> index b555845d6c14..75eef1ec2474 100644
> +++ b/drivers/infiniband/hw/efa/efa_verbs.c
> @@ -44,7 +44,8 @@ struct efa_user_mmap_entry {
> op(EFA_CREATE_CQ_ERR, "create_cq_err") \
> op(EFA_REG_MR_ERR, "reg_mr_err") \
> op(EFA_ALLOC_UCONTEXT_ERR, "alloc_ucontext_err") \
> - op(EFA_CREATE_AH_ERR, "create_ah_err")
> + op(EFA_CREATE_AH_ERR, "create_ah_err") \
> + op(EFA_MMAP_ERR, "mmap_err")
>
> #define EFA_STATS_ENUM(ename, name) ename,
> #define EFA_STATS_STR(ename, name) [ename] = name,
> @@ -1569,6 +1570,7 @@ static int __efa_mmap(struct efa_dev *dev, struct efa_ucontext *ucontext,
> ibdev_dbg(&dev->ibdev,
> "pgoff[%#lx] does not have valid entry\n",
> vma->vm_pgoff);
> + atomic64_inc(&dev->stats.sw_stats.mmap_err);
> return -EINVAL;
> }
> entry = to_emmap(rdma_entry);
> @@ -1604,12 +1606,14 @@ static int __efa_mmap(struct efa_dev *dev, struct efa_ucontext *ucontext,
> err = -EINVAL;
> }
>
> - if (err)
> + if (err) {
> ibdev_dbg(
> &dev->ibdev,
> "Couldn't mmap address[%#llx] length[%#zx] mmap_flag[%d] err[%d]\n",
> entry->address, rdma_entry->npages * PAGE_SIZE,
> entry->mmap_flag, err);
> + atomic64_inc(&dev->stats.sw_stats.mmap_err);
Really? Isn't this something that is only possible with a buggy
rdma-core provider? Why count it?
Jason
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH for-next 2/3] RDMA/efa: Count mmap failures
2020-04-24 14:59 ` Jason Gunthorpe
@ 2020-04-24 15:25 ` Gal Pressman
2020-04-24 18:26 ` Jason Gunthorpe
0 siblings, 1 reply; 11+ messages in thread
From: Gal Pressman @ 2020-04-24 15:25 UTC (permalink / raw)
To: Jason Gunthorpe
Cc: Doug Ledford, linux-rdma, Alexander Matushevsky, Firas JahJah,
Yossi Leybovich
On 24/04/2020 17:59, Jason Gunthorpe wrote:
> On Mon, Apr 20, 2020 at 09:22:12AM +0300, Gal Pressman wrote:
>> Add a new stat that counts mmap failures, which might help when
>> debugging different issues.
>>
>> Reviewed-by: Firas JahJah <firasj@amazon.com>
>> Reviewed-by: Yossi Leybovich <sleybo@amazon.com>
>> Signed-off-by: Gal Pressman <galpress@amazon.com>
>> drivers/infiniband/hw/efa/efa.h | 3 ++-
>> drivers/infiniband/hw/efa/efa_verbs.c | 9 +++++++--
>> 2 files changed, 9 insertions(+), 3 deletions(-)
>>
>> diff --git a/drivers/infiniband/hw/efa/efa.h b/drivers/infiniband/hw/efa/efa.h
>> index aa7396a1588a..77c9ff798117 100644
>> +++ b/drivers/infiniband/hw/efa/efa.h
>> @@ -1,6 +1,6 @@
>> /* SPDX-License-Identifier: GPL-2.0 OR BSD-2-Clause */
>> /*
>> - * Copyright 2018-2019 Amazon.com, Inc. or its affiliates. All rights reserved.
>> + * Copyright 2018-2020 Amazon.com, Inc. or its affiliates. All rights reserved.
>> */
>>
>> #ifndef _EFA_H_
>> @@ -40,6 +40,7 @@ struct efa_sw_stats {
>> atomic64_t reg_mr_err;
>> atomic64_t alloc_ucontext_err;
>> atomic64_t create_ah_err;
>> + atomic64_t mmap_err;
>> };
>>
>> /* Don't use anything other than atomic64 */
>> diff --git a/drivers/infiniband/hw/efa/efa_verbs.c b/drivers/infiniband/hw/efa/efa_verbs.c
>> index b555845d6c14..75eef1ec2474 100644
>> +++ b/drivers/infiniband/hw/efa/efa_verbs.c
>> @@ -44,7 +44,8 @@ struct efa_user_mmap_entry {
>> op(EFA_CREATE_CQ_ERR, "create_cq_err") \
>> op(EFA_REG_MR_ERR, "reg_mr_err") \
>> op(EFA_ALLOC_UCONTEXT_ERR, "alloc_ucontext_err") \
>> - op(EFA_CREATE_AH_ERR, "create_ah_err")
>> + op(EFA_CREATE_AH_ERR, "create_ah_err") \
>> + op(EFA_MMAP_ERR, "mmap_err")
>>
>> #define EFA_STATS_ENUM(ename, name) ename,
>> #define EFA_STATS_STR(ename, name) [ename] = name,
>> @@ -1569,6 +1570,7 @@ static int __efa_mmap(struct efa_dev *dev, struct efa_ucontext *ucontext,
>> ibdev_dbg(&dev->ibdev,
>> "pgoff[%#lx] does not have valid entry\n",
>> vma->vm_pgoff);
>> + atomic64_inc(&dev->stats.sw_stats.mmap_err);
>> return -EINVAL;
>> }
>> entry = to_emmap(rdma_entry);
>> @@ -1604,12 +1606,14 @@ static int __efa_mmap(struct efa_dev *dev, struct efa_ucontext *ucontext,
>> err = -EINVAL;
>> }
>>
>> - if (err)
>> + if (err) {
>> ibdev_dbg(
>> &dev->ibdev,
>> "Couldn't mmap address[%#llx] length[%#zx] mmap_flag[%d] err[%d]\n",
>> entry->address, rdma_entry->npages * PAGE_SIZE,
>> entry->mmap_flag, err);
>> + atomic64_inc(&dev->stats.sw_stats.mmap_err);
>
> Really? Isn't this something that is only possible with a buggy
> rdma-core provider? Why count it?
Though unlikely, it could happen, otherwise this error flow wouldn't exist in
the first place.
If for some reason a customer app steps on a bug we're not aware of, this
counter could serve as a red flag.
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH for-next 2/3] RDMA/efa: Count mmap failures
2020-04-24 15:25 ` Gal Pressman
@ 2020-04-24 18:26 ` Jason Gunthorpe
2020-04-26 6:42 ` Gal Pressman
0 siblings, 1 reply; 11+ messages in thread
From: Jason Gunthorpe @ 2020-04-24 18:26 UTC (permalink / raw)
To: Gal Pressman
Cc: Doug Ledford, linux-rdma, Alexander Matushevsky, Firas JahJah,
Yossi Leybovich
On Fri, Apr 24, 2020 at 06:25:54PM +0300, Gal Pressman wrote:
> On 24/04/2020 17:59, Jason Gunthorpe wrote:
> > On Mon, Apr 20, 2020 at 09:22:12AM +0300, Gal Pressman wrote:
> >> Add a new stat that counts mmap failures, which might help when
> >> debugging different issues.
> >>
> >> Reviewed-by: Firas JahJah <firasj@amazon.com>
> >> Reviewed-by: Yossi Leybovich <sleybo@amazon.com>
> >> Signed-off-by: Gal Pressman <galpress@amazon.com>
> >> drivers/infiniband/hw/efa/efa.h | 3 ++-
> >> drivers/infiniband/hw/efa/efa_verbs.c | 9 +++++++--
> >> 2 files changed, 9 insertions(+), 3 deletions(-)
> >>
> >> diff --git a/drivers/infiniband/hw/efa/efa.h b/drivers/infiniband/hw/efa/efa.h
> >> index aa7396a1588a..77c9ff798117 100644
> >> +++ b/drivers/infiniband/hw/efa/efa.h
> >> @@ -1,6 +1,6 @@
> >> /* SPDX-License-Identifier: GPL-2.0 OR BSD-2-Clause */
> >> /*
> >> - * Copyright 2018-2019 Amazon.com, Inc. or its affiliates. All rights reserved.
> >> + * Copyright 2018-2020 Amazon.com, Inc. or its affiliates. All rights reserved.
> >> */
> >>
> >> #ifndef _EFA_H_
> >> @@ -40,6 +40,7 @@ struct efa_sw_stats {
> >> atomic64_t reg_mr_err;
> >> atomic64_t alloc_ucontext_err;
> >> atomic64_t create_ah_err;
> >> + atomic64_t mmap_err;
> >> };
> >>
> >> /* Don't use anything other than atomic64 */
> >> diff --git a/drivers/infiniband/hw/efa/efa_verbs.c b/drivers/infiniband/hw/efa/efa_verbs.c
> >> index b555845d6c14..75eef1ec2474 100644
> >> +++ b/drivers/infiniband/hw/efa/efa_verbs.c
> >> @@ -44,7 +44,8 @@ struct efa_user_mmap_entry {
> >> op(EFA_CREATE_CQ_ERR, "create_cq_err") \
> >> op(EFA_REG_MR_ERR, "reg_mr_err") \
> >> op(EFA_ALLOC_UCONTEXT_ERR, "alloc_ucontext_err") \
> >> - op(EFA_CREATE_AH_ERR, "create_ah_err")
> >> + op(EFA_CREATE_AH_ERR, "create_ah_err") \
> >> + op(EFA_MMAP_ERR, "mmap_err")
> >>
> >> #define EFA_STATS_ENUM(ename, name) ename,
> >> #define EFA_STATS_STR(ename, name) [ename] = name,
> >> @@ -1569,6 +1570,7 @@ static int __efa_mmap(struct efa_dev *dev, struct efa_ucontext *ucontext,
> >> ibdev_dbg(&dev->ibdev,
> >> "pgoff[%#lx] does not have valid entry\n",
> >> vma->vm_pgoff);
> >> + atomic64_inc(&dev->stats.sw_stats.mmap_err);
> >> return -EINVAL;
> >> }
> >> entry = to_emmap(rdma_entry);
> >> @@ -1604,12 +1606,14 @@ static int __efa_mmap(struct efa_dev *dev, struct efa_ucontext *ucontext,
> >> err = -EINVAL;
> >> }
> >>
> >> - if (err)
> >> + if (err) {
> >> ibdev_dbg(
> >> &dev->ibdev,
> >> "Couldn't mmap address[%#llx] length[%#zx] mmap_flag[%d] err[%d]\n",
> >> entry->address, rdma_entry->npages * PAGE_SIZE,
> >> entry->mmap_flag, err);
> >> + atomic64_inc(&dev->stats.sw_stats.mmap_err);
> >
> > Really? Isn't this something that is only possible with a buggy
> > rdma-core provider? Why count it?
>
> Though unlikely, it could happen, otherwise this error flow wouldn't exist in
> the first place.
>
> If for some reason a customer app steps on a bug we're not aware of, this
> counter could serve as a red flag.
But there are lots of cases where a buggy provider can cause error
exits, why choose this one to count against all the others?
Jason
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH for-next 2/3] RDMA/efa: Count mmap failures
2020-04-24 18:26 ` Jason Gunthorpe
@ 2020-04-26 6:42 ` Gal Pressman
2020-04-26 13:30 ` Jason Gunthorpe
0 siblings, 1 reply; 11+ messages in thread
From: Gal Pressman @ 2020-04-26 6:42 UTC (permalink / raw)
To: Jason Gunthorpe
Cc: Doug Ledford, linux-rdma, Alexander Matushevsky, Firas JahJah,
Yossi Leybovich
On 24/04/2020 21:26, Jason Gunthorpe wrote:
> On Fri, Apr 24, 2020 at 06:25:54PM +0300, Gal Pressman wrote:
>> On 24/04/2020 17:59, Jason Gunthorpe wrote:
>>> On Mon, Apr 20, 2020 at 09:22:12AM +0300, Gal Pressman wrote:
>>>> Add a new stat that counts mmap failures, which might help when
>>>> debugging different issues.
>>>>
>>>> Reviewed-by: Firas JahJah <firasj@amazon.com>
>>>> Reviewed-by: Yossi Leybovich <sleybo@amazon.com>
>>>> Signed-off-by: Gal Pressman <galpress@amazon.com>
>>>> drivers/infiniband/hw/efa/efa.h | 3 ++-
>>>> drivers/infiniband/hw/efa/efa_verbs.c | 9 +++++++--
>>>> 2 files changed, 9 insertions(+), 3 deletions(-)
>>>>
>>>> diff --git a/drivers/infiniband/hw/efa/efa.h b/drivers/infiniband/hw/efa/efa.h
>>>> index aa7396a1588a..77c9ff798117 100644
>>>> +++ b/drivers/infiniband/hw/efa/efa.h
>>>> @@ -1,6 +1,6 @@
>>>> /* SPDX-License-Identifier: GPL-2.0 OR BSD-2-Clause */
>>>> /*
>>>> - * Copyright 2018-2019 Amazon.com, Inc. or its affiliates. All rights reserved.
>>>> + * Copyright 2018-2020 Amazon.com, Inc. or its affiliates. All rights reserved.
>>>> */
>>>>
>>>> #ifndef _EFA_H_
>>>> @@ -40,6 +40,7 @@ struct efa_sw_stats {
>>>> atomic64_t reg_mr_err;
>>>> atomic64_t alloc_ucontext_err;
>>>> atomic64_t create_ah_err;
>>>> + atomic64_t mmap_err;
>>>> };
>>>>
>>>> /* Don't use anything other than atomic64 */
>>>> diff --git a/drivers/infiniband/hw/efa/efa_verbs.c b/drivers/infiniband/hw/efa/efa_verbs.c
>>>> index b555845d6c14..75eef1ec2474 100644
>>>> +++ b/drivers/infiniband/hw/efa/efa_verbs.c
>>>> @@ -44,7 +44,8 @@ struct efa_user_mmap_entry {
>>>> op(EFA_CREATE_CQ_ERR, "create_cq_err") \
>>>> op(EFA_REG_MR_ERR, "reg_mr_err") \
>>>> op(EFA_ALLOC_UCONTEXT_ERR, "alloc_ucontext_err") \
>>>> - op(EFA_CREATE_AH_ERR, "create_ah_err")
>>>> + op(EFA_CREATE_AH_ERR, "create_ah_err") \
>>>> + op(EFA_MMAP_ERR, "mmap_err")
>>>>
>>>> #define EFA_STATS_ENUM(ename, name) ename,
>>>> #define EFA_STATS_STR(ename, name) [ename] = name,
>>>> @@ -1569,6 +1570,7 @@ static int __efa_mmap(struct efa_dev *dev, struct efa_ucontext *ucontext,
>>>> ibdev_dbg(&dev->ibdev,
>>>> "pgoff[%#lx] does not have valid entry\n",
>>>> vma->vm_pgoff);
>>>> + atomic64_inc(&dev->stats.sw_stats.mmap_err);
>>>> return -EINVAL;
>>>> }
>>>> entry = to_emmap(rdma_entry);
>>>> @@ -1604,12 +1606,14 @@ static int __efa_mmap(struct efa_dev *dev, struct efa_ucontext *ucontext,
>>>> err = -EINVAL;
>>>> }
>>>>
>>>> - if (err)
>>>> + if (err) {
>>>> ibdev_dbg(
>>>> &dev->ibdev,
>>>> "Couldn't mmap address[%#llx] length[%#zx] mmap_flag[%d] err[%d]\n",
>>>> entry->address, rdma_entry->npages * PAGE_SIZE,
>>>> entry->mmap_flag, err);
>>>> + atomic64_inc(&dev->stats.sw_stats.mmap_err);
>>>
>>> Really? Isn't this something that is only possible with a buggy
>>> rdma-core provider? Why count it?
>>
>> Though unlikely, it could happen, otherwise this error flow wouldn't exist in
>> the first place.
>>
>> If for some reason a customer app steps on a bug we're not aware of, this
>> counter could serve as a red flag.
>
> But there are lots of cases where a buggy provider can cause error
> exits, why choose this one to count against all the others?
It's not one against all others, most if not all of our userspace facing API
error flows have a similar counter.
And TBH, I think that the mmap flow is quite convoluted with the cookie response
from the crate verb, so it deserves a counter IMO.
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH for-next 2/3] RDMA/efa: Count mmap failures
2020-04-26 6:42 ` Gal Pressman
@ 2020-04-26 13:30 ` Jason Gunthorpe
2020-04-26 14:17 ` Gal Pressman
0 siblings, 1 reply; 11+ messages in thread
From: Jason Gunthorpe @ 2020-04-26 13:30 UTC (permalink / raw)
To: Gal Pressman
Cc: Doug Ledford, linux-rdma, Alexander Matushevsky, Firas JahJah,
Yossi Leybovich
On Sun, Apr 26, 2020 at 09:42:27AM +0300, Gal Pressman wrote:
> On 24/04/2020 21:26, Jason Gunthorpe wrote:
> > On Fri, Apr 24, 2020 at 06:25:54PM +0300, Gal Pressman wrote:
> >> On 24/04/2020 17:59, Jason Gunthorpe wrote:
> >>> On Mon, Apr 20, 2020 at 09:22:12AM +0300, Gal Pressman wrote:
> >>>> Add a new stat that counts mmap failures, which might help when
> >>>> debugging different issues.
> >>>>
> >>>> Reviewed-by: Firas JahJah <firasj@amazon.com>
> >>>> Reviewed-by: Yossi Leybovich <sleybo@amazon.com>
> >>>> Signed-off-by: Gal Pressman <galpress@amazon.com>
> >>>> drivers/infiniband/hw/efa/efa.h | 3 ++-
> >>>> drivers/infiniband/hw/efa/efa_verbs.c | 9 +++++++--
> >>>> 2 files changed, 9 insertions(+), 3 deletions(-)
> >>>>
> >>>> diff --git a/drivers/infiniband/hw/efa/efa.h b/drivers/infiniband/hw/efa/efa.h
> >>>> index aa7396a1588a..77c9ff798117 100644
> >>>> +++ b/drivers/infiniband/hw/efa/efa.h
> >>>> @@ -1,6 +1,6 @@
> >>>> /* SPDX-License-Identifier: GPL-2.0 OR BSD-2-Clause */
> >>>> /*
> >>>> - * Copyright 2018-2019 Amazon.com, Inc. or its affiliates. All rights reserved.
> >>>> + * Copyright 2018-2020 Amazon.com, Inc. or its affiliates. All rights reserved.
> >>>> */
> >>>>
> >>>> #ifndef _EFA_H_
> >>>> @@ -40,6 +40,7 @@ struct efa_sw_stats {
> >>>> atomic64_t reg_mr_err;
> >>>> atomic64_t alloc_ucontext_err;
> >>>> atomic64_t create_ah_err;
> >>>> + atomic64_t mmap_err;
> >>>> };
> >>>>
> >>>> /* Don't use anything other than atomic64 */
> >>>> diff --git a/drivers/infiniband/hw/efa/efa_verbs.c b/drivers/infiniband/hw/efa/efa_verbs.c
> >>>> index b555845d6c14..75eef1ec2474 100644
> >>>> +++ b/drivers/infiniband/hw/efa/efa_verbs.c
> >>>> @@ -44,7 +44,8 @@ struct efa_user_mmap_entry {
> >>>> op(EFA_CREATE_CQ_ERR, "create_cq_err") \
> >>>> op(EFA_REG_MR_ERR, "reg_mr_err") \
> >>>> op(EFA_ALLOC_UCONTEXT_ERR, "alloc_ucontext_err") \
> >>>> - op(EFA_CREATE_AH_ERR, "create_ah_err")
> >>>> + op(EFA_CREATE_AH_ERR, "create_ah_err") \
> >>>> + op(EFA_MMAP_ERR, "mmap_err")
> >>>>
> >>>> #define EFA_STATS_ENUM(ename, name) ename,
> >>>> #define EFA_STATS_STR(ename, name) [ename] = name,
> >>>> @@ -1569,6 +1570,7 @@ static int __efa_mmap(struct efa_dev *dev, struct efa_ucontext *ucontext,
> >>>> ibdev_dbg(&dev->ibdev,
> >>>> "pgoff[%#lx] does not have valid entry\n",
> >>>> vma->vm_pgoff);
> >>>> + atomic64_inc(&dev->stats.sw_stats.mmap_err);
> >>>> return -EINVAL;
> >>>> }
> >>>> entry = to_emmap(rdma_entry);
> >>>> @@ -1604,12 +1606,14 @@ static int __efa_mmap(struct efa_dev *dev, struct efa_ucontext *ucontext,
> >>>> err = -EINVAL;
> >>>> }
> >>>>
> >>>> - if (err)
> >>>> + if (err) {
> >>>> ibdev_dbg(
> >>>> &dev->ibdev,
> >>>> "Couldn't mmap address[%#llx] length[%#zx] mmap_flag[%d] err[%d]\n",
> >>>> entry->address, rdma_entry->npages * PAGE_SIZE,
> >>>> entry->mmap_flag, err);
> >>>> + atomic64_inc(&dev->stats.sw_stats.mmap_err);
> >>>
> >>> Really? Isn't this something that is only possible with a buggy
> >>> rdma-core provider? Why count it?
> >>
> >> Though unlikely, it could happen, otherwise this error flow wouldn't exist in
> >> the first place.
> >>
> >> If for some reason a customer app steps on a bug we're not aware of, this
> >> counter could serve as a red flag.
> >
> > But there are lots of cases where a buggy provider can cause error
> > exits, why choose this one to count against all the others?
>
> It's not one against all others, most if not all of our userspace facing API
> error flows have a similar counter.
Hurm, seems a bit strange, but OK
> And TBH, I think that the mmap flow is quite convoluted with the cookie response
> from the crate verb, so it deserves a counter IMO.
How so? Userspace takes the u64 from the command and pass it to mmap,
what is convoluted?
Jason
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH for-next 2/3] RDMA/efa: Count mmap failures
2020-04-26 13:30 ` Jason Gunthorpe
@ 2020-04-26 14:17 ` Gal Pressman
0 siblings, 0 replies; 11+ messages in thread
From: Gal Pressman @ 2020-04-26 14:17 UTC (permalink / raw)
To: Jason Gunthorpe
Cc: Doug Ledford, linux-rdma, Alexander Matushevsky, Firas JahJah,
Yossi Leybovich
On 26/04/2020 16:30, Jason Gunthorpe wrote:
> On Sun, Apr 26, 2020 at 09:42:27AM +0300, Gal Pressman wrote:
>> On 24/04/2020 21:26, Jason Gunthorpe wrote:
>>> On Fri, Apr 24, 2020 at 06:25:54PM +0300, Gal Pressman wrote:
>>>> On 24/04/2020 17:59, Jason Gunthorpe wrote:
>>>>> On Mon, Apr 20, 2020 at 09:22:12AM +0300, Gal Pressman wrote:
>>>>>> Add a new stat that counts mmap failures, which might help when
>>>>>> debugging different issues.
>>>>>>
>>>>>> Reviewed-by: Firas JahJah <firasj@amazon.com>
>>>>>> Reviewed-by: Yossi Leybovich <sleybo@amazon.com>
>>>>>> Signed-off-by: Gal Pressman <galpress@amazon.com>
>>>>>> drivers/infiniband/hw/efa/efa.h | 3 ++-
>>>>>> drivers/infiniband/hw/efa/efa_verbs.c | 9 +++++++--
>>>>>> 2 files changed, 9 insertions(+), 3 deletions(-)
>>>>>>
>>>>>> diff --git a/drivers/infiniband/hw/efa/efa.h b/drivers/infiniband/hw/efa/efa.h
>>>>>> index aa7396a1588a..77c9ff798117 100644
>>>>>> +++ b/drivers/infiniband/hw/efa/efa.h
>>>>>> @@ -1,6 +1,6 @@
>>>>>> /* SPDX-License-Identifier: GPL-2.0 OR BSD-2-Clause */
>>>>>> /*
>>>>>> - * Copyright 2018-2019 Amazon.com, Inc. or its affiliates. All rights reserved.
>>>>>> + * Copyright 2018-2020 Amazon.com, Inc. or its affiliates. All rights reserved.
>>>>>> */
>>>>>>
>>>>>> #ifndef _EFA_H_
>>>>>> @@ -40,6 +40,7 @@ struct efa_sw_stats {
>>>>>> atomic64_t reg_mr_err;
>>>>>> atomic64_t alloc_ucontext_err;
>>>>>> atomic64_t create_ah_err;
>>>>>> + atomic64_t mmap_err;
>>>>>> };
>>>>>>
>>>>>> /* Don't use anything other than atomic64 */
>>>>>> diff --git a/drivers/infiniband/hw/efa/efa_verbs.c b/drivers/infiniband/hw/efa/efa_verbs.c
>>>>>> index b555845d6c14..75eef1ec2474 100644
>>>>>> +++ b/drivers/infiniband/hw/efa/efa_verbs.c
>>>>>> @@ -44,7 +44,8 @@ struct efa_user_mmap_entry {
>>>>>> op(EFA_CREATE_CQ_ERR, "create_cq_err") \
>>>>>> op(EFA_REG_MR_ERR, "reg_mr_err") \
>>>>>> op(EFA_ALLOC_UCONTEXT_ERR, "alloc_ucontext_err") \
>>>>>> - op(EFA_CREATE_AH_ERR, "create_ah_err")
>>>>>> + op(EFA_CREATE_AH_ERR, "create_ah_err") \
>>>>>> + op(EFA_MMAP_ERR, "mmap_err")
>>>>>>
>>>>>> #define EFA_STATS_ENUM(ename, name) ename,
>>>>>> #define EFA_STATS_STR(ename, name) [ename] = name,
>>>>>> @@ -1569,6 +1570,7 @@ static int __efa_mmap(struct efa_dev *dev, struct efa_ucontext *ucontext,
>>>>>> ibdev_dbg(&dev->ibdev,
>>>>>> "pgoff[%#lx] does not have valid entry\n",
>>>>>> vma->vm_pgoff);
>>>>>> + atomic64_inc(&dev->stats.sw_stats.mmap_err);
>>>>>> return -EINVAL;
>>>>>> }
>>>>>> entry = to_emmap(rdma_entry);
>>>>>> @@ -1604,12 +1606,14 @@ static int __efa_mmap(struct efa_dev *dev, struct efa_ucontext *ucontext,
>>>>>> err = -EINVAL;
>>>>>> }
>>>>>>
>>>>>> - if (err)
>>>>>> + if (err) {
>>>>>> ibdev_dbg(
>>>>>> &dev->ibdev,
>>>>>> "Couldn't mmap address[%#llx] length[%#zx] mmap_flag[%d] err[%d]\n",
>>>>>> entry->address, rdma_entry->npages * PAGE_SIZE,
>>>>>> entry->mmap_flag, err);
>>>>>> + atomic64_inc(&dev->stats.sw_stats.mmap_err);
>>>>>
>>>>> Really? Isn't this something that is only possible with a buggy
>>>>> rdma-core provider? Why count it?
>>>>
>>>> Though unlikely, it could happen, otherwise this error flow wouldn't exist in
>>>> the first place.
>>>>
>>>> If for some reason a customer app steps on a bug we're not aware of, this
>>>> counter could serve as a red flag.
>>>
>>> But there are lots of cases where a buggy provider can cause error
>>> exits, why choose this one to count against all the others?
>>
>> It's not one against all others, most if not all of our userspace facing API
>> error flows have a similar counter.
>
> Hurm, seems a bit strange, but OK
>
>> And TBH, I think that the mmap flow is quite convoluted with the cookie response
>> from the crate verb, so it deserves a counter IMO.
>
> How so? Userspace takes the u64 from the command and pass it to mmap,
> what is convoluted?
It's the only flow that involves two phases/userspace calls and not a single
command-response call, so it's a bit more error prone. Convoluted was a bit
harsh :).
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH for-next 0/3] EFA statistics minor updates
2020-04-20 6:22 [PATCH for-next 0/3] EFA statistics minor updates Gal Pressman
` (2 preceding siblings ...)
2020-04-20 6:22 ` [PATCH for-next 3/3] RDMA/efa: Count admin commands errors Gal Pressman
@ 2020-05-02 23:33 ` Jason Gunthorpe
3 siblings, 0 replies; 11+ messages in thread
From: Jason Gunthorpe @ 2020-05-02 23:33 UTC (permalink / raw)
To: Gal Pressman; +Cc: Doug Ledford, linux-rdma, Alexander Matushevsky
On Mon, Apr 20, 2020 at 09:22:10AM +0300, Gal Pressman wrote:
> This series contains three small patches to collect a few counters we
> found helpful when encountering different issues.
Applied to for-next, thanks
Jason
^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2020-05-02 23:33 UTC | newest]
Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2020-04-20 6:22 [PATCH for-next 0/3] EFA statistics minor updates Gal Pressman
2020-04-20 6:22 ` [PATCH for-next 1/3] RDMA/efa: Report create CQ error counter Gal Pressman
2020-04-20 6:22 ` [PATCH for-next 2/3] RDMA/efa: Count mmap failures Gal Pressman
2020-04-24 14:59 ` Jason Gunthorpe
2020-04-24 15:25 ` Gal Pressman
2020-04-24 18:26 ` Jason Gunthorpe
2020-04-26 6:42 ` Gal Pressman
2020-04-26 13:30 ` Jason Gunthorpe
2020-04-26 14:17 ` Gal Pressman
2020-04-20 6:22 ` [PATCH for-next 3/3] RDMA/efa: Count admin commands errors Gal Pressman
2020-05-02 23:33 ` [PATCH for-next 0/3] EFA statistics minor updates Jason Gunthorpe
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).