* [PATCH for-next 0/3] EFA statistics minor updates
@ 2020-04-20 6:22 Gal Pressman
2020-04-20 6:22 ` [PATCH for-next 1/3] RDMA/efa: Report create CQ error counter Gal Pressman
` (3 more replies)
0 siblings, 4 replies; 11+ messages in thread
From: Gal Pressman @ 2020-04-20 6:22 UTC (permalink / raw)
To: Jason Gunthorpe, Doug Ledford
Cc: linux-rdma, Alexander Matushevsky, Gal Pressman
This series contains three small patches to collect a few counters we
found helpful when encountering different issues.
Regards,
Gal
Gal Pressman (3):
RDMA/efa: Report create CQ error counter
RDMA/efa: Count mmap failures
RDMA/efa: Count admin commands errors
drivers/infiniband/hw/efa/efa.h | 3 ++-
drivers/infiniband/hw/efa/efa_com.c | 5 ++++-
drivers/infiniband/hw/efa/efa_com.h | 3 ++-
drivers/infiniband/hw/efa/efa_verbs.c | 13 +++++++++++--
4 files changed, 19 insertions(+), 5 deletions(-)
base-commit: 8f3d9f354286745c751374f5f1fcafee6b3f3136
--
2.26.1
^ permalink raw reply [flat|nested] 11+ messages in thread* [PATCH for-next 1/3] RDMA/efa: Report create CQ error counter 2020-04-20 6:22 [PATCH for-next 0/3] EFA statistics minor updates Gal Pressman @ 2020-04-20 6:22 ` Gal Pressman 2020-04-20 6:22 ` [PATCH for-next 2/3] RDMA/efa: Count mmap failures Gal Pressman ` (2 subsequent siblings) 3 siblings, 0 replies; 11+ messages in thread From: Gal Pressman @ 2020-04-20 6:22 UTC (permalink / raw) To: Jason Gunthorpe, Doug Ledford Cc: linux-rdma, Alexander Matushevsky, Gal Pressman, Firas JahJah, Yossi Leybovich Create CQ errors are already being counted, report them along all other counters. Reviewed-by: Firas JahJah <firasj@amazon.com> Reviewed-by: Yossi Leybovich <sleybo@amazon.com> Signed-off-by: Gal Pressman <galpress@amazon.com> --- drivers/infiniband/hw/efa/efa_verbs.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/drivers/infiniband/hw/efa/efa_verbs.c b/drivers/infiniband/hw/efa/efa_verbs.c index 5c57098a4aee..b555845d6c14 100644 --- a/drivers/infiniband/hw/efa/efa_verbs.c +++ b/drivers/infiniband/hw/efa/efa_verbs.c @@ -41,6 +41,7 @@ struct efa_user_mmap_entry { op(EFA_KEEP_ALIVE_RCVD, "keep_alive_rcvd") \ op(EFA_ALLOC_PD_ERR, "alloc_pd_err") \ op(EFA_CREATE_QP_ERR, "create_qp_err") \ + op(EFA_CREATE_CQ_ERR, "create_cq_err") \ op(EFA_REG_MR_ERR, "reg_mr_err") \ op(EFA_ALLOC_UCONTEXT_ERR, "alloc_ucontext_err") \ op(EFA_CREATE_AH_ERR, "create_ah_err") @@ -1753,6 +1754,7 @@ int efa_get_hw_stats(struct ib_device *ibdev, struct rdma_hw_stats *stats, stats->value[EFA_KEEP_ALIVE_RCVD] = atomic64_read(&s->keep_alive_rcvd); stats->value[EFA_ALLOC_PD_ERR] = atomic64_read(&s->sw_stats.alloc_pd_err); stats->value[EFA_CREATE_QP_ERR] = atomic64_read(&s->sw_stats.create_qp_err); + stats->value[EFA_CREATE_CQ_ERR] = atomic64_read(&s->sw_stats.create_cq_err); stats->value[EFA_REG_MR_ERR] = atomic64_read(&s->sw_stats.reg_mr_err); stats->value[EFA_ALLOC_UCONTEXT_ERR] = atomic64_read(&s->sw_stats.alloc_ucontext_err); stats->value[EFA_CREATE_AH_ERR] = atomic64_read(&s->sw_stats.create_ah_err); -- 2.26.1 ^ permalink raw reply related [flat|nested] 11+ messages in thread
* [PATCH for-next 2/3] RDMA/efa: Count mmap failures 2020-04-20 6:22 [PATCH for-next 0/3] EFA statistics minor updates Gal Pressman 2020-04-20 6:22 ` [PATCH for-next 1/3] RDMA/efa: Report create CQ error counter Gal Pressman @ 2020-04-20 6:22 ` Gal Pressman 2020-04-24 14:59 ` Jason Gunthorpe 2020-04-20 6:22 ` [PATCH for-next 3/3] RDMA/efa: Count admin commands errors Gal Pressman 2020-05-02 23:33 ` [PATCH for-next 0/3] EFA statistics minor updates Jason Gunthorpe 3 siblings, 1 reply; 11+ messages in thread From: Gal Pressman @ 2020-04-20 6:22 UTC (permalink / raw) To: Jason Gunthorpe, Doug Ledford Cc: linux-rdma, Alexander Matushevsky, Gal Pressman, Firas JahJah, Yossi Leybovich Add a new stat that counts mmap failures, which might help when debugging different issues. Reviewed-by: Firas JahJah <firasj@amazon.com> Reviewed-by: Yossi Leybovich <sleybo@amazon.com> Signed-off-by: Gal Pressman <galpress@amazon.com> --- drivers/infiniband/hw/efa/efa.h | 3 ++- drivers/infiniband/hw/efa/efa_verbs.c | 9 +++++++-- 2 files changed, 9 insertions(+), 3 deletions(-) diff --git a/drivers/infiniband/hw/efa/efa.h b/drivers/infiniband/hw/efa/efa.h index aa7396a1588a..77c9ff798117 100644 --- a/drivers/infiniband/hw/efa/efa.h +++ b/drivers/infiniband/hw/efa/efa.h @@ -1,6 +1,6 @@ /* SPDX-License-Identifier: GPL-2.0 OR BSD-2-Clause */ /* - * Copyright 2018-2019 Amazon.com, Inc. or its affiliates. All rights reserved. + * Copyright 2018-2020 Amazon.com, Inc. or its affiliates. All rights reserved. */ #ifndef _EFA_H_ @@ -40,6 +40,7 @@ struct efa_sw_stats { atomic64_t reg_mr_err; atomic64_t alloc_ucontext_err; atomic64_t create_ah_err; + atomic64_t mmap_err; }; /* Don't use anything other than atomic64 */ diff --git a/drivers/infiniband/hw/efa/efa_verbs.c b/drivers/infiniband/hw/efa/efa_verbs.c index b555845d6c14..75eef1ec2474 100644 --- a/drivers/infiniband/hw/efa/efa_verbs.c +++ b/drivers/infiniband/hw/efa/efa_verbs.c @@ -44,7 +44,8 @@ struct efa_user_mmap_entry { op(EFA_CREATE_CQ_ERR, "create_cq_err") \ op(EFA_REG_MR_ERR, "reg_mr_err") \ op(EFA_ALLOC_UCONTEXT_ERR, "alloc_ucontext_err") \ - op(EFA_CREATE_AH_ERR, "create_ah_err") + op(EFA_CREATE_AH_ERR, "create_ah_err") \ + op(EFA_MMAP_ERR, "mmap_err") #define EFA_STATS_ENUM(ename, name) ename, #define EFA_STATS_STR(ename, name) [ename] = name, @@ -1569,6 +1570,7 @@ static int __efa_mmap(struct efa_dev *dev, struct efa_ucontext *ucontext, ibdev_dbg(&dev->ibdev, "pgoff[%#lx] does not have valid entry\n", vma->vm_pgoff); + atomic64_inc(&dev->stats.sw_stats.mmap_err); return -EINVAL; } entry = to_emmap(rdma_entry); @@ -1604,12 +1606,14 @@ static int __efa_mmap(struct efa_dev *dev, struct efa_ucontext *ucontext, err = -EINVAL; } - if (err) + if (err) { ibdev_dbg( &dev->ibdev, "Couldn't mmap address[%#llx] length[%#zx] mmap_flag[%d] err[%d]\n", entry->address, rdma_entry->npages * PAGE_SIZE, entry->mmap_flag, err); + atomic64_inc(&dev->stats.sw_stats.mmap_err); + } rdma_user_mmap_entry_put(rdma_entry); return err; @@ -1758,6 +1762,7 @@ int efa_get_hw_stats(struct ib_device *ibdev, struct rdma_hw_stats *stats, stats->value[EFA_REG_MR_ERR] = atomic64_read(&s->sw_stats.reg_mr_err); stats->value[EFA_ALLOC_UCONTEXT_ERR] = atomic64_read(&s->sw_stats.alloc_ucontext_err); stats->value[EFA_CREATE_AH_ERR] = atomic64_read(&s->sw_stats.create_ah_err); + stats->value[EFA_MMAP_ERR] = atomic64_read(&s->sw_stats.mmap_err); return ARRAY_SIZE(efa_stats_names); } -- 2.26.1 ^ permalink raw reply related [flat|nested] 11+ messages in thread
* Re: [PATCH for-next 2/3] RDMA/efa: Count mmap failures 2020-04-20 6:22 ` [PATCH for-next 2/3] RDMA/efa: Count mmap failures Gal Pressman @ 2020-04-24 14:59 ` Jason Gunthorpe 2020-04-24 15:25 ` Gal Pressman 0 siblings, 1 reply; 11+ messages in thread From: Jason Gunthorpe @ 2020-04-24 14:59 UTC (permalink / raw) To: Gal Pressman Cc: Doug Ledford, linux-rdma, Alexander Matushevsky, Firas JahJah, Yossi Leybovich On Mon, Apr 20, 2020 at 09:22:12AM +0300, Gal Pressman wrote: > Add a new stat that counts mmap failures, which might help when > debugging different issues. > > Reviewed-by: Firas JahJah <firasj@amazon.com> > Reviewed-by: Yossi Leybovich <sleybo@amazon.com> > Signed-off-by: Gal Pressman <galpress@amazon.com> > drivers/infiniband/hw/efa/efa.h | 3 ++- > drivers/infiniband/hw/efa/efa_verbs.c | 9 +++++++-- > 2 files changed, 9 insertions(+), 3 deletions(-) > > diff --git a/drivers/infiniband/hw/efa/efa.h b/drivers/infiniband/hw/efa/efa.h > index aa7396a1588a..77c9ff798117 100644 > +++ b/drivers/infiniband/hw/efa/efa.h > @@ -1,6 +1,6 @@ > /* SPDX-License-Identifier: GPL-2.0 OR BSD-2-Clause */ > /* > - * Copyright 2018-2019 Amazon.com, Inc. or its affiliates. All rights reserved. > + * Copyright 2018-2020 Amazon.com, Inc. or its affiliates. All rights reserved. > */ > > #ifndef _EFA_H_ > @@ -40,6 +40,7 @@ struct efa_sw_stats { > atomic64_t reg_mr_err; > atomic64_t alloc_ucontext_err; > atomic64_t create_ah_err; > + atomic64_t mmap_err; > }; > > /* Don't use anything other than atomic64 */ > diff --git a/drivers/infiniband/hw/efa/efa_verbs.c b/drivers/infiniband/hw/efa/efa_verbs.c > index b555845d6c14..75eef1ec2474 100644 > +++ b/drivers/infiniband/hw/efa/efa_verbs.c > @@ -44,7 +44,8 @@ struct efa_user_mmap_entry { > op(EFA_CREATE_CQ_ERR, "create_cq_err") \ > op(EFA_REG_MR_ERR, "reg_mr_err") \ > op(EFA_ALLOC_UCONTEXT_ERR, "alloc_ucontext_err") \ > - op(EFA_CREATE_AH_ERR, "create_ah_err") > + op(EFA_CREATE_AH_ERR, "create_ah_err") \ > + op(EFA_MMAP_ERR, "mmap_err") > > #define EFA_STATS_ENUM(ename, name) ename, > #define EFA_STATS_STR(ename, name) [ename] = name, > @@ -1569,6 +1570,7 @@ static int __efa_mmap(struct efa_dev *dev, struct efa_ucontext *ucontext, > ibdev_dbg(&dev->ibdev, > "pgoff[%#lx] does not have valid entry\n", > vma->vm_pgoff); > + atomic64_inc(&dev->stats.sw_stats.mmap_err); > return -EINVAL; > } > entry = to_emmap(rdma_entry); > @@ -1604,12 +1606,14 @@ static int __efa_mmap(struct efa_dev *dev, struct efa_ucontext *ucontext, > err = -EINVAL; > } > > - if (err) > + if (err) { > ibdev_dbg( > &dev->ibdev, > "Couldn't mmap address[%#llx] length[%#zx] mmap_flag[%d] err[%d]\n", > entry->address, rdma_entry->npages * PAGE_SIZE, > entry->mmap_flag, err); > + atomic64_inc(&dev->stats.sw_stats.mmap_err); Really? Isn't this something that is only possible with a buggy rdma-core provider? Why count it? Jason ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH for-next 2/3] RDMA/efa: Count mmap failures 2020-04-24 14:59 ` Jason Gunthorpe @ 2020-04-24 15:25 ` Gal Pressman 2020-04-24 18:26 ` Jason Gunthorpe 0 siblings, 1 reply; 11+ messages in thread From: Gal Pressman @ 2020-04-24 15:25 UTC (permalink / raw) To: Jason Gunthorpe Cc: Doug Ledford, linux-rdma, Alexander Matushevsky, Firas JahJah, Yossi Leybovich On 24/04/2020 17:59, Jason Gunthorpe wrote: > On Mon, Apr 20, 2020 at 09:22:12AM +0300, Gal Pressman wrote: >> Add a new stat that counts mmap failures, which might help when >> debugging different issues. >> >> Reviewed-by: Firas JahJah <firasj@amazon.com> >> Reviewed-by: Yossi Leybovich <sleybo@amazon.com> >> Signed-off-by: Gal Pressman <galpress@amazon.com> >> drivers/infiniband/hw/efa/efa.h | 3 ++- >> drivers/infiniband/hw/efa/efa_verbs.c | 9 +++++++-- >> 2 files changed, 9 insertions(+), 3 deletions(-) >> >> diff --git a/drivers/infiniband/hw/efa/efa.h b/drivers/infiniband/hw/efa/efa.h >> index aa7396a1588a..77c9ff798117 100644 >> +++ b/drivers/infiniband/hw/efa/efa.h >> @@ -1,6 +1,6 @@ >> /* SPDX-License-Identifier: GPL-2.0 OR BSD-2-Clause */ >> /* >> - * Copyright 2018-2019 Amazon.com, Inc. or its affiliates. All rights reserved. >> + * Copyright 2018-2020 Amazon.com, Inc. or its affiliates. All rights reserved. >> */ >> >> #ifndef _EFA_H_ >> @@ -40,6 +40,7 @@ struct efa_sw_stats { >> atomic64_t reg_mr_err; >> atomic64_t alloc_ucontext_err; >> atomic64_t create_ah_err; >> + atomic64_t mmap_err; >> }; >> >> /* Don't use anything other than atomic64 */ >> diff --git a/drivers/infiniband/hw/efa/efa_verbs.c b/drivers/infiniband/hw/efa/efa_verbs.c >> index b555845d6c14..75eef1ec2474 100644 >> +++ b/drivers/infiniband/hw/efa/efa_verbs.c >> @@ -44,7 +44,8 @@ struct efa_user_mmap_entry { >> op(EFA_CREATE_CQ_ERR, "create_cq_err") \ >> op(EFA_REG_MR_ERR, "reg_mr_err") \ >> op(EFA_ALLOC_UCONTEXT_ERR, "alloc_ucontext_err") \ >> - op(EFA_CREATE_AH_ERR, "create_ah_err") >> + op(EFA_CREATE_AH_ERR, "create_ah_err") \ >> + op(EFA_MMAP_ERR, "mmap_err") >> >> #define EFA_STATS_ENUM(ename, name) ename, >> #define EFA_STATS_STR(ename, name) [ename] = name, >> @@ -1569,6 +1570,7 @@ static int __efa_mmap(struct efa_dev *dev, struct efa_ucontext *ucontext, >> ibdev_dbg(&dev->ibdev, >> "pgoff[%#lx] does not have valid entry\n", >> vma->vm_pgoff); >> + atomic64_inc(&dev->stats.sw_stats.mmap_err); >> return -EINVAL; >> } >> entry = to_emmap(rdma_entry); >> @@ -1604,12 +1606,14 @@ static int __efa_mmap(struct efa_dev *dev, struct efa_ucontext *ucontext, >> err = -EINVAL; >> } >> >> - if (err) >> + if (err) { >> ibdev_dbg( >> &dev->ibdev, >> "Couldn't mmap address[%#llx] length[%#zx] mmap_flag[%d] err[%d]\n", >> entry->address, rdma_entry->npages * PAGE_SIZE, >> entry->mmap_flag, err); >> + atomic64_inc(&dev->stats.sw_stats.mmap_err); > > Really? Isn't this something that is only possible with a buggy > rdma-core provider? Why count it? Though unlikely, it could happen, otherwise this error flow wouldn't exist in the first place. If for some reason a customer app steps on a bug we're not aware of, this counter could serve as a red flag. ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH for-next 2/3] RDMA/efa: Count mmap failures 2020-04-24 15:25 ` Gal Pressman @ 2020-04-24 18:26 ` Jason Gunthorpe 2020-04-26 6:42 ` Gal Pressman 0 siblings, 1 reply; 11+ messages in thread From: Jason Gunthorpe @ 2020-04-24 18:26 UTC (permalink / raw) To: Gal Pressman Cc: Doug Ledford, linux-rdma, Alexander Matushevsky, Firas JahJah, Yossi Leybovich On Fri, Apr 24, 2020 at 06:25:54PM +0300, Gal Pressman wrote: > On 24/04/2020 17:59, Jason Gunthorpe wrote: > > On Mon, Apr 20, 2020 at 09:22:12AM +0300, Gal Pressman wrote: > >> Add a new stat that counts mmap failures, which might help when > >> debugging different issues. > >> > >> Reviewed-by: Firas JahJah <firasj@amazon.com> > >> Reviewed-by: Yossi Leybovich <sleybo@amazon.com> > >> Signed-off-by: Gal Pressman <galpress@amazon.com> > >> drivers/infiniband/hw/efa/efa.h | 3 ++- > >> drivers/infiniband/hw/efa/efa_verbs.c | 9 +++++++-- > >> 2 files changed, 9 insertions(+), 3 deletions(-) > >> > >> diff --git a/drivers/infiniband/hw/efa/efa.h b/drivers/infiniband/hw/efa/efa.h > >> index aa7396a1588a..77c9ff798117 100644 > >> +++ b/drivers/infiniband/hw/efa/efa.h > >> @@ -1,6 +1,6 @@ > >> /* SPDX-License-Identifier: GPL-2.0 OR BSD-2-Clause */ > >> /* > >> - * Copyright 2018-2019 Amazon.com, Inc. or its affiliates. All rights reserved. > >> + * Copyright 2018-2020 Amazon.com, Inc. or its affiliates. All rights reserved. > >> */ > >> > >> #ifndef _EFA_H_ > >> @@ -40,6 +40,7 @@ struct efa_sw_stats { > >> atomic64_t reg_mr_err; > >> atomic64_t alloc_ucontext_err; > >> atomic64_t create_ah_err; > >> + atomic64_t mmap_err; > >> }; > >> > >> /* Don't use anything other than atomic64 */ > >> diff --git a/drivers/infiniband/hw/efa/efa_verbs.c b/drivers/infiniband/hw/efa/efa_verbs.c > >> index b555845d6c14..75eef1ec2474 100644 > >> +++ b/drivers/infiniband/hw/efa/efa_verbs.c > >> @@ -44,7 +44,8 @@ struct efa_user_mmap_entry { > >> op(EFA_CREATE_CQ_ERR, "create_cq_err") \ > >> op(EFA_REG_MR_ERR, "reg_mr_err") \ > >> op(EFA_ALLOC_UCONTEXT_ERR, "alloc_ucontext_err") \ > >> - op(EFA_CREATE_AH_ERR, "create_ah_err") > >> + op(EFA_CREATE_AH_ERR, "create_ah_err") \ > >> + op(EFA_MMAP_ERR, "mmap_err") > >> > >> #define EFA_STATS_ENUM(ename, name) ename, > >> #define EFA_STATS_STR(ename, name) [ename] = name, > >> @@ -1569,6 +1570,7 @@ static int __efa_mmap(struct efa_dev *dev, struct efa_ucontext *ucontext, > >> ibdev_dbg(&dev->ibdev, > >> "pgoff[%#lx] does not have valid entry\n", > >> vma->vm_pgoff); > >> + atomic64_inc(&dev->stats.sw_stats.mmap_err); > >> return -EINVAL; > >> } > >> entry = to_emmap(rdma_entry); > >> @@ -1604,12 +1606,14 @@ static int __efa_mmap(struct efa_dev *dev, struct efa_ucontext *ucontext, > >> err = -EINVAL; > >> } > >> > >> - if (err) > >> + if (err) { > >> ibdev_dbg( > >> &dev->ibdev, > >> "Couldn't mmap address[%#llx] length[%#zx] mmap_flag[%d] err[%d]\n", > >> entry->address, rdma_entry->npages * PAGE_SIZE, > >> entry->mmap_flag, err); > >> + atomic64_inc(&dev->stats.sw_stats.mmap_err); > > > > Really? Isn't this something that is only possible with a buggy > > rdma-core provider? Why count it? > > Though unlikely, it could happen, otherwise this error flow wouldn't exist in > the first place. > > If for some reason a customer app steps on a bug we're not aware of, this > counter could serve as a red flag. But there are lots of cases where a buggy provider can cause error exits, why choose this one to count against all the others? Jason ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH for-next 2/3] RDMA/efa: Count mmap failures 2020-04-24 18:26 ` Jason Gunthorpe @ 2020-04-26 6:42 ` Gal Pressman 2020-04-26 13:30 ` Jason Gunthorpe 0 siblings, 1 reply; 11+ messages in thread From: Gal Pressman @ 2020-04-26 6:42 UTC (permalink / raw) To: Jason Gunthorpe Cc: Doug Ledford, linux-rdma, Alexander Matushevsky, Firas JahJah, Yossi Leybovich On 24/04/2020 21:26, Jason Gunthorpe wrote: > On Fri, Apr 24, 2020 at 06:25:54PM +0300, Gal Pressman wrote: >> On 24/04/2020 17:59, Jason Gunthorpe wrote: >>> On Mon, Apr 20, 2020 at 09:22:12AM +0300, Gal Pressman wrote: >>>> Add a new stat that counts mmap failures, which might help when >>>> debugging different issues. >>>> >>>> Reviewed-by: Firas JahJah <firasj@amazon.com> >>>> Reviewed-by: Yossi Leybovich <sleybo@amazon.com> >>>> Signed-off-by: Gal Pressman <galpress@amazon.com> >>>> drivers/infiniband/hw/efa/efa.h | 3 ++- >>>> drivers/infiniband/hw/efa/efa_verbs.c | 9 +++++++-- >>>> 2 files changed, 9 insertions(+), 3 deletions(-) >>>> >>>> diff --git a/drivers/infiniband/hw/efa/efa.h b/drivers/infiniband/hw/efa/efa.h >>>> index aa7396a1588a..77c9ff798117 100644 >>>> +++ b/drivers/infiniband/hw/efa/efa.h >>>> @@ -1,6 +1,6 @@ >>>> /* SPDX-License-Identifier: GPL-2.0 OR BSD-2-Clause */ >>>> /* >>>> - * Copyright 2018-2019 Amazon.com, Inc. or its affiliates. All rights reserved. >>>> + * Copyright 2018-2020 Amazon.com, Inc. or its affiliates. All rights reserved. >>>> */ >>>> >>>> #ifndef _EFA_H_ >>>> @@ -40,6 +40,7 @@ struct efa_sw_stats { >>>> atomic64_t reg_mr_err; >>>> atomic64_t alloc_ucontext_err; >>>> atomic64_t create_ah_err; >>>> + atomic64_t mmap_err; >>>> }; >>>> >>>> /* Don't use anything other than atomic64 */ >>>> diff --git a/drivers/infiniband/hw/efa/efa_verbs.c b/drivers/infiniband/hw/efa/efa_verbs.c >>>> index b555845d6c14..75eef1ec2474 100644 >>>> +++ b/drivers/infiniband/hw/efa/efa_verbs.c >>>> @@ -44,7 +44,8 @@ struct efa_user_mmap_entry { >>>> op(EFA_CREATE_CQ_ERR, "create_cq_err") \ >>>> op(EFA_REG_MR_ERR, "reg_mr_err") \ >>>> op(EFA_ALLOC_UCONTEXT_ERR, "alloc_ucontext_err") \ >>>> - op(EFA_CREATE_AH_ERR, "create_ah_err") >>>> + op(EFA_CREATE_AH_ERR, "create_ah_err") \ >>>> + op(EFA_MMAP_ERR, "mmap_err") >>>> >>>> #define EFA_STATS_ENUM(ename, name) ename, >>>> #define EFA_STATS_STR(ename, name) [ename] = name, >>>> @@ -1569,6 +1570,7 @@ static int __efa_mmap(struct efa_dev *dev, struct efa_ucontext *ucontext, >>>> ibdev_dbg(&dev->ibdev, >>>> "pgoff[%#lx] does not have valid entry\n", >>>> vma->vm_pgoff); >>>> + atomic64_inc(&dev->stats.sw_stats.mmap_err); >>>> return -EINVAL; >>>> } >>>> entry = to_emmap(rdma_entry); >>>> @@ -1604,12 +1606,14 @@ static int __efa_mmap(struct efa_dev *dev, struct efa_ucontext *ucontext, >>>> err = -EINVAL; >>>> } >>>> >>>> - if (err) >>>> + if (err) { >>>> ibdev_dbg( >>>> &dev->ibdev, >>>> "Couldn't mmap address[%#llx] length[%#zx] mmap_flag[%d] err[%d]\n", >>>> entry->address, rdma_entry->npages * PAGE_SIZE, >>>> entry->mmap_flag, err); >>>> + atomic64_inc(&dev->stats.sw_stats.mmap_err); >>> >>> Really? Isn't this something that is only possible with a buggy >>> rdma-core provider? Why count it? >> >> Though unlikely, it could happen, otherwise this error flow wouldn't exist in >> the first place. >> >> If for some reason a customer app steps on a bug we're not aware of, this >> counter could serve as a red flag. > > But there are lots of cases where a buggy provider can cause error > exits, why choose this one to count against all the others? It's not one against all others, most if not all of our userspace facing API error flows have a similar counter. And TBH, I think that the mmap flow is quite convoluted with the cookie response from the crate verb, so it deserves a counter IMO. ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH for-next 2/3] RDMA/efa: Count mmap failures 2020-04-26 6:42 ` Gal Pressman @ 2020-04-26 13:30 ` Jason Gunthorpe 2020-04-26 14:17 ` Gal Pressman 0 siblings, 1 reply; 11+ messages in thread From: Jason Gunthorpe @ 2020-04-26 13:30 UTC (permalink / raw) To: Gal Pressman Cc: Doug Ledford, linux-rdma, Alexander Matushevsky, Firas JahJah, Yossi Leybovich On Sun, Apr 26, 2020 at 09:42:27AM +0300, Gal Pressman wrote: > On 24/04/2020 21:26, Jason Gunthorpe wrote: > > On Fri, Apr 24, 2020 at 06:25:54PM +0300, Gal Pressman wrote: > >> On 24/04/2020 17:59, Jason Gunthorpe wrote: > >>> On Mon, Apr 20, 2020 at 09:22:12AM +0300, Gal Pressman wrote: > >>>> Add a new stat that counts mmap failures, which might help when > >>>> debugging different issues. > >>>> > >>>> Reviewed-by: Firas JahJah <firasj@amazon.com> > >>>> Reviewed-by: Yossi Leybovich <sleybo@amazon.com> > >>>> Signed-off-by: Gal Pressman <galpress@amazon.com> > >>>> drivers/infiniband/hw/efa/efa.h | 3 ++- > >>>> drivers/infiniband/hw/efa/efa_verbs.c | 9 +++++++-- > >>>> 2 files changed, 9 insertions(+), 3 deletions(-) > >>>> > >>>> diff --git a/drivers/infiniband/hw/efa/efa.h b/drivers/infiniband/hw/efa/efa.h > >>>> index aa7396a1588a..77c9ff798117 100644 > >>>> +++ b/drivers/infiniband/hw/efa/efa.h > >>>> @@ -1,6 +1,6 @@ > >>>> /* SPDX-License-Identifier: GPL-2.0 OR BSD-2-Clause */ > >>>> /* > >>>> - * Copyright 2018-2019 Amazon.com, Inc. or its affiliates. All rights reserved. > >>>> + * Copyright 2018-2020 Amazon.com, Inc. or its affiliates. All rights reserved. > >>>> */ > >>>> > >>>> #ifndef _EFA_H_ > >>>> @@ -40,6 +40,7 @@ struct efa_sw_stats { > >>>> atomic64_t reg_mr_err; > >>>> atomic64_t alloc_ucontext_err; > >>>> atomic64_t create_ah_err; > >>>> + atomic64_t mmap_err; > >>>> }; > >>>> > >>>> /* Don't use anything other than atomic64 */ > >>>> diff --git a/drivers/infiniband/hw/efa/efa_verbs.c b/drivers/infiniband/hw/efa/efa_verbs.c > >>>> index b555845d6c14..75eef1ec2474 100644 > >>>> +++ b/drivers/infiniband/hw/efa/efa_verbs.c > >>>> @@ -44,7 +44,8 @@ struct efa_user_mmap_entry { > >>>> op(EFA_CREATE_CQ_ERR, "create_cq_err") \ > >>>> op(EFA_REG_MR_ERR, "reg_mr_err") \ > >>>> op(EFA_ALLOC_UCONTEXT_ERR, "alloc_ucontext_err") \ > >>>> - op(EFA_CREATE_AH_ERR, "create_ah_err") > >>>> + op(EFA_CREATE_AH_ERR, "create_ah_err") \ > >>>> + op(EFA_MMAP_ERR, "mmap_err") > >>>> > >>>> #define EFA_STATS_ENUM(ename, name) ename, > >>>> #define EFA_STATS_STR(ename, name) [ename] = name, > >>>> @@ -1569,6 +1570,7 @@ static int __efa_mmap(struct efa_dev *dev, struct efa_ucontext *ucontext, > >>>> ibdev_dbg(&dev->ibdev, > >>>> "pgoff[%#lx] does not have valid entry\n", > >>>> vma->vm_pgoff); > >>>> + atomic64_inc(&dev->stats.sw_stats.mmap_err); > >>>> return -EINVAL; > >>>> } > >>>> entry = to_emmap(rdma_entry); > >>>> @@ -1604,12 +1606,14 @@ static int __efa_mmap(struct efa_dev *dev, struct efa_ucontext *ucontext, > >>>> err = -EINVAL; > >>>> } > >>>> > >>>> - if (err) > >>>> + if (err) { > >>>> ibdev_dbg( > >>>> &dev->ibdev, > >>>> "Couldn't mmap address[%#llx] length[%#zx] mmap_flag[%d] err[%d]\n", > >>>> entry->address, rdma_entry->npages * PAGE_SIZE, > >>>> entry->mmap_flag, err); > >>>> + atomic64_inc(&dev->stats.sw_stats.mmap_err); > >>> > >>> Really? Isn't this something that is only possible with a buggy > >>> rdma-core provider? Why count it? > >> > >> Though unlikely, it could happen, otherwise this error flow wouldn't exist in > >> the first place. > >> > >> If for some reason a customer app steps on a bug we're not aware of, this > >> counter could serve as a red flag. > > > > But there are lots of cases where a buggy provider can cause error > > exits, why choose this one to count against all the others? > > It's not one against all others, most if not all of our userspace facing API > error flows have a similar counter. Hurm, seems a bit strange, but OK > And TBH, I think that the mmap flow is quite convoluted with the cookie response > from the crate verb, so it deserves a counter IMO. How so? Userspace takes the u64 from the command and pass it to mmap, what is convoluted? Jason ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH for-next 2/3] RDMA/efa: Count mmap failures 2020-04-26 13:30 ` Jason Gunthorpe @ 2020-04-26 14:17 ` Gal Pressman 0 siblings, 0 replies; 11+ messages in thread From: Gal Pressman @ 2020-04-26 14:17 UTC (permalink / raw) To: Jason Gunthorpe Cc: Doug Ledford, linux-rdma, Alexander Matushevsky, Firas JahJah, Yossi Leybovich On 26/04/2020 16:30, Jason Gunthorpe wrote: > On Sun, Apr 26, 2020 at 09:42:27AM +0300, Gal Pressman wrote: >> On 24/04/2020 21:26, Jason Gunthorpe wrote: >>> On Fri, Apr 24, 2020 at 06:25:54PM +0300, Gal Pressman wrote: >>>> On 24/04/2020 17:59, Jason Gunthorpe wrote: >>>>> On Mon, Apr 20, 2020 at 09:22:12AM +0300, Gal Pressman wrote: >>>>>> Add a new stat that counts mmap failures, which might help when >>>>>> debugging different issues. >>>>>> >>>>>> Reviewed-by: Firas JahJah <firasj@amazon.com> >>>>>> Reviewed-by: Yossi Leybovich <sleybo@amazon.com> >>>>>> Signed-off-by: Gal Pressman <galpress@amazon.com> >>>>>> drivers/infiniband/hw/efa/efa.h | 3 ++- >>>>>> drivers/infiniband/hw/efa/efa_verbs.c | 9 +++++++-- >>>>>> 2 files changed, 9 insertions(+), 3 deletions(-) >>>>>> >>>>>> diff --git a/drivers/infiniband/hw/efa/efa.h b/drivers/infiniband/hw/efa/efa.h >>>>>> index aa7396a1588a..77c9ff798117 100644 >>>>>> +++ b/drivers/infiniband/hw/efa/efa.h >>>>>> @@ -1,6 +1,6 @@ >>>>>> /* SPDX-License-Identifier: GPL-2.0 OR BSD-2-Clause */ >>>>>> /* >>>>>> - * Copyright 2018-2019 Amazon.com, Inc. or its affiliates. All rights reserved. >>>>>> + * Copyright 2018-2020 Amazon.com, Inc. or its affiliates. All rights reserved. >>>>>> */ >>>>>> >>>>>> #ifndef _EFA_H_ >>>>>> @@ -40,6 +40,7 @@ struct efa_sw_stats { >>>>>> atomic64_t reg_mr_err; >>>>>> atomic64_t alloc_ucontext_err; >>>>>> atomic64_t create_ah_err; >>>>>> + atomic64_t mmap_err; >>>>>> }; >>>>>> >>>>>> /* Don't use anything other than atomic64 */ >>>>>> diff --git a/drivers/infiniband/hw/efa/efa_verbs.c b/drivers/infiniband/hw/efa/efa_verbs.c >>>>>> index b555845d6c14..75eef1ec2474 100644 >>>>>> +++ b/drivers/infiniband/hw/efa/efa_verbs.c >>>>>> @@ -44,7 +44,8 @@ struct efa_user_mmap_entry { >>>>>> op(EFA_CREATE_CQ_ERR, "create_cq_err") \ >>>>>> op(EFA_REG_MR_ERR, "reg_mr_err") \ >>>>>> op(EFA_ALLOC_UCONTEXT_ERR, "alloc_ucontext_err") \ >>>>>> - op(EFA_CREATE_AH_ERR, "create_ah_err") >>>>>> + op(EFA_CREATE_AH_ERR, "create_ah_err") \ >>>>>> + op(EFA_MMAP_ERR, "mmap_err") >>>>>> >>>>>> #define EFA_STATS_ENUM(ename, name) ename, >>>>>> #define EFA_STATS_STR(ename, name) [ename] = name, >>>>>> @@ -1569,6 +1570,7 @@ static int __efa_mmap(struct efa_dev *dev, struct efa_ucontext *ucontext, >>>>>> ibdev_dbg(&dev->ibdev, >>>>>> "pgoff[%#lx] does not have valid entry\n", >>>>>> vma->vm_pgoff); >>>>>> + atomic64_inc(&dev->stats.sw_stats.mmap_err); >>>>>> return -EINVAL; >>>>>> } >>>>>> entry = to_emmap(rdma_entry); >>>>>> @@ -1604,12 +1606,14 @@ static int __efa_mmap(struct efa_dev *dev, struct efa_ucontext *ucontext, >>>>>> err = -EINVAL; >>>>>> } >>>>>> >>>>>> - if (err) >>>>>> + if (err) { >>>>>> ibdev_dbg( >>>>>> &dev->ibdev, >>>>>> "Couldn't mmap address[%#llx] length[%#zx] mmap_flag[%d] err[%d]\n", >>>>>> entry->address, rdma_entry->npages * PAGE_SIZE, >>>>>> entry->mmap_flag, err); >>>>>> + atomic64_inc(&dev->stats.sw_stats.mmap_err); >>>>> >>>>> Really? Isn't this something that is only possible with a buggy >>>>> rdma-core provider? Why count it? >>>> >>>> Though unlikely, it could happen, otherwise this error flow wouldn't exist in >>>> the first place. >>>> >>>> If for some reason a customer app steps on a bug we're not aware of, this >>>> counter could serve as a red flag. >>> >>> But there are lots of cases where a buggy provider can cause error >>> exits, why choose this one to count against all the others? >> >> It's not one against all others, most if not all of our userspace facing API >> error flows have a similar counter. > > Hurm, seems a bit strange, but OK > >> And TBH, I think that the mmap flow is quite convoluted with the cookie response >> from the crate verb, so it deserves a counter IMO. > > How so? Userspace takes the u64 from the command and pass it to mmap, > what is convoluted? It's the only flow that involves two phases/userspace calls and not a single command-response call, so it's a bit more error prone. Convoluted was a bit harsh :). ^ permalink raw reply [flat|nested] 11+ messages in thread
* [PATCH for-next 3/3] RDMA/efa: Count admin commands errors 2020-04-20 6:22 [PATCH for-next 0/3] EFA statistics minor updates Gal Pressman 2020-04-20 6:22 ` [PATCH for-next 1/3] RDMA/efa: Report create CQ error counter Gal Pressman 2020-04-20 6:22 ` [PATCH for-next 2/3] RDMA/efa: Count mmap failures Gal Pressman @ 2020-04-20 6:22 ` Gal Pressman 2020-05-02 23:33 ` [PATCH for-next 0/3] EFA statistics minor updates Jason Gunthorpe 3 siblings, 0 replies; 11+ messages in thread From: Gal Pressman @ 2020-04-20 6:22 UTC (permalink / raw) To: Jason Gunthorpe, Doug Ledford Cc: linux-rdma, Alexander Matushevsky, Gal Pressman, Daniel Kranzdorf, Yossi Leybovich Add a new stat that counts admin commands failures, which might help when debugging different issues. Reviewed-by: Daniel Kranzdorf <dkkranzd@amazon.com> Reviewed-by: Yossi Leybovich <sleybo@amazon.com> Signed-off-by: Gal Pressman <galpress@amazon.com> --- drivers/infiniband/hw/efa/efa_com.c | 5 ++++- drivers/infiniband/hw/efa/efa_com.h | 3 ++- drivers/infiniband/hw/efa/efa_verbs.c | 2 ++ 3 files changed, 8 insertions(+), 2 deletions(-) diff --git a/drivers/infiniband/hw/efa/efa_com.c b/drivers/infiniband/hw/efa/efa_com.c index 7fce69f5568f..336bc2c57bb1 100644 --- a/drivers/infiniband/hw/efa/efa_com.c +++ b/drivers/infiniband/hw/efa/efa_com.c @@ -631,17 +631,20 @@ int efa_com_cmd_exec(struct efa_com_admin_queue *aq, cmd->aq_common_descriptor.opcode, PTR_ERR(comp_ctx)); up(&aq->avail_cmds); + atomic64_inc(&aq->stats.cmd_err); return PTR_ERR(comp_ctx); } err = efa_com_wait_and_process_admin_cq(comp_ctx, aq); - if (err) + if (err) { ibdev_err_ratelimited( aq->efa_dev, "Failed to process command %s (opcode %u) comp_status %d err %d\n", efa_com_cmd_str(cmd->aq_common_descriptor.opcode), cmd->aq_common_descriptor.opcode, comp_ctx->comp_status, err); + atomic64_inc(&aq->stats.cmd_err); + } up(&aq->avail_cmds); diff --git a/drivers/infiniband/hw/efa/efa_com.h b/drivers/infiniband/hw/efa/efa_com.h index c67dd8109d1c..5e4c88877ddb 100644 --- a/drivers/infiniband/hw/efa/efa_com.h +++ b/drivers/infiniband/hw/efa/efa_com.h @@ -1,6 +1,6 @@ /* SPDX-License-Identifier: GPL-2.0 OR BSD-2-Clause */ /* - * Copyright 2018-2019 Amazon.com, Inc. or its affiliates. All rights reserved. + * Copyright 2018-2020 Amazon.com, Inc. or its affiliates. All rights reserved. */ #ifndef _EFA_COM_H_ @@ -47,6 +47,7 @@ struct efa_com_admin_sq { struct efa_com_stats_admin { atomic64_t submitted_cmd; atomic64_t completed_cmd; + atomic64_t cmd_err; atomic64_t no_completion; }; diff --git a/drivers/infiniband/hw/efa/efa_verbs.c b/drivers/infiniband/hw/efa/efa_verbs.c index 75eef1ec2474..ce7a3ead1ba7 100644 --- a/drivers/infiniband/hw/efa/efa_verbs.c +++ b/drivers/infiniband/hw/efa/efa_verbs.c @@ -37,6 +37,7 @@ struct efa_user_mmap_entry { op(EFA_RX_DROPS, "rx_drops") \ op(EFA_SUBMITTED_CMDS, "submitted_cmds") \ op(EFA_COMPLETED_CMDS, "completed_cmds") \ + op(EFA_CMDS_ERR, "cmds_err") \ op(EFA_NO_COMPLETION_CMDS, "no_completion_cmds") \ op(EFA_KEEP_ALIVE_RCVD, "keep_alive_rcvd") \ op(EFA_ALLOC_PD_ERR, "alloc_pd_err") \ @@ -1752,6 +1753,7 @@ int efa_get_hw_stats(struct ib_device *ibdev, struct rdma_hw_stats *stats, as = &dev->edev.aq.stats; stats->value[EFA_SUBMITTED_CMDS] = atomic64_read(&as->submitted_cmd); stats->value[EFA_COMPLETED_CMDS] = atomic64_read(&as->completed_cmd); + stats->value[EFA_CMDS_ERR] = atomic64_read(&as->cmd_err); stats->value[EFA_NO_COMPLETION_CMDS] = atomic64_read(&as->no_completion); s = &dev->stats; -- 2.26.1 ^ permalink raw reply related [flat|nested] 11+ messages in thread
* Re: [PATCH for-next 0/3] EFA statistics minor updates 2020-04-20 6:22 [PATCH for-next 0/3] EFA statistics minor updates Gal Pressman ` (2 preceding siblings ...) 2020-04-20 6:22 ` [PATCH for-next 3/3] RDMA/efa: Count admin commands errors Gal Pressman @ 2020-05-02 23:33 ` Jason Gunthorpe 3 siblings, 0 replies; 11+ messages in thread From: Jason Gunthorpe @ 2020-05-02 23:33 UTC (permalink / raw) To: Gal Pressman; +Cc: Doug Ledford, linux-rdma, Alexander Matushevsky On Mon, Apr 20, 2020 at 09:22:10AM +0300, Gal Pressman wrote: > This series contains three small patches to collect a few counters we > found helpful when encountering different issues. Applied to for-next, thanks Jason ^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2020-05-02 23:33 UTC | newest] Thread overview: 11+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2020-04-20 6:22 [PATCH for-next 0/3] EFA statistics minor updates Gal Pressman 2020-04-20 6:22 ` [PATCH for-next 1/3] RDMA/efa: Report create CQ error counter Gal Pressman 2020-04-20 6:22 ` [PATCH for-next 2/3] RDMA/efa: Count mmap failures Gal Pressman 2020-04-24 14:59 ` Jason Gunthorpe 2020-04-24 15:25 ` Gal Pressman 2020-04-24 18:26 ` Jason Gunthorpe 2020-04-26 6:42 ` Gal Pressman 2020-04-26 13:30 ` Jason Gunthorpe 2020-04-26 14:17 ` Gal Pressman 2020-04-20 6:22 ` [PATCH for-next 3/3] RDMA/efa: Count admin commands errors Gal Pressman 2020-05-02 23:33 ` [PATCH for-next 0/3] EFA statistics minor updates Jason Gunthorpe
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).