* Re: [PATCH v2] infiniband: remove WARN that is not kernel bug
From: Leon Romanovsky @ 2016-11-21 11:44 UTC (permalink / raw)
To: Dmitry Vyukov
Cc: syzkaller, Jason Gunthorpe, Valdis.Kletnieks-PjAqaU27lzQ,
dledford-H+wXaHxf7aLQT0dZR+AlfA,
sean.hefty-ral2JQCrhuEAvxtiuMwx3w, Hal Rosenstock,
linux-rdma-u79uwXL29TY76Z2rM5mHXA, LKML
In-Reply-To: <CACT4Y+ZO9sTbBGwNUUMkAQ5m0N-4aT0S9xypCRbBUch1pQNw9g-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
[-- Attachment #1.1: Type: text/plain, Size: 2469 bytes --]
On Mon, Nov 21, 2016 at 11:30:21AM +0100, Dmitry Vyukov wrote:
> On Mon, Nov 21, 2016 at 11:25 AM, Miroslav Benes <mbenes-AlSwsSmVLrQ@public.gmane.org> wrote:
> > On Mon, 21 Nov 2016, Dmitry Vyukov wrote:
> >
> >> WARNINGs mean kernel bugs.
> >> The one in ucma_write() points to user programming error
> >> or a malicious attempt. This is not a kernel bug, remove it.
> >>
> >> BUG/WARNs that are not kernel bugs hinder automated testing effots.
> >>
> >> Signed-off-by: Dmitry Vyukov <dvyukov-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
> >> Cc: Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
> >> Cc: Sean Hefty <sean.hefty-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
> >> Cc: Hal Rosenstock <hal.rosenstock-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
> >> Cc: Leon Romanovsky <leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
> >> Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> >> Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> >> Cc: syzkaller-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org
> >>
> >> ---
> >> Changes since v1:
> >> - added printk_once
> >> ---
> >> drivers/infiniband/core/ucma.c | 5 ++++-
> >> 1 file changed, 4 insertions(+), 1 deletion(-)
> >>
> >> diff --git a/drivers/infiniband/core/ucma.c b/drivers/infiniband/core/ucma.c
> >> index 9520154..405d0ce 100644
> >> --- a/drivers/infiniband/core/ucma.c
> >> +++ b/drivers/infiniband/core/ucma.c
> >> @@ -1584,8 +1584,11 @@ static ssize_t ucma_write(struct file *filp, const char __user *buf,
> >> struct rdma_ucm_cmd_hdr hdr;
> >> ssize_t ret;
> >>
> >> - if (WARN_ON_ONCE(!ib_safe_file_access(filp)))
> >> + if (!ib_safe_file_access(filp)) {
> >> + printk_once("ucma_write: process %d (%s) tried to do something hinky\n",
> >> + task_tgid_vnr(current), current->comm);
> >> return -EACCES;
> >> + }
> >>
> >> if (len < sizeof(hdr))
> >> return -EINVAL;
> >
> > FWIW, WARN_ON_ONCE came with commit e6bd18f57aad ("IB/security: Restrict
> > use of the write() interface"). Would it make sense to change the other
> > places as well?
>
>
> I guess so.
> Can I ask somebody of infiniband maintainers to take care of this?
Please see below,
Hope it helps.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
[-- Attachment #1.2: 0001-IB-core-qib-Remove-WARN-that-is-not-kernel-bug.patch --]
[-- Type: text/x-diff, Size: 3383 bytes --]
From e0bfe4771591f9ffbcaa4e66d6257ed95e2d7188 Mon Sep 17 00:00:00 2001
From: Leon Romanovsky <leonro-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
Date: Mon, 21 Nov 2016 13:30:59 +0200
Subject: [PATCH] IB/{core, qib}: Remove WARN that is not kernel bug
WARNINGs mean kernel bugs, in this case, they are placed
to mark programming errors and/or malicious attempts.
BUG/WARNs that are not kernel bugs hinder automated testing efforts.
Signed-off-by: Dmitry Vyukov <dvyukov-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
Signed-off-by: Leon Romanovsky <leonro-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
---
drivers/infiniband/core/ucm.c | 5 ++++-
drivers/infiniband/core/ucma.c | 5 ++++-
drivers/infiniband/core/uverbs_main.c | 5 ++++-
drivers/infiniband/hw/qib/qib_file_ops.c | 5 ++++-
4 files changed, 16 insertions(+), 4 deletions(-)
diff --git a/drivers/infiniband/core/ucm.c b/drivers/infiniband/core/ucm.c
index 7713ef0..0076b55 100644
--- a/drivers/infiniband/core/ucm.c
+++ b/drivers/infiniband/core/ucm.c
@@ -1104,8 +1104,11 @@ static ssize_t ib_ucm_write(struct file *filp, const char __user *buf,
struct ib_ucm_cmd_hdr hdr;
ssize_t result;
- if (WARN_ON_ONCE(!ib_safe_file_access(filp)))
+ if (!ib_safe_file_access(filp)) {
+ pr_err_once("ucm_write: process %d (%s) tried to do something hinky\n",
+ task_tgid_vnr(current), current->comm);
return -EACCES;
+ }
if (len < sizeof(hdr))
return -EINVAL;
diff --git a/drivers/infiniband/core/ucma.c b/drivers/infiniband/core/ucma.c
index 9520154..31fa27f 100644
--- a/drivers/infiniband/core/ucma.c
+++ b/drivers/infiniband/core/ucma.c
@@ -1584,8 +1584,11 @@ static ssize_t ucma_write(struct file *filp, const char __user *buf,
struct rdma_ucm_cmd_hdr hdr;
ssize_t ret;
- if (WARN_ON_ONCE(!ib_safe_file_access(filp)))
+ if (!ib_safe_file_access(filp)) {
+ pr_err_once("ucma_write: process %d (%s) tried to do something hinky\n",
+ task_tgid_vnr(current), current->comm);
return -EACCES;
+ }
if (len < sizeof(hdr))
return -EINVAL;
diff --git a/drivers/infiniband/core/uverbs_main.c b/drivers/infiniband/core/uverbs_main.c
index 44b1104..329eb15 100644
--- a/drivers/infiniband/core/uverbs_main.c
+++ b/drivers/infiniband/core/uverbs_main.c
@@ -746,8 +746,11 @@ static ssize_t ib_uverbs_write(struct file *filp, const char __user *buf,
int srcu_key;
ssize_t ret;
- if (WARN_ON_ONCE(!ib_safe_file_access(filp)))
+ if (!ib_safe_file_access(filp)) {
+ pr_err_once("uverbs_write: process %d (%s) tried to do something hinky\n",
+ task_tgid_vnr(current), current->comm);
return -EACCES;
+ }
if (count < sizeof hdr)
return -EINVAL;
diff --git a/drivers/infiniband/hw/qib/qib_file_ops.c b/drivers/infiniband/hw/qib/qib_file_ops.c
index 382466a..b43ea52 100644
--- a/drivers/infiniband/hw/qib/qib_file_ops.c
+++ b/drivers/infiniband/hw/qib/qib_file_ops.c
@@ -2066,8 +2066,11 @@ static ssize_t qib_write(struct file *fp, const char __user *data,
ssize_t ret = 0;
void *dest;
- if (WARN_ON_ONCE(!ib_safe_file_access(fp)))
+ if (!ib_safe_file_access(fp)) {
+ pr_err_once("qib_write: process %d (%s) tried to do something hinky\n",
+ task_tgid_vnr(current), current->comm);
return -EACCES;
+ }
if (count < sizeof(cmd.type)) {
ret = -EINVAL;
--
2.7.4
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]
^ permalink raw reply related
* Re: [PATCH v2] infiniband: remove WARN that is not kernel bug
From: Dmitry Vyukov @ 2016-11-21 11:48 UTC (permalink / raw)
To: Leon Romanovsky
Cc: syzkaller, Jason Gunthorpe, Valdis.Kletnieks, dledford,
sean.hefty, Hal Rosenstock, linux-rdma, LKML
In-Reply-To: <20161121114417.GA4158@leon.nu>
On Mon, Nov 21, 2016 at 12:44 PM, Leon Romanovsky <leon@kernel.org> wrote:
> On Mon, Nov 21, 2016 at 11:30:21AM +0100, Dmitry Vyukov wrote:
>> On Mon, Nov 21, 2016 at 11:25 AM, Miroslav Benes <mbenes@suse.cz> wrote:
>> > On Mon, 21 Nov 2016, Dmitry Vyukov wrote:
>> >
>> >> WARNINGs mean kernel bugs.
>> >> The one in ucma_write() points to user programming error
>> >> or a malicious attempt. This is not a kernel bug, remove it.
>> >>
>> >> BUG/WARNs that are not kernel bugs hinder automated testing effots.
>> >>
>> >> Signed-off-by: Dmitry Vyukov <dvyukov@google.com>
>> >> Cc: Doug Ledford <dledford@redhat.com>
>> >> Cc: Sean Hefty <sean.hefty@intel.com>
>> >> Cc: Hal Rosenstock <hal.rosenstock@gmail.com>
>> >> Cc: Leon Romanovsky <leon@kernel.org>
>> >> Cc: linux-rdma@vger.kernel.org
>> >> Cc: linux-kernel@vger.kernel.org
>> >> Cc: syzkaller@googlegroups.com
>> >>
>> >> ---
>> >> Changes since v1:
>> >> - added printk_once
>> >> ---
>> >> drivers/infiniband/core/ucma.c | 5 ++++-
>> >> 1 file changed, 4 insertions(+), 1 deletion(-)
>> >>
>> >> diff --git a/drivers/infiniband/core/ucma.c b/drivers/infiniband/core/ucma.c
>> >> index 9520154..405d0ce 100644
>> >> --- a/drivers/infiniband/core/ucma.c
>> >> +++ b/drivers/infiniband/core/ucma.c
>> >> @@ -1584,8 +1584,11 @@ static ssize_t ucma_write(struct file *filp, const char __user *buf,
>> >> struct rdma_ucm_cmd_hdr hdr;
>> >> ssize_t ret;
>> >>
>> >> - if (WARN_ON_ONCE(!ib_safe_file_access(filp)))
>> >> + if (!ib_safe_file_access(filp)) {
>> >> + printk_once("ucma_write: process %d (%s) tried to do something hinky\n",
>> >> + task_tgid_vnr(current), current->comm);
>> >> return -EACCES;
>> >> + }
>> >>
>> >> if (len < sizeof(hdr))
>> >> return -EINVAL;
>> >
>> > FWIW, WARN_ON_ONCE came with commit e6bd18f57aad ("IB/security: Restrict
>> > use of the write() interface"). Would it make sense to change the other
>> > places as well?
>>
>>
>> I guess so.
>> Can I ask somebody of infiniband maintainers to take care of this?
>
> Please see below,
> Hope it helps.
In ib_ucm_write function there is a wrong prefix:
+ pr_err_once("ucm_write: process %d (%s) tried to do something hinky\n",
Otherwise looks good. Thanks.
^ permalink raw reply
* Re: [PATCH v2] infiniband: remove WARN that is not kernel bug
From: Leon Romanovsky @ 2016-11-21 12:14 UTC (permalink / raw)
To: Dmitry Vyukov
Cc: syzkaller, Jason Gunthorpe, Valdis.Kletnieks-PjAqaU27lzQ,
dledford-H+wXaHxf7aLQT0dZR+AlfA,
sean.hefty-ral2JQCrhuEAvxtiuMwx3w, Hal Rosenstock,
linux-rdma-u79uwXL29TY76Z2rM5mHXA, LKML
In-Reply-To: <CACT4Y+Y1sL-8K7HhZ5M5ABVPG0kM=dKZh+6gvuNRvzXFtiA5aQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
[-- Attachment #1: Type: text/plain, Size: 2805 bytes --]
On Mon, Nov 21, 2016 at 12:48:35PM +0100, Dmitry Vyukov wrote:
> On Mon, Nov 21, 2016 at 12:44 PM, Leon Romanovsky <leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> wrote:
> > On Mon, Nov 21, 2016 at 11:30:21AM +0100, Dmitry Vyukov wrote:
> >> On Mon, Nov 21, 2016 at 11:25 AM, Miroslav Benes <mbenes-AlSwsSmVLrQ@public.gmane.org> wrote:
> >> > On Mon, 21 Nov 2016, Dmitry Vyukov wrote:
> >> >
> >> >> WARNINGs mean kernel bugs.
> >> >> The one in ucma_write() points to user programming error
> >> >> or a malicious attempt. This is not a kernel bug, remove it.
> >> >>
> >> >> BUG/WARNs that are not kernel bugs hinder automated testing effots.
> >> >>
> >> >> Signed-off-by: Dmitry Vyukov <dvyukov-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
> >> >> Cc: Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
> >> >> Cc: Sean Hefty <sean.hefty-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
> >> >> Cc: Hal Rosenstock <hal.rosenstock-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
> >> >> Cc: Leon Romanovsky <leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
> >> >> Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> >> >> Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> >> >> Cc: syzkaller-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org
> >> >>
> >> >> ---
> >> >> Changes since v1:
> >> >> - added printk_once
> >> >> ---
> >> >> drivers/infiniband/core/ucma.c | 5 ++++-
> >> >> 1 file changed, 4 insertions(+), 1 deletion(-)
> >> >>
> >> >> diff --git a/drivers/infiniband/core/ucma.c b/drivers/infiniband/core/ucma.c
> >> >> index 9520154..405d0ce 100644
> >> >> --- a/drivers/infiniband/core/ucma.c
> >> >> +++ b/drivers/infiniband/core/ucma.c
> >> >> @@ -1584,8 +1584,11 @@ static ssize_t ucma_write(struct file *filp, const char __user *buf,
> >> >> struct rdma_ucm_cmd_hdr hdr;
> >> >> ssize_t ret;
> >> >>
> >> >> - if (WARN_ON_ONCE(!ib_safe_file_access(filp)))
> >> >> + if (!ib_safe_file_access(filp)) {
> >> >> + printk_once("ucma_write: process %d (%s) tried to do something hinky\n",
> >> >> + task_tgid_vnr(current), current->comm);
> >> >> return -EACCES;
> >> >> + }
> >> >>
> >> >> if (len < sizeof(hdr))
> >> >> return -EINVAL;
> >> >
> >> > FWIW, WARN_ON_ONCE came with commit e6bd18f57aad ("IB/security: Restrict
> >> > use of the write() interface"). Would it make sense to change the other
> >> > places as well?
> >>
> >>
> >> I guess so.
> >> Can I ask somebody of infiniband maintainers to take care of this?
> >
> > Please see below,
> > Hope it helps.
>
> In ib_ucm_write function there is a wrong prefix:
>
> + pr_err_once("ucm_write: process %d (%s) tried to do something hinky\n",
I did it intentionally to have the same errors for all flows.
>
> Otherwise looks good. Thanks.
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]
^ permalink raw reply
* Re: [PATCH v5 5/9] IB/isert: Replace semaphore sem with completion
From: Sagi Grimberg @ 2016-11-21 12:33 UTC (permalink / raw)
To: Arnd Bergmann
Cc: Binoy Jayan, Doug Ledford, Sean Hefty, Hal Rosenstock,
Leon Romanovsky, Sagi Grimberg, Bart Van Assche,
Nicholas Bellinger, Jenny Derzhavetz, Ira Weiny, Steve Wise,
Mark Bloch, Tatyana E Nikolova, Matan Barak, Lijun Ou,
Wei Hu(Xavier), Faisal Latif, Mustafa Ismail, Mark Brown,
linux-rdma
In-Reply-To: <2625189.HB0S3sfnY7@wuerfel>
>>> diff --git a/drivers/infiniband/ulp/isert/ib_isert.c b/drivers/infiniband/ulp/isert/ib_isert.c
>>> index 6dd43f6..de80f56 100644
>>> --- a/drivers/infiniband/ulp/isert/ib_isert.c
>>> +++ b/drivers/infiniband/ulp/isert/ib_isert.c
>>> @@ -619,7 +619,7 @@
>>> mutex_unlock(&isert_np->mutex);
>>>
>>> isert_info("np %p: Allow accept_np to continue\n", isert_np);
>>> - up(&isert_np->sem);
>>> + complete(&isert_np->comp);
>>> }
>>>
>>> static void
>>> @@ -2311,7 +2311,7 @@ struct rdma_cm_id *
>>> isert_err("Unable to allocate struct isert_np\n");
>>> return -ENOMEM;
>>> }
>>> - sema_init(&isert_np->sem, 0);
>>> + init_completion(&isert_np->comp);
>>
>> This is still racy, a connect event can complete just before we
>> init the completion and *will* get lost...
>>
>> This code started off with a waitqueue which exposes the same
>> problem, see:
>> 531b7bf4bd79 Target/iser: Fix iscsit_accept_np and rdma_cm racy flow
>>
>> So, still NAK from me...
>
> I don't see what you mean here. The code using a waitqueue had no
> counter but the completion does.
The problem here is that init_completion sets the counter to zero
and between this thread wakes up and until it initializes comp->done
we might have another connect event completing it and I fail to
see how this is handled correctly.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply
* Re: [PATCH v5 5/9] IB/isert: Replace semaphore sem with completion
From: Arnd Bergmann @ 2016-11-21 14:50 UTC (permalink / raw)
To: Sagi Grimberg
Cc: Binoy Jayan, Doug Ledford, Sean Hefty, Hal Rosenstock,
Leon Romanovsky, Sagi Grimberg, Bart Van Assche,
Nicholas Bellinger, Jenny Derzhavetz, Ira Weiny, Steve Wise,
Mark Bloch, Tatyana E Nikolova, Matan Barak, Lijun Ou,
Wei Hu(Xavier), Faisal Latif, Mustafa Ismail, Mark Brown,
linux-rdma
In-Reply-To: <5489f4ef-02fb-5207-6261-c08ee805acae@grimberg.me>
On Monday, November 21, 2016 2:33:02 PM CET Sagi Grimberg wrote:
> >>> diff --git a/drivers/infiniband/ulp/isert/ib_isert.c b/drivers/infiniband/ulp/isert/ib_isert.c
> >>> index 6dd43f6..de80f56 100644
> >>> --- a/drivers/infiniband/ulp/isert/ib_isert.c
> >>> +++ b/drivers/infiniband/ulp/isert/ib_isert.c
> >>> @@ -619,7 +619,7 @@
> >>> mutex_unlock(&isert_np->mutex);
> >>>
> >>> isert_info("np %p: Allow accept_np to continue\n", isert_np);
> >>> - up(&isert_np->sem);
> >>> + complete(&isert_np->comp);
> >>> }
> >>>
> >>> static void
> >>> @@ -2311,7 +2311,7 @@ struct rdma_cm_id *
> >>> isert_err("Unable to allocate struct isert_np\n");
> >>> return -ENOMEM;
> >>> }
> >>> - sema_init(&isert_np->sem, 0);
> >>> + init_completion(&isert_np->comp);
> >>
> >> This is still racy, a connect event can complete just before we
> >> init the completion and *will* get lost...
> >>
> >> This code started off with a waitqueue which exposes the same
> >> problem, see:
> >> 531b7bf4bd79 Target/iser: Fix iscsit_accept_np and rdma_cm racy flow
> >>
> >> So, still NAK from me...
> >
> > I don't see what you mean here. The code using a waitqueue had no
> > counter but the completion does.
>
> The problem here is that init_completion sets the counter to zero
> and between this thread wakes up and until it initializes comp->done
> we might have another connect event completing it and I fail to
> see how this is handled correctly.
But the sema_init()/init_completion() is done right after the
structure is allocated, so nothing should have a pointer to it
at this time.
Later when we do the down()/wait_for_completion(), that will
just decrement the counter rather than initialize, and converting
from one to the other doesn't change anything here.
Arnd
^ permalink raw reply
* Re: [PATCH v6 12/16] IB/pvrdma: Add Queue Pair support
From: Yuval Shaia @ 2016-11-21 15:51 UTC (permalink / raw)
To: Adit Ranadive
Cc: dledford-H+wXaHxf7aLQT0dZR+AlfA,
linux-rdma-u79uwXL29TY76Z2rM5mHXA,
pv-drivers-pghWNbHTmq7QT0dZR+AlfA, jhansen-pghWNbHTmq7QT0dZR+AlfA,
asarwade-pghWNbHTmq7QT0dZR+AlfA,
georgezhang-pghWNbHTmq7QT0dZR+AlfA,
bryantan-pghWNbHTmq7QT0dZR+AlfA
In-Reply-To: <6a643e92376856394d45638d80a90619d3abac37.1475458407.git.aditr-pghWNbHTmq7QT0dZR+AlfA@public.gmane.org>
On Sun, Oct 02, 2016 at 07:10:32PM -0700, Adit Ranadive wrote:
> +
> +/**
> + * pvrdma_post_send - post send work request entries on a QP
> + * @ibqp: the QP
> + * @wr: work request list to post
> + * @bad_wr: the first bad WR returned
> + *
> + * @return: 0 on success, otherwise errno returned.
> + */
> +int pvrdma_post_send(struct ib_qp *ibqp, struct ib_send_wr *wr,
> + struct ib_send_wr **bad_wr)
> +{
> + struct pvrdma_qp *qp = to_vqp(ibqp);
> + struct pvrdma_dev *dev = to_vdev(ibqp->device);
> + unsigned long flags;
> + struct pvrdma_sq_wqe_hdr *wqe_hdr;
> + struct pvrdma_sge *sge;
> + int i, index;
> + int nreq;
> + int ret;
> +
> + /*
> + * In states lower than RTS, we can fail immediately. In other states,
> + * just post and let the device figure it out.
> + */
> + if (qp->state < IB_QPS_RTS) {
> + *bad_wr = wr;
> + return -EINVAL;
> + }
> +
> + spin_lock_irqsave(&qp->sq.lock, flags);
> +
> + index = pvrdma_idx(&qp->sq.ring->prod_tail, qp->sq.wqe_cnt);
> + for (nreq = 0; wr; nreq++, wr = wr->next) {
nreq is not in used so better remove it.
> + unsigned int tail;
> +
> + if (unlikely(!pvrdma_idx_ring_has_space(
> + qp->sq.ring, qp->sq.wqe_cnt, &tail))) {
> + dev_warn_ratelimited(&dev->pdev->dev,
> + "send queue is full\n");
> + *bad_wr = wr;
> + ret = -ENOMEM;
> + goto out;
> + }
> +
> + if (unlikely(wr->num_sge > qp->sq.max_sg || wr->num_sge < 0)) {
> + dev_warn_ratelimited(&dev->pdev->dev,
> + "send SGE overflow\n");
> + *bad_wr = wr;
> + ret = -EINVAL;
> + goto out;
> + }
> +
> + if (unlikely(wr->opcode < 0)) {
> + dev_warn_ratelimited(&dev->pdev->dev,
> + "invalid send opcode\n");
> + *bad_wr = wr;
> + ret = -EINVAL;
> + goto out;
> + }
> +
> + /*
> + * Only support UD, RC.
> + * Need to check opcode table for thorough checking.
> + * opcode _UD _UC _RC
> + * _SEND x x x
> + * _SEND_WITH_IMM x x x
> + * _RDMA_WRITE x x
> + * _RDMA_WRITE_WITH_IMM x x
> + * _LOCAL_INV x x
> + * _SEND_WITH_INV x x
> + * _RDMA_READ x
> + * _ATOMIC_CMP_AND_SWP x
> + * _ATOMIC_FETCH_AND_ADD x
> + * _MASK_ATOMIC_CMP_AND_SWP x
> + * _MASK_ATOMIC_FETCH_AND_ADD x
> + * _REG_MR x
> + *
> + */
> + if (qp->ibqp.qp_type != IB_QPT_UD &&
> + qp->ibqp.qp_type != IB_QPT_RC &&
> + wr->opcode != IB_WR_SEND) {
> + dev_warn_ratelimited(&dev->pdev->dev,
> + "unsupported queuepair type\n");
> + *bad_wr = wr;
> + ret = -EINVAL;
> + goto out;
> + } else if (qp->ibqp.qp_type == IB_QPT_UD ||
> + qp->ibqp.qp_type == IB_QPT_GSI) {
> + if (wr->opcode != IB_WR_SEND &&
> + wr->opcode != IB_WR_SEND_WITH_IMM) {
> + dev_warn_ratelimited(&dev->pdev->dev,
> + "invalid send opcode\n");
> + *bad_wr = wr;
> + ret = -EINVAL;
> + goto out;
> + }
> + }
> +
> + wqe_hdr = (struct pvrdma_sq_wqe_hdr *)get_sq_wqe(qp, index);
> + memset(wqe_hdr, 0, sizeof(*wqe_hdr));
> + wqe_hdr->wr_id = wr->wr_id;
> + wqe_hdr->num_sge = wr->num_sge;
> + wqe_hdr->opcode = ib_wr_opcode_to_pvrdma(wr->opcode);
> + wqe_hdr->send_flags = ib_send_flags_to_pvrdma(wr->send_flags);
> + if (wr->opcode == IB_WR_SEND_WITH_IMM ||
> + wr->opcode == IB_WR_RDMA_WRITE_WITH_IMM)
> + wqe_hdr->ex.imm_data = wr->ex.imm_data;
> +
> + switch (qp->ibqp.qp_type) {
> + case IB_QPT_GSI:
> + case IB_QPT_UD:
> + if (unlikely(!ud_wr(wr)->ah)) {
> + dev_warn_ratelimited(&dev->pdev->dev,
> + "invalid address handle\n");
> + *bad_wr = wr;
> + ret = -EINVAL;
> + goto out;
> + }
> +
> + /*
> + * Use qkey from qp context if high order bit set,
> + * otherwise from work request.
> + */
> + wqe_hdr->wr.ud.remote_qpn = ud_wr(wr)->remote_qpn;
> + wqe_hdr->wr.ud.remote_qkey =
> + ud_wr(wr)->remote_qkey & 0x80000000 ?
> + qp->qkey : ud_wr(wr)->remote_qkey;
> + wqe_hdr->wr.ud.av = to_vah(ud_wr(wr)->ah)->av;
> +
> + break;
> + case IB_QPT_RC:
> + switch (wr->opcode) {
> + case IB_WR_RDMA_READ:
> + case IB_WR_RDMA_WRITE:
> + case IB_WR_RDMA_WRITE_WITH_IMM:
> + wqe_hdr->wr.rdma.remote_addr =
> + rdma_wr(wr)->remote_addr;
> + wqe_hdr->wr.rdma.rkey = rdma_wr(wr)->rkey;
> + break;
> + case IB_WR_LOCAL_INV:
> + case IB_WR_SEND_WITH_INV:
> + wqe_hdr->ex.invalidate_rkey =
> + wr->ex.invalidate_rkey;
> + break;
> + case IB_WR_ATOMIC_CMP_AND_SWP:
> + case IB_WR_ATOMIC_FETCH_AND_ADD:
> + wqe_hdr->wr.atomic.remote_addr =
> + atomic_wr(wr)->remote_addr;
> + wqe_hdr->wr.atomic.rkey = atomic_wr(wr)->rkey;
> + wqe_hdr->wr.atomic.compare_add =
> + atomic_wr(wr)->compare_add;
> + if (wr->opcode == IB_WR_ATOMIC_CMP_AND_SWP)
> + wqe_hdr->wr.atomic.swap =
> + atomic_wr(wr)->swap;
> + break;
> + case IB_WR_REG_MR:
> + ret = set_reg_seg(wqe_hdr, reg_wr(wr));
> + if (ret < 0) {
> + dev_warn_ratelimited(&dev->pdev->dev,
> + "Failed to set fast register work request\n");
> + *bad_wr = wr;
> + goto out;
> + }
> + break;
> + default:
> + break;
> + }
> +
> + break;
> + default:
> + dev_warn_ratelimited(&dev->pdev->dev,
> + "invalid queuepair type\n");
> + ret = -EINVAL;
> + *bad_wr = wr;
> + goto out;
> + }
> +
> + sge = (struct pvrdma_sge *)(wqe_hdr + 1);
> + for (i = 0; i < wr->num_sge; i++) {
> + /* Need to check wqe_size 0 or max size */
> + sge->addr = wr->sg_list[i].addr;
> + sge->length = wr->sg_list[i].length;
> + sge->lkey = wr->sg_list[i].lkey;
> + sge++;
> + }
> +
> + /* Make sure wqe is written before index update */
> + smp_wmb();
> +
> + index++;
> + if (unlikely(index >= qp->sq.wqe_cnt))
> + index = 0;
> + /* Update shared sq ring */
> + pvrdma_idx_ring_inc(&qp->sq.ring->prod_tail,
> + qp->sq.wqe_cnt);
> + }
> +
> + ret = 0;
> +
> +out:
> + spin_unlock_irqrestore(&qp->sq.lock, flags);
> +
> + if (!ret)
> + pvrdma_write_uar_qp(dev, PVRDMA_UAR_QP_SEND | qp->qp_handle);
> +
> + return ret;
> +}
> +
> +/**
> + * pvrdma_post_receive - post receive work request entries on a QP
> + * @ibqp: the QP
> + * @wr: the work request list to post
> + * @bad_wr: the first bad WR returned
> + *
> + * @return: 0 on success, otherwise errno returned.
> + */
> +int pvrdma_post_recv(struct ib_qp *ibqp, struct ib_recv_wr *wr,
> + struct ib_recv_wr **bad_wr)
> +{
> + struct pvrdma_dev *dev = to_vdev(ibqp->device);
> + unsigned long flags;
> + struct pvrdma_qp *qp = to_vqp(ibqp);
> + struct pvrdma_rq_wqe_hdr *wqe_hdr;
> + struct pvrdma_sge *sge;
> + int index, nreq;
> + int ret = 0;
> + int i;
> +
> + /*
> + * In the RESET state, we can fail immediately. For other states,
> + * just post and let the device figure it out.
> + */
> + if (qp->state == IB_QPS_RESET) {
> + *bad_wr = wr;
> + return -EINVAL;
> + }
> +
> + spin_lock_irqsave(&qp->rq.lock, flags);
> +
> + index = pvrdma_idx(&qp->rq.ring->prod_tail, qp->rq.wqe_cnt);
> + for (nreq = 0; wr; nreq++, wr = wr->next) {
ditto
> + unsigned int tail;
> +
> + if (unlikely(wr->num_sge > qp->rq.max_sg ||
> + wr->num_sge < 0)) {
> + ret = -EINVAL;
> + *bad_wr = wr;
> + dev_warn_ratelimited(&dev->pdev->dev,
> + "recv SGE overflow\n");
> + goto out;
> + }
> +
> + if (unlikely(!pvrdma_idx_ring_has_space(
> + qp->rq.ring, qp->rq.wqe_cnt, &tail))) {
> + ret = -ENOMEM;
> + *bad_wr = wr;
> + dev_warn_ratelimited(&dev->pdev->dev,
> + "recv queue full\n");
> + goto out;
> + }
> +
> + wqe_hdr = (struct pvrdma_rq_wqe_hdr *)get_rq_wqe(qp, index);
> + wqe_hdr->wr_id = wr->wr_id;
> + wqe_hdr->num_sge = wr->num_sge;
> + wqe_hdr->total_len = 0;
> +
> + sge = (struct pvrdma_sge *)(wqe_hdr + 1);
> + for (i = 0; i < wr->num_sge; i++) {
> + sge->addr = wr->sg_list[i].addr;
> + sge->length = wr->sg_list[i].length;
> + sge->lkey = wr->sg_list[i].lkey;
> + sge++;
> + }
> +
> + /* Make sure wqe is written before index update */
> + smp_wmb();
> +
> + index++;
> + if (unlikely(index >= qp->rq.wqe_cnt))
> + index = 0;
> + /* Update shared rq ring */
> + pvrdma_idx_ring_inc(&qp->rq.ring->prod_tail,
> + qp->rq.wqe_cnt);
> + }
> +
> + spin_unlock_irqrestore(&qp->rq.lock, flags);
> +
> + pvrdma_write_uar_qp(dev, PVRDMA_UAR_QP_RECV | qp->qp_handle);
> +
> + return ret;
> +
> +out:
> + spin_unlock_irqrestore(&qp->rq.lock, flags);
> +
> + return ret;
> +}
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply
* RE: [PATCH for-next 03/11] IB/hns: Optimize the logic of allocating memory using APIs
From: Salil Mehta @ 2016-11-21 16:12 UTC (permalink / raw)
To: Leon Romanovsky
Cc: dledford@redhat.com, Huwei (Xavier), oulijun,
mehta.salil.lnk@gmail.com, linux-rdma@vger.kernel.org,
netdev@vger.kernel.org, linux-kernel@vger.kernel.org, Linuxarm,
Zhangping (ZP)
In-Reply-To: <20161116083602.GH4240@leon.nu>
> -----Original Message-----
> From: Leon Romanovsky [mailto:leon@kernel.org]
> Sent: Wednesday, November 16, 2016 8:36 AM
> To: Salil Mehta
> Cc: dledford@redhat.com; Huwei (Xavier); oulijun;
> mehta.salil.lnk@gmail.com; linux-rdma@vger.kernel.org;
> netdev@vger.kernel.org; linux-kernel@vger.kernel.org; Linuxarm;
> Zhangping (ZP)
> Subject: Re: [PATCH for-next 03/11] IB/hns: Optimize the logic of
> allocating memory using APIs
>
> On Tue, Nov 15, 2016 at 03:52:46PM +0000, Salil Mehta wrote:
> > > -----Original Message-----
> > > From: Leon Romanovsky [mailto:leon@kernel.org]
> > > Sent: Wednesday, November 09, 2016 7:22 AM
> > > To: Salil Mehta
> > > Cc: dledford@redhat.com; Huwei (Xavier); oulijun;
> > > mehta.salil.lnk@gmail.com; linux-rdma@vger.kernel.org;
> > > netdev@vger.kernel.org; linux-kernel@vger.kernel.org; Linuxarm;
> > > Zhangping (ZP)
> > > Subject: Re: [PATCH for-next 03/11] IB/hns: Optimize the logic of
> > > allocating memory using APIs
> > >
> > > On Fri, Nov 04, 2016 at 04:36:25PM +0000, Salil Mehta wrote:
> > > > From: "Wei Hu (Xavier)" <xavier.huwei@huawei.com>
> > > >
> > > > This patch modified the logic of allocating memory using APIs in
> > > > hns RoCE driver. We used kcalloc instead of kmalloc_array and
> > > > bitmap_zero. And When kcalloc failed, call vzalloc to alloc
> > > > memory.
> > > >
> > > > Signed-off-by: Wei Hu (Xavier) <xavier.huwei@huawei.com>
> > > > Signed-off-by: Ping Zhang <zhangping5@huawei.com>
> > > > Signed-off-by: Salil Mehta <salil.mehta@huawei.com>
> > > > ---
> > > > drivers/infiniband/hw/hns/hns_roce_mr.c | 15 ++++++++-------
> > > > 1 file changed, 8 insertions(+), 7 deletions(-)
> > > >
> > > > diff --git a/drivers/infiniband/hw/hns/hns_roce_mr.c
> > > b/drivers/infiniband/hw/hns/hns_roce_mr.c
> > > > index fb87883..d3dfb5f 100644
> > > > --- a/drivers/infiniband/hw/hns/hns_roce_mr.c
> > > > +++ b/drivers/infiniband/hw/hns/hns_roce_mr.c
> > > > @@ -137,11 +137,12 @@ static int hns_roce_buddy_init(struct
> > > hns_roce_buddy *buddy, int max_order)
> > > >
> > > > for (i = 0; i <= buddy->max_order; ++i) {
> > > > s = BITS_TO_LONGS(1 << (buddy->max_order - i));
> > > > - buddy->bits[i] = kmalloc_array(s, sizeof(long),
> > > GFP_KERNEL);
> > > > - if (!buddy->bits[i])
> > > > - goto err_out_free;
> > > > -
> > > > - bitmap_zero(buddy->bits[i], 1 << (buddy->max_order -
> i));
> > > > + buddy->bits[i] = kcalloc(s, sizeof(long),
> GFP_KERNEL);
> > > > + if (!buddy->bits[i]) {
> > > > + buddy->bits[i] = vzalloc(s * sizeof(long));
> > >
> > > I wonder, why don't you use directly vzalloc instead of kcalloc
> > > fallback?
> > As we know we will have physical contiguous pages if the kcalloc
> > call succeeds. This will give us a chance to have better performance
> > over the allocations which are just virtually contiguous through the
> > function vzalloc(). Therefore, later has only been used as a fallback
> > when our memory request cannot be entertained through kcalloc.
> >
> > Are you suggesting that there will not be much performance penalty
> > if we use just vzalloc ?
>
> Not exactly,
> I asked it, because we have similar code in our drivers and this
> construction looks strange to me.
>
> 1. If performance is critical, we will use kmalloc.
> 2. If performance is not critical, we will use vmalloc.
>
> But in this case, such construction shows me that we can live with
> vmalloc performance and kmalloc allocation are not really needed.
>
> In your specific case, I'm not sure that kcalloc will ever fail.
Performance is definitely critical here. Though, I agree this is bit
unusual way of memory allocation. In actual, we were encountering
memory alloc failures using kmalloc (if you see allocation amount
is on the higher side and is exponential) so we ended up using
vmalloc as fall back - It is very naïve allocation scheme.
Maybe we need to rethink this allocation scheme part? Also, I can pull
back this particular patch for now or just live with vzalloc() till
we figure out proper solution to this?
>
> Thanks
>
>
> >
> > >
> > > > + if (!buddy->bits[i])
> > > > + goto err_out_free;
> > > > + }
> > > > }
^ permalink raw reply
* Re: [PATCH v5 2/9] IB/core: Replace semaphore sm_sem with an atomic wait
From: Arnd Bergmann @ 2016-11-21 16:52 UTC (permalink / raw)
To: Linus Torvalds
Cc: Binoy Jayan, linux-rdma, Mustafa Ismail, Lijun Ou,
Nicholas Bellinger, Leon Romanovsky, target-devel,
Tatyana E Nikolova, Doug Ledford, Jenny Derzhavetz, Sagi Grimberg,
Sean Hefty, Faisal Latif, Hal Rosenstock, linux-kernel,
Wei Hu(Xavier), Mark Brown, Mark Bloch, Steve Wise, Ira Weiny
In-Reply-To: <CA+55aFxGjaqduhRCyk0mVxEA7aqQ-omdG8SBreZ=x5cW2ovngQ@mail.gmail.com>
On Monday, November 21, 2016 7:57:53 AM CET Linus Torvalds wrote:
> Don't do this.
>
> Never ever do your own locking primitives. You will get the memory ordering
> wrong. And even if you get it right, why do it?
>
> If you want to get rid of semaphores, and replace them with a mutex, that's
> OK. But don't replace them with something more complex like an open coded
> waiting model.
I think a mutex would't work here, since fops->open() and fops->close()
are not called from the same context and lockdep will complain
about that.
Version of the series had replaced the semaphore with a completion
here, which worked correctly, but one reviewer suggested using
the wait_event() instead since it's confusing to have a completion
starting out in 'completed' state.
Arnd
^ permalink raw reply
* Re: [PATCH v2] infiniband: remove WARN that is not kernel bug
From: Jason Gunthorpe @ 2016-11-21 16:52 UTC (permalink / raw)
To: Leon Romanovsky
Cc: Dmitry Vyukov, syzkaller, Valdis.Kletnieks, dledford, sean.hefty,
Hal Rosenstock, linux-rdma, LKML
In-Reply-To: <20161121121408.GC4158@leon.nu>
On Mon, Nov 21, 2016 at 02:14:08PM +0200, Leon Romanovsky wrote:
> >
> > In ib_ucm_write function there is a wrong prefix:
> >
> > + pr_err_once("ucm_write: process %d (%s) tried to do something hinky\n",
>
> I did it intentionally to have the same errors for all flows.
Lets actually use a good message too please?
pr_err_once("ucm_write: process %d (%s) changed security contexts after opening FD, this is not allowed.\n",
Jason
^ permalink raw reply
* Re: [PATCH v5 2/9] IB/core: Replace semaphore sm_sem with an atomic wait
From: Christoph Hellwig @ 2016-11-21 16:57 UTC (permalink / raw)
To: Arnd Bergmann
Cc: Linus Torvalds, Binoy Jayan, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
Mustafa Ismail, Lijun Ou, Nicholas Bellinger, Leon Romanovsky,
target-devel-u79uwXL29TY76Z2rM5mHXA, Tatyana E Nikolova,
Doug Ledford, Jenny Derzhavetz, Sagi Grimberg, Sean Hefty,
Faisal Latif, Hal Rosenstock, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
Wei Hu(Xavier), Mark Brown, Mark Bloch, Steve Wise
In-Reply-To: <4374986.OFD8bCgUa1@wuerfel>
On Mon, Nov 21, 2016 at 05:52:27PM +0100, Arnd Bergmann wrote:
> I think a mutex would't work here, since fops->open() and fops->close()
> are not called from the same context and lockdep will complain
> about that.
>
> Version of the series had replaced the semaphore with a completion
> here, which worked correctly, but one reviewer suggested using
> the wait_event() instead since it's confusing to have a completion
> starting out in 'completed' state.
On the other hand the existing semaphore works perfectly fine, and
is way easier to understand than any of the other models, which also
don't provide any benefit at all.
Please stop messing with perfectly fine semaphores now - there are
plenty that are much better replaced with mutexes or completions,
but this (and various others are not). Leave them alone and do
something useful instead.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply
* Re: [PATCH for-next 03/11] IB/hns: Optimize the logic of allocating memory using APIs
From: Leon Romanovsky @ 2016-11-21 17:14 UTC (permalink / raw)
To: Salil Mehta
Cc: dledford@redhat.com, Huwei (Xavier), oulijun,
mehta.salil.lnk@gmail.com, linux-rdma@vger.kernel.org,
netdev@vger.kernel.org, linux-kernel@vger.kernel.org, Linuxarm,
Zhangping (ZP)
In-Reply-To: <F4CC6FACFEB3C54C9141D49AD221F7F91A7AD7DF@lhreml503-mbx>
[-- Attachment #1: Type: text/plain, Size: 4775 bytes --]
On Mon, Nov 21, 2016 at 04:12:38PM +0000, Salil Mehta wrote:
> > -----Original Message-----
> > From: Leon Romanovsky [mailto:leon@kernel.org]
> > Sent: Wednesday, November 16, 2016 8:36 AM
> > To: Salil Mehta
> > Cc: dledford@redhat.com; Huwei (Xavier); oulijun;
> > mehta.salil.lnk@gmail.com; linux-rdma@vger.kernel.org;
> > netdev@vger.kernel.org; linux-kernel@vger.kernel.org; Linuxarm;
> > Zhangping (ZP)
> > Subject: Re: [PATCH for-next 03/11] IB/hns: Optimize the logic of
> > allocating memory using APIs
> >
> > On Tue, Nov 15, 2016 at 03:52:46PM +0000, Salil Mehta wrote:
> > > > -----Original Message-----
> > > > From: Leon Romanovsky [mailto:leon@kernel.org]
> > > > Sent: Wednesday, November 09, 2016 7:22 AM
> > > > To: Salil Mehta
> > > > Cc: dledford@redhat.com; Huwei (Xavier); oulijun;
> > > > mehta.salil.lnk@gmail.com; linux-rdma@vger.kernel.org;
> > > > netdev@vger.kernel.org; linux-kernel@vger.kernel.org; Linuxarm;
> > > > Zhangping (ZP)
> > > > Subject: Re: [PATCH for-next 03/11] IB/hns: Optimize the logic of
> > > > allocating memory using APIs
> > > >
> > > > On Fri, Nov 04, 2016 at 04:36:25PM +0000, Salil Mehta wrote:
> > > > > From: "Wei Hu (Xavier)" <xavier.huwei@huawei.com>
> > > > >
> > > > > This patch modified the logic of allocating memory using APIs in
> > > > > hns RoCE driver. We used kcalloc instead of kmalloc_array and
> > > > > bitmap_zero. And When kcalloc failed, call vzalloc to alloc
> > > > > memory.
> > > > >
> > > > > Signed-off-by: Wei Hu (Xavier) <xavier.huwei@huawei.com>
> > > > > Signed-off-by: Ping Zhang <zhangping5@huawei.com>
> > > > > Signed-off-by: Salil Mehta <salil.mehta@huawei.com>
> > > > > ---
> > > > > drivers/infiniband/hw/hns/hns_roce_mr.c | 15 ++++++++-------
> > > > > 1 file changed, 8 insertions(+), 7 deletions(-)
> > > > >
> > > > > diff --git a/drivers/infiniband/hw/hns/hns_roce_mr.c
> > > > b/drivers/infiniband/hw/hns/hns_roce_mr.c
> > > > > index fb87883..d3dfb5f 100644
> > > > > --- a/drivers/infiniband/hw/hns/hns_roce_mr.c
> > > > > +++ b/drivers/infiniband/hw/hns/hns_roce_mr.c
> > > > > @@ -137,11 +137,12 @@ static int hns_roce_buddy_init(struct
> > > > hns_roce_buddy *buddy, int max_order)
> > > > >
> > > > > for (i = 0; i <= buddy->max_order; ++i) {
> > > > > s = BITS_TO_LONGS(1 << (buddy->max_order - i));
> > > > > - buddy->bits[i] = kmalloc_array(s, sizeof(long),
> > > > GFP_KERNEL);
> > > > > - if (!buddy->bits[i])
> > > > > - goto err_out_free;
> > > > > -
> > > > > - bitmap_zero(buddy->bits[i], 1 << (buddy->max_order -
> > i));
> > > > > + buddy->bits[i] = kcalloc(s, sizeof(long),
> > GFP_KERNEL);
> > > > > + if (!buddy->bits[i]) {
> > > > > + buddy->bits[i] = vzalloc(s * sizeof(long));
> > > >
> > > > I wonder, why don't you use directly vzalloc instead of kcalloc
> > > > fallback?
> > > As we know we will have physical contiguous pages if the kcalloc
> > > call succeeds. This will give us a chance to have better performance
> > > over the allocations which are just virtually contiguous through the
> > > function vzalloc(). Therefore, later has only been used as a fallback
> > > when our memory request cannot be entertained through kcalloc.
> > >
> > > Are you suggesting that there will not be much performance penalty
> > > if we use just vzalloc ?
> >
> > Not exactly,
> > I asked it, because we have similar code in our drivers and this
> > construction looks strange to me.
> >
> > 1. If performance is critical, we will use kmalloc.
> > 2. If performance is not critical, we will use vmalloc.
> >
> > But in this case, such construction shows me that we can live with
> > vmalloc performance and kmalloc allocation are not really needed.
> >
> > In your specific case, I'm not sure that kcalloc will ever fail.
> Performance is definitely critical here. Though, I agree this is bit
> unusual way of memory allocation. In actual, we were encountering
> memory alloc failures using kmalloc (if you see allocation amount
> is on the higher side and is exponential) so we ended up using
> vmalloc as fall back - It is very naïve allocation scheme.
I understand it, we did the same, see our mlx5_vzalloc call.
BTW, we used __GFP_NOWARN flag, which you should consider to use
in your case too.
>
> Maybe we need to rethink this allocation scheme part? Also, I can pull
> back this particular patch for now or just live with vzalloc() till
> we figure out proper solution to this?
It is up to you, I don't think that you should drop it, AFAIK, there is
no other proper solution.
>
> >
> > Thanks
> >
> >
> > >
> > > >
> > > > > + if (!buddy->bits[i])
> > > > > + goto err_out_free;
> > > > > + }
> > > > > }
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]
^ permalink raw reply
* Re: [PATCH v2] infiniband: remove WARN that is not kernel bug
From: Leon Romanovsky @ 2016-11-21 17:38 UTC (permalink / raw)
To: Jason Gunthorpe
Cc: Dmitry Vyukov, syzkaller, Valdis.Kletnieks-PjAqaU27lzQ,
dledford-H+wXaHxf7aLQT0dZR+AlfA,
sean.hefty-ral2JQCrhuEAvxtiuMwx3w, Hal Rosenstock,
linux-rdma-u79uwXL29TY76Z2rM5mHXA, LKML
In-Reply-To: <20161121165253.GA22237-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
[-- Attachment #1.1: Type: text/plain, Size: 523 bytes --]
On Mon, Nov 21, 2016 at 09:52:53AM -0700, Jason Gunthorpe wrote:
> On Mon, Nov 21, 2016 at 02:14:08PM +0200, Leon Romanovsky wrote:
> > >
> > > In ib_ucm_write function there is a wrong prefix:
> > >
> > > + pr_err_once("ucm_write: process %d (%s) tried to do something hinky\n",
> >
> > I did it intentionally to have the same errors for all flows.
>
> Lets actually use a good message too please?
>
> pr_err_once("ucm_write: process %d (%s) changed security contexts after opening FD, this is not allowed.\n",
>
> Jason
[-- Attachment #1.2: 0001-IB-core-qib-Remove-WARN-that-is-not-kernel-bug.patch --]
[-- Type: text/x-diff, Size: 3630 bytes --]
From 70f95b2d35aea42e5b97e7d27ab2f4e8effcbe67 Mon Sep 17 00:00:00 2001
From: Leon Romanovsky <leonro-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
Date: Mon, 21 Nov 2016 13:30:59 +0200
Subject: [PATCH rdma-next V2] IB/{core, qib}: Remove WARN that is not kernel bug
WARNINGs mean kernel bugs, in this case, they are placed
to mark programming errors and/or malicious attempts.
BUG/WARNs that are not kernel bugs hinder automated testing efforts.
Signed-off-by: Dmitry Vyukov <dvyukov-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
Signed-off-by: Leon Romanovsky <leonro-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
---
* Improved error prints.
---
drivers/infiniband/core/ucm.c | 5 ++++-
drivers/infiniband/core/ucma.c | 5 ++++-
drivers/infiniband/core/uverbs_main.c | 5 ++++-
drivers/infiniband/hw/qib/qib_file_ops.c | 5 ++++-
4 files changed, 16 insertions(+), 4 deletions(-)
diff --git a/drivers/infiniband/core/ucm.c b/drivers/infiniband/core/ucm.c
index 7713ef0..579f9a7 100644
--- a/drivers/infiniband/core/ucm.c
+++ b/drivers/infiniband/core/ucm.c
@@ -1104,8 +1104,11 @@ static ssize_t ib_ucm_write(struct file *filp, const char __user *buf,
struct ib_ucm_cmd_hdr hdr;
ssize_t result;
- if (WARN_ON_ONCE(!ib_safe_file_access(filp)))
+ if (!ib_safe_file_access(filp)) {
+ pr_err_once("ucm_write: process %d (%s) changed security contexts after opening file descriptor, this is not allowed.\n",
+ task_tgid_vnr(current), current->comm);
return -EACCES;
+ }
if (len < sizeof(hdr))
return -EINVAL;
diff --git a/drivers/infiniband/core/ucma.c b/drivers/infiniband/core/ucma.c
index 9520154..e12f8fa 100644
--- a/drivers/infiniband/core/ucma.c
+++ b/drivers/infiniband/core/ucma.c
@@ -1584,8 +1584,11 @@ static ssize_t ucma_write(struct file *filp, const char __user *buf,
struct rdma_ucm_cmd_hdr hdr;
ssize_t ret;
- if (WARN_ON_ONCE(!ib_safe_file_access(filp)))
+ if (!ib_safe_file_access(filp)) {
+ pr_err_once("ucma_write: process %d (%s) changed security contexts after opening file descriptor, this is not allowed.\n",
+ task_tgid_vnr(current), current->comm);
return -EACCES;
+ }
if (len < sizeof(hdr))
return -EINVAL;
diff --git a/drivers/infiniband/core/uverbs_main.c b/drivers/infiniband/core/uverbs_main.c
index 44b1104..38c79ad 100644
--- a/drivers/infiniband/core/uverbs_main.c
+++ b/drivers/infiniband/core/uverbs_main.c
@@ -746,8 +746,11 @@ static ssize_t ib_uverbs_write(struct file *filp, const char __user *buf,
int srcu_key;
ssize_t ret;
- if (WARN_ON_ONCE(!ib_safe_file_access(filp)))
+ if (!ib_safe_file_access(filp)) {
+ pr_err_once("uverbs_write: process %d (%s) changed security contexts after opening file descriptor, this is not allowed.\n",
+ task_tgid_vnr(current), current->comm);
return -EACCES;
+ }
if (count < sizeof hdr)
return -EINVAL;
diff --git a/drivers/infiniband/hw/qib/qib_file_ops.c b/drivers/infiniband/hw/qib/qib_file_ops.c
index 382466a..2d1eacf 100644
--- a/drivers/infiniband/hw/qib/qib_file_ops.c
+++ b/drivers/infiniband/hw/qib/qib_file_ops.c
@@ -2066,8 +2066,11 @@ static ssize_t qib_write(struct file *fp, const char __user *data,
ssize_t ret = 0;
void *dest;
- if (WARN_ON_ONCE(!ib_safe_file_access(fp)))
+ if (!ib_safe_file_access(fp)) {
+ pr_err_once("qib_write: process %d (%s) changed security contexts after opening file descriptor, this is not allowed.\n",
+ task_tgid_vnr(current), current->comm);
return -EACCES;
+ }
if (count < sizeof(cmd.type)) {
ret = -EINVAL;
--
2.7.4
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]
^ permalink raw reply related
* Re: [PATCH v5 2/9] IB/core: Replace semaphore sm_sem with an atomic wait
From: Linus Torvalds @ 2016-11-21 17:57 UTC (permalink / raw)
To: Arnd Bergmann
Cc: Binoy Jayan, linux-rdma@vger.kernel.org, Mustafa Ismail, Lijun Ou,
Nicholas Bellinger, Leon Romanovsky, target-devel,
Tatyana E Nikolova, Doug Ledford, Jenny Derzhavetz, Sagi Grimberg,
Sean Hefty, Faisal Latif, Hal Rosenstock,
Linux Kernel Mailing List, Wei Hu(Xavier), Mark Brown, Mark Bloch,
Steve
In-Reply-To: <4374986.OFD8bCgUa1@wuerfel>
On Mon, Nov 21, 2016 at 8:52 AM, Arnd Bergmann <arnd@arndb.de> wrote:
>
> Version of the series had replaced the semaphore with a completion
> here, which worked correctly, but one reviewer suggested using
> the wait_event() instead since it's confusing to have a completion
> starting out in 'completed' state.
Quite frankly, in that case just keep it as a semaphore.
So we have a few cases:
- completions are *slow*.
They are designed to be entirely synchronous, exactly so that you
can wait for a completion to finish, and then immediately free the
memory that the completion is in.
- completions used fro mutual exclusion are *confusing*.
Yes, they end up working as a counting semaphore, and yes, that's
by design, but at the same time, they are not _meant_ to be used as a
semaphore. The name matters. They are meant to be used the way they
are named: to be a "I'm done now" event. Don't use them for anything
else, because you *will* be confusing everybody else.
- open-coding wait-queues are really fragile and are *not* meant for exclusion.
They don't have lockdep checking, but more importantly, people
always invariably get the memory ordering wrong. And by "wrong" I mean
both "doesn't work" (because of insufficient memory ordering) and
"slow" (due to excessive memory ordering).
- mutexes are good for exclusion.
mutexes are both high-performance and non-confusing. Yes, they get
lockdep warnings, but that's usually a *good* thing. If you get
lockdep warnings from mutex use, you're almost certainly doing
something iffy.
mutexes do have a subtle downside: exactly because they are fast,
they are not entirely synchronous. You can't use them for completion
style notifications if you are going to release the memory they are
allocated in.
So mutexes need to be either statically allocated, or in
reference-counted allocations. That's so that the lifetime of the
memory is guaranteed to be cover all the users (including the wakers).
- semaphores are "old-fashioned mutexes". A mutex is better than a
semaphore, but a semaphore is better than just about all the other
alternatives. There's nothing _wrong_ with using a semaphore per se.
In this case, either use a semaphore or a mutex. If you are doing
mutual exclusion, those are really the only two acceptable sleeping
models.
Linus
^ permalink raw reply
* [PATCH 0/3] Address three Coverity complaints about RDMA code
From: Bart Van Assche @ 2016-11-21 18:20 UTC (permalink / raw)
To: Doug Ledford; +Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
Hello Doug,
The three patches in this series address warnings reported by Coverity
on the RDMA code. Please consider these patches for kernel version v4.10.
Thanks,
Bart.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply
* [PATCH 1/3] IB/mad: Fix an array index check
From: Bart Van Assche @ 2016-11-21 18:21 UTC (permalink / raw)
To: Doug Ledford
Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Sean Hefty
In-Reply-To: <94eb11fc-b558-b994-c933-da784caefc53-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org>
The array ib_mad_mgmt_class_table.method_table has MAX_MGMT_CLASS
(80) elements. Hence compare the array index with that value instead
of with IB_MGMT_MAX_METHODS (128). This patch avoids that Coverity
reports the following:
Overrunning array class->method_table of 80 8-byte elements at element index 127 (byte offset 1016) using index convert_mgmt_class(mad_hdr->mgmt_class) (which evaluates to 127).
Fixes: commit b7ab0b19a85f ("IB/mad: Verify mgmt class in received MADs")
Signed-off-by: Bart Van Assche <bart.vanassche-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org>
Cc: Sean Hefty <sean.hefty-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
Cc: <stable-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>
---
drivers/infiniband/core/mad.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/infiniband/core/mad.c b/drivers/infiniband/core/mad.c
index 40cbd6b..2395fe2 100644
--- a/drivers/infiniband/core/mad.c
+++ b/drivers/infiniband/core/mad.c
@@ -1746,7 +1746,7 @@ find_mad_agent(struct ib_mad_port_private *port_priv,
if (!class)
goto out;
if (convert_mgmt_class(mad_hdr->mgmt_class) >=
- IB_MGMT_MAX_METHODS)
+ ARRAY_SIZE(class->method_table))
goto out;
method = class->method_table[convert_mgmt_class(
mad_hdr->mgmt_class)];
--
2.10.2
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply related
* [PATCH 2/3] IPoIB: Avoid reading an uninitialized member variable
From: Bart Van Assche @ 2016-11-21 18:21 UTC (permalink / raw)
To: Doug Ledford
Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Erez Shitrit
In-Reply-To: <94eb11fc-b558-b994-c933-da784caefc53-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org>
This patch avoids that Coverity reports the following:
Using uninitialized value port_attr.state when calling printk
Fixes: commit 94232d9ce817 ("IPoIB: Start multicast join process only on active ports")
Signed-off-by: Bart Van Assche <bart.vanassche-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org>
Cc: Erez Shitrit <erezsh-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
Cc: <stable-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>
---
drivers/infiniband/ulp/ipoib/ipoib_multicast.c | 7 +++++--
1 file changed, 5 insertions(+), 2 deletions(-)
diff --git a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
index 1909dd2..fddff40 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
@@ -575,8 +575,11 @@ void ipoib_mcast_join_task(struct work_struct *work)
if (!test_bit(IPOIB_FLAG_OPER_UP, &priv->flags))
return;
- if (ib_query_port(priv->ca, priv->port, &port_attr) ||
- port_attr.state != IB_PORT_ACTIVE) {
+ if (ib_query_port(priv->ca, priv->port, &port_attr)) {
+ ipoib_dbg(priv, "ib_query_port() failed\n");
+ return;
+ }
+ if (port_attr.state != IB_PORT_ACTIVE) {
ipoib_dbg(priv, "port state is not ACTIVE (state = %d) suspending join task\n",
port_attr.state);
return;
--
2.10.2
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply related
* [PATCH 3/3] IB/multicast: Check ib_find_pkey() return value
From: Bart Van Assche @ 2016-11-21 18:22 UTC (permalink / raw)
To: Doug Ledford
Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Sean Hefty
In-Reply-To: <94eb11fc-b558-b994-c933-da784caefc53-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org>
This patch avoids that Coverity complains about not checking the
ib_find_pkey() return value.
Fixes: commit 547af76521b3 ("IB/multicast: Report errors on multicast groups if P_key changes")
Signed-off-by: Bart Van Assche <bart.vanassche-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org>
Cc: Sean Hefty <sean.hefty-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
Cc: <stable-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>
---
drivers/infiniband/core/multicast.c | 7 +++++--
1 file changed, 5 insertions(+), 2 deletions(-)
diff --git a/drivers/infiniband/core/multicast.c b/drivers/infiniband/core/multicast.c
index e51b739..322cb67 100644
--- a/drivers/infiniband/core/multicast.c
+++ b/drivers/infiniband/core/multicast.c
@@ -518,8 +518,11 @@ static void join_handler(int status, struct ib_sa_mcmember_rec *rec,
process_join_error(group, status);
else {
int mgids_changed, is_mgid0;
- ib_find_pkey(group->port->dev->device, group->port->port_num,
- be16_to_cpu(rec->pkey), &pkey_index);
+
+ if (ib_find_pkey(group->port->dev->device,
+ group->port->port_num, be16_to_cpu(rec->pkey),
+ &pkey_index))
+ pkey_index = MCAST_INVALID_PKEY_INDEX;
spin_lock_irq(&group->port->lock);
if (group->state == MCAST_BUSY &&
--
2.10.2
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply related
* Re: [PATCH 1/3] IB/mad: Fix an array index check
From: Hal Rosenstock @ 2016-11-21 20:17 UTC (permalink / raw)
To: Bart Van Assche, Doug Ledford
Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Sean Hefty
In-Reply-To: <396b3fe2-6a84-efed-07de-3e6381009ad1-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org>
On 11/21/2016 1:21 PM, Bart Van Assche wrote:
> The array ib_mad_mgmt_class_table.method_table has MAX_MGMT_CLASS
> (80) elements. Hence compare the array index with that value instead
> of with IB_MGMT_MAX_METHODS (128). This patch avoids that Coverity
> reports the following:
>
> Overrunning array class->method_table of 80 8-byte elements at element index 127 (byte offset 1016) using index convert_mgmt_class(mad_hdr->mgmt_class) (which evaluates to 127).
>
> Fixes: commit b7ab0b19a85f ("IB/mad: Verify mgmt class in received MADs")
> Signed-off-by: Bart Van Assche <bart.vanassche-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org>
Reviewed-by: Hal Rosenstock <hal-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
> Cc: Sean Hefty <sean.hefty-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
> Cc: <stable-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>
> ---
> drivers/infiniband/core/mad.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/drivers/infiniband/core/mad.c b/drivers/infiniband/core/mad.c
> index 40cbd6b..2395fe2 100644
> --- a/drivers/infiniband/core/mad.c
> +++ b/drivers/infiniband/core/mad.c
> @@ -1746,7 +1746,7 @@ find_mad_agent(struct ib_mad_port_private *port_priv,
> if (!class)
> goto out;
> if (convert_mgmt_class(mad_hdr->mgmt_class) >=
> - IB_MGMT_MAX_METHODS)
> + ARRAY_SIZE(class->method_table))
> goto out;
> method = class->method_table[convert_mgmt_class(
> mad_hdr->mgmt_class)];
>
I think there is also similar thing in missing check in ib_register_mad_agent where:
/*
* Make sure MAD registration (if supplied)
* is non overlapping with any existing ones
*/
if (mad_reg_req) {
mgmt_class = convert_mgmt_class(mad_reg_req->mgmt_class);
if (!is_vendor_class(mgmt_class)) {
class = port_priv->version[mad_reg_req->
mgmt_class_version].class;
if (class) {
method = class->method_table[mgmt_class];
so here the class' method_table is also accessed without checking mgmt_class for range violation, right ?
-- Hal
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply
* RE: [PATCH for-next 03/11] IB/hns: Optimize the logic of allocating memory using APIs
From: Salil Mehta @ 2016-11-21 20:20 UTC (permalink / raw)
To: Leon Romanovsky
Cc: dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org, Huwei (Xavier),
oulijun, mehta.salil.lnk-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org,
linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
netdev-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Linuxarm,
Zhangping (ZP)
In-Reply-To: <20161121171423.GA23083-2ukJVAZIZ/Y@public.gmane.org>
> -----Original Message-----
> From: netdev-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org [mailto:netdev-
> owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org] On Behalf Of Leon Romanovsky
> Sent: Monday, November 21, 2016 5:14 PM
> To: Salil Mehta
> Cc: dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org; Huwei (Xavier); oulijun;
> mehta.salil.lnk-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org; linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org;
> netdev-u79uwXL29TY76Z2rM5mHXA@public.gmane.org; linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org; Linuxarm;
> Zhangping (ZP)
> Subject: Re: [PATCH for-next 03/11] IB/hns: Optimize the logic of
> allocating memory using APIs
>
> On Mon, Nov 21, 2016 at 04:12:38PM +0000, Salil Mehta wrote:
> > > -----Original Message-----
> > > From: Leon Romanovsky [mailto:leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org]
> > > Sent: Wednesday, November 16, 2016 8:36 AM
> > > To: Salil Mehta
> > > Cc: dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org; Huwei (Xavier); oulijun;
> > > mehta.salil.lnk-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org; linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org;
> > > netdev-u79uwXL29TY76Z2rM5mHXA@public.gmane.org; linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org; Linuxarm;
> > > Zhangping (ZP)
> > > Subject: Re: [PATCH for-next 03/11] IB/hns: Optimize the logic of
> > > allocating memory using APIs
> > >
> > > On Tue, Nov 15, 2016 at 03:52:46PM +0000, Salil Mehta wrote:
> > > > > -----Original Message-----
> > > > > From: Leon Romanovsky [mailto:leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org]
> > > > > Sent: Wednesday, November 09, 2016 7:22 AM
> > > > > To: Salil Mehta
> > > > > Cc: dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org; Huwei (Xavier); oulijun;
> > > > > mehta.salil.lnk-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org; linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org;
> > > > > netdev-u79uwXL29TY76Z2rM5mHXA@public.gmane.org; linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org; Linuxarm;
> > > > > Zhangping (ZP)
> > > > > Subject: Re: [PATCH for-next 03/11] IB/hns: Optimize the logic
> of
> > > > > allocating memory using APIs
> > > > >
> > > > > On Fri, Nov 04, 2016 at 04:36:25PM +0000, Salil Mehta wrote:
> > > > > > From: "Wei Hu (Xavier)" <xavier.huwei-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>
> > > > > >
> > > > > > This patch modified the logic of allocating memory using APIs
> in
> > > > > > hns RoCE driver. We used kcalloc instead of kmalloc_array and
> > > > > > bitmap_zero. And When kcalloc failed, call vzalloc to alloc
> > > > > > memory.
> > > > > >
> > > > > > Signed-off-by: Wei Hu (Xavier) <xavier.huwei-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>
> > > > > > Signed-off-by: Ping Zhang <zhangping5-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>
> > > > > > Signed-off-by: Salil Mehta <salil.mehta-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>
> > > > > > ---
> > > > > > drivers/infiniband/hw/hns/hns_roce_mr.c | 15 ++++++++-----
> --
> > > > > > 1 file changed, 8 insertions(+), 7 deletions(-)
> > > > > >
> > > > > > diff --git a/drivers/infiniband/hw/hns/hns_roce_mr.c
> > > > > b/drivers/infiniband/hw/hns/hns_roce_mr.c
> > > > > > index fb87883..d3dfb5f 100644
> > > > > > --- a/drivers/infiniband/hw/hns/hns_roce_mr.c
> > > > > > +++ b/drivers/infiniband/hw/hns/hns_roce_mr.c
> > > > > > @@ -137,11 +137,12 @@ static int hns_roce_buddy_init(struct
> > > > > hns_roce_buddy *buddy, int max_order)
> > > > > >
> > > > > > for (i = 0; i <= buddy->max_order; ++i) {
> > > > > > s = BITS_TO_LONGS(1 << (buddy->max_order - i));
> > > > > > - buddy->bits[i] = kmalloc_array(s, sizeof(long),
> > > > > GFP_KERNEL);
> > > > > > - if (!buddy->bits[i])
> > > > > > - goto err_out_free;
> > > > > > -
> > > > > > - bitmap_zero(buddy->bits[i], 1 << (buddy->max_order -
> > > i));
> > > > > > + buddy->bits[i] = kcalloc(s, sizeof(long),
> > > GFP_KERNEL);
> > > > > > + if (!buddy->bits[i]) {
> > > > > > + buddy->bits[i] = vzalloc(s * sizeof(long));
> > > > >
> > > > > I wonder, why don't you use directly vzalloc instead of kcalloc
> > > > > fallback?
> > > > As we know we will have physical contiguous pages if the kcalloc
> > > > call succeeds. This will give us a chance to have better
> performance
> > > > over the allocations which are just virtually contiguous through
> the
> > > > function vzalloc(). Therefore, later has only been used as a
> fallback
> > > > when our memory request cannot be entertained through kcalloc.
> > > >
> > > > Are you suggesting that there will not be much performance
> penalty
> > > > if we use just vzalloc ?
> > >
> > > Not exactly,
> > > I asked it, because we have similar code in our drivers and this
> > > construction looks strange to me.
> > >
> > > 1. If performance is critical, we will use kmalloc.
> > > 2. If performance is not critical, we will use vmalloc.
> > >
> > > But in this case, such construction shows me that we can live with
> > > vmalloc performance and kmalloc allocation are not really needed.
> > >
> > > In your specific case, I'm not sure that kcalloc will ever fail.
> > Performance is definitely critical here. Though, I agree this is bit
> > unusual way of memory allocation. In actual, we were encountering
> > memory alloc failures using kmalloc (if you see allocation amount
> > is on the higher side and is exponential) so we ended up using
> > vmalloc as fall back - It is very naïve allocation scheme.
>
> I understand it, we did the same, see our mlx5_vzalloc call.
> BTW, we used __GFP_NOWARN flag, which you should consider to use
> in your case too.
Ok. Will add this flag and refloat patch V3.
Thanks
>
> >
> > Maybe we need to rethink this allocation scheme part? Also, I can
> pull
> > back this particular patch for now or just live with vzalloc() till
> > we figure out proper solution to this?
>
> It is up to you, I don't think that you should drop it, AFAIK, there is
> no other proper solution.
Ok we will live with it for now and later maybe we can see how we can optimize
pre-allocation of physically contiguous memory.
Thanks for your suggestions!
Salil
>
> >
> > >
> > > Thanks
> > >
> > >
> > > >
> > > > >
> > > > > > + if (!buddy->bits[i])
> > > > > > + goto err_out_free;
> > > > > > + }
> > > > > > }
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply
* Enabling peer to peer device transactions for PCIe devices
From: Deucher, Alexander @ 2016-11-21 20:36 UTC (permalink / raw)
To: 'linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org',
'linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org',
'linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw@public.gmane.org',
'Linux-media-u79uwXL29TY76Z2rM5mHXA@public.gmane.org',
'dri-devel-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org',
'linux-pci-u79uwXL29TY76Z2rM5mHXA@public.gmane.org'
Cc: Bridgman, John, Kuehling, Felix, Sagalovitch, Serguei,
Blinzer, Paul, Koenig, Christian, Suthikulpanit, Suravee,
Sander, Ben
This is certainly not the first time this has been brought up, but I'd like to try and get some consensus on the best way to move this forward. Allowing devices to talk directly improves performance and reduces latency by avoiding the use of staging buffers in system memory. Also in cases where both devices are behind a switch, it avoids the CPU entirely. Most current APIs (DirectGMA, PeerDirect, CUDA, HSA) that deal with this are pointer based. Ideally we'd be able to take a CPU virtual address and be able to get to a physical address taking into account IOMMUs, etc. Having struct pages for the memory would allow it to work more generally and wouldn't require as much explicit support in drivers that wanted to use it.
Some use cases:
1. Storage devices streaming directly to GPU device memory
2. GPU device memory to GPU device memory streaming
3. DVB/V4L/SDI devices streaming directly to GPU device memory
4. DVB/V4L/SDI devices streaming directly to storage devices
Here is a relatively simple example of how this could work for testing. This is obviously not a complete solution.
- Device memory will be registered with Linux memory sub-system by created corresponding struct page structures for device memory
- get_user_pages_fast() will return corresponding struct pages when CPU address points to the device memory
- put_page() will deal with struct pages for device memory
Previously proposed solutions and related proposals:
1.P2P DMA
DMA-API/PCI map_peer_resource support for peer-to-peer (http://www.spinics.net/lists/linux-pci/msg44560.html)
Pros: Low impact, already largely reviewed.
Cons: requires explicit support in all drivers that want to support it, doesn't handle S/G in device memory.
2. ZONE_DEVICE IO
Direct I/O and DMA for persistent memory (https://lwn.net/Articles/672457/)
Add support for ZONE_DEVICE IO memory with struct pages. (https://patchwork.kernel.org/patch/8583221/)
Pro: Doesn't waste system memory for ZONE metadata
Cons: CPU access to ZONE metadata slow, may be lost, corrupted on device reset.
3. DMA-BUF
RDMA subsystem DMA-BUF support (http://www.spinics.net/lists/linux-rdma/msg38748.html)
Pros: uses existing dma-buf interface
Cons: dma-buf is handle based, requires explicit dma-buf support in drivers.
4. iopmem
iopmem : A block device for PCIe memory (https://lwn.net/Articles/703895/)
5. HMM
Heterogeneous Memory Management (http://lkml.iu.edu/hypermail/linux/kernel/1611.2/02473.html)
6. Some new mmap-like interface that takes a userptr and a length and returns a dma-buf and offset?
Alex
^ permalink raw reply
* Re: [PATCH 1/3] IB/mad: Fix an array index check
From: Bart Van Assche @ 2016-11-21 20:52 UTC (permalink / raw)
To: Hal Rosenstock, Doug Ledford
Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Sean Hefty
In-Reply-To: <d5924c69-c052-84d3-06e3-b74ed8eb58d3-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
On 11/21/2016 12:17 PM, Hal Rosenstock wrote:
> I think there is also similar thing in missing check in ib_register_mad_agent where:
>
> /*
> * Make sure MAD registration (if supplied)
> * is non overlapping with any existing ones
> */
> if (mad_reg_req) {
> mgmt_class = convert_mgmt_class(mad_reg_req->mgmt_class);
> if (!is_vendor_class(mgmt_class)) {
> class = port_priv->version[mad_reg_req->
> mgmt_class_version].class;
> if (class) {
> method = class->method_table[mgmt_class];
>
> so here the class' method_table is also accessed without checking mgmt_class for range violation, right ?
Hello Hal,
I think such a check is already present in ib_register_mad_agent():
if (mad_reg_req->mgmt_class >= MAX_MGMT_CLASS) {
/*
* IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE is the only
* one in this range currently allowed
*/
if (mad_reg_req->mgmt_class !=
IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE) {
dev_notice(&device->dev,
"%s: Invalid Mgmt Class 0x%x\n",
__func__, mad_reg_req->mgmt_class);
goto error1;
}
} [ ... ]
Bart.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply
* Re: [PATCH 1/3] IB/mad: Fix an array index check
From: Hal Rosenstock @ 2016-11-21 21:17 UTC (permalink / raw)
To: Bart Van Assche, Doug Ledford
Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Sean Hefty
In-Reply-To: <ab9e8563-160b-a9de-08b3-b09bdd52fbad-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org>
Hi Bart,
On 11/21/2016 3:52 PM, Bart Van Assche wrote:
> Hello Hal,
>
> I think such a check is already present in ib_register_mad_agent():
>
> if (mad_reg_req->mgmt_class >= MAX_MGMT_CLASS) {
You're right.
-- Hal
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply
* Re: [RFC 02/10] IB/hfi-vnic: Virtual Network Interface Controller (VNIC) Bus driver
From: Vishwanathapura, Niranjana @ 2016-11-21 21:30 UTC (permalink / raw)
To: Jason Gunthorpe; +Cc: Doug Ledford, linux-rdma, netdev, Dennis Dalessandro
In-Reply-To: <20161119190445.GG22775@obsidianresearch.com>
On Sat, Nov 19, 2016 at 12:04:45PM -0700, Jason Gunthorpe wrote:
>On Fri, Nov 18, 2016 at 02:42:10PM -0800, Vishwanathapura, Niranjana wrote:
>> +HFI-VNIC DRIVER
>> +M: Dennis Dalessandro <dennis.dalessandro@intel.com>
>> +M: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com>
>> +L: linux-rdma@vger.kernel.org
>> +S: Supported
>> +F: drivers/infiniband/sw/intel/vnic
>
>This is either a net driver or a ULP, no idea why it should go in this
>directory!?
>
>It sounds like an ethernet driver, so you should probably put it
>there...
>
The hfi_vnic is an Ethernet driver. It is similar to ULP like ipoib, but
instead it is Ethernet over Omni-path here.
The VNIC Ethernet (hfi_vnic) driver encapsulates Ethernet packets in an
Omni-path header.
The hfi_vnic Ethernet driver do not access the HW. It interfaces with HFI1
driver which sends/receives Omni-Path encapsulated Ethernet frames from HW.
Also, the VNIC control path driver (VEMA) is an IB MAD agent which should be
under drivers/infiniband/.. .
Putting the VNIC Ethernet driver and the VNIC control driver together under a
single module (hfi_vnic.ko) provided a simpler interface between them.
So, we have put the driver under drivers/infiniband/sw/intel for two reasons:
a) We have VNIC control driver (VEMA) which is an IB mad agent.
b) hfi_vnic Ethernet driver is dependent on HFI1 driver for sending/receving
Omni-path encapsulated Ethernet packets from HW.
>> +/* hfi_vnic_bus_init - initialize the hfi vnic bus drvier */
>> +static int hfi_vnic_bus_init(void)
>> +{
>> + int rc;
>> +
>> + ida_init(&hfi_vnic_ctrl_ida);
>> + idr_init(&hfi_vnic_idr);
>> +
>> + rc = bus_register(&hfi_vnic_bus);
>
>Why on earth do we need this? Didn't I give you enough grief for the
>psm stuff and now you want to create an entire subystem hidden away!?
>
>Use some netlink scheme to control your vnic like the rest of the net
>stack..
>
The hfi_vnic_bus is only abstracting the HW independent functionality (like
Ethernet interface, encapsulation, IB MAD interface etc) with the HW dependent
functionality (sending/receiving packets on the wire).
Thus providing a cleaner interface between HW independent hfi_vnic Ethernet and
Control drivers and the HW dependent HFI1 driver.
There is no other User interface here other than the standard Ethernet
interface through network stack.
HFI1 driver creates VNIC devices on the hfi_vnic_bus as below and the hfi_vnic
Ethernet and Control drivers drive them.
#ls /sys/bus/hfi_vnic_bus/devices/
hfi_vnic_ctrl_00 /* control device for HFI instance 0 */
hfi_vnic_00.01.00 /* first VNIC port on HFI instance 0, port 1 */
hfi_vnic_00.01.01 /* second VNIC port on HFI instance 0, port 1 */
The design is as shown in the below diagram.
+-------------------+ +----------------------+
| | | Linux |
| IB MAD | | Network |
| | | Stack |
+-------------------+ +----------------------+
| |
| |
+--------------------------------------------+
| |
| HFI VNIC Module |
| (HFI VNIC Netdev and EMA drivers) |
| (HW independent) |
+--------------------------------------------+
|
|
+--------------------------------------------+
| HFI VNIC Bus |
+--------------------------------------------+
|
|
+--------------------------------------------+
| |
| HFI1 Driver with VNIC support |
| (HW dependent) |
+--------------------------------------------+
Niranjana
>Jason
^ permalink raw reply
* Re: [RFC 02/10] IB/hfi-vnic: Virtual Network Interface Controller (VNIC) Bus driver
From: Jason Gunthorpe @ 2016-11-21 21:39 UTC (permalink / raw)
To: Vishwanathapura, Niranjana
Cc: Doug Ledford, linux-rdma, netdev, Dennis Dalessandro
In-Reply-To: <20161121213017.GB67872@knc-06.sc.intel.com>
On Mon, Nov 21, 2016 at 01:30:17PM -0800, Vishwanathapura, Niranjana wrote:
> On Sat, Nov 19, 2016 at 12:04:45PM -0700, Jason Gunthorpe wrote:
> >On Fri, Nov 18, 2016 at 02:42:10PM -0800, Vishwanathapura, Niranjana wrote:
> >>+HFI-VNIC DRIVER
> >>+M: Dennis Dalessandro <dennis.dalessandro@intel.com>
> >>+M: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com>
> >>+L: linux-rdma@vger.kernel.org
> >>+S: Supported
> >>+F: drivers/infiniband/sw/intel/vnic
> >
> >This is either a net driver or a ULP, no idea why it should go in this
> >directory!?
> >
> >It sounds like an ethernet driver, so you should probably put it
> >there...
> >
>
> The hfi_vnic is an Ethernet driver. It is similar to ULP like ipoib, but
> instead it is Ethernet over Omni-path here.
> The VNIC Ethernet (hfi_vnic) driver encapsulates Ethernet packets in an
> Omni-path header.
> The hfi_vnic Ethernet driver do not access the HW. It interfaces with HFI1
> driver which sends/receives Omni-Path encapsulated Ethernet frames from HW.
> Also, the VNIC control path driver (VEMA) is an IB MAD agent which should be
> under drivers/infiniband/.. .
> Putting the VNIC Ethernet driver and the VNIC control driver together under
> a single module (hfi_vnic.ko) provided a simpler interface between them.
>
> So, we have put the driver under drivers/infiniband/sw/intel for two reasons:
> a) We have VNIC control driver (VEMA) which is an IB mad agent.
> b) hfi_vnic Ethernet driver is dependent on HFI1 driver for sending/receving
> Omni-path encapsulated Ethernet packets from HW.
Sounds like this driver belongs under net/ someplace to me.
NAK on drivers/infiniband/sw/ at least - that dir is only for HCA
drivers.
> >>+/* hfi_vnic_bus_init - initialize the hfi vnic bus drvier */
> >>+static int hfi_vnic_bus_init(void)
> >>+{
> >>+ int rc;
> >>+
> >>+ ida_init(&hfi_vnic_ctrl_ida);
> >>+ idr_init(&hfi_vnic_idr);
> >>+
> >>+ rc = bus_register(&hfi_vnic_bus);
> >
> >Why on earth do we need this? Didn't I give you enough grief for the
> >psm stuff and now you want to create an entire subystem hidden away!?
> >
> >Use some netlink scheme to control your vnic like the rest of the net
> >stack..
> >
>
> The hfi_vnic_bus is only abstracting the HW independent functionality (like
> Ethernet interface, encapsulation, IB MAD interface etc) with the HW
> dependent functionality (sending/receiving packets on the wire).
> Thus providing a cleaner interface between HW independent hfi_vnic Ethernet
> and Control drivers and the HW dependent HFI1 driver.
That doesn't explain anything, sound like you don't need it so get rid
of it.
> There is no other User interface here other than the standard Ethernet
> interface through network stack.
Good, then this isn't needed, because it doesn't provide a user interface.
> #ls /sys/bus/hfi_vnic_bus/devices/
> hfi_vnic_ctrl_00 /* control device for HFI instance 0 */
> hfi_vnic_00.01.00 /* first VNIC port on HFI instance 0, port 1 */
Jason
^ permalink raw reply
* [PATCH 0/5] IB/srp patches for kernel v4.10
From: Bart Van Assche @ 2016-11-21 21:56 UTC (permalink / raw)
To: Doug Ledford
Cc: Max Gurtovoy, Christoph Hellwig,
linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
Hello Doug,
The patches in this series are what I came up with while testing the
v4.9-rc5 IB/srp driver. Please consider these patches for inclusion in
kernel v4.10. The individual patches in this series are:
0001-IB-srp-Fix-CONFIG_DYNAMIC_DEBUG-n-build.patch
0002-IB-srp-Introduce-a-local-variable-in-srp_add_one.patch
0003-IB-srp-Make-login-failures-easier-to-debug.patch
0004-IB-srp-Make-mapping-failures-easier-to-debug.patch
0005-IB-srp-Make-writing-into-the-add_target-sysfs-attrib.patch
Thanks,
Bart.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox