Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH for-next 0/6] IB/hns: Bug Fixes for HNS RoCE Driver
From: Doug Ledford @ 2016-12-12 22:09 UTC (permalink / raw)
  To: Salil Mehta
  Cc: xavier.huwei, oulijun, xushaobo2, mehta.salil.lnk, lijun_nudt,
	linux-rdma, netdev, linux-kernel, linuxarm
In-Reply-To: <20161129231030.1105600-1-salil.mehta@huawei.com>


[-- Attachment #1.1: Type: text/plain, Size: 1076 bytes --]

On 11/29/2016 6:10 PM, Salil Mehta wrote:
> This patch-set contains bug fixes for the HNS RoCE driver.
> 
> Lijun Ou (1):
>   IB/hns: Fix the IB device name
> 
> Shaobo Xu (2):
>   IB/hns: Fix the bug when free mr
>   IB/hns: Fix the bug when free cq
> 
> Wei Hu (Xavier) (3):
>   IB/hns: Fix the bug when destroy qp
>   IB/hns: Fix the bug of setting port mtu
>   IB/hns: Delete the redundant memset operation
> 
>  drivers/infiniband/hw/hns/hns_roce_cmd.h    |    5 -
>  drivers/infiniband/hw/hns/hns_roce_common.h |   42 ++
>  drivers/infiniband/hw/hns/hns_roce_cq.c     |   27 +-
>  drivers/infiniband/hw/hns/hns_roce_device.h |   18 +
>  drivers/infiniband/hw/hns/hns_roce_hw_v1.c  |  967 ++++++++++++++++++++++++---
>  drivers/infiniband/hw/hns/hns_roce_hw_v1.h  |   57 ++
>  drivers/infiniband/hw/hns/hns_roce_main.c   |   26 +-
>  drivers/infiniband/hw/hns/hns_roce_mr.c     |   21 +-
>  8 files changed, 1026 insertions(+), 137 deletions(-)
> 

Series applied, thanks.

-- 
Doug Ledford <dledford@redhat.com>
    GPG Key ID: 0E572FDD


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 884 bytes --]

^ permalink raw reply

* Re: [PATCH V3 for-next 00/11] Code improvements & fixes for HNS RoCE driver
From: Doug Ledford @ 2016-12-12 22:09 UTC (permalink / raw)
  To: Salil Mehta
  Cc: xavier.huwei, oulijun, xushaobo2, mehta.salil.lnk, lijun_nudt,
	linux-rdma, netdev, linux-kernel, linuxarm
In-Reply-To: <20161123194109.420760-1-salil.mehta@huawei.com>


[-- Attachment #1.1: Type: text/plain, Size: 1930 bytes --]

On 11/23/2016 2:40 PM, Salil Mehta wrote:
> This patchset introduces some code improvements and fixes
> for the identified problems in the HNS RoCE driver.
> 
> Lijun Ou (4):
>   IB/hns: Add the interface for querying QP1
>   IB/hns: add self loopback for CM
>   IB/hns: Modify the condition of notifying hardware loopback
>   IB/hns: Fix the bug for qp state in hns_roce_v1_m_qp()
> 
> Salil Mehta (1):
>   IB/hns: Fix for Checkpatch.pl comment style errors
> 
> Shaobo Xu (1):
>   IB/hns: Implement the add_gid/del_gid and optimize the GIDs
>     management
> 
> Wei Hu (Xavier) (5):
>   IB/hns: Add code for refreshing CQ CI using TPTR
>   IB/hns: Optimize the logic of allocating memory using APIs
>   IB/hns: Modify the macro for the timeout when cmd process
>   IB/hns: Modify query info named port_num when querying RC QP
>   IB/hns: Change qpn allocation to round-robin mode.
> 
>  drivers/infiniband/hw/hns/hns_roce_alloc.c  |   11 +-
>  drivers/infiniband/hw/hns/hns_roce_cmd.c    |    8 +-
>  drivers/infiniband/hw/hns/hns_roce_cmd.h    |    7 +-
>  drivers/infiniband/hw/hns/hns_roce_common.h |    2 -
>  drivers/infiniband/hw/hns/hns_roce_cq.c     |   17 +-
>  drivers/infiniband/hw/hns/hns_roce_device.h |   45 ++--
>  drivers/infiniband/hw/hns/hns_roce_eq.c     |    6 +-
>  drivers/infiniband/hw/hns/hns_roce_hem.c    |    6 +-
>  drivers/infiniband/hw/hns/hns_roce_hw_v1.c  |  267 +++++++++++++++++------
>  drivers/infiniband/hw/hns/hns_roce_hw_v1.h  |   17 +-
>  drivers/infiniband/hw/hns/hns_roce_main.c   |  311 +++++++--------------------
>  drivers/infiniband/hw/hns/hns_roce_mr.c     |   22 +-
>  drivers/infiniband/hw/hns/hns_roce_pd.c     |    5 +-
>  drivers/infiniband/hw/hns/hns_roce_qp.c     |    2 +-
>  14 files changed, 364 insertions(+), 362 deletions(-)
> 

Series applied, thanks.

-- 
Doug Ledford <dledford@redhat.com>
    GPG Key ID: 0E572FDD


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 884 bytes --]

^ permalink raw reply

* Re: Soft lockup in inet_put_port on 4.6
From: Josef Bacik @ 2016-12-12 21:23 UTC (permalink / raw)
  To: Hannes Frederic Sowa
  Cc: Eric Dumazet, Tom Herbert, Linux Kernel Network Developers
In-Reply-To: <3c022731-e703-34ac-55f1-60f5b94b6d62@stressinduktion.org>

On Mon, Dec 12, 2016 at 1:44 PM, Hannes Frederic Sowa 
<hannes@stressinduktion.org> wrote:
> On 12.12.2016 19:05, Josef Bacik wrote:
>>  On Fri, Dec 9, 2016 at 11:14 PM, Eric Dumazet 
>> <eric.dumazet@gmail.com>
>>  wrote:
>>>  On Fri, 2016-12-09 at 19:47 -0800, Eric Dumazet wrote:
>>> 
>>>> 
>>>>   Hmm... Is your ephemeral port range includes the port your load
>>>>   balancing app is using ?
>>> 
>>>  I suspect that you might have processes doing bind( port = 0) that 
>>> are
>>>  trapped into the bind_conflict() scan ?
>>> 
>>>  With 100,000 + timewaits there, this possibly hurts.
>>> 
>>>  Can you try the following loop breaker ?
>> 
>>  It doesn't appear that the app is doing bind(port = 0) during normal
>>  operation.  I tested this patch and it made no difference.  I'm 
>> going to
>>  test simply restarting the app without changing to the SO_REUSEPORT
>>  option.  Thanks,
> 
> Would it be possible to trace the time the function uses with trace? 
> If
> we don't see the number growing considerably over time we probably can
> rule out that we loop somewhere in there (I would instrument
> inet_csk_bind_conflict, __inet_hash_connect and inet_csk_get_port).
> 
> __inet_hash_connect -> __inet_check_established also takes a lock
> (inet_ehash_lockp) which can be locked from inet_diag code path during
> socket diag info dumping.
> 
> Unfortunately we couldn't reproduce it so far. :/

Working on getting the timing info, will probably be tomorrow due to 
meetings.  I did test simply restarting the app without changing to the 
config that enabled the use of SO_REUSEPORT and the problem didn't 
occur, so it definitely has something to do with SO_REUSEPORT.  Thanks,

Josef

^ permalink raw reply

* Re: Soft lockup in tc_classify
From: Or Gerlitz @ 2016-12-12 21:18 UTC (permalink / raw)
  To: Daniel Borkmann, Cong Wang
  Cc: Shahar Klein, Linux Netdev List, Roi Dayan, David Miller,
	Jiri Pirko, John Fastabend, Hadar Hen Zion
In-Reply-To: <584EA60B.80803@iogearbox.net>

On Mon, Dec 12, 2016 at 3:28 PM, Daniel Borkmann <daniel@iogearbox.net> wrote:

> Note that there's still the RCU fix missing for the deletion race that
> Cong will still send out, but you say that the only thing you do is to
> add a single rule, but no other operation in involved during that test?

What's missing to have the deletion race fixed? making a patch or
testing to a patch which was sent?

^ permalink raw reply

* Re: [RFC PATCH net-next v3 1/2] macb: Add 1588 support in Cadence GEM.
From: Richard Cochran @ 2016-12-12 21:09 UTC (permalink / raw)
  To: Andrei.Pistirica
  Cc: harini.katakam, rafalo, netdev, linux-kernel, linux-arm-kernel,
	davem, nicolas.ferre, harinikatakamlinux, punnaia, michals,
	anirudh, boris.brezillon, alexandre.belloni, tbultel
In-Reply-To: <07C910AB6AC6C345A093D5A08F5AF568CB74D84D@CHN-SV-EXMX03.mchp-main.com>

On Mon, Dec 12, 2016 at 10:22:43AM +0000, Andrei.Pistirica@microchip.com wrote:
> Richard, are you agree with this?

Yes, but please trim your replies next time.  Scrolling through pages
of quoted headers and stale content in order to read one line is very
annoying.

Thanks,
Richard

^ permalink raw reply

* Re: [PATCH V2 10/22] bnxt_re: Support for CQ verbs
From: Jonathan Toppins @ 2016-12-12 21:03 UTC (permalink / raw)
  To: Selvin Xavier, dledford-H+wXaHxf7aLQT0dZR+AlfA,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA
  Cc: netdev-u79uwXL29TY76Z2rM5mHXA, Eddie Wai, Devesh Sharma,
	Somnath Kotur, Sriharsha Basavapatna
In-Reply-To: <1481266096-23331-11-git-send-email-selvin.xavier-dY08KVG/lbpWk0Htik3J/w@public.gmane.org>

On 12/09/2016 01:48 AM, Selvin Xavier wrote:
> This patch implements support for create_cq, destroy_cq and req_notify_cq
> verbs.
> 
> Signed-off-by: Eddie Wai <eddie.wai-dY08KVG/lbpWk0Htik3J/w@public.gmane.org>
> Signed-off-by: Devesh Sharma <devesh.sharma-dY08KVG/lbpWk0Htik3J/w@public.gmane.org>
> Signed-off-by: Somnath Kotur <somnath.kotur-dY08KVG/lbpWk0Htik3J/w@public.gmane.org>
> Signed-off-by: Sriharsha Basavapatna <sriharsha.basavapatna-dY08KVG/lbpWk0Htik3J/w@public.gmane.org>
> Signed-off-by: Selvin Xavier <selvin.xavier-dY08KVG/lbpWk0Htik3J/w@public.gmane.org>
> ---
>  drivers/infiniband/hw/bnxtre/bnxt_qplib_fp.c    | 183 ++++++++++++++++++++++++
>  drivers/infiniband/hw/bnxtre/bnxt_qplib_fp.h    |  47 ++++++
>  drivers/infiniband/hw/bnxtre/bnxt_re_ib_verbs.c | 154 ++++++++++++++++++++
>  drivers/infiniband/hw/bnxtre/bnxt_re_ib_verbs.h |  19 +++
>  drivers/infiniband/hw/bnxtre/bnxt_re_main.c     |   4 +
>  include/uapi/rdma/bnxt_re_uverbs_abi.h          |  11 ++
>  6 files changed, 418 insertions(+)

Something I just realized is this patch series does not modify the
MAINTAINERS file. Whom from Broadcom will be maintaining this driver?
Probably want to include this info in the v3 series

[...]

> diff --git a/drivers/infiniband/hw/bnxtre/bnxt_re_ib_verbs.c b/drivers/infiniband/hw/bnxtre/bnxt_re_ib_verbs.c
> index 3417829..f316598 100644
> --- a/drivers/infiniband/hw/bnxtre/bnxt_re_ib_verbs.c
> +++ b/drivers/infiniband/hw/bnxtre/bnxt_re_ib_verbs.c
> @@ -60,6 +60,16 @@
>  #include "bnxt_re_ib_verbs.h"
>  #include <rdma/bnxt_re_uverbs_abi.h>
>  
> +static int bnxt_re_copy_to_udata(struct bnxt_re_dev *rdev, void *data, int len,
> +				 struct ib_udata *udata)
> +{
> +	int rc;
> +
> +	rc = ib_copy_to_udata(udata, data, len);
> +
> +	return rc ? -EFAULT : 0;
> +}

This function seems to provide no value by wrapping ib_copy_to_udata,
any reason to keep it? From the two call sites for this function it
appears it can be replaced with a direct call to ib_copy_to_udata.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* EMAIL DISABLE
From: IT Department @ 2016-12-12 20:24 UTC (permalink / raw)
  To: Recipients

Recently, we have detect some unusual activity on your account and as a
result, all
email users are urged to update their email account within 24 hours of
receiving
this e-mail, using the update link: http://www.beam.to/1795  to confirm
that your
email account is up to date with the institution requirement.

---
This email has been checked for viruses by Avast antivirus software.
https://www.avast.com/antivirus

<JOM JIMAT GO GREEN>

Please Do Not Print If Unnecessary. JOM JIMAT. GO GREEN.

This e-mail and any files transmitted with it (message) is intended only 
for the use recepient (s) named and may contain confidential 
information. Opinions, conclusion and other information in this 
message that do not relate to the official business of PERBADANAN 
NASIONAL BERHAD (PNS) or its Group of Companies shall be 
understood as neither given or nor endorsed by PNS or any of the 
Companies within the Group.

^ permalink raw reply

* Re: [PATCH 1/1] Fixed to BUG_ON to WARN_ON def
From: Ozgur Karatas @ 2016-12-12 20:24 UTC (permalink / raw)
  To: Leon Romanovsky, Tariq Toukan; +Cc: yishaih@mellanox.com, netdev, linux-kernel
In-Reply-To: <20161212181838.GB8204@mtr-leonro.local>



12.12.2016, 20:18, "Leon Romanovsky" <leon@kernel.org>:
> On Mon, Dec 12, 2016 at 03:04:28PM +0200, Ozgur Karatas wrote:
>>  Dear Romanovsky;
>
> Please avoid top-posting in your replies.
> Thanks

Dear Leon; 
thanks for the information., I will pay attention.

>>  I'm trying to learn english and I apologize for my mistake words and phrases. So, I think the code when call to "sg_set_buf" and next time set memory and buffer. For example, isn't to call "WARN_ON" function, get a error to implicit declaration, right?
>>
>>  Because, you will use to "BUG_ON" get a error implicit declaration of functions.
>
> I'm not sure that I followed you. mem->offset is set by sg_set_buf from
> buf variable returned by dma_alloc_coherent(). HW needs to get very
> precise size of this buf, in multiple of pages and aligned to pages
> boundaries.

I have studied the following your coding and I guess that's the right patchs.
You are the very expert in this matter, thank you for the correct for me.

I learn to your style as an example.

Regards,

Ozgur Karatas

> See the patch inline which removes this BUG_ON in proper and safe way.
>
> From 7babe807affa2b27d51d3610afb75b693929ea1a Mon Sep 17 00:00:00 2001
> From: Leon Romanovsky <leonro@mellanox.com>
> Date: Mon, 12 Dec 2016 20:02:45 +0200
> Subject: [PATCH] net/mlx4: Remove BUG_ON from ICM allocation routine
>
> This patch removes BUG_ON() macro from mlx4_alloc_icm_coherent()
> by checking DMA address aligment in advance and performing proper
> folding in case of error.
>
> Fixes: 5b0bf5e25efe ("mlx4_core: Support ICM tables in coherent memory")
> Reported-by: Ozgur Karatas <okaratas@member.fsf.org>
> Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
> ---
>  drivers/net/ethernet/mellanox/mlx4/icm.c | 7 ++++++-
>  1 file changed, 6 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/net/ethernet/mellanox/mlx4/icm.c b/drivers/net/ethernet/mellanox/mlx4/icm.c
> index 2a9dd46..e1f9e7c 100644
> --- a/drivers/net/ethernet/mellanox/mlx4/icm.c
> +++ b/drivers/net/ethernet/mellanox/mlx4/icm.c
> @@ -118,8 +118,13 @@ static int mlx4_alloc_icm_coherent(struct device *dev, struct scatterlist *mem,
>          if (!buf)
>                  return -ENOMEM;
>
> + if (offset_in_page(buf)) {
> + dma_free_coherent(dev, PAGE_SIZE << order,
> + buf, sg_dma_address(mem));
> + return -ENOMEM;
> + }
> +
>          sg_set_buf(mem, buf, PAGE_SIZE << order);
> - BUG_ON(mem->offset);
>          sg_dma_len(mem) = PAGE_SIZE << order;
>          return 0;
>  }
> --
> 2.10.2

^ permalink raw reply

* Re: [PATCH v2] audit: use proper refcount locking on audit_sock
From: Paul Moore @ 2016-12-12 20:18 UTC (permalink / raw)
  To: Richard Guy Briggs
  Cc: netdev, linux-kernel, linux-audit, edumazet, xiyou.wangcong,
	dvyukov
In-Reply-To: <5714bd7468cfec225407a6c367e658478d590495.1481534171.git.rgb@redhat.com>

On Mon, Dec 12, 2016 at 5:03 AM, Richard Guy Briggs <rgb@redhat.com> wrote:
> Resetting audit_sock appears to be racy.
>
> audit_sock was being copied and dereferenced without using a refcount on
> the source sock.
>
> Bump the refcount on the underlying sock when we store a refrence in
> audit_sock and release it when we reset audit_sock.  audit_sock
> modification needs the audit_cmd_mutex.
>
> See: https://lkml.org/lkml/2016/11/26/232
>
> Thanks to Eric Dumazet <edumazet@google.com> and Cong Wang
> <xiyou.wangcong@gmail.com> on ideas how to fix it.
>
> Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
> ---
> There has been a lot of change in the audit code that is about to go
> upstream to address audit queue issues.  This patch is based on the
> source tree: git://git.infradead.org/users/pcmoore/audit#next
> ---
>  kernel/audit.c |   34 ++++++++++++++++++++++++++++------
>  1 files changed, 28 insertions(+), 6 deletions(-)

My previous question about testing still stands, but I took a closer
look and have some additional comments, see below ...

> diff --git a/kernel/audit.c b/kernel/audit.c
> index f20eee0..439f7f3 100644
> --- a/kernel/audit.c
> +++ b/kernel/audit.c
> @@ -452,7 +452,9 @@ static void auditd_reset(void)
>         struct sk_buff *skb;
>
>         /* break the connection */
> +       sock_put(audit_sock);
>         audit_pid = 0;
> +       audit_nlk_portid = 0;
>         audit_sock = NULL;
>
>         /* flush all of the retry queue to the hold queue */
> @@ -478,6 +480,12 @@ static int kauditd_send_unicast_skb(struct sk_buff *skb)
>         if (rc >= 0) {
>                 consume_skb(skb);
>                 rc = 0;
> +       } else {
> +               if (rc & (-ENOMEM|-EPERM|-ECONNREFUSED)) {

I dislike the way you wrote this because instead of simply looking at
this to see if it correct I need to sort out all the bits and find out
if there are other error codes that could run afoul of this check ...
make it simple, e.g. (rc == -ENOMEM || rc == -EPERM || ...).
Actually, since EPERM is 1, -EPERM (-1 in two's compliment is
0xffffffff) is going to cause this to be true for pretty much any
value of rc, yes?

> +                       mutex_lock(&audit_cmd_mutex);
> +                       auditd_reset();
> +                       mutex_unlock(&audit_cmd_mutex);
> +               }

The code in audit#next handles netlink_unicast() errors in
kauditd_thread() and you are adding error handling code here in
kauditd_send_unicast_skb() ... that's messy.  I don't care too much
where the auditd_reset() call is made, but let's only do it in one
function; FWIW, I originally put the error handling code in
kauditd_thread() because there was other error handling code that
needed to done in that scope so it resulted in cleaner code.

Related, I see you are now considering ENOMEM to be a fatal condition,
that differs from the AUDITD_BAD macro in kauditd_thread(); this
difference needs to be reconciled.

Finally, you should update the comment header block for auditd_reset()
that it needs to be called with the audit_cmd_mutex held.

> @@ -1004,17 +1018,22 @@ static int audit_receive_msg(struct sk_buff *skb, struct nlmsghdr *nlh)
>                                 return -EACCES;
>                         }
>                         if (audit_pid && new_pid &&
> -                           audit_replace(requesting_pid) != -ECONNREFUSED) {
> +                           (audit_replace(requesting_pid) & (-ECONNREFUSED|-EPERM|-ENOMEM))) {

Do we simply want to treat any error here as fatal, and not just
ECONN/EPERM/ENOMEM?  If not, let's come up with a single macro to
handle the fatal netlink_unicast() return codes so we have some chance
to keep things consistent in the future.

-- 
paul moore
www.paul-moore.com

^ permalink raw reply

* Re: [PATCH net-next 1/3] net:dsa:mv88e6xxx: use hashtable to store multicast entries
From: Vivien Didelot @ 2016-12-12 20:03 UTC (permalink / raw)
  To: Andrew Lunn, Florian Fainelli
  Cc: Volodymyr Bendiuga, Volodymyr Bendiuga, netdev,
	Volodymyr Bendiuga
In-Reply-To: <20161212190915.GA8885@lunn.ch>

Hi Andrew,

Andrew Lunn <andrew@lunn.ch> writes:

> Humm, it looks like we are doing the atu_get wrong. We are looking for
> a specific MAC address. Yet we seem to be walking the whole table to
> find it, rather than getting the hardware to do the search. 

We are not doing it wrong, the hardware does the search. A classic dump
of an ATU database consists of starting from the broadcast address
ff:ff:ff:ff:ff:ff and issuing GetNext operation until we reach back the
broadcast address. Only addresses in used are returned by GetNext, thus
dumping an empty database is completed in a single operation.

I implemented atu_get intentionally this way because it provides simpler
code, rather than doing arithmetic on MAC addresses (Unless I am unaware
of simple increment/decrement code.)

> The current code is:
>
> static int mv88e6xxx_atu_get(struct mv88e6xxx_chip *chip, int fid,
>                              const u8 *addr, struct mv88e6xxx_atu_entry *entry)
> {
>         struct mv88e6xxx_atu_entry next;
>         int err;
>
>         eth_broadcast_addr(next.mac);
>
>         err = _mv88e6xxx_atu_mac_write(chip, next.mac);
>
> We should be setting next.mac to one less than the address we are
> looking for.
>
> Volodymyr, please could you try that, and see how much of a speed up
> you get.
>
> There is another optimization which can be made. We only say there is
> no such entry once we have reached the end of the table. But it will
> return the entries in ascending order. So if the entry it returned is
> bigger than what we are looking for, we can immediately abort the
> search and say it does not exist.

However your two suggestions to optimize the lookup are correct. It'd be
interesting to see if that makes a significant difference or not.

Thanks,

        Vivien

^ permalink raw reply

* Re: [PATCH net-next 1/3] net:dsa:mv88e6xxx: use hashtable to store multicast entries
From: Andrew Lunn @ 2016-12-12 19:09 UTC (permalink / raw)
  To: Florian Fainelli
  Cc: Volodymyr Bendiuga, Vivien Didelot, Volodymyr Bendiuga, netdev,
	Volodymyr Bendiuga
In-Reply-To: <48ff1136-dd8f-7704-a512-c23b27989bf8@gmail.com>

On Mon, Dec 12, 2016 at 08:37:50AM -0800, Florian Fainelli wrote:
> On 12/12/2016 07:22 AM, Volodymyr Bendiuga wrote:
> > Hi,
> > 
> > I apologise for incorrectly formatted patch, I will fix and resend it.
> > The problem with the ATU right now is that it is too slow when inserting
> > entries.
> > When the OS boots up, it might insert some multicast entries into the
> > atu (if
> > they are preconfigured by user). I run a test with 10 mc entries being
> > configured for
> > each port (13 ports), and it took 15 seconds, which made system quite
> > slow on responding to
> > other commands, as it has been inserting mc entries. The implementation
> > with hashtable
> > made insert command for 13 ports and 10 entries per port about 700 msec
> > long.

Humm, it looks like we are doing the atu_get wrong. We are looking for
a specific MAC address. Yet we seem to be walking the whole table to
find it, rather than getting the hardware to do the search. 

The current code is:

static int mv88e6xxx_atu_get(struct mv88e6xxx_chip *chip, int fid,
                             const u8 *addr, struct mv88e6xxx_atu_entry *entry)
{
        struct mv88e6xxx_atu_entry next;
        int err;

        eth_broadcast_addr(next.mac);

        err = _mv88e6xxx_atu_mac_write(chip, next.mac);

We should be setting next.mac to one less than the address we are
looking for.

Volodymyr, please could you try that, and see how much of a speed up
you get.

There is another optimization which can be made. We only say there is
no such entry once we have reached the end of the table. But it will
return the entries in ascending order. So if the entry it returned is
bigger than what we are looking for, we can immediately abort the
search and say it does not exist.

   Andrew

^ permalink raw reply

* Re: Soft lockup in tc_classify
From: Cong Wang @ 2016-12-12 19:07 UTC (permalink / raw)
  To: Shahar Klein
  Cc: Daniel Borkmann, Linux Kernel Network Developers, Roi Dayan,
	David Miller, Jiri Pirko, John Fastabend, Or Gerlitz,
	Hadar Hen Zion
In-Reply-To: <1e715873-34ba-0a76-c94e-064ca4cf895b@mellanox.com>

On Mon, Dec 12, 2016 at 8:04 AM, Shahar Klein <shahark@mellanox.com> wrote:
>
>
> On 12/12/2016 3:28 PM, Daniel Borkmann wrote:
>>
>> Hi Shahar,
>>
>> On 12/12/2016 10:43 AM, Shahar Klein wrote:
>>>
>>> Hi All,
>>>
>>> sorry for the spam, the first time was sent with html part and was
>>> rejected.
>>>
>>> We observed an issue where a classifier instance next member is
>>> pointing back to itself, causing a CPU soft lockup.
>>> We found it by running traffic on many udp connections and then adding
>>> a new flower rule using tc.
>>>
>>> We added a quick workaround to verify it:
>>>
>>> In tc_classify:
>>>
>>>          for (; tp; tp = rcu_dereference_bh(tp->next)) {
>>>                  int err;
>>> +               if (tp == tp->next)
>>> +                     RCU_INIT_POINTER(tp->next, NULL);
>>>
>>>
>>> We also had a print here showing tp->next is pointing to tp. With this
>>> workaround we are not hitting the issue anymore.
>>> We are not sure we fully understand the mechanism here - with the rtnl
>>> and rcu locks.
>>> We'll appreciate your help solving this issue.
>>
>>
>> Note that there's still the RCU fix missing for the deletion race that
>> Cong will still send out, but you say that the only thing you do is to
>> add a single rule, but no other operation in involved during that test?

Hmm, I thought RCU_INIT_POINTER() respects readers, but seems no?
If so, that could be the cause since we play with the next pointer and
there is only one filter in this case, but I don't see why we could have
a loop here.

>>
>> Do you have a script and kernel .config for reproducing this?
>
>
> I'm using a user space socket app(https://github.com/shahar-klein/noodle)on
> a vm to push udp packets from ~2000 different udp src ports ramping up at
> ~100 per second towards another vm on the same Hypervisor. Once the traffic
> starts I'm pushing ingress flower tc udp rules(even_udp_src_port->mirred,
> odd->drop) on the relevant representor in the Hypervisor.

Do you mind to share your `tc filter show dev...` output? Also, since you
mentioned you only add one flower filter, just want to make sure you never
delete any filter before/when the bug happens? How reproducible is this?

Thanks!

^ permalink raw reply

* Re: Soft lockup in inet_put_port on 4.6
From: Hannes Frederic Sowa @ 2016-12-12 18:44 UTC (permalink / raw)
  To: Josef Bacik, Eric Dumazet; +Cc: Tom Herbert, Linux Kernel Network Developers
In-Reply-To: <1481565929.24490.0@smtp.office365.com>

On 12.12.2016 19:05, Josef Bacik wrote:
> On Fri, Dec 9, 2016 at 11:14 PM, Eric Dumazet <eric.dumazet@gmail.com>
> wrote:
>> On Fri, 2016-12-09 at 19:47 -0800, Eric Dumazet wrote:
>>
>>>
>>>  Hmm... Is your ephemeral port range includes the port your load
>>>  balancing app is using ?
>>
>> I suspect that you might have processes doing bind( port = 0) that are
>> trapped into the bind_conflict() scan ?
>>
>> With 100,000 + timewaits there, this possibly hurts.
>>
>> Can you try the following loop breaker ?
> 
> It doesn't appear that the app is doing bind(port = 0) during normal
> operation.  I tested this patch and it made no difference.  I'm going to
> test simply restarting the app without changing to the SO_REUSEPORT
> option.  Thanks,

Would it be possible to trace the time the function uses with trace? If
we don't see the number growing considerably over time we probably can
rule out that we loop somewhere in there (I would instrument
inet_csk_bind_conflict, __inet_hash_connect and inet_csk_get_port).

__inet_hash_connect -> __inet_check_established also takes a lock
(inet_ehash_lockp) which can be locked from inet_diag code path during
socket diag info dumping.

Unfortunately we couldn't reproduce it so far. :/

Thanks,
Hannes

^ permalink raw reply

* Re: [PATCH V2  13/22] bnxt_re: Support QP verbs
From: Leon Romanovsky @ 2016-12-12 18:27 UTC (permalink / raw)
  To: Selvin Xavier
  Cc: dledford-H+wXaHxf7aLQT0dZR+AlfA,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, netdev-u79uwXL29TY76Z2rM5mHXA,
	Eddie Wai, Devesh Sharma, Somnath Kotur, Sriharsha Basavapatna
In-Reply-To: <1481266096-23331-14-git-send-email-selvin.xavier-dY08KVG/lbpWk0Htik3J/w@public.gmane.org>

[-- Attachment #1: Type: text/plain, Size: 68951 bytes --]

On Thu, Dec 08, 2016 at 10:48:07PM -0800, Selvin Xavier wrote:
> This patch implements create_qp, destroy_qp, query_qp and modify_qp verbs.
>
> v2: Fixed sparse warnings
>
> Signed-off-by: Eddie Wai <eddie.wai-dY08KVG/lbpWk0Htik3J/w@public.gmane.org>
> Signed-off-by: Devesh Sharma <devesh.sharma-dY08KVG/lbpWk0Htik3J/w@public.gmane.org>
> Signed-off-by: Somnath Kotur <somnath.kotur-dY08KVG/lbpWk0Htik3J/w@public.gmane.org>
> Signed-off-by: Sriharsha Basavapatna <sriharsha.basavapatna-dY08KVG/lbpWk0Htik3J/w@public.gmane.org>
> Signed-off-by: Selvin Xavier <selvin.xavier-dY08KVG/lbpWk0Htik3J/w@public.gmane.org>
> ---
>  drivers/infiniband/hw/bnxtre/bnxt_qplib_fp.c    | 873 ++++++++++++++++++++++++
>  drivers/infiniband/hw/bnxtre/bnxt_qplib_fp.h    | 250 +++++++
>  drivers/infiniband/hw/bnxtre/bnxt_re.h          |  14 +
>  drivers/infiniband/hw/bnxtre/bnxt_re_ib_verbs.c | 762 +++++++++++++++++++++
>  drivers/infiniband/hw/bnxtre/bnxt_re_ib_verbs.h |  21 +
>  drivers/infiniband/hw/bnxtre/bnxt_re_main.c     |   6 +
>  include/uapi/rdma/bnxt_re_uverbs_abi.h          |  10 +
>  7 files changed, 1936 insertions(+)
>
> diff --git a/drivers/infiniband/hw/bnxtre/bnxt_qplib_fp.c b/drivers/infiniband/hw/bnxtre/bnxt_qplib_fp.c
> index 636306f..edc9411 100644
> --- a/drivers/infiniband/hw/bnxtre/bnxt_qplib_fp.c
> +++ b/drivers/infiniband/hw/bnxtre/bnxt_qplib_fp.c
> @@ -50,6 +50,69 @@
>  #include "bnxt_qplib_fp.h"
>
>  static void bnxt_qplib_arm_cq_enable(struct bnxt_qplib_cq *cq);
> +
> +static void bnxt_qplib_free_qp_hdr_buf(struct bnxt_qplib_res *res,
> +				       struct bnxt_qplib_qp *qp)
> +{
> +	struct bnxt_qplib_q *rq = &qp->rq;
> +	struct bnxt_qplib_q *sq = &qp->sq;
> +
> +	if (qp->rq_hdr_buf)
> +		dma_free_coherent(&res->pdev->dev,
> +				  rq->hwq.max_elements * qp->rq_hdr_buf_size,
> +				  qp->rq_hdr_buf, qp->rq_hdr_buf_map);
> +	if (qp->sq_hdr_buf)
> +		dma_free_coherent(&res->pdev->dev,
> +				  sq->hwq.max_elements * qp->sq_hdr_buf_size,
> +				  qp->sq_hdr_buf, qp->sq_hdr_buf_map);
> +	qp->rq_hdr_buf = NULL;
> +	qp->sq_hdr_buf = NULL;
> +	qp->rq_hdr_buf_map = 0;
> +	qp->sq_hdr_buf_map = 0;
> +	qp->sq_hdr_buf_size = 0;
> +	qp->rq_hdr_buf_size = 0;
> +}
> +
> +static int bnxt_qplib_alloc_qp_hdr_buf(struct bnxt_qplib_res *res,
> +				       struct bnxt_qplib_qp *qp)
> +{
> +	struct bnxt_qplib_q *rq = &qp->rq;
> +	struct bnxt_qplib_q *sq = &qp->rq;
> +	int rc = 0;
> +
> +	if (qp->sq_hdr_buf_size && sq->hwq.max_elements) {
> +		qp->sq_hdr_buf = dma_alloc_coherent(&res->pdev->dev,
> +					sq->hwq.max_elements *
> +					qp->sq_hdr_buf_size,
> +					&qp->sq_hdr_buf_map, GFP_KERNEL);
> +		if (!qp->sq_hdr_buf) {
> +			rc = -ENOMEM;
> +			dev_err(&res->pdev->dev,
> +				"QPLIB: Failed to create sq_hdr_buf");
> +			goto fail;
> +		}
> +	}
> +
> +	if (qp->rq_hdr_buf_size && rq->hwq.max_elements) {
> +		qp->rq_hdr_buf = dma_alloc_coherent(&res->pdev->dev,
> +						    rq->hwq.max_elements *
> +						    qp->rq_hdr_buf_size,
> +						    &qp->rq_hdr_buf_map,
> +						    GFP_KERNEL);
> +		if (!qp->rq_hdr_buf) {
> +			rc = -ENOMEM;
> +			dev_err(&res->pdev->dev,
> +				"QPLIB: Failed to create rq_hdr_buf");
> +			goto fail;
> +		}
> +	}
> +	return 0;
> +
> +fail:
> +	bnxt_qplib_free_qp_hdr_buf(res, qp);
> +	return rc;
> +}
> +
>  static void bnxt_qplib_service_nq(unsigned long data)
>  {
>  	struct bnxt_qplib_nq *nq = (struct bnxt_qplib_nq *)data;
> @@ -215,6 +278,816 @@ int bnxt_qplib_alloc_nq(struct pci_dev *pdev, struct bnxt_qplib_nq *nq)
>  	return 0;
>  }
>
> +/* QP */
> +int bnxt_qplib_create_qp1(struct bnxt_qplib_res *res, struct bnxt_qplib_qp *qp)
> +{
> +	struct bnxt_qplib_rcfw *rcfw = res->rcfw;
> +	struct cmdq_create_qp1 req;
> +	struct creq_create_qp1_resp *resp;
> +	struct bnxt_qplib_pbl *pbl;
> +	struct bnxt_qplib_q *sq = &qp->sq;
> +	struct bnxt_qplib_q *rq = &qp->rq;
> +	int rc;
> +	u16 cmd_flags = 0;
> +	u32 qp_flags = 0;
> +
> +	RCFW_CMD_PREP(req, CREATE_QP1, cmd_flags);
> +
> +	/* General */
> +	req.type = qp->type;
> +	req.dpi = cpu_to_le32(qp->dpi->dpi);
> +	req.qp_handle = cpu_to_le64(qp->qp_handle);
> +
> +	/* SQ */
> +	sq->hwq.max_elements = sq->max_wqe;
> +	rc = bnxt_qplib_alloc_init_hwq(res->pdev, &sq->hwq, NULL, 0,
> +				       &sq->hwq.max_elements,
> +				       BNXT_QPLIB_MAX_SQE_ENTRY_SIZE, 0,
> +				       PAGE_SIZE, HWQ_TYPE_QUEUE);
> +	if (rc)
> +		goto exit;
> +
> +	sq->swq = kcalloc(sq->hwq.max_elements, sizeof(*sq->swq), GFP_KERNEL);
> +	if (!sq->swq) {
> +		rc = -ENOMEM;
> +		goto fail_sq;
> +	}
> +	pbl = &sq->hwq.pbl[PBL_LVL_0];
> +	req.sq_pbl = cpu_to_le64(pbl->pg_map_arr[0]);
> +	req.sq_pg_size_sq_lvl =
> +		((sq->hwq.level & CMDQ_CREATE_QP1_SQ_LVL_MASK)
> +				<<  CMDQ_CREATE_QP1_SQ_LVL_SFT) |
> +		(pbl->pg_size == ROCE_PG_SIZE_4K ?
> +				CMDQ_CREATE_QP1_SQ_PG_SIZE_PG_4K :
> +		 pbl->pg_size == ROCE_PG_SIZE_8K ?
> +				CMDQ_CREATE_QP1_SQ_PG_SIZE_PG_8K :
> +		 pbl->pg_size == ROCE_PG_SIZE_64K ?
> +				CMDQ_CREATE_QP1_SQ_PG_SIZE_PG_64K :
> +		 pbl->pg_size == ROCE_PG_SIZE_2M ?
> +				CMDQ_CREATE_QP1_SQ_PG_SIZE_PG_2M :
> +		 pbl->pg_size == ROCE_PG_SIZE_8M ?
> +				CMDQ_CREATE_QP1_SQ_PG_SIZE_PG_8M :
> +		 pbl->pg_size == ROCE_PG_SIZE_1G ?
> +				CMDQ_CREATE_QP1_SQ_PG_SIZE_PG_1G :
> +		 CMDQ_CREATE_QP1_SQ_PG_SIZE_PG_4K);
> +
> +	if (qp->scq)
> +		req.scq_cid = cpu_to_le32(qp->scq->id);
> +
> +	qp_flags |= CMDQ_CREATE_QP1_QP_FLAGS_RESERVED_LKEY_ENABLE;
> +
> +	/* RQ */
> +	if (rq->max_wqe) {
> +		rq->hwq.max_elements = qp->rq.max_wqe;
> +		rc = bnxt_qplib_alloc_init_hwq(res->pdev, &rq->hwq, NULL, 0,
> +					       &rq->hwq.max_elements,
> +					       BNXT_QPLIB_MAX_RQE_ENTRY_SIZE, 0,
> +					       PAGE_SIZE, HWQ_TYPE_QUEUE);
> +		if (rc)
> +			goto fail_sq;
> +
> +		rq->swq = kcalloc(rq->hwq.max_elements, sizeof(*rq->swq),
> +				  GFP_KERNEL);
> +		if (!rq->swq) {
> +			rc = -ENOMEM;
> +			goto fail_rq;
> +		}
> +		pbl = &rq->hwq.pbl[PBL_LVL_0];
> +		req.rq_pbl = cpu_to_le64(pbl->pg_map_arr[0]);
> +		req.rq_pg_size_rq_lvl =
> +			((rq->hwq.level & CMDQ_CREATE_QP1_RQ_LVL_MASK) <<
> +			 CMDQ_CREATE_QP1_RQ_LVL_SFT) |
> +				(pbl->pg_size == ROCE_PG_SIZE_4K ?
> +					CMDQ_CREATE_QP1_RQ_PG_SIZE_PG_4K :
> +				 pbl->pg_size == ROCE_PG_SIZE_8K ?
> +					CMDQ_CREATE_QP1_RQ_PG_SIZE_PG_8K :
> +				 pbl->pg_size == ROCE_PG_SIZE_64K ?
> +					CMDQ_CREATE_QP1_RQ_PG_SIZE_PG_64K :
> +				 pbl->pg_size == ROCE_PG_SIZE_2M ?
> +					CMDQ_CREATE_QP1_RQ_PG_SIZE_PG_2M :
> +				 pbl->pg_size == ROCE_PG_SIZE_8M ?
> +					CMDQ_CREATE_QP1_RQ_PG_SIZE_PG_8M :
> +				 pbl->pg_size == ROCE_PG_SIZE_1G ?
> +					CMDQ_CREATE_QP1_RQ_PG_SIZE_PG_1G :
> +				 CMDQ_CREATE_QP1_RQ_PG_SIZE_PG_4K);
> +		if (qp->rcq)
> +			req.rcq_cid = cpu_to_le32(qp->rcq->id);
> +	}
> +
> +	/* Header buffer - allow hdr_buf pass in */
> +	rc = bnxt_qplib_alloc_qp_hdr_buf(res, qp);
> +	if (rc) {
> +		rc = -ENOMEM;
> +		goto fail;
> +	}
> +	req.qp_flags = cpu_to_le32(qp_flags);
> +	req.sq_size = cpu_to_le32(sq->hwq.max_elements);
> +	req.rq_size = cpu_to_le32(rq->hwq.max_elements);
> +
> +	req.sq_fwo_sq_sge =
> +		cpu_to_le16((sq->max_sge & CMDQ_CREATE_QP1_SQ_SGE_MASK) <<
> +			    CMDQ_CREATE_QP1_SQ_SGE_SFT);
> +	req.rq_fwo_rq_sge =
> +		cpu_to_le16((rq->max_sge & CMDQ_CREATE_QP1_RQ_SGE_MASK) <<
> +			    CMDQ_CREATE_QP1_RQ_SGE_SFT);
> +
> +	req.pd_id = cpu_to_le32(qp->pd->id);
> +
> +	resp = (struct creq_create_qp1_resp *)
> +			bnxt_qplib_rcfw_send_message(rcfw, (void *)&req,
> +						     NULL, 0);
> +	if (!resp) {
> +		dev_err(&res->pdev->dev, "QPLIB: FP: CREATE_QP1 send failed");
> +		rc = -EINVAL;
> +		goto fail;
> +	}
> +	/**/

It looks like you forgot to add a text into comment section.

> +	if (!bnxt_qplib_rcfw_wait_for_resp(rcfw, le16_to_cpu(req.cookie))) {
> +		/* Cmd timed out */
> +		dev_err(&rcfw->pdev->dev, "QPLIB: FP: CREATE_QP1 timed out");
> +		rc = -ETIMEDOUT;
> +		goto fail;
> +	}
> +	if (RCFW_RESP_STATUS(resp) ||
> +	    RCFW_RESP_COOKIE(resp) != RCFW_CMDQ_COOKIE(req)) {
> +		dev_err(&rcfw->pdev->dev, "QPLIB: FP: CREATE_QP1 failed ");
> +		dev_err(&rcfw->pdev->dev,
> +			"QPLIB: with status 0x%x cmdq 0x%x resp 0x%x",
> +			RCFW_RESP_STATUS(resp), RCFW_CMDQ_COOKIE(req),
> +			RCFW_RESP_COOKIE(resp));
> +		rc = -EINVAL;
> +		goto fail;
> +	}
> +	qp->id = le32_to_cpu(resp->xid);
> +	qp->cur_qp_state = CMDQ_MODIFY_QP_NEW_STATE_RESET;
> +	sq->flush_in_progress = false;
> +	rq->flush_in_progress = false;
> +
> +	return 0;
> +
> +fail:
> +	bnxt_qplib_free_qp_hdr_buf(res, qp);
> +fail_rq:
> +	bnxt_qplib_free_hwq(res->pdev, &rq->hwq);
> +	kfree(rq->swq);
> +fail_sq:
> +	bnxt_qplib_free_hwq(res->pdev, &sq->hwq);
> +	kfree(sq->swq);
> +exit:
> +	return rc;
> +}
> +
> +int bnxt_qplib_create_qp(struct bnxt_qplib_res *res, struct bnxt_qplib_qp *qp)
> +{
> +	struct bnxt_qplib_rcfw *rcfw = res->rcfw;
> +	struct sq_send *hw_sq_send_hdr, **hw_sq_send_ptr;
> +	struct cmdq_create_qp req;
> +	struct creq_create_qp_resp *resp;
> +	struct bnxt_qplib_pbl *pbl;
> +	struct sq_psn_search **psn_search_ptr;
> +	unsigned long long int psn_search, poff = 0;
> +	struct bnxt_qplib_q *sq = &qp->sq;
> +	struct bnxt_qplib_q *rq = &qp->rq;
> +	struct bnxt_qplib_hwq *xrrq;
> +	int i, rc, req_size, psn_sz;
> +	u16 cmd_flags = 0, max_ssge;
> +	u32 sw_prod, qp_flags = 0;
> +
> +	RCFW_CMD_PREP(req, CREATE_QP, cmd_flags);
> +
> +	/* General */
> +	req.type = qp->type;
> +	req.dpi = cpu_to_le32(qp->dpi->dpi);
> +	req.qp_handle = cpu_to_le64(qp->qp_handle);
> +
> +	/* SQ */
> +	psn_sz = (qp->type == CMDQ_CREATE_QP_TYPE_RC) ?
> +		 sizeof(struct sq_psn_search) : 0;
> +	sq->hwq.max_elements = sq->max_wqe;
> +	rc = bnxt_qplib_alloc_init_hwq(res->pdev, &sq->hwq, sq->sglist,
> +				       sq->nmap, &sq->hwq.max_elements,
> +				       BNXT_QPLIB_MAX_SQE_ENTRY_SIZE,
> +				       psn_sz,
> +				       PAGE_SIZE, HWQ_TYPE_QUEUE);
> +	if (rc)
> +		goto exit;
> +
> +	sq->swq = kcalloc(sq->hwq.max_elements, sizeof(*sq->swq), GFP_KERNEL);
> +	if (!sq->swq) {
> +		rc = -ENOMEM;
> +		goto fail_sq;
> +	}
> +	hw_sq_send_ptr = (struct sq_send **)sq->hwq.pbl_ptr;
> +	if (psn_sz) {
> +		psn_search_ptr = (struct sq_psn_search **)
> +				  &hw_sq_send_ptr[SQE_PG(sq->hwq.max_elements)];
> +		psn_search = (unsigned long long int)
> +			      &hw_sq_send_ptr[SQE_PG(sq->hwq.max_elements)]
> +			      [SQE_IDX(sq->hwq.max_elements)];
> +		if (psn_search & ~PAGE_MASK) {
> +			/* If the psn_search does not start on a page boundary,
> +			 * then calculate the offset
> +			 */
> +			poff = (psn_search & ~PAGE_MASK) /
> +				BNXT_QPLIB_MAX_PSNE_ENTRY_SIZE;
> +		}
> +		for (i = 0; i < sq->hwq.max_elements; i++)
> +			sq->swq[i].psn_search =
> +				&psn_search_ptr[PSNE_PG(i + poff)]
> +					       [PSNE_IDX(i + poff)];
> +	}
> +	pbl = &sq->hwq.pbl[PBL_LVL_0];
> +	req.sq_pbl = cpu_to_le64(pbl->pg_map_arr[0]);
> +	req.sq_pg_size_sq_lvl =
> +		((sq->hwq.level & CMDQ_CREATE_QP_SQ_LVL_MASK)
> +				 <<  CMDQ_CREATE_QP_SQ_LVL_SFT) |
> +		(pbl->pg_size == ROCE_PG_SIZE_4K ?
> +				CMDQ_CREATE_QP_SQ_PG_SIZE_PG_4K :
> +		 pbl->pg_size == ROCE_PG_SIZE_8K ?
> +				CMDQ_CREATE_QP_SQ_PG_SIZE_PG_8K :
> +		 pbl->pg_size == ROCE_PG_SIZE_64K ?
> +				CMDQ_CREATE_QP_SQ_PG_SIZE_PG_64K :
> +		 pbl->pg_size == ROCE_PG_SIZE_2M ?
> +				CMDQ_CREATE_QP_SQ_PG_SIZE_PG_2M :
> +		 pbl->pg_size == ROCE_PG_SIZE_8M ?
> +				CMDQ_CREATE_QP_SQ_PG_SIZE_PG_8M :
> +		 pbl->pg_size == ROCE_PG_SIZE_1G ?
> +				CMDQ_CREATE_QP_SQ_PG_SIZE_PG_1G :
> +		 CMDQ_CREATE_QP_SQ_PG_SIZE_PG_4K);
> +
> +	/* initialize all SQ WQEs to LOCAL_INVALID (sq prep for hw fetch) */
> +	hw_sq_send_ptr = (struct sq_send **)sq->hwq.pbl_ptr;
> +	for (sw_prod = 0; sw_prod < sq->hwq.max_elements; sw_prod++) {
> +		hw_sq_send_hdr = &hw_sq_send_ptr[SQE_PG(sw_prod)]
> +						[SQE_IDX(sw_prod)];
> +		hw_sq_send_hdr->wqe_type = SQ_BASE_WQE_TYPE_LOCAL_INVALID;
> +	}
> +
> +	if (qp->scq)
> +		req.scq_cid = cpu_to_le32(qp->scq->id);
> +
> +	qp_flags |= CMDQ_CREATE_QP_QP_FLAGS_RESERVED_LKEY_ENABLE;
> +	qp_flags |= CMDQ_CREATE_QP_QP_FLAGS_FR_PMR_ENABLED;
> +	if (qp->sig_type)
> +		qp_flags |= CMDQ_CREATE_QP_QP_FLAGS_FORCE_COMPLETION;
> +
> +	/* RQ */
> +	if (rq->max_wqe) {
> +		rq->hwq.max_elements = rq->max_wqe;
> +		rc = bnxt_qplib_alloc_init_hwq(res->pdev, &rq->hwq, rq->sglist,
> +					       rq->nmap, &rq->hwq.max_elements,
> +					       BNXT_QPLIB_MAX_RQE_ENTRY_SIZE, 0,
> +					       PAGE_SIZE, HWQ_TYPE_QUEUE);
> +		if (rc)
> +			goto fail_sq;
> +
> +		rq->swq = kcalloc(rq->hwq.max_elements, sizeof(*rq->swq),
> +				  GFP_KERNEL);
> +		if (!rq->swq) {
> +			rc = -ENOMEM;
> +			goto fail_rq;
> +		}
> +		pbl = &rq->hwq.pbl[PBL_LVL_0];
> +		req.rq_pbl = cpu_to_le64(pbl->pg_map_arr[0]);
> +		req.rq_pg_size_rq_lvl =
> +			((rq->hwq.level & CMDQ_CREATE_QP_RQ_LVL_MASK) <<
> +			 CMDQ_CREATE_QP_RQ_LVL_SFT) |
> +				(pbl->pg_size == ROCE_PG_SIZE_4K ?
> +					CMDQ_CREATE_QP_RQ_PG_SIZE_PG_4K :
> +				 pbl->pg_size == ROCE_PG_SIZE_8K ?
> +					CMDQ_CREATE_QP_RQ_PG_SIZE_PG_8K :
> +				 pbl->pg_size == ROCE_PG_SIZE_64K ?
> +					CMDQ_CREATE_QP_RQ_PG_SIZE_PG_64K :
> +				 pbl->pg_size == ROCE_PG_SIZE_2M ?
> +					CMDQ_CREATE_QP_RQ_PG_SIZE_PG_2M :
> +				 pbl->pg_size == ROCE_PG_SIZE_8M ?
> +					CMDQ_CREATE_QP_RQ_PG_SIZE_PG_8M :
> +				 pbl->pg_size == ROCE_PG_SIZE_1G ?
> +					CMDQ_CREATE_QP_RQ_PG_SIZE_PG_1G :
> +				 CMDQ_CREATE_QP_RQ_PG_SIZE_PG_4K);
> +	}
> +
> +	if (qp->rcq)
> +		req.rcq_cid = cpu_to_le32(qp->rcq->id);
> +	req.qp_flags = cpu_to_le32(qp_flags);
> +	req.sq_size = cpu_to_le32(sq->hwq.max_elements);
> +	req.rq_size = cpu_to_le32(rq->hwq.max_elements);
> +	qp->sq_hdr_buf = NULL;
> +	qp->rq_hdr_buf = NULL;
> +
> +	rc = bnxt_qplib_alloc_qp_hdr_buf(res, qp);
> +	if (rc)
> +		goto fail_rq;
> +
> +	/* CTRL-22434: Irrespective of the requested SGE count on the SQ
> +	 * always create the QP with max send sges possible if the requested
> +	 * inline size is greater than 0.
> +	 */
> +	max_ssge = qp->max_inline_data ? 6 : sq->max_sge;
> +	req.sq_fwo_sq_sge = cpu_to_le16(
> +				((max_ssge & CMDQ_CREATE_QP_SQ_SGE_MASK)
> +				 << CMDQ_CREATE_QP_SQ_SGE_SFT) | 0);
> +	req.rq_fwo_rq_sge = cpu_to_le16(
> +				((rq->max_sge & CMDQ_CREATE_QP_RQ_SGE_MASK)
> +				 << CMDQ_CREATE_QP_RQ_SGE_SFT) | 0);
> +	/* ORRQ and IRRQ */
> +	if (psn_sz) {
> +		xrrq = &qp->orrq;
> +		xrrq->max_elements =
> +			ORD_LIMIT_TO_ORRQ_SLOTS(qp->max_rd_atomic);
> +		req_size = xrrq->max_elements *
> +			   BNXT_QPLIB_MAX_ORRQE_ENTRY_SIZE + PAGE_SIZE - 1;
> +		req_size &= ~(PAGE_SIZE - 1);
> +		rc = bnxt_qplib_alloc_init_hwq(res->pdev, xrrq, NULL, 0,
> +					       &xrrq->max_elements,
> +					       BNXT_QPLIB_MAX_ORRQE_ENTRY_SIZE,
> +					       0, req_size, HWQ_TYPE_CTX);
> +		if (rc)
> +			goto fail_buf_free;
> +		pbl = &xrrq->pbl[PBL_LVL_0];
> +		req.orrq_addr = cpu_to_le64(pbl->pg_map_arr[0]);
> +
> +		xrrq = &qp->irrq;
> +		xrrq->max_elements = IRD_LIMIT_TO_IRRQ_SLOTS(
> +						qp->max_dest_rd_atomic);
> +		req_size = xrrq->max_elements *
> +			   BNXT_QPLIB_MAX_IRRQE_ENTRY_SIZE + PAGE_SIZE - 1;
> +		req_size &= ~(PAGE_SIZE - 1);
> +
> +		rc = bnxt_qplib_alloc_init_hwq(res->pdev, xrrq, NULL, 0,
> +					       &xrrq->max_elements,
> +					       BNXT_QPLIB_MAX_IRRQE_ENTRY_SIZE,
> +					       0, req_size, HWQ_TYPE_CTX);
> +		if (rc)
> +			goto fail_orrq;
> +
> +		pbl = &xrrq->pbl[PBL_LVL_0];
> +		req.irrq_addr = cpu_to_le64(pbl->pg_map_arr[0]);
> +	}
> +	req.pd_id = cpu_to_le32(qp->pd->id);
> +
> +	resp = (struct creq_create_qp_resp *)
> +			bnxt_qplib_rcfw_send_message(rcfw, (void *)&req,
> +						     NULL, 0);
> +	if (!resp) {
> +		dev_err(&rcfw->pdev->dev, "QPLIB: FP: CREATE_QP send failed");
> +		rc = -EINVAL;
> +		goto fail;
> +	}
> +	/**/
> +	if (!bnxt_qplib_rcfw_wait_for_resp(rcfw, le16_to_cpu(req.cookie))) {
> +		/* Cmd timed out */
> +		dev_err(&rcfw->pdev->dev, "QPLIB: FP: CREATE_QP timed out");
> +		rc = -ETIMEDOUT;
> +		goto fail;
> +	}
> +	if (RCFW_RESP_STATUS(resp) ||
> +	    RCFW_RESP_COOKIE(resp) != RCFW_CMDQ_COOKIE(req)) {
> +		dev_err(&rcfw->pdev->dev, "QPLIB: FP: CREATE_QP failed ");
> +		dev_err(&rcfw->pdev->dev,
> +			"QPLIB: with status 0x%x cmdq 0x%x resp 0x%x",
> +			RCFW_RESP_STATUS(resp), RCFW_CMDQ_COOKIE(req),
> +			RCFW_RESP_COOKIE(resp));
> +		rc = -EINVAL;
> +		goto fail;
> +	}
> +	qp->id = le32_to_cpu(resp->xid);
> +	qp->cur_qp_state = CMDQ_MODIFY_QP_NEW_STATE_RESET;
> +	sq->flush_in_progress = false;
> +	rq->flush_in_progress = false;
> +
> +	return 0;
> +
> +fail:
> +	if (qp->irrq.max_elements)
> +		bnxt_qplib_free_hwq(res->pdev, &qp->irrq);
> +fail_orrq:
> +	if (qp->orrq.max_elements)
> +		bnxt_qplib_free_hwq(res->pdev, &qp->orrq);
> +fail_buf_free:
> +	bnxt_qplib_free_qp_hdr_buf(res, qp);
> +fail_rq:
> +	bnxt_qplib_free_hwq(res->pdev, &rq->hwq);
> +	kfree(rq->swq);
> +fail_sq:
> +	bnxt_qplib_free_hwq(res->pdev, &sq->hwq);
> +	kfree(sq->swq);
> +exit:
> +	return rc;
> +}
> +
> +static void __filter_modify_flags(struct bnxt_qplib_qp *qp)
> +{

It can help to review if you break this function into smaller pieces and
get rid of switch->switch->if construction.

> +	switch (qp->cur_qp_state) {
> +	case CMDQ_MODIFY_QP_NEW_STATE_RESET:
> +		switch (qp->state) {
> +		case CMDQ_MODIFY_QP_NEW_STATE_INIT:
> +			break;
> +		default:
> +			break;
> +		}
> +		break;
> +	case CMDQ_MODIFY_QP_NEW_STATE_INIT:
> +		switch (qp->state) {
> +		case CMDQ_MODIFY_QP_NEW_STATE_RTR:
> +			/* INIT->RTR, configure the path_mtu to the default
> +			 * 2048 if not being requested
> +			 */
> +			if (!(qp->modify_flags &
> +			      CMDQ_MODIFY_QP_MODIFY_MASK_PATH_MTU)) {
> +				qp->modify_flags |=
> +					CMDQ_MODIFY_QP_MODIFY_MASK_PATH_MTU;
> +				qp->path_mtu = CMDQ_MODIFY_QP_PATH_MTU_MTU_2048;
> +			}
> +			qp->modify_flags &=
> +				~CMDQ_MODIFY_QP_MODIFY_MASK_VLAN_ID;
> +			/* Bono FW requires the max_dest_rd_atomic to be >= 1 */
> +			if (qp->max_dest_rd_atomic < 1)
> +				qp->max_dest_rd_atomic = 1;
> +			qp->modify_flags &= ~CMDQ_MODIFY_QP_MODIFY_MASK_SRC_MAC;
> +			/* Bono FW 20.6.5 requires SGID_INDEX configuration */
> +			if (!(qp->modify_flags &
> +			      CMDQ_MODIFY_QP_MODIFY_MASK_SGID_INDEX)) {
> +				qp->modify_flags |=
> +					CMDQ_MODIFY_QP_MODIFY_MASK_SGID_INDEX;
> +				qp->ah.sgid_index = 0;
> +			}
> +			break;
> +		default:
> +			break;
> +		}
> +		break;
> +	case CMDQ_MODIFY_QP_NEW_STATE_RTR:
> +		switch (qp->state) {
> +		case CMDQ_MODIFY_QP_NEW_STATE_RTS:
> +			/* Bono FW requires the max_rd_atomic to be >= 1 */
> +			if (qp->max_rd_atomic < 1)
> +				qp->max_rd_atomic = 1;
> +			/* Bono FW does not allow PKEY_INDEX,
> +			 * DGID, FLOW_LABEL, SGID_INDEX, HOP_LIMIT,
> +			 * TRAFFIC_CLASS, DEST_MAC, PATH_MTU, RQ_PSN,
> +			 * MIN_RNR_TIMER, MAX_DEST_RD_ATOMIC, DEST_QP_ID
> +			 * modification
> +			 */
> +			qp->modify_flags &=
> +				~(CMDQ_MODIFY_QP_MODIFY_MASK_PKEY |
> +				  CMDQ_MODIFY_QP_MODIFY_MASK_DGID |
> +				  CMDQ_MODIFY_QP_MODIFY_MASK_FLOW_LABEL |
> +				  CMDQ_MODIFY_QP_MODIFY_MASK_SGID_INDEX |
> +				  CMDQ_MODIFY_QP_MODIFY_MASK_HOP_LIMIT |
> +				  CMDQ_MODIFY_QP_MODIFY_MASK_TRAFFIC_CLASS |
> +				  CMDQ_MODIFY_QP_MODIFY_MASK_DEST_MAC |
> +				  CMDQ_MODIFY_QP_MODIFY_MASK_PATH_MTU |
> +				  CMDQ_MODIFY_QP_MODIFY_MASK_RQ_PSN |
> +				  CMDQ_MODIFY_QP_MODIFY_MASK_MIN_RNR_TIMER |
> +				  CMDQ_MODIFY_QP_MODIFY_MASK_MAX_DEST_RD_ATOMIC
> +				  | CMDQ_MODIFY_QP_MODIFY_MASK_DEST_QP_ID);
> +			break;
> +		default:
> +			break;
> +		}
> +		break;
> +	case CMDQ_MODIFY_QP_NEW_STATE_RTS:
> +		break;
> +	case CMDQ_MODIFY_QP_NEW_STATE_SQD:
> +		break;
> +	case CMDQ_MODIFY_QP_NEW_STATE_SQE:
> +		break;
> +	case CMDQ_MODIFY_QP_NEW_STATE_ERR:
> +		break;
> +	default:
> +		break;
> +	}
> +}
> +
> +int bnxt_qplib_modify_qp(struct bnxt_qplib_res *res, struct bnxt_qplib_qp *qp)
> +{
> +	struct bnxt_qplib_rcfw *rcfw = res->rcfw;
> +	struct cmdq_modify_qp req;
> +	struct creq_modify_qp_resp *resp;
> +	u16 cmd_flags = 0, pkey;
> +	u32 temp32[4];
> +	u32 bmask;
> +
> +	RCFW_CMD_PREP(req, MODIFY_QP, cmd_flags);
> +
> +	/* Filter out the qp_attr_mask based on the state->new transition */
> +	__filter_modify_flags(qp);
> +	bmask = qp->modify_flags;
> +	req.modify_mask = cpu_to_le64(qp->modify_flags);
> +	req.qp_cid = cpu_to_le32(qp->id);
> +	if (bmask & CMDQ_MODIFY_QP_MODIFY_MASK_STATE) {
> +		req.network_type_en_sqd_async_notify_new_state =
> +				(qp->state & CMDQ_MODIFY_QP_NEW_STATE_MASK) |
> +				(qp->en_sqd_async_notify ?
> +					CMDQ_MODIFY_QP_EN_SQD_ASYNC_NOTIFY : 0);
> +	}
> +	req.network_type_en_sqd_async_notify_new_state |= qp->nw_type;
> +
> +	if (bmask & CMDQ_MODIFY_QP_MODIFY_MASK_ACCESS)
> +		req.access = qp->access;
> +
> +	if (bmask & CMDQ_MODIFY_QP_MODIFY_MASK_PKEY) {
> +		if (!bnxt_qplib_get_pkey(res, &res->pkey_tbl,
> +					 qp->pkey_index, &pkey))
> +			req.pkey = cpu_to_le16(pkey);
> +	}
> +	if (bmask & CMDQ_MODIFY_QP_MODIFY_MASK_QKEY)
> +		req.qkey = cpu_to_le32(qp->qkey);
> +
> +	if (bmask & CMDQ_MODIFY_QP_MODIFY_MASK_DGID) {
> +		memcpy(temp32, qp->ah.dgid.data, sizeof(struct bnxt_qplib_gid));
> +		req.dgid[0] = cpu_to_le32(temp32[0]);
> +		req.dgid[1] = cpu_to_le32(temp32[1]);
> +		req.dgid[2] = cpu_to_le32(temp32[2]);
> +		req.dgid[3] = cpu_to_le32(temp32[3]);
> +	}
> +	if (bmask & CMDQ_MODIFY_QP_MODIFY_MASK_FLOW_LABEL)
> +		req.flow_label = cpu_to_le32(qp->ah.flow_label);
> +
> +	if (bmask & CMDQ_MODIFY_QP_MODIFY_MASK_SGID_INDEX)
> +		req.sgid_index = cpu_to_le16(res->sgid_tbl.hw_id
> +					     [qp->ah.sgid_index]);
> +
> +	if (bmask & CMDQ_MODIFY_QP_MODIFY_MASK_HOP_LIMIT)
> +		req.hop_limit = qp->ah.hop_limit;
> +
> +	if (bmask & CMDQ_MODIFY_QP_MODIFY_MASK_TRAFFIC_CLASS)
> +		req.traffic_class = qp->ah.traffic_class;
> +
> +	if (bmask & CMDQ_MODIFY_QP_MODIFY_MASK_DEST_MAC)
> +		memcpy(req.dest_mac, qp->ah.dmac, 6);
> +
> +	if (bmask & CMDQ_MODIFY_QP_MODIFY_MASK_PATH_MTU)
> +		req.path_mtu = cpu_to_le16(qp->path_mtu);
> +
> +	if (bmask & CMDQ_MODIFY_QP_MODIFY_MASK_TIMEOUT)
> +		req.timeout = qp->timeout;
> +
> +	if (bmask & CMDQ_MODIFY_QP_MODIFY_MASK_RETRY_CNT)
> +		req.retry_cnt = qp->retry_cnt;
> +
> +	if (bmask & CMDQ_MODIFY_QP_MODIFY_MASK_RNR_RETRY)
> +		req.rnr_retry = qp->rnr_retry;
> +
> +	if (bmask & CMDQ_MODIFY_QP_MODIFY_MASK_MIN_RNR_TIMER)
> +		req.min_rnr_timer = qp->min_rnr_timer;
> +
> +	if (bmask & CMDQ_MODIFY_QP_MODIFY_MASK_RQ_PSN)
> +		req.rq_psn = cpu_to_le32(qp->rq.psn);
> +
> +	if (bmask & CMDQ_MODIFY_QP_MODIFY_MASK_SQ_PSN)
> +		req.sq_psn = cpu_to_le32(qp->sq.psn);
> +
> +	if (bmask & CMDQ_MODIFY_QP_MODIFY_MASK_MAX_RD_ATOMIC)
> +		req.max_rd_atomic =
> +			ORD_LIMIT_TO_ORRQ_SLOTS(qp->max_rd_atomic);
> +
> +	if (bmask & CMDQ_MODIFY_QP_MODIFY_MASK_MAX_DEST_RD_ATOMIC)
> +		req.max_dest_rd_atomic =
> +			IRD_LIMIT_TO_IRRQ_SLOTS(qp->max_dest_rd_atomic);
> +
> +	req.sq_size = cpu_to_le32(qp->sq.hwq.max_elements);
> +	req.rq_size = cpu_to_le32(qp->rq.hwq.max_elements);
> +	req.sq_sge = cpu_to_le16(qp->sq.max_sge);
> +	req.rq_sge = cpu_to_le16(qp->rq.max_sge);
> +	req.max_inline_data = cpu_to_le32(qp->max_inline_data);
> +	if (bmask & CMDQ_MODIFY_QP_MODIFY_MASK_DEST_QP_ID)
> +		req.dest_qp_id = cpu_to_le32(qp->dest_qpn);
> +
> +	req.vlan_pcp_vlan_dei_vlan_id = cpu_to_le16(qp->vlan_id);
> +
> +	resp = (struct creq_modify_qp_resp *)
> +			bnxt_qplib_rcfw_send_message(rcfw, (void *)&req,
> +						     NULL, 0);
> +	if (!resp) {
> +		dev_err(&rcfw->pdev->dev, "QPLIB: FP: MODIFY_QP send failed");
> +		return -EINVAL;
> +	}
> +	/**/
> +	if (!bnxt_qplib_rcfw_wait_for_resp(rcfw, le16_to_cpu(req.cookie))) {
> +		/* Cmd timed out */
> +		dev_err(&rcfw->pdev->dev, "QPLIB: FP: MODIFY_QP timed out");
> +		return -ETIMEDOUT;
> +	}
> +	if (RCFW_RESP_STATUS(resp) ||
> +	    RCFW_RESP_COOKIE(resp) != RCFW_CMDQ_COOKIE(req)) {
> +		dev_err(&rcfw->pdev->dev, "QPLIB: FP: MODIFY_QP failed ");
> +		dev_err(&rcfw->pdev->dev,
> +			"QPLIB: with status 0x%x cmdq 0x%x resp 0x%x",
> +			RCFW_RESP_STATUS(resp), RCFW_CMDQ_COOKIE(req),
> +			RCFW_RESP_COOKIE(resp));
> +		return -EINVAL;
> +	}
> +	qp->cur_qp_state = qp->state;
> +	return 0;
> +}
> +
> +int bnxt_qplib_query_qp(struct bnxt_qplib_res *res, struct bnxt_qplib_qp *qp)
> +{
> +	struct bnxt_qplib_rcfw *rcfw = res->rcfw;
> +	struct cmdq_query_qp req;
> +	struct creq_query_qp_resp *resp;
> +	struct creq_query_qp_resp_sb *sb;
> +	u16 cmd_flags = 0;
> +	u32 temp32[4];
> +	int i;
> +
> +	RCFW_CMD_PREP(req, QUERY_QP, cmd_flags);
> +
> +	req.qp_cid = cpu_to_le32(qp->id);
> +	req.resp_size = sizeof(*sb) / BNXT_QPLIB_CMDQE_UNITS;
> +	resp = (struct creq_query_qp_resp *)
> +			bnxt_qplib_rcfw_send_message(rcfw, (void *)&req,
> +						     (void **)&sb, 0);
> +	if (!resp) {
> +		dev_err(&rcfw->pdev->dev, "QPLIB: FP: QUERY_QP send failed");
> +		return -EINVAL;
> +	}
> +	/**/
> +	if (!bnxt_qplib_rcfw_wait_for_resp(rcfw, le16_to_cpu(req.cookie))) {
> +		/* Cmd timed out */
> +		dev_err(&rcfw->pdev->dev, "QPLIB: FP: QUERY_QP timed out");
> +		return -ETIMEDOUT;
> +	}
> +	if (RCFW_RESP_STATUS(resp) ||
> +	    RCFW_RESP_COOKIE(resp) != RCFW_CMDQ_COOKIE(req)) {
> +		dev_err(&rcfw->pdev->dev, "QPLIB: FP: QUERY_QP failed ");
> +		dev_err(&rcfw->pdev->dev,
> +			"QPLIB: with status 0x%x cmdq 0x%x resp 0x%x",
> +			RCFW_RESP_STATUS(resp), RCFW_CMDQ_COOKIE(req),
> +			RCFW_RESP_COOKIE(resp));
> +		return -EINVAL;
> +	}
> +	/* Extract the context from the side buffer */
> +	qp->state = sb->en_sqd_async_notify_state &
> +			CREQ_QUERY_QP_RESP_SB_STATE_MASK;
> +	qp->en_sqd_async_notify = sb->en_sqd_async_notify_state &
> +				  CREQ_QUERY_QP_RESP_SB_EN_SQD_ASYNC_NOTIFY ?
> +				  true : false;
> +	qp->access = sb->access;
> +	qp->pkey_index = le16_to_cpu(sb->pkey);
> +	qp->qkey = le32_to_cpu(sb->qkey);
> +
> +	temp32[0] = le32_to_cpu(sb->dgid[0]);
> +	temp32[1] = le32_to_cpu(sb->dgid[1]);
> +	temp32[2] = le32_to_cpu(sb->dgid[2]);
> +	temp32[3] = le32_to_cpu(sb->dgid[3]);
> +	memcpy(qp->ah.dgid.data, temp32, sizeof(qp->ah.dgid.data));
> +
> +	qp->ah.flow_label = le32_to_cpu(sb->flow_label);
> +
> +	qp->ah.sgid_index = 0;
> +	for (i = 0; i < res->sgid_tbl.max; i++) {
> +		if (res->sgid_tbl.hw_id[i] == le16_to_cpu(sb->sgid_index)) {
> +			qp->ah.sgid_index = i;
> +			break;
> +		}
> +	}
> +	if (i == res->sgid_tbl.max)
> +		dev_warn(&res->pdev->dev, "QPLIB: SGID not found??");
> +
> +	qp->ah.hop_limit = sb->hop_limit;
> +	qp->ah.traffic_class = sb->traffic_class;
> +	memcpy(qp->ah.dmac, sb->dest_mac, 6);
> +	qp->ah.vlan_id = le16_to_cpu((sb->path_mtu_dest_vlan_id &
> +				CREQ_QUERY_QP_RESP_SB_VLAN_ID_MASK) >>
> +				CREQ_QUERY_QP_RESP_SB_VLAN_ID_SFT);
> +	qp->path_mtu = sb->path_mtu_dest_vlan_id &
> +				    CREQ_QUERY_QP_RESP_SB_PATH_MTU_MASK;
> +	qp->timeout = sb->timeout;
> +	qp->retry_cnt = sb->retry_cnt;
> +	qp->rnr_retry = sb->rnr_retry;
> +	qp->min_rnr_timer = sb->min_rnr_timer;
> +	qp->rq.psn = le32_to_cpu(sb->rq_psn);
> +	qp->max_rd_atomic = ORRQ_SLOTS_TO_ORD_LIMIT(sb->max_rd_atomic);
> +	qp->sq.psn = le32_to_cpu(sb->sq_psn);
> +	qp->max_dest_rd_atomic =
> +			IRRQ_SLOTS_TO_IRD_LIMIT(sb->max_dest_rd_atomic);
> +	qp->sq.max_wqe = qp->sq.hwq.max_elements;
> +	qp->rq.max_wqe = qp->rq.hwq.max_elements;
> +	qp->sq.max_sge = le16_to_cpu(sb->sq_sge);
> +	qp->rq.max_sge = le32_to_cpu(sb->rq_sge);
> +	qp->max_inline_data = le32_to_cpu(sb->max_inline_data);
> +	qp->dest_qpn = le32_to_cpu(sb->dest_qp_id);
> +	memcpy(qp->smac, sb->src_mac, 6);
> +	qp->vlan_id = le16_to_cpu(sb->vlan_pcp_vlan_dei_vlan_id);
> +	return 0;
> +}
> +
> +static void __clean_cq(struct bnxt_qplib_cq *cq, u64 qp)
> +{
> +	struct bnxt_qplib_hwq *cq_hwq = &cq->hwq;
> +	struct cq_base *hw_cqe, **hw_cqe_ptr;
> +	int i;
> +
> +	for (i = 0; i < cq_hwq->max_elements; i++) {
> +		hw_cqe_ptr = (struct cq_base **)cq_hwq->pbl_ptr;
> +		hw_cqe = &hw_cqe_ptr[CQE_PG(i)][CQE_IDX(i)];
> +		if (!CQE_CMP_VALID(hw_cqe, i, cq_hwq->max_elements))
> +			continue;
> +		switch (hw_cqe->cqe_type_toggle & CQ_BASE_CQE_TYPE_MASK) {
> +		case CQ_BASE_CQE_TYPE_REQ:
> +		case CQ_BASE_CQE_TYPE_TERMINAL:
> +		{
> +			struct cq_req *cqe = (struct cq_req *)hw_cqe;
> +
> +			if (qp == le64_to_cpu(cqe->qp_handle))
> +				cqe->qp_handle = 0;
> +			break;
> +		}
> +		case CQ_BASE_CQE_TYPE_RES_RC:
> +		case CQ_BASE_CQE_TYPE_RES_UD:
> +		case CQ_BASE_CQE_TYPE_RES_RAWETH_QP1:
> +		{
> +			struct cq_res_rc *cqe = (struct cq_res_rc *)hw_cqe;
> +
> +			if (qp == le64_to_cpu(cqe->qp_handle))
> +				cqe->qp_handle = 0;
> +			break;
> +		}
> +		default:
> +			break;
> +		}
> +	}
> +}
> +
> +static unsigned long bnxt_qplib_lock_cqs(struct bnxt_qplib_qp *qp)
> +{
> +	unsigned long flags;
> +
> +	spin_lock_irqsave(&qp->scq->hwq.lock, flags);
> +	if (qp->rcq && qp->rcq != qp->scq)
> +		spin_lock(&qp->rcq->hwq.lock);
> +
> +	return flags;
> +}
> +
> +static void bnxt_qplib_unlock_cqs(struct bnxt_qplib_qp *qp,
> +				  unsigned long flags)
> +{
> +	if (qp->rcq && qp->rcq != qp->scq)
> +		spin_unlock(&qp->rcq->hwq.lock);
> +	spin_unlock_irqrestore(&qp->scq->hwq.lock, flags);
> +}
> +
> +int bnxt_qplib_destroy_qp(struct bnxt_qplib_res *res,
> +			  struct bnxt_qplib_qp *qp)
> +{
> +	struct bnxt_qplib_rcfw *rcfw = res->rcfw;
> +	struct cmdq_destroy_qp req;
> +	struct creq_destroy_qp_resp *resp;
> +	unsigned long flags;
> +	u16 cmd_flags = 0;
> +
> +	RCFW_CMD_PREP(req, DESTROY_QP, cmd_flags);
> +
> +	req.qp_cid = cpu_to_le32(qp->id);
> +	resp = (struct creq_destroy_qp_resp *)
> +			bnxt_qplib_rcfw_send_message(rcfw, (void *)&req,
> +						     NULL, 0);
> +	if (!resp) {
> +		dev_err(&rcfw->pdev->dev, "QPLIB: FP: DESTROY_QP send failed");
> +		return -EINVAL;
> +	}
> +	/**/
> +	if (!bnxt_qplib_rcfw_wait_for_resp(rcfw, le16_to_cpu(req.cookie))) {
> +		/* Cmd timed out */
> +		dev_err(&rcfw->pdev->dev, "QPLIB: FP: DESTROY_QP timed out");
> +		return -ETIMEDOUT;
> +	}
> +	if (RCFW_RESP_STATUS(resp) ||
> +	    RCFW_RESP_COOKIE(resp) != RCFW_CMDQ_COOKIE(req)) {
> +		dev_err(&rcfw->pdev->dev, "QPLIB: FP: DESTROY_QP failed ");
> +		dev_err(&rcfw->pdev->dev,
> +			"QPLIB: with status 0x%x cmdq 0x%x resp 0x%x",
> +			RCFW_RESP_STATUS(resp), RCFW_CMDQ_COOKIE(req),
> +			RCFW_RESP_COOKIE(resp));
> +		return -EINVAL;
> +	}
> +
> +	/* Must walk the associated CQs to nullified the QP ptr */
> +	flags = bnxt_qplib_lock_cqs(qp);
> +	__clean_cq(qp->scq, (u64)qp);
> +	if (qp->rcq != qp->scq)
> +		__clean_cq(qp->rcq, (u64)qp);
> +	bnxt_qplib_unlock_cqs(qp, flags);
> +
> +	bnxt_qplib_free_qp_hdr_buf(res, qp);
> +	bnxt_qplib_free_hwq(res->pdev, &qp->sq.hwq);
> +	kfree(qp->sq.swq);
> +
> +	bnxt_qplib_free_hwq(res->pdev, &qp->rq.hwq);
> +	kfree(qp->rq.swq);
> +
> +	if (qp->irrq.max_elements)
> +		bnxt_qplib_free_hwq(res->pdev, &qp->irrq);
> +	if (qp->orrq.max_elements)
> +		bnxt_qplib_free_hwq(res->pdev, &qp->orrq);
> +
> +	return 0;
> +}
> +
>  /* CQ */
>
>  /* Spinlock must be held */
> diff --git a/drivers/infiniband/hw/bnxtre/bnxt_qplib_fp.h b/drivers/infiniband/hw/bnxtre/bnxt_qplib_fp.h
> index 1991eaa..f6d2be5 100644
> --- a/drivers/infiniband/hw/bnxtre/bnxt_qplib_fp.h
> +++ b/drivers/infiniband/hw/bnxtre/bnxt_qplib_fp.h
> @@ -38,8 +38,246 @@
>
>  #ifndef __BNXT_QPLIB_FP_H__
>  #define __BNXT_QPLIB_FP_H__
> +struct bnxt_qplib_sge {
> +	u64				addr;
> +	u32				lkey;
> +	u32				size;
> +};
> +
> +#define BNXT_QPLIB_MAX_SQE_ENTRY_SIZE	sizeof(struct sq_send)
> +
> +#define SQE_CNT_PER_PG		(PAGE_SIZE / BNXT_QPLIB_MAX_SQE_ENTRY_SIZE)
> +#define SQE_MAX_IDX_PER_PG	(SQE_CNT_PER_PG - 1)
> +#define SQE_PG(x)		(((x) & ~SQE_MAX_IDX_PER_PG) / SQE_CNT_PER_PG)
> +#define SQE_IDX(x)		((x) & SQE_MAX_IDX_PER_PG)
> +
> +#define BNXT_QPLIB_MAX_PSNE_ENTRY_SIZE	sizeof(struct sq_psn_search)
> +
> +#define PSNE_CNT_PER_PG		(PAGE_SIZE / BNXT_QPLIB_MAX_PSNE_ENTRY_SIZE)
> +#define PSNE_MAX_IDX_PER_PG	(PSNE_CNT_PER_PG - 1)
> +#define PSNE_PG(x)		(((x) & ~PSNE_MAX_IDX_PER_PG) / PSNE_CNT_PER_PG)
> +#define PSNE_IDX(x)		((x) & PSNE_MAX_IDX_PER_PG)
> +
> +#define BNXT_QPLIB_QP_MAX_SGL	6
> +
> +struct bnxt_qplib_swq {
> +	u64				wr_id;
> +	u8				type;
> +	u8				flags;
> +	u32				start_psn;
> +	u32				next_psn;
> +	struct sq_psn_search		*psn_search;
> +};
> +
> +struct bnxt_qplib_swqe {
> +	/* General */
> +	u64				wr_id;
> +	u8				reqs_type;
> +	u8				type;
> +#define BNXT_QPLIB_SWQE_TYPE_SEND			0
> +#define BNXT_QPLIB_SWQE_TYPE_SEND_WITH_IMM		1
> +#define BNXT_QPLIB_SWQE_TYPE_SEND_WITH_INV		2
> +#define BNXT_QPLIB_SWQE_TYPE_RDMA_WRITE			4
> +#define BNXT_QPLIB_SWQE_TYPE_RDMA_WRITE_WITH_IMM	5
> +#define BNXT_QPLIB_SWQE_TYPE_RDMA_READ			6
> +#define BNXT_QPLIB_SWQE_TYPE_ATOMIC_CMP_AND_SWP		8
> +#define BNXT_QPLIB_SWQE_TYPE_ATOMIC_FETCH_AND_ADD	11
> +#define BNXT_QPLIB_SWQE_TYPE_LOCAL_INV			12
> +#define BNXT_QPLIB_SWQE_TYPE_FAST_REG_MR		13
> +#define BNXT_QPLIB_SWQE_TYPE_REG_MR			13
> +#define BNXT_QPLIB_SWQE_TYPE_BIND_MW			14
> +#define BNXT_QPLIB_SWQE_TYPE_RECV			128
> +#define BNXT_QPLIB_SWQE_TYPE_RECV_RDMA_IMM		129
> +	u8				flags;
> +#define BNXT_QPLIB_SWQE_FLAGS_SIGNAL_COMP		BIT(0)
> +#define BNXT_QPLIB_SWQE_FLAGS_RD_ATOMIC_FENCE		BIT(1)
> +#define BNXT_QPLIB_SWQE_FLAGS_UC_FENCE			BIT(2)
> +#define BNXT_QPLIB_SWQE_FLAGS_SOLICIT_EVENT		BIT(3)
> +#define BNXT_QPLIB_SWQE_FLAGS_INLINE			BIT(4)
> +	struct bnxt_qplib_sge		sg_list[BNXT_QPLIB_QP_MAX_SGL];
> +	int				num_sge;
> +	/* Max inline data is 96 bytes */
> +	u32				inline_len;
> +#define BNXT_QPLIB_SWQE_MAX_INLINE_LENGTH		96
> +	u8		inline_data[BNXT_QPLIB_SWQE_MAX_INLINE_LENGTH];
> +
> +	union {
> +		/* Send, with imm, inval key */
> +		struct {
> +			u32		imm_data_or_inv_key;
> +			u32		q_key;
> +			u32		dst_qp;
> +			u16		avid;
> +		} send;
> +
> +		/* Send Raw Ethernet and QP1 */
> +		struct {
> +			u16		lflags;
> +			u16		cfa_action;
> +			u32		cfa_meta;
> +		} rawqp1;
> +
> +		/* RDMA write, with imm, read */
> +		struct {
> +			u32		imm_data_or_inv_key;
> +			u64		remote_va;
> +			u32		r_key;
> +		} rdma;
> +
> +		/* Atomic cmp/swap, fetch/add */
> +		struct {
> +			u64		remote_va;
> +			u32		r_key;
> +			u64		swap_data;
> +			u64		cmp_data;
> +		} atomic;
> +
> +		/* Local Invalidate */
> +		struct {
> +			u32		inv_l_key;
> +		} local_inv;
> +
> +		/* FR-PMR */
> +		struct {
> +			u8		access_cntl;
> +			u8		pg_sz_log;
> +			bool		zero_based;
> +			u32		l_key;
> +			u32		length;
> +			u8		pbl_pg_sz_log;
> +#define BNXT_QPLIB_SWQE_PAGE_SIZE_4K			0
> +#define BNXT_QPLIB_SWQE_PAGE_SIZE_8K			1
> +#define BNXT_QPLIB_SWQE_PAGE_SIZE_64K			4
> +#define BNXT_QPLIB_SWQE_PAGE_SIZE_256K			6
> +#define BNXT_QPLIB_SWQE_PAGE_SIZE_1M			8
> +#define BNXT_QPLIB_SWQE_PAGE_SIZE_2M			9
> +#define BNXT_QPLIB_SWQE_PAGE_SIZE_4M			10
> +#define BNXT_QPLIB_SWQE_PAGE_SIZE_1G			18
> +			u8		levels;
> +#define PAGE_SHIFT_4K	12
> +			u64		*pbl_ptr;
> +			dma_addr_t	pbl_dma_ptr;
> +			u64		*page_list;
> +			u16		page_list_len;
> +			u64		va;
> +		} frmr;
> +
> +		/* Bind */
> +		struct {
> +			u8		access_cntl;
> +#define BNXT_QPLIB_BIND_SWQE_ACCESS_LOCAL_WRITE		BIT(0)
> +#define BNXT_QPLIB_BIND_SWQE_ACCESS_REMOTE_READ		BIT(1)
> +#define BNXT_QPLIB_BIND_SWQE_ACCESS_REMOTE_WRITE	BIT(2)
> +#define BNXT_QPLIB_BIND_SWQE_ACCESS_REMOTE_ATOMIC	BIT(3)
> +#define BNXT_QPLIB_BIND_SWQE_ACCESS_WINDOW_BIND		BIT(4)
> +			bool		zero_based;
> +			u8		mw_type;
> +			u32		parent_l_key;
> +			u32		r_key;
> +			u64		va;
> +			u32		length;
> +		} bind;
> +	};
> +};
> +
> +#define BNXT_QPLIB_MAX_RQE_ENTRY_SIZE	sizeof(struct rq_wqe)
> +
> +#define RQE_CNT_PER_PG		(PAGE_SIZE / BNXT_QPLIB_MAX_RQE_ENTRY_SIZE)
> +#define RQE_MAX_IDX_PER_PG	(RQE_CNT_PER_PG - 1)
> +#define RQE_PG(x)		(((x) & ~RQE_MAX_IDX_PER_PG) / RQE_CNT_PER_PG)
> +#define RQE_IDX(x)		((x) & RQE_MAX_IDX_PER_PG)
> +
> +struct bnxt_qplib_q {
> +	struct bnxt_qplib_hwq		hwq;
> +	struct bnxt_qplib_swq		*swq;
> +	struct scatterlist		*sglist;
> +	u32				nmap;
> +	u32				max_wqe;
> +	u16				max_sge;
> +	u32				psn;
> +	bool				flush_in_progress;
> +};
> +
> +struct bnxt_qplib_qp {
> +	struct bnxt_qplib_pd		*pd;
> +	struct bnxt_qplib_dpi		*dpi;
> +	u64				qp_handle;
> +	u32				id;
> +	u8				type;
> +	u8				sig_type;
> +	u64				modify_flags;
> +	u8				state;
> +	u8				cur_qp_state;
> +	u32				max_inline_data;
> +	u32				mtu;
> +	u32				path_mtu;
> +	bool				en_sqd_async_notify;
> +	u16				pkey_index;
> +	u32				qkey;
> +	u32				dest_qp_id;
> +	u8				access;
> +	u8				timeout;
> +	u8				retry_cnt;
> +	u8				rnr_retry;
> +	u32				min_rnr_timer;
> +	u32				max_rd_atomic;
> +	u32				max_dest_rd_atomic;
> +	u32				dest_qpn;
> +	u8				smac[6];
> +	u16				vlan_id;
> +	u8				nw_type;
> +	struct bnxt_qplib_ah		ah;
> +
> +#define BTH_PSN_MASK			((1 << 24) - 1)
> +	/* SQ */
> +	struct bnxt_qplib_q		sq;
> +	/* RQ */
> +	struct bnxt_qplib_q		rq;
> +	/* SRQ */
> +	struct bnxt_qplib_srq		*srq;
> +	/* CQ */
> +	struct bnxt_qplib_cq		*scq;
> +	struct bnxt_qplib_cq		*rcq;
> +	/* IRRQ and ORRQ */
> +	struct bnxt_qplib_hwq		irrq;
> +	struct bnxt_qplib_hwq		orrq;
> +	/* Header buffer for QP1 */
> +	int				sq_hdr_buf_size;
> +	int				rq_hdr_buf_size;
> +/*
> + * Buffer space for ETH(14), IP or GRH(40), UDP header(8)
> + * and ib_bth + ib_deth (20).
> + * Max required is 82 when RoCE V2 is enabled
> + */
> +#define BNXT_QPLIB_MAX_QP1_SQ_HDR_SIZE_V2	86
> +	/* Ethernet header	=  14 */
> +	/* ib_grh		=  40 (provided by MAD) */
> +	/* ib_bth + ib_deth	=  20 */
> +	/* MAD			= 256 (provided by MAD) */
> +	/* iCRC			=   4 */
> +#define BNXT_QPLIB_MAX_QP1_RQ_ETH_HDR_SIZE	14
> +#define BNXT_QPLIB_MAX_QP1_RQ_HDR_SIZE_V2	512
> +#define BNXT_QPLIB_MAX_GRH_HDR_SIZE_IPV4	20
> +#define BNXT_QPLIB_MAX_GRH_HDR_SIZE_IPV6	40
> +#define BNXT_QPLIB_MAX_QP1_RQ_BDETH_HDR_SIZE	20
> +	void				*sq_hdr_buf;
> +	dma_addr_t			sq_hdr_buf_map;
> +	void				*rq_hdr_buf;
> +	dma_addr_t			rq_hdr_buf_map;
> +};
> +
>  #define BNXT_QPLIB_MAX_CQE_ENTRY_SIZE	sizeof(struct cq_base)
>
> +#define CQE_CNT_PER_PG		(PAGE_SIZE / BNXT_QPLIB_MAX_CQE_ENTRY_SIZE)
> +#define CQE_MAX_IDX_PER_PG	(CQE_CNT_PER_PG - 1)
> +#define CQE_PG(x)		(((x) & ~CQE_MAX_IDX_PER_PG) / CQE_CNT_PER_PG)
> +#define CQE_IDX(x)		((x) & CQE_MAX_IDX_PER_PG)
> +
> +#define ROCE_CQE_CMP_V			0
> +#define CQE_CMP_VALID(hdr, raw_cons, cp_bit)			\
> +	(!!((hdr)->cqe_type_toggle & CQ_BASE_TOGGLE) ==		\
> +	   !((raw_cons) & (cp_bit)))
> +
>  struct bnxt_qplib_cqe {
>  	u8				status;
>  	u8				type;
> @@ -82,6 +320,13 @@ struct bnxt_qplib_cq {
>  	wait_queue_head_t		waitq;
>  };
>
> +#define BNXT_QPLIB_MAX_IRRQE_ENTRY_SIZE	sizeof(struct xrrq_irrq)
> +#define BNXT_QPLIB_MAX_ORRQE_ENTRY_SIZE	sizeof(struct xrrq_orrq)
> +#define IRD_LIMIT_TO_IRRQ_SLOTS(x)	(2 * (x) + 2)
> +#define IRRQ_SLOTS_TO_IRD_LIMIT(s)	(((s) >> 1) - 1)
> +#define ORD_LIMIT_TO_ORRQ_SLOTS(x)	((x) + 1)
> +#define ORRQ_SLOTS_TO_ORD_LIMIT(s)	((s) - 1)
> +
>  #define BNXT_QPLIB_MAX_NQE_ENTRY_SIZE	sizeof(struct nq_base)
>
>  #define NQE_CNT_PER_PG		(PAGE_SIZE / BNXT_QPLIB_MAX_NQE_ENTRY_SIZE)
> @@ -140,6 +385,11 @@ int bnxt_qplib_enable_nq(struct pci_dev *pdev, struct bnxt_qplib_nq *nq,
>  			 int (*srqn_handler)(struct bnxt_qplib_nq *nq,
>  					     void *srq,
>  					     u8 event));
> +int bnxt_qplib_create_qp1(struct bnxt_qplib_res *res, struct bnxt_qplib_qp *qp);
> +int bnxt_qplib_create_qp(struct bnxt_qplib_res *res, struct bnxt_qplib_qp *qp);
> +int bnxt_qplib_modify_qp(struct bnxt_qplib_res *res, struct bnxt_qplib_qp *qp);
> +int bnxt_qplib_query_qp(struct bnxt_qplib_res *res, struct bnxt_qplib_qp *qp);
> +int bnxt_qplib_destroy_qp(struct bnxt_qplib_res *res, struct bnxt_qplib_qp *qp);
>  int bnxt_qplib_create_cq(struct bnxt_qplib_res *res, struct bnxt_qplib_cq *cq);
>  int bnxt_qplib_destroy_cq(struct bnxt_qplib_res *res, struct bnxt_qplib_cq *cq);
>
> diff --git a/drivers/infiniband/hw/bnxtre/bnxt_re.h b/drivers/infiniband/hw/bnxtre/bnxt_re.h
> index 3a93a88..84af86b 100644
> --- a/drivers/infiniband/hw/bnxtre/bnxt_re.h
> +++ b/drivers/infiniband/hw/bnxtre/bnxt_re.h
> @@ -64,6 +64,14 @@ struct bnxt_re_work {
>  	struct net_device	*vlan_dev;
>  };
>
> +struct bnxt_re_sqp_entries {
> +	struct bnxt_qplib_sge sge;
> +	u64 wrid;
> +	/* For storing the actual qp1 cqe */
> +	struct bnxt_qplib_cqe cqe;
> +	struct bnxt_re_qp *qp1_qp;
> +};
> +
>  #define BNXT_RE_MIN_MSIX		2
>  #define BNXT_RE_MAX_MSIX		16
>  #define BNXT_RE_AEQ_IDX			0
> @@ -112,6 +120,12 @@ struct bnxt_re_dev {
>  	atomic_t			mw_count;
>  	/* Max of 2 lossless traffic class supported per port */
>  	u16				cosq[2];
> +
> +	/* QP for for handling QP1 packets */
> +	u32				sqp_id;
> +	struct bnxt_re_qp		*qp1_sqp;
> +	struct bnxt_re_ah		*sqp_ah;
> +	struct bnxt_re_sqp_entries sqp_tbl[1024];
>  };
>
>  #define to_bnxt_re(ptr, type, member)	\
> diff --git a/drivers/infiniband/hw/bnxtre/bnxt_re_ib_verbs.c b/drivers/infiniband/hw/bnxtre/bnxt_re_ib_verbs.c
> index 5e41317..77860a2 100644
> --- a/drivers/infiniband/hw/bnxtre/bnxt_re_ib_verbs.c
> +++ b/drivers/infiniband/hw/bnxtre/bnxt_re_ib_verbs.c
> @@ -649,6 +649,481 @@ int bnxt_re_query_ah(struct ib_ah *ib_ah, struct ib_ah_attr *ah_attr)
>  	return 0;
>  }
>
> +/* Queue Pairs */
> +int bnxt_re_destroy_qp(struct ib_qp *ib_qp)
> +{
> +	struct bnxt_re_qp *qp = to_bnxt_re(ib_qp, struct bnxt_re_qp, ib_qp);
> +	struct bnxt_re_dev *rdev = qp->rdev;
> +	int rc;
> +
> +	rc = bnxt_qplib_destroy_qp(&rdev->qplib_res, &qp->qplib_qp);
> +	if (rc) {
> +		dev_err(rdev_to_dev(rdev), "Failed to destroy HW QP");
> +		return rc;
> +	}
> +	if (ib_qp->qp_type == IB_QPT_GSI && rdev->qp1_sqp) {
> +		rc = bnxt_qplib_destroy_ah(&rdev->qplib_res,
> +					   &rdev->sqp_ah->qplib_ah);
> +		if (rc) {
> +			dev_err(rdev_to_dev(rdev),
> +				"Failed to destroy HW AH for shadow QP");
> +			return rc;
> +		}
> +
> +		rc = bnxt_qplib_destroy_qp(&rdev->qplib_res,
> +					   &rdev->qp1_sqp->qplib_qp);
> +		if (rc) {
> +			dev_err(rdev_to_dev(rdev),
> +				"Failed to destroy Shadow QP");
> +			return rc;
> +		}
> +		mutex_lock(&rdev->qp_lock);
> +		list_del(&rdev->qp1_sqp->list);
> +		atomic_dec(&rdev->qp_count);
> +		mutex_unlock(&rdev->qp_lock);
> +
> +		kfree(rdev->sqp_ah);
> +		kfree(rdev->qp1_sqp);
> +	}
> +
> +	if (qp->rumem && !IS_ERR(qp->rumem))
> +		ib_umem_release(qp->rumem);
> +	if (qp->sumem && !IS_ERR(qp->sumem))
> +		ib_umem_release(qp->sumem);
> +
> +	mutex_lock(&rdev->qp_lock);
> +	list_del(&qp->list);
> +	atomic_dec(&rdev->qp_count);
> +	mutex_unlock(&rdev->qp_lock);
> +	kfree(qp);
> +	return 0;
> +}
> +
> +static u8 __from_ib_qp_type(enum ib_qp_type type)
> +{
> +	switch (type) {
> +	case IB_QPT_GSI:
> +		return CMDQ_CREATE_QP1_TYPE_GSI;
> +	case IB_QPT_RC:
> +		return CMDQ_CREATE_QP_TYPE_RC;
> +	case IB_QPT_UD:
> +		return CMDQ_CREATE_QP_TYPE_UD;
> +	case IB_QPT_RAW_ETHERTYPE:
> +		return CMDQ_CREATE_QP_TYPE_RAW_ETHERTYPE;
> +	default:
> +		return IB_QPT_MAX;
> +	}
> +}
> +
> +static int bnxt_re_init_user_qp(struct bnxt_re_dev *rdev, struct bnxt_re_pd *pd,
> +			 struct bnxt_re_qp *qp, struct ib_udata *udata)
> +{
> +	struct bnxt_re_qp_req ureq;
> +	struct bnxt_qplib_qp *qplib_qp = &qp->qplib_qp;
> +	struct ib_umem *umem;
> +	int bytes = 0;
> +	struct ib_ucontext *context = pd->ib_pd.uobject->context;
> +	struct bnxt_re_ucontext *cntx = to_bnxt_re(context,
> +						  struct bnxt_re_ucontext,
> +						  ib_uctx);
> +	if (ib_copy_from_udata(&ureq, udata, sizeof(ureq)))
> +		return -EFAULT;
> +
> +	bytes = (qplib_qp->sq.max_wqe * BNXT_QPLIB_MAX_SQE_ENTRY_SIZE);
> +	/* Consider mapping PSN search memory only for RC QPs. */
> +	if (qplib_qp->type == CMDQ_CREATE_QP_TYPE_RC)
> +		bytes += (qplib_qp->sq.max_wqe * sizeof(struct sq_psn_search));
> +	bytes = PAGE_ALIGN(bytes);
> +	umem = ib_umem_get(context, ureq.qpsva, bytes,
> +			   IB_ACCESS_LOCAL_WRITE, 1);
> +	if (IS_ERR(umem))
> +		return PTR_ERR(umem);
> +
> +	qp->sumem = umem;
> +	qplib_qp->sq.sglist = umem->sg_head.sgl;
> +	qplib_qp->sq.nmap = umem->nmap;
> +	qplib_qp->qp_handle = ureq.qp_handle;
> +
> +	if (!qp->qplib_qp.srq) {
> +		bytes = (qplib_qp->rq.max_wqe * BNXT_QPLIB_MAX_RQE_ENTRY_SIZE);
> +		bytes = PAGE_ALIGN(bytes);
> +		umem = ib_umem_get(context, ureq.qprva, bytes,
> +				   IB_ACCESS_LOCAL_WRITE, 1);
> +		if (IS_ERR(umem))
> +			goto rqfail;
> +		qp->rumem = umem;
> +		qplib_qp->rq.sglist = umem->sg_head.sgl;
> +		qplib_qp->rq.nmap = umem->nmap;
> +	}
> +
> +	qplib_qp->dpi = cntx->dpi;
> +	return 0;
> +rqfail:
> +	ib_umem_release(qp->sumem);
> +	qp->sumem = NULL;
> +	qplib_qp->sq.sglist = NULL;
> +	qplib_qp->sq.nmap = 0;
> +
> +	return PTR_ERR(umem);
> +}
> +
> +static struct bnxt_re_ah *bnxt_re_create_shadow_qp_ah(struct bnxt_re_pd *pd,
> +					       struct bnxt_qplib_res *qp1_res,
> +					       struct bnxt_qplib_qp *qp1_qp)
> +{
> +	struct bnxt_re_dev *rdev = pd->rdev;
> +	struct bnxt_re_ah *ah;
> +	union ib_gid sgid;
> +	int rc;
> +
> +	ah = kzalloc(sizeof(*ah), GFP_KERNEL);
> +	if (!ah)
> +		return NULL;
> +
> +	memset(ah, 0, sizeof(*ah));
> +	ah->rdev = rdev;
> +	ah->qplib_ah.pd = &pd->qplib_pd;
> +
> +	rc = bnxt_re_query_gid(&rdev->ibdev, 1, 0, &sgid);
> +	if (rc)
> +		goto fail;
> +
> +	/* supply the dgid data same as sgid */
> +	memcpy(ah->qplib_ah.dgid.data, &sgid.raw,
> +	       sizeof(union ib_gid));
> +	ah->qplib_ah.sgid_index = 0;
> +
> +	ah->qplib_ah.traffic_class = 0;
> +	ah->qplib_ah.flow_label = 0;
> +	ah->qplib_ah.hop_limit = 1;
> +	ah->qplib_ah.sl = 0;
> +	/* Have DMAC same as SMAC */
> +	ether_addr_copy(ah->qplib_ah.dmac, rdev->netdev->dev_addr);
> +
> +	rc = bnxt_qplib_create_ah(&rdev->qplib_res, &ah->qplib_ah);
> +	if (rc) {
> +		dev_err(rdev_to_dev(rdev),
> +			"Failed to allocate HW AH for Shadow QP");
> +		goto fail;
> +	}
> +
> +	return ah;
> +
> +fail:
> +	kfree(ah);
> +	return NULL;
> +}
> +
> +static struct bnxt_re_qp *bnxt_re_create_shadow_qp(struct bnxt_re_pd *pd,
> +					    struct bnxt_qplib_res *qp1_res,
> +					    struct bnxt_qplib_qp *qp1_qp)
> +{
> +	struct bnxt_re_dev *rdev = pd->rdev;
> +	struct bnxt_re_qp *qp;
> +	int rc;
> +
> +	qp = kzalloc(sizeof(*qp), GFP_KERNEL);
> +	if (!qp)
> +		return NULL;
> +
> +	memset(qp, 0, sizeof(*qp));
> +	qp->rdev = rdev;
> +
> +	/* Initialize the shadow QP structure from the QP1 values */
> +	ether_addr_copy(qp->qplib_qp.smac, rdev->netdev->dev_addr);
> +
> +	qp->qplib_qp.pd = &pd->qplib_pd;
> +	qp->qplib_qp.qp_handle = (u64)&qp->qplib_qp;
> +	qp->qplib_qp.type = IB_QPT_UD;
> +
> +	qp->qplib_qp.max_inline_data = 0;
> +	qp->qplib_qp.sig_type = true;
> +
> +	/* Shadow QP SQ depth should be same as QP1 RQ depth */
> +	qp->qplib_qp.sq.max_wqe = qp1_qp->rq.max_wqe;
> +	qp->qplib_qp.sq.max_sge = 2;
> +
> +	qp->qplib_qp.scq = qp1_qp->scq;
> +	qp->qplib_qp.rcq = qp1_qp->rcq;
> +
> +	qp->qplib_qp.rq.max_wqe = qp1_qp->rq.max_wqe;
> +	qp->qplib_qp.rq.max_sge = qp1_qp->rq.max_sge;
> +
> +	qp->qplib_qp.mtu = qp1_qp->mtu;
> +
> +	qp->qplib_qp.sq_hdr_buf_size = 0;
> +	qp->qplib_qp.rq_hdr_buf_size = BNXT_QPLIB_MAX_GRH_HDR_SIZE_IPV6;
> +	qp->qplib_qp.dpi = &rdev->dpi_privileged;
> +
> +	rc = bnxt_qplib_create_qp(qp1_res, &qp->qplib_qp);
> +	if (rc)
> +		goto fail;
> +
> +	rdev->sqp_id = qp->qplib_qp.id;
> +
> +	spin_lock_init(&qp->sq_lock);
> +	INIT_LIST_HEAD(&qp->list);
> +	mutex_lock(&rdev->qp_lock);
> +	list_add_tail(&qp->list, &rdev->qp_list);
> +	atomic_inc(&rdev->qp_count);
> +	mutex_unlock(&rdev->qp_lock);
> +	return qp;
> +fail:
> +	kfree(qp);
> +	return NULL;
> +}
> +
> +struct ib_qp *bnxt_re_create_qp(struct ib_pd *ib_pd,
> +				struct ib_qp_init_attr *qp_init_attr,
> +				struct ib_udata *udata)
> +{
> +	struct bnxt_re_pd *pd = to_bnxt_re(ib_pd, struct bnxt_re_pd, ib_pd);
> +	struct bnxt_re_dev *rdev = pd->rdev;
> +	struct bnxt_qplib_dev_attr *dev_attr = &rdev->dev_attr;
> +	struct bnxt_re_qp *qp;
> +	struct bnxt_re_srq *srq;
> +	struct bnxt_re_cq *cq;
> +	int rc, entries;
> +
> +	if ((qp_init_attr->cap.max_send_wr > dev_attr->max_qp_wqes) ||
> +	    (qp_init_attr->cap.max_recv_wr > dev_attr->max_qp_wqes) ||
> +	    (qp_init_attr->cap.max_send_sge > dev_attr->max_qp_sges) ||
> +	    (qp_init_attr->cap.max_recv_sge > dev_attr->max_qp_sges) ||
> +	    (qp_init_attr->cap.max_inline_data > dev_attr->max_inline_data))
> +		return ERR_PTR(-EINVAL);
> +
> +	qp = kzalloc(sizeof(*qp), GFP_KERNEL);
> +	if (!qp)
> +		return ERR_PTR(-ENOMEM);
> +
> +	qp->rdev = rdev;
> +	ether_addr_copy(qp->qplib_qp.smac, rdev->netdev->dev_addr);
> +	qp->qplib_qp.pd = &pd->qplib_pd;
> +	qp->qplib_qp.qp_handle = (u64)&qp->qplib_qp;
> +	qp->qplib_qp.type = __from_ib_qp_type(qp_init_attr->qp_type);
> +	if (qp->qplib_qp.type == IB_QPT_MAX) {
> +		dev_err(rdev_to_dev(rdev), "QP type 0x%x not supported",
> +			qp->qplib_qp.type);
> +		rc = -EINVAL;
> +		goto fail;
> +	}
> +	qp->qplib_qp.max_inline_data = qp_init_attr->cap.max_inline_data;
> +	qp->qplib_qp.sig_type = ((qp_init_attr->sq_sig_type ==
> +				  IB_SIGNAL_ALL_WR) ? true : false);
> +
> +	entries = roundup_pow_of_two(qp_init_attr->cap.max_send_wr + 1);
> +	if (entries > dev_attr->max_qp_wqes + 1)
> +		entries = dev_attr->max_qp_wqes + 1;
> +	qp->qplib_qp.sq.max_wqe = entries;
> +
> +	qp->qplib_qp.sq.max_sge = qp_init_attr->cap.max_send_sge;
> +	if (qp->qplib_qp.sq.max_sge > dev_attr->max_qp_sges)
> +		qp->qplib_qp.sq.max_sge = dev_attr->max_qp_sges;
> +
> +	if (qp_init_attr->send_cq) {
> +		cq = to_bnxt_re(qp_init_attr->send_cq, struct bnxt_re_cq,
> +				ib_cq);
> +		if (!cq) {
> +			dev_err(rdev_to_dev(rdev), "Send CQ not found");
> +			rc = -EINVAL;
> +			goto fail;
> +		}
> +		qp->qplib_qp.scq = &cq->qplib_cq;
> +	}
> +
> +	if (qp_init_attr->recv_cq) {
> +		cq = to_bnxt_re(qp_init_attr->recv_cq, struct bnxt_re_cq,
> +				ib_cq);
> +		if (!cq) {
> +			dev_err(rdev_to_dev(rdev), "Receive CQ not found");
> +			rc = -EINVAL;
> +			goto fail;
> +		}
> +		qp->qplib_qp.rcq = &cq->qplib_cq;
> +	}
> +
> +	if (qp_init_attr->srq) {
> +		dev_err(rdev_to_dev(rdev), "SRQ not supported");
> +		rc = -ENOTSUPP;
> +		goto fail;
> +	} else {
> +		/* Allocate 1 more than what's provided so posting max doesn't
> +		 * mean empty
> +		 */
> +		entries = roundup_pow_of_two(qp_init_attr->cap.max_recv_wr + 1);
> +		if (entries > dev_attr->max_qp_wqes + 1)
> +			entries = dev_attr->max_qp_wqes + 1;
> +		qp->qplib_qp.rq.max_wqe = entries;
> +
> +		qp->qplib_qp.rq.max_sge = qp_init_attr->cap.max_recv_sge;
> +		if (qp->qplib_qp.rq.max_sge > dev_attr->max_qp_sges)
> +			qp->qplib_qp.rq.max_sge = dev_attr->max_qp_sges;
> +	}
> +
> +	qp->qplib_qp.mtu = ib_mtu_enum_to_int(iboe_get_mtu(rdev->netdev->mtu));
> +
> +	if (qp_init_attr->qp_type == IB_QPT_GSI) {
> +		qp->qplib_qp.rq.max_sge = dev_attr->max_qp_sges;
> +		if (qp->qplib_qp.rq.max_sge > dev_attr->max_qp_sges)
> +			qp->qplib_qp.rq.max_sge = dev_attr->max_qp_sges;
> +		qp->qplib_qp.sq.max_sge++;
> +		if (qp->qplib_qp.sq.max_sge > dev_attr->max_qp_sges)
> +			qp->qplib_qp.sq.max_sge = dev_attr->max_qp_sges;
> +
> +		qp->qplib_qp.rq_hdr_buf_size =
> +					BNXT_QPLIB_MAX_QP1_RQ_HDR_SIZE_V2;
> +
> +		qp->qplib_qp.sq_hdr_buf_size =
> +					BNXT_QPLIB_MAX_QP1_SQ_HDR_SIZE_V2;
> +		qp->qplib_qp.dpi = &rdev->dpi_privileged;
> +		rc = bnxt_qplib_create_qp1(&rdev->qplib_res, &qp->qplib_qp);
> +		if (rc) {
> +			dev_err(rdev_to_dev(rdev), "Failed to create HW QP1");
> +			goto fail;
> +		}
> +		/* Create a shadow QP to handle the QP1 traffic */
> +		rdev->qp1_sqp = bnxt_re_create_shadow_qp(pd, &rdev->qplib_res,
> +							 &qp->qplib_qp);
> +		if (!rdev->qp1_sqp) {
> +			rc = -EINVAL;
> +			dev_err(rdev_to_dev(rdev),
> +				"Failed to create Shadow QP for QP1");
> +			goto qp_destroy;
> +		}
> +		rdev->sqp_ah = bnxt_re_create_shadow_qp_ah(pd, &rdev->qplib_res,
> +							   &qp->qplib_qp);
> +		if (!rdev->sqp_ah) {
> +			bnxt_qplib_destroy_qp(&rdev->qplib_res,
> +					      &rdev->qp1_sqp->qplib_qp);
> +			rc = -EINVAL;
> +			dev_err(rdev_to_dev(rdev),
> +				"Failed to create AH entry for ShadowQP");
> +			goto qp_destroy;
> +		}
> +
> +	} else {
> +		qp->qplib_qp.max_rd_atomic = dev_attr->max_qp_rd_atom;
> +		qp->qplib_qp.max_dest_rd_atomic = dev_attr->max_qp_init_rd_atom;
> +		if (udata) {
> +			rc = bnxt_re_init_user_qp(rdev, pd, qp, udata);
> +			if (rc)
> +				goto fail;
> +		} else {
> +			qp->qplib_qp.dpi = &rdev->dpi_privileged;
> +		}
> +
> +		rc = bnxt_qplib_create_qp(&rdev->qplib_res, &qp->qplib_qp);
> +		if (rc) {
> +			dev_err(rdev_to_dev(rdev), "Failed to create HW QP");
> +			goto fail;
> +		}
> +	}
> +
> +	qp->ib_qp.qp_num = qp->qplib_qp.id;
> +	spin_lock_init(&qp->sq_lock);
> +
> +	if (udata) {
> +		struct bnxt_re_qp_resp resp;
> +
> +		resp.qpid = qp->ib_qp.qp_num;
> +		rc = bnxt_re_copy_to_udata(rdev, &resp, sizeof(resp), udata);
> +		if (rc) {
> +			dev_err(rdev_to_dev(rdev), "Failed to copy QP udata");
> +			goto qp_destroy;
> +		}
> +	}
> +	INIT_LIST_HEAD(&qp->list);
> +	mutex_lock(&rdev->qp_lock);
> +	list_add_tail(&qp->list, &rdev->qp_list);
> +	atomic_inc(&rdev->qp_count);
> +	mutex_unlock(&rdev->qp_lock);
> +
> +	return &qp->ib_qp;
> +qp_destroy:
> +	bnxt_qplib_destroy_qp(&rdev->qplib_res, &qp->qplib_qp);
> +fail:
> +	kfree(qp);
> +	return ERR_PTR(rc);
> +}
> +
> +static u8 __from_ib_qp_state(enum ib_qp_state state)
> +{
> +	switch (state) {
> +	case IB_QPS_RESET:
> +		return CMDQ_MODIFY_QP_NEW_STATE_RESET;
> +	case IB_QPS_INIT:
> +		return CMDQ_MODIFY_QP_NEW_STATE_INIT;
> +	case IB_QPS_RTR:
> +		return CMDQ_MODIFY_QP_NEW_STATE_RTR;
> +	case IB_QPS_RTS:
> +		return CMDQ_MODIFY_QP_NEW_STATE_RTS;
> +	case IB_QPS_SQD:
> +		return CMDQ_MODIFY_QP_NEW_STATE_SQD;
> +	case IB_QPS_SQE:
> +		return CMDQ_MODIFY_QP_NEW_STATE_SQE;
> +	case IB_QPS_ERR:
> +	default:
> +		return CMDQ_MODIFY_QP_NEW_STATE_ERR;
> +	}
> +}
> +
> +static enum ib_qp_state __to_ib_qp_state(u8 state)
> +{
> +	switch (state) {
> +	case CMDQ_MODIFY_QP_NEW_STATE_RESET:
> +		return IB_QPS_RESET;
> +	case CMDQ_MODIFY_QP_NEW_STATE_INIT:
> +		return IB_QPS_INIT;
> +	case CMDQ_MODIFY_QP_NEW_STATE_RTR:
> +		return IB_QPS_RTR;
> +	case CMDQ_MODIFY_QP_NEW_STATE_RTS:
> +		return IB_QPS_RTS;
> +	case CMDQ_MODIFY_QP_NEW_STATE_SQD:
> +		return IB_QPS_SQD;
> +	case CMDQ_MODIFY_QP_NEW_STATE_SQE:
> +		return IB_QPS_SQE;
> +	case CMDQ_MODIFY_QP_NEW_STATE_ERR:
> +	default:
> +		return IB_QPS_ERR;
> +	}
> +}
> +
> +static u32 __from_ib_mtu(enum ib_mtu mtu)
> +{
> +	switch (mtu) {
> +	case IB_MTU_256:
> +		return CMDQ_MODIFY_QP_PATH_MTU_MTU_256;
> +	case IB_MTU_512:
> +		return CMDQ_MODIFY_QP_PATH_MTU_MTU_512;
> +	case IB_MTU_1024:
> +		return CMDQ_MODIFY_QP_PATH_MTU_MTU_1024;
> +	case IB_MTU_2048:
> +		return CMDQ_MODIFY_QP_PATH_MTU_MTU_2048;
> +	case IB_MTU_4096:
> +		return CMDQ_MODIFY_QP_PATH_MTU_MTU_4096;
> +	default:
> +		return CMDQ_MODIFY_QP_PATH_MTU_MTU_2048;
> +	}
> +}
> +
> +static enum ib_mtu __to_ib_mtu(u32 mtu)
> +{
> +	switch (mtu & CREQ_QUERY_QP_RESP_SB_PATH_MTU_MASK) {
> +	case CMDQ_MODIFY_QP_PATH_MTU_MTU_256:
> +		return IB_MTU_256;
> +	case CMDQ_MODIFY_QP_PATH_MTU_MTU_512:
> +		return IB_MTU_512;
> +	case CMDQ_MODIFY_QP_PATH_MTU_MTU_1024:
> +		return IB_MTU_1024;
> +	case CMDQ_MODIFY_QP_PATH_MTU_MTU_2048:
> +		return IB_MTU_2048;
> +	case CMDQ_MODIFY_QP_PATH_MTU_MTU_4096:
> +		return IB_MTU_4096;
> +	default:
> +		return IB_MTU_2048;
> +	}
> +}
> +
>  static int __from_ib_access_flags(int iflags)
>  {
>  	int qflags = 0;
> @@ -690,6 +1165,293 @@ static enum ib_access_flags __to_ib_access_flags(int qflags)
>  		iflags |= IB_ACCESS_ON_DEMAND;
>  	return iflags;
>  };
> +
> +static int bnxt_re_modify_shadow_qp(struct bnxt_re_dev *rdev,
> +			     struct bnxt_re_qp *qp1_qp,
> +			     int qp_attr_mask)
> +{
> +	struct bnxt_re_qp *qp = rdev->qp1_sqp;
> +	int rc = 0;
> +
> +	if (qp_attr_mask & IB_QP_STATE) {
> +		qp->qplib_qp.modify_flags |= CMDQ_MODIFY_QP_MODIFY_MASK_STATE;
> +		qp->qplib_qp.state = qp1_qp->qplib_qp.state;
> +	}
> +	if (qp_attr_mask & IB_QP_PKEY_INDEX) {
> +		qp->qplib_qp.modify_flags |= CMDQ_MODIFY_QP_MODIFY_MASK_PKEY;
> +		qp->qplib_qp.pkey_index = qp1_qp->qplib_qp.pkey_index;
> +	}
> +
> +	if (qp_attr_mask & IB_QP_QKEY) {
> +		qp->qplib_qp.modify_flags |= CMDQ_MODIFY_QP_MODIFY_MASK_QKEY;
> +		/* Using a Random  QKEY */
> +		qp->qplib_qp.qkey = 0x81818181;
> +	}
> +	if (qp_attr_mask & IB_QP_SQ_PSN) {
> +		qp->qplib_qp.modify_flags |= CMDQ_MODIFY_QP_MODIFY_MASK_SQ_PSN;
> +		qp->qplib_qp.sq.psn = qp1_qp->qplib_qp.sq.psn;
> +	}
> +
> +	rc = bnxt_qplib_modify_qp(&rdev->qplib_res, &qp->qplib_qp);
> +	if (rc)
> +		dev_err(rdev_to_dev(rdev),
> +			"Failed to modify Shadow QP for QP1");
> +	return rc;
> +}
> +
> +int bnxt_re_modify_qp(struct ib_qp *ib_qp, struct ib_qp_attr *qp_attr,
> +		      int qp_attr_mask, struct ib_udata *udata)
> +{
> +	struct bnxt_re_qp *qp = to_bnxt_re(ib_qp, struct bnxt_re_qp, ib_qp);
> +	struct bnxt_re_dev *rdev = qp->rdev;
> +	struct bnxt_qplib_dev_attr *dev_attr = &rdev->dev_attr;
> +	enum ib_qp_state curr_qp_state, new_qp_state;
> +	int rc, entries;
> +	int status;
> +	union ib_gid sgid;
> +	struct ib_gid_attr sgid_attr;
> +	u8 nw_type;
> +
> +	qp->qplib_qp.modify_flags = 0;
> +	if (qp_attr_mask & IB_QP_STATE) {
> +		curr_qp_state = __to_ib_qp_state(qp->qplib_qp.cur_qp_state);
> +		new_qp_state = qp_attr->qp_state;
> +		if (!ib_modify_qp_is_ok(curr_qp_state, new_qp_state,
> +					ib_qp->qp_type, qp_attr_mask,
> +					IB_LINK_LAYER_ETHERNET)) {
> +			dev_err(rdev_to_dev(rdev),
> +				"Invalid attribute mask: %#x specified ",
> +				qp_attr_mask);
> +			dev_err(rdev_to_dev(rdev),
> +				"for qpn: %#x type: %#x",
> +				ib_qp->qp_num, ib_qp->qp_type);
> +			dev_err(rdev_to_dev(rdev),
> +				"curr_qp_state=0x%x, new_qp_state=0x%x\n",
> +				curr_qp_state, new_qp_state);
> +			return -EINVAL;
> +		}
> +		qp->qplib_qp.modify_flags |= CMDQ_MODIFY_QP_MODIFY_MASK_STATE;
> +		qp->qplib_qp.state = __from_ib_qp_state(qp_attr->qp_state);
> +	}
> +	if (qp_attr_mask & IB_QP_EN_SQD_ASYNC_NOTIFY) {
> +		qp->qplib_qp.modify_flags |=
> +				CMDQ_MODIFY_QP_MODIFY_MASK_EN_SQD_ASYNC_NOTIFY;
> +		qp->qplib_qp.en_sqd_async_notify = true;
> +	}
> +	if (qp_attr_mask & IB_QP_ACCESS_FLAGS) {
> +		qp->qplib_qp.modify_flags |= CMDQ_MODIFY_QP_MODIFY_MASK_ACCESS;
> +		qp->qplib_qp.access =
> +			__from_ib_access_flags(qp_attr->qp_access_flags);
> +		/* LOCAL_WRITE access must be set to allow RC receive */
> +		qp->qplib_qp.access |= BNXT_QPLIB_ACCESS_LOCAL_WRITE;
> +	}
> +	if (qp_attr_mask & IB_QP_PKEY_INDEX) {
> +		qp->qplib_qp.modify_flags |= CMDQ_MODIFY_QP_MODIFY_MASK_PKEY;
> +		qp->qplib_qp.pkey_index = qp_attr->pkey_index;
> +	}
> +	if (qp_attr_mask & IB_QP_QKEY) {
> +		qp->qplib_qp.modify_flags |= CMDQ_MODIFY_QP_MODIFY_MASK_QKEY;
> +		qp->qplib_qp.qkey = qp_attr->qkey;
> +	}
> +	if (qp_attr_mask & IB_QP_AV) {
> +		qp->qplib_qp.modify_flags |= CMDQ_MODIFY_QP_MODIFY_MASK_DGID |
> +				     CMDQ_MODIFY_QP_MODIFY_MASK_FLOW_LABEL |
> +				     CMDQ_MODIFY_QP_MODIFY_MASK_SGID_INDEX |
> +				     CMDQ_MODIFY_QP_MODIFY_MASK_HOP_LIMIT |
> +				     CMDQ_MODIFY_QP_MODIFY_MASK_TRAFFIC_CLASS |
> +				     CMDQ_MODIFY_QP_MODIFY_MASK_DEST_MAC |
> +				     CMDQ_MODIFY_QP_MODIFY_MASK_VLAN_ID;
> +		memcpy(qp->qplib_qp.ah.dgid.data, qp_attr->ah_attr.grh.dgid.raw,
> +		       sizeof(qp->qplib_qp.ah.dgid.data));
> +		qp->qplib_qp.ah.flow_label = qp_attr->ah_attr.grh.flow_label;
> +		/* If RoCE V2 is enabled, stack will have two entries for
> +		 * each GID entry. Avoiding this duplicte entry in HW. Dividing
> +		 * the GID index by 2 for RoCE V2
> +		 */
> +		qp->qplib_qp.ah.sgid_index =
> +					qp_attr->ah_attr.grh.sgid_index / 2;
> +		qp->qplib_qp.ah.host_sgid_index =
> +					qp_attr->ah_attr.grh.sgid_index;
> +		qp->qplib_qp.ah.hop_limit = qp_attr->ah_attr.grh.hop_limit;
> +		qp->qplib_qp.ah.traffic_class =
> +					qp_attr->ah_attr.grh.traffic_class;
> +		qp->qplib_qp.ah.sl = qp_attr->ah_attr.sl;
> +		ether_addr_copy(qp->qplib_qp.ah.dmac, qp_attr->ah_attr.dmac);
> +
> +		status = ib_get_cached_gid(&rdev->ibdev, 1,
> +					   qp_attr->ah_attr.grh.sgid_index,
> +					   &sgid, &sgid_attr);
> +		if (!status && sgid_attr.ndev) {
> +			memcpy(qp->qplib_qp.smac, sgid_attr.ndev->dev_addr,
> +			       ETH_ALEN);
> +			dev_put(sgid_attr.ndev);
> +			nw_type = ib_gid_to_network_type(sgid_attr.gid_type,
> +							 &sgid);
> +			switch (nw_type) {
> +			case RDMA_NETWORK_IPV4:
> +				qp->qplib_qp.nw_type =
> +					CMDQ_MODIFY_QP_NETWORK_TYPE_ROCEV2_IPV4;
> +				break;
> +			case RDMA_NETWORK_IPV6:
> +				qp->qplib_qp.nw_type =
> +					CMDQ_MODIFY_QP_NETWORK_TYPE_ROCEV2_IPV6;
> +				break;
> +			default:
> +				qp->qplib_qp.nw_type =
> +					CMDQ_MODIFY_QP_NETWORK_TYPE_ROCEV1;
> +				break;
> +			}
> +		}
> +	}
> +
> +	if (qp_attr_mask & IB_QP_PATH_MTU) {
> +		qp->qplib_qp.modify_flags |=
> +				CMDQ_MODIFY_QP_MODIFY_MASK_PATH_MTU;
> +		qp->qplib_qp.path_mtu = __from_ib_mtu(qp_attr->path_mtu);
> +	} else if (qp_attr->qp_state == IB_QPS_RTR) {
> +		qp->qplib_qp.modify_flags |=
> +			CMDQ_MODIFY_QP_MODIFY_MASK_PATH_MTU;
> +		qp->qplib_qp.path_mtu =
> +			__from_ib_mtu(iboe_get_mtu(rdev->netdev->mtu));
> +	}
> +
> +	if (qp_attr_mask & IB_QP_TIMEOUT) {
> +		qp->qplib_qp.modify_flags |= CMDQ_MODIFY_QP_MODIFY_MASK_TIMEOUT;
> +		qp->qplib_qp.timeout = qp_attr->timeout;
> +	}
> +	if (qp_attr_mask & IB_QP_RETRY_CNT) {
> +		qp->qplib_qp.modify_flags |=
> +				CMDQ_MODIFY_QP_MODIFY_MASK_RETRY_CNT;
> +		qp->qplib_qp.retry_cnt = qp_attr->retry_cnt;
> +	}
> +	if (qp_attr_mask & IB_QP_RNR_RETRY) {
> +		qp->qplib_qp.modify_flags |=
> +				CMDQ_MODIFY_QP_MODIFY_MASK_RNR_RETRY;
> +		qp->qplib_qp.rnr_retry = qp_attr->rnr_retry;
> +	}
> +	if (qp_attr_mask & IB_QP_MIN_RNR_TIMER) {
> +		qp->qplib_qp.modify_flags |=
> +				CMDQ_MODIFY_QP_MODIFY_MASK_MIN_RNR_TIMER;
> +		qp->qplib_qp.min_rnr_timer = qp_attr->min_rnr_timer;
> +	}
> +	if (qp_attr_mask & IB_QP_RQ_PSN) {
> +		qp->qplib_qp.modify_flags |= CMDQ_MODIFY_QP_MODIFY_MASK_RQ_PSN;
> +		qp->qplib_qp.rq.psn = qp_attr->rq_psn;
> +	}
> +	if (qp_attr_mask & IB_QP_MAX_QP_RD_ATOMIC) {
> +		qp->qplib_qp.modify_flags |=
> +				CMDQ_MODIFY_QP_MODIFY_MASK_MAX_RD_ATOMIC;
> +		qp->qplib_qp.max_rd_atomic = qp_attr->max_rd_atomic;
> +	}
> +	if (qp_attr_mask & IB_QP_SQ_PSN) {
> +		qp->qplib_qp.modify_flags |= CMDQ_MODIFY_QP_MODIFY_MASK_SQ_PSN;
> +		qp->qplib_qp.sq.psn = qp_attr->sq_psn;
> +	}
> +	if (qp_attr_mask & IB_QP_MAX_DEST_RD_ATOMIC) {
> +		qp->qplib_qp.modify_flags |=
> +				CMDQ_MODIFY_QP_MODIFY_MASK_MAX_DEST_RD_ATOMIC;
> +		qp->qplib_qp.max_dest_rd_atomic = qp_attr->max_dest_rd_atomic;
> +	}
> +	if (qp_attr_mask & IB_QP_CAP) {
> +		qp->qplib_qp.modify_flags |=
> +				CMDQ_MODIFY_QP_MODIFY_MASK_SQ_SIZE |
> +				CMDQ_MODIFY_QP_MODIFY_MASK_RQ_SIZE |
> +				CMDQ_MODIFY_QP_MODIFY_MASK_SQ_SGE |
> +				CMDQ_MODIFY_QP_MODIFY_MASK_RQ_SGE |
> +				CMDQ_MODIFY_QP_MODIFY_MASK_MAX_INLINE_DATA;
> +		if ((qp_attr->cap.max_send_wr >= dev_attr->max_qp_wqes) ||
> +		    (qp_attr->cap.max_recv_wr >= dev_attr->max_qp_wqes) ||
> +		    (qp_attr->cap.max_send_sge >= dev_attr->max_qp_sges) ||
> +		    (qp_attr->cap.max_recv_sge >= dev_attr->max_qp_sges) ||
> +		    (qp_attr->cap.max_inline_data >=
> +						dev_attr->max_inline_data)) {
> +			dev_err(rdev_to_dev(rdev),
> +				"Create QP failed - max exceeded");
> +			return -EINVAL;
> +		}
> +		entries = roundup_pow_of_two(qp_attr->cap.max_send_wr);
> +		if (entries > dev_attr->max_qp_wqes)
> +			entries = dev_attr->max_qp_wqes;
> +		qp->qplib_qp.sq.max_wqe = entries;
> +		qp->qplib_qp.sq.max_sge = qp_attr->cap.max_send_sge;
> +		if (qp->qplib_qp.rq.max_wqe) {
> +			entries = roundup_pow_of_two(qp_attr->cap.max_recv_wr);
> +			if (entries > dev_attr->max_qp_wqes)
> +				entries = dev_attr->max_qp_wqes;
> +			qp->qplib_qp.rq.max_wqe = entries;
> +			qp->qplib_qp.rq.max_sge = qp_attr->cap.max_recv_sge;
> +		} else {
> +			/* SRQ was used prior, just ignore the RQ caps */
> +		}
> +	}
> +	if (qp_attr_mask & IB_QP_DEST_QPN) {
> +		qp->qplib_qp.modify_flags |=
> +				CMDQ_MODIFY_QP_MODIFY_MASK_DEST_QP_ID;
> +		qp->qplib_qp.dest_qpn = qp_attr->dest_qp_num;
> +	}
> +	rc = bnxt_qplib_modify_qp(&rdev->qplib_res, &qp->qplib_qp);
> +	if (rc) {
> +		dev_err(rdev_to_dev(rdev), "Failed to modify HW QP");
> +		return rc;
> +	}
> +	if (ib_qp->qp_type == IB_QPT_GSI && rdev->qp1_sqp)
> +		rc = bnxt_re_modify_shadow_qp(rdev, qp, qp_attr_mask);
> +	return rc;
> +}
> +
> +int bnxt_re_query_qp(struct ib_qp *ib_qp, struct ib_qp_attr *qp_attr,
> +		     int qp_attr_mask, struct ib_qp_init_attr *qp_init_attr)
> +{
> +	struct bnxt_re_qp *qp = to_bnxt_re(ib_qp, struct bnxt_re_qp, ib_qp);
> +	struct bnxt_re_dev *rdev = qp->rdev;
> +	struct bnxt_qplib_qp qplib_qp;
> +	int rc;
> +
> +	memset(&qplib_qp, 0, sizeof(struct bnxt_qplib_qp));
> +	qplib_qp.id = qp->qplib_qp.id;
> +	qplib_qp.ah.host_sgid_index = qp->qplib_qp.ah.host_sgid_index;
> +
> +	rc = bnxt_qplib_query_qp(&rdev->qplib_res, &qplib_qp);
> +	if (rc) {
> +		dev_err(rdev_to_dev(rdev), "Failed to query HW QP");
> +		return rc;
> +	}
> +	qp_attr->qp_state = __to_ib_qp_state(qplib_qp.state);
> +	qp_attr->en_sqd_async_notify = qplib_qp.en_sqd_async_notify ? 1 : 0;
> +	qp_attr->qp_access_flags = __to_ib_access_flags(qplib_qp.access);
> +	qp_attr->pkey_index = qplib_qp.pkey_index;
> +	qp_attr->qkey = qplib_qp.qkey;
> +	memcpy(qp_attr->ah_attr.grh.dgid.raw, qplib_qp.ah.dgid.data,
> +	       sizeof(qplib_qp.ah.dgid.data));
> +	qp_attr->ah_attr.grh.flow_label = qplib_qp.ah.flow_label;
> +	qp_attr->ah_attr.grh.sgid_index = qplib_qp.ah.host_sgid_index;
> +	qp_attr->ah_attr.grh.hop_limit = qplib_qp.ah.hop_limit;
> +	qp_attr->ah_attr.grh.traffic_class = qplib_qp.ah.traffic_class;
> +	qp_attr->ah_attr.sl = qplib_qp.ah.sl;
> +	ether_addr_copy(qp_attr->ah_attr.dmac, qplib_qp.ah.dmac);
> +	qp_attr->path_mtu = __to_ib_mtu(qplib_qp.path_mtu);
> +	qp_attr->timeout = qplib_qp.timeout;
> +	qp_attr->retry_cnt = qplib_qp.retry_cnt;
> +	qp_attr->rnr_retry = qplib_qp.rnr_retry;
> +	qp_attr->min_rnr_timer = qplib_qp.min_rnr_timer;
> +	qp_attr->rq_psn = qplib_qp.rq.psn;
> +	qp_attr->max_rd_atomic = qplib_qp.max_rd_atomic;
> +	qp_attr->sq_psn = qplib_qp.sq.psn;
> +	qp_attr->max_dest_rd_atomic = qplib_qp.max_dest_rd_atomic;
> +	qp_init_attr->sq_sig_type = qplib_qp.sig_type ? IB_SIGNAL_ALL_WR :
> +							IB_SIGNAL_REQ_WR;
> +	qp_attr->dest_qp_num = qplib_qp.dest_qpn;
> +
> +	qp_attr->cap.max_send_wr = qp->qplib_qp.sq.max_wqe;
> +	qp_attr->cap.max_send_sge = qp->qplib_qp.sq.max_sge;
> +	qp_attr->cap.max_recv_wr = qp->qplib_qp.rq.max_wqe;
> +	qp_attr->cap.max_recv_sge = qp->qplib_qp.rq.max_sge;
> +	qp_attr->cap.max_inline_data = qp->qplib_qp.max_inline_data;
> +	qp_init_attr->cap = qp_attr->cap;
> +
> +	return 0;
> +}
> +
>  /* Completion Queues */
>  int bnxt_re_destroy_cq(struct ib_cq *ib_cq)
>  {
> diff --git a/drivers/infiniband/hw/bnxtre/bnxt_re_ib_verbs.h b/drivers/infiniband/hw/bnxtre/bnxt_re_ib_verbs.h
> index ba9a4c9..75ee88a 100644
> --- a/drivers/infiniband/hw/bnxtre/bnxt_re_ib_verbs.h
> +++ b/drivers/infiniband/hw/bnxtre/bnxt_re_ib_verbs.h
> @@ -57,6 +57,19 @@ struct bnxt_re_ah {
>  	struct bnxt_qplib_ah	qplib_ah;
>  };
>
> +struct bnxt_re_qp {
> +	struct list_head	list;
> +	struct bnxt_re_dev	*rdev;
> +	struct ib_qp		ib_qp;
> +	spinlock_t		sq_lock;	/* protect sq */
> +	struct bnxt_qplib_qp	qplib_qp;
> +	struct ib_umem		*sumem;
> +	struct ib_umem		*rumem;
> +	/* QP1 */
> +	u32			send_psn;
> +	struct ib_ud_header	qp1_hdr;
> +};
> +
>  struct bnxt_re_cq {
>  	struct bnxt_re_dev	*rdev;
>  	spinlock_t              cq_lock;	/* protect cq */
> @@ -141,6 +154,14 @@ struct ib_ah *bnxt_re_create_ah(struct ib_pd *pd,
>  int bnxt_re_modify_ah(struct ib_ah *ah, struct ib_ah_attr *ah_attr);
>  int bnxt_re_query_ah(struct ib_ah *ah, struct ib_ah_attr *ah_attr);
>  int bnxt_re_destroy_ah(struct ib_ah *ah);
> +struct ib_qp *bnxt_re_create_qp(struct ib_pd *pd,
> +				struct ib_qp_init_attr *qp_init_attr,
> +				struct ib_udata *udata);
> +int bnxt_re_modify_qp(struct ib_qp *qp, struct ib_qp_attr *qp_attr,
> +		      int qp_attr_mask, struct ib_udata *udata);
> +int bnxt_re_query_qp(struct ib_qp *qp, struct ib_qp_attr *qp_attr,
> +		     int qp_attr_mask, struct ib_qp_init_attr *qp_init_attr);
> +int bnxt_re_destroy_qp(struct ib_qp *qp);
>  struct ib_cq *bnxt_re_create_cq(struct ib_device *ibdev,
>  				const struct ib_cq_init_attr *attr,
>  				struct ib_ucontext *context,
> diff --git a/drivers/infiniband/hw/bnxtre/bnxt_re_main.c b/drivers/infiniband/hw/bnxtre/bnxt_re_main.c
> index 3d1504e..5facacc 100644
> --- a/drivers/infiniband/hw/bnxtre/bnxt_re_main.c
> +++ b/drivers/infiniband/hw/bnxtre/bnxt_re_main.c
> @@ -445,6 +445,12 @@ static int bnxt_re_register_ib(struct bnxt_re_dev *rdev)
>  	ibdev->modify_ah		= bnxt_re_modify_ah;
>  	ibdev->query_ah			= bnxt_re_query_ah;
>  	ibdev->destroy_ah		= bnxt_re_destroy_ah;
> +
> +	ibdev->create_qp		= bnxt_re_create_qp;
> +	ibdev->modify_qp		= bnxt_re_modify_qp;
> +	ibdev->query_qp			= bnxt_re_query_qp;
> +	ibdev->destroy_qp		= bnxt_re_destroy_qp;
> +
>  	ibdev->create_cq		= bnxt_re_create_cq;
>  	ibdev->destroy_cq		= bnxt_re_destroy_cq;
>  	ibdev->req_notify_cq		= bnxt_re_req_notify_cq;
> diff --git a/include/uapi/rdma/bnxt_re_uverbs_abi.h b/include/uapi/rdma/bnxt_re_uverbs_abi.h
> index 5444eff..e6732f8 100644
> --- a/include/uapi/rdma/bnxt_re_uverbs_abi.h
> +++ b/include/uapi/rdma/bnxt_re_uverbs_abi.h
> @@ -66,6 +66,16 @@ struct bnxt_re_cq_resp {
>  	__u32 phase;
>  } __packed;
>
> +struct bnxt_re_qp_req {
> +	__u64 qpsva;
> +	__u64 qprva;
> +	__u64 qp_handle;
> +} __packed;
> +
> +struct bnxt_re_qp_resp {
> +	__u32 qpid;
> +} __packed;
> +
>  enum bnxt_re_shpg_offt {
>  	BNXT_RE_BEG_RESV_OFFT	= 0x00,
>  	BNXT_RE_AVID_OFFT	= 0x10,
> --
> 2.5.5
>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply

* Re: [PATCH 1/1] Fixed to BUG_ON to WARN_ON def
From: Leon Romanovsky @ 2016-12-12 18:18 UTC (permalink / raw)
  To: Ozgur Karatas, Tariq Toukan; +Cc: yishaih@mellanox.com, netdev, linux-kernel
In-Reply-To: <2090831481547868@web27h.yandex.ru>

[-- Attachment #1: Type: text/plain, Size: 2204 bytes --]

On Mon, Dec 12, 2016 at 03:04:28PM +0200, Ozgur Karatas wrote:
> Dear Romanovsky;

Please avoid top-posting in your replies.
Thanks

>
> I'm trying to learn english and I apologize for my mistake words and phrases. So, I think the code when call to "sg_set_buf" and next time set memory and buffer. For example, isn't to call "WARN_ON" function, get a error to implicit declaration, right?
>
> Because, you will use to "BUG_ON" get a error implicit declaration of functions.

I'm not sure that I followed you. mem->offset is set by sg_set_buf from
buf variable returned by dma_alloc_coherent(). HW needs to get very
precise size of this buf, in multiple of pages and aligned to pages
boundaries.

>
>         sg_set_buf(mem, buf, PAGE_SIZE << order);
>         WARN_ON(mem->offset);

See the patch inline which removes this BUG_ON in proper and safe way.

From 7babe807affa2b27d51d3610afb75b693929ea1a Mon Sep 17 00:00:00 2001
From: Leon Romanovsky <leonro@mellanox.com>
Date: Mon, 12 Dec 2016 20:02:45 +0200
Subject: [PATCH] net/mlx4: Remove BUG_ON from ICM allocation routine

This patch removes BUG_ON() macro from mlx4_alloc_icm_coherent()
by checking DMA address aligment in advance and performing proper
folding in case of error.

Fixes: 5b0bf5e25efe ("mlx4_core: Support ICM tables in coherent memory")
Reported-by: Ozgur Karatas <okaratas@member.fsf.org>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
---
 drivers/net/ethernet/mellanox/mlx4/icm.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/icm.c b/drivers/net/ethernet/mellanox/mlx4/icm.c
index 2a9dd46..e1f9e7c 100644
--- a/drivers/net/ethernet/mellanox/mlx4/icm.c
+++ b/drivers/net/ethernet/mellanox/mlx4/icm.c
@@ -118,8 +118,13 @@ static int mlx4_alloc_icm_coherent(struct device *dev, struct scatterlist *mem,
 	if (!buf)
 		return -ENOMEM;

+	if (offset_in_page(buf)) {
+		dma_free_coherent(dev, PAGE_SIZE << order,
+				  buf, sg_dma_address(mem));
+		return -ENOMEM;
+	}
+
 	sg_set_buf(mem, buf, PAGE_SIZE << order);
-	BUG_ON(mem->offset);
 	sg_dma_len(mem) = PAGE_SIZE << order;
 	return 0;
 }
--
2.10.2


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply related

* Re: Designing a safe RX-zero-copy Memory Model for Networking
From: Christoph Lameter @ 2016-12-12 18:06 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: John Fastabend, Mike Rapoport, netdev@vger.kernel.org, linux-mm,
	Willem de Bruijn, Björn Töpel, Karlsson, Magnus,
	Alexander Duyck, Mel Gorman, Tom Herbert, Brenden Blanco,
	Tariq Toukan, Saeed Mahameed, Jesse Brandeburg, Kalman Meth,
	Vladislav Yasevich
In-Reply-To: <20161212181344.3ddfa9c3@redhat.com>

On Mon, 12 Dec 2016, Jesper Dangaard Brouer wrote:

> Hmmm. If you can rely on hardware setup to give you steering and
> dedicated access to the RX rings.  In those cases, I guess, the "push"
> model could be a more direct API approach.

If the hardware does not support steering then one should be able to
provide those services in software.

> I was shooting for a model that worked without hardware support.  And
> then transparently benefit from HW support by configuring a HW filter
> into a specific RX queue and attaching/using to that queue.

The discussion here is a bit amusing since these issues have been resolved
a long time ago with the design of the RDMA subsystem. Zero copy is
already in wide use. Memory registration is used to pin down memory areas.
Work requests can be filed with the RDMA subsystem that then send and
receive packets from the registered memory regions. This is not strictly
remote memory access but this is a basic mode of operations supported  by
the RDMA subsystem. The mlx5 driver quoted here supports all of that.

What is bad about RDMA is that it is a separate kernel subsystem. What I
would like to see is a deeper integration with the network stack so that
memory regions can be registred with a network socket and work requests
then can be submitted and processed that directly read and write in these
regions. The network stack should provide the services that the hardware
of the NIC does not suppport as usual.

The RX/TX ring in user space should be an additional mode of operation of
the socket layer. Once that is in place the "Remote memory acces" can be
trivially implemented on top of that and the ugly RDMA sidecar subsystem
can go away.

^ permalink raw reply

* Re: Soft lockup in inet_put_port on 4.6
From: Josef Bacik @ 2016-12-12 18:05 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Hannes Frederic Sowa, Tom Herbert,
	Linux Kernel Network Developers
In-Reply-To: <1481343298.4930.208.camel@edumazet-glaptop3.roam.corp.google.com>

On Fri, Dec 9, 2016 at 11:14 PM, Eric Dumazet <eric.dumazet@gmail.com> 
wrote:
> On Fri, 2016-12-09 at 19:47 -0800, Eric Dumazet wrote:
> 
>> 
>>  Hmm... Is your ephemeral port range includes the port your load
>>  balancing app is using ?
> 
> I suspect that you might have processes doing bind( port = 0) that are
> trapped into the bind_conflict() scan ?
> 
> With 100,000 + timewaits there, this possibly hurts.
> 
> Can you try the following loop breaker ?

It doesn't appear that the app is doing bind(port = 0) during normal 
operation.  I tested this patch and it made no difference.  I'm going 
to test simply restarting the app without changing to the SO_REUSEPORT 
option.  Thanks,

Josef

^ permalink raw reply

* Re: Soft lockup in tc_classify
From: Shahar Klein @ 2016-12-12 16:04 UTC (permalink / raw)
  To: Daniel Borkmann, netdev
  Cc: shahark, Roi Dayan, David Miller, Cong Wang, Jiri Pirko,
	John Fastabend, Or Gerlitz, Hadar Hen Zion
In-Reply-To: <584EA60B.80803@iogearbox.net>



On 12/12/2016 3:28 PM, Daniel Borkmann wrote:
> Hi Shahar,
>
> On 12/12/2016 10:43 AM, Shahar Klein wrote:
>> Hi All,
>>
>> sorry for the spam, the first time was sent with html part and was
>> rejected.
>>
>> We observed an issue where a classifier instance next member is
>> pointing back to itself, causing a CPU soft lockup.
>> We found it by running traffic on many udp connections and then adding
>> a new flower rule using tc.
>>
>> We added a quick workaround to verify it:
>>
>> In tc_classify:
>>
>>          for (; tp; tp = rcu_dereference_bh(tp->next)) {
>>                  int err;
>> +               if (tp == tp->next)
>> +                     RCU_INIT_POINTER(tp->next, NULL);
>>
>>
>> We also had a print here showing tp->next is pointing to tp. With this
>> workaround we are not hitting the issue anymore.
>> We are not sure we fully understand the mechanism here - with the rtnl
>> and rcu locks.
>> We'll appreciate your help solving this issue.
>
> Note that there's still the RCU fix missing for the deletion race that
> Cong will still send out, but you say that the only thing you do is to
> add a single rule, but no other operation in involved during that test?
>
> Do you have a script and kernel .config for reproducing this?

I'm using a user space socket 
app(https://github.com/shahar-klein/noodle)on a vm to push udp packets 
from ~2000 different udp src ports ramping up at ~100 per second towards 
another vm on the same Hypervisor. Once the traffic starts I'm pushing 
ingress flower tc udp rules(even_udp_src_port->mirred, odd->drop) on the 
relevant representor in the Hypervisor.

>
> Thanks,
> Daniel

^ permalink raw reply

* Re: [PATCH] sh_eth: add wake-on-lan support via magic packet
From: Sergei Shtylyov @ 2016-12-12 17:35 UTC (permalink / raw)
  To: Niklas Söderlund; +Cc: Simon Horman, netdev, linux-renesas-soc
In-Reply-To: <20161212154955.GA17342@bigcity.dyn.berto.se>

On 12/12/2016 06:49 PM, Niklas Söderlund wrote:

> Thanks for your feedback.

    Not at all, it's my duty now. :-)
    I should probably have warned you not to post the new version to netdev -- 
DaveM has closed his net-next.git tree (ahead of the usual time, which would 
have been 4.9 release), so you posting would only upset him...

[...]
>>>>   You only enable the WOL support fo the R-Car gen2 chips but never say that
>>>> explicitly, neither in the subject nor here.
>>>>
>>>>> Signed-off-by: Niklas Söderlund <niklas.soderlund+renesas@ragnatech.se>
>>>>> ---
>>>>>  drivers/net/ethernet/renesas/sh_eth.c | 120 +++++++++++++++++++++++++++++++---
>>>>>  drivers/net/ethernet/renesas/sh_eth.h |   4 ++
>>>>>  2 files changed, 116 insertions(+), 8 deletions(-)
>>>>
>>>>> diff --git a/drivers/net/ethernet/renesas/sh_eth.c b/drivers/net/ethernet/renesas/sh_eth.c
>>>>> index 05b0dc5..3974046 100644
>>>>> --- a/drivers/net/ethernet/renesas/sh_eth.c
>>>>> +++ b/drivers/net/ethernet/renesas/sh_eth.c
[...]
>>>> +		/* Handle MagicPacket interrupt */
>>>> +		if (sh_eth_read(ndev, ECSR) & ECSR_MPD)
>>
>>    What if it wasn't enabled ATM?
>
> Sorry I don't understand this comment.

    I'm trying to handle only the enabled interrupts but this hasn't been 
consistently done yet (only for EESR, not ECSR), so nevermind. :-)

[...]
>>>>> @@ -3150,15 +3193,71 @@ static int sh_eth_drv_remove(struct platform_device *pdev)
[...]

>>> This is how it's done in
>>> other parts of the driver when disabling interrupts.
>>
>>    Not in all parts of the driver that disable EESIPR interrupts... I must
>> confess that I never liked that 'mdp->irq_enabled' flag and still suspect we
>> can get things done without it... I need to look at this code again, sigh...

    Well, we can't most probably but I have a patch almost ready that turns 
the boolean flag into u32 field holding the EESIPR value to be written next. 
Would that help you?

>>> This is also why I only check for MagicPacket interrupts if irq_enabled
>>> is false.
>>
>>   I would have preferred that this was done with the other EMAC interrupts,
>> in sh_eth_error().
>
> I removed the check for Magic Packet in sh_eth_interrupt() and running
> without setting mdp->irq_enabled = false. sh_eth_error() will then clear
> any ECI interrupt so no need to add Magic Packet detection to it since
> all that is needed on Magic Packet is to clear the interrupt which
> already is done. This works and I can do multiple suspend/resume cycles,
> will be in v2 thanks for the suggestion.

    OK, let's see what you have when I have some more time. We have a lot of 
time for ironing things out till net-next is opened again -- which will happen 
after -rc1)...

[...]
>>>>> +
>>>>> +	/* Enable MagicPacket */
>>>>> +	sh_eth_modify(ndev, ECMR, 0, ECMR_PMDE);
>>>>> +
>>>>> +	/* Increased clock usage so device won't be suspended */
>>>>> +	clk_enable(mdp->clk);
>>>>
>>>>    Hum, intermixiggn runtime PM with clock API doesn't look good...
>>>
>>> I agree it looks weird but I need a way to increment the usage count for
>>> the clock otherwise the PM code will disable the module clock and WoL
>>> will not work.
>>
>>    How will it do it if you don't call sh_eth_close() in this case?
>>
>>> Note that this call will not enable the clock just
>>> increase the usage count so it won't be disabled when the PM code
>>> decrease it after the sh_eth suspend function is run.
>>
>>    You mean that the PM code calls RPM or clk API on its own? That's strange...
>
> Yes it calls clk API.

    Hum, will have to look into it as well...

[...]

MBR, Sergei

^ permalink raw reply

* Re: Designing a safe RX-zero-copy Memory Model for Networking
From: Jesper Dangaard Brouer @ 2016-12-12 17:13 UTC (permalink / raw)
  To: John Fastabend
  Cc: Mike Rapoport, netdev@vger.kernel.org, linux-mm, Willem de Bruijn,
	Björn Töpel, Karlsson, Magnus, Alexander Duyck,
	Mel Gorman, Tom Herbert, Brenden Blanco, Tariq Toukan,
	Saeed Mahameed, Jesse Brandeburg, Kalman Meth, Vladislav Yasevich,
	brouer
In-Reply-To: <584EB8DF.8000308@gmail.com>

On Mon, 12 Dec 2016 06:49:03 -0800
John Fastabend <john.fastabend@gmail.com> wrote:

> On 16-12-12 06:14 AM, Mike Rapoport wrote:
> > On Mon, Dec 12, 2016 at 10:40:42AM +0100, Jesper Dangaard Brouer wrote:  
> >>
> >> On Mon, 12 Dec 2016 10:38:13 +0200 Mike Rapoport <rppt@linux.vnet.ibm.com> wrote:
> >>  
> >>> Hello Jesper,
> >>>
> >>> On Mon, Dec 05, 2016 at 03:31:32PM +0100, Jesper Dangaard Brouer wrote:  
> >>>> Hi all,
> >>>>
> >>>> This is my design for how to safely handle RX zero-copy in the network
> >>>> stack, by using page_pool[1] and modifying NIC drivers.  Safely means
> >>>> not leaking kernel info in pages mapped to userspace and resilience
> >>>> so a malicious userspace app cannot crash the kernel.
> >>>>
> >>>> Design target
> >>>> =============
> >>>>
> >>>> Allow the NIC to function as a normal Linux NIC and be shared in a
> >>>> safe manor, between the kernel network stack and an accelerated
> >>>> userspace application using RX zero-copy delivery.
> >>>>
> >>>> Target is to provide the basis for building RX zero-copy solutions in
> >>>> a memory safe manor.  An efficient communication channel for userspace
> >>>> delivery is out of scope for this document, but OOM considerations are
> >>>> discussed below (`Userspace delivery and OOM`_).    
> >>>
> >>> Sorry, if this reply is a bit off-topic.  
> >>
> >> It is very much on topic IMHO :-)
> >>  
> >>> I'm working on implementation of RX zero-copy for virtio and I've dedicated
> >>> some thought about making guest memory available for physical NIC DMAs.
> >>> I believe this is quite related to your page_pool proposal, at least from
> >>> the NIC driver perspective, so I'd like to share some thoughts here.  
> >>
> >> Seems quite related. I'm very interested in cooperating with you! I'm
> >> not very familiar with virtio, and how packets/pages gets channeled
> >> into virtio.  
> > 
> > They are copied :-)
> > Presuming we are dealing only with vhost backend, the received skb
> > eventually gets converted to IOVs, which in turn are copied to the guest
> > memory. The IOVs point to the guest memory that is allocated by virtio-net
> > running in the guest.
> >   
> 
> Great I'm also doing something similar.
> 
> My plan was to embed the zero copy as an AF_PACKET mode and then push
> a AF_PACKET backend into vhost. I'll post a patch later this week.
> 
> >>> The idea is to dedicate one (or more) of the NIC's queues to a VM, e.g.
> >>> using macvtap, and then propagate guest RX memory allocations to the NIC
> >>> using something like new .ndo_set_rx_buffers method.  
> >>
> >> I believe the page_pool API/design aligns with this idea/use-case.
> >>  
> >>> What is your view about interface between the page_pool and the NIC
> >>> drivers?  
> >>
> >> In my Prove-of-Concept implementation, the NIC driver (mlx5) register
> >> a page_pool per RX queue.  This is done for two reasons (1) performance
> >> and (2) for supporting use-cases where only one single RX-ring queue is
> >> (re)configured to support RX-zero-copy.  There are some associated
> >> extra cost of enabling this mode, thus it makes sense to only enable it
> >> when needed.
> >>
> >> I've not decided how this gets enabled, maybe some new driver NDO.  It
> >> could also happen when a XDP program gets loaded, which request this
> >> feature.
> >>
> >> The macvtap solution is nice and we should support it, but it requires
> >> VM to have their MAC-addr registered on the physical switch.  This
> >> design is about adding flexibility. Registering an XDP eBPF filter
> >> provides the maximum flexibility for matching the destination VM.  
> > 
> > I'm not very familiar with XDP eBPF, and it's difficult for me to estimate
> > what needs to be done in BPF program to do proper conversion of skb to the
> > virtio descriptors.  
> 
> I don't think XDP has much to do with this code and they should be done
> separately. XDP runs eBPF code on received packets after the DMA engine
> has already placed the packet in memory so its too late in the process.

It does not have to be connected to XDP.  My idea should support RX
zero-copy into normal sockets, without XDP.

My idea was to pre-VMA map the RX ring, when zero-copy is requested,
thus it is not too late in the process.  When frame travel the normal
network stack, then require the SKB-read-only-page mode (skb-frags).
If the SKB reach a socket that support zero-copy, then we can do RX
zero-copy on normal sockets.

 
> The other piece here is enabling XDP in vhost but that is again separate
> IMO.
> 
> Notice that ixgbe supports pushing packets into a macvlan via 'tc'
> traffic steering commands so even though macvlan gets an L2 address it
> doesn't mean it can't use other criteria to steer traffic to it.

This sounds interesting. As this allow much more flexibility macvlan
matching, which I like, but still depending on HW support. 

 
> > We were not considered using XDP yet, so we've decided to limit the initial
> > implementation to macvtap because we can ensure correspondence between a
> > NIC queue and virtual NIC, which is not the case with more generic tap
> > device. It could be that use of XDP will allow for a generic solution for
> > virtio case as well.  
> 
> Interesting this was one of the original ideas behind the macvlan
> offload mode. iirc Vlad also was interested in this.
> 
> I'm guessing this was used because of the ability to push macvlan onto
> its own queue?
> 
> >    
> >>  
> >>> Have you considered using "push" model for setting the NIC's RX memory?  
> >>
> >> I don't understand what you mean by a "push" model?  
> > 
> > Currently, memory allocation in NIC drivers boils down to alloc_page with
> > some wrapping code. I see two possible ways to make NIC use of some
> > preallocated pages: either NIC driver will call an API (probably different
> > from alloc_page) to obtain that memory, or there will be NDO API that
> > allows to set the NIC's RX buffers. I named the later case "push".  
> 
> I prefer the ndo op. This matches up well with AF_PACKET model where we
> have "slots" and offload is just a transparent "push" of these "slots"
> to the driver. Below we have a snippet of our proposed API,

Hmmm. If you can rely on hardware setup to give you steering and
dedicated access to the RX rings.  In those cases, I guess, the "push"
model could be a more direct API approach.

I was shooting for a model that worked without hardware support.  And
then transparently benefit from HW support by configuring a HW filter
into a specific RX queue and attaching/using to that queue.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [PATCH net-next 1/3] net:dsa:mv88e6xxx: use hashtable to store multicast entries
From: Vivien Didelot @ 2016-12-12 17:11 UTC (permalink / raw)
  To: Florian Fainelli, Volodymyr Bendiuga
  Cc: Volodymyr Bendiuga, andrew, netdev, Volodymyr Bendiuga
In-Reply-To: <48ff1136-dd8f-7704-a512-c23b27989bf8@gmail.com>

Hi all,

Florian Fainelli <f.fainelli@gmail.com> writes:

> Seeing such a change makes me wonder if we should not try to push some
> of this hashtable abstraction (provided that we agree we want it) at a
> higher layer, like net/dsa/slave.c?

That is the major reason why I am reluctant to cache stuffs in drivers.

In most cases, we want the DSA drivers to be "stupid", as much stateless
as possible, simply implementing the supported DSA switch operations.

The DSA core then handles the generic logic of how switch fabrics should
behave, and thus all DSA drivers are consistent and benefit from this.

Thanks,

        Vivien

^ permalink raw reply

* Re: [PATCH v2] audit: use proper refcount locking on audit_sock
From: Paul Moore @ 2016-12-12 17:10 UTC (permalink / raw)
  To: Richard Guy Briggs
  Cc: netdev, linux-kernel, edumazet, linux-audit, xiyou.wangcong,
	dvyukov
In-Reply-To: <5714bd7468cfec225407a6c367e658478d590495.1481534171.git.rgb@redhat.com>

On Mon, Dec 12, 2016 at 5:03 AM, Richard Guy Briggs <rgb@redhat.com> wrote:
> Resetting audit_sock appears to be racy.
>
> audit_sock was being copied and dereferenced without using a refcount on
> the source sock.
>
> Bump the refcount on the underlying sock when we store a refrence in
> audit_sock and release it when we reset audit_sock.  audit_sock
> modification needs the audit_cmd_mutex.
>
> See: https://lkml.org/lkml/2016/11/26/232
>
> Thanks to Eric Dumazet <edumazet@google.com> and Cong Wang
> <xiyou.wangcong@gmail.com> on ideas how to fix it.
>
> Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
> ---
> There has been a lot of change in the audit code that is about to go
> upstream to address audit queue issues.  This patch is based on the
> source tree: git://git.infradead.org/users/pcmoore/audit#next
> ---
>  kernel/audit.c |   34 ++++++++++++++++++++++++++++------
>  1 files changed, 28 insertions(+), 6 deletions(-)

This is coming in pretty late for the v4.10 merge window, much later
than I would usually take things, but this is arguably important, and
(at first glance) relatively low risk - what testing have you done on
this?

> diff --git a/kernel/audit.c b/kernel/audit.c
> index f20eee0..439f7f3 100644
> --- a/kernel/audit.c
> +++ b/kernel/audit.c
> @@ -452,7 +452,9 @@ static void auditd_reset(void)
>         struct sk_buff *skb;
>
>         /* break the connection */
> +       sock_put(audit_sock);
>         audit_pid = 0;
> +       audit_nlk_portid = 0;
>         audit_sock = NULL;
>
>         /* flush all of the retry queue to the hold queue */
> @@ -478,6 +480,12 @@ static int kauditd_send_unicast_skb(struct sk_buff *skb)
>         if (rc >= 0) {
>                 consume_skb(skb);
>                 rc = 0;
> +       } else {
> +               if (rc & (-ENOMEM|-EPERM|-ECONNREFUSED)) {
> +                       mutex_lock(&audit_cmd_mutex);
> +                       auditd_reset();
> +                       mutex_unlock(&audit_cmd_mutex);
> +               }
>         }
>
>         return rc;
> @@ -579,7 +587,9 @@ static int kauditd_thread(void *dummy)
>
>                                 auditd = 0;
>                                 if (AUDITD_BAD(rc, reschedule)) {
> +                                       mutex_lock(&audit_cmd_mutex);
>                                         auditd_reset();
> +                                       mutex_unlock(&audit_cmd_mutex);
>                                         reschedule = 0;
>                                 }
>                         } else
> @@ -594,7 +604,9 @@ static int kauditd_thread(void *dummy)
>                                 auditd = 0;
>                                 if (AUDITD_BAD(rc, reschedule)) {
>                                         kauditd_hold_skb(skb);
> +                                       mutex_lock(&audit_cmd_mutex);
>                                         auditd_reset();
> +                                       mutex_unlock(&audit_cmd_mutex);
>                                         reschedule = 0;
>                                 } else
>                                         /* temporary problem (we hope), queue
> @@ -623,7 +635,9 @@ quick_loop:
>                                 if (rc) {
>                                         auditd = 0;
>                                         if (AUDITD_BAD(rc, reschedule)) {
> +                                               mutex_lock(&audit_cmd_mutex);
>                                                 auditd_reset();
> +                                               mutex_unlock(&audit_cmd_mutex);
>                                                 reschedule = 0;
>                                         }
>
> @@ -1004,17 +1018,22 @@ static int audit_receive_msg(struct sk_buff *skb, struct nlmsghdr *nlh)
>                                 return -EACCES;
>                         }
>                         if (audit_pid && new_pid &&
> -                           audit_replace(requesting_pid) != -ECONNREFUSED) {
> +                           (audit_replace(requesting_pid) & (-ECONNREFUSED|-EPERM|-ENOMEM))) {
>                                 audit_log_config_change("audit_pid", new_pid, audit_pid, 0);
>                                 return -EEXIST;
>                         }
>                         if (audit_enabled != AUDIT_OFF)
>                                 audit_log_config_change("audit_pid", new_pid, audit_pid, 1);
> -                       audit_pid = new_pid;
> -                       audit_nlk_portid = NETLINK_CB(skb).portid;
> -                       audit_sock = skb->sk;
> -                       if (!new_pid)
> +                       if (new_pid) {
> +                               if (audit_sock)
> +                                       sock_put(audit_sock);
> +                               audit_pid = new_pid;
> +                               audit_nlk_portid = NETLINK_CB(skb).portid;
> +                               sock_hold(skb->sk);
> +                               audit_sock = skb->sk;
> +                       } else {
>                                 auditd_reset();
> +                       }
>                         wake_up_interruptible(&kauditd_wait);
>                 }
>                 if (s.mask & AUDIT_STATUS_RATE_LIMIT) {
> @@ -1283,8 +1302,11 @@ static void __net_exit audit_net_exit(struct net *net)
>  {
>         struct audit_net *aunet = net_generic(net, audit_net_id);
>         struct sock *sock = aunet->nlsk;
> -       if (sock == audit_sock)
> +       if (sock == audit_sock) {
> +               mutex_lock(&audit_cmd_mutex);
>                 auditd_reset();
> +               mutex_unlock(&audit_cmd_mutex);
> +       }
>
>         RCU_INIT_POINTER(aunet->nlsk, NULL);
>         synchronize_net();
> --
> 1.7.1
>
> --
> Linux-audit mailing list
> Linux-audit@redhat.com
> https://www.redhat.com/mailman/listinfo/linux-audit



-- 
paul moore
www.paul-moore.com

^ permalink raw reply

* Re: [PATCH V2 00/22] Broadcom RoCE Driver (bnxt_re)
From: Jason Gunthorpe @ 2016-12-12 17:07 UTC (permalink / raw)
  To: Selvin Xavier
  Cc: Doug Ledford, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	netdev-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <CA+sbYW1ZEa5fndGkvN8OXr-orcUx4jaL73Di8zBJQX_uCdK=Ww-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>

On Sat, Dec 10, 2016 at 11:06:58AM +0530, Selvin Xavier wrote:
> On Fri, Dec 9, 2016 at 12:17 PM, Selvin Xavier
> <selvin.xavier-dY08KVG/lbpWk0Htik3J/w@public.gmane.org> wrote:
> > I am preparing a git repository with these changes as per Jason's
> > comment and will share the details later today.
> 
> Please use bnxt_re branch in this git repository.
> 
> https://github.com/Broadcom/linux-rdma-nxt.git

Why are you using __packed in bnxt_re_uverbs_abi.h ? that doesn't seem
necessary. It is a good idea to make sure all those structures are a
multiple of 64 bits (add explicit reserved fields), and make sure you
test 32 bit verbs as well.

Why are you using debugfs just to export counters? Isn't the core code
counter framework good enough?

Please try and avoid writing functions as defines (eg rdev_to_dev,
to_bnxt_re, SQE_PG, RCFW_CMDQ_COOKIE, PTR_PG etc)

There is something wrong with the tabs and spaces (see
https://github.com/Broadcom/linux-rdma-nxt/blob/03e23b087f7e86ea28656273994e065827210ce5/drivers/infiniband/hw/bnxtre/bnxt_re_hsi.h)

FWIW, I really dislike the column alignment style, it is so hard to
maintain..

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH net-next 0/2] Add ethtool set regs support
From: Florian Fainelli @ 2016-12-12 17:00 UTC (permalink / raw)
  To: Andrew Lunn, Saeed Mahameed
  Cc: Saeed Mahameed, David S. Miller, Linux Netdev List,
	John W . Linville
In-Reply-To: <20161211152229.GC29761@lunn.ch>

On 12/11/2016 07:22 AM, Andrew Lunn wrote:
> On Sun, Dec 11, 2016 at 02:18:00PM +0200, Saeed Mahameed wrote:
>> On Wed, Dec 7, 2016 at 4:41 AM, Andrew Lunn <andrew@lunn.ch> wrote:
>>> On Wed, Dec 07, 2016 at 12:33:08AM +0200, Saeed Mahameed wrote:
>>>> Hi Dave,
>>>>
>>>> This series adds the support for setting device registers from user
>>>> space ethtool.
>>>
>>> Is this not the start of allowing binary only drivers in user space?
>>>
>>
>> It is not, we want to do same as set_eeprom already do,
>> Just set some HW registers, for analysis/debug/tweak/configure HW
>> dependent register for the NIC netdev sake.
> 
> Mellanox has a good reputation of open drivers. However, this API
> sounds like it would be the first step towards user space
> drivers. This is an API which can peek and poke registers, so it
> probably could be mis-used to put part of a driver in user
> space. Hence we should avoid this sort of API to start with.

I don't necessarily share your concerns here on the proprietary vs. open
source driver, because this interface is limited to the register space,
not the data path, there is only a handful of things you can do here,
but getting a NIC to work, is not probably one of them.

My concern is more with the support/debugging aspect, if someone starts
tweaking a gazillion of registers through that interface, and I have no
way to tell, how am I going to support that? Of course, the first step
is: have you tried to turn it on and off again, and see if this is
reproducible, but what if I was asked/told to tweak this or that value
first, etc... it can be hard to collect the exact state in which a NIC
is at the time of the problem.

NB: on the proprietary driver side, you can already mmap() the PCI
device's space and write an entire user-space based driver (DPDK) and
bypass the kernel for most things, ethtool -D is not much worse here
since it just offers a subset (and a small one) of that.
-- 
Florian

^ permalink raw reply

* Re: [PATCH V2 00/22] Broadcom RoCE Driver (bnxt_re)
From: Jonathan Toppins @ 2016-12-12 16:54 UTC (permalink / raw)
  To: Selvin Xavier, dledford-H+wXaHxf7aLQT0dZR+AlfA,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA
  Cc: netdev-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <1481266096-23331-1-git-send-email-selvin.xavier-dY08KVG/lbpWk0Htik3J/w@public.gmane.org>

On 12/09/2016 01:47 AM, Selvin Xavier wrote:
> This series introduces the RoCE driver for the Broadcom
> NetXtreme-E 10/25/40/50 gigabit RoCE HCAs. 
> This driver is dependent on the bnxt_en NIC driver and is 
> based on the bnxt_re branch in Doug's repository. bnxt_en changes
> required for this patch series is already available in this branch.
> 
> I am preparing a git repository with these changes as per Jason's
> comment and will share the details later today.
> 
> v1-> v2:
>   * The license text in each file updated to reflect Dual license.
>   * Makefile and Kconfig changes are pushed to the last patch
>   * Moved bnxt_re_uverbs_abi.h to include/uapi/rdma folder
>   * Remove duplicate structure definitions from bnxt_re_hsi.h as
>     it is available in the corresponding bnxt_en header file (bnxt_hsi.h)
>   * Removed some unused code reported during code review.
>   * Fixed few sparse warnings
> 

I get the following sparse errors (filtered for only bnxt_re ones),
please let me know if they are false positives:

$ make C=2  drivers/net/ethernet/broadcom/bnxt/bnxt_en.ko
drivers/infiniband/hw/bnxtre/bnxt_re.ko
  CHK     include/config/kernel.release
  CHK     include/generated/uapi/linux/version.h
  CHK     include/generated/utsrelease.h
  CHECK   arch/x86/purgatory/purgatory.c
[...]
  CHECK   arch/x86/purgatory/sha256.c
  CHECK   arch/x86/purgatory/string.c
[...]
  CHK     include/generated/bounds.h
  CHK     include/generated/timeconst.h
  CHK     include/generated/asm-offsets.h
  CALL    scripts/checksyscalls.sh
  CHECK   scripts/mod/empty.c
  CHECK   drivers/net/ethernet/broadcom/bnxt/bnxt.c
  CHECK   drivers/net/ethernet/broadcom/bnxt/bnxt_sriov.c
  CHECK   drivers/net/ethernet/broadcom/bnxt/bnxt_ethtool.c
  CHECK   drivers/net/ethernet/broadcom/bnxt/bnxt_dcb.c
  CHECK   drivers/net/ethernet/broadcom/bnxt/bnxt_ulp.c
  MODPOST 2 modules
  CHECK   drivers/infiniband/hw/bnxtre/bnxt_re_main.c
  CHECK   drivers/infiniband/hw/bnxtre/bnxt_re_ib_verbs.c
[...]
  CHECK   drivers/infiniband/hw/bnxtre/bnxt_re_debugfs.c
  CHECK   drivers/infiniband/hw/bnxtre/bnxt_qplib_res.c
drivers/infiniband/hw/bnxtre/bnxt_qplib_res.c:729:6: warning: symbol
'bnxt_qplib_cleanup_pkey_tbl' was not declared. Should it be static?
  CHECK   drivers/infiniband/hw/bnxtre/bnxt_qplib_rcfw.c
  CHECK   drivers/infiniband/hw/bnxtre/bnxt_qplib_sp.c
  CHECK   drivers/infiniband/hw/bnxtre/bnxt_qplib_fp.c
drivers/infiniband/hw/bnxtre/bnxt_qplib_fp.c:1015:22: warning: context
imbalance in 'bnxt_qplib_lock_cqs' - wrong count at exit
drivers/infiniband/hw/bnxtre/bnxt_qplib_fp.c:1030:28: warning: context
imbalance in 'bnxt_qplib_unlock_cqs' - unexpected unlock
  MODPOST 2 modules

-Jon
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox