* Re: [PATCH v2] audit: use proper refcount locking on audit_sock
From: Cong Wang @ 2016-12-12 23:58 UTC (permalink / raw)
To: Richard Guy Briggs
Cc: Linux Kernel Network Developers, LKML, linux-audit, Dmitry Vyukov,
Eric Dumazet, Eric Paris, Paul Moore, sgrubb
In-Reply-To: <5714bd7468cfec225407a6c367e658478d590495.1481534171.git.rgb@redhat.com>
On Mon, Dec 12, 2016 at 2:03 AM, Richard Guy Briggs <rgb@redhat.com> wrote:
> Resetting audit_sock appears to be racy.
>
> audit_sock was being copied and dereferenced without using a refcount on
> the source sock.
>
> Bump the refcount on the underlying sock when we store a refrence in
> audit_sock and release it when we reset audit_sock. audit_sock
> modification needs the audit_cmd_mutex.
>
> See: https://lkml.org/lkml/2016/11/26/232
>
> Thanks to Eric Dumazet <edumazet@google.com> and Cong Wang
> <xiyou.wangcong@gmail.com> on ideas how to fix it.
>
> Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
> ---
> There has been a lot of change in the audit code that is about to go
> upstream to address audit queue issues. This patch is based on the
> source tree: git://git.infradead.org/users/pcmoore/audit#next
> ---
> kernel/audit.c | 34 ++++++++++++++++++++++++++++------
> 1 files changed, 28 insertions(+), 6 deletions(-)
>
> diff --git a/kernel/audit.c b/kernel/audit.c
> index f20eee0..439f7f3 100644
> --- a/kernel/audit.c
> +++ b/kernel/audit.c
> @@ -452,7 +452,9 @@ static void auditd_reset(void)
> struct sk_buff *skb;
>
> /* break the connection */
> + sock_put(audit_sock);
Why audit_sock can't be NULL here?
> audit_pid = 0;
> + audit_nlk_portid = 0;
> audit_sock = NULL;
>
> /* flush all of the retry queue to the hold queue */
> @@ -478,6 +480,12 @@ static int kauditd_send_unicast_skb(struct sk_buff *skb)
> if (rc >= 0) {
> consume_skb(skb);
> rc = 0;
> + } else {
> + if (rc & (-ENOMEM|-EPERM|-ECONNREFUSED)) {
Are these errno's bits??
> + mutex_lock(&audit_cmd_mutex);
> + auditd_reset();
> + mutex_unlock(&audit_cmd_mutex);
> + }
> }
>
> return rc;
> @@ -579,7 +587,9 @@ static int kauditd_thread(void *dummy)
>
> auditd = 0;
> if (AUDITD_BAD(rc, reschedule)) {
> + mutex_lock(&audit_cmd_mutex);
> auditd_reset();
> + mutex_unlock(&audit_cmd_mutex);
> reschedule = 0;
> }
> } else
> @@ -594,7 +604,9 @@ static int kauditd_thread(void *dummy)
> auditd = 0;
> if (AUDITD_BAD(rc, reschedule)) {
> kauditd_hold_skb(skb);
> + mutex_lock(&audit_cmd_mutex);
> auditd_reset();
> + mutex_unlock(&audit_cmd_mutex);
> reschedule = 0;
> } else
> /* temporary problem (we hope), queue
> @@ -623,7 +635,9 @@ quick_loop:
> if (rc) {
> auditd = 0;
> if (AUDITD_BAD(rc, reschedule)) {
> + mutex_lock(&audit_cmd_mutex);
> auditd_reset();
> + mutex_unlock(&audit_cmd_mutex);
> reschedule = 0;
> }
>
> @@ -1004,17 +1018,22 @@ static int audit_receive_msg(struct sk_buff *skb, struct nlmsghdr *nlh)
> return -EACCES;
> }
> if (audit_pid && new_pid &&
> - audit_replace(requesting_pid) != -ECONNREFUSED) {
> + (audit_replace(requesting_pid) & (-ECONNREFUSED|-EPERM|-ENOMEM))) {
> audit_log_config_change("audit_pid", new_pid, audit_pid, 0);
> return -EEXIST;
> }
> if (audit_enabled != AUDIT_OFF)
> audit_log_config_change("audit_pid", new_pid, audit_pid, 1);
> - audit_pid = new_pid;
> - audit_nlk_portid = NETLINK_CB(skb).portid;
> - audit_sock = skb->sk;
> - if (!new_pid)
> + if (new_pid) {
> + if (audit_sock)
> + sock_put(audit_sock);
> + audit_pid = new_pid;
> + audit_nlk_portid = NETLINK_CB(skb).portid;
> + sock_hold(skb->sk);
Why refcnt is still needed here? I need it because I removed the code
in net exit code path.
> + audit_sock = skb->sk;
> + } else {
> auditd_reset();
> + }
> wake_up_interruptible(&kauditd_wait);
> }
> if (s.mask & AUDIT_STATUS_RATE_LIMIT) {
> @@ -1283,8 +1302,11 @@ static void __net_exit audit_net_exit(struct net *net)
> {
> struct audit_net *aunet = net_generic(net, audit_net_id);
> struct sock *sock = aunet->nlsk;
> - if (sock == audit_sock)
> + if (sock == audit_sock) {
> + mutex_lock(&audit_cmd_mutex);
You need to put the if check inside the mutex too. Again, this could be
removed if you use refcnt.
> auditd_reset();
> + mutex_unlock(&audit_cmd_mutex);
> + }
>
> RCU_INIT_POINTER(aunet->nlsk, NULL);
> synchronize_net();
> --
> 1.7.1
>
^ permalink raw reply
* Re: [PATCH V2 00/22] Broadcom RoCE Driver (bnxt_re)
From: Doug Ledford @ 2016-12-12 23:52 UTC (permalink / raw)
To: Selvin Xavier, linux-rdma-u79uwXL29TY76Z2rM5mHXA
Cc: netdev-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <1481266096-23331-1-git-send-email-selvin.xavier-dY08KVG/lbpWk0Htik3J/w@public.gmane.org>
[-- Attachment #1.1: Type: text/plain, Size: 1282 bytes --]
On 12/9/2016 1:47 AM, Selvin Xavier wrote:
> This series introduces the RoCE driver for the Broadcom
> NetXtreme-E 10/25/40/50 gigabit RoCE HCAs.
> This driver is dependent on the bnxt_en NIC driver and is
> based on the bnxt_re branch in Doug's repository. bnxt_en changes
> required for this patch series is already available in this branch.
>
> I am preparing a git repository with these changes as per Jason's
> comment and will share the details later today.
>
> v1-> v2:
> * The license text in each file updated to reflect Dual license.
> * Makefile and Kconfig changes are pushed to the last patch
> * Moved bnxt_re_uverbs_abi.h to include/uapi/rdma folder
> * Remove duplicate structure definitions from bnxt_re_hsi.h as
> it is available in the corresponding bnxt_en header file (bnxt_hsi.h)
> * Removed some unused code reported during code review.
> * Fixed few sparse warnings
>
> Doug,
> Please review and consider applying this to linux-rdma repository.
There are outstanding review comments to be addressed still yet, and the
v2 patchset doesn't compile for me in 0day testing. I'm going to bounce
this one to 4.11.
--
Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
GPG Key ID: 0E572FDD
[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 884 bytes --]
^ permalink raw reply
* Re: [iproute2 v2 net-next 0/8] Add support for vrf helper
From: Stephen Hemminger @ 2016-12-12 23:43 UTC (permalink / raw)
To: David Ahern; +Cc: netdev
In-Reply-To: <1481401934-4026-1-git-send-email-dsa@cumulusnetworks.com>
On Sat, 10 Dec 2016 12:32:06 -0800
David Ahern <dsa@cumulusnetworks.com> wrote:
> This series adds support to iproute2 to run a command against a specific
> VRF. The user semnatics are similar to 'ip netns'.
>
> The 'ip vrf' subcommand supports 3 usages:
>
> 1. Run a command against a given vrf:
> ip vrf exec NAME CMD
>
> Uses the recently committed cgroup/sock BPF option. vrf directory
> is added to cgroup2 mount. Individual vrfs are created under it. BPF
> filter is attached to vrf/NAME cgroup2 to set sk_bound_dev_if to the
> device index of the VRF. From there the current process (ip's pid) is
> addded to the cgroups.proc file and the given command is exected. In
> doing so all AF_INET/AF_INET6 (ipv4/ipv6) sockets are automatically
> bound to the VRF domain.
>
> The association is inherited parent to child allowing the command to
> be a shell from which other commands are run relative to the VRF.
>
> 2. Show the VRF a process is bound to:
> ip vrf id
> This command essentially looks at /proc/pid/cgroup for a "::/vrf/"
> entry.
>
> 3. Show process ids bound to a VRF
> ip vrf pids NAME
> This command dumps the file MNT/vrf/NAME/cgroup.procs since that file
> shows the process ids in the particular vrf cgroup.
>
> v2
> - updated suject of patch 3 to avoid spam filters on vger
>
> David Ahern (8):
> lib bpf: Add support for BPF_PROG_ATTACH and BPF_PROG_DETACH
> bpf: export bpf_prog_load
> Add libbpf.h header with BPF_ macros
> move cmd_exec to lib utils
> Add filesystem APIs to lib
> change name_is_vrf to return index
> libnetlink: Add variant of rtnl_talk that does not display RTNETLINK
> answers error
> Introduce ip vrf command
>
> include/bpf_util.h | 6 ++
> include/libbpf.h | 184 ++++++++++++++++++++++++++++++++
> include/libnetlink.h | 3 +
> include/utils.h | 4 +
> ip/Makefile | 3 +-
> ip/ip.c | 4 +-
> ip/ip_common.h | 4 +-
> ip/iplink_vrf.c | 29 ++++--
> ip/ipnetns.c | 34 ------
> ip/ipvrf.c | 289 +++++++++++++++++++++++++++++++++++++++++++++++++++
> lib/Makefile | 2 +-
> lib/bpf.c | 71 ++++++++-----
> lib/exec.c | 41 ++++++++
> lib/fs.c | 143 +++++++++++++++++++++++++
> lib/libnetlink.c | 20 +++-
> man/man8/ip-vrf.8 | 88 ++++++++++++++++
> 16 files changed, 850 insertions(+), 75 deletions(-)
> create mode 100644 include/libbpf.h
> create mode 100644 ip/ipvrf.c
> create mode 100644 lib/exec.c
> create mode 100644 lib/fs.c
> create mode 100644 man/man8/ip-vrf.8
>
Please use tooling that puts v2 on all the updated patches.
It makes it easier to spot them in patchwork
^ permalink raw reply
* "virtio-net: enable multiqueue by default" in linux-next breaks networking on GCE
From: Theodore Ts'o @ 2016-12-12 23:33 UTC (permalink / raw)
To: jasowang; +Cc: netdev, mst, nhorman, davem
Hi,
I was doing a last minute regression test of the ext4 tree before
sending a pull request to Linus, which I do using gce-xfstests[1], and
I found that using networking was broken on GCE on linux-next. I was
using next-20161209, and after bisecting things, I narrowed down the
commit which causing things to break to commit 449000102901:
"virtio-net: enable multiqueue by default". Reverting this commit on
top of next-20161209 fixed the problem.
[1] http://thunk.org/gce-xfstests
You can reproduce the problem for building the kernel for Google
Compute Engine --- I use a config such as this [2], and then try to
boot a kernel on a VM. The way I do this involves booting a test
appliance and then kexec'ing into the kernel to be tested[3], using a
2cpu configuration. (GCE machine type: n1-standard-2)
[2] https://git.kernel.org/cgit/fs/ext2/xfstests-bld.git/tree/kernel-configs/ext4-x86_64-config-4.9
[3] https://github.com/tytso/xfstests-bld/blob/master/Documentation/gce-xfstests.md
You can then take a look at serial console using a command such as
"gcloud compute instances get-serial-port-output <instance-name>", and
you will get something like this (see attached). The important bit is
that the dhclient command is completely failing to be able to get a
response from the network, from which I deduce that apparently that
either networking send or receive or both seem to be badly affected by
the commit in question.
Please let me know if there's anything I can do to help you debug this
further.
Cheers,
- Ted
Dec 11 23:53:20 xfstests-201612120451 kernel: [ 0.000000] Linux version 4.9.0-rc8-ext4-06387-g03e5cbd (tytso@tytso-ssd) (gcc version 4.9.2 (Debian 4.9.2-10) ) #9 SMP Mon Dec 12 04:50:16 UTC 2016
Dec 11 23:53:20 xfstests-201612120451 kernel: [ 0.000000] Command line: root=/dev/sda1 ro console=ttyS0,38400n8 elevator=noop console=ttyS0 fstestcfg=4k fstestset=-g,quick fstestexc= fstestopt=aex fstesttyp=ext4 fstestapi=1.3
Dec 11 23:53:20 xfstests-201612120451 kernel: [ 0.000000] x86/fpu: Supporting XSAVE feature 0x001: 'x87 floating point registers'
Dec 11 23:53:20 xfstests-201612120451 kernel: [ 0.000000] x86/fpu: Supporting XSAVE feature 0x002: 'SSE registers'
Dec 11 23:53:20 xfstests-201612120451 kernel: [ 0.000000] x86/fpu: Supporting XSAVE feature 0x004: 'AVX registers'
Dec 11 23:53:20 xfstests-201612120451 kernel: [ 0.000000] x86/fpu: xstate_offset[2]: 576, xstate_sizes[2]: 256
Dec 11 23:53:20 xfstests-201612120451 kernel: [ 0.000000] x86/fpu: Enabled xstate features 0x7, context size is 832 bytes, using 'standard' format.
Dec 11 23:53:20 xfstests-201612120451 systemd[1]: Started Load Kernel Modules.
Dec 11 23:53:20 xfstests-201612120451 systemd[1]: Starting Apply Kernel Variables...
Dec 11 23:53:20 xfstests-201612120451 systemd[1]: Mounting Configuration File System...
Dec 11 23:53:20 xfstests-201612120451 systemd[1]: Mounting FUSE Control File System...
Dec 11 23:53:20 xfstests-201612120451 systemd[1]: Mounted FUSE Control File System.
Dec 11 23:53:20 xfstests-201612120451 systemd[1]: Mounted Configuration File System.
Dec 11 23:53:20 xfstests-201612120451 systemd[1]: Started Apply Kernel Variables.
Dec 11 23:53:20 xfstests-201612120451 systemd[1]: Started Create Static Device Nodes in /dev.
Dec 11 23:53:20 xfstests-201612120451 systemd[1]: Starting udev Kernel Device Manager...
Dec 11 23:53:20 xfstests-201612120451 systemd[1]: Started udev Kernel Device Manager.
Dec 11 23:53:20 xfstests-201612120451 systemd[1]: Started udev Coldplug all Devices.
Dec 11 23:53:20 xfstests-201612120451 systemd[1]: Starting udev Wait for Complete Device Initialization...
Dec 11 23:53:20 xfstests-201612120451 systemd-fsck[1659]: xfstests-root: clean, 56268/655360 files, 357439/2620928 blocks
Dec 11 23:53:20 xfstests-201612120451 systemd[1]: Started File System Check on Root Device.
Dec 11 23:53:20 xfstests-201612120451 systemd[1]: Starting Remount Root and Kernel File Systems...
Dec 11 23:53:20 xfstests-201612120451 systemd[1]: Started Remount Root and Kernel File Systems.
Dec 11 23:53:20 xfstests-201612120451 systemd[1]: Started Various fixups to make systemd work better on Debian.
Dec 11 23:53:20 xfstests-201612120451 systemd[1]: Starting Load/Save Random Seed...
Dec 11 23:53:20 xfstests-201612120451 systemd[1]: Starting Local File Systems (Pre).
Dec 11 23:53:20 xfstests-201612120451 systemd[1]: Reached target Local File Systems (Pre).
Dec 11 23:53:20 xfstests-201612120451 systemd[1]: Started Load/Save Random Seed.
Dec 11 23:53:20 xfstests-201612120451 systemd[1]: Started udev Wait for Complete Device Initialization.
Dec 11 23:53:20 xfstests-201612120451 systemd[1]: Starting Activation of LVM2 logical volumes...
Dec 11 23:53:20 xfstests-201612120451 systemd[1]: Starting Copy rules generated while the root was ro...
Dec 11 23:53:20 xfstests-201612120451 systemd[1]: Found device /dev/ttyS0.
Dec 11 23:53:20 xfstests-201612120451 systemd[1]: Found device /dev/ttyS1.
Dec 11 23:53:20 xfstests-201612120451 systemd[1]: Started Copy rules generated while the root was ro.
Dec 11 23:53:20 xfstests-201612120451 systemd[1]: Found device /dev/ttyS2.
Dec 11 23:53:20 xfstests-201612120451 systemd[1]: Found device /dev/ttyS3.
Dec 11 23:53:20 xfstests-201612120451 systemd-udevd[2568]: could not open moddep file '/lib/modules/4.9.0-rc8-ext4-06387-g03e5cbd/modules.dep.bin'
Dec 11 23:53:20 xfstests-201612120451 lvm[2579]: No volume groups found
Dec 11 23:53:20 xfstests-201612120451 systemd[1]: Started Activation of LVM2 logical volumes.
Dec 11 23:53:20 xfstests-201612120451 systemd[1]: Starting Encrypted Volumes.
Dec 11 23:53:20 xfstests-201612120451 systemd[1]: Reached target Encrypted Volumes.
Dec 11 23:53:20 xfstests-201612120451 systemd[1]: Starting Activation of LVM2 logical volumes...
Dec 11 23:53:20 xfstests-201612120451 lvm[2625]: No volume groups found
Dec 11 23:53:20 xfstests-201612120451 systemd[1]: Started Activation of LVM2 logical volumes.
Dec 11 23:53:20 xfstests-201612120451 systemd[1]: Starting Monitoring of LVM2 mirrors, snapshots etc. using dmeventd or progress polling...
Dec 11 23:53:20 xfstests-201612120451 lvm[2627]: No volume groups found
Dec 11 23:53:20 xfstests-201612120451 systemd[1]: Started Monitoring of LVM2 mirrors, snapshots etc. using dmeventd or progress polling.
Dec 11 23:53:20 xfstests-201612120451 systemd[1]: Starting Local File Systems.
Dec 11 23:53:20 xfstests-201612120451 systemd[1]: Reached target Local File Systems.
Dec 11 23:53:20 xfstests-201612120451 systemd[1]: Starting Remote File Systems.
Dec 11 23:53:20 xfstests-201612120451 systemd[1]: Reached target Remote File Systems.
Dec 11 23:53:20 xfstests-201612120451 systemd[1]: Starting Trigger Flushing of Journal to Persistent Storage...
Dec 11 23:53:20 xfstests-201612120451 systemd[1]: Starting Create Volatile Files and Directories...
Dec 11 23:53:20 xfstests-201612120451 systemd[1]: Starting LSB: Generate ssh host keys if they do not exist...
Dec 11 23:53:20 xfstests-201612120451 systemd[1]: Starting LSB: Raise network interfaces....
Dec 11 23:53:20 xfstests-201612120451 systemd[1]: Started Trigger Flushing of Journal to Persistent Storage.
Dec 11 23:53:20 xfstests-201612120451 systemd[1]: Started Create Volatile Files and Directories.
Dec 11 23:53:20 xfstests-201612120451 systemd[1]: Started LSB: Generate ssh host keys if they do not exist.
Dec 11 23:53:20 xfstests-201612120451 systemd[1]: Starting Update UTMP about System Boot/Shutdown...
Dec 11 23:53:20 xfstests-201612120451 systemd[1]: Started Update UTMP about System Boot/Shutdown.
Dec 11 23:53:20 xfstests-201612120451 dhclient: Internet Systems Consortium DHCP Client 4.3.1
Dec 11 23:53:20 xfstests-201612120451 dhclient: Copyright 2004-2014 Internet Systems Consortium.
Dec 11 23:53:20 xfstests-201612120451 dhclient: All rights reserved.
Dec 11 23:53:20 xfstests-201612120451 dhclient: For info, please visit https://www.isc.org/software/dhcp/
Dec 11 23:53:20 xfstests-201612120451 dhclient:
Dec 11 23:53:20 xfstests-201612120451 networking[2633]: Configuring network interfaces...Internet Systems Consortium DHCP Client 4.3.1
Dec 11 23:53:20 xfstests-201612120451 networking[2633]: Copyright 2004-2014 Internet Systems Consortium.
Dec 11 23:53:20 xfstests-201612120451 networking[2633]: All rights reserved.
Dec 11 23:53:20 xfstests-201612120451 networking[2633]: For info, please visit https://www.isc.org/software/dhcp/
Dec 11 23:53:20 xfstests-201612120451 dhclient: Listening on LPF/eth0/42:01:0a:f0:00:03
Dec 11 23:53:20 xfstests-201612120451 dhclient: Sending on LPF/eth0/42:01:0a:f0:00:03
Dec 11 23:53:20 xfstests-201612120451 dhclient: Sending on Socket/fallback
Dec 11 23:53:20 xfstests-201612120451 dhclient: DHCPREQUEST on eth0 to 255.255.255.255 port 67
Dec 11 23:53:20 xfstests-201612120451 networking[2633]: Listening on LPF/eth0/42:01:0a:f0:00:03
Dec 11 23:53:20 xfstests-201612120451 networking[2633]: Sending on LPF/eth0/42:01:0a:f0:00:03
Dec 11 23:53:20 xfstests-201612120451 networking[2633]: Sending on Socket/fallback
Dec 11 23:53:20 xfstests-201612120451 networking[2633]: DHCPREQUEST on eth0 to 255.255.255.255 port 67
Dec 11 23:53:20 xfstests-201612120451 dhclient: DHCPREQUEST on eth0 to 255.255.255.255 port 67
Dec 11 23:53:20 xfstests-201612120451 networking[2633]: DHCPREQUEST on eth0 to 255.255.255.255 port 67
Dec 11 23:53:20 xfstests-201612120451 dhclient: DHCPDISCOVER on eth0 to 255.255.255.255 port 67 interval 8
Dec 11 23:53:20 xfstests-201612120451 networking[2633]: DHCPDISCOVER on eth0 to 255.255.255.255 port 67 interval 8
Dec 11 23:53:20 xfstests-201612120451 dhclient: DHCPDISCOVER on eth0 to 255.255.255.255 port 67 interval 8
Dec 11 23:53:20 xfstests-201612120451 networking[2633]: DHCP[^[[32m OK ^[[0m] DISCOVER on eth0 to 255.255.255.255 port 67 interval 8
Dec 11 23:53:20 xfstests-201612120451 dhclient: DHCPDISCOVER on eth0 to 255.255.255.255 port 67 interval 13
Dec 11 23:53:20 xfstests-201612120451 networking[2633]: DHCPDISCOVER on eth0 to 255.255.255.255 port 67 interval 13
Dec 11 23:53:20 xfstests-201612120451 dhclient: DHCPDISCOVER on eth0 to 255.255.255.255 port 67 interval 17
Dec 11 23:53:20 xfstests-201612120451 networking[2633]: DHCPDISCOVER on eth0 to 255.255.255.255 port 67 interval 17
Dec 11 23:53:20 xfstests-201612120451 dhclient: DHCPDISCOVER on eth0 to 255.255.255.255 port 67 interval 15
Dec 11 23:53:20 xfstests-201612120451 networking[2633]: DHCPDISCOVER on eth0 to 255.255.255.255 port 67 interval 15
Dec 11 23:53:20 xfstests-201612120451 dhclient: No DHCPOFFERS received.
Dec 11 23:53:20 xfstests-201612120451 dhclient: Trying recorded lease 10.240.0.3
Dec 11 23:53:20 xfstests-201612120451 networking[2633]: No DHCPOFFERS received.
Dec 11 23:53:20 xfstests-201612120451 networking[2633]: Trying recorded lease 10.240.0.3
Dec 11 23:53:20 xfstests-201612120451 networking[2633]: connect: Network is unreachable
Dec 11 23:53:20 xfstests-201612120451 logger: /etc/dhcp/dhclient-exit-hooks returned non-zero exit status 2
Dec 11 23:53:20 xfstests-201612120451 dhclient: bound: renewal in 38598 seconds.
Dec 11 23:53:20 xfstests-201612120451 networking[2633]: bound: renewal in 38598 seconds.
Dec 11 23:53:20 xfstests-201612120451 networking[2633]: done.
^ permalink raw reply
* [ANNOUNCE] iproute2 4.9
From: Stephen Hemminger @ 2016-12-12 23:24 UTC (permalink / raw)
To: netdev; +Cc: linux-kernel
Release of iproute2 for Linux 4.9, just in time for your holiday
giving.
Update to iproute2 utility to support new features in Linux 4.9.
Mostly this is refinements to add new flags to tipc, l2tp, ss
and macsec support. There are also a couple of performance
enhancments for handling lots of interfaces and namespaces.
Source:
https://www.kernel.org/pub/linux/utils/net/iproute2/iproute2-4.9.0.tar.gz
Repository:
git://git.kernel.org/pub/scm/linux/kernel/git/shemminger/iproute2.git
Report problems (or enhancements) to the netdev@vger.kernel.org mailing list.
---
Alexei Starovoitov (1):
iptnl: add support for collect_md flag in IPv4 and IPv6 tunnels
Anton Aksola (1):
iproute2: build nsid-name cache only for commands that need it
Asbjørn Sloth Tønnesen (9):
man: ip-l2tp.8: fix l2spec_type documentation
man: ip-l2tp.8: remove non-existent tunnel parameter name
l2tp: fix integers with too few significant bits
l2tp: fix L2TP_ATTR_{RECV,SEND}_SEQ handling
l2tp: fix L2TP_ATTR_UDP_CSUM handling
l2tp: read IPv6 UDP checksum attributes from kernel
l2tp: support sequence numbering
l2tp: show tunnel: expose UDP checksum state
man: ip-l2tp.8: document UDP checksum options
Craig Dillabaugh (1):
action gact: list pipe as a valid action
Daniel Borkmann (1):
tc, ipt: don't enforce iproute2 dependency on iptables-devel
Daniel Hopf (1):
macsec: Nr. of packets and octets for macsec tx stats were swapped
Eric Dumazet (1):
tc: fq: display unthrottle latency
Hadar Hen Zion (2):
tc: flower: Introduce vlan support
tc: m_vlan: Add priority option to push vlan action
Hangbin Liu (4):
misc/ss: tcp cwnd should be unsigned
ip rule: merge ip rule flush and list, save together
ip rule: add selector support
devlink: Convert conditional in dl_argv_handle_port() to switch()
Isaac Boukris (1):
iproute2: ss: escape all null bytes in abstract unix domain socket
Jakub Kicinski (1):
tc: cls_bpf: handle skip_sw and skip_hw flags
Jamal Hadi Salim (4):
actions ife: Introduce encoding and decoding of tcindex metadata
actions: add skbmod action
man pages: Add tc-ife to Makefile
tc filters: add support to get individual filters by handle
Lorenzo Colitti (1):
ss: Support displaying and filtering on socket marks.
Lucas Bates (2):
man pages: update ife action to include tcindex
man pages: add man page for skbmod action
Mahesh Bandewar (1):
ip: (ipvlan) introduce L3s mode
Mike Frysinger (1):
ifstat/nstat: fix help output alignment
Moshe Shemesh (1):
ip link: Add support to configure SR-IOV VF to vlan protocol 802.1ad (VST QinQ)
Neal Cardwell (1):
ss: output TCP BBR diag information
Nikolay Aleksandrov (4):
bridge: vlan: add support to display per-vlan statistics
ipmroute: add support for age dumping
bridge: vlan: remove wrong stats help
bridge: add support for the multicast flood flag
Parthasarathy Bhuvaragan (7):
tipc: remove dead code
tipc: add link monitor set threshold
tipc: add link monitor get threshold
tipc: add link monitor summary
tipc: refractor bearer to facilitate link monitor
tipc: add link monitor list
tipc: update man page for link monitor
Paul Blakey (1):
tc: flower: Fix usage message
Phil Sutter (6):
iproute: fix documentation for ip rule scan order
include: Add linux/sctp.h
ss: Add support for SCTP protocol
ipaddress: Simplify vf_info parsing
ipaddress: Print IFLA_VF_QUERY_RSS_EN setting
man: ip-route.8: Add notes about dropped IPv4 route cache
Richard Alpe (3):
tipc: add peer remove functionality
tipc: introduce bearer add for remoteip
tipc: add the ability to get UDP bearer options
Roi Dayan (2):
devlink: Add usage help for eswitch subcommand
devlink: Add option to set and show eswitch inline mode
Roman Mashak (7):
ife action: allow specifying index in hex
ife: print prio, mark and hash as unsigned
ife: improve help text
tc: updated man page to reflect GET command to retrieve a single filter.
tc: improved usage help for fw classifier.
tc: print raw qdisc handle.
tc: distinguish Add/Replace filter operations
Shmulik Ladkani (1):
tc: m_vlan: Add vlan modify action
Simon Horman (1):
ss: initialise variables outside of for loop
Stephen Hemminger (24):
update headers to 4.8-rc2 net-next
update TIPC headers
tipc: cleanup style issues
update kernel headers from net-next
update bpf.h
update headers from pre 4.9 (net-next)
iplink: cleanup style errors
ip: iprule style cleanup
tc: skbmod style cleanup
tc_filter: style cleanup
ip: macvlan style cleanup
Revert "iproute2: macvlan: add "source" mode"
cleanup debris from revert
ss: break really long lines
ip: style cleanup
tc: cleanup style of qdisc code
update headers based on 4.9-rc7
libnetlink: style cleanups
l2tp: style cleanup
Revert "devlink: Add option to set and show eswitch inline mode"
Revert "devlink: Add usage help for eswitch subcommand"
update kernel headers
update to 4.9 release headers
v4.9.0
Zhang Shengju (3):
iproute2: fix the link group name getting error
libnetlink: reduce size of message sent to kernel
link: add team and team_slave link type
david decotigny (2):
iproute2: avoid exit in case of error.
iproute2: a non-expected rtnl message is an error
michael-dev@fami-braun.de (2):
iproute2: macvlan: add "source" mode
iproute2: macvlan: add "source" mode
stefan@datenfreihafen.org (1):
ip: update link types to show 6lowpan and ieee802.15.4 monitor
^ permalink raw reply
* Re: Soft lockup in tc_classify
From: Cong Wang @ 2016-12-12 22:51 UTC (permalink / raw)
To: Or Gerlitz
Cc: Daniel Borkmann, Shahar Klein, Linux Netdev List, Roi Dayan,
David Miller, Jiri Pirko, John Fastabend, Hadar Hen Zion
In-Reply-To: <CAJ3xEMjABmvAMs6h0EqBgPH8QDDwF_x0COx01MkEw2pa+fp7LA@mail.gmail.com>
[-- Attachment #1: Type: text/plain, Size: 711 bytes --]
On Mon, Dec 12, 2016 at 1:18 PM, Or Gerlitz <gerlitz.or@gmail.com> wrote:
> On Mon, Dec 12, 2016 at 3:28 PM, Daniel Borkmann <daniel@iogearbox.net> wrote:
>
>> Note that there's still the RCU fix missing for the deletion race that
>> Cong will still send out, but you say that the only thing you do is to
>> add a single rule, but no other operation in involved during that test?
>
> What's missing to have the deletion race fixed? making a patch or
> testing to a patch which was sent?
If you think it would help for this problem, here is my patch rebased
on the latest net-next.
Again, I don't see how it could help this case yet, especially I don't
see how we could have a loop in this singly linked list.
[-- Attachment #2: tc-filter-destroy.diff --]
[-- Type: text/plain, Size: 21977 bytes --]
commit f6becda1e12fd8ef74e901fe39adb4558ce6c8f9
Author: Cong Wang <xiyou.wangcong@gmail.com>
Date: Wed Nov 23 14:58:01 2016 -0800
net_sched: move the empty tp check from ->destroy() to ->delete()
Roi reported we could have a race condition where in ->classify() path
we dereference tp->root and meanwhile a parallel ->destroy() makes it
a NULL.
This is possible because ->destroy() could be called when deleting
a filter to check if we are the last one in tp, this tp is still
linked and visible at that time.
The root cause of this problem is the semantic of ->destroy(), it
does two things (for non-force case):
1) check if tp is empty
2) if tp is empty we could really destroy it
and its caller, if cares, needs to check its return value to see if
it is really destroyed. Therefore we can't unlink tp unless we know
it is empty.
As suggested by Daniel, we could actually move the test logic to ->delete()
so that we can safely unlink tp after ->delete() tells us the last one is
just deleted and before ->destroy().
What's more, even we unlink it before ->destroy(), it could still have
readers since we don't wait for a grace period here, we should not modify
tp->root in ->destroy() either.
Fixes: 1e052be69d04 ("net_sched: destroy proto tp when all filters are gone")
Reported-by: Roi Dayan <roid@mellanox.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
index 498f81b..b5eda3f 100644
--- a/include/net/sch_generic.h
+++ b/include/net/sch_generic.h
@@ -203,14 +203,14 @@ struct tcf_proto_ops {
const struct tcf_proto *,
struct tcf_result *);
int (*init)(struct tcf_proto*);
- bool (*destroy)(struct tcf_proto*, bool);
+ void (*destroy)(struct tcf_proto*);
unsigned long (*get)(struct tcf_proto*, u32 handle);
int (*change)(struct net *net, struct sk_buff *,
struct tcf_proto*, unsigned long,
u32 handle, struct nlattr **,
unsigned long *, bool);
- int (*delete)(struct tcf_proto*, unsigned long);
+ int (*delete)(struct tcf_proto*, unsigned long, bool*);
void (*walk)(struct tcf_proto*, struct tcf_walker *arg);
/* rtnetlink specific */
@@ -405,7 +405,7 @@ struct Qdisc *qdisc_create_dflt(struct netdev_queue *dev_queue,
const struct Qdisc_ops *ops, u32 parentid);
void __qdisc_calculate_pkt_len(struct sk_buff *skb,
const struct qdisc_size_table *stab);
-bool tcf_destroy(struct tcf_proto *tp, bool force);
+void tcf_destroy(struct tcf_proto *tp);
void tcf_destroy_chain(struct tcf_proto __rcu **fl);
int skb_do_redirect(struct sk_buff *);
diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c
index 3fbba79..f9179e0 100644
--- a/net/sched/cls_api.c
+++ b/net/sched/cls_api.c
@@ -321,7 +321,7 @@ static int tc_ctl_tfilter(struct sk_buff *skb, struct nlmsghdr *n)
tfilter_notify(net, skb, n, tp, fh,
RTM_DELTFILTER, false);
- tcf_destroy(tp, true);
+ tcf_destroy(tp);
err = 0;
goto errout;
}
@@ -331,25 +331,29 @@ static int tc_ctl_tfilter(struct sk_buff *skb, struct nlmsghdr *n)
!(n->nlmsg_flags & NLM_F_CREATE))
goto errout;
} else {
+ bool last;
+
switch (n->nlmsg_type) {
case RTM_NEWTFILTER:
err = -EEXIST;
if (n->nlmsg_flags & NLM_F_EXCL) {
if (tp_created)
- tcf_destroy(tp, true);
+ tcf_destroy(tp);
goto errout;
}
break;
case RTM_DELTFILTER:
- err = tp->ops->delete(tp, fh);
+ err = tp->ops->delete(tp, fh, &last);
if (err == 0) {
- struct tcf_proto *next = rtnl_dereference(tp->next);
-
tfilter_notify(net, skb, n, tp,
t->tcm_handle,
RTM_DELTFILTER, false);
- if (tcf_destroy(tp, false))
+ if (last) {
+ struct tcf_proto *next = rtnl_dereference(tp->next);
+
RCU_INIT_POINTER(*back, next);
+ tcf_destroy(tp);
+ }
}
goto errout;
case RTM_GETTFILTER:
@@ -372,7 +376,7 @@ static int tc_ctl_tfilter(struct sk_buff *skb, struct nlmsghdr *n)
tfilter_notify(net, skb, n, tp, fh, RTM_NEWTFILTER, false);
} else {
if (tp_created)
- tcf_destroy(tp, true);
+ tcf_destroy(tp);
}
errout:
diff --git a/net/sched/cls_basic.c b/net/sched/cls_basic.c
index 5877f60..8d822e5 100644
--- a/net/sched/cls_basic.c
+++ b/net/sched/cls_basic.c
@@ -93,30 +93,28 @@ static void basic_delete_filter(struct rcu_head *head)
kfree(f);
}
-static bool basic_destroy(struct tcf_proto *tp, bool force)
+static void basic_destroy(struct tcf_proto *tp)
{
struct basic_head *head = rtnl_dereference(tp->root);
struct basic_filter *f, *n;
- if (!force && !list_empty(&head->flist))
- return false;
-
list_for_each_entry_safe(f, n, &head->flist, link) {
list_del_rcu(&f->link);
tcf_unbind_filter(tp, &f->res);
call_rcu(&f->rcu, basic_delete_filter);
}
kfree_rcu(head, rcu);
- return true;
}
-static int basic_delete(struct tcf_proto *tp, unsigned long arg)
+static int basic_delete(struct tcf_proto *tp, unsigned long arg, bool *last)
{
+ struct basic_head *head = rtnl_dereference(tp->root);
struct basic_filter *f = (struct basic_filter *) arg;
list_del_rcu(&f->link);
tcf_unbind_filter(tp, &f->res);
call_rcu(&f->rcu, basic_delete_filter);
+ *last = list_empty(&head->flist);
return 0;
}
diff --git a/net/sched/cls_bpf.c b/net/sched/cls_bpf.c
index adc7760..55c9961 100644
--- a/net/sched/cls_bpf.c
+++ b/net/sched/cls_bpf.c
@@ -268,25 +268,24 @@ static void __cls_bpf_delete(struct tcf_proto *tp, struct cls_bpf_prog *prog)
call_rcu(&prog->rcu, cls_bpf_delete_prog_rcu);
}
-static int cls_bpf_delete(struct tcf_proto *tp, unsigned long arg)
+static int cls_bpf_delete(struct tcf_proto *tp, unsigned long arg, bool *last)
{
+ struct cls_bpf_head *head = rtnl_dereference(tp->root);
+
__cls_bpf_delete(tp, (struct cls_bpf_prog *) arg);
+ *last = list_empty(&head->plist);
return 0;
}
-static bool cls_bpf_destroy(struct tcf_proto *tp, bool force)
+static void cls_bpf_destroy(struct tcf_proto *tp)
{
struct cls_bpf_head *head = rtnl_dereference(tp->root);
struct cls_bpf_prog *prog, *tmp;
- if (!force && !list_empty(&head->plist))
- return false;
-
list_for_each_entry_safe(prog, tmp, &head->plist, link)
__cls_bpf_delete(tp, prog);
kfree_rcu(head, rcu);
- return true;
}
static unsigned long cls_bpf_get(struct tcf_proto *tp, u32 handle)
diff --git a/net/sched/cls_cgroup.c b/net/sched/cls_cgroup.c
index c1f2007..51c822d 100644
--- a/net/sched/cls_cgroup.c
+++ b/net/sched/cls_cgroup.c
@@ -131,20 +131,16 @@ static int cls_cgroup_change(struct net *net, struct sk_buff *in_skb,
return err;
}
-static bool cls_cgroup_destroy(struct tcf_proto *tp, bool force)
+static void cls_cgroup_destroy(struct tcf_proto *tp)
{
struct cls_cgroup_head *head = rtnl_dereference(tp->root);
- if (!force)
- return false;
/* Head can still be NULL due to cls_cgroup_init(). */
if (head)
call_rcu(&head->rcu, cls_cgroup_destroy_rcu);
-
- return true;
}
-static int cls_cgroup_delete(struct tcf_proto *tp, unsigned long arg)
+static int cls_cgroup_delete(struct tcf_proto *tp, unsigned long arg, bool *last)
{
return -EOPNOTSUPP;
}
diff --git a/net/sched/cls_flow.c b/net/sched/cls_flow.c
index 6575aba..ea2be75 100644
--- a/net/sched/cls_flow.c
+++ b/net/sched/cls_flow.c
@@ -563,12 +563,14 @@ static int flow_change(struct net *net, struct sk_buff *in_skb,
return err;
}
-static int flow_delete(struct tcf_proto *tp, unsigned long arg)
+static int flow_delete(struct tcf_proto *tp, unsigned long arg, bool *last)
{
+ struct flow_head *head = rtnl_dereference(tp->root);
struct flow_filter *f = (struct flow_filter *)arg;
list_del_rcu(&f->list);
call_rcu(&f->rcu, flow_destroy_filter);
+ *last = list_empty(&head->filters);
return 0;
}
@@ -584,20 +586,16 @@ static int flow_init(struct tcf_proto *tp)
return 0;
}
-static bool flow_destroy(struct tcf_proto *tp, bool force)
+static void flow_destroy(struct tcf_proto *tp)
{
struct flow_head *head = rtnl_dereference(tp->root);
struct flow_filter *f, *next;
- if (!force && !list_empty(&head->filters))
- return false;
-
list_for_each_entry_safe(f, next, &head->filters, list) {
list_del_rcu(&f->list);
call_rcu(&f->rcu, flow_destroy_filter);
}
kfree_rcu(head, rcu);
- return true;
}
static unsigned long flow_get(struct tcf_proto *tp, u32 handle)
diff --git a/net/sched/cls_flower.c b/net/sched/cls_flower.c
index e040c51..328938b 100644
--- a/net/sched/cls_flower.c
+++ b/net/sched/cls_flower.c
@@ -312,21 +312,16 @@ static void fl_destroy_rcu(struct rcu_head *rcu)
schedule_work(&head->work);
}
-static bool fl_destroy(struct tcf_proto *tp, bool force)
+static void fl_destroy(struct tcf_proto *tp)
{
struct cls_fl_head *head = rtnl_dereference(tp->root);
struct cls_fl_filter *f, *next;
- if (!force && !list_empty(&head->filters))
- return false;
-
list_for_each_entry_safe(f, next, &head->filters, list)
__fl_delete(tp, f);
__module_get(THIS_MODULE);
call_rcu(&head->rcu, fl_destroy_rcu);
-
- return true;
}
static unsigned long fl_get(struct tcf_proto *tp, u32 handle)
@@ -877,7 +872,7 @@ static int fl_change(struct net *net, struct sk_buff *in_skb,
return err;
}
-static int fl_delete(struct tcf_proto *tp, unsigned long arg)
+static int fl_delete(struct tcf_proto *tp, unsigned long arg, bool *last)
{
struct cls_fl_head *head = rtnl_dereference(tp->root);
struct cls_fl_filter *f = (struct cls_fl_filter *) arg;
@@ -886,6 +881,7 @@ static int fl_delete(struct tcf_proto *tp, unsigned long arg)
rhashtable_remove_fast(&head->ht, &f->ht_node,
head->ht_params);
__fl_delete(tp, f);
+ *last = list_empty(&head->filters);
return 0;
}
diff --git a/net/sched/cls_fw.c b/net/sched/cls_fw.c
index 9dc63d5..bc8ceb7 100644
--- a/net/sched/cls_fw.c
+++ b/net/sched/cls_fw.c
@@ -127,20 +127,14 @@ static void fw_delete_filter(struct rcu_head *head)
kfree(f);
}
-static bool fw_destroy(struct tcf_proto *tp, bool force)
+static void fw_destroy(struct tcf_proto *tp)
{
struct fw_head *head = rtnl_dereference(tp->root);
struct fw_filter *f;
int h;
if (head == NULL)
- return true;
-
- if (!force) {
- for (h = 0; h < HTSIZE; h++)
- if (rcu_access_pointer(head->ht[h]))
- return false;
- }
+ return;
for (h = 0; h < HTSIZE; h++) {
while ((f = rtnl_dereference(head->ht[h])) != NULL) {
@@ -150,17 +144,17 @@ static bool fw_destroy(struct tcf_proto *tp, bool force)
call_rcu(&f->rcu, fw_delete_filter);
}
}
- RCU_INIT_POINTER(tp->root, NULL);
kfree_rcu(head, rcu);
- return true;
}
-static int fw_delete(struct tcf_proto *tp, unsigned long arg)
+static int fw_delete(struct tcf_proto *tp, unsigned long arg, bool *last)
{
struct fw_head *head = rtnl_dereference(tp->root);
struct fw_filter *f = (struct fw_filter *)arg;
struct fw_filter __rcu **fp;
struct fw_filter *pfp;
+ int ret = -EINVAL;
+ int h;
if (head == NULL || f == NULL)
goto out;
@@ -173,11 +167,21 @@ static int fw_delete(struct tcf_proto *tp, unsigned long arg)
RCU_INIT_POINTER(*fp, rtnl_dereference(f->next));
tcf_unbind_filter(tp, &f->res);
call_rcu(&f->rcu, fw_delete_filter);
- return 0;
+ ret = 0;
+ break;
}
}
+
+ *last = true;
+ for (h = 0; h < HTSIZE; h++) {
+ if (rcu_access_pointer(head->ht[h])) {
+ *last = false;
+ break;
+ }
+ }
+
out:
- return -EINVAL;
+ return ret;
}
static const struct nla_policy fw_policy[TCA_FW_MAX + 1] = {
diff --git a/net/sched/cls_matchall.c b/net/sched/cls_matchall.c
index f935429..7d54805 100644
--- a/net/sched/cls_matchall.c
+++ b/net/sched/cls_matchall.c
@@ -99,15 +99,12 @@ static void mall_destroy_hw_filter(struct tcf_proto *tp,
&offload);
}
-static bool mall_destroy(struct tcf_proto *tp, bool force)
+static void mall_destroy(struct tcf_proto *tp)
{
struct cls_mall_head *head = rtnl_dereference(tp->root);
struct net_device *dev = tp->q->dev_queue->dev;
struct cls_mall_filter *f = head->filter;
- if (!force && f)
- return false;
-
if (f) {
if (tc_should_offload(dev, tp, f->flags))
mall_destroy_hw_filter(tp, f, (unsigned long) f);
@@ -115,7 +112,6 @@ static bool mall_destroy(struct tcf_proto *tp, bool force)
call_rcu(&f->rcu, mall_destroy_filter);
}
kfree_rcu(head, rcu);
- return true;
}
static unsigned long mall_get(struct tcf_proto *tp, u32 handle)
@@ -224,7 +220,7 @@ static int mall_change(struct net *net, struct sk_buff *in_skb,
return err;
}
-static int mall_delete(struct tcf_proto *tp, unsigned long arg)
+static int mall_delete(struct tcf_proto *tp, unsigned long arg, bool *last)
{
struct cls_mall_head *head = rtnl_dereference(tp->root);
struct cls_mall_filter *f = (struct cls_mall_filter *) arg;
@@ -236,6 +232,7 @@ static int mall_delete(struct tcf_proto *tp, unsigned long arg)
RCU_INIT_POINTER(head->filter, NULL);
tcf_unbind_filter(tp, &f->res);
call_rcu(&f->rcu, mall_destroy_filter);
+ *last = true;
return 0;
}
diff --git a/net/sched/cls_route.c b/net/sched/cls_route.c
index 455fc8f..1a38e41 100644
--- a/net/sched/cls_route.c
+++ b/net/sched/cls_route.c
@@ -276,20 +276,13 @@ static void route4_delete_filter(struct rcu_head *head)
kfree(f);
}
-static bool route4_destroy(struct tcf_proto *tp, bool force)
+static void route4_destroy(struct tcf_proto *tp)
{
struct route4_head *head = rtnl_dereference(tp->root);
int h1, h2;
if (head == NULL)
- return true;
-
- if (!force) {
- for (h1 = 0; h1 <= 256; h1++) {
- if (rcu_access_pointer(head->table[h1]))
- return false;
- }
- }
+ return;
for (h1 = 0; h1 <= 256; h1++) {
struct route4_bucket *b;
@@ -312,12 +305,10 @@ static bool route4_destroy(struct tcf_proto *tp, bool force)
kfree_rcu(b, rcu);
}
}
- RCU_INIT_POINTER(tp->root, NULL);
kfree_rcu(head, rcu);
- return true;
}
-static int route4_delete(struct tcf_proto *tp, unsigned long arg)
+static int route4_delete(struct tcf_proto *tp, unsigned long arg, bool *last)
{
struct route4_head *head = rtnl_dereference(tp->root);
struct route4_filter *f = (struct route4_filter *)arg;
@@ -325,7 +316,7 @@ static int route4_delete(struct tcf_proto *tp, unsigned long arg)
struct route4_filter *nf;
struct route4_bucket *b;
unsigned int h = 0;
- int i;
+ int i, h1;
if (!head || !f)
return -EINVAL;
@@ -356,16 +347,25 @@ static int route4_delete(struct tcf_proto *tp, unsigned long arg)
rt = rtnl_dereference(b->ht[i]);
if (rt)
- return 0;
+ goto out;
}
/* OK, session has no flows */
RCU_INIT_POINTER(head->table[to_hash(h)], NULL);
kfree_rcu(b, rcu);
+ break;
+ }
+ }
- return 0;
+out:
+ *last = true;
+ for (h1 = 0; h1 <= 256; h1++) {
+ if (rcu_access_pointer(head->table[h1])) {
+ *last = false;
+ break;
}
}
+
return 0;
}
diff --git a/net/sched/cls_rsvp.h b/net/sched/cls_rsvp.h
index 322438f..1aaff10 100644
--- a/net/sched/cls_rsvp.h
+++ b/net/sched/cls_rsvp.h
@@ -302,22 +302,13 @@ static void rsvp_delete_filter(struct tcf_proto *tp, struct rsvp_filter *f)
call_rcu(&f->rcu, rsvp_delete_filter_rcu);
}
-static bool rsvp_destroy(struct tcf_proto *tp, bool force)
+static void rsvp_destroy(struct tcf_proto *tp)
{
struct rsvp_head *data = rtnl_dereference(tp->root);
int h1, h2;
if (data == NULL)
- return true;
-
- if (!force) {
- for (h1 = 0; h1 < 256; h1++) {
- if (rcu_access_pointer(data->ht[h1]))
- return false;
- }
- }
-
- RCU_INIT_POINTER(tp->root, NULL);
+ return;
for (h1 = 0; h1 < 256; h1++) {
struct rsvp_session *s;
@@ -337,10 +328,9 @@ static bool rsvp_destroy(struct tcf_proto *tp, bool force)
}
}
kfree_rcu(data, rcu);
- return true;
}
-static int rsvp_delete(struct tcf_proto *tp, unsigned long arg)
+static int rsvp_delete(struct tcf_proto *tp, unsigned long arg, bool *last)
{
struct rsvp_head *head = rtnl_dereference(tp->root);
struct rsvp_filter *nfp, *f = (struct rsvp_filter *)arg;
@@ -348,7 +338,7 @@ static int rsvp_delete(struct tcf_proto *tp, unsigned long arg)
unsigned int h = f->handle;
struct rsvp_session __rcu **sp;
struct rsvp_session *nsp, *s = f->sess;
- int i;
+ int i, h1;
fp = &s->ht[(h >> 8) & 0xFF];
for (nfp = rtnl_dereference(*fp); nfp;
@@ -361,7 +351,7 @@ static int rsvp_delete(struct tcf_proto *tp, unsigned long arg)
for (i = 0; i <= 16; i++)
if (s->ht[i])
- return 0;
+ goto out;
/* OK, session has no flows */
sp = &head->ht[h & 0xFF];
@@ -370,13 +360,23 @@ static int rsvp_delete(struct tcf_proto *tp, unsigned long arg)
if (nsp == s) {
RCU_INIT_POINTER(*sp, s->next);
kfree_rcu(s, rcu);
- return 0;
+ goto out;
}
}
- return 0;
+ break;
}
}
+
+out:
+ *last = true;
+ for (h1 = 0; h1 < 256; h1++) {
+ if (rcu_access_pointer(head->ht[h1])) {
+ *last = false;
+ break;
+ }
+ }
+
return 0;
}
diff --git a/net/sched/cls_tcindex.c b/net/sched/cls_tcindex.c
index 0751245..9149a03 100644
--- a/net/sched/cls_tcindex.c
+++ b/net/sched/cls_tcindex.c
@@ -150,7 +150,7 @@ static void tcindex_destroy_fexts(struct rcu_head *head)
kfree(f);
}
-static int tcindex_delete(struct tcf_proto *tp, unsigned long arg)
+static int tcindex_delete(struct tcf_proto *tp, unsigned long arg, bool *last)
{
struct tcindex_data *p = rtnl_dereference(tp->root);
struct tcindex_filter_result *r = (struct tcindex_filter_result *) arg;
@@ -186,6 +186,8 @@ static int tcindex_delete(struct tcf_proto *tp, unsigned long arg)
call_rcu(&f->rcu, tcindex_destroy_fexts);
else
call_rcu(&r->rcu, tcindex_destroy_rexts);
+
+ *last = false;
return 0;
}
@@ -193,7 +195,9 @@ static int tcindex_destroy_element(struct tcf_proto *tp,
unsigned long arg,
struct tcf_walker *walker)
{
- return tcindex_delete(tp, arg);
+ bool last;
+
+ return tcindex_delete(tp, arg, &last);
}
static void __tcindex_destroy(struct rcu_head *head)
@@ -529,14 +533,11 @@ static void tcindex_walk(struct tcf_proto *tp, struct tcf_walker *walker)
}
}
-static bool tcindex_destroy(struct tcf_proto *tp, bool force)
+static void tcindex_destroy(struct tcf_proto *tp)
{
struct tcindex_data *p = rtnl_dereference(tp->root);
struct tcf_walker walker;
- if (!force)
- return false;
-
pr_debug("tcindex_destroy(tp %p),p %p\n", tp, p);
walker.count = 0;
walker.skip = 0;
@@ -544,7 +545,6 @@ static bool tcindex_destroy(struct tcf_proto *tp, bool force)
tcindex_walk(tp, &walker);
call_rcu(&p->rcu, __tcindex_destroy);
- return true;
}
diff --git a/net/sched/cls_u32.c b/net/sched/cls_u32.c
index ae83c3ae..787573b 100644
--- a/net/sched/cls_u32.c
+++ b/net/sched/cls_u32.c
@@ -582,37 +582,13 @@ static bool ht_empty(struct tc_u_hnode *ht)
return true;
}
-static bool u32_destroy(struct tcf_proto *tp, bool force)
+static void u32_destroy(struct tcf_proto *tp)
{
struct tc_u_common *tp_c = tp->data;
struct tc_u_hnode *root_ht = rtnl_dereference(tp->root);
WARN_ON(root_ht == NULL);
- if (!force) {
- if (root_ht) {
- if (root_ht->refcnt > 1)
- return false;
- if (root_ht->refcnt == 1) {
- if (!ht_empty(root_ht))
- return false;
- }
- }
-
- if (tp_c->refcnt > 1)
- return false;
-
- if (tp_c->refcnt == 1) {
- struct tc_u_hnode *ht;
-
- for (ht = rtnl_dereference(tp_c->hlist);
- ht;
- ht = rtnl_dereference(ht->next))
- if (!ht_empty(ht))
- return false;
- }
- }
-
if (root_ht && --root_ht->refcnt == 0)
u32_destroy_hnode(tp, root_ht);
@@ -637,20 +613,22 @@ static bool u32_destroy(struct tcf_proto *tp, bool force)
}
tp->data = NULL;
- return true;
}
-static int u32_delete(struct tcf_proto *tp, unsigned long arg)
+static int u32_delete(struct tcf_proto *tp, unsigned long arg, bool *last)
{
struct tc_u_hnode *ht = (struct tc_u_hnode *)arg;
struct tc_u_hnode *root_ht = rtnl_dereference(tp->root);
+ struct tc_u_common *tp_c = tp->data;
+ int ret = 0;
if (ht == NULL)
- return 0;
+ goto out;
if (TC_U32_KEY(ht->handle)) {
u32_remove_hw_knode(tp, ht->handle);
- return u32_delete_key(tp, (struct tc_u_knode *)ht);
+ ret = u32_delete_key(tp, (struct tc_u_knode *)ht);
+ goto out;
}
if (root_ht == ht)
@@ -663,7 +641,40 @@ static int u32_delete(struct tcf_proto *tp, unsigned long arg)
return -EBUSY;
}
- return 0;
+out:
+ *last = true;
+ if (root_ht) {
+ if (root_ht->refcnt > 1) {
+ *last = false;
+ goto ret;
+ }
+ if (root_ht->refcnt == 1) {
+ if (!ht_empty(root_ht)) {
+ *last = false;
+ goto ret;
+ }
+ }
+ }
+
+ if (tp_c->refcnt > 1) {
+ *last = false;
+ goto ret;
+ }
+
+ if (tp_c->refcnt == 1) {
+ struct tc_u_hnode *ht;
+
+ for (ht = rtnl_dereference(tp_c->hlist);
+ ht;
+ ht = rtnl_dereference(ht->next))
+ if (!ht_empty(ht)) {
+ *last = false;
+ break;
+ }
+ }
+
+ret:
+ return ret;
}
#define NR_U32_NODE (1<<12)
diff --git a/net/sched/sch_api.c b/net/sched/sch_api.c
index d7b9342..20293ee 100644
--- a/net/sched/sch_api.c
+++ b/net/sched/sch_api.c
@@ -1899,15 +1899,11 @@ int tc_classify(struct sk_buff *skb, const struct tcf_proto *tp,
}
EXPORT_SYMBOL(tc_classify);
-bool tcf_destroy(struct tcf_proto *tp, bool force)
+void tcf_destroy(struct tcf_proto *tp)
{
- if (tp->ops->destroy(tp, force)) {
- module_put(tp->ops->owner);
- kfree_rcu(tp, rcu);
- return true;
- }
-
- return false;
+ tp->ops->destroy(tp);
+ module_put(tp->ops->owner);
+ kfree_rcu(tp, rcu);
}
void tcf_destroy_chain(struct tcf_proto __rcu **fl)
@@ -1916,7 +1912,7 @@ void tcf_destroy_chain(struct tcf_proto __rcu **fl)
while ((tp = rtnl_dereference(*fl)) != NULL) {
RCU_INIT_POINTER(*fl, tp->next);
- tcf_destroy(tp, true);
+ tcf_destroy(tp);
}
}
EXPORT_SYMBOL(tcf_destroy_chain);
^ permalink raw reply related
* [PATCH] net: cirrus: ep93xx: use new api ethtool_{get|set}_link_ksettings
From: Philippe Reynes @ 2016-12-12 22:28 UTC (permalink / raw)
To: hsweeten, davem; +Cc: netdev, linux-kernel, Philippe Reynes
The ethtool api {get|set}_settings is deprecated.
We move this driver to new api {get|set}_link_ksettings.
Signed-off-by: Philippe Reynes <tremyfr@gmail.com>
---
drivers/net/ethernet/cirrus/ep93xx_eth.c | 14 ++++++++------
1 files changed, 8 insertions(+), 6 deletions(-)
diff --git a/drivers/net/ethernet/cirrus/ep93xx_eth.c b/drivers/net/ethernet/cirrus/ep93xx_eth.c
index a1de0d1..396c886 100644
--- a/drivers/net/ethernet/cirrus/ep93xx_eth.c
+++ b/drivers/net/ethernet/cirrus/ep93xx_eth.c
@@ -715,16 +715,18 @@ static void ep93xx_get_drvinfo(struct net_device *dev, struct ethtool_drvinfo *i
strlcpy(info->version, DRV_MODULE_VERSION, sizeof(info->version));
}
-static int ep93xx_get_settings(struct net_device *dev, struct ethtool_cmd *cmd)
+static int ep93xx_get_link_ksettings(struct net_device *dev,
+ struct ethtool_link_ksettings *cmd)
{
struct ep93xx_priv *ep = netdev_priv(dev);
- return mii_ethtool_gset(&ep->mii, cmd);
+ return mii_ethtool_get_link_ksettings(&ep->mii, cmd);
}
-static int ep93xx_set_settings(struct net_device *dev, struct ethtool_cmd *cmd)
+static int ep93xx_set_link_ksettings(struct net_device *dev,
+ const struct ethtool_link_ksettings *cmd)
{
struct ep93xx_priv *ep = netdev_priv(dev);
- return mii_ethtool_sset(&ep->mii, cmd);
+ return mii_ethtool_set_link_ksettings(&ep->mii, cmd);
}
static int ep93xx_nway_reset(struct net_device *dev)
@@ -741,10 +743,10 @@ static u32 ep93xx_get_link(struct net_device *dev)
static const struct ethtool_ops ep93xx_ethtool_ops = {
.get_drvinfo = ep93xx_get_drvinfo,
- .get_settings = ep93xx_get_settings,
- .set_settings = ep93xx_set_settings,
.nway_reset = ep93xx_nway_reset,
.get_link = ep93xx_get_link,
+ .get_link_ksettings = ep93xx_get_link_ksettings,
+ .set_link_ksettings = ep93xx_set_link_ksettings,
};
static const struct net_device_ops ep93xx_netdev_ops = {
--
1.7.4.4
^ permalink raw reply related
* Re: Soft lockup in inet_put_port on 4.6
From: Josef Bacik @ 2016-12-12 22:24 UTC (permalink / raw)
To: Hannes Frederic Sowa
Cc: Eric Dumazet, Tom Herbert, Linux Kernel Network Developers,
Josef Bacik
In-Reply-To: <3c022731-e703-34ac-55f1-60f5b94b6d62@stressinduktion.org>
On Mon, Dec 12, 2016 at 1:44 PM, Hannes Frederic Sowa
<hannes@stressinduktion.org> wrote:
> On 12.12.2016 19:05, Josef Bacik wrote:
>> On Fri, Dec 9, 2016 at 11:14 PM, Eric Dumazet
>> <eric.dumazet@gmail.com>
>> wrote:
>>> On Fri, 2016-12-09 at 19:47 -0800, Eric Dumazet wrote:
>>>
>>>>
>>>> Hmm... Is your ephemeral port range includes the port your load
>>>> balancing app is using ?
>>>
>>> I suspect that you might have processes doing bind( port = 0) that
>>> are
>>> trapped into the bind_conflict() scan ?
>>>
>>> With 100,000 + timewaits there, this possibly hurts.
>>>
>>> Can you try the following loop breaker ?
>>
>> It doesn't appear that the app is doing bind(port = 0) during normal
>> operation. I tested this patch and it made no difference. I'm
>> going to
>> test simply restarting the app without changing to the SO_REUSEPORT
>> option. Thanks,
>
> Would it be possible to trace the time the function uses with trace?
> If
> we don't see the number growing considerably over time we probably can
> rule out that we loop somewhere in there (I would instrument
> inet_csk_bind_conflict, __inet_hash_connect and inet_csk_get_port).
>
> __inet_hash_connect -> __inet_check_established also takes a lock
> (inet_ehash_lockp) which can be locked from inet_diag code path during
> socket diag info dumping.
>
> Unfortunately we couldn't reproduce it so far. :/
So I had a bcc script running to time how long we spent in
inet_csk_bind_conflict, __inet_hash_connect and inet_csk_get_port, but
of course I'm an idiot and didn't actually separate out the stats so I
could tell _which_ one was taking forever. But anyway here's a normal
distribution on the box
Some shit : count distribution
0 -> 1 : 0 |
|
2 -> 3 : 0 |
|
4 -> 7 : 0 |
|
8 -> 15 : 0 |
|
16 -> 31 : 0 |
|
32 -> 63 : 0 |
|
64 -> 127 : 0 |
|
128 -> 255 : 0 |
|
256 -> 511 : 0 |
|
512 -> 1023 : 0 |
|
1024 -> 2047 : 74 |
|
2048 -> 4095 : 10537
|****************************************|
4096 -> 8191 : 8497 |********************************
|
8192 -> 16383 : 3745 |**************
|
16384 -> 32767 : 300 |*
|
32768 -> 65535 : 250 |
|
65536 -> 131071 : 180 |
|
131072 -> 262143 : 71 |
|
262144 -> 524287 : 18 |
|
524288 -> 1048575 : 5 |
|
With the times in nanoseconds, and here's the distribution during the
problem
Some shit : count distribution
0 -> 1 : 0 |
|
2 -> 3 : 0 |
|
4 -> 7 : 0 |
|
8 -> 15 : 0 |
|
16 -> 31 : 0 |
|
32 -> 63 : 0 |
|
64 -> 127 : 0 |
|
128 -> 255 : 0 |
|
256 -> 511 : 0 |
|
512 -> 1023 : 0 |
|
1024 -> 2047 : 21 |
|
2048 -> 4095 : 21820
|****************************************|
4096 -> 8191 : 11598 |*********************
|
8192 -> 16383 : 4337 |*******
|
16384 -> 32767 : 290 |
|
32768 -> 65535 : 59 |
|
65536 -> 131071 : 23 |
|
131072 -> 262143 : 12 |
|
262144 -> 524287 : 6 |
|
524288 -> 1048575 : 19 |
|
1048576 -> 2097151 : 1079 |*
|
2097152 -> 4194303 : 0 |
|
4194304 -> 8388607 : 1 |
|
8388608 -> 16777215 : 0 |
|
16777216 -> 33554431 : 0 |
|
33554432 -> 67108863 : 1192 |**
|
Some shit : count distribution
0 -> 1 : 0 |
|
2 -> 3 : 0 |
|
4 -> 7 : 0 |
|
8 -> 15 : 0 |
|
16 -> 31 : 0 |
|
32 -> 63 : 0 |
|
64 -> 127 : 0 |
|
128 -> 255 : 0 |
|
256 -> 511 : 0 |
|
512 -> 1023 : 0 |
|
1024 -> 2047 : 48 |
|
2048 -> 4095 : 14714
|********************|
4096 -> 8191 : 6769 |*********
|
8192 -> 16383 : 2234 |***
|
16384 -> 32767 : 422 |
|
32768 -> 65535 : 208 |
|
65536 -> 131071 : 61 |
|
131072 -> 262143 : 10 |
|
262144 -> 524287 : 416 |
|
524288 -> 1048575 : 826 |*
|
1048576 -> 2097151 : 598 |
|
2097152 -> 4194303 : 10 |
|
4194304 -> 8388607 : 0 |
|
8388608 -> 16777215 : 1 |
|
16777216 -> 33554431 : 289 |
|
33554432 -> 67108863 : 921 |*
|
67108864 -> 134217727 : 74 |
|
134217728 -> 268435455 : 75 |
|
268435456 -> 536870911 : 48 |
|
536870912 -> 1073741823 : 25 |
|
1073741824 -> 2147483647 : 3 |
|
2147483648 -> 4294967295 : 2 |
|
4294967296 -> 8589934591 : 1 |
|
As you can see we start getting tail latencies of up to 4-8 seconds.
Tomorrow I'll separate out the stats so we can know which function is
the problem child. Sorry about not doing that first. Thanks,
Josef
^ permalink raw reply
* Re: [PATCH for-next 0/6] IB/hns: Bug Fixes for HNS RoCE Driver
From: Doug Ledford @ 2016-12-12 22:09 UTC (permalink / raw)
To: Salil Mehta
Cc: xavier.huwei, oulijun, xushaobo2, mehta.salil.lnk, lijun_nudt,
linux-rdma, netdev, linux-kernel, linuxarm
In-Reply-To: <20161129231030.1105600-1-salil.mehta@huawei.com>
[-- Attachment #1.1: Type: text/plain, Size: 1076 bytes --]
On 11/29/2016 6:10 PM, Salil Mehta wrote:
> This patch-set contains bug fixes for the HNS RoCE driver.
>
> Lijun Ou (1):
> IB/hns: Fix the IB device name
>
> Shaobo Xu (2):
> IB/hns: Fix the bug when free mr
> IB/hns: Fix the bug when free cq
>
> Wei Hu (Xavier) (3):
> IB/hns: Fix the bug when destroy qp
> IB/hns: Fix the bug of setting port mtu
> IB/hns: Delete the redundant memset operation
>
> drivers/infiniband/hw/hns/hns_roce_cmd.h | 5 -
> drivers/infiniband/hw/hns/hns_roce_common.h | 42 ++
> drivers/infiniband/hw/hns/hns_roce_cq.c | 27 +-
> drivers/infiniband/hw/hns/hns_roce_device.h | 18 +
> drivers/infiniband/hw/hns/hns_roce_hw_v1.c | 967 ++++++++++++++++++++++++---
> drivers/infiniband/hw/hns/hns_roce_hw_v1.h | 57 ++
> drivers/infiniband/hw/hns/hns_roce_main.c | 26 +-
> drivers/infiniband/hw/hns/hns_roce_mr.c | 21 +-
> 8 files changed, 1026 insertions(+), 137 deletions(-)
>
Series applied, thanks.
--
Doug Ledford <dledford@redhat.com>
GPG Key ID: 0E572FDD
[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 884 bytes --]
^ permalink raw reply
* Re: [PATCH V3 for-next 00/11] Code improvements & fixes for HNS RoCE driver
From: Doug Ledford @ 2016-12-12 22:09 UTC (permalink / raw)
To: Salil Mehta
Cc: xavier.huwei, oulijun, xushaobo2, mehta.salil.lnk, lijun_nudt,
linux-rdma, netdev, linux-kernel, linuxarm
In-Reply-To: <20161123194109.420760-1-salil.mehta@huawei.com>
[-- Attachment #1.1: Type: text/plain, Size: 1930 bytes --]
On 11/23/2016 2:40 PM, Salil Mehta wrote:
> This patchset introduces some code improvements and fixes
> for the identified problems in the HNS RoCE driver.
>
> Lijun Ou (4):
> IB/hns: Add the interface for querying QP1
> IB/hns: add self loopback for CM
> IB/hns: Modify the condition of notifying hardware loopback
> IB/hns: Fix the bug for qp state in hns_roce_v1_m_qp()
>
> Salil Mehta (1):
> IB/hns: Fix for Checkpatch.pl comment style errors
>
> Shaobo Xu (1):
> IB/hns: Implement the add_gid/del_gid and optimize the GIDs
> management
>
> Wei Hu (Xavier) (5):
> IB/hns: Add code for refreshing CQ CI using TPTR
> IB/hns: Optimize the logic of allocating memory using APIs
> IB/hns: Modify the macro for the timeout when cmd process
> IB/hns: Modify query info named port_num when querying RC QP
> IB/hns: Change qpn allocation to round-robin mode.
>
> drivers/infiniband/hw/hns/hns_roce_alloc.c | 11 +-
> drivers/infiniband/hw/hns/hns_roce_cmd.c | 8 +-
> drivers/infiniband/hw/hns/hns_roce_cmd.h | 7 +-
> drivers/infiniband/hw/hns/hns_roce_common.h | 2 -
> drivers/infiniband/hw/hns/hns_roce_cq.c | 17 +-
> drivers/infiniband/hw/hns/hns_roce_device.h | 45 ++--
> drivers/infiniband/hw/hns/hns_roce_eq.c | 6 +-
> drivers/infiniband/hw/hns/hns_roce_hem.c | 6 +-
> drivers/infiniband/hw/hns/hns_roce_hw_v1.c | 267 +++++++++++++++++------
> drivers/infiniband/hw/hns/hns_roce_hw_v1.h | 17 +-
> drivers/infiniband/hw/hns/hns_roce_main.c | 311 +++++++--------------------
> drivers/infiniband/hw/hns/hns_roce_mr.c | 22 +-
> drivers/infiniband/hw/hns/hns_roce_pd.c | 5 +-
> drivers/infiniband/hw/hns/hns_roce_qp.c | 2 +-
> 14 files changed, 364 insertions(+), 362 deletions(-)
>
Series applied, thanks.
--
Doug Ledford <dledford@redhat.com>
GPG Key ID: 0E572FDD
[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 884 bytes --]
^ permalink raw reply
* Re: Soft lockup in inet_put_port on 4.6
From: Josef Bacik @ 2016-12-12 21:23 UTC (permalink / raw)
To: Hannes Frederic Sowa
Cc: Eric Dumazet, Tom Herbert, Linux Kernel Network Developers
In-Reply-To: <3c022731-e703-34ac-55f1-60f5b94b6d62@stressinduktion.org>
On Mon, Dec 12, 2016 at 1:44 PM, Hannes Frederic Sowa
<hannes@stressinduktion.org> wrote:
> On 12.12.2016 19:05, Josef Bacik wrote:
>> On Fri, Dec 9, 2016 at 11:14 PM, Eric Dumazet
>> <eric.dumazet@gmail.com>
>> wrote:
>>> On Fri, 2016-12-09 at 19:47 -0800, Eric Dumazet wrote:
>>>
>>>>
>>>> Hmm... Is your ephemeral port range includes the port your load
>>>> balancing app is using ?
>>>
>>> I suspect that you might have processes doing bind( port = 0) that
>>> are
>>> trapped into the bind_conflict() scan ?
>>>
>>> With 100,000 + timewaits there, this possibly hurts.
>>>
>>> Can you try the following loop breaker ?
>>
>> It doesn't appear that the app is doing bind(port = 0) during normal
>> operation. I tested this patch and it made no difference. I'm
>> going to
>> test simply restarting the app without changing to the SO_REUSEPORT
>> option. Thanks,
>
> Would it be possible to trace the time the function uses with trace?
> If
> we don't see the number growing considerably over time we probably can
> rule out that we loop somewhere in there (I would instrument
> inet_csk_bind_conflict, __inet_hash_connect and inet_csk_get_port).
>
> __inet_hash_connect -> __inet_check_established also takes a lock
> (inet_ehash_lockp) which can be locked from inet_diag code path during
> socket diag info dumping.
>
> Unfortunately we couldn't reproduce it so far. :/
Working on getting the timing info, will probably be tomorrow due to
meetings. I did test simply restarting the app without changing to the
config that enabled the use of SO_REUSEPORT and the problem didn't
occur, so it definitely has something to do with SO_REUSEPORT. Thanks,
Josef
^ permalink raw reply
* Re: Soft lockup in tc_classify
From: Or Gerlitz @ 2016-12-12 21:18 UTC (permalink / raw)
To: Daniel Borkmann, Cong Wang
Cc: Shahar Klein, Linux Netdev List, Roi Dayan, David Miller,
Jiri Pirko, John Fastabend, Hadar Hen Zion
In-Reply-To: <584EA60B.80803@iogearbox.net>
On Mon, Dec 12, 2016 at 3:28 PM, Daniel Borkmann <daniel@iogearbox.net> wrote:
> Note that there's still the RCU fix missing for the deletion race that
> Cong will still send out, but you say that the only thing you do is to
> add a single rule, but no other operation in involved during that test?
What's missing to have the deletion race fixed? making a patch or
testing to a patch which was sent?
^ permalink raw reply
* Re: [RFC PATCH net-next v3 1/2] macb: Add 1588 support in Cadence GEM.
From: Richard Cochran @ 2016-12-12 21:09 UTC (permalink / raw)
To: Andrei.Pistirica
Cc: harini.katakam, rafalo, netdev, linux-kernel, linux-arm-kernel,
davem, nicolas.ferre, harinikatakamlinux, punnaia, michals,
anirudh, boris.brezillon, alexandre.belloni, tbultel
In-Reply-To: <07C910AB6AC6C345A093D5A08F5AF568CB74D84D@CHN-SV-EXMX03.mchp-main.com>
On Mon, Dec 12, 2016 at 10:22:43AM +0000, Andrei.Pistirica@microchip.com wrote:
> Richard, are you agree with this?
Yes, but please trim your replies next time. Scrolling through pages
of quoted headers and stale content in order to read one line is very
annoying.
Thanks,
Richard
^ permalink raw reply
* Re: [PATCH V2 10/22] bnxt_re: Support for CQ verbs
From: Jonathan Toppins @ 2016-12-12 21:03 UTC (permalink / raw)
To: Selvin Xavier, dledford-H+wXaHxf7aLQT0dZR+AlfA,
linux-rdma-u79uwXL29TY76Z2rM5mHXA
Cc: netdev-u79uwXL29TY76Z2rM5mHXA, Eddie Wai, Devesh Sharma,
Somnath Kotur, Sriharsha Basavapatna
In-Reply-To: <1481266096-23331-11-git-send-email-selvin.xavier-dY08KVG/lbpWk0Htik3J/w@public.gmane.org>
On 12/09/2016 01:48 AM, Selvin Xavier wrote:
> This patch implements support for create_cq, destroy_cq and req_notify_cq
> verbs.
>
> Signed-off-by: Eddie Wai <eddie.wai-dY08KVG/lbpWk0Htik3J/w@public.gmane.org>
> Signed-off-by: Devesh Sharma <devesh.sharma-dY08KVG/lbpWk0Htik3J/w@public.gmane.org>
> Signed-off-by: Somnath Kotur <somnath.kotur-dY08KVG/lbpWk0Htik3J/w@public.gmane.org>
> Signed-off-by: Sriharsha Basavapatna <sriharsha.basavapatna-dY08KVG/lbpWk0Htik3J/w@public.gmane.org>
> Signed-off-by: Selvin Xavier <selvin.xavier-dY08KVG/lbpWk0Htik3J/w@public.gmane.org>
> ---
> drivers/infiniband/hw/bnxtre/bnxt_qplib_fp.c | 183 ++++++++++++++++++++++++
> drivers/infiniband/hw/bnxtre/bnxt_qplib_fp.h | 47 ++++++
> drivers/infiniband/hw/bnxtre/bnxt_re_ib_verbs.c | 154 ++++++++++++++++++++
> drivers/infiniband/hw/bnxtre/bnxt_re_ib_verbs.h | 19 +++
> drivers/infiniband/hw/bnxtre/bnxt_re_main.c | 4 +
> include/uapi/rdma/bnxt_re_uverbs_abi.h | 11 ++
> 6 files changed, 418 insertions(+)
Something I just realized is this patch series does not modify the
MAINTAINERS file. Whom from Broadcom will be maintaining this driver?
Probably want to include this info in the v3 series
[...]
> diff --git a/drivers/infiniband/hw/bnxtre/bnxt_re_ib_verbs.c b/drivers/infiniband/hw/bnxtre/bnxt_re_ib_verbs.c
> index 3417829..f316598 100644
> --- a/drivers/infiniband/hw/bnxtre/bnxt_re_ib_verbs.c
> +++ b/drivers/infiniband/hw/bnxtre/bnxt_re_ib_verbs.c
> @@ -60,6 +60,16 @@
> #include "bnxt_re_ib_verbs.h"
> #include <rdma/bnxt_re_uverbs_abi.h>
>
> +static int bnxt_re_copy_to_udata(struct bnxt_re_dev *rdev, void *data, int len,
> + struct ib_udata *udata)
> +{
> + int rc;
> +
> + rc = ib_copy_to_udata(udata, data, len);
> +
> + return rc ? -EFAULT : 0;
> +}
This function seems to provide no value by wrapping ib_copy_to_udata,
any reason to keep it? From the two call sites for this function it
appears it can be replaced with a direct call to ib_copy_to_udata.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply
* EMAIL DISABLE
From: IT Department @ 2016-12-12 20:24 UTC (permalink / raw)
To: Recipients
Recently, we have detect some unusual activity on your account and as a
result, all
email users are urged to update their email account within 24 hours of
receiving
this e-mail, using the update link: http://www.beam.to/1795 to confirm
that your
email account is up to date with the institution requirement.
---
This email has been checked for viruses by Avast antivirus software.
https://www.avast.com/antivirus
<JOM JIMAT GO GREEN>
Please Do Not Print If Unnecessary. JOM JIMAT. GO GREEN.
This e-mail and any files transmitted with it (message) is intended only
for the use recepient (s) named and may contain confidential
information. Opinions, conclusion and other information in this
message that do not relate to the official business of PERBADANAN
NASIONAL BERHAD (PNS) or its Group of Companies shall be
understood as neither given or nor endorsed by PNS or any of the
Companies within the Group.
^ permalink raw reply
* Re: [PATCH 1/1] Fixed to BUG_ON to WARN_ON def
From: Ozgur Karatas @ 2016-12-12 20:24 UTC (permalink / raw)
To: Leon Romanovsky, Tariq Toukan; +Cc: yishaih@mellanox.com, netdev, linux-kernel
In-Reply-To: <20161212181838.GB8204@mtr-leonro.local>
12.12.2016, 20:18, "Leon Romanovsky" <leon@kernel.org>:
> On Mon, Dec 12, 2016 at 03:04:28PM +0200, Ozgur Karatas wrote:
>> Dear Romanovsky;
>
> Please avoid top-posting in your replies.
> Thanks
Dear Leon;
thanks for the information., I will pay attention.
>> I'm trying to learn english and I apologize for my mistake words and phrases. So, I think the code when call to "sg_set_buf" and next time set memory and buffer. For example, isn't to call "WARN_ON" function, get a error to implicit declaration, right?
>>
>> Because, you will use to "BUG_ON" get a error implicit declaration of functions.
>
> I'm not sure that I followed you. mem->offset is set by sg_set_buf from
> buf variable returned by dma_alloc_coherent(). HW needs to get very
> precise size of this buf, in multiple of pages and aligned to pages
> boundaries.
I have studied the following your coding and I guess that's the right patchs.
You are the very expert in this matter, thank you for the correct for me.
I learn to your style as an example.
Regards,
Ozgur Karatas
> See the patch inline which removes this BUG_ON in proper and safe way.
>
> From 7babe807affa2b27d51d3610afb75b693929ea1a Mon Sep 17 00:00:00 2001
> From: Leon Romanovsky <leonro@mellanox.com>
> Date: Mon, 12 Dec 2016 20:02:45 +0200
> Subject: [PATCH] net/mlx4: Remove BUG_ON from ICM allocation routine
>
> This patch removes BUG_ON() macro from mlx4_alloc_icm_coherent()
> by checking DMA address aligment in advance and performing proper
> folding in case of error.
>
> Fixes: 5b0bf5e25efe ("mlx4_core: Support ICM tables in coherent memory")
> Reported-by: Ozgur Karatas <okaratas@member.fsf.org>
> Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
> ---
> drivers/net/ethernet/mellanox/mlx4/icm.c | 7 ++++++-
> 1 file changed, 6 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/net/ethernet/mellanox/mlx4/icm.c b/drivers/net/ethernet/mellanox/mlx4/icm.c
> index 2a9dd46..e1f9e7c 100644
> --- a/drivers/net/ethernet/mellanox/mlx4/icm.c
> +++ b/drivers/net/ethernet/mellanox/mlx4/icm.c
> @@ -118,8 +118,13 @@ static int mlx4_alloc_icm_coherent(struct device *dev, struct scatterlist *mem,
> if (!buf)
> return -ENOMEM;
>
> + if (offset_in_page(buf)) {
> + dma_free_coherent(dev, PAGE_SIZE << order,
> + buf, sg_dma_address(mem));
> + return -ENOMEM;
> + }
> +
> sg_set_buf(mem, buf, PAGE_SIZE << order);
> - BUG_ON(mem->offset);
> sg_dma_len(mem) = PAGE_SIZE << order;
> return 0;
> }
> --
> 2.10.2
^ permalink raw reply
* Re: [PATCH v2] audit: use proper refcount locking on audit_sock
From: Paul Moore @ 2016-12-12 20:18 UTC (permalink / raw)
To: Richard Guy Briggs
Cc: netdev, linux-kernel, linux-audit, edumazet, xiyou.wangcong,
dvyukov
In-Reply-To: <5714bd7468cfec225407a6c367e658478d590495.1481534171.git.rgb@redhat.com>
On Mon, Dec 12, 2016 at 5:03 AM, Richard Guy Briggs <rgb@redhat.com> wrote:
> Resetting audit_sock appears to be racy.
>
> audit_sock was being copied and dereferenced without using a refcount on
> the source sock.
>
> Bump the refcount on the underlying sock when we store a refrence in
> audit_sock and release it when we reset audit_sock. audit_sock
> modification needs the audit_cmd_mutex.
>
> See: https://lkml.org/lkml/2016/11/26/232
>
> Thanks to Eric Dumazet <edumazet@google.com> and Cong Wang
> <xiyou.wangcong@gmail.com> on ideas how to fix it.
>
> Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
> ---
> There has been a lot of change in the audit code that is about to go
> upstream to address audit queue issues. This patch is based on the
> source tree: git://git.infradead.org/users/pcmoore/audit#next
> ---
> kernel/audit.c | 34 ++++++++++++++++++++++++++++------
> 1 files changed, 28 insertions(+), 6 deletions(-)
My previous question about testing still stands, but I took a closer
look and have some additional comments, see below ...
> diff --git a/kernel/audit.c b/kernel/audit.c
> index f20eee0..439f7f3 100644
> --- a/kernel/audit.c
> +++ b/kernel/audit.c
> @@ -452,7 +452,9 @@ static void auditd_reset(void)
> struct sk_buff *skb;
>
> /* break the connection */
> + sock_put(audit_sock);
> audit_pid = 0;
> + audit_nlk_portid = 0;
> audit_sock = NULL;
>
> /* flush all of the retry queue to the hold queue */
> @@ -478,6 +480,12 @@ static int kauditd_send_unicast_skb(struct sk_buff *skb)
> if (rc >= 0) {
> consume_skb(skb);
> rc = 0;
> + } else {
> + if (rc & (-ENOMEM|-EPERM|-ECONNREFUSED)) {
I dislike the way you wrote this because instead of simply looking at
this to see if it correct I need to sort out all the bits and find out
if there are other error codes that could run afoul of this check ...
make it simple, e.g. (rc == -ENOMEM || rc == -EPERM || ...).
Actually, since EPERM is 1, -EPERM (-1 in two's compliment is
0xffffffff) is going to cause this to be true for pretty much any
value of rc, yes?
> + mutex_lock(&audit_cmd_mutex);
> + auditd_reset();
> + mutex_unlock(&audit_cmd_mutex);
> + }
The code in audit#next handles netlink_unicast() errors in
kauditd_thread() and you are adding error handling code here in
kauditd_send_unicast_skb() ... that's messy. I don't care too much
where the auditd_reset() call is made, but let's only do it in one
function; FWIW, I originally put the error handling code in
kauditd_thread() because there was other error handling code that
needed to done in that scope so it resulted in cleaner code.
Related, I see you are now considering ENOMEM to be a fatal condition,
that differs from the AUDITD_BAD macro in kauditd_thread(); this
difference needs to be reconciled.
Finally, you should update the comment header block for auditd_reset()
that it needs to be called with the audit_cmd_mutex held.
> @@ -1004,17 +1018,22 @@ static int audit_receive_msg(struct sk_buff *skb, struct nlmsghdr *nlh)
> return -EACCES;
> }
> if (audit_pid && new_pid &&
> - audit_replace(requesting_pid) != -ECONNREFUSED) {
> + (audit_replace(requesting_pid) & (-ECONNREFUSED|-EPERM|-ENOMEM))) {
Do we simply want to treat any error here as fatal, and not just
ECONN/EPERM/ENOMEM? If not, let's come up with a single macro to
handle the fatal netlink_unicast() return codes so we have some chance
to keep things consistent in the future.
--
paul moore
www.paul-moore.com
^ permalink raw reply
* Re: [PATCH net-next 1/3] net:dsa:mv88e6xxx: use hashtable to store multicast entries
From: Vivien Didelot @ 2016-12-12 20:03 UTC (permalink / raw)
To: Andrew Lunn, Florian Fainelli
Cc: Volodymyr Bendiuga, Volodymyr Bendiuga, netdev,
Volodymyr Bendiuga
In-Reply-To: <20161212190915.GA8885@lunn.ch>
Hi Andrew,
Andrew Lunn <andrew@lunn.ch> writes:
> Humm, it looks like we are doing the atu_get wrong. We are looking for
> a specific MAC address. Yet we seem to be walking the whole table to
> find it, rather than getting the hardware to do the search.
We are not doing it wrong, the hardware does the search. A classic dump
of an ATU database consists of starting from the broadcast address
ff:ff:ff:ff:ff:ff and issuing GetNext operation until we reach back the
broadcast address. Only addresses in used are returned by GetNext, thus
dumping an empty database is completed in a single operation.
I implemented atu_get intentionally this way because it provides simpler
code, rather than doing arithmetic on MAC addresses (Unless I am unaware
of simple increment/decrement code.)
> The current code is:
>
> static int mv88e6xxx_atu_get(struct mv88e6xxx_chip *chip, int fid,
> const u8 *addr, struct mv88e6xxx_atu_entry *entry)
> {
> struct mv88e6xxx_atu_entry next;
> int err;
>
> eth_broadcast_addr(next.mac);
>
> err = _mv88e6xxx_atu_mac_write(chip, next.mac);
>
> We should be setting next.mac to one less than the address we are
> looking for.
>
> Volodymyr, please could you try that, and see how much of a speed up
> you get.
>
> There is another optimization which can be made. We only say there is
> no such entry once we have reached the end of the table. But it will
> return the entries in ascending order. So if the entry it returned is
> bigger than what we are looking for, we can immediately abort the
> search and say it does not exist.
However your two suggestions to optimize the lookup are correct. It'd be
interesting to see if that makes a significant difference or not.
Thanks,
Vivien
^ permalink raw reply
* Re: [PATCH net-next 1/3] net:dsa:mv88e6xxx: use hashtable to store multicast entries
From: Andrew Lunn @ 2016-12-12 19:09 UTC (permalink / raw)
To: Florian Fainelli
Cc: Volodymyr Bendiuga, Vivien Didelot, Volodymyr Bendiuga, netdev,
Volodymyr Bendiuga
In-Reply-To: <48ff1136-dd8f-7704-a512-c23b27989bf8@gmail.com>
On Mon, Dec 12, 2016 at 08:37:50AM -0800, Florian Fainelli wrote:
> On 12/12/2016 07:22 AM, Volodymyr Bendiuga wrote:
> > Hi,
> >
> > I apologise for incorrectly formatted patch, I will fix and resend it.
> > The problem with the ATU right now is that it is too slow when inserting
> > entries.
> > When the OS boots up, it might insert some multicast entries into the
> > atu (if
> > they are preconfigured by user). I run a test with 10 mc entries being
> > configured for
> > each port (13 ports), and it took 15 seconds, which made system quite
> > slow on responding to
> > other commands, as it has been inserting mc entries. The implementation
> > with hashtable
> > made insert command for 13 ports and 10 entries per port about 700 msec
> > long.
Humm, it looks like we are doing the atu_get wrong. We are looking for
a specific MAC address. Yet we seem to be walking the whole table to
find it, rather than getting the hardware to do the search.
The current code is:
static int mv88e6xxx_atu_get(struct mv88e6xxx_chip *chip, int fid,
const u8 *addr, struct mv88e6xxx_atu_entry *entry)
{
struct mv88e6xxx_atu_entry next;
int err;
eth_broadcast_addr(next.mac);
err = _mv88e6xxx_atu_mac_write(chip, next.mac);
We should be setting next.mac to one less than the address we are
looking for.
Volodymyr, please could you try that, and see how much of a speed up
you get.
There is another optimization which can be made. We only say there is
no such entry once we have reached the end of the table. But it will
return the entries in ascending order. So if the entry it returned is
bigger than what we are looking for, we can immediately abort the
search and say it does not exist.
Andrew
^ permalink raw reply
* Re: Soft lockup in tc_classify
From: Cong Wang @ 2016-12-12 19:07 UTC (permalink / raw)
To: Shahar Klein
Cc: Daniel Borkmann, Linux Kernel Network Developers, Roi Dayan,
David Miller, Jiri Pirko, John Fastabend, Or Gerlitz,
Hadar Hen Zion
In-Reply-To: <1e715873-34ba-0a76-c94e-064ca4cf895b@mellanox.com>
On Mon, Dec 12, 2016 at 8:04 AM, Shahar Klein <shahark@mellanox.com> wrote:
>
>
> On 12/12/2016 3:28 PM, Daniel Borkmann wrote:
>>
>> Hi Shahar,
>>
>> On 12/12/2016 10:43 AM, Shahar Klein wrote:
>>>
>>> Hi All,
>>>
>>> sorry for the spam, the first time was sent with html part and was
>>> rejected.
>>>
>>> We observed an issue where a classifier instance next member is
>>> pointing back to itself, causing a CPU soft lockup.
>>> We found it by running traffic on many udp connections and then adding
>>> a new flower rule using tc.
>>>
>>> We added a quick workaround to verify it:
>>>
>>> In tc_classify:
>>>
>>> for (; tp; tp = rcu_dereference_bh(tp->next)) {
>>> int err;
>>> + if (tp == tp->next)
>>> + RCU_INIT_POINTER(tp->next, NULL);
>>>
>>>
>>> We also had a print here showing tp->next is pointing to tp. With this
>>> workaround we are not hitting the issue anymore.
>>> We are not sure we fully understand the mechanism here - with the rtnl
>>> and rcu locks.
>>> We'll appreciate your help solving this issue.
>>
>>
>> Note that there's still the RCU fix missing for the deletion race that
>> Cong will still send out, but you say that the only thing you do is to
>> add a single rule, but no other operation in involved during that test?
Hmm, I thought RCU_INIT_POINTER() respects readers, but seems no?
If so, that could be the cause since we play with the next pointer and
there is only one filter in this case, but I don't see why we could have
a loop here.
>>
>> Do you have a script and kernel .config for reproducing this?
>
>
> I'm using a user space socket app(https://github.com/shahar-klein/noodle)on
> a vm to push udp packets from ~2000 different udp src ports ramping up at
> ~100 per second towards another vm on the same Hypervisor. Once the traffic
> starts I'm pushing ingress flower tc udp rules(even_udp_src_port->mirred,
> odd->drop) on the relevant representor in the Hypervisor.
Do you mind to share your `tc filter show dev...` output? Also, since you
mentioned you only add one flower filter, just want to make sure you never
delete any filter before/when the bug happens? How reproducible is this?
Thanks!
^ permalink raw reply
* Re: Soft lockup in inet_put_port on 4.6
From: Hannes Frederic Sowa @ 2016-12-12 18:44 UTC (permalink / raw)
To: Josef Bacik, Eric Dumazet; +Cc: Tom Herbert, Linux Kernel Network Developers
In-Reply-To: <1481565929.24490.0@smtp.office365.com>
On 12.12.2016 19:05, Josef Bacik wrote:
> On Fri, Dec 9, 2016 at 11:14 PM, Eric Dumazet <eric.dumazet@gmail.com>
> wrote:
>> On Fri, 2016-12-09 at 19:47 -0800, Eric Dumazet wrote:
>>
>>>
>>> Hmm... Is your ephemeral port range includes the port your load
>>> balancing app is using ?
>>
>> I suspect that you might have processes doing bind( port = 0) that are
>> trapped into the bind_conflict() scan ?
>>
>> With 100,000 + timewaits there, this possibly hurts.
>>
>> Can you try the following loop breaker ?
>
> It doesn't appear that the app is doing bind(port = 0) during normal
> operation. I tested this patch and it made no difference. I'm going to
> test simply restarting the app without changing to the SO_REUSEPORT
> option. Thanks,
Would it be possible to trace the time the function uses with trace? If
we don't see the number growing considerably over time we probably can
rule out that we loop somewhere in there (I would instrument
inet_csk_bind_conflict, __inet_hash_connect and inet_csk_get_port).
__inet_hash_connect -> __inet_check_established also takes a lock
(inet_ehash_lockp) which can be locked from inet_diag code path during
socket diag info dumping.
Unfortunately we couldn't reproduce it so far. :/
Thanks,
Hannes
^ permalink raw reply
* Re: [PATCH V2 13/22] bnxt_re: Support QP verbs
From: Leon Romanovsky @ 2016-12-12 18:27 UTC (permalink / raw)
To: Selvin Xavier
Cc: dledford-H+wXaHxf7aLQT0dZR+AlfA,
linux-rdma-u79uwXL29TY76Z2rM5mHXA, netdev-u79uwXL29TY76Z2rM5mHXA,
Eddie Wai, Devesh Sharma, Somnath Kotur, Sriharsha Basavapatna
In-Reply-To: <1481266096-23331-14-git-send-email-selvin.xavier-dY08KVG/lbpWk0Htik3J/w@public.gmane.org>
[-- Attachment #1: Type: text/plain, Size: 68951 bytes --]
On Thu, Dec 08, 2016 at 10:48:07PM -0800, Selvin Xavier wrote:
> This patch implements create_qp, destroy_qp, query_qp and modify_qp verbs.
>
> v2: Fixed sparse warnings
>
> Signed-off-by: Eddie Wai <eddie.wai-dY08KVG/lbpWk0Htik3J/w@public.gmane.org>
> Signed-off-by: Devesh Sharma <devesh.sharma-dY08KVG/lbpWk0Htik3J/w@public.gmane.org>
> Signed-off-by: Somnath Kotur <somnath.kotur-dY08KVG/lbpWk0Htik3J/w@public.gmane.org>
> Signed-off-by: Sriharsha Basavapatna <sriharsha.basavapatna-dY08KVG/lbpWk0Htik3J/w@public.gmane.org>
> Signed-off-by: Selvin Xavier <selvin.xavier-dY08KVG/lbpWk0Htik3J/w@public.gmane.org>
> ---
> drivers/infiniband/hw/bnxtre/bnxt_qplib_fp.c | 873 ++++++++++++++++++++++++
> drivers/infiniband/hw/bnxtre/bnxt_qplib_fp.h | 250 +++++++
> drivers/infiniband/hw/bnxtre/bnxt_re.h | 14 +
> drivers/infiniband/hw/bnxtre/bnxt_re_ib_verbs.c | 762 +++++++++++++++++++++
> drivers/infiniband/hw/bnxtre/bnxt_re_ib_verbs.h | 21 +
> drivers/infiniband/hw/bnxtre/bnxt_re_main.c | 6 +
> include/uapi/rdma/bnxt_re_uverbs_abi.h | 10 +
> 7 files changed, 1936 insertions(+)
>
> diff --git a/drivers/infiniband/hw/bnxtre/bnxt_qplib_fp.c b/drivers/infiniband/hw/bnxtre/bnxt_qplib_fp.c
> index 636306f..edc9411 100644
> --- a/drivers/infiniband/hw/bnxtre/bnxt_qplib_fp.c
> +++ b/drivers/infiniband/hw/bnxtre/bnxt_qplib_fp.c
> @@ -50,6 +50,69 @@
> #include "bnxt_qplib_fp.h"
>
> static void bnxt_qplib_arm_cq_enable(struct bnxt_qplib_cq *cq);
> +
> +static void bnxt_qplib_free_qp_hdr_buf(struct bnxt_qplib_res *res,
> + struct bnxt_qplib_qp *qp)
> +{
> + struct bnxt_qplib_q *rq = &qp->rq;
> + struct bnxt_qplib_q *sq = &qp->sq;
> +
> + if (qp->rq_hdr_buf)
> + dma_free_coherent(&res->pdev->dev,
> + rq->hwq.max_elements * qp->rq_hdr_buf_size,
> + qp->rq_hdr_buf, qp->rq_hdr_buf_map);
> + if (qp->sq_hdr_buf)
> + dma_free_coherent(&res->pdev->dev,
> + sq->hwq.max_elements * qp->sq_hdr_buf_size,
> + qp->sq_hdr_buf, qp->sq_hdr_buf_map);
> + qp->rq_hdr_buf = NULL;
> + qp->sq_hdr_buf = NULL;
> + qp->rq_hdr_buf_map = 0;
> + qp->sq_hdr_buf_map = 0;
> + qp->sq_hdr_buf_size = 0;
> + qp->rq_hdr_buf_size = 0;
> +}
> +
> +static int bnxt_qplib_alloc_qp_hdr_buf(struct bnxt_qplib_res *res,
> + struct bnxt_qplib_qp *qp)
> +{
> + struct bnxt_qplib_q *rq = &qp->rq;
> + struct bnxt_qplib_q *sq = &qp->rq;
> + int rc = 0;
> +
> + if (qp->sq_hdr_buf_size && sq->hwq.max_elements) {
> + qp->sq_hdr_buf = dma_alloc_coherent(&res->pdev->dev,
> + sq->hwq.max_elements *
> + qp->sq_hdr_buf_size,
> + &qp->sq_hdr_buf_map, GFP_KERNEL);
> + if (!qp->sq_hdr_buf) {
> + rc = -ENOMEM;
> + dev_err(&res->pdev->dev,
> + "QPLIB: Failed to create sq_hdr_buf");
> + goto fail;
> + }
> + }
> +
> + if (qp->rq_hdr_buf_size && rq->hwq.max_elements) {
> + qp->rq_hdr_buf = dma_alloc_coherent(&res->pdev->dev,
> + rq->hwq.max_elements *
> + qp->rq_hdr_buf_size,
> + &qp->rq_hdr_buf_map,
> + GFP_KERNEL);
> + if (!qp->rq_hdr_buf) {
> + rc = -ENOMEM;
> + dev_err(&res->pdev->dev,
> + "QPLIB: Failed to create rq_hdr_buf");
> + goto fail;
> + }
> + }
> + return 0;
> +
> +fail:
> + bnxt_qplib_free_qp_hdr_buf(res, qp);
> + return rc;
> +}
> +
> static void bnxt_qplib_service_nq(unsigned long data)
> {
> struct bnxt_qplib_nq *nq = (struct bnxt_qplib_nq *)data;
> @@ -215,6 +278,816 @@ int bnxt_qplib_alloc_nq(struct pci_dev *pdev, struct bnxt_qplib_nq *nq)
> return 0;
> }
>
> +/* QP */
> +int bnxt_qplib_create_qp1(struct bnxt_qplib_res *res, struct bnxt_qplib_qp *qp)
> +{
> + struct bnxt_qplib_rcfw *rcfw = res->rcfw;
> + struct cmdq_create_qp1 req;
> + struct creq_create_qp1_resp *resp;
> + struct bnxt_qplib_pbl *pbl;
> + struct bnxt_qplib_q *sq = &qp->sq;
> + struct bnxt_qplib_q *rq = &qp->rq;
> + int rc;
> + u16 cmd_flags = 0;
> + u32 qp_flags = 0;
> +
> + RCFW_CMD_PREP(req, CREATE_QP1, cmd_flags);
> +
> + /* General */
> + req.type = qp->type;
> + req.dpi = cpu_to_le32(qp->dpi->dpi);
> + req.qp_handle = cpu_to_le64(qp->qp_handle);
> +
> + /* SQ */
> + sq->hwq.max_elements = sq->max_wqe;
> + rc = bnxt_qplib_alloc_init_hwq(res->pdev, &sq->hwq, NULL, 0,
> + &sq->hwq.max_elements,
> + BNXT_QPLIB_MAX_SQE_ENTRY_SIZE, 0,
> + PAGE_SIZE, HWQ_TYPE_QUEUE);
> + if (rc)
> + goto exit;
> +
> + sq->swq = kcalloc(sq->hwq.max_elements, sizeof(*sq->swq), GFP_KERNEL);
> + if (!sq->swq) {
> + rc = -ENOMEM;
> + goto fail_sq;
> + }
> + pbl = &sq->hwq.pbl[PBL_LVL_0];
> + req.sq_pbl = cpu_to_le64(pbl->pg_map_arr[0]);
> + req.sq_pg_size_sq_lvl =
> + ((sq->hwq.level & CMDQ_CREATE_QP1_SQ_LVL_MASK)
> + << CMDQ_CREATE_QP1_SQ_LVL_SFT) |
> + (pbl->pg_size == ROCE_PG_SIZE_4K ?
> + CMDQ_CREATE_QP1_SQ_PG_SIZE_PG_4K :
> + pbl->pg_size == ROCE_PG_SIZE_8K ?
> + CMDQ_CREATE_QP1_SQ_PG_SIZE_PG_8K :
> + pbl->pg_size == ROCE_PG_SIZE_64K ?
> + CMDQ_CREATE_QP1_SQ_PG_SIZE_PG_64K :
> + pbl->pg_size == ROCE_PG_SIZE_2M ?
> + CMDQ_CREATE_QP1_SQ_PG_SIZE_PG_2M :
> + pbl->pg_size == ROCE_PG_SIZE_8M ?
> + CMDQ_CREATE_QP1_SQ_PG_SIZE_PG_8M :
> + pbl->pg_size == ROCE_PG_SIZE_1G ?
> + CMDQ_CREATE_QP1_SQ_PG_SIZE_PG_1G :
> + CMDQ_CREATE_QP1_SQ_PG_SIZE_PG_4K);
> +
> + if (qp->scq)
> + req.scq_cid = cpu_to_le32(qp->scq->id);
> +
> + qp_flags |= CMDQ_CREATE_QP1_QP_FLAGS_RESERVED_LKEY_ENABLE;
> +
> + /* RQ */
> + if (rq->max_wqe) {
> + rq->hwq.max_elements = qp->rq.max_wqe;
> + rc = bnxt_qplib_alloc_init_hwq(res->pdev, &rq->hwq, NULL, 0,
> + &rq->hwq.max_elements,
> + BNXT_QPLIB_MAX_RQE_ENTRY_SIZE, 0,
> + PAGE_SIZE, HWQ_TYPE_QUEUE);
> + if (rc)
> + goto fail_sq;
> +
> + rq->swq = kcalloc(rq->hwq.max_elements, sizeof(*rq->swq),
> + GFP_KERNEL);
> + if (!rq->swq) {
> + rc = -ENOMEM;
> + goto fail_rq;
> + }
> + pbl = &rq->hwq.pbl[PBL_LVL_0];
> + req.rq_pbl = cpu_to_le64(pbl->pg_map_arr[0]);
> + req.rq_pg_size_rq_lvl =
> + ((rq->hwq.level & CMDQ_CREATE_QP1_RQ_LVL_MASK) <<
> + CMDQ_CREATE_QP1_RQ_LVL_SFT) |
> + (pbl->pg_size == ROCE_PG_SIZE_4K ?
> + CMDQ_CREATE_QP1_RQ_PG_SIZE_PG_4K :
> + pbl->pg_size == ROCE_PG_SIZE_8K ?
> + CMDQ_CREATE_QP1_RQ_PG_SIZE_PG_8K :
> + pbl->pg_size == ROCE_PG_SIZE_64K ?
> + CMDQ_CREATE_QP1_RQ_PG_SIZE_PG_64K :
> + pbl->pg_size == ROCE_PG_SIZE_2M ?
> + CMDQ_CREATE_QP1_RQ_PG_SIZE_PG_2M :
> + pbl->pg_size == ROCE_PG_SIZE_8M ?
> + CMDQ_CREATE_QP1_RQ_PG_SIZE_PG_8M :
> + pbl->pg_size == ROCE_PG_SIZE_1G ?
> + CMDQ_CREATE_QP1_RQ_PG_SIZE_PG_1G :
> + CMDQ_CREATE_QP1_RQ_PG_SIZE_PG_4K);
> + if (qp->rcq)
> + req.rcq_cid = cpu_to_le32(qp->rcq->id);
> + }
> +
> + /* Header buffer - allow hdr_buf pass in */
> + rc = bnxt_qplib_alloc_qp_hdr_buf(res, qp);
> + if (rc) {
> + rc = -ENOMEM;
> + goto fail;
> + }
> + req.qp_flags = cpu_to_le32(qp_flags);
> + req.sq_size = cpu_to_le32(sq->hwq.max_elements);
> + req.rq_size = cpu_to_le32(rq->hwq.max_elements);
> +
> + req.sq_fwo_sq_sge =
> + cpu_to_le16((sq->max_sge & CMDQ_CREATE_QP1_SQ_SGE_MASK) <<
> + CMDQ_CREATE_QP1_SQ_SGE_SFT);
> + req.rq_fwo_rq_sge =
> + cpu_to_le16((rq->max_sge & CMDQ_CREATE_QP1_RQ_SGE_MASK) <<
> + CMDQ_CREATE_QP1_RQ_SGE_SFT);
> +
> + req.pd_id = cpu_to_le32(qp->pd->id);
> +
> + resp = (struct creq_create_qp1_resp *)
> + bnxt_qplib_rcfw_send_message(rcfw, (void *)&req,
> + NULL, 0);
> + if (!resp) {
> + dev_err(&res->pdev->dev, "QPLIB: FP: CREATE_QP1 send failed");
> + rc = -EINVAL;
> + goto fail;
> + }
> + /**/
It looks like you forgot to add a text into comment section.
> + if (!bnxt_qplib_rcfw_wait_for_resp(rcfw, le16_to_cpu(req.cookie))) {
> + /* Cmd timed out */
> + dev_err(&rcfw->pdev->dev, "QPLIB: FP: CREATE_QP1 timed out");
> + rc = -ETIMEDOUT;
> + goto fail;
> + }
> + if (RCFW_RESP_STATUS(resp) ||
> + RCFW_RESP_COOKIE(resp) != RCFW_CMDQ_COOKIE(req)) {
> + dev_err(&rcfw->pdev->dev, "QPLIB: FP: CREATE_QP1 failed ");
> + dev_err(&rcfw->pdev->dev,
> + "QPLIB: with status 0x%x cmdq 0x%x resp 0x%x",
> + RCFW_RESP_STATUS(resp), RCFW_CMDQ_COOKIE(req),
> + RCFW_RESP_COOKIE(resp));
> + rc = -EINVAL;
> + goto fail;
> + }
> + qp->id = le32_to_cpu(resp->xid);
> + qp->cur_qp_state = CMDQ_MODIFY_QP_NEW_STATE_RESET;
> + sq->flush_in_progress = false;
> + rq->flush_in_progress = false;
> +
> + return 0;
> +
> +fail:
> + bnxt_qplib_free_qp_hdr_buf(res, qp);
> +fail_rq:
> + bnxt_qplib_free_hwq(res->pdev, &rq->hwq);
> + kfree(rq->swq);
> +fail_sq:
> + bnxt_qplib_free_hwq(res->pdev, &sq->hwq);
> + kfree(sq->swq);
> +exit:
> + return rc;
> +}
> +
> +int bnxt_qplib_create_qp(struct bnxt_qplib_res *res, struct bnxt_qplib_qp *qp)
> +{
> + struct bnxt_qplib_rcfw *rcfw = res->rcfw;
> + struct sq_send *hw_sq_send_hdr, **hw_sq_send_ptr;
> + struct cmdq_create_qp req;
> + struct creq_create_qp_resp *resp;
> + struct bnxt_qplib_pbl *pbl;
> + struct sq_psn_search **psn_search_ptr;
> + unsigned long long int psn_search, poff = 0;
> + struct bnxt_qplib_q *sq = &qp->sq;
> + struct bnxt_qplib_q *rq = &qp->rq;
> + struct bnxt_qplib_hwq *xrrq;
> + int i, rc, req_size, psn_sz;
> + u16 cmd_flags = 0, max_ssge;
> + u32 sw_prod, qp_flags = 0;
> +
> + RCFW_CMD_PREP(req, CREATE_QP, cmd_flags);
> +
> + /* General */
> + req.type = qp->type;
> + req.dpi = cpu_to_le32(qp->dpi->dpi);
> + req.qp_handle = cpu_to_le64(qp->qp_handle);
> +
> + /* SQ */
> + psn_sz = (qp->type == CMDQ_CREATE_QP_TYPE_RC) ?
> + sizeof(struct sq_psn_search) : 0;
> + sq->hwq.max_elements = sq->max_wqe;
> + rc = bnxt_qplib_alloc_init_hwq(res->pdev, &sq->hwq, sq->sglist,
> + sq->nmap, &sq->hwq.max_elements,
> + BNXT_QPLIB_MAX_SQE_ENTRY_SIZE,
> + psn_sz,
> + PAGE_SIZE, HWQ_TYPE_QUEUE);
> + if (rc)
> + goto exit;
> +
> + sq->swq = kcalloc(sq->hwq.max_elements, sizeof(*sq->swq), GFP_KERNEL);
> + if (!sq->swq) {
> + rc = -ENOMEM;
> + goto fail_sq;
> + }
> + hw_sq_send_ptr = (struct sq_send **)sq->hwq.pbl_ptr;
> + if (psn_sz) {
> + psn_search_ptr = (struct sq_psn_search **)
> + &hw_sq_send_ptr[SQE_PG(sq->hwq.max_elements)];
> + psn_search = (unsigned long long int)
> + &hw_sq_send_ptr[SQE_PG(sq->hwq.max_elements)]
> + [SQE_IDX(sq->hwq.max_elements)];
> + if (psn_search & ~PAGE_MASK) {
> + /* If the psn_search does not start on a page boundary,
> + * then calculate the offset
> + */
> + poff = (psn_search & ~PAGE_MASK) /
> + BNXT_QPLIB_MAX_PSNE_ENTRY_SIZE;
> + }
> + for (i = 0; i < sq->hwq.max_elements; i++)
> + sq->swq[i].psn_search =
> + &psn_search_ptr[PSNE_PG(i + poff)]
> + [PSNE_IDX(i + poff)];
> + }
> + pbl = &sq->hwq.pbl[PBL_LVL_0];
> + req.sq_pbl = cpu_to_le64(pbl->pg_map_arr[0]);
> + req.sq_pg_size_sq_lvl =
> + ((sq->hwq.level & CMDQ_CREATE_QP_SQ_LVL_MASK)
> + << CMDQ_CREATE_QP_SQ_LVL_SFT) |
> + (pbl->pg_size == ROCE_PG_SIZE_4K ?
> + CMDQ_CREATE_QP_SQ_PG_SIZE_PG_4K :
> + pbl->pg_size == ROCE_PG_SIZE_8K ?
> + CMDQ_CREATE_QP_SQ_PG_SIZE_PG_8K :
> + pbl->pg_size == ROCE_PG_SIZE_64K ?
> + CMDQ_CREATE_QP_SQ_PG_SIZE_PG_64K :
> + pbl->pg_size == ROCE_PG_SIZE_2M ?
> + CMDQ_CREATE_QP_SQ_PG_SIZE_PG_2M :
> + pbl->pg_size == ROCE_PG_SIZE_8M ?
> + CMDQ_CREATE_QP_SQ_PG_SIZE_PG_8M :
> + pbl->pg_size == ROCE_PG_SIZE_1G ?
> + CMDQ_CREATE_QP_SQ_PG_SIZE_PG_1G :
> + CMDQ_CREATE_QP_SQ_PG_SIZE_PG_4K);
> +
> + /* initialize all SQ WQEs to LOCAL_INVALID (sq prep for hw fetch) */
> + hw_sq_send_ptr = (struct sq_send **)sq->hwq.pbl_ptr;
> + for (sw_prod = 0; sw_prod < sq->hwq.max_elements; sw_prod++) {
> + hw_sq_send_hdr = &hw_sq_send_ptr[SQE_PG(sw_prod)]
> + [SQE_IDX(sw_prod)];
> + hw_sq_send_hdr->wqe_type = SQ_BASE_WQE_TYPE_LOCAL_INVALID;
> + }
> +
> + if (qp->scq)
> + req.scq_cid = cpu_to_le32(qp->scq->id);
> +
> + qp_flags |= CMDQ_CREATE_QP_QP_FLAGS_RESERVED_LKEY_ENABLE;
> + qp_flags |= CMDQ_CREATE_QP_QP_FLAGS_FR_PMR_ENABLED;
> + if (qp->sig_type)
> + qp_flags |= CMDQ_CREATE_QP_QP_FLAGS_FORCE_COMPLETION;
> +
> + /* RQ */
> + if (rq->max_wqe) {
> + rq->hwq.max_elements = rq->max_wqe;
> + rc = bnxt_qplib_alloc_init_hwq(res->pdev, &rq->hwq, rq->sglist,
> + rq->nmap, &rq->hwq.max_elements,
> + BNXT_QPLIB_MAX_RQE_ENTRY_SIZE, 0,
> + PAGE_SIZE, HWQ_TYPE_QUEUE);
> + if (rc)
> + goto fail_sq;
> +
> + rq->swq = kcalloc(rq->hwq.max_elements, sizeof(*rq->swq),
> + GFP_KERNEL);
> + if (!rq->swq) {
> + rc = -ENOMEM;
> + goto fail_rq;
> + }
> + pbl = &rq->hwq.pbl[PBL_LVL_0];
> + req.rq_pbl = cpu_to_le64(pbl->pg_map_arr[0]);
> + req.rq_pg_size_rq_lvl =
> + ((rq->hwq.level & CMDQ_CREATE_QP_RQ_LVL_MASK) <<
> + CMDQ_CREATE_QP_RQ_LVL_SFT) |
> + (pbl->pg_size == ROCE_PG_SIZE_4K ?
> + CMDQ_CREATE_QP_RQ_PG_SIZE_PG_4K :
> + pbl->pg_size == ROCE_PG_SIZE_8K ?
> + CMDQ_CREATE_QP_RQ_PG_SIZE_PG_8K :
> + pbl->pg_size == ROCE_PG_SIZE_64K ?
> + CMDQ_CREATE_QP_RQ_PG_SIZE_PG_64K :
> + pbl->pg_size == ROCE_PG_SIZE_2M ?
> + CMDQ_CREATE_QP_RQ_PG_SIZE_PG_2M :
> + pbl->pg_size == ROCE_PG_SIZE_8M ?
> + CMDQ_CREATE_QP_RQ_PG_SIZE_PG_8M :
> + pbl->pg_size == ROCE_PG_SIZE_1G ?
> + CMDQ_CREATE_QP_RQ_PG_SIZE_PG_1G :
> + CMDQ_CREATE_QP_RQ_PG_SIZE_PG_4K);
> + }
> +
> + if (qp->rcq)
> + req.rcq_cid = cpu_to_le32(qp->rcq->id);
> + req.qp_flags = cpu_to_le32(qp_flags);
> + req.sq_size = cpu_to_le32(sq->hwq.max_elements);
> + req.rq_size = cpu_to_le32(rq->hwq.max_elements);
> + qp->sq_hdr_buf = NULL;
> + qp->rq_hdr_buf = NULL;
> +
> + rc = bnxt_qplib_alloc_qp_hdr_buf(res, qp);
> + if (rc)
> + goto fail_rq;
> +
> + /* CTRL-22434: Irrespective of the requested SGE count on the SQ
> + * always create the QP with max send sges possible if the requested
> + * inline size is greater than 0.
> + */
> + max_ssge = qp->max_inline_data ? 6 : sq->max_sge;
> + req.sq_fwo_sq_sge = cpu_to_le16(
> + ((max_ssge & CMDQ_CREATE_QP_SQ_SGE_MASK)
> + << CMDQ_CREATE_QP_SQ_SGE_SFT) | 0);
> + req.rq_fwo_rq_sge = cpu_to_le16(
> + ((rq->max_sge & CMDQ_CREATE_QP_RQ_SGE_MASK)
> + << CMDQ_CREATE_QP_RQ_SGE_SFT) | 0);
> + /* ORRQ and IRRQ */
> + if (psn_sz) {
> + xrrq = &qp->orrq;
> + xrrq->max_elements =
> + ORD_LIMIT_TO_ORRQ_SLOTS(qp->max_rd_atomic);
> + req_size = xrrq->max_elements *
> + BNXT_QPLIB_MAX_ORRQE_ENTRY_SIZE + PAGE_SIZE - 1;
> + req_size &= ~(PAGE_SIZE - 1);
> + rc = bnxt_qplib_alloc_init_hwq(res->pdev, xrrq, NULL, 0,
> + &xrrq->max_elements,
> + BNXT_QPLIB_MAX_ORRQE_ENTRY_SIZE,
> + 0, req_size, HWQ_TYPE_CTX);
> + if (rc)
> + goto fail_buf_free;
> + pbl = &xrrq->pbl[PBL_LVL_0];
> + req.orrq_addr = cpu_to_le64(pbl->pg_map_arr[0]);
> +
> + xrrq = &qp->irrq;
> + xrrq->max_elements = IRD_LIMIT_TO_IRRQ_SLOTS(
> + qp->max_dest_rd_atomic);
> + req_size = xrrq->max_elements *
> + BNXT_QPLIB_MAX_IRRQE_ENTRY_SIZE + PAGE_SIZE - 1;
> + req_size &= ~(PAGE_SIZE - 1);
> +
> + rc = bnxt_qplib_alloc_init_hwq(res->pdev, xrrq, NULL, 0,
> + &xrrq->max_elements,
> + BNXT_QPLIB_MAX_IRRQE_ENTRY_SIZE,
> + 0, req_size, HWQ_TYPE_CTX);
> + if (rc)
> + goto fail_orrq;
> +
> + pbl = &xrrq->pbl[PBL_LVL_0];
> + req.irrq_addr = cpu_to_le64(pbl->pg_map_arr[0]);
> + }
> + req.pd_id = cpu_to_le32(qp->pd->id);
> +
> + resp = (struct creq_create_qp_resp *)
> + bnxt_qplib_rcfw_send_message(rcfw, (void *)&req,
> + NULL, 0);
> + if (!resp) {
> + dev_err(&rcfw->pdev->dev, "QPLIB: FP: CREATE_QP send failed");
> + rc = -EINVAL;
> + goto fail;
> + }
> + /**/
> + if (!bnxt_qplib_rcfw_wait_for_resp(rcfw, le16_to_cpu(req.cookie))) {
> + /* Cmd timed out */
> + dev_err(&rcfw->pdev->dev, "QPLIB: FP: CREATE_QP timed out");
> + rc = -ETIMEDOUT;
> + goto fail;
> + }
> + if (RCFW_RESP_STATUS(resp) ||
> + RCFW_RESP_COOKIE(resp) != RCFW_CMDQ_COOKIE(req)) {
> + dev_err(&rcfw->pdev->dev, "QPLIB: FP: CREATE_QP failed ");
> + dev_err(&rcfw->pdev->dev,
> + "QPLIB: with status 0x%x cmdq 0x%x resp 0x%x",
> + RCFW_RESP_STATUS(resp), RCFW_CMDQ_COOKIE(req),
> + RCFW_RESP_COOKIE(resp));
> + rc = -EINVAL;
> + goto fail;
> + }
> + qp->id = le32_to_cpu(resp->xid);
> + qp->cur_qp_state = CMDQ_MODIFY_QP_NEW_STATE_RESET;
> + sq->flush_in_progress = false;
> + rq->flush_in_progress = false;
> +
> + return 0;
> +
> +fail:
> + if (qp->irrq.max_elements)
> + bnxt_qplib_free_hwq(res->pdev, &qp->irrq);
> +fail_orrq:
> + if (qp->orrq.max_elements)
> + bnxt_qplib_free_hwq(res->pdev, &qp->orrq);
> +fail_buf_free:
> + bnxt_qplib_free_qp_hdr_buf(res, qp);
> +fail_rq:
> + bnxt_qplib_free_hwq(res->pdev, &rq->hwq);
> + kfree(rq->swq);
> +fail_sq:
> + bnxt_qplib_free_hwq(res->pdev, &sq->hwq);
> + kfree(sq->swq);
> +exit:
> + return rc;
> +}
> +
> +static void __filter_modify_flags(struct bnxt_qplib_qp *qp)
> +{
It can help to review if you break this function into smaller pieces and
get rid of switch->switch->if construction.
> + switch (qp->cur_qp_state) {
> + case CMDQ_MODIFY_QP_NEW_STATE_RESET:
> + switch (qp->state) {
> + case CMDQ_MODIFY_QP_NEW_STATE_INIT:
> + break;
> + default:
> + break;
> + }
> + break;
> + case CMDQ_MODIFY_QP_NEW_STATE_INIT:
> + switch (qp->state) {
> + case CMDQ_MODIFY_QP_NEW_STATE_RTR:
> + /* INIT->RTR, configure the path_mtu to the default
> + * 2048 if not being requested
> + */
> + if (!(qp->modify_flags &
> + CMDQ_MODIFY_QP_MODIFY_MASK_PATH_MTU)) {
> + qp->modify_flags |=
> + CMDQ_MODIFY_QP_MODIFY_MASK_PATH_MTU;
> + qp->path_mtu = CMDQ_MODIFY_QP_PATH_MTU_MTU_2048;
> + }
> + qp->modify_flags &=
> + ~CMDQ_MODIFY_QP_MODIFY_MASK_VLAN_ID;
> + /* Bono FW requires the max_dest_rd_atomic to be >= 1 */
> + if (qp->max_dest_rd_atomic < 1)
> + qp->max_dest_rd_atomic = 1;
> + qp->modify_flags &= ~CMDQ_MODIFY_QP_MODIFY_MASK_SRC_MAC;
> + /* Bono FW 20.6.5 requires SGID_INDEX configuration */
> + if (!(qp->modify_flags &
> + CMDQ_MODIFY_QP_MODIFY_MASK_SGID_INDEX)) {
> + qp->modify_flags |=
> + CMDQ_MODIFY_QP_MODIFY_MASK_SGID_INDEX;
> + qp->ah.sgid_index = 0;
> + }
> + break;
> + default:
> + break;
> + }
> + break;
> + case CMDQ_MODIFY_QP_NEW_STATE_RTR:
> + switch (qp->state) {
> + case CMDQ_MODIFY_QP_NEW_STATE_RTS:
> + /* Bono FW requires the max_rd_atomic to be >= 1 */
> + if (qp->max_rd_atomic < 1)
> + qp->max_rd_atomic = 1;
> + /* Bono FW does not allow PKEY_INDEX,
> + * DGID, FLOW_LABEL, SGID_INDEX, HOP_LIMIT,
> + * TRAFFIC_CLASS, DEST_MAC, PATH_MTU, RQ_PSN,
> + * MIN_RNR_TIMER, MAX_DEST_RD_ATOMIC, DEST_QP_ID
> + * modification
> + */
> + qp->modify_flags &=
> + ~(CMDQ_MODIFY_QP_MODIFY_MASK_PKEY |
> + CMDQ_MODIFY_QP_MODIFY_MASK_DGID |
> + CMDQ_MODIFY_QP_MODIFY_MASK_FLOW_LABEL |
> + CMDQ_MODIFY_QP_MODIFY_MASK_SGID_INDEX |
> + CMDQ_MODIFY_QP_MODIFY_MASK_HOP_LIMIT |
> + CMDQ_MODIFY_QP_MODIFY_MASK_TRAFFIC_CLASS |
> + CMDQ_MODIFY_QP_MODIFY_MASK_DEST_MAC |
> + CMDQ_MODIFY_QP_MODIFY_MASK_PATH_MTU |
> + CMDQ_MODIFY_QP_MODIFY_MASK_RQ_PSN |
> + CMDQ_MODIFY_QP_MODIFY_MASK_MIN_RNR_TIMER |
> + CMDQ_MODIFY_QP_MODIFY_MASK_MAX_DEST_RD_ATOMIC
> + | CMDQ_MODIFY_QP_MODIFY_MASK_DEST_QP_ID);
> + break;
> + default:
> + break;
> + }
> + break;
> + case CMDQ_MODIFY_QP_NEW_STATE_RTS:
> + break;
> + case CMDQ_MODIFY_QP_NEW_STATE_SQD:
> + break;
> + case CMDQ_MODIFY_QP_NEW_STATE_SQE:
> + break;
> + case CMDQ_MODIFY_QP_NEW_STATE_ERR:
> + break;
> + default:
> + break;
> + }
> +}
> +
> +int bnxt_qplib_modify_qp(struct bnxt_qplib_res *res, struct bnxt_qplib_qp *qp)
> +{
> + struct bnxt_qplib_rcfw *rcfw = res->rcfw;
> + struct cmdq_modify_qp req;
> + struct creq_modify_qp_resp *resp;
> + u16 cmd_flags = 0, pkey;
> + u32 temp32[4];
> + u32 bmask;
> +
> + RCFW_CMD_PREP(req, MODIFY_QP, cmd_flags);
> +
> + /* Filter out the qp_attr_mask based on the state->new transition */
> + __filter_modify_flags(qp);
> + bmask = qp->modify_flags;
> + req.modify_mask = cpu_to_le64(qp->modify_flags);
> + req.qp_cid = cpu_to_le32(qp->id);
> + if (bmask & CMDQ_MODIFY_QP_MODIFY_MASK_STATE) {
> + req.network_type_en_sqd_async_notify_new_state =
> + (qp->state & CMDQ_MODIFY_QP_NEW_STATE_MASK) |
> + (qp->en_sqd_async_notify ?
> + CMDQ_MODIFY_QP_EN_SQD_ASYNC_NOTIFY : 0);
> + }
> + req.network_type_en_sqd_async_notify_new_state |= qp->nw_type;
> +
> + if (bmask & CMDQ_MODIFY_QP_MODIFY_MASK_ACCESS)
> + req.access = qp->access;
> +
> + if (bmask & CMDQ_MODIFY_QP_MODIFY_MASK_PKEY) {
> + if (!bnxt_qplib_get_pkey(res, &res->pkey_tbl,
> + qp->pkey_index, &pkey))
> + req.pkey = cpu_to_le16(pkey);
> + }
> + if (bmask & CMDQ_MODIFY_QP_MODIFY_MASK_QKEY)
> + req.qkey = cpu_to_le32(qp->qkey);
> +
> + if (bmask & CMDQ_MODIFY_QP_MODIFY_MASK_DGID) {
> + memcpy(temp32, qp->ah.dgid.data, sizeof(struct bnxt_qplib_gid));
> + req.dgid[0] = cpu_to_le32(temp32[0]);
> + req.dgid[1] = cpu_to_le32(temp32[1]);
> + req.dgid[2] = cpu_to_le32(temp32[2]);
> + req.dgid[3] = cpu_to_le32(temp32[3]);
> + }
> + if (bmask & CMDQ_MODIFY_QP_MODIFY_MASK_FLOW_LABEL)
> + req.flow_label = cpu_to_le32(qp->ah.flow_label);
> +
> + if (bmask & CMDQ_MODIFY_QP_MODIFY_MASK_SGID_INDEX)
> + req.sgid_index = cpu_to_le16(res->sgid_tbl.hw_id
> + [qp->ah.sgid_index]);
> +
> + if (bmask & CMDQ_MODIFY_QP_MODIFY_MASK_HOP_LIMIT)
> + req.hop_limit = qp->ah.hop_limit;
> +
> + if (bmask & CMDQ_MODIFY_QP_MODIFY_MASK_TRAFFIC_CLASS)
> + req.traffic_class = qp->ah.traffic_class;
> +
> + if (bmask & CMDQ_MODIFY_QP_MODIFY_MASK_DEST_MAC)
> + memcpy(req.dest_mac, qp->ah.dmac, 6);
> +
> + if (bmask & CMDQ_MODIFY_QP_MODIFY_MASK_PATH_MTU)
> + req.path_mtu = cpu_to_le16(qp->path_mtu);
> +
> + if (bmask & CMDQ_MODIFY_QP_MODIFY_MASK_TIMEOUT)
> + req.timeout = qp->timeout;
> +
> + if (bmask & CMDQ_MODIFY_QP_MODIFY_MASK_RETRY_CNT)
> + req.retry_cnt = qp->retry_cnt;
> +
> + if (bmask & CMDQ_MODIFY_QP_MODIFY_MASK_RNR_RETRY)
> + req.rnr_retry = qp->rnr_retry;
> +
> + if (bmask & CMDQ_MODIFY_QP_MODIFY_MASK_MIN_RNR_TIMER)
> + req.min_rnr_timer = qp->min_rnr_timer;
> +
> + if (bmask & CMDQ_MODIFY_QP_MODIFY_MASK_RQ_PSN)
> + req.rq_psn = cpu_to_le32(qp->rq.psn);
> +
> + if (bmask & CMDQ_MODIFY_QP_MODIFY_MASK_SQ_PSN)
> + req.sq_psn = cpu_to_le32(qp->sq.psn);
> +
> + if (bmask & CMDQ_MODIFY_QP_MODIFY_MASK_MAX_RD_ATOMIC)
> + req.max_rd_atomic =
> + ORD_LIMIT_TO_ORRQ_SLOTS(qp->max_rd_atomic);
> +
> + if (bmask & CMDQ_MODIFY_QP_MODIFY_MASK_MAX_DEST_RD_ATOMIC)
> + req.max_dest_rd_atomic =
> + IRD_LIMIT_TO_IRRQ_SLOTS(qp->max_dest_rd_atomic);
> +
> + req.sq_size = cpu_to_le32(qp->sq.hwq.max_elements);
> + req.rq_size = cpu_to_le32(qp->rq.hwq.max_elements);
> + req.sq_sge = cpu_to_le16(qp->sq.max_sge);
> + req.rq_sge = cpu_to_le16(qp->rq.max_sge);
> + req.max_inline_data = cpu_to_le32(qp->max_inline_data);
> + if (bmask & CMDQ_MODIFY_QP_MODIFY_MASK_DEST_QP_ID)
> + req.dest_qp_id = cpu_to_le32(qp->dest_qpn);
> +
> + req.vlan_pcp_vlan_dei_vlan_id = cpu_to_le16(qp->vlan_id);
> +
> + resp = (struct creq_modify_qp_resp *)
> + bnxt_qplib_rcfw_send_message(rcfw, (void *)&req,
> + NULL, 0);
> + if (!resp) {
> + dev_err(&rcfw->pdev->dev, "QPLIB: FP: MODIFY_QP send failed");
> + return -EINVAL;
> + }
> + /**/
> + if (!bnxt_qplib_rcfw_wait_for_resp(rcfw, le16_to_cpu(req.cookie))) {
> + /* Cmd timed out */
> + dev_err(&rcfw->pdev->dev, "QPLIB: FP: MODIFY_QP timed out");
> + return -ETIMEDOUT;
> + }
> + if (RCFW_RESP_STATUS(resp) ||
> + RCFW_RESP_COOKIE(resp) != RCFW_CMDQ_COOKIE(req)) {
> + dev_err(&rcfw->pdev->dev, "QPLIB: FP: MODIFY_QP failed ");
> + dev_err(&rcfw->pdev->dev,
> + "QPLIB: with status 0x%x cmdq 0x%x resp 0x%x",
> + RCFW_RESP_STATUS(resp), RCFW_CMDQ_COOKIE(req),
> + RCFW_RESP_COOKIE(resp));
> + return -EINVAL;
> + }
> + qp->cur_qp_state = qp->state;
> + return 0;
> +}
> +
> +int bnxt_qplib_query_qp(struct bnxt_qplib_res *res, struct bnxt_qplib_qp *qp)
> +{
> + struct bnxt_qplib_rcfw *rcfw = res->rcfw;
> + struct cmdq_query_qp req;
> + struct creq_query_qp_resp *resp;
> + struct creq_query_qp_resp_sb *sb;
> + u16 cmd_flags = 0;
> + u32 temp32[4];
> + int i;
> +
> + RCFW_CMD_PREP(req, QUERY_QP, cmd_flags);
> +
> + req.qp_cid = cpu_to_le32(qp->id);
> + req.resp_size = sizeof(*sb) / BNXT_QPLIB_CMDQE_UNITS;
> + resp = (struct creq_query_qp_resp *)
> + bnxt_qplib_rcfw_send_message(rcfw, (void *)&req,
> + (void **)&sb, 0);
> + if (!resp) {
> + dev_err(&rcfw->pdev->dev, "QPLIB: FP: QUERY_QP send failed");
> + return -EINVAL;
> + }
> + /**/
> + if (!bnxt_qplib_rcfw_wait_for_resp(rcfw, le16_to_cpu(req.cookie))) {
> + /* Cmd timed out */
> + dev_err(&rcfw->pdev->dev, "QPLIB: FP: QUERY_QP timed out");
> + return -ETIMEDOUT;
> + }
> + if (RCFW_RESP_STATUS(resp) ||
> + RCFW_RESP_COOKIE(resp) != RCFW_CMDQ_COOKIE(req)) {
> + dev_err(&rcfw->pdev->dev, "QPLIB: FP: QUERY_QP failed ");
> + dev_err(&rcfw->pdev->dev,
> + "QPLIB: with status 0x%x cmdq 0x%x resp 0x%x",
> + RCFW_RESP_STATUS(resp), RCFW_CMDQ_COOKIE(req),
> + RCFW_RESP_COOKIE(resp));
> + return -EINVAL;
> + }
> + /* Extract the context from the side buffer */
> + qp->state = sb->en_sqd_async_notify_state &
> + CREQ_QUERY_QP_RESP_SB_STATE_MASK;
> + qp->en_sqd_async_notify = sb->en_sqd_async_notify_state &
> + CREQ_QUERY_QP_RESP_SB_EN_SQD_ASYNC_NOTIFY ?
> + true : false;
> + qp->access = sb->access;
> + qp->pkey_index = le16_to_cpu(sb->pkey);
> + qp->qkey = le32_to_cpu(sb->qkey);
> +
> + temp32[0] = le32_to_cpu(sb->dgid[0]);
> + temp32[1] = le32_to_cpu(sb->dgid[1]);
> + temp32[2] = le32_to_cpu(sb->dgid[2]);
> + temp32[3] = le32_to_cpu(sb->dgid[3]);
> + memcpy(qp->ah.dgid.data, temp32, sizeof(qp->ah.dgid.data));
> +
> + qp->ah.flow_label = le32_to_cpu(sb->flow_label);
> +
> + qp->ah.sgid_index = 0;
> + for (i = 0; i < res->sgid_tbl.max; i++) {
> + if (res->sgid_tbl.hw_id[i] == le16_to_cpu(sb->sgid_index)) {
> + qp->ah.sgid_index = i;
> + break;
> + }
> + }
> + if (i == res->sgid_tbl.max)
> + dev_warn(&res->pdev->dev, "QPLIB: SGID not found??");
> +
> + qp->ah.hop_limit = sb->hop_limit;
> + qp->ah.traffic_class = sb->traffic_class;
> + memcpy(qp->ah.dmac, sb->dest_mac, 6);
> + qp->ah.vlan_id = le16_to_cpu((sb->path_mtu_dest_vlan_id &
> + CREQ_QUERY_QP_RESP_SB_VLAN_ID_MASK) >>
> + CREQ_QUERY_QP_RESP_SB_VLAN_ID_SFT);
> + qp->path_mtu = sb->path_mtu_dest_vlan_id &
> + CREQ_QUERY_QP_RESP_SB_PATH_MTU_MASK;
> + qp->timeout = sb->timeout;
> + qp->retry_cnt = sb->retry_cnt;
> + qp->rnr_retry = sb->rnr_retry;
> + qp->min_rnr_timer = sb->min_rnr_timer;
> + qp->rq.psn = le32_to_cpu(sb->rq_psn);
> + qp->max_rd_atomic = ORRQ_SLOTS_TO_ORD_LIMIT(sb->max_rd_atomic);
> + qp->sq.psn = le32_to_cpu(sb->sq_psn);
> + qp->max_dest_rd_atomic =
> + IRRQ_SLOTS_TO_IRD_LIMIT(sb->max_dest_rd_atomic);
> + qp->sq.max_wqe = qp->sq.hwq.max_elements;
> + qp->rq.max_wqe = qp->rq.hwq.max_elements;
> + qp->sq.max_sge = le16_to_cpu(sb->sq_sge);
> + qp->rq.max_sge = le32_to_cpu(sb->rq_sge);
> + qp->max_inline_data = le32_to_cpu(sb->max_inline_data);
> + qp->dest_qpn = le32_to_cpu(sb->dest_qp_id);
> + memcpy(qp->smac, sb->src_mac, 6);
> + qp->vlan_id = le16_to_cpu(sb->vlan_pcp_vlan_dei_vlan_id);
> + return 0;
> +}
> +
> +static void __clean_cq(struct bnxt_qplib_cq *cq, u64 qp)
> +{
> + struct bnxt_qplib_hwq *cq_hwq = &cq->hwq;
> + struct cq_base *hw_cqe, **hw_cqe_ptr;
> + int i;
> +
> + for (i = 0; i < cq_hwq->max_elements; i++) {
> + hw_cqe_ptr = (struct cq_base **)cq_hwq->pbl_ptr;
> + hw_cqe = &hw_cqe_ptr[CQE_PG(i)][CQE_IDX(i)];
> + if (!CQE_CMP_VALID(hw_cqe, i, cq_hwq->max_elements))
> + continue;
> + switch (hw_cqe->cqe_type_toggle & CQ_BASE_CQE_TYPE_MASK) {
> + case CQ_BASE_CQE_TYPE_REQ:
> + case CQ_BASE_CQE_TYPE_TERMINAL:
> + {
> + struct cq_req *cqe = (struct cq_req *)hw_cqe;
> +
> + if (qp == le64_to_cpu(cqe->qp_handle))
> + cqe->qp_handle = 0;
> + break;
> + }
> + case CQ_BASE_CQE_TYPE_RES_RC:
> + case CQ_BASE_CQE_TYPE_RES_UD:
> + case CQ_BASE_CQE_TYPE_RES_RAWETH_QP1:
> + {
> + struct cq_res_rc *cqe = (struct cq_res_rc *)hw_cqe;
> +
> + if (qp == le64_to_cpu(cqe->qp_handle))
> + cqe->qp_handle = 0;
> + break;
> + }
> + default:
> + break;
> + }
> + }
> +}
> +
> +static unsigned long bnxt_qplib_lock_cqs(struct bnxt_qplib_qp *qp)
> +{
> + unsigned long flags;
> +
> + spin_lock_irqsave(&qp->scq->hwq.lock, flags);
> + if (qp->rcq && qp->rcq != qp->scq)
> + spin_lock(&qp->rcq->hwq.lock);
> +
> + return flags;
> +}
> +
> +static void bnxt_qplib_unlock_cqs(struct bnxt_qplib_qp *qp,
> + unsigned long flags)
> +{
> + if (qp->rcq && qp->rcq != qp->scq)
> + spin_unlock(&qp->rcq->hwq.lock);
> + spin_unlock_irqrestore(&qp->scq->hwq.lock, flags);
> +}
> +
> +int bnxt_qplib_destroy_qp(struct bnxt_qplib_res *res,
> + struct bnxt_qplib_qp *qp)
> +{
> + struct bnxt_qplib_rcfw *rcfw = res->rcfw;
> + struct cmdq_destroy_qp req;
> + struct creq_destroy_qp_resp *resp;
> + unsigned long flags;
> + u16 cmd_flags = 0;
> +
> + RCFW_CMD_PREP(req, DESTROY_QP, cmd_flags);
> +
> + req.qp_cid = cpu_to_le32(qp->id);
> + resp = (struct creq_destroy_qp_resp *)
> + bnxt_qplib_rcfw_send_message(rcfw, (void *)&req,
> + NULL, 0);
> + if (!resp) {
> + dev_err(&rcfw->pdev->dev, "QPLIB: FP: DESTROY_QP send failed");
> + return -EINVAL;
> + }
> + /**/
> + if (!bnxt_qplib_rcfw_wait_for_resp(rcfw, le16_to_cpu(req.cookie))) {
> + /* Cmd timed out */
> + dev_err(&rcfw->pdev->dev, "QPLIB: FP: DESTROY_QP timed out");
> + return -ETIMEDOUT;
> + }
> + if (RCFW_RESP_STATUS(resp) ||
> + RCFW_RESP_COOKIE(resp) != RCFW_CMDQ_COOKIE(req)) {
> + dev_err(&rcfw->pdev->dev, "QPLIB: FP: DESTROY_QP failed ");
> + dev_err(&rcfw->pdev->dev,
> + "QPLIB: with status 0x%x cmdq 0x%x resp 0x%x",
> + RCFW_RESP_STATUS(resp), RCFW_CMDQ_COOKIE(req),
> + RCFW_RESP_COOKIE(resp));
> + return -EINVAL;
> + }
> +
> + /* Must walk the associated CQs to nullified the QP ptr */
> + flags = bnxt_qplib_lock_cqs(qp);
> + __clean_cq(qp->scq, (u64)qp);
> + if (qp->rcq != qp->scq)
> + __clean_cq(qp->rcq, (u64)qp);
> + bnxt_qplib_unlock_cqs(qp, flags);
> +
> + bnxt_qplib_free_qp_hdr_buf(res, qp);
> + bnxt_qplib_free_hwq(res->pdev, &qp->sq.hwq);
> + kfree(qp->sq.swq);
> +
> + bnxt_qplib_free_hwq(res->pdev, &qp->rq.hwq);
> + kfree(qp->rq.swq);
> +
> + if (qp->irrq.max_elements)
> + bnxt_qplib_free_hwq(res->pdev, &qp->irrq);
> + if (qp->orrq.max_elements)
> + bnxt_qplib_free_hwq(res->pdev, &qp->orrq);
> +
> + return 0;
> +}
> +
> /* CQ */
>
> /* Spinlock must be held */
> diff --git a/drivers/infiniband/hw/bnxtre/bnxt_qplib_fp.h b/drivers/infiniband/hw/bnxtre/bnxt_qplib_fp.h
> index 1991eaa..f6d2be5 100644
> --- a/drivers/infiniband/hw/bnxtre/bnxt_qplib_fp.h
> +++ b/drivers/infiniband/hw/bnxtre/bnxt_qplib_fp.h
> @@ -38,8 +38,246 @@
>
> #ifndef __BNXT_QPLIB_FP_H__
> #define __BNXT_QPLIB_FP_H__
> +struct bnxt_qplib_sge {
> + u64 addr;
> + u32 lkey;
> + u32 size;
> +};
> +
> +#define BNXT_QPLIB_MAX_SQE_ENTRY_SIZE sizeof(struct sq_send)
> +
> +#define SQE_CNT_PER_PG (PAGE_SIZE / BNXT_QPLIB_MAX_SQE_ENTRY_SIZE)
> +#define SQE_MAX_IDX_PER_PG (SQE_CNT_PER_PG - 1)
> +#define SQE_PG(x) (((x) & ~SQE_MAX_IDX_PER_PG) / SQE_CNT_PER_PG)
> +#define SQE_IDX(x) ((x) & SQE_MAX_IDX_PER_PG)
> +
> +#define BNXT_QPLIB_MAX_PSNE_ENTRY_SIZE sizeof(struct sq_psn_search)
> +
> +#define PSNE_CNT_PER_PG (PAGE_SIZE / BNXT_QPLIB_MAX_PSNE_ENTRY_SIZE)
> +#define PSNE_MAX_IDX_PER_PG (PSNE_CNT_PER_PG - 1)
> +#define PSNE_PG(x) (((x) & ~PSNE_MAX_IDX_PER_PG) / PSNE_CNT_PER_PG)
> +#define PSNE_IDX(x) ((x) & PSNE_MAX_IDX_PER_PG)
> +
> +#define BNXT_QPLIB_QP_MAX_SGL 6
> +
> +struct bnxt_qplib_swq {
> + u64 wr_id;
> + u8 type;
> + u8 flags;
> + u32 start_psn;
> + u32 next_psn;
> + struct sq_psn_search *psn_search;
> +};
> +
> +struct bnxt_qplib_swqe {
> + /* General */
> + u64 wr_id;
> + u8 reqs_type;
> + u8 type;
> +#define BNXT_QPLIB_SWQE_TYPE_SEND 0
> +#define BNXT_QPLIB_SWQE_TYPE_SEND_WITH_IMM 1
> +#define BNXT_QPLIB_SWQE_TYPE_SEND_WITH_INV 2
> +#define BNXT_QPLIB_SWQE_TYPE_RDMA_WRITE 4
> +#define BNXT_QPLIB_SWQE_TYPE_RDMA_WRITE_WITH_IMM 5
> +#define BNXT_QPLIB_SWQE_TYPE_RDMA_READ 6
> +#define BNXT_QPLIB_SWQE_TYPE_ATOMIC_CMP_AND_SWP 8
> +#define BNXT_QPLIB_SWQE_TYPE_ATOMIC_FETCH_AND_ADD 11
> +#define BNXT_QPLIB_SWQE_TYPE_LOCAL_INV 12
> +#define BNXT_QPLIB_SWQE_TYPE_FAST_REG_MR 13
> +#define BNXT_QPLIB_SWQE_TYPE_REG_MR 13
> +#define BNXT_QPLIB_SWQE_TYPE_BIND_MW 14
> +#define BNXT_QPLIB_SWQE_TYPE_RECV 128
> +#define BNXT_QPLIB_SWQE_TYPE_RECV_RDMA_IMM 129
> + u8 flags;
> +#define BNXT_QPLIB_SWQE_FLAGS_SIGNAL_COMP BIT(0)
> +#define BNXT_QPLIB_SWQE_FLAGS_RD_ATOMIC_FENCE BIT(1)
> +#define BNXT_QPLIB_SWQE_FLAGS_UC_FENCE BIT(2)
> +#define BNXT_QPLIB_SWQE_FLAGS_SOLICIT_EVENT BIT(3)
> +#define BNXT_QPLIB_SWQE_FLAGS_INLINE BIT(4)
> + struct bnxt_qplib_sge sg_list[BNXT_QPLIB_QP_MAX_SGL];
> + int num_sge;
> + /* Max inline data is 96 bytes */
> + u32 inline_len;
> +#define BNXT_QPLIB_SWQE_MAX_INLINE_LENGTH 96
> + u8 inline_data[BNXT_QPLIB_SWQE_MAX_INLINE_LENGTH];
> +
> + union {
> + /* Send, with imm, inval key */
> + struct {
> + u32 imm_data_or_inv_key;
> + u32 q_key;
> + u32 dst_qp;
> + u16 avid;
> + } send;
> +
> + /* Send Raw Ethernet and QP1 */
> + struct {
> + u16 lflags;
> + u16 cfa_action;
> + u32 cfa_meta;
> + } rawqp1;
> +
> + /* RDMA write, with imm, read */
> + struct {
> + u32 imm_data_or_inv_key;
> + u64 remote_va;
> + u32 r_key;
> + } rdma;
> +
> + /* Atomic cmp/swap, fetch/add */
> + struct {
> + u64 remote_va;
> + u32 r_key;
> + u64 swap_data;
> + u64 cmp_data;
> + } atomic;
> +
> + /* Local Invalidate */
> + struct {
> + u32 inv_l_key;
> + } local_inv;
> +
> + /* FR-PMR */
> + struct {
> + u8 access_cntl;
> + u8 pg_sz_log;
> + bool zero_based;
> + u32 l_key;
> + u32 length;
> + u8 pbl_pg_sz_log;
> +#define BNXT_QPLIB_SWQE_PAGE_SIZE_4K 0
> +#define BNXT_QPLIB_SWQE_PAGE_SIZE_8K 1
> +#define BNXT_QPLIB_SWQE_PAGE_SIZE_64K 4
> +#define BNXT_QPLIB_SWQE_PAGE_SIZE_256K 6
> +#define BNXT_QPLIB_SWQE_PAGE_SIZE_1M 8
> +#define BNXT_QPLIB_SWQE_PAGE_SIZE_2M 9
> +#define BNXT_QPLIB_SWQE_PAGE_SIZE_4M 10
> +#define BNXT_QPLIB_SWQE_PAGE_SIZE_1G 18
> + u8 levels;
> +#define PAGE_SHIFT_4K 12
> + u64 *pbl_ptr;
> + dma_addr_t pbl_dma_ptr;
> + u64 *page_list;
> + u16 page_list_len;
> + u64 va;
> + } frmr;
> +
> + /* Bind */
> + struct {
> + u8 access_cntl;
> +#define BNXT_QPLIB_BIND_SWQE_ACCESS_LOCAL_WRITE BIT(0)
> +#define BNXT_QPLIB_BIND_SWQE_ACCESS_REMOTE_READ BIT(1)
> +#define BNXT_QPLIB_BIND_SWQE_ACCESS_REMOTE_WRITE BIT(2)
> +#define BNXT_QPLIB_BIND_SWQE_ACCESS_REMOTE_ATOMIC BIT(3)
> +#define BNXT_QPLIB_BIND_SWQE_ACCESS_WINDOW_BIND BIT(4)
> + bool zero_based;
> + u8 mw_type;
> + u32 parent_l_key;
> + u32 r_key;
> + u64 va;
> + u32 length;
> + } bind;
> + };
> +};
> +
> +#define BNXT_QPLIB_MAX_RQE_ENTRY_SIZE sizeof(struct rq_wqe)
> +
> +#define RQE_CNT_PER_PG (PAGE_SIZE / BNXT_QPLIB_MAX_RQE_ENTRY_SIZE)
> +#define RQE_MAX_IDX_PER_PG (RQE_CNT_PER_PG - 1)
> +#define RQE_PG(x) (((x) & ~RQE_MAX_IDX_PER_PG) / RQE_CNT_PER_PG)
> +#define RQE_IDX(x) ((x) & RQE_MAX_IDX_PER_PG)
> +
> +struct bnxt_qplib_q {
> + struct bnxt_qplib_hwq hwq;
> + struct bnxt_qplib_swq *swq;
> + struct scatterlist *sglist;
> + u32 nmap;
> + u32 max_wqe;
> + u16 max_sge;
> + u32 psn;
> + bool flush_in_progress;
> +};
> +
> +struct bnxt_qplib_qp {
> + struct bnxt_qplib_pd *pd;
> + struct bnxt_qplib_dpi *dpi;
> + u64 qp_handle;
> + u32 id;
> + u8 type;
> + u8 sig_type;
> + u64 modify_flags;
> + u8 state;
> + u8 cur_qp_state;
> + u32 max_inline_data;
> + u32 mtu;
> + u32 path_mtu;
> + bool en_sqd_async_notify;
> + u16 pkey_index;
> + u32 qkey;
> + u32 dest_qp_id;
> + u8 access;
> + u8 timeout;
> + u8 retry_cnt;
> + u8 rnr_retry;
> + u32 min_rnr_timer;
> + u32 max_rd_atomic;
> + u32 max_dest_rd_atomic;
> + u32 dest_qpn;
> + u8 smac[6];
> + u16 vlan_id;
> + u8 nw_type;
> + struct bnxt_qplib_ah ah;
> +
> +#define BTH_PSN_MASK ((1 << 24) - 1)
> + /* SQ */
> + struct bnxt_qplib_q sq;
> + /* RQ */
> + struct bnxt_qplib_q rq;
> + /* SRQ */
> + struct bnxt_qplib_srq *srq;
> + /* CQ */
> + struct bnxt_qplib_cq *scq;
> + struct bnxt_qplib_cq *rcq;
> + /* IRRQ and ORRQ */
> + struct bnxt_qplib_hwq irrq;
> + struct bnxt_qplib_hwq orrq;
> + /* Header buffer for QP1 */
> + int sq_hdr_buf_size;
> + int rq_hdr_buf_size;
> +/*
> + * Buffer space for ETH(14), IP or GRH(40), UDP header(8)
> + * and ib_bth + ib_deth (20).
> + * Max required is 82 when RoCE V2 is enabled
> + */
> +#define BNXT_QPLIB_MAX_QP1_SQ_HDR_SIZE_V2 86
> + /* Ethernet header = 14 */
> + /* ib_grh = 40 (provided by MAD) */
> + /* ib_bth + ib_deth = 20 */
> + /* MAD = 256 (provided by MAD) */
> + /* iCRC = 4 */
> +#define BNXT_QPLIB_MAX_QP1_RQ_ETH_HDR_SIZE 14
> +#define BNXT_QPLIB_MAX_QP1_RQ_HDR_SIZE_V2 512
> +#define BNXT_QPLIB_MAX_GRH_HDR_SIZE_IPV4 20
> +#define BNXT_QPLIB_MAX_GRH_HDR_SIZE_IPV6 40
> +#define BNXT_QPLIB_MAX_QP1_RQ_BDETH_HDR_SIZE 20
> + void *sq_hdr_buf;
> + dma_addr_t sq_hdr_buf_map;
> + void *rq_hdr_buf;
> + dma_addr_t rq_hdr_buf_map;
> +};
> +
> #define BNXT_QPLIB_MAX_CQE_ENTRY_SIZE sizeof(struct cq_base)
>
> +#define CQE_CNT_PER_PG (PAGE_SIZE / BNXT_QPLIB_MAX_CQE_ENTRY_SIZE)
> +#define CQE_MAX_IDX_PER_PG (CQE_CNT_PER_PG - 1)
> +#define CQE_PG(x) (((x) & ~CQE_MAX_IDX_PER_PG) / CQE_CNT_PER_PG)
> +#define CQE_IDX(x) ((x) & CQE_MAX_IDX_PER_PG)
> +
> +#define ROCE_CQE_CMP_V 0
> +#define CQE_CMP_VALID(hdr, raw_cons, cp_bit) \
> + (!!((hdr)->cqe_type_toggle & CQ_BASE_TOGGLE) == \
> + !((raw_cons) & (cp_bit)))
> +
> struct bnxt_qplib_cqe {
> u8 status;
> u8 type;
> @@ -82,6 +320,13 @@ struct bnxt_qplib_cq {
> wait_queue_head_t waitq;
> };
>
> +#define BNXT_QPLIB_MAX_IRRQE_ENTRY_SIZE sizeof(struct xrrq_irrq)
> +#define BNXT_QPLIB_MAX_ORRQE_ENTRY_SIZE sizeof(struct xrrq_orrq)
> +#define IRD_LIMIT_TO_IRRQ_SLOTS(x) (2 * (x) + 2)
> +#define IRRQ_SLOTS_TO_IRD_LIMIT(s) (((s) >> 1) - 1)
> +#define ORD_LIMIT_TO_ORRQ_SLOTS(x) ((x) + 1)
> +#define ORRQ_SLOTS_TO_ORD_LIMIT(s) ((s) - 1)
> +
> #define BNXT_QPLIB_MAX_NQE_ENTRY_SIZE sizeof(struct nq_base)
>
> #define NQE_CNT_PER_PG (PAGE_SIZE / BNXT_QPLIB_MAX_NQE_ENTRY_SIZE)
> @@ -140,6 +385,11 @@ int bnxt_qplib_enable_nq(struct pci_dev *pdev, struct bnxt_qplib_nq *nq,
> int (*srqn_handler)(struct bnxt_qplib_nq *nq,
> void *srq,
> u8 event));
> +int bnxt_qplib_create_qp1(struct bnxt_qplib_res *res, struct bnxt_qplib_qp *qp);
> +int bnxt_qplib_create_qp(struct bnxt_qplib_res *res, struct bnxt_qplib_qp *qp);
> +int bnxt_qplib_modify_qp(struct bnxt_qplib_res *res, struct bnxt_qplib_qp *qp);
> +int bnxt_qplib_query_qp(struct bnxt_qplib_res *res, struct bnxt_qplib_qp *qp);
> +int bnxt_qplib_destroy_qp(struct bnxt_qplib_res *res, struct bnxt_qplib_qp *qp);
> int bnxt_qplib_create_cq(struct bnxt_qplib_res *res, struct bnxt_qplib_cq *cq);
> int bnxt_qplib_destroy_cq(struct bnxt_qplib_res *res, struct bnxt_qplib_cq *cq);
>
> diff --git a/drivers/infiniband/hw/bnxtre/bnxt_re.h b/drivers/infiniband/hw/bnxtre/bnxt_re.h
> index 3a93a88..84af86b 100644
> --- a/drivers/infiniband/hw/bnxtre/bnxt_re.h
> +++ b/drivers/infiniband/hw/bnxtre/bnxt_re.h
> @@ -64,6 +64,14 @@ struct bnxt_re_work {
> struct net_device *vlan_dev;
> };
>
> +struct bnxt_re_sqp_entries {
> + struct bnxt_qplib_sge sge;
> + u64 wrid;
> + /* For storing the actual qp1 cqe */
> + struct bnxt_qplib_cqe cqe;
> + struct bnxt_re_qp *qp1_qp;
> +};
> +
> #define BNXT_RE_MIN_MSIX 2
> #define BNXT_RE_MAX_MSIX 16
> #define BNXT_RE_AEQ_IDX 0
> @@ -112,6 +120,12 @@ struct bnxt_re_dev {
> atomic_t mw_count;
> /* Max of 2 lossless traffic class supported per port */
> u16 cosq[2];
> +
> + /* QP for for handling QP1 packets */
> + u32 sqp_id;
> + struct bnxt_re_qp *qp1_sqp;
> + struct bnxt_re_ah *sqp_ah;
> + struct bnxt_re_sqp_entries sqp_tbl[1024];
> };
>
> #define to_bnxt_re(ptr, type, member) \
> diff --git a/drivers/infiniband/hw/bnxtre/bnxt_re_ib_verbs.c b/drivers/infiniband/hw/bnxtre/bnxt_re_ib_verbs.c
> index 5e41317..77860a2 100644
> --- a/drivers/infiniband/hw/bnxtre/bnxt_re_ib_verbs.c
> +++ b/drivers/infiniband/hw/bnxtre/bnxt_re_ib_verbs.c
> @@ -649,6 +649,481 @@ int bnxt_re_query_ah(struct ib_ah *ib_ah, struct ib_ah_attr *ah_attr)
> return 0;
> }
>
> +/* Queue Pairs */
> +int bnxt_re_destroy_qp(struct ib_qp *ib_qp)
> +{
> + struct bnxt_re_qp *qp = to_bnxt_re(ib_qp, struct bnxt_re_qp, ib_qp);
> + struct bnxt_re_dev *rdev = qp->rdev;
> + int rc;
> +
> + rc = bnxt_qplib_destroy_qp(&rdev->qplib_res, &qp->qplib_qp);
> + if (rc) {
> + dev_err(rdev_to_dev(rdev), "Failed to destroy HW QP");
> + return rc;
> + }
> + if (ib_qp->qp_type == IB_QPT_GSI && rdev->qp1_sqp) {
> + rc = bnxt_qplib_destroy_ah(&rdev->qplib_res,
> + &rdev->sqp_ah->qplib_ah);
> + if (rc) {
> + dev_err(rdev_to_dev(rdev),
> + "Failed to destroy HW AH for shadow QP");
> + return rc;
> + }
> +
> + rc = bnxt_qplib_destroy_qp(&rdev->qplib_res,
> + &rdev->qp1_sqp->qplib_qp);
> + if (rc) {
> + dev_err(rdev_to_dev(rdev),
> + "Failed to destroy Shadow QP");
> + return rc;
> + }
> + mutex_lock(&rdev->qp_lock);
> + list_del(&rdev->qp1_sqp->list);
> + atomic_dec(&rdev->qp_count);
> + mutex_unlock(&rdev->qp_lock);
> +
> + kfree(rdev->sqp_ah);
> + kfree(rdev->qp1_sqp);
> + }
> +
> + if (qp->rumem && !IS_ERR(qp->rumem))
> + ib_umem_release(qp->rumem);
> + if (qp->sumem && !IS_ERR(qp->sumem))
> + ib_umem_release(qp->sumem);
> +
> + mutex_lock(&rdev->qp_lock);
> + list_del(&qp->list);
> + atomic_dec(&rdev->qp_count);
> + mutex_unlock(&rdev->qp_lock);
> + kfree(qp);
> + return 0;
> +}
> +
> +static u8 __from_ib_qp_type(enum ib_qp_type type)
> +{
> + switch (type) {
> + case IB_QPT_GSI:
> + return CMDQ_CREATE_QP1_TYPE_GSI;
> + case IB_QPT_RC:
> + return CMDQ_CREATE_QP_TYPE_RC;
> + case IB_QPT_UD:
> + return CMDQ_CREATE_QP_TYPE_UD;
> + case IB_QPT_RAW_ETHERTYPE:
> + return CMDQ_CREATE_QP_TYPE_RAW_ETHERTYPE;
> + default:
> + return IB_QPT_MAX;
> + }
> +}
> +
> +static int bnxt_re_init_user_qp(struct bnxt_re_dev *rdev, struct bnxt_re_pd *pd,
> + struct bnxt_re_qp *qp, struct ib_udata *udata)
> +{
> + struct bnxt_re_qp_req ureq;
> + struct bnxt_qplib_qp *qplib_qp = &qp->qplib_qp;
> + struct ib_umem *umem;
> + int bytes = 0;
> + struct ib_ucontext *context = pd->ib_pd.uobject->context;
> + struct bnxt_re_ucontext *cntx = to_bnxt_re(context,
> + struct bnxt_re_ucontext,
> + ib_uctx);
> + if (ib_copy_from_udata(&ureq, udata, sizeof(ureq)))
> + return -EFAULT;
> +
> + bytes = (qplib_qp->sq.max_wqe * BNXT_QPLIB_MAX_SQE_ENTRY_SIZE);
> + /* Consider mapping PSN search memory only for RC QPs. */
> + if (qplib_qp->type == CMDQ_CREATE_QP_TYPE_RC)
> + bytes += (qplib_qp->sq.max_wqe * sizeof(struct sq_psn_search));
> + bytes = PAGE_ALIGN(bytes);
> + umem = ib_umem_get(context, ureq.qpsva, bytes,
> + IB_ACCESS_LOCAL_WRITE, 1);
> + if (IS_ERR(umem))
> + return PTR_ERR(umem);
> +
> + qp->sumem = umem;
> + qplib_qp->sq.sglist = umem->sg_head.sgl;
> + qplib_qp->sq.nmap = umem->nmap;
> + qplib_qp->qp_handle = ureq.qp_handle;
> +
> + if (!qp->qplib_qp.srq) {
> + bytes = (qplib_qp->rq.max_wqe * BNXT_QPLIB_MAX_RQE_ENTRY_SIZE);
> + bytes = PAGE_ALIGN(bytes);
> + umem = ib_umem_get(context, ureq.qprva, bytes,
> + IB_ACCESS_LOCAL_WRITE, 1);
> + if (IS_ERR(umem))
> + goto rqfail;
> + qp->rumem = umem;
> + qplib_qp->rq.sglist = umem->sg_head.sgl;
> + qplib_qp->rq.nmap = umem->nmap;
> + }
> +
> + qplib_qp->dpi = cntx->dpi;
> + return 0;
> +rqfail:
> + ib_umem_release(qp->sumem);
> + qp->sumem = NULL;
> + qplib_qp->sq.sglist = NULL;
> + qplib_qp->sq.nmap = 0;
> +
> + return PTR_ERR(umem);
> +}
> +
> +static struct bnxt_re_ah *bnxt_re_create_shadow_qp_ah(struct bnxt_re_pd *pd,
> + struct bnxt_qplib_res *qp1_res,
> + struct bnxt_qplib_qp *qp1_qp)
> +{
> + struct bnxt_re_dev *rdev = pd->rdev;
> + struct bnxt_re_ah *ah;
> + union ib_gid sgid;
> + int rc;
> +
> + ah = kzalloc(sizeof(*ah), GFP_KERNEL);
> + if (!ah)
> + return NULL;
> +
> + memset(ah, 0, sizeof(*ah));
> + ah->rdev = rdev;
> + ah->qplib_ah.pd = &pd->qplib_pd;
> +
> + rc = bnxt_re_query_gid(&rdev->ibdev, 1, 0, &sgid);
> + if (rc)
> + goto fail;
> +
> + /* supply the dgid data same as sgid */
> + memcpy(ah->qplib_ah.dgid.data, &sgid.raw,
> + sizeof(union ib_gid));
> + ah->qplib_ah.sgid_index = 0;
> +
> + ah->qplib_ah.traffic_class = 0;
> + ah->qplib_ah.flow_label = 0;
> + ah->qplib_ah.hop_limit = 1;
> + ah->qplib_ah.sl = 0;
> + /* Have DMAC same as SMAC */
> + ether_addr_copy(ah->qplib_ah.dmac, rdev->netdev->dev_addr);
> +
> + rc = bnxt_qplib_create_ah(&rdev->qplib_res, &ah->qplib_ah);
> + if (rc) {
> + dev_err(rdev_to_dev(rdev),
> + "Failed to allocate HW AH for Shadow QP");
> + goto fail;
> + }
> +
> + return ah;
> +
> +fail:
> + kfree(ah);
> + return NULL;
> +}
> +
> +static struct bnxt_re_qp *bnxt_re_create_shadow_qp(struct bnxt_re_pd *pd,
> + struct bnxt_qplib_res *qp1_res,
> + struct bnxt_qplib_qp *qp1_qp)
> +{
> + struct bnxt_re_dev *rdev = pd->rdev;
> + struct bnxt_re_qp *qp;
> + int rc;
> +
> + qp = kzalloc(sizeof(*qp), GFP_KERNEL);
> + if (!qp)
> + return NULL;
> +
> + memset(qp, 0, sizeof(*qp));
> + qp->rdev = rdev;
> +
> + /* Initialize the shadow QP structure from the QP1 values */
> + ether_addr_copy(qp->qplib_qp.smac, rdev->netdev->dev_addr);
> +
> + qp->qplib_qp.pd = &pd->qplib_pd;
> + qp->qplib_qp.qp_handle = (u64)&qp->qplib_qp;
> + qp->qplib_qp.type = IB_QPT_UD;
> +
> + qp->qplib_qp.max_inline_data = 0;
> + qp->qplib_qp.sig_type = true;
> +
> + /* Shadow QP SQ depth should be same as QP1 RQ depth */
> + qp->qplib_qp.sq.max_wqe = qp1_qp->rq.max_wqe;
> + qp->qplib_qp.sq.max_sge = 2;
> +
> + qp->qplib_qp.scq = qp1_qp->scq;
> + qp->qplib_qp.rcq = qp1_qp->rcq;
> +
> + qp->qplib_qp.rq.max_wqe = qp1_qp->rq.max_wqe;
> + qp->qplib_qp.rq.max_sge = qp1_qp->rq.max_sge;
> +
> + qp->qplib_qp.mtu = qp1_qp->mtu;
> +
> + qp->qplib_qp.sq_hdr_buf_size = 0;
> + qp->qplib_qp.rq_hdr_buf_size = BNXT_QPLIB_MAX_GRH_HDR_SIZE_IPV6;
> + qp->qplib_qp.dpi = &rdev->dpi_privileged;
> +
> + rc = bnxt_qplib_create_qp(qp1_res, &qp->qplib_qp);
> + if (rc)
> + goto fail;
> +
> + rdev->sqp_id = qp->qplib_qp.id;
> +
> + spin_lock_init(&qp->sq_lock);
> + INIT_LIST_HEAD(&qp->list);
> + mutex_lock(&rdev->qp_lock);
> + list_add_tail(&qp->list, &rdev->qp_list);
> + atomic_inc(&rdev->qp_count);
> + mutex_unlock(&rdev->qp_lock);
> + return qp;
> +fail:
> + kfree(qp);
> + return NULL;
> +}
> +
> +struct ib_qp *bnxt_re_create_qp(struct ib_pd *ib_pd,
> + struct ib_qp_init_attr *qp_init_attr,
> + struct ib_udata *udata)
> +{
> + struct bnxt_re_pd *pd = to_bnxt_re(ib_pd, struct bnxt_re_pd, ib_pd);
> + struct bnxt_re_dev *rdev = pd->rdev;
> + struct bnxt_qplib_dev_attr *dev_attr = &rdev->dev_attr;
> + struct bnxt_re_qp *qp;
> + struct bnxt_re_srq *srq;
> + struct bnxt_re_cq *cq;
> + int rc, entries;
> +
> + if ((qp_init_attr->cap.max_send_wr > dev_attr->max_qp_wqes) ||
> + (qp_init_attr->cap.max_recv_wr > dev_attr->max_qp_wqes) ||
> + (qp_init_attr->cap.max_send_sge > dev_attr->max_qp_sges) ||
> + (qp_init_attr->cap.max_recv_sge > dev_attr->max_qp_sges) ||
> + (qp_init_attr->cap.max_inline_data > dev_attr->max_inline_data))
> + return ERR_PTR(-EINVAL);
> +
> + qp = kzalloc(sizeof(*qp), GFP_KERNEL);
> + if (!qp)
> + return ERR_PTR(-ENOMEM);
> +
> + qp->rdev = rdev;
> + ether_addr_copy(qp->qplib_qp.smac, rdev->netdev->dev_addr);
> + qp->qplib_qp.pd = &pd->qplib_pd;
> + qp->qplib_qp.qp_handle = (u64)&qp->qplib_qp;
> + qp->qplib_qp.type = __from_ib_qp_type(qp_init_attr->qp_type);
> + if (qp->qplib_qp.type == IB_QPT_MAX) {
> + dev_err(rdev_to_dev(rdev), "QP type 0x%x not supported",
> + qp->qplib_qp.type);
> + rc = -EINVAL;
> + goto fail;
> + }
> + qp->qplib_qp.max_inline_data = qp_init_attr->cap.max_inline_data;
> + qp->qplib_qp.sig_type = ((qp_init_attr->sq_sig_type ==
> + IB_SIGNAL_ALL_WR) ? true : false);
> +
> + entries = roundup_pow_of_two(qp_init_attr->cap.max_send_wr + 1);
> + if (entries > dev_attr->max_qp_wqes + 1)
> + entries = dev_attr->max_qp_wqes + 1;
> + qp->qplib_qp.sq.max_wqe = entries;
> +
> + qp->qplib_qp.sq.max_sge = qp_init_attr->cap.max_send_sge;
> + if (qp->qplib_qp.sq.max_sge > dev_attr->max_qp_sges)
> + qp->qplib_qp.sq.max_sge = dev_attr->max_qp_sges;
> +
> + if (qp_init_attr->send_cq) {
> + cq = to_bnxt_re(qp_init_attr->send_cq, struct bnxt_re_cq,
> + ib_cq);
> + if (!cq) {
> + dev_err(rdev_to_dev(rdev), "Send CQ not found");
> + rc = -EINVAL;
> + goto fail;
> + }
> + qp->qplib_qp.scq = &cq->qplib_cq;
> + }
> +
> + if (qp_init_attr->recv_cq) {
> + cq = to_bnxt_re(qp_init_attr->recv_cq, struct bnxt_re_cq,
> + ib_cq);
> + if (!cq) {
> + dev_err(rdev_to_dev(rdev), "Receive CQ not found");
> + rc = -EINVAL;
> + goto fail;
> + }
> + qp->qplib_qp.rcq = &cq->qplib_cq;
> + }
> +
> + if (qp_init_attr->srq) {
> + dev_err(rdev_to_dev(rdev), "SRQ not supported");
> + rc = -ENOTSUPP;
> + goto fail;
> + } else {
> + /* Allocate 1 more than what's provided so posting max doesn't
> + * mean empty
> + */
> + entries = roundup_pow_of_two(qp_init_attr->cap.max_recv_wr + 1);
> + if (entries > dev_attr->max_qp_wqes + 1)
> + entries = dev_attr->max_qp_wqes + 1;
> + qp->qplib_qp.rq.max_wqe = entries;
> +
> + qp->qplib_qp.rq.max_sge = qp_init_attr->cap.max_recv_sge;
> + if (qp->qplib_qp.rq.max_sge > dev_attr->max_qp_sges)
> + qp->qplib_qp.rq.max_sge = dev_attr->max_qp_sges;
> + }
> +
> + qp->qplib_qp.mtu = ib_mtu_enum_to_int(iboe_get_mtu(rdev->netdev->mtu));
> +
> + if (qp_init_attr->qp_type == IB_QPT_GSI) {
> + qp->qplib_qp.rq.max_sge = dev_attr->max_qp_sges;
> + if (qp->qplib_qp.rq.max_sge > dev_attr->max_qp_sges)
> + qp->qplib_qp.rq.max_sge = dev_attr->max_qp_sges;
> + qp->qplib_qp.sq.max_sge++;
> + if (qp->qplib_qp.sq.max_sge > dev_attr->max_qp_sges)
> + qp->qplib_qp.sq.max_sge = dev_attr->max_qp_sges;
> +
> + qp->qplib_qp.rq_hdr_buf_size =
> + BNXT_QPLIB_MAX_QP1_RQ_HDR_SIZE_V2;
> +
> + qp->qplib_qp.sq_hdr_buf_size =
> + BNXT_QPLIB_MAX_QP1_SQ_HDR_SIZE_V2;
> + qp->qplib_qp.dpi = &rdev->dpi_privileged;
> + rc = bnxt_qplib_create_qp1(&rdev->qplib_res, &qp->qplib_qp);
> + if (rc) {
> + dev_err(rdev_to_dev(rdev), "Failed to create HW QP1");
> + goto fail;
> + }
> + /* Create a shadow QP to handle the QP1 traffic */
> + rdev->qp1_sqp = bnxt_re_create_shadow_qp(pd, &rdev->qplib_res,
> + &qp->qplib_qp);
> + if (!rdev->qp1_sqp) {
> + rc = -EINVAL;
> + dev_err(rdev_to_dev(rdev),
> + "Failed to create Shadow QP for QP1");
> + goto qp_destroy;
> + }
> + rdev->sqp_ah = bnxt_re_create_shadow_qp_ah(pd, &rdev->qplib_res,
> + &qp->qplib_qp);
> + if (!rdev->sqp_ah) {
> + bnxt_qplib_destroy_qp(&rdev->qplib_res,
> + &rdev->qp1_sqp->qplib_qp);
> + rc = -EINVAL;
> + dev_err(rdev_to_dev(rdev),
> + "Failed to create AH entry for ShadowQP");
> + goto qp_destroy;
> + }
> +
> + } else {
> + qp->qplib_qp.max_rd_atomic = dev_attr->max_qp_rd_atom;
> + qp->qplib_qp.max_dest_rd_atomic = dev_attr->max_qp_init_rd_atom;
> + if (udata) {
> + rc = bnxt_re_init_user_qp(rdev, pd, qp, udata);
> + if (rc)
> + goto fail;
> + } else {
> + qp->qplib_qp.dpi = &rdev->dpi_privileged;
> + }
> +
> + rc = bnxt_qplib_create_qp(&rdev->qplib_res, &qp->qplib_qp);
> + if (rc) {
> + dev_err(rdev_to_dev(rdev), "Failed to create HW QP");
> + goto fail;
> + }
> + }
> +
> + qp->ib_qp.qp_num = qp->qplib_qp.id;
> + spin_lock_init(&qp->sq_lock);
> +
> + if (udata) {
> + struct bnxt_re_qp_resp resp;
> +
> + resp.qpid = qp->ib_qp.qp_num;
> + rc = bnxt_re_copy_to_udata(rdev, &resp, sizeof(resp), udata);
> + if (rc) {
> + dev_err(rdev_to_dev(rdev), "Failed to copy QP udata");
> + goto qp_destroy;
> + }
> + }
> + INIT_LIST_HEAD(&qp->list);
> + mutex_lock(&rdev->qp_lock);
> + list_add_tail(&qp->list, &rdev->qp_list);
> + atomic_inc(&rdev->qp_count);
> + mutex_unlock(&rdev->qp_lock);
> +
> + return &qp->ib_qp;
> +qp_destroy:
> + bnxt_qplib_destroy_qp(&rdev->qplib_res, &qp->qplib_qp);
> +fail:
> + kfree(qp);
> + return ERR_PTR(rc);
> +}
> +
> +static u8 __from_ib_qp_state(enum ib_qp_state state)
> +{
> + switch (state) {
> + case IB_QPS_RESET:
> + return CMDQ_MODIFY_QP_NEW_STATE_RESET;
> + case IB_QPS_INIT:
> + return CMDQ_MODIFY_QP_NEW_STATE_INIT;
> + case IB_QPS_RTR:
> + return CMDQ_MODIFY_QP_NEW_STATE_RTR;
> + case IB_QPS_RTS:
> + return CMDQ_MODIFY_QP_NEW_STATE_RTS;
> + case IB_QPS_SQD:
> + return CMDQ_MODIFY_QP_NEW_STATE_SQD;
> + case IB_QPS_SQE:
> + return CMDQ_MODIFY_QP_NEW_STATE_SQE;
> + case IB_QPS_ERR:
> + default:
> + return CMDQ_MODIFY_QP_NEW_STATE_ERR;
> + }
> +}
> +
> +static enum ib_qp_state __to_ib_qp_state(u8 state)
> +{
> + switch (state) {
> + case CMDQ_MODIFY_QP_NEW_STATE_RESET:
> + return IB_QPS_RESET;
> + case CMDQ_MODIFY_QP_NEW_STATE_INIT:
> + return IB_QPS_INIT;
> + case CMDQ_MODIFY_QP_NEW_STATE_RTR:
> + return IB_QPS_RTR;
> + case CMDQ_MODIFY_QP_NEW_STATE_RTS:
> + return IB_QPS_RTS;
> + case CMDQ_MODIFY_QP_NEW_STATE_SQD:
> + return IB_QPS_SQD;
> + case CMDQ_MODIFY_QP_NEW_STATE_SQE:
> + return IB_QPS_SQE;
> + case CMDQ_MODIFY_QP_NEW_STATE_ERR:
> + default:
> + return IB_QPS_ERR;
> + }
> +}
> +
> +static u32 __from_ib_mtu(enum ib_mtu mtu)
> +{
> + switch (mtu) {
> + case IB_MTU_256:
> + return CMDQ_MODIFY_QP_PATH_MTU_MTU_256;
> + case IB_MTU_512:
> + return CMDQ_MODIFY_QP_PATH_MTU_MTU_512;
> + case IB_MTU_1024:
> + return CMDQ_MODIFY_QP_PATH_MTU_MTU_1024;
> + case IB_MTU_2048:
> + return CMDQ_MODIFY_QP_PATH_MTU_MTU_2048;
> + case IB_MTU_4096:
> + return CMDQ_MODIFY_QP_PATH_MTU_MTU_4096;
> + default:
> + return CMDQ_MODIFY_QP_PATH_MTU_MTU_2048;
> + }
> +}
> +
> +static enum ib_mtu __to_ib_mtu(u32 mtu)
> +{
> + switch (mtu & CREQ_QUERY_QP_RESP_SB_PATH_MTU_MASK) {
> + case CMDQ_MODIFY_QP_PATH_MTU_MTU_256:
> + return IB_MTU_256;
> + case CMDQ_MODIFY_QP_PATH_MTU_MTU_512:
> + return IB_MTU_512;
> + case CMDQ_MODIFY_QP_PATH_MTU_MTU_1024:
> + return IB_MTU_1024;
> + case CMDQ_MODIFY_QP_PATH_MTU_MTU_2048:
> + return IB_MTU_2048;
> + case CMDQ_MODIFY_QP_PATH_MTU_MTU_4096:
> + return IB_MTU_4096;
> + default:
> + return IB_MTU_2048;
> + }
> +}
> +
> static int __from_ib_access_flags(int iflags)
> {
> int qflags = 0;
> @@ -690,6 +1165,293 @@ static enum ib_access_flags __to_ib_access_flags(int qflags)
> iflags |= IB_ACCESS_ON_DEMAND;
> return iflags;
> };
> +
> +static int bnxt_re_modify_shadow_qp(struct bnxt_re_dev *rdev,
> + struct bnxt_re_qp *qp1_qp,
> + int qp_attr_mask)
> +{
> + struct bnxt_re_qp *qp = rdev->qp1_sqp;
> + int rc = 0;
> +
> + if (qp_attr_mask & IB_QP_STATE) {
> + qp->qplib_qp.modify_flags |= CMDQ_MODIFY_QP_MODIFY_MASK_STATE;
> + qp->qplib_qp.state = qp1_qp->qplib_qp.state;
> + }
> + if (qp_attr_mask & IB_QP_PKEY_INDEX) {
> + qp->qplib_qp.modify_flags |= CMDQ_MODIFY_QP_MODIFY_MASK_PKEY;
> + qp->qplib_qp.pkey_index = qp1_qp->qplib_qp.pkey_index;
> + }
> +
> + if (qp_attr_mask & IB_QP_QKEY) {
> + qp->qplib_qp.modify_flags |= CMDQ_MODIFY_QP_MODIFY_MASK_QKEY;
> + /* Using a Random QKEY */
> + qp->qplib_qp.qkey = 0x81818181;
> + }
> + if (qp_attr_mask & IB_QP_SQ_PSN) {
> + qp->qplib_qp.modify_flags |= CMDQ_MODIFY_QP_MODIFY_MASK_SQ_PSN;
> + qp->qplib_qp.sq.psn = qp1_qp->qplib_qp.sq.psn;
> + }
> +
> + rc = bnxt_qplib_modify_qp(&rdev->qplib_res, &qp->qplib_qp);
> + if (rc)
> + dev_err(rdev_to_dev(rdev),
> + "Failed to modify Shadow QP for QP1");
> + return rc;
> +}
> +
> +int bnxt_re_modify_qp(struct ib_qp *ib_qp, struct ib_qp_attr *qp_attr,
> + int qp_attr_mask, struct ib_udata *udata)
> +{
> + struct bnxt_re_qp *qp = to_bnxt_re(ib_qp, struct bnxt_re_qp, ib_qp);
> + struct bnxt_re_dev *rdev = qp->rdev;
> + struct bnxt_qplib_dev_attr *dev_attr = &rdev->dev_attr;
> + enum ib_qp_state curr_qp_state, new_qp_state;
> + int rc, entries;
> + int status;
> + union ib_gid sgid;
> + struct ib_gid_attr sgid_attr;
> + u8 nw_type;
> +
> + qp->qplib_qp.modify_flags = 0;
> + if (qp_attr_mask & IB_QP_STATE) {
> + curr_qp_state = __to_ib_qp_state(qp->qplib_qp.cur_qp_state);
> + new_qp_state = qp_attr->qp_state;
> + if (!ib_modify_qp_is_ok(curr_qp_state, new_qp_state,
> + ib_qp->qp_type, qp_attr_mask,
> + IB_LINK_LAYER_ETHERNET)) {
> + dev_err(rdev_to_dev(rdev),
> + "Invalid attribute mask: %#x specified ",
> + qp_attr_mask);
> + dev_err(rdev_to_dev(rdev),
> + "for qpn: %#x type: %#x",
> + ib_qp->qp_num, ib_qp->qp_type);
> + dev_err(rdev_to_dev(rdev),
> + "curr_qp_state=0x%x, new_qp_state=0x%x\n",
> + curr_qp_state, new_qp_state);
> + return -EINVAL;
> + }
> + qp->qplib_qp.modify_flags |= CMDQ_MODIFY_QP_MODIFY_MASK_STATE;
> + qp->qplib_qp.state = __from_ib_qp_state(qp_attr->qp_state);
> + }
> + if (qp_attr_mask & IB_QP_EN_SQD_ASYNC_NOTIFY) {
> + qp->qplib_qp.modify_flags |=
> + CMDQ_MODIFY_QP_MODIFY_MASK_EN_SQD_ASYNC_NOTIFY;
> + qp->qplib_qp.en_sqd_async_notify = true;
> + }
> + if (qp_attr_mask & IB_QP_ACCESS_FLAGS) {
> + qp->qplib_qp.modify_flags |= CMDQ_MODIFY_QP_MODIFY_MASK_ACCESS;
> + qp->qplib_qp.access =
> + __from_ib_access_flags(qp_attr->qp_access_flags);
> + /* LOCAL_WRITE access must be set to allow RC receive */
> + qp->qplib_qp.access |= BNXT_QPLIB_ACCESS_LOCAL_WRITE;
> + }
> + if (qp_attr_mask & IB_QP_PKEY_INDEX) {
> + qp->qplib_qp.modify_flags |= CMDQ_MODIFY_QP_MODIFY_MASK_PKEY;
> + qp->qplib_qp.pkey_index = qp_attr->pkey_index;
> + }
> + if (qp_attr_mask & IB_QP_QKEY) {
> + qp->qplib_qp.modify_flags |= CMDQ_MODIFY_QP_MODIFY_MASK_QKEY;
> + qp->qplib_qp.qkey = qp_attr->qkey;
> + }
> + if (qp_attr_mask & IB_QP_AV) {
> + qp->qplib_qp.modify_flags |= CMDQ_MODIFY_QP_MODIFY_MASK_DGID |
> + CMDQ_MODIFY_QP_MODIFY_MASK_FLOW_LABEL |
> + CMDQ_MODIFY_QP_MODIFY_MASK_SGID_INDEX |
> + CMDQ_MODIFY_QP_MODIFY_MASK_HOP_LIMIT |
> + CMDQ_MODIFY_QP_MODIFY_MASK_TRAFFIC_CLASS |
> + CMDQ_MODIFY_QP_MODIFY_MASK_DEST_MAC |
> + CMDQ_MODIFY_QP_MODIFY_MASK_VLAN_ID;
> + memcpy(qp->qplib_qp.ah.dgid.data, qp_attr->ah_attr.grh.dgid.raw,
> + sizeof(qp->qplib_qp.ah.dgid.data));
> + qp->qplib_qp.ah.flow_label = qp_attr->ah_attr.grh.flow_label;
> + /* If RoCE V2 is enabled, stack will have two entries for
> + * each GID entry. Avoiding this duplicte entry in HW. Dividing
> + * the GID index by 2 for RoCE V2
> + */
> + qp->qplib_qp.ah.sgid_index =
> + qp_attr->ah_attr.grh.sgid_index / 2;
> + qp->qplib_qp.ah.host_sgid_index =
> + qp_attr->ah_attr.grh.sgid_index;
> + qp->qplib_qp.ah.hop_limit = qp_attr->ah_attr.grh.hop_limit;
> + qp->qplib_qp.ah.traffic_class =
> + qp_attr->ah_attr.grh.traffic_class;
> + qp->qplib_qp.ah.sl = qp_attr->ah_attr.sl;
> + ether_addr_copy(qp->qplib_qp.ah.dmac, qp_attr->ah_attr.dmac);
> +
> + status = ib_get_cached_gid(&rdev->ibdev, 1,
> + qp_attr->ah_attr.grh.sgid_index,
> + &sgid, &sgid_attr);
> + if (!status && sgid_attr.ndev) {
> + memcpy(qp->qplib_qp.smac, sgid_attr.ndev->dev_addr,
> + ETH_ALEN);
> + dev_put(sgid_attr.ndev);
> + nw_type = ib_gid_to_network_type(sgid_attr.gid_type,
> + &sgid);
> + switch (nw_type) {
> + case RDMA_NETWORK_IPV4:
> + qp->qplib_qp.nw_type =
> + CMDQ_MODIFY_QP_NETWORK_TYPE_ROCEV2_IPV4;
> + break;
> + case RDMA_NETWORK_IPV6:
> + qp->qplib_qp.nw_type =
> + CMDQ_MODIFY_QP_NETWORK_TYPE_ROCEV2_IPV6;
> + break;
> + default:
> + qp->qplib_qp.nw_type =
> + CMDQ_MODIFY_QP_NETWORK_TYPE_ROCEV1;
> + break;
> + }
> + }
> + }
> +
> + if (qp_attr_mask & IB_QP_PATH_MTU) {
> + qp->qplib_qp.modify_flags |=
> + CMDQ_MODIFY_QP_MODIFY_MASK_PATH_MTU;
> + qp->qplib_qp.path_mtu = __from_ib_mtu(qp_attr->path_mtu);
> + } else if (qp_attr->qp_state == IB_QPS_RTR) {
> + qp->qplib_qp.modify_flags |=
> + CMDQ_MODIFY_QP_MODIFY_MASK_PATH_MTU;
> + qp->qplib_qp.path_mtu =
> + __from_ib_mtu(iboe_get_mtu(rdev->netdev->mtu));
> + }
> +
> + if (qp_attr_mask & IB_QP_TIMEOUT) {
> + qp->qplib_qp.modify_flags |= CMDQ_MODIFY_QP_MODIFY_MASK_TIMEOUT;
> + qp->qplib_qp.timeout = qp_attr->timeout;
> + }
> + if (qp_attr_mask & IB_QP_RETRY_CNT) {
> + qp->qplib_qp.modify_flags |=
> + CMDQ_MODIFY_QP_MODIFY_MASK_RETRY_CNT;
> + qp->qplib_qp.retry_cnt = qp_attr->retry_cnt;
> + }
> + if (qp_attr_mask & IB_QP_RNR_RETRY) {
> + qp->qplib_qp.modify_flags |=
> + CMDQ_MODIFY_QP_MODIFY_MASK_RNR_RETRY;
> + qp->qplib_qp.rnr_retry = qp_attr->rnr_retry;
> + }
> + if (qp_attr_mask & IB_QP_MIN_RNR_TIMER) {
> + qp->qplib_qp.modify_flags |=
> + CMDQ_MODIFY_QP_MODIFY_MASK_MIN_RNR_TIMER;
> + qp->qplib_qp.min_rnr_timer = qp_attr->min_rnr_timer;
> + }
> + if (qp_attr_mask & IB_QP_RQ_PSN) {
> + qp->qplib_qp.modify_flags |= CMDQ_MODIFY_QP_MODIFY_MASK_RQ_PSN;
> + qp->qplib_qp.rq.psn = qp_attr->rq_psn;
> + }
> + if (qp_attr_mask & IB_QP_MAX_QP_RD_ATOMIC) {
> + qp->qplib_qp.modify_flags |=
> + CMDQ_MODIFY_QP_MODIFY_MASK_MAX_RD_ATOMIC;
> + qp->qplib_qp.max_rd_atomic = qp_attr->max_rd_atomic;
> + }
> + if (qp_attr_mask & IB_QP_SQ_PSN) {
> + qp->qplib_qp.modify_flags |= CMDQ_MODIFY_QP_MODIFY_MASK_SQ_PSN;
> + qp->qplib_qp.sq.psn = qp_attr->sq_psn;
> + }
> + if (qp_attr_mask & IB_QP_MAX_DEST_RD_ATOMIC) {
> + qp->qplib_qp.modify_flags |=
> + CMDQ_MODIFY_QP_MODIFY_MASK_MAX_DEST_RD_ATOMIC;
> + qp->qplib_qp.max_dest_rd_atomic = qp_attr->max_dest_rd_atomic;
> + }
> + if (qp_attr_mask & IB_QP_CAP) {
> + qp->qplib_qp.modify_flags |=
> + CMDQ_MODIFY_QP_MODIFY_MASK_SQ_SIZE |
> + CMDQ_MODIFY_QP_MODIFY_MASK_RQ_SIZE |
> + CMDQ_MODIFY_QP_MODIFY_MASK_SQ_SGE |
> + CMDQ_MODIFY_QP_MODIFY_MASK_RQ_SGE |
> + CMDQ_MODIFY_QP_MODIFY_MASK_MAX_INLINE_DATA;
> + if ((qp_attr->cap.max_send_wr >= dev_attr->max_qp_wqes) ||
> + (qp_attr->cap.max_recv_wr >= dev_attr->max_qp_wqes) ||
> + (qp_attr->cap.max_send_sge >= dev_attr->max_qp_sges) ||
> + (qp_attr->cap.max_recv_sge >= dev_attr->max_qp_sges) ||
> + (qp_attr->cap.max_inline_data >=
> + dev_attr->max_inline_data)) {
> + dev_err(rdev_to_dev(rdev),
> + "Create QP failed - max exceeded");
> + return -EINVAL;
> + }
> + entries = roundup_pow_of_two(qp_attr->cap.max_send_wr);
> + if (entries > dev_attr->max_qp_wqes)
> + entries = dev_attr->max_qp_wqes;
> + qp->qplib_qp.sq.max_wqe = entries;
> + qp->qplib_qp.sq.max_sge = qp_attr->cap.max_send_sge;
> + if (qp->qplib_qp.rq.max_wqe) {
> + entries = roundup_pow_of_two(qp_attr->cap.max_recv_wr);
> + if (entries > dev_attr->max_qp_wqes)
> + entries = dev_attr->max_qp_wqes;
> + qp->qplib_qp.rq.max_wqe = entries;
> + qp->qplib_qp.rq.max_sge = qp_attr->cap.max_recv_sge;
> + } else {
> + /* SRQ was used prior, just ignore the RQ caps */
> + }
> + }
> + if (qp_attr_mask & IB_QP_DEST_QPN) {
> + qp->qplib_qp.modify_flags |=
> + CMDQ_MODIFY_QP_MODIFY_MASK_DEST_QP_ID;
> + qp->qplib_qp.dest_qpn = qp_attr->dest_qp_num;
> + }
> + rc = bnxt_qplib_modify_qp(&rdev->qplib_res, &qp->qplib_qp);
> + if (rc) {
> + dev_err(rdev_to_dev(rdev), "Failed to modify HW QP");
> + return rc;
> + }
> + if (ib_qp->qp_type == IB_QPT_GSI && rdev->qp1_sqp)
> + rc = bnxt_re_modify_shadow_qp(rdev, qp, qp_attr_mask);
> + return rc;
> +}
> +
> +int bnxt_re_query_qp(struct ib_qp *ib_qp, struct ib_qp_attr *qp_attr,
> + int qp_attr_mask, struct ib_qp_init_attr *qp_init_attr)
> +{
> + struct bnxt_re_qp *qp = to_bnxt_re(ib_qp, struct bnxt_re_qp, ib_qp);
> + struct bnxt_re_dev *rdev = qp->rdev;
> + struct bnxt_qplib_qp qplib_qp;
> + int rc;
> +
> + memset(&qplib_qp, 0, sizeof(struct bnxt_qplib_qp));
> + qplib_qp.id = qp->qplib_qp.id;
> + qplib_qp.ah.host_sgid_index = qp->qplib_qp.ah.host_sgid_index;
> +
> + rc = bnxt_qplib_query_qp(&rdev->qplib_res, &qplib_qp);
> + if (rc) {
> + dev_err(rdev_to_dev(rdev), "Failed to query HW QP");
> + return rc;
> + }
> + qp_attr->qp_state = __to_ib_qp_state(qplib_qp.state);
> + qp_attr->en_sqd_async_notify = qplib_qp.en_sqd_async_notify ? 1 : 0;
> + qp_attr->qp_access_flags = __to_ib_access_flags(qplib_qp.access);
> + qp_attr->pkey_index = qplib_qp.pkey_index;
> + qp_attr->qkey = qplib_qp.qkey;
> + memcpy(qp_attr->ah_attr.grh.dgid.raw, qplib_qp.ah.dgid.data,
> + sizeof(qplib_qp.ah.dgid.data));
> + qp_attr->ah_attr.grh.flow_label = qplib_qp.ah.flow_label;
> + qp_attr->ah_attr.grh.sgid_index = qplib_qp.ah.host_sgid_index;
> + qp_attr->ah_attr.grh.hop_limit = qplib_qp.ah.hop_limit;
> + qp_attr->ah_attr.grh.traffic_class = qplib_qp.ah.traffic_class;
> + qp_attr->ah_attr.sl = qplib_qp.ah.sl;
> + ether_addr_copy(qp_attr->ah_attr.dmac, qplib_qp.ah.dmac);
> + qp_attr->path_mtu = __to_ib_mtu(qplib_qp.path_mtu);
> + qp_attr->timeout = qplib_qp.timeout;
> + qp_attr->retry_cnt = qplib_qp.retry_cnt;
> + qp_attr->rnr_retry = qplib_qp.rnr_retry;
> + qp_attr->min_rnr_timer = qplib_qp.min_rnr_timer;
> + qp_attr->rq_psn = qplib_qp.rq.psn;
> + qp_attr->max_rd_atomic = qplib_qp.max_rd_atomic;
> + qp_attr->sq_psn = qplib_qp.sq.psn;
> + qp_attr->max_dest_rd_atomic = qplib_qp.max_dest_rd_atomic;
> + qp_init_attr->sq_sig_type = qplib_qp.sig_type ? IB_SIGNAL_ALL_WR :
> + IB_SIGNAL_REQ_WR;
> + qp_attr->dest_qp_num = qplib_qp.dest_qpn;
> +
> + qp_attr->cap.max_send_wr = qp->qplib_qp.sq.max_wqe;
> + qp_attr->cap.max_send_sge = qp->qplib_qp.sq.max_sge;
> + qp_attr->cap.max_recv_wr = qp->qplib_qp.rq.max_wqe;
> + qp_attr->cap.max_recv_sge = qp->qplib_qp.rq.max_sge;
> + qp_attr->cap.max_inline_data = qp->qplib_qp.max_inline_data;
> + qp_init_attr->cap = qp_attr->cap;
> +
> + return 0;
> +}
> +
> /* Completion Queues */
> int bnxt_re_destroy_cq(struct ib_cq *ib_cq)
> {
> diff --git a/drivers/infiniband/hw/bnxtre/bnxt_re_ib_verbs.h b/drivers/infiniband/hw/bnxtre/bnxt_re_ib_verbs.h
> index ba9a4c9..75ee88a 100644
> --- a/drivers/infiniband/hw/bnxtre/bnxt_re_ib_verbs.h
> +++ b/drivers/infiniband/hw/bnxtre/bnxt_re_ib_verbs.h
> @@ -57,6 +57,19 @@ struct bnxt_re_ah {
> struct bnxt_qplib_ah qplib_ah;
> };
>
> +struct bnxt_re_qp {
> + struct list_head list;
> + struct bnxt_re_dev *rdev;
> + struct ib_qp ib_qp;
> + spinlock_t sq_lock; /* protect sq */
> + struct bnxt_qplib_qp qplib_qp;
> + struct ib_umem *sumem;
> + struct ib_umem *rumem;
> + /* QP1 */
> + u32 send_psn;
> + struct ib_ud_header qp1_hdr;
> +};
> +
> struct bnxt_re_cq {
> struct bnxt_re_dev *rdev;
> spinlock_t cq_lock; /* protect cq */
> @@ -141,6 +154,14 @@ struct ib_ah *bnxt_re_create_ah(struct ib_pd *pd,
> int bnxt_re_modify_ah(struct ib_ah *ah, struct ib_ah_attr *ah_attr);
> int bnxt_re_query_ah(struct ib_ah *ah, struct ib_ah_attr *ah_attr);
> int bnxt_re_destroy_ah(struct ib_ah *ah);
> +struct ib_qp *bnxt_re_create_qp(struct ib_pd *pd,
> + struct ib_qp_init_attr *qp_init_attr,
> + struct ib_udata *udata);
> +int bnxt_re_modify_qp(struct ib_qp *qp, struct ib_qp_attr *qp_attr,
> + int qp_attr_mask, struct ib_udata *udata);
> +int bnxt_re_query_qp(struct ib_qp *qp, struct ib_qp_attr *qp_attr,
> + int qp_attr_mask, struct ib_qp_init_attr *qp_init_attr);
> +int bnxt_re_destroy_qp(struct ib_qp *qp);
> struct ib_cq *bnxt_re_create_cq(struct ib_device *ibdev,
> const struct ib_cq_init_attr *attr,
> struct ib_ucontext *context,
> diff --git a/drivers/infiniband/hw/bnxtre/bnxt_re_main.c b/drivers/infiniband/hw/bnxtre/bnxt_re_main.c
> index 3d1504e..5facacc 100644
> --- a/drivers/infiniband/hw/bnxtre/bnxt_re_main.c
> +++ b/drivers/infiniband/hw/bnxtre/bnxt_re_main.c
> @@ -445,6 +445,12 @@ static int bnxt_re_register_ib(struct bnxt_re_dev *rdev)
> ibdev->modify_ah = bnxt_re_modify_ah;
> ibdev->query_ah = bnxt_re_query_ah;
> ibdev->destroy_ah = bnxt_re_destroy_ah;
> +
> + ibdev->create_qp = bnxt_re_create_qp;
> + ibdev->modify_qp = bnxt_re_modify_qp;
> + ibdev->query_qp = bnxt_re_query_qp;
> + ibdev->destroy_qp = bnxt_re_destroy_qp;
> +
> ibdev->create_cq = bnxt_re_create_cq;
> ibdev->destroy_cq = bnxt_re_destroy_cq;
> ibdev->req_notify_cq = bnxt_re_req_notify_cq;
> diff --git a/include/uapi/rdma/bnxt_re_uverbs_abi.h b/include/uapi/rdma/bnxt_re_uverbs_abi.h
> index 5444eff..e6732f8 100644
> --- a/include/uapi/rdma/bnxt_re_uverbs_abi.h
> +++ b/include/uapi/rdma/bnxt_re_uverbs_abi.h
> @@ -66,6 +66,16 @@ struct bnxt_re_cq_resp {
> __u32 phase;
> } __packed;
>
> +struct bnxt_re_qp_req {
> + __u64 qpsva;
> + __u64 qprva;
> + __u64 qp_handle;
> +} __packed;
> +
> +struct bnxt_re_qp_resp {
> + __u32 qpid;
> +} __packed;
> +
> enum bnxt_re_shpg_offt {
> BNXT_RE_BEG_RESV_OFFT = 0x00,
> BNXT_RE_AVID_OFFT = 0x10,
> --
> 2.5.5
>
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
^ permalink raw reply
* Re: [PATCH 1/1] Fixed to BUG_ON to WARN_ON def
From: Leon Romanovsky @ 2016-12-12 18:18 UTC (permalink / raw)
To: Ozgur Karatas, Tariq Toukan; +Cc: yishaih@mellanox.com, netdev, linux-kernel
In-Reply-To: <2090831481547868@web27h.yandex.ru>
[-- Attachment #1: Type: text/plain, Size: 2204 bytes --]
On Mon, Dec 12, 2016 at 03:04:28PM +0200, Ozgur Karatas wrote:
> Dear Romanovsky;
Please avoid top-posting in your replies.
Thanks
>
> I'm trying to learn english and I apologize for my mistake words and phrases. So, I think the code when call to "sg_set_buf" and next time set memory and buffer. For example, isn't to call "WARN_ON" function, get a error to implicit declaration, right?
>
> Because, you will use to "BUG_ON" get a error implicit declaration of functions.
I'm not sure that I followed you. mem->offset is set by sg_set_buf from
buf variable returned by dma_alloc_coherent(). HW needs to get very
precise size of this buf, in multiple of pages and aligned to pages
boundaries.
>
> sg_set_buf(mem, buf, PAGE_SIZE << order);
> WARN_ON(mem->offset);
See the patch inline which removes this BUG_ON in proper and safe way.
From 7babe807affa2b27d51d3610afb75b693929ea1a Mon Sep 17 00:00:00 2001
From: Leon Romanovsky <leonro@mellanox.com>
Date: Mon, 12 Dec 2016 20:02:45 +0200
Subject: [PATCH] net/mlx4: Remove BUG_ON from ICM allocation routine
This patch removes BUG_ON() macro from mlx4_alloc_icm_coherent()
by checking DMA address aligment in advance and performing proper
folding in case of error.
Fixes: 5b0bf5e25efe ("mlx4_core: Support ICM tables in coherent memory")
Reported-by: Ozgur Karatas <okaratas@member.fsf.org>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
---
drivers/net/ethernet/mellanox/mlx4/icm.c | 7 ++++++-
1 file changed, 6 insertions(+), 1 deletion(-)
diff --git a/drivers/net/ethernet/mellanox/mlx4/icm.c b/drivers/net/ethernet/mellanox/mlx4/icm.c
index 2a9dd46..e1f9e7c 100644
--- a/drivers/net/ethernet/mellanox/mlx4/icm.c
+++ b/drivers/net/ethernet/mellanox/mlx4/icm.c
@@ -118,8 +118,13 @@ static int mlx4_alloc_icm_coherent(struct device *dev, struct scatterlist *mem,
if (!buf)
return -ENOMEM;
+ if (offset_in_page(buf)) {
+ dma_free_coherent(dev, PAGE_SIZE << order,
+ buf, sg_dma_address(mem));
+ return -ENOMEM;
+ }
+
sg_set_buf(mem, buf, PAGE_SIZE << order);
- BUG_ON(mem->offset);
sg_dma_len(mem) = PAGE_SIZE << order;
return 0;
}
--
2.10.2
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
^ permalink raw reply related
* Re: Designing a safe RX-zero-copy Memory Model for Networking
From: Christoph Lameter @ 2016-12-12 18:06 UTC (permalink / raw)
To: Jesper Dangaard Brouer
Cc: John Fastabend, Mike Rapoport, netdev@vger.kernel.org, linux-mm,
Willem de Bruijn, Björn Töpel, Karlsson, Magnus,
Alexander Duyck, Mel Gorman, Tom Herbert, Brenden Blanco,
Tariq Toukan, Saeed Mahameed, Jesse Brandeburg, Kalman Meth,
Vladislav Yasevich
In-Reply-To: <20161212181344.3ddfa9c3@redhat.com>
On Mon, 12 Dec 2016, Jesper Dangaard Brouer wrote:
> Hmmm. If you can rely on hardware setup to give you steering and
> dedicated access to the RX rings. In those cases, I guess, the "push"
> model could be a more direct API approach.
If the hardware does not support steering then one should be able to
provide those services in software.
> I was shooting for a model that worked without hardware support. And
> then transparently benefit from HW support by configuring a HW filter
> into a specific RX queue and attaching/using to that queue.
The discussion here is a bit amusing since these issues have been resolved
a long time ago with the design of the RDMA subsystem. Zero copy is
already in wide use. Memory registration is used to pin down memory areas.
Work requests can be filed with the RDMA subsystem that then send and
receive packets from the registered memory regions. This is not strictly
remote memory access but this is a basic mode of operations supported by
the RDMA subsystem. The mlx5 driver quoted here supports all of that.
What is bad about RDMA is that it is a separate kernel subsystem. What I
would like to see is a deeper integration with the network stack so that
memory regions can be registred with a network socket and work requests
then can be submitted and processed that directly read and write in these
regions. The network stack should provide the services that the hardware
of the NIC does not suppport as usual.
The RX/TX ring in user space should be an additional mode of operation of
the socket layer. Once that is in place the "Remote memory acces" can be
trivially implemented on top of that and the ugly RDMA sidecar subsystem
can go away.
^ permalink raw reply
* Re: Soft lockup in inet_put_port on 4.6
From: Josef Bacik @ 2016-12-12 18:05 UTC (permalink / raw)
To: Eric Dumazet
Cc: Hannes Frederic Sowa, Tom Herbert,
Linux Kernel Network Developers
In-Reply-To: <1481343298.4930.208.camel@edumazet-glaptop3.roam.corp.google.com>
On Fri, Dec 9, 2016 at 11:14 PM, Eric Dumazet <eric.dumazet@gmail.com>
wrote:
> On Fri, 2016-12-09 at 19:47 -0800, Eric Dumazet wrote:
>
>>
>> Hmm... Is your ephemeral port range includes the port your load
>> balancing app is using ?
>
> I suspect that you might have processes doing bind( port = 0) that are
> trapped into the bind_conflict() scan ?
>
> With 100,000 + timewaits there, this possibly hurts.
>
> Can you try the following loop breaker ?
It doesn't appear that the app is doing bind(port = 0) during normal
operation. I tested this patch and it made no difference. I'm going
to test simply restarting the app without changing to the SO_REUSEPORT
option. Thanks,
Josef
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox