Netdev List

Netdev List
 help / color / mirror / Atom feed

* [PATCH iproute2-next v3] Add support for cake qdisc
From: Toke Høiland-Jørgensen @ 2018-04-24 12:30 UTC (permalink / raw)
  To: netdev; +Cc: cake, Toke Høiland-Jørgensen, Dave Taht
In-Reply-To: <20180424114407.5939-2-toke@toke.dk>

sch_cake is intended to squeeze the most bandwidth and latency out of even
the slowest ISP links and routers, while presenting an API simple enough
that even an ISP can configure it.

Example of use on a cable ISP uplink:

tc qdisc add dev eth0 cake bandwidth 20Mbit nat docsis ack-filter

To shape a cable download link (ifb and tc-mirred setup elided)

tc qdisc add dev ifb0 cake bandwidth 200mbit nat docsis ingress wash besteffort

Cake is filled with:

* A hybrid Codel/Blue AQM algorithm, "Cobalt", tied to an FQ_Codel
  derived Flow Queuing system, which autoconfigures based on the bandwidth.
* A novel "triple-isolate" mode (the default) which balances per-host
  and per-flow FQ even through NAT.
* An deficit based shaper, that can also be used in an unlimited mode.
* 8 way set associative hashing to reduce flow collisions to a minimum.
* A reasonable interpretation of various diffserv latency/loss tradeoffs.
* Support for zeroing diffserv markings for entering and exiting traffic.
* Support for interacting well with Docsis 3.0 shaper framing.
* Support for DSL framing types and shapers.
* Support for ack filtering.
* Extensive statistics for measuring, loss, ecn markings, latency variation.

Various versions baking have been available as an out of tree build for
kernel versions going back to 3.10, as the embedded router world has been
running a few years behind mainline Linux. A stable version has been
generally available on lede-17.01 and later.

sch_cake replaces a combination of iptables, tc filter, htb and fq_codel
in the sqm-scripts, with sane defaults and vastly simpler configuration.

Cake's principal author is Jonathan Morton, with contributions from
Kevin Darbyshire-Bryant, Toke Høiland-Jørgensen, Sebastian Moeller,
Ryan Mounce, Guido Sarducci, Dean Scarff, Nils Andreas Svee, Dave Täht,
and Loganaden Velvindron.

Testing from Pete Heist, Georgios Amanakis, and the many other members of
the cake@lists.bufferbloat.net mailing list.

Signed-off-by: Dave Taht <dave.taht@gmail.com>
Signed-off-by: Toke Høiland-Jørgensen <toke@toke.dk>
---
Changes since v2:
  - Accidentally sent a version of the tc patch from my testing branch
    which includes a test flag that shouldn't be in there. Sorry for the
    mistake.
  
 man/man8/tc-cake.8 | 632 +++++++++++++++++++++++++++++++++++++++++++
 man/man8/tc.8      |   1 +
 tc/Makefile        |   1 +
 tc/q_cake.c        | 778 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 1412 insertions(+)
 create mode 100644 man/man8/tc-cake.8
 create mode 100644 tc/q_cake.c

diff --git a/man/man8/tc-cake.8 b/man/man8/tc-cake.8
new file mode 100644
index 00000000..30e41bc9
--- /dev/null
+++ b/man/man8/tc-cake.8
@@ -0,0 +1,632 @@
+.TH CAKE 8 "23 November 2017" "iproute2" "Linux"
+.SH NAME
+CAKE \- Common Applications Kept Enhanced (CAKE)
+.SH SYNOPSIS
+.B tc qdisc ... cake
+.br
+[
+.BR bandwidth
+RATE |
+.BR unlimited*
+|
+.BR autorate_ingress
+]
+.br
+[
+.BR rtt
+TIME |
+.BR datacentre
+|
+.BR lan
+|
+.BR metro
+|
+.BR regional
+|
+.BR internet*
+|
+.BR oceanic
+|
+.BR satellite
+|
+.BR interplanetary
+]
+.br
+[
+.BR besteffort
+|
+.BR diffserv8
+|
+.BR diffserv4
+|
+.BR diffserv3*
+]
+.br
+[
+.BR flowblind
+|
+.BR srchost
+|
+.BR dsthost
+|
+.BR hosts
+|
+.BR flows
+|
+.BR dual-srchost
+|
+.BR dual-dsthost
+|
+.BR triple-isolate*
+]
+.br
+[
+.BR nat
+|
+.BR nonat*
+]
+.br
+[
+.BR wash
+|
+.BR nowash*
+]
+.br
+[
+.BR ack-filter
+|
+.BR ack-filter-aggressive
+|
+.BR no-ack-filter*
+]
+.br
+[
+.BR memlimit
+LIMIT ]
+.br
+[
+.BR ptm
+|
+.BR atm
+|
+.BR noatm*
+]
+.br
+[
+.BR overhead
+N |
+.BR conservative
+|
+.BR raw*
+]
+.br
+[
+.BR mpu
+N ]
+.br
+[
+.BR ingress
+|
+.BR egress*
+]
+.br
+(* marks defaults)
+
+
+.SH DESCRIPTION
+CAKE (Common Applications Kept Enhanced) is a shaping-capable queue discipline
+which uses both AQM and FQ.  It combines COBALT, which is an AQM algorithm
+combining Codel and BLUE, a shaper which operates in deficit mode, and a variant
+of DRR++ for flow isolation.  8-way set-associative hashing is used to virtually
+eliminate hash collisions.  Priority queuing is available through a simplified
+diffserv implementation.  Overhead compensation for various encapsulation
+schemes is tightly integrated.
+
+All settings are optional; the default settings are chosen to be sensible in
+most common deployments.  Most people will only need to set the
+.B bandwidth
+parameter to get useful results, but reading the
+.B Overhead Compensation
+and
+.B Round Trip Time
+sections is strongly encouraged.
+
+.SH SHAPER PARAMETERS
+CAKE uses a deficit-mode shaper, which does not exhibit the initial burst
+typical of token-bucket shapers.  It will automatically burst precisely as much
+as required to maintain the configured throughput.  As such, it is very
+straightforward to configure.
+.PP
+.B unlimited
+(default)
+.br
+	No limit on the bandwidth.
+.PP
+.B bandwidth
+RATE
+.br
+	Set the shaper bandwidth.  See
+.BR tc(8)
+or examples below for details of the RATE value.
+.PP
+.B autorate_ingress
+.br
+	Automatic capacity estimation based on traffic arriving at this qdisc.
+This is most likely to be useful with cellular links, which tend to change
+quality randomly.  A
+.B bandwidth
+parameter can be used in conjunction to specify an initial estimate.  The shaper
+will periodically be set to a bandwidth slightly below the estimated rate.  This
+estimator cannot estimate the bandwidth of links downstream of itself.
+
+.SH OVERHEAD COMPENSATION PARAMETERS
+The size of each packet on the wire may differ from that seen by Linux.  The
+following parameters allow CAKE to compensate for this difference by internally
+considering each packet to be bigger than Linux informs it.  To assist users who
+are not expert network engineers, keywords have been provided to represent a
+number of common link technologies.
+
+.SS	Manual Overhead Specification
+.B overhead
+BYTES
+.br
+	Adds BYTES to the size of each packet.  BYTES may be negative; values
+between -64 and 256 (inclusive) are accepted.
+.PP
+.B mpu
+BYTES
+.br
+	Rounds each packet (including overhead) up to a minimum length
+BYTES. BYTES may not be negative; values between 0 and 256 (inclusive)
+are accepted.
+.PP
+.B atm
+.br
+	Compensates for ATM cell framing, which is normally found on ADSL links.
+This is performed after the
+.B overhead
+parameter above.  ATM uses fixed 53-byte cells, each of which can carry 48 bytes
+payload.
+.PP
+.B ptm
+.br
+	Compensates for PTM encoding, which is normally found on VDSL2 links and
+uses a 64b/65b encoding scheme. It is even more efficient to simply
+derate the specified shaper bandwidth by a factor of 64/65 or 0.984. See
+ITU G.992.3 Annex N and IEEE 802.3 Section 61.3 for details.
+.PP
+.B noatm
+.br
+	Disables ATM and PTM compensation.
+
+.SS	Failsafe Overhead Keywords
+These two keywords are provided for quick-and-dirty setup.  Use them if you
+can't be bothered to read the rest of this section.
+.PP
+.B raw
+(default)
+.br
+	Turns off all overhead compensation in CAKE.  The packet size reported
+by Linux will be used directly.
+.PP
+	Other overhead keywords may be added after "raw".  The effect of this is
+to make the overhead compensation operate relative to the reported packet size,
+not the underlying IP packet size.
+.PP
+.B conservative
+.br
+	Compensates for more overhead than is likely to occur on any
+widely-deployed link technology.
+.br
+	Equivalent to
+.B overhead 48 atm.
+
+.SS ADSL Overhead Keywords
+Most ADSL modems have a way to check which framing scheme is in use.  Often this
+is also specified in the settings document provided by the ISP.  The keywords in
+this section are intended to correspond with these sources of information.  All
+of them implicitly set the
+.B atm
+flag.
+.PP
+.B pppoa-vcmux
+.br
+	Equivalent to
+.B overhead 10 atm
+.PP
+.B pppoa-llc
+.br
+	Equivalent to
+.B overhead 14 atm
+.PP
+.B pppoe-vcmux
+.br
+	Equivalent to
+.B overhead 32 atm
+.PP
+.B pppoe-llcsnap
+.br
+	Equivalent to
+.B overhead 40 atm
+.PP
+.B bridged-vcmux
+.br
+	Equivalent to
+.B overhead 24 atm
+.PP
+.B bridged-llcsnap
+.br
+	Equivalent to
+.B overhead 32 atm
+.PP
+.B ipoa-vcmux
+.br
+	Equivalent to
+.B overhead 8 atm
+.PP
+.B ipoa-llcsnap
+.br
+	Equivalent to
+.B overhead 16 atm
+.PP
+See also the Ethernet Correction Factors section below.
+
+.SS VDSL2 Overhead Keywords
+ATM was dropped from VDSL2 in favour of PTM, which is a much more
+straightforward framing scheme.  Some ISPs retained PPPoE for compatibility with
+their existing back-end systems.
+.PP
+.B pppoe-ptm
+.br
+	Equivalent to
+.B overhead 30 ptm
+
+.br
+	PPPoE: 2B PPP + 6B PPPoE +
+.br
+	ETHERNET: 6B dest MAC + 6B src MAC + 2B ethertype + 4B Frame Check Sequence +
+.br
+	PTM: 1B Start of Frame (S) + 1B End of Frame (Ck) + 2B TC-CRC (PTM-FCS)
+.br
+.PP
+.B bridged-ptm
+.br
+	Equivalent to
+.B overhead 22 ptm
+.br
+	ETHERNET: 6B dest MAC + 6B src MAC + 2B ethertype + 4B Frame Check Sequence +
+.br
+	PTM: 1B Start of Frame (S) + 1B End of Frame (Ck) + 2B TC-CRC (PTM-FCS)
+.br
+.PP
+See also the Ethernet Correction Factors section below.
+
+.SS DOCSIS Cable Overhead Keyword
+DOCSIS is the universal standard for providing Internet service over cable-TV
+infrastructure.
+
+In this case, the actual on-wire overhead is less important than the packet size
+the head-end equipment uses for shaping and metering.  This is specified to be
+an Ethernet frame including the CRC (aka FCS).
+.PP
+.B docsis
+.br
+	Equivalent to
+.B overhead 18 mpu 64 noatm
+
+.SS Ethernet Overhead Keywords
+.PP
+.B ethernet
+.br
+	Accounts for Ethernet's preamble, inter-frame gap, and Frame Check
+Sequence.  Use this keyword when the bottleneck being shaped for is an
+actual Ethernet cable.
+.br
+	Equivalent to
+.B overhead 38 mpu 84 noatm
+.PP
+.B ether-vlan
+.br
+	Adds 4 bytes to the overhead compensation, accounting for an IEEE 802.1Q
+VLAN header appended to the Ethernet frame header.  NB: Some ISPs use one or
+even two of these within PPPoE; this keyword may be repeated as necessary to
+express this.
+
+.SH ROUND TRIP TIME PARAMETERS
+Active Queue Management (AQM) consists of embedding congestion signals in the
+packet flow, which receivers use to instruct senders to slow down when the queue
+is persistently occupied.  CAKE uses ECN signalling when available, and packet
+drops otherwise, according to a combination of the Codel and BLUE AQM algorithms
+called COBALT.
+
+Very short latencies require a very rapid AQM response to adequately control
+latency.  However, such a rapid response tends to impair throughput when the
+actual RTT is relatively long.  CAKE allows specifying the RTT it assumes for
+tuning various parameters.  Actual RTTs within an order of magnitude of this
+will generally work well for both throughput and latency management.
+
+At the 'lan' setting and below, the time constants are similar in magnitude to
+the jitter in the Linux kernel itself, so congestion might be signalled
+prematurely. The flows will then become sparse and total throughput reduced,
+leaving little or no back-pressure for the fairness logic to work against. Use
+the "metro" setting for local lans unless you have a custom kernel.
+.PP
+.B rtt
+TIME
+.br
+	Manually specify an RTT.
+.PP
+.B datacentre
+.br
+	For extremely high-performance 10GigE+ networks only.  Equivalent to
+.B rtt 100us.
+.PP
+.B lan
+.br
+	For pure Ethernet (not Wi-Fi) networks, at home or in the office.  Don't
+use this when shaping for an Internet access link.  Equivalent to
+.B rtt 1ms.
+.PP
+.B metro
+.br
+	For traffic mostly within a single city.  Equivalent to
+.B rtt 10ms.
+.PP
+.B regional
+.br
+	For traffic mostly within a European-sized country.  Equivalent to
+.B rtt 30ms.
+.PP
+.B internet
+(default)
+.br
+	This is suitable for most Internet traffic.  Equivalent to
+.B rtt 100ms.
+.PP
+.B oceanic
+.br
+	For Internet traffic with generally above-average latency, such as that
+suffered by Australasian residents.  Equivalent to
+.B rtt 300ms.
+.PP
+.B satellite
+.br
+	For traffic via geostationary satellites.  Equivalent to
+.B rtt 1000ms.
+.PP
+.B interplanetary
+.br
+	So named because Jupiter is about 1 light-hour from Earth.  Use this to
+(almost) completely disable AQM actions.  Equivalent to
+.B rtt 3600s.
+
+.SH FLOW ISOLATION PARAMETERS
+With flow isolation enabled, CAKE places packets from different flows into
+different queues, each of which carries its own AQM state.  Packets from each
+queue are then delivered fairly, according to a DRR++ algorithm which minimises
+latency for "sparse" flows.  CAKE uses a set-associative hashing algorithm to
+minimise flow collisions.
+
+These keywords specify whether fairness based on source address, destination
+address, individual flows, or any combination of those is desired.
+.PP
+.B flowblind
+.br
+	Disables flow isolation; all traffic passes through a single queue for
+each tin.
+.PP
+.B srchost
+.br
+	Flows are defined only by source address.  Could be useful on the egress
+path of an ISP backhaul.
+.PP
+.B dsthost
+.br
+	Flows are defined only by destination address.  Could be useful on the
+ingress path of an ISP backhaul.
+.PP
+.B hosts
+.br
+	Flows are defined by source-destination host pairs.  This is host
+isolation, rather than flow isolation.
+.PP
+.B flows
+.br
+	Flows are defined by the entire 5-tuple of source address, destination
+address, transport protocol, source port and destination port.  This is the type
+of flow isolation performed by SFQ and fq_codel.
+.PP
+.B dual-srchost
+.br
+	Flows are defined by the 5-tuple, and fairness is applied first over
+source addresses, then over individual flows.  Good for use on egress traffic
+from a LAN to the internet, where it'll prevent any one LAN host from
+monopolising the uplink, regardless of the number of flows they use.
+.PP
+.B dual-dsthost
+.br
+	Flows are defined by the 5-tuple, and fairness is applied first over
+destination addresses, then over individual flows.  Good for use on ingress
+traffic to a LAN from the internet, where it'll prevent any one LAN host from
+monopolising the downlink, regardless of the number of flows they use.
+.PP
+.B triple-isolate
+(default)
+.br
+	Flows are defined by the 5-tuple, and fairness is applied over source
+*and* destination addresses intelligently (ie. not merely by host-pairs), and
+also over individual flows.  Use this if you're not certain whether to use
+dual-srchost or dual-dsthost; it'll do both jobs at once, preventing any one
+host on *either* side of the link from monopolising it with a large number of
+flows.
+.PP
+.B nat
+.br
+	Instructs Cake to perform a NAT lookup before applying flow-isolation
+rules, to determine the true addresses and port numbers of the packet, to
+improve fairness between hosts "inside" the NAT.  This has no practical effect
+in "flowblind" or "flows" modes, or if NAT is performed on a different host.
+.PP
+.B nonat
+(default)
+.br
+	Cake will not perform a NAT lookup.  Flow isolation will be performed
+using the addresses and port numbers directly visible to the interface Cake is
+attached to.
+
+.SH PRIORITY QUEUE PARAMETERS
+CAKE can divide traffic into "tins" based on the Diffserv field.  Each tin has
+its own independent set of flow-isolation queues, and is serviced based on a WRR
+algorithm.  To avoid perverse Diffserv marking incentives, tin weights have a
+"priority sharing" value when bandwidth used by that tin is below a threshold,
+and a lower "bandwidth sharing" value when above.  Bandwidth is compared against
+the threshold using the same algorithm as the deficit-mode shaper.
+
+Detailed customisation of tin parameters is not provided.  The following presets
+perform all necessary tuning, relative to the current shaper bandwidth and RTT
+settings.
+.PP
+.B besteffort
+.br
+	Disables priority queuing by placing all traffic in one tin.
+.PP
+.B precedence
+.br
+	Enables legacy interpretation of TOS "Precedence" field.  Use of this
+preset on the modern Internet is firmly discouraged.
+.PP
+.B diffserv4
+.br
+	Provides a general-purpose Diffserv implementation with four tins:
+.br
+		Bulk (CS1), 6.25% threshold, generally low priority.
+.br
+		Best Effort (general), 100% threshold.
+.br
+		Video (AF4x, AF3x, CS3, AF2x, CS2, TOS4, TOS1), 50% threshold.
+.br
+		Voice (CS7, CS6, EF, VA, CS5, CS4), 25% threshold.
+.PP
+.B diffserv3
+(default)
+.br
+	Provides a simple, general-purpose Diffserv implementation with three tins:
+.br
+		Bulk (CS1), 6.25% threshold, generally low priority.
+.br
+		Best Effort (general), 100% threshold.
+.br
+		Voice (CS7, CS6, EF, VA, TOS4), 25% threshold, reduced Codel interval.
+
+.SH OTHER PARAMETERS
+.B memlimit
+LIMIT
+.br
+	Limit the memory consumed by Cake to LIMIT bytes. Note that this does
+not translate directly to queue size (so do not size this based on bandwidth
+delay product considerations, but rather on worst case acceptable memory
+consumption), as there is some overhead in the data structures containing the
+packets, especially for small packets.
+
+	By default, the limit is calculated based on the bandwidth and RTT
+settings.
+
+.PP
+.B wash
+
+.br
+	Traffic entering your diffserv domain is frequently mis-marked in
+transit from the perspective of your network, and traffic exiting yours may be
+mis-marked from the perspective of the transiting provider.
+
+Apply the wash option to clear all extra diffserv (but not ECN bits), after
+priority queuing has taken place.
+
+If you are shaping inbound, and cannot trust the diffserv markings (as is the
+case for Comcast Cable, among others), it is best to use a single queue
+"besteffort" mode with wash.
+
+.SH EXAMPLES
+# tc qdisc delete root dev eth0
+.br
+# tc qdisc add root dev eth0 cake bandwidth 100Mbit ethernet
+.br
+# tc -s qdisc show dev eth0
+.br
+qdisc cake 1: root refcnt 2 bandwidth 100Mbit diffserv3 triple-isolate rtt 100.0ms noatm overhead 38 mpu 84 
+ Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0) 
+ backlog 0b 0p requeues 0
+ memory used: 0b of 5000000b
+ capacity estimate: 100Mbit
+ min/max network layer size:        65535 /       0
+ min/max overhead-adjusted size:    65535 /       0
+ average network hdr offset:            0
+
+                   Bulk  Best Effort        Voice
+  thresh       6250Kbit      100Mbit       25Mbit
+  target          5.0ms        5.0ms        5.0ms
+  interval      100.0ms      100.0ms      100.0ms
+  pk_delay          0us          0us          0us
+  av_delay          0us          0us          0us
+  sp_delay          0us          0us          0us
+  pkts                0            0            0
+  bytes               0            0            0
+  way_inds            0            0            0
+  way_miss            0            0            0
+  way_cols            0            0            0
+  drops               0            0            0
+  marks               0            0            0
+  ack_drop            0            0            0
+  sp_flows            0            0            0
+  bk_flows            0            0            0
+  un_flows            0            0            0
+  max_len             0            0            0
+  quantum           300         1514          762
+
+After some use:
+.br
+# tc -s qdisc show dev eth0
+
+qdisc cake 1: root refcnt 2 bandwidth 100Mbit diffserv3 triple-isolate rtt 100.0ms noatm overhead 38 mpu 84 
+ Sent 44709231 bytes 31931 pkt (dropped 45, overlimits 93782 requeues 0) 
+ backlog 33308b 22p requeues 0
+ memory used: 292352b of 5000000b
+ capacity estimate: 100Mbit
+ min/max network layer size:           28 /    1500
+ min/max overhead-adjusted size:       84 /    1538
+ average network hdr offset:           14
+
+                   Bulk  Best Effort        Voice
+  thresh       6250Kbit      100Mbit       25Mbit
+  target          5.0ms        5.0ms        5.0ms
+  interval      100.0ms      100.0ms      100.0ms
+  pk_delay        8.7ms        6.9ms        5.0ms
+  av_delay        4.9ms        5.3ms        3.8ms
+  sp_delay        727us        1.4ms        511us
+  pkts             2590        21271         8137
+  bytes         3081804     30302659     11426206
+  way_inds            0           46            0
+  way_miss            3           17            4
+  way_cols            0            0            0
+  drops              20           15           10
+  marks               0            0            0
+  ack_drop            0            0            0
+  sp_flows            2            4            1
+  bk_flows            1            2            1
+  un_flows            0            0            0
+  max_len          1514         1514         1514
+  quantum           300         1514          762
+
+.SH SEE ALSO
+.BR tc (8),
+.BR tc-codel (8),
+.BR tc-fq_codel (8),
+.BR tc-red (8)
+
+.SH AUTHORS
+Cake's principal author is Jonathan Morton, with contributions from
+Tony Ambardar, Kevin Darbyshire-Bryant, Toke Høiland-Jørgensen,
+Sebastian Moeller, Ryan Mounce, Dean Scarff, Nils Andreas Svee, and Dave Täht.
+
+This manual page was written by Loganaden Velvindron. Please report corrections
+to the Linux Networking mailing list <netdev@vger.kernel.org>.
diff --git a/man/man8/tc.8 b/man/man8/tc.8
index 840880fb..716dfec5 100644
--- a/man/man8/tc.8
+++ b/man/man8/tc.8
@@ -795,6 +795,7 @@ was written by Alexey N. Kuznetsov and added in Linux 2.2.
 .BR tc-basic (8),
 .BR tc-bfifo (8),
 .BR tc-bpf (8),
+.BR tc-cake (8),
 .BR tc-cbq (8),
 .BR tc-cgroup (8),
 .BR tc-choke (8),
diff --git a/tc/Makefile b/tc/Makefile
index dfd00267..d9a43568 100644
--- a/tc/Makefile
+++ b/tc/Makefile
@@ -66,6 +66,7 @@ TCMODULES += q_codel.o
 TCMODULES += q_fq_codel.o
 TCMODULES += q_fq.o
 TCMODULES += q_pie.o
+TCMODULES += q_cake.o
 TCMODULES += q_hhf.o
 TCMODULES += q_clsact.o
 TCMODULES += e_bpf.o
diff --git a/tc/q_cake.c b/tc/q_cake.c
new file mode 100644
index 00000000..12263361
--- /dev/null
+++ b/tc/q_cake.c
@@ -0,0 +1,778 @@
+/* SPDX-License-Identifier: (GPL-2.0 OR BSD-3-Clause) */
+/*
+ * Common Applications Kept Enhanced  --  CAKE
+ *
+ *  Copyright (C) 2014-2018 Jonathan Morton <chromatix99@gmail.com>
+ *  Copyright (C) 2017-2018 Toke Høiland-Jørgensen <toke@toke.dk>
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ * 1. Redistributions of source code must retain the above copyright
+ *    notice, this list of conditions, and the following disclaimer,
+ *    without modification.
+ * 2. Redistributions in binary form must reproduce the above copyright
+ *    notice, this list of conditions and the following disclaimer in the
+ *    documentation and/or other materials provided with the distribution.
+ * 3. The names of the authors may not be used to endorse or promote products
+ *    derived from this software without specific prior written permission.
+ *
+ * Alternatively, provided that this notice is retained in full, this
+ * software may be distributed under the terms of the GNU General
+ * Public License ("GPL") version 2, in which case the provisions of the
+ * GPL apply INSTEAD OF those given above.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ * "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ * OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH
+ * DAMAGE.
+ *
+ */
+
+#include <stddef.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <unistd.h>
+#include <syslog.h>
+#include <fcntl.h>
+#include <sys/socket.h>
+#include <netinet/in.h>
+#include <arpa/inet.h>
+#include <string.h>
+#include <inttypes.h>
+
+#include "utils.h"
+#include "tc_util.h"
+
+static void explain(void)
+{
+	fprintf(stderr,
+"Usage: ... cake [ bandwidth RATE | unlimited* | autorate_ingress ]\n"
+"                [ rtt TIME | datacentre | lan | metro | regional |\n"
+"                  internet* | oceanic | satellite | interplanetary ]\n"
+"                [ besteffort | diffserv8 | diffserv4 | diffserv3* ]\n"
+"                [ flowblind | srchost | dsthost | hosts | flows |\n"
+"                  dual-srchost | dual-dsthost | triple-isolate* ]\n"
+"                [ nat | nonat* ]\n"
+"                [ wash | nowash* ]\n"
+"                [ ack-filter | ack-filter-aggressive | no-ack-filter* ]\n"
+"                [ memlimit LIMIT ]\n"
+"                [ ptm | atm | noatm* ] [ overhead N | conservative | raw* ]\n"
+"                [ mpu N ] [ ingress | egress* ]\n"
+"                (* marks defaults)\n");
+}
+
+static int cake_parse_opt(struct qdisc_util *qu, int argc, char **argv,
+			  struct nlmsghdr *n, const char *dev)
+{
+	int unlimited = 0;
+	unsigned bandwidth = 0;
+	unsigned interval = 0;
+	unsigned target = 0;
+	unsigned diffserv = 0;
+	unsigned memlimit = 0;
+	int  overhead = 0;
+	bool overhead_set = false;
+	bool overhead_override = false;
+	int mpu = 0;
+	int flowmode = -1;
+	int nat = -1;
+	int atm = -1;
+	int autorate = -1;
+	int wash = -1;
+	int ingress = -1;
+	int ack_filter = -1;
+	struct rtattr *tail;
+
+	while (argc > 0) {
+		if (strcmp(*argv, "bandwidth") == 0) {
+			NEXT_ARG();
+			if (get_rate(&bandwidth, *argv)) {
+				fprintf(stderr, "Illegal \"bandwidth\"\n");
+				return -1;
+			}
+			unlimited = 0;
+			autorate = 0;
+		} else if (strcmp(*argv, "unlimited") == 0) {
+			bandwidth = 0;
+			unlimited = 1;
+			autorate = 0;
+		} else if (strcmp(*argv, "autorate_ingress") == 0) {
+			autorate = 1;
+
+		} else if (strcmp(*argv, "rtt") == 0) {
+			NEXT_ARG();
+			if (get_time(&interval, *argv)) {
+				fprintf(stderr, "Illegal \"rtt\"\n");
+				return -1;
+			}
+			target = interval / 20;
+			if(!target)
+				target = 1;
+		} else if (strcmp(*argv, "datacentre") == 0) {
+			interval = 100;
+			target   =   5;
+		} else if (strcmp(*argv, "lan") == 0) {
+			interval = 1000;
+			target   =   50;
+		} else if (strcmp(*argv, "metro") == 0) {
+			interval = 10000;
+			target   =   500;
+		} else if (strcmp(*argv, "regional") == 0) {
+			interval = 30000;
+			target    = 1500;
+		} else if (strcmp(*argv, "internet") == 0) {
+			interval = 100000;
+			target   =   5000;
+		} else if (strcmp(*argv, "oceanic") == 0) {
+			interval = 300000;
+			target   =  15000;
+		} else if (strcmp(*argv, "satellite") == 0) {
+			interval = 1000000;
+			target   =   50000;
+		} else if (strcmp(*argv, "interplanetary") == 0) {
+			interval = 1000000000;
+			target   =   50000000;
+
+		} else if (strcmp(*argv, "besteffort") == 0) {
+			diffserv = CAKE_DIFFSERV_BESTEFFORT;
+		} else if (strcmp(*argv, "precedence") == 0) {
+			diffserv = CAKE_DIFFSERV_PRECEDENCE;
+		} else if (strcmp(*argv, "diffserv8") == 0) {
+			diffserv = CAKE_DIFFSERV_DIFFSERV8;
+		} else if (strcmp(*argv, "diffserv4") == 0) {
+			diffserv = CAKE_DIFFSERV_DIFFSERV4;
+		} else if (strcmp(*argv, "diffserv") == 0) {
+			diffserv = CAKE_DIFFSERV_DIFFSERV4;
+		} else if (strcmp(*argv, "diffserv3") == 0) {
+			diffserv = CAKE_DIFFSERV_DIFFSERV3;
+
+		} else if (strcmp(*argv, "nowash") == 0) {
+			wash = 0;
+		} else if (strcmp(*argv, "wash") == 0) {
+			wash = 1;
+
+		} else if (strcmp(*argv, "flowblind") == 0) {
+			flowmode = CAKE_FLOW_NONE;
+		} else if (strcmp(*argv, "srchost") == 0) {
+			flowmode = CAKE_FLOW_SRC_IP;
+		} else if (strcmp(*argv, "dsthost") == 0) {
+			flowmode = CAKE_FLOW_DST_IP;
+		} else if (strcmp(*argv, "hosts") == 0) {
+			flowmode = CAKE_FLOW_HOSTS;
+		} else if (strcmp(*argv, "flows") == 0) {
+			flowmode = CAKE_FLOW_FLOWS;
+		} else if (strcmp(*argv, "dual-srchost") == 0) {
+			flowmode = CAKE_FLOW_DUAL_SRC;
+		} else if (strcmp(*argv, "dual-dsthost") == 0) {
+			flowmode = CAKE_FLOW_DUAL_DST;
+		} else if (strcmp(*argv, "triple-isolate") == 0) {
+			flowmode = CAKE_FLOW_TRIPLE;
+
+		} else if (strcmp(*argv, "nat") == 0) {
+			nat = 1;
+		} else if (strcmp(*argv, "nonat") == 0) {
+			nat = 0;
+
+		} else if (strcmp(*argv, "ptm") == 0) {
+			atm = CAKE_ATM_PTM;
+		} else if (strcmp(*argv, "atm") == 0) {
+			atm = CAKE_ATM_ATM;
+		} else if (strcmp(*argv, "noatm") == 0) {
+			atm = CAKE_ATM_NONE;
+
+		} else if (strcmp(*argv, "raw") == 0) {
+			atm = CAKE_ATM_NONE;
+			overhead = 0;
+			overhead_set = true;
+			overhead_override = true;
+		} else if (strcmp(*argv, "conservative") == 0) {
+			/*
+			 * Deliberately over-estimate overhead:
+			 * one whole ATM cell plus ATM framing.
+			 * A safe choice if the actual overhead is unknown.
+			 */
+			atm = CAKE_ATM_ATM;
+			overhead = 48;
+			overhead_set = true;
+
+		/* Various ADSL framing schemes, all over ATM cells */
+		} else if (strcmp(*argv, "ipoa-vcmux") == 0) {
+			atm = CAKE_ATM_ATM;
+			overhead += 8;
+			overhead_set = true;
+		} else if (strcmp(*argv, "ipoa-llcsnap") == 0) {
+			atm = CAKE_ATM_ATM;
+			overhead += 16;
+			overhead_set = true;
+		} else if (strcmp(*argv, "bridged-vcmux") == 0) {
+			atm = CAKE_ATM_ATM;
+			overhead += 24;
+			overhead_set = true;
+		} else if (strcmp(*argv, "bridged-llcsnap") == 0) {
+			atm = CAKE_ATM_ATM;
+			overhead += 32;
+			overhead_set = true;
+		} else if (strcmp(*argv, "pppoa-vcmux") == 0) {
+			atm = CAKE_ATM_ATM;
+			overhead += 10;
+			overhead_set = true;
+		} else if (strcmp(*argv, "pppoa-llc") == 0) {
+			atm = CAKE_ATM_ATM;
+			overhead += 14;
+			overhead_set = true;
+		} else if (strcmp(*argv, "pppoe-vcmux") == 0) {
+			atm = CAKE_ATM_ATM;
+			overhead += 32;
+			overhead_set = true;
+		} else if (strcmp(*argv, "pppoe-llcsnap") == 0) {
+			atm = CAKE_ATM_ATM;
+			overhead += 40;
+			overhead_set = true;
+
+		/* Typical VDSL2 framing schemes, both over PTM */
+		/* PTM has 64b/65b coding which absorbs some bandwidth */
+		} else if (strcmp(*argv, "pppoe-ptm") == 0) {
+			/* 2B PPP + 6B PPPoE + 6B dest MAC + 6B src MAC
+			 * + 2B ethertype + 4B Frame Check Sequence
+			 * + 1B Start of Frame (S) + 1B End of Frame (Ck)
+			 * + 2B TC-CRC (PTM-FCS) = 30B
+			 */
+			atm = CAKE_ATM_PTM;
+			overhead += 30;
+			overhead_set = true;
+		} else if (strcmp(*argv, "bridged-ptm") == 0) {
+			/* 6B dest MAC + 6B src MAC + 2B ethertype
+			 * + 4B Frame Check Sequence
+			 * + 1B Start of Frame (S) + 1B End of Frame (Ck)
+			 * + 2B TC-CRC (PTM-FCS) = 22B
+			 */
+			atm = CAKE_ATM_PTM;
+			overhead += 22;
+			overhead_set = true;
+
+		} else if (strcmp(*argv, "via-ethernet") == 0) {
+			/*
+			 * We used to use this flag to manually compensate for
+			 * Linux including the Ethernet header on Ethernet-type
+			 * interfaces, but not on IP-type interfaces.
+			 *
+			 * It is no longer needed, because Cake now adjusts for
+			 * that automatically, and is thus ignored.
+			 *
+			 * It would be deleted entirely, but it appears in the
+			 * stats output when the automatic compensation is
+			 * active.
+			 */
+
+		} else if (strcmp(*argv, "ethernet") == 0) {
+			/* ethernet pre-amble & interframe gap & FCS
+			 * you may need to add vlan tag */
+			overhead += 38;
+			overhead_set = true;
+			mpu = 84;
+
+		/* Additional Ethernet-related overhead used by some ISPs */
+		} else if (strcmp(*argv, "ether-vlan") == 0) {
+			/* 802.1q VLAN tag - may be repeated */
+			overhead += 4;
+			overhead_set = true;
+
+		/*
+		 * DOCSIS cable shapers account for Ethernet frame with FCS,
+		 * but not interframe gap or preamble.
+		 */
+		} else if (strcmp(*argv, "docsis") == 0) {
+			atm = CAKE_ATM_NONE;
+			overhead += 18;
+			overhead_set = true;
+			mpu = 64;
+
+		} else if (strcmp(*argv, "overhead") == 0) {
+			char* p = NULL;
+			NEXT_ARG();
+			overhead = strtol(*argv, &p, 10);
+			if(!p || *p || !*argv || overhead < -64 || overhead > 256) {
+				fprintf(stderr, "Illegal \"overhead\", valid range is -64 to 256\\n");
+				return -1;
+			}
+			overhead_set = true;
+
+		} else if (strcmp(*argv, "mpu") == 0) {
+			char* p = NULL;
+			NEXT_ARG();
+			mpu = strtol(*argv, &p, 10);
+			if(!p || *p || !*argv || mpu < 0 || mpu > 256) {
+				fprintf(stderr, "Illegal \"mpu\", valid range is 0 to 256\\n");
+				return -1;
+			}
+
+		} else if (strcmp(*argv, "ingress") == 0) {
+			ingress = 1;
+		} else if (strcmp(*argv, "egress") == 0) {
+			ingress = 0;
+
+		} else if (strcmp(*argv, "no-ack-filter") == 0) {
+			ack_filter = CAKE_ACK_NONE;
+		} else if (strcmp(*argv, "ack-filter") == 0) {
+			ack_filter = CAKE_ACK_FILTER;
+		} else if (strcmp(*argv, "ack-filter-aggressive") == 0) {
+			ack_filter = CAKE_ACK_AGGRESSIVE;
+
+		} else if (strcmp(*argv, "memlimit") == 0) {
+			NEXT_ARG();
+			if(get_size(&memlimit, *argv)) {
+				fprintf(stderr, "Illegal value for \"memlimit\": \"%s\"\n", *argv);
+				return -1;
+			}
+
+		} else if (strcmp(*argv, "help") == 0) {
+			explain();
+			return -1;
+		} else {
+			fprintf(stderr, "What is \"%s\"?\n", *argv);
+			explain();
+			return -1;
+		}
+		argc--; argv++;
+	}
+
+	tail = NLMSG_TAIL(n);
+	addattr_l(n, 1024, TCA_OPTIONS, NULL, 0);
+	if (bandwidth || unlimited)
+		addattr_l(n, 1024, TCA_CAKE_BASE_RATE, &bandwidth, sizeof(bandwidth));
+	if (diffserv)
+		addattr_l(n, 1024, TCA_CAKE_DIFFSERV_MODE, &diffserv, sizeof(diffserv));
+	if (atm != -1)
+		addattr_l(n, 1024, TCA_CAKE_ATM, &atm, sizeof(atm));
+	if (flowmode != -1)
+		addattr_l(n, 1024, TCA_CAKE_FLOW_MODE, &flowmode, sizeof(flowmode));
+	if (overhead_set)
+		addattr_l(n, 1024, TCA_CAKE_OVERHEAD, &overhead, sizeof(overhead));
+	if (overhead_override) {
+		unsigned zero = 0;
+		addattr_l(n, 1024, TCA_CAKE_RAW, &zero, sizeof(zero));
+	}
+	if (mpu > 0)
+		addattr_l(n, 1024, TCA_CAKE_MPU, &mpu, sizeof(mpu));
+	if (interval)
+		addattr_l(n, 1024, TCA_CAKE_RTT, &interval, sizeof(interval));
+	if (target)
+		addattr_l(n, 1024, TCA_CAKE_TARGET, &target, sizeof(target));
+	if (autorate != -1)
+		addattr_l(n, 1024, TCA_CAKE_AUTORATE, &autorate, sizeof(autorate));
+	if (memlimit)
+		addattr_l(n, 1024, TCA_CAKE_MEMORY, &memlimit, sizeof(memlimit));
+	if (nat != -1)
+		addattr_l(n, 1024, TCA_CAKE_NAT, &nat, sizeof(nat));
+	if (wash != -1)
+		addattr_l(n, 1024, TCA_CAKE_WASH, &wash, sizeof(wash));
+	if (ingress != -1)
+		addattr_l(n, 1024, TCA_CAKE_INGRESS, &ingress, sizeof(ingress));
+	if (ack_filter != -1)
+		addattr_l(n, 1024, TCA_CAKE_ACK_FILTER, &ack_filter, sizeof(ack_filter));
+
+	tail->rta_len = (void *) NLMSG_TAIL(n) - (void *) tail;
+	return 0;
+}
+
+
+static int cake_print_opt(struct qdisc_util *qu, FILE *f, struct rtattr *opt)
+{
+	struct rtattr *tb[TCA_CAKE_MAX + 1];
+	unsigned bandwidth = 0;
+	unsigned diffserv = 0;
+	unsigned flowmode = 0;
+	unsigned interval = 0;
+	unsigned memlimit = 0;
+	int overhead = 0;
+	int raw = 0;
+	int mpu = 0;
+	int atm = 0;
+	int nat = 0;
+	int autorate = 0;
+	int wash = 0;
+	int ingress = 0;
+	int ack_filter = 0;
+	SPRINT_BUF(b1);
+	SPRINT_BUF(b2);
+
+	if (opt == NULL)
+		return 0;
+
+	parse_rtattr_nested(tb, TCA_CAKE_MAX, opt);
+
+	if (tb[TCA_CAKE_BASE_RATE] &&
+	    RTA_PAYLOAD(tb[TCA_CAKE_BASE_RATE]) >= sizeof(__u32)) {
+		bandwidth = rta_getattr_u32(tb[TCA_CAKE_BASE_RATE]);
+		if(bandwidth) {
+			print_uint(PRINT_JSON, "bandwidth", NULL, bandwidth);
+			print_string(PRINT_FP, NULL, "bandwidth %s ", sprint_rate(bandwidth, b1));
+		} else
+			print_string(PRINT_ANY, "bandwidth", "bandwidth %s ", "unlimited");
+	}
+	if (tb[TCA_CAKE_AUTORATE] &&
+		RTA_PAYLOAD(tb[TCA_CAKE_AUTORATE]) >= sizeof(__u32)) {
+		autorate = rta_getattr_u32(tb[TCA_CAKE_AUTORATE]);
+		if(autorate == 1)
+			print_string(PRINT_ANY, "autorate", "autorate_%s ", "ingress");
+		else if(autorate)
+			print_string(PRINT_ANY, "autorate", "(?autorate?) ", "unknown");
+	}
+	if (tb[TCA_CAKE_DIFFSERV_MODE] &&
+	    RTA_PAYLOAD(tb[TCA_CAKE_DIFFSERV_MODE]) >= sizeof(__u32)) {
+		diffserv = rta_getattr_u32(tb[TCA_CAKE_DIFFSERV_MODE]);
+		switch(diffserv) {
+		case CAKE_DIFFSERV_DIFFSERV3:
+			print_string(PRINT_ANY, "diffserv", "%s ", "diffserv3");
+			break;
+		case CAKE_DIFFSERV_DIFFSERV4:
+			print_string(PRINT_ANY, "diffserv", "%s ", "diffserv4");
+			break;
+		case CAKE_DIFFSERV_DIFFSERV8:
+			print_string(PRINT_ANY, "diffserv", "%s ", "diffserv8");
+			break;
+		case CAKE_DIFFSERV_BESTEFFORT:
+			print_string(PRINT_ANY, "diffserv", "%s ", "besteffort");
+			break;
+		case CAKE_DIFFSERV_PRECEDENCE:
+			print_string(PRINT_ANY, "diffserv", "%s ", "precedence");
+			break;
+		default:
+			print_string(PRINT_ANY, "diffserv", "(?diffserv?) ", "unknown");
+			break;
+		};
+	}
+	if (tb[TCA_CAKE_FLOW_MODE] &&
+	    RTA_PAYLOAD(tb[TCA_CAKE_FLOW_MODE]) >= sizeof(__u32)) {
+		flowmode = rta_getattr_u32(tb[TCA_CAKE_FLOW_MODE]);
+		switch(flowmode) {
+		case CAKE_FLOW_NONE:
+			print_string(PRINT_ANY, "flowmode", "%s ", "flowblind");
+			break;
+		case CAKE_FLOW_SRC_IP:
+			print_string(PRINT_ANY, "flowmode", "%s ", "srchost");
+			break;
+		case CAKE_FLOW_DST_IP:
+			print_string(PRINT_ANY, "flowmode", "%s ", "dsthost");
+			break;
+		case CAKE_FLOW_HOSTS:
+			print_string(PRINT_ANY, "flowmode", "%s ", "hosts");
+			break;
+		case CAKE_FLOW_FLOWS:
+			print_string(PRINT_ANY, "flowmode", "%s ", "flows");
+			break;
+		case CAKE_FLOW_DUAL_SRC:
+			print_string(PRINT_ANY, "flowmode", "%s ", "dual-srchost");
+			break;
+		case CAKE_FLOW_DUAL_DST:
+			print_string(PRINT_ANY, "flowmode", "%s ", "dual-dsthost");
+			break;
+		case CAKE_FLOW_TRIPLE:
+			print_string(PRINT_ANY, "flowmode", "%s ", "triple-isolate");
+			break;
+		default:
+			print_string(PRINT_ANY, "flowmode", "(?flowmode?) ", "unknown");
+			break;
+		};
+
+	}
+
+	if (tb[TCA_CAKE_NAT] &&
+	    RTA_PAYLOAD(tb[TCA_CAKE_NAT]) >= sizeof(__u32)) {
+	    nat = rta_getattr_u32(tb[TCA_CAKE_NAT]);
+	}
+
+	if(nat)
+		print_string(PRINT_FP, NULL, "nat ", NULL);
+	print_bool(PRINT_JSON, "nat", NULL, nat);
+
+	if (tb[TCA_CAKE_WASH] &&
+	    RTA_PAYLOAD(tb[TCA_CAKE_WASH]) >= sizeof(__u32)) {
+		wash = rta_getattr_u32(tb[TCA_CAKE_WASH]);
+	}
+	if (tb[TCA_CAKE_ATM] &&
+	    RTA_PAYLOAD(tb[TCA_CAKE_ATM]) >= sizeof(__u32)) {
+		atm = rta_getattr_u32(tb[TCA_CAKE_ATM]);
+	}
+	if (tb[TCA_CAKE_OVERHEAD] &&
+	    RTA_PAYLOAD(tb[TCA_CAKE_OVERHEAD]) >= sizeof(__s32)) {
+		overhead = *(__s32 *) RTA_DATA(tb[TCA_CAKE_OVERHEAD]);
+	}
+	if (tb[TCA_CAKE_MPU] &&
+	    RTA_PAYLOAD(tb[TCA_CAKE_MPU]) >= sizeof(__u32)) {
+		mpu = rta_getattr_u32(tb[TCA_CAKE_MPU]);
+	}
+	if (tb[TCA_CAKE_INGRESS] &&
+	    RTA_PAYLOAD(tb[TCA_CAKE_INGRESS]) >= sizeof(__u32)) {
+		ingress = rta_getattr_u32(tb[TCA_CAKE_INGRESS]);
+	}
+	if (tb[TCA_CAKE_ACK_FILTER] &&
+	    RTA_PAYLOAD(tb[TCA_CAKE_ACK_FILTER]) >= sizeof(__u32)) {
+		ack_filter = rta_getattr_u32(tb[TCA_CAKE_ACK_FILTER]);
+	}
+	if (tb[TCA_CAKE_RAW]) {
+		raw = 1;
+	}
+	if (tb[TCA_CAKE_RTT] &&
+	    RTA_PAYLOAD(tb[TCA_CAKE_RTT]) >= sizeof(__u32)) {
+		interval = rta_getattr_u32(tb[TCA_CAKE_RTT]);
+	}
+	if (wash)
+		print_string(PRINT_FP, NULL, "wash ", NULL);
+	print_bool(PRINT_JSON, "wash", NULL, wash);
+
+	if (ingress)
+		print_string(PRINT_FP, NULL, "ingress ", NULL);
+	print_bool(PRINT_JSON, "ingress", NULL, ingress);
+
+	if (ack_filter == CAKE_ACK_AGGRESSIVE)
+		print_string(PRINT_ANY, "ack-filter", "ack-filter-%s ", "aggressive");
+	else if (ack_filter == CAKE_ACK_FILTER)
+		print_string(PRINT_ANY, "ack-filter", "ack-filter ", "enabled");
+	else
+		print_string(PRINT_JSON, "ack-filter", NULL, "disabled");
+
+	if (interval)
+		print_string(PRINT_FP, NULL, "rtt %s ", sprint_time(interval, b2));
+	print_uint(PRINT_JSON, "rtt", NULL, interval);
+
+	if (raw)
+		print_string(PRINT_FP, NULL, "raw ", NULL);
+	print_bool(PRINT_JSON, "raw", NULL, raw);
+
+	if (atm == CAKE_ATM_ATM)
+		print_string(PRINT_ANY, "atm", "%s ", "atm");
+	else if (atm == CAKE_ATM_PTM)
+		print_string(PRINT_ANY, "atm", "%s ", "ptm");
+	else if (!raw)
+		print_string(PRINT_ANY, "atm", "%s ", "noatm");
+
+	print_int(PRINT_ANY, "overhead", "overhead %d ", overhead);
+
+	if (mpu)
+		print_uint(PRINT_ANY, "mpu", "mpu %" PRIu64 " ", mpu);
+
+	if (memlimit) {
+		print_uint(PRINT_JSON, "memlimit", NULL, memlimit);
+		print_string(PRINT_FP, NULL, "memlimit %s", sprint_size(memlimit, b1));
+	}
+
+	return 0;
+}
+
+#define FOR_EACH_TIN(xstats, tst, i)				\
+	for(tst = xstats->tin_stats, i = 0;			\
+	i < xstats->tin_cnt;						\
+	    i++, tst = ((void *) xstats->tin_stats) + xstats->tin_stats_size * i)
+
+static void cake_print_json_tin(struct tc_cake_tin_stats *tst, uint version)
+{
+	open_json_object(NULL);
+	print_uint(PRINT_JSON, "threshold_rate", NULL, tst->threshold_rate);
+	print_uint(PRINT_JSON, "target", NULL, tst->target_us);
+	print_uint(PRINT_JSON, "interval", NULL, tst->interval_us);
+	print_uint(PRINT_JSON, "peak_delay", NULL, tst->peak_delay_us);
+	print_uint(PRINT_JSON, "average_delay", NULL, tst->avge_delay_us);
+	print_uint(PRINT_JSON, "base_delay", NULL, tst->base_delay_us);
+	print_uint(PRINT_JSON, "sent_packets", NULL, tst->sent.packets);
+	print_uint(PRINT_JSON, "sent_bytes", NULL, tst->sent.bytes);
+	print_uint(PRINT_JSON, "way_indirect_hits", NULL, tst->way_indirect_hits);
+	print_uint(PRINT_JSON, "way_misses", NULL, tst->way_misses);
+	print_uint(PRINT_JSON, "way_collisions", NULL, tst->way_collisions);
+	print_uint(PRINT_JSON, "drops", NULL, tst->dropped.packets);
+	print_uint(PRINT_JSON, "ecn_mark", NULL, tst->ecn_marked.packets);
+	print_uint(PRINT_JSON, "ack_drops", NULL, tst->ack_drops.packets);
+	print_uint(PRINT_JSON, "sparse_flows", NULL, tst->sparse_flows);
+	print_uint(PRINT_JSON, "bulk_flows", NULL, tst->bulk_flows);
+	print_uint(PRINT_JSON, "unresponsive_flows", NULL, tst->unresponse_flows);
+	print_uint(PRINT_JSON, "max_pkt_len", NULL, tst->max_skblen);
+	if (version >= 0x102)
+		print_uint(PRINT_JSON, "flow_quantum", NULL, tst->flow_quantum);
+	close_json_object();
+}
+
+static int cake_print_xstats(struct qdisc_util *qu, FILE *f,
+			     struct rtattr *xstats)
+{
+	struct tc_cake_xstats     *stnc;
+	struct tc_cake_tin_stats  *tst;
+	SPRINT_BUF(b1);
+	int i;
+
+	if (xstats == NULL)
+		return 0;
+
+	if (RTA_PAYLOAD(xstats) < sizeof(*stnc))
+		return -1;
+
+	stnc = RTA_DATA(xstats);
+
+	if (stnc->version < 0x101 ||
+	    RTA_PAYLOAD(xstats) < (sizeof(struct tc_cake_xstats) +
+				    stnc->tin_stats_size * stnc->tin_cnt))
+		return -1;
+
+	print_uint(PRINT_JSON, "memory_used", NULL, stnc->memory_used);
+	print_uint(PRINT_JSON, "memory_limit", NULL, stnc->memory_limit);
+	print_uint(PRINT_JSON, "capacity_estimate", NULL, stnc->capacity_estimate);
+
+	print_string(PRINT_FP, NULL, " memory used: %s",
+		sprint_size(stnc->memory_used, b1));
+	print_string(PRINT_FP, NULL, " of %s\n",
+		sprint_size(stnc->memory_limit, b1));
+	print_string(PRINT_FP, NULL, " capacity estimate: %s\n",
+		sprint_rate(stnc->capacity_estimate, b1));
+
+	print_uint(PRINT_ANY, "min_network_size", " min/max network layer size: %12" PRIu64,
+		stnc->min_netlen);
+	print_uint(PRINT_ANY, "max_network_size", " /%8" PRIu64 "\n", stnc->max_netlen);
+	print_uint(PRINT_ANY, "min_adj_size", " min/max overhead-adjusted size: %8" PRIu64,
+		stnc->min_adjlen);
+	print_uint(PRINT_ANY, "max_adj_size", " /%8" PRIu64 "\n", stnc->max_adjlen);
+	print_uint(PRINT_ANY, "avg_hdr_offset", " average network hdr offset: %12" PRIu64 "\n\n",
+		stnc->avg_trnoff);
+
+	if (is_json_context()) {
+		open_json_array(PRINT_JSON, "tins");
+		FOR_EACH_TIN(stnc, tst, i)
+			cake_print_json_tin(tst, stnc->version);
+		close_json_array(PRINT_JSON, NULL);
+		return 0;
+	}
+
+
+	switch(stnc->tin_cnt) {
+	case 3:
+		fprintf(f, "                   Bulk  Best Effort        Voice\n");
+		break;
+
+	case 4:
+		fprintf(f, "                   Bulk  Best Effort        Video        Voice\n");
+		break;
+
+	case 5:
+		fprintf(f, "               Low Loss  Best Effort    Low Delay         Bulk  Net Control\n");
+		break;
+
+	default:
+		fprintf(f, "          ");
+		for(i=0; i < stnc->tin_cnt; i++)
+			fprintf(f, "       Tin %u", i);
+		fprintf(f, "\n");
+	};
+
+	fprintf(f, "  thresh  ");
+	FOR_EACH_TIN(stnc, tst, i)
+		fprintf(f, " %12s", sprint_rate(tst->threshold_rate, b1));
+	fprintf(f, "\n");
+
+	fprintf(f, "  target  ");
+	FOR_EACH_TIN(stnc, tst, i)
+		fprintf(f, " %12s", sprint_time(tst->target_us, b1));
+	fprintf(f, "\n");
+
+	fprintf(f, "  interval");
+	FOR_EACH_TIN(stnc, tst, i)
+		fprintf(f, " %12s", sprint_time(tst->interval_us, b1));
+	fprintf(f, "\n");
+
+	fprintf(f, "  pk_delay");
+	FOR_EACH_TIN(stnc, tst, i)
+		fprintf(f, " %12s", sprint_time(tst->peak_delay_us, b1));
+	fprintf(f, "\n");
+
+	fprintf(f, "  av_delay");
+	FOR_EACH_TIN(stnc, tst, i)
+		fprintf(f, " %12s", sprint_time(tst->avge_delay_us, b1));
+	fprintf(f, "\n");
+
+	fprintf(f, "  sp_delay");
+	FOR_EACH_TIN(stnc, tst, i)
+		fprintf(f, " %12s", sprint_time(tst->base_delay_us, b1));
+	fprintf(f, "\n");
+
+	fprintf(f, "  pkts    ");
+	FOR_EACH_TIN(stnc, tst, i)
+		fprintf(f, " %12u", tst->sent.packets);
+	fprintf(f, "\n");
+
+	fprintf(f, "  bytes   ");
+	FOR_EACH_TIN(stnc, tst, i)
+		fprintf(f, " %12llu", tst->sent.bytes);
+	fprintf(f, "\n");
+
+	fprintf(f, "  way_inds");
+	FOR_EACH_TIN(stnc, tst, i)
+		fprintf(f, " %12u", tst->way_indirect_hits);
+	fprintf(f, "\n");
+
+	fprintf(f, "  way_miss");
+	FOR_EACH_TIN(stnc, tst, i)
+		fprintf(f, " %12u", tst->way_misses);
+	fprintf(f, "\n");
+
+	fprintf(f, "  way_cols");
+	FOR_EACH_TIN(stnc, tst, i)
+		fprintf(f, " %12u", tst->way_collisions);
+	fprintf(f, "\n");
+
+	fprintf(f, "  drops   ");
+	FOR_EACH_TIN(stnc, tst, i)
+		fprintf(f, " %12u", tst->dropped.packets);
+	fprintf(f, "\n");
+
+	fprintf(f, "  marks   ");
+	FOR_EACH_TIN(stnc, tst, i)
+		fprintf(f, " %12u", tst->ecn_marked.packets);
+	fprintf(f, "\n");
+
+	fprintf(f, "  ack_drop");
+	FOR_EACH_TIN(stnc, tst, i)
+		fprintf(f, " %12u", tst->ack_drops.packets);
+	fprintf(f, "\n");
+
+	fprintf(f, "  sp_flows");
+	FOR_EACH_TIN(stnc, tst, i)
+		fprintf(f, " %12u", tst->sparse_flows);
+	fprintf(f, "\n");
+
+	fprintf(f, "  bk_flows");
+	FOR_EACH_TIN(stnc, tst, i)
+		fprintf(f, " %12u", tst->bulk_flows);
+	fprintf(f, "\n");
+
+	fprintf(f, "  un_flows");
+	FOR_EACH_TIN(stnc, tst, i)
+		fprintf(f, " %12u", tst->unresponse_flows);
+	fprintf(f, "\n");
+
+	fprintf(f, "  max_len ");
+	FOR_EACH_TIN(stnc, tst, i)
+		fprintf(f, " %12u", tst->max_skblen);
+	fprintf(f, "\n");
+
+	if (stnc->version >= 0x102) {
+		fprintf(f, "  quantum ");
+		FOR_EACH_TIN(stnc, tst, i)
+			fprintf(f, " %12u", tst->flow_quantum);
+		fprintf(f, "\n");
+	}
+
+	return 0;
+}
+
+struct qdisc_util cake_qdisc_util = {
+	.id		= "cake",
+	.parse_qopt	= cake_parse_opt,
+	.print_qopt	= cake_print_opt,
+	.print_xstats	= cake_print_xstats,
+};
-- 
2.11.0

^ permalink raw reply related

* [PATCH net-next] qed: Fix copying 2 strings
From: Denis Bolotin @ 2018-04-24 12:32 UTC (permalink / raw)
  To: davem; +Cc: netdev, ariel.elior, Denis Bolotin

The strscpy() was a recent fix (net: qed: use correct strncpy() size) to
prevent passing the length of the source buffer to strncpy() and guarantee
null termination.
It misses the goal of overwriting only the first 3 characters in
"???_BIG_RAM" and "???_RAM" while keeping the rest of the string.
Use strncpy() with the length of 3, without null termination.

Signed-off-by: Denis Bolotin <denis.bolotin@cavium.com>
Signed-off-by: Ariel Elior <ariel.elior@cavium.com>
---
 drivers/net/ethernet/qlogic/qed/qed_debug.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/qlogic/qed/qed_debug.c b/drivers/net/ethernet/qlogic/qed/qed_debug.c
index b3211c7..39124b5 100644
--- a/drivers/net/ethernet/qlogic/qed/qed_debug.c
+++ b/drivers/net/ethernet/qlogic/qed/qed_debug.c
@@ -419,6 +419,7 @@ struct phy_defs {
 #define NUM_RSS_MEM_TYPES		5
 
 #define NUM_BIG_RAM_TYPES		3
+#define BIG_RAM_NAME_LEN		3
 
 #define NUM_PHY_TBUS_ADDRESSES		2048
 #define PHY_DUMP_SIZE_DWORDS		(NUM_PHY_TBUS_ADDRESSES / 2)
@@ -3650,8 +3651,8 @@ static u32 qed_grc_dump_big_ram(struct qed_hwfn *p_hwfn,
 		     BIT(big_ram->is_256b_bit_offset[dev_data->chip_id]) ? 256
 									 : 128;
 
-	strscpy(type_name, big_ram->instance_name, sizeof(type_name));
-	strscpy(mem_name, big_ram->instance_name, sizeof(mem_name));
+	strncpy(type_name, big_ram->instance_name, BIG_RAM_NAME_LEN);
+	strncpy(mem_name, big_ram->instance_name, BIG_RAM_NAME_LEN);
 
 	/* Dump memory header */
 	offset += qed_grc_dump_mem_hdr(p_hwfn,
-- 
1.8.3.1

^ permalink raw reply related

* [PATCH] net/tls: remove redundant second null check on sgout
From: Colin King @ 2018-04-24 12:36 UTC (permalink / raw)
  To: Ilya Lesokhin, Aviad Yehezkel, Dave Watson, David S . Miller,
	netdev
  Cc: kernel-janitors, linux-kernel

From: Colin Ian King <colin.king@canonical.com>

A duplicated null check on sgout is redundant as it is known to be
already true because of the identical earlier check. Remove it.
Detected by cppcheck:

net/tls/tls_sw.c:696: (warning) Identical inner 'if' condition is always
true.

Signed-off-by: Colin Ian King <colin.king@canonical.com>
---
 net/tls/tls_sw.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/net/tls/tls_sw.c b/net/tls/tls_sw.c
index 71e79597f940..6ed1c02cfc94 100644
--- a/net/tls/tls_sw.c
+++ b/net/tls/tls_sw.c
@@ -693,8 +693,7 @@ static int decrypt_skb(struct sock *sk, struct sk_buff *skb,
 	if (!sgout) {
 		nsg = skb_cow_data(skb, 0, &unused) + 1;
 		sgin = kmalloc_array(nsg, sizeof(*sgin), sk->sk_allocation);
-		if (!sgout)
-			sgout = sgin;
+		sgout = sgin;
 	}
 
 	sg_init_table(sgin, nsg);
-- 
2.17.0

^ permalink raw reply related

* Re: [PATCH net-next] net: init sk_cookie for inet socket
From: Eric Dumazet @ 2018-04-24 12:37 UTC (permalink / raw)
  To: Yafang Shao; +Cc: David Miller, Alexei Starovoitov, netdev, LKML
In-Reply-To: <CALOAHbDajHPN9Z3TTWFBwpd1H16RVRRYYiUMrUELE3JxeZo+qw@mail.gmail.com>

On 04/24/2018 04:47 AM, Yafang Shao wrote:

> 
> Could you pls. explain the issue to me ?

Just run a synflood test on your host, it will definitely show the atomic
consuming most cpu cycles in inet_reqsk_alloc(), because of huge contention
on a cache line shared by all cpus.

Performance is reduced from ~5 Mpps to ~3.8 Mpps with 16 RX queues on my host.

atomic64_inc_return(&sock_net(sk)->cookie_gen) was not meant to be used in the
normal case (when a socket cookie is not ever requested/needed)

^ permalink raw reply

* Re: [PATCH net-next] Revert "net: init sk_cookie for inet socket"
From: Eric Dumazet @ 2018-04-24 12:38 UTC (permalink / raw)
  To: Yafang Shao, davem; +Cc: netdev
In-Reply-To: <1524571537-9781-1-git-send-email-laoar.shao@gmail.com>



On 04/24/2018 05:05 AM, Yafang Shao wrote:
> This revert commit <c6849a3ac17e> ("net: init sk_cookie for inet socket")
> 
> Per discussion with Eric.
> 

I suggest you include a bit more details, about cache line false sharing.

Thanks.

^ permalink raw reply

* Re: bpf: samples/bpf/sockex2: Assertion `setsockopt(sock, SOL_SOCKET, SO_ATTACH_BPF, prog_fd, sizeof(prog_fd[0])) == 0' failed.
From: Daniel Borkmann @ 2018-04-24 12:42 UTC (permalink / raw)
  To: shhuiw; +Cc: ast, netdev, yhs, peterz
In-Reply-To: <82072bf2-ef44-4755-47bf-bcba430b69a4@foxmail.com>

On 04/24/2018 09:29 AM, shhuiw wrote:
> On 04/23/18 17:53, Daniel Borkmann wrote:
>> On 04/23/2018 04:56 AM, Wang Sheng-Hui wrote:
>>> Sorry to trouble you!
>>>
>>> I run samples/bpf/sockex2 failed with
>>> "Assertion `setsockopt(sock, SOL_SOCKET, SO_ATTACH_BPF, prog_fd, sizeof(prog_fd[0])) == 0' failed."
>>>
>>> Then I run " strace ./sockex2" and got:
>>> ...
>>> bind(3, {sa_family=AF_PACKET, sll_protocol=htons(ETH_P_ALL), sll_ifindex=if_nametoindex("lo"), sll_hatype=ARPHRD_NETROM, sll_pkttype=PACKET_HOST, sll_halen=0}, 20) = 0
>>> setsockopt(3, SOL_SOCKET, SO_ATTACH_BPF, [0], 4) = -1 EINVAL (Invalid argument)
>>> write(2, "sockex2: /root/linux/samples/bpf"..., 156sockex2: /root/linux/samples/bpf/sockex2_user.c:35: main: Assertion `setsockopt(sock, SOL_SOCKET, SO_ATTACH_BPF, prog_fd, sizeof(prog_fd[0])) == 0' failed.
>>> ) = 156
>>> mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fb8ec4bf000
>>> rt_sigprocmask(SIG_UNBLOCK, [ABRT], NULL, 8) = 0
>>> rt_sigprocmask(SIG_BLOCK, ~[RTMIN RT_1], [], 8) = 0
>>> getpid()                                = 3513
>>> gettid()                                = 3513
>>> tgkill(3513, 3513, SIGABRT)             = 0
>>> rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
>>> --- SIGABRT {si_signo=SIGABRT, si_code=SI_TKILL, si_pid=3513, si_uid=0} ---
>>> +++ killed by SIGABRT +++
>>> Aborted
>> [...]
>> bpf(BPF_MAP_CREATE, {map_type=BPF_MAP_TYPE_HASH, key_size=4, value_size=16, max_entries=1024}, 72) = 4
>> bpf(BPF_PROG_LOAD, {prog_type=BPF_PROG_TYPE_SOCKET_FILTER, insn_cnt=201, insns=0x2676cd0, license="GPL", log_level=0, log_size=0, log_buf=0, kern_version=0}, 72) = 5
>> close(3)                                = 0
>> socket(AF_PACKET, SOCK_RAW|SOCK_CLOEXEC|SOCK_NONBLOCK, 768) = 3
>> access("/proc/net", R_OK)               = 0
>> access("/proc/net/unix", R_OK)          = 0
>> socket(AF_UNIX, SOCK_DGRAM|SOCK_CLOEXEC, 0) = 6
>> ioctl(6, SIOCGIFINDEX, {ifr_name="lo", }) = 0
>> close(6)                                = 0
>> bind(3, {sa_family=AF_PACKET, sll_protocol=htons(ETH_P_ALL), sll_ifindex=if_nametoindex("lo"), sll_hatype=ARPHRD_NETROM, sll_pkttype=PACKET_HOST, sll_halen=0}, 20) = 0
>> setsockopt(3, SOL_SOCKET, SO_ATTACH_BPF, [5], 4) = 0
>> pipe2([6, 7], O_CLOEXEC)                = 0
>> [...]
>>
>> Works fine for me. The EINVAL in your case comes from the 'setsockopt(3, SOL_SOCKET,
>> SO_ATTACH_BPF, [0], 4)'. So you send fd 0 to SO_ATTACH_BPF, which is not a valid BPF
>> fd and therefore bails out here. You might want to debug bpf_load.c why it's failing
>> to load the program in your case, or check strace a bit further above where you do
>> the map and prog creation (as I copied in my case).
>>
>> Cheers,
>> Daniel
>>
> Daniel,
> 
> I built the sample code by run
>      "make O=../buildkernel samples/bpf/ LLC=/usr/bin/llc clang=/usr/bin/clang"
> while llc & clang are provided by debian:
>     ~/buildkernel/samples/bpf# ls -l /usr/bin/llc
>     lrwxrwxrwx 1 root root 23 Dec  4 21:34 /usr/bin/llc -> ../lib/llvm-4.0/bin/llc
>     ~/buildkernel/samples/bpf# ls -l /usr/bin/clang
>     lrwxrwxrwx 1 root root 25 Dec  4 21:34 /usr/bin/clang -> ../lib/llvm-4.0/bin/clang
> ~/buildkernel/samples/bpf# llc --version
> LLVM (http://llvm.org/):
>   LLVM version 4.0.1
> ...
>   Registered Targets:
> ...
>     bpf        - BPF (host endian)
>     bpfeb      - BPF (big endian)
>     bpfel      - BPF (little endian)
> ...
> 
> There are 3 sockex BPF programs under samples/bpf, but only 'socket' section in sockex1_kern.o can be
> detected, sockex[23]_kern.o failed.
> ------------------------------------------------------------------------
> ~/buildkernel/samples/bpf# llvm-objdump -h ./sockex1_kern.o
> ./sockex1_kern.o:    file format ELF64-BPF
> Sections:
> Idx Name          Size      Address          Type
>   0               00000000 0000000000000000
>   1 .strtab       00000057 0000000000000000
>   2 .text         00000000 0000000000000000 TEXT DATA
>   3 socket1       00000078 0000000000000000 TEXT DATA
>   4 .relsocket1   00000010 0000000000000000
>   5 maps          0000001c 0000000000000000 DATA
>   6 license       00000004 0000000000000000 DATA
>   7 .eh_frame     00000028 0000000000000000 DATA
>   8 .rel.eh_frame 00000010 0000000000000000
>   9 .symtab       00000090 0000000000000000
> 
> ~/buildkernel/samples/bpf# llvm-objdump -h ./sockex2_kern.o
> ./sockex2_kern.o:    file format ELF64-BPF
> Sections:
> Idx Name          Size      Address          Type
>   0               00000000 0000000000000000
>   1 .strtab       00000017 0000000000000000
>   2 .text         00000000 0000000000000000 TEXT DATA
>   3 .symtab       00000018 0000000000000000
> 
> ~/buildkernel/samples/bpf# llvm-objdump -h ./sockex3_kern.o
> ./sockex3_kern.o:    file format ELF64-BPF
> Sections:
> Idx Name          Size      Address          Type
>   0               00000000 0000000000000000
>   1 .strtab       00000017 0000000000000000
>   2 .text         00000000 0000000000000000 TEXT DATA
>   3 .symtab       00000018 0000000000000000
> 
> 
> Do you know how to fix it, please?

Did you by any chance hit an error like the below during compilation? The first
sockex1_kern.o compiled fine for me from the kernel samples dir, but the second
sockex2_kern.o and subsequent bail out with "error: 'asm goto' constructs are not
supported yet":

# make
[...]
clang  -nostdinc -isystem /usr/lib/gcc/x86_64-redhat-linux/6.4.1/include -I./arch/x86/include -I./arch/x86/include/generated  -I./include -I./arch/x86/include/uapi -I./arch/x86/include/generated/uapi -I./include/uapi -I./include/generated/uapi -include ./include/linux/kconfig.h  -I/home/foo/trees/bpf-next/samples/bpf \
	-I./tools/testing/selftests/bpf/ \
	-D__KERNEL__ -Wno-unused-value -Wno-pointer-sign \
	-D__TARGET_ARCH_x86 -Wno-compare-distinct-pointer-types \
	-Wno-gnu-variable-sized-type-not-at-end \
	-Wno-address-of-packed-member -Wno-tautological-compare \
	-Wno-unknown-warning-option  \
	-O2 -emit-llvm -c /home/foo/trees/bpf-next/samples/bpf/sockex1_kern.c -o -| llc -march=bpf -filetype=obj -o /home/foo/trees/bpf-next/samples/bpf/sockex1_kern.o
clang  -nostdinc -isystem /usr/lib/gcc/x86_64-redhat-linux/6.4.1/include -I./arch/x86/include -I./arch/x86/include/generated  -I./include -I./arch/x86/include/uapi -I./arch/x86/include/generated/uapi -I./include/uapi -I./include/generated/uapi -include ./include/linux/kconfig.h  -I/home/foo/trees/bpf-next/samples/bpf \
	-I./tools/testing/selftests/bpf/ \
	-D__KERNEL__ -Wno-unused-value -Wno-pointer-sign \
	-D__TARGET_ARCH_x86 -Wno-compare-distinct-pointer-types \
	-Wno-gnu-variable-sized-type-not-at-end \
	-Wno-address-of-packed-member -Wno-tautological-compare \
	-Wno-unknown-warning-option  \
	-O2 -emit-llvm -c /home/foo/trees/bpf-next/samples/bpf/sockex2_kern.c -o -| llc -march=bpf -filetype=obj -o /home/foo/trees/bpf-next/samples/bpf/sockex2_kern.o
In file included from /home/foo/trees/bpf-next/samples/bpf/sockex2_kern.c:3:
In file included from ./include/uapi/linux/in.h:24:
In file included from ./include/linux/socket.h:8:
In file included from ./include/linux/uio.h:13:
In file included from ./include/linux/thread_info.h:38:
In file included from ./arch/x86/include/asm/thread_info.h:53:
./arch/x86/include/asm/cpufeature.h:150:2: error: 'asm goto' constructs are not supported yet
        asm_volatile_goto("1: jmp 6f\n"
        ^
./include/linux/compiler-gcc.h:290:42: note: expanded from macro 'asm_volatile_goto'
#define asm_volatile_goto(x...) do { asm goto(x); asm (""); } while (0)
                                         ^
1 error generated.
[...]

Which leads to:

# llvm-objdump -h ./sockex1_kern.o

./sockex1_kern.o:	file format ELF64-BPF

Sections:
Idx Name          Size      Address          Type
  0               00000000 0000000000000000
  1 .strtab       00000057 0000000000000000
  2 .text         00000000 0000000000000000 TEXT DATA
  3 socket1       00000078 0000000000000000 TEXT DATA
  4 .relsocket1   00000010 0000000000000000
  5 maps          0000001c 0000000000000000 DATA
  6 license       00000004 0000000000000000 DATA
  7 .eh_frame     00000028 0000000000000000 DATA
  8 .rel.eh_frame 00000010 0000000000000000
  9 .symtab       00000090 0000000000000000

But ...

# llvm-objdump -h ./sockex2_kern.o

./sockex2_kern.o:	file format ELF64-BPF

Sections:
Idx Name          Size      Address          Type
  0               00000000 0000000000000000
  1 .strtab       00000017 0000000000000000
  2 .text         00000000 0000000000000000 TEXT DATA
  3 .symtab       00000018 0000000000000000
#

There's a fix from Yonghong here:

  https://patchwork.kernel.org/patch/10341471/

Could you try with that patch?

Thanks,
Daniel

^ permalink raw reply

* Re: [PATCH] mlxsw: spectrum_switchdev: remove duplicated check on multicast_enable
From: Ido Schimmel @ 2018-04-24 12:51 UTC (permalink / raw)
  To: Colin King
  Cc: Jiri Pirko, Ido Schimmel, netdev, kernel-janitors, linux-kernel,
	nogahf
In-Reply-To: <20180424115139.21134-1-colin.king@canonical.com>

On Tue, Apr 24, 2018 at 12:51:39PM +0100, Colin King wrote:
> From: Colin Ian King <colin.king@canonical.com>
> 
> The check on bridge_port->bridge_device->multicast_enable is performed
> twice in two nested if statements. I can't see any reason for this, the
> second check is redundant and can be removed.
> 
> Detected by cppcheck:
> drivers/net/ethernet/mellanox/mlxsw/spectrum_switchdev.c:1722]:
> (warning) Identical inner 'if' condition is always true.
> 
> Signed-off-by: Colin Ian King <colin.king@canonical.com>
> ---
>  drivers/net/ethernet/mellanox/mlxsw/spectrum_switchdev.c | 9 +++------
>  1 file changed, 3 insertions(+), 6 deletions(-)
> 
> diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum_switchdev.c b/drivers/net/ethernet/mellanox/mlxsw/spectrum_switchdev.c
> index c11c9a635866..989ed19e25c9 100644
> --- a/drivers/net/ethernet/mellanox/mlxsw/spectrum_switchdev.c
> +++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum_switchdev.c
> @@ -1719,12 +1719,9 @@ __mlxsw_sp_port_mdb_del(struct mlxsw_sp_port *mlxsw_sp_port,
>  	int err;
>  
>  	if (bridge_port->bridge_device->multicast_enabled) {
> -		if (bridge_port->bridge_device->multicast_enabled) {

This looks like a copy-paste error. I believe the intention was to check
'!bridge_port->mrouter' so that if port is an mrouter port packets
hitting the MDB entry would still be sent to it even after it was
removed from the MDB entry ports list.

I will send a patch later this week after I run it by Nogah who worked
on this part.

Thanks for the patch.

> -			err = mlxsw_sp_port_smid_set(mlxsw_sp_port, mid->mid,
> -						     false);
> -			if (err)
> -				netdev_err(dev, "Unable to remove port from SMID\n");
> -		}
> +		err = mlxsw_sp_port_smid_set(mlxsw_sp_port, mid->mid, false);
> +		if (err)
> +			netdev_err(dev, "Unable to remove port from SMID\n");
>  	}
>  
>  	err = mlxsw_sp_port_remove_from_mid(mlxsw_sp_port, mid);
> -- 
> 2.17.0
> 

^ permalink raw reply

* Re: [PATCH v3] kvmalloc: always use vmalloc if CONFIG_DEBUG_SG
From: Michal Hocko @ 2018-04-24 12:51 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: dm-devel, eric.dumazet, mst, netdev, linux-kernel, Matthew Wilcox,
	virtualization, linux-mm, edumazet, Andrew Morton, David Miller,
	Vlastimil Babka
In-Reply-To: <alpine.LRH.2.02.1804232003100.2299@file01.intranet.prod.int.rdu2.redhat.com>

On Mon 23-04-18 20:06:16, Mikulas Patocka wrote:
[...]
> @@ -404,6 +405,12 @@ void *kvmalloc_node(size_t size, gfp_t f
>  	 */
>  	WARN_ON_ONCE((flags & GFP_KERNEL) != GFP_KERNEL);
>  
> +#ifdef CONFIG_DEBUG_SG
> +	/* Catch bugs when the caller uses DMA API on the result of kvmalloc. */
> +	if (!(prandom_u32_max(2) & 1))
> +		goto do_vmalloc;
> +#endif

I really do not think there is anything DEBUG_SG specific here. Why you
simply do not follow should_failslab path or even reuse the function?

> +
>  	/*
>  	 * We want to attempt a large physically contiguous block first because
>  	 * it is less likely to fragment multiple larger blocks and therefore
> @@ -427,6 +434,9 @@ void *kvmalloc_node(size_t size, gfp_t f
>  	if (ret || size <= PAGE_SIZE)
>  		return ret;
>  
> +#ifdef CONFIG_DEBUG_SG
> +do_vmalloc:
> +#endif
>  	return __vmalloc_node_flags_caller(size, node, flags,
>  			__builtin_return_address(0));
>  }

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply

* Re: [PATCH bpf-next,v2 1/2] bpf: add helper for getting xfrm states
From: Daniel Borkmann @ 2018-04-24 12:54 UTC (permalink / raw)
  To: Alexei Starovoitov, Eyal Birger
  Cc: netdev, shmulik, ast, fw, steffen.klassert
In-Reply-To: <20180423003430.dxc3frkp3opbvl7f@ast-mbp>

On 04/23/2018 02:34 AM, Alexei Starovoitov wrote:
> On Fri, Apr 20, 2018 at 06:43:56AM +0300, Eyal Birger wrote:
>> On Wed, 18 Apr 2018 15:31:03 -0700
>> Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
>>> On Thu, Apr 19, 2018 at 12:58:22AM +0300, Eyal Birger wrote:
>>>> This commit introduces a helper which allows fetching xfrm state
>>>> parameters by eBPF programs attached to TC.
>>>>
>>>> Prototype:
>>>> bpf_skb_get_xfrm_state(skb, index, xfrm_state, size, flags)
>>>>
>>>> skb: pointer to skb
>>>> index: the index in the skb xfrm_state secpath array
>>>> xfrm_state: pointer to 'struct bpf_xfrm_state'
>>>> size: size of 'struct bpf_xfrm_state'
>>>> flags: reserved for future extensions
>>
>> <snip>
>>  
>>>> +#ifdef CONFIG_XFRM
>>>> +BPF_CALL_5(bpf_skb_get_xfrm_state, struct sk_buff *, skb, u32,
>>>> index,
>>>> +	   struct bpf_xfrm_state *, to, u32, size, u64, flags)
>>>> +{
>>>> +	const struct sec_path *sp = skb_sec_path(skb);
>>>> +	const struct xfrm_state *x;
>>>> +
>>>> +	if (!sp || unlikely(index >= sp->len || flags))
>>>> +		goto err_clear;
>>>> +
>>>> +	x = sp->xvec[index];
>>>> +
>>>> +	if (unlikely(size != sizeof(struct bpf_xfrm_state)))
>>>> +		goto err_clear;
>>>> +
>>>> +	to->reqid = x->props.reqid;
>>>> +	to->spi = be32_to_cpu(x->id.spi);
>>>> +	to->family = x->props.family;
>>>> +	if (to->family == AF_INET6) {
>>>> +		memcpy(to->remote_ipv6, x->props.saddr.a6,
>>>> +		       sizeof(to->remote_ipv6));
>>>> +	} else {
>>>> +		to->remote_ipv4 = be32_to_cpu(x->props.saddr.a4);
>>>> +	}  
>>>
>>> that looks inconsistent. Why v4 is cpu endian, but v6 not?
>>
>> I agree. I followed the reference in bpf_skb_get_tunnel_key(). 
>> I can keep v4 in net endianess too.
> 
> argh.
> On one side it makes sense to be consistent with bpf_skb_get_tunnel_key()
> but it's certainly confusing to have v4 and v6 in different endianness.
> Imagine man page that says that bpf folks made a mistake in that
> helper can kept repeating it in other helpers for consistency...
> Daniel, what do you think?
> Do you remember the history with bpf_skb_get_tunnel_key and
> why it happened that way?

Check out d3aa45ce6b94 ("bpf: add helpers to access tunnel metadata").
I presume there was no particular reason for doing it this way, perhaps
to mimic old ld_abs kind of behavior, I don't know.

>>> Why change endianness of the spi?
>>
>> I felt it was more consistent with other fields and usually helpful for
>> programs. I can keep it in network order.
>>
>> In which case, do you expect it to be typed as __be32 in bpf.h?
>> (I haven't seen other cases)?
> 
> It can be __u32 with a comment /* Stored in network byte order */
> like in bunch of other fields.

Yeah, agree. I guess I would have been fine either way given this is
the way things are with the get/set tunnel helpers, but on the other
hand this helper does not really have a concrete tie to them, so given
we start fresh on this one, we should make both v4/v6 consistent and
document it appropriately.

Eyal, please respin the series with that. The rest was good to go
from my pov.

Thank you,
Daniel

^ permalink raw reply

* Re: [PATCH] mlxsw: spectrum_switchdev: remove duplicated check on multicast_enable
From: Colin Ian King @ 2018-04-24 12:55 UTC (permalink / raw)
  To: Ido Schimmel
  Cc: Jiri Pirko, Ido Schimmel, netdev, kernel-janitors, linux-kernel,
	nogahf
In-Reply-To: <20180424125104.GA2461@splinter.mtl.com>

On 24/04/18 13:51, Ido Schimmel wrote:
> On Tue, Apr 24, 2018 at 12:51:39PM +0100, Colin King wrote:
>> From: Colin Ian King <colin.king@canonical.com>
>>
>> The check on bridge_port->bridge_device->multicast_enable is performed
>> twice in two nested if statements. I can't see any reason for this, the
>> second check is redundant and can be removed.
>>
>> Detected by cppcheck:
>> drivers/net/ethernet/mellanox/mlxsw/spectrum_switchdev.c:1722]:
>> (warning) Identical inner 'if' condition is always true.
>>
>> Signed-off-by: Colin Ian King <colin.king@canonical.com>
>> ---
>>  drivers/net/ethernet/mellanox/mlxsw/spectrum_switchdev.c | 9 +++------
>>  1 file changed, 3 insertions(+), 6 deletions(-)
>>
>> diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum_switchdev.c b/drivers/net/ethernet/mellanox/mlxsw/spectrum_switchdev.c
>> index c11c9a635866..989ed19e25c9 100644
>> --- a/drivers/net/ethernet/mellanox/mlxsw/spectrum_switchdev.c
>> +++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum_switchdev.c
>> @@ -1719,12 +1719,9 @@ __mlxsw_sp_port_mdb_del(struct mlxsw_sp_port *mlxsw_sp_port,
>>  	int err;
>>  
>>  	if (bridge_port->bridge_device->multicast_enabled) {
>> -		if (bridge_port->bridge_device->multicast_enabled) {
> 
> This looks like a copy-paste error. I believe the intention was to check
> '!bridge_port->mrouter' so that if port is an mrouter port packets
> hitting the MDB entry would still be sent to it even after it was
> removed from the MDB entry ports list.

Ah, that makes a lot more sense.

> 
> I will send a patch later this week after I run it by Nogah who worked
> on this part.

OK, thanks!
> 
> Thanks for the patch.
> 
>> -			err = mlxsw_sp_port_smid_set(mlxsw_sp_port, mid->mid,
>> -						     false);
>> -			if (err)
>> -				netdev_err(dev, "Unable to remove port from SMID\n");
>> -		}
>> +		err = mlxsw_sp_port_smid_set(mlxsw_sp_port, mid->mid, false);
>> +		if (err)
>> +			netdev_err(dev, "Unable to remove port from SMID\n");
>>  	}
>>  
>>  	err = mlxsw_sp_port_remove_from_mid(mlxsw_sp_port, mid);
>> -- 
>> 2.17.0
>>

^ permalink raw reply

* Re: [PATCH net-next] Revert "net: init sk_cookie for inet socket"
From: Yafang Shao @ 2018-04-24 12:57 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, netdev
In-Reply-To: <29dbd735-5fd7-d4e4-288f-220f2075b645@gmail.com>

On Tue, Apr 24, 2018 at 8:38 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>
>
> On 04/24/2018 05:05 AM, Yafang Shao wrote:
>> This revert commit <c6849a3ac17e> ("net: init sk_cookie for inet socket")
>>
>> Per discussion with Eric.
>>
>
> I suggest you include a bit more details, about cache line false sharing.
>
> Thanks.
>

OK.
will update it with your message shared with me.

Thanks
Yafang

^ permalink raw reply

* [PATCH bpf-next v5 0/2] bpf: allow map helpers access to map values directly
From: Paul Chaignon via iovisor-dev @ 2018-04-24 13:06 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann,
	netdev-u79uwXL29TY76Z2rM5mHXA
  Cc: iovisor-dev-9jONkmmOlFHEE9lA1F8Ukti2O/JbrIOy

Currently, helpers that expect ARG_PTR_TO_MAP_KEY and ARG_PTR_TO_MAP_VALUE
can only access stack and packet memory.  This patchset allows these
helpers to directly access map values by passing registers of type
PTR_TO_MAP_VALUE.

The first patch changes the verifier; the second adds new test cases.

The first three versions of this patchset were sent on the iovisor-dev
mailing list only.

Changelogs:
  Changes in v5:
    - Refactor using check_helper_mem_access.
  Changes in v4:
    - Rebase.
  Changes in v3:
    - Bug fixes.
    - Negative test cases.
  Changes in v2:
    - Additional test cases for adjusted maps.

Paul Chaignon (2):
  bpf: allow map helpers access to map values directly
  tools/bpf: add verifier tests for accesses to map values

 kernel/bpf/verifier.c                       |  24 +--
 tools/testing/selftests/bpf/test_verifier.c | 266 ++++++++++++++++++++++++++++
 2 files changed, 273 insertions(+), 17 deletions(-)

-- 
2.7.4

^ permalink raw reply

* [PATCH bpf-next v5 1/2] bpf: allow map helpers access to map values directly
From: Paul Chaignon via iovisor-dev @ 2018-04-24 13:07 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann,
	netdev-u79uwXL29TY76Z2rM5mHXA
  Cc: iovisor-dev-9jONkmmOlFHEE9lA1F8Ukti2O/JbrIOy

Helpers that expect ARG_PTR_TO_MAP_KEY and ARG_PTR_TO_MAP_VALUE can only
access stack and packet memory.  Allow these helpers to directly access
map values by passing registers of type PTR_TO_MAP_VALUE.

This change removes the need for an extra copy to the stack when using a
map value to perform a second map lookup, as in the following:

struct bpf_map_def SEC("maps") infobyreq = {
    .type = BPF_MAP_TYPE_HASHMAP,
    .key_size = sizeof(struct request *),
    .value_size = sizeof(struct info_t),
    .max_entries = 1024,
};
struct bpf_map_def SEC("maps") counts = {
    .type = BPF_MAP_TYPE_HASHMAP,
    .key_size = sizeof(struct info_t),
    .value_size = sizeof(u64),
    .max_entries = 1024,
};
SEC("kprobe/blk_account_io_start")
int bpf_blk_account_io_start(struct pt_regs *ctx)
{
    struct info_t *info = bpf_map_lookup_elem(&infobyreq, &ctx->di);
    u64 *count = bpf_map_lookup_elem(&counts, info);
    (*count)++;
}

Signed-off-by: Paul Chaignon <paul.chaignon-C0LM0jrOve7QT0dZR+AlfA@public.gmane.org>
---
 kernel/bpf/verifier.c | 24 +++++++-----------------
 1 file changed, 7 insertions(+), 17 deletions(-)

diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 5dd1dcb..eb1a596 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -1914,7 +1914,7 @@ static int check_func_arg(struct bpf_verifier_env *env, u32 regno,
 	if (arg_type == ARG_PTR_TO_MAP_KEY ||
 	    arg_type == ARG_PTR_TO_MAP_VALUE) {
 		expected_type = PTR_TO_STACK;
-		if (!type_is_pkt_pointer(type) &&
+		if (!type_is_pkt_pointer(type) && type != PTR_TO_MAP_VALUE &&
 		    type != expected_type)
 			goto err_type;
 	} else if (arg_type == ARG_CONST_SIZE ||
@@ -1966,14 +1966,9 @@ static int check_func_arg(struct bpf_verifier_env *env, u32 regno,
 			verbose(env, "invalid map_ptr to access map->key\n");
 			return -EACCES;
 		}
-		if (type_is_pkt_pointer(type))
-			err = check_packet_access(env, regno, reg->off,
-						  meta->map_ptr->key_size,
-						  false);
-		else
-			err = check_stack_boundary(env, regno,
-						   meta->map_ptr->key_size,
-						   false, NULL);
+		err = check_helper_mem_access(env, regno,
+					      meta->map_ptr->key_size, false,
+					      NULL);
 	} else if (arg_type == ARG_PTR_TO_MAP_VALUE) {
 		/* bpf_map_xxx(..., map_ptr, ..., value) call:
 		 * check [value, value + map->value_size) validity
@@ -1983,14 +1978,9 @@ static int check_func_arg(struct bpf_verifier_env *env, u32 regno,
 			verbose(env, "invalid map_ptr to access map->value\n");
 			return -EACCES;
 		}
-		if (type_is_pkt_pointer(type))
-			err = check_packet_access(env, regno, reg->off,
-						  meta->map_ptr->value_size,
-						  false);
-		else
-			err = check_stack_boundary(env, regno,
-						   meta->map_ptr->value_size,
-						   false, NULL);
+		err = check_helper_mem_access(env, regno,
+					      meta->map_ptr->value_size, false,
+					      NULL);
 	} else if (arg_type_is_mem_size(arg_type)) {
 		bool zero_size_allowed = (arg_type == ARG_CONST_SIZE_OR_ZERO);
 
-- 
2.7.4

^ permalink raw reply related

* [PATCH bpf-next v5 2/2] tools/bpf: add verifier tests for accesses to map values
From: Paul Chaignon @ 2018-04-24 13:08 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, netdev; +Cc: iovisor-dev, paul.chaignon

This patch adds new test cases for accesses to map values from map
helpers.

Signed-off-by: Paul Chaignon <paul.chaignon@orange.com>
---
 tools/testing/selftests/bpf/test_verifier.c | 266 ++++++++++++++++++++++++++++
 1 file changed, 266 insertions(+)

diff --git a/tools/testing/selftests/bpf/test_verifier.c b/tools/testing/selftests/bpf/test_verifier.c
index 3e7718b..165e9dd 100644
--- a/tools/testing/selftests/bpf/test_verifier.c
+++ b/tools/testing/selftests/bpf/test_verifier.c
@@ -64,6 +64,7 @@ struct bpf_test {
 	struct bpf_insn	insns[MAX_INSNS];
 	int fixup_map1[MAX_FIXUPS];
 	int fixup_map2[MAX_FIXUPS];
+	int fixup_map3[MAX_FIXUPS];
 	int fixup_prog[MAX_FIXUPS];
 	int fixup_map_in_map[MAX_FIXUPS];
 	const char *errstr;
@@ -88,6 +89,11 @@ struct test_val {
 	int foo[MAX_ENTRIES];
 };
 
+struct other_val {
+	long long foo;
+	long long bar;
+};
+
 static struct bpf_test tests[] = {
 	{
 		"add+sub+mul",
@@ -5594,6 +5600,257 @@ static struct bpf_test tests[] = {
 		.prog_type = BPF_PROG_TYPE_TRACEPOINT,
 	},
 	{
+		"map lookup helper access to map",
+		.insns = {
+			BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
+			BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8),
+			BPF_ST_MEM(BPF_DW, BPF_REG_2, 0, 0),
+			BPF_LD_MAP_FD(BPF_REG_1, 0),
+			BPF_EMIT_CALL(BPF_FUNC_map_lookup_elem),
+			BPF_JMP_IMM(BPF_JEQ, BPF_REG_0, 0, 4),
+			BPF_MOV64_REG(BPF_REG_2, BPF_REG_0),
+			BPF_LD_MAP_FD(BPF_REG_1, 0),
+			BPF_EMIT_CALL(BPF_FUNC_map_lookup_elem),
+			BPF_EXIT_INSN(),
+		},
+		.fixup_map3 = { 3, 8 },
+		.result = ACCEPT,
+		.prog_type = BPF_PROG_TYPE_TRACEPOINT,
+	},
+	{
+		"map update helper access to map",
+		.insns = {
+			BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
+			BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8),
+			BPF_ST_MEM(BPF_DW, BPF_REG_2, 0, 0),
+			BPF_LD_MAP_FD(BPF_REG_1, 0),
+			BPF_EMIT_CALL(BPF_FUNC_map_lookup_elem),
+			BPF_JMP_IMM(BPF_JEQ, BPF_REG_0, 0, 6),
+			BPF_MOV64_IMM(BPF_REG_4, 0),
+			BPF_MOV64_REG(BPF_REG_3, BPF_REG_0),
+			BPF_MOV64_REG(BPF_REG_2, BPF_REG_0),
+			BPF_LD_MAP_FD(BPF_REG_1, 0),
+			BPF_EMIT_CALL(BPF_FUNC_map_update_elem),
+			BPF_EXIT_INSN(),
+		},
+		.fixup_map3 = { 3, 10 },
+		.result = ACCEPT,
+		.prog_type = BPF_PROG_TYPE_TRACEPOINT,
+	},
+	{
+		"map update helper access to map: wrong size",
+		.insns = {
+			BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
+			BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8),
+			BPF_ST_MEM(BPF_DW, BPF_REG_2, 0, 0),
+			BPF_LD_MAP_FD(BPF_REG_1, 0),
+			BPF_EMIT_CALL(BPF_FUNC_map_lookup_elem),
+			BPF_JMP_IMM(BPF_JEQ, BPF_REG_0, 0, 6),
+			BPF_MOV64_IMM(BPF_REG_4, 0),
+			BPF_MOV64_REG(BPF_REG_3, BPF_REG_0),
+			BPF_MOV64_REG(BPF_REG_2, BPF_REG_0),
+			BPF_LD_MAP_FD(BPF_REG_1, 0),
+			BPF_EMIT_CALL(BPF_FUNC_map_update_elem),
+			BPF_EXIT_INSN(),
+		},
+		.fixup_map1 = { 3 },
+		.fixup_map3 = { 10 },
+		.result = REJECT,
+		.errstr = "invalid access to map value, value_size=8 off=0 size=16",
+		.prog_type = BPF_PROG_TYPE_TRACEPOINT,
+	},
+	{
+		"map helper access to adjusted map (via const imm)",
+		.insns = {
+			BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
+			BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8),
+			BPF_ST_MEM(BPF_DW, BPF_REG_2, 0, 0),
+			BPF_LD_MAP_FD(BPF_REG_1, 0),
+			BPF_EMIT_CALL(BPF_FUNC_map_lookup_elem),
+			BPF_JMP_IMM(BPF_JEQ, BPF_REG_0, 0, 5),
+			BPF_MOV64_REG(BPF_REG_2, BPF_REG_0),
+			BPF_ALU64_IMM(BPF_ADD, BPF_REG_2,
+				      offsetof(struct other_val, bar)),
+			BPF_LD_MAP_FD(BPF_REG_1, 0),
+			BPF_EMIT_CALL(BPF_FUNC_map_lookup_elem),
+			BPF_EXIT_INSN(),
+		},
+		.fixup_map3 = { 3, 9 },
+		.result = ACCEPT,
+		.prog_type = BPF_PROG_TYPE_TRACEPOINT,
+	},
+	{
+		"map helper access to adjusted map (via const imm): out-of-bound 1",
+		.insns = {
+			BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
+			BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8),
+			BPF_ST_MEM(BPF_DW, BPF_REG_2, 0, 0),
+			BPF_LD_MAP_FD(BPF_REG_1, 0),
+			BPF_EMIT_CALL(BPF_FUNC_map_lookup_elem),
+			BPF_JMP_IMM(BPF_JEQ, BPF_REG_0, 0, 5),
+			BPF_MOV64_REG(BPF_REG_2, BPF_REG_0),
+			BPF_ALU64_IMM(BPF_ADD, BPF_REG_2,
+				      sizeof(struct other_val) - 4),
+			BPF_LD_MAP_FD(BPF_REG_1, 0),
+			BPF_EMIT_CALL(BPF_FUNC_map_lookup_elem),
+			BPF_EXIT_INSN(),
+		},
+		.fixup_map3 = { 3, 9 },
+		.result = REJECT,
+		.errstr = "invalid access to map value, value_size=16 off=12 size=8",
+		.prog_type = BPF_PROG_TYPE_TRACEPOINT,
+	},
+	{
+		"map helper access to adjusted map (via const imm): out-of-bound 2",
+		.insns = {
+			BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
+			BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8),
+			BPF_ST_MEM(BPF_DW, BPF_REG_2, 0, 0),
+			BPF_LD_MAP_FD(BPF_REG_1, 0),
+			BPF_EMIT_CALL(BPF_FUNC_map_lookup_elem),
+			BPF_JMP_IMM(BPF_JEQ, BPF_REG_0, 0, 5),
+			BPF_MOV64_REG(BPF_REG_2, BPF_REG_0),
+			BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -4),
+			BPF_LD_MAP_FD(BPF_REG_1, 0),
+			BPF_EMIT_CALL(BPF_FUNC_map_lookup_elem),
+			BPF_EXIT_INSN(),
+		},
+		.fixup_map3 = { 3, 9 },
+		.result = REJECT,
+		.errstr = "invalid access to map value, value_size=16 off=-4 size=8",
+		.prog_type = BPF_PROG_TYPE_TRACEPOINT,
+	},
+	{
+		"map helper access to adjusted map (via const reg)",
+		.insns = {
+			BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
+			BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8),
+			BPF_ST_MEM(BPF_DW, BPF_REG_2, 0, 0),
+			BPF_LD_MAP_FD(BPF_REG_1, 0),
+			BPF_EMIT_CALL(BPF_FUNC_map_lookup_elem),
+			BPF_JMP_IMM(BPF_JEQ, BPF_REG_0, 0, 6),
+			BPF_MOV64_REG(BPF_REG_2, BPF_REG_0),
+			BPF_MOV64_IMM(BPF_REG_3,
+				      offsetof(struct other_val, bar)),
+			BPF_ALU64_REG(BPF_ADD, BPF_REG_2, BPF_REG_3),
+			BPF_LD_MAP_FD(BPF_REG_1, 0),
+			BPF_EMIT_CALL(BPF_FUNC_map_lookup_elem),
+			BPF_EXIT_INSN(),
+		},
+		.fixup_map3 = { 3, 10 },
+		.result = ACCEPT,
+		.prog_type = BPF_PROG_TYPE_TRACEPOINT,
+	},
+	{
+		"map helper access to adjusted map (via const reg): out-of-bound 1",
+		.insns = {
+			BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
+			BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8),
+			BPF_ST_MEM(BPF_DW, BPF_REG_2, 0, 0),
+			BPF_LD_MAP_FD(BPF_REG_1, 0),
+			BPF_EMIT_CALL(BPF_FUNC_map_lookup_elem),
+			BPF_JMP_IMM(BPF_JEQ, BPF_REG_0, 0, 6),
+			BPF_MOV64_REG(BPF_REG_2, BPF_REG_0),
+			BPF_MOV64_IMM(BPF_REG_3,
+				      sizeof(struct other_val) - 4),
+			BPF_ALU64_REG(BPF_ADD, BPF_REG_2, BPF_REG_3),
+			BPF_LD_MAP_FD(BPF_REG_1, 0),
+			BPF_EMIT_CALL(BPF_FUNC_map_lookup_elem),
+			BPF_EXIT_INSN(),
+		},
+		.fixup_map3 = { 3, 10 },
+		.result = REJECT,
+		.errstr = "invalid access to map value, value_size=16 off=12 size=8",
+		.prog_type = BPF_PROG_TYPE_TRACEPOINT,
+	},
+	{
+		"map helper access to adjusted map (via const reg): out-of-bound 2",
+		.insns = {
+			BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
+			BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8),
+			BPF_ST_MEM(BPF_DW, BPF_REG_2, 0, 0),
+			BPF_LD_MAP_FD(BPF_REG_1, 0),
+			BPF_EMIT_CALL(BPF_FUNC_map_lookup_elem),
+			BPF_JMP_IMM(BPF_JEQ, BPF_REG_0, 0, 6),
+			BPF_MOV64_REG(BPF_REG_2, BPF_REG_0),
+			BPF_MOV64_IMM(BPF_REG_3, -4),
+			BPF_ALU64_REG(BPF_ADD, BPF_REG_2, BPF_REG_3),
+			BPF_LD_MAP_FD(BPF_REG_1, 0),
+			BPF_EMIT_CALL(BPF_FUNC_map_lookup_elem),
+			BPF_EXIT_INSN(),
+		},
+		.fixup_map3 = { 3, 10 },
+		.result = REJECT,
+		.errstr = "invalid access to map value, value_size=16 off=-4 size=8",
+		.prog_type = BPF_PROG_TYPE_TRACEPOINT,
+	},
+	{
+		"map helper access to adjusted map (via variable)",
+		.insns = {
+			BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
+			BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8),
+			BPF_ST_MEM(BPF_DW, BPF_REG_2, 0, 0),
+			BPF_LD_MAP_FD(BPF_REG_1, 0),
+			BPF_EMIT_CALL(BPF_FUNC_map_lookup_elem),
+			BPF_JMP_IMM(BPF_JEQ, BPF_REG_0, 0, 7),
+			BPF_MOV64_REG(BPF_REG_2, BPF_REG_0),
+			BPF_LDX_MEM(BPF_W, BPF_REG_3, BPF_REG_0, 0),
+			BPF_JMP_IMM(BPF_JGT, BPF_REG_3,
+				    offsetof(struct other_val, bar), 4),
+			BPF_ALU64_REG(BPF_ADD, BPF_REG_2, BPF_REG_3),
+			BPF_LD_MAP_FD(BPF_REG_1, 0),
+			BPF_EMIT_CALL(BPF_FUNC_map_lookup_elem),
+			BPF_EXIT_INSN(),
+		},
+		.fixup_map3 = { 3, 11 },
+		.result = ACCEPT,
+		.prog_type = BPF_PROG_TYPE_TRACEPOINT,
+	},
+	{
+		"map helper access to adjusted map (via variable): no max check",
+		.insns = {
+			BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
+			BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8),
+			BPF_ST_MEM(BPF_DW, BPF_REG_2, 0, 0),
+			BPF_LD_MAP_FD(BPF_REG_1, 0),
+			BPF_EMIT_CALL(BPF_FUNC_map_lookup_elem),
+			BPF_JMP_IMM(BPF_JEQ, BPF_REG_0, 0, 6),
+			BPF_MOV64_REG(BPF_REG_2, BPF_REG_0),
+			BPF_LDX_MEM(BPF_W, BPF_REG_3, BPF_REG_0, 0),
+			BPF_ALU64_REG(BPF_ADD, BPF_REG_2, BPF_REG_3),
+			BPF_LD_MAP_FD(BPF_REG_1, 0),
+			BPF_EMIT_CALL(BPF_FUNC_map_lookup_elem),
+			BPF_EXIT_INSN(),
+		},
+		.fixup_map3 = { 3, 10 },
+		.result = REJECT,
+		.errstr = "R2 unbounded memory access, make sure to bounds check any array access into a map",
+		.prog_type = BPF_PROG_TYPE_TRACEPOINT,
+	},
+	{
+		"map helper access to adjusted map (via variable): wrong max check",
+		.insns = {
+			BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
+			BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8),
+			BPF_ST_MEM(BPF_DW, BPF_REG_2, 0, 0),
+			BPF_LD_MAP_FD(BPF_REG_1, 0),
+			BPF_EMIT_CALL(BPF_FUNC_map_lookup_elem),
+			BPF_JMP_IMM(BPF_JEQ, BPF_REG_0, 0, 7),
+			BPF_MOV64_REG(BPF_REG_2, BPF_REG_0),
+			BPF_LDX_MEM(BPF_W, BPF_REG_3, BPF_REG_0, 0),
+			BPF_JMP_IMM(BPF_JGT, BPF_REG_3,
+				    offsetof(struct other_val, bar) + 1, 4),
+			BPF_ALU64_REG(BPF_ADD, BPF_REG_2, BPF_REG_3),
+			BPF_LD_MAP_FD(BPF_REG_1, 0),
+			BPF_EMIT_CALL(BPF_FUNC_map_lookup_elem),
+			BPF_EXIT_INSN(),
+		},
+		.fixup_map3 = { 3, 11 },
+		.result = REJECT,
+		.errstr = "invalid access to map value, value_size=16 off=9 size=8",
+		.prog_type = BPF_PROG_TYPE_TRACEPOINT,
+	},
+	{
 		"map element value is preserved across register spilling",
 		.insns = {
 			BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
@@ -11533,6 +11790,7 @@ static void do_test_fixup(struct bpf_test *test, struct bpf_insn *prog,
 {
 	int *fixup_map1 = test->fixup_map1;
 	int *fixup_map2 = test->fixup_map2;
+	int *fixup_map3 = test->fixup_map3;
 	int *fixup_prog = test->fixup_prog;
 	int *fixup_map_in_map = test->fixup_map_in_map;
 
@@ -11556,6 +11814,14 @@ static void do_test_fixup(struct bpf_test *test, struct bpf_insn *prog,
 		} while (*fixup_map2);
 	}
 
+	if (*fixup_map3) {
+		map_fds[1] = create_map(sizeof(struct other_val), 1);
+		do {
+			prog[*fixup_map3].imm = map_fds[1];
+			fixup_map3++;
+		} while (*fixup_map3);
+	}
+
 	if (*fixup_prog) {
 		map_fds[2] = create_prog_array();
 		do {
-- 
2.7.4

^ permalink raw reply related

* [PATCH V7 net-next 02/14] net: Rename and export copy_skb_header
From: Boris Pismenny @ 2018-04-24 13:12 UTC (permalink / raw)
  To: davem; +Cc: netdev, saeedm, borisp, davejwatson, ktkhai, Ilya Lesokhin
In-Reply-To: <1524575585-49541-1-git-send-email-borisp@mellanox.com>

From: Ilya Lesokhin <ilyal@mellanox.com>

copy_skb_header is renamed to skb_copy_header and
exported. Exposing this function give more flexibility
in copying SKBs.
skb_copy and skb_copy_expand do not give enough control
over which parts are copied.

Signed-off-by: Ilya Lesokhin <ilyal@mellanox.com>
Signed-off-by: Boris Pismenny <borisp@mellanox.com>
---
 include/linux/skbuff.h | 1 +
 net/core/skbuff.c      | 9 +++++----
 2 files changed, 6 insertions(+), 4 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index d274059..b4fff17 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -1032,6 +1032,7 @@ static inline struct sk_buff *alloc_skb_fclone(unsigned int size,
 struct sk_buff *skb_morph(struct sk_buff *dst, struct sk_buff *src);
 int skb_copy_ubufs(struct sk_buff *skb, gfp_t gfp_mask);
 struct sk_buff *skb_clone(struct sk_buff *skb, gfp_t priority);
+void skb_copy_header(struct sk_buff *new, const struct sk_buff *old);
 struct sk_buff *skb_copy(const struct sk_buff *skb, gfp_t priority);
 struct sk_buff *__pskb_copy_fclone(struct sk_buff *skb, int headroom,
 				   gfp_t gfp_mask, bool fclone);
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index ff49e35..3f5dba1 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -1305,7 +1305,7 @@ static void skb_headers_offset_update(struct sk_buff *skb, int off)
 	skb->inner_mac_header += off;
 }
 
-static void copy_skb_header(struct sk_buff *new, const struct sk_buff *old)
+void skb_copy_header(struct sk_buff *new, const struct sk_buff *old)
 {
 	__copy_skb_header(new, old);
 
@@ -1313,6 +1313,7 @@ static void copy_skb_header(struct sk_buff *new, const struct sk_buff *old)
 	skb_shinfo(new)->gso_segs = skb_shinfo(old)->gso_segs;
 	skb_shinfo(new)->gso_type = skb_shinfo(old)->gso_type;
 }
+EXPORT_SYMBOL(skb_copy_header);
 
 static inline int skb_alloc_rx_flag(const struct sk_buff *skb)
 {
@@ -1355,7 +1356,7 @@ struct sk_buff *skb_copy(const struct sk_buff *skb, gfp_t gfp_mask)
 
 	BUG_ON(skb_copy_bits(skb, -headerlen, n->head, headerlen + skb->len));
 
-	copy_skb_header(n, skb);
+	skb_copy_header(n, skb);
 	return n;
 }
 EXPORT_SYMBOL(skb_copy);
@@ -1419,7 +1420,7 @@ struct sk_buff *__pskb_copy_fclone(struct sk_buff *skb, int headroom,
 		skb_clone_fraglist(n);
 	}
 
-	copy_skb_header(n, skb);
+	skb_copy_header(n, skb);
 out:
 	return n;
 }
@@ -1599,7 +1600,7 @@ struct sk_buff *skb_copy_expand(const struct sk_buff *skb,
 	BUG_ON(skb_copy_bits(skb, -head_copy_len, n->head + head_copy_off,
 			     skb->len + head_copy_len));
 
-	copy_skb_header(n, skb);
+	skb_copy_header(n, skb);
 
 	skb_headers_offset_update(n, newheadroom - oldheadroom);
 
-- 
1.8.3.1

^ permalink raw reply related

* [PATCH V7 net-next 01/14] tcp: Add clean acked data hook
From: Boris Pismenny @ 2018-04-24 13:12 UTC (permalink / raw)
  To: davem
  Cc: netdev, saeedm, borisp, davejwatson, ktkhai, Ilya Lesokhin,
	Aviad Yehezkel
In-Reply-To: <1524575585-49541-1-git-send-email-borisp@mellanox.com>

From: Ilya Lesokhin <ilyal@mellanox.com>

Called when a TCP segment is acknowledged.
Could be used by application protocols who hold additional
metadata associated with the stream data.

This is required by TLS device offload to release
metadata associated with acknowledged TLS records.

Signed-off-by: Ilya Lesokhin <ilyal@mellanox.com>
Signed-off-by: Boris Pismenny <borisp@mellanox.com>
Signed-off-by: Aviad Yehezkel <aviadye@mellanox.com>
---
 include/net/inet_connection_sock.h |  2 ++
 include/net/tcp.h                  |  8 ++++++++
 net/ipv4/tcp_input.c               | 25 +++++++++++++++++++++++++
 3 files changed, 35 insertions(+)

diff --git a/include/net/inet_connection_sock.h b/include/net/inet_connection_sock.h
index b68fea0..2ab6667 100644
--- a/include/net/inet_connection_sock.h
+++ b/include/net/inet_connection_sock.h
@@ -77,6 +77,7 @@ struct inet_connection_sock_af_ops {
  * @icsk_af_ops		   Operations which are AF_INET{4,6} specific
  * @icsk_ulp_ops	   Pluggable ULP control hook
  * @icsk_ulp_data	   ULP private data
+ * @icsk_clean_acked	   Clean acked data hook
  * @icsk_listen_portaddr_node	hash to the portaddr listener hashtable
  * @icsk_ca_state:	   Congestion control state
  * @icsk_retransmits:	   Number of unrecovered [RTO] timeouts
@@ -102,6 +103,7 @@ struct inet_connection_sock {
 	const struct inet_connection_sock_af_ops *icsk_af_ops;
 	const struct tcp_ulp_ops  *icsk_ulp_ops;
 	void			  *icsk_ulp_data;
+	void (*icsk_clean_acked)(struct sock *sk, u32 acked_seq);
 	struct hlist_node         icsk_listen_portaddr_node;
 	unsigned int		  (*icsk_sync_mss)(struct sock *sk, u32 pmtu);
 	__u8			  icsk_ca_state:6,
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 833154e..cf803fe 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -2105,4 +2105,12 @@ static inline bool tcp_bpf_ca_needs_ecn(struct sock *sk)
 #if IS_ENABLED(CONFIG_SMC)
 extern struct static_key_false tcp_have_smc;
 #endif
+
+#if IS_ENABLED(CONFIG_TLS_DEVICE)
+void clean_acked_data_enable(struct inet_connection_sock *icsk,
+			     void (*cad)(struct sock *sk, u32 ack_seq));
+void clean_acked_data_disable(struct inet_connection_sock *icsk);
+
+#endif
+
 #endif	/* _TCP_H */
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 17b7858..8b92cd2 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -112,6 +112,25 @@
 #define REXMIT_LOST	1 /* retransmit packets marked lost */
 #define REXMIT_NEW	2 /* FRTO-style transmit of unsent/new packets */
 
+#if IS_ENABLED(CONFIG_TLS_DEVICE)
+static DEFINE_STATIC_KEY_FALSE(clean_acked_data_enabled);
+
+void clean_acked_data_enable(struct inet_connection_sock *icsk,
+			     void (*cad)(struct sock *sk, u32 ack_seq))
+{
+	icsk->icsk_clean_acked = cad;
+	static_branch_inc(&clean_acked_data_enabled);
+}
+EXPORT_SYMBOL_GPL(clean_acked_data_enable);
+
+void clean_acked_data_disable(struct inet_connection_sock *icsk)
+{
+	static_branch_dec(&clean_acked_data_enabled);
+	icsk->icsk_clean_acked = NULL;
+}
+EXPORT_SYMBOL_GPL(clean_acked_data_disable);
+#endif
+
 static void tcp_gro_dev_warn(struct sock *sk, const struct sk_buff *skb,
 			     unsigned int len)
 {
@@ -3561,6 +3580,12 @@ static int tcp_ack(struct sock *sk, const struct sk_buff *skb, int flag)
 	if (after(ack, prior_snd_una)) {
 		flag |= FLAG_SND_UNA_ADVANCED;
 		icsk->icsk_retransmits = 0;
+
+#if IS_ENABLED(CONFIG_TLS_DEVICE)
+		if (static_branch_unlikely(&clean_acked_data_enabled))
+			if (icsk->icsk_clean_acked)
+				icsk->icsk_clean_acked(sk, ack);
+#endif
 	}
 
 	prior_fack = tcp_is_sack(tp) ? tcp_highest_sack_seq(tp) : tp->snd_una;
-- 
1.8.3.1

^ permalink raw reply related

* [PATCH V7 net-next 00/14] TLS offload, netdev & MLX5 support
From: Boris Pismenny @ 2018-04-24 13:12 UTC (permalink / raw)
  To: davem; +Cc: netdev, saeedm, borisp, davejwatson, ktkhai

Hi Dave,

The following series provides TLS TX inline crypto offload.

v1->v2:
   - Added IS_ENABLED(CONFIG_TLS_DEVICE) and a STATIC_KEY for icsk_clean_acked
   - File license fix
   - Fix spelling, comment by DaveW
   - Move memory allocations out of tls_set_device_offload and other misc fixes,
	comments by Kiril.

v2->v3:
   - Reversed xmas tree where needed and style fixes
   - Removed the need for skb_page_frag_refill, per Eric's comment
   - IPv6 dependency fixes

v3->v4:
   - Remove "inline" from functions in C files
   - Make clean_acked_data_enabled a static variable and add enable/disable functions to control it.
   - Remove unnecessary variable initialization mentioned by ShannonN
   - Rebase over TLS RX
   - Refactor the tls_software_fallback to reduce the number of variables mentioned by KirilT

v4->v5:
   - Add missing CONFIG_TLS_DEVICE

v5->v6:
   - Move changes to the software implementation into a seperate patch
   - Fix some checkpatch warnings 
   - GPL export the enable/disable clean_acked_data functions

v6->v7:
   - Use the dst_entry to obtain the netdev in dev_get_by_index 
   - Remove the IPv6 patch since it is redundent now

This series adds a generic infrastructure to offload TLS crypto to a
network devices. It enables the kernel TLS socket to skip encryption and
authentication operations on the transmit side of the data path. Leaving
those computationally expensive operations to the NIC.

The NIC offload infrastructure builds TLS records and pushes them to the
TCP layer just like the SW KTLS implementation and using the same API.
TCP segmentation is mostly unaffected. Currently the only exception is
that we prevent mixed SKBs where only part of the payload requires
offload. In the future we are likely to add a similar restriction
following a change cipher spec record.

The notable differences between SW KTLS and NIC offloaded TLS
implementations are as follows:
1. The offloaded implementation builds "plaintext TLS record", those
records contain plaintext instead of ciphertext and place holder bytes
instead of authentication tags.
2. The offloaded implementation maintains a mapping from TCP sequence
number to TLS records. Thus given a TCP SKB sent from a NIC offloaded
TLS socket, we can use the tls NIC offload infrastructure to obtain
enough context to encrypt the payload of the SKB.
A TLS record is released when the last byte of the record is ack'ed,
this is done through the new icsk_clean_acked callback.

The infrastructure should be extendable to support various NIC offload
implementations.  However it is currently written with the
implementation below in mind:
The NIC assumes that packets from each offloaded stream are sent as
plaintext and in-order. It keeps track of the TLS records in the TCP
stream. When a packet marked for offload is transmitted, the NIC
encrypts the payload in-place and puts authentication tags in the
relevant place holders.

The responsibility for handling out-of-order packets (i.e. TCP
retransmission, qdisc drops) falls on the netdev driver.

The netdev driver keeps track of the expected TCP SN from the NIC's
perspective.  If the next packet to transmit matches the expected TCP
SN, the driver advances the expected TCP SN, and transmits the packet
with TLS offload indication.

If the next packet to transmit does not match the expected TCP SN. The
driver calls the TLS layer to obtain the TLS record that includes the
TCP of the packet for transmission. Using this TLS record, the driver
posts a work entry on the transmit queue to reconstruct the NIC TLS
state required for the offload of the out-of-order packet. It updates
the expected TCP SN accordingly and transmit the now in-order packet.
The same queue is used for packet transmission and TLS context
reconstruction to avoid the need for flushing the transmit queue before
issuing the context reconstruction request.

Expected TCP SN is accessed without a lock, under the assumption that
TCP doesn't transmit SKBs from different TX queue concurrently.

If packets are rerouted to a different netdevice, then a software
fallback routine handles encryption.

Paper: https://www.netdevconf.org/1.2/papers/netdevconf-TLS.pdf

Boris Pismenny (3):
  net/tls: Split conf to rx + tx
  MAINTAINERS: Update mlx5 innova driver maintainers
  MAINTAINERS: Update TLS maintainers

Ilya Lesokhin (11):
  tcp: Add clean acked data hook
  net: Rename and export copy_skb_header
  net: Add Software fallback infrastructure for socket dependent
    offloads
  net: Add TLS offload netdev ops
  net: Add TLS TX offload features
  net/tls: Add generic NIC offload infrastructure
  net/mlx5e: Move defines out of ipsec code
  net/mlx5: Accel, Add TLS tx offload interface
  net/mlx5e: TLS, Add Innova TLS TX support
  net/mlx5e: TLS, Add Innova TLS TX offload data path
  net/mlx5e: TLS, Add error statistics

 MAINTAINERS                                        |  19 +-
 drivers/net/ethernet/mellanox/mlx5/core/Kconfig    |  11 +
 drivers/net/ethernet/mellanox/mlx5/core/Makefile   |   6 +-
 .../net/ethernet/mellanox/mlx5/core/accel/tls.c    |  71 ++
 .../net/ethernet/mellanox/mlx5/core/accel/tls.h    |  86 +++
 drivers/net/ethernet/mellanox/mlx5/core/en.h       |  21 +
 .../mellanox/mlx5/core/en_accel/en_accel.h         |  72 ++
 .../ethernet/mellanox/mlx5/core/en_accel/ipsec.h   |   3 -
 .../net/ethernet/mellanox/mlx5/core/en_accel/tls.c | 197 ++++++
 .../net/ethernet/mellanox/mlx5/core/en_accel/tls.h |  87 +++
 .../mellanox/mlx5/core/en_accel/tls_rxtx.c         | 278 ++++++++
 .../mellanox/mlx5/core/en_accel/tls_rxtx.h         |  50 ++
 .../mellanox/mlx5/core/en_accel/tls_stats.c        |  89 +++
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c  |   9 +
 drivers/net/ethernet/mellanox/mlx5/core/en_stats.c |  32 +
 drivers/net/ethernet/mellanox/mlx5/core/en_stats.h |   9 +
 drivers/net/ethernet/mellanox/mlx5/core/en_tx.c    |  37 +-
 .../net/ethernet/mellanox/mlx5/core/fpga/core.h    |   1 +
 .../net/ethernet/mellanox/mlx5/core/fpga/ipsec.c   |   5 +-
 drivers/net/ethernet/mellanox/mlx5/core/fpga/sdk.h |   2 +
 drivers/net/ethernet/mellanox/mlx5/core/fpga/tls.c | 563 +++++++++++++++
 drivers/net/ethernet/mellanox/mlx5/core/fpga/tls.h |  68 ++
 drivers/net/ethernet/mellanox/mlx5/core/main.c     |  11 +
 include/linux/mlx5/mlx5_ifc.h                      |  16 -
 include/linux/mlx5/mlx5_ifc_fpga.h                 |  77 +++
 include/linux/netdev_features.h                    |   2 +
 include/linux/netdevice.h                          |  24 +
 include/linux/skbuff.h                             |   1 +
 include/net/inet_connection_sock.h                 |   2 +
 include/net/sock.h                                 |  21 +
 include/net/tcp.h                                  |   8 +
 include/net/tls.h                                  | 120 +++-
 net/Kconfig                                        |   4 +
 net/core/dev.c                                     |   4 +
 net/core/ethtool.c                                 |   1 +
 net/core/skbuff.c                                  |   9 +-
 net/ipv4/tcp_input.c                               |  25 +
 net/tls/Kconfig                                    |  10 +
 net/tls/Makefile                                   |   2 +
 net/tls/tls_device.c                               | 765 +++++++++++++++++++++
 net/tls/tls_device_fallback.c                      | 450 ++++++++++++
 net/tls/tls_main.c                                 | 143 ++--
 net/tls/tls_sw.c                                   | 134 ++--
 43 files changed, 3356 insertions(+), 189 deletions(-)
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/accel/tls.c
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/accel/tls.h
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/en_accel/en_accel.h
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls.c
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls.h
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls_rxtx.c
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls_rxtx.h
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls_stats.c
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/fpga/tls.c
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/fpga/tls.h
 create mode 100644 net/tls/tls_device.c
 create mode 100644 net/tls/tls_device_fallback.c

-- 
1.8.3.1

^ permalink raw reply

* [PATCH V7 net-next 03/14] net: Add Software fallback infrastructure for socket dependent offloads
From: Boris Pismenny @ 2018-04-24 13:12 UTC (permalink / raw)
  To: davem; +Cc: netdev, saeedm, borisp, davejwatson, ktkhai, Ilya Lesokhin
In-Reply-To: <1524575585-49541-1-git-send-email-borisp@mellanox.com>

From: Ilya Lesokhin <ilyal@mellanox.com>

With socket dependent offloads we rely on the netdev to transform
the transmitted packets before sending them to the wire.
When a packet from an offloaded socket is rerouted to a different
device we need to detect it and do the transformation in software.

Signed-off-by: Ilya Lesokhin <ilyal@mellanox.com>
Signed-off-by: Boris Pismenny <borisp@mellanox.com>
---
 include/net/sock.h | 21 +++++++++++++++++++++
 net/Kconfig        |  4 ++++
 net/core/dev.c     |  4 ++++
 3 files changed, 29 insertions(+)

diff --git a/include/net/sock.h b/include/net/sock.h
index 74d725f..3c568b3 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -481,6 +481,11 @@ struct sock {
 	void			(*sk_error_report)(struct sock *sk);
 	int			(*sk_backlog_rcv)(struct sock *sk,
 						  struct sk_buff *skb);
+#ifdef CONFIG_SOCK_VALIDATE_XMIT
+	struct sk_buff*		(*sk_validate_xmit_skb)(struct sock *sk,
+							struct net_device *dev,
+							struct sk_buff *skb);
+#endif
 	void                    (*sk_destruct)(struct sock *sk);
 	struct sock_reuseport __rcu	*sk_reuseport_cb;
 	struct rcu_head		sk_rcu;
@@ -2332,6 +2337,22 @@ static inline bool sk_fullsock(const struct sock *sk)
 	return (1 << sk->sk_state) & ~(TCPF_TIME_WAIT | TCPF_NEW_SYN_RECV);
 }
 
+/* Checks if this SKB belongs to an HW offloaded socket
+ * and whether any SW fallbacks are required based on dev.
+ */
+static inline struct sk_buff *sk_validate_xmit_skb(struct sk_buff *skb,
+						   struct net_device *dev)
+{
+#ifdef CONFIG_SOCK_VALIDATE_XMIT
+	struct sock *sk = skb->sk;
+
+	if (sk && sk_fullsock(sk) && sk->sk_validate_xmit_skb)
+		skb = sk->sk_validate_xmit_skb(sk, dev, skb);
+#endif
+
+	return skb;
+}
+
 /* This helper checks if a socket is a LISTEN or NEW_SYN_RECV
  * SYNACK messages can be attached to either ones (depending on SYNCOOKIE)
  */
diff --git a/net/Kconfig b/net/Kconfig
index 6fa1a44..c6a3f14 100644
--- a/net/Kconfig
+++ b/net/Kconfig
@@ -407,6 +407,10 @@ config GRO_CELLS
 	bool
 	default n
 
+config SOCK_VALIDATE_XMIT
+	bool
+	default n
+
 config NET_DEVLINK
 	tristate "Network physical/parent device Netlink interface"
 	help
diff --git a/net/core/dev.c b/net/core/dev.c
index c624a04..9fcf993 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -3114,6 +3114,10 @@ static struct sk_buff *validate_xmit_skb(struct sk_buff *skb, struct net_device
 	if (unlikely(!skb))
 		goto out_null;
 
+	skb = sk_validate_xmit_skb(skb, dev);
+	if (unlikely(!skb))
+		goto out_null;
+
 	if (netif_needs_gso(skb, features)) {
 		struct sk_buff *segs;
 
-- 
1.8.3.1

^ permalink raw reply related

* [PATCH V7 net-next 05/14] net: Add TLS TX offload features
From: Boris Pismenny @ 2018-04-24 13:12 UTC (permalink / raw)
  To: davem
  Cc: netdev, saeedm, borisp, davejwatson, ktkhai, Ilya Lesokhin,
	Aviad Yehezkel
In-Reply-To: <1524575585-49541-1-git-send-email-borisp@mellanox.com>

From: Ilya Lesokhin <ilyal@mellanox.com>

This patch adds a netdev feature to configure TLS TX offloads.

Signed-off-by: Ilya Lesokhin <ilyal@mellanox.com>
Signed-off-by: Boris Pismenny <borisp@mellanox.com>
Signed-off-by: Aviad Yehezkel <aviadye@mellanox.com>
---
 include/linux/netdev_features.h | 2 ++
 net/core/ethtool.c              | 1 +
 2 files changed, 3 insertions(+)

diff --git a/include/linux/netdev_features.h b/include/linux/netdev_features.h
index 35b79f4..7103791 100644
--- a/include/linux/netdev_features.h
+++ b/include/linux/netdev_features.h
@@ -77,6 +77,7 @@ enum {
 	NETIF_F_HW_ESP_BIT,		/* Hardware ESP transformation offload */
 	NETIF_F_HW_ESP_TX_CSUM_BIT,	/* ESP with TX checksum offload */
 	NETIF_F_RX_UDP_TUNNEL_PORT_BIT, /* Offload of RX port for UDP tunnels */
+	NETIF_F_HW_TLS_TX_BIT,		/* Hardware TLS TX offload */
 
 	NETIF_F_GRO_HW_BIT,		/* Hardware Generic receive offload */
 	NETIF_F_HW_TLS_RECORD_BIT,	/* Offload TLS record */
@@ -147,6 +148,7 @@ enum {
 #define NETIF_F_HW_ESP_TX_CSUM	__NETIF_F(HW_ESP_TX_CSUM)
 #define	NETIF_F_RX_UDP_TUNNEL_PORT  __NETIF_F(RX_UDP_TUNNEL_PORT)
 #define NETIF_F_HW_TLS_RECORD	__NETIF_F(HW_TLS_RECORD)
+#define NETIF_F_HW_TLS_TX	__NETIF_F(HW_TLS_TX)
 
 #define for_each_netdev_feature(mask_addr, bit)	\
 	for_each_set_bit(bit, (unsigned long *)mask_addr, NETDEV_FEATURE_COUNT)
diff --git a/net/core/ethtool.c b/net/core/ethtool.c
index 03416e6..cf08af9 100644
--- a/net/core/ethtool.c
+++ b/net/core/ethtool.c
@@ -109,6 +109,7 @@ int ethtool_op_get_ts_info(struct net_device *dev, struct ethtool_ts_info *info)
 	[NETIF_F_HW_ESP_TX_CSUM_BIT] =	 "esp-tx-csum-hw-offload",
 	[NETIF_F_RX_UDP_TUNNEL_PORT_BIT] =	 "rx-udp_tunnel-port-offload",
 	[NETIF_F_HW_TLS_RECORD_BIT] =	"tls-hw-record",
+	[NETIF_F_HW_TLS_TX_BIT] =	 "tls-hw-tx-offload",
 };
 
 static const char
-- 
1.8.3.1

^ permalink raw reply related

* [PATCH V7 net-next 10/14] net/mlx5e: TLS, Add Innova TLS TX support
From: Boris Pismenny @ 2018-04-24 13:13 UTC (permalink / raw)
  To: davem; +Cc: netdev, saeedm, borisp, davejwatson, ktkhai, Ilya Lesokhin
In-Reply-To: <1524575585-49541-1-git-send-email-borisp@mellanox.com>

From: Ilya Lesokhin <ilyal@mellanox.com>

Add NETIF_F_HW_TLS_TX capability and expose tlsdev_ops to work with the
TLS generic NIC offload infrastructure.
The NETIF_F_HW_TLS_TX capability will be added in the next patch.

Signed-off-by: Ilya Lesokhin <ilyal@mellanox.com>
Signed-off-by: Boris Pismenny <borisp@mellanox.com>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/Kconfig    |  11 ++
 drivers/net/ethernet/mellanox/mlx5/core/Makefile   |   2 +
 .../net/ethernet/mellanox/mlx5/core/en_accel/tls.c | 173 +++++++++++++++++++++
 .../net/ethernet/mellanox/mlx5/core/en_accel/tls.h |  65 ++++++++
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c  |   3 +
 5 files changed, 254 insertions(+)
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls.c
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls.h

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/Kconfig b/drivers/net/ethernet/mellanox/mlx5/core/Kconfig
index 1225703..ee66847 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/Kconfig
+++ b/drivers/net/ethernet/mellanox/mlx5/core/Kconfig
@@ -86,3 +86,14 @@ config MLX5_EN_IPSEC
 	  Build support for IPsec cryptography-offload accelaration in the NIC.
 	  Note: Support for hardware with this capability needs to be selected
 	  for this option to become available.
+
+config MLX5_EN_TLS
+	bool "TLS cryptography-offload accelaration"
+	depends on MLX5_CORE_EN
+	depends on TLS_DEVICE
+	depends on MLX5_ACCEL
+	default n
+	---help---
+	  Build support for TLS cryptography-offload accelaration in the NIC.
+	  Note: Support for hardware with this capability needs to be selected
+	  for this option to become available.
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/Makefile b/drivers/net/ethernet/mellanox/mlx5/core/Makefile
index 9989e52..50872ed 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/Makefile
+++ b/drivers/net/ethernet/mellanox/mlx5/core/Makefile
@@ -28,4 +28,6 @@ mlx5_core-$(CONFIG_MLX5_CORE_IPOIB) += ipoib/ipoib.o ipoib/ethtool.o ipoib/ipoib
 mlx5_core-$(CONFIG_MLX5_EN_IPSEC) += en_accel/ipsec.o en_accel/ipsec_rxtx.o \
 		en_accel/ipsec_stats.o
 
+mlx5_core-$(CONFIG_MLX5_EN_TLS) +=  en_accel/tls.o
+
 CFLAGS_tracepoint.o := -I$(src)
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls.c b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls.c
new file mode 100644
index 0000000..38d8810
--- /dev/null
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls.c
@@ -0,0 +1,173 @@
+/*
+ * Copyright (c) 2018 Mellanox Technologies. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ */
+
+#include <linux/netdevice.h>
+#include <net/ipv6.h>
+#include "en_accel/tls.h"
+#include "accel/tls.h"
+
+static void mlx5e_tls_set_ipv4_flow(void *flow, struct sock *sk)
+{
+	struct inet_sock *inet = inet_sk(sk);
+
+	MLX5_SET(tls_flow, flow, ipv6, 0);
+	memcpy(MLX5_ADDR_OF(tls_flow, flow, dst_ipv4_dst_ipv6.ipv4_layout.ipv4),
+	       &inet->inet_daddr, MLX5_FLD_SZ_BYTES(ipv4_layout, ipv4));
+	memcpy(MLX5_ADDR_OF(tls_flow, flow, src_ipv4_src_ipv6.ipv4_layout.ipv4),
+	       &inet->inet_rcv_saddr, MLX5_FLD_SZ_BYTES(ipv4_layout, ipv4));
+}
+
+#if IS_ENABLED(CONFIG_IPV6)
+static void mlx5e_tls_set_ipv6_flow(void *flow, struct sock *sk)
+{
+	struct ipv6_pinfo *np = inet6_sk(sk);
+
+	MLX5_SET(tls_flow, flow, ipv6, 1);
+	memcpy(MLX5_ADDR_OF(tls_flow, flow, dst_ipv4_dst_ipv6.ipv6_layout.ipv6),
+	       &sk->sk_v6_daddr, MLX5_FLD_SZ_BYTES(ipv6_layout, ipv6));
+	memcpy(MLX5_ADDR_OF(tls_flow, flow, src_ipv4_src_ipv6.ipv6_layout.ipv6),
+	       &np->saddr, MLX5_FLD_SZ_BYTES(ipv6_layout, ipv6));
+}
+#endif
+
+static void mlx5e_tls_set_flow_tcp_ports(void *flow, struct sock *sk)
+{
+	struct inet_sock *inet = inet_sk(sk);
+
+	memcpy(MLX5_ADDR_OF(tls_flow, flow, src_port), &inet->inet_sport,
+	       MLX5_FLD_SZ_BYTES(tls_flow, src_port));
+	memcpy(MLX5_ADDR_OF(tls_flow, flow, dst_port), &inet->inet_dport,
+	       MLX5_FLD_SZ_BYTES(tls_flow, dst_port));
+}
+
+static int mlx5e_tls_set_flow(void *flow, struct sock *sk, u32 caps)
+{
+	switch (sk->sk_family) {
+	case AF_INET:
+		mlx5e_tls_set_ipv4_flow(flow, sk);
+		break;
+#if IS_ENABLED(CONFIG_IPV6)
+	case AF_INET6:
+		if (!sk->sk_ipv6only &&
+		    ipv6_addr_type(&sk->sk_v6_daddr) == IPV6_ADDR_MAPPED) {
+			mlx5e_tls_set_ipv4_flow(flow, sk);
+			break;
+		}
+		if (!(caps & MLX5_ACCEL_TLS_IPV6))
+			goto error_out;
+
+		mlx5e_tls_set_ipv6_flow(flow, sk);
+		break;
+#endif
+	default:
+		goto error_out;
+	}
+
+	mlx5e_tls_set_flow_tcp_ports(flow, sk);
+	return 0;
+error_out:
+	return -EINVAL;
+}
+
+static int mlx5e_tls_add(struct net_device *netdev, struct sock *sk,
+			 enum tls_offload_ctx_dir direction,
+			 struct tls_crypto_info *crypto_info,
+			 u32 start_offload_tcp_sn)
+{
+	struct mlx5e_priv *priv = netdev_priv(netdev);
+	struct tls_context *tls_ctx = tls_get_ctx(sk);
+	struct mlx5_core_dev *mdev = priv->mdev;
+	u32 caps = mlx5_accel_tls_device_caps(mdev);
+	int ret = -ENOMEM;
+	void *flow;
+
+	if (direction != TLS_OFFLOAD_CTX_DIR_TX)
+		return -EINVAL;
+
+	flow = kzalloc(MLX5_ST_SZ_BYTES(tls_flow), GFP_KERNEL);
+	if (!flow)
+		return ret;
+
+	ret = mlx5e_tls_set_flow(flow, sk, caps);
+	if (ret)
+		goto free_flow;
+
+	if (direction == TLS_OFFLOAD_CTX_DIR_TX) {
+		struct mlx5e_tls_offload_context *tx_ctx =
+		    mlx5e_get_tls_tx_context(tls_ctx);
+		u32 swid;
+
+		ret = mlx5_accel_tls_add_tx_flow(mdev, flow, crypto_info,
+						 start_offload_tcp_sn, &swid);
+		if (ret < 0)
+			goto free_flow;
+
+		tx_ctx->swid = htonl(swid);
+		tx_ctx->expected_seq = start_offload_tcp_sn;
+	}
+
+	return 0;
+free_flow:
+	kfree(flow);
+	return ret;
+}
+
+static void mlx5e_tls_del(struct net_device *netdev,
+			  struct tls_context *tls_ctx,
+			  enum tls_offload_ctx_dir direction)
+{
+	struct mlx5e_priv *priv = netdev_priv(netdev);
+
+	if (direction == TLS_OFFLOAD_CTX_DIR_TX) {
+		u32 swid = ntohl(mlx5e_get_tls_tx_context(tls_ctx)->swid);
+
+		mlx5_accel_tls_del_tx_flow(priv->mdev, swid);
+	} else {
+		netdev_err(netdev, "unsupported direction %d\n", direction);
+	}
+}
+
+static const struct tlsdev_ops mlx5e_tls_ops = {
+	.tls_dev_add = mlx5e_tls_add,
+	.tls_dev_del = mlx5e_tls_del,
+};
+
+void mlx5e_tls_build_netdev(struct mlx5e_priv *priv)
+{
+	struct net_device *netdev = priv->netdev;
+
+	if (!mlx5_accel_is_tls_device(priv->mdev))
+		return;
+
+	netdev->tlsdev_ops = &mlx5e_tls_ops;
+}
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls.h b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls.h
new file mode 100644
index 0000000..f7216b9
--- /dev/null
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls.h
@@ -0,0 +1,65 @@
+/*
+ * Copyright (c) 2018 Mellanox Technologies. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ */
+#ifndef __MLX5E_TLS_H__
+#define __MLX5E_TLS_H__
+
+#ifdef CONFIG_MLX5_EN_TLS
+
+#include <net/tls.h>
+#include "en.h"
+
+struct mlx5e_tls_offload_context {
+	struct tls_offload_context base;
+	u32 expected_seq;
+	__be32 swid;
+};
+
+static inline struct mlx5e_tls_offload_context *
+mlx5e_get_tls_tx_context(struct tls_context *tls_ctx)
+{
+	BUILD_BUG_ON(sizeof(struct mlx5e_tls_offload_context) >
+		     TLS_OFFLOAD_CONTEXT_SIZE);
+	return container_of(tls_offload_ctx(tls_ctx),
+			    struct mlx5e_tls_offload_context,
+			    base);
+}
+
+void mlx5e_tls_build_netdev(struct mlx5e_priv *priv);
+
+#else
+
+static inline void mlx5e_tls_build_netdev(struct mlx5e_priv *priv) { }
+
+#endif
+
+#endif /* __MLX5E_TLS_H__ */
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index f100374..5a6aa18 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -42,7 +42,9 @@
 #include "en_rep.h"
 #include "en_accel/ipsec.h"
 #include "en_accel/ipsec_rxtx.h"
+#include "en_accel/tls.h"
 #include "accel/ipsec.h"
+#include "accel/tls.h"
 #include "vxlan.h"
 
 struct mlx5e_rq_param {
@@ -4355,6 +4357,7 @@ static void mlx5e_build_nic_netdev(struct net_device *netdev)
 #endif
 
 	mlx5e_ipsec_build_netdev(priv);
+	mlx5e_tls_build_netdev(priv);
 }
 
 static void mlx5e_create_q_counters(struct mlx5e_priv *priv)
-- 
1.8.3.1

^ permalink raw reply related

* [PATCH V7 net-next 04/14] net: Add TLS offload netdev ops
From: Boris Pismenny @ 2018-04-24 13:12 UTC (permalink / raw)
  To: davem
  Cc: netdev, saeedm, borisp, davejwatson, ktkhai, Ilya Lesokhin,
	Aviad Yehezkel
In-Reply-To: <1524575585-49541-1-git-send-email-borisp@mellanox.com>

From: Ilya Lesokhin <ilyal@mellanox.com>

Add new netdev ops to add and delete tls context

Signed-off-by: Ilya Lesokhin <ilyal@mellanox.com>
Signed-off-by: Boris Pismenny <borisp@mellanox.com>
Signed-off-by: Aviad Yehezkel <aviadye@mellanox.com>
---
 include/linux/netdevice.h | 24 ++++++++++++++++++++++++
 1 file changed, 24 insertions(+)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 14e0777..949a12ea 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -865,6 +865,26 @@ struct xfrmdev_ops {
 };
 #endif
 
+#if IS_ENABLED(CONFIG_TLS_DEVICE)
+enum tls_offload_ctx_dir {
+	TLS_OFFLOAD_CTX_DIR_RX,
+	TLS_OFFLOAD_CTX_DIR_TX,
+};
+
+struct tls_crypto_info;
+struct tls_context;
+
+struct tlsdev_ops {
+	int (*tls_dev_add)(struct net_device *netdev, struct sock *sk,
+			   enum tls_offload_ctx_dir direction,
+			   struct tls_crypto_info *crypto_info,
+			   u32 start_offload_tcp_sn);
+	void (*tls_dev_del)(struct net_device *netdev,
+			    struct tls_context *ctx,
+			    enum tls_offload_ctx_dir direction);
+};
+#endif
+
 struct dev_ifalias {
 	struct rcu_head rcuhead;
 	char ifalias[];
@@ -1750,6 +1770,10 @@ struct net_device {
 	const struct xfrmdev_ops *xfrmdev_ops;
 #endif
 
+#if IS_ENABLED(CONFIG_TLS_DEVICE)
+	const struct tlsdev_ops *tlsdev_ops;
+#endif
+
 	const struct header_ops *header_ops;
 
 	unsigned int		flags;
-- 
1.8.3.1

^ permalink raw reply related

* [PATCH V7 net-next 12/14] net/mlx5e: TLS, Add error statistics
From: Boris Pismenny @ 2018-04-24 13:13 UTC (permalink / raw)
  To: davem; +Cc: netdev, saeedm, borisp, davejwatson, ktkhai, Ilya Lesokhin
In-Reply-To: <1524575585-49541-1-git-send-email-borisp@mellanox.com>

From: Ilya Lesokhin <ilyal@mellanox.com>

Add statistics for rare TLS related errors.
Since the errors are rare we have a counter per netdev
rather then per SQ.

Signed-off-by: Ilya Lesokhin <ilyal@mellanox.com>
Signed-off-by: Boris Pismenny <borisp@mellanox.com>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/Makefile   |  2 +-
 drivers/net/ethernet/mellanox/mlx5/core/en.h       |  3 +
 .../net/ethernet/mellanox/mlx5/core/en_accel/tls.c | 22 ++++++
 .../net/ethernet/mellanox/mlx5/core/en_accel/tls.h | 22 ++++++
 .../mellanox/mlx5/core/en_accel/tls_rxtx.c         | 24 +++---
 .../mellanox/mlx5/core/en_accel/tls_stats.c        | 89 ++++++++++++++++++++++
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c  |  4 +
 drivers/net/ethernet/mellanox/mlx5/core/en_stats.c | 22 ++++++
 8 files changed, 178 insertions(+), 10 deletions(-)
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls_stats.c

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/Makefile b/drivers/net/ethernet/mellanox/mlx5/core/Makefile
index ec785f5..a7135f5 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/Makefile
+++ b/drivers/net/ethernet/mellanox/mlx5/core/Makefile
@@ -28,6 +28,6 @@ mlx5_core-$(CONFIG_MLX5_CORE_IPOIB) += ipoib/ipoib.o ipoib/ethtool.o ipoib/ipoib
 mlx5_core-$(CONFIG_MLX5_EN_IPSEC) += en_accel/ipsec.o en_accel/ipsec_rxtx.o \
 		en_accel/ipsec_stats.o
 
-mlx5_core-$(CONFIG_MLX5_EN_TLS) +=  en_accel/tls.o en_accel/tls_rxtx.o
+mlx5_core-$(CONFIG_MLX5_EN_TLS) +=  en_accel/tls.o en_accel/tls_rxtx.o en_accel/tls_stats.o
 
 CFLAGS_tracepoint.o := -I$(src)
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h b/drivers/net/ethernet/mellanox/mlx5/core/en.h
index d72145f..a863907 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h
@@ -798,6 +798,9 @@ struct mlx5e_priv {
 #ifdef CONFIG_MLX5_EN_IPSEC
 	struct mlx5e_ipsec        *ipsec;
 #endif
+#ifdef CONFIG_MLX5_EN_TLS
+	struct mlx5e_tls          *tls;
+#endif
 };
 
 struct mlx5e_profile {
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls.c b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls.c
index aa6981c..d167845 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls.c
@@ -173,3 +173,25 @@ void mlx5e_tls_build_netdev(struct mlx5e_priv *priv)
 	netdev->hw_features |= NETIF_F_HW_TLS_TX;
 	netdev->tlsdev_ops = &mlx5e_tls_ops;
 }
+
+int mlx5e_tls_init(struct mlx5e_priv *priv)
+{
+	struct mlx5e_tls *tls = kzalloc(sizeof(*tls), GFP_KERNEL);
+
+	if (!tls)
+		return -ENOMEM;
+
+	priv->tls = tls;
+	return 0;
+}
+
+void mlx5e_tls_cleanup(struct mlx5e_priv *priv)
+{
+	struct mlx5e_tls *tls = priv->tls;
+
+	if (!tls)
+		return;
+
+	kfree(tls);
+	priv->tls = NULL;
+}
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls.h b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls.h
index f7216b9..b616217 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls.h
@@ -38,6 +38,17 @@
 #include <net/tls.h>
 #include "en.h"
 
+struct mlx5e_tls_sw_stats {
+	atomic64_t tx_tls_drop_metadata;
+	atomic64_t tx_tls_drop_resync_alloc;
+	atomic64_t tx_tls_drop_no_sync_data;
+	atomic64_t tx_tls_drop_bypass_required;
+};
+
+struct mlx5e_tls {
+	struct mlx5e_tls_sw_stats sw_stats;
+};
+
 struct mlx5e_tls_offload_context {
 	struct tls_offload_context base;
 	u32 expected_seq;
@@ -55,10 +66,21 @@ struct mlx5e_tls_offload_context {
 }
 
 void mlx5e_tls_build_netdev(struct mlx5e_priv *priv);
+int mlx5e_tls_init(struct mlx5e_priv *priv);
+void mlx5e_tls_cleanup(struct mlx5e_priv *priv);
+
+int mlx5e_tls_get_count(struct mlx5e_priv *priv);
+int mlx5e_tls_get_strings(struct mlx5e_priv *priv, uint8_t *data);
+int mlx5e_tls_get_stats(struct mlx5e_priv *priv, u64 *data);
 
 #else
 
 static inline void mlx5e_tls_build_netdev(struct mlx5e_priv *priv) { }
+static inline int mlx5e_tls_init(struct mlx5e_priv *priv) { return 0; }
+static inline void mlx5e_tls_cleanup(struct mlx5e_priv *priv) { }
+static inline int mlx5e_tls_get_count(struct mlx5e_priv *priv) { return 0; }
+static inline int mlx5e_tls_get_strings(struct mlx5e_priv *priv, uint8_t *data) { return 0; }
+static inline int mlx5e_tls_get_stats(struct mlx5e_priv *priv, u64 *data) { return 0; }
 
 #endif
 
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls_rxtx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls_rxtx.c
index 49e8d45..ad2790f 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls_rxtx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls_rxtx.c
@@ -164,7 +164,8 @@ static void mlx5e_tls_complete_sync_skb(struct sk_buff *skb,
 mlx5e_tls_handle_ooo(struct mlx5e_tls_offload_context *context,
 		     struct mlx5e_txqsq *sq, struct sk_buff *skb,
 		     struct mlx5e_tx_wqe **wqe,
-		     u16 *pi)
+		     u16 *pi,
+		     struct mlx5e_tls *tls)
 {
 	u32 tcp_seq = ntohl(tcp_hdr(skb)->seq);
 	struct sync_info info;
@@ -175,12 +176,14 @@ static void mlx5e_tls_complete_sync_skb(struct sk_buff *skb,
 
 	sq->stats.tls_ooo++;
 
-	if (mlx5e_tls_get_sync_data(context, tcp_seq, &info))
+	if (mlx5e_tls_get_sync_data(context, tcp_seq, &info)) {
 		/* We might get here if a retransmission reaches the driver
 		 * after the relevant record is acked.
 		 * It should be safe to drop the packet in this case
 		 */
+		atomic64_inc(&tls->sw_stats.tx_tls_drop_no_sync_data);
 		goto err_out;
+	}
 
 	if (unlikely(info.sync_len < 0)) {
 		u32 payload;
@@ -192,21 +195,22 @@ static void mlx5e_tls_complete_sync_skb(struct sk_buff *skb,
 			 */
 			return skb;
 
-		netdev_err(skb->dev,
-			   "Can't offload from the middle of an SKB [seq: %X, offload_seq: %X, end_seq: %X]\n",
-			   tcp_seq, tcp_seq + payload + info.sync_len,
-			   tcp_seq + payload);
+		atomic64_inc(&tls->sw_stats.tx_tls_drop_bypass_required);
 		goto err_out;
 	}
 
-	if (unlikely(mlx5e_tls_add_metadata(skb, context->swid)))
+	if (unlikely(mlx5e_tls_add_metadata(skb, context->swid))) {
+		atomic64_inc(&tls->sw_stats.tx_tls_drop_metadata);
 		goto err_out;
+	}
 
 	headln = skb_transport_offset(skb) + tcp_hdrlen(skb);
 	linear_len += headln + sizeof(info.rcd_sn);
 	nskb = alloc_skb(linear_len, GFP_ATOMIC);
-	if (unlikely(!nskb))
+	if (unlikely(!nskb)) {
+		atomic64_inc(&tls->sw_stats.tx_tls_drop_resync_alloc);
 		goto err_out;
+	}
 
 	context->expected_seq = tcp_seq + skb->len - headln;
 	skb_put(nskb, linear_len);
@@ -234,6 +238,7 @@ struct sk_buff *mlx5e_tls_handle_tx_skb(struct net_device *netdev,
 					struct mlx5e_tx_wqe **wqe,
 					u16 *pi)
 {
+	struct mlx5e_priv *priv = netdev_priv(netdev);
 	struct mlx5e_tls_offload_context *context;
 	struct tls_context *tls_ctx;
 	u32 expected_seq;
@@ -256,11 +261,12 @@ struct sk_buff *mlx5e_tls_handle_tx_skb(struct net_device *netdev,
 	expected_seq = context->expected_seq;
 
 	if (unlikely(expected_seq != skb_seq)) {
-		skb = mlx5e_tls_handle_ooo(context, sq, skb, wqe, pi);
+		skb = mlx5e_tls_handle_ooo(context, sq, skb, wqe, pi, priv->tls);
 		goto out;
 	}
 
 	if (unlikely(mlx5e_tls_add_metadata(skb, context->swid))) {
+		atomic64_inc(&priv->tls->sw_stats.tx_tls_drop_metadata);
 		dev_kfree_skb_any(skb);
 		skb = NULL;
 		goto out;
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls_stats.c b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls_stats.c
new file mode 100644
index 0000000..01468ec
--- /dev/null
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls_stats.c
@@ -0,0 +1,89 @@
+/*
+ * Copyright (c) 2018 Mellanox Technologies. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ */
+
+#include <linux/ethtool.h>
+#include <net/sock.h>
+
+#include "en.h"
+#include "accel/tls.h"
+#include "fpga/sdk.h"
+#include "en_accel/tls.h"
+
+static const struct counter_desc mlx5e_tls_sw_stats_desc[] = {
+	{ MLX5E_DECLARE_STAT(struct mlx5e_tls_sw_stats, tx_tls_drop_metadata) },
+	{ MLX5E_DECLARE_STAT(struct mlx5e_tls_sw_stats, tx_tls_drop_resync_alloc) },
+	{ MLX5E_DECLARE_STAT(struct mlx5e_tls_sw_stats, tx_tls_drop_no_sync_data) },
+	{ MLX5E_DECLARE_STAT(struct mlx5e_tls_sw_stats, tx_tls_drop_bypass_required) },
+};
+
+#define MLX5E_READ_CTR_ATOMIC64(ptr, dsc, i) \
+	atomic64_read((atomic64_t *)((char *)(ptr) + (dsc)[i].offset))
+
+#define NUM_TLS_SW_COUNTERS ARRAY_SIZE(mlx5e_tls_sw_stats_desc)
+
+int mlx5e_tls_get_count(struct mlx5e_priv *priv)
+{
+	if (!priv->tls)
+		return 0;
+
+	return NUM_TLS_SW_COUNTERS;
+}
+
+int mlx5e_tls_get_strings(struct mlx5e_priv *priv, uint8_t *data)
+{
+	unsigned int i, idx = 0;
+
+	if (!priv->tls)
+		return 0;
+
+	for (i = 0; i < NUM_TLS_SW_COUNTERS; i++)
+		strcpy(data + (idx++) * ETH_GSTRING_LEN,
+		       mlx5e_tls_sw_stats_desc[i].format);
+
+	return NUM_TLS_SW_COUNTERS;
+}
+
+int mlx5e_tls_get_stats(struct mlx5e_priv *priv, u64 *data)
+{
+	int i, idx = 0;
+
+	if (!priv->tls)
+		return 0;
+
+	for (i = 0; i < NUM_TLS_SW_COUNTERS; i++)
+		data[idx++] =
+		    MLX5E_READ_CTR_ATOMIC64(&priv->tls->sw_stats,
+					    mlx5e_tls_sw_stats_desc, i);
+
+	return NUM_TLS_SW_COUNTERS;
+}
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index b1a1bc8..0382b1e 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -4401,12 +4401,16 @@ static void mlx5e_nic_init(struct mlx5_core_dev *mdev,
 	err = mlx5e_ipsec_init(priv);
 	if (err)
 		mlx5_core_err(mdev, "IPSec initialization failed, %d\n", err);
+	err = mlx5e_tls_init(priv);
+	if (err)
+		mlx5_core_err(mdev, "TLS initialization failed, %d\n", err);
 	mlx5e_build_nic_netdev(netdev);
 	mlx5e_vxlan_init(priv);
 }
 
 static void mlx5e_nic_cleanup(struct mlx5e_priv *priv)
 {
+	mlx5e_tls_cleanup(priv);
 	mlx5e_ipsec_cleanup(priv);
 	mlx5e_vxlan_cleanup(priv);
 }
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_stats.c b/drivers/net/ethernet/mellanox/mlx5/core/en_stats.c
index c04cf2b..e17919c 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_stats.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_stats.c
@@ -32,6 +32,7 @@
 
 #include "en.h"
 #include "en_accel/ipsec.h"
+#include "en_accel/tls.h"
 
 static const struct counter_desc sw_stats_desc[] = {
 	{ MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, rx_packets) },
@@ -1075,6 +1076,22 @@ static void mlx5e_grp_ipsec_update_stats(struct mlx5e_priv *priv)
 	mlx5e_ipsec_update_stats(priv);
 }
 
+static int mlx5e_grp_tls_get_num_stats(struct mlx5e_priv *priv)
+{
+	return mlx5e_tls_get_count(priv);
+}
+
+static int mlx5e_grp_tls_fill_strings(struct mlx5e_priv *priv, u8 *data,
+				      int idx)
+{
+	return idx + mlx5e_tls_get_strings(priv, data + idx * ETH_GSTRING_LEN);
+}
+
+static int mlx5e_grp_tls_fill_stats(struct mlx5e_priv *priv, u64 *data, int idx)
+{
+	return idx + mlx5e_tls_get_stats(priv, data + idx);
+}
+
 static const struct counter_desc rq_stats_desc[] = {
 	{ MLX5E_DECLARE_RX_STAT(struct mlx5e_rq_stats, packets) },
 	{ MLX5E_DECLARE_RX_STAT(struct mlx5e_rq_stats, bytes) },
@@ -1278,6 +1295,11 @@ static int mlx5e_grp_channels_fill_stats(struct mlx5e_priv *priv, u64 *data,
 		.update_stats = mlx5e_grp_ipsec_update_stats,
 	},
 	{
+		.get_num_stats = mlx5e_grp_tls_get_num_stats,
+		.fill_strings = mlx5e_grp_tls_fill_strings,
+		.fill_stats = mlx5e_grp_tls_fill_stats,
+	},
+	{
 		.get_num_stats = mlx5e_grp_channels_get_num_stats,
 		.fill_strings = mlx5e_grp_channels_fill_strings,
 		.fill_stats = mlx5e_grp_channels_fill_stats,
-- 
1.8.3.1

^ permalink raw reply related

* [PATCH V7 net-next 09/14] net/mlx5: Accel, Add TLS tx offload interface
From: Boris Pismenny @ 2018-04-24 13:13 UTC (permalink / raw)
  To: davem; +Cc: netdev, saeedm, borisp, davejwatson, ktkhai, Ilya Lesokhin
In-Reply-To: <1524575585-49541-1-git-send-email-borisp@mellanox.com>

From: Ilya Lesokhin <ilyal@mellanox.com>

Add routines for manipulating TLS TX offload contexts.

In Innova TLS, TLS contexts are added or deleted
via a command message over the SBU connection.
The HW then sends a response message over the same connection.

Add implementation for Innova TLS (FPGA-based) hardware.

These routines will be used by the TLS offload support in a later patch

mlx5/accel is a middle acceleration layer to allow mlx5e and other ULPs
to work directly with mlx5_core rather than Innova FPGA or other mlx5
acceleration providers.

In the future, when IPSec/TLS or any other acceleration gets integrated
into ConnectX chip, mlx5/accel layer will provide the integrated
acceleration, rather than the Innova one.

Signed-off-by: Ilya Lesokhin <ilyal@mellanox.com>
Signed-off-by: Boris Pismenny <borisp@mellanox.com>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/Makefile   |   4 +-
 .../net/ethernet/mellanox/mlx5/core/accel/tls.c    |  71 +++
 .../net/ethernet/mellanox/mlx5/core/accel/tls.h    |  86 ++++
 .../net/ethernet/mellanox/mlx5/core/fpga/core.h    |   1 +
 drivers/net/ethernet/mellanox/mlx5/core/fpga/tls.c | 563 +++++++++++++++++++++
 drivers/net/ethernet/mellanox/mlx5/core/fpga/tls.h |  68 +++
 drivers/net/ethernet/mellanox/mlx5/core/main.c     |  11 +
 include/linux/mlx5/mlx5_ifc.h                      |  16 -
 include/linux/mlx5/mlx5_ifc_fpga.h                 |  77 +++
 9 files changed, 879 insertions(+), 18 deletions(-)
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/accel/tls.c
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/accel/tls.h
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/fpga/tls.c
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/fpga/tls.h

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/Makefile b/drivers/net/ethernet/mellanox/mlx5/core/Makefile
index c805769..9989e52 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/Makefile
+++ b/drivers/net/ethernet/mellanox/mlx5/core/Makefile
@@ -8,10 +8,10 @@ mlx5_core-y :=	main.o cmd.o debugfs.o fw.o eq.o uar.o pagealloc.o \
 		fs_counters.o rl.o lag.o dev.o wq.o lib/gid.o lib/clock.o \
 		diag/fs_tracepoint.o
 
-mlx5_core-$(CONFIG_MLX5_ACCEL) += accel/ipsec.o
+mlx5_core-$(CONFIG_MLX5_ACCEL) += accel/ipsec.o accel/tls.o
 
 mlx5_core-$(CONFIG_MLX5_FPGA) += fpga/cmd.o fpga/core.o fpga/conn.o fpga/sdk.o \
-		fpga/ipsec.o
+		fpga/ipsec.o fpga/tls.o
 
 mlx5_core-$(CONFIG_MLX5_CORE_EN) += en_main.o en_common.o en_fs.o en_ethtool.o \
 		en_tx.o en_rx.o en_dim.o en_txrx.o en_stats.o vxlan.o \
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/accel/tls.c b/drivers/net/ethernet/mellanox/mlx5/core/accel/tls.c
new file mode 100644
index 0000000..77ac19f
--- /dev/null
+++ b/drivers/net/ethernet/mellanox/mlx5/core/accel/tls.c
@@ -0,0 +1,71 @@
+/*
+ * Copyright (c) 2018 Mellanox Technologies. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ */
+
+#include <linux/mlx5/device.h>
+
+#include "accel/tls.h"
+#include "mlx5_core.h"
+#include "fpga/tls.h"
+
+int mlx5_accel_tls_add_tx_flow(struct mlx5_core_dev *mdev, void *flow,
+			       struct tls_crypto_info *crypto_info,
+			       u32 start_offload_tcp_sn, u32 *p_swid)
+{
+	return mlx5_fpga_tls_add_tx_flow(mdev, flow, crypto_info,
+					 start_offload_tcp_sn, p_swid);
+}
+
+void mlx5_accel_tls_del_tx_flow(struct mlx5_core_dev *mdev, u32 swid)
+{
+	mlx5_fpga_tls_del_tx_flow(mdev, swid, GFP_KERNEL);
+}
+
+bool mlx5_accel_is_tls_device(struct mlx5_core_dev *mdev)
+{
+	return mlx5_fpga_is_tls_device(mdev);
+}
+
+u32 mlx5_accel_tls_device_caps(struct mlx5_core_dev *mdev)
+{
+	return mlx5_fpga_tls_device_caps(mdev);
+}
+
+int mlx5_accel_tls_init(struct mlx5_core_dev *mdev)
+{
+	return mlx5_fpga_tls_init(mdev);
+}
+
+void mlx5_accel_tls_cleanup(struct mlx5_core_dev *mdev)
+{
+	mlx5_fpga_tls_cleanup(mdev);
+}
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/accel/tls.h b/drivers/net/ethernet/mellanox/mlx5/core/accel/tls.h
new file mode 100644
index 0000000..6f9c9f4
--- /dev/null
+++ b/drivers/net/ethernet/mellanox/mlx5/core/accel/tls.h
@@ -0,0 +1,86 @@
+/*
+ * Copyright (c) 2018 Mellanox Technologies. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ */
+
+#ifndef __MLX5_ACCEL_TLS_H__
+#define __MLX5_ACCEL_TLS_H__
+
+#include <linux/mlx5/driver.h>
+#include <linux/tls.h>
+
+#ifdef CONFIG_MLX5_ACCEL
+
+enum {
+	MLX5_ACCEL_TLS_TX = BIT(0),
+	MLX5_ACCEL_TLS_RX = BIT(1),
+	MLX5_ACCEL_TLS_V12 = BIT(2),
+	MLX5_ACCEL_TLS_V13 = BIT(3),
+	MLX5_ACCEL_TLS_LRO = BIT(4),
+	MLX5_ACCEL_TLS_IPV6 = BIT(5),
+	MLX5_ACCEL_TLS_AES_GCM128 = BIT(30),
+	MLX5_ACCEL_TLS_AES_GCM256 = BIT(31),
+};
+
+struct mlx5_ifc_tls_flow_bits {
+	u8         src_port[0x10];
+	u8         dst_port[0x10];
+	union mlx5_ifc_ipv6_layout_ipv4_layout_auto_bits src_ipv4_src_ipv6;
+	union mlx5_ifc_ipv6_layout_ipv4_layout_auto_bits dst_ipv4_dst_ipv6;
+	u8         ipv6[0x1];
+	u8         direction_sx[0x1];
+	u8         reserved_at_2[0x1e];
+};
+
+int mlx5_accel_tls_add_tx_flow(struct mlx5_core_dev *mdev, void *flow,
+			       struct tls_crypto_info *crypto_info,
+			       u32 start_offload_tcp_sn, u32 *p_swid);
+void mlx5_accel_tls_del_tx_flow(struct mlx5_core_dev *mdev, u32 swid);
+bool mlx5_accel_is_tls_device(struct mlx5_core_dev *mdev);
+u32 mlx5_accel_tls_device_caps(struct mlx5_core_dev *mdev);
+int mlx5_accel_tls_init(struct mlx5_core_dev *mdev);
+void mlx5_accel_tls_cleanup(struct mlx5_core_dev *mdev);
+
+#else
+
+static inline int
+mlx5_accel_tls_add_tx_flow(struct mlx5_core_dev *mdev, void *flow,
+			   struct tls_crypto_info *crypto_info,
+			   u32 start_offload_tcp_sn, u32 *p_swid) { return 0; }
+static inline void mlx5_accel_tls_del_tx_flow(struct mlx5_core_dev *mdev, u32 swid) { }
+static inline bool mlx5_accel_is_tls_device(struct mlx5_core_dev *mdev) { return false; }
+static inline u32 mlx5_accel_tls_device_caps(struct mlx5_core_dev *mdev) { return 0; }
+static inline int mlx5_accel_tls_init(struct mlx5_core_dev *mdev) { return 0; }
+static inline void mlx5_accel_tls_cleanup(struct mlx5_core_dev *mdev) { }
+
+#endif
+
+#endif	/* __MLX5_ACCEL_TLS_H__ */
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/fpga/core.h b/drivers/net/ethernet/mellanox/mlx5/core/fpga/core.h
index 82405ed..3e2355c 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/fpga/core.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/fpga/core.h
@@ -53,6 +53,7 @@ struct mlx5_fpga_device {
 	} conn_res;
 
 	struct mlx5_fpga_ipsec *ipsec;
+	struct mlx5_fpga_tls *tls;
 };
 
 #define mlx5_fpga_dbg(__adev, format, ...) \
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/fpga/tls.c b/drivers/net/ethernet/mellanox/mlx5/core/fpga/tls.c
new file mode 100644
index 0000000..1abc217
--- /dev/null
+++ b/drivers/net/ethernet/mellanox/mlx5/core/fpga/tls.c
@@ -0,0 +1,563 @@
+/*
+ * Copyright (c) 2018 Mellanox Technologies. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ */
+
+#include <linux/mlx5/device.h>
+#include "fpga/tls.h"
+#include "fpga/cmd.h"
+#include "fpga/sdk.h"
+#include "fpga/core.h"
+#include "accel/tls.h"
+
+struct mlx5_fpga_tls_command_context;
+
+typedef void (*mlx5_fpga_tls_command_complete)
+	(struct mlx5_fpga_conn *conn, struct mlx5_fpga_device *fdev,
+	 struct mlx5_fpga_tls_command_context *ctx,
+	 struct mlx5_fpga_dma_buf *resp);
+
+struct mlx5_fpga_tls_command_context {
+	struct list_head list;
+	/* There is no guarantee on the order between the TX completion
+	 * and the command response.
+	 * The TX completion is going to touch cmd->buf even in
+	 * the case of successful transmission.
+	 * So instead of requiring separate allocations for cmd
+	 * and cmd->buf we've decided to use a reference counter
+	 */
+	refcount_t ref;
+	struct mlx5_fpga_dma_buf buf;
+	mlx5_fpga_tls_command_complete complete;
+};
+
+static void
+mlx5_fpga_tls_put_command_ctx(struct mlx5_fpga_tls_command_context *ctx)
+{
+	if (refcount_dec_and_test(&ctx->ref))
+		kfree(ctx);
+}
+
+static void mlx5_fpga_tls_cmd_complete(struct mlx5_fpga_device *fdev,
+				       struct mlx5_fpga_dma_buf *resp)
+{
+	struct mlx5_fpga_conn *conn = fdev->tls->conn;
+	struct mlx5_fpga_tls_command_context *ctx;
+	struct mlx5_fpga_tls *tls = fdev->tls;
+	unsigned long flags;
+
+	spin_lock_irqsave(&tls->pending_cmds_lock, flags);
+	ctx = list_first_entry(&tls->pending_cmds,
+			       struct mlx5_fpga_tls_command_context, list);
+	list_del(&ctx->list);
+	spin_unlock_irqrestore(&tls->pending_cmds_lock, flags);
+	ctx->complete(conn, fdev, ctx, resp);
+}
+
+static void mlx5_fpga_cmd_send_complete(struct mlx5_fpga_conn *conn,
+					struct mlx5_fpga_device *fdev,
+					struct mlx5_fpga_dma_buf *buf,
+					u8 status)
+{
+	struct mlx5_fpga_tls_command_context *ctx =
+	    container_of(buf, struct mlx5_fpga_tls_command_context, buf);
+
+	mlx5_fpga_tls_put_command_ctx(ctx);
+
+	if (unlikely(status))
+		mlx5_fpga_tls_cmd_complete(fdev, NULL);
+}
+
+static void mlx5_fpga_tls_cmd_send(struct mlx5_fpga_device *fdev,
+				   struct mlx5_fpga_tls_command_context *cmd,
+				   mlx5_fpga_tls_command_complete complete)
+{
+	struct mlx5_fpga_tls *tls = fdev->tls;
+	unsigned long flags;
+	int ret;
+
+	refcount_set(&cmd->ref, 2);
+	cmd->complete = complete;
+	cmd->buf.complete = mlx5_fpga_cmd_send_complete;
+
+	spin_lock_irqsave(&tls->pending_cmds_lock, flags);
+	/* mlx5_fpga_sbu_conn_sendmsg is called under pending_cmds_lock
+	 * to make sure commands are inserted to the tls->pending_cmds list
+	 * and the command QP in the same order.
+	 */
+	ret = mlx5_fpga_sbu_conn_sendmsg(tls->conn, &cmd->buf);
+	if (likely(!ret))
+		list_add_tail(&cmd->list, &tls->pending_cmds);
+	else
+		complete(tls->conn, fdev, cmd, NULL);
+	spin_unlock_irqrestore(&tls->pending_cmds_lock, flags);
+}
+
+/* Start of context identifiers range (inclusive) */
+#define SWID_START	0
+/* End of context identifiers range (exclusive) */
+#define SWID_END	BIT(24)
+
+static int mlx5_fpga_tls_alloc_swid(struct idr *idr, spinlock_t *idr_spinlock,
+				    void *ptr)
+{
+	int ret;
+
+	/* TLS metadata format is 1 byte for syndrome followed
+	 * by 3 bytes of swid (software ID)
+	 * swid must not exceed 3 bytes.
+	 * See tls_rxtx.c:insert_pet() for details
+	 */
+	BUILD_BUG_ON((SWID_END - 1) & 0xFF000000);
+
+	idr_preload(GFP_KERNEL);
+	spin_lock_irq(idr_spinlock);
+	ret = idr_alloc(idr, ptr, SWID_START, SWID_END, GFP_ATOMIC);
+	spin_unlock_irq(idr_spinlock);
+	idr_preload_end();
+
+	return ret;
+}
+
+static void mlx5_fpga_tls_release_swid(struct idr *idr,
+				       spinlock_t *idr_spinlock, u32 swid)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(idr_spinlock, flags);
+	idr_remove(idr, swid);
+	spin_unlock_irqrestore(idr_spinlock, flags);
+}
+
+struct mlx5_teardown_stream_context {
+	struct mlx5_fpga_tls_command_context cmd;
+	u32 swid;
+};
+
+static void
+mlx5_fpga_tls_teardown_completion(struct mlx5_fpga_conn *conn,
+				  struct mlx5_fpga_device *fdev,
+				  struct mlx5_fpga_tls_command_context *cmd,
+				  struct mlx5_fpga_dma_buf *resp)
+{
+	struct mlx5_teardown_stream_context *ctx =
+		    container_of(cmd, struct mlx5_teardown_stream_context, cmd);
+
+	if (resp) {
+		u32 syndrome = MLX5_GET(tls_resp, resp->sg[0].data, syndrome);
+
+		if (syndrome)
+			mlx5_fpga_err(fdev,
+				      "Teardown stream failed with syndrome = %d",
+				      syndrome);
+		else
+			mlx5_fpga_tls_release_swid(&fdev->tls->tx_idr,
+						   &fdev->tls->idr_spinlock,
+						   ctx->swid);
+	}
+	mlx5_fpga_tls_put_command_ctx(cmd);
+}
+
+static void mlx5_fpga_tls_flow_to_cmd(void *flow, void *cmd)
+{
+	memcpy(MLX5_ADDR_OF(tls_cmd, cmd, src_port), flow,
+	       MLX5_BYTE_OFF(tls_flow, ipv6));
+
+	MLX5_SET(tls_cmd, cmd, ipv6, MLX5_GET(tls_flow, flow, ipv6));
+	MLX5_SET(tls_cmd, cmd, direction_sx,
+		 MLX5_GET(tls_flow, flow, direction_sx));
+}
+
+void mlx5_fpga_tls_send_teardown_cmd(struct mlx5_core_dev *mdev, void *flow,
+				     u32 swid, gfp_t flags)
+{
+	struct mlx5_teardown_stream_context *ctx;
+	struct mlx5_fpga_dma_buf *buf;
+	void *cmd;
+
+	ctx = kzalloc(sizeof(*ctx) + MLX5_TLS_COMMAND_SIZE, flags);
+	if (!ctx)
+		return;
+
+	buf = &ctx->cmd.buf;
+	cmd = (ctx + 1);
+	MLX5_SET(tls_cmd, cmd, command_type, CMD_TEARDOWN_STREAM);
+	MLX5_SET(tls_cmd, cmd, swid, swid);
+
+	mlx5_fpga_tls_flow_to_cmd(flow, cmd);
+	kfree(flow);
+
+	buf->sg[0].data = cmd;
+	buf->sg[0].size = MLX5_TLS_COMMAND_SIZE;
+
+	ctx->swid = swid;
+	mlx5_fpga_tls_cmd_send(mdev->fpga, &ctx->cmd,
+			       mlx5_fpga_tls_teardown_completion);
+}
+
+void mlx5_fpga_tls_del_tx_flow(struct mlx5_core_dev *mdev, u32 swid,
+			       gfp_t flags)
+{
+	struct mlx5_fpga_tls *tls = mdev->fpga->tls;
+	void *flow;
+
+	rcu_read_lock();
+	flow = idr_find(&tls->tx_idr, swid);
+	rcu_read_unlock();
+
+	if (!flow) {
+		mlx5_fpga_err(mdev->fpga, "No flow information for swid %u\n",
+			      swid);
+		return;
+	}
+
+	mlx5_fpga_tls_send_teardown_cmd(mdev, flow, swid, flags);
+}
+
+enum mlx5_fpga_setup_stream_status {
+	MLX5_FPGA_CMD_PENDING,
+	MLX5_FPGA_CMD_SEND_FAILED,
+	MLX5_FPGA_CMD_RESPONSE_RECEIVED,
+	MLX5_FPGA_CMD_ABANDONED,
+};
+
+struct mlx5_setup_stream_context {
+	struct mlx5_fpga_tls_command_context cmd;
+	atomic_t status;
+	u32 syndrome;
+	struct completion comp;
+};
+
+static void
+mlx5_fpga_tls_setup_completion(struct mlx5_fpga_conn *conn,
+			       struct mlx5_fpga_device *fdev,
+			       struct mlx5_fpga_tls_command_context *cmd,
+			       struct mlx5_fpga_dma_buf *resp)
+{
+	struct mlx5_setup_stream_context *ctx =
+	    container_of(cmd, struct mlx5_setup_stream_context, cmd);
+	int status = MLX5_FPGA_CMD_SEND_FAILED;
+	void *tls_cmd = ctx + 1;
+
+	/* If we failed to send to command resp == NULL */
+	if (resp) {
+		ctx->syndrome = MLX5_GET(tls_resp, resp->sg[0].data, syndrome);
+		status = MLX5_FPGA_CMD_RESPONSE_RECEIVED;
+	}
+
+	status = atomic_xchg_release(&ctx->status, status);
+	if (likely(status != MLX5_FPGA_CMD_ABANDONED)) {
+		complete(&ctx->comp);
+		return;
+	}
+
+	mlx5_fpga_err(fdev, "Command was abandoned, syndrome = %u\n",
+		      ctx->syndrome);
+
+	if (!ctx->syndrome) {
+		/* The process was killed while waiting for the context to be
+		 * added, and the add completed successfully.
+		 * We need to destroy the HW context, and we can't can't reuse
+		 * the command context because we might not have received
+		 * the tx completion yet.
+		 */
+		mlx5_fpga_tls_del_tx_flow(fdev->mdev,
+					  MLX5_GET(tls_cmd, tls_cmd, swid),
+					  GFP_ATOMIC);
+	}
+
+	mlx5_fpga_tls_put_command_ctx(cmd);
+}
+
+static int mlx5_fpga_tls_setup_stream_cmd(struct mlx5_core_dev *mdev,
+					  struct mlx5_setup_stream_context *ctx)
+{
+	struct mlx5_fpga_dma_buf *buf;
+	void *cmd = ctx + 1;
+	int status, ret = 0;
+
+	buf = &ctx->cmd.buf;
+	buf->sg[0].data = cmd;
+	buf->sg[0].size = MLX5_TLS_COMMAND_SIZE;
+	MLX5_SET(tls_cmd, cmd, command_type, CMD_SETUP_STREAM);
+
+	init_completion(&ctx->comp);
+	atomic_set(&ctx->status, MLX5_FPGA_CMD_PENDING);
+	ctx->syndrome = -1;
+
+	mlx5_fpga_tls_cmd_send(mdev->fpga, &ctx->cmd,
+			       mlx5_fpga_tls_setup_completion);
+	wait_for_completion_killable(&ctx->comp);
+
+	status = atomic_xchg_acquire(&ctx->status, MLX5_FPGA_CMD_ABANDONED);
+	if (unlikely(status == MLX5_FPGA_CMD_PENDING))
+	/* ctx is going to be released in mlx5_fpga_tls_setup_completion */
+		return -EINTR;
+
+	if (unlikely(ctx->syndrome))
+		ret = -ENOMEM;
+
+	mlx5_fpga_tls_put_command_ctx(&ctx->cmd);
+	return ret;
+}
+
+static void mlx5_fpga_tls_hw_qp_recv_cb(void *cb_arg,
+					struct mlx5_fpga_dma_buf *buf)
+{
+	struct mlx5_fpga_device *fdev = (struct mlx5_fpga_device *)cb_arg;
+
+	mlx5_fpga_tls_cmd_complete(fdev, buf);
+}
+
+bool mlx5_fpga_is_tls_device(struct mlx5_core_dev *mdev)
+{
+	if (!mdev->fpga || !MLX5_CAP_GEN(mdev, fpga))
+		return false;
+
+	if (MLX5_CAP_FPGA(mdev, ieee_vendor_id) !=
+	    MLX5_FPGA_CAP_SANDBOX_VENDOR_ID_MLNX)
+		return false;
+
+	if (MLX5_CAP_FPGA(mdev, sandbox_product_id) !=
+	    MLX5_FPGA_CAP_SANDBOX_PRODUCT_ID_TLS)
+		return false;
+
+	if (MLX5_CAP_FPGA(mdev, sandbox_product_version) != 0)
+		return false;
+
+	return true;
+}
+
+static int mlx5_fpga_tls_get_caps(struct mlx5_fpga_device *fdev,
+				  u32 *p_caps)
+{
+	int err, cap_size = MLX5_ST_SZ_BYTES(tls_extended_cap);
+	u32 caps = 0;
+	void *buf;
+
+	buf = kzalloc(cap_size, GFP_KERNEL);
+	if (!buf)
+		return -ENOMEM;
+
+	err = mlx5_fpga_get_sbu_caps(fdev, cap_size, buf);
+	if (err)
+		goto out;
+
+	if (MLX5_GET(tls_extended_cap, buf, tx))
+		caps |= MLX5_ACCEL_TLS_TX;
+	if (MLX5_GET(tls_extended_cap, buf, rx))
+		caps |= MLX5_ACCEL_TLS_RX;
+	if (MLX5_GET(tls_extended_cap, buf, tls_v12))
+		caps |= MLX5_ACCEL_TLS_V12;
+	if (MLX5_GET(tls_extended_cap, buf, tls_v13))
+		caps |= MLX5_ACCEL_TLS_V13;
+	if (MLX5_GET(tls_extended_cap, buf, lro))
+		caps |= MLX5_ACCEL_TLS_LRO;
+	if (MLX5_GET(tls_extended_cap, buf, ipv6))
+		caps |= MLX5_ACCEL_TLS_IPV6;
+
+	if (MLX5_GET(tls_extended_cap, buf, aes_gcm_128))
+		caps |= MLX5_ACCEL_TLS_AES_GCM128;
+	if (MLX5_GET(tls_extended_cap, buf, aes_gcm_256))
+		caps |= MLX5_ACCEL_TLS_AES_GCM256;
+
+	*p_caps = caps;
+	err = 0;
+out:
+	kfree(buf);
+	return err;
+}
+
+int mlx5_fpga_tls_init(struct mlx5_core_dev *mdev)
+{
+	struct mlx5_fpga_device *fdev = mdev->fpga;
+	struct mlx5_fpga_conn_attr init_attr = {0};
+	struct mlx5_fpga_conn *conn;
+	struct mlx5_fpga_tls *tls;
+	int err = 0;
+
+	if (!mlx5_fpga_is_tls_device(mdev) || !fdev)
+		return 0;
+
+	tls = kzalloc(sizeof(*tls), GFP_KERNEL);
+	if (!tls)
+		return -ENOMEM;
+
+	err = mlx5_fpga_tls_get_caps(fdev, &tls->caps);
+	if (err)
+		goto error;
+
+	if (!(tls->caps & (MLX5_ACCEL_TLS_TX | MLX5_ACCEL_TLS_V12 |
+				 MLX5_ACCEL_TLS_AES_GCM128))) {
+		err = -ENOTSUPP;
+		goto error;
+	}
+
+	init_attr.rx_size = SBU_QP_QUEUE_SIZE;
+	init_attr.tx_size = SBU_QP_QUEUE_SIZE;
+	init_attr.recv_cb = mlx5_fpga_tls_hw_qp_recv_cb;
+	init_attr.cb_arg = fdev;
+	conn = mlx5_fpga_sbu_conn_create(fdev, &init_attr);
+	if (IS_ERR(conn)) {
+		err = PTR_ERR(conn);
+		mlx5_fpga_err(fdev, "Error creating TLS command connection %d\n",
+			      err);
+		goto error;
+	}
+
+	tls->conn = conn;
+	spin_lock_init(&tls->pending_cmds_lock);
+	INIT_LIST_HEAD(&tls->pending_cmds);
+
+	idr_init(&tls->tx_idr);
+	spin_lock_init(&tls->idr_spinlock);
+	fdev->tls = tls;
+	return 0;
+
+error:
+	kfree(tls);
+	return err;
+}
+
+void mlx5_fpga_tls_cleanup(struct mlx5_core_dev *mdev)
+{
+	struct mlx5_fpga_device *fdev = mdev->fpga;
+
+	if (!fdev || !fdev->tls)
+		return;
+
+	mlx5_fpga_sbu_conn_destroy(fdev->tls->conn);
+	kfree(fdev->tls);
+	fdev->tls = NULL;
+}
+
+static void mlx5_fpga_tls_set_aes_gcm128_ctx(void *cmd,
+					     struct tls_crypto_info *info,
+					     __be64 *rcd_sn)
+{
+	struct tls12_crypto_info_aes_gcm_128 *crypto_info =
+	    (struct tls12_crypto_info_aes_gcm_128 *)info;
+
+	memcpy(MLX5_ADDR_OF(tls_cmd, cmd, tls_rcd_sn), crypto_info->rec_seq,
+	       TLS_CIPHER_AES_GCM_128_REC_SEQ_SIZE);
+
+	memcpy(MLX5_ADDR_OF(tls_cmd, cmd, tls_implicit_iv),
+	       crypto_info->salt, TLS_CIPHER_AES_GCM_128_SALT_SIZE);
+	memcpy(MLX5_ADDR_OF(tls_cmd, cmd, encryption_key),
+	       crypto_info->key, TLS_CIPHER_AES_GCM_128_KEY_SIZE);
+
+	/* in AES-GCM 128 we need to write the key twice */
+	memcpy(MLX5_ADDR_OF(tls_cmd, cmd, encryption_key) +
+		   TLS_CIPHER_AES_GCM_128_KEY_SIZE,
+	       crypto_info->key, TLS_CIPHER_AES_GCM_128_KEY_SIZE);
+
+	MLX5_SET(tls_cmd, cmd, alg, MLX5_TLS_ALG_AES_GCM_128);
+}
+
+static int mlx5_fpga_tls_set_key_material(void *cmd, u32 caps,
+					  struct tls_crypto_info *crypto_info)
+{
+	__be64 rcd_sn;
+
+	switch (crypto_info->cipher_type) {
+	case TLS_CIPHER_AES_GCM_128:
+		if (!(caps & MLX5_ACCEL_TLS_AES_GCM128))
+			return -EINVAL;
+		mlx5_fpga_tls_set_aes_gcm128_ctx(cmd, crypto_info, &rcd_sn);
+		break;
+	default:
+		return -EINVAL;
+	}
+
+	return 0;
+}
+
+static int mlx5_fpga_tls_add_flow(struct mlx5_core_dev *mdev, void *flow,
+				  struct tls_crypto_info *crypto_info, u32 swid,
+				  u32 tcp_sn)
+{
+	u32 caps = mlx5_fpga_tls_device_caps(mdev);
+	struct mlx5_setup_stream_context *ctx;
+	int ret = -ENOMEM;
+	size_t cmd_size;
+	void *cmd;
+
+	cmd_size = MLX5_TLS_COMMAND_SIZE + sizeof(*ctx);
+	ctx = kzalloc(cmd_size, GFP_KERNEL);
+	if (!ctx)
+		goto out;
+
+	cmd = ctx + 1;
+	ret = mlx5_fpga_tls_set_key_material(cmd, caps, crypto_info);
+	if (ret)
+		goto free_ctx;
+
+	mlx5_fpga_tls_flow_to_cmd(flow, cmd);
+
+	MLX5_SET(tls_cmd, cmd, swid, swid);
+	MLX5_SET(tls_cmd, cmd, tcp_sn, tcp_sn);
+
+	return mlx5_fpga_tls_setup_stream_cmd(mdev, ctx);
+
+free_ctx:
+	kfree(ctx);
+out:
+	return ret;
+}
+
+int mlx5_fpga_tls_add_tx_flow(struct mlx5_core_dev *mdev, void *flow,
+			      struct tls_crypto_info *crypto_info,
+			      u32 start_offload_tcp_sn, u32 *p_swid)
+{
+	struct mlx5_fpga_tls *tls = mdev->fpga->tls;
+	int ret = -ENOMEM;
+	u32 swid;
+
+	ret = mlx5_fpga_tls_alloc_swid(&tls->tx_idr, &tls->idr_spinlock, flow);
+	if (ret < 0)
+		return ret;
+
+	swid = ret;
+	MLX5_SET(tls_flow, flow, direction_sx, 1);
+
+	ret = mlx5_fpga_tls_add_flow(mdev, flow, crypto_info, swid,
+				     start_offload_tcp_sn);
+	if (ret && ret != -EINTR)
+		goto free_swid;
+
+	*p_swid = swid;
+	return 0;
+free_swid:
+	mlx5_fpga_tls_release_swid(&tls->tx_idr, &tls->idr_spinlock, swid);
+
+	return ret;
+}
+
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/fpga/tls.h b/drivers/net/ethernet/mellanox/mlx5/core/fpga/tls.h
new file mode 100644
index 0000000..800a214
--- /dev/null
+++ b/drivers/net/ethernet/mellanox/mlx5/core/fpga/tls.h
@@ -0,0 +1,68 @@
+/*
+ * Copyright (c) 2018 Mellanox Technologies. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ */
+
+#ifndef __MLX5_FPGA_TLS_H__
+#define __MLX5_FPGA_TLS_H__
+
+#include <linux/mlx5/driver.h>
+
+#include <net/tls.h>
+#include "fpga/core.h"
+
+struct mlx5_fpga_tls {
+	struct list_head pending_cmds;
+	spinlock_t pending_cmds_lock; /* Protects pending_cmds */
+	u32 caps;
+	struct mlx5_fpga_conn *conn;
+
+	struct idr tx_idr;
+	spinlock_t idr_spinlock; /* protects the IDR */
+};
+
+int mlx5_fpga_tls_add_tx_flow(struct mlx5_core_dev *mdev, void *flow,
+			      struct tls_crypto_info *crypto_info,
+			      u32 start_offload_tcp_sn, u32 *p_swid);
+
+void mlx5_fpga_tls_del_tx_flow(struct mlx5_core_dev *mdev, u32 swid,
+			       gfp_t flags);
+
+bool mlx5_fpga_is_tls_device(struct mlx5_core_dev *mdev);
+int mlx5_fpga_tls_init(struct mlx5_core_dev *mdev);
+void mlx5_fpga_tls_cleanup(struct mlx5_core_dev *mdev);
+
+static inline u32 mlx5_fpga_tls_device_caps(struct mlx5_core_dev *mdev)
+{
+	return mdev->fpga->tls->caps;
+}
+
+#endif /* __MLX5_FPGA_TLS_H__ */
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/main.c b/drivers/net/ethernet/mellanox/mlx5/core/main.c
index 63a8ea3..b865597 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/main.c
@@ -60,6 +60,7 @@
 #include "fpga/core.h"
 #include "fpga/ipsec.h"
 #include "accel/ipsec.h"
+#include "accel/tls.h"
 #include "lib/clock.h"
 
 MODULE_AUTHOR("Eli Cohen <eli@mellanox.com>");
@@ -1190,6 +1191,12 @@ static int mlx5_load_one(struct mlx5_core_dev *dev, struct mlx5_priv *priv,
 		goto err_ipsec_start;
 	}
 
+	err = mlx5_accel_tls_init(dev);
+	if (err) {
+		dev_err(&pdev->dev, "TLS device start failed %d\n", err);
+		goto err_tls_start;
+	}
+
 	err = mlx5_init_fs(dev);
 	if (err) {
 		dev_err(&pdev->dev, "Failed to init flow steering\n");
@@ -1231,6 +1238,9 @@ static int mlx5_load_one(struct mlx5_core_dev *dev, struct mlx5_priv *priv,
 	mlx5_cleanup_fs(dev);
 
 err_fs:
+	mlx5_accel_tls_cleanup(dev);
+
+err_tls_start:
 	mlx5_accel_ipsec_cleanup(dev);
 
 err_ipsec_start:
@@ -1306,6 +1316,7 @@ static int mlx5_unload_one(struct mlx5_core_dev *dev, struct mlx5_priv *priv,
 	mlx5_sriov_detach(dev);
 	mlx5_cleanup_fs(dev);
 	mlx5_accel_ipsec_cleanup(dev);
+	mlx5_accel_tls_cleanup(dev);
 	mlx5_fpga_device_stop(dev);
 	mlx5_irq_clear_affinity_hints(dev);
 	free_comp_eqs(dev);
diff --git a/include/linux/mlx5/mlx5_ifc.h b/include/linux/mlx5/mlx5_ifc.h
index 1aad455..b8918a1 100644
--- a/include/linux/mlx5/mlx5_ifc.h
+++ b/include/linux/mlx5/mlx5_ifc.h
@@ -356,22 +356,6 @@ struct mlx5_ifc_odp_per_transport_service_cap_bits {
 	u8         reserved_at_6[0x1a];
 };
 
-struct mlx5_ifc_ipv4_layout_bits {
-	u8         reserved_at_0[0x60];
-
-	u8         ipv4[0x20];
-};
-
-struct mlx5_ifc_ipv6_layout_bits {
-	u8         ipv6[16][0x8];
-};
-
-union mlx5_ifc_ipv6_layout_ipv4_layout_auto_bits {
-	struct mlx5_ifc_ipv6_layout_bits ipv6_layout;
-	struct mlx5_ifc_ipv4_layout_bits ipv4_layout;
-	u8         reserved_at_0[0x80];
-};
-
 struct mlx5_ifc_fte_match_set_lyr_2_4_bits {
 	u8         smac_47_16[0x20];
 
diff --git a/include/linux/mlx5/mlx5_ifc_fpga.h b/include/linux/mlx5/mlx5_ifc_fpga.h
index ec05249..1930915 100644
--- a/include/linux/mlx5/mlx5_ifc_fpga.h
+++ b/include/linux/mlx5/mlx5_ifc_fpga.h
@@ -32,12 +32,29 @@
 #ifndef MLX5_IFC_FPGA_H
 #define MLX5_IFC_FPGA_H
 
+struct mlx5_ifc_ipv4_layout_bits {
+	u8         reserved_at_0[0x60];
+
+	u8         ipv4[0x20];
+};
+
+struct mlx5_ifc_ipv6_layout_bits {
+	u8         ipv6[16][0x8];
+};
+
+union mlx5_ifc_ipv6_layout_ipv4_layout_auto_bits {
+	struct mlx5_ifc_ipv6_layout_bits ipv6_layout;
+	struct mlx5_ifc_ipv4_layout_bits ipv4_layout;
+	u8         reserved_at_0[0x80];
+};
+
 enum {
 	MLX5_FPGA_CAP_SANDBOX_VENDOR_ID_MLNX = 0x2c9,
 };
 
 enum {
 	MLX5_FPGA_CAP_SANDBOX_PRODUCT_ID_IPSEC    = 0x2,
+	MLX5_FPGA_CAP_SANDBOX_PRODUCT_ID_TLS      = 0x3,
 };
 
 struct mlx5_ifc_fpga_shell_caps_bits {
@@ -370,6 +387,27 @@ struct mlx5_ifc_fpga_destroy_qp_out_bits {
 	u8         reserved_at_40[0x40];
 };
 
+struct mlx5_ifc_tls_extended_cap_bits {
+	u8         aes_gcm_128[0x1];
+	u8         aes_gcm_256[0x1];
+	u8         reserved_at_2[0x1e];
+	u8         reserved_at_20[0x20];
+	u8         context_capacity_total[0x20];
+	u8         context_capacity_rx[0x20];
+	u8         context_capacity_tx[0x20];
+	u8         reserved_at_a0[0x10];
+	u8         tls_counter_size[0x10];
+	u8         tls_counters_addr_low[0x20];
+	u8         tls_counters_addr_high[0x20];
+	u8         rx[0x1];
+	u8         tx[0x1];
+	u8         tls_v12[0x1];
+	u8         tls_v13[0x1];
+	u8         lro[0x1];
+	u8         ipv6[0x1];
+	u8         reserved_at_106[0x1a];
+};
+
 struct mlx5_ifc_ipsec_extended_cap_bits {
 	u8         encapsulation[0x20];
 
@@ -519,4 +557,43 @@ struct mlx5_ifc_fpga_ipsec_sa {
 	__be16 reserved2;
 } __packed;
 
+enum fpga_tls_cmds {
+	CMD_SETUP_STREAM		= 0x1001,
+	CMD_TEARDOWN_STREAM		= 0x1002,
+};
+
+#define MLX5_TLS_1_2 (0)
+
+#define MLX5_TLS_ALG_AES_GCM_128 (0)
+#define MLX5_TLS_ALG_AES_GCM_256 (1)
+
+struct mlx5_ifc_tls_cmd_bits {
+	u8         command_type[0x20];
+	u8         ipv6[0x1];
+	u8         direction_sx[0x1];
+	u8         tls_version[0x2];
+	u8         reserved[0x1c];
+	u8         swid[0x20];
+	u8         src_port[0x10];
+	u8         dst_port[0x10];
+	union mlx5_ifc_ipv6_layout_ipv4_layout_auto_bits src_ipv4_src_ipv6;
+	union mlx5_ifc_ipv6_layout_ipv4_layout_auto_bits dst_ipv4_dst_ipv6;
+	u8         tls_rcd_sn[0x40];
+	u8         tcp_sn[0x20];
+	u8         tls_implicit_iv[0x20];
+	u8         tls_xor_iv[0x40];
+	u8         encryption_key[0x100];
+	u8         alg[4];
+	u8         reserved2[0x1c];
+	u8         reserved3[0x4a0];
+};
+
+struct mlx5_ifc_tls_resp_bits {
+	u8         syndrome[0x20];
+	u8         stream_id[0x20];
+	u8         reserverd[0x40];
+};
+
+#define MLX5_TLS_COMMAND_SIZE (0x100)
+
 #endif /* MLX5_IFC_FPGA_H */
-- 
1.8.3.1

^ permalink raw reply related

* [PATCH V7 net-next 06/14] net/tls: Split conf to rx + tx
From: Boris Pismenny @ 2018-04-24 13:12 UTC (permalink / raw)
  To: davem; +Cc: netdev, saeedm, borisp, davejwatson, ktkhai
In-Reply-To: <1524575585-49541-1-git-send-email-borisp@mellanox.com>

In TLS inline crypto, we can have one direction in software
and another in hardware. Thus, we split the TLS configuration to separate
structures for receive and transmit.

Signed-off-by: Boris Pismenny <borisp@mellanox.com>
---
 include/net/tls.h  |  51 +++++++++++++-------
 net/tls/tls_main.c | 103 ++++++++++++++++++++--------------------
 net/tls/tls_sw.c   | 134 ++++++++++++++++++++++++++++++-----------------------
 3 files changed, 161 insertions(+), 127 deletions(-)

diff --git a/include/net/tls.h b/include/net/tls.h
index 3da8e13..95a8c60 100644
--- a/include/net/tls.h
+++ b/include/net/tls.h
@@ -83,21 +83,10 @@ struct tls_device {
 	void (*unhash)(struct tls_device *device, struct sock *sk);
 };
 
-struct tls_sw_context {
+struct tls_sw_context_tx {
 	struct crypto_aead *aead_send;
-	struct crypto_aead *aead_recv;
 	struct crypto_wait async_wait;
 
-	/* Receive context */
-	struct strparser strp;
-	void (*saved_data_ready)(struct sock *sk);
-	unsigned int (*sk_poll)(struct file *file, struct socket *sock,
-				struct poll_table_struct *wait);
-	struct sk_buff *recv_pkt;
-	u8 control;
-	bool decrypted;
-
-	/* Sending context */
 	char aad_space[TLS_AAD_SPACE_SIZE];
 
 	unsigned int sg_plaintext_size;
@@ -114,6 +103,19 @@ struct tls_sw_context {
 	struct scatterlist sg_aead_out[2];
 };
 
+struct tls_sw_context_rx {
+	struct crypto_aead *aead_recv;
+	struct crypto_wait async_wait;
+
+	struct strparser strp;
+	void (*saved_data_ready)(struct sock *sk);
+	unsigned int (*sk_poll)(struct file *file, struct socket *sock,
+				struct poll_table_struct *wait);
+	struct sk_buff *recv_pkt;
+	u8 control;
+	bool decrypted;
+};
+
 enum {
 	TLS_PENDING_CLOSED_RECORD
 };
@@ -138,9 +140,15 @@ struct tls_context {
 		struct tls12_crypto_info_aes_gcm_128 crypto_recv_aes_gcm_128;
 	};
 
-	void *priv_ctx;
+	struct list_head list;
+	struct net_device *netdev;
+	refcount_t refcount;
+
+	void *priv_ctx_tx;
+	void *priv_ctx_rx;
 
-	u8 conf:3;
+	u8 tx_conf:3;
+	u8 rx_conf:3;
 
 	struct cipher_context tx;
 	struct cipher_context rx;
@@ -177,7 +185,8 @@ int tls_sk_attach(struct sock *sk, int optname, char __user *optval,
 int tls_sw_sendpage(struct sock *sk, struct page *page,
 		    int offset, size_t size, int flags);
 void tls_sw_close(struct sock *sk, long timeout);
-void tls_sw_free_resources(struct sock *sk);
+void tls_sw_free_resources_tx(struct sock *sk);
+void tls_sw_free_resources_rx(struct sock *sk);
 int tls_sw_recvmsg(struct sock *sk, struct msghdr *msg, size_t len,
 		   int nonblock, int flags, int *addr_len);
 unsigned int tls_sw_poll(struct file *file, struct socket *sock,
@@ -297,16 +306,22 @@ static inline struct tls_context *tls_get_ctx(const struct sock *sk)
 	return icsk->icsk_ulp_data;
 }
 
-static inline struct tls_sw_context *tls_sw_ctx(
+static inline struct tls_sw_context_rx *tls_sw_ctx_rx(
+		const struct tls_context *tls_ctx)
+{
+	return (struct tls_sw_context_rx *)tls_ctx->priv_ctx_rx;
+}
+
+static inline struct tls_sw_context_tx *tls_sw_ctx_tx(
 		const struct tls_context *tls_ctx)
 {
-	return (struct tls_sw_context *)tls_ctx->priv_ctx;
+	return (struct tls_sw_context_tx *)tls_ctx->priv_ctx_tx;
 }
 
 static inline struct tls_offload_context *tls_offload_ctx(
 		const struct tls_context *tls_ctx)
 {
-	return (struct tls_offload_context *)tls_ctx->priv_ctx;
+	return (struct tls_offload_context *)tls_ctx->priv_ctx_tx;
 }
 
 int tls_proccess_cmsg(struct sock *sk, struct msghdr *msg,
diff --git a/net/tls/tls_main.c b/net/tls/tls_main.c
index 0d37997..545bf34 100644
--- a/net/tls/tls_main.c
+++ b/net/tls/tls_main.c
@@ -51,12 +51,9 @@ enum {
 	TLSV6,
 	TLS_NUM_PROTS,
 };
-
 enum {
 	TLS_BASE,
-	TLS_SW_TX,
-	TLS_SW_RX,
-	TLS_SW_RXTX,
+	TLS_SW,
 	TLS_HW_RECORD,
 	TLS_NUM_CONFIG,
 };
@@ -65,14 +62,14 @@ enum {
 static DEFINE_MUTEX(tcpv6_prot_mutex);
 static LIST_HEAD(device_list);
 static DEFINE_MUTEX(device_mutex);
-static struct proto tls_prots[TLS_NUM_PROTS][TLS_NUM_CONFIG];
+static struct proto tls_prots[TLS_NUM_PROTS][TLS_NUM_CONFIG][TLS_NUM_CONFIG];
 static struct proto_ops tls_sw_proto_ops;
 
-static inline void update_sk_prot(struct sock *sk, struct tls_context *ctx)
+static void update_sk_prot(struct sock *sk, struct tls_context *ctx)
 {
 	int ip_ver = sk->sk_family == AF_INET6 ? TLSV6 : TLSV4;
 
-	sk->sk_prot = &tls_prots[ip_ver][ctx->conf];
+	sk->sk_prot = &tls_prots[ip_ver][ctx->tx_conf][ctx->rx_conf];
 }
 
 int wait_on_pending_writer(struct sock *sk, long *timeo)
@@ -245,10 +242,10 @@ static void tls_sk_proto_close(struct sock *sk, long timeout)
 	lock_sock(sk);
 	sk_proto_close = ctx->sk_proto_close;
 
-	if (ctx->conf == TLS_HW_RECORD)
+	if (ctx->tx_conf == TLS_HW_RECORD && ctx->rx_conf == TLS_HW_RECORD)
 		goto skip_tx_cleanup;
 
-	if (ctx->conf == TLS_BASE) {
+	if (ctx->tx_conf == TLS_BASE && ctx->rx_conf == TLS_BASE) {
 		kfree(ctx);
 		ctx = NULL;
 		goto skip_tx_cleanup;
@@ -270,15 +267,17 @@ static void tls_sk_proto_close(struct sock *sk, long timeout)
 		}
 	}
 
-	kfree(ctx->tx.rec_seq);
-	kfree(ctx->tx.iv);
-	kfree(ctx->rx.rec_seq);
-	kfree(ctx->rx.iv);
+	/* We need these for tls_sw_fallback handling of other packets */
+	if (ctx->tx_conf == TLS_SW) {
+		kfree(ctx->tx.rec_seq);
+		kfree(ctx->tx.iv);
+		tls_sw_free_resources_tx(sk);
+	}
 
-	if (ctx->conf == TLS_SW_TX ||
-	    ctx->conf == TLS_SW_RX ||
-	    ctx->conf == TLS_SW_RXTX) {
-		tls_sw_free_resources(sk);
+	if (ctx->rx_conf == TLS_SW) {
+		kfree(ctx->rx.rec_seq);
+		kfree(ctx->rx.iv);
+		tls_sw_free_resources_rx(sk);
 	}
 
 skip_tx_cleanup:
@@ -287,7 +286,8 @@ static void tls_sk_proto_close(struct sock *sk, long timeout)
 	/* free ctx for TLS_HW_RECORD, used by tcp_set_state
 	 * for sk->sk_prot->unhash [tls_hw_unhash]
 	 */
-	if (ctx && ctx->conf == TLS_HW_RECORD)
+	if (ctx && ctx->tx_conf == TLS_HW_RECORD &&
+	    ctx->rx_conf == TLS_HW_RECORD)
 		kfree(ctx);
 }
 
@@ -441,25 +441,21 @@ static int do_tls_setsockopt_conf(struct sock *sk, char __user *optval,
 		goto err_crypto_info;
 	}
 
-	/* currently SW is default, we will have ethtool in future */
 	if (tx) {
 		rc = tls_set_sw_offload(sk, ctx, 1);
-		if (ctx->conf == TLS_SW_RX)
-			conf = TLS_SW_RXTX;
-		else
-			conf = TLS_SW_TX;
+		conf = TLS_SW;
 	} else {
 		rc = tls_set_sw_offload(sk, ctx, 0);
-		if (ctx->conf == TLS_SW_TX)
-			conf = TLS_SW_RXTX;
-		else
-			conf = TLS_SW_RX;
+		conf = TLS_SW;
 	}
 
 	if (rc)
 		goto err_crypto_info;
 
-	ctx->conf = conf;
+	if (tx)
+		ctx->tx_conf = conf;
+	else
+		ctx->rx_conf = conf;
 	update_sk_prot(sk, ctx);
 	if (tx) {
 		ctx->sk_write_space = sk->sk_write_space;
@@ -535,7 +531,8 @@ static int tls_hw_prot(struct sock *sk)
 			ctx->hash = sk->sk_prot->hash;
 			ctx->unhash = sk->sk_prot->unhash;
 			ctx->sk_proto_close = sk->sk_prot->close;
-			ctx->conf = TLS_HW_RECORD;
+			ctx->rx_conf = TLS_HW_RECORD;
+			ctx->tx_conf = TLS_HW_RECORD;
 			update_sk_prot(sk, ctx);
 			rc = 1;
 			break;
@@ -579,29 +576,30 @@ static int tls_hw_hash(struct sock *sk)
 	return err;
 }
 
-static void build_protos(struct proto *prot, struct proto *base)
+static void build_protos(struct proto prot[TLS_NUM_CONFIG][TLS_NUM_CONFIG],
+			 struct proto *base)
 {
-	prot[TLS_BASE] = *base;
-	prot[TLS_BASE].setsockopt	= tls_setsockopt;
-	prot[TLS_BASE].getsockopt	= tls_getsockopt;
-	prot[TLS_BASE].close		= tls_sk_proto_close;
-
-	prot[TLS_SW_TX] = prot[TLS_BASE];
-	prot[TLS_SW_TX].sendmsg		= tls_sw_sendmsg;
-	prot[TLS_SW_TX].sendpage	= tls_sw_sendpage;
-
-	prot[TLS_SW_RX] = prot[TLS_BASE];
-	prot[TLS_SW_RX].recvmsg		= tls_sw_recvmsg;
-	prot[TLS_SW_RX].close		= tls_sk_proto_close;
-
-	prot[TLS_SW_RXTX] = prot[TLS_SW_TX];
-	prot[TLS_SW_RXTX].recvmsg	= tls_sw_recvmsg;
-	prot[TLS_SW_RXTX].close		= tls_sk_proto_close;
-
-	prot[TLS_HW_RECORD] = *base;
-	prot[TLS_HW_RECORD].hash	= tls_hw_hash;
-	prot[TLS_HW_RECORD].unhash	= tls_hw_unhash;
-	prot[TLS_HW_RECORD].close	= tls_sk_proto_close;
+	prot[TLS_BASE][TLS_BASE] = *base;
+	prot[TLS_BASE][TLS_BASE].setsockopt	= tls_setsockopt;
+	prot[TLS_BASE][TLS_BASE].getsockopt	= tls_getsockopt;
+	prot[TLS_BASE][TLS_BASE].close		= tls_sk_proto_close;
+
+	prot[TLS_SW][TLS_BASE] = prot[TLS_BASE][TLS_BASE];
+	prot[TLS_SW][TLS_BASE].sendmsg		= tls_sw_sendmsg;
+	prot[TLS_SW][TLS_BASE].sendpage		= tls_sw_sendpage;
+
+	prot[TLS_BASE][TLS_SW] = prot[TLS_BASE][TLS_BASE];
+	prot[TLS_BASE][TLS_SW].recvmsg		= tls_sw_recvmsg;
+	prot[TLS_BASE][TLS_SW].close		= tls_sk_proto_close;
+
+	prot[TLS_SW][TLS_SW] = prot[TLS_SW][TLS_BASE];
+	prot[TLS_SW][TLS_SW].recvmsg	= tls_sw_recvmsg;
+	prot[TLS_SW][TLS_SW].close	= tls_sk_proto_close;
+
+	prot[TLS_HW_RECORD][TLS_HW_RECORD] = *base;
+	prot[TLS_HW_RECORD][TLS_HW_RECORD].hash		= tls_hw_hash;
+	prot[TLS_HW_RECORD][TLS_HW_RECORD].unhash	= tls_hw_unhash;
+	prot[TLS_HW_RECORD][TLS_HW_RECORD].close	= tls_sk_proto_close;
 }
 
 static int tls_init(struct sock *sk)
@@ -643,7 +641,8 @@ static int tls_init(struct sock *sk)
 		mutex_unlock(&tcpv6_prot_mutex);
 	}
 
-	ctx->conf = TLS_BASE;
+	ctx->tx_conf = TLS_BASE;
+	ctx->rx_conf = TLS_BASE;
 	update_sk_prot(sk, ctx);
 out:
 	return rc;
diff --git a/net/tls/tls_sw.c b/net/tls/tls_sw.c
index 71e7959..f374cc2 100644
--- a/net/tls/tls_sw.c
+++ b/net/tls/tls_sw.c
@@ -52,7 +52,7 @@ static int tls_do_decryption(struct sock *sk,
 			     gfp_t flags)
 {
 	struct tls_context *tls_ctx = tls_get_ctx(sk);
-	struct tls_sw_context *ctx = tls_sw_ctx(tls_ctx);
+	struct tls_sw_context_rx *ctx = tls_sw_ctx_rx(tls_ctx);
 	struct strp_msg *rxm = strp_msg(skb);
 	struct aead_request *aead_req;
 
@@ -122,7 +122,7 @@ static void trim_sg(struct sock *sk, struct scatterlist *sg,
 static void trim_both_sgl(struct sock *sk, int target_size)
 {
 	struct tls_context *tls_ctx = tls_get_ctx(sk);
-	struct tls_sw_context *ctx = tls_sw_ctx(tls_ctx);
+	struct tls_sw_context_tx *ctx = tls_sw_ctx_tx(tls_ctx);
 
 	trim_sg(sk, ctx->sg_plaintext_data,
 		&ctx->sg_plaintext_num_elem,
@@ -141,7 +141,7 @@ static void trim_both_sgl(struct sock *sk, int target_size)
 static int alloc_encrypted_sg(struct sock *sk, int len)
 {
 	struct tls_context *tls_ctx = tls_get_ctx(sk);
-	struct tls_sw_context *ctx = tls_sw_ctx(tls_ctx);
+	struct tls_sw_context_tx *ctx = tls_sw_ctx_tx(tls_ctx);
 	int rc = 0;
 
 	rc = sk_alloc_sg(sk, len,
@@ -155,7 +155,7 @@ static int alloc_encrypted_sg(struct sock *sk, int len)
 static int alloc_plaintext_sg(struct sock *sk, int len)
 {
 	struct tls_context *tls_ctx = tls_get_ctx(sk);
-	struct tls_sw_context *ctx = tls_sw_ctx(tls_ctx);
+	struct tls_sw_context_tx *ctx = tls_sw_ctx_tx(tls_ctx);
 	int rc = 0;
 
 	rc = sk_alloc_sg(sk, len, ctx->sg_plaintext_data, 0,
@@ -181,7 +181,7 @@ static void free_sg(struct sock *sk, struct scatterlist *sg,
 static void tls_free_both_sg(struct sock *sk)
 {
 	struct tls_context *tls_ctx = tls_get_ctx(sk);
-	struct tls_sw_context *ctx = tls_sw_ctx(tls_ctx);
+	struct tls_sw_context_tx *ctx = tls_sw_ctx_tx(tls_ctx);
 
 	free_sg(sk, ctx->sg_encrypted_data, &ctx->sg_encrypted_num_elem,
 		&ctx->sg_encrypted_size);
@@ -191,7 +191,7 @@ static void tls_free_both_sg(struct sock *sk)
 }
 
 static int tls_do_encryption(struct tls_context *tls_ctx,
-			     struct tls_sw_context *ctx, size_t data_len,
+			     struct tls_sw_context_tx *ctx, size_t data_len,
 			     gfp_t flags)
 {
 	unsigned int req_size = sizeof(struct aead_request) +
@@ -227,7 +227,7 @@ static int tls_push_record(struct sock *sk, int flags,
 			   unsigned char record_type)
 {
 	struct tls_context *tls_ctx = tls_get_ctx(sk);
-	struct tls_sw_context *ctx = tls_sw_ctx(tls_ctx);
+	struct tls_sw_context_tx *ctx = tls_sw_ctx_tx(tls_ctx);
 	int rc;
 
 	sg_mark_end(ctx->sg_plaintext_data + ctx->sg_plaintext_num_elem - 1);
@@ -339,7 +339,7 @@ static int memcopy_from_iter(struct sock *sk, struct iov_iter *from,
 			     int bytes)
 {
 	struct tls_context *tls_ctx = tls_get_ctx(sk);
-	struct tls_sw_context *ctx = tls_sw_ctx(tls_ctx);
+	struct tls_sw_context_tx *ctx = tls_sw_ctx_tx(tls_ctx);
 	struct scatterlist *sg = ctx->sg_plaintext_data;
 	int copy, i, rc = 0;
 
@@ -367,7 +367,7 @@ static int memcopy_from_iter(struct sock *sk, struct iov_iter *from,
 int tls_sw_sendmsg(struct sock *sk, struct msghdr *msg, size_t size)
 {
 	struct tls_context *tls_ctx = tls_get_ctx(sk);
-	struct tls_sw_context *ctx = tls_sw_ctx(tls_ctx);
+	struct tls_sw_context_tx *ctx = tls_sw_ctx_tx(tls_ctx);
 	int ret = 0;
 	int required_size;
 	long timeo = sock_sndtimeo(sk, msg->msg_flags & MSG_DONTWAIT);
@@ -522,7 +522,7 @@ int tls_sw_sendpage(struct sock *sk, struct page *page,
 		    int offset, size_t size, int flags)
 {
 	struct tls_context *tls_ctx = tls_get_ctx(sk);
-	struct tls_sw_context *ctx = tls_sw_ctx(tls_ctx);
+	struct tls_sw_context_tx *ctx = tls_sw_ctx_tx(tls_ctx);
 	int ret = 0;
 	long timeo = sock_sndtimeo(sk, flags & MSG_DONTWAIT);
 	bool eor;
@@ -636,7 +636,7 @@ static struct sk_buff *tls_wait_data(struct sock *sk, int flags,
 				     long timeo, int *err)
 {
 	struct tls_context *tls_ctx = tls_get_ctx(sk);
-	struct tls_sw_context *ctx = tls_sw_ctx(tls_ctx);
+	struct tls_sw_context_rx *ctx = tls_sw_ctx_rx(tls_ctx);
 	struct sk_buff *skb;
 	DEFINE_WAIT_FUNC(wait, woken_wake_function);
 
@@ -674,7 +674,7 @@ static int decrypt_skb(struct sock *sk, struct sk_buff *skb,
 		       struct scatterlist *sgout)
 {
 	struct tls_context *tls_ctx = tls_get_ctx(sk);
-	struct tls_sw_context *ctx = tls_sw_ctx(tls_ctx);
+	struct tls_sw_context_rx *ctx = tls_sw_ctx_rx(tls_ctx);
 	char iv[TLS_CIPHER_AES_GCM_128_SALT_SIZE + MAX_IV_SIZE];
 	struct scatterlist sgin_arr[MAX_SKB_FRAGS + 2];
 	struct scatterlist *sgin = &sgin_arr[0];
@@ -724,7 +724,7 @@ static bool tls_sw_advance_skb(struct sock *sk, struct sk_buff *skb,
 			       unsigned int len)
 {
 	struct tls_context *tls_ctx = tls_get_ctx(sk);
-	struct tls_sw_context *ctx = tls_sw_ctx(tls_ctx);
+	struct tls_sw_context_rx *ctx = tls_sw_ctx_rx(tls_ctx);
 	struct strp_msg *rxm = strp_msg(skb);
 
 	if (len < rxm->full_len) {
@@ -750,7 +750,7 @@ int tls_sw_recvmsg(struct sock *sk,
 		   int *addr_len)
 {
 	struct tls_context *tls_ctx = tls_get_ctx(sk);
-	struct tls_sw_context *ctx = tls_sw_ctx(tls_ctx);
+	struct tls_sw_context_rx *ctx = tls_sw_ctx_rx(tls_ctx);
 	unsigned char control;
 	struct strp_msg *rxm;
 	struct sk_buff *skb;
@@ -870,7 +870,7 @@ ssize_t tls_sw_splice_read(struct socket *sock,  loff_t *ppos,
 			   size_t len, unsigned int flags)
 {
 	struct tls_context *tls_ctx = tls_get_ctx(sock->sk);
-	struct tls_sw_context *ctx = tls_sw_ctx(tls_ctx);
+	struct tls_sw_context_rx *ctx = tls_sw_ctx_rx(tls_ctx);
 	struct strp_msg *rxm = NULL;
 	struct sock *sk = sock->sk;
 	struct sk_buff *skb;
@@ -923,7 +923,7 @@ unsigned int tls_sw_poll(struct file *file, struct socket *sock,
 	unsigned int ret;
 	struct sock *sk = sock->sk;
 	struct tls_context *tls_ctx = tls_get_ctx(sk);
-	struct tls_sw_context *ctx = tls_sw_ctx(tls_ctx);
+	struct tls_sw_context_rx *ctx = tls_sw_ctx_rx(tls_ctx);
 
 	/* Grab POLLOUT and POLLHUP from the underlying socket */
 	ret = ctx->sk_poll(file, sock, wait);
@@ -939,7 +939,7 @@ unsigned int tls_sw_poll(struct file *file, struct socket *sock,
 static int tls_read_size(struct strparser *strp, struct sk_buff *skb)
 {
 	struct tls_context *tls_ctx = tls_get_ctx(strp->sk);
-	struct tls_sw_context *ctx = tls_sw_ctx(tls_ctx);
+	struct tls_sw_context_rx *ctx = tls_sw_ctx_rx(tls_ctx);
 	char header[tls_ctx->rx.prepend_size];
 	struct strp_msg *rxm = strp_msg(skb);
 	size_t cipher_overhead;
@@ -988,7 +988,7 @@ static int tls_read_size(struct strparser *strp, struct sk_buff *skb)
 static void tls_queue(struct strparser *strp, struct sk_buff *skb)
 {
 	struct tls_context *tls_ctx = tls_get_ctx(strp->sk);
-	struct tls_sw_context *ctx = tls_sw_ctx(tls_ctx);
+	struct tls_sw_context_rx *ctx = tls_sw_ctx_rx(tls_ctx);
 	struct strp_msg *rxm;
 
 	rxm = strp_msg(skb);
@@ -1004,18 +1004,28 @@ static void tls_queue(struct strparser *strp, struct sk_buff *skb)
 static void tls_data_ready(struct sock *sk)
 {
 	struct tls_context *tls_ctx = tls_get_ctx(sk);
-	struct tls_sw_context *ctx = tls_sw_ctx(tls_ctx);
+	struct tls_sw_context_rx *ctx = tls_sw_ctx_rx(tls_ctx);
 
 	strp_data_ready(&ctx->strp);
 }
 
-void tls_sw_free_resources(struct sock *sk)
+void tls_sw_free_resources_tx(struct sock *sk)
 {
 	struct tls_context *tls_ctx = tls_get_ctx(sk);
-	struct tls_sw_context *ctx = tls_sw_ctx(tls_ctx);
+	struct tls_sw_context_tx *ctx = tls_sw_ctx_tx(tls_ctx);
 
 	if (ctx->aead_send)
 		crypto_free_aead(ctx->aead_send);
+	tls_free_both_sg(sk);
+
+	kfree(ctx);
+}
+
+void tls_sw_free_resources_rx(struct sock *sk)
+{
+	struct tls_context *tls_ctx = tls_get_ctx(sk);
+	struct tls_sw_context_rx *ctx = tls_sw_ctx_rx(tls_ctx);
+
 	if (ctx->aead_recv) {
 		if (ctx->recv_pkt) {
 			kfree_skb(ctx->recv_pkt);
@@ -1031,10 +1041,7 @@ void tls_sw_free_resources(struct sock *sk)
 		lock_sock(sk);
 	}
 
-	tls_free_both_sg(sk);
-
 	kfree(ctx);
-	kfree(tls_ctx);
 }
 
 int tls_set_sw_offload(struct sock *sk, struct tls_context *ctx, int tx)
@@ -1042,7 +1049,8 @@ int tls_set_sw_offload(struct sock *sk, struct tls_context *ctx, int tx)
 	char keyval[TLS_CIPHER_AES_GCM_128_KEY_SIZE];
 	struct tls_crypto_info *crypto_info;
 	struct tls12_crypto_info_aes_gcm_128 *gcm_128_info;
-	struct tls_sw_context *sw_ctx;
+	struct tls_sw_context_tx *sw_ctx_tx;
+	struct tls_sw_context_rx *sw_ctx_rx;
 	struct cipher_context *cctx;
 	struct crypto_aead **aead;
 	struct strp_callbacks cb;
@@ -1055,27 +1063,32 @@ int tls_set_sw_offload(struct sock *sk, struct tls_context *ctx, int tx)
 		goto out;
 	}
 
-	if (!ctx->priv_ctx) {
-		sw_ctx = kzalloc(sizeof(*sw_ctx), GFP_KERNEL);
-		if (!sw_ctx) {
+	if (tx) {
+		sw_ctx_tx = kzalloc(sizeof(*sw_ctx_tx), GFP_KERNEL);
+		if (!sw_ctx_tx) {
 			rc = -ENOMEM;
 			goto out;
 		}
-		crypto_init_wait(&sw_ctx->async_wait);
+		crypto_init_wait(&sw_ctx_tx->async_wait);
+		ctx->priv_ctx_tx = sw_ctx_tx;
 	} else {
-		sw_ctx = ctx->priv_ctx;
+		sw_ctx_rx = kzalloc(sizeof(*sw_ctx_rx), GFP_KERNEL);
+		if (!sw_ctx_rx) {
+			rc = -ENOMEM;
+			goto out;
+		}
+		crypto_init_wait(&sw_ctx_rx->async_wait);
+		ctx->priv_ctx_rx = sw_ctx_rx;
 	}
 
-	ctx->priv_ctx = (struct tls_offload_context *)sw_ctx;
-
 	if (tx) {
 		crypto_info = &ctx->crypto_send;
 		cctx = &ctx->tx;
-		aead = &sw_ctx->aead_send;
+		aead = &sw_ctx_tx->aead_send;
 	} else {
 		crypto_info = &ctx->crypto_recv;
 		cctx = &ctx->rx;
-		aead = &sw_ctx->aead_recv;
+		aead = &sw_ctx_rx->aead_recv;
 	}
 
 	switch (crypto_info->cipher_type) {
@@ -1123,21 +1136,23 @@ int tls_set_sw_offload(struct sock *sk, struct tls_context *ctx, int tx)
 	memcpy(cctx->rec_seq, rec_seq, rec_seq_size);
 
 	if (tx) {
-		sg_init_table(sw_ctx->sg_encrypted_data,
-			      ARRAY_SIZE(sw_ctx->sg_encrypted_data));
-		sg_init_table(sw_ctx->sg_plaintext_data,
-			      ARRAY_SIZE(sw_ctx->sg_plaintext_data));
-
-		sg_init_table(sw_ctx->sg_aead_in, 2);
-		sg_set_buf(&sw_ctx->sg_aead_in[0], sw_ctx->aad_space,
-			   sizeof(sw_ctx->aad_space));
-		sg_unmark_end(&sw_ctx->sg_aead_in[1]);
-		sg_chain(sw_ctx->sg_aead_in, 2, sw_ctx->sg_plaintext_data);
-		sg_init_table(sw_ctx->sg_aead_out, 2);
-		sg_set_buf(&sw_ctx->sg_aead_out[0], sw_ctx->aad_space,
-			   sizeof(sw_ctx->aad_space));
-		sg_unmark_end(&sw_ctx->sg_aead_out[1]);
-		sg_chain(sw_ctx->sg_aead_out, 2, sw_ctx->sg_encrypted_data);
+		sg_init_table(sw_ctx_tx->sg_encrypted_data,
+			      ARRAY_SIZE(sw_ctx_tx->sg_encrypted_data));
+		sg_init_table(sw_ctx_tx->sg_plaintext_data,
+			      ARRAY_SIZE(sw_ctx_tx->sg_plaintext_data));
+
+		sg_init_table(sw_ctx_tx->sg_aead_in, 2);
+		sg_set_buf(&sw_ctx_tx->sg_aead_in[0], sw_ctx_tx->aad_space,
+			   sizeof(sw_ctx_tx->aad_space));
+		sg_unmark_end(&sw_ctx_tx->sg_aead_in[1]);
+		sg_chain(sw_ctx_tx->sg_aead_in, 2,
+			 sw_ctx_tx->sg_plaintext_data);
+		sg_init_table(sw_ctx_tx->sg_aead_out, 2);
+		sg_set_buf(&sw_ctx_tx->sg_aead_out[0], sw_ctx_tx->aad_space,
+			   sizeof(sw_ctx_tx->aad_space));
+		sg_unmark_end(&sw_ctx_tx->sg_aead_out[1]);
+		sg_chain(sw_ctx_tx->sg_aead_out, 2,
+			 sw_ctx_tx->sg_encrypted_data);
 	}
 
 	if (!*aead) {
@@ -1168,16 +1183,16 @@ int tls_set_sw_offload(struct sock *sk, struct tls_context *ctx, int tx)
 		cb.rcv_msg = tls_queue;
 		cb.parse_msg = tls_read_size;
 
-		strp_init(&sw_ctx->strp, sk, &cb);
+		strp_init(&sw_ctx_rx->strp, sk, &cb);
 
 		write_lock_bh(&sk->sk_callback_lock);
-		sw_ctx->saved_data_ready = sk->sk_data_ready;
+		sw_ctx_rx->saved_data_ready = sk->sk_data_ready;
 		sk->sk_data_ready = tls_data_ready;
 		write_unlock_bh(&sk->sk_callback_lock);
 
-		sw_ctx->sk_poll = sk->sk_socket->ops->poll;
+		sw_ctx_rx->sk_poll = sk->sk_socket->ops->poll;
 
-		strp_check_rcv(&sw_ctx->strp);
+		strp_check_rcv(&sw_ctx_rx->strp);
 	}
 
 	goto out;
@@ -1189,11 +1204,16 @@ int tls_set_sw_offload(struct sock *sk, struct tls_context *ctx, int tx)
 	kfree(cctx->rec_seq);
 	cctx->rec_seq = NULL;
 free_iv:
-	kfree(ctx->tx.iv);
-	ctx->tx.iv = NULL;
+	kfree(cctx->iv);
+	cctx->iv = NULL;
 free_priv:
-	kfree(ctx->priv_ctx);
-	ctx->priv_ctx = NULL;
+	if (tx) {
+		kfree(ctx->priv_ctx_tx);
+		ctx->priv_ctx_tx = NULL;
+	} else {
+		kfree(ctx->priv_ctx_rx);
+		ctx->priv_ctx_rx = NULL;
+	}
 out:
 	return rc;
 }
-- 
1.8.3.1

^ permalink raw reply related

* [PATCH V7 net-next 11/14] net/mlx5e: TLS, Add Innova TLS TX offload data path
From: Boris Pismenny @ 2018-04-24 13:13 UTC (permalink / raw)
  To: davem; +Cc: netdev, saeedm, borisp, davejwatson, ktkhai, Ilya Lesokhin
In-Reply-To: <1524575585-49541-1-git-send-email-borisp@mellanox.com>

From: Ilya Lesokhin <ilyal@mellanox.com>

Implement the TLS tx offload data path according to the
requirements of the TLS generic NIC offload infrastructure.

Special metadata ethertype is used to pass information to
the hardware.

Signed-off-by: Ilya Lesokhin <ilyal@mellanox.com>
Signed-off-by: Boris Pismenny <borisp@mellanox.com>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/Makefile   |   2 +-
 drivers/net/ethernet/mellanox/mlx5/core/en.h       |  15 ++
 .../mellanox/mlx5/core/en_accel/en_accel.h         |  72 ++++++
 .../net/ethernet/mellanox/mlx5/core/en_accel/tls.c |   2 +
 .../mellanox/mlx5/core/en_accel/tls_rxtx.c         | 272 +++++++++++++++++++++
 .../mellanox/mlx5/core/en_accel/tls_rxtx.h         |  50 ++++
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c  |   2 +
 drivers/net/ethernet/mellanox/mlx5/core/en_stats.c |  10 +
 drivers/net/ethernet/mellanox/mlx5/core/en_stats.h |   9 +
 drivers/net/ethernet/mellanox/mlx5/core/en_tx.c    |  37 +--
 10 files changed, 455 insertions(+), 16 deletions(-)
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/en_accel/en_accel.h
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls_rxtx.c
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls_rxtx.h

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/Makefile b/drivers/net/ethernet/mellanox/mlx5/core/Makefile
index 50872ed..ec785f5 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/Makefile
+++ b/drivers/net/ethernet/mellanox/mlx5/core/Makefile
@@ -28,6 +28,6 @@ mlx5_core-$(CONFIG_MLX5_CORE_IPOIB) += ipoib/ipoib.o ipoib/ethtool.o ipoib/ipoib
 mlx5_core-$(CONFIG_MLX5_EN_IPSEC) += en_accel/ipsec.o en_accel/ipsec_rxtx.o \
 		en_accel/ipsec_stats.o
 
-mlx5_core-$(CONFIG_MLX5_EN_TLS) +=  en_accel/tls.o
+mlx5_core-$(CONFIG_MLX5_EN_TLS) +=  en_accel/tls.o en_accel/tls_rxtx.o
 
 CFLAGS_tracepoint.o := -I$(src)
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h b/drivers/net/ethernet/mellanox/mlx5/core/en.h
index b09625f..d72145f 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h
@@ -333,6 +333,7 @@ enum {
 	MLX5E_SQ_STATE_ENABLED,
 	MLX5E_SQ_STATE_RECOVERING,
 	MLX5E_SQ_STATE_IPSEC,
+	MLX5E_SQ_STATE_TLS,
 };
 
 struct mlx5e_sq_wqe_info {
@@ -827,6 +828,8 @@ struct mlx5e_profile {
 u16 mlx5e_select_queue(struct net_device *dev, struct sk_buff *skb,
 		       void *accel_priv, select_queue_fallback_t fallback);
 netdev_tx_t mlx5e_xmit(struct sk_buff *skb, struct net_device *dev);
+netdev_tx_t mlx5e_sq_xmit(struct mlx5e_txqsq *sq, struct sk_buff *skb,
+			  struct mlx5e_tx_wqe *wqe, u16 pi);
 
 void mlx5e_completion_event(struct mlx5_core_cq *mcq);
 void mlx5e_cq_error_event(struct mlx5_core_cq *mcq, enum mlx5_event event);
@@ -942,6 +945,18 @@ static inline bool mlx5e_tunnel_inner_ft_supported(struct mlx5_core_dev *mdev)
 		MLX5_CAP_FLOWTABLE_NIC_RX(mdev, ft_field_support.inner_ip_version));
 }
 
+static inline void mlx5e_sq_fetch_wqe(struct mlx5e_txqsq *sq,
+				      struct mlx5e_tx_wqe **wqe,
+				      u16 *pi)
+{
+	struct mlx5_wq_cyc *wq;
+
+	wq = &sq->wq;
+	*pi = sq->pc & wq->sz_m1;
+	*wqe = mlx5_wq_cyc_get_wqe(wq, *pi);
+	memset(*wqe, 0, sizeof(**wqe));
+}
+
 static inline
 struct mlx5e_tx_wqe *mlx5e_post_nop(struct mlx5_wq_cyc *wq, u32 sqn, u16 *pc)
 {
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/en_accel.h b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/en_accel.h
new file mode 100644
index 0000000..68fcb40a
--- /dev/null
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/en_accel.h
@@ -0,0 +1,72 @@
+/*
+ * Copyright (c) 2018 Mellanox Technologies. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ */
+
+#ifndef __MLX5E_EN_ACCEL_H__
+#define __MLX5E_EN_ACCEL_H__
+
+#ifdef CONFIG_MLX5_ACCEL
+
+#include <linux/skbuff.h>
+#include <linux/netdevice.h>
+#include "en_accel/ipsec_rxtx.h"
+#include "en_accel/tls_rxtx.h"
+#include "en.h"
+
+static inline struct sk_buff *mlx5e_accel_handle_tx(struct sk_buff *skb,
+						    struct mlx5e_txqsq *sq,
+						    struct net_device *dev,
+						    struct mlx5e_tx_wqe **wqe,
+						    u16 *pi)
+{
+#ifdef CONFIG_MLX5_EN_TLS
+	if (sq->state & BIT(MLX5E_SQ_STATE_TLS)) {
+		skb = mlx5e_tls_handle_tx_skb(dev, sq, skb, wqe, pi);
+		if (unlikely(!skb))
+			return NULL;
+	}
+#endif
+
+#ifdef CONFIG_MLX5_EN_IPSEC
+	if (sq->state & BIT(MLX5E_SQ_STATE_IPSEC)) {
+		skb = mlx5e_ipsec_handle_tx_skb(dev, *wqe, skb);
+		if (unlikely(!skb))
+			return NULL;
+	}
+#endif
+
+	return skb;
+}
+
+#endif /* CONFIG_MLX5_ACCEL */
+
+#endif /* __MLX5E_EN_ACCEL_H__ */
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls.c b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls.c
index 38d8810..aa6981c 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls.c
@@ -169,5 +169,7 @@ void mlx5e_tls_build_netdev(struct mlx5e_priv *priv)
 	if (!mlx5_accel_is_tls_device(priv->mdev))
 		return;
 
+	netdev->features |= NETIF_F_HW_TLS_TX;
+	netdev->hw_features |= NETIF_F_HW_TLS_TX;
 	netdev->tlsdev_ops = &mlx5e_tls_ops;
 }
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls_rxtx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls_rxtx.c
new file mode 100644
index 0000000..49e8d45
--- /dev/null
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls_rxtx.c
@@ -0,0 +1,272 @@
+/*
+ * Copyright (c) 2018 Mellanox Technologies. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ */
+
+#include "en_accel/tls.h"
+#include "en_accel/tls_rxtx.h"
+
+#define SYNDROME_OFFLOAD_REQUIRED 32
+#define SYNDROME_SYNC 33
+
+struct sync_info {
+	u64 rcd_sn;
+	s32 sync_len;
+	int nr_frags;
+	skb_frag_t frags[MAX_SKB_FRAGS];
+};
+
+struct mlx5e_tls_metadata {
+	/* One byte of syndrome followed by 3 bytes of swid */
+	__be32 syndrome_swid;
+	__be16 first_seq;
+	/* packet type ID field	*/
+	__be16 ethertype;
+} __packed;
+
+static int mlx5e_tls_add_metadata(struct sk_buff *skb, __be32 swid)
+{
+	struct mlx5e_tls_metadata *pet;
+	struct ethhdr *eth;
+
+	if (skb_cow_head(skb, sizeof(struct mlx5e_tls_metadata)))
+		return -ENOMEM;
+
+	eth = (struct ethhdr *)skb_push(skb, sizeof(struct mlx5e_tls_metadata));
+	skb->mac_header -= sizeof(struct mlx5e_tls_metadata);
+	pet = (struct mlx5e_tls_metadata *)(eth + 1);
+
+	memmove(skb->data, skb->data + sizeof(struct mlx5e_tls_metadata),
+		2 * ETH_ALEN);
+
+	eth->h_proto = cpu_to_be16(MLX5E_METADATA_ETHER_TYPE);
+	pet->syndrome_swid = htonl(SYNDROME_OFFLOAD_REQUIRED << 24) | swid;
+
+	return 0;
+}
+
+static int mlx5e_tls_get_sync_data(struct mlx5e_tls_offload_context *context,
+				   u32 tcp_seq, struct sync_info *info)
+{
+	int remaining, i = 0, ret = -EINVAL;
+	struct tls_record_info *record;
+	unsigned long flags;
+	s32 sync_size;
+
+	spin_lock_irqsave(&context->base.lock, flags);
+	record = tls_get_record(&context->base, tcp_seq, &info->rcd_sn);
+
+	if (unlikely(!record))
+		goto out;
+
+	sync_size = tcp_seq - tls_record_start_seq(record);
+	info->sync_len = sync_size;
+	if (unlikely(sync_size < 0)) {
+		if (tls_record_is_start_marker(record))
+			goto done;
+
+		goto out;
+	}
+
+	remaining = sync_size;
+	while (remaining > 0) {
+		info->frags[i] = record->frags[i];
+		__skb_frag_ref(&info->frags[i]);
+		remaining -= skb_frag_size(&info->frags[i]);
+
+		if (remaining < 0)
+			skb_frag_size_add(&info->frags[i], remaining);
+
+		i++;
+	}
+	info->nr_frags = i;
+done:
+	ret = 0;
+out:
+	spin_unlock_irqrestore(&context->base.lock, flags);
+	return ret;
+}
+
+static void mlx5e_tls_complete_sync_skb(struct sk_buff *skb,
+					struct sk_buff *nskb, u32 tcp_seq,
+					int headln, __be64 rcd_sn)
+{
+	struct mlx5e_tls_metadata *pet;
+	u8 syndrome = SYNDROME_SYNC;
+	struct iphdr *iph;
+	struct tcphdr *th;
+	int data_len, mss;
+
+	nskb->dev = skb->dev;
+	skb_reset_mac_header(nskb);
+	skb_set_network_header(nskb, skb_network_offset(skb));
+	skb_set_transport_header(nskb, skb_transport_offset(skb));
+	memcpy(nskb->data, skb->data, headln);
+	memcpy(nskb->data + headln, &rcd_sn, sizeof(rcd_sn));
+
+	iph = ip_hdr(nskb);
+	iph->tot_len = htons(nskb->len - skb_network_offset(nskb));
+	th = tcp_hdr(nskb);
+	data_len = nskb->len - headln;
+	tcp_seq -= data_len;
+	th->seq = htonl(tcp_seq);
+
+	mss = nskb->dev->mtu - (headln - skb_network_offset(nskb));
+	skb_shinfo(nskb)->gso_size = 0;
+	if (data_len > mss) {
+		skb_shinfo(nskb)->gso_size = mss;
+		skb_shinfo(nskb)->gso_segs = DIV_ROUND_UP(data_len, mss);
+	}
+	skb_shinfo(nskb)->gso_type = skb_shinfo(skb)->gso_type;
+
+	pet = (struct mlx5e_tls_metadata *)(nskb->data + sizeof(struct ethhdr));
+	memcpy(pet, &syndrome, sizeof(syndrome));
+	pet->first_seq = htons(tcp_seq);
+
+	/* MLX5 devices don't care about the checksum partial start, offset
+	 * and pseudo header
+	 */
+	nskb->ip_summed = CHECKSUM_PARTIAL;
+
+	nskb->xmit_more = 1;
+	nskb->queue_mapping = skb->queue_mapping;
+}
+
+static struct sk_buff *
+mlx5e_tls_handle_ooo(struct mlx5e_tls_offload_context *context,
+		     struct mlx5e_txqsq *sq, struct sk_buff *skb,
+		     struct mlx5e_tx_wqe **wqe,
+		     u16 *pi)
+{
+	u32 tcp_seq = ntohl(tcp_hdr(skb)->seq);
+	struct sync_info info;
+	struct sk_buff *nskb;
+	int linear_len = 0;
+	int headln;
+	int i;
+
+	sq->stats.tls_ooo++;
+
+	if (mlx5e_tls_get_sync_data(context, tcp_seq, &info))
+		/* We might get here if a retransmission reaches the driver
+		 * after the relevant record is acked.
+		 * It should be safe to drop the packet in this case
+		 */
+		goto err_out;
+
+	if (unlikely(info.sync_len < 0)) {
+		u32 payload;
+
+		headln = skb_transport_offset(skb) + tcp_hdrlen(skb);
+		payload = skb->len - headln;
+		if (likely(payload <= -info.sync_len))
+			/* SKB payload doesn't require offload
+			 */
+			return skb;
+
+		netdev_err(skb->dev,
+			   "Can't offload from the middle of an SKB [seq: %X, offload_seq: %X, end_seq: %X]\n",
+			   tcp_seq, tcp_seq + payload + info.sync_len,
+			   tcp_seq + payload);
+		goto err_out;
+	}
+
+	if (unlikely(mlx5e_tls_add_metadata(skb, context->swid)))
+		goto err_out;
+
+	headln = skb_transport_offset(skb) + tcp_hdrlen(skb);
+	linear_len += headln + sizeof(info.rcd_sn);
+	nskb = alloc_skb(linear_len, GFP_ATOMIC);
+	if (unlikely(!nskb))
+		goto err_out;
+
+	context->expected_seq = tcp_seq + skb->len - headln;
+	skb_put(nskb, linear_len);
+	for (i = 0; i < info.nr_frags; i++)
+		skb_shinfo(nskb)->frags[i] = info.frags[i];
+
+	skb_shinfo(nskb)->nr_frags = info.nr_frags;
+	nskb->data_len = info.sync_len;
+	nskb->len += info.sync_len;
+	sq->stats.tls_resync_bytes += nskb->len;
+	mlx5e_tls_complete_sync_skb(skb, nskb, tcp_seq, headln,
+				    cpu_to_be64(info.rcd_sn));
+	mlx5e_sq_xmit(sq, nskb, *wqe, *pi);
+	mlx5e_sq_fetch_wqe(sq, wqe, pi);
+	return skb;
+
+err_out:
+	dev_kfree_skb_any(skb);
+	return NULL;
+}
+
+struct sk_buff *mlx5e_tls_handle_tx_skb(struct net_device *netdev,
+					struct mlx5e_txqsq *sq,
+					struct sk_buff *skb,
+					struct mlx5e_tx_wqe **wqe,
+					u16 *pi)
+{
+	struct mlx5e_tls_offload_context *context;
+	struct tls_context *tls_ctx;
+	u32 expected_seq;
+	int datalen;
+	u32 skb_seq;
+
+	if (!skb->sk || !tls_is_sk_tx_device_offloaded(skb->sk))
+		goto out;
+
+	datalen = skb->len - (skb_transport_offset(skb) + tcp_hdrlen(skb));
+	if (!datalen)
+		goto out;
+
+	tls_ctx = tls_get_ctx(skb->sk);
+	if (unlikely(tls_ctx->netdev != netdev))
+		goto out;
+
+	skb_seq = ntohl(tcp_hdr(skb)->seq);
+	context = mlx5e_get_tls_tx_context(tls_ctx);
+	expected_seq = context->expected_seq;
+
+	if (unlikely(expected_seq != skb_seq)) {
+		skb = mlx5e_tls_handle_ooo(context, sq, skb, wqe, pi);
+		goto out;
+	}
+
+	if (unlikely(mlx5e_tls_add_metadata(skb, context->swid))) {
+		dev_kfree_skb_any(skb);
+		skb = NULL;
+		goto out;
+	}
+
+	context->expected_seq = skb_seq + datalen;
+out:
+	return skb;
+}
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls_rxtx.h b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls_rxtx.h
new file mode 100644
index 0000000..405dfd3
--- /dev/null
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls_rxtx.h
@@ -0,0 +1,50 @@
+/*
+ * Copyright (c) 2018 Mellanox Technologies. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ */
+
+#ifndef __MLX5E_TLS_RXTX_H__
+#define __MLX5E_TLS_RXTX_H__
+
+#ifdef CONFIG_MLX5_EN_TLS
+
+#include <linux/skbuff.h>
+#include "en.h"
+
+struct sk_buff *mlx5e_tls_handle_tx_skb(struct net_device *netdev,
+					struct mlx5e_txqsq *sq,
+					struct sk_buff *skb,
+					struct mlx5e_tx_wqe **wqe,
+					u16 *pi);
+
+#endif /* CONFIG_MLX5_EN_TLS */
+
+#endif /* __MLX5E_TLS_RXTX_H__ */
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index 5a6aa18..b1a1bc8 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -1016,6 +1016,8 @@ static int mlx5e_alloc_txqsq(struct mlx5e_channel *c,
 	INIT_WORK(&sq->recover.recover_work, mlx5e_sq_recover);
 	if (MLX5_IPSEC_DEV(c->priv->mdev))
 		set_bit(MLX5E_SQ_STATE_IPSEC, &sq->state);
+	if (mlx5_accel_is_tls_device(c->priv->mdev))
+		set_bit(MLX5E_SQ_STATE_TLS, &sq->state);
 
 	param->wq.db_numa_node = cpu_to_node(c->cpu);
 	err = mlx5_wq_cyc_create(mdev, &param->wq, sqc_wq, &sq->wq, &sq->wq_ctrl);
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_stats.c b/drivers/net/ethernet/mellanox/mlx5/core/en_stats.c
index b08c944..c04cf2b 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_stats.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_stats.c
@@ -43,6 +43,12 @@
 	{ MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, tx_tso_inner_packets) },
 	{ MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, tx_tso_inner_bytes) },
 	{ MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, tx_added_vlan_packets) },
+
+#ifdef CONFIG_MLX5_EN_TLS
+	{ MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, tx_tls_ooo) },
+	{ MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, tx_tls_resync_bytes) },
+#endif
+
 	{ MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, rx_lro_packets) },
 	{ MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, rx_lro_bytes) },
 	{ MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, rx_removed_vlan_packets) },
@@ -161,6 +167,10 @@ static void mlx5e_grp_sw_update_stats(struct mlx5e_priv *priv)
 			s->tx_csum_partial_inner += sq_stats->csum_partial_inner;
 			s->tx_csum_none		+= sq_stats->csum_none;
 			s->tx_csum_partial	+= sq_stats->csum_partial;
+#ifdef CONFIG_MLX5_EN_TLS
+			s->tx_tls_ooo		+= sq_stats->tls_ooo;
+			s->tx_tls_resync_bytes	+= sq_stats->tls_resync_bytes;
+#endif
 		}
 	}
 
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_stats.h b/drivers/net/ethernet/mellanox/mlx5/core/en_stats.h
index 53111a2..a36e6a8 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_stats.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_stats.h
@@ -93,6 +93,11 @@ struct mlx5e_sw_stats {
 	u64 rx_cache_waive;
 	u64 ch_eq_rearm;
 
+#ifdef CONFIG_MLX5_EN_TLS
+	u64 tx_tls_ooo;
+	u64 tx_tls_resync_bytes;
+#endif
+
 	/* Special handling counters */
 	u64 link_down_events_phy;
 };
@@ -194,6 +199,10 @@ struct mlx5e_sq_stats {
 	u64 csum_partial_inner;
 	u64 added_vlan_packets;
 	u64 nop;
+#ifdef CONFIG_MLX5_EN_TLS
+	u64 tls_ooo;
+	u64 tls_resync_bytes;
+#endif
 	/* less likely accessed in data path */
 	u64 csum_none;
 	u64 stopped;
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c
index 2029710..486b6bf 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c
@@ -35,12 +35,21 @@
 #include <net/dsfield.h>
 #include "en.h"
 #include "ipoib/ipoib.h"
-#include "en_accel/ipsec_rxtx.h"
+#include "en_accel/en_accel.h"
 #include "lib/clock.h"
 
 #define MLX5E_SQ_NOPS_ROOM  MLX5_SEND_WQE_MAX_WQEBBS
+
+#ifndef CONFIG_MLX5_EN_TLS
 #define MLX5E_SQ_STOP_ROOM (MLX5_SEND_WQE_MAX_WQEBBS +\
 			    MLX5E_SQ_NOPS_ROOM)
+#else
+/* TLS offload requires MLX5E_SQ_STOP_ROOM to have
+ * enough room for a resync SKB, a normal SKB and a NOP
+ */
+#define MLX5E_SQ_STOP_ROOM (2 * MLX5_SEND_WQE_MAX_WQEBBS +\
+			    MLX5E_SQ_NOPS_ROOM)
+#endif
 
 static inline void mlx5e_tx_dma_unmap(struct device *pdev,
 				      struct mlx5e_sq_dma *dma)
@@ -325,8 +334,8 @@ static inline void mlx5e_insert_vlan(void *start, struct sk_buff *skb, u16 ihs,
 	}
 }
 
-static netdev_tx_t mlx5e_sq_xmit(struct mlx5e_txqsq *sq, struct sk_buff *skb,
-				 struct mlx5e_tx_wqe *wqe, u16 pi)
+netdev_tx_t mlx5e_sq_xmit(struct mlx5e_txqsq *sq, struct sk_buff *skb,
+			  struct mlx5e_tx_wqe *wqe, u16 pi)
 {
 	struct mlx5e_tx_wqe_info *wi   = &sq->db.wqe_info[pi];
 
@@ -399,21 +408,19 @@ static netdev_tx_t mlx5e_sq_xmit(struct mlx5e_txqsq *sq, struct sk_buff *skb,
 netdev_tx_t mlx5e_xmit(struct sk_buff *skb, struct net_device *dev)
 {
 	struct mlx5e_priv *priv = netdev_priv(dev);
-	struct mlx5e_txqsq *sq = priv->txq2sq[skb_get_queue_mapping(skb)];
-	struct mlx5_wq_cyc *wq = &sq->wq;
-	u16 pi = sq->pc & wq->sz_m1;
-	struct mlx5e_tx_wqe *wqe = mlx5_wq_cyc_get_wqe(wq, pi);
+	struct mlx5e_tx_wqe *wqe;
+	struct mlx5e_txqsq *sq;
+	u16 pi;
 
-	memset(wqe, 0, sizeof(*wqe));
+	sq = priv->txq2sq[skb_get_queue_mapping(skb)];
+	mlx5e_sq_fetch_wqe(sq, &wqe, &pi);
 
-#ifdef CONFIG_MLX5_EN_IPSEC
-	if (sq->state & BIT(MLX5E_SQ_STATE_IPSEC)) {
-		skb = mlx5e_ipsec_handle_tx_skb(dev, wqe, skb);
-		if (unlikely(!skb))
-			return NETDEV_TX_OK;
-	}
+#ifdef CONFIG_MLX5_ACCEL
+	/* might send skbs and update wqe and pi */
+	skb = mlx5e_accel_handle_tx(skb, sq, dev, &wqe, &pi);
+	if (unlikely(!skb))
+		return NETDEV_TX_OK;
 #endif
-
 	return mlx5e_sq_xmit(sq, skb, wqe, pi);
 }
 
-- 
1.8.3.1

^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox