Netdev List

Netdev List
 help / color / mirror / Atom feed

* [PATCH bpf-next 0/2] tools: bpftool: work with frozen maps
From: Quentin Monnet @ 2019-08-21  8:52 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann
  Cc: bpf, netdev, oss-drivers, Quentin Monnet

Hi,
This is a simple set to add support for BPF map freezing to bpftool. First
patch makes bpftool indicate if a map is frozen when listing BPF maps.
Second patch adds a command to freeze a map loaded on the system.

Quentin Monnet (2):
  tools: bpftool: show frozen status for maps
  tools: bpftool: add "bpftool map freeze" subcommand

 .../bpf/bpftool/Documentation/bpftool-map.rst |  9 +++
 tools/bpf/bpftool/bash-completion/bpftool     |  4 +-
 tools/bpf/bpftool/map.c                       | 64 +++++++++++++++++--
 3 files changed, 71 insertions(+), 6 deletions(-)

-- 
2.17.1


^ permalink raw reply

* Re: [RFC PATCH bpf-next 00/14] xdp_flow: Flow offload to XDP
From: Toshiaki Makita @ 2019-08-21  8:49 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Stanislav Fomichev, Alexei Starovoitov, Daniel Borkmann,
	Martin KaFai Lau, Song Liu, Yonghong Song, David S. Miller,
	Jesper Dangaard Brouer, John Fastabend, Jamal Hadi Salim,
	Cong Wang, Jiri Pirko, netdev, bpf, William Tu
In-Reply-To: <20190819111546.35a8ed76@cakuba.netronome.com>

On 19/08/20 (火) 3:15:46, Jakub Kicinski wrote:

I'm on vacation and replying slowly. Sorry for any inconvenience.

> On Sat, 17 Aug 2019 23:01:59 +0900, Toshiaki Makita wrote:
>> On 19/08/17 (土) 3:52:24, Jakub Kicinski wrote:
>>> On Fri, 16 Aug 2019 10:28:10 +0900, Toshiaki Makita wrote:
>>>> On 2019/08/16 4:22, Jakub Kicinski wrote:
>>>>> There's a certain allure in bringing the in-kernel BPF translation
>>>>> infrastructure forward. OTOH from system architecture perspective IMHO
>>>>> it does seem like a task best handed in user space. bpfilter can replace
>>>>> iptables completely, here we're looking at an acceleration relatively
>>>>> loosely coupled with flower.
>>>>
>>>> I don't think it's loosely coupled. Emulating TC behavior in userspace
>>>> is not so easy.
>>>>
>>>> Think about recent multi-mask support in flower. Previously userspace could
>>>> assume there is one mask and hash table for each preference in TC. After the
>>>> change TC accepts different masks with the same pref. Such a change tends to
>>>> break userspace emulation. It may ignore masks passed from flow insertion
>>>> and use the mask remembered when the first flow of the pref is inserted. It
>>>> may override the mask of all existing flows with the pref. It may fail to
>>>> insert such flows. Any of them would result in unexpected wrong datapath
>>>> handling which is critical.
>>>> I think such an emulation layer needs to be updated in sync with TC.
>>>
>>> Oh, so you're saying that if xdp_flow is merged all patches to
>>> cls_flower and netfilter which affect flow offload will be required
>>> to update xdp_flow as well?
>>
>> Hmm... you are saying that we are allowed to break other in-kernel
>> subsystem by some change? Sounds strange...
> 
> No I'm not saying that, please don't put words in my mouth.

If we ignore xdp_flow when modifying something which affects flow 
offload, that may cause breakage. I showed such an example using 
multi-mask support. So I just wondered what you mean and guessed you 
think we can break other subsystem in some situation.

I admit I should not have used the wording "you are saying...?". If it 
was not unpleasant to you I'm sorry about that. But I think you should 
not use it as well. I did not say "cls_flower and netfilter which affect 
flow offload will be required to update xdp_flow". I guess most patches 
which affect flow offload core will not break xdp_flow. In some cases 
breakage may happen. In that case we need to fix xdp_flow as well.

> I'm asking you if that's your intention.
> 
> Having an implementation nor support a feature of another implementation
> and degrade gracefully to the slower one is not necessarily breakage.
> We need to make a concious decision here, hence the clarifying question.

As I described above, breakage can happen in some case, and if the patch 
breaks xdp_flow I think we need to fix xdp_flow at the same time. If 
xdp_flow does not support newly added features but it works for existing 
ones, it is OK. In the first place not all features can be offloaded to 
xdp_flow. I think this is the same as HW-offload.

>>> That's a question of policy. Technically the implementation in user
>>> space is equivalent.
>>>
>>> The advantage of user space implementation is that you can add more
>>> to it and explore use cases which do not fit in the flow offload API,
>>> but are trivial for BPF. Not to mention the obvious advantage of
>>> decoupling the upgrade path.
>>
>> I understand the advantage, but I can't trust such a third-party kernel
>> emulation solution for this kind of thing which handles critical data path.
> 
> That's a strange argument to make. All production data path BPF today
> comes from user space.

Probably my explanation was not sufficient. What I'm concerned about is 
that this needs to emulate kernel behavior, and it is difficult.
I don't think userspace-defined datapath itself is not reliable, nor 
eBPF ecosystem.

Toshiaki Makita

^ permalink raw reply

* [PATCH iproute2-next v2] tc: add dualpi2 scheduler module
From: Tilmans, Olivier (Nokia - BE/Antwerp) @ 2019-08-21  8:39 UTC (permalink / raw)
  Cc: Eric Dumazet, Stephen Hemminger, davem@davemloft.net,
	netdev@vger.kernel.org, Olga Albisser,
	De Schepper, Koen (Nokia - BE/Antwerp),
	Tilmans, Olivier (Nokia - BE/Antwerp), Bob Briscoe, Henrik Steen

From: Olga Albisser <olga@albisser.org>

DualPI2 AQM is a combination of the DualQ Coupled-AQM with a PI2
base-AQM, able to control scalable congestion controls like DCTCP
and TCP-Prague, implemented as a Linux qdisc.

This patch adds support to tc to configure it through its netlink
interface.

Signed-off-by: Olga Albisser <olga@albisser.org>
Signed-off-by: Koen De Schepper <koen.de_schepper@nokia-bell-labs.com>
Signed-off-by: Oliver Tilmans <olivier.tilmans@nokia-bell-labs.com>
Signed-off-by: Bob Briscoe <research@bobbriscoe.net>
Signed-off-by: Henrik Steen <henrist@henrist.net>
---

Notes:
    Changelog:
    * v1 -> v2
      - Update against revised Netlink defines
      - Aligned bound checks with kernel ones
      - Simplified set of parameters

 include/uapi/linux/pkt_sched.h |  33 +++
 include/utils.h                |   8 +
 man/man8/tc-dualpi2.8          | 203 +++++++++++++++
 tc/Makefile                    |   1 +
 tc/q_dualpi2.c                 | 439 +++++++++++++++++++++++++++++++++
 5 files changed, 684 insertions(+)
 create mode 100644 man/man8/tc-dualpi2.8
 create mode 100644 tc/q_dualpi2.c

diff --git a/include/uapi/linux/pkt_sched.h b/include/uapi/linux/pkt_sched.h
index 18f18529..5f31ba46 100644
--- a/include/uapi/linux/pkt_sched.h
+++ b/include/uapi/linux/pkt_sched.h
@@ -1180,4 +1180,37 @@ enum {
 
 #define TCA_TAPRIO_ATTR_MAX (__TCA_TAPRIO_ATTR_MAX - 1)
 
+/* DUALPI2 */
+enum {
+	TCA_DUALPI2_UNSPEC,
+	TCA_DUALPI2_LIMIT,		/* Packets */
+	TCA_DUALPI2_TARGET,		/* us */
+	TCA_DUALPI2_TUPDATE,		/* us */
+	TCA_DUALPI2_ALPHA,		/* Hz scaled up by 256 */
+	TCA_DUALPI2_BETA,		/* HZ scaled up by 256 */
+	TCA_DUALPI2_STEP_THRESH,	/* Packets or us */
+	TCA_DUALPI2_STEP_PACKETS,	/* Whether STEP_THRESH is in packets */
+	TCA_DUALPI2_COUPLING,		/* Coupling factor between queues */
+	TCA_DUALPI2_DROP_OVERLOAD,	/* Whether to drop on overload */
+	TCA_DUALPI2_DROP_EARLY,		/* Whether to drop on enqueue */
+	TCA_DUALPI2_C_PROTECTION,	/* Percentage */
+	TCA_DUALPI2_ECN_MASK,		/* L4S queue classification mask */
+	TCA_DUALPI2_PAD,
+	__TCA_DUALPI2_MAX
+};
+
+#define TCA_DUALPI2_MAX   (__TCA_DUALPI2_MAX - 1)
+
+struct tc_dualpi2_xstats {
+	__u32 prob;             /* current probability */
+	__u32 delay_c;		/* current delay in C queue */
+	__u32 delay_l;		/* current delay in L queue */
+	__s32 credit;		/* current c_protection credit */
+	__u32 packets_in_c;	/* number of packets enqueued in C queue */
+	__u32 packets_in_l;	/* number of packets enqueued in L queue */
+	__u32 maxq;             /* maximum queue size */
+	__u32 ecn_mark;         /* packets marked with ecn*/
+	__u32 step_marks;	/* ECN marks due to the step AQM */
+};
+
 #endif
diff --git a/include/utils.h b/include/utils.h
index 794d3605..0c8f43e8 100644
--- a/include/utils.h
+++ b/include/utils.h
@@ -279,6 +279,14 @@ unsigned int print_name_and_link(const char *fmt,
 	_min1 < _min2 ? _min1 : _min2; })
 #endif
 
+#ifndef max
+# define max(x, y) ({			\
+	typeof(x) _max1 = (x);		\
+	typeof(y) _max2 = (y);		\
+	(void) (&_max1 == &_max2);	\
+	_max1 > _max2 ? _max1 : _max2; })
+#endif
+
 #ifndef __check_format_string
 # define __check_format_string(pos_str, pos_args) \
 	__attribute__ ((format (printf, (pos_str), (pos_args))))
diff --git a/man/man8/tc-dualpi2.8 b/man/man8/tc-dualpi2.8
new file mode 100644
index 00000000..d325ccda
--- /dev/null
+++ b/man/man8/tc-dualpi2.8
@@ -0,0 +1,203 @@
+.TH DUALPI2 8 "13 December 2018" "iproute2" "Linux"
+
+.SH NAME
+DUALPI2 \- Dual Queue Proportional Integral Controller AQM - Improved with a square
+.SH SYNOPSIS
+.sp
+.ad l
+.in +8
+.ti -8
+.BR tc " " qdisc " ... " dualpi2
+.br
+.RB "[ " limit
+.IR PACKETS " ]"
+.br
+.RB "[ " coupling_factor
+.IR NUMBER " ]"
+.br
+.RB "[ " step_thresh
+.IR TIME | PACKETS " ]"
+.br
+.RB "[ " drop_on_overload " | " overflow " ]"
+.br
+.RB "[ " drop_enqueue " | " drop_dequeue " ]"
+.br
+.RB "[ " l4s_ect " | " any_ect " ]"
+.br
+.RB "[ " classic_protection
+.IR PERCENTAGE " ] "
+.br
+.RB "[ " max_rtt
+.IR TIME 
+.RB " [ " typical_rtt 
+.IR TIME " ]] "
+.br
+.RB "[ " target
+.IR TIME " ]"
+.br
+.RB "[ " tupdate
+.IR TIME " ]"
+.br
+.RB "[ " alpha
+.IR float " ]"
+.br
+.RB "[ " beta
+.IR float " ] "
+
+.SH DESCRIPTION
+DUALPI2 AQM is a combination of the DUALQ Coupled-AQM with a PI2 base-AQM. The PI2 AQM (details can be found in the paper cited below) is in turn both an extension and a simplification of the PIE AQM. PI2 makes quite some PIE heuristics unnecessary, while being able to control scalable congestion controls like DCTCP and TCP-Prague. With PI2, both Reno/Cubic can be used in parallel with DCTCP, maintaining window fairness. DUALQ provides latency separation between low latency DCTCP flows and Reno/Cubic flows that need a bigger queue. The main design goals are:
+.PD 0
+.IP \(bu 4
+L4S - Low Loss, Low Latency and Scalable congestion control support
+.IP \(bu 4
+DualQ option to separate the L4S traffic in a low latency queue, without harming remaining traffic that is scheduled in classic queue due to congestion-coupling
+.IP \(bu 4
+Configurable overload strategies
+.IP \(bu 4
+Use of sojourn time to reliably estimate queue delay
+.IP \(bu 4
+Simple implementation
+.IP \(bu 4
+Guaranteed stability and fast responsiveness
+.PD
+
+.SH ALGORITHM
+DUALPI2 is designed to provide low loss and low latency to L4S traffic, without harming classic traffic. Every update interval a new internal base probability is calculated, based on queue delay. The base probability is updated with a delta based on the difference between the current queue delay and the 
+.I "" target
+delay, and the queue growth comparing with the queuing delay during the previous 
+.I "" tupdate
+interval. The integral gain factor 
+.RB "" alpha
+is used to correct slowly enough any persistent standing queue error to the user specified target delay, while the proportional gain factor
+.RB "" beta
+is used to quickly compensate for queue changes (growth or shrink).
+
+The updated base probability is used as input to decide to mark and drop packets. DUALPI2 scales the calculated probability for each of the two queues accordingly. For the L4S queue, the probability is multiplied by a 
+.RB "" coupling_factor
+, while for the classic queue, it is squared to compensate the squareroot rate equation of Reno/Cubic. The ECT identifier (
+.RB "" l4s_ect | any_ect
+) is used to classify traffic into respective queues.
+
+If DUALPI2 AQM has detected overload (when excessive non-responsive traffic is sent), it can signal congestion solely using 
+.RB "" drop
+, irrespective of the ECN field, or alternatively limit the drop probability and let the queue grow and eventually 
+.RB "" overflow
+(like tail-drop).
+
+Additional details can be found in the draft cited below.
+
+.SH PARAMETERS
+.TP
+.BI limit " PACKETS"
+Limit the number of packets that can be enqueued. Incoming packets are dropped when this limit
+is reached. This limit is common for the L4S and Classic queue. Defaults to
+.I 10000
+packets. This is about 125ms delay on a 1Gbps link.
+.TP
+.BI coupling_factor " NUMBER"
+Set the coupling rate factor between Classic and L4S. Defaults to
+.I 2
+.TP
+.B l4s_ect | any_ect
+Configures the ECT classifier. Packets whose ECT codepoint matches this are sent to the L4S queue where they receive a scalable marking. Defaults to
+.I l4s_ect
+, i.e., the L4S identifier ECT(1). Setting this to
+.I any_ect
+causes all packets whose ECN field is not zero to be sent to the L4S queue. This enables to be backward compatible with, e.g., DCTCP.
+.PD
+.BI step_thresh " TIME | PACKETS"
+Set the step threshold for the L4S queue. This will cause packets with a sojourn time exceeding the threshold to always be marked. This value can either be specified using time units (i.e., us, ms, s), or in packets (pkt, packet(s)). A velue without units is assumed to be in time (us). If defining the step in packets, be sure to disable GRO on the ingress interfaces. Defaults to
+.I 1ms
+.
+.TP
+.B drop_on_overload  |  overflow
+Control the overload strategy. 
+.I drop_on_overload
+preserves the delay in the L4S queue by dropping in both queues on overload.
+.I overflow
+sacrifices delay to avoid losses, eventually resulting in a taildrop behavior once
+.I limit
+is reached. Defaults to
+.I drop_on_overload.
+.PD
+.TP
+.B drop_enqueue | drop_dequeue
+Decide when packets are PI-based dropped or marked. The
+.I step_thresh 
+based L4S marking is always at dequeue. Defaults to
+.I drop_dequeue
+.PD
+.TP
+.BI classic_protection " PERCENTAGE
+Protects the classic queue from unresponsive traffic in the L4S queue. This bounds the maximal delay in the C queue to be
+.I (100 - PERCENTAGE)
+times greater than the one in the L queue. Defaults to
+.I 10
+.TP
+.BI typical_rtt " TIME"
+.PD 0
+.TP
+.PD
+.BI max_rtt " TIME"
+Specify the maximum round trip time (RTT) and/or the typical RTT of the traffic
+that will be controlled by dualpi2. If either of
+.I max_rtt
+or
+.I typical_rtt
+is not specified, the missing value will be computed from the following 
+relationship:
+.I max_rtt = typical_rtt * 6.
+If any of these parameters is given, it will be used to automatically compute
+suitable values for
+.I alpha, beta, target, and tupdate,
+according to the relationship from the appendix A.1 in the IETF draft, to
+achieve a stable control. Consequently, those derived values will override their
+eventual user-provided ones. The default range of operation for the qdisc uses
+.I max_rtt = 100ms
+and 
+.I typical_rtt = 15ms
+, which is suited to control internet traffic.
+.TP
+.BI target " TIME"
+Set the expected queue delay. Defaults to
+.I 15
+ms.
+.TP
+.BI tupdate " TIME"
+Set the frequency at which the system drop probability is calculated. Defaults to
+.I 16
+ms. This should be a third of the max RTT supported.
+.TP
+.BI alpha " float"
+.PD 0
+.TP
+.PD
+.BI beta " float"
+Set alpha and beta, the integral and proportional gain factors in Hz for the PI controller. These can be calculated based on control theory. Defaults are
+.I 0.16
+and
+.I 3.2
+Hz, which provide stable control for RTT's up to 100ms with tupdate of 16. Be aware, unlike with PIE, these are the real unscaled gain factors.
+
+.SH EXAMPLES
+Setting DUALPI2 for the Internet with default parameters:
+ # sudo tc qdisc add dev eth0 root dualpi2
+
+Setting DUALPI2 for datacenter with legacy DCTCP using ECT(0):
+ # sudo tc qdisc add dev eth0 root dualpi2 any_ect
+
+.SH SEE ALSO
+.BR tc (8),
+.BR tc-pie (8)
+
+.SH SOURCES
+.IP \(bu 4
+IETF draft submission is at https://www.ietf.org/id/draft-ietf-tsvwg-aqm-dualq-coupled
+.IP \(bu 4
+CoNEXT '16 Proceedings of the 12th International on Conference on emerging Networking EXperiments and Technologies : "PI2: A
+Linearized AQM for both Classic and Scalable TCP"
+
+.SH AUTHORS
+DUALPI2 was implemented by Koen De Schepper, Olga Albisser, Henrik Steen, and Olivier Tilmans also the authors of
+this man page. Please report bugs and corrections to the Linux networking
+development mailing list at <netdev@vger.kernel.org>.
diff --git a/tc/Makefile b/tc/Makefile
index 14171a28..35f02da3 100644
--- a/tc/Makefile
+++ b/tc/Makefile
@@ -9,6 +9,7 @@ SHARED_LIBS ?= y
 
 TCMODULES :=
 TCMODULES += q_fifo.o
+TCMODULES += q_dualpi2.o
 TCMODULES += q_sfq.o
 TCMODULES += q_red.o
 TCMODULES += q_prio.o
diff --git a/tc/q_dualpi2.c b/tc/q_dualpi2.c
new file mode 100644
index 00000000..1fda2ec3
--- /dev/null
+++ b/tc/q_dualpi2.c
@@ -0,0 +1,439 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/* Copyright (C) 2019 Nokia.
+ *
+ * DualQ PI Improved with a Square (dualpi2)
+ * Supports controlling scalable congestion controls (DCTCP, etc...)
+ * Supports DualQ with PI2
+ * Supports L4S ECN identifier
+ * Author: Koen De Schepper <koen.de_schepper@nokia-bell-labs.com>
+ * Author: Olga Albisser <olga@albisser.org>
+ * Author: Henrik Steen <henrist@henrist.net>
+ */
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <unistd.h>
+#include <syslog.h>
+#include <fcntl.h>
+#include <sys/socket.h>
+#include <netinet/in.h>
+#include <arpa/inet.h>
+#include <string.h>
+#include <math.h>
+#include <errno.h>
+
+#include "utils.h"
+#include "tc_util.h"
+
+#define MAX_PROB ((uint32_t)(~((uint32_t)0)))
+#define DEFAULT_ALPHA_BETA ((uint32_t)(~((uint32_t)0)))
+#define ALPHA_BETA_MAX ((2 << 23) - 1) /* see net/sched/sch_dualpi2.c */
+#define ALPHA_BETA_SCALE (1 << 8)
+#define RTT_TYP_TO_MAX 6
+
+enum {
+	INET_ECN_NOT_ECT = 0,
+	INET_ECN_ECT_1 = 1,
+	INET_ECN_ECT_0 = 2,
+	INET_ECN_CE = 3,
+	INET_ECN_MASK = 3,
+};
+
+static const char *get_ecn_type(uint8_t ect)
+{
+	switch (ect & INET_ECN_MASK) {
+		case INET_ECN_ECT_1: return "l4s_ect";
+		case INET_ECN_ECT_0:
+		case INET_ECN_MASK: return "any_ect";
+		default:
+			fprintf(stderr,
+				"Warning: Unexpected ecn type %u!\n", ect);
+			return "";
+	}
+}
+
+static void explain(void)
+{
+	fprintf(stderr, "Usage: ... dualpi2\n");
+	fprintf(stderr, "               [limit PACKETS]\n");
+	fprintf(stderr, "               [coupling_factor NUMBER]\n");
+	fprintf(stderr, "               [step_thresh TIME|PACKETS]\n");
+	fprintf(stderr, "               [drop_on_overload|overflow]\n");
+	fprintf(stderr, "               [drop_enqueue|drop_dequeue]\n");
+	fprintf(stderr, "               [classic_protection PERCENTAGE]\n");
+	fprintf(stderr, "               [max_rtt TIME [typical_rtt TIME]]\n");
+	fprintf(stderr, "               [target TIME] [tupdate TIME]\n");
+	fprintf(stderr, "               [alpha ALPHA] [beta BETA]\n");
+}
+
+static int get_float(float *val, const char *arg, float min, float max)
+{
+        float res;
+        char *ptr;
+
+        if (!arg || !*arg)
+                return -1;
+        res = strtof(arg, &ptr);
+        if (!ptr || ptr == arg || *ptr)
+                return -1;
+	if (res < min || res > max)
+		return -1;
+        *val = res;
+        return 0;
+}
+
+static int get_packets(uint32_t *val, const char *arg)
+{
+	unsigned long res;
+	char *ptr;
+
+	if (!arg || !*arg)
+		return -1;
+	res = strtoul(arg, &ptr, 10);
+	if (!ptr || ptr == arg ||
+	    (strcmp(ptr, "pkt") && strcmp(ptr, "packet") &&
+	     strcmp(ptr, "packets")))
+		return -1;
+	if (res == ULONG_MAX && errno == ERANGE)
+		return -1;
+	if (res > 0xFFFFFFFFUL)
+		return -1;
+	*val = res;
+	return 0;
+}
+
+static int parse_alpha_beta(const char *name, char *argv, uint32_t *field)
+{
+
+	float field_f;
+
+	if (get_float(&field_f, argv, 0.0, ALPHA_BETA_MAX)) {
+		fprintf(stderr, "Illegal \"%s\"\n", name);
+		return -1;
+	}
+	else if (field_f < 1.0f / ALPHA_BETA_SCALE)
+		fprintf(stderr, "Warning: \"%s\" is too small and will be "
+			"rounded to zero.\n", name);
+	*field = (uint32_t)(field_f * ALPHA_BETA_SCALE);
+	return 0;
+}
+
+static int try_get_percentage(int *val, const char *arg, int base)
+{
+	long res;
+	char *ptr;
+
+	if (!arg || !*arg)
+		return -1;
+	res = strtol(arg, &ptr, base);
+	if (!ptr || ptr == arg || (*ptr && strcmp(ptr, "%")))
+		return -1;
+	if (res == ULONG_MAX && errno == ERANGE)
+		return -1;
+	if (res < 0 || res > 100)
+		return -1;
+
+	*val = res;
+	return 0;
+}
+
+static int dualpi2_parse_opt(struct qdisc_util *qu, int argc, char **argv,
+			 struct nlmsghdr *n, const char* dev)
+{
+	uint32_t limit = 0;
+	uint32_t target = 0;
+	uint32_t tupdate = 0;
+	uint32_t alpha = DEFAULT_ALPHA_BETA;
+	uint32_t beta = DEFAULT_ALPHA_BETA;
+	int32_t coupling_factor = -1;
+	uint8_t ecn_mask = INET_ECN_NOT_ECT;
+	bool step_packets = false;
+	uint32_t step_thresh = 0;
+	int c_protection = -1;
+	int drop_early = -1;
+	int drop_overload = -1;
+	uint32_t rtt_max = 0;
+	uint32_t rtt_typ = 0;
+	struct rtattr *tail;
+
+	while (argc > 0) {
+		if (strcmp(*argv, "limit") == 0) {
+			NEXT_ARG();
+			if (get_u32(&limit, *argv, 10)) {
+				fprintf(stderr, "Illegal \"limit\"\n");
+				return -1;
+			}
+		} else if (strcmp(*argv, "target") == 0) {
+			NEXT_ARG();
+			if (get_time(&target, *argv)) {
+				fprintf(stderr, "Illegal \"target\"\n");
+				return -1;
+			}
+		} else if (strcmp(*argv, "tupdate") == 0) {
+			NEXT_ARG();
+			if (get_time(&tupdate, *argv)) {
+				fprintf(stderr, "Illegal \"tupdate\"\n");
+				return -1;
+			}
+		} else if (strcmp(*argv, "alpha") == 0) {
+			NEXT_ARG();
+			if (parse_alpha_beta("alpha", *argv, &alpha))
+				return -1;
+		} else if (strcmp(*argv, "beta") == 0) {
+			NEXT_ARG();
+			if (parse_alpha_beta("beta", *argv, &beta))
+				return -1;
+		} else if (strcmp(*argv, "coupling_factor") == 0) {
+			NEXT_ARG();
+			if (get_s32(&coupling_factor, *argv, 0) ||
+			    coupling_factor > 0xFFUL ||coupling_factor < 0) {
+				fprintf(stderr,
+					"Illegal \"coupling_factor\"\n");
+				return -1;
+			}
+		} else if (strcmp(*argv, "l4s_ect") == 0)
+			ecn_mask = INET_ECN_ECT_1;
+		else if (strcmp(*argv, "any_ect") == 0)
+			ecn_mask = INET_ECN_MASK;
+		else if (strcmp(*argv, "step_thresh") == 0) {
+			NEXT_ARG();
+			/* First assume that this is specified in time */
+			if (get_time(&step_thresh, *argv)) {
+				/* Then packets */
+				if (get_packets(&step_thresh, *argv)) {
+					fprintf(stderr,
+						"Illegal \"step_thresh\"\n");
+					return -1;
+				}
+				step_packets = true;
+			}
+		} else if (strcmp(*argv, "overflow") == 0) {
+                        drop_overload = 0;
+		} else if (strcmp(*argv, "drop_on_overload") == 0) {
+                        drop_overload = 1;
+		} else if (strcmp(*argv, "drop_enqueue") == 0) {
+			drop_early = 1;
+		} else if (strcmp(*argv, "drop_dequeue") == 0) {
+			drop_early = 0;
+		} else if (strcmp(*argv, "classic_protection") == 0) {
+                        NEXT_ARG();
+                        if (try_get_percentage(&c_protection, *argv, 10) ||
+			    c_protection > 100 ||
+			    c_protection < 0) {
+                                fprintf(stderr,
+					"Illegal \"classic_protection\"\n");
+                                return -1;
+                        }
+		} else if (strcmp(*argv, "max_rtt") == 0) {
+			NEXT_ARG();
+			if (get_time(&rtt_max, *argv)) {
+				fprintf(stderr, "Illegal \"rtt_max\"\n");
+				return -1;
+			}
+		} else if (strcmp(*argv, "typical_rtt") == 0) {
+			NEXT_ARG();
+			if (get_time(&rtt_typ, *argv)) {
+				fprintf(stderr, "Illegal \"rtt_typ\"\n");
+				return -1;
+			}
+		} else if (strcmp(*argv, "help") == 0) {
+			explain();
+			return -1;
+		} else {
+			fprintf(stderr, "What is \"%s\"?\n", *argv);
+			explain();
+			return -1;
+		}
+		--argc;
+		++argv;
+	}
+
+	if (rtt_max || rtt_typ) {
+		float alpha_f, beta_f;
+		SPRINT_BUF(max_rtt_t);
+		SPRINT_BUF(typ_rtt_t);
+		SPRINT_BUF(tupdate_t);
+		SPRINT_BUF(target_t);
+
+		if (!rtt_typ)
+			rtt_typ = max(rtt_max / RTT_TYP_TO_MAX, 1U);
+		else if (!rtt_max)
+			rtt_max = rtt_typ * RTT_TYP_TO_MAX;
+		else if (rtt_typ > rtt_max) {
+			fprintf(stderr, "typical_rtt must be >= max_rtt!\n");
+			return -1;
+		}
+		if (alpha != DEFAULT_ALPHA_BETA || beta != DEFAULT_ALPHA_BETA ||
+		    tupdate || target)
+			fprintf(stderr, "rtt_max is specified, ignoring values "
+				"specified for alpha/beta/tupdate/target\n");
+		target = rtt_typ;
+		tupdate = max(min(rtt_typ, rtt_max / 3), 1U);
+		alpha_f = (double)tupdate / ((double)rtt_max * rtt_max)
+			* TIME_UNITS_PER_SEC * 0.1f;
+		beta_f = 0.3f / (float)rtt_max * TIME_UNITS_PER_SEC;
+		if (beta_f > ALPHA_BETA_MAX) {
+			fprintf(stderr, "max_rtt=%s is too low and cause beta "
+				"to overflow!\n",
+				sprint_time(rtt_max, max_rtt_t));
+			return -1;
+		}
+		if (alpha_f < 1.0f / ALPHA_BETA_SCALE ||
+		    beta_f < 1.0f / ALPHA_BETA_SCALE) {
+			fprintf(stderr, "max_rtt=%s is too large and will "
+				"cause alpha=%f and/or beta=%f to be rounded "
+				"down to 0!\n", sprint_time(rtt_max, max_rtt_t),
+				alpha_f, beta_f);
+			return -1;
+		}
+		fprintf(stderr, "Auto-configuring parameters using "
+			"[max_rtt: %s, typical_rtt: %s]: "
+			"target=%s tupdate=%s alpha=%f beta=%f\n",
+			sprint_time(rtt_max, max_rtt_t),
+			sprint_time(rtt_typ, typ_rtt_t),
+			sprint_time(target, target_t),
+			sprint_time(tupdate, tupdate_t), alpha_f, beta_f);
+		alpha = alpha_f * ALPHA_BETA_SCALE;
+		beta = beta * ALPHA_BETA_SCALE;
+	}
+
+	tail = addattr_nest(n, 1024, TCA_OPTIONS);
+	if (limit)
+		addattr32(n, 1024, TCA_DUALPI2_LIMIT, limit);
+	if (tupdate)
+		addattr32(n, 1024, TCA_DUALPI2_TUPDATE, tupdate);
+	if (target)
+		addattr32(n, 1024, TCA_DUALPI2_TARGET, target);
+	if (alpha != DEFAULT_ALPHA_BETA)
+		addattr32(n, 1024, TCA_DUALPI2_ALPHA, alpha);
+	if (beta != DEFAULT_ALPHA_BETA)
+		addattr32(n, 1024, TCA_DUALPI2_BETA, beta);
+	if (ecn_mask != INET_ECN_NOT_ECT)
+		addattr8(n, 1024, TCA_DUALPI2_ECN_MASK, ecn_mask);
+	if (drop_overload != -1)
+		addattr8(n, 1024, TCA_DUALPI2_DROP_OVERLOAD, drop_overload);
+	if (coupling_factor != -1)
+		addattr8(n, 1024, TCA_DUALPI2_COUPLING, coupling_factor);
+	if (step_thresh) {
+		addattr32(n, 1024, TCA_DUALPI2_STEP_THRESH, step_thresh);
+                addattr8(n, 1024, TCA_DUALPI2_STEP_PACKETS, step_packets);
+	}
+	if (drop_early != -1)
+		addattr8(n, 1024, TCA_DUALPI2_DROP_EARLY, drop_early);
+	if (c_protection != -1)
+		addattr8(n, 1024, TCA_DUALPI2_C_PROTECTION, c_protection);
+	addattr_nest_end(n, tail);
+	return 0;
+}
+
+static int dualpi2_print_opt(struct qdisc_util *qu, FILE *f, struct rtattr *opt)
+{
+	struct rtattr *tb[TCA_DUALPI2_MAX + 1];
+	uint32_t tupdate;
+	uint32_t target;
+	uint32_t step_thresh;
+	bool step_packets = false;
+	SPRINT_BUF(b1);
+
+	if (opt == NULL)
+		return 0;
+
+	parse_rtattr_nested(tb, TCA_DUALPI2_MAX, opt);
+
+	if (tb[TCA_DUALPI2_LIMIT] &&
+	    RTA_PAYLOAD(tb[TCA_DUALPI2_LIMIT]) >= sizeof(__uint32_t))
+		fprintf(f, "limit %up ",
+			rta_getattr_u32(tb[TCA_DUALPI2_LIMIT]));
+	if (tb[TCA_DUALPI2_TARGET] &&
+	    RTA_PAYLOAD(tb[TCA_DUALPI2_TARGET]) >= sizeof(__uint32_t)) {
+		target = rta_getattr_u32(tb[TCA_DUALPI2_TARGET]);
+		fprintf(f, "target %s ", sprint_time(target, b1));
+	}
+	if (tb[TCA_DUALPI2_TUPDATE] &&
+	    RTA_PAYLOAD(tb[TCA_DUALPI2_TUPDATE]) >= sizeof(__uint32_t)) {
+		tupdate = rta_getattr_u32(tb[TCA_DUALPI2_TUPDATE]);
+		fprintf(f, "tupdate %s ", sprint_time(tupdate, b1));
+	}
+	if (tb[TCA_DUALPI2_ALPHA] &&
+	    RTA_PAYLOAD(tb[TCA_DUALPI2_ALPHA]) >= sizeof(__uint32_t)) {
+		fprintf(f, "alpha %f ",
+			(float)rta_getattr_u32(tb[TCA_DUALPI2_ALPHA]) /
+			ALPHA_BETA_SCALE);
+	}
+	if (tb[TCA_DUALPI2_BETA] &&
+	    RTA_PAYLOAD(tb[TCA_DUALPI2_BETA]) >= sizeof(__uint32_t)) {
+		fprintf(f, "beta %f ",
+			(float)rta_getattr_u32(tb[TCA_DUALPI2_BETA]) /
+			ALPHA_BETA_SCALE);
+	}
+	if (tb[TCA_DUALPI2_ECN_MASK] &&
+	    RTA_PAYLOAD(tb[TCA_DUALPI2_ECN_MASK]) >= sizeof(__u8))
+		fprintf(f, "%s ",
+			get_ecn_type(rta_getattr_u8(tb[TCA_DUALPI2_ECN_MASK])));
+	if (tb[TCA_DUALPI2_COUPLING] &&
+	    RTA_PAYLOAD(tb[TCA_DUALPI2_COUPLING]) >= sizeof(__u8))
+		fprintf(f, "coupling_factor %u ",
+			rta_getattr_u8(tb[TCA_DUALPI2_COUPLING]));
+	if (tb[TCA_DUALPI2_DROP_OVERLOAD] &&
+	    RTA_PAYLOAD(tb[TCA_DUALPI2_DROP_OVERLOAD]) >= sizeof(__u8)) {
+		if (rta_getattr_u8(tb[TCA_DUALPI2_DROP_OVERLOAD]))
+			fprintf(f, "drop_on_overload ");
+		else
+			fprintf(f, "overflow ");
+	}
+	if (tb[TCA_DUALPI2_STEP_PACKETS] &&
+            RTA_PAYLOAD(tb[TCA_DUALPI2_STEP_PACKETS]) >= sizeof(__u8) &&
+	    rta_getattr_u8(tb[TCA_DUALPI2_STEP_PACKETS]))
+                        step_packets = true;
+	if (tb[TCA_DUALPI2_STEP_THRESH] &&
+	    RTA_PAYLOAD(tb[TCA_DUALPI2_STEP_THRESH]) >= sizeof(__uint32_t)) {
+		step_thresh = rta_getattr_u32(tb[TCA_DUALPI2_STEP_THRESH]);
+		if (step_packets)
+			fprintf(f, "step_thresh %upkt ", step_thresh);
+		else
+			fprintf(f, "step_thresh %s ",
+				sprint_time(step_thresh, b1));
+	}
+	if (tb[TCA_DUALPI2_DROP_EARLY] &&
+	    RTA_PAYLOAD(tb[TCA_DUALPI2_DROP_EARLY]) >= sizeof(__u8)) {
+		if (rta_getattr_u8(tb[TCA_DUALPI2_DROP_EARLY]))
+			fprintf(f, "drop_enqueue ");
+		else
+			fprintf(f, "drop_dequeue ");
+	}
+	if (tb[TCA_DUALPI2_C_PROTECTION] &&
+            RTA_PAYLOAD(tb[TCA_DUALPI2_C_PROTECTION]) >= sizeof(__u8))
+                fprintf(f, "classic_protection %u%% ",
+			rta_getattr_u8(tb[TCA_DUALPI2_C_PROTECTION]));
+
+	return 0;
+}
+
+static int dualpi2_print_xstats(struct qdisc_util *qu, FILE *f,
+			    struct rtattr *xstats)
+{
+	struct tc_dualpi2_xstats *st;
+
+	if (xstats == NULL)
+		return 0;
+
+	if (RTA_PAYLOAD(xstats) < sizeof(*st))
+		return -1;
+
+	st = RTA_DATA(xstats);
+	fprintf(f, "prob %f delay_c %uus delay_l %uus\n",
+		(double)st->prob / (double)MAX_PROB, st->delay_c, st->delay_l);
+	fprintf(f, "pkts_in_c %u pkts_in_l %u maxq %u\n",
+		st->packets_in_c, st->packets_in_l, st->maxq);
+	fprintf(f, "ecn_mark %u step_marks %u\n", st->ecn_mark, st->step_marks);
+	fprintf(f, "credit %d (%c)\n", st->credit, st->credit > 0 ? 'C' : 'L');
+	return 0;
+
+}
+
+struct qdisc_util dualpi2_qdisc_util = {
+	.id = "dualpi2",
+	.parse_qopt	= dualpi2_parse_opt,
+	.print_qopt	= dualpi2_print_opt,
+	.print_xstats	= dualpi2_print_xstats,
+};
-- 
2.22.0


^ permalink raw reply related

* [PATCH net-next v4] sched: Add dualpi2 qdisc
From: Tilmans, Olivier (Nokia - BE/Antwerp) @ 2019-08-21  8:31 UTC (permalink / raw)
  Cc: Eric Dumazet, Stephen Hemminger, Olga Albisser,
	De Schepper, Koen (Nokia - BE/Antwerp),
	Tilmans, Olivier (Nokia - BE/Antwerp), Bob Briscoe, Henrik Steen,
	Jamal Hadi Salim, Cong Wang, Jiri Pirko, David S. Miller,
	linux-kernel@vger.kernel.org, netdev@vger.kernel.org

From: Olga Albisser <olga@albisser.org>

DualPI2 provides L4S-type low latency & loss to traffic that uses a
scalable congestion controller (e.g. TCP-Prague, DCTCP) without
degrading the performance of 'classic' traffic (e.g. Reno,
Cubic etc.). It is intended to be the reference implementation of the
IETF's DualQ Coupled AQM.

The qdisc provides two queues called low latency and classic. It
classifies packets based on the ECN field in the IP headers. By
default it directs non-ECN and ECT(0) into the classic queue and
ECT(1) and CE into the low latency queue, as per the IETF spec.

Each queue runs its own AQM:
* The classic AQM is called PI2, which is similar to the PIE AQM but
   more responsive and simpler. Classic traffic requires a decent
   target queue (default 15ms for Internet deployment) to fully
   utilize the link and to avoid high drop rates.
* The low latency AQM is, by default, a very shallow ECN marking
   threshold (1ms) similar to that used for DCTCP.

The DualQ isolates the low queuing delay of the Low Latency queue
from the larger delay of the 'Classic' queue. However, from a
bandwidth perspective, flows in either queue will share out the link
capacity as if there was just a single queue. This bandwidth pooling
effect is achieved by coupling together the drop and ECN-marking
probabilities of the two AQMs.

The PI2 AQM has two main parameters in addition to its target delay.
All the defaults are suitable for any Internet setting, but it can
be reconfigured for a Data Centre setting. The integral gain factor
alpha is used to slowly correct any persistent standing queue error
from the target delay, while the proportional gain factor beta is
used to quickly compensate for queue changes (growth or shrinkage).
Either alpha and beta are given as a parameter, or they can be
calculated by tc from alternative typical and maximum RTT parameters.

Internally, the output of a linear Proportional Integral (PI)
controller is used for both queues. This output is squared to
calculate the drop or ECN-marking probability of the classic queue.
This counterbalances the square-root rate equation of Reno/Cubic,
which is the trick that balances flow rates across the queues. For
the ECN-marking probability of the low latency queue, the output of
the base AQM is multiplied by a coupling factor. This determines the
balance between the flow rates in each queue. The default setting
makes the flow rates roughly equal, which should be generally
applicable.

If DUALPI2 AQM has detected overload (due to excessive non-responsive
traffic in either queue), it will switch to signaling congestion
solely using drop, irrespective of the ECN field. Alternatively, it
can be configured to limit the drop probability and let the queue
grow and eventually overflow (like tail-drop).

Additional details can be found in the draft:
https://www.ietf.org/id/draft-ietf-tsvwg-aqm-dualq-coupled

Signed-off-by: Olga Albisser <olga@albisser.org>
Signed-off-by: Koen De Schepper <koen.de_schepper@nokia-bell-labs.com>
Signed-off-by: Olivier Tilmans <olivier.tilmans@nokia-bell-labs.com>
Signed-off-by: Bob Briscoe <research@bobbriscoe.net>
Signed-off-by: Henrik Steen <henrist@henrist.net>
---

Notes:
    Changelog:
    * v3-> v4
      - Replaced license boiletplate with SPDX identifier
      - Fix missing pskb_may_pull() calls when accessing ECN bits
      - Move timestamp computation at enqueue to happen after drop check
      - Use NMI-safe time keeping function, i.e., ktime_get_ns()
      - Switched from deprecated PSCHED_NS2TICKS/... to raw nanoseconds clocks
      - Validate netlink parameters properly (ranges, error reporting)
      - Expanded the statistics tracked/reported to better reflect the behavior of
        both queues
      - Simplified the qdisc structure:
        o Reworked classification logic to only depend on an ECN mask
        o Renamed most parameters to better reflect their usage
        o Removed unused/experimental features (e.g., TS-FIFO)
        o Restructured the skb->cb
        o Extracted helper functions
      - Fix compilation issues for ARM
      - Updated defaults parameter values to latest IETF ID
      - Fix the step AQM being applied on empty queues, causing excess marking on
        slower links
    * v2 -> v3
      - Fix compilation issues
      - Replaced the classic queue starvation protection from time-shifted FIFO
        to WRR, as it gives better results (e.g., prevents leaking burst in the C
        queue to the L queue)
    * v1 -> v2
      - Store enqueue timestamp in skb->cb to avoid conflict with EDT

 include/uapi/linux/pkt_sched.h |  33 ++
 net/sched/Kconfig              |  22 +-
 net/sched/Makefile             |   1 +
 net/sched/sch_dualpi2.c        | 744 +++++++++++++++++++++++++++++++++
 4 files changed, 799 insertions(+), 1 deletion(-)
 create mode 100644 net/sched/sch_dualpi2.c

diff --git a/include/uapi/linux/pkt_sched.h b/include/uapi/linux/pkt_sched.h
index 18f185299f47..e2ad4a8d2059 100644
--- a/include/uapi/linux/pkt_sched.h
+++ b/include/uapi/linux/pkt_sched.h
@@ -1180,4 +1180,37 @@ enum {
 
 #define TCA_TAPRIO_ATTR_MAX (__TCA_TAPRIO_ATTR_MAX - 1)
 
+/* DUALPI2 */
+enum {
+	TCA_DUALPI2_UNSPEC,
+	TCA_DUALPI2_LIMIT,		/* Packets */
+	TCA_DUALPI2_TARGET,		/* us */
+	TCA_DUALPI2_TUPDATE,		/* us */
+	TCA_DUALPI2_ALPHA,		/* Hz scaled up by 256 */
+	TCA_DUALPI2_BETA,		/* HZ scaled up by 256 */
+	TCA_DUALPI2_STEP_THRESH,	/* Packets or us */
+	TCA_DUALPI2_STEP_PACKETS,	/* Whether STEP_THRESH is in packets */
+	TCA_DUALPI2_COUPLING,		/* Coupling factor between queues */
+	TCA_DUALPI2_DROP_OVERLOAD,	/* Whether to drop on overload */
+	TCA_DUALPI2_DROP_EARLY,		/* Whether to drop on enqueue */
+	TCA_DUALPI2_C_PROTECTION,	/* Percentage */
+	TCA_DUALPI2_ECN_MASK,		/* L4S queue classification mask */
+	TCA_DUALPI2_PAD,
+	__TCA_DUALPI2_MAX
+};
+
+#define TCA_DUALPI2_MAX   (__TCA_DUALPI2_MAX - 1)
+
+struct tc_dualpi2_xstats {
+	__u32 prob;		/* current probability */
+	__u32 delay_c;		/* current delay in C queue */
+	__u32 delay_l;		/* current delay in L queue */
+	__s32 credit;		/* current c_protection credit */
+	__u32 packets_in_c;	/* number of packets enqueued in C queue */
+	__u32 packets_in_l;	/* number of packets enqueued in L queue */
+	__u32 maxq;		/* maximum queue size */
+	__u32 ecn_mark;		/* packets marked with ecn*/
+	__u32 step_marks;	/* ECN marks due to the step AQM */
+};
+
 #endif
diff --git a/net/sched/Kconfig b/net/sched/Kconfig
index afd2ba157a13..f9340c18c3a2 100644
--- a/net/sched/Kconfig
+++ b/net/sched/Kconfig
@@ -409,6 +409,26 @@ config NET_SCH_PLUG
 	  To compile this code as a module, choose M here: the
 	  module will be called sch_plug.
 
+config NET_SCH_DUALPI2
+	tristate "Dual Queue Proportional Integral Controller Improved with a Square (DUALPI2) scheduler"
+	help
+	  Say Y here if you want to use the DualPI2 AQM.
+	  This is a combination of the DUALQ Coupled-AQM with a PI2 base-AQM.
+	  The PI2 AQM is in turn both an extension and a simplification of the
+	  PIE AQM. PI2 makes quite some PIE heuristics unnecessary, while being
+	  able to control scalable congestion controls like DCTCP and
+	  TCP-Prague. With PI2, both Reno/Cubic can be used in parallel with
+	  DCTCP, maintaining window fairness. DUALQ provides latency separation
+	  between low latency DCTCP flows and Reno/Cubic flows that need a
+	  bigger queue.
+	  For more information, please see
+	  https://www.ietf.org/id/draft-ietf-tsvwg-aqm-dualq-coupled
+
+	  To compile this code as a module, choose M here: the module
+	  will be called sch_dualpi2.
+
+	  If unsure, say N.
+
 menuconfig NET_SCH_DEFAULT
 	bool "Allow override default queue discipline"
 	---help---
@@ -418,7 +438,7 @@ menuconfig NET_SCH_DEFAULT
 	  of pfifo_fast will be used. Many distributions already set
 	  the default value via /proc/sys/net/core/default_qdisc.
 
-	  If unsure, say N.
+
 
 if NET_SCH_DEFAULT
 
diff --git a/net/sched/Makefile b/net/sched/Makefile
index 415d1e1f237e..8e3bd4459eb4 100644
--- a/net/sched/Makefile
+++ b/net/sched/Makefile
@@ -61,6 +61,7 @@ obj-$(CONFIG_NET_SCH_PIE)	+= sch_pie.o
 obj-$(CONFIG_NET_SCH_CBS)	+= sch_cbs.o
 obj-$(CONFIG_NET_SCH_ETF)	+= sch_etf.o
 obj-$(CONFIG_NET_SCH_TAPRIO)	+= sch_taprio.o
+obj-$(CONFIG_NET_SCH_DUALPI2)   += sch_dualpi2.o
 
 obj-$(CONFIG_NET_CLS_U32)	+= cls_u32.o
 obj-$(CONFIG_NET_CLS_ROUTE4)	+= cls_route.o
diff --git a/net/sched/sch_dualpi2.c b/net/sched/sch_dualpi2.c
new file mode 100644
index 000000000000..a6452aa82018
--- /dev/null
+++ b/net/sched/sch_dualpi2.c
@@ -0,0 +1,744 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (C) 2019 Nokia.
+ *
+ * Author: Koen De Schepper <koen.de_schepper@nokia-bell-labs.com>
+ * Author: Olga Albisser <olga@albisser.org>
+ * Author: Henrik Steen <henrist@henrist.net>
+ * Author: Olivier Tilmans <olivier.tilmans@nokia-bell-labs.com>
+ *
+ * DualPI Improved with a Square (dualpi2):
+ *   Supports scalable congestion controls (e.g., DCTCP)
+ *   Supports coupled dual-queue with PI2
+ *   Supports L4S ECN identifier
+ *
+ * References:
+ *   draft-ietf-tsvwg-aqm-dualq-coupled:
+ *     http://tools.ietf.org/html/draft-ietf-tsvwg-aqm-dualq-coupled-08
+ *   De Schepper, Koen, et al. "PI 2: A linearized AQM for both classic and
+ *   scalable TCP."  in proc. ACM CoNEXT'16, 2016.
+ */
+
+#include <linux/module.h>
+#include <linux/slab.h>
+#include <linux/types.h>
+#include <linux/kernel.h>
+#include <linux/errno.h>
+#include <linux/skbuff.h>
+#include <net/pkt_sched.h>
+#include <net/inet_ecn.h>
+#include <linux/string.h>
+
+/* 32b enable to support flows with windows up to ~8.6 * 1e9 packets
+ * i.e., twice the maximal snd_cwnd.
+ * MAX_PROB must be consistent with the RNG in dualpi2_roll().
+ */
+#define MAX_PROB ((u32)(~((u32)0)))
+/* alpha/beta values exchanged over netlink are in units of 256ns */
+#define ALPHA_BETA_SHIFT 8
+/* Scaled values of alpha/beta must fit in 32b to avoid overflow in later
+ * computations. Consequently (see and dualpi2_scale_alpha_beta()), their
+ * netlink-provided values can use at most 31b, i.e. be at most most (2^23)-1
+ * (~4MHz) as those are given in 1/256th. This enable to tune alpha/beta to
+ * control flows whose maximal RTTs can be in usec up to few secs.
+ */
+#define ALPHA_BETA_MAX ((2 << 31) - 1)
+/* Internal alpha/beta are in units of 64ns.
+ * This enables to use all alpha/beta values in the allowed range without loss
+ * of precision due to rounding when scaling them internally, e.g.,
+ * scale_alpha_beta(1) will not round down to 0.
+ */
+#define ALPHA_BETA_GRANULARITY 6
+#define ALPHA_BETA_SCALING (ALPHA_BETA_SHIFT - ALPHA_BETA_GRANULARITY)
+/* We express the weights (wc, wl) in %, i.e., wc + wl = 100 */
+#define MAX_WC 100
+
+struct dualpi2_sched_data {
+	struct Qdisc *l_queue;	/* The L4S LL queue */
+	struct Qdisc *sch;	/* The classic queue (owner of this struct) */
+
+	struct { /* PI2 parameters */
+		u64	target;	/* Target delay in nanoseconds */
+		u32	tupdate;/* timer frequency (in jiffies) */
+		u32	prob;	/* Base PI2 probability */
+		u32	alpha;	/* Gain factor for the integral rate response */
+		u32	beta;	/* Gain factor for the proportional response */
+		struct timer_list timer; /* prob update timer */
+	} pi2;
+
+	struct { /* Step AQM (L4S queue only) parameters */
+		u32 thresh;	/* Step threshold */
+		bool in_packets;/* Whether the step is in packets or time */
+	} step;
+
+	struct { /* Classic queue starvation protection */
+		s32	credit; /* Credit (sign indicates which queue) */
+		s32	init;	/* Reset value of the credit */
+		u8	wc;	/* C queue weight (between 0 and MAX_WC) */
+		u8	wl;	/* L queue weight (MAX_WC - wc) */
+	} c_protection;
+
+	/* General dualQ parameters */
+	u8	coupling_factor;/* Coupling factor (k) between both queues */
+	u8	ecn_mask;	/* Mask to match L4S packets */
+	bool	drop_early;	/* Drop at enqueue instead of dequeue if true */
+	bool	drop_overload;	/* Drop (1) on overload, or overflow (0) */
+
+	/* Statistics */
+	u64	qdelay_c;	/* Classic Q delay */
+	u64	qdelay_l;	/* L4S Q delay */
+	u32	packets_in_c;	/* Number of packets enqueued in C queue */
+	u32	packets_in_l;	/* Number of packets enqueued in L queue */
+	u32	maxq;		/* maximum queue size */
+	u32	ecn_mark;	/* packets marked with ECN */
+	u32	step_marks;	/* ECN marks due to the step AQM */
+
+	struct { /* Deferred drop statistics */
+		u32 cnt;	/* Packets dropped */
+		u32 len;	/* Bytes dropped */
+	} deferred_drops;
+};
+
+struct dualpi2_skb_cb {
+	u64 ts;		/* Timestamp at enqueue */
+	u8 apply_step:1,/* Can we apply the step threshold */
+	   l4s:1,	/* Packet has been classified as L4S */
+	   ect:2;	/* Packet ECT codepoint */
+};
+
+static inline struct dualpi2_skb_cb *dualpi2_skb_cb(struct sk_buff *skb)
+{
+	qdisc_cb_private_validate(skb, sizeof(struct dualpi2_skb_cb));
+	return (struct dualpi2_skb_cb *)qdisc_skb_cb(skb)->data;
+}
+
+static inline u64 skb_sojourn_time(struct sk_buff *skb, u64 reference)
+{
+	return reference - dualpi2_skb_cb(skb)->ts;
+}
+
+static inline u64 qdelay_in_ns(struct Qdisc *q, u64 now)
+{
+	struct sk_buff *skb = qdisc_peek_head(q);
+
+	return skb ? skb_sojourn_time(skb, now) : 0;
+}
+
+static inline u32 dualpi2_scale_alpha_beta(u32 param)
+{
+	u64 tmp  = ((u64)param * MAX_PROB >> ALPHA_BETA_SCALING);
+	do_div(tmp, NSEC_PER_SEC);
+	return tmp;
+}
+
+static inline u32 dualpi2_unscale_alpha_beta(u32 param)
+{
+	u64 tmp = ((u64)param * NSEC_PER_SEC << ALPHA_BETA_SCALING);
+	do_div(tmp, MAX_PROB);
+	return tmp;
+}
+
+static inline bool skb_is_l4s(struct sk_buff *skb)
+{
+	return dualpi2_skb_cb(skb)->l4s != 0;
+}
+
+static inline void dualpi2_mark(struct dualpi2_sched_data *q,
+				struct sk_buff *skb)
+{
+	if (INET_ECN_set_ce(skb))
+		q->ecn_mark++;
+}
+
+static inline void dualpi2_reset_c_protection(struct dualpi2_sched_data *q)
+{
+	q->c_protection.credit = q->c_protection.init;
+}
+
+static inline void dualpi2_calculate_c_protection(struct Qdisc *sch,
+						  struct dualpi2_sched_data *q,
+						  u32 wc)
+{
+	q->c_protection.wc = wc;
+	q->c_protection.wl = MAX_WC - wc;
+	/* Start with L queue if wl > wc */
+	q->c_protection.init = (s32)psched_mtu(qdisc_dev(sch)) *
+		((int)q->c_protection.wc - (int)q->c_protection.wl);
+	dualpi2_reset_c_protection(q);
+}
+
+static inline bool dualpi2_roll(u32 prob)
+{
+	return prandom_u32() <= prob;
+}
+
+static inline bool dualpi2_squared_roll(struct dualpi2_sched_data *q)
+{
+	return dualpi2_roll(q->pi2.prob) && dualpi2_roll(q->pi2.prob);
+}
+
+static inline bool dualpi2_is_overloaded(u64 prob)
+{
+	return prob > MAX_PROB;
+}
+
+static bool must_drop(struct Qdisc *sch, struct dualpi2_sched_data *q,
+		      struct sk_buff *skb)
+{
+	u64 local_l_prob;
+
+	/* Never drop if we have fewer than 2 mtu-sized packets;
+	 * similar to min_th in RED.
+	 */
+	if (sch->qstats.backlog < 2 * psched_mtu(qdisc_dev(sch)))
+		return false;
+
+	local_l_prob = (u64)q->pi2.prob * q->coupling_factor;
+
+	if (skb_is_l4s(skb)) {
+		if (dualpi2_is_overloaded(local_l_prob)) {
+			/* On overload, preserve delay by doing a classic drop
+			 * in the L queue. Otherwise, let both queues grow until
+			 * we reach the limit and cannot enqueue anymore
+			 * (sacrifice delay to avoid drops).
+			 */
+			if (q->drop_overload && dualpi2_squared_roll(q))
+				goto drop;
+			else
+				goto mark;
+			/* Scalable marking has a  (prob * k) probability */
+		} else if (dualpi2_roll(local_l_prob)) {
+			goto mark;
+		}
+		/* Apply classic marking with a (prob * prob) probability.
+		 * Force drops for ECN-capable traffic on overload.
+		 */
+	} else if (dualpi2_squared_roll(q)) {
+		if (dualpi2_skb_cb(skb)->ect &&
+		    !dualpi2_is_overloaded(local_l_prob))
+			goto mark;
+		else
+			goto drop;
+	}
+	return false;
+
+mark:
+	dualpi2_mark(q, skb);
+	return false;
+
+drop:
+	return true;
+}
+
+static void dualpi2_skb_classify(struct dualpi2_sched_data *q,
+				 struct sk_buff *skb)
+{
+	struct dualpi2_skb_cb *cb = dualpi2_skb_cb(skb);
+	int wlen = skb_network_offset(skb);
+
+	switch (tc_skb_protocol(skb)) {
+	case htons(ETH_P_IP):
+		wlen += sizeof(struct iphdr);
+		if (!pskb_may_pull(skb, wlen) ||
+		    skb_try_make_writable(skb, wlen))
+			goto not_ecn;
+
+		cb->ect = ipv4_get_dsfield(ip_hdr(skb)) & INET_ECN_MASK;
+		break;
+	case htons(ETH_P_IPV6):
+		wlen += sizeof(struct ipv6hdr);
+		if (!pskb_may_pull(skb, wlen) ||
+		    skb_try_make_writable(skb, wlen))
+			goto not_ecn;
+
+		cb->ect = ipv6_get_dsfield(ipv6_hdr(skb)) & INET_ECN_MASK;
+		break;
+	default:
+		goto not_ecn;
+	}
+	cb->l4s = (cb->ect & q->ecn_mask) != 0;
+	return;
+
+not_ecn:
+	/* Not ECN capable or not non pullable/writable packets can only be
+	 * dropped hence go the the classic queue.
+	 */
+	cb->ect = INET_ECN_NOT_ECT;
+	cb->l4s = 0;
+}
+
+static int dualpi2_qdisc_enqueue(struct sk_buff *skb, struct Qdisc *sch,
+				 struct sk_buff **to_free)
+{
+	struct dualpi2_sched_data *q = qdisc_priv(sch);
+	int err;
+
+	if (unlikely(qdisc_qlen(sch) >= sch->limit)) {
+		qdisc_qstats_overlimit(sch);
+		err = NET_XMIT_DROP;
+		goto drop;
+	}
+
+	dualpi2_skb_classify(q, skb);
+
+	/* drop early if configured */
+	if (q->drop_early && must_drop(sch, q, skb)) {
+		err = NET_XMIT_SUCCESS | __NET_XMIT_BYPASS;
+		goto drop;
+	}
+
+	dualpi2_skb_cb(skb)->ts = ktime_get_ns();
+
+	if (qdisc_qlen(sch) > q->maxq)
+		q->maxq = qdisc_qlen(sch);
+
+	if (skb_is_l4s(skb)) {
+		/* Only apply the step if a queue is building up */
+		dualpi2_skb_cb(skb)->apply_step = qdisc_qlen(q->l_queue) > 1;
+		/* Keep the overall qdisc stats consistent */
+		++sch->q.qlen;
+		qdisc_qstats_backlog_inc(sch, skb);
+		++q->packets_in_l;
+		return qdisc_enqueue_tail(skb, q->l_queue);
+	}
+	++q->packets_in_c;
+	return qdisc_enqueue_tail(skb, sch);
+
+drop:
+	qdisc_drop(skb, sch, to_free);
+	return err;
+}
+
+static struct sk_buff *dualpi2_qdisc_dequeue(struct Qdisc *sch)
+{
+	struct dualpi2_sched_data *q = qdisc_priv(sch);
+	struct sk_buff *skb;
+	int qlen_c, credit_change;
+
+pick_packet:
+	/* L queue packets are also accounted for in qdisc_qlen(sch)! */
+	qlen_c = qdisc_qlen(sch) - qdisc_qlen(q->l_queue);
+	skb = NULL;
+	/* We can drop after qdisc_dequeue_head() calls.
+	 * Manage statistics by hand to keep them consistent if that happens.
+	 */
+	if (qdisc_qlen(q->l_queue) > 0 &&
+	    (qlen_c <= 0 || q->c_protection.credit <= 0)) {
+		/* Dequeue and increase the credit by wc if qlen_c != 0 */
+		skb = __qdisc_dequeue_head(&q->l_queue->q);
+		credit_change = qlen_c ?
+			q->c_protection.wc * qdisc_pkt_len(skb) : 0;
+		/* The global backlog will be updated later. */
+		qdisc_qstats_backlog_dec(q->l_queue, skb);
+		/* Propagate the dequeue to the global stats. */
+		--sch->q.qlen;
+	} else if (qlen_c > 0) {
+		/* Dequeue and decrease the credit by wl if qlen_l != 0 */
+		skb = __qdisc_dequeue_head(&sch->q);
+		credit_change = qdisc_qlen(q->l_queue) ?
+			(s32)(-1) * q->c_protection.wl * qdisc_pkt_len(skb) : 0;
+	} else {
+		dualpi2_reset_c_protection(q);
+		goto exit;
+	}
+	qdisc_qstats_backlog_dec(sch, skb);
+
+	/* Drop on dequeue? */
+	if (!q->drop_early && must_drop(sch, q, skb)) {
+		++q->deferred_drops.cnt;
+		q->deferred_drops.len += qdisc_pkt_len(skb);
+		consume_skb(skb);
+		qdisc_qstats_drop(sch);
+		/* try next packet */
+		goto pick_packet;
+	}
+
+	/* Apply the Step AQM to packets coming out of the L queue. */
+	if (skb_is_l4s(skb)) {
+		u64 qdelay = 0;
+
+		if (q->step.in_packets)
+			qdelay = qdisc_qlen(q->l_queue);
+		else
+			qdelay = skb_sojourn_time(skb, ktime_get_ns());
+		/* Apply the step */
+		if (likely(dualpi2_skb_cb(skb)->apply_step) &&
+		    qdelay > q->step.thresh) {
+			dualpi2_mark(q, skb);
+			++q->step_marks;
+		}
+		qdisc_bstats_update(q->l_queue, skb);
+	}
+
+	q->c_protection.credit += credit_change;
+	qdisc_bstats_update(sch, skb);
+
+exit:
+	/* We cannot call qdisc_tree_reduce_backlog() if our qlen is 0,
+	 * or HTB crashes.
+	 */
+	if (q->deferred_drops.cnt && qdisc_qlen(sch)) {
+		qdisc_tree_reduce_backlog(sch, q->deferred_drops.cnt,
+					  q->deferred_drops.len);
+		q->deferred_drops.cnt = 0;
+		q->deferred_drops.len = 0;
+	}
+	return skb;
+}
+
+static s64 __scale_delta(s64 diff)
+{
+	do_div(diff, (1 << (ALPHA_BETA_GRANULARITY + 1)) - 1);
+	return diff;
+}
+
+static u32 calculate_probability(struct Qdisc *sch)
+{
+	struct dualpi2_sched_data *q = qdisc_priv(sch);
+	u64 qdelay, qdelay_old, now;
+	u32 new_prob;
+	s64 delta;
+
+	qdelay_old = max_t(u64, q->qdelay_c, q->qdelay_l);
+	now = ktime_get_ns();
+	q->qdelay_l = qdelay_in_ns(q->l_queue, now);
+	q->qdelay_c = qdelay_in_ns(sch, now);
+	qdelay = max_t(u64, q->qdelay_c, q->qdelay_l);
+	/* Alpha and beta take at most 32b, i.e, the delay difference would
+	 * overflow for queueing delay differences > ~4.2sec.
+	 */
+	delta = __scale_delta(((s64)qdelay - q->pi2.target) * q->pi2.alpha);
+	delta += __scale_delta(((s64)qdelay - qdelay_old) * q->pi2.beta);
+	new_prob = delta + q->pi2.prob;
+	/* Prevent overflow */
+	if (delta > 0) {
+		if (new_prob < q->pi2.prob)
+			new_prob = MAX_PROB;
+		/* Prevent underflow */
+	} else if (new_prob > q->pi2.prob) {
+		new_prob = 0;
+	}
+	/* If we do not drop on overload, ensure we cap the L4S probability to
+	 * 100% to keep window fairness when overflowing.
+	 */
+	if (!q->drop_overload)
+		return min_t(u32, new_prob, MAX_PROB / q->coupling_factor);
+	return new_prob;
+}
+
+static void dualpi2_timer(struct timer_list *timer)
+{
+	struct dualpi2_sched_data *q = from_timer(q, timer, pi2.timer);
+	struct Qdisc *sch = q->sch;
+	spinlock_t *root_lock; /* Lock to access the head of both queues. */
+
+	root_lock = qdisc_lock(qdisc_root_sleeping(sch));
+	spin_lock(root_lock);
+
+	q->pi2.prob = calculate_probability(sch);
+	mod_timer(&q->pi2.timer, jiffies + q->pi2.tupdate);
+
+	spin_unlock(root_lock);
+}
+
+static const struct nla_policy dualpi2_policy[TCA_DUALPI2_MAX + 1] = {
+	[TCA_DUALPI2_LIMIT] = {.type = NLA_U32},
+	[TCA_DUALPI2_TARGET] = {.type = NLA_U32},
+	[TCA_DUALPI2_TUPDATE] = {.type = NLA_U32},
+	[TCA_DUALPI2_ALPHA] = {.type = NLA_U32},
+	[TCA_DUALPI2_BETA] = {.type = NLA_U32},
+	[TCA_DUALPI2_STEP_THRESH] = {.type = NLA_U32},
+	[TCA_DUALPI2_STEP_PACKETS] = {.type = NLA_U8},
+	[TCA_DUALPI2_COUPLING] = {.type = NLA_U8},
+	[TCA_DUALPI2_DROP_OVERLOAD] = {.type = NLA_U8},
+	[TCA_DUALPI2_DROP_EARLY] = {.type = NLA_U8},
+	[TCA_DUALPI2_C_PROTECTION] = {.type = NLA_U8},
+	[TCA_DUALPI2_ECN_MASK] = {.type = NLA_U8},
+};
+
+static int dualpi2_change(struct Qdisc *sch, struct nlattr *opt,
+			  struct netlink_ext_ack *extack)
+{
+	struct dualpi2_sched_data *q = qdisc_priv(sch);
+	struct nlattr *tb[TCA_DUALPI2_MAX + 1];
+	unsigned int old_qlen, dropped = 0;
+	int err;
+
+	if (!opt)
+		return -EINVAL;
+	err = nla_parse_nested_deprecated(tb, TCA_DUALPI2_MAX, opt,
+					  dualpi2_policy, extack);
+	if (err < 0)
+		return err;
+
+	sch_tree_lock(sch);
+
+	if (tb[TCA_DUALPI2_LIMIT]) {
+		u32 limit = nla_get_u32(tb[TCA_DUALPI2_LIMIT]);
+
+		if (!limit) {
+			NL_SET_ERR_MSG_ATTR(extack, tb[TCA_DUALPI2_LIMIT],
+					    "limit must be greater than 0 !");
+			return -EINVAL;
+		}
+		sch->limit = limit;
+	}
+
+	if (tb[TCA_DUALPI2_TARGET])
+		q->pi2.target = (u64)nla_get_u32(tb[TCA_DUALPI2_TARGET]) *
+			NSEC_PER_USEC;
+
+	if (tb[TCA_DUALPI2_TUPDATE]) {
+		u64 tupdate =
+			usecs_to_jiffies(nla_get_u32(tb[TCA_DUALPI2_TUPDATE]));
+
+		if (!tupdate) {
+			NL_SET_ERR_MSG_ATTR(extack, tb[TCA_DUALPI2_TUPDATE],
+					    "tupdate cannot be 0 jiffies!");
+			return -EINVAL;
+		}
+		q->pi2.tupdate = tupdate;
+	}
+
+	if (tb[TCA_DUALPI2_ALPHA]) {
+		u32 alpha = nla_get_u32(tb[TCA_DUALPI2_ALPHA]);
+
+		if (alpha > ALPHA_BETA_MAX) {
+			NL_SET_ERR_MSG_ATTR(extack, tb[TCA_DUALPI2_ALPHA],
+					    "alpha is too large!");
+			return -EINVAL;
+		}
+		q->pi2.alpha = dualpi2_scale_alpha_beta(alpha);
+	}
+
+	if (tb[TCA_DUALPI2_BETA]) {
+		u32 beta = nla_get_u32(tb[TCA_DUALPI2_BETA]);
+
+		if (beta > ALPHA_BETA_MAX) {
+			NL_SET_ERR_MSG_ATTR(extack, tb[TCA_DUALPI2_BETA],
+					    "beta is too large!");
+			return -EINVAL;
+		}
+		q->pi2.beta = dualpi2_scale_alpha_beta(beta);
+	}
+
+	if (tb[TCA_DUALPI2_STEP_THRESH])
+		q->step.thresh = nla_get_u32(tb[TCA_DUALPI2_STEP_THRESH]) *
+			NSEC_PER_USEC;
+
+	if (tb[TCA_DUALPI2_COUPLING]) {
+		u8 coupling = nla_get_u8(tb[TCA_DUALPI2_COUPLING]);
+
+		if (!coupling) {
+			NL_SET_ERR_MSG_ATTR(extack, tb[TCA_DUALPI2_COUPLING],
+					    "Must use a non-zero coupling!");
+			return -EINVAL;
+		}
+		q->coupling_factor = coupling;
+	}
+
+	if (tb[TCA_DUALPI2_STEP_PACKETS])
+		q->step.in_packets = nla_get_u8(tb[TCA_DUALPI2_STEP_PACKETS]);
+
+	if (tb[TCA_DUALPI2_DROP_OVERLOAD])
+		q->drop_overload = nla_get_u8(tb[TCA_DUALPI2_DROP_OVERLOAD]);
+
+	if (tb[TCA_DUALPI2_DROP_EARLY])
+		q->drop_early = nla_get_u8(tb[TCA_DUALPI2_DROP_EARLY]);
+
+	if (tb[TCA_DUALPI2_C_PROTECTION]) {
+		u8 wc = nla_get_u8(tb[TCA_DUALPI2_C_PROTECTION]);
+
+		if (wc > MAX_WC) {
+			NL_SET_ERR_MSG_ATTR(extack,
+					    tb[TCA_DUALPI2_C_PROTECTION],
+					    "c_protection must be <= 100!");
+			return -EINVAL;
+		}
+		dualpi2_calculate_c_protection(sch, q, wc);
+	}
+
+	if (tb[TCA_DUALPI2_ECN_MASK])
+		q->ecn_mask = nla_get_u8(tb[TCA_DUALPI2_ECN_MASK]);
+
+	/* Drop excess packets if new limit is lower */
+	old_qlen = qdisc_qlen(sch);
+	while (qdisc_qlen(sch) > sch->limit) {
+		struct sk_buff *skb = __qdisc_dequeue_head(&sch->q);
+
+		dropped += qdisc_pkt_len(skb);
+		qdisc_qstats_backlog_dec(sch, skb);
+		rtnl_qdisc_drop(skb, sch);
+	}
+	qdisc_tree_reduce_backlog(sch, old_qlen - qdisc_qlen(sch), dropped);
+
+	sch_tree_unlock(sch);
+	return 0;
+}
+
+static void dualpi2_reset_default(struct dualpi2_sched_data *q)
+{
+	q->sch->limit = 10000; /* Holds 125ms at 1G */
+
+	q->pi2.target = 15 * NSEC_PER_MSEC;
+	q->pi2.tupdate = usecs_to_jiffies(16 * USEC_PER_MSEC);
+	q->pi2.alpha = dualpi2_scale_alpha_beta(41); /* ~0.16 Hz in 1/256th */
+	q->pi2.beta = dualpi2_scale_alpha_beta(819); /* ~3.2 Hz in 1/256th */
+	/* These values give a 10dB stability margin with max_rtt=100ms */
+
+	q->step.thresh = 1 * NSEC_PER_MSEC; /* 1ms */
+	q->step.in_packets = false; /* Step in time not packets */
+
+	dualpi2_calculate_c_protection(q->sch, q, 10); /* Defaults to wc = 10 */
+
+	q->ecn_mask = INET_ECN_ECT_1; /* l4s-id */
+	q->coupling_factor = 2; /* window fairness for equal RTTs */
+	q->drop_overload = true; /* Preserve latency by dropping on overload */
+	q->drop_early = false; /* PI2 drop on dequeue */
+}
+
+static int dualpi2_init(struct Qdisc *sch, struct nlattr *opt,
+			struct netlink_ext_ack *extack)
+{
+	struct dualpi2_sched_data *q = qdisc_priv(sch);
+
+	q->l_queue = qdisc_create_dflt(sch->dev_queue, &pfifo_qdisc_ops,
+				       TC_H_MAKE(sch->handle, 1), extack);
+	if (!q->l_queue)
+		return -ENOMEM;
+
+	q->sch = sch;
+	dualpi2_reset_default(q);
+	timer_setup(&q->pi2.timer, dualpi2_timer, 0);
+
+	if (opt) {
+		int err = dualpi2_change(sch, opt, extack);
+
+		if (err)
+			return err;
+	}
+
+	mod_timer(&q->pi2.timer, (jiffies + HZ) >> 1);
+	return 0;
+}
+
+static int dualpi2_dump(struct Qdisc *sch, struct sk_buff *skb)
+{
+	struct nlattr *opts = nla_nest_start_noflag(skb, TCA_OPTIONS);
+	struct dualpi2_sched_data *q = qdisc_priv(sch);
+	u64 step_thresh = q->step.thresh;
+	u64 target_usec = q->pi2.target;
+
+	if (!opts)
+		goto nla_put_failure;
+
+	do_div(target_usec, NSEC_PER_USEC);
+	if (!q->step.in_packets)
+		do_div(step_thresh, NSEC_PER_USEC);
+
+	if (nla_put_u32(skb, TCA_DUALPI2_LIMIT, sch->limit) ||
+	    nla_put_u32(skb, TCA_DUALPI2_TARGET, target_usec) ||
+	    nla_put_u32(skb, TCA_DUALPI2_TUPDATE,
+			jiffies_to_usecs(q->pi2.tupdate)) ||
+	    nla_put_u32(skb, TCA_DUALPI2_ALPHA,
+			dualpi2_unscale_alpha_beta(q->pi2.alpha)) ||
+	    nla_put_u32(skb, TCA_DUALPI2_BETA,
+			dualpi2_unscale_alpha_beta(q->pi2.beta)) ||
+	    nla_put_u32(skb, TCA_DUALPI2_STEP_THRESH, step_thresh) ||
+	    nla_put_u8(skb, TCA_DUALPI2_COUPLING, q->coupling_factor) ||
+	    nla_put_u8(skb, TCA_DUALPI2_DROP_OVERLOAD, q->drop_overload) ||
+	    nla_put_u8(skb, TCA_DUALPI2_STEP_PACKETS, q->step.in_packets) ||
+	    nla_put_u8(skb, TCA_DUALPI2_DROP_EARLY, q->drop_early) ||
+	    nla_put_u8(skb, TCA_DUALPI2_C_PROTECTION, q->c_protection.wc) ||
+	    nla_put_u8(skb, TCA_DUALPI2_ECN_MASK, q->ecn_mask))
+		goto nla_put_failure;
+
+	return nla_nest_end(skb, opts);
+
+nla_put_failure:
+	nla_nest_cancel(skb, opts);
+	return -1;
+}
+
+static int dualpi2_dump_stats(struct Qdisc *sch, struct gnet_dump *d)
+{
+	struct dualpi2_sched_data *q = qdisc_priv(sch);
+	u64 qdelay_c_usec = q->qdelay_c;
+	u64 qdelay_l_usec = q->qdelay_l;
+	struct tc_dualpi2_xstats st = {
+		.prob		= q->pi2.prob,
+		.packets_in_c	= q->packets_in_c,
+		.packets_in_l	= q->packets_in_l,
+		.maxq		= q->maxq,
+		.ecn_mark	= q->ecn_mark,
+		.credit		= q->c_protection.credit,
+		.step_marks	= q->step_marks,
+	};
+
+	do_div(qdelay_c_usec, NSEC_PER_USEC);
+	do_div(qdelay_l_usec, NSEC_PER_USEC);
+	st.delay_c = qdelay_c_usec;
+	st.delay_l = qdelay_l_usec;
+	return gnet_stats_copy_app(d, &st, sizeof(st));
+}
+
+static void dualpi2_reset(struct Qdisc *sch)
+{
+	struct dualpi2_sched_data *q = qdisc_priv(sch);
+
+	qdisc_reset_queue(sch);
+	qdisc_reset_queue(q->l_queue);
+	q->qdelay_c = 0;
+	q->qdelay_l = 0;
+	q->pi2.prob = 0;
+	q->packets_in_c = 0;
+	q->packets_in_l = 0;
+	q->maxq = 0;
+	q->ecn_mark = 0;
+	q->step_marks = 0;
+	dualpi2_reset_c_protection(q);
+}
+
+static void dualpi2_destroy(struct Qdisc *sch)
+{
+	struct dualpi2_sched_data *q = qdisc_priv(sch);
+
+	q->pi2.tupdate = 0;
+	del_timer_sync(&q->pi2.timer);
+	if (q->l_queue)
+		qdisc_put(q->l_queue);
+}
+
+static struct Qdisc_ops dualpi2_qdisc_ops __read_mostly = {
+	.id = "dualpi2",
+	.priv_size	= sizeof(struct dualpi2_sched_data),
+	.enqueue	= dualpi2_qdisc_enqueue,
+	.dequeue	= dualpi2_qdisc_dequeue,
+	.peek		= qdisc_peek_dequeued,
+	.init		= dualpi2_init,
+	.destroy	= dualpi2_destroy,
+	.reset		= dualpi2_reset,
+	.change		= dualpi2_change,
+	.dump		= dualpi2_dump,
+	.dump_stats	= dualpi2_dump_stats,
+	.owner		= THIS_MODULE,
+};
+
+static int __init dualpi2_module_init(void)
+{
+	return register_qdisc(&dualpi2_qdisc_ops);
+}
+
+static void __exit dualpi2_module_exit(void)
+{
+	unregister_qdisc(&dualpi2_qdisc_ops);
+}
+
+module_init(dualpi2_module_init);
+module_exit(dualpi2_module_exit);
+
+MODULE_DESCRIPTION("Dual Queue with Proportional Integral controller Improved with a Square (dualpi2) scheduler");
+MODULE_AUTHOR("Koen De Schepper");
+MODULE_AUTHOR("Olga Albisser");
+MODULE_AUTHOR("Henrik Steen");
+MODULE_AUTHOR("Olivier Tilmans");
+MODULE_LICENSE("GPL");
-- 
2.22.0


^ permalink raw reply related

* Re: [RFC PATCH] bpf: handle 32-bit zext during constant blinding
From: Naveen N. Rao @ 2019-08-21  8:30 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, Jiong Wang
  Cc: bpf, linux-kernel, linuxppc-dev, Michael Ellerman, netdev
In-Reply-To: <20190813171018.28221-1-naveen.n.rao@linux.vnet.ibm.com>

Naveen N. Rao wrote:
> Since BPF constant blinding is performed after the verifier pass, there
> are certain ALU32 instructions inserted which don't have a corresponding
> zext instruction inserted after. This is causing a kernel oops on
> powerpc and can be reproduced by running 'test_cgroup_storage' with
> bpf_jit_harden=2.
> 
> Fix this by emitting BPF_ZEXT during constant blinding if
> prog->aux->verifier_zext is set.
> 
> Fixes: a4b1d3c1ddf6cb ("bpf: verifier: insert zero extension according to analysis result")
> Reported-by: Michael Ellerman <mpe@ellerman.id.au>
> Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
> ---
> This approach (the location where zext is being introduced below, in 
> particular) works for powerpc, but I am not entirely sure if this is 
> sufficient for other architectures as well. This is broken on v5.3-rc4.

Alexie, Daniel, Jiong,
Any feedback on this?

- Naveen


^ permalink raw reply

* RE: [PATCH v3] rtw88: pci: Move a mass of jobs in hw IRQ to soft IRQ
From: Tony Chuang @ 2019-08-21  8:16 UTC (permalink / raw)
  To: Jian-Hong Pan, Kalle Valo, David S . Miller
  Cc: linux-wireless@vger.kernel.org, netdev@vger.kernel.org,
	linux-kernel@vger.kernel.org, linux@endlessm.com
In-Reply-To: <20190820045934.24841-1-jian-hong@endlessm.com>

Hi,

> From: Jian-Hong Pan [mailto:jian-hong@endlessm.com]
> 
> There is a mass of jobs between spin lock and unlock in the hardware
> IRQ which will occupy much time originally. To make system work more
> efficiently, this patch moves the jobs to the soft IRQ (bottom half) to
> reduce the time in hardware IRQ.
> 
> Signed-off-by: Jian-Hong Pan <jian-hong@endlessm.com>
> ---
> v2:
>  Change the spin_lock_irqsave/unlock_irqrestore to spin_lock/unlock in
>  rtw_pci_interrupt_handler. Because the interrupts are already disabled
>  in the hardware interrupt handler.
> 
> v3:
>  Extend the spin lock protecting area for the TX path in
>  rtw_pci_interrupt_threadfn by Realtek's suggestion
> 
>  drivers/net/wireless/realtek/rtw88/pci.c | 33 +++++++++++++++++++-----
>  1 file changed, 27 insertions(+), 6 deletions(-)
> 
> diff --git a/drivers/net/wireless/realtek/rtw88/pci.c
> b/drivers/net/wireless/realtek/rtw88/pci.c
> index 00ef229552d5..a8c17a01f318 100644
> --- a/drivers/net/wireless/realtek/rtw88/pci.c
> +++ b/drivers/net/wireless/realtek/rtw88/pci.c
> @@ -866,12 +866,29 @@ static irqreturn_t rtw_pci_interrupt_handler(int irq,
> void *dev)
>  {
>  	struct rtw_dev *rtwdev = dev;
>  	struct rtw_pci *rtwpci = (struct rtw_pci *)rtwdev->priv;
> -	u32 irq_status[4];
> 
>  	spin_lock(&rtwpci->irq_lock);
>  	if (!rtwpci->irq_enabled)
>  		goto out;
> 
> +	/* disable RTW PCI interrupt to avoid more interrupts before the end of
> +	 * thread function
> +	 */
> +	rtw_pci_disable_interrupt(rtwdev, rtwpci);
> +out:
> +	spin_unlock(&rtwpci->irq_lock);
> +
> +	return IRQ_WAKE_THREAD;
> +}
> +
> +static irqreturn_t rtw_pci_interrupt_threadfn(int irq, void *dev)
> +{
> +	struct rtw_dev *rtwdev = dev;
> +	struct rtw_pci *rtwpci = (struct rtw_pci *)rtwdev->priv;
> +	unsigned long flags;
> +	u32 irq_status[4];
> +
> +	spin_lock_irqsave(&rtwpci->irq_lock, flags);
>  	rtw_pci_irq_recognized(rtwdev, rtwpci, irq_status);
> 
>  	if (irq_status[0] & IMR_MGNTDOK)
> @@ -891,8 +908,10 @@ static irqreturn_t rtw_pci_interrupt_handler(int irq,
> void *dev)
>  	if (irq_status[0] & IMR_ROK)
>  		rtw_pci_rx_isr(rtwdev, rtwpci, RTW_RX_QUEUE_MPDU);
> 
> -out:
> -	spin_unlock(&rtwpci->irq_lock);
> +	/* all of the jobs for this interrupt have been done */
> +	if (rtw_flag_check(rtwdev, RTW_FLAG_RUNNING))
> +		rtw_pci_enable_interrupt(rtwdev, rtwpci);

I've applied this patch and tested it.
But I failed to connect to AP, it seems that the
scan_result is empty. And when I failed to connect
to the AP, I found that the IMR is not enabled.
I guess the check bypass the interrupt enable function.

And I also found that *without MSI*, the driver is
able to connect to the AP. Could you please verify
this patch again with MSI interrupt enabled and
send a v4?

You can find my MSI patch on
https://patchwork.kernel.org/patch/11081539/


> +	spin_unlock_irqrestore(&rtwpci->irq_lock, flags);
> 
>  	return IRQ_HANDLED;
>  }
> @@ -1152,8 +1171,10 @@ static int rtw_pci_probe(struct pci_dev *pdev,
>  		goto err_destroy_pci;
>  	}
> 
> -	ret = request_irq(pdev->irq, &rtw_pci_interrupt_handler,
> -			  IRQF_SHARED, KBUILD_MODNAME, rtwdev);
> +	ret = devm_request_threaded_irq(rtwdev->dev, pdev->irq,
> +					rtw_pci_interrupt_handler,
> +					rtw_pci_interrupt_threadfn,
> +					IRQF_SHARED, KBUILD_MODNAME, rtwdev);
>  	if (ret) {
>  		ieee80211_unregister_hw(hw);
>  		goto err_destroy_pci;
> @@ -1192,7 +1213,7 @@ static void rtw_pci_remove(struct pci_dev *pdev)
>  	rtw_pci_disable_interrupt(rtwdev, rtwpci);
>  	rtw_pci_destroy(rtwdev, pdev);
>  	rtw_pci_declaim(rtwdev, pdev);
> -	free_irq(rtwpci->pdev->irq, rtwdev);
> +	devm_free_irq(rtwdev->dev, rtwpci->pdev->irq, rtwdev);
>  	rtw_core_deinit(rtwdev);
>  	ieee80211_free_hw(hw);
>  }
> --


NACK
Need to verify it with MSI https://patchwork.kernel.org/patch/11081539/
And hope v4 could fix it.

Yan-Hsuan


^ permalink raw reply

* Re: [PATCH net-next v3 2/4] net: mdio: add PTP offset compensation to mdiobus_write_sts
From: Miroslav Lichvar @ 2019-08-21  8:07 UTC (permalink / raw)
  To: Hubert Feurstein
  Cc: Andrew Lunn, netdev, lkml, Richard Cochran, Florian Fainelli,
	Heiner Kallweit, Vladimir Oltean, David S. Miller
In-Reply-To: <CAFfN3gUgpzMebyUt8_-9e+5vpm3q-DVVszWdkUEFAgZQ8ex73w@mail.gmail.com>

On Tue, Aug 20, 2019 at 06:56:56PM +0200, Hubert Feurstein wrote:
> Am Di., 20. Aug. 2019 um 17:40 Uhr schrieb Miroslav Lichvar
> > I think a large jitter is ok in this case. We just need to timestamp
> > something that we know for sure happened after the PHC timestamp. It
> > should have no impact on the offset and its stability, just the
> > reported delay. A test with phc2sys should be able to confirm that.
> > phc2sys selects the measurement with the shortest delay, which has
> > least uncertainty. I'd say that applies to both interrupt and polling.
> >
> > If it is difficult to specify the minimum interrupt delay, I'd still
> > prefer an overly pessimistic interval assuming a zero delay.
> >
> Currently I do not see the benefit from this. The original intention was to
> compensate for the remaining offset as good as possible.

That's ok, but IMHO the change should not break the assumptions of
existing application and users.

> The current code
> of phc2sys uses the delay only for the filtering of the measurement record
> with the shortest delay and for reporting and statistics. Why not simple shift
> the timestamps with the offset to the point where we expect the PHC timestamp
> to be captured, and we have a very good result compared to where we came
> from.

Because those reports/statistics are important in calculation of
maximum error. If someone had a requirement for a clock to be accurate
to 1.5 microseconds and the ioctl returned a delay indicating a
sufficient accuracy when in reality it could be worse, that would be a
problem.

BTW, phc2sys is not the only user of the ioctl.

-- 
Miroslav Lichvar

^ permalink raw reply

* Re: [PATCH] `iwlist scan` fails with many networks available
From: Johannes Berg @ 2019-08-21  7:59 UTC (permalink / raw)
  To: James Nylen; +Cc: David S. Miller, linux-wireless, netdev, linux-kernel
In-Reply-To: <CABVa4Nga1vyvyWNpTTJLa44rZo8wu4-bE=mXX1nZgvzktbSq6A@mail.gmail.com>

On Tue, 2019-08-13 at 00:43 +0000, James Nylen wrote:
> > I suppose we could consider applying a workaround like this if it has a
> > condition checking that the buffer passed in is the maximum possible
> > buffer (65535 bytes, due to iw_point::length being u16)
> 
> This is what the latest patch does (attached to my email from
> yesterday / https://lkml.org/lkml/2019/8/10/452 ).

Hmm, yes, you're right. I evidently missed the comparisons to 0xFFFF
there, sorry about that.

> If you'd like to apply it, I'm happy to make any needed revisions.
> Otherwise I'm going to have to keep patching my kernels for this
> issue, unfortunately I don't have the time to try to get wicd to
> migrate to a better solution.

Not sure which would be easier, but ok :-)

Can you please fix the patch to
 1) use /* */ style comments (see
    https://www.kernel.org/doc/html/latest/process/coding-style.html)

 2) remove extra braces (also per coding style)

 3) use U16_MAX instead of 0xFFFF

I'd also consider renaming "maybe_current_ev" to "next_ev" or something
shorter anyway, and would probably argue that rewriting this

> +		if (IS_ERR(maybe_current_ev)) {
> +			err = PTR_ERR(maybe_current_ev);
> +			if (err == -E2BIG) {
> +				// Last BSS failed to copy into buffer.  As
> +				// above, only report an error if `iwlist` will
> +				// retry again with a larger buffer.
> +				if (len >= 0xFFFF) {
> +					err = 0;
> +				}
> +			}
>  			break;
> +		} else {
> +			current_ev = maybe_current_ev;
>  		}


to something like

	next_ev = ...
	if (IS_ERR(next_ev)) {
		err = PTR_ERR(next_ev);
		/* mask error and truncate in case buffer cannot be
                 * increased
                 */
		if (err == -E2BIG && len < U16_MAX)
			err = 0;
		break;
	}

	current_ev = next_ev;


could be more readable, but that's just editorial really.

Thanks,
johannes


^ permalink raw reply

* RE: Help needed - Kernel lockup while running ipsec
From: Vakul Garg @ 2019-08-21  7:37 UTC (permalink / raw)
  To: Florian Westphal; +Cc: netdev@vger.kernel.org
In-Reply-To: <DB7PR04MB4620B6ACB01BFA338ADAED048BAB0@DB7PR04MB4620.eurprd04.prod.outlook.com>



> -----Original Message-----
> From: Vakul Garg
> Sent: Tuesday, August 20, 2019 4:08 PM
> To: Florian Westphal <fw@strlen.de>
> Cc: netdev@vger.kernel.org
> Subject: RE: Help needed - Kernel lockup while running ipsec
> 
> 
> 
> >
> > > -----Original Message-----
> > > From: Florian Westphal <fw@strlen.de>
> > > Sent: Tuesday, August 20, 2019 3:08 PM
> > > To: Vakul Garg <vakul.garg@nxp.com>
> > > Cc: Florian Westphal <fw@strlen.de>; netdev@vger.kernel.org
> > > Subject: Re: Help needed - Kernel lockup while running ipsec
> > >
> > > Vakul Garg <vakul.garg@nxp.com> wrote:
> > > >
> > > >
> > > > > -----Original Message-----
> > > > > From: Florian Westphal <fw@strlen.de>
> > > > > Sent: Tuesday, August 20, 2019 2:53 PM
> > > > > To: Vakul Garg <vakul.garg@nxp.com>
> > > > > Cc: Florian Westphal <fw@strlen.de>; netdev@vger.kernel.org
> > > > > Subject: Re: Help needed - Kernel lockup while running ipsec
> > > > >
> > > > > Vakul Garg <vakul.garg@nxp.com> wrote:
> > > > > > > > With kernel 4.14.122, I am getting a kernel softlockup while
> > > > > > > > running single
> > > > > > > static ipsec tunnel.
> > > > > > > > The problem reproduces mostly after running 8-10 hours of
> > > > > > > > ipsec encap
> > > > > > > test (on my dual core arm board).
> > > > > > > >
> > > > > > > > I found that in function xfrm_policy_lookup_bytype(), the
> > > > > > > > policy in variable
> > > > > > > 'ret' shows refcnt=0 under problem situation.
> > > > > > > > This creates an infinite loop in  xfrm_policy_lookup_bytype()
> > > > > > > > and hence the
> > > > > > > lockup.
> > > > > > > >
> > > > > > > > Can some body please provide me pointers about 'refcnt'?
> > > > > > > > Is it legitimate for 'refcnt' to become '0'? Under what
> > > > > > > > condition can it
> > > > > > > become '0'?
> > > > > > >
> > > > > > > Yes, when policy is destroyed and the last user calls
> > > > > > > xfrm_pol_put() which will invoke call_rcu to free the structure.
> > > > > >
> > > > > > It seems that policy reference count never gets decremented during
> > > > > > packet
> > > > > ipsec encap.
> > > > > > It is getting incremented for every frame that hits the policy.
> > > > > > In setkey -DP output, I see refcnt to be wrapping around after '0'.
> > > > >
> > > > > Thats a bug.  Does this affect 4.14 only or does this happen on
> > > > > current tree as well?
> > > >
> > > > I am yet to try it on 4.19.
> > > > Can you help me with the right fix? Which part of code should it get
> > > decremented?
> > > > I am not conversant with xfrm code.
> > >
> > > Normally policy reference counts get decremented when the skb is
> free'd,
> > via
> > > dst destruction (xfrm_dst_destroy()).
> > >
> > > Do you see a dst leak as well?
> >
> > Can you please guide me how to detect it?
> >
> > (I am checking refcount on recent kernel and will let you know.)
> 
> Policy refcount is decreasing properly on 4.19.
> Same should be on the latest kernel too.

On kernel-4.14, I find dst_release() is getting called through xfrm_output_one().
However since dst->__refcnt gets decremented to '1', 
the call_rcu(&dst->rcu_head, dst_destroy_rcu) is not invoked. 

On kernel-4.19, dst->__refcnt gets decremented to '0', hence things fall in place and 
dst_destroy_rcu() eventually executes.

Any further help/pointers for kernel-4.14 would be deeply appreciated.




^ permalink raw reply

* [PATCH net-next 7/7] selftests: mlxsw: Add a test case for devlink-trap
From: Ido Schimmel @ 2019-08-21  7:19 UTC (permalink / raw)
  To: netdev; +Cc: davem, jiri, dsahern, mlxsw, Ido Schimmel
In-Reply-To: <20190821071937.13622-1-idosch@idosch.org>

From: Ido Schimmel <idosch@mellanox.com>

Test generic devlink-trap functionality over mlxsw. These tests are not
specific to a single trap, but do not check the devlink-trap common
infrastructure either.

Currently, the only test case is device deletion (by reloading the
driver) while packets are being trapped.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
---
 .../drivers/net/mlxsw/devlink_trap.sh         | 129 ++++++++++++++++++
 1 file changed, 129 insertions(+)
 create mode 100755 tools/testing/selftests/drivers/net/mlxsw/devlink_trap.sh

diff --git a/tools/testing/selftests/drivers/net/mlxsw/devlink_trap.sh b/tools/testing/selftests/drivers/net/mlxsw/devlink_trap.sh
new file mode 100755
index 000000000000..89b55e946eed
--- /dev/null
+++ b/tools/testing/selftests/drivers/net/mlxsw/devlink_trap.sh
@@ -0,0 +1,129 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0
+#
+# Test generic devlink-trap functionality over mlxsw. These tests are not
+# specific to a single trap, but do not check the devlink-trap common
+# infrastructure either.
+
+lib_dir=$(dirname $0)/../../../net/forwarding
+
+ALL_TESTS="
+	dev_del_test
+"
+NUM_NETIFS=4
+source $lib_dir/tc_common.sh
+source $lib_dir/lib.sh
+source $lib_dir/devlink_lib.sh
+
+h1_create()
+{
+	simple_if_init $h1
+}
+
+h1_destroy()
+{
+	simple_if_fini $h1
+}
+
+h2_create()
+{
+	simple_if_init $h2
+}
+
+h2_destroy()
+{
+	simple_if_fini $h2
+}
+
+switch_create()
+{
+	ip link add dev br0 type bridge vlan_filtering 1 mcast_snooping 0
+
+	ip link set dev $swp1 master br0
+	ip link set dev $swp2 master br0
+
+	ip link set dev br0 up
+	ip link set dev $swp1 up
+	ip link set dev $swp2 up
+}
+
+switch_destroy()
+{
+	ip link set dev $swp2 down
+	ip link set dev $swp1 down
+
+	ip link del dev br0
+}
+
+setup_prepare()
+{
+	h1=${NETIFS[p1]}
+	swp1=${NETIFS[p2]}
+
+	swp2=${NETIFS[p3]}
+	h2=${NETIFS[p4]}
+
+	vrf_prepare
+
+	h1_create
+	h2_create
+
+	switch_create
+}
+
+cleanup()
+{
+	pre_cleanup
+
+	switch_destroy
+
+	h2_destroy
+	h1_destroy
+
+	vrf_cleanup
+}
+
+dev_del_test()
+{
+	local trap_name="source_mac_is_multicast"
+	local smac=01:02:03:04:05:06
+	local num_iter=5
+	local mz_pid
+	local i
+
+	$MZ $h1 -c 0 -p 100 -a $smac -b bcast -t ip -q &
+	mz_pid=$!
+
+	# The purpose of this test is to make sure we correctly dismantle a
+	# port while packets are trapped from it. This is done by reloading the
+	# the driver while the 'ingress_smac_mc_drop' trap is triggered.
+	RET=0
+
+	for i in $(seq 1 $num_iter); do
+		log_info "Iteration $i / $num_iter"
+
+		devlink_trap_action_set $trap_name "trap"
+		sleep 1
+
+		devlink_reload
+		# Allow netdevices to be re-created following the reload
+		sleep 20
+
+		cleanup
+		setup_prepare
+		setup_wait
+	done
+
+	log_test "Device delete"
+
+	kill $mz_pid && wait $mz_pid &> /dev/null
+}
+
+trap cleanup EXIT
+
+setup_prepare
+setup_wait
+
+tests_run
+
+exit $EXIT_STATUS
-- 
2.21.0


^ permalink raw reply related

* [PATCH net-next 6/7] selftests: mlxsw: Add test cases for devlink-trap L2 drops
From: Ido Schimmel @ 2019-08-21  7:19 UTC (permalink / raw)
  To: netdev; +Cc: davem, jiri, dsahern, mlxsw, Ido Schimmel
In-Reply-To: <20190821071937.13622-1-idosch@idosch.org>

From: Ido Schimmel <idosch@mellanox.com>

Test that each supported packet trap is triggered under the right
conditions and that packets are indeed dropped and not forwarded.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
---
 .../net/mlxsw/devlink_trap_l2_drops.sh        | 484 ++++++++++++++++++
 1 file changed, 484 insertions(+)
 create mode 100755 tools/testing/selftests/drivers/net/mlxsw/devlink_trap_l2_drops.sh

diff --git a/tools/testing/selftests/drivers/net/mlxsw/devlink_trap_l2_drops.sh b/tools/testing/selftests/drivers/net/mlxsw/devlink_trap_l2_drops.sh
new file mode 100755
index 000000000000..5dcdfa20fc6c
--- /dev/null
+++ b/tools/testing/selftests/drivers/net/mlxsw/devlink_trap_l2_drops.sh
@@ -0,0 +1,484 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0
+#
+# Test devlink-trap L2 drops functionality over mlxsw. Each registered L2 drop
+# packet trap is tested to make sure it is triggered under the right
+# conditions.
+
+lib_dir=$(dirname $0)/../../../net/forwarding
+
+ALL_TESTS="
+	source_mac_is_multicast_test
+	vlan_tag_mismatch_test
+	ingress_vlan_filter_test
+	ingress_stp_filter_test
+	port_list_is_empty_test
+	port_loopback_filter_test
+"
+NUM_NETIFS=4
+source $lib_dir/tc_common.sh
+source $lib_dir/lib.sh
+source $lib_dir/devlink_lib.sh
+
+h1_create()
+{
+	simple_if_init $h1
+}
+
+h1_destroy()
+{
+	simple_if_fini $h1
+}
+
+h2_create()
+{
+	simple_if_init $h2
+}
+
+h2_destroy()
+{
+	simple_if_fini $h2
+}
+
+switch_create()
+{
+	ip link add dev br0 type bridge vlan_filtering 1 mcast_snooping 0
+
+	ip link set dev $swp1 master br0
+	ip link set dev $swp2 master br0
+
+	ip link set dev br0 up
+	ip link set dev $swp1 up
+	ip link set dev $swp2 up
+
+	tc qdisc add dev $swp2 clsact
+}
+
+switch_destroy()
+{
+	tc qdisc del dev $swp2 clsact
+
+	ip link set dev $swp2 down
+	ip link set dev $swp1 down
+
+	ip link del dev br0
+}
+
+setup_prepare()
+{
+	h1=${NETIFS[p1]}
+	swp1=${NETIFS[p2]}
+
+	swp2=${NETIFS[p3]}
+	h2=${NETIFS[p4]}
+
+	vrf_prepare
+
+	h1_create
+	h2_create
+
+	switch_create
+}
+
+cleanup()
+{
+	pre_cleanup
+
+	switch_destroy
+
+	h2_destroy
+	h1_destroy
+
+	vrf_cleanup
+}
+
+l2_drops_test()
+{
+	local trap_name=$1; shift
+	local group_name=$1; shift
+
+	# This is the common part of all the tests. It checks that stats are
+	# initially idle, then non-idle after changing the trap action and
+	# finally idle again. It also makes sure the packets are dropped and
+	# never forwarded.
+	devlink_trap_stats_idle_test $trap_name
+	check_err $? "Trap stats not idle with initial drop action"
+	devlink_trap_group_stats_idle_test $group_name
+	check_err $? "Trap group stats not idle with initial drop action"
+
+	devlink_trap_action_set $trap_name "trap"
+
+	devlink_trap_stats_idle_test $trap_name
+	check_fail $? "Trap stats idle after setting action to trap"
+	devlink_trap_group_stats_idle_test $group_name
+	check_fail $? "Trap group stats idle after setting action to trap"
+
+	devlink_trap_action_set $trap_name "drop"
+
+	devlink_trap_stats_idle_test $trap_name
+	check_err $? "Trap stats not idle after setting action to drop"
+	devlink_trap_group_stats_idle_test $group_name
+	check_err $? "Trap group stats not idle after setting action to drop"
+
+	tc_check_packets "dev $swp2 egress" 101 0
+	check_err $? "Packets were not dropped"
+}
+
+l2_drops_cleanup()
+{
+	local mz_pid=$1; shift
+
+	kill $mz_pid && wait $mz_pid &> /dev/null
+	tc filter del dev $swp2 egress protocol ip pref 1 handle 101 flower
+}
+
+source_mac_is_multicast_test()
+{
+	local trap_name="source_mac_is_multicast"
+	local smac=01:02:03:04:05:06
+	local group_name="l2_drops"
+	local mz_pid
+
+	tc filter add dev $swp2 egress protocol ip pref 1 handle 101 \
+		flower src_mac $smac action drop
+
+	$MZ $h1 -c 0 -p 100 -a $smac -b bcast -t ip -d 1msec -q &
+	mz_pid=$!
+
+	RET=0
+
+	l2_drops_test $trap_name $group_name
+
+	log_test "Source MAC is multicast"
+
+	l2_drops_cleanup $mz_pid
+}
+
+__vlan_tag_mismatch_test()
+{
+	local trap_name="vlan_tag_mismatch"
+	local dmac=de:ad:be:ef:13:37
+	local group_name="l2_drops"
+	local opt=$1; shift
+	local mz_pid
+
+	# Remove PVID flag. This should prevent untagged and prio-tagged
+	# packets from entering the bridge.
+	bridge vlan add vid 1 dev $swp1 untagged master
+
+	tc filter add dev $swp2 egress protocol ip pref 1 handle 101 \
+		flower dst_mac $dmac action drop
+
+	$MZ $h1 "$opt" -c 0 -p 100 -a own -b $dmac -t ip -d 1msec -q &
+	mz_pid=$!
+
+	l2_drops_test $trap_name $group_name
+
+	# Add PVID and make sure packets are no longer dropped.
+	bridge vlan add vid 1 dev $swp1 pvid untagged master
+	devlink_trap_action_set $trap_name "trap"
+
+	devlink_trap_stats_idle_test $trap_name
+	check_err $? "Trap stats not idle when packets should not be dropped"
+	devlink_trap_group_stats_idle_test $group_name
+	check_err $? "Trap group stats not idle with when packets should not be dropped"
+
+	tc_check_packets "dev $swp2 egress" 101 0
+	check_fail $? "Packets not forwarded when should"
+
+	devlink_trap_action_set $trap_name "drop"
+
+	l2_drops_cleanup $mz_pid
+}
+
+vlan_tag_mismatch_untagged_test()
+{
+	RET=0
+
+	__vlan_tag_mismatch_test
+
+	log_test "VLAN tag mismatch - untagged packets"
+}
+
+vlan_tag_mismatch_vid_0_test()
+{
+	RET=0
+
+	__vlan_tag_mismatch_test "-Q 0"
+
+	log_test "VLAN tag mismatch - prio-tagged packets"
+}
+
+vlan_tag_mismatch_test()
+{
+	vlan_tag_mismatch_untagged_test
+	vlan_tag_mismatch_vid_0_test
+}
+
+ingress_vlan_filter_test()
+{
+	local trap_name="ingress_vlan_filter"
+	local dmac=de:ad:be:ef:13:37
+	local group_name="l2_drops"
+	local mz_pid
+	local vid=10
+
+	bridge vlan add vid $vid dev $swp2 master
+	# During initialization the firmware enables all the VLAN filters and
+	# the driver does not turn them off since the traffic will be discarded
+	# by the STP filter whose default is DISCARD state. Add the VID on the
+	# ingress bridge port and then remove it to make sure it is not member
+	# in the VLAN.
+	bridge vlan add vid $vid dev $swp1 master
+	bridge vlan del vid $vid dev $swp1 master
+
+	RET=0
+
+	tc filter add dev $swp2 egress protocol ip pref 1 handle 101 \
+		flower dst_mac $dmac action drop
+
+	$MZ $h1 -Q $vid -c 0 -p 100 -a own -b $dmac -t ip -d 1msec -q &
+	mz_pid=$!
+
+	l2_drops_test $trap_name $group_name
+
+	# Add the VLAN on the bridge port and make sure packets are no longer
+	# dropped.
+	bridge vlan add vid $vid dev $swp1 master
+	devlink_trap_action_set $trap_name "trap"
+
+	devlink_trap_stats_idle_test $trap_name
+	check_err $? "Trap stats not idle when packets should not be dropped"
+	devlink_trap_group_stats_idle_test $group_name
+	check_err $? "Trap group stats not idle with when packets should not be dropped"
+
+	tc_check_packets "dev $swp2 egress" 101 0
+	check_fail $? "Packets not forwarded when should"
+
+	devlink_trap_action_set $trap_name "drop"
+
+	log_test "Ingress VLAN filter"
+
+	l2_drops_cleanup $mz_pid
+
+	bridge vlan del vid $vid dev $swp1 master
+	bridge vlan del vid $vid dev $swp2 master
+}
+
+__ingress_stp_filter_test()
+{
+	local trap_name="ingress_spanning_tree_filter"
+	local dmac=de:ad:be:ef:13:37
+	local group_name="l2_drops"
+	local state=$1; shift
+	local mz_pid
+	local vid=20
+
+	bridge vlan add vid $vid dev $swp2 master
+	bridge vlan add vid $vid dev $swp1 master
+	ip link set dev $swp1 type bridge_slave state $state
+
+	tc filter add dev $swp2 egress protocol ip pref 1 handle 101 \
+		flower dst_mac $dmac action drop
+
+	$MZ $h1 -Q $vid -c 0 -p 100 -a own -b $dmac -t ip -d 1msec -q &
+	mz_pid=$!
+
+	l2_drops_test $trap_name $group_name
+
+	# Change STP state to forwarding and make sure packets are no longer
+	# dropped.
+	ip link set dev $swp1 type bridge_slave state 3
+	devlink_trap_action_set $trap_name "trap"
+
+	devlink_trap_stats_idle_test $trap_name
+	check_err $? "Trap stats not idle when packets should not be dropped"
+	devlink_trap_group_stats_idle_test $group_name
+	check_err $? "Trap group stats not idle with when packets should not be dropped"
+
+	tc_check_packets "dev $swp2 egress" 101 0
+	check_fail $? "Packets not forwarded when should"
+
+	devlink_trap_action_set $trap_name "drop"
+
+	l2_drops_cleanup $mz_pid
+
+	bridge vlan del vid $vid dev $swp1 master
+	bridge vlan del vid $vid dev $swp2 master
+}
+
+ingress_stp_filter_listening_test()
+{
+	local state=$1; shift
+
+	RET=0
+
+	__ingress_stp_filter_test $state
+
+	log_test "Ingress STP filter - listening state"
+}
+
+ingress_stp_filter_learning_test()
+{
+	local state=$1; shift
+
+	RET=0
+
+	__ingress_stp_filter_test $state
+
+	log_test "Ingress STP filter - learning state"
+}
+
+ingress_stp_filter_test()
+{
+	ingress_stp_filter_listening_test 1
+	ingress_stp_filter_learning_test 2
+}
+
+port_list_is_empty_uc_test()
+{
+	local trap_name="port_list_is_empty"
+	local dmac=de:ad:be:ef:13:37
+	local group_name="l2_drops"
+	local mz_pid
+
+	# Disable unicast flooding on both ports, so that packets cannot egress
+	# any port.
+	ip link set dev $swp1 type bridge_slave flood off
+	ip link set dev $swp2 type bridge_slave flood off
+
+	RET=0
+
+	tc filter add dev $swp2 egress protocol ip pref 1 handle 101 \
+		flower dst_mac $dmac action drop
+
+	$MZ $h1 -c 0 -p 100 -a own -b $dmac -t ip -d 1msec -q &
+	mz_pid=$!
+
+	l2_drops_test $trap_name $group_name
+
+	# Allow packets to be flooded to one port.
+	ip link set dev $swp2 type bridge_slave flood on
+	devlink_trap_action_set $trap_name "trap"
+
+	devlink_trap_stats_idle_test $trap_name
+	check_err $? "Trap stats not idle when packets should not be dropped"
+	devlink_trap_group_stats_idle_test $group_name
+	check_err $? "Trap group stats not idle with when packets should not be dropped"
+
+	tc_check_packets "dev $swp2 egress" 101 0
+	check_fail $? "Packets not forwarded when should"
+
+	devlink_trap_action_set $trap_name "drop"
+
+	log_test "Port list is empty - unicast"
+
+	l2_drops_cleanup $mz_pid
+
+	ip link set dev $swp1 type bridge_slave flood on
+}
+
+port_list_is_empty_mc_test()
+{
+	local trap_name="port_list_is_empty"
+	local dmac=01:00:5e:00:00:01
+	local group_name="l2_drops"
+	local dip=239.0.0.1
+	local mz_pid
+
+	# Disable multicast flooding on both ports, so that packets cannot
+	# egress any port. We also need to flush IP addresses from the bridge
+	# in order to prevent packets from being flooded to the router port.
+	ip link set dev $swp1 type bridge_slave mcast_flood off
+	ip link set dev $swp2 type bridge_slave mcast_flood off
+	ip address flush dev br0
+
+	RET=0
+
+	tc filter add dev $swp2 egress protocol ip pref 1 handle 101 \
+		flower dst_mac $dmac action drop
+
+	$MZ $h1 -c 0 -p 100 -a own -b $dmac -t ip -B $dip -d 1msec -q &
+	mz_pid=$!
+
+	l2_drops_test $trap_name $group_name
+
+	# Allow packets to be flooded to one port.
+	ip link set dev $swp2 type bridge_slave mcast_flood on
+	devlink_trap_action_set $trap_name "trap"
+
+	devlink_trap_stats_idle_test $trap_name
+	check_err $? "Trap stats not idle when packets should not be dropped"
+	devlink_trap_group_stats_idle_test $group_name
+	check_err $? "Trap group stats not idle with when packets should not be dropped"
+
+	tc_check_packets "dev $swp2 egress" 101 0
+	check_fail $? "Packets not forwarded when should"
+
+	devlink_trap_action_set $trap_name "drop"
+
+	log_test "Port list is empty - multicast"
+
+	l2_drops_cleanup $mz_pid
+
+	ip link set dev $swp1 type bridge_slave mcast_flood on
+}
+
+port_list_is_empty_test()
+{
+	port_list_is_empty_uc_test
+	port_list_is_empty_mc_test
+}
+
+port_loopback_filter_uc_test()
+{
+	local trap_name="port_loopback_filter"
+	local dmac=de:ad:be:ef:13:37
+	local group_name="l2_drops"
+	local mz_pid
+
+	# Make sure packets can only egress the input port.
+	ip link set dev $swp2 type bridge_slave flood off
+
+	RET=0
+
+	tc filter add dev $swp2 egress protocol ip pref 1 handle 101 \
+		flower dst_mac $dmac action drop
+
+	$MZ $h1 -c 0 -p 100 -a own -b $dmac -t ip -d 1msec -q &
+	mz_pid=$!
+
+	l2_drops_test $trap_name $group_name
+
+	# Allow packets to be flooded.
+	ip link set dev $swp2 type bridge_slave flood on
+	devlink_trap_action_set $trap_name "trap"
+
+	devlink_trap_stats_idle_test $trap_name
+	check_err $? "Trap stats not idle when packets should not be dropped"
+	devlink_trap_group_stats_idle_test $group_name
+	check_err $? "Trap group stats not idle with when packets should not be dropped"
+
+	tc_check_packets "dev $swp2 egress" 101 0
+	check_fail $? "Packets not forwarded when should"
+
+	devlink_trap_action_set $trap_name "drop"
+
+	log_test "Port loopback filter - unicast"
+
+	l2_drops_cleanup $mz_pid
+}
+
+port_loopback_filter_test()
+{
+	port_loopback_filter_uc_test
+}
+
+trap cleanup EXIT
+
+setup_prepare
+setup_wait
+
+tests_run
+
+exit $EXIT_STATUS
-- 
2.21.0


^ permalink raw reply related

* [PATCH net-next 5/7] mlxsw: spectrum: Add devlink-trap support
From: Ido Schimmel @ 2019-08-21  7:19 UTC (permalink / raw)
  To: netdev; +Cc: davem, jiri, dsahern, mlxsw, Ido Schimmel
In-Reply-To: <20190821071937.13622-1-idosch@idosch.org>

From: Ido Schimmel <idosch@mellanox.com>

Register supported packet traps (layer 2 drops only, currently) and
associated trap group with devlink during driver initialization.

The amount of traffic generated by these packet drop traps is capped at
10Kpps to ensure the CPU is not overwhelmed by incoming packets.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
---
 drivers/net/ethernet/mellanox/mlxsw/Makefile  |   2 +-
 drivers/net/ethernet/mellanox/mlxsw/core.c    |  52 ++++
 drivers/net/ethernet/mellanox/mlxsw/core.h    |   9 +
 .../net/ethernet/mellanox/mlxsw/spectrum.c    |  21 ++
 .../net/ethernet/mellanox/mlxsw/spectrum.h    |  13 +
 .../ethernet/mellanox/mlxsw/spectrum_trap.c   | 267 ++++++++++++++++++
 6 files changed, 363 insertions(+), 1 deletion(-)
 create mode 100644 drivers/net/ethernet/mellanox/mlxsw/spectrum_trap.c

diff --git a/drivers/net/ethernet/mellanox/mlxsw/Makefile b/drivers/net/ethernet/mellanox/mlxsw/Makefile
index 171b36bd8a4e..0e86a581d45b 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/Makefile
+++ b/drivers/net/ethernet/mellanox/mlxsw/Makefile
@@ -29,7 +29,7 @@ mlxsw_spectrum-objs		:= spectrum.o spectrum_buffers.o \
 				   spectrum_mr_tcam.o spectrum_mr.o \
 				   spectrum_qdisc.o spectrum_span.o \
 				   spectrum_nve.o spectrum_nve_vxlan.o \
-				   spectrum_dpipe.o
+				   spectrum_dpipe.o spectrum_trap.o
 mlxsw_spectrum-$(CONFIG_MLXSW_SPECTRUM_DCB)	+= spectrum_dcb.o
 mlxsw_spectrum-$(CONFIG_PTP_1588_CLOCK)		+= spectrum_ptp.o
 obj-$(CONFIG_MLXSW_MINIMAL)	+= mlxsw_minimal.o
diff --git a/drivers/net/ethernet/mellanox/mlxsw/core.c b/drivers/net/ethernet/mellanox/mlxsw/core.c
index 6ec07ecfb5f6..963a2b4b61b1 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/core.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/core.c
@@ -1017,6 +1017,54 @@ static int mlxsw_devlink_flash_update(struct devlink *devlink,
 					  component, extack);
 }
 
+static int mlxsw_devlink_trap_init(struct devlink *devlink,
+				   const struct devlink_trap *trap,
+				   void *trap_ctx)
+{
+	struct mlxsw_core *mlxsw_core = devlink_priv(devlink);
+	struct mlxsw_driver *mlxsw_driver = mlxsw_core->driver;
+
+	if (!mlxsw_driver->trap_init)
+		return -EOPNOTSUPP;
+	return mlxsw_driver->trap_init(mlxsw_core, trap, trap_ctx);
+}
+
+static void mlxsw_devlink_trap_fini(struct devlink *devlink,
+				    const struct devlink_trap *trap,
+				    void *trap_ctx)
+{
+	struct mlxsw_core *mlxsw_core = devlink_priv(devlink);
+	struct mlxsw_driver *mlxsw_driver = mlxsw_core->driver;
+
+	if (!mlxsw_driver->trap_fini)
+		return;
+	mlxsw_driver->trap_fini(mlxsw_core, trap, trap_ctx);
+}
+
+static int mlxsw_devlink_trap_action_set(struct devlink *devlink,
+					 const struct devlink_trap *trap,
+					 enum devlink_trap_action action)
+{
+	struct mlxsw_core *mlxsw_core = devlink_priv(devlink);
+	struct mlxsw_driver *mlxsw_driver = mlxsw_core->driver;
+
+	if (!mlxsw_driver->trap_action_set)
+		return -EOPNOTSUPP;
+	return mlxsw_driver->trap_action_set(mlxsw_core, trap, action);
+}
+
+static int
+mlxsw_devlink_trap_group_init(struct devlink *devlink,
+			      const struct devlink_trap_group *group)
+{
+	struct mlxsw_core *mlxsw_core = devlink_priv(devlink);
+	struct mlxsw_driver *mlxsw_driver = mlxsw_core->driver;
+
+	if (!mlxsw_driver->trap_group_init)
+		return -EOPNOTSUPP;
+	return mlxsw_driver->trap_group_init(mlxsw_core, group);
+}
+
 static const struct devlink_ops mlxsw_devlink_ops = {
 	.reload				= mlxsw_devlink_core_bus_device_reload,
 	.port_type_set			= mlxsw_devlink_port_type_set,
@@ -1034,6 +1082,10 @@ static const struct devlink_ops mlxsw_devlink_ops = {
 	.sb_occ_tc_port_bind_get	= mlxsw_devlink_sb_occ_tc_port_bind_get,
 	.info_get			= mlxsw_devlink_info_get,
 	.flash_update			= mlxsw_devlink_flash_update,
+	.trap_init			= mlxsw_devlink_trap_init,
+	.trap_fini			= mlxsw_devlink_trap_fini,
+	.trap_action_set		= mlxsw_devlink_trap_action_set,
+	.trap_group_init		= mlxsw_devlink_trap_group_init,
 };
 
 static int
diff --git a/drivers/net/ethernet/mellanox/mlxsw/core.h b/drivers/net/ethernet/mellanox/mlxsw/core.h
index 19cea16c30bb..b65a17d49e43 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/core.h
+++ b/drivers/net/ethernet/mellanox/mlxsw/core.h
@@ -292,6 +292,15 @@ struct mlxsw_driver {
 	int (*flash_update)(struct mlxsw_core *mlxsw_core,
 			    const char *file_name, const char *component,
 			    struct netlink_ext_ack *extack);
+	int (*trap_init)(struct mlxsw_core *mlxsw_core,
+			 const struct devlink_trap *trap, void *trap_ctx);
+	void (*trap_fini)(struct mlxsw_core *mlxsw_core,
+			  const struct devlink_trap *trap, void *trap_ctx);
+	int (*trap_action_set)(struct mlxsw_core *mlxsw_core,
+			       const struct devlink_trap *trap,
+			       enum devlink_trap_action action);
+	int (*trap_group_init)(struct mlxsw_core *mlxsw_core,
+			       const struct devlink_trap_group *group);
 	void (*txhdr_construct)(struct sk_buff *skb,
 				const struct mlxsw_tx_info *tx_info);
 	int (*resources_register)(struct mlxsw_core *mlxsw_core);
diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum.c b/drivers/net/ethernet/mellanox/mlxsw/spectrum.c
index 389861ece418..7de9833fc60b 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/spectrum.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum.c
@@ -4665,6 +4665,12 @@ static int mlxsw_sp_init(struct mlxsw_core *mlxsw_core,
 		goto err_traps_init;
 	}
 
+	err = mlxsw_sp_devlink_traps_init(mlxsw_sp);
+	if (err) {
+		dev_err(mlxsw_sp->bus_info->dev, "Failed to initialize devlink traps\n");
+		goto err_devlink_traps_init;
+	}
+
 	err = mlxsw_sp_buffers_init(mlxsw_sp);
 	if (err) {
 		dev_err(mlxsw_sp->bus_info->dev, "Failed to initialize buffers\n");
@@ -4798,6 +4804,8 @@ static int mlxsw_sp_init(struct mlxsw_core *mlxsw_core,
 err_lag_init:
 	mlxsw_sp_buffers_fini(mlxsw_sp);
 err_buffers_init:
+	mlxsw_sp_devlink_traps_fini(mlxsw_sp);
+err_devlink_traps_init:
 	mlxsw_sp_traps_fini(mlxsw_sp);
 err_traps_init:
 	mlxsw_sp_fids_fini(mlxsw_sp);
@@ -4870,6 +4878,7 @@ static void mlxsw_sp_fini(struct mlxsw_core *mlxsw_core)
 	mlxsw_sp_span_fini(mlxsw_sp);
 	mlxsw_sp_lag_fini(mlxsw_sp);
 	mlxsw_sp_buffers_fini(mlxsw_sp);
+	mlxsw_sp_devlink_traps_fini(mlxsw_sp);
 	mlxsw_sp_traps_fini(mlxsw_sp);
 	mlxsw_sp_fids_fini(mlxsw_sp);
 	mlxsw_sp_kvdl_fini(mlxsw_sp);
@@ -5251,6 +5260,10 @@ static struct mlxsw_driver mlxsw_sp1_driver = {
 	.sb_occ_port_pool_get		= mlxsw_sp_sb_occ_port_pool_get,
 	.sb_occ_tc_port_bind_get	= mlxsw_sp_sb_occ_tc_port_bind_get,
 	.flash_update			= mlxsw_sp_flash_update,
+	.trap_init			= mlxsw_sp_trap_init,
+	.trap_fini			= mlxsw_sp_trap_fini,
+	.trap_action_set		= mlxsw_sp_trap_action_set,
+	.trap_group_init		= mlxsw_sp_trap_group_init,
 	.txhdr_construct		= mlxsw_sp_txhdr_construct,
 	.resources_register		= mlxsw_sp1_resources_register,
 	.kvd_sizes_get			= mlxsw_sp_kvd_sizes_get,
@@ -5281,6 +5294,10 @@ static struct mlxsw_driver mlxsw_sp2_driver = {
 	.sb_occ_port_pool_get		= mlxsw_sp_sb_occ_port_pool_get,
 	.sb_occ_tc_port_bind_get	= mlxsw_sp_sb_occ_tc_port_bind_get,
 	.flash_update			= mlxsw_sp_flash_update,
+	.trap_init			= mlxsw_sp_trap_init,
+	.trap_fini			= mlxsw_sp_trap_fini,
+	.trap_action_set		= mlxsw_sp_trap_action_set,
+	.trap_group_init		= mlxsw_sp_trap_group_init,
 	.txhdr_construct		= mlxsw_sp_txhdr_construct,
 	.resources_register		= mlxsw_sp2_resources_register,
 	.params_register		= mlxsw_sp2_params_register,
@@ -5310,6 +5327,10 @@ static struct mlxsw_driver mlxsw_sp3_driver = {
 	.sb_occ_port_pool_get		= mlxsw_sp_sb_occ_port_pool_get,
 	.sb_occ_tc_port_bind_get	= mlxsw_sp_sb_occ_tc_port_bind_get,
 	.flash_update			= mlxsw_sp_flash_update,
+	.trap_init			= mlxsw_sp_trap_init,
+	.trap_fini			= mlxsw_sp_trap_fini,
+	.trap_action_set		= mlxsw_sp_trap_action_set,
+	.trap_group_init		= mlxsw_sp_trap_group_init,
 	.txhdr_construct		= mlxsw_sp_txhdr_construct,
 	.resources_register		= mlxsw_sp2_resources_register,
 	.params_register		= mlxsw_sp2_params_register,
diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum.h b/drivers/net/ethernet/mellanox/mlxsw/spectrum.h
index db17ba35ec84..20c14bba9ccb 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/spectrum.h
+++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum.h
@@ -958,4 +958,17 @@ void mlxsw_sp_nve_fini(struct mlxsw_sp *mlxsw_sp);
 int mlxsw_sp_nve_inc_parsing_depth_get(struct mlxsw_sp *mlxsw_sp);
 void mlxsw_sp_nve_inc_parsing_depth_put(struct mlxsw_sp *mlxsw_sp);
 
+/* spectrum_trap.c */
+int mlxsw_sp_devlink_traps_init(struct mlxsw_sp *mlxsw_sp);
+void mlxsw_sp_devlink_traps_fini(struct mlxsw_sp *mlxsw_sp);
+int mlxsw_sp_trap_init(struct mlxsw_core *mlxsw_core,
+		       const struct devlink_trap *trap, void *trap_ctx);
+void mlxsw_sp_trap_fini(struct mlxsw_core *mlxsw_core,
+			const struct devlink_trap *trap, void *trap_ctx);
+int mlxsw_sp_trap_action_set(struct mlxsw_core *mlxsw_core,
+			     const struct devlink_trap *trap,
+			     enum devlink_trap_action action);
+int mlxsw_sp_trap_group_init(struct mlxsw_core *mlxsw_core,
+			     const struct devlink_trap_group *group);
+
 #endif
diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum_trap.c b/drivers/net/ethernet/mellanox/mlxsw/spectrum_trap.c
new file mode 100644
index 000000000000..899450b28621
--- /dev/null
+++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum_trap.c
@@ -0,0 +1,267 @@
+// SPDX-License-Identifier: BSD-3-Clause OR GPL-2.0
+/* Copyright (c) 2019 Mellanox Technologies. All rights reserved */
+
+#include <linux/kernel.h>
+#include <net/devlink.h>
+#include <uapi/linux/devlink.h>
+
+#include "core.h"
+#include "reg.h"
+#include "spectrum.h"
+
+#define MLXSW_SP_TRAP_METADATA DEVLINK_TRAP_METADATA_TYPE_F_IN_PORT
+
+static void mlxsw_sp_rx_drop_listener(struct sk_buff *skb, u8 local_port,
+				      void *priv);
+
+#define MLXSW_SP_TRAP_DROP(_id, _group_id)				      \
+	DEVLINK_TRAP_GENERIC(DROP, DROP, _id,				      \
+			     DEVLINK_TRAP_GROUP_GENERIC(_group_id),	      \
+			     MLXSW_SP_TRAP_METADATA)
+
+#define MLXSW_SP_RXL_DISCARD(_id, _group_id)				      \
+	MLXSW_RXL(mlxsw_sp_rx_drop_listener, DISCARD_##_id, SET_FW_DEFAULT,   \
+		  false, SP_##_group_id, DISCARD)
+
+static struct devlink_trap mlxsw_sp_traps_arr[] = {
+	MLXSW_SP_TRAP_DROP(SMAC_MC, L2_DROPS),
+	MLXSW_SP_TRAP_DROP(VLAN_TAG_MISMATCH, L2_DROPS),
+	MLXSW_SP_TRAP_DROP(INGRESS_VLAN_FILTER, L2_DROPS),
+	MLXSW_SP_TRAP_DROP(INGRESS_STP_FILTER, L2_DROPS),
+	MLXSW_SP_TRAP_DROP(EMPTY_TX_LIST, L2_DROPS),
+	MLXSW_SP_TRAP_DROP(PORT_LOOPBACK_FILTER, L2_DROPS),
+};
+
+static struct mlxsw_listener mlxsw_sp_listeners_arr[] = {
+	MLXSW_SP_RXL_DISCARD(ING_PACKET_SMAC_MC, L2_DISCARDS),
+	MLXSW_SP_RXL_DISCARD(ING_SWITCH_VTAG_ALLOW, L2_DISCARDS),
+	MLXSW_SP_RXL_DISCARD(ING_SWITCH_VLAN, L2_DISCARDS),
+	MLXSW_SP_RXL_DISCARD(ING_SWITCH_STP, L2_DISCARDS),
+	MLXSW_SP_RXL_DISCARD(LOOKUP_SWITCH_UC, L2_DISCARDS),
+	MLXSW_SP_RXL_DISCARD(LOOKUP_SWITCH_MC_NULL, L2_DISCARDS),
+	MLXSW_SP_RXL_DISCARD(LOOKUP_SWITCH_LB, L2_DISCARDS),
+};
+
+/* Mapping between hardware trap and devlink trap. Multiple hardware traps can
+ * be mapped to the same devlink trap. Order is according to
+ * 'mlxsw_sp_listeners_arr'.
+ */
+static u16 mlxsw_sp_listener_devlink_map[] = {
+	DEVLINK_TRAP_GENERIC_ID_SMAC_MC,
+	DEVLINK_TRAP_GENERIC_ID_VLAN_TAG_MISMATCH,
+	DEVLINK_TRAP_GENERIC_ID_INGRESS_VLAN_FILTER,
+	DEVLINK_TRAP_GENERIC_ID_INGRESS_STP_FILTER,
+	DEVLINK_TRAP_GENERIC_ID_EMPTY_TX_LIST,
+	DEVLINK_TRAP_GENERIC_ID_EMPTY_TX_LIST,
+	DEVLINK_TRAP_GENERIC_ID_PORT_LOOPBACK_FILTER,
+};
+
+static int mlxsw_sp_rx_listener(struct mlxsw_sp *mlxsw_sp, struct sk_buff *skb,
+				u8 local_port,
+				struct mlxsw_sp_port *mlxsw_sp_port)
+{
+	struct mlxsw_sp_port_pcpu_stats *pcpu_stats;
+
+	if (unlikely(!mlxsw_sp_port)) {
+		dev_warn_ratelimited(mlxsw_sp->bus_info->dev, "Port %d: skb received for non-existent port\n",
+				     local_port);
+		kfree_skb(skb);
+		return -EINVAL;
+	}
+
+	skb->dev = mlxsw_sp_port->dev;
+
+	pcpu_stats = this_cpu_ptr(mlxsw_sp_port->pcpu_stats);
+	u64_stats_update_begin(&pcpu_stats->syncp);
+	pcpu_stats->rx_packets++;
+	pcpu_stats->rx_bytes += skb->len;
+	u64_stats_update_end(&pcpu_stats->syncp);
+
+	skb->protocol = eth_type_trans(skb, skb->dev);
+
+	return 0;
+}
+
+static void mlxsw_sp_rx_drop_listener(struct sk_buff *skb, u8 local_port,
+				      void *trap_ctx)
+{
+	struct devlink_port *in_devlink_port;
+	struct mlxsw_sp_port *mlxsw_sp_port;
+	struct mlxsw_sp *mlxsw_sp;
+	struct devlink *devlink;
+
+	mlxsw_sp = devlink_trap_ctx_priv(trap_ctx);
+	mlxsw_sp_port = mlxsw_sp->ports[local_port];
+
+	if (mlxsw_sp_rx_listener(mlxsw_sp, skb, local_port, mlxsw_sp_port))
+		return;
+
+	devlink = priv_to_devlink(mlxsw_sp->core);
+	in_devlink_port = mlxsw_core_port_devlink_port_get(mlxsw_sp->core,
+							   local_port);
+	devlink_trap_report(devlink, skb, trap_ctx, in_devlink_port);
+	consume_skb(skb);
+}
+
+int mlxsw_sp_devlink_traps_init(struct mlxsw_sp *mlxsw_sp)
+{
+	struct devlink *devlink = priv_to_devlink(mlxsw_sp->core);
+
+	if (WARN_ON(ARRAY_SIZE(mlxsw_sp_listener_devlink_map) !=
+		    ARRAY_SIZE(mlxsw_sp_listeners_arr)))
+		return -EINVAL;
+
+	return devlink_traps_register(devlink, mlxsw_sp_traps_arr,
+				      ARRAY_SIZE(mlxsw_sp_traps_arr),
+				      mlxsw_sp);
+}
+
+void mlxsw_sp_devlink_traps_fini(struct mlxsw_sp *mlxsw_sp)
+{
+	struct devlink *devlink = priv_to_devlink(mlxsw_sp->core);
+
+	devlink_traps_unregister(devlink, mlxsw_sp_traps_arr,
+				 ARRAY_SIZE(mlxsw_sp_traps_arr));
+}
+
+int mlxsw_sp_trap_init(struct mlxsw_core *mlxsw_core,
+		       const struct devlink_trap *trap, void *trap_ctx)
+{
+	int i;
+
+	for (i = 0; i < ARRAY_SIZE(mlxsw_sp_listener_devlink_map); i++) {
+		struct mlxsw_listener *listener;
+		int err;
+
+		if (mlxsw_sp_listener_devlink_map[i] != trap->id)
+			continue;
+		listener = &mlxsw_sp_listeners_arr[i];
+
+		err = mlxsw_core_trap_register(mlxsw_core, listener, trap_ctx);
+		if (err)
+			return err;
+	}
+
+	return 0;
+}
+
+void mlxsw_sp_trap_fini(struct mlxsw_core *mlxsw_core,
+			const struct devlink_trap *trap, void *trap_ctx)
+{
+	int i;
+
+	for (i = 0; i < ARRAY_SIZE(mlxsw_sp_listener_devlink_map); i++) {
+		struct mlxsw_listener *listener;
+
+		if (mlxsw_sp_listener_devlink_map[i] != trap->id)
+			continue;
+		listener = &mlxsw_sp_listeners_arr[i];
+
+		mlxsw_core_trap_unregister(mlxsw_core, listener, trap_ctx);
+	}
+}
+
+int mlxsw_sp_trap_action_set(struct mlxsw_core *mlxsw_core,
+			     const struct devlink_trap *trap,
+			     enum devlink_trap_action action)
+{
+	int i;
+
+	for (i = 0; i < ARRAY_SIZE(mlxsw_sp_listener_devlink_map); i++) {
+		enum mlxsw_reg_hpkt_action hw_action;
+		struct mlxsw_listener *listener;
+		int err;
+
+		if (mlxsw_sp_listener_devlink_map[i] != trap->id)
+			continue;
+		listener = &mlxsw_sp_listeners_arr[i];
+
+		switch (action) {
+		case DEVLINK_TRAP_ACTION_DROP:
+			hw_action = MLXSW_REG_HPKT_ACTION_SET_FW_DEFAULT;
+			break;
+		case DEVLINK_TRAP_ACTION_TRAP:
+			hw_action = MLXSW_REG_HPKT_ACTION_TRAP_EXCEPTION_TO_CPU;
+			break;
+		default:
+			return -EINVAL;
+		}
+
+		err = mlxsw_core_trap_action_set(mlxsw_core, listener,
+						 hw_action);
+		if (err)
+			return err;
+	}
+
+	return 0;
+}
+
+#define MLXSW_SP_DISCARD_POLICER_ID	(MLXSW_REG_HTGT_TRAP_GROUP_MAX + 1)
+
+static int
+mlxsw_sp_trap_group_policer_init(struct mlxsw_sp *mlxsw_sp,
+				 const struct devlink_trap_group *group)
+{
+	enum mlxsw_reg_qpcr_ir_units ir_units;
+	char qpcr_pl[MLXSW_REG_QPCR_LEN];
+	u16 policer_id;
+	u8 burst_size;
+	bool is_bytes;
+	u32 rate;
+
+	switch (group->id) {
+	case DEVLINK_TRAP_GROUP_GENERIC_ID_L2_DROPS:
+		policer_id = MLXSW_SP_DISCARD_POLICER_ID;
+		ir_units = MLXSW_REG_QPCR_IR_UNITS_M;
+		is_bytes = false;
+		rate = 10 * 1024; /* 10Kpps */
+		burst_size = 7;
+		break;
+	default:
+		return -EINVAL;
+	}
+
+	mlxsw_reg_qpcr_pack(qpcr_pl, policer_id, ir_units, is_bytes, rate,
+			    burst_size);
+	return mlxsw_reg_write(mlxsw_sp->core, MLXSW_REG(qpcr), qpcr_pl);
+}
+
+static int
+__mlxsw_sp_trap_group_init(struct mlxsw_sp *mlxsw_sp,
+			   const struct devlink_trap_group *group)
+{
+	char htgt_pl[MLXSW_REG_HTGT_LEN];
+	u8 priority, tc, group_id;
+	u16 policer_id;
+
+	switch (group->id) {
+	case DEVLINK_TRAP_GROUP_GENERIC_ID_L2_DROPS:
+		group_id = MLXSW_REG_HTGT_TRAP_GROUP_SP_L2_DISCARDS;
+		policer_id = MLXSW_SP_DISCARD_POLICER_ID;
+		priority = 0;
+		tc = 1;
+		break;
+	default:
+		return -EINVAL;
+	}
+
+	mlxsw_reg_htgt_pack(htgt_pl, group_id, policer_id, priority, tc);
+	return mlxsw_reg_write(mlxsw_sp->core, MLXSW_REG(htgt), htgt_pl);
+}
+
+int mlxsw_sp_trap_group_init(struct mlxsw_core *mlxsw_core,
+			     const struct devlink_trap_group *group)
+{
+	struct mlxsw_sp *mlxsw_sp = mlxsw_core_driver_priv(mlxsw_core);
+	int err;
+
+	err = mlxsw_sp_trap_group_policer_init(mlxsw_sp, group);
+	if (err)
+		return err;
+
+	err = __mlxsw_sp_trap_group_init(mlxsw_sp, group);
+	if (err)
+		return err;
+
+	return 0;
+}
-- 
2.21.0


^ permalink raw reply related

* [PATCH net-next 4/7] mlxsw: Add trap group for layer 2 discards
From: Ido Schimmel @ 2019-08-21  7:19 UTC (permalink / raw)
  To: netdev; +Cc: davem, jiri, dsahern, mlxsw, Ido Schimmel
In-Reply-To: <20190821071937.13622-1-idosch@idosch.org>

From: Ido Schimmel <idosch@mellanox.com>

Discard trap groups are defined in a different enum so that they could
all share the same policer ID: MLXSW_REG_HTGT_TRAP_GROUP_MAX + 1.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
---
 drivers/net/ethernet/mellanox/mlxsw/reg.h | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/drivers/net/ethernet/mellanox/mlxsw/reg.h b/drivers/net/ethernet/mellanox/mlxsw/reg.h
index 59e296562b5a..baa20cdd65df 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/reg.h
+++ b/drivers/net/ethernet/mellanox/mlxsw/reg.h
@@ -5422,6 +5422,14 @@ enum mlxsw_reg_htgt_trap_group {
 	MLXSW_REG_HTGT_TRAP_GROUP_SP_LBERROR,
 	MLXSW_REG_HTGT_TRAP_GROUP_SP_PTP0,
 	MLXSW_REG_HTGT_TRAP_GROUP_SP_PTP1,
+
+	__MLXSW_REG_HTGT_TRAP_GROUP_MAX,
+	MLXSW_REG_HTGT_TRAP_GROUP_MAX = __MLXSW_REG_HTGT_TRAP_GROUP_MAX - 1
+};
+
+enum mlxsw_reg_htgt_discard_trap_group {
+	MLXSW_REG_HTGT_DISCARD_TRAP_GROUP_BASE = MLXSW_REG_HTGT_TRAP_GROUP_MAX,
+	MLXSW_REG_HTGT_TRAP_GROUP_SP_L2_DISCARDS,
 };
 
 /* reg_htgt_trap_group
-- 
2.21.0


^ permalink raw reply related

* [PATCH net-next 3/7] mlxsw: Add layer 2 discard trap IDs
From: Ido Schimmel @ 2019-08-21  7:19 UTC (permalink / raw)
  To: netdev; +Cc: davem, jiri, dsahern, mlxsw, Ido Schimmel
In-Reply-To: <20190821071937.13622-1-idosch@idosch.org>

From: Ido Schimmel <idosch@mellanox.com>

Add the trap IDs used to report layer 2 drops.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
---
 drivers/net/ethernet/mellanox/mlxsw/trap.h | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/drivers/net/ethernet/mellanox/mlxsw/trap.h b/drivers/net/ethernet/mellanox/mlxsw/trap.h
index 19202bdb5105..7618f084cae9 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/trap.h
+++ b/drivers/net/ethernet/mellanox/mlxsw/trap.h
@@ -66,6 +66,13 @@ enum {
 	MLXSW_TRAP_ID_NVE_ENCAP_ARP = 0xBD,
 	MLXSW_TRAP_ID_ROUTER_ALERT_IPV4 = 0xD6,
 	MLXSW_TRAP_ID_ROUTER_ALERT_IPV6 = 0xD7,
+	MLXSW_TRAP_ID_DISCARD_ING_PACKET_SMAC_MC = 0x140,
+	MLXSW_TRAP_ID_DISCARD_ING_SWITCH_VTAG_ALLOW = 0x148,
+	MLXSW_TRAP_ID_DISCARD_ING_SWITCH_VLAN = 0x149,
+	MLXSW_TRAP_ID_DISCARD_ING_SWITCH_STP = 0x14A,
+	MLXSW_TRAP_ID_DISCARD_LOOKUP_SWITCH_UC = 0x150,
+	MLXSW_TRAP_ID_DISCARD_LOOKUP_SWITCH_MC_NULL = 0x151,
+	MLXSW_TRAP_ID_DISCARD_LOOKUP_SWITCH_LB = 0x152,
 	MLXSW_TRAP_ID_ACL0 = 0x1C0,
 	/* Multicast trap used for routes with trap action */
 	MLXSW_TRAP_ID_ACL1 = 0x1C1,
-- 
2.21.0


^ permalink raw reply related

* [PATCH net-next 2/7] mlxsw: reg: Add new trap actions
From: Ido Schimmel @ 2019-08-21  7:19 UTC (permalink / raw)
  To: netdev; +Cc: davem, jiri, dsahern, mlxsw, Ido Schimmel
In-Reply-To: <20190821071937.13622-1-idosch@idosch.org>

From: Ido Schimmel <idosch@mellanox.com>

Subsequent patches will add discard traps support in mlxsw. The driver
cannot configure such traps with a normal trap action, but needs to use
exception trap action, which also increments an error counter.

On the other hand, when these traps are initialized or set to drop
action, they should use the default drop action set by the firmware.
This guarantees that when the feature is disabled we get the exact same
behavior as before the feature was introduced.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
---
 drivers/net/ethernet/mellanox/mlxsw/reg.h | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/drivers/net/ethernet/mellanox/mlxsw/reg.h b/drivers/net/ethernet/mellanox/mlxsw/reg.h
index ead36702549a..59e296562b5a 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/reg.h
+++ b/drivers/net/ethernet/mellanox/mlxsw/reg.h
@@ -5559,6 +5559,8 @@ enum mlxsw_reg_hpkt_action {
 	MLXSW_REG_HPKT_ACTION_DISCARD,
 	MLXSW_REG_HPKT_ACTION_SOFT_DISCARD,
 	MLXSW_REG_HPKT_ACTION_TRAP_AND_SOFT_DISCARD,
+	MLXSW_REG_HPKT_ACTION_TRAP_EXCEPTION_TO_CPU,
+	MLXSW_REG_HPKT_ACTION_SET_FW_DEFAULT = 15,
 };
 
 /* reg_hpkt_action
@@ -5569,6 +5571,8 @@ enum mlxsw_reg_hpkt_action {
  * 3 - Discard.
  * 4 - Soft discard (allow other traps to act on the packet).
  * 5 - Trap and soft discard (allow other traps to overwrite this trap).
+ * 6 - Trap to CPU (CPU receives sole copy) and count it as error.
+ * 15 - Restore the firmware's default action.
  * Access: RW
  *
  * Note: Must be set to 0 (forward) for event trap IDs, as they are already
-- 
2.21.0


^ permalink raw reply related

* [PATCH net-next 1/7] mlxsw: core: Add API to set trap action
From: Ido Schimmel @ 2019-08-21  7:19 UTC (permalink / raw)
  To: netdev; +Cc: davem, jiri, dsahern, mlxsw, Ido Schimmel
In-Reply-To: <20190821071937.13622-1-idosch@idosch.org>

From: Ido Schimmel <idosch@mellanox.com>

Up until now the action of a trap was never changed during its lifetime.
This is going to change by subsequent patches that will allow devlink to
control the action of certain traps.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
---
 drivers/net/ethernet/mellanox/mlxsw/core.c | 12 ++++++++++++
 drivers/net/ethernet/mellanox/mlxsw/core.h |  3 +++
 2 files changed, 15 insertions(+)

diff --git a/drivers/net/ethernet/mellanox/mlxsw/core.c b/drivers/net/ethernet/mellanox/mlxsw/core.c
index 17ceac7505e5..6ec07ecfb5f6 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/core.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/core.c
@@ -1477,6 +1477,18 @@ void mlxsw_core_trap_unregister(struct mlxsw_core *mlxsw_core,
 }
 EXPORT_SYMBOL(mlxsw_core_trap_unregister);
 
+int mlxsw_core_trap_action_set(struct mlxsw_core *mlxsw_core,
+			       const struct mlxsw_listener *listener,
+			       enum mlxsw_reg_hpkt_action action)
+{
+	char hpkt_pl[MLXSW_REG_HPKT_LEN];
+
+	mlxsw_reg_hpkt_pack(hpkt_pl, action, listener->trap_id,
+			    listener->trap_group, listener->is_ctrl);
+	return mlxsw_reg_write(mlxsw_core, MLXSW_REG(hpkt), hpkt_pl);
+}
+EXPORT_SYMBOL(mlxsw_core_trap_action_set);
+
 static u64 mlxsw_core_tid_get(struct mlxsw_core *mlxsw_core)
 {
 	return atomic64_inc_return(&mlxsw_core->emad.tid);
diff --git a/drivers/net/ethernet/mellanox/mlxsw/core.h b/drivers/net/ethernet/mellanox/mlxsw/core.h
index 8efcff4b59cb..19cea16c30bb 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/core.h
+++ b/drivers/net/ethernet/mellanox/mlxsw/core.h
@@ -128,6 +128,9 @@ int mlxsw_core_trap_register(struct mlxsw_core *mlxsw_core,
 void mlxsw_core_trap_unregister(struct mlxsw_core *mlxsw_core,
 				const struct mlxsw_listener *listener,
 				void *priv);
+int mlxsw_core_trap_action_set(struct mlxsw_core *mlxsw_core,
+			       const struct mlxsw_listener *listener,
+			       enum mlxsw_reg_hpkt_action action);
 
 typedef void mlxsw_reg_trans_cb_t(struct mlxsw_core *mlxsw_core, char *payload,
 				  size_t payload_len, unsigned long cb_priv);
-- 
2.21.0


^ permalink raw reply related

* [PATCH net-next 0/7] mlxsw: Add devlink-trap support
From: Ido Schimmel @ 2019-08-21  7:19 UTC (permalink / raw)
  To: netdev; +Cc: davem, jiri, dsahern, mlxsw, Ido Schimmel

From: Ido Schimmel <idosch@mellanox.com>

This patchset adds devlink-trap support in mlxsw.

Patches #1-#4 add the necessary APIs and defines in mlxsw.

Patch #5 implements devlink-trap support for layer 2 drops. More drops
will be added in the future.

Patches #6-#7 add selftests to make sure that all the new code paths are
exercised and that the feature is working as expected.

Ido Schimmel (7):
  mlxsw: core: Add API to set trap action
  mlxsw: reg: Add new trap actions
  mlxsw: Add layer 2 discard trap IDs
  mlxsw: Add trap group for layer 2 discards
  mlxsw: spectrum: Add devlink-trap support
  selftests: mlxsw: Add test cases for devlink-trap L2 drops
  selftests: mlxsw: Add a test case for devlink-trap

 drivers/net/ethernet/mellanox/mlxsw/Makefile  |   2 +-
 drivers/net/ethernet/mellanox/mlxsw/core.c    |  64 +++
 drivers/net/ethernet/mellanox/mlxsw/core.h    |  12 +
 drivers/net/ethernet/mellanox/mlxsw/reg.h     |  12 +
 .../net/ethernet/mellanox/mlxsw/spectrum.c    |  21 +
 .../net/ethernet/mellanox/mlxsw/spectrum.h    |  13 +
 .../ethernet/mellanox/mlxsw/spectrum_trap.c   | 267 ++++++++++
 drivers/net/ethernet/mellanox/mlxsw/trap.h    |   7 +
 .../drivers/net/mlxsw/devlink_trap.sh         | 129 +++++
 .../net/mlxsw/devlink_trap_l2_drops.sh        | 484 ++++++++++++++++++
 10 files changed, 1010 insertions(+), 1 deletion(-)
 create mode 100644 drivers/net/ethernet/mellanox/mlxsw/spectrum_trap.c
 create mode 100755 tools/testing/selftests/drivers/net/mlxsw/devlink_trap.sh
 create mode 100755 tools/testing/selftests/drivers/net/mlxsw/devlink_trap_l2_drops.sh

-- 
2.21.0


^ permalink raw reply

* Re: [PATCH RFC ipsec-next 0/7] ipsec: add TCP encapsulation support (RFC 8229)
From: Steffen Klassert @ 2019-08-21  6:59 UTC (permalink / raw)
  To: Sabrina Dubroca; +Cc: netdev, Herbert Xu
In-Reply-To: <20190816141814.GA12002@bistromath.localdomain>

On Fri, Aug 16, 2019 at 04:18:14PM +0200, Sabrina Dubroca wrote:
> Hi Steffen,
> 
> 2019-06-25, 12:11:33 +0200, Sabrina Dubroca wrote:
> > This patchset introduces support for TCP encapsulation of IKE and ESP
> > messages, as defined by RFC 8229 [0]. It is an evolution of what
> > Herbert Xu proposed in January 2018 [1] that addresses the main
> > criticism against it, by not interfering with the TCP implementation
> > at all. The networking stack now has infrastructure for this: TCP ULPs
> > and Stream Parsers.
> 
> Have you had a chance to look at this?  I was going to rebase and
> resend, but the patches still apply to ipsec-next and net-next (patch
> 2 is already in net-next as commit bd95e678e0f6).

I had a look and I have no general objection against this. If you
think the patchset is ready for inclusion, just remove the RFC and
resend it. I'll have a closer on it look then.

^ permalink raw reply

* Re: various TLS bug fixes...
From: Jakub Kicinski @ 2019-08-21  6:51 UTC (permalink / raw)
  To: John Fastabend; +Cc: David Miller, netdev
In-Reply-To: <5d5cd426e18be_67732ae0ef5705bc4@john-XPS-13-9370.notmuch>

On Tue, 20 Aug 2019 22:18:30 -0700, John Fastabend wrote:
> > > I suspect you've triaged through this already on your side for other
> > > reasons, so perhaps you could help come up with a sane set of TLS
> > > bug fix backports that would be appropriate for -stable?  
> > 
> > I'm planning to spend tomorrow working exactly on v4.19 backport. 
> > I have internal reports of openssl failing on v4.19 while v4.20 
> > works fine.. Hopefully I'll be able to figure that one out, test the
> > above and see if there are any other missing fixes.
> > 
> > Is it okay if I come back to this tomorrow?  
> 
> Is the failure with hw offload or sw case? 

SW case, strangely enough. Large file transfer, I think with openssl
client..

> If its sendpage related looks like we also need to push the following
> patch back to 4.19,
>
> commit 648ee6cea7dde4a5cdf817e5d964fd60b22006a4
> Author: John Fastabend <john.fastabend@gmail.com>
> Date:   Wed Jun 12 17:23:57 2019 +0000
> 
>     net: tls, correctly account for copied bytes with multiple sk_msgs

I had a quick look at that, but the commit in Fixes is not in v4.19.

> If you have more details I can also spend some cycles looking into it.

Awesome, I'll let you know what the details are as soon as I get them.

^ permalink raw reply

* Re: [PATCH ipsec] xfrm: policy: avoid warning splat when merging nodes
From: Steffen Klassert @ 2019-08-21  6:41 UTC (permalink / raw)
  To: Florian Westphal; +Cc: netdev, syzkaller-bugs, syzbot+8cc27ace5f6972910b31
In-Reply-To: <20190812083213.5010-1-fw@strlen.de>

On Mon, Aug 12, 2019 at 10:32:13AM +0200, Florian Westphal wrote:
> syzbot reported a splat:
>  xfrm_policy_inexact_list_reinsert+0x625/0x6e0 net/xfrm/xfrm_policy.c:877
>  CPU: 1 PID: 6756 Comm: syz-executor.1 Not tainted 5.3.0-rc2+ #57
>  Call Trace:
>   xfrm_policy_inexact_node_reinsert net/xfrm/xfrm_policy.c:922 [inline]
>   xfrm_policy_inexact_node_merge net/xfrm/xfrm_policy.c:958 [inline]
>   xfrm_policy_inexact_insert_node+0x537/0xb50 net/xfrm/xfrm_policy.c:1023
>   xfrm_policy_inexact_alloc_chain+0x62b/0xbd0 net/xfrm/xfrm_policy.c:1139
>   xfrm_policy_inexact_insert+0xe8/0x1540 net/xfrm/xfrm_policy.c:1182
>   xfrm_policy_insert+0xdf/0xce0 net/xfrm/xfrm_policy.c:1574
>   xfrm_add_policy+0x4cf/0x9b0 net/xfrm/xfrm_user.c:1670
>   xfrm_user_rcv_msg+0x46b/0x720 net/xfrm/xfrm_user.c:2676
>   netlink_rcv_skb+0x1f0/0x460 net/netlink/af_netlink.c:2477
>   xfrm_netlink_rcv+0x74/0x90 net/xfrm/xfrm_user.c:2684
>   netlink_unicast_kernel net/netlink/af_netlink.c:1302 [inline]
>   netlink_unicast+0x809/0x9a0 net/netlink/af_netlink.c:1328
>   netlink_sendmsg+0xa70/0xd30 net/netlink/af_netlink.c:1917
>   sock_sendmsg_nosec net/socket.c:637 [inline]
>   sock_sendmsg net/socket.c:657 [inline]
> 
> There is no reproducer, however, the warning can be reproduced
> by adding rules with ever smaller prefixes.
> 
> The sanity check ("does the policy match the node") uses the prefix value
> of the node before its updated to the smaller value.
> 
> To fix this, update the prefix earlier.  The bug has no impact on tree
> correctness, this is only to prevent a false warning.
> 
> Reported-by: syzbot+8cc27ace5f6972910b31@syzkaller.appspotmail.com
> Signed-off-by: Florian Westphal <fw@strlen.de>

Applied, thanks Florian!

^ permalink raw reply

* Re: INFO: task hung in tls_sw_release_resources_tx
From: Steffen Klassert @ 2019-08-21  6:37 UTC (permalink / raw)
  To: Jakub Kicinski, syzbot, ast, aviadye, borisp, bpf, daniel,
	davejwatson, davem, hdanton, john.fastabend, netdev,
	syzkaller-bugs, herbert, linux-crypto
In-Reply-To: <20190817054743.GE8209@sol.localdomain>

On Fri, Aug 16, 2019 at 10:47:43PM -0700, Eric Biggers wrote:
> [+Steffen, who is the maintainer of pcrypt]
> 
> On Fri, Aug 16, 2019 at 07:02:34PM -0700, Jakub Kicinski wrote:
> > On Thu, 15 Aug 2019 11:06:00 -0700, syzbot wrote:
> > > syzbot has bisected this bug to:
> > > 
> > > commit 130b392c6cd6b2aed1b7eb32253d4920babb4891
> > > Author: Dave Watson <davejwatson@fb.com>
> > > Date:   Wed Jan 30 21:58:31 2019 +0000
> > > 
> > >      net: tls: Add tls 1.3 support
> > > 
> > > bisection log:  https://syzkaller.appspot.com/x/bisect.txt?x=118e8dee600000
> > > start commit:   6d5afe20 sctp: fix memleak in sctp_send_reset_streams
> > > git tree:       net
> > > final crash:    https://syzkaller.appspot.com/x/report.txt?x=138e8dee600000
> > > console output: https://syzkaller.appspot.com/x/log.txt?x=158e8dee600000
> > > kernel config:  https://syzkaller.appspot.com/x/.config?x=a4c9e9f08e9e8960
> > > dashboard link: https://syzkaller.appspot.com/bug?extid=6a9ff159672dfbb41c95
> > > syz repro:      https://syzkaller.appspot.com/x/repro.syz?x=17cb0502600000
> > > C reproducer:   https://syzkaller.appspot.com/x/repro.c?x=14d5dc22600000
> > > 
> > > Reported-by: syzbot+6a9ff159672dfbb41c95@syzkaller.appspotmail.com
> > > Fixes: 130b392c6cd6 ("net: tls: Add tls 1.3 support")
> > > 
> > > For information about bisection process see: https://goo.gl/tpsmEJ#bisection
> > 
> > CC Herbert, linux-crypto
> > 
> > This is got to be something in the crypto code :S 
> > 
> > The test case opens a ktls socket and back log writes to it.
> > Then it opens a AF_ALG socket, binds "pcrypt(gcm(aes))" and dies.
> > 
> > The ktls socket upon close waits for async crypto callbacks, but they
> > never come. If I unset CRYPTO_USER_API_AEAD or change the alg to bind
> > to "gcm(aes)" the bug does not trigger.
> > 
> > Any suggestions?
> 
> Seeing as pcrypt is involved and this is a "task hung" bug, this is probably
> caused by the recursive pcrypt deadlock, which is yet to be fixed.
> 
> See the original thread for more info:
> 
> 	https://groups.google.com/forum/#!msg/syzkaller-bugs/1_CXUd3gBcg/BvsRLH0lAgAJ
> 
> And the syzbot dashboard link:
> 
> 	https://syzkaller.appspot.com/bug?id=178f2528d10720d563091fb51dceb4cb20f75525
> 
> Let's tell syzbot this is a duplicate:
> 
> #syz dup: INFO: task hung in aead_recvmsg
> 
> 
> Steffen, do you have any plan to fix this?

I've tried to use different padata instances for each pcrypt template,
but then each pcrypt template needs to expose its cpumask configuration
to a new file in /sys/kernel/pcrypt/. Currently we have one file
there for the encrytion and on for the decryption cpumask. If we have
more than these two files, we need some naming convention to now which
pcrypt template we want to configure. That would be a bit odd because
a such a nested pcrypt in pcrypt algorithm would not make sense at all.

I still think we should somehow forbid these nested configurations.
If I remember correct, the only objection to your original patch
was that it would still deadlock if an underlying algorithm uses
pcrypt as a fallback.

Maybe we can use your patch and also refuse instanitating if an
underlying algorithm needs a fallback.

The patch would look like this then:

Subject: [PATCH] crypto: pcrypt - forbid recursive instantiation

If the pcrypt template is used multiple times in an algorithm, then a
deadlock occurs because all pcrypt instances share the same
padata_instance, which completes requests in the order submitted.  That
is, the inner pcrypt request waits for the outer pcrypt request while
the outer request is already waiting for the inner.

Fix this by making pcrypt forbid instantiation if pcrypt appears in the
underlying ->cra_driver_name and if an underlying algorithm needs a
fallback.  This is somewhat of a hack, but it's a simple fix that should
be sufficient to prevent the deadlock.

Reproducer:

	#include <linux/if_alg.h>
	#include <sys/socket.h>
	#include <unistd.h>

	int main()
	{
		struct sockaddr_alg addr = {
			.salg_type = "aead",
			.salg_name = "pcrypt(pcrypt(rfc4106-gcm-aesni))"
		};
		int algfd, reqfd;
		char buf[32] = { 0 };

		algfd = socket(AF_ALG, SOCK_SEQPACKET, 0);
		bind(algfd, (void *)&addr, sizeof(addr));
		setsockopt(algfd, SOL_ALG, ALG_SET_KEY, buf, 20);
		reqfd = accept(algfd, 0, 0);
		write(reqfd, buf, 32);
		read(reqfd, buf, 16);
	}

Reported-by: syzbot+56c7151cad94eec37c521f0e47d2eee53f9361c4@syzkaller.appspotmail.com
Fixes: 5068c7a883d1 ("crypto: pcrypt - Add pcrypt crypto parallelization wrapper")
Cc: <stable@vger.kernel.org> # v2.6.34+
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
---
 crypto/pcrypt.c | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/crypto/pcrypt.c b/crypto/pcrypt.c
index 543792e0ebf0..932a77b61b47 100644
--- a/crypto/pcrypt.c
+++ b/crypto/pcrypt.c
@@ -198,6 +198,12 @@ static void pcrypt_free(struct aead_instance *inst)
 static int pcrypt_init_instance(struct crypto_instance *inst,
 				struct crypto_alg *alg)
 {
+	/* Recursive pcrypt deadlocks due to the shared padata_instance */
+	if (!strncmp(alg->cra_driver_name, "pcrypt(", 7) ||
+	    strstr(alg->cra_driver_name, "(pcrypt(") ||
+	    strstr(alg->cra_driver_name, ",pcrypt("))
+		return -EINVAL;
+
 	if (snprintf(inst->alg.cra_driver_name, CRYPTO_MAX_ALG_NAME,
 		     "pcrypt(%s)", alg->cra_driver_name) >= CRYPTO_MAX_ALG_NAME)
 		return -ENAMETOOLONG;
@@ -236,7 +242,7 @@ static int pcrypt_create_aead(struct crypto_template *tmpl, struct rtattr **tb,
 	ctx = aead_instance_ctx(inst);
 	crypto_set_aead_spawn(&ctx->spawn, aead_crypto_instance(inst));
 
-	err = crypto_grab_aead(&ctx->spawn, name, 0, 0);
+	err = crypto_grab_aead(&ctx->spawn, name, 0, CRYPTO_ALG_NEED_FALLBACK);
 	if (err)
 		goto out_free_inst;
 
-- 
2.17.1


^ permalink raw reply related

* Re: [pull request][net-next v2 00/16] Mellanox, mlx5 devlink RX health reporters
From: David Miller @ 2019-08-21  6:36 UTC (permalink / raw)
  To: saeedm; +Cc: netdev
In-Reply-To: <20190820202352.2995-1-saeedm@mellanox.com>

From: Saeed Mahameed <saeedm@mellanox.com>
Date: Tue, 20 Aug 2019 20:24:10 +0000

> This series is adding a new devlink health reporter for RX related
> errors from Aya.
> 
> Last two patches from Vlad and Gavi, are trivial fixes for previously
> submitted patches on this release cycle.
> 
> v1->v2:
>  - Improve reversed xmas tree variable declaration.
>  - Rebase on top of net-next to avoid a new conflict due to latest 
>    merge with net.
> 
> For more information please see tag log below.
> 
> Please pull and let me know if there is any problem.

Pulled, thanks Saeed.

^ permalink raw reply

* [PATCH net-next] net: dsa: remove bitmap operations
From: Vivien Didelot @ 2019-08-21  6:24 UTC (permalink / raw)
  To: netdev; +Cc: davem, f.fainelli, andrew, olteanv, Vivien Didelot

The bitmap operations were introduced to simplify the switch drivers
in the future, since most of them could implement the common VLAN and
MDB operations (add, del, dump) with simple functions taking all target
ports at once, and thus limiting the number of hardware accesses.

Programming an MDB or VLAN this way in a single operation would clearly
simplify the drivers a lot but would require a new get-set interface
in DSA. The usage of such bitmap from the stack also raised concerned
in the past, leading to the dynamic allocation of a new ds->_bitmap
member in the dsa_switch structure. So let's get rid of them for now.

This commit nicely wraps the ds->ops->port_{mdb,vlan}_{prepare,add}
switch operations into new dsa_switch_{mdb,vlan}_{prepare,add}
variants not using any bitmap argument anymore.

New dsa_switch_{mdb,vlan}_match helpers have been introduced to make
clear which local port of a switch must be programmed with the target
object. While the targeted user port is an obvious candidate, the
DSA links must also be programmed, as well as the CPU port for VLANs.

While at it, also remove local variables that are only used once.

Signed-off-by: Vivien Didelot <vivien.didelot@gmail.com>
---
 include/net/dsa.h |   3 --
 net/dsa/dsa2.c    |  14 -----
 net/dsa/switch.c  | 132 +++++++++++++++++++++-------------------------
 3 files changed, 59 insertions(+), 90 deletions(-)

diff --git a/include/net/dsa.h b/include/net/dsa.h
index 147b757ef8ea..96acb14ec1a8 100644
--- a/include/net/dsa.h
+++ b/include/net/dsa.h
@@ -275,9 +275,6 @@ struct dsa_switch {
 	 */
 	bool			vlan_filtering;
 
-	unsigned long		*bitmap;
-	unsigned long		_bitmap;
-
 	/* Dynamically allocated ports, keep last */
 	size_t num_ports;
 	struct dsa_port ports[];
diff --git a/net/dsa/dsa2.c b/net/dsa/dsa2.c
index 8c4eccb0cfe6..f8445fa73448 100644
--- a/net/dsa/dsa2.c
+++ b/net/dsa/dsa2.c
@@ -834,20 +834,6 @@ struct dsa_switch *dsa_switch_alloc(struct device *dev, size_t n)
 	if (!ds)
 		return NULL;
 
-	/* We avoid allocating memory outside dsa_switch
-	 * if it is not needed.
-	 */
-	if (n <= sizeof(ds->_bitmap) * 8) {
-		ds->bitmap = &ds->_bitmap;
-	} else {
-		ds->bitmap = devm_kcalloc(dev,
-					  BITS_TO_LONGS(n),
-					  sizeof(unsigned long),
-					  GFP_KERNEL);
-		if (unlikely(!ds->bitmap))
-			return NULL;
-	}
-
 	ds->dev = dev;
 	ds->num_ports = n;
 
diff --git a/net/dsa/switch.c b/net/dsa/switch.c
index 09d9286b27cc..489eb7b430a4 100644
--- a/net/dsa/switch.c
+++ b/net/dsa/switch.c
@@ -128,57 +128,51 @@ static int dsa_switch_fdb_del(struct dsa_switch *ds,
 	return ds->ops->port_fdb_del(ds, port, info->addr, info->vid);
 }
 
-static int
-dsa_switch_mdb_prepare_bitmap(struct dsa_switch *ds,
-			      const struct switchdev_obj_port_mdb *mdb,
-			      const unsigned long *bitmap)
+static bool dsa_switch_mdb_match(struct dsa_switch *ds, int port,
+				 struct dsa_notifier_mdb_info *info)
+{
+	if (ds->index == info->sw_index && port == info->port)
+		return true;
+
+	if (dsa_is_dsa_port(ds, port))
+		return true;
+
+	return false;
+}
+
+static int dsa_switch_mdb_prepare(struct dsa_switch *ds,
+				  struct dsa_notifier_mdb_info *info)
 {
 	int port, err;
 
 	if (!ds->ops->port_mdb_prepare || !ds->ops->port_mdb_add)
 		return -EOPNOTSUPP;
 
-	for_each_set_bit(port, bitmap, ds->num_ports) {
-		err = ds->ops->port_mdb_prepare(ds, port, mdb);
-		if (err)
-			return err;
+	for (port = 0; port < ds->num_ports; port++) {
+		if (dsa_switch_mdb_match(ds, port, info)) {
+			err = ds->ops->port_mdb_prepare(ds, port, info->mdb);
+			if (err)
+				return err;
+		}
 	}
 
 	return 0;
 }
 
-static void dsa_switch_mdb_add_bitmap(struct dsa_switch *ds,
-				      const struct switchdev_obj_port_mdb *mdb,
-				      const unsigned long *bitmap)
-{
-	int port;
-
-	if (!ds->ops->port_mdb_add)
-		return;
-
-	for_each_set_bit(port, bitmap, ds->num_ports)
-		ds->ops->port_mdb_add(ds, port, mdb);
-}
-
 static int dsa_switch_mdb_add(struct dsa_switch *ds,
 			      struct dsa_notifier_mdb_info *info)
 {
-	const struct switchdev_obj_port_mdb *mdb = info->mdb;
-	struct switchdev_trans *trans = info->trans;
 	int port;
 
-	/* Build a mask of Multicast group members */
-	bitmap_zero(ds->bitmap, ds->num_ports);
-	if (ds->index == info->sw_index)
-		set_bit(info->port, ds->bitmap);
-	for (port = 0; port < ds->num_ports; port++)
-		if (dsa_is_dsa_port(ds, port))
-			set_bit(port, ds->bitmap);
+	if (switchdev_trans_ph_prepare(info->trans))
+		return dsa_switch_mdb_prepare(ds, info);
 
-	if (switchdev_trans_ph_prepare(trans))
-		return dsa_switch_mdb_prepare_bitmap(ds, mdb, ds->bitmap);
+	if (!ds->ops->port_mdb_add)
+		return 0;
 
-	dsa_switch_mdb_add_bitmap(ds, mdb, ds->bitmap);
+	for (port = 0; port < ds->num_ports; port++)
+		if (dsa_switch_mdb_match(ds, port, info))
+			ds->ops->port_mdb_add(ds, port, info->mdb);
 
 	return 0;
 }
@@ -186,13 +180,11 @@ static int dsa_switch_mdb_add(struct dsa_switch *ds,
 static int dsa_switch_mdb_del(struct dsa_switch *ds,
 			      struct dsa_notifier_mdb_info *info)
 {
-	const struct switchdev_obj_port_mdb *mdb = info->mdb;
-
 	if (!ds->ops->port_mdb_del)
 		return -EOPNOTSUPP;
 
 	if (ds->index == info->sw_index)
-		return ds->ops->port_mdb_del(ds, info->port, mdb);
+		return ds->ops->port_mdb_del(ds, info->port, info->mdb);
 
 	return 0;
 }
@@ -234,59 +226,55 @@ static int dsa_port_vlan_check(struct dsa_switch *ds, int port,
 			     (void *)vlan);
 }
 
-static int
-dsa_switch_vlan_prepare_bitmap(struct dsa_switch *ds,
-			       const struct switchdev_obj_port_vlan *vlan,
-			       const unsigned long *bitmap)
+static bool dsa_switch_vlan_match(struct dsa_switch *ds, int port,
+				  struct dsa_notifier_vlan_info *info)
+{
+	if (ds->index == info->sw_index && port == info->port)
+		return true;
+
+	if (dsa_is_cpu_port(ds, port) || dsa_is_dsa_port(ds, port))
+		return true;
+
+	return false;
+}
+
+static int dsa_switch_vlan_prepare(struct dsa_switch *ds,
+				   struct dsa_notifier_vlan_info *info)
 {
 	int port, err;
 
 	if (!ds->ops->port_vlan_prepare || !ds->ops->port_vlan_add)
 		return -EOPNOTSUPP;
 
-	for_each_set_bit(port, bitmap, ds->num_ports) {
-		err = dsa_port_vlan_check(ds, port, vlan);
-		if (err)
-			return err;
+	for (port = 0; port < ds->num_ports; port++) {
+		if (dsa_switch_vlan_match(ds, port, info)) {
+			err = dsa_port_vlan_check(ds, port, info->vlan);
+			if (err)
+				return err;
 
-		err = ds->ops->port_vlan_prepare(ds, port, vlan);
-		if (err)
-			return err;
+			err = ds->ops->port_vlan_prepare(ds, port, info->vlan);
+			if (err)
+				return err;
+		}
 	}
 
 	return 0;
 }
 
-static void
-dsa_switch_vlan_add_bitmap(struct dsa_switch *ds,
-			   const struct switchdev_obj_port_vlan *vlan,
-			   const unsigned long *bitmap)
-{
-	int port;
-
-	for_each_set_bit(port, bitmap, ds->num_ports)
-		ds->ops->port_vlan_add(ds, port, vlan);
-}
-
 static int dsa_switch_vlan_add(struct dsa_switch *ds,
 			       struct dsa_notifier_vlan_info *info)
 {
-	const struct switchdev_obj_port_vlan *vlan = info->vlan;
-	struct switchdev_trans *trans = info->trans;
 	int port;
 
-	/* Build a mask of VLAN members */
-	bitmap_zero(ds->bitmap, ds->num_ports);
-	if (ds->index == info->sw_index)
-		set_bit(info->port, ds->bitmap);
-	for (port = 0; port < ds->num_ports; port++)
-		if (dsa_is_cpu_port(ds, port) || dsa_is_dsa_port(ds, port))
-			set_bit(port, ds->bitmap);
+	if (switchdev_trans_ph_prepare(info->trans))
+		return dsa_switch_vlan_prepare(ds, info);
 
-	if (switchdev_trans_ph_prepare(trans))
-		return dsa_switch_vlan_prepare_bitmap(ds, vlan, ds->bitmap);
+	if (!ds->ops->port_vlan_add)
+		return 0;
 
-	dsa_switch_vlan_add_bitmap(ds, vlan, ds->bitmap);
+	for (port = 0; port < ds->num_ports; port++)
+		if (dsa_switch_vlan_match(ds, port, info))
+			ds->ops->port_vlan_add(ds, port, info->vlan);
 
 	return 0;
 }
@@ -294,13 +282,11 @@ static int dsa_switch_vlan_add(struct dsa_switch *ds,
 static int dsa_switch_vlan_del(struct dsa_switch *ds,
 			       struct dsa_notifier_vlan_info *info)
 {
-	const struct switchdev_obj_port_vlan *vlan = info->vlan;
-
 	if (!ds->ops->port_vlan_del)
 		return -EOPNOTSUPP;
 
 	if (ds->index == info->sw_index)
-		return ds->ops->port_vlan_del(ds, info->port, vlan);
+		return ds->ops->port_vlan_del(ds, info->port, info->vlan);
 
 	return 0;
 }
-- 
2.23.0


^ permalink raw reply related

* RE: [PATCH v2 0/2] Simplify mtty driver and mdev core
From: Parav Pandit @ 2019-08-21  6:23 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Jiri Pirko, David S . Miller, Kirti Wankhede, Cornelia Huck,
	kvm@vger.kernel.org, linux-kernel@vger.kernel.org, cjia,
	netdev@vger.kernel.org
In-Reply-To: <20190820232622.164962d3@x1.home>



> -----Original Message-----
> From: Alex Williamson <alex.williamson@redhat.com>
> Sent: Wednesday, August 21, 2019 10:56 AM
> To: Parav Pandit <parav@mellanox.com>
> Cc: Jiri Pirko <jiri@mellanox.com>; David S . Miller <davem@davemloft.net>;
> Kirti Wankhede <kwankhede@nvidia.com>; Cornelia Huck
> <cohuck@redhat.com>; kvm@vger.kernel.org; linux-kernel@vger.kernel.org;
> cjia <cjia@nvidia.com>; netdev@vger.kernel.org
> Subject: Re: [PATCH v2 0/2] Simplify mtty driver and mdev core
> 
> > > > > Just an example of the alias, not proposing how it's set.  In
> > > > > fact, proposing that the user does not set it, mdev-core
> > > > > provides one
> > > automatically.
> > > > >
> > > > > > > Since there seems to be some prefix overhead, as I ask about
> > > > > > > above in how many characters we actually have to work with
> > > > > > > in IFNAMESZ, maybe we start with 8 characters (matching your
> > > > > > > "index" namespace) and expand as necessary for disambiguation.
> > > > > > > If we can eliminate overhead in IFNAMESZ, let's start with 12.
> > > > > > > Thanks,
> > > > > > >
> > > > > > If user is going to choose the alias, why does it have to be limited to
> sha1?
> > > > > > Or you just told it as an example?
> > > > > >
> > > > > > It can be an alpha-numeric string.
> > > > >
> > > > > No, I'm proposing a different solution where mdev-core creates
> > > > > an alias based on an abbreviated sha1.  The user does not provide the
> alias.
> > > > >
> > > > > > Instead of mdev imposing number of characters on the alias, it
> > > > > > should be best
> > > > > left to the user.
> > > > > > Because in future if netdev improves on the naming scheme,
> > > > > > mdev will be
> > > > > limiting it, which is not right.
> > > > > > So not restricting alias size seems right to me.
> > > > > > User configuring mdev for networking devices in a given kernel
> > > > > > knows what
> > > > > user is doing.
> > > > > > So user can choose alias name size as it finds suitable.
> > > > >
> > > > > That's not what I'm proposing, please read again.  Thanks,
> > > >
> > > > I understood your point. But mdev doesn't know how user is going
> > > > to use
> > > udev/systemd to name the netdev.
> > > > So even if mdev chose to pick 12 characters, it could result in collision.
> > > > Hence the proposal to provide the alias by the user, as user know
> > > > the best
> > > policy for its use case in the environment its using.
> > > > So 12 character sha1 method will still work by user.
> > >
> > > Haven't you already provided examples where certain drivers or
> > > subsystems have unique netdev prefixes?  If mdev provides a unique
> > > alias within the subsystem, couldn't we simply define a netdev
> > > prefix for the mdev subsystem and avoid all other collisions?  I'm
> > > not in favor of the user providing both a uuid and an
> > > alias/instance.  Thanks,
> > >
> > For a given prefix, say ens2f0, can two UUID->sha1 first 9 characters have
> collision?
> 
> I think it would be a mistake to waste so many chars on a prefix, but 9
> characters of sha1 likely wouldn't have a collision before we have 10s of
> thousands of devices.  Thanks,
> 
> Alex

Jiri, Dave,
Are you ok with it for devlink/netdev part?
Mdev core will create an alias from a UUID.

This will be supplied during devlink port attr set such as,

devlink_port_attrs_mdev_set(struct devlink_port *port, const char *mdev_alias);

This alias is used to generate representor netdev's phys_port_name.
This alias from the mdev device's sysfs will be used by the udev/systemd to generate predicable netdev's name.
Example: enm<mdev_alias_first_12_chars>
I took Ethernet mdev as an example.
New prefix 'm' stands for mediated device.
Remaining 12 characters are first 12 chars of the mdev alias.

^ permalink raw reply

* pull-request: wireless-drivers 2019-08-21
From: Kalle Valo @ 2019-08-21  6:07 UTC (permalink / raw)
  To: David Miller; +Cc: linux-wireless, netdev, linux-kernel, Johannes Berg

Hi Dave,

here's a pull request to net for 5.3, more info below. I will be offline
the next week, but Johannes should be able to help if there are any
issues.

Kalle


The following changes since commit d1abaeb3be7b5fa6d7a1fbbd2e14e3310005c4c1:

  Linux 5.3-rc5 (2019-08-18 14:31:08 -0700)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/kvalo/wireless-drivers.git tags/wireless-drivers-for-davem-2019-08-21

for you to fetch changes up to 5a8c31aa63578cb0ff390a57537f1cb4b312a1ed:

  iwlwifi: pcie: fix recognition of QuZ devices (2019-08-20 17:00:42 +0300)

----------------------------------------------------------------
wireless-drivers fixes for 5.3

Third set of fixes for 5.3, and most likely the last one. The rt2x00
regression has been reported multiple times, others are of lower
priority.

mt76

* fix hang on resume on certain machines

rt2x00

* fix AP mode regression related to encryption

iwlwifi

* avoid unnecessary error messages due to multicast frames when not
  associated

* fix configuration for ax201 devices

* fix recognition of QuZ devices

----------------------------------------------------------------
Emmanuel Grumbach (1):
      iwlwifi: pcie: fix the byte count table format for 22560 devices

Ilan Peer (1):
      iwlwifi: mvm: Allow multicast data frames only when associated

Luca Coelho (2):
      iwlwifi: pcie: don't switch FW to qnj when ax201 is detected
      iwlwifi: pcie: fix recognition of QuZ devices

Stanislaw Gruszka (2):
      mt76: mt76x0u: do not reset radio on resume
      rt2x00: clear IV's on start to fix AP mode regression

 drivers/net/wireless/intel/iwlwifi/mvm/mac-ctxt.c | 33 ++++++++++++++++++++---
 drivers/net/wireless/intel/iwlwifi/mvm/mac80211.c | 10 +++++++
 drivers/net/wireless/intel/iwlwifi/pcie/drv.c     | 17 ++++++++++++
 drivers/net/wireless/intel/iwlwifi/pcie/trans.c   |  1 +
 drivers/net/wireless/intel/iwlwifi/pcie/tx-gen2.c | 20 +++++++++-----
 drivers/net/wireless/mediatek/mt76/mt76x0/usb.c   |  8 +++---
 drivers/net/wireless/ralink/rt2x00/rt2800lib.c    |  9 +++++++
 drivers/net/wireless/ralink/rt2x00/rt2x00.h       |  1 +
 drivers/net/wireless/ralink/rt2x00/rt2x00dev.c    | 13 +++++----
 9 files changed, 93 insertions(+), 19 deletions(-)

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox