Netdev List
 help / color / mirror / Atom feed
* [PATCH net-next 2/7] mlxsw: reg: Add new trap actions
From: Ido Schimmel @ 2019-08-21  7:19 UTC (permalink / raw)
  To: netdev; +Cc: davem, jiri, dsahern, mlxsw, Ido Schimmel
In-Reply-To: <20190821071937.13622-1-idosch@idosch.org>

From: Ido Schimmel <idosch@mellanox.com>

Subsequent patches will add discard traps support in mlxsw. The driver
cannot configure such traps with a normal trap action, but needs to use
exception trap action, which also increments an error counter.

On the other hand, when these traps are initialized or set to drop
action, they should use the default drop action set by the firmware.
This guarantees that when the feature is disabled we get the exact same
behavior as before the feature was introduced.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
---
 drivers/net/ethernet/mellanox/mlxsw/reg.h | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/drivers/net/ethernet/mellanox/mlxsw/reg.h b/drivers/net/ethernet/mellanox/mlxsw/reg.h
index ead36702549a..59e296562b5a 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/reg.h
+++ b/drivers/net/ethernet/mellanox/mlxsw/reg.h
@@ -5559,6 +5559,8 @@ enum mlxsw_reg_hpkt_action {
 	MLXSW_REG_HPKT_ACTION_DISCARD,
 	MLXSW_REG_HPKT_ACTION_SOFT_DISCARD,
 	MLXSW_REG_HPKT_ACTION_TRAP_AND_SOFT_DISCARD,
+	MLXSW_REG_HPKT_ACTION_TRAP_EXCEPTION_TO_CPU,
+	MLXSW_REG_HPKT_ACTION_SET_FW_DEFAULT = 15,
 };
 
 /* reg_hpkt_action
@@ -5569,6 +5571,8 @@ enum mlxsw_reg_hpkt_action {
  * 3 - Discard.
  * 4 - Soft discard (allow other traps to act on the packet).
  * 5 - Trap and soft discard (allow other traps to overwrite this trap).
+ * 6 - Trap to CPU (CPU receives sole copy) and count it as error.
+ * 15 - Restore the firmware's default action.
  * Access: RW
  *
  * Note: Must be set to 0 (forward) for event trap IDs, as they are already
-- 
2.21.0


^ permalink raw reply related

* [PATCH net-next 3/7] mlxsw: Add layer 2 discard trap IDs
From: Ido Schimmel @ 2019-08-21  7:19 UTC (permalink / raw)
  To: netdev; +Cc: davem, jiri, dsahern, mlxsw, Ido Schimmel
In-Reply-To: <20190821071937.13622-1-idosch@idosch.org>

From: Ido Schimmel <idosch@mellanox.com>

Add the trap IDs used to report layer 2 drops.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
---
 drivers/net/ethernet/mellanox/mlxsw/trap.h | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/drivers/net/ethernet/mellanox/mlxsw/trap.h b/drivers/net/ethernet/mellanox/mlxsw/trap.h
index 19202bdb5105..7618f084cae9 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/trap.h
+++ b/drivers/net/ethernet/mellanox/mlxsw/trap.h
@@ -66,6 +66,13 @@ enum {
 	MLXSW_TRAP_ID_NVE_ENCAP_ARP = 0xBD,
 	MLXSW_TRAP_ID_ROUTER_ALERT_IPV4 = 0xD6,
 	MLXSW_TRAP_ID_ROUTER_ALERT_IPV6 = 0xD7,
+	MLXSW_TRAP_ID_DISCARD_ING_PACKET_SMAC_MC = 0x140,
+	MLXSW_TRAP_ID_DISCARD_ING_SWITCH_VTAG_ALLOW = 0x148,
+	MLXSW_TRAP_ID_DISCARD_ING_SWITCH_VLAN = 0x149,
+	MLXSW_TRAP_ID_DISCARD_ING_SWITCH_STP = 0x14A,
+	MLXSW_TRAP_ID_DISCARD_LOOKUP_SWITCH_UC = 0x150,
+	MLXSW_TRAP_ID_DISCARD_LOOKUP_SWITCH_MC_NULL = 0x151,
+	MLXSW_TRAP_ID_DISCARD_LOOKUP_SWITCH_LB = 0x152,
 	MLXSW_TRAP_ID_ACL0 = 0x1C0,
 	/* Multicast trap used for routes with trap action */
 	MLXSW_TRAP_ID_ACL1 = 0x1C1,
-- 
2.21.0


^ permalink raw reply related

* [PATCH net-next 4/7] mlxsw: Add trap group for layer 2 discards
From: Ido Schimmel @ 2019-08-21  7:19 UTC (permalink / raw)
  To: netdev; +Cc: davem, jiri, dsahern, mlxsw, Ido Schimmel
In-Reply-To: <20190821071937.13622-1-idosch@idosch.org>

From: Ido Schimmel <idosch@mellanox.com>

Discard trap groups are defined in a different enum so that they could
all share the same policer ID: MLXSW_REG_HTGT_TRAP_GROUP_MAX + 1.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
---
 drivers/net/ethernet/mellanox/mlxsw/reg.h | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/drivers/net/ethernet/mellanox/mlxsw/reg.h b/drivers/net/ethernet/mellanox/mlxsw/reg.h
index 59e296562b5a..baa20cdd65df 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/reg.h
+++ b/drivers/net/ethernet/mellanox/mlxsw/reg.h
@@ -5422,6 +5422,14 @@ enum mlxsw_reg_htgt_trap_group {
 	MLXSW_REG_HTGT_TRAP_GROUP_SP_LBERROR,
 	MLXSW_REG_HTGT_TRAP_GROUP_SP_PTP0,
 	MLXSW_REG_HTGT_TRAP_GROUP_SP_PTP1,
+
+	__MLXSW_REG_HTGT_TRAP_GROUP_MAX,
+	MLXSW_REG_HTGT_TRAP_GROUP_MAX = __MLXSW_REG_HTGT_TRAP_GROUP_MAX - 1
+};
+
+enum mlxsw_reg_htgt_discard_trap_group {
+	MLXSW_REG_HTGT_DISCARD_TRAP_GROUP_BASE = MLXSW_REG_HTGT_TRAP_GROUP_MAX,
+	MLXSW_REG_HTGT_TRAP_GROUP_SP_L2_DISCARDS,
 };
 
 /* reg_htgt_trap_group
-- 
2.21.0


^ permalink raw reply related

* [PATCH net-next 5/7] mlxsw: spectrum: Add devlink-trap support
From: Ido Schimmel @ 2019-08-21  7:19 UTC (permalink / raw)
  To: netdev; +Cc: davem, jiri, dsahern, mlxsw, Ido Schimmel
In-Reply-To: <20190821071937.13622-1-idosch@idosch.org>

From: Ido Schimmel <idosch@mellanox.com>

Register supported packet traps (layer 2 drops only, currently) and
associated trap group with devlink during driver initialization.

The amount of traffic generated by these packet drop traps is capped at
10Kpps to ensure the CPU is not overwhelmed by incoming packets.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
---
 drivers/net/ethernet/mellanox/mlxsw/Makefile  |   2 +-
 drivers/net/ethernet/mellanox/mlxsw/core.c    |  52 ++++
 drivers/net/ethernet/mellanox/mlxsw/core.h    |   9 +
 .../net/ethernet/mellanox/mlxsw/spectrum.c    |  21 ++
 .../net/ethernet/mellanox/mlxsw/spectrum.h    |  13 +
 .../ethernet/mellanox/mlxsw/spectrum_trap.c   | 267 ++++++++++++++++++
 6 files changed, 363 insertions(+), 1 deletion(-)
 create mode 100644 drivers/net/ethernet/mellanox/mlxsw/spectrum_trap.c

diff --git a/drivers/net/ethernet/mellanox/mlxsw/Makefile b/drivers/net/ethernet/mellanox/mlxsw/Makefile
index 171b36bd8a4e..0e86a581d45b 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/Makefile
+++ b/drivers/net/ethernet/mellanox/mlxsw/Makefile
@@ -29,7 +29,7 @@ mlxsw_spectrum-objs		:= spectrum.o spectrum_buffers.o \
 				   spectrum_mr_tcam.o spectrum_mr.o \
 				   spectrum_qdisc.o spectrum_span.o \
 				   spectrum_nve.o spectrum_nve_vxlan.o \
-				   spectrum_dpipe.o
+				   spectrum_dpipe.o spectrum_trap.o
 mlxsw_spectrum-$(CONFIG_MLXSW_SPECTRUM_DCB)	+= spectrum_dcb.o
 mlxsw_spectrum-$(CONFIG_PTP_1588_CLOCK)		+= spectrum_ptp.o
 obj-$(CONFIG_MLXSW_MINIMAL)	+= mlxsw_minimal.o
diff --git a/drivers/net/ethernet/mellanox/mlxsw/core.c b/drivers/net/ethernet/mellanox/mlxsw/core.c
index 6ec07ecfb5f6..963a2b4b61b1 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/core.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/core.c
@@ -1017,6 +1017,54 @@ static int mlxsw_devlink_flash_update(struct devlink *devlink,
 					  component, extack);
 }
 
+static int mlxsw_devlink_trap_init(struct devlink *devlink,
+				   const struct devlink_trap *trap,
+				   void *trap_ctx)
+{
+	struct mlxsw_core *mlxsw_core = devlink_priv(devlink);
+	struct mlxsw_driver *mlxsw_driver = mlxsw_core->driver;
+
+	if (!mlxsw_driver->trap_init)
+		return -EOPNOTSUPP;
+	return mlxsw_driver->trap_init(mlxsw_core, trap, trap_ctx);
+}
+
+static void mlxsw_devlink_trap_fini(struct devlink *devlink,
+				    const struct devlink_trap *trap,
+				    void *trap_ctx)
+{
+	struct mlxsw_core *mlxsw_core = devlink_priv(devlink);
+	struct mlxsw_driver *mlxsw_driver = mlxsw_core->driver;
+
+	if (!mlxsw_driver->trap_fini)
+		return;
+	mlxsw_driver->trap_fini(mlxsw_core, trap, trap_ctx);
+}
+
+static int mlxsw_devlink_trap_action_set(struct devlink *devlink,
+					 const struct devlink_trap *trap,
+					 enum devlink_trap_action action)
+{
+	struct mlxsw_core *mlxsw_core = devlink_priv(devlink);
+	struct mlxsw_driver *mlxsw_driver = mlxsw_core->driver;
+
+	if (!mlxsw_driver->trap_action_set)
+		return -EOPNOTSUPP;
+	return mlxsw_driver->trap_action_set(mlxsw_core, trap, action);
+}
+
+static int
+mlxsw_devlink_trap_group_init(struct devlink *devlink,
+			      const struct devlink_trap_group *group)
+{
+	struct mlxsw_core *mlxsw_core = devlink_priv(devlink);
+	struct mlxsw_driver *mlxsw_driver = mlxsw_core->driver;
+
+	if (!mlxsw_driver->trap_group_init)
+		return -EOPNOTSUPP;
+	return mlxsw_driver->trap_group_init(mlxsw_core, group);
+}
+
 static const struct devlink_ops mlxsw_devlink_ops = {
 	.reload				= mlxsw_devlink_core_bus_device_reload,
 	.port_type_set			= mlxsw_devlink_port_type_set,
@@ -1034,6 +1082,10 @@ static const struct devlink_ops mlxsw_devlink_ops = {
 	.sb_occ_tc_port_bind_get	= mlxsw_devlink_sb_occ_tc_port_bind_get,
 	.info_get			= mlxsw_devlink_info_get,
 	.flash_update			= mlxsw_devlink_flash_update,
+	.trap_init			= mlxsw_devlink_trap_init,
+	.trap_fini			= mlxsw_devlink_trap_fini,
+	.trap_action_set		= mlxsw_devlink_trap_action_set,
+	.trap_group_init		= mlxsw_devlink_trap_group_init,
 };
 
 static int
diff --git a/drivers/net/ethernet/mellanox/mlxsw/core.h b/drivers/net/ethernet/mellanox/mlxsw/core.h
index 19cea16c30bb..b65a17d49e43 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/core.h
+++ b/drivers/net/ethernet/mellanox/mlxsw/core.h
@@ -292,6 +292,15 @@ struct mlxsw_driver {
 	int (*flash_update)(struct mlxsw_core *mlxsw_core,
 			    const char *file_name, const char *component,
 			    struct netlink_ext_ack *extack);
+	int (*trap_init)(struct mlxsw_core *mlxsw_core,
+			 const struct devlink_trap *trap, void *trap_ctx);
+	void (*trap_fini)(struct mlxsw_core *mlxsw_core,
+			  const struct devlink_trap *trap, void *trap_ctx);
+	int (*trap_action_set)(struct mlxsw_core *mlxsw_core,
+			       const struct devlink_trap *trap,
+			       enum devlink_trap_action action);
+	int (*trap_group_init)(struct mlxsw_core *mlxsw_core,
+			       const struct devlink_trap_group *group);
 	void (*txhdr_construct)(struct sk_buff *skb,
 				const struct mlxsw_tx_info *tx_info);
 	int (*resources_register)(struct mlxsw_core *mlxsw_core);
diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum.c b/drivers/net/ethernet/mellanox/mlxsw/spectrum.c
index 389861ece418..7de9833fc60b 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/spectrum.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum.c
@@ -4665,6 +4665,12 @@ static int mlxsw_sp_init(struct mlxsw_core *mlxsw_core,
 		goto err_traps_init;
 	}
 
+	err = mlxsw_sp_devlink_traps_init(mlxsw_sp);
+	if (err) {
+		dev_err(mlxsw_sp->bus_info->dev, "Failed to initialize devlink traps\n");
+		goto err_devlink_traps_init;
+	}
+
 	err = mlxsw_sp_buffers_init(mlxsw_sp);
 	if (err) {
 		dev_err(mlxsw_sp->bus_info->dev, "Failed to initialize buffers\n");
@@ -4798,6 +4804,8 @@ static int mlxsw_sp_init(struct mlxsw_core *mlxsw_core,
 err_lag_init:
 	mlxsw_sp_buffers_fini(mlxsw_sp);
 err_buffers_init:
+	mlxsw_sp_devlink_traps_fini(mlxsw_sp);
+err_devlink_traps_init:
 	mlxsw_sp_traps_fini(mlxsw_sp);
 err_traps_init:
 	mlxsw_sp_fids_fini(mlxsw_sp);
@@ -4870,6 +4878,7 @@ static void mlxsw_sp_fini(struct mlxsw_core *mlxsw_core)
 	mlxsw_sp_span_fini(mlxsw_sp);
 	mlxsw_sp_lag_fini(mlxsw_sp);
 	mlxsw_sp_buffers_fini(mlxsw_sp);
+	mlxsw_sp_devlink_traps_fini(mlxsw_sp);
 	mlxsw_sp_traps_fini(mlxsw_sp);
 	mlxsw_sp_fids_fini(mlxsw_sp);
 	mlxsw_sp_kvdl_fini(mlxsw_sp);
@@ -5251,6 +5260,10 @@ static struct mlxsw_driver mlxsw_sp1_driver = {
 	.sb_occ_port_pool_get		= mlxsw_sp_sb_occ_port_pool_get,
 	.sb_occ_tc_port_bind_get	= mlxsw_sp_sb_occ_tc_port_bind_get,
 	.flash_update			= mlxsw_sp_flash_update,
+	.trap_init			= mlxsw_sp_trap_init,
+	.trap_fini			= mlxsw_sp_trap_fini,
+	.trap_action_set		= mlxsw_sp_trap_action_set,
+	.trap_group_init		= mlxsw_sp_trap_group_init,
 	.txhdr_construct		= mlxsw_sp_txhdr_construct,
 	.resources_register		= mlxsw_sp1_resources_register,
 	.kvd_sizes_get			= mlxsw_sp_kvd_sizes_get,
@@ -5281,6 +5294,10 @@ static struct mlxsw_driver mlxsw_sp2_driver = {
 	.sb_occ_port_pool_get		= mlxsw_sp_sb_occ_port_pool_get,
 	.sb_occ_tc_port_bind_get	= mlxsw_sp_sb_occ_tc_port_bind_get,
 	.flash_update			= mlxsw_sp_flash_update,
+	.trap_init			= mlxsw_sp_trap_init,
+	.trap_fini			= mlxsw_sp_trap_fini,
+	.trap_action_set		= mlxsw_sp_trap_action_set,
+	.trap_group_init		= mlxsw_sp_trap_group_init,
 	.txhdr_construct		= mlxsw_sp_txhdr_construct,
 	.resources_register		= mlxsw_sp2_resources_register,
 	.params_register		= mlxsw_sp2_params_register,
@@ -5310,6 +5327,10 @@ static struct mlxsw_driver mlxsw_sp3_driver = {
 	.sb_occ_port_pool_get		= mlxsw_sp_sb_occ_port_pool_get,
 	.sb_occ_tc_port_bind_get	= mlxsw_sp_sb_occ_tc_port_bind_get,
 	.flash_update			= mlxsw_sp_flash_update,
+	.trap_init			= mlxsw_sp_trap_init,
+	.trap_fini			= mlxsw_sp_trap_fini,
+	.trap_action_set		= mlxsw_sp_trap_action_set,
+	.trap_group_init		= mlxsw_sp_trap_group_init,
 	.txhdr_construct		= mlxsw_sp_txhdr_construct,
 	.resources_register		= mlxsw_sp2_resources_register,
 	.params_register		= mlxsw_sp2_params_register,
diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum.h b/drivers/net/ethernet/mellanox/mlxsw/spectrum.h
index db17ba35ec84..20c14bba9ccb 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/spectrum.h
+++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum.h
@@ -958,4 +958,17 @@ void mlxsw_sp_nve_fini(struct mlxsw_sp *mlxsw_sp);
 int mlxsw_sp_nve_inc_parsing_depth_get(struct mlxsw_sp *mlxsw_sp);
 void mlxsw_sp_nve_inc_parsing_depth_put(struct mlxsw_sp *mlxsw_sp);
 
+/* spectrum_trap.c */
+int mlxsw_sp_devlink_traps_init(struct mlxsw_sp *mlxsw_sp);
+void mlxsw_sp_devlink_traps_fini(struct mlxsw_sp *mlxsw_sp);
+int mlxsw_sp_trap_init(struct mlxsw_core *mlxsw_core,
+		       const struct devlink_trap *trap, void *trap_ctx);
+void mlxsw_sp_trap_fini(struct mlxsw_core *mlxsw_core,
+			const struct devlink_trap *trap, void *trap_ctx);
+int mlxsw_sp_trap_action_set(struct mlxsw_core *mlxsw_core,
+			     const struct devlink_trap *trap,
+			     enum devlink_trap_action action);
+int mlxsw_sp_trap_group_init(struct mlxsw_core *mlxsw_core,
+			     const struct devlink_trap_group *group);
+
 #endif
diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum_trap.c b/drivers/net/ethernet/mellanox/mlxsw/spectrum_trap.c
new file mode 100644
index 000000000000..899450b28621
--- /dev/null
+++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum_trap.c
@@ -0,0 +1,267 @@
+// SPDX-License-Identifier: BSD-3-Clause OR GPL-2.0
+/* Copyright (c) 2019 Mellanox Technologies. All rights reserved */
+
+#include <linux/kernel.h>
+#include <net/devlink.h>
+#include <uapi/linux/devlink.h>
+
+#include "core.h"
+#include "reg.h"
+#include "spectrum.h"
+
+#define MLXSW_SP_TRAP_METADATA DEVLINK_TRAP_METADATA_TYPE_F_IN_PORT
+
+static void mlxsw_sp_rx_drop_listener(struct sk_buff *skb, u8 local_port,
+				      void *priv);
+
+#define MLXSW_SP_TRAP_DROP(_id, _group_id)				      \
+	DEVLINK_TRAP_GENERIC(DROP, DROP, _id,				      \
+			     DEVLINK_TRAP_GROUP_GENERIC(_group_id),	      \
+			     MLXSW_SP_TRAP_METADATA)
+
+#define MLXSW_SP_RXL_DISCARD(_id, _group_id)				      \
+	MLXSW_RXL(mlxsw_sp_rx_drop_listener, DISCARD_##_id, SET_FW_DEFAULT,   \
+		  false, SP_##_group_id, DISCARD)
+
+static struct devlink_trap mlxsw_sp_traps_arr[] = {
+	MLXSW_SP_TRAP_DROP(SMAC_MC, L2_DROPS),
+	MLXSW_SP_TRAP_DROP(VLAN_TAG_MISMATCH, L2_DROPS),
+	MLXSW_SP_TRAP_DROP(INGRESS_VLAN_FILTER, L2_DROPS),
+	MLXSW_SP_TRAP_DROP(INGRESS_STP_FILTER, L2_DROPS),
+	MLXSW_SP_TRAP_DROP(EMPTY_TX_LIST, L2_DROPS),
+	MLXSW_SP_TRAP_DROP(PORT_LOOPBACK_FILTER, L2_DROPS),
+};
+
+static struct mlxsw_listener mlxsw_sp_listeners_arr[] = {
+	MLXSW_SP_RXL_DISCARD(ING_PACKET_SMAC_MC, L2_DISCARDS),
+	MLXSW_SP_RXL_DISCARD(ING_SWITCH_VTAG_ALLOW, L2_DISCARDS),
+	MLXSW_SP_RXL_DISCARD(ING_SWITCH_VLAN, L2_DISCARDS),
+	MLXSW_SP_RXL_DISCARD(ING_SWITCH_STP, L2_DISCARDS),
+	MLXSW_SP_RXL_DISCARD(LOOKUP_SWITCH_UC, L2_DISCARDS),
+	MLXSW_SP_RXL_DISCARD(LOOKUP_SWITCH_MC_NULL, L2_DISCARDS),
+	MLXSW_SP_RXL_DISCARD(LOOKUP_SWITCH_LB, L2_DISCARDS),
+};
+
+/* Mapping between hardware trap and devlink trap. Multiple hardware traps can
+ * be mapped to the same devlink trap. Order is according to
+ * 'mlxsw_sp_listeners_arr'.
+ */
+static u16 mlxsw_sp_listener_devlink_map[] = {
+	DEVLINK_TRAP_GENERIC_ID_SMAC_MC,
+	DEVLINK_TRAP_GENERIC_ID_VLAN_TAG_MISMATCH,
+	DEVLINK_TRAP_GENERIC_ID_INGRESS_VLAN_FILTER,
+	DEVLINK_TRAP_GENERIC_ID_INGRESS_STP_FILTER,
+	DEVLINK_TRAP_GENERIC_ID_EMPTY_TX_LIST,
+	DEVLINK_TRAP_GENERIC_ID_EMPTY_TX_LIST,
+	DEVLINK_TRAP_GENERIC_ID_PORT_LOOPBACK_FILTER,
+};
+
+static int mlxsw_sp_rx_listener(struct mlxsw_sp *mlxsw_sp, struct sk_buff *skb,
+				u8 local_port,
+				struct mlxsw_sp_port *mlxsw_sp_port)
+{
+	struct mlxsw_sp_port_pcpu_stats *pcpu_stats;
+
+	if (unlikely(!mlxsw_sp_port)) {
+		dev_warn_ratelimited(mlxsw_sp->bus_info->dev, "Port %d: skb received for non-existent port\n",
+				     local_port);
+		kfree_skb(skb);
+		return -EINVAL;
+	}
+
+	skb->dev = mlxsw_sp_port->dev;
+
+	pcpu_stats = this_cpu_ptr(mlxsw_sp_port->pcpu_stats);
+	u64_stats_update_begin(&pcpu_stats->syncp);
+	pcpu_stats->rx_packets++;
+	pcpu_stats->rx_bytes += skb->len;
+	u64_stats_update_end(&pcpu_stats->syncp);
+
+	skb->protocol = eth_type_trans(skb, skb->dev);
+
+	return 0;
+}
+
+static void mlxsw_sp_rx_drop_listener(struct sk_buff *skb, u8 local_port,
+				      void *trap_ctx)
+{
+	struct devlink_port *in_devlink_port;
+	struct mlxsw_sp_port *mlxsw_sp_port;
+	struct mlxsw_sp *mlxsw_sp;
+	struct devlink *devlink;
+
+	mlxsw_sp = devlink_trap_ctx_priv(trap_ctx);
+	mlxsw_sp_port = mlxsw_sp->ports[local_port];
+
+	if (mlxsw_sp_rx_listener(mlxsw_sp, skb, local_port, mlxsw_sp_port))
+		return;
+
+	devlink = priv_to_devlink(mlxsw_sp->core);
+	in_devlink_port = mlxsw_core_port_devlink_port_get(mlxsw_sp->core,
+							   local_port);
+	devlink_trap_report(devlink, skb, trap_ctx, in_devlink_port);
+	consume_skb(skb);
+}
+
+int mlxsw_sp_devlink_traps_init(struct mlxsw_sp *mlxsw_sp)
+{
+	struct devlink *devlink = priv_to_devlink(mlxsw_sp->core);
+
+	if (WARN_ON(ARRAY_SIZE(mlxsw_sp_listener_devlink_map) !=
+		    ARRAY_SIZE(mlxsw_sp_listeners_arr)))
+		return -EINVAL;
+
+	return devlink_traps_register(devlink, mlxsw_sp_traps_arr,
+				      ARRAY_SIZE(mlxsw_sp_traps_arr),
+				      mlxsw_sp);
+}
+
+void mlxsw_sp_devlink_traps_fini(struct mlxsw_sp *mlxsw_sp)
+{
+	struct devlink *devlink = priv_to_devlink(mlxsw_sp->core);
+
+	devlink_traps_unregister(devlink, mlxsw_sp_traps_arr,
+				 ARRAY_SIZE(mlxsw_sp_traps_arr));
+}
+
+int mlxsw_sp_trap_init(struct mlxsw_core *mlxsw_core,
+		       const struct devlink_trap *trap, void *trap_ctx)
+{
+	int i;
+
+	for (i = 0; i < ARRAY_SIZE(mlxsw_sp_listener_devlink_map); i++) {
+		struct mlxsw_listener *listener;
+		int err;
+
+		if (mlxsw_sp_listener_devlink_map[i] != trap->id)
+			continue;
+		listener = &mlxsw_sp_listeners_arr[i];
+
+		err = mlxsw_core_trap_register(mlxsw_core, listener, trap_ctx);
+		if (err)
+			return err;
+	}
+
+	return 0;
+}
+
+void mlxsw_sp_trap_fini(struct mlxsw_core *mlxsw_core,
+			const struct devlink_trap *trap, void *trap_ctx)
+{
+	int i;
+
+	for (i = 0; i < ARRAY_SIZE(mlxsw_sp_listener_devlink_map); i++) {
+		struct mlxsw_listener *listener;
+
+		if (mlxsw_sp_listener_devlink_map[i] != trap->id)
+			continue;
+		listener = &mlxsw_sp_listeners_arr[i];
+
+		mlxsw_core_trap_unregister(mlxsw_core, listener, trap_ctx);
+	}
+}
+
+int mlxsw_sp_trap_action_set(struct mlxsw_core *mlxsw_core,
+			     const struct devlink_trap *trap,
+			     enum devlink_trap_action action)
+{
+	int i;
+
+	for (i = 0; i < ARRAY_SIZE(mlxsw_sp_listener_devlink_map); i++) {
+		enum mlxsw_reg_hpkt_action hw_action;
+		struct mlxsw_listener *listener;
+		int err;
+
+		if (mlxsw_sp_listener_devlink_map[i] != trap->id)
+			continue;
+		listener = &mlxsw_sp_listeners_arr[i];
+
+		switch (action) {
+		case DEVLINK_TRAP_ACTION_DROP:
+			hw_action = MLXSW_REG_HPKT_ACTION_SET_FW_DEFAULT;
+			break;
+		case DEVLINK_TRAP_ACTION_TRAP:
+			hw_action = MLXSW_REG_HPKT_ACTION_TRAP_EXCEPTION_TO_CPU;
+			break;
+		default:
+			return -EINVAL;
+		}
+
+		err = mlxsw_core_trap_action_set(mlxsw_core, listener,
+						 hw_action);
+		if (err)
+			return err;
+	}
+
+	return 0;
+}
+
+#define MLXSW_SP_DISCARD_POLICER_ID	(MLXSW_REG_HTGT_TRAP_GROUP_MAX + 1)
+
+static int
+mlxsw_sp_trap_group_policer_init(struct mlxsw_sp *mlxsw_sp,
+				 const struct devlink_trap_group *group)
+{
+	enum mlxsw_reg_qpcr_ir_units ir_units;
+	char qpcr_pl[MLXSW_REG_QPCR_LEN];
+	u16 policer_id;
+	u8 burst_size;
+	bool is_bytes;
+	u32 rate;
+
+	switch (group->id) {
+	case DEVLINK_TRAP_GROUP_GENERIC_ID_L2_DROPS:
+		policer_id = MLXSW_SP_DISCARD_POLICER_ID;
+		ir_units = MLXSW_REG_QPCR_IR_UNITS_M;
+		is_bytes = false;
+		rate = 10 * 1024; /* 10Kpps */
+		burst_size = 7;
+		break;
+	default:
+		return -EINVAL;
+	}
+
+	mlxsw_reg_qpcr_pack(qpcr_pl, policer_id, ir_units, is_bytes, rate,
+			    burst_size);
+	return mlxsw_reg_write(mlxsw_sp->core, MLXSW_REG(qpcr), qpcr_pl);
+}
+
+static int
+__mlxsw_sp_trap_group_init(struct mlxsw_sp *mlxsw_sp,
+			   const struct devlink_trap_group *group)
+{
+	char htgt_pl[MLXSW_REG_HTGT_LEN];
+	u8 priority, tc, group_id;
+	u16 policer_id;
+
+	switch (group->id) {
+	case DEVLINK_TRAP_GROUP_GENERIC_ID_L2_DROPS:
+		group_id = MLXSW_REG_HTGT_TRAP_GROUP_SP_L2_DISCARDS;
+		policer_id = MLXSW_SP_DISCARD_POLICER_ID;
+		priority = 0;
+		tc = 1;
+		break;
+	default:
+		return -EINVAL;
+	}
+
+	mlxsw_reg_htgt_pack(htgt_pl, group_id, policer_id, priority, tc);
+	return mlxsw_reg_write(mlxsw_sp->core, MLXSW_REG(htgt), htgt_pl);
+}
+
+int mlxsw_sp_trap_group_init(struct mlxsw_core *mlxsw_core,
+			     const struct devlink_trap_group *group)
+{
+	struct mlxsw_sp *mlxsw_sp = mlxsw_core_driver_priv(mlxsw_core);
+	int err;
+
+	err = mlxsw_sp_trap_group_policer_init(mlxsw_sp, group);
+	if (err)
+		return err;
+
+	err = __mlxsw_sp_trap_group_init(mlxsw_sp, group);
+	if (err)
+		return err;
+
+	return 0;
+}
-- 
2.21.0


^ permalink raw reply related

* [PATCH net-next 6/7] selftests: mlxsw: Add test cases for devlink-trap L2 drops
From: Ido Schimmel @ 2019-08-21  7:19 UTC (permalink / raw)
  To: netdev; +Cc: davem, jiri, dsahern, mlxsw, Ido Schimmel
In-Reply-To: <20190821071937.13622-1-idosch@idosch.org>

From: Ido Schimmel <idosch@mellanox.com>

Test that each supported packet trap is triggered under the right
conditions and that packets are indeed dropped and not forwarded.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
---
 .../net/mlxsw/devlink_trap_l2_drops.sh        | 484 ++++++++++++++++++
 1 file changed, 484 insertions(+)
 create mode 100755 tools/testing/selftests/drivers/net/mlxsw/devlink_trap_l2_drops.sh

diff --git a/tools/testing/selftests/drivers/net/mlxsw/devlink_trap_l2_drops.sh b/tools/testing/selftests/drivers/net/mlxsw/devlink_trap_l2_drops.sh
new file mode 100755
index 000000000000..5dcdfa20fc6c
--- /dev/null
+++ b/tools/testing/selftests/drivers/net/mlxsw/devlink_trap_l2_drops.sh
@@ -0,0 +1,484 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0
+#
+# Test devlink-trap L2 drops functionality over mlxsw. Each registered L2 drop
+# packet trap is tested to make sure it is triggered under the right
+# conditions.
+
+lib_dir=$(dirname $0)/../../../net/forwarding
+
+ALL_TESTS="
+	source_mac_is_multicast_test
+	vlan_tag_mismatch_test
+	ingress_vlan_filter_test
+	ingress_stp_filter_test
+	port_list_is_empty_test
+	port_loopback_filter_test
+"
+NUM_NETIFS=4
+source $lib_dir/tc_common.sh
+source $lib_dir/lib.sh
+source $lib_dir/devlink_lib.sh
+
+h1_create()
+{
+	simple_if_init $h1
+}
+
+h1_destroy()
+{
+	simple_if_fini $h1
+}
+
+h2_create()
+{
+	simple_if_init $h2
+}
+
+h2_destroy()
+{
+	simple_if_fini $h2
+}
+
+switch_create()
+{
+	ip link add dev br0 type bridge vlan_filtering 1 mcast_snooping 0
+
+	ip link set dev $swp1 master br0
+	ip link set dev $swp2 master br0
+
+	ip link set dev br0 up
+	ip link set dev $swp1 up
+	ip link set dev $swp2 up
+
+	tc qdisc add dev $swp2 clsact
+}
+
+switch_destroy()
+{
+	tc qdisc del dev $swp2 clsact
+
+	ip link set dev $swp2 down
+	ip link set dev $swp1 down
+
+	ip link del dev br0
+}
+
+setup_prepare()
+{
+	h1=${NETIFS[p1]}
+	swp1=${NETIFS[p2]}
+
+	swp2=${NETIFS[p3]}
+	h2=${NETIFS[p4]}
+
+	vrf_prepare
+
+	h1_create
+	h2_create
+
+	switch_create
+}
+
+cleanup()
+{
+	pre_cleanup
+
+	switch_destroy
+
+	h2_destroy
+	h1_destroy
+
+	vrf_cleanup
+}
+
+l2_drops_test()
+{
+	local trap_name=$1; shift
+	local group_name=$1; shift
+
+	# This is the common part of all the tests. It checks that stats are
+	# initially idle, then non-idle after changing the trap action and
+	# finally idle again. It also makes sure the packets are dropped and
+	# never forwarded.
+	devlink_trap_stats_idle_test $trap_name
+	check_err $? "Trap stats not idle with initial drop action"
+	devlink_trap_group_stats_idle_test $group_name
+	check_err $? "Trap group stats not idle with initial drop action"
+
+	devlink_trap_action_set $trap_name "trap"
+
+	devlink_trap_stats_idle_test $trap_name
+	check_fail $? "Trap stats idle after setting action to trap"
+	devlink_trap_group_stats_idle_test $group_name
+	check_fail $? "Trap group stats idle after setting action to trap"
+
+	devlink_trap_action_set $trap_name "drop"
+
+	devlink_trap_stats_idle_test $trap_name
+	check_err $? "Trap stats not idle after setting action to drop"
+	devlink_trap_group_stats_idle_test $group_name
+	check_err $? "Trap group stats not idle after setting action to drop"
+
+	tc_check_packets "dev $swp2 egress" 101 0
+	check_err $? "Packets were not dropped"
+}
+
+l2_drops_cleanup()
+{
+	local mz_pid=$1; shift
+
+	kill $mz_pid && wait $mz_pid &> /dev/null
+	tc filter del dev $swp2 egress protocol ip pref 1 handle 101 flower
+}
+
+source_mac_is_multicast_test()
+{
+	local trap_name="source_mac_is_multicast"
+	local smac=01:02:03:04:05:06
+	local group_name="l2_drops"
+	local mz_pid
+
+	tc filter add dev $swp2 egress protocol ip pref 1 handle 101 \
+		flower src_mac $smac action drop
+
+	$MZ $h1 -c 0 -p 100 -a $smac -b bcast -t ip -d 1msec -q &
+	mz_pid=$!
+
+	RET=0
+
+	l2_drops_test $trap_name $group_name
+
+	log_test "Source MAC is multicast"
+
+	l2_drops_cleanup $mz_pid
+}
+
+__vlan_tag_mismatch_test()
+{
+	local trap_name="vlan_tag_mismatch"
+	local dmac=de:ad:be:ef:13:37
+	local group_name="l2_drops"
+	local opt=$1; shift
+	local mz_pid
+
+	# Remove PVID flag. This should prevent untagged and prio-tagged
+	# packets from entering the bridge.
+	bridge vlan add vid 1 dev $swp1 untagged master
+
+	tc filter add dev $swp2 egress protocol ip pref 1 handle 101 \
+		flower dst_mac $dmac action drop
+
+	$MZ $h1 "$opt" -c 0 -p 100 -a own -b $dmac -t ip -d 1msec -q &
+	mz_pid=$!
+
+	l2_drops_test $trap_name $group_name
+
+	# Add PVID and make sure packets are no longer dropped.
+	bridge vlan add vid 1 dev $swp1 pvid untagged master
+	devlink_trap_action_set $trap_name "trap"
+
+	devlink_trap_stats_idle_test $trap_name
+	check_err $? "Trap stats not idle when packets should not be dropped"
+	devlink_trap_group_stats_idle_test $group_name
+	check_err $? "Trap group stats not idle with when packets should not be dropped"
+
+	tc_check_packets "dev $swp2 egress" 101 0
+	check_fail $? "Packets not forwarded when should"
+
+	devlink_trap_action_set $trap_name "drop"
+
+	l2_drops_cleanup $mz_pid
+}
+
+vlan_tag_mismatch_untagged_test()
+{
+	RET=0
+
+	__vlan_tag_mismatch_test
+
+	log_test "VLAN tag mismatch - untagged packets"
+}
+
+vlan_tag_mismatch_vid_0_test()
+{
+	RET=0
+
+	__vlan_tag_mismatch_test "-Q 0"
+
+	log_test "VLAN tag mismatch - prio-tagged packets"
+}
+
+vlan_tag_mismatch_test()
+{
+	vlan_tag_mismatch_untagged_test
+	vlan_tag_mismatch_vid_0_test
+}
+
+ingress_vlan_filter_test()
+{
+	local trap_name="ingress_vlan_filter"
+	local dmac=de:ad:be:ef:13:37
+	local group_name="l2_drops"
+	local mz_pid
+	local vid=10
+
+	bridge vlan add vid $vid dev $swp2 master
+	# During initialization the firmware enables all the VLAN filters and
+	# the driver does not turn them off since the traffic will be discarded
+	# by the STP filter whose default is DISCARD state. Add the VID on the
+	# ingress bridge port and then remove it to make sure it is not member
+	# in the VLAN.
+	bridge vlan add vid $vid dev $swp1 master
+	bridge vlan del vid $vid dev $swp1 master
+
+	RET=0
+
+	tc filter add dev $swp2 egress protocol ip pref 1 handle 101 \
+		flower dst_mac $dmac action drop
+
+	$MZ $h1 -Q $vid -c 0 -p 100 -a own -b $dmac -t ip -d 1msec -q &
+	mz_pid=$!
+
+	l2_drops_test $trap_name $group_name
+
+	# Add the VLAN on the bridge port and make sure packets are no longer
+	# dropped.
+	bridge vlan add vid $vid dev $swp1 master
+	devlink_trap_action_set $trap_name "trap"
+
+	devlink_trap_stats_idle_test $trap_name
+	check_err $? "Trap stats not idle when packets should not be dropped"
+	devlink_trap_group_stats_idle_test $group_name
+	check_err $? "Trap group stats not idle with when packets should not be dropped"
+
+	tc_check_packets "dev $swp2 egress" 101 0
+	check_fail $? "Packets not forwarded when should"
+
+	devlink_trap_action_set $trap_name "drop"
+
+	log_test "Ingress VLAN filter"
+
+	l2_drops_cleanup $mz_pid
+
+	bridge vlan del vid $vid dev $swp1 master
+	bridge vlan del vid $vid dev $swp2 master
+}
+
+__ingress_stp_filter_test()
+{
+	local trap_name="ingress_spanning_tree_filter"
+	local dmac=de:ad:be:ef:13:37
+	local group_name="l2_drops"
+	local state=$1; shift
+	local mz_pid
+	local vid=20
+
+	bridge vlan add vid $vid dev $swp2 master
+	bridge vlan add vid $vid dev $swp1 master
+	ip link set dev $swp1 type bridge_slave state $state
+
+	tc filter add dev $swp2 egress protocol ip pref 1 handle 101 \
+		flower dst_mac $dmac action drop
+
+	$MZ $h1 -Q $vid -c 0 -p 100 -a own -b $dmac -t ip -d 1msec -q &
+	mz_pid=$!
+
+	l2_drops_test $trap_name $group_name
+
+	# Change STP state to forwarding and make sure packets are no longer
+	# dropped.
+	ip link set dev $swp1 type bridge_slave state 3
+	devlink_trap_action_set $trap_name "trap"
+
+	devlink_trap_stats_idle_test $trap_name
+	check_err $? "Trap stats not idle when packets should not be dropped"
+	devlink_trap_group_stats_idle_test $group_name
+	check_err $? "Trap group stats not idle with when packets should not be dropped"
+
+	tc_check_packets "dev $swp2 egress" 101 0
+	check_fail $? "Packets not forwarded when should"
+
+	devlink_trap_action_set $trap_name "drop"
+
+	l2_drops_cleanup $mz_pid
+
+	bridge vlan del vid $vid dev $swp1 master
+	bridge vlan del vid $vid dev $swp2 master
+}
+
+ingress_stp_filter_listening_test()
+{
+	local state=$1; shift
+
+	RET=0
+
+	__ingress_stp_filter_test $state
+
+	log_test "Ingress STP filter - listening state"
+}
+
+ingress_stp_filter_learning_test()
+{
+	local state=$1; shift
+
+	RET=0
+
+	__ingress_stp_filter_test $state
+
+	log_test "Ingress STP filter - learning state"
+}
+
+ingress_stp_filter_test()
+{
+	ingress_stp_filter_listening_test 1
+	ingress_stp_filter_learning_test 2
+}
+
+port_list_is_empty_uc_test()
+{
+	local trap_name="port_list_is_empty"
+	local dmac=de:ad:be:ef:13:37
+	local group_name="l2_drops"
+	local mz_pid
+
+	# Disable unicast flooding on both ports, so that packets cannot egress
+	# any port.
+	ip link set dev $swp1 type bridge_slave flood off
+	ip link set dev $swp2 type bridge_slave flood off
+
+	RET=0
+
+	tc filter add dev $swp2 egress protocol ip pref 1 handle 101 \
+		flower dst_mac $dmac action drop
+
+	$MZ $h1 -c 0 -p 100 -a own -b $dmac -t ip -d 1msec -q &
+	mz_pid=$!
+
+	l2_drops_test $trap_name $group_name
+
+	# Allow packets to be flooded to one port.
+	ip link set dev $swp2 type bridge_slave flood on
+	devlink_trap_action_set $trap_name "trap"
+
+	devlink_trap_stats_idle_test $trap_name
+	check_err $? "Trap stats not idle when packets should not be dropped"
+	devlink_trap_group_stats_idle_test $group_name
+	check_err $? "Trap group stats not idle with when packets should not be dropped"
+
+	tc_check_packets "dev $swp2 egress" 101 0
+	check_fail $? "Packets not forwarded when should"
+
+	devlink_trap_action_set $trap_name "drop"
+
+	log_test "Port list is empty - unicast"
+
+	l2_drops_cleanup $mz_pid
+
+	ip link set dev $swp1 type bridge_slave flood on
+}
+
+port_list_is_empty_mc_test()
+{
+	local trap_name="port_list_is_empty"
+	local dmac=01:00:5e:00:00:01
+	local group_name="l2_drops"
+	local dip=239.0.0.1
+	local mz_pid
+
+	# Disable multicast flooding on both ports, so that packets cannot
+	# egress any port. We also need to flush IP addresses from the bridge
+	# in order to prevent packets from being flooded to the router port.
+	ip link set dev $swp1 type bridge_slave mcast_flood off
+	ip link set dev $swp2 type bridge_slave mcast_flood off
+	ip address flush dev br0
+
+	RET=0
+
+	tc filter add dev $swp2 egress protocol ip pref 1 handle 101 \
+		flower dst_mac $dmac action drop
+
+	$MZ $h1 -c 0 -p 100 -a own -b $dmac -t ip -B $dip -d 1msec -q &
+	mz_pid=$!
+
+	l2_drops_test $trap_name $group_name
+
+	# Allow packets to be flooded to one port.
+	ip link set dev $swp2 type bridge_slave mcast_flood on
+	devlink_trap_action_set $trap_name "trap"
+
+	devlink_trap_stats_idle_test $trap_name
+	check_err $? "Trap stats not idle when packets should not be dropped"
+	devlink_trap_group_stats_idle_test $group_name
+	check_err $? "Trap group stats not idle with when packets should not be dropped"
+
+	tc_check_packets "dev $swp2 egress" 101 0
+	check_fail $? "Packets not forwarded when should"
+
+	devlink_trap_action_set $trap_name "drop"
+
+	log_test "Port list is empty - multicast"
+
+	l2_drops_cleanup $mz_pid
+
+	ip link set dev $swp1 type bridge_slave mcast_flood on
+}
+
+port_list_is_empty_test()
+{
+	port_list_is_empty_uc_test
+	port_list_is_empty_mc_test
+}
+
+port_loopback_filter_uc_test()
+{
+	local trap_name="port_loopback_filter"
+	local dmac=de:ad:be:ef:13:37
+	local group_name="l2_drops"
+	local mz_pid
+
+	# Make sure packets can only egress the input port.
+	ip link set dev $swp2 type bridge_slave flood off
+
+	RET=0
+
+	tc filter add dev $swp2 egress protocol ip pref 1 handle 101 \
+		flower dst_mac $dmac action drop
+
+	$MZ $h1 -c 0 -p 100 -a own -b $dmac -t ip -d 1msec -q &
+	mz_pid=$!
+
+	l2_drops_test $trap_name $group_name
+
+	# Allow packets to be flooded.
+	ip link set dev $swp2 type bridge_slave flood on
+	devlink_trap_action_set $trap_name "trap"
+
+	devlink_trap_stats_idle_test $trap_name
+	check_err $? "Trap stats not idle when packets should not be dropped"
+	devlink_trap_group_stats_idle_test $group_name
+	check_err $? "Trap group stats not idle with when packets should not be dropped"
+
+	tc_check_packets "dev $swp2 egress" 101 0
+	check_fail $? "Packets not forwarded when should"
+
+	devlink_trap_action_set $trap_name "drop"
+
+	log_test "Port loopback filter - unicast"
+
+	l2_drops_cleanup $mz_pid
+}
+
+port_loopback_filter_test()
+{
+	port_loopback_filter_uc_test
+}
+
+trap cleanup EXIT
+
+setup_prepare
+setup_wait
+
+tests_run
+
+exit $EXIT_STATUS
-- 
2.21.0


^ permalink raw reply related

* [PATCH net-next 7/7] selftests: mlxsw: Add a test case for devlink-trap
From: Ido Schimmel @ 2019-08-21  7:19 UTC (permalink / raw)
  To: netdev; +Cc: davem, jiri, dsahern, mlxsw, Ido Schimmel
In-Reply-To: <20190821071937.13622-1-idosch@idosch.org>

From: Ido Schimmel <idosch@mellanox.com>

Test generic devlink-trap functionality over mlxsw. These tests are not
specific to a single trap, but do not check the devlink-trap common
infrastructure either.

Currently, the only test case is device deletion (by reloading the
driver) while packets are being trapped.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
---
 .../drivers/net/mlxsw/devlink_trap.sh         | 129 ++++++++++++++++++
 1 file changed, 129 insertions(+)
 create mode 100755 tools/testing/selftests/drivers/net/mlxsw/devlink_trap.sh

diff --git a/tools/testing/selftests/drivers/net/mlxsw/devlink_trap.sh b/tools/testing/selftests/drivers/net/mlxsw/devlink_trap.sh
new file mode 100755
index 000000000000..89b55e946eed
--- /dev/null
+++ b/tools/testing/selftests/drivers/net/mlxsw/devlink_trap.sh
@@ -0,0 +1,129 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0
+#
+# Test generic devlink-trap functionality over mlxsw. These tests are not
+# specific to a single trap, but do not check the devlink-trap common
+# infrastructure either.
+
+lib_dir=$(dirname $0)/../../../net/forwarding
+
+ALL_TESTS="
+	dev_del_test
+"
+NUM_NETIFS=4
+source $lib_dir/tc_common.sh
+source $lib_dir/lib.sh
+source $lib_dir/devlink_lib.sh
+
+h1_create()
+{
+	simple_if_init $h1
+}
+
+h1_destroy()
+{
+	simple_if_fini $h1
+}
+
+h2_create()
+{
+	simple_if_init $h2
+}
+
+h2_destroy()
+{
+	simple_if_fini $h2
+}
+
+switch_create()
+{
+	ip link add dev br0 type bridge vlan_filtering 1 mcast_snooping 0
+
+	ip link set dev $swp1 master br0
+	ip link set dev $swp2 master br0
+
+	ip link set dev br0 up
+	ip link set dev $swp1 up
+	ip link set dev $swp2 up
+}
+
+switch_destroy()
+{
+	ip link set dev $swp2 down
+	ip link set dev $swp1 down
+
+	ip link del dev br0
+}
+
+setup_prepare()
+{
+	h1=${NETIFS[p1]}
+	swp1=${NETIFS[p2]}
+
+	swp2=${NETIFS[p3]}
+	h2=${NETIFS[p4]}
+
+	vrf_prepare
+
+	h1_create
+	h2_create
+
+	switch_create
+}
+
+cleanup()
+{
+	pre_cleanup
+
+	switch_destroy
+
+	h2_destroy
+	h1_destroy
+
+	vrf_cleanup
+}
+
+dev_del_test()
+{
+	local trap_name="source_mac_is_multicast"
+	local smac=01:02:03:04:05:06
+	local num_iter=5
+	local mz_pid
+	local i
+
+	$MZ $h1 -c 0 -p 100 -a $smac -b bcast -t ip -q &
+	mz_pid=$!
+
+	# The purpose of this test is to make sure we correctly dismantle a
+	# port while packets are trapped from it. This is done by reloading the
+	# the driver while the 'ingress_smac_mc_drop' trap is triggered.
+	RET=0
+
+	for i in $(seq 1 $num_iter); do
+		log_info "Iteration $i / $num_iter"
+
+		devlink_trap_action_set $trap_name "trap"
+		sleep 1
+
+		devlink_reload
+		# Allow netdevices to be re-created following the reload
+		sleep 20
+
+		cleanup
+		setup_prepare
+		setup_wait
+	done
+
+	log_test "Device delete"
+
+	kill $mz_pid && wait $mz_pid &> /dev/null
+}
+
+trap cleanup EXIT
+
+setup_prepare
+setup_wait
+
+tests_run
+
+exit $EXIT_STATUS
-- 
2.21.0


^ permalink raw reply related

* RE: Help needed - Kernel lockup while running ipsec
From: Vakul Garg @ 2019-08-21  7:37 UTC (permalink / raw)
  To: Florian Westphal; +Cc: netdev@vger.kernel.org
In-Reply-To: <DB7PR04MB4620B6ACB01BFA338ADAED048BAB0@DB7PR04MB4620.eurprd04.prod.outlook.com>



> -----Original Message-----
> From: Vakul Garg
> Sent: Tuesday, August 20, 2019 4:08 PM
> To: Florian Westphal <fw@strlen.de>
> Cc: netdev@vger.kernel.org
> Subject: RE: Help needed - Kernel lockup while running ipsec
> 
> 
> 
> >
> > > -----Original Message-----
> > > From: Florian Westphal <fw@strlen.de>
> > > Sent: Tuesday, August 20, 2019 3:08 PM
> > > To: Vakul Garg <vakul.garg@nxp.com>
> > > Cc: Florian Westphal <fw@strlen.de>; netdev@vger.kernel.org
> > > Subject: Re: Help needed - Kernel lockup while running ipsec
> > >
> > > Vakul Garg <vakul.garg@nxp.com> wrote:
> > > >
> > > >
> > > > > -----Original Message-----
> > > > > From: Florian Westphal <fw@strlen.de>
> > > > > Sent: Tuesday, August 20, 2019 2:53 PM
> > > > > To: Vakul Garg <vakul.garg@nxp.com>
> > > > > Cc: Florian Westphal <fw@strlen.de>; netdev@vger.kernel.org
> > > > > Subject: Re: Help needed - Kernel lockup while running ipsec
> > > > >
> > > > > Vakul Garg <vakul.garg@nxp.com> wrote:
> > > > > > > > With kernel 4.14.122, I am getting a kernel softlockup while
> > > > > > > > running single
> > > > > > > static ipsec tunnel.
> > > > > > > > The problem reproduces mostly after running 8-10 hours of
> > > > > > > > ipsec encap
> > > > > > > test (on my dual core arm board).
> > > > > > > >
> > > > > > > > I found that in function xfrm_policy_lookup_bytype(), the
> > > > > > > > policy in variable
> > > > > > > 'ret' shows refcnt=0 under problem situation.
> > > > > > > > This creates an infinite loop in  xfrm_policy_lookup_bytype()
> > > > > > > > and hence the
> > > > > > > lockup.
> > > > > > > >
> > > > > > > > Can some body please provide me pointers about 'refcnt'?
> > > > > > > > Is it legitimate for 'refcnt' to become '0'? Under what
> > > > > > > > condition can it
> > > > > > > become '0'?
> > > > > > >
> > > > > > > Yes, when policy is destroyed and the last user calls
> > > > > > > xfrm_pol_put() which will invoke call_rcu to free the structure.
> > > > > >
> > > > > > It seems that policy reference count never gets decremented during
> > > > > > packet
> > > > > ipsec encap.
> > > > > > It is getting incremented for every frame that hits the policy.
> > > > > > In setkey -DP output, I see refcnt to be wrapping around after '0'.
> > > > >
> > > > > Thats a bug.  Does this affect 4.14 only or does this happen on
> > > > > current tree as well?
> > > >
> > > > I am yet to try it on 4.19.
> > > > Can you help me with the right fix? Which part of code should it get
> > > decremented?
> > > > I am not conversant with xfrm code.
> > >
> > > Normally policy reference counts get decremented when the skb is
> free'd,
> > via
> > > dst destruction (xfrm_dst_destroy()).
> > >
> > > Do you see a dst leak as well?
> >
> > Can you please guide me how to detect it?
> >
> > (I am checking refcount on recent kernel and will let you know.)
> 
> Policy refcount is decreasing properly on 4.19.
> Same should be on the latest kernel too.

On kernel-4.14, I find dst_release() is getting called through xfrm_output_one().
However since dst->__refcnt gets decremented to '1', 
the call_rcu(&dst->rcu_head, dst_destroy_rcu) is not invoked. 

On kernel-4.19, dst->__refcnt gets decremented to '0', hence things fall in place and 
dst_destroy_rcu() eventually executes.

Any further help/pointers for kernel-4.14 would be deeply appreciated.




^ permalink raw reply

* Re: [PATCH] `iwlist scan` fails with many networks available
From: Johannes Berg @ 2019-08-21  7:59 UTC (permalink / raw)
  To: James Nylen; +Cc: David S. Miller, linux-wireless, netdev, linux-kernel
In-Reply-To: <CABVa4Nga1vyvyWNpTTJLa44rZo8wu4-bE=mXX1nZgvzktbSq6A@mail.gmail.com>

On Tue, 2019-08-13 at 00:43 +0000, James Nylen wrote:
> > I suppose we could consider applying a workaround like this if it has a
> > condition checking that the buffer passed in is the maximum possible
> > buffer (65535 bytes, due to iw_point::length being u16)
> 
> This is what the latest patch does (attached to my email from
> yesterday / https://lkml.org/lkml/2019/8/10/452 ).

Hmm, yes, you're right. I evidently missed the comparisons to 0xFFFF
there, sorry about that.

> If you'd like to apply it, I'm happy to make any needed revisions.
> Otherwise I'm going to have to keep patching my kernels for this
> issue, unfortunately I don't have the time to try to get wicd to
> migrate to a better solution.

Not sure which would be easier, but ok :-)

Can you please fix the patch to
 1) use /* */ style comments (see
    https://www.kernel.org/doc/html/latest/process/coding-style.html)

 2) remove extra braces (also per coding style)

 3) use U16_MAX instead of 0xFFFF

I'd also consider renaming "maybe_current_ev" to "next_ev" or something
shorter anyway, and would probably argue that rewriting this

> +		if (IS_ERR(maybe_current_ev)) {
> +			err = PTR_ERR(maybe_current_ev);
> +			if (err == -E2BIG) {
> +				// Last BSS failed to copy into buffer.  As
> +				// above, only report an error if `iwlist` will
> +				// retry again with a larger buffer.
> +				if (len >= 0xFFFF) {
> +					err = 0;
> +				}
> +			}
>  			break;
> +		} else {
> +			current_ev = maybe_current_ev;
>  		}


to something like

	next_ev = ...
	if (IS_ERR(next_ev)) {
		err = PTR_ERR(next_ev);
		/* mask error and truncate in case buffer cannot be
                 * increased
                 */
		if (err == -E2BIG && len < U16_MAX)
			err = 0;
		break;
	}

	current_ev = next_ev;


could be more readable, but that's just editorial really.

Thanks,
johannes


^ permalink raw reply

* Re: [PATCH net-next v3 2/4] net: mdio: add PTP offset compensation to mdiobus_write_sts
From: Miroslav Lichvar @ 2019-08-21  8:07 UTC (permalink / raw)
  To: Hubert Feurstein
  Cc: Andrew Lunn, netdev, lkml, Richard Cochran, Florian Fainelli,
	Heiner Kallweit, Vladimir Oltean, David S. Miller
In-Reply-To: <CAFfN3gUgpzMebyUt8_-9e+5vpm3q-DVVszWdkUEFAgZQ8ex73w@mail.gmail.com>

On Tue, Aug 20, 2019 at 06:56:56PM +0200, Hubert Feurstein wrote:
> Am Di., 20. Aug. 2019 um 17:40 Uhr schrieb Miroslav Lichvar
> > I think a large jitter is ok in this case. We just need to timestamp
> > something that we know for sure happened after the PHC timestamp. It
> > should have no impact on the offset and its stability, just the
> > reported delay. A test with phc2sys should be able to confirm that.
> > phc2sys selects the measurement with the shortest delay, which has
> > least uncertainty. I'd say that applies to both interrupt and polling.
> >
> > If it is difficult to specify the minimum interrupt delay, I'd still
> > prefer an overly pessimistic interval assuming a zero delay.
> >
> Currently I do not see the benefit from this. The original intention was to
> compensate for the remaining offset as good as possible.

That's ok, but IMHO the change should not break the assumptions of
existing application and users.

> The current code
> of phc2sys uses the delay only for the filtering of the measurement record
> with the shortest delay and for reporting and statistics. Why not simple shift
> the timestamps with the offset to the point where we expect the PHC timestamp
> to be captured, and we have a very good result compared to where we came
> from.

Because those reports/statistics are important in calculation of
maximum error. If someone had a requirement for a clock to be accurate
to 1.5 microseconds and the ioctl returned a delay indicating a
sufficient accuracy when in reality it could be worse, that would be a
problem.

BTW, phc2sys is not the only user of the ioctl.

-- 
Miroslav Lichvar

^ permalink raw reply

* RE: [PATCH v3] rtw88: pci: Move a mass of jobs in hw IRQ to soft IRQ
From: Tony Chuang @ 2019-08-21  8:16 UTC (permalink / raw)
  To: Jian-Hong Pan, Kalle Valo, David S . Miller
  Cc: linux-wireless@vger.kernel.org, netdev@vger.kernel.org,
	linux-kernel@vger.kernel.org, linux@endlessm.com
In-Reply-To: <20190820045934.24841-1-jian-hong@endlessm.com>

Hi,

> From: Jian-Hong Pan [mailto:jian-hong@endlessm.com]
> 
> There is a mass of jobs between spin lock and unlock in the hardware
> IRQ which will occupy much time originally. To make system work more
> efficiently, this patch moves the jobs to the soft IRQ (bottom half) to
> reduce the time in hardware IRQ.
> 
> Signed-off-by: Jian-Hong Pan <jian-hong@endlessm.com>
> ---
> v2:
>  Change the spin_lock_irqsave/unlock_irqrestore to spin_lock/unlock in
>  rtw_pci_interrupt_handler. Because the interrupts are already disabled
>  in the hardware interrupt handler.
> 
> v3:
>  Extend the spin lock protecting area for the TX path in
>  rtw_pci_interrupt_threadfn by Realtek's suggestion
> 
>  drivers/net/wireless/realtek/rtw88/pci.c | 33 +++++++++++++++++++-----
>  1 file changed, 27 insertions(+), 6 deletions(-)
> 
> diff --git a/drivers/net/wireless/realtek/rtw88/pci.c
> b/drivers/net/wireless/realtek/rtw88/pci.c
> index 00ef229552d5..a8c17a01f318 100644
> --- a/drivers/net/wireless/realtek/rtw88/pci.c
> +++ b/drivers/net/wireless/realtek/rtw88/pci.c
> @@ -866,12 +866,29 @@ static irqreturn_t rtw_pci_interrupt_handler(int irq,
> void *dev)
>  {
>  	struct rtw_dev *rtwdev = dev;
>  	struct rtw_pci *rtwpci = (struct rtw_pci *)rtwdev->priv;
> -	u32 irq_status[4];
> 
>  	spin_lock(&rtwpci->irq_lock);
>  	if (!rtwpci->irq_enabled)
>  		goto out;
> 
> +	/* disable RTW PCI interrupt to avoid more interrupts before the end of
> +	 * thread function
> +	 */
> +	rtw_pci_disable_interrupt(rtwdev, rtwpci);
> +out:
> +	spin_unlock(&rtwpci->irq_lock);
> +
> +	return IRQ_WAKE_THREAD;
> +}
> +
> +static irqreturn_t rtw_pci_interrupt_threadfn(int irq, void *dev)
> +{
> +	struct rtw_dev *rtwdev = dev;
> +	struct rtw_pci *rtwpci = (struct rtw_pci *)rtwdev->priv;
> +	unsigned long flags;
> +	u32 irq_status[4];
> +
> +	spin_lock_irqsave(&rtwpci->irq_lock, flags);
>  	rtw_pci_irq_recognized(rtwdev, rtwpci, irq_status);
> 
>  	if (irq_status[0] & IMR_MGNTDOK)
> @@ -891,8 +908,10 @@ static irqreturn_t rtw_pci_interrupt_handler(int irq,
> void *dev)
>  	if (irq_status[0] & IMR_ROK)
>  		rtw_pci_rx_isr(rtwdev, rtwpci, RTW_RX_QUEUE_MPDU);
> 
> -out:
> -	spin_unlock(&rtwpci->irq_lock);
> +	/* all of the jobs for this interrupt have been done */
> +	if (rtw_flag_check(rtwdev, RTW_FLAG_RUNNING))
> +		rtw_pci_enable_interrupt(rtwdev, rtwpci);

I've applied this patch and tested it.
But I failed to connect to AP, it seems that the
scan_result is empty. And when I failed to connect
to the AP, I found that the IMR is not enabled.
I guess the check bypass the interrupt enable function.

And I also found that *without MSI*, the driver is
able to connect to the AP. Could you please verify
this patch again with MSI interrupt enabled and
send a v4?

You can find my MSI patch on
https://patchwork.kernel.org/patch/11081539/


> +	spin_unlock_irqrestore(&rtwpci->irq_lock, flags);
> 
>  	return IRQ_HANDLED;
>  }
> @@ -1152,8 +1171,10 @@ static int rtw_pci_probe(struct pci_dev *pdev,
>  		goto err_destroy_pci;
>  	}
> 
> -	ret = request_irq(pdev->irq, &rtw_pci_interrupt_handler,
> -			  IRQF_SHARED, KBUILD_MODNAME, rtwdev);
> +	ret = devm_request_threaded_irq(rtwdev->dev, pdev->irq,
> +					rtw_pci_interrupt_handler,
> +					rtw_pci_interrupt_threadfn,
> +					IRQF_SHARED, KBUILD_MODNAME, rtwdev);
>  	if (ret) {
>  		ieee80211_unregister_hw(hw);
>  		goto err_destroy_pci;
> @@ -1192,7 +1213,7 @@ static void rtw_pci_remove(struct pci_dev *pdev)
>  	rtw_pci_disable_interrupt(rtwdev, rtwpci);
>  	rtw_pci_destroy(rtwdev, pdev);
>  	rtw_pci_declaim(rtwdev, pdev);
> -	free_irq(rtwpci->pdev->irq, rtwdev);
> +	devm_free_irq(rtwdev->dev, rtwpci->pdev->irq, rtwdev);
>  	rtw_core_deinit(rtwdev);
>  	ieee80211_free_hw(hw);
>  }
> --


NACK
Need to verify it with MSI https://patchwork.kernel.org/patch/11081539/
And hope v4 could fix it.

Yan-Hsuan


^ permalink raw reply

* Re: [RFC PATCH] bpf: handle 32-bit zext during constant blinding
From: Naveen N. Rao @ 2019-08-21  8:30 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, Jiong Wang
  Cc: bpf, linux-kernel, linuxppc-dev, Michael Ellerman, netdev
In-Reply-To: <20190813171018.28221-1-naveen.n.rao@linux.vnet.ibm.com>

Naveen N. Rao wrote:
> Since BPF constant blinding is performed after the verifier pass, there
> are certain ALU32 instructions inserted which don't have a corresponding
> zext instruction inserted after. This is causing a kernel oops on
> powerpc and can be reproduced by running 'test_cgroup_storage' with
> bpf_jit_harden=2.
> 
> Fix this by emitting BPF_ZEXT during constant blinding if
> prog->aux->verifier_zext is set.
> 
> Fixes: a4b1d3c1ddf6cb ("bpf: verifier: insert zero extension according to analysis result")
> Reported-by: Michael Ellerman <mpe@ellerman.id.au>
> Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
> ---
> This approach (the location where zext is being introduced below, in 
> particular) works for powerpc, but I am not entirely sure if this is 
> sufficient for other architectures as well. This is broken on v5.3-rc4.

Alexie, Daniel, Jiong,
Any feedback on this?

- Naveen


^ permalink raw reply

* [PATCH net-next v4] sched: Add dualpi2 qdisc
From: Tilmans, Olivier (Nokia - BE/Antwerp) @ 2019-08-21  8:31 UTC (permalink / raw)
  Cc: Eric Dumazet, Stephen Hemminger, Olga Albisser,
	De Schepper, Koen (Nokia - BE/Antwerp),
	Tilmans, Olivier (Nokia - BE/Antwerp), Bob Briscoe, Henrik Steen,
	Jamal Hadi Salim, Cong Wang, Jiri Pirko, David S. Miller,
	linux-kernel@vger.kernel.org, netdev@vger.kernel.org

From: Olga Albisser <olga@albisser.org>

DualPI2 provides L4S-type low latency & loss to traffic that uses a
scalable congestion controller (e.g. TCP-Prague, DCTCP) without
degrading the performance of 'classic' traffic (e.g. Reno,
Cubic etc.). It is intended to be the reference implementation of the
IETF's DualQ Coupled AQM.

The qdisc provides two queues called low latency and classic. It
classifies packets based on the ECN field in the IP headers. By
default it directs non-ECN and ECT(0) into the classic queue and
ECT(1) and CE into the low latency queue, as per the IETF spec.

Each queue runs its own AQM:
* The classic AQM is called PI2, which is similar to the PIE AQM but
   more responsive and simpler. Classic traffic requires a decent
   target queue (default 15ms for Internet deployment) to fully
   utilize the link and to avoid high drop rates.
* The low latency AQM is, by default, a very shallow ECN marking
   threshold (1ms) similar to that used for DCTCP.

The DualQ isolates the low queuing delay of the Low Latency queue
from the larger delay of the 'Classic' queue. However, from a
bandwidth perspective, flows in either queue will share out the link
capacity as if there was just a single queue. This bandwidth pooling
effect is achieved by coupling together the drop and ECN-marking
probabilities of the two AQMs.

The PI2 AQM has two main parameters in addition to its target delay.
All the defaults are suitable for any Internet setting, but it can
be reconfigured for a Data Centre setting. The integral gain factor
alpha is used to slowly correct any persistent standing queue error
from the target delay, while the proportional gain factor beta is
used to quickly compensate for queue changes (growth or shrinkage).
Either alpha and beta are given as a parameter, or they can be
calculated by tc from alternative typical and maximum RTT parameters.

Internally, the output of a linear Proportional Integral (PI)
controller is used for both queues. This output is squared to
calculate the drop or ECN-marking probability of the classic queue.
This counterbalances the square-root rate equation of Reno/Cubic,
which is the trick that balances flow rates across the queues. For
the ECN-marking probability of the low latency queue, the output of
the base AQM is multiplied by a coupling factor. This determines the
balance between the flow rates in each queue. The default setting
makes the flow rates roughly equal, which should be generally
applicable.

If DUALPI2 AQM has detected overload (due to excessive non-responsive
traffic in either queue), it will switch to signaling congestion
solely using drop, irrespective of the ECN field. Alternatively, it
can be configured to limit the drop probability and let the queue
grow and eventually overflow (like tail-drop).

Additional details can be found in the draft:
https://www.ietf.org/id/draft-ietf-tsvwg-aqm-dualq-coupled

Signed-off-by: Olga Albisser <olga@albisser.org>
Signed-off-by: Koen De Schepper <koen.de_schepper@nokia-bell-labs.com>
Signed-off-by: Olivier Tilmans <olivier.tilmans@nokia-bell-labs.com>
Signed-off-by: Bob Briscoe <research@bobbriscoe.net>
Signed-off-by: Henrik Steen <henrist@henrist.net>
---

Notes:
    Changelog:
    * v3-> v4
      - Replaced license boiletplate with SPDX identifier
      - Fix missing pskb_may_pull() calls when accessing ECN bits
      - Move timestamp computation at enqueue to happen after drop check
      - Use NMI-safe time keeping function, i.e., ktime_get_ns()
      - Switched from deprecated PSCHED_NS2TICKS/... to raw nanoseconds clocks
      - Validate netlink parameters properly (ranges, error reporting)
      - Expanded the statistics tracked/reported to better reflect the behavior of
        both queues
      - Simplified the qdisc structure:
        o Reworked classification logic to only depend on an ECN mask
        o Renamed most parameters to better reflect their usage
        o Removed unused/experimental features (e.g., TS-FIFO)
        o Restructured the skb->cb
        o Extracted helper functions
      - Fix compilation issues for ARM
      - Updated defaults parameter values to latest IETF ID
      - Fix the step AQM being applied on empty queues, causing excess marking on
        slower links
    * v2 -> v3
      - Fix compilation issues
      - Replaced the classic queue starvation protection from time-shifted FIFO
        to WRR, as it gives better results (e.g., prevents leaking burst in the C
        queue to the L queue)
    * v1 -> v2
      - Store enqueue timestamp in skb->cb to avoid conflict with EDT

 include/uapi/linux/pkt_sched.h |  33 ++
 net/sched/Kconfig              |  22 +-
 net/sched/Makefile             |   1 +
 net/sched/sch_dualpi2.c        | 744 +++++++++++++++++++++++++++++++++
 4 files changed, 799 insertions(+), 1 deletion(-)
 create mode 100644 net/sched/sch_dualpi2.c

diff --git a/include/uapi/linux/pkt_sched.h b/include/uapi/linux/pkt_sched.h
index 18f185299f47..e2ad4a8d2059 100644
--- a/include/uapi/linux/pkt_sched.h
+++ b/include/uapi/linux/pkt_sched.h
@@ -1180,4 +1180,37 @@ enum {
 
 #define TCA_TAPRIO_ATTR_MAX (__TCA_TAPRIO_ATTR_MAX - 1)
 
+/* DUALPI2 */
+enum {
+	TCA_DUALPI2_UNSPEC,
+	TCA_DUALPI2_LIMIT,		/* Packets */
+	TCA_DUALPI2_TARGET,		/* us */
+	TCA_DUALPI2_TUPDATE,		/* us */
+	TCA_DUALPI2_ALPHA,		/* Hz scaled up by 256 */
+	TCA_DUALPI2_BETA,		/* HZ scaled up by 256 */
+	TCA_DUALPI2_STEP_THRESH,	/* Packets or us */
+	TCA_DUALPI2_STEP_PACKETS,	/* Whether STEP_THRESH is in packets */
+	TCA_DUALPI2_COUPLING,		/* Coupling factor between queues */
+	TCA_DUALPI2_DROP_OVERLOAD,	/* Whether to drop on overload */
+	TCA_DUALPI2_DROP_EARLY,		/* Whether to drop on enqueue */
+	TCA_DUALPI2_C_PROTECTION,	/* Percentage */
+	TCA_DUALPI2_ECN_MASK,		/* L4S queue classification mask */
+	TCA_DUALPI2_PAD,
+	__TCA_DUALPI2_MAX
+};
+
+#define TCA_DUALPI2_MAX   (__TCA_DUALPI2_MAX - 1)
+
+struct tc_dualpi2_xstats {
+	__u32 prob;		/* current probability */
+	__u32 delay_c;		/* current delay in C queue */
+	__u32 delay_l;		/* current delay in L queue */
+	__s32 credit;		/* current c_protection credit */
+	__u32 packets_in_c;	/* number of packets enqueued in C queue */
+	__u32 packets_in_l;	/* number of packets enqueued in L queue */
+	__u32 maxq;		/* maximum queue size */
+	__u32 ecn_mark;		/* packets marked with ecn*/
+	__u32 step_marks;	/* ECN marks due to the step AQM */
+};
+
 #endif
diff --git a/net/sched/Kconfig b/net/sched/Kconfig
index afd2ba157a13..f9340c18c3a2 100644
--- a/net/sched/Kconfig
+++ b/net/sched/Kconfig
@@ -409,6 +409,26 @@ config NET_SCH_PLUG
 	  To compile this code as a module, choose M here: the
 	  module will be called sch_plug.
 
+config NET_SCH_DUALPI2
+	tristate "Dual Queue Proportional Integral Controller Improved with a Square (DUALPI2) scheduler"
+	help
+	  Say Y here if you want to use the DualPI2 AQM.
+	  This is a combination of the DUALQ Coupled-AQM with a PI2 base-AQM.
+	  The PI2 AQM is in turn both an extension and a simplification of the
+	  PIE AQM. PI2 makes quite some PIE heuristics unnecessary, while being
+	  able to control scalable congestion controls like DCTCP and
+	  TCP-Prague. With PI2, both Reno/Cubic can be used in parallel with
+	  DCTCP, maintaining window fairness. DUALQ provides latency separation
+	  between low latency DCTCP flows and Reno/Cubic flows that need a
+	  bigger queue.
+	  For more information, please see
+	  https://www.ietf.org/id/draft-ietf-tsvwg-aqm-dualq-coupled
+
+	  To compile this code as a module, choose M here: the module
+	  will be called sch_dualpi2.
+
+	  If unsure, say N.
+
 menuconfig NET_SCH_DEFAULT
 	bool "Allow override default queue discipline"
 	---help---
@@ -418,7 +438,7 @@ menuconfig NET_SCH_DEFAULT
 	  of pfifo_fast will be used. Many distributions already set
 	  the default value via /proc/sys/net/core/default_qdisc.
 
-	  If unsure, say N.
+
 
 if NET_SCH_DEFAULT
 
diff --git a/net/sched/Makefile b/net/sched/Makefile
index 415d1e1f237e..8e3bd4459eb4 100644
--- a/net/sched/Makefile
+++ b/net/sched/Makefile
@@ -61,6 +61,7 @@ obj-$(CONFIG_NET_SCH_PIE)	+= sch_pie.o
 obj-$(CONFIG_NET_SCH_CBS)	+= sch_cbs.o
 obj-$(CONFIG_NET_SCH_ETF)	+= sch_etf.o
 obj-$(CONFIG_NET_SCH_TAPRIO)	+= sch_taprio.o
+obj-$(CONFIG_NET_SCH_DUALPI2)   += sch_dualpi2.o
 
 obj-$(CONFIG_NET_CLS_U32)	+= cls_u32.o
 obj-$(CONFIG_NET_CLS_ROUTE4)	+= cls_route.o
diff --git a/net/sched/sch_dualpi2.c b/net/sched/sch_dualpi2.c
new file mode 100644
index 000000000000..a6452aa82018
--- /dev/null
+++ b/net/sched/sch_dualpi2.c
@@ -0,0 +1,744 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (C) 2019 Nokia.
+ *
+ * Author: Koen De Schepper <koen.de_schepper@nokia-bell-labs.com>
+ * Author: Olga Albisser <olga@albisser.org>
+ * Author: Henrik Steen <henrist@henrist.net>
+ * Author: Olivier Tilmans <olivier.tilmans@nokia-bell-labs.com>
+ *
+ * DualPI Improved with a Square (dualpi2):
+ *   Supports scalable congestion controls (e.g., DCTCP)
+ *   Supports coupled dual-queue with PI2
+ *   Supports L4S ECN identifier
+ *
+ * References:
+ *   draft-ietf-tsvwg-aqm-dualq-coupled:
+ *     http://tools.ietf.org/html/draft-ietf-tsvwg-aqm-dualq-coupled-08
+ *   De Schepper, Koen, et al. "PI 2: A linearized AQM for both classic and
+ *   scalable TCP."  in proc. ACM CoNEXT'16, 2016.
+ */
+
+#include <linux/module.h>
+#include <linux/slab.h>
+#include <linux/types.h>
+#include <linux/kernel.h>
+#include <linux/errno.h>
+#include <linux/skbuff.h>
+#include <net/pkt_sched.h>
+#include <net/inet_ecn.h>
+#include <linux/string.h>
+
+/* 32b enable to support flows with windows up to ~8.6 * 1e9 packets
+ * i.e., twice the maximal snd_cwnd.
+ * MAX_PROB must be consistent with the RNG in dualpi2_roll().
+ */
+#define MAX_PROB ((u32)(~((u32)0)))
+/* alpha/beta values exchanged over netlink are in units of 256ns */
+#define ALPHA_BETA_SHIFT 8
+/* Scaled values of alpha/beta must fit in 32b to avoid overflow in later
+ * computations. Consequently (see and dualpi2_scale_alpha_beta()), their
+ * netlink-provided values can use at most 31b, i.e. be at most most (2^23)-1
+ * (~4MHz) as those are given in 1/256th. This enable to tune alpha/beta to
+ * control flows whose maximal RTTs can be in usec up to few secs.
+ */
+#define ALPHA_BETA_MAX ((2 << 31) - 1)
+/* Internal alpha/beta are in units of 64ns.
+ * This enables to use all alpha/beta values in the allowed range without loss
+ * of precision due to rounding when scaling them internally, e.g.,
+ * scale_alpha_beta(1) will not round down to 0.
+ */
+#define ALPHA_BETA_GRANULARITY 6
+#define ALPHA_BETA_SCALING (ALPHA_BETA_SHIFT - ALPHA_BETA_GRANULARITY)
+/* We express the weights (wc, wl) in %, i.e., wc + wl = 100 */
+#define MAX_WC 100
+
+struct dualpi2_sched_data {
+	struct Qdisc *l_queue;	/* The L4S LL queue */
+	struct Qdisc *sch;	/* The classic queue (owner of this struct) */
+
+	struct { /* PI2 parameters */
+		u64	target;	/* Target delay in nanoseconds */
+		u32	tupdate;/* timer frequency (in jiffies) */
+		u32	prob;	/* Base PI2 probability */
+		u32	alpha;	/* Gain factor for the integral rate response */
+		u32	beta;	/* Gain factor for the proportional response */
+		struct timer_list timer; /* prob update timer */
+	} pi2;
+
+	struct { /* Step AQM (L4S queue only) parameters */
+		u32 thresh;	/* Step threshold */
+		bool in_packets;/* Whether the step is in packets or time */
+	} step;
+
+	struct { /* Classic queue starvation protection */
+		s32	credit; /* Credit (sign indicates which queue) */
+		s32	init;	/* Reset value of the credit */
+		u8	wc;	/* C queue weight (between 0 and MAX_WC) */
+		u8	wl;	/* L queue weight (MAX_WC - wc) */
+	} c_protection;
+
+	/* General dualQ parameters */
+	u8	coupling_factor;/* Coupling factor (k) between both queues */
+	u8	ecn_mask;	/* Mask to match L4S packets */
+	bool	drop_early;	/* Drop at enqueue instead of dequeue if true */
+	bool	drop_overload;	/* Drop (1) on overload, or overflow (0) */
+
+	/* Statistics */
+	u64	qdelay_c;	/* Classic Q delay */
+	u64	qdelay_l;	/* L4S Q delay */
+	u32	packets_in_c;	/* Number of packets enqueued in C queue */
+	u32	packets_in_l;	/* Number of packets enqueued in L queue */
+	u32	maxq;		/* maximum queue size */
+	u32	ecn_mark;	/* packets marked with ECN */
+	u32	step_marks;	/* ECN marks due to the step AQM */
+
+	struct { /* Deferred drop statistics */
+		u32 cnt;	/* Packets dropped */
+		u32 len;	/* Bytes dropped */
+	} deferred_drops;
+};
+
+struct dualpi2_skb_cb {
+	u64 ts;		/* Timestamp at enqueue */
+	u8 apply_step:1,/* Can we apply the step threshold */
+	   l4s:1,	/* Packet has been classified as L4S */
+	   ect:2;	/* Packet ECT codepoint */
+};
+
+static inline struct dualpi2_skb_cb *dualpi2_skb_cb(struct sk_buff *skb)
+{
+	qdisc_cb_private_validate(skb, sizeof(struct dualpi2_skb_cb));
+	return (struct dualpi2_skb_cb *)qdisc_skb_cb(skb)->data;
+}
+
+static inline u64 skb_sojourn_time(struct sk_buff *skb, u64 reference)
+{
+	return reference - dualpi2_skb_cb(skb)->ts;
+}
+
+static inline u64 qdelay_in_ns(struct Qdisc *q, u64 now)
+{
+	struct sk_buff *skb = qdisc_peek_head(q);
+
+	return skb ? skb_sojourn_time(skb, now) : 0;
+}
+
+static inline u32 dualpi2_scale_alpha_beta(u32 param)
+{
+	u64 tmp  = ((u64)param * MAX_PROB >> ALPHA_BETA_SCALING);
+	do_div(tmp, NSEC_PER_SEC);
+	return tmp;
+}
+
+static inline u32 dualpi2_unscale_alpha_beta(u32 param)
+{
+	u64 tmp = ((u64)param * NSEC_PER_SEC << ALPHA_BETA_SCALING);
+	do_div(tmp, MAX_PROB);
+	return tmp;
+}
+
+static inline bool skb_is_l4s(struct sk_buff *skb)
+{
+	return dualpi2_skb_cb(skb)->l4s != 0;
+}
+
+static inline void dualpi2_mark(struct dualpi2_sched_data *q,
+				struct sk_buff *skb)
+{
+	if (INET_ECN_set_ce(skb))
+		q->ecn_mark++;
+}
+
+static inline void dualpi2_reset_c_protection(struct dualpi2_sched_data *q)
+{
+	q->c_protection.credit = q->c_protection.init;
+}
+
+static inline void dualpi2_calculate_c_protection(struct Qdisc *sch,
+						  struct dualpi2_sched_data *q,
+						  u32 wc)
+{
+	q->c_protection.wc = wc;
+	q->c_protection.wl = MAX_WC - wc;
+	/* Start with L queue if wl > wc */
+	q->c_protection.init = (s32)psched_mtu(qdisc_dev(sch)) *
+		((int)q->c_protection.wc - (int)q->c_protection.wl);
+	dualpi2_reset_c_protection(q);
+}
+
+static inline bool dualpi2_roll(u32 prob)
+{
+	return prandom_u32() <= prob;
+}
+
+static inline bool dualpi2_squared_roll(struct dualpi2_sched_data *q)
+{
+	return dualpi2_roll(q->pi2.prob) && dualpi2_roll(q->pi2.prob);
+}
+
+static inline bool dualpi2_is_overloaded(u64 prob)
+{
+	return prob > MAX_PROB;
+}
+
+static bool must_drop(struct Qdisc *sch, struct dualpi2_sched_data *q,
+		      struct sk_buff *skb)
+{
+	u64 local_l_prob;
+
+	/* Never drop if we have fewer than 2 mtu-sized packets;
+	 * similar to min_th in RED.
+	 */
+	if (sch->qstats.backlog < 2 * psched_mtu(qdisc_dev(sch)))
+		return false;
+
+	local_l_prob = (u64)q->pi2.prob * q->coupling_factor;
+
+	if (skb_is_l4s(skb)) {
+		if (dualpi2_is_overloaded(local_l_prob)) {
+			/* On overload, preserve delay by doing a classic drop
+			 * in the L queue. Otherwise, let both queues grow until
+			 * we reach the limit and cannot enqueue anymore
+			 * (sacrifice delay to avoid drops).
+			 */
+			if (q->drop_overload && dualpi2_squared_roll(q))
+				goto drop;
+			else
+				goto mark;
+			/* Scalable marking has a  (prob * k) probability */
+		} else if (dualpi2_roll(local_l_prob)) {
+			goto mark;
+		}
+		/* Apply classic marking with a (prob * prob) probability.
+		 * Force drops for ECN-capable traffic on overload.
+		 */
+	} else if (dualpi2_squared_roll(q)) {
+		if (dualpi2_skb_cb(skb)->ect &&
+		    !dualpi2_is_overloaded(local_l_prob))
+			goto mark;
+		else
+			goto drop;
+	}
+	return false;
+
+mark:
+	dualpi2_mark(q, skb);
+	return false;
+
+drop:
+	return true;
+}
+
+static void dualpi2_skb_classify(struct dualpi2_sched_data *q,
+				 struct sk_buff *skb)
+{
+	struct dualpi2_skb_cb *cb = dualpi2_skb_cb(skb);
+	int wlen = skb_network_offset(skb);
+
+	switch (tc_skb_protocol(skb)) {
+	case htons(ETH_P_IP):
+		wlen += sizeof(struct iphdr);
+		if (!pskb_may_pull(skb, wlen) ||
+		    skb_try_make_writable(skb, wlen))
+			goto not_ecn;
+
+		cb->ect = ipv4_get_dsfield(ip_hdr(skb)) & INET_ECN_MASK;
+		break;
+	case htons(ETH_P_IPV6):
+		wlen += sizeof(struct ipv6hdr);
+		if (!pskb_may_pull(skb, wlen) ||
+		    skb_try_make_writable(skb, wlen))
+			goto not_ecn;
+
+		cb->ect = ipv6_get_dsfield(ipv6_hdr(skb)) & INET_ECN_MASK;
+		break;
+	default:
+		goto not_ecn;
+	}
+	cb->l4s = (cb->ect & q->ecn_mask) != 0;
+	return;
+
+not_ecn:
+	/* Not ECN capable or not non pullable/writable packets can only be
+	 * dropped hence go the the classic queue.
+	 */
+	cb->ect = INET_ECN_NOT_ECT;
+	cb->l4s = 0;
+}
+
+static int dualpi2_qdisc_enqueue(struct sk_buff *skb, struct Qdisc *sch,
+				 struct sk_buff **to_free)
+{
+	struct dualpi2_sched_data *q = qdisc_priv(sch);
+	int err;
+
+	if (unlikely(qdisc_qlen(sch) >= sch->limit)) {
+		qdisc_qstats_overlimit(sch);
+		err = NET_XMIT_DROP;
+		goto drop;
+	}
+
+	dualpi2_skb_classify(q, skb);
+
+	/* drop early if configured */
+	if (q->drop_early && must_drop(sch, q, skb)) {
+		err = NET_XMIT_SUCCESS | __NET_XMIT_BYPASS;
+		goto drop;
+	}
+
+	dualpi2_skb_cb(skb)->ts = ktime_get_ns();
+
+	if (qdisc_qlen(sch) > q->maxq)
+		q->maxq = qdisc_qlen(sch);
+
+	if (skb_is_l4s(skb)) {
+		/* Only apply the step if a queue is building up */
+		dualpi2_skb_cb(skb)->apply_step = qdisc_qlen(q->l_queue) > 1;
+		/* Keep the overall qdisc stats consistent */
+		++sch->q.qlen;
+		qdisc_qstats_backlog_inc(sch, skb);
+		++q->packets_in_l;
+		return qdisc_enqueue_tail(skb, q->l_queue);
+	}
+	++q->packets_in_c;
+	return qdisc_enqueue_tail(skb, sch);
+
+drop:
+	qdisc_drop(skb, sch, to_free);
+	return err;
+}
+
+static struct sk_buff *dualpi2_qdisc_dequeue(struct Qdisc *sch)
+{
+	struct dualpi2_sched_data *q = qdisc_priv(sch);
+	struct sk_buff *skb;
+	int qlen_c, credit_change;
+
+pick_packet:
+	/* L queue packets are also accounted for in qdisc_qlen(sch)! */
+	qlen_c = qdisc_qlen(sch) - qdisc_qlen(q->l_queue);
+	skb = NULL;
+	/* We can drop after qdisc_dequeue_head() calls.
+	 * Manage statistics by hand to keep them consistent if that happens.
+	 */
+	if (qdisc_qlen(q->l_queue) > 0 &&
+	    (qlen_c <= 0 || q->c_protection.credit <= 0)) {
+		/* Dequeue and increase the credit by wc if qlen_c != 0 */
+		skb = __qdisc_dequeue_head(&q->l_queue->q);
+		credit_change = qlen_c ?
+			q->c_protection.wc * qdisc_pkt_len(skb) : 0;
+		/* The global backlog will be updated later. */
+		qdisc_qstats_backlog_dec(q->l_queue, skb);
+		/* Propagate the dequeue to the global stats. */
+		--sch->q.qlen;
+	} else if (qlen_c > 0) {
+		/* Dequeue and decrease the credit by wl if qlen_l != 0 */
+		skb = __qdisc_dequeue_head(&sch->q);
+		credit_change = qdisc_qlen(q->l_queue) ?
+			(s32)(-1) * q->c_protection.wl * qdisc_pkt_len(skb) : 0;
+	} else {
+		dualpi2_reset_c_protection(q);
+		goto exit;
+	}
+	qdisc_qstats_backlog_dec(sch, skb);
+
+	/* Drop on dequeue? */
+	if (!q->drop_early && must_drop(sch, q, skb)) {
+		++q->deferred_drops.cnt;
+		q->deferred_drops.len += qdisc_pkt_len(skb);
+		consume_skb(skb);
+		qdisc_qstats_drop(sch);
+		/* try next packet */
+		goto pick_packet;
+	}
+
+	/* Apply the Step AQM to packets coming out of the L queue. */
+	if (skb_is_l4s(skb)) {
+		u64 qdelay = 0;
+
+		if (q->step.in_packets)
+			qdelay = qdisc_qlen(q->l_queue);
+		else
+			qdelay = skb_sojourn_time(skb, ktime_get_ns());
+		/* Apply the step */
+		if (likely(dualpi2_skb_cb(skb)->apply_step) &&
+		    qdelay > q->step.thresh) {
+			dualpi2_mark(q, skb);
+			++q->step_marks;
+		}
+		qdisc_bstats_update(q->l_queue, skb);
+	}
+
+	q->c_protection.credit += credit_change;
+	qdisc_bstats_update(sch, skb);
+
+exit:
+	/* We cannot call qdisc_tree_reduce_backlog() if our qlen is 0,
+	 * or HTB crashes.
+	 */
+	if (q->deferred_drops.cnt && qdisc_qlen(sch)) {
+		qdisc_tree_reduce_backlog(sch, q->deferred_drops.cnt,
+					  q->deferred_drops.len);
+		q->deferred_drops.cnt = 0;
+		q->deferred_drops.len = 0;
+	}
+	return skb;
+}
+
+static s64 __scale_delta(s64 diff)
+{
+	do_div(diff, (1 << (ALPHA_BETA_GRANULARITY + 1)) - 1);
+	return diff;
+}
+
+static u32 calculate_probability(struct Qdisc *sch)
+{
+	struct dualpi2_sched_data *q = qdisc_priv(sch);
+	u64 qdelay, qdelay_old, now;
+	u32 new_prob;
+	s64 delta;
+
+	qdelay_old = max_t(u64, q->qdelay_c, q->qdelay_l);
+	now = ktime_get_ns();
+	q->qdelay_l = qdelay_in_ns(q->l_queue, now);
+	q->qdelay_c = qdelay_in_ns(sch, now);
+	qdelay = max_t(u64, q->qdelay_c, q->qdelay_l);
+	/* Alpha and beta take at most 32b, i.e, the delay difference would
+	 * overflow for queueing delay differences > ~4.2sec.
+	 */
+	delta = __scale_delta(((s64)qdelay - q->pi2.target) * q->pi2.alpha);
+	delta += __scale_delta(((s64)qdelay - qdelay_old) * q->pi2.beta);
+	new_prob = delta + q->pi2.prob;
+	/* Prevent overflow */
+	if (delta > 0) {
+		if (new_prob < q->pi2.prob)
+			new_prob = MAX_PROB;
+		/* Prevent underflow */
+	} else if (new_prob > q->pi2.prob) {
+		new_prob = 0;
+	}
+	/* If we do not drop on overload, ensure we cap the L4S probability to
+	 * 100% to keep window fairness when overflowing.
+	 */
+	if (!q->drop_overload)
+		return min_t(u32, new_prob, MAX_PROB / q->coupling_factor);
+	return new_prob;
+}
+
+static void dualpi2_timer(struct timer_list *timer)
+{
+	struct dualpi2_sched_data *q = from_timer(q, timer, pi2.timer);
+	struct Qdisc *sch = q->sch;
+	spinlock_t *root_lock; /* Lock to access the head of both queues. */
+
+	root_lock = qdisc_lock(qdisc_root_sleeping(sch));
+	spin_lock(root_lock);
+
+	q->pi2.prob = calculate_probability(sch);
+	mod_timer(&q->pi2.timer, jiffies + q->pi2.tupdate);
+
+	spin_unlock(root_lock);
+}
+
+static const struct nla_policy dualpi2_policy[TCA_DUALPI2_MAX + 1] = {
+	[TCA_DUALPI2_LIMIT] = {.type = NLA_U32},
+	[TCA_DUALPI2_TARGET] = {.type = NLA_U32},
+	[TCA_DUALPI2_TUPDATE] = {.type = NLA_U32},
+	[TCA_DUALPI2_ALPHA] = {.type = NLA_U32},
+	[TCA_DUALPI2_BETA] = {.type = NLA_U32},
+	[TCA_DUALPI2_STEP_THRESH] = {.type = NLA_U32},
+	[TCA_DUALPI2_STEP_PACKETS] = {.type = NLA_U8},
+	[TCA_DUALPI2_COUPLING] = {.type = NLA_U8},
+	[TCA_DUALPI2_DROP_OVERLOAD] = {.type = NLA_U8},
+	[TCA_DUALPI2_DROP_EARLY] = {.type = NLA_U8},
+	[TCA_DUALPI2_C_PROTECTION] = {.type = NLA_U8},
+	[TCA_DUALPI2_ECN_MASK] = {.type = NLA_U8},
+};
+
+static int dualpi2_change(struct Qdisc *sch, struct nlattr *opt,
+			  struct netlink_ext_ack *extack)
+{
+	struct dualpi2_sched_data *q = qdisc_priv(sch);
+	struct nlattr *tb[TCA_DUALPI2_MAX + 1];
+	unsigned int old_qlen, dropped = 0;
+	int err;
+
+	if (!opt)
+		return -EINVAL;
+	err = nla_parse_nested_deprecated(tb, TCA_DUALPI2_MAX, opt,
+					  dualpi2_policy, extack);
+	if (err < 0)
+		return err;
+
+	sch_tree_lock(sch);
+
+	if (tb[TCA_DUALPI2_LIMIT]) {
+		u32 limit = nla_get_u32(tb[TCA_DUALPI2_LIMIT]);
+
+		if (!limit) {
+			NL_SET_ERR_MSG_ATTR(extack, tb[TCA_DUALPI2_LIMIT],
+					    "limit must be greater than 0 !");
+			return -EINVAL;
+		}
+		sch->limit = limit;
+	}
+
+	if (tb[TCA_DUALPI2_TARGET])
+		q->pi2.target = (u64)nla_get_u32(tb[TCA_DUALPI2_TARGET]) *
+			NSEC_PER_USEC;
+
+	if (tb[TCA_DUALPI2_TUPDATE]) {
+		u64 tupdate =
+			usecs_to_jiffies(nla_get_u32(tb[TCA_DUALPI2_TUPDATE]));
+
+		if (!tupdate) {
+			NL_SET_ERR_MSG_ATTR(extack, tb[TCA_DUALPI2_TUPDATE],
+					    "tupdate cannot be 0 jiffies!");
+			return -EINVAL;
+		}
+		q->pi2.tupdate = tupdate;
+	}
+
+	if (tb[TCA_DUALPI2_ALPHA]) {
+		u32 alpha = nla_get_u32(tb[TCA_DUALPI2_ALPHA]);
+
+		if (alpha > ALPHA_BETA_MAX) {
+			NL_SET_ERR_MSG_ATTR(extack, tb[TCA_DUALPI2_ALPHA],
+					    "alpha is too large!");
+			return -EINVAL;
+		}
+		q->pi2.alpha = dualpi2_scale_alpha_beta(alpha);
+	}
+
+	if (tb[TCA_DUALPI2_BETA]) {
+		u32 beta = nla_get_u32(tb[TCA_DUALPI2_BETA]);
+
+		if (beta > ALPHA_BETA_MAX) {
+			NL_SET_ERR_MSG_ATTR(extack, tb[TCA_DUALPI2_BETA],
+					    "beta is too large!");
+			return -EINVAL;
+		}
+		q->pi2.beta = dualpi2_scale_alpha_beta(beta);
+	}
+
+	if (tb[TCA_DUALPI2_STEP_THRESH])
+		q->step.thresh = nla_get_u32(tb[TCA_DUALPI2_STEP_THRESH]) *
+			NSEC_PER_USEC;
+
+	if (tb[TCA_DUALPI2_COUPLING]) {
+		u8 coupling = nla_get_u8(tb[TCA_DUALPI2_COUPLING]);
+
+		if (!coupling) {
+			NL_SET_ERR_MSG_ATTR(extack, tb[TCA_DUALPI2_COUPLING],
+					    "Must use a non-zero coupling!");
+			return -EINVAL;
+		}
+		q->coupling_factor = coupling;
+	}
+
+	if (tb[TCA_DUALPI2_STEP_PACKETS])
+		q->step.in_packets = nla_get_u8(tb[TCA_DUALPI2_STEP_PACKETS]);
+
+	if (tb[TCA_DUALPI2_DROP_OVERLOAD])
+		q->drop_overload = nla_get_u8(tb[TCA_DUALPI2_DROP_OVERLOAD]);
+
+	if (tb[TCA_DUALPI2_DROP_EARLY])
+		q->drop_early = nla_get_u8(tb[TCA_DUALPI2_DROP_EARLY]);
+
+	if (tb[TCA_DUALPI2_C_PROTECTION]) {
+		u8 wc = nla_get_u8(tb[TCA_DUALPI2_C_PROTECTION]);
+
+		if (wc > MAX_WC) {
+			NL_SET_ERR_MSG_ATTR(extack,
+					    tb[TCA_DUALPI2_C_PROTECTION],
+					    "c_protection must be <= 100!");
+			return -EINVAL;
+		}
+		dualpi2_calculate_c_protection(sch, q, wc);
+	}
+
+	if (tb[TCA_DUALPI2_ECN_MASK])
+		q->ecn_mask = nla_get_u8(tb[TCA_DUALPI2_ECN_MASK]);
+
+	/* Drop excess packets if new limit is lower */
+	old_qlen = qdisc_qlen(sch);
+	while (qdisc_qlen(sch) > sch->limit) {
+		struct sk_buff *skb = __qdisc_dequeue_head(&sch->q);
+
+		dropped += qdisc_pkt_len(skb);
+		qdisc_qstats_backlog_dec(sch, skb);
+		rtnl_qdisc_drop(skb, sch);
+	}
+	qdisc_tree_reduce_backlog(sch, old_qlen - qdisc_qlen(sch), dropped);
+
+	sch_tree_unlock(sch);
+	return 0;
+}
+
+static void dualpi2_reset_default(struct dualpi2_sched_data *q)
+{
+	q->sch->limit = 10000; /* Holds 125ms at 1G */
+
+	q->pi2.target = 15 * NSEC_PER_MSEC;
+	q->pi2.tupdate = usecs_to_jiffies(16 * USEC_PER_MSEC);
+	q->pi2.alpha = dualpi2_scale_alpha_beta(41); /* ~0.16 Hz in 1/256th */
+	q->pi2.beta = dualpi2_scale_alpha_beta(819); /* ~3.2 Hz in 1/256th */
+	/* These values give a 10dB stability margin with max_rtt=100ms */
+
+	q->step.thresh = 1 * NSEC_PER_MSEC; /* 1ms */
+	q->step.in_packets = false; /* Step in time not packets */
+
+	dualpi2_calculate_c_protection(q->sch, q, 10); /* Defaults to wc = 10 */
+
+	q->ecn_mask = INET_ECN_ECT_1; /* l4s-id */
+	q->coupling_factor = 2; /* window fairness for equal RTTs */
+	q->drop_overload = true; /* Preserve latency by dropping on overload */
+	q->drop_early = false; /* PI2 drop on dequeue */
+}
+
+static int dualpi2_init(struct Qdisc *sch, struct nlattr *opt,
+			struct netlink_ext_ack *extack)
+{
+	struct dualpi2_sched_data *q = qdisc_priv(sch);
+
+	q->l_queue = qdisc_create_dflt(sch->dev_queue, &pfifo_qdisc_ops,
+				       TC_H_MAKE(sch->handle, 1), extack);
+	if (!q->l_queue)
+		return -ENOMEM;
+
+	q->sch = sch;
+	dualpi2_reset_default(q);
+	timer_setup(&q->pi2.timer, dualpi2_timer, 0);
+
+	if (opt) {
+		int err = dualpi2_change(sch, opt, extack);
+
+		if (err)
+			return err;
+	}
+
+	mod_timer(&q->pi2.timer, (jiffies + HZ) >> 1);
+	return 0;
+}
+
+static int dualpi2_dump(struct Qdisc *sch, struct sk_buff *skb)
+{
+	struct nlattr *opts = nla_nest_start_noflag(skb, TCA_OPTIONS);
+	struct dualpi2_sched_data *q = qdisc_priv(sch);
+	u64 step_thresh = q->step.thresh;
+	u64 target_usec = q->pi2.target;
+
+	if (!opts)
+		goto nla_put_failure;
+
+	do_div(target_usec, NSEC_PER_USEC);
+	if (!q->step.in_packets)
+		do_div(step_thresh, NSEC_PER_USEC);
+
+	if (nla_put_u32(skb, TCA_DUALPI2_LIMIT, sch->limit) ||
+	    nla_put_u32(skb, TCA_DUALPI2_TARGET, target_usec) ||
+	    nla_put_u32(skb, TCA_DUALPI2_TUPDATE,
+			jiffies_to_usecs(q->pi2.tupdate)) ||
+	    nla_put_u32(skb, TCA_DUALPI2_ALPHA,
+			dualpi2_unscale_alpha_beta(q->pi2.alpha)) ||
+	    nla_put_u32(skb, TCA_DUALPI2_BETA,
+			dualpi2_unscale_alpha_beta(q->pi2.beta)) ||
+	    nla_put_u32(skb, TCA_DUALPI2_STEP_THRESH, step_thresh) ||
+	    nla_put_u8(skb, TCA_DUALPI2_COUPLING, q->coupling_factor) ||
+	    nla_put_u8(skb, TCA_DUALPI2_DROP_OVERLOAD, q->drop_overload) ||
+	    nla_put_u8(skb, TCA_DUALPI2_STEP_PACKETS, q->step.in_packets) ||
+	    nla_put_u8(skb, TCA_DUALPI2_DROP_EARLY, q->drop_early) ||
+	    nla_put_u8(skb, TCA_DUALPI2_C_PROTECTION, q->c_protection.wc) ||
+	    nla_put_u8(skb, TCA_DUALPI2_ECN_MASK, q->ecn_mask))
+		goto nla_put_failure;
+
+	return nla_nest_end(skb, opts);
+
+nla_put_failure:
+	nla_nest_cancel(skb, opts);
+	return -1;
+}
+
+static int dualpi2_dump_stats(struct Qdisc *sch, struct gnet_dump *d)
+{
+	struct dualpi2_sched_data *q = qdisc_priv(sch);
+	u64 qdelay_c_usec = q->qdelay_c;
+	u64 qdelay_l_usec = q->qdelay_l;
+	struct tc_dualpi2_xstats st = {
+		.prob		= q->pi2.prob,
+		.packets_in_c	= q->packets_in_c,
+		.packets_in_l	= q->packets_in_l,
+		.maxq		= q->maxq,
+		.ecn_mark	= q->ecn_mark,
+		.credit		= q->c_protection.credit,
+		.step_marks	= q->step_marks,
+	};
+
+	do_div(qdelay_c_usec, NSEC_PER_USEC);
+	do_div(qdelay_l_usec, NSEC_PER_USEC);
+	st.delay_c = qdelay_c_usec;
+	st.delay_l = qdelay_l_usec;
+	return gnet_stats_copy_app(d, &st, sizeof(st));
+}
+
+static void dualpi2_reset(struct Qdisc *sch)
+{
+	struct dualpi2_sched_data *q = qdisc_priv(sch);
+
+	qdisc_reset_queue(sch);
+	qdisc_reset_queue(q->l_queue);
+	q->qdelay_c = 0;
+	q->qdelay_l = 0;
+	q->pi2.prob = 0;
+	q->packets_in_c = 0;
+	q->packets_in_l = 0;
+	q->maxq = 0;
+	q->ecn_mark = 0;
+	q->step_marks = 0;
+	dualpi2_reset_c_protection(q);
+}
+
+static void dualpi2_destroy(struct Qdisc *sch)
+{
+	struct dualpi2_sched_data *q = qdisc_priv(sch);
+
+	q->pi2.tupdate = 0;
+	del_timer_sync(&q->pi2.timer);
+	if (q->l_queue)
+		qdisc_put(q->l_queue);
+}
+
+static struct Qdisc_ops dualpi2_qdisc_ops __read_mostly = {
+	.id = "dualpi2",
+	.priv_size	= sizeof(struct dualpi2_sched_data),
+	.enqueue	= dualpi2_qdisc_enqueue,
+	.dequeue	= dualpi2_qdisc_dequeue,
+	.peek		= qdisc_peek_dequeued,
+	.init		= dualpi2_init,
+	.destroy	= dualpi2_destroy,
+	.reset		= dualpi2_reset,
+	.change		= dualpi2_change,
+	.dump		= dualpi2_dump,
+	.dump_stats	= dualpi2_dump_stats,
+	.owner		= THIS_MODULE,
+};
+
+static int __init dualpi2_module_init(void)
+{
+	return register_qdisc(&dualpi2_qdisc_ops);
+}
+
+static void __exit dualpi2_module_exit(void)
+{
+	unregister_qdisc(&dualpi2_qdisc_ops);
+}
+
+module_init(dualpi2_module_init);
+module_exit(dualpi2_module_exit);
+
+MODULE_DESCRIPTION("Dual Queue with Proportional Integral controller Improved with a Square (dualpi2) scheduler");
+MODULE_AUTHOR("Koen De Schepper");
+MODULE_AUTHOR("Olga Albisser");
+MODULE_AUTHOR("Henrik Steen");
+MODULE_AUTHOR("Olivier Tilmans");
+MODULE_LICENSE("GPL");
-- 
2.22.0


^ permalink raw reply related

* [PATCH iproute2-next v2] tc: add dualpi2 scheduler module
From: Tilmans, Olivier (Nokia - BE/Antwerp) @ 2019-08-21  8:39 UTC (permalink / raw)
  Cc: Eric Dumazet, Stephen Hemminger, davem@davemloft.net,
	netdev@vger.kernel.org, Olga Albisser,
	De Schepper, Koen (Nokia - BE/Antwerp),
	Tilmans, Olivier (Nokia - BE/Antwerp), Bob Briscoe, Henrik Steen

From: Olga Albisser <olga@albisser.org>

DualPI2 AQM is a combination of the DualQ Coupled-AQM with a PI2
base-AQM, able to control scalable congestion controls like DCTCP
and TCP-Prague, implemented as a Linux qdisc.

This patch adds support to tc to configure it through its netlink
interface.

Signed-off-by: Olga Albisser <olga@albisser.org>
Signed-off-by: Koen De Schepper <koen.de_schepper@nokia-bell-labs.com>
Signed-off-by: Oliver Tilmans <olivier.tilmans@nokia-bell-labs.com>
Signed-off-by: Bob Briscoe <research@bobbriscoe.net>
Signed-off-by: Henrik Steen <henrist@henrist.net>
---

Notes:
    Changelog:
    * v1 -> v2
      - Update against revised Netlink defines
      - Aligned bound checks with kernel ones
      - Simplified set of parameters

 include/uapi/linux/pkt_sched.h |  33 +++
 include/utils.h                |   8 +
 man/man8/tc-dualpi2.8          | 203 +++++++++++++++
 tc/Makefile                    |   1 +
 tc/q_dualpi2.c                 | 439 +++++++++++++++++++++++++++++++++
 5 files changed, 684 insertions(+)
 create mode 100644 man/man8/tc-dualpi2.8
 create mode 100644 tc/q_dualpi2.c

diff --git a/include/uapi/linux/pkt_sched.h b/include/uapi/linux/pkt_sched.h
index 18f18529..5f31ba46 100644
--- a/include/uapi/linux/pkt_sched.h
+++ b/include/uapi/linux/pkt_sched.h
@@ -1180,4 +1180,37 @@ enum {
 
 #define TCA_TAPRIO_ATTR_MAX (__TCA_TAPRIO_ATTR_MAX - 1)
 
+/* DUALPI2 */
+enum {
+	TCA_DUALPI2_UNSPEC,
+	TCA_DUALPI2_LIMIT,		/* Packets */
+	TCA_DUALPI2_TARGET,		/* us */
+	TCA_DUALPI2_TUPDATE,		/* us */
+	TCA_DUALPI2_ALPHA,		/* Hz scaled up by 256 */
+	TCA_DUALPI2_BETA,		/* HZ scaled up by 256 */
+	TCA_DUALPI2_STEP_THRESH,	/* Packets or us */
+	TCA_DUALPI2_STEP_PACKETS,	/* Whether STEP_THRESH is in packets */
+	TCA_DUALPI2_COUPLING,		/* Coupling factor between queues */
+	TCA_DUALPI2_DROP_OVERLOAD,	/* Whether to drop on overload */
+	TCA_DUALPI2_DROP_EARLY,		/* Whether to drop on enqueue */
+	TCA_DUALPI2_C_PROTECTION,	/* Percentage */
+	TCA_DUALPI2_ECN_MASK,		/* L4S queue classification mask */
+	TCA_DUALPI2_PAD,
+	__TCA_DUALPI2_MAX
+};
+
+#define TCA_DUALPI2_MAX   (__TCA_DUALPI2_MAX - 1)
+
+struct tc_dualpi2_xstats {
+	__u32 prob;             /* current probability */
+	__u32 delay_c;		/* current delay in C queue */
+	__u32 delay_l;		/* current delay in L queue */
+	__s32 credit;		/* current c_protection credit */
+	__u32 packets_in_c;	/* number of packets enqueued in C queue */
+	__u32 packets_in_l;	/* number of packets enqueued in L queue */
+	__u32 maxq;             /* maximum queue size */
+	__u32 ecn_mark;         /* packets marked with ecn*/
+	__u32 step_marks;	/* ECN marks due to the step AQM */
+};
+
 #endif
diff --git a/include/utils.h b/include/utils.h
index 794d3605..0c8f43e8 100644
--- a/include/utils.h
+++ b/include/utils.h
@@ -279,6 +279,14 @@ unsigned int print_name_and_link(const char *fmt,
 	_min1 < _min2 ? _min1 : _min2; })
 #endif
 
+#ifndef max
+# define max(x, y) ({			\
+	typeof(x) _max1 = (x);		\
+	typeof(y) _max2 = (y);		\
+	(void) (&_max1 == &_max2);	\
+	_max1 > _max2 ? _max1 : _max2; })
+#endif
+
 #ifndef __check_format_string
 # define __check_format_string(pos_str, pos_args) \
 	__attribute__ ((format (printf, (pos_str), (pos_args))))
diff --git a/man/man8/tc-dualpi2.8 b/man/man8/tc-dualpi2.8
new file mode 100644
index 00000000..d325ccda
--- /dev/null
+++ b/man/man8/tc-dualpi2.8
@@ -0,0 +1,203 @@
+.TH DUALPI2 8 "13 December 2018" "iproute2" "Linux"
+
+.SH NAME
+DUALPI2 \- Dual Queue Proportional Integral Controller AQM - Improved with a square
+.SH SYNOPSIS
+.sp
+.ad l
+.in +8
+.ti -8
+.BR tc " " qdisc " ... " dualpi2
+.br
+.RB "[ " limit
+.IR PACKETS " ]"
+.br
+.RB "[ " coupling_factor
+.IR NUMBER " ]"
+.br
+.RB "[ " step_thresh
+.IR TIME | PACKETS " ]"
+.br
+.RB "[ " drop_on_overload " | " overflow " ]"
+.br
+.RB "[ " drop_enqueue " | " drop_dequeue " ]"
+.br
+.RB "[ " l4s_ect " | " any_ect " ]"
+.br
+.RB "[ " classic_protection
+.IR PERCENTAGE " ] "
+.br
+.RB "[ " max_rtt
+.IR TIME 
+.RB " [ " typical_rtt 
+.IR TIME " ]] "
+.br
+.RB "[ " target
+.IR TIME " ]"
+.br
+.RB "[ " tupdate
+.IR TIME " ]"
+.br
+.RB "[ " alpha
+.IR float " ]"
+.br
+.RB "[ " beta
+.IR float " ] "
+
+.SH DESCRIPTION
+DUALPI2 AQM is a combination of the DUALQ Coupled-AQM with a PI2 base-AQM. The PI2 AQM (details can be found in the paper cited below) is in turn both an extension and a simplification of the PIE AQM. PI2 makes quite some PIE heuristics unnecessary, while being able to control scalable congestion controls like DCTCP and TCP-Prague. With PI2, both Reno/Cubic can be used in parallel with DCTCP, maintaining window fairness. DUALQ provides latency separation between low latency DCTCP flows and Reno/Cubic flows that need a bigger queue. The main design goals are:
+.PD 0
+.IP \(bu 4
+L4S - Low Loss, Low Latency and Scalable congestion control support
+.IP \(bu 4
+DualQ option to separate the L4S traffic in a low latency queue, without harming remaining traffic that is scheduled in classic queue due to congestion-coupling
+.IP \(bu 4
+Configurable overload strategies
+.IP \(bu 4
+Use of sojourn time to reliably estimate queue delay
+.IP \(bu 4
+Simple implementation
+.IP \(bu 4
+Guaranteed stability and fast responsiveness
+.PD
+
+.SH ALGORITHM
+DUALPI2 is designed to provide low loss and low latency to L4S traffic, without harming classic traffic. Every update interval a new internal base probability is calculated, based on queue delay. The base probability is updated with a delta based on the difference between the current queue delay and the 
+.I "" target
+delay, and the queue growth comparing with the queuing delay during the previous 
+.I "" tupdate
+interval. The integral gain factor 
+.RB "" alpha
+is used to correct slowly enough any persistent standing queue error to the user specified target delay, while the proportional gain factor
+.RB "" beta
+is used to quickly compensate for queue changes (growth or shrink).
+
+The updated base probability is used as input to decide to mark and drop packets. DUALPI2 scales the calculated probability for each of the two queues accordingly. For the L4S queue, the probability is multiplied by a 
+.RB "" coupling_factor
+, while for the classic queue, it is squared to compensate the squareroot rate equation of Reno/Cubic. The ECT identifier (
+.RB "" l4s_ect | any_ect
+) is used to classify traffic into respective queues.
+
+If DUALPI2 AQM has detected overload (when excessive non-responsive traffic is sent), it can signal congestion solely using 
+.RB "" drop
+, irrespective of the ECN field, or alternatively limit the drop probability and let the queue grow and eventually 
+.RB "" overflow
+(like tail-drop).
+
+Additional details can be found in the draft cited below.
+
+.SH PARAMETERS
+.TP
+.BI limit " PACKETS"
+Limit the number of packets that can be enqueued. Incoming packets are dropped when this limit
+is reached. This limit is common for the L4S and Classic queue. Defaults to
+.I 10000
+packets. This is about 125ms delay on a 1Gbps link.
+.TP
+.BI coupling_factor " NUMBER"
+Set the coupling rate factor between Classic and L4S. Defaults to
+.I 2
+.TP
+.B l4s_ect | any_ect
+Configures the ECT classifier. Packets whose ECT codepoint matches this are sent to the L4S queue where they receive a scalable marking. Defaults to
+.I l4s_ect
+, i.e., the L4S identifier ECT(1). Setting this to
+.I any_ect
+causes all packets whose ECN field is not zero to be sent to the L4S queue. This enables to be backward compatible with, e.g., DCTCP.
+.PD
+.BI step_thresh " TIME | PACKETS"
+Set the step threshold for the L4S queue. This will cause packets with a sojourn time exceeding the threshold to always be marked. This value can either be specified using time units (i.e., us, ms, s), or in packets (pkt, packet(s)). A velue without units is assumed to be in time (us). If defining the step in packets, be sure to disable GRO on the ingress interfaces. Defaults to
+.I 1ms
+.
+.TP
+.B drop_on_overload  |  overflow
+Control the overload strategy. 
+.I drop_on_overload
+preserves the delay in the L4S queue by dropping in both queues on overload.
+.I overflow
+sacrifices delay to avoid losses, eventually resulting in a taildrop behavior once
+.I limit
+is reached. Defaults to
+.I drop_on_overload.
+.PD
+.TP
+.B drop_enqueue | drop_dequeue
+Decide when packets are PI-based dropped or marked. The
+.I step_thresh 
+based L4S marking is always at dequeue. Defaults to
+.I drop_dequeue
+.PD
+.TP
+.BI classic_protection " PERCENTAGE
+Protects the classic queue from unresponsive traffic in the L4S queue. This bounds the maximal delay in the C queue to be
+.I (100 - PERCENTAGE)
+times greater than the one in the L queue. Defaults to
+.I 10
+.TP
+.BI typical_rtt " TIME"
+.PD 0
+.TP
+.PD
+.BI max_rtt " TIME"
+Specify the maximum round trip time (RTT) and/or the typical RTT of the traffic
+that will be controlled by dualpi2. If either of
+.I max_rtt
+or
+.I typical_rtt
+is not specified, the missing value will be computed from the following 
+relationship:
+.I max_rtt = typical_rtt * 6.
+If any of these parameters is given, it will be used to automatically compute
+suitable values for
+.I alpha, beta, target, and tupdate,
+according to the relationship from the appendix A.1 in the IETF draft, to
+achieve a stable control. Consequently, those derived values will override their
+eventual user-provided ones. The default range of operation for the qdisc uses
+.I max_rtt = 100ms
+and 
+.I typical_rtt = 15ms
+, which is suited to control internet traffic.
+.TP
+.BI target " TIME"
+Set the expected queue delay. Defaults to
+.I 15
+ms.
+.TP
+.BI tupdate " TIME"
+Set the frequency at which the system drop probability is calculated. Defaults to
+.I 16
+ms. This should be a third of the max RTT supported.
+.TP
+.BI alpha " float"
+.PD 0
+.TP
+.PD
+.BI beta " float"
+Set alpha and beta, the integral and proportional gain factors in Hz for the PI controller. These can be calculated based on control theory. Defaults are
+.I 0.16
+and
+.I 3.2
+Hz, which provide stable control for RTT's up to 100ms with tupdate of 16. Be aware, unlike with PIE, these are the real unscaled gain factors.
+
+.SH EXAMPLES
+Setting DUALPI2 for the Internet with default parameters:
+ # sudo tc qdisc add dev eth0 root dualpi2
+
+Setting DUALPI2 for datacenter with legacy DCTCP using ECT(0):
+ # sudo tc qdisc add dev eth0 root dualpi2 any_ect
+
+.SH SEE ALSO
+.BR tc (8),
+.BR tc-pie (8)
+
+.SH SOURCES
+.IP \(bu 4
+IETF draft submission is at https://www.ietf.org/id/draft-ietf-tsvwg-aqm-dualq-coupled
+.IP \(bu 4
+CoNEXT '16 Proceedings of the 12th International on Conference on emerging Networking EXperiments and Technologies : "PI2: A
+Linearized AQM for both Classic and Scalable TCP"
+
+.SH AUTHORS
+DUALPI2 was implemented by Koen De Schepper, Olga Albisser, Henrik Steen, and Olivier Tilmans also the authors of
+this man page. Please report bugs and corrections to the Linux networking
+development mailing list at <netdev@vger.kernel.org>.
diff --git a/tc/Makefile b/tc/Makefile
index 14171a28..35f02da3 100644
--- a/tc/Makefile
+++ b/tc/Makefile
@@ -9,6 +9,7 @@ SHARED_LIBS ?= y
 
 TCMODULES :=
 TCMODULES += q_fifo.o
+TCMODULES += q_dualpi2.o
 TCMODULES += q_sfq.o
 TCMODULES += q_red.o
 TCMODULES += q_prio.o
diff --git a/tc/q_dualpi2.c b/tc/q_dualpi2.c
new file mode 100644
index 00000000..1fda2ec3
--- /dev/null
+++ b/tc/q_dualpi2.c
@@ -0,0 +1,439 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/* Copyright (C) 2019 Nokia.
+ *
+ * DualQ PI Improved with a Square (dualpi2)
+ * Supports controlling scalable congestion controls (DCTCP, etc...)
+ * Supports DualQ with PI2
+ * Supports L4S ECN identifier
+ * Author: Koen De Schepper <koen.de_schepper@nokia-bell-labs.com>
+ * Author: Olga Albisser <olga@albisser.org>
+ * Author: Henrik Steen <henrist@henrist.net>
+ */
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <unistd.h>
+#include <syslog.h>
+#include <fcntl.h>
+#include <sys/socket.h>
+#include <netinet/in.h>
+#include <arpa/inet.h>
+#include <string.h>
+#include <math.h>
+#include <errno.h>
+
+#include "utils.h"
+#include "tc_util.h"
+
+#define MAX_PROB ((uint32_t)(~((uint32_t)0)))
+#define DEFAULT_ALPHA_BETA ((uint32_t)(~((uint32_t)0)))
+#define ALPHA_BETA_MAX ((2 << 23) - 1) /* see net/sched/sch_dualpi2.c */
+#define ALPHA_BETA_SCALE (1 << 8)
+#define RTT_TYP_TO_MAX 6
+
+enum {
+	INET_ECN_NOT_ECT = 0,
+	INET_ECN_ECT_1 = 1,
+	INET_ECN_ECT_0 = 2,
+	INET_ECN_CE = 3,
+	INET_ECN_MASK = 3,
+};
+
+static const char *get_ecn_type(uint8_t ect)
+{
+	switch (ect & INET_ECN_MASK) {
+		case INET_ECN_ECT_1: return "l4s_ect";
+		case INET_ECN_ECT_0:
+		case INET_ECN_MASK: return "any_ect";
+		default:
+			fprintf(stderr,
+				"Warning: Unexpected ecn type %u!\n", ect);
+			return "";
+	}
+}
+
+static void explain(void)
+{
+	fprintf(stderr, "Usage: ... dualpi2\n");
+	fprintf(stderr, "               [limit PACKETS]\n");
+	fprintf(stderr, "               [coupling_factor NUMBER]\n");
+	fprintf(stderr, "               [step_thresh TIME|PACKETS]\n");
+	fprintf(stderr, "               [drop_on_overload|overflow]\n");
+	fprintf(stderr, "               [drop_enqueue|drop_dequeue]\n");
+	fprintf(stderr, "               [classic_protection PERCENTAGE]\n");
+	fprintf(stderr, "               [max_rtt TIME [typical_rtt TIME]]\n");
+	fprintf(stderr, "               [target TIME] [tupdate TIME]\n");
+	fprintf(stderr, "               [alpha ALPHA] [beta BETA]\n");
+}
+
+static int get_float(float *val, const char *arg, float min, float max)
+{
+        float res;
+        char *ptr;
+
+        if (!arg || !*arg)
+                return -1;
+        res = strtof(arg, &ptr);
+        if (!ptr || ptr == arg || *ptr)
+                return -1;
+	if (res < min || res > max)
+		return -1;
+        *val = res;
+        return 0;
+}
+
+static int get_packets(uint32_t *val, const char *arg)
+{
+	unsigned long res;
+	char *ptr;
+
+	if (!arg || !*arg)
+		return -1;
+	res = strtoul(arg, &ptr, 10);
+	if (!ptr || ptr == arg ||
+	    (strcmp(ptr, "pkt") && strcmp(ptr, "packet") &&
+	     strcmp(ptr, "packets")))
+		return -1;
+	if (res == ULONG_MAX && errno == ERANGE)
+		return -1;
+	if (res > 0xFFFFFFFFUL)
+		return -1;
+	*val = res;
+	return 0;
+}
+
+static int parse_alpha_beta(const char *name, char *argv, uint32_t *field)
+{
+
+	float field_f;
+
+	if (get_float(&field_f, argv, 0.0, ALPHA_BETA_MAX)) {
+		fprintf(stderr, "Illegal \"%s\"\n", name);
+		return -1;
+	}
+	else if (field_f < 1.0f / ALPHA_BETA_SCALE)
+		fprintf(stderr, "Warning: \"%s\" is too small and will be "
+			"rounded to zero.\n", name);
+	*field = (uint32_t)(field_f * ALPHA_BETA_SCALE);
+	return 0;
+}
+
+static int try_get_percentage(int *val, const char *arg, int base)
+{
+	long res;
+	char *ptr;
+
+	if (!arg || !*arg)
+		return -1;
+	res = strtol(arg, &ptr, base);
+	if (!ptr || ptr == arg || (*ptr && strcmp(ptr, "%")))
+		return -1;
+	if (res == ULONG_MAX && errno == ERANGE)
+		return -1;
+	if (res < 0 || res > 100)
+		return -1;
+
+	*val = res;
+	return 0;
+}
+
+static int dualpi2_parse_opt(struct qdisc_util *qu, int argc, char **argv,
+			 struct nlmsghdr *n, const char* dev)
+{
+	uint32_t limit = 0;
+	uint32_t target = 0;
+	uint32_t tupdate = 0;
+	uint32_t alpha = DEFAULT_ALPHA_BETA;
+	uint32_t beta = DEFAULT_ALPHA_BETA;
+	int32_t coupling_factor = -1;
+	uint8_t ecn_mask = INET_ECN_NOT_ECT;
+	bool step_packets = false;
+	uint32_t step_thresh = 0;
+	int c_protection = -1;
+	int drop_early = -1;
+	int drop_overload = -1;
+	uint32_t rtt_max = 0;
+	uint32_t rtt_typ = 0;
+	struct rtattr *tail;
+
+	while (argc > 0) {
+		if (strcmp(*argv, "limit") == 0) {
+			NEXT_ARG();
+			if (get_u32(&limit, *argv, 10)) {
+				fprintf(stderr, "Illegal \"limit\"\n");
+				return -1;
+			}
+		} else if (strcmp(*argv, "target") == 0) {
+			NEXT_ARG();
+			if (get_time(&target, *argv)) {
+				fprintf(stderr, "Illegal \"target\"\n");
+				return -1;
+			}
+		} else if (strcmp(*argv, "tupdate") == 0) {
+			NEXT_ARG();
+			if (get_time(&tupdate, *argv)) {
+				fprintf(stderr, "Illegal \"tupdate\"\n");
+				return -1;
+			}
+		} else if (strcmp(*argv, "alpha") == 0) {
+			NEXT_ARG();
+			if (parse_alpha_beta("alpha", *argv, &alpha))
+				return -1;
+		} else if (strcmp(*argv, "beta") == 0) {
+			NEXT_ARG();
+			if (parse_alpha_beta("beta", *argv, &beta))
+				return -1;
+		} else if (strcmp(*argv, "coupling_factor") == 0) {
+			NEXT_ARG();
+			if (get_s32(&coupling_factor, *argv, 0) ||
+			    coupling_factor > 0xFFUL ||coupling_factor < 0) {
+				fprintf(stderr,
+					"Illegal \"coupling_factor\"\n");
+				return -1;
+			}
+		} else if (strcmp(*argv, "l4s_ect") == 0)
+			ecn_mask = INET_ECN_ECT_1;
+		else if (strcmp(*argv, "any_ect") == 0)
+			ecn_mask = INET_ECN_MASK;
+		else if (strcmp(*argv, "step_thresh") == 0) {
+			NEXT_ARG();
+			/* First assume that this is specified in time */
+			if (get_time(&step_thresh, *argv)) {
+				/* Then packets */
+				if (get_packets(&step_thresh, *argv)) {
+					fprintf(stderr,
+						"Illegal \"step_thresh\"\n");
+					return -1;
+				}
+				step_packets = true;
+			}
+		} else if (strcmp(*argv, "overflow") == 0) {
+                        drop_overload = 0;
+		} else if (strcmp(*argv, "drop_on_overload") == 0) {
+                        drop_overload = 1;
+		} else if (strcmp(*argv, "drop_enqueue") == 0) {
+			drop_early = 1;
+		} else if (strcmp(*argv, "drop_dequeue") == 0) {
+			drop_early = 0;
+		} else if (strcmp(*argv, "classic_protection") == 0) {
+                        NEXT_ARG();
+                        if (try_get_percentage(&c_protection, *argv, 10) ||
+			    c_protection > 100 ||
+			    c_protection < 0) {
+                                fprintf(stderr,
+					"Illegal \"classic_protection\"\n");
+                                return -1;
+                        }
+		} else if (strcmp(*argv, "max_rtt") == 0) {
+			NEXT_ARG();
+			if (get_time(&rtt_max, *argv)) {
+				fprintf(stderr, "Illegal \"rtt_max\"\n");
+				return -1;
+			}
+		} else if (strcmp(*argv, "typical_rtt") == 0) {
+			NEXT_ARG();
+			if (get_time(&rtt_typ, *argv)) {
+				fprintf(stderr, "Illegal \"rtt_typ\"\n");
+				return -1;
+			}
+		} else if (strcmp(*argv, "help") == 0) {
+			explain();
+			return -1;
+		} else {
+			fprintf(stderr, "What is \"%s\"?\n", *argv);
+			explain();
+			return -1;
+		}
+		--argc;
+		++argv;
+	}
+
+	if (rtt_max || rtt_typ) {
+		float alpha_f, beta_f;
+		SPRINT_BUF(max_rtt_t);
+		SPRINT_BUF(typ_rtt_t);
+		SPRINT_BUF(tupdate_t);
+		SPRINT_BUF(target_t);
+
+		if (!rtt_typ)
+			rtt_typ = max(rtt_max / RTT_TYP_TO_MAX, 1U);
+		else if (!rtt_max)
+			rtt_max = rtt_typ * RTT_TYP_TO_MAX;
+		else if (rtt_typ > rtt_max) {
+			fprintf(stderr, "typical_rtt must be >= max_rtt!\n");
+			return -1;
+		}
+		if (alpha != DEFAULT_ALPHA_BETA || beta != DEFAULT_ALPHA_BETA ||
+		    tupdate || target)
+			fprintf(stderr, "rtt_max is specified, ignoring values "
+				"specified for alpha/beta/tupdate/target\n");
+		target = rtt_typ;
+		tupdate = max(min(rtt_typ, rtt_max / 3), 1U);
+		alpha_f = (double)tupdate / ((double)rtt_max * rtt_max)
+			* TIME_UNITS_PER_SEC * 0.1f;
+		beta_f = 0.3f / (float)rtt_max * TIME_UNITS_PER_SEC;
+		if (beta_f > ALPHA_BETA_MAX) {
+			fprintf(stderr, "max_rtt=%s is too low and cause beta "
+				"to overflow!\n",
+				sprint_time(rtt_max, max_rtt_t));
+			return -1;
+		}
+		if (alpha_f < 1.0f / ALPHA_BETA_SCALE ||
+		    beta_f < 1.0f / ALPHA_BETA_SCALE) {
+			fprintf(stderr, "max_rtt=%s is too large and will "
+				"cause alpha=%f and/or beta=%f to be rounded "
+				"down to 0!\n", sprint_time(rtt_max, max_rtt_t),
+				alpha_f, beta_f);
+			return -1;
+		}
+		fprintf(stderr, "Auto-configuring parameters using "
+			"[max_rtt: %s, typical_rtt: %s]: "
+			"target=%s tupdate=%s alpha=%f beta=%f\n",
+			sprint_time(rtt_max, max_rtt_t),
+			sprint_time(rtt_typ, typ_rtt_t),
+			sprint_time(target, target_t),
+			sprint_time(tupdate, tupdate_t), alpha_f, beta_f);
+		alpha = alpha_f * ALPHA_BETA_SCALE;
+		beta = beta * ALPHA_BETA_SCALE;
+	}
+
+	tail = addattr_nest(n, 1024, TCA_OPTIONS);
+	if (limit)
+		addattr32(n, 1024, TCA_DUALPI2_LIMIT, limit);
+	if (tupdate)
+		addattr32(n, 1024, TCA_DUALPI2_TUPDATE, tupdate);
+	if (target)
+		addattr32(n, 1024, TCA_DUALPI2_TARGET, target);
+	if (alpha != DEFAULT_ALPHA_BETA)
+		addattr32(n, 1024, TCA_DUALPI2_ALPHA, alpha);
+	if (beta != DEFAULT_ALPHA_BETA)
+		addattr32(n, 1024, TCA_DUALPI2_BETA, beta);
+	if (ecn_mask != INET_ECN_NOT_ECT)
+		addattr8(n, 1024, TCA_DUALPI2_ECN_MASK, ecn_mask);
+	if (drop_overload != -1)
+		addattr8(n, 1024, TCA_DUALPI2_DROP_OVERLOAD, drop_overload);
+	if (coupling_factor != -1)
+		addattr8(n, 1024, TCA_DUALPI2_COUPLING, coupling_factor);
+	if (step_thresh) {
+		addattr32(n, 1024, TCA_DUALPI2_STEP_THRESH, step_thresh);
+                addattr8(n, 1024, TCA_DUALPI2_STEP_PACKETS, step_packets);
+	}
+	if (drop_early != -1)
+		addattr8(n, 1024, TCA_DUALPI2_DROP_EARLY, drop_early);
+	if (c_protection != -1)
+		addattr8(n, 1024, TCA_DUALPI2_C_PROTECTION, c_protection);
+	addattr_nest_end(n, tail);
+	return 0;
+}
+
+static int dualpi2_print_opt(struct qdisc_util *qu, FILE *f, struct rtattr *opt)
+{
+	struct rtattr *tb[TCA_DUALPI2_MAX + 1];
+	uint32_t tupdate;
+	uint32_t target;
+	uint32_t step_thresh;
+	bool step_packets = false;
+	SPRINT_BUF(b1);
+
+	if (opt == NULL)
+		return 0;
+
+	parse_rtattr_nested(tb, TCA_DUALPI2_MAX, opt);
+
+	if (tb[TCA_DUALPI2_LIMIT] &&
+	    RTA_PAYLOAD(tb[TCA_DUALPI2_LIMIT]) >= sizeof(__uint32_t))
+		fprintf(f, "limit %up ",
+			rta_getattr_u32(tb[TCA_DUALPI2_LIMIT]));
+	if (tb[TCA_DUALPI2_TARGET] &&
+	    RTA_PAYLOAD(tb[TCA_DUALPI2_TARGET]) >= sizeof(__uint32_t)) {
+		target = rta_getattr_u32(tb[TCA_DUALPI2_TARGET]);
+		fprintf(f, "target %s ", sprint_time(target, b1));
+	}
+	if (tb[TCA_DUALPI2_TUPDATE] &&
+	    RTA_PAYLOAD(tb[TCA_DUALPI2_TUPDATE]) >= sizeof(__uint32_t)) {
+		tupdate = rta_getattr_u32(tb[TCA_DUALPI2_TUPDATE]);
+		fprintf(f, "tupdate %s ", sprint_time(tupdate, b1));
+	}
+	if (tb[TCA_DUALPI2_ALPHA] &&
+	    RTA_PAYLOAD(tb[TCA_DUALPI2_ALPHA]) >= sizeof(__uint32_t)) {
+		fprintf(f, "alpha %f ",
+			(float)rta_getattr_u32(tb[TCA_DUALPI2_ALPHA]) /
+			ALPHA_BETA_SCALE);
+	}
+	if (tb[TCA_DUALPI2_BETA] &&
+	    RTA_PAYLOAD(tb[TCA_DUALPI2_BETA]) >= sizeof(__uint32_t)) {
+		fprintf(f, "beta %f ",
+			(float)rta_getattr_u32(tb[TCA_DUALPI2_BETA]) /
+			ALPHA_BETA_SCALE);
+	}
+	if (tb[TCA_DUALPI2_ECN_MASK] &&
+	    RTA_PAYLOAD(tb[TCA_DUALPI2_ECN_MASK]) >= sizeof(__u8))
+		fprintf(f, "%s ",
+			get_ecn_type(rta_getattr_u8(tb[TCA_DUALPI2_ECN_MASK])));
+	if (tb[TCA_DUALPI2_COUPLING] &&
+	    RTA_PAYLOAD(tb[TCA_DUALPI2_COUPLING]) >= sizeof(__u8))
+		fprintf(f, "coupling_factor %u ",
+			rta_getattr_u8(tb[TCA_DUALPI2_COUPLING]));
+	if (tb[TCA_DUALPI2_DROP_OVERLOAD] &&
+	    RTA_PAYLOAD(tb[TCA_DUALPI2_DROP_OVERLOAD]) >= sizeof(__u8)) {
+		if (rta_getattr_u8(tb[TCA_DUALPI2_DROP_OVERLOAD]))
+			fprintf(f, "drop_on_overload ");
+		else
+			fprintf(f, "overflow ");
+	}
+	if (tb[TCA_DUALPI2_STEP_PACKETS] &&
+            RTA_PAYLOAD(tb[TCA_DUALPI2_STEP_PACKETS]) >= sizeof(__u8) &&
+	    rta_getattr_u8(tb[TCA_DUALPI2_STEP_PACKETS]))
+                        step_packets = true;
+	if (tb[TCA_DUALPI2_STEP_THRESH] &&
+	    RTA_PAYLOAD(tb[TCA_DUALPI2_STEP_THRESH]) >= sizeof(__uint32_t)) {
+		step_thresh = rta_getattr_u32(tb[TCA_DUALPI2_STEP_THRESH]);
+		if (step_packets)
+			fprintf(f, "step_thresh %upkt ", step_thresh);
+		else
+			fprintf(f, "step_thresh %s ",
+				sprint_time(step_thresh, b1));
+	}
+	if (tb[TCA_DUALPI2_DROP_EARLY] &&
+	    RTA_PAYLOAD(tb[TCA_DUALPI2_DROP_EARLY]) >= sizeof(__u8)) {
+		if (rta_getattr_u8(tb[TCA_DUALPI2_DROP_EARLY]))
+			fprintf(f, "drop_enqueue ");
+		else
+			fprintf(f, "drop_dequeue ");
+	}
+	if (tb[TCA_DUALPI2_C_PROTECTION] &&
+            RTA_PAYLOAD(tb[TCA_DUALPI2_C_PROTECTION]) >= sizeof(__u8))
+                fprintf(f, "classic_protection %u%% ",
+			rta_getattr_u8(tb[TCA_DUALPI2_C_PROTECTION]));
+
+	return 0;
+}
+
+static int dualpi2_print_xstats(struct qdisc_util *qu, FILE *f,
+			    struct rtattr *xstats)
+{
+	struct tc_dualpi2_xstats *st;
+
+	if (xstats == NULL)
+		return 0;
+
+	if (RTA_PAYLOAD(xstats) < sizeof(*st))
+		return -1;
+
+	st = RTA_DATA(xstats);
+	fprintf(f, "prob %f delay_c %uus delay_l %uus\n",
+		(double)st->prob / (double)MAX_PROB, st->delay_c, st->delay_l);
+	fprintf(f, "pkts_in_c %u pkts_in_l %u maxq %u\n",
+		st->packets_in_c, st->packets_in_l, st->maxq);
+	fprintf(f, "ecn_mark %u step_marks %u\n", st->ecn_mark, st->step_marks);
+	fprintf(f, "credit %d (%c)\n", st->credit, st->credit > 0 ? 'C' : 'L');
+	return 0;
+
+}
+
+struct qdisc_util dualpi2_qdisc_util = {
+	.id = "dualpi2",
+	.parse_qopt	= dualpi2_parse_opt,
+	.print_qopt	= dualpi2_print_opt,
+	.print_xstats	= dualpi2_print_xstats,
+};
-- 
2.22.0


^ permalink raw reply related

* Re: [RFC PATCH bpf-next 00/14] xdp_flow: Flow offload to XDP
From: Toshiaki Makita @ 2019-08-21  8:49 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Stanislav Fomichev, Alexei Starovoitov, Daniel Borkmann,
	Martin KaFai Lau, Song Liu, Yonghong Song, David S. Miller,
	Jesper Dangaard Brouer, John Fastabend, Jamal Hadi Salim,
	Cong Wang, Jiri Pirko, netdev, bpf, William Tu
In-Reply-To: <20190819111546.35a8ed76@cakuba.netronome.com>

On 19/08/20 (火) 3:15:46, Jakub Kicinski wrote:

I'm on vacation and replying slowly. Sorry for any inconvenience.

> On Sat, 17 Aug 2019 23:01:59 +0900, Toshiaki Makita wrote:
>> On 19/08/17 (土) 3:52:24, Jakub Kicinski wrote:
>>> On Fri, 16 Aug 2019 10:28:10 +0900, Toshiaki Makita wrote:
>>>> On 2019/08/16 4:22, Jakub Kicinski wrote:
>>>>> There's a certain allure in bringing the in-kernel BPF translation
>>>>> infrastructure forward. OTOH from system architecture perspective IMHO
>>>>> it does seem like a task best handed in user space. bpfilter can replace
>>>>> iptables completely, here we're looking at an acceleration relatively
>>>>> loosely coupled with flower.
>>>>
>>>> I don't think it's loosely coupled. Emulating TC behavior in userspace
>>>> is not so easy.
>>>>
>>>> Think about recent multi-mask support in flower. Previously userspace could
>>>> assume there is one mask and hash table for each preference in TC. After the
>>>> change TC accepts different masks with the same pref. Such a change tends to
>>>> break userspace emulation. It may ignore masks passed from flow insertion
>>>> and use the mask remembered when the first flow of the pref is inserted. It
>>>> may override the mask of all existing flows with the pref. It may fail to
>>>> insert such flows. Any of them would result in unexpected wrong datapath
>>>> handling which is critical.
>>>> I think such an emulation layer needs to be updated in sync with TC.
>>>
>>> Oh, so you're saying that if xdp_flow is merged all patches to
>>> cls_flower and netfilter which affect flow offload will be required
>>> to update xdp_flow as well?
>>
>> Hmm... you are saying that we are allowed to break other in-kernel
>> subsystem by some change? Sounds strange...
> 
> No I'm not saying that, please don't put words in my mouth.

If we ignore xdp_flow when modifying something which affects flow 
offload, that may cause breakage. I showed such an example using 
multi-mask support. So I just wondered what you mean and guessed you 
think we can break other subsystem in some situation.

I admit I should not have used the wording "you are saying...?". If it 
was not unpleasant to you I'm sorry about that. But I think you should 
not use it as well. I did not say "cls_flower and netfilter which affect 
flow offload will be required to update xdp_flow". I guess most patches 
which affect flow offload core will not break xdp_flow. In some cases 
breakage may happen. In that case we need to fix xdp_flow as well.

> I'm asking you if that's your intention.
> 
> Having an implementation nor support a feature of another implementation
> and degrade gracefully to the slower one is not necessarily breakage.
> We need to make a concious decision here, hence the clarifying question.

As I described above, breakage can happen in some case, and if the patch 
breaks xdp_flow I think we need to fix xdp_flow at the same time. If 
xdp_flow does not support newly added features but it works for existing 
ones, it is OK. In the first place not all features can be offloaded to 
xdp_flow. I think this is the same as HW-offload.

>>> That's a question of policy. Technically the implementation in user
>>> space is equivalent.
>>>
>>> The advantage of user space implementation is that you can add more
>>> to it and explore use cases which do not fit in the flow offload API,
>>> but are trivial for BPF. Not to mention the obvious advantage of
>>> decoupling the upgrade path.
>>
>> I understand the advantage, but I can't trust such a third-party kernel
>> emulation solution for this kind of thing which handles critical data path.
> 
> That's a strange argument to make. All production data path BPF today
> comes from user space.

Probably my explanation was not sufficient. What I'm concerned about is 
that this needs to emulate kernel behavior, and it is difficult.
I don't think userspace-defined datapath itself is not reliable, nor 
eBPF ecosystem.

Toshiaki Makita

^ permalink raw reply

* [PATCH bpf-next 0/2] tools: bpftool: work with frozen maps
From: Quentin Monnet @ 2019-08-21  8:52 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann
  Cc: bpf, netdev, oss-drivers, Quentin Monnet

Hi,
This is a simple set to add support for BPF map freezing to bpftool. First
patch makes bpftool indicate if a map is frozen when listing BPF maps.
Second patch adds a command to freeze a map loaded on the system.

Quentin Monnet (2):
  tools: bpftool: show frozen status for maps
  tools: bpftool: add "bpftool map freeze" subcommand

 .../bpf/bpftool/Documentation/bpftool-map.rst |  9 +++
 tools/bpf/bpftool/bash-completion/bpftool     |  4 +-
 tools/bpf/bpftool/map.c                       | 64 +++++++++++++++++--
 3 files changed, 71 insertions(+), 6 deletions(-)

-- 
2.17.1


^ permalink raw reply

* [PATCH bpf-next 1/2] tools: bpftool: show frozen status for maps
From: Quentin Monnet @ 2019-08-21  8:52 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann
  Cc: bpf, netdev, oss-drivers, Quentin Monnet
In-Reply-To: <20190821085219.30387-1-quentin.monnet@netronome.com>

When listing maps, read their "frozen" status from procfs, and tell if
maps are frozen.

As commit log for map freezing command mentions that the feature might
be extended with flags (e.g. for write-only instead of read-only) in the
future, use an integer and not a boolean for JSON output.

Signed-off-by: Quentin Monnet <quentin.monnet@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
---
 tools/bpf/bpftool/map.c | 30 +++++++++++++++++++++++++++---
 1 file changed, 27 insertions(+), 3 deletions(-)

diff --git a/tools/bpf/bpftool/map.c b/tools/bpf/bpftool/map.c
index bfbbc6b4cb83..af2e9eb9747b 100644
--- a/tools/bpf/bpftool/map.c
+++ b/tools/bpf/bpftool/map.c
@@ -481,9 +481,11 @@ static int parse_elem(char **argv, struct bpf_map_info *info,
 
 static int show_map_close_json(int fd, struct bpf_map_info *info)
 {
-	char *memlock;
+	char *memlock, *frozen_str;
+	int frozen = 0;
 
 	memlock = get_fdinfo(fd, "memlock");
+	frozen_str = get_fdinfo(fd, "frozen");
 
 	jsonw_start_object(json_wtr);
 
@@ -533,6 +535,12 @@ static int show_map_close_json(int fd, struct bpf_map_info *info)
 	}
 	close(fd);
 
+	if (frozen_str) {
+		frozen = atoi(frozen_str);
+		free(frozen_str);
+	}
+	jsonw_int_field(json_wtr, "frozen", frozen);
+
 	if (info->btf_id)
 		jsonw_int_field(json_wtr, "btf_id", info->btf_id);
 
@@ -555,9 +563,11 @@ static int show_map_close_json(int fd, struct bpf_map_info *info)
 
 static int show_map_close_plain(int fd, struct bpf_map_info *info)
 {
-	char *memlock;
+	char *memlock, *frozen_str;
+	int frozen = 0;
 
 	memlock = get_fdinfo(fd, "memlock");
+	frozen_str = get_fdinfo(fd, "frozen");
 
 	printf("%u: ", info->id);
 	if (info->type < ARRAY_SIZE(map_type_name))
@@ -610,9 +620,23 @@ static int show_map_close_plain(int fd, struct bpf_map_info *info)
 				printf("\n\tpinned %s", obj->path);
 		}
 	}
+	printf("\n");
+
+	if (frozen_str) {
+		frozen = atoi(frozen_str);
+		free(frozen_str);
+	}
+
+	if (!info->btf_id && !frozen)
+		return 0;
+
+	printf("\t");
 
 	if (info->btf_id)
-		printf("\n\tbtf_id %d", info->btf_id);
+		printf("btf_id %d", info->btf_id);
+
+	if (frozen)
+		printf("%sfrozen", info->btf_id ? "  " : "");
 
 	printf("\n");
 	return 0;
-- 
2.17.1


^ permalink raw reply related

* [PATCH bpf-next 2/2] tools: bpftool: add "bpftool map freeze" subcommand
From: Quentin Monnet @ 2019-08-21  8:52 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann
  Cc: bpf, netdev, oss-drivers, Quentin Monnet
In-Reply-To: <20190821085219.30387-1-quentin.monnet@netronome.com>

Add a new subcommand to freeze maps from user space.

Signed-off-by: Quentin Monnet <quentin.monnet@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
---
 .../bpf/bpftool/Documentation/bpftool-map.rst |  9 +++++
 tools/bpf/bpftool/bash-completion/bpftool     |  4 +--
 tools/bpf/bpftool/map.c                       | 34 ++++++++++++++++++-
 3 files changed, 44 insertions(+), 3 deletions(-)

diff --git a/tools/bpf/bpftool/Documentation/bpftool-map.rst b/tools/bpf/bpftool/Documentation/bpftool-map.rst
index 61d1d270eb5e..1c0f7146aab0 100644
--- a/tools/bpf/bpftool/Documentation/bpftool-map.rst
+++ b/tools/bpf/bpftool/Documentation/bpftool-map.rst
@@ -36,6 +36,7 @@ MAP COMMANDS
 |	**bpftool** **map pop**        *MAP*
 |	**bpftool** **map enqueue**    *MAP* **value** *VALUE*
 |	**bpftool** **map dequeue**    *MAP*
+|	**bpftool** **map freeze**     *MAP*
 |	**bpftool** **map help**
 |
 |	*MAP* := { **id** *MAP_ID* | **pinned** *FILE* }
@@ -127,6 +128,14 @@ DESCRIPTION
 	**bpftool map dequeue**  *MAP*
 		  Dequeue and print **value** from the queue.
 
+	**bpftool map freeze**  *MAP*
+		  Freeze the map as read-only from user space. Entries from a
+		  frozen map can not longer be updated or deleted with the
+		  **bpf\ ()** system call. This operation is not reversible,
+		  and the map remains immutable from user space until its
+		  destruction. However, read and write permissions for BPF
+		  programs to the map remain unchanged.
+
 	**bpftool map help**
 		  Print short help message.
 
diff --git a/tools/bpf/bpftool/bash-completion/bpftool b/tools/bpf/bpftool/bash-completion/bpftool
index 2ffd351f9dbf..70493a6da206 100644
--- a/tools/bpf/bpftool/bash-completion/bpftool
+++ b/tools/bpf/bpftool/bash-completion/bpftool
@@ -449,7 +449,7 @@ _bpftool()
         map)
             local MAP_TYPE='id pinned'
             case $command in
-                show|list|dump|peek|pop|dequeue)
+                show|list|dump|peek|pop|dequeue|freeze)
                     case $prev in
                         $command)
                             COMPREPLY=( $( compgen -W "$MAP_TYPE" -- "$cur" ) )
@@ -638,7 +638,7 @@ _bpftool()
                     [[ $prev == $object ]] && \
                         COMPREPLY=( $( compgen -W 'delete dump getnext help \
                             lookup pin event_pipe show list update create \
-                            peek push enqueue pop dequeue' -- \
+                            peek push enqueue pop dequeue freeze' -- \
                             "$cur" ) )
                     ;;
             esac
diff --git a/tools/bpf/bpftool/map.c b/tools/bpf/bpftool/map.c
index af2e9eb9747b..de61d73b9030 100644
--- a/tools/bpf/bpftool/map.c
+++ b/tools/bpf/bpftool/map.c
@@ -1262,6 +1262,35 @@ static int do_pop_dequeue(int argc, char **argv)
 	return err;
 }
 
+static int do_freeze(int argc, char **argv)
+{
+	int err, fd;
+
+	if (!REQ_ARGS(2))
+		return -1;
+
+	fd = map_parse_fd(&argc, &argv);
+	if (fd < 0)
+		return -1;
+
+	if (argc) {
+		close(fd);
+		return BAD_ARG();
+	}
+
+	err = bpf_map_freeze(fd);
+	close(fd);
+	if (err) {
+		p_err("failed to freeze map: %s", strerror(errno));
+		return err;
+	}
+
+	if (json_output)
+		jsonw_null(json_wtr);
+
+	return 0;
+}
+
 static int do_help(int argc, char **argv)
 {
 	if (json_output) {
@@ -1286,6 +1315,7 @@ static int do_help(int argc, char **argv)
 		"       %s %s pop        MAP\n"
 		"       %s %s enqueue    MAP value VALUE\n"
 		"       %s %s dequeue    MAP\n"
+		"       %s %s freeze     MAP\n"
 		"       %s %s help\n"
 		"\n"
 		"       " HELP_SPEC_MAP "\n"
@@ -1304,7 +1334,8 @@ static int do_help(int argc, char **argv)
 		bin_name, argv[-2], bin_name, argv[-2], bin_name, argv[-2],
 		bin_name, argv[-2], bin_name, argv[-2], bin_name, argv[-2],
 		bin_name, argv[-2], bin_name, argv[-2], bin_name, argv[-2],
-		bin_name, argv[-2], bin_name, argv[-2], bin_name, argv[-2]);
+		bin_name, argv[-2], bin_name, argv[-2], bin_name, argv[-2],
+		bin_name, argv[-2]);
 
 	return 0;
 }
@@ -1326,6 +1357,7 @@ static const struct cmd cmds[] = {
 	{ "enqueue",	do_update },
 	{ "pop",	do_pop_dequeue },
 	{ "dequeue",	do_pop_dequeue },
+	{ "freeze",	do_freeze },
 	{ 0 }
 };
 
-- 
2.17.1


^ permalink raw reply related

* Re: [RFC PATCH] bpf: handle 32-bit zext during constant blinding
From: Jiong Wang @ 2019-08-21  9:05 UTC (permalink / raw)
  To: Naveen N. Rao
  Cc: Alexei Starovoitov, Daniel Borkmann, Jiong Wang, bpf,
	linux-kernel, linuxppc-dev, Michael Ellerman, netdev
In-Reply-To: <1566376025.68ldwx3wc7.naveen@linux.ibm.com>


Naveen N. Rao writes:

> Naveen N. Rao wrote:
>> Since BPF constant blinding is performed after the verifier pass, there
>> are certain ALU32 instructions inserted which don't have a corresponding
>> zext instruction inserted after. This is causing a kernel oops on
>> powerpc and can be reproduced by running 'test_cgroup_storage' with
>> bpf_jit_harden=2.
>> 
>> Fix this by emitting BPF_ZEXT during constant blinding if
>> prog->aux->verifier_zext is set.
>> 
>> Fixes: a4b1d3c1ddf6cb ("bpf: verifier: insert zero extension according to analysis result")
>> Reported-by: Michael Ellerman <mpe@ellerman.id.au>
>> Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
>> ---
>> This approach (the location where zext is being introduced below, in 
>> particular) works for powerpc, but I am not entirely sure if this is 
>> sufficient for other architectures as well. This is broken on v5.3-rc4.
>
> Alexie, Daniel, Jiong,
> Any feedback on this?

The fix on BPF_LD | BPF_IMM | BPF_DW looks correct to me, but the two other
places looks to me is unnecessary, as those destinations are exposed to
external and if they are used as 64-bit then there will be zext inserted
for them.

Have you verified removing those two fixes will still cause the bug?

Regards,
Jiong

>
> - Naveen


^ permalink raw reply

* Re: [PATCH net-next v2 6/9] net: macsec: hardware offloading infrastructure
From: Igor Russkikh @ 2019-08-21  9:20 UTC (permalink / raw)
  To: Sabrina Dubroca, Antoine Tenart
  Cc: Andrew Lunn, davem@davemloft.net, f.fainelli@gmail.com,
	hkallweit1@gmail.com, netdev@vger.kernel.org,
	linux-kernel@vger.kernel.org, thomas.petazzoni@bootlin.com,
	alexandre.belloni@bootlin.com, allan.nielsen@microchip.com,
	camelia.groza@nxp.com, Simon Edelhaus, Pavel Belous
In-Reply-To: <20190820144119.GA28714@bistromath.localdomain>


> 
> Talking about packet numbers, can you describe how PN exhaustion is
> handled?  I couldn't find much about packet numbers at all in the
> driver patches (I hope the hw doesn't wrap around from 2^32-1 to 0 on
> the same SA).  At some point userspace needs to know that we're
> getting close to 2^32 and that it's time to re-key.  Since the whole
> TX path of the software implementation is bypassed, it looks like the
> PN (as far as drivers/net/macsec.c is concerned) never increases, so
> userspace can't know when to negotiate a new SA.

I think there should be driver specific implementation of this functionality.
As an example, our macsec HW issues an interrupt towards the host to indicate
PN threshold has reached and it's time for userspace to change the keys.

In contrast, current SW macsec implementation just stops this SA/secy.

> I don't see how this implementation handles non-macsec traffic (on TX,
> that would be packets sent directly through the real interface, for
> example by wpa_supplicant - on RX, incoming MKA traffic for
> wpa_supplicant). Unless I missed something, incoming MKA traffic will
> end up on the macsec interface as well as the lower interface (not
> entirely critical, as long as wpa_supplicant can grab it on the lower
> device, but not consistent with the software implementation). How does
> the driver distinguish traffic that should pass through unmodified
> from traffic that the HW needs to encapsulate and encrypt?

I can comment on our HW engine - where it has special bypass rules
for configured ethertypes. This way macsec engine skips encryption on TX and
passes in RX unencrypted for the selected control packets.

But thats true, realdev driver is hard to distinguish encrypted/unencrypted
packets. In case realdev should make a decision where to put RX packet,
it only may do some heuristic (since after macsec decription all the
macsec related info is dropped. Thats true at least for our HW implementation).

> If you look at IPsec offloading, the networking stack builds up the
> ESP header, and passes the unencrypted data down to the driver. I'm
> wondering if the same would be possible with MACsec offloading: the
> macsec virtual interface adds the header (and maybe a dummy ICV), and
> then the HW does the encryption. In case of HW that needs to add the
> sectag itself, the driver would first strip the headers that the stack
> created. On receive, the driver would recreate a sectag and the macsec
> interface would just skip all verification (decrypt, PN).

I don't think this way is good, as driver have to do per packet header mangling.
That'll harm linerate performance heavily.

Regards,
   Igor

^ permalink raw reply

* Re: [RFC 1/4] Add usb_get_address and usb_set_address support
From: Oliver Neukum @ 2019-08-21  9:08 UTC (permalink / raw)
  To: Charles.Hyde, linux-acpi, linux-usb
  Cc: Mario.Limonciello, gregkh, nic_swsd, netdev
In-Reply-To: <1566339522507.45056@Dellteam.com>

Am Dienstag, den 20.08.2019, 22:18 +0000 schrieb
Charles.Hyde@dellteam.com:
> The core USB driver message.c is missing get/set address functionality

This should go into usbnet. The CDC parser is where it is because
it is needed for serial and network devices. As serial devices
do not have a MAC, this can go into usbnet.

> that stops ifconfig from being able to push MAC addresses out to USB
> based ethernet devices.  Without this functionality, some USB devices
> stop responding to ethernet packets when using ifconfig to change MAC
> addresses.  This has been tested with a Dell Universal Dock D6000.
> 
> Signed-off-by: Charles Hyde <charles.hyde@dellteam.com>
> Cc: Mario Limonciello <mario.limonciello@dell.com>
> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
> Cc: linux-usb@vger.kernel.org
> ---
>  drivers/usb/core/message.c | 59 ++++++++++++++++++++++++++++++++++++++
>  include/linux/usb.h        |  3 ++
>  2 files changed, 62 insertions(+)
> 
> diff --git a/drivers/usb/core/message.c b/drivers/usb/core/message.c
> index 5adf489428aa..eea775234b09 100644
> --- a/drivers/usb/core/message.c
> +++ b/drivers/usb/core/message.c
> @@ -1085,6 +1085,65 @@ int usb_clear_halt(struct usb_device *dev, int pipe)
>  }
>  EXPORT_SYMBOL_GPL(usb_clear_halt);
>  
> +/**
> + * usb_get_address - 
> + * @dev: device whose endpoint is halted

Which endpoint?

> + * @mac: buffer for containing 
> + * Context: !in_interrupt ()
> + *
> + * This will attempt to get the six byte MAC address from a USB device's
> + * ethernet controller using GET_NET_ADDRESS command.
> + *
> + * This call is synchronous, and may not be used in an interrupt context.
> + *
> + * Return: Zero on success, or else the status code returned by the

Well, I am afraid it will return 6 on success.

> + * underlying usb_control_msg() call.
> + */
> +int usb_get_address(struct usb_device *dev, unsigned char * mac)
> +{
> +	int ret = -ENOMEM;

Initialization is unnecessary here.

> +	unsigned char *tbuf = kmalloc(256, GFP_NOIO);

If you intentionally picked a safety margin of 42 times, this
is cool. Otherwise it is a litttle much.

> +
> +	if (!tbuf)
> +		return -ENOMEM;
> +
> +	ret = usb_control_msg(dev, usb_sndctrlpipe(dev, 0),
> +			USB_CDC_GET_NET_ADDRESS,
> +			USB_DIR_IN | USB_TYPE_CLASS | USB_RECIP_INTERFACE,
> +			0, USB_REQ_SET_ADDRESS, tbuf, 256,
> +			USB_CTRL_GET_TIMEOUT);
> +	if (ret == 6)
> +		memcpy(mac, tbuf, 6);

You cannot ignore the case of devices sending more or less than 6
bytes.

	Regards
		Oliver


^ permalink raw reply

* Re: [PATCH net-next v2 6/9] net: macsec: hardware offloading infrastructure
From: allan.nielsen @ 2019-08-21  9:24 UTC (permalink / raw)
  To: Sabrina Dubroca
  Cc: Antoine Tenart, Igor Russkikh, Andrew Lunn, davem@davemloft.net,
	f.fainelli@gmail.com, hkallweit1@gmail.com,
	netdev@vger.kernel.org, linux-kernel@vger.kernel.org,
	thomas.petazzoni@bootlin.com, alexandre.belloni@bootlin.com,
	camelia.groza@nxp.com, Simon Edelhaus, Pavel Belous
In-Reply-To: <20190820144119.GA28714@bistromath.localdomain>

Hi,

I can add some information to the HW Antoine is working on, general design of it
and the thoughts behind it. See below.

The 08/20/2019 16:41, Sabrina Dubroca wrote:
> 2019-08-20, 12:01:40 +0200, Antoine Tenart wrote:
> > So it seems the ability to enable or disable the offloading on a given
> > interface is the main missing feature. I'll add that, however I'll
> > probably (at least at first):
> > 
> > - Have the interface to be fully offloaded or fully handled in s/w (with
> >   errors being thrown if a given configuration isn't supported). Having
> >   both at the same time on a given interface would be tricky because of
> >   the MACsec validation parameter.
> > 
> > - Won't allow to enable/disable the offloading of there are rules in
> >   place, as we're not sure the same rules would be accepted by the other
> >   implementation.
> 
> That's probably quite problematic actually, because to do that you
> need to be able to resync the state between software and hardware,
> particularly packet numbers. So, yeah, we're better off having it
> completely blocked until we have a working implementation (if that
> ever happens).
> 
> However, that means in case the user wants to set up something that's
> not offloadable, they'll have to:
>  - configure the offloaded version until EOPNOTSUPP
>  - tear everything down
>  - restart from scratch without offloading
> 
> That's inconvenient.
> 
> Talking about packet numbers, can you describe how PN exhaustion is
> handled?  I couldn't find much about packet numbers at all in the
> driver patches (I hope the hw doesn't wrap around from 2^32-1 to 0 on
> the same SA).
New SA's are suppose to be installed ahead of time. The HW will automatic move
to the next SA and reset the PN.

> At some point userspace needs to know that we're
> getting close to 2^32 and that it's time to re-key.  Since the whole
> TX path of the software implementation is bypassed, it looks like the
> PN (as far as drivers/net/macsec.c is concerned) never increases, so
> userspace can't know when to negotiate a new SA.
> 
> > I'm not sure if we should allow to mix the implementations on a given
> > physical interface (by having two MACsec interfaces attached) as the
> > validation would be impossible to do (we would have no idea if a
> > packet was correctly handled by the offloading part or just not being
> > a MACsec packet in the first place, in Rx).
> 
> That's something that really bothers me with this proposal. Quoting
> from the commit message:
> 
> > the packets seen by the networking stack on both the physical and
> > MACsec virtual interface are exactly the same
> 
> If the HW/driver is expected to strip the sectag, I don't see how we
> could ever have multiple secy's at the same time and demultiplex
> properly between them. That's part of the reason why I chose to have
> virtual interfaces: without them, picking the right secy on TX gets
> weird.

The HW does frame clasification, and use the claisfication to associate frames
to a given secy.

We we in SW have eth0, with 2 vlan-sub interfaces, and enable macsec on those,
then we have:

eth0
eth0.10
eth0.10.macsec
eth0.20
eth0.20.macsec

In this case the HW needs to be configured to match vlan 10 to one secy, and
vlan 20 to an other one.

This is nor supported in the current patch, but is something we can add later.
We just wanted to get the basic functionallity done right before moving on to
this.

But in the current design, there is nothing that prevent us from adding this.

If anyone is interested in the details of this then it is described in section
3.6.3 in http://ww1.microchip.com/downloads/en/DeviceDoc/VMDS-10455.pdf

It is possible to construct an encapsulation that the HW can not classify
correctly. If that is the case, then we should reject the HW offload, and use
the SW.

But it is a good point, which I think is missing. We should properly reject
MACsec HW offload (with this driver, in this state) on virtual interfaces, if
the encapsulation can not be handled (could start by reject all virtual
interfaces)

> AFAICT, it means that any filters (tc, tcpdump) need to be different
> between offloaded and non-offloaded cases.
It will see the result of the offloaded operation. But I guess that this is no
different from when 'tc' operations are offloaded to HW.

> How does the driver distinguish traffic that should pass through unmodified
> from traffic that the HW needs to encapsulate and encrypt?
It relay on frame classification (I think Antoine is missing a flow
configuration to bypass all MKA traffic).

> If you look at IPsec offloading, the networking stack builds up the
> ESP header, and passes the unencrypted data down to the driver. I'm
> wondering if the same would be possible with MACsec offloading: the
> macsec virtual interface adds the header (and maybe a dummy ICV), and
> then the HW does the encryption. In case of HW that needs to add the
> sectag itself, the driver would first strip the headers that the stack
> created. On receive, the driver would recreate a sectag and the macsec
> interface would just skip all verification (decrypt, PN).
I do not think this is possible with this HW, nor do I think this is desirable.

One of the big differences between MACsec and IPsec, is the fact that it is a L2
protocol, designed to be running on switches (this is how MACsec claims to limit
key-distributions issues, as all frames are decrypted when entering a switch,
and encrypted with a new key when leaving).

If a SW stack needs to add a tag to the frame, then we cannot offload traffic
being bridged in HW, and never goes to the CPU.

/Allan

^ permalink raw reply

* Re: [PATCH v1] ocfs2/dlm: Move BITS_TO_BYTES() to bitops.h for wider use
From: Andy Shevchenko @ 2019-08-21  9:25 UTC (permalink / raw)
  To: Joseph Qi
  Cc: Mark Fasheh, Joel Becker, ocfs2-devel, Ariel Elior,
	Sudarsana Kalluru, GR-everest-linux-l2, David S. Miller, netdev,
	Colin Ian King
In-Reply-To: <1a3e6660-10d2-e66c-2880-24af64c7f120@linux.alibaba.com>

On Wed, Aug 21, 2019 at 09:29:04AM +0800, Joseph Qi wrote:
> On 19/8/21 00:31, Andy Shevchenko wrote:
> > There are users already and will be more of BITS_TO_BYTES() macro.
> > Move it to bitops.h for wider use.
> > 
> > Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
> > ---
> >  drivers/net/ethernet/broadcom/bnx2x/bnx2x_init.h | 1 -
> >  fs/ocfs2/dlm/dlmcommon.h                         | 4 ----
> >  include/linux/bitops.h                           | 1 +
> >  3 files changed, 1 insertion(+), 5 deletions(-)
> > 
> > diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_init.h b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_init.h
> > index 066765fbef06..0a59a09ef82f 100644
> > --- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_init.h
> > +++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_init.h
> > @@ -296,7 +296,6 @@ static inline void bnx2x_dcb_config_qm(struct bnx2x *bp, enum cos_mode mode,
> >   *    possible, the driver should only write the valid vnics into the internal
> >   *    ram according to the appropriate port mode.
> >   */
> > -#define BITS_TO_BYTES(x) ((x)/8)>
> I don't think this is a equivalent replace, or it is in fact
> wrong before?

I was thinking about this one and there are two applications:
- calculus of the amount of structures of certain type per PAGE
  (obviously off-by-one error in the original code IIUC purpose of STRUCT_SIZE)
- calculus of some threshold based on line speed in bytes per second
  (I dunno it will have any difference on the Gbs / 100 MBs speeds)

> >  /* CMNG constants, as derived from system spec calculations */
> >  
> > diff --git a/fs/ocfs2/dlm/dlmcommon.h b/fs/ocfs2/dlm/dlmcommon.h
> > index aaf24548b02a..0463dce65bb2 100644
> > --- a/fs/ocfs2/dlm/dlmcommon.h
> > +++ b/fs/ocfs2/dlm/dlmcommon.h
> > @@ -688,10 +688,6 @@ struct dlm_begin_reco
> >  	__be32 pad2;
> >  };
> >  
> > -
> > -#define BITS_PER_BYTE 8
> > -#define BITS_TO_BYTES(bits) (((bits)+BITS_PER_BYTE-1)/BITS_PER_BYTE)
> > -
> For ocfs2 part, it looks good to me.
> Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>

Thanks!

> 
> >  struct dlm_query_join_request
> >  {
> >  	u8 node_idx;
> > diff --git a/include/linux/bitops.h b/include/linux/bitops.h
> > index cf074bce3eb3..79d80f5ddf7b 100644
> > --- a/include/linux/bitops.h
> > +++ b/include/linux/bitops.h
> > @@ -5,6 +5,7 @@
> >  #include <linux/bits.h>
> >  
> >  #define BITS_PER_TYPE(type) (sizeof(type) * BITS_PER_BYTE)
> > +#define BITS_TO_BYTES(nr)	DIV_ROUND_UP(nr, BITS_PER_BYTE)
> >  #define BITS_TO_LONGS(nr)	DIV_ROUND_UP(nr, BITS_PER_TYPE(long))
> >  
> >  extern unsigned int __sw_hweight8(unsigned int w);
> > 

-- 
With Best Regards,
Andy Shevchenko



^ permalink raw reply

* Re: [PATCH net-next v2 6/9] net: macsec: hardware offloading infrastructure
From: allan.nielsen @ 2019-08-21  9:27 UTC (permalink / raw)
  To: Igor Russkikh
  Cc: Sabrina Dubroca, Antoine Tenart, Andrew Lunn, davem@davemloft.net,
	f.fainelli@gmail.com, hkallweit1@gmail.com,
	netdev@vger.kernel.org, linux-kernel@vger.kernel.org,
	thomas.petazzoni@bootlin.com, alexandre.belloni@bootlin.com,
	camelia.groza@nxp.com, Simon Edelhaus, Pavel Belous
In-Reply-To: <81ec0497-58cd-1f4c-faa3-c057693cd50e@aquantia.com>

The 08/21/2019 09:20, Igor Russkikh wrote:
> > Talking about packet numbers, can you describe how PN exhaustion is
> > handled?  I couldn't find much about packet numbers at all in the
> > driver patches (I hope the hw doesn't wrap around from 2^32-1 to 0 on
> > the same SA).  At some point userspace needs to know that we're
> > getting close to 2^32 and that it's time to re-key.  Since the whole
> > TX path of the software implementation is bypassed, it looks like the
> > PN (as far as drivers/net/macsec.c is concerned) never increases, so
> > userspace can't know when to negotiate a new SA.
> 
> I think there should be driver specific implementation of this functionality.
> As an example, our macsec HW issues an interrupt towards the host to indicate
> PN threshold has reached and it's time for userspace to change the keys.
> 
> In contrast, current SW macsec implementation just stops this SA/secy.
> 
> > I don't see how this implementation handles non-macsec traffic (on TX,
> > that would be packets sent directly through the real interface, for
> > example by wpa_supplicant - on RX, incoming MKA traffic for
> > wpa_supplicant). Unless I missed something, incoming MKA traffic will
> > end up on the macsec interface as well as the lower interface (not
> > entirely critical, as long as wpa_supplicant can grab it on the lower
> > device, but not consistent with the software implementation). How does
> > the driver distinguish traffic that should pass through unmodified
> > from traffic that the HW needs to encapsulate and encrypt?
> 
> I can comment on our HW engine - where it has special bypass rules
> for configured ethertypes. This way macsec engine skips encryption on TX and
> passes in RX unencrypted for the selected control packets.
In our case it is a TCAM which can look at various fields (including ethertype),
but is sounds like we have a vary similar design.

> But thats true, realdev driver is hard to distinguish encrypted/unencrypted
> packets. In case realdev should make a decision where to put RX packet,
> it only may do some heuristic (since after macsec decription all the
> macsec related info is dropped. Thats true at least for our HW implementation).
Same for ours.

> > If you look at IPsec offloading, the networking stack builds up the
> > ESP header, and passes the unencrypted data down to the driver. I'm
> > wondering if the same would be possible with MACsec offloading: the
> > macsec virtual interface adds the header (and maybe a dummy ICV), and
> > then the HW does the encryption. In case of HW that needs to add the
> > sectag itself, the driver would first strip the headers that the stack
> > created. On receive, the driver would recreate a sectag and the macsec
> > interface would just skip all verification (decrypt, PN).
> 
> I don't think this way is good, as driver have to do per packet header mangling.
> That'll harm linerate performance heavily.
Agree, and it will also prevent MACsec offload in offloaded bridge devices.

/Allan

^ permalink raw reply

* Re: [PATCH net-next 3/6] net: dsa: Delete the VID from the upstream port as well
From: Vladimir Oltean @ 2019-08-21  9:51 UTC (permalink / raw)
  To: Vivien Didelot
  Cc: Florian Fainelli, Andrew Lunn, Ido Schimmel, Roopa Prabhu,
	nikolay, David S. Miller, netdev
In-Reply-To: <20190820233026.GC21067@t480s.localdomain>

On Wed, 21 Aug 2019 at 06:30, Vivien Didelot <vivien.didelot@gmail.com> wrote:
>
> On Wed, 21 Aug 2019 01:09:39 +0300, Vladimir Oltean <olteanv@gmail.com> wrote:
> > I mean I made an argument already for the hack in 4/6 ("Don't program
> > the VLAN as pvid on the upstream port"). If the hack gets accepted
> > like that, I have no further need of any change in the implicit VLAN
> > configuration. But it's still a hack, so in that sense it would be
> > nicer to not need it and have a better amount of control.
>
> How come you simply cannot ignore the PVID flag for the CPU port in the
> driver directly, as mv88e6xxx does in preference of the Marvell specific
> "unmodified" mode? What PVID are you programming on the CPU port already?
>
>
> Thanks,
>
>         Vivien

sja1105 has no such thing as an "unmodified" egress policy.
And ignoring the flags in the switch driver for the CPU port is just
as hacky as fixing it up in the DSA core.
I fail to see any reason to change the pvid for the CPU/DSA ports,
maybe you can clarify.

Actually I gave a second thought to the patchset and in a weird,
convoluted way, I can get away with just:
- 2/6: net: bridge: Populate the pvid flag in br_vlan_get_info
- 5/6 net: dsa: Allow proper internal use of VLANs
- 6/6 net: dsa: tag_8021q: Restore bridge pvid when enabling vlan_filtering
A side effect of running dsa_port_restore_pvid on a user port will be
that it is going to also restore the pvid on the CPU port, via the
bitmap operations. I had not thought of this initially when I first
jumped to remove the BRIDGE_VLAN_INFO_PVID flag in 4/6. And the fact
that it would work would just be "programming by coincidence".

I guess my aversion against the VLAN bitmap operations (and, well,
"match" in your new change) stems from the fact that the DSA core sits
in the way of doing custom configuration of the CPU port VLAN
settings. Yes, you can work around it (dsa_8021q first configures the
user ports as pvid and egress untagged, then configures the CPU port
as egress tagged, so that the pvid setting from user ports doesn't
"stick" to the CPU via the bitmap), but it's like pouring water that's
half hot and half cold from a water cooler, when all you want is water
that's at room temperature.

-Vladimir

^ permalink raw reply

* Re: [PATCH net-next v3 2/4] net: mdio: add PTP offset compensation to mdiobus_write_sts
From: Hubert Feurstein @ 2019-08-21  9:53 UTC (permalink / raw)
  To: Miroslav Lichvar
  Cc: Andrew Lunn, netdev, lkml, Richard Cochran, Florian Fainelli,
	Heiner Kallweit, Vladimir Oltean, David S. Miller
In-Reply-To: <20190821080709.GO891@localhost>

Am Mi., 21. Aug. 2019 um 10:07 Uhr schrieb Miroslav Lichvar
<mlichvar@redhat.com>:
> > Currently I do not see the benefit from this. The original intention was to
> > compensate for the remaining offset as good as possible.
>
> That's ok, but IMHO the change should not break the assumptions of
> existing application and users.
>
> > The current code
> > of phc2sys uses the delay only for the filtering of the measurement record
> > with the shortest delay and for reporting and statistics. Why not simple shift
> > the timestamps with the offset to the point where we expect the PHC timestamp
> > to be captured, and we have a very good result compared to where we came
> > from.
>
> Because those reports/statistics are important in calculation of
> maximum error. If someone had a requirement for a clock to be accurate
> to 1.5 microseconds and the ioctl returned a delay indicating a
> sufficient accuracy when in reality it could be worse, that would be a
> problem.
>
Ok, I understand your point. But including the MDIO completion into
delay calculation
will indicate a much wore precision as it actually is. When the MDIO
driver implements
the PTP system timestamping as follows ...

  ptp_read_system_prets(bus->ptp_sts);
  writel(value, mdio-reg)
  ptp_read_system_postts(bus->ptp_sts);

... then we catch already the error caused by interrupts which hit the
pre/post_ts section.
Now we only have the additional error of one MDIO clock cycle
(~400ns). Because I expect
the MDIO controller to shift out the MDIO frame on the next MDIO clock
cycle. So if I subtract
one MDIO clock cycle from pre_ts and add one MDIO clock cycle to
post_ts the error indication
would be sufficiently corrected IMHO. And then we can shift both
timestamps in the switch driver
(in the gettimex handler) to compensate the switch depending offset.
What do you think?

Hubert

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox