Netdev List
 help / color / mirror / Atom feed
* [PATCH net-next 6/7] selftests: mlxsw: Add test cases for devlink-trap L2 drops
From: Ido Schimmel @ 2019-08-21  7:19 UTC (permalink / raw)
  To: netdev; +Cc: davem, jiri, dsahern, mlxsw, Ido Schimmel
In-Reply-To: <20190821071937.13622-1-idosch@idosch.org>

From: Ido Schimmel <idosch@mellanox.com>

Test that each supported packet trap is triggered under the right
conditions and that packets are indeed dropped and not forwarded.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
---
 .../net/mlxsw/devlink_trap_l2_drops.sh        | 484 ++++++++++++++++++
 1 file changed, 484 insertions(+)
 create mode 100755 tools/testing/selftests/drivers/net/mlxsw/devlink_trap_l2_drops.sh

diff --git a/tools/testing/selftests/drivers/net/mlxsw/devlink_trap_l2_drops.sh b/tools/testing/selftests/drivers/net/mlxsw/devlink_trap_l2_drops.sh
new file mode 100755
index 000000000000..5dcdfa20fc6c
--- /dev/null
+++ b/tools/testing/selftests/drivers/net/mlxsw/devlink_trap_l2_drops.sh
@@ -0,0 +1,484 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0
+#
+# Test devlink-trap L2 drops functionality over mlxsw. Each registered L2 drop
+# packet trap is tested to make sure it is triggered under the right
+# conditions.
+
+lib_dir=$(dirname $0)/../../../net/forwarding
+
+ALL_TESTS="
+	source_mac_is_multicast_test
+	vlan_tag_mismatch_test
+	ingress_vlan_filter_test
+	ingress_stp_filter_test
+	port_list_is_empty_test
+	port_loopback_filter_test
+"
+NUM_NETIFS=4
+source $lib_dir/tc_common.sh
+source $lib_dir/lib.sh
+source $lib_dir/devlink_lib.sh
+
+h1_create()
+{
+	simple_if_init $h1
+}
+
+h1_destroy()
+{
+	simple_if_fini $h1
+}
+
+h2_create()
+{
+	simple_if_init $h2
+}
+
+h2_destroy()
+{
+	simple_if_fini $h2
+}
+
+switch_create()
+{
+	ip link add dev br0 type bridge vlan_filtering 1 mcast_snooping 0
+
+	ip link set dev $swp1 master br0
+	ip link set dev $swp2 master br0
+
+	ip link set dev br0 up
+	ip link set dev $swp1 up
+	ip link set dev $swp2 up
+
+	tc qdisc add dev $swp2 clsact
+}
+
+switch_destroy()
+{
+	tc qdisc del dev $swp2 clsact
+
+	ip link set dev $swp2 down
+	ip link set dev $swp1 down
+
+	ip link del dev br0
+}
+
+setup_prepare()
+{
+	h1=${NETIFS[p1]}
+	swp1=${NETIFS[p2]}
+
+	swp2=${NETIFS[p3]}
+	h2=${NETIFS[p4]}
+
+	vrf_prepare
+
+	h1_create
+	h2_create
+
+	switch_create
+}
+
+cleanup()
+{
+	pre_cleanup
+
+	switch_destroy
+
+	h2_destroy
+	h1_destroy
+
+	vrf_cleanup
+}
+
+l2_drops_test()
+{
+	local trap_name=$1; shift
+	local group_name=$1; shift
+
+	# This is the common part of all the tests. It checks that stats are
+	# initially idle, then non-idle after changing the trap action and
+	# finally idle again. It also makes sure the packets are dropped and
+	# never forwarded.
+	devlink_trap_stats_idle_test $trap_name
+	check_err $? "Trap stats not idle with initial drop action"
+	devlink_trap_group_stats_idle_test $group_name
+	check_err $? "Trap group stats not idle with initial drop action"
+
+	devlink_trap_action_set $trap_name "trap"
+
+	devlink_trap_stats_idle_test $trap_name
+	check_fail $? "Trap stats idle after setting action to trap"
+	devlink_trap_group_stats_idle_test $group_name
+	check_fail $? "Trap group stats idle after setting action to trap"
+
+	devlink_trap_action_set $trap_name "drop"
+
+	devlink_trap_stats_idle_test $trap_name
+	check_err $? "Trap stats not idle after setting action to drop"
+	devlink_trap_group_stats_idle_test $group_name
+	check_err $? "Trap group stats not idle after setting action to drop"
+
+	tc_check_packets "dev $swp2 egress" 101 0
+	check_err $? "Packets were not dropped"
+}
+
+l2_drops_cleanup()
+{
+	local mz_pid=$1; shift
+
+	kill $mz_pid && wait $mz_pid &> /dev/null
+	tc filter del dev $swp2 egress protocol ip pref 1 handle 101 flower
+}
+
+source_mac_is_multicast_test()
+{
+	local trap_name="source_mac_is_multicast"
+	local smac=01:02:03:04:05:06
+	local group_name="l2_drops"
+	local mz_pid
+
+	tc filter add dev $swp2 egress protocol ip pref 1 handle 101 \
+		flower src_mac $smac action drop
+
+	$MZ $h1 -c 0 -p 100 -a $smac -b bcast -t ip -d 1msec -q &
+	mz_pid=$!
+
+	RET=0
+
+	l2_drops_test $trap_name $group_name
+
+	log_test "Source MAC is multicast"
+
+	l2_drops_cleanup $mz_pid
+}
+
+__vlan_tag_mismatch_test()
+{
+	local trap_name="vlan_tag_mismatch"
+	local dmac=de:ad:be:ef:13:37
+	local group_name="l2_drops"
+	local opt=$1; shift
+	local mz_pid
+
+	# Remove PVID flag. This should prevent untagged and prio-tagged
+	# packets from entering the bridge.
+	bridge vlan add vid 1 dev $swp1 untagged master
+
+	tc filter add dev $swp2 egress protocol ip pref 1 handle 101 \
+		flower dst_mac $dmac action drop
+
+	$MZ $h1 "$opt" -c 0 -p 100 -a own -b $dmac -t ip -d 1msec -q &
+	mz_pid=$!
+
+	l2_drops_test $trap_name $group_name
+
+	# Add PVID and make sure packets are no longer dropped.
+	bridge vlan add vid 1 dev $swp1 pvid untagged master
+	devlink_trap_action_set $trap_name "trap"
+
+	devlink_trap_stats_idle_test $trap_name
+	check_err $? "Trap stats not idle when packets should not be dropped"
+	devlink_trap_group_stats_idle_test $group_name
+	check_err $? "Trap group stats not idle with when packets should not be dropped"
+
+	tc_check_packets "dev $swp2 egress" 101 0
+	check_fail $? "Packets not forwarded when should"
+
+	devlink_trap_action_set $trap_name "drop"
+
+	l2_drops_cleanup $mz_pid
+}
+
+vlan_tag_mismatch_untagged_test()
+{
+	RET=0
+
+	__vlan_tag_mismatch_test
+
+	log_test "VLAN tag mismatch - untagged packets"
+}
+
+vlan_tag_mismatch_vid_0_test()
+{
+	RET=0
+
+	__vlan_tag_mismatch_test "-Q 0"
+
+	log_test "VLAN tag mismatch - prio-tagged packets"
+}
+
+vlan_tag_mismatch_test()
+{
+	vlan_tag_mismatch_untagged_test
+	vlan_tag_mismatch_vid_0_test
+}
+
+ingress_vlan_filter_test()
+{
+	local trap_name="ingress_vlan_filter"
+	local dmac=de:ad:be:ef:13:37
+	local group_name="l2_drops"
+	local mz_pid
+	local vid=10
+
+	bridge vlan add vid $vid dev $swp2 master
+	# During initialization the firmware enables all the VLAN filters and
+	# the driver does not turn them off since the traffic will be discarded
+	# by the STP filter whose default is DISCARD state. Add the VID on the
+	# ingress bridge port and then remove it to make sure it is not member
+	# in the VLAN.
+	bridge vlan add vid $vid dev $swp1 master
+	bridge vlan del vid $vid dev $swp1 master
+
+	RET=0
+
+	tc filter add dev $swp2 egress protocol ip pref 1 handle 101 \
+		flower dst_mac $dmac action drop
+
+	$MZ $h1 -Q $vid -c 0 -p 100 -a own -b $dmac -t ip -d 1msec -q &
+	mz_pid=$!
+
+	l2_drops_test $trap_name $group_name
+
+	# Add the VLAN on the bridge port and make sure packets are no longer
+	# dropped.
+	bridge vlan add vid $vid dev $swp1 master
+	devlink_trap_action_set $trap_name "trap"
+
+	devlink_trap_stats_idle_test $trap_name
+	check_err $? "Trap stats not idle when packets should not be dropped"
+	devlink_trap_group_stats_idle_test $group_name
+	check_err $? "Trap group stats not idle with when packets should not be dropped"
+
+	tc_check_packets "dev $swp2 egress" 101 0
+	check_fail $? "Packets not forwarded when should"
+
+	devlink_trap_action_set $trap_name "drop"
+
+	log_test "Ingress VLAN filter"
+
+	l2_drops_cleanup $mz_pid
+
+	bridge vlan del vid $vid dev $swp1 master
+	bridge vlan del vid $vid dev $swp2 master
+}
+
+__ingress_stp_filter_test()
+{
+	local trap_name="ingress_spanning_tree_filter"
+	local dmac=de:ad:be:ef:13:37
+	local group_name="l2_drops"
+	local state=$1; shift
+	local mz_pid
+	local vid=20
+
+	bridge vlan add vid $vid dev $swp2 master
+	bridge vlan add vid $vid dev $swp1 master
+	ip link set dev $swp1 type bridge_slave state $state
+
+	tc filter add dev $swp2 egress protocol ip pref 1 handle 101 \
+		flower dst_mac $dmac action drop
+
+	$MZ $h1 -Q $vid -c 0 -p 100 -a own -b $dmac -t ip -d 1msec -q &
+	mz_pid=$!
+
+	l2_drops_test $trap_name $group_name
+
+	# Change STP state to forwarding and make sure packets are no longer
+	# dropped.
+	ip link set dev $swp1 type bridge_slave state 3
+	devlink_trap_action_set $trap_name "trap"
+
+	devlink_trap_stats_idle_test $trap_name
+	check_err $? "Trap stats not idle when packets should not be dropped"
+	devlink_trap_group_stats_idle_test $group_name
+	check_err $? "Trap group stats not idle with when packets should not be dropped"
+
+	tc_check_packets "dev $swp2 egress" 101 0
+	check_fail $? "Packets not forwarded when should"
+
+	devlink_trap_action_set $trap_name "drop"
+
+	l2_drops_cleanup $mz_pid
+
+	bridge vlan del vid $vid dev $swp1 master
+	bridge vlan del vid $vid dev $swp2 master
+}
+
+ingress_stp_filter_listening_test()
+{
+	local state=$1; shift
+
+	RET=0
+
+	__ingress_stp_filter_test $state
+
+	log_test "Ingress STP filter - listening state"
+}
+
+ingress_stp_filter_learning_test()
+{
+	local state=$1; shift
+
+	RET=0
+
+	__ingress_stp_filter_test $state
+
+	log_test "Ingress STP filter - learning state"
+}
+
+ingress_stp_filter_test()
+{
+	ingress_stp_filter_listening_test 1
+	ingress_stp_filter_learning_test 2
+}
+
+port_list_is_empty_uc_test()
+{
+	local trap_name="port_list_is_empty"
+	local dmac=de:ad:be:ef:13:37
+	local group_name="l2_drops"
+	local mz_pid
+
+	# Disable unicast flooding on both ports, so that packets cannot egress
+	# any port.
+	ip link set dev $swp1 type bridge_slave flood off
+	ip link set dev $swp2 type bridge_slave flood off
+
+	RET=0
+
+	tc filter add dev $swp2 egress protocol ip pref 1 handle 101 \
+		flower dst_mac $dmac action drop
+
+	$MZ $h1 -c 0 -p 100 -a own -b $dmac -t ip -d 1msec -q &
+	mz_pid=$!
+
+	l2_drops_test $trap_name $group_name
+
+	# Allow packets to be flooded to one port.
+	ip link set dev $swp2 type bridge_slave flood on
+	devlink_trap_action_set $trap_name "trap"
+
+	devlink_trap_stats_idle_test $trap_name
+	check_err $? "Trap stats not idle when packets should not be dropped"
+	devlink_trap_group_stats_idle_test $group_name
+	check_err $? "Trap group stats not idle with when packets should not be dropped"
+
+	tc_check_packets "dev $swp2 egress" 101 0
+	check_fail $? "Packets not forwarded when should"
+
+	devlink_trap_action_set $trap_name "drop"
+
+	log_test "Port list is empty - unicast"
+
+	l2_drops_cleanup $mz_pid
+
+	ip link set dev $swp1 type bridge_slave flood on
+}
+
+port_list_is_empty_mc_test()
+{
+	local trap_name="port_list_is_empty"
+	local dmac=01:00:5e:00:00:01
+	local group_name="l2_drops"
+	local dip=239.0.0.1
+	local mz_pid
+
+	# Disable multicast flooding on both ports, so that packets cannot
+	# egress any port. We also need to flush IP addresses from the bridge
+	# in order to prevent packets from being flooded to the router port.
+	ip link set dev $swp1 type bridge_slave mcast_flood off
+	ip link set dev $swp2 type bridge_slave mcast_flood off
+	ip address flush dev br0
+
+	RET=0
+
+	tc filter add dev $swp2 egress protocol ip pref 1 handle 101 \
+		flower dst_mac $dmac action drop
+
+	$MZ $h1 -c 0 -p 100 -a own -b $dmac -t ip -B $dip -d 1msec -q &
+	mz_pid=$!
+
+	l2_drops_test $trap_name $group_name
+
+	# Allow packets to be flooded to one port.
+	ip link set dev $swp2 type bridge_slave mcast_flood on
+	devlink_trap_action_set $trap_name "trap"
+
+	devlink_trap_stats_idle_test $trap_name
+	check_err $? "Trap stats not idle when packets should not be dropped"
+	devlink_trap_group_stats_idle_test $group_name
+	check_err $? "Trap group stats not idle with when packets should not be dropped"
+
+	tc_check_packets "dev $swp2 egress" 101 0
+	check_fail $? "Packets not forwarded when should"
+
+	devlink_trap_action_set $trap_name "drop"
+
+	log_test "Port list is empty - multicast"
+
+	l2_drops_cleanup $mz_pid
+
+	ip link set dev $swp1 type bridge_slave mcast_flood on
+}
+
+port_list_is_empty_test()
+{
+	port_list_is_empty_uc_test
+	port_list_is_empty_mc_test
+}
+
+port_loopback_filter_uc_test()
+{
+	local trap_name="port_loopback_filter"
+	local dmac=de:ad:be:ef:13:37
+	local group_name="l2_drops"
+	local mz_pid
+
+	# Make sure packets can only egress the input port.
+	ip link set dev $swp2 type bridge_slave flood off
+
+	RET=0
+
+	tc filter add dev $swp2 egress protocol ip pref 1 handle 101 \
+		flower dst_mac $dmac action drop
+
+	$MZ $h1 -c 0 -p 100 -a own -b $dmac -t ip -d 1msec -q &
+	mz_pid=$!
+
+	l2_drops_test $trap_name $group_name
+
+	# Allow packets to be flooded.
+	ip link set dev $swp2 type bridge_slave flood on
+	devlink_trap_action_set $trap_name "trap"
+
+	devlink_trap_stats_idle_test $trap_name
+	check_err $? "Trap stats not idle when packets should not be dropped"
+	devlink_trap_group_stats_idle_test $group_name
+	check_err $? "Trap group stats not idle with when packets should not be dropped"
+
+	tc_check_packets "dev $swp2 egress" 101 0
+	check_fail $? "Packets not forwarded when should"
+
+	devlink_trap_action_set $trap_name "drop"
+
+	log_test "Port loopback filter - unicast"
+
+	l2_drops_cleanup $mz_pid
+}
+
+port_loopback_filter_test()
+{
+	port_loopback_filter_uc_test
+}
+
+trap cleanup EXIT
+
+setup_prepare
+setup_wait
+
+tests_run
+
+exit $EXIT_STATUS
-- 
2.21.0


^ permalink raw reply related

* [PATCH net-next 7/7] selftests: mlxsw: Add a test case for devlink-trap
From: Ido Schimmel @ 2019-08-21  7:19 UTC (permalink / raw)
  To: netdev; +Cc: davem, jiri, dsahern, mlxsw, Ido Schimmel
In-Reply-To: <20190821071937.13622-1-idosch@idosch.org>

From: Ido Schimmel <idosch@mellanox.com>

Test generic devlink-trap functionality over mlxsw. These tests are not
specific to a single trap, but do not check the devlink-trap common
infrastructure either.

Currently, the only test case is device deletion (by reloading the
driver) while packets are being trapped.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
---
 .../drivers/net/mlxsw/devlink_trap.sh         | 129 ++++++++++++++++++
 1 file changed, 129 insertions(+)
 create mode 100755 tools/testing/selftests/drivers/net/mlxsw/devlink_trap.sh

diff --git a/tools/testing/selftests/drivers/net/mlxsw/devlink_trap.sh b/tools/testing/selftests/drivers/net/mlxsw/devlink_trap.sh
new file mode 100755
index 000000000000..89b55e946eed
--- /dev/null
+++ b/tools/testing/selftests/drivers/net/mlxsw/devlink_trap.sh
@@ -0,0 +1,129 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0
+#
+# Test generic devlink-trap functionality over mlxsw. These tests are not
+# specific to a single trap, but do not check the devlink-trap common
+# infrastructure either.
+
+lib_dir=$(dirname $0)/../../../net/forwarding
+
+ALL_TESTS="
+	dev_del_test
+"
+NUM_NETIFS=4
+source $lib_dir/tc_common.sh
+source $lib_dir/lib.sh
+source $lib_dir/devlink_lib.sh
+
+h1_create()
+{
+	simple_if_init $h1
+}
+
+h1_destroy()
+{
+	simple_if_fini $h1
+}
+
+h2_create()
+{
+	simple_if_init $h2
+}
+
+h2_destroy()
+{
+	simple_if_fini $h2
+}
+
+switch_create()
+{
+	ip link add dev br0 type bridge vlan_filtering 1 mcast_snooping 0
+
+	ip link set dev $swp1 master br0
+	ip link set dev $swp2 master br0
+
+	ip link set dev br0 up
+	ip link set dev $swp1 up
+	ip link set dev $swp2 up
+}
+
+switch_destroy()
+{
+	ip link set dev $swp2 down
+	ip link set dev $swp1 down
+
+	ip link del dev br0
+}
+
+setup_prepare()
+{
+	h1=${NETIFS[p1]}
+	swp1=${NETIFS[p2]}
+
+	swp2=${NETIFS[p3]}
+	h2=${NETIFS[p4]}
+
+	vrf_prepare
+
+	h1_create
+	h2_create
+
+	switch_create
+}
+
+cleanup()
+{
+	pre_cleanup
+
+	switch_destroy
+
+	h2_destroy
+	h1_destroy
+
+	vrf_cleanup
+}
+
+dev_del_test()
+{
+	local trap_name="source_mac_is_multicast"
+	local smac=01:02:03:04:05:06
+	local num_iter=5
+	local mz_pid
+	local i
+
+	$MZ $h1 -c 0 -p 100 -a $smac -b bcast -t ip -q &
+	mz_pid=$!
+
+	# The purpose of this test is to make sure we correctly dismantle a
+	# port while packets are trapped from it. This is done by reloading the
+	# the driver while the 'ingress_smac_mc_drop' trap is triggered.
+	RET=0
+
+	for i in $(seq 1 $num_iter); do
+		log_info "Iteration $i / $num_iter"
+
+		devlink_trap_action_set $trap_name "trap"
+		sleep 1
+
+		devlink_reload
+		# Allow netdevices to be re-created following the reload
+		sleep 20
+
+		cleanup
+		setup_prepare
+		setup_wait
+	done
+
+	log_test "Device delete"
+
+	kill $mz_pid && wait $mz_pid &> /dev/null
+}
+
+trap cleanup EXIT
+
+setup_prepare
+setup_wait
+
+tests_run
+
+exit $EXIT_STATUS
-- 
2.21.0


^ permalink raw reply related

* RE: Help needed - Kernel lockup while running ipsec
From: Vakul Garg @ 2019-08-21  7:37 UTC (permalink / raw)
  To: Florian Westphal; +Cc: netdev@vger.kernel.org
In-Reply-To: <DB7PR04MB4620B6ACB01BFA338ADAED048BAB0@DB7PR04MB4620.eurprd04.prod.outlook.com>



> -----Original Message-----
> From: Vakul Garg
> Sent: Tuesday, August 20, 2019 4:08 PM
> To: Florian Westphal <fw@strlen.de>
> Cc: netdev@vger.kernel.org
> Subject: RE: Help needed - Kernel lockup while running ipsec
> 
> 
> 
> >
> > > -----Original Message-----
> > > From: Florian Westphal <fw@strlen.de>
> > > Sent: Tuesday, August 20, 2019 3:08 PM
> > > To: Vakul Garg <vakul.garg@nxp.com>
> > > Cc: Florian Westphal <fw@strlen.de>; netdev@vger.kernel.org
> > > Subject: Re: Help needed - Kernel lockup while running ipsec
> > >
> > > Vakul Garg <vakul.garg@nxp.com> wrote:
> > > >
> > > >
> > > > > -----Original Message-----
> > > > > From: Florian Westphal <fw@strlen.de>
> > > > > Sent: Tuesday, August 20, 2019 2:53 PM
> > > > > To: Vakul Garg <vakul.garg@nxp.com>
> > > > > Cc: Florian Westphal <fw@strlen.de>; netdev@vger.kernel.org
> > > > > Subject: Re: Help needed - Kernel lockup while running ipsec
> > > > >
> > > > > Vakul Garg <vakul.garg@nxp.com> wrote:
> > > > > > > > With kernel 4.14.122, I am getting a kernel softlockup while
> > > > > > > > running single
> > > > > > > static ipsec tunnel.
> > > > > > > > The problem reproduces mostly after running 8-10 hours of
> > > > > > > > ipsec encap
> > > > > > > test (on my dual core arm board).
> > > > > > > >
> > > > > > > > I found that in function xfrm_policy_lookup_bytype(), the
> > > > > > > > policy in variable
> > > > > > > 'ret' shows refcnt=0 under problem situation.
> > > > > > > > This creates an infinite loop in  xfrm_policy_lookup_bytype()
> > > > > > > > and hence the
> > > > > > > lockup.
> > > > > > > >
> > > > > > > > Can some body please provide me pointers about 'refcnt'?
> > > > > > > > Is it legitimate for 'refcnt' to become '0'? Under what
> > > > > > > > condition can it
> > > > > > > become '0'?
> > > > > > >
> > > > > > > Yes, when policy is destroyed and the last user calls
> > > > > > > xfrm_pol_put() which will invoke call_rcu to free the structure.
> > > > > >
> > > > > > It seems that policy reference count never gets decremented during
> > > > > > packet
> > > > > ipsec encap.
> > > > > > It is getting incremented for every frame that hits the policy.
> > > > > > In setkey -DP output, I see refcnt to be wrapping around after '0'.
> > > > >
> > > > > Thats a bug.  Does this affect 4.14 only or does this happen on
> > > > > current tree as well?
> > > >
> > > > I am yet to try it on 4.19.
> > > > Can you help me with the right fix? Which part of code should it get
> > > decremented?
> > > > I am not conversant with xfrm code.
> > >
> > > Normally policy reference counts get decremented when the skb is
> free'd,
> > via
> > > dst destruction (xfrm_dst_destroy()).
> > >
> > > Do you see a dst leak as well?
> >
> > Can you please guide me how to detect it?
> >
> > (I am checking refcount on recent kernel and will let you know.)
> 
> Policy refcount is decreasing properly on 4.19.
> Same should be on the latest kernel too.

On kernel-4.14, I find dst_release() is getting called through xfrm_output_one().
However since dst->__refcnt gets decremented to '1', 
the call_rcu(&dst->rcu_head, dst_destroy_rcu) is not invoked. 

On kernel-4.19, dst->__refcnt gets decremented to '0', hence things fall in place and 
dst_destroy_rcu() eventually executes.

Any further help/pointers for kernel-4.14 would be deeply appreciated.




^ permalink raw reply

* Re: [PATCH] `iwlist scan` fails with many networks available
From: Johannes Berg @ 2019-08-21  7:59 UTC (permalink / raw)
  To: James Nylen; +Cc: David S. Miller, linux-wireless, netdev, linux-kernel
In-Reply-To: <CABVa4Nga1vyvyWNpTTJLa44rZo8wu4-bE=mXX1nZgvzktbSq6A@mail.gmail.com>

On Tue, 2019-08-13 at 00:43 +0000, James Nylen wrote:
> > I suppose we could consider applying a workaround like this if it has a
> > condition checking that the buffer passed in is the maximum possible
> > buffer (65535 bytes, due to iw_point::length being u16)
> 
> This is what the latest patch does (attached to my email from
> yesterday / https://lkml.org/lkml/2019/8/10/452 ).

Hmm, yes, you're right. I evidently missed the comparisons to 0xFFFF
there, sorry about that.

> If you'd like to apply it, I'm happy to make any needed revisions.
> Otherwise I'm going to have to keep patching my kernels for this
> issue, unfortunately I don't have the time to try to get wicd to
> migrate to a better solution.

Not sure which would be easier, but ok :-)

Can you please fix the patch to
 1) use /* */ style comments (see
    https://www.kernel.org/doc/html/latest/process/coding-style.html)

 2) remove extra braces (also per coding style)

 3) use U16_MAX instead of 0xFFFF

I'd also consider renaming "maybe_current_ev" to "next_ev" or something
shorter anyway, and would probably argue that rewriting this

> +		if (IS_ERR(maybe_current_ev)) {
> +			err = PTR_ERR(maybe_current_ev);
> +			if (err == -E2BIG) {
> +				// Last BSS failed to copy into buffer.  As
> +				// above, only report an error if `iwlist` will
> +				// retry again with a larger buffer.
> +				if (len >= 0xFFFF) {
> +					err = 0;
> +				}
> +			}
>  			break;
> +		} else {
> +			current_ev = maybe_current_ev;
>  		}


to something like

	next_ev = ...
	if (IS_ERR(next_ev)) {
		err = PTR_ERR(next_ev);
		/* mask error and truncate in case buffer cannot be
                 * increased
                 */
		if (err == -E2BIG && len < U16_MAX)
			err = 0;
		break;
	}

	current_ev = next_ev;


could be more readable, but that's just editorial really.

Thanks,
johannes


^ permalink raw reply

* Re: [PATCH net-next v3 2/4] net: mdio: add PTP offset compensation to mdiobus_write_sts
From: Miroslav Lichvar @ 2019-08-21  8:07 UTC (permalink / raw)
  To: Hubert Feurstein
  Cc: Andrew Lunn, netdev, lkml, Richard Cochran, Florian Fainelli,
	Heiner Kallweit, Vladimir Oltean, David S. Miller
In-Reply-To: <CAFfN3gUgpzMebyUt8_-9e+5vpm3q-DVVszWdkUEFAgZQ8ex73w@mail.gmail.com>

On Tue, Aug 20, 2019 at 06:56:56PM +0200, Hubert Feurstein wrote:
> Am Di., 20. Aug. 2019 um 17:40 Uhr schrieb Miroslav Lichvar
> > I think a large jitter is ok in this case. We just need to timestamp
> > something that we know for sure happened after the PHC timestamp. It
> > should have no impact on the offset and its stability, just the
> > reported delay. A test with phc2sys should be able to confirm that.
> > phc2sys selects the measurement with the shortest delay, which has
> > least uncertainty. I'd say that applies to both interrupt and polling.
> >
> > If it is difficult to specify the minimum interrupt delay, I'd still
> > prefer an overly pessimistic interval assuming a zero delay.
> >
> Currently I do not see the benefit from this. The original intention was to
> compensate for the remaining offset as good as possible.

That's ok, but IMHO the change should not break the assumptions of
existing application and users.

> The current code
> of phc2sys uses the delay only for the filtering of the measurement record
> with the shortest delay and for reporting and statistics. Why not simple shift
> the timestamps with the offset to the point where we expect the PHC timestamp
> to be captured, and we have a very good result compared to where we came
> from.

Because those reports/statistics are important in calculation of
maximum error. If someone had a requirement for a clock to be accurate
to 1.5 microseconds and the ioctl returned a delay indicating a
sufficient accuracy when in reality it could be worse, that would be a
problem.

BTW, phc2sys is not the only user of the ioctl.

-- 
Miroslav Lichvar

^ permalink raw reply

* RE: [PATCH v3] rtw88: pci: Move a mass of jobs in hw IRQ to soft IRQ
From: Tony Chuang @ 2019-08-21  8:16 UTC (permalink / raw)
  To: Jian-Hong Pan, Kalle Valo, David S . Miller
  Cc: linux-wireless@vger.kernel.org, netdev@vger.kernel.org,
	linux-kernel@vger.kernel.org, linux@endlessm.com
In-Reply-To: <20190820045934.24841-1-jian-hong@endlessm.com>

Hi,

> From: Jian-Hong Pan [mailto:jian-hong@endlessm.com]
> 
> There is a mass of jobs between spin lock and unlock in the hardware
> IRQ which will occupy much time originally. To make system work more
> efficiently, this patch moves the jobs to the soft IRQ (bottom half) to
> reduce the time in hardware IRQ.
> 
> Signed-off-by: Jian-Hong Pan <jian-hong@endlessm.com>
> ---
> v2:
>  Change the spin_lock_irqsave/unlock_irqrestore to spin_lock/unlock in
>  rtw_pci_interrupt_handler. Because the interrupts are already disabled
>  in the hardware interrupt handler.
> 
> v3:
>  Extend the spin lock protecting area for the TX path in
>  rtw_pci_interrupt_threadfn by Realtek's suggestion
> 
>  drivers/net/wireless/realtek/rtw88/pci.c | 33 +++++++++++++++++++-----
>  1 file changed, 27 insertions(+), 6 deletions(-)
> 
> diff --git a/drivers/net/wireless/realtek/rtw88/pci.c
> b/drivers/net/wireless/realtek/rtw88/pci.c
> index 00ef229552d5..a8c17a01f318 100644
> --- a/drivers/net/wireless/realtek/rtw88/pci.c
> +++ b/drivers/net/wireless/realtek/rtw88/pci.c
> @@ -866,12 +866,29 @@ static irqreturn_t rtw_pci_interrupt_handler(int irq,
> void *dev)
>  {
>  	struct rtw_dev *rtwdev = dev;
>  	struct rtw_pci *rtwpci = (struct rtw_pci *)rtwdev->priv;
> -	u32 irq_status[4];
> 
>  	spin_lock(&rtwpci->irq_lock);
>  	if (!rtwpci->irq_enabled)
>  		goto out;
> 
> +	/* disable RTW PCI interrupt to avoid more interrupts before the end of
> +	 * thread function
> +	 */
> +	rtw_pci_disable_interrupt(rtwdev, rtwpci);
> +out:
> +	spin_unlock(&rtwpci->irq_lock);
> +
> +	return IRQ_WAKE_THREAD;
> +}
> +
> +static irqreturn_t rtw_pci_interrupt_threadfn(int irq, void *dev)
> +{
> +	struct rtw_dev *rtwdev = dev;
> +	struct rtw_pci *rtwpci = (struct rtw_pci *)rtwdev->priv;
> +	unsigned long flags;
> +	u32 irq_status[4];
> +
> +	spin_lock_irqsave(&rtwpci->irq_lock, flags);
>  	rtw_pci_irq_recognized(rtwdev, rtwpci, irq_status);
> 
>  	if (irq_status[0] & IMR_MGNTDOK)
> @@ -891,8 +908,10 @@ static irqreturn_t rtw_pci_interrupt_handler(int irq,
> void *dev)
>  	if (irq_status[0] & IMR_ROK)
>  		rtw_pci_rx_isr(rtwdev, rtwpci, RTW_RX_QUEUE_MPDU);
> 
> -out:
> -	spin_unlock(&rtwpci->irq_lock);
> +	/* all of the jobs for this interrupt have been done */
> +	if (rtw_flag_check(rtwdev, RTW_FLAG_RUNNING))
> +		rtw_pci_enable_interrupt(rtwdev, rtwpci);

I've applied this patch and tested it.
But I failed to connect to AP, it seems that the
scan_result is empty. And when I failed to connect
to the AP, I found that the IMR is not enabled.
I guess the check bypass the interrupt enable function.

And I also found that *without MSI*, the driver is
able to connect to the AP. Could you please verify
this patch again with MSI interrupt enabled and
send a v4?

You can find my MSI patch on
https://patchwork.kernel.org/patch/11081539/


> +	spin_unlock_irqrestore(&rtwpci->irq_lock, flags);
> 
>  	return IRQ_HANDLED;
>  }
> @@ -1152,8 +1171,10 @@ static int rtw_pci_probe(struct pci_dev *pdev,
>  		goto err_destroy_pci;
>  	}
> 
> -	ret = request_irq(pdev->irq, &rtw_pci_interrupt_handler,
> -			  IRQF_SHARED, KBUILD_MODNAME, rtwdev);
> +	ret = devm_request_threaded_irq(rtwdev->dev, pdev->irq,
> +					rtw_pci_interrupt_handler,
> +					rtw_pci_interrupt_threadfn,
> +					IRQF_SHARED, KBUILD_MODNAME, rtwdev);
>  	if (ret) {
>  		ieee80211_unregister_hw(hw);
>  		goto err_destroy_pci;
> @@ -1192,7 +1213,7 @@ static void rtw_pci_remove(struct pci_dev *pdev)
>  	rtw_pci_disable_interrupt(rtwdev, rtwpci);
>  	rtw_pci_destroy(rtwdev, pdev);
>  	rtw_pci_declaim(rtwdev, pdev);
> -	free_irq(rtwpci->pdev->irq, rtwdev);
> +	devm_free_irq(rtwdev->dev, rtwpci->pdev->irq, rtwdev);
>  	rtw_core_deinit(rtwdev);
>  	ieee80211_free_hw(hw);
>  }
> --


NACK
Need to verify it with MSI https://patchwork.kernel.org/patch/11081539/
And hope v4 could fix it.

Yan-Hsuan


^ permalink raw reply

* Re: [RFC PATCH] bpf: handle 32-bit zext during constant blinding
From: Naveen N. Rao @ 2019-08-21  8:30 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, Jiong Wang
  Cc: bpf, linux-kernel, linuxppc-dev, Michael Ellerman, netdev
In-Reply-To: <20190813171018.28221-1-naveen.n.rao@linux.vnet.ibm.com>

Naveen N. Rao wrote:
> Since BPF constant blinding is performed after the verifier pass, there
> are certain ALU32 instructions inserted which don't have a corresponding
> zext instruction inserted after. This is causing a kernel oops on
> powerpc and can be reproduced by running 'test_cgroup_storage' with
> bpf_jit_harden=2.
> 
> Fix this by emitting BPF_ZEXT during constant blinding if
> prog->aux->verifier_zext is set.
> 
> Fixes: a4b1d3c1ddf6cb ("bpf: verifier: insert zero extension according to analysis result")
> Reported-by: Michael Ellerman <mpe@ellerman.id.au>
> Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
> ---
> This approach (the location where zext is being introduced below, in 
> particular) works for powerpc, but I am not entirely sure if this is 
> sufficient for other architectures as well. This is broken on v5.3-rc4.

Alexie, Daniel, Jiong,
Any feedback on this?

- Naveen


^ permalink raw reply

* [PATCH net-next v4] sched: Add dualpi2 qdisc
From: Tilmans, Olivier (Nokia - BE/Antwerp) @ 2019-08-21  8:31 UTC (permalink / raw)
  Cc: Eric Dumazet, Stephen Hemminger, Olga Albisser,
	De Schepper, Koen (Nokia - BE/Antwerp),
	Tilmans, Olivier (Nokia - BE/Antwerp), Bob Briscoe, Henrik Steen,
	Jamal Hadi Salim, Cong Wang, Jiri Pirko, David S. Miller,
	linux-kernel@vger.kernel.org, netdev@vger.kernel.org

From: Olga Albisser <olga@albisser.org>

DualPI2 provides L4S-type low latency & loss to traffic that uses a
scalable congestion controller (e.g. TCP-Prague, DCTCP) without
degrading the performance of 'classic' traffic (e.g. Reno,
Cubic etc.). It is intended to be the reference implementation of the
IETF's DualQ Coupled AQM.

The qdisc provides two queues called low latency and classic. It
classifies packets based on the ECN field in the IP headers. By
default it directs non-ECN and ECT(0) into the classic queue and
ECT(1) and CE into the low latency queue, as per the IETF spec.

Each queue runs its own AQM:
* The classic AQM is called PI2, which is similar to the PIE AQM but
   more responsive and simpler. Classic traffic requires a decent
   target queue (default 15ms for Internet deployment) to fully
   utilize the link and to avoid high drop rates.
* The low latency AQM is, by default, a very shallow ECN marking
   threshold (1ms) similar to that used for DCTCP.

The DualQ isolates the low queuing delay of the Low Latency queue
from the larger delay of the 'Classic' queue. However, from a
bandwidth perspective, flows in either queue will share out the link
capacity as if there was just a single queue. This bandwidth pooling
effect is achieved by coupling together the drop and ECN-marking
probabilities of the two AQMs.

The PI2 AQM has two main parameters in addition to its target delay.
All the defaults are suitable for any Internet setting, but it can
be reconfigured for a Data Centre setting. The integral gain factor
alpha is used to slowly correct any persistent standing queue error
from the target delay, while the proportional gain factor beta is
used to quickly compensate for queue changes (growth or shrinkage).
Either alpha and beta are given as a parameter, or they can be
calculated by tc from alternative typical and maximum RTT parameters.

Internally, the output of a linear Proportional Integral (PI)
controller is used for both queues. This output is squared to
calculate the drop or ECN-marking probability of the classic queue.
This counterbalances the square-root rate equation of Reno/Cubic,
which is the trick that balances flow rates across the queues. For
the ECN-marking probability of the low latency queue, the output of
the base AQM is multiplied by a coupling factor. This determines the
balance between the flow rates in each queue. The default setting
makes the flow rates roughly equal, which should be generally
applicable.

If DUALPI2 AQM has detected overload (due to excessive non-responsive
traffic in either queue), it will switch to signaling congestion
solely using drop, irrespective of the ECN field. Alternatively, it
can be configured to limit the drop probability and let the queue
grow and eventually overflow (like tail-drop).

Additional details can be found in the draft:
https://www.ietf.org/id/draft-ietf-tsvwg-aqm-dualq-coupled

Signed-off-by: Olga Albisser <olga@albisser.org>
Signed-off-by: Koen De Schepper <koen.de_schepper@nokia-bell-labs.com>
Signed-off-by: Olivier Tilmans <olivier.tilmans@nokia-bell-labs.com>
Signed-off-by: Bob Briscoe <research@bobbriscoe.net>
Signed-off-by: Henrik Steen <henrist@henrist.net>
---

Notes:
    Changelog:
    * v3-> v4
      - Replaced license boiletplate with SPDX identifier
      - Fix missing pskb_may_pull() calls when accessing ECN bits
      - Move timestamp computation at enqueue to happen after drop check
      - Use NMI-safe time keeping function, i.e., ktime_get_ns()
      - Switched from deprecated PSCHED_NS2TICKS/... to raw nanoseconds clocks
      - Validate netlink parameters properly (ranges, error reporting)
      - Expanded the statistics tracked/reported to better reflect the behavior of
        both queues
      - Simplified the qdisc structure:
        o Reworked classification logic to only depend on an ECN mask
        o Renamed most parameters to better reflect their usage
        o Removed unused/experimental features (e.g., TS-FIFO)
        o Restructured the skb->cb
        o Extracted helper functions
      - Fix compilation issues for ARM
      - Updated defaults parameter values to latest IETF ID
      - Fix the step AQM being applied on empty queues, causing excess marking on
        slower links
    * v2 -> v3
      - Fix compilation issues
      - Replaced the classic queue starvation protection from time-shifted FIFO
        to WRR, as it gives better results (e.g., prevents leaking burst in the C
        queue to the L queue)
    * v1 -> v2
      - Store enqueue timestamp in skb->cb to avoid conflict with EDT

 include/uapi/linux/pkt_sched.h |  33 ++
 net/sched/Kconfig              |  22 +-
 net/sched/Makefile             |   1 +
 net/sched/sch_dualpi2.c        | 744 +++++++++++++++++++++++++++++++++
 4 files changed, 799 insertions(+), 1 deletion(-)
 create mode 100644 net/sched/sch_dualpi2.c

diff --git a/include/uapi/linux/pkt_sched.h b/include/uapi/linux/pkt_sched.h
index 18f185299f47..e2ad4a8d2059 100644
--- a/include/uapi/linux/pkt_sched.h
+++ b/include/uapi/linux/pkt_sched.h
@@ -1180,4 +1180,37 @@ enum {
 
 #define TCA_TAPRIO_ATTR_MAX (__TCA_TAPRIO_ATTR_MAX - 1)
 
+/* DUALPI2 */
+enum {
+	TCA_DUALPI2_UNSPEC,
+	TCA_DUALPI2_LIMIT,		/* Packets */
+	TCA_DUALPI2_TARGET,		/* us */
+	TCA_DUALPI2_TUPDATE,		/* us */
+	TCA_DUALPI2_ALPHA,		/* Hz scaled up by 256 */
+	TCA_DUALPI2_BETA,		/* HZ scaled up by 256 */
+	TCA_DUALPI2_STEP_THRESH,	/* Packets or us */
+	TCA_DUALPI2_STEP_PACKETS,	/* Whether STEP_THRESH is in packets */
+	TCA_DUALPI2_COUPLING,		/* Coupling factor between queues */
+	TCA_DUALPI2_DROP_OVERLOAD,	/* Whether to drop on overload */
+	TCA_DUALPI2_DROP_EARLY,		/* Whether to drop on enqueue */
+	TCA_DUALPI2_C_PROTECTION,	/* Percentage */
+	TCA_DUALPI2_ECN_MASK,		/* L4S queue classification mask */
+	TCA_DUALPI2_PAD,
+	__TCA_DUALPI2_MAX
+};
+
+#define TCA_DUALPI2_MAX   (__TCA_DUALPI2_MAX - 1)
+
+struct tc_dualpi2_xstats {
+	__u32 prob;		/* current probability */
+	__u32 delay_c;		/* current delay in C queue */
+	__u32 delay_l;		/* current delay in L queue */
+	__s32 credit;		/* current c_protection credit */
+	__u32 packets_in_c;	/* number of packets enqueued in C queue */
+	__u32 packets_in_l;	/* number of packets enqueued in L queue */
+	__u32 maxq;		/* maximum queue size */
+	__u32 ecn_mark;		/* packets marked with ecn*/
+	__u32 step_marks;	/* ECN marks due to the step AQM */
+};
+
 #endif
diff --git a/net/sched/Kconfig b/net/sched/Kconfig
index afd2ba157a13..f9340c18c3a2 100644
--- a/net/sched/Kconfig
+++ b/net/sched/Kconfig
@@ -409,6 +409,26 @@ config NET_SCH_PLUG
 	  To compile this code as a module, choose M here: the
 	  module will be called sch_plug.
 
+config NET_SCH_DUALPI2
+	tristate "Dual Queue Proportional Integral Controller Improved with a Square (DUALPI2) scheduler"
+	help
+	  Say Y here if you want to use the DualPI2 AQM.
+	  This is a combination of the DUALQ Coupled-AQM with a PI2 base-AQM.
+	  The PI2 AQM is in turn both an extension and a simplification of the
+	  PIE AQM. PI2 makes quite some PIE heuristics unnecessary, while being
+	  able to control scalable congestion controls like DCTCP and
+	  TCP-Prague. With PI2, both Reno/Cubic can be used in parallel with
+	  DCTCP, maintaining window fairness. DUALQ provides latency separation
+	  between low latency DCTCP flows and Reno/Cubic flows that need a
+	  bigger queue.
+	  For more information, please see
+	  https://www.ietf.org/id/draft-ietf-tsvwg-aqm-dualq-coupled
+
+	  To compile this code as a module, choose M here: the module
+	  will be called sch_dualpi2.
+
+	  If unsure, say N.
+
 menuconfig NET_SCH_DEFAULT
 	bool "Allow override default queue discipline"
 	---help---
@@ -418,7 +438,7 @@ menuconfig NET_SCH_DEFAULT
 	  of pfifo_fast will be used. Many distributions already set
 	  the default value via /proc/sys/net/core/default_qdisc.
 
-	  If unsure, say N.
+
 
 if NET_SCH_DEFAULT
 
diff --git a/net/sched/Makefile b/net/sched/Makefile
index 415d1e1f237e..8e3bd4459eb4 100644
--- a/net/sched/Makefile
+++ b/net/sched/Makefile
@@ -61,6 +61,7 @@ obj-$(CONFIG_NET_SCH_PIE)	+= sch_pie.o
 obj-$(CONFIG_NET_SCH_CBS)	+= sch_cbs.o
 obj-$(CONFIG_NET_SCH_ETF)	+= sch_etf.o
 obj-$(CONFIG_NET_SCH_TAPRIO)	+= sch_taprio.o
+obj-$(CONFIG_NET_SCH_DUALPI2)   += sch_dualpi2.o
 
 obj-$(CONFIG_NET_CLS_U32)	+= cls_u32.o
 obj-$(CONFIG_NET_CLS_ROUTE4)	+= cls_route.o
diff --git a/net/sched/sch_dualpi2.c b/net/sched/sch_dualpi2.c
new file mode 100644
index 000000000000..a6452aa82018
--- /dev/null
+++ b/net/sched/sch_dualpi2.c
@@ -0,0 +1,744 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (C) 2019 Nokia.
+ *
+ * Author: Koen De Schepper <koen.de_schepper@nokia-bell-labs.com>
+ * Author: Olga Albisser <olga@albisser.org>
+ * Author: Henrik Steen <henrist@henrist.net>
+ * Author: Olivier Tilmans <olivier.tilmans@nokia-bell-labs.com>
+ *
+ * DualPI Improved with a Square (dualpi2):
+ *   Supports scalable congestion controls (e.g., DCTCP)
+ *   Supports coupled dual-queue with PI2
+ *   Supports L4S ECN identifier
+ *
+ * References:
+ *   draft-ietf-tsvwg-aqm-dualq-coupled:
+ *     http://tools.ietf.org/html/draft-ietf-tsvwg-aqm-dualq-coupled-08
+ *   De Schepper, Koen, et al. "PI 2: A linearized AQM for both classic and
+ *   scalable TCP."  in proc. ACM CoNEXT'16, 2016.
+ */
+
+#include <linux/module.h>
+#include <linux/slab.h>
+#include <linux/types.h>
+#include <linux/kernel.h>
+#include <linux/errno.h>
+#include <linux/skbuff.h>
+#include <net/pkt_sched.h>
+#include <net/inet_ecn.h>
+#include <linux/string.h>
+
+/* 32b enable to support flows with windows up to ~8.6 * 1e9 packets
+ * i.e., twice the maximal snd_cwnd.
+ * MAX_PROB must be consistent with the RNG in dualpi2_roll().
+ */
+#define MAX_PROB ((u32)(~((u32)0)))
+/* alpha/beta values exchanged over netlink are in units of 256ns */
+#define ALPHA_BETA_SHIFT 8
+/* Scaled values of alpha/beta must fit in 32b to avoid overflow in later
+ * computations. Consequently (see and dualpi2_scale_alpha_beta()), their
+ * netlink-provided values can use at most 31b, i.e. be at most most (2^23)-1
+ * (~4MHz) as those are given in 1/256th. This enable to tune alpha/beta to
+ * control flows whose maximal RTTs can be in usec up to few secs.
+ */
+#define ALPHA_BETA_MAX ((2 << 31) - 1)
+/* Internal alpha/beta are in units of 64ns.
+ * This enables to use all alpha/beta values in the allowed range without loss
+ * of precision due to rounding when scaling them internally, e.g.,
+ * scale_alpha_beta(1) will not round down to 0.
+ */
+#define ALPHA_BETA_GRANULARITY 6
+#define ALPHA_BETA_SCALING (ALPHA_BETA_SHIFT - ALPHA_BETA_GRANULARITY)
+/* We express the weights (wc, wl) in %, i.e., wc + wl = 100 */
+#define MAX_WC 100
+
+struct dualpi2_sched_data {
+	struct Qdisc *l_queue;	/* The L4S LL queue */
+	struct Qdisc *sch;	/* The classic queue (owner of this struct) */
+
+	struct { /* PI2 parameters */
+		u64	target;	/* Target delay in nanoseconds */
+		u32	tupdate;/* timer frequency (in jiffies) */
+		u32	prob;	/* Base PI2 probability */
+		u32	alpha;	/* Gain factor for the integral rate response */
+		u32	beta;	/* Gain factor for the proportional response */
+		struct timer_list timer; /* prob update timer */
+	} pi2;
+
+	struct { /* Step AQM (L4S queue only) parameters */
+		u32 thresh;	/* Step threshold */
+		bool in_packets;/* Whether the step is in packets or time */
+	} step;
+
+	struct { /* Classic queue starvation protection */
+		s32	credit; /* Credit (sign indicates which queue) */
+		s32	init;	/* Reset value of the credit */
+		u8	wc;	/* C queue weight (between 0 and MAX_WC) */
+		u8	wl;	/* L queue weight (MAX_WC - wc) */
+	} c_protection;
+
+	/* General dualQ parameters */
+	u8	coupling_factor;/* Coupling factor (k) between both queues */
+	u8	ecn_mask;	/* Mask to match L4S packets */
+	bool	drop_early;	/* Drop at enqueue instead of dequeue if true */
+	bool	drop_overload;	/* Drop (1) on overload, or overflow (0) */
+
+	/* Statistics */
+	u64	qdelay_c;	/* Classic Q delay */
+	u64	qdelay_l;	/* L4S Q delay */
+	u32	packets_in_c;	/* Number of packets enqueued in C queue */
+	u32	packets_in_l;	/* Number of packets enqueued in L queue */
+	u32	maxq;		/* maximum queue size */
+	u32	ecn_mark;	/* packets marked with ECN */
+	u32	step_marks;	/* ECN marks due to the step AQM */
+
+	struct { /* Deferred drop statistics */
+		u32 cnt;	/* Packets dropped */
+		u32 len;	/* Bytes dropped */
+	} deferred_drops;
+};
+
+struct dualpi2_skb_cb {
+	u64 ts;		/* Timestamp at enqueue */
+	u8 apply_step:1,/* Can we apply the step threshold */
+	   l4s:1,	/* Packet has been classified as L4S */
+	   ect:2;	/* Packet ECT codepoint */
+};
+
+static inline struct dualpi2_skb_cb *dualpi2_skb_cb(struct sk_buff *skb)
+{
+	qdisc_cb_private_validate(skb, sizeof(struct dualpi2_skb_cb));
+	return (struct dualpi2_skb_cb *)qdisc_skb_cb(skb)->data;
+}
+
+static inline u64 skb_sojourn_time(struct sk_buff *skb, u64 reference)
+{
+	return reference - dualpi2_skb_cb(skb)->ts;
+}
+
+static inline u64 qdelay_in_ns(struct Qdisc *q, u64 now)
+{
+	struct sk_buff *skb = qdisc_peek_head(q);
+
+	return skb ? skb_sojourn_time(skb, now) : 0;
+}
+
+static inline u32 dualpi2_scale_alpha_beta(u32 param)
+{
+	u64 tmp  = ((u64)param * MAX_PROB >> ALPHA_BETA_SCALING);
+	do_div(tmp, NSEC_PER_SEC);
+	return tmp;
+}
+
+static inline u32 dualpi2_unscale_alpha_beta(u32 param)
+{
+	u64 tmp = ((u64)param * NSEC_PER_SEC << ALPHA_BETA_SCALING);
+	do_div(tmp, MAX_PROB);
+	return tmp;
+}
+
+static inline bool skb_is_l4s(struct sk_buff *skb)
+{
+	return dualpi2_skb_cb(skb)->l4s != 0;
+}
+
+static inline void dualpi2_mark(struct dualpi2_sched_data *q,
+				struct sk_buff *skb)
+{
+	if (INET_ECN_set_ce(skb))
+		q->ecn_mark++;
+}
+
+static inline void dualpi2_reset_c_protection(struct dualpi2_sched_data *q)
+{
+	q->c_protection.credit = q->c_protection.init;
+}
+
+static inline void dualpi2_calculate_c_protection(struct Qdisc *sch,
+						  struct dualpi2_sched_data *q,
+						  u32 wc)
+{
+	q->c_protection.wc = wc;
+	q->c_protection.wl = MAX_WC - wc;
+	/* Start with L queue if wl > wc */
+	q->c_protection.init = (s32)psched_mtu(qdisc_dev(sch)) *
+		((int)q->c_protection.wc - (int)q->c_protection.wl);
+	dualpi2_reset_c_protection(q);
+}
+
+static inline bool dualpi2_roll(u32 prob)
+{
+	return prandom_u32() <= prob;
+}
+
+static inline bool dualpi2_squared_roll(struct dualpi2_sched_data *q)
+{
+	return dualpi2_roll(q->pi2.prob) && dualpi2_roll(q->pi2.prob);
+}
+
+static inline bool dualpi2_is_overloaded(u64 prob)
+{
+	return prob > MAX_PROB;
+}
+
+static bool must_drop(struct Qdisc *sch, struct dualpi2_sched_data *q,
+		      struct sk_buff *skb)
+{
+	u64 local_l_prob;
+
+	/* Never drop if we have fewer than 2 mtu-sized packets;
+	 * similar to min_th in RED.
+	 */
+	if (sch->qstats.backlog < 2 * psched_mtu(qdisc_dev(sch)))
+		return false;
+
+	local_l_prob = (u64)q->pi2.prob * q->coupling_factor;
+
+	if (skb_is_l4s(skb)) {
+		if (dualpi2_is_overloaded(local_l_prob)) {
+			/* On overload, preserve delay by doing a classic drop
+			 * in the L queue. Otherwise, let both queues grow until
+			 * we reach the limit and cannot enqueue anymore
+			 * (sacrifice delay to avoid drops).
+			 */
+			if (q->drop_overload && dualpi2_squared_roll(q))
+				goto drop;
+			else
+				goto mark;
+			/* Scalable marking has a  (prob * k) probability */
+		} else if (dualpi2_roll(local_l_prob)) {
+			goto mark;
+		}
+		/* Apply classic marking with a (prob * prob) probability.
+		 * Force drops for ECN-capable traffic on overload.
+		 */
+	} else if (dualpi2_squared_roll(q)) {
+		if (dualpi2_skb_cb(skb)->ect &&
+		    !dualpi2_is_overloaded(local_l_prob))
+			goto mark;
+		else
+			goto drop;
+	}
+	return false;
+
+mark:
+	dualpi2_mark(q, skb);
+	return false;
+
+drop:
+	return true;
+}
+
+static void dualpi2_skb_classify(struct dualpi2_sched_data *q,
+				 struct sk_buff *skb)
+{
+	struct dualpi2_skb_cb *cb = dualpi2_skb_cb(skb);
+	int wlen = skb_network_offset(skb);
+
+	switch (tc_skb_protocol(skb)) {
+	case htons(ETH_P_IP):
+		wlen += sizeof(struct iphdr);
+		if (!pskb_may_pull(skb, wlen) ||
+		    skb_try_make_writable(skb, wlen))
+			goto not_ecn;
+
+		cb->ect = ipv4_get_dsfield(ip_hdr(skb)) & INET_ECN_MASK;
+		break;
+	case htons(ETH_P_IPV6):
+		wlen += sizeof(struct ipv6hdr);
+		if (!pskb_may_pull(skb, wlen) ||
+		    skb_try_make_writable(skb, wlen))
+			goto not_ecn;
+
+		cb->ect = ipv6_get_dsfield(ipv6_hdr(skb)) & INET_ECN_MASK;
+		break;
+	default:
+		goto not_ecn;
+	}
+	cb->l4s = (cb->ect & q->ecn_mask) != 0;
+	return;
+
+not_ecn:
+	/* Not ECN capable or not non pullable/writable packets can only be
+	 * dropped hence go the the classic queue.
+	 */
+	cb->ect = INET_ECN_NOT_ECT;
+	cb->l4s = 0;
+}
+
+static int dualpi2_qdisc_enqueue(struct sk_buff *skb, struct Qdisc *sch,
+				 struct sk_buff **to_free)
+{
+	struct dualpi2_sched_data *q = qdisc_priv(sch);
+	int err;
+
+	if (unlikely(qdisc_qlen(sch) >= sch->limit)) {
+		qdisc_qstats_overlimit(sch);
+		err = NET_XMIT_DROP;
+		goto drop;
+	}
+
+	dualpi2_skb_classify(q, skb);
+
+	/* drop early if configured */
+	if (q->drop_early && must_drop(sch, q, skb)) {
+		err = NET_XMIT_SUCCESS | __NET_XMIT_BYPASS;
+		goto drop;
+	}
+
+	dualpi2_skb_cb(skb)->ts = ktime_get_ns();
+
+	if (qdisc_qlen(sch) > q->maxq)
+		q->maxq = qdisc_qlen(sch);
+
+	if (skb_is_l4s(skb)) {
+		/* Only apply the step if a queue is building up */
+		dualpi2_skb_cb(skb)->apply_step = qdisc_qlen(q->l_queue) > 1;
+		/* Keep the overall qdisc stats consistent */
+		++sch->q.qlen;
+		qdisc_qstats_backlog_inc(sch, skb);
+		++q->packets_in_l;
+		return qdisc_enqueue_tail(skb, q->l_queue);
+	}
+	++q->packets_in_c;
+	return qdisc_enqueue_tail(skb, sch);
+
+drop:
+	qdisc_drop(skb, sch, to_free);
+	return err;
+}
+
+static struct sk_buff *dualpi2_qdisc_dequeue(struct Qdisc *sch)
+{
+	struct dualpi2_sched_data *q = qdisc_priv(sch);
+	struct sk_buff *skb;
+	int qlen_c, credit_change;
+
+pick_packet:
+	/* L queue packets are also accounted for in qdisc_qlen(sch)! */
+	qlen_c = qdisc_qlen(sch) - qdisc_qlen(q->l_queue);
+	skb = NULL;
+	/* We can drop after qdisc_dequeue_head() calls.
+	 * Manage statistics by hand to keep them consistent if that happens.
+	 */
+	if (qdisc_qlen(q->l_queue) > 0 &&
+	    (qlen_c <= 0 || q->c_protection.credit <= 0)) {
+		/* Dequeue and increase the credit by wc if qlen_c != 0 */
+		skb = __qdisc_dequeue_head(&q->l_queue->q);
+		credit_change = qlen_c ?
+			q->c_protection.wc * qdisc_pkt_len(skb) : 0;
+		/* The global backlog will be updated later. */
+		qdisc_qstats_backlog_dec(q->l_queue, skb);
+		/* Propagate the dequeue to the global stats. */
+		--sch->q.qlen;
+	} else if (qlen_c > 0) {
+		/* Dequeue and decrease the credit by wl if qlen_l != 0 */
+		skb = __qdisc_dequeue_head(&sch->q);
+		credit_change = qdisc_qlen(q->l_queue) ?
+			(s32)(-1) * q->c_protection.wl * qdisc_pkt_len(skb) : 0;
+	} else {
+		dualpi2_reset_c_protection(q);
+		goto exit;
+	}
+	qdisc_qstats_backlog_dec(sch, skb);
+
+	/* Drop on dequeue? */
+	if (!q->drop_early && must_drop(sch, q, skb)) {
+		++q->deferred_drops.cnt;
+		q->deferred_drops.len += qdisc_pkt_len(skb);
+		consume_skb(skb);
+		qdisc_qstats_drop(sch);
+		/* try next packet */
+		goto pick_packet;
+	}
+
+	/* Apply the Step AQM to packets coming out of the L queue. */
+	if (skb_is_l4s(skb)) {
+		u64 qdelay = 0;
+
+		if (q->step.in_packets)
+			qdelay = qdisc_qlen(q->l_queue);
+		else
+			qdelay = skb_sojourn_time(skb, ktime_get_ns());
+		/* Apply the step */
+		if (likely(dualpi2_skb_cb(skb)->apply_step) &&
+		    qdelay > q->step.thresh) {
+			dualpi2_mark(q, skb);
+			++q->step_marks;
+		}
+		qdisc_bstats_update(q->l_queue, skb);
+	}
+
+	q->c_protection.credit += credit_change;
+	qdisc_bstats_update(sch, skb);
+
+exit:
+	/* We cannot call qdisc_tree_reduce_backlog() if our qlen is 0,
+	 * or HTB crashes.
+	 */
+	if (q->deferred_drops.cnt && qdisc_qlen(sch)) {
+		qdisc_tree_reduce_backlog(sch, q->deferred_drops.cnt,
+					  q->deferred_drops.len);
+		q->deferred_drops.cnt = 0;
+		q->deferred_drops.len = 0;
+	}
+	return skb;
+}
+
+static s64 __scale_delta(s64 diff)
+{
+	do_div(diff, (1 << (ALPHA_BETA_GRANULARITY + 1)) - 1);
+	return diff;
+}
+
+static u32 calculate_probability(struct Qdisc *sch)
+{
+	struct dualpi2_sched_data *q = qdisc_priv(sch);
+	u64 qdelay, qdelay_old, now;
+	u32 new_prob;
+	s64 delta;
+
+	qdelay_old = max_t(u64, q->qdelay_c, q->qdelay_l);
+	now = ktime_get_ns();
+	q->qdelay_l = qdelay_in_ns(q->l_queue, now);
+	q->qdelay_c = qdelay_in_ns(sch, now);
+	qdelay = max_t(u64, q->qdelay_c, q->qdelay_l);
+	/* Alpha and beta take at most 32b, i.e, the delay difference would
+	 * overflow for queueing delay differences > ~4.2sec.
+	 */
+	delta = __scale_delta(((s64)qdelay - q->pi2.target) * q->pi2.alpha);
+	delta += __scale_delta(((s64)qdelay - qdelay_old) * q->pi2.beta);
+	new_prob = delta + q->pi2.prob;
+	/* Prevent overflow */
+	if (delta > 0) {
+		if (new_prob < q->pi2.prob)
+			new_prob = MAX_PROB;
+		/* Prevent underflow */
+	} else if (new_prob > q->pi2.prob) {
+		new_prob = 0;
+	}
+	/* If we do not drop on overload, ensure we cap the L4S probability to
+	 * 100% to keep window fairness when overflowing.
+	 */
+	if (!q->drop_overload)
+		return min_t(u32, new_prob, MAX_PROB / q->coupling_factor);
+	return new_prob;
+}
+
+static void dualpi2_timer(struct timer_list *timer)
+{
+	struct dualpi2_sched_data *q = from_timer(q, timer, pi2.timer);
+	struct Qdisc *sch = q->sch;
+	spinlock_t *root_lock; /* Lock to access the head of both queues. */
+
+	root_lock = qdisc_lock(qdisc_root_sleeping(sch));
+	spin_lock(root_lock);
+
+	q->pi2.prob = calculate_probability(sch);
+	mod_timer(&q->pi2.timer, jiffies + q->pi2.tupdate);
+
+	spin_unlock(root_lock);
+}
+
+static const struct nla_policy dualpi2_policy[TCA_DUALPI2_MAX + 1] = {
+	[TCA_DUALPI2_LIMIT] = {.type = NLA_U32},
+	[TCA_DUALPI2_TARGET] = {.type = NLA_U32},
+	[TCA_DUALPI2_TUPDATE] = {.type = NLA_U32},
+	[TCA_DUALPI2_ALPHA] = {.type = NLA_U32},
+	[TCA_DUALPI2_BETA] = {.type = NLA_U32},
+	[TCA_DUALPI2_STEP_THRESH] = {.type = NLA_U32},
+	[TCA_DUALPI2_STEP_PACKETS] = {.type = NLA_U8},
+	[TCA_DUALPI2_COUPLING] = {.type = NLA_U8},
+	[TCA_DUALPI2_DROP_OVERLOAD] = {.type = NLA_U8},
+	[TCA_DUALPI2_DROP_EARLY] = {.type = NLA_U8},
+	[TCA_DUALPI2_C_PROTECTION] = {.type = NLA_U8},
+	[TCA_DUALPI2_ECN_MASK] = {.type = NLA_U8},
+};
+
+static int dualpi2_change(struct Qdisc *sch, struct nlattr *opt,
+			  struct netlink_ext_ack *extack)
+{
+	struct dualpi2_sched_data *q = qdisc_priv(sch);
+	struct nlattr *tb[TCA_DUALPI2_MAX + 1];
+	unsigned int old_qlen, dropped = 0;
+	int err;
+
+	if (!opt)
+		return -EINVAL;
+	err = nla_parse_nested_deprecated(tb, TCA_DUALPI2_MAX, opt,
+					  dualpi2_policy, extack);
+	if (err < 0)
+		return err;
+
+	sch_tree_lock(sch);
+
+	if (tb[TCA_DUALPI2_LIMIT]) {
+		u32 limit = nla_get_u32(tb[TCA_DUALPI2_LIMIT]);
+
+		if (!limit) {
+			NL_SET_ERR_MSG_ATTR(extack, tb[TCA_DUALPI2_LIMIT],
+					    "limit must be greater than 0 !");
+			return -EINVAL;
+		}
+		sch->limit = limit;
+	}
+
+	if (tb[TCA_DUALPI2_TARGET])
+		q->pi2.target = (u64)nla_get_u32(tb[TCA_DUALPI2_TARGET]) *
+			NSEC_PER_USEC;
+
+	if (tb[TCA_DUALPI2_TUPDATE]) {
+		u64 tupdate =
+			usecs_to_jiffies(nla_get_u32(tb[TCA_DUALPI2_TUPDATE]));
+
+		if (!tupdate) {
+			NL_SET_ERR_MSG_ATTR(extack, tb[TCA_DUALPI2_TUPDATE],
+					    "tupdate cannot be 0 jiffies!");
+			return -EINVAL;
+		}
+		q->pi2.tupdate = tupdate;
+	}
+
+	if (tb[TCA_DUALPI2_ALPHA]) {
+		u32 alpha = nla_get_u32(tb[TCA_DUALPI2_ALPHA]);
+
+		if (alpha > ALPHA_BETA_MAX) {
+			NL_SET_ERR_MSG_ATTR(extack, tb[TCA_DUALPI2_ALPHA],
+					    "alpha is too large!");
+			return -EINVAL;
+		}
+		q->pi2.alpha = dualpi2_scale_alpha_beta(alpha);
+	}
+
+	if (tb[TCA_DUALPI2_BETA]) {
+		u32 beta = nla_get_u32(tb[TCA_DUALPI2_BETA]);
+
+		if (beta > ALPHA_BETA_MAX) {
+			NL_SET_ERR_MSG_ATTR(extack, tb[TCA_DUALPI2_BETA],
+					    "beta is too large!");
+			return -EINVAL;
+		}
+		q->pi2.beta = dualpi2_scale_alpha_beta(beta);
+	}
+
+	if (tb[TCA_DUALPI2_STEP_THRESH])
+		q->step.thresh = nla_get_u32(tb[TCA_DUALPI2_STEP_THRESH]) *
+			NSEC_PER_USEC;
+
+	if (tb[TCA_DUALPI2_COUPLING]) {
+		u8 coupling = nla_get_u8(tb[TCA_DUALPI2_COUPLING]);
+
+		if (!coupling) {
+			NL_SET_ERR_MSG_ATTR(extack, tb[TCA_DUALPI2_COUPLING],
+					    "Must use a non-zero coupling!");
+			return -EINVAL;
+		}
+		q->coupling_factor = coupling;
+	}
+
+	if (tb[TCA_DUALPI2_STEP_PACKETS])
+		q->step.in_packets = nla_get_u8(tb[TCA_DUALPI2_STEP_PACKETS]);
+
+	if (tb[TCA_DUALPI2_DROP_OVERLOAD])
+		q->drop_overload = nla_get_u8(tb[TCA_DUALPI2_DROP_OVERLOAD]);
+
+	if (tb[TCA_DUALPI2_DROP_EARLY])
+		q->drop_early = nla_get_u8(tb[TCA_DUALPI2_DROP_EARLY]);
+
+	if (tb[TCA_DUALPI2_C_PROTECTION]) {
+		u8 wc = nla_get_u8(tb[TCA_DUALPI2_C_PROTECTION]);
+
+		if (wc > MAX_WC) {
+			NL_SET_ERR_MSG_ATTR(extack,
+					    tb[TCA_DUALPI2_C_PROTECTION],
+					    "c_protection must be <= 100!");
+			return -EINVAL;
+		}
+		dualpi2_calculate_c_protection(sch, q, wc);
+	}
+
+	if (tb[TCA_DUALPI2_ECN_MASK])
+		q->ecn_mask = nla_get_u8(tb[TCA_DUALPI2_ECN_MASK]);
+
+	/* Drop excess packets if new limit is lower */
+	old_qlen = qdisc_qlen(sch);
+	while (qdisc_qlen(sch) > sch->limit) {
+		struct sk_buff *skb = __qdisc_dequeue_head(&sch->q);
+
+		dropped += qdisc_pkt_len(skb);
+		qdisc_qstats_backlog_dec(sch, skb);
+		rtnl_qdisc_drop(skb, sch);
+	}
+	qdisc_tree_reduce_backlog(sch, old_qlen - qdisc_qlen(sch), dropped);
+
+	sch_tree_unlock(sch);
+	return 0;
+}
+
+static void dualpi2_reset_default(struct dualpi2_sched_data *q)
+{
+	q->sch->limit = 10000; /* Holds 125ms at 1G */
+
+	q->pi2.target = 15 * NSEC_PER_MSEC;
+	q->pi2.tupdate = usecs_to_jiffies(16 * USEC_PER_MSEC);
+	q->pi2.alpha = dualpi2_scale_alpha_beta(41); /* ~0.16 Hz in 1/256th */
+	q->pi2.beta = dualpi2_scale_alpha_beta(819); /* ~3.2 Hz in 1/256th */
+	/* These values give a 10dB stability margin with max_rtt=100ms */
+
+	q->step.thresh = 1 * NSEC_PER_MSEC; /* 1ms */
+	q->step.in_packets = false; /* Step in time not packets */
+
+	dualpi2_calculate_c_protection(q->sch, q, 10); /* Defaults to wc = 10 */
+
+	q->ecn_mask = INET_ECN_ECT_1; /* l4s-id */
+	q->coupling_factor = 2; /* window fairness for equal RTTs */
+	q->drop_overload = true; /* Preserve latency by dropping on overload */
+	q->drop_early = false; /* PI2 drop on dequeue */
+}
+
+static int dualpi2_init(struct Qdisc *sch, struct nlattr *opt,
+			struct netlink_ext_ack *extack)
+{
+	struct dualpi2_sched_data *q = qdisc_priv(sch);
+
+	q->l_queue = qdisc_create_dflt(sch->dev_queue, &pfifo_qdisc_ops,
+				       TC_H_MAKE(sch->handle, 1), extack);
+	if (!q->l_queue)
+		return -ENOMEM;
+
+	q->sch = sch;
+	dualpi2_reset_default(q);
+	timer_setup(&q->pi2.timer, dualpi2_timer, 0);
+
+	if (opt) {
+		int err = dualpi2_change(sch, opt, extack);
+
+		if (err)
+			return err;
+	}
+
+	mod_timer(&q->pi2.timer, (jiffies + HZ) >> 1);
+	return 0;
+}
+
+static int dualpi2_dump(struct Qdisc *sch, struct sk_buff *skb)
+{
+	struct nlattr *opts = nla_nest_start_noflag(skb, TCA_OPTIONS);
+	struct dualpi2_sched_data *q = qdisc_priv(sch);
+	u64 step_thresh = q->step.thresh;
+	u64 target_usec = q->pi2.target;
+
+	if (!opts)
+		goto nla_put_failure;
+
+	do_div(target_usec, NSEC_PER_USEC);
+	if (!q->step.in_packets)
+		do_div(step_thresh, NSEC_PER_USEC);
+
+	if (nla_put_u32(skb, TCA_DUALPI2_LIMIT, sch->limit) ||
+	    nla_put_u32(skb, TCA_DUALPI2_TARGET, target_usec) ||
+	    nla_put_u32(skb, TCA_DUALPI2_TUPDATE,
+			jiffies_to_usecs(q->pi2.tupdate)) ||
+	    nla_put_u32(skb, TCA_DUALPI2_ALPHA,
+			dualpi2_unscale_alpha_beta(q->pi2.alpha)) ||
+	    nla_put_u32(skb, TCA_DUALPI2_BETA,
+			dualpi2_unscale_alpha_beta(q->pi2.beta)) ||
+	    nla_put_u32(skb, TCA_DUALPI2_STEP_THRESH, step_thresh) ||
+	    nla_put_u8(skb, TCA_DUALPI2_COUPLING, q->coupling_factor) ||
+	    nla_put_u8(skb, TCA_DUALPI2_DROP_OVERLOAD, q->drop_overload) ||
+	    nla_put_u8(skb, TCA_DUALPI2_STEP_PACKETS, q->step.in_packets) ||
+	    nla_put_u8(skb, TCA_DUALPI2_DROP_EARLY, q->drop_early) ||
+	    nla_put_u8(skb, TCA_DUALPI2_C_PROTECTION, q->c_protection.wc) ||
+	    nla_put_u8(skb, TCA_DUALPI2_ECN_MASK, q->ecn_mask))
+		goto nla_put_failure;
+
+	return nla_nest_end(skb, opts);
+
+nla_put_failure:
+	nla_nest_cancel(skb, opts);
+	return -1;
+}
+
+static int dualpi2_dump_stats(struct Qdisc *sch, struct gnet_dump *d)
+{
+	struct dualpi2_sched_data *q = qdisc_priv(sch);
+	u64 qdelay_c_usec = q->qdelay_c;
+	u64 qdelay_l_usec = q->qdelay_l;
+	struct tc_dualpi2_xstats st = {
+		.prob		= q->pi2.prob,
+		.packets_in_c	= q->packets_in_c,
+		.packets_in_l	= q->packets_in_l,
+		.maxq		= q->maxq,
+		.ecn_mark	= q->ecn_mark,
+		.credit		= q->c_protection.credit,
+		.step_marks	= q->step_marks,
+	};
+
+	do_div(qdelay_c_usec, NSEC_PER_USEC);
+	do_div(qdelay_l_usec, NSEC_PER_USEC);
+	st.delay_c = qdelay_c_usec;
+	st.delay_l = qdelay_l_usec;
+	return gnet_stats_copy_app(d, &st, sizeof(st));
+}
+
+static void dualpi2_reset(struct Qdisc *sch)
+{
+	struct dualpi2_sched_data *q = qdisc_priv(sch);
+
+	qdisc_reset_queue(sch);
+	qdisc_reset_queue(q->l_queue);
+	q->qdelay_c = 0;
+	q->qdelay_l = 0;
+	q->pi2.prob = 0;
+	q->packets_in_c = 0;
+	q->packets_in_l = 0;
+	q->maxq = 0;
+	q->ecn_mark = 0;
+	q->step_marks = 0;
+	dualpi2_reset_c_protection(q);
+}
+
+static void dualpi2_destroy(struct Qdisc *sch)
+{
+	struct dualpi2_sched_data *q = qdisc_priv(sch);
+
+	q->pi2.tupdate = 0;
+	del_timer_sync(&q->pi2.timer);
+	if (q->l_queue)
+		qdisc_put(q->l_queue);
+}
+
+static struct Qdisc_ops dualpi2_qdisc_ops __read_mostly = {
+	.id = "dualpi2",
+	.priv_size	= sizeof(struct dualpi2_sched_data),
+	.enqueue	= dualpi2_qdisc_enqueue,
+	.dequeue	= dualpi2_qdisc_dequeue,
+	.peek		= qdisc_peek_dequeued,
+	.init		= dualpi2_init,
+	.destroy	= dualpi2_destroy,
+	.reset		= dualpi2_reset,
+	.change		= dualpi2_change,
+	.dump		= dualpi2_dump,
+	.dump_stats	= dualpi2_dump_stats,
+	.owner		= THIS_MODULE,
+};
+
+static int __init dualpi2_module_init(void)
+{
+	return register_qdisc(&dualpi2_qdisc_ops);
+}
+
+static void __exit dualpi2_module_exit(void)
+{
+	unregister_qdisc(&dualpi2_qdisc_ops);
+}
+
+module_init(dualpi2_module_init);
+module_exit(dualpi2_module_exit);
+
+MODULE_DESCRIPTION("Dual Queue with Proportional Integral controller Improved with a Square (dualpi2) scheduler");
+MODULE_AUTHOR("Koen De Schepper");
+MODULE_AUTHOR("Olga Albisser");
+MODULE_AUTHOR("Henrik Steen");
+MODULE_AUTHOR("Olivier Tilmans");
+MODULE_LICENSE("GPL");
-- 
2.22.0


^ permalink raw reply related

* [PATCH iproute2-next v2] tc: add dualpi2 scheduler module
From: Tilmans, Olivier (Nokia - BE/Antwerp) @ 2019-08-21  8:39 UTC (permalink / raw)
  Cc: Eric Dumazet, Stephen Hemminger, davem@davemloft.net,
	netdev@vger.kernel.org, Olga Albisser,
	De Schepper, Koen (Nokia - BE/Antwerp),
	Tilmans, Olivier (Nokia - BE/Antwerp), Bob Briscoe, Henrik Steen

From: Olga Albisser <olga@albisser.org>

DualPI2 AQM is a combination of the DualQ Coupled-AQM with a PI2
base-AQM, able to control scalable congestion controls like DCTCP
and TCP-Prague, implemented as a Linux qdisc.

This patch adds support to tc to configure it through its netlink
interface.

Signed-off-by: Olga Albisser <olga@albisser.org>
Signed-off-by: Koen De Schepper <koen.de_schepper@nokia-bell-labs.com>
Signed-off-by: Oliver Tilmans <olivier.tilmans@nokia-bell-labs.com>
Signed-off-by: Bob Briscoe <research@bobbriscoe.net>
Signed-off-by: Henrik Steen <henrist@henrist.net>
---

Notes:
    Changelog:
    * v1 -> v2
      - Update against revised Netlink defines
      - Aligned bound checks with kernel ones
      - Simplified set of parameters

 include/uapi/linux/pkt_sched.h |  33 +++
 include/utils.h                |   8 +
 man/man8/tc-dualpi2.8          | 203 +++++++++++++++
 tc/Makefile                    |   1 +
 tc/q_dualpi2.c                 | 439 +++++++++++++++++++++++++++++++++
 5 files changed, 684 insertions(+)
 create mode 100644 man/man8/tc-dualpi2.8
 create mode 100644 tc/q_dualpi2.c

diff --git a/include/uapi/linux/pkt_sched.h b/include/uapi/linux/pkt_sched.h
index 18f18529..5f31ba46 100644
--- a/include/uapi/linux/pkt_sched.h
+++ b/include/uapi/linux/pkt_sched.h
@@ -1180,4 +1180,37 @@ enum {
 
 #define TCA_TAPRIO_ATTR_MAX (__TCA_TAPRIO_ATTR_MAX - 1)
 
+/* DUALPI2 */
+enum {
+	TCA_DUALPI2_UNSPEC,
+	TCA_DUALPI2_LIMIT,		/* Packets */
+	TCA_DUALPI2_TARGET,		/* us */
+	TCA_DUALPI2_TUPDATE,		/* us */
+	TCA_DUALPI2_ALPHA,		/* Hz scaled up by 256 */
+	TCA_DUALPI2_BETA,		/* HZ scaled up by 256 */
+	TCA_DUALPI2_STEP_THRESH,	/* Packets or us */
+	TCA_DUALPI2_STEP_PACKETS,	/* Whether STEP_THRESH is in packets */
+	TCA_DUALPI2_COUPLING,		/* Coupling factor between queues */
+	TCA_DUALPI2_DROP_OVERLOAD,	/* Whether to drop on overload */
+	TCA_DUALPI2_DROP_EARLY,		/* Whether to drop on enqueue */
+	TCA_DUALPI2_C_PROTECTION,	/* Percentage */
+	TCA_DUALPI2_ECN_MASK,		/* L4S queue classification mask */
+	TCA_DUALPI2_PAD,
+	__TCA_DUALPI2_MAX
+};
+
+#define TCA_DUALPI2_MAX   (__TCA_DUALPI2_MAX - 1)
+
+struct tc_dualpi2_xstats {
+	__u32 prob;             /* current probability */
+	__u32 delay_c;		/* current delay in C queue */
+	__u32 delay_l;		/* current delay in L queue */
+	__s32 credit;		/* current c_protection credit */
+	__u32 packets_in_c;	/* number of packets enqueued in C queue */
+	__u32 packets_in_l;	/* number of packets enqueued in L queue */
+	__u32 maxq;             /* maximum queue size */
+	__u32 ecn_mark;         /* packets marked with ecn*/
+	__u32 step_marks;	/* ECN marks due to the step AQM */
+};
+
 #endif
diff --git a/include/utils.h b/include/utils.h
index 794d3605..0c8f43e8 100644
--- a/include/utils.h
+++ b/include/utils.h
@@ -279,6 +279,14 @@ unsigned int print_name_and_link(const char *fmt,
 	_min1 < _min2 ? _min1 : _min2; })
 #endif
 
+#ifndef max
+# define max(x, y) ({			\
+	typeof(x) _max1 = (x);		\
+	typeof(y) _max2 = (y);		\
+	(void) (&_max1 == &_max2);	\
+	_max1 > _max2 ? _max1 : _max2; })
+#endif
+
 #ifndef __check_format_string
 # define __check_format_string(pos_str, pos_args) \
 	__attribute__ ((format (printf, (pos_str), (pos_args))))
diff --git a/man/man8/tc-dualpi2.8 b/man/man8/tc-dualpi2.8
new file mode 100644
index 00000000..d325ccda
--- /dev/null
+++ b/man/man8/tc-dualpi2.8
@@ -0,0 +1,203 @@
+.TH DUALPI2 8 "13 December 2018" "iproute2" "Linux"
+
+.SH NAME
+DUALPI2 \- Dual Queue Proportional Integral Controller AQM - Improved with a square
+.SH SYNOPSIS
+.sp
+.ad l
+.in +8
+.ti -8
+.BR tc " " qdisc " ... " dualpi2
+.br
+.RB "[ " limit
+.IR PACKETS " ]"
+.br
+.RB "[ " coupling_factor
+.IR NUMBER " ]"
+.br
+.RB "[ " step_thresh
+.IR TIME | PACKETS " ]"
+.br
+.RB "[ " drop_on_overload " | " overflow " ]"
+.br
+.RB "[ " drop_enqueue " | " drop_dequeue " ]"
+.br
+.RB "[ " l4s_ect " | " any_ect " ]"
+.br
+.RB "[ " classic_protection
+.IR PERCENTAGE " ] "
+.br
+.RB "[ " max_rtt
+.IR TIME 
+.RB " [ " typical_rtt 
+.IR TIME " ]] "
+.br
+.RB "[ " target
+.IR TIME " ]"
+.br
+.RB "[ " tupdate
+.IR TIME " ]"
+.br
+.RB "[ " alpha
+.IR float " ]"
+.br
+.RB "[ " beta
+.IR float " ] "
+
+.SH DESCRIPTION
+DUALPI2 AQM is a combination of the DUALQ Coupled-AQM with a PI2 base-AQM. The PI2 AQM (details can be found in the paper cited below) is in turn both an extension and a simplification of the PIE AQM. PI2 makes quite some PIE heuristics unnecessary, while being able to control scalable congestion controls like DCTCP and TCP-Prague. With PI2, both Reno/Cubic can be used in parallel with DCTCP, maintaining window fairness. DUALQ provides latency separation between low latency DCTCP flows and Reno/Cubic flows that need a bigger queue. The main design goals are:
+.PD 0
+.IP \(bu 4
+L4S - Low Loss, Low Latency and Scalable congestion control support
+.IP \(bu 4
+DualQ option to separate the L4S traffic in a low latency queue, without harming remaining traffic that is scheduled in classic queue due to congestion-coupling
+.IP \(bu 4
+Configurable overload strategies
+.IP \(bu 4
+Use of sojourn time to reliably estimate queue delay
+.IP \(bu 4
+Simple implementation
+.IP \(bu 4
+Guaranteed stability and fast responsiveness
+.PD
+
+.SH ALGORITHM
+DUALPI2 is designed to provide low loss and low latency to L4S traffic, without harming classic traffic. Every update interval a new internal base probability is calculated, based on queue delay. The base probability is updated with a delta based on the difference between the current queue delay and the 
+.I "" target
+delay, and the queue growth comparing with the queuing delay during the previous 
+.I "" tupdate
+interval. The integral gain factor 
+.RB "" alpha
+is used to correct slowly enough any persistent standing queue error to the user specified target delay, while the proportional gain factor
+.RB "" beta
+is used to quickly compensate for queue changes (growth or shrink).
+
+The updated base probability is used as input to decide to mark and drop packets. DUALPI2 scales the calculated probability for each of the two queues accordingly. For the L4S queue, the probability is multiplied by a 
+.RB "" coupling_factor
+, while for the classic queue, it is squared to compensate the squareroot rate equation of Reno/Cubic. The ECT identifier (
+.RB "" l4s_ect | any_ect
+) is used to classify traffic into respective queues.
+
+If DUALPI2 AQM has detected overload (when excessive non-responsive traffic is sent), it can signal congestion solely using 
+.RB "" drop
+, irrespective of the ECN field, or alternatively limit the drop probability and let the queue grow and eventually 
+.RB "" overflow
+(like tail-drop).
+
+Additional details can be found in the draft cited below.
+
+.SH PARAMETERS
+.TP
+.BI limit " PACKETS"
+Limit the number of packets that can be enqueued. Incoming packets are dropped when this limit
+is reached. This limit is common for the L4S and Classic queue. Defaults to
+.I 10000
+packets. This is about 125ms delay on a 1Gbps link.
+.TP
+.BI coupling_factor " NUMBER"
+Set the coupling rate factor between Classic and L4S. Defaults to
+.I 2
+.TP
+.B l4s_ect | any_ect
+Configures the ECT classifier. Packets whose ECT codepoint matches this are sent to the L4S queue where they receive a scalable marking. Defaults to
+.I l4s_ect
+, i.e., the L4S identifier ECT(1). Setting this to
+.I any_ect
+causes all packets whose ECN field is not zero to be sent to the L4S queue. This enables to be backward compatible with, e.g., DCTCP.
+.PD
+.BI step_thresh " TIME | PACKETS"
+Set the step threshold for the L4S queue. This will cause packets with a sojourn time exceeding the threshold to always be marked. This value can either be specified using time units (i.e., us, ms, s), or in packets (pkt, packet(s)). A velue without units is assumed to be in time (us). If defining the step in packets, be sure to disable GRO on the ingress interfaces. Defaults to
+.I 1ms
+.
+.TP
+.B drop_on_overload  |  overflow
+Control the overload strategy. 
+.I drop_on_overload
+preserves the delay in the L4S queue by dropping in both queues on overload.
+.I overflow
+sacrifices delay to avoid losses, eventually resulting in a taildrop behavior once
+.I limit
+is reached. Defaults to
+.I drop_on_overload.
+.PD
+.TP
+.B drop_enqueue | drop_dequeue
+Decide when packets are PI-based dropped or marked. The
+.I step_thresh 
+based L4S marking is always at dequeue. Defaults to
+.I drop_dequeue
+.PD
+.TP
+.BI classic_protection " PERCENTAGE
+Protects the classic queue from unresponsive traffic in the L4S queue. This bounds the maximal delay in the C queue to be
+.I (100 - PERCENTAGE)
+times greater than the one in the L queue. Defaults to
+.I 10
+.TP
+.BI typical_rtt " TIME"
+.PD 0
+.TP
+.PD
+.BI max_rtt " TIME"
+Specify the maximum round trip time (RTT) and/or the typical RTT of the traffic
+that will be controlled by dualpi2. If either of
+.I max_rtt
+or
+.I typical_rtt
+is not specified, the missing value will be computed from the following 
+relationship:
+.I max_rtt = typical_rtt * 6.
+If any of these parameters is given, it will be used to automatically compute
+suitable values for
+.I alpha, beta, target, and tupdate,
+according to the relationship from the appendix A.1 in the IETF draft, to
+achieve a stable control. Consequently, those derived values will override their
+eventual user-provided ones. The default range of operation for the qdisc uses
+.I max_rtt = 100ms
+and 
+.I typical_rtt = 15ms
+, which is suited to control internet traffic.
+.TP
+.BI target " TIME"
+Set the expected queue delay. Defaults to
+.I 15
+ms.
+.TP
+.BI tupdate " TIME"
+Set the frequency at which the system drop probability is calculated. Defaults to
+.I 16
+ms. This should be a third of the max RTT supported.
+.TP
+.BI alpha " float"
+.PD 0
+.TP
+.PD
+.BI beta " float"
+Set alpha and beta, the integral and proportional gain factors in Hz for the PI controller. These can be calculated based on control theory. Defaults are
+.I 0.16
+and
+.I 3.2
+Hz, which provide stable control for RTT's up to 100ms with tupdate of 16. Be aware, unlike with PIE, these are the real unscaled gain factors.
+
+.SH EXAMPLES
+Setting DUALPI2 for the Internet with default parameters:
+ # sudo tc qdisc add dev eth0 root dualpi2
+
+Setting DUALPI2 for datacenter with legacy DCTCP using ECT(0):
+ # sudo tc qdisc add dev eth0 root dualpi2 any_ect
+
+.SH SEE ALSO
+.BR tc (8),
+.BR tc-pie (8)
+
+.SH SOURCES
+.IP \(bu 4
+IETF draft submission is at https://www.ietf.org/id/draft-ietf-tsvwg-aqm-dualq-coupled
+.IP \(bu 4
+CoNEXT '16 Proceedings of the 12th International on Conference on emerging Networking EXperiments and Technologies : "PI2: A
+Linearized AQM for both Classic and Scalable TCP"
+
+.SH AUTHORS
+DUALPI2 was implemented by Koen De Schepper, Olga Albisser, Henrik Steen, and Olivier Tilmans also the authors of
+this man page. Please report bugs and corrections to the Linux networking
+development mailing list at <netdev@vger.kernel.org>.
diff --git a/tc/Makefile b/tc/Makefile
index 14171a28..35f02da3 100644
--- a/tc/Makefile
+++ b/tc/Makefile
@@ -9,6 +9,7 @@ SHARED_LIBS ?= y
 
 TCMODULES :=
 TCMODULES += q_fifo.o
+TCMODULES += q_dualpi2.o
 TCMODULES += q_sfq.o
 TCMODULES += q_red.o
 TCMODULES += q_prio.o
diff --git a/tc/q_dualpi2.c b/tc/q_dualpi2.c
new file mode 100644
index 00000000..1fda2ec3
--- /dev/null
+++ b/tc/q_dualpi2.c
@@ -0,0 +1,439 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/* Copyright (C) 2019 Nokia.
+ *
+ * DualQ PI Improved with a Square (dualpi2)
+ * Supports controlling scalable congestion controls (DCTCP, etc...)
+ * Supports DualQ with PI2
+ * Supports L4S ECN identifier
+ * Author: Koen De Schepper <koen.de_schepper@nokia-bell-labs.com>
+ * Author: Olga Albisser <olga@albisser.org>
+ * Author: Henrik Steen <henrist@henrist.net>
+ */
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <unistd.h>
+#include <syslog.h>
+#include <fcntl.h>
+#include <sys/socket.h>
+#include <netinet/in.h>
+#include <arpa/inet.h>
+#include <string.h>
+#include <math.h>
+#include <errno.h>
+
+#include "utils.h"
+#include "tc_util.h"
+
+#define MAX_PROB ((uint32_t)(~((uint32_t)0)))
+#define DEFAULT_ALPHA_BETA ((uint32_t)(~((uint32_t)0)))
+#define ALPHA_BETA_MAX ((2 << 23) - 1) /* see net/sched/sch_dualpi2.c */
+#define ALPHA_BETA_SCALE (1 << 8)
+#define RTT_TYP_TO_MAX 6
+
+enum {
+	INET_ECN_NOT_ECT = 0,
+	INET_ECN_ECT_1 = 1,
+	INET_ECN_ECT_0 = 2,
+	INET_ECN_CE = 3,
+	INET_ECN_MASK = 3,
+};
+
+static const char *get_ecn_type(uint8_t ect)
+{
+	switch (ect & INET_ECN_MASK) {
+		case INET_ECN_ECT_1: return "l4s_ect";
+		case INET_ECN_ECT_0:
+		case INET_ECN_MASK: return "any_ect";
+		default:
+			fprintf(stderr,
+				"Warning: Unexpected ecn type %u!\n", ect);
+			return "";
+	}
+}
+
+static void explain(void)
+{
+	fprintf(stderr, "Usage: ... dualpi2\n");
+	fprintf(stderr, "               [limit PACKETS]\n");
+	fprintf(stderr, "               [coupling_factor NUMBER]\n");
+	fprintf(stderr, "               [step_thresh TIME|PACKETS]\n");
+	fprintf(stderr, "               [drop_on_overload|overflow]\n");
+	fprintf(stderr, "               [drop_enqueue|drop_dequeue]\n");
+	fprintf(stderr, "               [classic_protection PERCENTAGE]\n");
+	fprintf(stderr, "               [max_rtt TIME [typical_rtt TIME]]\n");
+	fprintf(stderr, "               [target TIME] [tupdate TIME]\n");
+	fprintf(stderr, "               [alpha ALPHA] [beta BETA]\n");
+}
+
+static int get_float(float *val, const char *arg, float min, float max)
+{
+        float res;
+        char *ptr;
+
+        if (!arg || !*arg)
+                return -1;
+        res = strtof(arg, &ptr);
+        if (!ptr || ptr == arg || *ptr)
+                return -1;
+	if (res < min || res > max)
+		return -1;
+        *val = res;
+        return 0;
+}
+
+static int get_packets(uint32_t *val, const char *arg)
+{
+	unsigned long res;
+	char *ptr;
+
+	if (!arg || !*arg)
+		return -1;
+	res = strtoul(arg, &ptr, 10);
+	if (!ptr || ptr == arg ||
+	    (strcmp(ptr, "pkt") && strcmp(ptr, "packet") &&
+	     strcmp(ptr, "packets")))
+		return -1;
+	if (res == ULONG_MAX && errno == ERANGE)
+		return -1;
+	if (res > 0xFFFFFFFFUL)
+		return -1;
+	*val = res;
+	return 0;
+}
+
+static int parse_alpha_beta(const char *name, char *argv, uint32_t *field)
+{
+
+	float field_f;
+
+	if (get_float(&field_f, argv, 0.0, ALPHA_BETA_MAX)) {
+		fprintf(stderr, "Illegal \"%s\"\n", name);
+		return -1;
+	}
+	else if (field_f < 1.0f / ALPHA_BETA_SCALE)
+		fprintf(stderr, "Warning: \"%s\" is too small and will be "
+			"rounded to zero.\n", name);
+	*field = (uint32_t)(field_f * ALPHA_BETA_SCALE);
+	return 0;
+}
+
+static int try_get_percentage(int *val, const char *arg, int base)
+{
+	long res;
+	char *ptr;
+
+	if (!arg || !*arg)
+		return -1;
+	res = strtol(arg, &ptr, base);
+	if (!ptr || ptr == arg || (*ptr && strcmp(ptr, "%")))
+		return -1;
+	if (res == ULONG_MAX && errno == ERANGE)
+		return -1;
+	if (res < 0 || res > 100)
+		return -1;
+
+	*val = res;
+	return 0;
+}
+
+static int dualpi2_parse_opt(struct qdisc_util *qu, int argc, char **argv,
+			 struct nlmsghdr *n, const char* dev)
+{
+	uint32_t limit = 0;
+	uint32_t target = 0;
+	uint32_t tupdate = 0;
+	uint32_t alpha = DEFAULT_ALPHA_BETA;
+	uint32_t beta = DEFAULT_ALPHA_BETA;
+	int32_t coupling_factor = -1;
+	uint8_t ecn_mask = INET_ECN_NOT_ECT;
+	bool step_packets = false;
+	uint32_t step_thresh = 0;
+	int c_protection = -1;
+	int drop_early = -1;
+	int drop_overload = -1;
+	uint32_t rtt_max = 0;
+	uint32_t rtt_typ = 0;
+	struct rtattr *tail;
+
+	while (argc > 0) {
+		if (strcmp(*argv, "limit") == 0) {
+			NEXT_ARG();
+			if (get_u32(&limit, *argv, 10)) {
+				fprintf(stderr, "Illegal \"limit\"\n");
+				return -1;
+			}
+		} else if (strcmp(*argv, "target") == 0) {
+			NEXT_ARG();
+			if (get_time(&target, *argv)) {
+				fprintf(stderr, "Illegal \"target\"\n");
+				return -1;
+			}
+		} else if (strcmp(*argv, "tupdate") == 0) {
+			NEXT_ARG();
+			if (get_time(&tupdate, *argv)) {
+				fprintf(stderr, "Illegal \"tupdate\"\n");
+				return -1;
+			}
+		} else if (strcmp(*argv, "alpha") == 0) {
+			NEXT_ARG();
+			if (parse_alpha_beta("alpha", *argv, &alpha))
+				return -1;
+		} else if (strcmp(*argv, "beta") == 0) {
+			NEXT_ARG();
+			if (parse_alpha_beta("beta", *argv, &beta))
+				return -1;
+		} else if (strcmp(*argv, "coupling_factor") == 0) {
+			NEXT_ARG();
+			if (get_s32(&coupling_factor, *argv, 0) ||
+			    coupling_factor > 0xFFUL ||coupling_factor < 0) {
+				fprintf(stderr,
+					"Illegal \"coupling_factor\"\n");
+				return -1;
+			}
+		} else if (strcmp(*argv, "l4s_ect") == 0)
+			ecn_mask = INET_ECN_ECT_1;
+		else if (strcmp(*argv, "any_ect") == 0)
+			ecn_mask = INET_ECN_MASK;
+		else if (strcmp(*argv, "step_thresh") == 0) {
+			NEXT_ARG();
+			/* First assume that this is specified in time */
+			if (get_time(&step_thresh, *argv)) {
+				/* Then packets */
+				if (get_packets(&step_thresh, *argv)) {
+					fprintf(stderr,
+						"Illegal \"step_thresh\"\n");
+					return -1;
+				}
+				step_packets = true;
+			}
+		} else if (strcmp(*argv, "overflow") == 0) {
+                        drop_overload = 0;
+		} else if (strcmp(*argv, "drop_on_overload") == 0) {
+                        drop_overload = 1;
+		} else if (strcmp(*argv, "drop_enqueue") == 0) {
+			drop_early = 1;
+		} else if (strcmp(*argv, "drop_dequeue") == 0) {
+			drop_early = 0;
+		} else if (strcmp(*argv, "classic_protection") == 0) {
+                        NEXT_ARG();
+                        if (try_get_percentage(&c_protection, *argv, 10) ||
+			    c_protection > 100 ||
+			    c_protection < 0) {
+                                fprintf(stderr,
+					"Illegal \"classic_protection\"\n");
+                                return -1;
+                        }
+		} else if (strcmp(*argv, "max_rtt") == 0) {
+			NEXT_ARG();
+			if (get_time(&rtt_max, *argv)) {
+				fprintf(stderr, "Illegal \"rtt_max\"\n");
+				return -1;
+			}
+		} else if (strcmp(*argv, "typical_rtt") == 0) {
+			NEXT_ARG();
+			if (get_time(&rtt_typ, *argv)) {
+				fprintf(stderr, "Illegal \"rtt_typ\"\n");
+				return -1;
+			}
+		} else if (strcmp(*argv, "help") == 0) {
+			explain();
+			return -1;
+		} else {
+			fprintf(stderr, "What is \"%s\"?\n", *argv);
+			explain();
+			return -1;
+		}
+		--argc;
+		++argv;
+	}
+
+	if (rtt_max || rtt_typ) {
+		float alpha_f, beta_f;
+		SPRINT_BUF(max_rtt_t);
+		SPRINT_BUF(typ_rtt_t);
+		SPRINT_BUF(tupdate_t);
+		SPRINT_BUF(target_t);
+
+		if (!rtt_typ)
+			rtt_typ = max(rtt_max / RTT_TYP_TO_MAX, 1U);
+		else if (!rtt_max)
+			rtt_max = rtt_typ * RTT_TYP_TO_MAX;
+		else if (rtt_typ > rtt_max) {
+			fprintf(stderr, "typical_rtt must be >= max_rtt!\n");
+			return -1;
+		}
+		if (alpha != DEFAULT_ALPHA_BETA || beta != DEFAULT_ALPHA_BETA ||
+		    tupdate || target)
+			fprintf(stderr, "rtt_max is specified, ignoring values "
+				"specified for alpha/beta/tupdate/target\n");
+		target = rtt_typ;
+		tupdate = max(min(rtt_typ, rtt_max / 3), 1U);
+		alpha_f = (double)tupdate / ((double)rtt_max * rtt_max)
+			* TIME_UNITS_PER_SEC * 0.1f;
+		beta_f = 0.3f / (float)rtt_max * TIME_UNITS_PER_SEC;
+		if (beta_f > ALPHA_BETA_MAX) {
+			fprintf(stderr, "max_rtt=%s is too low and cause beta "
+				"to overflow!\n",
+				sprint_time(rtt_max, max_rtt_t));
+			return -1;
+		}
+		if (alpha_f < 1.0f / ALPHA_BETA_SCALE ||
+		    beta_f < 1.0f / ALPHA_BETA_SCALE) {
+			fprintf(stderr, "max_rtt=%s is too large and will "
+				"cause alpha=%f and/or beta=%f to be rounded "
+				"down to 0!\n", sprint_time(rtt_max, max_rtt_t),
+				alpha_f, beta_f);
+			return -1;
+		}
+		fprintf(stderr, "Auto-configuring parameters using "
+			"[max_rtt: %s, typical_rtt: %s]: "
+			"target=%s tupdate=%s alpha=%f beta=%f\n",
+			sprint_time(rtt_max, max_rtt_t),
+			sprint_time(rtt_typ, typ_rtt_t),
+			sprint_time(target, target_t),
+			sprint_time(tupdate, tupdate_t), alpha_f, beta_f);
+		alpha = alpha_f * ALPHA_BETA_SCALE;
+		beta = beta * ALPHA_BETA_SCALE;
+	}
+
+	tail = addattr_nest(n, 1024, TCA_OPTIONS);
+	if (limit)
+		addattr32(n, 1024, TCA_DUALPI2_LIMIT, limit);
+	if (tupdate)
+		addattr32(n, 1024, TCA_DUALPI2_TUPDATE, tupdate);
+	if (target)
+		addattr32(n, 1024, TCA_DUALPI2_TARGET, target);
+	if (alpha != DEFAULT_ALPHA_BETA)
+		addattr32(n, 1024, TCA_DUALPI2_ALPHA, alpha);
+	if (beta != DEFAULT_ALPHA_BETA)
+		addattr32(n, 1024, TCA_DUALPI2_BETA, beta);
+	if (ecn_mask != INET_ECN_NOT_ECT)
+		addattr8(n, 1024, TCA_DUALPI2_ECN_MASK, ecn_mask);
+	if (drop_overload != -1)
+		addattr8(n, 1024, TCA_DUALPI2_DROP_OVERLOAD, drop_overload);
+	if (coupling_factor != -1)
+		addattr8(n, 1024, TCA_DUALPI2_COUPLING, coupling_factor);
+	if (step_thresh) {
+		addattr32(n, 1024, TCA_DUALPI2_STEP_THRESH, step_thresh);
+                addattr8(n, 1024, TCA_DUALPI2_STEP_PACKETS, step_packets);
+	}
+	if (drop_early != -1)
+		addattr8(n, 1024, TCA_DUALPI2_DROP_EARLY, drop_early);
+	if (c_protection != -1)
+		addattr8(n, 1024, TCA_DUALPI2_C_PROTECTION, c_protection);
+	addattr_nest_end(n, tail);
+	return 0;
+}
+
+static int dualpi2_print_opt(struct qdisc_util *qu, FILE *f, struct rtattr *opt)
+{
+	struct rtattr *tb[TCA_DUALPI2_MAX + 1];
+	uint32_t tupdate;
+	uint32_t target;
+	uint32_t step_thresh;
+	bool step_packets = false;
+	SPRINT_BUF(b1);
+
+	if (opt == NULL)
+		return 0;
+
+	parse_rtattr_nested(tb, TCA_DUALPI2_MAX, opt);
+
+	if (tb[TCA_DUALPI2_LIMIT] &&
+	    RTA_PAYLOAD(tb[TCA_DUALPI2_LIMIT]) >= sizeof(__uint32_t))
+		fprintf(f, "limit %up ",
+			rta_getattr_u32(tb[TCA_DUALPI2_LIMIT]));
+	if (tb[TCA_DUALPI2_TARGET] &&
+	    RTA_PAYLOAD(tb[TCA_DUALPI2_TARGET]) >= sizeof(__uint32_t)) {
+		target = rta_getattr_u32(tb[TCA_DUALPI2_TARGET]);
+		fprintf(f, "target %s ", sprint_time(target, b1));
+	}
+	if (tb[TCA_DUALPI2_TUPDATE] &&
+	    RTA_PAYLOAD(tb[TCA_DUALPI2_TUPDATE]) >= sizeof(__uint32_t)) {
+		tupdate = rta_getattr_u32(tb[TCA_DUALPI2_TUPDATE]);
+		fprintf(f, "tupdate %s ", sprint_time(tupdate, b1));
+	}
+	if (tb[TCA_DUALPI2_ALPHA] &&
+	    RTA_PAYLOAD(tb[TCA_DUALPI2_ALPHA]) >= sizeof(__uint32_t)) {
+		fprintf(f, "alpha %f ",
+			(float)rta_getattr_u32(tb[TCA_DUALPI2_ALPHA]) /
+			ALPHA_BETA_SCALE);
+	}
+	if (tb[TCA_DUALPI2_BETA] &&
+	    RTA_PAYLOAD(tb[TCA_DUALPI2_BETA]) >= sizeof(__uint32_t)) {
+		fprintf(f, "beta %f ",
+			(float)rta_getattr_u32(tb[TCA_DUALPI2_BETA]) /
+			ALPHA_BETA_SCALE);
+	}
+	if (tb[TCA_DUALPI2_ECN_MASK] &&
+	    RTA_PAYLOAD(tb[TCA_DUALPI2_ECN_MASK]) >= sizeof(__u8))
+		fprintf(f, "%s ",
+			get_ecn_type(rta_getattr_u8(tb[TCA_DUALPI2_ECN_MASK])));
+	if (tb[TCA_DUALPI2_COUPLING] &&
+	    RTA_PAYLOAD(tb[TCA_DUALPI2_COUPLING]) >= sizeof(__u8))
+		fprintf(f, "coupling_factor %u ",
+			rta_getattr_u8(tb[TCA_DUALPI2_COUPLING]));
+	if (tb[TCA_DUALPI2_DROP_OVERLOAD] &&
+	    RTA_PAYLOAD(tb[TCA_DUALPI2_DROP_OVERLOAD]) >= sizeof(__u8)) {
+		if (rta_getattr_u8(tb[TCA_DUALPI2_DROP_OVERLOAD]))
+			fprintf(f, "drop_on_overload ");
+		else
+			fprintf(f, "overflow ");
+	}
+	if (tb[TCA_DUALPI2_STEP_PACKETS] &&
+            RTA_PAYLOAD(tb[TCA_DUALPI2_STEP_PACKETS]) >= sizeof(__u8) &&
+	    rta_getattr_u8(tb[TCA_DUALPI2_STEP_PACKETS]))
+                        step_packets = true;
+	if (tb[TCA_DUALPI2_STEP_THRESH] &&
+	    RTA_PAYLOAD(tb[TCA_DUALPI2_STEP_THRESH]) >= sizeof(__uint32_t)) {
+		step_thresh = rta_getattr_u32(tb[TCA_DUALPI2_STEP_THRESH]);
+		if (step_packets)
+			fprintf(f, "step_thresh %upkt ", step_thresh);
+		else
+			fprintf(f, "step_thresh %s ",
+				sprint_time(step_thresh, b1));
+	}
+	if (tb[TCA_DUALPI2_DROP_EARLY] &&
+	    RTA_PAYLOAD(tb[TCA_DUALPI2_DROP_EARLY]) >= sizeof(__u8)) {
+		if (rta_getattr_u8(tb[TCA_DUALPI2_DROP_EARLY]))
+			fprintf(f, "drop_enqueue ");
+		else
+			fprintf(f, "drop_dequeue ");
+	}
+	if (tb[TCA_DUALPI2_C_PROTECTION] &&
+            RTA_PAYLOAD(tb[TCA_DUALPI2_C_PROTECTION]) >= sizeof(__u8))
+                fprintf(f, "classic_protection %u%% ",
+			rta_getattr_u8(tb[TCA_DUALPI2_C_PROTECTION]));
+
+	return 0;
+}
+
+static int dualpi2_print_xstats(struct qdisc_util *qu, FILE *f,
+			    struct rtattr *xstats)
+{
+	struct tc_dualpi2_xstats *st;
+
+	if (xstats == NULL)
+		return 0;
+
+	if (RTA_PAYLOAD(xstats) < sizeof(*st))
+		return -1;
+
+	st = RTA_DATA(xstats);
+	fprintf(f, "prob %f delay_c %uus delay_l %uus\n",
+		(double)st->prob / (double)MAX_PROB, st->delay_c, st->delay_l);
+	fprintf(f, "pkts_in_c %u pkts_in_l %u maxq %u\n",
+		st->packets_in_c, st->packets_in_l, st->maxq);
+	fprintf(f, "ecn_mark %u step_marks %u\n", st->ecn_mark, st->step_marks);
+	fprintf(f, "credit %d (%c)\n", st->credit, st->credit > 0 ? 'C' : 'L');
+	return 0;
+
+}
+
+struct qdisc_util dualpi2_qdisc_util = {
+	.id = "dualpi2",
+	.parse_qopt	= dualpi2_parse_opt,
+	.print_qopt	= dualpi2_print_opt,
+	.print_xstats	= dualpi2_print_xstats,
+};
-- 
2.22.0


^ permalink raw reply related

* Re: [RFC PATCH bpf-next 00/14] xdp_flow: Flow offload to XDP
From: Toshiaki Makita @ 2019-08-21  8:49 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Stanislav Fomichev, Alexei Starovoitov, Daniel Borkmann,
	Martin KaFai Lau, Song Liu, Yonghong Song, David S. Miller,
	Jesper Dangaard Brouer, John Fastabend, Jamal Hadi Salim,
	Cong Wang, Jiri Pirko, netdev, bpf, William Tu
In-Reply-To: <20190819111546.35a8ed76@cakuba.netronome.com>

On 19/08/20 (火) 3:15:46, Jakub Kicinski wrote:

I'm on vacation and replying slowly. Sorry for any inconvenience.

> On Sat, 17 Aug 2019 23:01:59 +0900, Toshiaki Makita wrote:
>> On 19/08/17 (土) 3:52:24, Jakub Kicinski wrote:
>>> On Fri, 16 Aug 2019 10:28:10 +0900, Toshiaki Makita wrote:
>>>> On 2019/08/16 4:22, Jakub Kicinski wrote:
>>>>> There's a certain allure in bringing the in-kernel BPF translation
>>>>> infrastructure forward. OTOH from system architecture perspective IMHO
>>>>> it does seem like a task best handed in user space. bpfilter can replace
>>>>> iptables completely, here we're looking at an acceleration relatively
>>>>> loosely coupled with flower.
>>>>
>>>> I don't think it's loosely coupled. Emulating TC behavior in userspace
>>>> is not so easy.
>>>>
>>>> Think about recent multi-mask support in flower. Previously userspace could
>>>> assume there is one mask and hash table for each preference in TC. After the
>>>> change TC accepts different masks with the same pref. Such a change tends to
>>>> break userspace emulation. It may ignore masks passed from flow insertion
>>>> and use the mask remembered when the first flow of the pref is inserted. It
>>>> may override the mask of all existing flows with the pref. It may fail to
>>>> insert such flows. Any of them would result in unexpected wrong datapath
>>>> handling which is critical.
>>>> I think such an emulation layer needs to be updated in sync with TC.
>>>
>>> Oh, so you're saying that if xdp_flow is merged all patches to
>>> cls_flower and netfilter which affect flow offload will be required
>>> to update xdp_flow as well?
>>
>> Hmm... you are saying that we are allowed to break other in-kernel
>> subsystem by some change? Sounds strange...
> 
> No I'm not saying that, please don't put words in my mouth.

If we ignore xdp_flow when modifying something which affects flow 
offload, that may cause breakage. I showed such an example using 
multi-mask support. So I just wondered what you mean and guessed you 
think we can break other subsystem in some situation.

I admit I should not have used the wording "you are saying...?". If it 
was not unpleasant to you I'm sorry about that. But I think you should 
not use it as well. I did not say "cls_flower and netfilter which affect 
flow offload will be required to update xdp_flow". I guess most patches 
which affect flow offload core will not break xdp_flow. In some cases 
breakage may happen. In that case we need to fix xdp_flow as well.

> I'm asking you if that's your intention.
> 
> Having an implementation nor support a feature of another implementation
> and degrade gracefully to the slower one is not necessarily breakage.
> We need to make a concious decision here, hence the clarifying question.

As I described above, breakage can happen in some case, and if the patch 
breaks xdp_flow I think we need to fix xdp_flow at the same time. If 
xdp_flow does not support newly added features but it works for existing 
ones, it is OK. In the first place not all features can be offloaded to 
xdp_flow. I think this is the same as HW-offload.

>>> That's a question of policy. Technically the implementation in user
>>> space is equivalent.
>>>
>>> The advantage of user space implementation is that you can add more
>>> to it and explore use cases which do not fit in the flow offload API,
>>> but are trivial for BPF. Not to mention the obvious advantage of
>>> decoupling the upgrade path.
>>
>> I understand the advantage, but I can't trust such a third-party kernel
>> emulation solution for this kind of thing which handles critical data path.
> 
> That's a strange argument to make. All production data path BPF today
> comes from user space.

Probably my explanation was not sufficient. What I'm concerned about is 
that this needs to emulate kernel behavior, and it is difficult.
I don't think userspace-defined datapath itself is not reliable, nor 
eBPF ecosystem.

Toshiaki Makita

^ permalink raw reply

* [PATCH bpf-next 0/2] tools: bpftool: work with frozen maps
From: Quentin Monnet @ 2019-08-21  8:52 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann
  Cc: bpf, netdev, oss-drivers, Quentin Monnet

Hi,
This is a simple set to add support for BPF map freezing to bpftool. First
patch makes bpftool indicate if a map is frozen when listing BPF maps.
Second patch adds a command to freeze a map loaded on the system.

Quentin Monnet (2):
  tools: bpftool: show frozen status for maps
  tools: bpftool: add "bpftool map freeze" subcommand

 .../bpf/bpftool/Documentation/bpftool-map.rst |  9 +++
 tools/bpf/bpftool/bash-completion/bpftool     |  4 +-
 tools/bpf/bpftool/map.c                       | 64 +++++++++++++++++--
 3 files changed, 71 insertions(+), 6 deletions(-)

-- 
2.17.1


^ permalink raw reply

* [PATCH bpf-next 1/2] tools: bpftool: show frozen status for maps
From: Quentin Monnet @ 2019-08-21  8:52 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann
  Cc: bpf, netdev, oss-drivers, Quentin Monnet
In-Reply-To: <20190821085219.30387-1-quentin.monnet@netronome.com>

When listing maps, read their "frozen" status from procfs, and tell if
maps are frozen.

As commit log for map freezing command mentions that the feature might
be extended with flags (e.g. for write-only instead of read-only) in the
future, use an integer and not a boolean for JSON output.

Signed-off-by: Quentin Monnet <quentin.monnet@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
---
 tools/bpf/bpftool/map.c | 30 +++++++++++++++++++++++++++---
 1 file changed, 27 insertions(+), 3 deletions(-)

diff --git a/tools/bpf/bpftool/map.c b/tools/bpf/bpftool/map.c
index bfbbc6b4cb83..af2e9eb9747b 100644
--- a/tools/bpf/bpftool/map.c
+++ b/tools/bpf/bpftool/map.c
@@ -481,9 +481,11 @@ static int parse_elem(char **argv, struct bpf_map_info *info,
 
 static int show_map_close_json(int fd, struct bpf_map_info *info)
 {
-	char *memlock;
+	char *memlock, *frozen_str;
+	int frozen = 0;
 
 	memlock = get_fdinfo(fd, "memlock");
+	frozen_str = get_fdinfo(fd, "frozen");
 
 	jsonw_start_object(json_wtr);
 
@@ -533,6 +535,12 @@ static int show_map_close_json(int fd, struct bpf_map_info *info)
 	}
 	close(fd);
 
+	if (frozen_str) {
+		frozen = atoi(frozen_str);
+		free(frozen_str);
+	}
+	jsonw_int_field(json_wtr, "frozen", frozen);
+
 	if (info->btf_id)
 		jsonw_int_field(json_wtr, "btf_id", info->btf_id);
 
@@ -555,9 +563,11 @@ static int show_map_close_json(int fd, struct bpf_map_info *info)
 
 static int show_map_close_plain(int fd, struct bpf_map_info *info)
 {
-	char *memlock;
+	char *memlock, *frozen_str;
+	int frozen = 0;
 
 	memlock = get_fdinfo(fd, "memlock");
+	frozen_str = get_fdinfo(fd, "frozen");
 
 	printf("%u: ", info->id);
 	if (info->type < ARRAY_SIZE(map_type_name))
@@ -610,9 +620,23 @@ static int show_map_close_plain(int fd, struct bpf_map_info *info)
 				printf("\n\tpinned %s", obj->path);
 		}
 	}
+	printf("\n");
+
+	if (frozen_str) {
+		frozen = atoi(frozen_str);
+		free(frozen_str);
+	}
+
+	if (!info->btf_id && !frozen)
+		return 0;
+
+	printf("\t");
 
 	if (info->btf_id)
-		printf("\n\tbtf_id %d", info->btf_id);
+		printf("btf_id %d", info->btf_id);
+
+	if (frozen)
+		printf("%sfrozen", info->btf_id ? "  " : "");
 
 	printf("\n");
 	return 0;
-- 
2.17.1


^ permalink raw reply related

* [PATCH bpf-next 2/2] tools: bpftool: add "bpftool map freeze" subcommand
From: Quentin Monnet @ 2019-08-21  8:52 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann
  Cc: bpf, netdev, oss-drivers, Quentin Monnet
In-Reply-To: <20190821085219.30387-1-quentin.monnet@netronome.com>

Add a new subcommand to freeze maps from user space.

Signed-off-by: Quentin Monnet <quentin.monnet@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
---
 .../bpf/bpftool/Documentation/bpftool-map.rst |  9 +++++
 tools/bpf/bpftool/bash-completion/bpftool     |  4 +--
 tools/bpf/bpftool/map.c                       | 34 ++++++++++++++++++-
 3 files changed, 44 insertions(+), 3 deletions(-)

diff --git a/tools/bpf/bpftool/Documentation/bpftool-map.rst b/tools/bpf/bpftool/Documentation/bpftool-map.rst
index 61d1d270eb5e..1c0f7146aab0 100644
--- a/tools/bpf/bpftool/Documentation/bpftool-map.rst
+++ b/tools/bpf/bpftool/Documentation/bpftool-map.rst
@@ -36,6 +36,7 @@ MAP COMMANDS
 |	**bpftool** **map pop**        *MAP*
 |	**bpftool** **map enqueue**    *MAP* **value** *VALUE*
 |	**bpftool** **map dequeue**    *MAP*
+|	**bpftool** **map freeze**     *MAP*
 |	**bpftool** **map help**
 |
 |	*MAP* := { **id** *MAP_ID* | **pinned** *FILE* }
@@ -127,6 +128,14 @@ DESCRIPTION
 	**bpftool map dequeue**  *MAP*
 		  Dequeue and print **value** from the queue.
 
+	**bpftool map freeze**  *MAP*
+		  Freeze the map as read-only from user space. Entries from a
+		  frozen map can not longer be updated or deleted with the
+		  **bpf\ ()** system call. This operation is not reversible,
+		  and the map remains immutable from user space until its
+		  destruction. However, read and write permissions for BPF
+		  programs to the map remain unchanged.
+
 	**bpftool map help**
 		  Print short help message.
 
diff --git a/tools/bpf/bpftool/bash-completion/bpftool b/tools/bpf/bpftool/bash-completion/bpftool
index 2ffd351f9dbf..70493a6da206 100644
--- a/tools/bpf/bpftool/bash-completion/bpftool
+++ b/tools/bpf/bpftool/bash-completion/bpftool
@@ -449,7 +449,7 @@ _bpftool()
         map)
             local MAP_TYPE='id pinned'
             case $command in
-                show|list|dump|peek|pop|dequeue)
+                show|list|dump|peek|pop|dequeue|freeze)
                     case $prev in
                         $command)
                             COMPREPLY=( $( compgen -W "$MAP_TYPE" -- "$cur" ) )
@@ -638,7 +638,7 @@ _bpftool()
                     [[ $prev == $object ]] && \
                         COMPREPLY=( $( compgen -W 'delete dump getnext help \
                             lookup pin event_pipe show list update create \
-                            peek push enqueue pop dequeue' -- \
+                            peek push enqueue pop dequeue freeze' -- \
                             "$cur" ) )
                     ;;
             esac
diff --git a/tools/bpf/bpftool/map.c b/tools/bpf/bpftool/map.c
index af2e9eb9747b..de61d73b9030 100644
--- a/tools/bpf/bpftool/map.c
+++ b/tools/bpf/bpftool/map.c
@@ -1262,6 +1262,35 @@ static int do_pop_dequeue(int argc, char **argv)
 	return err;
 }
 
+static int do_freeze(int argc, char **argv)
+{
+	int err, fd;
+
+	if (!REQ_ARGS(2))
+		return -1;
+
+	fd = map_parse_fd(&argc, &argv);
+	if (fd < 0)
+		return -1;
+
+	if (argc) {
+		close(fd);
+		return BAD_ARG();
+	}
+
+	err = bpf_map_freeze(fd);
+	close(fd);
+	if (err) {
+		p_err("failed to freeze map: %s", strerror(errno));
+		return err;
+	}
+
+	if (json_output)
+		jsonw_null(json_wtr);
+
+	return 0;
+}
+
 static int do_help(int argc, char **argv)
 {
 	if (json_output) {
@@ -1286,6 +1315,7 @@ static int do_help(int argc, char **argv)
 		"       %s %s pop        MAP\n"
 		"       %s %s enqueue    MAP value VALUE\n"
 		"       %s %s dequeue    MAP\n"
+		"       %s %s freeze     MAP\n"
 		"       %s %s help\n"
 		"\n"
 		"       " HELP_SPEC_MAP "\n"
@@ -1304,7 +1334,8 @@ static int do_help(int argc, char **argv)
 		bin_name, argv[-2], bin_name, argv[-2], bin_name, argv[-2],
 		bin_name, argv[-2], bin_name, argv[-2], bin_name, argv[-2],
 		bin_name, argv[-2], bin_name, argv[-2], bin_name, argv[-2],
-		bin_name, argv[-2], bin_name, argv[-2], bin_name, argv[-2]);
+		bin_name, argv[-2], bin_name, argv[-2], bin_name, argv[-2],
+		bin_name, argv[-2]);
 
 	return 0;
 }
@@ -1326,6 +1357,7 @@ static const struct cmd cmds[] = {
 	{ "enqueue",	do_update },
 	{ "pop",	do_pop_dequeue },
 	{ "dequeue",	do_pop_dequeue },
+	{ "freeze",	do_freeze },
 	{ 0 }
 };
 
-- 
2.17.1


^ permalink raw reply related

* Re: [RFC PATCH] bpf: handle 32-bit zext during constant blinding
From: Jiong Wang @ 2019-08-21  9:05 UTC (permalink / raw)
  To: Naveen N. Rao
  Cc: Alexei Starovoitov, Daniel Borkmann, Jiong Wang, bpf,
	linux-kernel, linuxppc-dev, Michael Ellerman, netdev
In-Reply-To: <1566376025.68ldwx3wc7.naveen@linux.ibm.com>


Naveen N. Rao writes:

> Naveen N. Rao wrote:
>> Since BPF constant blinding is performed after the verifier pass, there
>> are certain ALU32 instructions inserted which don't have a corresponding
>> zext instruction inserted after. This is causing a kernel oops on
>> powerpc and can be reproduced by running 'test_cgroup_storage' with
>> bpf_jit_harden=2.
>> 
>> Fix this by emitting BPF_ZEXT during constant blinding if
>> prog->aux->verifier_zext is set.
>> 
>> Fixes: a4b1d3c1ddf6cb ("bpf: verifier: insert zero extension according to analysis result")
>> Reported-by: Michael Ellerman <mpe@ellerman.id.au>
>> Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
>> ---
>> This approach (the location where zext is being introduced below, in 
>> particular) works for powerpc, but I am not entirely sure if this is 
>> sufficient for other architectures as well. This is broken on v5.3-rc4.
>
> Alexie, Daniel, Jiong,
> Any feedback on this?

The fix on BPF_LD | BPF_IMM | BPF_DW looks correct to me, but the two other
places looks to me is unnecessary, as those destinations are exposed to
external and if they are used as 64-bit then there will be zext inserted
for them.

Have you verified removing those two fixes will still cause the bug?

Regards,
Jiong

>
> - Naveen


^ permalink raw reply

* Re: [PATCH net-next v2 6/9] net: macsec: hardware offloading infrastructure
From: Igor Russkikh @ 2019-08-21  9:20 UTC (permalink / raw)
  To: Sabrina Dubroca, Antoine Tenart
  Cc: Andrew Lunn, davem@davemloft.net, f.fainelli@gmail.com,
	hkallweit1@gmail.com, netdev@vger.kernel.org,
	linux-kernel@vger.kernel.org, thomas.petazzoni@bootlin.com,
	alexandre.belloni@bootlin.com, allan.nielsen@microchip.com,
	camelia.groza@nxp.com, Simon Edelhaus, Pavel Belous
In-Reply-To: <20190820144119.GA28714@bistromath.localdomain>


> 
> Talking about packet numbers, can you describe how PN exhaustion is
> handled?  I couldn't find much about packet numbers at all in the
> driver patches (I hope the hw doesn't wrap around from 2^32-1 to 0 on
> the same SA).  At some point userspace needs to know that we're
> getting close to 2^32 and that it's time to re-key.  Since the whole
> TX path of the software implementation is bypassed, it looks like the
> PN (as far as drivers/net/macsec.c is concerned) never increases, so
> userspace can't know when to negotiate a new SA.

I think there should be driver specific implementation of this functionality.
As an example, our macsec HW issues an interrupt towards the host to indicate
PN threshold has reached and it's time for userspace to change the keys.

In contrast, current SW macsec implementation just stops this SA/secy.

> I don't see how this implementation handles non-macsec traffic (on TX,
> that would be packets sent directly through the real interface, for
> example by wpa_supplicant - on RX, incoming MKA traffic for
> wpa_supplicant). Unless I missed something, incoming MKA traffic will
> end up on the macsec interface as well as the lower interface (not
> entirely critical, as long as wpa_supplicant can grab it on the lower
> device, but not consistent with the software implementation). How does
> the driver distinguish traffic that should pass through unmodified
> from traffic that the HW needs to encapsulate and encrypt?

I can comment on our HW engine - where it has special bypass rules
for configured ethertypes. This way macsec engine skips encryption on TX and
passes in RX unencrypted for the selected control packets.

But thats true, realdev driver is hard to distinguish encrypted/unencrypted
packets. In case realdev should make a decision where to put RX packet,
it only may do some heuristic (since after macsec decription all the
macsec related info is dropped. Thats true at least for our HW implementation).

> If you look at IPsec offloading, the networking stack builds up the
> ESP header, and passes the unencrypted data down to the driver. I'm
> wondering if the same would be possible with MACsec offloading: the
> macsec virtual interface adds the header (and maybe a dummy ICV), and
> then the HW does the encryption. In case of HW that needs to add the
> sectag itself, the driver would first strip the headers that the stack
> created. On receive, the driver would recreate a sectag and the macsec
> interface would just skip all verification (decrypt, PN).

I don't think this way is good, as driver have to do per packet header mangling.
That'll harm linerate performance heavily.

Regards,
   Igor

^ permalink raw reply

* Re: [RFC 1/4] Add usb_get_address and usb_set_address support
From: Oliver Neukum @ 2019-08-21  9:08 UTC (permalink / raw)
  To: Charles.Hyde, linux-acpi, linux-usb
  Cc: Mario.Limonciello, gregkh, nic_swsd, netdev
In-Reply-To: <1566339522507.45056@Dellteam.com>

Am Dienstag, den 20.08.2019, 22:18 +0000 schrieb
Charles.Hyde@dellteam.com:
> The core USB driver message.c is missing get/set address functionality

This should go into usbnet. The CDC parser is where it is because
it is needed for serial and network devices. As serial devices
do not have a MAC, this can go into usbnet.

> that stops ifconfig from being able to push MAC addresses out to USB
> based ethernet devices.  Without this functionality, some USB devices
> stop responding to ethernet packets when using ifconfig to change MAC
> addresses.  This has been tested with a Dell Universal Dock D6000.
> 
> Signed-off-by: Charles Hyde <charles.hyde@dellteam.com>
> Cc: Mario Limonciello <mario.limonciello@dell.com>
> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
> Cc: linux-usb@vger.kernel.org
> ---
>  drivers/usb/core/message.c | 59 ++++++++++++++++++++++++++++++++++++++
>  include/linux/usb.h        |  3 ++
>  2 files changed, 62 insertions(+)
> 
> diff --git a/drivers/usb/core/message.c b/drivers/usb/core/message.c
> index 5adf489428aa..eea775234b09 100644
> --- a/drivers/usb/core/message.c
> +++ b/drivers/usb/core/message.c
> @@ -1085,6 +1085,65 @@ int usb_clear_halt(struct usb_device *dev, int pipe)
>  }
>  EXPORT_SYMBOL_GPL(usb_clear_halt);
>  
> +/**
> + * usb_get_address - 
> + * @dev: device whose endpoint is halted

Which endpoint?

> + * @mac: buffer for containing 
> + * Context: !in_interrupt ()
> + *
> + * This will attempt to get the six byte MAC address from a USB device's
> + * ethernet controller using GET_NET_ADDRESS command.
> + *
> + * This call is synchronous, and may not be used in an interrupt context.
> + *
> + * Return: Zero on success, or else the status code returned by the

Well, I am afraid it will return 6 on success.

> + * underlying usb_control_msg() call.
> + */
> +int usb_get_address(struct usb_device *dev, unsigned char * mac)
> +{
> +	int ret = -ENOMEM;

Initialization is unnecessary here.

> +	unsigned char *tbuf = kmalloc(256, GFP_NOIO);

If you intentionally picked a safety margin of 42 times, this
is cool. Otherwise it is a litttle much.

> +
> +	if (!tbuf)
> +		return -ENOMEM;
> +
> +	ret = usb_control_msg(dev, usb_sndctrlpipe(dev, 0),
> +			USB_CDC_GET_NET_ADDRESS,
> +			USB_DIR_IN | USB_TYPE_CLASS | USB_RECIP_INTERFACE,
> +			0, USB_REQ_SET_ADDRESS, tbuf, 256,
> +			USB_CTRL_GET_TIMEOUT);
> +	if (ret == 6)
> +		memcpy(mac, tbuf, 6);

You cannot ignore the case of devices sending more or less than 6
bytes.

	Regards
		Oliver


^ permalink raw reply

* Re: [PATCH net-next v2 6/9] net: macsec: hardware offloading infrastructure
From: allan.nielsen @ 2019-08-21  9:24 UTC (permalink / raw)
  To: Sabrina Dubroca
  Cc: Antoine Tenart, Igor Russkikh, Andrew Lunn, davem@davemloft.net,
	f.fainelli@gmail.com, hkallweit1@gmail.com,
	netdev@vger.kernel.org, linux-kernel@vger.kernel.org,
	thomas.petazzoni@bootlin.com, alexandre.belloni@bootlin.com,
	camelia.groza@nxp.com, Simon Edelhaus, Pavel Belous
In-Reply-To: <20190820144119.GA28714@bistromath.localdomain>

Hi,

I can add some information to the HW Antoine is working on, general design of it
and the thoughts behind it. See below.

The 08/20/2019 16:41, Sabrina Dubroca wrote:
> 2019-08-20, 12:01:40 +0200, Antoine Tenart wrote:
> > So it seems the ability to enable or disable the offloading on a given
> > interface is the main missing feature. I'll add that, however I'll
> > probably (at least at first):
> > 
> > - Have the interface to be fully offloaded or fully handled in s/w (with
> >   errors being thrown if a given configuration isn't supported). Having
> >   both at the same time on a given interface would be tricky because of
> >   the MACsec validation parameter.
> > 
> > - Won't allow to enable/disable the offloading of there are rules in
> >   place, as we're not sure the same rules would be accepted by the other
> >   implementation.
> 
> That's probably quite problematic actually, because to do that you
> need to be able to resync the state between software and hardware,
> particularly packet numbers. So, yeah, we're better off having it
> completely blocked until we have a working implementation (if that
> ever happens).
> 
> However, that means in case the user wants to set up something that's
> not offloadable, they'll have to:
>  - configure the offloaded version until EOPNOTSUPP
>  - tear everything down
>  - restart from scratch without offloading
> 
> That's inconvenient.
> 
> Talking about packet numbers, can you describe how PN exhaustion is
> handled?  I couldn't find much about packet numbers at all in the
> driver patches (I hope the hw doesn't wrap around from 2^32-1 to 0 on
> the same SA).
New SA's are suppose to be installed ahead of time. The HW will automatic move
to the next SA and reset the PN.

> At some point userspace needs to know that we're
> getting close to 2^32 and that it's time to re-key.  Since the whole
> TX path of the software implementation is bypassed, it looks like the
> PN (as far as drivers/net/macsec.c is concerned) never increases, so
> userspace can't know when to negotiate a new SA.
> 
> > I'm not sure if we should allow to mix the implementations on a given
> > physical interface (by having two MACsec interfaces attached) as the
> > validation would be impossible to do (we would have no idea if a
> > packet was correctly handled by the offloading part or just not being
> > a MACsec packet in the first place, in Rx).
> 
> That's something that really bothers me with this proposal. Quoting
> from the commit message:
> 
> > the packets seen by the networking stack on both the physical and
> > MACsec virtual interface are exactly the same
> 
> If the HW/driver is expected to strip the sectag, I don't see how we
> could ever have multiple secy's at the same time and demultiplex
> properly between them. That's part of the reason why I chose to have
> virtual interfaces: without them, picking the right secy on TX gets
> weird.

The HW does frame clasification, and use the claisfication to associate frames
to a given secy.

We we in SW have eth0, with 2 vlan-sub interfaces, and enable macsec on those,
then we have:

eth0
eth0.10
eth0.10.macsec
eth0.20
eth0.20.macsec

In this case the HW needs to be configured to match vlan 10 to one secy, and
vlan 20 to an other one.

This is nor supported in the current patch, but is something we can add later.
We just wanted to get the basic functionallity done right before moving on to
this.

But in the current design, there is nothing that prevent us from adding this.

If anyone is interested in the details of this then it is described in section
3.6.3 in http://ww1.microchip.com/downloads/en/DeviceDoc/VMDS-10455.pdf

It is possible to construct an encapsulation that the HW can not classify
correctly. If that is the case, then we should reject the HW offload, and use
the SW.

But it is a good point, which I think is missing. We should properly reject
MACsec HW offload (with this driver, in this state) on virtual interfaces, if
the encapsulation can not be handled (could start by reject all virtual
interfaces)

> AFAICT, it means that any filters (tc, tcpdump) need to be different
> between offloaded and non-offloaded cases.
It will see the result of the offloaded operation. But I guess that this is no
different from when 'tc' operations are offloaded to HW.

> How does the driver distinguish traffic that should pass through unmodified
> from traffic that the HW needs to encapsulate and encrypt?
It relay on frame classification (I think Antoine is missing a flow
configuration to bypass all MKA traffic).

> If you look at IPsec offloading, the networking stack builds up the
> ESP header, and passes the unencrypted data down to the driver. I'm
> wondering if the same would be possible with MACsec offloading: the
> macsec virtual interface adds the header (and maybe a dummy ICV), and
> then the HW does the encryption. In case of HW that needs to add the
> sectag itself, the driver would first strip the headers that the stack
> created. On receive, the driver would recreate a sectag and the macsec
> interface would just skip all verification (decrypt, PN).
I do not think this is possible with this HW, nor do I think this is desirable.

One of the big differences between MACsec and IPsec, is the fact that it is a L2
protocol, designed to be running on switches (this is how MACsec claims to limit
key-distributions issues, as all frames are decrypted when entering a switch,
and encrypted with a new key when leaving).

If a SW stack needs to add a tag to the frame, then we cannot offload traffic
being bridged in HW, and never goes to the CPU.

/Allan

^ permalink raw reply

* Re: [PATCH v1] ocfs2/dlm: Move BITS_TO_BYTES() to bitops.h for wider use
From: Andy Shevchenko @ 2019-08-21  9:25 UTC (permalink / raw)
  To: Joseph Qi
  Cc: Mark Fasheh, Joel Becker, ocfs2-devel, Ariel Elior,
	Sudarsana Kalluru, GR-everest-linux-l2, David S. Miller, netdev,
	Colin Ian King
In-Reply-To: <1a3e6660-10d2-e66c-2880-24af64c7f120@linux.alibaba.com>

On Wed, Aug 21, 2019 at 09:29:04AM +0800, Joseph Qi wrote:
> On 19/8/21 00:31, Andy Shevchenko wrote:
> > There are users already and will be more of BITS_TO_BYTES() macro.
> > Move it to bitops.h for wider use.
> > 
> > Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
> > ---
> >  drivers/net/ethernet/broadcom/bnx2x/bnx2x_init.h | 1 -
> >  fs/ocfs2/dlm/dlmcommon.h                         | 4 ----
> >  include/linux/bitops.h                           | 1 +
> >  3 files changed, 1 insertion(+), 5 deletions(-)
> > 
> > diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_init.h b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_init.h
> > index 066765fbef06..0a59a09ef82f 100644
> > --- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_init.h
> > +++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_init.h
> > @@ -296,7 +296,6 @@ static inline void bnx2x_dcb_config_qm(struct bnx2x *bp, enum cos_mode mode,
> >   *    possible, the driver should only write the valid vnics into the internal
> >   *    ram according to the appropriate port mode.
> >   */
> > -#define BITS_TO_BYTES(x) ((x)/8)>
> I don't think this is a equivalent replace, or it is in fact
> wrong before?

I was thinking about this one and there are two applications:
- calculus of the amount of structures of certain type per PAGE
  (obviously off-by-one error in the original code IIUC purpose of STRUCT_SIZE)
- calculus of some threshold based on line speed in bytes per second
  (I dunno it will have any difference on the Gbs / 100 MBs speeds)

> >  /* CMNG constants, as derived from system spec calculations */
> >  
> > diff --git a/fs/ocfs2/dlm/dlmcommon.h b/fs/ocfs2/dlm/dlmcommon.h
> > index aaf24548b02a..0463dce65bb2 100644
> > --- a/fs/ocfs2/dlm/dlmcommon.h
> > +++ b/fs/ocfs2/dlm/dlmcommon.h
> > @@ -688,10 +688,6 @@ struct dlm_begin_reco
> >  	__be32 pad2;
> >  };
> >  
> > -
> > -#define BITS_PER_BYTE 8
> > -#define BITS_TO_BYTES(bits) (((bits)+BITS_PER_BYTE-1)/BITS_PER_BYTE)
> > -
> For ocfs2 part, it looks good to me.
> Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>

Thanks!

> 
> >  struct dlm_query_join_request
> >  {
> >  	u8 node_idx;
> > diff --git a/include/linux/bitops.h b/include/linux/bitops.h
> > index cf074bce3eb3..79d80f5ddf7b 100644
> > --- a/include/linux/bitops.h
> > +++ b/include/linux/bitops.h
> > @@ -5,6 +5,7 @@
> >  #include <linux/bits.h>
> >  
> >  #define BITS_PER_TYPE(type) (sizeof(type) * BITS_PER_BYTE)
> > +#define BITS_TO_BYTES(nr)	DIV_ROUND_UP(nr, BITS_PER_BYTE)
> >  #define BITS_TO_LONGS(nr)	DIV_ROUND_UP(nr, BITS_PER_TYPE(long))
> >  
> >  extern unsigned int __sw_hweight8(unsigned int w);
> > 

-- 
With Best Regards,
Andy Shevchenko



^ permalink raw reply

* Re: [PATCH net-next v2 6/9] net: macsec: hardware offloading infrastructure
From: allan.nielsen @ 2019-08-21  9:27 UTC (permalink / raw)
  To: Igor Russkikh
  Cc: Sabrina Dubroca, Antoine Tenart, Andrew Lunn, davem@davemloft.net,
	f.fainelli@gmail.com, hkallweit1@gmail.com,
	netdev@vger.kernel.org, linux-kernel@vger.kernel.org,
	thomas.petazzoni@bootlin.com, alexandre.belloni@bootlin.com,
	camelia.groza@nxp.com, Simon Edelhaus, Pavel Belous
In-Reply-To: <81ec0497-58cd-1f4c-faa3-c057693cd50e@aquantia.com>

The 08/21/2019 09:20, Igor Russkikh wrote:
> > Talking about packet numbers, can you describe how PN exhaustion is
> > handled?  I couldn't find much about packet numbers at all in the
> > driver patches (I hope the hw doesn't wrap around from 2^32-1 to 0 on
> > the same SA).  At some point userspace needs to know that we're
> > getting close to 2^32 and that it's time to re-key.  Since the whole
> > TX path of the software implementation is bypassed, it looks like the
> > PN (as far as drivers/net/macsec.c is concerned) never increases, so
> > userspace can't know when to negotiate a new SA.
> 
> I think there should be driver specific implementation of this functionality.
> As an example, our macsec HW issues an interrupt towards the host to indicate
> PN threshold has reached and it's time for userspace to change the keys.
> 
> In contrast, current SW macsec implementation just stops this SA/secy.
> 
> > I don't see how this implementation handles non-macsec traffic (on TX,
> > that would be packets sent directly through the real interface, for
> > example by wpa_supplicant - on RX, incoming MKA traffic for
> > wpa_supplicant). Unless I missed something, incoming MKA traffic will
> > end up on the macsec interface as well as the lower interface (not
> > entirely critical, as long as wpa_supplicant can grab it on the lower
> > device, but not consistent with the software implementation). How does
> > the driver distinguish traffic that should pass through unmodified
> > from traffic that the HW needs to encapsulate and encrypt?
> 
> I can comment on our HW engine - where it has special bypass rules
> for configured ethertypes. This way macsec engine skips encryption on TX and
> passes in RX unencrypted for the selected control packets.
In our case it is a TCAM which can look at various fields (including ethertype),
but is sounds like we have a vary similar design.

> But thats true, realdev driver is hard to distinguish encrypted/unencrypted
> packets. In case realdev should make a decision where to put RX packet,
> it only may do some heuristic (since after macsec decription all the
> macsec related info is dropped. Thats true at least for our HW implementation).
Same for ours.

> > If you look at IPsec offloading, the networking stack builds up the
> > ESP header, and passes the unencrypted data down to the driver. I'm
> > wondering if the same would be possible with MACsec offloading: the
> > macsec virtual interface adds the header (and maybe a dummy ICV), and
> > then the HW does the encryption. In case of HW that needs to add the
> > sectag itself, the driver would first strip the headers that the stack
> > created. On receive, the driver would recreate a sectag and the macsec
> > interface would just skip all verification (decrypt, PN).
> 
> I don't think this way is good, as driver have to do per packet header mangling.
> That'll harm linerate performance heavily.
Agree, and it will also prevent MACsec offload in offloaded bridge devices.

/Allan

^ permalink raw reply

* Re: [PATCH net-next 3/6] net: dsa: Delete the VID from the upstream port as well
From: Vladimir Oltean @ 2019-08-21  9:51 UTC (permalink / raw)
  To: Vivien Didelot
  Cc: Florian Fainelli, Andrew Lunn, Ido Schimmel, Roopa Prabhu,
	nikolay, David S. Miller, netdev
In-Reply-To: <20190820233026.GC21067@t480s.localdomain>

On Wed, 21 Aug 2019 at 06:30, Vivien Didelot <vivien.didelot@gmail.com> wrote:
>
> On Wed, 21 Aug 2019 01:09:39 +0300, Vladimir Oltean <olteanv@gmail.com> wrote:
> > I mean I made an argument already for the hack in 4/6 ("Don't program
> > the VLAN as pvid on the upstream port"). If the hack gets accepted
> > like that, I have no further need of any change in the implicit VLAN
> > configuration. But it's still a hack, so in that sense it would be
> > nicer to not need it and have a better amount of control.
>
> How come you simply cannot ignore the PVID flag for the CPU port in the
> driver directly, as mv88e6xxx does in preference of the Marvell specific
> "unmodified" mode? What PVID are you programming on the CPU port already?
>
>
> Thanks,
>
>         Vivien

sja1105 has no such thing as an "unmodified" egress policy.
And ignoring the flags in the switch driver for the CPU port is just
as hacky as fixing it up in the DSA core.
I fail to see any reason to change the pvid for the CPU/DSA ports,
maybe you can clarify.

Actually I gave a second thought to the patchset and in a weird,
convoluted way, I can get away with just:
- 2/6: net: bridge: Populate the pvid flag in br_vlan_get_info
- 5/6 net: dsa: Allow proper internal use of VLANs
- 6/6 net: dsa: tag_8021q: Restore bridge pvid when enabling vlan_filtering
A side effect of running dsa_port_restore_pvid on a user port will be
that it is going to also restore the pvid on the CPU port, via the
bitmap operations. I had not thought of this initially when I first
jumped to remove the BRIDGE_VLAN_INFO_PVID flag in 4/6. And the fact
that it would work would just be "programming by coincidence".

I guess my aversion against the VLAN bitmap operations (and, well,
"match" in your new change) stems from the fact that the DSA core sits
in the way of doing custom configuration of the CPU port VLAN
settings. Yes, you can work around it (dsa_8021q first configures the
user ports as pvid and egress untagged, then configures the CPU port
as egress tagged, so that the pvid setting from user ports doesn't
"stick" to the CPU via the bitmap), but it's like pouring water that's
half hot and half cold from a water cooler, when all you want is water
that's at room temperature.

-Vladimir

^ permalink raw reply

* Re: [PATCH net-next v3 2/4] net: mdio: add PTP offset compensation to mdiobus_write_sts
From: Hubert Feurstein @ 2019-08-21  9:53 UTC (permalink / raw)
  To: Miroslav Lichvar
  Cc: Andrew Lunn, netdev, lkml, Richard Cochran, Florian Fainelli,
	Heiner Kallweit, Vladimir Oltean, David S. Miller
In-Reply-To: <20190821080709.GO891@localhost>

Am Mi., 21. Aug. 2019 um 10:07 Uhr schrieb Miroslav Lichvar
<mlichvar@redhat.com>:
> > Currently I do not see the benefit from this. The original intention was to
> > compensate for the remaining offset as good as possible.
>
> That's ok, but IMHO the change should not break the assumptions of
> existing application and users.
>
> > The current code
> > of phc2sys uses the delay only for the filtering of the measurement record
> > with the shortest delay and for reporting and statistics. Why not simple shift
> > the timestamps with the offset to the point where we expect the PHC timestamp
> > to be captured, and we have a very good result compared to where we came
> > from.
>
> Because those reports/statistics are important in calculation of
> maximum error. If someone had a requirement for a clock to be accurate
> to 1.5 microseconds and the ioctl returned a delay indicating a
> sufficient accuracy when in reality it could be worse, that would be a
> problem.
>
Ok, I understand your point. But including the MDIO completion into
delay calculation
will indicate a much wore precision as it actually is. When the MDIO
driver implements
the PTP system timestamping as follows ...

  ptp_read_system_prets(bus->ptp_sts);
  writel(value, mdio-reg)
  ptp_read_system_postts(bus->ptp_sts);

... then we catch already the error caused by interrupts which hit the
pre/post_ts section.
Now we only have the additional error of one MDIO clock cycle
(~400ns). Because I expect
the MDIO controller to shift out the MDIO frame on the next MDIO clock
cycle. So if I subtract
one MDIO clock cycle from pre_ts and add one MDIO clock cycle to
post_ts the error indication
would be sufficiently corrected IMHO. And then we can shift both
timestamps in the switch driver
(in the gettimex handler) to compensate the switch depending offset.
What do you think?

Hubert

^ permalink raw reply

* Re: [RFC 2/4] Allow cdc_ncm to set MAC address in hardware
From: Oliver Neukum @ 2019-08-21  9:39 UTC (permalink / raw)
  To: Charles.Hyde, linux-acpi, linux-usb
  Cc: Mario.Limonciello, gregkh, nic_swsd, netdev
In-Reply-To: <1566339663476.54366@Dellteam.com>

Am Dienstag, den 20.08.2019, 22:21 +0000 schrieb
Charles.Hyde@dellteam.com:
> This patch adds support for pushing a MAC address out to USB based
> ethernet controllers driven by cdc_ncm.  With this change, ifconfig can
> now set the device's MAC address.  For example, the Dell Universal Dock
> D6000 is driven by cdc_ncm.  The D6000 can now have its MAC address set
> by ifconfig, as it can be done in Windows.  This was tested with a D6000
> using ifconfig.

On a design note, it looks like you broke S4 with the driver.
Suspend To Disk will cut power and hence reset the MAC to default.
You need to reset it to the user's setting in reset_resume().
Please add that to usbnet.

> 
> Signed-off-by: Charles Hyde <charles.hyde@dellteam.com>
> Cc: Mario Limonciello <mario.limonciello@dell.com>
> Cc: Oliver Neukum <oliver@neukum.org>
> Cc: netdev@vger.kernel.org
> Cc: linux-usb@vger.kernel.org
> ---
>  drivers/net/usb/cdc_ncm.c  | 20 +++++++++++++++++++-
>  drivers/net/usb/usbnet.c   | 37 ++++++++++++++++++++++++++++---------
>  include/linux/usb/usbnet.h |  1 +
>  3 files changed, 48 insertions(+), 10 deletions(-)
> 
> diff --git a/drivers/net/usb/cdc_ncm.c b/drivers/net/usb/cdc_ncm.c
> index 50c05d0f44cb..f77c8672f972 100644
> --- a/drivers/net/usb/cdc_ncm.c
> +++ b/drivers/net/usb/cdc_ncm.c
> @@ -750,6 +750,24 @@ int cdc_ncm_change_mtu(struct net_device *net, int new_mtu)
>  }
>  EXPORT_SYMBOL_GPL(cdc_ncm_change_mtu);
>  
> +/* Provide method to push MAC address to the USB device's ethernet controller.
> + */
> +int cdc_ncm_set_mac_addr(struct net_device *net, void *p)
> +{
> +	struct usbnet *dev = netdev_priv(net);
> +	struct sockaddr *addr = p;
> +
> +	memcpy(dev->net->dev_addr, addr->sa_data, ETH_ALEN);
> +	/*
> +	 * Try to push the MAC address out to the device.  Ignore any errors,
> +	 * to be compatible with prior versions of this source.
> +	 */
> +	usbnet_set_ethernet_addr(dev);
> +
> +	return eth_mac_addr(net, p);
> +}
> +EXPORT_SYMBOL_GPL(cdc_ncm_set_mac_addr);
> +
>  static const struct net_device_ops cdc_ncm_netdev_ops = {
>  	.ndo_open	     = usbnet_open,
>  	.ndo_stop	     = usbnet_stop,
> @@ -757,7 +775,7 @@ static const struct net_device_ops cdc_ncm_netdev_ops = {
>  	.ndo_tx_timeout	     = usbnet_tx_timeout,
>  	.ndo_get_stats64     = usbnet_get_stats64,
>  	.ndo_change_mtu	     = cdc_ncm_change_mtu,
> -	.ndo_set_mac_address = eth_mac_addr,
> +	.ndo_set_mac_address = cdc_ncm_set_mac_addr,

Why can't this fully go into usbnet?

>  	.ndo_validate_addr   = eth_validate_addr,
>  };
>  
> diff --git a/drivers/net/usb/usbnet.c b/drivers/net/usb/usbnet.c
> index 72514c46b478..72bdac34b0ee 100644
> --- a/drivers/net/usb/usbnet.c
> +++ b/drivers/net/usb/usbnet.c
> @@ -149,20 +149,39 @@ int usbnet_get_ethernet_addr(struct usbnet *dev, int iMACAddress)
>  	int 		tmp = -1, ret;
>  	unsigned char	buf [13];
>  
> -	ret = usb_string(dev->udev, iMACAddress, buf, sizeof buf);
> -	if (ret == 12)
> -		tmp = hex2bin(dev->net->dev_addr, buf, 6);
> -	if (tmp < 0) {
> -		dev_dbg(&dev->udev->dev,
> -			"bad MAC string %d fetch, %d\n", iMACAddress, tmp);
> -		if (ret >= 0)
> -			ret = -EINVAL;
> -		return ret;
> +	ret = usb_get_address(dev->udev, buf);
> +	if (ret == 6)

If you mean ETH_ALEN, you should use it.

> +		memcpy(dev->net->dev_addr, buf, 6);
> +	else if (ret < 0) {
> +		ret = usb_string(dev->udev, iMACAddress, buf, sizeof buf);
> +		if (ret == 12)
> +			tmp = hex2bin(dev->net->dev_addr, buf, 6);
> +		if (tmp < 0) {
> +			dev_dbg(&dev->udev->dev,
> +				"bad MAC string %d fetch, %d\n", iMACAddress,
> +				tmp);
> +			if (ret >= 0)
> +				ret = -EINVAL;
> +			return ret;

Again, you cannot ignore the possibility of getting fewer or more than
6 bytes.

> +		}
>  	}
>  	return 0;
>  }
>  EXPORT_SYMBOL_GPL(usbnet_get_ethernet_addr);
>  
> +int usbnet_set_ethernet_addr(struct usbnet *dev)
> +{
> +	int ret;
> +
> +	ret = usb_set_address(dev->udev, dev->net->dev_addr);
> +	if (ret < 0) {
> +		dev_dbg(&dev->udev->dev,
> +			"bad MAC address put, %d\n", ret);
> +	}
> +	return ret;
> +}
> +EXPORT_SYMBOL_GPL(usbnet_set_ethernet_addr);

What is the purpose of this wrapper?

	Regards
		Oliver


^ permalink raw reply

* Re: [RFC 0/4] Add support into cdc_ncm for MAC address pass through
From: Oliver Neukum @ 2019-08-21  9:41 UTC (permalink / raw)
  To: Charles.Hyde, linux-acpi, linux-usb
  Cc: Mario.Limonciello, gregkh, nic_swsd, netdev
In-Reply-To: <89a5f8ea30b240babd8750d236ca9ef4@AUSX13MPS303.AMER.DELL.COM>

Am Dienstag, den 20.08.2019, 22:15 +0000 schrieb
Charles.Hyde@dellteam.com:
> In recent testing of a Dell Universal Dock D6000, I found that MAC address pass through is not supported in the Linux drivers.

Hi,

thank you for writing this code. It adds cool functionality.
There are, however, a few design issues and minor flaws with it.
I will comment on the patches in the series to point them out.
Could you correct them?

	Regards
		Oliver


^ permalink raw reply

* Re: [PATCH 1/1] netfilter: nf_tables: fib: Drop IPV6 packages if IPv6 is disabled on boot
From: Pablo Neira Ayuso @ 2019-08-21  9:58 UTC (permalink / raw)
  To: Leonardo Bras
  Cc: Florian Westphal, netfilter-devel, coreteam, netdev, linux-kernel,
	Jozsef Kadlecsik, David S. Miller
In-Reply-To: <793ce2e9b6200a033d44716749acc837aaf5e4e7.camel@linux.ibm.com>

On Tue, Aug 20, 2019 at 01:15:58PM -0300, Leonardo Bras wrote:
> On Tue, 2019-08-20 at 07:36 +0200, Florian Westphal wrote:
> > Wouldn't fib_netdev.c have the same problem?
> Probably, but I haven't hit this issue yet.
> 
> > If so, might be better to place this test in both
> > nft_fib6_eval_type and nft_fib6_eval.
>
> I think that is possible, and not very hard to do.
> 
> But in my humble viewpoint, it looks like it's nft_fib_inet_eval() and
> nft_fib_netdev_eval() have the responsibility to choose a valid
> protocol or drop the package. 
> I am not sure if it would be a good move to transfer this
> responsibility to nft_fib6_eval_type() and nft_fib6_eval(), so I would
> rather add the same test to nft_fib_netdev_eval().
> 
> Does it make sense?

Please, update common code to netdev and ip6 extensions as Florian
suggests.

Thanks.

^ permalink raw reply

* pull-request: mac80211 2019-08-21
From: Johannes Berg @ 2019-08-21 10:00 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, linux-wireless

Hi Dave,

I have here for you a few fixes; three, to be specific. Nothing that
warrants real discussion or urgency though.

Please pull and let me know if there's any problem.

Thanks,
johannes



The following changes since commit a1c4cd67840ef80f6ca5f73326fa9a6719303a95:

  net: fix __ip_mc_inc_group usage (2019-08-20 12:48:06 -0700)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211.git tags/mac80211-for-davem-2019-08-21

for you to fetch changes up to 0d31d4dbf38412f5b8b11b4511d07b840eebe8cb:

  Revert "cfg80211: fix processing world regdomain when non modular" (2019-08-21 10:43:03 +0200)

----------------------------------------------------------------
Just three fixes:
 * extended key ID key installation
 * regulatory processing
 * possible memory leak in an error path

----------------------------------------------------------------
Alexander Wetzel (1):
      cfg80211: Fix Extended Key ID key install checks

Hodaszi, Robert (1):
      Revert "cfg80211: fix processing world regdomain when non modular"

Johannes Berg (1):
      mac80211: fix possible sta leak

 net/mac80211/cfg.c  |  9 +++++----
 net/wireless/reg.c  |  2 +-
 net/wireless/util.c | 23 ++++++++++++++---------
 3 files changed, 20 insertions(+), 14 deletions(-)


^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox