Netdev List

Netdev List
 help / color / mirror / Atom feed

* [PATCH ipsec-next 02/11] xfrm: security: iterate all, not inexact lists
From: Florian Westphal @ 2018-11-07 22:00 UTC (permalink / raw)
  To: netdev; +Cc: Florian Westphal
In-Reply-To: <20181107220041.26205-1-fw@strlen.de>

currently all non-socket policies are either hashed in the dst table,
or placed on the 'inexact list'.  When flushing, we first walk the
table, then the (per-direction) inexact lists.

When we try and get rid of the inexact lists to having "n" inexact
lists (e.g. per-af inexact lists, or sorted into a tree), this walk
would become more complicated.

Simplify this: walk the 'all' list and skip socket policies during
traversal so we don't need to handle exact and inexact policies
separately anymore.

Signed-off-by: Florian Westphal <fw@strlen.de>
---
 net/xfrm/xfrm_policy.c | 93 ++++++++++++------------------------------
 1 file changed, 26 insertions(+), 67 deletions(-)

diff --git a/net/xfrm/xfrm_policy.c b/net/xfrm/xfrm_policy.c
index 119a427d9b2b..39d0db2a50d9 100644
--- a/net/xfrm/xfrm_policy.c
+++ b/net/xfrm/xfrm_policy.c
@@ -892,36 +892,19 @@ EXPORT_SYMBOL(xfrm_policy_byid);
 static inline int
 xfrm_policy_flush_secctx_check(struct net *net, u8 type, bool task_valid)
 {
-	int dir, err = 0;
+	struct xfrm_policy *pol;
+	int err = 0;
 
-	for (dir = 0; dir < XFRM_POLICY_MAX; dir++) {
-		struct xfrm_policy *pol;
-		int i;
+	list_for_each_entry(pol, &net->xfrm.policy_all, walk.all) {
+		if (pol->walk.dead ||
+		    xfrm_policy_id2dir(pol->index) >= XFRM_POLICY_MAX ||
+		    pol->type != type)
+			continue;
 
-		hlist_for_each_entry(pol,
-				     &net->xfrm.policy_inexact[dir], bydst) {
-			if (pol->type != type)
-				continue;
-			err = security_xfrm_policy_delete(pol->security);
-			if (err) {
-				xfrm_audit_policy_delete(pol, 0, task_valid);
-				return err;
-			}
-		}
-		for (i = net->xfrm.policy_bydst[dir].hmask; i >= 0; i--) {
-			hlist_for_each_entry(pol,
-					     net->xfrm.policy_bydst[dir].table + i,
-					     bydst) {
-				if (pol->type != type)
-					continue;
-				err = security_xfrm_policy_delete(
-								pol->security);
-				if (err) {
-					xfrm_audit_policy_delete(pol, 0,
-								 task_valid);
-					return err;
-				}
-			}
+		err = security_xfrm_policy_delete(pol->security);
+		if (err) {
+			xfrm_audit_policy_delete(pol, 0, task_valid);
+			return err;
 		}
 	}
 	return err;
@@ -937,6 +920,7 @@ xfrm_policy_flush_secctx_check(struct net *net, u8 type, bool task_valid)
 int xfrm_policy_flush(struct net *net, u8 type, bool task_valid)
 {
 	int dir, err = 0, cnt = 0;
+	struct xfrm_policy *pol;
 
 	spin_lock_bh(&net->xfrm.xfrm_policy_lock);
 
@@ -944,46 +928,21 @@ int xfrm_policy_flush(struct net *net, u8 type, bool task_valid)
 	if (err)
 		goto out;
 
-	for (dir = 0; dir < XFRM_POLICY_MAX; dir++) {
-		struct xfrm_policy *pol;
-		int i;
-
-	again1:
-		hlist_for_each_entry(pol,
-				     &net->xfrm.policy_inexact[dir], bydst) {
-			if (pol->type != type)
-				continue;
-			__xfrm_policy_unlink(pol, dir);
-			spin_unlock_bh(&net->xfrm.xfrm_policy_lock);
-			cnt++;
-
-			xfrm_audit_policy_delete(pol, 1, task_valid);
-
-			xfrm_policy_kill(pol);
-
-			spin_lock_bh(&net->xfrm.xfrm_policy_lock);
-			goto again1;
-		}
-
-		for (i = net->xfrm.policy_bydst[dir].hmask; i >= 0; i--) {
-	again2:
-			hlist_for_each_entry(pol,
-					     net->xfrm.policy_bydst[dir].table + i,
-					     bydst) {
-				if (pol->type != type)
-					continue;
-				__xfrm_policy_unlink(pol, dir);
-				spin_unlock_bh(&net->xfrm.xfrm_policy_lock);
-				cnt++;
-
-				xfrm_audit_policy_delete(pol, 1, task_valid);
-				xfrm_policy_kill(pol);
-
-				spin_lock_bh(&net->xfrm.xfrm_policy_lock);
-				goto again2;
-			}
-		}
+again:
+	list_for_each_entry(pol, &net->xfrm.policy_all, walk.all) {
+		dir = xfrm_policy_id2dir(pol->index);
+		if (pol->walk.dead ||
+		    dir >= XFRM_POLICY_MAX ||
+		    pol->type != type)
+			continue;
 
+		__xfrm_policy_unlink(pol, dir);
+		spin_unlock_bh(&net->xfrm.xfrm_policy_lock);
+		cnt++;
+		xfrm_audit_policy_delete(pol, 1, task_valid);
+		xfrm_policy_kill(pol);
+		spin_lock_bh(&net->xfrm.xfrm_policy_lock);
+		goto again;
 	}
 	if (!cnt)
 		err = -ESRCH;
-- 
2.18.1

^ permalink raw reply related

* [PATCH ipsec-next 01/11] selftests: add xfrm policy test script
From: Florian Westphal @ 2018-11-07 22:00 UTC (permalink / raw)
  To: netdev; +Cc: Florian Westphal
In-Reply-To: <20181107220041.26205-1-fw@strlen.de>

add a script that adds a ipsec tunnel between two network
namespaces plus following policies:

.0/24 -> ipsec tunnel
.240/28 -> bypass
.253/32 -> ipsec tunnel

Then check that .254 bypasses tunnel (match /28 exception),
and .2 (match /24) and .253 (match direct policy) pass through the
tunnel.

Abuses iptables to check if ping did resolve an ipsec policy or not.

Also adds a bunch of 'block' rules that are not supposed to match.

Signed-off-by: Florian Westphal <fw@strlen.de>
---
 tools/testing/selftests/net/Makefile       |   3 +-
 tools/testing/selftests/net/xfrm_policy.sh | 276 +++++++++++++++++++++
 2 files changed, 278 insertions(+), 1 deletion(-)
 create mode 100755 tools/testing/selftests/net/xfrm_policy.sh

diff --git a/tools/testing/selftests/net/Makefile b/tools/testing/selftests/net/Makefile
index 256d82d5fa87..4c7df17d4f42 100644
--- a/tools/testing/selftests/net/Makefile
+++ b/tools/testing/selftests/net/Makefile
@@ -4,7 +4,8 @@
 CFLAGS =  -Wall -Wl,--no-as-needed -O2 -g
 CFLAGS += -I../../../../usr/include/
 
-TEST_PROGS := run_netsocktests run_afpackettests test_bpf.sh netdevice.sh rtnetlink.sh
+TEST_PROGS := run_netsocktests run_afpackettests test_bpf.sh netdevice.sh \
+	      rtnetlink.sh xfrm_policy.sh
 TEST_PROGS += fib_tests.sh fib-onlink-tests.sh pmtu.sh udpgso.sh ip_defrag.sh
 TEST_PROGS += udpgso_bench.sh fib_rule_tests.sh msg_zerocopy.sh psock_snd.sh
 TEST_PROGS_EXTENDED := in_netns.sh
diff --git a/tools/testing/selftests/net/xfrm_policy.sh b/tools/testing/selftests/net/xfrm_policy.sh
new file mode 100755
index 000000000000..72f9e3dfbea1
--- /dev/null
+++ b/tools/testing/selftests/net/xfrm_policy.sh
@@ -0,0 +1,276 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0
+#
+# Check xfrm policy resolution.  Topology:
+#
+# 1.2   1.1   3.1  3.10    2.1   2.2
+# eth1  eth1 veth0 veth0 eth1   eth1
+# ns1 ---- ns3 ----- ns4 ---- ns2
+#
+# ns3 and ns4 are connected via ipsec tunnel.
+# pings from ns1 to ns2 (and vice versa) are supposed to work like this:
+# ns1: ping 10.0.2.2: passes via ipsec tunnel.
+# ns2: ping 10.0.1.2: passes via ipsec tunnel.
+
+# ns1: ping 10.0.1.253: passes via ipsec tunnel (direct policy)
+# ns2: ping 10.0.2.253: passes via ipsec tunnel (direct policy)
+#
+# ns1: ping 10.0.2.254: does NOT pass via ipsec tunnel (exception)
+# ns2: ping 10.0.1.254: does NOT pass via ipsec tunnel (exception)
+
+# Kselftest framework requirement - SKIP code is 4.
+ksft_skip=4
+ret=0
+
+KEY_SHA=0xdeadbeef1234567890abcdefabcdefabcdefabcd
+KEY_AES=0x0123456789abcdef0123456789012345
+SPI1=0x1
+SPI2=0x2
+
+do_esp() {
+    local ns=$1
+    local me=$2
+    local remote=$3
+    local lnet=$4
+    local rnet=$5
+    local spi_out=$6
+    local spi_in=$7
+
+    ip -net $ns xfrm state add src $remote dst $me proto esp spi $spi_in  enc aes $KEY_AES  auth sha1 $KEY_SHA  mode tunnel sel src $rnet dst $lnet
+    ip -net $ns xfrm state add src $me  dst $remote proto esp spi $spi_out enc aes $KEY_AES auth sha1 $KEY_SHA mode tunnel sel src $lnet dst $rnet
+
+    # to encrypt packets as they go out (includes forwarded packets that need encapsulation)
+    ip -net $ns xfrm policy add src $lnet dst $rnet dir out tmpl src $me dst $remote proto esp mode tunnel priority 100 action allow
+    # to fwd decrypted packets after esp processing:
+    ip -net $ns xfrm policy add src $rnet dst $lnet dir fwd tmpl src $remote dst $me proto esp mode tunnel priority 100 action allow
+}
+
+do_exception() {
+    local ns=$1
+    local me=$2
+    local remote=$3
+    local encryptip=$4
+    local plain=$5
+
+    # network $plain passes without tunnel
+    ip -net $ns xfrm policy add dst $plain dir out priority 10 action allow
+
+    # direct policy for $encryptip, use tunnel, higher prio takes precedence
+    ip -net $ns xfrm policy add dst $encryptip dir out tmpl src $me dst $remote proto esp mode tunnel priority 1 action allow
+}
+
+# policies that are not supposed to match any packets generated in this test.
+do_dummies4() {
+    local ns=$1
+
+    for i in $(seq 10 16);do
+      # dummy policy with wildcard src/dst.
+      echo netns exec $ns ip xfrm policy add src 0.0.0.0/0 dst 10.$i.99.0/30 dir out action block
+      echo netns exec $ns ip xfrm policy add src 10.$i.99.0/30 dst 0.0.0.0/0 dir out action block
+      for j in $(seq 32 64);do
+        echo netns exec $ns ip xfrm policy add src 10.$i.1.0/30 dst 10.$i.$j.0/30 dir out action block
+        # silly, as it encompasses the one above too, but its allowed:
+        echo netns exec $ns ip xfrm policy add src 10.$i.1.0/29 dst 10.$i.$j.0/29 dir out action block
+        # and yet again, even more broad one.
+        echo netns exec $ns ip xfrm policy add src 10.$i.1.0/24 dst 10.$i.$j.0/24 dir out action block
+        echo netns exec $ns ip xfrm policy add src 10.$i.$j.0/24 dst 10.$i.1.0/24 dir fwd action block
+      done
+    done | ip -batch /dev/stdin
+}
+
+do_dummies6() {
+    local ns=$1
+
+    for i in $(seq 10 16);do
+      for j in $(seq 32 64);do
+       echo netns exec $ns ip xfrm policy add src dead:$i::/64 dst dead:$i:$j::/64 dir out action block
+       echo netns exec $ns ip xfrm policy add src dead:$i:$j::/64 dst dead:$i::/24 dir fwd action block
+      done
+    done | ip -batch /dev/stdin
+}
+
+check_ipt_policy_count()
+{
+	ns=$1
+
+	ip netns exec $ns iptables-save -c |grep policy | ( read c rest
+		ip netns exec $ns iptables -Z
+		if [ x"$c" = x'[0:0]' ]; then
+			exit 0
+		elif [ x"$c" = x ]; then
+			echo "ERROR: No counters"
+			ret=1
+			exit 111
+		else
+			exit 1
+		fi
+	)
+}
+
+check_xfrm() {
+	# 0: iptables -m policy rule count == 0
+	# 1: iptables -m policy rule count != 0
+	rval=$1
+	ip=$2
+	ret=0
+
+	ip netns exec ns1 ping -q -c 1 10.0.2.$ip > /dev/null
+
+	check_ipt_policy_count ns3
+	if [ $? -ne $rval ] ; then
+		ret=1
+	fi
+	check_ipt_policy_count ns4
+	if [ $? -ne $rval ] ; then
+		ret=1
+	fi
+
+	ip netns exec ns2 ping -q -c 1 10.0.1.$ip > /dev/null
+
+	check_ipt_policy_count ns3
+	if [ $? -ne $rval ] ; then
+		ret=1
+	fi
+	check_ipt_policy_count ns4
+	if [ $? -ne $rval ] ; then
+		ret=1
+	fi
+
+	return $ret
+}
+
+#check for needed privileges
+if [ "$(id -u)" -ne 0 ];then
+	echo "SKIP: Need root privileges"
+	exit $ksft_skip
+fi
+
+ip -Version 2>/dev/null >/dev/null
+if [ $? -ne 0 ];then
+	echo "SKIP: Could not run test without the ip tool"
+	exit $ksft_skip
+fi
+
+# needed to check if policy lookup got valid ipsec result
+iptables --version 2>/dev/null >/dev/null
+if [ $? -ne 0 ];then
+	echo "SKIP: Could not run test without iptables tool"
+	exit $ksft_skip
+fi
+
+for i in 1 2 3 4; do
+    ip netns add ns$i
+    ip -net ns$i link set lo up
+done
+
+DEV=veth0
+ip link add $DEV netns ns1 type veth peer name eth1 netns ns3
+ip link add $DEV netns ns2 type veth peer name eth1 netns ns4
+
+ip link add $DEV netns ns3 type veth peer name veth0 netns ns4
+
+DEV=veth0
+for i in 1 2; do
+    ip -net ns$i link set $DEV up
+    ip -net ns$i addr add 10.0.$i.2/24 dev $DEV
+    ip -net ns$i addr add dead:$i::2/64 dev $DEV
+
+    ip -net ns$i addr add 10.0.$i.253 dev $DEV
+    ip -net ns$i addr add 10.0.$i.254 dev $DEV
+    ip -net ns$i addr add dead:$i::fd dev $DEV
+    ip -net ns$i addr add dead:$i::fe dev $DEV
+done
+
+for i in 3 4; do
+ip -net ns$i link set eth1 up
+ip -net ns$i link set veth0 up
+done
+
+ip -net ns1 route add default via 10.0.1.1
+ip -net ns2 route add default via 10.0.2.1
+
+ip -net ns3 addr add 10.0.1.1/24 dev eth1
+ip -net ns3 addr add 10.0.3.1/24 dev veth0
+ip -net ns3 addr add 2001:1::1/64 dev eth1
+ip -net ns3 addr add 2001:3::1/64 dev veth0
+
+ip -net ns3 route add default via 10.0.3.10
+
+ip -net ns4 addr add 10.0.2.1/24 dev eth1
+ip -net ns4 addr add 10.0.3.10/24 dev veth0
+ip -net ns4 addr add 2001:2::1/64 dev eth1
+ip -net ns4 addr add 2001:3::10/64 dev veth0
+ip -net ns4 route add default via 10.0.3.1
+
+for j in 4 6; do
+	for i in 3 4;do
+		ip netns exec ns$i sysctl net.ipv$j.conf.eth1.forwarding=1 > /dev/null
+		ip netns exec ns$i sysctl net.ipv$j.conf.veth0.forwarding=1 > /dev/null
+	done
+done
+
+# abuse iptables rule counter to check if ping matches a policy
+ip netns exec ns3 iptables -p icmp -A FORWARD -m policy --dir out --pol ipsec
+ip netns exec ns4 iptables -p icmp -A FORWARD -m policy --dir out --pol ipsec
+if [ $? -ne 0 ];then
+	echo "SKIP: Could not insert iptables rule"
+	for i in 1 2 3 4;do ip netns del ns$i;done
+	exit $ksft_skip
+fi
+
+#          localip  remoteip  localnet    remotenet
+do_esp ns3 10.0.3.1 10.0.3.10 10.0.1.0/24 10.0.2.0/24 $SPI1 $SPI2
+do_esp ns3 dead:3::1 dead:3::10 dead:1::/64 dead:2::/64 $SPI1 $SPI2
+do_esp ns4 10.0.3.10 10.0.3.1 10.0.2.0/24 10.0.1.0/24 $SPI2 $SPI1
+do_esp ns4 dead:3::10 dead:3::1 dead:2::/64 dead:1::/64 $SPI2 $SPI1
+
+do_dummies4 ns3
+do_dummies6 ns4
+
+# ping to .254 should use ipsec, exception is not installed.
+check_xfrm 1 254
+if [ $? -ne 0 ]; then
+	echo "FAIL: expected ping to .254 to use ipsec tunnel"
+	ret=1
+else
+	echo "PASS: policy before exception matches"
+fi
+
+# installs exceptions
+#                localip  remoteip   encryptdst  plaindst
+do_exception ns3 10.0.3.1 10.0.3.10 10.0.2.253 10.0.2.240/28
+do_exception ns4 10.0.3.10 10.0.3.1 10.0.1.253 10.0.1.240/28
+
+do_exception ns3 dead:3::1 dead:3::10 dead:2::fd  dead:2:f0::/96
+do_exception ns4 dead:3::10 dead:3::1 dead:1::fd  dead:1:f0::/96
+
+# ping to .254 should now be excluded from the tunnel
+check_xfrm 0 254
+if [ $? -ne 0 ]; then
+	echo "FAIL: expected ping to .254 to fail"
+	ret=1
+else
+	echo "PASS: ping to .254 bypassed ipsec tunnel"
+fi
+
+# ping to .253 should use use ipsec due to direct policy exception.
+check_xfrm 1 253
+if [ $? -ne 0 ]; then
+	echo "FAIL: expected ping to .253 to use ipsec tunnel"
+	ret=1
+else
+	echo "PASS: direct policy matches"
+fi
+
+# ping to .2 should use ipsec.
+check_xfrm 1 2
+if [ $? -ne 0 ]; then
+	echo "FAIL: expected ping to .2 to use ipsec tunnel"
+	ret=1
+else
+	echo "PASS: policy matches"
+fi
+
+for i in 1 2 3 4;do ip netns del ns$i;done
+
+exit $ret
-- 
2.18.1

^ permalink raw reply related

* [PATCH ipsec-next 00/11] xfrm: policy: add inexact policy search tree
From: Florian Westphal @ 2018-11-07 22:00 UTC (permalink / raw)
  To: netdev

This series attempts to improve xfrm policy lookup performance when
a lot of (several hundred or even thousands) inexact policies exist
on a system.

On insert, a policy is either placed in hash table (all direct (/32 for
ipv4, /128 policies, or all policies matching a user-configured threshold).
All other policies get inserted into inexact list as per priority.

Lookup then scans inexact list for first matching entry.

This series instead makes it so that inexact policy is added to exactly
one of four different search list classes.

1. "Any:Any" list, containing policies where both saddr and daddr are
   wildcards or have very coarse prefixes, e.g. 10.0.0.0/8 and the like.
2. "saddr:any" list, containing policies with a fixed saddr/prefixlen,
   but without destination restrictions.
   These lists are stored in rbtree nodes; each node contains those
   policies matching saddr/prefixlen.
3. "Any:daddr" list. Similar to 2), except for policies where only the
   destinations are specified.
4. "saddr:daddr" lists, containing policies that match the given
   source/destination network.

   The root of the saddr/daddr tree is stored in the nodes of the
   'daddr' tree.

This diagram illustrates the list classes, and their
placement in the lookup hierarchy:

xfrm_pol_inexact_bin = hash(dir,type,family,if_id);
 |
 +---- root_d: sorted by daddr:prefix
 |                 |
 |        xfrm_pol_inexact_node
 |                 |
 |                 +- root: sorted by saddr/prefix
 |                 |              |
 |                 |         xfrm_pol_inexact_node
 |                 |              |
 |                 |              + root: unused
 |                 |              |
 |                 |              + hhead: saddr:daddr policies
 |                 |
 |                 +- coarse policies and all any:daddr policies
 |
 +---- root_s: sorted by saddr:prefix
 |                 |
 |        xfrm_pol_inexact_node
 |                 |
 |                 + root: unused
 |                 |
 |                 + hhead: saddr:any policies
 |
 +---- coarse policies and all any:any policies

First we obtain inexact bin by hashing direction, family and other 'must
match' criteria.

Next, we search root_d using packets' destination ip address.
If we find a matching node, we search that nodes' source address tree
using packets src address.

We also search the bins root_s tree using packets source address.

This initial lookup now has created a 'candidate set' consisting
of up to four hlist heads to the four search classes.

Each list gets scanned for first matching entry.
Out of those up the one with the highest priority is chosen.
In case several policies had same priority, most recent one is used to
match old behaviour.

Locking rules:
insert requires net->xfrm.xfrm_policy_lock spinlock held + BH off
(no change to current kernel).
Insert/removal of a tree node needs increment of its sequence count.

lookup requires rcu read lock, lookups in a tree need to sample
the sequence count of the bin that tree is located in as well and
re-try in case it has changed.

Results on my kvm test setup, using 4 netns:

# 1.2   1.1   3.1  3.10    2.1   2.2
# eth1  eth1 veth0 veth0 eth1   eth1
# ns1 ---- ns3 ----- ns4 ---- ns2

ns3 and ns4 are connected via ipsec tunnel.
'Dummy policies' below means that i created 1k non-matching
policies in both ns3 and ns4.

20 parallel netperf (mbits, unidirectional TCP_STREAM):
normal:
993.518000 (no dummy policies)
796.32 (with dummy policies)

patched:
991.335 (no dummy policies)
989.684 (with dummy policies)

20 parallel TCP_RR, trans. per second:
normal:
32413.930000 (no dummy policies)
17725.970000 (with dummy policies)

patched:
32314.51 (no dummy policies)
32326.190000 (with dummy policies)

caveats:

1. Won't work unless large part of policies have no common saddr/daddr pairs.
   Extreme example: adding 10k 0.0.0.0/0 -> 0.0.0.0/0
   policies may require search of all policies, just like current kernel.

2. The policy add sequence
 a daddr 10.0.0.0/26
 b daddr 10.0.0.64/26
 c daddr 10.0.0.0/24

When c gets added, search nodes for a and b have to be merged with
the new subnet, as it contains both.

This also means that a single policy rule with a small subnet value
can cause severe tree degradation and thus longer search lists.

3. Causes (small) performance degradation when setup only has few policies.
4. Policies need an additional hlist, as MIGRATE doesn't consider if_id,
   but the patchset makes the if_id part of the initial hash lookup.
   Could be avoided by not considering if_id as part of initial bin lookup,
   but then increases list length a lot when we have lots of xfrm interfaces.
5. In oder to solve 'two matching candidates have same priority, which
   takes precedence?' problem I added a number to xfrm_policy struct that
   maps to its index in the complete inexact list.
   This was necessary to make sure same sequence of policy-add commands
   keeps on returning same lookup result compared to older kernel.

This work doesn't prevent us from re-adding a new flow caching scheme,
but so far i found no good solution (all have various DoS or degradation
issues).

Comments or questions welcome.

Florian Westphal (11):
      selftests: add xfrm policy test script
      xfrm: security: iterate all, not inexact lists
      xfrm: policy: split list insertion into a helper
      xfrm: policy: return NULL when inexact search needed
      xfrm: policy: store inexact policies in an rhashtable
      xfrm: policy: consider if_id when hashing inexact policy
      xfrm: policy: add inexact policy search tree infrastructure
      xfrm: policy: store inexact policies in a tree ordered by destination address
      xfrm: policy: check reinserted policies match their node
      xfrm: policy: store inexact policies in a tree ordered by source address
      xfrm: policy: add 2nd-level saddr trees for inexact policies

 include/net/netns/xfrm.h                   |    2
 include/net/xfrm.h                         |    3
 net/xfrm/xfrm_policy.c                     | 1234 ++++++++++++++++++++++++++---
 tools/testing/selftests/net/Makefile       |    3
 tools/testing/selftests/net/xfrm_policy.sh |  276 ++++++
 5 files changed, 1391 insertions(+), 127 deletions(-)

^ permalink raw reply

* Re: [PATCH bpf] bpf: Fix IPv6 dport byte order in bpf_sk_lookup_udp
From: Martin Lau @ 2018-11-07 21:55 UTC (permalink / raw)
  To: Andrey Ignatov
  Cc: netdev@vger.kernel.org, ast@kernel.org, daniel@iogearbox.net,
	joe@wand.net.nz, Kernel Team
In-Reply-To: <20181107213607.2658130-1-rdna@fb.com>

On Wed, Nov 07, 2018 at 01:36:07PM -0800, Andrey Ignatov wrote:
> Lookup functions in sk_lookup have different expectations about byte
> order of provided arguments.
> 
> Specifically __inet_lookup, __udp4_lib_lookup and __udp6_lib_lookup
> expect dport to be in network byte order and do ntohs(dport) internally.
> 
> At the same time __inet6_lookup expects dport to be in host byte order
> and correspondingly name the argument hnum.
> 
> sk_lookup works correctly with __inet_lookup, __udp4_lib_lookup and
> __inet6_lookup with regard to dport. But in __udp6_lib_lookup case it
> uses host instead of expected network byte order. It makes result
> returned by bpf_sk_lookup_udp for IPv6 incorrect.
> 
> The patch fixes byte order of dport passed to __udp6_lib_lookup.
> 
> Originally sk_lookup properly handled UDPv6, but not TCPv6. 5ef0ae84f02a
> fixes TCPv6 but breaks UDPv6.
> 
> Fixes: 5ef0ae84f02a ("bpf: Fix IPv6 dport byte-order in bpf_sk_lookup")
> Signed-off-by: Andrey Ignatov <rdna@fb.com>
Good catch.

Acked-by: Martin KaFai Lau <kafai@fb.com>

> ---
>  net/core/filter.c | 5 ++---
>  1 file changed, 2 insertions(+), 3 deletions(-)
> 
> diff --git a/net/core/filter.c b/net/core/filter.c
> index e521c5ebc7d1..9a1327eb25fa 100644
> --- a/net/core/filter.c
> +++ b/net/core/filter.c
> @@ -4852,18 +4852,17 @@ static struct sock *sk_lookup(struct net *net, struct bpf_sock_tuple *tuple,
>  	} else {
>  		struct in6_addr *src6 = (struct in6_addr *)&tuple->ipv6.saddr;
>  		struct in6_addr *dst6 = (struct in6_addr *)&tuple->ipv6.daddr;
> -		u16 hnum = ntohs(tuple->ipv6.dport);
>  		int sdif = inet6_sdif(skb);
>  
>  		if (proto == IPPROTO_TCP)
>  			sk = __inet6_lookup(net, &tcp_hashinfo, skb, 0,
>  					    src6, tuple->ipv6.sport,
> -					    dst6, hnum,
> +					    dst6, ntohs(tuple->ipv6.dport),
>  					    dif, sdif, &refcounted);
>  		else if (likely(ipv6_bpf_stub))
>  			sk = ipv6_bpf_stub->udp6_lib_lookup(net,
>  							    src6, tuple->ipv6.sport,
> -							    dst6, hnum,
> +							    dst6, tuple->ipv6.dport,
>  							    dif, sdif,
>  							    &udp_table, skb);
>  #endif
> -- 
> 2.17.1
> 

^ permalink raw reply

* Re: [PATCH bpf] bpf: Fix IPv6 dport byte order in bpf_sk_lookup_udp
From: Joe Stringer @ 2018-11-07 21:51 UTC (permalink / raw)
  To: rdna; +Cc: netdev, ast, daniel, Joe Stringer, kernel-team
In-Reply-To: <20181107213607.2658130-1-rdna@fb.com>

On Wed, 7 Nov 2018 at 13:37, Andrey Ignatov <rdna@fb.com> wrote:
>
> Lookup functions in sk_lookup have different expectations about byte
> order of provided arguments.
>
> Specifically __inet_lookup, __udp4_lib_lookup and __udp6_lib_lookup
> expect dport to be in network byte order and do ntohs(dport) internally.
>
> At the same time __inet6_lookup expects dport to be in host byte order
> and correspondingly name the argument hnum.
>
> sk_lookup works correctly with __inet_lookup, __udp4_lib_lookup and
> __inet6_lookup with regard to dport. But in __udp6_lib_lookup case it
> uses host instead of expected network byte order. It makes result
> returned by bpf_sk_lookup_udp for IPv6 incorrect.
>
> The patch fixes byte order of dport passed to __udp6_lib_lookup.
>
> Originally sk_lookup properly handled UDPv6, but not TCPv6. 5ef0ae84f02a
> fixes TCPv6 but breaks UDPv6.
>
> Fixes: 5ef0ae84f02a ("bpf: Fix IPv6 dport byte-order in bpf_sk_lookup")
> Signed-off-by: Andrey Ignatov <rdna@fb.com>

Thanks for the fix, makes sense.

Acked-by: Joe Stringer <joe@wand.net.nz>

^ permalink raw reply

* Re: [PATCH bpf-next v2 02/13] bpf: btf: Add BTF_KIND_FUNC and BTF_KIND_FUNC_PROTO
From: Alexei Starovoitov @ 2018-11-07 21:49 UTC (permalink / raw)
  To: Edward Cree
  Cc: Martin Lau, Yonghong Song, Alexei Starovoitov,
	daniel@iogearbox.net, netdev@vger.kernel.org, Kernel Team
In-Reply-To: <2ea1e869-5f28-6beb-9fb1-15ae5a396077@solarflare.com>

On Wed, Nov 07, 2018 at 07:29:31PM +0000, Edward Cree wrote:
> Whereas I don't, and I don't feel like my core criticisms have
>  been addressed _at all_.  The only answer I get to "BTF should
>  store type and instance information in separate records" is
>  "it's a debuginfo",
...
>  I am just trying to organise
>  BTF to consist of separate _parts_ for types and instances,
>  rather than forcing both into the same Procrustean bed.

BTF does not have and should not have instances.
BTF is debug info only.
This is not negotiable.

Loading BTF into the kernel does not create an instance of anything.
The kernel verifies BTF and BTF blob sits there waiting to be pointed at
by the actual instance.
Today these instances are maps and programs.
A single program can have multiple instances of the functions
which debug info is described in BTF.
Multiple prograrms (with multiple functions) can point
to the same BTF blob description.
Consider that in single C file there will be multliple functions,
many maps, global variables (in the future) they will be created
by individual prog_load/map_create syscalls.
Before that the _single_ blob of BTF data for the whole C file
will be loaded, since functions, maps share most of
the type and name information.

In elf files the instances of functions have names.
These names go into symbol table of the elf. Dynamic linker
is using these symbols to connect different instances.
In BPF world we use IDs to connect instances.
There is btd if, map id, prog id.
You can argue that bpf_prog_load has prog_name attribute.
So the instance of the program sort of have name already,
but this name is for debug only. It doesn't have to be unique
and the kernel is not relying on the name to connect instances.

Example:
1: int foo(struct __sk_buff *skb, char arg2)
2: {
3:  return skb->len;
4: }

Line 1 goes into BTF to describe the function (with names and types).
Line 1 also goes into line_info .BTF.ext elf section as a string.
Line 3 and 4 go into line_info as well as strings.

llvm compiles it into something:
  .text
  .globl  foo
  .type   foo, @function
foo:
  r0 = *(u32 *)(r1 + 0)
  exit
  .section .BTF
  .byte X
  .byte Y
  .section .BTF.ext
  .byte X
  .byte Y

That 'foo' name goes into symbol table and together with the body
form the future instance of the function.
libbpf takes 'foo' from symbol table of elf and passes it
as bpf_attr.prog_name to the kernel along with bpf instructions
to create an instance.
The name 'foo' in symbol table doesn't have to match with
BTF 'foo' name and '.*foo.*' substring in line_info.

If you decide to use the sequence:
"""
  .globl  foo
  .type   foo, @function
foo:
""
in your ebpf_asm to parse and store into BTF as a function name
I would strongly argue against, since it would be incorrectly
conflating the instance of the function with debug purpose
of BTF.

Right now we do not have a concrete proposal for assembler text of BTF.
When LLVM emits BTF into .s it emits them as raw bytes.
So I'm looking forward to your ideas how to describe BTF in .s
Note such .s must have freedom to describe 'int bar(struct __sk_buff *a1, char a2)'
as debug info while having '.globl foo; foo:' as symbol name.

Your other 'criticism' was about libbpf's bpf_map_find_btf_info()
and ____btf_map_* hack. Yes. It is a hack and I'm open to change it
if there are better suggestions. It's a convention between
libbpf and program writers that produce elf. It's not a kernel abi.
Nothing to do with BTF and this instance vs debug info discussion.

> (I don't feel like we're making progress in understanding one
>  another here; maybe we should have a telephone discussion?
>  Sadly I'm not going to Plumbers, else that would be the
>  perfect place to thrash this out.)

Happy to jump on the call to explain it again.
10:30am pacific time works for me tomorrow.

^ permalink raw reply

* Re: [PATCH bpf-next] xdp: sample code for redirecting vlan packets to specific cpus
From: Daniel Borkmann @ 2018-11-07 21:44 UTC (permalink / raw)
  To: John Fastabend, Shannon Nelson, ast
  Cc: netdev, eric.dumazet, silviu.smarandache, shannon.lee.nelson
In-Reply-To: <499cc6e8-f533-3b90-b053-b3a1c7fa3a15@gmail.com>

On 10/29/2018 11:11 PM, John Fastabend wrote:
> On 10/29/2018 02:19 PM, Shannon Nelson wrote:
>> This is an example of using XDP to redirect the processing of
>> particular vlan packets to specific CPUs.  This is in response
>> to comments received on a kernel patch put forth previously
>> to do something similar using RPS.
>>      https://www.spinics.net/lists/netdev/msg528210.html
>>      [PATCH net-next] net: enable RPS on vlan devices
>>
>> This XDP application watches for inbound vlan-tagged packets
>> and redirects those packets to be processed on a specific CPU
>> as configured in a BPF map.  The BPF map can be modified by
>> this user program, which can also load and unload the kernel
>> XDP code.
>>
>> One example use is for supporting VMs where we can't control the
>> OS being used: we'd like to separate the VM CPU processing from
>> the host's CPUs as a way to help mitigate L1TF related issues.
>> When running the VM's traffic on a vlan we can stick the host's
>> Rx processing on one set of CPUs separate from the VM's CPUs.
>>
>> This example currently uses a vlan key and cpu value in the
>> BPF map, so only can do one CPU per vlan.  This could easily
>> be modified to use a bitpattern of CPUs rather than a CPU id
>> to allow multiple CPUs per vlan.
> 
> Great, so does this solve your use case then? At least on drivers
> with XDP support?
> 
>>
>> Signed-off-by: Shannon Nelson <shannon.nelson@oracle.com>
>> ---
> 
> Some really small and trivial nits below.
> 
> Acked-by: John Fastabend <john.fastabend@gmail.com>
> 
> [...]
> 
>> +	if (install) {
>> +
> 
> new line probably not needed. 
> 
>> +		/* check to see if already installed */
>> +		errno = 0;
>> +		access(pin_prog_name, R_OK);
>> +		if (errno != ENOENT) {
>> +			fprintf(stderr, "ERR: %s is already installed\n", argv[0]);
>> +			return -1;
>> +		}
>> +
>> +		/* load the XDP program and maps with the convenient library */
>> +		if (load_bpf_file(filename)) {
>> +			fprintf(stderr, "ERR: load_bpf_file(%s): \n%s",
>> +				filename, bpf_log_buf);
>> +			return -1;
>> +		}
>> +		if (!prog_fd[0]) {
>> +			fprintf(stderr, "ERR: load_bpf_file(%s): %d %s\n",
>> +				filename, errno, strerror(errno));
>> +			return -1;
>> +		}
>> +
>> +		/* pin the XDP program and maps */
>> +		if (bpf_obj_pin(prog_fd[0], pin_prog_name) < 0) {
>> +			fprintf(stderr, "ERR: bpf_obj_pin(%s): %d %s\n",
>> +				pin_prog_name, errno, strerror(errno));
>> +			if (errno == 2)
>> +				fprintf(stderr, "     (is the BPF fs mounted on /sys/fs/bpf?)\n");
>> +			return -1;
>> +		}
>> +		if (bpf_obj_pin(map_fd[0], pin_vlanmap_name) < 0) {
>> +			fprintf(stderr, "ERR: bpf_obj_pin(%s): %d %s\n",
>> +				pin_vlanmap_name, errno, strerror(errno));
>> +			return -1;
>> +		}
>> +		if (bpf_obj_pin(map_fd[2], pin_countermap_name) < 0) {
>> +			fprintf(stderr, "ERR: bpf_obj_pin(%s): %d %s\n",
>> +				pin_countermap_name, errno, strerror(errno));
>> +			return -1;
>> +		}
>> +
>> +		/* prep the vlan map with "not used" values */
>> +		c64 = UNDEF_CPU;
>> +		for (v64 = 0; v64 < 4096; v64++) {
> 
> maybe #define MAX_VLANS 4096 just to avoid constants.
> 
>> +			if (bpf_map_update_elem(map_fd[0], &v64, &c64, 0)) {
>> +				fprintf(stderr, "ERR: preping vlan map failed on v=%llu: %d %s\n",
>> +					v64, errno, strerror(errno));
>> +				return -1;
>> +			}
>> +		}
>> +
>> +		/* prep the cpumap with queue sizes */
>> +		c64 = 128+64;  /* see note in xdp_redirect_cpu_user.c */
>> +		for (v64 = 0; v64 < MAX_CPUS; v64++) {
>> +			if (bpf_map_update_elem(map_fd[1], &v64, &c64, 0)) {
>> +				if (errno == ENODEV) {
>> +					/* Save the last CPU number attempted
>> +					 * into the counters map
>> +					 */
>> +					c64 = CPU_COUNT;
>> +					ret = bpf_map_update_elem(map_fd[2], &c64, &v64, 0);
>> +					break;
>> +				}
>> +
>> +				fprintf(stderr, "ERR: preping cpu map failed on v=%llu: %d %s\n",
>> +					v64, errno, strerror(errno));
>> +				return -1;
>> +			}
>> +		}
>> +
>> +		/* wire the XDP program to the device */
>> +		if (bpf_set_link_xdp_fd(ifindex, prog_fd[0], 0) < 0) {
>> +			fprintf(stderr, "ERR: bpf_set_link_xdp_fd(): %d %s\n",
>> +				errno, strerror(errno));
>> +			return -1;
>> +		}
>> +
>> +		return 0;
>> +	}
>> +
>> +	if (remove) {
>> +

Ditto

>> +		/* unlink the program from the device */
>> +		if (bpf_set_link_xdp_fd(ifindex, -1, 0) < 0)
>> +			fprintf(stderr, "ERR: bpf_set_link_xdp_fd(): %d %s\n",
>> +				errno, strerror(errno));
>> +
>> +		/* unlink pinned files */
>> +		if (unlink(pin_prog_name))
>> +			fprintf(stderr, "ERR: unlink(%s): %d %s\n",
>> +				pin_prog_name, errno, strerror(errno));
>> +		if (unlink(pin_vlanmap_name))
>> +			fprintf(stderr, "ERR: unlink(%s): %d %s\n",
>> +				pin_vlanmap_name, errno, strerror(errno));
>> +		if (unlink(pin_countermap_name))
>> +			fprintf(stderr, "ERR: unlink(%s): %d %s\n",
>> +				pin_countermap_name, errno, strerror(errno));
>> +
>> +		return 0;
>> +	}
>> +
>> +	if (vlan == 0) {
>> +		fprintf(stderr, "ERR: required option --vlan missing\n");
>> +		goto error;
>> +	}
>> +
>> +	if (cpu == MAX_CPUS && vlan > 0) {
>> +		fprintf(stderr, "ERR: required option --cpu missing\n");
>> +		goto error;
>> +	}
>> +
>> +	vfd = bpf_obj_get(pin_vlanmap_name);
>> +	if (vfd < 0) {
>> +		fprintf(stderr, "ERR: can't find pinned map %s: %d %s\n",
>> +			pin_vlanmap_name, errno, strerror(errno));
>> +		if (errno == ENOENT)
>> +			fprintf(stderr, "   (has %s been installed yet?)\n", argv[0]);
>> +		return -1;
>> +	}
>> +
>> +	/* decode the requested action */
>> +	if (vlan > 0) {
>> +		/* check cpu against the max value found */
>> +		cfd = bpf_obj_get(pin_countermap_name);
>> +		if (cfd < 0) {
>> +			fprintf(stderr, "ERR: can't find pinned map %s: %d %s\n",
>> +				pin_countermap_name, errno, strerror(errno));
>> +			return -1;
>> +		}
>> +		c64 = CPU_COUNT;
>> +		ret = bpf_map_lookup_elem(cfd, &c64, &v64);
>> +		if (cpu >= v64) {
> 
> Need to check ret code?

Yeah, I expect this would have another respin, thanks!

>> +			fprintf(stderr, "ERR: cpu %d greater than max %llu\n", cpu, v64);
>> +			return -1;
>> +		}
>> +
>> +		/* Note that the value and key pointers really do need to be
>> +		 * pointers to 64-bit values, else things get a bit muddled.
>> +		 */
>> +		v64 = vlan;
>> +		c64 = cpu;
>> +		ret = bpf_map_update_elem(vfd, &v64, &c64, 0);
>> +		if (ret) {
>> +			fprintf(stderr, "Adding vlan %d CPU %d failed: %d %s\n",
>> +				vlan, cpu, errno, strerror(errno));
>> +			return -1;
>> +		}
>> +
>> +	} else {
>> +		v64 = -vlan;
>> +		c64 = UNDEF_CPU;
>> +
>> +		/* We can't actually delete from a TYPE_ARRAY map, so we
>> +		 * simply set it to an undefined value.
>> +		 */
>> +		ret = bpf_map_update_elem(vfd, &v64, &c64, 0);
>> +		if (ret) {
>> +			fprintf(stderr, "Delete of vlan %llu failed: %d %s\n",
>> +				v64, errno, strerror(errno));
>> +			return -1;
>> +		}
>> +	}
>> +
>> +	return 0;
>> +
>> +error:
>> +	usage(argv);
>> +	return -1;
>> +}
>>
> 

^ permalink raw reply

* Re: [PATCH bpf-next] bpf_load: add map name to load_maps error message
From: Daniel Borkmann @ 2018-11-07 21:41 UTC (permalink / raw)
  To: Song Liu, John Fastabend
  Cc: shannon.nelson, Alexei Starovoitov, Networking,
	shannon.lee.nelson
In-Reply-To: <CAPhsuW4-zJ=2QPUW7rz-rq3KHm_YJ3SnPTHVbddY5tZi3TjEwg@mail.gmail.com>

On 11/07/2018 01:26 AM, Song Liu wrote:
> On Mon, Oct 29, 2018 at 3:12 PM John Fastabend <john.fastabend@gmail.com> wrote:
>>
>> On 10/29/2018 02:14 PM, Shannon Nelson wrote:
>>> To help when debugging bpf/xdp load issues, have the load_map()
>>> error message include the number and name of the map that
>>> failed.
>>>
>>> Signed-off-by: Shannon Nelson <shannon.nelson@oracle.com>
>>> ---
>>>  samples/bpf/bpf_load.c | 4 ++--
>>>  1 file changed, 2 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/samples/bpf/bpf_load.c b/samples/bpf/bpf_load.c
>>> index 89161c9..5de0357 100644
>>> --- a/samples/bpf/bpf_load.c
>>> +++ b/samples/bpf/bpf_load.c
>>> @@ -282,8 +282,8 @@ static int load_maps(struct bpf_map_data *maps, int nr_maps,
>>>                                                       numa_node);
>>>               }
>>>               if (map_fd[i] < 0) {
>>> -                     printf("failed to create a map: %d %s\n",
>>> -                            errno, strerror(errno));
>>> +                     printf("failed to create map %d (%s): %d %s\n",
>>> +                            i, maps[i].name, errno, strerror(errno));
>>>                       return 1;
>>>               }
>>>               maps[i].fd = map_fd[i];
>>>
>>
>> LGTM
>>
>> Acked-by: John Fastabend <john.fastabend@gmail.com>
> 
> Acked-by: Song Liu <songliubraving@fb.com>

Applied to bpf-next, thanks!

^ permalink raw reply

* [PATCH bpf] bpf: Fix IPv6 dport byte order in bpf_sk_lookup_udp
From: Andrey Ignatov @ 2018-11-07 21:36 UTC (permalink / raw)
  To: netdev; +Cc: Andrey Ignatov, ast, daniel, joe, kernel-team

Lookup functions in sk_lookup have different expectations about byte
order of provided arguments.

Specifically __inet_lookup, __udp4_lib_lookup and __udp6_lib_lookup
expect dport to be in network byte order and do ntohs(dport) internally.

At the same time __inet6_lookup expects dport to be in host byte order
and correspondingly name the argument hnum.

sk_lookup works correctly with __inet_lookup, __udp4_lib_lookup and
__inet6_lookup with regard to dport. But in __udp6_lib_lookup case it
uses host instead of expected network byte order. It makes result
returned by bpf_sk_lookup_udp for IPv6 incorrect.

The patch fixes byte order of dport passed to __udp6_lib_lookup.

Originally sk_lookup properly handled UDPv6, but not TCPv6. 5ef0ae84f02a
fixes TCPv6 but breaks UDPv6.

Fixes: 5ef0ae84f02a ("bpf: Fix IPv6 dport byte-order in bpf_sk_lookup")
Signed-off-by: Andrey Ignatov <rdna@fb.com>
---
 net/core/filter.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/net/core/filter.c b/net/core/filter.c
index e521c5ebc7d1..9a1327eb25fa 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -4852,18 +4852,17 @@ static struct sock *sk_lookup(struct net *net, struct bpf_sock_tuple *tuple,
 	} else {
 		struct in6_addr *src6 = (struct in6_addr *)&tuple->ipv6.saddr;
 		struct in6_addr *dst6 = (struct in6_addr *)&tuple->ipv6.daddr;
-		u16 hnum = ntohs(tuple->ipv6.dport);
 		int sdif = inet6_sdif(skb);

 		if (proto == IPPROTO_TCP)
 			sk = __inet6_lookup(net, &tcp_hashinfo, skb, 0,
 					    src6, tuple->ipv6.sport,
-					    dst6, hnum,
+					    dst6, ntohs(tuple->ipv6.dport),
 					    dif, sdif, &refcounted);
 		else if (likely(ipv6_bpf_stub))
 			sk = ipv6_bpf_stub->udp6_lib_lookup(net,
 							    src6, tuple->ipv6.sport,
-							    dst6, hnum,
+							    dst6, tuple->ipv6.dport,
 							    dif, sdif,
 							    &udp_table, skb);
 #endif
-- 
2.17.1

^ permalink raw reply related

* Re: [PATCH bpf-next 2/2] bpftool: support loading flow dissector
From: Quentin Monnet @ 2018-11-07 21:34 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Stanislav Fomichev, netdev, linux-kselftest, ast, daniel, shuah,
	guro, jiong.wang, bhole_prashant_q7, john.fastabend, jbenc,
	treeze.taeung, yhs, osk, sandipan
In-Reply-To: <20181107123231.6d2f4782@cakuba.netronome.com>

2018-11-07 12:32 UTC-0800 ~ Jakub Kicinski <jakub.kicinski@netronome.com>
> On Wed, 7 Nov 2018 20:08:53 +0000, Quentin Monnet wrote:
>>> +		err = bpf_obj_pin(bpf_program__fd(prog), pinfile);
>>> +		if (err) {
>>> +			p_err("failed to pin program %s",
>>> +			      bpf_program__title(prog, false));
>>> +			goto err_close_obj;
>>> +		}
>>
>> I don't have the same opinion as Jakub for pinning :). I was hoping we
>> could also load additional programs (for tail calls) for
>> non-flow_dissector programs. Could this be an occasion to update the
>> code in that direction?
> 
> Do you mean having the bpftool construct an array for tail calling
> automatically when loading an object?  Or do a "mass pin" of all
> programs in an object file?
> 
> I'm not convinced about this strategy of auto assembling a tail call
> array by assuming that a flow dissector object carries programs for
> protocols in order (apart from the main program which doesn't have to
> be first, for some reason).

Not constructing the prog array, I don't think this should be the role 
of bpftool either. Much more a "mass pin", so that you have a link to 
each program loaded from the object file and can later add them to a 
prog array map with subsequent calls to bpftool.

^ permalink raw reply

* Re: [PATCH bpf-next 2/2] selftests/bpf: add a test case for sock_ops perf-event notification
From: Daniel Borkmann @ 2018-11-07 21:31 UTC (permalink / raw)
  To: Sowmini Varadhan, netdev, davem
In-Reply-To: <09992f2af846ab5186cf5d8c3e4e1c5d20d1b747.1541535172.git.sowmini.varadhan@oracle.com>

On 11/06/2018 09:28 PM, Sowmini Varadhan wrote:
> This patch provides a tcp_bpf based eBPF sample. The test
> - ncat(1) as the TCP client program to connect() to a port
>   with the intention of triggerring SYN retransmissions: we
>   first install an iptables DROP rule to make sure ncat SYNs are
>   resent (instead of aborting instantly after a TCP RST)
> - has a bpf kernel module that sends a perf-event notification for
>   each TCP retransmit, and also tracks the number of such notifications
>   sent in the global_map
> The test passes when the number of event notifications intercepted
> in user-space matches the value in the global_map.
> 
> Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
> ---
>  tools/testing/selftests/bpf/Makefile              |    4 +-
>  tools/testing/selftests/bpf/perf-sys.h            |   74 ++++++++
>  tools/testing/selftests/bpf/test_tcpnotify.h      |   19 ++
>  tools/testing/selftests/bpf/test_tcpnotify_kern.c |   95 +++++++++++
>  tools/testing/selftests/bpf/test_tcpnotify_user.c |  186 +++++++++++++++++++++
>  5 files changed, 377 insertions(+), 1 deletions(-)
>  create mode 100644 tools/testing/selftests/bpf/perf-sys.h
>  create mode 100644 tools/testing/selftests/bpf/test_tcpnotify.h
>  create mode 100644 tools/testing/selftests/bpf/test_tcpnotify_kern.c
>  create mode 100644 tools/testing/selftests/bpf/test_tcpnotify_user.c
> 
> diff --git a/tools/testing/selftests/bpf/Makefile b/tools/testing/selftests/bpf/Makefile
> index e39dfb4..6c94048 100644
> --- a/tools/testing/selftests/bpf/Makefile
> +++ b/tools/testing/selftests/bpf/Makefile
> @@ -24,12 +24,13 @@ TEST_GEN_PROGS = test_verifier test_tag test_maps test_lru_map test_lpm_map test
>  	test_align test_verifier_log test_dev_cgroup test_tcpbpf_user \
>  	test_sock test_btf test_sockmap test_lirc_mode2_user get_cgroup_id_user \
>  	test_socket_cookie test_cgroup_storage test_select_reuseport test_section_names \
> -	test_netcnt
> +	test_netcnt test_tcpnotify_user
>  
>  TEST_GEN_FILES = test_pkt_access.o test_xdp.o test_l4lb.o test_tcp_estats.o test_obj_id.o \
>  	test_pkt_md_access.o test_xdp_redirect.o test_xdp_meta.o sockmap_parse_prog.o     \
>  	sockmap_verdict_prog.o dev_cgroup.o sample_ret0.o test_tracepoint.o \
>  	test_l4lb_noinline.o test_xdp_noinline.o test_stacktrace_map.o \
> +	test_tcpnotify_kern.o \
>  	sample_map_ret0.o test_tcpbpf_kern.o test_stacktrace_build_id.o \
>  	sockmap_tcp_msg_prog.o connect4_prog.o connect6_prog.o test_adjust_tail.o \
>  	test_btf_haskv.o test_btf_nokv.o test_sockmap_kern.o test_tunnel_kern.o \
> @@ -74,6 +75,7 @@ $(OUTPUT)/test_sock_addr: cgroup_helpers.c
>  $(OUTPUT)/test_socket_cookie: cgroup_helpers.c
>  $(OUTPUT)/test_sockmap: cgroup_helpers.c
>  $(OUTPUT)/test_tcpbpf_user: cgroup_helpers.c
> +$(OUTPUT)/test_tcpnotify_user: cgroup_helpers.c trace_helpers.c
>  $(OUTPUT)/test_progs: trace_helpers.c
>  $(OUTPUT)/get_cgroup_id_user: cgroup_helpers.c
>  $(OUTPUT)/test_cgroup_storage: cgroup_helpers.c
> diff --git a/tools/testing/selftests/bpf/perf-sys.h b/tools/testing/selftests/bpf/perf-sys.h
> new file mode 100644
> index 0000000..3eb7a39
> --- /dev/null
> +++ b/tools/testing/selftests/bpf/perf-sys.h
> @@ -0,0 +1,74 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef _PERF_SYS_H
> +#define _PERF_SYS_H
> +
> +#include <unistd.h>
> +#include <sys/types.h>
> +#include <sys/syscall.h>
> +#include <linux/types.h>
> +#include <linux/compiler.h>
> +#include <linux/perf_event.h>
> +#include <asm/barrier.h>
> +
> +#ifdef __powerpc__
> +#define CPUINFO_PROC	{"cpu"}
> +#endif
> +
> +#ifdef __s390__
> +#define CPUINFO_PROC	{"vendor_id"}
> +#endif
> +
> +#ifdef __sh__
> +#define CPUINFO_PROC	{"cpu type"}
> +#endif
> +
> +#ifdef __hppa__
> +#define CPUINFO_PROC	{"cpu"}
> +#endif
> +
> +#ifdef __sparc__
> +#define CPUINFO_PROC	{"cpu"}
> +#endif
> +
> +#ifdef __alpha__
> +#define CPUINFO_PROC	{"cpu model"}
> +#endif
> +
> +#ifdef __arm__
> +#define CPUINFO_PROC	{"model name", "Processor"}
> +#endif
> +
> +#ifdef __mips__
> +#define CPUINFO_PROC	{"cpu model"}
> +#endif
> +
> +#ifdef __arc__
> +#define CPUINFO_PROC	{"Processor"}
> +#endif
> +
> +#ifdef __xtensa__
> +#define CPUINFO_PROC	{"core ID"}
> +#endif
> +
> +#ifndef CPUINFO_PROC
> +#define CPUINFO_PROC	{ "model name", }
> +#endif
> +
> +static inline int
> +sys_perf_event_open(struct perf_event_attr *attr,
> +		      pid_t pid, int cpu, int group_fd,
> +		      unsigned long flags)
> +{
> +	int fd;
> +
> +	fd = syscall(__NR_perf_event_open, attr, pid, cpu,
> +		     group_fd, flags);
> +
> +#ifdef HAVE_ATTR_TEST
> +	if (unlikely(test_attr__enabled))
> +		test_attr__open(attr, pid, cpu, fd, group_fd, flags);
> +#endif
> +	return fd;
> +}

I would prefer if we could avoid adding whole perf-sys duplicate right
into BPF kselftest directory. Agree it would be nice to have the mini
wrapper somewhere, but then lets make that a separate commit and place
the wrapper-only somewhere as tools/include/linux/perf.h that all the
remaining occurrences below can be replaced with.

$ git grep -n __NR_perf_event_open tools/
[...]
tools/testing/selftests/bpf/get_cgroup_id_user.c:112:   pmu_fd = syscall(__NR_perf_event_open, &attr, getpid(), -1, -1, 0);
tools/testing/selftests/bpf/test_progs.c:799:           pmu_fd[i] = syscall(__NR_perf_event_open, &attr, -1 /* pid */,
tools/testing/selftests/bpf/test_progs.c:977:   pmu_fd = syscall(__NR_perf_event_open, &attr, -1 /* pid */,
tools/testing/selftests/bpf/test_progs.c:1163:  pmu_fd = syscall(__NR_perf_event_open, &attr, -1 /* pid */,
tools/testing/selftests/bpf/test_progs.c:1297:  pmu_fd = syscall(__NR_perf_event_open, &attr, -1 /* pid */,
tools/testing/selftests/bpf/test_progs.c:1510:  pmu_fd = syscall(__NR_perf_event_open, &attr, getpid()/*pid*/, -1/*cpu*/,
tools/testing/selftests/bpf/test_progs.c:1653:  pmu_fd = syscall(__NR_perf_event_open, &attr, -1 /* pid */,
tools/testing/selftests/powerpc/pmu/event.c:19: return syscall(__NR_perf_event_open, attr, pid, cpu,
tools/testing/selftests/powerpc/ptrace/perf-hwbreak.c:46:       return syscall(__NR_perf_event_open, attr, pid, cpu, group_fd, flags);
tools/testing/selftests/powerpc/utils.c:177:    return syscall(__NR_perf_event_open, hw_event, pid, cpu,

> +#endif /* _PERF_SYS_H */
> diff --git a/tools/testing/selftests/bpf/test_tcpnotify.h b/tools/testing/selftests/bpf/test_tcpnotify.h
> new file mode 100644
> index 0000000..8b6cea0
> --- /dev/null
> +++ b/tools/testing/selftests/bpf/test_tcpnotify.h
> @@ -0,0 +1,19 @@
> +// SPDX-License-Identifier: GPL-2.0
> +
> +#ifndef _TEST_TCPBPF_H
> +#define _TEST_TCPBPF_H
> +
> +struct tcpnotify_globals {
> +	__u32 total_retrans;
> +	__u32 ncalls;
> +};
> +
> +struct tcp_notifier {
> +	__u8    type;
> +	__u8    subtype;
> +	__u8    source;
> +	__u8    hash;
> +};
> +
> +#define	TESTPORT	12877
> +#endif
> diff --git a/tools/testing/selftests/bpf/test_tcpnotify_kern.c b/tools/testing/selftests/bpf/test_tcpnotify_kern.c
> new file mode 100644
> index 0000000..edbca20
> --- /dev/null
> +++ b/tools/testing/selftests/bpf/test_tcpnotify_kern.c
> @@ -0,0 +1,95 @@
> +// SPDX-License-Identifier: GPL-2.0
> +#include <stddef.h>
> +#include <string.h>
> +#include <linux/bpf.h>
> +#include <linux/if_ether.h>
> +#include <linux/if_packet.h>
> +#include <linux/ip.h>
> +#include <linux/ipv6.h>
> +#include <linux/types.h>
> +#include <linux/socket.h>
> +#include <linux/tcp.h>
> +#include <netinet/in.h>
> +#include "bpf_helpers.h"
> +#include "bpf_endian.h"
> +#include "test_tcpnotify.h"
> +
> +struct bpf_map_def SEC("maps") global_map = {
> +	.type = BPF_MAP_TYPE_ARRAY,
> +	.key_size = sizeof(__u32),
> +	.value_size = sizeof(struct tcpnotify_globals),
> +	.max_entries = 4,
> +};
> +
> +struct bpf_map_def SEC("maps") perf_event_map = {
> +	.type = BPF_MAP_TYPE_PERF_EVENT_ARRAY,
> +	.key_size = sizeof(int),
> +	.value_size = sizeof(__u32),
> +	.max_entries = 2,
> +};
> +
> +int _version SEC("version") = 1;
> +
> +SEC("sockops")
> +int bpf_testcb(struct bpf_sock_ops *skops)
> +{
> +	int rv = -1;
> +	int op;
> +
> +	op = (int) skops->op;
> +
> +	if (bpf_ntohl(skops->remote_port) != TESTPORT) {
> +		skops->reply = -1;
> +		return 0;
> +	}
> +
> +	switch (op) {
> +	case BPF_SOCK_OPS_TIMEOUT_INIT:
> +	case BPF_SOCK_OPS_RWND_INIT:
> +	case BPF_SOCK_OPS_NEEDS_ECN:
> +	case BPF_SOCK_OPS_BASE_RTT:
> +	case BPF_SOCK_OPS_RTO_CB:
> +		rv = 1;
> +		break;
> +
> +	case BPF_SOCK_OPS_TCP_CONNECT_CB:
> +	case BPF_SOCK_OPS_TCP_LISTEN_CB:
> +	case BPF_SOCK_OPS_ACTIVE_ESTABLISHED_CB:
> +	case BPF_SOCK_OPS_PASSIVE_ESTABLISHED_CB:
> +		bpf_sock_ops_cb_flags_set(skops, (BPF_SOCK_OPS_RETRANS_CB_FLAG|
> +					  BPF_SOCK_OPS_RTO_CB_FLAG));
> +		rv = 1;
> +		break;
> +	case BPF_SOCK_OPS_RETRANS_CB: {
> +			__u32 key = 0;
> +			struct tcpnotify_globals g, *gp;
> +			struct tcp_notifier msg = {
> +				.type = 0xde,
> +				.subtype = 0xad,
> +				.source = 0xbe,
> +				.hash = 0xef,
> +			};
> +
> +			rv = 1;
> +
> +			/* Update results */
> +			gp = bpf_map_lookup_elem(&global_map, &key);
> +			if (!gp)
> +				break;
> +			g = *gp;
> +			g.total_retrans = skops->total_retrans;
> +			g.ncalls++;
> +			bpf_map_update_elem(&global_map, &key, &g,
> +					    BPF_ANY);
> +			bpf_perf_event_output(skops, &perf_event_map,
> +					      BPF_F_CURRENT_CPU,
> +					      &msg, sizeof(msg));
> +		}
> +		break;
> +	default:
> +		rv = -1;
> +	}
> +	skops->reply = rv;
> +	return 1;
> +}
> +char _license[] SEC("license") = "GPL";
> diff --git a/tools/testing/selftests/bpf/test_tcpnotify_user.c b/tools/testing/selftests/bpf/test_tcpnotify_user.c
> new file mode 100644
> index 0000000..8f88cb9
> --- /dev/null
> +++ b/tools/testing/selftests/bpf/test_tcpnotify_user.c
> @@ -0,0 +1,186 @@
> +// SPDX-License-Identifier: GPL-2.0
> +#define _GNU_SOURCE
> +#include <pthread.h>
> +#include <inttypes.h>
> +#include <stdio.h>
> +#include <stdlib.h>
> +#include <unistd.h>
> +#include <asm/types.h>
> +#include <errno.h>
> +#include <string.h>
> +#include <linux/bpf.h>
> +#include <sys/socket.h>
> +#include <bpf/bpf.h>
> +#include <bpf/libbpf.h>
> +#include <sys/ioctl.h>
> +#include <linux/rtnetlink.h>
> +#include <signal.h>
> +#include <linux/perf_event.h>
> +#include "perf-sys.h"
> +
> +#include "bpf_rlimit.h"
> +#include "bpf_util.h"
> +#include "cgroup_helpers.h"
> +
> +#include "test_tcpnotify.h"
> +#include "trace_helpers.h"
> +
> +#define SOCKET_BUFFER_SIZE (getpagesize() < 8192L ? getpagesize() : 8192L)
> +
> +pthread_t tid;
> +int rx_callbacks;
> +
> +static int dummyfn(void *data, int size)
> +{
> +	struct tcp_notifier *t = data;
> +
> +	if (t->type != 0xde || t->subtype != 0xad ||
> +	    t->source != 0xbe || t->hash != 0xef)
> +		return 1;
> +	rx_callbacks++;
> +	return 0;
> +}
> +
> +void tcp_notifier_poller(int fd)
> +{
> +	while (1)
> +		perf_event_poller(fd, dummyfn);
> +}
> +
> +static void *poller_thread(void *arg)
> +{
> +	int fd = *(int *)arg;
> +
> +	tcp_notifier_poller(fd);
> +	return arg;
> +}
> +
> +int verify_result(const struct tcpnotify_globals *result)
> +{
> +	return (result->ncalls > 0 && result->ncalls == rx_callbacks ? 0 : 1);
> +}
> +
> +static int bpf_find_map(const char *test, struct bpf_object *obj,
> +			const char *name)
> +{
> +	struct bpf_map *map;
> +
> +	map = bpf_object__find_map_by_name(obj, name);
> +	if (!map) {
> +		printf("%s:FAIL:map '%s' not found\n", test, name);
> +		return -1;
> +	}
> +	return bpf_map__fd(map);
> +}
> +
> +static int setup_bpf_perf_event(int mapfd)
> +{
> +	struct perf_event_attr attr = {
> +		.sample_type = PERF_SAMPLE_RAW,
> +		.type = PERF_TYPE_SOFTWARE,
> +		.config = PERF_COUNT_SW_BPF_OUTPUT,
> +	};
> +	int key = 0;
> +	int pmu_fd;
> +
> +	pmu_fd = sys_perf_event_open(&attr, -1, 0, -1, 0);
> +	if (pmu_fd < 0)
> +		return pmu_fd;
> +	bpf_map_update_elem(mapfd, &key, &pmu_fd, BPF_ANY);
> +
> +	ioctl(pmu_fd, PERF_EVENT_IOC_ENABLE, 0);
> +	return pmu_fd;
> +}
> +
> +int main(int argc, char **argv)
> +{
> +	const char *file = "test_tcpnotify_kern.o";
> +	int prog_fd, map_fd, perf_event_fd;
> +	struct tcpnotify_globals g = {0};
> +	const char *cg_path = "/foo";
> +	int error = EXIT_FAILURE;
> +	struct bpf_object *obj;
> +	int cg_fd = -1;
> +	__u32 key = 0;
> +	int rv;
> +	char test_script[80];
> +	int pmu_fd;
> +	cpu_set_t cpuset;
> +
> +	CPU_ZERO(&cpuset);
> +	CPU_SET(0, &cpuset);
> +	pthread_setaffinity_np(pthread_self(), sizeof(cpu_set_t), &cpuset);
> +
> +	if (setup_cgroup_environment())
> +		goto err;
> +
> +	cg_fd = create_and_get_cgroup(cg_path);
> +	if (!cg_fd)
> +		goto err;
> +
> +	if (join_cgroup(cg_path))
> +		goto err;
> +
> +	if (bpf_prog_load(file, BPF_PROG_TYPE_SOCK_OPS, &obj, &prog_fd)) {
> +		printf("FAILED: load_bpf_file failed for: %s\n", file);
> +		goto err;
> +	}
> +
> +	rv = bpf_prog_attach(prog_fd, cg_fd, BPF_CGROUP_SOCK_OPS, 0);
> +	if (rv) {
> +		printf("FAILED: bpf_prog_attach: %d (%s)\n",
> +		       error, strerror(errno));
> +		goto err;
> +	}
> +
> +	perf_event_fd = bpf_find_map(__func__, obj, "perf_event_map");
> +	if (perf_event_fd < 0)
> +		goto err;
> +
> +	map_fd = bpf_find_map(__func__, obj, "global_map");
> +	if (map_fd < 0)
> +		goto err;
> +
> +	pmu_fd = setup_bpf_perf_event(perf_event_fd);
> +	if (pmu_fd < 0 || perf_event_mmap(pmu_fd) < 0)
> +		goto err;
> +
> +	pthread_create(&tid, NULL, poller_thread, (void *)&pmu_fd);
> +
> +	sprintf(test_script,
> +		"/usr/sbin/iptables -A INPUT -p tcp --dport %d -j DROP",
> +		TESTPORT);
> +	system(test_script);
> +
> +	sprintf(test_script,
> +		"/usr/bin/nc 127.0.0.1 %d < /etc/passwd > /dev/null 2>&1 ",
> +		TESTPORT);
> +	system(test_script);
> +
> +	sprintf(test_script,
> +		"/usr/sbin/iptables -D INPUT -p tcp --dport %d -j DROP",
> +		TESTPORT);
> +	system(test_script);
> +
> +	rv = bpf_map_lookup_elem(map_fd, &key, &g);
> +	if (rv != 0) {
> +		printf("FAILED: bpf_map_lookup_elem returns %d\n", rv);
> +		goto err;
> +	}
> +
> +	sleep(10);
> +
> +	if (verify_result(&g)) {
> +		printf("FAILED: Wrong stats Expected %d calls, got %d\n",
> +			g.ncalls, rx_callbacks);
> +		goto err;
> +	}
> +
> +	printf("PASSED!\n");
> +	error = 0;
> +err:
> +	bpf_prog_detach(cg_fd, BPF_CGROUP_SOCK_OPS);
> +	close(cg_fd);
> +	cleanup_cgroup_environment();
> +	return error;
> +}
> 

^ permalink raw reply

* Re: [PATCH bpf-next] tools: bpftool: adjust rlimit RLIMIT_MEMLOCK when loading programs, maps
From: Daniel Borkmann @ 2018-11-07 21:24 UTC (permalink / raw)
  To: Quentin Monnet, Alexei Starovoitov; +Cc: netdev, oss-drivers
In-Reply-To: <1541593770-28089-1-git-send-email-quentin.monnet@netronome.com>

On 11/07/2018 01:29 PM, Quentin Monnet wrote:
> The limit for memory locked in the kernel by a process is usually set to
> 64 bytes by default. This can be an issue when creating large BPF maps
> and/or loading many programs. A workaround is to raise this limit for
> the current process before trying to create a new BPF map. Changing the
> hard limit requires the CAP_SYS_RESOURCE and can usually only be done by
> root user (for non-root users, a call to setrlimit fails (and sets
> errno) and the program simply goes on with its rlimit unchanged).
> 
> There is no API to get the current amount of memory locked for a user,
> therefore we cannot raise the limit only when required. One solution,
> used by bcc, is to try to create the map, and on getting a EPERM error,
> raising the limit to infinity before giving another try. Another
> approach, used in iproute2, is to raise the limit in all cases, before
> trying to create the map.
> 
> Here we do the same as in iproute2: the rlimit is raised to infinity
> before trying to load programs or to create maps with bpftool.
> 
> Signed-off-by: Quentin Monnet <quentin.monnet@netronome.com>
> Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>

Applied to bpf-next and fixed commit log, thanks!

^ permalink raw reply

* Re: [PATCH bpf-next] selftests/bpf: enable (uncomment) all tests in test_libbpf.sh
From: Daniel Borkmann @ 2018-11-07 21:24 UTC (permalink / raw)
  To: Quentin Monnet, Alexei Starovoitov
  Cc: netdev, oss-drivers, Jesper Dangaard Brouer
In-Reply-To: <1541593725-27703-1-git-send-email-quentin.monnet@netronome.com>

On 11/07/2018 01:28 PM, Quentin Monnet wrote:
> libbpf is now able to load successfully test_l4lb_noinline.o and
> samples/bpf/tracex3_kern.o.
> 
> For the test_l4lb_noinline, uncomment related tests from test_libbpf.c
> and remove the associated "TODO".
> 
> For tracex3_kern.o, instead of loading a program from samples/bpf/ that
> might not have been compiled at this stage, try loading a program from
> BPF selftests. Since this test case is about loading a program compiled
> without the "-target bpf" flag, change the Makefile to compile one
> program accordingly (instead of passing the flag for compiling all
> programs).
> 
> Regarding test_xdp_noinline.o: in its current shape the program fails to
> load because it provides no version section, but the loader needs one.
> The test was added to make sure that libbpf could load XDP programs even
> if they do not provide a version number in a dedicated section. But
> libbpf is already capable of doing that: in our case loading fails
> because the loader does not know that this is an XDP program (it does
> not need to, since it does not attach the program). So trying to load
> test_xdp_noinline.o does not bring much here: just delete this subtest.
> 
> For the record, the error message obtained with tracex3_kern.o was
> fixed by commit e3d91b0ca523 ("tools/libbpf: handle issues with bpf ELF
> objects containing .eh_frames")
> 
> I have not been abled to reproduce the "libbpf: incorrect bpf_call
> opcode" error for test_l4lb_noinline.o, even with the version of libbpf
> present at the time when test_libbpf.sh and test_libbpf_open.c were
> created.
> 
> RFC -> v1:
> - Compile test_xdp without the "-target bpf" flag, and try to load it
>   instead of ../../samples/bpf/tracex3_kern.o.
> - Delete test_xdp_noinline.o subtest.
> 
> Cc: Jesper Dangaard Brouer <brouer@redhat.com>
> Signed-off-by: Quentin Monnet <quentin.monnet@netronome.com>
> Acked-by: Jakub Kicinski <jakub.kicinski@netronome.com>

Applied to bpf-next, thanks!

^ permalink raw reply

* Re: [PATCH bpf-next 2/2] bpftool: support loading flow dissector
From: Stanislav Fomichev @ 2018-11-07 21:17 UTC (permalink / raw)
  To: Quentin Monnet
  Cc: Stanislav Fomichev, netdev, linux-kselftest, ast, daniel, shuah,
	jakub.kicinski, guro, jiong.wang, bhole_prashant_q7,
	john.fastabend, jbenc, treeze.taeung, yhs, osk, sandipan
In-Reply-To: <484ea7f5-0d4c-b828-c5fe-1fd4d9bf847d@netronome.com>

On 11/07, Quentin Monnet wrote:
> Hi Stanislav,
> 
> 2018-11-07 11:35 UTC-0800 ~ Stanislav Fomichev <sdf@google.com>
> > This commit adds support for loading/attaching/detaching flow
> > dissector program. The structure of the flow dissector program is
> > assumed to be the same as in the selftests:
> > 
> > * flow_dissector section with the main entry point
> > * a bunch of tail call progs
> > * a jmp_table map that is populated with the tail call progs
> > 
> > When `bpftool load` is called with a flow_dissector prog (i.e. when the
> > first section is flow_dissector of 'type flow_dissector' argument is
> > passed), we load and pin all the programs and build the jump table.
> > 
> > The last argument of `bpftool attach` is made optional for this use
> > case.
> > 
> > Example:
> > bpftool prog load tools/testing/selftests/bpf/bpf_flow.o \
> > 	/sys/fs/bpf/flow type flow_dissector
> > bpftool prog attach pinned /sys/fs/bpf/flow/flow_dissector/0 flow_dissector
> > 
> > Tested by using the above two lines to load the prog in
> > the test_flow_dissector.sh selftest.
> > 
> > Signed-off-by: Stanislav Fomichev <sdf@google.com>
> > ---
> >  .../bpftool/Documentation/bpftool-prog.rst    |  16 ++-
> >  tools/bpf/bpftool/common.c                    |  32 +++--
> >  tools/bpf/bpftool/main.h                      |   1 +
> >  tools/bpf/bpftool/prog.c                      | 135 +++++++++++++++---
> >  4 files changed, 141 insertions(+), 43 deletions(-)
> > 
> > diff --git a/tools/bpf/bpftool/Documentation/bpftool-prog.rst b/tools/bpf/bpftool/Documentation/bpftool-prog.rst
> > index ac4e904b10fb..3caa9153435b 100644
> > --- a/tools/bpf/bpftool/Documentation/bpftool-prog.rst
> > +++ b/tools/bpf/bpftool/Documentation/bpftool-prog.rst
> > @@ -25,8 +25,8 @@ MAP COMMANDS
> >  |	**bpftool** **prog dump jited**  *PROG* [{**file** *FILE* | **opcodes**}]
> >  |	**bpftool** **prog pin** *PROG* *FILE*
> >  |	**bpftool** **prog load** *OBJ* *FILE* [**type** *TYPE*] [**map** {**idx** *IDX* | **name** *NAME*} *MAP*] [**dev** *NAME*]
> > -|       **bpftool** **prog attach** *PROG* *ATTACH_TYPE* *MAP*
> > -|       **bpftool** **prog detach** *PROG* *ATTACH_TYPE* *MAP*
> > +|       **bpftool** **prog attach** *PROG* *ATTACH_TYPE* [*MAP*]
> > +|       **bpftool** **prog detach** *PROG* *ATTACH_TYPE* [*MAP*]
> >  |	**bpftool** **prog help**
> >  |
> >  |	*MAP* := { **id** *MAP_ID* | **pinned** *FILE* }
> > @@ -39,7 +39,9 @@ MAP COMMANDS
> >  |		**cgroup/bind4** | **cgroup/bind6** | **cgroup/post_bind4** | **cgroup/post_bind6** |
> >  |		**cgroup/connect4** | **cgroup/connect6** | **cgroup/sendmsg4** | **cgroup/sendmsg6**
> >  |	}
> > -|       *ATTACH_TYPE* := { **msg_verdict** | **skb_verdict** | **skb_parse** }
> > +|       *ATTACH_TYPE* := {
> > +|	|	**msg_verdict** | **skb_verdict** | **skb_parse** | **flow_dissector**
> 
>         ^
> Nitpick: Could you please remove the above pipe?
Ah, I missed that one, my bad, thanks.

> > +|	}
> >  
> >  
> >  DESCRIPTION
> > @@ -97,13 +99,13 @@ DESCRIPTION
> >  		  contain a dot character ('.'), which is reserved for future
> >  		  extensions of *bpffs*.
> >  
> > -        **bpftool prog attach** *PROG* *ATTACH_TYPE* *MAP*
> > +        **bpftool prog attach** *PROG* *ATTACH_TYPE* [*MAP*]
> >                    Attach bpf program *PROG* (with type specified by *ATTACH_TYPE*)
> > -                  to the map *MAP*.
> > +                  to the optional map *MAP*.
> >  
> > -        **bpftool prog detach** *PROG* *ATTACH_TYPE* *MAP*
> > +        **bpftool prog detach** *PROG* *ATTACH_TYPE* [*MAP*]
> >                    Detach bpf program *PROG* (with type specified by *ATTACH_TYPE*)
> > -                  from the map *MAP*.
> > +                  from the optional map *MAP*.
> >  
> >  	**bpftool prog help**
> >  		  Print short help message.
> > diff --git a/tools/bpf/bpftool/common.c b/tools/bpf/bpftool/common.c
> > index 25af85304ebe..963881142dfb 100644
> > --- a/tools/bpf/bpftool/common.c
> > +++ b/tools/bpf/bpftool/common.c
> > @@ -169,34 +169,24 @@ int open_obj_pinned_any(char *path, enum bpf_obj_type exp_type)
> >  	return fd;
> >  }
> >  
> > -int do_pin_fd(int fd, const char *name)
> > +int mount_bpffs_for_pin(const char *name)
> >  {
> >  	char err_str[ERR_MAX_LEN];
> >  	char *file;
> >  	char *dir;
> >  	int err = 0;
> >  
> > -	err = bpf_obj_pin(fd, name);
> > -	if (!err)
> > -		goto out;
> > -
> >  	file = malloc(strlen(name) + 1);
> >  	strcpy(file, name);
> >  	dir = dirname(file);
> >  
> > -	if (errno != EPERM || is_bpffs(dir)) {
> > -		p_err("can't pin the object (%s): %s", name, strerror(errno));
> > +	if (is_bpffs(dir)) {
> > +		/* nothing to do if already mounted */
> >  		goto out_free;
> >  	}
> >  
> > -	/* Attempt to mount bpffs, then retry pinning. */
> >  	err = mnt_bpffs(dir, err_str, ERR_MAX_LEN);
> > -	if (!err) {
> > -		err = bpf_obj_pin(fd, name);
> > -		if (err)
> > -			p_err("can't pin the object (%s): %s", name,
> > -			      strerror(errno));
> > -	} else {
> > +	if (err) {
> >  		err_str[ERR_MAX_LEN - 1] = '\0';
> >  		p_err("can't mount BPF file system to pin the object (%s): %s",
> >  		      name, err_str);
> > @@ -204,10 +194,22 @@ int do_pin_fd(int fd, const char *name)
> >  
> >  out_free:
> >  	free(file);
> > -out:
> >  	return err;
> >  }
> >  
> > +int do_pin_fd(int fd, const char *name)
> > +{
> > +	int err = mount_bpffs_for_pin(name);
> > +
> > +	if (err) {
> > +		p_err("can't mount bpffs for pin %s: %s",
> > +		      name, strerror(errno));
> > +		return err;
> > +	}
> > +
> > +	return bpf_obj_pin(fd, name);
> > +}
> > +
> >  int do_pin_any(int argc, char **argv, int (*get_fd_by_id)(__u32))
> >  {
> >  	unsigned int id;
> > diff --git a/tools/bpf/bpftool/main.h b/tools/bpf/bpftool/main.h
> > index 28322ace2856..1383824c9baf 100644
> > --- a/tools/bpf/bpftool/main.h
> > +++ b/tools/bpf/bpftool/main.h
> > @@ -129,6 +129,7 @@ const char *get_fd_type_name(enum bpf_obj_type type);
> >  char *get_fdinfo(int fd, const char *key);
> >  int open_obj_pinned(char *path);
> >  int open_obj_pinned_any(char *path, enum bpf_obj_type exp_type);
> > +int mount_bpffs_for_pin(const char *name);
> >  int do_pin_any(int argc, char **argv, int (*get_fd_by_id)(__u32));
> >  int do_pin_fd(int fd, const char *name);
> >  
> > diff --git a/tools/bpf/bpftool/prog.c b/tools/bpf/bpftool/prog.c
> > index 5302ee282409..f3a07ec3a444 100644
> > --- a/tools/bpf/bpftool/prog.c
> > +++ b/tools/bpf/bpftool/prog.c
> > @@ -81,6 +81,7 @@ static const char * const attach_type_strings[] = {
> >  	[BPF_SK_SKB_STREAM_PARSER] = "stream_parser",
> >  	[BPF_SK_SKB_STREAM_VERDICT] = "stream_verdict",
> >  	[BPF_SK_MSG_VERDICT] = "msg_verdict",
> > +	[BPF_FLOW_DISSECTOR] = "flow_dissector",
> >  	[__MAX_BPF_ATTACH_TYPE] = NULL,
> >  };
> >  
> > @@ -724,9 +725,10 @@ int map_replace_compar(const void *p1, const void *p2)
> >  static int do_attach(int argc, char **argv)
> >  {
> >  	enum bpf_attach_type attach_type;
> > -	int err, mapfd, progfd;
> > +	int err, progfd;
> > +	int mapfd = 0;
> >  
> > -	if (!REQ_ARGS(5)) {
> > +	if (!REQ_ARGS(3)) {
> >  		p_err("too few parameters for map attach");
> >  		return -EINVAL;
> >  	}
> > @@ -741,10 +743,11 @@ static int do_attach(int argc, char **argv)
> >  		return -EINVAL;
> >  	}
> >  	NEXT_ARG();
> > -
> > -	mapfd = map_parse_fd(&argc, &argv);
> > -	if (mapfd < 0)
> > -		return mapfd;
> > +	if (argc > 0) {
> > +		mapfd = map_parse_fd(&argc, &argv);
> > +		if (mapfd < 0)
> > +			return mapfd;
> > +	}
> >  
> >  	err = bpf_prog_attach(progfd, mapfd, attach_type, 0);
> >  	if (err) {
> > @@ -760,9 +763,10 @@ static int do_attach(int argc, char **argv)
> >  static int do_detach(int argc, char **argv)
> >  {
> >  	enum bpf_attach_type attach_type;
> > -	int err, mapfd, progfd;
> > +	int err, progfd;
> > +	int mapfd = 0;
> >  
> > -	if (!REQ_ARGS(5)) {
> > +	if (!REQ_ARGS(3)) {
> >  		p_err("too few parameters for map detach");
> >  		return -EINVAL;
> >  	}
> > @@ -777,10 +781,11 @@ static int do_detach(int argc, char **argv)
> >  		return -EINVAL;
> >  	}
> >  	NEXT_ARG();
> > -
> > -	mapfd = map_parse_fd(&argc, &argv);
> > -	if (mapfd < 0)
> > -		return mapfd;
> > +	if (argc > 0) {
> > +		mapfd = map_parse_fd(&argc, &argv);
> > +		if (mapfd < 0)
> > +			return mapfd;
> > +	}
> >  
> >  	err = bpf_prog_detach2(progfd, mapfd, attach_type);
> >  	if (err) {
> > @@ -792,6 +797,56 @@ static int do_detach(int argc, char **argv)
> >  		jsonw_null(json_wtr);
> >  	return 0;
> >  }
> > +
> > +/* Flow dissector consists of a main program and a jump table for each
> > + * supported protocol. The assumption here is that the first prog is the main
> > + * one and the other progs are used in the tail calls. In this routine we
> > + * build the jump table for the non-main progs.
> > + */
> > +static int build_flow_dissector_jmp_table(struct bpf_object *obj,
> > +					  struct bpf_program *prog,
> > +					  const char *jmp_table_map)
> > +{
> > +	struct bpf_map *jmp_table;
> > +	struct bpf_program *pos;
> > +	int i = 0;
> > +	int prog_fd, jmp_table_fd, fd;
> > +
> > +	prog_fd = bpf_program__fd(prog);
> > +	if (prog_fd < 0) {
> > +		p_err("failed to get fd of main prog");
> > +		return prog_fd;
> > +	}
> > +
> > +	jmp_table = bpf_object__find_map_by_name(obj, jmp_table_map);
> > +	if (jmp_table == NULL) {
> > +		p_err("failed to find '%s' map", jmp_table_map);
> > +		return -1;
> > +	}
> > +
> > +	jmp_table_fd = bpf_map__fd(jmp_table);
> > +	if (jmp_table_fd < 0) {
> > +		p_err("failed to get fd of jmp_table");
> > +		return jmp_table_fd;
> > +	}
> > +
> > +	bpf_object__for_each_program(pos, obj) {
> > +		fd = bpf_program__fd(pos);
> > +		if (fd < 0) {
> > +			p_err("failed to get fd of '%s'",
> > +			      bpf_program__title(pos, false));
> > +			return fd;
> > +		}
> > +
> > +		if (fd != prog_fd) {
> > +			bpf_map_update_elem(jmp_table_fd, &i, &fd, BPF_ANY);
> > +			++i;
> > +		}
> > +	}
> > +
> > +	return 0;
> > +}
> > +
> >  static int do_load(int argc, char **argv)
> >  {
> >  	enum bpf_attach_type expected_attach_type;
> > @@ -800,7 +855,7 @@ static int do_load(int argc, char **argv)
> >  	};
> >  	struct map_replace *map_replace = NULL;
> >  	unsigned int old_map_fds = 0;
> > -	struct bpf_program *prog;
> > +	struct bpf_program *prog, *pos;
> >  	struct bpf_object *obj;
> >  	struct bpf_map *map;
> >  	const char *pinfile;
> > @@ -918,13 +973,19 @@ static int do_load(int argc, char **argv)
> >  		goto err_free_reuse_maps;
> >  	}
> >  
> > -	prog = bpf_program__next(NULL, obj);
> > +	if (attr.prog_type == BPF_PROG_TYPE_FLOW_DISSECTOR) {
> > +		/* for the flow dissector type, the entry point is in the
> > +		 * section flow_dissector; other progs are tail calls
> > +		 */
> > +		prog = bpf_object__find_program_by_title(obj, "flow_dissector");
> 
> Is the section always called like this? Or could we use instead the same
> assumption as above, i.e. the main program is the first one?
Ideally yes, the main section should always be called
flow_dissector (to not break libbpf_prog_type_by_name assumptions).

We can call bpftool with and without the ATTACH_TYPE argument:

1. when 'type flow_dissector' is specified, we try to find section called
  'flow_dissector' and assume, that's the main prog (libbpf_prog_type_by_name)
2. when 'type flow_dissector' is _not_ specified, we try to detect
  program type from the first section

This check here is for the cases where first section is not
flow_dissector (but some tail call) and we want to force proper type.


> > +	} else {
> > +		prog = bpf_program__next(NULL, obj);
> > +	}
> >  	if (!prog) {
> >  		p_err("object file doesn't contain any bpf program");
> >  		goto err_close_obj;
> >  	}
> >  
> > -	bpf_program__set_ifindex(prog, ifindex);
> >  	if (attr.prog_type == BPF_PROG_TYPE_UNSPEC) {
> >  		const char *sec_name = bpf_program__title(prog, false);
> >  
> > @@ -936,8 +997,13 @@ static int do_load(int argc, char **argv)
> >  			goto err_close_obj;
> >  		}
> >  	}
> > -	bpf_program__set_type(prog, attr.prog_type);
> > -	bpf_program__set_expected_attach_type(prog, expected_attach_type);
> > +
> > +	bpf_object__for_each_program(pos, obj) {
> > +		bpf_program__set_ifindex(pos, ifindex);
> > +		bpf_program__set_type(pos, attr.prog_type);
> > +		bpf_program__set_expected_attach_type(pos,
> > +						      expected_attach_type);
> > +	}
> 
> I believe it is possible to load program of several types or attach
> types from a same object file? If so, we would need to get individual
> prog types and attach types for each program with
> libbpf_prog_type_by_name().
I don't think it works actually. I also assumed that initially, but as
you can see from the patch, I had to loop over the programs and set
correct type. Current implementation will fail if you have two progs
in the same obj AFAICT.

> >  	qsort(map_replace, old_map_fds, sizeof(*map_replace),
> >  	      map_replace_compar);
> > @@ -1001,8 +1067,34 @@ static int do_load(int argc, char **argv)
> >  		goto err_close_obj;
> >  	}
> >  
> > -	if (do_pin_fd(bpf_program__fd(prog), pinfile))
> > +	err = mount_bpffs_for_pin(pinfile);
> > +	if (err) {
> > +		p_err("failed to mount bpffs for pin '%s'", pinfile);
> >  		goto err_close_obj;
> > +	}
> > +
> > +	if (attr.prog_type == BPF_PROG_TYPE_FLOW_DISSECTOR) {
> > +		err = build_flow_dissector_jmp_table(obj, prog, "jmp_table");
> > +		if (err) {
> > +			p_err("failed to build flow dissector jump table");
> > +			goto err_close_obj;
> > +		}
> > +		/* flow dissector consist of multiple programs,
> > +		 * we want to pin them all
> > +		 */
> > +		err = bpf_object__pin(obj, pinfile);
> > +		if (err) {
> > +			p_err("failed to pin flow dissector object");
> > +			goto err_close_obj;
> > +		}
> 
> What happens for the programs previously loaded and pinned in one of the
> program fails? Do they remain loaded and pinned even if all programs
> were not successfully loaded? Or should we consider removing all links
> we added instead and go back to a "clean" state?
They remain loaded. I agree, we should go back to the clean state;
that probably should be achieved by adding proper cleanup to libbpf's
bpf_object__pin when it fails. I can look into that for v2.

> > +	} else {
> > +		err = bpf_obj_pin(bpf_program__fd(prog), pinfile);
> > +		if (err) {
> > +			p_err("failed to pin program %s",
> > +			      bpf_program__title(prog, false));
> > +			goto err_close_obj;
> > +		}
> 
> I don't have the same opinion as Jakub for pinning :). I was hoping we
> could also load additional programs (for tail calls) for
> non-flow_dissector programs. Could this be an occasion to update the
> code in that direction?
(See above for loading multiple progs from the same object).

If we want to be able to load non-flow_dissector programs from the same file,
we should have some way to distinguish between flow_dissector and
non-flow_dissector ones; I don't see an easy way to do that.

> > +	}
> >  
> >  	if (json_output)
> >  		jsonw_null(json_wtr);
> > @@ -1037,8 +1129,8 @@ static int do_help(int argc, char **argv)
> >  		"       %s %s pin   PROG FILE\n"
> >  		"       %s %s load  OBJ  FILE [type TYPE] [dev NAME] \\\n"
> >  		"                         [map { idx IDX | name NAME } MAP]\n"
> > -		"       %s %s attach PROG ATTACH_TYPE MAP\n"
> > -		"       %s %s detach PROG ATTACH_TYPE MAP\n"
> > +		"       %s %s attach PROG ATTACH_TYPE [MAP]\n"
> > +		"       %s %s detach PROG ATTACH_TYPE [MAP]\n"
> >  		"       %s %s help\n"
> >  		"\n"
> >  		"       " HELP_SPEC_MAP "\n"
> > @@ -1050,7 +1142,8 @@ static int do_help(int argc, char **argv)
> >  		"                 cgroup/bind4 | cgroup/bind6 | cgroup/post_bind4 |\n"
> >  		"                 cgroup/post_bind6 | cgroup/connect4 | cgroup/connect6 |\n"
> >  		"                 cgroup/sendmsg4 | cgroup/sendmsg6 }\n"
> > -		"       ATTACH_TYPE := { msg_verdict | skb_verdict | skb_parse }\n"
> > +		"       ATTACH_TYPE := { msg_verdict | skb_verdict | skb_parse |\n"
> > +		"                        flow_dissector }\n"
> >  		"       " HELP_SPEC_OPTIONS "\n"
> >  		"",
> >  		bin_name, argv[-2], bin_name, argv[-2], bin_name, argv[-2],
> > 
> 
> Thanks a lot for the set!
Thank you for a review as well!
> Quentin

^ permalink raw reply

* Re: [PATCH bpf-next 2/2] bpftool: support loading flow dissector
From: Jakub Kicinski @ 2018-11-07 21:12 UTC (permalink / raw)
  To: Stanislav Fomichev
  Cc: Stanislav Fomichev, netdev, linux-kselftest, ast, daniel, shuah,
	quentin.monnet, guro, jiong.wang, bhole_prashant_q7,
	john.fastabend, jbenc, treeze.taeung, yhs, osk, sandipan
In-Reply-To: <20181107210009.y6jucu5oehlladkq@mini-arch>

On Wed, 7 Nov 2018 13:00:09 -0800, Stanislav Fomichev wrote:
> > > +	}
> > > +
> > > +	if (attr.prog_type == BPF_PROG_TYPE_FLOW_DISSECTOR) {
> > > +		err = build_flow_dissector_jmp_table(obj, prog, "jmp_table");
> > > +		if (err) {
> > > +			p_err("failed to build flow dissector jump table");
> > > +			goto err_close_obj;
> > > +		}
> > > +		/* flow dissector consist of multiple programs,
> > > +		 * we want to pin them all  
> > 
> > Why pin them all shouldn't the main program be the only one pinned?  
> If I pin only the main program, the tail ones disappear from the jmp_table map
> when bpftool exits.
> Am I missing something?
> Should BPF_MAP_TYPE_PROG_ARRAY hold the referenced progs?

It does.

# bpftool map create /sys/fs/bpf/map \
	type prog_array key 4 value 4 entries 4 \
	name tail_call_map
# bpftool prog load perf_event_output_stack.o /sys/fs/bpf/prog \
	 type xdp 
# bpftool map update pinned /sys/fs/bpf/map \
	key 0 0 0 0 \
	value pinned /sys/fs/bpf/prog
# bpftool prog 
11: xdp  name xdp_prog1  tag 6f1b482b27443edf  gpl
	loaded_at 2018-11-07T13:09:20-0800  uid 0
	xlated 144B  jited 130B  memlock 4096B  map_ids 14
# rm /sys/fs/bpf/prog
# bpftool prog 
11: xdp  name xdp_prog1  tag 6f1b482b27443edf  gpl
	loaded_at 2018-11-07T13:09:20-0800  uid 0
	xlated 144B  jited 130B  memlock 4096B  map_ids 14
# rm /sys/fs/bpf/map 
# bpftool prog 
# bpftool prog show id 11
Error: get by id (11): No such file or directory


I think we should remove all this auto tail call construction unless
we have solid annotations in the elf file which can clearly guide the
loader.

^ permalink raw reply

* Re: Kernel 4.19 network performance - forwarding/routing normal users traffic
From: David Ahern @ 2018-11-07 21:06 UTC (permalink / raw)
  To: Paweł Staszewski, Jesper Dangaard Brouer; +Cc: netdev, Yoel Caspersen
In-Reply-To: <394a0bf2-fa97-1085-2eda-98ddf476895c@itcare.pl>

On 11/3/18 6:24 PM, Paweł Staszewski wrote:
>> Does your setup have any other device types besides physical ports with
>> VLANs (e.g., any macvlans or bonds)?
>>
>>
> no.
> just
> phy(mlnx)->vlans only config

VLAN and non-VLAN (and a mix) seem to work ok. Patches are here:
   https://github.com/dsahern/linux.git bpf/kernel-tables-wip

I got lazy with the vlan exports; right now it requires 8021q to be
builtin (CONFIG_VLAN_8021Q=y)

You can use the xdp_fwd sample:
  make O=kbuild -C samples/bpf -j 8

Copy samples/bpf/xdp_fwd_kern.o and samples/bpf/xdp_fwd to the server
and run:
   ./xdp_fwd <list of NIC ports>

e.g., in my testing I run:
   xdp_fwd eth1 eth2 eth3 eth4

All of the relevant forwarding ports need to be on the same command
line. This version populates a second map to verify the egress port has
XDP enabled.


> 
> And today again after allpy patch for page allocator - reached again
> 64/64 Gbit/s
> 
> with only 50-60% cpu load

you should see the cpu load drop considerably.

^ permalink raw reply

* Re: [PATCH] dpaa_eth: add ethtool coalesce control
From: David Miller @ 2018-11-08  6:33 UTC (permalink / raw)
  To: madalin.bucur; +Cc: netdev, linux-kernel
In-Reply-To: <1541598823-22533-1-git-send-email-madalin.bucur@nxp.com>

From: Madalin Bucur <madalin.bucur@nxp.com>
Date: Wed,  7 Nov 2018 15:53:43 +0200

> +static int dpaa_set_coalesce(struct net_device *dev,
> +			     struct ethtool_coalesce *c)
> +{
> +	const cpumask_t *cpus = qman_affine_cpus();
> +	struct qman_portal *portal;
> +	u32 period;
> +	u8 thresh;
> +	int cpu;
> +
> +	period = c->rx_coalesce_usecs;
> +	thresh = c->rx_max_coalesced_frames;
> +
> +	for_each_cpu(cpu, cpus) {
> +		portal = qman_get_affine_portal(cpu);
> +		qman_portal_set_iperiod(portal, period);
> +		qman_dqrr_set_ithresh(portal, thresh);
> +	}

You really have to check to see if the user is trying to configure
a setting you don't support, for example if the user tries
to set ->use_adative_rx_coalesce or uses a period or threshold
value which is out of range.

Thanks.

^ permalink raw reply

* Re: [PATCH bpf-next 2/2] bpftool: support loading flow dissector
From: Stanislav Fomichev @ 2018-11-07 21:00 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Stanislav Fomichev, netdev, linux-kselftest, ast, daniel, shuah,
	quentin.monnet, guro, jiong.wang, bhole_prashant_q7,
	john.fastabend, jbenc, treeze.taeung, yhs, osk, sandipan
In-Reply-To: <20181107120340.27001209@cakuba.netronome.com>

On 11/07, Jakub Kicinski wrote:
> On Wed,  7 Nov 2018 11:35:52 -0800, Stanislav Fomichev wrote:
> > This commit adds support for loading/attaching/detaching flow
> > dissector program. The structure of the flow dissector program is
> > assumed to be the same as in the selftests:
> > 
> > * flow_dissector section with the main entry point
> > * a bunch of tail call progs
> > * a jmp_table map that is populated with the tail call progs
> > 
> > When `bpftool load` is called with a flow_dissector prog (i.e. when the
> > first section is flow_dissector of 'type flow_dissector' argument is
> > passed), we load and pin all the programs and build the jump table.
> > 
> > The last argument of `bpftool attach` is made optional for this use
> > case.
> > 
> > Example:
> > bpftool prog load tools/testing/selftests/bpf/bpf_flow.o \
> > 	/sys/fs/bpf/flow type flow_dissector
> > bpftool prog attach pinned /sys/fs/bpf/flow/flow_dissector/0 flow_dissector
> > 
> > Tested by using the above two lines to load the prog in
> > the test_flow_dissector.sh selftest.
> > 
> > Signed-off-by: Stanislav Fomichev <sdf@google.com>
> > ---
> >  .../bpftool/Documentation/bpftool-prog.rst    |  16 ++-
> >  tools/bpf/bpftool/common.c                    |  32 +++--
> >  tools/bpf/bpftool/main.h                      |   1 +
> >  tools/bpf/bpftool/prog.c                      | 135 +++++++++++++++---
> 
> Please add the new attach type to bash completions.
Thanks for a quick review! Will address everything in the v2.
Answered some of your questions below.

> >  4 files changed, 141 insertions(+), 43 deletions(-)
> > 
> > diff --git a/tools/bpf/bpftool/Documentation/bpftool-prog.rst b/tools/bpf/bpftool/Documentation/bpftool-prog.rst
> > index ac4e904b10fb..3caa9153435b 100644
> > --- a/tools/bpf/bpftool/Documentation/bpftool-prog.rst
> > +++ b/tools/bpf/bpftool/Documentation/bpftool-prog.rst
> > @@ -25,8 +25,8 @@ MAP COMMANDS
> >  |	**bpftool** **prog dump jited**  *PROG* [{**file** *FILE* | **opcodes**}]
> >  |	**bpftool** **prog pin** *PROG* *FILE*
> >  |	**bpftool** **prog load** *OBJ* *FILE* [**type** *TYPE*] [**map** {**idx** *IDX* | **name** *NAME*} *MAP*] [**dev** *NAME*]
> > -|       **bpftool** **prog attach** *PROG* *ATTACH_TYPE* *MAP*
> > -|       **bpftool** **prog detach** *PROG* *ATTACH_TYPE* *MAP*
> > +|       **bpftool** **prog attach** *PROG* *ATTACH_TYPE* [*MAP*]
> > +|       **bpftool** **prog detach** *PROG* *ATTACH_TYPE* [*MAP*]
> >  |	**bpftool** **prog help**
> >  |
> >  |	*MAP* := { **id** *MAP_ID* | **pinned** *FILE* }
> > @@ -39,7 +39,9 @@ MAP COMMANDS
> >  |		**cgroup/bind4** | **cgroup/bind6** | **cgroup/post_bind4** | **cgroup/post_bind6** |
> >  |		**cgroup/connect4** | **cgroup/connect6** | **cgroup/sendmsg4** | **cgroup/sendmsg6**
> >  |	}
> > -|       *ATTACH_TYPE* := { **msg_verdict** | **skb_verdict** | **skb_parse** }
> > +|       *ATTACH_TYPE* := {
> > +|	|	**msg_verdict** | **skb_verdict** | **skb_parse** | **flow_dissector**
> > +|	}
> >  
> >  
> >  DESCRIPTION
> > @@ -97,13 +99,13 @@ DESCRIPTION
> >  		  contain a dot character ('.'), which is reserved for future
> >  		  extensions of *bpffs*.
> >  
> > -        **bpftool prog attach** *PROG* *ATTACH_TYPE* *MAP*
> > +        **bpftool prog attach** *PROG* *ATTACH_TYPE* [*MAP*]
> >                    Attach bpf program *PROG* (with type specified by *ATTACH_TYPE*)
> > -                  to the map *MAP*.
> > +                  to the optional map *MAP*.
> 
> Perhaps we can do better on help?  Attach BPF program *PROG* (with type
> specified by *ATTACH_TYPE*).  Most *ATTACH_TYPEs* require a *MAP*
> parameter, with the exception of *flow_dissector* which is attached to
> current networking name space.
> 
> > -        **bpftool prog detach** *PROG* *ATTACH_TYPE* *MAP*
> > +        **bpftool prog detach** *PROG* *ATTACH_TYPE* [*MAP*]
> >                    Detach bpf program *PROG* (with type specified by *ATTACH_TYPE*)
> > -                  from the map *MAP*.
> > +                  from the optional map *MAP*.
> >  
> >  	**bpftool prog help**
> >  		  Print short help message.
> > diff --git a/tools/bpf/bpftool/common.c b/tools/bpf/bpftool/common.c
> > index 25af85304ebe..963881142dfb 100644
> > --- a/tools/bpf/bpftool/common.c
> > +++ b/tools/bpf/bpftool/common.c
> 
> > @@ -204,10 +194,22 @@ int do_pin_fd(int fd, const char *name)
> >  
> >  out_free:
> >  	free(file);
> > -out:
> >  	return err;
> >  }
> >  
> > +int do_pin_fd(int fd, const char *name)
> > +{
> > +	int err = mount_bpffs_for_pin(name);
> 
> Please don't initialize the error variable with a non-trivial function
> call.
> 
> > +	if (err) {
> > +		p_err("can't mount bpffs for pin %s: %s",
> > +		      name, strerror(errno));
> 
> I think mount_bpffs_for_pin() will already print an error.  We can't
> print two errors, because it will break JSON output.
> 
> > +		return err;
> > +	}
> > +
> > +	return bpf_obj_pin(fd, name);
> > +}
> > +
> >  int do_pin_any(int argc, char **argv, int (*get_fd_by_id)(__u32))
> >  {
> >  	unsigned int id;
> 
> > diff --git a/tools/bpf/bpftool/prog.c b/tools/bpf/bpftool/prog.c
> > index 5302ee282409..f3a07ec3a444 100644
> > --- a/tools/bpf/bpftool/prog.c
> > +++ b/tools/bpf/bpftool/prog.c
> > @@ -81,6 +81,7 @@ static const char * const attach_type_strings[] = {
> >  	[BPF_SK_SKB_STREAM_PARSER] = "stream_parser",
> >  	[BPF_SK_SKB_STREAM_VERDICT] = "stream_verdict",
> >  	[BPF_SK_MSG_VERDICT] = "msg_verdict",
> > +	[BPF_FLOW_DISSECTOR] = "flow_dissector",
> >  	[__MAX_BPF_ATTACH_TYPE] = NULL,
> >  };
> >  
> > @@ -724,9 +725,10 @@ int map_replace_compar(const void *p1, const void *p2)
> >  static int do_attach(int argc, char **argv)
> >  {
> >  	enum bpf_attach_type attach_type;
> > -	int err, mapfd, progfd;
> > +	int err, progfd;
> > +	int mapfd = 0;
> >  
> > -	if (!REQ_ARGS(5)) {
> > +	if (!REQ_ARGS(3)) {
> >  		p_err("too few parameters for map attach");
> >  		return -EINVAL;
> >  	}
> > @@ -741,10 +743,11 @@ static int do_attach(int argc, char **argv)
> >  		return -EINVAL;
> >  	}
> >  	NEXT_ARG();
> > -
> > -	mapfd = map_parse_fd(&argc, &argv);
> > -	if (mapfd < 0)
> > -		return mapfd;
> > +	if (argc > 0) {
> 
> Flow dissector can't need a map right?  I think explicitly checking for
> the correct number of arguments once attach type is known would be good.
Makes sense. I initially didn't want to depend on the attach_type too
much, but it might be more readable actually. Will chain in v2.

> > +		mapfd = map_parse_fd(&argc, &argv);
> > +		if (mapfd < 0)
> > +			return mapfd;
> > +	}
> >  
> >  	err = bpf_prog_attach(progfd, mapfd, attach_type, 0);
> >  	if (err) {
> > @@ -760,9 +763,10 @@ static int do_attach(int argc, char **argv)
> >  static int do_detach(int argc, char **argv)
> >  {
> >  	enum bpf_attach_type attach_type;
> > -	int err, mapfd, progfd;
> > +	int err, progfd;
> > +	int mapfd = 0;
> >  
> > -	if (!REQ_ARGS(5)) {
> > +	if (!REQ_ARGS(3)) {
> >  		p_err("too few parameters for map detach");
> >  		return -EINVAL;
> >  	}
> > @@ -777,10 +781,11 @@ static int do_detach(int argc, char **argv)
> >  		return -EINVAL;
> >  	}
> >  	NEXT_ARG();
> > -
> > -	mapfd = map_parse_fd(&argc, &argv);
> > -	if (mapfd < 0)
> > -		return mapfd;
> > +	if (argc > 0) {
> > +		mapfd = map_parse_fd(&argc, &argv);
> > +		if (mapfd < 0)
> > +			return mapfd;
> > +	}
> >  
> >  	err = bpf_prog_detach2(progfd, mapfd, attach_type);
> >  	if (err) {
> > @@ -792,6 +797,56 @@ static int do_detach(int argc, char **argv)
> >  		jsonw_null(json_wtr);
> >  	return 0;
> >  }
> > +
> > +/* Flow dissector consists of a main program and a jump table for each
> > + * supported protocol. The assumption here is that the first prog is the main
> > + * one and the other progs are used in the tail calls. In this routine we
> > + * build the jump table for the non-main progs.
> > + */
> > +static int build_flow_dissector_jmp_table(struct bpf_object *obj,
> > +					  struct bpf_program *prog,
> > +					  const char *jmp_table_map)
> > +{
> > +	struct bpf_map *jmp_table;
> > +	struct bpf_program *pos;
> > +	int i = 0;
> > +	int prog_fd, jmp_table_fd, fd;
> 
> Please order variables longest to shortest.
> 
> > +	prog_fd = bpf_program__fd(prog);
> > +	if (prog_fd < 0) {
> > +		p_err("failed to get fd of main prog");
> > +		return prog_fd;
> > +	}
> > +
> > +	jmp_table = bpf_object__find_map_by_name(obj, jmp_table_map);
> > +	if (jmp_table == NULL) {
> 
> nit: !jmp_table
> 
> > +		p_err("failed to find '%s' map", jmp_table_map);
> > +		return -1;
> > +	}
> > +
> > +	jmp_table_fd = bpf_map__fd(jmp_table);
> > +	if (jmp_table_fd < 0) {
> > +		p_err("failed to get fd of jmp_table");
> > +		return jmp_table_fd;
> > +	}
> > +
> > +	bpf_object__for_each_program(pos, obj) {
> > +		fd = bpf_program__fd(pos);
> > +		if (fd < 0) {
> > +			p_err("failed to get fd of '%s'",
> > +			      bpf_program__title(pos, false));
> > +			return fd;
> > +		}
> > +
> > +		if (fd != prog_fd) {
> > +			bpf_map_update_elem(jmp_table_fd, &i, &fd, BPF_ANY);
> > +			++i;
> > +		}
> > +	}
> > +
> > +	return 0;
> > +}
> > +
> >  static int do_load(int argc, char **argv)
> >  {
> >  	enum bpf_attach_type expected_attach_type;
> > @@ -800,7 +855,7 @@ static int do_load(int argc, char **argv)
> >  	};
> >  	struct map_replace *map_replace = NULL;
> >  	unsigned int old_map_fds = 0;
> > -	struct bpf_program *prog;
> > +	struct bpf_program *prog, *pos;
> 
> variable order
> 
> >  	struct bpf_object *obj;
> >  	struct bpf_map *map;
> >  	const char *pinfile;
> > @@ -918,13 +973,19 @@ static int do_load(int argc, char **argv)
> >  		goto err_free_reuse_maps;
> >  	}
> >  
> > -	prog = bpf_program__next(NULL, obj);
> > +	if (attr.prog_type == BPF_PROG_TYPE_FLOW_DISSECTOR) {
> > +		/* for the flow dissector type, the entry point is in the
> > +		 * section flow_dissector; other progs are tail calls
> > +		 */
> > +		prog = bpf_object__find_program_by_title(obj, "flow_dissector");
> > +	} else {
> > +		prog = bpf_program__next(NULL, obj);
> > +	}
> >  	if (!prog) {
> >  		p_err("object file doesn't contain any bpf program");
> >  		goto err_close_obj;
> >  	}
> >  
> > -	bpf_program__set_ifindex(prog, ifindex);
> >  	if (attr.prog_type == BPF_PROG_TYPE_UNSPEC) {
> >  		const char *sec_name = bpf_program__title(prog, false);
> >  
> > @@ -936,8 +997,13 @@ static int do_load(int argc, char **argv)
> >  			goto err_close_obj;
> >  		}
> >  	}
> > -	bpf_program__set_type(prog, attr.prog_type);
> > -	bpf_program__set_expected_attach_type(prog, expected_attach_type);
> > +
> > +	bpf_object__for_each_program(pos, obj) {
> > +		bpf_program__set_ifindex(pos, ifindex);
> > +		bpf_program__set_type(pos, attr.prog_type);
> > +		bpf_program__set_expected_attach_type(pos,
> > +						      expected_attach_type);
> > +	}
> >  
> >  	qsort(map_replace, old_map_fds, sizeof(*map_replace),
> >  	      map_replace_compar);
> > @@ -1001,8 +1067,34 @@ static int do_load(int argc, char **argv)
> >  		goto err_close_obj;
> >  	}
> >  
> > -	if (do_pin_fd(bpf_program__fd(prog), pinfile))
> > +	err = mount_bpffs_for_pin(pinfile);
> > +	if (err) {
> > +		p_err("failed to mount bpffs for pin '%s'", pinfile);
> 
> Probably would be a duplicated error again?
> 
> >  		goto err_close_obj;
> > +	}
> > +
> > +	if (attr.prog_type == BPF_PROG_TYPE_FLOW_DISSECTOR) {
> > +		err = build_flow_dissector_jmp_table(obj, prog, "jmp_table");
> > +		if (err) {
> > +			p_err("failed to build flow dissector jump table");
> > +			goto err_close_obj;
> > +		}
> > +		/* flow dissector consist of multiple programs,
> > +		 * we want to pin them all
> 
> Why pin them all shouldn't the main program be the only one pinned?
If I pin only the main program, the tail ones disappear from the jmp_table map
when bpftool exits.
Am I missing something?
Should BPF_MAP_TYPE_PROG_ARRAY hold the referenced progs?

> > +		 */
> > +		err = bpf_object__pin(obj, pinfile);
> > +		if (err) {
> > +			p_err("failed to pin flow dissector object");
> > +			goto err_close_obj;
> > +		}
> > +	} else {
> > +		err = bpf_obj_pin(bpf_program__fd(prog), pinfile);
> > +		if (err) {
> > +			p_err("failed to pin program %s",
> > +			      bpf_program__title(prog, false));
> > +			goto err_close_obj;
> > +		}
> > +	}
> 

^ permalink raw reply

* Re: [PATCH net-next] inet: minor optimization for backlog setting in listen(2)
From: David Miller @ 2018-11-08  6:31 UTC (permalink / raw)
  To: laoar.shao; +Cc: edumazet, netdev, linux-kernel
In-Reply-To: <1541589617-1607-1-git-send-email-laoar.shao@gmail.com>

From: Yafang Shao <laoar.shao@gmail.com>
Date: Wed,  7 Nov 2018 19:20:16 +0800

> Set the backlog earlier in inet_dccp_listen() and inet_listen(),
> then we can avoid the redundant setting.
> 
> Signed-off-by: Yafang Shao <laoar.shao@gmail.com>

Applied.

^ permalink raw reply

* [PATCH net-next 2/2] net: phy: realtek: remove boilerplate code from driver configs
From: Heiner Kallweit @ 2018-11-07 20:54 UTC (permalink / raw)
  To: Florian Fainelli, Andrew Lunn, David Miller; +Cc: netdev@vger.kernel.org
In-Reply-To: <08d52675-f308-4ebb-4ec4-f6c7ac0b6b06@gmail.com>

After a recent change phy_id_mask value 0 means "exact match". This
allows to remove setting phy_id_mask values from the driver configs.

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
---
 drivers/net/phy/realtek.c | 9 ---------
 1 file changed, 9 deletions(-)

diff --git a/drivers/net/phy/realtek.c b/drivers/net/phy/realtek.c
index 7b1c89b38..abff4cdc9 100644
--- a/drivers/net/phy/realtek.c
+++ b/drivers/net/phy/realtek.c
@@ -215,13 +215,11 @@ static struct phy_driver realtek_drvs[] = {
 	{
 		.phy_id         = 0x00008201,
 		.name           = "RTL8201CP Ethernet",
-		.phy_id_mask    = 0x0000ffff,
 		.features       = PHY_BASIC_FEATURES,
 		.flags          = PHY_HAS_INTERRUPT,
 	}, {
 		.phy_id		= 0x001cc816,
 		.name		= "RTL8201F Fast Ethernet",
-		.phy_id_mask	= 0x001fffff,
 		.features	= PHY_BASIC_FEATURES,
 		.flags		= PHY_HAS_INTERRUPT,
 		.ack_interrupt	= &rtl8201_ack_interrupt,
@@ -233,7 +231,6 @@ static struct phy_driver realtek_drvs[] = {
 	}, {
 		.phy_id		= 0x001cc910,
 		.name		= "RTL8211 Gigabit Ethernet",
-		.phy_id_mask	= 0x001fffff,
 		.features	= PHY_GBIT_FEATURES,
 		.config_aneg	= rtl8211_config_aneg,
 		.read_mmd	= &genphy_read_mmd_unsupported,
@@ -241,7 +238,6 @@ static struct phy_driver realtek_drvs[] = {
 	}, {
 		.phy_id		= 0x001cc912,
 		.name		= "RTL8211B Gigabit Ethernet",
-		.phy_id_mask	= 0x001fffff,
 		.features	= PHY_GBIT_FEATURES,
 		.flags		= PHY_HAS_INTERRUPT,
 		.ack_interrupt	= &rtl821x_ack_interrupt,
@@ -253,7 +249,6 @@ static struct phy_driver realtek_drvs[] = {
 	}, {
 		.phy_id		= 0x001cc913,
 		.name		= "RTL8211C Gigabit Ethernet",
-		.phy_id_mask	= 0x001fffff,
 		.features	= PHY_GBIT_FEATURES,
 		.config_init	= rtl8211c_config_init,
 		.read_mmd	= &genphy_read_mmd_unsupported,
@@ -261,7 +256,6 @@ static struct phy_driver realtek_drvs[] = {
 	}, {
 		.phy_id		= 0x001cc914,
 		.name		= "RTL8211DN Gigabit Ethernet",
-		.phy_id_mask	= 0x001fffff,
 		.features	= PHY_GBIT_FEATURES,
 		.flags		= PHY_HAS_INTERRUPT,
 		.ack_interrupt	= rtl821x_ack_interrupt,
@@ -271,7 +265,6 @@ static struct phy_driver realtek_drvs[] = {
 	}, {
 		.phy_id		= 0x001cc915,
 		.name		= "RTL8211E Gigabit Ethernet",
-		.phy_id_mask	= 0x001fffff,
 		.features	= PHY_GBIT_FEATURES,
 		.flags		= PHY_HAS_INTERRUPT,
 		.ack_interrupt	= &rtl821x_ack_interrupt,
@@ -281,7 +274,6 @@ static struct phy_driver realtek_drvs[] = {
 	}, {
 		.phy_id		= 0x001cc916,
 		.name		= "RTL8211F Gigabit Ethernet",
-		.phy_id_mask	= 0x001fffff,
 		.features	= PHY_GBIT_FEATURES,
 		.flags		= PHY_HAS_INTERRUPT,
 		.config_init	= &rtl8211f_config_init,
@@ -294,7 +286,6 @@ static struct phy_driver realtek_drvs[] = {
 	}, {
 		.phy_id		= 0x001cc961,
 		.name		= "RTL8366RB Gigabit Ethernet",
-		.phy_id_mask	= 0x001fffff,
 		.features	= PHY_GBIT_FEATURES,
 		.flags		= PHY_HAS_INTERRUPT,
 		.config_init	= &rtl8366rb_config_init,
-- 
2.19.1

^ permalink raw reply related

* [PATCH net-next 1/2] net: phy: use phy_id_mask value zero for exact match
From: Heiner Kallweit @ 2018-11-07 20:53 UTC (permalink / raw)
  To: Florian Fainelli, Andrew Lunn, David Miller; +Cc: netdev@vger.kernel.org
In-Reply-To: <08d52675-f308-4ebb-4ec4-f6c7ac0b6b06@gmail.com>

A phy_id_mask value zero means every PHYID matches, therefore
value zero isn't used. So we can safely redefine the semantics
of value zero to mean "exact match". This allows to avoid some
boilerplate code in PHY driver configs.

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
---
 drivers/net/phy/phy_device.c | 21 +++++++++++++++------
 include/linux/phy.h          |  2 +-
 2 files changed, 16 insertions(+), 7 deletions(-)

diff --git a/drivers/net/phy/phy_device.c b/drivers/net/phy/phy_device.c
index ab33d1777..d165a2c82 100644
--- a/drivers/net/phy/phy_device.c
+++ b/drivers/net/phy/phy_device.c
@@ -483,15 +483,24 @@ static int phy_bus_match(struct device *dev, struct device_driver *drv)
 			if (!(phydev->c45_ids.devices_in_package & (1 << i)))
 				continue;
 
-			if ((phydrv->phy_id & phydrv->phy_id_mask) ==
-			    (phydev->c45_ids.device_ids[i] &
-			     phydrv->phy_id_mask))
-				return 1;
+			if (!phydrv->phy_id_mask) {
+				if (phydrv->phy_id ==
+				    phydev->c45_ids.device_ids[i])
+					return 1;
+			} else {
+				if ((phydrv->phy_id & phydrv->phy_id_mask) ==
+				    (phydev->c45_ids.device_ids[i] &
+				     phydrv->phy_id_mask))
+					return 1;
+			}
 		}
 		return 0;
 	} else {
-		return (phydrv->phy_id & phydrv->phy_id_mask) ==
-			(phydev->phy_id & phydrv->phy_id_mask);
+		if (!phydrv->phy_id_mask)
+			return phydrv->phy_id == phydev->phy_id;
+		else
+			return (phydrv->phy_id & phydrv->phy_id_mask) ==
+			       (phydev->phy_id & phydrv->phy_id_mask);
 	}
 }
 
diff --git a/include/linux/phy.h b/include/linux/phy.h
index 2090277ea..e30ca2fdd 100644
--- a/include/linux/phy.h
+++ b/include/linux/phy.h
@@ -500,7 +500,7 @@ struct phy_driver {
 	struct mdio_driver_common mdiodrv;
 	u32 phy_id;
 	char *name;
-	u32 phy_id_mask;
+	u32 phy_id_mask; /* value 0 means exact match */
 	const unsigned long * const features;
 	u32 flags;
 	const void *driver_data;
-- 
2.19.1

^ permalink raw reply related

* [PATCH net-next 0/2] net: phy: use phy_id_mask value zero for exact match
From: Heiner Kallweit @ 2018-11-07 20:52 UTC (permalink / raw)
  To: Florian Fainelli, Andrew Lunn, David Miller; +Cc: netdev@vger.kernel.org

A phy_id_mask value zero means every PHYID matches, therefore
value zero isn't used. So we can safely redefine the semantics
of value zero to mean "exact match". This allows to avoid some
boilerplate code in PHY driver configs.

Realtek PHY driver is the first user of this change.

Heiner Kallweit (2):
  net: phy: use phy_id_mask value zero for exact match
  net: phy: realtek: remove boilerplate code from PHY driver definitions

 drivers/net/phy/phy_device.c | 21 +++++++++++++++------
 drivers/net/phy/realtek.c    |  9 ---------
 include/linux/phy.h          |  2 +-
 3 files changed, 16 insertions(+), 16 deletions(-)

-- 
2.19.1

^ permalink raw reply

* Re: [PATCH net-next 0/5] net: phy: improve and simplify phylib state machine
From: Heiner Kallweit @ 2018-11-07 20:45 UTC (permalink / raw)
  To: Andrew Lunn; +Cc: Florian Fainelli, David Miller, netdev@vger.kernel.org
In-Reply-To: <20181107202156.GD9599@lunn.ch>

On 07.11.2018 21:21, Andrew Lunn wrote:
> On Wed, Nov 07, 2018 at 09:05:49PM +0100, Heiner Kallweit wrote:
>> On 07.11.2018 20:48, Andrew Lunn wrote:
>>> On Wed, Nov 07, 2018 at 08:41:52PM +0100, Heiner Kallweit wrote:
>>>> This patch series is based on two axioms:
>>>>
>>>> - During autoneg a PHY always reports the link being down
>>>
>>> Hi Heiner
>>>
>>> I think that is a risky assumption to make.
>>>
>> I wasn't sure initially too (found no clear rule in 802.3 clause 22)
>> and therefore asked around. Florian agrees to the assumption,
>> see here: https://www.spinics.net/lists/netdev/msg519242.html
>>
>> If a PHY reports the link as up then every user would assume that
>> data can be transferred. But that's not the case during aneg.
>> Therefore reporting the link as up during aneg wouldn't make sense.
> 
> Hi Heiner
> 
> If auto-neg has already been completed once before, i can see a lazy
> hardware designed not reporting link down, or at least, not until
> auto-neg actually fails.
> 
"aneg finished" flag means that the aneg parameters in the register set
are valid. Once the link goes down that's not necessarily the case any
longer. E.g. some PHYs have an "auto speed down" feature and reduce
the speed to save power once they detect the link is down.
Of course I can not rule out that there are broken designs (or as you
stated more politely: lazy designs) out there. But in this case I assume
we would see issues already. And we would have to think about whether we
want to support such broken / lazy designs in phylib.

> And what about if link is down for too short a time for us to notice?
> I've seen some code fail because the kernel went off and did something
> else for too long, and a state change was missed. 
> 
This is a case we have already, independent of my change.
genphy_update_link() reads BMSR twice, thus ignoring potential latched
info about a temporary link failure. When polling phylib ignores
everything that happens between two poll intervals.

>>> What happens if this assumption is incorrect?
>>>
>> Then we have to flush this patch series down the drain ;)
>> At least I would have to check in detail which parts need to be
>> changed. I clearly mention the assumptions so that every
>> reviewer can check whether he agrees.
> 
> Thanks for doing that. I want to be happy this is safe, and not going
> to introduce regressions.
> 
>    Andrew
> 

^ permalink raw reply

* Re: [PATCH] net: dsa: bcm_sf2: fix semicolon.cocci warnings
From: David Miller @ 2018-11-08  6:14 UTC (permalink / raw)
  To: fengguang.wu
  Cc: f.fainelli, kbuild-all, netdev, andrew, vivien.didelot,
	linux-kernel
In-Reply-To: <20181107005034.GA41962@lkp-ivb-ep02>

From: kbuild test robot <fengguang.wu@intel.com>
Date: Wed, 7 Nov 2018 08:50:34 +0800

> From: kbuild test robot <fengguang.wu@intel.com>
> 
> drivers/net/dsa/bcm_sf2_cfp.c:1168:2-3: Unneeded semicolon
> drivers/net/dsa/bcm_sf2_cfp.c:532:2-3: Unneeded semicolon
> 
> 
>  Remove unneeded semicolon.
> 
> Generated by: scripts/coccinelle/misc/semicolon.cocci
> 
> Fixes: ae7a5aff783c ("net: dsa: bcm_sf2: Keep copy of inserted rules")
> CC: Florian Fainelli <f.fainelli@gmail.com>
> Signed-off-by: kbuild test robot <fengguang.wu@intel.com>

Applied.

^ permalink raw reply

* Re: [PATCH bpf-next 2/2] bpftool: support loading flow dissector
From: Jakub Kicinski @ 2018-11-07 20:32 UTC (permalink / raw)
  To: Quentin Monnet
  Cc: Stanislav Fomichev, netdev, linux-kselftest, ast, daniel, shuah,
	guro, jiong.wang, bhole_prashant_q7, john.fastabend, jbenc,
	treeze.taeung, yhs, osk, sandipan
In-Reply-To: <484ea7f5-0d4c-b828-c5fe-1fd4d9bf847d@netronome.com>

On Wed, 7 Nov 2018 20:08:53 +0000, Quentin Monnet wrote:
> > +		err = bpf_obj_pin(bpf_program__fd(prog), pinfile);
> > +		if (err) {
> > +			p_err("failed to pin program %s",
> > +			      bpf_program__title(prog, false));
> > +			goto err_close_obj;
> > +		}  
> 
> I don't have the same opinion as Jakub for pinning :). I was hoping we
> could also load additional programs (for tail calls) for
> non-flow_dissector programs. Could this be an occasion to update the
> code in that direction?

Do you mean having the bpftool construct an array for tail calling
automatically when loading an object?  Or do a "mass pin" of all
programs in an object file?

I'm not convinced about this strategy of auto assembling a tail call
array by assuming that a flow dissector object carries programs for
protocols in order (apart from the main program which doesn't have to
be first, for some reason).

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox