Netdev List
 help / color / mirror / Atom feed
* Re: [RFC 0/2] Delayed binding of UDP sockets for Quic per-connection sockets
From: Eric Dumazet @ 2018-11-01  0:53 UTC (permalink / raw)
  To: Christoph Paasch, netdev; +Cc: Ian Swett, Leif Hedstrom, Jana Iyengar
In-Reply-To: <20181031232635.33750-1-cpaasch@apple.com>



On 10/31/2018 04:26 PM, Christoph Paasch wrote:
> Implementations of Quic might want to create a separate socket for each
> Quic-connection by creating a connected UDP-socket.
> 

Nice proposal, but I doubt a QUIC server can afford having one UDP socket per connection ?

It would add a huge overhead in term of memory usage in the kernel,
and lots of epoll events to manage (say a QUIC server with one million flows, receiving
very few packets per second per flow)

Maybe you could elaborate on the need of having one UDP socket per connection.

> To achieve that on the server-side, a "master-socket" needs to wait for
> incoming new connections and then creates a new socket that will be a
> connected UDP-socket. To create that latter one, the server needs to
> first bind() and then connect(). However, after the bind() the server
> might already receive traffic on that new socket that is unrelated to the
> Quic-connection at hand. Only after the connect() a full 4-tuple match
> is happening. So, one can't really create this kind of a server that has
> a connected UDP-socket per Quic connection.
> 
> So, what is needed is an "atomic bind & connect" that basically
> prevents any incoming traffic until the connect() call has been issued
> at which point the full 4-tuple is known.
> 
> 
> This patchset implements this functionality and exposes a socket-option
> to do this.
> 
> Usage would be:
> 
>         int fd = socket(AF_INET, SOCK_DGRAM, IPPROTO_UDP);
> 
>         int val = 1;
>         setsockopt(fd, SOL_SOCKET, SO_DELAYED_BIND, &val, sizeof(val));
> 
>         bind(fd, (struct sockaddr *)&src, sizeof(src));
> 
> 	/* At this point, incoming traffic will never match on this socket */
> 
>         connect(fd, (struct sockaddr *)&dst, sizeof(dst));
> 
> 	/* Only now incoming traffic will reach the socket */
> 
> 
> 
> There is literally an infinite number of ways on how to implement it,
> which is why I first send it out as an RFC. With this approach here I
> chose the least invasive one, just preventing the match on the incoming
> path.
> 
> 
> The reason for choosing a SOL_SOCKET socket-option and not at the
> SOL_UDP-level is because that functionality actually could be useful for
> other protocols as well. E.g., TCP wants to better use the full 4-tuple space
> by binding to the source-IP and the destination-IP at the same time.

Passive TCP flows can not benefit from this idea.

Active TCP flows can already do that, I do not really understand what you are suggesting.

^ permalink raw reply

* Re: pull-request: bpf 2018-11-01
From: David Miller @ 2018-11-01  0:39 UTC (permalink / raw)
  To: daniel; +Cc: ast, netdev
In-Reply-To: <20181101002841.6267-1-daniel@iogearbox.net>

From: Daniel Borkmann <daniel@iogearbox.net>
Date: Thu,  1 Nov 2018 01:28:41 +0100

> The following pull-request contains BPF updates for your *net* tree.
> 
> The main changes are:
> 
> 1) Fix tcp_bpf_recvmsg() to return -EAGAIN instead of 0 in non-blocking
>    case when no data is available yet, from John.
> 
> 2) Fix a compilation error in libbpf_attach_type_by_name() when compiled
>    with clang 3.8, from Andrey.
> 
> 3) Fix a partial copy of map pointer on scalar alu and remove id
>    generation for RET_PTR_TO_MAP_VALUE return types, from Daniel.
> 
> 4) Add unlimited memlock limit for kernel selftest's flow_dissector_load
>    program, from Yonghong.
> 
> 5) Fix ping for some BPF shell based kselftests where distro does not
>    ship "ping -6" anymore, from Li.
> 
> Please consider pulling these changes from:
> 
>   git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf.git

Pulled, thanks Daniel.

^ permalink raw reply

* pull-request: bpf 2018-11-01
From: Daniel Borkmann @ 2018-11-01  0:28 UTC (permalink / raw)
  To: davem; +Cc: daniel, ast, netdev

Hi David,

The following pull-request contains BPF updates for your *net* tree.

The main changes are:

1) Fix tcp_bpf_recvmsg() to return -EAGAIN instead of 0 in non-blocking
   case when no data is available yet, from John.

2) Fix a compilation error in libbpf_attach_type_by_name() when compiled
   with clang 3.8, from Andrey.

3) Fix a partial copy of map pointer on scalar alu and remove id
   generation for RET_PTR_TO_MAP_VALUE return types, from Daniel.

4) Add unlimited memlock limit for kernel selftest's flow_dissector_load
   program, from Yonghong.

5) Fix ping for some BPF shell based kselftests where distro does not
   ship "ping -6" anymore, from Li.

Please consider pulling these changes from:

  git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf.git

Thanks a lot!

----------------------------------------------------------------

The following changes since commit a6b3a3fa042343e29ffaf9169f5ba3c819d4f9a2:

  net: mvpp2: Fix affinity hint allocation (2018-10-30 11:34:41 -0700)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf.git 

for you to fetch changes up to dfeb8f4c9692fd5e6c3eef19c2e4ae5338dbdb01:

  Merge branch 'verifier-fixes' (2018-10-31 16:53:18 -0700)

----------------------------------------------------------------
Alexei Starovoitov (1):
      Merge branch 'verifier-fixes'

Andrey Ignatov (1):
      libbpf: Fix compile error in libbpf_attach_type_by_name

Daniel Borkmann (4):
      bpf: fix partial copy of map_ptr when dst is scalar
      bpf: don't set id on after map lookup with ptr_to_map_val return
      bpf: add various test cases to test_verifier
      bpf: test make sure to run unpriv test cases in test_verifier

John Fastabend (1):
      bpf: tcp_bpf_recvmsg should return EAGAIN when nonblocking and no data

Li Zhijian (1):
      kselftests/bpf: use ping6 as the default ipv6 ping binary if it exists

Yonghong Song (1):
      tools/bpf: add unlimited rlimit for flow_dissector_load

 include/linux/bpf_verifier.h                      |   3 +
 kernel/bpf/verifier.c                             |  21 +-
 net/ipv4/tcp_bpf.c                                |   1 +
 tools/lib/bpf/libbpf.c                            |  13 +-
 tools/testing/selftests/bpf/flow_dissector_load.c |   2 +
 tools/testing/selftests/bpf/test_skb_cgroup_id.sh |   3 +-
 tools/testing/selftests/bpf/test_sock_addr.sh     |   3 +-
 tools/testing/selftests/bpf/test_verifier.c       | 321 +++++++++++++++++++---
 8 files changed, 319 insertions(+), 48 deletions(-)

^ permalink raw reply

* [PATCH net v6] net/ipv6: Add anycast addresses to a global hashtable
From: Jeff Barnhill @ 2018-11-01  0:14 UTC (permalink / raw)
  To: netdev; +Cc: davem, kuznet, yoshfuji, Jeff Barnhill
In-Reply-To: <CAL6e_pfzPzt=rAxjWKAWHQqdrqejZ5e6vA1YoB3nGyc3_jeJeA@mail.gmail.com>

icmp6_send() function is expensive on systems with a large number of
interfaces. Every time it’s called, it has to verify that the source
address does not correspond to an existing anycast address by looping
through every device and every anycast address on the device.  This can
result in significant delays for a CPU when there are a large number of
neighbors and ND timers are frequently timing out and calling
neigh_invalidate().

Add anycast addresses to a global hashtable to allow quick searching for
matching anycast addresses.  This is based on inet6_addr_lst in addrconf.c.

Signed-off-by: Jeff Barnhill <0xeffeff@gmail.com>
---
 include/net/addrconf.h |  2 ++
 include/net/if_inet6.h |  2 ++
 net/ipv6/af_inet6.c    |  5 ++++
 net/ipv6/anycast.c     | 80 +++++++++++++++++++++++++++++++++++++++++++++++---
 4 files changed, 85 insertions(+), 4 deletions(-)

diff --git a/include/net/addrconf.h b/include/net/addrconf.h
index 14b789a123e7..799af1a037d1 100644
--- a/include/net/addrconf.h
+++ b/include/net/addrconf.h
@@ -317,6 +317,8 @@ bool ipv6_chk_acast_addr(struct net *net, struct net_device *dev,
 			 const struct in6_addr *addr);
 bool ipv6_chk_acast_addr_src(struct net *net, struct net_device *dev,
 			     const struct in6_addr *addr);
+int anycast_init(void);
+void anycast_cleanup(void);
 
 /* Device notifier */
 int register_inet6addr_notifier(struct notifier_block *nb);
diff --git a/include/net/if_inet6.h b/include/net/if_inet6.h
index d7578cf49c3a..c9c78c15bce0 100644
--- a/include/net/if_inet6.h
+++ b/include/net/if_inet6.h
@@ -146,10 +146,12 @@ struct ifacaddr6 {
 	struct in6_addr		aca_addr;
 	struct fib6_info	*aca_rt;
 	struct ifacaddr6	*aca_next;
+	struct hlist_node	aca_addr_lst;
 	int			aca_users;
 	refcount_t		aca_refcnt;
 	unsigned long		aca_cstamp;
 	unsigned long		aca_tstamp;
+	struct rcu_head		rcu;
 };
 
 #define	IFA_HOST	IPV6_ADDR_LOOPBACK
diff --git a/net/ipv6/af_inet6.c b/net/ipv6/af_inet6.c
index 3f4d61017a69..ddc8a6dbfba2 100644
--- a/net/ipv6/af_inet6.c
+++ b/net/ipv6/af_inet6.c
@@ -1001,6 +1001,9 @@ static int __init inet6_init(void)
 	err = ip6_flowlabel_init();
 	if (err)
 		goto ip6_flowlabel_fail;
+	err = anycast_init();
+	if (err)
+		goto anycast_fail;
 	err = addrconf_init();
 	if (err)
 		goto addrconf_fail;
@@ -1091,6 +1094,8 @@ static int __init inet6_init(void)
 ipv6_exthdrs_fail:
 	addrconf_cleanup();
 addrconf_fail:
+	anycast_cleanup();
+anycast_fail:
 	ip6_flowlabel_cleanup();
 ip6_flowlabel_fail:
 	ndisc_late_cleanup();
diff --git a/net/ipv6/anycast.c b/net/ipv6/anycast.c
index 4e0ff7031edd..f6c4c8ac184c 100644
--- a/net/ipv6/anycast.c
+++ b/net/ipv6/anycast.c
@@ -44,8 +44,22 @@
 
 #include <net/checksum.h>
 
+#define IN6_ADDR_HSIZE_SHIFT	8
+#define IN6_ADDR_HSIZE		BIT(IN6_ADDR_HSIZE_SHIFT)
+/*	anycast address hash table
+ */
+static struct hlist_head inet6_acaddr_lst[IN6_ADDR_HSIZE];
+static DEFINE_SPINLOCK(acaddr_hash_lock);
+
 static int ipv6_dev_ac_dec(struct net_device *dev, const struct in6_addr *addr);
 
+static u32 inet6_acaddr_hash(struct net *net, const struct in6_addr *addr)
+{
+	u32 val = ipv6_addr_hash(addr) ^ net_hash_mix(net);
+
+	return hash_32(val, IN6_ADDR_HSIZE_SHIFT);
+}
+
 /*
  *	socket join an anycast group
  */
@@ -204,16 +218,39 @@ void ipv6_sock_ac_close(struct sock *sk)
 	rtnl_unlock();
 }
 
+static void ipv6_add_acaddr_hash(struct net *net, struct ifacaddr6 *aca)
+{
+	unsigned int hash = inet6_acaddr_hash(net, &aca->aca_addr);
+
+	spin_lock(&acaddr_hash_lock);
+	hlist_add_head_rcu(&aca->aca_addr_lst, &inet6_acaddr_lst[hash]);
+	spin_unlock(&acaddr_hash_lock);
+}
+
+static void ipv6_del_acaddr_hash(struct ifacaddr6 *aca)
+{
+	spin_lock(&acaddr_hash_lock);
+	hlist_del_init_rcu(&aca->aca_addr_lst);
+	spin_unlock(&acaddr_hash_lock);
+}
+
 static void aca_get(struct ifacaddr6 *aca)
 {
 	refcount_inc(&aca->aca_refcnt);
 }
 
+static void aca_free_rcu(struct rcu_head *h)
+{
+	struct ifacaddr6 *aca = container_of(h, struct ifacaddr6, rcu);
+
+	fib6_info_release(aca->aca_rt);
+	kfree(aca);
+}
+
 static void aca_put(struct ifacaddr6 *ac)
 {
 	if (refcount_dec_and_test(&ac->aca_refcnt)) {
-		fib6_info_release(ac->aca_rt);
-		kfree(ac);
+		call_rcu(&ac->rcu, aca_free_rcu);
 	}
 }
 
@@ -229,6 +266,7 @@ static struct ifacaddr6 *aca_alloc(struct fib6_info *f6i,
 	aca->aca_addr = *addr;
 	fib6_info_hold(f6i);
 	aca->aca_rt = f6i;
+	INIT_HLIST_NODE(&aca->aca_addr_lst);
 	aca->aca_users = 1;
 	/* aca_tstamp should be updated upon changes */
 	aca->aca_cstamp = aca->aca_tstamp = jiffies;
@@ -285,6 +323,8 @@ int __ipv6_dev_ac_inc(struct inet6_dev *idev, const struct in6_addr *addr)
 	aca_get(aca);
 	write_unlock_bh(&idev->lock);
 
+	ipv6_add_acaddr_hash(net, aca);
+
 	ip6_ins_rt(net, f6i);
 
 	addrconf_join_solict(idev->dev, &aca->aca_addr);
@@ -325,6 +365,7 @@ int __ipv6_dev_ac_dec(struct inet6_dev *idev, const struct in6_addr *addr)
 	else
 		idev->ac_list = aca->aca_next;
 	write_unlock_bh(&idev->lock);
+	ipv6_del_acaddr_hash(aca);
 	addrconf_leave_solict(idev, &aca->aca_addr);
 
 	ip6_del_rt(dev_net(idev->dev), aca->aca_rt);
@@ -352,6 +393,8 @@ void ipv6_ac_destroy_dev(struct inet6_dev *idev)
 		idev->ac_list = aca->aca_next;
 		write_unlock_bh(&idev->lock);
 
+		ipv6_del_acaddr_hash(aca);
+
 		addrconf_leave_solict(idev, &aca->aca_addr);
 
 		ip6_del_rt(dev_net(idev->dev), aca->aca_rt);
@@ -390,17 +433,25 @@ static bool ipv6_chk_acast_dev(struct net_device *dev, const struct in6_addr *ad
 bool ipv6_chk_acast_addr(struct net *net, struct net_device *dev,
 			 const struct in6_addr *addr)
 {
+	unsigned int hash = inet6_acaddr_hash(net, addr);
+	struct net_device *nh_dev;
+	struct ifacaddr6 *aca;
 	bool found = false;
 
 	rcu_read_lock();
 	if (dev)
 		found = ipv6_chk_acast_dev(dev, addr);
 	else
-		for_each_netdev_rcu(net, dev)
-			if (ipv6_chk_acast_dev(dev, addr)) {
+		hlist_for_each_entry_rcu(aca, &inet6_acaddr_lst[hash],
+					 aca_addr_lst) {
+			nh_dev = fib6_info_nh_dev(aca->aca_rt);
+			if (!nh_dev || !net_eq(dev_net(nh_dev), net))
+				continue;
+			if (ipv6_addr_equal(&aca->aca_addr, addr)) {
 				found = true;
 				break;
 			}
+		}
 	rcu_read_unlock();
 	return found;
 }
@@ -539,4 +590,25 @@ void ac6_proc_exit(struct net *net)
 {
 	remove_proc_entry("anycast6", net->proc_net);
 }
+
+/*	Init / cleanup code
+ */
+int __init anycast_init(void)
+{
+	int i;
+
+	for (i = 0; i < IN6_ADDR_HSIZE; i++)
+		INIT_HLIST_HEAD(&inet6_acaddr_lst[i]);
+	return 0;
+}
+
+void anycast_cleanup(void)
+{
+	int i;
+
+	spin_lock(&acaddr_hash_lock);
+	for (i = 0; i < IN6_ADDR_HSIZE; i++)
+		WARN_ON(!hlist_empty(&inet6_acaddr_lst[i]));
+	spin_unlock(&acaddr_hash_lock);
+}
 #endif
-- 
2.14.1

^ permalink raw reply related

* Re: [PATCH bpf 0/4] BPF fixes and tests
From: Alexei Starovoitov @ 2018-11-01  0:08 UTC (permalink / raw)
  To: Daniel Borkmann; +Cc: ast, netdev
In-Reply-To: <20181031230555.3371-1-daniel@iogearbox.net>

On Thu, Nov 01, 2018 at 12:05:51AM +0100, Daniel Borkmann wrote:
> The series contains two fixes in BPF core and test cases. For details
> please see individual patches. Thanks!
> 
> Daniel Borkmann (4):
>   bpf: fix partial copy of map_ptr when dst is scalar
>   bpf: don't set id on after map lookup with ptr_to_map_val return
>   bpf: add various test cases to test_verifier
>   bpf: test make sure to run unpriv test cases in test_verifier
> 
>  include/linux/bpf_verifier.h                |   3 +
>  kernel/bpf/verifier.c                       |  21 +-
>  tools/testing/selftests/bpf/test_verifier.c | 321 +++++++++++++++++++++++++---
>  3 files changed, 305 insertions(+), 40 deletions(-)

Applied to bpf tree, Thanks

... and we achieved very nice milestone... crossed 1000 tests in test_verifier :)

Summary: 1012 PASSED, 0 SKIPPED, 0 FAILED

^ permalink raw reply

* Re: [PATCH net v5] net/ipv6: Add anycast addresses to a global hashtable
From: Jeff Barnhill @ 2018-11-01  0:02 UTC (permalink / raw)
  To: davem; +Cc: David Ahern, netdev, Alexey Kuznetsov, yoshfuji
In-Reply-To: <20181030.161916.2155476722804506340.davem@davemloft.net>

I'll follow this email with a new patch using ifacaddr6 instead of
creating a new struct. I ended up using fib6_nh.nh_dev to get the net,
instead of adding a back pointer to idev.  It seems that idev was
recently removed in lieu of this, so if this is incorrect, please let
me know. Hopefully, I got the locking correct.
Thanks,
Jeff
On Tue, Oct 30, 2018 at 7:19 PM David Miller <davem@davemloft.net> wrote:
>
> From: David Ahern <dsahern@gmail.com>
> Date: Tue, 30 Oct 2018 16:06:46 -0600
>
> > or make the table per namespace.
>
> This will increase namespace create/destroy cost, so I'd rather not
> for something like this.

^ permalink raw reply

* [PATCH v3 2/2] trace: remove kretprobed checks
From: Aleksa Sarai @ 2018-11-01  8:35 UTC (permalink / raw)
  To: Naveen N. Rao, Anil S Keshavamurthy, David S. Miller,
	Masami Hiramatsu, Jonathan Corbet, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Alexander Shishkin, Jiri Olsa,
	Namhyung Kim, Steven Rostedt, Shuah Khan, Alexei Starovoitov,
	Daniel Borkmann
  Cc: Aleksa Sarai, Aleksa Sarai, Christian Brauner, Brendan Gregg,
	netdev, linux-doc, linux-kernel, linux-kselftest
In-Reply-To: <20181101083551.3805-1-cyphar@cyphar.com>

This is effectively a reversion of commit 76094a2cf46e ("ftrace:
distinguish kretprobe'd functions in trace logs"), as the checking of
kretprobe_trampoline *for tracing* is no longer necessary with the new
kretprobe stack trace changes.

Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
---
 kernel/trace/trace_output.c | 34 ++++------------------------------
 1 file changed, 4 insertions(+), 30 deletions(-)

diff --git a/kernel/trace/trace_output.c b/kernel/trace/trace_output.c
index 6e6cc64faa38..951de16bd4fd 100644
--- a/kernel/trace/trace_output.c
+++ b/kernel/trace/trace_output.c
@@ -321,36 +321,14 @@ int trace_output_call(struct trace_iterator *iter, char *name, char *fmt, ...)
 }
 EXPORT_SYMBOL_GPL(trace_output_call);
 
-#ifdef CONFIG_KRETPROBES
-static inline const char *kretprobed(const char *name)
-{
-	static const char tramp_name[] = "kretprobe_trampoline";
-	int size = sizeof(tramp_name);
-
-	if (strncmp(tramp_name, name, size) == 0)
-		return "[unknown/kretprobe'd]";
-	return name;
-}
-#else
-static inline const char *kretprobed(const char *name)
-{
-	return name;
-}
-#endif /* CONFIG_KRETPROBES */
-
 static void
 seq_print_sym_short(struct trace_seq *s, const char *fmt, unsigned long address)
 {
 	char str[KSYM_SYMBOL_LEN];
 #ifdef CONFIG_KALLSYMS
-	const char *name;
-
 	kallsyms_lookup(address, NULL, NULL, NULL, str);
-
-	name = kretprobed(str);
-
-	if (name && strlen(name)) {
-		trace_seq_printf(s, fmt, name);
+	if (strlen(str)) {
+		trace_seq_printf(s, fmt, str);
 		return;
 	}
 #endif
@@ -364,13 +342,9 @@ seq_print_sym_offset(struct trace_seq *s, const char *fmt,
 {
 	char str[KSYM_SYMBOL_LEN];
 #ifdef CONFIG_KALLSYMS
-	const char *name;
-
 	sprint_symbol(str, address);
-	name = kretprobed(str);
-
-	if (name && strlen(name)) {
-		trace_seq_printf(s, fmt, name);
+	if (strlen(str)) {
+		trace_seq_printf(s, fmt, str);
 		return;
 	}
 #endif
-- 
2.19.1

^ permalink raw reply related

* [PATCH v3 1/2] kretprobe: produce sane stack traces
From: Aleksa Sarai @ 2018-11-01  8:35 UTC (permalink / raw)
  To: Naveen N. Rao, Anil S Keshavamurthy, David S. Miller,
	Masami Hiramatsu, Jonathan Corbet, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Alexander Shishkin, Jiri Olsa,
	Namhyung Kim, Steven Rostedt, Shuah Khan, Alexei Starovoitov,
	Daniel Borkmann
  Cc: Aleksa Sarai, Brendan Gregg, Christian Brauner, Aleksa Sarai,
	netdev, linux-doc, linux-kernel, linux-kselftest
In-Reply-To: <20181101083551.3805-1-cyphar@cyphar.com>

Historically, kretprobe has always produced unusable stack traces
(kretprobe_trampoline is the only entry in most cases, because of the
funky stack pointer overwriting). This has caused quite a few annoyances
when using tracing to debug problems[1] -- since return values are only
available with kretprobes but stack traces were only usable for kprobes,
users had to probe both and then manually associate them.

With the advent of bpf_trace, users would have been able to do this
association in bpf, but this was less than ideal (because
bpf_get_stackid would still produce rubbish and programs that didn't
know better would get silly results). The main usecase for stack traces
(at least with bpf_trace) is for DTrace-style aggregation on stack
traces (both entry and exit). Therefore we cannot simply correct the
stack trace on exit -- we must stash away the stack trace and return the
entry stack trace when it is requested.

[1]: https://github.com/iovisor/bpftrace/issues/101

Cc: Brendan Gregg <bgregg@netflix.com>
Cc: Christian Brauner <christian@brauner.io>
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
---
 Documentation/kprobes.txt                     |   6 +-
 include/linux/kprobes.h                       |  27 +++++
 kernel/events/callchain.c                     |   8 +-
 kernel/kprobes.c                              | 101 +++++++++++++++++-
 kernel/trace/trace.c                          |  11 +-
 .../test.d/kprobe/kretprobe_stacktrace.tc     |  25 +++++
 6 files changed, 173 insertions(+), 5 deletions(-)
 create mode 100644 tools/testing/selftests/ftrace/test.d/kprobe/kretprobe_stacktrace.tc

diff --git a/Documentation/kprobes.txt b/Documentation/kprobes.txt
index 10f4499e677c..1965585848f4 100644
--- a/Documentation/kprobes.txt
+++ b/Documentation/kprobes.txt
@@ -597,7 +597,11 @@ address with the trampoline's address, stack backtraces and calls
 to __builtin_return_address() will typically yield the trampoline's
 address instead of the real return address for kretprobed functions.
 (As far as we can tell, __builtin_return_address() is used only
-for instrumentation and error reporting.)
+for instrumentation and error reporting.) However, since return probes
+are used extensively in tracing (where stack backtraces are useful),
+return probes will stash away the stack backtrace during function entry
+so that return probe handlers can use the entry backtrace instead of
+having a trace with just kretprobe_trampoline.
 
 If the number of times a function is called does not match the number
 of times it returns, registering a return probe on that function may
diff --git a/include/linux/kprobes.h b/include/linux/kprobes.h
index e909413e4e38..1a1629544e56 100644
--- a/include/linux/kprobes.h
+++ b/include/linux/kprobes.h
@@ -40,6 +40,8 @@
 #include <linux/rcupdate.h>
 #include <linux/mutex.h>
 #include <linux/ftrace.h>
+#include <linux/stacktrace.h>
+#include <linux/perf_event.h>
 #include <asm/kprobes.h>
 
 #ifdef CONFIG_KPROBES
@@ -168,11 +170,18 @@ struct kretprobe {
 	raw_spinlock_t lock;
 };
 
+#define KRETPROBE_TRACE_SIZE 127
+struct kretprobe_trace {
+	int nr_entries;
+	unsigned long entries[KRETPROBE_TRACE_SIZE];
+};
+
 struct kretprobe_instance {
 	struct hlist_node hlist;
 	struct kretprobe *rp;
 	kprobe_opcode_t *ret_addr;
 	struct task_struct *task;
+	struct kretprobe_trace entry;
 	char data[0];
 };
 
@@ -371,6 +380,12 @@ void unregister_kretprobe(struct kretprobe *rp);
 int register_kretprobes(struct kretprobe **rps, int num);
 void unregister_kretprobes(struct kretprobe **rps, int num);
 
+struct kretprobe_instance *current_kretprobe_instance(void);
+void kretprobe_save_stack_trace(struct kretprobe_instance *ri,
+				struct stack_trace *trace);
+void kretprobe_perf_callchain_kernel(struct kretprobe_instance *ri,
+				     struct perf_callchain_entry_ctx *ctx);
+
 void kprobe_flush_task(struct task_struct *tk);
 void recycle_rp_inst(struct kretprobe_instance *ri, struct hlist_head *head);
 
@@ -397,6 +412,18 @@ static inline struct kprobe *kprobe_running(void)
 {
 	return NULL;
 }
+static inline struct kretprobe_instance *current_kretprobe_instance(void)
+{
+	return NULL;
+}
+static inline void kretprobe_save_stack_trace(struct kretprobe_instance *ri,
+					      struct stack_trace *trace)
+{
+}
+static inline void kretprobe_perf_callchain_kernel(struct kretprobe_instance *ri,
+						   struct perf_callchain_entry_ctx *ctx)
+{
+}
 static inline int register_kprobe(struct kprobe *p)
 {
 	return -ENOSYS;
diff --git a/kernel/events/callchain.c b/kernel/events/callchain.c
index 24a77c34e9ad..98edcd8a6987 100644
--- a/kernel/events/callchain.c
+++ b/kernel/events/callchain.c
@@ -12,6 +12,7 @@
 #include <linux/perf_event.h>
 #include <linux/slab.h>
 #include <linux/sched/task_stack.h>
+#include <linux/kprobes.h>
 
 #include "internal.h"
 
@@ -197,9 +198,14 @@ get_perf_callchain(struct pt_regs *regs, u32 init_nr, bool kernel, bool user,
 	ctx.contexts_maxed = false;
 
 	if (kernel && !user_mode(regs)) {
+		struct kretprobe_instance *ri = current_kretprobe_instance();
+
 		if (add_mark)
 			perf_callchain_store_context(&ctx, PERF_CONTEXT_KERNEL);
-		perf_callchain_kernel(&ctx, regs);
+		if (ri)
+			kretprobe_perf_callchain_kernel(ri, &ctx);
+		else
+			perf_callchain_kernel(&ctx, regs);
 	}
 
 	if (user) {
diff --git a/kernel/kprobes.c b/kernel/kprobes.c
index 90e98e233647..fca3964d18cd 100644
--- a/kernel/kprobes.c
+++ b/kernel/kprobes.c
@@ -1206,6 +1206,16 @@ __releases(hlist_lock)
 }
 NOKPROBE_SYMBOL(kretprobe_table_unlock);
 
+static bool kretprobe_hash_is_locked(struct task_struct *tsk)
+{
+	unsigned long hash = hash_ptr(tsk, KPROBE_HASH_BITS);
+	raw_spinlock_t *hlist_lock;
+
+	hlist_lock = kretprobe_table_lock_ptr(hash);
+	return raw_spin_is_locked(hlist_lock);
+}
+NOKPROBE_SYMBOL(kretprobe_hash_is_locked);
+
 /*
  * This function is called from finish_task_switch when task tk becomes dead,
  * so that we can recycle any function-return probe instances associated
@@ -1800,6 +1810,13 @@ unsigned long __weak arch_deref_entry_point(void *entry)
 	return (unsigned long)entry;
 }
 
+static int pre_handler_kretprobe(struct kprobe *p, struct pt_regs *regs);
+
+static inline bool kprobe_is_retprobe(struct kprobe *kp)
+{
+	return kp->pre_handler == pre_handler_kretprobe;
+}
+
 #ifdef CONFIG_KRETPROBES
 /*
  * This kprobe pre_handler is registered with every kretprobe. When probe
@@ -1826,6 +1843,8 @@ static int pre_handler_kretprobe(struct kprobe *p, struct pt_regs *regs)
 	hash = hash_ptr(current, KPROBE_HASH_BITS);
 	raw_spin_lock_irqsave(&rp->lock, flags);
 	if (!hlist_empty(&rp->free_instances)) {
+		struct stack_trace trace = {};
+
 		ri = hlist_entry(rp->free_instances.first,
 				struct kretprobe_instance, hlist);
 		hlist_del(&ri->hlist);
@@ -1834,6 +1853,11 @@ static int pre_handler_kretprobe(struct kprobe *p, struct pt_regs *regs)
 		ri->rp = rp;
 		ri->task = current;
 
+		trace.entries = &ri->entry.entries[0];
+		trace.max_entries = KRETPROBE_TRACE_SIZE;
+		save_stack_trace_regs(regs, &trace);
+		ri->entry.nr_entries = trace.nr_entries;
+
 		if (rp->entry_handler && rp->entry_handler(ri, regs)) {
 			raw_spin_lock_irqsave(&rp->lock, flags);
 			hlist_add_head(&ri->hlist, &rp->free_instances);
@@ -1856,6 +1880,65 @@ static int pre_handler_kretprobe(struct kprobe *p, struct pt_regs *regs)
 }
 NOKPROBE_SYMBOL(pre_handler_kretprobe);
 
+/*
+ * Return the kretprobe_instance associated with the current_kprobe. Calling
+ * this is only reasonable from within a kretprobe handler context (otherwise
+ * return NULL).
+ *
+ * Must be called within a kretprobe_hash_lock(current, ...) context.
+ */
+struct kretprobe_instance *current_kretprobe_instance(void)
+{
+	struct kprobe *kp;
+	struct kretprobe *rp;
+	struct kretprobe_instance *ri;
+	struct hlist_head *head;
+	unsigned long hash = hash_ptr(current, KPROBE_HASH_BITS);
+
+	kp = kprobe_running();
+	if (!kp || !kprobe_is_retprobe(kp))
+		return NULL;
+	if (WARN_ON(!kretprobe_hash_is_locked(current)))
+		return NULL;
+
+	rp = container_of(kp, struct kretprobe, kp);
+	head = &kretprobe_inst_table[hash];
+
+	hlist_for_each_entry(ri, head, hlist) {
+		if (ri->task == current && ri->rp == rp)
+			return ri;
+	}
+	return NULL;
+}
+EXPORT_SYMBOL_GPL(current_kretprobe_instance);
+NOKPROBE_SYMBOL(current_kretprobe_instance);
+
+void kretprobe_save_stack_trace(struct kretprobe_instance *ri,
+				struct stack_trace *trace)
+{
+	int i;
+	struct kretprobe_trace *krt = &ri->entry;
+
+	for (i = trace->skip; i < krt->nr_entries; i++) {
+		if (trace->nr_entries >= trace->max_entries)
+			break;
+		trace->entries[trace->nr_entries++] = krt->entries[i];
+	}
+}
+
+void kretprobe_perf_callchain_kernel(struct kretprobe_instance *ri,
+				     struct perf_callchain_entry_ctx *ctx)
+{
+	int i;
+	struct kretprobe_trace *krt = &ri->entry;
+
+	for (i = 0; i < krt->nr_entries; i++) {
+		if (krt->entries[i] == ULONG_MAX)
+			break;
+		perf_callchain_store(ctx, (u64) krt->entries[i]);
+	}
+}
+
 bool __weak arch_kprobe_on_func_entry(unsigned long offset)
 {
 	return !offset;
@@ -2005,6 +2088,22 @@ static int pre_handler_kretprobe(struct kprobe *p, struct pt_regs *regs)
 }
 NOKPROBE_SYMBOL(pre_handler_kretprobe);
 
+struct kretprobe_instance *current_kretprobe_instance(void)
+{
+	return NULL;
+}
+EXPORT_SYMBOL_GPL(current_kretprobe_instance);
+NOKPROBE_SYMBOL(current_kretprobe_instance);
+
+void kretprobe_save_stack_trace(struct kretprobe_instance *ri,
+				struct stack_trace *trace)
+{
+}
+
+void kretprobe_perf_callchain_kernel(struct kretprobe_instance *ri,
+				     struct perf_callchain_entry_ctx *ctx)
+{
+}
 #endif /* CONFIG_KRETPROBES */
 
 /* Set the kprobe gone and remove its instruction buffer. */
@@ -2241,7 +2340,7 @@ static void report_probe(struct seq_file *pi, struct kprobe *p,
 	char *kprobe_type;
 	void *addr = p->addr;
 
-	if (p->pre_handler == pre_handler_kretprobe)
+	if (kprobe_is_retprobe(p))
 		kprobe_type = "r";
 	else
 		kprobe_type = "k";
diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index bf6f1d70484d..2210d38a4dbf 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -42,6 +42,7 @@
 #include <linux/nmi.h>
 #include <linux/fs.h>
 #include <linux/trace.h>
+#include <linux/kprobes.h>
 #include <linux/sched/clock.h>
 #include <linux/sched/rt.h>
 
@@ -2590,6 +2591,7 @@ static void __ftrace_trace_stack(struct ring_buffer *buffer,
 	struct ring_buffer_event *event;
 	struct stack_entry *entry;
 	struct stack_trace trace;
+	struct kretprobe_instance *ri = current_kretprobe_instance();
 	int use_stack;
 	int size = FTRACE_STACK_ENTRIES;
 
@@ -2626,7 +2628,9 @@ static void __ftrace_trace_stack(struct ring_buffer *buffer,
 		trace.entries		= this_cpu_ptr(ftrace_stack.calls);
 		trace.max_entries	= FTRACE_STACK_MAX_ENTRIES;
 
-		if (regs)
+		if (ri)
+			kretprobe_save_stack_trace(ri, &trace);
+		else if (regs)
 			save_stack_trace_regs(regs, &trace);
 		else
 			save_stack_trace(&trace);
@@ -2653,7 +2657,10 @@ static void __ftrace_trace_stack(struct ring_buffer *buffer,
 	else {
 		trace.max_entries	= FTRACE_STACK_ENTRIES;
 		trace.entries		= entry->caller;
-		if (regs)
+
+		if (ri)
+			kretprobe_save_stack_trace(ri, &trace);
+		else if (regs)
 			save_stack_trace_regs(regs, &trace);
 		else
 			save_stack_trace(&trace);
diff --git a/tools/testing/selftests/ftrace/test.d/kprobe/kretprobe_stacktrace.tc b/tools/testing/selftests/ftrace/test.d/kprobe/kretprobe_stacktrace.tc
new file mode 100644
index 000000000000..03146c6a1a3c
--- /dev/null
+++ b/tools/testing/selftests/ftrace/test.d/kprobe/kretprobe_stacktrace.tc
@@ -0,0 +1,25 @@
+#!/bin/sh
+# SPDX-License-Identifier: GPL-2.0+
+# description: Kretprobe dynamic event with a stacktrace
+
+[ -f kprobe_events ] || exit_unsupported # this is configurable
+
+echo 0 > events/enable
+echo 1 > options/stacktrace
+
+echo 'r:teststackprobe sched_fork $retval' > kprobe_events
+grep teststackprobe kprobe_events
+test -d events/kprobes/teststackprobe
+
+clear_trace
+echo 1 > events/kprobes/teststackprobe/enable
+( echo "forked")
+echo 0 > events/kprobes/teststackprobe/enable
+
+# Make sure we don't see kretprobe_trampoline and we see _do_fork.
+! grep 'kretprobe' trace
+grep '_do_fork' trace
+
+echo '-:teststackprobe' >> kprobe_events
+clear_trace
+test -d events/kprobes/teststackprobe && exit_fail || exit_pass
-- 
2.19.1

^ permalink raw reply related

* [PATCH v3 0/2] kretprobe: produce sane stack traces
From: Aleksa Sarai @ 2018-11-01  8:35 UTC (permalink / raw)
  To: Naveen N. Rao, Anil S Keshavamurthy, David S. Miller,
	Masami Hiramatsu, Jonathan Corbet, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Alexander Shishkin, Jiri Olsa,
	Namhyung Kim, Steven Rostedt, Shuah Khan, Alexei Starovoitov,
	Daniel Borkmann
  Cc: Aleksa Sarai, Aleksa Sarai, Christian Brauner, Brendan Gregg,
	netdev, linux-doc, linux-kernel, linux-kselftest

Historically, kretprobe has always produced unusable stack traces
(kretprobe_trampoline is the only entry in most cases, because of the
funky stack pointer overwriting). This has caused quite a few annoyances
when using tracing to debug problems[1] -- since return values are only
available with kretprobes but stack traces were only usable for kprobes,
users had to probe both and then manually associate them.

This patch series stores the stack trace within kretprobe_instance on
the kprobe entry used to set up the kretprobe. This allows for
DTrace-style stack aggregation between function entry and exit with
tools like BPFtrace -- which would not really be doable if the stack
unwinder understood kretprobe_trampoline.

We also revert commit 76094a2cf46e ("ftrace: distinguish kretprobe'd
functions in trace logs") and any follow-up changes because that code is
no longer necessary now that stack traces are sane. *However* this patch
might be a bit contentious since the original usecase (that ftrace
returns shouldn't show kretprobe_trampoline) is arguably still an
issue. Feel free to drop it if you think it is wrong.

Patch changelog:
 v3:
   * kprobe: fix build on !CONFIG_KPROBES
 v2:
   * documentation: mention kretprobe stack-stashing
   * ftrace: add self-test for fixed kretprobe stacktraces
   * ftrace: remove [unknown/kretprobe'd] handling
   * kprobe: remove needless EXPORT statements
   * kprobe: minor corrections to current_kretprobe_instance (switch
     away from hlist_for_each_entry_safe)
   * kprobe: make maximum stack size 127, which is the ftrace default

Aleksa Sarai (2):
  kretprobe: produce sane stack traces
  trace: remove kretprobed checks

 Documentation/kprobes.txt                     |   6 +-
 include/linux/kprobes.h                       |  27 +++++
 kernel/events/callchain.c                     |   8 +-
 kernel/kprobes.c                              | 101 +++++++++++++++++-
 kernel/trace/trace.c                          |  11 +-
 kernel/trace/trace_output.c                   |  34 +-----
 .../test.d/kprobe/kretprobe_stacktrace.tc     |  25 +++++
 7 files changed, 177 insertions(+), 35 deletions(-)
 create mode 100644 tools/testing/selftests/ftrace/test.d/kprobe/kretprobe_stacktrace.tc

-- 
2.19.1

^ permalink raw reply

* [RFC 2/2] udp: Support SO_DELAYED_BIND
From: Christoph Paasch @ 2018-10-31 23:26 UTC (permalink / raw)
  To: netdev; +Cc: Ian Swett, Leif Hedstrom, Jana Iyengar
In-Reply-To: <20181031232635.33750-1-cpaasch@apple.com>

For UDP, there is only a single socket-hash table, the udptable.

We want to prevent incoming segments to match on this socket when
SO_DELAYED_BIND is set. Thus, when computing the score for unconnected
sockets, we simply prevent the match as long as the flag is set.

Signed-off-by: Christoph Paasch <cpaasch@apple.com>
---
 net/ipv4/datagram.c | 1 +
 net/ipv4/udp.c      | 3 +++
 2 files changed, 4 insertions(+)

diff --git a/net/ipv4/datagram.c b/net/ipv4/datagram.c
index 300921417f89..9bf0e0d2ea33 100644
--- a/net/ipv4/datagram.c
+++ b/net/ipv4/datagram.c
@@ -78,6 +78,7 @@ int __ip4_datagram_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len
 	inet->inet_id = jiffies;
 
 	sk_dst_set(sk, &rt->dst);
+	sock_reset_flag(sk, SOCK_DELAYED_BIND);
 	err = 0;
 out:
 	return err;
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index ca3ed931f2a9..fb55f925342b 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -408,6 +408,9 @@ static int compute_score(struct sock *sk, struct net *net,
 			score += 4;
 	}
 
+	if (sock_flag(sk, SOCK_DELAYED_BIND))
+		return -1;
+
 	if (sk->sk_incoming_cpu == raw_smp_processor_id())
 		score++;
 	return score;
-- 
2.16.2

^ permalink raw reply related

* [RFC 1/2] net: Add new socket-option SO_DELAYED_BIND
From: Christoph Paasch @ 2018-10-31 23:26 UTC (permalink / raw)
  To: netdev; +Cc: Ian Swett, Leif Hedstrom, Jana Iyengar
In-Reply-To: <20181031232635.33750-1-cpaasch@apple.com>

And store it as a flag in the sk_flags.

Signed-off-by: Christoph Paasch <cpaasch@apple.com>
---
 arch/alpha/include/uapi/asm/socket.h  |  2 ++
 arch/ia64/include/uapi/asm/socket.h   |  2 ++
 arch/mips/include/uapi/asm/socket.h   |  2 ++
 arch/parisc/include/uapi/asm/socket.h |  2 ++
 arch/s390/include/uapi/asm/socket.h   |  2 ++
 arch/sparc/include/uapi/asm/socket.h  |  2 ++
 arch/xtensa/include/uapi/asm/socket.h |  2 ++
 include/net/sock.h                    |  1 +
 include/uapi/asm-generic/socket.h     |  2 ++
 net/core/sock.c                       | 21 +++++++++++++++++++++
 10 files changed, 38 insertions(+)

diff --git a/arch/alpha/include/uapi/asm/socket.h b/arch/alpha/include/uapi/asm/socket.h
index 065fb372e355..add6aca13b53 100644
--- a/arch/alpha/include/uapi/asm/socket.h
+++ b/arch/alpha/include/uapi/asm/socket.h
@@ -115,4 +115,6 @@
 #define SO_TXTIME		61
 #define SCM_TXTIME		SO_TXTIME
 
+#define SO_DELAYED_BIND		62
+
 #endif /* _UAPI_ASM_SOCKET_H */
diff --git a/arch/ia64/include/uapi/asm/socket.h b/arch/ia64/include/uapi/asm/socket.h
index c872c4e6bafb..98a86f406601 100644
--- a/arch/ia64/include/uapi/asm/socket.h
+++ b/arch/ia64/include/uapi/asm/socket.h
@@ -117,4 +117,6 @@
 #define SO_TXTIME		61
 #define SCM_TXTIME		SO_TXTIME
 
+#define SO_DELAYED_BIND		62
+
 #endif /* _ASM_IA64_SOCKET_H */
diff --git a/arch/mips/include/uapi/asm/socket.h b/arch/mips/include/uapi/asm/socket.h
index 71370fb3ceef..f84bd74d58ee 100644
--- a/arch/mips/include/uapi/asm/socket.h
+++ b/arch/mips/include/uapi/asm/socket.h
@@ -126,4 +126,6 @@
 #define SO_TXTIME		61
 #define SCM_TXTIME		SO_TXTIME
 
+#define SO_DELAYED_BIND		62
+
 #endif /* _UAPI_ASM_SOCKET_H */
diff --git a/arch/parisc/include/uapi/asm/socket.h b/arch/parisc/include/uapi/asm/socket.h
index 061b9cf2a779..8fe20a7abf6e 100644
--- a/arch/parisc/include/uapi/asm/socket.h
+++ b/arch/parisc/include/uapi/asm/socket.h
@@ -107,4 +107,6 @@
 #define SO_TXTIME		0x4036
 #define SCM_TXTIME		SO_TXTIME
 
+#define SO_DELAYED_BIND		0x4037
+
 #endif /* _UAPI_ASM_SOCKET_H */
diff --git a/arch/s390/include/uapi/asm/socket.h b/arch/s390/include/uapi/asm/socket.h
index 39d901476ee5..c00b10909a72 100644
--- a/arch/s390/include/uapi/asm/socket.h
+++ b/arch/s390/include/uapi/asm/socket.h
@@ -114,4 +114,6 @@
 #define SO_TXTIME		61
 #define SCM_TXTIME		SO_TXTIME
 
+#define SO_DELAYED_BIND		62
+
 #endif /* _ASM_SOCKET_H */
diff --git a/arch/sparc/include/uapi/asm/socket.h b/arch/sparc/include/uapi/asm/socket.h
index 7ea35e5601b6..0825db0c9f46 100644
--- a/arch/sparc/include/uapi/asm/socket.h
+++ b/arch/sparc/include/uapi/asm/socket.h
@@ -104,6 +104,8 @@
 #define SO_TXTIME		0x003f
 #define SCM_TXTIME		SO_TXTIME
 
+#define SO_DELAYED_BIND		0x0040
+
 /* Security levels - as per NRL IPv6 - don't actually do anything */
 #define SO_SECURITY_AUTHENTICATION		0x5001
 #define SO_SECURITY_ENCRYPTION_TRANSPORT	0x5002
diff --git a/arch/xtensa/include/uapi/asm/socket.h b/arch/xtensa/include/uapi/asm/socket.h
index 1de07a7f7680..cd4d91e982d5 100644
--- a/arch/xtensa/include/uapi/asm/socket.h
+++ b/arch/xtensa/include/uapi/asm/socket.h
@@ -119,4 +119,6 @@
 #define SO_TXTIME		61
 #define SCM_TXTIME		SO_TXTIME
 
+#define SO_DELAYED_BIND		62
+
 #endif	/* _XTENSA_SOCKET_H */
diff --git a/include/net/sock.h b/include/net/sock.h
index f665d74ae509..16fbe54cf519 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -801,6 +801,7 @@ enum sock_flags {
 	SOCK_RCU_FREE, /* wait rcu grace period in sk_destruct() */
 	SOCK_TXTIME,
 	SOCK_XDP, /* XDP is attached */
+	SOCK_DELAYED_BIND,
 };
 
 #define SK_FLAGS_TIMESTAMP ((1UL << SOCK_TIMESTAMP) | (1UL << SOCK_TIMESTAMPING_RX_SOFTWARE))
diff --git a/include/uapi/asm-generic/socket.h b/include/uapi/asm-generic/socket.h
index a12692e5f7a8..653f1f65a311 100644
--- a/include/uapi/asm-generic/socket.h
+++ b/include/uapi/asm-generic/socket.h
@@ -110,4 +110,6 @@
 #define SO_TXTIME		61
 #define SCM_TXTIME		SO_TXTIME
 
+#define SO_DELAYED_BIND		62
+
 #endif /* __ASM_GENERIC_SOCKET_H */
diff --git a/net/core/sock.c b/net/core/sock.c
index 6fcc4bc07d19..343baa820cf2 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -1047,6 +1047,23 @@ int sock_setsockopt(struct socket *sock, int level, int optname,
 		}
 		break;
 
+	case SO_DELAYED_BIND:
+		if (sk->sk_family == PF_INET || sk->sk_family == PF_INET6) {
+			if (sk->sk_protocol != IPPROTO_UDP)
+				ret = -ENOTSUPP;
+		} else {
+			ret = -ENOTSUPP;
+		}
+
+		if (!ret) {
+			if (val < 0 || val > 1)
+				ret = -EINVAL;
+			else
+				sock_valbool_flag(sk, SOCK_DELAYED_BIND, valbool);
+		}
+
+		break;
+
 	default:
 		ret = -ENOPROTOOPT;
 		break;
@@ -1391,6 +1408,10 @@ int sock_getsockopt(struct socket *sock, int level, int optname,
 				  SOF_TXTIME_REPORT_ERRORS : 0;
 		break;
 
+	case SO_DELAYED_BIND:
+		v.val = sock_flag(sk, SOCK_DELAYED_BIND);
+		break;
+
 	default:
 		/* We implement the SO_SNDLOWAT etc to not be settable
 		 * (1003.1g 7).
-- 
2.16.2

^ permalink raw reply related

* [RFC 0/2] Delayed binding of UDP sockets for Quic per-connection sockets
From: Christoph Paasch @ 2018-10-31 23:26 UTC (permalink / raw)
  To: netdev; +Cc: Ian Swett, Leif Hedstrom, Jana Iyengar

Implementations of Quic might want to create a separate socket for each
Quic-connection by creating a connected UDP-socket.

To achieve that on the server-side, a "master-socket" needs to wait for
incoming new connections and then creates a new socket that will be a
connected UDP-socket. To create that latter one, the server needs to
first bind() and then connect(). However, after the bind() the server
might already receive traffic on that new socket that is unrelated to the
Quic-connection at hand. Only after the connect() a full 4-tuple match
is happening. So, one can't really create this kind of a server that has
a connected UDP-socket per Quic connection.

So, what is needed is an "atomic bind & connect" that basically
prevents any incoming traffic until the connect() call has been issued
at which point the full 4-tuple is known.


This patchset implements this functionality and exposes a socket-option
to do this.

Usage would be:

        int fd = socket(AF_INET, SOCK_DGRAM, IPPROTO_UDP);

        int val = 1;
        setsockopt(fd, SOL_SOCKET, SO_DELAYED_BIND, &val, sizeof(val));

        bind(fd, (struct sockaddr *)&src, sizeof(src));

	/* At this point, incoming traffic will never match on this socket */

        connect(fd, (struct sockaddr *)&dst, sizeof(dst));

	/* Only now incoming traffic will reach the socket */



There is literally an infinite number of ways on how to implement it,
which is why I first send it out as an RFC. With this approach here I
chose the least invasive one, just preventing the match on the incoming
path.


The reason for choosing a SOL_SOCKET socket-option and not at the
SOL_UDP-level is because that functionality actually could be useful for
other protocols as well. E.g., TCP wants to better use the full 4-tuple space
by binding to the source-IP and the destination-IP at the same time.


Feedback is very welcome!


Christoph Paasch (2):
  net: Add new socket-option SO_DELAYED_BIND
  udp: Support SO_DELAYED_BIND

 arch/alpha/include/uapi/asm/socket.h  |  2 ++
 arch/ia64/include/uapi/asm/socket.h   |  2 ++
 arch/mips/include/uapi/asm/socket.h   |  2 ++
 arch/parisc/include/uapi/asm/socket.h |  2 ++
 arch/s390/include/uapi/asm/socket.h   |  2 ++
 arch/sparc/include/uapi/asm/socket.h  |  2 ++
 arch/xtensa/include/uapi/asm/socket.h |  2 ++
 include/net/sock.h                    |  1 +
 include/uapi/asm-generic/socket.h     |  2 ++
 net/core/sock.c                       | 21 +++++++++++++++++++++
 net/ipv4/datagram.c                   |  1 +
 net/ipv4/udp.c                        |  3 +++
 12 files changed, 42 insertions(+)

-- 
2.16.2

^ permalink raw reply

* [PATCH bpf 3/4] bpf: add various test cases to test_verifier
From: Daniel Borkmann @ 2018-10-31 23:05 UTC (permalink / raw)
  To: ast; +Cc: netdev, Daniel Borkmann
In-Reply-To: <20181031230555.3371-1-daniel@iogearbox.net>

Add some more map related test cases to test_verifier kselftest
to improve test coverage. Summary: 1012 PASSED, 0 SKIPPED, 0 FAILED

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
---
 tools/testing/selftests/bpf/test_verifier.c | 250 ++++++++++++++++++++++++++++
 1 file changed, 250 insertions(+)

diff --git a/tools/testing/selftests/bpf/test_verifier.c b/tools/testing/selftests/bpf/test_verifier.c
index 36f3d30..4c7445d 100644
--- a/tools/testing/selftests/bpf/test_verifier.c
+++ b/tools/testing/selftests/bpf/test_verifier.c
@@ -6455,6 +6455,256 @@ static struct bpf_test tests[] = {
 		.prog_type = BPF_PROG_TYPE_TRACEPOINT,
 	},
 	{
+		"map access: known scalar += value_ptr",
+		.insns = {
+			BPF_ST_MEM(BPF_DW, BPF_REG_10, -8, 0),
+			BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
+			BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8),
+			BPF_LD_MAP_FD(BPF_REG_1, 0),
+			BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0,
+				     BPF_FUNC_map_lookup_elem),
+			BPF_JMP_IMM(BPF_JEQ, BPF_REG_0, 0, 3),
+			BPF_MOV64_IMM(BPF_REG_1, 4),
+			BPF_ALU64_REG(BPF_ADD, BPF_REG_1, BPF_REG_0),
+			BPF_LDX_MEM(BPF_B, BPF_REG_0, BPF_REG_1, 0),
+			BPF_MOV64_IMM(BPF_REG_0, 1),
+			BPF_EXIT_INSN(),
+		},
+		.fixup_map_array_48b = { 3 },
+		.result = ACCEPT,
+		.retval = 1,
+	},
+	{
+		"map access: value_ptr += known scalar",
+		.insns = {
+			BPF_ST_MEM(BPF_DW, BPF_REG_10, -8, 0),
+			BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
+			BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8),
+			BPF_LD_MAP_FD(BPF_REG_1, 0),
+			BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0,
+				     BPF_FUNC_map_lookup_elem),
+			BPF_JMP_IMM(BPF_JEQ, BPF_REG_0, 0, 3),
+			BPF_MOV64_IMM(BPF_REG_1, 4),
+			BPF_ALU64_REG(BPF_ADD, BPF_REG_0, BPF_REG_1),
+			BPF_LDX_MEM(BPF_B, BPF_REG_1, BPF_REG_0, 0),
+			BPF_MOV64_IMM(BPF_REG_0, 1),
+			BPF_EXIT_INSN(),
+		},
+		.fixup_map_array_48b = { 3 },
+		.result = ACCEPT,
+		.retval = 1,
+	},
+	{
+		"map access: unknown scalar += value_ptr",
+		.insns = {
+			BPF_ST_MEM(BPF_DW, BPF_REG_10, -8, 0),
+			BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
+			BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8),
+			BPF_LD_MAP_FD(BPF_REG_1, 0),
+			BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0,
+				     BPF_FUNC_map_lookup_elem),
+			BPF_JMP_IMM(BPF_JEQ, BPF_REG_0, 0, 4),
+			BPF_LDX_MEM(BPF_B, BPF_REG_1, BPF_REG_0, 0),
+			BPF_ALU64_IMM(BPF_AND, BPF_REG_1, 0xf),
+			BPF_ALU64_REG(BPF_ADD, BPF_REG_1, BPF_REG_0),
+			BPF_LDX_MEM(BPF_B, BPF_REG_0, BPF_REG_1, 0),
+			BPF_MOV64_IMM(BPF_REG_0, 1),
+			BPF_EXIT_INSN(),
+		},
+		.fixup_map_array_48b = { 3 },
+		.result = ACCEPT,
+		.retval = 1,
+	},
+	{
+		"map access: value_ptr += unknown scalar",
+		.insns = {
+			BPF_ST_MEM(BPF_DW, BPF_REG_10, -8, 0),
+			BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
+			BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8),
+			BPF_LD_MAP_FD(BPF_REG_1, 0),
+			BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0,
+				     BPF_FUNC_map_lookup_elem),
+			BPF_JMP_IMM(BPF_JEQ, BPF_REG_0, 0, 4),
+			BPF_LDX_MEM(BPF_B, BPF_REG_1, BPF_REG_0, 0),
+			BPF_ALU64_IMM(BPF_AND, BPF_REG_1, 0xf),
+			BPF_ALU64_REG(BPF_ADD, BPF_REG_0, BPF_REG_1),
+			BPF_LDX_MEM(BPF_B, BPF_REG_1, BPF_REG_0, 0),
+			BPF_MOV64_IMM(BPF_REG_0, 1),
+			BPF_EXIT_INSN(),
+		},
+		.fixup_map_array_48b = { 3 },
+		.result = ACCEPT,
+		.retval = 1,
+	},
+	{
+		"map access: value_ptr += value_ptr",
+		.insns = {
+			BPF_ST_MEM(BPF_DW, BPF_REG_10, -8, 0),
+			BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
+			BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8),
+			BPF_LD_MAP_FD(BPF_REG_1, 0),
+			BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0,
+				     BPF_FUNC_map_lookup_elem),
+			BPF_JMP_IMM(BPF_JEQ, BPF_REG_0, 0, 2),
+			BPF_ALU64_REG(BPF_ADD, BPF_REG_0, BPF_REG_0),
+			BPF_LDX_MEM(BPF_B, BPF_REG_1, BPF_REG_0, 0),
+			BPF_MOV64_IMM(BPF_REG_0, 1),
+			BPF_EXIT_INSN(),
+		},
+		.fixup_map_array_48b = { 3 },
+		.result = REJECT,
+		.errstr = "R0 pointer += pointer prohibited",
+	},
+	{
+		"map access: known scalar -= value_ptr",
+		.insns = {
+			BPF_ST_MEM(BPF_DW, BPF_REG_10, -8, 0),
+			BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
+			BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8),
+			BPF_LD_MAP_FD(BPF_REG_1, 0),
+			BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0,
+				     BPF_FUNC_map_lookup_elem),
+			BPF_JMP_IMM(BPF_JEQ, BPF_REG_0, 0, 3),
+			BPF_MOV64_IMM(BPF_REG_1, 4),
+			BPF_ALU64_REG(BPF_SUB, BPF_REG_1, BPF_REG_0),
+			BPF_LDX_MEM(BPF_B, BPF_REG_0, BPF_REG_1, 0),
+			BPF_MOV64_IMM(BPF_REG_0, 1),
+			BPF_EXIT_INSN(),
+		},
+		.fixup_map_array_48b = { 3 },
+		.result = REJECT,
+		.errstr = "R1 tried to subtract pointer from scalar",
+	},
+	{
+		"map access: value_ptr -= known scalar",
+		.insns = {
+			BPF_ST_MEM(BPF_DW, BPF_REG_10, -8, 0),
+			BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
+			BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8),
+			BPF_LD_MAP_FD(BPF_REG_1, 0),
+			BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0,
+				     BPF_FUNC_map_lookup_elem),
+			BPF_JMP_IMM(BPF_JEQ, BPF_REG_0, 0, 3),
+			BPF_MOV64_IMM(BPF_REG_1, 4),
+			BPF_ALU64_REG(BPF_SUB, BPF_REG_0, BPF_REG_1),
+			BPF_LDX_MEM(BPF_B, BPF_REG_1, BPF_REG_0, 0),
+			BPF_MOV64_IMM(BPF_REG_0, 1),
+			BPF_EXIT_INSN(),
+		},
+		.fixup_map_array_48b = { 3 },
+		.result = REJECT,
+		.errstr = "R0 min value is outside of the array range",
+	},
+	{
+		"map access: value_ptr -= known scalar, 2",
+		.insns = {
+			BPF_ST_MEM(BPF_DW, BPF_REG_10, -8, 0),
+			BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
+			BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8),
+			BPF_LD_MAP_FD(BPF_REG_1, 0),
+			BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0,
+				     BPF_FUNC_map_lookup_elem),
+			BPF_JMP_IMM(BPF_JEQ, BPF_REG_0, 0, 5),
+			BPF_MOV64_IMM(BPF_REG_1, 6),
+			BPF_MOV64_IMM(BPF_REG_2, 4),
+			BPF_ALU64_REG(BPF_ADD, BPF_REG_0, BPF_REG_1),
+			BPF_ALU64_REG(BPF_SUB, BPF_REG_0, BPF_REG_2),
+			BPF_LDX_MEM(BPF_B, BPF_REG_1, BPF_REG_0, 0),
+			BPF_MOV64_IMM(BPF_REG_0, 1),
+			BPF_EXIT_INSN(),
+		},
+		.fixup_map_array_48b = { 3 },
+		.result = ACCEPT,
+		.retval = 1,
+	},
+	{
+		"map access: unknown scalar -= value_ptr",
+		.insns = {
+			BPF_ST_MEM(BPF_DW, BPF_REG_10, -8, 0),
+			BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
+			BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8),
+			BPF_LD_MAP_FD(BPF_REG_1, 0),
+			BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0,
+				     BPF_FUNC_map_lookup_elem),
+			BPF_JMP_IMM(BPF_JEQ, BPF_REG_0, 0, 4),
+			BPF_LDX_MEM(BPF_B, BPF_REG_1, BPF_REG_0, 0),
+			BPF_ALU64_IMM(BPF_AND, BPF_REG_1, 0xf),
+			BPF_ALU64_REG(BPF_SUB, BPF_REG_1, BPF_REG_0),
+			BPF_LDX_MEM(BPF_B, BPF_REG_0, BPF_REG_1, 0),
+			BPF_MOV64_IMM(BPF_REG_0, 1),
+			BPF_EXIT_INSN(),
+		},
+		.fixup_map_array_48b = { 3 },
+		.result = REJECT,
+		.errstr = "R1 tried to subtract pointer from scalar",
+	},
+	{
+		"map access: value_ptr -= unknown scalar",
+		.insns = {
+			BPF_ST_MEM(BPF_DW, BPF_REG_10, -8, 0),
+			BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
+			BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8),
+			BPF_LD_MAP_FD(BPF_REG_1, 0),
+			BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0,
+				     BPF_FUNC_map_lookup_elem),
+			BPF_JMP_IMM(BPF_JEQ, BPF_REG_0, 0, 4),
+			BPF_LDX_MEM(BPF_B, BPF_REG_1, BPF_REG_0, 0),
+			BPF_ALU64_IMM(BPF_AND, BPF_REG_1, 0xf),
+			BPF_ALU64_REG(BPF_SUB, BPF_REG_0, BPF_REG_1),
+			BPF_LDX_MEM(BPF_B, BPF_REG_1, BPF_REG_0, 0),
+			BPF_MOV64_IMM(BPF_REG_0, 1),
+			BPF_EXIT_INSN(),
+		},
+		.fixup_map_array_48b = { 3 },
+		.result = REJECT,
+		.errstr = "R0 min value is negative",
+	},
+	{
+		"map access: value_ptr -= unknown scalar, 2",
+		.insns = {
+			BPF_ST_MEM(BPF_DW, BPF_REG_10, -8, 0),
+			BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
+			BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8),
+			BPF_LD_MAP_FD(BPF_REG_1, 0),
+			BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0,
+				     BPF_FUNC_map_lookup_elem),
+			BPF_JMP_IMM(BPF_JEQ, BPF_REG_0, 0, 8),
+			BPF_LDX_MEM(BPF_B, BPF_REG_1, BPF_REG_0, 0),
+			BPF_ALU64_IMM(BPF_AND, BPF_REG_1, 0xf),
+			BPF_ALU64_IMM(BPF_OR, BPF_REG_1, 0x7),
+			BPF_ALU64_REG(BPF_ADD, BPF_REG_0, BPF_REG_1),
+			BPF_LDX_MEM(BPF_B, BPF_REG_1, BPF_REG_0, 0),
+			BPF_ALU64_IMM(BPF_AND, BPF_REG_1, 0x7),
+			BPF_ALU64_REG(BPF_SUB, BPF_REG_0, BPF_REG_1),
+			BPF_LDX_MEM(BPF_B, BPF_REG_1, BPF_REG_0, 0),
+			BPF_MOV64_IMM(BPF_REG_0, 1),
+			BPF_EXIT_INSN(),
+		},
+		.fixup_map_array_48b = { 3 },
+		.result = ACCEPT,
+		.retval = 1,
+	},
+	{
+		"map access: value_ptr -= value_ptr",
+		.insns = {
+			BPF_ST_MEM(BPF_DW, BPF_REG_10, -8, 0),
+			BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
+			BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8),
+			BPF_LD_MAP_FD(BPF_REG_1, 0),
+			BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0,
+				     BPF_FUNC_map_lookup_elem),
+			BPF_JMP_IMM(BPF_JEQ, BPF_REG_0, 0, 2),
+			BPF_ALU64_REG(BPF_SUB, BPF_REG_0, BPF_REG_0),
+			BPF_LDX_MEM(BPF_B, BPF_REG_1, BPF_REG_0, 0),
+			BPF_MOV64_IMM(BPF_REG_0, 1),
+			BPF_EXIT_INSN(),
+		},
+		.fixup_map_array_48b = { 3 },
+		.result = REJECT,
+		.errstr = "R0 invalid mem access 'inv'",
+		.errstr_unpriv = "R0 pointer -= pointer prohibited",
+	},
+	{
 		"map lookup helper access to map",
 		.insns = {
 			BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
-- 
2.9.5

^ permalink raw reply related

* [PATCH bpf 4/4] bpf: test make sure to run unpriv test cases in test_verifier
From: Daniel Borkmann @ 2018-10-31 23:05 UTC (permalink / raw)
  To: ast; +Cc: netdev, Daniel Borkmann
In-Reply-To: <20181031230555.3371-1-daniel@iogearbox.net>

Right now unprivileged tests are never executed as a BPF test run,
only loaded. Allow for running them as well so that we can check
the outcome and probe for regressions.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
---
 tools/testing/selftests/bpf/test_verifier.c | 71 ++++++++++++++++-------------
 1 file changed, 40 insertions(+), 31 deletions(-)

diff --git a/tools/testing/selftests/bpf/test_verifier.c b/tools/testing/selftests/bpf/test_verifier.c
index 4c7445d..6f61df6 100644
--- a/tools/testing/selftests/bpf/test_verifier.c
+++ b/tools/testing/selftests/bpf/test_verifier.c
@@ -76,7 +76,7 @@ struct bpf_test {
 	int fixup_percpu_cgroup_storage[MAX_FIXUPS];
 	const char *errstr;
 	const char *errstr_unpriv;
-	uint32_t retval;
+	uint32_t retval, retval_unpriv;
 	enum {
 		UNDEF,
 		ACCEPT,
@@ -3084,6 +3084,8 @@ static struct bpf_test tests[] = {
 		.fixup_prog1 = { 2 },
 		.result = ACCEPT,
 		.retval = 42,
+		/* Verifier rewrite for unpriv skips tail call here. */
+		.retval_unpriv = 2,
 	},
 	{
 		"stack pointer arithmetic",
@@ -14149,6 +14151,33 @@ static void do_test_fixup(struct bpf_test *test, enum bpf_map_type prog_type,
 	}
 }
 
+static int set_admin(bool admin)
+{
+	cap_t caps;
+	const cap_value_t cap_val = CAP_SYS_ADMIN;
+	int ret = -1;
+
+	caps = cap_get_proc();
+	if (!caps) {
+		perror("cap_get_proc");
+		return -1;
+	}
+	if (cap_set_flag(caps, CAP_EFFECTIVE, 1, &cap_val,
+				admin ? CAP_SET : CAP_CLEAR)) {
+		perror("cap_set_flag");
+		goto out;
+	}
+	if (cap_set_proc(caps)) {
+		perror("cap_set_proc");
+		goto out;
+	}
+	ret = 0;
+out:
+	if (cap_free(caps))
+		perror("cap_free");
+	return ret;
+}
+
 static void do_test_single(struct bpf_test *test, bool unpriv,
 			   int *passes, int *errors)
 {
@@ -14157,6 +14186,7 @@ static void do_test_single(struct bpf_test *test, bool unpriv,
 	struct bpf_insn *prog = test->insns;
 	int map_fds[MAX_NR_MAPS];
 	const char *expected_err;
+	uint32_t expected_val;
 	uint32_t retval;
 	int i, err;
 
@@ -14176,6 +14206,8 @@ static void do_test_single(struct bpf_test *test, bool unpriv,
 		       test->result_unpriv : test->result;
 	expected_err = unpriv && test->errstr_unpriv ?
 		       test->errstr_unpriv : test->errstr;
+	expected_val = unpriv && test->retval_unpriv ?
+		       test->retval_unpriv : test->retval;
 
 	reject_from_alignment = fd_prog < 0 &&
 				(test->flags & F_NEEDS_EFFICIENT_UNALIGNED_ACCESS) &&
@@ -14209,16 +14241,20 @@ static void do_test_single(struct bpf_test *test, bool unpriv,
 		__u8 tmp[TEST_DATA_LEN << 2];
 		__u32 size_tmp = sizeof(tmp);
 
+		if (unpriv)
+			set_admin(true);
 		err = bpf_prog_test_run(fd_prog, 1, test->data,
 					sizeof(test->data), tmp, &size_tmp,
 					&retval, NULL);
+		if (unpriv)
+			set_admin(false);
 		if (err && errno != 524/*ENOTSUPP*/ && errno != EPERM) {
 			printf("Unexpected bpf_prog_test_run error\n");
 			goto fail_log;
 		}
-		if (!err && retval != test->retval &&
-		    test->retval != POINTER_VALUE) {
-			printf("FAIL retval %d != %d\n", retval, test->retval);
+		if (!err && retval != expected_val &&
+		    expected_val != POINTER_VALUE) {
+			printf("FAIL retval %d != %d\n", retval, expected_val);
 			goto fail_log;
 		}
 	}
@@ -14261,33 +14297,6 @@ static bool is_admin(void)
 	return (sysadmin == CAP_SET);
 }
 
-static int set_admin(bool admin)
-{
-	cap_t caps;
-	const cap_value_t cap_val = CAP_SYS_ADMIN;
-	int ret = -1;
-
-	caps = cap_get_proc();
-	if (!caps) {
-		perror("cap_get_proc");
-		return -1;
-	}
-	if (cap_set_flag(caps, CAP_EFFECTIVE, 1, &cap_val,
-				admin ? CAP_SET : CAP_CLEAR)) {
-		perror("cap_set_flag");
-		goto out;
-	}
-	if (cap_set_proc(caps)) {
-		perror("cap_set_proc");
-		goto out;
-	}
-	ret = 0;
-out:
-	if (cap_free(caps))
-		perror("cap_free");
-	return ret;
-}
-
 static void get_unpriv_disabled()
 {
 	char buf[2];
-- 
2.9.5

^ permalink raw reply related

* [PATCH bpf 2/4] bpf: don't set id on after map lookup with ptr_to_map_val return
From: Daniel Borkmann @ 2018-10-31 23:05 UTC (permalink / raw)
  To: ast; +Cc: netdev, Daniel Borkmann, Roman Gushchin
In-Reply-To: <20181031230555.3371-1-daniel@iogearbox.net>

In the verifier there is no such semantics where registers with
PTR_TO_MAP_VALUE type have an id assigned to them. This is only
used in PTR_TO_MAP_VALUE_OR_NULL and later on nullified once the
test against NULL has been pattern matched and type transformed
into PTR_TO_MAP_VALUE.

Fixes: 3e6a4b3e0289 ("bpf/verifier: introduce BPF_PTR_TO_MAP_VALUE")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Cc: Roman Gushchin <guro@fb.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
---
 kernel/bpf/verifier.c | 11 ++++++-----
 1 file changed, 6 insertions(+), 5 deletions(-)

diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 774fa40..1971ca32 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -2852,10 +2852,6 @@ static int check_helper_call(struct bpf_verifier_env *env, int func_id, int insn
 		regs[BPF_REG_0].type = NOT_INIT;
 	} else if (fn->ret_type == RET_PTR_TO_MAP_VALUE_OR_NULL ||
 		   fn->ret_type == RET_PTR_TO_MAP_VALUE) {
-		if (fn->ret_type == RET_PTR_TO_MAP_VALUE)
-			regs[BPF_REG_0].type = PTR_TO_MAP_VALUE;
-		else
-			regs[BPF_REG_0].type = PTR_TO_MAP_VALUE_OR_NULL;
 		/* There is no offset yet applied, variable or fixed */
 		mark_reg_known_zero(env, regs, BPF_REG_0);
 		/* remember map_ptr, so that check_map_access()
@@ -2868,7 +2864,12 @@ static int check_helper_call(struct bpf_verifier_env *env, int func_id, int insn
 			return -EINVAL;
 		}
 		regs[BPF_REG_0].map_ptr = meta.map_ptr;
-		regs[BPF_REG_0].id = ++env->id_gen;
+		if (fn->ret_type == RET_PTR_TO_MAP_VALUE) {
+			regs[BPF_REG_0].type = PTR_TO_MAP_VALUE;
+		} else {
+			regs[BPF_REG_0].type = PTR_TO_MAP_VALUE_OR_NULL;
+			regs[BPF_REG_0].id = ++env->id_gen;
+		}
 	} else if (fn->ret_type == RET_PTR_TO_SOCKET_OR_NULL) {
 		int id = acquire_reference_state(env, insn_idx);
 		if (id < 0)
-- 
2.9.5

^ permalink raw reply related

* [PATCH bpf 1/4] bpf: fix partial copy of map_ptr when dst is scalar
From: Daniel Borkmann @ 2018-10-31 23:05 UTC (permalink / raw)
  To: ast; +Cc: netdev, Daniel Borkmann, Edward Cree
In-Reply-To: <20181031230555.3371-1-daniel@iogearbox.net>

ALU operations on pointers such as scalar_reg += map_value_ptr are
handled in adjust_ptr_min_max_vals(). Problem is however that map_ptr
and range in the register state share a union, so transferring state
through dst_reg->range = ptr_reg->range is just buggy as any new
map_ptr in the dst_reg is then truncated (or null) for subsequent
checks. Fix this by adding a raw member and use it for copying state
over to dst_reg.

Fixes: f1174f77b50c ("bpf/verifier: rework value tracking")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Cc: Edward Cree <ecree@solarflare.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
---
 include/linux/bpf_verifier.h |  3 +++
 kernel/bpf/verifier.c        | 10 ++++++----
 2 files changed, 9 insertions(+), 4 deletions(-)

diff --git a/include/linux/bpf_verifier.h b/include/linux/bpf_verifier.h
index 9e8056e..d93e897 100644
--- a/include/linux/bpf_verifier.h
+++ b/include/linux/bpf_verifier.h
@@ -51,6 +51,9 @@ struct bpf_reg_state {
 		 *   PTR_TO_MAP_VALUE_OR_NULL
 		 */
 		struct bpf_map *map_ptr;
+
+		/* Max size from any of the above. */
+		unsigned long raw;
 	};
 	/* Fixed part of pointer offset, pointer types only */
 	s32 off;
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 171a2c8..774fa40 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -3046,7 +3046,7 @@ static int adjust_ptr_min_max_vals(struct bpf_verifier_env *env,
 			dst_reg->umax_value = umax_ptr;
 			dst_reg->var_off = ptr_reg->var_off;
 			dst_reg->off = ptr_reg->off + smin_val;
-			dst_reg->range = ptr_reg->range;
+			dst_reg->raw = ptr_reg->raw;
 			break;
 		}
 		/* A new variable offset is created.  Note that off_reg->off
@@ -3076,10 +3076,11 @@ static int adjust_ptr_min_max_vals(struct bpf_verifier_env *env,
 		}
 		dst_reg->var_off = tnum_add(ptr_reg->var_off, off_reg->var_off);
 		dst_reg->off = ptr_reg->off;
+		dst_reg->raw = ptr_reg->raw;
 		if (reg_is_pkt_pointer(ptr_reg)) {
 			dst_reg->id = ++env->id_gen;
 			/* something was added to pkt_ptr, set range to zero */
-			dst_reg->range = 0;
+			dst_reg->raw = 0;
 		}
 		break;
 	case BPF_SUB:
@@ -3108,7 +3109,7 @@ static int adjust_ptr_min_max_vals(struct bpf_verifier_env *env,
 			dst_reg->var_off = ptr_reg->var_off;
 			dst_reg->id = ptr_reg->id;
 			dst_reg->off = ptr_reg->off - smin_val;
-			dst_reg->range = ptr_reg->range;
+			dst_reg->raw = ptr_reg->raw;
 			break;
 		}
 		/* A new variable offset is created.  If the subtrahend is known
@@ -3134,11 +3135,12 @@ static int adjust_ptr_min_max_vals(struct bpf_verifier_env *env,
 		}
 		dst_reg->var_off = tnum_sub(ptr_reg->var_off, off_reg->var_off);
 		dst_reg->off = ptr_reg->off;
+		dst_reg->raw = ptr_reg->raw;
 		if (reg_is_pkt_pointer(ptr_reg)) {
 			dst_reg->id = ++env->id_gen;
 			/* something was added to pkt_ptr, set range to zero */
 			if (smin_val < 0)
-				dst_reg->range = 0;
+				dst_reg->raw = 0;
 		}
 		break;
 	case BPF_AND:
-- 
2.9.5

^ permalink raw reply related

* [PATCH bpf 0/4] BPF fixes and tests
From: Daniel Borkmann @ 2018-10-31 23:05 UTC (permalink / raw)
  To: ast; +Cc: netdev, Daniel Borkmann

The series contains two fixes in BPF core and test cases. For details
please see individual patches. Thanks!

Daniel Borkmann (4):
  bpf: fix partial copy of map_ptr when dst is scalar
  bpf: don't set id on after map lookup with ptr_to_map_val return
  bpf: add various test cases to test_verifier
  bpf: test make sure to run unpriv test cases in test_verifier

 include/linux/bpf_verifier.h                |   3 +
 kernel/bpf/verifier.c                       |  21 +-
 tools/testing/selftests/bpf/test_verifier.c | 321 +++++++++++++++++++++++++---
 3 files changed, 305 insertions(+), 40 deletions(-)

-- 
2.9.5

^ permalink raw reply

* Re: Kernel 4.19 network performance - forwarding/routing normal users traffic
From: Paweł Staszewski @ 2018-10-31 22:45 UTC (permalink / raw)
  To: Eric Dumazet, netdev
In-Reply-To: <8e10bf68-f3b3-98f2-91a5-25b151756dd6@itcare.pl>



W dniu 31.10.2018 o 23:20, Paweł Staszewski pisze:
>
>
> W dniu 31.10.2018 o 23:09, Eric Dumazet pisze:
>>
>> On 10/31/2018 02:57 PM, Paweł Staszewski wrote:
>>> Hi
>>>
>>> So maybee someone will be interested how linux kernel handles normal 
>>> traffic (not pktgen :) )
>>>
>>>
>>> Server HW configuration:
>>>
>>> CPU : Intel(R) Xeon(R) Gold 6132 CPU @ 2.60GHz
>>>
>>> NIC's: 2x 100G Mellanox ConnectX-4 (connected to x16 pcie 8GT)
>>>
>>>
>>> Server software:
>>>
>>> FRR - as routing daemon
>>>
>>> enp175s0f0 (100G) - 16 vlans from upstreams (28 RSS binded to local 
>>> numa node)
>>>
>>> enp175s0f1 (100G) - 343 vlans to clients (28 RSS binded to local 
>>> numa node)
>>>
>>>
>>> Maximum traffic that server can handle:
>>>
>>> Bandwidth
>>>
>>>   bwm-ng v0.6.1 (probing every 1.000s), press 'h' for help
>>>    input: /proc/net/dev type: rate
>>>    \         iface                   Rx Tx Total
>>> ============================================================================== 
>>>
>>>         enp175s0f1:          28.51 Gb/s           37.24 
>>> Gb/s           65.74 Gb/s
>>>         enp175s0f0:          38.07 Gb/s           28.44 
>>> Gb/s           66.51 Gb/s
>>> ------------------------------------------------------------------------------ 
>>>
>>>              total:          66.58 Gb/s           65.67 
>>> Gb/s          132.25 Gb/s
>>>
>>>
>>> Packets per second:
>>>
>>>   bwm-ng v0.6.1 (probing every 1.000s), press 'h' for help
>>>    input: /proc/net/dev type: rate
>>>    -         iface                   Rx Tx Total
>>> ============================================================================== 
>>>
>>>         enp175s0f1:      5248589.00 P/s       3486617.75 P/s 
>>> 8735207.00 P/s
>>>         enp175s0f0:      3557944.25 P/s       5232516.00 P/s 
>>> 8790460.00 P/s
>>> ------------------------------------------------------------------------------ 
>>>
>>>              total:      8806533.00 P/s       8719134.00 P/s 
>>> 17525668.00 P/s
>>>
>>>
>>> After reaching that limits nics on the upstream side (more RX 
>>> traffic) start to drop packets
>>>
>>>
>>> I just dont understand that server can't handle more bandwidth 
>>> (~40Gbit/s is limit where all cpu's are 100% util) - where pps on RX 
>>> side are increasing.
>>>
>>> Was thinking that maybee reached some pcie x16 limit - but x16 8GT 
>>> is 126Gbit - and also when testing with pktgen i can reach more bw 
>>> and pps (like 4x more comparing to normal internet traffic)
>>>
>>> And wondering if there is something that can be improved here.
>>>
>>>
>>>
>>> Some more informations / counters / stats and perf top below:
>>>
>>> Perf top flame graph:
>>>
>>> https://uploadfiles.io/7zo6u
>>>
>>>
>>>
>>> System configuration(long):
>>>
>>>
>>> cat /sys/devices/system/node/node1/cpulist
>>> 14-27,42-55
>>> cat /sys/class/net/enp175s0f0/device/numa_node
>>> 1
>>> cat /sys/class/net/enp175s0f1/device/numa_node
>>> 1
>>>
>>>
>>>
>>>
>>>
>>> ip -s -d link ls dev enp175s0f0
>>> 6: enp175s0f0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq 
>>> state UP mode DEFAULT group default qlen 8192
>>>      link/ether 0c:c4:7a:d8:5d:1c brd ff:ff:ff:ff:ff:ff promiscuity 
>>> 0 addrgenmode eui64 numtxqueues 448 numrxqueues 56 gso_max_size 
>>> 65536 gso_max_segs 65535
>>>      RX: bytes  packets  errors  dropped overrun mcast
>>>      184142375840858 141347715974 2       2806325 0 85050528
>>>      TX: bytes  packets  errors  dropped carrier collsns
>>>      99270697277430 172227994003 0       0       0       0
>>>
>>>   ip -s -d link ls dev enp175s0f1
>>> 7: enp175s0f1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq 
>>> state UP mode DEFAULT group default qlen 8192
>>>      link/ether 0c:c4:7a:d8:5d:1d brd ff:ff:ff:ff:ff:ff promiscuity 
>>> 0 addrgenmode eui64 numtxqueues 448 numrxqueues 56 gso_max_size 
>>> 65536 gso_max_segs 65535
>>>      RX: bytes  packets  errors  dropped overrun mcast
>>>      99686284170801 173507590134 61      669685  0 100304421
>>>      TX: bytes  packets  errors  dropped carrier collsns
>>>      184435107970545 142383178304 0       0       0       0
>>>
>>>
>>> ./softnet.sh
>>> cpu      total    dropped   squeezed  collision        rps flow_limit
>>>
>>>
>>>
>>>
>>>     PerfTop:  108490 irqs/sec  kernel:99.6%  exact:  0.0% [4000Hz 
>>> cycles],  (all, 56 CPUs)
>>> --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 
>>>
>>>
>>>      26.78%  [kernel]       [k] queued_spin_lock_slowpath
>> This is highly suspect.
>>
>> A call graph (perf record -a -g sleep 1; perf report --stdio) would 
>> tell what is going on.
> perf report:
> https://ufile.io/rqp0h
>
>
>
>>
>> With that many TX/RX queues, I would expect you to not use RPS/RFS, 
>> and have a 1/1 RX/TX mapping,
>> so I do not know what could request a spinlock contention.
>>
>>
>>
>
>
And yes there is no RPF/RFS - just 1/1 RX/TX and affinity mapping on 
local cpu for the network controller for 28 RX+TX queues per nic .

^ permalink raw reply

* Re: [PATCH 0/5] Allwinner H6 Ethernet support
From: Jagan Teki @ 2018-11-01  7:38 UTC (permalink / raw)
  To: Icenowy Zheng
  Cc: Rob Herring, Maxime Ripard, Chen-Yu Tsai,
	davem-fT/PcQaiUtIeIZ0/mPfg9Q, clabbe-rdvid1DuHRBWk0Htik3J/w,
	netdev-u79uwXL29TY76Z2rM5mHXA, devicetree, linux-arm-kernel,
	linux-kernel, linux-sunxi-/JYPxA39Uh5TLH3MbocFFw
In-Reply-To: <20180722053955.25266-1-icenowy-h8G6r0blFSE@public.gmane.org>

On Sun, Jul 22, 2018 at 11:10 AM Icenowy Zheng <icenowy-h8G6r0blFSE@public.gmane.org> wrote:
>
> This patchset introduces Allwinner H6 Ethernet support with code already
> available for A64.
>
> As the system controller and EMAC on H6 are all similar to A64 ones,
> support for them are directly reused, by using fallback compatible
> strings.
>
> Icenowy Zheng (5):
>   dt-binding: dwmac-sun8i: add H6 compatible string (w/ A64 fallback)
>   dt-bindings: sunxi-sram: add binding for Allwinner H6 SRAM C
>   arm64: allwinner: h6: add system controller device tree node
>   arm64: allwinner: h6: add EMAC device nodes
>   arm64: allwinner: h6: add support for the Ethernet on Pine H64

Tested EMAC on Orangepi 1+

Tested-by: Jagan Teki <jagan-dyjBcgdgk7Pe9wHmmfpqLFaTQe2KTcn/@public.gmane.org>

^ permalink raw reply

* Re: [PATCH V2] mlx5: Fix formats with line continuation whitespace
From: Leon Romanovsky @ 2018-11-01  7:34 UTC (permalink / raw)
  To: Joe Perches
  Cc: Saeed Mahameed, David S. Miller, netdev, linux-rdma, linux-kernel
In-Reply-To: <f14db3287b23ed8af9bdbf8001e2e2fe7ae9e43a.camel@perches.com>

[-- Attachment #1: Type: text/plain, Size: 444 bytes --]

On Thu, Nov 01, 2018 at 12:24:08AM -0700, Joe Perches wrote:
> The line continuations unintentionally add whitespace so
> instead use coalesced formats to remove the whitespace.
>
> Signed-off-by: Joe Perches <joe@perches.com>
> ---
>
> v2: Remove excess space after %u
>
>  drivers/net/ethernet/mellanox/mlx5/core/rl.c | 6 ++----
>  1 file changed, 2 insertions(+), 4 deletions(-)
>

Thanks,
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 801 bytes --]

^ permalink raw reply

* [PATCH V2] mlx5: Fix formats with line continuation whitespace
From: Joe Perches @ 2018-11-01  7:24 UTC (permalink / raw)
  To: Saeed Mahameed, Leon Romanovsky
  Cc: David S. Miller, netdev, linux-rdma, linux-kernel

The line continuations unintentionally add whitespace so
instead use coalesced formats to remove the whitespace.

Signed-off-by: Joe Perches <joe@perches.com>
---

v2: Remove excess space after %u

 drivers/net/ethernet/mellanox/mlx5/core/rl.c | 6 ++----
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/rl.c b/drivers/net/ethernet/mellanox/mlx5/core/rl.c
index bc86dffdc43c..377b7e65ecf1 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/rl.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/rl.c
@@ -188,8 +188,7 @@ int mlx5_rl_add_rate(struct mlx5_core_dev *dev, u16 *index,
 		/* new rate limit */
 		err = mlx5_set_pp_rate_limit_cmd(dev, entry->index, rl);
 		if (err) {
-			mlx5_core_err(dev, "Failed configuring rate limit(err %d): \
-				      rate %u, max_burst_sz %u, typical_pkt_sz %u\n",
+			mlx5_core_err(dev, "Failed configuring rate limit(err %d): rate %u, max_burst_sz %u, typical_pkt_sz %u\n",
 				      err, rl->rate, rl->max_burst_sz,
 				      rl->typical_pkt_sz);
 			goto out;
@@ -218,8 +217,7 @@ void mlx5_rl_remove_rate(struct mlx5_core_dev *dev, struct mlx5_rate_limit *rl)
 	mutex_lock(&table->rl_lock);
 	entry = find_rl_entry(table, rl);
 	if (!entry || !entry->refcount) {
-		mlx5_core_warn(dev, "Rate %u, max_burst_sz %u typical_pkt_sz %u \
-			       are not configured\n",
+		mlx5_core_warn(dev, "Rate %u, max_burst_sz %u typical_pkt_sz %u are not configured\n",
 			       rl->rate, rl->max_burst_sz, rl->typical_pkt_sz);
 		goto out;
 	}

^ permalink raw reply related

* Re: [PATCH] mlx5: Fix formats with line continuation whitespace
From: Leon Romanovsky @ 2018-11-01  7:20 UTC (permalink / raw)
  To: Joe Perches
  Cc: Saeed Mahameed, David S. Miller, netdev, linux-rdma, linux-kernel
In-Reply-To: <e076161b88d0c87f083f28450a130de8eead618f.camel@perches.com>

[-- Attachment #1: Type: text/plain, Size: 1765 bytes --]

On Thu, Nov 01, 2018 at 12:09:36AM -0700, Joe Perches wrote:
> The line continuations unintentionally add whitespace so
> instead use coalesced formats to remove the whitespace.
>
> Signed-off-by: Joe Perches <joe@perches.com>
> ---
>  drivers/net/ethernet/mellanox/mlx5/core/rl.c | 6 ++----
>  1 file changed, 2 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/rl.c b/drivers/net/ethernet/mellanox/mlx5/core/rl.c
> index bc86dffdc43c..377b7e65ecf1 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/rl.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/rl.c
> @@ -188,8 +188,7 @@ int mlx5_rl_add_rate(struct mlx5_core_dev *dev, u16 *index,
>  		/* new rate limit */
>  		err = mlx5_set_pp_rate_limit_cmd(dev, entry->index, rl);
>  		if (err) {
> -			mlx5_core_err(dev, "Failed configuring rate limit(err %d): \
> -				      rate %u, max_burst_sz %u, typical_pkt_sz %u\n",
> +			mlx5_core_err(dev, "Failed configuring rate limit(err %d): rate %u, max_burst_sz %u, typical_pkt_sz %u\n",
>  				      err, rl->rate, rl->max_burst_sz,
>  				      rl->typical_pkt_sz);
>  			goto out;
> @@ -218,8 +217,7 @@ void mlx5_rl_remove_rate(struct mlx5_core_dev *dev, struct mlx5_rate_limit *rl)
>  	mutex_lock(&table->rl_lock);
>  	entry = find_rl_entry(table, rl);
>  	if (!entry || !entry->refcount) {
> -		mlx5_core_warn(dev, "Rate %u, max_burst_sz %u typical_pkt_sz %u \
> -			       are not configured\n",
> +		mlx5_core_warn(dev, "Rate %u, max_burst_sz %u typical_pkt_sz %u  are not configured\n",

                                                                 double space ^^^^^

>  			       rl->rate, rl->max_burst_sz, rl->typical_pkt_sz);
>  		goto out;
>  	}
>

Thanks,
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 801 bytes --]

^ permalink raw reply

* Re: Kernel 4.19 network performance - forwarding/routing normal users traffic
From: Paweł Staszewski @ 2018-10-31 22:20 UTC (permalink / raw)
  To: Eric Dumazet, netdev
In-Reply-To: <61e30474-b5e9-4dc8-a8a6-90cdd17d2a66@gmail.com>



W dniu 31.10.2018 o 23:09, Eric Dumazet pisze:
>
> On 10/31/2018 02:57 PM, Paweł Staszewski wrote:
>> Hi
>>
>> So maybee someone will be interested how linux kernel handles normal traffic (not pktgen :) )
>>
>>
>> Server HW configuration:
>>
>> CPU : Intel(R) Xeon(R) Gold 6132 CPU @ 2.60GHz
>>
>> NIC's: 2x 100G Mellanox ConnectX-4 (connected to x16 pcie 8GT)
>>
>>
>> Server software:
>>
>> FRR - as routing daemon
>>
>> enp175s0f0 (100G) - 16 vlans from upstreams (28 RSS binded to local numa node)
>>
>> enp175s0f1 (100G) - 343 vlans to clients (28 RSS binded to local numa node)
>>
>>
>> Maximum traffic that server can handle:
>>
>> Bandwidth
>>
>>   bwm-ng v0.6.1 (probing every 1.000s), press 'h' for help
>>    input: /proc/net/dev type: rate
>>    \         iface                   Rx Tx                Total
>> ==============================================================================
>>         enp175s0f1:          28.51 Gb/s           37.24 Gb/s           65.74 Gb/s
>>         enp175s0f0:          38.07 Gb/s           28.44 Gb/s           66.51 Gb/s
>> ------------------------------------------------------------------------------
>>              total:          66.58 Gb/s           65.67 Gb/s          132.25 Gb/s
>>
>>
>> Packets per second:
>>
>>   bwm-ng v0.6.1 (probing every 1.000s), press 'h' for help
>>    input: /proc/net/dev type: rate
>>    -         iface                   Rx Tx                Total
>> ==============================================================================
>>         enp175s0f1:      5248589.00 P/s       3486617.75 P/s 8735207.00 P/s
>>         enp175s0f0:      3557944.25 P/s       5232516.00 P/s 8790460.00 P/s
>> ------------------------------------------------------------------------------
>>              total:      8806533.00 P/s       8719134.00 P/s 17525668.00 P/s
>>
>>
>> After reaching that limits nics on the upstream side (more RX traffic) start to drop packets
>>
>>
>> I just dont understand that server can't handle more bandwidth (~40Gbit/s is limit where all cpu's are 100% util) - where pps on RX side are increasing.
>>
>> Was thinking that maybee reached some pcie x16 limit - but x16 8GT is 126Gbit - and also when testing with pktgen i can reach more bw and pps (like 4x more comparing to normal internet traffic)
>>
>> And wondering if there is something that can be improved here.
>>
>>
>>
>> Some more informations / counters / stats and perf top below:
>>
>> Perf top flame graph:
>>
>> https://uploadfiles.io/7zo6u
>>
>>
>>
>> System configuration(long):
>>
>>
>> cat /sys/devices/system/node/node1/cpulist
>> 14-27,42-55
>> cat /sys/class/net/enp175s0f0/device/numa_node
>> 1
>> cat /sys/class/net/enp175s0f1/device/numa_node
>> 1
>>
>>
>>
>>
>>
>> ip -s -d link ls dev enp175s0f0
>> 6: enp175s0f0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP mode DEFAULT group default qlen 8192
>>      link/ether 0c:c4:7a:d8:5d:1c brd ff:ff:ff:ff:ff:ff promiscuity 0 addrgenmode eui64 numtxqueues 448 numrxqueues 56 gso_max_size 65536 gso_max_segs 65535
>>      RX: bytes  packets  errors  dropped overrun mcast
>>      184142375840858 141347715974 2       2806325 0       85050528
>>      TX: bytes  packets  errors  dropped carrier collsns
>>      99270697277430 172227994003 0       0       0       0
>>
>>   ip -s -d link ls dev enp175s0f1
>> 7: enp175s0f1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP mode DEFAULT group default qlen 8192
>>      link/ether 0c:c4:7a:d8:5d:1d brd ff:ff:ff:ff:ff:ff promiscuity 0 addrgenmode eui64 numtxqueues 448 numrxqueues 56 gso_max_size 65536 gso_max_segs 65535
>>      RX: bytes  packets  errors  dropped overrun mcast
>>      99686284170801 173507590134 61      669685  0       100304421
>>      TX: bytes  packets  errors  dropped carrier collsns
>>      184435107970545 142383178304 0       0       0       0
>>
>>
>> ./softnet.sh
>> cpu      total    dropped   squeezed  collision        rps flow_limit
>>
>>
>>
>>
>>     PerfTop:  108490 irqs/sec  kernel:99.6%  exact:  0.0% [4000Hz cycles],  (all, 56 CPUs)
>> ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>>
>>      26.78%  [kernel]       [k] queued_spin_lock_slowpath
> This is highly suspect.
>
> A call graph (perf record -a -g sleep 1; perf report --stdio) would tell what is going on.
perf report:
https://ufile.io/rqp0h



>
> With that many TX/RX queues, I would expect you to not use RPS/RFS, and have a 1/1 RX/TX mapping,
> so I do not know what could request a spinlock contention.
>
>
>

^ permalink raw reply

* [PATCH] mlx5: Fix formats with line continuation whitespace
From: Joe Perches @ 2018-11-01  7:09 UTC (permalink / raw)
  To: Saeed Mahameed, Leon Romanovsky
  Cc: David S. Miller, netdev, linux-rdma, linux-kernel

The line continuations unintentionally add whitespace so
instead use coalesced formats to remove the whitespace.

Signed-off-by: Joe Perches <joe@perches.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/rl.c | 6 ++----
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/rl.c b/drivers/net/ethernet/mellanox/mlx5/core/rl.c
index bc86dffdc43c..377b7e65ecf1 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/rl.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/rl.c
@@ -188,8 +188,7 @@ int mlx5_rl_add_rate(struct mlx5_core_dev *dev, u16 *index,
 		/* new rate limit */
 		err = mlx5_set_pp_rate_limit_cmd(dev, entry->index, rl);
 		if (err) {
-			mlx5_core_err(dev, "Failed configuring rate limit(err %d): \
-				      rate %u, max_burst_sz %u, typical_pkt_sz %u\n",
+			mlx5_core_err(dev, "Failed configuring rate limit(err %d): rate %u, max_burst_sz %u, typical_pkt_sz %u\n",
 				      err, rl->rate, rl->max_burst_sz,
 				      rl->typical_pkt_sz);
 			goto out;
@@ -218,8 +217,7 @@ void mlx5_rl_remove_rate(struct mlx5_core_dev *dev, struct mlx5_rate_limit *rl)
 	mutex_lock(&table->rl_lock);
 	entry = find_rl_entry(table, rl);
 	if (!entry || !entry->refcount) {
-		mlx5_core_warn(dev, "Rate %u, max_burst_sz %u typical_pkt_sz %u \
-			       are not configured\n",
+		mlx5_core_warn(dev, "Rate %u, max_burst_sz %u typical_pkt_sz %u  are not configured\n",
 			       rl->rate, rl->max_burst_sz, rl->typical_pkt_sz);
 		goto out;
 	}

^ permalink raw reply related

* Re: Kernel 4.19 network performance - forwarding/routing normal users traffic
From: Eric Dumazet @ 2018-10-31 22:09 UTC (permalink / raw)
  To: Paweł Staszewski, netdev
In-Reply-To: <61697e49-e839-befc-8330-fc00187c48ee@itcare.pl>



On 10/31/2018 02:57 PM, Paweł Staszewski wrote:
> Hi
> 
> So maybee someone will be interested how linux kernel handles normal traffic (not pktgen :) )
> 
> 
> Server HW configuration:
> 
> CPU : Intel(R) Xeon(R) Gold 6132 CPU @ 2.60GHz
> 
> NIC's: 2x 100G Mellanox ConnectX-4 (connected to x16 pcie 8GT)
> 
> 
> Server software:
> 
> FRR - as routing daemon
> 
> enp175s0f0 (100G) - 16 vlans from upstreams (28 RSS binded to local numa node)
> 
> enp175s0f1 (100G) - 343 vlans to clients (28 RSS binded to local numa node)
> 
> 
> Maximum traffic that server can handle:
> 
> Bandwidth
> 
>  bwm-ng v0.6.1 (probing every 1.000s), press 'h' for help
>   input: /proc/net/dev type: rate
>   \         iface                   Rx Tx                Total
> ==============================================================================
>        enp175s0f1:          28.51 Gb/s           37.24 Gb/s           65.74 Gb/s
>        enp175s0f0:          38.07 Gb/s           28.44 Gb/s           66.51 Gb/s
> ------------------------------------------------------------------------------
>             total:          66.58 Gb/s           65.67 Gb/s          132.25 Gb/s
> 
> 
> Packets per second:
> 
>  bwm-ng v0.6.1 (probing every 1.000s), press 'h' for help
>   input: /proc/net/dev type: rate
>   -         iface                   Rx Tx                Total
> ==============================================================================
>        enp175s0f1:      5248589.00 P/s       3486617.75 P/s 8735207.00 P/s
>        enp175s0f0:      3557944.25 P/s       5232516.00 P/s 8790460.00 P/s
> ------------------------------------------------------------------------------
>             total:      8806533.00 P/s       8719134.00 P/s 17525668.00 P/s
> 
> 
> After reaching that limits nics on the upstream side (more RX traffic) start to drop packets
> 
> 
> I just dont understand that server can't handle more bandwidth (~40Gbit/s is limit where all cpu's are 100% util) - where pps on RX side are increasing.
> 
> Was thinking that maybee reached some pcie x16 limit - but x16 8GT is 126Gbit - and also when testing with pktgen i can reach more bw and pps (like 4x more comparing to normal internet traffic)
> 
> And wondering if there is something that can be improved here.
> 
> 
> 
> Some more informations / counters / stats and perf top below:
> 
> Perf top flame graph:
> 
> https://uploadfiles.io/7zo6u
> 
> 
> 
> System configuration(long):
> 
> 
> cat /sys/devices/system/node/node1/cpulist
> 14-27,42-55
> cat /sys/class/net/enp175s0f0/device/numa_node
> 1
> cat /sys/class/net/enp175s0f1/device/numa_node
> 1
> 
> 
> 
> 
> 
> ip -s -d link ls dev enp175s0f0
> 6: enp175s0f0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP mode DEFAULT group default qlen 8192
>     link/ether 0c:c4:7a:d8:5d:1c brd ff:ff:ff:ff:ff:ff promiscuity 0 addrgenmode eui64 numtxqueues 448 numrxqueues 56 gso_max_size 65536 gso_max_segs 65535
>     RX: bytes  packets  errors  dropped overrun mcast
>     184142375840858 141347715974 2       2806325 0       85050528
>     TX: bytes  packets  errors  dropped carrier collsns
>     99270697277430 172227994003 0       0       0       0
> 
>  ip -s -d link ls dev enp175s0f1
> 7: enp175s0f1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP mode DEFAULT group default qlen 8192
>     link/ether 0c:c4:7a:d8:5d:1d brd ff:ff:ff:ff:ff:ff promiscuity 0 addrgenmode eui64 numtxqueues 448 numrxqueues 56 gso_max_size 65536 gso_max_segs 65535
>     RX: bytes  packets  errors  dropped overrun mcast
>     99686284170801 173507590134 61      669685  0       100304421
>     TX: bytes  packets  errors  dropped carrier collsns
>     184435107970545 142383178304 0       0       0       0
> 
> 
> ./softnet.sh
> cpu      total    dropped   squeezed  collision        rps flow_limit
> 
> 
> 
> 
>    PerfTop:  108490 irqs/sec  kernel:99.6%  exact:  0.0% [4000Hz cycles],  (all, 56 CPUs)
> ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
> 
>     26.78%  [kernel]       [k] queued_spin_lock_slowpath

This is highly suspect.

A call graph (perf record -a -g sleep 1; perf report --stdio) would tell what is going on.

With that many TX/RX queues, I would expect you to not use RPS/RFS, and have a 1/1 RX/TX mapping,
so I do not know what could request a spinlock contention.

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox