All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH RFC net-next 0/3] Extensions to allow asynchronous TCP_INFO notifications based on congestion parameters
@ 2018-10-22 15:23 Sowmini Varadhan
  2018-10-22 15:23 ` [PATCH RFC net-next 1/3] sock_diag: Refactor inet_sock_diag_destroy code Sowmini Varadhan
                   ` (2 more replies)
  0 siblings, 3 replies; 4+ messages in thread
From: Sowmini Varadhan @ 2018-10-22 15:23 UTC (permalink / raw)
  To: sowmini.varadhan, netdev; +Cc: edumazet, brakmo

Problem statement:
  We would like to monitor some subset of TCP sockets in user-space,
  (the monitoring application would define 4-tuples it wants to monitor)
  using TCP_INFO stats to analyze reported problems. The idea is to
  use those stats to see where the bottlenecks are likely to be ("is it
  application-limited?" or "is there evidence of BufferBloat in the
  path?" etc)

  Today we can do this by periodically polling for tcp_info, but this
  could be made more efficient if the kernel would asynchronously
  notify the application via tcp_info when some "interesting"
  thresholds (e.g., "RTT variance > X", or "total_retrans > Y" etc)
  are reached. And to make this effective, it is better if
  we could apply the threshold check *before* constructing the
  tcp_info netlink notification, so that we don't waste resources
  constructing notifications that will be discarded by the filter.

One idea, implemented in this patchset, is to extend the tcp_call_bpf()
infra so that the BPF kernel module (the sock_ops filter/callback)
can examine the values in the sock_ops, apply any thresholds it wants,
and return some new status ("BPF_TCP_INFO_NOTIFY"). Use this status in
the tcp stack to queue up a tcp_info notification (similar to
sock_diag_broadcast_destroy() today..)

Patch 1 in this set refactors the existing sock_diag code so that
the functions can be reused for notifications from other states than CLOSE.

Patch 2 provides a minor extension to tcp_call_bpf() so that it
will queue a tcp_info_notification if the BPF callout returns 
BPF_TCP_INFO_NOTIFY

Patch 3, provided strictly as a demonstration/PoC to aid in reviewing
this proposal, shows a simple sample/bpf example where we trigger the
tcp_info notification for an iperf connection if the number of 
retransmits exceeds 16.


Sowmini Varadhan (3):
  sock_diag: Refactor inet_sock_diag_destroy code
  tcp: BPF_TCP_INFO_NOTIFY support
  bpf: Added a sample for tcp_info_notify callback

 include/linux/sock_diag.h      |   18 +++++++---
 include/net/tcp.h              |   15 +++++++-
 include/uapi/linux/bpf.h       |    4 ++
 include/uapi/linux/sock_diag.h |    2 +
 net/core/sock.c                |    4 +-
 net/core/sock_diag.c           |   11 +++---
 samples/bpf/Makefile           |    1 +
 samples/bpf/tcp_notify_kern.c  |   73 ++++++++++++++++++++++++++++++++++++++++
 8 files changed, 114 insertions(+), 14 deletions(-)
 create mode 100644 samples/bpf/tcp_notify_kern.c

^ permalink raw reply	[flat|nested] 4+ messages in thread

* [PATCH RFC net-next 1/3] sock_diag: Refactor inet_sock_diag_destroy code
  2018-10-22 15:23 [PATCH RFC net-next 0/3] Extensions to allow asynchronous TCP_INFO notifications based on congestion parameters Sowmini Varadhan
@ 2018-10-22 15:23 ` Sowmini Varadhan
  2018-10-22 15:23 ` [PATCH RFC net-next 2/3] tcp: BPF_TCP_INFO_NOTIFY support Sowmini Varadhan
  2018-10-22 15:24 ` [PATCH RFC net-next 3/3] bpf: Added a sample for tcp_info_notify callback Sowmini Varadhan
  2 siblings, 0 replies; 4+ messages in thread
From: Sowmini Varadhan @ 2018-10-22 15:23 UTC (permalink / raw)
  To: sowmini.varadhan, netdev; +Cc: edumazet, brakmo

We want to use the inet_sock_diag_destroy code to send notifications
for more types of TCP events than just socket_close(), so refactor
the code to allow this.

Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
---
 include/linux/sock_diag.h      |   18 +++++++++++++-----
 include/uapi/linux/sock_diag.h |    2 ++
 net/core/sock.c                |    4 ++--
 net/core/sock_diag.c           |   11 ++++++-----
 4 files changed, 23 insertions(+), 12 deletions(-)

diff --git a/include/linux/sock_diag.h b/include/linux/sock_diag.h
index 15fe980..df85767 100644
--- a/include/linux/sock_diag.h
+++ b/include/linux/sock_diag.h
@@ -34,7 +34,7 @@ int sock_diag_put_filterinfo(bool may_report_filterinfo, struct sock *sk,
 			     struct sk_buff *skb, int attrtype);
 
 static inline
-enum sknetlink_groups sock_diag_destroy_group(const struct sock *sk)
+enum sknetlink_groups sock_diag_group(const struct sock *sk)
 {
 	switch (sk->sk_family) {
 	case AF_INET:
@@ -43,7 +43,15 @@ enum sknetlink_groups sock_diag_destroy_group(const struct sock *sk)
 
 		switch (sk->sk_protocol) {
 		case IPPROTO_TCP:
-			return SKNLGRP_INET_TCP_DESTROY;
+			switch (sk->sk_state) {
+			case TCP_ESTABLISHED:
+				return SKNLGRP_INET_TCP_CONNECTED;
+			case TCP_SYN_SENT:
+			case TCP_SYN_RECV:
+				return SKNLGRP_INET_TCP_3WH;
+			default:
+				return SKNLGRP_INET_TCP_DESTROY;
+			}
 		case IPPROTO_UDP:
 			return SKNLGRP_INET_UDP_DESTROY;
 		default:
@@ -67,15 +75,15 @@ enum sknetlink_groups sock_diag_destroy_group(const struct sock *sk)
 }
 
 static inline
-bool sock_diag_has_destroy_listeners(const struct sock *sk)
+bool sock_diag_has_listeners(const struct sock *sk)
 {
 	const struct net *n = sock_net(sk);
-	const enum sknetlink_groups group = sock_diag_destroy_group(sk);
+	const enum sknetlink_groups group = sock_diag_group(sk);
 
 	return group != SKNLGRP_NONE && n->diag_nlsk &&
 		netlink_has_listeners(n->diag_nlsk, group);
 }
-void sock_diag_broadcast_destroy(struct sock *sk);
+void sock_diag_broadcast(struct sock *sk);
 
 int sock_diag_destroy(struct sock *sk, int err);
 #endif
diff --git a/include/uapi/linux/sock_diag.h b/include/uapi/linux/sock_diag.h
index e592500..4252674 100644
--- a/include/uapi/linux/sock_diag.h
+++ b/include/uapi/linux/sock_diag.h
@@ -32,6 +32,8 @@ enum sknetlink_groups {
 	SKNLGRP_INET_UDP_DESTROY,
 	SKNLGRP_INET6_TCP_DESTROY,
 	SKNLGRP_INET6_UDP_DESTROY,
+	SKNLGRP_INET_TCP_3WH,
+	SKNLGRP_INET_TCP_CONNECTED,
 	__SKNLGRP_MAX,
 };
 #define SKNLGRP_MAX	(__SKNLGRP_MAX - 1)
diff --git a/net/core/sock.c b/net/core/sock.c
index 7e8796a..6684840 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -1600,8 +1600,8 @@ static void __sk_free(struct sock *sk)
 	if (likely(sk->sk_net_refcnt))
 		sock_inuse_add(sock_net(sk), -1);
 
-	if (unlikely(sk->sk_net_refcnt && sock_diag_has_destroy_listeners(sk)))
-		sock_diag_broadcast_destroy(sk);
+	if (unlikely(sk->sk_net_refcnt && sock_diag_has_listeners(sk)))
+		sock_diag_broadcast(sk);
 	else
 		sk_destruct(sk);
 }
diff --git a/net/core/sock_diag.c b/net/core/sock_diag.c
index 3312a58..dbd9e65 100644
--- a/net/core/sock_diag.c
+++ b/net/core/sock_diag.c
@@ -116,14 +116,14 @@ static size_t sock_diag_nlmsg_size(void)
 	       + nla_total_size_64bit(sizeof(struct tcp_info))); /* INET_DIAG_INFO */
 }
 
-static void sock_diag_broadcast_destroy_work(struct work_struct *work)
+static void sock_diag_broadcast_work(struct work_struct *work)
 {
 	struct broadcast_sk *bsk =
 		container_of(work, struct broadcast_sk, work);
 	struct sock *sk = bsk->sk;
 	const struct sock_diag_handler *hndl;
 	struct sk_buff *skb;
-	const enum sknetlink_groups group = sock_diag_destroy_group(sk);
+	const enum sknetlink_groups group = sock_diag_group(sk);
 	int err = -1;
 
 	WARN_ON(group == SKNLGRP_NONE);
@@ -144,11 +144,12 @@ static void sock_diag_broadcast_destroy_work(struct work_struct *work)
 	else
 		kfree_skb(skb);
 out:
-	sk_destruct(sk);
+	if (group <= SKNLGRP_INET6_UDP_DESTROY)
+		sk_destruct(sk);
 	kfree(bsk);
 }
 
-void sock_diag_broadcast_destroy(struct sock *sk)
+void sock_diag_broadcast(struct sock *sk)
 {
 	/* Note, this function is often called from an interrupt context. */
 	struct broadcast_sk *bsk =
@@ -156,7 +157,7 @@ void sock_diag_broadcast_destroy(struct sock *sk)
 	if (!bsk)
 		return sk_destruct(sk);
 	bsk->sk = sk;
-	INIT_WORK(&bsk->work, sock_diag_broadcast_destroy_work);
+	INIT_WORK(&bsk->work, sock_diag_broadcast_work);
 	queue_work(broadcast_wq, &bsk->work);
 }
 
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 4+ messages in thread

* [PATCH RFC net-next 2/3] tcp: BPF_TCP_INFO_NOTIFY support
  2018-10-22 15:23 [PATCH RFC net-next 0/3] Extensions to allow asynchronous TCP_INFO notifications based on congestion parameters Sowmini Varadhan
  2018-10-22 15:23 ` [PATCH RFC net-next 1/3] sock_diag: Refactor inet_sock_diag_destroy code Sowmini Varadhan
@ 2018-10-22 15:23 ` Sowmini Varadhan
  2018-10-22 15:24 ` [PATCH RFC net-next 3/3] bpf: Added a sample for tcp_info_notify callback Sowmini Varadhan
  2 siblings, 0 replies; 4+ messages in thread
From: Sowmini Varadhan @ 2018-10-22 15:23 UTC (permalink / raw)
  To: sowmini.varadhan, netdev; +Cc: edumazet, brakmo

We want to be able to set up the monitoring application so that it can
be aysnchronously notified when "interesting" events happen, e.g., when
application-determined thresholds on parameters like RTT estimate, number
of retransmissions, RTO are reached.

The bpf_sock_ops infrastructure provided as part of Commit 40304b2a1567
("bpf: BPF support for sock_ops") provides an elegant way to trigger
this asynchronous notification. The BPF program can examine the
current TCP state reported in the bpf_sock_ops and conditionally
return a (new) status BPF_TCP_INFO_NOTIFY. The return status is used
by the caller to queue up a tcp_info notification for the application.

Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
---
 include/net/tcp.h        |   15 +++++++++++++--
 include/uapi/linux/bpf.h |    4 ++++
 2 files changed, 17 insertions(+), 2 deletions(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index 0d29292..df06a9f 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -47,6 +47,7 @@
 #include <linux/seq_file.h>
 #include <linux/memcontrol.h>
 #include <linux/bpf-cgroup.h>
+#include <linux/sock_diag.h>
 
 extern struct inet_hashinfo tcp_hashinfo;
 
@@ -2065,6 +2066,12 @@ struct tcp_ulp_ops {
 	__MODULE_INFO(alias, alias_userspace, name);		\
 	__MODULE_INFO(alias, alias_tcp_ulp, "tcp-ulp-" name)
 
+#define	TCPDIAG_CB(sk)							\
+do {									\
+	if (unlikely(sk->sk_net_refcnt && sock_diag_has_listeners(sk)))	\
+		sock_diag_broadcast(sk);				\
+} while (0)
+
 /* Call BPF_SOCK_OPS program that returns an int. If the return value
  * is < 0, then the BPF op failed (for example if the loaded BPF
  * program does not support the chosen operation or there is no BPF
@@ -2088,9 +2095,13 @@ static inline int tcp_call_bpf(struct sock *sk, int op, u32 nargs, u32 *args)
 		memcpy(sock_ops.args, args, nargs * sizeof(*args));
 
 	ret = BPF_CGROUP_RUN_PROG_SOCK_OPS(&sock_ops);
-	if (ret == 0)
+	if (ret == 0) {
 		ret = sock_ops.reply;
-	else
+
+		/* XXX would be nice if we could use replylong[1] here */
+		if (ret == BPF_TCP_INFO_NOTIFY)
+			TCPDIAG_CB(sk);
+	} else
 		ret = -1;
 	return ret;
 }
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index aa5ccd2..bc45e5e 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -2678,6 +2678,10 @@ enum {
 	BPF_TCP_MAX_STATES	/* Leave at the end! */
 };
 
+enum {
+	BPF_TCP_INFO_NOTIFY = 2
+};
+
 #define TCP_BPF_IW		1001	/* Set TCP initial congestion window */
 #define TCP_BPF_SNDCWND_CLAMP	1002	/* Set sndcwnd_clamp */
 
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 4+ messages in thread

* [PATCH RFC net-next 3/3] bpf: Added a sample for tcp_info_notify callback
  2018-10-22 15:23 [PATCH RFC net-next 0/3] Extensions to allow asynchronous TCP_INFO notifications based on congestion parameters Sowmini Varadhan
  2018-10-22 15:23 ` [PATCH RFC net-next 1/3] sock_diag: Refactor inet_sock_diag_destroy code Sowmini Varadhan
  2018-10-22 15:23 ` [PATCH RFC net-next 2/3] tcp: BPF_TCP_INFO_NOTIFY support Sowmini Varadhan
@ 2018-10-22 15:24 ` Sowmini Varadhan
  2 siblings, 0 replies; 4+ messages in thread
From: Sowmini Varadhan @ 2018-10-22 15:24 UTC (permalink / raw)
  To: sowmini.varadhan, netdev; +Cc: edumazet, brakmo

Simple Proof-Of-Concept test program for BPF_TCP_INFO_NOTIFY
(will move this to testing/selftests/net later)

Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
---
 samples/bpf/Makefile          |    1 +
 samples/bpf/tcp_notify_kern.c |   73 +++++++++++++++++++++++++++++++++++++++++
 2 files changed, 74 insertions(+), 0 deletions(-)
 create mode 100644 samples/bpf/tcp_notify_kern.c

diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
index be0a961..d937bd2 100644
--- a/samples/bpf/Makefile
+++ b/samples/bpf/Makefile
@@ -152,6 +152,7 @@ always += tcp_bufs_kern.o
 always += tcp_cong_kern.o
 always += tcp_iw_kern.o
 always += tcp_clamp_kern.o
+always += tcp_notify_kern.o
 always += tcp_basertt_kern.o
 always += tcp_tos_reflect_kern.o
 always += xdp_redirect_kern.o
diff --git a/samples/bpf/tcp_notify_kern.c b/samples/bpf/tcp_notify_kern.c
new file mode 100644
index 0000000..bc4efd8
--- /dev/null
+++ b/samples/bpf/tcp_notify_kern.c
@@ -0,0 +1,73 @@
+/* Sample BPF program to demonstrate how to triger async tcp_info
+ * notification based on thresholds determeined in the filter.
+ * The example here will trigger notification if  skops->total_retrans > 16
+ *
+ * Use load_sock_ops to load this BPF program.
+ */
+
+#include <uapi/linux/bpf.h>
+#include <uapi/linux/if_ether.h>
+#include <uapi/linux/if_packet.h>
+#include <uapi/linux/ip.h>
+#include <linux/socket.h>
+#include "bpf_helpers.h"
+#include "bpf_endian.h"
+
+#define DEBUG 0
+
+#define bpf_printk(fmt, ...)					\
+({								\
+	       char ____fmt[] = fmt;				\
+	       bpf_trace_printk(____fmt, sizeof(____fmt),	\
+				##__VA_ARGS__);			\
+})
+
+SEC("sockops")
+int bpf_tcp_info_notify(struct bpf_sock_ops *skops)
+{
+	int bufsize = 150000;
+	int to_init = 10;
+	int clamp = 100;
+	int rv = 0;
+	int op;
+
+	/* For testing purposes, only execute rest of BPF program
+	 * if neither port numberis 5001 (default iperf port)
+	 */
+	if (bpf_ntohl(skops->remote_port) != 5001 &&
+	    skops->local_port != 5001) {
+		skops->reply = -1;
+		return 0;
+	}
+
+	op = (int) skops->op;
+
+#ifdef DEBUG
+	bpf_printk("BPF command: %d\n", op);
+#endif
+
+	switch (op) {
+	case BPF_SOCK_OPS_ACTIVE_ESTABLISHED_CB:
+	case BPF_SOCK_OPS_PASSIVE_ESTABLISHED_CB:
+		bpf_sock_ops_cb_flags_set(skops,
+			(BPF_SOCK_OPS_RETRANS_CB_FLAG|
+			BPF_SOCK_OPS_RTO_CB_FLAG));
+		rv = 1;
+		break;
+	case BPF_SOCK_OPS_RETRANS_CB:
+	case BPF_SOCK_OPS_RTO_CB:
+		if (skops->total_retrans < 16)
+			rv = 1; /* skip */
+		else
+			rv = BPF_TCP_INFO_NOTIFY;
+		break;
+	default:
+		rv = -1;
+	}
+#ifdef DEBUG
+	bpf_printk("Returning %d\n", rv);
+#endif
+	skops->reply = rv;
+	return 1;
+}
+char _license[] SEC("license") = "GPL";
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2018-10-22 23:48 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2018-10-22 15:23 [PATCH RFC net-next 0/3] Extensions to allow asynchronous TCP_INFO notifications based on congestion parameters Sowmini Varadhan
2018-10-22 15:23 ` [PATCH RFC net-next 1/3] sock_diag: Refactor inet_sock_diag_destroy code Sowmini Varadhan
2018-10-22 15:23 ` [PATCH RFC net-next 2/3] tcp: BPF_TCP_INFO_NOTIFY support Sowmini Varadhan
2018-10-22 15:24 ` [PATCH RFC net-next 3/3] bpf: Added a sample for tcp_info_notify callback Sowmini Varadhan

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.