Netdev List

Netdev List
 help / color / mirror / Atom feed

* [PATCH net-next 0/3] net: tls: add socket diag
From: Davide Caratti @ 2019-08-15 16:00 UTC (permalink / raw)
  To: Boris Pismenny, Jakub Kicinski, John Fastabend, Dave Watson,
	Aviad Yehezkel, David S. Miller, netdev

The current kernel does not provide any diagnostic tool, except
getsockopt(TCP_ULP), to know more about TCP sockets that have an upper
layer protocol (ULP) on top of them. This series extends the set of
information exported by INET_DIAG_INFO, to include data that are specific
to the ULP (and that might be meaningful for debug/testing purposes).

patch 1/3 ensures that the control plane reads/updates ULP specific data
using RCU.

patch 2/3 extends INET_DIAG_INFO and allows knowing the ULP name for
each TCP socket that has done setsockopt(TCP_ULP) successfully.

patch 3/3 extends kTLS to let programs like 'ss' know the protocol
version and the cipher in use.

Changes since RFC:
- some coding style fixes, thanks to Jakub Kicinski
- add X_UNSPEC as lowest value of uAPI enums, thanks to Jakub Kicinski
- fix assignment of struct nlattr *start, thanks to Jakub Kicinski
- let tls dump RXCONF and TXCONF, suggested by Jakub Kicinski
- don't dump anything if TLS version or cipher are 0 (but still return a
  constant size in get_aux_size()), thanks to Boris Pismenny
- constify first argument of get_info() and get_size()
- use RCU to access access ulp_ops, like it's done for ca_ops
- add patch 1/3, from Jakub Kicinski

Davide Caratti (2):
  tcp: ulp: add functions to dump ulp-specific information
  net: tls: export protocol version, cipher, tx_conf/rx_conf to socket
    diag

Jakub Kicinski (1):
  net/tls: use RCU protection on icsk->icsk_ulp_data

 include/net/inet_connection_sock.h |  2 +-
 include/net/tcp.h                  |  3 ++
 include/net/tls.h                  | 28 +++++++++-
 include/uapi/linux/inet_diag.h     |  9 ++++
 include/uapi/linux/tls.h           | 15 ++++++
 net/core/sock_map.c                |  2 +-
 net/ipv4/tcp_diag.c                | 56 ++++++++++++++++++-
 net/tls/tls_device.c               |  2 +-
 net/tls/tls_main.c                 | 87 +++++++++++++++++++++++++++---
 9 files changed, 191 insertions(+), 13 deletions(-)

-- 
2.20.1


^ permalink raw reply

* [PATCH net-next 1/3] net/tls: use RCU protection on icsk->icsk_ulp_data
From: Davide Caratti @ 2019-08-15 16:00 UTC (permalink / raw)
  To: Boris Pismenny, Jakub Kicinski, John Fastabend, Dave Watson,
	Aviad Yehezkel, David S. Miller, netdev
In-Reply-To: <cover.1565882584.git.dcaratti@redhat.com>

From: Jakub Kicinski <jakub.kicinski@netronome.com>

We need to make sure context does not get freed while diag
code is interrogating it. Free struct tls_context with
kfree_rcu().

We add the __rcu annotation directly in icsk, and cast it
away in the datapath accessor. Presumably all ULPs will
do a similar thing.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
---
 include/net/inet_connection_sock.h |  2 +-
 include/net/tls.h                  |  9 +++++++--
 net/core/sock_map.c                |  2 +-
 net/tls/tls_device.c               |  2 +-
 net/tls/tls_main.c                 | 31 +++++++++++++++++++++++-------
 5 files changed, 34 insertions(+), 12 deletions(-)

diff --git a/include/net/inet_connection_sock.h b/include/net/inet_connection_sock.h
index c57d53e7e02c..895546058a20 100644
--- a/include/net/inet_connection_sock.h
+++ b/include/net/inet_connection_sock.h
@@ -97,7 +97,7 @@ struct inet_connection_sock {
 	const struct tcp_congestion_ops *icsk_ca_ops;
 	const struct inet_connection_sock_af_ops *icsk_af_ops;
 	const struct tcp_ulp_ops  *icsk_ulp_ops;
-	void			  *icsk_ulp_data;
+	void __rcu		  *icsk_ulp_data;
 	void (*icsk_clean_acked)(struct sock *sk, u32 acked_seq);
 	struct hlist_node         icsk_listen_portaddr_node;
 	unsigned int		  (*icsk_sync_mss)(struct sock *sk, u32 pmtu);
diff --git a/include/net/tls.h b/include/net/tls.h
index 41b2d41bb1b8..4997742475cd 100644
--- a/include/net/tls.h
+++ b/include/net/tls.h
@@ -41,6 +41,7 @@
 #include <linux/tcp.h>
 #include <linux/skmsg.h>
 #include <linux/netdevice.h>
+#include <linux/rcupdate.h>
 
 #include <net/tcp.h>
 #include <net/strparser.h>
@@ -290,6 +291,7 @@ struct tls_context {
 
 	struct list_head list;
 	refcount_t refcount;
+	struct rcu_head rcu;
 };
 
 enum tls_offload_ctx_dir {
@@ -348,7 +350,7 @@ struct tls_offload_context_rx {
 #define TLS_OFFLOAD_CONTEXT_SIZE_RX					\
 	(sizeof(struct tls_offload_context_rx) + TLS_DRIVER_STATE_SIZE_RX)
 
-void tls_ctx_free(struct tls_context *ctx);
+void tls_ctx_free(struct sock *sk, struct tls_context *ctx);
 int wait_on_pending_writer(struct sock *sk, long *timeo);
 int tls_sk_query(struct sock *sk, int optname, char __user *optval,
 		int __user *optlen);
@@ -467,7 +469,10 @@ static inline struct tls_context *tls_get_ctx(const struct sock *sk)
 {
 	struct inet_connection_sock *icsk = inet_csk(sk);
 
-	return icsk->icsk_ulp_data;
+	/* Use RCU on icsk_ulp_data only for sock diag code,
+	 * TLS data path doesn't need rcu_dereference().
+	 */
+	return (__force void *)icsk->icsk_ulp_data;
 }
 
 static inline void tls_advance_record_sn(struct sock *sk,
diff --git a/net/core/sock_map.c b/net/core/sock_map.c
index 1330a7442e5b..01998860afaa 100644
--- a/net/core/sock_map.c
+++ b/net/core/sock_map.c
@@ -345,7 +345,7 @@ static int sock_map_update_common(struct bpf_map *map, u32 idx,
 		return -EINVAL;
 	if (unlikely(idx >= map->max_entries))
 		return -E2BIG;
-	if (unlikely(icsk->icsk_ulp_data))
+	if (unlikely(rcu_access_pointer(icsk->icsk_ulp_data)))
 		return -EINVAL;
 
 	link = sk_psock_init_link();
diff --git a/net/tls/tls_device.c b/net/tls/tls_device.c
index d184230665eb..436df5b4bb60 100644
--- a/net/tls/tls_device.c
+++ b/net/tls/tls_device.c
@@ -61,7 +61,7 @@ static void tls_device_free_ctx(struct tls_context *ctx)
 	if (ctx->rx_conf == TLS_HW)
 		kfree(tls_offload_ctx_rx(ctx));
 
-	tls_ctx_free(ctx);
+	tls_ctx_free(NULL, ctx);
 }
 
 static void tls_device_gc_task(struct work_struct *work)
diff --git a/net/tls/tls_main.c b/net/tls/tls_main.c
index 9cbbae606ced..04829bef514c 100644
--- a/net/tls/tls_main.c
+++ b/net/tls/tls_main.c
@@ -251,14 +251,31 @@ static void tls_write_space(struct sock *sk)
 	ctx->sk_write_space(sk);
 }
 
-void tls_ctx_free(struct tls_context *ctx)
+/**
+ * tls_ctx_free() - free TLS ULP context
+ * @sk:  socket to with @ctx is attached
+ * @ctx: TLS context structure
+ *
+ * Free TLS context. If @sk is %NULL caller guarantees that the socket
+ * to which @ctx was attached has no outstanding references.
+ */
+void tls_ctx_free(struct sock *sk, struct tls_context *ctx)
 {
+	struct inet_connection_sock *icsk;
+
 	if (!ctx)
 		return;
 
 	memzero_explicit(&ctx->crypto_send, sizeof(ctx->crypto_send));
 	memzero_explicit(&ctx->crypto_recv, sizeof(ctx->crypto_recv));
-	kfree(ctx);
+
+	if (sk) {
+		icsk = inet_csk(sk);
+		rcu_assign_pointer(icsk->icsk_ulp_data, NULL);
+		kfree_rcu(ctx, rcu);
+	} else {
+		kfree(ctx);
+	}
 }
 
 static void tls_sk_proto_cleanup(struct sock *sk,
@@ -306,7 +323,7 @@ static void tls_sk_proto_close(struct sock *sk, long timeout)
 
 	write_lock_bh(&sk->sk_callback_lock);
 	if (free_ctx)
-		icsk->icsk_ulp_data = NULL;
+		rcu_assign_pointer(icsk->icsk_ulp_data, NULL);
 	sk->sk_prot = ctx->sk_proto;
 	write_unlock_bh(&sk->sk_callback_lock);
 	release_sock(sk);
@@ -319,7 +336,7 @@ static void tls_sk_proto_close(struct sock *sk, long timeout)
 	ctx->sk_proto_close(sk, timeout);
 
 	if (free_ctx)
-		tls_ctx_free(ctx);
+		tls_ctx_free(sk, ctx);
 }
 
 static int do_tls_getsockopt_tx(struct sock *sk, char __user *optval,
@@ -608,7 +625,7 @@ static struct tls_context *create_ctx(struct sock *sk)
 	if (!ctx)
 		return NULL;
 
-	icsk->icsk_ulp_data = ctx;
+	rcu_assign_pointer(icsk->icsk_ulp_data, ctx);
 	ctx->setsockopt = sk->sk_prot->setsockopt;
 	ctx->getsockopt = sk->sk_prot->getsockopt;
 	ctx->sk_proto_close = sk->sk_prot->close;
@@ -649,8 +666,8 @@ static void tls_hw_sk_destruct(struct sock *sk)
 
 	ctx->sk_destruct(sk);
 	/* Free ctx */
-	tls_ctx_free(ctx);
-	icsk->icsk_ulp_data = NULL;
+	tls_ctx_free(sk, ctx);
+	rcu_assign_pointer(icsk->icsk_ulp_data, NULL);
 }
 
 static int tls_hw_prot(struct sock *sk)
-- 
2.20.1


^ permalink raw reply related

* [PATCH net-next 2/3] tcp: ulp: add functions to dump ulp-specific information
From: Davide Caratti @ 2019-08-15 16:00 UTC (permalink / raw)
  To: Boris Pismenny, Jakub Kicinski, John Fastabend, Dave Watson,
	Aviad Yehezkel, David S. Miller, netdev
In-Reply-To: <cover.1565882584.git.dcaratti@redhat.com>

currently, only getsockopt(TCP_ULP) can be invoked to know if a ULP is on
top of a TCP socket. Extend idiag_get_aux() and idiag_get_aux_size(),
introduced by commit b37e88407c1d ("inet_diag: allow protocols to provide
additional data"), to report the ULP name and other information that can
be made available by the ULP through optional functions.

Users having CAP_NET_ADMIN privileges will then be able to retrieve this
information through inet_diag_handler, if they specify INET_DIAG_INFO in
the request.

Signed-off-by: Davide Caratti <dcaratti@redhat.com>
---
 include/net/tcp.h              |  3 ++
 include/uapi/linux/inet_diag.h |  8 +++++
 net/ipv4/tcp_diag.c            | 56 +++++++++++++++++++++++++++++++++-
 3 files changed, 66 insertions(+), 1 deletion(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index 77fe87f7a992..c9a3f9688223 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -2122,6 +2122,9 @@ struct tcp_ulp_ops {
 	void (*update)(struct sock *sk, struct proto *p);
 	/* cleanup ulp */
 	void (*release)(struct sock *sk);
+	/* diagnostic */
+	int (*get_info)(const struct sock *sk, struct sk_buff *skb);
+	size_t (*get_info_size)(const struct sock *sk);
 
 	char		name[TCP_ULP_NAME_MAX];
 	struct module	*owner;
diff --git a/include/uapi/linux/inet_diag.h b/include/uapi/linux/inet_diag.h
index e8baca85bac6..e2c6273274f3 100644
--- a/include/uapi/linux/inet_diag.h
+++ b/include/uapi/linux/inet_diag.h
@@ -153,11 +153,19 @@ enum {
 	INET_DIAG_BBRINFO,	/* request as INET_DIAG_VEGASINFO */
 	INET_DIAG_CLASS_ID,	/* request as INET_DIAG_TCLASS */
 	INET_DIAG_MD5SIG,
+	INET_DIAG_ULP_INFO,
 	__INET_DIAG_MAX,
 };
 
 #define INET_DIAG_MAX (__INET_DIAG_MAX - 1)
 
+enum {
+	INET_ULP_INFO_UNSPEC,
+	INET_ULP_INFO_NAME,
+	__INET_ULP_INFO_MAX,
+};
+#define INET_ULP_INFO_MAX (__INET_ULP_INFO_MAX - 1)
+
 /* INET_DIAG_MEM */
 
 struct inet_diag_meminfo {
diff --git a/net/ipv4/tcp_diag.c b/net/ipv4/tcp_diag.c
index a3a386236d93..1cec262ac8eb 100644
--- a/net/ipv4/tcp_diag.c
+++ b/net/ipv4/tcp_diag.c
@@ -81,13 +81,42 @@ static int tcp_diag_put_md5sig(struct sk_buff *skb,
 }
 #endif
 
+static int tcp_diag_put_ulp(struct sk_buff *skb, struct sock *sk,
+			    const struct tcp_ulp_ops *ulp_ops)
+{
+	struct nlattr *nest;
+	int err;
+
+	nest = nla_nest_start_noflag(skb, INET_DIAG_ULP_INFO);
+	if (!nest)
+		return -EMSGSIZE;
+
+	err = nla_put_string(skb, INET_ULP_INFO_NAME, ulp_ops->name);
+	if (err)
+		goto nla_failure;
+
+	if (ulp_ops->get_info)
+		err = ulp_ops->get_info(sk, skb);
+	if (err)
+		goto nla_failure;
+
+	nla_nest_end(skb, nest);
+	return 0;
+
+nla_failure:
+	nla_nest_cancel(skb, nest);
+	return err;
+}
+
 static int tcp_diag_get_aux(struct sock *sk, bool net_admin,
 			    struct sk_buff *skb)
 {
+	struct inet_connection_sock *icsk = inet_csk(sk);
+	int err = 0;
+
 #ifdef CONFIG_TCP_MD5SIG
 	if (net_admin) {
 		struct tcp_md5sig_info *md5sig;
-		int err = 0;
 
 		rcu_read_lock();
 		md5sig = rcu_dereference(tcp_sk(sk)->md5sig_info);
@@ -99,11 +128,23 @@ static int tcp_diag_get_aux(struct sock *sk, bool net_admin,
 	}
 #endif
 
+	if (net_admin) {
+		const struct tcp_ulp_ops *ulp_ops;
+
+		rcu_read_lock();
+		ulp_ops = icsk->icsk_ulp_ops;
+		if (ulp_ops)
+			err = tcp_diag_put_ulp(skb, sk, ulp_ops);
+		rcu_read_unlock();
+		if (err)
+			return err;
+	}
 	return 0;
 }
 
 static size_t tcp_diag_get_aux_size(struct sock *sk, bool net_admin)
 {
+	struct inet_connection_sock *icsk = inet_csk(sk);
 	size_t size = 0;
 
 #ifdef CONFIG_TCP_MD5SIG
@@ -124,6 +165,19 @@ static size_t tcp_diag_get_aux_size(struct sock *sk, bool net_admin)
 	}
 #endif
 
+	if (net_admin) {
+		const struct tcp_ulp_ops *ulp_ops;
+
+		rcu_read_lock();
+		ulp_ops = icsk->icsk_ulp_ops;
+		if (ulp_ops) {
+			size += nla_total_size(0) +
+				nla_total_size(TCP_ULP_NAME_MAX);
+			if (ulp_ops->get_info_size)
+				size += ulp_ops->get_info_size(sk);
+		}
+		rcu_read_unlock();
+	}
 	return size;
 }
 
-- 
2.20.1


^ permalink raw reply related

* [PATCH net-next 3/3] net: tls: export protocol version, cipher, tx_conf/rx_conf to socket diag
From: Davide Caratti @ 2019-08-15 16:00 UTC (permalink / raw)
  To: Boris Pismenny, Jakub Kicinski, John Fastabend, Dave Watson,
	Aviad Yehezkel, David S. Miller, netdev
In-Reply-To: <cover.1565882584.git.dcaratti@redhat.com>

When an application configures kernel TLS on top of a TCP socket, it's
now possible for inet_diag_handler() to collect information regarding the
protocol version, the cipher type and TX / RX configuration, in case
INET_DIAG_INFO is requested.

Signed-off-by: Davide Caratti <dcaratti@redhat.com>
---
 include/net/tls.h              | 19 ++++++++++++
 include/uapi/linux/inet_diag.h |  1 +
 include/uapi/linux/tls.h       | 15 +++++++++
 net/tls/tls_main.c             | 56 ++++++++++++++++++++++++++++++++++
 4 files changed, 91 insertions(+)

diff --git a/include/net/tls.h b/include/net/tls.h
index 4997742475cd..990f1d9182a3 100644
--- a/include/net/tls.h
+++ b/include/net/tls.h
@@ -431,6 +431,25 @@ static inline bool is_tx_ready(struct tls_sw_context_tx *ctx)
 	return READ_ONCE(rec->tx_ready);
 }
 
+static inline u16 tls_user_config(struct tls_context *ctx, bool tx)
+{
+	u16 config = tx ? ctx->tx_conf : ctx->rx_conf;
+
+	switch (config) {
+	case TLS_BASE:
+		return TLS_CONF_BASE;
+	case TLS_SW:
+		return TLS_CONF_SW;
+#ifdef CONFIG_TLS_DEVICE
+	case TLS_HW:
+		return TLS_CONF_HW;
+#endif
+	case TLS_HW_RECORD:
+		return TLS_CONF_HW_RECORD;
+	}
+	return 0;
+}
+
 struct sk_buff *
 tls_validate_xmit_skb(struct sock *sk, struct net_device *dev,
 		      struct sk_buff *skb);
diff --git a/include/uapi/linux/inet_diag.h b/include/uapi/linux/inet_diag.h
index e2c6273274f3..a1ff345b3f33 100644
--- a/include/uapi/linux/inet_diag.h
+++ b/include/uapi/linux/inet_diag.h
@@ -162,6 +162,7 @@ enum {
 enum {
 	INET_ULP_INFO_UNSPEC,
 	INET_ULP_INFO_NAME,
+	INET_ULP_INFO_TLS,
 	__INET_ULP_INFO_MAX,
 };
 #define INET_ULP_INFO_MAX (__INET_ULP_INFO_MAX - 1)
diff --git a/include/uapi/linux/tls.h b/include/uapi/linux/tls.h
index 5b9c26753e46..bcd2869ed472 100644
--- a/include/uapi/linux/tls.h
+++ b/include/uapi/linux/tls.h
@@ -109,4 +109,19 @@ struct tls12_crypto_info_aes_ccm_128 {
 	unsigned char rec_seq[TLS_CIPHER_AES_CCM_128_REC_SEQ_SIZE];
 };
 
+enum {
+	TLS_INFO_UNSPEC,
+	TLS_INFO_VERSION,
+	TLS_INFO_CIPHER,
+	TLS_INFO_TXCONF,
+	TLS_INFO_RXCONF,
+	__TLS_INFO_MAX,
+};
+#define TLS_INFO_MAX (__TLS_INFO_MAX - 1)
+
+#define TLS_CONF_BASE 1
+#define TLS_CONF_SW 2
+#define TLS_CONF_HW 3
+#define TLS_CONF_HW_RECORD 4
+
 #endif /* _UAPI_LINUX_TLS_H */
diff --git a/net/tls/tls_main.c b/net/tls/tls_main.c
index 04829bef514c..957d937c72d2 100644
--- a/net/tls/tls_main.c
+++ b/net/tls/tls_main.c
@@ -39,6 +39,7 @@
 #include <linux/netdevice.h>
 #include <linux/sched/signal.h>
 #include <linux/inetdevice.h>
+#include <linux/inet_diag.h>
 
 #include <net/tls.h>
 
@@ -838,6 +839,59 @@ static void tls_update(struct sock *sk, struct proto *p)
 	}
 }
 
+static int tls_get_info(const struct sock *sk, struct sk_buff *skb)
+{
+	struct tls_context *ctx = tls_get_ctx(sk);
+	u16 version, cipher_type;
+	struct nlattr *start;
+	int err;
+
+	start = nla_nest_start_noflag(skb, INET_ULP_INFO_TLS);
+	if (!start)
+		return -EMSGSIZE;
+
+	version = ctx->prot_info.version;
+	if (version) {
+		err = nla_put_u16(skb, TLS_INFO_VERSION, version);
+		if (err)
+			goto nla_failure;
+	}
+	cipher_type = ctx->prot_info.cipher_type;
+	if (cipher_type) {
+		err = nla_put_u16(skb, TLS_INFO_CIPHER, cipher_type);
+		if (err)
+			goto nla_failure;
+	}
+	err = nla_put_u16(skb, TLS_INFO_TXCONF, tls_user_config(ctx, true));
+	if (err)
+		goto nla_failure;
+
+	err = nla_put_u16(skb, TLS_INFO_RXCONF, tls_user_config(ctx, false));
+	if (err)
+		goto nla_failure;
+
+	nla_nest_end(skb, start);
+	return 0;
+
+nla_failure:
+	nla_nest_cancel(skb, start);
+	return err;
+}
+
+static size_t tls_get_info_size(const struct sock *sk)
+{
+	size_t size = 0;
+
+	size += nla_total_size(0) +		/* INET_ULP_INFO_TLS */
+		nla_total_size(sizeof(u16)) +	/* TLS_INFO_VERSION */
+		nla_total_size(sizeof(u16)) +	/* TLS_INFO_CIPHER */
+		nla_total_size(sizeof(u16)) +	/* TLS_INFO_RXCONF */
+		nla_total_size(sizeof(u16)) +	/* TLS_INFO_TXCONF */
+		0;
+
+	return size;
+}
+
 void tls_register_device(struct tls_device *device)
 {
 	spin_lock_bh(&device_spinlock);
@@ -859,6 +913,8 @@ static struct tcp_ulp_ops tcp_tls_ulp_ops __read_mostly = {
 	.owner			= THIS_MODULE,
 	.init			= tls_init,
 	.update			= tls_update,
+	.get_info		= tls_get_info,
+	.get_info_size		= tls_get_info_size,
 };
 
 static int __init tls_register(void)
-- 
2.20.1


^ permalink raw reply related

* Re: [PATCH net 1/2] igb: Enable media autosense for the i350.
From: Jeff Kirsher @ 2019-08-15 16:01 UTC (permalink / raw)
  To: Manfred Rudigier, davem; +Cc: carolyn.wyborny, todd.fujinaka, netdev
In-Reply-To: <f50fd188-fe43-4bd7-aaa4-4c1c8cb022c3@EXC04-ATKLA.omicron.at>

[-- Attachment #1: Type: text/plain, Size: 770 bytes --]

On Wed, 2019-08-14 at 13:59 +0200, Manfred Rudigier wrote:
> This patch enables the hardware feature "Media Auto Sense" also on
> the
> i350. It works in the same way as on the 82850 devices. Hardware
> designs
> using dual PHYs (fiber/copper) can enable this feature by setting the
> MAS
> enable bits in the NVM_COMPAT register (0x03) in the EEPROM.
> 
> Signed-off-by: Manfred Rudigier <manfred.rudigier@omicronenergy.com>
> ---
>  drivers/net/ethernet/intel/igb/e1000_82575.c | 2 +-
>  drivers/net/ethernet/intel/igb/igb_main.c    | 2 +-
>  2 files changed, 2 insertions(+), 2 deletions(-)

I will get this 2 patch series sent to intel-wired-lan@lists.osuosl.org
  list so that we can get these patches into review and test for
upstream inclusion.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply

* Re: [PATCH net-next v2 1/4] RDS: limit the number of times we loop in rds_send_xmit
From: santosh.shilimkar @ 2019-08-15 16:16 UTC (permalink / raw)
  To: Gerd Rausch, netdev, linux-rdma, rds-devel; +Cc: David Miller
In-Reply-To: <90b76f24-d799-5362-df53-19102d781e3e@oracle.com>

On 8/15/19 7:42 AM, Gerd Rausch wrote:
> From: Chris Mason <chris.mason@oracle.com>
> Date: Fri, 3 Feb 2012 11:07:54 -0500
> 
> This will kick the RDS worker thread if we have been looping
> too long.
> 
> Original commit from 2012 updated to include a change by
> Venkat Venkatsubra <venkat.x.venkatsubra@oracle.com>
> that triggers "must_wake" if "rds_ib_recv_refill_one" fails.
> 
> Signed-off-by: Gerd Rausch <gerd.rausch@oracle.com>
> ---
Thought I acked V1 series.

Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>

^ permalink raw reply

* Re: [PATCH net-next v2 2/4] RDS: don't use GFP_ATOMIC for sk_alloc in rds_create
From: santosh.shilimkar @ 2019-08-15 16:17 UTC (permalink / raw)
  To: Gerd Rausch, netdev, linux-rdma, rds-devel; +Cc: David Miller
In-Reply-To: <31c65073-0a9a-28b5-eb73-4ec784b0393e@oracle.com>

On 8/15/19 7:42 AM, Gerd Rausch wrote:
> From: Chris Mason <chris.mason@oracle.com>
> Date: Fri, 3 Feb 2012 11:08:51 -0500
> 
> Signed-off-by: Chris Mason <chris.mason@oracle.com>
> Signed-off-by: Bang Nguyen <bang.nguyen@oracle.com>
> Signed-off-by: Gerd Rausch <gerd.rausch@oracle.com>
> Signed-off-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com>
> ---
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>

^ permalink raw reply

* Re: [PATCH net-next v2 3/4] net/rds: Add a few missing rds_stat_names entries
From: santosh.shilimkar @ 2019-08-15 16:17 UTC (permalink / raw)
  To: Gerd Rausch, netdev, linux-rdma, rds-devel; +Cc: David Miller
In-Reply-To: <2d604055-a49e-637f-a1e6-afefa8482316@oracle.com>

On 8/15/19 7:42 AM, Gerd Rausch wrote:
> From: Gerd Rausch <gerd.rausch@oracle.com>
> Date: Thu, 11 Jul 2019 12:15:50 -0700
> 
> In a previous commit, fields were added to "struct rds_statistics"
> but array "rds_stat_names" was not updated accordingly.
> 
> Please note the inconsistent naming of the string representations
> that is done in the name of compatibility
> with the Oracle internal code-base.
> 
> s_recv_bytes_added_to_socket     -> "recv_bytes_added_to_sock"
> s_recv_bytes_removed_from_socket -> "recv_bytes_freed_fromsock"
> 
> Fixes: 192a798f5299 ("RDS: add stat for socket recv memory usage")
> Signed-off-by: Gerd Rausch <gerd.rausch@oracle.com>
> ---
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>

^ permalink raw reply

* Re: [PATCH net-next v2 4/4] rds: check for excessive looping in rds_send_xmit
From: santosh.shilimkar @ 2019-08-15 16:18 UTC (permalink / raw)
  To: Gerd Rausch, netdev, linux-rdma, rds-devel; +Cc: David Miller
In-Reply-To: <d91e3273-48bb-13bf-af65-40472890f975@oracle.com>

On 8/15/19 7:43 AM, Gerd Rausch wrote:
> From: Andy Grover <andy.grover@oracle.com>
> Date: Thu, 13 Jan 2011 11:40:31 -0800
> 
> Original commit from 2011 updated to include a change by
> Yuval Shaia <yuval.shaia@oracle.com>
> that adds a new statistic counter "send_stuck_rm"
> to capture the messages looping exessively
> in the send path.
> 
> Signed-off-by: Gerd Rausch <gerd.rausch@oracle.com>
> ---
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>

^ permalink raw reply

* Re: [PATCH bpf-next 0/5] Add support for SKIP_BPF flag for AF_XDP sockets
From: Samudrala, Sridhar @ 2019-08-15 16:25 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen, magnus.karlsson, bjorn.topel,
	netdev, bpf, intel-wired-lan, maciej.fijalkowski, tom.herbert
In-Reply-To: <87ftm2adi2.fsf@toke.dk>


On 8/15/2019 4:12 AM, Toke Høiland-Jørgensen wrote:
> Sridhar Samudrala <sridhar.samudrala@intel.com> writes:
> 
>> This patch series introduces XDP_SKIP_BPF flag that can be specified
>> during the bind() call of an AF_XDP socket to skip calling the BPF
>> program in the receive path and pass the buffer directly to the socket.
>>
>> When a single AF_XDP socket is associated with a queue and a HW
>> filter is used to redirect the packets and the app is interested in
>> receiving all the packets on that queue, we don't need an additional
>> BPF program to do further filtering or lookup/redirect to a socket.
>>
>> Here are some performance numbers collected on
>>    - 2 socket 28 core Intel(R) Xeon(R) Platinum 8180 CPU @ 2.50GHz
>>    - Intel 40Gb Ethernet NIC (i40e)
>>
>> All tests use 2 cores and the results are in Mpps.
>>
>> turbo on (default)
>> ---------------------------------------------	
>>                        no-skip-bpf    skip-bpf
>> ---------------------------------------------	
>> rxdrop zerocopy           21.9         38.5
>> l2fwd  zerocopy           17.0         20.5
>> rxdrop copy               11.1         13.3
>> l2fwd  copy                1.9          2.0
>>
>> no turbo :  echo 1 > /sys/devices/system/cpu/intel_pstate/no_turbo
>> ---------------------------------------------	
>>                        no-skip-bpf    skip-bpf
>> ---------------------------------------------	
>> rxdrop zerocopy           15.4         29.0
>> l2fwd  zerocopy           11.8         18.2
>> rxdrop copy                8.2         10.5
>> l2fwd  copy                1.7          1.7
>> ---------------------------------------------
> 
> You're getting this performance boost by adding more code in the fast
> path for every XDP program; so what's the performance impact of that for
> cases where we do run an eBPF program?

The no-skip-bpf results are pretty close to what i see before the 
patches are applied. As umem is cached in rx_ring for zerocopy the 
overhead is much smaller compared to the copy scenario where i am 
currently calling xdp_get_umem_from_qid().

> 
> Also, this is basically a special-casing of a particular deployment
> scenario. Without a way to control RX queue assignment and traffic
> steering, you're basically hard-coding a particular app's takeover of
> the network interface; I'm not sure that is such a good idea...

Yes. This is mainly targeted for application that create 1 AF_XDP socket 
per RX queue and can use a HW filter (via ethtool or TC flower) to 
redirect the packets to a queue or a group of queues.

> 
> -Toke
> 

^ permalink raw reply

* Re: [PATCH net-next 1/1] Added BASE-T1 PHY support to PHY Subsystem
From: Heiner Kallweit @ 2019-08-15 16:34 UTC (permalink / raw)
  To: Andrew Lunn, Christian Herber
  Cc: davem@davemloft.net, netdev@vger.kernel.org,
	linux-kernel@vger.kernel.org, Florian Fainelli
In-Reply-To: <20190815155613.GE15291@lunn.ch>

On 15.08.2019 17:56, Andrew Lunn wrote:
> On Thu, Aug 15, 2019 at 03:32:29PM +0000, Christian Herber wrote:
>> BASE-T1 is a category of Ethernet PHYs.
>> They use a single copper pair for transmission.
>> This patch add basic support for this category of PHYs.
>> It coveres the discovery of abilities and basic configuration.
>> It includes setting fixed speed and enabling auto-negotiation.
>> BASE-T1 devices should always Clause-45 managed.
>> Therefore, this patch extends phy-c45.c.
>> While for some functions like auto-neogtiation different registers are
>> used, the layout of these registers is the same for the used fields.
>> Thus, much of the logic of basic Clause-45 devices can be reused.
>>
>> Signed-off-by: Christian Herber <christian.herber@nxp.com>
>> ---
>>  drivers/net/phy/phy-c45.c    | 113 +++++++++++++++++++++++++++++++----
>>  drivers/net/phy/phy-core.c   |   4 +-
>>  include/uapi/linux/ethtool.h |   2 +
>>  include/uapi/linux/mdio.h    |  21 +++++++
>>  4 files changed, 129 insertions(+), 11 deletions(-)
>>
>> diff --git a/drivers/net/phy/phy-c45.c b/drivers/net/phy/phy-c45.c
>> index b9d4145781ca..9ff0b8c785de 100644
>> --- a/drivers/net/phy/phy-c45.c
>> +++ b/drivers/net/phy/phy-c45.c
>> @@ -8,13 +8,23 @@
>>  #include <linux/mii.h>
>>  #include <linux/phy.h>
>>  
>> +#define IS_100BASET1(phy) (linkmode_test_bit( \
>> +			   ETHTOOL_LINK_MODE_100baseT1_Full_BIT, \
>> +			   (phy)->supported))
>> +#define IS_1000BASET1(phy) (linkmode_test_bit( \
>> +			    ETHTOOL_LINK_MODE_1000baseT1_Full_BIT, \
>> +			    (phy)->supported))
> 
> Hi Christian
> 
> We already have the flag phydev->is_gigabit_capable. Maybe add a flag
> phydev->is_t1_capable
> 
>> +
>> +static u32 get_aneg_ctrl(struct phy_device *phydev);
>> +static u32 get_aneg_stat(struct phy_device *phydev);
> 
> No forward declarations please. Put the code in the right order so
> they are not needed.
> 
> Thanks
> 
>      Andrew
> 

For whatever reason I don't have the original mail in my netdev inbox (yet).

+	if (IS_100BASET1(phydev) || IS_1000BASET1(phydev))
+		ctrl = MDIO_AN_BT1_CTRL;

Code like this could be problematic once a PHY supports one of the T1 modes
AND normal modes. Then normal modes would be unusable.

I think this scenario isn't completely hypothetical. See the Aquantia
AQCS109 that supports normal modes and (proprietary) 1000Base-T2.

Maybe we need separate versions of the generic functions for T1.
Then it would be up to the PHY driver to decide when to use which
version.

Heiner

^ permalink raw reply

* Re: [PATCH net-next] net/rds: Add RDS6_INFO_SOCKETS and RDS6_INFO_RECV_MESSAGES options
From: santosh.shilimkar @ 2019-08-15 16:36 UTC (permalink / raw)
  To: Ka-Cheong Poon, netdev; +Cc: davem, rds-devel
In-Reply-To: <1565861803-31268-1-git-send-email-ka-cheong.poon@oracle.com>

On 8/15/19 2:36 AM, Ka-Cheong Poon wrote:
> Add support of the socket options RDS6_INFO_SOCKETS and
> RDS6_INFO_RECV_MESSAGES which update the RDS_INFO_SOCKETS and
> RDS_INFO_RECV_MESSAGES options respectively.  The old options work
> for IPv4 sockets only.
> 
> Signed-off-by: Ka-Cheong Poon <ka-cheong.poon@oracle.com>
> ---
Thanks Ka-Cheong for getting this one out on list.

Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>

^ permalink raw reply

* Re: [PATCH] arm64: do_csum: implement accelerated scalar version
From: Will Deacon @ 2019-08-15 16:46 UTC (permalink / raw)
  To: Zhangshaokun
  Cc: Robin Murphy, Ard Biesheuvel, linux-arm-kernel, netdev,
	ilias.apalodimas, huanglingyan (A), steve.capper
In-Reply-To: <440eb674-0e59-a97e-4a90-0026e2327069@hisilicon.com>

On Thu, May 16, 2019 at 11:14:35AM +0800, Zhangshaokun wrote:
> On 2019/5/15 17:47, Will Deacon wrote:
> > On Mon, Apr 15, 2019 at 07:18:22PM +0100, Robin Murphy wrote:
> >> On 12/04/2019 10:52, Will Deacon wrote:
> >>> I'm waiting for Robin to come back with numbers for a C implementation.
> >>>
> >>> Robin -- did you get anywhere with that?
> >>
> >> Still not what I would call finished, but where I've got so far (besides an
> >> increasingly elaborate test rig) is as below - it still wants some unrolling
> >> in the middle to really fly (and actual testing on BE), but the worst-case
> >> performance already equals or just beats this asm version on Cortex-A53 with
> >> GCC 7 (by virtue of being alignment-insensitive and branchless except for
> >> the loop). Unfortunately, the advantage of C code being instrumentable does
> >> also come around to bite me...
> > 
> > Is there any interest from anybody in spinning a proper patch out of this?
> > Shaokun?
> 
> HiSilicon's Kunpeng920(Hi1620) benefits from do_csum optimization, if Ard and
> Robin are ok, Lingyan or I can try to do it.
> Of course, if any guy posts the patch, we are happy to test it.
> Any will be ok.

I don't mind who posts it, but Robin is super busy with SMMU stuff at the
moment so it probably makes more sense for you or Lingyan to do it.

Will

^ permalink raw reply

* Re: [PATCH bpf-next 0/5] Add support for SKIP_BPF flag for AF_XDP sockets
From: Samudrala, Sridhar @ 2019-08-15 16:46 UTC (permalink / raw)
  To: Björn Töpel, magnus.karlsson, netdev, bpf,
	intel-wired-lan, maciej.fijalkowski, tom.herbert
In-Reply-To: <bebfb097-5357-91d8-ebc7-2f8ede392ad7@intel.com>

On 8/15/2019 5:51 AM, Björn Töpel wrote:
> On 2019-08-15 05:46, Sridhar Samudrala wrote:
>> This patch series introduces XDP_SKIP_BPF flag that can be specified
>> during the bind() call of an AF_XDP socket to skip calling the BPF
>> program in the receive path and pass the buffer directly to the socket.
>>
>> When a single AF_XDP socket is associated with a queue and a HW
>> filter is used to redirect the packets and the app is interested in
>> receiving all the packets on that queue, we don't need an additional
>> BPF program to do further filtering or lookup/redirect to a socket.
>>
>> Here are some performance numbers collected on
>>    - 2 socket 28 core Intel(R) Xeon(R) Platinum 8180 CPU @ 2.50GHz
>>    - Intel 40Gb Ethernet NIC (i40e)
>>
>> All tests use 2 cores and the results are in Mpps.
>>
>> turbo on (default)
>> ---------------------------------------------
>>                        no-skip-bpf    skip-bpf
>> ---------------------------------------------
>> rxdrop zerocopy           21.9         38.5
>> l2fwd  zerocopy           17.0         20.5
>> rxdrop copy               11.1         13.3
>> l2fwd  copy                1.9          2.0
>>
>> no turbo :  echo 1 > /sys/devices/system/cpu/intel_pstate/no_turbo
>> ---------------------------------------------
>>                        no-skip-bpf    skip-bpf
>> ---------------------------------------------
>> rxdrop zerocopy           15.4         29.0
>> l2fwd  zerocopy           11.8         18.2
>> rxdrop copy                8.2         10.5
>> l2fwd  copy                1.7          1.7
>> ---------------------------------------------
>>
> 
> This work is somewhat similar to the XDP_ATTACH work [1]. Avoiding the
> retpoline in the XDP program call is a nice performance boost! I like
> the numbers! :-) I also like the idea of adding a flag that just does
> what most AF_XDP Rx users want -- just getting all packets of a
> certain queue into the XDP sockets.
> 
> In addition to Toke's mail, I have some more concerns with the series:
> 
> * AFAIU the SKIP_BPF only works for zero-copy enabled sockets. IMO, it
>    should work for all modes (including XDP_SKB).

This patch enables SKIP_BPF for AF_XDP sockets where an XDP program is 
attached at driver level (both zerocopy and copy modes)
I tried a quick hack to see the perf benefit with generic XDP mode, but 
i didn't see any significant improvement in performance in that 
scenario. so i didn't include that mode.

> 
> * In order to work, a user still needs an XDP program running. That's
>    clunky. I'd like the behavior that if no XDP program is attached,
>    and the option is set, the packets for a that queue end up in the
>    socket. If there's an XDP program attached, the program has
>    precedence.

I think this would require more changes in the drivers to take XDP 
datapath even when there is no XDP program loaded.

> 
> * It requires changes in all drivers. Not nice, and scales badly. Try
>    making it generic (xdp_do_redirect/xdp_flush), so it Just Works for
>    all XDP capable drivers.

I tried to make this as generic as possible and make the changes to the 
driver very minimal, but could not find a way to avoid any changes at 
all to the driver. xdp_do_direct() gets called based after the call to 
bpf_prog_run_xdp() in the drivers.

> 
> Thanks for working on this!
> 
> 
> Björn
> 
> [1] 
> https://lore.kernel.org/netdev/20181207114431.18038-1-bjorn.topel@gmail.com/ 
> 
> 
> 
>> Sridhar Samudrala (5):
>>    xsk: Convert bool 'zc' field in struct xdp_umem to a u32 bitmap
>>    xsk: Introduce XDP_SKIP_BPF bind option
>>    i40e: Enable XDP_SKIP_BPF option for AF_XDP sockets
>>    ixgbe: Enable XDP_SKIP_BPF option for AF_XDP sockets
>>    xdpsock_user: Add skip_bpf option
>>
>>   drivers/net/ethernet/intel/i40e/i40e_txrx.c   | 22 +++++++++-
>>   drivers/net/ethernet/intel/i40e/i40e_xsk.c    |  6 +++
>>   drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 20 ++++++++-
>>   drivers/net/ethernet/intel/ixgbe/ixgbe_xsk.c  | 16 ++++++-
>>   include/net/xdp_sock.h                        | 21 ++++++++-
>>   include/uapi/linux/if_xdp.h                   |  1 +
>>   include/uapi/linux/xdp_diag.h                 |  1 +
>>   net/xdp/xdp_umem.c                            |  9 ++--
>>   net/xdp/xsk.c                                 | 43 ++++++++++++++++---
>>   net/xdp/xsk_diag.c                            |  5 ++-
>>   samples/bpf/xdpsock_user.c                    |  8 ++++
>>   11 files changed, 135 insertions(+), 17 deletions(-)
>>

^ permalink raw reply

* Re: [PATCH -next] btf: fix return value check in btf_vmlinux_init()
From: Andrii Nakryiko @ 2019-08-15 17:06 UTC (permalink / raw)
  To: Wei Yongjun
  Cc: Alexei Starovoitov, Daniel Borkmann, Martin KaFai Lau, Song Liu,
	Yonghong Song, Andrii Nakryiko, Networking, bpf, kernel-janitors
In-Reply-To: <20190815142432.101401-1-weiyongjun1@huawei.com>

On Thu, Aug 15, 2019 at 7:21 AM Wei Yongjun <weiyongjun1@huawei.com> wrote:
>
> In case of error, the function kobject_create_and_add() returns NULL
> pointer not ERR_PTR(). The IS_ERR() test in the return value check
> should be replaced with NULL test.
>
> Fixes: 341dfcf8d78e ("btf: expose BTF info through sysfs")
> Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com>
> ---


Argh... Thanks for the fix! Fix the comment below addressed:

Acked-by: Andrii Nakryiko <andriin@fb.com>

>  kernel/bpf/sysfs_btf.c | 7 ++-----
>  1 file changed, 2 insertions(+), 5 deletions(-)
>
> diff --git a/kernel/bpf/sysfs_btf.c b/kernel/bpf/sysfs_btf.c
> index 4659349fc795..be5557deb958 100644
> --- a/kernel/bpf/sysfs_btf.c
> +++ b/kernel/bpf/sysfs_btf.c
> @@ -30,16 +30,13 @@ static struct kobject *btf_kobj;
>
>  static int __init btf_vmlinux_init(void)
>  {
> -       int err;
> -
>         if (!_binary__btf_vmlinux_bin_start)
>                 return 0;
>
>         btf_kobj = kobject_create_and_add("btf", kernel_kobj);
> -       if (IS_ERR(btf_kobj)) {
> -               err = PTR_ERR(btf_kobj);
> +       if (!btf_kobj) {
>                 btf_kobj = NULL;

This is now not necessary, please drop (and don't forget to remove {}
for this single-line if afterwards).

> -               return err;
> +               return -ENOMEM;
>         }
>
>         bin_attr_btf_vmlinux.size = _binary__btf_vmlinux_bin_end -
>
>
>

^ permalink raw reply

* Re: [PATCH bpf] tools: bpftool: close prog FD before exit on showing a single program
From: Andrii Nakryiko @ 2019-08-15 17:09 UTC (permalink / raw)
  To: Quentin Monnet
  Cc: Alexei Starovoitov, Daniel Borkmann, bpf, Networking, oss-drivers
In-Reply-To: <20190815142223.2203-1-quentin.monnet@netronome.com>

On Thu, Aug 15, 2019 at 7:24 AM Quentin Monnet
<quentin.monnet@netronome.com> wrote:
>
> When showing metadata about a single program by invoking
> "bpftool prog show PROG", the file descriptor referring to the program
> is not closed before returning from the function. Let's close it.
>
> Fixes: 71bb428fe2c1 ("tools: bpf: add bpftool")
> Signed-off-by: Quentin Monnet <quentin.monnet@netronome.com>
> Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
> ---
>  tools/bpf/bpftool/prog.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
>
> diff --git a/tools/bpf/bpftool/prog.c b/tools/bpf/bpftool/prog.c
> index 66f04a4846a5..43fdbbfe41bb 100644
> --- a/tools/bpf/bpftool/prog.c
> +++ b/tools/bpf/bpftool/prog.c
> @@ -363,7 +363,9 @@ static int do_show(int argc, char **argv)
>                 if (fd < 0)
>                         return -1;
>
> -               return show_prog(fd);
> +               err = show_prog(fd);
> +               close(fd);
> +               return err;

There is a similar problem few lines above for special case of argc ==
2, which you didn't fix.
Would it be better to make show_prog(fd) close provided fd instead or
is it used in some other context where FD should live longer (I
haven't checked, sorry)?

>         }
>
>         if (argc)
> --
> 2.17.1
>

^ permalink raw reply

* Re: [PATCH bpf-next 0/5] Add support for SKIP_BPF flag for AF_XDP sockets
From: Toke Høiland-Jørgensen @ 2019-08-15 17:11 UTC (permalink / raw)
  To: Samudrala, Sridhar, magnus.karlsson, bjorn.topel, netdev, bpf,
	intel-wired-lan, maciej.fijalkowski, tom.herbert
In-Reply-To: <b9423054-247e-8b57-ea59-42368f60ea1e@intel.com>

"Samudrala, Sridhar" <sridhar.samudrala@intel.com> writes:

> On 8/15/2019 4:12 AM, Toke Høiland-Jørgensen wrote:
>> Sridhar Samudrala <sridhar.samudrala@intel.com> writes:
>> 
>>> This patch series introduces XDP_SKIP_BPF flag that can be specified
>>> during the bind() call of an AF_XDP socket to skip calling the BPF
>>> program in the receive path and pass the buffer directly to the socket.
>>>
>>> When a single AF_XDP socket is associated with a queue and a HW
>>> filter is used to redirect the packets and the app is interested in
>>> receiving all the packets on that queue, we don't need an additional
>>> BPF program to do further filtering or lookup/redirect to a socket.
>>>
>>> Here are some performance numbers collected on
>>>    - 2 socket 28 core Intel(R) Xeon(R) Platinum 8180 CPU @ 2.50GHz
>>>    - Intel 40Gb Ethernet NIC (i40e)
>>>
>>> All tests use 2 cores and the results are in Mpps.
>>>
>>> turbo on (default)
>>> ---------------------------------------------	
>>>                        no-skip-bpf    skip-bpf
>>> ---------------------------------------------	
>>> rxdrop zerocopy           21.9         38.5
>>> l2fwd  zerocopy           17.0         20.5
>>> rxdrop copy               11.1         13.3
>>> l2fwd  copy                1.9          2.0
>>>
>>> no turbo :  echo 1 > /sys/devices/system/cpu/intel_pstate/no_turbo
>>> ---------------------------------------------	
>>>                        no-skip-bpf    skip-bpf
>>> ---------------------------------------------	
>>> rxdrop zerocopy           15.4         29.0
>>> l2fwd  zerocopy           11.8         18.2
>>> rxdrop copy                8.2         10.5
>>> l2fwd  copy                1.7          1.7
>>> ---------------------------------------------
>> 
>> You're getting this performance boost by adding more code in the fast
>> path for every XDP program; so what's the performance impact of that for
>> cases where we do run an eBPF program?
>
> The no-skip-bpf results are pretty close to what i see before the 
> patches are applied. As umem is cached in rx_ring for zerocopy the 
> overhead is much smaller compared to the copy scenario where i am 
> currently calling xdp_get_umem_from_qid().

I meant more for other XDP programs; what is the performance impact of
XDP_DROP, for instance?

>> Also, this is basically a special-casing of a particular deployment
>> scenario. Without a way to control RX queue assignment and traffic
>> steering, you're basically hard-coding a particular app's takeover of
>> the network interface; I'm not sure that is such a good idea...
>
> Yes. This is mainly targeted for application that create 1 AF_XDP
> socket per RX queue and can use a HW filter (via ethtool or TC flower)
> to redirect the packets to a queue or a group of queues.

Yeah, and I'd prefer it if the handling of this to be unified somehow...

-Toke

^ permalink raw reply

* Re: [patch net-next v2 2/2] selftests: netdevsim: add devlink params tests
From: Jakub Kicinski @ 2019-08-15 17:12 UTC (permalink / raw)
  To: Jiri Pirko; +Cc: netdev, davem, mlxsw
In-Reply-To: <20190815085214.GC2273@nanopsycho>

On Thu, 15 Aug 2019 10:52:14 +0200, Jiri Pirko wrote:
> Thu, Aug 15, 2019 at 10:45:45AM CEST, jiri@resnulli.us wrote:
> >Thu, Aug 15, 2019 at 03:09:00AM CEST, jakub.kicinski@netronome.com wrote:  
> >>On Wed, 14 Aug 2019 17:26:04 +0200, Jiri Pirko wrote:  
> >>> From: Jiri Pirko <jiri@mellanox.com>
> >>> 
> >>> Test recently added netdevsim devlink param implementation.
> >>> 
> >>> Signed-off-by: Jiri Pirko <jiri@mellanox.com>
> >>> ---
> >>> v1->v2:
> >>> -using cmd_jq helper  
> >>
> >>Still failing here :(
> >>
> >># ./devlink.sh 
> >>TEST: fw flash test                                                 [ OK ]
> >>TEST: params test                                                   [FAIL]
> >>	Failed to get test1 param value
> >>TEST: regions test                                                  [ OK ]
> >>
> >># jq --version
> >>jq-1.5-1-a5b5cbe
> >># echo '{ "a" : false }' | jq -e -r '.[]'
> >>false
> >># echo $?
> >>1  
> >
> >Odd, could you please try:
> >$ jq --version
> >jq-1.5
> >$ echo '{"param":{"netdevsim/netdevsim11":[{"name":"test1","type":"driver-specific","values":[{"cmode":"driverinit","value":"false"}]}]}}' | jq -e -r '.[][][].values[] | select(.cmode == "driverinit").value'
> >false
> >$ echo $?
> >0  
> 
> Ah, it is not the jq version, it is the iproute2 version:
> 8257e6c49cca9847e01262f6e749c6e88e2ddb72
> 
> I'll think about how to fix this.

Ah, wow, you're right! Old iproute2 works fine here, too!

> >>
> >>On another machine:
> >>
> >>$ echo '{ "a" : false }' | jq -e -r '.[]'
> >>false
> >>$ echo $?
> >>1
> >>
> >>Did you mean to drop the -e ?  


^ permalink raw reply

* Re: [PATCH v2 bpf-next 1/4] bpf: unprivileged BPF access via /dev/bpf
From: Alexei Starovoitov @ 2019-08-15 17:28 UTC (permalink / raw)
  To: Jordan Glover
  Cc: Andy Lutomirski, Daniel Colascione, Song Liu, Kees Cook,
	Networking, bpf, Alexei Starovoitov, Daniel Borkmann, Kernel Team,
	Lorenz Bauer, Jann Horn, Greg KH, Linux API, LSM List
In-Reply-To: <HG0x24u69mnaMFKuxHVAzHpyjwsD5-U6RpqFRua87wGWQCHg00Q8ZqPeA_5kJ9l-d6oe0cXa4HyYXMnOO0Aofp_LcPcQdG0WFV21z1MbgcE=@protonmail.ch>

On Thu, Aug 15, 2019 at 11:24:54AM +0000, Jordan Glover wrote:
> On Wednesday, August 14, 2019 10:05 PM, Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
> 
> > On Wed, Aug 14, 2019 at 10:51:23AM -0700, Andy Lutomirski wrote:
> >
> > > If eBPF is genuinely not usable by programs that are not fully trusted
> > > by the admin, then no kernel changes at all are needed. Programs that
> > > want to reduce their own privileges can easily fork() a privileged
> > > subprocess or run a little helper to which they delegate BPF
> > > operations. This is far more flexible than anything that will ever be
> > > in the kernel because it allows the helper to verify that the rest of
> > > the program is doing exactly what it's supposed to and restrict eBPF
> > > operations to exactly the subset that is needed. So a container
> > > manager or network manager that drops some provilege could have a
> > > little bpf-helper that manages its BPF XDP, firewalling, etc
> > > configuration. The two processes would talk over a socketpair.
> >
> > there were three projects that tried to delegate bpf operations.
> > All of them failed.
> > bpf operational workflow is much more complex than you're imagining.
> > fork() also doesn't work for all cases.
> > I gave this example before: consider multiple systemd-like deamons
> > that need to do bpf operations that want to pass this 'bpf capability'
> > to other deamons written by other teams. Some of them will start
> > non-root, but still need to do bpf. They will be rpm installed
> > and live upgraded while running.
> > We considered to make systemd such centralized bpf delegation
> > authority too. It didn't work. bpf in kernel grows quickly.
> > libbpf part grows independently. llvm keeps evolving.
> > All of them are being changed while system overall has to stay
> > operational. Centralized approach breaks apart.
> >
> > > The interesting cases you're talking about really do involved
> > > unprivileged or less privileged eBPF, though. Let's see:
> > > systemd --user: systemd --user is not privileged at all. There's no
> > > issue of reducing privilege, since systemd --user doesn't have any
> > > privilege to begin with. But systemd supports some eBPF features, and
> > > presumably it would like to support them in the systemd --user case.
> > > This is unprivileged eBPF.
> >
> > Let's disambiguate the terminology.
> > This /dev/bpf patch set started as describing the feature as 'unprivileged bpf'.
> > I think that was a mistake.
> > Let's call systemd-like deamon usage of bpf 'less privileged bpf'.
> > This is not unprivileged.
> > 'unprivileged bpf' is what sysctl kernel.unprivileged_bpf_disabled controls.
> >
> > There is a huge difference between the two.
> > I'm against extending 'unprivileged bpf' even a bit more than what it is
> > today for many reasons mentioned earlier.
> > The /dev/bpf is about 'less privileged'.
> > Less privileged than root. We need to split part of full root capability
> > into bpf capability. So that most of the root can be dropped.
> > This is very similar to what cap_net_admin does.
> > cap_net_amdin can bring down eth0 which is just as bad as crashing the box.
> > cap_net_admin is very much privileged. Just 'less privileged' than root.
> > Same thing for cap_bpf.
> >
> > May be we should do both cap_bpf and /dev/bpf to make it clear that
> > this is the same thing. Two interfaces to achieve the same result.
> >
> 
> systemd --user processes aren't "less privileged". The are COMPLETELY unprivileged.
> Granting them cap_bpf is the same as granting it to every other unprivileged user
> process. Also unprivileged user process can start systemd --user process with any
> command they like.

systemd itself is trusted. It's the same binary whether it runs as pid=1
or as pid=123. One of the use cases is to make IPAddressDeny= work with --user.
Subset of that feature already works with AmbientCapabilities=CAP_NET_ADMIN.
CAP_BPF is a natural step in the same direction.


^ permalink raw reply

* Re: [PATCH bpf] tools: bpftool: close prog FD before exit on showing a single program
From: Jakub Kicinski @ 2019-08-15 17:30 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Quentin Monnet, Alexei Starovoitov, Daniel Borkmann, bpf,
	Networking, oss-drivers
In-Reply-To: <CAEf4BzbL3K5XWSyY6BxrVeF3+3qomsYbXh67yzjyy7ApsosVBw@mail.gmail.com>

On Thu, 15 Aug 2019 10:09:38 -0700, Andrii Nakryiko wrote:
> On Thu, Aug 15, 2019 at 7:24 AM Quentin Monnet wrote:
> > When showing metadata about a single program by invoking
> > "bpftool prog show PROG", the file descriptor referring to the program
> > is not closed before returning from the function. Let's close it.
> >
> > Fixes: 71bb428fe2c1 ("tools: bpf: add bpftool")
> > Signed-off-by: Quentin Monnet <quentin.monnet@netronome.com>
> > Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
> > ---
> >  tools/bpf/bpftool/prog.c | 4 +++-
> >  1 file changed, 3 insertions(+), 1 deletion(-)
> >
> > diff --git a/tools/bpf/bpftool/prog.c b/tools/bpf/bpftool/prog.c
> > index 66f04a4846a5..43fdbbfe41bb 100644
> > --- a/tools/bpf/bpftool/prog.c
> > +++ b/tools/bpf/bpftool/prog.c
> > @@ -363,7 +363,9 @@ static int do_show(int argc, char **argv)
> >                 if (fd < 0)
> >                         return -1;
> >
> > -               return show_prog(fd);
> > +               err = show_prog(fd);
> > +               close(fd);
> > +               return err;  
> 
> There is a similar problem few lines above for special case of argc ==
> 2, which you didn't fix.

This is the special argc == 2 case.

> Would it be better to make show_prog(fd) close provided fd instead or
> is it used in some other context where FD should live longer (I
> haven't checked, sorry)?

I think it used to close that's how the bug crept in. Other than the bug
it's fine the way it is.

^ permalink raw reply

* Re: [patch net-next v3 2/2] selftests: netdevsim: add devlink params tests
From: Jakub Kicinski @ 2019-08-15 17:33 UTC (permalink / raw)
  To: Jiri Pirko; +Cc: netdev, davem, mlxsw
In-Reply-To: <20190815134229.8884-3-jiri@resnulli.us>

On Thu, 15 Aug 2019 15:42:29 +0200, Jiri Pirko wrote:
> From: Jiri Pirko <jiri@mellanox.com>
> 
> Test recently added netdevsim devlink param implementation.
> 
> Signed-off-by: Jiri Pirko <jiri@mellanox.com>

Tested-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Acked-by: Jakub Kicinski <jakub.kicinski@netronome.com>

Thanks!

^ permalink raw reply

* Re: [patch net-next v4 1/2] netdevsim: implement support for devlink region and snapshots
From: Jakub Kicinski @ 2019-08-15 17:34 UTC (permalink / raw)
  To: Jiri Pirko; +Cc: netdev, davem, mlxsw
In-Reply-To: <20190815134634.9858-2-jiri@resnulli.us>

On Thu, 15 Aug 2019 15:46:33 +0200, Jiri Pirko wrote:
> From: Jiri Pirko <jiri@mellanox.com>
> 
> Implement dummy region of size 32K and allow user to create snapshots
> or random data using debugfs file trigger.
> 
> Signed-off-by: Jiri Pirko <jiri@mellanox.com>

Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>

^ permalink raw reply

* Re: [patch net-next v4 2/2] selftests: netdevsim: add devlink regions tests
From: Jakub Kicinski @ 2019-08-15 17:34 UTC (permalink / raw)
  To: Jiri Pirko; +Cc: netdev, davem, mlxsw
In-Reply-To: <20190815134634.9858-3-jiri@resnulli.us>

On Thu, 15 Aug 2019 15:46:34 +0200, Jiri Pirko wrote:
> From: Jiri Pirko <jiri@mellanox.com>
> 
> Test netdevsim devlink region implementation.
> 
> Signed-off-by: Jiri Pirko <jiri@mellanox.com>

Tested-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Acked-by: Jakub Kicinski <jakub.kicinski@netronome.com>

Thank you!!

^ permalink raw reply

* [PATCH] wimax/i2400m: fix a memory leak bug
From: Wenwen Wang @ 2019-08-15 18:05 UTC (permalink / raw)
  To: Wenwen Wang
  Cc: Inaky Perez-Gonzalez,
	supporter:INTEL WIRELESS WIMAX CONNECTION 2400, David S. Miller,
	open list:NETWORKING DRIVERS, open list

In i2400m_barker_db_init(), 'options_orig' is allocated through kstrdup()
to hold the original command line options. Then, the options are parsed.
However, if an error occurs during the parsing process, 'options_orig' is
not deallocated, leading to a memory leak bug. To fix this issue, free
'options_orig' before returning the error.

Signed-off-by: Wenwen Wang <wenwen@cs.uga.edu>
---
 drivers/net/wimax/i2400m/fw.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/drivers/net/wimax/i2400m/fw.c b/drivers/net/wimax/i2400m/fw.c
index e9fc168..6b36f6d 100644
--- a/drivers/net/wimax/i2400m/fw.c
+++ b/drivers/net/wimax/i2400m/fw.c
@@ -342,6 +342,7 @@ int i2400m_barker_db_init(const char *_options)
 				       "a 32-bit number\n",
 				       __func__, token);
 				result = -EINVAL;
+				kfree(options_orig);
 				goto error_parse;
 			}
 			if (barker == 0) {
@@ -350,8 +351,10 @@ int i2400m_barker_db_init(const char *_options)
 				continue;
 			}
 			result = i2400m_barker_db_add(barker);
-			if (result < 0)
+			if (result < 0) {
+				kfree(options_orig);
 				goto error_add;
+			}
 		}
 		kfree(options_orig);
 	}
-- 
2.7.4


^ permalink raw reply related

* Re: [PATCH bpf] tools: bpftool: close prog FD before exit on showing a single program
From: Andrii Nakryiko @ 2019-08-15 18:05 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Quentin Monnet, Alexei Starovoitov, Daniel Borkmann, bpf,
	Networking, oss-drivers
In-Reply-To: <20190815103023.0bd2c210@cakuba.netronome.com>

On Thu, Aug 15, 2019 at 10:30 AM Jakub Kicinski
<jakub.kicinski@netronome.com> wrote:
>
> On Thu, 15 Aug 2019 10:09:38 -0700, Andrii Nakryiko wrote:
> > On Thu, Aug 15, 2019 at 7:24 AM Quentin Monnet wrote:
> > > When showing metadata about a single program by invoking
> > > "bpftool prog show PROG", the file descriptor referring to the program
> > > is not closed before returning from the function. Let's close it.
> > >
> > > Fixes: 71bb428fe2c1 ("tools: bpf: add bpftool")
> > > Signed-off-by: Quentin Monnet <quentin.monnet@netronome.com>
> > > Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
> > > ---
> > >  tools/bpf/bpftool/prog.c | 4 +++-
> > >  1 file changed, 3 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/tools/bpf/bpftool/prog.c b/tools/bpf/bpftool/prog.c
> > > index 66f04a4846a5..43fdbbfe41bb 100644
> > > --- a/tools/bpf/bpftool/prog.c
> > > +++ b/tools/bpf/bpftool/prog.c
> > > @@ -363,7 +363,9 @@ static int do_show(int argc, char **argv)
> > >                 if (fd < 0)
> > >                         return -1;
> > >
> > > -               return show_prog(fd);
> > > +               err = show_prog(fd);
> > > +               close(fd);
> > > +               return err;
> >
> > There is a similar problem few lines above for special case of argc ==
> > 2, which you didn't fix.
>
> This is the special argc == 2 case.

Yep, you are right, the other one already does this.

>
> > Would it be better to make show_prog(fd) close provided fd instead or
> > is it used in some other context where FD should live longer (I
> > haven't checked, sorry)?
>
> I think it used to close that's how the bug crept in. Other than the bug
> it's fine the way it is.

So are you saying that show_prog() should or should not close FD?

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox