Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH iproute2-rc 1/8] rdma: Update uapi headers to add statistic counter support
From: Stephen Hemminger @ 2019-07-15 20:52 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Leon Romanovsky, netdev, David Ahern, Mark Zhang,
	RDMA mailing list
In-Reply-To: <20190710072455.9125-2-leon@kernel.org>

On Wed, 10 Jul 2019 10:24:48 +0300
Leon Romanovsky <leon@kernel.org> wrote:

> From: Mark Zhang <markz@mellanox.com>
> 
> Update rdma_netlink.h to kernel commit 6e7be47a5345 ("RDMA/nldev:
> Allow get default counter statistics through RDMA netlink").
> 
> Signed-off-by: Mark Zhang <markz@mellanox.com>
> Signed-off-by: Leon Romanovsky <leonro@mellanox.com>

I am waiting on this until it gets to Linus's tree.

^ permalink raw reply

* Re: [PATCH iproute2 master 0/3] devlink dumpit fixes
From: Stephen Hemminger @ 2019-07-15 20:51 UTC (permalink / raw)
  To: Tariq Toukan; +Cc: netdev, moshe, ayal
In-Reply-To: <1562756601-19171-1-git-send-email-tariqt@mellanox.com>

On Wed, 10 Jul 2019 14:03:18 +0300
Tariq Toukan <tariqt@mellanox.com> wrote:

> Hi,
> 
> This series from Aya contains several fixes for devlink health
> dump show command with binary data.
> 
> In patch 1 we replace the usage of doit with a dumpit, which
> is non-blocking and allows transferring larger amount of data.
> 
> Patches 2 and 3 fix the output for binary data prints, for both
> json and non-json.
> 
> Series generated against master commit:
> 2eb23f3e7aaf devlink: Show devlink port number
> 
> Regards,
> Tariq
> 
> Aya Levin (3):
>   devlink: Change devlink health dump show command to dumpit
>   devlink: Fix binary values print
>   devlink: Remove enclosing array brackets binary print with json format
> 
>  devlink/devlink.c | 41 +++++++++++++++++++++--------------------
>  1 file changed, 21 insertions(+), 20 deletions(-)
> 

Applied

^ permalink raw reply

* [bpf PATCH v3 8/8] bpf: sockmap/tls, close can race with map free
From: John Fastabend @ 2019-07-15 20:49 UTC (permalink / raw)
  To: jakub.kicinski, ast, daniel; +Cc: netdev, edumazet, john.fastabend, bpf
In-Reply-To: <156322373173.18678.6003379631139659856.stgit@john-XPS-13-9370>

When a map free is called and in parallel a socket is closed we
have two paths that can potentially reset the socket prot ops, the
bpf close() path and the map free path. This creates a problem
with which prot ops should be used from the socket closed side.

If the map_free side completes first then we want to call the
original lowest level ops. However, if the tls path runs first
we want to call the sockmap ops. Additionally there was no locking
around prot updates in TLS code paths so the prot ops could
be changed multiple times once from TLS path and again from sockmap
side potentially leaving ops pointed at either TLS or sockmap
when psock and/or tls context have already been destroyed.

To fix this race first only update ops inside callback lock
so that TLS, sockmap and lowest level all agree on prot state.
Second and a ULP callback update() so that lower layers can
inform the upper layer when they are being removed allowing the
upper layer to reset prot ops.

This gets us close to allowing sockmap and tls to be stacked
in arbitrary order but will save that patch for *next trees.

Reported-by: syzbot+06537213db7ba2745c4a@syzkaller.appspotmail.com
Fixes: 02c558b2d5d6 ("bpf: sockmap, support for msg_peek in sk_msg with redirect ingress")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
---
 include/linux/skmsg.h |    8 +++++++-
 include/net/tcp.h     |    3 +++
 net/core/skmsg.c      |    4 ++--
 net/ipv4/tcp_ulp.c    |   13 +++++++++++++
 net/tls/tls_main.c    |   48 ++++++++++++++++++++++++++++++++++++------------
 5 files changed, 61 insertions(+), 15 deletions(-)

diff --git a/include/linux/skmsg.h b/include/linux/skmsg.h
index 50ced8aba9db..e4b3fb4bb77c 100644
--- a/include/linux/skmsg.h
+++ b/include/linux/skmsg.h
@@ -354,7 +354,13 @@ static inline void sk_psock_restore_proto(struct sock *sk,
 	sk->sk_write_space = psock->saved_write_space;
 
 	if (psock->sk_proto) {
-		sk->sk_prot = psock->sk_proto;
+		struct inet_connection_sock *icsk = inet_csk(sk);
+		bool has_ulp = !!icsk->icsk_ulp_data;
+
+		if (has_ulp)
+			tcp_update_ulp(sk, psock->sk_proto);
+		else
+			sk->sk_prot = psock->sk_proto;
 		psock->sk_proto = NULL;
 	}
 }
diff --git a/include/net/tcp.h b/include/net/tcp.h
index cca3c59b98bf..f4702c8b9b8c 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -2102,6 +2102,8 @@ struct tcp_ulp_ops {
 
 	/* initialize ulp */
 	int (*init)(struct sock *sk);
+	/* update ulp */
+	void (*update)(struct sock *sk, struct proto *p);
 	/* cleanup ulp */
 	void (*release)(struct sock *sk);
 
@@ -2113,6 +2115,7 @@ void tcp_unregister_ulp(struct tcp_ulp_ops *type);
 int tcp_set_ulp(struct sock *sk, const char *name);
 void tcp_get_available_ulp(char *buf, size_t len);
 void tcp_cleanup_ulp(struct sock *sk);
+void tcp_update_ulp(struct sock *sk, struct proto *p);
 
 #define MODULE_ALIAS_TCP_ULP(name)				\
 	__MODULE_INFO(alias, alias_userspace, name);		\
diff --git a/net/core/skmsg.c b/net/core/skmsg.c
index 93bffaad2135..6832eeb4b785 100644
--- a/net/core/skmsg.c
+++ b/net/core/skmsg.c
@@ -585,12 +585,12 @@ EXPORT_SYMBOL_GPL(sk_psock_destroy);
 
 void sk_psock_drop(struct sock *sk, struct sk_psock *psock)
 {
-	rcu_assign_sk_user_data(sk, NULL);
 	sk_psock_cork_free(psock);
 	sk_psock_zap_ingress(psock);
-	sk_psock_restore_proto(sk, psock);
 
 	write_lock_bh(&sk->sk_callback_lock);
+	sk_psock_restore_proto(sk, psock);
+	rcu_assign_sk_user_data(sk, NULL);
 	if (psock->progs.skb_parser)
 		sk_psock_stop_strp(sk, psock);
 	write_unlock_bh(&sk->sk_callback_lock);
diff --git a/net/ipv4/tcp_ulp.c b/net/ipv4/tcp_ulp.c
index 3d8a1d835471..4849edb62d52 100644
--- a/net/ipv4/tcp_ulp.c
+++ b/net/ipv4/tcp_ulp.c
@@ -96,6 +96,19 @@ void tcp_get_available_ulp(char *buf, size_t maxlen)
 	rcu_read_unlock();
 }
 
+void tcp_update_ulp(struct sock *sk, struct proto *proto)
+{
+	struct inet_connection_sock *icsk = inet_csk(sk);
+
+	if (!icsk->icsk_ulp_ops) {
+		sk->sk_prot = proto;
+		return;
+	}
+
+	if (icsk->icsk_ulp_ops->update)
+		icsk->icsk_ulp_ops->update(sk, proto);
+}
+
 void tcp_cleanup_ulp(struct sock *sk)
 {
 	struct inet_connection_sock *icsk = inet_csk(sk);
diff --git a/net/tls/tls_main.c b/net/tls/tls_main.c
index f4cb0522fa95..e67e687f79a2 100644
--- a/net/tls/tls_main.c
+++ b/net/tls/tls_main.c
@@ -323,15 +323,16 @@ static void tls_sk_proto_unhash(struct sock *sk)
 	long timeo = sock_sndtimeo(sk, 0);
 	struct tls_context *ctx;
 
-	if (unlikely(!icsk->icsk_ulp_data)) {
-		if (sk->sk_prot->unhash)
-			sk->sk_prot->unhash(sk);
-	}
-
 	ctx = tls_get_ctx(sk);
 	if (ctx->tx_conf == TLS_SW || ctx->rx_conf == TLS_SW)
 		tls_sk_proto_cleanup(sk, ctx, timeo);
+
+	write_lock_bh(&sk->sk_callback_lock);
 	icsk->icsk_ulp_data = NULL;
+	if (sk->sk_prot->unhash == tls_sk_proto_unhash)
+		sk->sk_prot = ctx->sk_proto;
+	write_unlock_bh(&sk->sk_callback_lock);
+
 	tls_ctx_free_wq(ctx);
 
 	if (ctx->unhash)
@@ -340,15 +341,17 @@ static void tls_sk_proto_unhash(struct sock *sk)
 
 static void tls_sk_proto_close(struct sock *sk, long timeout)
 {
-	void (*sk_proto_close)(struct sock *sk, long timeout);
+	struct inet_connection_sock *icsk = inet_csk(sk);
 	struct tls_context *ctx = tls_get_ctx(sk);
 	long timeo = sock_sndtimeo(sk, 0);
 
+	if (unlikely(!ctx))
+		return;
+
 	if (ctx->tx_conf == TLS_SW)
 		tls_sw_cancel_work_tx(ctx);
 
 	lock_sock(sk);
-	sk_proto_close = ctx->sk_proto_close;
 
 	if (ctx->tx_conf == TLS_HW_RECORD && ctx->rx_conf == TLS_HW_RECORD)
 		goto skip_tx_cleanup;
@@ -356,17 +359,20 @@ static void tls_sk_proto_close(struct sock *sk, long timeout)
 	if (ctx->tx_conf == TLS_BASE && ctx->rx_conf == TLS_BASE)
 		goto skip_tx_cleanup;
 
-	sk->sk_prot = ctx->sk_proto;
 	tls_sk_proto_cleanup(sk, ctx, timeo);
 
 skip_tx_cleanup:
+	write_lock_bh(&sk->sk_callback_lock);
+	icsk->icsk_ulp_data = NULL;
+	if (sk->sk_prot->close == tls_sk_proto_close)
+		sk->sk_prot = ctx->sk_proto;
+	write_unlock_bh(&sk->sk_callback_lock);
 	release_sock(sk);
 	if (ctx->rx_conf == TLS_SW || ctx->rx_conf == TLS_HW)
 		tls_sw_strparser_done(ctx);
 	if (ctx->rx_conf == TLS_SW)
 		tls_sw_free_ctx_rx(ctx);
-	sk_proto_close(sk, timeout);
-
+	ctx->sk_proto_close(sk, timeout);
 	if (ctx->tx_conf != TLS_HW && ctx->rx_conf != TLS_HW &&
 	    ctx->tx_conf != TLS_HW_RECORD && ctx->rx_conf != TLS_HW_RECORD)
 		tls_ctx_free(ctx);
@@ -833,7 +839,7 @@ static int tls_init(struct sock *sk)
 	int rc = 0;
 
 	if (tls_hw_prot(sk))
-		goto out;
+		return 0;
 
 	/* The TLS ulp is currently supported only for TCP sockets
 	 * in ESTABLISHED state.
@@ -844,22 +850,39 @@ static int tls_init(struct sock *sk)
 	if (sk->sk_state != TCP_ESTABLISHED)
 		return -ENOTSUPP;
 
+	tls_build_proto(sk);
+
 	/* allocate tls context */
+	write_lock_bh(&sk->sk_callback_lock);
 	ctx = create_ctx(sk);
 	if (!ctx) {
 		rc = -ENOMEM;
 		goto out;
 	}
 
-	tls_build_proto(sk);
 	ctx->tx_conf = TLS_BASE;
 	ctx->rx_conf = TLS_BASE;
 	ctx->sk_proto = sk->sk_prot;
 	update_sk_prot(sk, ctx);
 out:
+	write_unlock_bh(&sk->sk_callback_lock);
 	return rc;
 }
 
+static void tls_update(struct sock *sk, struct proto *p)
+{
+	struct tls_context *ctx;
+
+	ctx = tls_get_ctx(sk);
+	if (likely(ctx)) {
+		ctx->sk_proto_close = p->close;
+		ctx->unhash = p->unhash;
+		ctx->sk_proto = p;
+	} else {
+		sk->sk_prot = p;
+	}
+}
+
 void tls_register_device(struct tls_device *device)
 {
 	spin_lock_bh(&device_spinlock);
@@ -880,6 +903,7 @@ static struct tcp_ulp_ops tcp_tls_ulp_ops __read_mostly = {
 	.name			= "tls",
 	.owner			= THIS_MODULE,
 	.init			= tls_init,
+	.update			= tls_update,
 };
 
 static int __init tls_register(void)


^ permalink raw reply related

* Re: [PATCH iproute2 v2] utils: don't match empty strings as prefixes
From: Stephen Hemminger @ 2019-07-15 20:49 UTC (permalink / raw)
  To: Matteo Croce; +Cc: netdev, David Ahern
In-Reply-To: <20190715180430.19902-1-mcroce@redhat.com>

On Mon, 15 Jul 2019 20:04:30 +0200
Matteo Croce <mcroce@redhat.com> wrote:

> iproute has an utility function which checks if a string is a prefix for
> another one, to allow use of abbreviated commands, e.g. 'addr' or 'a'
> instead of 'address'.
> 
> This routine unfortunately considers an empty string as prefix
> of any pattern, leading to undefined behaviour when an empty
> argument is passed to ip:
> 
>     # ip ''
>     1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
>         link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
>         inet 127.0.0.1/8 scope host lo
>            valid_lft forever preferred_lft forever
>         inet6 ::1/128 scope host
>            valid_lft forever preferred_lft forever
> 
>     # tc ''
>     qdisc noqueue 0: dev lo root refcnt 2
> 
>     # ip address add 192.0.2.0/24 '' 198.51.100.1 dev dummy0
>     # ip addr show dev dummy0
>     6: dummy0: <BROADCAST,NOARP> mtu 1500 qdisc noop state DOWN group default qlen 1000
>         link/ether 02:9d:5e:e9:3f:c0 brd ff:ff:ff:ff:ff:ff
>         inet 192.0.2.0/24 brd 198.51.100.1 scope global dummy0
>            valid_lft forever preferred_lft forever
> 
> Rewrite matches() so it takes care of an empty input, and doesn't
> scan the input strings three times: the actual implementation
> does 2 strlen and a memcpy to accomplish the same task.
> 
> Signed-off-by: Matteo Croce <mcroce@redhat.com>

Thanks for following up. Applied

^ permalink raw reply

* [bpf PATCH v3 7/8] bpf: sockmap, only create entry if ulp is not already enabled
From: John Fastabend @ 2019-07-15 20:49 UTC (permalink / raw)
  To: jakub.kicinski, ast, daniel; +Cc: netdev, edumazet, john.fastabend, bpf
In-Reply-To: <156322373173.18678.6003379631139659856.stgit@john-XPS-13-9370>

Sockmap does not currently support adding sockets after TLS has been
enabled. There never was a real use case for this so it was never
added. But, we lost the test for ULP at some point so add it here
and fail the socket insert if TLS is enabled. Future work could
make sockmap support this use case but fixup the bug here.

Fixes: 604326b41a6fb ("bpf, sockmap: convert to generic sk_msg interface")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
---
 net/core/sock_map.c |    3 +++
 1 file changed, 3 insertions(+)

diff --git a/net/core/sock_map.c b/net/core/sock_map.c
index 56bcabe7c2f2..1330a7442e5b 100644
--- a/net/core/sock_map.c
+++ b/net/core/sock_map.c
@@ -334,6 +334,7 @@ static int sock_map_update_common(struct bpf_map *map, u32 idx,
 				  struct sock *sk, u64 flags)
 {
 	struct bpf_stab *stab = container_of(map, struct bpf_stab, map);
+	struct inet_connection_sock *icsk = inet_csk(sk);
 	struct sk_psock_link *link;
 	struct sk_psock *psock;
 	struct sock *osk;
@@ -344,6 +345,8 @@ static int sock_map_update_common(struct bpf_map *map, u32 idx,
 		return -EINVAL;
 	if (unlikely(idx >= map->max_entries))
 		return -E2BIG;
+	if (unlikely(icsk->icsk_ulp_data))
+		return -EINVAL;
 
 	link = sk_psock_init_link();
 	if (!link)


^ permalink raw reply related

* [bpf PATCH v3 6/8] bpf: sockmap, synchronize_rcu before free'ing map
From: John Fastabend @ 2019-07-15 20:49 UTC (permalink / raw)
  To: jakub.kicinski, ast, daniel; +Cc: netdev, edumazet, john.fastabend, bpf
In-Reply-To: <156322373173.18678.6003379631139659856.stgit@john-XPS-13-9370>

We need to have a synchronize_rcu before free'ing the sockmap because
any outstanding psock references will have a pointer to the map and
when they use this could trigger a use after free.

Fixes: 604326b41a6fb ("bpf, sockmap: convert to generic sk_msg interface")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
---
 net/core/sock_map.c |    2 ++
 1 file changed, 2 insertions(+)

diff --git a/net/core/sock_map.c b/net/core/sock_map.c
index 28702f2e9a4a..56bcabe7c2f2 100644
--- a/net/core/sock_map.c
+++ b/net/core/sock_map.c
@@ -247,6 +247,8 @@ static void sock_map_free(struct bpf_map *map)
 	raw_spin_unlock_bh(&stab->lock);
 	rcu_read_unlock();
 
+	synchronize_rcu();
+
 	bpf_map_area_free(stab->sks);
 	kfree(stab);
 }


^ permalink raw reply related

* [bpf PATCH v3 5/8] bpf: sockmap, sock_map_delete needs to use xchg
From: John Fastabend @ 2019-07-15 20:49 UTC (permalink / raw)
  To: jakub.kicinski, ast, daniel; +Cc: netdev, edumazet, john.fastabend, bpf
In-Reply-To: <156322373173.18678.6003379631139659856.stgit@john-XPS-13-9370>

__sock_map_delete() may be called from a tcp event such as unhash or
close from the following trace,

  tcp_bpf_close()
    tcp_bpf_remove()
      sk_psock_unlink()
        sock_map_delete_from_link()
          __sock_map_delete()

In this case the sock lock is held but this only protects against
duplicate removals on the TCP side. If the map is free'd then we have
this trace,

  sock_map_free
    xchg()                  <- replaces map entry
    sock_map_unref()
      sk_psock_put()
        sock_map_del_link()

The __sock_map_delete() call however uses a read, test, null over the
map entry which can result in both paths trying to free the map
entry.

To fix use xchg in TCP paths as well so we avoid having two references
to the same map entry.

Fixes: 604326b41a6fb ("bpf, sockmap: convert to generic sk_msg interface")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
---
 net/core/sock_map.c |   14 +++++++++-----
 1 file changed, 9 insertions(+), 5 deletions(-)

diff --git a/net/core/sock_map.c b/net/core/sock_map.c
index 52d4faeee18b..28702f2e9a4a 100644
--- a/net/core/sock_map.c
+++ b/net/core/sock_map.c
@@ -276,16 +276,20 @@ static int __sock_map_delete(struct bpf_stab *stab, struct sock *sk_test,
 			     struct sock **psk)
 {
 	struct sock *sk;
+	int err = 0;
 
 	raw_spin_lock_bh(&stab->lock);
 	sk = *psk;
 	if (!sk_test || sk_test == sk)
-		*psk = NULL;
+		sk = xchg(psk, NULL);
+
+	if (likely(sk))
+		sock_map_unref(sk, psk);
+	else
+		err = -EINVAL;
+
 	raw_spin_unlock_bh(&stab->lock);
-	if (unlikely(!sk))
-		return -EINVAL;
-	sock_map_unref(sk, psk);
-	return 0;
+	return err;
 }
 
 static void sock_map_delete_from_link(struct bpf_map *map, struct sock *sk,


^ permalink raw reply related

* [bpf PATCH v3 4/8] bpf: tls fix transition through disconnect with close
From: John Fastabend @ 2019-07-15 20:49 UTC (permalink / raw)
  To: jakub.kicinski, ast, daniel; +Cc: netdev, edumazet, john.fastabend, bpf
In-Reply-To: <156322373173.18678.6003379631139659856.stgit@john-XPS-13-9370>

It is possible (via shutdown()) for TCP socks to go through TCP_CLOSE
state via tcp_dosconnect() without actually calling tcp_close which
would then call the tls close callback. Because of this a user could
disconnect a socket then put it in a LISTEN state which would break
our assumptions about sockets always being ESTABLISHED state.

More directly because close() can call unhash() and unhash is
implemented by sockmap if a sockmap socket has TLS enabled we can
incorrectly destroy the psock from unhash() and then call its close
handler again. But because the psock (sockmap socket representation)
is already destroyed we call close handler in sk->prot. However,
in some cases (TLS BASE/BASE case) this will still point at the
sockmap close handler resulting in a circular call and crash reported
by syzbot.

To fix both above issues implement the unhash() routine for TLS.

Fixes: 3c4d7559159bf ("tls: kernel TLS support")
Reported-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
---
 include/net/tls.h  |    5 ++++-
 net/tls/tls_main.c |   57 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 61 insertions(+), 1 deletion(-)

diff --git a/include/net/tls.h b/include/net/tls.h
index 72ddd16de056..79ef7049375d 100644
--- a/include/net/tls.h
+++ b/include/net/tls.h
@@ -251,6 +251,8 @@ struct tls_context {
 	u8 tx_conf:3;
 	u8 rx_conf:3;
 
+	struct proto *sk_proto;
+
 	int (*push_pending_record)(struct sock *sk, int flags);
 	void (*sk_write_space)(struct sock *sk);
 
@@ -288,6 +290,8 @@ struct tls_context {
 
 	struct list_head list;
 	refcount_t refcount;
+
+	struct work_struct gc;
 };
 
 enum tls_offload_ctx_dir {
@@ -359,7 +363,6 @@ void tls_sw_strparser_done(struct tls_context *tls_ctx);
 int tls_sw_sendmsg(struct sock *sk, struct msghdr *msg, size_t size);
 int tls_sw_sendpage(struct sock *sk, struct page *page,
 		    int offset, size_t size, int flags);
-void tls_sw_close(struct sock *sk, long timeout);
 void tls_sw_cancel_work_tx(struct tls_context *tls_ctx);
 void tls_sw_free_resources_tx(struct sock *sk);
 void tls_sw_free_resources_rx(struct sock *sk);
diff --git a/net/tls/tls_main.c b/net/tls/tls_main.c
index 9f4a9da182ae..f4cb0522fa95 100644
--- a/net/tls/tls_main.c
+++ b/net/tls/tls_main.c
@@ -251,6 +251,35 @@ static void tls_write_space(struct sock *sk)
 	ctx->sk_write_space(sk);
 }
 
+static void tls_ctx_free_deferred(struct work_struct *gc)
+{
+	struct tls_context *ctx = container_of(gc, struct tls_context, gc);
+
+	if (ctx->rx_conf == TLS_SW || ctx->rx_conf == TLS_HW)
+		tls_sw_strparser_done(ctx);
+
+	if (ctx->rx_conf == TLS_SW)
+		tls_sw_free_ctx_rx(ctx);
+
+	/* Ensure any remaining work items are completed. The sk will
+	 * already have lost its tls_ctx reference by the time we get
+	 * here so no xmit operation will actually be performed.
+	 */
+	tls_sw_cancel_work_tx(ctx);
+	kfree(ctx);
+}
+
+static void tls_ctx_free_wq(struct tls_context *ctx)
+{
+	if (!ctx)
+		return;
+
+	memzero_explicit(&ctx->crypto_send, sizeof(ctx->crypto_send));
+	memzero_explicit(&ctx->crypto_recv, sizeof(ctx->crypto_recv));
+	INIT_WORK(&ctx->gc, tls_ctx_free_deferred);
+	schedule_work(&ctx->gc);
+}
+
 void tls_ctx_free(struct tls_context *ctx)
 {
 	if (!ctx)
@@ -288,6 +317,27 @@ static void tls_sk_proto_cleanup(struct sock *sk,
 #endif
 }
 
+static void tls_sk_proto_unhash(struct sock *sk)
+{
+	struct inet_connection_sock *icsk = inet_csk(sk);
+	long timeo = sock_sndtimeo(sk, 0);
+	struct tls_context *ctx;
+
+	if (unlikely(!icsk->icsk_ulp_data)) {
+		if (sk->sk_prot->unhash)
+			sk->sk_prot->unhash(sk);
+	}
+
+	ctx = tls_get_ctx(sk);
+	if (ctx->tx_conf == TLS_SW || ctx->rx_conf == TLS_SW)
+		tls_sk_proto_cleanup(sk, ctx, timeo);
+	icsk->icsk_ulp_data = NULL;
+	tls_ctx_free_wq(ctx);
+
+	if (ctx->unhash)
+		ctx->unhash(sk);
+}
+
 static void tls_sk_proto_close(struct sock *sk, long timeout)
 {
 	void (*sk_proto_close)(struct sock *sk, long timeout);
@@ -306,6 +356,7 @@ static void tls_sk_proto_close(struct sock *sk, long timeout)
 	if (ctx->tx_conf == TLS_BASE && ctx->rx_conf == TLS_BASE)
 		goto skip_tx_cleanup;
 
+	sk->sk_prot = ctx->sk_proto;
 	tls_sk_proto_cleanup(sk, ctx, timeo);
 
 skip_tx_cleanup:
@@ -611,6 +662,7 @@ static struct tls_context *create_ctx(struct sock *sk)
 	ctx->setsockopt = sk->sk_prot->setsockopt;
 	ctx->getsockopt = sk->sk_prot->getsockopt;
 	ctx->sk_proto_close = sk->sk_prot->close;
+	ctx->unhash = sk->sk_prot->unhash;
 	return ctx;
 }
 
@@ -734,20 +786,24 @@ static void build_protos(struct proto prot[TLS_NUM_CONFIG][TLS_NUM_CONFIG],
 	prot[TLS_BASE][TLS_BASE].setsockopt	= tls_setsockopt;
 	prot[TLS_BASE][TLS_BASE].getsockopt	= tls_getsockopt;
 	prot[TLS_BASE][TLS_BASE].close		= tls_sk_proto_close;
+	prot[TLS_BASE][TLS_BASE].unhash		= tls_sk_proto_unhash;
 
 	prot[TLS_SW][TLS_BASE] = prot[TLS_BASE][TLS_BASE];
 	prot[TLS_SW][TLS_BASE].sendmsg		= tls_sw_sendmsg;
 	prot[TLS_SW][TLS_BASE].sendpage		= tls_sw_sendpage;
+	prot[TLS_SW][TLS_BASE].unhash		= tls_sk_proto_unhash;
 
 	prot[TLS_BASE][TLS_SW] = prot[TLS_BASE][TLS_BASE];
 	prot[TLS_BASE][TLS_SW].recvmsg		  = tls_sw_recvmsg;
 	prot[TLS_BASE][TLS_SW].stream_memory_read = tls_sw_stream_read;
 	prot[TLS_BASE][TLS_SW].close		  = tls_sk_proto_close;
+	prot[TLS_BASE][TLS_SW].unhash		  = tls_sk_proto_unhash;
 
 	prot[TLS_SW][TLS_SW] = prot[TLS_SW][TLS_BASE];
 	prot[TLS_SW][TLS_SW].recvmsg		= tls_sw_recvmsg;
 	prot[TLS_SW][TLS_SW].stream_memory_read	= tls_sw_stream_read;
 	prot[TLS_SW][TLS_SW].close		= tls_sk_proto_close;
+	prot[TLS_SW][TLS_SW].unhash		= tls_sk_proto_unhash;
 
 #ifdef CONFIG_TLS_DEVICE
 	prot[TLS_HW][TLS_BASE] = prot[TLS_BASE][TLS_BASE];
@@ -798,6 +854,7 @@ static int tls_init(struct sock *sk)
 	tls_build_proto(sk);
 	ctx->tx_conf = TLS_BASE;
 	ctx->rx_conf = TLS_BASE;
+	ctx->sk_proto = sk->sk_prot;
 	update_sk_prot(sk, ctx);
 out:
 	return rc;


^ permalink raw reply related

* [bpf PATCH v3 3/8] tls: remove sock unlock/lock around strp_done()
From: John Fastabend @ 2019-07-15 20:49 UTC (permalink / raw)
  To: jakub.kicinski, ast, daniel; +Cc: netdev, edumazet, john.fastabend, bpf
In-Reply-To: <156322373173.18678.6003379631139659856.stgit@john-XPS-13-9370>

The tls close() callback currently drops the sock lock to call
strp_done(). Split up the RX cleanup into stopping the strparser
and releasing most resources, syncing strparser and finally
freeing the context.

To avoid the need for a strp_done() call on the cleanup path
of device offload make sure we don't arm the strparser until
we are sure init will be successful.

Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com>
---
 include/net/tls.h    |    4 ++-
 net/tls/tls_device.c |    1 -
 net/tls/tls_main.c   |   65 +++++++++++++++++++++++++-------------------------
 net/tls/tls_sw.c     |   33 ++++++++++++++++++-------
 4 files changed, 58 insertions(+), 45 deletions(-)

diff --git a/include/net/tls.h b/include/net/tls.h
index d4276cb6de53..72ddd16de056 100644
--- a/include/net/tls.h
+++ b/include/net/tls.h
@@ -107,9 +107,7 @@ struct tls_device {
 enum {
 	TLS_BASE,
 	TLS_SW,
-#ifdef CONFIG_TLS_DEVICE
 	TLS_HW,
-#endif
 	TLS_HW_RECORD,
 	TLS_NUM_CONFIG,
 };
@@ -357,6 +355,7 @@ int tls_sk_attach(struct sock *sk, int optname, char __user *optval,
 
 int tls_set_sw_offload(struct sock *sk, struct tls_context *ctx, int tx);
 void tls_sw_strparser_arm(struct sock *sk, struct tls_context *ctx);
+void tls_sw_strparser_done(struct tls_context *tls_ctx);
 int tls_sw_sendmsg(struct sock *sk, struct msghdr *msg, size_t size);
 int tls_sw_sendpage(struct sock *sk, struct page *page,
 		    int offset, size_t size, int flags);
@@ -365,6 +364,7 @@ void tls_sw_cancel_work_tx(struct tls_context *tls_ctx);
 void tls_sw_free_resources_tx(struct sock *sk);
 void tls_sw_free_resources_rx(struct sock *sk);
 void tls_sw_release_resources_rx(struct sock *sk);
+void tls_sw_free_ctx_rx(struct tls_context *tls_ctx);
 int tls_sw_recvmsg(struct sock *sk, struct msghdr *msg, size_t len,
 		   int nonblock, int flags, int *addr_len);
 bool tls_sw_stream_read(const struct sock *sk);
diff --git a/net/tls/tls_device.c b/net/tls/tls_device.c
index 4d67d72f007c..7c0b2b778703 100644
--- a/net/tls/tls_device.c
+++ b/net/tls/tls_device.c
@@ -1045,7 +1045,6 @@ int tls_set_device_offload_rx(struct sock *sk, struct tls_context *ctx)
 	rc = tls_set_sw_offload(sk, ctx, 0);
 	if (rc)
 		goto release_ctx;
-	tls_sw_strparser_arm(sk, ctx);
 
 	rc = netdev->tlsdev_ops->tls_dev_add(netdev, sk, TLS_OFFLOAD_CTX_DIR_RX,
 					     &ctx->crypto_recv.info,
diff --git a/net/tls/tls_main.c b/net/tls/tls_main.c
index ddda422498aa..9f4a9da182ae 100644
--- a/net/tls/tls_main.c
+++ b/net/tls/tls_main.c
@@ -261,27 +261,9 @@ void tls_ctx_free(struct tls_context *ctx)
 	kfree(ctx);
 }
 
-static void tls_sk_proto_close(struct sock *sk, long timeout)
+static void tls_sk_proto_cleanup(struct sock *sk,
+				 struct tls_context *ctx, long timeo)
 {
-	struct tls_context *ctx = tls_get_ctx(sk);
-	long timeo = sock_sndtimeo(sk, 0);
-	void (*sk_proto_close)(struct sock *sk, long timeout);
-	bool free_ctx = false;
-
-	if (ctx->tx_conf == TLS_SW)
-		tls_sw_cancel_work_tx(ctx);
-
-	lock_sock(sk);
-	sk_proto_close = ctx->sk_proto_close;
-
-	if (ctx->tx_conf == TLS_HW_RECORD && ctx->rx_conf == TLS_HW_RECORD)
-		goto skip_tx_cleanup;
-
-	if (ctx->tx_conf == TLS_BASE && ctx->rx_conf == TLS_BASE) {
-		free_ctx = true;
-		goto skip_tx_cleanup;
-	}
-
 	if (unlikely(sk->sk_write_pending) &&
 	    !wait_on_pending_writer(sk, &timeo))
 		tls_handle_open_record(sk, 0);
@@ -298,27 +280,44 @@ static void tls_sk_proto_close(struct sock *sk, long timeout)
 	}
 
 	if (ctx->rx_conf == TLS_SW)
-		tls_sw_free_resources_rx(sk);
+		tls_sw_release_resources_rx(sk);
 
 #ifdef CONFIG_TLS_DEVICE
 	if (ctx->rx_conf == TLS_HW)
 		tls_device_offload_cleanup_rx(sk);
-
-	if (ctx->tx_conf != TLS_HW && ctx->rx_conf != TLS_HW) {
-#else
-	{
 #endif
-		tls_ctx_free(ctx);
-		ctx = NULL;
-	}
+}
+
+static void tls_sk_proto_close(struct sock *sk, long timeout)
+{
+	void (*sk_proto_close)(struct sock *sk, long timeout);
+	struct tls_context *ctx = tls_get_ctx(sk);
+	long timeo = sock_sndtimeo(sk, 0);
+
+	if (ctx->tx_conf == TLS_SW)
+		tls_sw_cancel_work_tx(ctx);
+
+	lock_sock(sk);
+	sk_proto_close = ctx->sk_proto_close;
+
+	if (ctx->tx_conf == TLS_HW_RECORD && ctx->rx_conf == TLS_HW_RECORD)
+		goto skip_tx_cleanup;
+
+	if (ctx->tx_conf == TLS_BASE && ctx->rx_conf == TLS_BASE)
+		goto skip_tx_cleanup;
+
+	tls_sk_proto_cleanup(sk, ctx, timeo);
 
 skip_tx_cleanup:
 	release_sock(sk);
+	if (ctx->rx_conf == TLS_SW || ctx->rx_conf == TLS_HW)
+		tls_sw_strparser_done(ctx);
+	if (ctx->rx_conf == TLS_SW)
+		tls_sw_free_ctx_rx(ctx);
 	sk_proto_close(sk, timeout);
-	/* free ctx for TLS_HW_RECORD, used by tcp_set_state
-	 * for sk->sk_prot->unhash [tls_hw_unhash]
-	 */
-	if (free_ctx)
+
+	if (ctx->tx_conf != TLS_HW && ctx->rx_conf != TLS_HW &&
+	    ctx->tx_conf != TLS_HW_RECORD && ctx->rx_conf != TLS_HW_RECORD)
 		tls_ctx_free(ctx);
 }
 
@@ -544,9 +543,9 @@ static int do_tls_setsockopt_conf(struct sock *sk, char __user *optval,
 			rc = tls_set_sw_offload(sk, ctx, 0);
 			if (rc)
 				goto err_crypto_info;
-			tls_sw_strparser_arm(sk, ctx);
 			conf = TLS_SW;
 		}
+		tls_sw_strparser_arm(sk, ctx);
 	}
 
 	if (tx)
diff --git a/net/tls/tls_sw.c b/net/tls/tls_sw.c
index 38c0e53c727d..ee8fef312475 100644
--- a/net/tls/tls_sw.c
+++ b/net/tls/tls_sw.c
@@ -2114,25 +2114,40 @@ void tls_sw_release_resources_rx(struct sock *sk)
 		skb_queue_purge(&ctx->rx_list);
 		crypto_free_aead(ctx->aead_recv);
 		strp_stop(&ctx->strp);
-		write_lock_bh(&sk->sk_callback_lock);
-		sk->sk_data_ready = ctx->saved_data_ready;
-		write_unlock_bh(&sk->sk_callback_lock);
-		release_sock(sk);
-		strp_done(&ctx->strp);
-		lock_sock(sk);
+		/* If tls_sw_strparser_arm() was not called (cleanup paths)
+		 * we still want to strp_stop(), but sk->sk_data_ready was
+		 * never swapped.
+		 */
+		if (ctx->saved_data_ready) {
+			write_lock_bh(&sk->sk_callback_lock);
+			sk->sk_data_ready = ctx->saved_data_ready;
+			write_unlock_bh(&sk->sk_callback_lock);
+		}
 	}
 }
 
-void tls_sw_free_resources_rx(struct sock *sk)
+void tls_sw_strparser_done(struct tls_context *tls_ctx)
 {
-	struct tls_context *tls_ctx = tls_get_ctx(sk);
 	struct tls_sw_context_rx *ctx = tls_sw_ctx_rx(tls_ctx);
 
-	tls_sw_release_resources_rx(sk);
+	strp_done(&ctx->strp);
+}
+
+void tls_sw_free_ctx_rx(struct tls_context *tls_ctx)
+{
+	struct tls_sw_context_rx *ctx = tls_sw_ctx_rx(tls_ctx);
 
 	kfree(ctx);
 }
 
+void tls_sw_free_resources_rx(struct sock *sk)
+{
+	struct tls_context *tls_ctx = tls_get_ctx(sk);
+
+	tls_sw_release_resources_rx(sk);
+	tls_sw_free_ctx_rx(tls_ctx);
+}
+
 /* The work handler to transmitt the encrypted records in tx_list */
 static void tx_work_handler(struct work_struct *work)
 {


^ permalink raw reply related

* [bpf PATCH v3 2/8] tls: remove close callback sock unlock/lock around TX work flush
From: John Fastabend @ 2019-07-15 20:49 UTC (permalink / raw)
  To: jakub.kicinski, ast, daniel; +Cc: netdev, edumazet, john.fastabend, bpf
In-Reply-To: <156322373173.18678.6003379631139659856.stgit@john-XPS-13-9370>

The tls close() callback currently drops the sock lock, makes a
cancel_delayed_work_sync() call, and then relocks the sock.

By restructuring the code we can avoid droping lock and then
reclaiming it. To simplify this we do the following,

 tls_sk_proto_close
 set_bit(CLOSING)
 set_bit(SCHEDULE)
 cancel_delay_work_sync() <- cancel workqueue
 lock_sock(sk)
 ...
 release_sock(sk)
 strp_done()

Setting the CLOSING bit prevents the SCHEDULE bit from being
cleared by any workqueue items e.g. if one happens to be
scheduled and run between when we set SCHEDULE bit and cancel
work. Then because SCHEDULE bit is set now no new work will
be scheduled.

Tested with net selftests and bpf selftests.

Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com>
---
 include/net/tls.h  |    2 ++
 net/tls/tls_main.c |    3 +++
 net/tls/tls_sw.c   |   24 +++++++++++++++++-------
 3 files changed, 22 insertions(+), 7 deletions(-)

diff --git a/include/net/tls.h b/include/net/tls.h
index 43f551cd508b..d4276cb6de53 100644
--- a/include/net/tls.h
+++ b/include/net/tls.h
@@ -162,6 +162,7 @@ struct tls_sw_context_tx {
 	int async_capable;
 
 #define BIT_TX_SCHEDULED	0
+#define BIT_TX_CLOSING		1
 	unsigned long tx_bitmask;
 };
 
@@ -360,6 +361,7 @@ int tls_sw_sendmsg(struct sock *sk, struct msghdr *msg, size_t size);
 int tls_sw_sendpage(struct sock *sk, struct page *page,
 		    int offset, size_t size, int flags);
 void tls_sw_close(struct sock *sk, long timeout);
+void tls_sw_cancel_work_tx(struct tls_context *tls_ctx);
 void tls_sw_free_resources_tx(struct sock *sk);
 void tls_sw_free_resources_rx(struct sock *sk);
 void tls_sw_release_resources_rx(struct sock *sk);
diff --git a/net/tls/tls_main.c b/net/tls/tls_main.c
index 85a9d7d57b32..ddda422498aa 100644
--- a/net/tls/tls_main.c
+++ b/net/tls/tls_main.c
@@ -268,6 +268,9 @@ static void tls_sk_proto_close(struct sock *sk, long timeout)
 	void (*sk_proto_close)(struct sock *sk, long timeout);
 	bool free_ctx = false;
 
+	if (ctx->tx_conf == TLS_SW)
+		tls_sw_cancel_work_tx(ctx);
+
 	lock_sock(sk);
 	sk_proto_close = ctx->sk_proto_close;
 
diff --git a/net/tls/tls_sw.c b/net/tls/tls_sw.c
index f58a8ffc2a9c..38c0e53c727d 100644
--- a/net/tls/tls_sw.c
+++ b/net/tls/tls_sw.c
@@ -2054,6 +2054,15 @@ static void tls_data_ready(struct sock *sk)
 	}
 }
 
+void tls_sw_cancel_work_tx(struct tls_context *tls_ctx)
+{
+	struct tls_sw_context_tx *ctx = tls_sw_ctx_tx(tls_ctx);
+
+	set_bit(BIT_TX_CLOSING, &ctx->tx_bitmask);
+	set_bit(BIT_TX_SCHEDULED, &ctx->tx_bitmask);
+	cancel_delayed_work_sync(&ctx->tx_work.work);
+}
+
 void tls_sw_free_resources_tx(struct sock *sk)
 {
 	struct tls_context *tls_ctx = tls_get_ctx(sk);
@@ -2065,11 +2074,6 @@ void tls_sw_free_resources_tx(struct sock *sk)
 	if (atomic_read(&ctx->encrypt_pending))
 		crypto_wait_req(-EINPROGRESS, &ctx->async_wait);
 
-	release_sock(sk);
-	cancel_delayed_work_sync(&ctx->tx_work.work);
-	lock_sock(sk);
-
-	/* Tx whatever records we can transmit and abandon the rest */
 	tls_tx_records(sk, -1);
 
 	/* Free up un-sent records in tx_list. First, free
@@ -2137,11 +2141,17 @@ static void tx_work_handler(struct work_struct *work)
 					       struct tx_work, work);
 	struct sock *sk = tx_work->sk;
 	struct tls_context *tls_ctx = tls_get_ctx(sk);
-	struct tls_sw_context_tx *ctx = tls_sw_ctx_tx(tls_ctx);
+	struct tls_sw_context_tx *ctx;
 
-	if (!test_and_clear_bit(BIT_TX_SCHEDULED, &ctx->tx_bitmask))
+	if (unlikely(!tls_ctx))
 		return;
 
+	ctx = tls_sw_ctx_tx(tls_ctx);
+	if (test_bit(BIT_TX_CLOSING, &ctx->tx_bitmask))
+		return;
+
+	if (!test_and_clear_bit(BIT_TX_SCHEDULED, &ctx->tx_bitmask))
+		return;
 	lock_sock(sk);
 	tls_tx_records(sk, -1);
 	release_sock(sk);


^ permalink raw reply related

* [bpf PATCH v3 1/8] net/tls: don't arm strparser immediately in tls_set_sw_offload()
From: John Fastabend @ 2019-07-15 20:49 UTC (permalink / raw)
  To: jakub.kicinski, ast, daniel; +Cc: netdev, edumazet, john.fastabend, bpf
In-Reply-To: <156322373173.18678.6003379631139659856.stgit@john-XPS-13-9370>

From: Jakub Kicinski <jakub.kicinski@netronome.com>

In tls_set_device_offload_rx() we prepare the software context
for RX fallback and proceed to add the connection to the device.
Unfortunately, software context prep includes arming strparser
so in case of a later error we have to release the socket lock
to call strp_done().

In preparation for not releasing the socket lock half way through
callbacks move arming strparser into a separate function.
Following patches will make use of that.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com>
---
 include/net/tls.h    |    1 +
 net/tls/tls_device.c |    1 +
 net/tls/tls_main.c   |    8 +++++---
 net/tls/tls_sw.c     |   19 ++++++++++++-------
 4 files changed, 19 insertions(+), 10 deletions(-)

diff --git a/include/net/tls.h b/include/net/tls.h
index 584609174fe0..43f551cd508b 100644
--- a/include/net/tls.h
+++ b/include/net/tls.h
@@ -355,6 +355,7 @@ int tls_sk_attach(struct sock *sk, int optname, char __user *optval,
 		  unsigned int optlen);
 
 int tls_set_sw_offload(struct sock *sk, struct tls_context *ctx, int tx);
+void tls_sw_strparser_arm(struct sock *sk, struct tls_context *ctx);
 int tls_sw_sendmsg(struct sock *sk, struct msghdr *msg, size_t size);
 int tls_sw_sendpage(struct sock *sk, struct page *page,
 		    int offset, size_t size, int flags);
diff --git a/net/tls/tls_device.c b/net/tls/tls_device.c
index 7c0b2b778703..4d67d72f007c 100644
--- a/net/tls/tls_device.c
+++ b/net/tls/tls_device.c
@@ -1045,6 +1045,7 @@ int tls_set_device_offload_rx(struct sock *sk, struct tls_context *ctx)
 	rc = tls_set_sw_offload(sk, ctx, 0);
 	if (rc)
 		goto release_ctx;
+	tls_sw_strparser_arm(sk, ctx);
 
 	rc = netdev->tlsdev_ops->tls_dev_add(netdev, sk, TLS_OFFLOAD_CTX_DIR_RX,
 					     &ctx->crypto_recv.info,
diff --git a/net/tls/tls_main.c b/net/tls/tls_main.c
index 4674e57e66b0..85a9d7d57b32 100644
--- a/net/tls/tls_main.c
+++ b/net/tls/tls_main.c
@@ -526,6 +526,8 @@ static int do_tls_setsockopt_conf(struct sock *sk, char __user *optval,
 		{
 #endif
 			rc = tls_set_sw_offload(sk, ctx, 1);
+			if (rc)
+				goto err_crypto_info;
 			conf = TLS_SW;
 		}
 	} else {
@@ -537,13 +539,13 @@ static int do_tls_setsockopt_conf(struct sock *sk, char __user *optval,
 		{
 #endif
 			rc = tls_set_sw_offload(sk, ctx, 0);
+			if (rc)
+				goto err_crypto_info;
+			tls_sw_strparser_arm(sk, ctx);
 			conf = TLS_SW;
 		}
 	}
 
-	if (rc)
-		goto err_crypto_info;
-
 	if (tx)
 		ctx->tx_conf = conf;
 	else
diff --git a/net/tls/tls_sw.c b/net/tls/tls_sw.c
index 53b4ad94e74a..f58a8ffc2a9c 100644
--- a/net/tls/tls_sw.c
+++ b/net/tls/tls_sw.c
@@ -2160,6 +2160,18 @@ void tls_sw_write_space(struct sock *sk, struct tls_context *ctx)
 	}
 }
 
+void tls_sw_strparser_arm(struct sock *sk, struct tls_context *tls_ctx)
+{
+	struct tls_sw_context_rx *rx_ctx = tls_sw_ctx_rx(tls_ctx);
+
+	write_lock_bh(&sk->sk_callback_lock);
+	rx_ctx->saved_data_ready = sk->sk_data_ready;
+	sk->sk_data_ready = tls_data_ready;
+	write_unlock_bh(&sk->sk_callback_lock);
+
+	strp_check_rcv(&rx_ctx->strp);
+}
+
 int tls_set_sw_offload(struct sock *sk, struct tls_context *ctx, int tx)
 {
 	struct tls_context *tls_ctx = tls_get_ctx(sk);
@@ -2357,13 +2369,6 @@ int tls_set_sw_offload(struct sock *sk, struct tls_context *ctx, int tx)
 		cb.parse_msg = tls_read_size;
 
 		strp_init(&sw_ctx_rx->strp, sk, &cb);
-
-		write_lock_bh(&sk->sk_callback_lock);
-		sw_ctx_rx->saved_data_ready = sk->sk_data_ready;
-		sk->sk_data_ready = tls_data_ready;
-		write_unlock_bh(&sk->sk_callback_lock);
-
-		strp_check_rcv(&sw_ctx_rx->strp);
 	}
 
 	goto out;


^ permalink raw reply related

* [bpf PATCH v3 0/8] sockmap/tls fixes
From: John Fastabend @ 2019-07-15 20:49 UTC (permalink / raw)
  To: jakub.kicinski, ast, daniel; +Cc: netdev, edumazet, john.fastabend, bpf

Resolve a series of splats discovered by syzbot and an unhash
TLS issue noted by Eric Dumazet.

The main issues revolved around interaction between TLS and
sockmap tear down. TLS and sockmap could both reset sk->prot
ops creating a condition where a close or unhash op could be
called forever. A rare race condition resulting from a missing
rcu sync operation was causing a use after free. Then on the
TLS side dropping the sock lock and re-acquiring it during the
close op could hang. Finally, sockmap must be deployed before
tls for current stack assumptions to be met. This is enforced
now. A feature series can enable it.

To fix this first refactor TLS code so the lock is held for the
entire teardown operation. Then add an unhash callback to ensure
TLS can not transition from ESTABLISHED to LISTEN state. This
transition is a similar bug to the one found and fixed previously
in sockmap. Then apply three fixes to sockmap to fix up races
on tear down around map free and close. Finally, if sockmap
is destroyed before TLS we add a new ULP op update to inform
the TLS stack it should not call sockmap ops. This last one
appears to be the most commonly found issue from syzbot.

---

Jakub Kicinski (1):
      net/tls: don't arm strparser immediately in tls_set_sw_offload()

John Fastabend (7):
      tls: remove close callback sock unlock/lock around TX work flush
      tls: remove sock unlock/lock around strp_done()
      bpf: tls fix transition through disconnect with close
      bpf: sockmap, sock_map_delete needs to use xchg
      bpf: sockmap, synchronize_rcu before free'ing map
      bpf: sockmap, only create entry if ulp is not already enabled
      bpf: sockmap/tls, close can race with map free

 include/linux/skmsg.h |    8 ++-
 include/net/tcp.h     |    3 +
 include/net/tls.h     |   12 +++-
 net/core/skmsg.c      |    4 +
 net/core/sock_map.c   |   19 ++++--
 net/ipv4/tcp_ulp.c    |   13 ++++
 net/tls/tls_main.c    |  155 ++++++++++++++++++++++++++++++++++++++-----------
 net/tls/tls_sw.c      |   76 +++++++++++++++++-------
 8 files changed, 221 insertions(+), 69 deletions(-)

--
Signature

^ permalink raw reply

* Re: [PATCH iproute2] tc: util: constrain percentage in 0-100 interval
From: Stephen Hemminger @ 2019-07-15 20:48 UTC (permalink / raw)
  To: Andrea Claudi; +Cc: netdev, dsahern
In-Reply-To: <c0a9b4ce15d5389ac59fbf572f5f1b3030ec4c90.1563011008.git.aclaudi@redhat.com>

On Sat, 13 Jul 2019 11:44:07 +0200
Andrea Claudi <aclaudi@redhat.com> wrote:

> parse_percent() currently allows to specify negative percentages
> or value above 100%. However this does not seems to make sense,
> as the function is used for probabilities or bandiwidth rates.
> 
> Moreover, using negative values leads to erroneous results
> (using Bernoulli loss model as example):
> 
> $ ip link add test type dummy
> $ ip link set test up
> $ tc qdisc add dev test root netem loss gemodel -10% limit 10
> $ tc qdisc show dev test
> qdisc netem 800c: root refcnt 2 limit 10 loss gemodel p 90% r 10% 1-h 100% 1-k 0%
> 
> Using values above 100% we have instead:
> 
> $ ip link add test type dummy
> $ ip link set test up
> $ tc qdisc add dev test root netem loss gemodel 140% limit 10
> $ tc qdisc show dev test
> qdisc netem 800f: root refcnt 2 limit 10 loss gemodel p 40% r 60% 1-h 100% 1-k 0%
> 
> This commit changes parse_percent() with a check to ensure
> percentage values stay between 1.0 and 0.0.
> parse_percent_rate() function, which already employs a similar
> check, is adjusted accordingly.
> 
> With this check in place, we have:
> 
> $ ip link add test type dummy
> $ ip link set test up
> $ tc qdisc add dev test root netem loss gemodel -10% limit 10
> Illegal "loss gemodel p"
> 
> Fixes: 927e3cfb52b58 ("tc: B.W limits can now be specified in %.")
> Signed-off-by: Andrea Claudi <aclaudi@redhat.com>

Looks good. Applied

^ permalink raw reply

* Re: [net-next 1/2] ipvs: batch __ip_vs_cleanup
From: Julian Anastasov @ 2019-07-15 20:39 UTC (permalink / raw)
  To: Haishuang Yan
  Cc: David S. Miller, Pablo Neira Ayuso, Simon Horman, netdev,
	lvs-devel, linux-kernel, netfilter-devel
In-Reply-To: <1563031186-2101-2-git-send-email-yanhaishuang@cmss.chinamobile.com>


	Hello,

On Sat, 13 Jul 2019, Haishuang Yan wrote:

> It's better to batch __ip_vs_cleanup to speedup ipvs
> connections dismantle.
> 
> Signed-off-by: Haishuang Yan <yanhaishuang@cmss.chinamobile.com>
> ---
>  include/net/ip_vs.h             |  2 +-
>  net/netfilter/ipvs/ip_vs_core.c | 29 +++++++++++++++++------------
>  net/netfilter/ipvs/ip_vs_ctl.c  | 13 ++++++++++---
>  3 files changed, 28 insertions(+), 16 deletions(-)
> 
> diff --git a/include/net/ip_vs.h b/include/net/ip_vs.h
> index 3759167..93e7a25 100644
> --- a/include/net/ip_vs.h
> +++ b/include/net/ip_vs.h
> @@ -1324,7 +1324,7 @@ static inline void ip_vs_control_del(struct ip_vs_conn *cp)
>  void ip_vs_control_net_cleanup(struct netns_ipvs *ipvs);
>  void ip_vs_estimator_net_cleanup(struct netns_ipvs *ipvs);
>  void ip_vs_sync_net_cleanup(struct netns_ipvs *ipvs);
> -void ip_vs_service_net_cleanup(struct netns_ipvs *ipvs);
> +void ip_vs_service_nets_cleanup(struct list_head *net_list);
>  
>  /* IPVS application functions
>   * (from ip_vs_app.c)
> diff --git a/net/netfilter/ipvs/ip_vs_core.c b/net/netfilter/ipvs/ip_vs_core.c
> index 46f06f9..b4d79b7 100644
> --- a/net/netfilter/ipvs/ip_vs_core.c
> +++ b/net/netfilter/ipvs/ip_vs_core.c
> @@ -2402,18 +2402,23 @@ static int __net_init __ip_vs_init(struct net *net)
>  	return -ENOMEM;
>  }
>  
> -static void __net_exit __ip_vs_cleanup(struct net *net)
> +static void __net_exit __ip_vs_cleanup_batch(struct list_head *net_list)
>  {
> -	struct netns_ipvs *ipvs = net_ipvs(net);
> -
> -	ip_vs_service_net_cleanup(ipvs);	/* ip_vs_flush() with locks */
> -	ip_vs_conn_net_cleanup(ipvs);
> -	ip_vs_app_net_cleanup(ipvs);
> -	ip_vs_protocol_net_cleanup(ipvs);
> -	ip_vs_control_net_cleanup(ipvs);
> -	ip_vs_estimator_net_cleanup(ipvs);
> -	IP_VS_DBG(2, "ipvs netns %d released\n", ipvs->gen);
> -	net->ipvs = NULL;
> +	struct netns_ipvs *ipvs;
> +	struct net *net;
> +	LIST_HEAD(list);
> +
> +	ip_vs_service_nets_cleanup(net_list);	/* ip_vs_flush() with locks */
> +	list_for_each_entry(net, net_list, exit_list) {

	How much faster is to replace list_for_each_entry in
ops_exit_list() with this one. IPVS can waste time in calls
such as kthread_stop() and del_timer_sync() but I'm not sure
we can solve it easily. What gain do you see in benchmarks?

> +		ipvs = net_ipvs(net);
> +		ip_vs_conn_net_cleanup(ipvs);
> +		ip_vs_app_net_cleanup(ipvs);
> +		ip_vs_protocol_net_cleanup(ipvs);
> +		ip_vs_control_net_cleanup(ipvs);
> +		ip_vs_estimator_net_cleanup(ipvs);
> +		IP_VS_DBG(2, "ipvs netns %d released\n", ipvs->gen);
> +		net->ipvs = NULL;
> +	}
>  }

Regards

--
Julian Anastasov <ja@ssi.bg>

^ permalink raw reply

* Re: [PATCH ghak90 V6 02/10] audit: add container id
From: Paul Moore @ 2019-07-15 20:38 UTC (permalink / raw)
  To: Richard Guy Briggs
  Cc: Tycho Andersen, containers, linux-api, Linux-Audit Mailing List,
	linux-fsdevel, LKML, netdev, netfilter-devel, sgrubb, omosnace,
	dhowells, simo, Eric Paris, Serge Hallyn, ebiederm, nhorman
In-Reply-To: <20190708175105.7zb6mikjw2wmnwln@madcap2.tricolour.ca>

On Mon, Jul 8, 2019 at 1:51 PM Richard Guy Briggs <rgb@redhat.com> wrote:
> On 2019-05-29 11:29, Paul Moore wrote:

...

> > The idea is that only container orchestrators should be able to
> > set/modify the audit container ID, and since setting the audit
> > container ID can have a significant effect on the records captured
> > (and their routing to multiple daemons when we get there) modifying
> > the audit container ID is akin to modifying the audit configuration
> > which is why it is gated by CAP_AUDIT_CONTROL.  The current thinking
> > is that you would only change the audit container ID from one
> > set/inherited value to another if you were nesting containers, in
> > which case the nested container orchestrator would need to be granted
> > CAP_AUDIT_CONTROL (which everyone to date seems to agree is a workable
> > compromise).  We did consider allowing for a chain of nested audit
> > container IDs, but the implications of doing so are significant
> > (implementation mess, runtime cost, etc.) so we are leaving that out
> > of this effort.
>
> We had previously discussed the idea of restricting
> orchestrators/engines from only being able to set the audit container
> identifier on their own descendants, but it was discarded.  I've added a
> check to ensure this is now enforced.

When we weren't allowing nested orchestrators it wasn't necessary, but
with the move to support nesting I believe this will be a requirement.
We might also need/want to restrict audit container ID changes if a
descendant is acting as a container orchestrator and managing one or
more audit container IDs; although I'm less certain of the need for
this.

> I've also added a check to ensure that a process can't set its own audit
> container identifier ...

What does this protect against, or what problem does this solve?
Considering how easy it is to fork/exec, it seems like this could be
trivially bypassed.

> ... and that if the identifier is already set, then the
> orchestrator/engine must be in a descendant user namespace from the
> orchestrator that set the previously inherited audit container
> identifier.

You lost me here ... although I don't like the idea of relying on X
namespace inheritance for a hard coded policy on setting the audit
container ID; we've worked hard to keep this independent of any
definition of a "container" and it would sadden me greatly if we had
to go back on that.

-- 
paul moore
www.paul-moore.com

^ permalink raw reply

* Re: [PATCH iproute2 net-next v2 1/6] Kernel header update for hardware offloading changes.
From: David Ahern @ 2019-07-15 20:16 UTC (permalink / raw)
  To: Stephen Hemminger, Patel, Vedang
  Cc: Jamal Hadi Salim, Cong Wang, Jiri Pirko, Gomes, Vinicius,
	netdev@vger.kernel.org, Dorileo, Leandro, Jakub Kicinski,
	Murali Karicheri
In-Reply-To: <20190715125059.70470f9e@hermes.lan>

On 7/15/19 1:50 PM, Stephen Hemminger wrote:
> On Mon, 15 Jul 2019 19:40:19 +0000
> "Patel, Vedang" <vedang.patel@intel.com> wrote:
> 
>> Hi Stephen, 
>>
>> The kernel patches corresponding to this series have been merged. I just wanted to check whether these iproute2 related patches are on your TODO list.
>>
>> Let me know if you need any information from me on these patches.
>>
>> Thanks,
>> Vedang Patel
> 
> 
> David Ahern handles iproute2 next
> 
> https://patchwork.ozlabs.org/patch/1111466/
> 

given the long time delay between when the iproute2 patches were posted
and when the kernel side was accepted you will need to re-send the
iproute2 patches.

^ permalink raw reply

* [net 2/3] net/mlx5e: Rely on filter_dev instead of dissector keys for tunnels
From: Saeed Mahameed @ 2019-07-15 20:09 UTC (permalink / raw)
  To: David S. Miller
  Cc: netdev@vger.kernel.org, Vlad Buslov, Roi Dayan, Saeed Mahameed
In-Reply-To: <20190715200940.31799-1-saeedm@mellanox.com>

From: Vlad Buslov <vladbu@mellanox.com>

Currently, tunnel attributes are parsed and inner header matching is used
only when flow dissector specifies match on some of the supported
encapsulation fields. When user tries to offload tc filter that doesn't
match any encapsulation fields on tunnel device, mlx5 tc layer incorrectly
sets to match packet header keys on encap header (outer header) and
firmware rejects the rule with syndrome 0x7e1579 when creating new flow
group.

Change __parse_cls_flower() to determine whether tunnel is used based on
fitler_dev tunnel info, instead of determining it indirectly by checking
flow dissector enc keys.

Fixes: bbd00f7e2349 ("net/mlx5e: Add TC tunnel release action for SRIOV offloads")
Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Reviewed-by: Roi Dayan <roid@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/en_tc.c | 6 +-----
 1 file changed, 1 insertion(+), 5 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c b/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
index 018709a4343f..b95e0ae4d7fd 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
@@ -1522,11 +1522,7 @@ static int __parse_cls_flower(struct mlx5e_priv *priv,
 		return -EOPNOTSUPP;
 	}

-	if (flow_rule_match_key(rule, FLOW_DISSECTOR_KEY_ENC_IPV4_ADDRS) ||
-	    flow_rule_match_key(rule, FLOW_DISSECTOR_KEY_ENC_IPV6_ADDRS) ||
-	    flow_rule_match_key(rule, FLOW_DISSECTOR_KEY_ENC_KEYID) ||
-	    flow_rule_match_key(rule, FLOW_DISSECTOR_KEY_ENC_PORTS) ||
-	    flow_rule_match_key(rule, FLOW_DISSECTOR_KEY_ENC_OPTS)) {
+	if (mlx5e_get_tc_tun(filter_dev)) {
 		if (parse_tunnel_attr(priv, spec, f, filter_dev, tunnel_match_level))
 			return -EOPNOTSUPP;

-- 
2.21.0

^ permalink raw reply related

* [net 3/3] net/mlx5e: Allow dissector meta key in tc flower
From: Saeed Mahameed @ 2019-07-15 20:09 UTC (permalink / raw)
  To: David S. Miller
  Cc: netdev@vger.kernel.org, Vlad Buslov, Roi Dayan, Saeed Mahameed
In-Reply-To: <20190715200940.31799-1-saeedm@mellanox.com>

From: Vlad Buslov <vladbu@mellanox.com>

Recently, fl_flow_key->indev_ifindex int field was refactored into
flow_dissector_key_meta field. With this, flower classifier also sets
FLOW_DISSECTOR_KEY_META flow dissector key. However, mlx5 flower dissector
validation code rejects filters that use flow dissector keys that are not
supported. Add FLOW_DISSECTOR_KEY_META to the list of allowed dissector
keys in __parse_cls_flower() to prevent following error when offloading
flower classifier to mlx5:

Error: mlx5_core: Unsupported key.

Fixes: 8212ed777f40 ("net: sched: cls_flower: use flow_dissector for ingress ifindex")
Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Reviewed-by: Roi Dayan <roid@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/en_tc.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c b/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
index b95e0ae4d7fd..cc096f6011d9 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
@@ -1499,7 +1499,8 @@ static int __parse_cls_flower(struct mlx5e_priv *priv,
 	*match_level = MLX5_MATCH_NONE;

 	if (dissector->used_keys &
-	    ~(BIT(FLOW_DISSECTOR_KEY_CONTROL) |
+	    ~(BIT(FLOW_DISSECTOR_KEY_META) |
+	      BIT(FLOW_DISSECTOR_KEY_CONTROL) |
 	      BIT(FLOW_DISSECTOR_KEY_BASIC) |
 	      BIT(FLOW_DISSECTOR_KEY_ETH_ADDRS) |
 	      BIT(FLOW_DISSECTOR_KEY_VLAN) |
-- 
2.21.0

^ permalink raw reply related

* [net 1/3] net/mlx5e: Verify encapsulation is supported
From: Saeed Mahameed @ 2019-07-15 20:09 UTC (permalink / raw)
  To: David S. Miller
  Cc: netdev@vger.kernel.org, Eli Cohen, Roi Dayan, Saeed Mahameed
In-Reply-To: <20190715200940.31799-1-saeedm@mellanox.com>

From: Eli Cohen <eli@mellanox.com>

When mlx5e_attach_encap() calls mlx5e_get_tc_tun() to get the tunnel
info data struct, check that returned value is not NULL, as would be in
the case of unsupported encapsulation.

Fixes: d386939a327d2 ("net/mlx5e: Rearrange tc tunnel code in a modular way")
Signed-off-by: Eli Cohen <eli@mellanox.com>
Reviewed-by: Roi Dayan <roid@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/en_tc.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c b/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
index 2d6436257f9d..018709a4343f 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
@@ -2647,6 +2647,10 @@ static int mlx5e_attach_encap(struct mlx5e_priv *priv,
 	family = ip_tunnel_info_af(tun_info);
 	key.ip_tun_key = &tun_info->key;
 	key.tc_tunnel = mlx5e_get_tc_tun(mirred_dev);
+	if (!key.tc_tunnel) {
+		NL_SET_ERR_MSG_MOD(extack, "Unsupported tunnel");
+		return -EOPNOTSUPP;
+	}

 	hash_key = hash_encap_info(&key);

-- 
2.21.0

^ permalink raw reply related

* [pull request][net 0/3] Mellanox, mlx5 fixes 2019-07-15
From: Saeed Mahameed @ 2019-07-15 20:09 UTC (permalink / raw)
  To: David S. Miller; +Cc: netdev@vger.kernel.org, Saeed Mahameed

Hi Dave,

This pull request provides mlx5 TC flower and tunnel fixes for kernel 5.2
from Eli and Vlad.

Please pull and let me know if there is any problem.

Thanks,
Saeed.

---
The following changes since commit f384e62a82ba5d85408405fdd6aeff89354deaa9:

  ISDN: hfcsusb: checking idx of ep configuration (2019-07-15 11:10:31 -0700)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux.git tags/mlx5-fixes-2019-07-15

for you to fetch changes up to 3d144578c91a2db417923ba905ce7a84ce0c274b:

  net/mlx5e: Allow dissector meta key in tc flower (2019-07-15 13:04:04 -0700)

----------------------------------------------------------------
mlx5-fixes-2019-07-15

----------------------------------------------------------------
Eli Cohen (1):
      net/mlx5e: Verify encapsulation is supported

Vlad Buslov (2):
      net/mlx5e: Rely on filter_dev instead of dissector keys for tunnels
      net/mlx5e: Allow dissector meta key in tc flower

 drivers/net/ethernet/mellanox/mlx5/core/en_tc.c | 13 +++++++------
 1 file changed, 7 insertions(+), 6 deletions(-)

^ permalink raw reply

* Re: [PATCH 7/9] x86/pci: Pass lockdep condition to pcm_mmcfg_list iterator (v1)
From: Bjorn Helgaas @ 2019-07-15 20:02 UTC (permalink / raw)
  To: Joel Fernandes (Google)
  Cc: linux-kernel, Alexey Kuznetsov, Borislav Petkov, c0d1n61at3,
	David S. Miller, edumazet, Greg Kroah-Hartman, Hideaki YOSHIFUJI,
	H. Peter Anvin, Ingo Molnar, Jonathan Corbet, Josh Triplett,
	keescook, kernel-hardening, kernel-team, Lai Jiangshan, Len Brown,
	linux-acpi, linux-doc, linux-pci, linux-pm, Mathieu Desnoyers,
	neilb, netdev, Oleg Nesterov, Paul E. McKenney, Pavel Machek,
	peterz, Rafael J. Wysocki, Rasmus Villemoes, rcu, Steven Rostedt,
	Tejun Heo, Thomas Gleixner, will,
	maintainer:X86 ARCHITECTURE (32-BIT AND 64-BIT)
In-Reply-To: <20190715143705.117908-8-joel@joelfernandes.org>

On Mon, Jul 15, 2019 at 10:37:03AM -0400, Joel Fernandes (Google) wrote:
> The pcm_mmcfg_list is traversed with list_for_each_entry_rcu without a
> reader-lock held, because the pci_mmcfg_lock is already held. Make this
> known to the list macro so that it fixes new lockdep warnings that
> trigger due to lockdep checks added to list_for_each_entry_rcu().
> 
> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>

Ingo takes care of most patches to this file, but FWIW,

Acked-by: Bjorn Helgaas <bhelgaas@google.com>

I would personally prefer if you capitalized the subject to match the
"x86/PCI:" convention that's used fairly consistently in
arch/x86/pci/.

Also, I didn't apply this to be sure, but it looks like this might
make a line or two wider than 80 columns, which I would rewrap if I
were applying this.

> ---
>  arch/x86/pci/mmconfig-shared.c | 5 +++--
>  1 file changed, 3 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/x86/pci/mmconfig-shared.c b/arch/x86/pci/mmconfig-shared.c
> index 7389db538c30..6fa42e9c4e6f 100644
> --- a/arch/x86/pci/mmconfig-shared.c
> +++ b/arch/x86/pci/mmconfig-shared.c
> @@ -29,6 +29,7 @@
>  static bool pci_mmcfg_running_state;
>  static bool pci_mmcfg_arch_init_failed;
>  static DEFINE_MUTEX(pci_mmcfg_lock);
> +#define pci_mmcfg_lock_held() lock_is_held(&(pci_mmcfg_lock).dep_map)
>  
>  LIST_HEAD(pci_mmcfg_list);
>  
> @@ -54,7 +55,7 @@ static void list_add_sorted(struct pci_mmcfg_region *new)
>  	struct pci_mmcfg_region *cfg;
>  
>  	/* keep list sorted by segment and starting bus number */
> -	list_for_each_entry_rcu(cfg, &pci_mmcfg_list, list) {
> +	list_for_each_entry_rcu(cfg, &pci_mmcfg_list, list, pci_mmcfg_lock_held()) {
>  		if (cfg->segment > new->segment ||
>  		    (cfg->segment == new->segment &&
>  		     cfg->start_bus >= new->start_bus)) {
> @@ -118,7 +119,7 @@ struct pci_mmcfg_region *pci_mmconfig_lookup(int segment, int bus)
>  {
>  	struct pci_mmcfg_region *cfg;
>  
> -	list_for_each_entry_rcu(cfg, &pci_mmcfg_list, list)
> +	list_for_each_entry_rcu(cfg, &pci_mmcfg_list, list, pci_mmcfg_lock_held())
>  		if (cfg->segment == segment &&
>  		    cfg->start_bus <= bus && bus <= cfg->end_bus)
>  			return cfg;
> -- 
> 2.22.0.510.g264f2c817a-goog
> 

^ permalink raw reply

* [PATCH V35 23/29] bpf: Restrict bpf when kernel lockdown is in confidentiality mode
From: Matthew Garrett @ 2019-07-15 19:59 UTC (permalink / raw)
  To: jmorris
  Cc: linux-security-module, linux-kernel, linux-api, David Howells,
	Alexei Starovoitov, Matthew Garrett, netdev, Chun-Yi Lee,
	Daniel Borkmann
In-Reply-To: <20190715195946.223443-1-matthewgarrett@google.com>

From: David Howells <dhowells@redhat.com>

bpf_read() and bpf_read_str() could potentially be abused to (eg) allow
private keys in kernel memory to be leaked. Disable them if the kernel
has been locked down in confidentiality mode.

Suggested-by: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Signed-off-by: Matthew Garrett <mjg59@google.com>
cc: netdev@vger.kernel.org
cc: Chun-Yi Lee <jlee@suse.com>
cc: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
---
 include/linux/security.h     |  1 +
 kernel/trace/bpf_trace.c     | 10 ++++++++++
 security/lockdown/lockdown.c |  1 +
 3 files changed, 12 insertions(+)

diff --git a/include/linux/security.h b/include/linux/security.h
index 987d8427f091..8dd1741a52cd 100644
--- a/include/linux/security.h
+++ b/include/linux/security.h
@@ -118,6 +118,7 @@ enum lockdown_reason {
 	LOCKDOWN_INTEGRITY_MAX,
 	LOCKDOWN_KCORE,
 	LOCKDOWN_KPROBES,
+	LOCKDOWN_BPF_READ,
 	LOCKDOWN_CONFIDENTIALITY_MAX,
 };
 
diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
index ca1255d14576..605908da61c5 100644
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -142,7 +142,12 @@ BPF_CALL_3(bpf_probe_read, void *, dst, u32, size, const void *, unsafe_ptr)
 {
 	int ret;
 
+	ret = security_locked_down(LOCKDOWN_BPF_READ);
+	if (ret)
+		goto out;
+
 	ret = probe_kernel_read(dst, unsafe_ptr, size);
+out:
 	if (unlikely(ret < 0))
 		memset(dst, 0, size);
 
@@ -569,6 +574,10 @@ BPF_CALL_3(bpf_probe_read_str, void *, dst, u32, size,
 {
 	int ret;
 
+	ret = security_locked_down(LOCKDOWN_BPF_READ);
+	if (ret)
+		goto out;
+
 	/*
 	 * The strncpy_from_unsafe() call will likely not fill the entire
 	 * buffer, but that's okay in this circumstance as we're probing
@@ -579,6 +588,7 @@ BPF_CALL_3(bpf_probe_read_str, void *, dst, u32, size,
 	 * is returned that can be used for bpf_perf_event_output() et al.
 	 */
 	ret = strncpy_from_unsafe(dst, unsafe_ptr, size);
+out:
 	if (unlikely(ret < 0))
 		memset(dst, 0, size);
 
diff --git a/security/lockdown/lockdown.c b/security/lockdown/lockdown.c
index ccb3e9a2a47c..d14b89784412 100644
--- a/security/lockdown/lockdown.c
+++ b/security/lockdown/lockdown.c
@@ -33,6 +33,7 @@ static char *lockdown_reasons[LOCKDOWN_CONFIDENTIALITY_MAX+1] = {
 	[LOCKDOWN_INTEGRITY_MAX] = "integrity",
 	[LOCKDOWN_KCORE] = "/proc/kcore access",
 	[LOCKDOWN_KPROBES] = "use of kprobes",
+	[LOCKDOWN_BPF_READ] = "use of bpf to read kernel RAM",
 	[LOCKDOWN_CONFIDENTIALITY_MAX] = "confidentiality",
 };
 
-- 
2.22.0.510.g264f2c817a-goog


^ permalink raw reply related

* Re: [PATCH bpf 0/5] bpf: allow wide (u64) aligned loads for some fields of bpf_sock_addr
From: Yonghong Song @ 2019-07-15 19:57 UTC (permalink / raw)
  To: Stanislav Fomichev, netdev@vger.kernel.org, bpf@vger.kernel.org
  Cc: davem@davemloft.net, ast@kernel.org, daniel@iogearbox.net
In-Reply-To: <20190715163956.204061-1-sdf@google.com>



On 7/15/19 9:39 AM, Stanislav Fomichev wrote:
> When fixing selftests by adding support for wide stores, Yonghong
> reported that he had seen some examples where clang generates
> single u64 loads for two adjacent u32s as well:
> http://lore.kernel.org/netdev/a66c937f-94c0-eaf8-5b37-8587d66c0c62@fb.com
> 
> Let's support aligned u64 reads for some bpf_sock_addr fields
> as well.
> 
> (This can probably wait for bpf-next, I'll defer to Younhong and the
> maintainers.)
> 
> Cc: Yonghong Song <yhs@fb.com>
> 
> Stanislav Fomichev (5):
>    bpf: rename bpf_ctx_wide_store_ok to bpf_ctx_wide_access_ok
>    bpf: allow wide aligned loads for bpf_sock_addr user_ip6 and
>      msg_src_ip6
>    selftests/bpf: rename verifier/wide_store.c to verifier/wide_access.c
>    selftests/bpf: add selftests for wide loads
>    bpf: sync bpf.h to tools/

Thanks for fixing. Maybe getting into bpf is better as this indeed
a potential issue? I do not have strong feeling either as the
issue can be easily workarounded with "volatile" tricks.

Acked-by: Yonghong Song <yhs@fb.com>

> 
>   include/linux/filter.h                        |  2 +-
>   include/uapi/linux/bpf.h                      |  4 +-
>   net/core/filter.c                             | 24 ++++--
>   tools/include/uapi/linux/bpf.h                |  4 +-
>   .../selftests/bpf/verifier/wide_access.c      | 73 +++++++++++++++++++
>   .../selftests/bpf/verifier/wide_store.c       | 36 ---------
>   6 files changed, 95 insertions(+), 48 deletions(-)
>   create mode 100644 tools/testing/selftests/bpf/verifier/wide_access.c
>   delete mode 100644 tools/testing/selftests/bpf/verifier/wide_store.c
> 

^ permalink raw reply

* Re: [PATCH iproute2 net-next v2 1/6] Kernel header update for hardware offloading changes.
From: Stephen Hemminger @ 2019-07-15 19:50 UTC (permalink / raw)
  To: Patel, Vedang
  Cc: Jamal Hadi Salim, Cong Wang, Jiri Pirko, Gomes, Vinicius,
	netdev@vger.kernel.org, Dorileo, Leandro, Jakub Kicinski,
	Murali Karicheri, David Ahern
In-Reply-To: <0AFDC65C-2A16-47B7-96F6-F6844AF75095@intel.com>

On Mon, 15 Jul 2019 19:40:19 +0000
"Patel, Vedang" <vedang.patel@intel.com> wrote:

> Hi Stephen, 
> 
> The kernel patches corresponding to this series have been merged. I just wanted to check whether these iproute2 related patches are on your TODO list.
> 
> Let me know if you need any information from me on these patches.
> 
> Thanks,
> Vedang Patel


David Ahern handles iproute2 next

https://patchwork.ozlabs.org/patch/1111466/

^ permalink raw reply

* Re: [PATCH iproute2 net-next v2 1/6] Kernel header update for hardware offloading changes.
From: Patel, Vedang @ 2019-07-15 19:40 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: Jamal Hadi Salim, Cong Wang, Jiri Pirko, Gomes, Vinicius,
	netdev@vger.kernel.org, Dorileo, Leandro, Jakub Kicinski,
	Murali Karicheri
In-Reply-To: <1559859735-17237-1-git-send-email-vedang.patel@intel.com>

Hi Stephen, 

The kernel patches corresponding to this series have been merged. I just wanted to check whether these iproute2 related patches are on your TODO list.

Let me know if you need any information from me on these patches.

Thanks,
Vedang Patel
> On Jun 6, 2019, at 3:22 PM, Patel, Vedang <vedang.patel@intel.com> wrote:
> 
> This should only be updated after the kernel patches related to
> txtime-offload have been merged into the kernel.
> 
> Signed-off-by: Vedang Patel <vedang.patel@intel.com>
> ---
> include/uapi/linux/pkt_sched.h | 5 +++++
> 1 file changed, 5 insertions(+)
> 
> diff --git a/include/uapi/linux/pkt_sched.h b/include/uapi/linux/pkt_sched.h
> index 8b2f993cbb77..c085860ff637 100644
> --- a/include/uapi/linux/pkt_sched.h
> +++ b/include/uapi/linux/pkt_sched.h
> @@ -990,6 +990,7 @@ struct tc_etf_qopt {
> 	__u32 flags;
> #define TC_ETF_DEADLINE_MODE_ON	BIT(0)
> #define TC_ETF_OFFLOAD_ON	BIT(1)
> +#define TC_ETF_SKIP_SOCK_CHECK  BIT(2)
> };
> 
> enum {
> @@ -1158,6 +1159,8 @@ enum {
>  *       [TCA_TAPRIO_ATTR_SCHED_ENTRY_INTERVAL]
>  */
> 
> +#define TCA_TAPRIO_ATTR_FLAG_TXTIME_ASSIST 0x1
> +
> enum {
> 	TCA_TAPRIO_ATTR_UNSPEC,
> 	TCA_TAPRIO_ATTR_PRIOMAP, /* struct tc_mqprio_qopt */
> @@ -1169,6 +1172,8 @@ enum {
> 	TCA_TAPRIO_ATTR_ADMIN_SCHED, /* The admin sched, only used in dump */
> 	TCA_TAPRIO_ATTR_SCHED_CYCLE_TIME, /* s64 */
> 	TCA_TAPRIO_ATTR_SCHED_CYCLE_TIME_EXTENSION, /* s64 */
> +	TCA_TAPRIO_ATTR_FLAGS, /* u32 */
> +	TCA_TAPRIO_ATTR_TXTIME_DELAY, /* s32 */
> 	__TCA_TAPRIO_ATTR_MAX,
> };
> 
> -- 
> 2.7.3
> 


^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox