Netdev List

Netdev List
 help / color / mirror / Atom feed

* [PATCH v3 3/7] dt-bindings: net: realtek: Add property to enable SSC
From: Matthias Kaehlcke @ 2019-07-08 19:24 UTC (permalink / raw)
  To: David S . Miller, Rob Herring, Mark Rutland, Andrew Lunn,
	Florian Fainelli, Heiner Kallweit
  Cc: netdev, devicetree, linux-kernel, Douglas Anderson,
	Matthias Kaehlcke
In-Reply-To: <20190708192459.187984-1-mka@chromium.org>

Add the 'realtek,enable-ssc' property to enable Spread Spectrum
Clocking (SSC) on Realtek PHYs that support it.

Signed-off-by: Matthias Kaehlcke <mka@chromium.org>
---
Changes in v3:
- changed wording for supported PHY models

Changes in v2:
- patch added to the series (kind of, it already existed, but now
  the binding is created by another patch)
---
 Documentation/devicetree/bindings/net/realtek.txt | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/Documentation/devicetree/bindings/net/realtek.txt b/Documentation/devicetree/bindings/net/realtek.txt
index db0333f23fec..af2824664f08 100644
--- a/Documentation/devicetree/bindings/net/realtek.txt
+++ b/Documentation/devicetree/bindings/net/realtek.txt
@@ -15,6 +15,10 @@ Optional properties:
 
   Only supported for "realtek,rtl8211e".
 
+- realtek,enable-ssc : Enable Spread Spectrum Clocking (SSC) on this port.
+
+  Only supported for "realtek,rtl8211e".
+
 
 Example:
 
@@ -27,5 +31,6 @@ mdio0 {
 		compatible = "realtek,rtl8211e";
 		reg = <1>;
 		realtek,eee-led-mode-disable;
+		realtek,enable-ssc;
 	};
 };
-- 
2.22.0.410.gd8fdbe21b5-goog


^ permalink raw reply related

* [PATCH v3 0/7] net: phy: realtek: Enable configuration of RTL8211E LEDs and SSC
From: Matthias Kaehlcke @ 2019-07-08 19:24 UTC (permalink / raw)
  To: David S . Miller, Rob Herring, Mark Rutland, Andrew Lunn,
	Florian Fainelli, Heiner Kallweit
  Cc: netdev, devicetree, linux-kernel, Douglas Anderson,
	Matthias Kaehlcke

The Realtek RTL8211E allows customization of the PHY LED behavior,
like which LEDs are on for certain link speeds and which LEDs blink
when there is traffic. By default EEE LED mode is enabled, in which
a blinking LED is on for 400ms and off for 2s. This series adds
support for configuring the LED behavior through device tree
properties.

The RTL8211E supports Spread Spectrum Clocking (SSC), which reduces
clock noise that may affect other board functions. By default SSC
is disabled, this series adds support for enabling it through a
device tree property.

Certain registers on the RTL8211E can only be accessed through
a vendor specific extended page mechanism. Extended pages need
to be accessed for the LED configuration and enabling SSC. This
series adds helpers to facilitate accessing extended pages.

Matthias Kaehlcke (7):
  dt-bindings: net: Add bindings for Realtek PHYs
  net: phy: realtek: Allow disabling RTL8211E EEE LED mode
  dt-bindings: net: realtek: Add property to enable SSC
  net: phy: realtek: Add helpers for accessing RTL8211E extension pages
  net: phy: realtek: Support SSC for the RTL8211E
  dt-bindings: net: realtek: Add property to configure LED mode
  net: phy: realtek: configure RTL8211E LEDs

 .../devicetree/bindings/net/realtek.txt       |  47 +++++
 drivers/net/phy/realtek.c                     | 171 ++++++++++++++++--
 include/dt-bindings/net/realtek.h             |  18 ++
 3 files changed, 221 insertions(+), 15 deletions(-)
 create mode 100644 Documentation/devicetree/bindings/net/realtek.txt
 create mode 100644 include/dt-bindings/net/realtek.h

-- 
2.22.0.410.gd8fdbe21b5-goog

^ permalink raw reply

* Re: [PATCH net-next] sfc: Remove 'PCIE error reporting unavailable'
From: David Miller @ 2019-07-08 19:17 UTC (permalink / raw)
  To: mhabets; +Cc: linux-net-drivers, netdev
In-Reply-To: <156258403191.17195.13184667600147687856.stgit@mh-desktop.uk.solarflarecom.com>

From: Martin Habets <mhabets@solarflare.com>
Date: Mon, 8 Jul 2019 12:07:11 +0100

> This is only at notice level but it was pointed out that no other driver
> does this.
> Also there is no action the user can take as it is really a property of
> the server.
> 
> Signed-off-by: Martin Habets <mhabets@solarflare.com>

Applied, thanks.

^ permalink raw reply

* [bpf PATCH v2 6/6] bpf: sockmap/tls, close can race with map free
From: John Fastabend @ 2019-07-08 19:15 UTC (permalink / raw)
  To: jakub.kicinski, ast, daniel; +Cc: netdev, edumazet, john.fastabend, bpf
In-Reply-To: <156261310104.31108.4569969631798277807.stgit@ubuntu3-kvm1>

When a map free is called and in parallel a socket is closed we
have two paths that can potentially reset the socket prot ops, the
bpf close() path and the map free path. This creates a problem
with which prot ops should be used from the socket closed side.

If the map_free side completes first then we want to call the
original lowest level ops. However, if the tls path runs first
we want to call the sockmap ops. Additionally there was no locking
around prot updates in TLS code paths so the prot ops could
be changed multiple times once from TLS path and again from sockmap
side potentially leaving ops pointed at either TLS or sockmap
when psock and/or tls context have already been destroyed.

To fix this race first only update ops inside callback lock
so that TLS, sockmap and lowest level all agree on prot state.
Second and a ULP callback update() so that lower layers can
inform the upper layer when they are being removed allowing the
upper layer to reset prot ops.

This gets us close to allowing sockmap and tls to be stacked
in arbitrary order but will save that patch for *next trees.

Reported-by: syzbot+06537213db7ba2745c4a@syzkaller.appspotmail.com
Fixes: 02c558b2d5d6 ("bpf: sockmap, support for msg_peek in sk_msg with redirect ingress")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
---
 include/linux/skmsg.h |    8 +++++++-
 include/net/tcp.h     |    3 +++
 net/core/skmsg.c      |    4 ++--
 net/ipv4/tcp_ulp.c    |   13 +++++++++++++
 net/tls/tls_main.c    |   35 +++++++++++++++++++++++++++++------
 5 files changed, 54 insertions(+), 9 deletions(-)

diff --git a/include/linux/skmsg.h b/include/linux/skmsg.h
index 50ced8aba9db..e4b3fb4bb77c 100644
--- a/include/linux/skmsg.h
+++ b/include/linux/skmsg.h
@@ -354,7 +354,13 @@ static inline void sk_psock_restore_proto(struct sock *sk,
 	sk->sk_write_space = psock->saved_write_space;
 
 	if (psock->sk_proto) {
-		sk->sk_prot = psock->sk_proto;
+		struct inet_connection_sock *icsk = inet_csk(sk);
+		bool has_ulp = !!icsk->icsk_ulp_data;
+
+		if (has_ulp)
+			tcp_update_ulp(sk, psock->sk_proto);
+		else
+			sk->sk_prot = psock->sk_proto;
 		psock->sk_proto = NULL;
 	}
 }
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 9d36cc88d043..123cac4c96f2 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -2102,6 +2102,8 @@ struct tcp_ulp_ops {
 
 	/* initialize ulp */
 	int (*init)(struct sock *sk);
+	/* update ulp */
+	void (*update)(struct sock *sk, struct proto *p);
 	/* cleanup ulp */
 	void (*release)(struct sock *sk);
 
@@ -2113,6 +2115,7 @@ void tcp_unregister_ulp(struct tcp_ulp_ops *type);
 int tcp_set_ulp(struct sock *sk, const char *name);
 void tcp_get_available_ulp(char *buf, size_t len);
 void tcp_cleanup_ulp(struct sock *sk);
+void tcp_update_ulp(struct sock *sk, struct proto *p);
 
 #define MODULE_ALIAS_TCP_ULP(name)				\
 	__MODULE_INFO(alias, alias_userspace, name);		\
diff --git a/net/core/skmsg.c b/net/core/skmsg.c
index 93bffaad2135..6832eeb4b785 100644
--- a/net/core/skmsg.c
+++ b/net/core/skmsg.c
@@ -585,12 +585,12 @@ EXPORT_SYMBOL_GPL(sk_psock_destroy);
 
 void sk_psock_drop(struct sock *sk, struct sk_psock *psock)
 {
-	rcu_assign_sk_user_data(sk, NULL);
 	sk_psock_cork_free(psock);
 	sk_psock_zap_ingress(psock);
-	sk_psock_restore_proto(sk, psock);
 
 	write_lock_bh(&sk->sk_callback_lock);
+	sk_psock_restore_proto(sk, psock);
+	rcu_assign_sk_user_data(sk, NULL);
 	if (psock->progs.skb_parser)
 		sk_psock_stop_strp(sk, psock);
 	write_unlock_bh(&sk->sk_callback_lock);
diff --git a/net/ipv4/tcp_ulp.c b/net/ipv4/tcp_ulp.c
index 3d8a1d835471..4849edb62d52 100644
--- a/net/ipv4/tcp_ulp.c
+++ b/net/ipv4/tcp_ulp.c
@@ -96,6 +96,19 @@ void tcp_get_available_ulp(char *buf, size_t maxlen)
 	rcu_read_unlock();
 }
 
+void tcp_update_ulp(struct sock *sk, struct proto *proto)
+{
+	struct inet_connection_sock *icsk = inet_csk(sk);
+
+	if (!icsk->icsk_ulp_ops) {
+		sk->sk_prot = proto;
+		return;
+	}
+
+	if (icsk->icsk_ulp_ops->update)
+		icsk->icsk_ulp_ops->update(sk, proto);
+}
+
 void tcp_cleanup_ulp(struct sock *sk)
 {
 	struct inet_connection_sock *icsk = inet_csk(sk);
diff --git a/net/tls/tls_main.c b/net/tls/tls_main.c
index e8418456ee24..4ba5476fbc5f 100644
--- a/net/tls/tls_main.c
+++ b/net/tls/tls_main.c
@@ -336,15 +336,17 @@ static void tls_sk_proto_unhash(struct sock *sk)
 
 static void tls_sk_proto_close(struct sock *sk, long timeout)
 {
+	struct inet_connection_sock *icsk = inet_csk(sk);
 	struct tls_context *ctx = tls_get_ctx(sk);
 	long timeo = sock_sndtimeo(sk, 0);
-	void (*sk_proto_close)(struct sock *sk, long timeout);
+
+	if (unlikely(!ctx))
+		return;
 
 	if (ctx->tx_conf == TLS_SW)
 		tls_sw_cancel_work_tx(ctx);
 
 	lock_sock(sk);
-	sk_proto_close = ctx->sk_proto_close;
 
 	if (ctx->tx_conf == TLS_HW_RECORD && ctx->rx_conf == TLS_HW_RECORD)
 		goto skip_tx_cleanup;
@@ -352,15 +354,18 @@ static void tls_sk_proto_close(struct sock *sk, long timeout)
 	if (ctx->tx_conf == TLS_BASE && ctx->rx_conf == TLS_BASE)
 		goto skip_tx_cleanup;
 
-	sk->sk_prot = ctx->sk_proto;
 	tls_sk_proto_cleanup(sk, ctx, timeo);
 
 skip_tx_cleanup:
+	write_lock_bh(&sk->sk_callback_lock);
+	icsk->icsk_ulp_data = NULL;
+	if (sk->sk_prot->close == tls_sk_proto_close)
+		sk->sk_prot = ctx->sk_proto;
+	write_unlock_bh(&sk->sk_callback_lock);
 	release_sock(sk);
 	if (ctx->rx_conf == TLS_SW)
 		tls_sw_release_strp_rx(ctx);
-	sk_proto_close(sk, timeout);
-
+	ctx->sk_proto_close(sk, timeout);
 	if (ctx->tx_conf != TLS_HW && ctx->rx_conf != TLS_HW &&
 	    ctx->tx_conf != TLS_HW_RECORD && ctx->rx_conf != TLS_HW_RECORD)
 		tls_ctx_free(ctx);
@@ -836,22 +841,39 @@ static int tls_init(struct sock *sk)
 	if (sk->sk_state != TCP_ESTABLISHED)
 		return -ENOTSUPP;
 
+	tls_build_proto(sk);
+
 	/* allocate tls context */
+	write_lock_bh(&sk->sk_callback_lock);
 	ctx = create_ctx(sk);
 	if (!ctx) {
 		rc = -ENOMEM;
 		goto out;
 	}
 
-	tls_build_proto(sk);
 	ctx->tx_conf = TLS_BASE;
 	ctx->rx_conf = TLS_BASE;
 	ctx->sk_proto = sk->sk_prot;
 	update_sk_prot(sk, ctx);
 out:
+	write_unlock_bh(&sk->sk_callback_lock);
 	return rc;
 }
 
+static void tls_update(struct sock *sk, struct proto *p)
+{
+	struct tls_context *ctx;
+
+	ctx = tls_get_ctx(sk);
+	if (likely(ctx)) {
+		ctx->sk_proto_close = p->close;
+		ctx->unhash = p->unhash;
+		ctx->sk_proto = p;
+	} else {
+		sk->sk_prot = p;
+	}
+}
+
 void tls_register_device(struct tls_device *device)
 {
 	spin_lock_bh(&device_spinlock);
@@ -872,6 +894,7 @@ static struct tcp_ulp_ops tcp_tls_ulp_ops __read_mostly = {
 	.name			= "tls",
 	.owner			= THIS_MODULE,
 	.init			= tls_init,
+	.update			= tls_update,
 };
 
 static int __init tls_register(void)


^ permalink raw reply related

* [bpf PATCH v2 5/6] bpf: sockmap, only create entry if ulp is not already enabled
From: John Fastabend @ 2019-07-08 19:15 UTC (permalink / raw)
  To: jakub.kicinski, ast, daniel; +Cc: netdev, edumazet, john.fastabend, bpf
In-Reply-To: <156261310104.31108.4569969631798277807.stgit@ubuntu3-kvm1>

Sockmap does not currently support adding sockets after TLS has been
enabled. There never was a real use case for this so it was never
added. But, we lost the test for ULP at some point so add it here
and fail the socket insert if TLS is enabled. Future work could
make sockmap support this use case but fixup the bug here.

Fixes: 604326b41a6fb ("bpf, sockmap: convert to generic sk_msg interface")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
---
 net/core/sock_map.c |    3 +++
 1 file changed, 3 insertions(+)

diff --git a/net/core/sock_map.c b/net/core/sock_map.c
index 56bcabe7c2f2..1330a7442e5b 100644
--- a/net/core/sock_map.c
+++ b/net/core/sock_map.c
@@ -334,6 +334,7 @@ static int sock_map_update_common(struct bpf_map *map, u32 idx,
 				  struct sock *sk, u64 flags)
 {
 	struct bpf_stab *stab = container_of(map, struct bpf_stab, map);
+	struct inet_connection_sock *icsk = inet_csk(sk);
 	struct sk_psock_link *link;
 	struct sk_psock *psock;
 	struct sock *osk;
@@ -344,6 +345,8 @@ static int sock_map_update_common(struct bpf_map *map, u32 idx,
 		return -EINVAL;
 	if (unlikely(idx >= map->max_entries))
 		return -E2BIG;
+	if (unlikely(icsk->icsk_ulp_data))
+		return -EINVAL;
 
 	link = sk_psock_init_link();
 	if (!link)


^ permalink raw reply related

* [bpf PATCH v2 4/6] bpf: sockmap, synchronize_rcu before free'ing map
From: John Fastabend @ 2019-07-08 19:14 UTC (permalink / raw)
  To: jakub.kicinski, ast, daniel; +Cc: netdev, edumazet, john.fastabend, bpf
In-Reply-To: <156261310104.31108.4569969631798277807.stgit@ubuntu3-kvm1>

We need to have a synchronize_rcu before free'ing the sockmap because
any outstanding psock references will have a pointer to the map and
when they use this could trigger a use after free.

Fixes: 604326b41a6fb ("bpf, sockmap: convert to generic sk_msg interface")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
---
 net/core/sock_map.c |    2 ++
 1 file changed, 2 insertions(+)

diff --git a/net/core/sock_map.c b/net/core/sock_map.c
index 28702f2e9a4a..56bcabe7c2f2 100644
--- a/net/core/sock_map.c
+++ b/net/core/sock_map.c
@@ -247,6 +247,8 @@ static void sock_map_free(struct bpf_map *map)
 	raw_spin_unlock_bh(&stab->lock);
 	rcu_read_unlock();
 
+	synchronize_rcu();
+
 	bpf_map_area_free(stab->sks);
 	kfree(stab);
 }


^ permalink raw reply related

* [bpf PATCH v2 3/6] bpf: sockmap, sock_map_delete needs to use xchg
From: John Fastabend @ 2019-07-08 19:14 UTC (permalink / raw)
  To: jakub.kicinski, ast, daniel; +Cc: netdev, edumazet, john.fastabend, bpf
In-Reply-To: <156261310104.31108.4569969631798277807.stgit@ubuntu3-kvm1>

__sock_map_delete() may be called from a tcp event such as unhash or
close from the following trace,

  tcp_bpf_close()
    tcp_bpf_remove()
      sk_psock_unlink()
        sock_map_delete_from_link()
          __sock_map_delete()

In this case the sock lock is held but this only protects against
duplicate removals on the TCP side. If the map is free'd then we have
this trace,

  sock_map_free
    xchg()                  <- replaces map entry
    sock_map_unref()
      sk_psock_put()
        sock_map_del_link()

The __sock_map_delete() call however uses a read, test, null over the
map entry which can result in both paths trying to free the map
entry.

To fix use xchg in TCP paths as well so we avoid having two references
to the same map entry.

Fixes: 604326b41a6fb ("bpf, sockmap: convert to generic sk_msg interface")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
---
 net/core/sock_map.c |   14 +++++++++-----
 1 file changed, 9 insertions(+), 5 deletions(-)

diff --git a/net/core/sock_map.c b/net/core/sock_map.c
index 52d4faeee18b..28702f2e9a4a 100644
--- a/net/core/sock_map.c
+++ b/net/core/sock_map.c
@@ -276,16 +276,20 @@ static int __sock_map_delete(struct bpf_stab *stab, struct sock *sk_test,
 			     struct sock **psk)
 {
 	struct sock *sk;
+	int err = 0;
 
 	raw_spin_lock_bh(&stab->lock);
 	sk = *psk;
 	if (!sk_test || sk_test == sk)
-		*psk = NULL;
+		sk = xchg(psk, NULL);
+
+	if (likely(sk))
+		sock_map_unref(sk, psk);
+	else
+		err = -EINVAL;
+
 	raw_spin_unlock_bh(&stab->lock);
-	if (unlikely(!sk))
-		return -EINVAL;
-	sock_map_unref(sk, psk);
-	return 0;
+	return err;
 }
 
 static void sock_map_delete_from_link(struct bpf_map *map, struct sock *sk,


^ permalink raw reply related

* [bpf PATCH v2 2/6] bpf: tls fix transition through disconnect with close
From: John Fastabend @ 2019-07-08 19:14 UTC (permalink / raw)
  To: jakub.kicinski, ast, daniel; +Cc: netdev, edumazet, john.fastabend, bpf
In-Reply-To: <156261310104.31108.4569969631798277807.stgit@ubuntu3-kvm1>

It is possible (via shutdown()) for TCP socks to go through TCP_CLOSE
state via tcp_dosconnect() without actually calling tcp_close which
would then call the tls close callback. Because of this a user could
disconnect a socket then put it in a LISTEN state which would break
our assumptions about sockets always being ESTABLISHED state.

More directly because close() can call unhash() and unhash is
implemented by sockmap if a sockmap socket has TLS enabled we can
incorrectly destroy the psock from unhash() and then call its close
handler again. But because the psock (sockmap socket representation)
is already destroyed we call close handler in sk->prot. However,
in some cases (TLS BASE/BASE case) this will still point at the
sockmap close handler resulting in a circular call and crash reported
by syzbot.

To fix both above issues implement the unhash() routine for TLS.

Fixes: 3c4d7559159bf ("tls: kernel TLS support")
Reported-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
---
 include/net/tls.h  |    6 +++++-
 net/tls/tls_main.c |   54 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 59 insertions(+), 1 deletion(-)

diff --git a/include/net/tls.h b/include/net/tls.h
index 0a3540a6304d..2a98d0584999 100644
--- a/include/net/tls.h
+++ b/include/net/tls.h
@@ -251,6 +251,8 @@ struct tls_context {
 	u8 tx_conf:3;
 	u8 rx_conf:3;
 
+	struct proto *sk_proto;
+
 	int (*push_pending_record)(struct sock *sk, int flags);
 	void (*sk_write_space)(struct sock *sk);
 
@@ -288,6 +290,8 @@ struct tls_context {
 
 	struct list_head list;
 	refcount_t refcount;
+
+	struct work_struct gc;
 };
 
 enum tls_offload_ctx_dir {
@@ -356,7 +360,7 @@ int tls_set_sw_offload(struct sock *sk, struct tls_context *ctx, int tx);
 int tls_sw_sendmsg(struct sock *sk, struct msghdr *msg, size_t size);
 int tls_sw_sendpage(struct sock *sk, struct page *page,
 		    int offset, size_t size, int flags);
-void tls_sw_close(struct sock *sk, long timeout);
+void tls_sw_cancel_work_tx(struct tls_context *tls_ctx);
 void tls_sw_free_resources_tx(struct sock *sk);
 void tls_sw_free_resources_rx(struct sock *sk);
 void tls_sw_release_resources_rx(struct sock *sk);
diff --git a/net/tls/tls_main.c b/net/tls/tls_main.c
index d63c3583d2f7..e8418456ee24 100644
--- a/net/tls/tls_main.c
+++ b/net/tls/tls_main.c
@@ -251,6 +251,32 @@ static void tls_write_space(struct sock *sk)
 	ctx->sk_write_space(sk);
 }
 
+static void tls_ctx_free_deferred(struct work_struct *gc)
+{
+	struct tls_context *ctx = container_of(gc, struct tls_context, gc);
+
+	if (ctx->rx_conf == TLS_SW)
+		tls_sw_release_strp_rx(ctx);
+
+	/* Ensure any remaining work items are completed. The sk will
+	 * already have lost its tls_ctx reference by the time we get
+	 * here so no xmit operation will actually be performed.
+	 */
+	tls_sw_cancel_work_tx(ctx);
+	kfree(ctx);
+}
+
+static void tls_ctx_free_wq(struct tls_context *ctx)
+{
+	if (!ctx)
+		return;
+
+	memzero_explicit(&ctx->crypto_send, sizeof(ctx->crypto_send));
+	memzero_explicit(&ctx->crypto_recv, sizeof(ctx->crypto_recv));
+	INIT_WORK(&ctx->gc, tls_ctx_free_deferred);
+	schedule_work(&ctx->gc);
+}
+
 static void tls_ctx_free(struct tls_context *ctx)
 {
 	if (!ctx)
@@ -287,6 +313,27 @@ static void tls_sk_proto_cleanup(struct sock *sk,
 #endif
 }
 
+static void tls_sk_proto_unhash(struct sock *sk)
+{
+	struct inet_connection_sock *icsk = inet_csk(sk);
+	long timeo = sock_sndtimeo(sk, 0);
+	struct tls_context *ctx;
+
+	if (unlikely(!icsk->icsk_ulp_data)) {
+		if (sk->sk_prot->unhash)
+			sk->sk_prot->unhash(sk);
+	}
+
+	ctx = tls_get_ctx(sk);
+	if (ctx->tx_conf == TLS_SW || ctx->rx_conf == TLS_SW)
+		tls_sk_proto_cleanup(sk, ctx, timeo);
+	icsk->icsk_ulp_data = NULL;
+	tls_ctx_free_wq(ctx);
+
+	if (ctx->unhash)
+		ctx->unhash(sk);
+}
+
 static void tls_sk_proto_close(struct sock *sk, long timeout)
 {
 	struct tls_context *ctx = tls_get_ctx(sk);
@@ -305,6 +352,7 @@ static void tls_sk_proto_close(struct sock *sk, long timeout)
 	if (ctx->tx_conf == TLS_BASE && ctx->rx_conf == TLS_BASE)
 		goto skip_tx_cleanup;
 
+	sk->sk_prot = ctx->sk_proto;
 	tls_sk_proto_cleanup(sk, ctx, timeo);
 
 skip_tx_cleanup:
@@ -606,6 +654,7 @@ static struct tls_context *create_ctx(struct sock *sk)
 	ctx->setsockopt = sk->sk_prot->setsockopt;
 	ctx->getsockopt = sk->sk_prot->getsockopt;
 	ctx->sk_proto_close = sk->sk_prot->close;
+	ctx->unhash = sk->sk_prot->unhash;
 	return ctx;
 }
 
@@ -729,20 +778,24 @@ static void build_protos(struct proto prot[TLS_NUM_CONFIG][TLS_NUM_CONFIG],
 	prot[TLS_BASE][TLS_BASE].setsockopt	= tls_setsockopt;
 	prot[TLS_BASE][TLS_BASE].getsockopt	= tls_getsockopt;
 	prot[TLS_BASE][TLS_BASE].close		= tls_sk_proto_close;
+	prot[TLS_BASE][TLS_BASE].unhash		= tls_sk_proto_unhash;
 
 	prot[TLS_SW][TLS_BASE] = prot[TLS_BASE][TLS_BASE];
 	prot[TLS_SW][TLS_BASE].sendmsg		= tls_sw_sendmsg;
 	prot[TLS_SW][TLS_BASE].sendpage		= tls_sw_sendpage;
+	prot[TLS_SW][TLS_BASE].unhash		= tls_sk_proto_unhash;
 
 	prot[TLS_BASE][TLS_SW] = prot[TLS_BASE][TLS_BASE];
 	prot[TLS_BASE][TLS_SW].recvmsg		  = tls_sw_recvmsg;
 	prot[TLS_BASE][TLS_SW].stream_memory_read = tls_sw_stream_read;
 	prot[TLS_BASE][TLS_SW].close		  = tls_sk_proto_close;
+	prot[TLS_BASE][TLS_SW].unhash		  = tls_sk_proto_unhash;
 
 	prot[TLS_SW][TLS_SW] = prot[TLS_SW][TLS_BASE];
 	prot[TLS_SW][TLS_SW].recvmsg		= tls_sw_recvmsg;
 	prot[TLS_SW][TLS_SW].stream_memory_read	= tls_sw_stream_read;
 	prot[TLS_SW][TLS_SW].close		= tls_sk_proto_close;
+	prot[TLS_SW][TLS_SW].unhash		= tls_sk_proto_unhash;
 
 #ifdef CONFIG_TLS_DEVICE
 	prot[TLS_HW][TLS_BASE] = prot[TLS_BASE][TLS_BASE];
@@ -793,6 +846,7 @@ static int tls_init(struct sock *sk)
 	tls_build_proto(sk);
 	ctx->tx_conf = TLS_BASE;
 	ctx->rx_conf = TLS_BASE;
+	ctx->sk_proto = sk->sk_prot;
 	update_sk_prot(sk, ctx);
 out:
 	return rc;


^ permalink raw reply related

* Re: [PATCH 00/15] Netfilter/IPVS updates for net-next
From: David Miller @ 2019-07-08 19:14 UTC (permalink / raw)
  To: pablo; +Cc: netfilter-devel, netdev
In-Reply-To: <20190708103237.28061-1-pablo@netfilter.org>

From: Pablo Neira Ayuso <pablo@netfilter.org>
Date: Mon,  8 Jul 2019 12:32:22 +0200

> The following patchset contains Netfilter/IPVS updates for net-next:
 ...
> You can pull these changes from:
> 
>   git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next.git

Pulled, thanks.

^ permalink raw reply

* [bpf PATCH v2 1/6] tls: remove close callback sock unlock/lock and flush_sync
From: John Fastabend @ 2019-07-08 19:13 UTC (permalink / raw)
  To: jakub.kicinski, ast, daniel; +Cc: netdev, edumazet, john.fastabend, bpf
In-Reply-To: <156261310104.31108.4569969631798277807.stgit@ubuntu3-kvm1>

The tls close() callback currently drops the sock lock, makes a
cancel_delayed_work_sync() call, and then relocks the sock. The
same pattern is used again to call strp_done().

By restructuring the code we can avoid droping lock and then
reclaiming it. To simplify this we do the following,

 tls_sk_proto_close
  set_bit(CLOSING)
  set_bit(SCHEDULE)
  cancel_delay_work_sync() <- cancel workqueue
  lock_sock(sk)
  ...
  release_sock(sk)
  strp_done()

Setting the CLOSING bit prevents the SCHEDULE bit from being
cleared by any workqueue items e.g. if one happens to be
scheduled and run between when we set SCHEDULE bit and cancel
work. Then because SCHEDULE bit is set now no new work will
be scheduled.

Then strp_done() is called after the sk work is completed.
Any outstanding work is sync'd and finally ctx is free'd.

Tested with net selftests and bpf selftests.

Signed-off-by: John Fastabend <john.fastabend@gmail.com>
---
 include/net/tls.h  |    4 ++--
 net/tls/tls_main.c |   56 ++++++++++++++++++++++++++--------------------------
 net/tls/tls_sw.c   |   38 ++++++++++++++++++++++-------------
 3 files changed, 54 insertions(+), 44 deletions(-)

diff --git a/include/net/tls.h b/include/net/tls.h
index 63e473420b00..0a3540a6304d 100644
--- a/include/net/tls.h
+++ b/include/net/tls.h
@@ -107,9 +107,7 @@ struct tls_device {
 enum {
 	TLS_BASE,
 	TLS_SW,
-#ifdef CONFIG_TLS_DEVICE
 	TLS_HW,
-#endif
 	TLS_HW_RECORD,
 	TLS_NUM_CONFIG,
 };
@@ -162,6 +160,7 @@ struct tls_sw_context_tx {
 	int async_capable;
 
 #define BIT_TX_SCHEDULED	0
+#define BIT_TX_CLOSING		1
 	unsigned long tx_bitmask;
 };
 
@@ -361,6 +360,7 @@ void tls_sw_close(struct sock *sk, long timeout);
 void tls_sw_free_resources_tx(struct sock *sk);
 void tls_sw_free_resources_rx(struct sock *sk);
 void tls_sw_release_resources_rx(struct sock *sk);
+void tls_sw_release_strp_rx(struct tls_context *tls_ctx);
 int tls_sw_recvmsg(struct sock *sk, struct msghdr *msg, size_t len,
 		   int nonblock, int flags, int *addr_len);
 bool tls_sw_stream_read(const struct sock *sk);
diff --git a/net/tls/tls_main.c b/net/tls/tls_main.c
index fc81ae18cc44..d63c3583d2f7 100644
--- a/net/tls/tls_main.c
+++ b/net/tls/tls_main.c
@@ -261,24 +261,9 @@ static void tls_ctx_free(struct tls_context *ctx)
 	kfree(ctx);
 }
 
-static void tls_sk_proto_close(struct sock *sk, long timeout)
+static void tls_sk_proto_cleanup(struct sock *sk,
+				 struct tls_context *ctx, long timeo)
 {
-	struct tls_context *ctx = tls_get_ctx(sk);
-	long timeo = sock_sndtimeo(sk, 0);
-	void (*sk_proto_close)(struct sock *sk, long timeout);
-	bool free_ctx = false;
-
-	lock_sock(sk);
-	sk_proto_close = ctx->sk_proto_close;
-
-	if (ctx->tx_conf == TLS_HW_RECORD && ctx->rx_conf == TLS_HW_RECORD)
-		goto skip_tx_cleanup;
-
-	if (ctx->tx_conf == TLS_BASE && ctx->rx_conf == TLS_BASE) {
-		free_ctx = true;
-		goto skip_tx_cleanup;
-	}
-
 	if (!tls_complete_pending_work(sk, ctx, 0, &timeo))
 		tls_handle_open_record(sk, 0);
 
@@ -299,22 +284,37 @@ static void tls_sk_proto_close(struct sock *sk, long timeout)
 #ifdef CONFIG_TLS_DEVICE
 	if (ctx->rx_conf == TLS_HW)
 		tls_device_offload_cleanup_rx(sk);
-
-	if (ctx->tx_conf != TLS_HW && ctx->rx_conf != TLS_HW) {
-#else
-	{
 #endif
-		tls_ctx_free(ctx);
-		ctx = NULL;
-	}
+}
+
+static void tls_sk_proto_close(struct sock *sk, long timeout)
+{
+	struct tls_context *ctx = tls_get_ctx(sk);
+	long timeo = sock_sndtimeo(sk, 0);
+	void (*sk_proto_close)(struct sock *sk, long timeout);
+
+	if (ctx->tx_conf == TLS_SW)
+		tls_sw_cancel_work_tx(ctx);
+
+	lock_sock(sk);
+	sk_proto_close = ctx->sk_proto_close;
+
+	if (ctx->tx_conf == TLS_HW_RECORD && ctx->rx_conf == TLS_HW_RECORD)
+		goto skip_tx_cleanup;
+
+	if (ctx->tx_conf == TLS_BASE && ctx->rx_conf == TLS_BASE)
+		goto skip_tx_cleanup;
+
+	tls_sk_proto_cleanup(sk, ctx, timeo);
 
 skip_tx_cleanup:
 	release_sock(sk);
+	if (ctx->rx_conf == TLS_SW)
+		tls_sw_release_strp_rx(ctx);
 	sk_proto_close(sk, timeout);
-	/* free ctx for TLS_HW_RECORD, used by tcp_set_state
-	 * for sk->sk_prot->unhash [tls_hw_unhash]
-	 */
-	if (free_ctx)
+
+	if (ctx->tx_conf != TLS_HW && ctx->rx_conf != TLS_HW &&
+	    ctx->tx_conf != TLS_HW_RECORD && ctx->rx_conf != TLS_HW_RECORD)
 		tls_ctx_free(ctx);
 }
 
diff --git a/net/tls/tls_sw.c b/net/tls/tls_sw.c
index db585964b52b..3b01cd72ca6c 100644
--- a/net/tls/tls_sw.c
+++ b/net/tls/tls_sw.c
@@ -2053,6 +2053,15 @@ static void tls_data_ready(struct sock *sk)
 	}
 }
 
+void tls_sw_cancel_work_tx(struct tls_context *tls_ctx)
+{
+	struct tls_sw_context_tx *ctx = tls_sw_ctx_tx(tls_ctx);
+
+	set_bit(BIT_TX_CLOSING, &ctx->tx_bitmask);
+	set_bit(BIT_TX_SCHEDULED, &ctx->tx_bitmask);
+	cancel_delayed_work_sync(&ctx->tx_work.work);
+}
+
 void tls_sw_free_resources_tx(struct sock *sk)
 {
 	struct tls_context *tls_ctx = tls_get_ctx(sk);
@@ -2064,11 +2073,6 @@ void tls_sw_free_resources_tx(struct sock *sk)
 	if (atomic_read(&ctx->encrypt_pending))
 		crypto_wait_req(-EINPROGRESS, &ctx->async_wait);
 
-	release_sock(sk);
-	cancel_delayed_work_sync(&ctx->tx_work.work);
-	lock_sock(sk);
-
-	/* Tx whatever records we can transmit and abandon the rest */
 	tls_tx_records(sk, -1);
 
 	/* Free up un-sent records in tx_list. First, free
@@ -2112,22 +2116,22 @@ void tls_sw_release_resources_rx(struct sock *sk)
 		write_lock_bh(&sk->sk_callback_lock);
 		sk->sk_data_ready = ctx->saved_data_ready;
 		write_unlock_bh(&sk->sk_callback_lock);
-		release_sock(sk);
-		strp_done(&ctx->strp);
-		lock_sock(sk);
 	}
 }
 
-void tls_sw_free_resources_rx(struct sock *sk)
+void tls_sw_release_strp_rx(struct tls_context *tls_ctx)
 {
-	struct tls_context *tls_ctx = tls_get_ctx(sk);
 	struct tls_sw_context_rx *ctx = tls_sw_ctx_rx(tls_ctx);
 
-	tls_sw_release_resources_rx(sk);
-
+	strp_done(&ctx->strp);
 	kfree(ctx);
 }
 
+void tls_sw_free_resources_rx(struct sock *sk)
+{
+	tls_sw_release_resources_rx(sk);
+}
+
 /* The work handler to transmitt the encrypted records in tx_list */
 static void tx_work_handler(struct work_struct *work)
 {
@@ -2136,11 +2140,17 @@ static void tx_work_handler(struct work_struct *work)
 					       struct tx_work, work);
 	struct sock *sk = tx_work->sk;
 	struct tls_context *tls_ctx = tls_get_ctx(sk);
-	struct tls_sw_context_tx *ctx = tls_sw_ctx_tx(tls_ctx);
+	struct tls_sw_context_tx *ctx;
 
-	if (!test_and_clear_bit(BIT_TX_SCHEDULED, &ctx->tx_bitmask))
+	if (unlikely(!tls_ctx))
+		return;
+
+	ctx = tls_sw_ctx_tx(tls_ctx);
+	if (test_bit(BIT_TX_CLOSING, &ctx->tx_bitmask))
 		return;
 
+	if (!test_and_clear_bit(BIT_TX_SCHEDULED, &ctx->tx_bitmask))
+		return;
 	lock_sock(sk);
 	tls_tx_records(sk, -1);
 	release_sock(sk);


^ permalink raw reply related

* [bpf PATCH v2 0/6] bpf: sockmap/tls fixes
From: John Fastabend @ 2019-07-08 19:13 UTC (permalink / raw)
  To: jakub.kicinski, ast, daniel; +Cc: netdev, edumazet, john.fastabend, bpf

Resolve a series of splats discovered by syzbot and an unhash
TLS issue noted by Eric Dumazet.

The main issues revolved around interaction between TLS and
sockmap tear down. TLS and sockmap could both reset sk->prot
ops creating a condition where a close or unhash op could be
called forever. A rare race condition resulting from a missing
rcu sync operation was causing a use after free. Then on the
TLS side dropping the sock lock and re-acquiring it during the
close op could hang. Finally, sockmap must be deployed before
tls for current stack assumptions to be met. This is enforced
now. A feature series can enable it.

To fix this first refactor TLS code so the lock is held for the
entire teardown operation. Then add an unhash callback to ensure
TLS can not transition from ESTABLISHED to LISTEN state. This
transition is a similar bug to the one found and fixed previously
in sockmap. Then apply three fixes to sockmap to fix up races
on tear down around map free and close. Finally, if sockmap
is destroyed before TLS we add a new ULP op update to inform
the TLS stack it should not call sockmap ops. This last one
appears to be the most commonly found issue from syzbot.

---

John Fastabend (6):
      tls: remove close callback sock unlock/lock and flush_sync
      bpf: tls fix transition through disconnect with close
      bpf: sockmap, sock_map_delete needs to use xchg
      bpf: sockmap, synchronize_rcu before free'ing map
      bpf: sockmap, only create entry if ulp is not already enabled
      bpf: sockmap/tls, close can race with map free

 include/linux/skmsg.h |    8 +++
 include/net/tcp.h     |    3 +
 include/net/tls.h     |   10 +++-
 net/core/skmsg.c      |    4 +
 net/core/sock_map.c   |   19 +++++--
 net/ipv4/tcp_ulp.c    |   13 +++++
 net/tls/tls_main.c    |  135 ++++++++++++++++++++++++++++++++++++++-----------
 net/tls/tls_sw.c      |   38 +++++++++-----
 8 files changed, 176 insertions(+), 54 deletions(-)

--
Signature

^ permalink raw reply

* Re: [PATCH net-next v2 8/8] net: mscc: PTP Hardware Clock (PHC) support
From: Jakub Kicinski @ 2019-07-08 19:06 UTC (permalink / raw)
  To: Antoine Tenart
  Cc: davem, richardcochran, alexandre.belloni, UNGLinuxDriver, ralf,
	paul.burton, jhogan, netdev, linux-mips, thomas.petazzoni,
	allan.nielsen
In-Reply-To: <20190708084809.GB2932@kwain>

On Mon, 8 Jul 2019 10:48:09 +0200, Antoine Tenart wrote:
> > > +	/* Commit back the result & save it */
> > > +	memcpy(&ocelot->hwtstamp_config, &cfg, sizeof(cfg));
> > > +	mutex_unlock(&ocelot->ptp_lock);
> > > +
> > > +	return copy_to_user(ifr->ifr_data, &cfg, sizeof(cfg)) ? -EFAULT : 0;
> > > +}
> > >  
> > > +static int ocelot_get_ts_info(struct net_device *dev,
> > > +			      struct ethtool_ts_info *info)
> > > +{
> > > +	struct ocelot_port *ocelot_port = netdev_priv(dev);
> > > +	struct ocelot *ocelot = ocelot_port->ocelot;
> > > +	int ret;
> > > +
> > > +	if (!ocelot->ptp)
> > > +		return -EOPNOTSUPP;  
> > 
> > Hmm.. why does software timestamping depend on PTP?  
> 
> Because it depends on the "PTP" register bank (and the "PTP" interrupt)
> being described and available. This is why I named the flag 'ptp', but
> it could be named 'timestamp' or 'ts' as well.

Right, but software timestamps are done by calling skb_tx_timestamp(skb)
in the driver, no need for HW support there (software RX timestamp is
handled by the stack).

^ permalink raw reply

* Re: [PATCH] net: netsec: Sync dma for device on buffer allocation
From: David Miller @ 2019-07-08 19:00 UTC (permalink / raw)
  To: ilias.apalodimas; +Cc: netdev, jaswinder.singh
In-Reply-To: <1562570741-25108-1-git-send-email-ilias.apalodimas@linaro.org>

From: Ilias Apalodimas <ilias.apalodimas@linaro.org>
Date: Mon,  8 Jul 2019 10:25:41 +0300

> cd1973a9215a ("net: netsec: Sync dma for device on buffer allocation")
> was merged on it's v1 instead of the v3.
> Merge the proper patch version
> 
> Signed-off-by: Ilias Apalodimas <ilias.apalodimas@linaro.org>

Applied to net-next, thanks.

^ permalink raw reply

* Re: [PATCH net-next 4/4] bnxt_en: add page_pool support
From: Andy Gospodarek @ 2019-07-08 18:49 UTC (permalink / raw)
  To: Michael Chan, Ilias Apalodimas; +Cc: davem, netdev, hawk, ast
In-Reply-To: <1562398578-26020-5-git-send-email-michael.chan@broadcom.com>

On Sat, Jul 06, 2019 at 03:36:18AM -0400, Michael Chan wrote:
> From: Andy Gospodarek <gospo@broadcom.com>
> 
> This removes contention over page allocation for XDP_REDIRECT actions by
> adding page_pool support per queue for the driver.  The performance for
> XDP_REDIRECT actions scales linearly with the number of cores performing
> redirect actions when using the page pools instead of the standard page
> allocator.
> 
> Signed-off-by: Andy Gospodarek <gospo@broadcom.com>
> Signed-off-by: Michael Chan <michael.chan@broadcom.com>
> ---
>  drivers/net/ethernet/broadcom/Kconfig         |  1 +
>  drivers/net/ethernet/broadcom/bnxt/bnxt.c     | 40 +++++++++++++++++++++++----
>  drivers/net/ethernet/broadcom/bnxt/bnxt.h     |  3 ++
>  drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.c |  3 +-
>  4 files changed, 41 insertions(+), 6 deletions(-)
> 
> diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
> index d8f0846..b6777e5 100644
> --- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c
> +++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
[...]
> @@ -2530,12 +2555,17 @@ static int bnxt_alloc_rx_rings(struct bnxt *bp)
>  
>  		ring = &rxr->rx_ring_struct;
>  
> +		rc = bnxt_alloc_rx_page_pool(bp, rxr);
> +		if (rc)
> +			return rc;
> +
>  		rc = xdp_rxq_info_reg(&rxr->xdp_rxq, bp->dev, i);
>  		if (rc < 0)
>  			return rc;
>  
>  		rc = xdp_rxq_info_reg_mem_model(&rxr->xdp_rxq,
> -						MEM_TYPE_PAGE_SHARED, NULL);
> +						MEM_TYPE_PAGE_POOL,
> +						rxr->page_pool);
>  		if (rc) {
>  			xdp_rxq_info_unreg(&rxr->xdp_rxq);
>  			return rc;

I think we want to amend and the chunk above to be:

@@ -2530,14 +2557,24 @@ static int bnxt_alloc_rx_rings(struct bnxt *bp)
 
                ring = &rxr->rx_ring_struct;
 
+               rc = bnxt_alloc_rx_page_pool(bp, rxr);
+               if (rc)
+                       return rc;
+
                rc = xdp_rxq_info_reg(&rxr->xdp_rxq, bp->dev, i);
-               if (rc < 0)
+               if (rc < 0) {
+                       page_pool_free(rxr->page_pool);
+                       rxr->page_pool = NULL;
                        return rc;
+               }
 
                rc = xdp_rxq_info_reg_mem_model(&rxr->xdp_rxq,
-                                               MEM_TYPE_PAGE_SHARED, NULL);
+                                               MEM_TYPE_PAGE_POOL,
+                                               rxr->page_pool);
                if (rc) {
                        xdp_rxq_info_unreg(&rxr->xdp_rxq);
+                       page_pool_free(rxr->page_pool);
+                       rxr->page_pool = NULL;
                        return rc;
                }
 

That should take care of the freeing of the page_pool that is allocated
but there is a failure during xdp_rxq_info_reg() or
xdp_rxq_info_reg_mem_model().

I agree that we do not need to call page_pool_free in the normal
shutdown case.

^ permalink raw reply

* Re: use exact allocation for dma coherent memory
From: Christoph Hellwig @ 2019-07-08 18:43 UTC (permalink / raw)
  To: Arend Van Spriel
  Cc: Christoph Hellwig, Maarten Lankhorst, Maxime Ripard, Sean Paul,
	David Airlie, Daniel Vetter, Jani Nikula, Joonas Lahtinen,
	Rodrigo Vivi, Ian Abbott, H Hartley Sweeten, devel, linux-s390,
	Intel Linux Wireless, linux-rdma, netdev, intel-gfx,
	linux-wireless, linux-kernel, dri-devel, linux-mm, iommu,
	moderated list:ARM PORT, linux-media
In-Reply-To: <74eb9d99-6aa6-d1ad-e66d-6cc9c496b2f3@broadcom.com>

On Tue, Jul 02, 2019 at 11:48:44AM +0200, Arend Van Spriel wrote:
> You made me look ;-) Actually not touching my drivers so I'm off the hook. 
> However, I was wondering if drivers could know so I decided to look into 
> the DMA-API.txt documentation which currently states:
>
> """
> The flag parameter (dma_alloc_coherent() only) allows the caller to
> specify the ``GFP_`` flags (see kmalloc()) for the allocation (the
> implementation may choose to ignore flags that affect the location of
> the returned memory, like GFP_DMA).
> """
>
> I do expect you are going to change that description as well now that you 
> are going to issue a warning on __GFP_COMP. Maybe include that in patch 
> 15/16 where you introduce that warning.

Yes, that description needs an updated, even without this series.
I'll make sure it is more clear.

^ permalink raw reply

* Re: [PATCH] r8169: add enable_aspm parameter
From: Heiner Kallweit @ 2019-07-08 18:27 UTC (permalink / raw)
  To: AceLan Kao, Realtek linux nic maintainers, David S. Miller,
	netdev, linux-kernel
In-Reply-To: <20190708063751.16234-1-acelan.kao@canonical.com>

On 08.07.2019 08:37, AceLan Kao wrote:
> We have many commits in the driver which enable and then disable ASPM
> function over and over again.
>    commit b75bb8a5b755 ("r8169: disable ASPM again")
>    commit 0866cd15029b ("r8169: enable ASPM on RTL8106E")
>    commit 94235460f9ea ("r8169: Align ASPM/CLKREQ setting function with vendor driver")
>    commit aa1e7d2c31ef ("r8169: enable ASPM on RTL8168E-VL")
>    commit f37658da21aa ("r8169: align ASPM entry latency setting with vendor driver")
>    commit a99790bf5c7f ("r8169: Reinstate ASPM Support")
>    commit 671646c151d4 ("r8169: Don't disable ASPM in the driver")
>    commit 4521e1a94279 ("Revert "r8169: enable internal ASPM and clock request settings".")
>    commit d64ec841517a ("r8169: enable internal ASPM and clock request settings")
> 
> This function is very important for production, and if we can't come out
> a solution to make both happy, I'd suggest we add a parameter in the
> driver to toggle it.
> 
The usage of a module parameter to control ASPM is discouraged.
There have been more such attempts in the past that have been declined.

Pending with the PCI maintainers is a series adding ASPM control
via sysfs, see here: https://www.spinics.net/lists/linux-pci/msg83228.html

Also more details than just stating "it's important for production"
would have been appreciated in the commit message, e.g. which
power-savings you can achieve with ASPM on which systems.

^ permalink raw reply

* Re: [PATCH] net: hisilicon: Add an tx_desc to adapt HI13X1_GMAC
From: David Miller @ 2019-07-08 18:18 UTC (permalink / raw)
  To: xiaojiangfeng
  Cc: yisen.zhuang, salil.mehta, dingtianhong, robh+dt, mark.rutland,
	netdev, devicetree, linux-kernel, leeyou.li, xiekunxun,
	jianping.liu, nixiaoming
In-Reply-To: <20190707.221805.2104668553072088371.davem@davemloft.net>

From: David Miller <davem@davemloft.net>
Date: Sun, 07 Jul 2019 22:18:05 -0700 (PDT)

> From: Jiangfeng Xiao <xiaojiangfeng@huawei.com>
> Date: Fri, 5 Jul 2019 14:10:03 +0800
> 
>> HI13X1 changed the offsets and bitmaps for tx_desc
>> registers in the same peripheral device on different
>> models of the hip04_eth.
>> 
>> Signed-off-by: Jiangfeng Xiao <xiaojiangfeng@huawei.com>
> 
> Applied.

Actually I didn't apply this because I can't see that HI13X1_GMAC
kconfig knob anywhere in the tree at all.

^ permalink raw reply

* Re: [PATCH net-next v6 04/15] ethtool: introduce ethtool netlink interface
From: Johannes Berg @ 2019-07-08 18:12 UTC (permalink / raw)
  To: Michal Kubecek, netdev
  Cc: Jiri Pirko, David Miller, Jakub Kicinski, Andrew Lunn,
	Florian Fainelli, John Linville, Stephen Hemminger, linux-kernel
In-Reply-To: <20190708172729.GC24474@unicorn.suse.cz>

On Mon, 2019-07-08 at 19:27 +0200, Michal Kubecek wrote:
> 
> Second reason is that with 8-bit genetlink command/message id, the space
> is not as infinite as it might seem.

FWIW, there isn't really any good reason for this, we have like 16
reserved bits in the genl header.

OTOH, having a LOT of ops will certainly cost space in the kernel
image...

johannes


^ permalink raw reply

* Re: [PATCH ghak90 V6 02/10] audit: add container id
From: Richard Guy Briggs @ 2019-07-08 18:12 UTC (permalink / raw)
  To: Paul Moore
  Cc: Tycho Andersen, Serge E. Hallyn, containers, linux-api,
	Linux-Audit Mailing List, linux-fsdevel, LKML, netdev,
	netfilter-devel, sgrubb, omosnace, dhowells, simo, Eric Paris,
	ebiederm, nhorman
In-Reply-To: <CAHC9VhT5HPt9rCJoDutdvA3r1Y1GOHfpXe2eJ54atNC1=Vd8LA@mail.gmail.com>

On 2019-05-30 19:26, Paul Moore wrote:
> On Thu, May 30, 2019 at 5:29 PM Tycho Andersen <tycho@tycho.ws> wrote:
> > On Thu, May 30, 2019 at 03:29:32PM -0400, Paul Moore wrote:
> > >
> > > [REMINDER: It is an "*audit* container ID" and not a general
> > > "container ID" ;)  Smiley aside, I'm not kidding about that part.]
> >
> > This sort of seems like a distinction without a difference; presumably
> > audit is going to want to differentiate between everything that people
> > in userspace call a container. So you'll have to support all this
> > insanity anyway, even if it's "not a container ID".
> 
> That's not quite right.  Audit doesn't care about what a container is,
> or is not, it also doesn't care if the "audit container ID" actually
> matches the ID used by the container engine in userspace and I think
> that is a very important line to draw.  Audit is simply given a value
> which it calls the "audit container ID", it ensures that the value is
> inherited appropriately (e.g. children inherit their parent's audit
> container ID), and it uses the value in audit records to provide some
> additional context for log analysis.  The distinction isn't limited to
> the value itself, but also to how it is used; it is an "audit
> container ID" and not a "container ID" because this value is
> exclusively for use by the audit subsystem.  We are very intentionally
> not adding a generic container ID to the kernel.  If the kernel does
> ever grow a general purpose container ID we will be one of the first
> ones in line to make use of it, but we are not going to be the ones to
> generically add containers to the kernel.  Enough people already hate
> audit ;)
> 
> > > I'm not interested in supporting/merging something that isn't useful;
> > > if this doesn't work for your use case then we need to figure out what
> > > would work.  It sounds like nested containers are much more common in
> > > the lxc world, can you elaborate a bit more on this?
> > >
> > > As far as the possible solutions you mention above, I'm not sure I
> > > like the per-userns audit container IDs, I'd much rather just emit the
> > > necessary tracking information via the audit record stream and let the
> > > log analysis tools figure it out.  However, the bigger question is how
> > > to limit (re)setting the audit container ID when you are in a non-init
> > > userns.  For reasons already mentioned, using capable() is a non
> > > starter for everything but the initial userns, and using ns_capable()
> > > is equally poor as it essentially allows any userns the ability to
> > > munge it's audit container ID (obviously not good).  It appears we
> > > need a different method for controlling access to the audit container
> > > ID.
> >
> > One option would be to make it a string, and have it be append only.
> > That should be safe with no checks.
> >
> > I know there was a long thread about what type to make this thing. I
> > think you could accomplish the append-only-ness with a u64 if you had
> > some rule about only allowing setting lower order bits than those that
> > are already set. With 4 bits for simplicity:
> >
> > 1100         # initial container id
> > 1100 -> 1011 # not allowed
> > 1100 -> 1101 # allowed, but now 1101 is set in stone since there are
> >              # no lower order bits left
> >
> > There are probably fancier ways to do it if you actually understand
> > math :)
> 
>  ;)
> 
> > Since userns nesting is limited to 32 levels (right now, IIRC), and
> > you have 64 bits, this might be reasonable. You could just teach
> > container engines to use the first say N bits for themselves, with a 1
> > bit for the barrier at the end.
> 
> I like the creativity, but I worry that at some point these
> limitations are going to be raised (limits have a funny way of doing
> that over time) and we will be in trouble.  I say "trouble" because I
> want to be able to quickly do an audit container ID comparison and
> we're going to pay a penalty for these larger values (we'll need this
> when we add multiple auditd support and the requisite record routing).
> 
> Thinking about this makes me also realize we probably need to think a
> bit longer about audit container ID conflicts between orchestrators.
> Right now we just take the value that is given to us by the
> orchestrator, but if we want to allow multiple container orchestrators
> to work without some form of cooperation in userspace (I think we have
> to assume the orchestrators will not talk to each other) we likely
> need to have some way to block reuse of an audit container ID.  We
> would either need to prevent the orchestrator from explicitly setting
> an audit container ID to a currently in use value, or instead generate
> the audit container ID in the kernel upon an event triggered by the
> orchestrator (e.g. a write to a /proc file).  I suspect we should
> start looking at the idr code, I think we will need to make use of it.

To address this, I'd suggest that it is enforced to only allow the
setting of descendants and to maintain a master list of audit container
identifiers (with a hash table if necessary later) that includes the
container owner.

This also allows the orchestrator/engine to inject processes into
existing containers by checking that the audit container identifier is
only used again by the same owner.

I have working code for both.

> paul moore
> www.paul-moore.com

- RGB

--
Richard Guy Briggs <rgb@redhat.com>
Sr. S/W Engineer, Kernel Security, Base Operating Systems
Remote, Ottawa, Red Hat Canada
IRC: rgb, SunRaycer
Voice: +1.647.777.2635, Internal: (81) 32635

^ permalink raw reply

* Re: [PATCH ghak90 V6 02/10] audit: add container id
From: Richard Guy Briggs @ 2019-07-08 18:05 UTC (permalink / raw)
  To: Paul Moore
  Cc: Serge E. Hallyn, Tycho Andersen, containers, linux-api,
	Linux-Audit Mailing List, linux-fsdevel, LKML, netdev,
	netfilter-devel, sgrubb, omosnace, dhowells, simo, Eric Paris,
	ebiederm, nhorman
In-Reply-To: <CAHC9VhThLiQzGYRUWmSuVfOC6QCDmA75BDB7Eg7V8HX4x7ymQg@mail.gmail.com>

On 2019-05-30 15:29, Paul Moore wrote:
> On Thu, May 30, 2019 at 1:09 PM Serge E. Hallyn <serge@hallyn.com> wrote:
> > On Wed, May 29, 2019 at 06:39:48PM -0400, Paul Moore wrote:
> > > On Wed, May 29, 2019 at 6:28 PM Tycho Andersen <tycho@tycho.ws> wrote:
> > > > On Wed, May 29, 2019 at 12:03:58PM -0400, Paul Moore wrote:
> > > > > On Wed, May 29, 2019 at 11:34 AM Tycho Andersen <tycho@tycho.ws> wrote:
> > > > > > On Wed, May 29, 2019 at 11:29:05AM -0400, Paul Moore wrote:
> > > > > > > On Wed, May 29, 2019 at 10:57 AM Tycho Andersen <tycho@tycho.ws> wrote:
> > > > > > > > On Mon, Apr 08, 2019 at 11:39:09PM -0400, Richard Guy Briggs wrote:
> 
> ...
> 
> > > > > > > The current thinking
> > > > > > > is that you would only change the audit container ID from one
> > > > > > > set/inherited value to another if you were nesting containers, in
> > > > > > > which case the nested container orchestrator would need to be granted
> > > > > > > CAP_AUDIT_CONTROL (which everyone to date seems to agree is a workable
> > > > > > > compromise).
> > > >
> > > > won't work in user namespaced containers, because they will never be
> > > > capable(CAP_AUDIT_CONTROL); so I don't think this will work for
> > > > nesting as is. But maybe nobody cares :)
> > >
> > > That's fun :)
> > >
> > > To be honest, I've never been a big fan of supporting nested
> > > containers from an audit perspective, so I'm not really too upset
> > > about this.  The k8s/cri-o folks seem okay with this, or at least I
> > > haven't heard any objections; lxc folks, what do you have to say?
> >
> > I actually thought the answer to this (when last I looked, "some time" ago)
> > was that userspace should track an audit message saying "task X in
> > container Y is changing its auditid to Z", and then decide to also track Z.
> > This should be doable, but a lot of extra work in userspace.
> >
> > Per-userns containerids would also work.  So task X1 is in containerid
> > 1 on the host and creates a new task Y in new userns;  it continues to
> > be reported in init_user_ns as containerid 1 forever;  but in its own
> > userns it can request to be known as some other containerid.  Audit
> > socks would be per-userns, allowing root in a container to watch for
> > audit events in its own (and descendent) namespaces.
> >
> > But again I'm sure we've gone over all this in the last few years.
> >
> > I suppose we can look at this as a "first step", and talk about
> > making it user-ns-nestable later.  But agreed it's not useful in a
> > lot of situations as is.
> 
> [REMINDER: It is an "*audit* container ID" and not a general
> "container ID" ;)  Smiley aside, I'm not kidding about that part.]
> 
> I'm not interested in supporting/merging something that isn't useful;
> if this doesn't work for your use case then we need to figure out what
> would work.  It sounds like nested containers are much more common in
> the lxc world, can you elaborate a bit more on this?
> 
> As far as the possible solutions you mention above, I'm not sure I
> like the per-userns audit container IDs, I'd much rather just emit the
> necessary tracking information via the audit record stream and let the
> log analysis tools figure it out.  However, the bigger question is how
> to limit (re)setting the audit container ID when you are in a non-init
> userns.  For reasons already mentioned, using capable() is a non
> starter for everything but the initial userns, and using ns_capable()
> is equally poor as it essentially allows any userns the ability to
> munge it's audit container ID (obviously not good).  It appears we
> need a different method for controlling access to the audit container
> ID.

We're not quite ready yet for multiple audit daemons and possibly not
yet for audit namespaces, but this is starting to look a lot like the
latter.

If we can't trust ns_capable() then why are we passing on
CAP_AUDIT_CONTROL?  It is being passed down and not stripped purposely
by the orchestrator/engine.  If ns_capable() isn't inherited how is it
gained otherwise?  Can it be inserted by cotainer image?  I think the
answer is "no".  Either we trust ns_capable() or we have audit
namespaces (recommend based on user namespace) (or both).

At this point I would say we are at an impasse unless we trust
ns_capable() or we implement audit namespaces.

I don't think another mechanism to trust nested orchestrators/engines
will buy us anything.

Am I missing something?

> Punting this to a LSM hook is an obvious thing to do, and something we
> might want to do anyway, but currently audit doesn't rely on the LSM
> for proper/safe operation and I'm not sure I want to change that now.
> 
> The next obvious thing is to create some sort of access control knob
> in audit itself.  Perhaps an auditctl operation that would allow the
> administrator to specify which containers, via their corresponding
> audit container IDs, are allowed to change their audit container ID?
> The permission granting would need to be done in the init userns, but
> it would allow containers with a non-init userns the ability to change
> their audit container ID.  We would probably still want a
> ns_capable(CAP_AUDIT_CONTROL) restriction in this case.

This auditctl knob of which you speak is an additional API, not changing
the existing proposed one.

> Does anyone else have any other ideas?
> 
> -- 
> paul moore
> www.paul-moore.com

- RGB

--
Richard Guy Briggs <rgb@redhat.com>
Sr. S/W Engineer, Kernel Security, Base Operating Systems
Remote, Ottawa, Red Hat Canada
IRC: rgb, SunRaycer
Voice: +1.647.777.2635, Internal: (81) 32635

^ permalink raw reply

* Re: [PATCH] rtw88/pci: Rearrange the memory usage for skb in RX ISR
From: Larry Finger @ 2019-07-08 18:01 UTC (permalink / raw)
  To: Jian-Hong Pan, Yan-Hsuan Chuang, Kalle Valo, David S . Miller
  Cc: linux-wireless, netdev, linux-kernel, linux, Daniel Drake, stable
In-Reply-To: <20190708063252.4756-1-jian-hong@endlessm.com>

On 7/8/19 1:32 AM, Jian-Hong Pan wrote:
> diff --git a/drivers/net/wireless/realtek/rtw88/pci.c b/drivers/net/wireless/realtek/rtw88/pci.c
> index cfe05ba7280d..1bfc99ae6b84 100644
> --- a/drivers/net/wireless/realtek/rtw88/pci.c
> +++ b/drivers/net/wireless/realtek/rtw88/pci.c
> @@ -786,6 +786,15 @@ static void rtw_pci_rx_isr(struct rtw_dev *rtwdev, struct rtw_pci *rtwpci,
>   		rx_desc = skb->data;
>   		chip->ops->query_rx_desc(rtwdev, rx_desc, &pkt_stat, &rx_status);
>   
> +		/* discard current skb if the new skb cannot be allocated as a
> +		 * new one in rx ring later
> +		 * */
> +		new = dev_alloc_skb(RTK_PCI_RX_BUF_SIZE);
> +		if (WARN(!new, "rx routine starvation\n")) {
> +			new = skb;
> +			goto next_rp;

This should probably be a WARN_ONCE() rather than WARN(), otherwise the logs 
will be flooded once this condition triggers.

Larry

^ permalink raw reply

* Re: [PATCH v3 bpf-next 4/9] libbpf: add kprobe/uprobe attach API
From: Andrii Nakryiko @ 2019-07-08 17:58 UTC (permalink / raw)
  To: Matt Hart
  Cc: Andrii Nakryiko, Alexei Starovoitov, bpf, Daniel Borkmann,
	Kernel Team, Networking, Stanislav Fomichev
In-Reply-To: <CAH+k93FQkiwRXwgRGrUJEpmAGZBL03URKDmx8uVA9MnLrDKn0Q@mail.gmail.com>

On Mon, Jul 8, 2019 at 8:11 AM Matt Hart <matthew.hart@linaro.org> wrote:
>
> Hi all,
>
> I bisected a perf build error on ARMv7 to this patch:
> libbpf.c: In function ‘perf_event_open_probe’:
> libbpf.c:4112:17: error: cast from pointer to integer of different
> size [-Werror=pointer-to-int-cast]
>   attr.config1 = (uint64_t)(void *)name; /* kprobe_func or uprobe_path */
>                  ^
>
> Is this a known issue?

No, thanks for reporting!

It should be

attr.config1 = (uint64_t)(uintptr_t)(void *)name;

to avoid warning on 32-bit architectures.

I'll post a fix later today, but if you could verify this fixes
warning for you, I'd really appreciate that! Thanks!

^ permalink raw reply

* Re: [PATCH net-next iproute2 2/3] tc: Introduce tc ct action
From: Marcelo Ricardo Leitner @ 2019-07-08 17:54 UTC (permalink / raw)
  To: Paul Blakey
  Cc: Jiri Pirko, Roi Dayan, Yossi Kuperman, Oz Shlomo, netdev,
	David Miller, Aaron Conole, Zhike Wang, Justin Pettit,
	John Hurley, Rony Efraim, nst-kernel, Simon Horman
In-Reply-To: <1562489628-5925-3-git-send-email-paulb@mellanox.com>

On Sun, Jul 07, 2019 at 11:53:47AM +0300, Paul Blakey wrote:
> New tc action to send packets to conntrack module, commit
> them, and set a zone, labels, mark, and nat on the connection.
> 
> It can also clear the packet's conntrack state by using clear.
> 
> Usage:
>    ct clear
>    ct commit [force] [zone] [mark] [label] [nat]

Isn't the 'commit' also optional? More like
    ct [commit [force]] [zone] [mark] [label] [nat]

>    ct [nat] [zone]
> 
> Signed-off-by: Paul Blakey <paulb@mellanox.com>
> Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
> Signed-off-by: Yossi Kuperman <yossiku@mellanox.com>
> Acked-by: Jiri Pirko <jiri@mellanox.com>
> Acked-by: Roi Dayan <roid@mellanox.com>
> ---
...
> +static void
> +usage(void)
> +{
> +	fprintf(stderr,
> +		"Usage: ct clear\n"
> +		"	ct commit [force] [zone ZONE] [mark MASKED_MARK] [label MASKED_LABEL] [nat NAT_SPEC]\n"

Ditto here then.

> +		"	ct [nat] [zone ZONE]\n"
> +		"Where: ZONE is the conntrack zone table number\n"
> +		"	NAT_SPEC is {src|dst} addr addr1[-addr2] [port port1[-port2]]\n"
> +		"\n");
> +	exit(-1);
> +}
...

The validation below doesn't enforce that commit must be there for
such case.

> +static int
> +parse_ct(struct action_util *a, int *argc_p, char ***argv_p, int tca_id,
> +		struct nlmsghdr *n)
> +{
> +	struct tc_ct sel = {};
> +	char **argv = *argv_p;
> +	struct rtattr *tail;
> +	int argc = *argc_p;
> +	int ct_action = 0;
> +	int ret;
> +
> +	tail = addattr_nest(n, MAX_MSG, tca_id);
> +
> +	if (argc && matches(*argv, "ct") == 0)
> +		NEXT_ARG_FWD();
> +
> +	while (argc > 0) {
> +		if (matches(*argv, "zone") == 0) {
> +			NEXT_ARG();
> +
> +			if (ct_parse_u16(*argv,
> +					 TCA_CT_ZONE, TCA_CT_UNSPEC, n)) {
> +				fprintf(stderr, "ct: Illegal \"zone\"\n");
> +				return -1;
> +			}
> +		} else if (matches(*argv, "nat") == 0) {
> +			ct_action |= TCA_CT_ACT_NAT;
> +
> +			NEXT_ARG();
> +			if (matches(*argv, "src") == 0)
> +				ct_action |= TCA_CT_ACT_NAT_SRC;
> +			else if (matches(*argv, "dst") == 0)
> +				ct_action |= TCA_CT_ACT_NAT_DST;
> +			else
> +				continue;
> +
> +			NEXT_ARG();
> +			if (matches(*argv, "addr") != 0)
> +				usage();
> +
> +			NEXT_ARG();
> +			ret = ct_parse_nat_addr_range(*argv, n);
> +			if (ret) {
> +				fprintf(stderr, "ct: Illegal nat address range\n");
> +				return -1;
> +			}
> +
> +			NEXT_ARG_FWD();
> +			if (matches(*argv, "port") != 0)
> +				continue;
> +
> +			NEXT_ARG();
> +			ret = ct_parse_nat_port_range(*argv, n);
> +			if (ret) {
> +				fprintf(stderr, "ct: Illegal nat port range\n");
> +				return -1;
> +			}
> +		} else if (matches(*argv, "clear") == 0) {
> +			ct_action |= TCA_CT_ACT_CLEAR;
> +		} else if (matches(*argv, "commit") == 0) {
> +			ct_action |= TCA_CT_ACT_COMMIT;
> +		} else if (matches(*argv, "force") == 0) {
> +			ct_action |= TCA_CT_ACT_FORCE;
> +		} else if (matches(*argv, "index") == 0) {
> +			NEXT_ARG();
> +			if (get_u32(&sel.index, *argv, 10)) {
> +				fprintf(stderr, "ct: Illegal \"index\"\n");
> +				return -1;
> +			}
> +		} else if (matches(*argv, "mark") == 0) {
> +			NEXT_ARG();
> +
> +			ret = ct_parse_mark(*argv, n);
> +			if (ret) {
> +				fprintf(stderr, "ct: Illegal \"mark\"\n");
> +				return -1;
> +			}
> +		} else if (matches(*argv, "label") == 0) {
> +			NEXT_ARG();
> +
> +			ret = ct_parse_labels(*argv, n);
> +			if (ret) {
> +				fprintf(stderr, "ct: Illegal \"label\"\n");
> +				return -1;
> +			}
> +		} else if (matches(*argv, "help") == 0) {
> +			usage();
> +		} else {
> +			break;
> +		}
> +		NEXT_ARG_FWD();
> +	}
> +
> +	if (ct_action & TCA_CT_ACT_CLEAR &&
> +	    ct_action & ~TCA_CT_ACT_CLEAR) {
> +		fprintf(stderr, "ct: clear can only be used alone\n");
> +		return -1;
> +	}
> +
> +	if (ct_action & TCA_CT_ACT_NAT_SRC &&
> +	    ct_action & TCA_CT_ACT_NAT_DST) {
> +		fprintf(stderr, "ct: src and dst nat can't be used together\n");
> +		return -1;
> +	}
> +
> +	if ((ct_action & TCA_CT_ACT_COMMIT) &&
> +	    (ct_action & TCA_CT_ACT_NAT) &&
> +	    !(ct_action & (TCA_CT_ACT_NAT_SRC | TCA_CT_ACT_NAT_DST))) {
> +		fprintf(stderr, "ct: commit and nat must set src or dst\n");
> +		return -1;
> +	}
> +
> +	if (!(ct_action & TCA_CT_ACT_COMMIT) &&
> +	    (ct_action & (TCA_CT_ACT_NAT_SRC | TCA_CT_ACT_NAT_DST))) {
> +		fprintf(stderr, "ct: src or dst is only valid if commit is set\n");
> +		return -1;
> +	}
> +
> +	parse_action_control_dflt(&argc, &argv, &sel.action, false,
> +				  TC_ACT_PIPE);
> +	NEXT_ARG_FWD();
> +
> +	addattr16(n, MAX_MSG, TCA_CT_ACTION, ct_action);
> +	addattr_l(n, MAX_MSG, TCA_CT_PARMS, &sel, sizeof(sel));
> +	addattr_nest_end(n, tail);
> +
> +	*argc_p = argc;
> +	*argv_p = argv;
> +	return 0;
> +}
...

^ permalink raw reply

* Re: [PATCH bpf-next 1/2] bpf, libbpf: add a new API bpf_object__reuse_maps()
From: Andrii Nakryiko @ 2019-07-08 17:54 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: Anton Protopopov, Alexei Starovoitov, Martin KaFai Lau, Song Liu,
	Yonghong Song, Networking, bpf, open list, Andrii Nakryiko
In-Reply-To: <734dd45a-95b0-a7fd-9e1d-0535ef4d3e12@iogearbox.net>

On Fri, Jul 5, 2019 at 2:53 PM Daniel Borkmann <daniel@iogearbox.net> wrote:
>
> On 07/05/2019 10:44 PM, Anton Protopopov wrote:
> > Add a new API bpf_object__reuse_maps() which can be used to replace all maps in
> > an object by maps pinned to a directory provided in the path argument.  Namely,
> > each map M in the object will be replaced by a map pinned to path/M.name.
> >
> > Signed-off-by: Anton Protopopov <a.s.protopopov@gmail.com>
> > ---
> >  tools/lib/bpf/libbpf.c   | 34 ++++++++++++++++++++++++++++++++++
> >  tools/lib/bpf/libbpf.h   |  2 ++
> >  tools/lib/bpf/libbpf.map |  1 +
> >  3 files changed, 37 insertions(+)
> >
> > diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
> > index 4907997289e9..84c9e8f7bfd3 100644
> > --- a/tools/lib/bpf/libbpf.c
> > +++ b/tools/lib/bpf/libbpf.c
> > @@ -3144,6 +3144,40 @@ int bpf_object__unpin_maps(struct bpf_object *obj, const char *path)
> >       return 0;
> >  }
> >
> > +int bpf_object__reuse_maps(struct bpf_object *obj, const char *path)

As is, bpf_object__reuse_maps() can be easily implemented by user
applications, as it's only using public libbpf APIs, so I'm not 100%
sure we need to add method like that to libbpf.

> > +{
> > +     struct bpf_map *map;
> > +
> > +     if (!obj)
> > +             return -ENOENT;
> > +
> > +     if (!path)
> > +             return -EINVAL;
> > +
> > +     bpf_object__for_each_map(map, obj) {
> > +             int len, err;
> > +             int pinned_map_fd;
> > +             char buf[PATH_MAX];
>
> We'd need to skip the case of bpf_map__is_internal(map) since they are always
> recreated for the given object.
>
> > +             len = snprintf(buf, PATH_MAX, "%s/%s", path, bpf_map__name(map));
> > +             if (len < 0) {
> > +                     return -EINVAL;
> > +             } else if (len >= PATH_MAX) {
> > +                     return -ENAMETOOLONG;
> > +             }
> > +
> > +             pinned_map_fd = bpf_obj_get(buf);
> > +             if (pinned_map_fd < 0)
> > +                     return pinned_map_fd;
>
> Should we rather have a new map definition attribute that tells to reuse
> the map if it's pinned in bpf fs, and if not, we create it and later on
> pin it? This is what iproute2 is doing and which we're making use of heavily.

I'd like something like that as well. This would play nicely with
recently added BTF-defined maps as well.

I think it should be not just pin/don't pin flag, but rather pinning
strategy, to accommodate various typical strategies of handling maps
that are already pinned. So something like this:

1. BPF_PIN_NOTHING - default, don't pin;
2. BPF_PIN_EXCLUSIVE - pin, but if map is already pinned - fail;
3. BPF_PIN_SET - pin; if existing map exists, reset its state to be
exact state of object's map;
4. BPF_PIN_MERGE - pin, if map exists, fill in NULL entries only (this
is how Cilium is pinning PROG_ARRAY maps, if I understand correctly);
5. BPF_PIN_MERGE_OVERWRITE - pin, if map exists, overwrite non-NULL values.

This list is only for illustrative purposes, ideally people that have
a lot of experience using pinning for real-world use cases would chime
in on what strategies are useful and make sense.

> In bpf_object__reuse_maps() bailing out if bpf_obj_get() fails is perhaps
> too limiting for a generic API as new version of an object file may contain
> new maps which are not yet present in bpf fs at that point.
>
> > +             err = bpf_map__reuse_fd(map, pinned_map_fd);
> > +             if (err)
> > +                     return err;
> > +     }
> > +
> > +     return 0;
> > +}
> > +
> >  int bpf_object__pin_programs(struct bpf_object *obj, const char *path)
> >  {
> >       struct bpf_program *prog;
> > diff --git a/tools/lib/bpf/libbpf.h b/tools/lib/bpf/libbpf.h
> > index d639f47e3110..7fe465a1be76 100644
> > --- a/tools/lib/bpf/libbpf.h
> > +++ b/tools/lib/bpf/libbpf.h
> > @@ -82,6 +82,8 @@ int bpf_object__variable_offset(const struct bpf_object *obj, const char *name,
> >  LIBBPF_API int bpf_object__pin_maps(struct bpf_object *obj, const char *path);
> >  LIBBPF_API int bpf_object__unpin_maps(struct bpf_object *obj,
> >                                     const char *path);
> > +LIBBPF_API int bpf_object__reuse_maps(struct bpf_object *obj,
> > +                                   const char *path);
> >  LIBBPF_API int bpf_object__pin_programs(struct bpf_object *obj,
> >                                       const char *path);
> >  LIBBPF_API int bpf_object__unpin_programs(struct bpf_object *obj,
> > diff --git a/tools/lib/bpf/libbpf.map b/tools/lib/bpf/libbpf.map
> > index 2c6d835620d2..66a30be6696c 100644
> > --- a/tools/lib/bpf/libbpf.map
> > +++ b/tools/lib/bpf/libbpf.map
> > @@ -172,5 +172,6 @@ LIBBPF_0.0.4 {
> >               btf_dump__new;
> >               btf__parse_elf;
> >               bpf_object__load_xattr;
> > +             bpf_object__reuse_maps;
> >               libbpf_num_possible_cpus;
> >  } LIBBPF_0.0.3;
> >
>

^ permalink raw reply

* Re: [PATCH ghak90 V6 02/10] audit: add container id
From: Richard Guy Briggs @ 2019-07-08 17:51 UTC (permalink / raw)
  To: Paul Moore
  Cc: Tycho Andersen, containers, linux-api, Linux-Audit Mailing List,
	linux-fsdevel, LKML, netdev, netfilter-devel, sgrubb, omosnace,
	dhowells, simo, Eric Paris, Serge Hallyn, ebiederm, nhorman
In-Reply-To: <CAHC9VhR4fudQanvZGYWMvCf7k2CU3q7e7n1Pi7hzC3v_zpVEdw@mail.gmail.com>

On 2019-05-29 11:29, Paul Moore wrote:
> On Wed, May 29, 2019 at 10:57 AM Tycho Andersen <tycho@tycho.ws> wrote:
> >
> > On Mon, Apr 08, 2019 at 11:39:09PM -0400, Richard Guy Briggs wrote:
> > > It is not permitted to unset the audit container identifier.
> > > A child inherits its parent's audit container identifier.
> >
> > ...
> >
> > >  /**
> > > + * audit_set_contid - set current task's audit contid
> > > + * @contid: contid value
> > > + *
> > > + * Returns 0 on success, -EPERM on permission failure.
> > > + *
> > > + * Called (set) from fs/proc/base.c::proc_contid_write().
> > > + */
> > > +int audit_set_contid(struct task_struct *task, u64 contid)
> > > +{
> > > +     u64 oldcontid;
> > > +     int rc = 0;
> > > +     struct audit_buffer *ab;
> > > +     uid_t uid;
> > > +     struct tty_struct *tty;
> > > +     char comm[sizeof(current->comm)];
> > > +
> > > +     task_lock(task);
> > > +     /* Can't set if audit disabled */
> > > +     if (!task->audit) {
> > > +             task_unlock(task);
> > > +             return -ENOPROTOOPT;
> > > +     }
> > > +     oldcontid = audit_get_contid(task);
> > > +     read_lock(&tasklist_lock);
> > > +     /* Don't allow the audit containerid to be unset */
> > > +     if (!audit_contid_valid(contid))
> > > +             rc = -EINVAL;
> > > +     /* if we don't have caps, reject */
> > > +     else if (!capable(CAP_AUDIT_CONTROL))
> > > +             rc = -EPERM;
> > > +     /* if task has children or is not single-threaded, deny */
> > > +     else if (!list_empty(&task->children))
> > > +             rc = -EBUSY;
> > > +     else if (!(thread_group_leader(task) && thread_group_empty(task)))
> > > +             rc = -EALREADY;
> > > +     read_unlock(&tasklist_lock);
> > > +     if (!rc)
> > > +             task->audit->contid = contid;
> > > +     task_unlock(task);
> > > +
> > > +     if (!audit_enabled)
> > > +             return rc;
> >
> > ...but it is allowed to change it (assuming
> > capable(CAP_AUDIT_CONTROL), of course)? Seems like this might be more
> > immediately useful since we still live in the world of majority
> > privileged containers if we didn't allow changing it, in addition to
> > un-setting it.
> 
> The idea is that only container orchestrators should be able to
> set/modify the audit container ID, and since setting the audit
> container ID can have a significant effect on the records captured
> (and their routing to multiple daemons when we get there) modifying
> the audit container ID is akin to modifying the audit configuration
> which is why it is gated by CAP_AUDIT_CONTROL.  The current thinking
> is that you would only change the audit container ID from one
> set/inherited value to another if you were nesting containers, in
> which case the nested container orchestrator would need to be granted
> CAP_AUDIT_CONTROL (which everyone to date seems to agree is a workable
> compromise).  We did consider allowing for a chain of nested audit
> container IDs, but the implications of doing so are significant
> (implementation mess, runtime cost, etc.) so we are leaving that out
> of this effort.

We had previously discussed the idea of restricting
orchestrators/engines from only being able to set the audit container
identifier on their own descendants, but it was discarded.  I've added a
check to ensure this is now enforced.

I've also added a check to ensure that a process can't set its own audit
container identifier and that if the identifier is already set, then the
orchestrator/engine must be in a descendant user namespace from the
orchestrator that set the previously inherited audit container
identifier.

> From a practical perspective, un-setting the audit container ID is
> pretty much the same as changing it from one set value to another so
> most of the above applies to that case as well.
> 
> -- 
> paul moore
> www.paul-moore.com

- RGB

--
Richard Guy Briggs <rgb@redhat.com>
Sr. S/W Engineer, Kernel Security, Base Operating Systems
Remote, Ottawa, Red Hat Canada
IRC: rgb, SunRaycer
Voice: +1.647.777.2635, Internal: (81) 32635

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox