Netdev List

Netdev List
 help / color / mirror / Atom feed

* [PATCH net-next] net/tls: prevent skb_orphan() from leaking TLS plain text with offload
From: Jakub Kicinski @ 2019-07-30 21:12 UTC (permalink / raw)
  To: davem
  Cc: netdev, oss-drivers, edumazet, borisp, aviadye, davejwatson,
	john.fastabend, daniel, willemb, xiyou.wangcong, fw,
	alexei.starovoitov, Jakub Kicinski

sk_validate_xmit_skb() and drivers depend on the sk member of
struct sk_buff to identify segments requiring encryption.
Any operation which removes or does not preserve the original TLS
socket such as skb_orphan() or skb_clone() will cause clear text
leaks.

Make the TCP socket underlying an offloaded TLS connection
mark all skbs as decrypted, if TLS TX is in offload mode.
Then in sk_validate_xmit_skb() catch skbs which have no socket
(or a socket with no validation) and decrypted flag set.

Note that CONFIG_SOCK_VALIDATE_XMIT, CONFIG_TLS_DEVICE and
sk->sk_validate_xmit_skb are slightly interchangeable right now,
they all imply TLS offload. The new checks are guarded by
CONFIG_TLS_DEVICE because that's the option guarding the
sk_buff->decrypted member.

Second, smaller issue with orphaning is that it breaks
the guarantee that packets will be delivered to device
queues in-order. All TLS offload drivers depend on that
scheduling property. This means skb_orphan_partial()'s
trick of preserving partial socket references will cause
issues in the drivers. We need a full orphan, and as a
result netem delay/throttling will cause all TLS offload
skbs to be dropped.

Reusing the sk_buff->decrypted flag also protects from
leaking clear text when incoming, decrypted skb is redirected
(e.g. by TC).

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
---
I'm sending this for net-next because of lack of confidence
in my own abilities. It should apply cleanly to net... :)

 Documentation/networking/tls-offload.rst |  9 --------
 include/net/sock.h                       | 28 +++++++++++++++++++++++-
 net/core/skbuff.c                        |  3 +++
 net/core/sock.c                          | 22 ++++++++++++++-----
 net/ipv4/tcp.c                           |  4 +++-
 net/ipv4/tcp_output.c                    |  3 +++
 net/tls/tls_device.c                     |  2 ++
 7 files changed, 55 insertions(+), 16 deletions(-)

diff --git a/Documentation/networking/tls-offload.rst b/Documentation/networking/tls-offload.rst
index 048e5ca44824..2bc3ab5515d8 100644
--- a/Documentation/networking/tls-offload.rst
+++ b/Documentation/networking/tls-offload.rst
@@ -499,15 +499,6 @@ offloads, old connections will remain active after flags are cleared.
 Known bugs
 ==========
 
-skb_orphan() leaks clear text
------------------------------
-
-Currently drivers depend on the :c:member:`sk` member of
-:c:type:`struct sk_buff <sk_buff>` to identify segments requiring
-encryption. Any operation which removes or does not preserve the socket
-association such as :c:func:`skb_orphan` or :c:func:`skb_clone`
-will cause the driver to miss the packets and lead to clear text leaks.
-
 Redirects leak clear text
 -------------------------
 
diff --git a/include/net/sock.h b/include/net/sock.h
index 228db3998e46..90f3f552c789 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -814,6 +814,7 @@ enum sock_flags {
 	SOCK_TXTIME,
 	SOCK_XDP, /* XDP is attached */
 	SOCK_TSTAMP_NEW, /* Indicates 64 bit timestamps always */
+	SOCK_CRYPTO_TX_PLAIN_TEXT, /* Generate skbs with decrypted flag set */
 };
 
 #define SK_FLAGS_TIMESTAMP ((1UL << SOCK_TIMESTAMP) | (1UL << SOCK_TIMESTAMPING_RX_SOFTWARE))
@@ -2480,8 +2481,26 @@ static inline bool sk_fullsock(const struct sock *sk)
 	return (1 << sk->sk_state) & ~(TCPF_TIME_WAIT | TCPF_NEW_SYN_RECV);
 }
 
+static inline void sk_set_tx_crypto(const struct sock *sk, struct sk_buff *skb)
+{
+#ifdef CONFIG_TLS_DEVICE
+	skb->decrypted = sock_flag(sk, SOCK_CRYPTO_TX_PLAIN_TEXT);
+#endif
+}
+
+static inline bool sk_tx_crypto_match(const struct sock *sk,
+				      const struct sk_buff *skb)
+{
+#ifdef CONFIG_TLS_DEVICE
+	return sock_flag(sk, SOCK_CRYPTO_TX_PLAIN_TEXT) == !!skb->decrypted;
+#else
+	return true;
+#endif
+}
+
 /* Checks if this SKB belongs to an HW offloaded socket
  * and whether any SW fallbacks are required based on dev.
+ * Check decrypted mark in case skb_orphan() cleared socket.
  */
 static inline struct sk_buff *sk_validate_xmit_skb(struct sk_buff *skb,
 						   struct net_device *dev)
@@ -2489,8 +2508,15 @@ static inline struct sk_buff *sk_validate_xmit_skb(struct sk_buff *skb,
 #ifdef CONFIG_SOCK_VALIDATE_XMIT
 	struct sock *sk = skb->sk;
 
-	if (sk && sk_fullsock(sk) && sk->sk_validate_xmit_skb)
+	if (sk && sk_fullsock(sk) && sk->sk_validate_xmit_skb) {
 		skb = sk->sk_validate_xmit_skb(sk, dev, skb);
+#ifdef CONFIG_TLS_DEVICE
+	} else if (unlikely(skb->decrypted)) {
+		pr_warn_ratelimited("unencrypted skb with no associated socket - dropping\n");
+		kfree_skb(skb);
+		skb = NULL;
+#endif
+	}
 #endif
 
 	return skb;
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 0b788df5a75b..9e92684479b9 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -3794,6 +3794,9 @@ struct sk_buff *skb_segment(struct sk_buff *head_skb,
 
 			skb_reserve(nskb, headroom);
 			__skb_put(nskb, doffset);
+#ifdef CONFIG_TLS_DEVICE
+			nskb->decrypted = head_skb->decrypted;
+#endif
 		}
 
 		if (segs)
diff --git a/net/core/sock.c b/net/core/sock.c
index d57b0cc995a0..b0c10b518e65 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -1992,6 +1992,22 @@ void skb_set_owner_w(struct sk_buff *skb, struct sock *sk)
 }
 EXPORT_SYMBOL(skb_set_owner_w);
 
+static bool can_skb_orphan_partial(const struct sk_buff *skb)
+{
+#ifdef CONFIG_TLS_DEVICE
+	/* Drivers depend on in-order delivery for crypto offload,
+	 * partial orphan breaks out-of-order-OK logic.
+	 */
+	if (skb->decrypted)
+		return false;
+#endif
+#ifdef CONFIG_INET
+	if (skb->destructor == tcp_wfree)
+		return true;
+#endif
+	return skb->destructor == sock_wfree;
+}
+
 /* This helper is used by netem, as it can hold packets in its
  * delay queue. We want to allow the owner socket to send more
  * packets, as if they were already TX completed by a typical driver.
@@ -2003,11 +2019,7 @@ void skb_orphan_partial(struct sk_buff *skb)
 	if (skb_is_tcp_pure_ack(skb))
 		return;
 
-	if (skb->destructor == sock_wfree
-#ifdef CONFIG_INET
-	    || skb->destructor == tcp_wfree
-#endif
-		) {
+	if (can_skb_orphan_partial(skb)) {
 		struct sock *sk = skb->sk;
 
 		if (refcount_inc_not_zero(&sk->sk_refcnt)) {
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index f62f0e7e3cdd..dee93eab02fe 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -984,6 +984,7 @@ ssize_t do_tcp_sendpages(struct sock *sk, struct page *page, int offset,
 			if (!skb)
 				goto wait_for_memory;
 
+			sk_set_tx_crypto(sk, skb);
 			skb_entail(sk, skb);
 			copy = size_goal;
 		}
@@ -993,7 +994,8 @@ ssize_t do_tcp_sendpages(struct sock *sk, struct page *page, int offset,
 
 		i = skb_shinfo(skb)->nr_frags;
 		can_coalesce = skb_can_coalesce(skb, i, page, offset);
-		if (!can_coalesce && i >= sysctl_max_skb_frags) {
+		if ((!can_coalesce && i >= sysctl_max_skb_frags) ||
+		    !sk_tx_crypto_match(sk, skb)) {
 			tcp_mark_push(tp, skb);
 			goto new_segment;
 		}
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 6e4afc48d7bb..9efd0ca44d49 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -1320,6 +1320,7 @@ int tcp_fragment(struct sock *sk, enum tcp_queue tcp_queue,
 	buff = sk_stream_alloc_skb(sk, nsize, gfp, true);
 	if (!buff)
 		return -ENOMEM; /* We'll just try again later. */
+	sk_set_tx_crypto(sk, buff);
 
 	sk->sk_wmem_queued += buff->truesize;
 	sk_mem_charge(sk, buff->truesize);
@@ -1874,6 +1875,7 @@ static int tso_fragment(struct sock *sk, struct sk_buff *skb, unsigned int len,
 	buff = sk_stream_alloc_skb(sk, 0, gfp, true);
 	if (unlikely(!buff))
 		return -ENOMEM;
+	sk_set_tx_crypto(sk, buff);
 
 	sk->sk_wmem_queued += buff->truesize;
 	sk_mem_charge(sk, buff->truesize);
@@ -2139,6 +2141,7 @@ static int tcp_mtu_probe(struct sock *sk)
 	nskb = sk_stream_alloc_skb(sk, probe_size, GFP_ATOMIC, false);
 	if (!nskb)
 		return -1;
+	sk_set_tx_crypto(sk, nskb);
 	sk->sk_wmem_queued += nskb->truesize;
 	sk_mem_charge(sk, nskb->truesize);
 
diff --git a/net/tls/tls_device.c b/net/tls/tls_device.c
index 4ec8a06fa5d1..3d78742b3b1b 100644
--- a/net/tls/tls_device.c
+++ b/net/tls/tls_device.c
@@ -970,6 +970,8 @@ int tls_set_device_offload(struct sock *sk, struct tls_context *ctx)
 
 	tls_device_attach(ctx, sk, netdev);
 
+	sock_set_flag(sk, SOCK_CRYPTO_TX_PLAIN_TEXT);
+
 	/* following this assignment tls_is_sk_tx_device_offloaded
 	 * will return true and the context might be accessed
 	 * by the netdev's xmit function.
-- 
2.21.0


^ permalink raw reply related

* Re: [PATCH] net: usb: pegasus: fix improper read if get_registers() fail
From: David Miller @ 2019-07-30 21:10 UTC (permalink / raw)
  To: kirjanov; +Cc: kda, petkan, netdev
In-Reply-To: <CAHj3AVm2EZB7n9UBxiBmA+6XN+EgAC_FRoHjh6kO3WMT8KVd6g@mail.gmail.com>

From: Denis Kirjanov <kirjanov@gmail.com>
Date: Tue, 30 Jul 2019 21:19:46 +0300

> On Tuesday, July 30, 2019, David Miller <davem@davemloft.net> wrote:
> 
>> From: Denis Kirjanov <kda@linux-powerpc.org>
>> Date: Tue, 30 Jul 2019 15:13:57 +0200
>>
>> > get_registers() may fail with -ENOMEM and in this
>> > case we can read a garbage from the status variable tmp.
>> >
>> > Reported-by: syzbot+3499a83b2d062ae409d4@syzkaller.appspotmail.com
>> > Signed-off-by: Denis Kirjanov <kda@linux-powerpc.org>
>>
>> Why did you post this patch twice?  What is different between the two
>> versions?
> 
> 
> Looks like it’s the issue with git send-email :/
> Do you want me to figure out the reason and resend?

No need, I was just curious.

^ permalink raw reply

* [PATCH bpf-next v2] tools: bpftool: add support for reporting the effective cgroup progs
From: Jakub Kicinski @ 2019-07-30 21:03 UTC (permalink / raw)
  To: alexei.starovoitov, daniel
  Cc: netdev, bpf, oss-drivers, ctakshak, kernel-team, Jakub Kicinski,
	Quentin Monnet

Takshak said in the original submission:

With different bpf attach_flags available to attach bpf programs specially
with BPF_F_ALLOW_OVERRIDE and BPF_F_ALLOW_MULTI, the list of effective
bpf-programs available to any sub-cgroups really needs to be available for
easy debugging.

Using BPF_F_QUERY_EFFECTIVE flag, one can get the list of not only attached
bpf-programs to a cgroup but also the inherited ones from parent cgroup.

So a new option is introduced to use BPF_F_QUERY_EFFECTIVE query flag here
to list all the effective bpf-programs available for execution at a specified
cgroup.

Reused modified test program test_cgroup_attach from tools/testing/selftests/bpf:
  # ./test_cgroup_attach

With old bpftool:

 # bpftool cgroup show /sys/fs/cgroup/cgroup-test-work-dir/cg1/
  ID       AttachType      AttachFlags     Name
  271      egress          multi           pkt_cntr_1
  272      egress          multi           pkt_cntr_2

Attached new program pkt_cntr_4 in cg2 gives following:

 # bpftool cgroup show /sys/fs/cgroup/cgroup-test-work-dir/cg1/cg2
  ID       AttachType      AttachFlags     Name
  273      egress          override        pkt_cntr_4

And with new "effective" option it shows all effective programs for cg2:

 # bpftool cgroup show /sys/fs/cgroup/cgroup-test-work-dir/cg1/cg2 effective
  ID       AttachType      AttachFlags     Name
  273      egress          override        pkt_cntr_4
  271      egress          override        pkt_cntr_1
  272      egress          override        pkt_cntr_2

Compared to original submission use a local flag instead of global
option.

We need to clear query_flags on every command, in case batch mode
wants to use varying settings.

v2: (Takshak)
 - forbid duplicated flags;
 - fix cgroup path freeing.

Signed-off-by: Takshak Chahande <ctakshak@fb.com>
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com>
---
 .../bpftool/Documentation/bpftool-cgroup.rst  | 16 +++-
 tools/bpf/bpftool/bash-completion/bpftool     | 15 ++--
 tools/bpf/bpftool/cgroup.c                    | 83 ++++++++++++-------
 3 files changed, 76 insertions(+), 38 deletions(-)

diff --git a/tools/bpf/bpftool/Documentation/bpftool-cgroup.rst b/tools/bpf/bpftool/Documentation/bpftool-cgroup.rst
index 585f270c2d25..06a28b07787d 100644
--- a/tools/bpf/bpftool/Documentation/bpftool-cgroup.rst
+++ b/tools/bpf/bpftool/Documentation/bpftool-cgroup.rst
@@ -20,8 +20,8 @@ SYNOPSIS
 CGROUP COMMANDS
 ===============
 
-|	**bpftool** **cgroup { show | list }** *CGROUP*
-|	**bpftool** **cgroup tree** [*CGROUP_ROOT*]
+|	**bpftool** **cgroup { show | list }** *CGROUP* [**effective**]
+|	**bpftool** **cgroup tree** [*CGROUP_ROOT*] [**effective**]
 |	**bpftool** **cgroup attach** *CGROUP* *ATTACH_TYPE* *PROG* [*ATTACH_FLAGS*]
 |	**bpftool** **cgroup detach** *CGROUP* *ATTACH_TYPE* *PROG*
 |	**bpftool** **cgroup help**
@@ -35,13 +35,17 @@ CGROUP COMMANDS
 
 DESCRIPTION
 ===========
-	**bpftool cgroup { show | list }** *CGROUP*
+	**bpftool cgroup { show | list }** *CGROUP* [**effective**]
 		  List all programs attached to the cgroup *CGROUP*.
 
 		  Output will start with program ID followed by attach type,
 		  attach flags and program name.
 
-	**bpftool cgroup tree** [*CGROUP_ROOT*]
+		  If **effective** is specified retrieve effective programs that
+		  will execute for events within a cgroup. This includes
+		  inherited along with attached ones.
+
+	**bpftool cgroup tree** [*CGROUP_ROOT*] [**effective**]
 		  Iterate over all cgroups in *CGROUP_ROOT* and list all
 		  attached programs. If *CGROUP_ROOT* is not specified,
 		  bpftool uses cgroup v2 mountpoint.
@@ -50,6 +54,10 @@ DESCRIPTION
 		  commands: it starts with absolute cgroup path, followed by
 		  program ID, attach type, attach flags and program name.
 
+		  If **effective** is specified retrieve effective programs that
+		  will execute for events within a cgroup. This includes
+		  inherited along with attached ones.
+
 	**bpftool cgroup attach** *CGROUP* *ATTACH_TYPE* *PROG* [*ATTACH_FLAGS*]
 		  Attach program *PROG* to the cgroup *CGROUP* with attach type
 		  *ATTACH_TYPE* and optional *ATTACH_FLAGS*.
diff --git a/tools/bpf/bpftool/bash-completion/bpftool b/tools/bpf/bpftool/bash-completion/bpftool
index 6b961a5ed100..df16c5415444 100644
--- a/tools/bpf/bpftool/bash-completion/bpftool
+++ b/tools/bpf/bpftool/bash-completion/bpftool
@@ -710,12 +710,15 @@ _bpftool()
             ;;
         cgroup)
             case $command in
-                show|list)
-                    _filedir
-                    return 0
-                    ;;
-                tree)
-                    _filedir
+                show|list|tree)
+                    case $cword in
+                        3)
+                            _filedir
+                            ;;
+                        4)
+                            COMPREPLY=( $( compgen -W 'effective' -- "$cur" ) )
+                            ;;
+                    esac
                     return 0
                     ;;
                 attach|detach)
diff --git a/tools/bpf/bpftool/cgroup.c b/tools/bpf/bpftool/cgroup.c
index f3c05b08c68c..44352b5aca85 100644
--- a/tools/bpf/bpftool/cgroup.c
+++ b/tools/bpf/bpftool/cgroup.c
@@ -29,6 +29,8 @@
 	"                        recvmsg4 | recvmsg6 | sysctl |\n"	       \
 	"                        getsockopt | setsockopt }"
 
+static unsigned int query_flags;
+
 static const char * const attach_type_strings[] = {
 	[BPF_CGROUP_INET_INGRESS] = "ingress",
 	[BPF_CGROUP_INET_EGRESS] = "egress",
@@ -107,7 +109,8 @@ static int count_attached_bpf_progs(int cgroup_fd, enum bpf_attach_type type)
 	__u32 prog_cnt = 0;
 	int ret;
 
-	ret = bpf_prog_query(cgroup_fd, type, 0, NULL, NULL, &prog_cnt);
+	ret = bpf_prog_query(cgroup_fd, type, query_flags, NULL,
+			     NULL, &prog_cnt);
 	if (ret)
 		return -1;
 
@@ -125,8 +128,8 @@ static int show_attached_bpf_progs(int cgroup_fd, enum bpf_attach_type type,
 	int ret;
 
 	prog_cnt = ARRAY_SIZE(prog_ids);
-	ret = bpf_prog_query(cgroup_fd, type, 0, &attach_flags, prog_ids,
-			     &prog_cnt);
+	ret = bpf_prog_query(cgroup_fd, type, query_flags, &attach_flags,
+			     prog_ids, &prog_cnt);
 	if (ret)
 		return ret;
 
@@ -158,20 +161,34 @@ static int show_attached_bpf_progs(int cgroup_fd, enum bpf_attach_type type,
 static int do_show(int argc, char **argv)
 {
 	enum bpf_attach_type type;
+	const char *path;
 	int cgroup_fd;
 	int ret = -1;
 
-	if (argc < 1) {
-		p_err("too few parameters for cgroup show");
-		goto exit;
-	} else if (argc > 1) {
-		p_err("too many parameters for cgroup show");
-		goto exit;
+	query_flags = 0;
+
+	if (!REQ_ARGS(1))
+		return -1;
+	path = GET_ARG();
+
+	while (argc) {
+		if (is_prefix(*argv, "effective")) {
+			if (query_flags & BPF_F_QUERY_EFFECTIVE) {
+				p_err("duplicated argument: %s", *argv);
+				return -1;
+			}
+			query_flags |= BPF_F_QUERY_EFFECTIVE;
+			NEXT_ARG();
+		} else {
+			p_err("expected no more arguments, 'effective', got: '%s'?",
+			      *argv);
+			return -1;
+		}
 	}
 
-	cgroup_fd = open(argv[0], O_RDONLY);
+	cgroup_fd = open(path, O_RDONLY);
 	if (cgroup_fd < 0) {
-		p_err("can't open cgroup %s", argv[0]);
+		p_err("can't open cgroup %s", path);
 		goto exit;
 	}
 
@@ -294,26 +311,37 @@ static char *find_cgroup_root(void)
 
 static int do_show_tree(int argc, char **argv)
 {
-	char *cgroup_root;
+	char *cgroup_root, *cgroup_alloced = NULL;
 	int ret;
 
-	switch (argc) {
-	case 0:
-		cgroup_root = find_cgroup_root();
-		if (!cgroup_root) {
+	query_flags = 0;
+
+	if (!argc) {
+		cgroup_alloced = find_cgroup_root();
+		if (!cgroup_alloced) {
 			p_err("cgroup v2 isn't mounted");
 			return -1;
 		}
-		break;
-	case 1:
-		cgroup_root = argv[0];
-		break;
-	default:
-		p_err("too many parameters for cgroup tree");
-		return -1;
+		cgroup_root = cgroup_alloced;
+	} else {
+		cgroup_root = GET_ARG();
+
+		while (argc) {
+			if (is_prefix(*argv, "effective")) {
+				if (query_flags & BPF_F_QUERY_EFFECTIVE) {
+					p_err("duplicated argument: %s", *argv);
+					return -1;
+				}
+				query_flags |= BPF_F_QUERY_EFFECTIVE;
+				NEXT_ARG();
+			} else {
+				p_err("expected no more arguments, 'effective', got: '%s'?",
+				      *argv);
+				return -1;
+			}
+		}
 	}
 
-
 	if (json_output)
 		jsonw_start_array(json_wtr);
 	else
@@ -338,8 +366,7 @@ static int do_show_tree(int argc, char **argv)
 	if (json_output)
 		jsonw_end_array(json_wtr);
 
-	if (argc == 0)
-		free(cgroup_root);
+	free(cgroup_alloced);
 
 	return ret;
 }
@@ -459,8 +486,8 @@ static int do_help(int argc, char **argv)
 	}
 
 	fprintf(stderr,
-		"Usage: %s %s { show | list } CGROUP\n"
-		"       %s %s tree [CGROUP_ROOT]\n"
+		"Usage: %s %s { show | list } CGROUP [**effective**]\n"
+		"       %s %s tree [CGROUP_ROOT] [**effective**]\n"
 		"       %s %s attach CGROUP ATTACH_TYPE PROG [ATTACH_FLAGS]\n"
 		"       %s %s detach CGROUP ATTACH_TYPE PROG\n"
 		"       %s %s help\n"
-- 
2.21.0


^ permalink raw reply related

* Re: [patch net-next 3/3] netdevsim: create devlink and netdev instances in namespace
From: Jiri Pirko @ 2019-07-30 21:03 UTC (permalink / raw)
  To: Jakub Kicinski; +Cc: netdev, davem, sthemmin, dsahern, mlxsw
In-Reply-To: <20190730101411.7dc1e83d@cakuba.netronome.com>

Tue, Jul 30, 2019 at 07:14:11PM CEST, jakub.kicinski@netronome.com wrote:
>On Tue, 30 Jul 2019 08:06:55 +0200, Jiri Pirko wrote:
>> >> diff --git a/drivers/net/netdevsim/netdevsim.h b/drivers/net/netdevsim/netdevsim.h
>> >> index 79c05af2a7c0..cdf53d0e0c49 100644
>> >> --- a/drivers/net/netdevsim/netdevsim.h
>> >> +++ b/drivers/net/netdevsim/netdevsim.h
>> >> @@ -19,6 +19,7 @@
>> >>  #include <linux/netdevice.h>
>> >>  #include <linux/u64_stats_sync.h>
>> >>  #include <net/devlink.h>
>> >> +#include <net/net_namespace.h>  
>> >
>> >You can just do a forward declaration, no need to pull in the header.  
>> 
>> Sure, but why?
>
>Less time to compile the kernel after net_namespace.h was touched.
>Don't we all spend more time that we would like to recompiling the
>kernel? :(  Not a huge deal if you have a strong preference.

I removed it in v2. I don't care that much either :)

^ permalink raw reply

* Re: [PATCH bpf-next] tools: bpftool: add support for reporting the effective cgroup progs
From: Jakub Kicinski @ 2019-07-30 21:00 UTC (permalink / raw)
  To: Takshak Chahande
  Cc: alexei.starovoitov@gmail.com, daniel@iogearbox.net,
	netdev@vger.kernel.org, bpf@vger.kernel.org,
	oss-drivers@netronome.com, Kernel Team, Quentin Monnet
In-Reply-To: <20190730180443.GA48276@ctakshak-mbp.dhcp.thefacebook.com>

On Tue, 30 Jul 2019 18:04:53 +0000, Takshak Chahande wrote:
> Jakub Kicinski <jakub.kicinski@netronome.com> wrote on Mon [2019-Jul-29 14:35:38 -0700]:
> > @@ -158,20 +161,30 @@ static int show_attached_bpf_progs(int cgroup_fd, enum bpf_attach_type type,
> >  static int do_show(int argc, char **argv)
> >  {
> >  	enum bpf_attach_type type;
> > +	const char *path;
> >  	int cgroup_fd;
> >  	int ret = -1;
> >  
> > -	if (argc < 1) {
> > -		p_err("too few parameters for cgroup show");
> > -		goto exit;
> > -	} else if (argc > 1) {
> > -		p_err("too many parameters for cgroup show");
> > -		goto exit;
> > +	query_flags = 0;
> > +
> > +	if (!REQ_ARGS(1))
> > +		return -1;
> > +	path = GET_ARG();
> > +
> > +	while (argc) {
> > +		if (is_prefix(*argv, "effective")) {
> > +			query_flags |= BPF_F_QUERY_EFFECTIVE;
> > +			NEXT_ARG();
> > +		} else {
> > +			p_err("expected no more arguments, 'effective', got: '%s'?",
> > +			      *argv);
> > +			return -1;
> > +		}
> >  	}  
> This while loop will allow multiple 'effective' keywords in the argument
> unnecessarily. IMO, we should strictly restrict only for single
> occurance of 'effective' word.

It's kind of the way all bpftool works to date :(

But perhaps not checking is worse than inconsistency? Okay, let's fix
this up.

> > -	cgroup_fd = open(argv[0], O_RDONLY);
> > +	cgroup_fd = open(path, O_RDONLY);
> >  	if (cgroup_fd < 0) {
> > -		p_err("can't open cgroup %s", argv[0]);
> > +		p_err("can't open cgroup %s", path);
> >  		goto exit;
> >  	}
> >  
> > @@ -297,23 +310,29 @@ static int do_show_tree(int argc, char **argv)
> >  	char *cgroup_root;
> >  	int ret;
> >  
> > -	switch (argc) {
> > -	case 0:
> > +	query_flags = 0;
> > +
> > +	if (!argc) {
> >  		cgroup_root = find_cgroup_root();
> >  		if (!cgroup_root) {
> >  			p_err("cgroup v2 isn't mounted");
> >  			return -1;
> >  		}
> > -		break;
> > -	case 1:
> > -		cgroup_root = argv[0];
> > -		break;
> > -	default:
> > -		p_err("too many parameters for cgroup tree");
> > -		return -1;
> > +	} else {
> > +		cgroup_root = GET_ARG();
> > +
> > +		while (argc) {
> > +			if (is_prefix(*argv, "effective")) {
> > +				query_flags |= BPF_F_QUERY_EFFECTIVE;
> > +				NEXT_ARG();  
> 
> NEXT_ARG() does update argc value; that means after this outer if/else we need 
> to know how argc has become 0 (through which path) before freeing up `cgroup_root` allocated
> memory later at the end of this function.

Good catch!

> > +			} else {
> > +				p_err("expected no more arguments, 'effective', got: '%s'?",
> > +				      *argv);
> > +				return -1;
> > +			}
> > +		}
> >  	}  
 
> Thanks for the patch. Apart from above two issues, patch looks good.

Thanks for the review.

^ permalink raw reply

* [PATCH v4 3/3] net/xdp: convert put_page() to put_user_page*()
From: john.hubbard @ 2019-07-30 20:57 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Al Viro, Christian Benvenuti, Christoph Hellwig, Dan Williams,
	Darrick J . Wong, Dave Chinner, Ira Weiny, Jan Kara,
	Jason Gunthorpe, Jens Axboe, Jerome Glisse, Kirill A . Shutemov,
	Matthew Wilcox, Michal Hocko, Mike Marciniszyn, Mike Rapoport,
	linux-block, linux-fsdevel, linux-mm, linux-xfs, LKML,
	John Hubbard, Björn Töpel, Magnus Karlsson,
	David S . Miller, netdev
In-Reply-To: <20190730205705.9018-1-jhubbard@nvidia.com>

From: John Hubbard <jhubbard@nvidia.com>

For pages that were retained via get_user_pages*(), release those pages
via the new put_user_page*() routines, instead of via put_page() or
release_pages().

This is part a tree-wide conversion, as described in commit fc1d8e7cca2d
("mm: introduce put_user_page*(), placeholder versions").

Cc: Björn Töpel <bjorn.topel@intel.com>
Cc: Magnus Karlsson <magnus.karlsson@intel.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: netdev@vger.kernel.org
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
---
 net/xdp/xdp_umem.c | 9 +--------
 1 file changed, 1 insertion(+), 8 deletions(-)

diff --git a/net/xdp/xdp_umem.c b/net/xdp/xdp_umem.c
index 83de74ca729a..17c4b3d3dc34 100644
--- a/net/xdp/xdp_umem.c
+++ b/net/xdp/xdp_umem.c
@@ -166,14 +166,7 @@ void xdp_umem_clear_dev(struct xdp_umem *umem)
 
 static void xdp_umem_unpin_pages(struct xdp_umem *umem)
 {
-	unsigned int i;
-
-	for (i = 0; i < umem->npgs; i++) {
-		struct page *page = umem->pgs[i];
-
-		set_page_dirty_lock(page);
-		put_page(page);
-	}
+	put_user_pages_dirty_lock(umem->pgs, umem->npgs, true);
 
 	kfree(umem->pgs);
 	umem->pgs = NULL;
-- 
2.22.0


^ permalink raw reply related

* Re: [PATCH v2 0/3 net-next] Finish conversion of skb_frag_t to bio_vec
From: Jens Axboe @ 2019-07-30 20:49 UTC (permalink / raw)
  To: Jonathan Lemon, willy@infradead.org, davem@davemloft.net,
	jakub.kicinski@netronome.com
  Cc: Kernel Team, netdev@vger.kernel.org
In-Reply-To: <20190730144034.444022-1-jonathan.lemon@gmail.com>

On 7/30/19 8:40 AM, Jonathan Lemon wrote:
> The recent conversion of skb_frag_t to bio_vec did not include
> skb_frag's page_offset.  Add accessor functions for this field,
> utilize them, and remove the union, restoring the original structure.

You can add:

Reviewed-by: Jens Axboe <axboe@kernel.dk>

Pretty appalled to see this abomination:

net: Convert skb_frag_t to bio_vec

There are a lot of users of frag->page_offset, so use a union
to avoid converting those users today.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

show up in the net tree without even having been posted on a
block list...

At least this kills this ugly thing.

-- 
Jens Axboe

^ permalink raw reply

* Re: [PATCH 4/4] net: dsa: mv88e6xxx: add PTP support for MV88E6250 family
From: Vladimir Oltean @ 2019-07-30 20:46 UTC (permalink / raw)
  To: Richard Cochran
  Cc: Hubert Feurstein, netdev, lkml, Andrew Lunn, Vivien Didelot,
	Florian Fainelli, David S. Miller, Rasmus Villemoes
In-Reply-To: <20190730171246.GB1251@localhost>

Hi Hubert, Richard,

On Tue, 30 Jul 2019 at 19:44, Richard Cochran <richardcochran@gmail.com> wrote:
>
> On Tue, Jul 30, 2019 at 12:04:29PM +0200, Hubert Feurstein wrote:
> > diff --git a/drivers/net/dsa/mv88e6xxx/ptp.c b/drivers/net/dsa/mv88e6xxx/ptp.c
> > index 768d256f7c9f..51cdf4712517 100644
> > --- a/drivers/net/dsa/mv88e6xxx/ptp.c
> > +++ b/drivers/net/dsa/mv88e6xxx/ptp.c
> > @@ -15,11 +15,38 @@
> >  #include "hwtstamp.h"
> >  #include "ptp.h"
> >
> > -/* Raw timestamps are in units of 8-ns clock periods. */
> > -#define CC_SHIFT     28
> > -#define CC_MULT              (8 << CC_SHIFT)
> > -#define CC_MULT_NUM  (1 << 9)
> > -#define CC_MULT_DEM  15625ULL
> > +/* The adjfine API clamps ppb between [-32,768,000, 32,768,000], and
>
> That is not true.
>

I was referring to this:
https://github.com/richardcochran/linuxptp/blob/master/phc.c#L38

/*
* On 32 bit platforms, the PHC driver's maximum adjustment (type
* 'int' in units of ppb) can overflow the timex.freq field (type
* 'long'). So in this case we clamp the maximum to the largest
* possible adjustment that fits into a 32 bit long.
*/
#define BITS_PER_LONG (sizeof(long)*8)
#define MAX_PPB_32 32767999 /* 2^31 - 1 / 65.536 */

Technically it is not "not true".

[snip]

On Tue, 30 Jul 2019 at 21:09, Richard Cochran <richardcochran@gmail.com> wrote:
>
> On Tue, Jul 30, 2019 at 06:20:00PM +0200, Hubert Feurstein wrote:
> > > Please don't re-write this logic.  It is written like that for a reason.
> > I used the sja1105_ptp.c as a reference. So it is also wrong there.
>
> I'll let that driver's author worry about that.
>
> Thanks,
> Richard
>

And what is the reason for the neg_adj thing? Can you give an example
of when does the "normal way" of doing signed arithmetics not work?

Thanks,
-Vladimir

^ permalink raw reply

* Re: [PATCH v4 1/5] vsock/virtio: limit the memory used per-socket
From: Michael S. Tsirkin @ 2019-07-30 20:42 UTC (permalink / raw)
  To: Stefano Garzarella
  Cc: netdev, linux-kernel, Stefan Hajnoczi, David S. Miller,
	virtualization, Jason Wang, kvm
In-Reply-To: <20190730093539.dcksure3vrykir3g@steredhat>

On Tue, Jul 30, 2019 at 11:35:39AM +0200, Stefano Garzarella wrote:
> On Mon, Jul 29, 2019 at 03:10:15PM -0400, Michael S. Tsirkin wrote:
> > On Mon, Jul 29, 2019 at 06:50:56PM +0200, Stefano Garzarella wrote:
> > > On Mon, Jul 29, 2019 at 06:19:03PM +0200, Stefano Garzarella wrote:
> > > > On Mon, Jul 29, 2019 at 11:49:02AM -0400, Michael S. Tsirkin wrote:
> > > > > On Mon, Jul 29, 2019 at 05:36:56PM +0200, Stefano Garzarella wrote:
> > > > > > On Mon, Jul 29, 2019 at 10:04:29AM -0400, Michael S. Tsirkin wrote:
> > > > > > > On Wed, Jul 17, 2019 at 01:30:26PM +0200, Stefano Garzarella wrote:
> > > > > > > > Since virtio-vsock was introduced, the buffers filled by the host
> > > > > > > > and pushed to the guest using the vring, are directly queued in
> > > > > > > > a per-socket list. These buffers are preallocated by the guest
> > > > > > > > with a fixed size (4 KB).
> > > > > > > > 
> > > > > > > > The maximum amount of memory used by each socket should be
> > > > > > > > controlled by the credit mechanism.
> > > > > > > > The default credit available per-socket is 256 KB, but if we use
> > > > > > > > only 1 byte per packet, the guest can queue up to 262144 of 4 KB
> > > > > > > > buffers, using up to 1 GB of memory per-socket. In addition, the
> > > > > > > > guest will continue to fill the vring with new 4 KB free buffers
> > > > > > > > to avoid starvation of other sockets.
> > > > > > > > 
> > > > > > > > This patch mitigates this issue copying the payload of small
> > > > > > > > packets (< 128 bytes) into the buffer of last packet queued, in
> > > > > > > > order to avoid wasting memory.
> > > > > > > > 
> > > > > > > > Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
> > > > > > > > Signed-off-by: Stefano Garzarella <sgarzare@redhat.com>
> > > > > > > 
> > > > > > > This is good enough for net-next, but for net I think we
> > > > > > > should figure out how to address the issue completely.
> > > > > > > Can we make the accounting precise? What happens to
> > > > > > > performance if we do?
> > > > > > > 
> > > > > > 
> > > > > > In order to do more precise accounting maybe we can use the buffer size,
> > > > > > instead of payload size when we update the credit available.
> > > > > > In this way, the credit available for each socket will reflect the memory
> > > > > > actually used.
> > > > > > 
> > > > > > I should check better, because I'm not sure what happen if the peer sees
> > > > > > 1KB of space available, then it sends 1KB of payload (using a 4KB
> > > > > > buffer).
> > > > > > 
> > > > > > The other option is to copy each packet in a new buffer like I did in
> > > > > > the v2 [2], but this forces us to make a copy for each packet that does
> > > > > > not fill the entire buffer, perhaps too expensive.
> > > > > > 
> > > > > > [2] https://patchwork.kernel.org/patch/10938741/
> > > > > > 
> > > > > > 
> > > > > > Thanks,
> > > > > > Stefano
> > > > > 
> > > > > Interesting. You are right, and at some level the protocol forces copies.
> > > > > 
> > > > > We could try to detect that the actual memory is getting close to
> > > > > admin limits and force copies on queued packets after the fact.
> > > > > Is that practical?
> > > > 
> > > > Yes, I think it is doable!
> > > > We can decrease the credit available with the buffer size queued, and
> > > > when the buffer size of packet to queue is bigger than the credit
> > > > available, we can copy it.
> > > > 
> > > > > 
> > > > > And yes we can extend the credit accounting to include buffer size.
> > > > > That's a protocol change but maybe it makes sense.
> > > > 
> > > > Since we send to the other peer the credit available, maybe this
> > > > change can be backwards compatible (I'll check better this).
> > > 
> > > What I said was wrong.
> > > 
> > > We send a counter (increased when the user consumes the packets) and the
> > > "buf_alloc" (the max memory allowed) to the other peer.
> > > It makes a difference between a local counter (increased when the
> > > packets are sent) and the remote counter to calculate the credit available:
> > > 
> > >     u32 virtio_transport_get_credit(struct virtio_vsock_sock *vvs, u32 credit)
> > >     {
> > >     	u32 ret;
> > > 
> > >     	spin_lock_bh(&vvs->tx_lock);
> > >     	ret = vvs->peer_buf_alloc - (vvs->tx_cnt - vvs->peer_fwd_cnt);
> > >     	if (ret > credit)
> > >     		ret = credit;
> > >     	vvs->tx_cnt += ret;
> > >     	spin_unlock_bh(&vvs->tx_lock);
> > > 
> > >     	return ret;
> > >     }
> > > 
> > > Maybe I can play with "buf_alloc" to take care of bytes queued but not
> > > used.
> > > 
> > > Thanks,
> > > Stefano
> > 
> > Right. And the idea behind it all was that if we send a credit
> > to remote then we have space for it.
> 
> Yes.
> 
> > I think the basic idea was that if we have actual allocated
> > memory and can copy data there, then we send the credit to
> > remote.
> > 
> > Of course that means an extra copy every packet.
> > So as an optimization, it seems that we just assume
> > that we will be able to allocate a new buffer.
> 
> Yes, we refill the virtqueue when half of the buffers were used.
> 
> > 
> > First this is not the best we can do. We can actually do
> > allocate memory in the socket before sending credit.
> 
> In this case, IIUC we should allocate an entire buffer (4KB),
> so we can reuse it if the packet is big.
> 
> > If packet is small then we copy it there.
> > If packet is big then we queue the packet,
> > take the buffer out of socket and add it to the virtqueue.
> > 
> > Second question is what to do about medium sized packets.
> > Packet is 1K but buffer is 4K, what do we do?
> > And here I wonder - why don't we add the 3K buffer
> > to the vq?
> 
> This would allow us to have an accurate credit account.
> 
> The problem here is the compatibility. Before this series virtio-vsock
> and vhost-vsock modules had the RX buffer size hard-coded
> (VIRTIO_VSOCK_DEFAULT_RX_BUF_SIZE = 4K). So, if we send a buffer smaller
> of 4K, there might be issues.

Shouldn't be if they are following the spec. If not let's fix
the broken parts.

> 
> Maybe it is the time to add add 'features' to virtio-vsock device.
> 
> Thanks,
> Stefano

Why would a remote care about buffer sizes?

Let's first see what the issues are. If they exist
we can either fix the bugs, or code the bug as a feature in spec.

-- 
MST

^ permalink raw reply

* Re: [PATCH v2 0/3 net-next] Finish conversion of skb_frag_t to bio_vec
From: Jakub Kicinski @ 2019-07-30 20:39 UTC (permalink / raw)
  To: Jonathan Lemon; +Cc: willy, davem, kernel-team, netdev
In-Reply-To: <20190730144034.444022-1-jonathan.lemon@gmail.com>

On Tue, 30 Jul 2019 07:40:31 -0700, Jonathan Lemon wrote:
> The recent conversion of skb_frag_t to bio_vec did not include
> skb_frag's page_offset.  Add accessor functions for this field,
> utilize them, and remove the union, restoring the original structure.

Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>

Thanks!

^ permalink raw reply

* Re: [PATCH v2 bpf-next 1/4] bpf: unprivileged BPF access via /dev/bpf
From: Andy Lutomirski @ 2019-07-30 20:24 UTC (permalink / raw)
  To: Song Liu
  Cc: Andy Lutomirski, Kees Cook, Networking, bpf, Alexei Starovoitov,
	Daniel Borkmann, Kernel Team, Lorenz Bauer, Jann Horn, Greg KH,
	Linux API, LSM List
In-Reply-To: <369476A8-4CE1-43DA-9239-06437C0384C7@fb.com>

On Mon, Jul 29, 2019 at 10:07 PM Song Liu <songliubraving@fb.com> wrote:
>
> Hi Andy,
>
> > On Jul 27, 2019, at 11:20 AM, Song Liu <songliubraving@fb.com> wrote:
> >
> > Hi Andy,
> >
> >>>>>
> >>>>
> >>>> Well, yes. sys_bpf() is pretty powerful.
> >>>>
> >>>> The goal of /dev/bpf is to enable special users to call sys_bpf(). In
> >>>> the meanwhile, such users should not take down the whole system easily
> >>>> by accident, e.g., with rm -rf /.
> >>>
> >>> That’s easy, though — bpftool could learn to read /etc/bpfusers before allowing ruid != 0.
> >>
> >> This is a great idea! fscaps + /etc/bpfusers should do the trick.
> >
> > After some discussions and more thinking on this, I have some concerns
> > with the user space only approach.
> >
> > IIUC, your proposal for user space only approach is like:
> >
> > 1. bpftool (and other tools) check /etc/bpfusers and only do
> >   setuid for allowed users:
> >
> >       int main()
> >       {
> >               if (/* uid in /etc/bpfusers */)
> >                       setuid(0);
> >               sys_bpf(...);
> >       }
> >
> > 2. bpftool (and other tools) is installed with CAP_SETUID:
> >
> >       setcap cap_setuid=e+p /bin/bpftool
> >
> > 3. sys admin maintains proper /etc/bpfusers.
> >
> > This approach is not ideal, because we need to trust the tool to give
> > it CAP_SETUID. A hacked tool could easily bypass /etc/bpfusers check
> > or use other root only sys calls after setuid(0).
> >
>
> I would like more comments on this.
>
> Currently, bpf permission is more or less "root or nothing", which we
> would like to change.
>
> The short term goal is to separate bpf from root, in other words, it is
> "all or nothing". Special user space utilities, such as systemd, would
> benefit from this. Once this is implemented, systemd can call sys_bpf()
> when it is not running as root.

As generally nasty as Linux capabilities are, this sounds like a good
use for CAP_BPF_ADMIN.

But what do you have in mind?  Isn't non-root systemd mostly just the
user systemd session?  That should *not* have bpf() privileges until
bpf() is improved such that you can't use it to compromise the system.

>
> In longer term, it may be useful to provide finer grain permission of
> sys_bpf(). For example, sys_bpf() should be aware of containers; and
> user may only have access to certain bpf maps. Let's call this
> "fine grain" capability.
>
>
> Since we are seeing new use cases every year, we will need many
> iterations to implement the fine grain permission. I think we need an
> API that is flexible enough to cover different types of permission
> control.
>
> For example, bpf_with_cap() can be flexible:
>
>         bpf_with_cap(cmd, attr, size, perm_fd);
>
> We can get different types of permission via different combinations of
> arguments:
>
>     A perm_fd to /dev/bpf gives access to all sys_bpf() commands, so
>     this is "all or nothing" permission.
>
>     A perm_fd to /sys/fs/cgroup/.../bpf.xxx would only allow some
>     commands to this specific cgroup.
>

I don't see why you need to invent a whole new mechanism for this.
The entire cgroup ecosystem outside bpf() does just fine using the
write permission on files in cgroupfs to control access.  Why can't
bpf() do the same thing?

>
> Alexei raised another idea in offline discussions: instead of adding
> bpf_with_cap(), we add a command LOAD_PERM_FD, which enables special
> permission for the _next_ sys_bpf() from current task:
>
>     bpf(LOAD_PERM_FD, perm_fd);
>     /* the next sys_bpf() uses permission from perm_fd */
>     bpf(cmd, attr, size);
>
> This is equivalent to bpf_with_cap(cmd, attr, size, perm_fd), but
> doesn't require the new sys call.

That sounds almost every bit as problematic as the approach where you
ask for permission once and it sticks.

>
> 1. User space only approach doesn't work, even for "all or nothing"
>    permission control. I expanded the discussion in the previous
>    email. Please let me know if I missed anything there.

As in my previous email, I disagree.

^ permalink raw reply

* Re: [PATCH v2 bpf-next 1/4] bpf: unprivileged BPF access via /dev/bpf
From: Andy Lutomirski @ 2019-07-30 20:20 UTC (permalink / raw)
  To: Song Liu
  Cc: Andy Lutomirski, Kees Cook, linux-security@vger.kernel.org,
	Networking, bpf, Alexei Starovoitov, Daniel Borkmann, Kernel Team,
	Lorenz Bauer, Jann Horn, Greg KH, Linux API
In-Reply-To: <77354A95-4107-41A7-8936-D144F01C3CA4@fb.com>

On Sat, Jul 27, 2019 at 11:20 AM Song Liu <songliubraving@fb.com> wrote:
>
> Hi Andy,
>
> >>>>
> >>>
> >>> Well, yes. sys_bpf() is pretty powerful.
> >>>
> >>> The goal of /dev/bpf is to enable special users to call sys_bpf(). In
> >>> the meanwhile, such users should not take down the whole system easily
> >>> by accident, e.g., with rm -rf /.
> >>
> >> That’s easy, though — bpftool could learn to read /etc/bpfusers before allowing ruid != 0.
> >
> > This is a great idea! fscaps + /etc/bpfusers should do the trick.
>
> After some discussions and more thinking on this, I have some concerns
> with the user space only approach.
>
> IIUC, your proposal for user space only approach is like:
>
> 1. bpftool (and other tools) check /etc/bpfusers and only do
>    setuid for allowed users:
>
>         int main()
>         {
>                 if (/* uid in /etc/bpfusers */)
>                         setuid(0);
>                 sys_bpf(...);
>         }
>
> 2. bpftool (and other tools) is installed with CAP_SETUID:
>
>         setcap cap_setuid=e+p /bin/bpftool
>

You have this a bit backwards.  You wouldn't use CAP_SETUID.  You
would use the setuid *mode* bit, i.e. chmod 4111 (or 4100 and use ACLs
to further lock it down).  Or you could use setcap cap_sys_admin=p,
although the details vary.  It woks a bit like this:

First, if you are running with elevated privilege due to SUID or
fscaps, the kernel and glibc offer you a degree of protection: you are
protected from ptrace(), LD_PRELOAD, etc.  You are *not* protected
from yourself.  For example, you may be running in a context in which
an attacker has malicious values in your environment variables, CWD,
etc.  Do you need to carefully decide whether you are willing to run
with elevated privilege on behalf of the user, which you learn like
this:

uid_t real_uid = getuid();

Your decision may may depend on command-line arguments as well (i.e.
you might want to allow tracing but not filtering, say).  Once you've
made this decision, the details vary:

For SUID, you either continue to run with euid == 0, or you drop
privilege using something like:

if (setresuid(real_uid, real_uid, real_uid) != 0) {
 /* optionally print an error to stderr */
 exit(1);
}

For fscaps, if you want to be privileged, you use something like
capng_update(); capng_apply() to set CAP_SYS_ADMIN to be effective
when you want privilege.  If you want to be unprivileged (because
bpfusers says so, for example), you could use capng_update() to drop
CAP_SYS_ADMIN entirely and see if the calls still work without
privilege.  But this is a little bit awkward, since you don't directly
know whether the user that invoked you in the first place had
CAP_SYS_ADMIN to begin with.

In general, SUID is a bit easier to work with.

> This approach is not ideal, because we need to trust the tool to give
> it CAP_SETUID. A hacked tool could easily bypass /etc/bpfusers check
> or use other root only sys calls after setuid(0).

How?  The whole SUID mechanism is designed fairly carefully to prevent
this.  /bin/sudo is likely to be SUID on your system, but you can't
just "hack" it to become root.

^ permalink raw reply

* Re: [PATCH stable 4.9] tcp: reset sk_send_head in tcp_write_queue_purge
From: Sasha Levin @ 2019-07-30 20:17 UTC (permalink / raw)
  To: maowenan; +Cc: gregkh, stable, netdev, linux-kernel
In-Reply-To: <29c1ee9c-4a5d-4f61-f526-85980185f0bd@huawei.com>

On Tue, Jul 30, 2019 at 09:31:19AM +0800, maowenan wrote:
>
>
>On 2019/7/29 23:32, Sasha Levin wrote:
>> On Mon, Jul 29, 2019 at 09:21:08PM +0800, Mao Wenan wrote:
>>> From: Soheil Hassas Yeganeh <soheil@google.com>
>>>
>>> tcp_write_queue_purge clears all the SKBs in the write queue
>>> but does not reset the sk_send_head. As a result, we can have
>>> a NULL pointer dereference anywhere that we use tcp_send_head
>>> instead of the tcp_write_queue_tail.
>>>
>>> For example, after a27fd7a8ed38 (tcp: purge write queue upon RST),
>>> we can purge the write queue on RST. Prior to
>>> 75c119afe14f (tcp: implement rb-tree based retransmit queue),
>>> tcp_push will only check tcp_send_head and then accesses
>>> tcp_write_queue_tail to send the actual SKB. As a result, it will
>>> dereference a NULL pointer.
>>>
>>> This has been reported twice for 4.14 where we don't have
>>> 75c119afe14f:
>>>
>>> By Timofey Titovets:
>>>
>>> [  422.081094] BUG: unable to handle kernel NULL pointer dereference
>>> at 0000000000000038
>>> [  422.081254] IP: tcp_push+0x42/0x110
>>> [  422.081314] PGD 0 P4D 0
>>> [  422.081364] Oops: 0002 [#1] SMP PTI
>>>
>>> By Yongjian Xu:
>>>
>>> BUG: unable to handle kernel NULL pointer dereference at 0000000000000038
>>> IP: tcp_push+0x48/0x120
>>> PGD 80000007ff77b067 P4D 80000007ff77b067 PUD 7fd989067 PMD 0
>>> Oops: 0002 [#18] SMP PTI
>>> Modules linked in: tcp_diag inet_diag tcp_bbr sch_fq iTCO_wdt
>>> iTCO_vendor_support pcspkr ixgbe mdio i2c_i801 lpc_ich joydev input_leds shpchp
>>> e1000e igb dca ptp pps_core hwmon mei_me mei ipmi_si ipmi_msghandler sg ses
>>> scsi_transport_sas enclosure ext4 jbd2 mbcache sd_mod ahci libahci megaraid_sas
>>> wmi ast ttm dm_mirror dm_region_hash dm_log dm_mod dax
>>> CPU: 6 PID: 14156 Comm: [ET_NET 6] Tainted: G D 4.14.26-1.el6.x86_64 #1
>>> Hardware name: LENOVO ThinkServer RD440 /ThinkServer RD440, BIOS A0TS80A
>>> 09/22/2014
>>> task: ffff8807d78d8140 task.stack: ffffc9000e944000
>>> RIP: 0010:tcp_push+0x48/0x120
>>> RSP: 0018:ffffc9000e947a88 EFLAGS: 00010246
>>> RAX: 00000000000005b4 RBX: ffff880f7cce9c00 RCX: 0000000000000000
>>> RDX: 0000000000000000 RSI: 0000000000000040 RDI: ffff8807d00f5000
>>> RBP: ffffc9000e947aa8 R08: 0000000000001c84 R09: 0000000000000000
>>> R10: ffff8807d00f5158 R11: 0000000000000000 R12: ffff8807d00f5000
>>> R13: 0000000000000020 R14: 00000000000256d4 R15: 0000000000000000
>>> FS: 00007f5916de9700(0000) GS:ffff88107fd00000(0000) knlGS:0000000000000000
>>> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>> CR2: 0000000000000038 CR3: 00000007f8226004 CR4: 00000000001606e0
>>> Call Trace:
>>> tcp_sendmsg_locked+0x33d/0xe50
>>> tcp_sendmsg+0x37/0x60
>>> inet_sendmsg+0x39/0xc0
>>> sock_sendmsg+0x49/0x60
>>> sock_write_iter+0xb6/0x100
>>> do_iter_readv_writev+0xec/0x130
>>> ? rw_verify_area+0x49/0xb0
>>> do_iter_write+0x97/0xd0
>>> vfs_writev+0x7e/0xe0
>>> ? __wake_up_common_lock+0x80/0xa0
>>> ? __fget_light+0x2c/0x70
>>> ? __do_page_fault+0x1e7/0x530
>>> do_writev+0x60/0xf0
>>> ? inet_shutdown+0xac/0x110
>>> SyS_writev+0x10/0x20
>>> do_syscall_64+0x6f/0x140
>>> ? prepare_exit_to_usermode+0x8b/0xa0
>>> entry_SYSCALL_64_after_hwframe+0x3d/0xa2
>>> RIP: 0033:0x3135ce0c57
>>> RSP: 002b:00007f5916de4b00 EFLAGS: 00000293 ORIG_RAX: 0000000000000014
>>> RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 0000003135ce0c57
>>> RDX: 0000000000000002 RSI: 00007f5916de4b90 RDI: 000000000000606f
>>> RBP: 0000000000000000 R08: 0000000000000000 R09: 00007f5916de8c38
>>> R10: 0000000000000000 R11: 0000000000000293 R12: 00000000000464cc
>>> R13: 00007f5916de8c30 R14: 00007f58d8bef080 R15: 0000000000000002
>>> Code: 48 8b 97 60 01 00 00 4c 8d 97 58 01 00 00 41 b9 00 00 00 00 41 89 f3 4c 39
>>> d2 49 0f 44 d1 41 81 e3 00 80 00 00 0f 85 b0 00 00 00 <80> 4a 38 08 44 8b 8f 74
>>> 06 00 00 44 89 8f 7c 06 00 00 83 e6 01
>>> RIP: tcp_push+0x48/0x120 RSP: ffffc9000e947a88
>>> CR2: 0000000000000038
>>> ---[ end trace 8d545c2e93515549 ]---
>>>
>>> There is other scenario which found in stable 4.4:
>>> Allocated:
>>> [<ffffffff82f380a6>] __alloc_skb+0xe6/0x600 net/core/skbuff.c:218
>>> [<ffffffff832466c3>] alloc_skb_fclone include/linux/skbuff.h:856 [inline]
>>> [<ffffffff832466c3>] sk_stream_alloc_skb+0xa3/0x5d0 net/ipv4/tcp.c:833
>>> [<ffffffff83249164>] tcp_sendmsg+0xd34/0x2b00 net/ipv4/tcp.c:1178
>>> [<ffffffff83300ef3>] inet_sendmsg+0x203/0x4d0 net/ipv4/af_inet.c:755
>>> Freed:
>>> [<ffffffff82f372fd>] __kfree_skb+0x1d/0x20 net/core/skbuff.c:676
>>> [<ffffffff83288834>] sk_wmem_free_skb include/net/sock.h:1447 [inline]
>>> [<ffffffff83288834>] tcp_write_queue_purge include/net/tcp.h:1460 [inline]
>>> [<ffffffff83288834>] tcp_connect_init net/ipv4/tcp_output.c:3122 [inline]
>>> [<ffffffff83288834>] tcp_connect+0xb24/0x30c0 net/ipv4/tcp_output.c:3261
>>> [<ffffffff8329b991>] tcp_v4_connect+0xf31/0x1890 net/ipv4/tcp_ipv4.c:246
>>>
>>> BUG: KASAN: use-after-free in tcp_skb_pcount include/net/tcp.h:796 [inline]
>>> BUG: KASAN: use-after-free in tcp_init_tso_segs net/ipv4/tcp_output.c:1619 [inline]
>>> BUG: KASAN: use-after-free in tcp_write_xmit+0x3fc2/0x4cb0 net/ipv4/tcp_output.c:2056
>>> [<ffffffff81515cd5>] kasan_report.cold.7+0x175/0x2f7 mm/kasan/report.c:408
>>> [<ffffffff814f9784>] __asan_report_load2_noabort+0x14/0x20 mm/kasan/report.c:427
>>> [<ffffffff83286582>] tcp_skb_pcount include/net/tcp.h:796 [inline]
>>> [<ffffffff83286582>] tcp_init_tso_segs net/ipv4/tcp_output.c:1619 [inline]
>>> [<ffffffff83286582>] tcp_write_xmit+0x3fc2/0x4cb0 net/ipv4/tcp_output.c:2056
>>> [<ffffffff83287a40>] __tcp_push_pending_frames+0xa0/0x290 net/ipv4/tcp_output.c:2307
>>>
>>> stable 4.4 and stable 4.9 don't have the commit abb4a8b870b5 ("tcp: purge write queue upon RST")
>>> which is referred in dbbf2d1e4077,
>>> in tcp_connect_init, it calls tcp_write_queue_purge, and does not reset sk_send_head, then UAF.
>>>
>>> stable 4.14 have the commit abb4a8b870b5 ("tcp: purge write queue upon RST"),
>>> in tcp_reset, it calls tcp_write_queue_purge(sk), and does not reset sk_send_head, then UAF.
>>>
>>> So this patch can be used to fix stable 4.4 and 4.9.
>>>
>>> Fixes: a27fd7a8ed38 (tcp: purge write queue upon RST)
>>> Reported-by: Timofey Titovets <nefelim4ag@gmail.com>
>>> Reported-by: Yongjian Xu <yongjianchn@gmail.com>
>>> Signed-off-by: Eric Dumazet <edumazet@google.com>
>>> Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
>>> Tested-by: Yongjian Xu <yongjianchn@gmail.com>
>>>
>>> Signed-off-by: David S. Miller <davem@davemloft.net>
>>> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
>>> Signed-off-by: Mao Wenan <maowenan@huawei.com>
>>
>> So the "Fixes:" commit in the commit message is wrong? What's the actual
>> commit that this fixes?
>
>Upstream commit is 7f582b248d0a ("tcp: purge write queue in tcp_connect_init()")
>linux-4.4.y
>Fixes: 5bbe138a250e ("tcp: purge write queue in tcp_connect_init()")
>linux-4.9.y
>Fixes: 74a4c09d4b05 ("tcp: purge write queue in tcp_connect_init()")
>linux-4.14.y
>Fixes: a27fd7a8ed38 ("tcp: purge write queue upon RST")

Okay, I've queued this for 4.9 and 4.4, thank you.

--
Thanks,
Sasha

^ permalink raw reply

* Re: [PATCH v5 13/29] compat_ioctl: move more drivers to compat_ptr_ioctl
From: Dan Williams @ 2019-07-30 20:14 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Alexander Viro, linux-iio, Daniel Vetter, linux-pci, linux-nvme,
	Bjorn Andersson, sparclinux, Mauro Carvalho Chehab, linux-scsi,
	linux-nvdimm, linux-rdma, qat-linux, amd-gfx list,
	Jason Gunthorpe, linux-input, Darren Hart,
	Linux-media@vger.kernel.org, linux-remoteproc,
	moderated list:DMA BUFFER SHARING FRAMEWORK,
	Maling list - DRI developers, Jonathan Cameron, David Sterba,
	platform-driver-x86, Greg Kroah-Hartman, USB list,
	Linux Wireless List, Linux Kernel Mailing List, tee-dev,
	linux-crypto, Netdev, linux-fsdevel, linux-btrfs
In-Reply-To: <20190730195819.901457-1-arnd@arndb.de>

On Tue, Jul 30, 2019 at 12:59 PM Arnd Bergmann <arnd@arndb.de> wrote:
>
> The .ioctl and .compat_ioctl file operations have the same prototype so
> they can both point to the same function, which works great almost all
> the time when all the commands are compatible.
>
> One exception is the s390 architecture, where a compat pointer is only
> 31 bit wide, and converting it into a 64-bit pointer requires calling
> compat_ptr(). Most drivers here will never run in s390, but since we now
> have a generic helper for it, it's easy enough to use it consistently.
>
> I double-checked all these drivers to ensure that all ioctl arguments
> are used as pointers or are ignored, but are not interpreted as integer
> values.
>
> Acked-by: Jason Gunthorpe <jgg@mellanox.com>
> Acked-by: Daniel Vetter <daniel.vetter@ffwll.ch>
> Acked-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
> Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
> Acked-by: David Sterba <dsterba@suse.com>
> Acked-by: Darren Hart (VMware) <dvhart@infradead.org>
> Acked-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> Acked-by: Bjorn Andersson <bjorn.andersson@linaro.org>
> Signed-off-by: Arnd Bergmann <arnd@arndb.de>
> ---
>  drivers/nvdimm/bus.c                        | 4 ++--
[..]
> diff --git a/drivers/nvdimm/bus.c b/drivers/nvdimm/bus.c
> index 798c5c4aea9c..6ca142d833ab 100644
> --- a/drivers/nvdimm/bus.c
> +++ b/drivers/nvdimm/bus.c
> @@ -1229,7 +1229,7 @@ static const struct file_operations nvdimm_bus_fops = {
>         .owner = THIS_MODULE,
>         .open = nd_open,
>         .unlocked_ioctl = bus_ioctl,
> -       .compat_ioctl = bus_ioctl,
> +       .compat_ioctl = compat_ptr_ioctl,
>         .llseek = noop_llseek,
>  };
>
> @@ -1237,7 +1237,7 @@ static const struct file_operations nvdimm_fops = {
>         .owner = THIS_MODULE,
>         .open = nd_open,
>         .unlocked_ioctl = dimm_ioctl,
> -       .compat_ioctl = dimm_ioctl,
> +       .compat_ioctl = compat_ptr_ioctl,
>         .llseek = noop_llseek,
>  };

Acked-by: Dan Williams <dan.j.williams@intel.com>

^ permalink raw reply

* Re: [net-next 08/13] net/mlx5e: Protect tc flows hashtable with rcu
From: Saeed Mahameed @ 2019-07-30 20:09 UTC (permalink / raw)
  To: willemdebruijn.kernel@gmail.com
  Cc: Roi Dayan, davem@davemloft.net, netdev@vger.kernel.org,
	Vlad Buslov, Jianbo Liu
In-Reply-To: <CAF=yD-LgfHTJrfyaVfokKkZWwPpz4uxYDKA11+jgO5rAq1LamA@mail.gmail.com>

On Tue, 2019-07-30 at 12:37 -0400, Willem de Bruijn wrote:
> On Tue, Jul 30, 2019 at 12:16 PM Willem de Bruijn
> <willemdebruijn.kernel@gmail.com> wrote:
> > On Mon, Jul 29, 2019 at 7:50 PM Saeed Mahameed <saeedm@mellanox.com
> > > wrote:
> > > From: Vlad Buslov <vladbu@mellanox.com>
> > > 
> > > In order to remove dependency on rtnl lock, access to tc flows
> > > hashtable
> > > must be explicitly protected from concurrent flows removal.
> > > 
> > > Extend tc flow structure with rcu to allow concurrent parallel
> > > access. Use
> > > rcu read lock to safely lookup flow in tc flows hash table, and
> > > take
> > > reference to it. Use rcu free for flow deletion to accommodate
> > > concurrent
> > > stats requests.
> > > 
> > > Add new DELETED flow flag. Imlement new flow_flag_test_and_set()
> > > helper
> > > that is used to set a flag and return its previous value. Use it
> > > to
> > > atomically set the flag in mlx5e_delete_flower() to guarantee
> > > that flow can
> > > only be deleted once, even when same flow is deleted concurrently
> > > by
> > > multiple tasks.
> > > 
> > > Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
> > > Reviewed-by: Jianbo Liu <jianbol@mellanox.com>
> > > Reviewed-by: Roi Dayan <roid@mellanox.com>
> > > Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
> > > ---
> > > @@ -3492,16 +3507,32 @@ int mlx5e_delete_flower(struct net_device
> > > *dev, struct mlx5e_priv *priv,
> > >  {
> > >         struct rhashtable *tc_ht = get_tc_ht(priv, flags);
> > >         struct mlx5e_tc_flow *flow;
> > > +       int err;
> > > 
> > > +       rcu_read_lock();
> > >         flow = rhashtable_lookup_fast(tc_ht, &f->cookie,
> > > tc_ht_params);
> > > -       if (!flow || !same_flow_direction(flow, flags))
> > > -               return -EINVAL;
> > > +       if (!flow || !same_flow_direction(flow, flags)) {
> > > +               err = -EINVAL;
> > > +               goto errout;
> > > +       }
> > > 
> > > +       /* Only delete the flow if it doesn't have
> > > MLX5E_TC_FLOW_DELETED flag
> > > +        * set.
> > > +        */
> > > +       if (flow_flag_test_and_set(flow, DELETED)) {
> > > +               err = -EINVAL;
> > > +               goto errout;
> > > +       }
> > >         rhashtable_remove_fast(tc_ht, &flow->node, tc_ht_params);
> > > +       rcu_read_unlock();
> > > 
> > >         mlx5e_flow_put(priv, flow);
> > 
> > Dereferencing flow outside rcu readside critical section? Does a
> > build
> > with lockdep not complain?
> 
> Eh no, it won't. The surprising part to me was to use a readside
> critical section when performing a write action on an RCU ptr. The
> DELETED flag ensures that multiple writers will not compete to call
> rhashtable_remove_fast. rcu_read_lock is a common pattern to do
> rhashtable lookup + delete.
> 

correct.


^ permalink raw reply

* Re: [net-next 01/13] net/mlx5e: Print a warning when LRO feature is dropped or not allowed
From: Saeed Mahameed @ 2019-07-30 20:07 UTC (permalink / raw)
  To: willemdebruijn.kernel@gmail.com
  Cc: Huy Nguyen, davem@davemloft.net, netdev@vger.kernel.org
In-Reply-To: <CA+FuTSdoCfj=vcQd3TcA9BRhCNPAJamKHX+14H-6_ecpDEVS_Q@mail.gmail.com>

On Tue, 2019-07-30 at 11:52 -0400, Willem de Bruijn wrote:
> On Mon, Jul 29, 2019 at 7:50 PM Saeed Mahameed <saeedm@mellanox.com>
> wrote:
> > From: Huy Nguyen <huyn@mellanox.com>
> > 
> > When user enables LRO via ethtool and if the RQ mode is legacy,
> > mlx5e_fix_features drops the request without any explanation.
> > Add netdev_warn to cover this case.
> > 
> > Fixes: 6c3a823e1e9c ("net/mlx5e: RX, Remove HW LRO support in
> > legacy RQ")
> > Signed-off-by: Huy Nguyen <huyn@mellanox.com>
> > Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
> > ---
> >  drivers/net/ethernet/mellanox/mlx5/core/en_main.c | 5 +++--
> >  1 file changed, 3 insertions(+), 2 deletions(-)
> > 
> > diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
> > b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
> > index 47eea6b3a1c3..776eb46d263d 100644
> > --- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
> > +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
> > @@ -3788,9 +3788,10 @@ static netdev_features_t
> > mlx5e_fix_features(struct net_device *netdev,
> >                         netdev_warn(netdev, "Dropping C-tag vlan
> > stripping offload due to S-tag vlan\n");
> >         }
> >         if (!MLX5E_GET_PFLAG(params, MLX5E_PFLAG_RX_STRIDING_RQ)) {
> > -               features &= ~NETIF_F_LRO;
> > -               if (params->lro_en)
> > +               if (features & NETIF_F_LRO) {
> >                         netdev_warn(netdev, "Disabling LRO, not
> > supported in legacy RQ\n");
> 
> This warns about "Disabling LRO" on an enable request?
> 

no, this warning appears only when lro is already enabled and might
conflict with any other feature requested by user (hence
mlx5e_fix_features), e.g when moving away from striding rq in this
example, we will force lro to off.


> More fundamentally, it appears that the device does not advertise
> the feature as configurable in netdev_hw_features as of commit
> 6c3a823e1e9c ("net/mlx5e: RX, Remove HW LRO support in
> legacy RQ"), so shouldn't this be caught by the device driver
> independent ethtool code?

when hw doesn't support MLX5E_PFLAG_RX_STRIDING_RQ then yes, you will
never hit this code path, but when hw does support
MLX5E_PFLAG_RX_STRIDING_RQ and you want to turn striding rq off, then
lro will be forced to off (if it was enabled in first space) and a
warning msg will be shown.




^ permalink raw reply

* Re: [PATCH 0/2] tools: bpftool: add net (un)load command to load XDP
From: Andrii Nakryiko @ 2019-07-30 20:04 UTC (permalink / raw)
  To: Daniel T. Lee; +Cc: Daniel Borkmann, Alexei Starovoitov, Networking
In-Reply-To: <20190730184821.10833-1-danieltimlee@gmail.com>

On Tue, Jul 30, 2019 at 11:48 AM Daniel T. Lee <danieltimlee@gmail.com> wrote:
>
> Currently, bpftool net only supports dumping progs loaded on the
> interface. To load XDP prog on interface, user must use other tool
> (eg. iproute2). By this patch, with `bpftool net (un)load`, user can
> (un)load XDP prog on interface.
>
>     $ ./bpftool prog
>     ...
>     208: xdp  name xdp_prog1  tag ad822e38b629553f  gpl
>       loaded_at 2019-07-28T18:03:11+0900  uid 0
>     ...
>     $ ./bpftool net load id 208 xdpdrv enp6s0np1
>     $ ./bpftool net
>     xdp:
>     enp6s0np1(5) driver id 208
>     ...
>     $ ./bpftool net unload xdpdrv enp6s0np1
>     $ ./bpftool net
>     xdp:
>     ...
>
> The word 'load' is used instead of 'attach', since XDP program is not

Let's be consistent and "attach" everything. Tracing BPF programs
(kprobe, tracepoints) are also not attached with BPF_PROG_ATTACH, but
it's still a process of attaching BPF program to a BPF hook (a point
in the system which can execute BPF program).

Even better, if you can check what we recently did for tracing APIs
with struct bpf_link and can add similar APIs for XDP programs, it
would be a nice step towards consistency of APIs (and terminology as
well). Please consider doing that. Thanks!

> considered as 'bpf_attach_type' and can't be attached with
> 'BPF_PROG_ATTACH'. In this context, the meaning of 'load' is, prog will
> be loaded on interface.
>
> While this patch only contains support for XDP, through `net (un)load`,
> bpftool can further support other prog attach types.
>
> XDP (un)load tested on Netronome Agilio.
>
> Daniel T. Lee (2):
>   tools: bpftool: add net load command to load XDP on interface
>   tools: bpftool: add net unload command to unload XDP on interface
>
>  tools/bpf/bpftool/net.c | 160 +++++++++++++++++++++++++++++++++++++++-
>  1 file changed, 159 insertions(+), 1 deletion(-)
>
> --
> 2.20.1
>

^ permalink raw reply

* Re: [RFC PATCH net-next 0/3] net: batched receive in GRO path
From: Edward Cree @ 2019-07-30 20:02 UTC (permalink / raw)
  To: Eric Dumazet, Paolo Abeni, David Miller; +Cc: netdev
In-Reply-To: <a30137b4-1b01-df6e-c771-c5ddd1cfc490@solarflare.com>

On 24/07/2019 22:49, Edward Cree wrote:
> One thing that's causing me some uncertainty: busy_poll_stop() does a
>  napi->poll(), which can potentially gro_normal_one() something.  But
>  when I tried to put a gro_normal_list() just after that, I ran into
>  list corruption because it could race against the one in
>  napi_complete_done().
Turns out that the bh_disables are a red herring, we're racing against a
 napi poll on a different CPU.
I *think* that the sequence of events is
- enter busy_poll_stop
- enter napi->poll
- napi->poll calls napi_complete_done(), which schedules a new napi
- that new napi runs, on another CPU
- meanwhile, busy_poll_stop returns from napi->poll; if it then does a
  gro_normal_list it can race with the one on the other CPU from the new
  napi-poll.
Which means that...
> Questions that arise from that:
> 1) Is it safe to potentially be adding to the rx_list (gro_normal_one(),
>    which in theory can end up calling gro_normal_list() as well) within
>    busy_poll_stop()?  I haven't ever seen a splat from that, but it seems
>    every bit as possible as what I have been seeing.
... this isn't a problem, because napi->poll() will do all of its
 gro_normal_one()s before it calls napi_complete_done().

Just gathering some final performance numbers, then I'll post the updated
 patches (hopefully) later tonight.

-Ed

^ permalink raw reply

* [PATCH v5 19/29] compat_ioctl: move hci_sock handlers into driver
From: Arnd Bergmann @ 2019-07-30 19:55 UTC (permalink / raw)
  To: Alexander Viro
  Cc: linux-fsdevel, linux-kernel, Arnd Bergmann, Marcel Holtmann,
	Johan Hedberg, David S. Miller, Deepa Dinamani, linux-bluetooth,
	netdev
In-Reply-To: <20190730195819.901457-1-arnd@arndb.de>

All these ioctl commands are compatible, so we can handle
them with a trivial wrapper in hci_sock.c and remove
the listing in fs/compat_ioctl.c.

A few of the commands pass integer arguments instead of
pointers, so for correctness skip the compat_ptr() conversion
here.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
---
 fs/compat_ioctl.c        | 24 ------------------------
 net/bluetooth/hci_sock.c | 21 ++++++++++++++++++++-
 2 files changed, 20 insertions(+), 25 deletions(-)

diff --git a/fs/compat_ioctl.c b/fs/compat_ioctl.c
index 8dbef92b10fd..9302157d1471 100644
--- a/fs/compat_ioctl.c
+++ b/fs/compat_ioctl.c
@@ -40,9 +40,6 @@
 
 #include "internal.h"
 
-#include <net/bluetooth/bluetooth.h>
-#include <net/bluetooth/hci_sock.h>
-
 #ifdef CONFIG_BLOCK
 #include <linux/cdrom.h>
 #include <linux/fd.h>
@@ -646,27 +643,6 @@ COMPATIBLE_IOCTL(RNDADDENTROPY)
 COMPATIBLE_IOCTL(RNDZAPENTCNT)
 COMPATIBLE_IOCTL(RNDCLEARPOOL)
 /* Bluetooth */
-COMPATIBLE_IOCTL(HCIDEVUP)
-COMPATIBLE_IOCTL(HCIDEVDOWN)
-COMPATIBLE_IOCTL(HCIDEVRESET)
-COMPATIBLE_IOCTL(HCIDEVRESTAT)
-COMPATIBLE_IOCTL(HCIGETDEVLIST)
-COMPATIBLE_IOCTL(HCIGETDEVINFO)
-COMPATIBLE_IOCTL(HCIGETCONNLIST)
-COMPATIBLE_IOCTL(HCIGETCONNINFO)
-COMPATIBLE_IOCTL(HCIGETAUTHINFO)
-COMPATIBLE_IOCTL(HCISETRAW)
-COMPATIBLE_IOCTL(HCISETSCAN)
-COMPATIBLE_IOCTL(HCISETAUTH)
-COMPATIBLE_IOCTL(HCISETENCRYPT)
-COMPATIBLE_IOCTL(HCISETPTYPE)
-COMPATIBLE_IOCTL(HCISETLINKPOL)
-COMPATIBLE_IOCTL(HCISETLINKMODE)
-COMPATIBLE_IOCTL(HCISETACLMTU)
-COMPATIBLE_IOCTL(HCISETSCOMTU)
-COMPATIBLE_IOCTL(HCIBLOCKADDR)
-COMPATIBLE_IOCTL(HCIUNBLOCKADDR)
-COMPATIBLE_IOCTL(HCIINQUIRY)
 COMPATIBLE_IOCTL(HCIUARTSETPROTO)
 COMPATIBLE_IOCTL(HCIUARTGETPROTO)
 COMPATIBLE_IOCTL(HCIUARTGETDEVICE)
diff --git a/net/bluetooth/hci_sock.c b/net/bluetooth/hci_sock.c
index d32077b28433..5d0ed28c0d3a 100644
--- a/net/bluetooth/hci_sock.c
+++ b/net/bluetooth/hci_sock.c
@@ -23,7 +23,7 @@
 */
 
 /* Bluetooth HCI sockets. */
-
+#include <linux/compat.h>
 #include <linux/export.h>
 #include <linux/utsname.h>
 #include <linux/sched.h>
@@ -1054,6 +1054,22 @@ static int hci_sock_ioctl(struct socket *sock, unsigned int cmd,
 	return err;
 }
 
+#ifdef CONFIG_COMPAT
+static int hci_sock_compat_ioctl(struct socket *sock, unsigned int cmd,
+				 unsigned long arg)
+{
+	switch (cmd) {
+	case HCIDEVUP:
+	case HCIDEVDOWN:
+	case HCIDEVRESET:
+	case HCIDEVRESTAT:
+		return hci_sock_ioctl(sock, cmd, arg);
+	}
+
+	return hci_sock_ioctl(sock, cmd, (unsigned long)compat_ptr(arg));
+}
+#endif
+
 static int hci_sock_bind(struct socket *sock, struct sockaddr *addr,
 			 int addr_len)
 {
@@ -1974,6 +1990,9 @@ static const struct proto_ops hci_sock_ops = {
 	.sendmsg	= hci_sock_sendmsg,
 	.recvmsg	= hci_sock_recvmsg,
 	.ioctl		= hci_sock_ioctl,
+#ifdef CONFIG_COMPAT
+	.compat_ioctl	= hci_sock_compat_ioctl,
+#endif
 	.poll		= datagram_poll,
 	.listen		= sock_no_listen,
 	.shutdown	= sock_no_shutdown,
-- 
2.20.0


^ permalink raw reply related

* [PATCH v5 18/29] compat_ioctl: move rfcomm handlers into driver
From: Arnd Bergmann @ 2019-07-30 19:55 UTC (permalink / raw)
  To: Alexander Viro
  Cc: linux-fsdevel, linux-kernel, Arnd Bergmann, Marcel Holtmann,
	Johan Hedberg, David S. Miller, Mauro Carvalho Chehab,
	linux-bluetooth, netdev
In-Reply-To: <20190730195819.901457-1-arnd@arndb.de>

All these ioctl commands are compatible, so we can handle
them with a trivial wrapper in rfcomm/sock.c and remove
the listing in fs/compat_ioctl.c.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
---
 fs/compat_ioctl.c           |  6 ------
 net/bluetooth/rfcomm/sock.c | 14 ++++++++++++--
 2 files changed, 12 insertions(+), 8 deletions(-)

diff --git a/fs/compat_ioctl.c b/fs/compat_ioctl.c
index f3b4179d6dff..8dbef92b10fd 100644
--- a/fs/compat_ioctl.c
+++ b/fs/compat_ioctl.c
@@ -42,7 +42,6 @@
 
 #include <net/bluetooth/bluetooth.h>
 #include <net/bluetooth/hci_sock.h>
-#include <net/bluetooth/rfcomm.h>
 
 #ifdef CONFIG_BLOCK
 #include <linux/cdrom.h>
@@ -673,11 +672,6 @@ COMPATIBLE_IOCTL(HCIUARTGETPROTO)
 COMPATIBLE_IOCTL(HCIUARTGETDEVICE)
 COMPATIBLE_IOCTL(HCIUARTSETFLAGS)
 COMPATIBLE_IOCTL(HCIUARTGETFLAGS)
-COMPATIBLE_IOCTL(RFCOMMCREATEDEV)
-COMPATIBLE_IOCTL(RFCOMMRELEASEDEV)
-COMPATIBLE_IOCTL(RFCOMMGETDEVLIST)
-COMPATIBLE_IOCTL(RFCOMMGETDEVINFO)
-COMPATIBLE_IOCTL(RFCOMMSTEALDLC)
 /* Misc. */
 COMPATIBLE_IOCTL(PCIIOC_CONTROLLER)
 COMPATIBLE_IOCTL(PCIIOC_MMAP_IS_IO)
diff --git a/net/bluetooth/rfcomm/sock.c b/net/bluetooth/rfcomm/sock.c
index 90bb53aa4bee..b4eaf21360ef 100644
--- a/net/bluetooth/rfcomm/sock.c
+++ b/net/bluetooth/rfcomm/sock.c
@@ -24,7 +24,7 @@
 /*
  * RFCOMM sockets.
  */
-
+#include <linux/compat.h>
 #include <linux/export.h>
 #include <linux/debugfs.h>
 #include <linux/sched/signal.h>
@@ -909,6 +909,13 @@ static int rfcomm_sock_ioctl(struct socket *sock, unsigned int cmd, unsigned lon
 	return err;
 }
 
+#ifdef CONFIG_COMPAT
+static int rfcomm_sock_compat_ioctl(struct socket *sock, unsigned int cmd, unsigned long arg)
+{
+	return rfcomm_sock_ioctl(sock, cmd, (unsigned long)compat_ptr(arg));
+}
+#endif
+
 static int rfcomm_sock_shutdown(struct socket *sock, int how)
 {
 	struct sock *sk = sock->sk;
@@ -1042,7 +1049,10 @@ static const struct proto_ops rfcomm_sock_ops = {
 	.gettstamp	= sock_gettstamp,
 	.poll		= bt_sock_poll,
 	.socketpair	= sock_no_socketpair,
-	.mmap		= sock_no_mmap
+	.mmap		= sock_no_mmap,
+#ifdef CONFIG_COMPAT
+	.compat_ioctl	= rfcomm_sock_compat_ioctl,
+#endif
 };
 
 static const struct net_proto_family rfcomm_sock_family_ops = {
-- 
2.20.0


^ permalink raw reply related

* [PATCH v5 17/29] compat_ioctl: move isdn/capi ioctl translation into driver
From: Arnd Bergmann @ 2019-07-30 19:55 UTC (permalink / raw)
  To: Alexander Viro
  Cc: linux-fsdevel, linux-kernel, Arnd Bergmann, Karsten Keil, netdev
In-Reply-To: <20190730195819.901457-1-arnd@arndb.de>

Neither the old isdn4linux interface nor the newer mISDN stack
ever had working 32-bit compat mode as far as I can tell.

However, the CAPI stack has some ioctl commands that are
correctly listed in fs/compat_ioctl.c.

We can trivially move all of those into the corresponding
file that implement the native handlers by adding a compat_ioctl
redirect to that.

I did notice that treating CAPI_MANUFACTURER_CMD() as compatible
is broken, so I'm also adding a handler for that, realizing that
in all likelyhood, nobody is ever going to call it.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
---
 drivers/isdn/capi/capi.c | 31 +++++++++++++++++++++++++++++++
 fs/compat_ioctl.c        | 17 -----------------
 2 files changed, 31 insertions(+), 17 deletions(-)

diff --git a/drivers/isdn/capi/capi.c b/drivers/isdn/capi/capi.c
index 3c3ad42f22bf..3b72fd8104db 100644
--- a/drivers/isdn/capi/capi.c
+++ b/drivers/isdn/capi/capi.c
@@ -942,6 +942,34 @@ capi_unlocked_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
 	return ret;
 }
 
+#ifdef CONFIG_COMPAT
+static long
+capi_compat_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
+{
+	int ret;
+
+	if (cmd == CAPI_MANUFACTURER_CMD) {
+		struct {
+			unsigned long cmd;
+			compat_uptr_t data;
+		} mcmd32;
+
+		if (!capable(CAP_SYS_ADMIN))
+			return -EPERM;
+		if (copy_from_user(&mcmd32, compat_ptr(arg), sizeof(mcmd32)))
+			return -EFAULT;
+
+		mutex_lock(&capi_mutex);
+		ret = capi20_manufacturer(mcmd32.cmd, compat_ptr(mcmd32.data));
+		mutex_unlock(&capi_mutex);
+
+		return ret;
+	}
+
+	return capi_unlocked_ioctl(file, cmd, (unsigned long)compat_ptr(arg));
+}
+#endif
+
 static int capi_open(struct inode *inode, struct file *file)
 {
 	struct capidev *cdev;
@@ -988,6 +1016,9 @@ static const struct file_operations capi_fops =
 	.write		= capi_write,
 	.poll		= capi_poll,
 	.unlocked_ioctl	= capi_unlocked_ioctl,
+#ifdef CONFIG_COMPAT
+	.compat_ioctl	= capi_compat_ioctl,
+#endif
 	.open		= capi_open,
 	.release	= capi_release,
 };
diff --git a/fs/compat_ioctl.c b/fs/compat_ioctl.c
index a4e8fb7da968..f3b4179d6dff 100644
--- a/fs/compat_ioctl.c
+++ b/fs/compat_ioctl.c
@@ -44,9 +44,6 @@
 #include <net/bluetooth/hci_sock.h>
 #include <net/bluetooth/rfcomm.h>
 
-#include <linux/capi.h>
-#include <linux/gigaset_dev.h>
-
 #ifdef CONFIG_BLOCK
 #include <linux/cdrom.h>
 #include <linux/fd.h>
@@ -681,20 +678,6 @@ COMPATIBLE_IOCTL(RFCOMMRELEASEDEV)
 COMPATIBLE_IOCTL(RFCOMMGETDEVLIST)
 COMPATIBLE_IOCTL(RFCOMMGETDEVINFO)
 COMPATIBLE_IOCTL(RFCOMMSTEALDLC)
-/* CAPI */
-COMPATIBLE_IOCTL(CAPI_REGISTER)
-COMPATIBLE_IOCTL(CAPI_GET_MANUFACTURER)
-COMPATIBLE_IOCTL(CAPI_GET_VERSION)
-COMPATIBLE_IOCTL(CAPI_GET_SERIAL)
-COMPATIBLE_IOCTL(CAPI_GET_PROFILE)
-COMPATIBLE_IOCTL(CAPI_MANUFACTURER_CMD)
-COMPATIBLE_IOCTL(CAPI_GET_ERRCODE)
-COMPATIBLE_IOCTL(CAPI_INSTALLED)
-COMPATIBLE_IOCTL(CAPI_GET_FLAGS)
-COMPATIBLE_IOCTL(CAPI_SET_FLAGS)
-COMPATIBLE_IOCTL(CAPI_CLR_FLAGS)
-COMPATIBLE_IOCTL(CAPI_NCCI_OPENCOUNT)
-COMPATIBLE_IOCTL(CAPI_NCCI_GETUNIT)
 /* Misc. */
 COMPATIBLE_IOCTL(PCIIOC_CONTROLLER)
 COMPATIBLE_IOCTL(PCIIOC_MMAP_IS_IO)
-- 
2.20.0


^ permalink raw reply related

* [PATCH v5 13/29] compat_ioctl: move more drivers to compat_ptr_ioctl
From: Arnd Bergmann @ 2019-07-30 19:55 UTC (permalink / raw)
  To: Alexander Viro
  Cc: linux-fsdevel, linux-kernel, Arnd Bergmann, Jason Gunthorpe,
	Daniel Vetter, Mauro Carvalho Chehab, Greg Kroah-Hartman,
	David Sterba, Darren Hart, Jonathan Cameron, Bjorn Andersson,
	qat-linux, linux-crypto, linux-media, dri-devel, linaro-mm-sig,
	amd-gfx, linux-input, linux-iio, linux-rdma, linux-nvdimm,
	linux-nvme, linux-pci, platform-driver-x86, linux-remoteproc,
	sparclinux, linux-scsi, tee-dev, linux-usb, linux-btrfs,
	linux-wireless, netdev
In-Reply-To: <20190730192552.4014288-1-arnd@arndb.de>

The .ioctl and .compat_ioctl file operations have the same prototype so
they can both point to the same function, which works great almost all
the time when all the commands are compatible.

One exception is the s390 architecture, where a compat pointer is only
31 bit wide, and converting it into a 64-bit pointer requires calling
compat_ptr(). Most drivers here will never run in s390, but since we now
have a generic helper for it, it's easy enough to use it consistently.

I double-checked all these drivers to ensure that all ioctl arguments
are used as pointers or are ignored, but are not interpreted as integer
values.

Acked-by: Jason Gunthorpe <jgg@mellanox.com>
Acked-by: Daniel Vetter <daniel.vetter@ffwll.ch>
Acked-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Acked-by: David Sterba <dsterba@suse.com>
Acked-by: Darren Hart (VMware) <dvhart@infradead.org>
Acked-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Acked-by: Bjorn Andersson <bjorn.andersson@linaro.org>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
---
 drivers/android/binder.c                    | 2 +-
 drivers/crypto/qat/qat_common/adf_ctl_drv.c | 2 +-
 drivers/dma-buf/dma-buf.c                   | 4 +---
 drivers/dma-buf/sw_sync.c                   | 2 +-
 drivers/dma-buf/sync_file.c                 | 2 +-
 drivers/gpu/drm/amd/amdkfd/kfd_chardev.c    | 2 +-
 drivers/hid/hidraw.c                        | 4 +---
 drivers/iio/industrialio-core.c             | 2 +-
 drivers/infiniband/core/uverbs_main.c       | 4 ++--
 drivers/media/rc/lirc_dev.c                 | 4 +---
 drivers/mfd/cros_ec_dev.c                   | 4 +---
 drivers/misc/vmw_vmci/vmci_host.c           | 2 +-
 drivers/nvdimm/bus.c                        | 4 ++--
 drivers/nvme/host/core.c                    | 2 +-
 drivers/pci/switch/switchtec.c              | 2 +-
 drivers/platform/x86/wmi.c                  | 2 +-
 drivers/rpmsg/rpmsg_char.c                  | 4 ++--
 drivers/sbus/char/display7seg.c             | 2 +-
 drivers/sbus/char/envctrl.c                 | 4 +---
 drivers/scsi/3w-xxxx.c                      | 4 +---
 drivers/scsi/cxlflash/main.c                | 2 +-
 drivers/scsi/esas2r/esas2r_main.c           | 2 +-
 drivers/scsi/pmcraid.c                      | 4 +---
 drivers/staging/android/ion/ion.c           | 4 +---
 drivers/staging/vme/devices/vme_user.c      | 2 +-
 drivers/tee/tee_core.c                      | 2 +-
 drivers/usb/class/cdc-wdm.c                 | 2 +-
 drivers/usb/class/usbtmc.c                  | 4 +---
 drivers/virt/fsl_hypervisor.c               | 2 +-
 fs/btrfs/super.c                            | 2 +-
 fs/fuse/dev.c                               | 2 +-
 fs/notify/fanotify/fanotify_user.c          | 2 +-
 fs/userfaultfd.c                            | 2 +-
 net/rfkill/core.c                           | 2 +-
 34 files changed, 37 insertions(+), 55 deletions(-)

diff --git a/drivers/android/binder.c b/drivers/android/binder.c
index dc1c83eafc22..79955e82544a 100644
--- a/drivers/android/binder.c
+++ b/drivers/android/binder.c
@@ -6043,7 +6043,7 @@ const struct file_operations binder_fops = {
 	.owner = THIS_MODULE,
 	.poll = binder_poll,
 	.unlocked_ioctl = binder_ioctl,
-	.compat_ioctl = binder_ioctl,
+	.compat_ioctl = compat_ptr_ioctl,
 	.mmap = binder_mmap,
 	.open = binder_open,
 	.flush = binder_flush,
diff --git a/drivers/crypto/qat/qat_common/adf_ctl_drv.c b/drivers/crypto/qat/qat_common/adf_ctl_drv.c
index abc7a7f64d64..ef0e482ee04f 100644
--- a/drivers/crypto/qat/qat_common/adf_ctl_drv.c
+++ b/drivers/crypto/qat/qat_common/adf_ctl_drv.c
@@ -68,7 +68,7 @@ static long adf_ctl_ioctl(struct file *fp, unsigned int cmd, unsigned long arg);
 static const struct file_operations adf_ctl_ops = {
 	.owner = THIS_MODULE,
 	.unlocked_ioctl = adf_ctl_ioctl,
-	.compat_ioctl = adf_ctl_ioctl,
+	.compat_ioctl = compat_ptr_ioctl,
 };
 
 struct adf_ctl_drv_info {
diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c
index f45bfb29ef96..f6d9047b7a69 100644
--- a/drivers/dma-buf/dma-buf.c
+++ b/drivers/dma-buf/dma-buf.c
@@ -415,9 +415,7 @@ static const struct file_operations dma_buf_fops = {
 	.llseek		= dma_buf_llseek,
 	.poll		= dma_buf_poll,
 	.unlocked_ioctl	= dma_buf_ioctl,
-#ifdef CONFIG_COMPAT
-	.compat_ioctl	= dma_buf_ioctl,
-#endif
+	.compat_ioctl	= compat_ptr_ioctl,
 	.show_fdinfo	= dma_buf_show_fdinfo,
 };
 
diff --git a/drivers/dma-buf/sw_sync.c b/drivers/dma-buf/sw_sync.c
index 051f6c2873c7..51026cb08801 100644
--- a/drivers/dma-buf/sw_sync.c
+++ b/drivers/dma-buf/sw_sync.c
@@ -410,5 +410,5 @@ const struct file_operations sw_sync_debugfs_fops = {
 	.open           = sw_sync_debugfs_open,
 	.release        = sw_sync_debugfs_release,
 	.unlocked_ioctl = sw_sync_ioctl,
-	.compat_ioctl	= sw_sync_ioctl,
+	.compat_ioctl	= compat_ptr_ioctl,
 };
diff --git a/drivers/dma-buf/sync_file.c b/drivers/dma-buf/sync_file.c
index ee4d1a96d779..85b96757fc76 100644
--- a/drivers/dma-buf/sync_file.c
+++ b/drivers/dma-buf/sync_file.c
@@ -480,5 +480,5 @@ static const struct file_operations sync_file_fops = {
 	.release = sync_file_release,
 	.poll = sync_file_poll,
 	.unlocked_ioctl = sync_file_ioctl,
-	.compat_ioctl = sync_file_ioctl,
+	.compat_ioctl = compat_ptr_ioctl,
 };
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c b/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c
index 26b15cc56c31..ea933d2444bb 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c
@@ -49,7 +49,7 @@ static const char kfd_dev_name[] = "kfd";
 static const struct file_operations kfd_fops = {
 	.owner = THIS_MODULE,
 	.unlocked_ioctl = kfd_ioctl,
-	.compat_ioctl = kfd_ioctl,
+	.compat_ioctl = compat_ptr_ioctl,
 	.open = kfd_open,
 	.mmap = kfd_mmap,
 };
diff --git a/drivers/hid/hidraw.c b/drivers/hid/hidraw.c
index 006bd6f4f653..923edc650f46 100644
--- a/drivers/hid/hidraw.c
+++ b/drivers/hid/hidraw.c
@@ -468,9 +468,7 @@ static const struct file_operations hidraw_ops = {
 	.release =      hidraw_release,
 	.unlocked_ioctl = hidraw_ioctl,
 	.fasync =	hidraw_fasync,
-#ifdef CONFIG_COMPAT
-	.compat_ioctl   = hidraw_ioctl,
-#endif
+	.compat_ioctl   = compat_ptr_ioctl,
 	.llseek =	noop_llseek,
 };
 
diff --git a/drivers/iio/industrialio-core.c b/drivers/iio/industrialio-core.c
index 524a686077ca..9dd687534035 100644
--- a/drivers/iio/industrialio-core.c
+++ b/drivers/iio/industrialio-core.c
@@ -1610,7 +1610,7 @@ static const struct file_operations iio_buffer_fileops = {
 	.owner = THIS_MODULE,
 	.llseek = noop_llseek,
 	.unlocked_ioctl = iio_ioctl,
-	.compat_ioctl = iio_ioctl,
+	.compat_ioctl = compat_ptr_ioctl,
 };
 
 static int iio_check_unique_scan_index(struct iio_dev *indio_dev)
diff --git a/drivers/infiniband/core/uverbs_main.c b/drivers/infiniband/core/uverbs_main.c
index 11c13c1381cf..d6d2f6c0cd01 100644
--- a/drivers/infiniband/core/uverbs_main.c
+++ b/drivers/infiniband/core/uverbs_main.c
@@ -1135,7 +1135,7 @@ static const struct file_operations uverbs_fops = {
 	.release = ib_uverbs_close,
 	.llseek	 = no_llseek,
 	.unlocked_ioctl = ib_uverbs_ioctl,
-	.compat_ioctl = ib_uverbs_ioctl,
+	.compat_ioctl = compat_ptr_ioctl,
 };
 
 static const struct file_operations uverbs_mmap_fops = {
@@ -1146,7 +1146,7 @@ static const struct file_operations uverbs_mmap_fops = {
 	.release = ib_uverbs_close,
 	.llseek	 = no_llseek,
 	.unlocked_ioctl = ib_uverbs_ioctl,
-	.compat_ioctl = ib_uverbs_ioctl,
+	.compat_ioctl = compat_ptr_ioctl,
 };
 
 static int ib_uverbs_get_nl_info(struct ib_device *ibdev, void *client_data,
diff --git a/drivers/media/rc/lirc_dev.c b/drivers/media/rc/lirc_dev.c
index f078f8a3aec8..9a8c1cf54ac4 100644
--- a/drivers/media/rc/lirc_dev.c
+++ b/drivers/media/rc/lirc_dev.c
@@ -720,9 +720,7 @@ static const struct file_operations lirc_fops = {
 	.owner		= THIS_MODULE,
 	.write		= ir_lirc_transmit_ir,
 	.unlocked_ioctl	= ir_lirc_ioctl,
-#ifdef CONFIG_COMPAT
-	.compat_ioctl	= ir_lirc_ioctl,
-#endif
+	.compat_ioctl	= compat_ptr_ioctl,
 	.read		= ir_lirc_read,
 	.poll		= ir_lirc_poll,
 	.open		= ir_lirc_open,
diff --git a/drivers/mfd/cros_ec_dev.c b/drivers/mfd/cros_ec_dev.c
index 41dccced5026..db1eefcd770b 100644
--- a/drivers/mfd/cros_ec_dev.c
+++ b/drivers/mfd/cros_ec_dev.c
@@ -239,9 +239,7 @@ static const struct file_operations fops = {
 	.release = ec_device_release,
 	.read = ec_device_read,
 	.unlocked_ioctl = ec_device_ioctl,
-#ifdef CONFIG_COMPAT
-	.compat_ioctl = ec_device_ioctl,
-#endif
+	.compat_ioctl = compat_ptr_ioctl,
 };
 
 static void cros_ec_class_release(struct device *dev)
diff --git a/drivers/misc/vmw_vmci/vmci_host.c b/drivers/misc/vmw_vmci/vmci_host.c
index 833e2bd248a5..903e321e8e87 100644
--- a/drivers/misc/vmw_vmci/vmci_host.c
+++ b/drivers/misc/vmw_vmci/vmci_host.c
@@ -961,7 +961,7 @@ static const struct file_operations vmuser_fops = {
 	.release	= vmci_host_close,
 	.poll		= vmci_host_poll,
 	.unlocked_ioctl	= vmci_host_unlocked_ioctl,
-	.compat_ioctl	= vmci_host_unlocked_ioctl,
+	.compat_ioctl	= compat_ptr_ioctl,
 };
 
 static struct miscdevice vmci_host_miscdev = {
diff --git a/drivers/nvdimm/bus.c b/drivers/nvdimm/bus.c
index 798c5c4aea9c..6ca142d833ab 100644
--- a/drivers/nvdimm/bus.c
+++ b/drivers/nvdimm/bus.c
@@ -1229,7 +1229,7 @@ static const struct file_operations nvdimm_bus_fops = {
 	.owner = THIS_MODULE,
 	.open = nd_open,
 	.unlocked_ioctl = bus_ioctl,
-	.compat_ioctl = bus_ioctl,
+	.compat_ioctl = compat_ptr_ioctl,
 	.llseek = noop_llseek,
 };
 
@@ -1237,7 +1237,7 @@ static const struct file_operations nvdimm_fops = {
 	.owner = THIS_MODULE,
 	.open = nd_open,
 	.unlocked_ioctl = dimm_ioctl,
-	.compat_ioctl = dimm_ioctl,
+	.compat_ioctl = compat_ptr_ioctl,
 	.llseek = noop_llseek,
 };
 
diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 8f3fbe5ca937..be07bd1f6654 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -2813,7 +2813,7 @@ static const struct file_operations nvme_dev_fops = {
 	.owner		= THIS_MODULE,
 	.open		= nvme_dev_open,
 	.unlocked_ioctl	= nvme_dev_ioctl,
-	.compat_ioctl	= nvme_dev_ioctl,
+	.compat_ioctl	= compat_ptr_ioctl,
 };
 
 static ssize_t nvme_sysfs_reset(struct device *dev,
diff --git a/drivers/pci/switch/switchtec.c b/drivers/pci/switch/switchtec.c
index 8c94cd3fd1f2..66610f04d76d 100644
--- a/drivers/pci/switch/switchtec.c
+++ b/drivers/pci/switch/switchtec.c
@@ -1025,7 +1025,7 @@ static const struct file_operations switchtec_fops = {
 	.read = switchtec_dev_read,
 	.poll = switchtec_dev_poll,
 	.unlocked_ioctl = switchtec_dev_ioctl,
-	.compat_ioctl = switchtec_dev_ioctl,
+	.compat_ioctl = compat_ptr_ioctl,
 };
 
 static void link_event_work(struct work_struct *work)
diff --git a/drivers/platform/x86/wmi.c b/drivers/platform/x86/wmi.c
index 784cea8572c2..d9a0dd94ee62 100644
--- a/drivers/platform/x86/wmi.c
+++ b/drivers/platform/x86/wmi.c
@@ -913,7 +913,7 @@ static const struct file_operations wmi_fops = {
 	.read		= wmi_char_read,
 	.open		= wmi_char_open,
 	.unlocked_ioctl	= wmi_ioctl,
-	.compat_ioctl	= wmi_ioctl,
+	.compat_ioctl	= compat_ptr_ioctl,
 };
 
 static int wmi_dev_probe(struct device *dev)
diff --git a/drivers/rpmsg/rpmsg_char.c b/drivers/rpmsg/rpmsg_char.c
index eea5ebbb5119..507bfe163883 100644
--- a/drivers/rpmsg/rpmsg_char.c
+++ b/drivers/rpmsg/rpmsg_char.c
@@ -290,7 +290,7 @@ static const struct file_operations rpmsg_eptdev_fops = {
 	.write_iter = rpmsg_eptdev_write_iter,
 	.poll = rpmsg_eptdev_poll,
 	.unlocked_ioctl = rpmsg_eptdev_ioctl,
-	.compat_ioctl = rpmsg_eptdev_ioctl,
+	.compat_ioctl = compat_ptr_ioctl,
 };
 
 static ssize_t name_show(struct device *dev, struct device_attribute *attr,
@@ -451,7 +451,7 @@ static const struct file_operations rpmsg_ctrldev_fops = {
 	.open = rpmsg_ctrldev_open,
 	.release = rpmsg_ctrldev_release,
 	.unlocked_ioctl = rpmsg_ctrldev_ioctl,
-	.compat_ioctl = rpmsg_ctrldev_ioctl,
+	.compat_ioctl = compat_ptr_ioctl,
 };
 
 static void rpmsg_ctrldev_release_device(struct device *dev)
diff --git a/drivers/sbus/char/display7seg.c b/drivers/sbus/char/display7seg.c
index 971fe074d7c9..fad936eb845f 100644
--- a/drivers/sbus/char/display7seg.c
+++ b/drivers/sbus/char/display7seg.c
@@ -156,7 +156,7 @@ static long d7s_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
 static const struct file_operations d7s_fops = {
 	.owner =		THIS_MODULE,
 	.unlocked_ioctl =	d7s_ioctl,
-	.compat_ioctl =		d7s_ioctl,
+	.compat_ioctl =		compat_ptr_ioctl,
 	.open =			d7s_open,
 	.release =		d7s_release,
 	.llseek = noop_llseek,
diff --git a/drivers/sbus/char/envctrl.c b/drivers/sbus/char/envctrl.c
index a63d5e402ff2..12d66aa61ede 100644
--- a/drivers/sbus/char/envctrl.c
+++ b/drivers/sbus/char/envctrl.c
@@ -715,9 +715,7 @@ static const struct file_operations envctrl_fops = {
 	.owner =		THIS_MODULE,
 	.read =			envctrl_read,
 	.unlocked_ioctl =	envctrl_ioctl,
-#ifdef CONFIG_COMPAT
-	.compat_ioctl =		envctrl_ioctl,
-#endif
+	.compat_ioctl =		compat_ptr_ioctl,
 	.open =			envctrl_open,
 	.release =		envctrl_release,
 	.llseek =		noop_llseek,
diff --git a/drivers/scsi/3w-xxxx.c b/drivers/scsi/3w-xxxx.c
index 2b1e0d503020..fb6444d0409c 100644
--- a/drivers/scsi/3w-xxxx.c
+++ b/drivers/scsi/3w-xxxx.c
@@ -1049,9 +1049,7 @@ static int tw_chrdev_open(struct inode *inode, struct file *file)
 static const struct file_operations tw_fops = {
 	.owner		= THIS_MODULE,
 	.unlocked_ioctl	= tw_chrdev_ioctl,
-#ifdef CONFIG_COMPAT
-	.compat_ioctl   = tw_chrdev_ioctl,
-#endif
+	.compat_ioctl   = compat_ptr_ioctl,
 	.open		= tw_chrdev_open,
 	.release	= NULL,
 	.llseek		= noop_llseek,
diff --git a/drivers/scsi/cxlflash/main.c b/drivers/scsi/cxlflash/main.c
index b1f4724efde2..6927654792b0 100644
--- a/drivers/scsi/cxlflash/main.c
+++ b/drivers/scsi/cxlflash/main.c
@@ -3585,7 +3585,7 @@ static const struct file_operations cxlflash_chr_fops = {
 	.owner          = THIS_MODULE,
 	.open           = cxlflash_chr_open,
 	.unlocked_ioctl	= cxlflash_chr_ioctl,
-	.compat_ioctl	= cxlflash_chr_ioctl,
+	.compat_ioctl	= compat_ptr_ioctl,
 };
 
 /**
diff --git a/drivers/scsi/esas2r/esas2r_main.c b/drivers/scsi/esas2r/esas2r_main.c
index fdbda5c05aa0..80c5a235d193 100644
--- a/drivers/scsi/esas2r/esas2r_main.c
+++ b/drivers/scsi/esas2r/esas2r_main.c
@@ -613,7 +613,7 @@ static int __init esas2r_init(void)
 
 /* Handle ioctl calls to "/proc/scsi/esas2r/ATTOnode" */
 static const struct file_operations esas2r_proc_fops = {
-	.compat_ioctl	= esas2r_proc_ioctl,
+	.compat_ioctl	= compat_ptr_ioctl,
 	.unlocked_ioctl = esas2r_proc_ioctl,
 };
 
diff --git a/drivers/scsi/pmcraid.c b/drivers/scsi/pmcraid.c
index 71ff3936da4f..12c4487cb9f6 100644
--- a/drivers/scsi/pmcraid.c
+++ b/drivers/scsi/pmcraid.c
@@ -3973,9 +3973,7 @@ static const struct file_operations pmcraid_fops = {
 	.open = pmcraid_chr_open,
 	.fasync = pmcraid_chr_fasync,
 	.unlocked_ioctl = pmcraid_chr_ioctl,
-#ifdef CONFIG_COMPAT
-	.compat_ioctl = pmcraid_chr_ioctl,
-#endif
+	.compat_ioctl = compat_ptr_ioctl,
 	.llseek = noop_llseek,
 };
 
diff --git a/drivers/staging/android/ion/ion.c b/drivers/staging/android/ion/ion.c
index 92c2914239e3..1663c163edca 100644
--- a/drivers/staging/android/ion/ion.c
+++ b/drivers/staging/android/ion/ion.c
@@ -567,9 +567,7 @@ static long ion_ioctl(struct file *filp, unsigned int cmd, unsigned long arg)
 static const struct file_operations ion_fops = {
 	.owner          = THIS_MODULE,
 	.unlocked_ioctl = ion_ioctl,
-#ifdef CONFIG_COMPAT
-	.compat_ioctl	= ion_ioctl,
-#endif
+	.compat_ioctl	= compat_ptr_ioctl,
 };
 
 static int debug_shrink_set(void *data, u64 val)
diff --git a/drivers/staging/vme/devices/vme_user.c b/drivers/staging/vme/devices/vme_user.c
index 6a33aaa1a49f..fd0ea4dbcb91 100644
--- a/drivers/staging/vme/devices/vme_user.c
+++ b/drivers/staging/vme/devices/vme_user.c
@@ -494,7 +494,7 @@ static const struct file_operations vme_user_fops = {
 	.write = vme_user_write,
 	.llseek = vme_user_llseek,
 	.unlocked_ioctl = vme_user_unlocked_ioctl,
-	.compat_ioctl = vme_user_unlocked_ioctl,
+	.compat_ioctl = compat_ptr_ioctl,
 	.mmap = vme_user_mmap,
 };
 
diff --git a/drivers/tee/tee_core.c b/drivers/tee/tee_core.c
index 0f16d9ffd8d1..37d22e39fd8d 100644
--- a/drivers/tee/tee_core.c
+++ b/drivers/tee/tee_core.c
@@ -675,7 +675,7 @@ static const struct file_operations tee_fops = {
 	.open = tee_open,
 	.release = tee_release,
 	.unlocked_ioctl = tee_ioctl,
-	.compat_ioctl = tee_ioctl,
+	.compat_ioctl = compat_ptr_ioctl,
 };
 
 static void tee_release_device(struct device *dev)
diff --git a/drivers/usb/class/cdc-wdm.c b/drivers/usb/class/cdc-wdm.c
index a7824a51f86d..3234dc539873 100644
--- a/drivers/usb/class/cdc-wdm.c
+++ b/drivers/usb/class/cdc-wdm.c
@@ -724,7 +724,7 @@ static const struct file_operations wdm_fops = {
 	.release =	wdm_release,
 	.poll =		wdm_poll,
 	.unlocked_ioctl = wdm_ioctl,
-	.compat_ioctl = wdm_ioctl,
+	.compat_ioctl = compat_ptr_ioctl,
 	.llseek =	noop_llseek,
 };
 
diff --git a/drivers/usb/class/usbtmc.c b/drivers/usb/class/usbtmc.c
index 4942122b2346..bbd0308b13f5 100644
--- a/drivers/usb/class/usbtmc.c
+++ b/drivers/usb/class/usbtmc.c
@@ -2220,9 +2220,7 @@ static const struct file_operations fops = {
 	.release	= usbtmc_release,
 	.flush		= usbtmc_flush,
 	.unlocked_ioctl	= usbtmc_ioctl,
-#ifdef CONFIG_COMPAT
-	.compat_ioctl	= usbtmc_ioctl,
-#endif
+	.compat_ioctl	= compat_ptr_ioctl,
 	.fasync         = usbtmc_fasync,
 	.poll           = usbtmc_poll,
 	.llseek		= default_llseek,
diff --git a/drivers/virt/fsl_hypervisor.c b/drivers/virt/fsl_hypervisor.c
index 93d5bebf9572..1b0b11b55d2a 100644
--- a/drivers/virt/fsl_hypervisor.c
+++ b/drivers/virt/fsl_hypervisor.c
@@ -706,7 +706,7 @@ static const struct file_operations fsl_hv_fops = {
 	.poll = fsl_hv_poll,
 	.read = fsl_hv_read,
 	.unlocked_ioctl = fsl_hv_ioctl,
-	.compat_ioctl = fsl_hv_ioctl,
+	.compat_ioctl = compat_ptr_ioctl,
 };
 
 static struct miscdevice fsl_hv_misc_dev = {
diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index 78de9d5d80c6..f4f792b7379d 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -2305,7 +2305,7 @@ static const struct super_operations btrfs_super_ops = {
 static const struct file_operations btrfs_ctl_fops = {
 	.open = btrfs_control_open,
 	.unlocked_ioctl	 = btrfs_control_ioctl,
-	.compat_ioctl = btrfs_control_ioctl,
+	.compat_ioctl = compat_ptr_ioctl,
 	.owner	 = THIS_MODULE,
 	.llseek = noop_llseek,
 };
diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
index ea8237513dfa..5bb93a3c397e 100644
--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -2354,7 +2354,7 @@ const struct file_operations fuse_dev_operations = {
 	.release	= fuse_dev_release,
 	.fasync		= fuse_dev_fasync,
 	.unlocked_ioctl = fuse_dev_ioctl,
-	.compat_ioctl   = fuse_dev_ioctl,
+	.compat_ioctl   = compat_ptr_ioctl,
 };
 EXPORT_SYMBOL_GPL(fuse_dev_operations);
 
diff --git a/fs/notify/fanotify/fanotify_user.c b/fs/notify/fanotify/fanotify_user.c
index 91006f47e420..3f494c8eaf2b 100644
--- a/fs/notify/fanotify/fanotify_user.c
+++ b/fs/notify/fanotify/fanotify_user.c
@@ -523,7 +523,7 @@ static const struct file_operations fanotify_fops = {
 	.fasync		= NULL,
 	.release	= fanotify_release,
 	.unlocked_ioctl	= fanotify_ioctl,
-	.compat_ioctl	= fanotify_ioctl,
+	.compat_ioctl	= compat_ptr_ioctl,
 	.llseek		= noop_llseek,
 };
 
diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index ccbdbd62f0d8..6ec18e0492e6 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -1920,7 +1920,7 @@ static const struct file_operations userfaultfd_fops = {
 	.poll		= userfaultfd_poll,
 	.read		= userfaultfd_read,
 	.unlocked_ioctl = userfaultfd_ioctl,
-	.compat_ioctl	= userfaultfd_ioctl,
+	.compat_ioctl	= compat_ptr_ioctl,
 	.llseek		= noop_llseek,
 };
 
diff --git a/net/rfkill/core.c b/net/rfkill/core.c
index f9b08a6d8dbe..c4be6a94ba97 100644
--- a/net/rfkill/core.c
+++ b/net/rfkill/core.c
@@ -1311,7 +1311,7 @@ static const struct file_operations rfkill_fops = {
 	.release	= rfkill_fop_release,
 #ifdef CONFIG_RFKILL_INPUT
 	.unlocked_ioctl	= rfkill_fop_ioctl,
-	.compat_ioctl	= rfkill_fop_ioctl,
+	.compat_ioctl	= compat_ptr_ioctl,
 #endif
 	.llseek		= no_llseek,
 };
-- 
2.20.0


^ permalink raw reply related

* [PATCH v2 bpf-next 09/12] selftests/bpf: add CO-RE relocs modifiers/typedef tests
From: Andrii Nakryiko @ 2019-07-30 19:54 UTC (permalink / raw)
  To: bpf, netdev, ast, daniel, yhs
  Cc: andrii.nakryiko, kernel-team, Andrii Nakryiko
In-Reply-To: <20190730195408.670063-1-andriin@fb.com>

Add tests validating correct handling of various combinations of
typedefs and const/volatile/restrict modifiers.

Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Acked-by: Song Liu <songliubraving@fb.com>
---
 .../selftests/bpf/prog_tests/core_reloc.c     | 27 +++++++
 .../bpf/progs/btf__core_reloc_mods.c          |  3 +
 .../progs/btf__core_reloc_mods___mod_swap.c   |  3 +
 .../progs/btf__core_reloc_mods___typedefs.c   |  3 +
 .../selftests/bpf/progs/core_reloc_types.h    | 72 +++++++++++++++++++
 .../bpf/progs/test_core_reloc_mods.c          | 62 ++++++++++++++++
 6 files changed, 170 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/progs/btf__core_reloc_mods.c
 create mode 100644 tools/testing/selftests/bpf/progs/btf__core_reloc_mods___mod_swap.c
 create mode 100644 tools/testing/selftests/bpf/progs/btf__core_reloc_mods___typedefs.c
 create mode 100644 tools/testing/selftests/bpf/progs/test_core_reloc_mods.c

diff --git a/tools/testing/selftests/bpf/prog_tests/core_reloc.c b/tools/testing/selftests/bpf/prog_tests/core_reloc.c
index dd2bab1b2e7d..adb8480d14dc 100644
--- a/tools/testing/selftests/bpf/prog_tests/core_reloc.c
+++ b/tools/testing/selftests/bpf/prog_tests/core_reloc.c
@@ -107,6 +107,28 @@
 	.fails = true,							\
 }
 
+#define MODS_CASE(name) {						\
+	.case_name = #name,						\
+	.bpf_obj_file = "test_core_reloc_mods.o",			\
+	.btf_src_file = "btf__core_reloc_" #name ".o",			\
+	.input = STRUCT_TO_CHAR_PTR(core_reloc_##name) {		\
+		.a = 1,							\
+		.b = 2,							\
+		.c = (void *)3,						\
+		.d = (void *)4,						\
+		.e = { [2] = 5 },					\
+		.f = { [1] = 6 },					\
+		.g = { .x = 7 },					\
+		.h = { .y = 8 },					\
+	},								\
+	.input_len = sizeof(struct core_reloc_##name),			\
+	.output = STRUCT_TO_CHAR_PTR(core_reloc_mods_output) {		\
+		.a = 1, .b = 2, .c = 3, .d = 4,				\
+		.e = 5, .f = 6, .g = 7, .h = 8,				\
+	},								\
+	.output_len = sizeof(struct core_reloc_mods_output),		\
+}
+
 struct core_reloc_test_case {
 	const char *case_name;
 	const char *bpf_obj_file;
@@ -173,6 +195,11 @@ static struct core_reloc_test_case test_cases[] = {
 	PRIMITIVES_ERR_CASE(primitives___err_non_enum),
 	PRIMITIVES_ERR_CASE(primitives___err_non_int),
 	PRIMITIVES_ERR_CASE(primitives___err_non_ptr),
+
+	/* const/volatile/restrict and typedefs scenarios */
+	MODS_CASE(mods),
+	MODS_CASE(mods___mod_swap),
+	MODS_CASE(mods___typedefs),
 };
 
 struct data {
diff --git a/tools/testing/selftests/bpf/progs/btf__core_reloc_mods.c b/tools/testing/selftests/bpf/progs/btf__core_reloc_mods.c
new file mode 100644
index 000000000000..124197a2e813
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/btf__core_reloc_mods.c
@@ -0,0 +1,3 @@
+#include "core_reloc_types.h"
+
+void f(struct core_reloc_mods x) {}
diff --git a/tools/testing/selftests/bpf/progs/btf__core_reloc_mods___mod_swap.c b/tools/testing/selftests/bpf/progs/btf__core_reloc_mods___mod_swap.c
new file mode 100644
index 000000000000..f8a6592ca75f
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/btf__core_reloc_mods___mod_swap.c
@@ -0,0 +1,3 @@
+#include "core_reloc_types.h"
+
+void f(struct core_reloc_mods___mod_swap x) {}
diff --git a/tools/testing/selftests/bpf/progs/btf__core_reloc_mods___typedefs.c b/tools/testing/selftests/bpf/progs/btf__core_reloc_mods___typedefs.c
new file mode 100644
index 000000000000..5c0d73687247
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/btf__core_reloc_mods___typedefs.c
@@ -0,0 +1,3 @@
+#include "core_reloc_types.h"
+
+void f(struct core_reloc_mods___typedefs x) {}
diff --git a/tools/testing/selftests/bpf/progs/core_reloc_types.h b/tools/testing/selftests/bpf/progs/core_reloc_types.h
index 7526a5f5755b..3401e8342e57 100644
--- a/tools/testing/selftests/bpf/progs/core_reloc_types.h
+++ b/tools/testing/selftests/bpf/progs/core_reloc_types.h
@@ -454,3 +454,75 @@ struct core_reloc_primitives___err_non_ptr {
 	int d; /* int instead of ptr */
 	int (*f)(const char *);
 };
+
+/*
+ * MODS
+ */
+struct core_reloc_mods_output {
+	int a, b, c, d, e, f, g, h;
+};
+
+typedef const int int_t;
+typedef const char *char_ptr_t;
+typedef const int arr_t[7];
+
+struct core_reloc_mods_substruct {
+	int x;
+	int y;
+};
+
+typedef struct {
+	int x;
+	int y;
+} core_reloc_mods_substruct_t;
+
+struct core_reloc_mods {
+	int a;
+	int_t b;
+	char *c;
+	char_ptr_t d;
+	int e[3];
+	arr_t f;
+	struct core_reloc_mods_substruct g;
+	core_reloc_mods_substruct_t h;
+};
+
+/* a/b, c/d, e/f, and g/h pairs are swapped */
+struct core_reloc_mods___mod_swap {
+	int b;
+	int_t a;
+	char *d;
+	char_ptr_t c;
+	int f[3];
+	arr_t e;
+	struct {
+		int y;
+		int x;
+	} h;
+	core_reloc_mods_substruct_t g;
+};
+
+typedef int int1_t;
+typedef int1_t int2_t;
+typedef int2_t int3_t;
+
+typedef int arr1_t[5];
+typedef arr1_t arr2_t;
+typedef arr2_t arr3_t;
+typedef arr3_t arr4_t;
+
+typedef const char * const volatile restrict fancy_char_ptr_t;
+
+typedef core_reloc_mods_substruct_t core_reloc_mods_substruct_tt;
+
+/* we need more typedefs */
+struct core_reloc_mods___typedefs {
+	core_reloc_mods_substruct_tt g;
+	core_reloc_mods_substruct_tt h;
+	arr4_t f;
+	arr4_t e;
+	fancy_char_ptr_t d;
+	fancy_char_ptr_t c;
+	int3_t b;
+	int3_t a;
+};
diff --git a/tools/testing/selftests/bpf/progs/test_core_reloc_mods.c b/tools/testing/selftests/bpf/progs/test_core_reloc_mods.c
new file mode 100644
index 000000000000..f98b942c062b
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/test_core_reloc_mods.c
@@ -0,0 +1,62 @@
+// SPDX-License-Identifier: GPL-2.0
+// Copyright (c) 2019 Facebook
+
+#include <linux/bpf.h>
+#include <stdint.h>
+#include "bpf_helpers.h"
+
+char _license[] SEC("license") = "GPL";
+
+static volatile struct data {
+	char in[256];
+	char out[256];
+} data;
+
+struct core_reloc_mods_output {
+	int a, b, c, d, e, f, g, h;
+};
+
+typedef const int int_t;
+typedef const char *char_ptr_t;
+typedef const int arr_t[7];
+
+struct core_reloc_mods_substruct {
+	int x;
+	int y;
+};
+
+typedef struct {
+	int x;
+	int y;
+} core_reloc_mods_substruct_t;
+
+struct core_reloc_mods {
+	int a;
+	int_t b;
+	char *c;
+	char_ptr_t d;
+	int e[3];
+	arr_t f;
+	struct core_reloc_mods_substruct g;
+	core_reloc_mods_substruct_t h;
+};
+
+SEC("raw_tracepoint/sys_enter")
+int test_core_mods(void *ctx)
+{
+	struct core_reloc_mods *in = (void *)&data.in;
+	struct core_reloc_mods_output *out = (void *)&data.out;
+
+	if (BPF_CORE_READ(&out->a, &in->a) ||
+	    BPF_CORE_READ(&out->b, &in->b) ||
+	    BPF_CORE_READ(&out->c, &in->c) ||
+	    BPF_CORE_READ(&out->d, &in->d) ||
+	    BPF_CORE_READ(&out->e, &in->e[2]) ||
+	    BPF_CORE_READ(&out->f, &in->f[1]) ||
+	    BPF_CORE_READ(&out->g, &in->g.x) ||
+	    BPF_CORE_READ(&out->h, &in->h.y))
+		return 1;
+
+	return 0;
+}
+
-- 
2.17.1


^ permalink raw reply related

* [PATCH v2 bpf-next 04/12] selftests/bpf: add CO-RE relocs testing setup
From: Andrii Nakryiko @ 2019-07-30 19:54 UTC (permalink / raw)
  To: bpf, netdev, ast, daniel, yhs
  Cc: andrii.nakryiko, kernel-team, Andrii Nakryiko
In-Reply-To: <20190730195408.670063-1-andriin@fb.com>

Add CO-RE relocation test runner. Add one simple test validating that
libbpf's logic for searching for kernel image and loading BTF out of it
works.

Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Acked-by: Song Liu <songliubraving@fb.com>
---
 .../selftests/bpf/prog_tests/core_reloc.c     | 125 ++++++++++++++++++
 .../bpf/progs/test_core_reloc_kernel.c        |  36 +++++
 2 files changed, 161 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/core_reloc.c
 create mode 100644 tools/testing/selftests/bpf/progs/test_core_reloc_kernel.c

diff --git a/tools/testing/selftests/bpf/prog_tests/core_reloc.c b/tools/testing/selftests/bpf/prog_tests/core_reloc.c
new file mode 100644
index 000000000000..fab7492a8714
--- /dev/null
+++ b/tools/testing/selftests/bpf/prog_tests/core_reloc.c
@@ -0,0 +1,125 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <test_progs.h>
+
+struct core_reloc_test_case {
+	const char *case_name;
+	const char *bpf_obj_file;
+	const char *btf_src_file;
+	const char *input;
+	int input_len;
+	const char *output;
+	int output_len;
+	bool fails;
+};
+
+static struct core_reloc_test_case test_cases[] = {
+	/* validate we can find kernel image and use its BTF for relocs */
+	{
+		.case_name = "kernel",
+		.bpf_obj_file = "test_core_reloc_kernel.o",
+		.btf_src_file = NULL, /* load from /lib/modules/$(uname -r) */
+		.input = "",
+		.input_len = 0,
+		.output = "\1", /* true */
+		.output_len = 1,
+	},
+};
+
+struct data {
+	char in[256];
+	char out[256];
+};
+
+void test_core_reloc(void)
+{
+	const char *probe_name = "raw_tracepoint/sys_enter";
+	struct bpf_object_load_attr load_attr = {};
+	struct core_reloc_test_case *test_case;
+	int err, duration = 0, i, equal;
+	struct bpf_link *link = NULL;
+	struct bpf_map *data_map;
+	struct bpf_program *prog;
+	struct bpf_object *obj;
+	const int zero = 0;
+	struct data data;
+
+	for (i = 0; i < ARRAY_SIZE(test_cases); i++) {
+		test_case = &test_cases[i];
+
+		if (!test__start_subtest(test_case->case_name))
+			continue;
+
+		obj = bpf_object__open(test_case->bpf_obj_file);
+		if (CHECK(IS_ERR_OR_NULL(obj), "obj_open",
+			  "failed to open '%s': %ld\n",
+			  test_case->bpf_obj_file, PTR_ERR(obj)))
+			continue;
+
+		prog = bpf_object__find_program_by_title(obj, probe_name);
+		if (CHECK(!prog, "find_probe",
+			  "prog '%s' not found\n", probe_name))
+			goto cleanup;
+		bpf_program__set_type(prog, BPF_PROG_TYPE_RAW_TRACEPOINT);
+
+		load_attr.obj = obj;
+		load_attr.log_level = 0;
+		load_attr.target_btf_path = test_case->btf_src_file;
+		err = bpf_object__load_xattr(&load_attr);
+		if (test_case->fails) {
+			CHECK(!err, "obj_load_fail",
+			      "should fail to load prog '%s'\n", probe_name);
+			goto cleanup;
+		} else {
+			if (CHECK(err, "obj_load",
+				  "failed to load prog '%s': %d\n",
+				  probe_name, err))
+				goto cleanup;
+		}
+
+		link = bpf_program__attach_raw_tracepoint(prog, "sys_enter");
+		if (CHECK(IS_ERR(link), "attach_raw_tp", "err %ld\n",
+			  PTR_ERR(link)))
+			goto cleanup;
+
+		data_map = bpf_object__find_map_by_name(obj, "test_cor.bss");
+		if (CHECK(!data_map, "find_data_map", "data map not found\n"))
+			goto cleanup;
+
+		memset(&data, 0, sizeof(data));
+		memcpy(data.in, test_case->input, test_case->input_len);
+
+		err = bpf_map_update_elem(bpf_map__fd(data_map),
+					  &zero, &data, 0);
+		if (CHECK(err, "update_data_map",
+			  "failed to update .data map: %d\n", err))
+			goto cleanup;
+
+		/* trigger test run */
+		usleep(1);
+
+		err = bpf_map_lookup_elem(bpf_map__fd(data_map), &zero, &data);
+		if (CHECK(err, "get_result",
+			  "failed to get output data: %d\n", err))
+			goto cleanup;
+
+		equal = memcmp(data.out, test_case->output,
+			       test_case->output_len) == 0;
+		if (CHECK(!equal, "check_result",
+			  "input/output data don't match\n")) {
+			int j;
+
+			for (j = 0; j < test_case->output_len; j++) {
+				test__printf("byte #%d, EXP 0x%02hhx GOT 0x%02hhx\n",
+					     j, test_case->output[j], data.out[j]);
+			}
+			goto cleanup;
+		}
+
+cleanup:
+		if (!IS_ERR_OR_NULL(link)) {
+			bpf_link__destroy(link);
+			link = NULL;
+		}
+		bpf_object__close(obj);
+	}
+}
diff --git a/tools/testing/selftests/bpf/progs/test_core_reloc_kernel.c b/tools/testing/selftests/bpf/progs/test_core_reloc_kernel.c
new file mode 100644
index 000000000000..37e02aa3f0c8
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/test_core_reloc_kernel.c
@@ -0,0 +1,36 @@
+// SPDX-License-Identifier: GPL-2.0
+// Copyright (c) 2019 Facebook
+
+#include <linux/bpf.h>
+#include <stdint.h>
+#include "bpf_helpers.h"
+
+char _license[] SEC("license") = "GPL";
+
+static volatile struct data {
+	char in[256];
+	char out[256];
+} data;
+
+struct task_struct {
+	int pid;
+	int tgid;
+};
+
+SEC("raw_tracepoint/sys_enter")
+int test_core_kernel(void *ctx)
+{
+	struct task_struct *task = (void *)bpf_get_current_task();
+	uint64_t pid_tgid = bpf_get_current_pid_tgid();
+	int pid, tgid;
+
+	if (BPF_CORE_READ(&pid, &task->pid) ||
+	    BPF_CORE_READ(&tgid, &task->tgid))
+		return 1;
+
+	/* validate pid + tgid matches */
+	data.out[0] = (((uint64_t)pid << 32) | tgid) == pid_tgid;
+
+	return 0;
+}
+
-- 
2.17.1


^ permalink raw reply related

* [PATCH v2 bpf-next 03/12] selftests/bpf: add BPF_CORE_READ relocatable read macro
From: Andrii Nakryiko @ 2019-07-30 19:53 UTC (permalink / raw)
  To: bpf, netdev, ast, daniel, yhs
  Cc: andrii.nakryiko, kernel-team, Andrii Nakryiko
In-Reply-To: <20190730195408.670063-1-andriin@fb.com>

Add BPF_CORE_READ macro used in tests to do bpf_core_read(), which
automatically captures offset relocation.

Signed-off-by: Andrii Nakryiko <andriin@fb.com>
---
 tools/testing/selftests/bpf/bpf_helpers.h | 19 +++++++++++++++++++
 1 file changed, 19 insertions(+)

diff --git a/tools/testing/selftests/bpf/bpf_helpers.h b/tools/testing/selftests/bpf/bpf_helpers.h
index f804f210244e..81bc51293d11 100644
--- a/tools/testing/selftests/bpf/bpf_helpers.h
+++ b/tools/testing/selftests/bpf/bpf_helpers.h
@@ -501,4 +501,23 @@ struct pt_regs;
 				(void *)(PT_REGS_FP(ctx) + sizeof(ip))); })
 #endif
 
+/*
+ * BPF_CORE_READ abstracts away bpf_probe_read() call and captures offset
+ * relocation for source address using __builtin_preserve_access_index()
+ * built-in, provided by Clang.
+ *
+ * __builtin_preserve_access_index() takes as an argument an expression of
+ * taking an address of a field within struct/union. It makes compiler emit
+ * a relocation, which records BTF type ID describing root struct/union and an
+ * accessor string which describes exact embedded field that was used to take
+ * an address. See detailed description of this relocation format and
+ * semantics in comments to struct bpf_offset_reloc in libbpf_internal.h.
+ *
+ * This relocation allows libbpf to adjust BPF instruction to use correct
+ * actual field offset, based on target kernel BTF type that matches original
+ * (local) BTF, used to record relocation.
+ */
+#define BPF_CORE_READ(dst, src) \
+	bpf_probe_read(dst, sizeof(*src), __builtin_preserve_access_index(src))
+
 #endif
-- 
2.17.1


^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox