Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: Reminder: 13 open syzbot bugs in "net/netrom" subsystem
From: Cong Wang @ 2019-07-24 17:02 UTC (permalink / raw)
  To: linux-hams, Linux Kernel Network Developers, Ralf Baechle,
	David S. Miller, Cong Wang, LKML, syzkaller-bugs
In-Reply-To: <20190724014723.GJ643@sol.localdomain>

On Tue, Jul 23, 2019 at 6:47 PM Eric Biggers <ebiggers@kernel.org> wrote:
>
> [This email was generated by a script.  Let me know if you have any suggestions
> to make it better, or if you want it re-generated with the latest status.]
>
> Of the currently open syzbot reports against the upstream kernel, I've manually
> marked 13 of them as possibly being bugs in the "net/netrom" subsystem.  I've
> listed these reports below, sorted by an algorithm that tries to list first the
> reports most likely to be still valid, important, and actionable.
>
> Of these 13 bugs, 8 were seen in mainline in the last week.
>
> Of these 13 bugs, 4 were bisected to commits from the following person:
>
>         Cong Wang <xiyou.wangcong@gmail.com>

These 4 should be fixed by this pending patch:
http://patchwork.ozlabs.org/patch/1135398/

Thanks.

^ permalink raw reply

* [PATCH bpf-next 1/7] bpf/flow_dissector: pass input flags to BPF flow dissector program
From: Stanislav Fomichev @ 2019-07-24 17:00 UTC (permalink / raw)
  To: netdev, bpf
  Cc: davem, ast, daniel, Stanislav Fomichev, Willem de Bruijn,
	Petar Penkov
In-Reply-To: <20190724170018.96659-1-sdf@google.com>

C flow dissector supports input flags that tell it to customize parsing
by either stopping early or trying to parse as deep as possible. Pass
those flags to the BPF flow dissector so it can make the same
decisions. In the next commits I'll add support for those flags to
our reference bpf_flow.c

Cc: Willem de Bruijn <willemb@google.com>
Cc: Petar Penkov <ppenkov@google.com>
Signed-off-by: Stanislav Fomichev <sdf@google.com>
---
 include/linux/skbuff.h       | 2 +-
 include/net/flow_dissector.h | 4 ----
 include/uapi/linux/bpf.h     | 5 +++++
 net/bpf/test_run.c           | 2 +-
 net/core/flow_dissector.c    | 5 +++--
 5 files changed, 10 insertions(+), 8 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 718742b1c505..9b7a8038beec 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -1271,7 +1271,7 @@ static inline int skb_flow_dissector_bpf_prog_detach(const union bpf_attr *attr)
 
 struct bpf_flow_dissector;
 bool bpf_flow_dissect(struct bpf_prog *prog, struct bpf_flow_dissector *ctx,
-		      __be16 proto, int nhoff, int hlen);
+		      __be16 proto, int nhoff, int hlen, unsigned int flags);
 
 bool __skb_flow_dissect(const struct net *net,
 			const struct sk_buff *skb,
diff --git a/include/net/flow_dissector.h b/include/net/flow_dissector.h
index 90bd210be060..3e2642587b76 100644
--- a/include/net/flow_dissector.h
+++ b/include/net/flow_dissector.h
@@ -253,10 +253,6 @@ enum flow_dissector_key_id {
 	FLOW_DISSECTOR_KEY_MAX,
 };
 
-#define FLOW_DISSECTOR_F_PARSE_1ST_FRAG		BIT(0)
-#define FLOW_DISSECTOR_F_STOP_AT_FLOW_LABEL	BIT(1)
-#define FLOW_DISSECTOR_F_STOP_AT_ENCAP		BIT(2)
-
 struct flow_dissector_key {
 	enum flow_dissector_key_id key_id;
 	size_t offset; /* offset of struct flow_dissector_key_*
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index fa1c753dcdbc..b4ad19bd6aa8 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -3507,6 +3507,10 @@ enum bpf_task_fd_type {
 	BPF_FD_TYPE_URETPROBE,		/* filename + offset */
 };
 
+#define FLOW_DISSECTOR_F_PARSE_1ST_FRAG		(1U << 0)
+#define FLOW_DISSECTOR_F_STOP_AT_FLOW_LABEL	(1U << 1)
+#define FLOW_DISSECTOR_F_STOP_AT_ENCAP		(1U << 2)
+
 struct bpf_flow_keys {
 	__u16	nhoff;
 	__u16	thoff;
@@ -3528,6 +3532,7 @@ struct bpf_flow_keys {
 			__u32	ipv6_dst[4];	/* in6_addr; network order */
 		};
 	};
+	__u32	flags;
 };
 
 struct bpf_func_info {
diff --git a/net/bpf/test_run.c b/net/bpf/test_run.c
index 80e6f3a6864d..4e41d15a1098 100644
--- a/net/bpf/test_run.c
+++ b/net/bpf/test_run.c
@@ -419,7 +419,7 @@ int bpf_prog_test_run_flow_dissector(struct bpf_prog *prog,
 	time_start = ktime_get_ns();
 	for (i = 0; i < repeat; i++) {
 		retval = bpf_flow_dissect(prog, &ctx, eth->h_proto, ETH_HLEN,
-					  size);
+					  size, 0);
 
 		if (signal_pending(current)) {
 			preempt_enable();
diff --git a/net/core/flow_dissector.c b/net/core/flow_dissector.c
index 3e6fedb57bc1..a74c4ed1b30d 100644
--- a/net/core/flow_dissector.c
+++ b/net/core/flow_dissector.c
@@ -784,7 +784,7 @@ static void __skb_flow_bpf_to_target(const struct bpf_flow_keys *flow_keys,
 }
 
 bool bpf_flow_dissect(struct bpf_prog *prog, struct bpf_flow_dissector *ctx,
-		      __be16 proto, int nhoff, int hlen)
+		      __be16 proto, int nhoff, int hlen, unsigned int flags)
 {
 	struct bpf_flow_keys *flow_keys = ctx->flow_keys;
 	u32 result;
@@ -794,6 +794,7 @@ bool bpf_flow_dissect(struct bpf_prog *prog, struct bpf_flow_dissector *ctx,
 	flow_keys->n_proto = proto;
 	flow_keys->nhoff = nhoff;
 	flow_keys->thoff = flow_keys->nhoff;
+	flow_keys->flags = flags;
 
 	preempt_disable();
 	result = BPF_PROG_RUN(prog, ctx);
@@ -914,7 +915,7 @@ bool __skb_flow_dissect(const struct net *net,
 			}
 
 			ret = bpf_flow_dissect(attached, &ctx, n_proto, nhoff,
-					       hlen);
+					       hlen, flags);
 			__skb_flow_bpf_to_target(&flow_keys, flow_dissector,
 						 target_container);
 			rcu_read_unlock();
-- 
2.22.0.657.g960e92d24f-goog


^ permalink raw reply related

* [PATCH bpf-next 7/7] selftests/bpf: support FLOW_DISSECTOR_F_STOP_AT_ENCAP
From: Stanislav Fomichev @ 2019-07-24 17:00 UTC (permalink / raw)
  To: netdev, bpf
  Cc: davem, ast, daniel, Stanislav Fomichev, Willem de Bruijn,
	Petar Penkov
In-Reply-To: <20190724170018.96659-1-sdf@google.com>

Exit as soon as we found that packet is encapped when
FLOW_DISSECTOR_F_STOP_AT_ENCAP is passed.
Add appropriate selftest cases.

Cc: Willem de Bruijn <willemb@google.com>
Cc: Petar Penkov <ppenkov@google.com>
Signed-off-by: Stanislav Fomichev <sdf@google.com>
---
 .../selftests/bpf/prog_tests/flow_dissector.c | 60 +++++++++++++++++++
 tools/testing/selftests/bpf/progs/bpf_flow.c  |  8 +++
 2 files changed, 68 insertions(+)

diff --git a/tools/testing/selftests/bpf/prog_tests/flow_dissector.c b/tools/testing/selftests/bpf/prog_tests/flow_dissector.c
index 1ea921c4cdc0..e382264fbc40 100644
--- a/tools/testing/selftests/bpf/prog_tests/flow_dissector.c
+++ b/tools/testing/selftests/bpf/prog_tests/flow_dissector.c
@@ -41,6 +41,13 @@ struct ipv4_pkt {
 	struct tcphdr tcp;
 } __packed;
 
+struct ipip_pkt {
+	struct ethhdr eth;
+	struct iphdr iph;
+	struct iphdr iph_inner;
+	struct tcphdr tcp;
+} __packed;
+
 struct svlan_ipv4_pkt {
 	struct ethhdr eth;
 	__u16 vlan_tci;
@@ -82,6 +89,7 @@ struct test {
 	union {
 		struct ipv4_pkt ipv4;
 		struct svlan_ipv4_pkt svlan_ipv4;
+		struct ipip_pkt ipip;
 		struct ipv6_pkt ipv6;
 		struct ipv6_frag_pkt ipv6_frag;
 		struct dvlan_ipv6_pkt dvlan_ipv6;
@@ -303,6 +311,58 @@ struct test tests[] = {
 		},
 		.flags = FLOW_DISSECTOR_F_STOP_AT_FLOW_LABEL,
 	},
+	{
+		.name = "ipip-encap",
+		.pkt.ipip = {
+			.eth.h_proto = __bpf_constant_htons(ETH_P_IP),
+			.iph.ihl = 5,
+			.iph.protocol = IPPROTO_IPIP,
+			.iph.tot_len = __bpf_constant_htons(MAGIC_BYTES),
+			.iph_inner.ihl = 5,
+			.iph_inner.protocol = IPPROTO_TCP,
+			.iph_inner.tot_len = __bpf_constant_htons(MAGIC_BYTES),
+			.tcp.doff = 5,
+			.tcp.source = 80,
+			.tcp.dest = 8080,
+		},
+		.keys = {
+			.nhoff = 0,
+			.nhoff = ETH_HLEN,
+			.thoff = ETH_HLEN + sizeof(struct iphdr) +
+				sizeof(struct iphdr),
+			.addr_proto = ETH_P_IP,
+			.ip_proto = IPPROTO_TCP,
+			.n_proto = __bpf_constant_htons(ETH_P_IP),
+			.is_encap = true,
+			.sport = 80,
+			.dport = 8080,
+		},
+	},
+	{
+		.name = "ipip-no-encap",
+		.pkt.ipip = {
+			.eth.h_proto = __bpf_constant_htons(ETH_P_IP),
+			.iph.ihl = 5,
+			.iph.protocol = IPPROTO_IPIP,
+			.iph.tot_len = __bpf_constant_htons(MAGIC_BYTES),
+			.iph_inner.ihl = 5,
+			.iph_inner.protocol = IPPROTO_TCP,
+			.iph_inner.tot_len = __bpf_constant_htons(MAGIC_BYTES),
+			.tcp.doff = 5,
+			.tcp.source = 80,
+			.tcp.dest = 8080,
+		},
+		.keys = {
+			.flags = FLOW_DISSECTOR_F_STOP_AT_ENCAP,
+			.nhoff = ETH_HLEN,
+			.thoff = ETH_HLEN + sizeof(struct iphdr),
+			.addr_proto = ETH_P_IP,
+			.ip_proto = IPPROTO_IPIP,
+			.n_proto = __bpf_constant_htons(ETH_P_IP),
+			.is_encap = true,
+		},
+		.flags = FLOW_DISSECTOR_F_STOP_AT_ENCAP,
+	},
 };
 
 static int create_tap(const char *ifname)
diff --git a/tools/testing/selftests/bpf/progs/bpf_flow.c b/tools/testing/selftests/bpf/progs/bpf_flow.c
index 7d73b7bfe609..b6236cdf8564 100644
--- a/tools/testing/selftests/bpf/progs/bpf_flow.c
+++ b/tools/testing/selftests/bpf/progs/bpf_flow.c
@@ -167,9 +167,15 @@ static __always_inline int parse_ip_proto(struct __sk_buff *skb, __u8 proto)
 		return export_flow_keys(keys, BPF_OK);
 	case IPPROTO_IPIP:
 		keys->is_encap = true;
+		if (keys->flags & FLOW_DISSECTOR_F_STOP_AT_ENCAP)
+			return export_flow_keys(keys, BPF_OK);
+
 		return parse_eth_proto(skb, bpf_htons(ETH_P_IP));
 	case IPPROTO_IPV6:
 		keys->is_encap = true;
+		if (keys->flags & FLOW_DISSECTOR_F_STOP_AT_ENCAP)
+			return export_flow_keys(keys, BPF_OK);
+
 		return parse_eth_proto(skb, bpf_htons(ETH_P_IPV6));
 	case IPPROTO_GRE:
 		gre = bpf_flow_dissect_get_header(skb, sizeof(*gre), &_gre);
@@ -189,6 +195,8 @@ static __always_inline int parse_ip_proto(struct __sk_buff *skb, __u8 proto)
 			keys->thoff += 4; /* Step over sequence number */
 
 		keys->is_encap = true;
+		if (keys->flags & FLOW_DISSECTOR_F_STOP_AT_ENCAP)
+			return export_flow_keys(keys, BPF_OK);
 
 		if (gre->proto == bpf_htons(ETH_P_TEB)) {
 			eth = bpf_flow_dissect_get_header(skb, sizeof(*eth),
-- 
2.22.0.657.g960e92d24f-goog


^ permalink raw reply related

* [PATCH bpf-next 6/7] bpf/flow_dissector: support ipv6 flow_label and FLOW_DISSECTOR_F_STOP_AT_FLOW_LABEL
From: Stanislav Fomichev @ 2019-07-24 17:00 UTC (permalink / raw)
  To: netdev, bpf
  Cc: davem, ast, daniel, Stanislav Fomichev, Willem de Bruijn,
	Petar Penkov
In-Reply-To: <20190724170018.96659-1-sdf@google.com>

Add support for exporting ipv6 flow label via bpf_flow_keys.
Export flow label from bpf_flow.c and also return early when
FLOW_DISSECTOR_F_STOP_AT_FLOW_LABEL is passed.

Cc: Willem de Bruijn <willemb@google.com>
Cc: Petar Penkov <ppenkov@google.com>
Signed-off-by: Stanislav Fomichev <sdf@google.com>
---
 include/uapi/linux/bpf.h                      |  1 +
 net/core/flow_dissector.c                     |  9 ++++
 tools/include/uapi/linux/bpf.h                |  1 +
 .../selftests/bpf/prog_tests/flow_dissector.c | 46 +++++++++++++++++++
 tools/testing/selftests/bpf/progs/bpf_flow.c  | 10 ++++
 5 files changed, 67 insertions(+)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index b4ad19bd6aa8..83b4150466af 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -3533,6 +3533,7 @@ struct bpf_flow_keys {
 		};
 	};
 	__u32	flags;
+	__be32	flow_label;
 };
 
 struct bpf_func_info {
diff --git a/net/core/flow_dissector.c b/net/core/flow_dissector.c
index a74c4ed1b30d..bcdb863cad28 100644
--- a/net/core/flow_dissector.c
+++ b/net/core/flow_dissector.c
@@ -737,6 +737,7 @@ static void __skb_flow_bpf_to_target(const struct bpf_flow_keys *flow_keys,
 	struct flow_dissector_key_basic *key_basic;
 	struct flow_dissector_key_addrs *key_addrs;
 	struct flow_dissector_key_ports *key_ports;
+	struct flow_dissector_key_tags *key_tags;
 
 	key_control = skb_flow_dissector_target(flow_dissector,
 						FLOW_DISSECTOR_KEY_CONTROL,
@@ -781,6 +782,14 @@ static void __skb_flow_bpf_to_target(const struct bpf_flow_keys *flow_keys,
 		key_ports->src = flow_keys->sport;
 		key_ports->dst = flow_keys->dport;
 	}
+
+	if (dissector_uses_key(flow_dissector,
+			       FLOW_DISSECTOR_KEY_FLOW_LABEL)) {
+		key_tags = skb_flow_dissector_target(flow_dissector,
+						     FLOW_DISSECTOR_KEY_FLOW_LABEL,
+						     target_container);
+		key_tags->flow_label = ntohl(flow_keys->flow_label);
+	}
 }
 
 bool bpf_flow_dissect(struct bpf_prog *prog, struct bpf_flow_dissector *ctx,
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index a0e1c891b56f..c26ca432b1b3 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -3530,6 +3530,7 @@ struct bpf_flow_keys {
 		};
 	};
 	__u32	flags;
+	__be32	flow_label;
 };
 
 struct bpf_func_info {
diff --git a/tools/testing/selftests/bpf/prog_tests/flow_dissector.c b/tools/testing/selftests/bpf/prog_tests/flow_dissector.c
index 966cb3b06870..1ea921c4cdc0 100644
--- a/tools/testing/selftests/bpf/prog_tests/flow_dissector.c
+++ b/tools/testing/selftests/bpf/prog_tests/flow_dissector.c
@@ -20,6 +20,7 @@
 	      "is_encap=%u/%u "						\
 	      "ip_proto=0x%x/0x%x "					\
 	      "n_proto=0x%x/0x%x "					\
+	      "flow_label=0x%x/0x%x "					\
 	      "sport=%u/%u "						\
 	      "dport=%u/%u\n",						\
 	      got.nhoff, expected.nhoff,				\
@@ -30,6 +31,7 @@
 	      got.is_encap, expected.is_encap,				\
 	      got.ip_proto, expected.ip_proto,				\
 	      got.n_proto, expected.n_proto,				\
+	      got.flow_label, expected.flow_label,			\
 	      got.sport, expected.sport,				\
 	      got.dport, expected.dport)
 
@@ -257,6 +259,50 @@ struct test tests[] = {
 			.is_first_frag = true,
 		},
 	},
+	{
+		.name = "ipv6-flow-label",
+		.pkt.ipv6 = {
+			.eth.h_proto = __bpf_constant_htons(ETH_P_IPV6),
+			.iph.nexthdr = IPPROTO_TCP,
+			.iph.payload_len = __bpf_constant_htons(MAGIC_BYTES),
+			.iph.flow_lbl = { 0xb, 0xee, 0xef },
+			.tcp.doff = 5,
+			.tcp.source = 80,
+			.tcp.dest = 8080,
+		},
+		.keys = {
+			.nhoff = ETH_HLEN,
+			.thoff = ETH_HLEN + sizeof(struct ipv6hdr),
+			.addr_proto = ETH_P_IPV6,
+			.ip_proto = IPPROTO_TCP,
+			.n_proto = __bpf_constant_htons(ETH_P_IPV6),
+			.sport = 80,
+			.dport = 8080,
+			.flow_label = __bpf_constant_htonl(0xbeeef),
+		},
+	},
+	{
+		.name = "ipv6-no-flow-label",
+		.pkt.ipv6 = {
+			.eth.h_proto = __bpf_constant_htons(ETH_P_IPV6),
+			.iph.nexthdr = IPPROTO_TCP,
+			.iph.payload_len = __bpf_constant_htons(MAGIC_BYTES),
+			.iph.flow_lbl = { 0xb, 0xee, 0xef },
+			.tcp.doff = 5,
+			.tcp.source = 80,
+			.tcp.dest = 8080,
+		},
+		.keys = {
+			.flags = FLOW_DISSECTOR_F_STOP_AT_FLOW_LABEL,
+			.nhoff = ETH_HLEN,
+			.thoff = ETH_HLEN + sizeof(struct ipv6hdr),
+			.addr_proto = ETH_P_IPV6,
+			.ip_proto = IPPROTO_TCP,
+			.n_proto = __bpf_constant_htons(ETH_P_IPV6),
+			.flow_label = __bpf_constant_htonl(0xbeeef),
+		},
+		.flags = FLOW_DISSECTOR_F_STOP_AT_FLOW_LABEL,
+	},
 };
 
 static int create_tap(const char *ifname)
diff --git a/tools/testing/selftests/bpf/progs/bpf_flow.c b/tools/testing/selftests/bpf/progs/bpf_flow.c
index 0eabe5e57944..7d73b7bfe609 100644
--- a/tools/testing/selftests/bpf/progs/bpf_flow.c
+++ b/tools/testing/selftests/bpf/progs/bpf_flow.c
@@ -83,6 +83,12 @@ static __always_inline int export_flow_keys(struct bpf_flow_keys *keys,
 	return ret;
 }
 
+#define IPV6_FLOWLABEL_MASK		__bpf_constant_htonl(0x000FFFFF)
+static inline __be32 ip6_flowlabel(const struct ipv6hdr *hdr)
+{
+	return *(__be32 *)hdr & IPV6_FLOWLABEL_MASK;
+}
+
 static __always_inline void *bpf_flow_dissect_get_header(struct __sk_buff *skb,
 							 __u16 hdr_size,
 							 void *buffer)
@@ -307,6 +313,10 @@ PROG(IPV6)(struct __sk_buff *skb)
 
 	keys->thoff += sizeof(struct ipv6hdr);
 	keys->ip_proto = ip6h->nexthdr;
+	keys->flow_label = ip6_flowlabel(ip6h);
+
+	if (keys->flags & FLOW_DISSECTOR_F_STOP_AT_FLOW_LABEL)
+		return export_flow_keys(keys, BPF_OK);
 
 	return parse_ipv6_proto(skb, ip6h->nexthdr);
 }
-- 
2.22.0.657.g960e92d24f-goog


^ permalink raw reply related

* [PATCH bpf-next 5/7] sefltests/bpf: support FLOW_DISSECTOR_F_PARSE_1ST_FRAG
From: Stanislav Fomichev @ 2019-07-24 17:00 UTC (permalink / raw)
  To: netdev, bpf
  Cc: davem, ast, daniel, Stanislav Fomichev, Willem de Bruijn,
	Petar Penkov
In-Reply-To: <20190724170018.96659-1-sdf@google.com>

bpf_flow.c: exit early unless FLOW_DISSECTOR_F_PARSE_1ST_FRAG is passed
in flags. Also, set ip_proto earlier, this makes sure we have correct
value with fragmented packets.

Add selftest cases to test ipv4/ipv6 fragments and skip eth_get_headlen
tests that don't have FLOW_DISSECTOR_F_PARSE_1ST_FRAG flag.

eth_get_headlen calls flow dissector with
FLOW_DISSECTOR_F_PARSE_1ST_FRAG flag so we can't run tests that
have different set of input flags against it.

Cc: Willem de Bruijn <willemb@google.com>
Cc: Petar Penkov <ppenkov@google.com>
Signed-off-by: Stanislav Fomichev <sdf@google.com>
---
 .../selftests/bpf/prog_tests/flow_dissector.c | 129 ++++++++++++++++++
 tools/testing/selftests/bpf/progs/bpf_flow.c  |  28 +++-
 2 files changed, 151 insertions(+), 6 deletions(-)

diff --git a/tools/testing/selftests/bpf/prog_tests/flow_dissector.c b/tools/testing/selftests/bpf/prog_tests/flow_dissector.c
index c938283ac232..966cb3b06870 100644
--- a/tools/testing/selftests/bpf/prog_tests/flow_dissector.c
+++ b/tools/testing/selftests/bpf/prog_tests/flow_dissector.c
@@ -5,6 +5,10 @@
 #include <linux/if_tun.h>
 #include <sys/uio.h>
 
+#ifndef IP_MF
+#define IP_MF 0x2000
+#endif
+
 #define CHECK_FLOW_KEYS(desc, got, expected)				\
 	CHECK_ATTR(memcmp(&got, &expected, sizeof(got)) != 0,		\
 	      desc,							\
@@ -49,6 +53,18 @@ struct ipv6_pkt {
 	struct tcphdr tcp;
 } __packed;
 
+struct ipv6_frag_pkt {
+	struct ethhdr eth;
+	struct ipv6hdr iph;
+	struct frag_hdr {
+		__u8 nexthdr;
+		__u8 reserved;
+		__be16 frag_off;
+		__be32 identification;
+	} ipf;
+	struct tcphdr tcp;
+} __packed;
+
 struct dvlan_ipv6_pkt {
 	struct ethhdr eth;
 	__u16 vlan_tci;
@@ -65,9 +81,11 @@ struct test {
 		struct ipv4_pkt ipv4;
 		struct svlan_ipv4_pkt svlan_ipv4;
 		struct ipv6_pkt ipv6;
+		struct ipv6_frag_pkt ipv6_frag;
 		struct dvlan_ipv6_pkt dvlan_ipv6;
 	} pkt;
 	struct bpf_flow_keys keys;
+	__u32 flags;
 };
 
 #define VLAN_HLEN	4
@@ -143,6 +161,102 @@ struct test tests[] = {
 			.n_proto = __bpf_constant_htons(ETH_P_IPV6),
 		},
 	},
+	{
+		.name = "ipv4-frag",
+		.pkt.ipv4 = {
+			.eth.h_proto = __bpf_constant_htons(ETH_P_IP),
+			.iph.ihl = 5,
+			.iph.protocol = IPPROTO_TCP,
+			.iph.tot_len = __bpf_constant_htons(MAGIC_BYTES),
+			.iph.frag_off = __bpf_constant_htons(IP_MF),
+			.tcp.doff = 5,
+			.tcp.source = 80,
+			.tcp.dest = 8080,
+		},
+		.keys = {
+			.flags = FLOW_DISSECTOR_F_PARSE_1ST_FRAG,
+			.nhoff = ETH_HLEN,
+			.thoff = ETH_HLEN + sizeof(struct iphdr),
+			.addr_proto = ETH_P_IP,
+			.ip_proto = IPPROTO_TCP,
+			.n_proto = __bpf_constant_htons(ETH_P_IP),
+			.is_frag = true,
+			.is_first_frag = true,
+			.sport = 80,
+			.dport = 8080,
+		},
+		.flags = FLOW_DISSECTOR_F_PARSE_1ST_FRAG,
+	},
+	{
+		.name = "ipv4-no-frag",
+		.pkt.ipv4 = {
+			.eth.h_proto = __bpf_constant_htons(ETH_P_IP),
+			.iph.ihl = 5,
+			.iph.protocol = IPPROTO_TCP,
+			.iph.tot_len = __bpf_constant_htons(MAGIC_BYTES),
+			.iph.frag_off = __bpf_constant_htons(IP_MF),
+			.tcp.doff = 5,
+			.tcp.source = 80,
+			.tcp.dest = 8080,
+		},
+		.keys = {
+			.nhoff = ETH_HLEN,
+			.thoff = ETH_HLEN + sizeof(struct iphdr),
+			.addr_proto = ETH_P_IP,
+			.ip_proto = IPPROTO_TCP,
+			.n_proto = __bpf_constant_htons(ETH_P_IP),
+			.is_frag = true,
+			.is_first_frag = true,
+		},
+	},
+	{
+		.name = "ipv6-frag",
+		.pkt.ipv6_frag = {
+			.eth.h_proto = __bpf_constant_htons(ETH_P_IPV6),
+			.iph.nexthdr = IPPROTO_FRAGMENT,
+			.iph.payload_len = __bpf_constant_htons(MAGIC_BYTES),
+			.ipf.nexthdr = IPPROTO_TCP,
+			.tcp.doff = 5,
+			.tcp.source = 80,
+			.tcp.dest = 8080,
+		},
+		.keys = {
+			.flags = FLOW_DISSECTOR_F_PARSE_1ST_FRAG,
+			.nhoff = ETH_HLEN,
+			.thoff = ETH_HLEN + sizeof(struct ipv6hdr) +
+				sizeof(struct frag_hdr),
+			.addr_proto = ETH_P_IPV6,
+			.ip_proto = IPPROTO_TCP,
+			.n_proto = __bpf_constant_htons(ETH_P_IPV6),
+			.is_frag = true,
+			.is_first_frag = true,
+			.sport = 80,
+			.dport = 8080,
+		},
+		.flags = FLOW_DISSECTOR_F_PARSE_1ST_FRAG,
+	},
+	{
+		.name = "ipv6-no-frag",
+		.pkt.ipv6_frag = {
+			.eth.h_proto = __bpf_constant_htons(ETH_P_IPV6),
+			.iph.nexthdr = IPPROTO_FRAGMENT,
+			.iph.payload_len = __bpf_constant_htons(MAGIC_BYTES),
+			.ipf.nexthdr = IPPROTO_TCP,
+			.tcp.doff = 5,
+			.tcp.source = 80,
+			.tcp.dest = 8080,
+		},
+		.keys = {
+			.nhoff = ETH_HLEN,
+			.thoff = ETH_HLEN + sizeof(struct ipv6hdr) +
+				sizeof(struct frag_hdr),
+			.addr_proto = ETH_P_IPV6,
+			.ip_proto = IPPROTO_TCP,
+			.n_proto = __bpf_constant_htons(ETH_P_IPV6),
+			.is_frag = true,
+			.is_first_frag = true,
+		},
+	},
 };
 
 static int create_tap(const char *ifname)
@@ -225,6 +339,13 @@ void test_flow_dissector(void)
 			.data_size_in = sizeof(tests[i].pkt),
 			.data_out = &flow_keys,
 		};
+		static struct bpf_flow_keys ctx = {};
+
+		if (tests[i].flags) {
+			tattr.ctx_in = &ctx;
+			tattr.ctx_size_in = sizeof(ctx);
+			ctx.flags = tests[i].flags;
+		}
 
 		err = bpf_prog_test_run_xattr(&tattr);
 		CHECK_ATTR(tattr.data_size_out != sizeof(flow_keys) ||
@@ -255,6 +376,14 @@ void test_flow_dissector(void)
 		struct bpf_prog_test_run_attr tattr = {};
 		__u32 key = 0;
 
+		/* Don't run tests that are not marked as
+		 * FLOW_DISSECTOR_F_PARSE_1ST_FRAG; eth_get_headlen
+		 * sets this flag.
+		 */
+
+		if (tests[i].flags != FLOW_DISSECTOR_F_PARSE_1ST_FRAG)
+			continue;
+
 		err = tx_tap(tap_fd, &tests[i].pkt, sizeof(tests[i].pkt));
 		CHECK(err < 0, "tx_tap", "err %d errno %d\n", err, errno);
 
diff --git a/tools/testing/selftests/bpf/progs/bpf_flow.c b/tools/testing/selftests/bpf/progs/bpf_flow.c
index 5ae485a6af3f..0eabe5e57944 100644
--- a/tools/testing/selftests/bpf/progs/bpf_flow.c
+++ b/tools/testing/selftests/bpf/progs/bpf_flow.c
@@ -153,7 +153,6 @@ static __always_inline int parse_ip_proto(struct __sk_buff *skb, __u8 proto)
 	struct tcphdr *tcp, _tcp;
 	struct udphdr *udp, _udp;
 
-	keys->ip_proto = proto;
 	switch (proto) {
 	case IPPROTO_ICMP:
 		icmp = bpf_flow_dissect_get_header(skb, sizeof(*icmp), &_icmp);
@@ -231,7 +230,6 @@ static __always_inline int parse_ipv6_proto(struct __sk_buff *skb, __u8 nexthdr)
 {
 	struct bpf_flow_keys *keys = skb->flow_keys;
 
-	keys->ip_proto = nexthdr;
 	switch (nexthdr) {
 	case IPPROTO_HOPOPTS:
 	case IPPROTO_DSTOPTS:
@@ -266,6 +264,7 @@ PROG(IP)(struct __sk_buff *skb)
 	keys->addr_proto = ETH_P_IP;
 	keys->ipv4_src = iph->saddr;
 	keys->ipv4_dst = iph->daddr;
+	keys->ip_proto = iph->protocol;
 
 	keys->thoff += iph->ihl << 2;
 	if (data + keys->thoff > data_end)
@@ -273,13 +272,19 @@ PROG(IP)(struct __sk_buff *skb)
 
 	if (iph->frag_off & bpf_htons(IP_MF | IP_OFFSET)) {
 		keys->is_frag = true;
-		if (iph->frag_off & bpf_htons(IP_OFFSET))
+		if (iph->frag_off & bpf_htons(IP_OFFSET)) {
 			/* From second fragment on, packets do not have headers
 			 * we can parse.
 			 */
 			done = true;
-		else
+		} else {
 			keys->is_first_frag = true;
+			/* No need to parse fragmented packet unless
+			 * explicitly asked for.
+			 */
+			if (!(keys->flags & FLOW_DISSECTOR_F_PARSE_1ST_FRAG))
+				done = true;
+		}
 	}
 
 	if (done)
@@ -301,6 +306,7 @@ PROG(IPV6)(struct __sk_buff *skb)
 	memcpy(&keys->ipv6_src, &ip6h->saddr, 2*sizeof(ip6h->saddr));
 
 	keys->thoff += sizeof(struct ipv6hdr);
+	keys->ip_proto = ip6h->nexthdr;
 
 	return parse_ipv6_proto(skb, ip6h->nexthdr);
 }
@@ -317,7 +323,8 @@ PROG(IPV6OP)(struct __sk_buff *skb)
 	/* hlen is in 8-octets and does not include the first 8 bytes
 	 * of the header
 	 */
-	skb->flow_keys->thoff += (1 + ip6h->hdrlen) << 3;
+	keys->thoff += (1 + ip6h->hdrlen) << 3;
+	keys->ip_proto = ip6h->nexthdr;
 
 	return parse_ipv6_proto(skb, ip6h->nexthdr);
 }
@@ -333,9 +340,18 @@ PROG(IPV6FR)(struct __sk_buff *skb)
 
 	keys->thoff += sizeof(*fragh);
 	keys->is_frag = true;
-	if (!(fragh->frag_off & bpf_htons(IP6_OFFSET)))
+	keys->ip_proto = fragh->nexthdr;
+
+	if (!(fragh->frag_off & bpf_htons(IP6_OFFSET))) {
 		keys->is_first_frag = true;
 
+		/* No need to parse fragmented packet unless
+		 * explicitly asked for.
+		 */
+		if (!(keys->flags & FLOW_DISSECTOR_F_PARSE_1ST_FRAG))
+			return export_flow_keys(keys, BPF_OK);
+	}
+
 	return parse_ipv6_proto(skb, fragh->nexthdr);
 }
 
-- 
2.22.0.657.g960e92d24f-goog


^ permalink raw reply related

* [PATCH bpf-next 4/7] tools/bpf: sync bpf_flow_keys flags
From: Stanislav Fomichev @ 2019-07-24 17:00 UTC (permalink / raw)
  To: netdev, bpf
  Cc: davem, ast, daniel, Stanislav Fomichev, Willem de Bruijn,
	Petar Penkov
In-Reply-To: <20190724170018.96659-1-sdf@google.com>

Export bpf_flow_keys flags to tools/libbpf/selftests.

Cc: Willem de Bruijn <willemb@google.com>
Cc: Petar Penkov <ppenkov@google.com>
Signed-off-by: Stanislav Fomichev <sdf@google.com>
---
 tools/include/uapi/linux/bpf.h | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 4e455018da65..a0e1c891b56f 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -3504,6 +3504,10 @@ enum bpf_task_fd_type {
 	BPF_FD_TYPE_URETPROBE,		/* filename + offset */
 };
 
+#define FLOW_DISSECTOR_F_PARSE_1ST_FRAG		(1U << 0)
+#define FLOW_DISSECTOR_F_STOP_AT_FLOW_LABEL	(1U << 1)
+#define FLOW_DISSECTOR_F_STOP_AT_ENCAP		(1U << 2)
+
 struct bpf_flow_keys {
 	__u16	nhoff;
 	__u16	thoff;
@@ -3525,6 +3529,7 @@ struct bpf_flow_keys {
 			__u32	ipv6_dst[4];	/* in6_addr; network order */
 		};
 	};
+	__u32	flags;
 };
 
 struct bpf_func_info {
-- 
2.22.0.657.g960e92d24f-goog


^ permalink raw reply related

* [PATCH bpf-next 3/7] bpf/flow_dissector: support flags in BPF_PROG_TEST_RUN
From: Stanislav Fomichev @ 2019-07-24 17:00 UTC (permalink / raw)
  To: netdev, bpf
  Cc: davem, ast, daniel, Stanislav Fomichev, Willem de Bruijn,
	Petar Penkov
In-Reply-To: <20190724170018.96659-1-sdf@google.com>

This will allow us to write tests for those flags.

Cc: Willem de Bruijn <willemb@google.com>
Cc: Petar Penkov <ppenkov@google.com>
Signed-off-by: Stanislav Fomichev <sdf@google.com>
---
 net/bpf/test_run.c | 39 +++++++++++++++++++++++++++++++++++----
 1 file changed, 35 insertions(+), 4 deletions(-)

diff --git a/net/bpf/test_run.c b/net/bpf/test_run.c
index 4e41d15a1098..444a7baed791 100644
--- a/net/bpf/test_run.c
+++ b/net/bpf/test_run.c
@@ -377,6 +377,22 @@ int bpf_prog_test_run_xdp(struct bpf_prog *prog, const union bpf_attr *kattr,
 	return ret;
 }
 
+static int verify_user_bpf_flow_keys(struct bpf_flow_keys *ctx)
+{
+	/* make sure the fields we don't use are zeroed */
+	if (!range_is_zero(ctx, 0, offsetof(struct bpf_flow_keys, flags)))
+		return -EINVAL;
+
+	/* flags is allowed */
+
+	if (!range_is_zero(ctx, offsetof(struct bpf_flow_keys, flags) +
+			   FIELD_SIZEOF(struct bpf_flow_keys, flags),
+			   sizeof(struct bpf_flow_keys)))
+		return -EINVAL;
+
+	return 0;
+}
+
 int bpf_prog_test_run_flow_dissector(struct bpf_prog *prog,
 				     const union bpf_attr *kattr,
 				     union bpf_attr __user *uattr)
@@ -384,9 +400,11 @@ int bpf_prog_test_run_flow_dissector(struct bpf_prog *prog,
 	u32 size = kattr->test.data_size_in;
 	struct bpf_flow_dissector ctx = {};
 	u32 repeat = kattr->test.repeat;
+	struct bpf_flow_keys *user_ctx;
 	struct bpf_flow_keys flow_keys;
 	u64 time_start, time_spent = 0;
 	const struct ethhdr *eth;
+	unsigned int flags = 0;
 	u32 retval, duration;
 	void *data;
 	int ret;
@@ -395,9 +413,6 @@ int bpf_prog_test_run_flow_dissector(struct bpf_prog *prog,
 	if (prog->type != BPF_PROG_TYPE_FLOW_DISSECTOR)
 		return -EINVAL;
 
-	if (kattr->test.ctx_in || kattr->test.ctx_out)
-		return -EINVAL;
-
 	if (size < ETH_HLEN)
 		return -EINVAL;
 
@@ -410,6 +425,18 @@ int bpf_prog_test_run_flow_dissector(struct bpf_prog *prog,
 	if (!repeat)
 		repeat = 1;
 
+	user_ctx = bpf_ctx_init(kattr, sizeof(struct bpf_flow_keys));
+	if (IS_ERR(user_ctx)) {
+		kfree(data);
+		return PTR_ERR(user_ctx);
+	}
+	if (user_ctx) {
+		ret = verify_user_bpf_flow_keys(user_ctx);
+		if (ret)
+			goto out;
+		flags = user_ctx->flags;
+	}
+
 	ctx.flow_keys = &flow_keys;
 	ctx.data = data;
 	ctx.data_end = (__u8 *)data + size;
@@ -419,7 +446,7 @@ int bpf_prog_test_run_flow_dissector(struct bpf_prog *prog,
 	time_start = ktime_get_ns();
 	for (i = 0; i < repeat; i++) {
 		retval = bpf_flow_dissect(prog, &ctx, eth->h_proto, ETH_HLEN,
-					  size, 0);
+					  size, flags);
 
 		if (signal_pending(current)) {
 			preempt_enable();
@@ -450,8 +477,12 @@ int bpf_prog_test_run_flow_dissector(struct bpf_prog *prog,
 
 	ret = bpf_test_finish(kattr, uattr, &flow_keys, sizeof(flow_keys),
 			      retval, duration);
+	if (!ret)
+		ret = bpf_ctx_finish(kattr, uattr, user_ctx,
+				     sizeof(struct bpf_flow_keys));
 
 out:
 	kfree(data);
+	kfree(user_ctx);
 	return ret;
 }
-- 
2.22.0.657.g960e92d24f-goog


^ permalink raw reply related

* [PATCH bpf-next 2/7] bpf/flow_dissector: document flags
From: Stanislav Fomichev @ 2019-07-24 17:00 UTC (permalink / raw)
  To: netdev, bpf
  Cc: davem, ast, daniel, Stanislav Fomichev, Willem de Bruijn,
	Petar Penkov
In-Reply-To: <20190724170018.96659-1-sdf@google.com>

Describe what each input flag does and who uses it.

Cc: Willem de Bruijn <willemb@google.com>
Cc: Petar Penkov <ppenkov@google.com>
Signed-off-by: Stanislav Fomichev <sdf@google.com>
---
 Documentation/bpf/prog_flow_dissector.rst | 18 ++++++++++++++++++
 1 file changed, 18 insertions(+)

diff --git a/Documentation/bpf/prog_flow_dissector.rst b/Documentation/bpf/prog_flow_dissector.rst
index ed343abe541e..0f3f380b2ce4 100644
--- a/Documentation/bpf/prog_flow_dissector.rst
+++ b/Documentation/bpf/prog_flow_dissector.rst
@@ -26,6 +26,7 @@ and output arguments.
   * ``nhoff`` - initial offset of the networking header
   * ``thoff`` - initial offset of the transport header, initialized to nhoff
   * ``n_proto`` - L3 protocol type, parsed out of L2 header
+  * ``flags`` - optional flags
 
 Flow dissector BPF program should fill out the rest of the ``struct
 bpf_flow_keys`` fields. Input arguments ``nhoff/thoff/n_proto`` should be
@@ -101,6 +102,23 @@ can be called for both cases and would have to be written carefully to
 handle both cases.
 
 
+Flags
+=====
+
+``flow_keys->flags`` might contain optional input flags that work as follows:
+
+* ``FLOW_DISSECTOR_F_PARSE_1ST_FRAG`` - tells BPF flow dissector to continue
+  parsing first fragment; the default expected behavior is that flow dissector
+  returns as soon as it finds out that the packet is fragmented;
+  used by ``eth_get_headlen`` to estimate length of all headers for GRO.
+* ``FLOW_DISSECTOR_F_STOP_AT_FLOW_LABEL`` - tells BPF flow dissector to stop
+  parsing as soon as it reaches IPv6 flow label; used by ``___skb_get_hash``
+  and ``__skb_get_hash_symmetric`` to get flow hash.
+* ``FLOW_DISSECTOR_F_STOP_AT_ENCAP`` - tells BPF flow dissector to stop
+  parsing as soon as it reaches encapsulated headers; used by routing
+  infrastructure.
+
+
 Reference Implementation
 ========================
 
-- 
2.22.0.657.g960e92d24f-goog


^ permalink raw reply related

* [PATCH bpf-next 0/7] bpf/flow_dissector: support input flags
From: Stanislav Fomichev @ 2019-07-24 17:00 UTC (permalink / raw)
  To: netdev, bpf
  Cc: davem, ast, daniel, Stanislav Fomichev, Willem de Bruijn,
	Petar Penkov

C flow dissector supports input flags that tell it to customize parsing
by either stopping early or trying to parse as deep as possible.
BPF flow dissector always parses as deep as possible which is sub-optimal.
Pass input flags to the BPF flow dissector as well so it can make the same
decisions.

Series outline:
* remove unused FLOW_DISSECTOR_F_STOP_AT_L3 flag
* export FLOW_DISSECTOR_F_XXX flags as uapi and pass them to BPF
  flow dissector
* add documentation for the export flags
* support input flags in BPF_PROG_TEST_RUN via ctx_{in,out}
* sync uapi to tools
* support FLOW_DISSECTOR_F_PARSE_1ST_FRAG in selftest
* support FLOW_DISSECTOR_F_STOP_AT_FLOW_LABEL in kernel and selftest
* support FLOW_DISSECTOR_F_STOP_AT_ENCAP in selftest

Pros:
* makes BPF flow dissector faster by avoiding burning extra cycles
* existing BPF progs continue to work by ignoring the flags and always
  parsing as deep as possible

Cons:
* new UAPI which we need to support (OTOH, if we need to deprecate some
  flags, we can just stop setting them upon calling BPF programs)

Some numbers (with .repeat = 4000000 in test_flow_dissector):
        test_flow_dissector:PASS:ipv4-frag 35 nsec
        test_flow_dissector:PASS:ipv4-frag 35 nsec
        test_flow_dissector:PASS:ipv4-no-frag 32 nsec
        test_flow_dissector:PASS:ipv4-no-frag 32 nsec

        test_flow_dissector:PASS:ipv6-frag 39 nsec
        test_flow_dissector:PASS:ipv6-frag 39 nsec
        test_flow_dissector:PASS:ipv6-no-frag 36 nsec
        test_flow_dissector:PASS:ipv6-no-frag 36 nsec

        test_flow_dissector:PASS:ipv6-flow-label 36 nsec
        test_flow_dissector:PASS:ipv6-flow-label 36 nsec
        test_flow_dissector:PASS:ipv6-no-flow-label 33 nsec
        test_flow_dissector:PASS:ipv6-no-flow-label 33 nsec

        test_flow_dissector:PASS:ipip-encap 38 nsec
        test_flow_dissector:PASS:ipip-encap 38 nsec
        test_flow_dissector:PASS:ipip-no-encap 32 nsec
        test_flow_dissector:PASS:ipip-no-encap 32 nsec

The improvement is around 10%, but it's in a tight cache-hot
BPF_PROG_TEST_RUN loop.

Cc: Willem de Bruijn <willemb@google.com>
Cc: Petar Penkov <ppenkov@google.com>

Stanislav Fomichev (7):
  bpf/flow_dissector: pass input flags to BPF flow dissector program
  bpf/flow_dissector: document flags
  bpf/flow_dissector: support flags in BPF_PROG_TEST_RUN
  tools/bpf: sync bpf_flow_keys flags
  sefltests/bpf: support FLOW_DISSECTOR_F_PARSE_1ST_FRAG
  bpf/flow_dissector: support ipv6 flow_label and
    FLOW_DISSECTOR_F_STOP_AT_FLOW_LABEL
  selftests/bpf: support FLOW_DISSECTOR_F_STOP_AT_ENCAP

 Documentation/bpf/prog_flow_dissector.rst     |  18 ++
 include/linux/skbuff.h                        |   2 +-
 include/net/flow_dissector.h                  |   4 -
 include/uapi/linux/bpf.h                      |   6 +
 net/bpf/test_run.c                            |  39 ++-
 net/core/flow_dissector.c                     |  14 +-
 tools/include/uapi/linux/bpf.h                |   6 +
 .../selftests/bpf/prog_tests/flow_dissector.c | 235 ++++++++++++++++++
 tools/testing/selftests/bpf/progs/bpf_flow.c  |  46 +++-
 9 files changed, 353 insertions(+), 17 deletions(-)

-- 
2.22.0.657.g960e92d24f-goog

^ permalink raw reply

* [PATCH bpf-next 1/6] bpf: add bpf_map_value_size and bp_map_copy_value helper functions
From: Brian Vazquez @ 2019-07-24 16:57 UTC (permalink / raw)
  To: Brian Vazquez, Alexei Starovoitov, Daniel Borkmann,
	David S . Miller
  Cc: Stanislav Fomichev, Willem de Bruijn, Petar Penkov, linux-kernel,
	netdev, bpf, Brian Vazquez
In-Reply-To: <20190724165803.87470-1-brianvv@google.com>

Move reusable code from map_lookup_elem to helper functions to avoid code
duplication in kernel/bpf/syscall.c

Suggested-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Brian Vazquez <brianvv@google.com>
---
 kernel/bpf/syscall.c | 134 +++++++++++++++++++++++--------------------
 1 file changed, 73 insertions(+), 61 deletions(-)

diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 5d141f16f6fa9..86cdc2f7bb56e 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -126,6 +126,76 @@ static struct bpf_map *find_and_alloc_map(union bpf_attr *attr)
 	return map;
 }
 
+static u32 bpf_map_value_size(struct bpf_map *map)
+{
+	if (map->map_type == BPF_MAP_TYPE_PERCPU_HASH ||
+	    map->map_type == BPF_MAP_TYPE_LRU_PERCPU_HASH ||
+	    map->map_type == BPF_MAP_TYPE_PERCPU_ARRAY ||
+	    map->map_type == BPF_MAP_TYPE_PERCPU_CGROUP_STORAGE)
+		return round_up(map->value_size, 8) * num_possible_cpus();
+	else if (IS_FD_MAP(map))
+		return sizeof(u32);
+	else
+		return  map->value_size;
+}
+
+static int bpf_map_copy_value(struct bpf_map *map, void *key, void *value,
+			      __u64 flags)
+{
+	void *ptr;
+	int err;
+
+	if (bpf_map_is_dev_bound(map))
+		return  bpf_map_offload_lookup_elem(map, key, value);
+
+	preempt_disable();
+	this_cpu_inc(bpf_prog_active);
+	if (map->map_type == BPF_MAP_TYPE_PERCPU_HASH ||
+	    map->map_type == BPF_MAP_TYPE_LRU_PERCPU_HASH) {
+		err = bpf_percpu_hash_copy(map, key, value);
+	} else if (map->map_type == BPF_MAP_TYPE_PERCPU_ARRAY) {
+		err = bpf_percpu_array_copy(map, key, value);
+	} else if (map->map_type == BPF_MAP_TYPE_PERCPU_CGROUP_STORAGE) {
+		err = bpf_percpu_cgroup_storage_copy(map, key, value);
+	} else if (map->map_type == BPF_MAP_TYPE_STACK_TRACE) {
+		err = bpf_stackmap_copy(map, key, value);
+	} else if (IS_FD_ARRAY(map)) {
+		err = bpf_fd_array_map_lookup_elem(map, key, value);
+	} else if (IS_FD_HASH(map)) {
+		err = bpf_fd_htab_map_lookup_elem(map, key, value);
+	} else if (map->map_type == BPF_MAP_TYPE_REUSEPORT_SOCKARRAY) {
+		err = bpf_fd_reuseport_array_lookup_elem(map, key, value);
+	} else if (map->map_type == BPF_MAP_TYPE_QUEUE ||
+		   map->map_type == BPF_MAP_TYPE_STACK) {
+		err = map->ops->map_peek_elem(map, value);
+	} else {
+		rcu_read_lock();
+		if (map->ops->map_lookup_elem_sys_only)
+			ptr = map->ops->map_lookup_elem_sys_only(map, key);
+		else
+			ptr = map->ops->map_lookup_elem(map, key);
+		if (IS_ERR(ptr)) {
+			err = PTR_ERR(ptr);
+		} else if (!ptr) {
+			err = -ENOENT;
+		} else {
+			err = 0;
+			if (flags & BPF_F_LOCK)
+				/* lock 'ptr' and copy everything but lock */
+				copy_map_value_locked(map, value, ptr, true);
+			else
+				copy_map_value(map, value, ptr);
+			/* mask lock, since value wasn't zero inited */
+			check_and_init_map_lock(map, value);
+		}
+		rcu_read_unlock();
+	}
+	this_cpu_dec(bpf_prog_active);
+	preempt_enable();
+
+	return err;
+}
+
 void *bpf_map_area_alloc(size_t size, int numa_node)
 {
 	/* We really just want to fail instead of triggering OOM killer
@@ -729,7 +799,7 @@ static int map_lookup_elem(union bpf_attr *attr)
 	void __user *uvalue = u64_to_user_ptr(attr->value);
 	int ufd = attr->map_fd;
 	struct bpf_map *map;
-	void *key, *value, *ptr;
+	void *key, *value;
 	u32 value_size;
 	struct fd f;
 	int err;
@@ -761,72 +831,14 @@ static int map_lookup_elem(union bpf_attr *attr)
 		goto err_put;
 	}
 
-	if (map->map_type == BPF_MAP_TYPE_PERCPU_HASH ||
-	    map->map_type == BPF_MAP_TYPE_LRU_PERCPU_HASH ||
-	    map->map_type == BPF_MAP_TYPE_PERCPU_ARRAY ||
-	    map->map_type == BPF_MAP_TYPE_PERCPU_CGROUP_STORAGE)
-		value_size = round_up(map->value_size, 8) * num_possible_cpus();
-	else if (IS_FD_MAP(map))
-		value_size = sizeof(u32);
-	else
-		value_size = map->value_size;
+	value_size = bpf_map_value_size(map);
 
 	err = -ENOMEM;
 	value = kmalloc(value_size, GFP_USER | __GFP_NOWARN);
 	if (!value)
 		goto free_key;
 
-	if (bpf_map_is_dev_bound(map)) {
-		err = bpf_map_offload_lookup_elem(map, key, value);
-		goto done;
-	}
-
-	preempt_disable();
-	this_cpu_inc(bpf_prog_active);
-	if (map->map_type == BPF_MAP_TYPE_PERCPU_HASH ||
-	    map->map_type == BPF_MAP_TYPE_LRU_PERCPU_HASH) {
-		err = bpf_percpu_hash_copy(map, key, value);
-	} else if (map->map_type == BPF_MAP_TYPE_PERCPU_ARRAY) {
-		err = bpf_percpu_array_copy(map, key, value);
-	} else if (map->map_type == BPF_MAP_TYPE_PERCPU_CGROUP_STORAGE) {
-		err = bpf_percpu_cgroup_storage_copy(map, key, value);
-	} else if (map->map_type == BPF_MAP_TYPE_STACK_TRACE) {
-		err = bpf_stackmap_copy(map, key, value);
-	} else if (IS_FD_ARRAY(map)) {
-		err = bpf_fd_array_map_lookup_elem(map, key, value);
-	} else if (IS_FD_HASH(map)) {
-		err = bpf_fd_htab_map_lookup_elem(map, key, value);
-	} else if (map->map_type == BPF_MAP_TYPE_REUSEPORT_SOCKARRAY) {
-		err = bpf_fd_reuseport_array_lookup_elem(map, key, value);
-	} else if (map->map_type == BPF_MAP_TYPE_QUEUE ||
-		   map->map_type == BPF_MAP_TYPE_STACK) {
-		err = map->ops->map_peek_elem(map, value);
-	} else {
-		rcu_read_lock();
-		if (map->ops->map_lookup_elem_sys_only)
-			ptr = map->ops->map_lookup_elem_sys_only(map, key);
-		else
-			ptr = map->ops->map_lookup_elem(map, key);
-		if (IS_ERR(ptr)) {
-			err = PTR_ERR(ptr);
-		} else if (!ptr) {
-			err = -ENOENT;
-		} else {
-			err = 0;
-			if (attr->flags & BPF_F_LOCK)
-				/* lock 'ptr' and copy everything but lock */
-				copy_map_value_locked(map, value, ptr, true);
-			else
-				copy_map_value(map, value, ptr);
-			/* mask lock, since value wasn't zero inited */
-			check_and_init_map_lock(map, value);
-		}
-		rcu_read_unlock();
-	}
-	this_cpu_dec(bpf_prog_active);
-	preempt_enable();
-
-done:
+	err = bpf_map_copy_value(map, key, value, attr->flags);
 	if (err)
 		goto free_value;
 
-- 
2.22.0.657.g960e92d24f-goog


^ permalink raw reply related

* [PATCH bpf-next 3/6] bpf: keep bpf.h in sync with tools/
From: Brian Vazquez @ 2019-07-24 16:58 UTC (permalink / raw)
  To: Brian Vazquez, Alexei Starovoitov, Daniel Borkmann,
	David S . Miller
  Cc: Stanislav Fomichev, Willem de Bruijn, Petar Penkov, linux-kernel,
	netdev, bpf, Brian Vazquez
In-Reply-To: <20190724165803.87470-1-brianvv@google.com>

Adds bpf_attr.dump structure to libbpf.

Suggested-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Brian Vazquez <brianvv@google.com>
---
 tools/include/uapi/linux/bpf.h | 9 +++++++++
 tools/lib/bpf/libbpf.map       | 2 ++
 2 files changed, 11 insertions(+)

diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 4e455018da65f..e127f16e4e932 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -106,6 +106,7 @@ enum bpf_cmd {
 	BPF_TASK_FD_QUERY,
 	BPF_MAP_LOOKUP_AND_DELETE_ELEM,
 	BPF_MAP_FREEZE,
+	BPF_MAP_DUMP,
 };
 
 enum bpf_map_type {
@@ -388,6 +389,14 @@ union bpf_attr {
 		__u64		flags;
 	};
 
+	struct { /* struct used by BPF_MAP_DUMP command */
+		__aligned_u64	prev_key;
+		__aligned_u64	buf;
+		__aligned_u64	buf_len; /* input/output: len of buf */
+		__u64		flags;
+		__u32		map_fd;
+	} dump;
+
 	struct { /* anonymous struct used by BPF_PROG_LOAD command */
 		__u32		prog_type;	/* one of enum bpf_prog_type */
 		__u32		insn_cnt;
diff --git a/tools/lib/bpf/libbpf.map b/tools/lib/bpf/libbpf.map
index f9d316e873d8d..cac3723d5c45c 100644
--- a/tools/lib/bpf/libbpf.map
+++ b/tools/lib/bpf/libbpf.map
@@ -183,4 +183,6 @@ LIBBPF_0.0.4 {
 		perf_buffer__new;
 		perf_buffer__new_raw;
 		perf_buffer__poll;
+		bpf_map_dump;
+		bpf_map_dump_flags;
 } LIBBPF_0.0.3;
-- 
2.22.0.657.g960e92d24f-goog


^ permalink raw reply related

* [PATCH bpf-next 5/6] selftests/bpf: test BPF_MAP_DUMP command on a bpf hashmap
From: Brian Vazquez @ 2019-07-24 16:58 UTC (permalink / raw)
  To: Brian Vazquez, Alexei Starovoitov, Daniel Borkmann,
	David S . Miller
  Cc: Stanislav Fomichev, Willem de Bruijn, Petar Penkov, linux-kernel,
	netdev, bpf, Brian Vazquez
In-Reply-To: <20190724165803.87470-1-brianvv@google.com>

This tests exercise the new command on a bpf hashmap and make sure it
works as expected.

Signed-off-by: Brian Vazquez <brianvv@google.com>
---
 tools/testing/selftests/bpf/test_maps.c | 83 ++++++++++++++++++++++++-
 1 file changed, 81 insertions(+), 2 deletions(-)

diff --git a/tools/testing/selftests/bpf/test_maps.c b/tools/testing/selftests/bpf/test_maps.c
index 5443b9bd75ed7..f7ab401399d40 100644
--- a/tools/testing/selftests/bpf/test_maps.c
+++ b/tools/testing/selftests/bpf/test_maps.c
@@ -309,6 +309,86 @@ static void test_hashmap_walk(unsigned int task, void *data)
 	close(fd);
 }
 
+static void test_hashmap_dump(void)
+{
+	int fd, i, max_entries = 5;
+	uint64_t keys[max_entries], values[max_entries];
+	uint64_t key, value, next_key, prev_key;
+	bool next_key_valid = true;
+	void *buf, *elem;
+	u32 buf_len;
+	const int elem_size = sizeof(key) + sizeof(value);
+
+	fd = helper_fill_hashmap(max_entries);
+
+	// Get the elements in the hashmap, and store them in that order
+	assert(bpf_map_get_next_key(fd, NULL, &key) == 0);
+	i = 0;
+	keys[i] = key;
+	for (i = 1; next_key_valid; i++) {
+		next_key_valid = bpf_map_get_next_key(fd, &key, &next_key) == 0;
+		assert(bpf_map_lookup_elem(fd, &key, &values[i - 1]) == 0);
+		keys[i-1] = key;
+		key = next_key;
+	}
+
+	// Alloc memory for the whole table
+	buf = malloc(elem_size * max_entries);
+	assert(buf != NULL);
+
+	// Check that buf_len < elem_size returns EINVAL
+	buf_len = elem_size-1;
+	errno = 0;
+	assert(bpf_map_dump(fd, NULL, buf, &buf_len) == -1 && errno == EINVAL);
+
+	// Check that it returns the first two elements
+	errno = 0;
+	buf_len = elem_size * 2;
+	i = 0;
+	assert(bpf_map_dump(fd, NULL, buf, &buf_len) == 0 &&
+	       buf_len == 2*elem_size);
+	elem = buf;
+	assert((*(uint64_t *)elem) == keys[i] &&
+	       (*(uint64_t *)(elem + sizeof(key))) == values[i]);
+	elem = buf + elem_size;
+	i++;
+	assert((*(uint64_t *)elem) == keys[i] &&
+	       (*(uint64_t *)(elem + sizeof(key))) == values[i]);
+	i++;
+
+	/* Check that prev_key contains key from last_elem retrieved in previous
+	 * call
+	 */
+	prev_key = *((uint64_t *)elem);
+	assert(bpf_map_dump(fd, &prev_key, buf, &buf_len) == 0 &&
+	       buf_len == elem_size*2);
+	elem = buf;
+	assert((*(uint64_t *)elem) == keys[i] &&
+	       (*(uint64_t *)(elem + sizeof(key))) == values[i]);
+	elem = buf + elem_size;
+	i++;
+	assert((*(uint64_t *)elem) == keys[i] &&
+	       (*(uint64_t *)(elem + sizeof(key))) == values[i]);
+	i++;
+	assert(prev_key == (*(uint64_t *)elem));
+
+	/* Continue reading from map and verify buf_len only contains 1 element
+	 * even though buf_len is 2 elem_size and it returns err = 0.
+	 */
+	assert(bpf_map_dump(fd, &prev_key, buf, &buf_len) == 0 &&
+	       buf_len == elem_size);
+	elem = buf;
+	assert((*(uint64_t *)elem) == keys[i] &&
+	       (*(uint64_t *)(elem + sizeof(key))) == values[i]);
+
+	// Verify there's no more entries and err = ENOENT
+	assert(bpf_map_dump(fd, &prev_key, buf, &buf_len) == -1 &&
+	       errno == ENOENT);
+
+	free(buf);
+	close(fd);
+}
+
 static void test_hashmap_zero_seed(void)
 {
 	int i, first, second, old_flags;
@@ -1677,6 +1757,7 @@ static void run_all_tests(void)
 	test_hashmap_percpu(0, NULL);
 	test_hashmap_walk(0, NULL);
 	test_hashmap_zero_seed();
+	test_hashmap_dump();
 
 	test_arraymap(0, NULL);
 	test_arraymap_percpu(0, NULL);
@@ -1714,11 +1795,9 @@ int main(void)
 
 	map_flags = BPF_F_NO_PREALLOC;
 	run_all_tests();
-
 #define CALL
 #include <map_tests/tests.h>
 #undef CALL
-
 	printf("test_maps: OK, %d SKIPPED\n", skips);
 	return 0;
 }
-- 
2.22.0.657.g960e92d24f-goog


^ permalink raw reply related

* [PATCH bpf-next 4/6] libbpf: support BPF_MAP_DUMP command
From: Brian Vazquez @ 2019-07-24 16:58 UTC (permalink / raw)
  To: Brian Vazquez, Alexei Starovoitov, Daniel Borkmann,
	David S . Miller
  Cc: Stanislav Fomichev, Willem de Bruijn, Petar Penkov, linux-kernel,
	netdev, bpf, Brian Vazquez
In-Reply-To: <20190724165803.87470-1-brianvv@google.com>

Make libbpf aware of new BPF_MAP_DUMP command and add bpf_map_dump and
bpf_map_dump_flags to use them from the library.

Suggested-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Brian Vazquez <brianvv@google.com>
---
 tools/lib/bpf/bpf.c | 28 ++++++++++++++++++++++++++++
 tools/lib/bpf/bpf.h |  4 ++++
 2 files changed, 32 insertions(+)

diff --git a/tools/lib/bpf/bpf.c b/tools/lib/bpf/bpf.c
index c7d7993c44bb0..c1139b7db756a 100644
--- a/tools/lib/bpf/bpf.c
+++ b/tools/lib/bpf/bpf.c
@@ -368,6 +368,34 @@ int bpf_map_update_elem(int fd, const void *key, const void *value,
 	return sys_bpf(BPF_MAP_UPDATE_ELEM, &attr, sizeof(attr));
 }
 
+int bpf_map_dump(int fd, const void *prev_key, void *buf, void *buf_len)
+{
+	union bpf_attr attr;
+
+	memset(&attr, 0, sizeof(attr));
+	attr.dump.map_fd = fd;
+	attr.dump.prev_key = ptr_to_u64(prev_key);
+	attr.dump.buf = ptr_to_u64(buf);
+	attr.dump.buf_len = ptr_to_u64(buf_len);
+
+	return sys_bpf(BPF_MAP_DUMP, &attr, sizeof(attr));
+}
+
+int bpf_map_dump_flags(int fd, const void *prev_key, void *buf, void *buf_len,
+		       __u64 flags)
+{
+	union bpf_attr attr;
+
+	memset(&attr, 0, sizeof(attr));
+	attr.dump.map_fd = fd;
+	attr.dump.prev_key = ptr_to_u64(prev_key);
+	attr.dump.buf = ptr_to_u64(buf);
+	attr.dump.buf_len = ptr_to_u64(buf_len);
+	attr.dump.flags = flags;
+
+	return sys_bpf(BPF_MAP_DUMP, &attr, sizeof(attr));
+}
+
 int bpf_map_lookup_elem(int fd, const void *key, void *value)
 {
 	union bpf_attr attr;
diff --git a/tools/lib/bpf/bpf.h b/tools/lib/bpf/bpf.h
index ff42ca043dc8f..86496443440e9 100644
--- a/tools/lib/bpf/bpf.h
+++ b/tools/lib/bpf/bpf.h
@@ -112,6 +112,10 @@ LIBBPF_API int bpf_verify_program(enum bpf_prog_type type,
 LIBBPF_API int bpf_map_update_elem(int fd, const void *key, const void *value,
 				   __u64 flags);
 
+LIBBPF_API int bpf_map_dump(int fd, const void *prev_key, void *buf,
+				void *buf_len);
+LIBBPF_API int bpf_map_dump_flags(int fd, const void *prev_key, void *buf,
+				void *buf_len, __u64 flags);
 LIBBPF_API int bpf_map_lookup_elem(int fd, const void *key, void *value);
 LIBBPF_API int bpf_map_lookup_elem_flags(int fd, const void *key, void *value,
 					 __u64 flags);
-- 
2.22.0.657.g960e92d24f-goog


^ permalink raw reply related

* [PATCH bpf-next 6/6] selftests/bpf: add test to measure performance of BPF_MAP_DUMP
From: Brian Vazquez @ 2019-07-24 16:58 UTC (permalink / raw)
  To: Brian Vazquez, Alexei Starovoitov, Daniel Borkmann,
	David S . Miller
  Cc: Stanislav Fomichev, Willem de Bruijn, Petar Penkov, linux-kernel,
	netdev, bpf, Brian Vazquez
In-Reply-To: <20190724165803.87470-1-brianvv@google.com>

This tests compares the amount of time that takes to read an entire
table of 100K elements on a bpf hashmap using both BPF_MAP_DUMP and
BPF_MAP_GET_NEXT_KEY + BPF_MAP_LOOKUP_ELEM.

Signed-off-by: Brian Vazquez <brianvv@google.com>
---
 tools/testing/selftests/bpf/test_maps.c | 65 +++++++++++++++++++++++++
 1 file changed, 65 insertions(+)

diff --git a/tools/testing/selftests/bpf/test_maps.c b/tools/testing/selftests/bpf/test_maps.c
index f7ab401399d40..c4593a8904ca6 100644
--- a/tools/testing/selftests/bpf/test_maps.c
+++ b/tools/testing/selftests/bpf/test_maps.c
@@ -18,6 +18,7 @@
 #include <sys/socket.h>
 #include <netinet/in.h>
 #include <linux/bpf.h>
+#include <linux/time64.h>
 
 #include <bpf/bpf.h>
 #include <bpf/libbpf.h>
@@ -389,6 +390,69 @@ static void test_hashmap_dump(void)
 	close(fd);
 }
 
+static void test_hashmap_dump_perf(void)
+{
+	int fd, i, max_entries = 100000;
+	uint64_t key, value, next_key;
+	bool next_key_valid = true;
+	void *buf;
+	u32 buf_len, entries;
+	int j = 0;
+	int clk_id = CLOCK_MONOTONIC;
+	struct timespec begin, end;
+	long long time_spent, dump_time_spent;
+	double res;
+	int tests[] = {1, 2, 230, 5000, 73000, 100000, 234567};
+	int test_len = ARRAY_SIZE(tests);
+	const int elem_size = sizeof(key) + sizeof(value);
+
+	fd = helper_fill_hashmap(max_entries);
+	// Alloc memory considering the largest buffer
+	buf = malloc(elem_size * tests[test_len-1]);
+	assert(buf != NULL);
+
+test:
+	entries = tests[j];
+	buf_len = elem_size*tests[j];
+	j++;
+	clock_gettime(clk_id, &begin);
+	errno = 0;
+	i = 0;
+	while (errno == 0) {
+		bpf_map_dump(fd, !i ? NULL : &key,
+				  buf, &buf_len);
+		if (errno)
+			break;
+		if (!i)
+			key = *((uint64_t *)(buf + buf_len - elem_size));
+		i += buf_len / elem_size;
+	}
+	clock_gettime(clk_id, &end);
+	assert(i  == max_entries);
+	dump_time_spent = NSEC_PER_SEC * (end.tv_sec - begin.tv_sec) +
+			  end.tv_nsec - begin.tv_nsec;
+	next_key_valid = true;
+	clock_gettime(clk_id, &begin);
+	assert(bpf_map_get_next_key(fd, NULL, &key) == 0);
+	for (i = 0; next_key_valid; i++) {
+		next_key_valid = bpf_map_get_next_key(fd, &key, &next_key) == 0;
+		assert(bpf_map_lookup_elem(fd, &key, &value) == 0);
+		key = next_key;
+	}
+	clock_gettime(clk_id, &end);
+	time_spent = NSEC_PER_SEC * (end.tv_sec - begin.tv_sec) +
+		     end.tv_nsec - begin.tv_nsec;
+	res = (1-((double)dump_time_spent/time_spent))*100;
+	printf("buf_len_%u:\t %llu entry-by-entry: %llu improvement %lf\n",
+	       entries, dump_time_spent, time_spent, res);
+	assert(i  == max_entries);
+
+	if (j < test_len)
+		goto test;
+	free(buf);
+	close(fd);
+}
+
 static void test_hashmap_zero_seed(void)
 {
 	int i, first, second, old_flags;
@@ -1758,6 +1822,7 @@ static void run_all_tests(void)
 	test_hashmap_walk(0, NULL);
 	test_hashmap_zero_seed();
 	test_hashmap_dump();
+	test_hashmap_dump_perf();
 
 	test_arraymap(0, NULL);
 	test_arraymap_percpu(0, NULL);
-- 
2.22.0.657.g960e92d24f-goog


^ permalink raw reply related

* [PATCH bpf-next 2/6] bpf: add BPF_MAP_DUMP command to dump more than one entry per call
From: Brian Vazquez @ 2019-07-24 16:57 UTC (permalink / raw)
  To: Brian Vazquez, Alexei Starovoitov, Daniel Borkmann,
	David S . Miller
  Cc: Stanislav Fomichev, Willem de Bruijn, Petar Penkov, linux-kernel,
	netdev, bpf, Brian Vazquez
In-Reply-To: <20190724165803.87470-1-brianvv@google.com>

This introduces a new command to retrieve multiple number of entries
from a bpf map, wrapping the existing bpf methods:
map_get_next_key and map_lookup_elem

To start dumping the map from the beginning you must specify NULL as
the prev_key.

The new API returns 0 when it successfully copied all the elements
requested or it copied less because there weren't more elements to
retrieved (i.e err == -ENOENT). In last scenario err will be masked to 0.

On a successful call buf and buf_len will contain correct data and in
case prev_key was provided (not for the first walk, since prev_key is
NULL) it will contain the last_key copied into the prev_key which will
simplify next call.

Only when it can't find a single element it will return -ENOENT meaning
that the map has been entirely walked. When an error is return buf,
buf_len and prev_key shouldn't be read nor used.

Because maps can be called from userspace and kernel code, this function
can have a scenario where the next_key was found but by the time we
try to retrieve the value the element is not there, in this case the
function continues and tries to get a new next_key value, skipping the
deleted key. If at some point the function find itself trap in a loop,
it will return -EINTR.

The function will try to fit as much as possible in the buf provided and
will return -EINVAL if buf_len is smaller than elem_size.

QUEUE and STACK maps are not supported.

Note that map_dump doesn't guarantee that reading the entire table is
consistent since this function is always racing with kernel and user code
but the same behaviour is found when the entire table is walked using
the current interfaces: map_get_next_key + map_lookup_elem.
It is also important to note that with  a locked map, the lock is grabbed
for 1 entry at the time, meaning that the returned buf might or might not
be consistent.

Suggested-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Brian Vazquez <brianvv@google.com>
---
 include/uapi/linux/bpf.h |   9 +++
 kernel/bpf/syscall.c     | 117 +++++++++++++++++++++++++++++++++++++++
 2 files changed, 126 insertions(+)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index fa1c753dcdbc7..66dab5385170d 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -106,6 +106,7 @@ enum bpf_cmd {
 	BPF_TASK_FD_QUERY,
 	BPF_MAP_LOOKUP_AND_DELETE_ELEM,
 	BPF_MAP_FREEZE,
+	BPF_MAP_DUMP,
 };
 
 enum bpf_map_type {
@@ -388,6 +389,14 @@ union bpf_attr {
 		__u64		flags;
 	};
 
+	struct { /* struct used by BPF_MAP_DUMP command */
+		__aligned_u64	prev_key;
+		__aligned_u64	buf;
+		__aligned_u64	buf_len; /* input/output: len of buf */
+		__u64		flags;
+		__u32		map_fd;
+	} dump;
+
 	struct { /* anonymous struct used by BPF_PROG_LOAD command */
 		__u32		prog_type;	/* one of enum bpf_prog_type */
 		__u32		insn_cnt;
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 86cdc2f7bb56e..0c35505aa219f 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -1097,6 +1097,120 @@ static int map_get_next_key(union bpf_attr *attr)
 	return err;
 }
 
+/* last field in 'union bpf_attr' used by this command */
+#define BPF_MAP_DUMP_LAST_FIELD dump.map_fd
+
+static int map_dump(union bpf_attr *attr)
+{
+	void __user *ukey = u64_to_user_ptr(attr->dump.prev_key);
+	void __user *ubuf = u64_to_user_ptr(attr->dump.buf);
+	u32 __user *ubuf_len = u64_to_user_ptr(attr->dump.buf_len);
+	int ufd = attr->dump.map_fd;
+	struct bpf_map *map;
+	void *buf, *prev_key, *key, *value;
+	u32 value_size, elem_size, buf_len, cp_len;
+	struct fd f;
+	int err;
+	bool first_key = false;
+
+	if (CHECK_ATTR(BPF_MAP_DUMP))
+		return -EINVAL;
+
+	if (attr->dump.flags & ~BPF_F_LOCK)
+		return -EINVAL;
+
+	f = fdget(ufd);
+	map = __bpf_map_get(f);
+	if (IS_ERR(map))
+		return PTR_ERR(map);
+	if (!(map_get_sys_perms(map, f) & FMODE_CAN_READ)) {
+		err = -EPERM;
+		goto err_put;
+	}
+
+	if ((attr->dump.flags & BPF_F_LOCK) &&
+	    !map_value_has_spin_lock(map)) {
+		err = -EINVAL;
+		goto err_put;
+	}
+
+	if (map->map_type == BPF_MAP_TYPE_QUEUE ||
+	    map->map_type == BPF_MAP_TYPE_STACK) {
+		err = -ENOTSUPP;
+		goto err_put;
+	}
+
+	value_size = bpf_map_value_size(map);
+
+	err = get_user(buf_len, ubuf_len);
+	if (err)
+		goto err_put;
+
+	elem_size = map->key_size + value_size;
+	if (buf_len < elem_size) {
+		err = -EINVAL;
+		goto err_put;
+	}
+
+	if (ukey) {
+		prev_key = __bpf_copy_key(ukey, map->key_size);
+		if (IS_ERR(prev_key)) {
+			err = PTR_ERR(prev_key);
+			goto err_put;
+		}
+	} else {
+		prev_key = NULL;
+		first_key = true;
+	}
+
+	err = -ENOMEM;
+	buf = kmalloc(elem_size, GFP_USER | __GFP_NOWARN);
+	if (!buf)
+		goto err_put;
+
+	key = buf;
+	value = key + map->key_size;
+	for (cp_len = 0; cp_len + elem_size <= buf_len;) {
+		if (signal_pending(current)) {
+			err = -EINTR;
+			break;
+		}
+
+		rcu_read_lock();
+		err = map->ops->map_get_next_key(map, prev_key, key);
+		rcu_read_unlock();
+
+		if (err)
+			break;
+
+		err = bpf_map_copy_value(map, key, value, attr->dump.flags);
+
+		if (err == -ENOENT)
+			continue;
+		if (err)
+			goto free_buf;
+
+		if (copy_to_user(ubuf + cp_len, buf, elem_size)) {
+			err = -EFAULT;
+			goto free_buf;
+		}
+
+		prev_key = key;
+		cp_len += elem_size;
+	}
+
+	if (err == -ENOENT && cp_len)
+		err = 0;
+	if (!err && (copy_to_user(ubuf_len, &cp_len, sizeof(cp_len)) ||
+		    (!first_key && copy_to_user(ukey, key, map->key_size))))
+		err = -EFAULT;
+free_buf:
+	kfree(buf);
+err_put:
+	fdput(f);
+	return err;
+}
+
 #define BPF_MAP_LOOKUP_AND_DELETE_ELEM_LAST_FIELD value
 
 static int map_lookup_and_delete_elem(union bpf_attr *attr)
@@ -2910,6 +3024,9 @@ SYSCALL_DEFINE3(bpf, int, cmd, union bpf_attr __user *, uattr, unsigned int, siz
 	case BPF_MAP_LOOKUP_AND_DELETE_ELEM:
 		err = map_lookup_and_delete_elem(&attr);
 		break;
+	case BPF_MAP_DUMP:
+		err = map_dump(&attr);
+		break;
 	default:
 		err = -EINVAL;
 		break;
-- 
2.22.0.657.g960e92d24f-goog


^ permalink raw reply related

* [PATCH bpf-next 0/6] bpf: add BPF_MAP_DUMP command to dump more than one entry per call
From: Brian Vazquez @ 2019-07-24 16:57 UTC (permalink / raw)
  To: Brian Vazquez, Alexei Starovoitov, Daniel Borkmann,
	David S . Miller
  Cc: Stanislav Fomichev, Willem de Bruijn, Petar Penkov, linux-kernel,
	netdev, bpf, Brian Vazquez

This introduces a new command to retrieve multiple number of entries
from a bpf map.

This new command can be executed from the existing BPF syscall as
follows:

err =  bpf(BPF_MAP_DUMP, union bpf_attr *attr, u32 size)
using attr->dump.map_fd, attr->dump.prev_key, attr->dump.buf,
attr->dump.buf_len
returns zero or negative error, and populates buf and buf_len on
succees

This implementation is wrapping the existing bpf methods:
map_get_next_key and map_lookup_elem

Note that this implementation can be extended later to do dump and
delete by extending map_lookup_and_delete_elem (currently it only works
for bpf queue/stack maps) and either use a new flag in map_dump or a new
command map_dump_and_delete. 

Results show that even with a 1-elem_size buffer, it runs ~40 faster
than the current implementation, improvements of ~85% are reported when
the buffer size is increased, although, after the buffer size is around
5% of the total number of entries there's no huge difference in
increasing it.

Tested:
Tried different size buffers to handle case where the bulk is bigger, or
the elements to retrieve are less than the existing ones, all runs read
a map of 100K entries. Below are the results(in ns) from the different
runs:

buf_len_1:       69038725 entry-by-entry: 112384424 improvement
38.569134
buf_len_2:       40897447 entry-by-entry: 111030546 improvement
63.165590
buf_len_230:     13652714 entry-by-entry: 111694058 improvement
87.776687
buf_len_5000:    13576271 entry-by-entry: 111101169 improvement
87.780263
buf_len_73000:   14694343 entry-by-entry: 111740162 improvement
86.849542
buf_len_100000:  13745969 entry-by-entry: 114151991 improvement
87.958187
buf_len_234567:  14329834 entry-by-entry: 114427589 improvement
87.476941

The series of patches are split as follows:

- First patch move some map_lookup_elem logic into 2 fucntions to
deduplicate code: bpf_map_value_size and bpf_map_copy_value
- Second patch introduce map_dump function
- Third patch syncs tools linux headers
- Fourth patch adds libbpf support
- Last two patches adds tests

RFC Changelog:

- remove wrong usage of attr.flags
- move map_fd to remove hole after it

v3:
- add explanation of the API in the commit message
- fix masked errors and return them to user
- copy last_key from return buf into prev_key if it was provided
- run perf test with kpti and retpoline mitigations

v2:
- use proper bpf-next tag

Brian Vazquez (6):
  bpf: add bpf_map_value_size and bp_map_copy_value helper functions
  bpf: add BPF_MAP_DUMP command to dump more than one entry per call
  bpf: keep bpf.h in sync with tools/
  libbpf: support BPF_MAP_DUMP command
  selftests/bpf: test BPF_MAP_DUMP command on a bpf hashmap
  selftests/bpf: add test to measure performance of BPF_MAP_DUMP

 include/uapi/linux/bpf.h                |   9 +
 kernel/bpf/syscall.c                    | 251 ++++++++++++++++++------
 tools/include/uapi/linux/bpf.h          |   9 +
 tools/lib/bpf/bpf.c                     |  28 +++
 tools/lib/bpf/bpf.h                     |   4 +
 tools/lib/bpf/libbpf.map                |   2 +
 tools/testing/selftests/bpf/test_maps.c | 148 +++++++++++++-
 7 files changed, 388 insertions(+), 63 deletions(-)

-- 
2.22.0.657.g960e92d24f-goog


^ permalink raw reply

* Re: [RFC PATCH net-next 10/12] drop_monitor: Add packet alert mode
From: Ido Schimmel @ 2019-07-24 16:57 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: netdev, davem, nhorman, dsahern, roopa, nikolay, jakub.kicinski,
	toke, andy, f.fainelli, andrew, vivien.didelot, mlxsw,
	Ido Schimmel
In-Reply-To: <20190724125341.GB2225@nanopsycho>

On Wed, Jul 24, 2019 at 02:53:41PM +0200, Jiri Pirko wrote:
> Mon, Jul 22, 2019 at 08:31:32PM CEST, idosch@idosch.org wrote:
> >+static const struct net_dm_alert_ops *net_dm_alert_ops_arr[] = {
> >+	[NET_DM_ALERT_MODE_SUMMARY]	= &net_dm_alert_summary_ops,
> >+	[NET_DM_ALERT_MODE_PACKET]	= &net_dm_alert_packet_ops,
> >+};
> 
> Please split this patch into 2:
> 1) introducing the ops and modes (only summary)
> 2) introducing the packet mode

Ack

...

> >+static int net_dm_alert_mode_set(struct genl_info *info)
> >+{
> >+	struct netlink_ext_ack *extack = info->extack;
> >+	enum net_dm_alert_mode alert_mode;
> >+	int rc;
> >+
> >+	if (!info->attrs[NET_DM_ATTR_ALERT_MODE])
> >+		return 0;
> >+
> >+	rc = net_dm_alert_mode_get_from_info(info, &alert_mode);
> >+	if (rc) {
> >+		NL_SET_ERR_MSG_MOD(extack, "Invalid alert mode");
> >+		return -EINVAL;
> >+	}
> >+
> >+	net_dm_alert_mode = alert_mode;
> 
> 2 things:
> 1) Shouldn't you check if the tracing is on and return -EBUSY in case it is?

I'm doing it below in net_dm_cmd_config() :)
But I'm returning '-EOPNOTSUPP'. I guess '-EBUSY' is more appropriate.
Will change.

> 2) You setup the mode globally. I guess it is fine and it does not make
>    sense to do it otherwise, right? Like per-net or something.

Yes, it's global. I didn't change that aspect of drop monitor and I
don't really see a use case for that.

> 
> 
> >+
> >+	return 0;
> >+}
> >+
> > static int net_dm_cmd_config(struct sk_buff *skb,
> > 			struct genl_info *info)
> > {
> >-	NL_SET_ERR_MSG_MOD(info->extack, "Command not supported");
> >+	struct netlink_ext_ack *extack = info->extack;
> >+	int rc;
> > 
> >-	return -EOPNOTSUPP;
> >+	if (trace_state == TRACE_ON) {
> >+		NL_SET_ERR_MSG_MOD(extack, "Cannot configure drop monitor while tracing is on");
> >+		return -EOPNOTSUPP;
> >+	}
> >+
> >+	rc = net_dm_alert_mode_set(info);
> >+	if (rc)
> >+		return rc;
> >+
> >+	return 0;
> > }

^ permalink raw reply

* Re: [RFC PATCH net-next 11/12] drop_monitor: Allow truncation of dropped packets
From: Ido Schimmel @ 2019-07-24 16:49 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: netdev, davem, nhorman, dsahern, roopa, nikolay, jakub.kicinski,
	toke, andy, f.fainelli, andrew, vivien.didelot, mlxsw,
	Ido Schimmel
In-Reply-To: <20190724125537.GC2225@nanopsycho>

On Wed, Jul 24, 2019 at 02:55:37PM +0200, Jiri Pirko wrote:
> Mon, Jul 22, 2019 at 08:31:33PM CEST, idosch@idosch.org wrote:
> >+static int net_dm_trunc_len_set(struct genl_info *info)
> 
> void.

Ack, will change.

> 
> 
> >+{
> >+	if (!info->attrs[NET_DM_ATTR_TRUNC_LEN])
> >+		return 0;
> >+
> >+	net_dm_trunc_len = nla_get_u32(info->attrs[NET_DM_ATTR_TRUNC_LEN]);
> >+
> >+	return 0;
> >+}

^ permalink raw reply

* Re: [RFC PATCH net-next 00/12] drop_monitor: Capture dropped packets and metadata
From: Ido Schimmel @ 2019-07-24 16:48 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: Toke Høiland-Jørgensen, netdev, davem, nhorman, dsahern,
	roopa, nikolay, jakub.kicinski, andy, f.fainelli, andrew,
	vivien.didelot, mlxsw, Ido Schimmel
In-Reply-To: <20190724125851.GD2225@nanopsycho>

On Wed, Jul 24, 2019 at 02:58:51PM +0200, Jiri Pirko wrote:
> Shouldn't the queue len be configurable?

Yes, it will be configurable in v1. I will use a sane limit as default.

^ permalink raw reply

* Re: [PATCH net-next 03/10] sfc: Use dev_get_drvdata where possible
From: Edward Cree @ 2019-07-24 16:44 UTC (permalink / raw)
  To: Chuhong Yuan
  Cc: Solarflare linux maintainers, Martin Habets, David S . Miller,
	netdev, linux-kernel
In-Reply-To: <20190724112658.13241-1-hslester96@gmail.com>

On 24/07/2019 12:26, Chuhong Yuan wrote:
> Instead of using to_pci_dev + pci_get_drvdata,
> use dev_get_drvdata to make code simpler.
>
> Signed-off-by: Chuhong Yuan <hslester96@gmail.com>
Acked-by: Edward Cree <ecree@solarflare.com>

> ---
>  drivers/net/ethernet/sfc/ef10.c |  4 ++--
>  drivers/net/ethernet/sfc/efx.c  | 10 +++++-----
>  2 files changed, 7 insertions(+), 7 deletions(-)
>
> diff --git a/drivers/net/ethernet/sfc/ef10.c b/drivers/net/ethernet/sfc/ef10.c
> index 16d6952c312a..0ec13f520e90 100644
> --- a/drivers/net/ethernet/sfc/ef10.c
> +++ b/drivers/net/ethernet/sfc/ef10.c
> @@ -508,7 +508,7 @@ static ssize_t efx_ef10_show_link_control_flag(struct device *dev,
>  					       struct device_attribute *attr,
>  					       char *buf)
>  {
> -	struct efx_nic *efx = pci_get_drvdata(to_pci_dev(dev));
> +	struct efx_nic *efx = dev_get_drvdata(dev);
>  
>  	return sprintf(buf, "%d\n",
>  		       ((efx->mcdi->fn_flags) &
> @@ -520,7 +520,7 @@ static ssize_t efx_ef10_show_primary_flag(struct device *dev,
>  					  struct device_attribute *attr,
>  					  char *buf)
>  {
> -	struct efx_nic *efx = pci_get_drvdata(to_pci_dev(dev));
> +	struct efx_nic *efx = dev_get_drvdata(dev);
>  
>  	return sprintf(buf, "%d\n",
>  		       ((efx->mcdi->fn_flags) &
> diff --git a/drivers/net/ethernet/sfc/efx.c b/drivers/net/ethernet/sfc/efx.c
> index ab58b837df47..2fef7402233e 100644
> --- a/drivers/net/ethernet/sfc/efx.c
> +++ b/drivers/net/ethernet/sfc/efx.c
> @@ -2517,7 +2517,7 @@ static struct notifier_block efx_netdev_notifier = {
>  static ssize_t
>  show_phy_type(struct device *dev, struct device_attribute *attr, char *buf)
>  {
> -	struct efx_nic *efx = pci_get_drvdata(to_pci_dev(dev));
> +	struct efx_nic *efx = dev_get_drvdata(dev);
>  	return sprintf(buf, "%d\n", efx->phy_type);
>  }
>  static DEVICE_ATTR(phy_type, 0444, show_phy_type, NULL);
> @@ -2526,7 +2526,7 @@ static DEVICE_ATTR(phy_type, 0444, show_phy_type, NULL);
>  static ssize_t show_mcdi_log(struct device *dev, struct device_attribute *attr,
>  			     char *buf)
>  {
> -	struct efx_nic *efx = pci_get_drvdata(to_pci_dev(dev));
> +	struct efx_nic *efx = dev_get_drvdata(dev);
>  	struct efx_mcdi_iface *mcdi = efx_mcdi(efx);
>  
>  	return scnprintf(buf, PAGE_SIZE, "%d\n", mcdi->logging_enabled);
> @@ -2534,7 +2534,7 @@ static ssize_t show_mcdi_log(struct device *dev, struct device_attribute *attr,
>  static ssize_t set_mcdi_log(struct device *dev, struct device_attribute *attr,
>  			    const char *buf, size_t count)
>  {
> -	struct efx_nic *efx = pci_get_drvdata(to_pci_dev(dev));
> +	struct efx_nic *efx = dev_get_drvdata(dev);
>  	struct efx_mcdi_iface *mcdi = efx_mcdi(efx);
>  	bool enable = count > 0 && *buf != '0';
>  
> @@ -3654,7 +3654,7 @@ static int efx_pci_sriov_configure(struct pci_dev *dev, int num_vfs)
>  
>  static int efx_pm_freeze(struct device *dev)
>  {
> -	struct efx_nic *efx = pci_get_drvdata(to_pci_dev(dev));
> +	struct efx_nic *efx = dev_get_drvdata(dev);
>  
>  	rtnl_lock();
>  
> @@ -3675,7 +3675,7 @@ static int efx_pm_freeze(struct device *dev)
>  static int efx_pm_thaw(struct device *dev)
>  {
>  	int rc;
> -	struct efx_nic *efx = pci_get_drvdata(to_pci_dev(dev));
> +	struct efx_nic *efx = dev_get_drvdata(dev);
>  
>  	rtnl_lock();
>  


^ permalink raw reply

* Re: [PATCH -next v2] net/ixgbevf: fix a compilation error of skb_frag_t
From: Jeff Kirsher @ 2019-07-24 16:39 UTC (permalink / raw)
  To: Qian Cai, davem; +Cc: netdev, linux-kernel
In-Reply-To: <1563985079-12888-1-git-send-email-cai@lca.pw>

[-- Attachment #1: Type: text/plain, Size: 1727 bytes --]

On Wed, 2019-07-24 at 12:17 -0400, Qian Cai wrote:
> The linux-next commit "net: Rename skb_frag_t size to bv_len" [1]
> introduced a compilation error on powerpc as it forgot to deal with
> the
> renaming from "size" to "bv_len" for ixgbevf.
> 
> [1] 
> https://lore.kernel.org/netdev/20190723030831.11879-1-willy@infradead.org/T/#md052f1c7de965ccd1bdcb6f92e1990a52298eac5
> 
> In file included from ./include/linux/cache.h:5,
>                  from ./include/linux/printk.h:9,
>                  from ./include/linux/kernel.h:15,
>                  from ./include/linux/list.h:9,
>                  from ./include/linux/module.h:9,
>                  from
> drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c:12:
> drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c: In function
> 'ixgbevf_xmit_frame_ring':
> drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c:4138:51: error:
> 'skb_frag_t' {aka 'struct bio_vec'} has no member named 'size'
>    count += TXD_USE_COUNT(skb_shinfo(skb)->frags[f].size);
>                                                    ^
> ./include/uapi/linux/kernel.h:13:40: note: in definition of macro
> '__KERNEL_DIV_ROUND_UP'
>  #define __KERNEL_DIV_ROUND_UP(n, d) (((n) + (d) - 1) / (d))
>                                         ^
> drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c:4138:12: note: in
> expansion of macro 'TXD_USE_COUNT'
>    count += TXD_USE_COUNT(skb_shinfo(skb)->frags[f].size);
> 
> Signed-off-by: Qian Cai <cai@lca.pw>
> ---
> 
> v2: Use the fine accessor per Matthew.
> 
>  drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c | 7 +++++--
>  1 file changed, 5 insertions(+), 2 deletions(-)

Dave I will pick this up and add it to my queue.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply

* Re: [PATCH net-next v1 1/4] enetc: Clean up local mdio bus allocation
From: Andrew Lunn @ 2019-07-24 16:39 UTC (permalink / raw)
  To: Claudiu Manoil
  Cc: David S . Miller, Rob Herring, Leo Li, Alexandru Marginean,
	netdev@vger.kernel.org, devicetree@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org,
	linux-kernel@vger.kernel.org
In-Reply-To: <VI1PR04MB4880CD977A5D58DA0A7EE56696C60@VI1PR04MB4880.eurprd04.prod.outlook.com>

> >All the horrible casts go away, the driver is structured like every
> >other driver, sparse is probably happy, etc.
> >
> 
> This looks more like a matter cosmetic preferences.  I mean, I didn't
> notice anything "horrible" in the code so far.

#define bus_to_enetc_regs(bus)  (struct enetc_mdio_regs __iomem *)((bus)->priv)

You should not need a cast here, bus->priv is a void *. But bus->priv
is being abused to hold a __iomem pointer.

enetc_wr_reg(&regs->mdio_cfg, mdio_cfg);

This is also rather odd, passing the address of something to an IO
operator? I also don't know the C standard well enough to know if it
is guaranteed that:

struct enetc_mdio_regs {
        u32     mdio_cfg;       /* MDIO configuration and status */
        u32     mdio_ctl;       /* MDIO control */
        u32     mdio_data;      /* MDIO data */
        u32     mdio_addr;      /* MDIO address */
};

actually works. On a 64bit system is the compiler allowed to put in
padding to keep the u32 64 bit aligned?

> I actually find it more
> ugly to define a new structure with only one element inside, like:
> struct enetc_mdio_priv {
>        struct enetc_hw *hw;
> }

One advantage of this is that struct enetc_hw correctly has all the
__iomem attributes. All the casts to __iomem go away, and sparse is
happy.

> Anyway, if others already did this in the kernel, what can I do?

Clean it up. Make the code more readable and easy to maintain.

      Andrew

^ permalink raw reply

* Re: Reminder: 99 open syzbot bugs in net subsystem
From: Eric Biggers @ 2019-07-24 16:30 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Dmitry Vyukov, netdev, David S. Miller, Florian Westphal,
	Ilya Maximets, Eric Dumazet, David Ahern, linux-kernel,
	syzkaller-bugs
In-Reply-To: <63f12327-dd4b-5210-4de2-705af6bc4ba4@gmail.com>

On Wed, Jul 24, 2019 at 08:39:05AM +0200, Eric Dumazet wrote:
> 
> 
> On 7/24/19 3:38 AM, Eric Biggers wrote:
> > [This email was generated by a script.  Let me know if you have any suggestions
> > to make it better, or if you want it re-generated with the latest status.]
> > 
> > Of the currently open syzbot reports against the upstream kernel, I've manually
> > marked 99 of them as possibly being bugs in the net subsystem.  This category
> > only includes the networking bugs that I couldn't assign to a more specific
> > component (bpf, xfrm, bluetooth, tls, tipc, sctp, wireless, etc.).  I've listed
> > these reports below, sorted by an algorithm that tries to list first the reports
> > most likely to be still valid, important, and actionable.
> > 
> > Of these 99 bugs, 17 were seen in mainline in the last week.
> > 
> > Of these 99 bugs, 4 were bisected to commits from the following people:
> > 
> > 	Florian Westphal <fw@strlen.de>
> > 	Ilya Maximets <i.maximets@samsung.com>
> > 	Eric Dumazet <edumazet@google.com>
> > 	David Ahern <dsahern@gmail.com>
> > 
> > If you believe a bug is no longer valid, please close the syzbot report by
> > sending a '#syz fix', '#syz dup', or '#syz invalid' command in reply to the
> > original thread, as explained at https://goo.gl/tpsmEJ#status
> > 
> > If you believe I misattributed a bug to the net subsystem, please let me know,
> > and if possible forward the report to the correct people or mailing list.
> >
> 
> Some of the bugs have been fixed already, before syzbot found them.
> 
> Why force human to be gentle to bots and actually replying to them ?
> 
> I usually simply wait that syzbot is finding the bug does not repro anymore,
> but now if you send these emails, we will have even more pressure on us.
> 

First, based on experience, I'd guess about 30-45 of these are still valid.  17
were seen in mainline in the last week, but some others are valid too.  The ones
most likely to still be valid are at the beginning of the list.  So let's try
not use the presence of outdated bugs as an excuse not to fix current bugs.

Second, all these bug reports are still open, regardless of whether reminders
are sent or not.  I think you're really suggesting that possibly outdated bug
reports should be automatically invalidated by syzbot.

syzbot already does that for bugs with no reproducer.  However, that still
leaves a lot of outdated bugs with reproducers.

Since the kernel community is basically in continuous bug bankruptcy and lots of
syzbot reports are being ignored anyway, I'm in favor of making the invalidation
criteria more aggressive, so we can best focus people's efforts.  I understand
that Dmitry has been against this though, since a significant fraction of bugs
that syzbot stopped hitting for some reason actually turn out to be still valid.

But we probably have no choice.  So I suggest we agree on new criteria for
invalidating bugs.  I'd suggest assigning a timeout to each bug, based on
attributes like "seen in mainline?", "reproducer type", "bisected?", "does it
look like a 'bad' crash (e.g. use-after-free)"; similar to the algorithm I'm
using to sort the bugs when sorting these reminders.  I.e., bugs most likely to
still be valid, important, and actionable get longest timeouts.

Then if no crash or activity was seen in the timeout, the bug is closed.

Any thoughts from anyone?

- Eric

^ permalink raw reply

* Re: kernel panic: stack is corrupted in pointer
From: John Fastabend @ 2019-07-24 16:22 UTC (permalink / raw)
  To: Dmitry Vyukov, John Fastabend
  Cc: syzbot, bpf, David Airlie, alexander.deucher, amd-gfx,
	Alexei Starovoitov, christian.koenig, Daniel Borkmann,
	david1.zhou, DRI, leo.liu, LKML, netdev, syzkaller-bugs,
	Marco Elver
In-Reply-To: <CACT4Y+ZbPmRB9T9ZzhE79VnKKD3+ieHeLpaDGRkcQ72nADKH_g@mail.gmail.com>

Dmitry Vyukov wrote:
> On Tue, Jul 23, 2019 at 7:26 PM John Fastabend <john.fastabend@gmail.com> wrote:
> >
> > Dmitry Vyukov wrote:
> > > On Wed, Jul 17, 2019 at 10:58 AM syzbot
> > > <syzbot+79f5f028005a77ecb6bb@syzkaller.appspotmail.com> wrote:
> > > >
> > > > Hello,
> > > >
> > > > syzbot found the following crash on:
> > > >
> > > > HEAD commit:    1438cde7 Add linux-next specific files for 20190716
> > > > git tree:       linux-next
> > > > console output: https://syzkaller.appspot.com/x/log.txt?x=13988058600000
> > > > kernel config:  https://syzkaller.appspot.com/x/.config?x=3430a151e1452331
> > > > dashboard link: https://syzkaller.appspot.com/bug?extid=79f5f028005a77ecb6bb
> > > > compiler:       gcc (GCC) 9.0.0 20181231 (experimental)
> > > > syz repro:      https://syzkaller.appspot.com/x/repro.syz?x=111fc8afa00000
> > >
> > > From the repro it looks like the same bpf stack overflow bug. +John
> > > We need to dup them onto some canonical report for this bug, or this
> > > becomes unmanageable.
> >
> > Fixes in bpf tree should fix this. Hopefully, we will squash this once fixes
> > percolate up.
> >
> > #syz test: git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf.git
> 
> Cool! What is the fix?

It took a series of patches here,

https://www.spinics.net/lists/netdev/msg586986.html

The fix commits from bpf tree are,

(git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf.git) 

318892ac068397f40ff81d9155898da01493b1d2
ac78fc148d8249dbf382c2127456dd08ec5b161c
f87e62d45e51b12d48d2cb46b5cde8f83b866bc4
313ab004805cf52a42673b15852b3842474ccd87
32857cf57f920cdc03b5095f08febec94cf9c36b
45a4521dcbd92e71c9e53031b40e34211d3b4feb
2bb90e5cc90e1d09f631aeab041a9cf913a5bbe5
0e858739c2d2eedeeac1d35bfa0ec3cc2a7190d8
95fa145479fbc0a0c1fd3274ceb42ec03c042a4a

The last commit fixes this paticular syzbot issue,

commit 95fa145479fbc0a0c1fd3274ceb42ec03c042a4a
Author: John Fastabend <john.fastabend@gmail.com>
Date:   Fri Jul 19 10:29:22 2019 -0700

    bpf: sockmap/tls, close can race with map free

The other commits address some other issues found while testing.

> We don't need to wait for the fix to percolate up (and then down
> too!). syzbot gracefully handles when a patch is not yet present
> everywhere (it happens all the time).

Great. By the way the above should fix many of the outstanding
reports against bpf sockmap and tls side. I'll have to walk through
each one individually to double check though. I guess we can mark
them as dup reports and syzbot should sort it out?

> 
> Btw, this was due to a stack overflow, right? Or something else?

Right, stack overflow due to race in updating sock ops where build a
circular call chain.

> We are trying to make KASAN configuration detect stack overflows too,
> so that it does not cause havoc next time. But it turns out to be
> non-trivial and our current attempt seems to fail:
> https://groups.google.com/forum/#!topic/kasan-dev/IhYv7QYhLfY
> 
> 

^ permalink raw reply

* [PATCH -next v2] net/ixgbevf: fix a compilation error of skb_frag_t
From: Qian Cai @ 2019-07-24 16:17 UTC (permalink / raw)
  To: davem; +Cc: jeffrey.t.kirsher, netdev, linux-kernel, Qian Cai

The linux-next commit "net: Rename skb_frag_t size to bv_len" [1]
introduced a compilation error on powerpc as it forgot to deal with the
renaming from "size" to "bv_len" for ixgbevf.

[1] https://lore.kernel.org/netdev/20190723030831.11879-1-willy@infradead.org/T/#md052f1c7de965ccd1bdcb6f92e1990a52298eac5

In file included from ./include/linux/cache.h:5,
                 from ./include/linux/printk.h:9,
                 from ./include/linux/kernel.h:15,
                 from ./include/linux/list.h:9,
                 from ./include/linux/module.h:9,
                 from
drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c:12:
drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c: In function
'ixgbevf_xmit_frame_ring':
drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c:4138:51: error:
'skb_frag_t' {aka 'struct bio_vec'} has no member named 'size'
   count += TXD_USE_COUNT(skb_shinfo(skb)->frags[f].size);
                                                   ^
./include/uapi/linux/kernel.h:13:40: note: in definition of macro
'__KERNEL_DIV_ROUND_UP'
 #define __KERNEL_DIV_ROUND_UP(n, d) (((n) + (d) - 1) / (d))
                                        ^
drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c:4138:12: note: in
expansion of macro 'TXD_USE_COUNT'
   count += TXD_USE_COUNT(skb_shinfo(skb)->frags[f].size);

Signed-off-by: Qian Cai <cai@lca.pw>
---

v2: Use the fine accessor per Matthew.

 drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c | 7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
index bdfccaf38edd..8c011d4ce7a9 100644
--- a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
+++ b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
@@ -4134,8 +4134,11 @@ static int ixgbevf_xmit_frame_ring(struct sk_buff *skb,
 	 * otherwise try next time
 	 */
 #if PAGE_SIZE > IXGBE_MAX_DATA_PER_TXD
-	for (f = 0; f < skb_shinfo(skb)->nr_frags; f++)
-		count += TXD_USE_COUNT(skb_shinfo(skb)->frags[f].size);
+	for (f = 0; f < skb_shinfo(skb)->nr_frags; f++) {
+		skb_frag_t *frag = &skb_shinfo(skb)->frags[f];
+
+		count += TXD_USE_COUNT(skb_frag_size(frag));
+	}
 #else
 	count += skb_shinfo(skb)->nr_frags;
 #endif
-- 
1.8.3.1


^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox