* [net PATCH v2 1/2] GRE: Disable segmentation offloads w/ CSUM and we are encapsulated via FOU
From: Alexander Duyck @ 2016-04-04 16:28 UTC (permalink / raw)
To: herbert, tom, jesse, alexander.duyck, edumazet, netdev, davem
In-Reply-To: <20160404162545.14332.653.stgit@localhost.localdomain>
This patch fixes an issue I found in which we were dropping frames if we
had enabled checksums on GRE headers that were encapsulated by either FOU
or GUE. Without this patch I was barely able to get 1 Gb/s of throughput.
With this patch applied I am now at least getting around 6 Gb/s.
The issue is due to the fact that with FOU or GUE applied we do not provide
a transport offset pointing to the GRE header, nor do we offload it in
software as the GRE header is completely skipped by GSO and treated like a
VXLAN or GENEVE type header. As such we need to prevent the stack from
generating it and also prevent GRE from generating it via any interface we
create.
Fixes: c3483384ee511 ("gro: Allow tunnel stacking in the case of FOU/GUE")
Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
---
include/linux/netdevice.h | 5 ++++-
net/core/dev.c | 1 +
net/ipv4/fou.c | 6 ++++++
net/ipv4/gre_offload.c | 8 ++++++++
net/ipv4/ip_gre.c | 13 ++++++++++---
5 files changed, 29 insertions(+), 4 deletions(-)
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index cb0d5d09c2e4..8395308a2445 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -2120,7 +2120,10 @@ struct napi_gro_cb {
/* Used in foo-over-udp, set in udp[46]_gro_receive */
u8 is_ipv6:1;
- /* 7 bit hole */
+ /* Used in GRE, set in fou/gue_gro_receive */
+ u8 is_fou:1;
+
+ /* 6 bit hole */
/* used to support CHECKSUM_COMPLETE for tunneling protocols */
__wsum csum;
diff --git a/net/core/dev.c b/net/core/dev.c
index b9bcbe77d913..77a71cd68535 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -4439,6 +4439,7 @@ static enum gro_result dev_gro_receive(struct napi_struct *napi, struct sk_buff
NAPI_GRO_CB(skb)->flush = 0;
NAPI_GRO_CB(skb)->free = 0;
NAPI_GRO_CB(skb)->encap_mark = 0;
+ NAPI_GRO_CB(skb)->is_fou = 0;
NAPI_GRO_CB(skb)->gro_remcsum_start = 0;
/* Setup for GRO checksum validation */
diff --git a/net/ipv4/fou.c b/net/ipv4/fou.c
index 5a94aea280d3..a39068b4a4d9 100644
--- a/net/ipv4/fou.c
+++ b/net/ipv4/fou.c
@@ -203,6 +203,9 @@ static struct sk_buff **fou_gro_receive(struct sk_buff **head,
*/
NAPI_GRO_CB(skb)->encap_mark = 0;
+ /* Flag this frame as already having an outer encap header */
+ NAPI_GRO_CB(skb)->is_fou = 1;
+
rcu_read_lock();
offloads = NAPI_GRO_CB(skb)->is_ipv6 ? inet6_offloads : inet_offloads;
ops = rcu_dereference(offloads[proto]);
@@ -368,6 +371,9 @@ static struct sk_buff **gue_gro_receive(struct sk_buff **head,
*/
NAPI_GRO_CB(skb)->encap_mark = 0;
+ /* Flag this frame as already having an outer encap header */
+ NAPI_GRO_CB(skb)->is_fou = 1;
+
rcu_read_lock();
offloads = NAPI_GRO_CB(skb)->is_ipv6 ? inet6_offloads : inet_offloads;
ops = rcu_dereference(offloads[guehdr->proto_ctype]);
diff --git a/net/ipv4/gre_offload.c b/net/ipv4/gre_offload.c
index c47539d04b88..6a5bd4317866 100644
--- a/net/ipv4/gre_offload.c
+++ b/net/ipv4/gre_offload.c
@@ -150,6 +150,14 @@ static struct sk_buff **gre_gro_receive(struct sk_buff **head,
if ((greh->flags & ~(GRE_KEY|GRE_CSUM)) != 0)
goto out;
+ /* We can only support GRE_CSUM if we can track the location of
+ * the GRE header. In the case of FOU/GUE we cannot because the
+ * outer UDP header displaces the GRE header leaving us in a state
+ * of limbo.
+ */
+ if ((greh->flags & GRE_CSUM) && NAPI_GRO_CB(skb)->is_fou)
+ goto out;
+
type = greh->protocol;
rcu_read_lock();
diff --git a/net/ipv4/ip_gre.c b/net/ipv4/ip_gre.c
index 31936d387cfd..af5d1f38217f 100644
--- a/net/ipv4/ip_gre.c
+++ b/net/ipv4/ip_gre.c
@@ -862,9 +862,16 @@ static void __gre_tunnel_init(struct net_device *dev)
dev->hw_features |= GRE_FEATURES;
if (!(tunnel->parms.o_flags & TUNNEL_SEQ)) {
- /* TCP offload with GRE SEQ is not supported. */
- dev->features |= NETIF_F_GSO_SOFTWARE;
- dev->hw_features |= NETIF_F_GSO_SOFTWARE;
+ /* TCP offload with GRE SEQ is not supported, nor
+ * can we support 2 levels of outer headers requiring
+ * an update.
+ */
+ if (!(tunnel->parms.o_flags & TUNNEL_CSUM) ||
+ (tunnel->encap.type == TUNNEL_ENCAP_NONE)) {
+ dev->features |= NETIF_F_GSO_SOFTWARE;
+ dev->hw_features |= NETIF_F_GSO_SOFTWARE;
+ }
+
/* Can use a lockless transmit, unless we generate
* output sequences
*/
^ permalink raw reply related
* [net PATCH v2 2/2] ipv4/GRO: Make GRO conform to RFC 6864
From: Alexander Duyck @ 2016-04-04 16:31 UTC (permalink / raw)
To: herbert, tom, jesse, alexander.duyck, edumazet, netdev, davem
In-Reply-To: <20160404162545.14332.653.stgit@localhost.localdomain>
RFC 6864 states that the IPv4 ID field MUST NOT be used for purposes other
than fragmentation and reassembly. Currently we are looking at this field
as a way of identifying what frames can be aggregated and which cannot for
GRO. While this is valid for frames that do not have DF set, it is invalid
to do so if the bit is set.
In addition we were generating IPv4 ID collisions when 2 or more flows were
interleaved over the same tunnel. To prevent that we store the result of
all IP ID checks via a "|=" instead of overwriting previous values.
With this patch we support two different approaches for the IP ID field.
The first is a non-incrementing IP ID with DF bit set. In such a case we
simply won't write to the flush_id field in the GRO context block. The
other option is the legacy option in which the IP ID must increment by 1
for every packet we aggregate.
In the case of the non-incrementing IP ID we will end up losing the data
that the IP ID is fixed. However as per RFC 6864 we should be able to
write any value into the IP ID when the DF bit is set so this should cause
minimal harm.
Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
---
v2: Updated patch so that we now only support one of two options. Either
the IP ID is fixed with DF bit set, or the IP ID is incrementing. That
allows us to support the fixed ID case as occurs with IPv6 to IPv4
header translation and what is likely already out there for some
devices with tunnel headers.
net/core/dev.c | 1 +
net/ipv4/af_inet.c | 25 ++++++++++++++++++-------
net/ipv6/ip6_offload.c | 3 ---
3 files changed, 19 insertions(+), 10 deletions(-)
diff --git a/net/core/dev.c b/net/core/dev.c
index 77a71cd68535..3429632398a4 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -4352,6 +4352,7 @@ static void gro_list_prepare(struct napi_struct *napi, struct sk_buff *skb)
unsigned long diffs;
NAPI_GRO_CB(p)->flush = 0;
+ NAPI_GRO_CB(p)->flush_id = 0;
if (hash != skb_get_hash_raw(p)) {
NAPI_GRO_CB(p)->same_flow = 0;
diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index 9e481992dbae..33f6335448a2 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -1324,6 +1324,7 @@ static struct sk_buff **inet_gro_receive(struct sk_buff **head,
for (p = *head; p; p = p->next) {
struct iphdr *iph2;
+ u16 flush_id;
if (!NAPI_GRO_CB(p)->same_flow)
continue;
@@ -1347,14 +1348,24 @@ static struct sk_buff **inet_gro_receive(struct sk_buff **head,
(iph->tos ^ iph2->tos) |
((iph->frag_off ^ iph2->frag_off) & htons(IP_DF));
- /* Save the IP ID check to be included later when we get to
- * the transport layer so only the inner most IP ID is checked.
- * This is because some GSO/TSO implementations do not
- * correctly increment the IP ID for the outer hdrs.
- */
- NAPI_GRO_CB(p)->flush_id =
- ((u16)(ntohs(iph2->id) + NAPI_GRO_CB(p)->count) ^ id);
NAPI_GRO_CB(p)->flush |= flush;
+
+ /* We must save the offset as it is possible to have multiple
+ * flows using the same protocol and address pairs so we
+ * need to wait until we can validate this is part of the
+ * same flow with a 5-tuple or better to avoid unnecessary
+ * collisions between flows. We can support one of two
+ * possible scenarios, either a fixed value with DF bit set
+ * or an incrementing value with DF either set or unset.
+ * In the case of a fixed value we will end up losing the
+ * data that the IP ID was a fixed value, however per RFC
+ * 6864 in such a case the actual value of the IP ID is
+ * meant to be ignored anyway.
+ */
+ flush_id = (u16)(id - ntohs(iph2->id));
+ if (flush_id || !(iph2->frag_off & htons(IP_DF)))
+ NAPI_GRO_CB(p)->flush_id |= flush_id ^
+ NAPI_GRO_CB(p)->count;
}
NAPI_GRO_CB(skb)->flush |= flush;
diff --git a/net/ipv6/ip6_offload.c b/net/ipv6/ip6_offload.c
index 82e9f3076028..9aa53f64dffd 100644
--- a/net/ipv6/ip6_offload.c
+++ b/net/ipv6/ip6_offload.c
@@ -238,9 +238,6 @@ static struct sk_buff **ipv6_gro_receive(struct sk_buff **head,
/* flush if Traffic Class fields are different */
NAPI_GRO_CB(p)->flush |= !!(first_word & htonl(0x0FF00000));
NAPI_GRO_CB(p)->flush |= flush;
-
- /* Clear flush_id, there's really no concept of ID in IPv6. */
- NAPI_GRO_CB(p)->flush_id = 0;
}
NAPI_GRO_CB(skb)->flush |= flush;
^ permalink raw reply related
* [PATCHv2 net 1/3] samples/bpf: Fix build breakage with map_perf_test_user.c
From: Naveen N. Rao @ 2016-04-04 17:01 UTC (permalink / raw)
To: linux-kernel, linuxppc-dev, netdev
Cc: Alexei Starovoitov, Daniel Borkmann, David S . Miller,
Ananth N Mavinakayanahalli, Michael Ellerman
Building BPF samples is failing with the below error:
samples/bpf/map_perf_test_user.c: In function ‘main’:
samples/bpf/map_perf_test_user.c:134:9: error: variable ‘r’ has
initializer but incomplete type
struct rlimit r = {RLIM_INFINITY, RLIM_INFINITY};
^
samples/bpf/map_perf_test_user.c:134:21: error: ‘RLIM_INFINITY’
undeclared (first use in this function)
struct rlimit r = {RLIM_INFINITY, RLIM_INFINITY};
^
samples/bpf/map_perf_test_user.c:134:21: note: each undeclared
identifier is reported only once for each function it appears in
samples/bpf/map_perf_test_user.c:134:9: warning: excess elements in
struct initializer [enabled by default]
struct rlimit r = {RLIM_INFINITY, RLIM_INFINITY};
^
samples/bpf/map_perf_test_user.c:134:9: warning: (near initialization
for ‘r’) [enabled by default]
samples/bpf/map_perf_test_user.c:134:9: warning: excess elements in
struct initializer [enabled by default]
samples/bpf/map_perf_test_user.c:134:9: warning: (near initialization
for ‘r’) [enabled by default]
samples/bpf/map_perf_test_user.c:134:16: error: storage size of ‘r’
isn’t known
struct rlimit r = {RLIM_INFINITY, RLIM_INFINITY};
^
samples/bpf/map_perf_test_user.c:139:2: warning: implicit declaration of
function ‘setrlimit’ [-Wimplicit-function-declaration]
setrlimit(RLIMIT_MEMLOCK, &r);
^
samples/bpf/map_perf_test_user.c:139:12: error: ‘RLIMIT_MEMLOCK’
undeclared (first use in this function)
setrlimit(RLIMIT_MEMLOCK, &r);
^
samples/bpf/map_perf_test_user.c:134:16: warning: unused variable ‘r’
[-Wunused-variable]
struct rlimit r = {RLIM_INFINITY, RLIM_INFINITY};
^
make[2]: *** [samples/bpf/map_perf_test_user.o] Error 1
Fix this by including the necessary header file.
Cc: Alexei Starovoitov <ast@fb.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: David S. Miller <davem@davemloft.net>
Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
---
v2: no changes
samples/bpf/map_perf_test_user.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/samples/bpf/map_perf_test_user.c b/samples/bpf/map_perf_test_user.c
index 95af56e..3147377 100644
--- a/samples/bpf/map_perf_test_user.c
+++ b/samples/bpf/map_perf_test_user.c
@@ -17,6 +17,7 @@
#include <linux/bpf.h>
#include <string.h>
#include <time.h>
+#include <sys/resource.h>
#include "libbpf.h"
#include "bpf_load.h"
--
2.7.4
^ permalink raw reply related
* [PATCHv2 net 2/3] samples/bpf: Use llc in PATH, rather than a hardcoded value
From: Naveen N. Rao @ 2016-04-04 17:01 UTC (permalink / raw)
To: linux-kernel, linuxppc-dev, netdev
Cc: Alexei Starovoitov, David S . Miller, Daniel Borkmann,
Ananth N Mavinakayanahalli, Michael Ellerman
In-Reply-To: <5ab5eec91d2d0bc7f221d714ef84afac83b2604b.1459789086.git.naveen.n.rao@linux.vnet.ibm.com>
While at it, remove the generation of .s files and fix some typos in the
related comment.
Cc: Alexei Starovoitov <ast@fb.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
---
v2: removed generation of .s files
samples/bpf/Makefile | 12 +++---------
1 file changed, 3 insertions(+), 9 deletions(-)
diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
index 502c9fc..b820cc9 100644
--- a/samples/bpf/Makefile
+++ b/samples/bpf/Makefile
@@ -76,16 +76,10 @@ HOSTLOADLIBES_offwaketime += -lelf
HOSTLOADLIBES_spintest += -lelf
HOSTLOADLIBES_map_perf_test += -lelf -lrt
-# point this to your LLVM backend with bpf support
-LLC=$(srctree)/tools/bpf/llvm/bld/Debug+Asserts/bin/llc
-
-# asm/sysreg.h inline assmbly used by it is incompatible with llvm.
-# But, ehere is not easy way to fix it, so just exclude it since it is
+# asm/sysreg.h - inline assembly used by it is incompatible with llvm.
+# But, there is no easy way to fix it, so just exclude it since it is
# useless for BPF samples.
$(obj)/%.o: $(src)/%.c
clang $(NOSTDINC_FLAGS) $(LINUXINCLUDE) $(EXTRA_CFLAGS) \
-D__KERNEL__ -D__ASM_SYSREG_H -Wno-unused-value -Wno-pointer-sign \
- -O2 -emit-llvm -c $< -o -| $(LLC) -march=bpf -filetype=obj -o $@
- clang $(NOSTDINC_FLAGS) $(LINUXINCLUDE) $(EXTRA_CFLAGS) \
- -D__KERNEL__ -D__ASM_SYSREG_H -Wno-unused-value -Wno-pointer-sign \
- -O2 -emit-llvm -c $< -o -| $(LLC) -march=bpf -filetype=asm -o $@.s
+ -O2 -emit-llvm -c $< -o -| llc -march=bpf -filetype=obj -o $@
--
2.7.4
^ permalink raw reply related
* [PATCHv2 net 3/3] samples/bpf: Enable powerpc support
From: Naveen N. Rao @ 2016-04-04 17:01 UTC (permalink / raw)
To: linux-kernel, linuxppc-dev, netdev
Cc: Alexei Starovoitov, Daniel Borkmann, David S . Miller,
Ananth N Mavinakayanahalli, Michael Ellerman
In-Reply-To: <5ab5eec91d2d0bc7f221d714ef84afac83b2604b.1459789086.git.naveen.n.rao@linux.vnet.ibm.com>
Add the necessary definitions for building bpf samples on ppc.
Since ppc doesn't store function return address on the stack, modify how
PT_REGS_RET() and PT_REGS_FP() work.
Also, introduce PT_REGS_IP() to access the instruction pointer.
Cc: Alexei Starovoitov <ast@fb.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: David S. Miller <davem@davemloft.net>
Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
---
v2: updated macros using ({ }) gcc extension as per Alexei
samples/bpf/bpf_helpers.h | 26 ++++++++++++++++++++++++++
samples/bpf/spintest_kern.c | 2 +-
samples/bpf/tracex2_kern.c | 4 ++--
samples/bpf/tracex4_kern.c | 2 +-
4 files changed, 30 insertions(+), 4 deletions(-)
diff --git a/samples/bpf/bpf_helpers.h b/samples/bpf/bpf_helpers.h
index 9363500..7904a2a 100644
--- a/samples/bpf/bpf_helpers.h
+++ b/samples/bpf/bpf_helpers.h
@@ -82,6 +82,7 @@ static int (*bpf_l4_csum_replace)(void *ctx, int off, int from, int to, int flag
#define PT_REGS_FP(x) ((x)->bp)
#define PT_REGS_RC(x) ((x)->ax)
#define PT_REGS_SP(x) ((x)->sp)
+#define PT_REGS_IP(x) ((x)->ip)
#elif defined(__s390x__)
@@ -94,6 +95,7 @@ static int (*bpf_l4_csum_replace)(void *ctx, int off, int from, int to, int flag
#define PT_REGS_FP(x) ((x)->gprs[11]) /* Works only with CONFIG_FRAME_POINTER */
#define PT_REGS_RC(x) ((x)->gprs[2])
#define PT_REGS_SP(x) ((x)->gprs[15])
+#define PT_REGS_IP(x) ((x)->ip)
#elif defined(__aarch64__)
@@ -106,6 +108,30 @@ static int (*bpf_l4_csum_replace)(void *ctx, int off, int from, int to, int flag
#define PT_REGS_FP(x) ((x)->regs[29]) /* Works only with CONFIG_FRAME_POINTER */
#define PT_REGS_RC(x) ((x)->regs[0])
#define PT_REGS_SP(x) ((x)->sp)
+#define PT_REGS_IP(x) ((x)->pc)
+
+#elif defined(__powerpc__)
+
+#define PT_REGS_PARM1(x) ((x)->gpr[3])
+#define PT_REGS_PARM2(x) ((x)->gpr[4])
+#define PT_REGS_PARM3(x) ((x)->gpr[5])
+#define PT_REGS_PARM4(x) ((x)->gpr[6])
+#define PT_REGS_PARM5(x) ((x)->gpr[7])
+#define PT_REGS_RC(x) ((x)->gpr[3])
+#define PT_REGS_SP(x) ((x)->sp)
+#define PT_REGS_IP(x) ((x)->nip)
#endif
+
+#ifdef __powerpc__
+#define BPF_KPROBE_READ_RET_IP(ip, ctx) ({ (ip) = (ctx)->link; })
+#define BPF_KRETPROBE_READ_RET_IP BPF_KPROBE_READ_RET_IP
+#else
+#define BPF_KPROBE_READ_RET_IP(ip, ctx) ({ \
+ bpf_probe_read(&(ip), sizeof(ip), (void *)PT_REGS_RET(ctx)); })
+#define BPF_KRETPROBE_READ_RET_IP(ip, ctx) ({ \
+ bpf_probe_read(&(ip), sizeof(ip), \
+ (void *)(PT_REGS_FP(ctx) + sizeof(ip))); })
+#endif
+
#endif
diff --git a/samples/bpf/spintest_kern.c b/samples/bpf/spintest_kern.c
index 4b27619..ce0167d 100644
--- a/samples/bpf/spintest_kern.c
+++ b/samples/bpf/spintest_kern.c
@@ -34,7 +34,7 @@ struct bpf_map_def SEC("maps") stackmap = {
#define PROG(foo) \
int foo(struct pt_regs *ctx) \
{ \
- long v = ctx->ip, *val; \
+ long v = PT_REGS_IP(ctx), *val; \
\
val = bpf_map_lookup_elem(&my_map, &v); \
bpf_map_update_elem(&my_map, &v, &v, BPF_ANY); \
diff --git a/samples/bpf/tracex2_kern.c b/samples/bpf/tracex2_kern.c
index 09c1adc..6d6eefd 100644
--- a/samples/bpf/tracex2_kern.c
+++ b/samples/bpf/tracex2_kern.c
@@ -27,10 +27,10 @@ int bpf_prog2(struct pt_regs *ctx)
long init_val = 1;
long *value;
- /* x64/s390x specific: read ip of kfree_skb caller.
+ /* read ip of kfree_skb caller.
* non-portable version of __builtin_return_address(0)
*/
- bpf_probe_read(&loc, sizeof(loc), (void *)PT_REGS_RET(ctx));
+ BPF_KPROBE_READ_RET_IP(loc, ctx);
value = bpf_map_lookup_elem(&my_map, &loc);
if (value)
diff --git a/samples/bpf/tracex4_kern.c b/samples/bpf/tracex4_kern.c
index ac46714..6dd8e38 100644
--- a/samples/bpf/tracex4_kern.c
+++ b/samples/bpf/tracex4_kern.c
@@ -40,7 +40,7 @@ int bpf_prog2(struct pt_regs *ctx)
long ip = 0;
/* get ip address of kmem_cache_alloc_node() caller */
- bpf_probe_read(&ip, sizeof(ip), (void *)(PT_REGS_FP(ctx) + sizeof(ip)));
+ BPF_KRETPROBE_READ_RET_IP(ip, ctx);
struct pair v = {
.val = bpf_ktime_get_ns(),
--
2.7.4
^ permalink raw reply related
* Re: [RFC PATCH 6/6] ppc: ebpf/jit: Implement JIT compiler for extended BPF
From: Naveen N. Rao @ 2016-04-04 17:09 UTC (permalink / raw)
To: Daniel Borkmann
Cc: Alexei Starovoitov, linux-kernel, linuxppc-dev, Matt Evans, oss,
Paul Mackerras, netdev, David S. Miller
In-Reply-To: <56FEBF51.7090608@iogearbox.net>
On 2016/04/01 08:34PM, Daniel Borkmann wrote:
> On 04/01/2016 08:10 PM, Alexei Starovoitov wrote:
> >On 4/1/16 2:58 AM, Naveen N. Rao wrote:
> >>PPC64 eBPF JIT compiler. Works for both ABIv1 and ABIv2.
> >>
> >>Enable with:
> >>echo 1 > /proc/sys/net/core/bpf_jit_enable
> >>or
> >>echo 2 > /proc/sys/net/core/bpf_jit_enable
> >>
> >>... to see the generated JIT code. This can further be processed with
> >>tools/net/bpf_jit_disasm.
> >>
> >>With CONFIG_TEST_BPF=m and 'modprobe test_bpf':
> >>test_bpf: Summary: 291 PASSED, 0 FAILED, [234/283 JIT'ed]
> >>
> >>... on both ppc64 BE and LE.
> >>
> >>The details of the approach are documented through various comments in
> >>the code, as are the TODOs. Some of the prominent TODOs include
> >>implementing BPF tail calls and skb loads.
> >>
> >>Cc: Matt Evans <matt@ozlabs.org>
> >>Cc: Michael Ellerman <mpe@ellerman.id.au>
> >>Cc: Paul Mackerras <paulus@samba.org>
> >>Cc: Alexei Starovoitov <ast@fb.com>
> >>Cc: "David S. Miller" <davem@davemloft.net>
> >>Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
> >>Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
> >>---
> >> arch/powerpc/include/asm/ppc-opcode.h | 19 +-
> >> arch/powerpc/net/Makefile | 4 +
> >> arch/powerpc/net/bpf_jit.h | 66 ++-
> >> arch/powerpc/net/bpf_jit64.h | 58 +++
> >> arch/powerpc/net/bpf_jit_comp64.c | 828 ++++++++++++++++++++++++++++++++++
> >> 5 files changed, 973 insertions(+), 2 deletions(-)
> >> create mode 100644 arch/powerpc/net/bpf_jit64.h
> >> create mode 100644 arch/powerpc/net/bpf_jit_comp64.c
> >...
> >>-#ifdef CONFIG_PPC64
> >>+#if defined(CONFIG_PPC64) && (!defined(_CALL_ELF) || _CALL_ELF != 2)
> >
> >impressive stuff!
>
> +1, awesome to see another one!
Thanks!
>
> >Everything nicely documented. Could you add few words for the above
> >condition as well ?
> >Or may be a new macro, since it occurs many times?
> >What are these _CALL_ELF == 2 and != 2 conditions mean? ppc ABIs ?
Yes, there are 2 ABIs: ppc64 (ABIv1) -- big endian and the recently
introduced ppc64le (ABIv2) which is currently only little endian. There
is also ppc32...
Good suggestion about using a macro. I will put out a patch for that.
> >Will there ever be v3 ?
Hope not! ;)
>
> Minor TODO would also be to convert to use bpf_jit_binary_alloc() and
> bpf_jit_binary_free() API for the image, which is done by other eBPF
> jits, too.
Sure. I will make that change.
>
> >So far most of the bpf jits were going via net-next tree, but if
> >in this case no changes to the core is necessary then I guess it's fine
> >to do it via powerpc tree. What's your plan?
I initially thought this has to go through the powerpc tree. I don't
really have a preference and I'll allow the maintainers to take a call
on that. I do however need a review of the JIT code from Michael
Ellerman/Paul Mackerras.
- Naveen
^ permalink raw reply
* Re: [PATCHv2 net 2/3] samples/bpf: Use llc in PATH, rather than a hardcoded value
From: Alexei Starovoitov @ 2016-04-04 17:24 UTC (permalink / raw)
To: Naveen N. Rao
Cc: linux-kernel, linuxppc-dev, netdev, Alexei Starovoitov,
David S . Miller, Daniel Borkmann, Ananth N Mavinakayanahalli,
Michael Ellerman
In-Reply-To: <000e23c21f475e7106c74a83eb6226c4cb8d4a14.1459789086.git.naveen.n.rao@linux.vnet.ibm.com>
On Mon, Apr 04, 2016 at 10:31:33PM +0530, Naveen N. Rao wrote:
> While at it, remove the generation of .s files and fix some typos in the
> related comment.
>
> Cc: Alexei Starovoitov <ast@fb.com>
> Cc: David S. Miller <davem@davemloft.net>
> Cc: Daniel Borkmann <daniel@iogearbox.net>
> Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
> Cc: Michael Ellerman <mpe@ellerman.id.au>
> Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
^ permalink raw reply
* Re: [PATCHv2 net 3/3] samples/bpf: Enable powerpc support
From: Alexei Starovoitov @ 2016-04-04 17:24 UTC (permalink / raw)
To: Naveen N. Rao
Cc: linux-kernel, linuxppc-dev, netdev, Alexei Starovoitov,
Daniel Borkmann, David S . Miller, Ananth N Mavinakayanahalli,
Michael Ellerman
In-Reply-To: <28d2e09a7db94a0a71b3222b3e5ffacb0d4a8dd7.1459789086.git.naveen.n.rao@linux.vnet.ibm.com>
On Mon, Apr 04, 2016 at 10:31:34PM +0530, Naveen N. Rao wrote:
> Add the necessary definitions for building bpf samples on ppc.
>
> Since ppc doesn't store function return address on the stack, modify how
> PT_REGS_RET() and PT_REGS_FP() work.
>
> Also, introduce PT_REGS_IP() to access the instruction pointer.
>
> Cc: Alexei Starovoitov <ast@fb.com>
> Cc: Daniel Borkmann <daniel@iogearbox.net>
> Cc: David S. Miller <davem@davemloft.net>
> Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
> Cc: Michael Ellerman <mpe@ellerman.id.au>
> Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
^ permalink raw reply
* Re: Best way to reduce system call overhead for tun device I/O?
From: Stephen Hemminger @ 2016-04-04 17:28 UTC (permalink / raw)
To: ValdikSS
Cc: Guus Sliepen, Willem de Bruijn, David Miller, Tom Herbert, netdev
In-Reply-To: <57026D91.4090207@valdikss.org.ru>
On Mon, 4 Apr 2016 16:35:13 +0300
ValdikSS <iam@valdikss.org.ru> wrote:
> I'm trying to increase OpenVPN throughput by optimizing tun manipulations, too.
> Right now I have more questions than answers.
>
> I get about 800 Mbit/s speeds via OpenVPN with authentication and encryption disabled on a local machine with OpenVPN server and client running in a different
> network namespaces, which use veth for networking, with 1500 MTU on a TUN interface. This is rather limiting. Low-end devices like SOHO routers could only
> achieve 15-20 Mbit/s via OpenVPN with encryption with a 560 MHz CPU.
> Increasing MTU reduces overhead. You can get > 5GBit/s if you set 16000 MTU on a TUN interface.
> That's not only OpenVPN related. All the tunneling software I tried can't achieve gigabit speeds without encryption on my machine with MTU 1500. Didn't test
> tinc though.
>
> TUN supports various offloading techniques: GSO, TSO, UFO, just as hardware NICs. From what I understand, if we use GSO/GRO for TUN, we would be able to receive
> send small packets combined in a huge one with one send/recv call with MTU 1500 on a TUN interface, and the performance should increase and be just as it now
> with increased MTU. But there is a very little information of how to use offloading with TUN.
> I've found an old example code which creates TUN interface with GSO support (TUN_VNET_HDR), does NAT and echoes TUN data to stdout, and a script to run two
> instances of this software connected with a pipe. But it doesn't work for me, I never see any combined frames (gso_type is always 0 in a virtio_net_hdr header).
> Probably I did something wrong, but I'm not sure what exactly is wrong.
>
> Here's said application: http://ovrload.ru/f/68996_tun.tar.gz
>
> The questions are as follows:
>
> 1. Do I understand correctly that GSO/GRO would have the same effect as increasing MTU on TUN interface?
> 2. How GRO/GSO is different from TSO, UFO?
> 3. Can we get and send combined frames directly from/to NIC with offloading support?
> 4. How to implement GRO/GSO, TSO, UFO? What should be the logic behind it?
>
>
> Any reply is greatly appreciated.
>
> P.S. this could be helpful: https://ldpreload.com/p/tuntap-notes.txt
>
> > I'm trying to reduce system call overhead when reading/writing to/from a
> > tun device in userspace. For sockets, one can use sendmmsg()/recvmmsg(),
> > but a tun fd is not a socket fd, so this doesn't work. I'm see several
> > options to allow userspace to read/write multiple packets with one
> > syscall:
> >
> > - Implement a TX/RX ring buffer that is mmap()ed, like with AF_PACKET
> > sockets.
> >
> > - Implement a ioctl() to emulate sendmmsg()/recvmmsg().
> >
> > - Add a flag that can be set using TUNSETIFF that makes regular
> > read()/write() calls handle multiple packets in one go.
> >
> > - Expose a socket fd to userspace, so regular sendmmsg()/recvmmsg() can
> > be used. There is tun_get_socket() which is used internally in the
> > kernel, but this is not exposed to userspace, and doesn't look trivial
> > to do either.
> >
> > What would be the right way to do this?
> >
> > --
> > Met vriendelijke groet / with kind regards,
> > Guus Sliepen <guus@tinc-vpn.org>
The first step to getting better performance through GRO would be modifying
TUN device to use NAPI when receiving. I tried this once, and it got more complex
than I had patience for because TUN device write is obviously in userspace context.
^ permalink raw reply
* [Patch net] net_sched: fix a memory leak in tc action
From: Cong Wang @ 2016-04-04 17:32 UTC (permalink / raw)
To: netdev; +Cc: dvyukov, Cong Wang, Jamal Hadi Salim
Fixes: ddf97ccdd7cb ("net_sched: add network namespace support for tc actions")
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Tested-by: Dmitry Vyukov <dvyukov@google.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
---
include/net/act_api.h | 1 +
1 file changed, 1 insertion(+)
diff --git a/include/net/act_api.h b/include/net/act_api.h
index 2a19fe1..03e322b 100644
--- a/include/net/act_api.h
+++ b/include/net/act_api.h
@@ -135,6 +135,7 @@ void tcf_hashinfo_destroy(const struct tc_action_ops *ops,
static inline void tc_action_net_exit(struct tc_action_net *tn)
{
tcf_hashinfo_destroy(tn->ops, tn->hinfo);
+ kfree(tn->hinfo);
}
int tcf_generic_walker(struct tc_action_net *tn, struct sk_buff *skb,
--
2.1.0
^ permalink raw reply related
* Re: System hangs (unable to handle kernel paging request)
From: Bastien Philbert @ 2016-04-04 17:50 UTC (permalink / raw)
To: Oleksii Berezhniak, netdev
In-Reply-To: <CAJHPw-O+rjqahFb1nS=S+efwgEqOaCH7LeEXF38nqSJFYDSWSQ@mail.gmail.com>
On 2016-04-04 11:01 AM, Oleksii Berezhniak wrote:
> Can you please point me to more detailed description of similar issues
> that you mentioned?
>
Mostly it's in reworks for the Intel Drivers related to improving performance in order
to avoid over usage of CPU leading to a soft lockup being found during kernel polling
at high loads with millions of packets being send per second. In addition this has been
in various parts of these drivers so it's hard to find one exact detailed commit. However
I based my finding of this commit maybe helping you based on the release history of the
longterm kernel your using as the release date for that commit is way after your kernel
was released. However you may want to check if the commit with the id I sent you has
been back ported to your kernel, if so and this is being *still* triggered then this
is probably a bug somewhere else.
Cheers,
Bastien
> I can only find this:
> http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=32b3e08fff60494cd1d281a39b51583edfd2b18f
>
> But there are no any hangs. Only performance issues.
>
> BTW, GRO (Generic Receive Offloading) is disabled on our network adapter.
>
> 2016-04-04 17:30 GMT+03:00 Bastien Philbert <bastienphilbert@gmail.com>:
>>
>>
>> On 2016-04-04 03:59 AM, Oleksii Berezhniak wrote:
>>> Good day.
>>>
>>> We have PPPoE server with CentOS 7 (kernel 3.10.0-327.10.1.el7.dsip.x86_64)
>>>
>>> We applied some PPPoE related patches to this kernel:
>>>
>>> ppp: don't override sk->sk_state in pppoe_flush_dev()
>>> ppp: fix pppoe_dev deletion condition in pppoe_release()
>>> pppoe: fix memory corruption in padt work structure
>>> pppoe: fix reference counting in PPPoE proxy
>>>
>>> Also we built latest version of ixgbe driver from Intel.
>>>
>>> Now we have crashes after approx. one week of uptime:
>>>
>>> [545444.673270] BUG: unable to handle kernel paging request at ffff88a005040200
>>> [545444.673306] IP: [<ffffffff811c0e95>] kmem_cache_alloc+0x75/0x1d0
>>> [545444.673335] PGD 0
>>> [545444.673348] Oops: 0000 [#1] SMP
>>> [545444.673367] Modules linked in: arc4 ppp_mppe act_police cls_u32
>>> sch_ingress sch_tbf pptp gre pppoe pppox ppp_generic slhc 8021q garp
>>> stp mrp llc iptable_nat nf_conn
>>> track_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat iptable_filter xt_TCPMSS
>>> iptable_mangle xt_CT nf_conntrack iptable_raw w83793 hwmon_vid
>>> snd_hda_codec_realtek snd_hda_codec
>>> _generic snd_hda_intel snd_hda_codec coretemp snd_hda_core iTCO_wdt
>>> kvm iTCO_vendor_support snd_hwdep snd_seq snd_seq_device ipmi_ssif
>>> ppdev lpc_ich snd_pcm pcspkr mfd_
>>> core sg ipmi_si snd_timer snd i2c_i801 ipmi_msghandler ioatdma
>>> parport_pc parport shpchp soundcore i7core_edac tpm_infineon edac_core
>>> ip_tables ext4 mbcache jbd2 sd_mod
>>> crct10dif_generic crc_t10dif crct10dif_common syscopyarea sysfillrect
>>> firewire_ohci sysimgblt i2c_algo_bit drm_kms_helper ata_generic
>>> pata_acpi
>>> [545444.674383] ttm firewire_core crc_itu_t serio_raw drm ata_piix
>>> libata crc32c_intel i2c_core ixgbe(OE) vxlan e1000e ip6_udp_tunnel
>>> udp_tunnel aacraid dca ptp pps_co
>>> re
>>> [545444.674783] CPU: 5 PID: 0 Comm: swapper/5 Tainted: G OE
>>> ------------ 3.10.0-327.10.1.el7.dsip.x86_64 #1
>>> [545444.675032] Hardware name: empty empty/S7010, BIOS 'V2.06 ' 03/31/2010
>>> [545444.675162] task: ffff880139c55c00 ti: ffff880139c84000 task.ti:
>>> ffff880139c84000
>>> [545444.675400] RIP: 0010:[<ffffffff811c0e95>] [<ffffffff811c0e95>]
>>> kmem_cache_alloc+0x75/0x1d0
>>> [545444.675641] RSP: 0018:ffff88023fc23ce8 EFLAGS: 00010286
>>> [545444.675766] RAX: 0000000000000000 RBX: ffff8802302eab00 RCX:
>>> 000000010eb8edbe
>>> [545444.676002] RDX: 000000010eb8edbd RSI: 0000000000000020 RDI:
>>> ffff88013b803700
>>> [545444.676237] RBP: ffff88023fc23d18 R08: 00000000000175a0 R09:
>>> ffffffff81517e70
>>> [545444.676472] R10: 000000000000006b R11: 0000000000000000 R12:
>>> ffff88a005040200
>>> [545444.676706] R13: 0000000000000020 R14: ffff88013b803700 R15:
>>> ffff88013b803700
>>> [545444.676942] FS: 0000000000000000(0000) GS:ffff88023fc20000(0000)
>>> knlGS:0000000000000000
>>> [545444.677180] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
>>> [545444.677307] CR2: ffff88a005040200 CR3: 0000000237e63000 CR4:
>>> 00000000000007e0
>>> [545444.677543] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
>>> 0000000000000000
>>> [545444.677779] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7:
>>> 0000000000000400
>>> [545444.678014] Stack:
>>> [545444.678127] ffff880237ea2040 ffff8802302eab00 0000000000000280
>>> 0000000000000280
>>> [545444.678370] 0000000000000006 ffff880236bb1b60 ffff88023fc23d40
>>> ffffffff81517e70
>>> [545444.678614] 0000000000000280 ffff8802302eab00 0000000000000000
>>> ffff88023fc23d60
>>> [545444.678857] Call Trace:
>>> [545444.678973] <IRQ>
>>>
>>> [545444.678982]
>>> [545444.679100] [<ffffffff81517e70>] build_skb+0x30/0x1d0
>>> [545444.679222] [<ffffffff8151a973>] __alloc_rx_skb+0x63/0xb0
>>> [545444.679349] [<ffffffff8151a9db>] __netdev_alloc_skb+0x1b/0x40
>>> [545444.679492] [<ffffffffa0104d8e>] ixgbe_clean_rx_irq+0xee/0xa50 [ixgbe]
>>> [545444.679624] [<ffffffff8152862f>] ? __napi_complete+0x1f/0x30
>>> [545444.679756] [<ffffffffa0106738>] ixgbe_poll+0x2d8/0x6d0 [ixgbe]
>>> [545444.679886] [<ffffffff8152b092>] net_rx_action+0x152/0x240
>>> [545444.680015] [<ffffffff81084aef>] __do_softirq+0xef/0x280
>>> [545444.680144] [<ffffffff8164735c>] call_softirq+0x1c/0x30
>>> [545444.680277] [<ffffffff81016fc5>] do_softirq+0x65/0xa0
>>> [545444.680402] [<ffffffff81084e85>] irq_exit+0x115/0x120
>>> [545444.680529] [<ffffffff81647ef8>] do_IRQ+0x58/0xf0
>>> [545444.680660] [<ffffffff8163d1ad>] common_interrupt+0x6d/0x6d
>>> [545444.680786] <EOI>
>>> [545444.680794]
>>> [545444.680914] [<ffffffff81058e96>] ? native_safe_halt+0x6/0x10
>>> [545444.681041] [<ffffffff8101dbcf>] default_idle+0x1f/0xc0
>>> [545444.681168] [<ffffffff8101e4d6>] arch_cpu_idle+0x26/0x30
>>> [545444.681297] [<ffffffff810d62c5>] cpu_startup_entry+0x245/0x290
>>> [545444.681427] [<ffffffff810475fa>] start_secondary+0x1ba/0x230
>>> [545444.681554] Code: ce 00 00 49 8b 50 08 4d 8b 20 49 8b 40 10 4d 85
>>> e4 0f 84 1f 01 00 00 48 85 c0 0f 84 16 01 00 00 49 63 46 20 48 8d 4a
>>> 01 4d 8b 06 <49> 8b 1c 04 4c
>>> 89 e0 65 49 0f c7 08 0f 94 c0 84 c0 74 b9 49 63
>>> [545444.682056] RIP [<ffffffff811c0e95>] kmem_cache_alloc+0x75/0x1d0
>>> [545444.682186] RSP <ffff88023fc23ce8>
>>> [545444.682305] CR2: ffff88a005040200
>>>
>>>
>>> Every time description and call stack are the same.
>>>
>>> What can be cause of these crashes?
>>>
>>> Thanks.
>>>
>> I am wondering if your kernel has this commit id, 32b3e08fff60494cd1d281a39b51583edfd2b18f.
>> As this seems to be added to fix issues that look very similar to the trace you are receiving.
>> Nick
>
>
>
^ permalink raw reply
* Re: [PATCH v3 00/16] add Intel X722 iWARP driver
From: Faisal Latif @ 2016-04-04 17:59 UTC (permalink / raw)
To: Christoph Hellwig
Cc: dledford-H+wXaHxf7aLQT0dZR+AlfA,
linux-rdma-u79uwXL29TY76Z2rM5mHXA, netdev-u79uwXL29TY76Z2rM5mHXA,
jeffrey.t.kirsher-ral2JQCrhuEAvxtiuMwx3w,
e1000-rdma-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f
In-Reply-To: <20160404073929.GA23218-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>
On Mon, Apr 04, 2016 at 12:39:29AM -0700, Christoph Hellwig wrote:
> On Wed, Jan 20, 2016 at 01:40:00PM -0600, Faisal Latif wrote:
> > This driver provides iWARP RDMA functionality for the Intel(R) X722 Ethernet
> > controller for PCI Physical Functions. It is in early product cycle
> > and having the driver in the kernel will allow users to have hardware support
> > when available for purchase.
>
> Just curious: how is this driver supposed to work? It doesn't seem to
> support FRWRs despite the iWarp spec requiring support for it. It also
> sets IB_DEVICE_MEM_MGT_EXTENSIONS despite the lack of this methods,
> which will lead to instant crashes when using any of the usual drivers.
>
Thank Christoph,
We are addressing it in the next patch series.
Faisal
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply
* Re: [Patch net] net_sched: fix a memory leak in tc action
From: Eric Dumazet @ 2016-04-04 18:07 UTC (permalink / raw)
To: Cong Wang; +Cc: netdev, dvyukov, Jamal Hadi Salim
In-Reply-To: <1459791168-16675-1-git-send-email-xiyou.wangcong@gmail.com>
On Mon, 2016-04-04 at 10:32 -0700, Cong Wang wrote:
> Fixes: ddf97ccdd7cb ("net_sched: add network namespace support for tc actions")
> Reported-by: Dmitry Vyukov <dvyukov@google.com>
> Tested-by: Dmitry Vyukov <dvyukov@google.com>
> Cc: Jamal Hadi Salim <jhs@mojatatu.com>
> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
> ---
> include/net/act_api.h | 1 +
> 1 file changed, 1 insertion(+)
>
> diff --git a/include/net/act_api.h b/include/net/act_api.h
> index 2a19fe1..03e322b 100644
> --- a/include/net/act_api.h
> +++ b/include/net/act_api.h
> @@ -135,6 +135,7 @@ void tcf_hashinfo_destroy(const struct tc_action_ops *ops,
> static inline void tc_action_net_exit(struct tc_action_net *tn)
> {
> tcf_hashinfo_destroy(tn->ops, tn->hinfo);
> + kfree(tn->hinfo);
> }
>
> int tcf_generic_walker(struct tc_action_net *tn, struct sk_buff *skb,
Looks good to me, although the kfree() might be put in
cf_hashinfo_destroy() (at one place instead of being inlined in all call
points)
^ permalink raw reply
* Re: [RFC PATCH 0/5] Add driver bpf hook for early packet drop
From: Alexei Starovoitov @ 2016-04-04 18:10 UTC (permalink / raw)
To: Jesper Dangaard Brouer
Cc: Brenden Blanco, Tom Herbert, David S. Miller,
Linux Kernel Network Developers, ogerlitz, Daniel Borkmann,
john fastabend, Alexander Duyck
In-Reply-To: <20160404094846.4df8defc@redhat.com>
On Mon, Apr 04, 2016 at 09:48:46AM +0200, Jesper Dangaard Brouer wrote:
> On Sat, 2 Apr 2016 22:41:04 -0700
> Brenden Blanco <bblanco@plumgrid.com> wrote:
>
> > On Sat, Apr 02, 2016 at 12:47:16PM -0400, Tom Herbert wrote:
> >
> > > Very nice! Do you think this hook will be sufficient to implement a
> > > fast forward patch also?
>
> (DMA experts please verify and correct me!)
>
> One of the gotchas is how DMA sync/unmap works. For forwarding you
> need to modify the headers. The DMA sync API (DMA_FROM_DEVICE) specify
> that the data is to be _considered_ read-only. AFAIK you can write into
> the data, BUT on DMA_unmap the API/DMA-engine is allowed to overwrite
> data... note on most archs the DMA_unmap does not overwrite.
>
> This DMA issue should not block the work on a hook for early packet drop.
> Maybe we should add a flag option, that can specify to the hook if the
> packet read-only? (e.g. if driver use page-fragments and DMA_sync)
>
>
> We should have another track/thread on how to solve the DMA issue:
> I see two solutions.
>
> Solution 1: Simply use a "full" page per packet and do the DMA_unmap.
> This result in a slowdown on arch's with expensive DMA-map/unmap. And
> we stress the page allocator more (can be solved with a page-pool-cache).
> Eric will not like this due to memory usage, but we can just add a
> "copy-break" step for normal stack hand-off.
>
> Solution 2: (Due credit to Alex Duyck, this idea came up while
> discussing issue with him). Remember DMA_sync'ed data is only
> considered read-only, because the DMA_unmap can be destructive. In many
> cases DMA_unmap is not. Thus, we could take advantage of this, and
> allow modifying DMA sync'ed data on those DMA setups.
I bet on those device dma_sync is a noop as well.
In ndo_bpf_set we can check
if (sync_single_for_cpu != swiotlb_sync_single_for_cpu)
return -ENOTSUPP;
to avoid all these problems altogether. We're doing this to have
as high as possible performance, so we have to sacrifice generality.
This BPF_PROG_TYPE_PHYS_DEV program type is only applicable to physical
ethernet networking device and the name clearly indicates that.
Devices like taps or veth will not have such ndo.
These are early architectural decisions that we have to make to
actually hit our performance targets.
This is not 'yet another hook in the stack'. We already have tc+cls_bpf
that is pretty fast, but it's generic and works with veth, taps, phys dev
and by design operates on skb.
The BPF_PROG_TYPE_PHYS_DEV is operating on dma buffer. Virtual devices
don't have dma buffers, so no ndo.
Probably the confusion is due to 'pseudo skb' name in the patches.
I guess we have to pick some other name.
^ permalink raw reply
* Re: [RFC PATCH 4/5] mlx4: add support for fast rx drop bpf program
From: Alexei Starovoitov @ 2016-04-04 18:27 UTC (permalink / raw)
To: Brenden Blanco
Cc: Jesper Dangaard Brouer, davem, netdev, tom, ogerlitz, daniel,
john.fastabend
In-Reply-To: <20160403061151.GC21980@gmail.com>
On Sat, Apr 02, 2016 at 11:11:52PM -0700, Brenden Blanco wrote:
> On Sat, Apr 02, 2016 at 10:23:31AM +0200, Jesper Dangaard Brouer wrote:
> [...]
> >
> > I think you need to DMA sync RX-page before you can safely access
> > packet data in page (on all arch's).
> >
> Thanks, I will give that a try in the next spin.
> > > + ethh = (struct ethhdr *)(page_address(frags[0].page) +
> > > + frags[0].page_offset);
> > > + if (mlx4_call_bpf(prog, ethh, length)) {
> >
> > AFAIK length here covers all the frags[n].page, thus potentially
> > causing the BPF program to access memory out of bound (crash).
> >
> > Having several page fragments is AFAIK an optimization for jumbo-frames
> > on PowerPC (which is a bit annoying for you use-case ;-)).
> >
> Yeah, this needs some more work. I can think of some options:
> 1. limit pseudo skb.len to first frag's length only, and signal to
> program that the packet is incomplete
> 2. for nfrags>1 skip bpf processing, but this could be functionally
> incorrect for some use cases
> 3. run the program for each frag
> 4. reject ndo_bpf_set when frags are possible (large mtu?)
>
> My preference is to go with 1, thoughts?
hmm and what program will do with 'incomplete' packet?
imo option 4 is only way here. If phys_dev bpf program already
attached to netdev then mlx4_en_change_mtu() can reject jumbo mtus.
My understanding of mlx4_en_calc_rx_buf is that mtu < 1514
will have num_frags==1. That's the common case and one we
want to optimize for.
If later we can find a way to change mlx4 driver to support
phys_dev bpf programs with jumbo mtus, great.
^ permalink raw reply
* Re: [Patch net] net_sched: fix a memory leak in tc action
From: Cong Wang @ 2016-04-04 18:37 UTC (permalink / raw)
To: Eric Dumazet
Cc: Linux Kernel Network Developers, Dmitry Vyukov, Jamal Hadi Salim
In-Reply-To: <1459793263.6473.343.camel@edumazet-glaptop3.roam.corp.google.com>
On Mon, Apr 4, 2016 at 11:07 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Mon, 2016-04-04 at 10:32 -0700, Cong Wang wrote:
>> Fixes: ddf97ccdd7cb ("net_sched: add network namespace support for tc actions")
>> Reported-by: Dmitry Vyukov <dvyukov@google.com>
>> Tested-by: Dmitry Vyukov <dvyukov@google.com>
>> Cc: Jamal Hadi Salim <jhs@mojatatu.com>
>> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
>> ---
>> include/net/act_api.h | 1 +
>> 1 file changed, 1 insertion(+)
>>
>> diff --git a/include/net/act_api.h b/include/net/act_api.h
>> index 2a19fe1..03e322b 100644
>> --- a/include/net/act_api.h
>> +++ b/include/net/act_api.h
>> @@ -135,6 +135,7 @@ void tcf_hashinfo_destroy(const struct tc_action_ops *ops,
>> static inline void tc_action_net_exit(struct tc_action_net *tn)
>> {
>> tcf_hashinfo_destroy(tn->ops, tn->hinfo);
>> + kfree(tn->hinfo);
>> }
>>
>> int tcf_generic_walker(struct tc_action_net *tn, struct sk_buff *skb,
>
> Looks good to me, although the kfree() might be put in
> cf_hashinfo_destroy() (at one place instead of being inlined in all call
> points)
Putting it in tc_action_net_exit() makes it symmetric with tc_action_net_init().
^ permalink raw reply
* Re: [RFC PATCH 4/5] mlx4: add support for fast rx drop bpf program
From: Alexei Starovoitov @ 2016-04-04 18:46 UTC (permalink / raw)
To: Daniel Borkmann
Cc: Johannes Berg, Brenden Blanco, davem, netdev, tom, ogerlitz,
john.fastabend, brouer
In-Reply-To: <57023AA0.5080600@iogearbox.net>
On Mon, Apr 04, 2016 at 11:57:52AM +0200, Daniel Borkmann wrote:
> On 04/04/2016 09:35 AM, Johannes Berg wrote:
> >On Sat, 2016-04-02 at 23:38 -0700, Brenden Blanco wrote:
> >>
> >>Having a common check makes sense. The tricky thing is that the type can
> >>only be checked after taking the reference, and I wanted to keep the
> >>scope of the prog brief in the case of errors. I would have to move the
> >>bpf_prog_get logic into dev_change_bpf_fd and pass a bpf_prog * into the
> >>ndo instead. Would that API look fine to you?
> >
> >I can't really comment, I wasn't planning on using the API right now :)
> >
> >However, what else is there that the driver could possibly do with the
> >FD, other than getting the bpf_prog?
> >
> >>A possible extension of this is just to keep the bpf_prog * in the
> >>netdev itself and expose a feature flag from the driver rather than
> >>an ndo. But that would mean another 8 bytes in the netdev.
> >
> >That also misses the signal to the driver when the program is
> >set/removed, so I don't think that works. I'd argue it's not really
> >desirable anyway though since I wouldn't expect a majority of drivers
> >to start supporting this.
>
> I think ndo is probably fine for this purpose, see also my other mail. I
> think currently, the only really driver specific code would be to store
> the prog pointer somewhere and to pass needed meta data to populate the
> fake skb.
yes. I think ndo is better and having bpf_prog in the driver priv
part is likely better as well, since driver may decide to put it into
their ring struct for faster fetch or layout prog pointer next to other
priv fields for better cache.
Having prog in 'struct net_device' may look very sensible right now,
since there is not much code around it, but later it may be causing
some performance headachces. I think it's better to have complete
freedom in the drivers and later move code to generic part.
Same applies to your other comment about moving mlx4_bpf_set() and
mlx4_call_bpf() into generic. It's better for them to be driver
specific in the moment. Right now we have only mlx4 anyway.
> Maybe mid-term drivers might want to reuse this hook/signal for offloading
> as well, not yet sure ... how would that relate to offloading of cls_bpf?
> Should these be considered two different things (although from an offloading
> perspective they are not really). _Conceptually_, XDP could also be seen
> as a software offload for the facilities we support with cls_bpf et al.
agree. 'conceptually' phys_dev bpf program is similar to sched_cls bpf program.
If we can reuse some of the helpers that would be great...
but only if we can maintain highest performance.
hw offloading may be more convenient from cls_bpf side for some drivers,
but nothing obviously stops them from hw offloading of bpf_type_phys_dev
^ permalink raw reply
* Re: [RFC PATCH 4/5] mlx4: add support for fast rx drop bpf program
From: Alexei Starovoitov @ 2016-04-04 18:50 UTC (permalink / raw)
To: Eric Dumazet
Cc: Jesper Dangaard Brouer, Brenden Blanco, davem, netdev, tom,
Or Gerlitz, daniel, john.fastabend
In-Reply-To: <1459783323.6473.341.camel@edumazet-glaptop3.roam.corp.google.com>
On Mon, Apr 04, 2016 at 08:22:03AM -0700, Eric Dumazet wrote:
> On Mon, 2016-04-04 at 16:57 +0200, Jesper Dangaard Brouer wrote:
> > On Fri, 1 Apr 2016 19:47:12 -0700 Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
> >
> > > My guess we're hitting 14.5Mpps limit for empty bpf program
> > > and for program that actually looks into the packet because we're
> > > hitting 10G phy limit of 40G nic. Since physically 40G nic
> > > consists of four 10G phys. There will be the same problem
> > > with 100G and 50G nics. Both will be hitting 25G phy limit.
> > > We need to vary packets somehow. Hopefully Or can explain that
> > > bit of hw design.
> > > Jesper's experiments with mlx4 showed the same 14.5Mpps limit
> > > when sender blasting the same packet over and over again.
> >
> > That is an interesting observation Alexei, and could explain the pps limit
> > I hit on 40G, with single flow testing. AFAIK 40G is 4x 10G PHYs, and
> > 100G is 4x 25G PHYs.
> >
> > I have a pktgen script that tried to avoid this pitfall. By creating a
> > new flow per pktgen kthread. I call it "pktgen_sample05_flow_per_thread.sh"[1]
> >
> > [1] https://github.com/netoptimizer/network-testing/blob/master/pktgen/pktgen_sample05_flow_per_thread.sh
> >
>
> A single flow is able to use 40Gbit on those 40Gbit NIC, so there is not
> a single 10GB trunk used for a given flow.
>
> This 14Mpps thing seems to be a queue limitation on mlx4.
yeah, could be queueing related.
Multiple cpus can send ~30Mpps of the same 64 byte packet,
but mlx4 can only receive 14.5Mpps. Odd.
Or (and other mellanox guys),
what is really going on inside 40G nic ?
^ permalink raw reply
* af_packet: tone down the Tx-ring unsupported spew.
From: Dave Jones @ 2016-04-04 19:11 UTC (permalink / raw)
To: netdev
Trinity and other fuzzers can hit this WARN on far too easily,
resulting in a tainted kernel that hinders automated fuzzing.
Replace it with a rate-limited printk.
Signed-off-by: Dave Jones <davej@codemonkey.org.uk>
diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
index 1ecfa710ca98..f12c17f355d9 100644
--- a/net/packet/af_packet.c
+++ b/net/packet/af_packet.c
@@ -4151,7 +4151,7 @@ static int packet_set_ring(struct sock *sk, union tpacket_req_u *req_u,
/* Opening a Tx-ring is NOT supported in TPACKET_V3 */
if (!closing && tx_ring && (po->tp_version > TPACKET_V2)) {
- WARN(1, "Tx-ring is not supported.\n");
+ net_warn_ratelimited("Tx-ring is not supported.\n");
goto out;
}
^ permalink raw reply related
* Re: af_packet: tone down the Tx-ring unsupported spew.
From: Daniel Borkmann @ 2016-04-04 19:24 UTC (permalink / raw)
To: Dave Jones; +Cc: netdev
In-Reply-To: <20160404191150.GA7224@codemonkey.org.uk>
On 04/04/2016 09:11 PM, Dave Jones wrote:
> Trinity and other fuzzers can hit this WARN on far too easily,
> resulting in a tainted kernel that hinders automated fuzzing.
>
> Replace it with a rate-limited printk.
>
> Signed-off-by: Dave Jones <davej@codemonkey.org.uk>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
^ permalink raw reply
* Re: [PATCH v3 net-next 0/8] add TX timestamping via cmsg
From: David Miller @ 2016-04-04 19:52 UTC (permalink / raw)
To: soheil.kdev; +Cc: netdev, willemb, edumazet, ycheng, ncardwell, kafai, soheil
In-Reply-To: <1459652893-14207-1-git-send-email-soheil.kdev@gmail.com>
From: Soheil Hassas Yeganeh <soheil.kdev@gmail.com>
Date: Sat, 2 Apr 2016 23:08:05 -0400
> This patch series aim at enabling TX timestamping via cmsg.
>
> Currently, to occasionally sample TX timestamping on a socket,
> applications need to call setsockopt twice: first for enabling
> timestamps and then for disabling them. This is an unnecessary
> overhead. With cmsg, in contrast, applications can sample TX
> timestamps per sendmsg().
>
> This patch series adds the code for processing SO_TIMESTAMPING
> for cmsg's of the SOL_SOCKET level, and adds the glue code for
> TCP, UDP, and RAW for both IPv4 and IPv6. This implementation
> supports overriding timestamp generation flags (i.e.,
> SOF_TIMESTAMPING_TX_*) but not timestamp reporting flags.
> Applications must still enable timestamp reporting via
> setsockopt to receive timestamps.
>
> This series does not change existing timestamping behavior for
> applications that are using socket options.
>
> I will follow up with another patch to enable timestamping for
> active TFO (client-side TCP Fast Open) and also setting packet
> mark via cmsgs.
...
> Changes in v2:
> - Replace u32 with __u32 in the documentation.
>
> Changes in v3:
> - Fix the broken build for L2TP (due to changes
> in IPv6).
Series applied, thanks.
^ permalink raw reply
* Re: [PATCH v3 1/2] RDS: fix "Kernel unaligned access" on sparc64
From: David Miller @ 2016-04-04 19:57 UTC (permalink / raw)
To: shamir.rabinovitch; +Cc: rds-devel, netdev
In-Reply-To: <1459686244-14939-1-git-send-email-shamir.rabinovitch@oracle.com>
From: Shamir Rabinovitch <shamir.rabinovitch@oracle.com>
Date: Sun, 3 Apr 2016 08:24:03 -0400
> @@ -135,8 +135,9 @@ int rds_page_remainder_alloc(struct scatterlist *scat, unsigned long bytes,
> if (rem->r_offset != 0)
> rds_stats_inc(s_page_remainder_hit);
>
> - rem->r_offset += bytes;
> - if (rem->r_offset == PAGE_SIZE) {
> + /* fix 'Kernel unaligned access' on sparc64 */
> + rem->r_offset += ALIGN(bytes, 8);
> + if (rem->r_offset >= PAGE_SIZE) {
It is inappropriate to mark things with a comment like this in code
that has nothing at all to do what a specific architecture.
64-bit alignment, and this requirement, is also not sparc64 specific.
Other architectures have the same issue.
Next, comments should aide in the understanding of what the code is
trying to accomplish, when necessary. So, something more appropriate
would be:
/* Objects in this memory can countain 64-bit integers, align
* in order to accomodate that.
*/
But it's very close to obvious here what the code is doing, and why
it might be doing so.
So I'd so no comment at all works best here.
I'm sorry, but it's a real pet peeve of mine when people mention
totally irrelevant crap in code comments. What the heck does sparc64
have to do with aligning memory properly for the data types you will
be storing in that memory?!?!
^ permalink raw reply
* Re: [RFC PATCH 1/5] bpf: add PHYS_DEV prog type for early driver filter
From: Alexei Starovoitov @ 2016-04-04 20:00 UTC (permalink / raw)
To: Brenden Blanco
Cc: John Fastabend, Jesper Dangaard Brouer, Tom Herbert,
Daniel Borkmann, David S. Miller, Linux Kernel Network Developers,
ogerlitz
In-Reply-To: <20160404161720.GB495@gmail.com>
On Mon, Apr 04, 2016 at 09:17:22AM -0700, Brenden Blanco wrote:
> > >>
> > >> As Tom also points out, making the BPF interface independent of the SKB
> > >> meta-data structure, would also make the eBPF program more generally
> > >> applicable.
> > > The initial approach that I tried went down this path. Alexei advised
> > > that I use the pseudo skb, and in the future the API between drivers and
> > > bpf can change to adopt non-skb context. The only user facing ABIs in
> > > this patchset are the IFLA, the xdp_metadata struct, and the name of the
> > > new enum.
Exactly. That the most important part of this rfc.
Right now redirect to different queue, batching, prefetch and tons of
other code are mising. We have to plan the whole project, so we can
incrementally add features without breaking abi.
So new IFLA, xdp_metadata struct and enum for bpf return codes are
the main things to agree on.
> > > The reason to use a pseudo skb for now is that there will be a fair
> > > amount of churn to get bpf jit and interpreter to understand non-skb
> > > context in the bpf_load_pointer() code. I don't see the need for
> > > requiring that for this patchset, as it will be internal-only change
> > > if/when we use something else.
> >
> > Another option would be to have per driver JIT code to patch up the
> > skb read/loads with descriptor reads and metadata. From a strictly
> > performance stand point it should be better than pseudo skbs.
Per-driver pre/post JIT-like phase is actually on the table. It may
still be needed. If we can avoid it while keeping high performance, great.
> I considered (and implemented) this as well, but there my problem was
> that I needed to inform the bpf() syscall at BPF_PROG_LOAD time which
> ifindex to look at for fixups, so I had to add a new ifindex field to
> bpf_attr. Then during verification I had to use a new ndo to get the
> driver-specific offsets for its particular descriptor format. It seemed
> kludgy.
Another reason for going with 'pseudo skb' structure was to reuse
load_byte/half/word instructions in bpf interpreter as-is.
Right now these instructions have to see in-kernel
'struct sk_buff' as context (note we have mirror __sk_buff
for user space), so to use load_byte for bpf_prog_type_phys_dev
we have to give real 'struct sk_buff' to interpter with
data, head, len, data_len fields initialized, so that
interpreter 'just works'.
The potential fix would be to add phys_dev specific load_byte/word
instructions. Then we can drop all the legacy negative offset
stuff that <1% uses, but it slows down everyone.
We can also drop byteswap that load_word does (which turned out
to be confusing and often programs do 2nd byteswap to go
back to cpu endiannes).
And if we do it smart, we can drop length check as well,
then new_load_byte will actually be single load byte cpu instruction.
We can drop length check when offset is constant in the verfier
and that constant is less than 64, since all packets are larger.
As seen in 'perf report' from patch 5:
3.32% ksoftirqd/1 [kernel.vmlinux] [k] sk_load_byte_positive_offset
this is 14Mpps and 4 assembler instructions in the above function
are consuming 3% of the cpu. Making new_load_byte to be single
x86 insn would be really cool.
Of course, there are other pieces to accelerate:
12.71% ksoftirqd/1 [mlx4_en] [k] mlx4_en_alloc_frags
6.87% ksoftirqd/1 [mlx4_en] [k] mlx4_en_free_frag
4.20% ksoftirqd/1 [kernel.vmlinux] [k] get_page_from_freelist
4.09% swapper [mlx4_en] [k] mlx4_en_process_rx_cq
and I think Jesper's work on batch allocation is going help that a lot.
^ permalink raw reply
* Re: [PATCH 1/2] ipv4: l2tp: fix a potential issue in l2tp_ip_recv
From: David Miller @ 2016-04-04 20:01 UTC (permalink / raw)
To: yanhaishuang; +Cc: netdev, linux-kernel
In-Reply-To: <1459692564-3998-1-git-send-email-yanhaishuang@cmss.chinamobile.com>
From: Haishuang Yan <yanhaishuang@cmss.chinamobile.com>
Date: Sun, 3 Apr 2016 22:09:23 +0800
> pskb_may_pull() can change skb->data, so we have to load ptr/optr at the
> right place.
>
> Signed-off-by: Haishuang Yan <yanhaishuang@cmss.chinamobile.com>
Applied and queued up for -stable.
^ permalink raw reply
* Re: [PATCH 2/2] ipv6: l2tp: fix a potential issue in l2tp_ip6_recv
From: David Miller @ 2016-04-04 20:01 UTC (permalink / raw)
To: yanhaishuang; +Cc: netdev, linux-kernel
In-Reply-To: <1459692564-3998-2-git-send-email-yanhaishuang@cmss.chinamobile.com>
From: Haishuang Yan <yanhaishuang@cmss.chinamobile.com>
Date: Sun, 3 Apr 2016 22:09:24 +0800
> pskb_may_pull() can change skb->data, so we have to load ptr/optr at the
> right place.
>
> Signed-off-by: Haishuang Yan <yanhaishuang@cmss.chinamobile.com>
Applied and queued up for -stable.
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox