* [PATCH v2 net-next 1/2] bpf: make programs see skb->data == L2 for ingress and egress
@ 2015-06-04 17:11 Alexei Starovoitov
2015-06-04 17:11 ` [PATCH v2 net-next 2/2] bpf: allow programs to write to certain skb fields Alexei Starovoitov
` (2 more replies)
0 siblings, 3 replies; 5+ messages in thread
From: Alexei Starovoitov @ 2015-06-04 17:11 UTC (permalink / raw)
To: David S. Miller; +Cc: Eric Dumazet, Jamal Hadi Salim, Daniel Borkmann, netdev
eBPF programs attached to ingress and egress qdiscs see inconsistent skb->data.
For ingress L2 header is already pulled, whereas for egress it's present.
This is known to program writers which are currently forced to use
BPF_LL_OFF workaround.
Since programs don't change skb internal pointers it is safe to do
pull/push right around invocation of the program and earlier taps and
later pt->func() will not be affected.
Multiple taps via packet_rcv(), tpacket_rcv() are doing the same trick
around run_filter/BPF_PROG_RUN even if skb_shared.
This fix finally allows programs to use optimized LD_ABS/IND instructions
without BPF_LL_OFF for higher performance.
tc ingress + cls_bpf + samples/bpf/tcbpf1_kern.o
w/o JIT w/JIT
before 20.5 23.6 Mpps
after 21.8 26.6 Mpps
Old programs with BPF_LL_OFF will still work as-is.
We can now undo most of the earlier workaround commit:
a166151cbe33 ("bpf: fix bpf helpers to use skb->mac_header relative offsets")
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
---
new V1->V2: fixed u32->bool and added a check for CONFIG_NET_CLS_ACT
This patch is on top of 'fix build due to tc_verd':
http://patchwork.ozlabs.org/patch/480783/
Earlier versions were trying to do too much to make ingress and egress qdisc
consistent for all classifiers and actions or had too big of a scope of push/pull:
v1: http://thread.gmane.org/gmane.linux.network/358168/focus=358168
v2: http://thread.gmane.org/gmane.linux.network/358524/focus=358532
v3: http://thread.gmane.org/gmane.linux.network/358733/focus=358734
v4: http://thread.gmane.org/gmane.linux.network/359129/focus=359694
skb->data will still be different for all non-bpf classifiers/actions.
This fix will still allow us to explore further optimizations like
moving skb_pull() from eth_type_trans() into netif_receive_skb() in the future.
Here is how ingress callchain looks:
netif_receive_skb, // likely skb->users == 1
deliver_skb
packet_rcv // skb->users == 2
orig_skb_data = skb->data
push l2
res = BPF_PROG_RUN
if (!res) {
skb->data = orig_skb_data
consume_skb(skb) // skb->users == 1
goto out
}
skb2 = skb_clone(skb)
skb->data = orig_skb_data
consume_skb(skb) // skb->users == 1
__skb_queue_tail(skb2)
deliver_skb
Tpacket_rcv // skb->users == 2
orig_skb_data = skb->data
push l2
res = BPF_PROG_RUN
if (!res) {
skb->data = orig_skb_data
kfree_skb(skb) // skb->users == 1
goto out
}
if (...) {
skb2 = skb_clone(skb)
__skb_queue_tail(skb2)
}
skb_copy_bits(skb)
skb->data = orig_skb_data
kfree_skb(skb) // skb->users == 1
tc_classify
cls_u32 and other classifiers don't touch skb
actions like mirred do clone before redirect, etc.
cls_bpf // skb->users == 1
push l2
res = BPF_PROG_RUN
pull l2
actions // still see skb->data at L3
cls_xxx // still see skb->data at L3
actions
netfilter
vlan_do_receive
bridge
deliver_skb
mpls_forward and other ptype specific taps
ip_rcv // skb->users == 1
net/core/filter.c | 26 +++-----------------------
net/sched/act_bpf.c | 9 ++++++++-
net/sched/cls_bpf.c | 16 +++++++++++++++-
samples/bpf/tcbpf1_kern.c | 8 ++++----
4 files changed, 30 insertions(+), 29 deletions(-)
diff --git a/net/core/filter.c b/net/core/filter.c
index 09b2062eb5b8..36a69e33d76b 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -1238,21 +1238,6 @@ int sk_attach_bpf(u32 ufd, struct sock *sk)
return 0;
}
-/**
- * bpf_skb_clone_not_writable - is the header of a clone not writable
- * @skb: buffer to check
- * @len: length up to which to write, can be negative
- *
- * Returns true if modifying the header part of the cloned buffer
- * does require the data to be copied. I.e. this version works with
- * negative lengths needed for eBPF case!
- */
-static bool bpf_skb_clone_unwritable(const struct sk_buff *skb, int len)
-{
- return skb_header_cloned(skb) ||
- (int) skb_headroom(skb) + len > skb->hdr_len;
-}
-
#define BPF_RECOMPUTE_CSUM(flags) ((flags) & 1)
static u64 bpf_skb_store_bytes(u64 r1, u64 r2, u64 r3, u64 r4, u64 flags)
@@ -1275,9 +1260,8 @@ static u64 bpf_skb_store_bytes(u64 r1, u64 r2, u64 r3, u64 r4, u64 flags)
if (unlikely((u32) offset > 0xffff || len > sizeof(buf)))
return -EFAULT;
- offset -= skb->data - skb_mac_header(skb);
if (unlikely(skb_cloned(skb) &&
- bpf_skb_clone_unwritable(skb, offset + len)))
+ !skb_clone_writable(skb, offset + len)))
return -EFAULT;
ptr = skb_header_pointer(skb, offset, len, buf);
@@ -1321,9 +1305,8 @@ static u64 bpf_l3_csum_replace(u64 r1, u64 r2, u64 from, u64 to, u64 flags)
if (unlikely((u32) offset > 0xffff))
return -EFAULT;
- offset -= skb->data - skb_mac_header(skb);
if (unlikely(skb_cloned(skb) &&
- bpf_skb_clone_unwritable(skb, offset + sizeof(sum))))
+ !skb_clone_writable(skb, offset + sizeof(sum))))
return -EFAULT;
ptr = skb_header_pointer(skb, offset, sizeof(sum), &sum);
@@ -1369,9 +1352,8 @@ static u64 bpf_l4_csum_replace(u64 r1, u64 r2, u64 from, u64 to, u64 flags)
if (unlikely((u32) offset > 0xffff))
return -EFAULT;
- offset -= skb->data - skb_mac_header(skb);
if (unlikely(skb_cloned(skb) &&
- bpf_skb_clone_unwritable(skb, offset + sizeof(sum))))
+ !skb_clone_writable(skb, offset + sizeof(sum))))
return -EFAULT;
ptr = skb_header_pointer(skb, offset, sizeof(sum), &sum);
@@ -1425,8 +1407,6 @@ static u64 bpf_clone_redirect(u64 r1, u64 ifindex, u64 flags, u64 r4, u64 r5)
if (unlikely(!skb2))
return -ENOMEM;
- skb_push(skb2, skb2->data - skb_mac_header(skb2));
-
if (BPF_IS_REDIRECT_INGRESS(flags))
return dev_forward_skb(dev, skb2);
diff --git a/net/sched/act_bpf.c b/net/sched/act_bpf.c
index dc6a2d324bd8..1d56903fd4c7 100644
--- a/net/sched/act_bpf.c
+++ b/net/sched/act_bpf.c
@@ -37,6 +37,7 @@ static int tcf_bpf(struct sk_buff *skb, const struct tc_action *act,
{
struct tcf_bpf *prog = act->priv;
int action, filter_res;
+ bool at_ingress = G_TC_AT(skb->tc_verd) & AT_INGRESS;
if (unlikely(!skb_mac_header_was_set(skb)))
return TC_ACT_UNSPEC;
@@ -48,7 +49,13 @@ static int tcf_bpf(struct sk_buff *skb, const struct tc_action *act,
/* Needed here for accessing maps. */
rcu_read_lock();
- filter_res = BPF_PROG_RUN(prog->filter, skb);
+ if (at_ingress) {
+ __skb_push(skb, skb->mac_len);
+ filter_res = BPF_PROG_RUN(prog->filter, skb);
+ __skb_pull(skb, skb->mac_len);
+ } else {
+ filter_res = BPF_PROG_RUN(prog->filter, skb);
+ }
rcu_read_unlock();
/* A BPF program may overwrite the default action opcode.
diff --git a/net/sched/cls_bpf.c b/net/sched/cls_bpf.c
index 91bd9c19471d..c79ecfd36e0f 100644
--- a/net/sched/cls_bpf.c
+++ b/net/sched/cls_bpf.c
@@ -64,6 +64,11 @@ static int cls_bpf_classify(struct sk_buff *skb, const struct tcf_proto *tp,
{
struct cls_bpf_head *head = rcu_dereference_bh(tp->root);
struct cls_bpf_prog *prog;
+#ifdef CONFIG_NET_CLS_ACT
+ bool at_ingress = G_TC_AT(skb->tc_verd) & AT_INGRESS;
+#else
+ bool at_ingress = false;
+#endif
int ret = -1;
if (unlikely(!skb_mac_header_was_set(skb)))
@@ -72,7 +77,16 @@ static int cls_bpf_classify(struct sk_buff *skb, const struct tcf_proto *tp,
/* Needed here for accessing maps. */
rcu_read_lock();
list_for_each_entry_rcu(prog, &head->plist, link) {
- int filter_res = BPF_PROG_RUN(prog->filter, skb);
+ int filter_res;
+
+ if (at_ingress) {
+ /* It is safe to push/pull even if skb_shared() */
+ __skb_push(skb, skb->mac_len);
+ filter_res = BPF_PROG_RUN(prog->filter, skb);
+ __skb_pull(skb, skb->mac_len);
+ } else {
+ filter_res = BPF_PROG_RUN(prog->filter, skb);
+ }
if (filter_res == 0)
continue;
diff --git a/samples/bpf/tcbpf1_kern.c b/samples/bpf/tcbpf1_kern.c
index 7c27710f8296..9bfb2eb34563 100644
--- a/samples/bpf/tcbpf1_kern.c
+++ b/samples/bpf/tcbpf1_kern.c
@@ -21,7 +21,7 @@ static inline void set_dst_mac(struct __sk_buff *skb, char *mac)
static inline void set_ip_tos(struct __sk_buff *skb, __u8 new_tos)
{
- __u8 old_tos = load_byte(skb, BPF_LL_OFF + TOS_OFF);
+ __u8 old_tos = load_byte(skb, TOS_OFF);
bpf_l3_csum_replace(skb, IP_CSUM_OFF, htons(old_tos), htons(new_tos), 2);
bpf_skb_store_bytes(skb, TOS_OFF, &new_tos, sizeof(new_tos), 0);
@@ -34,7 +34,7 @@ static inline void set_ip_tos(struct __sk_buff *skb, __u8 new_tos)
static inline void set_tcp_ip_src(struct __sk_buff *skb, __u32 new_ip)
{
- __u32 old_ip = _htonl(load_word(skb, BPF_LL_OFF + IP_SRC_OFF));
+ __u32 old_ip = _htonl(load_word(skb, IP_SRC_OFF));
bpf_l4_csum_replace(skb, TCP_CSUM_OFF, old_ip, new_ip, IS_PSEUDO | sizeof(new_ip));
bpf_l3_csum_replace(skb, IP_CSUM_OFF, old_ip, new_ip, sizeof(new_ip));
@@ -44,7 +44,7 @@ static inline void set_tcp_ip_src(struct __sk_buff *skb, __u32 new_ip)
#define TCP_DPORT_OFF (ETH_HLEN + sizeof(struct iphdr) + offsetof(struct tcphdr, dest))
static inline void set_tcp_dest_port(struct __sk_buff *skb, __u16 new_port)
{
- __u16 old_port = htons(load_half(skb, BPF_LL_OFF + TCP_DPORT_OFF));
+ __u16 old_port = htons(load_half(skb, TCP_DPORT_OFF));
bpf_l4_csum_replace(skb, TCP_CSUM_OFF, old_port, new_port, sizeof(new_port));
bpf_skb_store_bytes(skb, TCP_DPORT_OFF, &new_port, sizeof(new_port), 0);
@@ -53,7 +53,7 @@ static inline void set_tcp_dest_port(struct __sk_buff *skb, __u16 new_port)
SEC("classifier")
int bpf_prog1(struct __sk_buff *skb)
{
- __u8 proto = load_byte(skb, BPF_LL_OFF + ETH_HLEN + offsetof(struct iphdr, protocol));
+ __u8 proto = load_byte(skb, ETH_HLEN + offsetof(struct iphdr, protocol));
long *value;
if (proto == IPPROTO_TCP) {
--
1.7.9.5
^ permalink raw reply related [flat|nested] 5+ messages in thread* [PATCH v2 net-next 2/2] bpf: allow programs to write to certain skb fields 2015-06-04 17:11 [PATCH v2 net-next 1/2] bpf: make programs see skb->data == L2 for ingress and egress Alexei Starovoitov @ 2015-06-04 17:11 ` Alexei Starovoitov 2015-06-07 9:01 ` David Miller 2015-06-05 10:03 ` [PATCH v2 net-next 1/2] bpf: make programs see skb->data == L2 for ingress and egress Jamal Hadi Salim 2015-06-07 9:01 ` David Miller 2 siblings, 1 reply; 5+ messages in thread From: Alexei Starovoitov @ 2015-06-04 17:11 UTC (permalink / raw) To: David S. Miller; +Cc: Eric Dumazet, Jamal Hadi Salim, Daniel Borkmann, netdev allow programs read/write skb->mark, tc_index fields and ((struct qdisc_skb_cb *)cb)->data. mark and tc_index are generically useful in TC. cb[0]-cb[4] are primarily used to pass arguments from one program to another called via bpf_tail_call() which can be seen in sockex3_kern.c example. All fields of 'struct __sk_buff' are readable to socket and tc_cls_act progs. mark, tc_index are writeable from tc_cls_act only. cb[0]-cb[4] are writeable by both sockets and tc_cls_act. Add verifier tests and improve sample code. Signed-off-by: Alexei Starovoitov <ast@plumgrid.com> --- v1->v2: no changes This patch is independent from previous push/pull l2 patch, but they both modify net/core/filter.c, so sent together. include/linux/bpf.h | 3 +- include/uapi/linux/bpf.h | 2 + kernel/bpf/verifier.c | 37 ++++++++++++----- net/core/filter.c | 94 +++++++++++++++++++++++++++++++++++++------ samples/bpf/sockex3_kern.c | 35 +++++----------- samples/bpf/test_verifier.c | 84 +++++++++++++++++++++++++++++++++++++- 6 files changed, 207 insertions(+), 48 deletions(-) diff --git a/include/linux/bpf.h b/include/linux/bpf.h index ca854e5bb2f7..2235aee8096a 100644 --- a/include/linux/bpf.h +++ b/include/linux/bpf.h @@ -105,7 +105,8 @@ struct bpf_verifier_ops { */ bool (*is_valid_access)(int off, int size, enum bpf_access_type type); - u32 (*convert_ctx_access)(int dst_reg, int src_reg, int ctx_off, + u32 (*convert_ctx_access)(enum bpf_access_type type, int dst_reg, + int src_reg, int ctx_off, struct bpf_insn *insn); }; diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h index 42aa19abab86..602f05b7a275 100644 --- a/include/uapi/linux/bpf.h +++ b/include/uapi/linux/bpf.h @@ -248,6 +248,8 @@ struct __sk_buff { __u32 priority; __u32 ingress_ifindex; __u32 ifindex; + __u32 tc_index; + __u32 cb[5]; }; #endif /* _UAPI__LINUX_BPF_H__ */ diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c index cfd9a40b9a5a..039d866fd36a 100644 --- a/kernel/bpf/verifier.c +++ b/kernel/bpf/verifier.c @@ -1692,6 +1692,8 @@ static int do_check(struct verifier_env *env) } } else if (class == BPF_STX) { + enum bpf_reg_type dst_reg_type; + if (BPF_MODE(insn->code) == BPF_XADD) { err = check_xadd(env, insn); if (err) @@ -1700,11 +1702,6 @@ static int do_check(struct verifier_env *env) continue; } - if (BPF_MODE(insn->code) != BPF_MEM || - insn->imm != 0) { - verbose("BPF_STX uses reserved fields\n"); - return -EINVAL; - } /* check src1 operand */ err = check_reg_arg(regs, insn->src_reg, SRC_OP); if (err) @@ -1714,6 +1711,8 @@ static int do_check(struct verifier_env *env) if (err) return err; + dst_reg_type = regs[insn->dst_reg].type; + /* check that memory (dst_reg + off) is writeable */ err = check_mem_access(env, insn->dst_reg, insn->off, BPF_SIZE(insn->code), BPF_WRITE, @@ -1721,6 +1720,15 @@ static int do_check(struct verifier_env *env) if (err) return err; + if (insn->imm == 0) { + insn->imm = dst_reg_type; + } else if (dst_reg_type != insn->imm && + (dst_reg_type == PTR_TO_CTX || + insn->imm == PTR_TO_CTX)) { + verbose("same insn cannot be used with different pointers\n"); + return -EINVAL; + } + } else if (class == BPF_ST) { if (BPF_MODE(insn->code) != BPF_MEM || insn->src_reg != BPF_REG_0) { @@ -1839,12 +1847,18 @@ static int replace_map_fd_with_map_ptr(struct verifier_env *env) for (i = 0; i < insn_cnt; i++, insn++) { if (BPF_CLASS(insn->code) == BPF_LDX && - (BPF_MODE(insn->code) != BPF_MEM || - insn->imm != 0)) { + (BPF_MODE(insn->code) != BPF_MEM || insn->imm != 0)) { verbose("BPF_LDX uses reserved fields\n"); return -EINVAL; } + if (BPF_CLASS(insn->code) == BPF_STX && + ((BPF_MODE(insn->code) != BPF_MEM && + BPF_MODE(insn->code) != BPF_XADD) || insn->imm != 0)) { + verbose("BPF_STX uses reserved fields\n"); + return -EINVAL; + } + if (insn[0].code == (BPF_LD | BPF_IMM | BPF_DW)) { struct bpf_map *map; struct fd f; @@ -1967,12 +1981,17 @@ static int convert_ctx_accesses(struct verifier_env *env) struct bpf_prog *new_prog; u32 cnt; int i; + enum bpf_access_type type; if (!env->prog->aux->ops->convert_ctx_access) return 0; for (i = 0; i < insn_cnt; i++, insn++) { - if (insn->code != (BPF_LDX | BPF_MEM | BPF_W)) + if (insn->code == (BPF_LDX | BPF_MEM | BPF_W)) + type = BPF_READ; + else if (insn->code == (BPF_STX | BPF_MEM | BPF_W)) + type = BPF_WRITE; + else continue; if (insn->imm != PTR_TO_CTX) { @@ -1982,7 +2001,7 @@ static int convert_ctx_accesses(struct verifier_env *env) } cnt = env->prog->aux->ops-> - convert_ctx_access(insn->dst_reg, insn->src_reg, + convert_ctx_access(type, insn->dst_reg, insn->src_reg, insn->off, insn_buf); if (cnt == 0 || cnt >= ARRAY_SIZE(insn_buf)) { verbose("bpf verifier is misconfigured\n"); diff --git a/net/core/filter.c b/net/core/filter.c index 36a69e33d76b..d271c06bf01f 100644 --- a/net/core/filter.c +++ b/net/core/filter.c @@ -46,6 +46,7 @@ #include <linux/seccomp.h> #include <linux/if_vlan.h> #include <linux/bpf.h> +#include <net/sch_generic.h> /** * sk_filter - run a packet through a socket filter @@ -1463,13 +1464,8 @@ tc_cls_act_func_proto(enum bpf_func_id func_id) } } -static bool sk_filter_is_valid_access(int off, int size, - enum bpf_access_type type) +static bool __is_valid_access(int off, int size, enum bpf_access_type type) { - /* only read is allowed */ - if (type != BPF_READ) - return false; - /* check bounds */ if (off < 0 || off >= sizeof(struct __sk_buff)) return false; @@ -1485,8 +1481,42 @@ static bool sk_filter_is_valid_access(int off, int size, return true; } -static u32 sk_filter_convert_ctx_access(int dst_reg, int src_reg, int ctx_off, - struct bpf_insn *insn_buf) +static bool sk_filter_is_valid_access(int off, int size, + enum bpf_access_type type) +{ + if (type == BPF_WRITE) { + switch (off) { + case offsetof(struct __sk_buff, cb[0]) ... + offsetof(struct __sk_buff, cb[4]): + break; + default: + return false; + } + } + + return __is_valid_access(off, size, type); +} + +static bool tc_cls_act_is_valid_access(int off, int size, + enum bpf_access_type type) +{ + if (type == BPF_WRITE) { + switch (off) { + case offsetof(struct __sk_buff, mark): + case offsetof(struct __sk_buff, tc_index): + case offsetof(struct __sk_buff, cb[0]) ... + offsetof(struct __sk_buff, cb[4]): + break; + default: + return false; + } + } + return __is_valid_access(off, size, type); +} + +static u32 bpf_net_convert_ctx_access(enum bpf_access_type type, int dst_reg, + int src_reg, int ctx_off, + struct bpf_insn *insn_buf) { struct bpf_insn *insn = insn_buf; @@ -1538,7 +1568,15 @@ static u32 sk_filter_convert_ctx_access(int dst_reg, int src_reg, int ctx_off, break; case offsetof(struct __sk_buff, mark): - return convert_skb_access(SKF_AD_MARK, dst_reg, src_reg, insn); + BUILD_BUG_ON(FIELD_SIZEOF(struct sk_buff, mark) != 4); + + if (type == BPF_WRITE) + *insn++ = BPF_STX_MEM(BPF_W, dst_reg, src_reg, + offsetof(struct sk_buff, mark)); + else + *insn++ = BPF_LDX_MEM(BPF_W, dst_reg, src_reg, + offsetof(struct sk_buff, mark)); + break; case offsetof(struct __sk_buff, pkt_type): return convert_skb_access(SKF_AD_PKTTYPE, dst_reg, src_reg, insn); @@ -1553,6 +1591,38 @@ static u32 sk_filter_convert_ctx_access(int dst_reg, int src_reg, int ctx_off, case offsetof(struct __sk_buff, vlan_tci): return convert_skb_access(SKF_AD_VLAN_TAG, dst_reg, src_reg, insn); + + case offsetof(struct __sk_buff, cb[0]) ... + offsetof(struct __sk_buff, cb[4]): + BUILD_BUG_ON(FIELD_SIZEOF(struct qdisc_skb_cb, data) < 20); + + ctx_off -= offsetof(struct __sk_buff, cb[0]); + ctx_off += offsetof(struct sk_buff, cb); + ctx_off += offsetof(struct qdisc_skb_cb, data); + if (type == BPF_WRITE) + *insn++ = BPF_STX_MEM(BPF_W, dst_reg, src_reg, ctx_off); + else + *insn++ = BPF_LDX_MEM(BPF_W, dst_reg, src_reg, ctx_off); + break; + + case offsetof(struct __sk_buff, tc_index): +#ifdef CONFIG_NET_SCHED + BUILD_BUG_ON(FIELD_SIZEOF(struct sk_buff, tc_index) != 2); + + if (type == BPF_WRITE) + *insn++ = BPF_STX_MEM(BPF_H, dst_reg, src_reg, + offsetof(struct sk_buff, tc_index)); + else + *insn++ = BPF_LDX_MEM(BPF_H, dst_reg, src_reg, + offsetof(struct sk_buff, tc_index)); + break; +#else + if (type == BPF_WRITE) + *insn++ = BPF_MOV64_REG(dst_reg, dst_reg); + else + *insn++ = BPF_MOV64_IMM(dst_reg, 0); + break; +#endif } return insn - insn_buf; @@ -1561,13 +1631,13 @@ static u32 sk_filter_convert_ctx_access(int dst_reg, int src_reg, int ctx_off, static const struct bpf_verifier_ops sk_filter_ops = { .get_func_proto = sk_filter_func_proto, .is_valid_access = sk_filter_is_valid_access, - .convert_ctx_access = sk_filter_convert_ctx_access, + .convert_ctx_access = bpf_net_convert_ctx_access, }; static const struct bpf_verifier_ops tc_cls_act_ops = { .get_func_proto = tc_cls_act_func_proto, - .is_valid_access = sk_filter_is_valid_access, - .convert_ctx_access = sk_filter_convert_ctx_access, + .is_valid_access = tc_cls_act_is_valid_access, + .convert_ctx_access = bpf_net_convert_ctx_access, }; static struct bpf_prog_type_list sk_filter_type __read_mostly = { diff --git a/samples/bpf/sockex3_kern.c b/samples/bpf/sockex3_kern.c index 2625b987944f..41ae2fd21b13 100644 --- a/samples/bpf/sockex3_kern.c +++ b/samples/bpf/sockex3_kern.c @@ -89,7 +89,6 @@ static inline __u32 ipv6_addr_hash(struct __sk_buff *ctx, __u64 off) struct globals { struct flow_keys flow; - __u32 nhoff; }; struct bpf_map_def SEC("maps") percpu_map = { @@ -139,7 +138,7 @@ static void update_stats(struct __sk_buff *skb, struct globals *g) static __always_inline void parse_ip_proto(struct __sk_buff *skb, struct globals *g, __u32 ip_proto) { - __u32 nhoff = g->nhoff; + __u32 nhoff = skb->cb[0]; int poff; switch (ip_proto) { @@ -165,7 +164,7 @@ static __always_inline void parse_ip_proto(struct __sk_buff *skb, if (gre_flags & GRE_SEQ) nhoff += 4; - g->nhoff = nhoff; + skb->cb[0] = nhoff; parse_eth_proto(skb, gre_proto); break; } @@ -195,7 +194,7 @@ PROG(PARSE_IP)(struct __sk_buff *skb) if (!g) return 0; - nhoff = g->nhoff; + nhoff = skb->cb[0]; if (unlikely(ip_is_fragment(skb, nhoff))) return 0; @@ -210,7 +209,7 @@ PROG(PARSE_IP)(struct __sk_buff *skb) verlen = load_byte(skb, nhoff + 0/*offsetof(struct iphdr, ihl)*/); nhoff += (verlen & 0xF) << 2; - g->nhoff = nhoff; + skb->cb[0] = nhoff; parse_ip_proto(skb, g, ip_proto); return 0; } @@ -223,7 +222,7 @@ PROG(PARSE_IPV6)(struct __sk_buff *skb) if (!g) return 0; - nhoff = g->nhoff; + nhoff = skb->cb[0]; ip_proto = load_byte(skb, nhoff + offsetof(struct ipv6hdr, nexthdr)); @@ -233,25 +232,21 @@ PROG(PARSE_IPV6)(struct __sk_buff *skb) nhoff + offsetof(struct ipv6hdr, daddr)); nhoff += sizeof(struct ipv6hdr); - g->nhoff = nhoff; + skb->cb[0] = nhoff; parse_ip_proto(skb, g, ip_proto); return 0; } PROG(PARSE_VLAN)(struct __sk_buff *skb) { - struct globals *g = this_cpu_globals(); __u32 nhoff, proto; - if (!g) - return 0; - - nhoff = g->nhoff; + nhoff = skb->cb[0]; proto = load_half(skb, nhoff + offsetof(struct vlan_hdr, h_vlan_encapsulated_proto)); nhoff += sizeof(struct vlan_hdr); - g->nhoff = nhoff; + skb->cb[0] = nhoff; parse_eth_proto(skb, proto); @@ -260,17 +255,13 @@ PROG(PARSE_VLAN)(struct __sk_buff *skb) PROG(PARSE_MPLS)(struct __sk_buff *skb) { - struct globals *g = this_cpu_globals(); __u32 nhoff, label; - if (!g) - return 0; - - nhoff = g->nhoff; + nhoff = skb->cb[0]; label = load_word(skb, nhoff); nhoff += sizeof(struct mpls_label); - g->nhoff = nhoff; + skb->cb[0] = nhoff; if (label & MPLS_LS_S_MASK) { __u8 verlen = load_byte(skb, nhoff); @@ -288,14 +279,10 @@ PROG(PARSE_MPLS)(struct __sk_buff *skb) SEC("socket/0") int main_prog(struct __sk_buff *skb) { - struct globals *g = this_cpu_globals(); __u32 nhoff = ETH_HLEN; __u32 proto = load_half(skb, 12); - if (!g) - return 0; - - g->nhoff = nhoff; + skb->cb[0] = nhoff; parse_eth_proto(skb, proto); return 0; } diff --git a/samples/bpf/test_verifier.c b/samples/bpf/test_verifier.c index 12f3780af73f..693605997abc 100644 --- a/samples/bpf/test_verifier.c +++ b/samples/bpf/test_verifier.c @@ -29,6 +29,7 @@ struct bpf_test { ACCEPT, REJECT } result; + enum bpf_prog_type prog_type; }; static struct bpf_test tests[] = { @@ -743,6 +744,84 @@ static struct bpf_test tests[] = { .errstr = "different pointers", .result = REJECT, }, + { + "check skb->mark is not writeable by sockets", + .insns = { + BPF_STX_MEM(BPF_W, BPF_REG_1, BPF_REG_1, + offsetof(struct __sk_buff, mark)), + BPF_EXIT_INSN(), + }, + .errstr = "invalid bpf_context access", + .result = REJECT, + }, + { + "check skb->tc_index is not writeable by sockets", + .insns = { + BPF_STX_MEM(BPF_W, BPF_REG_1, BPF_REG_1, + offsetof(struct __sk_buff, tc_index)), + BPF_EXIT_INSN(), + }, + .errstr = "invalid bpf_context access", + .result = REJECT, + }, + { + "check non-u32 access to cb", + .insns = { + BPF_STX_MEM(BPF_H, BPF_REG_1, BPF_REG_1, + offsetof(struct __sk_buff, cb[0])), + BPF_EXIT_INSN(), + }, + .errstr = "invalid bpf_context access", + .result = REJECT, + }, + { + "check out of range skb->cb access", + .insns = { + BPF_LDX_MEM(BPF_W, BPF_REG_0, BPF_REG_1, + offsetof(struct __sk_buff, cb[60])), + BPF_EXIT_INSN(), + }, + .errstr = "invalid bpf_context access", + .result = REJECT, + .prog_type = BPF_PROG_TYPE_SCHED_ACT, + }, + { + "write skb fields from socket prog", + .insns = { + BPF_LDX_MEM(BPF_W, BPF_REG_0, BPF_REG_1, + offsetof(struct __sk_buff, cb[4])), + BPF_JMP_IMM(BPF_JGE, BPF_REG_0, 0, 1), + BPF_LDX_MEM(BPF_W, BPF_REG_0, BPF_REG_1, + offsetof(struct __sk_buff, mark)), + BPF_LDX_MEM(BPF_W, BPF_REG_0, BPF_REG_1, + offsetof(struct __sk_buff, tc_index)), + BPF_JMP_IMM(BPF_JGE, BPF_REG_0, 0, 1), + BPF_STX_MEM(BPF_W, BPF_REG_1, BPF_REG_1, + offsetof(struct __sk_buff, cb[0])), + BPF_STX_MEM(BPF_W, BPF_REG_1, BPF_REG_1, + offsetof(struct __sk_buff, cb[2])), + BPF_EXIT_INSN(), + }, + .result = ACCEPT, + }, + { + "write skb fields from tc_cls_act prog", + .insns = { + BPF_LDX_MEM(BPF_W, BPF_REG_0, BPF_REG_1, + offsetof(struct __sk_buff, cb[0])), + BPF_STX_MEM(BPF_W, BPF_REG_1, BPF_REG_0, + offsetof(struct __sk_buff, mark)), + BPF_LDX_MEM(BPF_W, BPF_REG_0, BPF_REG_1, + offsetof(struct __sk_buff, tc_index)), + BPF_STX_MEM(BPF_W, BPF_REG_1, BPF_REG_0, + offsetof(struct __sk_buff, tc_index)), + BPF_STX_MEM(BPF_W, BPF_REG_1, BPF_REG_0, + offsetof(struct __sk_buff, cb[3])), + BPF_EXIT_INSN(), + }, + .result = ACCEPT, + .prog_type = BPF_PROG_TYPE_SCHED_CLS, + }, }; static int probe_filter_length(struct bpf_insn *fp) @@ -775,6 +854,7 @@ static int test(void) for (i = 0; i < ARRAY_SIZE(tests); i++) { struct bpf_insn *prog = tests[i].insns; + int prog_type = tests[i].prog_type; int prog_len = probe_filter_length(prog); int *fixup = tests[i].fixup; int map_fd = -1; @@ -789,8 +869,8 @@ static int test(void) } printf("#%d %s ", i, tests[i].descr); - prog_fd = bpf_prog_load(BPF_PROG_TYPE_SOCKET_FILTER, prog, - prog_len * sizeof(struct bpf_insn), + prog_fd = bpf_prog_load(prog_type ?: BPF_PROG_TYPE_SOCKET_FILTER, + prog, prog_len * sizeof(struct bpf_insn), "GPL", 0); if (tests[i].result == ACCEPT) { -- 1.7.9.5 ^ permalink raw reply related [flat|nested] 5+ messages in thread
* Re: [PATCH v2 net-next 2/2] bpf: allow programs to write to certain skb fields 2015-06-04 17:11 ` [PATCH v2 net-next 2/2] bpf: allow programs to write to certain skb fields Alexei Starovoitov @ 2015-06-07 9:01 ` David Miller 0 siblings, 0 replies; 5+ messages in thread From: David Miller @ 2015-06-07 9:01 UTC (permalink / raw) To: ast; +Cc: edumazet, jhs, daniel, netdev From: Alexei Starovoitov <ast@plumgrid.com> Date: Thu, 4 Jun 2015 10:11:54 -0700 > allow programs read/write skb->mark, tc_index fields and > ((struct qdisc_skb_cb *)cb)->data. > > mark and tc_index are generically useful in TC. > cb[0]-cb[4] are primarily used to pass arguments from one > program to another called via bpf_tail_call() which can > be seen in sockex3_kern.c example. > > All fields of 'struct __sk_buff' are readable to socket and tc_cls_act progs. > mark, tc_index are writeable from tc_cls_act only. > cb[0]-cb[4] are writeable by both sockets and tc_cls_act. > > Add verifier tests and improve sample code. > > Signed-off-by: Alexei Starovoitov <ast@plumgrid.com> Applied. ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH v2 net-next 1/2] bpf: make programs see skb->data == L2 for ingress and egress 2015-06-04 17:11 [PATCH v2 net-next 1/2] bpf: make programs see skb->data == L2 for ingress and egress Alexei Starovoitov 2015-06-04 17:11 ` [PATCH v2 net-next 2/2] bpf: allow programs to write to certain skb fields Alexei Starovoitov @ 2015-06-05 10:03 ` Jamal Hadi Salim 2015-06-07 9:01 ` David Miller 2 siblings, 0 replies; 5+ messages in thread From: Jamal Hadi Salim @ 2015-06-05 10:03 UTC (permalink / raw) To: Alexei Starovoitov, David S. Miller; +Cc: Eric Dumazet, Daniel Borkmann, netdev On 06/04/15 13:11, Alexei Starovoitov wrote: > eBPF programs attached to ingress and egress qdiscs see inconsistent skb->data. > For ingress L2 header is already pulled, whereas for egress it's present. > This is known to program writers which are currently forced to use > BPF_LL_OFF workaround. > Since programs don't change skb internal pointers it is safe to do > pull/push right around invocation of the program and earlier taps and > later pt->func() will not be affected. > Multiple taps via packet_rcv(), tpacket_rcv() are doing the same trick > around run_filter/BPF_PROG_RUN even if skb_shared. > > This fix finally allows programs to use optimized LD_ABS/IND instructions > without BPF_LL_OFF for higher performance. > tc ingress + cls_bpf + samples/bpf/tcbpf1_kern.o > w/o JIT w/JIT > before 20.5 23.6 Mpps > after 21.8 26.6 Mpps > > Old programs with BPF_LL_OFF will still work as-is. > > We can now undo most of the earlier workaround commit: > a166151cbe33 ("bpf: fix bpf helpers to use skb->mac_header relative offsets") > > Signed-off-by: Alexei Starovoitov <ast@plumgrid.com> > --- Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> cheers, jamal ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH v2 net-next 1/2] bpf: make programs see skb->data == L2 for ingress and egress 2015-06-04 17:11 [PATCH v2 net-next 1/2] bpf: make programs see skb->data == L2 for ingress and egress Alexei Starovoitov 2015-06-04 17:11 ` [PATCH v2 net-next 2/2] bpf: allow programs to write to certain skb fields Alexei Starovoitov 2015-06-05 10:03 ` [PATCH v2 net-next 1/2] bpf: make programs see skb->data == L2 for ingress and egress Jamal Hadi Salim @ 2015-06-07 9:01 ` David Miller 2 siblings, 0 replies; 5+ messages in thread From: David Miller @ 2015-06-07 9:01 UTC (permalink / raw) To: ast; +Cc: edumazet, jhs, daniel, netdev From: Alexei Starovoitov <ast@plumgrid.com> Date: Thu, 4 Jun 2015 10:11:53 -0700 > eBPF programs attached to ingress and egress qdiscs see inconsistent skb->data. > For ingress L2 header is already pulled, whereas for egress it's present. > This is known to program writers which are currently forced to use > BPF_LL_OFF workaround. > Since programs don't change skb internal pointers it is safe to do > pull/push right around invocation of the program and earlier taps and > later pt->func() will not be affected. > Multiple taps via packet_rcv(), tpacket_rcv() are doing the same trick > around run_filter/BPF_PROG_RUN even if skb_shared. > > This fix finally allows programs to use optimized LD_ABS/IND instructions > without BPF_LL_OFF for higher performance. > tc ingress + cls_bpf + samples/bpf/tcbpf1_kern.o > w/o JIT w/JIT > before 20.5 23.6 Mpps > after 21.8 26.6 Mpps > > Old programs with BPF_LL_OFF will still work as-is. > > We can now undo most of the earlier workaround commit: > a166151cbe33 ("bpf: fix bpf helpers to use skb->mac_header relative offsets") > > Signed-off-by: Alexei Starovoitov <ast@plumgrid.com> Applied. ^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2015-06-07 9:01 UTC | newest] Thread overview: 5+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2015-06-04 17:11 [PATCH v2 net-next 1/2] bpf: make programs see skb->data == L2 for ingress and egress Alexei Starovoitov 2015-06-04 17:11 ` [PATCH v2 net-next 2/2] bpf: allow programs to write to certain skb fields Alexei Starovoitov 2015-06-07 9:01 ` David Miller 2015-06-05 10:03 ` [PATCH v2 net-next 1/2] bpf: make programs see skb->data == L2 for ingress and egress Jamal Hadi Salim 2015-06-07 9:01 ` David Miller
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).