* [PATCH v2 net-next 1/2] net: introduce skb_postpush_rcsum() helper
@ 2015-04-08  1:03 Alexei Starovoitov
  2015-04-08  1:03 ` [PATCH v2 net-next 2/2] tc: make ingress and egress qdiscs consistent Alexei Starovoitov
  2015-04-08  7:25 ` [PATCH v2 net-next 1/2] net: introduce skb_postpush_rcsum() helper Daniel Borkmann
  0 siblings, 2 replies; 25+ messages in thread
From: Alexei Starovoitov @ 2015-04-08  1:03 UTC (permalink / raw)
  To: David S. Miller; +Cc: Daniel Borkmann, Jiri Pirko, Jamal Hadi Salim, netdev
similar to skb_postpull_rcsum() introduce skb_postpush_rcsum() helper
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
---
skb_postpush_rcsum() is also used in the next patch.
 include/linux/skbuff.h |    7 +++++++
 net/core/filter.c      |    4 ++--
 2 files changed, 9 insertions(+), 2 deletions(-)
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 36f3f43c0117..3c5b191869bb 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -2586,6 +2586,13 @@ static inline void skb_postpull_rcsum(struct sk_buff *skb,
 		skb->csum = csum_sub(skb->csum, csum_partial(start, len, 0));
 }
 
+static inline void skb_postpush_rcsum(struct sk_buff *skb,
+				      const void *start, unsigned int len)
+{
+	if (skb->ip_summed == CHECKSUM_COMPLETE)
+		skb->csum = csum_add(skb->csum, csum_partial(start, len, 0));
+}
+
 unsigned char *skb_pull_rcsum(struct sk_buff *skb, unsigned int len);
 
 /**
diff --git a/net/core/filter.c b/net/core/filter.c
index b669e75d2b36..e5872704c28b 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -1213,8 +1213,8 @@ static u64 bpf_skb_store_bytes(u64 r1, u64 r2, u64 r3, u64 r4, u64 flags)
 		/* skb_store_bits cannot return -EFAULT here */
 		skb_store_bits(skb, offset, ptr, len);
 
-	if (BPF_RECOMPUTE_CSUM(flags) && skb->ip_summed == CHECKSUM_COMPLETE)
-		skb->csum = csum_add(skb->csum, csum_partial(ptr, len, 0));
+	if (BPF_RECOMPUTE_CSUM(flags))
+		skb_postpush_rcsum(skb, ptr, len);
 	return 0;
 }
 
-- 
1.7.9.5
^ permalink raw reply related	[flat|nested] 25+ messages in thread
* [PATCH v2 net-next 2/2] tc: make ingress and egress qdiscs consistent
  2015-04-08  1:03 [PATCH v2 net-next 1/2] net: introduce skb_postpush_rcsum() helper Alexei Starovoitov
@ 2015-04-08  1:03 ` Alexei Starovoitov
  2015-04-08  2:35   ` David Miller
  2015-04-08 11:43   ` Jamal Hadi Salim
  2015-04-08  7:25 ` [PATCH v2 net-next 1/2] net: introduce skb_postpush_rcsum() helper Daniel Borkmann
  1 sibling, 2 replies; 25+ messages in thread
From: Alexei Starovoitov @ 2015-04-08  1:03 UTC (permalink / raw)
  To: David S. Miller; +Cc: Daniel Borkmann, Jiri Pirko, Jamal Hadi Salim, netdev
TC classifers and actions attached to ingress and egress qdiscs see
inconsistent skb->data. For ingress L2 header is already pulled, whereas
for egress it's present. Make them consistent by pushing L2 before calling
into classifiers/actions and pulling L2 back afterwards.
The following cls/act assume L2 and were broken on ingress before this commit:
cls_bpf, act_bpf, cls_rsvp, cls_cgroup, act_csum, act_nat, all of ematch.
All other in-tree cls/act use skb_network_offset() accessors for skb data
and work regardless whether L2 was pulled or not.
Since L2 is now present on ingress update act_mirred (the only action that
was aware of L2 differences) and fix two bugs in it:
- it was forwarding packets with L2 present to tunnel devices if used
  with egress qdisc
- it wasn't updating skb->csum with ingress qdisc
Also rename 'ok_push' to 'needs_l2' to make it more readable.
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
---
V1->V2:
. major refactoring: move push/pull into ingress qdisc
. use dev->hard_header_len instead of hard coding ETH_HLEN
. remove now obsolete hack in samples/bpf/ example
cls_rsvp, act_csum, act_nat may appear to work on ingress, since they're using
skb_*_offset(), but they do pskb_may_pull() assuming L2 is present.
ematches are hard coding skb->data to mean L2 in few places.
The fix may break out-of-tree ingress classifiers that rely on lack of L2 and
which use skb->data directly.
Alternative approach would be to do full push/pull with csum recompute
in ingress_enqueue() which will simplify act_mirred. Not sure which is better.
 include/net/tc_act/tc_mirred.h |    2 +-
 net/sched/act_mirred.c         |   30 ++++++++++++++++++++++--------
 net/sched/sch_ingress.c        |    8 ++++++++
 samples/bpf/tcbpf1_kern.c      |    6 ------
 4 files changed, 31 insertions(+), 15 deletions(-)
diff --git a/include/net/tc_act/tc_mirred.h b/include/net/tc_act/tc_mirred.h
index 4dd77a1c106b..f0cddd81bd6c 100644
--- a/include/net/tc_act/tc_mirred.h
+++ b/include/net/tc_act/tc_mirred.h
@@ -7,7 +7,7 @@ struct tcf_mirred {
 	struct tcf_common	common;
 	int			tcfm_eaction;
 	int			tcfm_ifindex;
-	int			tcfm_ok_push;
+	int			tcfm_needs_l2;
 	struct net_device	*tcfm_dev;
 	struct list_head	tcfm_list;
 };
diff --git a/net/sched/act_mirred.c b/net/sched/act_mirred.c
index 5953517ec059..c015d1c43db7 100644
--- a/net/sched/act_mirred.c
+++ b/net/sched/act_mirred.c
@@ -52,7 +52,7 @@ static int tcf_mirred_init(struct net *net, struct nlattr *nla,
 	struct tc_mirred *parm;
 	struct tcf_mirred *m;
 	struct net_device *dev;
-	int ret, ok_push = 0;
+	int ret, needs_l2 = 0;
 
 	if (nla == NULL)
 		return -EINVAL;
@@ -80,10 +80,10 @@ static int tcf_mirred_init(struct net *net, struct nlattr *nla,
 		case ARPHRD_IPGRE:
 		case ARPHRD_VOID:
 		case ARPHRD_NONE:
-			ok_push = 0;
+			needs_l2 = 0;
 			break;
 		default:
-			ok_push = 1;
+			needs_l2 = 1;
 			break;
 		}
 	} else {
@@ -114,7 +114,7 @@ static int tcf_mirred_init(struct net *net, struct nlattr *nla,
 			dev_put(m->tcfm_dev);
 		dev_hold(dev);
 		m->tcfm_dev = dev;
-		m->tcfm_ok_push = ok_push;
+		m->tcfm_needs_l2 = needs_l2;
 	}
 	spin_unlock_bh(&m->tcf_lock);
 	if (ret == ACT_P_CREATED) {
@@ -131,7 +131,7 @@ static int tcf_mirred(struct sk_buff *skb, const struct tc_action *a,
 	struct tcf_mirred *m = a->priv;
 	struct net_device *dev;
 	struct sk_buff *skb2;
-	u32 at;
+	u32 at, hard_header_len;
 	int retval, err = 1;
 
 	spin_lock(&m->tcf_lock);
@@ -155,9 +155,23 @@ static int tcf_mirred(struct sk_buff *skb, const struct tc_action *a,
 	if (skb2 == NULL)
 		goto out;
 
-	if (!(at & AT_EGRESS)) {
-		if (m->tcfm_ok_push)
-			skb_push(skb2, skb2->dev->hard_header_len);
+	hard_header_len = skb2->dev->hard_header_len;
+
+	if (!m->tcfm_needs_l2) {
+		/* in packets arriving on egress qdiscs skb->csum (complete)
+		 * includes L2, whereas ingress_enqueue() only pushes L2 without
+		 * updating skb->csum.
+		 * So pull L2 here to undo push done by ingress_enqueue()
+		 * and do pull_rcsum for fully checksummed packets
+		 */
+		if (at & AT_INGRESS)
+			skb_pull(skb2, hard_header_len);
+		else
+			skb_pull_rcsum(skb2, hard_header_len);
+	} else {
+		/* ingress_enqueue() already pushed L2, recompute csum here */
+		if (at & AT_INGRESS)
+			skb_postpush_rcsum(skb2, skb2->data, hard_header_len);
 	}
 
 	/* mirror is always swallowed */
diff --git a/net/sched/sch_ingress.c b/net/sched/sch_ingress.c
index eb5b8445fef9..658b46082269 100644
--- a/net/sched/sch_ingress.c
+++ b/net/sched/sch_ingress.c
@@ -61,9 +61,17 @@ static int ingress_enqueue(struct sk_buff *skb, struct Qdisc *sch)
 	struct ingress_qdisc_data *p = qdisc_priv(sch);
 	struct tcf_result res;
 	struct tcf_proto *fl = rcu_dereference_bh(p->filter_list);
+	unsigned int hard_header_len = skb->dev->hard_header_len;
 	int result;
 
+	if (unlikely(skb_headroom(skb) < hard_header_len))
+		/* should never happen, since L2 was just pulled on ingress */
+		return TC_ACT_OK;
+
+	/* don't recompute skb->csum back and forth while pushing/pulling L2 */
+	__skb_push(skb, hard_header_len);
 	result = tc_classify(skb, fl, &res);
+	__skb_pull(skb, hard_header_len);
 
 	qdisc_bstats_update(sch, skb);
 	switch (result) {
diff --git a/samples/bpf/tcbpf1_kern.c b/samples/bpf/tcbpf1_kern.c
index 7cf3f42a6e39..76cdaab63058 100644
--- a/samples/bpf/tcbpf1_kern.c
+++ b/samples/bpf/tcbpf1_kern.c
@@ -14,12 +14,6 @@ static inline void set_dst_mac(struct __sk_buff *skb, char *mac)
 	bpf_skb_store_bytes(skb, 0, mac, ETH_ALEN, 1);
 }
 
-/* use 1 below for ingress qdisc and 0 for egress */
-#if 0
-#undef ETH_HLEN
-#define ETH_HLEN 0
-#endif
-
 #define IP_CSUM_OFF (ETH_HLEN + offsetof(struct iphdr, check))
 #define TOS_OFF (ETH_HLEN + offsetof(struct iphdr, tos))
 
-- 
1.7.9.5
^ permalink raw reply related	[flat|nested] 25+ messages in thread
* Re: [PATCH v2 net-next 2/2] tc: make ingress and egress qdiscs consistent
  2015-04-08  1:03 ` [PATCH v2 net-next 2/2] tc: make ingress and egress qdiscs consistent Alexei Starovoitov
@ 2015-04-08  2:35   ` David Miller
  2015-04-08  3:22     ` Alexei Starovoitov
  2015-04-08 11:43   ` Jamal Hadi Salim
  1 sibling, 1 reply; 25+ messages in thread
From: David Miller @ 2015-04-08  2:35 UTC (permalink / raw)
  To: ast; +Cc: daniel, jiri, jhs, netdev
From: Alexei Starovoitov <ast@plumgrid.com>
Date: Tue,  7 Apr 2015 18:03:45 -0700
> +
> +	/* don't recompute skb->csum back and forth while pushing/pulling L2 */
> +	__skb_push(skb, hard_header_len);
>  	result = tc_classify(skb, fl, &res);
> +	__skb_pull(skb, hard_header_len);
This is not legal.
This SKB can be referenced by other entities, such as AF_PACKET
sockets and other network taps on input.  You absolutely do not
have private access to this SKB object, and any modification you
make to it will be seen by others.
Therefore you cannot push and pull the headers, because that
modification will be seen by the other entities referencing this SKB.
This is why every network protocol's input path must do something
like:
	skb = skb_share_check(skb, GFP_ATOMIC);
before pulling headers.
And you do not want to add this expensive operation here.
That would be really stupid overhead just to accomodate BPF things.
Instead just propagate some offset into the locations that care about
the skb data location.
^ permalink raw reply	[flat|nested] 25+ messages in thread
* Re: [PATCH v2 net-next 2/2] tc: make ingress and egress qdiscs consistent
  2015-04-08  2:35   ` David Miller
@ 2015-04-08  3:22     ` Alexei Starovoitov
  2015-04-08  4:48       ` Alexei Starovoitov
  0 siblings, 1 reply; 25+ messages in thread
From: Alexei Starovoitov @ 2015-04-08  3:22 UTC (permalink / raw)
  To: David Miller; +Cc: daniel, jiri, jhs, netdev
On 4/7/15 7:35 PM, David Miller wrote:
> From: Alexei Starovoitov <ast@plumgrid.com>
> Date: Tue,  7 Apr 2015 18:03:45 -0700
>
>> +
>> +	/* don't recompute skb->csum back and forth while pushing/pulling L2 */
>> +	__skb_push(skb, hard_header_len);
>>   	result = tc_classify(skb, fl, &res);
>> +	__skb_pull(skb, hard_header_len);
>
> This is not legal.
>
> This SKB can be referenced by other entities, such as AF_PACKET
> sockets and other network taps on input.  You absolutely do not
> have private access to this SKB object, and any modification you
> make to it will be seen by others.
>
> Therefore you cannot push and pull the headers, because that
> modification will be seen by the other entities referencing this SKB.
ohh, yes. Spent too much time looking at TC that forgot the obvious.
Was modeling this one by skb_defer_rx_timestamp() which is called before
taps. My v1 patch was obviously broken too :(
> And you do not want to add this expensive operation here.
>
> That would be really stupid overhead just to accomodate BPF things.
skb_share_check is only expensive if taps are active and
not only cls_bpf, but ematch and bunch of other cls/acts assume L2,
but it seems no one cares about using them with ingress, so I'll go back
to cls_bpf specific skb_share_check and push.
Other classifiers that care about attaching to ingress would need
to do the same thing or play nicely with offsets. Fair enough.
^ permalink raw reply	[flat|nested] 25+ messages in thread
* Re: [PATCH v2 net-next 2/2] tc: make ingress and egress qdiscs consistent
  2015-04-08  3:22     ` Alexei Starovoitov
@ 2015-04-08  4:48       ` Alexei Starovoitov
  2015-04-08  8:36         ` Daniel Borkmann
  0 siblings, 1 reply; 25+ messages in thread
From: Alexei Starovoitov @ 2015-04-08  4:48 UTC (permalink / raw)
  To: David Miller; +Cc: daniel, jiri, jhs, netdev
On 4/7/15 8:22 PM, Alexei Starovoitov wrote:
> but it seems no one cares about using them with ingress, so I'll go back
> to cls_bpf specific skb_share_check and push.
that didn't work either :(
we cannot replace skb via skb_share_check() inside cls/act. We cannot do
it inside ingress_enqueue() either. It can only be done at handle_ing()
level. And it's quite ugly to change the signatures of the whole
qdisc->enqueue() call chain just for cls_bpf. May be introducing
bpf-only ingress qdisc to decouple the logic is not such a bad idea?
Then ing_filter() can special case it just like != noop_qdisc.
Or a flag for existing ingress qdisc? Will see how it looks.
^ permalink raw reply	[flat|nested] 25+ messages in thread
* Re: [PATCH v2 net-next 1/2] net: introduce skb_postpush_rcsum() helper
  2015-04-08  1:03 [PATCH v2 net-next 1/2] net: introduce skb_postpush_rcsum() helper Alexei Starovoitov
  2015-04-08  1:03 ` [PATCH v2 net-next 2/2] tc: make ingress and egress qdiscs consistent Alexei Starovoitov
@ 2015-04-08  7:25 ` Daniel Borkmann
  1 sibling, 0 replies; 25+ messages in thread
From: Daniel Borkmann @ 2015-04-08  7:25 UTC (permalink / raw)
  To: Alexei Starovoitov, David S. Miller; +Cc: Jiri Pirko, Jamal Hadi Salim, netdev
On 04/08/2015 03:03 AM, Alexei Starovoitov wrote:
> similar to skb_postpull_rcsum() introduce skb_postpush_rcsum() helper
>
> Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
This one you could resend nevertheless as a stand-alone as it
also improves the existing parts.
Would be good if you could also add a proper kernel doc part
as in skb_postpull_rcsum().
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
^ permalink raw reply	[flat|nested] 25+ messages in thread
* Re: [PATCH v2 net-next 2/2] tc: make ingress and egress qdiscs consistent
  2015-04-08  4:48       ` Alexei Starovoitov
@ 2015-04-08  8:36         ` Daniel Borkmann
  2015-04-08  9:05           ` Jiri Pirko
  0 siblings, 1 reply; 25+ messages in thread
From: Daniel Borkmann @ 2015-04-08  8:36 UTC (permalink / raw)
  To: Alexei Starovoitov; +Cc: David Miller, jiri, jhs, netdev, tgraf
On 04/08/2015 06:48 AM, Alexei Starovoitov wrote:
> On 4/7/15 8:22 PM, Alexei Starovoitov wrote:
>> but it seems no one cares about using them with ingress, so I'll go back
>> to cls_bpf specific skb_share_check and push.
>
> that didn't work either :(
> we cannot replace skb via skb_share_check() inside cls/act. We cannot do
> it inside ingress_enqueue() either. It can only be done at handle_ing()
> level. And it's quite ugly to change the signatures of the whole
> qdisc->enqueue() call chain just for cls_bpf. May be introducing
> bpf-only ingress qdisc to decouple the logic is not such a bad idea?
So it seems ingress qdisc is quite broken for various classifier
and actions. :/ I wouldn't go that far to have a bpf-only ingress
qdisc, but what about introducing l2/l3 ingress qdisc (or, name
it "early ingress" and "ingress" qdisc), so at an early point in
netif_receive_skb_internal(), we would have an l2_ingress hook,
wrapped via static keys to have minimal impact if unused, and could
do the push/pull similarly as in the PTP classifier w/o worry that
it is referenced by other entities. There, we could at least still
benefit from hw flow steering.
The current ingress qdisc, we'd rename l3_ingress to make it clear
what to expect (can also be aliased in iproute2). Maybe classifiers,
actions could be flagged as l2/l3 capable and checked at config
time where to apply, at least in the case of {cls,act}_bpf?
The other thing I had in mind is that we could expose skb_iif to
detect that we're actually coming from ingress qdisc from inside
the ebpf prog, but that is very limited and you nevertheless miss
out on l2 context.
Thanks,
Daniel
^ permalink raw reply	[flat|nested] 25+ messages in thread
* Re: [PATCH v2 net-next 2/2] tc: make ingress and egress qdiscs consistent
  2015-04-08  8:36         ` Daniel Borkmann
@ 2015-04-08  9:05           ` Jiri Pirko
  2015-04-08 10:54             ` Daniel Borkmann
  0 siblings, 1 reply; 25+ messages in thread
From: Jiri Pirko @ 2015-04-08  9:05 UTC (permalink / raw)
  To: Daniel Borkmann; +Cc: Alexei Starovoitov, David Miller, jhs, netdev, tgraf
Wed, Apr 08, 2015 at 10:36:08AM CEST, daniel@iogearbox.net wrote:
>On 04/08/2015 06:48 AM, Alexei Starovoitov wrote:
>>On 4/7/15 8:22 PM, Alexei Starovoitov wrote:
>>>but it seems no one cares about using them with ingress, so I'll go back
>>>to cls_bpf specific skb_share_check and push.
>>
>>that didn't work either :(
>>we cannot replace skb via skb_share_check() inside cls/act. We cannot do
>>it inside ingress_enqueue() either. It can only be done at handle_ing()
>>level. And it's quite ugly to change the signatures of the whole
>>qdisc->enqueue() call chain just for cls_bpf. May be introducing
>>bpf-only ingress qdisc to decouple the logic is not such a bad idea?
>
>So it seems ingress qdisc is quite broken for various classifier
>and actions. :/ I wouldn't go that far to have a bpf-only ingress
>qdisc, but what about introducing l2/l3 ingress qdisc (or, name
>it "early ingress" and "ingress" qdisc), so at an early point in
>netif_receive_skb_internal(), we would have an l2_ingress hook,
>wrapped via static keys to have minimal impact if unused, and could
>do the push/pull similarly as in the PTP classifier w/o worry that
>it is referenced by other entities. There, we could at least still
>benefit from hw flow steering.
How about to just adjust ingress qdisc to do the right thing (of adjust
egress qdisc so they both behave the same). I don't like the idea of
having more ingres queue disk. Would be just confusing.
>
>The current ingress qdisc, we'd rename l3_ingress to make it clear
>what to expect (can also be aliased in iproute2). Maybe classifiers,
>actions could be flagged as l2/l3 capable and checked at config
>time where to apply, at least in the case of {cls,act}_bpf?
>
>The other thing I had in mind is that we could expose skb_iif to
>detect that we're actually coming from ingress qdisc from inside
>the ebpf prog, but that is very limited and you nevertheless miss
>out on l2 context.
As you said, this needs to be resolved for others as well.
>
>Thanks,
>Daniel
^ permalink raw reply	[flat|nested] 25+ messages in thread
* Re: [PATCH v2 net-next 2/2] tc: make ingress and egress qdiscs consistent
  2015-04-08  9:05           ` Jiri Pirko
@ 2015-04-08 10:54             ` Daniel Borkmann
  2015-04-08 11:14               ` Daniel Borkmann
  0 siblings, 1 reply; 25+ messages in thread
From: Daniel Borkmann @ 2015-04-08 10:54 UTC (permalink / raw)
  To: Jiri Pirko; +Cc: Alexei Starovoitov, David Miller, jhs, netdev, tgraf
On 04/08/2015 11:05 AM, Jiri Pirko wrote:
> Wed, Apr 08, 2015 at 10:36:08AM CEST, daniel@iogearbox.net wrote:
>> On 04/08/2015 06:48 AM, Alexei Starovoitov wrote:
>>> On 4/7/15 8:22 PM, Alexei Starovoitov wrote:
>>>> but it seems no one cares about using them with ingress, so I'll go back
>>>> to cls_bpf specific skb_share_check and push.
>>>
>>> that didn't work either :(
>>> we cannot replace skb via skb_share_check() inside cls/act. We cannot do
>>> it inside ingress_enqueue() either. It can only be done at handle_ing()
>>> level. And it's quite ugly to change the signatures of the whole
>>> qdisc->enqueue() call chain just for cls_bpf. May be introducing
>>> bpf-only ingress qdisc to decouple the logic is not such a bad idea?
>>
>> So it seems ingress qdisc is quite broken for various classifier
>> and actions. :/ I wouldn't go that far to have a bpf-only ingress
>> qdisc, but what about introducing l2/l3 ingress qdisc (or, name
>> it "early ingress" and "ingress" qdisc), so at an early point in
>> netif_receive_skb_internal(), we would have an l2_ingress hook,
>> wrapped via static keys to have minimal impact if unused, and could
>> do the push/pull similarly as in the PTP classifier w/o worry that
>> it is referenced by other entities. There, we could at least still
>> benefit from hw flow steering.
>
> How about to just adjust ingress qdisc to do the right thing (of adjust
> egress qdisc so they both behave the same). I don't like the idea of
> having more ingres queue disk. Would be just confusing.
I'm all for it, that's what I've mentioned earlier in this thread
already. ;) The above would be one possibility, but of course I'm
open for other, better suggestions?
I totally agree with Dave that skb_share_check() should be avoided
at all costs. At least on my laptop (maybe not a perfect example),
I've got these as packet socket users present in the background,
so there are packet users running all the time where we would hit
skb_share_check() then:
# ss -0lnp
Netid  State      Recv-Q Send-Q      Local Address:Port    Peer Address:Port
p_raw  UNCONN     0      0                *:wlp2s0b1       *      users:(("dhclient",1290,5))
p_dgr  UNCONN     0      0          [34958]:wlp2s0b1       *      users:(("wpa_supplicant",805,13))
p_dgr  UNCONN     0      0              [0]:*              *      users:(("wpa_supplicant",805,12))
I do not yet see a generic way to push an offset down into various
classifiers and actions that otherwise don't really work with ingress,
it's not just limited to BPF only as Alexei already mentioned. Hm.
Cheers,
Daniel
^ permalink raw reply	[flat|nested] 25+ messages in thread
* Re: [PATCH v2 net-next 2/2] tc: make ingress and egress qdiscs consistent
  2015-04-08 10:54             ` Daniel Borkmann
@ 2015-04-08 11:14               ` Daniel Borkmann
  2015-04-08 11:47                 ` Jamal Hadi Salim
  0 siblings, 1 reply; 25+ messages in thread
From: Daniel Borkmann @ 2015-04-08 11:14 UTC (permalink / raw)
  To: Jiri Pirko; +Cc: Alexei Starovoitov, David Miller, jhs, netdev, tgraf
On 04/08/2015 12:54 PM, Daniel Borkmann wrote:
> On 04/08/2015 11:05 AM, Jiri Pirko wrote:
>> Wed, Apr 08, 2015 at 10:36:08AM CEST, daniel@iogearbox.net wrote:
>>> On 04/08/2015 06:48 AM, Alexei Starovoitov wrote:
>>>> On 4/7/15 8:22 PM, Alexei Starovoitov wrote:
>>>>> but it seems no one cares about using them with ingress, so I'll go back
>>>>> to cls_bpf specific skb_share_check and push.
>>>>
>>>> that didn't work either :(
>>>> we cannot replace skb via skb_share_check() inside cls/act. We cannot do
>>>> it inside ingress_enqueue() either. It can only be done at handle_ing()
>>>> level. And it's quite ugly to change the signatures of the whole
>>>> qdisc->enqueue() call chain just for cls_bpf. May be introducing
>>>> bpf-only ingress qdisc to decouple the logic is not such a bad idea?
>>>
>>> So it seems ingress qdisc is quite broken for various classifier
>>> and actions. :/ I wouldn't go that far to have a bpf-only ingress
>>> qdisc, but what about introducing l2/l3 ingress qdisc (or, name
>>> it "early ingress" and "ingress" qdisc), so at an early point in
>>> netif_receive_skb_internal(), we would have an l2_ingress hook,
>>> wrapped via static keys to have minimal impact if unused, and could
>>> do the push/pull similarly as in the PTP classifier w/o worry that
>>> it is referenced by other entities. There, we could at least still
>>> benefit from hw flow steering.
>>
>> How about to just adjust ingress qdisc to do the right thing (of adjust
>> egress qdisc so they both behave the same). I don't like the idea of
Generically adjusting egress towards ingress would not work. I
think it's reasonable to assume that the majority of people use
classifier and actions only from egress side, and they rely on
having l2 context present. Stripping that away would also be an
artificial limitation we'd impose.
You could use the ingress qdisc to redirect traffic to an ifb
device and attach the same egress classifier and action there
as skb_pull(skb, skb->dev->hard_header_len) is being done, but
I'd presume that extra detour is pretty slow. To make this useful,
we'd need a very lightweight solution.
>> having more ingres queue disk. Would be just confusing.
>
> I'm all for it, that's what I've mentioned earlier in this thread
> already. ;) The above would be one possibility, but of course I'm
> open for other, better suggestions?
>
> I totally agree with Dave that skb_share_check() should be avoided
> at all costs. At least on my laptop (maybe not a perfect example),
> I've got these as packet socket users present in the background,
> so there are packet users running all the time where we would hit
> skb_share_check() then:
>
> # ss -0lnp
> Netid  State      Recv-Q Send-Q      Local Address:Port    Peer Address:Port
> p_raw  UNCONN     0      0                *:wlp2s0b1       *      users:(("dhclient",1290,5))
> p_dgr  UNCONN     0      0          [34958]:wlp2s0b1       *      users:(("wpa_supplicant",805,13))
> p_dgr  UNCONN     0      0              [0]:*              *      users:(("wpa_supplicant",805,12))
>
> I do not yet see a generic way to push an offset down into various
> classifiers and actions that otherwise don't really work with ingress,
> it's not just limited to BPF only as Alexei already mentioned. Hm.
>
> Cheers,
> Daniel
^ permalink raw reply	[flat|nested] 25+ messages in thread
* Re: [PATCH v2 net-next 2/2] tc: make ingress and egress qdiscs consistent
  2015-04-08  1:03 ` [PATCH v2 net-next 2/2] tc: make ingress and egress qdiscs consistent Alexei Starovoitov
  2015-04-08  2:35   ` David Miller
@ 2015-04-08 11:43   ` Jamal Hadi Salim
  1 sibling, 0 replies; 25+ messages in thread
From: Jamal Hadi Salim @ 2015-04-08 11:43 UTC (permalink / raw)
  To: Alexei Starovoitov, David S. Miller; +Cc: Daniel Borkmann, Jiri Pirko, netdev
Hi Alexei,
Sorry havent had the opportunity to rejoin the discussion on the other
emails (I plan to still). Let me get my head out of the sand
and respond to this one because it is a new thread.
This change is a little intrusive.
Summary: my view is a NACK for this change. It will break existing
assumptions and adds unnecessary costs.
Fix bpf_xxx. If cls_rsvp doesnt work, then it needs fixing too[1].
Longer version (to help move forward):
1) Some netdevs show up at the ingress with Link layer headers and
others dont. The assumption of doing blind pull/push is not going
to work. mirred and a few other users care where their packets
are being seen and in my opinion it is their job to handle those
cases. Besides that i am uncomfortable adding this new per-packet cost.
Why are you not able to achieve your goal using the indicators
of where the packet showed up at? Refer to the AT bit definitions
and how they are used elsewhere. Context: There are three paths the
packet sources from:
ingress, egress and packets coming from the stack (when it is
neither from ingress or egress). One of your changes assumed "if
we are not at egress" to mean "we are at ingress". It missed
the case where packets came from the stack.
The good news is that we have consistent behavior depending
where we are seeing packets, so it is easy to special case things
as mirred does.
2) Actions should be self contained if they are to be used in
conjunction with other actions in a policy. They do some small
thing and do it well. Adding semantics of checksum knowledge
in an action whose role is merely to send link level packets out
a netdev seems a little of an overkill. To be honest i am conflicted;
on one side i see the value being not much different than a netdev
that does csum recomputation - otoh i am seeing it as unnecessary
computation that only few preceding actions need. If it can be made
optional and policy defined with reasonable datapath cost (eg a single
if check when there is no interest) then i think it will be more
palatable.
3) The role of mirred is to send link layer packets. It cares about
making sure that the packets are in the proper link layer format
before sending them on the "wire". So the push is there merely to work
around cases where some netdevs have their link layers stripped off.
Anything else that cares about offsets should take of them as well.
IOW, all things you mentioned as not working need to be fixed.
cheers,
jamal
[1] I looked quickly at tests i ran against rsvp and i dont see any that
test at ingress - so i think you are right, it needs fixing (dont
have time right now, be my guest and i will test).
On 04/07/15 21:03, Alexei Starovoitov wrote:
> TC classifers and actions attached to ingress and egress qdiscs see
> inconsistent skb->data. For ingress L2 header is already pulled, whereas
> for egress it's present. Make them consistent by pushing L2 before calling
> into classifiers/actions and pulling L2 back afterwards.
>
> The following cls/act assume L2 and were broken on ingress before this commit:
> cls_bpf, act_bpf, cls_rsvp, cls_cgroup, act_csum, act_nat, all of ematch.
>
> All other in-tree cls/act use skb_network_offset() accessors for skb data
> and work regardless whether L2 was pulled or not.
>
> Since L2 is now present on ingress update act_mirred (the only action that
> was aware of L2 differences) and fix two bugs in it:
> - it was forwarding packets with L2 present to tunnel devices if used
>    with egress qdisc
> - it wasn't updating skb->csum with ingress qdisc
> Also rename 'ok_push' to 'needs_l2' to make it more readable.
>
> Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
>
^ permalink raw reply	[flat|nested] 25+ messages in thread
* Re: [PATCH v2 net-next 2/2] tc: make ingress and egress qdiscs consistent
  2015-04-08 11:14               ` Daniel Borkmann
@ 2015-04-08 11:47                 ` Jamal Hadi Salim
  2015-04-08 12:31                   ` Daniel Borkmann
  0 siblings, 1 reply; 25+ messages in thread
From: Jamal Hadi Salim @ 2015-04-08 11:47 UTC (permalink / raw)
  To: Daniel Borkmann, Jiri Pirko
  Cc: Alexei Starovoitov, David Miller, netdev, tgraf
Should have read my emails backward. Refer to my other email
So why are you not able to use the indicators of where in the stack
you are in both actions and classifiers?
bpf needs to adjust.
cheers,
jamal
On 04/08/15 07:14, Daniel Borkmann wrote:
> On 04/08/2015 12:54 PM, Daniel Borkmann wrote:
>> On 04/08/2015 11:05 AM, Jiri Pirko wrote:
>
> Generically adjusting egress towards ingress would not work. I
> think it's reasonable to assume that the majority of people use
> classifier and actions only from egress side, and they rely on
> having l2 context present. Stripping that away would also be an
> artificial limitation we'd impose.
>
> You could use the ingress qdisc to redirect traffic to an ifb
> device and attach the same egress classifier and action there
> as skb_pull(skb, skb->dev->hard_header_len) is being done, but
> I'd presume that extra detour is pretty slow. To make this useful,
> we'd need a very lightweight solution.
>
>>> having more ingres queue disk. Would be just confusing.
>>
>> I'm all for it, that's what I've mentioned earlier in this thread
>> already. ;) The above would be one possibility, but of course I'm
>> open for other, better suggestions?
>>
>> I totally agree with Dave that skb_share_check() should be avoided
>> at all costs. At least on my laptop (maybe not a perfect example),
>> I've got these as packet socket users present in the background,
>> so there are packet users running all the time where we would hit
>> skb_share_check() then:
>>
>> # ss -0lnp
>> Netid  State      Recv-Q Send-Q      Local Address:Port    Peer
>> Address:Port
>> p_raw  UNCONN     0      0                *:wlp2s0b1       *
>> users:(("dhclient",1290,5))
>> p_dgr  UNCONN     0      0          [34958]:wlp2s0b1       *
>> users:(("wpa_supplicant",805,13))
>> p_dgr  UNCONN     0      0              [0]:*              *
>> users:(("wpa_supplicant",805,12))
>>
>> I do not yet see a generic way to push an offset down into various
>> classifiers and actions that otherwise don't really work with ingress,
>> it's not just limited to BPF only as Alexei already mentioned. Hm.
>>
>> Cheers,
>> Daniel
^ permalink raw reply	[flat|nested] 25+ messages in thread
* Re: [PATCH v2 net-next 2/2] tc: make ingress and egress qdiscs consistent
  2015-04-08 11:47                 ` Jamal Hadi Salim
@ 2015-04-08 12:31                   ` Daniel Borkmann
  2015-04-08 12:58                     ` Jamal Hadi Salim
  0 siblings, 1 reply; 25+ messages in thread
From: Daniel Borkmann @ 2015-04-08 12:31 UTC (permalink / raw)
  To: Jamal Hadi Salim, Jiri Pirko
  Cc: Alexei Starovoitov, David Miller, netdev, tgraf
On 04/08/2015 01:47 PM, Jamal Hadi Salim wrote:
> Should have read my emails backward. Refer to my other email
Ok, sure.
> So why are you not able to use the indicators of where in the stack
> you are in both actions and classifiers?
> bpf needs to adjust.
So aside from the ifb dev, the only guys caring about AT_INGRESS
and AT_EGRESS is netem and act_mirred. That means the tc's cls_u32
sample selectors a la ip, ip6, udp, tcp, icmp don't work on ingress
either, so in u32 speak you would need to do that by hand, but that
doesn't work as you don't have the Ethernet type context available.
Am I missing something? :)
^ permalink raw reply	[flat|nested] 25+ messages in thread
* Re: [PATCH v2 net-next 2/2] tc: make ingress and egress qdiscs consistent
  2015-04-08 12:31                   ` Daniel Borkmann
@ 2015-04-08 12:58                     ` Jamal Hadi Salim
  2015-04-08 13:14                       ` Thomas Graf
  0 siblings, 1 reply; 25+ messages in thread
From: Jamal Hadi Salim @ 2015-04-08 12:58 UTC (permalink / raw)
  To: Daniel Borkmann, Jiri Pirko
  Cc: Alexei Starovoitov, David Miller, netdev, tgraf
On 04/08/15 08:31, Daniel Borkmann wrote:
> So aside from the ifb dev, the only guys caring about AT_INGRESS
> and AT_EGRESS is netem and act_mirred.
And the IFE action (which i havent submitted because Dave wanted me to
see if i can get an IEEE ethertype allocated first).
Essentially - this is an important optimization. If some action is
interested to find where it is connected the information is there.
But we dont want to impose on every action/classifier to be part of
this activity.
>That means the tc's cls_u32
> sample selectors a la ip, ip6, udp, tcp, icmp don't work on ingress
> either,so in u32 speak you would need to do that by hand, but that
> doesn't work as you don't have the Ethernet type context available.
> Am I missing something? :)
u32 works fine. I am sure i have tests which run these on both
in/egress.
cheers,
jamal
^ permalink raw reply	[flat|nested] 25+ messages in thread
* Re: [PATCH v2 net-next 2/2] tc: make ingress and egress qdiscs consistent
  2015-04-08 12:58                     ` Jamal Hadi Salim
@ 2015-04-08 13:14                       ` Thomas Graf
  2015-04-08 13:27                         ` Daniel Borkmann
  0 siblings, 1 reply; 25+ messages in thread
From: Thomas Graf @ 2015-04-08 13:14 UTC (permalink / raw)
  To: Jamal Hadi Salim
  Cc: Daniel Borkmann, Jiri Pirko, Alexei Starovoitov, David Miller,
	netdev
On 04/08/15 at 08:58am, Jamal Hadi Salim wrote:
> On 04/08/15 08:31, Daniel Borkmann wrote:
> >That means the tc's cls_u32
> >sample selectors a la ip, ip6, udp, tcp, icmp don't work on ingress
> >either,so in u32 speak you would need to do that by hand, but that
> >doesn't work as you don't have the Ethernet type context available.
> >Am I missing something? :)
> 
> u32 works fine. I am sure i have tests which run these on both
> in/egress.
His point is that an u32 filter written for egress won't work at
ingress because the offsets are different. This has always been the
case and we can't break this behaviour either. I'm sure you have
these weird negative offset u32 egress filters in your repertoire
as well ;-)
Even if we made ingress offsets start at the network header we
would still have to make this depend on a flag which new users have
to set.
^ permalink raw reply	[flat|nested] 25+ messages in thread
* Re: [PATCH v2 net-next 2/2] tc: make ingress and egress qdiscs consistent
  2015-04-08 13:14                       ` Thomas Graf
@ 2015-04-08 13:27                         ` Daniel Borkmann
  2015-04-08 13:34                           ` Jiri Pirko
  0 siblings, 1 reply; 25+ messages in thread
From: Daniel Borkmann @ 2015-04-08 13:27 UTC (permalink / raw)
  To: Thomas Graf, Jamal Hadi Salim
  Cc: Jiri Pirko, Alexei Starovoitov, David Miller, netdev
On 04/08/2015 03:14 PM, Thomas Graf wrote:
> On 04/08/15 at 08:58am, Jamal Hadi Salim wrote:
>> On 04/08/15 08:31, Daniel Borkmann wrote:
>>> That means the tc's cls_u32
>>> sample selectors a la ip, ip6, udp, tcp, icmp don't work on ingress
>>> either,so in u32 speak you would need to do that by hand, but that
>>> doesn't work as you don't have the Ethernet type context available.
>>> Am I missing something? :)
>>
>> u32 works fine. I am sure i have tests which run these on both
>> in/egress.
>
> His point is that an u32 filter written for egress won't work at
> ingress because the offsets are different. This has always been the
> case and we can't break this behaviour either. I'm sure you have
> these weird negative offset u32 egress filters in your repertoire
> as well ;-)
Okay, you can use negative offsets in cls_u32 to accomodate for
that; so yeah, you'd need to implement your filter differently
on ingress. That should also work on cls_bpf et al.
^ permalink raw reply	[flat|nested] 25+ messages in thread
* Re: [PATCH v2 net-next 2/2] tc: make ingress and egress qdiscs consistent
  2015-04-08 13:27                         ` Daniel Borkmann
@ 2015-04-08 13:34                           ` Jiri Pirko
  2015-04-08 13:41                             ` Daniel Borkmann
  0 siblings, 1 reply; 25+ messages in thread
From: Jiri Pirko @ 2015-04-08 13:34 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: Thomas Graf, Jamal Hadi Salim, Alexei Starovoitov, David Miller,
	netdev
Wed, Apr 08, 2015 at 03:27:57PM CEST, daniel@iogearbox.net wrote:
>On 04/08/2015 03:14 PM, Thomas Graf wrote:
>>On 04/08/15 at 08:58am, Jamal Hadi Salim wrote:
>>>On 04/08/15 08:31, Daniel Borkmann wrote:
>>>>That means the tc's cls_u32
>>>>sample selectors a la ip, ip6, udp, tcp, icmp don't work on ingress
>>>>either,so in u32 speak you would need to do that by hand, but that
>>>>doesn't work as you don't have the Ethernet type context available.
>>>>Am I missing something? :)
>>>
>>>u32 works fine. I am sure i have tests which run these on both
>>>in/egress.
>>
>>His point is that an u32 filter written for egress won't work at
>>ingress because the offsets are different. This has always been the
>>case and we can't break this behaviour either. I'm sure you have
>>these weird negative offset u32 egress filters in your repertoire
>>as well ;-)
>
>Okay, you can use negative offsets in cls_u32 to accomodate for
>that; so yeah, you'd need to implement your filter differently
>on ingress. That should also work on cls_bpf et al.
That is certainly doable. But is that what we want? I don't think so. I
would like to have the same for in/eg.
^ permalink raw reply	[flat|nested] 25+ messages in thread
* Re: [PATCH v2 net-next 2/2] tc: make ingress and egress qdiscs consistent
  2015-04-08 13:34                           ` Jiri Pirko
@ 2015-04-08 13:41                             ` Daniel Borkmann
  2015-04-08 13:47                               ` Thomas Graf
  2015-04-08 13:52                               ` Jamal Hadi Salim
  0 siblings, 2 replies; 25+ messages in thread
From: Daniel Borkmann @ 2015-04-08 13:41 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: Thomas Graf, Jamal Hadi Salim, Alexei Starovoitov, David Miller,
	netdev
On 04/08/2015 03:34 PM, Jiri Pirko wrote:
> Wed, Apr 08, 2015 at 03:27:57PM CEST, daniel@iogearbox.net wrote:
>> On 04/08/2015 03:14 PM, Thomas Graf wrote:
>>> On 04/08/15 at 08:58am, Jamal Hadi Salim wrote:
>>>> On 04/08/15 08:31, Daniel Borkmann wrote:
>>>>> That means the tc's cls_u32
>>>>> sample selectors a la ip, ip6, udp, tcp, icmp don't work on ingress
>>>>> either,so in u32 speak you would need to do that by hand, but that
>>>>> doesn't work as you don't have the Ethernet type context available.
>>>>> Am I missing something? :)
>>>>
>>>> u32 works fine. I am sure i have tests which run these on both
>>>> in/egress.
>>>
>>> His point is that an u32 filter written for egress won't work at
>>> ingress because the offsets are different. This has always been the
>>> case and we can't break this behaviour either. I'm sure you have
>>> these weird negative offset u32 egress filters in your repertoire
>>> as well ;-)
>>
>> Okay, you can use negative offsets in cls_u32 to accomodate for
>> that; so yeah, you'd need to implement your filter differently
>> on ingress. That should also work on cls_bpf et al.
>
> That is certainly doable. But is that what we want? I don't think so. I
> would like to have the same for in/eg.
I mean it's certainly a non-obvious hack, where user space has to
fix up something that the kernel should have gotten right in the
first place. :/
^ permalink raw reply	[flat|nested] 25+ messages in thread
* Re: [PATCH v2 net-next 2/2] tc: make ingress and egress qdiscs consistent
  2015-04-08 13:41                             ` Daniel Borkmann
@ 2015-04-08 13:47                               ` Thomas Graf
  2015-04-08 13:52                               ` Jamal Hadi Salim
  1 sibling, 0 replies; 25+ messages in thread
From: Thomas Graf @ 2015-04-08 13:47 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: Jiri Pirko, Jamal Hadi Salim, Alexei Starovoitov, David Miller,
	netdev
On 04/08/15 at 03:41pm, Daniel Borkmann wrote:
> On 04/08/2015 03:34 PM, Jiri Pirko wrote:
> >That is certainly doable. But is that what we want? I don't think so. I
> >would like to have the same for in/eg.
> 
> I mean it's certainly a non-obvious hack, where user space has to
> fix up something that the kernel should have gotten right in the
> first place. :/
We should be careful to not break existing scripts. I realize
it's horribly broken but this stuff is being used.
We need the user to signal that he is aware that offsets changed
and subtract the ll header from offsets of unaware users.
^ permalink raw reply	[flat|nested] 25+ messages in thread
* Re: [PATCH v2 net-next 2/2] tc: make ingress and egress qdiscs consistent
  2015-04-08 13:41                             ` Daniel Borkmann
  2015-04-08 13:47                               ` Thomas Graf
@ 2015-04-08 13:52                               ` Jamal Hadi Salim
  2015-04-08 14:53                                 ` Daniel Borkmann
  2015-04-08 16:26                                 ` Alexei Starovoitov
  1 sibling, 2 replies; 25+ messages in thread
From: Jamal Hadi Salim @ 2015-04-08 13:52 UTC (permalink / raw)
  To: Daniel Borkmann, Jiri Pirko
  Cc: Thomas Graf, Alexei Starovoitov, David Miller, netdev
On 04/08/15 09:41, Daniel Borkmann wrote:
> On 04/08/2015 03:34 PM, Jiri Pirko wrote:
>> Wed, Apr 08, 2015 at 03:27:57PM CEST, daniel@iogearbox.net wrote:
>>> On 04/08/2015 03:14 PM, Thomas Graf wrote:
>>>> On 04/08/15 at 08:58am, Jamal Hadi Salim wrote:
>>>>> On 04/08/15 08:31, Daniel Borkmann wrote:
>>>>>> That means the tc's cls_u32
>>>>>> sample selectors a la ip, ip6, udp, tcp, icmp don't work on ingress
>>>>>> either,so in u32 speak you would need to do that by hand, but that
>>>>>> doesn't work as you don't have the Ethernet type context available.
>>>>>> Am I missing something? :)
>>>>>
>>>>> u32 works fine. I am sure i have tests which run these on both
>>>>> in/egress.
>>>>
>>>> His point is that an u32 filter written for egress won't work at
>>>> ingress because the offsets are different. This has always been the
>>>> case and we can't break this behaviour either. I'm sure you have
>>>> these weird negative offset u32 egress filters in your repertoire
>>>> as well ;-)
>>>
>>> Okay, you can use negative offsets in cls_u32 to accomodate for
>>> that; so yeah, you'd need to implement your filter differently
>>> on ingress. That should also work on cls_bpf et al.
>>
>> That is certainly doable. But is that what we want? I don't think so. I
>> would like to have the same for in/eg.
>
> I mean it's certainly a non-obvious hack, where user space has to
> fix up something that the kernel should have gotten right in the
> first place. :/
Try something like:
on ingress
filter add dev eth0 parent ffff: protocol ip prio 6 u32 match ip src \
10.0.0.21/32 flowid 1:16 action ok
and egress
filter add dev eth0 parent 1: protocol ip prio 6 u32 match ip src \
10.0.0.21/32 flowid 1:16 action ok
and see if you need negative offsets for anything.
If you really must adjust in the kernel then use the indicators which
tell you where you are at.
Negative offsets are used if you must go beyond the network header. This
has been the expectation so far (whether it is actions or classifiers),
example, here is something that mucks around with ethernet
headers:
tc filter add dev eth0 parent ffff: protocol ip prio 10 u32 \
match ip src 192.168.1.10/32 flowid 1:2 \
action pedit munge offset -14 u16 set 0x0000 \
munge offset -12 u32 set 0x00010100 \
munge offset -8 u32 set 0x0aaf0100 \
munge offset -4 u32 set 0x0008ec06 pipe \
action mirred egress redirect dev eth1
cheers,
jamal
^ permalink raw reply	[flat|nested] 25+ messages in thread
* Re: [PATCH v2 net-next 2/2] tc: make ingress and egress qdiscs consistent
  2015-04-08 13:52                               ` Jamal Hadi Salim
@ 2015-04-08 14:53                                 ` Daniel Borkmann
  2015-04-08 16:26                                 ` Alexei Starovoitov
  1 sibling, 0 replies; 25+ messages in thread
From: Daniel Borkmann @ 2015-04-08 14:53 UTC (permalink / raw)
  To: Jamal Hadi Salim, Jiri Pirko
  Cc: Thomas Graf, Alexei Starovoitov, David Miller, netdev
On 04/08/2015 03:52 PM, Jamal Hadi Salim wrote:
> On 04/08/15 09:41, Daniel Borkmann wrote:
>> On 04/08/2015 03:34 PM, Jiri Pirko wrote:
>>> Wed, Apr 08, 2015 at 03:27:57PM CEST, daniel@iogearbox.net wrote:
>>>> On 04/08/2015 03:14 PM, Thomas Graf wrote:
>>>>> On 04/08/15 at 08:58am, Jamal Hadi Salim wrote:
>>>>>> On 04/08/15 08:31, Daniel Borkmann wrote:
>>>>>>> That means the tc's cls_u32
>>>>>>> sample selectors a la ip, ip6, udp, tcp, icmp don't work on ingress
>>>>>>> either,so in u32 speak you would need to do that by hand, but that
>>>>>>> doesn't work as you don't have the Ethernet type context available.
>>>>>>> Am I missing something? :)
>>>>>>
>>>>>> u32 works fine. I am sure i have tests which run these on both
>>>>>> in/egress.
>>>>>
>>>>> His point is that an u32 filter written for egress won't work at
>>>>> ingress because the offsets are different. This has always been the
>>>>> case and we can't break this behaviour either. I'm sure you have
>>>>> these weird negative offset u32 egress filters in your repertoire
>>>>> as well ;-)
>>>>
>>>> Okay, you can use negative offsets in cls_u32 to accomodate for
>>>> that; so yeah, you'd need to implement your filter differently
>>>> on ingress. That should also work on cls_bpf et al.
>>>
>>> That is certainly doable. But is that what we want? I don't think so. I
>>> would like to have the same for in/eg.
>>
>> I mean it's certainly a non-obvious hack, where user space has to
>> fix up something that the kernel should have gotten right in the
>> first place. :/
>
> Try something like:
> on ingress
> filter add dev eth0 parent ffff: protocol ip prio 6 u32 match ip src \
> 10.0.0.21/32 flowid 1:16 action ok
> and egress
> filter add dev eth0 parent 1: protocol ip prio 6 u32 match ip src \
> 10.0.0.21/32 flowid 1:16 action ok
>
> and see if you need negative offsets for anything.
Yep, that works because from tc side, you push down offsets where
it's assumed that ip, etc start at offset 0, and in u32_classify()
offsets are fixed up for ingress/egress via skb_network_offset()
stored in off.
> If you really must adjust in the kernel then use the indicators which
> tell you where you are at.
>
> Negative offsets are used if you must go beyond the network header. This
> has been the expectation so far (whether it is actions or classifiers),
Yes, exactly, meaning negative offsets need to be used no matter
whether from ingress or egress, but with the advantage to be able
to use the same u32 filter on both sides. Ok, I see. Need to think
about if we could easily do the same for BPF while keeping compat.
> example, here is something that mucks around with ethernet
> headers:
>
> tc filter add dev eth0 parent ffff: protocol ip prio 10 u32 \
> match ip src 192.168.1.10/32 flowid 1:2 \
> action pedit munge offset -14 u16 set 0x0000 \
> munge offset -12 u32 set 0x00010100 \
> munge offset -8 u32 set 0x0aaf0100 \
> munge offset -4 u32 set 0x0008ec06 pipe \
> action mirred egress redirect dev eth1
^ permalink raw reply	[flat|nested] 25+ messages in thread
* Re: [PATCH v2 net-next 2/2] tc: make ingress and egress qdiscs consistent
  2015-04-08 13:52                               ` Jamal Hadi Salim
  2015-04-08 14:53                                 ` Daniel Borkmann
@ 2015-04-08 16:26                                 ` Alexei Starovoitov
  2015-04-08 16:32                                   ` David Miller
  1 sibling, 1 reply; 25+ messages in thread
From: Alexei Starovoitov @ 2015-04-08 16:26 UTC (permalink / raw)
  To: Jamal Hadi Salim, Daniel Borkmann, Jiri Pirko
  Cc: Thomas Graf, David Miller, netdev
On 4/8/15 6:52 AM, Jamal Hadi Salim wrote:
> On 04/08/15 09:41, Daniel Borkmann wrote:
>> On 04/08/2015 03:34 PM, Jiri Pirko wrote:
>>> Wed, Apr 08, 2015 at 03:27:57PM CEST, daniel@iogearbox.net wrote:
>>>> On 04/08/2015 03:14 PM, Thomas Graf wrote:
>>>>> On 04/08/15 at 08:58am, Jamal Hadi Salim wrote:
>>>>>> On 04/08/15 08:31, Daniel Borkmann wrote:
>>>>>>> That means the tc's cls_u32
>>>>>>> sample selectors a la ip, ip6, udp, tcp, icmp don't work on ingress
>>>>>>> either,so in u32 speak you would need to do that by hand, but that
>>>>>>> doesn't work as you don't have the Ethernet type context available.
>>>>>>> Am I missing something? :)
>>>>>>
>>>>>> u32 works fine. I am sure i have tests which run these on both
>>>>>> in/egress.
>>>>>
>>>>> His point is that an u32 filter written for egress won't work at
>>>>> ingress because the offsets are different. This has always been the
>>>>> case and we can't break this behaviour either. I'm sure you have
>>>>> these weird negative offset u32 egress filters in your repertoire
>>>>> as well ;-)
>>>>
>>>> Okay, you can use negative offsets in cls_u32 to accomodate for
>>>> that; so yeah, you'd need to implement your filter differently
>>>> on ingress. That should also work on cls_bpf et al.
>>>
>>> That is certainly doable. But is that what we want? I don't think so. I
>>> would like to have the same for in/eg.
>>
>> I mean it's certainly a non-obvious hack, where user space has to
>> fix up something that the kernel should have gotten right in the
>> first place. :/
>
> Try something like:
> on ingress
> filter add dev eth0 parent ffff: protocol ip prio 6 u32 match ip src \
> 10.0.0.21/32 flowid 1:16 action ok
> and egress
> filter add dev eth0 parent 1: protocol ip prio 6 u32 match ip src \
> 10.0.0.21/32 flowid 1:16 action ok
>
> and see if you need negative offsets for anything.
>
> If you really must adjust in the kernel then use the indicators which
> tell you where you are at.
>
> Negative offsets are used if you must go beyond the network header. This
> has been the expectation so far (whether it is actions or classifiers),
> example, here is something that mucks around with ethernet
> headers:
>
> tc filter add dev eth0 parent ffff: protocol ip prio 10 u32 \
> match ip src 192.168.1.10/32 flowid 1:2 \
> action pedit munge offset -14 u16 set 0x0000 \
> munge offset -12 u32 set 0x00010100 \
> munge offset -8 u32 set 0x0aaf0100 \
> munge offset -4 u32 set 0x0008ec06 pipe \
> action mirred egress redirect dev eth1
here u32 assumes ethernet L2 and can hard code -14.
bpf programs cannot. Their key feature is parsing the packet.
Look at libpcap it can parse weird protocols that kernel doesn't
even know about.
Theoretically we can pass hard_header_len/network_offset/starting_offset
into bpf program and add this offset everywhere. It would mean that
LD_ABS instruction becomes useless and performance immediately drops.
We cannot use libpcap and pretty much all existing classic and extended
bpf programs, since they assume starting at L2 and parsing its way.
That is awful already, but more importantly this l2 vs l3 or
ingress vs egress qdisc exposes this kernel specific difference to user
space. bpf programs should know nothing about the kernel.
They should only see packet + metadata. So network_offset+-off style of
access for bpf programs is non starter. It's perfectly fine for other
classifiers/actions because their are part of the kernel.
Reading through the thread looks like the only option is to add
'needs_l2' flag to ingress qdisc and if it's there do skb_share_check
and push. Also add similar flags to classifier/actions and only
allow them to be added to such ingress qdisc if flags match.
It's a bit intrusive, but won't cost any performance for
existing users.
Another alternative is in cls_bpf do:
if(skb_shared) return -1;
else push L2 and run the program.
This way if tap is added to the interface with ingress+bpf
the programs will stop working as soon as tap is added.
That's not great, but very small isolated change.
My preference is to add 'needs_l2' flag to ingress qdisc.
^ permalink raw reply	[flat|nested] 25+ messages in thread
* Re: [PATCH v2 net-next 2/2] tc: make ingress and egress qdiscs consistent
  2015-04-08 16:26                                 ` Alexei Starovoitov
@ 2015-04-08 16:32                                   ` David Miller
  2015-04-08 16:44                                     ` Alexei Starovoitov
  0 siblings, 1 reply; 25+ messages in thread
From: David Miller @ 2015-04-08 16:32 UTC (permalink / raw)
  To: ast; +Cc: jhs, daniel, jiri, tgraf, netdev
From: Alexei Starovoitov <ast@plumgrid.com>
Date: Wed, 08 Apr 2015 09:26:36 -0700
> My preference is to add 'needs_l2' flag to ingress qdisc.
The problem is that needs_l2 is not property of individual qdisc,
but conditionally 1 or more things sitting behind it.
You can mix u32 and bpf classifiers.  One wants need_L2 another
does not, and you therefore cannot handle this problem in this
manner.
Face it, we're stuck with what we have.  And I think you will
have to adjust generated bpf program based upon whether it is
being attacked to ingress or egress qdisc.
^ permalink raw reply	[flat|nested] 25+ messages in thread
* Re: [PATCH v2 net-next 2/2] tc: make ingress and egress qdiscs consistent
  2015-04-08 16:32                                   ` David Miller
@ 2015-04-08 16:44                                     ` Alexei Starovoitov
  2015-04-08 16:54                                       ` Daniel Borkmann
  0 siblings, 1 reply; 25+ messages in thread
From: Alexei Starovoitov @ 2015-04-08 16:44 UTC (permalink / raw)
  To: David Miller; +Cc: jhs, daniel, jiri, tgraf, netdev
On 4/8/15 9:32 AM, David Miller wrote:
> From: Alexei Starovoitov <ast@plumgrid.com>
> Date: Wed, 08 Apr 2015 09:26:36 -0700
>
>> My preference is to add 'needs_l2' flag to ingress qdisc.
>
> The problem is that needs_l2 is not property of individual qdisc,
> but conditionally 1 or more things sitting behind it.
>
> You can mix u32 and bpf classifiers.  One wants need_L2 another
> does not, and you therefore cannot handle this problem in this
> manner.
that is still ok.
I'm proposing multiple flags. One for ingress qdisc and another
flag for all cls/acts whether they care about l2 or not.
Then when cls is attached to ingress_with_l2 we will check whether
this cls is ready or not.
so cls_bpf will have flag L2_ONLY
whereas cls_u32 will be L2 | L3
and 50% of other cls/acts will be L2 | L3
some cls/acts will be L3 only until they're fixed.
The users will create ingress qdisc with 'needs_l2' flag only
when they need to attach cls_bpf to it. All existing users
won't notice the change.
Looks pretty clean to me.
> Face it, we're stuck with what we have.  And I think you will
> have to adjust generated bpf program based upon whether it is
> being attacked to ingress or egress qdisc.
Presence of L2 is fundamental in bpf architecture. I'd rather
disable them on ingress than face this mess, since it will be
horrendous in the long run.
^ permalink raw reply	[flat|nested] 25+ messages in thread
* Re: [PATCH v2 net-next 2/2] tc: make ingress and egress qdiscs consistent
  2015-04-08 16:44                                     ` Alexei Starovoitov
@ 2015-04-08 16:54                                       ` Daniel Borkmann
  0 siblings, 0 replies; 25+ messages in thread
From: Daniel Borkmann @ 2015-04-08 16:54 UTC (permalink / raw)
  To: Alexei Starovoitov; +Cc: David Miller, jhs, jiri, tgraf, netdev
On 04/08/2015 06:44 PM, Alexei Starovoitov wrote:
> On 4/8/15 9:32 AM, David Miller wrote:
>> From: Alexei Starovoitov <ast@plumgrid.com>
>> Date: Wed, 08 Apr 2015 09:26:36 -0700
>>
>>> My preference is to add 'needs_l2' flag to ingress qdisc.
>>
>> The problem is that needs_l2 is not property of individual qdisc,
>> but conditionally 1 or more things sitting behind it.
>>
>> You can mix u32 and bpf classifiers.  One wants need_L2 another
>> does not, and you therefore cannot handle this problem in this
>> manner.
>
> that is still ok.
> I'm proposing multiple flags. One for ingress qdisc and another
> flag for all cls/acts whether they care about l2 or not.
> Then when cls is attached to ingress_with_l2 we will check whether
> this cls is ready or not.
> so cls_bpf will have flag L2_ONLY
> whereas cls_u32 will be L2 | L3
> and 50% of other cls/acts will be L2 | L3
> some cls/acts will be L3 only until they're fixed.
>
> The users will create ingress qdisc with 'needs_l2' flag only
> when they need to attach cls_bpf to it. All existing users
> won't notice the change.
I think that could work, it would also allow for keeping compat
with the existing users.
^ permalink raw reply	[flat|nested] 25+ messages in thread
end of thread, other threads:[~2015-04-08 16:54 UTC | newest]
Thread overview: 25+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-04-08  1:03 [PATCH v2 net-next 1/2] net: introduce skb_postpush_rcsum() helper Alexei Starovoitov
2015-04-08  1:03 ` [PATCH v2 net-next 2/2] tc: make ingress and egress qdiscs consistent Alexei Starovoitov
2015-04-08  2:35   ` David Miller
2015-04-08  3:22     ` Alexei Starovoitov
2015-04-08  4:48       ` Alexei Starovoitov
2015-04-08  8:36         ` Daniel Borkmann
2015-04-08  9:05           ` Jiri Pirko
2015-04-08 10:54             ` Daniel Borkmann
2015-04-08 11:14               ` Daniel Borkmann
2015-04-08 11:47                 ` Jamal Hadi Salim
2015-04-08 12:31                   ` Daniel Borkmann
2015-04-08 12:58                     ` Jamal Hadi Salim
2015-04-08 13:14                       ` Thomas Graf
2015-04-08 13:27                         ` Daniel Borkmann
2015-04-08 13:34                           ` Jiri Pirko
2015-04-08 13:41                             ` Daniel Borkmann
2015-04-08 13:47                               ` Thomas Graf
2015-04-08 13:52                               ` Jamal Hadi Salim
2015-04-08 14:53                                 ` Daniel Borkmann
2015-04-08 16:26                                 ` Alexei Starovoitov
2015-04-08 16:32                                   ` David Miller
2015-04-08 16:44                                     ` Alexei Starovoitov
2015-04-08 16:54                                       ` Daniel Borkmann
2015-04-08 11:43   ` Jamal Hadi Salim
2015-04-08  7:25 ` [PATCH v2 net-next 1/2] net: introduce skb_postpush_rcsum() helper Daniel Borkmann
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).