netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH -next 0/3] tc state machinery cleanups
@ 2015-05-15  8:50 Florian Westphal
  2015-05-15  8:50 ` [PATCH -next 1/3] net: sched: remove FROM INGRESS/EGRESS Florian Westphal
                   ` (4 more replies)
  0 siblings, 5 replies; 13+ messages in thread
From: Florian Westphal @ 2015-05-15  8:50 UTC (permalink / raw)
  To: netdev; +Cc: jhs, alexei.starovoitov, daniel

This series prepares removal of tc_verd member from sk_buff.

It simplifies tc state machinery to what is required to keep current
mirred/ifb combinations working.

I tested a few scenarios, namely:

1 - htb based shaping on egress
2 - netem attached to ifb with mirred redirect from ingress qdisc
3 - mirred to different egress device
4 - mirred to ifb egress device with qdiscs set up on ifb
    to provide illusion of 'single' transmit interface for traffic shaping

After this series tc_verd is only used by ifb to skip actions on egress.

Part #2 of this series will remove tc_verd completely.

motivation is two-fold:
1) make states and state transitions more obvious
2) provide a way to later reduce skb size by 8 bytes
   (s/u16 mac_len/u8 mac_len/ would result in
    two 2 byte and one 4 byte hole, i.e. 8 byte reduction with
    minor reshuffling).

 drivers/net/ifb.c                    |   18 +++++++++---------
 drivers/staging/octeon/ethernet-tx.c |    1 +
 include/linux/skbuff.h               |    6 ++++--
 include/net/pkt_sched.h              |   18 ++++++++++++++++++
 include/uapi/linux/pkt_cls.h         |    6 +++---
 net/core/dev.c                       |   14 ++++++--------
 net/sched/act_mirred.c               |   17 +++++++++--------
 net/sched/sch_netem.c                |    2 +-
 8 files changed, 51 insertions(+), 31 deletions(-)

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH -next 1/3] net: sched: remove FROM INGRESS/EGRESS
  2015-05-15  8:50 [PATCH -next 0/3] tc state machinery cleanups Florian Westphal
@ 2015-05-15  8:50 ` Florian Westphal
  2015-05-15  8:50 ` [PATCH -next 2/3] net: sched: remove AT INGRESS/EGRESS Florian Westphal
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 13+ messages in thread
From: Florian Westphal @ 2015-05-15  8:50 UTC (permalink / raw)
  To: netdev; +Cc: jhs, alexei.starovoitov, daniel, Florian Westphal

Jamal explains:
| Since tc can be applied only per netdev, redirecting to ifb from
| many netdevs allows us to provide illusion we can have groupings
| of netdevs.
| The role of ifb is, after completing processing, to return the packet
| the spot it found it in the code path before the redirect
| (i.e if it is on ingress, then it will show up back on ingress;
| likewise if it was on egress).

This ingress/egress information (FROM IN/EGRESS; not to be confused with
AT_INGRESS/EGRESS values returned by G_TC_AT) is set up by the 'mirred'
action to tell IFB at which spot we need to return the packet to.

This change introduces skb->skb_tc_state enum to track which traffic
control processing state this skb is in.

If the mirred action is called via classifiers on ingress (indicated
by G_TC_AT() returning AT_INGRESS) skb_tc_state enters TC_FROM_INGRESS.

If mirred is called from egress path (G_TC_AT returns AT_EGRESS), then
it will be in TC_FROM_EGRESS state.

INGRESS/EGRESS are mutually exclusive.

ifb uses this to decide if it needs to call netif_rx (TC_FROM_INGRESS),
dev_queue_xmit (TC_FROM_EGRESS) or if skb must be dropped (tc_state is 0).

tested via:

ip link set dev ifb0 up
ip link set dev eth1 up
ip addr add  192.168.42.1/24 dev eth1
tc qdisc add dev eth1 root handle 1: htb default 1
tc filter add dev eth1 parent 1: protocol all u32 match u32 0 0 action \
  mirred egress redirect dev ifb0

With help from Jamal Hadi Salim.

Signed-off-by: Florian Westphal <fw@strlen.de>
---
 drivers/net/ifb.c                    | 18 +++++++++---------
 drivers/staging/octeon/ethernet-tx.c |  1 +
 include/linux/skbuff.h               |  4 +++-
 include/net/pkt_sched.h              |  6 ++++++
 include/uapi/linux/pkt_cls.h         |  2 +-
 net/sched/act_mirred.c               |  9 ++++++---
 net/sched/sch_netem.c                |  2 +-
 7 files changed, 27 insertions(+), 15 deletions(-)

diff --git a/drivers/net/ifb.c b/drivers/net/ifb.c
index 94570aa..bbce359 100644
--- a/drivers/net/ifb.c
+++ b/drivers/net/ifb.c
@@ -79,8 +79,6 @@ static void ri_tasklet(unsigned long dev)
 	}
 
 	while ((skb = __skb_dequeue(&dp->tq)) != NULL) {
-		u32 from = G_TC_FROM(skb->tc_verd);
-
 		skb->tc_verd = 0;
 		skb->tc_verd = SET_TC_NCLS(skb->tc_verd);
 
@@ -102,13 +100,16 @@ static void ri_tasklet(unsigned long dev)
 		rcu_read_unlock();
 		skb->skb_iif = _dev->ifindex;
 
-		if (from & AT_EGRESS) {
-			dev_queue_xmit(skb);
-		} else if (from & AT_INGRESS) {
+		switch (skb->skb_tc_state) {
+		case TC_FROM_INGRESS:
 			skb_pull(skb, skb->mac_len);
 			netif_receive_skb(skb);
-		} else
-			BUG();
+			break;
+		case TC_FROM_EGRESS:
+			skb->skb_tc_state = 0;
+			dev_queue_xmit(skb);
+			break;
+		}
 	}
 
 	if (__netif_tx_trylock(txq)) {
@@ -193,14 +194,13 @@ static void ifb_setup(struct net_device *dev)
 static netdev_tx_t ifb_xmit(struct sk_buff *skb, struct net_device *dev)
 {
 	struct ifb_private *dp = netdev_priv(dev);
-	u32 from = G_TC_FROM(skb->tc_verd);
 
 	u64_stats_update_begin(&dp->rsync);
 	dp->rx_packets++;
 	dp->rx_bytes += skb->len;
 	u64_stats_update_end(&dp->rsync);
 
-	if (!(from & (AT_INGRESS|AT_EGRESS)) || !skb->skb_iif) {
+	if (!skb->skb_tc_state || !skb->skb_iif) {
 		dev_kfree_skb(skb);
 		dev->stats.rx_dropped++;
 		return NETDEV_TX_OK;
diff --git a/drivers/staging/octeon/ethernet-tx.c b/drivers/staging/octeon/ethernet-tx.c
index 5b9ac1f..4656af7 100644
--- a/drivers/staging/octeon/ethernet-tx.c
+++ b/drivers/staging/octeon/ethernet-tx.c
@@ -403,6 +403,7 @@ int cvm_oct_xmit(struct sk_buff *skb, struct net_device *dev)
 #ifdef CONFIG_NET_SCHED
 	skb->tc_index = 0;
 #ifdef CONFIG_NET_CLS_ACT
+	skb->skb_tc_state = 0;
 	skb->tc_verd = 0;
 #endif /* CONFIG_NET_CLS_ACT */
 #endif /* CONFIG_NET_SCHED */
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index f83aa65..4a1367e 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -487,6 +487,7 @@ static inline u32 skb_mstamp_us_delta(const struct skb_mstamp *t1,
  *	@hash: the packet hash
  *	@queue_mapping: Queue mapping for multiqueue devices
  *	@xmit_more: More SKBs are pending for this queue
+ *	@skb_tc_state: was mirrored (act_mirred)
  *	@ndisc_nodetype: router type (from link layer)
  *	@ooo_okay: allow the mapping of a socket to a queue to be changed
  *	@l4_hash: indicate hash is a canonical 4-tuple hash over transport
@@ -614,7 +615,8 @@ struct sk_buff {
 	__u8			ipvs_property:1;
 	__u8			inner_protocol_type:1;
 	__u8			remcsum_offload:1;
-	/* 3 or 5 bit hole */
+	__u8			skb_tc_state:2;	/* traffic control state enum */
+	/* 1 or 3 bit hole */
 
 #ifdef CONFIG_NET_SCHED
 	__u16			tc_index;	/* traffic control index */
diff --git a/include/net/pkt_sched.h b/include/net/pkt_sched.h
index 2342bf1..a4c509d 100644
--- a/include/net/pkt_sched.h
+++ b/include/net/pkt_sched.h
@@ -134,4 +134,10 @@ static inline unsigned int psched_mtu(const struct net_device *dev)
 	return dev->mtu + dev->hard_header_len;
 }
 
+enum skb_tc_state {
+	/* set by act_mirred to tell IFB that skb needs to be ... */
+	TC_FROM_INGRESS = 1, /* ... re-injected to local stack */
+	TC_FROM_EGRESS = 2,  /* ... transmitted to device */
+};
+
 #endif
diff --git a/include/uapi/linux/pkt_cls.h b/include/uapi/linux/pkt_cls.h
index 39fb53d..3308e89 100644
--- a/include/uapi/linux/pkt_cls.h
+++ b/include/uapi/linux/pkt_cls.h
@@ -50,13 +50,13 @@ bits 9,10,11: redirect counter -  redirect TTL. Loop avoidance
 #define G_TC_VERD(x)       _TC_GETVALUE(x,S_TC_VERD,M_TC_VERD)
 #define V_TC_VERD(x)       _TC_MAKEVALUE(x,S_TC_VERD)
 #define SET_TC_VERD(v,n)   ((V_TC_VERD(n)) | (v & ~M_TC_VERD))
-#endif
 
 #define S_TC_FROM          _TC_MAKE32(6)
 #define M_TC_FROM          _TC_MAKEMASK(2,S_TC_FROM)
 #define G_TC_FROM(x)       _TC_GETVALUE(x,S_TC_FROM,M_TC_FROM)
 #define V_TC_FROM(x)       _TC_MAKEVALUE(x,S_TC_FROM)
 #define SET_TC_FROM(v,n)   ((V_TC_FROM(n)) | (v & ~M_TC_FROM))
+#endif
 #define AT_STACK	0x0
 #define AT_INGRESS	0x1
 #define AT_EGRESS	0x2
diff --git a/net/sched/act_mirred.c b/net/sched/act_mirred.c
index a42a3b2..34d4320 100644
--- a/net/sched/act_mirred.c
+++ b/net/sched/act_mirred.c
@@ -161,9 +161,12 @@ static int tcf_mirred(struct sk_buff *skb, const struct tc_action *a,
 	}
 
 	/* mirror is always swallowed */
-	if (m->tcfm_eaction != TCA_EGRESS_MIRROR)
-		skb2->tc_verd = SET_TC_FROM(skb2->tc_verd, at);
-
+	if (m->tcfm_eaction != TCA_EGRESS_MIRROR) {
+		if (at & AT_INGRESS)
+			skb2->skb_tc_state = TC_FROM_INGRESS;
+		else if (at & AT_EGRESS)
+			skb2->skb_tc_state = TC_FROM_EGRESS;
+	}
 	skb2->skb_iif = skb->dev->ifindex;
 	skb2->dev = dev;
 	err = dev_queue_xmit(skb2);
diff --git a/net/sched/sch_netem.c b/net/sched/sch_netem.c
index 5abd1d9..760cf43 100644
--- a/net/sched/sch_netem.c
+++ b/net/sched/sch_netem.c
@@ -588,7 +588,7 @@ deliver:
 			 * If it's at ingress let's pretend the delay is
 			 * from the network (tstamp will be updated).
 			 */
-			if (G_TC_FROM(skb->tc_verd) & AT_INGRESS)
+			if (skb->skb_tc_state == TC_FROM_INGRESS)
 				skb->tstamp.tv64 = 0;
 #endif
 
-- 
2.0.5

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH -next 2/3] net: sched: remove AT INGRESS/EGRESS
  2015-05-15  8:50 [PATCH -next 0/3] tc state machinery cleanups Florian Westphal
  2015-05-15  8:50 ` [PATCH -next 1/3] net: sched: remove FROM INGRESS/EGRESS Florian Westphal
@ 2015-05-15  8:50 ` Florian Westphal
  2015-05-15 16:23   ` Alexei Starovoitov
  2015-05-15  8:50 ` [PATCH -next 3/3] net: core: use skb_tc_state to skip ingress classifiers Florian Westphal
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 13+ messages in thread
From: Florian Westphal @ 2015-05-15  8:50 UTC (permalink / raw)
  To: netdev; +Cc: jhs, alexei.starovoitov, daniel, Florian Westphal

act_mirred needs to know when it is invoked from ingress sched so it
knows that it has to push the l2 header back (which is already in place
if its called via the 'normal' egress path).

This is currently done via SET_TC_AT(), and explicit AT_INGRESS/EGRESS.
But some of this information is redundant.

The skb can only be in one of four states:

- FROM_INGRESS: skb was redirected via act_mirred and should be
                reinjected into ingress path
- FROM_EGRESS:  skb was redirected via act_mirred and needs to
                be transmitted via the device that the mirred action is
		attached to.

These (existing) two states are set by act_mirred, IFB driver is consumer.

The third state is 'zero', which means the skb has not been handled by
any part of the tc machinery (or has no "special properties" tc
needs to be aware of).

This adds the 4th skb_tc_state: TC_AT_INGRESS.

This is set when calling tc_classify in the ingress path.  The mirred
action uses this to decide the state of the cloned skb that it operates on:

original      clone
TC_AT_INGRESS TC_FROM_INGRESS
0             TC_FROM_EGRESS

We also remove the need to re-set tc_verd AT state in dev_queue_xmit.

Signed-off-by: Florian Westphal <fw@strlen.de>
---
 include/linux/skbuff.h       |  2 +-
 include/net/pkt_sched.h      | 12 ++++++++++++
 include/uapi/linux/pkt_cls.h |  4 ++--
 net/core/dev.c               |  6 ++----
 net/sched/act_mirred.c       |  8 +++-----
 5 files changed, 20 insertions(+), 12 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 4a1367e..906dc35 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -487,7 +487,7 @@ static inline u32 skb_mstamp_us_delta(const struct skb_mstamp *t1,
  *	@hash: the packet hash
  *	@queue_mapping: Queue mapping for multiqueue devices
  *	@xmit_more: More SKBs are pending for this queue
- *	@skb_tc_state: was mirrored (act_mirred)
+ *	@skb_tc_state: was mirrored (act_mirred) or is handled via sch_ingress
  *	@ndisc_nodetype: router type (from link layer)
  *	@ooo_okay: allow the mapping of a socket to a queue to be changed
  *	@l4_hash: indicate hash is a canonical 4-tuple hash over transport
diff --git a/include/net/pkt_sched.h b/include/net/pkt_sched.h
index a4c509d..63328cf 100644
--- a/include/net/pkt_sched.h
+++ b/include/net/pkt_sched.h
@@ -134,10 +134,22 @@ static inline unsigned int psched_mtu(const struct net_device *dev)
 	return dev->mtu + dev->hard_header_len;
 }
 
+/* traffic control processing state of the skb.
+ *
+ * This is mainly used by IFB driver, the 'mirred' action, and
+ * the ingress scheduler (sch_ingress).
+ */
 enum skb_tc_state {
+	TC_NO_STATE = 0, /* must be 0 */
+
 	/* set by act_mirred to tell IFB that skb needs to be ... */
 	TC_FROM_INGRESS = 1, /* ... re-injected to local stack */
 	TC_FROM_EGRESS = 2,  /* ... transmitted to device */
+
+	/* used by act_mirred to learn its called during skb rx processing
+	 * and has to push back the (already pulled) l2 header.
+	 */
+	TC_AT_INGRESS = 3,
 };
 
 #endif
diff --git a/include/uapi/linux/pkt_cls.h b/include/uapi/linux/pkt_cls.h
index 3308e89..271c788 100644
--- a/include/uapi/linux/pkt_cls.h
+++ b/include/uapi/linux/pkt_cls.h
@@ -56,10 +56,10 @@ bits 9,10,11: redirect counter -  redirect TTL. Loop avoidance
 #define G_TC_FROM(x)       _TC_GETVALUE(x,S_TC_FROM,M_TC_FROM)
 #define V_TC_FROM(x)       _TC_MAKEVALUE(x,S_TC_FROM)
 #define SET_TC_FROM(v,n)   ((V_TC_FROM(n)) | (v & ~M_TC_FROM))
-#endif
 #define AT_STACK	0x0
 #define AT_INGRESS	0x1
 #define AT_EGRESS	0x2
+#endif
 
 #define TC_NCLS          _TC_MAKEMASK1(8)
 #define SET_TC_NCLS(v)   ( TC_NCLS | (v & ~TC_NCLS))
@@ -71,13 +71,13 @@ bits 9,10,11: redirect counter -  redirect TTL. Loop avoidance
 #define G_TC_RTTL(x)       _TC_GETVALUE(x,S_TC_RTTL,M_TC_RTTL)
 #define V_TC_RTTL(x)       _TC_MAKEVALUE(x,S_TC_RTTL)
 #define SET_TC_RTTL(v,n)   ((V_TC_RTTL(n)) | (v & ~M_TC_RTTL))
-#endif
 
 #define S_TC_AT          _TC_MAKE32(12)
 #define M_TC_AT          _TC_MAKEMASK(2,S_TC_AT)
 #define G_TC_AT(x)       _TC_GETVALUE(x,S_TC_AT,M_TC_AT)
 #define V_TC_AT(x)       _TC_MAKEVALUE(x,S_TC_AT)
 #define SET_TC_AT(v,n)   ((V_TC_AT(n)) | (v & ~M_TC_AT))
+#endif
 
 /* Action attributes */
 enum {
diff --git a/net/core/dev.c b/net/core/dev.c
index 0e7afef..802b9b9 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -3071,9 +3071,6 @@ static int __dev_queue_xmit(struct sk_buff *skb, void *accel_priv)
 	txq = netdev_pick_tx(dev, skb, accel_priv);
 	q = rcu_dereference_bh(txq->qdisc);
 
-#ifdef CONFIG_NET_CLS_ACT
-	skb->tc_verd = SET_TC_AT(skb->tc_verd, AT_EGRESS);
-#endif
 	trace_net_dev_queue(skb);
 	if (q->enqueue) {
 		rc = __dev_xmit_skb(skb, q, dev, txq);
@@ -3648,7 +3645,7 @@ static inline struct sk_buff *handle_ing(struct sk_buff *skb,
 	}
 
 	qdisc_skb_cb(skb)->pkt_len = skb->len;
-	skb->tc_verd = SET_TC_AT(skb->tc_verd, AT_INGRESS);
+	skb->skb_tc_state = TC_AT_INGRESS;
 	qdisc_bstats_update_cpu(cl->q, skb);
 
 	switch (tc_classify(skb, cl, &cl_res)) {
@@ -3665,6 +3662,7 @@ static inline struct sk_buff *handle_ing(struct sk_buff *skb,
 	default:
 		break;
 	}
+	skb->skb_tc_state = 0;
 
 	return skb;
 }
diff --git a/net/sched/act_mirred.c b/net/sched/act_mirred.c
index 34d4320..b8d70de 100644
--- a/net/sched/act_mirred.c
+++ b/net/sched/act_mirred.c
@@ -131,7 +131,6 @@ static int tcf_mirred(struct sk_buff *skb, const struct tc_action *a,
 	struct tcf_mirred *m = a->priv;
 	struct net_device *dev;
 	struct sk_buff *skb2;
-	u32 at;
 	int retval, err = 1;
 
 	spin_lock(&m->tcf_lock);
@@ -150,21 +149,20 @@ static int tcf_mirred(struct sk_buff *skb, const struct tc_action *a,
 		goto out;
 	}
 
-	at = G_TC_AT(skb->tc_verd);
 	skb2 = skb_clone(skb, GFP_ATOMIC);
 	if (skb2 == NULL)
 		goto out;
 
-	if (!(at & AT_EGRESS)) {
+	if (skb->skb_tc_state == TC_AT_INGRESS) {
 		if (m->tcfm_ok_push)
 			skb_push(skb2, skb->mac_len);
 	}
 
 	/* mirror is always swallowed */
 	if (m->tcfm_eaction != TCA_EGRESS_MIRROR) {
-		if (at & AT_INGRESS)
+		if (skb->skb_tc_state == TC_AT_INGRESS)
 			skb2->skb_tc_state = TC_FROM_INGRESS;
-		else if (at & AT_EGRESS)
+		else
 			skb2->skb_tc_state = TC_FROM_EGRESS;
 	}
 	skb2->skb_iif = skb->dev->ifindex;
-- 
2.0.5

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH -next 3/3] net: core: use skb_tc_state to skip ingress classifiers
  2015-05-15  8:50 [PATCH -next 0/3] tc state machinery cleanups Florian Westphal
  2015-05-15  8:50 ` [PATCH -next 1/3] net: sched: remove FROM INGRESS/EGRESS Florian Westphal
  2015-05-15  8:50 ` [PATCH -next 2/3] net: sched: remove AT INGRESS/EGRESS Florian Westphal
@ 2015-05-15  8:50 ` Florian Westphal
  2015-05-15 11:36 ` [PATCH -next 0/3] tc state machinery cleanups Jamal Hadi Salim
  2015-05-18  3:34 ` David Miller
  4 siblings, 0 replies; 13+ messages in thread
From: Florian Westphal @ 2015-05-15  8:50 UTC (permalink / raw)
  To: netdev; +Cc: jhs, alexei.starovoitov, daniel, Florian Westphal

We can use either the tc_verd NCLS bit or tc_skb_state for this purpose.

If skb was received from network, both are zero.

If the skb was re-injected via sch_ingress + ifb driver, then
NCLS is set, but skb->tc_skb_state retains last value set up by
mirred action that brought us to ifb in first place (TC_FROM_INGRESS).

We can thus also use tc_skb_state instead to detect ifb
reinject-to-ingress.

After this patch NCLS bit is only used by ifb to skip the classsifiers
attached to its egress device.

Signed-off-by: Florian Westphal <fw@strlen.de>
---
 net/core/dev.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index 802b9b9..e817fa9 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -3805,9 +3805,9 @@ another_round:
 	}
 
 #ifdef CONFIG_NET_CLS_ACT
-	if (skb->tc_verd & TC_NCLS) {
-		skb->tc_verd = CLR_TC_NCLS(skb->tc_verd);
-		goto ncls;
+	if (skb->skb_tc_state) {
+		skb->skb_tc_state = 0;
+		goto skip_tc_ingress;
 	}
 #endif
 
@@ -3839,7 +3839,7 @@ skip_taps:
 #endif
 #ifdef CONFIG_NET_CLS_ACT
 	skb->tc_verd = 0;
-ncls:
+skip_tc_ingress:
 #endif
 	if (pfmemalloc && !skb_pfmemalloc_protocol(skb))
 		goto drop;
-- 
2.0.5

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH -next 0/3] tc state machinery cleanups
  2015-05-15  8:50 [PATCH -next 0/3] tc state machinery cleanups Florian Westphal
                   ` (2 preceding siblings ...)
  2015-05-15  8:50 ` [PATCH -next 3/3] net: core: use skb_tc_state to skip ingress classifiers Florian Westphal
@ 2015-05-15 11:36 ` Jamal Hadi Salim
  2015-05-15 11:59   ` Florian Westphal
  2015-05-15 13:11   ` Daniel Borkmann
  2015-05-18  3:34 ` David Miller
  4 siblings, 2 replies; 13+ messages in thread
From: Jamal Hadi Salim @ 2015-05-15 11:36 UTC (permalink / raw)
  To: Florian Westphal, netdev; +Cc: alexei.starovoitov, daniel

Hi Florian,

On 05/15/15 04:50, Florian Westphal wrote:
> This series prepares removal of tc_verd member from sk_buff.
>
> It simplifies tc state machinery to what is required to keep current
> mirred/ifb combinations working.
>
> I tested a few scenarios, namely:
>
> 1 - htb based shaping on egress
> 2 - netem attached to ifb with mirred redirect from ingress qdisc
> 3 - mirred to different egress device
> 4 - mirred to ifb egress device with qdiscs set up on ifb
>      to provide illusion of 'single' transmit interface for traffic shaping
>
> After this series tc_verd is only used by ifb to skip actions on egress.
>
> Part #2 of this series will remove tc_verd completely.
>

The major goal here is to release some of the bits that are
no longer in use. I am for that but struggling with the unintended
consequence of changing readability or subtle meanings.
I brought this up last time when you removed the 1-bit macro.
Think of someone who read the code 6  months ago.
It would be useful to keep readability for the rest of the code the
way it was - change the definition of the macros.
This sounds doable because 8 contigous bits are going to be available.
Note: The macros are fugly - but there are better tools these days to
make them easier to read so de-uglifying could be part of the goal.
Therefore, I suggest the initial effort should be to find ways to
co-locate the bits and save the 8  bits and unless necessary keep
the code readability.

More on readability and subtle meanings:
A noun like "skb_tc_state" is not a good choice. Those two bits
carry location indicators and are intended to define source
and destination endpoints.

Also note: ifb and mirred are using the location metadata
bits - they are _not_ the owners of those bits or the only
potential users. Any of the blocks can consume or produce the metadatum
in order to aid policy traversal. You can use those two as examples
of existing blocks that do so - but annotating the code as such is
inaccurate.

cheers,
jamal

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH -next 0/3] tc state machinery cleanups
  2015-05-15 11:36 ` [PATCH -next 0/3] tc state machinery cleanups Jamal Hadi Salim
@ 2015-05-15 11:59   ` Florian Westphal
  2015-05-15 13:11   ` Daniel Borkmann
  1 sibling, 0 replies; 13+ messages in thread
From: Florian Westphal @ 2015-05-15 11:59 UTC (permalink / raw)
  To: Jamal Hadi Salim; +Cc: Florian Westphal, netdev, alexei.starovoitov, daniel

Jamal Hadi Salim <jhs@mojatatu.com> wrote:
> On 05/15/15 04:50, Florian Westphal wrote:
> >This series prepares removal of tc_verd member from sk_buff.
> >
> >It simplifies tc state machinery to what is required to keep current
> >mirred/ifb combinations working.
> >
> >I tested a few scenarios, namely:
> >
> >1 - htb based shaping on egress
> >2 - netem attached to ifb with mirred redirect from ingress qdisc
> >3 - mirred to different egress device
> >4 - mirred to ifb egress device with qdiscs set up on ifb
> >     to provide illusion of 'single' transmit interface for traffic shaping
> >
> >After this series tc_verd is only used by ifb to skip actions on egress.
> >
> >Part #2 of this series will remove tc_verd completely.
> 
> The major goal here is to release some of the bits that are
> no longer in use. I am for that but struggling with the unintended
> consequence of changing readability or subtle meanings.
> I brought this up last time when you removed the 1-bit macro.
> Think of someone who read the code 6  months ago.
> It would be useful to keep readability for the rest of the code the
> way it was - change the definition of the macros.
> This sounds doable because 8 contigous bits are going to be available.

Yes, its doable.  But I specifically wanted to have one "skb state", i.e.
no combinations of bit indicators.

> Therefore, I suggest the initial effort should be to find ways to
> co-locate the bits and save the 8  bits and unless necessary keep
> the code readability.

I could add that as an intermediate step, but see no real gain.

> More on readability and subtle meanings:
> A noun like "skb_tc_state" is not a good choice. Those two bits
> carry location indicators and are intended to define source
> and destination endpoints.

Not anymore.  Its really a processing state, overlaps are not possible.
If its FROM_INGRESS then it means mirred wants ifb to return it to
ingress.

If its FROM_EGRESS then it means mirred wants ifb to transmit it to
device the mirred action was attached to.

If its AT_INGRESS, then it means skb is being processed during rx
processing (sch_ingress), so mirred needs to push back l2 header.

If its 0, its normal egress path.
No other states are possible.

I named it tc_state initially but that is "git grep" unfriendly, hence
skb_ prefix.

I'm open to a better name instead of skb_tc_state.

> Also note: ifb and mirred are using the location metadata
> bits - they are _not_ the owners of those bits or the only
> potential users.

Hmm, afaics they are the only users.

> Any of the blocks can consume or produce the metadatum
> in order to aid policy traversal. You can use those two as examples
> of existing blocks that do so - but annotating the code as such is
> inaccurate.

It documents what they're being used for.
And there haven't been any other users so far.

I'll have a stab at keeping all of the macros intact, but I don't
see (yet?) what the gain is.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH -next 0/3] tc state machinery cleanups
  2015-05-15 11:36 ` [PATCH -next 0/3] tc state machinery cleanups Jamal Hadi Salim
  2015-05-15 11:59   ` Florian Westphal
@ 2015-05-15 13:11   ` Daniel Borkmann
  1 sibling, 0 replies; 13+ messages in thread
From: Daniel Borkmann @ 2015-05-15 13:11 UTC (permalink / raw)
  To: Jamal Hadi Salim; +Cc: Florian Westphal, netdev, alexei.starovoitov

On 05/15/2015 01:36 PM, Jamal Hadi Salim wrote:
...
> It would be useful to keep readability for the rest of the code the
> way it was - change the definition of the macros.
...
> Note: The macros are fugly - but there are better tools these days to
> make them easier to read so de-uglifying could be part of the goal.

Yep, they are fugly. :) Only adapting the definitions of the macros then
pointing to the skb bits of skb_tc_state (or however it's called eventually)
could certainly be an intermin step, where a follow-up could get rid of
the macros for the kernel eventually as part of the de-uglifying. Is that
what you mean?

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH -next 2/3] net: sched: remove AT INGRESS/EGRESS
  2015-05-15  8:50 ` [PATCH -next 2/3] net: sched: remove AT INGRESS/EGRESS Florian Westphal
@ 2015-05-15 16:23   ` Alexei Starovoitov
  2015-05-15 17:21     ` Florian Westphal
  0 siblings, 1 reply; 13+ messages in thread
From: Alexei Starovoitov @ 2015-05-15 16:23 UTC (permalink / raw)
  To: Florian Westphal; +Cc: netdev, jhs, daniel

On Fri, May 15, 2015 at 10:50:49AM +0200, Florian Westphal wrote:
>  
>  enum skb_tc_state {
> +	TC_NO_STATE = 0, /* must be 0 */
> +
>  	/* set by act_mirred to tell IFB that skb needs to be ... */
>  	TC_FROM_INGRESS = 1, /* ... re-injected to local stack */
>  	TC_FROM_EGRESS = 2,  /* ... transmitted to device */
> +
> +	/* used by act_mirred to learn its called during skb rx processing
> +	 * and has to push back the (already pulled) l2 header.
> +	 */
> +	TC_AT_INGRESS = 3,
>  };
>  
...
> diff --git a/net/core/dev.c b/net/core/dev.c
> index 0e7afef..802b9b9 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -3071,9 +3071,6 @@ static int __dev_queue_xmit(struct sk_buff *skb, void *accel_priv)
>  	txq = netdev_pick_tx(dev, skb, accel_priv);
>  	q = rcu_dereference_bh(txq->qdisc);
>  
> -#ifdef CONFIG_NET_CLS_ACT
> -	skb->tc_verd = SET_TC_AT(skb->tc_verd, AT_EGRESS);
> -#endif

I don't think it's a good change.
Squeezing 4 bits into 2 by losing AT_STACK condition, imo, is wrong.
Before ifb could differentiate AT_STACK and AT_EGRESS, but when
they're aliased we lose this information.
I think we should keep AT_STACK, AT_INGRESS, AT_EGRESS.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH -next 2/3] net: sched: remove AT INGRESS/EGRESS
  2015-05-15 16:23   ` Alexei Starovoitov
@ 2015-05-15 17:21     ` Florian Westphal
  2015-05-15 20:09       ` Alexei Starovoitov
  0 siblings, 1 reply; 13+ messages in thread
From: Florian Westphal @ 2015-05-15 17:21 UTC (permalink / raw)
  To: Alexei Starovoitov; +Cc: Florian Westphal, netdev, jhs, daniel

Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
> > diff --git a/net/core/dev.c b/net/core/dev.c
> > index 0e7afef..802b9b9 100644
> > --- a/net/core/dev.c
> > +++ b/net/core/dev.c
> > @@ -3071,9 +3071,6 @@ static int __dev_queue_xmit(struct sk_buff *skb, void *accel_priv)
> >  	txq = netdev_pick_tx(dev, skb, accel_priv);
> >  	q = rcu_dereference_bh(txq->qdisc);
> >  
> > -#ifdef CONFIG_NET_CLS_ACT
> > -	skb->tc_verd = SET_TC_AT(skb->tc_verd, AT_EGRESS);
> > -#endif
> 
> I don't think it's a good change.
> Squeezing 4 bits into 2 by losing AT_STACK condition, imo, is wrong.

Its right, IMO.

> Before ifb could differentiate AT_STACK and AT_EGRESS, but when
> they're aliased we lose this information.

IFB refuses to work with skbs that did not come in via mirred.
So, from ifb point of view it makes no difference, G_TC_FROM+AT_STACK
causes skb to be dropped and IFB doesn't care about G_TC_AT() at all.

So the change is compatible: zero state == "mirred did not see this skb"
-> drop

> I think we shmuld keep AT_STACK, AT_INGRESS, AT_EGRESS.

We have dozens of combinations that either cannot happen
from concept point of view (skb is AT_EGRESS or its coming in,
it can't be both logically) or because no code path uses/sets it.

For G_TC_AT(), we only have two cases: egress (since
everything is attached to qdisc / coming in via dev_queue_xmit thats
the "normal case", and the special ingress one, which is what the new
TC_AT_INGRESS state is for.  The ingress case is also "special" because
it can only turn back to 0 (no mirred action attached to ingress) or
change to TC_FROM_INGRESS (when mirred grabs it).

The existing G_TC_AT is is irrelevant for IFB since ifb is reachable only
called via ndo_start_xmit so its always at egress.

skb_tc_state enum just tells us what do to with the skb --
back to ingress or transmit via device.

AT_STACK cannot even happen for the G_TC_AT case from looking at the
code since dev_queue_xmit forces AT_EGRESS & rx sets AT_INGRESS.

And for FROM its only relevant in ifb to detect skb that wasn't seen by
mirred.  And we keep this functionality via 0 skb_tc_state.

I did a test patch with Jamals suggestion (continue using macros, just
close holes), but I see no gain: we get 5 bits that are used for
automaton that only has/cares about 4 possible (distinct) skb states.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH -next 2/3] net: sched: remove AT INGRESS/EGRESS
  2015-05-15 17:21     ` Florian Westphal
@ 2015-05-15 20:09       ` Alexei Starovoitov
  2015-05-15 22:22         ` Florian Westphal
  2015-05-15 22:43         ` Jamal Hadi Salim
  0 siblings, 2 replies; 13+ messages in thread
From: Alexei Starovoitov @ 2015-05-15 20:09 UTC (permalink / raw)
  To: Florian Westphal; +Cc: netdev, jhs, daniel

On Fri, May 15, 2015 at 07:21:15PM +0200, Florian Westphal wrote:
> So, from ifb point of view it makes no difference, G_TC_FROM+AT_STACK
> causes skb to be dropped and IFB doesn't care about G_TC_AT() at all.

yes. your change is technically correct. It's not causing ifb regression,
but it removes information in a way that will be very hard to add it later.

> AT_STACK cannot even happen for the G_TC_AT case from looking at the
> code since dev_queue_xmit forces AT_EGRESS & rx sets AT_INGRESS.

yes, if we only consider ingress and egress hooks.
I want to use this stack/ingress/egress indication with socket filters.
If we make stack==egress, I would need to refactor this code all over again.
It's not broken today. You're doing this aliasing only two squeeze a bit.
That's why I'm saying keep the stack/ingress/egress flag as-is. It's useful.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH -next 2/3] net: sched: remove AT INGRESS/EGRESS
  2015-05-15 20:09       ` Alexei Starovoitov
@ 2015-05-15 22:22         ` Florian Westphal
  2015-05-15 22:43         ` Jamal Hadi Salim
  1 sibling, 0 replies; 13+ messages in thread
From: Florian Westphal @ 2015-05-15 22:22 UTC (permalink / raw)
  To: Alexei Starovoitov; +Cc: Florian Westphal, netdev, jhs, daniel

Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
> On Fri, May 15, 2015 at 07:21:15PM +0200, Florian Westphal wrote:
> > So, from ifb point of view it makes no difference, G_TC_FROM+AT_STACK
> > causes skb to be dropped and IFB doesn't care about G_TC_AT() at all.
> 
> yes. your change is technically correct. It's not causing ifb regression,

Thanks.

> but it removes information in a way that will be very hard to add it later.

Are you sure?  Would you mind elaborating a bit?

> > AT_STACK cannot even happen for the G_TC_AT case from looking at the
> > code since dev_queue_xmit forces AT_EGRESS & rx sets AT_INGRESS.
> 
> yes, if we only consider ingress and egress hooks.
> I want to use this stack/ingress/egress indication with socket filters.

Hmm... I'm sorry, I fail to understand where problem is.

> If we make stack==egress, I would need to refactor this code all over again.

If you mean "skb was not forwarded", you could just check for
skb->skb_iif = 0?

If not, what info do you need, and why could we not extend proposed enum
if absolutely required?

> It's not broken today. You're doing this aliasing only two squeeze a bit.

Yep, but its was also to minimize the state machinery down to whats
required.

Sorry Alexei, I'm just trying to find out what exactly is needed,
perhaps if you can clarify/explain I might be able to re-spin this in a
way that will meet your requirements.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH -next 2/3] net: sched: remove AT INGRESS/EGRESS
  2015-05-15 20:09       ` Alexei Starovoitov
  2015-05-15 22:22         ` Florian Westphal
@ 2015-05-15 22:43         ` Jamal Hadi Salim
  1 sibling, 0 replies; 13+ messages in thread
From: Jamal Hadi Salim @ 2015-05-15 22:43 UTC (permalink / raw)
  To: Alexei Starovoitov, Florian Westphal; +Cc: netdev, daniel

On 05/15/15 16:09, Alexei Starovoitov wrote:
> On Fri, May 15, 2015 at 07:21:15PM +0200, Florian Westphal wrote:
>> So, from ifb point of view it makes no difference, G_TC_FROM+AT_STACK
>> causes skb to be dropped and IFB doesn't care about G_TC_AT() at all.
>
> yes. your change is technically correct. It's not causing ifb regression,
> but it removes information in a way that will be very hard to add it later.
>
>> AT_STACK cannot even happen for the G_TC_AT case from looking at the
>> code since dev_queue_xmit forces AT_EGRESS & rx sets AT_INGRESS.
>
> yes, if we only consider ingress and egress hooks.
> I want to use this stack/ingress/egress indication with socket filters.
> If we make stack==egress, I would need to refactor this code all over again.
> It's not broken today. You're doing this aliasing only two squeeze a bit.
> That's why I'm saying keep the stack/ingress/egress flag as-is. It's useful.
>

My point as well. Using ifb or mirred as examples is fine
but they are not the only potential consumers/producers. Using examples
as such is out of place when it is an architectural issue.
So i would rather this be left alone.

cheers,
jamal

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH -next 0/3] tc state machinery cleanups
  2015-05-15  8:50 [PATCH -next 0/3] tc state machinery cleanups Florian Westphal
                   ` (3 preceding siblings ...)
  2015-05-15 11:36 ` [PATCH -next 0/3] tc state machinery cleanups Jamal Hadi Salim
@ 2015-05-18  3:34 ` David Miller
  4 siblings, 0 replies; 13+ messages in thread
From: David Miller @ 2015-05-18  3:34 UTC (permalink / raw)
  To: fw; +Cc: netdev, jhs, alexei.starovoitov, daniel

From: Florian Westphal <fw@strlen.de>
Date: Fri, 15 May 2015 10:50:47 +0200

> This series prepares removal of tc_verd member from sk_buff.
> 
> It simplifies tc state machinery to what is required to keep current
> mirred/ifb combinations working.
> 
> I tested a few scenarios, namely:
> 
> 1 - htb based shaping on egress
> 2 - netem attached to ifb with mirred redirect from ingress qdisc
> 3 - mirred to different egress device
> 4 - mirred to ifb egress device with qdiscs set up on ifb
>     to provide illusion of 'single' transmit interface for traffic shaping
> 
> After this series tc_verd is only used by ifb to skip actions on egress.
> 
> Part #2 of this series will remove tc_verd completely.
> 
> motivation is two-fold:
> 1) make states and state transitions more obvious
> 2) provide a way to later reduce skb size by 8 bytes
>    (s/u16 mac_len/u8 mac_len/ would result in
>     two 2 byte and one 4 byte hole, i.e. 8 byte reduction with
>     minor reshuffling).

It looks like there will be changes to this series.

My only comment from my perspective is that, in patch #1, you
should probably add a "TC_FROM_UNKNOWN = 0" to the enumeration
and test/set using that instead of a magic constant.

Thanks.

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2015-05-18  3:34 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-05-15  8:50 [PATCH -next 0/3] tc state machinery cleanups Florian Westphal
2015-05-15  8:50 ` [PATCH -next 1/3] net: sched: remove FROM INGRESS/EGRESS Florian Westphal
2015-05-15  8:50 ` [PATCH -next 2/3] net: sched: remove AT INGRESS/EGRESS Florian Westphal
2015-05-15 16:23   ` Alexei Starovoitov
2015-05-15 17:21     ` Florian Westphal
2015-05-15 20:09       ` Alexei Starovoitov
2015-05-15 22:22         ` Florian Westphal
2015-05-15 22:43         ` Jamal Hadi Salim
2015-05-15  8:50 ` [PATCH -next 3/3] net: core: use skb_tc_state to skip ingress classifiers Florian Westphal
2015-05-15 11:36 ` [PATCH -next 0/3] tc state machinery cleanups Jamal Hadi Salim
2015-05-15 11:59   ` Florian Westphal
2015-05-15 13:11   ` Daniel Borkmann
2015-05-18  3:34 ` David Miller

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).