[PATCH net] net: sched: refine software bypass handling in tc

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH net] net: sched: refine software bypass handling in tc_run
@ 2025-01-06 15:08 Xin Long
  2025-01-08 17:58 ` Jakub Kicinski
                   ` (2 more replies)
  0 siblings, 3 replies; 6+ messages in thread
From: Xin Long @ 2025-01-06 15:08 UTC (permalink / raw)
  To: network dev
  Cc: davem, kuba, Eric Dumazet, Paolo Abeni, Jamal Hadi Salim,
	Cong Wang, Jiri Pirko, Marcelo Ricardo Leitner, ast, Shuang Li

This patch addresses issues with filter counting in block (tcf_block),
particularly for software bypass scenarios, by introducing a more
accurate mechanism using useswcnt.

Previously, filtercnt and skipswcnt were introduced by:

  Commit 2081fd3445fe ("net: sched: cls_api: add filter counter") and
  Commit f631ef39d819 ("net: sched: cls_api: add skip_sw counter")

  filtercnt tracked all tp (tcf_proto) objects added to a block, and
  skipswcnt counted tp objects with the skipsw attribute set.

The problem is: a single tp can contain multiple filters, some with skipsw
and others without. The current implementation fails in the case:

  When the first filter in a tp has skipsw, both skipswcnt and filtercnt
  are incremented, then adding a second filter without skipsw to the same
  tp does not modify these counters because tp->counted is already set.

  This results in bypass software behavior based solely on skipswcnt
  equaling filtercnt, even when the block includes filters without
  skipsw. Consequently, filters without skipsw are inadvertently bypassed.

To address this, the patch introduces useswcnt in block to explicitly count
tp objects containing at least one filter without skipsw. Key changes
include:

  Whenever a filter without skipsw is added, its tp is marked with usesw
  and counted in useswcnt. tc_run() now uses useswcnt to determine software
  bypass, eliminating reliance on filtercnt and skipswcnt.

  This refined approach prevents software bypass for blocks containing
  mixed filters, ensuring correct behavior in tc_run().

Additionally, as atomic operations on useswcnt ensure thread safety and
tp->lock guards access to tp->usesw and tp->counted, the broader lock
down_write(&block->cb_lock) is no longer required in tc_new_tfilter(),
and this resolves a performance regression caused by the filter counting
mechanism during parallel filter insertions.

  The improvement can be demonstrated using the following script:

  # cat insert_tc_rules.sh

    tc qdisc add dev ens1f0np0 ingress
    for i in $(seq 16); do
        taskset -c $i tc -b rules_$i.txt &
    done
    wait

  Each of rules_$i.txt files above includes 100000 tc filter rules to a
  mlx5 driver NIC ens1f0np0.

  Without this patch:

  # time sh insert_tc_rules.sh

    real    0m50.780s
    user    0m23.556s
    sys	    4m13.032s

  With this patch:

  # time sh insert_tc_rules.sh

    real    0m17.718s
    user    0m7.807s
    sys     3m45.050s

Fixes: 047f340b36fc ("net: sched: make skip_sw actually skip software")
Reported-by: Shuang Li <shuali@redhat.com>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
---
 include/net/pkt_cls.h     | 18 +++++++-------
 include/net/sch_generic.h |  5 ++--
 net/core/dev.c            | 11 ++-------
 net/sched/cls_api.c       | 52 +++++++++------------------------------
 net/sched/cls_bpf.c       |  2 ++
 net/sched/cls_flower.c    |  2 ++
 net/sched/cls_matchall.c  |  2 ++
 net/sched/cls_u32.c       |  2 ++
 8 files changed, 32 insertions(+), 62 deletions(-)

diff --git a/include/net/pkt_cls.h b/include/net/pkt_cls.h
index cf199af85c52..d66cb315a6b5 100644
--- a/include/net/pkt_cls.h
+++ b/include/net/pkt_cls.h
@@ -74,15 +74,6 @@ static inline bool tcf_block_non_null_shared(struct tcf_block *block)
 	return block && block->index;
 }
 
-#ifdef CONFIG_NET_CLS_ACT
-DECLARE_STATIC_KEY_FALSE(tcf_bypass_check_needed_key);
-
-static inline bool tcf_block_bypass_sw(struct tcf_block *block)
-{
-	return block && block->bypass_wanted;
-}
-#endif
-
 static inline struct Qdisc *tcf_block_q(struct tcf_block *block)
 {
 	WARN_ON(tcf_block_shared(block));
@@ -760,6 +751,15 @@ tc_cls_common_offload_init(struct flow_cls_common_offload *cls_common,
 		cls_common->extack = extack;
 }
 
+static inline void tcf_proto_update_usesw(struct tcf_proto *tp, u32 flags)
+{
+	if (tp->usesw)
+		return;
+	if (tc_skip_sw(flags) && tc_in_hw(flags))
+		return;
+	tp->usesw = true;
+}
+
 #if IS_ENABLED(CONFIG_NET_TC_SKB_EXT)
 static inline struct tc_skb_ext *tc_skb_ext_alloc(struct sk_buff *skb)
 {
diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
index 5d74fa7e694c..1e6324f0d4ef 100644
--- a/include/net/sch_generic.h
+++ b/include/net/sch_generic.h
@@ -425,6 +425,7 @@ struct tcf_proto {
 	spinlock_t		lock;
 	bool			deleting;
 	bool			counted;
+	bool			usesw;
 	refcount_t		refcnt;
 	struct rcu_head		rcu;
 	struct hlist_node	destroy_ht_node;
@@ -474,9 +475,7 @@ struct tcf_block {
 	struct flow_block flow_block;
 	struct list_head owner_list;
 	bool keep_dst;
-	bool bypass_wanted;
-	atomic_t filtercnt; /* Number of filters */
-	atomic_t skipswcnt; /* Number of skip_sw filters */
+	atomic_t useswcnt;
 	atomic_t offloadcnt; /* Number of oddloaded filters */
 	unsigned int nooffloaddevcnt; /* Number of devs unable to do offload */
 	unsigned int lockeddevcnt; /* Number of devs that require rtnl lock. */
diff --git a/net/core/dev.c b/net/core/dev.c
index 45a8c3dd4a64..c1fff978f3ac 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2103,11 +2103,6 @@ void net_dec_egress_queue(void)
 EXPORT_SYMBOL_GPL(net_dec_egress_queue);
 #endif
 
-#ifdef CONFIG_NET_CLS_ACT
-DEFINE_STATIC_KEY_FALSE(tcf_bypass_check_needed_key);
-EXPORT_SYMBOL(tcf_bypass_check_needed_key);
-#endif
-
 DEFINE_STATIC_KEY_FALSE(netstamp_needed_key);
 EXPORT_SYMBOL(netstamp_needed_key);
 #ifdef CONFIG_JUMP_LABEL
@@ -3998,10 +3993,8 @@ static int tc_run(struct tcx_entry *entry, struct sk_buff *skb,
 	if (!miniq)
 		return ret;
 
-	if (static_branch_unlikely(&tcf_bypass_check_needed_key)) {
-		if (tcf_block_bypass_sw(miniq->block))
-			return ret;
-	}
+	if (miniq->block && !atomic_read(&miniq->block->useswcnt))
+		return ret;
 
 	tc_skb_cb(skb)->mru = 0;
 	tc_skb_cb(skb)->post_ct = false;
diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c
index 7578e27260c9..4b971184cfbb 100644
--- a/net/sched/cls_api.c
+++ b/net/sched/cls_api.c
@@ -390,6 +390,7 @@ static struct tcf_proto *tcf_proto_create(const char *kind, u32 protocol,
 	tp->protocol = protocol;
 	tp->prio = prio;
 	tp->chain = chain;
+	tp->usesw = !tp->ops->reoffload;
 	spin_lock_init(&tp->lock);
 	refcount_set(&tp->refcnt, 1);
 
@@ -410,48 +411,16 @@ static void tcf_proto_get(struct tcf_proto *tp)
 	refcount_inc(&tp->refcnt);
 }
 
-static void tcf_maintain_bypass(struct tcf_block *block)
-{
-	int filtercnt = atomic_read(&block->filtercnt);
-	int skipswcnt = atomic_read(&block->skipswcnt);
-	bool bypass_wanted = filtercnt > 0 && filtercnt == skipswcnt;
-
-	if (bypass_wanted != block->bypass_wanted) {
-#ifdef CONFIG_NET_CLS_ACT
-		if (bypass_wanted)
-			static_branch_inc(&tcf_bypass_check_needed_key);
-		else
-			static_branch_dec(&tcf_bypass_check_needed_key);
-#endif
-		block->bypass_wanted = bypass_wanted;
-	}
-}
-
-static void tcf_block_filter_cnt_update(struct tcf_block *block, bool *counted, bool add)
-{
-	lockdep_assert_not_held(&block->cb_lock);
-
-	down_write(&block->cb_lock);
-	if (*counted != add) {
-		if (add) {
-			atomic_inc(&block->filtercnt);
-			*counted = true;
-		} else {
-			atomic_dec(&block->filtercnt);
-			*counted = false;
-		}
-	}
-	tcf_maintain_bypass(block);
-	up_write(&block->cb_lock);
-}
-
 static void tcf_chain_put(struct tcf_chain *chain);
 
 static void tcf_proto_destroy(struct tcf_proto *tp, bool rtnl_held,
 			      bool sig_destroy, struct netlink_ext_ack *extack)
 {
 	tp->ops->destroy(tp, rtnl_held, extack);
-	tcf_block_filter_cnt_update(tp->chain->block, &tp->counted, false);
+	if (tp->usesw && tp->counted) {
+		tp->counted = false;
+		atomic_dec(&tp->chain->block->useswcnt);
+	}
 	if (sig_destroy)
 		tcf_proto_signal_destroyed(tp->chain, tp);
 	tcf_chain_put(tp->chain);
@@ -2409,7 +2378,12 @@ static int tc_new_tfilter(struct sk_buff *skb, struct nlmsghdr *n,
 		tfilter_notify(net, skb, n, tp, block, q, parent, fh,
 			       RTM_NEWTFILTER, false, rtnl_held, extack);
 		tfilter_put(tp, fh);
-		tcf_block_filter_cnt_update(block, &tp->counted, true);
+		spin_lock(&tp->lock);
+		if (tp->usesw && !tp->counted) {
+			tp->counted = true;
+			atomic_inc(&block->useswcnt);
+		}
+		spin_unlock(&tp->lock);
 		/* q pointer is NULL for shared blocks */
 		if (q)
 			q->flags &= ~TCQ_F_CAN_BYPASS;
@@ -3532,8 +3506,6 @@ static void tcf_block_offload_inc(struct tcf_block *block, u32 *flags)
 	if (*flags & TCA_CLS_FLAGS_IN_HW)
 		return;
 	*flags |= TCA_CLS_FLAGS_IN_HW;
-	if (tc_skip_sw(*flags))
-		atomic_inc(&block->skipswcnt);
 	atomic_inc(&block->offloadcnt);
 }
 
@@ -3542,8 +3514,6 @@ static void tcf_block_offload_dec(struct tcf_block *block, u32 *flags)
 	if (!(*flags & TCA_CLS_FLAGS_IN_HW))
 		return;
 	*flags &= ~TCA_CLS_FLAGS_IN_HW;
-	if (tc_skip_sw(*flags))
-		atomic_dec(&block->skipswcnt);
 	atomic_dec(&block->offloadcnt);
 }
 
diff --git a/net/sched/cls_bpf.c b/net/sched/cls_bpf.c
index 1941ebec23ff..7fbe42f0e5c2 100644
--- a/net/sched/cls_bpf.c
+++ b/net/sched/cls_bpf.c
@@ -509,6 +509,8 @@ static int cls_bpf_change(struct net *net, struct sk_buff *in_skb,
 	if (!tc_in_hw(prog->gen_flags))
 		prog->gen_flags |= TCA_CLS_FLAGS_NOT_IN_HW;
 
+	tcf_proto_update_usesw(tp, prog->gen_flags);
+
 	if (oldprog) {
 		idr_replace(&head->handle_idr, prog, handle);
 		list_replace_rcu(&oldprog->link, &prog->link);
diff --git a/net/sched/cls_flower.c b/net/sched/cls_flower.c
index 1008ec8a464c..03505673d523 100644
--- a/net/sched/cls_flower.c
+++ b/net/sched/cls_flower.c
@@ -2503,6 +2503,8 @@ static int fl_change(struct net *net, struct sk_buff *in_skb,
 	if (!tc_in_hw(fnew->flags))
 		fnew->flags |= TCA_CLS_FLAGS_NOT_IN_HW;
 
+	tcf_proto_update_usesw(tp, fnew->flags);
+
 	spin_lock(&tp->lock);
 
 	/* tp was deleted concurrently. -EAGAIN will cause caller to lookup
diff --git a/net/sched/cls_matchall.c b/net/sched/cls_matchall.c
index 9f1e62ca508d..f03bf5da39ee 100644
--- a/net/sched/cls_matchall.c
+++ b/net/sched/cls_matchall.c
@@ -228,6 +228,8 @@ static int mall_change(struct net *net, struct sk_buff *in_skb,
 	if (!tc_in_hw(new->flags))
 		new->flags |= TCA_CLS_FLAGS_NOT_IN_HW;
 
+	tcf_proto_update_usesw(tp, new->flags);
+
 	*arg = head;
 	rcu_assign_pointer(tp->root, new);
 	return 0;
diff --git a/net/sched/cls_u32.c b/net/sched/cls_u32.c
index d3a03c57545b..5e8f191fd820 100644
--- a/net/sched/cls_u32.c
+++ b/net/sched/cls_u32.c
@@ -1164,6 +1164,8 @@ static int u32_change(struct net *net, struct sk_buff *in_skb,
 		if (!tc_in_hw(n->flags))
 			n->flags |= TCA_CLS_FLAGS_NOT_IN_HW;
 
+		tcf_proto_update_usesw(tp, n->flags);
+
 		ins = &ht->ht[TC_U32_HASH(handle)];
 		for (pins = rtnl_dereference(*ins); pins;
 		     ins = &pins->next, pins = rtnl_dereference(*ins))
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH net] net: sched: refine software bypass handling in tc_run
  2025-01-06 15:08 [PATCH net] net: sched: refine software bypass handling in tc_run Xin Long
@ 2025-01-08 17:58 ` Jakub Kicinski
  2025-01-09 10:46 ` Paolo Abeni
  2025-01-10 13:25 ` Asbjørn Sloth Tønnesen
  2 siblings, 0 replies; 6+ messages in thread
From: Jakub Kicinski @ 2025-01-08 17:58 UTC (permalink / raw)
  To: ast
  Cc: Xin Long, network dev, davem, Eric Dumazet, Paolo Abeni,
	Jamal Hadi Salim, Cong Wang, Jiri Pirko, Marcelo Ricardo Leitner,
	Shuang Li

On Mon,  6 Jan 2025 10:08:32 -0500 Xin Long wrote:
> This patch addresses issues with filter counting in block (tcf_block),
> particularly for software bypass scenarios, by introducing a more
> accurate mechanism using useswcnt.

@ast, please take a look?

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH net] net: sched: refine software bypass handling in tc_run
  2025-01-06 15:08 [PATCH net] net: sched: refine software bypass handling in tc_run Xin Long
  2025-01-08 17:58 ` Jakub Kicinski
@ 2025-01-09 10:46 ` Paolo Abeni
  2025-01-09 15:47   ` Xin Long
  2025-01-10 13:25 ` Asbjørn Sloth Tønnesen
  2 siblings, 1 reply; 6+ messages in thread
From: Paolo Abeni @ 2025-01-09 10:46 UTC (permalink / raw)
  To: Xin Long, network dev
  Cc: davem, kuba, Eric Dumazet, Jamal Hadi Salim, Cong Wang,
	Jiri Pirko, Marcelo Ricardo Leitner, ast, Shuang Li

On 1/6/25 4:08 PM, Xin Long wrote:
> This patch addresses issues with filter counting in block (tcf_block),
> particularly for software bypass scenarios, by introducing a more
> accurate mechanism using useswcnt.
> 
> Previously, filtercnt and skipswcnt were introduced by:
> 
>   Commit 2081fd3445fe ("net: sched: cls_api: add filter counter") and
>   Commit f631ef39d819 ("net: sched: cls_api: add skip_sw counter")
> 
>   filtercnt tracked all tp (tcf_proto) objects added to a block, and
>   skipswcnt counted tp objects with the skipsw attribute set.
> 
> The problem is: a single tp can contain multiple filters, some with skipsw
> and others without. The current implementation fails in the case:
> 
>   When the first filter in a tp has skipsw, both skipswcnt and filtercnt
>   are incremented, then adding a second filter without skipsw to the same
>   tp does not modify these counters because tp->counted is already set.
> 
>   This results in bypass software behavior based solely on skipswcnt
>   equaling filtercnt, even when the block includes filters without
>   skipsw. Consequently, filters without skipsw are inadvertently bypassed.
> 
> To address this, the patch introduces useswcnt in block to explicitly count
> tp objects containing at least one filter without skipsw. Key changes
> include:
> 
>   Whenever a filter without skipsw is added, its tp is marked with usesw
>   and counted in useswcnt. tc_run() now uses useswcnt to determine software
>   bypass, eliminating reliance on filtercnt and skipswcnt.
> 
>   This refined approach prevents software bypass for blocks containing
>   mixed filters, ensuring correct behavior in tc_run().
> 
> Additionally, as atomic operations on useswcnt ensure thread safety and
> tp->lock guards access to tp->usesw and tp->counted, the broader lock
> down_write(&block->cb_lock) is no longer required in tc_new_tfilter(),
> and this resolves a performance regression caused by the filter counting
> mechanism during parallel filter insertions.
> 
>   The improvement can be demonstrated using the following script:
> 
>   # cat insert_tc_rules.sh
> 
>     tc qdisc add dev ens1f0np0 ingress
>     for i in $(seq 16); do
>         taskset -c $i tc -b rules_$i.txt &
>     done
>     wait
> 
>   Each of rules_$i.txt files above includes 100000 tc filter rules to a
>   mlx5 driver NIC ens1f0np0.
> 
>   Without this patch:
> 
>   # time sh insert_tc_rules.sh
> 
>     real    0m50.780s
>     user    0m23.556s
>     sys	    4m13.032s
> 
>   With this patch:
> 
>   # time sh insert_tc_rules.sh
> 
>     real    0m17.718s
>     user    0m7.807s
>     sys     3m45.050s
> 
> Fixes: 047f340b36fc ("net: sched: make skip_sw actually skip software")
> Reported-by: Shuang Li <shuali@redhat.com>
> Signed-off-by: Xin Long <lucien.xin@gmail.com>
> Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>

Given the quite large scope of this change and the functional and
performance implications, I think it's more suited for net-next.

> ---
>  include/net/pkt_cls.h     | 18 +++++++-------
>  include/net/sch_generic.h |  5 ++--
>  net/core/dev.c            | 11 ++-------
>  net/sched/cls_api.c       | 52 +++++++++------------------------------
>  net/sched/cls_bpf.c       |  2 ++
>  net/sched/cls_flower.c    |  2 ++
>  net/sched/cls_matchall.c  |  2 ++
>  net/sched/cls_u32.c       |  2 ++
>  8 files changed, 32 insertions(+), 62 deletions(-)
> 
> diff --git a/include/net/pkt_cls.h b/include/net/pkt_cls.h
> index cf199af85c52..d66cb315a6b5 100644
> --- a/include/net/pkt_cls.h
> +++ b/include/net/pkt_cls.h
> @@ -74,15 +74,6 @@ static inline bool tcf_block_non_null_shared(struct tcf_block *block)
>  	return block && block->index;
>  }
>  
> -#ifdef CONFIG_NET_CLS_ACT
> -DECLARE_STATIC_KEY_FALSE(tcf_bypass_check_needed_key);

I think it would be better, if possible, to preserve this static key;
will reduce the delta and avoid additional tests in fast-path for S/W
only setup.

> -
> -static inline bool tcf_block_bypass_sw(struct tcf_block *block)
> -{
> -	return block && block->bypass_wanted;
> -}
> -#endif
> -
>  static inline struct Qdisc *tcf_block_q(struct tcf_block *block)
>  {
>  	WARN_ON(tcf_block_shared(block));
> @@ -760,6 +751,15 @@ tc_cls_common_offload_init(struct flow_cls_common_offload *cls_common,
>  		cls_common->extack = extack;
>  }
>  
> +static inline void tcf_proto_update_usesw(struct tcf_proto *tp, u32 flags)
> +{
> +	if (tp->usesw)
> +		return;
> +	if (tc_skip_sw(flags) && tc_in_hw(flags))
> +		return;
> +	tp->usesw = true;
> +}

It looks like 'usesw' is never cleared. Can't user-space change the
skipsw flag for an existing tp?

[...]
> diff --git a/net/sched/cls_u32.c b/net/sched/cls_u32.c
> index d3a03c57545b..5e8f191fd820 100644
> --- a/net/sched/cls_u32.c
> +++ b/net/sched/cls_u32.c
> @@ -1164,6 +1164,8 @@ static int u32_change(struct net *net, struct sk_buff *in_skb,
>  		if (!tc_in_hw(n->flags))
>  			n->flags |= TCA_CLS_FLAGS_NOT_IN_HW;
>  
> +		tcf_proto_update_usesw(tp, n->flags);

Why don't you need to hook also in the 'key existing' branch on line 909?

Thanks!

Paolo


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH net] net: sched: refine software bypass handling in tc_run
  2025-01-09 10:46 ` Paolo Abeni
@ 2025-01-09 15:47   ` Xin Long
  0 siblings, 0 replies; 6+ messages in thread
From: Xin Long @ 2025-01-09 15:47 UTC (permalink / raw)
  To: Paolo Abeni
  Cc: network dev, davem, kuba, Eric Dumazet, Jamal Hadi Salim,
	Cong Wang, Jiri Pirko, Marcelo Ricardo Leitner, ast, Shuang Li

On Thu, Jan 9, 2025 at 5:46 AM Paolo Abeni <pabeni@redhat.com> wrote:
>
> On 1/6/25 4:08 PM, Xin Long wrote:
> > This patch addresses issues with filter counting in block (tcf_block),
> > particularly for software bypass scenarios, by introducing a more
> > accurate mechanism using useswcnt.
> >
> > Previously, filtercnt and skipswcnt were introduced by:
> >
> >   Commit 2081fd3445fe ("net: sched: cls_api: add filter counter") and
> >   Commit f631ef39d819 ("net: sched: cls_api: add skip_sw counter")
> >
> >   filtercnt tracked all tp (tcf_proto) objects added to a block, and
> >   skipswcnt counted tp objects with the skipsw attribute set.
> >
> > The problem is: a single tp can contain multiple filters, some with skipsw
> > and others without. The current implementation fails in the case:
> >
> >   When the first filter in a tp has skipsw, both skipswcnt and filtercnt
> >   are incremented, then adding a second filter without skipsw to the same
> >   tp does not modify these counters because tp->counted is already set.
> >
> >   This results in bypass software behavior based solely on skipswcnt
> >   equaling filtercnt, even when the block includes filters without
> >   skipsw. Consequently, filters without skipsw are inadvertently bypassed.
> >
> > To address this, the patch introduces useswcnt in block to explicitly count
> > tp objects containing at least one filter without skipsw. Key changes
> > include:
> >
> >   Whenever a filter without skipsw is added, its tp is marked with usesw
> >   and counted in useswcnt. tc_run() now uses useswcnt to determine software
> >   bypass, eliminating reliance on filtercnt and skipswcnt.
> >
> >   This refined approach prevents software bypass for blocks containing
> >   mixed filters, ensuring correct behavior in tc_run().
> >
> > Additionally, as atomic operations on useswcnt ensure thread safety and
> > tp->lock guards access to tp->usesw and tp->counted, the broader lock
> > down_write(&block->cb_lock) is no longer required in tc_new_tfilter(),
> > and this resolves a performance regression caused by the filter counting
> > mechanism during parallel filter insertions.
> >
> >   The improvement can be demonstrated using the following script:
> >
> >   # cat insert_tc_rules.sh
> >
> >     tc qdisc add dev ens1f0np0 ingress
> >     for i in $(seq 16); do
> >         taskset -c $i tc -b rules_$i.txt &
> >     done
> >     wait
> >
> >   Each of rules_$i.txt files above includes 100000 tc filter rules to a
> >   mlx5 driver NIC ens1f0np0.
> >
> >   Without this patch:
> >
> >   # time sh insert_tc_rules.sh
> >
> >     real    0m50.780s
> >     user    0m23.556s
> >     sys           4m13.032s
> >
> >   With this patch:
> >
> >   # time sh insert_tc_rules.sh
> >
> >     real    0m17.718s
> >     user    0m7.807s
> >     sys     3m45.050s
> >
> > Fixes: 047f340b36fc ("net: sched: make skip_sw actually skip software")
> > Reported-by: Shuang Li <shuali@redhat.com>
> > Signed-off-by: Xin Long <lucien.xin@gmail.com>
> > Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
>
> Given the quite large scope of this change and the functional and
> performance implications, I think it's more suited for net-next.
>
Sounds fine to me.

> > ---
> >  include/net/pkt_cls.h     | 18 +++++++-------
> >  include/net/sch_generic.h |  5 ++--
> >  net/core/dev.c            | 11 ++-------
> >  net/sched/cls_api.c       | 52 +++++++++------------------------------
> >  net/sched/cls_bpf.c       |  2 ++
> >  net/sched/cls_flower.c    |  2 ++
> >  net/sched/cls_matchall.c  |  2 ++
> >  net/sched/cls_u32.c       |  2 ++
> >  8 files changed, 32 insertions(+), 62 deletions(-)
> >
> > diff --git a/include/net/pkt_cls.h b/include/net/pkt_cls.h
> > index cf199af85c52..d66cb315a6b5 100644
> > --- a/include/net/pkt_cls.h
> > +++ b/include/net/pkt_cls.h
> > @@ -74,15 +74,6 @@ static inline bool tcf_block_non_null_shared(struct tcf_block *block)
> >       return block && block->index;
> >  }
> >
> > -#ifdef CONFIG_NET_CLS_ACT
> > -DECLARE_STATIC_KEY_FALSE(tcf_bypass_check_needed_key);
>
> I think it would be better, if possible, to preserve this static key;
> will reduce the delta and avoid additional tests in fast-path for S/W
> only setup.
That's difficult. This static key will in/decrement according to
block->useswcnt, and we have to hold down_write(&block->cb_lock)
for its update when adding a filter, and the performance issue
will come back again.

>
> > -
> > -static inline bool tcf_block_bypass_sw(struct tcf_block *block)
> > -{
> > -     return block && block->bypass_wanted;
> > -}
> > -#endif
> > -
> >  static inline struct Qdisc *tcf_block_q(struct tcf_block *block)
> >  {
> >       WARN_ON(tcf_block_shared(block));
> > @@ -760,6 +751,15 @@ tc_cls_common_offload_init(struct flow_cls_common_offload *cls_common,
> >               cls_common->extack = extack;
> >  }
> >
> > +static inline void tcf_proto_update_usesw(struct tcf_proto *tp, u32 flags)
> > +{
> > +     if (tp->usesw)
> > +             return;
> > +     if (tc_skip_sw(flags) && tc_in_hw(flags))
> > +             return;
> > +     tp->usesw = true;
> > +}
>
> It looks like 'usesw' is never cleared. Can't user-space change the
> skipsw flag for an existing tp?
skipsw flag belongs to a tc rule/filter, and a tp may include multiple
rules/filters, and a tp doesn't have flags for skipsw directly.

Now we are adding a tp->usesw to reflect if any rule without skipsw flag
is ever added in this tp. And yes, this tp->usesw will NOT be cleared
even if this rule without the skipsw flag is deleted. I guess that's
all we can do now.

If we want to use a tp->skipswcnt to dynamically track the skipsw flag
for a tp, some common code/functions for rule adding and deleting will
be needed. However,there's no such code in tc for a rule deleting, so
it's not doable unless we touch all cls_*.c files.

>
> [...]
> > diff --git a/net/sched/cls_u32.c b/net/sched/cls_u32.c
> > index d3a03c57545b..5e8f191fd820 100644
> > --- a/net/sched/cls_u32.c
> > +++ b/net/sched/cls_u32.c
> > @@ -1164,6 +1164,8 @@ static int u32_change(struct net *net, struct sk_buff *in_skb,
> >               if (!tc_in_hw(n->flags))
> >                       n->flags |= TCA_CLS_FLAGS_NOT_IN_HW;
> >
> > +             tcf_proto_update_usesw(tp, n->flags);
>
> Why don't you need to hook also in the 'key existing' branch on line 909?
>
Good catch, didn't notice that cls_u32 calls to replace hw filter in
two places.

Thanks.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH net] net: sched: refine software bypass handling in tc_run
  2025-01-06 15:08 [PATCH net] net: sched: refine software bypass handling in tc_run Xin Long
  2025-01-08 17:58 ` Jakub Kicinski
  2025-01-09 10:46 ` Paolo Abeni
@ 2025-01-10 13:25 ` Asbjørn Sloth Tønnesen
  2025-01-13 15:19   ` Xin Long
  2 siblings, 1 reply; 6+ messages in thread
From: Asbjørn Sloth Tønnesen @ 2025-01-10 13:25 UTC (permalink / raw)
  To: Xin Long
  Cc: davem, kuba, Eric Dumazet, Paolo Abeni, Jamal Hadi Salim,
	Cong Wang, Jiri Pirko, Marcelo Ricardo Leitner, Shuang Li,
	network dev

Sorry for the late response, it has been a busy first week back, too many
operational issues with devices in our network running crappy vendor images.

On 1/6/25 3:08 PM, Xin Long wrote:
> This patch addresses issues with filter counting in block (tcf_block),
> particularly for software bypass scenarios, by introducing a more
> accurate mechanism using useswcnt.
> 
> Previously, filtercnt and skipswcnt were introduced by:
> 
>    Commit 2081fd3445fe ("net: sched: cls_api: add filter counter") and
>    Commit f631ef39d819 ("net: sched: cls_api: add skip_sw counter")
> 
>    filtercnt tracked all tp (tcf_proto) objects added to a block, and
>    skipswcnt counted tp objects with the skipsw attribute set.
> 
> The problem is: a single tp can contain multiple filters, some with skipsw
> and others without. The current implementation fails in the case:
> 
>    When the first filter in a tp has skipsw, both skipswcnt and filtercnt
>    are incremented, then adding a second filter without skipsw to the same
>    tp does not modify these counters because tp->counted is already set.
> 
>    This results in bypass software behavior based solely on skipswcnt
>    equaling filtercnt, even when the block includes filters without
>    skipsw. Consequently, filters without skipsw are inadvertently bypassed.

Thank you for tracking it down. I wasn't aware that a tp, could be used by multiple
filters, and didn't encounter it during my testing.

> To address this, the patch introduces useswcnt in block to explicitly count
> tp objects containing at least one filter without skipsw. Key changes
> include:
> 
>    Whenever a filter without skipsw is added, its tp is marked with usesw
>    and counted in useswcnt. tc_run() now uses useswcnt to determine software
>    bypass, eliminating reliance on filtercnt and skipswcnt.
> 
>    This refined approach prevents software bypass for blocks containing
>    mixed filters, ensuring correct behavior in tc_run().
> 
> Additionally, as atomic operations on useswcnt ensure thread safety and
> tp->lock guards access to tp->usesw and tp->counted, the broader lock
> down_write(&block->cb_lock) is no longer required in tc_new_tfilter(),
> and this resolves a performance regression caused by the filter counting
> mechanism during parallel filter insertions.

You are trying to do two things:
A) Fix functional defect when filters share a single tp
B) Improve filter updates performance

If you do part A in a minimalistic way, then IMHO it might be suitable
for net (+ stable), but for part B I agree with Paolo, that it would
properly be better suited for net-next.

I focused my testing on routing performance, not filter update performance,
I also didn't test it in any multi-CPU setups (as I don't have any).

The static key was added to mitigate concerns, about the impact that the
bypass check would have for non-offloaded workloads in multi-CPU systems.

https://lore.kernel.org/netdev/28bf1467-b7ce-4e36-a4ef-5445f65edd97@fiberby.net/
https://lore.kernel.org/netdev/CAM0EoMngVoBcbX7cqTdbW8dG1v_ysc1SZK+4y-9j-5Tbq6gaYw@mail.gmail.com/

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH net] net: sched: refine software bypass handling in tc_run
  2025-01-10 13:25 ` Asbjørn Sloth Tønnesen
@ 2025-01-13 15:19   ` Xin Long
  0 siblings, 0 replies; 6+ messages in thread
From: Xin Long @ 2025-01-13 15:19 UTC (permalink / raw)
  To: Asbjørn Sloth Tønnesen
  Cc: davem, kuba, Eric Dumazet, Paolo Abeni, Jamal Hadi Salim,
	Cong Wang, Jiri Pirko, Marcelo Ricardo Leitner, Shuang Li,
	network dev

On Fri, Jan 10, 2025 at 8:25 AM Asbjørn Sloth Tønnesen <ast@fiberby.net> wrote:
>
> Sorry for the late response, it has been a busy first week back, too many
> operational issues with devices in our network running crappy vendor images.
>
> On 1/6/25 3:08 PM, Xin Long wrote:
> > This patch addresses issues with filter counting in block (tcf_block),
> > particularly for software bypass scenarios, by introducing a more
> > accurate mechanism using useswcnt.
> >
> > Previously, filtercnt and skipswcnt were introduced by:
> >
> >    Commit 2081fd3445fe ("net: sched: cls_api: add filter counter") and
> >    Commit f631ef39d819 ("net: sched: cls_api: add skip_sw counter")
> >
> >    filtercnt tracked all tp (tcf_proto) objects added to a block, and
> >    skipswcnt counted tp objects with the skipsw attribute set.
> >
> > The problem is: a single tp can contain multiple filters, some with skipsw
> > and others without. The current implementation fails in the case:
> >
> >    When the first filter in a tp has skipsw, both skipswcnt and filtercnt
> >    are incremented, then adding a second filter without skipsw to the same
> >    tp does not modify these counters because tp->counted is already set.
> >
> >    This results in bypass software behavior based solely on skipswcnt
> >    equaling filtercnt, even when the block includes filters without
> >    skipsw. Consequently, filters without skipsw are inadvertently bypassed.
>
> Thank you for tracking it down. I wasn't aware that a tp, could be used by multiple
> filters, and didn't encounter it during my testing.
>
> > To address this, the patch introduces useswcnt in block to explicitly count
> > tp objects containing at least one filter without skipsw. Key changes
> > include:
> >
> >    Whenever a filter without skipsw is added, its tp is marked with usesw
> >    and counted in useswcnt. tc_run() now uses useswcnt to determine software
> >    bypass, eliminating reliance on filtercnt and skipswcnt.
> >
> >    This refined approach prevents software bypass for blocks containing
> >    mixed filters, ensuring correct behavior in tc_run().
> >
> > Additionally, as atomic operations on useswcnt ensure thread safety and
> > tp->lock guards access to tp->usesw and tp->counted, the broader lock
> > down_write(&block->cb_lock) is no longer required in tc_new_tfilter(),
> > and this resolves a performance regression caused by the filter counting
> > mechanism during parallel filter insertions.
>
> You are trying to do two things:
> A) Fix functional defect when filters share a single tp
> B) Improve filter updates performance
>
> If you do part A in a minimalistic way, then IMHO it might be suitable
> for net (+ stable), but for part B I agree with Paolo, that it would
> properly be better suited for net-next.
>
> I focused my testing on routing performance, not filter update performance,
> I also didn't test it in any multi-CPU setups (as I don't have any).
>
> The static key was added to mitigate concerns, about the impact that the
> bypass check would have for non-offloaded workloads in multi-CPU systems.
>
> https://lore.kernel.org/netdev/28bf1467-b7ce-4e36-a4ef-5445f65edd97@fiberby.net/
> https://lore.kernel.org/netdev/CAM0EoMngVoBcbX7cqTdbW8dG1v_ysc1SZK+4y-9j-5Tbq6gaYw@mail.gmail.com/
Hi, Asbjørn, thanks for the comment.

I will keep the static key, and not touch the code in tc_run() in
net/core/dev.c, and we can make it without holding the block->cb_lock
by atomic_inc/dec_return():

        if (atomic_inc_return(&block->useswcnt) == 1)
                static_branch_inc(&tcf_bypass_check_needed_key);

        if (!atomic_dec_return(&tp->chain->block->useswcnt))
                static_branch_dec(&tcf_bypass_check_needed_key);

It doesn't look good to split this patch into two, I will post v2 on
net.git with these changes.

Let me know if there are any other concerns.

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2025-01-13 15:19 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-01-06 15:08 [PATCH net] net: sched: refine software bypass handling in tc_run Xin Long
2025-01-08 17:58 ` Jakub Kicinski
2025-01-09 10:46 ` Paolo Abeni
2025-01-09 15:47   ` Xin Long
2025-01-10 13:25 ` Asbjørn Sloth Tønnesen
2025-01-13 15:19   ` Xin Long

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).