* [PATCH net-next 0/3] make skip_sw actually skip software
@ 2024-02-15 16:04 Asbjørn Sloth Tønnesen
2024-02-15 16:04 ` [PATCH net-next 1/3] net: sched: cls_api: add skip_sw counter Asbjørn Sloth Tønnesen
` (3 more replies)
0 siblings, 4 replies; 20+ messages in thread
From: Asbjørn Sloth Tønnesen @ 2024-02-15 16:04 UTC (permalink / raw)
To: Jamal Hadi Salim, Cong Wang, Jiri Pirko
Cc: Asbjørn Sloth Tønnesen, Daniel Borkmann, netdev,
linux-kernel, llu
Hi,
During development of flower-route[1], which I
recently presented at FOSDEM[2], I noticed that
CPU usage, would increase the more rules I installed
into the hardware for IP forwarding offloading.
Since we use TC flower offload for the hottest
prefixes, and leave the long tail to Linux / the CPU.
we therefore need both the hardware and software
datapath to perform well.
I found that skip_sw rules, are quite expensive
in the kernel datapath, sice they must be evaluated
and matched upon, before the kernel checks the
skip_sw flag.
This patchset optimizes the case where all rules
are skip_sw.
[1] flower-route
https://github.com/fiberby-dk/flower-route
[2] FOSDEM talk
https://fosdem.org/2024/schedule/event/fosdem-2024-3337-flying-higher-hardware-offloading-with-bird/
Asbjørn Sloth Tønnesen (3):
net: sched: cls_api: add skip_sw counter
net: sched: cls_api: add filter counter
net: sched: make skip_sw actually skip software
include/net/pkt_cls.h | 5 +++++
include/net/sch_generic.h | 3 +++
net/core/dev.c | 3 +++
net/sched/cls_api.c | 24 ++++++++++++++++++++++++
4 files changed, 35 insertions(+)
--
Best regards
Asbjørn Sloth Tønnesen
Network Engineer
Fiberby ApS
^ permalink raw reply [flat|nested] 20+ messages in thread
* [PATCH net-next 1/3] net: sched: cls_api: add skip_sw counter
2024-02-15 16:04 [PATCH net-next 0/3] make skip_sw actually skip software Asbjørn Sloth Tønnesen
@ 2024-02-15 16:04 ` Asbjørn Sloth Tønnesen
2024-02-15 17:39 ` Jamal Hadi Salim
2024-02-16 12:52 ` Jiri Pirko
2024-02-15 16:04 ` [PATCH net-next 2/3] net: sched: cls_api: add filter counter Asbjørn Sloth Tønnesen
` (2 subsequent siblings)
3 siblings, 2 replies; 20+ messages in thread
From: Asbjørn Sloth Tønnesen @ 2024-02-15 16:04 UTC (permalink / raw)
To: Jamal Hadi Salim, Cong Wang, Jiri Pirko
Cc: Asbjørn Sloth Tønnesen, Daniel Borkmann, netdev,
linux-kernel, llu
Maintain a count of skip_sw filters.
This counter is protected by the cb_lock, and is updated
at the same time as offloadcnt.
Signed-off-by: Asbjørn Sloth Tønnesen <ast@fiberby.net>
---
include/net/sch_generic.h | 1 +
net/sched/cls_api.c | 4 ++++
2 files changed, 5 insertions(+)
diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
index 934fdb977551..46a63d1818a0 100644
--- a/include/net/sch_generic.h
+++ b/include/net/sch_generic.h
@@ -476,6 +476,7 @@ struct tcf_block {
struct flow_block flow_block;
struct list_head owner_list;
bool keep_dst;
+ atomic_t skipswcnt; /* Number of skip_sw filters */
atomic_t offloadcnt; /* Number of oddloaded filters */
unsigned int nooffloaddevcnt; /* Number of devs unable to do offload */
unsigned int lockeddevcnt; /* Number of devs that require rtnl lock. */
diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c
index ca5676b2668e..397c3d29659c 100644
--- a/net/sched/cls_api.c
+++ b/net/sched/cls_api.c
@@ -3483,6 +3483,8 @@ static void tcf_block_offload_inc(struct tcf_block *block, u32 *flags)
if (*flags & TCA_CLS_FLAGS_IN_HW)
return;
*flags |= TCA_CLS_FLAGS_IN_HW;
+ if (tc_skip_sw(*flags))
+ atomic_inc(&block->skipswcnt);
atomic_inc(&block->offloadcnt);
}
@@ -3491,6 +3493,8 @@ static void tcf_block_offload_dec(struct tcf_block *block, u32 *flags)
if (!(*flags & TCA_CLS_FLAGS_IN_HW))
return;
*flags &= ~TCA_CLS_FLAGS_IN_HW;
+ if (tc_skip_sw(*flags))
+ atomic_dec(&block->skipswcnt);
atomic_dec(&block->offloadcnt);
}
--
2.43.0
^ permalink raw reply related [flat|nested] 20+ messages in thread
* [PATCH net-next 2/3] net: sched: cls_api: add filter counter
2024-02-15 16:04 [PATCH net-next 0/3] make skip_sw actually skip software Asbjørn Sloth Tønnesen
2024-02-15 16:04 ` [PATCH net-next 1/3] net: sched: cls_api: add skip_sw counter Asbjørn Sloth Tønnesen
@ 2024-02-15 16:04 ` Asbjørn Sloth Tønnesen
2024-02-15 17:25 ` Jiri Pirko
2024-02-15 16:04 ` [PATCH net-next 3/3] net: sched: make skip_sw actually skip software Asbjørn Sloth Tønnesen
2024-02-15 18:00 ` [PATCH net-next 0/3] " Marcelo Ricardo Leitner
3 siblings, 1 reply; 20+ messages in thread
From: Asbjørn Sloth Tønnesen @ 2024-02-15 16:04 UTC (permalink / raw)
To: Jamal Hadi Salim, Cong Wang, Jiri Pirko
Cc: Asbjørn Sloth Tønnesen, Daniel Borkmann, netdev,
linux-kernel, llu
Maintain a count of filters per block.
Counter updates are protected by cb_lock, which is
also used to protect the offload counters.
Signed-off-by: Asbjørn Sloth Tønnesen <ast@fiberby.net>
---
include/net/sch_generic.h | 2 ++
net/sched/cls_api.c | 20 ++++++++++++++++++++
2 files changed, 22 insertions(+)
diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
index 46a63d1818a0..7af0621db226 100644
--- a/include/net/sch_generic.h
+++ b/include/net/sch_generic.h
@@ -427,6 +427,7 @@ struct tcf_proto {
*/
spinlock_t lock;
bool deleting;
+ bool counted;
refcount_t refcnt;
struct rcu_head rcu;
struct hlist_node destroy_ht_node;
@@ -476,6 +477,7 @@ struct tcf_block {
struct flow_block flow_block;
struct list_head owner_list;
bool keep_dst;
+ atomic_t filtercnt; /* Number of filters */
atomic_t skipswcnt; /* Number of skip_sw filters */
atomic_t offloadcnt; /* Number of oddloaded filters */
unsigned int nooffloaddevcnt; /* Number of devs unable to do offload */
diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c
index 397c3d29659c..c750cb662142 100644
--- a/net/sched/cls_api.c
+++ b/net/sched/cls_api.c
@@ -411,11 +411,13 @@ static void tcf_proto_get(struct tcf_proto *tp)
}
static void tcf_chain_put(struct tcf_chain *chain);
+static void tcf_block_filter_cnt_update(struct tcf_block *block, bool *counted, bool add);
static void tcf_proto_destroy(struct tcf_proto *tp, bool rtnl_held,
bool sig_destroy, struct netlink_ext_ack *extack)
{
tp->ops->destroy(tp, rtnl_held, extack);
+ tcf_block_filter_cnt_update(tp->chain->block, &tp->counted, false);
if (sig_destroy)
tcf_proto_signal_destroyed(tp->chain, tp);
tcf_chain_put(tp->chain);
@@ -2364,6 +2366,7 @@ static int tc_new_tfilter(struct sk_buff *skb, struct nlmsghdr *n,
err = tp->ops->change(net, skb, tp, cl, t->tcm_handle, tca, &fh,
flags, extack);
if (err == 0) {
+ tcf_block_filter_cnt_update(block, &tp->counted, true);
tfilter_notify(net, skb, n, tp, block, q, parent, fh,
RTM_NEWTFILTER, false, rtnl_held, extack);
tfilter_put(tp, fh);
@@ -3478,6 +3481,23 @@ int tcf_exts_dump_stats(struct sk_buff *skb, struct tcf_exts *exts)
}
EXPORT_SYMBOL(tcf_exts_dump_stats);
+static void tcf_block_filter_cnt_update(struct tcf_block *block, bool *counted, bool add)
+{
+ lockdep_assert_not_held(&block->cb_lock);
+
+ down_write(&block->cb_lock);
+ if (*counted != add) {
+ if (add) {
+ atomic_inc(&block->filtercnt);
+ *counted = true;
+ } else {
+ atomic_dec(&block->filtercnt);
+ *counted = false;
+ }
+ }
+ up_write(&block->cb_lock);
+}
+
static void tcf_block_offload_inc(struct tcf_block *block, u32 *flags)
{
if (*flags & TCA_CLS_FLAGS_IN_HW)
--
2.43.0
^ permalink raw reply related [flat|nested] 20+ messages in thread
* [PATCH net-next 3/3] net: sched: make skip_sw actually skip software
2024-02-15 16:04 [PATCH net-next 0/3] make skip_sw actually skip software Asbjørn Sloth Tønnesen
2024-02-15 16:04 ` [PATCH net-next 1/3] net: sched: cls_api: add skip_sw counter Asbjørn Sloth Tønnesen
2024-02-15 16:04 ` [PATCH net-next 2/3] net: sched: cls_api: add filter counter Asbjørn Sloth Tønnesen
@ 2024-02-15 16:04 ` Asbjørn Sloth Tønnesen
2024-02-15 17:49 ` Jamal Hadi Salim
2024-02-16 8:47 ` Vlad Buslov
2024-02-15 18:00 ` [PATCH net-next 0/3] " Marcelo Ricardo Leitner
3 siblings, 2 replies; 20+ messages in thread
From: Asbjørn Sloth Tønnesen @ 2024-02-15 16:04 UTC (permalink / raw)
To: Jamal Hadi Salim, Cong Wang, Jiri Pirko
Cc: Asbjørn Sloth Tønnesen, Daniel Borkmann, netdev,
linux-kernel, llu
TC filters come in 3 variants:
- no flag (no opinion, process wherever possible)
- skip_hw (do not process filter by hardware)
- skip_sw (do not process filter by software)
However skip_sw is implemented so that the skip_sw
flag can first be checked, after it has been matched.
IMHO it's common when using skip_sw, to use it on all rules.
So if all filters in a block is skip_sw filters, then
we can bail early, we can thus avoid having to match
the filters, just to check for the skip_sw flag.
+----------------------------+--------+--------+--------+
| Test description | Pre | Post | Rel. |
| | kpps | kpps | chg. |
+----------------------------+--------+--------+--------+
| basic forwarding + notrack | 1264.9 | 1277.7 | 1.01x |
| switch to eswitch mode | 1067.1 | 1071.0 | 1.00x |
| add ingress qdisc | 1056.0 | 1059.1 | 1.00x |
+----------------------------+--------+--------+--------+
| 1 non-matching rule | 927.9 | 1057.1 | 1.14x |
| 10 non-matching rules | 495.8 | 1055.6 | 2.13x |
| 25 non-matching rules | 280.6 | 1053.5 | 3.75x |
| 50 non-matching rules | 162.0 | 1055.7 | 6.52x |
| 100 non-matching rules | 87.7 | 1019.0 | 11.62x |
+----------------------------+--------+--------+--------+
perf top (100 n-m skip_sw rules - pre patch):
25.57% [kernel] [k] __skb_flow_dissect
20.77% [kernel] [k] rhashtable_jhash2
14.26% [kernel] [k] fl_classify
13.28% [kernel] [k] fl_mask_lookup
6.38% [kernel] [k] memset_orig
3.22% [kernel] [k] tcf_classify
perf top (100 n-m skip_sw rules - post patch):
4.28% [kernel] [k] __dev_queue_xmit
3.80% [kernel] [k] check_preemption_disabled
3.68% [kernel] [k] nft_do_chain
3.08% [kernel] [k] __netif_receive_skb_core.constprop.0
2.59% [kernel] [k] mlx5e_xmit
2.48% [kernel] [k] mlx5e_skb_from_cqe_mpwrq_nonlinear
Test setup:
DUT: Intel Xeon D-1518 (2.20GHz) w/ Nvidia/Mellanox ConnectX-6 Dx 2x100G
Data rate measured on switch (Extreme X690), and DUT connected as
a router on a stick, with pktgen and pktsink as VLANs.
Pktgen was in range 12.79 - 12.95 Mpps across all tests.
Signed-off-by: Asbjørn Sloth Tønnesen <ast@fiberby.net>
---
include/net/pkt_cls.h | 5 +++++
net/core/dev.c | 3 +++
2 files changed, 8 insertions(+)
diff --git a/include/net/pkt_cls.h b/include/net/pkt_cls.h
index a4ee43f493bb..a065da4df7ff 100644
--- a/include/net/pkt_cls.h
+++ b/include/net/pkt_cls.h
@@ -74,6 +74,11 @@ static inline bool tcf_block_non_null_shared(struct tcf_block *block)
return block && block->index;
}
+static inline bool tcf_block_has_skip_sw_only(struct tcf_block *block)
+{
+ return block && atomic_read(&block->filtercnt) == atomic_read(&block->skipswcnt);
+}
+
static inline struct Qdisc *tcf_block_q(struct tcf_block *block)
{
WARN_ON(tcf_block_shared(block));
diff --git a/net/core/dev.c b/net/core/dev.c
index d8dd293a7a27..7cd014e5066e 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -3910,6 +3910,9 @@ static int tc_run(struct tcx_entry *entry, struct sk_buff *skb,
if (!miniq)
return ret;
+ if (tcf_block_has_skip_sw_only(miniq->block))
+ return ret;
+
tc_skb_cb(skb)->mru = 0;
tc_skb_cb(skb)->post_ct = false;
tcf_set_drop_reason(skb, *drop_reason);
--
2.43.0
^ permalink raw reply related [flat|nested] 20+ messages in thread
* Re: [PATCH net-next 2/3] net: sched: cls_api: add filter counter
2024-02-15 16:04 ` [PATCH net-next 2/3] net: sched: cls_api: add filter counter Asbjørn Sloth Tønnesen
@ 2024-02-15 17:25 ` Jiri Pirko
2024-02-15 23:19 ` Asbjørn Sloth Tønnesen
0 siblings, 1 reply; 20+ messages in thread
From: Jiri Pirko @ 2024-02-15 17:25 UTC (permalink / raw)
To: Asbjørn Sloth Tønnesen
Cc: Jamal Hadi Salim, Cong Wang, Daniel Borkmann, netdev,
linux-kernel, llu
Thu, Feb 15, 2024 at 05:04:43PM CET, ast@fiberby.net wrote:
>Maintain a count of filters per block.
>
>Counter updates are protected by cb_lock, which is
>also used to protect the offload counters.
>
>Signed-off-by: Asbjørn Sloth Tønnesen <ast@fiberby.net>
>---
> include/net/sch_generic.h | 2 ++
> net/sched/cls_api.c | 20 ++++++++++++++++++++
> 2 files changed, 22 insertions(+)
>
>diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
>index 46a63d1818a0..7af0621db226 100644
>--- a/include/net/sch_generic.h
>+++ b/include/net/sch_generic.h
>@@ -427,6 +427,7 @@ struct tcf_proto {
> */
> spinlock_t lock;
> bool deleting;
>+ bool counted;
> refcount_t refcnt;
> struct rcu_head rcu;
> struct hlist_node destroy_ht_node;
>@@ -476,6 +477,7 @@ struct tcf_block {
> struct flow_block flow_block;
> struct list_head owner_list;
> bool keep_dst;
>+ atomic_t filtercnt; /* Number of filters */
> atomic_t skipswcnt; /* Number of skip_sw filters */
> atomic_t offloadcnt; /* Number of oddloaded filters */
> unsigned int nooffloaddevcnt; /* Number of devs unable to do offload */
>diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c
>index 397c3d29659c..c750cb662142 100644
>--- a/net/sched/cls_api.c
>+++ b/net/sched/cls_api.c
>@@ -411,11 +411,13 @@ static void tcf_proto_get(struct tcf_proto *tp)
> }
>
> static void tcf_chain_put(struct tcf_chain *chain);
>+static void tcf_block_filter_cnt_update(struct tcf_block *block, bool *counted, bool add);
>
> static void tcf_proto_destroy(struct tcf_proto *tp, bool rtnl_held,
> bool sig_destroy, struct netlink_ext_ack *extack)
> {
> tp->ops->destroy(tp, rtnl_held, extack);
>+ tcf_block_filter_cnt_update(tp->chain->block, &tp->counted, false);
> if (sig_destroy)
> tcf_proto_signal_destroyed(tp->chain, tp);
> tcf_chain_put(tp->chain);
>@@ -2364,6 +2366,7 @@ static int tc_new_tfilter(struct sk_buff *skb, struct nlmsghdr *n,
> err = tp->ops->change(net, skb, tp, cl, t->tcm_handle, tca, &fh,
> flags, extack);
> if (err == 0) {
>+ tcf_block_filter_cnt_update(block, &tp->counted, true);
> tfilter_notify(net, skb, n, tp, block, q, parent, fh,
> RTM_NEWTFILTER, false, rtnl_held, extack);
> tfilter_put(tp, fh);
>@@ -3478,6 +3481,23 @@ int tcf_exts_dump_stats(struct sk_buff *skb, struct tcf_exts *exts)
> }
> EXPORT_SYMBOL(tcf_exts_dump_stats);
>
>+static void tcf_block_filter_cnt_update(struct tcf_block *block, bool *counted, bool add)
Can't you move this up to avoid forward declaration?
>+{
>+ lockdep_assert_not_held(&block->cb_lock);
>+
>+ down_write(&block->cb_lock);
>+ if (*counted != add) {
>+ if (add) {
>+ atomic_inc(&block->filtercnt);
>+ *counted = true;
>+ } else {
>+ atomic_dec(&block->filtercnt);
>+ *counted = false;
>+ }
>+ }
>+ up_write(&block->cb_lock);
>+}
>+
> static void tcf_block_offload_inc(struct tcf_block *block, u32 *flags)
> {
> if (*flags & TCA_CLS_FLAGS_IN_HW)
>--
>2.43.0
>
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH net-next 1/3] net: sched: cls_api: add skip_sw counter
2024-02-15 16:04 ` [PATCH net-next 1/3] net: sched: cls_api: add skip_sw counter Asbjørn Sloth Tønnesen
@ 2024-02-15 17:39 ` Jamal Hadi Salim
2024-02-15 23:34 ` Asbjørn Sloth Tønnesen
2024-02-16 12:52 ` Jiri Pirko
1 sibling, 1 reply; 20+ messages in thread
From: Jamal Hadi Salim @ 2024-02-15 17:39 UTC (permalink / raw)
To: Asbjørn Sloth Tønnesen
Cc: Cong Wang, Jiri Pirko, Daniel Borkmann, netdev, linux-kernel, llu,
Vlad Buslov, Marcelo Ricardo Leitner
+Cc Vlad and Marcelo..
On Thu, Feb 15, 2024 at 11:06 AM Asbjørn Sloth Tønnesen <ast@fiberby.net> wrote:
>
> Maintain a count of skip_sw filters.
>
> This counter is protected by the cb_lock, and is updated
> at the same time as offloadcnt.
>
> Signed-off-by: Asbjørn Sloth Tønnesen <ast@fiberby.net>
> ---
> include/net/sch_generic.h | 1 +
> net/sched/cls_api.c | 4 ++++
> 2 files changed, 5 insertions(+)
>
> diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
> index 934fdb977551..46a63d1818a0 100644
> --- a/include/net/sch_generic.h
> +++ b/include/net/sch_generic.h
> @@ -476,6 +476,7 @@ struct tcf_block {
> struct flow_block flow_block;
> struct list_head owner_list;
> bool keep_dst;
> + atomic_t skipswcnt; /* Number of skip_sw filters */
> atomic_t offloadcnt; /* Number of oddloaded filters */
For your use case is skipswcnt ever going to be any different than offloadcnt?
cheers,
jamal
> unsigned int nooffloaddevcnt; /* Number of devs unable to do offload */
> unsigned int lockeddevcnt; /* Number of devs that require rtnl lock. */
> diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c
> index ca5676b2668e..397c3d29659c 100644
> --- a/net/sched/cls_api.c
> +++ b/net/sched/cls_api.c
> @@ -3483,6 +3483,8 @@ static void tcf_block_offload_inc(struct tcf_block *block, u32 *flags)
> if (*flags & TCA_CLS_FLAGS_IN_HW)
> return;
> *flags |= TCA_CLS_FLAGS_IN_HW;
> + if (tc_skip_sw(*flags))
> + atomic_inc(&block->skipswcnt);
> atomic_inc(&block->offloadcnt);
> }
>
> @@ -3491,6 +3493,8 @@ static void tcf_block_offload_dec(struct tcf_block *block, u32 *flags)
> if (!(*flags & TCA_CLS_FLAGS_IN_HW))
> return;
> *flags &= ~TCA_CLS_FLAGS_IN_HW;
> + if (tc_skip_sw(*flags))
> + atomic_dec(&block->skipswcnt);
> atomic_dec(&block->offloadcnt);
> }
>
> --
> 2.43.0
>
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH net-next 3/3] net: sched: make skip_sw actually skip software
2024-02-15 16:04 ` [PATCH net-next 3/3] net: sched: make skip_sw actually skip software Asbjørn Sloth Tønnesen
@ 2024-02-15 17:49 ` Jamal Hadi Salim
2024-02-16 12:57 ` Jiri Pirko
2024-02-16 13:38 ` Asbjørn Sloth Tønnesen
2024-02-16 8:47 ` Vlad Buslov
1 sibling, 2 replies; 20+ messages in thread
From: Jamal Hadi Salim @ 2024-02-15 17:49 UTC (permalink / raw)
To: Asbjørn Sloth Tønnesen
Cc: Cong Wang, Jiri Pirko, Daniel Borkmann, netdev, linux-kernel, llu,
Vlad Buslov, Marcelo Ricardo Leitner
On Thu, Feb 15, 2024 at 11:06 AM Asbjørn Sloth Tønnesen <ast@fiberby.net> wrote:
>
> TC filters come in 3 variants:
> - no flag (no opinion, process wherever possible)
> - skip_hw (do not process filter by hardware)
> - skip_sw (do not process filter by software)
>
> However skip_sw is implemented so that the skip_sw
> flag can first be checked, after it has been matched.
>
> IMHO it's common when using skip_sw, to use it on all rules.
>
> So if all filters in a block is skip_sw filters, then
> we can bail early, we can thus avoid having to match
> the filters, just to check for the skip_sw flag.
>
> +----------------------------+--------+--------+--------+
> | Test description | Pre | Post | Rel. |
> | | kpps | kpps | chg. |
> +----------------------------+--------+--------+--------+
> | basic forwarding + notrack | 1264.9 | 1277.7 | 1.01x |
> | switch to eswitch mode | 1067.1 | 1071.0 | 1.00x |
> | add ingress qdisc | 1056.0 | 1059.1 | 1.00x |
> +----------------------------+--------+--------+--------+
> | 1 non-matching rule | 927.9 | 1057.1 | 1.14x |
> | 10 non-matching rules | 495.8 | 1055.6 | 2.13x |
> | 25 non-matching rules | 280.6 | 1053.5 | 3.75x |
> | 50 non-matching rules | 162.0 | 1055.7 | 6.52x |
> | 100 non-matching rules | 87.7 | 1019.0 | 11.62x |
> +----------------------------+--------+--------+--------+
>
> perf top (100 n-m skip_sw rules - pre patch):
> 25.57% [kernel] [k] __skb_flow_dissect
> 20.77% [kernel] [k] rhashtable_jhash2
> 14.26% [kernel] [k] fl_classify
> 13.28% [kernel] [k] fl_mask_lookup
> 6.38% [kernel] [k] memset_orig
> 3.22% [kernel] [k] tcf_classify
>
> perf top (100 n-m skip_sw rules - post patch):
> 4.28% [kernel] [k] __dev_queue_xmit
> 3.80% [kernel] [k] check_preemption_disabled
> 3.68% [kernel] [k] nft_do_chain
> 3.08% [kernel] [k] __netif_receive_skb_core.constprop.0
> 2.59% [kernel] [k] mlx5e_xmit
> 2.48% [kernel] [k] mlx5e_skb_from_cqe_mpwrq_nonlinear
>
The concept makes sense - but i am wondering when you have a mix of
skip_sw and skip_hw if it makes more sense to just avoid looking up
skip_sw at all in the s/w datapath? Potentially by separating the
hashes for skip_sw/hw. I know it's a deeper surgery - but would be
more general purpose....unless i am missing something
> Test setup:
> DUT: Intel Xeon D-1518 (2.20GHz) w/ Nvidia/Mellanox ConnectX-6 Dx 2x100G
> Data rate measured on switch (Extreme X690), and DUT connected as
> a router on a stick, with pktgen and pktsink as VLANs.
> Pktgen was in range 12.79 - 12.95 Mpps across all tests.
>
Hrm. Those are "tiny" numbers (25G @64B is about 3x that). What are
the packet sizes?
Perhaps the traffic generator is a limitation here?
Also feels like you are doing exact matches? A sample flower rule
would have helped.
cheers,
jamal
> Signed-off-by: Asbjørn Sloth Tønnesen <ast@fiberby.net>
> ---
> include/net/pkt_cls.h | 5 +++++
> net/core/dev.c | 3 +++
> 2 files changed, 8 insertions(+)
>
> diff --git a/include/net/pkt_cls.h b/include/net/pkt_cls.h
> index a4ee43f493bb..a065da4df7ff 100644
> --- a/include/net/pkt_cls.h
> +++ b/include/net/pkt_cls.h
> @@ -74,6 +74,11 @@ static inline bool tcf_block_non_null_shared(struct tcf_block *block)
> return block && block->index;
> }
>
> +static inline bool tcf_block_has_skip_sw_only(struct tcf_block *block)
> +{
> + return block && atomic_read(&block->filtercnt) == atomic_read(&block->skipswcnt);
> +}
> +
> static inline struct Qdisc *tcf_block_q(struct tcf_block *block)
> {
> WARN_ON(tcf_block_shared(block));
> diff --git a/net/core/dev.c b/net/core/dev.c
> index d8dd293a7a27..7cd014e5066e 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -3910,6 +3910,9 @@ static int tc_run(struct tcx_entry *entry, struct sk_buff *skb,
> if (!miniq)
> return ret;
>
> + if (tcf_block_has_skip_sw_only(miniq->block))
> + return ret;
> +
> tc_skb_cb(skb)->mru = 0;
> tc_skb_cb(skb)->post_ct = false;
> tcf_set_drop_reason(skb, *drop_reason);
> --
> 2.43.0
>
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH net-next 0/3] make skip_sw actually skip software
2024-02-15 16:04 [PATCH net-next 0/3] make skip_sw actually skip software Asbjørn Sloth Tønnesen
` (2 preceding siblings ...)
2024-02-15 16:04 ` [PATCH net-next 3/3] net: sched: make skip_sw actually skip software Asbjørn Sloth Tønnesen
@ 2024-02-15 18:00 ` Marcelo Ricardo Leitner
2024-02-16 8:44 ` Vlad Buslov
2024-02-16 12:17 ` Asbjørn Sloth Tønnesen
3 siblings, 2 replies; 20+ messages in thread
From: Marcelo Ricardo Leitner @ 2024-02-15 18:00 UTC (permalink / raw)
To: Asbjørn Sloth Tønnesen
Cc: Jamal Hadi Salim, Cong Wang, Jiri Pirko, Daniel Borkmann, netdev,
linux-kernel, llu
Hi,
On Thu, Feb 15, 2024 at 04:04:41PM +0000, Asbjørn Sloth Tønnesen wrote:
...
> Since we use TC flower offload for the hottest
> prefixes, and leave the long tail to Linux / the CPU.
> we therefore need both the hardware and software
> datapath to perform well.
>
> I found that skip_sw rules, are quite expensive
> in the kernel datapath, sice they must be evaluated
> and matched upon, before the kernel checks the
> skip_sw flag.
>
> This patchset optimizes the case where all rules
> are skip_sw.
The talk is interesting. Yet, I don't get how it is set up.
How do you use a dedicated block for skip_sw, and then have a
catch-all on sw again please?
I'm missing which traffic is being matched against the sw datapath. In
theory, you have all the heavy duty filters offloaded, so the sw
datapath should be seeing only a few packets, right?
Marcelo
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH net-next 2/3] net: sched: cls_api: add filter counter
2024-02-15 17:25 ` Jiri Pirko
@ 2024-02-15 23:19 ` Asbjørn Sloth Tønnesen
0 siblings, 0 replies; 20+ messages in thread
From: Asbjørn Sloth Tønnesen @ 2024-02-15 23:19 UTC (permalink / raw)
To: Jiri Pirko
Cc: Jamal Hadi Salim, Cong Wang, Daniel Borkmann, netdev,
linux-kernel, llu
Hi Jiri,
Thanks for the review.
On 2/15/24 17:25, Jiri Pirko wrote:
> Thu, Feb 15, 2024 at 05:04:43PM CET, ast@fiberby.net wrote:
>> Maintain a count of filters per block.
>>
>> Counter updates are protected by cb_lock, which is
>> also used to protect the offload counters.
>>
>> Signed-off-by: Asbjørn Sloth Tønnesen <ast@fiberby.net>
>> ---
>> include/net/sch_generic.h | 2 ++
>> net/sched/cls_api.c | 20 ++++++++++++++++++++
>> 2 files changed, 22 insertions(+)
>>
>> diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
>> index 46a63d1818a0..7af0621db226 100644
>> --- a/include/net/sch_generic.h
>> +++ b/include/net/sch_generic.h
>> @@ -427,6 +427,7 @@ struct tcf_proto {
>> */
>> spinlock_t lock;
>> bool deleting;
>> + bool counted;
>> refcount_t refcnt;
>> struct rcu_head rcu;
>> struct hlist_node destroy_ht_node;
>> @@ -476,6 +477,7 @@ struct tcf_block {
>> struct flow_block flow_block;
>> struct list_head owner_list;
>> bool keep_dst;
>> + atomic_t filtercnt; /* Number of filters */
>> atomic_t skipswcnt; /* Number of skip_sw filters */
>> atomic_t offloadcnt; /* Number of oddloaded filters */
>> unsigned int nooffloaddevcnt; /* Number of devs unable to do offload */
>> diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c
>> index 397c3d29659c..c750cb662142 100644
>> --- a/net/sched/cls_api.c
>> +++ b/net/sched/cls_api.c
>> @@ -411,11 +411,13 @@ static void tcf_proto_get(struct tcf_proto *tp)
>> }
>>
>> static void tcf_chain_put(struct tcf_chain *chain);
>> +static void tcf_block_filter_cnt_update(struct tcf_block *block, bool *counted, bool add);
>>
>> static void tcf_proto_destroy(struct tcf_proto *tp, bool rtnl_held,
>> bool sig_destroy, struct netlink_ext_ack *extack)
>> {
>> tp->ops->destroy(tp, rtnl_held, extack);
>> + tcf_block_filter_cnt_update(tp->chain->block, &tp->counted, false);
>> if (sig_destroy)
>> tcf_proto_signal_destroyed(tp->chain, tp);
>> tcf_chain_put(tp->chain);
>> @@ -2364,6 +2366,7 @@ static int tc_new_tfilter(struct sk_buff *skb, struct nlmsghdr *n,
>> err = tp->ops->change(net, skb, tp, cl, t->tcm_handle, tca, &fh,
>> flags, extack);
>> if (err == 0) {
>> + tcf_block_filter_cnt_update(block, &tp->counted, true);
>> tfilter_notify(net, skb, n, tp, block, q, parent, fh,
>> RTM_NEWTFILTER, false, rtnl_held, extack);
>> tfilter_put(tp, fh);
>> @@ -3478,6 +3481,23 @@ int tcf_exts_dump_stats(struct sk_buff *skb, struct tcf_exts *exts)
>> }
>> EXPORT_SYMBOL(tcf_exts_dump_stats);
>>
>> +static void tcf_block_filter_cnt_update(struct tcf_block *block, bool *counted, bool add)
>
> Can't you move this up to avoid forward declaration?
Sure, will do that in v2.
I had considered it, but in the end decided to keep it next to the related offloadcnt logic.
>> +{
>> + lockdep_assert_not_held(&block->cb_lock);
>> +
>> + down_write(&block->cb_lock);
>> + if (*counted != add) {
>> + if (add) {
>> + atomic_inc(&block->filtercnt);
>> + *counted = true;
>> + } else {
>> + atomic_dec(&block->filtercnt);
>> + *counted = false;
>> + }
>> + }
>> + up_write(&block->cb_lock);
>> +}
>> +
>> static void tcf_block_offload_inc(struct tcf_block *block, u32 *flags)
>> {
>> if (*flags & TCA_CLS_FLAGS_IN_HW)
>> --
>> 2.43.0
>>
--
Best regards
Asbjørn Sloth Tønnesen
Network Engineer
Fiberby - AS42541
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH net-next 1/3] net: sched: cls_api: add skip_sw counter
2024-02-15 17:39 ` Jamal Hadi Salim
@ 2024-02-15 23:34 ` Asbjørn Sloth Tønnesen
2024-02-16 8:35 ` Vlad Buslov
0 siblings, 1 reply; 20+ messages in thread
From: Asbjørn Sloth Tønnesen @ 2024-02-15 23:34 UTC (permalink / raw)
To: Jamal Hadi Salim
Cc: Cong Wang, Jiri Pirko, Daniel Borkmann, netdev, linux-kernel, llu,
Vlad Buslov, Marcelo Ricardo Leitner
Hi Jamal,
Thank you for the review.
On 2/15/24 17:39, Jamal Hadi Salim wrote:
> +Cc Vlad and Marcelo..
>
> On Thu, Feb 15, 2024 at 11:06 AM Asbjørn Sloth Tønnesen <ast@fiberby.net> wrote:
>>
>> Maintain a count of skip_sw filters.
>>
>> This counter is protected by the cb_lock, and is updated
>> at the same time as offloadcnt.
>>
>> Signed-off-by: Asbjørn Sloth Tønnesen <ast@fiberby.net>
>> ---
>> include/net/sch_generic.h | 1 +
>> net/sched/cls_api.c | 4 ++++
>> 2 files changed, 5 insertions(+)
>>
>> diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
>> index 934fdb977551..46a63d1818a0 100644
>> --- a/include/net/sch_generic.h
>> +++ b/include/net/sch_generic.h
>> @@ -476,6 +476,7 @@ struct tcf_block {
>> struct flow_block flow_block;
>> struct list_head owner_list;
>> bool keep_dst;
>> + atomic_t skipswcnt; /* Number of skip_sw filters */
>> atomic_t offloadcnt; /* Number of oddloaded filters */
>
> For your use case is skipswcnt ever going to be any different than offloadcnt?
No, we only use skip_sw filters, since we only use TC as a control path to
install skip_sw rules into hardware.
AFAICT offloadcnt is the sum of skip_sw filters, and filters with no flags which
have implicitly been offloaded.
The reason that I didn't just use offloadcnt, is that I'm not sure if it is
acceptable to treat implicitly offloaded rules without skip_sw, as if they were
explicitly skip_sw. It sounds reasonable, given that the filters without skip_* flags
shouldn't really care.
I tried to only trigger the TC bypass, in the cases that I was absolutely sure would
be safe as a first step.
>
> cheers,
> jamal
>
>> unsigned int nooffloaddevcnt; /* Number of devs unable to do offload */
>> unsigned int lockeddevcnt; /* Number of devs that require rtnl lock. */
>> diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c
>> index ca5676b2668e..397c3d29659c 100644
>> --- a/net/sched/cls_api.c
>> +++ b/net/sched/cls_api.c
>> @@ -3483,6 +3483,8 @@ static void tcf_block_offload_inc(struct tcf_block *block, u32 *flags)
>> if (*flags & TCA_CLS_FLAGS_IN_HW)
>> return;
>> *flags |= TCA_CLS_FLAGS_IN_HW;
>> + if (tc_skip_sw(*flags))
>> + atomic_inc(&block->skipswcnt);
>> atomic_inc(&block->offloadcnt);
>> }
>>
>> @@ -3491,6 +3493,8 @@ static void tcf_block_offload_dec(struct tcf_block *block, u32 *flags)
>> if (!(*flags & TCA_CLS_FLAGS_IN_HW))
>> return;
>> *flags &= ~TCA_CLS_FLAGS_IN_HW;
>> + if (tc_skip_sw(*flags))
>> + atomic_dec(&block->skipswcnt);
>> atomic_dec(&block->offloadcnt);
>> }
>>
>> --
>> 2.43.0
>>
--
Best regards
Asbjørn Sloth Tønnesen
Network Engineer
Fiberby - AS42541
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH net-next 1/3] net: sched: cls_api: add skip_sw counter
2024-02-15 23:34 ` Asbjørn Sloth Tønnesen
@ 2024-02-16 8:35 ` Vlad Buslov
0 siblings, 0 replies; 20+ messages in thread
From: Vlad Buslov @ 2024-02-16 8:35 UTC (permalink / raw)
To: Asbjørn Sloth Tønnesen
Cc: Jamal Hadi Salim, Cong Wang, Jiri Pirko, Daniel Borkmann, netdev,
linux-kernel, llu, Marcelo Ricardo Leitner
On Thu 15 Feb 2024 at 23:34, Asbjørn Sloth Tønnesen <ast@fiberby.net> wrote:
> Hi Jamal,
>
> Thank you for the review.
>
> On 2/15/24 17:39, Jamal Hadi Salim wrote:
>> +Cc Vlad and Marcelo..
>> On Thu, Feb 15, 2024 at 11:06 AM Asbjørn Sloth Tønnesen <ast@fiberby.net>
>> wrote:
>>>
>>> Maintain a count of skip_sw filters.
>>>
>>> This counter is protected by the cb_lock, and is updated
>>> at the same time as offloadcnt.
>>>
>>> Signed-off-by: Asbjørn Sloth Tønnesen <ast@fiberby.net>
>>> ---
>>> include/net/sch_generic.h | 1 +
>>> net/sched/cls_api.c | 4 ++++
>>> 2 files changed, 5 insertions(+)
>>>
>>> diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
>>> index 934fdb977551..46a63d1818a0 100644
>>> --- a/include/net/sch_generic.h
>>> +++ b/include/net/sch_generic.h
>>> @@ -476,6 +476,7 @@ struct tcf_block {
>>> struct flow_block flow_block;
>>> struct list_head owner_list;
>>> bool keep_dst;
>>> + atomic_t skipswcnt; /* Number of skip_sw filters */
>>> atomic_t offloadcnt; /* Number of oddloaded filters */
>> For your use case is skipswcnt ever going to be any different than offloadcnt?
>
> No, we only use skip_sw filters, since we only use TC as a control path to
> install skip_sw rules into hardware.
>
> AFAICT offloadcnt is the sum of skip_sw filters, and filters with no flags which
> have implicitly been offloaded.
>
> The reason that I didn't just use offloadcnt, is that I'm not sure if it is
> acceptable to treat implicitly offloaded rules without skip_sw, as if they were
> explicitly skip_sw. It sounds reasonable, given that the filters without skip_* flags
> shouldn't really care.
It is not acceptable since there are valid use-cases where packets need
to match sw filters that are supposedly also in-hw. For example, filters
with tunnel_key set action during neighbor update event.
>
> I tried to only trigger the TC bypass, in the cases that I was absolutely sure would
> be safe as a first step.
>
>
>> cheers,
>> jamal
>>
>>> unsigned int nooffloaddevcnt; /* Number of devs unable to do offload */
>>> unsigned int lockeddevcnt; /* Number of devs that require rtnl lock. */
>>> diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c
>>> index ca5676b2668e..397c3d29659c 100644
>>> --- a/net/sched/cls_api.c
>>> +++ b/net/sched/cls_api.c
>>> @@ -3483,6 +3483,8 @@ static void tcf_block_offload_inc(struct tcf_block *block, u32 *flags)
>>> if (*flags & TCA_CLS_FLAGS_IN_HW)
>>> return;
>>> *flags |= TCA_CLS_FLAGS_IN_HW;
>>> + if (tc_skip_sw(*flags))
>>> + atomic_inc(&block->skipswcnt);
>>> atomic_inc(&block->offloadcnt);
>>> }
>>>
>>> @@ -3491,6 +3493,8 @@ static void tcf_block_offload_dec(struct tcf_block *block, u32 *flags)
>>> if (!(*flags & TCA_CLS_FLAGS_IN_HW))
>>> return;
>>> *flags &= ~TCA_CLS_FLAGS_IN_HW;
>>> + if (tc_skip_sw(*flags))
>>> + atomic_dec(&block->skipswcnt);
>>> atomic_dec(&block->offloadcnt);
>>> }
>>>
>>> --
>>> 2.43.0
>>>
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH net-next 0/3] make skip_sw actually skip software
2024-02-15 18:00 ` [PATCH net-next 0/3] " Marcelo Ricardo Leitner
@ 2024-02-16 8:44 ` Vlad Buslov
2024-02-16 12:17 ` Asbjørn Sloth Tønnesen
1 sibling, 0 replies; 20+ messages in thread
From: Vlad Buslov @ 2024-02-16 8:44 UTC (permalink / raw)
To: Marcelo Ricardo Leitner
Cc: Asbjørn Sloth Tønnesen, Jamal Hadi Salim, Cong Wang,
Jiri Pirko, Daniel Borkmann, netdev, linux-kernel, llu
On Thu 15 Feb 2024 at 10:00, Marcelo Ricardo Leitner <mleitner@redhat.com> wrote:
> Hi,
>
> On Thu, Feb 15, 2024 at 04:04:41PM +0000, Asbjørn Sloth Tønnesen wrote:
> ...
>> Since we use TC flower offload for the hottest
>> prefixes, and leave the long tail to Linux / the CPU.
>> we therefore need both the hardware and software
>> datapath to perform well.
>>
>> I found that skip_sw rules, are quite expensive
>> in the kernel datapath, sice they must be evaluated
>> and matched upon, before the kernel checks the
>> skip_sw flag.
>>
>> This patchset optimizes the case where all rules
>> are skip_sw.
>
> The talk is interesting. Yet, I don't get how it is set up.
> How do you use a dedicated block for skip_sw, and then have a
> catch-all on sw again please?
>
> I'm missing which traffic is being matched against the sw datapath. In
> theory, you have all the heavy duty filters offloaded, so the sw
> datapath should be seeing only a few packets, right?
Yeah, I also didn't get the idea here. The cited paragraphs seem to
contradict each other.
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH net-next 3/3] net: sched: make skip_sw actually skip software
2024-02-15 16:04 ` [PATCH net-next 3/3] net: sched: make skip_sw actually skip software Asbjørn Sloth Tønnesen
2024-02-15 17:49 ` Jamal Hadi Salim
@ 2024-02-16 8:47 ` Vlad Buslov
2024-02-16 14:01 ` Asbjørn Sloth Tønnesen
1 sibling, 1 reply; 20+ messages in thread
From: Vlad Buslov @ 2024-02-16 8:47 UTC (permalink / raw)
To: Asbjørn Sloth Tønnesen
Cc: Jamal Hadi Salim, Cong Wang, Jiri Pirko, Daniel Borkmann, netdev,
linux-kernel, llu
On Thu 15 Feb 2024 at 16:04, Asbjørn Sloth Tønnesen <ast@fiberby.net> wrote:
> TC filters come in 3 variants:
> - no flag (no opinion, process wherever possible)
> - skip_hw (do not process filter by hardware)
> - skip_sw (do not process filter by software)
>
> However skip_sw is implemented so that the skip_sw
> flag can first be checked, after it has been matched.
>
> IMHO it's common when using skip_sw, to use it on all rules.
>
> So if all filters in a block is skip_sw filters, then
> we can bail early, we can thus avoid having to match
> the filters, just to check for the skip_sw flag.
>
> +----------------------------+--------+--------+--------+
> | Test description | Pre | Post | Rel. |
> | | kpps | kpps | chg. |
> +----------------------------+--------+--------+--------+
> | basic forwarding + notrack | 1264.9 | 1277.7 | 1.01x |
> | switch to eswitch mode | 1067.1 | 1071.0 | 1.00x |
> | add ingress qdisc | 1056.0 | 1059.1 | 1.00x |
> +----------------------------+--------+--------+--------+
> | 1 non-matching rule | 927.9 | 1057.1 | 1.14x |
> | 10 non-matching rules | 495.8 | 1055.6 | 2.13x |
> | 25 non-matching rules | 280.6 | 1053.5 | 3.75x |
> | 50 non-matching rules | 162.0 | 1055.7 | 6.52x |
> | 100 non-matching rules | 87.7 | 1019.0 | 11.62x |
> +----------------------------+--------+--------+--------+
>
> perf top (100 n-m skip_sw rules - pre patch):
> 25.57% [kernel] [k] __skb_flow_dissect
> 20.77% [kernel] [k] rhashtable_jhash2
> 14.26% [kernel] [k] fl_classify
> 13.28% [kernel] [k] fl_mask_lookup
> 6.38% [kernel] [k] memset_orig
> 3.22% [kernel] [k] tcf_classify
>
> perf top (100 n-m skip_sw rules - post patch):
> 4.28% [kernel] [k] __dev_queue_xmit
> 3.80% [kernel] [k] check_preemption_disabled
> 3.68% [kernel] [k] nft_do_chain
> 3.08% [kernel] [k] __netif_receive_skb_core.constprop.0
> 2.59% [kernel] [k] mlx5e_xmit
> 2.48% [kernel] [k] mlx5e_skb_from_cqe_mpwrq_nonlinear
>
> Test setup:
> DUT: Intel Xeon D-1518 (2.20GHz) w/ Nvidia/Mellanox ConnectX-6 Dx 2x100G
> Data rate measured on switch (Extreme X690), and DUT connected as
> a router on a stick, with pktgen and pktsink as VLANs.
> Pktgen was in range 12.79 - 12.95 Mpps across all tests.
>
> Signed-off-by: Asbjørn Sloth Tønnesen <ast@fiberby.net>
> ---
> include/net/pkt_cls.h | 5 +++++
> net/core/dev.c | 3 +++
> 2 files changed, 8 insertions(+)
>
> diff --git a/include/net/pkt_cls.h b/include/net/pkt_cls.h
> index a4ee43f493bb..a065da4df7ff 100644
> --- a/include/net/pkt_cls.h
> +++ b/include/net/pkt_cls.h
> @@ -74,6 +74,11 @@ static inline bool tcf_block_non_null_shared(struct tcf_block *block)
> return block && block->index;
> }
>
> +static inline bool tcf_block_has_skip_sw_only(struct tcf_block *block)
> +{
> + return block && atomic_read(&block->filtercnt) == atomic_read(&block->skipswcnt);
> +}
Note that this introduces a read from heavily contended cache-line on
data path for all classifiers, including the ones that don't support
offloads. Wonder if this a concern for users running purely software tc.
> +
> static inline struct Qdisc *tcf_block_q(struct tcf_block *block)
> {
> WARN_ON(tcf_block_shared(block));
> diff --git a/net/core/dev.c b/net/core/dev.c
> index d8dd293a7a27..7cd014e5066e 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -3910,6 +3910,9 @@ static int tc_run(struct tcx_entry *entry, struct sk_buff *skb,
> if (!miniq)
> return ret;
>
> + if (tcf_block_has_skip_sw_only(miniq->block))
> + return ret;
> +
> tc_skb_cb(skb)->mru = 0;
> tc_skb_cb(skb)->post_ct = false;
> tcf_set_drop_reason(skb, *drop_reason);
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH net-next 0/3] make skip_sw actually skip software
2024-02-15 18:00 ` [PATCH net-next 0/3] " Marcelo Ricardo Leitner
2024-02-16 8:44 ` Vlad Buslov
@ 2024-02-16 12:17 ` Asbjørn Sloth Tønnesen
2024-02-16 14:46 ` Marcelo Ricardo Leitner
1 sibling, 1 reply; 20+ messages in thread
From: Asbjørn Sloth Tønnesen @ 2024-02-16 12:17 UTC (permalink / raw)
To: Marcelo Ricardo Leitner
Cc: Jamal Hadi Salim, Cong Wang, Jiri Pirko, Daniel Borkmann, netdev,
linux-kernel, llu
Hi Marcelo,
On 2/15/24 18:00, Marcelo Ricardo Leitner wrote:
> On Thu, Feb 15, 2024 at 04:04:41PM +0000, Asbjørn Sloth Tønnesen wrote:
> ...
>> Since we use TC flower offload for the hottest
>> prefixes, and leave the long tail to Linux / the CPU.
>> we therefore need both the hardware and software
>> datapath to perform well.
>>
>> I found that skip_sw rules, are quite expensive
>> in the kernel datapath, sice they must be evaluated
>> and matched upon, before the kernel checks the
>> skip_sw flag.
>>
>> This patchset optimizes the case where all rules
>> are skip_sw.
>
> The talk is interesting. Yet, I don't get how it is set up.
> How do you use a dedicated block for skip_sw, and then have a
> catch-all on sw again please?
Bird installs the DFZ Internet routing table into the main kernel table
for the software datapath.
Bird also installs a subset of routing table into an aux. kernel table.
flower-route then picks up the routes from the aux. kernel table, and
installs them as TC skip_sw filters.
On these machines we don't have any non-skip_sw TC filters.
Since 2021, we have statically offloaded all inbound traffic, since
nexthop for our IP space is always the switch next to it, which does
interior L3 routing. Thereby we could offload ~50% of the packets.
I have put an example of the static script here:
https://files.fiberby.net/ast/2024/tc_skip_sw/mlx5_static_offload.sh
And `tc filter show dev enp5s0f0np0 ingress` after running the script:
https://files.fiberby.net/ast/2024/tc_skip_sw/mlx_offload_demo_tc_dump.txt
> I'm missing which traffic is being matched against the sw datapath. In
> theory, you have all the heavy duty filters offloaded, so the sw
> datapath should be seeing only a few packets, right?
We are an residential ISP, our traffic is therefore residential Internet
traffic, we run the BGP routers as a router on a stick, the filters therefore
see both inbound and outbound traffic.
~50% of packets are inbound traffic, our own prefixes are therefore the
hottest prefixes. Most streaming traffic is handled internally, and is
therefore not seen on our core routers. We regularly have 5%-10% of all
outbound traffic going towards the same prefix, and have 50% of outbound
traffic distributed across just a few prefixes.
We currently only offload our own prefixes, and a select few other known
high-traffic prefixes.
The goal is to offload the majority of the trafic, but it is still early
days for flower-route, and I need to implement some smarter chain layout
first and dynamic filter placement based on hardware counters.
Even when I get flower-route to offload almost all traffic, there will still
be a long tail of prefixes not in hardware, so the kernel still needs
to not be pulled down by the offloaded filters.
--
Best regards
Asbjørn Sloth Tønnesen
Network Engineer
Fiberby - AS42541
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH net-next 1/3] net: sched: cls_api: add skip_sw counter
2024-02-15 16:04 ` [PATCH net-next 1/3] net: sched: cls_api: add skip_sw counter Asbjørn Sloth Tønnesen
2024-02-15 17:39 ` Jamal Hadi Salim
@ 2024-02-16 12:52 ` Jiri Pirko
1 sibling, 0 replies; 20+ messages in thread
From: Jiri Pirko @ 2024-02-16 12:52 UTC (permalink / raw)
To: Asbjørn Sloth Tønnesen
Cc: Jamal Hadi Salim, Cong Wang, Daniel Borkmann, netdev,
linux-kernel, llu
Thu, Feb 15, 2024 at 05:04:42PM CET, ast@fiberby.net wrote:
>Maintain a count of skip_sw filters.
>
>This counter is protected by the cb_lock, and is updated
>at the same time as offloadcnt.
>
>Signed-off-by: Asbjørn Sloth Tønnesen <ast@fiberby.net>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH net-next 3/3] net: sched: make skip_sw actually skip software
2024-02-15 17:49 ` Jamal Hadi Salim
@ 2024-02-16 12:57 ` Jiri Pirko
2024-02-16 15:07 ` Jamal Hadi Salim
2024-02-16 13:38 ` Asbjørn Sloth Tønnesen
1 sibling, 1 reply; 20+ messages in thread
From: Jiri Pirko @ 2024-02-16 12:57 UTC (permalink / raw)
To: Jamal Hadi Salim
Cc: Asbjørn Sloth Tønnesen, Cong Wang, Daniel Borkmann,
netdev, linux-kernel, llu, Vlad Buslov, Marcelo Ricardo Leitner
Thu, Feb 15, 2024 at 06:49:05PM CET, jhs@mojatatu.com wrote:
>On Thu, Feb 15, 2024 at 11:06 AM Asbjørn Sloth Tønnesen <ast@fiberby.net> wrote:
>>
>> TC filters come in 3 variants:
>> - no flag (no opinion, process wherever possible)
>> - skip_hw (do not process filter by hardware)
>> - skip_sw (do not process filter by software)
>>
>> However skip_sw is implemented so that the skip_sw
>> flag can first be checked, after it has been matched.
>>
>> IMHO it's common when using skip_sw, to use it on all rules.
>>
>> So if all filters in a block is skip_sw filters, then
>> we can bail early, we can thus avoid having to match
>> the filters, just to check for the skip_sw flag.
>>
>> +----------------------------+--------+--------+--------+
>> | Test description | Pre | Post | Rel. |
>> | | kpps | kpps | chg. |
>> +----------------------------+--------+--------+--------+
>> | basic forwarding + notrack | 1264.9 | 1277.7 | 1.01x |
>> | switch to eswitch mode | 1067.1 | 1071.0 | 1.00x |
>> | add ingress qdisc | 1056.0 | 1059.1 | 1.00x |
>> +----------------------------+--------+--------+--------+
>> | 1 non-matching rule | 927.9 | 1057.1 | 1.14x |
>> | 10 non-matching rules | 495.8 | 1055.6 | 2.13x |
>> | 25 non-matching rules | 280.6 | 1053.5 | 3.75x |
>> | 50 non-matching rules | 162.0 | 1055.7 | 6.52x |
>> | 100 non-matching rules | 87.7 | 1019.0 | 11.62x |
>> +----------------------------+--------+--------+--------+
>>
>> perf top (100 n-m skip_sw rules - pre patch):
>> 25.57% [kernel] [k] __skb_flow_dissect
>> 20.77% [kernel] [k] rhashtable_jhash2
>> 14.26% [kernel] [k] fl_classify
>> 13.28% [kernel] [k] fl_mask_lookup
>> 6.38% [kernel] [k] memset_orig
>> 3.22% [kernel] [k] tcf_classify
>>
>> perf top (100 n-m skip_sw rules - post patch):
>> 4.28% [kernel] [k] __dev_queue_xmit
>> 3.80% [kernel] [k] check_preemption_disabled
>> 3.68% [kernel] [k] nft_do_chain
>> 3.08% [kernel] [k] __netif_receive_skb_core.constprop.0
>> 2.59% [kernel] [k] mlx5e_xmit
>> 2.48% [kernel] [k] mlx5e_skb_from_cqe_mpwrq_nonlinear
>>
>
>The concept makes sense - but i am wondering when you have a mix of
>skip_sw and skip_hw if it makes more sense to just avoid looking up
>skip_sw at all in the s/w datapath? Potentially by separating the
>hashes for skip_sw/hw. I know it's a deeper surgery - but would be
Yeah, there could be 2 hashes: skip_sw/rest
rest is the only one that needs to be looked-up in kernel datapath.
skip_sw is just for control path.
But is it worth the efford? I mean, since now, nobody seemed to care. If
this patchset solves the problem for this usecase, I think it is enough.
In that case, I'm fine with this patch:
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
>more general purpose....unless i am missing something
>
>> Test setup:
>> DUT: Intel Xeon D-1518 (2.20GHz) w/ Nvidia/Mellanox ConnectX-6 Dx 2x100G
>> Data rate measured on switch (Extreme X690), and DUT connected as
>> a router on a stick, with pktgen and pktsink as VLANs.
>> Pktgen was in range 12.79 - 12.95 Mpps across all tests.
>>
>
>Hrm. Those are "tiny" numbers (25G @64B is about 3x that). What are
>the packet sizes?
>Perhaps the traffic generator is a limitation here?
>Also feels like you are doing exact matches? A sample flower rule
>would have helped.
>
>cheers,
>jamal
>> Signed-off-by: Asbjørn Sloth Tønnesen <ast@fiberby.net>
>> ---
>> include/net/pkt_cls.h | 5 +++++
>> net/core/dev.c | 3 +++
>> 2 files changed, 8 insertions(+)
>>
>> diff --git a/include/net/pkt_cls.h b/include/net/pkt_cls.h
>> index a4ee43f493bb..a065da4df7ff 100644
>> --- a/include/net/pkt_cls.h
>> +++ b/include/net/pkt_cls.h
>> @@ -74,6 +74,11 @@ static inline bool tcf_block_non_null_shared(struct tcf_block *block)
>> return block && block->index;
>> }
>>
>> +static inline bool tcf_block_has_skip_sw_only(struct tcf_block *block)
>> +{
>> + return block && atomic_read(&block->filtercnt) == atomic_read(&block->skipswcnt);
>> +}
>> +
>> static inline struct Qdisc *tcf_block_q(struct tcf_block *block)
>> {
>> WARN_ON(tcf_block_shared(block));
>> diff --git a/net/core/dev.c b/net/core/dev.c
>> index d8dd293a7a27..7cd014e5066e 100644
>> --- a/net/core/dev.c
>> +++ b/net/core/dev.c
>> @@ -3910,6 +3910,9 @@ static int tc_run(struct tcx_entry *entry, struct sk_buff *skb,
>> if (!miniq)
>> return ret;
>>
>> + if (tcf_block_has_skip_sw_only(miniq->block))
>> + return ret;
>> +
>> tc_skb_cb(skb)->mru = 0;
>> tc_skb_cb(skb)->post_ct = false;
>> tcf_set_drop_reason(skb, *drop_reason);
>> --
>> 2.43.0
>>
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH net-next 3/3] net: sched: make skip_sw actually skip software
2024-02-15 17:49 ` Jamal Hadi Salim
2024-02-16 12:57 ` Jiri Pirko
@ 2024-02-16 13:38 ` Asbjørn Sloth Tønnesen
1 sibling, 0 replies; 20+ messages in thread
From: Asbjørn Sloth Tønnesen @ 2024-02-16 13:38 UTC (permalink / raw)
To: Jamal Hadi Salim
Cc: Cong Wang, Jiri Pirko, Daniel Borkmann, netdev, linux-kernel, llu,
Vlad Buslov, Marcelo Ricardo Leitner
Hi Jamal,
On 2/15/24 17:49, Jamal Hadi Salim wrote:
> On Thu, Feb 15, 2024 at 11:06 AM Asbjørn Sloth Tønnesen <ast@fiberby.net> wrote:
>>
>> TC filters come in 3 variants:
>> - no flag (no opinion, process wherever possible)
>> - skip_hw (do not process filter by hardware)
>> - skip_sw (do not process filter by software)
>>
>> However skip_sw is implemented so that the skip_sw
>> flag can first be checked, after it has been matched.
>>
>> IMHO it's common when using skip_sw, to use it on all rules.
>>
>> So if all filters in a block is skip_sw filters, then
>> we can bail early, we can thus avoid having to match
>> the filters, just to check for the skip_sw flag.
>>
>> +----------------------------+--------+--------+--------+
>> | Test description | Pre | Post | Rel. |
>> | | kpps | kpps | chg. |
>> +----------------------------+--------+--------+--------+
>> | basic forwarding + notrack | 1264.9 | 1277.7 | 1.01x |
>> | switch to eswitch mode | 1067.1 | 1071.0 | 1.00x |
>> | add ingress qdisc | 1056.0 | 1059.1 | 1.00x |
>> +----------------------------+--------+--------+--------+
>> | 1 non-matching rule | 927.9 | 1057.1 | 1.14x |
>> | 10 non-matching rules | 495.8 | 1055.6 | 2.13x |
>> | 25 non-matching rules | 280.6 | 1053.5 | 3.75x |
>> | 50 non-matching rules | 162.0 | 1055.7 | 6.52x |
>> | 100 non-matching rules | 87.7 | 1019.0 | 11.62x |
>> +----------------------------+--------+--------+--------+
>>
>> perf top (100 n-m skip_sw rules - pre patch):
>> 25.57% [kernel] [k] __skb_flow_dissect
>> 20.77% [kernel] [k] rhashtable_jhash2
>> 14.26% [kernel] [k] fl_classify
>> 13.28% [kernel] [k] fl_mask_lookup
>> 6.38% [kernel] [k] memset_orig
>> 3.22% [kernel] [k] tcf_classify
>>
>> perf top (100 n-m skip_sw rules - post patch):
>> 4.28% [kernel] [k] __dev_queue_xmit
>> 3.80% [kernel] [k] check_preemption_disabled
>> 3.68% [kernel] [k] nft_do_chain
>> 3.08% [kernel] [k] __netif_receive_skb_core.constprop.0
>> 2.59% [kernel] [k] mlx5e_xmit
>> 2.48% [kernel] [k] mlx5e_skb_from_cqe_mpwrq_nonlinear
>>
>
> The concept makes sense - but i am wondering when you have a mix of
> skip_sw and skip_hw if it makes more sense to just avoid looking up
> skip_sw at all in the s/w datapath? Potentially by separating the
> hashes for skip_sw/hw. I know it's a deeper surgery - but would be
> more general purpose....unless i am missing something
>
>> Test setup:
>> DUT: Intel Xeon D-1518 (2.20GHz) w/ Nvidia/Mellanox ConnectX-6 Dx 2x100G
>> Data rate measured on switch (Extreme X690), and DUT connected as
>> a router on a stick, with pktgen and pktsink as VLANs.
>> Pktgen was in range 12.79 - 12.95 Mpps across all tests.
>>
>
> Hrm. Those are "tiny" numbers (25G @64B is about 3x that). What are
> the packet sizes?
> Perhaps the traffic generator is a limitation here?
> Also feels like you are doing exact matches? A sample flower rule
> would have helped.
Yeah, I would also have liked those number to be higher. Sorry forgot to mention it is 64B packets.
Sadly, I used two machine to compensate for my lack of pktgen skills.
I know that there are faster packet generators, but I just used the in-kernel one,
and haven't invested much time into it, but the normal ethtool params mentioned in
the doc didn't change much, and since it was still more than enough packets that DUT
would go to 100% CPU, I didn't pursue it further.
pktgen A: Xeon E5-1620 v2 @ 3.70GHz w/ ConnectX-6 Dx and 100G link.
pktgen B: Xeon(R) D-2123IT @ 2.20GHz w/ Intel X722 and 10G link.
Both pktgen boxes are running stock Debian bullseye kernel (v6.1)
I found this to work best:
./pktgen_sample05_flow_per_thread.sh -i enp8s0f0np0 -s 64 -d 10.53.22.3 -m 2a:11:22:33:21:11 -p 1024-65000 -t 8 -n 0
Datarates monitored through SNMP using:
snmpdelta -Cp 10 -c public -v 2c -Cs $hostname IF-MIB::ifInUcastPkts.1057 IF-MIB::ifOutUcastPkts.1057
VLAN config, MAC and IP addressing:
v21:
2a:11:22:33:21:11 - 10.53.21.1/27 - dut v21 (tagged VLAN)
2a:11:22:33:21:12 - 10.53.21.2/27 - pktgen-a (untagged)
2a:11:22:33:21:13 - 10.53.21.3/27 - pktgen-b (untagged)
v22:
2a:11:22:33:22:21 - 10.53.22.2/31 - dut v22 (tagged VLAN)
2a:11:22:33:22:22 - 10.53.22.3/31 - packet drop (untagged)
Switch MAC address table config:
# static entry
create fdb 2a:11:22:33:21:11 vlan "v21" port 57
# blackhole pktsink
create fdb 2a:11:22:33:22:22 vlan "v22" blackhole
flow control are disabled on all links:
ethtool -A $dev rx off tx off
I have uploaded the script used for the flower rules in the tests here:
https://files.fiberby.net/ast/2024/tc_skip_sw/flower_placement_tests.sh
> cheers,
> jamal
>> Signed-off-by: Asbjørn Sloth Tønnesen <ast@fiberby.net>
>> ---
>> include/net/pkt_cls.h | 5 +++++
>> net/core/dev.c | 3 +++
>> 2 files changed, 8 insertions(+)
>>
>> diff --git a/include/net/pkt_cls.h b/include/net/pkt_cls.h
>> index a4ee43f493bb..a065da4df7ff 100644
>> --- a/include/net/pkt_cls.h
>> +++ b/include/net/pkt_cls.h
>> @@ -74,6 +74,11 @@ static inline bool tcf_block_non_null_shared(struct tcf_block *block)
>> return block && block->index;
>> }
>>
>> +static inline bool tcf_block_has_skip_sw_only(struct tcf_block *block)
>> +{
>> + return block && atomic_read(&block->filtercnt) == atomic_read(&block->skipswcnt);
>> +}
>> +
>> static inline struct Qdisc *tcf_block_q(struct tcf_block *block)
>> {
>> WARN_ON(tcf_block_shared(block));
>> diff --git a/net/core/dev.c b/net/core/dev.c
>> index d8dd293a7a27..7cd014e5066e 100644
>> --- a/net/core/dev.c
>> +++ b/net/core/dev.c
>> @@ -3910,6 +3910,9 @@ static int tc_run(struct tcx_entry *entry, struct sk_buff *skb,
>> if (!miniq)
>> return ret;
>>
>> + if (tcf_block_has_skip_sw_only(miniq->block))
>> + return ret;
>> +
>> tc_skb_cb(skb)->mru = 0;
>> tc_skb_cb(skb)->post_ct = false;
>> tcf_set_drop_reason(skb, *drop_reason);
>> --
>> 2.43.0
>>
--
Best regards
Asbjørn Sloth Tønnesen
Network Engineer
Fiberby - AS42541
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH net-next 3/3] net: sched: make skip_sw actually skip software
2024-02-16 8:47 ` Vlad Buslov
@ 2024-02-16 14:01 ` Asbjørn Sloth Tønnesen
0 siblings, 0 replies; 20+ messages in thread
From: Asbjørn Sloth Tønnesen @ 2024-02-16 14:01 UTC (permalink / raw)
To: Vlad Buslov
Cc: Jamal Hadi Salim, Cong Wang, Jiri Pirko, Daniel Borkmann, netdev,
linux-kernel, llu
Hi Vlad,
On 2/16/24 08:47, Vlad Buslov wrote:
> On Thu 15 Feb 2024 at 16:04, Asbjørn Sloth Tønnesen <ast@fiberby.net> wrote:
>> TC filters come in 3 variants:
>> - no flag (no opinion, process wherever possible)
>> - skip_hw (do not process filter by hardware)
>> - skip_sw (do not process filter by software)
>>
>> However skip_sw is implemented so that the skip_sw
>> flag can first be checked, after it has been matched.
>>
>> IMHO it's common when using skip_sw, to use it on all rules.
>>
>> So if all filters in a block is skip_sw filters, then
>> we can bail early, we can thus avoid having to match
>> the filters, just to check for the skip_sw flag.
>>
>> +----------------------------+--------+--------+--------+
>> | Test description | Pre | Post | Rel. |
>> | | kpps | kpps | chg. |
>> +----------------------------+--------+--------+--------+
>> | basic forwarding + notrack | 1264.9 | 1277.7 | 1.01x |
>> | switch to eswitch mode | 1067.1 | 1071.0 | 1.00x |
>> | add ingress qdisc | 1056.0 | 1059.1 | 1.00x |
>> +----------------------------+--------+--------+--------+
>> | 1 non-matching rule | 927.9 | 1057.1 | 1.14x |
>> | 10 non-matching rules | 495.8 | 1055.6 | 2.13x |
>> | 25 non-matching rules | 280.6 | 1053.5 | 3.75x |
>> | 50 non-matching rules | 162.0 | 1055.7 | 6.52x |
>> | 100 non-matching rules | 87.7 | 1019.0 | 11.62x |
>> +----------------------------+--------+--------+--------+
>>
>> perf top (100 n-m skip_sw rules - pre patch):
>> 25.57% [kernel] [k] __skb_flow_dissect
>> 20.77% [kernel] [k] rhashtable_jhash2
>> 14.26% [kernel] [k] fl_classify
>> 13.28% [kernel] [k] fl_mask_lookup
>> 6.38% [kernel] [k] memset_orig
>> 3.22% [kernel] [k] tcf_classify
>>
>> perf top (100 n-m skip_sw rules - post patch):
>> 4.28% [kernel] [k] __dev_queue_xmit
>> 3.80% [kernel] [k] check_preemption_disabled
>> 3.68% [kernel] [k] nft_do_chain
>> 3.08% [kernel] [k] __netif_receive_skb_core.constprop.0
>> 2.59% [kernel] [k] mlx5e_xmit
>> 2.48% [kernel] [k] mlx5e_skb_from_cqe_mpwrq_nonlinear
>>
>> Test setup:
>> DUT: Intel Xeon D-1518 (2.20GHz) w/ Nvidia/Mellanox ConnectX-6 Dx 2x100G
>> Data rate measured on switch (Extreme X690), and DUT connected as
>> a router on a stick, with pktgen and pktsink as VLANs.
>> Pktgen was in range 12.79 - 12.95 Mpps across all tests.
>>
>> Signed-off-by: Asbjørn Sloth Tønnesen <ast@fiberby.net>
>> ---
>> include/net/pkt_cls.h | 5 +++++
>> net/core/dev.c | 3 +++
>> 2 files changed, 8 insertions(+)
>>
>> diff --git a/include/net/pkt_cls.h b/include/net/pkt_cls.h
>> index a4ee43f493bb..a065da4df7ff 100644
>> --- a/include/net/pkt_cls.h
>> +++ b/include/net/pkt_cls.h
>> @@ -74,6 +74,11 @@ static inline bool tcf_block_non_null_shared(struct tcf_block *block)
>> return block && block->index;
>> }
>>
>> +static inline bool tcf_block_has_skip_sw_only(struct tcf_block *block)
>> +{
>> + return block && atomic_read(&block->filtercnt) == atomic_read(&block->skipswcnt);
>> +}
>
> Note that this introduces a read from heavily contended cache-line on
> data path for all classifiers, including the ones that don't support
> offloads. Wonder if this a concern for users running purely software tc.
Unfortunately, I don't have access to any multi-CPU machines, so I haven't been
able to test the impact of that.
Alternatively I guess I could also maintain a static key in the counter update logic.
>> +
>> static inline struct Qdisc *tcf_block_q(struct tcf_block *block)
>> {
>> WARN_ON(tcf_block_shared(block));
>> diff --git a/net/core/dev.c b/net/core/dev.c
>> index d8dd293a7a27..7cd014e5066e 100644
>> --- a/net/core/dev.c
>> +++ b/net/core/dev.c
>> @@ -3910,6 +3910,9 @@ static int tc_run(struct tcx_entry *entry, struct sk_buff *skb,
>> if (!miniq)
>> return ret;
>>
>> + if (tcf_block_has_skip_sw_only(miniq->block))
>> + return ret;
>> +
>> tc_skb_cb(skb)->mru = 0;
>> tc_skb_cb(skb)->post_ct = false;
>> tcf_set_drop_reason(skb, *drop_reason);
>
--
Best regards
Asbjørn Sloth Tønnesen
Network Engineer
Fiberby - AS42541
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Re: [PATCH net-next 0/3] make skip_sw actually skip software
2024-02-16 12:17 ` Asbjørn Sloth Tønnesen
@ 2024-02-16 14:46 ` Marcelo Ricardo Leitner
0 siblings, 0 replies; 20+ messages in thread
From: Marcelo Ricardo Leitner @ 2024-02-16 14:46 UTC (permalink / raw)
To: Asbjørn Sloth Tønnesen
Cc: Jamal Hadi Salim, Cong Wang, Jiri Pirko, Daniel Borkmann, netdev,
linux-kernel, llu
On Fri, Feb 16, 2024 at 12:17:28PM +0000, Asbjørn Sloth Tønnesen wrote:
> Hi Marcelo,
>
> On 2/15/24 18:00, Marcelo Ricardo Leitner wrote:
> > On Thu, Feb 15, 2024 at 04:04:41PM +0000, Asbjørn Sloth Tønnesen wrote:
> > ...
> > > Since we use TC flower offload for the hottest
> > > prefixes, and leave the long tail to Linux / the CPU.
> > > we therefore need both the hardware and software
> > > datapath to perform well.
> > >
> > > I found that skip_sw rules, are quite expensive
> > > in the kernel datapath, sice they must be evaluated
> > > and matched upon, before the kernel checks the
> > > skip_sw flag.
> > >
> > > This patchset optimizes the case where all rules
> > > are skip_sw.
> >
> > The talk is interesting. Yet, I don't get how it is set up.
> > How do you use a dedicated block for skip_sw, and then have a
> > catch-all on sw again please?
>
> Bird installs the DFZ Internet routing table into the main kernel table
> for the software datapath.
>
> Bird also installs a subset of routing table into an aux. kernel table.
>
> flower-route then picks up the routes from the aux. kernel table, and
> installs them as TC skip_sw filters.
>
> On these machines we don't have any non-skip_sw TC filters.
>
> Since 2021, we have statically offloaded all inbound traffic, since
> nexthop for our IP space is always the switch next to it, which does
> interior L3 routing. Thereby we could offload ~50% of the packets.
>
> I have put an example of the static script here:
> https://files.fiberby.net/ast/2024/tc_skip_sw/mlx5_static_offload.sh
>
> And `tc filter show dev enp5s0f0np0 ingress` after running the script:
> https://files.fiberby.net/ast/2024/tc_skip_sw/mlx_offload_demo_tc_dump.txt
Ahh ok. So from tc/flower perspective, you actually offload
everything. :-)
The part that was confusing to me is that what you need done in sw,
you don't do it in tc sw, but rather with the IP the stack itself. So
you actually offload a flower filter with these, lets say, exceptions.
It seems to me a better fix for this is to have action trap to "resume
to sw" to itself. Then even if you have traffic that triggers a miss
in hw, you could add a catch-all filter to trigger the trap.
With the catch-all idea, you may also instead of using trap directly,
use a goto chain X. I just don't remember if you need to have a flow
in chain X that is not offloaded, or an inexistant chain is enough.
These ideas are rooted on the fact that now the offloading can resume
processing at a given chain, or even at a given action that triggered
the miss. With this, it should skip all the filtering that is
unnecessary in your case. IOW, instead of trying to make the filtering
smarter, which current proposal would be limited to this use case
pretty much (instead of using a dedicated list for skip_sw), it
resumes the processing at a better spot, and with what we already
have.
One caveat with this approach is that it will cause an skb_extension
to be allocated for all this traffic that is handled in sw. There's a
small performance penalty on it.
WDYT? Or maybe I missed something?
>
>
> > I'm missing which traffic is being matched against the sw datapath. In
> > theory, you have all the heavy duty filters offloaded, so the sw
> > datapath should be seeing only a few packets, right?
>
> We are an residential ISP, our traffic is therefore residential Internet
> traffic, we run the BGP routers as a router on a stick, the filters therefore
> see both inbound and outbound traffic.
>
> ~50% of packets are inbound traffic, our own prefixes are therefore the
> hottest prefixes. Most streaming traffic is handled internally, and is
> therefore not seen on our core routers. We regularly have 5%-10% of all
> outbound traffic going towards the same prefix, and have 50% of outbound
> traffic distributed across just a few prefixes.
>
> We currently only offload our own prefixes, and a select few other known
> high-traffic prefixes.
>
> The goal is to offload the majority of the trafic, but it is still early
> days for flower-route, and I need to implement some smarter chain layout
> first and dynamic filter placement based on hardware counters.
Cool. Btw, be aware that after a few chain jumps, performance may drop
considerably even if offloaded.
>
> Even when I get flower-route to offload almost all traffic, there will still
> be a long tail of prefixes not in hardware, so the kernel still needs
> to not be pulled down by the offloaded filters.
>
> --
> Best regards
> Asbjørn Sloth Tønnesen
> Network Engineer
> Fiberby - AS42541
>
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH net-next 3/3] net: sched: make skip_sw actually skip software
2024-02-16 12:57 ` Jiri Pirko
@ 2024-02-16 15:07 ` Jamal Hadi Salim
0 siblings, 0 replies; 20+ messages in thread
From: Jamal Hadi Salim @ 2024-02-16 15:07 UTC (permalink / raw)
To: Jiri Pirko
Cc: Asbjørn Sloth Tønnesen, Cong Wang, Daniel Borkmann,
netdev, linux-kernel, llu, Vlad Buslov, Marcelo Ricardo Leitner
On Fri, Feb 16, 2024 at 7:57 AM Jiri Pirko <jiri@resnulli.us> wrote:
>
> Thu, Feb 15, 2024 at 06:49:05PM CET, jhs@mojatatu.com wrote:
> >On Thu, Feb 15, 2024 at 11:06 AM Asbjørn Sloth Tønnesen <ast@fiberby.net> wrote:
> >>
> >> TC filters come in 3 variants:
> >> - no flag (no opinion, process wherever possible)
> >> - skip_hw (do not process filter by hardware)
> >> - skip_sw (do not process filter by software)
> >>
> >> However skip_sw is implemented so that the skip_sw
> >> flag can first be checked, after it has been matched.
> >>
> >> IMHO it's common when using skip_sw, to use it on all rules.
> >>
> >> So if all filters in a block is skip_sw filters, then
> >> we can bail early, we can thus avoid having to match
> >> the filters, just to check for the skip_sw flag.
> >>
> >> +----------------------------+--------+--------+--------+
> >> | Test description | Pre | Post | Rel. |
> >> | | kpps | kpps | chg. |
> >> +----------------------------+--------+--------+--------+
> >> | basic forwarding + notrack | 1264.9 | 1277.7 | 1.01x |
> >> | switch to eswitch mode | 1067.1 | 1071.0 | 1.00x |
> >> | add ingress qdisc | 1056.0 | 1059.1 | 1.00x |
> >> +----------------------------+--------+--------+--------+
> >> | 1 non-matching rule | 927.9 | 1057.1 | 1.14x |
> >> | 10 non-matching rules | 495.8 | 1055.6 | 2.13x |
> >> | 25 non-matching rules | 280.6 | 1053.5 | 3.75x |
> >> | 50 non-matching rules | 162.0 | 1055.7 | 6.52x |
> >> | 100 non-matching rules | 87.7 | 1019.0 | 11.62x |
> >> +----------------------------+--------+--------+--------+
> >>
> >> perf top (100 n-m skip_sw rules - pre patch):
> >> 25.57% [kernel] [k] __skb_flow_dissect
> >> 20.77% [kernel] [k] rhashtable_jhash2
> >> 14.26% [kernel] [k] fl_classify
> >> 13.28% [kernel] [k] fl_mask_lookup
> >> 6.38% [kernel] [k] memset_orig
> >> 3.22% [kernel] [k] tcf_classify
> >>
> >> perf top (100 n-m skip_sw rules - post patch):
> >> 4.28% [kernel] [k] __dev_queue_xmit
> >> 3.80% [kernel] [k] check_preemption_disabled
> >> 3.68% [kernel] [k] nft_do_chain
> >> 3.08% [kernel] [k] __netif_receive_skb_core.constprop.0
> >> 2.59% [kernel] [k] mlx5e_xmit
> >> 2.48% [kernel] [k] mlx5e_skb_from_cqe_mpwrq_nonlinear
> >>
> >
> >The concept makes sense - but i am wondering when you have a mix of
> >skip_sw and skip_hw if it makes more sense to just avoid looking up
> >skip_sw at all in the s/w datapath? Potentially by separating the
> >hashes for skip_sw/hw. I know it's a deeper surgery - but would be
>
> Yeah, there could be 2 hashes: skip_sw/rest
> rest is the only one that needs to be looked-up in kernel datapath.
> skip_sw is just for control path.
>
> But is it worth the efford? I mean, since now, nobody seemed to care. If
> this patchset solves the problem for this usecase, I think it is enough.
>
May not be worth the effort - and this is a reasonable use case. The
approach is a hack nonetheless and kills at least some insects. To
address the issues Vlad brought up, perhaps we should wrap it under
some kconfig.
cheers,
jamal
> In that case, I'm fine with this patch:
>
> Reviewed-by: Jiri Pirko <jiri@nvidia.com>
>
>
>
> >more general purpose....unless i am missing something
> >
> >> Test setup:
> >> DUT: Intel Xeon D-1518 (2.20GHz) w/ Nvidia/Mellanox ConnectX-6 Dx 2x100G
> >> Data rate measured on switch (Extreme X690), and DUT connected as
> >> a router on a stick, with pktgen and pktsink as VLANs.
> >> Pktgen was in range 12.79 - 12.95 Mpps across all tests.
> >>
> >
> >Hrm. Those are "tiny" numbers (25G @64B is about 3x that). What are
> >the packet sizes?
> >Perhaps the traffic generator is a limitation here?
> >Also feels like you are doing exact matches? A sample flower rule
> >would have helped.
> >
> >cheers,
> >jamal
> >> Signed-off-by: Asbjørn Sloth Tønnesen <ast@fiberby.net>
> >> ---
> >> include/net/pkt_cls.h | 5 +++++
> >> net/core/dev.c | 3 +++
> >> 2 files changed, 8 insertions(+)
> >>
> >> diff --git a/include/net/pkt_cls.h b/include/net/pkt_cls.h
> >> index a4ee43f493bb..a065da4df7ff 100644
> >> --- a/include/net/pkt_cls.h
> >> +++ b/include/net/pkt_cls.h
> >> @@ -74,6 +74,11 @@ static inline bool tcf_block_non_null_shared(struct tcf_block *block)
> >> return block && block->index;
> >> }
> >>
> >> +static inline bool tcf_block_has_skip_sw_only(struct tcf_block *block)
> >> +{
> >> + return block && atomic_read(&block->filtercnt) == atomic_read(&block->skipswcnt);
> >> +}
> >> +
> >> static inline struct Qdisc *tcf_block_q(struct tcf_block *block)
> >> {
> >> WARN_ON(tcf_block_shared(block));
> >> diff --git a/net/core/dev.c b/net/core/dev.c
> >> index d8dd293a7a27..7cd014e5066e 100644
> >> --- a/net/core/dev.c
> >> +++ b/net/core/dev.c
> >> @@ -3910,6 +3910,9 @@ static int tc_run(struct tcx_entry *entry, struct sk_buff *skb,
> >> if (!miniq)
> >> return ret;
> >>
> >> + if (tcf_block_has_skip_sw_only(miniq->block))
> >> + return ret;
> >> +
> >> tc_skb_cb(skb)->mru = 0;
> >> tc_skb_cb(skb)->post_ct = false;
> >> tcf_set_drop_reason(skb, *drop_reason);
> >> --
> >> 2.43.0
> >>
^ permalink raw reply [flat|nested] 20+ messages in thread
end of thread, other threads:[~2024-02-16 15:07 UTC | newest]
Thread overview: 20+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-02-15 16:04 [PATCH net-next 0/3] make skip_sw actually skip software Asbjørn Sloth Tønnesen
2024-02-15 16:04 ` [PATCH net-next 1/3] net: sched: cls_api: add skip_sw counter Asbjørn Sloth Tønnesen
2024-02-15 17:39 ` Jamal Hadi Salim
2024-02-15 23:34 ` Asbjørn Sloth Tønnesen
2024-02-16 8:35 ` Vlad Buslov
2024-02-16 12:52 ` Jiri Pirko
2024-02-15 16:04 ` [PATCH net-next 2/3] net: sched: cls_api: add filter counter Asbjørn Sloth Tønnesen
2024-02-15 17:25 ` Jiri Pirko
2024-02-15 23:19 ` Asbjørn Sloth Tønnesen
2024-02-15 16:04 ` [PATCH net-next 3/3] net: sched: make skip_sw actually skip software Asbjørn Sloth Tønnesen
2024-02-15 17:49 ` Jamal Hadi Salim
2024-02-16 12:57 ` Jiri Pirko
2024-02-16 15:07 ` Jamal Hadi Salim
2024-02-16 13:38 ` Asbjørn Sloth Tønnesen
2024-02-16 8:47 ` Vlad Buslov
2024-02-16 14:01 ` Asbjørn Sloth Tønnesen
2024-02-15 18:00 ` [PATCH net-next 0/3] " Marcelo Ricardo Leitner
2024-02-16 8:44 ` Vlad Buslov
2024-02-16 12:17 ` Asbjørn Sloth Tønnesen
2024-02-16 14:46 ` Marcelo Ricardo Leitner
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).