* [PATCH net-next v8 0/7] net/sched: cls_api: Support hardware miss to tc action
@ 2023-02-05 15:49 Paul Blakey
2023-02-05 15:49 ` [PATCH net-next v8 1/7] " Paul Blakey
` (7 more replies)
0 siblings, 8 replies; 21+ messages in thread
From: Paul Blakey @ 2023-02-05 15:49 UTC (permalink / raw)
To: Paul Blakey, netdev, Saeed Mahameed, Paolo Abeni, Jakub Kicinski,
Eric Dumazet, Jamal Hadi Salim, Cong Wang, David S. Miller
Cc: Oz Shlomo, Jiri Pirko, Roi Dayan, Vlad Buslov
Hi,
This series adds support for hardware miss to instruct tc to continue execution
in a specific tc action instance on a filter's action list. The mlx5 driver patch
(besides the refactors) shows its usage instead of using just chain restore.
Currently a filter's action list must be executed all together or
not at all as driver are only able to tell tc to continue executing from a
specific tc chain, and not a specific filter/action.
This is troublesome with regards to action CT, where new connections should
be sent to software (via tc chain restore), and established connections can
be handled in hardware.
Checking for new connections is done when executing the ct action in hardware
(by checking the packet's tuple against known established tuples).
But if there is a packet modification (pedit) action before action CT and the
checked tuple is a new connection, hardware will need to revert the previous
packet modifications before sending it back to software so it can
re-match the same tc filter in software and re-execute its CT action.
The following is an example configuration of stateless nat
on mlx5 driver that isn't supported before this patchet:
#Setup corrosponding mlx5 VFs in namespaces
$ ip netns add ns0
$ ip netns add ns1
$ ip link set dev enp8s0f0v0 netns ns0
$ ip netns exec ns0 ifconfig enp8s0f0v0 1.1.1.1/24 up
$ ip link set dev enp8s0f0v1 netns ns1
$ ip netns exec ns1 ifconfig enp8s0f0v1 1.1.1.2/24 up
#Setup tc arp and ct rules on mxl5 VF representors
$ tc qdisc add dev enp8s0f0_0 ingress
$ tc qdisc add dev enp8s0f0_1 ingress
$ ifconfig enp8s0f0_0 up
$ ifconfig enp8s0f0_1 up
#Original side
$ tc filter add dev enp8s0f0_0 ingress chain 0 proto ip flower \
ct_state -trk ip_proto tcp dst_port 8888 \
action pedit ex munge tcp dport set 5001 pipe \
action csum ip tcp pipe \
action ct pipe \
action goto chain 1
$ tc filter add dev enp8s0f0_0 ingress chain 1 proto ip flower \
ct_state +trk+est \
action mirred egress redirect dev enp8s0f0_1
$ tc filter add dev enp8s0f0_0 ingress chain 1 proto ip flower \
ct_state +trk+new \
action ct commit pipe \
action mirred egress redirect dev enp8s0f0_1
$ tc filter add dev enp8s0f0_0 ingress chain 0 proto arp flower \
action mirred egress redirect dev enp8s0f0_1
#Reply side
$ tc filter add dev enp8s0f0_1 ingress chain 0 proto arp flower \
action mirred egress redirect dev enp8s0f0_0
$ tc filter add dev enp8s0f0_1 ingress chain 0 proto ip flower \
ct_state -trk ip_proto tcp \
action ct pipe \
action pedit ex munge tcp sport set 8888 pipe \
action csum ip tcp pipe \
action mirred egress redirect dev enp8s0f0_0
#Run traffic
$ ip netns exec ns1 iperf -s -p 5001&
$ sleep 2 #wait for iperf to fully open
$ ip netns exec ns0 iperf -c 1.1.1.2 -p 8888
#dump tc filter stats on enp8s0f0_0 chain 0 rule and see hardware packets:
$ tc -s filter show dev enp8s0f0_0 ingress chain 0 proto ip | grep "hardware.*pkt"
Sent hardware 9310116832 bytes 6149672 pkt
Sent hardware 9310116832 bytes 6149672 pkt
Sent hardware 9310116832 bytes 6149672 pkt
A new connection executing the first filter in hardware will first rewrite
the dst port to the new port, and then the ct action is executed,
because this is a new connection, hardware will need to be send this back
to software, on chain 0, to execute the first filter again in software.
The dst port needs to be reverted otherwise it won't re-match the old
dst port in the first filter. Because of that, currently mlx5 driver will
reject offloading the above action ct rule.
This series adds supports partial offload of a filter's action list,
and letting tc software continue processing in the specific action instance
where hardware left off (in the above case after the "action pedit ex munge tcp
dport... of the first rule") allowing support for scenarios such as the above.
Changelog:
v1->v2:
Fixed compilation without CONFIG_NET_CLS
Cover letter re-write
v2->v3:
Unlock spin_lock on error in cls flower filter handle refactor
Cover letter
v3->v4:
Silence warning by clang
v4->v5:
Cover letter example
Removed ifdef as much as possible by using inline stubs
v5->v6:
Removed new inlines in cls_api.c (bot complained in patchwork)
Added reviewed-by/ack - Thanks!
v6->v7:
Removed WARN_ON from pkt path (leon)
Removed unnecessary return in void func
v7->v8:
Removed #if IS_ENABLED on skb ext adding Kconfig changes
Complex variable init in seperate lines
if,else if, else if ---> switch case
Paul Blakey (7):
net/sched: cls_api: Support hardware miss to tc action
net/sched: flower: Move filter handle initialization earlier
net/sched: flower: Support hardware miss to tc action
net/mlx5: Kconfig: Make tc offload depend on tc skb extension
net/mlx5: Refactor tc miss handling to a single function
net/mlx5e: Rename CHAIN_TO_REG to MAPPED_OBJ_TO_REG
net/mlx5e: TC, Set CT miss to the specific ct action instance
.../net/ethernet/mellanox/mlx5/core/Kconfig | 2 +-
.../ethernet/mellanox/mlx5/core/en/rep/tc.c | 225 ++------------
.../mellanox/mlx5/core/en/tc/sample.c | 2 +-
.../ethernet/mellanox/mlx5/core/en/tc_ct.c | 32 +-
.../ethernet/mellanox/mlx5/core/en/tc_ct.h | 2 +
.../net/ethernet/mellanox/mlx5/core/en_rx.c | 4 +-
.../net/ethernet/mellanox/mlx5/core/en_tc.c | 280 ++++++++++++++++--
.../net/ethernet/mellanox/mlx5/core/en_tc.h | 21 +-
.../net/ethernet/mellanox/mlx5/core/eswitch.h | 2 +
.../mellanox/mlx5/core/lib/fs_chains.c | 14 +-
include/linux/skbuff.h | 6 +-
include/net/flow_offload.h | 1 +
include/net/pkt_cls.h | 34 ++-
include/net/sch_generic.h | 2 +
net/openvswitch/flow.c | 3 +-
net/sched/act_api.c | 2 +-
net/sched/cls_api.c | 213 ++++++++++++-
net/sched/cls_flower.c | 73 +++--
18 files changed, 601 insertions(+), 317 deletions(-)
--
2.30.1
^ permalink raw reply [flat|nested] 21+ messages in thread
* [PATCH net-next v8 1/7] net/sched: cls_api: Support hardware miss to tc action
2023-02-05 15:49 [PATCH net-next v8 0/7] net/sched: cls_api: Support hardware miss to tc action Paul Blakey
@ 2023-02-05 15:49 ` Paul Blakey
2023-02-05 15:49 ` [PATCH net-next v8 2/7] net/sched: flower: Move filter handle initialization earlier Paul Blakey
` (6 subsequent siblings)
7 siblings, 0 replies; 21+ messages in thread
From: Paul Blakey @ 2023-02-05 15:49 UTC (permalink / raw)
To: Paul Blakey, netdev, Saeed Mahameed, Paolo Abeni, Jakub Kicinski,
Eric Dumazet, Jamal Hadi Salim, Cong Wang, David S. Miller
Cc: Oz Shlomo, Jiri Pirko, Roi Dayan, Vlad Buslov, Simon Horman
For drivers to support partial offload of a filter's action list,
add support for action miss to specify an action instance to
continue from in sw.
CT action in particular can't be fully offloaded, as new connections
need to be handled in software. This imposes other limitations on
the actions that can be offloaded together with the CT action, such
as packet modifications.
Assign each action on a filter's action list a unique miss_cookie
which drivers can then use to fill action_miss part of the tc skb
extension. On getting back this miss_cookie, find the action
instance with relevant cookie and continue classifying from there.
Signed-off-by: Paul Blakey <paulb@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
---
include/linux/skbuff.h | 6 +-
include/net/flow_offload.h | 1 +
include/net/pkt_cls.h | 34 +++---
include/net/sch_generic.h | 2 +
net/openvswitch/flow.c | 3 +-
net/sched/act_api.c | 2 +-
net/sched/cls_api.c | 213 +++++++++++++++++++++++++++++++++++--
7 files changed, 234 insertions(+), 27 deletions(-)
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 5ba12185f43e..5804ab031c96 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -311,12 +311,16 @@ struct nf_bridge_info {
* and read by ovs to recirc_id.
*/
struct tc_skb_ext {
- __u32 chain;
+ union {
+ u64 act_miss_cookie;
+ __u32 chain;
+ };
__u16 mru;
__u16 zone;
u8 post_ct:1;
u8 post_ct_snat:1;
u8 post_ct_dnat:1;
+ u8 act_miss:1; /* Set if act_miss_cookie is used */
};
#endif
diff --git a/include/net/flow_offload.h b/include/net/flow_offload.h
index 0400a0ac8a29..88db7346eb7a 100644
--- a/include/net/flow_offload.h
+++ b/include/net/flow_offload.h
@@ -228,6 +228,7 @@ void flow_action_cookie_destroy(struct flow_action_cookie *cookie);
struct flow_action_entry {
enum flow_action_id id;
u32 hw_index;
+ u64 miss_cookie;
enum flow_action_hw_stats hw_stats;
action_destr destructor;
void *destructor_priv;
diff --git a/include/net/pkt_cls.h b/include/net/pkt_cls.h
index 4cabb32a2ad9..7346a90e0842 100644
--- a/include/net/pkt_cls.h
+++ b/include/net/pkt_cls.h
@@ -59,6 +59,8 @@ int tcf_block_get_ext(struct tcf_block **p_block, struct Qdisc *q,
void tcf_block_put(struct tcf_block *block);
void tcf_block_put_ext(struct tcf_block *block, struct Qdisc *q,
struct tcf_block_ext_info *ei);
+int tcf_exts_init_ex(struct tcf_exts *exts, struct net *net, int action,
+ int police, struct tcf_proto *tp, u32 handle, bool used_action_miss);
static inline bool tcf_block_shared(struct tcf_block *block)
{
@@ -229,6 +231,7 @@ struct tcf_exts {
struct tc_action **actions;
struct net *net;
netns_tracker ns_tracker;
+ struct tcf_exts_miss_cookie_node *miss_cookie_node;
#endif
/* Map to export classifier specific extension TLV types to the
* generic extensions API. Unsupported extensions must be set to 0.
@@ -240,21 +243,11 @@ struct tcf_exts {
static inline int tcf_exts_init(struct tcf_exts *exts, struct net *net,
int action, int police)
{
-#ifdef CONFIG_NET_CLS_ACT
- exts->type = 0;
- exts->nr_actions = 0;
- /* Note: we do not own yet a reference on net.
- * This reference might be taken later from tcf_exts_get_net().
- */
- exts->net = net;
- exts->actions = kcalloc(TCA_ACT_MAX_PRIO, sizeof(struct tc_action *),
- GFP_KERNEL);
- if (!exts->actions)
- return -ENOMEM;
+#ifdef CONFIG_NET_CLS
+ return tcf_exts_init_ex(exts, net, action, police, NULL, 0, false);
+#else
+ return -EOPNOTSUPP;
#endif
- exts->action = action;
- exts->police = police;
- return 0;
}
/* Return false if the netns is being destroyed in cleanup_net(). Callers
@@ -353,6 +346,18 @@ tcf_exts_exec(struct sk_buff *skb, struct tcf_exts *exts,
return TC_ACT_OK;
}
+static inline int
+tcf_exts_exec_ex(struct sk_buff *skb, struct tcf_exts *exts, int act_index,
+ struct tcf_result *res)
+{
+#ifdef CONFIG_NET_CLS_ACT
+ return tcf_action_exec(skb, exts->actions + act_index,
+ exts->nr_actions - act_index, res);
+#else
+ return TC_ACT_OK;
+#endif
+}
+
int tcf_exts_validate(struct net *net, struct tcf_proto *tp,
struct nlattr **tb, struct nlattr *rate_tlv,
struct tcf_exts *exts, u32 flags,
@@ -577,6 +582,7 @@ int tc_setup_offload_action(struct flow_action *flow_action,
void tc_cleanup_offload_action(struct flow_action *flow_action);
int tc_setup_action(struct flow_action *flow_action,
struct tc_action *actions[],
+ u32 miss_cookie_base,
struct netlink_ext_ack *extack);
int tc_setup_cb_call(struct tcf_block *block, enum tc_setup_type type,
diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
index af4aa66aaa4e..fab5ba3e61b7 100644
--- a/include/net/sch_generic.h
+++ b/include/net/sch_generic.h
@@ -369,6 +369,8 @@ struct tcf_proto_ops {
struct nlattr **tca,
struct netlink_ext_ack *extack);
void (*tmplt_destroy)(void *tmplt_priv);
+ struct tcf_exts * (*get_exts)(const struct tcf_proto *tp,
+ u32 handle);
/* rtnetlink specific */
int (*dump)(struct net*, struct tcf_proto*, void *,
diff --git a/net/openvswitch/flow.c b/net/openvswitch/flow.c
index e20d1a973417..69f91460a55c 100644
--- a/net/openvswitch/flow.c
+++ b/net/openvswitch/flow.c
@@ -1038,7 +1038,8 @@ int ovs_flow_key_extract(const struct ip_tunnel_info *tun_info,
#if IS_ENABLED(CONFIG_NET_TC_SKB_EXT)
if (tc_skb_ext_tc_enabled()) {
tc_ext = skb_ext_find(skb, TC_SKB_EXT);
- key->recirc_id = tc_ext ? tc_ext->chain : 0;
+ key->recirc_id = tc_ext && !tc_ext->act_miss ?
+ tc_ext->chain : 0;
OVS_CB(skb)->mru = tc_ext ? tc_ext->mru : 0;
post_ct = tc_ext ? tc_ext->post_ct : false;
post_ct_snat = post_ct ? tc_ext->post_ct_snat : false;
diff --git a/net/sched/act_api.c b/net/sched/act_api.c
index cd09ef49df22..16fd3d30eb12 100644
--- a/net/sched/act_api.c
+++ b/net/sched/act_api.c
@@ -272,7 +272,7 @@ static int tcf_action_offload_add_ex(struct tc_action *action,
if (err)
goto fl_err;
- err = tc_setup_action(&fl_action->action, actions, extack);
+ err = tc_setup_action(&fl_action->action, actions, 0, extack);
if (err) {
NL_SET_ERR_MSG_MOD(extack,
"Failed to setup tc actions for offload");
diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c
index 5b4a95e8a1ee..8ff9530fef68 100644
--- a/net/sched/cls_api.c
+++ b/net/sched/cls_api.c
@@ -22,6 +22,7 @@
#include <linux/idr.h>
#include <linux/jhash.h>
#include <linux/rculist.h>
+#include <linux/rhashtable.h>
#include <net/net_namespace.h>
#include <net/sock.h>
#include <net/netlink.h>
@@ -50,6 +51,109 @@ static LIST_HEAD(tcf_proto_base);
/* Protects list of registered TC modules. It is pure SMP lock. */
static DEFINE_RWLOCK(cls_mod_lock);
+static struct xarray tcf_exts_miss_cookies_xa;
+struct tcf_exts_miss_cookie_node {
+ const struct tcf_chain *chain;
+ const struct tcf_proto *tp;
+ const struct tcf_exts *exts;
+ u32 chain_index;
+ u32 tp_prio;
+ u32 handle;
+ u32 miss_cookie_base;
+ struct rcu_head rcu;
+};
+
+/* Each tc action entry cookie will be comprised of 32bit miss_cookie_base +
+ * action index in the exts tc actions array.
+ */
+union tcf_exts_miss_cookie {
+ struct {
+ u32 miss_cookie_base;
+ u32 act_index;
+ };
+ u64 miss_cookie;
+};
+
+#if IS_ENABLED(CONFIG_NET_TC_SKB_EXT)
+static int
+tcf_exts_miss_cookie_base_alloc(struct tcf_exts *exts, struct tcf_proto *tp,
+ u32 handle)
+{
+ struct tcf_exts_miss_cookie_node *n;
+ static u32 next;
+ int err;
+
+ if (WARN_ON(!handle || !tp->ops->get_exts))
+ return -EINVAL;
+
+ n = kzalloc(sizeof(*n), GFP_KERNEL);
+ if (!n)
+ return -ENOMEM;
+
+ n->chain_index = tp->chain->index;
+ n->chain = tp->chain;
+ n->tp_prio = tp->prio;
+ n->tp = tp;
+ n->exts = exts;
+ n->handle = handle;
+
+ err = xa_alloc_cyclic(&tcf_exts_miss_cookies_xa, &n->miss_cookie_base,
+ n, xa_limit_32b, &next, GFP_KERNEL);
+ if (err)
+ goto err_xa_alloc;
+
+ exts->miss_cookie_node = n;
+ return 0;
+
+err_xa_alloc:
+ kfree(n);
+ return err;
+}
+
+static void tcf_exts_miss_cookie_base_destroy(struct tcf_exts *exts)
+{
+ struct tcf_exts_miss_cookie_node *n;
+
+ if (!exts->miss_cookie_node)
+ return;
+
+ n = exts->miss_cookie_node;
+ xa_erase(&tcf_exts_miss_cookies_xa, n->miss_cookie_base);
+ kfree_rcu(n, rcu);
+}
+
+static struct tcf_exts_miss_cookie_node *
+tcf_exts_miss_cookie_lookup(u64 miss_cookie, int *act_index)
+{
+ union tcf_exts_miss_cookie mc = { .miss_cookie = miss_cookie, };
+
+ *act_index = mc.act_index;
+ return xa_load(&tcf_exts_miss_cookies_xa, mc.miss_cookie_base);
+}
+#else /* IS_ENABLED(CONFIG_NET_TC_SKB_EXT) */
+static int
+tcf_exts_miss_cookie_base_alloc(struct tcf_exts *exts, struct tcf_proto *tp,
+ u32 handle)
+{
+ return 0;
+}
+
+static void tcf_exts_miss_cookie_base_destroy(struct tcf_exts *exts)
+{
+}
+#endif /* IS_ENABLED(CONFIG_NET_TC_SKB_EXT) */
+
+static u64 tcf_exts_miss_cookie_get(u32 miss_cookie_base, int act_index)
+{
+ union tcf_exts_miss_cookie mc = { .act_index = act_index, };
+
+ if (!miss_cookie_base)
+ return 0;
+
+ mc.miss_cookie_base = miss_cookie_base;
+ return mc.miss_cookie;
+}
+
#ifdef CONFIG_NET_CLS_ACT
DEFINE_STATIC_KEY_FALSE(tc_skb_ext_tc);
EXPORT_SYMBOL(tc_skb_ext_tc);
@@ -1549,6 +1653,8 @@ static inline int __tcf_classify(struct sk_buff *skb,
const struct tcf_proto *orig_tp,
struct tcf_result *res,
bool compat_mode,
+ struct tcf_exts_miss_cookie_node *n,
+ int act_index,
u32 *last_executed_chain)
{
#ifdef CONFIG_NET_CLS_ACT
@@ -1560,13 +1666,36 @@ static inline int __tcf_classify(struct sk_buff *skb,
#endif
for (; tp; tp = rcu_dereference_bh(tp->next)) {
__be16 protocol = skb_protocol(skb, false);
- int err;
+ int err = 0;
- if (tp->protocol != protocol &&
- tp->protocol != htons(ETH_P_ALL))
- continue;
+ if (n) {
+ struct tcf_exts *exts;
+
+ if (n->tp_prio != tp->prio)
+ continue;
+
+ /* We re-lookup the tp and chain based on index instead
+ * of having hard refs and locks to them, so do a sanity
+ * check if any of tp,chain,exts was replaced by the
+ * time we got here with a cookie from hardware.
+ */
+ if (unlikely(n->tp != tp || n->tp->chain != n->chain ||
+ !tp->ops->get_exts))
+ return TC_ACT_SHOT;
+
+ exts = tp->ops->get_exts(tp, n->handle);
+ if (unlikely(!exts || n->exts != exts))
+ return TC_ACT_SHOT;
- err = tc_classify(skb, tp, res);
+ n = NULL;
+ err = tcf_exts_exec_ex(skb, exts, act_index, res);
+ } else {
+ if (tp->protocol != protocol &&
+ tp->protocol != htons(ETH_P_ALL))
+ continue;
+
+ err = tc_classify(skb, tp, res);
+ }
#ifdef CONFIG_NET_CLS_ACT
if (unlikely(err == TC_ACT_RECLASSIFY && !compat_mode)) {
first_tp = orig_tp;
@@ -1582,6 +1711,9 @@ static inline int __tcf_classify(struct sk_buff *skb,
return err;
}
+ if (unlikely(n))
+ return TC_ACT_SHOT;
+
return TC_ACT_UNSPEC; /* signal: continue lookup */
#ifdef CONFIG_NET_CLS_ACT
reset:
@@ -1606,21 +1738,33 @@ int tcf_classify(struct sk_buff *skb,
#if !IS_ENABLED(CONFIG_NET_TC_SKB_EXT)
u32 last_executed_chain = 0;
- return __tcf_classify(skb, tp, tp, res, compat_mode,
+ return __tcf_classify(skb, tp, tp, res, compat_mode, NULL, 0,
&last_executed_chain);
#else
u32 last_executed_chain = tp ? tp->chain->index : 0;
+ struct tcf_exts_miss_cookie_node *n = NULL;
const struct tcf_proto *orig_tp = tp;
struct tc_skb_ext *ext;
+ int act_index = 0;
int ret;
if (block) {
ext = skb_ext_find(skb, TC_SKB_EXT);
- if (ext && ext->chain) {
+ if (ext && (ext->chain || ext->act_miss)) {
struct tcf_chain *fchain;
+ u32 chain = ext->chain;
- fchain = tcf_chain_lookup_rcu(block, ext->chain);
+ if (ext->act_miss) {
+ n = tcf_exts_miss_cookie_lookup(ext->act_miss_cookie,
+ &act_index);
+ if (!n)
+ return TC_ACT_SHOT;
+
+ chain = n->chain_index;
+ }
+
+ fchain = tcf_chain_lookup_rcu(block, chain);
if (!fchain)
return TC_ACT_SHOT;
@@ -1632,7 +1776,7 @@ int tcf_classify(struct sk_buff *skb,
}
}
- ret = __tcf_classify(skb, tp, orig_tp, res, compat_mode,
+ ret = __tcf_classify(skb, tp, orig_tp, res, compat_mode, n, act_index,
&last_executed_chain);
if (tc_skb_ext_tc_enabled()) {
@@ -3056,9 +3200,48 @@ static int tc_dump_chain(struct sk_buff *skb, struct netlink_callback *cb)
return skb->len;
}
+int tcf_exts_init_ex(struct tcf_exts *exts, struct net *net, int action,
+ int police, struct tcf_proto *tp, u32 handle,
+ bool use_action_miss)
+{
+ int err = 0;
+
+#ifdef CONFIG_NET_CLS_ACT
+ exts->type = 0;
+ exts->nr_actions = 0;
+ /* Note: we do not own yet a reference on net.
+ * This reference might be taken later from tcf_exts_get_net().
+ */
+ exts->net = net;
+ exts->actions = kcalloc(TCA_ACT_MAX_PRIO, sizeof(struct tc_action *),
+ GFP_KERNEL);
+ if (!exts->actions)
+ return -ENOMEM;
+#endif
+
+ exts->action = action;
+ exts->police = police;
+
+ if (!use_action_miss)
+ return 0;
+
+ err = tcf_exts_miss_cookie_base_alloc(exts, tp, handle);
+ if (err)
+ goto err_miss_alloc;
+
+ return 0;
+
+err_miss_alloc:
+ tcf_exts_destroy(exts);
+ return err;
+}
+EXPORT_SYMBOL(tcf_exts_init_ex);
+
void tcf_exts_destroy(struct tcf_exts *exts)
{
#ifdef CONFIG_NET_CLS_ACT
+ tcf_exts_miss_cookie_base_destroy(exts);
+
if (exts->actions) {
tcf_action_destroy(exts->actions, TCA_ACT_UNBIND);
kfree(exts->actions);
@@ -3547,6 +3730,7 @@ static int tc_setup_offload_act(struct tc_action *act,
int tc_setup_action(struct flow_action *flow_action,
struct tc_action *actions[],
+ u32 miss_cookie_base,
struct netlink_ext_ack *extack)
{
int i, j, k, index, err = 0;
@@ -3577,6 +3761,8 @@ int tc_setup_action(struct flow_action *flow_action,
for (k = 0; k < index ; k++) {
entry[k].hw_stats = tc_act_hw_stats(act->hw_stats);
entry[k].hw_index = act->tcfa_index;
+ entry[k].miss_cookie =
+ tcf_exts_miss_cookie_get(miss_cookie_base, i);
}
j += index;
@@ -3599,10 +3785,15 @@ int tc_setup_offload_action(struct flow_action *flow_action,
struct netlink_ext_ack *extack)
{
#ifdef CONFIG_NET_CLS_ACT
+ u32 miss_cookie_base;
+
if (!exts)
return 0;
- return tc_setup_action(flow_action, exts->actions, extack);
+ miss_cookie_base = exts->miss_cookie_node ?
+ exts->miss_cookie_node->miss_cookie_base : 0;
+ return tc_setup_action(flow_action, exts->actions, miss_cookie_base,
+ extack);
#else
return 0;
#endif
@@ -3770,6 +3961,8 @@ static int __init tc_filter_init(void)
if (err)
goto err_register_pernet_subsys;
+ xa_init_flags(&tcf_exts_miss_cookies_xa, XA_FLAGS_ALLOC1);
+
rtnl_register(PF_UNSPEC, RTM_NEWTFILTER, tc_new_tfilter, NULL,
RTNL_FLAG_DOIT_UNLOCKED);
rtnl_register(PF_UNSPEC, RTM_DELTFILTER, tc_del_tfilter, NULL,
--
2.30.1
^ permalink raw reply related [flat|nested] 21+ messages in thread
* [PATCH net-next v8 2/7] net/sched: flower: Move filter handle initialization earlier
2023-02-05 15:49 [PATCH net-next v8 0/7] net/sched: cls_api: Support hardware miss to tc action Paul Blakey
2023-02-05 15:49 ` [PATCH net-next v8 1/7] " Paul Blakey
@ 2023-02-05 15:49 ` Paul Blakey
2023-02-05 15:49 ` [PATCH net-next v8 3/7] net/sched: flower: Support hardware miss to tc action Paul Blakey
` (5 subsequent siblings)
7 siblings, 0 replies; 21+ messages in thread
From: Paul Blakey @ 2023-02-05 15:49 UTC (permalink / raw)
To: Paul Blakey, netdev, Saeed Mahameed, Paolo Abeni, Jakub Kicinski,
Eric Dumazet, Jamal Hadi Salim, Cong Wang, David S. Miller
Cc: Oz Shlomo, Jiri Pirko, Roi Dayan, Vlad Buslov, Simon Horman
To support miss to action during hardware offload the filter's
handle is needed when setting up the actions (tcf_exts_init()),
and before offloading.
Move filter handle initialization earlier.
Signed-off-by: Paul Blakey <paulb@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
---
net/sched/cls_flower.c | 62 ++++++++++++++++++++++++------------------
1 file changed, 35 insertions(+), 27 deletions(-)
diff --git a/net/sched/cls_flower.c b/net/sched/cls_flower.c
index 0b15698b3531..564b862870c7 100644
--- a/net/sched/cls_flower.c
+++ b/net/sched/cls_flower.c
@@ -2192,10 +2192,6 @@ static int fl_change(struct net *net, struct sk_buff *in_skb,
INIT_LIST_HEAD(&fnew->hw_list);
refcount_set(&fnew->refcnt, 1);
- err = tcf_exts_init(&fnew->exts, net, TCA_FLOWER_ACT, 0);
- if (err < 0)
- goto errout;
-
if (tb[TCA_FLOWER_FLAGS]) {
fnew->flags = nla_get_u32(tb[TCA_FLOWER_FLAGS]);
@@ -2205,15 +2201,45 @@ static int fl_change(struct net *net, struct sk_buff *in_skb,
}
}
+ if (!fold) {
+ spin_lock(&tp->lock);
+ if (!handle) {
+ handle = 1;
+ err = idr_alloc_u32(&head->handle_idr, fnew, &handle,
+ INT_MAX, GFP_ATOMIC);
+ } else {
+ err = idr_alloc_u32(&head->handle_idr, fnew, &handle,
+ handle, GFP_ATOMIC);
+
+ /* Filter with specified handle was concurrently
+ * inserted after initial check in cls_api. This is not
+ * necessarily an error if NLM_F_EXCL is not set in
+ * message flags. Returning EAGAIN will cause cls_api to
+ * try to update concurrently inserted rule.
+ */
+ if (err == -ENOSPC)
+ err = -EAGAIN;
+ }
+ spin_unlock(&tp->lock);
+
+ if (err)
+ goto errout;
+ }
+ fnew->handle = handle;
+
+ err = tcf_exts_init(&fnew->exts, net, TCA_FLOWER_ACT, 0);
+ if (err < 0)
+ goto errout_idr;
+
err = fl_set_parms(net, tp, fnew, mask, base, tb, tca[TCA_RATE],
tp->chain->tmplt_priv, flags, fnew->flags,
extack);
if (err)
- goto errout;
+ goto errout_idr;
err = fl_check_assign_mask(head, fnew, fold, mask);
if (err)
- goto errout;
+ goto errout_idr;
err = fl_ht_insert_unique(fnew, fold, &in_ht);
if (err)
@@ -2279,29 +2305,9 @@ static int fl_change(struct net *net, struct sk_buff *in_skb,
refcount_dec(&fold->refcnt);
__fl_put(fold);
} else {
- if (handle) {
- /* user specifies a handle and it doesn't exist */
- err = idr_alloc_u32(&head->handle_idr, fnew, &handle,
- handle, GFP_ATOMIC);
-
- /* Filter with specified handle was concurrently
- * inserted after initial check in cls_api. This is not
- * necessarily an error if NLM_F_EXCL is not set in
- * message flags. Returning EAGAIN will cause cls_api to
- * try to update concurrently inserted rule.
- */
- if (err == -ENOSPC)
- err = -EAGAIN;
- } else {
- handle = 1;
- err = idr_alloc_u32(&head->handle_idr, fnew, &handle,
- INT_MAX, GFP_ATOMIC);
- }
- if (err)
- goto errout_hw;
+ idr_replace(&head->handle_idr, fnew, fnew->handle);
refcount_inc(&fnew->refcnt);
- fnew->handle = handle;
list_add_tail_rcu(&fnew->list, &fnew->mask->filters);
spin_unlock(&tp->lock);
}
@@ -2324,6 +2330,8 @@ static int fl_change(struct net *net, struct sk_buff *in_skb,
fnew->mask->filter_ht_params);
errout_mask:
fl_mask_put(head, fnew->mask);
+errout_idr:
+ idr_remove(&head->handle_idr, fnew->handle);
errout:
__fl_put(fnew);
errout_tb:
--
2.30.1
^ permalink raw reply related [flat|nested] 21+ messages in thread
* [PATCH net-next v8 3/7] net/sched: flower: Support hardware miss to tc action
2023-02-05 15:49 [PATCH net-next v8 0/7] net/sched: cls_api: Support hardware miss to tc action Paul Blakey
2023-02-05 15:49 ` [PATCH net-next v8 1/7] " Paul Blakey
2023-02-05 15:49 ` [PATCH net-next v8 2/7] net/sched: flower: Move filter handle initialization earlier Paul Blakey
@ 2023-02-05 15:49 ` Paul Blakey
2023-02-05 15:49 ` [PATCH net-next v8 4/7] net/mlx5: Kconfig: Make tc offload depend on tc skb extension Paul Blakey
` (4 subsequent siblings)
7 siblings, 0 replies; 21+ messages in thread
From: Paul Blakey @ 2023-02-05 15:49 UTC (permalink / raw)
To: Paul Blakey, netdev, Saeed Mahameed, Paolo Abeni, Jakub Kicinski,
Eric Dumazet, Jamal Hadi Salim, Cong Wang, David S. Miller
Cc: Oz Shlomo, Jiri Pirko, Roi Dayan, Vlad Buslov, Simon Horman
To support hardware miss to tc action in actions on the flower
classifier, implement the required getting of filter actions,
and setup filter exts (actions) miss by giving it the filter's
handle and actions.
Signed-off-by: Paul Blakey <paulb@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
---
net/sched/cls_flower.c | 13 ++++++++++++-
1 file changed, 12 insertions(+), 1 deletion(-)
diff --git a/net/sched/cls_flower.c b/net/sched/cls_flower.c
index 564b862870c7..5da7f6d02e5d 100644
--- a/net/sched/cls_flower.c
+++ b/net/sched/cls_flower.c
@@ -534,6 +534,15 @@ static struct cls_fl_filter *__fl_get(struct cls_fl_head *head, u32 handle)
return f;
}
+static struct tcf_exts *fl_get_exts(const struct tcf_proto *tp, u32 handle)
+{
+ struct cls_fl_head *head = rcu_dereference_bh(tp->root);
+ struct cls_fl_filter *f;
+
+ f = idr_find(&head->handle_idr, handle);
+ return f ? &f->exts : NULL;
+}
+
static int __fl_delete(struct tcf_proto *tp, struct cls_fl_filter *f,
bool *last, bool rtnl_held,
struct netlink_ext_ack *extack)
@@ -2227,7 +2236,8 @@ static int fl_change(struct net *net, struct sk_buff *in_skb,
}
fnew->handle = handle;
- err = tcf_exts_init(&fnew->exts, net, TCA_FLOWER_ACT, 0);
+ err = tcf_exts_init_ex(&fnew->exts, net, TCA_FLOWER_ACT, 0, tp, handle,
+ !tc_skip_hw(fnew->flags));
if (err < 0)
goto errout_idr;
@@ -3449,6 +3459,7 @@ static struct tcf_proto_ops cls_fl_ops __read_mostly = {
.tmplt_create = fl_tmplt_create,
.tmplt_destroy = fl_tmplt_destroy,
.tmplt_dump = fl_tmplt_dump,
+ .get_exts = fl_get_exts,
.owner = THIS_MODULE,
.flags = TCF_PROTO_OPS_DOIT_UNLOCKED,
};
--
2.30.1
^ permalink raw reply related [flat|nested] 21+ messages in thread
* [PATCH net-next v8 4/7] net/mlx5: Kconfig: Make tc offload depend on tc skb extension
2023-02-05 15:49 [PATCH net-next v8 0/7] net/sched: cls_api: Support hardware miss to tc action Paul Blakey
` (2 preceding siblings ...)
2023-02-05 15:49 ` [PATCH net-next v8 3/7] net/sched: flower: Support hardware miss to tc action Paul Blakey
@ 2023-02-05 15:49 ` Paul Blakey
2023-02-06 15:40 ` Alexander H Duyck
2023-02-05 15:49 ` [PATCH net-next v8 5/7] net/mlx5: Refactor tc miss handling to a single function Paul Blakey
` (3 subsequent siblings)
7 siblings, 1 reply; 21+ messages in thread
From: Paul Blakey @ 2023-02-05 15:49 UTC (permalink / raw)
To: Paul Blakey, netdev, Saeed Mahameed, Paolo Abeni, Jakub Kicinski,
Eric Dumazet, Jamal Hadi Salim, Cong Wang, David S. Miller
Cc: Oz Shlomo, Jiri Pirko, Roi Dayan, Vlad Buslov
Tc skb extension is a basic requirement for using tc
offload to support correct restoration on action miss.
Depend on it.
Signed-off-by: Paul Blakey <paulb@nvidia.com>
---
drivers/net/ethernet/mellanox/mlx5/core/Kconfig | 2 +-
drivers/net/ethernet/mellanox/mlx5/core/en/rep/tc.c | 2 --
drivers/net/ethernet/mellanox/mlx5/core/en_tc.c | 2 --
3 files changed, 1 insertion(+), 5 deletions(-)
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/Kconfig b/drivers/net/ethernet/mellanox/mlx5/core/Kconfig
index 26685fd0fdaa..20447b13c6bc 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/Kconfig
+++ b/drivers/net/ethernet/mellanox/mlx5/core/Kconfig
@@ -85,7 +85,7 @@ config MLX5_BRIDGE
config MLX5_CLS_ACT
bool "MLX5 TC classifier action support"
- depends on MLX5_ESWITCH && NET_CLS_ACT
+ depends on MLX5_ESWITCH && NET_CLS_ACT && NET_TC_SKB_EXT
default y
help
mlx5 ConnectX offloads support for TC classifier action (NET_CLS_ACT),
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/rep/tc.c b/drivers/net/ethernet/mellanox/mlx5/core/en/rep/tc.c
index b08339d986d5..fcb4cf526727 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en/rep/tc.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en/rep/tc.c
@@ -762,7 +762,6 @@ static bool mlx5e_restore_skb_chain(struct sk_buff *skb, u32 chain, u32 reg_c1,
struct mlx5e_priv *priv = netdev_priv(skb->dev);
u32 tunnel_id = (reg_c1 >> ESW_TUN_OFFSET) & TUNNEL_ID_MASK;
-#if IS_ENABLED(CONFIG_NET_TC_SKB_EXT)
if (chain) {
struct mlx5_rep_uplink_priv *uplink_priv;
struct mlx5e_rep_priv *uplink_rpriv;
@@ -784,7 +783,6 @@ static bool mlx5e_restore_skb_chain(struct sk_buff *skb, u32 chain, u32 reg_c1,
zone_restore_id))
return false;
}
-#endif /* CONFIG_NET_TC_SKB_EXT */
return mlx5e_restore_tunnel(priv, skb, tc_priv, tunnel_id);
}
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c b/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
index 4e6f5caf8ab6..b173c7e9e553 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
@@ -5565,7 +5565,6 @@ int mlx5e_setup_tc_block_cb(enum tc_setup_type type, void *type_data,
bool mlx5e_tc_update_skb(struct mlx5_cqe64 *cqe,
struct sk_buff *skb)
{
-#if IS_ENABLED(CONFIG_NET_TC_SKB_EXT)
u32 chain = 0, chain_tag, reg_b, zone_restore_id;
struct mlx5e_priv *priv = netdev_priv(skb->dev);
struct mlx5_mapped_obj mapped_obj;
@@ -5603,7 +5602,6 @@ bool mlx5e_tc_update_skb(struct mlx5_cqe64 *cqe,
netdev_dbg(priv->netdev, "Invalid mapped object type: %d\n", mapped_obj.type);
return false;
}
-#endif /* CONFIG_NET_TC_SKB_EXT */
return true;
}
--
2.30.1
^ permalink raw reply related [flat|nested] 21+ messages in thread
* [PATCH net-next v8 5/7] net/mlx5: Refactor tc miss handling to a single function
2023-02-05 15:49 [PATCH net-next v8 0/7] net/sched: cls_api: Support hardware miss to tc action Paul Blakey
` (3 preceding siblings ...)
2023-02-05 15:49 ` [PATCH net-next v8 4/7] net/mlx5: Kconfig: Make tc offload depend on tc skb extension Paul Blakey
@ 2023-02-05 15:49 ` Paul Blakey
2023-02-05 15:49 ` [PATCH net-next v8 6/7] net/mlx5e: Rename CHAIN_TO_REG to MAPPED_OBJ_TO_REG Paul Blakey
` (2 subsequent siblings)
7 siblings, 0 replies; 21+ messages in thread
From: Paul Blakey @ 2023-02-05 15:49 UTC (permalink / raw)
To: Paul Blakey, netdev, Saeed Mahameed, Paolo Abeni, Jakub Kicinski,
Eric Dumazet, Jamal Hadi Salim, Cong Wang, David S. Miller
Cc: Oz Shlomo, Jiri Pirko, Roi Dayan, Vlad Buslov
Move tc miss handling code to en_tc.c, and remove
duplicate code.
Signed-off-by: Paul Blakey <paulb@nvidia.com>
Reviewed-by: Roi Dayan <roid@nvidia.com>
---
.../ethernet/mellanox/mlx5/core/en/rep/tc.c | 223 ++----------------
.../net/ethernet/mellanox/mlx5/core/en_rx.c | 4 +-
.../net/ethernet/mellanox/mlx5/core/en_tc.c | 220 +++++++++++++++--
.../net/ethernet/mellanox/mlx5/core/en_tc.h | 11 +-
4 files changed, 231 insertions(+), 227 deletions(-)
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/rep/tc.c b/drivers/net/ethernet/mellanox/mlx5/core/en/rep/tc.c
index fcb4cf526727..0b84665989fb 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en/rep/tc.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en/rep/tc.c
@@ -1,7 +1,6 @@
// SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB
/* Copyright (c) 2020 Mellanox Technologies. */
-#include <net/dst_metadata.h>
#include <linux/netdevice.h>
#include <linux/if_macvlan.h>
#include <linux/list.h>
@@ -665,230 +664,54 @@ void mlx5e_rep_tc_netdevice_event_unregister(struct mlx5e_rep_priv *rpriv)
mlx5e_rep_indr_block_unbind);
}
-static bool mlx5e_restore_tunnel(struct mlx5e_priv *priv, struct sk_buff *skb,
- struct mlx5e_tc_update_priv *tc_priv,
- u32 tunnel_id)
-{
- struct mlx5_eswitch *esw = priv->mdev->priv.eswitch;
- struct tunnel_match_enc_opts enc_opts = {};
- struct mlx5_rep_uplink_priv *uplink_priv;
- struct mlx5e_rep_priv *uplink_rpriv;
- struct metadata_dst *tun_dst;
- struct tunnel_match_key key;
- u32 tun_id, enc_opts_id;
- struct net_device *dev;
- int err;
-
- enc_opts_id = tunnel_id & ENC_OPTS_BITS_MASK;
- tun_id = tunnel_id >> ENC_OPTS_BITS;
-
- if (!tun_id)
- return true;
-
- uplink_rpriv = mlx5_eswitch_get_uplink_priv(esw, REP_ETH);
- uplink_priv = &uplink_rpriv->uplink_priv;
-
- err = mapping_find(uplink_priv->tunnel_mapping, tun_id, &key);
- if (err) {
- netdev_dbg(priv->netdev,
- "Couldn't find tunnel for tun_id: %d, err: %d\n",
- tun_id, err);
- return false;
- }
-
- if (enc_opts_id) {
- err = mapping_find(uplink_priv->tunnel_enc_opts_mapping,
- enc_opts_id, &enc_opts);
- if (err) {
- netdev_dbg(priv->netdev,
- "Couldn't find tunnel (opts) for tun_id: %d, err: %d\n",
- enc_opts_id, err);
- return false;
- }
- }
-
- if (key.enc_control.addr_type == FLOW_DISSECTOR_KEY_IPV4_ADDRS) {
- tun_dst = __ip_tun_set_dst(key.enc_ipv4.src, key.enc_ipv4.dst,
- key.enc_ip.tos, key.enc_ip.ttl,
- key.enc_tp.dst, TUNNEL_KEY,
- key32_to_tunnel_id(key.enc_key_id.keyid),
- enc_opts.key.len);
- } else if (key.enc_control.addr_type == FLOW_DISSECTOR_KEY_IPV6_ADDRS) {
- tun_dst = __ipv6_tun_set_dst(&key.enc_ipv6.src, &key.enc_ipv6.dst,
- key.enc_ip.tos, key.enc_ip.ttl,
- key.enc_tp.dst, 0, TUNNEL_KEY,
- key32_to_tunnel_id(key.enc_key_id.keyid),
- enc_opts.key.len);
- } else {
- netdev_dbg(priv->netdev,
- "Couldn't restore tunnel, unsupported addr_type: %d\n",
- key.enc_control.addr_type);
- return false;
- }
-
- if (!tun_dst) {
- netdev_dbg(priv->netdev, "Couldn't restore tunnel, no tun_dst\n");
- return false;
- }
-
- tun_dst->u.tun_info.key.tp_src = key.enc_tp.src;
-
- if (enc_opts.key.len)
- ip_tunnel_info_opts_set(&tun_dst->u.tun_info,
- enc_opts.key.data,
- enc_opts.key.len,
- enc_opts.key.dst_opt_type);
-
- skb_dst_set(skb, (struct dst_entry *)tun_dst);
- dev = dev_get_by_index(&init_net, key.filter_ifindex);
- if (!dev) {
- netdev_dbg(priv->netdev,
- "Couldn't find tunnel device with ifindex: %d\n",
- key.filter_ifindex);
- return false;
- }
-
- /* Set fwd_dev so we do dev_put() after datapath */
- tc_priv->fwd_dev = dev;
-
- skb->dev = dev;
-
- return true;
-}
-
-static bool mlx5e_restore_skb_chain(struct sk_buff *skb, u32 chain, u32 reg_c1,
- struct mlx5e_tc_update_priv *tc_priv)
-{
- struct mlx5e_priv *priv = netdev_priv(skb->dev);
- u32 tunnel_id = (reg_c1 >> ESW_TUN_OFFSET) & TUNNEL_ID_MASK;
-
- if (chain) {
- struct mlx5_rep_uplink_priv *uplink_priv;
- struct mlx5e_rep_priv *uplink_rpriv;
- struct tc_skb_ext *tc_skb_ext;
- struct mlx5_eswitch *esw;
- u32 zone_restore_id;
-
- tc_skb_ext = tc_skb_ext_alloc(skb);
- if (!tc_skb_ext) {
- WARN_ON(1);
- return false;
- }
- tc_skb_ext->chain = chain;
- zone_restore_id = reg_c1 & ESW_ZONE_ID_MASK;
- esw = priv->mdev->priv.eswitch;
- uplink_rpriv = mlx5_eswitch_get_uplink_priv(esw, REP_ETH);
- uplink_priv = &uplink_rpriv->uplink_priv;
- if (!mlx5e_tc_ct_restore_flow(uplink_priv->ct_priv, skb,
- zone_restore_id))
- return false;
- }
-
- return mlx5e_restore_tunnel(priv, skb, tc_priv, tunnel_id);
-}
-
-static void mlx5_rep_tc_post_napi_receive(struct mlx5e_tc_update_priv *tc_priv)
-{
- if (tc_priv->fwd_dev)
- dev_put(tc_priv->fwd_dev);
-}
-
-static void mlx5e_restore_skb_sample(struct mlx5e_priv *priv, struct sk_buff *skb,
- struct mlx5_mapped_obj *mapped_obj,
- struct mlx5e_tc_update_priv *tc_priv)
-{
- if (!mlx5e_restore_tunnel(priv, skb, tc_priv, mapped_obj->sample.tunnel_id)) {
- netdev_dbg(priv->netdev,
- "Failed to restore tunnel info for sampled packet\n");
- return;
- }
- mlx5e_tc_sample_skb(skb, mapped_obj);
- mlx5_rep_tc_post_napi_receive(tc_priv);
-}
-
-static bool mlx5e_restore_skb_int_port(struct mlx5e_priv *priv, struct sk_buff *skb,
- struct mlx5_mapped_obj *mapped_obj,
- struct mlx5e_tc_update_priv *tc_priv,
- bool *forward_tx,
- u32 reg_c1)
-{
- u32 tunnel_id = (reg_c1 >> ESW_TUN_OFFSET) & TUNNEL_ID_MASK;
- struct mlx5_eswitch *esw = priv->mdev->priv.eswitch;
- struct mlx5_rep_uplink_priv *uplink_priv;
- struct mlx5e_rep_priv *uplink_rpriv;
-
- /* Tunnel restore takes precedence over int port restore */
- if (tunnel_id)
- return mlx5e_restore_tunnel(priv, skb, tc_priv, tunnel_id);
-
- uplink_rpriv = mlx5_eswitch_get_uplink_priv(esw, REP_ETH);
- uplink_priv = &uplink_rpriv->uplink_priv;
-
- if (mlx5e_tc_int_port_dev_fwd(uplink_priv->int_port_priv, skb,
- mapped_obj->int_port_metadata, forward_tx)) {
- /* Set fwd_dev for future dev_put */
- tc_priv->fwd_dev = skb->dev;
-
- return true;
- }
-
- return false;
-}
-
void mlx5e_rep_tc_receive(struct mlx5_cqe64 *cqe, struct mlx5e_rq *rq,
struct sk_buff *skb)
{
- u32 reg_c1 = be32_to_cpu(cqe->ft_metadata);
+ u32 reg_c0, reg_c1, zone_restore_id, tunnel_id;
struct mlx5e_tc_update_priv tc_priv = {};
- struct mlx5_mapped_obj mapped_obj;
+ struct mlx5_rep_uplink_priv *uplink_priv;
+ struct mlx5e_rep_priv *uplink_rpriv;
+ struct mlx5_tc_ct_priv *ct_priv;
+ struct mapping_ctx *mapping_ctx;
struct mlx5_eswitch *esw;
- bool forward_tx = false;
struct mlx5e_priv *priv;
- u32 reg_c0;
- int err;
reg_c0 = (be32_to_cpu(cqe->sop_drop_qpn) & MLX5E_TC_FLOW_ID_MASK);
if (!reg_c0 || reg_c0 == MLX5_FS_DEFAULT_FLOW_TAG)
goto forward;
- /* If reg_c0 is not equal to the default flow tag then skb->mark
+ /* If mapped_obj_id is not equal to the default flow tag then skb->mark
* is not supported and must be reset back to 0.
*/
skb->mark = 0;
priv = netdev_priv(skb->dev);
esw = priv->mdev->priv.eswitch;
- err = mapping_find(esw->offloads.reg_c0_obj_pool, reg_c0, &mapped_obj);
- if (err) {
- netdev_dbg(priv->netdev,
- "Couldn't find mapped object for reg_c0: %d, err: %d\n",
- reg_c0, err);
- goto free_skb;
- }
+ mapping_ctx = esw->offloads.reg_c0_obj_pool;
+ reg_c1 = be32_to_cpu(cqe->ft_metadata);
+ zone_restore_id = reg_c1 & ESW_ZONE_ID_MASK;
+ tunnel_id = (reg_c1 >> ESW_TUN_OFFSET) & TUNNEL_ID_MASK;
- if (mapped_obj.type == MLX5_MAPPED_OBJ_CHAIN) {
- if (!mlx5e_restore_skb_chain(skb, mapped_obj.chain, reg_c1, &tc_priv) &&
- !mlx5_ipsec_is_rx_flow(cqe))
- goto free_skb;
- } else if (mapped_obj.type == MLX5_MAPPED_OBJ_SAMPLE) {
- mlx5e_restore_skb_sample(priv, skb, &mapped_obj, &tc_priv);
- goto free_skb;
- } else if (mapped_obj.type == MLX5_MAPPED_OBJ_INT_PORT_METADATA) {
- if (!mlx5e_restore_skb_int_port(priv, skb, &mapped_obj, &tc_priv,
- &forward_tx, reg_c1))
- goto free_skb;
- } else {
- netdev_dbg(priv->netdev, "Invalid mapped object type: %d\n", mapped_obj.type);
+ uplink_rpriv = mlx5_eswitch_get_uplink_priv(esw, REP_ETH);
+ uplink_priv = &uplink_rpriv->uplink_priv;
+ ct_priv = uplink_priv->ct_priv;
+
+ if (!mlx5_ipsec_is_rx_flow(cqe) &&
+ !mlx5e_tc_update_skb(cqe, skb, mapping_ctx, reg_c0, ct_priv, zone_restore_id, tunnel_id,
+ &tc_priv))
goto free_skb;
- }
forward:
- if (forward_tx)
+ if (tc_priv.skb_done)
+ goto free_skb;
+
+ if (tc_priv.forward_tx)
dev_queue_xmit(skb);
else
napi_gro_receive(rq->cq.napi, skb);
- mlx5_rep_tc_post_napi_receive(&tc_priv);
+ if (tc_priv.fwd_dev)
+ dev_put(tc_priv.fwd_dev);
return;
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
index a9473a51edc1..fea0c2aa95e2 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
@@ -1792,7 +1792,7 @@ static void mlx5e_handle_rx_cqe(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe)
mlx5e_complete_rx_cqe(rq, cqe, cqe_bcnt, skb);
if (mlx5e_cqe_regb_chain(cqe))
- if (!mlx5e_tc_update_skb(cqe, skb)) {
+ if (!mlx5e_tc_update_skb_nic(cqe, skb)) {
dev_kfree_skb_any(skb);
goto free_wqe;
}
@@ -2259,7 +2259,7 @@ static void mlx5e_handle_rx_cqe_mpwrq(struct mlx5e_rq *rq, struct mlx5_cqe64 *cq
mlx5e_complete_rx_cqe(rq, cqe, cqe_bcnt, skb);
if (mlx5e_cqe_regb_chain(cqe))
- if (!mlx5e_tc_update_skb(cqe, skb)) {
+ if (!mlx5e_tc_update_skb_nic(cqe, skb)) {
dev_kfree_skb_any(skb);
goto mpwrq_cqe_out;
}
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c b/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
index b173c7e9e553..a6399dc870c2 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
@@ -43,6 +43,7 @@
#include <net/ipv6_stubs.h>
#include <net/bareudp.h>
#include <net/bonding.h>
+#include <net/dst_metadata.h>
#include "en.h"
#include "en/tc/post_act.h"
#include "en_rep.h"
@@ -5562,46 +5563,219 @@ int mlx5e_setup_tc_block_cb(enum tc_setup_type type, void *type_data,
}
}
-bool mlx5e_tc_update_skb(struct mlx5_cqe64 *cqe,
- struct sk_buff *skb)
+static bool mlx5e_tc_restore_tunnel(struct mlx5e_priv *priv, struct sk_buff *skb,
+ struct mlx5e_tc_update_priv *tc_priv,
+ u32 tunnel_id)
{
- u32 chain = 0, chain_tag, reg_b, zone_restore_id;
- struct mlx5e_priv *priv = netdev_priv(skb->dev);
- struct mlx5_mapped_obj mapped_obj;
- struct tc_skb_ext *tc_skb_ext;
- struct mlx5e_tc_table *tc;
+ struct mlx5_eswitch *esw = priv->mdev->priv.eswitch;
+ struct tunnel_match_enc_opts enc_opts = {};
+ struct mlx5_rep_uplink_priv *uplink_priv;
+ struct mlx5e_rep_priv *uplink_rpriv;
+ struct metadata_dst *tun_dst;
+ struct tunnel_match_key key;
+ u32 tun_id, enc_opts_id;
+ struct net_device *dev;
int err;
- reg_b = be32_to_cpu(cqe->ft_metadata);
- tc = mlx5e_fs_get_tc(priv->fs);
- chain_tag = reg_b & MLX5E_TC_TABLE_CHAIN_TAG_MASK;
+ enc_opts_id = tunnel_id & ENC_OPTS_BITS_MASK;
+ tun_id = tunnel_id >> ENC_OPTS_BITS;
+
+ if (!tun_id)
+ return true;
+
+ uplink_rpriv = mlx5_eswitch_get_uplink_priv(esw, REP_ETH);
+ uplink_priv = &uplink_rpriv->uplink_priv;
- err = mapping_find(tc->mapping, chain_tag, &mapped_obj);
+ err = mapping_find(uplink_priv->tunnel_mapping, tun_id, &key);
if (err) {
netdev_dbg(priv->netdev,
- "Couldn't find chain for chain tag: %d, err: %d\n",
- chain_tag, err);
+ "Couldn't find tunnel for tun_id: %d, err: %d\n",
+ tun_id, err);
+ return false;
+ }
+
+ if (enc_opts_id) {
+ err = mapping_find(uplink_priv->tunnel_enc_opts_mapping,
+ enc_opts_id, &enc_opts);
+ if (err) {
+ netdev_dbg(priv->netdev,
+ "Couldn't find tunnel (opts) for tun_id: %d, err: %d\n",
+ enc_opts_id, err);
+ return false;
+ }
+ }
+
+ switch (key.enc_control.addr_type) {
+ case FLOW_DISSECTOR_KEY_IPV4_ADDRS:
+ tun_dst = __ip_tun_set_dst(key.enc_ipv4.src, key.enc_ipv4.dst,
+ key.enc_ip.tos, key.enc_ip.ttl,
+ key.enc_tp.dst, TUNNEL_KEY,
+ key32_to_tunnel_id(key.enc_key_id.keyid),
+ enc_opts.key.len);
+ break;
+ case FLOW_DISSECTOR_KEY_IPV6_ADDRS:
+ tun_dst = __ipv6_tun_set_dst(&key.enc_ipv6.src, &key.enc_ipv6.dst,
+ key.enc_ip.tos, key.enc_ip.ttl,
+ key.enc_tp.dst, 0, TUNNEL_KEY,
+ key32_to_tunnel_id(key.enc_key_id.keyid),
+ enc_opts.key.len);
+ break;
+ default:
+ netdev_dbg(priv->netdev,
+ "Couldn't restore tunnel, unsupported addr_type: %d\n",
+ key.enc_control.addr_type);
+ return false;
+ }
+
+ if (!tun_dst) {
+ netdev_dbg(priv->netdev, "Couldn't restore tunnel, no tun_dst\n");
+ return false;
+ }
+
+ tun_dst->u.tun_info.key.tp_src = key.enc_tp.src;
+
+ if (enc_opts.key.len)
+ ip_tunnel_info_opts_set(&tun_dst->u.tun_info,
+ enc_opts.key.data,
+ enc_opts.key.len,
+ enc_opts.key.dst_opt_type);
+
+ skb_dst_set(skb, (struct dst_entry *)tun_dst);
+ dev = dev_get_by_index(&init_net, key.filter_ifindex);
+ if (!dev) {
+ netdev_dbg(priv->netdev,
+ "Couldn't find tunnel device with ifindex: %d\n",
+ key.filter_ifindex);
return false;
}
- if (mapped_obj.type == MLX5_MAPPED_OBJ_CHAIN) {
- chain = mapped_obj.chain;
+ /* Set fwd_dev so we do dev_put() after datapath */
+ tc_priv->fwd_dev = dev;
+
+ skb->dev = dev;
+
+ return true;
+}
+
+static bool mlx5e_tc_restore_skb_chain(struct sk_buff *skb, struct mlx5_tc_ct_priv *ct_priv,
+ u32 chain, u32 zone_restore_id,
+ u32 tunnel_id, struct mlx5e_tc_update_priv *tc_priv)
+{
+ struct mlx5e_priv *priv = netdev_priv(skb->dev);
+ struct tc_skb_ext *tc_skb_ext;
+
+ if (chain) {
+ if (!mlx5e_tc_ct_restore_flow(ct_priv, skb, zone_restore_id))
+ return false;
+
tc_skb_ext = tc_skb_ext_alloc(skb);
- if (WARN_ON(!tc_skb_ext))
+ if (!tc_skb_ext) {
+ WARN_ON(1);
return false;
+ }
tc_skb_ext->chain = chain;
+ }
- zone_restore_id = (reg_b >> MLX5_REG_MAPPING_MOFFSET(NIC_ZONE_RESTORE_TO_REG)) &
- ESW_ZONE_ID_MASK;
+ if (tc_priv)
+ return mlx5e_tc_restore_tunnel(priv, skb, tc_priv, tunnel_id);
- if (!mlx5e_tc_ct_restore_flow(tc->ct, skb,
- zone_restore_id))
- return false;
- } else {
+ return true;
+}
+
+static void mlx5e_tc_restore_skb_sample(struct mlx5e_priv *priv, struct sk_buff *skb,
+ struct mlx5_mapped_obj *mapped_obj,
+ struct mlx5e_tc_update_priv *tc_priv)
+{
+ if (!mlx5e_tc_restore_tunnel(priv, skb, tc_priv, mapped_obj->sample.tunnel_id)) {
+ netdev_dbg(priv->netdev,
+ "Failed to restore tunnel info for sampled packet\n");
+ return;
+ }
+ mlx5e_tc_sample_skb(skb, mapped_obj);
+}
+
+static bool mlx5e_tc_restore_skb_int_port(struct mlx5e_priv *priv, struct sk_buff *skb,
+ struct mlx5_mapped_obj *mapped_obj,
+ struct mlx5e_tc_update_priv *tc_priv,
+ u32 tunnel_id)
+{
+ struct mlx5_eswitch *esw = priv->mdev->priv.eswitch;
+ struct mlx5_rep_uplink_priv *uplink_priv;
+ struct mlx5e_rep_priv *uplink_rpriv;
+ bool forward_tx = false;
+
+ /* Tunnel restore takes precedence over int port restore */
+ if (tunnel_id)
+ return mlx5e_tc_restore_tunnel(priv, skb, tc_priv, tunnel_id);
+
+ uplink_rpriv = mlx5_eswitch_get_uplink_priv(esw, REP_ETH);
+ uplink_priv = &uplink_rpriv->uplink_priv;
+
+ if (mlx5e_tc_int_port_dev_fwd(uplink_priv->int_port_priv, skb,
+ mapped_obj->int_port_metadata, &forward_tx)) {
+ /* Set fwd_dev for future dev_put */
+ tc_priv->fwd_dev = skb->dev;
+ tc_priv->forward_tx = forward_tx;
+
+ return true;
+ }
+
+ return false;
+}
+
+bool mlx5e_tc_update_skb(struct mlx5_cqe64 *cqe, struct sk_buff *skb,
+ struct mapping_ctx *mapping_ctx, u32 mapped_obj_id,
+ struct mlx5_tc_ct_priv *ct_priv,
+ u32 zone_restore_id, u32 tunnel_id,
+ struct mlx5e_tc_update_priv *tc_priv)
+{
+ struct mlx5e_priv *priv = netdev_priv(skb->dev);
+ struct mlx5_mapped_obj mapped_obj;
+ int err;
+
+ err = mapping_find(mapping_ctx, mapped_obj_id, &mapped_obj);
+ if (err) {
+ netdev_dbg(skb->dev,
+ "Couldn't find mapped object for mapped_obj_id: %d, err: %d\n",
+ mapped_obj_id, err);
+ return false;
+ }
+
+ switch (mapped_obj.type) {
+ case MLX5_MAPPED_OBJ_CHAIN:
+ return mlx5e_tc_restore_skb_chain(skb, ct_priv, mapped_obj.chain, zone_restore_id,
+ tunnel_id, tc_priv);
+ case MLX5_MAPPED_OBJ_SAMPLE:
+ mlx5e_tc_restore_skb_sample(priv, skb, &mapped_obj, tc_priv);
+ tc_priv->skb_done = true;
+ return true;
+ case MLX5_MAPPED_OBJ_INT_PORT_METADATA:
+ return mlx5e_tc_restore_skb_int_port(priv, skb, &mapped_obj, tc_priv, tunnel_id);
+ default:
netdev_dbg(priv->netdev, "Invalid mapped object type: %d\n", mapped_obj.type);
return false;
}
- return true;
+ return false;
+}
+
+bool mlx5e_tc_update_skb_nic(struct mlx5_cqe64 *cqe, struct sk_buff *skb)
+{
+ struct mlx5e_priv *priv = netdev_priv(skb->dev);
+ u32 mapped_obj_id, reg_b, zone_restore_id;
+ struct mlx5_tc_ct_priv *ct_priv;
+ struct mapping_ctx *mapping_ctx;
+ struct mlx5e_tc_table *tc;
+
+ reg_b = be32_to_cpu(cqe->ft_metadata);
+ tc = mlx5e_fs_get_tc(priv->fs);
+ mapped_obj_id = reg_b & MLX5E_TC_TABLE_CHAIN_TAG_MASK;
+ zone_restore_id = (reg_b >> MLX5_REG_MAPPING_MOFFSET(NIC_ZONE_RESTORE_TO_REG)) &
+ ESW_ZONE_ID_MASK;
+ ct_priv = tc->ct;
+ mapping_ctx = tc->mapping;
+
+ return mlx5e_tc_update_skb(cqe, skb, mapping_ctx, mapped_obj_id, ct_priv, zone_restore_id,
+ 0, NULL);
}
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_tc.h b/drivers/net/ethernet/mellanox/mlx5/core/en_tc.h
index ce516dc7f3fd..4fa5d4e024cd 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_tc.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_tc.h
@@ -59,6 +59,8 @@ int mlx5e_tc_num_filters(struct mlx5e_priv *priv, unsigned long flags);
struct mlx5e_tc_update_priv {
struct net_device *fwd_dev;
+ bool skb_done;
+ bool forward_tx;
};
struct mlx5_nic_flow_attr {
@@ -386,14 +388,19 @@ static inline bool mlx5e_cqe_regb_chain(struct mlx5_cqe64 *cqe)
return false;
}
-bool mlx5e_tc_update_skb(struct mlx5_cqe64 *cqe, struct sk_buff *skb);
+bool mlx5e_tc_update_skb_nic(struct mlx5_cqe64 *cqe, struct sk_buff *skb);
+bool mlx5e_tc_update_skb(struct mlx5_cqe64 *cqe, struct sk_buff *skb,
+ struct mapping_ctx *mapping_ctx, u32 mapped_obj_id,
+ struct mlx5_tc_ct_priv *ct_priv,
+ u32 zone_restore_id, u32 tunnel_id,
+ struct mlx5e_tc_update_priv *tc_priv);
#else /* CONFIG_MLX5_CLS_ACT */
static inline struct mlx5e_tc_table *mlx5e_tc_table_alloc(void) { return NULL; }
static inline void mlx5e_tc_table_free(struct mlx5e_tc_table *tc) {}
static inline bool mlx5e_cqe_regb_chain(struct mlx5_cqe64 *cqe)
{ return false; }
static inline bool
-mlx5e_tc_update_skb(struct mlx5_cqe64 *cqe, struct sk_buff *skb)
+mlx5e_tc_update_skb_nic(struct mlx5_cqe64 *cqe, struct sk_buff *skb)
{ return true; }
#endif
--
2.30.1
^ permalink raw reply related [flat|nested] 21+ messages in thread
* [PATCH net-next v8 6/7] net/mlx5e: Rename CHAIN_TO_REG to MAPPED_OBJ_TO_REG
2023-02-05 15:49 [PATCH net-next v8 0/7] net/sched: cls_api: Support hardware miss to tc action Paul Blakey
` (4 preceding siblings ...)
2023-02-05 15:49 ` [PATCH net-next v8 5/7] net/mlx5: Refactor tc miss handling to a single function Paul Blakey
@ 2023-02-05 15:49 ` Paul Blakey
2023-02-05 15:49 ` [PATCH net-next v8 7/7] net/mlx5e: TC, Set CT miss to the specific ct action instance Paul Blakey
2023-02-06 12:34 ` [PATCH net-next v8 0/7] net/sched: cls_api: Support hardware miss to tc action Ilya Maximets
7 siblings, 0 replies; 21+ messages in thread
From: Paul Blakey @ 2023-02-05 15:49 UTC (permalink / raw)
To: Paul Blakey, netdev, Saeed Mahameed, Paolo Abeni, Jakub Kicinski,
Eric Dumazet, Jamal Hadi Salim, Cong Wang, David S. Miller
Cc: Oz Shlomo, Jiri Pirko, Roi Dayan, Vlad Buslov, Leon Romanovsky
This reg usage is always a mapped object, not necessarily
containing chain info.
Rename to properly convey what it stores.
This patch doesn't change any functionality.
Signed-off-by: Paul Blakey <paulb@nvidia.com>
Reviewed-by: Roi Dayan <roid@nvidia.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
---
.../net/ethernet/mellanox/mlx5/core/en/tc/sample.c | 2 +-
drivers/net/ethernet/mellanox/mlx5/core/en/tc_ct.c | 2 +-
drivers/net/ethernet/mellanox/mlx5/core/en_tc.c | 6 +++---
drivers/net/ethernet/mellanox/mlx5/core/en_tc.h | 4 ++--
.../ethernet/mellanox/mlx5/core/lib/fs_chains.c | 14 +++++++-------
5 files changed, 14 insertions(+), 14 deletions(-)
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/tc/sample.c b/drivers/net/ethernet/mellanox/mlx5/core/en/tc/sample.c
index f2c2c752bd1c..558a776359af 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en/tc/sample.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en/tc/sample.c
@@ -237,7 +237,7 @@ sample_modify_hdr_get(struct mlx5_core_dev *mdev, u32 obj_id,
int err;
err = mlx5e_tc_match_to_reg_set(mdev, mod_acts, MLX5_FLOW_NAMESPACE_FDB,
- CHAIN_TO_REG, obj_id);
+ MAPPED_OBJ_TO_REG, obj_id);
if (err)
goto err_set_regc0;
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/tc_ct.c b/drivers/net/ethernet/mellanox/mlx5/core/en/tc_ct.c
index 313df8232db7..e1a2861cc13b 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en/tc_ct.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en/tc_ct.c
@@ -1871,7 +1871,7 @@ __mlx5_tc_ct_flow_offload(struct mlx5_tc_ct_priv *ct_priv,
ct_flow->chain_mapping = chain_mapping;
err = mlx5e_tc_match_to_reg_set(priv->mdev, pre_mod_acts, ct_priv->ns_type,
- CHAIN_TO_REG, chain_mapping);
+ MAPPED_OBJ_TO_REG, chain_mapping);
if (err) {
ct_dbg("Failed to set chain register mapping");
goto err_mapping;
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c b/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
index a6399dc870c2..f0ce1d1ae8ad 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
@@ -105,7 +105,7 @@ struct mlx5e_tc_table {
};
struct mlx5e_tc_attr_to_reg_mapping mlx5e_tc_attr_to_reg_mappings[] = {
- [CHAIN_TO_REG] = {
+ [MAPPED_OBJ_TO_REG] = {
.mfield = MLX5_ACTION_IN_FIELD_METADATA_REG_C_0,
.moffset = 0,
.mlen = 16,
@@ -132,7 +132,7 @@ struct mlx5e_tc_attr_to_reg_mapping mlx5e_tc_attr_to_reg_mappings[] = {
* into reg_b that is passed to SW since we don't
* jump between steering domains.
*/
- [NIC_CHAIN_TO_REG] = {
+ [NIC_MAPPED_OBJ_TO_REG] = {
.mfield = MLX5_ACTION_IN_FIELD_METADATA_REG_B,
.moffset = 0,
.mlen = 16,
@@ -1585,7 +1585,7 @@ mlx5e_tc_offload_to_slow_path(struct mlx5_eswitch *esw,
goto err_get_chain;
err = mlx5e_tc_match_to_reg_set(esw->dev, &mod_acts, MLX5_FLOW_NAMESPACE_FDB,
- CHAIN_TO_REG, chain_mapping);
+ MAPPED_OBJ_TO_REG, chain_mapping);
if (err)
goto err_reg_set;
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_tc.h b/drivers/net/ethernet/mellanox/mlx5/core/en_tc.h
index 4fa5d4e024cd..eb985f7bdea7 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_tc.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_tc.h
@@ -229,7 +229,7 @@ void mlx5e_tc_update_neigh_used_value(struct mlx5e_neigh_hash_entry *nhe);
void mlx5e_tc_reoffload_flows_work(struct work_struct *work);
enum mlx5e_tc_attr_to_reg {
- CHAIN_TO_REG,
+ MAPPED_OBJ_TO_REG,
VPORT_TO_REG,
TUNNEL_TO_REG,
CTSTATE_TO_REG,
@@ -238,7 +238,7 @@ enum mlx5e_tc_attr_to_reg {
MARK_TO_REG,
LABELS_TO_REG,
FTEID_TO_REG,
- NIC_CHAIN_TO_REG,
+ NIC_MAPPED_OBJ_TO_REG,
NIC_ZONE_RESTORE_TO_REG,
PACKET_COLOR_TO_REG,
};
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/lib/fs_chains.c b/drivers/net/ethernet/mellanox/mlx5/core/lib/fs_chains.c
index df58cba37930..81ed91fee59b 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/lib/fs_chains.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/lib/fs_chains.c
@@ -214,7 +214,7 @@ create_chain_restore(struct fs_chain *chain)
struct mlx5_eswitch *esw = chain->chains->dev->priv.eswitch;
u8 modact[MLX5_UN_SZ_BYTES(set_add_copy_action_in_auto)] = {};
struct mlx5_fs_chains *chains = chain->chains;
- enum mlx5e_tc_attr_to_reg chain_to_reg;
+ enum mlx5e_tc_attr_to_reg mapped_obj_to_reg;
struct mlx5_modify_hdr *mod_hdr;
u32 index;
int err;
@@ -242,7 +242,7 @@ create_chain_restore(struct fs_chain *chain)
chain->id = index;
if (chains->ns == MLX5_FLOW_NAMESPACE_FDB) {
- chain_to_reg = CHAIN_TO_REG;
+ mapped_obj_to_reg = MAPPED_OBJ_TO_REG;
chain->restore_rule = esw_add_restore_rule(esw, chain->id);
if (IS_ERR(chain->restore_rule)) {
err = PTR_ERR(chain->restore_rule);
@@ -253,7 +253,7 @@ create_chain_restore(struct fs_chain *chain)
* since we write the metadata to reg_b
* that is passed to SW directly.
*/
- chain_to_reg = NIC_CHAIN_TO_REG;
+ mapped_obj_to_reg = NIC_MAPPED_OBJ_TO_REG;
} else {
err = -EINVAL;
goto err_rule;
@@ -261,12 +261,12 @@ create_chain_restore(struct fs_chain *chain)
MLX5_SET(set_action_in, modact, action_type, MLX5_ACTION_TYPE_SET);
MLX5_SET(set_action_in, modact, field,
- mlx5e_tc_attr_to_reg_mappings[chain_to_reg].mfield);
+ mlx5e_tc_attr_to_reg_mappings[mapped_obj_to_reg].mfield);
MLX5_SET(set_action_in, modact, offset,
- mlx5e_tc_attr_to_reg_mappings[chain_to_reg].moffset);
+ mlx5e_tc_attr_to_reg_mappings[mapped_obj_to_reg].moffset);
MLX5_SET(set_action_in, modact, length,
- mlx5e_tc_attr_to_reg_mappings[chain_to_reg].mlen == 32 ?
- 0 : mlx5e_tc_attr_to_reg_mappings[chain_to_reg].mlen);
+ mlx5e_tc_attr_to_reg_mappings[mapped_obj_to_reg].mlen == 32 ?
+ 0 : mlx5e_tc_attr_to_reg_mappings[mapped_obj_to_reg].mlen);
MLX5_SET(set_action_in, modact, data, chain->id);
mod_hdr = mlx5_modify_header_alloc(chains->dev, chains->ns,
1, modact);
--
2.30.1
^ permalink raw reply related [flat|nested] 21+ messages in thread
* [PATCH net-next v8 7/7] net/mlx5e: TC, Set CT miss to the specific ct action instance
2023-02-05 15:49 [PATCH net-next v8 0/7] net/sched: cls_api: Support hardware miss to tc action Paul Blakey
` (5 preceding siblings ...)
2023-02-05 15:49 ` [PATCH net-next v8 6/7] net/mlx5e: Rename CHAIN_TO_REG to MAPPED_OBJ_TO_REG Paul Blakey
@ 2023-02-05 15:49 ` Paul Blakey
2023-02-06 12:34 ` [PATCH net-next v8 0/7] net/sched: cls_api: Support hardware miss to tc action Ilya Maximets
7 siblings, 0 replies; 21+ messages in thread
From: Paul Blakey @ 2023-02-05 15:49 UTC (permalink / raw)
To: Paul Blakey, netdev, Saeed Mahameed, Paolo Abeni, Jakub Kicinski,
Eric Dumazet, Jamal Hadi Salim, Cong Wang, David S. Miller
Cc: Oz Shlomo, Jiri Pirko, Roi Dayan, Vlad Buslov
Currently, CT misses restore the missed chain on the tc skb extension so
tc will continue from the relevant chain. Instead, restore the CT action's
miss cookie on the extension, which will instruct tc to continue from the
this specific CT action instance on the relevant filter's action list.
Map the CT action's miss_cookie to a new miss object (ACT_MISS), and use
this miss mapping instead of the current chain miss object (CHAIN_MISS)
for CT action misses.
To restore this new miss mapping value, add a RX restore rule for each
such mapping value.
Signed-off-by: Paul Blakey <paulb@nvidia.com>
Reviewed-by: Roi Dayan <roid@nvidia.com>
Reviewed-by: Oz Sholmo <ozsh@nvidia.com>
Signed-off-by: Paul Blakey <paulb@nvidia.com>
---
.../ethernet/mellanox/mlx5/core/en/tc_ct.c | 32 +++++-----
.../ethernet/mellanox/mlx5/core/en/tc_ct.h | 2 +
.../net/ethernet/mellanox/mlx5/core/en_tc.c | 64 +++++++++++++++++--
.../net/ethernet/mellanox/mlx5/core/en_tc.h | 6 ++
.../net/ethernet/mellanox/mlx5/core/eswitch.h | 2 +
5 files changed, 82 insertions(+), 24 deletions(-)
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/tc_ct.c b/drivers/net/ethernet/mellanox/mlx5/core/en/tc_ct.c
index e1a2861cc13b..71d8a906add9 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en/tc_ct.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en/tc_ct.c
@@ -59,6 +59,7 @@ struct mlx5_tc_ct_debugfs {
struct mlx5_tc_ct_priv {
struct mlx5_core_dev *dev;
+ struct mlx5e_priv *priv;
const struct net_device *netdev;
struct mod_hdr_tbl *mod_hdr_tbl;
struct xarray tuple_ids;
@@ -85,7 +86,6 @@ struct mlx5_ct_flow {
struct mlx5_flow_attr *pre_ct_attr;
struct mlx5_flow_handle *pre_ct_rule;
struct mlx5_ct_ft *ft;
- u32 chain_mapping;
};
struct mlx5_ct_zone_rule {
@@ -1441,6 +1441,7 @@ mlx5_tc_ct_parse_action(struct mlx5_tc_ct_priv *priv,
attr->ct_attr.zone = act->ct.zone;
attr->ct_attr.ct_action = act->ct.action;
attr->ct_attr.nf_ft = act->ct.flow_table;
+ attr->ct_attr.act_miss_cookie = act->miss_cookie;
return 0;
}
@@ -1778,7 +1779,7 @@ mlx5_tc_ct_del_ft_cb(struct mlx5_tc_ct_priv *ct_priv, struct mlx5_ct_ft *ft)
* + ft prio (tc chain) +
* + original match +
* +---------------------+
- * | set chain miss mapping
+ * | set act_miss_cookie mapping
* | set fte_id
* | set tunnel_id
* | do decap
@@ -1823,7 +1824,7 @@ __mlx5_tc_ct_flow_offload(struct mlx5_tc_ct_priv *ct_priv,
struct mlx5_flow_attr *pre_ct_attr;
struct mlx5_modify_hdr *mod_hdr;
struct mlx5_ct_flow *ct_flow;
- int chain_mapping = 0, err;
+ int act_miss_mapping = 0, err;
struct mlx5_ct_ft *ft;
u16 zone;
@@ -1858,22 +1859,18 @@ __mlx5_tc_ct_flow_offload(struct mlx5_tc_ct_priv *ct_priv,
pre_ct_attr->action |= MLX5_FLOW_CONTEXT_ACTION_FWD_DEST |
MLX5_FLOW_CONTEXT_ACTION_MOD_HDR;
- /* Write chain miss tag for miss in ct table as we
- * don't go though all prios of this chain as normal tc rules
- * miss.
- */
- err = mlx5_chains_get_chain_mapping(ct_priv->chains, attr->chain,
- &chain_mapping);
+ err = mlx5e_tc_action_miss_mapping_get(ct_priv->priv, attr, attr->ct_attr.act_miss_cookie,
+ &act_miss_mapping);
if (err) {
- ct_dbg("Failed to get chain register mapping for chain");
- goto err_get_chain;
+ ct_dbg("Failed to get register mapping for act miss");
+ goto err_get_act_miss;
}
- ct_flow->chain_mapping = chain_mapping;
+ attr->ct_attr.act_miss_mapping = act_miss_mapping;
err = mlx5e_tc_match_to_reg_set(priv->mdev, pre_mod_acts, ct_priv->ns_type,
- MAPPED_OBJ_TO_REG, chain_mapping);
+ MAPPED_OBJ_TO_REG, act_miss_mapping);
if (err) {
- ct_dbg("Failed to set chain register mapping");
+ ct_dbg("Failed to set act miss register mapping");
goto err_mapping;
}
@@ -1937,8 +1934,8 @@ __mlx5_tc_ct_flow_offload(struct mlx5_tc_ct_priv *ct_priv,
mlx5_modify_header_dealloc(priv->mdev, pre_ct_attr->modify_hdr);
err_mapping:
mlx5e_mod_hdr_dealloc(pre_mod_acts);
- mlx5_chains_put_chain_mapping(ct_priv->chains, ct_flow->chain_mapping);
-err_get_chain:
+ mlx5e_tc_action_miss_mapping_put(ct_priv->priv, attr, act_miss_mapping);
+err_get_act_miss:
kfree(ct_flow->pre_ct_attr);
err_alloc_pre:
mlx5_tc_ct_del_ft_cb(ct_priv, ft);
@@ -1977,7 +1974,7 @@ __mlx5_tc_ct_delete_flow(struct mlx5_tc_ct_priv *ct_priv,
mlx5_tc_rule_delete(priv, ct_flow->pre_ct_rule, pre_ct_attr);
mlx5_modify_header_dealloc(priv->mdev, pre_ct_attr->modify_hdr);
- mlx5_chains_put_chain_mapping(ct_priv->chains, ct_flow->chain_mapping);
+ mlx5e_tc_action_miss_mapping_put(ct_priv->priv, attr, attr->ct_attr.act_miss_mapping);
mlx5_tc_ct_del_ft_cb(ct_priv, ct_flow->ft);
kfree(ct_flow->pre_ct_attr);
@@ -2157,6 +2154,7 @@ mlx5_tc_ct_init(struct mlx5e_priv *priv, struct mlx5_fs_chains *chains,
}
spin_lock_init(&ct_priv->ht_lock);
+ ct_priv->priv = priv;
ct_priv->ns_type = ns_type;
ct_priv->chains = chains;
ct_priv->netdev = priv->netdev;
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/tc_ct.h b/drivers/net/ethernet/mellanox/mlx5/core/en/tc_ct.h
index 5bbd6b92840f..5c5ddaa83055 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en/tc_ct.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en/tc_ct.h
@@ -28,6 +28,8 @@ struct mlx5_ct_attr {
struct mlx5_ct_flow *ct_flow;
struct nf_flowtable *nf_ft;
u32 ct_labels_id;
+ u32 act_miss_mapping;
+ u64 act_miss_cookie;
};
#define zone_to_reg_ct {\
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c b/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
index f0ce1d1ae8ad..91798291f235 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
@@ -3801,6 +3801,7 @@ mlx5e_clone_flow_attr_for_post_act(struct mlx5_flow_attr *attr,
attr2->parse_attr = parse_attr;
attr2->dest_chain = 0;
attr2->dest_ft = NULL;
+ attr2->act_id_restore_rule = NULL;
if (ns_type == MLX5_FLOW_NAMESPACE_FDB) {
attr2->esw_attr->out_count = 0;
@@ -5657,14 +5658,19 @@ static bool mlx5e_tc_restore_tunnel(struct mlx5e_priv *priv, struct sk_buff *skb
return true;
}
-static bool mlx5e_tc_restore_skb_chain(struct sk_buff *skb, struct mlx5_tc_ct_priv *ct_priv,
- u32 chain, u32 zone_restore_id,
- u32 tunnel_id, struct mlx5e_tc_update_priv *tc_priv)
+static bool mlx5e_tc_restore_skb_tc_meta(struct sk_buff *skb, struct mlx5_tc_ct_priv *ct_priv,
+ struct mlx5_mapped_obj *mapped_obj, u32 zone_restore_id,
+ u32 tunnel_id, struct mlx5e_tc_update_priv *tc_priv)
{
struct mlx5e_priv *priv = netdev_priv(skb->dev);
struct tc_skb_ext *tc_skb_ext;
+ u64 act_miss_cookie;
+ u32 chain;
- if (chain) {
+ chain = mapped_obj->type == MLX5_MAPPED_OBJ_CHAIN ? mapped_obj->chain : 0;
+ act_miss_cookie = mapped_obj->type == MLX5_MAPPED_OBJ_ACT_MISS ?
+ mapped_obj->act_miss_cookie : 0;
+ if (chain || act_miss_cookie) {
if (!mlx5e_tc_ct_restore_flow(ct_priv, skb, zone_restore_id))
return false;
@@ -5674,7 +5680,12 @@ static bool mlx5e_tc_restore_skb_chain(struct sk_buff *skb, struct mlx5_tc_ct_pr
return false;
}
- tc_skb_ext->chain = chain;
+ if (act_miss_cookie) {
+ tc_skb_ext->act_miss_cookie = act_miss_cookie;
+ tc_skb_ext->act_miss = 1;
+ } else {
+ tc_skb_ext->chain = chain;
+ }
}
if (tc_priv)
@@ -5744,8 +5755,9 @@ bool mlx5e_tc_update_skb(struct mlx5_cqe64 *cqe, struct sk_buff *skb,
switch (mapped_obj.type) {
case MLX5_MAPPED_OBJ_CHAIN:
- return mlx5e_tc_restore_skb_chain(skb, ct_priv, mapped_obj.chain, zone_restore_id,
- tunnel_id, tc_priv);
+ case MLX5_MAPPED_OBJ_ACT_MISS:
+ return mlx5e_tc_restore_skb_tc_meta(skb, ct_priv, &mapped_obj, zone_restore_id,
+ tunnel_id, tc_priv);
case MLX5_MAPPED_OBJ_SAMPLE:
mlx5e_tc_restore_skb_sample(priv, skb, &mapped_obj, tc_priv);
tc_priv->skb_done = true;
@@ -5779,3 +5791,41 @@ bool mlx5e_tc_update_skb_nic(struct mlx5_cqe64 *cqe, struct sk_buff *skb)
return mlx5e_tc_update_skb(cqe, skb, mapping_ctx, mapped_obj_id, ct_priv, zone_restore_id,
0, NULL);
}
+
+int mlx5e_tc_action_miss_mapping_get(struct mlx5e_priv *priv, struct mlx5_flow_attr *attr,
+ u64 act_miss_cookie, u32 *act_miss_mapping)
+{
+ struct mlx5_eswitch *esw = priv->mdev->priv.eswitch;
+ struct mlx5_mapped_obj mapped_obj = {};
+ struct mapping_ctx *ctx;
+ int err;
+
+ ctx = esw->offloads.reg_c0_obj_pool;
+
+ mapped_obj.type = MLX5_MAPPED_OBJ_ACT_MISS;
+ mapped_obj.act_miss_cookie = act_miss_cookie;
+ err = mapping_add(ctx, &mapped_obj, act_miss_mapping);
+ if (err)
+ return err;
+
+ attr->act_id_restore_rule = esw_add_restore_rule(esw, *act_miss_mapping);
+ if (IS_ERR(attr->act_id_restore_rule))
+ goto err_rule;
+
+ return 0;
+
+err_rule:
+ mapping_remove(ctx, *act_miss_mapping);
+ return err;
+}
+
+void mlx5e_tc_action_miss_mapping_put(struct mlx5e_priv *priv, struct mlx5_flow_attr *attr,
+ u32 act_miss_mapping)
+{
+ struct mlx5_eswitch *esw = priv->mdev->priv.eswitch;
+ struct mapping_ctx *ctx;
+
+ ctx = esw->offloads.reg_c0_obj_pool;
+ mlx5_del_flow_rules(attr->act_id_restore_rule);
+ mapping_remove(ctx, act_miss_mapping);
+}
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_tc.h b/drivers/net/ethernet/mellanox/mlx5/core/en_tc.h
index eb985f7bdea7..60bff94ba2b6 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_tc.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_tc.h
@@ -101,6 +101,7 @@ struct mlx5_flow_attr {
struct mlx5_flow_attr *branch_true;
struct mlx5_flow_attr *branch_false;
struct mlx5_flow_attr *jumping_attr;
+ struct mlx5_flow_handle *act_id_restore_rule;
/* keep this union last */
union {
DECLARE_FLEX_ARRAY(struct mlx5_esw_flow_attr, esw_attr);
@@ -404,4 +405,9 @@ mlx5e_tc_update_skb_nic(struct mlx5_cqe64 *cqe, struct sk_buff *skb)
{ return true; }
#endif
+int mlx5e_tc_action_miss_mapping_get(struct mlx5e_priv *priv, struct mlx5_flow_attr *attr,
+ u64 act_miss_cookie, u32 *act_miss_mapping);
+void mlx5e_tc_action_miss_mapping_put(struct mlx5e_priv *priv, struct mlx5_flow_attr *attr,
+ u32 act_miss_mapping);
+
#endif /* __MLX5_EN_TC_H__ */
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eswitch.h b/drivers/net/ethernet/mellanox/mlx5/core/eswitch.h
index 5b5a215a7dc5..747981b868bd 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/eswitch.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/eswitch.h
@@ -52,12 +52,14 @@ enum mlx5_mapped_obj_type {
MLX5_MAPPED_OBJ_CHAIN,
MLX5_MAPPED_OBJ_SAMPLE,
MLX5_MAPPED_OBJ_INT_PORT_METADATA,
+ MLX5_MAPPED_OBJ_ACT_MISS,
};
struct mlx5_mapped_obj {
enum mlx5_mapped_obj_type type;
union {
u32 chain;
+ u64 act_miss_cookie;
struct {
u32 group_id;
u32 rate;
--
2.30.1
^ permalink raw reply related [flat|nested] 21+ messages in thread
* Re: [PATCH net-next v8 0/7] net/sched: cls_api: Support hardware miss to tc action
2023-02-05 15:49 [PATCH net-next v8 0/7] net/sched: cls_api: Support hardware miss to tc action Paul Blakey
` (6 preceding siblings ...)
2023-02-05 15:49 ` [PATCH net-next v8 7/7] net/mlx5e: TC, Set CT miss to the specific ct action instance Paul Blakey
@ 2023-02-06 12:34 ` Ilya Maximets
2023-02-06 17:14 ` Paul Blakey
7 siblings, 1 reply; 21+ messages in thread
From: Ilya Maximets @ 2023-02-06 12:34 UTC (permalink / raw)
To: Paul Blakey, netdev, Saeed Mahameed, Paolo Abeni, Jakub Kicinski,
Eric Dumazet, Jamal Hadi Salim, Cong Wang, David S. Miller
Cc: Oz Shlomo, Jiri Pirko, Roi Dayan, Vlad Buslov, Marcelo Leitner,
i.maximets
On 2/5/23 16:49, Paul Blakey wrote:
> Hi,
>
> This series adds support for hardware miss to instruct tc to continue execution
> in a specific tc action instance on a filter's action list. The mlx5 driver patch
> (besides the refactors) shows its usage instead of using just chain restore.
>
> Currently a filter's action list must be executed all together or
> not at all as driver are only able to tell tc to continue executing from a
> specific tc chain, and not a specific filter/action.
>
> This is troublesome with regards to action CT, where new connections should
> be sent to software (via tc chain restore), and established connections can
> be handled in hardware.
>
> Checking for new connections is done when executing the ct action in hardware
> (by checking the packet's tuple against known established tuples).
> But if there is a packet modification (pedit) action before action CT and the
> checked tuple is a new connection, hardware will need to revert the previous
> packet modifications before sending it back to software so it can
> re-match the same tc filter in software and re-execute its CT action.
>
> The following is an example configuration of stateless nat
> on mlx5 driver that isn't supported before this patchet:
>
> #Setup corrosponding mlx5 VFs in namespaces
> $ ip netns add ns0
> $ ip netns add ns1
> $ ip link set dev enp8s0f0v0 netns ns0
> $ ip netns exec ns0 ifconfig enp8s0f0v0 1.1.1.1/24 up
> $ ip link set dev enp8s0f0v1 netns ns1
> $ ip netns exec ns1 ifconfig enp8s0f0v1 1.1.1.2/24 up
>
> #Setup tc arp and ct rules on mxl5 VF representors
> $ tc qdisc add dev enp8s0f0_0 ingress
> $ tc qdisc add dev enp8s0f0_1 ingress
> $ ifconfig enp8s0f0_0 up
> $ ifconfig enp8s0f0_1 up
>
> #Original side
> $ tc filter add dev enp8s0f0_0 ingress chain 0 proto ip flower \
> ct_state -trk ip_proto tcp dst_port 8888 \
> action pedit ex munge tcp dport set 5001 pipe \
> action csum ip tcp pipe \
> action ct pipe \
> action goto chain 1
> $ tc filter add dev enp8s0f0_0 ingress chain 1 proto ip flower \
> ct_state +trk+est \
> action mirred egress redirect dev enp8s0f0_1
> $ tc filter add dev enp8s0f0_0 ingress chain 1 proto ip flower \
> ct_state +trk+new \
> action ct commit pipe \
> action mirred egress redirect dev enp8s0f0_1
> $ tc filter add dev enp8s0f0_0 ingress chain 0 proto arp flower \
> action mirred egress redirect dev enp8s0f0_1
>
> #Reply side
> $ tc filter add dev enp8s0f0_1 ingress chain 0 proto arp flower \
> action mirred egress redirect dev enp8s0f0_0
> $ tc filter add dev enp8s0f0_1 ingress chain 0 proto ip flower \
> ct_state -trk ip_proto tcp \
> action ct pipe \
> action pedit ex munge tcp sport set 8888 pipe \
> action csum ip tcp pipe \
> action mirred egress redirect dev enp8s0f0_0
>
> #Run traffic
> $ ip netns exec ns1 iperf -s -p 5001&
> $ sleep 2 #wait for iperf to fully open
> $ ip netns exec ns0 iperf -c 1.1.1.2 -p 8888
>
> #dump tc filter stats on enp8s0f0_0 chain 0 rule and see hardware packets:
> $ tc -s filter show dev enp8s0f0_0 ingress chain 0 proto ip | grep "hardware.*pkt"
> Sent hardware 9310116832 bytes 6149672 pkt
> Sent hardware 9310116832 bytes 6149672 pkt
> Sent hardware 9310116832 bytes 6149672 pkt
>
> A new connection executing the first filter in hardware will first rewrite
> the dst port to the new port, and then the ct action is executed,
> because this is a new connection, hardware will need to be send this back
> to software, on chain 0, to execute the first filter again in software.
> The dst port needs to be reverted otherwise it won't re-match the old
> dst port in the first filter. Because of that, currently mlx5 driver will
> reject offloading the above action ct rule.
>
> This series adds supports partial offload of a filter's action list,
> and letting tc software continue processing in the specific action instance
> where hardware left off (in the above case after the "action pedit ex munge tcp
> dport... of the first rule") allowing support for scenarios such as the above.
Hi, Paul. Not sure if this was discussed before, but don't we also need
a new TCA_CLS_FLAGS_IN_HW_PARTIAL flag or something like this?
Currently the in_hw/not_in_hw flags are reported per filter, i.e. these
flags are not per-action. This may cause confusion among users, if flows
are reported as in_hw, while they are actually partially or even mostly
processed in SW.
What do you think?
Best regards, Ilya Maximets.
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH net-next v8 4/7] net/mlx5: Kconfig: Make tc offload depend on tc skb extension
2023-02-05 15:49 ` [PATCH net-next v8 4/7] net/mlx5: Kconfig: Make tc offload depend on tc skb extension Paul Blakey
@ 2023-02-06 15:40 ` Alexander H Duyck
2023-02-06 17:16 ` Paul Blakey
0 siblings, 1 reply; 21+ messages in thread
From: Alexander H Duyck @ 2023-02-06 15:40 UTC (permalink / raw)
To: Paul Blakey, netdev, Saeed Mahameed, Paolo Abeni, Jakub Kicinski,
Eric Dumazet, Jamal Hadi Salim, Cong Wang, David S. Miller
Cc: Oz Shlomo, Jiri Pirko, Roi Dayan, Vlad Buslov
On Sun, 2023-02-05 at 17:49 +0200, Paul Blakey wrote:
> Tc skb extension is a basic requirement for using tc
> offload to support correct restoration on action miss.
>
> Depend on it.
>
> Signed-off-by: Paul Blakey <paulb@nvidia.com>
> ---
> drivers/net/ethernet/mellanox/mlx5/core/Kconfig | 2 +-
> drivers/net/ethernet/mellanox/mlx5/core/en/rep/tc.c | 2 --
> drivers/net/ethernet/mellanox/mlx5/core/en_tc.c | 2 --
> 3 files changed, 1 insertion(+), 5 deletions(-)
>
So one question I had is what about the use of the SKB_EXT check in
mlx5/core/en_tc.h? Seems like you could remove that one as well since
it is wrapped in a check for MLX5_CLS_ACT before the check for
NET_TC_SKB_EXT.
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH net-next v8 0/7] net/sched: cls_api: Support hardware miss to tc action
2023-02-06 12:34 ` [PATCH net-next v8 0/7] net/sched: cls_api: Support hardware miss to tc action Ilya Maximets
@ 2023-02-06 17:14 ` Paul Blakey
2023-02-07 0:20 ` Ilya Maximets
0 siblings, 1 reply; 21+ messages in thread
From: Paul Blakey @ 2023-02-06 17:14 UTC (permalink / raw)
To: Ilya Maximets, netdev, Saeed Mahameed, Paolo Abeni,
Jakub Kicinski, Eric Dumazet, Jamal Hadi Salim, Cong Wang,
David S. Miller
Cc: Oz Shlomo, Jiri Pirko, Roi Dayan, Vlad Buslov, Marcelo Leitner
On 06/02/2023 14:34, Ilya Maximets wrote:
> On 2/5/23 16:49, Paul Blakey wrote:
>> Hi,
>>
>> This series adds support for hardware miss to instruct tc to continue execution
>> in a specific tc action instance on a filter's action list. The mlx5 driver patch
>> (besides the refactors) shows its usage instead of using just chain restore.
>>
>> Currently a filter's action list must be executed all together or
>> not at all as driver are only able to tell tc to continue executing from a
>> specific tc chain, and not a specific filter/action.
>>
>> This is troublesome with regards to action CT, where new connections should
>> be sent to software (via tc chain restore), and established connections can
>> be handled in hardware.
>>
>> Checking for new connections is done when executing the ct action in hardware
>> (by checking the packet's tuple against known established tuples).
>> But if there is a packet modification (pedit) action before action CT and the
>> checked tuple is a new connection, hardware will need to revert the previous
>> packet modifications before sending it back to software so it can
>> re-match the same tc filter in software and re-execute its CT action.
>>
>> The following is an example configuration of stateless nat
>> on mlx5 driver that isn't supported before this patchet:
>>
>> #Setup corrosponding mlx5 VFs in namespaces
>> $ ip netns add ns0
>> $ ip netns add ns1
>> $ ip link set dev enp8s0f0v0 netns ns0
>> $ ip netns exec ns0 ifconfig enp8s0f0v0 1.1.1.1/24 up
>> $ ip link set dev enp8s0f0v1 netns ns1
>> $ ip netns exec ns1 ifconfig enp8s0f0v1 1.1.1.2/24 up
>>
>> #Setup tc arp and ct rules on mxl5 VF representors
>> $ tc qdisc add dev enp8s0f0_0 ingress
>> $ tc qdisc add dev enp8s0f0_1 ingress
>> $ ifconfig enp8s0f0_0 up
>> $ ifconfig enp8s0f0_1 up
>>
>> #Original side
>> $ tc filter add dev enp8s0f0_0 ingress chain 0 proto ip flower \
>> ct_state -trk ip_proto tcp dst_port 8888 \
>> action pedit ex munge tcp dport set 5001 pipe \
>> action csum ip tcp pipe \
>> action ct pipe \
>> action goto chain 1
>> $ tc filter add dev enp8s0f0_0 ingress chain 1 proto ip flower \
>> ct_state +trk+est \
>> action mirred egress redirect dev enp8s0f0_1
>> $ tc filter add dev enp8s0f0_0 ingress chain 1 proto ip flower \
>> ct_state +trk+new \
>> action ct commit pipe \
>> action mirred egress redirect dev enp8s0f0_1
>> $ tc filter add dev enp8s0f0_0 ingress chain 0 proto arp flower \
>> action mirred egress redirect dev enp8s0f0_1
>>
>> #Reply side
>> $ tc filter add dev enp8s0f0_1 ingress chain 0 proto arp flower \
>> action mirred egress redirect dev enp8s0f0_0
>> $ tc filter add dev enp8s0f0_1 ingress chain 0 proto ip flower \
>> ct_state -trk ip_proto tcp \
>> action ct pipe \
>> action pedit ex munge tcp sport set 8888 pipe \
>> action csum ip tcp pipe \
>> action mirred egress redirect dev enp8s0f0_0
>>
>> #Run traffic
>> $ ip netns exec ns1 iperf -s -p 5001&
>> $ sleep 2 #wait for iperf to fully open
>> $ ip netns exec ns0 iperf -c 1.1.1.2 -p 8888
>>
>> #dump tc filter stats on enp8s0f0_0 chain 0 rule and see hardware packets:
>> $ tc -s filter show dev enp8s0f0_0 ingress chain 0 proto ip | grep "hardware.*pkt"
>> Sent hardware 9310116832 bytes 6149672 pkt
>> Sent hardware 9310116832 bytes 6149672 pkt
>> Sent hardware 9310116832 bytes 6149672 pkt
>>
>> A new connection executing the first filter in hardware will first rewrite
>> the dst port to the new port, and then the ct action is executed,
>> because this is a new connection, hardware will need to be send this back
>> to software, on chain 0, to execute the first filter again in software.
>> The dst port needs to be reverted otherwise it won't re-match the old
>> dst port in the first filter. Because of that, currently mlx5 driver will
>> reject offloading the above action ct rule.
>>
>> This series adds supports partial offload of a filter's action list,
>> and letting tc software continue processing in the specific action instance
>> where hardware left off (in the above case after the "action pedit ex munge tcp
>> dport... of the first rule") allowing support for scenarios such as the above.
>
>
> Hi, Paul. Not sure if this was discussed before, but don't we also need
> a new TCA_CLS_FLAGS_IN_HW_PARTIAL flag or something like this?
>
> Currently the in_hw/not_in_hw flags are reported per filter, i.e. these
> flags are not per-action. This may cause confusion among users, if flows
> are reported as in_hw, while they are actually partially or even mostly
> processed in SW.
>
> What do you think?
>
> Best regards, Ilya Maximets.
I think its a good idea, and I'm fine with proposing something like this
in a different series, as this isn't a new problem from this series and
existed before it, at least with CT rules.
So how about I'll propose it in a different series and we continue with
this first?
Thanks,
Paul.
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH net-next v8 4/7] net/mlx5: Kconfig: Make tc offload depend on tc skb extension
2023-02-06 15:40 ` Alexander H Duyck
@ 2023-02-06 17:16 ` Paul Blakey
0 siblings, 0 replies; 21+ messages in thread
From: Paul Blakey @ 2023-02-06 17:16 UTC (permalink / raw)
To: Alexander H Duyck, netdev, Saeed Mahameed, Paolo Abeni,
Jakub Kicinski, Eric Dumazet, Jamal Hadi Salim, Cong Wang,
David S. Miller
Cc: Oz Shlomo, Jiri Pirko, Roi Dayan, Vlad Buslov
On 06/02/2023 17:40, Alexander H Duyck wrote:
> On Sun, 2023-02-05 at 17:49 +0200, Paul Blakey wrote:
>> Tc skb extension is a basic requirement for using tc
>> offload to support correct restoration on action miss.
>>
>> Depend on it.
>>
>> Signed-off-by: Paul Blakey <paulb@nvidia.com>
>> ---
>> drivers/net/ethernet/mellanox/mlx5/core/Kconfig | 2 +-
>> drivers/net/ethernet/mellanox/mlx5/core/en/rep/tc.c | 2 --
>> drivers/net/ethernet/mellanox/mlx5/core/en_tc.c | 2 --
>> 3 files changed, 1 insertion(+), 5 deletions(-)
>>
>
> So one question I had is what about the use of the SKB_EXT check in
> mlx5/core/en_tc.h? Seems like you could remove that one as well since
> it is wrapped in a check for MLX5_CLS_ACT before the check for
> NET_TC_SKB_EXT.
Good catch, ill remove it.
Thanks.
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH net-next v8 0/7] net/sched: cls_api: Support hardware miss to tc action
2023-02-06 17:14 ` Paul Blakey
@ 2023-02-07 0:20 ` Ilya Maximets
2023-02-07 5:03 ` Marcelo Leitner
0 siblings, 1 reply; 21+ messages in thread
From: Ilya Maximets @ 2023-02-07 0:20 UTC (permalink / raw)
To: Paul Blakey, netdev, Saeed Mahameed, Paolo Abeni, Jakub Kicinski,
Eric Dumazet, Jamal Hadi Salim, Cong Wang, David S. Miller
Cc: i.maximets, Oz Shlomo, Jiri Pirko, Roi Dayan, Vlad Buslov,
Marcelo Leitner
On 2/6/23 18:14, Paul Blakey wrote:
>
>
> On 06/02/2023 14:34, Ilya Maximets wrote:
>> On 2/5/23 16:49, Paul Blakey wrote:
>>> Hi,
>>>
>>> This series adds support for hardware miss to instruct tc to continue execution
>>> in a specific tc action instance on a filter's action list. The mlx5 driver patch
>>> (besides the refactors) shows its usage instead of using just chain restore.
>>>
>>> Currently a filter's action list must be executed all together or
>>> not at all as driver are only able to tell tc to continue executing from a
>>> specific tc chain, and not a specific filter/action.
>>>
>>> This is troublesome with regards to action CT, where new connections should
>>> be sent to software (via tc chain restore), and established connections can
>>> be handled in hardware.
>>>
>>> Checking for new connections is done when executing the ct action in hardware
>>> (by checking the packet's tuple against known established tuples).
>>> But if there is a packet modification (pedit) action before action CT and the
>>> checked tuple is a new connection, hardware will need to revert the previous
>>> packet modifications before sending it back to software so it can
>>> re-match the same tc filter in software and re-execute its CT action.
>>>
>>> The following is an example configuration of stateless nat
>>> on mlx5 driver that isn't supported before this patchet:
>>>
>>> #Setup corrosponding mlx5 VFs in namespaces
>>> $ ip netns add ns0
>>> $ ip netns add ns1
>>> $ ip link set dev enp8s0f0v0 netns ns0
>>> $ ip netns exec ns0 ifconfig enp8s0f0v0 1.1.1.1/24 up
>>> $ ip link set dev enp8s0f0v1 netns ns1
>>> $ ip netns exec ns1 ifconfig enp8s0f0v1 1.1.1.2/24 up
>>>
>>> #Setup tc arp and ct rules on mxl5 VF representors
>>> $ tc qdisc add dev enp8s0f0_0 ingress
>>> $ tc qdisc add dev enp8s0f0_1 ingress
>>> $ ifconfig enp8s0f0_0 up
>>> $ ifconfig enp8s0f0_1 up
>>>
>>> #Original side
>>> $ tc filter add dev enp8s0f0_0 ingress chain 0 proto ip flower \
>>> ct_state -trk ip_proto tcp dst_port 8888 \
>>> action pedit ex munge tcp dport set 5001 pipe \
>>> action csum ip tcp pipe \
>>> action ct pipe \
>>> action goto chain 1
>>> $ tc filter add dev enp8s0f0_0 ingress chain 1 proto ip flower \
>>> ct_state +trk+est \
>>> action mirred egress redirect dev enp8s0f0_1
>>> $ tc filter add dev enp8s0f0_0 ingress chain 1 proto ip flower \
>>> ct_state +trk+new \
>>> action ct commit pipe \
>>> action mirred egress redirect dev enp8s0f0_1
>>> $ tc filter add dev enp8s0f0_0 ingress chain 0 proto arp flower \
>>> action mirred egress redirect dev enp8s0f0_1
>>>
>>> #Reply side
>>> $ tc filter add dev enp8s0f0_1 ingress chain 0 proto arp flower \
>>> action mirred egress redirect dev enp8s0f0_0
>>> $ tc filter add dev enp8s0f0_1 ingress chain 0 proto ip flower \
>>> ct_state -trk ip_proto tcp \
>>> action ct pipe \
>>> action pedit ex munge tcp sport set 8888 pipe \
>>> action csum ip tcp pipe \
>>> action mirred egress redirect dev enp8s0f0_0
>>>
>>> #Run traffic
>>> $ ip netns exec ns1 iperf -s -p 5001&
>>> $ sleep 2 #wait for iperf to fully open
>>> $ ip netns exec ns0 iperf -c 1.1.1.2 -p 8888
>>>
>>> #dump tc filter stats on enp8s0f0_0 chain 0 rule and see hardware packets:
>>> $ tc -s filter show dev enp8s0f0_0 ingress chain 0 proto ip | grep "hardware.*pkt"
>>> Sent hardware 9310116832 bytes 6149672 pkt
>>> Sent hardware 9310116832 bytes 6149672 pkt
>>> Sent hardware 9310116832 bytes 6149672 pkt
>>>
>>> A new connection executing the first filter in hardware will first rewrite
>>> the dst port to the new port, and then the ct action is executed,
>>> because this is a new connection, hardware will need to be send this back
>>> to software, on chain 0, to execute the first filter again in software.
>>> The dst port needs to be reverted otherwise it won't re-match the old
>>> dst port in the first filter. Because of that, currently mlx5 driver will
>>> reject offloading the above action ct rule.
>>>
>>> This series adds supports partial offload of a filter's action list,
>>> and letting tc software continue processing in the specific action instance
>>> where hardware left off (in the above case after the "action pedit ex munge tcp
>>> dport... of the first rule") allowing support for scenarios such as the above.
>>
>>
>> Hi, Paul. Not sure if this was discussed before, but don't we also need
>> a new TCA_CLS_FLAGS_IN_HW_PARTIAL flag or something like this?
>>
>> Currently the in_hw/not_in_hw flags are reported per filter, i.e. these
>> flags are not per-action. This may cause confusion among users, if flows
>> are reported as in_hw, while they are actually partially or even mostly
>> processed in SW.
>>
>> What do you think?
>>
>> Best regards, Ilya Maximets.
>
> I think its a good idea, and I'm fine with proposing something like this in a
> different series, as this isn't a new problem from this series and existed before
> it, at least with CT rules.
Hmm, I didn't realize the issue already exists.
>
> So how about I'll propose it in a different series and we continue with this first?
Sounds fine to me. Thanks!
Best regards, Ilya Maximets.
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH net-next v8 0/7] net/sched: cls_api: Support hardware miss to tc action
2023-02-07 0:20 ` Ilya Maximets
@ 2023-02-07 5:03 ` Marcelo Leitner
2023-02-07 5:20 ` Jakub Kicinski
2023-02-08 8:41 ` Paul Blakey
0 siblings, 2 replies; 21+ messages in thread
From: Marcelo Leitner @ 2023-02-07 5:03 UTC (permalink / raw)
To: Ilya Maximets
Cc: Paul Blakey, netdev, Saeed Mahameed, Paolo Abeni, Jakub Kicinski,
Eric Dumazet, Jamal Hadi Salim, Cong Wang, David S. Miller,
Oz Shlomo, Jiri Pirko, Roi Dayan, Vlad Buslov
On Tue, Feb 07, 2023 at 01:20:55AM +0100, Ilya Maximets wrote:
> On 2/6/23 18:14, Paul Blakey wrote:
> >
> >
> > On 06/02/2023 14:34, Ilya Maximets wrote:
> >> On 2/5/23 16:49, Paul Blakey wrote:
> >>> Hi,
> >>>
> >>> This series adds support for hardware miss to instruct tc to continue execution
> >>> in a specific tc action instance on a filter's action list. The mlx5 driver patch
> >>> (besides the refactors) shows its usage instead of using just chain restore.
> >>>
> >>> Currently a filter's action list must be executed all together or
> >>> not at all as driver are only able to tell tc to continue executing from a
> >>> specific tc chain, and not a specific filter/action.
> >>>
> >>> This is troublesome with regards to action CT, where new connections should
> >>> be sent to software (via tc chain restore), and established connections can
> >>> be handled in hardware.
> >>>
> >>> Checking for new connections is done when executing the ct action in hardware
> >>> (by checking the packet's tuple against known established tuples).
> >>> But if there is a packet modification (pedit) action before action CT and the
> >>> checked tuple is a new connection, hardware will need to revert the previous
> >>> packet modifications before sending it back to software so it can
> >>> re-match the same tc filter in software and re-execute its CT action.
> >>>
> >>> The following is an example configuration of stateless nat
> >>> on mlx5 driver that isn't supported before this patchet:
> >>>
> >>> #Setup corrosponding mlx5 VFs in namespaces
> >>> $ ip netns add ns0
> >>> $ ip netns add ns1
> >>> $ ip link set dev enp8s0f0v0 netns ns0
> >>> $ ip netns exec ns0 ifconfig enp8s0f0v0 1.1.1.1/24 up
> >>> $ ip link set dev enp8s0f0v1 netns ns1
> >>> $ ip netns exec ns1 ifconfig enp8s0f0v1 1.1.1.2/24 up
> >>>
> >>> #Setup tc arp and ct rules on mxl5 VF representors
> >>> $ tc qdisc add dev enp8s0f0_0 ingress
> >>> $ tc qdisc add dev enp8s0f0_1 ingress
> >>> $ ifconfig enp8s0f0_0 up
> >>> $ ifconfig enp8s0f0_1 up
> >>>
> >>> #Original side
> >>> $ tc filter add dev enp8s0f0_0 ingress chain 0 proto ip flower \
> >>> ct_state -trk ip_proto tcp dst_port 8888 \
> >>> action pedit ex munge tcp dport set 5001 pipe \
> >>> action csum ip tcp pipe \
> >>> action ct pipe \
> >>> action goto chain 1
> >>> $ tc filter add dev enp8s0f0_0 ingress chain 1 proto ip flower \
> >>> ct_state +trk+est \
> >>> action mirred egress redirect dev enp8s0f0_1
> >>> $ tc filter add dev enp8s0f0_0 ingress chain 1 proto ip flower \
> >>> ct_state +trk+new \
> >>> action ct commit pipe \
> >>> action mirred egress redirect dev enp8s0f0_1
> >>> $ tc filter add dev enp8s0f0_0 ingress chain 0 proto arp flower \
> >>> action mirred egress redirect dev enp8s0f0_1
> >>>
> >>> #Reply side
> >>> $ tc filter add dev enp8s0f0_1 ingress chain 0 proto arp flower \
> >>> action mirred egress redirect dev enp8s0f0_0
> >>> $ tc filter add dev enp8s0f0_1 ingress chain 0 proto ip flower \
> >>> ct_state -trk ip_proto tcp \
> >>> action ct pipe \
> >>> action pedit ex munge tcp sport set 8888 pipe \
> >>> action csum ip tcp pipe \
> >>> action mirred egress redirect dev enp8s0f0_0
> >>>
> >>> #Run traffic
> >>> $ ip netns exec ns1 iperf -s -p 5001&
> >>> $ sleep 2 #wait for iperf to fully open
> >>> $ ip netns exec ns0 iperf -c 1.1.1.2 -p 8888
> >>>
> >>> #dump tc filter stats on enp8s0f0_0 chain 0 rule and see hardware packets:
> >>> $ tc -s filter show dev enp8s0f0_0 ingress chain 0 proto ip | grep "hardware.*pkt"
> >>> Sent hardware 9310116832 bytes 6149672 pkt
> >>> Sent hardware 9310116832 bytes 6149672 pkt
> >>> Sent hardware 9310116832 bytes 6149672 pkt
> >>>
> >>> A new connection executing the first filter in hardware will first rewrite
> >>> the dst port to the new port, and then the ct action is executed,
> >>> because this is a new connection, hardware will need to be send this back
> >>> to software, on chain 0, to execute the first filter again in software.
> >>> The dst port needs to be reverted otherwise it won't re-match the old
> >>> dst port in the first filter. Because of that, currently mlx5 driver will
> >>> reject offloading the above action ct rule.
> >>>
> >>> This series adds supports partial offload of a filter's action list,
> >>> and letting tc software continue processing in the specific action instance
> >>> where hardware left off (in the above case after the "action pedit ex munge tcp
> >>> dport... of the first rule") allowing support for scenarios such as the above.
> >>
> >>
> >> Hi, Paul. Not sure if this was discussed before, but don't we also need
> >> a new TCA_CLS_FLAGS_IN_HW_PARTIAL flag or something like this?
> >>
> >> Currently the in_hw/not_in_hw flags are reported per filter, i.e. these
> >> flags are not per-action. This may cause confusion among users, if flows
> >> are reported as in_hw, while they are actually partially or even mostly
> >> processed in SW.
> >>
> >> What do you think?
> >>
> >> Best regards, Ilya Maximets.
> >
> > I think its a good idea, and I'm fine with proposing something like this in a
> > different series, as this isn't a new problem from this series and existed before
> > it, at least with CT rules.
>
> Hmm, I didn't realize the issue already exists.
Maintainers: please give me up to Friday to review this patchset.
Disclaimer: I had missed this patchset, and I didn't even read it yet.
I don't follow. Can someone please rephase the issue please?
AFAICT, it is not that the NIC is offloading half of the action list
and never executing a part of it. Instead, for established connections
the rule will work fully offloaded. While for misses in the CT action,
it will simply trigger a miss, like it already does today.
>
> >
> > So how about I'll propose it in a different series and we continue with this first?
So I'm not sure either on what's the idea here.
Thanks,
Marcelo
>
> Sounds fine to me. Thanks!
>
> Best regards, Ilya Maximets.
>
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH net-next v8 0/7] net/sched: cls_api: Support hardware miss to tc action
2023-02-07 5:03 ` Marcelo Leitner
@ 2023-02-07 5:20 ` Jakub Kicinski
2023-02-08 8:41 ` Paul Blakey
1 sibling, 0 replies; 21+ messages in thread
From: Jakub Kicinski @ 2023-02-07 5:20 UTC (permalink / raw)
To: Marcelo Leitner
Cc: Ilya Maximets, Paul Blakey, netdev, Saeed Mahameed, Paolo Abeni,
Eric Dumazet, Jamal Hadi Salim, Cong Wang, David S. Miller,
Oz Shlomo, Jiri Pirko, Roi Dayan, Vlad Buslov
On Mon, 6 Feb 2023 21:03:15 -0800 Marcelo Leitner wrote:
> On Tue, Feb 07, 2023 at 01:20:55AM +0100, Ilya Maximets wrote:
> > On 2/6/23 18:14, Paul Blakey wrote:
> > > I think its a good idea, and I'm fine with proposing something like this in a
> > > different series, as this isn't a new problem from this series and existed before
> > > it, at least with CT rules.
> >
> > Hmm, I didn't realize the issue already exists.
>
> Maintainers: please give me up to Friday to review this patchset.
No problem, there's a v9 FWIW:
https://lore.kernel.org/all/20230206174403.32733-1-paulb@nvidia.com/
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH net-next v8 0/7] net/sched: cls_api: Support hardware miss to tc action
2023-02-07 5:03 ` Marcelo Leitner
2023-02-07 5:20 ` Jakub Kicinski
@ 2023-02-08 8:41 ` Paul Blakey
2023-02-08 18:01 ` Marcelo Leitner
1 sibling, 1 reply; 21+ messages in thread
From: Paul Blakey @ 2023-02-08 8:41 UTC (permalink / raw)
To: Marcelo Leitner, Ilya Maximets
Cc: netdev, Saeed Mahameed, Paolo Abeni, Jakub Kicinski, Eric Dumazet,
Jamal Hadi Salim, Cong Wang, David S. Miller, Oz Shlomo,
Jiri Pirko, Roi Dayan, Vlad Buslov
On 07/02/2023 07:03, Marcelo Leitner wrote:
> On Tue, Feb 07, 2023 at 01:20:55AM +0100, Ilya Maximets wrote:
>> On 2/6/23 18:14, Paul Blakey wrote:
>>>
>>>
>>> On 06/02/2023 14:34, Ilya Maximets wrote:
>>>> On 2/5/23 16:49, Paul Blakey wrote:
>>>>> Hi,
>>>>>
>>>>> This series adds support for hardware miss to instruct tc to continue execution
>>>>> in a specific tc action instance on a filter's action list. The mlx5 driver patch
>>>>> (besides the refactors) shows its usage instead of using just chain restore.
>>>>>
>>>>> Currently a filter's action list must be executed all together or
>>>>> not at all as driver are only able to tell tc to continue executing from a
>>>>> specific tc chain, and not a specific filter/action.
>>>>>
>>>>> This is troublesome with regards to action CT, where new connections should
>>>>> be sent to software (via tc chain restore), and established connections can
>>>>> be handled in hardware.
>>>>>
>>>>> Checking for new connections is done when executing the ct action in hardware
>>>>> (by checking the packet's tuple against known established tuples).
>>>>> But if there is a packet modification (pedit) action before action CT and the
>>>>> checked tuple is a new connection, hardware will need to revert the previous
>>>>> packet modifications before sending it back to software so it can
>>>>> re-match the same tc filter in software and re-execute its CT action.
>>>>>
>>>>> The following is an example configuration of stateless nat
>>>>> on mlx5 driver that isn't supported before this patchet:
>>>>>
>>>>> #Setup corrosponding mlx5 VFs in namespaces
>>>>> $ ip netns add ns0
>>>>> $ ip netns add ns1
>>>>> $ ip link set dev enp8s0f0v0 netns ns0
>>>>> $ ip netns exec ns0 ifconfig enp8s0f0v0 1.1.1.1/24 up
>>>>> $ ip link set dev enp8s0f0v1 netns ns1
>>>>> $ ip netns exec ns1 ifconfig enp8s0f0v1 1.1.1.2/24 up
>>>>>
>>>>> #Setup tc arp and ct rules on mxl5 VF representors
>>>>> $ tc qdisc add dev enp8s0f0_0 ingress
>>>>> $ tc qdisc add dev enp8s0f0_1 ingress
>>>>> $ ifconfig enp8s0f0_0 up
>>>>> $ ifconfig enp8s0f0_1 up
>>>>>
>>>>> #Original side
>>>>> $ tc filter add dev enp8s0f0_0 ingress chain 0 proto ip flower \
>>>>> ct_state -trk ip_proto tcp dst_port 8888 \
>>>>> action pedit ex munge tcp dport set 5001 pipe \
>>>>> action csum ip tcp pipe \
>>>>> action ct pipe \
>>>>> action goto chain 1
>>>>> $ tc filter add dev enp8s0f0_0 ingress chain 1 proto ip flower \
>>>>> ct_state +trk+est \
>>>>> action mirred egress redirect dev enp8s0f0_1
>>>>> $ tc filter add dev enp8s0f0_0 ingress chain 1 proto ip flower \
>>>>> ct_state +trk+new \
>>>>> action ct commit pipe \
>>>>> action mirred egress redirect dev enp8s0f0_1
>>>>> $ tc filter add dev enp8s0f0_0 ingress chain 0 proto arp flower \
>>>>> action mirred egress redirect dev enp8s0f0_1
>>>>>
>>>>> #Reply side
>>>>> $ tc filter add dev enp8s0f0_1 ingress chain 0 proto arp flower \
>>>>> action mirred egress redirect dev enp8s0f0_0
>>>>> $ tc filter add dev enp8s0f0_1 ingress chain 0 proto ip flower \
>>>>> ct_state -trk ip_proto tcp \
>>>>> action ct pipe \
>>>>> action pedit ex munge tcp sport set 8888 pipe \
>>>>> action csum ip tcp pipe \
>>>>> action mirred egress redirect dev enp8s0f0_0
>>>>>
>>>>> #Run traffic
>>>>> $ ip netns exec ns1 iperf -s -p 5001&
>>>>> $ sleep 2 #wait for iperf to fully open
>>>>> $ ip netns exec ns0 iperf -c 1.1.1.2 -p 8888
>>>>>
>>>>> #dump tc filter stats on enp8s0f0_0 chain 0 rule and see hardware packets:
>>>>> $ tc -s filter show dev enp8s0f0_0 ingress chain 0 proto ip | grep "hardware.*pkt"
>>>>> Sent hardware 9310116832 bytes 6149672 pkt
>>>>> Sent hardware 9310116832 bytes 6149672 pkt
>>>>> Sent hardware 9310116832 bytes 6149672 pkt
>>>>>
>>>>> A new connection executing the first filter in hardware will first rewrite
>>>>> the dst port to the new port, and then the ct action is executed,
>>>>> because this is a new connection, hardware will need to be send this back
>>>>> to software, on chain 0, to execute the first filter again in software.
>>>>> The dst port needs to be reverted otherwise it won't re-match the old
>>>>> dst port in the first filter. Because of that, currently mlx5 driver will
>>>>> reject offloading the above action ct rule.
>>>>>
>>>>> This series adds supports partial offload of a filter's action list,
>>>>> and letting tc software continue processing in the specific action instance
>>>>> where hardware left off (in the above case after the "action pedit ex munge tcp
>>>>> dport... of the first rule") allowing support for scenarios such as the above.
>>>>
>>>>
>>>> Hi, Paul. Not sure if this was discussed before, but don't we also need
>>>> a new TCA_CLS_FLAGS_IN_HW_PARTIAL flag or something like this?
>>>>
>>>> Currently the in_hw/not_in_hw flags are reported per filter, i.e. these
>>>> flags are not per-action. This may cause confusion among users, if flows
>>>> are reported as in_hw, while they are actually partially or even mostly
>>>> processed in SW.
>>>>
>>>> What do you think?
>>>>
>>>> Best regards, Ilya Maximets.
>>>
>>> I think its a good idea, and I'm fine with proposing something like this in a
>>> different series, as this isn't a new problem from this series and existed before
>>> it, at least with CT rules.
>>
>> Hmm, I didn't realize the issue already exists.
>
> Maintainers: please give me up to Friday to review this patchset.
>
> Disclaimer: I had missed this patchset, and I didn't even read it yet.
>
> I don't follow. Can someone please rephase the issue please?
> AFAICT, it is not that the NIC is offloading half of the action list
> and never executing a part of it. Instead, for established connections
> the rule will work fully offloaded. While for misses in the CT action,
> it will simply trigger a miss, like it already does today.
You got it right, and like you said it was like this before so its not
strictly related by this series and could be in a different patchset.
And I thought that (extra) flag would mean that it can miss, compared to
other rules/actions combination that will never miss because they
don't need sw support.
>
>>
>>>
>>> So how about I'll propose it in a different series and we continue with this first?
>
> So I'm not sure either on what's the idea here.
>
> Thanks,
> Marcelo
>
>>
>> Sounds fine to me. Thanks!
>>
>> Best regards, Ilya Maximets.
>>
>
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH net-next v8 0/7] net/sched: cls_api: Support hardware miss to tc action
2023-02-08 8:41 ` Paul Blakey
@ 2023-02-08 18:01 ` Marcelo Leitner
2023-02-09 0:09 ` Ilya Maximets
0 siblings, 1 reply; 21+ messages in thread
From: Marcelo Leitner @ 2023-02-08 18:01 UTC (permalink / raw)
To: Paul Blakey
Cc: Ilya Maximets, netdev, Saeed Mahameed, Paolo Abeni,
Jakub Kicinski, Eric Dumazet, Jamal Hadi Salim, Cong Wang,
David S. Miller, Oz Shlomo, Jiri Pirko, Roi Dayan, Vlad Buslov
On Wed, Feb 08, 2023 at 10:41:39AM +0200, Paul Blakey wrote:
>
>
> On 07/02/2023 07:03, Marcelo Leitner wrote:
> > On Tue, Feb 07, 2023 at 01:20:55AM +0100, Ilya Maximets wrote:
> > > On 2/6/23 18:14, Paul Blakey wrote:
> > > >
> > > >
> > > > On 06/02/2023 14:34, Ilya Maximets wrote:
> > > > > On 2/5/23 16:49, Paul Blakey wrote:
> > > > > > Hi,
> > > > > >
> > > > > > This series adds support for hardware miss to instruct tc to continue execution
> > > > > > in a specific tc action instance on a filter's action list. The mlx5 driver patch
> > > > > > (besides the refactors) shows its usage instead of using just chain restore.
> > > > > >
> > > > > > Currently a filter's action list must be executed all together or
> > > > > > not at all as driver are only able to tell tc to continue executing from a
> > > > > > specific tc chain, and not a specific filter/action.
> > > > > >
> > > > > > This is troublesome with regards to action CT, where new connections should
> > > > > > be sent to software (via tc chain restore), and established connections can
> > > > > > be handled in hardware.
> > > > > >
> > > > > > Checking for new connections is done when executing the ct action in hardware
> > > > > > (by checking the packet's tuple against known established tuples).
> > > > > > But if there is a packet modification (pedit) action before action CT and the
> > > > > > checked tuple is a new connection, hardware will need to revert the previous
> > > > > > packet modifications before sending it back to software so it can
> > > > > > re-match the same tc filter in software and re-execute its CT action.
> > > > > >
> > > > > > The following is an example configuration of stateless nat
> > > > > > on mlx5 driver that isn't supported before this patchet:
> > > > > >
> > > > > > #Setup corrosponding mlx5 VFs in namespaces
> > > > > > $ ip netns add ns0
> > > > > > $ ip netns add ns1
> > > > > > $ ip link set dev enp8s0f0v0 netns ns0
> > > > > > $ ip netns exec ns0 ifconfig enp8s0f0v0 1.1.1.1/24 up
> > > > > > $ ip link set dev enp8s0f0v1 netns ns1
> > > > > > $ ip netns exec ns1 ifconfig enp8s0f0v1 1.1.1.2/24 up
> > > > > >
> > > > > > #Setup tc arp and ct rules on mxl5 VF representors
> > > > > > $ tc qdisc add dev enp8s0f0_0 ingress
> > > > > > $ tc qdisc add dev enp8s0f0_1 ingress
> > > > > > $ ifconfig enp8s0f0_0 up
> > > > > > $ ifconfig enp8s0f0_1 up
> > > > > >
> > > > > > #Original side
> > > > > > $ tc filter add dev enp8s0f0_0 ingress chain 0 proto ip flower \
> > > > > > ct_state -trk ip_proto tcp dst_port 8888 \
> > > > > > action pedit ex munge tcp dport set 5001 pipe \
> > > > > > action csum ip tcp pipe \
> > > > > > action ct pipe \
> > > > > > action goto chain 1
> > > > > > $ tc filter add dev enp8s0f0_0 ingress chain 1 proto ip flower \
> > > > > > ct_state +trk+est \
> > > > > > action mirred egress redirect dev enp8s0f0_1
> > > > > > $ tc filter add dev enp8s0f0_0 ingress chain 1 proto ip flower \
> > > > > > ct_state +trk+new \
> > > > > > action ct commit pipe \
> > > > > > action mirred egress redirect dev enp8s0f0_1
> > > > > > $ tc filter add dev enp8s0f0_0 ingress chain 0 proto arp flower \
> > > > > > action mirred egress redirect dev enp8s0f0_1
> > > > > >
> > > > > > #Reply side
> > > > > > $ tc filter add dev enp8s0f0_1 ingress chain 0 proto arp flower \
> > > > > > action mirred egress redirect dev enp8s0f0_0
> > > > > > $ tc filter add dev enp8s0f0_1 ingress chain 0 proto ip flower \
> > > > > > ct_state -trk ip_proto tcp \
> > > > > > action ct pipe \
> > > > > > action pedit ex munge tcp sport set 8888 pipe \
> > > > > > action csum ip tcp pipe \
> > > > > > action mirred egress redirect dev enp8s0f0_0
> > > > > >
> > > > > > #Run traffic
> > > > > > $ ip netns exec ns1 iperf -s -p 5001&
> > > > > > $ sleep 2 #wait for iperf to fully open
> > > > > > $ ip netns exec ns0 iperf -c 1.1.1.2 -p 8888
> > > > > >
> > > > > > #dump tc filter stats on enp8s0f0_0 chain 0 rule and see hardware packets:
> > > > > > $ tc -s filter show dev enp8s0f0_0 ingress chain 0 proto ip | grep "hardware.*pkt"
> > > > > > Sent hardware 9310116832 bytes 6149672 pkt
> > > > > > Sent hardware 9310116832 bytes 6149672 pkt
> > > > > > Sent hardware 9310116832 bytes 6149672 pkt
> > > > > >
> > > > > > A new connection executing the first filter in hardware will first rewrite
> > > > > > the dst port to the new port, and then the ct action is executed,
> > > > > > because this is a new connection, hardware will need to be send this back
> > > > > > to software, on chain 0, to execute the first filter again in software.
> > > > > > The dst port needs to be reverted otherwise it won't re-match the old
> > > > > > dst port in the first filter. Because of that, currently mlx5 driver will
> > > > > > reject offloading the above action ct rule.
> > > > > >
> > > > > > This series adds supports partial offload of a filter's action list,
> > > > > > and letting tc software continue processing in the specific action instance
> > > > > > where hardware left off (in the above case after the "action pedit ex munge tcp
> > > > > > dport... of the first rule") allowing support for scenarios such as the above.
> > > > >
> > > > >
> > > > > Hi, Paul. Not sure if this was discussed before, but don't we also need
> > > > > a new TCA_CLS_FLAGS_IN_HW_PARTIAL flag or something like this?
> > > > >
> > > > > Currently the in_hw/not_in_hw flags are reported per filter, i.e. these
> > > > > flags are not per-action. This may cause confusion among users, if flows
> > > > > are reported as in_hw, while they are actually partially or even mostly
> > > > > processed in SW.
> > > > >
> > > > > What do you think?
> > > > >
> > > > > Best regards, Ilya Maximets.
> > > >
> > > > I think its a good idea, and I'm fine with proposing something like this in a
> > > > different series, as this isn't a new problem from this series and existed before
> > > > it, at least with CT rules.
> > >
> > > Hmm, I didn't realize the issue already exists.
> >
> > Maintainers: please give me up to Friday to review this patchset.
> >
> > Disclaimer: I had missed this patchset, and I didn't even read it yet.
> >
> > I don't follow. Can someone please rephase the issue please?
> > AFAICT, it is not that the NIC is offloading half of the action list
> > and never executing a part of it. Instead, for established connections
> > the rule will work fully offloaded. While for misses in the CT action,
> > it will simply trigger a miss, like it already does today.
>
> You got it right, and like you said it was like this before so its not
> strictly related by this series and could be in a different patchset. And I
> thought that (extra) flag would mean that it can miss, compared to other
> rules/actions combination that will never miss because they
> don't need sw support.
This is different from what I understood from Ilya's comment. Maybe I
got his comment wrong, but I have the impression that he meant it in
the sense of having some actions offloaded and some not. Which I think
it is not the goal here.
But anyway, flows can have some packets matching in sw while also
being in hw. That's expected. For example, in more complex flow sets,
if a packet hit a flow with ct action and triggered a miss, all
subsequent flows will handle this packet in sw. Or if we have queued
packets in rx ring already and ovs just updated the datapath, these
will match in tc sw instead of going to upcall. The latter will have
only a few hits, yes, but the former will be increasing over time.
I'm not sure how a new flag, which is probably more informative than
an actual state indication, would help here.
>
> >
> > >
> > > >
> > > > So how about I'll propose it in a different series and we continue with this first?
> >
> > So I'm not sure either on what's the idea here.
> >
> > Thanks,
> > Marcelo
> >
> > >
> > > Sounds fine to me. Thanks!
> > >
> > > Best regards, Ilya Maximets.
> > >
> >
>
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH net-next v8 0/7] net/sched: cls_api: Support hardware miss to tc action
2023-02-08 18:01 ` Marcelo Leitner
@ 2023-02-09 0:09 ` Ilya Maximets
2023-02-09 1:09 ` Marcelo Leitner
0 siblings, 1 reply; 21+ messages in thread
From: Ilya Maximets @ 2023-02-09 0:09 UTC (permalink / raw)
To: Marcelo Leitner, Paul Blakey
Cc: i.maximets, netdev, Saeed Mahameed, Paolo Abeni, Jakub Kicinski,
Eric Dumazet, Jamal Hadi Salim, Cong Wang, David S. Miller,
Oz Shlomo, Jiri Pirko, Roi Dayan, Vlad Buslov
On 2/8/23 19:01, Marcelo Leitner wrote:
> On Wed, Feb 08, 2023 at 10:41:39AM +0200, Paul Blakey wrote:
>>
>>
>> On 07/02/2023 07:03, Marcelo Leitner wrote:
>>> On Tue, Feb 07, 2023 at 01:20:55AM +0100, Ilya Maximets wrote:
>>>> On 2/6/23 18:14, Paul Blakey wrote:
>>>>>
>>>>>
>>>>> On 06/02/2023 14:34, Ilya Maximets wrote:
>>>>>> On 2/5/23 16:49, Paul Blakey wrote:
>>>>>>> Hi,
>>>>>>>
>>>>>>> This series adds support for hardware miss to instruct tc to continue execution
>>>>>>> in a specific tc action instance on a filter's action list. The mlx5 driver patch
>>>>>>> (besides the refactors) shows its usage instead of using just chain restore.
>>>>>>>
>>>>>>> Currently a filter's action list must be executed all together or
>>>>>>> not at all as driver are only able to tell tc to continue executing from a
>>>>>>> specific tc chain, and not a specific filter/action.
>>>>>>>
>>>>>>> This is troublesome with regards to action CT, where new connections should
>>>>>>> be sent to software (via tc chain restore), and established connections can
>>>>>>> be handled in hardware.
>>>>>>>
>>>>>>> Checking for new connections is done when executing the ct action in hardware
>>>>>>> (by checking the packet's tuple against known established tuples).
>>>>>>> But if there is a packet modification (pedit) action before action CT and the
>>>>>>> checked tuple is a new connection, hardware will need to revert the previous
>>>>>>> packet modifications before sending it back to software so it can
>>>>>>> re-match the same tc filter in software and re-execute its CT action.
>>>>>>>
>>>>>>> The following is an example configuration of stateless nat
>>>>>>> on mlx5 driver that isn't supported before this patchet:
>>>>>>>
>>>>>>> #Setup corrosponding mlx5 VFs in namespaces
>>>>>>> $ ip netns add ns0
>>>>>>> $ ip netns add ns1
>>>>>>> $ ip link set dev enp8s0f0v0 netns ns0
>>>>>>> $ ip netns exec ns0 ifconfig enp8s0f0v0 1.1.1.1/24 up
>>>>>>> $ ip link set dev enp8s0f0v1 netns ns1
>>>>>>> $ ip netns exec ns1 ifconfig enp8s0f0v1 1.1.1.2/24 up
>>>>>>>
>>>>>>> #Setup tc arp and ct rules on mxl5 VF representors
>>>>>>> $ tc qdisc add dev enp8s0f0_0 ingress
>>>>>>> $ tc qdisc add dev enp8s0f0_1 ingress
>>>>>>> $ ifconfig enp8s0f0_0 up
>>>>>>> $ ifconfig enp8s0f0_1 up
>>>>>>>
>>>>>>> #Original side
>>>>>>> $ tc filter add dev enp8s0f0_0 ingress chain 0 proto ip flower \
>>>>>>> ct_state -trk ip_proto tcp dst_port 8888 \
>>>>>>> action pedit ex munge tcp dport set 5001 pipe \
>>>>>>> action csum ip tcp pipe \
>>>>>>> action ct pipe \
>>>>>>> action goto chain 1
>>>>>>> $ tc filter add dev enp8s0f0_0 ingress chain 1 proto ip flower \
>>>>>>> ct_state +trk+est \
>>>>>>> action mirred egress redirect dev enp8s0f0_1
>>>>>>> $ tc filter add dev enp8s0f0_0 ingress chain 1 proto ip flower \
>>>>>>> ct_state +trk+new \
>>>>>>> action ct commit pipe \
>>>>>>> action mirred egress redirect dev enp8s0f0_1
>>>>>>> $ tc filter add dev enp8s0f0_0 ingress chain 0 proto arp flower \
>>>>>>> action mirred egress redirect dev enp8s0f0_1
>>>>>>>
>>>>>>> #Reply side
>>>>>>> $ tc filter add dev enp8s0f0_1 ingress chain 0 proto arp flower \
>>>>>>> action mirred egress redirect dev enp8s0f0_0
>>>>>>> $ tc filter add dev enp8s0f0_1 ingress chain 0 proto ip flower \
>>>>>>> ct_state -trk ip_proto tcp \
>>>>>>> action ct pipe \
>>>>>>> action pedit ex munge tcp sport set 8888 pipe \
>>>>>>> action csum ip tcp pipe \
>>>>>>> action mirred egress redirect dev enp8s0f0_0
>>>>>>>
>>>>>>> #Run traffic
>>>>>>> $ ip netns exec ns1 iperf -s -p 5001&
>>>>>>> $ sleep 2 #wait for iperf to fully open
>>>>>>> $ ip netns exec ns0 iperf -c 1.1.1.2 -p 8888
>>>>>>>
>>>>>>> #dump tc filter stats on enp8s0f0_0 chain 0 rule and see hardware packets:
>>>>>>> $ tc -s filter show dev enp8s0f0_0 ingress chain 0 proto ip | grep "hardware.*pkt"
>>>>>>> Sent hardware 9310116832 bytes 6149672 pkt
>>>>>>> Sent hardware 9310116832 bytes 6149672 pkt
>>>>>>> Sent hardware 9310116832 bytes 6149672 pkt
>>>>>>>
>>>>>>> A new connection executing the first filter in hardware will first rewrite
>>>>>>> the dst port to the new port, and then the ct action is executed,
>>>>>>> because this is a new connection, hardware will need to be send this back
>>>>>>> to software, on chain 0, to execute the first filter again in software.
>>>>>>> The dst port needs to be reverted otherwise it won't re-match the old
>>>>>>> dst port in the first filter. Because of that, currently mlx5 driver will
>>>>>>> reject offloading the above action ct rule.
>>>>>>>
>>>>>>> This series adds supports partial offload of a filter's action list,
>>>>>>> and letting tc software continue processing in the specific action instance
>>>>>>> where hardware left off (in the above case after the "action pedit ex munge tcp
>>>>>>> dport... of the first rule") allowing support for scenarios such as the above.
>>>>>>
>>>>>>
>>>>>> Hi, Paul. Not sure if this was discussed before, but don't we also need
>>>>>> a new TCA_CLS_FLAGS_IN_HW_PARTIAL flag or something like this?
>>>>>>
>>>>>> Currently the in_hw/not_in_hw flags are reported per filter, i.e. these
>>>>>> flags are not per-action. This may cause confusion among users, if flows
>>>>>> are reported as in_hw, while they are actually partially or even mostly
>>>>>> processed in SW.
>>>>>>
>>>>>> What do you think?
>>>>>>
>>>>>> Best regards, Ilya Maximets.
>>>>>
>>>>> I think its a good idea, and I'm fine with proposing something like this in a
>>>>> different series, as this isn't a new problem from this series and existed before
>>>>> it, at least with CT rules.
>>>>
>>>> Hmm, I didn't realize the issue already exists.
>>>
>>> Maintainers: please give me up to Friday to review this patchset.
>>>
>>> Disclaimer: I had missed this patchset, and I didn't even read it yet.
>>>
>>> I don't follow. Can someone please rephase the issue please?
>>> AFAICT, it is not that the NIC is offloading half of the action list
>>> and never executing a part of it. Instead, for established connections
>>> the rule will work fully offloaded. While for misses in the CT action,
>>> it will simply trigger a miss, like it already does today.
>>
>> You got it right, and like you said it was like this before so its not
>> strictly related by this series and could be in a different patchset. And I
>> thought that (extra) flag would mean that it can miss, compared to other
>> rules/actions combination that will never miss because they
>> don't need sw support.
>
> This is different from what I understood from Ilya's comment. Maybe I
> got his comment wrong, but I have the impression that he meant it in
> the sense of having some actions offloaded and some not.
> Which I thinkit is not the goal here.
I don't really know the code around this patch set well enough, so my
thoughts might be a bit irrelevant. But after reading the cover letter
and commit messages in this patch set I imagined that if we have some
kind of miss on the N-th action in a list in HW, we could go to software
tc, find that action and continue execution from it. In this case some
actions are executed in HW and some are in SW.
From the user's perspective, if such tc filter reports an 'in_hw' flag,
that would be a bit misleading, IMO.
If that is not what is happening here, then please ignore my comments,
as I'm not sure what this code is about then. :)
>
> But anyway, flows can have some packets matching in sw while also
> being in hw. That's expected. For example, in more complex flow sets,
> if a packet hit a flow with ct action and triggered a miss, all
> subsequent flows will handle this packet in sw. Or if we have queued
> packets in rx ring already and ovs just updated the datapath, these
> will match in tc sw instead of going to upcall. The latter will have
> only a few hits, yes, but the former will be increasing over time.
> I'm not sure how a new flag, which is probably more informative than
> an actual state indication, would help here.
These cases are related to just one or a very few packets, so for them
it's generally fine to report 'in_hw', I think. The vast majority of
traffic will be handled in HW.
My thoughts were about a case where we have a lot of traffic handled
partially in HW and in SW. Let's say we have N actions and HW doesn't
support action M. In this case, driver may offload actions [0, M - 1]
inserting some kind of forced "HW miss" at the end, so actions [M, N]
can be executed in TC software.
But now I'm not sure if that is possible with the current implementation.
>
>>
>>>
>>>>
>>>>>
>>>>> So how about I'll propose it in a different series and we continue with this first?
>>>
>>> So I'm not sure either on what's the idea here.
>>>
>>> Thanks,
>>> Marcelo
>>>
>>>>
>>>> Sounds fine to me. Thanks!
>>>>
>>>> Best regards, Ilya Maximets.
>>>>
>>>
>>
>
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH net-next v8 0/7] net/sched: cls_api: Support hardware miss to tc action
2023-02-09 0:09 ` Ilya Maximets
@ 2023-02-09 1:09 ` Marcelo Leitner
2023-02-09 12:07 ` Ilya Maximets
0 siblings, 1 reply; 21+ messages in thread
From: Marcelo Leitner @ 2023-02-09 1:09 UTC (permalink / raw)
To: Ilya Maximets
Cc: Paul Blakey, netdev, Saeed Mahameed, Paolo Abeni, Jakub Kicinski,
Eric Dumazet, Jamal Hadi Salim, Cong Wang, David S. Miller,
Oz Shlomo, Jiri Pirko, Roi Dayan, Vlad Buslov
On Thu, Feb 09, 2023 at 01:09:21AM +0100, Ilya Maximets wrote:
> On 2/8/23 19:01, Marcelo Leitner wrote:
> > On Wed, Feb 08, 2023 at 10:41:39AM +0200, Paul Blakey wrote:
> >>
> >>
> >> On 07/02/2023 07:03, Marcelo Leitner wrote:
> >>> On Tue, Feb 07, 2023 at 01:20:55AM +0100, Ilya Maximets wrote:
> >>>> On 2/6/23 18:14, Paul Blakey wrote:
> >>>>>
> >>>>>
> >>>>> On 06/02/2023 14:34, Ilya Maximets wrote:
> >>>>>> On 2/5/23 16:49, Paul Blakey wrote:
> >>>>>>> Hi,
> >>>>>>>
> >>>>>>> This series adds support for hardware miss to instruct tc to continue execution
> >>>>>>> in a specific tc action instance on a filter's action list. The mlx5 driver patch
> >>>>>>> (besides the refactors) shows its usage instead of using just chain restore.
> >>>>>>>
> >>>>>>> Currently a filter's action list must be executed all together or
> >>>>>>> not at all as driver are only able to tell tc to continue executing from a
> >>>>>>> specific tc chain, and not a specific filter/action.
> >>>>>>>
> >>>>>>> This is troublesome with regards to action CT, where new connections should
> >>>>>>> be sent to software (via tc chain restore), and established connections can
> >>>>>>> be handled in hardware.
> >>>>>>>
> >>>>>>> Checking for new connections is done when executing the ct action in hardware
> >>>>>>> (by checking the packet's tuple against known established tuples).
> >>>>>>> But if there is a packet modification (pedit) action before action CT and the
> >>>>>>> checked tuple is a new connection, hardware will need to revert the previous
> >>>>>>> packet modifications before sending it back to software so it can
> >>>>>>> re-match the same tc filter in software and re-execute its CT action.
> >>>>>>>
> >>>>>>> The following is an example configuration of stateless nat
> >>>>>>> on mlx5 driver that isn't supported before this patchet:
> >>>>>>>
> >>>>>>> #Setup corrosponding mlx5 VFs in namespaces
> >>>>>>> $ ip netns add ns0
> >>>>>>> $ ip netns add ns1
> >>>>>>> $ ip link set dev enp8s0f0v0 netns ns0
> >>>>>>> $ ip netns exec ns0 ifconfig enp8s0f0v0 1.1.1.1/24 up
> >>>>>>> $ ip link set dev enp8s0f0v1 netns ns1
> >>>>>>> $ ip netns exec ns1 ifconfig enp8s0f0v1 1.1.1.2/24 up
> >>>>>>>
> >>>>>>> #Setup tc arp and ct rules on mxl5 VF representors
> >>>>>>> $ tc qdisc add dev enp8s0f0_0 ingress
> >>>>>>> $ tc qdisc add dev enp8s0f0_1 ingress
> >>>>>>> $ ifconfig enp8s0f0_0 up
> >>>>>>> $ ifconfig enp8s0f0_1 up
> >>>>>>>
> >>>>>>> #Original side
> >>>>>>> $ tc filter add dev enp8s0f0_0 ingress chain 0 proto ip flower \
> >>>>>>> ct_state -trk ip_proto tcp dst_port 8888 \
> >>>>>>> action pedit ex munge tcp dport set 5001 pipe \
> >>>>>>> action csum ip tcp pipe \
> >>>>>>> action ct pipe \
> >>>>>>> action goto chain 1
> >>>>>>> $ tc filter add dev enp8s0f0_0 ingress chain 1 proto ip flower \
> >>>>>>> ct_state +trk+est \
> >>>>>>> action mirred egress redirect dev enp8s0f0_1
> >>>>>>> $ tc filter add dev enp8s0f0_0 ingress chain 1 proto ip flower \
> >>>>>>> ct_state +trk+new \
> >>>>>>> action ct commit pipe \
> >>>>>>> action mirred egress redirect dev enp8s0f0_1
> >>>>>>> $ tc filter add dev enp8s0f0_0 ingress chain 0 proto arp flower \
> >>>>>>> action mirred egress redirect dev enp8s0f0_1
> >>>>>>>
> >>>>>>> #Reply side
> >>>>>>> $ tc filter add dev enp8s0f0_1 ingress chain 0 proto arp flower \
> >>>>>>> action mirred egress redirect dev enp8s0f0_0
> >>>>>>> $ tc filter add dev enp8s0f0_1 ingress chain 0 proto ip flower \
> >>>>>>> ct_state -trk ip_proto tcp \
> >>>>>>> action ct pipe \
> >>>>>>> action pedit ex munge tcp sport set 8888 pipe \
> >>>>>>> action csum ip tcp pipe \
> >>>>>>> action mirred egress redirect dev enp8s0f0_0
> >>>>>>>
> >>>>>>> #Run traffic
> >>>>>>> $ ip netns exec ns1 iperf -s -p 5001&
> >>>>>>> $ sleep 2 #wait for iperf to fully open
> >>>>>>> $ ip netns exec ns0 iperf -c 1.1.1.2 -p 8888
> >>>>>>>
> >>>>>>> #dump tc filter stats on enp8s0f0_0 chain 0 rule and see hardware packets:
> >>>>>>> $ tc -s filter show dev enp8s0f0_0 ingress chain 0 proto ip | grep "hardware.*pkt"
> >>>>>>> Sent hardware 9310116832 bytes 6149672 pkt
> >>>>>>> Sent hardware 9310116832 bytes 6149672 pkt
> >>>>>>> Sent hardware 9310116832 bytes 6149672 pkt
> >>>>>>>
> >>>>>>> A new connection executing the first filter in hardware will first rewrite
> >>>>>>> the dst port to the new port, and then the ct action is executed,
> >>>>>>> because this is a new connection, hardware will need to be send this back
> >>>>>>> to software, on chain 0, to execute the first filter again in software.
> >>>>>>> The dst port needs to be reverted otherwise it won't re-match the old
> >>>>>>> dst port in the first filter. Because of that, currently mlx5 driver will
> >>>>>>> reject offloading the above action ct rule.
> >>>>>>>
> >>>>>>> This series adds supports partial offload of a filter's action list,
> >>>>>>> and letting tc software continue processing in the specific action instance
> >>>>>>> where hardware left off (in the above case after the "action pedit ex munge tcp
> >>>>>>> dport... of the first rule") allowing support for scenarios such as the above.
> >>>>>>
> >>>>>>
> >>>>>> Hi, Paul. Not sure if this was discussed before, but don't we also need
> >>>>>> a new TCA_CLS_FLAGS_IN_HW_PARTIAL flag or something like this?
> >>>>>>
> >>>>>> Currently the in_hw/not_in_hw flags are reported per filter, i.e. these
> >>>>>> flags are not per-action. This may cause confusion among users, if flows
> >>>>>> are reported as in_hw, while they are actually partially or even mostly
> >>>>>> processed in SW.
> >>>>>>
> >>>>>> What do you think?
> >>>>>>
> >>>>>> Best regards, Ilya Maximets.
> >>>>>
> >>>>> I think its a good idea, and I'm fine with proposing something like this in a
> >>>>> different series, as this isn't a new problem from this series and existed before
> >>>>> it, at least with CT rules.
> >>>>
> >>>> Hmm, I didn't realize the issue already exists.
> >>>
> >>> Maintainers: please give me up to Friday to review this patchset.
> >>>
> >>> Disclaimer: I had missed this patchset, and I didn't even read it yet.
> >>>
> >>> I don't follow. Can someone please rephase the issue please?
> >>> AFAICT, it is not that the NIC is offloading half of the action list
> >>> and never executing a part of it. Instead, for established connections
> >>> the rule will work fully offloaded. While for misses in the CT action,
> >>> it will simply trigger a miss, like it already does today.
> >>
> >> You got it right, and like you said it was like this before so its not
> >> strictly related by this series and could be in a different patchset. And I
> >> thought that (extra) flag would mean that it can miss, compared to other
> >> rules/actions combination that will never miss because they
> >> don't need sw support.
> >
> > This is different from what I understood from Ilya's comment. Maybe I
> > got his comment wrong, but I have the impression that he meant it in
> > the sense of having some actions offloaded and some not.
> > Which I thinkit is not the goal here.
>
> I don't really know the code around this patch set well enough, so my
> thoughts might be a bit irrelevant. But after reading the cover letter
> and commit messages in this patch set I imagined that if we have some
> kind of miss on the N-th action in a list in HW, we could go to software
> tc, find that action and continue execution from it. In this case some
> actions are executed in HW and some are in SW.
Precisely. :)
>
> From the user's perspective, if such tc filter reports an 'in_hw' flag,
> that would be a bit misleading, IMO.
I may be tainted or perhaps even biased here, but I don't see how it
can be misleading. Since we came up with skip_hw/sw I think it is
expected that packets can be handled in both datapaths. The flag is
just saying that hw has this flow. (btw, in_sw is simplified, as sw
always accepts the flow if skip_sw is not used)
>
> If that is not what is happening here, then please ignore my comments,
> as I'm not sure what this code is about then. :)
>
> >
> > But anyway, flows can have some packets matching in sw while also
> > being in hw. That's expected. For example, in more complex flow sets,
> > if a packet hit a flow with ct action and triggered a miss, all
> > subsequent flows will handle this packet in sw. Or if we have queued
> > packets in rx ring already and ovs just updated the datapath, these
> > will match in tc sw instead of going to upcall. The latter will have
> > only a few hits, yes, but the former will be increasing over time.
> > I'm not sure how a new flag, which is probably more informative than
> > an actual state indication, would help here.
>
> These cases are related to just one or a very few packets, so for them
> it's generally fine to report 'in_hw', I think. The vast majority of
> traffic will be handled in HW.
>
> My thoughts were about a case where we have a lot of traffic handled
> partially in HW and in SW. Let's say we have N actions and HW doesn't
> support action M. In this case, driver may offload actions [0, M - 1]
> inserting some kind of forced "HW miss" at the end, so actions [M, N]
> can be executed in TC software.
Right. Please lets consider this other scenario then. Consider that we
have these flows:
chain 0,ip,match ip X actions=ct,goto chain 1
chain 1,proto_Y_specific_match actions=ct(nat),goto chain 2
chain 2 actions=output:3
The idea here is that on chain 1, the HW doesn't support that particular
match on proto Y. That flow will never be in_hw, and that's okay. But
the flow on chain 2, though, will be tagged as in_hw, and for packets
following these specific sequence, they will get handled in sw on
chain 2.
But if we have another flow there:
chain 1,proto tcp actions=ct(nat),set_ttl,goto chain 2
which is supported by the hw, such packets would be handled by hw in
chain 2.
The flow on chain 2 has no idea on what was done before it. It can't
be tagged with _PARTIAL as the actions in there are not expected to
trigger misses, yet, with this flow set, it is expected to handle
packets in both datapaths, despite being 'in_hw'.
I guess what I'm trying so say is that it is not because a flow is
tagged with in_hw that sw processing is unexpected straight away.
Hopefully this makes sense?
>
> But now I'm not sure if that is possible with the current implementation.
AFAICT you got all right. It is me that had misunderstood you. :)
>
> >
> >>
> >>>
> >>>>
> >>>>>
> >>>>> So how about I'll propose it in a different series and we continue with this first?
> >>>
> >>> So I'm not sure either on what's the idea here.
> >>>
> >>> Thanks,
> >>> Marcelo
> >>>
> >>>>
> >>>> Sounds fine to me. Thanks!
> >>>>
> >>>> Best regards, Ilya Maximets.
> >>>>
> >>>
> >>
> >
>
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH net-next v8 0/7] net/sched: cls_api: Support hardware miss to tc action
2023-02-09 1:09 ` Marcelo Leitner
@ 2023-02-09 12:07 ` Ilya Maximets
2023-02-09 12:40 ` Paul Blakey
0 siblings, 1 reply; 21+ messages in thread
From: Ilya Maximets @ 2023-02-09 12:07 UTC (permalink / raw)
To: Marcelo Leitner
Cc: i.maximets, Paul Blakey, netdev, Saeed Mahameed, Paolo Abeni,
Jakub Kicinski, Eric Dumazet, Jamal Hadi Salim, Cong Wang,
David S. Miller, Oz Shlomo, Jiri Pirko, Roi Dayan, Vlad Buslov
On 2/9/23 02:09, Marcelo Leitner wrote:
> On Thu, Feb 09, 2023 at 01:09:21AM +0100, Ilya Maximets wrote:
>> On 2/8/23 19:01, Marcelo Leitner wrote:
>>> On Wed, Feb 08, 2023 at 10:41:39AM +0200, Paul Blakey wrote:
>>>>
>>>>
>>>> On 07/02/2023 07:03, Marcelo Leitner wrote:
>>>>> On Tue, Feb 07, 2023 at 01:20:55AM +0100, Ilya Maximets wrote:
>>>>>> On 2/6/23 18:14, Paul Blakey wrote:
>>>>>>>
>>>>>>>
>>>>>>> On 06/02/2023 14:34, Ilya Maximets wrote:
>>>>>>>> On 2/5/23 16:49, Paul Blakey wrote:
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> This series adds support for hardware miss to instruct tc to continue execution
>>>>>>>>> in a specific tc action instance on a filter's action list. The mlx5 driver patch
>>>>>>>>> (besides the refactors) shows its usage instead of using just chain restore.
>>>>>>>>>
>>>>>>>>> Currently a filter's action list must be executed all together or
>>>>>>>>> not at all as driver are only able to tell tc to continue executing from a
>>>>>>>>> specific tc chain, and not a specific filter/action.
>>>>>>>>>
>>>>>>>>> This is troublesome with regards to action CT, where new connections should
>>>>>>>>> be sent to software (via tc chain restore), and established connections can
>>>>>>>>> be handled in hardware.
>>>>>>>>>
>>>>>>>>> Checking for new connections is done when executing the ct action in hardware
>>>>>>>>> (by checking the packet's tuple against known established tuples).
>>>>>>>>> But if there is a packet modification (pedit) action before action CT and the
>>>>>>>>> checked tuple is a new connection, hardware will need to revert the previous
>>>>>>>>> packet modifications before sending it back to software so it can
>>>>>>>>> re-match the same tc filter in software and re-execute its CT action.
>>>>>>>>>
>>>>>>>>> The following is an example configuration of stateless nat
>>>>>>>>> on mlx5 driver that isn't supported before this patchet:
>>>>>>>>>
>>>>>>>>> #Setup corrosponding mlx5 VFs in namespaces
>>>>>>>>> $ ip netns add ns0
>>>>>>>>> $ ip netns add ns1
>>>>>>>>> $ ip link set dev enp8s0f0v0 netns ns0
>>>>>>>>> $ ip netns exec ns0 ifconfig enp8s0f0v0 1.1.1.1/24 up
>>>>>>>>> $ ip link set dev enp8s0f0v1 netns ns1
>>>>>>>>> $ ip netns exec ns1 ifconfig enp8s0f0v1 1.1.1.2/24 up
>>>>>>>>>
>>>>>>>>> #Setup tc arp and ct rules on mxl5 VF representors
>>>>>>>>> $ tc qdisc add dev enp8s0f0_0 ingress
>>>>>>>>> $ tc qdisc add dev enp8s0f0_1 ingress
>>>>>>>>> $ ifconfig enp8s0f0_0 up
>>>>>>>>> $ ifconfig enp8s0f0_1 up
>>>>>>>>>
>>>>>>>>> #Original side
>>>>>>>>> $ tc filter add dev enp8s0f0_0 ingress chain 0 proto ip flower \
>>>>>>>>> ct_state -trk ip_proto tcp dst_port 8888 \
>>>>>>>>> action pedit ex munge tcp dport set 5001 pipe \
>>>>>>>>> action csum ip tcp pipe \
>>>>>>>>> action ct pipe \
>>>>>>>>> action goto chain 1
>>>>>>>>> $ tc filter add dev enp8s0f0_0 ingress chain 1 proto ip flower \
>>>>>>>>> ct_state +trk+est \
>>>>>>>>> action mirred egress redirect dev enp8s0f0_1
>>>>>>>>> $ tc filter add dev enp8s0f0_0 ingress chain 1 proto ip flower \
>>>>>>>>> ct_state +trk+new \
>>>>>>>>> action ct commit pipe \
>>>>>>>>> action mirred egress redirect dev enp8s0f0_1
>>>>>>>>> $ tc filter add dev enp8s0f0_0 ingress chain 0 proto arp flower \
>>>>>>>>> action mirred egress redirect dev enp8s0f0_1
>>>>>>>>>
>>>>>>>>> #Reply side
>>>>>>>>> $ tc filter add dev enp8s0f0_1 ingress chain 0 proto arp flower \
>>>>>>>>> action mirred egress redirect dev enp8s0f0_0
>>>>>>>>> $ tc filter add dev enp8s0f0_1 ingress chain 0 proto ip flower \
>>>>>>>>> ct_state -trk ip_proto tcp \
>>>>>>>>> action ct pipe \
>>>>>>>>> action pedit ex munge tcp sport set 8888 pipe \
>>>>>>>>> action csum ip tcp pipe \
>>>>>>>>> action mirred egress redirect dev enp8s0f0_0
>>>>>>>>>
>>>>>>>>> #Run traffic
>>>>>>>>> $ ip netns exec ns1 iperf -s -p 5001&
>>>>>>>>> $ sleep 2 #wait for iperf to fully open
>>>>>>>>> $ ip netns exec ns0 iperf -c 1.1.1.2 -p 8888
>>>>>>>>>
>>>>>>>>> #dump tc filter stats on enp8s0f0_0 chain 0 rule and see hardware packets:
>>>>>>>>> $ tc -s filter show dev enp8s0f0_0 ingress chain 0 proto ip | grep "hardware.*pkt"
>>>>>>>>> Sent hardware 9310116832 bytes 6149672 pkt
>>>>>>>>> Sent hardware 9310116832 bytes 6149672 pkt
>>>>>>>>> Sent hardware 9310116832 bytes 6149672 pkt
>>>>>>>>>
>>>>>>>>> A new connection executing the first filter in hardware will first rewrite
>>>>>>>>> the dst port to the new port, and then the ct action is executed,
>>>>>>>>> because this is a new connection, hardware will need to be send this back
>>>>>>>>> to software, on chain 0, to execute the first filter again in software.
>>>>>>>>> The dst port needs to be reverted otherwise it won't re-match the old
>>>>>>>>> dst port in the first filter. Because of that, currently mlx5 driver will
>>>>>>>>> reject offloading the above action ct rule.
>>>>>>>>>
>>>>>>>>> This series adds supports partial offload of a filter's action list,
>>>>>>>>> and letting tc software continue processing in the specific action instance
>>>>>>>>> where hardware left off (in the above case after the "action pedit ex munge tcp
>>>>>>>>> dport... of the first rule") allowing support for scenarios such as the above.
>>>>>>>>
>>>>>>>>
>>>>>>>> Hi, Paul. Not sure if this was discussed before, but don't we also need
>>>>>>>> a new TCA_CLS_FLAGS_IN_HW_PARTIAL flag or something like this?
>>>>>>>>
>>>>>>>> Currently the in_hw/not_in_hw flags are reported per filter, i.e. these
>>>>>>>> flags are not per-action. This may cause confusion among users, if flows
>>>>>>>> are reported as in_hw, while they are actually partially or even mostly
>>>>>>>> processed in SW.
>>>>>>>>
>>>>>>>> What do you think?
>>>>>>>>
>>>>>>>> Best regards, Ilya Maximets.
>>>>>>>
>>>>>>> I think its a good idea, and I'm fine with proposing something like this in a
>>>>>>> different series, as this isn't a new problem from this series and existed before
>>>>>>> it, at least with CT rules.
>>>>>>
>>>>>> Hmm, I didn't realize the issue already exists.
>>>>>
>>>>> Maintainers: please give me up to Friday to review this patchset.
>>>>>
>>>>> Disclaimer: I had missed this patchset, and I didn't even read it yet.
>>>>>
>>>>> I don't follow. Can someone please rephase the issue please?
>>>>> AFAICT, it is not that the NIC is offloading half of the action list
>>>>> and never executing a part of it. Instead, for established connections
>>>>> the rule will work fully offloaded. While for misses in the CT action,
>>>>> it will simply trigger a miss, like it already does today.
>>>>
>>>> You got it right, and like you said it was like this before so its not
>>>> strictly related by this series and could be in a different patchset. And I
>>>> thought that (extra) flag would mean that it can miss, compared to other
>>>> rules/actions combination that will never miss because they
>>>> don't need sw support.
>>>
>>> This is different from what I understood from Ilya's comment. Maybe I
>>> got his comment wrong, but I have the impression that he meant it in
>>> the sense of having some actions offloaded and some not.
>>> Which I thinkit is not the goal here.
>>
>> I don't really know the code around this patch set well enough, so my
>> thoughts might be a bit irrelevant. But after reading the cover letter
>> and commit messages in this patch set I imagined that if we have some
>> kind of miss on the N-th action in a list in HW, we could go to software
>> tc, find that action and continue execution from it. In this case some
>> actions are executed in HW and some are in SW.
>
> Precisely. :)
>
>>
>> From the user's perspective, if such tc filter reports an 'in_hw' flag,
>> that would be a bit misleading, IMO.
>
> I may be tainted or perhaps even biased here, but I don't see how it
> can be misleading. Since we came up with skip_hw/sw I think it is
> expected that packets can be handled in both datapaths. The flag is
> just saying that hw has this flow. (btw, in_sw is simplified, as sw
> always accepts the flow if skip_sw is not used)
>
>>
>> If that is not what is happening here, then please ignore my comments,
>> as I'm not sure what this code is about then. :)
>>
>>>
>>> But anyway, flows can have some packets matching in sw while also
>>> being in hw. That's expected. For example, in more complex flow sets,
>>> if a packet hit a flow with ct action and triggered a miss, all
>>> subsequent flows will handle this packet in sw. Or if we have queued
>>> packets in rx ring already and ovs just updated the datapath, these
>>> will match in tc sw instead of going to upcall. The latter will have
>>> only a few hits, yes, but the former will be increasing over time.
>>> I'm not sure how a new flag, which is probably more informative than
>>> an actual state indication, would help here.
>>
>> These cases are related to just one or a very few packets, so for them
>> it's generally fine to report 'in_hw', I think. The vast majority of
>> traffic will be handled in HW.
>>
>> My thoughts were about a case where we have a lot of traffic handled
>> partially in HW and in SW. Let's say we have N actions and HW doesn't
>> support action M. In this case, driver may offload actions [0, M - 1]
>> inserting some kind of forced "HW miss" at the end, so actions [M, N]
>> can be executed in TC software.
>
> Right. Please lets consider this other scenario then. Consider that we
> have these flows:
> chain 0,ip,match ip X actions=ct,goto chain 1
> chain 1,proto_Y_specific_match actions=ct(nat),goto chain 2
> chain 2 actions=output:3
>
> The idea here is that on chain 1, the HW doesn't support that particular
> match on proto Y. That flow will never be in_hw, and that's okay. But
> the flow on chain 2, though, will be tagged as in_hw, and for packets
> following these specific sequence, they will get handled in sw on
> chain 2.
>
> But if we have another flow there:
> chain 1,proto tcp actions=ct(nat),set_ttl,goto chain 2
> which is supported by the hw, such packets would be handled by hw in
> chain 2.
>
> The flow on chain 2 has no idea on what was done before it. It can't
> be tagged with _PARTIAL as the actions in there are not expected to
> trigger misses, yet, with this flow set, it is expected to handle
> packets in both datapaths, despite being 'in_hw'.
>
> I guess what I'm trying so say is that it is not because a flow is
> tagged with in_hw that sw processing is unexpected straight away.
>
> Hopefully this makes sense?
Yep. I see your point. In this case I agree that we can't really tell
if the traffic will be handled in HW or SW and the chain 2 will be
always handled in both. So, the fact that it is 'in_hw' only means that
the chain is actually in HW as that HW actually has it.
Summarizing: having something doesn't mean using it. :) So, thinking
that in_hw flows are actually fully processed in HW is a user's fault. :/
However, going back to my example where HW supports half of actions
in the chain ([0, M - 1]) and doesn't support the other half ([M, N])...
If the actions M to N are actually not installed into HW, marking the
chain as in_hw is a bit misleading, because unlike your example, not all
the actions are actually in HW and driver knows that. For that case,
something like _PARTIAL suffix might still be useful.
>
>>
>> But now I'm not sure if that is possible with the current implementation.
>
> AFAICT you got all right. It is me that had misunderstood you. :)
>
>>
>>>
>>>>
>>>>>
>>>>>>
>>>>>>>
>>>>>>> So how about I'll propose it in a different series and we continue with this first?
>>>>>
>>>>> So I'm not sure either on what's the idea here.
>>>>>
>>>>> Thanks,
>>>>> Marcelo
>>>>>
>>>>>>
>>>>>> Sounds fine to me. Thanks!
>>>>>>
>>>>>> Best regards, Ilya Maximets.
>>>>>>
>>>>>
>>>>
>>>
>>
>
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH net-next v8 0/7] net/sched: cls_api: Support hardware miss to tc action
2023-02-09 12:07 ` Ilya Maximets
@ 2023-02-09 12:40 ` Paul Blakey
0 siblings, 0 replies; 21+ messages in thread
From: Paul Blakey @ 2023-02-09 12:40 UTC (permalink / raw)
To: Ilya Maximets, Marcelo Leitner
Cc: netdev, Saeed Mahameed, Paolo Abeni, Jakub Kicinski, Eric Dumazet,
Jamal Hadi Salim, Cong Wang, David S. Miller, Oz Shlomo,
Jiri Pirko, Roi Dayan, Vlad Buslov
On 09/02/2023 14:07, Ilya Maximets wrote:
> On 2/9/23 02:09, Marcelo Leitner wrote:
>> On Thu, Feb 09, 2023 at 01:09:21AM +0100, Ilya Maximets wrote:
>>> On 2/8/23 19:01, Marcelo Leitner wrote:
>>>> On Wed, Feb 08, 2023 at 10:41:39AM +0200, Paul Blakey wrote:
>>>>>
>>>>>
>>>>> On 07/02/2023 07:03, Marcelo Leitner wrote:
>>>>>> On Tue, Feb 07, 2023 at 01:20:55AM +0100, Ilya Maximets wrote:
>>>>>>> On 2/6/23 18:14, Paul Blakey wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>> On 06/02/2023 14:34, Ilya Maximets wrote:
>>>>>>>>> On 2/5/23 16:49, Paul Blakey wrote:
>>>>>>>>>> Hi,
>>>>>>>>>>
>>>>>>>>>> This series adds support for hardware miss to instruct tc to continue execution
>>>>>>>>>> in a specific tc action instance on a filter's action list. The mlx5 driver patch
>>>>>>>>>> (besides the refactors) shows its usage instead of using just chain restore.
>>>>>>>>>>
>>>>>>>>>> Currently a filter's action list must be executed all together or
>>>>>>>>>> not at all as driver are only able to tell tc to continue executing from a
>>>>>>>>>> specific tc chain, and not a specific filter/action.
>>>>>>>>>>
>>>>>>>>>> This is troublesome with regards to action CT, where new connections should
>>>>>>>>>> be sent to software (via tc chain restore), and established connections can
>>>>>>>>>> be handled in hardware.
>>>>>>>>>>
>>>>>>>>>> Checking for new connections is done when executing the ct action in hardware
>>>>>>>>>> (by checking the packet's tuple against known established tuples).
>>>>>>>>>> But if there is a packet modification (pedit) action before action CT and the
>>>>>>>>>> checked tuple is a new connection, hardware will need to revert the previous
>>>>>>>>>> packet modifications before sending it back to software so it can
>>>>>>>>>> re-match the same tc filter in software and re-execute its CT action.
>>>>>>>>>>
>>>>>>>>>> The following is an example configuration of stateless nat
>>>>>>>>>> on mlx5 driver that isn't supported before this patchet:
>>>>>>>>>>
>>>>>>>>>> #Setup corrosponding mlx5 VFs in namespaces
>>>>>>>>>> $ ip netns add ns0
>>>>>>>>>> $ ip netns add ns1
>>>>>>>>>> $ ip link set dev enp8s0f0v0 netns ns0
>>>>>>>>>> $ ip netns exec ns0 ifconfig enp8s0f0v0 1.1.1.1/24 up
>>>>>>>>>> $ ip link set dev enp8s0f0v1 netns ns1
>>>>>>>>>> $ ip netns exec ns1 ifconfig enp8s0f0v1 1.1.1.2/24 up
>>>>>>>>>>
>>>>>>>>>> #Setup tc arp and ct rules on mxl5 VF representors
>>>>>>>>>> $ tc qdisc add dev enp8s0f0_0 ingress
>>>>>>>>>> $ tc qdisc add dev enp8s0f0_1 ingress
>>>>>>>>>> $ ifconfig enp8s0f0_0 up
>>>>>>>>>> $ ifconfig enp8s0f0_1 up
>>>>>>>>>>
>>>>>>>>>> #Original side
>>>>>>>>>> $ tc filter add dev enp8s0f0_0 ingress chain 0 proto ip flower \
>>>>>>>>>> ct_state -trk ip_proto tcp dst_port 8888 \
>>>>>>>>>> action pedit ex munge tcp dport set 5001 pipe \
>>>>>>>>>> action csum ip tcp pipe \
>>>>>>>>>> action ct pipe \
>>>>>>>>>> action goto chain 1
>>>>>>>>>> $ tc filter add dev enp8s0f0_0 ingress chain 1 proto ip flower \
>>>>>>>>>> ct_state +trk+est \
>>>>>>>>>> action mirred egress redirect dev enp8s0f0_1
>>>>>>>>>> $ tc filter add dev enp8s0f0_0 ingress chain 1 proto ip flower \
>>>>>>>>>> ct_state +trk+new \
>>>>>>>>>> action ct commit pipe \
>>>>>>>>>> action mirred egress redirect dev enp8s0f0_1
>>>>>>>>>> $ tc filter add dev enp8s0f0_0 ingress chain 0 proto arp flower \
>>>>>>>>>> action mirred egress redirect dev enp8s0f0_1
>>>>>>>>>>
>>>>>>>>>> #Reply side
>>>>>>>>>> $ tc filter add dev enp8s0f0_1 ingress chain 0 proto arp flower \
>>>>>>>>>> action mirred egress redirect dev enp8s0f0_0
>>>>>>>>>> $ tc filter add dev enp8s0f0_1 ingress chain 0 proto ip flower \
>>>>>>>>>> ct_state -trk ip_proto tcp \
>>>>>>>>>> action ct pipe \
>>>>>>>>>> action pedit ex munge tcp sport set 8888 pipe \
>>>>>>>>>> action csum ip tcp pipe \
>>>>>>>>>> action mirred egress redirect dev enp8s0f0_0
>>>>>>>>>>
>>>>>>>>>> #Run traffic
>>>>>>>>>> $ ip netns exec ns1 iperf -s -p 5001&
>>>>>>>>>> $ sleep 2 #wait for iperf to fully open
>>>>>>>>>> $ ip netns exec ns0 iperf -c 1.1.1.2 -p 8888
>>>>>>>>>>
>>>>>>>>>> #dump tc filter stats on enp8s0f0_0 chain 0 rule and see hardware packets:
>>>>>>>>>> $ tc -s filter show dev enp8s0f0_0 ingress chain 0 proto ip | grep "hardware.*pkt"
>>>>>>>>>> Sent hardware 9310116832 bytes 6149672 pkt
>>>>>>>>>> Sent hardware 9310116832 bytes 6149672 pkt
>>>>>>>>>> Sent hardware 9310116832 bytes 6149672 pkt
>>>>>>>>>>
>>>>>>>>>> A new connection executing the first filter in hardware will first rewrite
>>>>>>>>>> the dst port to the new port, and then the ct action is executed,
>>>>>>>>>> because this is a new connection, hardware will need to be send this back
>>>>>>>>>> to software, on chain 0, to execute the first filter again in software.
>>>>>>>>>> The dst port needs to be reverted otherwise it won't re-match the old
>>>>>>>>>> dst port in the first filter. Because of that, currently mlx5 driver will
>>>>>>>>>> reject offloading the above action ct rule.
>>>>>>>>>>
>>>>>>>>>> This series adds supports partial offload of a filter's action list,
>>>>>>>>>> and letting tc software continue processing in the specific action instance
>>>>>>>>>> where hardware left off (in the above case after the "action pedit ex munge tcp
>>>>>>>>>> dport... of the first rule") allowing support for scenarios such as the above.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Hi, Paul. Not sure if this was discussed before, but don't we also need
>>>>>>>>> a new TCA_CLS_FLAGS_IN_HW_PARTIAL flag or something like this?
>>>>>>>>>
>>>>>>>>> Currently the in_hw/not_in_hw flags are reported per filter, i.e. these
>>>>>>>>> flags are not per-action. This may cause confusion among users, if flows
>>>>>>>>> are reported as in_hw, while they are actually partially or even mostly
>>>>>>>>> processed in SW.
>>>>>>>>>
>>>>>>>>> What do you think?
>>>>>>>>>
>>>>>>>>> Best regards, Ilya Maximets.
>>>>>>>>
>>>>>>>> I think its a good idea, and I'm fine with proposing something like this in a
>>>>>>>> different series, as this isn't a new problem from this series and existed before
>>>>>>>> it, at least with CT rules.
>>>>>>>
>>>>>>> Hmm, I didn't realize the issue already exists.
>>>>>>
>>>>>> Maintainers: please give me up to Friday to review this patchset.
>>>>>>
>>>>>> Disclaimer: I had missed this patchset, and I didn't even read it yet.
>>>>>>
>>>>>> I don't follow. Can someone please rephase the issue please?
>>>>>> AFAICT, it is not that the NIC is offloading half of the action list
>>>>>> and never executing a part of it. Instead, for established connections
>>>>>> the rule will work fully offloaded. While for misses in the CT action,
>>>>>> it will simply trigger a miss, like it already does today.
>>>>>
>>>>> You got it right, and like you said it was like this before so its not
>>>>> strictly related by this series and could be in a different patchset. And I
>>>>> thought that (extra) flag would mean that it can miss, compared to other
>>>>> rules/actions combination that will never miss because they
>>>>> don't need sw support.
>>>>
>>>> This is different from what I understood from Ilya's comment. Maybe I
>>>> got his comment wrong, but I have the impression that he meant it in
>>>> the sense of having some actions offloaded and some not.
>>>> Which I thinkit is not the goal here.
>>>
>>> I don't really know the code around this patch set well enough, so my
>>> thoughts might be a bit irrelevant. But after reading the cover letter
>>> and commit messages in this patch set I imagined that if we have some
>>> kind of miss on the N-th action in a list in HW, we could go to software
>>> tc, find that action and continue execution from it. In this case some
>>> actions are executed in HW and some are in SW.
>>
>> Precisely. :)
>>
>>>
>>> From the user's perspective, if such tc filter reports an 'in_hw' flag,
>>> that would be a bit misleading, IMO.
>>
>> I may be tainted or perhaps even biased here, but I don't see how it
>> can be misleading. Since we came up with skip_hw/sw I think it is
>> expected that packets can be handled in both datapaths. The flag is
>> just saying that hw has this flow. (btw, in_sw is simplified, as sw
>> always accepts the flow if skip_sw is not used)
>>
>>>
>>> If that is not what is happening here, then please ignore my comments,
>>> as I'm not sure what this code is about then. :)
>>>
>>>>
>>>> But anyway, flows can have some packets matching in sw while also
>>>> being in hw. That's expected. For example, in more complex flow sets,
>>>> if a packet hit a flow with ct action and triggered a miss, all
>>>> subsequent flows will handle this packet in sw. Or if we have queued
>>>> packets in rx ring already and ovs just updated the datapath, these
>>>> will match in tc sw instead of going to upcall. The latter will have
>>>> only a few hits, yes, but the former will be increasing over time.
>>>> I'm not sure how a new flag, which is probably more informative than
>>>> an actual state indication, would help here.
>>>
>>> These cases are related to just one or a very few packets, so for them
>>> it's generally fine to report 'in_hw', I think. The vast majority of
>>> traffic will be handled in HW.
>>>
>>> My thoughts were about a case where we have a lot of traffic handled
>>> partially in HW and in SW. Let's say we have N actions and HW doesn't
>>> support action M. In this case, driver may offload actions [0, M - 1]
>>> inserting some kind of forced "HW miss" at the end, so actions [M, N]
>>> can be executed in TC software.
>>
>> Right. Please lets consider this other scenario then. Consider that we
>> have these flows:
>> chain 0,ip,match ip X actions=ct,goto chain 1
>> chain 1,proto_Y_specific_match actions=ct(nat),goto chain 2
>> chain 2 actions=output:3
>>
>> The idea here is that on chain 1, the HW doesn't support that particular
>> match on proto Y. That flow will never be in_hw, and that's okay. But
>> the flow on chain 2, though, will be tagged as in_hw, and for packets
>> following these specific sequence, they will get handled in sw on
>> chain 2.
>>
>> But if we have another flow there:
>> chain 1,proto tcp actions=ct(nat),set_ttl,goto chain 2
>> which is supported by the hw, such packets would be handled by hw in
>> chain 2.
>>
>> The flow on chain 2 has no idea on what was done before it. It can't
>> be tagged with _PARTIAL as the actions in there are not expected to
>> trigger misses, yet, with this flow set, it is expected to handle
>> packets in both datapaths, despite being 'in_hw'.
>>
>> I guess what I'm trying so say is that it is not because a flow is
>> tagged with in_hw that sw processing is unexpected straight away.
>>
>> Hopefully this makes sense?
>
> Yep. I see your point. In this case I agree that we can't really tell
> if the traffic will be handled in HW or SW and the chain 2 will be
> always handled in both. So, the fact that it is 'in_hw' only means that
> the chain is actually in HW as that HW actually has it.
>
> Summarizing: having something doesn't mean using it. :) So, thinking
> that in_hw flows are actually fully processed in HW is a user's fault. :/
>
> However, going back to my example where HW supports half of actions
> in the chain ([0, M - 1]) and doesn't support the other half ([M, N])...
> If the actions M to N are actually not installed into HW, marking the
> chain as in_hw is a bit misleading, because unlike your example, not all
> the actions are actually in HW and driver knows that. For that case,
> something like _PARTIAL suffix might still be useful.
Right, and for now at least, no one does that :)
>
>>
>>>
>>> But now I'm not sure if that is possible with the current implementation.
>>
>> AFAICT you got all right. It is me that had misunderstood you. :)
>>
>>>
>>>>
>>>>>
>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>> So how about I'll propose it in a different series and we continue with this first?
>>>>>>
>>>>>> So I'm not sure either on what's the idea here.
>>>>>>
>>>>>> Thanks,
>>>>>> Marcelo
>>>>>>
>>>>>>>
>>>>>>> Sounds fine to me. Thanks!
>>>>>>>
>>>>>>> Best regards, Ilya Maximets.
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>
^ permalink raw reply [flat|nested] 21+ messages in thread
end of thread, other threads:[~2023-02-09 12:40 UTC | newest]
Thread overview: 21+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-02-05 15:49 [PATCH net-next v8 0/7] net/sched: cls_api: Support hardware miss to tc action Paul Blakey
2023-02-05 15:49 ` [PATCH net-next v8 1/7] " Paul Blakey
2023-02-05 15:49 ` [PATCH net-next v8 2/7] net/sched: flower: Move filter handle initialization earlier Paul Blakey
2023-02-05 15:49 ` [PATCH net-next v8 3/7] net/sched: flower: Support hardware miss to tc action Paul Blakey
2023-02-05 15:49 ` [PATCH net-next v8 4/7] net/mlx5: Kconfig: Make tc offload depend on tc skb extension Paul Blakey
2023-02-06 15:40 ` Alexander H Duyck
2023-02-06 17:16 ` Paul Blakey
2023-02-05 15:49 ` [PATCH net-next v8 5/7] net/mlx5: Refactor tc miss handling to a single function Paul Blakey
2023-02-05 15:49 ` [PATCH net-next v8 6/7] net/mlx5e: Rename CHAIN_TO_REG to MAPPED_OBJ_TO_REG Paul Blakey
2023-02-05 15:49 ` [PATCH net-next v8 7/7] net/mlx5e: TC, Set CT miss to the specific ct action instance Paul Blakey
2023-02-06 12:34 ` [PATCH net-next v8 0/7] net/sched: cls_api: Support hardware miss to tc action Ilya Maximets
2023-02-06 17:14 ` Paul Blakey
2023-02-07 0:20 ` Ilya Maximets
2023-02-07 5:03 ` Marcelo Leitner
2023-02-07 5:20 ` Jakub Kicinski
2023-02-08 8:41 ` Paul Blakey
2023-02-08 18:01 ` Marcelo Leitner
2023-02-09 0:09 ` Ilya Maximets
2023-02-09 1:09 ` Marcelo Leitner
2023-02-09 12:07 ` Ilya Maximets
2023-02-09 12:40 ` Paul Blakey
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).