netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC net-next v2 0/5] net: openvswitch: Add sample multicasting.
@ 2024-04-08 12:57 Adrian Moreno
  2024-04-08 12:57 ` [RFC net-next v2 1/5] net: netlink: export genl private pointer getters Adrian Moreno
                   ` (5 more replies)
  0 siblings, 6 replies; 26+ messages in thread
From: Adrian Moreno @ 2024-04-08 12:57 UTC (permalink / raw)
  To: netdev
  Cc: Adrian Moreno, jiri, xiyou.wangcong, cmi, yotam.gi, i.maximets,
	aconole, echaudro, horms

** Background **
Currently, OVS supports several packet sampling mechanisms (sFlow,
per-bridge IPFIX, per-flow IPFIX). These end up being translated into a
userspace action that needs to be handled by ovs-vswitchd's handler
threads only to be forwarded to some third party application that
will somehow process the sample and provide observability on the
datapath.

A particularly interesting use-case is controller-driven
per-flow IPFIX sampling where the OpenFlow controller can add metadata
to samples (via two 32bit integers) and this metadata is then available
to the sample-collecting system for correlation.

** Problem **
The fact that sampled traffic share netlink sockets and handler thread
time with upcalls, apart from being a performance bottleneck in the
sample extraction itself, can severely compromise the datapath,
yielding this solution unfit for highly loaded production systems.

Users are left with little options other than guessing what sampling
rate will be OK for their traffic pattern and system load and dealing
with the lost accuracy.

Looking at available infrastructure, an obvious candidated would be
to use psample. However, it's current state does not help with the
use-case at stake because sampled packets do not contain user-defined
metadata.

** Proposal **
In this RFC_v2, I'd like to request feedback on an attempt to fix this
situation by:

- Extending the existing psample infrastructure to carry a variable
length user-defined cookie.
- Extending the tc act_sample action to use the actions' cookie as
psample metadata.
- Extending the OVS sample action with an attribute
(OVS_SAMPLE_ATTR_PSAMPLE) that contains a u32 group_id and a variable
length cookie and use the the psample infrastructure to multicast
the sample.
- Extending psample to allow group_id filtering on listening sockets so
that users can only receive samples they are interested in.

--
rfc_v1 -> rfc_v2:
- Use psample instead of a new OVS-only multicast group.
- Extend psample and tc with a user-defined cookie.


Adrian Moreno (5):
  net: netlink: export genl private pointer getters
  net: psample: add multicast filtering on group_id
  net: psample: add user cookie
  net:sched:act_sample: add action cookie to sample
  net:openvswitch: add psample support

 include/net/psample.h            |   2 +
 include/uapi/linux/openvswitch.h |  22 ++++-
 include/uapi/linux/psample.h     |   2 +
 net/netlink/genetlink.c          |   2 +
 net/openvswitch/actions.c        |  52 ++++++++++--
 net/openvswitch/datapath.c       |   2 +-
 net/openvswitch/flow_netlink.c   |  78 +++++++++++++----
 net/psample/psample.c            | 139 +++++++++++++++++++++++++++++--
 net/sched/act_sample.c           |  12 +++
 9 files changed, 277 insertions(+), 34 deletions(-)

-- 
2.44.0


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [RFC net-next v2 1/5] net: netlink: export genl private pointer getters
  2024-04-08 12:57 [RFC net-next v2 0/5] net: openvswitch: Add sample multicasting Adrian Moreno
@ 2024-04-08 12:57 ` Adrian Moreno
  2024-04-09 21:48   ` Jakub Kicinski
  2024-04-08 12:57 ` [RFC net-next v2 2/5] net: psample: add multicast filtering on group_id Adrian Moreno
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 26+ messages in thread
From: Adrian Moreno @ 2024-04-08 12:57 UTC (permalink / raw)
  To: netdev
  Cc: Adrian Moreno, jiri, xiyou.wangcong, cmi, yotam.gi, i.maximets,
	aconole, echaudro, horms

Both "__genl_sk_priv_get" and "genl_sk_priv_get" are useful functions to
handle private pointers attached to genetlink sockets.
Export them so they can be used in modules.

Signed-off-by: Adrian Moreno <amorenoz@redhat.com>
---
 net/netlink/genetlink.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/net/netlink/genetlink.c b/net/netlink/genetlink.c
index feb54c63a116..beafa415a9f5 100644
--- a/net/netlink/genetlink.c
+++ b/net/netlink/genetlink.c
@@ -726,6 +726,7 @@ void *__genl_sk_priv_get(struct genl_family *family, struct sock *sk)
 		return ERR_PTR(-EINVAL);
 	return xa_load(family->sock_privs, (unsigned long) sk);
 }
+EXPORT_SYMBOL(__genl_sk_priv_get);
 
 /**
  * genl_sk_priv_get - Get family private pointer for socket
@@ -764,6 +765,7 @@ void *genl_sk_priv_get(struct genl_family *family, struct sock *sk)
 	}
 	return priv;
 }
+EXPORT_SYMBOL(genl_sk_priv_get);
 
 /**
  * genl_register_family - register a generic netlink family
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [RFC net-next v2 2/5] net: psample: add multicast filtering on group_id
  2024-04-08 12:57 [RFC net-next v2 0/5] net: openvswitch: Add sample multicasting Adrian Moreno
  2024-04-08 12:57 ` [RFC net-next v2 1/5] net: netlink: export genl private pointer getters Adrian Moreno
@ 2024-04-08 12:57 ` Adrian Moreno
  2024-04-08 13:18   ` Ilya Maximets
  2024-04-10 13:06   ` Ido Schimmel
  2024-04-08 12:57 ` [RFC net-next v2 3/5] net: psample: add user cookie Adrian Moreno
                   ` (3 subsequent siblings)
  5 siblings, 2 replies; 26+ messages in thread
From: Adrian Moreno @ 2024-04-08 12:57 UTC (permalink / raw)
  To: netdev
  Cc: Adrian Moreno, jiri, xiyou.wangcong, cmi, yotam.gi, i.maximets,
	aconole, echaudro, horms

Packet samples can come from several places (e.g: different tc sample
actions), typically using the sample group (PSAMPLE_ATTR_SAMPLE_GROUP)
to differentiate them.

Likewise, sample consumers that listen on the multicast group may only
be interested on a single group. However, they are currently forced to
receive all samples and discard the ones that are not relevant, causing
unnecessary overhead.

Allow users to filter on the desired group_id by adding a new command
SAMPLE_FILTER_SET that can be used to pass the desired group id.
Store this filter on the per-socket private pointer and use it for
filtering multicasted samples.

Signed-off-by: Adrian Moreno <amorenoz@redhat.com>
---
 include/uapi/linux/psample.h |   1 +
 net/psample/psample.c        | 127 +++++++++++++++++++++++++++++++++--
 2 files changed, 122 insertions(+), 6 deletions(-)

diff --git a/include/uapi/linux/psample.h b/include/uapi/linux/psample.h
index e585db5bf2d2..5e0305b1520d 100644
--- a/include/uapi/linux/psample.h
+++ b/include/uapi/linux/psample.h
@@ -28,6 +28,7 @@ enum psample_command {
 	PSAMPLE_CMD_GET_GROUP,
 	PSAMPLE_CMD_NEW_GROUP,
 	PSAMPLE_CMD_DEL_GROUP,
+	PSAMPLE_CMD_SAMPLE_FILTER_SET,
 };
 
 enum psample_tunnel_key_attr {
diff --git a/net/psample/psample.c b/net/psample/psample.c
index a5d9b8446f77..a0cef63dfdec 100644
--- a/net/psample/psample.c
+++ b/net/psample/psample.c
@@ -98,13 +98,84 @@ static int psample_nl_cmd_get_group_dumpit(struct sk_buff *msg,
 	return msg->len;
 }
 
-static const struct genl_small_ops psample_nl_ops[] = {
+struct psample_obj_desc {
+	struct rcu_head rcu;
+	u32 group_num;
+	bool group_num_valid;
+};
+
+struct psample_nl_sock_priv {
+	struct psample_obj_desc __rcu *flt;
+	spinlock_t flt_lock; /* Protects flt. */
+};
+
+static void psample_nl_sock_priv_init(void *priv)
+{
+	struct psample_nl_sock_priv *sk_priv = priv;
+
+	spin_lock_init(&sk_priv->flt_lock);
+}
+
+static void psample_nl_sock_priv_destroy(void *priv)
+{
+	struct psample_nl_sock_priv *sk_priv = priv;
+	struct psample_obj_desc *flt;
+
+	flt = rcu_dereference_protected(sk_priv->flt, true);
+	kfree_rcu(flt, rcu);
+}
+
+static int psample_nl_sample_filter_set_doit(struct sk_buff *skb,
+					     struct genl_info *info)
+{
+	struct psample_nl_sock_priv *sk_priv;
+	struct nlattr **attrs = info->attrs;
+	struct psample_obj_desc *flt;
+
+	flt = kzalloc(sizeof(*flt), GFP_KERNEL);
+
+	if (attrs[PSAMPLE_ATTR_SAMPLE_GROUP]) {
+		flt->group_num = nla_get_u32(attrs[PSAMPLE_ATTR_SAMPLE_GROUP]);
+		flt->group_num_valid = true;
+	}
+
+	if (!flt->group_num_valid) {
+		kfree(flt);
+		flt = NULL;
+	}
+
+	sk_priv = genl_sk_priv_get(&psample_nl_family, NETLINK_CB(skb).sk);
+	if (IS_ERR(sk_priv)) {
+		kfree(flt);
+		return PTR_ERR(sk_priv);
+	}
+
+	spin_lock(&sk_priv->flt_lock);
+	flt = rcu_replace_pointer(sk_priv->flt, flt,
+				  lockdep_is_held(&sk_priv->flt_lock));
+	spin_unlock(&sk_priv->flt_lock);
+	kfree_rcu(flt, rcu);
+	return 0;
+}
+
+static const struct nla_policy
+	psample_sample_filter_set_policy[PSAMPLE_ATTR_SAMPLE_GROUP + 1] = {
+	[PSAMPLE_ATTR_SAMPLE_GROUP] = { .type = NLA_U32, },
+};
+
+static const struct genl_ops psample_nl_ops[] = {
 	{
 		.cmd = PSAMPLE_CMD_GET_GROUP,
 		.validate = GENL_DONT_VALIDATE_STRICT | GENL_DONT_VALIDATE_DUMP,
 		.dumpit = psample_nl_cmd_get_group_dumpit,
 		/* can be retrieved by unprivileged users */
-	}
+	},
+	{
+		.cmd		= PSAMPLE_CMD_SAMPLE_FILTER_SET,
+		.doit		= psample_nl_sample_filter_set_doit,
+		.policy		= psample_sample_filter_set_policy,
+		.flags		= 0,
+	},
 };
 
 static struct genl_family psample_nl_family __ro_after_init = {
@@ -114,10 +185,13 @@ static struct genl_family psample_nl_family __ro_after_init = {
 	.netnsok	= true,
 	.module		= THIS_MODULE,
 	.mcgrps		= psample_nl_mcgrps,
-	.small_ops	= psample_nl_ops,
-	.n_small_ops	= ARRAY_SIZE(psample_nl_ops),
+	.ops		= psample_nl_ops,
+	.n_ops		= ARRAY_SIZE(psample_nl_ops),
 	.resv_start_op	= PSAMPLE_CMD_GET_GROUP + 1,
 	.n_mcgrps	= ARRAY_SIZE(psample_nl_mcgrps),
+	.sock_priv_size		= sizeof(struct psample_nl_sock_priv),
+	.sock_priv_init		= psample_nl_sock_priv_init,
+	.sock_priv_destroy	= psample_nl_sock_priv_destroy,
 };
 
 static void psample_group_notify(struct psample_group *group,
@@ -360,6 +434,42 @@ static int psample_tunnel_meta_len(struct ip_tunnel_info *tun_info)
 }
 #endif
 
+static inline void psample_nl_obj_desc_init(struct psample_obj_desc *desc,
+					    u32 group_num)
+{
+	memset(desc, 0, sizeof(*desc));
+	desc->group_num = group_num;
+	desc->group_num_valid = true;
+}
+
+static bool psample_obj_desc_match(struct psample_obj_desc *desc,
+				   struct psample_obj_desc *flt)
+{
+	if (desc->group_num_valid && flt->group_num_valid &&
+	    desc->group_num != flt->group_num)
+		return false;
+	return true;
+}
+
+static int psample_nl_sample_filter(struct sock *dsk, struct sk_buff *skb,
+				    void *data)
+{
+	struct psample_obj_desc *desc = data;
+	struct psample_nl_sock_priv *sk_priv;
+	struct psample_obj_desc *flt;
+	int ret = 0;
+
+	rcu_read_lock();
+	sk_priv = __genl_sk_priv_get(&psample_nl_family, dsk);
+	if (!IS_ERR_OR_NULL(sk_priv)) {
+		flt = rcu_dereference(sk_priv->flt);
+		if (flt)
+			ret = !psample_obj_desc_match(desc, flt);
+	}
+	rcu_read_unlock();
+	return ret;
+}
+
 void psample_sample_packet(struct psample_group *group, struct sk_buff *skb,
 			   u32 sample_rate, const struct psample_metadata *md)
 {
@@ -370,6 +480,7 @@ void psample_sample_packet(struct psample_group *group, struct sk_buff *skb,
 #ifdef CONFIG_INET
 	struct ip_tunnel_info *tun_info;
 #endif
+	struct psample_obj_desc desc;
 	struct sk_buff *nl_skb;
 	int data_len;
 	int meta_len;
@@ -487,8 +598,12 @@ void psample_sample_packet(struct psample_group *group, struct sk_buff *skb,
 #endif
 
 	genlmsg_end(nl_skb, data);
-	genlmsg_multicast_netns(&psample_nl_family, group->net, nl_skb, 0,
-				PSAMPLE_NL_MCGRP_SAMPLE, GFP_ATOMIC);
+	psample_nl_obj_desc_init(&desc, group->group_num);
+	genlmsg_multicast_netns_filtered(&psample_nl_family,
+					 group->net, nl_skb, 0,
+					 PSAMPLE_NL_MCGRP_SAMPLE,
+					 GFP_ATOMIC, psample_nl_sample_filter,
+					 &desc);
 
 	return;
 error:
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [RFC net-next v2 3/5] net: psample: add user cookie
  2024-04-08 12:57 [RFC net-next v2 0/5] net: openvswitch: Add sample multicasting Adrian Moreno
  2024-04-08 12:57 ` [RFC net-next v2 1/5] net: netlink: export genl private pointer getters Adrian Moreno
  2024-04-08 12:57 ` [RFC net-next v2 2/5] net: psample: add multicast filtering on group_id Adrian Moreno
@ 2024-04-08 12:57 ` Adrian Moreno
  2024-04-08 13:19   ` Ilya Maximets
  2024-04-08 12:57 ` [RFC net-next v2 4/5] net:sched:act_sample: add action cookie to sample Adrian Moreno
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 26+ messages in thread
From: Adrian Moreno @ 2024-04-08 12:57 UTC (permalink / raw)
  To: netdev
  Cc: Adrian Moreno, jiri, xiyou.wangcong, cmi, yotam.gi, i.maximets,
	aconole, echaudro, horms

Add a user cookie to the sample metadata so that sample emitters can
provide more contextual information to samples.

If present, send the user cookie in a new attribute:
PSAMPLE_ATTR_USER_COOKIE.

Signed-off-by: Adrian Moreno <amorenoz@redhat.com>
---
 include/net/psample.h        |  2 ++
 include/uapi/linux/psample.h |  1 +
 net/psample/psample.c        | 12 +++++++++++-
 3 files changed, 14 insertions(+), 1 deletion(-)

diff --git a/include/net/psample.h b/include/net/psample.h
index 0509d2d6be67..2503ab3c92a5 100644
--- a/include/net/psample.h
+++ b/include/net/psample.h
@@ -25,6 +25,8 @@ struct psample_metadata {
 	   out_tc_occ_valid:1,
 	   latency_valid:1,
 	   unused:5;
+	u8 *user_cookie;
+	u32 user_cookie_len;
 };
 
 struct psample_group *psample_group_get(struct net *net, u32 group_num);
diff --git a/include/uapi/linux/psample.h b/include/uapi/linux/psample.h
index 5e0305b1520d..1f61fd7ef7fd 100644
--- a/include/uapi/linux/psample.h
+++ b/include/uapi/linux/psample.h
@@ -19,6 +19,7 @@ enum {
 	PSAMPLE_ATTR_LATENCY,		/* u64, nanoseconds */
 	PSAMPLE_ATTR_TIMESTAMP,		/* u64, nanoseconds */
 	PSAMPLE_ATTR_PROTO,		/* u16 */
+	PSAMPLE_ATTR_USER_COOKIE,
 
 	__PSAMPLE_ATTR_MAX
 };
diff --git a/net/psample/psample.c b/net/psample/psample.c
index a0cef63dfdec..9fdb88e01f21 100644
--- a/net/psample/psample.c
+++ b/net/psample/psample.c
@@ -497,7 +497,8 @@ void psample_sample_packet(struct psample_group *group, struct sk_buff *skb,
 		   nla_total_size(sizeof(u32)) +	/* group_num */
 		   nla_total_size(sizeof(u32)) +	/* seq */
 		   nla_total_size_64bit(sizeof(u64)) +	/* timestamp */
-		   nla_total_size(sizeof(u16));		/* protocol */
+		   nla_total_size(sizeof(u16)) +	/* protocol */
+		   nla_total_size(md->user_cookie_len);	/* user_cookie */
 
 #ifdef CONFIG_INET
 	tun_info = skb_tunnel_info(skb);
@@ -596,6 +597,15 @@ void psample_sample_packet(struct psample_group *group, struct sk_buff *skb,
 			goto error;
 	}
 #endif
+	if (md->user_cookie && md->user_cookie_len) {
+		int nla_len = nla_total_size(md->user_cookie_len);
+		struct nlattr *nla;
+
+		nla = skb_put(nl_skb, nla_len);
+		nla->nla_type = PSAMPLE_ATTR_USER_COOKIE;
+		nla->nla_len = nla_attr_size(md->user_cookie_len);
+		memcpy(nla_data(nla), md->user_cookie, md->user_cookie_len);
+	}
 
 	genlmsg_end(nl_skb, data);
 	psample_nl_obj_desc_init(&desc, group->group_num);
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [RFC net-next v2 4/5] net:sched:act_sample: add action cookie to sample
  2024-04-08 12:57 [RFC net-next v2 0/5] net: openvswitch: Add sample multicasting Adrian Moreno
                   ` (2 preceding siblings ...)
  2024-04-08 12:57 ` [RFC net-next v2 3/5] net: psample: add user cookie Adrian Moreno
@ 2024-04-08 12:57 ` Adrian Moreno
  2024-04-08 13:20   ` Ilya Maximets
  2024-04-08 12:57 ` [RFC net-next v2 5/5] net:openvswitch: add psample support Adrian Moreno
  2024-04-08 13:16 ` [RFC net-next v2 0/5] net: openvswitch: Add sample multicasting Ilya Maximets
  5 siblings, 1 reply; 26+ messages in thread
From: Adrian Moreno @ 2024-04-08 12:57 UTC (permalink / raw)
  To: netdev
  Cc: Adrian Moreno, jiri, xiyou.wangcong, cmi, yotam.gi, i.maximets,
	aconole, echaudro, horms

If the action has a user_cookie, pass it along to the sample so it can
be easily identified.

Signed-off-by: Adrian Moreno <amorenoz@redhat.com>
---
 net/sched/act_sample.c | 12 ++++++++++++
 1 file changed, 12 insertions(+)

diff --git a/net/sched/act_sample.c b/net/sched/act_sample.c
index a69b53d54039..5c3f86ec964a 100644
--- a/net/sched/act_sample.c
+++ b/net/sched/act_sample.c
@@ -165,9 +165,11 @@ TC_INDIRECT_SCOPE int tcf_sample_act(struct sk_buff *skb,
 				     const struct tc_action *a,
 				     struct tcf_result *res)
 {
+	u8 cookie_data[TC_COOKIE_MAX_SIZE] = {};
 	struct tcf_sample *s = to_sample(a);
 	struct psample_group *psample_group;
 	struct psample_metadata md = {};
+	struct tc_cookie *user_cookie;
 	int retval;
 
 	tcf_lastuse_update(&s->tcf_tm);
@@ -189,6 +191,16 @@ TC_INDIRECT_SCOPE int tcf_sample_act(struct sk_buff *skb,
 		if (skb_at_tc_ingress(skb) && tcf_sample_dev_ok_push(skb->dev))
 			skb_push(skb, skb->mac_len);
 
+		rcu_read_lock();
+		user_cookie = rcu_dereference(a->user_cookie);
+		if (user_cookie) {
+			memcpy(cookie_data, user_cookie->data,
+			       user_cookie->len);
+			md.user_cookie = cookie_data;
+			md.user_cookie_len = user_cookie->len;
+		}
+		rcu_read_unlock();
+
 		md.trunc_size = s->truncate ? s->trunc_size : skb->len;
 		psample_sample_packet(psample_group, skb, s->rate, &md);
 
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [RFC net-next v2 5/5] net:openvswitch: add psample support
  2024-04-08 12:57 [RFC net-next v2 0/5] net: openvswitch: Add sample multicasting Adrian Moreno
                   ` (3 preceding siblings ...)
  2024-04-08 12:57 ` [RFC net-next v2 4/5] net:sched:act_sample: add action cookie to sample Adrian Moreno
@ 2024-04-08 12:57 ` Adrian Moreno
  2024-04-08 13:37   ` Ilya Maximets
  2024-04-08 13:16 ` [RFC net-next v2 0/5] net: openvswitch: Add sample multicasting Ilya Maximets
  5 siblings, 1 reply; 26+ messages in thread
From: Adrian Moreno @ 2024-04-08 12:57 UTC (permalink / raw)
  To: netdev
  Cc: Adrian Moreno, jiri, xiyou.wangcong, cmi, yotam.gi, i.maximets,
	aconole, echaudro, horms

Add a new attribute to the sample action, called
OVS_SAMPLE_ATTR_PSAMPLE to allow userspace to pass a group_id and a
user-defined cookie.

The maximum length of the user-defined cookie is set to 16, same as
tc_cookie to discourage using cookies that will not be offloadable.

When set, the sample action will use psample to multicast the sample.

Signed-off-by: Adrian Moreno <amorenoz@redhat.com>
---
 include/uapi/linux/openvswitch.h | 22 +++++++--
 net/openvswitch/actions.c        | 52 ++++++++++++++++++---
 net/openvswitch/datapath.c       |  2 +-
 net/openvswitch/flow_netlink.c   | 78 +++++++++++++++++++++++++-------
 4 files changed, 127 insertions(+), 27 deletions(-)

diff --git a/include/uapi/linux/openvswitch.h b/include/uapi/linux/openvswitch.h
index efc82c318fa2..a5a32588f582 100644
--- a/include/uapi/linux/openvswitch.h
+++ b/include/uapi/linux/openvswitch.h
@@ -646,15 +646,24 @@ enum ovs_flow_attr {
  * %UINT32_MAX samples all packets and intermediate values sample intermediate
  * fractions of packets.
  * @OVS_SAMPLE_ATTR_ACTIONS: Set of actions to execute in sampling event.
- * Actions are passed as nested attributes.
+ * Actions are passed as nested attributes. Optional if OVS_SAMPLE_ATTR_PSAMPLE
+ * is not set.
+ * @OVS_SAMPLE_ATTR_PSAMPLE: Arguments to be passed to psample. Optional if
+ * OVS_SAMPLE_ATTR_ACTIONS is not set.
  *
- * Executes the specified actions with the given probability on a per-packet
- * basis.
+ * Either OVS_SAMPLE_ATTR_USER_COOKIE or OVS_SAMPLE_ATTR_USER_COOKIE must be
+ * specified.
+ *
+ * Executes the specified actions and/or sends the packet to psample
+ * with the given probability on a per-packet basis.
  */
 enum ovs_sample_attr {
 	OVS_SAMPLE_ATTR_UNSPEC,
 	OVS_SAMPLE_ATTR_PROBABILITY, /* u32 number */
 	OVS_SAMPLE_ATTR_ACTIONS,     /* Nested OVS_ACTION_ATTR_* attributes. */
+	OVS_SAMPLE_ATTR_PSAMPLE,     /* struct ovs_psample followed
+				      * by the user-provided cookie.
+				      */
 	__OVS_SAMPLE_ATTR_MAX,
 
 #ifdef __KERNEL__
@@ -675,6 +684,13 @@ struct sample_arg {
 };
 #endif
 
+#define OVS_PSAMPLE_COOKIE_MAX_SIZE 16
+struct ovs_psample {
+	__u32 group_id;		/* The group used for packet sampling. */
+	__u32 user_cookie_len;	/* The length of the user-provided cookie. */
+	__u8 user_cookie[];	/* The user-provided cookie. */
+};
+
 /**
  * enum ovs_userspace_attr - Attributes for %OVS_ACTION_ATTR_USERSPACE action.
  * @OVS_USERSPACE_ATTR_PID: u32 Netlink PID to which the %OVS_PACKET_CMD_ACTION
diff --git a/net/openvswitch/actions.c b/net/openvswitch/actions.c
index 6fcd7e2ca81f..45d2b325b76a 100644
--- a/net/openvswitch/actions.c
+++ b/net/openvswitch/actions.c
@@ -24,6 +24,7 @@
 #include <net/checksum.h>
 #include <net/dsfield.h>
 #include <net/mpls.h>
+#include <net/psample.h>
 #include <net/sctp/checksum.h>
 
 #include "datapath.h"
@@ -1025,6 +1026,31 @@ static int dec_ttl_exception_handler(struct datapath *dp, struct sk_buff *skb,
 	return 0;
 }
 
+static int ovs_psample_packet(struct datapath *dp, struct sw_flow_key *key,
+			      struct ovs_psample *psample, struct sk_buff *skb,
+			      u32 rate)
+{
+	struct psample_group psample_group = {};
+	struct psample_metadata md = {};
+	struct vport *input_vport;
+
+	psample_group.group_num = psample->group_id;
+	psample_group.net = ovs_dp_get_net(dp);
+
+	input_vport = ovs_vport_rcu(dp, key->phy.in_port);
+	if (!input_vport)
+		input_vport = ovs_vport_rcu(dp, OVSP_LOCAL);
+
+	md.in_ifindex = input_vport->dev->ifindex;
+	md.user_cookie = psample->user_cookie;
+	md.user_cookie_len = psample->user_cookie_len;
+	md.trunc_size = skb->len;
+
+	psample_sample_packet(&psample_group, skb, rate, &md);
+
+	return 0;
+}
+
 /* When 'last' is true, sample() should always consume the 'skb'.
  * Otherwise, sample() should keep 'skb' intact regardless what
  * actions are executed within sample().
@@ -1033,16 +1059,17 @@ static int sample(struct datapath *dp, struct sk_buff *skb,
 		  struct sw_flow_key *key, const struct nlattr *attr,
 		  bool last)
 {
-	struct nlattr *actions;
+	const struct sample_arg *arg;
 	struct nlattr *sample_arg;
 	int rem = nla_len(attr);
-	const struct sample_arg *arg;
+	struct nlattr *next;
 	bool clone_flow_key;
+	int ret;
 
 	/* The first action is always 'OVS_SAMPLE_ATTR_ARG'. */
 	sample_arg = nla_data(attr);
 	arg = nla_data(sample_arg);
-	actions = nla_next(sample_arg, &rem);
+	next = nla_next(sample_arg, &rem);
 
 	if ((arg->probability != U32_MAX) &&
 	    (!arg->probability || get_random_u32() > arg->probability)) {
@@ -1051,9 +1078,22 @@ static int sample(struct datapath *dp, struct sk_buff *skb,
 		return 0;
 	}
 
-	clone_flow_key = !arg->exec;
-	return clone_execute(dp, skb, key, 0, actions, rem, last,
-			     clone_flow_key);
+	if (next->nla_type == OVS_SAMPLE_ATTR_PSAMPLE) {
+		ret = ovs_psample_packet(dp, key, nla_data(next), skb,
+					 arg->probability);
+		if (last)
+			ovs_kfree_skb_reason(skb, OVS_DROP_LAST_ACTION);
+		if (ret)
+			return ret;
+		next = nla_next(next, &rem);
+	}
+
+	if (nla_ok(next, rem)) {
+		clone_flow_key = !arg->exec;
+		ret = clone_execute(dp, skb, key, 0, next, rem, last,
+				    clone_flow_key);
+	}
+	return ret;
 }
 
 /* When 'last' is true, clone() should always consume the 'skb'.
diff --git a/net/openvswitch/datapath.c b/net/openvswitch/datapath.c
index 99d72543abd3..b5b560c2e74b 100644
--- a/net/openvswitch/datapath.c
+++ b/net/openvswitch/datapath.c
@@ -976,7 +976,7 @@ static int ovs_flow_cmd_new(struct sk_buff *skb, struct genl_info *info)
 	struct sw_flow_match match;
 	u32 ufid_flags = ovs_nla_get_ufid_flags(a[OVS_FLOW_ATTR_UFID_FLAGS]);
 	int error;
-	bool log = !a[OVS_FLOW_ATTR_PROBE];
+	bool log = true;
 
 	/* Must have key and actions. */
 	error = -EINVAL;
diff --git a/net/openvswitch/flow_netlink.c b/net/openvswitch/flow_netlink.c
index f224d9bcea5e..f540686271b7 100644
--- a/net/openvswitch/flow_netlink.c
+++ b/net/openvswitch/flow_netlink.c
@@ -2381,8 +2381,12 @@ static void ovs_nla_free_sample_action(const struct nlattr *action)
 
 	switch (nla_type(a)) {
 	case OVS_SAMPLE_ATTR_ARG:
-		/* The real list of actions follows this attribute. */
 		a = nla_next(a, &rem);
+
+		/* OVS_SAMPLE_ATTR_PSAMPLE may be present. */
+		if (nla_type(a) == OVS_SAMPLE_ATTR_PSAMPLE)
+			a = nla_next(a, &rem);
+
 		ovs_nla_free_nested_actions(a, rem);
 		break;
 	}
@@ -2561,6 +2565,9 @@ static int __ovs_nla_copy_actions(struct net *net, const struct nlattr *attr,
 				  u32 mpls_label_count, bool log,
 				  u32 depth);
 
+static int copy_action(const struct nlattr *from,
+		       struct sw_flow_actions **sfa, bool log);
+
 static int validate_and_copy_sample(struct net *net, const struct nlattr *attr,
 				    const struct sw_flow_key *key,
 				    struct sw_flow_actions **sfa,
@@ -2569,10 +2576,10 @@ static int validate_and_copy_sample(struct net *net, const struct nlattr *attr,
 				    u32 depth)
 {
 	const struct nlattr *attrs[OVS_SAMPLE_ATTR_MAX + 1];
-	const struct nlattr *probability, *actions;
+	const struct nlattr *probability, *actions, *psample;
 	const struct nlattr *a;
-	int rem, start, err;
 	struct sample_arg arg;
+	int rem, start, err;
 
 	memset(attrs, 0, sizeof(attrs));
 	nla_for_each_nested(a, attr, rem) {
@@ -2589,7 +2596,23 @@ static int validate_and_copy_sample(struct net *net, const struct nlattr *attr,
 		return -EINVAL;
 
 	actions = attrs[OVS_SAMPLE_ATTR_ACTIONS];
-	if (!actions || (nla_len(actions) && nla_len(actions) < NLA_HDRLEN))
+	if (actions && (!nla_len(actions) || nla_len(actions) < NLA_HDRLEN))
+		return -EINVAL;
+
+	psample = attrs[OVS_SAMPLE_ATTR_PSAMPLE];
+	if (psample) {
+		struct ovs_psample *ovs_ps;
+
+		if (!nla_len(psample) || nla_len(psample) < sizeof(*ovs_ps))
+			return -EINVAL;
+
+		ovs_ps = nla_data(psample);
+		if (ovs_ps->user_cookie_len > OVS_PSAMPLE_COOKIE_MAX_SIZE ||
+		    nla_len(psample) != sizeof(*ovs_ps) + ovs_ps->user_cookie_len)
+			return -EINVAL;
+	}
+
+	if (!psample && !actions)
 		return -EINVAL;
 
 	/* validation done, copy sample action. */
@@ -2608,7 +2631,9 @@ static int validate_and_copy_sample(struct net *net, const struct nlattr *attr,
 	 * If the sample is the last action, it can always be excuted
 	 * rather than deferred.
 	 */
-	arg.exec = last || !actions_may_change_flow(actions);
+	if (actions)
+		arg.exec = last || !actions_may_change_flow(actions);
+
 	arg.probability = nla_get_u32(probability);
 
 	err = ovs_nla_add_action(sfa, OVS_SAMPLE_ATTR_ARG, &arg, sizeof(arg),
@@ -2616,10 +2641,17 @@ static int validate_and_copy_sample(struct net *net, const struct nlattr *attr,
 	if (err)
 		return err;
 
-	err = __ovs_nla_copy_actions(net, actions, key, sfa,
-				     eth_type, vlan_tci, mpls_label_count, log,
-				     depth + 1);
+	if (psample)
+		err = ovs_nla_add_action(sfa, OVS_SAMPLE_ATTR_PSAMPLE,
+					 nla_data(psample), nla_len(psample),
+					 log);
+	if (err)
+		return err;
 
+	if (actions)
+		err = __ovs_nla_copy_actions(net, actions, key, sfa,
+					     eth_type, vlan_tci,
+					     mpls_label_count, log, depth + 1);
 	if (err)
 		return err;
 
@@ -3538,7 +3570,7 @@ static int sample_action_to_attr(const struct nlattr *attr,
 	struct nlattr *start, *ac_start = NULL, *sample_arg;
 	int err = 0, rem = nla_len(attr);
 	const struct sample_arg *arg;
-	struct nlattr *actions;
+	struct nlattr *next;
 
 	start = nla_nest_start_noflag(skb, OVS_ACTION_ATTR_SAMPLE);
 	if (!start)
@@ -3546,27 +3578,39 @@ static int sample_action_to_attr(const struct nlattr *attr,
 
 	sample_arg = nla_data(attr);
 	arg = nla_data(sample_arg);
-	actions = nla_next(sample_arg, &rem);
+	next = nla_next(sample_arg, &rem);
 
 	if (nla_put_u32(skb, OVS_SAMPLE_ATTR_PROBABILITY, arg->probability)) {
 		err = -EMSGSIZE;
 		goto out;
 	}
 
-	ac_start = nla_nest_start_noflag(skb, OVS_SAMPLE_ATTR_ACTIONS);
-	if (!ac_start) {
-		err = -EMSGSIZE;
-		goto out;
+	if (nla_type(next) == OVS_SAMPLE_ATTR_PSAMPLE) {
+		if (nla_put(skb, OVS_SAMPLE_ATTR_PSAMPLE, nla_len(next),
+			    nla_data(next))) {
+			err = -EMSGSIZE;
+			goto out;
+		}
+		next = nla_next(next, &rem);
 	}
 
-	err = ovs_nla_put_actions(actions, rem, skb);
+	if (nla_ok(next, rem)) {
+		ac_start = nla_nest_start_noflag(skb, OVS_SAMPLE_ATTR_ACTIONS);
+		if (!ac_start) {
+			err = -EMSGSIZE;
+			goto out;
+		}
+		err = ovs_nla_put_actions(next, rem, skb);
+	}
 
 out:
 	if (err) {
-		nla_nest_cancel(skb, ac_start);
+		if (ac_start)
+			nla_nest_cancel(skb, ac_start);
 		nla_nest_cancel(skb, start);
 	} else {
-		nla_nest_end(skb, ac_start);
+		if (ac_start)
+			nla_nest_end(skb, ac_start);
 		nla_nest_end(skb, start);
 	}
 
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [RFC net-next v2 0/5] net: openvswitch: Add sample multicasting.
  2024-04-08 12:57 [RFC net-next v2 0/5] net: openvswitch: Add sample multicasting Adrian Moreno
                   ` (4 preceding siblings ...)
  2024-04-08 12:57 ` [RFC net-next v2 5/5] net:openvswitch: add psample support Adrian Moreno
@ 2024-04-08 13:16 ` Ilya Maximets
  5 siblings, 0 replies; 26+ messages in thread
From: Ilya Maximets @ 2024-04-08 13:16 UTC (permalink / raw)
  To: Adrian Moreno, netdev
  Cc: i.maximets, jiri, xiyou.wangcong, cmi, yotam.gi, aconole,
	echaudro, horms

On 4/8/24 14:57, Adrian Moreno wrote:
> ** Background **
> Currently, OVS supports several packet sampling mechanisms (sFlow,
> per-bridge IPFIX, per-flow IPFIX). These end up being translated into a
> userspace action that needs to be handled by ovs-vswitchd's handler
> threads only to be forwarded to some third party application that
> will somehow process the sample and provide observability on the
> datapath.
> 
> A particularly interesting use-case is controller-driven
> per-flow IPFIX sampling where the OpenFlow controller can add metadata
> to samples (via two 32bit integers) and this metadata is then available
> to the sample-collecting system for correlation.
> 
> ** Problem **
> The fact that sampled traffic share netlink sockets and handler thread
> time with upcalls, apart from being a performance bottleneck in the
> sample extraction itself, can severely compromise the datapath,
> yielding this solution unfit for highly loaded production systems.
> 
> Users are left with little options other than guessing what sampling
> rate will be OK for their traffic pattern and system load and dealing
> with the lost accuracy.
> 
> Looking at available infrastructure, an obvious candidated would be
> to use psample. However, it's current state does not help with the
> use-case at stake because sampled packets do not contain user-defined
> metadata.
> 
> ** Proposal **
> In this RFC_v2, I'd like to request feedback on an attempt to fix this
> situation by:
> 
> - Extending the existing psample infrastructure to carry a variable
> length user-defined cookie.
> - Extending the tc act_sample action to use the actions' cookie as
> psample metadata.
> - Extending the OVS sample action with an attribute
> (OVS_SAMPLE_ATTR_PSAMPLE) that contains a u32 group_id and a variable
> length cookie and use the the psample infrastructure to multicast
> the sample.
> - Extending psample to allow group_id filtering on listening sockets so
> that users can only receive samples they are interested in.
> 
> --
> rfc_v1 -> rfc_v2:
> - Use psample instead of a new OVS-only multicast group.
> - Extend psample and tc with a user-defined cookie.
> 
> 
> Adrian Moreno (5):
>   net: netlink: export genl private pointer getters
>   net: psample: add multicast filtering on group_id
>   net: psample: add user cookie
>   net:sched:act_sample: add action cookie to sample
>   net:openvswitch: add psample support
> 
>  include/net/psample.h            |   2 +
>  include/uapi/linux/openvswitch.h |  22 ++++-
>  include/uapi/linux/psample.h     |   2 +
>  net/netlink/genetlink.c          |   2 +
>  net/openvswitch/actions.c        |  52 ++++++++++--
>  net/openvswitch/datapath.c       |   2 +-
>  net/openvswitch/flow_netlink.c   |  78 +++++++++++++----
>  net/psample/psample.c            | 139 +++++++++++++++++++++++++++++--
>  net/sched/act_sample.c           |  12 +++
>  9 files changed, 277 insertions(+), 34 deletions(-)
> 

[copying my previous reply since this version actually has netdev@ in Cc]

Thanks, Adrian!

FWIW, I think this is a reasonable approach to the problem.

I won't do a full review since it's an RFC, but I'll leave some
comments for individual patches.

Best regards, Ilya Maximets.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC net-next v2 2/5] net: psample: add multicast filtering on group_id
  2024-04-08 12:57 ` [RFC net-next v2 2/5] net: psample: add multicast filtering on group_id Adrian Moreno
@ 2024-04-08 13:18   ` Ilya Maximets
  2024-04-08 19:24     ` Adrian Moreno
  2024-04-10 13:06   ` Ido Schimmel
  1 sibling, 1 reply; 26+ messages in thread
From: Ilya Maximets @ 2024-04-08 13:18 UTC (permalink / raw)
  To: Adrian Moreno, netdev
  Cc: i.maximets, jiri, xiyou.wangcong, cmi, yotam.gi, aconole,
	echaudro, horms

[copying my previous reply since this version actually has netdev@ in Cc]

On 4/8/24 14:57, Adrian Moreno wrote:
> Packet samples can come from several places (e.g: different tc sample
> actions), typically using the sample group (PSAMPLE_ATTR_SAMPLE_GROUP)
> to differentiate them.
> 
> Likewise, sample consumers that listen on the multicast group may only
> be interested on a single group. However, they are currently forced to
> receive all samples and discard the ones that are not relevant, causing
> unnecessary overhead.
> 
> Allow users to filter on the desired group_id by adding a new command
> SAMPLE_FILTER_SET that can be used to pass the desired group id.
> Store this filter on the per-socket private pointer and use it for
> filtering multicasted samples.
> 
> Signed-off-by: Adrian Moreno <amorenoz@redhat.com>
> ---
>  include/uapi/linux/psample.h |   1 +
>  net/psample/psample.c        | 127 +++++++++++++++++++++++++++++++++--
>  2 files changed, 122 insertions(+), 6 deletions(-)
> 
> diff --git a/include/uapi/linux/psample.h b/include/uapi/linux/psample.h
> index e585db5bf2d2..5e0305b1520d 100644
> --- a/include/uapi/linux/psample.h
> +++ b/include/uapi/linux/psample.h
> @@ -28,6 +28,7 @@ enum psample_command {
>  	PSAMPLE_CMD_GET_GROUP,
>  	PSAMPLE_CMD_NEW_GROUP,
>  	PSAMPLE_CMD_DEL_GROUP,
> +	PSAMPLE_CMD_SAMPLE_FILTER_SET,
Other commands are names as PSAMPLE_CMD_VERB_NOUN, so this new one
should be PSAMPLE_CMD_SET_FILTER.  (The SAMPLE part seems unnecessary.)
Some functions/structures need to be renamed accordingly.

>  };
>  
>  enum psample_tunnel_key_attr {
> diff --git a/net/psample/psample.c b/net/psample/psample.c
> index a5d9b8446f77..a0cef63dfdec 100644
> --- a/net/psample/psample.c
> +++ b/net/psample/psample.c
> @@ -98,13 +98,84 @@ static int psample_nl_cmd_get_group_dumpit(struct sk_buff *msg,
>  	return msg->len;
>  }
>  
> -static const struct genl_small_ops psample_nl_ops[] = {
> +struct psample_obj_desc {
> +	struct rcu_head rcu;
> +	u32 group_num;
> +	bool group_num_valid;
> +};
> +
> +struct psample_nl_sock_priv {
> +	struct psample_obj_desc __rcu *flt;

Can we call it 'fileter' ?  I find it hard to read the code with
this unnecessary abbreviation.  Same for the lock below.

> +	spinlock_t flt_lock; /* Protects flt. */
> +};
> +
> +static void psample_nl_sock_priv_init(void *priv)
> +{
> +	struct psample_nl_sock_priv *sk_priv = priv;
> +
> +	spin_lock_init(&sk_priv->flt_lock);
> +}
> +
> +static void psample_nl_sock_priv_destroy(void *priv)
> +{
> +	struct psample_nl_sock_priv *sk_priv = priv;
> +	struct psample_obj_desc *flt;
> +
> +	flt = rcu_dereference_protected(sk_priv->flt, true);
> +	kfree_rcu(flt, rcu);
> +}
> +
> +static int psample_nl_sample_filter_set_doit(struct sk_buff *skb,
> +					     struct genl_info *info)
> +{
> +	struct psample_nl_sock_priv *sk_priv;
> +	struct nlattr **attrs = info->attrs;
> +	struct psample_obj_desc *flt;
> +
> +	flt = kzalloc(sizeof(*flt), GFP_KERNEL);
> +
> +	if (attrs[PSAMPLE_ATTR_SAMPLE_GROUP]) {
> +		flt->group_num = nla_get_u32(attrs[PSAMPLE_ATTR_SAMPLE_GROUP]);
> +		flt->group_num_valid = true;
> +	}
> +
> +	if (!flt->group_num_valid) {
> +		kfree(flt);

Might be better to not allocate it in the first place.

> +		flt = NULL;
> +	}
> +
> +	sk_priv = genl_sk_priv_get(&psample_nl_family, NETLINK_CB(skb).sk);
> +	if (IS_ERR(sk_priv)) {
> +		kfree(flt);
> +		return PTR_ERR(sk_priv);
> +	}
> +
> +	spin_lock(&sk_priv->flt_lock);
> +	flt = rcu_replace_pointer(sk_priv->flt, flt,
> +				  lockdep_is_held(&sk_priv->flt_lock));
> +	spin_unlock(&sk_priv->flt_lock);
> +	kfree_rcu(flt, rcu);
> +	return 0;
> +}
> +
> +static const struct nla_policy
> +	psample_sample_filter_set_policy[PSAMPLE_ATTR_SAMPLE_GROUP + 1] = {
> +	[PSAMPLE_ATTR_SAMPLE_GROUP] = { .type = NLA_U32, },

This indentation is confusing, though I'm not sure what's a better way.

> +};
> +
> +static const struct genl_ops psample_nl_ops[] = {
>  	{
>  		.cmd = PSAMPLE_CMD_GET_GROUP,
>  		.validate = GENL_DONT_VALIDATE_STRICT | GENL_DONT_VALIDATE_DUMP,
>  		.dumpit = psample_nl_cmd_get_group_dumpit,
>  		/* can be retrieved by unprivileged users */
> -	}
> +	},
> +	{
> +		.cmd		= PSAMPLE_CMD_SAMPLE_FILTER_SET,
> +		.doit		= psample_nl_sample_filter_set_doit,
> +		.policy		= psample_sample_filter_set_policy,
> +		.flags		= 0,
> +	},
>  };
>  
>  static struct genl_family psample_nl_family __ro_after_init = {
> @@ -114,10 +185,13 @@ static struct genl_family psample_nl_family __ro_after_init = {
>  	.netnsok	= true,
>  	.module		= THIS_MODULE,
>  	.mcgrps		= psample_nl_mcgrps,
> -	.small_ops	= psample_nl_ops,
> -	.n_small_ops	= ARRAY_SIZE(psample_nl_ops),
> +	.ops		= psample_nl_ops,
> +	.n_ops		= ARRAY_SIZE(psample_nl_ops),
>  	.resv_start_op	= PSAMPLE_CMD_GET_GROUP + 1,
>  	.n_mcgrps	= ARRAY_SIZE(psample_nl_mcgrps),
> +	.sock_priv_size		= sizeof(struct psample_nl_sock_priv),
> +	.sock_priv_init		= psample_nl_sock_priv_init,
> +	.sock_priv_destroy	= psample_nl_sock_priv_destroy,
>  };
>  
>  static void psample_group_notify(struct psample_group *group,
> @@ -360,6 +434,42 @@ static int psample_tunnel_meta_len(struct ip_tunnel_info *tun_info)
>  }
>  #endif
>  
> +static inline void psample_nl_obj_desc_init(struct psample_obj_desc *desc,
> +					    u32 group_num)
> +{
> +	memset(desc, 0, sizeof(*desc));
> +	desc->group_num = group_num;
> +	desc->group_num_valid = true;
> +}
> +
> +static bool psample_obj_desc_match(struct psample_obj_desc *desc,
> +				   struct psample_obj_desc *flt)
> +{
> +	if (desc->group_num_valid && flt->group_num_valid &&
> +	    desc->group_num != flt->group_num)
> +		return false;
> +	return true;

This fucntion returns 'true' if one of the arguments is not valid.
I'd not expect such behavior from a 'match' function.

I understand the intention that psample should sample everything
to sockets that do not request filters, but that should not be part
of the 'match' logic, or more appropriate function name should be
chosen.  Also, if the group is not initialized, but the filter is,
it should not match, logically.  The validity on filter and the
current sample is not symmetric.

And I'm not really sure if the 'group_num_valid' is actually needed.
Can the NULL pointer be used as an indicator?  If so, then maybe
the whole psample_obj_desc structure is not needed as it will
contain a single field.

> +}
> +
> +static int psample_nl_sample_filter(struct sock *dsk, struct sk_buff *skb,
> +				    void *data)
> +{
> +	struct psample_obj_desc *desc = data;
> +	struct psample_nl_sock_priv *sk_priv;
> +	struct psample_obj_desc *flt;
> +	int ret = 0;
> +
> +	rcu_read_lock();
> +	sk_priv = __genl_sk_priv_get(&psample_nl_family, dsk);
> +	if (!IS_ERR_OR_NULL(sk_priv)) {
> +		flt = rcu_dereference(sk_priv->flt);
> +		if (flt)
> +			ret = !psample_obj_desc_match(desc, flt);
> +	}
> +	rcu_read_unlock();
> +	return ret;
> +}
> +
>  void psample_sample_packet(struct psample_group *group, struct sk_buff *skb,
>  			   u32 sample_rate, const struct psample_metadata *md)
>  {
> @@ -370,6 +480,7 @@ void psample_sample_packet(struct psample_group *group, struct sk_buff *skb,
>  #ifdef CONFIG_INET
>  	struct ip_tunnel_info *tun_info;
>  #endif
> +	struct psample_obj_desc desc;
>  	struct sk_buff *nl_skb;
>  	int data_len;
>  	int meta_len;
> @@ -487,8 +598,12 @@ void psample_sample_packet(struct psample_group *group, struct sk_buff *skb,
>  #endif
>  
>  	genlmsg_end(nl_skb, data);
> -	genlmsg_multicast_netns(&psample_nl_family, group->net, nl_skb, 0,
> -				PSAMPLE_NL_MCGRP_SAMPLE, GFP_ATOMIC);
> +	psample_nl_obj_desc_init(&desc, group->group_num);
> +	genlmsg_multicast_netns_filtered(&psample_nl_family,
> +					 group->net, nl_skb, 0,
> +					 PSAMPLE_NL_MCGRP_SAMPLE,
> +					 GFP_ATOMIC, psample_nl_sample_filter,
> +					 &desc);
>  
>  	return;
>  error:


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC net-next v2 3/5] net: psample: add user cookie
  2024-04-08 12:57 ` [RFC net-next v2 3/5] net: psample: add user cookie Adrian Moreno
@ 2024-04-08 13:19   ` Ilya Maximets
  2024-04-08 19:28     ` Adrian Moreno
  0 siblings, 1 reply; 26+ messages in thread
From: Ilya Maximets @ 2024-04-08 13:19 UTC (permalink / raw)
  To: Adrian Moreno, netdev
  Cc: i.maximets, jiri, xiyou.wangcong, cmi, yotam.gi, aconole,
	echaudro, horms

[copying my previous reply since this version actually has netdev@ in Cc]

On 4/8/24 14:57, Adrian Moreno wrote:
> Add a user cookie to the sample metadata so that sample emitters can
> provide more contextual information to samples.
> 
> If present, send the user cookie in a new attribute:
> PSAMPLE_ATTR_USER_COOKIE.
> 
> Signed-off-by: Adrian Moreno <amorenoz@redhat.com>
> ---
>  include/net/psample.h        |  2 ++
>  include/uapi/linux/psample.h |  1 +
>  net/psample/psample.c        | 12 +++++++++++-
>  3 files changed, 14 insertions(+), 1 deletion(-)
> 
> diff --git a/include/net/psample.h b/include/net/psample.h
> index 0509d2d6be67..2503ab3c92a5 100644
> --- a/include/net/psample.h
> +++ b/include/net/psample.h
> @@ -25,6 +25,8 @@ struct psample_metadata {
>  	   out_tc_occ_valid:1,
>  	   latency_valid:1,
>  	   unused:5;
> +	u8 *user_cookie;

The code doesn't take ownership of this data.  It should probably
be a const pointer in this case.

This should also allow us to avoid a double copy first to the
psample_metadata and then to netlink skb, as users will know that
it is a const pointer and so they can hand the data directly.

> +	u32 user_cookie_len;
>  };
>  
>  struct psample_group *psample_group_get(struct net *net, u32 group_num);
> diff --git a/include/uapi/linux/psample.h b/include/uapi/linux/psample.h
> index 5e0305b1520d..1f61fd7ef7fd 100644
> --- a/include/uapi/linux/psample.h
> +++ b/include/uapi/linux/psample.h
> @@ -19,6 +19,7 @@ enum {
>  	PSAMPLE_ATTR_LATENCY,		/* u64, nanoseconds */
>  	PSAMPLE_ATTR_TIMESTAMP,		/* u64, nanoseconds */
>  	PSAMPLE_ATTR_PROTO,		/* u16 */
> +	PSAMPLE_ATTR_USER_COOKIE,

A comment would be nice.

>  
>  	__PSAMPLE_ATTR_MAX
>  };
> diff --git a/net/psample/psample.c b/net/psample/psample.c
> index a0cef63dfdec..9fdb88e01f21 100644
> --- a/net/psample/psample.c
> +++ b/net/psample/psample.c
> @@ -497,7 +497,8 @@ void psample_sample_packet(struct psample_group *group, struct sk_buff *skb,
>  		   nla_total_size(sizeof(u32)) +	/* group_num */
>  		   nla_total_size(sizeof(u32)) +	/* seq */
>  		   nla_total_size_64bit(sizeof(u64)) +	/* timestamp */
> -		   nla_total_size(sizeof(u16));		/* protocol */
> +		   nla_total_size(sizeof(u16)) +	/* protocol */
> +		   nla_total_size(md->user_cookie_len);	/* user_cookie */
>  
>  #ifdef CONFIG_INET
>  	tun_info = skb_tunnel_info(skb);
> @@ -596,6 +597,15 @@ void psample_sample_packet(struct psample_group *group, struct sk_buff *skb,
>  			goto error;
>  	}
>  #endif
> +	if (md->user_cookie && md->user_cookie_len) {
> +		int nla_len = nla_total_size(md->user_cookie_len);
> +		struct nlattr *nla;
> +
> +		nla = skb_put(nl_skb, nla_len);
> +		nla->nla_type = PSAMPLE_ATTR_USER_COOKIE;
> +		nla->nla_len = nla_attr_size(md->user_cookie_len);
> +		memcpy(nla_data(nla), md->user_cookie, md->user_cookie_len);
> +	}
>  
>  	genlmsg_end(nl_skb, data);
>  	psample_nl_obj_desc_init(&desc, group->group_num);


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC net-next v2 4/5] net:sched:act_sample: add action cookie to sample
  2024-04-08 12:57 ` [RFC net-next v2 4/5] net:sched:act_sample: add action cookie to sample Adrian Moreno
@ 2024-04-08 13:20   ` Ilya Maximets
  2024-04-11  8:40     ` Adrian Moreno
  0 siblings, 1 reply; 26+ messages in thread
From: Ilya Maximets @ 2024-04-08 13:20 UTC (permalink / raw)
  To: Adrian Moreno, netdev
  Cc: i.maximets, jiri, xiyou.wangcong, cmi, yotam.gi, aconole,
	echaudro, horms

[copying my previous reply since this version actually has netdev@ in Cc]

On 4/8/24 14:57, Adrian Moreno wrote:
> If the action has a user_cookie, pass it along to the sample so it can
> be easily identified.
> 
> Signed-off-by: Adrian Moreno <amorenoz@redhat.com>
> ---
>  net/sched/act_sample.c | 12 ++++++++++++
>  1 file changed, 12 insertions(+)

Nit: some spaces in the subject would be nice.

> 
> diff --git a/net/sched/act_sample.c b/net/sched/act_sample.c
> index a69b53d54039..5c3f86ec964a 100644
> --- a/net/sched/act_sample.c
> +++ b/net/sched/act_sample.c
> @@ -165,9 +165,11 @@ TC_INDIRECT_SCOPE int tcf_sample_act(struct sk_buff *skb,
>  				     const struct tc_action *a,
>  				     struct tcf_result *res)
>  {
> +	u8 cookie_data[TC_COOKIE_MAX_SIZE] = {};
>  	struct tcf_sample *s = to_sample(a);
>  	struct psample_group *psample_group;
>  	struct psample_metadata md = {};
> +	struct tc_cookie *user_cookie;
>  	int retval;
>  
>  	tcf_lastuse_update(&s->tcf_tm);
> @@ -189,6 +191,16 @@ TC_INDIRECT_SCOPE int tcf_sample_act(struct sk_buff *skb,
>  		if (skb_at_tc_ingress(skb) && tcf_sample_dev_ok_push(skb->dev))
>  			skb_push(skb, skb->mac_len);
>  
> +		rcu_read_lock();
> +		user_cookie = rcu_dereference(a->user_cookie);
> +		if (user_cookie) {
> +			memcpy(cookie_data, user_cookie->data,
> +			       user_cookie->len);
> +			md.user_cookie = cookie_data;
> +			md.user_cookie_len = user_cookie->len;
> +		}
> +		rcu_read_unlock();

Not sure what is better - extending rcu critical section
beyond psample_sample_packet() or copying the cookie...
What do you think?

> +
>  		md.trunc_size = s->truncate ? s->trunc_size : skb->len;
>  		psample_sample_packet(psample_group, skb, s->rate, &md);
>  


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC net-next v2 5/5] net:openvswitch: add psample support
  2024-04-08 12:57 ` [RFC net-next v2 5/5] net:openvswitch: add psample support Adrian Moreno
@ 2024-04-08 13:37   ` Ilya Maximets
  2024-04-08 19:48     ` Adrian Moreno
  0 siblings, 1 reply; 26+ messages in thread
From: Ilya Maximets @ 2024-04-08 13:37 UTC (permalink / raw)
  To: Adrian Moreno, netdev
  Cc: i.maximets, jiri, xiyou.wangcong, cmi, yotam.gi, aconole,
	echaudro, horms

On 4/8/24 14:57, Adrian Moreno wrote:
> Add a new attribute to the sample action, called
> OVS_SAMPLE_ATTR_PSAMPLE to allow userspace to pass a group_id and a
> user-defined cookie.
> 
> The maximum length of the user-defined cookie is set to 16, same as
> tc_cookie to discourage using cookies that will not be offloadable.
> 
> When set, the sample action will use psample to multicast the sample.
> 
> Signed-off-by: Adrian Moreno <amorenoz@redhat.com>
> ---
>  include/uapi/linux/openvswitch.h | 22 +++++++--
>  net/openvswitch/actions.c        | 52 ++++++++++++++++++---
>  net/openvswitch/datapath.c       |  2 +-
>  net/openvswitch/flow_netlink.c   | 78 +++++++++++++++++++++++++-------
>  4 files changed, 127 insertions(+), 27 deletions(-)

This cpatch is missing a few bits:
 - Update for Documentation/netlink/specs/ovs_flow.yaml
 - Maybe update for tools/testing/selftests/net/openvswitch/ovs-dpctl.py
 - Maybe some basic selftests.

> 
> diff --git a/include/uapi/linux/openvswitch.h b/include/uapi/linux/openvswitch.h
> index efc82c318fa2..a5a32588f582 100644
> --- a/include/uapi/linux/openvswitch.h
> +++ b/include/uapi/linux/openvswitch.h
> @@ -646,15 +646,24 @@ enum ovs_flow_attr {
>   * %UINT32_MAX samples all packets and intermediate values sample intermediate
>   * fractions of packets.
>   * @OVS_SAMPLE_ATTR_ACTIONS: Set of actions to execute in sampling event.
> - * Actions are passed as nested attributes.
> + * Actions are passed as nested attributes. Optional if OVS_SAMPLE_ATTR_PSAMPLE
> + * is not set.

'is set' probably.

> + * @OVS_SAMPLE_ATTR_PSAMPLE: Arguments to be passed to psample. Optional if
> + * OVS_SAMPLE_ATTR_ACTIONS is not set.

Same here.

>   *
> - * Executes the specified actions with the given probability on a per-packet
> - * basis.
> + * Either OVS_SAMPLE_ATTR_USER_COOKIE or OVS_SAMPLE_ATTR_USER_COOKIE must be
> + * specified.
> + *
> + * Executes the specified actions and/or sends the packet to psample
> + * with the given probability on a per-packet basis.
>   */
>  enum ovs_sample_attr {
>  	OVS_SAMPLE_ATTR_UNSPEC,
>  	OVS_SAMPLE_ATTR_PROBABILITY, /* u32 number */
>  	OVS_SAMPLE_ATTR_ACTIONS,     /* Nested OVS_ACTION_ATTR_* attributes. */
> +	OVS_SAMPLE_ATTR_PSAMPLE,     /* struct ovs_psample followed
> +				      * by the user-provided cookie.
> +				      */
>  	__OVS_SAMPLE_ATTR_MAX,
>  
>  #ifdef __KERNEL__
> @@ -675,6 +684,13 @@ struct sample_arg {
>  };
>  #endif
>  
> +#define OVS_PSAMPLE_COOKIE_MAX_SIZE 16
> +struct ovs_psample {
> +	__u32 group_id;		/* The group used for packet sampling. */
> +	__u32 user_cookie_len;	/* The length of the user-provided cookie. */
> +	__u8 user_cookie[];	/* The user-provided cookie. */
> +};

Structures are not a good approach for modern netlink.
use nested attributes instead.  This way we can also
eliminate the need for variable-length array and the
length field, if the length can be taken from a netlink
attribute directly, e.g. similar to NLA_BINARY in tc.

If necessary, there could be a structure in the private
header to store the data for internal use.

> +
>  /**
>   * enum ovs_userspace_attr - Attributes for %OVS_ACTION_ATTR_USERSPACE action.
>   * @OVS_USERSPACE_ATTR_PID: u32 Netlink PID to which the %OVS_PACKET_CMD_ACTION
> diff --git a/net/openvswitch/actions.c b/net/openvswitch/actions.c
> index 6fcd7e2ca81f..45d2b325b76a 100644
> --- a/net/openvswitch/actions.c
> +++ b/net/openvswitch/actions.c
> @@ -24,6 +24,7 @@
>  #include <net/checksum.h>
>  #include <net/dsfield.h>
>  #include <net/mpls.h>
> +#include <net/psample.h>
>  #include <net/sctp/checksum.h>
>  
>  #include "datapath.h"
> @@ -1025,6 +1026,31 @@ static int dec_ttl_exception_handler(struct datapath *dp, struct sk_buff *skb,
>  	return 0;
>  }
>  
> +static int ovs_psample_packet(struct datapath *dp, struct sw_flow_key *key,
> +			      struct ovs_psample *psample, struct sk_buff *skb,
> +			      u32 rate)
> +{
> +	struct psample_group psample_group = {};
> +	struct psample_metadata md = {};
> +	struct vport *input_vport;
> +
> +	psample_group.group_num = psample->group_id;
> +	psample_group.net = ovs_dp_get_net(dp);
> +
> +	input_vport = ovs_vport_rcu(dp, key->phy.in_port);
> +	if (!input_vport)
> +		input_vport = ovs_vport_rcu(dp, OVSP_LOCAL);
> +
> +	md.in_ifindex = input_vport->dev->ifindex;
> +	md.user_cookie = psample->user_cookie;
> +	md.user_cookie_len = psample->user_cookie_len;
> +	md.trunc_size = skb->len;
> +
> +	psample_sample_packet(&psample_group, skb, rate, &md);
> +
> +	return 0;
> +}
> +
>  /* When 'last' is true, sample() should always consume the 'skb'.
>   * Otherwise, sample() should keep 'skb' intact regardless what
>   * actions are executed within sample().
> @@ -1033,16 +1059,17 @@ static int sample(struct datapath *dp, struct sk_buff *skb,
>  		  struct sw_flow_key *key, const struct nlattr *attr,
>  		  bool last)
>  {
> -	struct nlattr *actions;
> +	const struct sample_arg *arg;
>  	struct nlattr *sample_arg;
>  	int rem = nla_len(attr);
> -	const struct sample_arg *arg;
> +	struct nlattr *next;
>  	bool clone_flow_key;
> +	int ret;
>  
>  	/* The first action is always 'OVS_SAMPLE_ATTR_ARG'. */
>  	sample_arg = nla_data(attr);
>  	arg = nla_data(sample_arg);
> -	actions = nla_next(sample_arg, &rem);
> +	next = nla_next(sample_arg, &rem);
>  
>  	if ((arg->probability != U32_MAX) &&
>  	    (!arg->probability || get_random_u32() > arg->probability)) {
> @@ -1051,9 +1078,22 @@ static int sample(struct datapath *dp, struct sk_buff *skb,
>  		return 0;
>  	}
>  
> -	clone_flow_key = !arg->exec;
> -	return clone_execute(dp, skb, key, 0, actions, rem, last,
> -			     clone_flow_key);
> +	if (next->nla_type == OVS_SAMPLE_ATTR_PSAMPLE) {

Maybe add a commnet that OVS_SAMPLE_ATTR_PSAMPLE is always a sencond
argument when present.

Is there a better way to handle this?

> +		ret = ovs_psample_packet(dp, key, nla_data(next), skb,
> +					 arg->probability);
> +		if (last)
> +			ovs_kfree_skb_reason(skb, OVS_DROP_LAST_ACTION);
> +		if (ret)
> +			return ret;
> +		next = nla_next(next, &rem);
> +	}
> +
> +	if (nla_ok(next, rem)) {
> +		clone_flow_key = !arg->exec;
> +		ret = clone_execute(dp, skb, key, 0, next, rem, last,
> +				    clone_flow_key);
> +	}
> +	return ret;
>  }
>  
>  /* When 'last' is true, clone() should always consume the 'skb'.
> diff --git a/net/openvswitch/datapath.c b/net/openvswitch/datapath.c
> index 99d72543abd3..b5b560c2e74b 100644
> --- a/net/openvswitch/datapath.c
> +++ b/net/openvswitch/datapath.c
> @@ -976,7 +976,7 @@ static int ovs_flow_cmd_new(struct sk_buff *skb, struct genl_info *info)
>  	struct sw_flow_match match;
>  	u32 ufid_flags = ovs_nla_get_ufid_flags(a[OVS_FLOW_ATTR_UFID_FLAGS]);
>  	int error;
> -	bool log = !a[OVS_FLOW_ATTR_PROBE];
> +	bool log = true;

Debugging artifact?

>  
>  	/* Must have key and actions. */
>  	error = -EINVAL;
> diff --git a/net/openvswitch/flow_netlink.c b/net/openvswitch/flow_netlink.c
> index f224d9bcea5e..f540686271b7 100644
> --- a/net/openvswitch/flow_netlink.c
> +++ b/net/openvswitch/flow_netlink.c
> @@ -2381,8 +2381,12 @@ static void ovs_nla_free_sample_action(const struct nlattr *action)
>  
>  	switch (nla_type(a)) {
>  	case OVS_SAMPLE_ATTR_ARG:
> -		/* The real list of actions follows this attribute. */

Please, don't remove this comment.  Maybe extend it instead.

>  		a = nla_next(a, &rem);
> +
> +		/* OVS_SAMPLE_ATTR_PSAMPLE may be present. */
> +		if (nla_type(a) == OVS_SAMPLE_ATTR_PSAMPLE)
> +			a = nla_next(a, &rem);
> +
>  		ovs_nla_free_nested_actions(a, rem);
>  		break;
>  	}
> @@ -2561,6 +2565,9 @@ static int __ovs_nla_copy_actions(struct net *net, const struct nlattr *attr,
>  				  u32 mpls_label_count, bool log,
>  				  u32 depth);
>  
> +static int copy_action(const struct nlattr *from,
> +		       struct sw_flow_actions **sfa, bool log);
> +
>  static int validate_and_copy_sample(struct net *net, const struct nlattr *attr,
>  				    const struct sw_flow_key *key,
>  				    struct sw_flow_actions **sfa,
> @@ -2569,10 +2576,10 @@ static int validate_and_copy_sample(struct net *net, const struct nlattr *attr,
>  				    u32 depth)
>  {
>  	const struct nlattr *attrs[OVS_SAMPLE_ATTR_MAX + 1];
> -	const struct nlattr *probability, *actions;
> +	const struct nlattr *probability, *actions, *psample;
>  	const struct nlattr *a;
> -	int rem, start, err;
>  	struct sample_arg arg;
> +	int rem, start, err;
>  
>  	memset(attrs, 0, sizeof(attrs));
>  	nla_for_each_nested(a, attr, rem) {
> @@ -2589,7 +2596,23 @@ static int validate_and_copy_sample(struct net *net, const struct nlattr *attr,
>  		return -EINVAL;
>  
>  	actions = attrs[OVS_SAMPLE_ATTR_ACTIONS];
> -	if (!actions || (nla_len(actions) && nla_len(actions) < NLA_HDRLEN))
> +	if (actions && (!nla_len(actions) || nla_len(actions) < NLA_HDRLEN))
> +		return -EINVAL;
> +
> +	psample = attrs[OVS_SAMPLE_ATTR_PSAMPLE];
> +	if (psample) {
> +		struct ovs_psample *ovs_ps;
> +
> +		if (!nla_len(psample) || nla_len(psample) < sizeof(*ovs_ps))
> +			return -EINVAL;
> +
> +		ovs_ps = nla_data(psample);
> +		if (ovs_ps->user_cookie_len > OVS_PSAMPLE_COOKIE_MAX_SIZE ||
> +		    nla_len(psample) != sizeof(*ovs_ps) + ovs_ps->user_cookie_len)
> +			return -EINVAL;
> +	}
> +
> +	if (!psample && !actions)
>  		return -EINVAL;
>  
>  	/* validation done, copy sample action. */
> @@ -2608,7 +2631,9 @@ static int validate_and_copy_sample(struct net *net, const struct nlattr *attr,
>  	 * If the sample is the last action, it can always be excuted
>  	 * rather than deferred.
>  	 */
> -	arg.exec = last || !actions_may_change_flow(actions);
> +	if (actions)
> +		arg.exec = last || !actions_may_change_flow(actions);

'arg.exec' will remain uninitialized.

> +
>  	arg.probability = nla_get_u32(probability);
>  
>  	err = ovs_nla_add_action(sfa, OVS_SAMPLE_ATTR_ARG, &arg, sizeof(arg),
> @@ -2616,10 +2641,17 @@ static int validate_and_copy_sample(struct net *net, const struct nlattr *attr,
>  	if (err)
>  		return err;
>  
> -	err = __ovs_nla_copy_actions(net, actions, key, sfa,
> -				     eth_type, vlan_tci, mpls_label_count, log,
> -				     depth + 1);
> +	if (psample)
> +		err = ovs_nla_add_action(sfa, OVS_SAMPLE_ATTR_PSAMPLE,
> +					 nla_data(psample), nla_len(psample),
> +					 log);
> +	if (err)

Can be used uninitialized.

> +		return err;
>  
> +	if (actions)
> +		err = __ovs_nla_copy_actions(net, actions, key, sfa,
> +					     eth_type, vlan_tci,
> +					     mpls_label_count, log, depth + 1);
>  	if (err)
>  		return err;
>  
> @@ -3538,7 +3570,7 @@ static int sample_action_to_attr(const struct nlattr *attr,
>  	struct nlattr *start, *ac_start = NULL, *sample_arg;
>  	int err = 0, rem = nla_len(attr);
>  	const struct sample_arg *arg;
> -	struct nlattr *actions;
> +	struct nlattr *next;
>  
>  	start = nla_nest_start_noflag(skb, OVS_ACTION_ATTR_SAMPLE);
>  	if (!start)
> @@ -3546,27 +3578,39 @@ static int sample_action_to_attr(const struct nlattr *attr,
>  
>  	sample_arg = nla_data(attr);
>  	arg = nla_data(sample_arg);
> -	actions = nla_next(sample_arg, &rem);
> +	next = nla_next(sample_arg, &rem);
>  
>  	if (nla_put_u32(skb, OVS_SAMPLE_ATTR_PROBABILITY, arg->probability)) {
>  		err = -EMSGSIZE;
>  		goto out;
>  	}
>  
> -	ac_start = nla_nest_start_noflag(skb, OVS_SAMPLE_ATTR_ACTIONS);
> -	if (!ac_start) {
> -		err = -EMSGSIZE;
> -		goto out;
> +	if (nla_type(next) == OVS_SAMPLE_ATTR_PSAMPLE) {
> +		if (nla_put(skb, OVS_SAMPLE_ATTR_PSAMPLE, nla_len(next),
> +			    nla_data(next))) {
> +			err = -EMSGSIZE;
> +			goto out;
> +		}
> +		next = nla_next(next, &rem);
>  	}
>  
> -	err = ovs_nla_put_actions(actions, rem, skb);
> +	if (nla_ok(next, rem)) {
> +		ac_start = nla_nest_start_noflag(skb, OVS_SAMPLE_ATTR_ACTIONS);
> +		if (!ac_start) {
> +			err = -EMSGSIZE;
> +			goto out;
> +		}
> +		err = ovs_nla_put_actions(next, rem, skb);
> +	}
>  
>  out:
>  	if (err) {
> -		nla_nest_cancel(skb, ac_start);
> +		if (ac_start)
> +			nla_nest_cancel(skb, ac_start);
>  		nla_nest_cancel(skb, start);
>  	} else {
> -		nla_nest_end(skb, ac_start);
> +		if (ac_start)
> +			nla_nest_end(skb, ac_start);
>  		nla_nest_end(skb, start);
>  	}
>  


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC net-next v2 2/5] net: psample: add multicast filtering on group_id
  2024-04-08 13:18   ` Ilya Maximets
@ 2024-04-08 19:24     ` Adrian Moreno
  2024-04-09 14:43       ` Aaron Conole
  0 siblings, 1 reply; 26+ messages in thread
From: Adrian Moreno @ 2024-04-08 19:24 UTC (permalink / raw)
  To: Ilya Maximets, netdev
  Cc: jiri, xiyou.wangcong, cmi, yotam.gi, aconole, echaudro, horms



On 4/8/24 15:18, Ilya Maximets wrote:
> [copying my previous reply since this version actually has netdev@ in Cc]
> 
> On 4/8/24 14:57, Adrian Moreno wrote:
>> Packet samples can come from several places (e.g: different tc sample
>> actions), typically using the sample group (PSAMPLE_ATTR_SAMPLE_GROUP)
>> to differentiate them.
>>
>> Likewise, sample consumers that listen on the multicast group may only
>> be interested on a single group. However, they are currently forced to
>> receive all samples and discard the ones that are not relevant, causing
>> unnecessary overhead.
>>
>> Allow users to filter on the desired group_id by adding a new command
>> SAMPLE_FILTER_SET that can be used to pass the desired group id.
>> Store this filter on the per-socket private pointer and use it for
>> filtering multicasted samples.
>>
>> Signed-off-by: Adrian Moreno <amorenoz@redhat.com>
>> ---
>>   include/uapi/linux/psample.h |   1 +
>>   net/psample/psample.c        | 127 +++++++++++++++++++++++++++++++++--
>>   2 files changed, 122 insertions(+), 6 deletions(-)
>>
>> diff --git a/include/uapi/linux/psample.h b/include/uapi/linux/psample.h
>> index e585db5bf2d2..5e0305b1520d 100644
>> --- a/include/uapi/linux/psample.h
>> +++ b/include/uapi/linux/psample.h
>> @@ -28,6 +28,7 @@ enum psample_command {
>>   	PSAMPLE_CMD_GET_GROUP,
>>   	PSAMPLE_CMD_NEW_GROUP,
>>   	PSAMPLE_CMD_DEL_GROUP,
>> +	PSAMPLE_CMD_SAMPLE_FILTER_SET,
> Other commands are names as PSAMPLE_CMD_VERB_NOUN, so this new one
> should be PSAMPLE_CMD_SET_FILTER.  (The SAMPLE part seems unnecessary.)
> Some functions/structures need to be renamed accordingly.
> 

Sure, I'll rename it when I sent the next version.

>>   };
>>   
>>   enum psample_tunnel_key_attr {
>> diff --git a/net/psample/psample.c b/net/psample/psample.c
>> index a5d9b8446f77..a0cef63dfdec 100644
>> --- a/net/psample/psample.c
>> +++ b/net/psample/psample.c
>> @@ -98,13 +98,84 @@ static int psample_nl_cmd_get_group_dumpit(struct sk_buff *msg,
>>   	return msg->len;
>>   }
>>   
>> -static const struct genl_small_ops psample_nl_ops[] = {
>> +struct psample_obj_desc {
>> +	struct rcu_head rcu;
>> +	u32 group_num;
>> +	bool group_num_valid;
>> +};
>> +
>> +struct psample_nl_sock_priv {
>> +	struct psample_obj_desc __rcu *flt;
> 
> Can we call it 'fileter' ?  I find it hard to read the code with
> this unnecessary abbreviation.  Same for the lock below.
> 

Sure.

>> +	spinlock_t flt_lock; /* Protects flt. */
>> +};
>> +
>> +static void psample_nl_sock_priv_init(void *priv)
>> +{
>> +	struct psample_nl_sock_priv *sk_priv = priv;
>> +
>> +	spin_lock_init(&sk_priv->flt_lock);
>> +}
>> +
>> +static void psample_nl_sock_priv_destroy(void *priv)
>> +{
>> +	struct psample_nl_sock_priv *sk_priv = priv;
>> +	struct psample_obj_desc *flt;
>> +
>> +	flt = rcu_dereference_protected(sk_priv->flt, true);
>> +	kfree_rcu(flt, rcu);
>> +}
>> +
>> +static int psample_nl_sample_filter_set_doit(struct sk_buff *skb,
>> +					     struct genl_info *info)
>> +{
>> +	struct psample_nl_sock_priv *sk_priv;
>> +	struct nlattr **attrs = info->attrs;
>> +	struct psample_obj_desc *flt;
>> +
>> +	flt = kzalloc(sizeof(*flt), GFP_KERNEL);
>> +
>> +	if (attrs[PSAMPLE_ATTR_SAMPLE_GROUP]) {
>> +		flt->group_num = nla_get_u32(attrs[PSAMPLE_ATTR_SAMPLE_GROUP]);
>> +		flt->group_num_valid = true;
>> +	}
>> +
>> +	if (!flt->group_num_valid) {
>> +		kfree(flt);
> 
> Might be better to not allocate it in the first place.
> 

Absolutely.

>> +		flt = NULL;
>> +	}
>> +
>> +	sk_priv = genl_sk_priv_get(&psample_nl_family, NETLINK_CB(skb).sk);
>> +	if (IS_ERR(sk_priv)) {
>> +		kfree(flt);
>> +		return PTR_ERR(sk_priv);
>> +	}
>> +
>> +	spin_lock(&sk_priv->flt_lock);
>> +	flt = rcu_replace_pointer(sk_priv->flt, flt,
>> +				  lockdep_is_held(&sk_priv->flt_lock));
>> +	spin_unlock(&sk_priv->flt_lock);
>> +	kfree_rcu(flt, rcu);
>> +	return 0;
>> +}
>> +
>> +static const struct nla_policy
>> +	psample_sample_filter_set_policy[PSAMPLE_ATTR_SAMPLE_GROUP + 1] = {
>> +	[PSAMPLE_ATTR_SAMPLE_GROUP] = { .type = NLA_U32, },
> 
> This indentation is confusing, though I'm not sure what's a better way.
> 

I now! I'll try to move it around see if it improves things.

>> +};
>> +
>> +static const struct genl_ops psample_nl_ops[] = {
>>   	{
>>   		.cmd = PSAMPLE_CMD_GET_GROUP,
>>   		.validate = GENL_DONT_VALIDATE_STRICT | GENL_DONT_VALIDATE_DUMP,
>>   		.dumpit = psample_nl_cmd_get_group_dumpit,
>>   		/* can be retrieved by unprivileged users */
>> -	}
>> +	},
>> +	{
>> +		.cmd		= PSAMPLE_CMD_SAMPLE_FILTER_SET,
>> +		.doit		= psample_nl_sample_filter_set_doit,
>> +		.policy		= psample_sample_filter_set_policy,
>> +		.flags		= 0,
>> +	},
>>   };
>>   
>>   static struct genl_family psample_nl_family __ro_after_init = {
>> @@ -114,10 +185,13 @@ static struct genl_family psample_nl_family __ro_after_init = {
>>   	.netnsok	= true,
>>   	.module		= THIS_MODULE,
>>   	.mcgrps		= psample_nl_mcgrps,
>> -	.small_ops	= psample_nl_ops,
>> -	.n_small_ops	= ARRAY_SIZE(psample_nl_ops),
>> +	.ops		= psample_nl_ops,
>> +	.n_ops		= ARRAY_SIZE(psample_nl_ops),
>>   	.resv_start_op	= PSAMPLE_CMD_GET_GROUP + 1,
>>   	.n_mcgrps	= ARRAY_SIZE(psample_nl_mcgrps),
>> +	.sock_priv_size		= sizeof(struct psample_nl_sock_priv),
>> +	.sock_priv_init		= psample_nl_sock_priv_init,
>> +	.sock_priv_destroy	= psample_nl_sock_priv_destroy,
>>   };
>>   
>>   static void psample_group_notify(struct psample_group *group,
>> @@ -360,6 +434,42 @@ static int psample_tunnel_meta_len(struct ip_tunnel_info *tun_info)
>>   }
>>   #endif
>>   
>> +static inline void psample_nl_obj_desc_init(struct psample_obj_desc *desc,
>> +					    u32 group_num)
>> +{
>> +	memset(desc, 0, sizeof(*desc));
>> +	desc->group_num = group_num;
>> +	desc->group_num_valid = true;
>> +}
>> +
>> +static bool psample_obj_desc_match(struct psample_obj_desc *desc,
>> +				   struct psample_obj_desc *flt)
>> +{
>> +	if (desc->group_num_valid && flt->group_num_valid &&
>> +	    desc->group_num != flt->group_num)
>> +		return false;
>> +	return true;
> 
> This fucntion returns 'true' if one of the arguments is not valid.
> I'd not expect such behavior from a 'match' function.
> 
> I understand the intention that psample should sample everything
> to sockets that do not request filters, but that should not be part
> of the 'match' logic, or more appropriate function name should be
> chosen.  Also, if the group is not initialized, but the filter is,
> it should not match, logically.  The validity on filter and the
> current sample is not symmetric.
> 

The descriptor should always be initialized but I think double checking should 
be OK as in the context of this particular function, it might not be clear it is.

> And I'm not really sure if the 'group_num_valid' is actually needed.
> Can the NULL pointer be used as an indicator?  If so, then maybe
> the whole psample_obj_desc structure is not needed as it will
> contain a single field.

If we only filter on group_id, then yes. However, as I was writing this, I 
thought maybe opening the door to filtering on more fields such as the protocol 
in/out interfaces, etc. Now that I read this I understand the current code is 
confusing: I should have left a comment or mention it in the commit message.

> 
>> +}
>> +
>> +static int psample_nl_sample_filter(struct sock *dsk, struct sk_buff *skb,
>> +				    void *data)
>> +{
>> +	struct psample_obj_desc *desc = data;
>> +	struct psample_nl_sock_priv *sk_priv;
>> +	struct psample_obj_desc *flt;
>> +	int ret = 0;
>> +
>> +	rcu_read_lock();
>> +	sk_priv = __genl_sk_priv_get(&psample_nl_family, dsk);
>> +	if (!IS_ERR_OR_NULL(sk_priv)) {
>> +		flt = rcu_dereference(sk_priv->flt);
>> +		if (flt)
>> +			ret = !psample_obj_desc_match(desc, flt);
>> +	}
>> +	rcu_read_unlock();
>> +	return ret;
>> +}
>> +
>>   void psample_sample_packet(struct psample_group *group, struct sk_buff *skb,
>>   			   u32 sample_rate, const struct psample_metadata *md)
>>   {
>> @@ -370,6 +480,7 @@ void psample_sample_packet(struct psample_group *group, struct sk_buff *skb,
>>   #ifdef CONFIG_INET
>>   	struct ip_tunnel_info *tun_info;
>>   #endif
>> +	struct psample_obj_desc desc;
>>   	struct sk_buff *nl_skb;
>>   	int data_len;
>>   	int meta_len;
>> @@ -487,8 +598,12 @@ void psample_sample_packet(struct psample_group *group, struct sk_buff *skb,
>>   #endif
>>   
>>   	genlmsg_end(nl_skb, data);
>> -	genlmsg_multicast_netns(&psample_nl_family, group->net, nl_skb, 0,
>> -				PSAMPLE_NL_MCGRP_SAMPLE, GFP_ATOMIC);
>> +	psample_nl_obj_desc_init(&desc, group->group_num);
>> +	genlmsg_multicast_netns_filtered(&psample_nl_family,
>> +					 group->net, nl_skb, 0,
>> +					 PSAMPLE_NL_MCGRP_SAMPLE,
>> +					 GFP_ATOMIC, psample_nl_sample_filter,
>> +					 &desc);
>>   
>>   	return;
>>   error:
> 

-- 
Adrián Moreno


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC net-next v2 3/5] net: psample: add user cookie
  2024-04-08 13:19   ` Ilya Maximets
@ 2024-04-08 19:28     ` Adrian Moreno
  2024-04-08 20:28       ` Ilya Maximets
  0 siblings, 1 reply; 26+ messages in thread
From: Adrian Moreno @ 2024-04-08 19:28 UTC (permalink / raw)
  To: Ilya Maximets, netdev
  Cc: jiri, xiyou.wangcong, cmi, yotam.gi, aconole, echaudro, horms



On 4/8/24 15:19, Ilya Maximets wrote:
> [copying my previous reply since this version actually has netdev@ in Cc]
> 
> On 4/8/24 14:57, Adrian Moreno wrote:
>> Add a user cookie to the sample metadata so that sample emitters can
>> provide more contextual information to samples.
>>
>> If present, send the user cookie in a new attribute:
>> PSAMPLE_ATTR_USER_COOKIE.
>>
>> Signed-off-by: Adrian Moreno <amorenoz@redhat.com>
>> ---
>>   include/net/psample.h        |  2 ++
>>   include/uapi/linux/psample.h |  1 +
>>   net/psample/psample.c        | 12 +++++++++++-
>>   3 files changed, 14 insertions(+), 1 deletion(-)
>>
>> diff --git a/include/net/psample.h b/include/net/psample.h
>> index 0509d2d6be67..2503ab3c92a5 100644
>> --- a/include/net/psample.h
>> +++ b/include/net/psample.h
>> @@ -25,6 +25,8 @@ struct psample_metadata {
>>   	   out_tc_occ_valid:1,
>>   	   latency_valid:1,
>>   	   unused:5;
>> +	u8 *user_cookie;
> 
> The code doesn't take ownership of this data.  It should probably
> be a const pointer in this case.
> 
> This should also allow us to avoid a double copy first to the
> psample_metadata and then to netlink skb, as users will know that
> it is a const pointer and so they can hand the data directly.
> 

Yep. You're right.


>> +	u32 user_cookie_len;
>>   };
>>   
>>   struct psample_group *psample_group_get(struct net *net, u32 group_num);
>> diff --git a/include/uapi/linux/psample.h b/include/uapi/linux/psample.h
>> index 5e0305b1520d..1f61fd7ef7fd 100644
>> --- a/include/uapi/linux/psample.h
>> +++ b/include/uapi/linux/psample.h
>> @@ -19,6 +19,7 @@ enum {
>>   	PSAMPLE_ATTR_LATENCY,		/* u64, nanoseconds */
>>   	PSAMPLE_ATTR_TIMESTAMP,		/* u64, nanoseconds */
>>   	PSAMPLE_ATTR_PROTO,		/* u16 */
>> +	PSAMPLE_ATTR_USER_COOKIE,
> 
> A comment would be nice.
> 

Definitely.

>>   
>>   	__PSAMPLE_ATTR_MAX
>>   };
>> diff --git a/net/psample/psample.c b/net/psample/psample.c
>> index a0cef63dfdec..9fdb88e01f21 100644
>> --- a/net/psample/psample.c
>> +++ b/net/psample/psample.c
>> @@ -497,7 +497,8 @@ void psample_sample_packet(struct psample_group *group, struct sk_buff *skb,
>>   		   nla_total_size(sizeof(u32)) +	/* group_num */
>>   		   nla_total_size(sizeof(u32)) +	/* seq */
>>   		   nla_total_size_64bit(sizeof(u64)) +	/* timestamp */
>> -		   nla_total_size(sizeof(u16));		/* protocol */
>> +		   nla_total_size(sizeof(u16)) +	/* protocol */
>> +		   nla_total_size(md->user_cookie_len);	/* user_cookie */
>>   
>>   #ifdef CONFIG_INET
>>   	tun_info = skb_tunnel_info(skb);
>> @@ -596,6 +597,15 @@ void psample_sample_packet(struct psample_group *group, struct sk_buff *skb,
>>   			goto error;
>>   	}
>>   #endif
>> +	if (md->user_cookie && md->user_cookie_len) {
>> +		int nla_len = nla_total_size(md->user_cookie_len);
>> +		struct nlattr *nla;
>> +
>> +		nla = skb_put(nl_skb, nla_len);
>> +		nla->nla_type = PSAMPLE_ATTR_USER_COOKIE;
>> +		nla->nla_len = nla_attr_size(md->user_cookie_len);
>> +		memcpy(nla_data(nla), md->user_cookie, md->user_cookie_len);
>> +	}
>>   
>>   	genlmsg_end(nl_skb, data);
>>   	psample_nl_obj_desc_init(&desc, group->group_num);
> 


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC net-next v2 5/5] net:openvswitch: add psample support
  2024-04-08 13:37   ` Ilya Maximets
@ 2024-04-08 19:48     ` Adrian Moreno
  2024-04-08 20:40       ` Ilya Maximets
  0 siblings, 1 reply; 26+ messages in thread
From: Adrian Moreno @ 2024-04-08 19:48 UTC (permalink / raw)
  To: Ilya Maximets, netdev
  Cc: jiri, xiyou.wangcong, cmi, yotam.gi, aconole, echaudro, horms



On 4/8/24 15:37, Ilya Maximets wrote:
> On 4/8/24 14:57, Adrian Moreno wrote:
>> Add a new attribute to the sample action, called
>> OVS_SAMPLE_ATTR_PSAMPLE to allow userspace to pass a group_id and a
>> user-defined cookie.
>>
>> The maximum length of the user-defined cookie is set to 16, same as
>> tc_cookie to discourage using cookies that will not be offloadable.
>>
>> When set, the sample action will use psample to multicast the sample.
>>
>> Signed-off-by: Adrian Moreno <amorenoz@redhat.com>
>> ---
>>   include/uapi/linux/openvswitch.h | 22 +++++++--
>>   net/openvswitch/actions.c        | 52 ++++++++++++++++++---
>>   net/openvswitch/datapath.c       |  2 +-
>>   net/openvswitch/flow_netlink.c   | 78 +++++++++++++++++++++++++-------
>>   4 files changed, 127 insertions(+), 27 deletions(-)
> 
> This cpatch is missing a few bits:
>   - Update for Documentation/netlink/specs/ovs_flow.yaml
>   - Maybe update for tools/testing/selftests/net/openvswitch/ovs-dpctl.py
>   - Maybe some basic selftests.
> 

Absolutely. I surely plan to add it on the first non-rfc version.

>>
>> diff --git a/include/uapi/linux/openvswitch.h b/include/uapi/linux/openvswitch.h
>> index efc82c318fa2..a5a32588f582 100644
>> --- a/include/uapi/linux/openvswitch.h
>> +++ b/include/uapi/linux/openvswitch.h
>> @@ -646,15 +646,24 @@ enum ovs_flow_attr {
>>    * %UINT32_MAX samples all packets and intermediate values sample intermediate
>>    * fractions of packets.
>>    * @OVS_SAMPLE_ATTR_ACTIONS: Set of actions to execute in sampling event.
>> - * Actions are passed as nested attributes.
>> + * Actions are passed as nested attributes. Optional if OVS_SAMPLE_ATTR_PSAMPLE
>> + * is not set.
> 
> 'is set' probably. >
>> + * @OVS_SAMPLE_ATTR_PSAMPLE: Arguments to be passed to psample. Optional if
>> + * OVS_SAMPLE_ATTR_ACTIONS is not set.
> 
> Same here.
>

Good catch, I rewrote those comments a bunch of times and the were left 
expressing the exact opposite!


>>    *
>> - * Executes the specified actions with the given probability on a per-packet
>> - * basis.
>> + * Either OVS_SAMPLE_ATTR_USER_COOKIE or OVS_SAMPLE_ATTR_USER_COOKIE must be
>> + * specified.
>> + *
>> + * Executes the specified actions and/or sends the packet to psample
>> + * with the given probability on a per-packet basis.
>>    */
>>   enum ovs_sample_attr {
>>   	OVS_SAMPLE_ATTR_UNSPEC,
>>   	OVS_SAMPLE_ATTR_PROBABILITY, /* u32 number */
>>   	OVS_SAMPLE_ATTR_ACTIONS,     /* Nested OVS_ACTION_ATTR_* attributes. */
>> +	OVS_SAMPLE_ATTR_PSAMPLE,     /* struct ovs_psample followed
>> +				      * by the user-provided cookie.
>> +				      */
>>   	__OVS_SAMPLE_ATTR_MAX,
>>   
>>   #ifdef __KERNEL__
>> @@ -675,6 +684,13 @@ struct sample_arg {
>>   };
>>   #endif
>>   
>> +#define OVS_PSAMPLE_COOKIE_MAX_SIZE 16
>> +struct ovs_psample {
>> +	__u32 group_id;		/* The group used for packet sampling. */
>> +	__u32 user_cookie_len;	/* The length of the user-provided cookie. */
>> +	__u8 user_cookie[];	/* The user-provided cookie. */
>> +};
> 
> Structures are not a good approach for modern netlink.
> use nested attributes instead.  This way we can also
> eliminate the need for variable-length array and the
> length field, if the length can be taken from a netlink
> attribute directly, e.g. similar to NLA_BINARY in tc.
> 
> If necessary, there could be a structure in the private
> header to store the data for internal use.
> 

Interesting. Thanks for the suggestion. I think it will also help action validation.

>> +
>>   /**
>>    * enum ovs_userspace_attr - Attributes for %OVS_ACTION_ATTR_USERSPACE action.
>>    * @OVS_USERSPACE_ATTR_PID: u32 Netlink PID to which the %OVS_PACKET_CMD_ACTION
>> diff --git a/net/openvswitch/actions.c b/net/openvswitch/actions.c
>> index 6fcd7e2ca81f..45d2b325b76a 100644
>> --- a/net/openvswitch/actions.c
>> +++ b/net/openvswitch/actions.c
>> @@ -24,6 +24,7 @@
>>   #include <net/checksum.h>
>>   #include <net/dsfield.h>
>>   #include <net/mpls.h>
>> +#include <net/psample.h>
>>   #include <net/sctp/checksum.h>
>>   
>>   #include "datapath.h"
>> @@ -1025,6 +1026,31 @@ static int dec_ttl_exception_handler(struct datapath *dp, struct sk_buff *skb,
>>   	return 0;
>>   }
>>   
>> +static int ovs_psample_packet(struct datapath *dp, struct sw_flow_key *key,
>> +			      struct ovs_psample *psample, struct sk_buff *skb,
>> +			      u32 rate)
>> +{
>> +	struct psample_group psample_group = {};
>> +	struct psample_metadata md = {};
>> +	struct vport *input_vport;
>> +
>> +	psample_group.group_num = psample->group_id;
>> +	psample_group.net = ovs_dp_get_net(dp);
>> +
>> +	input_vport = ovs_vport_rcu(dp, key->phy.in_port);
>> +	if (!input_vport)
>> +		input_vport = ovs_vport_rcu(dp, OVSP_LOCAL);
>> +
>> +	md.in_ifindex = input_vport->dev->ifindex;
>> +	md.user_cookie = psample->user_cookie;
>> +	md.user_cookie_len = psample->user_cookie_len;
>> +	md.trunc_size = skb->len;
>> +
>> +	psample_sample_packet(&psample_group, skb, rate, &md);
>> +
>> +	return 0;
>> +}
>> +
>>   /* When 'last' is true, sample() should always consume the 'skb'.
>>    * Otherwise, sample() should keep 'skb' intact regardless what
>>    * actions are executed within sample().
>> @@ -1033,16 +1059,17 @@ static int sample(struct datapath *dp, struct sk_buff *skb,
>>   		  struct sw_flow_key *key, const struct nlattr *attr,
>>   		  bool last)
>>   {
>> -	struct nlattr *actions;
>> +	const struct sample_arg *arg;
>>   	struct nlattr *sample_arg;
>>   	int rem = nla_len(attr);
>> -	const struct sample_arg *arg;
>> +	struct nlattr *next;
>>   	bool clone_flow_key;
>> +	int ret;
>>   
>>   	/* The first action is always 'OVS_SAMPLE_ATTR_ARG'. */
>>   	sample_arg = nla_data(attr);
>>   	arg = nla_data(sample_arg);
>> -	actions = nla_next(sample_arg, &rem);
>> +	next = nla_next(sample_arg, &rem);
>>   
>>   	if ((arg->probability != U32_MAX) &&
>>   	    (!arg->probability || get_random_u32() > arg->probability)) {
>> @@ -1051,9 +1078,22 @@ static int sample(struct datapath *dp, struct sk_buff *skb,
>>   		return 0;
>>   	}
>>   
>> -	clone_flow_key = !arg->exec;
>> -	return clone_execute(dp, skb, key, 0, actions, rem, last,
>> -			     clone_flow_key);
>> +	if (next->nla_type == OVS_SAMPLE_ATTR_PSAMPLE) {
> 
> Maybe add a commnet that OVS_SAMPLE_ATTR_PSAMPLE is always a sencond
> argument when present.
> 
> Is there a better way to handle this?
> 

I also dislike it. The fact that actions are not nested but concatenated to 
makes the internal representation a bit flaky.

The alternative I considered was adding the group_id and cookie to the 
internal-only OVS_SAMPLE_ATTR_ARG. However, now I wonder:
- Should we also use nested attributes instead of structs for this internal one?

And, probably off-topic: what's the story behind using netlink attributes to 
store action arguments internally? Has it ever been discussed using a union for 
instance?

>> +		ret = ovs_psample_packet(dp, key, nla_data(next), skb,
>> +					 arg->probability);
>> +		if (last)
>> +			ovs_kfree_skb_reason(skb, OVS_DROP_LAST_ACTION);
>> +		if (ret)
>> +			return ret;
>> +		next = nla_next(next, &rem);
>> +	}
>> +
>> +	if (nla_ok(next, rem)) {
>> +		clone_flow_key = !arg->exec;
>> +		ret = clone_execute(dp, skb, key, 0, next, rem, last,
>> +				    clone_flow_key);
>> +	}
>> +	return ret;
>>   }
>>   
>>   /* When 'last' is true, clone() should always consume the 'skb'.
>> diff --git a/net/openvswitch/datapath.c b/net/openvswitch/datapath.c
>> index 99d72543abd3..b5b560c2e74b 100644
>> --- a/net/openvswitch/datapath.c
>> +++ b/net/openvswitch/datapath.c
>> @@ -976,7 +976,7 @@ static int ovs_flow_cmd_new(struct sk_buff *skb, struct genl_info *info)
>>   	struct sw_flow_match match;
>>   	u32 ufid_flags = ovs_nla_get_ufid_flags(a[OVS_FLOW_ATTR_UFID_FLAGS]);
>>   	int error;
>> -	bool log = !a[OVS_FLOW_ATTR_PROBE];
>> +	bool log = true;
> 
> Debugging artifact?
> 

Yep, sorry.

>>   
>>   	/* Must have key and actions. */
>>   	error = -EINVAL;
>> diff --git a/net/openvswitch/flow_netlink.c b/net/openvswitch/flow_netlink.c
>> index f224d9bcea5e..f540686271b7 100644
>> --- a/net/openvswitch/flow_netlink.c
>> +++ b/net/openvswitch/flow_netlink.c
>> @@ -2381,8 +2381,12 @@ static void ovs_nla_free_sample_action(const struct nlattr *action)
>>   
>>   	switch (nla_type(a)) {
>>   	case OVS_SAMPLE_ATTR_ARG:
>> -		/* The real list of actions follows this attribute. */
> 
> Please, don't remove this comment.  Maybe extend it instead.
> 
>>   		a = nla_next(a, &rem);
>> +
>> +		/* OVS_SAMPLE_ATTR_PSAMPLE may be present. */
>> +		if (nla_type(a) == OVS_SAMPLE_ATTR_PSAMPLE)
>> +			a = nla_next(a, &rem);
>> +
>>   		ovs_nla_free_nested_actions(a, rem);
>>   		break;
>>   	}
>> @@ -2561,6 +2565,9 @@ static int __ovs_nla_copy_actions(struct net *net, const struct nlattr *attr,
>>   				  u32 mpls_label_count, bool log,
>>   				  u32 depth);
>>   
>> +static int copy_action(const struct nlattr *from,
>> +		       struct sw_flow_actions **sfa, bool log);
>> +
>>   static int validate_and_copy_sample(struct net *net, const struct nlattr *attr,
>>   				    const struct sw_flow_key *key,
>>   				    struct sw_flow_actions **sfa,
>> @@ -2569,10 +2576,10 @@ static int validate_and_copy_sample(struct net *net, const struct nlattr *attr,
>>   				    u32 depth)
>>   {
>>   	const struct nlattr *attrs[OVS_SAMPLE_ATTR_MAX + 1];
>> -	const struct nlattr *probability, *actions;
>> +	const struct nlattr *probability, *actions, *psample;
>>   	const struct nlattr *a;
>> -	int rem, start, err;
>>   	struct sample_arg arg;
>> +	int rem, start, err;
>>   
>>   	memset(attrs, 0, sizeof(attrs));
>>   	nla_for_each_nested(a, attr, rem) {
>> @@ -2589,7 +2596,23 @@ static int validate_and_copy_sample(struct net *net, const struct nlattr *attr,
>>   		return -EINVAL;
>>   
>>   	actions = attrs[OVS_SAMPLE_ATTR_ACTIONS];
>> -	if (!actions || (nla_len(actions) && nla_len(actions) < NLA_HDRLEN))
>> +	if (actions && (!nla_len(actions) || nla_len(actions) < NLA_HDRLEN))
>> +		return -EINVAL;
>> +
>> +	psample = attrs[OVS_SAMPLE_ATTR_PSAMPLE];
>> +	if (psample) {
>> +		struct ovs_psample *ovs_ps;
>> +
>> +		if (!nla_len(psample) || nla_len(psample) < sizeof(*ovs_ps))
>> +			return -EINVAL;
>> +
>> +		ovs_ps = nla_data(psample);
>> +		if (ovs_ps->user_cookie_len > OVS_PSAMPLE_COOKIE_MAX_SIZE ||
>> +		    nla_len(psample) != sizeof(*ovs_ps) + ovs_ps->user_cookie_len)
>> +			return -EINVAL;
>> +	}
>> +
>> +	if (!psample && !actions)
>>   		return -EINVAL;
>>   
>>   	/* validation done, copy sample action. */
>> @@ -2608,7 +2631,9 @@ static int validate_and_copy_sample(struct net *net, const struct nlattr *attr,
>>   	 * If the sample is the last action, it can always be excuted
>>   	 * rather than deferred.
>>   	 */
>> -	arg.exec = last || !actions_may_change_flow(actions);
>> +	if (actions)
>> +		arg.exec = last || !actions_may_change_flow(actions);
> 
> 'arg.exec' will remain uninitialized.
> 

Yes. I just wanted to avoid the call to actions_may_change_flow(NULL) in which 
case arg.exec is not used. I'll make sure to zero-initialize it nevertheless.

>> +
>>   	arg.probability = nla_get_u32(probability);
>>   
>>   	err = ovs_nla_add_action(sfa, OVS_SAMPLE_ATTR_ARG, &arg, sizeof(arg),
>> @@ -2616,10 +2641,17 @@ static int validate_and_copy_sample(struct net *net, const struct nlattr *attr,
>>   	if (err)
>>   		return err;
>>   
>> -	err = __ovs_nla_copy_actions(net, actions, key, sfa,
>> -				     eth_type, vlan_tci, mpls_label_count, log,
>> -				     depth + 1);
>> +	if (psample)
>> +		err = ovs_nla_add_action(sfa, OVS_SAMPLE_ATTR_PSAMPLE,
>> +					 nla_data(psample), nla_len(psample),
>> +					 log);
>> +	if (err)
> 
> Can be used uninitialized.
> 

You're right, it should be inside the if above, although err should have been 
initialized to output of ovs_nla_add_action.

>> +		return err;
>>   
>> +	if (actions)
>> +		err = __ovs_nla_copy_actions(net, actions, key, sfa,
>> +					     eth_type, vlan_tci,
>> +					     mpls_label_count, log, depth + 1);
>>   	if (err)
>>   		return err;
>>   
>> @@ -3538,7 +3570,7 @@ static int sample_action_to_attr(const struct nlattr *attr,
>>   	struct nlattr *start, *ac_start = NULL, *sample_arg;
>>   	int err = 0, rem = nla_len(attr);
>>   	const struct sample_arg *arg;
>> -	struct nlattr *actions;
>> +	struct nlattr *next;
>>   
>>   	start = nla_nest_start_noflag(skb, OVS_ACTION_ATTR_SAMPLE);
>>   	if (!start)
>> @@ -3546,27 +3578,39 @@ static int sample_action_to_attr(const struct nlattr *attr,
>>   
>>   	sample_arg = nla_data(attr);
>>   	arg = nla_data(sample_arg);
>> -	actions = nla_next(sample_arg, &rem);
>> +	next = nla_next(sample_arg, &rem);
>>   
>>   	if (nla_put_u32(skb, OVS_SAMPLE_ATTR_PROBABILITY, arg->probability)) {
>>   		err = -EMSGSIZE;
>>   		goto out;
>>   	}
>>   
>> -	ac_start = nla_nest_start_noflag(skb, OVS_SAMPLE_ATTR_ACTIONS);
>> -	if (!ac_start) {
>> -		err = -EMSGSIZE;
>> -		goto out;
>> +	if (nla_type(next) == OVS_SAMPLE_ATTR_PSAMPLE) {
>> +		if (nla_put(skb, OVS_SAMPLE_ATTR_PSAMPLE, nla_len(next),
>> +			    nla_data(next))) {
>> +			err = -EMSGSIZE;
>> +			goto out;
>> +		}
>> +		next = nla_next(next, &rem);
>>   	}
>>   
>> -	err = ovs_nla_put_actions(actions, rem, skb);
>> +	if (nla_ok(next, rem)) {
>> +		ac_start = nla_nest_start_noflag(skb, OVS_SAMPLE_ATTR_ACTIONS);
>> +		if (!ac_start) {
>> +			err = -EMSGSIZE;
>> +			goto out;
>> +		}
>> +		err = ovs_nla_put_actions(next, rem, skb);
>> +	}
>>   
>>   out:
>>   	if (err) {
>> -		nla_nest_cancel(skb, ac_start);
>> +		if (ac_start)
>> +			nla_nest_cancel(skb, ac_start);
>>   		nla_nest_cancel(skb, start);
>>   	} else {
>> -		nla_nest_end(skb, ac_start);
>> +		if (ac_start)
>> +			nla_nest_end(skb, ac_start);
>>   		nla_nest_end(skb, start);
>>   	}
>>   
> 

-- 
Adrián Moreno


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC net-next v2 3/5] net: psample: add user cookie
  2024-04-08 19:28     ` Adrian Moreno
@ 2024-04-08 20:28       ` Ilya Maximets
  0 siblings, 0 replies; 26+ messages in thread
From: Ilya Maximets @ 2024-04-08 20:28 UTC (permalink / raw)
  To: Adrian Moreno, netdev
  Cc: i.maximets, jiri, xiyou.wangcong, cmi, yotam.gi, aconole,
	echaudro, horms

On 4/8/24 21:28, Adrian Moreno wrote:
> 
> 
> On 4/8/24 15:19, Ilya Maximets wrote:
>> [copying my previous reply since this version actually has netdev@ in Cc]
>>
>> On 4/8/24 14:57, Adrian Moreno wrote:
>>> Add a user cookie to the sample metadata so that sample emitters can
>>> provide more contextual information to samples.
>>>
>>> If present, send the user cookie in a new attribute:
>>> PSAMPLE_ATTR_USER_COOKIE.
>>>
>>> Signed-off-by: Adrian Moreno <amorenoz@redhat.com>
>>> ---
>>>   include/net/psample.h        |  2 ++
>>>   include/uapi/linux/psample.h |  1 +
>>>   net/psample/psample.c        | 12 +++++++++++-
>>>   3 files changed, 14 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/include/net/psample.h b/include/net/psample.h
>>> index 0509d2d6be67..2503ab3c92a5 100644
>>> --- a/include/net/psample.h
>>> +++ b/include/net/psample.h
>>> @@ -25,6 +25,8 @@ struct psample_metadata {
>>>   	   out_tc_occ_valid:1,
>>>   	   latency_valid:1,
>>>   	   unused:5;
>>> +	u8 *user_cookie;
>>
>> The code doesn't take ownership of this data.  It should probably
>> be a const pointer in this case.
>>
>> This should also allow us to avoid a double copy first to the
>> psample_metadata and then to netlink skb, as users will know that
>> it is a const pointer and so they can hand the data directly.
>>
> 
> Yep. You're right.
> 
> 
>>> +	u32 user_cookie_len;

BTW, u32 looks a little strange here and by extension in tc and OVS
as well.  We can't push move than 64K through netlink, AFAIK, so
u16 should be plenty here.  In practice, I hope, we'll not go over
u8 though.

Best regards, Ilya Maximets.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC net-next v2 5/5] net:openvswitch: add psample support
  2024-04-08 19:48     ` Adrian Moreno
@ 2024-04-08 20:40       ` Ilya Maximets
  2024-04-09  8:16         ` Adrian Moreno
  0 siblings, 1 reply; 26+ messages in thread
From: Ilya Maximets @ 2024-04-08 20:40 UTC (permalink / raw)
  To: Adrian Moreno, netdev
  Cc: i.maximets, jiri, xiyou.wangcong, cmi, yotam.gi, aconole,
	echaudro, horms

On 4/8/24 21:48, Adrian Moreno wrote:
> 
> 
> On 4/8/24 15:37, Ilya Maximets wrote:
>> On 4/8/24 14:57, Adrian Moreno wrote:
>>> Add a new attribute to the sample action, called
>>> OVS_SAMPLE_ATTR_PSAMPLE to allow userspace to pass a group_id and a
>>> user-defined cookie.
>>>
>>> The maximum length of the user-defined cookie is set to 16, same as
>>> tc_cookie to discourage using cookies that will not be offloadable.
>>>
>>> When set, the sample action will use psample to multicast the sample.
>>>
>>> Signed-off-by: Adrian Moreno <amorenoz@redhat.com>
>>> ---
>>>   include/uapi/linux/openvswitch.h | 22 +++++++--
>>>   net/openvswitch/actions.c        | 52 ++++++++++++++++++---
>>>   net/openvswitch/datapath.c       |  2 +-
>>>   net/openvswitch/flow_netlink.c   | 78 +++++++++++++++++++++++++-------
>>>   4 files changed, 127 insertions(+), 27 deletions(-)
>>
>> This cpatch is missing a few bits:
>>   - Update for Documentation/netlink/specs/ovs_flow.yaml
>>   - Maybe update for tools/testing/selftests/net/openvswitch/ovs-dpctl.py
>>   - Maybe some basic selftests.
>>
> 
> Absolutely. I surely plan to add it on the first non-rfc version.
> 
>>>
>>> diff --git a/include/uapi/linux/openvswitch.h b/include/uapi/linux/openvswitch.h
>>> index efc82c318fa2..a5a32588f582 100644
>>> --- a/include/uapi/linux/openvswitch.h
>>> +++ b/include/uapi/linux/openvswitch.h
>>> @@ -646,15 +646,24 @@ enum ovs_flow_attr {
>>>    * %UINT32_MAX samples all packets and intermediate values sample intermediate
>>>    * fractions of packets.
>>>    * @OVS_SAMPLE_ATTR_ACTIONS: Set of actions to execute in sampling event.
>>> - * Actions are passed as nested attributes.
>>> + * Actions are passed as nested attributes. Optional if OVS_SAMPLE_ATTR_PSAMPLE
>>> + * is not set.
>>
>> 'is set' probably. >
>>> + * @OVS_SAMPLE_ATTR_PSAMPLE: Arguments to be passed to psample. Optional if
>>> + * OVS_SAMPLE_ATTR_ACTIONS is not set.
>>
>> Same here.
>>
> 
> Good catch, I rewrote those comments a bunch of times and the were left 
> expressing the exact opposite!
> 
> 
>>>    *
>>> - * Executes the specified actions with the given probability on a per-packet
>>> - * basis.
>>> + * Either OVS_SAMPLE_ATTR_USER_COOKIE or OVS_SAMPLE_ATTR_USER_COOKIE must be
>>> + * specified.
>>> + *
>>> + * Executes the specified actions and/or sends the packet to psample
>>> + * with the given probability on a per-packet basis.
>>>    */
>>>   enum ovs_sample_attr {
>>>   	OVS_SAMPLE_ATTR_UNSPEC,
>>>   	OVS_SAMPLE_ATTR_PROBABILITY, /* u32 number */
>>>   	OVS_SAMPLE_ATTR_ACTIONS,     /* Nested OVS_ACTION_ATTR_* attributes. */
>>> +	OVS_SAMPLE_ATTR_PSAMPLE,     /* struct ovs_psample followed
>>> +				      * by the user-provided cookie.
>>> +				      */
>>>   	__OVS_SAMPLE_ATTR_MAX,
>>>   
>>>   #ifdef __KERNEL__
>>> @@ -675,6 +684,13 @@ struct sample_arg {
>>>   };
>>>   #endif
>>>   
>>> +#define OVS_PSAMPLE_COOKIE_MAX_SIZE 16
>>> +struct ovs_psample {
>>> +	__u32 group_id;		/* The group used for packet sampling. */
>>> +	__u32 user_cookie_len;	/* The length of the user-provided cookie. */

Here as well, not sure if u32 makes sense as userspace can't
actually supply a cookie this large via netlink.

>>> +	__u8 user_cookie[];	/* The user-provided cookie. */
>>> +};
>>
>> Structures are not a good approach for modern netlink.
>> use nested attributes instead.  This way we can also
>> eliminate the need for variable-length array and the
>> length field, if the length can be taken from a netlink
>> attribute directly, e.g. similar to NLA_BINARY in tc.
>>
>> If necessary, there could be a structure in the private
>> header to store the data for internal use.
>>
> 
> Interesting. Thanks for the suggestion. I think it will also help action validation.
> 
>>> +
>>>   /**
>>>    * enum ovs_userspace_attr - Attributes for %OVS_ACTION_ATTR_USERSPACE action.
>>>    * @OVS_USERSPACE_ATTR_PID: u32 Netlink PID to which the %OVS_PACKET_CMD_ACTION
>>> diff --git a/net/openvswitch/actions.c b/net/openvswitch/actions.c
>>> index 6fcd7e2ca81f..45d2b325b76a 100644
>>> --- a/net/openvswitch/actions.c
>>> +++ b/net/openvswitch/actions.c
>>> @@ -24,6 +24,7 @@
>>>   #include <net/checksum.h>
>>>   #include <net/dsfield.h>
>>>   #include <net/mpls.h>
>>> +#include <net/psample.h>
>>>   #include <net/sctp/checksum.h>
>>>   
>>>   #include "datapath.h"
>>> @@ -1025,6 +1026,31 @@ static int dec_ttl_exception_handler(struct datapath *dp, struct sk_buff *skb,
>>>   	return 0;
>>>   }
>>>   
>>> +static int ovs_psample_packet(struct datapath *dp, struct sw_flow_key *key,
>>> +			      struct ovs_psample *psample, struct sk_buff *skb,
>>> +			      u32 rate)
>>> +{
>>> +	struct psample_group psample_group = {};
>>> +	struct psample_metadata md = {};
>>> +	struct vport *input_vport;
>>> +
>>> +	psample_group.group_num = psample->group_id;
>>> +	psample_group.net = ovs_dp_get_net(dp);
>>> +
>>> +	input_vport = ovs_vport_rcu(dp, key->phy.in_port);
>>> +	if (!input_vport)
>>> +		input_vport = ovs_vport_rcu(dp, OVSP_LOCAL);
>>> +
>>> +	md.in_ifindex = input_vport->dev->ifindex;
>>> +	md.user_cookie = psample->user_cookie;
>>> +	md.user_cookie_len = psample->user_cookie_len;
>>> +	md.trunc_size = skb->len;
>>> +
>>> +	psample_sample_packet(&psample_group, skb, rate, &md);
>>> +
>>> +	return 0;
>>> +}
>>> +
>>>   /* When 'last' is true, sample() should always consume the 'skb'.
>>>    * Otherwise, sample() should keep 'skb' intact regardless what
>>>    * actions are executed within sample().
>>> @@ -1033,16 +1059,17 @@ static int sample(struct datapath *dp, struct sk_buff *skb,
>>>   		  struct sw_flow_key *key, const struct nlattr *attr,
>>>   		  bool last)
>>>   {
>>> -	struct nlattr *actions;
>>> +	const struct sample_arg *arg;
>>>   	struct nlattr *sample_arg;
>>>   	int rem = nla_len(attr);
>>> -	const struct sample_arg *arg;
>>> +	struct nlattr *next;
>>>   	bool clone_flow_key;
>>> +	int ret;
>>>   
>>>   	/* The first action is always 'OVS_SAMPLE_ATTR_ARG'. */
>>>   	sample_arg = nla_data(attr);
>>>   	arg = nla_data(sample_arg);
>>> -	actions = nla_next(sample_arg, &rem);
>>> +	next = nla_next(sample_arg, &rem);
>>>   
>>>   	if ((arg->probability != U32_MAX) &&
>>>   	    (!arg->probability || get_random_u32() > arg->probability)) {
>>> @@ -1051,9 +1078,22 @@ static int sample(struct datapath *dp, struct sk_buff *skb,
>>>   		return 0;
>>>   	}
>>>   
>>> -	clone_flow_key = !arg->exec;
>>> -	return clone_execute(dp, skb, key, 0, actions, rem, last,
>>> -			     clone_flow_key);
>>> +	if (next->nla_type == OVS_SAMPLE_ATTR_PSAMPLE) {
>>
>> Maybe add a commnet that OVS_SAMPLE_ATTR_PSAMPLE is always a sencond
>> argument when present.
>>
>> Is there a better way to handle this?
>>
> 
> I also dislike it. The fact that actions are not nested but concatenated to 
> makes the internal representation a bit flaky.
> 
> The alternative I considered was adding the group_id and cookie to the 
> internal-only OVS_SAMPLE_ATTR_ARG. However, now I wonder:
> - Should we also use nested attributes instead of structs for this internal one?
> 
> And, probably off-topic: what's the story behind using netlink attributes to 
> store action arguments internally? Has it ever been discussed using a union for 
> instance?

I'm not sure why it was originally done this way, but it is probably
the most efficient of convenient ways to pack a tree-like structure.
If we had a union, we would likely need a tree of actions with each
action bing a separately allocated node, since we have clone() and
even check_pkt_len() or other actions that can fork the pipeline.
Or we'll need to use some dummy actions as parethesis, prectically
emulating what we already have with netlink, but with structures.

Netlink format is fast to scan, since it's in a single liner chunk
of memory.  Tree or list structure may have higher memory footprint
and be slower to iterate due to cache misses.

There might be a better way to store all this for sure, but will
require some careful performance and memory consumption testing.

> 
>>> +		ret = ovs_psample_packet(dp, key, nla_data(next), skb,
>>> +					 arg->probability);
>>> +		if (last)
>>> +			ovs_kfree_skb_reason(skb, OVS_DROP_LAST_ACTION);
>>> +		if (ret)
>>> +			return ret;
>>> +		next = nla_next(next, &rem);
>>> +	}
>>> +
>>> +	if (nla_ok(next, rem)) {
>>> +		clone_flow_key = !arg->exec;
>>> +		ret = clone_execute(dp, skb, key, 0, next, rem, last,
>>> +				    clone_flow_key);
>>> +	}
>>> +	return ret;
>>>   }
>>>   
>>>   /* When 'last' is true, clone() should always consume the 'skb'.
>>> diff --git a/net/openvswitch/datapath.c b/net/openvswitch/datapath.c
>>> index 99d72543abd3..b5b560c2e74b 100644
>>> --- a/net/openvswitch/datapath.c
>>> +++ b/net/openvswitch/datapath.c
>>> @@ -976,7 +976,7 @@ static int ovs_flow_cmd_new(struct sk_buff *skb, struct genl_info *info)
>>>   	struct sw_flow_match match;
>>>   	u32 ufid_flags = ovs_nla_get_ufid_flags(a[OVS_FLOW_ATTR_UFID_FLAGS]);
>>>   	int error;
>>> -	bool log = !a[OVS_FLOW_ATTR_PROBE];
>>> +	bool log = true;
>>
>> Debugging artifact?
>>
> 
> Yep, sorry.
> 
>>>   
>>>   	/* Must have key and actions. */
>>>   	error = -EINVAL;
>>> diff --git a/net/openvswitch/flow_netlink.c b/net/openvswitch/flow_netlink.c
>>> index f224d9bcea5e..f540686271b7 100644
>>> --- a/net/openvswitch/flow_netlink.c
>>> +++ b/net/openvswitch/flow_netlink.c
>>> @@ -2381,8 +2381,12 @@ static void ovs_nla_free_sample_action(const struct nlattr *action)
>>>   
>>>   	switch (nla_type(a)) {
>>>   	case OVS_SAMPLE_ATTR_ARG:
>>> -		/* The real list of actions follows this attribute. */
>>
>> Please, don't remove this comment.  Maybe extend it instead.
>>
>>>   		a = nla_next(a, &rem);
>>> +
>>> +		/* OVS_SAMPLE_ATTR_PSAMPLE may be present. */
>>> +		if (nla_type(a) == OVS_SAMPLE_ATTR_PSAMPLE)
>>> +			a = nla_next(a, &rem);
>>> +
>>>   		ovs_nla_free_nested_actions(a, rem);
>>>   		break;
>>>   	}
>>> @@ -2561,6 +2565,9 @@ static int __ovs_nla_copy_actions(struct net *net, const struct nlattr *attr,
>>>   				  u32 mpls_label_count, bool log,
>>>   				  u32 depth);
>>>   
>>> +static int copy_action(const struct nlattr *from,
>>> +		       struct sw_flow_actions **sfa, bool log);
>>> +
>>>   static int validate_and_copy_sample(struct net *net, const struct nlattr *attr,
>>>   				    const struct sw_flow_key *key,
>>>   				    struct sw_flow_actions **sfa,
>>> @@ -2569,10 +2576,10 @@ static int validate_and_copy_sample(struct net *net, const struct nlattr *attr,
>>>   				    u32 depth)
>>>   {
>>>   	const struct nlattr *attrs[OVS_SAMPLE_ATTR_MAX + 1];
>>> -	const struct nlattr *probability, *actions;
>>> +	const struct nlattr *probability, *actions, *psample;
>>>   	const struct nlattr *a;
>>> -	int rem, start, err;
>>>   	struct sample_arg arg;
>>> +	int rem, start, err;
>>>   
>>>   	memset(attrs, 0, sizeof(attrs));
>>>   	nla_for_each_nested(a, attr, rem) {
>>> @@ -2589,7 +2596,23 @@ static int validate_and_copy_sample(struct net *net, const struct nlattr *attr,
>>>   		return -EINVAL;
>>>   
>>>   	actions = attrs[OVS_SAMPLE_ATTR_ACTIONS];
>>> -	if (!actions || (nla_len(actions) && nla_len(actions) < NLA_HDRLEN))
>>> +	if (actions && (!nla_len(actions) || nla_len(actions) < NLA_HDRLEN))
>>> +		return -EINVAL;
>>> +
>>> +	psample = attrs[OVS_SAMPLE_ATTR_PSAMPLE];
>>> +	if (psample) {
>>> +		struct ovs_psample *ovs_ps;
>>> +
>>> +		if (!nla_len(psample) || nla_len(psample) < sizeof(*ovs_ps))
>>> +			return -EINVAL;
>>> +
>>> +		ovs_ps = nla_data(psample);
>>> +		if (ovs_ps->user_cookie_len > OVS_PSAMPLE_COOKIE_MAX_SIZE ||
>>> +		    nla_len(psample) != sizeof(*ovs_ps) + ovs_ps->user_cookie_len)
>>> +			return -EINVAL;
>>> +	}
>>> +
>>> +	if (!psample && !actions)
>>>   		return -EINVAL;
>>>   
>>>   	/* validation done, copy sample action. */
>>> @@ -2608,7 +2631,9 @@ static int validate_and_copy_sample(struct net *net, const struct nlattr *attr,
>>>   	 * If the sample is the last action, it can always be excuted
>>>   	 * rather than deferred.
>>>   	 */
>>> -	arg.exec = last || !actions_may_change_flow(actions);
>>> +	if (actions)
>>> +		arg.exec = last || !actions_may_change_flow(actions);
>>
>> 'arg.exec' will remain uninitialized.
>>
> 
> Yes. I just wanted to avoid the call to actions_may_change_flow(NULL) in which 
> case arg.exec is not used. I'll make sure to zero-initialize it nevertheless.
> 
>>> +
>>>   	arg.probability = nla_get_u32(probability);
>>>   
>>>   	err = ovs_nla_add_action(sfa, OVS_SAMPLE_ATTR_ARG, &arg, sizeof(arg),
>>> @@ -2616,10 +2641,17 @@ static int validate_and_copy_sample(struct net *net, const struct nlattr *attr,
>>>   	if (err)
>>>   		return err;
>>>   
>>> -	err = __ovs_nla_copy_actions(net, actions, key, sfa,
>>> -				     eth_type, vlan_tci, mpls_label_count, log,
>>> -				     depth + 1);
>>> +	if (psample)
>>> +		err = ovs_nla_add_action(sfa, OVS_SAMPLE_ATTR_PSAMPLE,
>>> +					 nla_data(psample), nla_len(psample),
>>> +					 log);
>>> +	if (err)
>>
>> Can be used uninitialized.
>>
> 
> You're right, it should be inside the if above, although err should have been 
> initialized to output of ovs_nla_add_action.

Ah, I missed the previous assingnent.  But yes, it is still a little strange
to handle errors this way.

> 
>>> +		return err;
>>>   
>>> +	if (actions)
>>> +		err = __ovs_nla_copy_actions(net, actions, key, sfa,
>>> +					     eth_type, vlan_tci,
>>> +					     mpls_label_count, log, depth + 1);
>>>   	if (err)
>>>   		return err;
>>>   
>>> @@ -3538,7 +3570,7 @@ static int sample_action_to_attr(const struct nlattr *attr,
>>>   	struct nlattr *start, *ac_start = NULL, *sample_arg;
>>>   	int err = 0, rem = nla_len(attr);
>>>   	const struct sample_arg *arg;
>>> -	struct nlattr *actions;
>>> +	struct nlattr *next;
>>>   
>>>   	start = nla_nest_start_noflag(skb, OVS_ACTION_ATTR_SAMPLE);
>>>   	if (!start)
>>> @@ -3546,27 +3578,39 @@ static int sample_action_to_attr(const struct nlattr *attr,
>>>   
>>>   	sample_arg = nla_data(attr);
>>>   	arg = nla_data(sample_arg);
>>> -	actions = nla_next(sample_arg, &rem);
>>> +	next = nla_next(sample_arg, &rem);
>>>   
>>>   	if (nla_put_u32(skb, OVS_SAMPLE_ATTR_PROBABILITY, arg->probability)) {
>>>   		err = -EMSGSIZE;
>>>   		goto out;
>>>   	}
>>>   
>>> -	ac_start = nla_nest_start_noflag(skb, OVS_SAMPLE_ATTR_ACTIONS);
>>> -	if (!ac_start) {
>>> -		err = -EMSGSIZE;
>>> -		goto out;
>>> +	if (nla_type(next) == OVS_SAMPLE_ATTR_PSAMPLE) {
>>> +		if (nla_put(skb, OVS_SAMPLE_ATTR_PSAMPLE, nla_len(next),
>>> +			    nla_data(next))) {
>>> +			err = -EMSGSIZE;
>>> +			goto out;
>>> +		}
>>> +		next = nla_next(next, &rem);
>>>   	}
>>>   
>>> -	err = ovs_nla_put_actions(actions, rem, skb);
>>> +	if (nla_ok(next, rem)) {
>>> +		ac_start = nla_nest_start_noflag(skb, OVS_SAMPLE_ATTR_ACTIONS);
>>> +		if (!ac_start) {
>>> +			err = -EMSGSIZE;
>>> +			goto out;
>>> +		}
>>> +		err = ovs_nla_put_actions(next, rem, skb);
>>> +	}
>>>   
>>>   out:
>>>   	if (err) {
>>> -		nla_nest_cancel(skb, ac_start);
>>> +		if (ac_start)
>>> +			nla_nest_cancel(skb, ac_start);
>>>   		nla_nest_cancel(skb, start);
>>>   	} else {
>>> -		nla_nest_end(skb, ac_start);
>>> +		if (ac_start)
>>> +			nla_nest_end(skb, ac_start);
>>>   		nla_nest_end(skb, start);
>>>   	}
>>>   
>>
> 


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC net-next v2 5/5] net:openvswitch: add psample support
  2024-04-08 20:40       ` Ilya Maximets
@ 2024-04-09  8:16         ` Adrian Moreno
  2024-04-09  9:35           ` Ilya Maximets
  0 siblings, 1 reply; 26+ messages in thread
From: Adrian Moreno @ 2024-04-09  8:16 UTC (permalink / raw)
  To: Ilya Maximets, netdev
  Cc: jiri, xiyou.wangcong, cmi, yotam.gi, aconole, echaudro, horms



On 4/8/24 22:40, Ilya Maximets wrote:
> On 4/8/24 21:48, Adrian Moreno wrote:
>>
>>
>> On 4/8/24 15:37, Ilya Maximets wrote:
>>> On 4/8/24 14:57, Adrian Moreno wrote:
>>>> Add a new attribute to the sample action, called
>>>> OVS_SAMPLE_ATTR_PSAMPLE to allow userspace to pass a group_id and a
>>>> user-defined cookie.
>>>>
>>>> The maximum length of the user-defined cookie is set to 16, same as
>>>> tc_cookie to discourage using cookies that will not be offloadable.
>>>>
>>>> When set, the sample action will use psample to multicast the sample.
>>>>
>>>> Signed-off-by: Adrian Moreno <amorenoz@redhat.com>
>>>> ---
>>>>    include/uapi/linux/openvswitch.h | 22 +++++++--
>>>>    net/openvswitch/actions.c        | 52 ++++++++++++++++++---
>>>>    net/openvswitch/datapath.c       |  2 +-
>>>>    net/openvswitch/flow_netlink.c   | 78 +++++++++++++++++++++++++-------
>>>>    4 files changed, 127 insertions(+), 27 deletions(-)
>>>
>>> This cpatch is missing a few bits:
>>>    - Update for Documentation/netlink/specs/ovs_flow.yaml
>>>    - Maybe update for tools/testing/selftests/net/openvswitch/ovs-dpctl.py
>>>    - Maybe some basic selftests.
>>>
>>
>> Absolutely. I surely plan to add it on the first non-rfc version.
>>
>>>>
>>>> diff --git a/include/uapi/linux/openvswitch.h b/include/uapi/linux/openvswitch.h
>>>> index efc82c318fa2..a5a32588f582 100644
>>>> --- a/include/uapi/linux/openvswitch.h
>>>> +++ b/include/uapi/linux/openvswitch.h
>>>> @@ -646,15 +646,24 @@ enum ovs_flow_attr {
>>>>     * %UINT32_MAX samples all packets and intermediate values sample intermediate
>>>>     * fractions of packets.
>>>>     * @OVS_SAMPLE_ATTR_ACTIONS: Set of actions to execute in sampling event.
>>>> - * Actions are passed as nested attributes.
>>>> + * Actions are passed as nested attributes. Optional if OVS_SAMPLE_ATTR_PSAMPLE
>>>> + * is not set.
>>>
>>> 'is set' probably. >
>>>> + * @OVS_SAMPLE_ATTR_PSAMPLE: Arguments to be passed to psample. Optional if
>>>> + * OVS_SAMPLE_ATTR_ACTIONS is not set.
>>>
>>> Same here.
>>>
>>
>> Good catch, I rewrote those comments a bunch of times and the were left
>> expressing the exact opposite!
>>
>>
>>>>     *
>>>> - * Executes the specified actions with the given probability on a per-packet
>>>> - * basis.
>>>> + * Either OVS_SAMPLE_ATTR_USER_COOKIE or OVS_SAMPLE_ATTR_USER_COOKIE must be
>>>> + * specified.
>>>> + *
>>>> + * Executes the specified actions and/or sends the packet to psample
>>>> + * with the given probability on a per-packet basis.
>>>>     */
>>>>    enum ovs_sample_attr {
>>>>    	OVS_SAMPLE_ATTR_UNSPEC,
>>>>    	OVS_SAMPLE_ATTR_PROBABILITY, /* u32 number */
>>>>    	OVS_SAMPLE_ATTR_ACTIONS,     /* Nested OVS_ACTION_ATTR_* attributes. */
>>>> +	OVS_SAMPLE_ATTR_PSAMPLE,     /* struct ovs_psample followed
>>>> +				      * by the user-provided cookie.
>>>> +				      */
>>>>    	__OVS_SAMPLE_ATTR_MAX,
>>>>    
>>>>    #ifdef __KERNEL__
>>>> @@ -675,6 +684,13 @@ struct sample_arg {
>>>>    };
>>>>    #endif
>>>>    
>>>> +#define OVS_PSAMPLE_COOKIE_MAX_SIZE 16
>>>> +struct ovs_psample {
>>>> +	__u32 group_id;		/* The group used for packet sampling. */
>>>> +	__u32 user_cookie_len;	/* The length of the user-provided cookie. */
> 
> Here as well, not sure if u32 makes sense as userspace can't
> actually supply a cookie this large via netlink.
> 
>>>> +	__u8 user_cookie[];	/* The user-provided cookie. */
>>>> +};
>>>
>>> Structures are not a good approach for modern netlink.
>>> use nested attributes instead.  This way we can also
>>> eliminate the need for variable-length array and the
>>> length field, if the length can be taken from a netlink
>>> attribute directly, e.g. similar to NLA_BINARY in tc.
>>>
>>> If necessary, there could be a structure in the private
>>> header to store the data for internal use.
>>>
>>
>> Interesting. Thanks for the suggestion. I think it will also help action validation.
>>
>>>> +
>>>>    /**
>>>>     * enum ovs_userspace_attr - Attributes for %OVS_ACTION_ATTR_USERSPACE action.
>>>>     * @OVS_USERSPACE_ATTR_PID: u32 Netlink PID to which the %OVS_PACKET_CMD_ACTION
>>>> diff --git a/net/openvswitch/actions.c b/net/openvswitch/actions.c
>>>> index 6fcd7e2ca81f..45d2b325b76a 100644
>>>> --- a/net/openvswitch/actions.c
>>>> +++ b/net/openvswitch/actions.c
>>>> @@ -24,6 +24,7 @@
>>>>    #include <net/checksum.h>
>>>>    #include <net/dsfield.h>
>>>>    #include <net/mpls.h>
>>>> +#include <net/psample.h>
>>>>    #include <net/sctp/checksum.h>
>>>>    
>>>>    #include "datapath.h"
>>>> @@ -1025,6 +1026,31 @@ static int dec_ttl_exception_handler(struct datapath *dp, struct sk_buff *skb,
>>>>    	return 0;
>>>>    }
>>>>    
>>>> +static int ovs_psample_packet(struct datapath *dp, struct sw_flow_key *key,
>>>> +			      struct ovs_psample *psample, struct sk_buff *skb,
>>>> +			      u32 rate)
>>>> +{
>>>> +	struct psample_group psample_group = {};
>>>> +	struct psample_metadata md = {};
>>>> +	struct vport *input_vport;
>>>> +
>>>> +	psample_group.group_num = psample->group_id;
>>>> +	psample_group.net = ovs_dp_get_net(dp);
>>>> +
>>>> +	input_vport = ovs_vport_rcu(dp, key->phy.in_port);
>>>> +	if (!input_vport)
>>>> +		input_vport = ovs_vport_rcu(dp, OVSP_LOCAL);
>>>> +
>>>> +	md.in_ifindex = input_vport->dev->ifindex;
>>>> +	md.user_cookie = psample->user_cookie;
>>>> +	md.user_cookie_len = psample->user_cookie_len;
>>>> +	md.trunc_size = skb->len;
>>>> +
>>>> +	psample_sample_packet(&psample_group, skb, rate, &md);
>>>> +
>>>> +	return 0;
>>>> +}
>>>> +
>>>>    /* When 'last' is true, sample() should always consume the 'skb'.
>>>>     * Otherwise, sample() should keep 'skb' intact regardless what
>>>>     * actions are executed within sample().
>>>> @@ -1033,16 +1059,17 @@ static int sample(struct datapath *dp, struct sk_buff *skb,
>>>>    		  struct sw_flow_key *key, const struct nlattr *attr,
>>>>    		  bool last)
>>>>    {
>>>> -	struct nlattr *actions;
>>>> +	const struct sample_arg *arg;
>>>>    	struct nlattr *sample_arg;
>>>>    	int rem = nla_len(attr);
>>>> -	const struct sample_arg *arg;
>>>> +	struct nlattr *next;
>>>>    	bool clone_flow_key;
>>>> +	int ret;
>>>>    
>>>>    	/* The first action is always 'OVS_SAMPLE_ATTR_ARG'. */
>>>>    	sample_arg = nla_data(attr);
>>>>    	arg = nla_data(sample_arg);
>>>> -	actions = nla_next(sample_arg, &rem);
>>>> +	next = nla_next(sample_arg, &rem);
>>>>    
>>>>    	if ((arg->probability != U32_MAX) &&
>>>>    	    (!arg->probability || get_random_u32() > arg->probability)) {
>>>> @@ -1051,9 +1078,22 @@ static int sample(struct datapath *dp, struct sk_buff *skb,
>>>>    		return 0;
>>>>    	}
>>>>    
>>>> -	clone_flow_key = !arg->exec;
>>>> -	return clone_execute(dp, skb, key, 0, actions, rem, last,
>>>> -			     clone_flow_key);
>>>> +	if (next->nla_type == OVS_SAMPLE_ATTR_PSAMPLE) {
>>>
>>> Maybe add a commnet that OVS_SAMPLE_ATTR_PSAMPLE is always a sencond
>>> argument when present.
>>>
>>> Is there a better way to handle this?
>>>
>>
>> I also dislike it. The fact that actions are not nested but concatenated to
>> makes the internal representation a bit flaky.
>>
>> The alternative I considered was adding the group_id and cookie to the
>> internal-only OVS_SAMPLE_ATTR_ARG. However, now I wonder:
>> - Should we also use nested attributes instead of structs for this internal one?
>>
>> And, probably off-topic: what's the story behind using netlink attributes to
>> store action arguments internally? Has it ever been discussed using a union for
>> instance?
> 
> I'm not sure why it was originally done this way, but it is probably
> the most efficient of convenient ways to pack a tree-like structure.
> If we had a union, we would likely need a tree of actions with each
> action bing a separately allocated node, since we have clone() and
> even check_pkt_len() or other actions that can fork the pipeline.
> Or we'll need to use some dummy actions as parethesis, prectically
> emulating what we already have with netlink, but with structures.
> 
> Netlink format is fast to scan, since it's in a single liner chunk
> of memory.  Tree or list structure may have higher memory footprint
> and be slower to iterate due to cache misses.
> 

That's true. Iteration over a number of actions is easy in netlink. And nesting 
easily enables tree-like actions. But in the context of:

 > Structures are not a good approach for modern netlink. use nested attributes 
instead.

If we try to implement all our actions following this way, and we keep just 
copying the incoming actions into the internal representation, we incur in 
unnecessary memory overhead (e.g: storing 2x struct nlaattr + padding of extra 
memory to store 2 integers).

I don't want to derail the discussion into historical or futuristic changes, 
just saying that the approach taken in the SAMPLE action (not including this 
patch) of exposing arguments as attributes but having a kernel-only struct to 
store them seems to me a good compromise.


> There might be a better way to store all this for sure, but will
> require some careful performance and memory consumption testing.
> 
>>
>>>> +		ret = ovs_psample_packet(dp, key, nla_data(next), skb,
>>>> +					 arg->probability);
>>>> +		if (last)
>>>> +			ovs_kfree_skb_reason(skb, OVS_DROP_LAST_ACTION);
>>>> +		if (ret)
>>>> +			return ret;
>>>> +		next = nla_next(next, &rem);
>>>> +	}
>>>> +
>>>> +	if (nla_ok(next, rem)) {
>>>> +		clone_flow_key = !arg->exec;
>>>> +		ret = clone_execute(dp, skb, key, 0, next, rem, last,
>>>> +				    clone_flow_key);
>>>> +	}
>>>> +	return ret;
>>>>    }
>>>>    
>>>>    /* When 'last' is true, clone() should always consume the 'skb'.
>>>> diff --git a/net/openvswitch/datapath.c b/net/openvswitch/datapath.c
>>>> index 99d72543abd3..b5b560c2e74b 100644
>>>> --- a/net/openvswitch/datapath.c
>>>> +++ b/net/openvswitch/datapath.c
>>>> @@ -976,7 +976,7 @@ static int ovs_flow_cmd_new(struct sk_buff *skb, struct genl_info *info)
>>>>    	struct sw_flow_match match;
>>>>    	u32 ufid_flags = ovs_nla_get_ufid_flags(a[OVS_FLOW_ATTR_UFID_FLAGS]);
>>>>    	int error;
>>>> -	bool log = !a[OVS_FLOW_ATTR_PROBE];
>>>> +	bool log = true;
>>>
>>> Debugging artifact?
>>>
>>
>> Yep, sorry.
>>
>>>>    
>>>>    	/* Must have key and actions. */
>>>>    	error = -EINVAL;
>>>> diff --git a/net/openvswitch/flow_netlink.c b/net/openvswitch/flow_netlink.c
>>>> index f224d9bcea5e..f540686271b7 100644
>>>> --- a/net/openvswitch/flow_netlink.c
>>>> +++ b/net/openvswitch/flow_netlink.c
>>>> @@ -2381,8 +2381,12 @@ static void ovs_nla_free_sample_action(const struct nlattr *action)
>>>>    
>>>>    	switch (nla_type(a)) {
>>>>    	case OVS_SAMPLE_ATTR_ARG:
>>>> -		/* The real list of actions follows this attribute. */
>>>
>>> Please, don't remove this comment.  Maybe extend it instead.
>>>
>>>>    		a = nla_next(a, &rem);
>>>> +
>>>> +		/* OVS_SAMPLE_ATTR_PSAMPLE may be present. */
>>>> +		if (nla_type(a) == OVS_SAMPLE_ATTR_PSAMPLE)
>>>> +			a = nla_next(a, &rem);
>>>> +
>>>>    		ovs_nla_free_nested_actions(a, rem);
>>>>    		break;
>>>>    	}
>>>> @@ -2561,6 +2565,9 @@ static int __ovs_nla_copy_actions(struct net *net, const struct nlattr *attr,
>>>>    				  u32 mpls_label_count, bool log,
>>>>    				  u32 depth);
>>>>    
>>>> +static int copy_action(const struct nlattr *from,
>>>> +		       struct sw_flow_actions **sfa, bool log);
>>>> +
>>>>    static int validate_and_copy_sample(struct net *net, const struct nlattr *attr,
>>>>    				    const struct sw_flow_key *key,
>>>>    				    struct sw_flow_actions **sfa,
>>>> @@ -2569,10 +2576,10 @@ static int validate_and_copy_sample(struct net *net, const struct nlattr *attr,
>>>>    				    u32 depth)
>>>>    {
>>>>    	const struct nlattr *attrs[OVS_SAMPLE_ATTR_MAX + 1];
>>>> -	const struct nlattr *probability, *actions;
>>>> +	const struct nlattr *probability, *actions, *psample;
>>>>    	const struct nlattr *a;
>>>> -	int rem, start, err;
>>>>    	struct sample_arg arg;
>>>> +	int rem, start, err;
>>>>    
>>>>    	memset(attrs, 0, sizeof(attrs));
>>>>    	nla_for_each_nested(a, attr, rem) {
>>>> @@ -2589,7 +2596,23 @@ static int validate_and_copy_sample(struct net *net, const struct nlattr *attr,
>>>>    		return -EINVAL;
>>>>    
>>>>    	actions = attrs[OVS_SAMPLE_ATTR_ACTIONS];
>>>> -	if (!actions || (nla_len(actions) && nla_len(actions) < NLA_HDRLEN))
>>>> +	if (actions && (!nla_len(actions) || nla_len(actions) < NLA_HDRLEN))
>>>> +		return -EINVAL;
>>>> +
>>>> +	psample = attrs[OVS_SAMPLE_ATTR_PSAMPLE];
>>>> +	if (psample) {
>>>> +		struct ovs_psample *ovs_ps;
>>>> +
>>>> +		if (!nla_len(psample) || nla_len(psample) < sizeof(*ovs_ps))
>>>> +			return -EINVAL;
>>>> +
>>>> +		ovs_ps = nla_data(psample);
>>>> +		if (ovs_ps->user_cookie_len > OVS_PSAMPLE_COOKIE_MAX_SIZE ||
>>>> +		    nla_len(psample) != sizeof(*ovs_ps) + ovs_ps->user_cookie_len)
>>>> +			return -EINVAL;
>>>> +	}
>>>> +
>>>> +	if (!psample && !actions)
>>>>    		return -EINVAL;
>>>>    
>>>>    	/* validation done, copy sample action. */
>>>> @@ -2608,7 +2631,9 @@ static int validate_and_copy_sample(struct net *net, const struct nlattr *attr,
>>>>    	 * If the sample is the last action, it can always be excuted
>>>>    	 * rather than deferred.
>>>>    	 */
>>>> -	arg.exec = last || !actions_may_change_flow(actions);
>>>> +	if (actions)
>>>> +		arg.exec = last || !actions_may_change_flow(actions);
>>>
>>> 'arg.exec' will remain uninitialized.
>>>
>>
>> Yes. I just wanted to avoid the call to actions_may_change_flow(NULL) in which
>> case arg.exec is not used. I'll make sure to zero-initialize it nevertheless.
>>
>>>> +
>>>>    	arg.probability = nla_get_u32(probability);
>>>>    
>>>>    	err = ovs_nla_add_action(sfa, OVS_SAMPLE_ATTR_ARG, &arg, sizeof(arg),
>>>> @@ -2616,10 +2641,17 @@ static int validate_and_copy_sample(struct net *net, const struct nlattr *attr,
>>>>    	if (err)
>>>>    		return err;
>>>>    
>>>> -	err = __ovs_nla_copy_actions(net, actions, key, sfa,
>>>> -				     eth_type, vlan_tci, mpls_label_count, log,
>>>> -				     depth + 1);
>>>> +	if (psample)
>>>> +		err = ovs_nla_add_action(sfa, OVS_SAMPLE_ATTR_PSAMPLE,
>>>> +					 nla_data(psample), nla_len(psample),
>>>> +					 log);
>>>> +	if (err)
>>>
>>> Can be used uninitialized.
>>>
>>
>> You're right, it should be inside the if above, although err should have been
>> initialized to output of ovs_nla_add_action.
> 
> Ah, I missed the previous assingnent.  But yes, it is still a little strange
> to handle errors this way.
> 
>>
>>>> +		return err;
>>>>    
>>>> +	if (actions)
>>>> +		err = __ovs_nla_copy_actions(net, actions, key, sfa,
>>>> +					     eth_type, vlan_tci,
>>>> +					     mpls_label_count, log, depth + 1);
>>>>    	if (err)
>>>>    		return err;
>>>>    
>>>> @@ -3538,7 +3570,7 @@ static int sample_action_to_attr(const struct nlattr *attr,
>>>>    	struct nlattr *start, *ac_start = NULL, *sample_arg;
>>>>    	int err = 0, rem = nla_len(attr);
>>>>    	const struct sample_arg *arg;
>>>> -	struct nlattr *actions;
>>>> +	struct nlattr *next;
>>>>    
>>>>    	start = nla_nest_start_noflag(skb, OVS_ACTION_ATTR_SAMPLE);
>>>>    	if (!start)
>>>> @@ -3546,27 +3578,39 @@ static int sample_action_to_attr(const struct nlattr *attr,
>>>>    
>>>>    	sample_arg = nla_data(attr);
>>>>    	arg = nla_data(sample_arg);
>>>> -	actions = nla_next(sample_arg, &rem);
>>>> +	next = nla_next(sample_arg, &rem);
>>>>    
>>>>    	if (nla_put_u32(skb, OVS_SAMPLE_ATTR_PROBABILITY, arg->probability)) {
>>>>    		err = -EMSGSIZE;
>>>>    		goto out;
>>>>    	}
>>>>    
>>>> -	ac_start = nla_nest_start_noflag(skb, OVS_SAMPLE_ATTR_ACTIONS);
>>>> -	if (!ac_start) {
>>>> -		err = -EMSGSIZE;
>>>> -		goto out;
>>>> +	if (nla_type(next) == OVS_SAMPLE_ATTR_PSAMPLE) {
>>>> +		if (nla_put(skb, OVS_SAMPLE_ATTR_PSAMPLE, nla_len(next),
>>>> +			    nla_data(next))) {
>>>> +			err = -EMSGSIZE;
>>>> +			goto out;
>>>> +		}
>>>> +		next = nla_next(next, &rem);
>>>>    	}
>>>>    
>>>> -	err = ovs_nla_put_actions(actions, rem, skb);
>>>> +	if (nla_ok(next, rem)) {
>>>> +		ac_start = nla_nest_start_noflag(skb, OVS_SAMPLE_ATTR_ACTIONS);
>>>> +		if (!ac_start) {
>>>> +			err = -EMSGSIZE;
>>>> +			goto out;
>>>> +		}
>>>> +		err = ovs_nla_put_actions(next, rem, skb);
>>>> +	}
>>>>    
>>>>    out:
>>>>    	if (err) {
>>>> -		nla_nest_cancel(skb, ac_start);
>>>> +		if (ac_start)
>>>> +			nla_nest_cancel(skb, ac_start);
>>>>    		nla_nest_cancel(skb, start);
>>>>    	} else {
>>>> -		nla_nest_end(skb, ac_start);
>>>> +		if (ac_start)
>>>> +			nla_nest_end(skb, ac_start);
>>>>    		nla_nest_end(skb, start);
>>>>    	}
>>>>    
>>>
>>
> 


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC net-next v2 5/5] net:openvswitch: add psample support
  2024-04-09  8:16         ` Adrian Moreno
@ 2024-04-09  9:35           ` Ilya Maximets
  2024-04-09 21:49             ` Jakub Kicinski
  0 siblings, 1 reply; 26+ messages in thread
From: Ilya Maximets @ 2024-04-09  9:35 UTC (permalink / raw)
  To: Adrian Moreno, netdev
  Cc: i.maximets, jiri, xiyou.wangcong, cmi, yotam.gi, aconole,
	echaudro, horms

On 4/9/24 10:16, Adrian Moreno wrote:
> 
> 
> On 4/8/24 22:40, Ilya Maximets wrote:
>> On 4/8/24 21:48, Adrian Moreno wrote:
>>>
>>>
>>> On 4/8/24 15:37, Ilya Maximets wrote:
>>>> On 4/8/24 14:57, Adrian Moreno wrote:
>>>>> Add a new attribute to the sample action, called
>>>>> OVS_SAMPLE_ATTR_PSAMPLE to allow userspace to pass a group_id and a
>>>>> user-defined cookie.
>>>>>
>>>>> The maximum length of the user-defined cookie is set to 16, same as
>>>>> tc_cookie to discourage using cookies that will not be offloadable.
>>>>>
>>>>> When set, the sample action will use psample to multicast the sample.
>>>>>
>>>>> Signed-off-by: Adrian Moreno <amorenoz@redhat.com>
>>>>> ---
>>>>>    include/uapi/linux/openvswitch.h | 22 +++++++--
>>>>>    net/openvswitch/actions.c        | 52 ++++++++++++++++++---
>>>>>    net/openvswitch/datapath.c       |  2 +-
>>>>>    net/openvswitch/flow_netlink.c   | 78 +++++++++++++++++++++++++-------
>>>>>    4 files changed, 127 insertions(+), 27 deletions(-)
>>>>
>>>> This cpatch is missing a few bits:
>>>>    - Update for Documentation/netlink/specs/ovs_flow.yaml
>>>>    - Maybe update for tools/testing/selftests/net/openvswitch/ovs-dpctl.py
>>>>    - Maybe some basic selftests.
>>>>
>>>
>>> Absolutely. I surely plan to add it on the first non-rfc version.
>>>
>>>>>
>>>>> diff --git a/include/uapi/linux/openvswitch.h b/include/uapi/linux/openvswitch.h
>>>>> index efc82c318fa2..a5a32588f582 100644
>>>>> --- a/include/uapi/linux/openvswitch.h
>>>>> +++ b/include/uapi/linux/openvswitch.h
>>>>> @@ -646,15 +646,24 @@ enum ovs_flow_attr {
>>>>>     * %UINT32_MAX samples all packets and intermediate values sample intermediate
>>>>>     * fractions of packets.
>>>>>     * @OVS_SAMPLE_ATTR_ACTIONS: Set of actions to execute in sampling event.
>>>>> - * Actions are passed as nested attributes.
>>>>> + * Actions are passed as nested attributes. Optional if OVS_SAMPLE_ATTR_PSAMPLE
>>>>> + * is not set.
>>>>
>>>> 'is set' probably. >
>>>>> + * @OVS_SAMPLE_ATTR_PSAMPLE: Arguments to be passed to psample. Optional if
>>>>> + * OVS_SAMPLE_ATTR_ACTIONS is not set.
>>>>
>>>> Same here.
>>>>
>>>
>>> Good catch, I rewrote those comments a bunch of times and the were left
>>> expressing the exact opposite!
>>>
>>>
>>>>>     *
>>>>> - * Executes the specified actions with the given probability on a per-packet
>>>>> - * basis.
>>>>> + * Either OVS_SAMPLE_ATTR_USER_COOKIE or OVS_SAMPLE_ATTR_USER_COOKIE must be
>>>>> + * specified.
>>>>> + *
>>>>> + * Executes the specified actions and/or sends the packet to psample
>>>>> + * with the given probability on a per-packet basis.
>>>>>     */
>>>>>    enum ovs_sample_attr {
>>>>>    	OVS_SAMPLE_ATTR_UNSPEC,
>>>>>    	OVS_SAMPLE_ATTR_PROBABILITY, /* u32 number */
>>>>>    	OVS_SAMPLE_ATTR_ACTIONS,     /* Nested OVS_ACTION_ATTR_* attributes. */
>>>>> +	OVS_SAMPLE_ATTR_PSAMPLE,     /* struct ovs_psample followed
>>>>> +				      * by the user-provided cookie.
>>>>> +				      */
>>>>>    	__OVS_SAMPLE_ATTR_MAX,
>>>>>    
>>>>>    #ifdef __KERNEL__
>>>>> @@ -675,6 +684,13 @@ struct sample_arg {
>>>>>    };
>>>>>    #endif
>>>>>    
>>>>> +#define OVS_PSAMPLE_COOKIE_MAX_SIZE 16
>>>>> +struct ovs_psample {
>>>>> +	__u32 group_id;		/* The group used for packet sampling. */
>>>>> +	__u32 user_cookie_len;	/* The length of the user-provided cookie. */
>>
>> Here as well, not sure if u32 makes sense as userspace can't
>> actually supply a cookie this large via netlink.
>>
>>>>> +	__u8 user_cookie[];	/* The user-provided cookie. */
>>>>> +};
>>>>
>>>> Structures are not a good approach for modern netlink.
>>>> use nested attributes instead.  This way we can also
>>>> eliminate the need for variable-length array and the
>>>> length field, if the length can be taken from a netlink
>>>> attribute directly, e.g. similar to NLA_BINARY in tc.
>>>>
>>>> If necessary, there could be a structure in the private
>>>> header to store the data for internal use.
>>>>
>>>
>>> Interesting. Thanks for the suggestion. I think it will also help action validation.
>>>
>>>>> +
>>>>>    /**
>>>>>     * enum ovs_userspace_attr - Attributes for %OVS_ACTION_ATTR_USERSPACE action.
>>>>>     * @OVS_USERSPACE_ATTR_PID: u32 Netlink PID to which the %OVS_PACKET_CMD_ACTION
>>>>> diff --git a/net/openvswitch/actions.c b/net/openvswitch/actions.c
>>>>> index 6fcd7e2ca81f..45d2b325b76a 100644
>>>>> --- a/net/openvswitch/actions.c
>>>>> +++ b/net/openvswitch/actions.c
>>>>> @@ -24,6 +24,7 @@
>>>>>    #include <net/checksum.h>
>>>>>    #include <net/dsfield.h>
>>>>>    #include <net/mpls.h>
>>>>> +#include <net/psample.h>
>>>>>    #include <net/sctp/checksum.h>
>>>>>    
>>>>>    #include "datapath.h"
>>>>> @@ -1025,6 +1026,31 @@ static int dec_ttl_exception_handler(struct datapath *dp, struct sk_buff *skb,
>>>>>    	return 0;
>>>>>    }
>>>>>    
>>>>> +static int ovs_psample_packet(struct datapath *dp, struct sw_flow_key *key,
>>>>> +			      struct ovs_psample *psample, struct sk_buff *skb,
>>>>> +			      u32 rate)
>>>>> +{
>>>>> +	struct psample_group psample_group = {};
>>>>> +	struct psample_metadata md = {};
>>>>> +	struct vport *input_vport;
>>>>> +
>>>>> +	psample_group.group_num = psample->group_id;
>>>>> +	psample_group.net = ovs_dp_get_net(dp);
>>>>> +
>>>>> +	input_vport = ovs_vport_rcu(dp, key->phy.in_port);
>>>>> +	if (!input_vport)
>>>>> +		input_vport = ovs_vport_rcu(dp, OVSP_LOCAL);
>>>>> +
>>>>> +	md.in_ifindex = input_vport->dev->ifindex;
>>>>> +	md.user_cookie = psample->user_cookie;
>>>>> +	md.user_cookie_len = psample->user_cookie_len;
>>>>> +	md.trunc_size = skb->len;
>>>>> +
>>>>> +	psample_sample_packet(&psample_group, skb, rate, &md);
>>>>> +
>>>>> +	return 0;
>>>>> +}
>>>>> +
>>>>>    /* When 'last' is true, sample() should always consume the 'skb'.
>>>>>     * Otherwise, sample() should keep 'skb' intact regardless what
>>>>>     * actions are executed within sample().
>>>>> @@ -1033,16 +1059,17 @@ static int sample(struct datapath *dp, struct sk_buff *skb,
>>>>>    		  struct sw_flow_key *key, const struct nlattr *attr,
>>>>>    		  bool last)
>>>>>    {
>>>>> -	struct nlattr *actions;
>>>>> +	const struct sample_arg *arg;
>>>>>    	struct nlattr *sample_arg;
>>>>>    	int rem = nla_len(attr);
>>>>> -	const struct sample_arg *arg;
>>>>> +	struct nlattr *next;
>>>>>    	bool clone_flow_key;
>>>>> +	int ret;
>>>>>    
>>>>>    	/* The first action is always 'OVS_SAMPLE_ATTR_ARG'. */
>>>>>    	sample_arg = nla_data(attr);
>>>>>    	arg = nla_data(sample_arg);
>>>>> -	actions = nla_next(sample_arg, &rem);
>>>>> +	next = nla_next(sample_arg, &rem);
>>>>>    
>>>>>    	if ((arg->probability != U32_MAX) &&
>>>>>    	    (!arg->probability || get_random_u32() > arg->probability)) {
>>>>> @@ -1051,9 +1078,22 @@ static int sample(struct datapath *dp, struct sk_buff *skb,
>>>>>    		return 0;
>>>>>    	}
>>>>>    
>>>>> -	clone_flow_key = !arg->exec;
>>>>> -	return clone_execute(dp, skb, key, 0, actions, rem, last,
>>>>> -			     clone_flow_key);
>>>>> +	if (next->nla_type == OVS_SAMPLE_ATTR_PSAMPLE) {
>>>>
>>>> Maybe add a commnet that OVS_SAMPLE_ATTR_PSAMPLE is always a sencond
>>>> argument when present.
>>>>
>>>> Is there a better way to handle this?
>>>>
>>>
>>> I also dislike it. The fact that actions are not nested but concatenated to
>>> makes the internal representation a bit flaky.
>>>
>>> The alternative I considered was adding the group_id and cookie to the
>>> internal-only OVS_SAMPLE_ATTR_ARG. However, now I wonder:
>>> - Should we also use nested attributes instead of structs for this internal one?
>>>
>>> And, probably off-topic: what's the story behind using netlink attributes to
>>> store action arguments internally? Has it ever been discussed using a union for
>>> instance?
>>
>> I'm not sure why it was originally done this way, but it is probably
>> the most efficient of convenient ways to pack a tree-like structure.
>> If we had a union, we would likely need a tree of actions with each
>> action bing a separately allocated node, since we have clone() and
>> even check_pkt_len() or other actions that can fork the pipeline.
>> Or we'll need to use some dummy actions as parethesis, prectically
>> emulating what we already have with netlink, but with structures.
>>
>> Netlink format is fast to scan, since it's in a single liner chunk
>> of memory.  Tree or list structure may have higher memory footprint
>> and be slower to iterate due to cache misses.
>>
> 
> That's true. Iteration over a number of actions is easy in netlink. And nesting 
> easily enables tree-like actions. But in the context of:
> 
>  > Structures are not a good approach for modern netlink. use nested attributes 
> instead.
> 
> If we try to implement all our actions following this way, and we keep just 
> copying the incoming actions into the internal representation, we incur in 
> unnecessary memory overhead (e.g: storing 2x struct nlaattr + padding of extra 
> memory to store 2 integers).
> 
> I don't want to derail the discussion into historical or futuristic changes, 
> just saying that the approach taken in the SAMPLE action (not including this 
> patch) of exposing arguments as attributes but having a kernel-only struct to 
> store them seems to me a good compromise.

Sure.  As I said, it's fine to have internal structures.  My comment
was mainly about uAPI part.  We should avoid structures in uAPI if
possible, as they are very hard to maintain and keep compatible with
older userspace in case some changes will be needed in the future.

> 
> 
>> There might be a better way to store all this for sure, but will
>> require some careful performance and memory consumption testing.
>>
>>>
>>>>> +		ret = ovs_psample_packet(dp, key, nla_data(next), skb,
>>>>> +					 arg->probability);
>>>>> +		if (last)
>>>>> +			ovs_kfree_skb_reason(skb, OVS_DROP_LAST_ACTION);
>>>>> +		if (ret)
>>>>> +			return ret;
>>>>> +		next = nla_next(next, &rem);
>>>>> +	}
>>>>> +
>>>>> +	if (nla_ok(next, rem)) {
>>>>> +		clone_flow_key = !arg->exec;
>>>>> +		ret = clone_execute(dp, skb, key, 0, next, rem, last,
>>>>> +				    clone_flow_key);
>>>>> +	}
>>>>> +	return ret;
>>>>>    }
>>>>>    
>>>>>    /* When 'last' is true, clone() should always consume the 'skb'.
>>>>> diff --git a/net/openvswitch/datapath.c b/net/openvswitch/datapath.c
>>>>> index 99d72543abd3..b5b560c2e74b 100644
>>>>> --- a/net/openvswitch/datapath.c
>>>>> +++ b/net/openvswitch/datapath.c
>>>>> @@ -976,7 +976,7 @@ static int ovs_flow_cmd_new(struct sk_buff *skb, struct genl_info *info)
>>>>>    	struct sw_flow_match match;
>>>>>    	u32 ufid_flags = ovs_nla_get_ufid_flags(a[OVS_FLOW_ATTR_UFID_FLAGS]);
>>>>>    	int error;
>>>>> -	bool log = !a[OVS_FLOW_ATTR_PROBE];
>>>>> +	bool log = true;
>>>>
>>>> Debugging artifact?
>>>>
>>>
>>> Yep, sorry.
>>>
>>>>>    
>>>>>    	/* Must have key and actions. */
>>>>>    	error = -EINVAL;
>>>>> diff --git a/net/openvswitch/flow_netlink.c b/net/openvswitch/flow_netlink.c
>>>>> index f224d9bcea5e..f540686271b7 100644
>>>>> --- a/net/openvswitch/flow_netlink.c
>>>>> +++ b/net/openvswitch/flow_netlink.c
>>>>> @@ -2381,8 +2381,12 @@ static void ovs_nla_free_sample_action(const struct nlattr *action)
>>>>>    
>>>>>    	switch (nla_type(a)) {
>>>>>    	case OVS_SAMPLE_ATTR_ARG:
>>>>> -		/* The real list of actions follows this attribute. */
>>>>
>>>> Please, don't remove this comment.  Maybe extend it instead.
>>>>
>>>>>    		a = nla_next(a, &rem);
>>>>> +
>>>>> +		/* OVS_SAMPLE_ATTR_PSAMPLE may be present. */
>>>>> +		if (nla_type(a) == OVS_SAMPLE_ATTR_PSAMPLE)
>>>>> +			a = nla_next(a, &rem);
>>>>> +
>>>>>    		ovs_nla_free_nested_actions(a, rem);
>>>>>    		break;
>>>>>    	}
>>>>> @@ -2561,6 +2565,9 @@ static int __ovs_nla_copy_actions(struct net *net, const struct nlattr *attr,
>>>>>    				  u32 mpls_label_count, bool log,
>>>>>    				  u32 depth);
>>>>>    
>>>>> +static int copy_action(const struct nlattr *from,
>>>>> +		       struct sw_flow_actions **sfa, bool log);
>>>>> +
>>>>>    static int validate_and_copy_sample(struct net *net, const struct nlattr *attr,
>>>>>    				    const struct sw_flow_key *key,
>>>>>    				    struct sw_flow_actions **sfa,
>>>>> @@ -2569,10 +2576,10 @@ static int validate_and_copy_sample(struct net *net, const struct nlattr *attr,
>>>>>    				    u32 depth)
>>>>>    {
>>>>>    	const struct nlattr *attrs[OVS_SAMPLE_ATTR_MAX + 1];
>>>>> -	const struct nlattr *probability, *actions;
>>>>> +	const struct nlattr *probability, *actions, *psample;
>>>>>    	const struct nlattr *a;
>>>>> -	int rem, start, err;
>>>>>    	struct sample_arg arg;
>>>>> +	int rem, start, err;
>>>>>    
>>>>>    	memset(attrs, 0, sizeof(attrs));
>>>>>    	nla_for_each_nested(a, attr, rem) {
>>>>> @@ -2589,7 +2596,23 @@ static int validate_and_copy_sample(struct net *net, const struct nlattr *attr,
>>>>>    		return -EINVAL;
>>>>>    
>>>>>    	actions = attrs[OVS_SAMPLE_ATTR_ACTIONS];
>>>>> -	if (!actions || (nla_len(actions) && nla_len(actions) < NLA_HDRLEN))
>>>>> +	if (actions && (!nla_len(actions) || nla_len(actions) < NLA_HDRLEN))
>>>>> +		return -EINVAL;
>>>>> +
>>>>> +	psample = attrs[OVS_SAMPLE_ATTR_PSAMPLE];
>>>>> +	if (psample) {
>>>>> +		struct ovs_psample *ovs_ps;
>>>>> +
>>>>> +		if (!nla_len(psample) || nla_len(psample) < sizeof(*ovs_ps))
>>>>> +			return -EINVAL;
>>>>> +
>>>>> +		ovs_ps = nla_data(psample);
>>>>> +		if (ovs_ps->user_cookie_len > OVS_PSAMPLE_COOKIE_MAX_SIZE ||
>>>>> +		    nla_len(psample) != sizeof(*ovs_ps) + ovs_ps->user_cookie_len)
>>>>> +			return -EINVAL;
>>>>> +	}
>>>>> +
>>>>> +	if (!psample && !actions)
>>>>>    		return -EINVAL;
>>>>>    
>>>>>    	/* validation done, copy sample action. */
>>>>> @@ -2608,7 +2631,9 @@ static int validate_and_copy_sample(struct net *net, const struct nlattr *attr,
>>>>>    	 * If the sample is the last action, it can always be excuted
>>>>>    	 * rather than deferred.
>>>>>    	 */
>>>>> -	arg.exec = last || !actions_may_change_flow(actions);
>>>>> +	if (actions)
>>>>> +		arg.exec = last || !actions_may_change_flow(actions);
>>>>
>>>> 'arg.exec' will remain uninitialized.
>>>>
>>>
>>> Yes. I just wanted to avoid the call to actions_may_change_flow(NULL) in which
>>> case arg.exec is not used. I'll make sure to zero-initialize it nevertheless.
>>>
>>>>> +
>>>>>    	arg.probability = nla_get_u32(probability);
>>>>>    
>>>>>    	err = ovs_nla_add_action(sfa, OVS_SAMPLE_ATTR_ARG, &arg, sizeof(arg),
>>>>> @@ -2616,10 +2641,17 @@ static int validate_and_copy_sample(struct net *net, const struct nlattr *attr,
>>>>>    	if (err)
>>>>>    		return err;
>>>>>    
>>>>> -	err = __ovs_nla_copy_actions(net, actions, key, sfa,
>>>>> -				     eth_type, vlan_tci, mpls_label_count, log,
>>>>> -				     depth + 1);
>>>>> +	if (psample)
>>>>> +		err = ovs_nla_add_action(sfa, OVS_SAMPLE_ATTR_PSAMPLE,
>>>>> +					 nla_data(psample), nla_len(psample),
>>>>> +					 log);
>>>>> +	if (err)
>>>>
>>>> Can be used uninitialized.
>>>>
>>>
>>> You're right, it should be inside the if above, although err should have been
>>> initialized to output of ovs_nla_add_action.
>>
>> Ah, I missed the previous assingnent.  But yes, it is still a little strange
>> to handle errors this way.
>>
>>>
>>>>> +		return err;
>>>>>    
>>>>> +	if (actions)
>>>>> +		err = __ovs_nla_copy_actions(net, actions, key, sfa,
>>>>> +					     eth_type, vlan_tci,
>>>>> +					     mpls_label_count, log, depth + 1);
>>>>>    	if (err)
>>>>>    		return err;
>>>>>    
>>>>> @@ -3538,7 +3570,7 @@ static int sample_action_to_attr(const struct nlattr *attr,
>>>>>    	struct nlattr *start, *ac_start = NULL, *sample_arg;
>>>>>    	int err = 0, rem = nla_len(attr);
>>>>>    	const struct sample_arg *arg;
>>>>> -	struct nlattr *actions;
>>>>> +	struct nlattr *next;
>>>>>    
>>>>>    	start = nla_nest_start_noflag(skb, OVS_ACTION_ATTR_SAMPLE);
>>>>>    	if (!start)
>>>>> @@ -3546,27 +3578,39 @@ static int sample_action_to_attr(const struct nlattr *attr,
>>>>>    
>>>>>    	sample_arg = nla_data(attr);
>>>>>    	arg = nla_data(sample_arg);
>>>>> -	actions = nla_next(sample_arg, &rem);
>>>>> +	next = nla_next(sample_arg, &rem);
>>>>>    
>>>>>    	if (nla_put_u32(skb, OVS_SAMPLE_ATTR_PROBABILITY, arg->probability)) {
>>>>>    		err = -EMSGSIZE;
>>>>>    		goto out;
>>>>>    	}
>>>>>    
>>>>> -	ac_start = nla_nest_start_noflag(skb, OVS_SAMPLE_ATTR_ACTIONS);
>>>>> -	if (!ac_start) {
>>>>> -		err = -EMSGSIZE;
>>>>> -		goto out;
>>>>> +	if (nla_type(next) == OVS_SAMPLE_ATTR_PSAMPLE) {
>>>>> +		if (nla_put(skb, OVS_SAMPLE_ATTR_PSAMPLE, nla_len(next),
>>>>> +			    nla_data(next))) {
>>>>> +			err = -EMSGSIZE;
>>>>> +			goto out;
>>>>> +		}
>>>>> +		next = nla_next(next, &rem);
>>>>>    	}
>>>>>    
>>>>> -	err = ovs_nla_put_actions(actions, rem, skb);
>>>>> +	if (nla_ok(next, rem)) {
>>>>> +		ac_start = nla_nest_start_noflag(skb, OVS_SAMPLE_ATTR_ACTIONS);
>>>>> +		if (!ac_start) {
>>>>> +			err = -EMSGSIZE;
>>>>> +			goto out;
>>>>> +		}
>>>>> +		err = ovs_nla_put_actions(next, rem, skb);
>>>>> +	}
>>>>>    
>>>>>    out:
>>>>>    	if (err) {
>>>>> -		nla_nest_cancel(skb, ac_start);
>>>>> +		if (ac_start)
>>>>> +			nla_nest_cancel(skb, ac_start);
>>>>>    		nla_nest_cancel(skb, start);
>>>>>    	} else {
>>>>> -		nla_nest_end(skb, ac_start);
>>>>> +		if (ac_start)
>>>>> +			nla_nest_end(skb, ac_start);
>>>>>    		nla_nest_end(skb, start);
>>>>>    	}
>>>>>    
>>>>
>>>
>>
> 


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC net-next v2 2/5] net: psample: add multicast filtering on group_id
  2024-04-08 19:24     ` Adrian Moreno
@ 2024-04-09 14:43       ` Aaron Conole
  2024-04-10 13:32         ` Adrian Moreno
  0 siblings, 1 reply; 26+ messages in thread
From: Aaron Conole @ 2024-04-09 14:43 UTC (permalink / raw)
  To: Adrian Moreno
  Cc: Ilya Maximets, netdev, jiri, xiyou.wangcong, cmi, yotam.gi,
	echaudro, horms

Adrian Moreno <amorenoz@redhat.com> writes:

> On 4/8/24 15:18, Ilya Maximets wrote:
>> [copying my previous reply since this version actually has netdev@ in Cc]
>> On 4/8/24 14:57, Adrian Moreno wrote:
>>> Packet samples can come from several places (e.g: different tc sample
>>> actions), typically using the sample group (PSAMPLE_ATTR_SAMPLE_GROUP)
>>> to differentiate them.
>>>
>>> Likewise, sample consumers that listen on the multicast group may only
>>> be interested on a single group. However, they are currently forced to
>>> receive all samples and discard the ones that are not relevant, causing
>>> unnecessary overhead.
>>>
>>> Allow users to filter on the desired group_id by adding a new command
>>> SAMPLE_FILTER_SET that can be used to pass the desired group id.
>>> Store this filter on the per-socket private pointer and use it for
>>> filtering multicasted samples.
>>>
>>> Signed-off-by: Adrian Moreno <amorenoz@redhat.com>
>>> ---
>>>   include/uapi/linux/psample.h |   1 +
>>>   net/psample/psample.c        | 127 +++++++++++++++++++++++++++++++++--
>>>   2 files changed, 122 insertions(+), 6 deletions(-)
>>>
>>> diff --git a/include/uapi/linux/psample.h b/include/uapi/linux/psample.h
>>> index e585db5bf2d2..5e0305b1520d 100644
>>> --- a/include/uapi/linux/psample.h
>>> +++ b/include/uapi/linux/psample.h
>>> @@ -28,6 +28,7 @@ enum psample_command {
>>>   	PSAMPLE_CMD_GET_GROUP,
>>>   	PSAMPLE_CMD_NEW_GROUP,
>>>   	PSAMPLE_CMD_DEL_GROUP,
>>> +	PSAMPLE_CMD_SAMPLE_FILTER_SET,
>> Other commands are names as PSAMPLE_CMD_VERB_NOUN, so this new one
>> should be PSAMPLE_CMD_SET_FILTER.  (The SAMPLE part seems unnecessary.)
>> Some functions/structures need to be renamed accordingly.
>> 
>
> Sure, I'll rename it when I sent the next version.
>
>>>   };
>>>     enum psample_tunnel_key_attr {
>>> diff --git a/net/psample/psample.c b/net/psample/psample.c
>>> index a5d9b8446f77..a0cef63dfdec 100644
>>> --- a/net/psample/psample.c
>>> +++ b/net/psample/psample.c
>>> @@ -98,13 +98,84 @@ static int psample_nl_cmd_get_group_dumpit(struct sk_buff *msg,
>>>   	return msg->len;
>>>   }
>>>   -static const struct genl_small_ops psample_nl_ops[] = {
>>> +struct psample_obj_desc {
>>> +	struct rcu_head rcu;
>>> +	u32 group_num;
>>> +	bool group_num_valid;
>>> +};
>>> +
>>> +struct psample_nl_sock_priv {
>>> +	struct psample_obj_desc __rcu *flt;
>> Can we call it 'fileter' ?  I find it hard to read the code with
>> this unnecessary abbreviation.  Same for the lock below.
>> 
>
> Sure.
>
>>> +	spinlock_t flt_lock; /* Protects flt. */
>>> +};
>>> +
>>> +static void psample_nl_sock_priv_init(void *priv)
>>> +{
>>> +	struct psample_nl_sock_priv *sk_priv = priv;
>>> +
>>> +	spin_lock_init(&sk_priv->flt_lock);
>>> +}
>>> +
>>> +static void psample_nl_sock_priv_destroy(void *priv)
>>> +{
>>> +	struct psample_nl_sock_priv *sk_priv = priv;
>>> +	struct psample_obj_desc *flt;
>>> +
>>> +	flt = rcu_dereference_protected(sk_priv->flt, true);
>>> +	kfree_rcu(flt, rcu);
>>> +}
>>> +
>>> +static int psample_nl_sample_filter_set_doit(struct sk_buff *skb,
>>> +					     struct genl_info *info)
>>> +{
>>> +	struct psample_nl_sock_priv *sk_priv;
>>> +	struct nlattr **attrs = info->attrs;
>>> +	struct psample_obj_desc *flt;
>>> +
>>> +	flt = kzalloc(sizeof(*flt), GFP_KERNEL);
>>> +
>>> +	if (attrs[PSAMPLE_ATTR_SAMPLE_GROUP]) {
>>> +		flt->group_num = nla_get_u32(attrs[PSAMPLE_ATTR_SAMPLE_GROUP]);
>>> +		flt->group_num_valid = true;
>>> +	}
>>> +
>>> +	if (!flt->group_num_valid) {
>>> +		kfree(flt);
>> Might be better to not allocate it in the first place.
>> 
>
> Absolutely.
>
>>> +		flt = NULL;
>>> +	}
>>> +
>>> +	sk_priv = genl_sk_priv_get(&psample_nl_family, NETLINK_CB(skb).sk);
>>> +	if (IS_ERR(sk_priv)) {
>>> +		kfree(flt);
>>> +		return PTR_ERR(sk_priv);
>>> +	}
>>> +
>>> +	spin_lock(&sk_priv->flt_lock);
>>> +	flt = rcu_replace_pointer(sk_priv->flt, flt,
>>> +				  lockdep_is_held(&sk_priv->flt_lock));
>>> +	spin_unlock(&sk_priv->flt_lock);
>>> +	kfree_rcu(flt, rcu);
>>> +	return 0;
>>> +}
>>> +
>>> +static const struct nla_policy
>>> +	psample_sample_filter_set_policy[PSAMPLE_ATTR_SAMPLE_GROUP + 1] = {
>>> +	[PSAMPLE_ATTR_SAMPLE_GROUP] = { .type = NLA_U32, },
>> This indentation is confusing, though I'm not sure what's a better
>> way.
>> 
>
> I now! I'll try to move it around see if it improves things.
>
>>> +};
>>> +
>>> +static const struct genl_ops psample_nl_ops[] = {
>>>   	{
>>>   		.cmd = PSAMPLE_CMD_GET_GROUP,
>>>   		.validate = GENL_DONT_VALIDATE_STRICT | GENL_DONT_VALIDATE_DUMP,
>>>   		.dumpit = psample_nl_cmd_get_group_dumpit,
>>>   		/* can be retrieved by unprivileged users */
>>> -	}
>>> +	},
>>> +	{
>>> +		.cmd		= PSAMPLE_CMD_SAMPLE_FILTER_SET,
>>> +		.doit		= psample_nl_sample_filter_set_doit,
>>> +		.policy		= psample_sample_filter_set_policy,
>>> +		.flags		= 0,
>>> +	},
>>>   };
>>>     static struct genl_family psample_nl_family __ro_after_init = {
>>> @@ -114,10 +185,13 @@ static struct genl_family psample_nl_family __ro_after_init = {
>>>   	.netnsok	= true,
>>>   	.module		= THIS_MODULE,
>>>   	.mcgrps		= psample_nl_mcgrps,
>>> -	.small_ops	= psample_nl_ops,
>>> -	.n_small_ops	= ARRAY_SIZE(psample_nl_ops),
>>> +	.ops		= psample_nl_ops,
>>> +	.n_ops		= ARRAY_SIZE(psample_nl_ops),
>>>   	.resv_start_op	= PSAMPLE_CMD_GET_GROUP + 1,
>>>   	.n_mcgrps	= ARRAY_SIZE(psample_nl_mcgrps),
>>> +	.sock_priv_size		= sizeof(struct psample_nl_sock_priv),
>>> +	.sock_priv_init		= psample_nl_sock_priv_init,
>>> +	.sock_priv_destroy	= psample_nl_sock_priv_destroy,
>>>   };
>>>     static void psample_group_notify(struct psample_group *group,
>>> @@ -360,6 +434,42 @@ static int psample_tunnel_meta_len(struct ip_tunnel_info *tun_info)
>>>   }
>>>   #endif
>>>   +static inline void psample_nl_obj_desc_init(struct
>>> psample_obj_desc *desc,
>>> +					    u32 group_num)
>>> +{
>>> +	memset(desc, 0, sizeof(*desc));
>>> +	desc->group_num = group_num;
>>> +	desc->group_num_valid = true;
>>> +}
>>> +
>>> +static bool psample_obj_desc_match(struct psample_obj_desc *desc,
>>> +				   struct psample_obj_desc *flt)
>>> +{
>>> +	if (desc->group_num_valid && flt->group_num_valid &&
>>> +	    desc->group_num != flt->group_num)
>>> +		return false;
>>> +	return true;
>> This fucntion returns 'true' if one of the arguments is not valid.
>> I'd not expect such behavior from a 'match' function.
>> I understand the intention that psample should sample everything
>> to sockets that do not request filters, but that should not be part
>> of the 'match' logic, or more appropriate function name should be
>> chosen.  Also, if the group is not initialized, but the filter is,
>> it should not match, logically.  The validity on filter and the
>> current sample is not symmetric.
>> 
>
> The descriptor should always be initialized but I think double
> checking should be OK as in the context of this particular function,
> it might not be clear it is.
>
>> And I'm not really sure if the 'group_num_valid' is actually needed.
>> Can the NULL pointer be used as an indicator?  If so, then maybe
>> the whole psample_obj_desc structure is not needed as it will
>> contain a single field.
>
> If we only filter on group_id, then yes. However, as I was writing
> this, I thought maybe opening the door to filtering on more fields
> such as the protocol in/out interfaces, etc. Now that I read this I
> understand the current code is confusing: I should have left a comment
> or mention it in the commit message.

If you want to have such filtering options, does it make sense to
instead have the listening program send a set of bpf instructions for
filtering instead?  I think the data should be available at the point
where simple bpf is attached (SO_ATTACH_BPF to the psample socket, and
the filter should run as part of the broadcast message IIRC since it
populates the sk_filter field).

>> 
>>> +}
>>> +
>>> +static int psample_nl_sample_filter(struct sock *dsk, struct sk_buff *skb,
>>> +				    void *data)
>>> +{
>>> +	struct psample_obj_desc *desc = data;
>>> +	struct psample_nl_sock_priv *sk_priv;
>>> +	struct psample_obj_desc *flt;
>>> +	int ret = 0;
>>> +
>>> +	rcu_read_lock();
>>> +	sk_priv = __genl_sk_priv_get(&psample_nl_family, dsk);
>>> +	if (!IS_ERR_OR_NULL(sk_priv)) {
>>> +		flt = rcu_dereference(sk_priv->flt);
>>> +		if (flt)
>>> +			ret = !psample_obj_desc_match(desc, flt);
>>> +	}
>>> +	rcu_read_unlock();
>>> +	return ret;
>>> +}
>>> +
>>>   void psample_sample_packet(struct psample_group *group, struct sk_buff *skb,
>>>   			   u32 sample_rate, const struct psample_metadata *md)
>>>   {
>>> @@ -370,6 +480,7 @@ void psample_sample_packet(struct psample_group *group, struct sk_buff *skb,
>>>   #ifdef CONFIG_INET
>>>   	struct ip_tunnel_info *tun_info;
>>>   #endif
>>> +	struct psample_obj_desc desc;
>>>   	struct sk_buff *nl_skb;
>>>   	int data_len;
>>>   	int meta_len;
>>> @@ -487,8 +598,12 @@ void psample_sample_packet(struct psample_group *group, struct sk_buff *skb,
>>>   #endif
>>>     	genlmsg_end(nl_skb, data);
>>> -	genlmsg_multicast_netns(&psample_nl_family, group->net, nl_skb, 0,
>>> -				PSAMPLE_NL_MCGRP_SAMPLE, GFP_ATOMIC);
>>> +	psample_nl_obj_desc_init(&desc, group->group_num);
>>> +	genlmsg_multicast_netns_filtered(&psample_nl_family,
>>> +					 group->net, nl_skb, 0,
>>> +					 PSAMPLE_NL_MCGRP_SAMPLE,
>>> +					 GFP_ATOMIC, psample_nl_sample_filter,
>>> +					 &desc);
>>>     	return;
>>>   error:
>> 


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC net-next v2 1/5] net: netlink: export genl private pointer getters
  2024-04-08 12:57 ` [RFC net-next v2 1/5] net: netlink: export genl private pointer getters Adrian Moreno
@ 2024-04-09 21:48   ` Jakub Kicinski
  0 siblings, 0 replies; 26+ messages in thread
From: Jakub Kicinski @ 2024-04-09 21:48 UTC (permalink / raw)
  To: Adrian Moreno
  Cc: netdev, jiri, xiyou.wangcong, cmi, yotam.gi, i.maximets, aconole,
	echaudro, horms

On Mon,  8 Apr 2024 14:57:40 +0200 Adrian Moreno wrote:
> Both "__genl_sk_priv_get" and "genl_sk_priv_get" are useful functions to
> handle private pointers attached to genetlink sockets.
> Export them so they can be used in modules.
> 
> Signed-off-by: Adrian Moreno <amorenoz@redhat.com>

Reviewed-by: Jakub Kicinski <kuba@kernel.org>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC net-next v2 5/5] net:openvswitch: add psample support
  2024-04-09  9:35           ` Ilya Maximets
@ 2024-04-09 21:49             ` Jakub Kicinski
  2024-04-10 13:44               ` Adrian Moreno
  0 siblings, 1 reply; 26+ messages in thread
From: Jakub Kicinski @ 2024-04-09 21:49 UTC (permalink / raw)
  To: Ilya Maximets
  Cc: Adrian Moreno, netdev, jiri, xiyou.wangcong, cmi, yotam.gi,
	aconole, echaudro, horms

On Tue, 9 Apr 2024 11:35:04 +0200 Ilya Maximets wrote:
> > If we try to implement all our actions following this way, and we keep just 
> > copying the incoming actions into the internal representation, we incur in 
> > unnecessary memory overhead (e.g: storing 2x struct nlaattr + padding of extra 
> > memory to store 2 integers).
> > 
> > I don't want to derail the discussion into historical or futuristic changes, 
> > just saying that the approach taken in the SAMPLE action (not including this 
> > patch) of exposing arguments as attributes but having a kernel-only struct to 
> > store them seems to me a good compromise.  
> 
> Sure.  As I said, it's fine to have internal structures.  My comment
> was mainly about uAPI part.  We should avoid structures in uAPI if
> possible, as they are very hard to maintain and keep compatible with
> older userspace in case some changes will be needed in the future.

FWIW there are some YAML specs for ovs under
Documentation/netlink/specs/ovs_*
perhaps they should also be updated?

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC net-next v2 2/5] net: psample: add multicast filtering on group_id
  2024-04-08 12:57 ` [RFC net-next v2 2/5] net: psample: add multicast filtering on group_id Adrian Moreno
  2024-04-08 13:18   ` Ilya Maximets
@ 2024-04-10 13:06   ` Ido Schimmel
  2024-04-10 13:42     ` Adrian Moreno
  1 sibling, 1 reply; 26+ messages in thread
From: Ido Schimmel @ 2024-04-10 13:06 UTC (permalink / raw)
  To: Adrian Moreno
  Cc: netdev, jiri, xiyou.wangcong, cmi, yotam.gi, i.maximets, aconole,
	echaudro, horms

On Mon, Apr 08, 2024 at 02:57:41PM +0200, Adrian Moreno wrote:
> Packet samples can come from several places (e.g: different tc sample
> actions), typically using the sample group (PSAMPLE_ATTR_SAMPLE_GROUP)
> to differentiate them.
> 
> Likewise, sample consumers that listen on the multicast group may only
> be interested on a single group. However, they are currently forced to
> receive all samples and discard the ones that are not relevant, causing
> unnecessary overhead.
> 
> Allow users to filter on the desired group_id by adding a new command
> SAMPLE_FILTER_SET that can be used to pass the desired group id.
> Store this filter on the per-socket private pointer and use it for
> filtering multicasted samples.

Did you consider using BPF for this type of filtering instead of new
uAPI?

See example here:
https://github.com/Mellanox/libpsample/blob/master/src/psample.c#L290

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC net-next v2 2/5] net: psample: add multicast filtering on group_id
  2024-04-09 14:43       ` Aaron Conole
@ 2024-04-10 13:32         ` Adrian Moreno
  0 siblings, 0 replies; 26+ messages in thread
From: Adrian Moreno @ 2024-04-10 13:32 UTC (permalink / raw)
  To: Aaron Conole
  Cc: Ilya Maximets, netdev, jiri, xiyou.wangcong, cmi, yotam.gi,
	echaudro, horms



On 4/9/24 16:43, Aaron Conole wrote:
> Adrian Moreno <amorenoz@redhat.com> writes:
> 
>> On 4/8/24 15:18, Ilya Maximets wrote:
>>> [copying my previous reply since this version actually has netdev@ in Cc]
>>> On 4/8/24 14:57, Adrian Moreno wrote:
>>>> Packet samples can come from several places (e.g: different tc sample
>>>> actions), typically using the sample group (PSAMPLE_ATTR_SAMPLE_GROUP)
>>>> to differentiate them.
>>>>
>>>> Likewise, sample consumers that listen on the multicast group may only
>>>> be interested on a single group. However, they are currently forced to
>>>> receive all samples and discard the ones that are not relevant, causing
>>>> unnecessary overhead.
>>>>
>>>> Allow users to filter on the desired group_id by adding a new command
>>>> SAMPLE_FILTER_SET that can be used to pass the desired group id.
>>>> Store this filter on the per-socket private pointer and use it for
>>>> filtering multicasted samples.
>>>>
>>>> Signed-off-by: Adrian Moreno <amorenoz@redhat.com>
>>>> ---
>>>>    include/uapi/linux/psample.h |   1 +
>>>>    net/psample/psample.c        | 127 +++++++++++++++++++++++++++++++++--
>>>>    2 files changed, 122 insertions(+), 6 deletions(-)
>>>>
>>>> diff --git a/include/uapi/linux/psample.h b/include/uapi/linux/psample.h
>>>> index e585db5bf2d2..5e0305b1520d 100644
>>>> --- a/include/uapi/linux/psample.h
>>>> +++ b/include/uapi/linux/psample.h
>>>> @@ -28,6 +28,7 @@ enum psample_command {
>>>>    	PSAMPLE_CMD_GET_GROUP,
>>>>    	PSAMPLE_CMD_NEW_GROUP,
>>>>    	PSAMPLE_CMD_DEL_GROUP,
>>>> +	PSAMPLE_CMD_SAMPLE_FILTER_SET,
>>> Other commands are names as PSAMPLE_CMD_VERB_NOUN, so this new one
>>> should be PSAMPLE_CMD_SET_FILTER.  (The SAMPLE part seems unnecessary.)
>>> Some functions/structures need to be renamed accordingly.
>>>
>>
>> Sure, I'll rename it when I sent the next version.
>>
>>>>    };
>>>>      enum psample_tunnel_key_attr {
>>>> diff --git a/net/psample/psample.c b/net/psample/psample.c
>>>> index a5d9b8446f77..a0cef63dfdec 100644
>>>> --- a/net/psample/psample.c
>>>> +++ b/net/psample/psample.c
>>>> @@ -98,13 +98,84 @@ static int psample_nl_cmd_get_group_dumpit(struct sk_buff *msg,
>>>>    	return msg->len;
>>>>    }
>>>>    -static const struct genl_small_ops psample_nl_ops[] = {
>>>> +struct psample_obj_desc {
>>>> +	struct rcu_head rcu;
>>>> +	u32 group_num;
>>>> +	bool group_num_valid;
>>>> +};
>>>> +
>>>> +struct psample_nl_sock_priv {
>>>> +	struct psample_obj_desc __rcu *flt;
>>> Can we call it 'fileter' ?  I find it hard to read the code with
>>> this unnecessary abbreviation.  Same for the lock below.
>>>
>>
>> Sure.
>>
>>>> +	spinlock_t flt_lock; /* Protects flt. */
>>>> +};
>>>> +
>>>> +static void psample_nl_sock_priv_init(void *priv)
>>>> +{
>>>> +	struct psample_nl_sock_priv *sk_priv = priv;
>>>> +
>>>> +	spin_lock_init(&sk_priv->flt_lock);
>>>> +}
>>>> +
>>>> +static void psample_nl_sock_priv_destroy(void *priv)
>>>> +{
>>>> +	struct psample_nl_sock_priv *sk_priv = priv;
>>>> +	struct psample_obj_desc *flt;
>>>> +
>>>> +	flt = rcu_dereference_protected(sk_priv->flt, true);
>>>> +	kfree_rcu(flt, rcu);
>>>> +}
>>>> +
>>>> +static int psample_nl_sample_filter_set_doit(struct sk_buff *skb,
>>>> +					     struct genl_info *info)
>>>> +{
>>>> +	struct psample_nl_sock_priv *sk_priv;
>>>> +	struct nlattr **attrs = info->attrs;
>>>> +	struct psample_obj_desc *flt;
>>>> +
>>>> +	flt = kzalloc(sizeof(*flt), GFP_KERNEL);
>>>> +
>>>> +	if (attrs[PSAMPLE_ATTR_SAMPLE_GROUP]) {
>>>> +		flt->group_num = nla_get_u32(attrs[PSAMPLE_ATTR_SAMPLE_GROUP]);
>>>> +		flt->group_num_valid = true;
>>>> +	}
>>>> +
>>>> +	if (!flt->group_num_valid) {
>>>> +		kfree(flt);
>>> Might be better to not allocate it in the first place.
>>>
>>
>> Absolutely.
>>
>>>> +		flt = NULL;
>>>> +	}
>>>> +
>>>> +	sk_priv = genl_sk_priv_get(&psample_nl_family, NETLINK_CB(skb).sk);
>>>> +	if (IS_ERR(sk_priv)) {
>>>> +		kfree(flt);
>>>> +		return PTR_ERR(sk_priv);
>>>> +	}
>>>> +
>>>> +	spin_lock(&sk_priv->flt_lock);
>>>> +	flt = rcu_replace_pointer(sk_priv->flt, flt,
>>>> +				  lockdep_is_held(&sk_priv->flt_lock));
>>>> +	spin_unlock(&sk_priv->flt_lock);
>>>> +	kfree_rcu(flt, rcu);
>>>> +	return 0;
>>>> +}
>>>> +
>>>> +static const struct nla_policy
>>>> +	psample_sample_filter_set_policy[PSAMPLE_ATTR_SAMPLE_GROUP + 1] = {
>>>> +	[PSAMPLE_ATTR_SAMPLE_GROUP] = { .type = NLA_U32, },
>>> This indentation is confusing, though I'm not sure what's a better
>>> way.
>>>
>>
>> I now! I'll try to move it around see if it improves things.
>>
>>>> +};
>>>> +
>>>> +static const struct genl_ops psample_nl_ops[] = {
>>>>    	{
>>>>    		.cmd = PSAMPLE_CMD_GET_GROUP,
>>>>    		.validate = GENL_DONT_VALIDATE_STRICT | GENL_DONT_VALIDATE_DUMP,
>>>>    		.dumpit = psample_nl_cmd_get_group_dumpit,
>>>>    		/* can be retrieved by unprivileged users */
>>>> -	}
>>>> +	},
>>>> +	{
>>>> +		.cmd		= PSAMPLE_CMD_SAMPLE_FILTER_SET,
>>>> +		.doit		= psample_nl_sample_filter_set_doit,
>>>> +		.policy		= psample_sample_filter_set_policy,
>>>> +		.flags		= 0,
>>>> +	},
>>>>    };
>>>>      static struct genl_family psample_nl_family __ro_after_init = {
>>>> @@ -114,10 +185,13 @@ static struct genl_family psample_nl_family __ro_after_init = {
>>>>    	.netnsok	= true,
>>>>    	.module		= THIS_MODULE,
>>>>    	.mcgrps		= psample_nl_mcgrps,
>>>> -	.small_ops	= psample_nl_ops,
>>>> -	.n_small_ops	= ARRAY_SIZE(psample_nl_ops),
>>>> +	.ops		= psample_nl_ops,
>>>> +	.n_ops		= ARRAY_SIZE(psample_nl_ops),
>>>>    	.resv_start_op	= PSAMPLE_CMD_GET_GROUP + 1,
>>>>    	.n_mcgrps	= ARRAY_SIZE(psample_nl_mcgrps),
>>>> +	.sock_priv_size		= sizeof(struct psample_nl_sock_priv),
>>>> +	.sock_priv_init		= psample_nl_sock_priv_init,
>>>> +	.sock_priv_destroy	= psample_nl_sock_priv_destroy,
>>>>    };
>>>>      static void psample_group_notify(struct psample_group *group,
>>>> @@ -360,6 +434,42 @@ static int psample_tunnel_meta_len(struct ip_tunnel_info *tun_info)
>>>>    }
>>>>    #endif
>>>>    +static inline void psample_nl_obj_desc_init(struct
>>>> psample_obj_desc *desc,
>>>> +					    u32 group_num)
>>>> +{
>>>> +	memset(desc, 0, sizeof(*desc));
>>>> +	desc->group_num = group_num;
>>>> +	desc->group_num_valid = true;
>>>> +}
>>>> +
>>>> +static bool psample_obj_desc_match(struct psample_obj_desc *desc,
>>>> +				   struct psample_obj_desc *flt)
>>>> +{
>>>> +	if (desc->group_num_valid && flt->group_num_valid &&
>>>> +	    desc->group_num != flt->group_num)
>>>> +		return false;
>>>> +	return true;
>>> This fucntion returns 'true' if one of the arguments is not valid.
>>> I'd not expect such behavior from a 'match' function.
>>> I understand the intention that psample should sample everything
>>> to sockets that do not request filters, but that should not be part
>>> of the 'match' logic, or more appropriate function name should be
>>> chosen.  Also, if the group is not initialized, but the filter is,
>>> it should not match, logically.  The validity on filter and the
>>> current sample is not symmetric.
>>>
>>
>> The descriptor should always be initialized but I think double
>> checking should be OK as in the context of this particular function,
>> it might not be clear it is.
>>
>>> And I'm not really sure if the 'group_num_valid' is actually needed.
>>> Can the NULL pointer be used as an indicator?  If so, then maybe
>>> the whole psample_obj_desc structure is not needed as it will
>>> contain a single field.
>>
>> If we only filter on group_id, then yes. However, as I was writing
>> this, I thought maybe opening the door to filtering on more fields
>> such as the protocol in/out interfaces, etc. Now that I read this I
>> understand the current code is confusing: I should have left a comment
>> or mention it in the commit message.
> 
> If you want to have such filtering options, does it make sense to
> instead have the listening program send a set of bpf instructions for
> filtering instead?  I think the data should be available at the point
> where simple bpf is attached (SO_ATTACH_BPF to the psample socket, and
> the filter should run as part of the broadcast message IIRC since it
> populates the sk_filter field).
> 

That's a good point. I hope parsing the netlink messages won't be too cumbersome.
So let's limit it to group_ids. How about filtering on a number of group_ids? Is 
that worth it?


>>>
>>>> +}
>>>> +
>>>> +static int psample_nl_sample_filter(struct sock *dsk, struct sk_buff *skb,
>>>> +				    void *data)
>>>> +{
>>>> +	struct psample_obj_desc *desc = data;
>>>> +	struct psample_nl_sock_priv *sk_priv;
>>>> +	struct psample_obj_desc *flt;
>>>> +	int ret = 0;
>>>> +
>>>> +	rcu_read_lock();
>>>> +	sk_priv = __genl_sk_priv_get(&psample_nl_family, dsk);
>>>> +	if (!IS_ERR_OR_NULL(sk_priv)) {
>>>> +		flt = rcu_dereference(sk_priv->flt);
>>>> +		if (flt)
>>>> +			ret = !psample_obj_desc_match(desc, flt);
>>>> +	}
>>>> +	rcu_read_unlock();
>>>> +	return ret;
>>>> +}
>>>> +
>>>>    void psample_sample_packet(struct psample_group *group, struct sk_buff *skb,
>>>>    			   u32 sample_rate, const struct psample_metadata *md)
>>>>    {
>>>> @@ -370,6 +480,7 @@ void psample_sample_packet(struct psample_group *group, struct sk_buff *skb,
>>>>    #ifdef CONFIG_INET
>>>>    	struct ip_tunnel_info *tun_info;
>>>>    #endif
>>>> +	struct psample_obj_desc desc;
>>>>    	struct sk_buff *nl_skb;
>>>>    	int data_len;
>>>>    	int meta_len;
>>>> @@ -487,8 +598,12 @@ void psample_sample_packet(struct psample_group *group, struct sk_buff *skb,
>>>>    #endif
>>>>      	genlmsg_end(nl_skb, data);
>>>> -	genlmsg_multicast_netns(&psample_nl_family, group->net, nl_skb, 0,
>>>> -				PSAMPLE_NL_MCGRP_SAMPLE, GFP_ATOMIC);
>>>> +	psample_nl_obj_desc_init(&desc, group->group_num);
>>>> +	genlmsg_multicast_netns_filtered(&psample_nl_family,
>>>> +					 group->net, nl_skb, 0,
>>>> +					 PSAMPLE_NL_MCGRP_SAMPLE,
>>>> +					 GFP_ATOMIC, psample_nl_sample_filter,
>>>> +					 &desc);
>>>>      	return;
>>>>    error:
>>>
> 


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC net-next v2 2/5] net: psample: add multicast filtering on group_id
  2024-04-10 13:06   ` Ido Schimmel
@ 2024-04-10 13:42     ` Adrian Moreno
  0 siblings, 0 replies; 26+ messages in thread
From: Adrian Moreno @ 2024-04-10 13:42 UTC (permalink / raw)
  To: Ido Schimmel
  Cc: netdev, jiri, xiyou.wangcong, cmi, yotam.gi, i.maximets, aconole,
	echaudro, horms



On 4/10/24 15:06, Ido Schimmel wrote:
> On Mon, Apr 08, 2024 at 02:57:41PM +0200, Adrian Moreno wrote:
>> Packet samples can come from several places (e.g: different tc sample
>> actions), typically using the sample group (PSAMPLE_ATTR_SAMPLE_GROUP)
>> to differentiate them.
>>
>> Likewise, sample consumers that listen on the multicast group may only
>> be interested on a single group. However, they are currently forced to
>> receive all samples and discard the ones that are not relevant, causing
>> unnecessary overhead.
>>
>> Allow users to filter on the desired group_id by adding a new command
>> SAMPLE_FILTER_SET that can be used to pass the desired group id.
>> Store this filter on the per-socket private pointer and use it for
>> filtering multicasted samples.
> 
> Did you consider using BPF for this type of filtering instead of new
> uAPI?
>

Yes. I ended up going for a uAPI change because, since the group_id is part of 
the psample uAPI semantics, requiring users to load ebpf programs for that 
seemed a bit excessive. Given devlink already uses this mechanism [1], I thought 
it would make things easier for users that already just use netlink.

[1] https://lore.kernel.org/netdev/20231214181549.1270696-9-jiri@resnulli.us/

> See example here:
> https://github.com/Mellanox/libpsample/blob/master/src/psample.c#L290
> 


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC net-next v2 5/5] net:openvswitch: add psample support
  2024-04-09 21:49             ` Jakub Kicinski
@ 2024-04-10 13:44               ` Adrian Moreno
  0 siblings, 0 replies; 26+ messages in thread
From: Adrian Moreno @ 2024-04-10 13:44 UTC (permalink / raw)
  To: Jakub Kicinski, Ilya Maximets
  Cc: netdev, jiri, xiyou.wangcong, cmi, yotam.gi, aconole, echaudro,
	horms



On 4/9/24 23:49, Jakub Kicinski wrote:
> On Tue, 9 Apr 2024 11:35:04 +0200 Ilya Maximets wrote:
>>> If we try to implement all our actions following this way, and we keep just
>>> copying the incoming actions into the internal representation, we incur in
>>> unnecessary memory overhead (e.g: storing 2x struct nlaattr + padding of extra
>>> memory to store 2 integers).
>>>
>>> I don't want to derail the discussion into historical or futuristic changes,
>>> just saying that the approach taken in the SAMPLE action (not including this
>>> patch) of exposing arguments as attributes but having a kernel-only struct to
>>> store them seems to me a good compromise.
>>
>> Sure.  As I said, it's fine to have internal structures.  My comment
>> was mainly about uAPI part.  We should avoid structures in uAPI if
>> possible, as they are very hard to maintain and keep compatible with
>> older userspace in case some changes will be needed in the future.
> 
> FWIW there are some YAML specs for ovs under
> Documentation/netlink/specs/ovs_*
> perhaps they should also be updated?
> 

Sure, I will update them in the next version.
Thanks.


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC net-next v2 4/5] net:sched:act_sample: add action cookie to sample
  2024-04-08 13:20   ` Ilya Maximets
@ 2024-04-11  8:40     ` Adrian Moreno
  0 siblings, 0 replies; 26+ messages in thread
From: Adrian Moreno @ 2024-04-11  8:40 UTC (permalink / raw)
  To: Ilya Maximets, netdev
  Cc: jiri, xiyou.wangcong, cmi, yotam.gi, aconole, echaudro, horms



On 4/8/24 15:20, Ilya Maximets wrote:
> [copying my previous reply since this version actually has netdev@ in Cc]
> 
> On 4/8/24 14:57, Adrian Moreno wrote:
>> If the action has a user_cookie, pass it along to the sample so it can
>> be easily identified.
>>
>> Signed-off-by: Adrian Moreno <amorenoz@redhat.com>
>> ---
>>   net/sched/act_sample.c | 12 ++++++++++++
>>   1 file changed, 12 insertions(+)
> 
> Nit: some spaces in the subject would be nice.
> 
>>
>> diff --git a/net/sched/act_sample.c b/net/sched/act_sample.c
>> index a69b53d54039..5c3f86ec964a 100644
>> --- a/net/sched/act_sample.c
>> +++ b/net/sched/act_sample.c
>> @@ -165,9 +165,11 @@ TC_INDIRECT_SCOPE int tcf_sample_act(struct sk_buff *skb,
>>   				     const struct tc_action *a,
>>   				     struct tcf_result *res)
>>   {
>> +	u8 cookie_data[TC_COOKIE_MAX_SIZE] = {};
>>   	struct tcf_sample *s = to_sample(a);
>>   	struct psample_group *psample_group;
>>   	struct psample_metadata md = {};
>> +	struct tc_cookie *user_cookie;
>>   	int retval;
>>   
>>   	tcf_lastuse_update(&s->tcf_tm);
>> @@ -189,6 +191,16 @@ TC_INDIRECT_SCOPE int tcf_sample_act(struct sk_buff *skb,
>>   		if (skb_at_tc_ingress(skb) && tcf_sample_dev_ok_push(skb->dev))
>>   			skb_push(skb, skb->mac_len);
>>   
>> +		rcu_read_lock();
>> +		user_cookie = rcu_dereference(a->user_cookie);
>> +		if (user_cookie) {
>> +			memcpy(cookie_data, user_cookie->data,
>> +			       user_cookie->len);
>> +			md.user_cookie = cookie_data;
>> +			md.user_cookie_len = user_cookie->len;
>> +		}
>> +		rcu_read_unlock();
> 
> Not sure what is better - extending rcu critical section
> beyond psample_sample_packet() or copying the cookie...
> What do you think?

Ha! I also scratched my head on this one.

I tried to follow all the possible paths and, as far as I could tell, there is 
no explicit sleeping involved. However, there is a call to yield() towards the 
end of netlink_broadcast_filtered() that made me think it's safer to copy the 
cookie which, in this particular case, is fairly limited in size.

Have I missed something?

> 
>> +
>>   		md.trunc_size = s->truncate ? s->trunc_size : skb->len;
>>   		psample_sample_packet(psample_group, skb, s->rate, &md);
>>   
> 


^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2024-04-11  8:40 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-04-08 12:57 [RFC net-next v2 0/5] net: openvswitch: Add sample multicasting Adrian Moreno
2024-04-08 12:57 ` [RFC net-next v2 1/5] net: netlink: export genl private pointer getters Adrian Moreno
2024-04-09 21:48   ` Jakub Kicinski
2024-04-08 12:57 ` [RFC net-next v2 2/5] net: psample: add multicast filtering on group_id Adrian Moreno
2024-04-08 13:18   ` Ilya Maximets
2024-04-08 19:24     ` Adrian Moreno
2024-04-09 14:43       ` Aaron Conole
2024-04-10 13:32         ` Adrian Moreno
2024-04-10 13:06   ` Ido Schimmel
2024-04-10 13:42     ` Adrian Moreno
2024-04-08 12:57 ` [RFC net-next v2 3/5] net: psample: add user cookie Adrian Moreno
2024-04-08 13:19   ` Ilya Maximets
2024-04-08 19:28     ` Adrian Moreno
2024-04-08 20:28       ` Ilya Maximets
2024-04-08 12:57 ` [RFC net-next v2 4/5] net:sched:act_sample: add action cookie to sample Adrian Moreno
2024-04-08 13:20   ` Ilya Maximets
2024-04-11  8:40     ` Adrian Moreno
2024-04-08 12:57 ` [RFC net-next v2 5/5] net:openvswitch: add psample support Adrian Moreno
2024-04-08 13:37   ` Ilya Maximets
2024-04-08 19:48     ` Adrian Moreno
2024-04-08 20:40       ` Ilya Maximets
2024-04-09  8:16         ` Adrian Moreno
2024-04-09  9:35           ` Ilya Maximets
2024-04-09 21:49             ` Jakub Kicinski
2024-04-10 13:44               ` Adrian Moreno
2024-04-08 13:16 ` [RFC net-next v2 0/5] net: openvswitch: Add sample multicasting Ilya Maximets

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).