Netdev List
 help / color / mirror / Atom feed
* [PATCH net-next v5 09/11] net: sched: use reference counting action init
From: Vlad Buslov @ 2018-06-01 15:15 UTC (permalink / raw)
  To: netdev
  Cc: davem, jhs, xiyou.wangcong, jiri, pablo, kadlec, fw, ast, daniel,
	edumazet, vladbu, keescook, marcelo.leitner, kliteyn, Jiri Pirko
In-Reply-To: <1527866118-21312-1-git-send-email-vladbu@mellanox.com>

Change action API to assume that action init function always takes
reference to action, even when overwriting existing action. This is
necessary because action API continues to use action pointer after init
function is done. At this point action becomes accessible for concurrent
modifications, so user must always hold reference to it.

Implement helper put list function to atomically release list of actions
after action API init code is done using them.

Reviewed-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
---
Changes from V1 to V2:
- Resplit action lookup/release code to prevent memory leaks in
  individual patches.

 net/sched/act_api.c | 35 +++++++++++++++++------------------
 1 file changed, 17 insertions(+), 18 deletions(-)

diff --git a/net/sched/act_api.c b/net/sched/act_api.c
index f019f0464cec..eefe8c2fe667 100644
--- a/net/sched/act_api.c
+++ b/net/sched/act_api.c
@@ -627,6 +627,18 @@ static int tcf_action_put(struct tc_action *p)
 	return __tcf_action_put(p, false);
 }
 
+static void tcf_action_put_lst(struct list_head *actions)
+{
+	struct tc_action *a, *tmp;
+
+	list_for_each_entry_safe(a, tmp, actions, list) {
+		const struct tc_action_ops *ops = a->ops;
+
+		if (tcf_action_put(a))
+			module_put(ops->owner);
+	}
+}
+
 int
 tcf_action_dump_old(struct sk_buff *skb, struct tc_action *a, int bind, int ref)
 {
@@ -835,17 +847,6 @@ struct tc_action *tcf_action_init_1(struct net *net, struct tcf_proto *tp,
 	return ERR_PTR(err);
 }
 
-static void cleanup_a(struct list_head *actions, int ovr)
-{
-	struct tc_action *a;
-
-	if (!ovr)
-		return;
-
-	list_for_each_entry(a, actions, list)
-		refcount_dec(&a->tcfa_refcnt);
-}
-
 int tcf_action_init(struct net *net, struct tcf_proto *tp, struct nlattr *nla,
 		    struct nlattr *est, char *name, int ovr, int bind,
 		    struct list_head *actions, size_t *attr_size,
@@ -874,11 +875,6 @@ int tcf_action_init(struct net *net, struct tcf_proto *tp, struct nlattr *nla,
 	}
 
 	*attr_size = tcf_action_full_attrs_size(sz);
-
-	/* Remove the temp refcnt which was necessary to protect against
-	 * destroying an existing action which was being replaced
-	 */
-	cleanup_a(actions, ovr);
 	return 0;
 
 err:
@@ -1209,7 +1205,7 @@ tca_action_gd(struct net *net, struct nlattr *nla, struct nlmsghdr *n,
 		return ret;
 	}
 err:
-	tcf_action_destroy(&actions, 0);
+	tcf_action_put_lst(&actions);
 	return ret;
 }
 
@@ -1251,8 +1247,11 @@ static int tcf_action_add(struct net *net, struct nlattr *nla,
 			      &attr_size, true, extack);
 	if (ret)
 		return ret;
+	ret = tcf_add_notify(net, n, &actions, portid, attr_size, extack);
+	if (ovr)
+		tcf_action_put_lst(&actions);
 
-	return tcf_add_notify(net, n, &actions, portid, attr_size, extack);
+	return ret;
 }
 
 static u32 tcaa_root_flags_allowed = TCA_FLAG_LARGE_DUMP_ON;
-- 
2.7.5

^ permalink raw reply related

* [PATCH net-next v5 10/11] net: sched: atomically check-allocate action
From: Vlad Buslov @ 2018-06-01 15:15 UTC (permalink / raw)
  To: netdev
  Cc: davem, jhs, xiyou.wangcong, jiri, pablo, kadlec, fw, ast, daniel,
	edumazet, vladbu, keescook, marcelo.leitner, kliteyn, Jiri Pirko
In-Reply-To: <1527866118-21312-1-git-send-email-vladbu@mellanox.com>

Implement function that atomically checks if action exists and either takes
reference to it, or allocates idr slot for action index to prevent
concurrent allocations of actions with same index. Use EBUSY error pointer
to indicate that idr slot is reserved.

Implement cleanup helper function that removes temporary error pointer from
idr. (in case of error between idr allocation and insertion of newly
created action to specified index)

Refactor all action init functions to insert new action to idr using this
API.

Reviewed-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
---
Changes from V1 to V2:
- Remove unique idr insertion function. Change original idr insert to do
  the same thing.
- Refactor action check-alloc code into standalone function.

 include/net/act_api.h      |  3 ++
 net/sched/act_api.c        | 92 ++++++++++++++++++++++++++++++++++++----------
 net/sched/act_bpf.c        | 11 ++++--
 net/sched/act_connmark.c   | 10 +++--
 net/sched/act_csum.c       | 11 ++++--
 net/sched/act_gact.c       | 11 ++++--
 net/sched/act_ife.c        |  6 ++-
 net/sched/act_ipt.c        | 13 ++++++-
 net/sched/act_mirred.c     | 16 ++++++--
 net/sched/act_nat.c        | 11 ++++--
 net/sched/act_pedit.c      | 15 ++++++--
 net/sched/act_police.c     |  9 ++++-
 net/sched/act_sample.c     | 11 ++++--
 net/sched/act_simple.c     | 11 +++++-
 net/sched/act_skbedit.c    | 11 +++++-
 net/sched/act_skbmod.c     | 11 +++++-
 net/sched/act_tunnel_key.c |  9 ++++-
 net/sched/act_vlan.c       | 17 ++++++++-
 18 files changed, 218 insertions(+), 60 deletions(-)

diff --git a/include/net/act_api.h b/include/net/act_api.h
index d256e20507b9..cd4547476074 100644
--- a/include/net/act_api.h
+++ b/include/net/act_api.h
@@ -154,6 +154,9 @@ int tcf_idr_create(struct tc_action_net *tn, u32 index, struct nlattr *est,
 		   int bind, bool cpustats);
 void tcf_idr_insert(struct tc_action_net *tn, struct tc_action *a);
 
+void tcf_idr_cleanup(struct tc_action_net *tn, u32 index);
+int tcf_idr_check_alloc(struct tc_action_net *tn, u32 *index,
+			struct tc_action **a, int bind);
 int tcf_idr_delete_index(struct tc_action_net *tn, u32 index);
 int __tcf_idr_release(struct tc_action *a, bool bind, bool strict);
 
diff --git a/net/sched/act_api.c b/net/sched/act_api.c
index eefe8c2fe667..9511502e1cbb 100644
--- a/net/sched/act_api.c
+++ b/net/sched/act_api.c
@@ -303,7 +303,9 @@ static bool __tcf_idr_check(struct tc_action_net *tn, u32 index,
 
 	spin_lock(&idrinfo->lock);
 	p = idr_find(&idrinfo->action_idr, index);
-	if (p) {
+	if (IS_ERR(p)) {
+		p = NULL;
+	} else if (p) {
 		refcount_inc(&p->tcfa_refcnt);
 		if (bind)
 			atomic_inc(&p->tcfa_bindcnt);
@@ -371,7 +373,6 @@ int tcf_idr_create(struct tc_action_net *tn, u32 index, struct nlattr *est,
 {
 	struct tc_action *p = kzalloc(ops->size, GFP_KERNEL);
 	struct tcf_idrinfo *idrinfo = tn->idrinfo;
-	struct idr *idr = &idrinfo->action_idr;
 	int err = -ENOMEM;
 
 	if (unlikely(!p))
@@ -389,20 +390,6 @@ int tcf_idr_create(struct tc_action_net *tn, u32 index, struct nlattr *est,
 			goto err2;
 	}
 	spin_lock_init(&p->tcfa_lock);
-	idr_preload(GFP_KERNEL);
-	spin_lock(&idrinfo->lock);
-	/* user doesn't specify an index */
-	if (!index) {
-		index = 1;
-		err = idr_alloc_u32(idr, NULL, &index, UINT_MAX, GFP_ATOMIC);
-	} else {
-		err = idr_alloc_u32(idr, NULL, &index, index, GFP_ATOMIC);
-	}
-	spin_unlock(&idrinfo->lock);
-	idr_preload_end();
-	if (err)
-		goto err3;
-
 	p->tcfa_index = index;
 	p->tcfa_tm.install = jiffies;
 	p->tcfa_tm.lastuse = jiffies;
@@ -412,7 +399,7 @@ int tcf_idr_create(struct tc_action_net *tn, u32 index, struct nlattr *est,
 					&p->tcfa_rate_est,
 					&p->tcfa_lock, NULL, est);
 		if (err)
-			goto err4;
+			goto err3;
 	}
 
 	p->idrinfo = idrinfo;
@@ -420,8 +407,6 @@ int tcf_idr_create(struct tc_action_net *tn, u32 index, struct nlattr *est,
 	INIT_LIST_HEAD(&p->list);
 	*a = p;
 	return 0;
-err4:
-	idr_remove(idr, index);
 err3:
 	free_percpu(p->cpu_qstats);
 err2:
@@ -437,11 +422,78 @@ void tcf_idr_insert(struct tc_action_net *tn, struct tc_action *a)
 	struct tcf_idrinfo *idrinfo = tn->idrinfo;
 
 	spin_lock(&idrinfo->lock);
-	idr_replace(&idrinfo->action_idr, a, a->tcfa_index);
+	/* Replace ERR_PTR(-EBUSY) allocated by tcf_idr_check_alloc */
+	WARN_ON(!IS_ERR(idr_replace(&idrinfo->action_idr, a, a->tcfa_index)));
 	spin_unlock(&idrinfo->lock);
 }
 EXPORT_SYMBOL(tcf_idr_insert);
 
+/* Cleanup idr index that was allocated but not initialized. */
+
+void tcf_idr_cleanup(struct tc_action_net *tn, u32 index)
+{
+	struct tcf_idrinfo *idrinfo = tn->idrinfo;
+
+	spin_lock(&idrinfo->lock);
+	/* Remove ERR_PTR(-EBUSY) allocated by tcf_idr_check_alloc */
+	WARN_ON(!IS_ERR(idr_remove(&idrinfo->action_idr, index)));
+	spin_unlock(&idrinfo->lock);
+}
+EXPORT_SYMBOL(tcf_idr_cleanup);
+
+/* Check if action with specified index exists. If actions is found, increments
+ * its reference and bind counters, and return 1. Otherwise insert temporary
+ * error pointer (to prevent concurrent users from inserting actions with same
+ * index) and return 0.
+ */
+
+int tcf_idr_check_alloc(struct tc_action_net *tn, u32 *index,
+			struct tc_action **a, int bind)
+{
+	struct tcf_idrinfo *idrinfo = tn->idrinfo;
+	struct tc_action *p;
+	int ret;
+
+again:
+	spin_lock(&idrinfo->lock);
+	if (*index) {
+		p = idr_find(&idrinfo->action_idr, *index);
+		if (IS_ERR(p)) {
+			/* This means that another process allocated
+			 * index but did not assign the pointer yet.
+			 */
+			spin_unlock(&idrinfo->lock);
+			goto again;
+		}
+
+		if (p) {
+			refcount_inc(&p->tcfa_refcnt);
+			if (bind)
+				atomic_inc(&p->tcfa_bindcnt);
+			*a = p;
+			ret = 1;
+		} else {
+			*a = NULL;
+			ret = idr_alloc_u32(&idrinfo->action_idr, NULL, index,
+					    *index, GFP_ATOMIC);
+			if (!ret)
+				idr_replace(&idrinfo->action_idr,
+					    ERR_PTR(-EBUSY), *index);
+		}
+	} else {
+		*index = 1;
+		*a = NULL;
+		ret = idr_alloc_u32(&idrinfo->action_idr, NULL, index,
+				    UINT_MAX, GFP_ATOMIC);
+		if (!ret)
+			idr_replace(&idrinfo->action_idr, ERR_PTR(-EBUSY),
+				    *index);
+	}
+	spin_unlock(&idrinfo->lock);
+	return ret;
+}
+EXPORT_SYMBOL(tcf_idr_check_alloc);
+
 void tcf_idrinfo_destroy(const struct tc_action_ops *ops,
 			 struct tcf_idrinfo *idrinfo)
 {
diff --git a/net/sched/act_bpf.c b/net/sched/act_bpf.c
index d3f4ac6f2c4b..06f743d8ed41 100644
--- a/net/sched/act_bpf.c
+++ b/net/sched/act_bpf.c
@@ -299,14 +299,17 @@ static int tcf_bpf_init(struct net *net, struct nlattr *nla,
 
 	parm = nla_data(tb[TCA_ACT_BPF_PARMS]);
 
-	if (!tcf_idr_check(tn, parm->index, act, bind)) {
+	ret = tcf_idr_check_alloc(tn, &parm->index, act, bind);
+	if (!ret) {
 		ret = tcf_idr_create(tn, parm->index, est, act,
 				     &act_bpf_ops, bind, true);
-		if (ret < 0)
+		if (ret < 0) {
+			tcf_idr_cleanup(tn, parm->index);
 			return ret;
+		}
 
 		res = ACT_P_CREATED;
-	} else {
+	} else if (ret > 0) {
 		/* Don't override defaults. */
 		if (bind)
 			return 0;
@@ -315,6 +318,8 @@ static int tcf_bpf_init(struct net *net, struct nlattr *nla,
 			tcf_idr_release(*act, bind);
 			return -EEXIST;
 		}
+	} else {
+		return ret;
 	}
 
 	is_bpf = tb[TCA_ACT_BPF_OPS_LEN] && tb[TCA_ACT_BPF_OPS];
diff --git a/net/sched/act_connmark.c b/net/sched/act_connmark.c
index 701e90244eff..1e31f0e448e2 100644
--- a/net/sched/act_connmark.c
+++ b/net/sched/act_connmark.c
@@ -118,11 +118,14 @@ static int tcf_connmark_init(struct net *net, struct nlattr *nla,
 
 	parm = nla_data(tb[TCA_CONNMARK_PARMS]);
 
-	if (!tcf_idr_check(tn, parm->index, a, bind)) {
+	ret = tcf_idr_check_alloc(tn, &parm->index, a, bind);
+	if (!ret) {
 		ret = tcf_idr_create(tn, parm->index, est, a,
 				     &act_connmark_ops, bind, false);
-		if (ret)
+		if (ret) {
+			tcf_idr_cleanup(tn, parm->index);
 			return ret;
+		}
 
 		ci = to_connmark(*a);
 		ci->tcf_action = parm->action;
@@ -131,7 +134,7 @@ static int tcf_connmark_init(struct net *net, struct nlattr *nla,
 
 		tcf_idr_insert(tn, *a);
 		ret = ACT_P_CREATED;
-	} else {
+	} else if (ret > 0) {
 		ci = to_connmark(*a);
 		if (bind)
 			return 0;
@@ -142,6 +145,7 @@ static int tcf_connmark_init(struct net *net, struct nlattr *nla,
 		/* replacing action and zone */
 		ci->tcf_action = parm->action;
 		ci->zone = parm->zone;
+		ret = 0;
 	}
 
 	return ret;
diff --git a/net/sched/act_csum.c b/net/sched/act_csum.c
index 5dbee136b0a1..bd232d3bd022 100644
--- a/net/sched/act_csum.c
+++ b/net/sched/act_csum.c
@@ -67,19 +67,24 @@ static int tcf_csum_init(struct net *net, struct nlattr *nla,
 		return -EINVAL;
 	parm = nla_data(tb[TCA_CSUM_PARMS]);
 
-	if (!tcf_idr_check(tn, parm->index, a, bind)) {
+	err = tcf_idr_check_alloc(tn, &parm->index, a, bind);
+	if (!err) {
 		ret = tcf_idr_create(tn, parm->index, est, a,
 				     &act_csum_ops, bind, true);
-		if (ret)
+		if (ret) {
+			tcf_idr_cleanup(tn, parm->index);
 			return ret;
+		}
 		ret = ACT_P_CREATED;
-	} else {
+	} else if (err > 0) {
 		if (bind)/* dont override defaults */
 			return 0;
 		if (!ovr) {
 			tcf_idr_release(*a, bind);
 			return -EEXIST;
 		}
+	} else {
+		return err;
 	}
 
 	p = to_tcf_csum(*a);
diff --git a/net/sched/act_gact.c b/net/sched/act_gact.c
index 11c4de3f344e..661b72b9147d 100644
--- a/net/sched/act_gact.c
+++ b/net/sched/act_gact.c
@@ -91,19 +91,24 @@ static int tcf_gact_init(struct net *net, struct nlattr *nla,
 	}
 #endif
 
-	if (!tcf_idr_check(tn, parm->index, a, bind)) {
+	err = tcf_idr_check_alloc(tn, &parm->index, a, bind);
+	if (!err) {
 		ret = tcf_idr_create(tn, parm->index, est, a,
 				     &act_gact_ops, bind, true);
-		if (ret)
+		if (ret) {
+			tcf_idr_cleanup(tn, parm->index);
 			return ret;
+		}
 		ret = ACT_P_CREATED;
-	} else {
+	} else if (err > 0) {
 		if (bind)/* dont override defaults */
 			return 0;
 		if (!ovr) {
 			tcf_idr_release(*a, bind);
 			return -EEXIST;
 		}
+	} else {
+		return err;
 	}
 
 	gact = to_gact(*a);
diff --git a/net/sched/act_ife.c b/net/sched/act_ife.c
index 3dd3d79c5a4b..5bf0e79796c0 100644
--- a/net/sched/act_ife.c
+++ b/net/sched/act_ife.c
@@ -483,7 +483,10 @@ static int tcf_ife_init(struct net *net, struct nlattr *nla,
 	if (!p)
 		return -ENOMEM;
 
-	exists = tcf_idr_check(tn, parm->index, a, bind);
+	err = tcf_idr_check_alloc(tn, &parm->index, a, bind);
+	if (err < 0)
+		return err;
+	exists = err;
 	if (exists && bind) {
 		kfree(p);
 		return 0;
@@ -493,6 +496,7 @@ static int tcf_ife_init(struct net *net, struct nlattr *nla,
 		ret = tcf_idr_create(tn, parm->index, est, a, &act_ife_ops,
 				     bind, true);
 		if (ret) {
+			tcf_idr_cleanup(tn, parm->index);
 			kfree(p);
 			return ret;
 		}
diff --git a/net/sched/act_ipt.c b/net/sched/act_ipt.c
index 85e85dfba401..0dc787a57798 100644
--- a/net/sched/act_ipt.c
+++ b/net/sched/act_ipt.c
@@ -119,13 +119,18 @@ static int __tcf_ipt_init(struct net *net, unsigned int id, struct nlattr *nla,
 	if (tb[TCA_IPT_INDEX] != NULL)
 		index = nla_get_u32(tb[TCA_IPT_INDEX]);
 
-	exists = tcf_idr_check(tn, index, a, bind);
+	err = tcf_idr_check_alloc(tn, &index, a, bind);
+	if (err < 0)
+		return err;
+	exists = err;
 	if (exists && bind)
 		return 0;
 
 	if (tb[TCA_IPT_HOOK] == NULL || tb[TCA_IPT_TARG] == NULL) {
 		if (exists)
 			tcf_idr_release(*a, bind);
+		else
+			tcf_idr_cleanup(tn, index);
 		return -EINVAL;
 	}
 
@@ -133,14 +138,18 @@ static int __tcf_ipt_init(struct net *net, unsigned int id, struct nlattr *nla,
 	if (nla_len(tb[TCA_IPT_TARG]) < td->u.target_size) {
 		if (exists)
 			tcf_idr_release(*a, bind);
+		else
+			tcf_idr_cleanup(tn, index);
 		return -EINVAL;
 	}
 
 	if (!exists) {
 		ret = tcf_idr_create(tn, index, est, a, ops, bind,
 				     false);
-		if (ret)
+		if (ret) {
+			tcf_idr_cleanup(tn, index);
 			return ret;
+		}
 		ret = ACT_P_CREATED;
 	} else {
 		if (bind)/* dont override defaults */
diff --git a/net/sched/act_mirred.c b/net/sched/act_mirred.c
index e08aed06d7f8..6afd89a36c69 100644
--- a/net/sched/act_mirred.c
+++ b/net/sched/act_mirred.c
@@ -79,7 +79,7 @@ static int tcf_mirred_init(struct net *net, struct nlattr *nla,
 	struct tcf_mirred *m;
 	struct net_device *dev;
 	bool exists = false;
-	int ret;
+	int ret, err;
 
 	if (!nla) {
 		NL_SET_ERR_MSG_MOD(extack, "Mirred requires attributes to be passed");
@@ -94,7 +94,10 @@ static int tcf_mirred_init(struct net *net, struct nlattr *nla,
 	}
 	parm = nla_data(tb[TCA_MIRRED_PARMS]);
 
-	exists = tcf_idr_check(tn, parm->index, a, bind);
+	err = tcf_idr_check_alloc(tn, &parm->index, a, bind);
+	if (err < 0)
+		return err;
+	exists = err;
 	if (exists && bind)
 		return 0;
 
@@ -107,6 +110,8 @@ static int tcf_mirred_init(struct net *net, struct nlattr *nla,
 	default:
 		if (exists)
 			tcf_idr_release(*a, bind);
+		else
+			tcf_idr_cleanup(tn, parm->index);
 		NL_SET_ERR_MSG_MOD(extack, "Unknown mirred option");
 		return -EINVAL;
 	}
@@ -115,6 +120,8 @@ static int tcf_mirred_init(struct net *net, struct nlattr *nla,
 		if (dev == NULL) {
 			if (exists)
 				tcf_idr_release(*a, bind);
+			else
+				tcf_idr_cleanup(tn, parm->index);
 			return -ENODEV;
 		}
 		mac_header_xmit = dev_is_mac_header_xmit(dev);
@@ -124,13 +131,16 @@ static int tcf_mirred_init(struct net *net, struct nlattr *nla,
 
 	if (!exists) {
 		if (!dev) {
+			tcf_idr_cleanup(tn, parm->index);
 			NL_SET_ERR_MSG_MOD(extack, "Specified device does not exist");
 			return -EINVAL;
 		}
 		ret = tcf_idr_create(tn, parm->index, est, a,
 				     &act_mirred_ops, bind, true);
-		if (ret)
+		if (ret) {
+			tcf_idr_cleanup(tn, parm->index);
 			return ret;
+		}
 		ret = ACT_P_CREATED;
 	} else if (!ovr) {
 		tcf_idr_release(*a, bind);
diff --git a/net/sched/act_nat.c b/net/sched/act_nat.c
index 1f91e8e66c0f..4dd9188a72fd 100644
--- a/net/sched/act_nat.c
+++ b/net/sched/act_nat.c
@@ -57,19 +57,24 @@ static int tcf_nat_init(struct net *net, struct nlattr *nla, struct nlattr *est,
 		return -EINVAL;
 	parm = nla_data(tb[TCA_NAT_PARMS]);
 
-	if (!tcf_idr_check(tn, parm->index, a, bind)) {
+	err = tcf_idr_check_alloc(tn, &parm->index, a, bind);
+	if (!err) {
 		ret = tcf_idr_create(tn, parm->index, est, a,
 				     &act_nat_ops, bind, false);
-		if (ret)
+		if (ret) {
+			tcf_idr_cleanup(tn, parm->index);
 			return ret;
+		}
 		ret = ACT_P_CREATED;
-	} else {
+	} else if (err > 0) {
 		if (bind)
 			return 0;
 		if (!ovr) {
 			tcf_idr_release(*a, bind);
 			return -EEXIST;
 		}
+	} else {
+		return err;
 	}
 	p = to_tcf_nat(*a);
 
diff --git a/net/sched/act_pedit.c b/net/sched/act_pedit.c
index fbf283f2ac34..2bd1d3f61488 100644
--- a/net/sched/act_pedit.c
+++ b/net/sched/act_pedit.c
@@ -167,13 +167,18 @@ static int tcf_pedit_init(struct net *net, struct nlattr *nla,
 	if (IS_ERR(keys_ex))
 		return PTR_ERR(keys_ex);
 
-	if (!tcf_idr_check(tn, parm->index, a, bind)) {
-		if (!parm->nkeys)
+	err = tcf_idr_check_alloc(tn, &parm->index, a, bind);
+	if (!err) {
+		if (!parm->nkeys) {
+			tcf_idr_cleanup(tn, parm->index);
 			return -EINVAL;
+		}
 		ret = tcf_idr_create(tn, parm->index, est, a,
 				     &act_pedit_ops, bind, false);
-		if (ret)
+		if (ret) {
+			tcf_idr_cleanup(tn, parm->index);
 			return ret;
+		}
 		p = to_pedit(*a);
 		keys = kmalloc(ksize, GFP_KERNEL);
 		if (keys == NULL) {
@@ -182,7 +187,7 @@ static int tcf_pedit_init(struct net *net, struct nlattr *nla,
 			return -ENOMEM;
 		}
 		ret = ACT_P_CREATED;
-	} else {
+	} else if (err > 0) {
 		if (bind)
 			return 0;
 		if (!ovr) {
@@ -197,6 +202,8 @@ static int tcf_pedit_init(struct net *net, struct nlattr *nla,
 				return -ENOMEM;
 			}
 		}
+	} else {
+		return err;
 	}
 
 	spin_lock_bh(&p->tcf_lock);
diff --git a/net/sched/act_police.c b/net/sched/act_police.c
index 99335cca739e..1f3192ea8df7 100644
--- a/net/sched/act_police.c
+++ b/net/sched/act_police.c
@@ -101,15 +101,20 @@ static int tcf_act_police_init(struct net *net, struct nlattr *nla,
 		return -EINVAL;
 
 	parm = nla_data(tb[TCA_POLICE_TBF]);
-	exists = tcf_idr_check(tn, parm->index, a, bind);
+	err = tcf_idr_check_alloc(tn, &parm->index, a, bind);
+	if (err < 0)
+		return err;
+	exists = err;
 	if (exists && bind)
 		return 0;
 
 	if (!exists) {
 		ret = tcf_idr_create(tn, parm->index, NULL, a,
 				     &act_police_ops, bind, false);
-		if (ret)
+		if (ret) {
+			tcf_idr_cleanup(tn, parm->index);
 			return ret;
+		}
 		ret = ACT_P_CREATED;
 	} else if (!ovr) {
 		tcf_idr_release(*a, bind);
diff --git a/net/sched/act_sample.c b/net/sched/act_sample.c
index a8582e1347db..3079e7be5bde 100644
--- a/net/sched/act_sample.c
+++ b/net/sched/act_sample.c
@@ -46,7 +46,7 @@ static int tcf_sample_init(struct net *net, struct nlattr *nla,
 	struct tc_sample *parm;
 	struct tcf_sample *s;
 	bool exists = false;
-	int ret;
+	int ret, err;
 
 	if (!nla)
 		return -EINVAL;
@@ -59,15 +59,20 @@ static int tcf_sample_init(struct net *net, struct nlattr *nla,
 
 	parm = nla_data(tb[TCA_SAMPLE_PARMS]);
 
-	exists = tcf_idr_check(tn, parm->index, a, bind);
+	err = tcf_idr_check_alloc(tn, &parm->index, a, bind);
+	if (err < 0)
+		return err;
+	exists = err;
 	if (exists && bind)
 		return 0;
 
 	if (!exists) {
 		ret = tcf_idr_create(tn, parm->index, est, a,
 				     &act_sample_ops, bind, false);
-		if (ret)
+		if (ret) {
+			tcf_idr_cleanup(tn, parm->index);
 			return ret;
+		}
 		ret = ACT_P_CREATED;
 	} else if (!ovr) {
 		tcf_idr_release(*a, bind);
diff --git a/net/sched/act_simple.c b/net/sched/act_simple.c
index 78fffd329ed9..2cc874f791df 100644
--- a/net/sched/act_simple.c
+++ b/net/sched/act_simple.c
@@ -101,13 +101,18 @@ static int tcf_simp_init(struct net *net, struct nlattr *nla,
 		return -EINVAL;
 
 	parm = nla_data(tb[TCA_DEF_PARMS]);
-	exists = tcf_idr_check(tn, parm->index, a, bind);
+	err = tcf_idr_check_alloc(tn, &parm->index, a, bind);
+	if (err < 0)
+		return err;
+	exists = err;
 	if (exists && bind)
 		return 0;
 
 	if (tb[TCA_DEF_DATA] == NULL) {
 		if (exists)
 			tcf_idr_release(*a, bind);
+		else
+			tcf_idr_cleanup(tn, parm->index);
 		return -EINVAL;
 	}
 
@@ -116,8 +121,10 @@ static int tcf_simp_init(struct net *net, struct nlattr *nla,
 	if (!exists) {
 		ret = tcf_idr_create(tn, parm->index, est, a,
 				     &act_simp_ops, bind, false);
-		if (ret)
+		if (ret) {
+			tcf_idr_cleanup(tn, parm->index);
 			return ret;
+		}
 
 		d = to_defact(*a);
 		ret = alloc_defdata(d, defdata);
diff --git a/net/sched/act_skbedit.c b/net/sched/act_skbedit.c
index c0607d1319eb..29a15172a99d 100644
--- a/net/sched/act_skbedit.c
+++ b/net/sched/act_skbedit.c
@@ -117,21 +117,28 @@ static int tcf_skbedit_init(struct net *net, struct nlattr *nla,
 
 	parm = nla_data(tb[TCA_SKBEDIT_PARMS]);
 
-	exists = tcf_idr_check(tn, parm->index, a, bind);
+	err = tcf_idr_check_alloc(tn, &parm->index, a, bind);
+	if (err < 0)
+		return err;
+	exists = err;
 	if (exists && bind)
 		return 0;
 
 	if (!flags) {
 		if (exists)
 			tcf_idr_release(*a, bind);
+		else
+			tcf_idr_cleanup(tn, parm->index);
 		return -EINVAL;
 	}
 
 	if (!exists) {
 		ret = tcf_idr_create(tn, parm->index, est, a,
 				     &act_skbedit_ops, bind, false);
-		if (ret)
+		if (ret) {
+			tcf_idr_cleanup(tn, parm->index);
 			return ret;
+		}
 
 		d = to_skbedit(*a);
 		ret = ACT_P_CREATED;
diff --git a/net/sched/act_skbmod.c b/net/sched/act_skbmod.c
index e844381af066..cdc6bacfb190 100644
--- a/net/sched/act_skbmod.c
+++ b/net/sched/act_skbmod.c
@@ -128,21 +128,28 @@ static int tcf_skbmod_init(struct net *net, struct nlattr *nla,
 	if (parm->flags & SKBMOD_F_SWAPMAC)
 		lflags = SKBMOD_F_SWAPMAC;
 
-	exists = tcf_idr_check(tn, parm->index, a, bind);
+	err = tcf_idr_check_alloc(tn, &parm->index, a, bind);
+	if (err < 0)
+		return err;
+	exists = err;
 	if (exists && bind)
 		return 0;
 
 	if (!lflags) {
 		if (exists)
 			tcf_idr_release(*a, bind);
+		else
+			tcf_idr_cleanup(tn, parm->index);
 		return -EINVAL;
 	}
 
 	if (!exists) {
 		ret = tcf_idr_create(tn, parm->index, est, a,
 				     &act_skbmod_ops, bind, true);
-		if (ret)
+		if (ret) {
+			tcf_idr_cleanup(tn, parm->index);
 			return ret;
+		}
 
 		ret = ACT_P_CREATED;
 	} else if (!ovr) {
diff --git a/net/sched/act_tunnel_key.c b/net/sched/act_tunnel_key.c
index bd53f39a345b..4679b620af12 100644
--- a/net/sched/act_tunnel_key.c
+++ b/net/sched/act_tunnel_key.c
@@ -99,7 +99,10 @@ static int tunnel_key_init(struct net *net, struct nlattr *nla,
 		return -EINVAL;
 
 	parm = nla_data(tb[TCA_TUNNEL_KEY_PARMS]);
-	exists = tcf_idr_check(tn, parm->index, a, bind);
+	err = tcf_idr_check_alloc(tn, &parm->index, a, bind);
+	if (err < 0)
+		return err;
+	exists = err;
 	if (exists && bind)
 		return 0;
 
@@ -162,7 +165,7 @@ static int tunnel_key_init(struct net *net, struct nlattr *nla,
 		ret = tcf_idr_create(tn, parm->index, est, a,
 				     &act_tunnel_key_ops, bind, true);
 		if (ret)
-			return ret;
+			goto err_out;
 
 		ret = ACT_P_CREATED;
 	} else if (!ovr) {
@@ -198,6 +201,8 @@ static int tunnel_key_init(struct net *net, struct nlattr *nla,
 err_out:
 	if (exists)
 		tcf_idr_release(*a, bind);
+	else
+		tcf_idr_cleanup(tn, parm->index);
 	return ret;
 }
 
diff --git a/net/sched/act_vlan.c b/net/sched/act_vlan.c
index 9b600faaccbb..ad37f308175a 100644
--- a/net/sched/act_vlan.c
+++ b/net/sched/act_vlan.c
@@ -134,7 +134,10 @@ static int tcf_vlan_init(struct net *net, struct nlattr *nla,
 	if (!tb[TCA_VLAN_PARMS])
 		return -EINVAL;
 	parm = nla_data(tb[TCA_VLAN_PARMS]);
-	exists = tcf_idr_check(tn, parm->index, a, bind);
+	err = tcf_idr_check_alloc(tn, &parm->index, a, bind);
+	if (err < 0)
+		return err;
+	exists = err;
 	if (exists && bind)
 		return 0;
 
@@ -146,12 +149,16 @@ static int tcf_vlan_init(struct net *net, struct nlattr *nla,
 		if (!tb[TCA_VLAN_PUSH_VLAN_ID]) {
 			if (exists)
 				tcf_idr_release(*a, bind);
+			else
+				tcf_idr_cleanup(tn, parm->index);
 			return -EINVAL;
 		}
 		push_vid = nla_get_u16(tb[TCA_VLAN_PUSH_VLAN_ID]);
 		if (push_vid >= VLAN_VID_MASK) {
 			if (exists)
 				tcf_idr_release(*a, bind);
+			else
+				tcf_idr_cleanup(tn, parm->index);
 			return -ERANGE;
 		}
 
@@ -164,6 +171,8 @@ static int tcf_vlan_init(struct net *net, struct nlattr *nla,
 			default:
 				if (exists)
 					tcf_idr_release(*a, bind);
+				else
+					tcf_idr_cleanup(tn, parm->index);
 				return -EPROTONOSUPPORT;
 			}
 		} else {
@@ -176,6 +185,8 @@ static int tcf_vlan_init(struct net *net, struct nlattr *nla,
 	default:
 		if (exists)
 			tcf_idr_release(*a, bind);
+		else
+			tcf_idr_cleanup(tn, parm->index);
 		return -EINVAL;
 	}
 	action = parm->v_action;
@@ -183,8 +194,10 @@ static int tcf_vlan_init(struct net *net, struct nlattr *nla,
 	if (!exists) {
 		ret = tcf_idr_create(tn, parm->index, est, a,
 				     &act_vlan_ops, bind, true);
-		if (ret)
+		if (ret) {
+			tcf_idr_cleanup(tn, parm->index);
 			return ret;
+		}
 
 		ret = ACT_P_CREATED;
 	} else if (!ovr) {
-- 
2.7.5

^ permalink raw reply related

* [PATCH net-next v5 11/11] net: sched: change action API to use array of pointers to actions
From: Vlad Buslov @ 2018-06-01 15:15 UTC (permalink / raw)
  To: netdev
  Cc: davem, jhs, xiyou.wangcong, jiri, pablo, kadlec, fw, ast, daniel,
	edumazet, vladbu, keescook, marcelo.leitner, kliteyn
In-Reply-To: <1527866118-21312-1-git-send-email-vladbu@mellanox.com>

Act API used linked list to pass set of actions to functions. It is
intrusive data structure that stores list nodes inside action structure
itself, which means it is not safe to modify such list concurrently.
However, action API doesn't use any linked list specific operations on this
set of actions, so it can be safely refactored into plain pointer array.

Refactor action API to use array of pointers to tc_actions instead of
linked list. Change argument 'actions' type of exported action init,
destroy and dump functions.

Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
---
Changes from V4 to V5:
- Change action delete API to track actions that were deleted, to
  prevent releasing them on error.

Changes from V3 to V4:
- Reduce actions array size in tcf_action_init_1.

 include/net/act_api.h |  7 ++--
 net/sched/act_api.c   | 88 ++++++++++++++++++++++++++++-----------------------
 net/sched/cls_api.c   | 21 ++++--------
 3 files changed, 59 insertions(+), 57 deletions(-)

diff --git a/include/net/act_api.h b/include/net/act_api.h
index cd4547476074..43dfa5e1b3b3 100644
--- a/include/net/act_api.h
+++ b/include/net/act_api.h
@@ -168,19 +168,20 @@ static inline int tcf_idr_release(struct tc_action *a, bool bind)
 int tcf_register_action(struct tc_action_ops *a, struct pernet_operations *ops);
 int tcf_unregister_action(struct tc_action_ops *a,
 			  struct pernet_operations *ops);
-int tcf_action_destroy(struct list_head *actions, int bind);
+int tcf_action_destroy(struct tc_action *actions[], int bind);
 int tcf_action_exec(struct sk_buff *skb, struct tc_action **actions,
 		    int nr_actions, struct tcf_result *res);
 int tcf_action_init(struct net *net, struct tcf_proto *tp, struct nlattr *nla,
 		    struct nlattr *est, char *name, int ovr, int bind,
-		    struct list_head *actions, size_t *attr_size,
+		    struct tc_action *actions[], size_t *attr_size,
 		    bool rtnl_held, struct netlink_ext_ack *extack);
 struct tc_action *tcf_action_init_1(struct net *net, struct tcf_proto *tp,
 				    struct nlattr *nla, struct nlattr *est,
 				    char *name, int ovr, int bind,
 				    bool rtnl_held,
 				    struct netlink_ext_ack *extack);
-int tcf_action_dump(struct sk_buff *skb, struct list_head *, int, int);
+int tcf_action_dump(struct sk_buff *skb, struct tc_action *actions[], int bind,
+		    int ref);
 int tcf_action_dump_old(struct sk_buff *skb, struct tc_action *a, int, int);
 int tcf_action_dump_1(struct sk_buff *skb, struct tc_action *a, int, int);
 int tcf_action_copy_stats(struct sk_buff *, struct tc_action *, int);
diff --git a/net/sched/act_api.c b/net/sched/act_api.c
index 9511502e1cbb..3874fcbdecd6 100644
--- a/net/sched/act_api.c
+++ b/net/sched/act_api.c
@@ -657,13 +657,14 @@ int tcf_action_exec(struct sk_buff *skb, struct tc_action **actions,
 }
 EXPORT_SYMBOL(tcf_action_exec);
 
-int tcf_action_destroy(struct list_head *actions, int bind)
+int tcf_action_destroy(struct tc_action *actions[], int bind)
 {
 	const struct tc_action_ops *ops;
-	struct tc_action *a, *tmp;
-	int ret = 0;
+	struct tc_action *a;
+	int ret = 0, i;
 
-	list_for_each_entry_safe(a, tmp, actions, list) {
+	for (i = 0; i < TCA_ACT_MAX_PRIO && actions[i]; i++) {
+		a = actions[i];
 		ops = a->ops;
 		ret = __tcf_idr_release(a, bind, true);
 		if (ret == ACT_P_DELETED)
@@ -679,11 +680,12 @@ static int tcf_action_put(struct tc_action *p)
 	return __tcf_action_put(p, false);
 }
 
-static void tcf_action_put_lst(struct list_head *actions)
+static void tcf_action_put_many(struct tc_action *actions[])
 {
-	struct tc_action *a, *tmp;
+	int i;
 
-	list_for_each_entry_safe(a, tmp, actions, list) {
+	for (i = 0; i < TCA_ACT_MAX_PRIO && actions[i]; i++) {
+		struct tc_action *a = actions[i];
 		const struct tc_action_ops *ops = a->ops;
 
 		if (tcf_action_put(a))
@@ -735,14 +737,15 @@ tcf_action_dump_1(struct sk_buff *skb, struct tc_action *a, int bind, int ref)
 }
 EXPORT_SYMBOL(tcf_action_dump_1);
 
-int tcf_action_dump(struct sk_buff *skb, struct list_head *actions,
+int tcf_action_dump(struct sk_buff *skb, struct tc_action *actions[],
 		    int bind, int ref)
 {
 	struct tc_action *a;
-	int err = -EINVAL;
+	int err = -EINVAL, i;
 	struct nlattr *nest;
 
-	list_for_each_entry(a, actions, list) {
+	for (i = 0; i < TCA_ACT_MAX_PRIO && actions[i]; i++) {
+		a = actions[i];
 		nest = nla_nest_start(skb, a->order);
 		if (nest == NULL)
 			goto nla_put_failure;
@@ -878,10 +881,9 @@ struct tc_action *tcf_action_init_1(struct net *net, struct tcf_proto *tp,
 	if (TC_ACT_EXT_CMP(a->tcfa_action, TC_ACT_GOTO_CHAIN)) {
 		err = tcf_action_goto_chain_init(a, tp);
 		if (err) {
-			LIST_HEAD(actions);
+			struct tc_action *actions[] = { a, NULL };
 
-			list_add_tail(&a->list, &actions);
-			tcf_action_destroy(&actions, bind);
+			tcf_action_destroy(actions, bind);
 			NL_SET_ERR_MSG(extack, "Failed to init TC action chain");
 			return ERR_PTR(err);
 		}
@@ -899,9 +901,11 @@ struct tc_action *tcf_action_init_1(struct net *net, struct tcf_proto *tp,
 	return ERR_PTR(err);
 }
 
+/* Returns numbers of initialized actions or negative error. */
+
 int tcf_action_init(struct net *net, struct tcf_proto *tp, struct nlattr *nla,
 		    struct nlattr *est, char *name, int ovr, int bind,
-		    struct list_head *actions, size_t *attr_size,
+		    struct tc_action *actions[], size_t *attr_size,
 		    bool rtnl_held, struct netlink_ext_ack *extack)
 {
 	struct nlattr *tb[TCA_ACT_MAX_PRIO + 1];
@@ -923,11 +927,12 @@ int tcf_action_init(struct net *net, struct tcf_proto *tp, struct nlattr *nla,
 		}
 		act->order = i;
 		sz += tcf_action_fill_size(act);
-		list_add_tail(&act->list, actions);
+		/* Start from index 0 */
+		actions[i - 1] = act;
 	}
 
 	*attr_size = tcf_action_full_attrs_size(sz);
-	return 0;
+	return i - 1;
 
 err:
 	tcf_action_destroy(actions, bind);
@@ -978,7 +983,7 @@ int tcf_action_copy_stats(struct sk_buff *skb, struct tc_action *p,
 	return -1;
 }
 
-static int tca_get_fill(struct sk_buff *skb, struct list_head *actions,
+static int tca_get_fill(struct sk_buff *skb, struct tc_action *actions[],
 			u32 portid, u32 seq, u16 flags, int event, int bind,
 			int ref)
 {
@@ -1014,7 +1019,7 @@ static int tca_get_fill(struct sk_buff *skb, struct list_head *actions,
 
 static int
 tcf_get_notify(struct net *net, u32 portid, struct nlmsghdr *n,
-	       struct list_head *actions, int event,
+	       struct tc_action *actions[], int event,
 	       struct netlink_ext_ack *extack)
 {
 	struct sk_buff *skb;
@@ -1150,14 +1155,14 @@ static int tca_action_flush(struct net *net, struct nlattr *nla,
 	return err;
 }
 
-static int tcf_action_delete(struct net *net, struct list_head *actions,
-			     struct netlink_ext_ack *extack)
+static int tcf_action_delete(struct net *net, struct tc_action *actions[],
+			     int *acts_deleted, struct netlink_ext_ack *extack)
 {
-	struct tc_action *a, *tmp;
 	u32 act_index;
-	int ret;
+	int ret, i;
 
-	list_for_each_entry_safe(a, tmp, actions, list) {
+	for (i = 0; i < TCA_ACT_MAX_PRIO && actions[i]; i++) {
+		struct tc_action *a = actions[i];
 		const struct tc_action_ops *ops = a->ops;
 
 		/* Actions can be deleted concurrently so we must save their
@@ -1165,23 +1170,26 @@ static int tcf_action_delete(struct net *net, struct list_head *actions,
 		 */
 		act_index = a->tcfa_index;
 
-		list_del(&a->list);
 		if (tcf_action_put(a)) {
 			/* last reference, action was deleted concurrently */
 			module_put(ops->owner);
 		} else  {
 			/* now do the delete */
 			ret = ops->delete(net, act_index);
-			if (ret < 0)
+			if (ret < 0) {
+				*acts_deleted = i + 1;
 				return ret;
+			}
 		}
 	}
+	*acts_deleted = i;
 	return 0;
 }
 
 static int
-tcf_del_notify(struct net *net, struct nlmsghdr *n, struct list_head *actions,
-	       u32 portid, size_t attr_size, struct netlink_ext_ack *extack)
+tcf_del_notify(struct net *net, struct nlmsghdr *n, struct tc_action *actions[],
+	       int *acts_deleted, u32 portid, size_t attr_size,
+	       struct netlink_ext_ack *extack)
 {
 	int ret;
 	struct sk_buff *skb;
@@ -1199,7 +1207,7 @@ tcf_del_notify(struct net *net, struct nlmsghdr *n, struct list_head *actions,
 	}
 
 	/* now do the delete */
-	ret = tcf_action_delete(net, actions, extack);
+	ret = tcf_action_delete(net, actions, acts_deleted, extack);
 	if (ret < 0) {
 		NL_SET_ERR_MSG(extack, "Failed to delete TC action");
 		kfree_skb(skb);
@@ -1221,7 +1229,8 @@ tca_action_gd(struct net *net, struct nlattr *nla, struct nlmsghdr *n,
 	struct nlattr *tb[TCA_ACT_MAX_PRIO + 1];
 	struct tc_action *act;
 	size_t attr_size = 0;
-	LIST_HEAD(actions);
+	struct tc_action *actions[TCA_ACT_MAX_PRIO + 1] = {};
+	int acts_deleted = 0;
 
 	ret = nla_parse_nested(tb, TCA_ACT_MAX_PRIO, nla, NULL, extack);
 	if (ret < 0)
@@ -1243,26 +1252,27 @@ tca_action_gd(struct net *net, struct nlattr *nla, struct nlmsghdr *n,
 		}
 		act->order = i;
 		attr_size += tcf_action_fill_size(act);
-		list_add_tail(&act->list, &actions);
+		actions[i - 1] = act;
 	}
 
 	attr_size = tcf_action_full_attrs_size(attr_size);
 
 	if (event == RTM_GETACTION)
-		ret = tcf_get_notify(net, portid, n, &actions, event, extack);
+		ret = tcf_get_notify(net, portid, n, actions, event, extack);
 	else { /* delete */
-		ret = tcf_del_notify(net, n, &actions, portid, attr_size, extack);
+		ret = tcf_del_notify(net, n, actions, &acts_deleted, portid,
+				     attr_size, extack);
 		if (ret)
 			goto err;
 		return ret;
 	}
 err:
-	tcf_action_put_lst(&actions);
+	tcf_action_put_many(&actions[acts_deleted]);
 	return ret;
 }
 
 static int
-tcf_add_notify(struct net *net, struct nlmsghdr *n, struct list_head *actions,
+tcf_add_notify(struct net *net, struct nlmsghdr *n, struct tc_action *actions[],
 	       u32 portid, size_t attr_size, struct netlink_ext_ack *extack)
 {
 	struct sk_buff *skb;
@@ -1293,15 +1303,15 @@ static int tcf_action_add(struct net *net, struct nlattr *nla,
 {
 	size_t attr_size = 0;
 	int ret = 0;
-	LIST_HEAD(actions);
+	struct tc_action *actions[TCA_ACT_MAX_PRIO] = {};
 
-	ret = tcf_action_init(net, NULL, nla, NULL, NULL, ovr, 0, &actions,
+	ret = tcf_action_init(net, NULL, nla, NULL, NULL, ovr, 0, actions,
 			      &attr_size, true, extack);
-	if (ret)
+	if (ret < 0)
 		return ret;
-	ret = tcf_add_notify(net, n, &actions, portid, attr_size, extack);
+	ret = tcf_add_notify(net, n, actions, portid, attr_size, extack);
 	if (ovr)
-		tcf_action_put_lst(&actions);
+		tcf_action_put_many(actions);
 
 	return ret;
 }
diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c
index 90b758778756..30af17e80dd7 100644
--- a/net/sched/cls_api.c
+++ b/net/sched/cls_api.c
@@ -1416,10 +1416,7 @@ static int tc_dump_tfilter(struct sk_buff *skb, struct netlink_callback *cb)
 void tcf_exts_destroy(struct tcf_exts *exts)
 {
 #ifdef CONFIG_NET_CLS_ACT
-	LIST_HEAD(actions);
-
-	tcf_exts_to_list(exts, &actions);
-	tcf_action_destroy(&actions, TCA_ACT_UNBIND);
+	tcf_action_destroy(exts->actions, TCA_ACT_UNBIND);
 	kfree(exts->actions);
 	exts->nr_actions = 0;
 #endif
@@ -1446,18 +1443,15 @@ int tcf_exts_validate(struct net *net, struct tcf_proto *tp, struct nlattr **tb,
 			exts->actions[0] = act;
 			exts->nr_actions = 1;
 		} else if (exts->action && tb[exts->action]) {
-			LIST_HEAD(actions);
-			int err, i = 0;
+			int err;
 
 			err = tcf_action_init(net, tp, tb[exts->action],
 					      rate_tlv, NULL, ovr, TCA_ACT_BIND,
-					      &actions, &attr_size, true,
+					      exts->actions, &attr_size, true,
 					      extack);
-			if (err)
+			if (err < 0)
 				return err;
-			list_for_each_entry(act, &actions, list)
-				exts->actions[i++] = act;
-			exts->nr_actions = i;
+			exts->nr_actions = err;
 		}
 		exts->net = net;
 	}
@@ -1506,14 +1500,11 @@ int tcf_exts_dump(struct sk_buff *skb, struct tcf_exts *exts)
 		 * tc data even if iproute2  was newer - jhs
 		 */
 		if (exts->type != TCA_OLD_COMPAT) {
-			LIST_HEAD(actions);
-
 			nest = nla_nest_start(skb, exts->action);
 			if (nest == NULL)
 				goto nla_put_failure;
 
-			tcf_exts_to_list(exts, &actions);
-			if (tcf_action_dump(skb, &actions, 0, 0) < 0)
+			if (tcf_action_dump(skb, exts->actions, 0, 0) < 0)
 				goto nla_put_failure;
 			nla_nest_end(skb, nest);
 		} else if (exts->police) {
-- 
2.7.5

^ permalink raw reply related

* [RFC nf-next 0/5] netfilter: add ebpf translation infrastructure
From: Florian Westphal @ 2018-06-01 15:32 UTC (permalink / raw)
  To: netfilter-devel; +Cc: ast, daniel, netdev

This patch series adds a JIT layer to translate nft expressions
to ebpf programs.

>From commit phase, spawn a userspace program (using recently added UMH
infrastructure).

We then provide rules that came in this transaction to the helper via pipe,
using same nf_tables netlink that nftables already uses.

The userspace helper translates the rules, and, if successful, installs the
generated program(s) via bpf syscall.

For each rule a small response containing the corresponding epbf file
descriptor (can be -1 on failure) and a attribute count (how many
expressions were jitted) gets sent back to kernel via pipe.

If translation fails, the rule is will be processed by nf_tables
interpreter (as before this patch).

If translation succeeded, nf_tables fetches the bpf program using the file
descriptor identifier, allocates a new rule blob containing the new 'ebpf'
expression (and possible trailing un-translated expressions).

It then replaces the original rule in the transaction log with the new
'ebpf-rule'.  The original rule is retained in a private area inside the epbf
expression to be able to present the original expressions back to userspace
on 'nft list ruleset'.

For easier review, this contains the kernel-side only.
nf_tables_jit_work() will not do anything, yet.

Unresolved issues:
 - maps and sets.
   It might be possible to add a new ebpf map type that just wraps
   the nft set infrastructure for lookups.
   This would allow nft userspace to continue to work as-is while
   not requiring new ebpf helper.
   Anonymous set should be a lot easier as they're immutable
   and could probably be handled already by existing infra.

 - BPF_PROG_RUN() is bolted into nft main loop via a middleman expression.
   I'm also abusing skb->cb[] to pass network and transport header offsets.
   Its not 'public' api so this can be changed later.

 - always uses BPF_PROG_TYPE_SCHED_CLS.
   This is because it "works" for current RFC purposes.

 - we should eventually support translating multiple (adjacent) rules
   into single program.

   If we do this kernel will need to track mapping of rules to
   program (to re-jit when a rule is changed.  This isn't implemented
   so far, but can be added later.  Alternatively, one could also add a
   'readonly' table switch to just prevent further updates.

   We will also need to dump the 'next' generation of the
   to-be-translated table.  The kernel has this information, so its only
   a matter of serializing it back to userspace from the commit phase.

The jitter is still limited.  So far it supports:

 * payload expression for network and transport header
 * meta mark, nfproto, l4proto
 * 32 bit immediates
 * 32 bit bitmask ops
 * accept/drop verdicts

As this uses netlink, there is also no technical requirement for
libnftnl, its simply used here for convienience.

It doesn't need any userspace changes. Patches for libnftnl and nftables
make debug info available (e.g. to map rule to its bpf prog id).

Comments welcome.

Florian Westphal (5):
      bpf: add bpf_prog_get_type_dev_file
      netfilter: nf_tables: add ebpf expression
      netfilter: nf_tables: add rule ebpf jit infrastructure
      netfilter: nf_tables_jit: add dumping of original rule
      netfilter: nf_tables_jit: add userspace nft to ebpf translator

 include/linux/bpf.h                              |   11 
 include/net/netfilter/nf_tables_core.h           |   22 
 include/uapi/linux/netfilter/nf_tables.h         |   18 
 kernel/bpf/syscall.c                             |   18 
 net/netfilter/Kconfig                            |    7 
 net/netfilter/Makefile                           |    5 
 net/netfilter/nf_tables_api.c                    |   16 
 net/netfilter/nf_tables_core.c                   |   61 +
 net/netfilter/nf_tables_jit.c                    |  242 +++
 net/netfilter/nf_tables_jit/Makefile             |   19 
 net/netfilter/nf_tables_jit/imr.c                | 1401 +++++++++++++++++++++++
 net/netfilter/nf_tables_jit/imr.h                |   96 +
 net/netfilter/nf_tables_jit/main.c               |  579 +++++++++
 net/netfilter/nf_tables_jit/nf_tables_jit_kern.c |  175 ++
 14 files changed, 2670 insertions(+)

^ permalink raw reply

* [RFC nf-next 1/5] bpf: add bpf_prog_get_type_dev_file
From: Florian Westphal @ 2018-06-01 15:32 UTC (permalink / raw)
  To: netfilter-devel; +Cc: ast, daniel, netdev, Florian Westphal
In-Reply-To: <20180601153216.10901-1-fw@strlen.de>

Same as bpf_prog_get_type_dev, but gets struct file* instead of fd.

In case of nf_tables jit, a file descriptor representing the ebpf program
gets passed to kernel via a pipe from the (userspace) jit helper,
not 'current', so existing bpf_prog_get_type_dev() doesn't work.

Signed-off-by: Florian Westphal <fw@strlen.de>
---
 include/linux/bpf.h  | 11 +++++++++++
 kernel/bpf/syscall.c | 18 ++++++++++++++++++
 2 files changed, 29 insertions(+)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index bbe297436e5d..be7796ac48ac 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -24,6 +24,7 @@ struct bpf_map;
 struct sock;
 struct seq_file;
 struct btf;
+struct file;
 
 /* map is generic key/value storage optionally accesible by eBPF programs */
 struct bpf_map_ops {
@@ -417,6 +418,9 @@ extern const struct bpf_verifier_ops xdp_analyzer_ops;
 struct bpf_prog *bpf_prog_get(u32 ufd);
 struct bpf_prog *bpf_prog_get_type_dev(u32 ufd, enum bpf_prog_type type,
 				       bool attach_drv);
+struct bpf_prog *bpf_prog_get_type_dev_file(struct file *,
+					    enum bpf_prog_type type,
+					    bool attach_drv);
 struct bpf_prog * __must_check bpf_prog_add(struct bpf_prog *prog, int i);
 void bpf_prog_sub(struct bpf_prog *prog, int i);
 struct bpf_prog * __must_check bpf_prog_inc(struct bpf_prog *prog);
@@ -523,6 +527,13 @@ static inline struct bpf_prog *bpf_prog_get_type_dev(u32 ufd,
 	return ERR_PTR(-EOPNOTSUPP);
 }
 
+static inline struct bpf_prog *bpf_prog_get_type_dev_file(struct file *f,
+							  enum bpf_prog_type type,
+							  bool b)
+{
+	return ERR_PTR(-EOPNOTSUPP);
+}
+
 static inline struct bpf_prog * __must_check bpf_prog_add(struct bpf_prog *prog,
 							  int i)
 {
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 388d4feda348..3fcfd26f0290 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -1203,6 +1203,24 @@ struct bpf_prog *bpf_prog_get_type_dev(u32 ufd, enum bpf_prog_type type,
 }
 EXPORT_SYMBOL_GPL(bpf_prog_get_type_dev);
 
+struct bpf_prog *bpf_prog_get_type_dev_file(struct file *f,
+					    enum bpf_prog_type type,
+					    bool attach_drv)
+{
+	struct bpf_prog *prog;
+
+	if (f->f_op != &bpf_prog_fops)
+		return ERR_PTR(-EINVAL);
+
+	prog = f->private_data;
+
+	if (!bpf_prog_get_ok(prog, &type, attach_drv))
+		return ERR_PTR(-EINVAL);
+
+	return bpf_prog_inc(prog);
+}
+EXPORT_SYMBOL_GPL(bpf_prog_get_type_dev_file);
+
 /* Initially all BPF programs could be loaded w/o specifying
  * expected_attach_type. Later for some of them specifying expected_attach_type
  * at load time became required so that program could be validated properly.
-- 
2.16.4

^ permalink raw reply related

* [RFC nf-next 2/5] netfilter: nf_tables: add ebpf expression
From: Florian Westphal @ 2018-06-01 15:32 UTC (permalink / raw)
  To: netfilter-devel; +Cc: ast, daniel, netdev, Florian Westphal
In-Reply-To: <20180601153216.10901-1-fw@strlen.de>

This expression serves two purposes:
1. a middleman to invoke BPF_PROG_RUN() from nf_tables main eval loop
2. to expose the bpf program id via netlink, so userspace
   can map nftables rules to their corresponding ebpf program.

2) is added in a followup patch.

Its currently not possible to attach arbitrary ebpf programs from
userspace, but this limitation is easy to remove if needed.

Signed-off-by: Florian Westphal <fw@strlen.de>
---
 include/net/netfilter/nf_tables_core.h   |  9 ++++++
 include/uapi/linux/netfilter/nf_tables.h | 18 ++++++++++++
 net/netfilter/Makefile                   |  3 +-
 net/netfilter/nf_tables_core.c           | 33 ++++++++++++++++++++++
 net/netfilter/nf_tables_jit.c            | 48 ++++++++++++++++++++++++++++++++
 5 files changed, 110 insertions(+), 1 deletion(-)
 create mode 100644 net/netfilter/nf_tables_jit.c

diff --git a/include/net/netfilter/nf_tables_core.h b/include/net/netfilter/nf_tables_core.h
index e0c0c2558ec4..90087a84f127 100644
--- a/include/net/netfilter/nf_tables_core.h
+++ b/include/net/netfilter/nf_tables_core.h
@@ -15,6 +15,7 @@ extern struct nft_expr_type nft_range_type;
 extern struct nft_expr_type nft_meta_type;
 extern struct nft_expr_type nft_rt_type;
 extern struct nft_expr_type nft_exthdr_type;
+extern struct nft_expr_type nft_ebpf_type;
 
 int nf_tables_core_module_init(void);
 void nf_tables_core_module_exit(void);
@@ -62,6 +63,14 @@ struct nft_payload_set {
 
 extern const struct nft_expr_ops nft_payload_fast_ops;
 
+struct nft_ebpf {
+	struct bpf_prog *prog;
+	u8 expressions;
+	const struct nft_rule *original;
+};
+
+extern const struct nft_expr_ops nft_ebpf_fast_ops;
+
 extern struct static_key_false nft_counters_enabled;
 extern struct static_key_false nft_trace_enabled;
 
diff --git a/include/uapi/linux/netfilter/nf_tables.h b/include/uapi/linux/netfilter/nf_tables.h
index 9c71f024f9cc..e05799652a4c 100644
--- a/include/uapi/linux/netfilter/nf_tables.h
+++ b/include/uapi/linux/netfilter/nf_tables.h
@@ -718,6 +718,24 @@ enum nft_payload_attributes {
 };
 #define NFTA_PAYLOAD_MAX	(__NFTA_PAYLOAD_MAX - 1)
 
+/**
+ * enum nft_ebpf_attributes - nf_tables ebpf expression netlink attributes
+ *
+ * @NFTA_EBPF_FD: file descriptor holding ebpf program (NLA_S32)
+ * @NFTA_EBPF_ID: bpf program id (NLA_U32)
+ * @NFTA_EBPF_TAG: bpf tag (NLA_BINARY)
+ * @NFTA_EBPF_TAG: expressions covered by this jit (NLA_U32)
+ */
+enum nft_ebpf_attributes {
+	NFTA_EBPF_UNSPEC,
+	NFTA_EBPF_FD,
+	NFTA_EBPF_ID,
+	NFTA_EBPF_TAG,
+	NFTA_EBPF_EXPR_COUNT,
+	__NFTA_EBPF_MAX,
+};
+#define NFTA_EBPF_MAX	(__NFTA_EBPF_MAX - 1)
+
 enum nft_exthdr_flags {
 	NFT_EXTHDR_F_PRESENT = (1 << 0),
 };
diff --git a/net/netfilter/Makefile b/net/netfilter/Makefile
index 9b3434360d49..49c6e0a535f9 100644
--- a/net/netfilter/Makefile
+++ b/net/netfilter/Makefile
@@ -76,7 +76,8 @@ obj-$(CONFIG_NF_DUP_NETDEV)	+= nf_dup_netdev.o
 nf_tables-objs := nf_tables_core.o nf_tables_api.o nft_chain_filter.o \
 		  nf_tables_trace.o nft_immediate.o nft_cmp.o nft_range.o \
 		  nft_bitwise.o nft_byteorder.o nft_payload.o nft_lookup.o \
-		  nft_dynset.o nft_meta.o nft_rt.o nft_exthdr.o
+		  nft_dynset.o nft_meta.o nft_rt.o nft_exthdr.o \
+		  nf_tables_jit.o
 
 obj-$(CONFIG_NF_TABLES)		+= nf_tables.o
 obj-$(CONFIG_NFT_COMPAT)	+= nft_compat.o
diff --git a/net/netfilter/nf_tables_core.c b/net/netfilter/nf_tables_core.c
index 47cf667b15ca..038a15243508 100644
--- a/net/netfilter/nf_tables_core.c
+++ b/net/netfilter/nf_tables_core.c
@@ -11,6 +11,7 @@
 #include <linux/kernel.h>
 #include <linux/module.h>
 #include <linux/init.h>
+#include <linux/filter.h>
 #include <linux/list.h>
 #include <linux/rculist.h>
 #include <linux/skbuff.h>
@@ -92,6 +93,35 @@ static bool nft_payload_fast_eval(const struct nft_expr *expr,
 	return true;
 }
 
+static void nft_ebpf_fast_eval(const struct nft_expr *expr,
+			       struct nft_regs *regs,
+			       const struct nft_pktinfo *pkt)
+{
+	const struct nft_ebpf *priv = nft_expr_priv(expr);
+	struct bpf_skb_data_end cb_saved;
+	int ret;
+
+	memcpy(&cb_saved, pkt->skb->cb, sizeof(cb_saved));
+	bpf_compute_data_pointers(pkt->skb);
+
+	ret = BPF_PROG_RUN(priv->prog, pkt->skb);
+
+	memcpy(pkt->skb->cb, &cb_saved, sizeof(cb_saved));
+
+	switch (ret) {
+	case NF_DROP:
+	case NF_ACCEPT:
+	case NFT_BREAK:
+		regs->verdict.code = ret;
+		return;
+	case NFT_CONTINUE:
+		return;
+	default:
+		pr_debug("Unknown verdict %d\n", ret);
+		regs->verdict.code = NF_DROP;
+		break;
+	}
+}
 DEFINE_STATIC_KEY_FALSE(nft_counters_enabled);
 
 static noinline void nft_update_chain_stats(const struct nft_chain *chain,
@@ -151,6 +181,8 @@ nft_do_chain(struct nft_pktinfo *pkt, void *priv)
 		nft_rule_for_each_expr(expr, last, rule) {
 			if (expr->ops == &nft_cmp_fast_ops)
 				nft_cmp_fast_eval(expr, &regs);
+			else if (expr->ops == &nft_ebpf_fast_ops)
+				nft_ebpf_fast_eval(expr, &regs, pkt);
 			else if (expr->ops != &nft_payload_fast_ops ||
 				 !nft_payload_fast_eval(expr, &regs, pkt))
 				expr->ops->eval(expr, &regs, pkt);
@@ -232,6 +264,7 @@ static struct nft_expr_type *nft_basic_types[] = {
 	&nft_meta_type,
 	&nft_rt_type,
 	&nft_exthdr_type,
+	&nft_ebpf_type,
 };
 
 int __init nf_tables_core_module_init(void)
diff --git a/net/netfilter/nf_tables_jit.c b/net/netfilter/nf_tables_jit.c
new file mode 100644
index 000000000000..415c2acfa471
--- /dev/null
+++ b/net/netfilter/nf_tables_jit.c
@@ -0,0 +1,48 @@
+#include <linux/bpf.h>
+#include <linux/netfilter.h>
+#include <net/netfilter/nf_tables.h>
+#include <net/netfilter/nf_tables_core.h>
+
+struct nft_ebpf_expression {
+	struct nft_expr e;
+	struct nft_ebpf priv;
+};
+
+static const struct nla_policy nft_ebpf_policy[NFTA_EBPF_MAX + 1] = {
+	[NFTA_EBPF_FD]			= { .type = NLA_S32 },
+	[NFTA_EBPF_ID]			= { .type = NLA_U32 },
+	[NFTA_EBPF_EXPR_COUNT]		= { .type = NLA_U32 },
+	[NFTA_EBPF_TAG]			= { .type = NLA_BINARY,
+					    .len = BPF_TAG_SIZE, },
+};
+
+static int nft_ebpf_init(const struct nft_ctx *ctx,
+			 const struct nft_expr *expr,
+			 const struct nlattr * const tb[])
+{
+	return -EOPNOTSUPP;
+}
+
+static void nft_ebpf_destroy(const struct nft_ctx *ctx,
+			     const struct nft_expr *expr)
+{
+	struct nft_ebpf *priv = nft_expr_priv(expr);
+
+	bpf_prog_put(priv->prog);
+	kfree(priv->original);
+}
+
+const struct nft_expr_ops nft_ebpf_fast_ops = {
+	.type		= &nft_ebpf_type,
+	.size		= NFT_EXPR_SIZE(sizeof(struct nft_ebpf)),
+	.init		= nft_ebpf_init,
+	.destroy	= nft_ebpf_destroy,
+};
+
+struct nft_expr_type nft_ebpf_type __read_mostly = {
+	.name		= "ebpf",
+	.ops		= &nft_ebpf_fast_ops,
+	.policy		= nft_ebpf_policy,
+	.maxattr	= NFTA_EBPF_MAX,
+	.owner		= THIS_MODULE,
+};
-- 
2.16.4

^ permalink raw reply related

* [RFC nf-next 3/5] netfilter: nf_tables: add rule ebpf jit infrastructure
From: Florian Westphal @ 2018-06-01 15:32 UTC (permalink / raw)
  To: netfilter-devel; +Cc: ast, daniel, netdev, Florian Westphal
In-Reply-To: <20180601153216.10901-1-fw@strlen.de>

This adds a JIT helper infrastructure to translate nft expressions to ebpf
programs.

>From commit phase, we spawn jit module (a userspace program), and then
provide the rules that came in this transaction to that program via a pipe
(in nf_tables netlink format).

The userspace helper translates the rules if possible, and installs the
program(s) via bpf syscall.

For each rule a small response containing the corresponding file descriptor
(can be -1 on failure) and a attribute count (how many expressions were
jitted) gets sent back to kernel via pipe.

If translation fails, the rule is will be processed by nf_tables
interpreter (as before this patch).

If translation succeeded, nf_tables fetches the bpf program using the file
descriptor identifier, allocates a new rule blob containing the new 'ebpf'
expression (and possible trailing un-translated expressions).

It then replaces the original rule in the transaction log with the new
'ebpf-rule'.
The original rule is retained in a private area inside the epbf expression
to be able to present the original expressions to userspace when
'nft list ruleset' is called.

For easier review, this contains the kernel-side only.
nf_tables_jit_work() will not do anything, yet.

Unresolved issues:
 - maps and sets.
   It might be possible to add a new ebpf map type that just wraps
   the nft set infrastructure for lookups.
   This would allow nft userspace to continue to work as-is while
   not requiring new ebpf helper.

 - we should eventually support translating multiple (adjacent) rules
   into single program.

   If we do this kernel will need to track mapping of rules to
   program (to re-jit when a rule is changed.  This isn't implemented
   so far, but can be added later.

   We will also need to dump the 'next' generation of the
   to-be-translated table.  The kernel has this information, so its only
   a matter of serializing it back to userspace from the commit phase.

Signed-off-by: Florian Westphal <fw@strlen.de>
---
 include/net/netfilter/nf_tables_core.h           |  12 ++
 net/netfilter/Kconfig                            |   7 ++
 net/netfilter/Makefile                           |   8 +-
 net/netfilter/nf_tables_api.c                    |   5 +
 net/netfilter/nf_tables_core.c                   |  31 ++++-
 net/netfilter/nf_tables_jit.c                    | 139 +++++++++++++++++++++++
 net/netfilter/nf_tables_jit/Makefile             |  18 +++
 net/netfilter/nf_tables_jit/main.c               |  21 ++++
 net/netfilter/nf_tables_jit/nf_tables_jit_kern.c |  33 ++++++
 9 files changed, 270 insertions(+), 4 deletions(-)
 create mode 100644 net/netfilter/nf_tables_jit/Makefile
 create mode 100644 net/netfilter/nf_tables_jit/main.c
 create mode 100644 net/netfilter/nf_tables_jit/nf_tables_jit_kern.c

diff --git a/include/net/netfilter/nf_tables_core.h b/include/net/netfilter/nf_tables_core.h
index 90087a84f127..e9b5cc20ec45 100644
--- a/include/net/netfilter/nf_tables_core.h
+++ b/include/net/netfilter/nf_tables_core.h
@@ -71,6 +71,18 @@ struct nft_ebpf {
 
 extern const struct nft_expr_ops nft_ebpf_fast_ops;
 
+struct nft_jit_data_from_user {
+	int ebpf_fd;		/* fd to get program from, or < 0 if jitter error */
+	u32 expr_count;		/* number of translated expressions */
+};
+
+#if IS_ENABLED(CONFIG_NF_TABLES_JIT)
+int nft_jit_commit(struct net *net);
+#else
+static inline int nft_jit_commit(struct net *net) { return 0; }
+#endif
+int nf_tables_jit_work(const struct sk_buff *nlskb, struct nft_ebpf *e);
+
 extern struct static_key_false nft_counters_enabled;
 extern struct static_key_false nft_trace_enabled;
 
diff --git a/net/netfilter/Kconfig b/net/netfilter/Kconfig
index 3ec8886850b2..82162fe931bb 100644
--- a/net/netfilter/Kconfig
+++ b/net/netfilter/Kconfig
@@ -473,6 +473,13 @@ config NF_TABLES_NETDEV
 	help
 	  This option enables support for the "netdev" table.
 
+config NF_TABLES_JIT
+	bool "Netfilter nf_tables jit infrastructure"
+	depends on BPF
+	help
+	  This option enables support for translation of nf_tables
+	  expressions to ebpf.
+
 config NFT_NUMGEN
 	tristate "Netfilter nf_tables number generator module"
 	help
diff --git a/net/netfilter/Makefile b/net/netfilter/Makefile
index 49c6e0a535f9..ecb371160cf7 100644
--- a/net/netfilter/Makefile
+++ b/net/netfilter/Makefile
@@ -76,8 +76,12 @@ obj-$(CONFIG_NF_DUP_NETDEV)	+= nf_dup_netdev.o
 nf_tables-objs := nf_tables_core.o nf_tables_api.o nft_chain_filter.o \
 		  nf_tables_trace.o nft_immediate.o nft_cmp.o nft_range.o \
 		  nft_bitwise.o nft_byteorder.o nft_payload.o nft_lookup.o \
-		  nft_dynset.o nft_meta.o nft_rt.o nft_exthdr.o \
-		  nf_tables_jit.o
+		  nft_dynset.o nft_meta.o nft_rt.o nft_exthdr.o
+
+obj-$(CONFIG_NF_TABLES_JIT) += nf_tables_jit/
+nf_tables-$(CONFIG_NF_TABLES_JIT) += nf_tables_jit.o
+nf_tables-$(CONFIG_NF_TABLES_JIT) += nf_tables_jit/nf_tables_jit_kern.o
+nf_tables-$(CONFIG_NF_TABLES_JIT) += nf_tables_jit/nf_tables_jit_umh.o
 
 obj-$(CONFIG_NF_TABLES)		+= nf_tables.o
 obj-$(CONFIG_NFT_COMPAT)	+= nft_compat.o
diff --git a/net/netfilter/nf_tables_api.c b/net/netfilter/nf_tables_api.c
index 89e61b2d048b..40c2de230400 100644
--- a/net/netfilter/nf_tables_api.c
+++ b/net/netfilter/nf_tables_api.c
@@ -6092,6 +6092,11 @@ static int nf_tables_commit(struct net *net, struct sk_buff *skb)
 	struct nft_trans_elem *te;
 	struct nft_chain *chain;
 	struct nft_table *table;
+	int ret;
+
+	ret = nft_jit_commit(net);
+	if (ret < 0)
+		return ret;
 
 	/* 1.  Allocate space for next generation rules_gen_X[] */
 	list_for_each_entry_safe(trans, next, &net->nft.commit_list, list) {
diff --git a/net/netfilter/nf_tables_core.c b/net/netfilter/nf_tables_core.c
index 038a15243508..5557b2709f98 100644
--- a/net/netfilter/nf_tables_core.c
+++ b/net/netfilter/nf_tables_core.c
@@ -93,19 +93,46 @@ static bool nft_payload_fast_eval(const struct nft_expr *expr,
 	return true;
 }
 
+/* Dirty hack: pass nft_pktinfo in skb->cb[] */
+struct nft_jit_args_inet_cb {
+	/* cb[0] */
+	u16 thoff;	 /* 0: unset */
+	u16 lloff;	 /* 0: unset */
+
+	/* cb[1] */
+	u16 l4proto;	/* thoff = 0? unset */
+	u16 reserved;
+
+	/* 12 bytes left */
+};
+
 static void nft_ebpf_fast_eval(const struct nft_expr *expr,
 			       struct nft_regs *regs,
 			       const struct nft_pktinfo *pkt)
 {
 	const struct nft_ebpf *priv = nft_expr_priv(expr);
+	struct nft_jit_args_inet_cb *jit_args;
 	struct bpf_skb_data_end cb_saved;
 	int ret;
 
+	BUILD_BUG_ON(sizeof(struct nft_jit_args_inet_cb) > QDISC_CB_PRIV_LEN);
+
 	memcpy(&cb_saved, pkt->skb->cb, sizeof(cb_saved));
+
+	jit_args = (void *)bpf_skb_cb(pkt->skb);
+	memset(jit_args, 0, sizeof(*jit_args));
+
+	if (skb_mac_header_was_set(pkt->skb))
+		jit_args->lloff = skb_mac_header_len(pkt->skb);
+
+	if (pkt->tprot_set) {
+		jit_args->thoff = pkt->xt.thoff;
+		jit_args->l4proto = pkt->tprot;
+	}
+
 	bpf_compute_data_pointers(pkt->skb);
 
 	ret = BPF_PROG_RUN(priv->prog, pkt->skb);
-
 	memcpy(pkt->skb->cb, &cb_saved, sizeof(cb_saved));
 
 	switch (ret) {
@@ -119,9 +146,9 @@ static void nft_ebpf_fast_eval(const struct nft_expr *expr,
 	default:
 		pr_debug("Unknown verdict %d\n", ret);
 		regs->verdict.code = NF_DROP;
-		break;
 	}
 }
+
 DEFINE_STATIC_KEY_FALSE(nft_counters_enabled);
 
 static noinline void nft_update_chain_stats(const struct nft_chain *chain,
diff --git a/net/netfilter/nf_tables_jit.c b/net/netfilter/nf_tables_jit.c
index 415c2acfa471..a8f4696249bf 100644
--- a/net/netfilter/nf_tables_jit.c
+++ b/net/netfilter/nf_tables_jit.c
@@ -1,13 +1,152 @@
+// SPDX-License-Identifier: GPL-2.0
 #include <linux/bpf.h>
+#include <linux/filter.h>
 #include <linux/netfilter.h>
 #include <net/netfilter/nf_tables.h>
 #include <net/netfilter/nf_tables_core.h>
+#include <linux/file.h>
+
+static int nft_jit_dump_ruleinfo(struct sk_buff *skb,
+				 const struct nft_ctx *ctx, const struct nft_rule *rule)
+{
+	const struct nft_expr *expr, *next;
+	struct nfgenmsg *nfmsg;
+	struct nlmsghdr *nlh;
+	struct nlattr *list;
+	int ret;
+	u16 type = nfnl_msg_type(NFNL_SUBSYS_NFTABLES, NFT_MSG_NEWRULE);
+
+	nlh = nlmsg_put(skb, ctx->portid, ctx->seq, type, sizeof(struct nfgenmsg), 0);
+	if (nlh == NULL)
+		return -EMSGSIZE;
+
+	nfmsg = nlmsg_data(nlh);
+	nfmsg->nfgen_family = ctx->family;
+	nfmsg->version = NFNETLINK_V0;
+	nfmsg->res_id = htons(ctx->net->nft.base_seq & 0xffff);
+
+	ret = nla_put_string(skb, NFTA_RULE_TABLE, ctx->table->name);
+	if (ret < 0)
+		return ret;
+	ret = nla_put_string(skb, NFTA_RULE_CHAIN, ctx->chain->name);
+	if (ret < 0)
+		return ret;
+	ret = nla_put_be64(skb, NFTA_RULE_HANDLE, cpu_to_be64(rule->handle),
+			   NFTA_RULE_PAD);
+	if (ret < 0)
+		return ret;
+
+	list = nla_nest_start(skb, NFTA_RULE_EXPRESSIONS);
+	if (list == NULL)
+		return -EMSGSIZE;
+
+	nft_rule_for_each_expr(expr, next, rule) {
+		ret = nft_expr_dump(skb, NFTA_LIST_ELEM, expr);
+		if (ret)
+			return ret;
+	}
+	nla_nest_end(skb, list);
+	nlmsg_end(skb, nlh);
+	return 0;
+}
 
 struct nft_ebpf_expression {
 	struct nft_expr e;
 	struct nft_ebpf priv;
 };
 
+static int nft_jit_rule(struct nft_trans *trans, struct sk_buff *skb)
+{
+	const struct nft_rule *r = nft_trans_rule(trans);
+	const struct nft_expr *e, *last;
+	struct nft_ebpf_expression ebpf = { 0 };
+	struct nft_rule *rule;
+	struct nft_expr *new;
+	unsigned int size = sizeof(ebpf);
+	int err, expr_count;
+
+	err = nft_jit_dump_ruleinfo(skb, &trans->ctx, nft_trans_rule(trans));
+	if (err < 0)
+		return err;
+
+	err = nf_tables_jit_work(skb, &ebpf.priv);
+	if (err < 0)
+		return err;
+
+	if (!ebpf.priv.prog)
+		return 0;
+
+	ebpf.priv.original = r;
+
+	if (r->udata) {
+		struct nft_userdata *udata = nft_userdata(r);
+
+		size += udata->len + 1;
+	}
+
+	rule = kmalloc(sizeof(*rule) + r->dlen + size, GFP_KERNEL);
+	if (!rule) {
+		bpf_prog_put(ebpf.priv.prog);
+		return -ENOMEM;
+	}
+
+	memcpy(rule, r, sizeof(*r));
+	rule->dlen = r->dlen + sizeof(ebpf);
+
+	new = nft_expr_first(rule);
+	memcpy(new, &ebpf, sizeof(ebpf));
+	new->ops = &nft_ebpf_fast_ops;
+	size = sizeof(ebpf);
+
+	expr_count = 0;
+	nft_rule_for_each_expr(e, last, r) {
+		++expr_count;
+		if (expr_count <= ebpf.priv.expressions)
+			continue; /* expression was jitted */
+
+		new = nft_expr_next(new);
+		memcpy(new, e, e->ops->size);
+		size += e->ops->size;
+	}
+
+	rule->dlen = size;
+	if (r->udata) {
+		const struct nft_userdata *udata = nft_userdata(r);
+
+		memcpy(nft_userdata(rule), udata, udata->len + 1);
+	}
+
+	list_replace_rcu(&nft_trans_rule(trans)->list, &rule->list);
+	nft_trans_rule(trans) = rule;
+
+	return 0;
+}
+
+int nft_jit_commit(struct net *net)
+{
+	struct nft_trans *trans;
+	struct sk_buff *skb;
+	int ret;
+
+	skb = alloc_skb(NLMSG_GOODSIZE, GFP_KERNEL);
+	if (!skb)
+		return -ENOMEM;
+
+	list_for_each_entry(trans, &net->nft.commit_list, list) {
+		if (trans->msg_type != NFT_MSG_NEWRULE)
+			continue;
+
+		ret = nft_jit_rule(trans, skb);
+		if (ret < 0)
+			break;
+		skb->head = skb->data;
+		skb_reset_tail_pointer(skb);
+	}
+
+	kfree_skb(skb);
+	return ret;
+}
+
 static const struct nla_policy nft_ebpf_policy[NFTA_EBPF_MAX + 1] = {
 	[NFTA_EBPF_FD]			= { .type = NLA_S32 },
 	[NFTA_EBPF_ID]			= { .type = NLA_U32 },
diff --git a/net/netfilter/nf_tables_jit/Makefile b/net/netfilter/nf_tables_jit/Makefile
new file mode 100644
index 000000000000..aa7509e49589
--- /dev/null
+++ b/net/netfilter/nf_tables_jit/Makefile
@@ -0,0 +1,18 @@
+# SPDX-License-Identifier: GPL-2.0
+#
+
+hostprogs-y := nf_tables_jit_umh
+nf_tables_jit_umh-objs := main.o
+HOSTCFLAGS += -I. -Itools/include/
+
+quiet_cmd_copy_umh = GEN $@
+      cmd_copy_umh = echo ':' > $(obj)/.nf_tables_jit_umh.o.cmd; \
+      $(OBJCOPY) -I binary -O $(CONFIG_OUTPUT_FORMAT) \
+      -B `$(OBJDUMP) -f $<|grep architecture|cut -d, -f1|cut -d' ' -f2` \
+      --rename-section .data=.rodata $< $@
+
+$(obj)/nf_tables_jit_umh.o: $(obj)/nf_tables_jit_umh
+	$(call cmd,copy_umh)
+
+obj-$(CONFIG_NF_TABLES_JIT) += nf_tables_jit.o
+nf_tables_jit-objs += nf_tables_jit_kern.o nf_tables_jit_umh.o
diff --git a/net/netfilter/nf_tables_jit/main.c b/net/netfilter/nf_tables_jit/main.c
new file mode 100644
index 000000000000..6f6a4423c2e4
--- /dev/null
+++ b/net/netfilter/nf_tables_jit/main.c
@@ -0,0 +1,21 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <unistd.h>
+
+int main(void)
+{
+	static struct {
+		int fd, count;
+	} response;
+
+	response.fd = -1;
+	for (;;) {
+		char buf[8192];
+
+		if (read(0, buf, sizeof(buf)) < 0)
+			return 1;
+		if (write(1, &response, sizeof(response)) != sizeof(response))
+			return 2;
+	}
+
+	return 0;
+}
diff --git a/net/netfilter/nf_tables_jit/nf_tables_jit_kern.c b/net/netfilter/nf_tables_jit/nf_tables_jit_kern.c
new file mode 100644
index 000000000000..4778f53b2683
--- /dev/null
+++ b/net/netfilter/nf_tables_jit/nf_tables_jit_kern.c
@@ -0,0 +1,33 @@
+// SPDX-License-Identifier: GPL-2.0
+#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+#include <linux/umh.h>
+#include <linux/netfilter/nfnetlink.h>
+#include <linux/netfilter/nf_tables.h>
+#include <net/netfilter/nf_tables_core.h>
+
+#define UMH_start _binary_net_netfilter_nf_tables_jit_nf_tables_jit_umh_start
+#define UMH_end _binary_net_netfilter_nf_tables_jit_nf_tables_jit_umh_end
+
+extern char UMH_start;
+extern char UMH_end;
+
+static struct umh_info info;
+
+static int nft_jit_load_umh(void)
+{
+	return fork_usermode_blob(&UMH_start, &UMH_end - &UMH_start, &info);
+}
+
+int nf_tables_jit_work(const struct sk_buff *nlskb, struct nft_ebpf *e)
+{
+	if (!info.pipe_to_umh) {
+		int ret = nft_jit_load_umh();
+		if (ret)
+			return ret;
+
+		if (WARN_ON(!info.pipe_to_umh))
+			return -EINVAL;
+	}
+
+	return 0;
+}
-- 
2.16.4

^ permalink raw reply related

* [RFC nf-next 4/5] netfilter: nf_tables_jit: add dumping of original rule
From: Florian Westphal @ 2018-06-01 15:32 UTC (permalink / raw)
  To: netfilter-devel; +Cc: ast, daniel, netdev, Florian Westphal
In-Reply-To: <20180601153216.10901-1-fw@strlen.de>

After previous patch userspace can't discover the original rules
anymore when listing the rule.

This change adds a dump callback to the ebpf expression and
a special handling in the main dumper loop.

When we see an ebpf expression in a rule, we skip normal dump handling
and leave it the the nft ebpf expression -- it has a copy of the
original expressions and can then simply add them back.

In order to allow userspace to discover presence of auto-jit,
and to map the rule to the ebpf program, we still include the ebpf
expression itself as the first expression in the dump.

For now, we expose the ebpf tag and the ebpf id plus the
number of expressions that are supposedly covered by the program.

Signed-off-by: Florian Westphal <fw@strlen.de>
---
 net/netfilter/nf_tables_api.c | 11 +++++++++
 net/netfilter/nf_tables_jit.c | 55 +++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 66 insertions(+)

diff --git a/net/netfilter/nf_tables_api.c b/net/netfilter/nf_tables_api.c
index 40c2de230400..4c5acd5d1cab 100644
--- a/net/netfilter/nf_tables_api.c
+++ b/net/netfilter/nf_tables_api.c
@@ -2054,6 +2054,17 @@ static int nf_tables_fill_rule_info(struct sk_buff *skb, struct net *net,
 	if (list == NULL)
 		goto nla_put_failure;
 	nft_rule_for_each_expr(expr, next, rule) {
+		/*
+		 * special case: ebpf_fast_ops will add original expressions
+		 * to the netlink message, it will call
+		 * nf_tables_fill_expr_info() itself.
+		 */
+		if (expr->ops == &nft_ebpf_fast_ops) {
+			if (expr->ops->dump(skb, expr) < 0)
+				goto nla_put_failure;
+			break;
+		}
+
 		if (nft_expr_dump(skb, NFTA_LIST_ELEM, expr) < 0)
 			goto nla_put_failure;
 	}
diff --git a/net/netfilter/nf_tables_jit.c b/net/netfilter/nf_tables_jit.c
index a8f4696249bf..864331aaee6b 100644
--- a/net/netfilter/nf_tables_jit.c
+++ b/net/netfilter/nf_tables_jit.c
@@ -171,11 +171,66 @@ static void nft_ebpf_destroy(const struct nft_ctx *ctx,
 	kfree(priv->original);
 }
 
+static int nft_ebpf_dump(struct sk_buff *skb, const struct nft_expr *expr)
+{
+	const struct nft_ebpf *priv = nft_expr_priv(expr);
+	const struct bpf_prog *prog = priv->prog;
+	const struct nft_expr *next;
+	struct nlattr *nest, *data;
+	int ret;
+
+	/*
+	 * From netlink perspective dump of normal vs. ebpf-jitted rule are
+	 * the same, except epbf-jitted rule has the ebpf expression prepended
+	 * to it.  The ebpf expression allows us to propagate the epbf tag and
+	 * some other meta data back to userspace.
+	 *
+	 * After the epbf expression we serialize the expressions of the
+	 * original rule (rather than the ebpf-rule blob used in packet path).
+	 */
+	nest = nla_nest_start(skb, NFTA_LIST_ELEM);
+	if (!nest)
+		return -EMSGSIZE;
+
+	if (nla_put_string(skb, NFTA_EXPR_NAME, expr->ops->type->name))
+		return -EMSGSIZE;
+
+	/* first, add ebpf expr meta data */
+	data = nla_nest_start(skb, NFTA_EXPR_DATA);
+	if (data == NULL)
+		return -EMSGSIZE;
+
+	ret = nla_put_be32(skb, NFTA_EBPF_ID, htonl(prog->aux->id));
+	if (ret)
+		return ret;
+
+	ret = nla_put(skb, NFTA_EBPF_TAG, sizeof(prog->tag), prog->tag);
+	if (ret)
+		return ret;
+
+	ret = nla_put_be32(skb, NFTA_EBPF_EXPR_COUNT, htonl(priv->expressions));
+	if (ret)
+		return ret;
+	nla_nest_end(skb, data);
+	nla_nest_end(skb, nest);
+
+	/* ... followed by the expressions that made up the original rule. */
+	nft_rule_for_each_expr(expr, next, priv->original) {
+		if (WARN_ON(expr->ops->dump == nft_ebpf_dump))
+			break;
+		if (nft_expr_dump(skb, NFTA_LIST_ELEM, expr) < 0)
+			return -EMSGSIZE;
+	}
+
+	return 0;
+}
+
 const struct nft_expr_ops nft_ebpf_fast_ops = {
 	.type		= &nft_ebpf_type,
 	.size		= NFT_EXPR_SIZE(sizeof(struct nft_ebpf)),
 	.init		= nft_ebpf_init,
 	.destroy	= nft_ebpf_destroy,
+	.dump		= nft_ebpf_dump,
 };
 
 struct nft_expr_type nft_ebpf_type __read_mostly = {
-- 
2.16.4

^ permalink raw reply related

* [RFC nf-next 5/5] netfilter: nf_tables_jit: add userspace nft to ebpf translator
From: Florian Westphal @ 2018-06-01 15:32 UTC (permalink / raw)
  To: netfilter-devel; +Cc: ast, daniel, netdev, Florian Westphal
In-Reply-To: <20180601153216.10901-1-fw@strlen.de>

currently rather limited.

It supports:
 * payload expression for network and transport header
 * meta mark, nfproto, l4proto
 * 32 bit immediates
 * 32 bit bitmask ops
 * accept/drop verdicts

Currently kernel will emit each rule on its own.
However, jitter is (eventually) supposed to also cope with complete
chains (including goto/jump).

It also lacks support for any kind of sets; anonymous sets would
be a good initial target as they can't change.

As this uses netlink, there is also no technical requirement for
libnftnl, its simply used for convienience.

This doesn't need any userspace changes to work, however,
a libnftnl and nft patch will make debug info available
(e.g. to match a rule with its bpf program id).

Signed-off-by: Florian Westphal <fw@strlen.de>
---
 include/net/netfilter/nf_tables_core.h           |    1 +
 net/netfilter/nf_tables_core.c                   |    1 +
 net/netfilter/nf_tables_jit/Makefile             |    3 +-
 net/netfilter/nf_tables_jit/imr.c                | 1401 ++++++++++++++++++++++
 net/netfilter/nf_tables_jit/imr.h                |   96 ++
 net/netfilter/nf_tables_jit/main.c               |  582 ++++++++-
 net/netfilter/nf_tables_jit/nf_tables_jit_kern.c |  146 ++-
 7 files changed, 2215 insertions(+), 15 deletions(-)
 create mode 100644 net/netfilter/nf_tables_jit/imr.c
 create mode 100644 net/netfilter/nf_tables_jit/imr.h

diff --git a/include/net/netfilter/nf_tables_core.h b/include/net/netfilter/nf_tables_core.h
index e9b5cc20ec45..f3e85e6c8cc6 100644
--- a/include/net/netfilter/nf_tables_core.h
+++ b/include/net/netfilter/nf_tables_core.h
@@ -82,6 +82,7 @@ int nft_jit_commit(struct net *net);
 static inline int nft_jit_commit(struct net *net) { return 0; }
 #endif
 int nf_tables_jit_work(const struct sk_buff *nlskb, struct nft_ebpf *e);
+void nft_jit_stop_umh(void);
 
 extern struct static_key_false nft_counters_enabled;
 extern struct static_key_false nft_trace_enabled;
diff --git a/net/netfilter/nf_tables_core.c b/net/netfilter/nf_tables_core.c
index 5557b2709f98..8956f873a8cb 100644
--- a/net/netfilter/nf_tables_core.c
+++ b/net/netfilter/nf_tables_core.c
@@ -319,4 +319,5 @@ void nf_tables_core_module_exit(void)
 	i = ARRAY_SIZE(nft_basic_types);
 	while (i-- > 0)
 		nft_unregister_expr(nft_basic_types[i]);
+	nft_jit_stop_umh();
 }
diff --git a/net/netfilter/nf_tables_jit/Makefile b/net/netfilter/nf_tables_jit/Makefile
index aa7509e49589..a1b8eb5a4c45 100644
--- a/net/netfilter/nf_tables_jit/Makefile
+++ b/net/netfilter/nf_tables_jit/Makefile
@@ -2,8 +2,9 @@
 #
 
 hostprogs-y := nf_tables_jit_umh
-nf_tables_jit_umh-objs := main.o
+nf_tables_jit_umh-objs := main.o imr.o
 HOSTCFLAGS += -I. -Itools/include/
+HOSTLOADLIBES_nf_tables_jit_umh = `pkg-config --libs libnftnl libmnl`
 
 quiet_cmd_copy_umh = GEN $@
       cmd_copy_umh = echo ':' > $(obj)/.nf_tables_jit_umh.o.cmd; \
diff --git a/net/netfilter/nf_tables_jit/imr.c b/net/netfilter/nf_tables_jit/imr.c
new file mode 100644
index 000000000000..2242bc7379ee
--- /dev/null
+++ b/net/netfilter/nf_tables_jit/imr.c
@@ -0,0 +1,1401 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <stdbool.h>
+#include <stdio.h>
+#include <stddef.h>
+#include <stdlib.h>
+#include <string.h>
+#include <errno.h>
+#include <limits.h>
+
+#include <linux/bpf.h>
+#include <linux/filter.h>
+
+#include <linux/if_ether.h>
+#include <arpa/inet.h>
+#include <linux/netfilter.h>
+
+#include <netinet/ip.h>
+#include <netinet/ip6.h>
+
+#include "imr.h"
+
+#define div_round_up(n, d)      (((n) + (d) - 1) / (d))
+#define ARRAY_SIZE(x) (sizeof(x) / sizeof(*(x)))
+#define EMIT(ctx, x)							\
+	do {								\
+		struct bpf_insn __tmp[] = { x };			\
+		if ((ctx)->len_cur + ARRAY_SIZE(__tmp) > BPF_MAXINSNS)	\
+			return -ENOMEM;					\
+		memcpy((ctx)->img + (ctx)->len_cur, &__tmp, sizeof(__tmp));		\
+		(ctx)->len_cur += ARRAY_SIZE(__tmp);			\
+	} while (0)
+
+struct imr_object {
+	enum imr_obj_type type:8;
+	uint8_t len;
+	uint8_t refcnt;
+
+	union {
+		struct {
+			union {
+				uint64_t value_large[8];
+				uint64_t value64;
+				uint32_t value32;
+			};
+		} imm;
+		struct {
+			uint16_t offset;
+			enum imr_payload_base base:8;
+		} payload;
+		struct {
+			enum imr_verdict verdict;
+		} verdict;
+		struct {
+			enum imr_meta_key key:8;
+		} meta;
+		struct {
+			struct imr_object *left;
+			struct imr_object *right;
+			enum imr_alu_op op:8;
+		} alu;
+	};
+};
+
+struct imr_state {
+	struct bpf_insn	*img;
+	uint16_t	len_cur;
+	uint16_t	num_objects;
+	uint8_t		nfproto;
+	uint8_t		regcount;
+
+	/* payload access <= headlen will use direct skb->data access.
+	 * Normally set to either sizeof(iphdr) or sizeof(ipv6hdr).
+	 *
+	 * Access >= headlen will need to go through skb_header_pointer().
+	 */
+	uint8_t		headlen;
+
+	/* where skb->data points to at start
+	 * of program.  Usually this is IMR_PAYLOAD_BASE_NH.
+	 */
+	enum imr_payload_base base:8;
+
+	/* hints to emitter */
+	bool reload_r2;
+
+	struct imr_object *registers[IMR_REG_COUNT];
+
+	struct imr_object **objects;
+};
+
+static int imr_jit_object(struct imr_state *, const struct imr_object *o);
+
+static void internal_error(const char *s)
+{
+	fprintf(stderr, "FIXME: internal error %s\n", s);
+	exit(1);
+}
+
+static unsigned int imr_regs_needed(unsigned int len)
+{
+	return div_round_up(len, sizeof(uint64_t));
+}
+
+static int imr_register_alloc(struct imr_state *s, uint32_t len)
+{
+	unsigned int regs_needed = imr_regs_needed(len);
+	uint8_t reg = s->regcount;
+
+	if (s->regcount + regs_needed >= IMR_REG_COUNT) {
+		internal_error("out of BPF registers");
+		return -1;
+	}
+
+	s->regcount += regs_needed;
+
+	return reg;
+}
+
+static int imr_register_get(const struct imr_state *s, uint32_t len)
+{
+	unsigned int regs_needed = imr_regs_needed(len);
+
+	if (s->regcount < regs_needed)
+		internal_error("not enough registers in use");
+
+	return s->regcount - regs_needed;
+}
+
+static int bpf_reg_width(unsigned int len)
+{
+	switch (len) {
+	case sizeof(uint8_t): return BPF_B;
+	case sizeof(uint16_t): return BPF_H;
+	case sizeof(uint32_t): return BPF_W;
+	case sizeof(uint64_t): return BPF_DW;
+	default:
+		internal_error("reg size not supported");
+	}
+
+	return -EINVAL;
+}
+
+/* map op to negated bpf opcode.
+ * This is because if we want to check 'eq', we need
+ * to jump to end of rule ('break') on inequality, i.e.
+ * 'branch if NOT equal'.
+ */
+static int alu_jmp_get_negated_bpf_opcode(enum imr_alu_op op)
+{
+	switch (op) {
+	case IMR_ALU_OP_EQ:
+		return BPF_JNE;
+	case IMR_ALU_OP_NE:
+		return BPF_JEQ;
+	case IMR_ALU_OP_LT:
+		return BPF_JGE;
+	case IMR_ALU_OP_LTE:
+		return BPF_JGT;
+	case IMR_ALU_OP_GT:
+		return BPF_JLE;
+	case IMR_ALU_OP_GTE:
+		return BPF_JLT;
+	case IMR_ALU_OP_LSHIFT:
+	case IMR_ALU_OP_AND:
+		break;
+        }
+
+	internal_error("invalid imr alu op");
+	return -EINVAL;
+}
+
+static void imr_register_release(struct imr_state *s, uint32_t len)
+{
+	unsigned int regs_needed = imr_regs_needed(len);
+
+	if (s->regcount < regs_needed)
+		internal_error("regcount underflow");
+	s->regcount -= regs_needed;
+}
+
+void imr_register_store(struct imr_state *s, enum imr_reg_num reg, struct imr_object *o)
+{
+	struct imr_object *old;
+
+	old = s->registers[reg];
+	if (old)
+		imr_object_free(old);
+
+	s->registers[reg] = o;
+}
+
+struct imr_object *imr_register_load(const struct imr_state *s, enum imr_reg_num reg)
+{
+	struct imr_object *o = s->registers[reg];
+
+	if (!o)
+		internal_error("empty register");
+
+	if (!o->refcnt)
+		internal_error("already free'd object in register");
+
+	o->refcnt++;
+	return o;
+}
+
+struct imr_state *imr_state_alloc(void)
+{
+	struct imr_state *s = calloc(1, sizeof(*s));
+
+	return s;
+}
+
+void imr_state_free(struct imr_state *s)
+{
+	int i;
+
+	for (i = 0; i < s->num_objects; i++)
+		imr_object_free(s->objects[i]);
+
+	free(s->objects);
+	free(s->img);
+	free(s);
+}
+
+struct imr_object *imr_object_alloc(enum imr_obj_type t)
+{
+	struct imr_object *o = calloc(1, sizeof(*o));
+
+	if (!o)
+		return NULL;
+
+	o->refcnt = 1;
+	o->type = t;
+	return o;
+}
+
+static struct imr_object *imr_object_copy(const struct imr_object *old)
+{
+	struct imr_object *o = imr_object_alloc(old->type);
+
+	if (!o)
+		return NULL;
+
+	switch (o->type) {
+	case IMR_OBJ_TYPE_VERDICT:
+	case IMR_OBJ_TYPE_IMMEDIATE:
+	case IMR_OBJ_TYPE_PAYLOAD:
+	case IMR_OBJ_TYPE_META:
+		memcpy(o, old, sizeof(*o));
+		o->refcnt = 1;
+		break;
+	case IMR_OBJ_TYPE_ALU:
+		o->alu.left = imr_object_copy(old->alu.left);
+		o->alu.right = imr_object_copy(old->alu.right);
+		if (!o->alu.left || !o->alu.right) {
+			imr_object_free(o);
+			return NULL;
+		}
+		break;
+	}
+
+	o->len = old->len;
+	return o;
+}
+
+static struct imr_object *imr_object_split64(struct imr_object *to_split)
+{
+	struct imr_object *o = NULL;
+
+	if (to_split->len < sizeof(uint64_t))
+		internal_error("bogus split of size <= uint64_t");
+
+	to_split->len -= sizeof(uint64_t);
+
+	switch (to_split->type) {
+	case IMR_OBJ_TYPE_IMMEDIATE: {
+		uint64_t tmp;
+
+		o = imr_object_copy(to_split);
+		o->imm.value64 = to_split->imm.value_large[0];
+
+		switch (to_split->len) {
+		case 0:
+			break;
+		case sizeof(uint32_t):
+			tmp = to_split->imm.value_large[1];
+			to_split->imm.value32 = tmp;
+			break;
+		case sizeof(uint64_t):
+			tmp = to_split->imm.value_large[1];
+			to_split->imm.value64 = tmp;
+			break;
+		default:
+			memmove(to_split->imm.value_large, &to_split->imm.value_large[1],
+				sizeof(to_split->imm.value_large) - sizeof(to_split->imm.value_large[0]));
+			break;
+		}
+		}
+		break;
+	case IMR_OBJ_TYPE_PAYLOAD:
+		o = imr_object_copy(to_split);
+		to_split->payload.offset += sizeof(uint64_t);
+		break;
+	case IMR_OBJ_TYPE_META:
+		internal_error("can't split meta");
+		break;
+	case IMR_OBJ_TYPE_ALU:
+		o = imr_object_alloc(to_split->type);
+		o->alu.left = imr_object_split64(to_split->alu.left);
+		o->alu.right = imr_object_split64(to_split->alu.right);
+
+		if (!o->alu.left || !o->alu.right) {
+			imr_object_free(o);
+			return NULL; /* Can't recover */
+
+		}
+		break;
+	case IMR_OBJ_TYPE_VERDICT:
+		internal_error("can't split type");
+	}
+
+	if (o)
+		o->len = sizeof(uint64_t);
+	return o;
+}
+
+void imr_object_free(struct imr_object *o)
+{
+	if (!o)
+		return;
+
+	if (o->refcnt == 0) {
+		internal_error("double-free, refcnt already zero");
+		o->refcnt--;
+	}
+	switch (o->type) {
+	case IMR_OBJ_TYPE_VERDICT:
+	case IMR_OBJ_TYPE_IMMEDIATE:
+	case IMR_OBJ_TYPE_PAYLOAD:
+	case IMR_OBJ_TYPE_META:
+		break;
+	case IMR_OBJ_TYPE_ALU:
+		imr_object_free(o->alu.left);
+		imr_object_free(o->alu.right);
+		break;
+	}
+
+	o->refcnt--;
+	if (o->refcnt > 0)
+		return;
+
+	free(o);
+}
+
+struct imr_object *imr_object_alloc_imm32(uint32_t value)
+{
+	struct imr_object *o = imr_object_alloc(IMR_OBJ_TYPE_IMMEDIATE);
+
+	if (o) {
+		o->imm.value32 = value;
+		o->len = sizeof(value);
+	}
+	return o;
+}
+
+struct imr_object *imr_object_alloc_imm64(uint64_t value)
+{
+	struct imr_object *o = imr_object_alloc(IMR_OBJ_TYPE_IMMEDIATE);
+
+	if (o) {
+		o->imm.value64 = value;
+		o->len = sizeof(value);
+	}
+	return o;
+}
+
+struct imr_object *imr_object_alloc_imm(const uint32_t *data, unsigned int len)
+{
+	struct imr_object *o = imr_object_alloc(IMR_OBJ_TYPE_IMMEDIATE);
+	unsigned int left = len;
+	int i = 0;
+
+	if (!o)
+		return NULL;
+
+	while (left >= sizeof(uint64_t)) {
+		uint64_t value = *data;
+
+		left -= sizeof(uint64_t);
+
+		value <<= 32;
+		data++;
+		value |= *data;
+		data++;
+
+		if (i >= ARRAY_SIZE(o->imm.value_large)) {
+			internal_error("value too large");
+			imr_object_free(o);
+			return NULL;
+		}
+		o->imm.value_large[i++] = value;
+	}
+
+	if (left) {
+		if (left != sizeof(uint32_t))
+			internal_error("values are expected in 4-byte chunks at least");
+
+		if (i >= ARRAY_SIZE(o->imm.value_large)) {
+			internal_error("value too large");
+			imr_object_free(o);
+			return NULL;
+		}
+		o->imm.value_large[i] = *data;
+	}
+
+	o->len = len;
+	return o;
+}
+
+struct imr_object *imr_object_alloc_verdict(enum imr_verdict v)
+{
+	struct imr_object *o = imr_object_alloc(IMR_OBJ_TYPE_VERDICT);
+
+	if (!o)
+		return NULL;
+
+	o->verdict.verdict = v;
+	o->len = sizeof(v);
+
+	return o;
+}
+
+static const char * alu_op_to_str(enum imr_alu_op op)
+{
+	switch (op) {
+	case IMR_ALU_OP_EQ: return "eq";
+	case IMR_ALU_OP_NE: return "ne";
+	case IMR_ALU_OP_LT: return "<";
+	case IMR_ALU_OP_LTE: return "<=";
+	case IMR_ALU_OP_GT: return ">";
+	case IMR_ALU_OP_GTE: return ">=";
+	case IMR_ALU_OP_AND: return "&";
+	case IMR_ALU_OP_LSHIFT: return "<<";
+	}
+
+	return "?";
+}
+
+static const char *verdict_to_str(enum imr_verdict v)
+{
+	switch (v) {
+	case IMR_VERDICT_NONE: return "none";
+	case IMR_VERDICT_NEXT: return "next";
+	case IMR_VERDICT_PASS: return "pass";
+	case IMR_VERDICT_DROP: return "drop";
+	}
+
+	return "invalid";
+}
+
+static int imr_object_print_imm(FILE *fp, const struct imr_object *o)
+{
+	switch (o->len) {
+	case sizeof(uint64_t):
+		return fprintf(fp, "(0x%16llx)", (unsigned long long)o->imm.value64);
+	case sizeof(uint32_t):
+		return fprintf(fp, "(0x%08x)", (unsigned int)o->imm.value32);
+	default:
+		return fprintf(fp, "(0x%llx?)", (unsigned long long)o->imm.value64);
+	}
+}
+
+static const char *meta_to_str(enum imr_meta_key k)
+{
+	switch (k) {
+	case IMR_META_NFMARK:
+		return "nfmark";
+	case IMR_META_NFPROTO:
+		return "nfproto";
+	case IMR_META_L4PROTO:
+		return "l4proto";
+	}
+
+	return "unknown";
+}
+
+static const char *type_to_str(enum imr_obj_type t)
+{
+	switch (t) {
+	case IMR_OBJ_TYPE_VERDICT: return "verdict";
+	case IMR_OBJ_TYPE_IMMEDIATE: return "imm";
+	case IMR_OBJ_TYPE_PAYLOAD: return "payload";
+	case IMR_OBJ_TYPE_ALU: return "alu";
+	case IMR_OBJ_TYPE_META: return "meta";
+	}
+
+	return "unknown";
+}
+
+static int imr_object_print(FILE *fp, const struct imr_object *o)
+{
+	int ret, total = 0;
+
+	ret = fprintf(fp, "%s", type_to_str(o->type));
+	if (ret < 0)
+		return ret;
+	total += ret;
+	switch (o->type) {
+	case IMR_OBJ_TYPE_VERDICT:
+		ret = fprintf(fp, "(%s)", verdict_to_str(o->verdict.verdict));
+		if (ret < 0)
+			break;
+		total += ret;
+		break;
+	case IMR_OBJ_TYPE_PAYLOAD:
+		ret = fprintf(fp, "(base %d, off %d, len %d)",
+				o->payload.base, o->payload.offset, o->len);
+		if (ret < 0)
+			break;
+		total += ret;
+		break;
+	case IMR_OBJ_TYPE_IMMEDIATE:
+		ret = imr_object_print_imm(fp, o);
+		if (ret < 0)
+			break;
+		total += ret;
+		break;
+	case IMR_OBJ_TYPE_ALU:
+		ret = fprintf(fp, "(");
+		if (ret < 0)
+			break;
+		total += ret;
+		ret = imr_object_print(fp, o->alu.left);
+		if (ret < 0)
+			break;
+		total += ret;
+
+		ret = fprintf(fp , " %s ", alu_op_to_str(o->alu.op));
+		if (ret < 0)
+			break;
+		total += ret;
+
+		ret = imr_object_print(fp, o->alu.right);
+		if (ret < 0)
+			break;
+		total += ret;
+
+		ret = fprintf(fp, ") ");
+		if (ret < 0)
+			break;
+		total += ret;
+		break;
+	case IMR_OBJ_TYPE_META:
+		ret = fprintf(fp , " %s ", meta_to_str(o->meta.key));
+		if (ret < 0)
+			break;
+		total += ret;
+		break;
+	default:
+		internal_error("missing print support");
+		break;
+	}
+
+	return total;
+}
+
+void imr_state_print(FILE *fp, struct imr_state *s)
+{
+	int i;
+
+	for (i = 0; i < s->num_objects; i++) {
+		imr_object_print(fp, s->objects[i]);
+		putc('\n', fp);
+	}
+}
+
+struct imr_object *imr_object_alloc_meta(enum imr_meta_key k)
+{
+	struct imr_object *o = imr_object_alloc(IMR_OBJ_TYPE_META);
+
+	o->meta.key = k;
+
+	switch (k) {
+	case IMR_META_L4PROTO:
+		o->len = sizeof(uint16_t);
+		break;
+	case IMR_META_NFPROTO:
+		o->len = sizeof(uint8_t);
+		break;
+	case IMR_META_NFMARK:
+		o->len = sizeof(uint32_t);
+		break;
+	}
+
+	return o;
+}
+
+struct imr_object *imr_object_alloc_payload(enum imr_payload_base b, uint16_t off, uint16_t len)
+{
+	struct imr_object *o = imr_object_alloc(IMR_OBJ_TYPE_PAYLOAD);
+
+	if (!o)
+		return NULL;
+
+	o->payload.base = b;
+	o->payload.offset = off;
+	if (len > 16)
+		return NULL;
+
+	if (len == 0)
+		internal_error("payload length is 0");
+	if (len > 16)
+		internal_error("payload length exceeds 16 byte");
+
+	o->len = len;
+
+	return o;
+}
+
+struct imr_object *imr_object_alloc_alu(enum imr_alu_op op, struct imr_object *l, struct imr_object *r)
+{
+	struct imr_object *o = imr_object_alloc(IMR_OBJ_TYPE_ALU);
+
+	if (!o)
+		return NULL;
+
+	if (l == r)
+		internal_error("same operands");
+
+	o->alu.op = op;
+	o->alu.left = l;
+	o->alu.right = r;
+
+	if (l->len == 0 || r->len == 0)
+		internal_error("alu op with 0 op length");
+
+	o->len = l->len;
+	if (r->len > o->len)
+		o->len = r->len;
+
+	return o;
+}
+
+static int imr_state_add_obj_alu(struct imr_state *s, struct imr_object *o)
+{
+	struct imr_object *old;
+
+	if (s->num_objects == 0 || o->len > sizeof(uint64_t))
+		return -EINVAL;
+
+	old = s->objects[s->num_objects - 1];
+
+	if (old->type != IMR_OBJ_TYPE_ALU)
+		return -EINVAL;
+	if (old->alu.left != o->alu.left)
+		return -EINVAL;
+
+	imr_object_free(o->alu.left);
+	o->alu.left = old;
+	s->objects[s->num_objects - 1] = o;
+
+	if (old->len != o->len)
+		internal_error("different op len but same src");
+	return 0;
+}
+
+int imr_state_add_obj(struct imr_state *s, struct imr_object *o)
+{
+	struct imr_object **new;
+	uint32_t slot = s->num_objects;
+
+	if (s->num_objects >= 0xffff / sizeof(*o))
+		return -1;
+
+	if (o->type == IMR_OBJ_TYPE_ALU &&
+	    imr_state_add_obj_alu(s, o) == 0)
+		return 0;
+
+	s->num_objects++;
+
+	new = realloc(s->objects, sizeof(o) * s->num_objects);
+	if (!new) {
+		imr_object_free(o);
+		return -1;
+	}
+
+	new[slot] = o;
+	if (new != s->objects)
+		s->objects = new;
+
+	return 0;
+}
+
+int imr_state_rule_end(struct imr_state *s)
+{
+	uint32_t slot = s->num_objects;
+	struct imr_object *last;
+	int i;
+
+	for (i = 0; i < ARRAY_SIZE(s->registers); i++) {
+		last = s->registers[i];
+		if (last)
+			imr_register_store(s, i, NULL);
+	}
+
+	if (slot == 0)
+		internal_error("rule end, but no objects present\n");
+	last = s->objects[slot - 1];
+
+	if (last->type == IMR_OBJ_TYPE_VERDICT)
+		return 0;
+
+	return imr_state_add_obj(s, imr_object_alloc_verdict(IMR_VERDICT_NEXT));
+}
+
+static int imr_jit_obj_immediate(struct imr_state *s,
+				 const struct imr_object *o)
+{
+	int bpf_reg = imr_register_get(s, o->len);
+
+	switch (o->len) {
+	case sizeof(uint32_t):
+		EMIT(s, BPF_MOV32_IMM(bpf_reg, o->imm.value32));
+		return 0;
+	case sizeof(uint64_t):
+		EMIT(s, BPF_LD_IMM64(bpf_reg, o->imm.value64));
+		return 0;
+	default:
+		break;
+	}
+
+	internal_error("unhandled immediate size");
+	return -EINVAL;
+}
+
+static int imr_jit_verdict(struct imr_state *s, int verdict)
+{
+	EMIT(s, BPF_MOV32_IMM(BPF_REG_0, verdict));
+	EMIT(s, BPF_EXIT_INSN());
+	return 0;
+}
+
+static int imr_jit_obj_verdict(struct imr_state *s,
+			       const struct imr_object *o)
+{
+	int verdict = o->verdict.verdict;
+
+	switch (o->verdict.verdict) {
+	case IMR_VERDICT_NEXT: /* no-op: continue with next rule */
+		return 0;
+	case IMR_VERDICT_PASS:
+		verdict = NF_ACCEPT;
+		break;
+	case IMR_VERDICT_DROP:
+		verdict = NF_DROP;
+		break;
+	case IMR_VERDICT_NONE:
+		verdict = -1; /* NFT_CONTINUE */
+		break;
+	default:
+		internal_error("unhandled verdict");
+	}
+
+	return imr_jit_verdict(s, verdict);
+}
+
+static unsigned int align_for_stack(uint16_t len)
+{
+	return div_round_up(len, sizeof(uint64_t)) * sizeof(uint64_t);
+}
+
+static int imr_reload_skb_data(struct imr_state *state)
+{
+	int tmp_reg = imr_register_alloc(state, sizeof(uint64_t));
+
+	/* headlen tells how much bytes we can expect to reside
+	 * in the skb linear area.
+	 *
+	 * Used to decide when to prefer direct access vs.
+	 * bpf equivalent of skb_header_pointer().
+	 */
+	EMIT(state, BPF_LDX_MEM(BPF_W, BPF_REG_2, BPF_REG_1,
+			 offsetof(struct __sk_buff, data)));
+	EMIT(state, BPF_LDX_MEM(BPF_W, BPF_REG_3, BPF_REG_1,
+			 offsetof(struct __sk_buff, data_end)));
+
+	EMIT(state, BPF_MOV64_REG(tmp_reg, BPF_REG_2));
+	EMIT(state, BPF_ALU64_IMM(BPF_ADD, tmp_reg, state->headlen));
+
+	/* This is so that verifier can mark accesses to
+	 * skb->data as safe provided they don't exceed data_end (R3).
+	 *
+	 * IMR makes sure it switches to bpf_skb_load_bytes helper for
+	 * accesses that are larger, else verifier rejects program.
+	 *
+	 * R3 and R4 are only used temporarily here, no need to preserve them.
+	 */
+	EMIT(state, BPF_JMP_REG(BPF_JLE, tmp_reg, BPF_REG_3, 2));
+
+	imr_register_release(state, sizeof(uint64_t));
+
+	/*
+	 * ((R2 (data) + headlen) > R3 data_end.
+	 * Should never happen for nf hook points, ip/ipv6 stack pulls
+	 * at least ip(6) header into linear area, and caller will
+	 * pass this header size as headlen.
+	 */
+	EMIT(state, BPF_MOV32_IMM(BPF_REG_0, NF_DROP));
+	EMIT(state, BPF_EXIT_INSN());
+	return 0;
+}
+
+static int imr_load_thoff(struct imr_state *s, int bpfreg)
+{
+	/* fetch 16bit off cb[0] */
+	EMIT(s, BPF_LDX_MEM(BPF_H, bpfreg, BPF_REG_1, offsetof(struct __sk_buff, cb[0])));
+	return 0;
+}
+
+static int imr_maybe_reload_skb_data(struct imr_state *state)
+{
+	if (state->reload_r2) {
+		state->reload_r2 = false;
+		return imr_reload_skb_data(state);
+	}
+
+	return 0;
+}
+
+/*
+ * Though R10 is correct read-only register and has type PTR_TO_STACK
+ * and R10 - 4 is within stack bounds, there were no stores into that location.
+ */
+static int bpf_skb_load_bytes(struct imr_state *state,
+			      uint16_t offset, uint16_t olen,
+			      int bpf_reg_hdr_off)
+{
+	int len = align_for_stack(olen);
+	int tmp_reg;
+
+	tmp_reg = imr_register_alloc(state, sizeof(uint64_t));
+	if (tmp_reg < 0)
+		return -ENOSPC;
+
+	EMIT(state, BPF_MOV64_IMM(BPF_REG_2, offset));
+	state->reload_r2 = true;
+
+	EMIT(state, BPF_ALU64_REG(BPF_ADD, BPF_REG_2, bpf_reg_hdr_off));
+
+	EMIT(state, BPF_ALU64_REG(BPF_MOV, BPF_REG_3, BPF_REG_10));
+	EMIT(state, BPF_ALU64_IMM(BPF_ADD, BPF_REG_3, -len));
+
+	EMIT(state, BPF_MOV64_IMM(BPF_REG_4, olen));
+
+	EMIT(state, BPF_MOV64_REG(tmp_reg, BPF_REG_1));
+
+	EMIT(state, BPF_EMIT_CALL(BPF_FUNC_skb_load_bytes));
+
+	/* 0: ok, so move to next rule on error */
+	EMIT(state, BPF_JMP_IMM(BPF_JNE, BPF_REG_0, 0, 0));
+
+	EMIT(state, BPF_MOV64_REG(BPF_REG_1, tmp_reg));
+	imr_register_release(state, sizeof(uint64_t));
+	return 0;
+}
+
+static int imr_jit_obj_payload(struct imr_state *state,
+			       const struct imr_object *o)
+{
+	int base = o->payload.base;
+	int offset = o->payload.offset;
+	int bpf_width = bpf_reg_width(o->len);
+	int bpf_reg = imr_register_get(state, o->len);
+	int ret, bpf_reg_hdr_off;
+
+	switch (base) {
+	case IMR_PAYLOAD_BASE_LL: /* XXX: */
+		internal_error("can't handle ll yet");
+		return -ENOTSUP;
+	case IMR_PAYLOAD_BASE_NH:
+		if (state->base == base &&
+		    offset <= state->headlen) {
+			ret = imr_maybe_reload_skb_data(state);
+			if (ret < 0)
+				return ret;
+			EMIT(state, BPF_LDX_MEM(bpf_width, bpf_reg, BPF_REG_2, offset));
+			return 0;
+		}
+		/* XXX: use bpf_load_bytes helper if offset is too big */
+		internal_error("can't handle nonlinear yet");
+		return -ENOTSUP;
+	case IMR_PAYLOAD_BASE_TH:
+		if (o->len > sizeof(uint64_t))
+			internal_error("can't handle size exceeding 8 bytes");
+
+		bpf_reg_hdr_off = imr_register_alloc(state, sizeof(uint16_t));
+		if (bpf_reg_hdr_off < 0)
+			return -ENOSPC;
+
+		ret = imr_load_thoff(state, bpf_reg_hdr_off);
+		if (ret < 0) {
+			imr_register_release(state, sizeof(uint16_t));
+			return ret;
+		}
+
+		ret = bpf_skb_load_bytes(state, offset,
+						o->len, bpf_reg_hdr_off);
+		imr_register_release(state, sizeof(uint16_t));
+
+		if (ret)
+			return ret;
+
+		EMIT(state, BPF_LDX_MEM(bpf_width, bpf_reg, BPF_REG_10,
+		     - align_for_stack(o->len)));
+		return 0;
+	}
+
+	internal_error("invalid base");
+	return -ENOTSUP;
+}
+
+static void imr_fixup_jumps(struct imr_state *state, unsigned int poc_start)
+{
+	unsigned int pc, pc_end, i;
+
+	if (poc_start >= state->len_cur)
+		internal_error("old poc >= current one");
+
+	pc = 0;
+	pc_end = state->len_cur - poc_start;
+
+	for (i = poc_start; pc < pc_end; pc++, i++) {
+		if (BPF_CLASS(state->img[i].code) == BPF_JMP) {
+			if (state->img[i].code == (BPF_EXIT | BPF_JMP))
+				continue;
+			if (state->img[i].code == (BPF_CALL | BPF_JMP))
+				continue;
+
+			if (state->img[i].off)
+				continue;
+			state->img[i].off = pc_end - pc - 1;
+		}
+	}
+}
+
+
+#if 0
+static void nft_cmp_eval(const struct nft_expr *expr,
+                         struct nft_regs *regs,
+                         const struct nft_pktinfo *pkt)
+{
+        const struct nft_cmp_expr *priv = nft_expr_priv(expr);
+        int d;
+
+        d = memcmp(&regs->data[priv->sreg], &priv->data, priv->len);
+        switch (priv->op) {
+        case NFT_CMP_EQ:
+                if (d != 0)
+                        goto mismatch;
+                break;
+        case NFT_CMP_NEQ:
+                if (d == 0)
+                        goto mismatch;
+                break;
+        case NFT_CMP_LT:
+                if (d == 0)
+                        goto mismatch;
+                /* fall through */
+        case NFT_CMP_LTE:
+                if (d > 0)
+                        goto mismatch;
+                break;
+        case NFT_CMP_GT:
+                if (d == 0)
+                        goto mismatch;
+                /* fall through */
+        case NFT_CMP_GTE:
+                if (d < 0)
+                        goto mismatch;
+                break;
+        }
+        return;
+
+mismatch:
+        regs->verdict.code = NFT_BREAK;
+}
+#endif
+
+static int __imr_jit_memcmp_sub64(struct imr_state *state,
+				  struct imr_object *sub,
+				  int regl)
+{
+	int ret = imr_jit_object(state, sub->alu.left);
+	int regr = imr_register_alloc(state, sizeof(uint64_t));
+
+	if (ret < 0)
+		return ret;
+
+	ret = imr_jit_object(state, sub->alu.right);
+
+	EMIT(state, BPF_ALU64_REG(BPF_SUB, regl, regr));
+
+	imr_register_release(state, sizeof(uint64_t));
+	return 0;
+}
+
+static int __imr_jit_memcmp_sub32(struct imr_state *state,
+				  struct imr_object *sub,
+				  int regl)
+{
+	const struct imr_object *right = sub->alu.right;
+	int regr, ret = imr_jit_object(state, sub->alu.left);
+
+	if (right->type == IMR_OBJ_TYPE_IMMEDIATE && right->len) {
+		EMIT(state, BPF_ALU32_IMM(BPF_SUB, regl, right->imm.value32));
+		return 0;
+	}
+
+	regr = imr_register_alloc(state, sizeof(uint32_t));
+	if (ret < 0)
+		return ret;
+
+	ret = imr_jit_object(state, right);
+	if (ret < 0) {
+		imr_register_release(state, sizeof(uint32_t));
+		return ret;
+	}
+
+	EMIT(state, BPF_ALU32_REG(BPF_SUB, regl, regr));
+	return 0;
+}
+
+static int imr_jit_alu_bigcmp(struct imr_state *state, const struct imr_object *o)
+{
+	struct imr_object *copy = imr_object_copy(o);
+	unsigned int start_insn = state->len_cur;
+	int regl, ret;
+
+	if (!copy)
+		return -ENOMEM;
+
+	regl = imr_register_alloc(state, sizeof(uint64_t));
+	do {
+		struct imr_object *tmp;
+
+		tmp = imr_object_split64(copy);
+		if (!tmp) {
+			imr_register_release(state, sizeof(uint64_t));
+			imr_object_free(copy);
+			return -ENOMEM;
+		}
+
+		ret = __imr_jit_memcmp_sub64(state, tmp, regl);
+		imr_object_free(tmp);
+		if (ret < 0) {
+			imr_register_release(state, sizeof(uint64_t));
+			imr_object_free(copy);
+			return ret;
+		}
+		/* XXX: 64bit */
+		EMIT(state, BPF_JMP_IMM(BPF_JNE, regl, 0, 0));
+	} while (copy->len >= sizeof(uint64_t));
+
+	if (copy->len && copy->len != sizeof(uint64_t)) {
+		ret = __imr_jit_memcmp_sub32(state, copy, regl);
+
+		if (ret < 0) {
+			imr_object_free(copy);
+			imr_register_release(state, sizeof(uint64_t));
+			return ret;
+		}
+	}
+
+	imr_object_free(copy);
+	imr_fixup_jumps(state, start_insn);
+
+	switch (o->alu.op) {
+	case IMR_ALU_OP_AND:
+	case IMR_ALU_OP_LSHIFT:
+		internal_error("not a jump");
+	case IMR_ALU_OP_EQ:
+	case IMR_ALU_OP_NE:
+	case IMR_ALU_OP_LT:
+	case IMR_ALU_OP_LTE:
+	case IMR_ALU_OP_GT:
+	case IMR_ALU_OP_GTE:
+		EMIT(state, BPF_JMP_IMM(alu_jmp_get_negated_bpf_opcode(o->alu.op), regl, 0, 0));
+		break;
+        }
+
+	imr_register_release(state, sizeof(uint64_t));
+	return 0;
+}
+
+static int __imr_jit_obj_alu_jmp(struct imr_state *state,
+			         const struct imr_object *o,
+				 int regl)
+{
+	const struct imr_object *right;
+	enum imr_reg_num regr;
+	int op, ret;
+
+	right = o->alu.right;
+
+	op = alu_jmp_get_negated_bpf_opcode(o->alu.op);
+
+	/* avoid 2nd register if possible */
+	if (right->type == IMR_OBJ_TYPE_IMMEDIATE) {
+		switch (right->len) {
+		case sizeof(uint32_t):
+			EMIT(state, BPF_JMP_IMM(op, regl, right->imm.value32, 0));
+			return 0;
+		}
+	}
+
+	regr = imr_register_alloc(state, right->len);
+	if (regr < 0)
+		return -ENOSPC;
+
+	ret = imr_jit_object(state, right);
+	if (ret == 0) {
+		EMIT(state, BPF_MOV32_IMM(BPF_REG_0, -2)); /* NFT_BREAK */
+		EMIT(state, BPF_JMP_REG(op, regl, regr, 0));
+	}
+
+	imr_register_release(state, right->len);
+	return ret;
+}
+
+static int imr_jit_obj_alu_jmp(struct imr_state *state,
+			       const struct imr_object *o,
+			       int regl)
+
+{
+	int ret;
+
+	/* multiple tests on same source? */
+	if (o->alu.left->type == IMR_OBJ_TYPE_ALU) {
+		ret = imr_jit_obj_alu_jmp(state, o->alu.left, regl);
+		if (ret < 0)
+			return ret;
+	} else {
+		ret = imr_jit_object(state, o->alu.left);
+		if (ret < 0)
+			return ret;
+	}
+
+	ret = __imr_jit_obj_alu_jmp(state, o, regl);
+
+	return ret;
+}
+
+static int imr_jit_obj_alu(struct imr_state *state, const struct imr_object *o)
+{
+	const struct imr_object *right;
+	enum imr_reg_num regl;
+	int ret, op;
+
+
+	switch (o->alu.op) {
+	case IMR_ALU_OP_AND:
+		op = BPF_AND;
+		break;
+	case IMR_ALU_OP_LSHIFT:
+		op = BPF_LSH;
+		break;
+	case IMR_ALU_OP_EQ:
+	case IMR_ALU_OP_NE:
+	case IMR_ALU_OP_LT:
+	case IMR_ALU_OP_LTE:
+	case IMR_ALU_OP_GT:
+	case IMR_ALU_OP_GTE:
+		if (o->len > sizeof(uint64_t))
+			return imr_jit_alu_bigcmp(state, o);
+
+		regl = imr_register_alloc(state, o->len);
+		if (regl < 0)
+			return -ENOSPC;
+
+		ret = imr_jit_obj_alu_jmp(state, o, regl);
+		imr_register_release(state, o->len);
+		return ret;
+	}
+
+	ret = imr_jit_object(state, o->alu.left);
+	if (ret)
+		return ret;
+
+	regl = imr_register_get(state, o->len);
+	if (regl < 0)
+		return -EINVAL;
+
+	right = o->alu.right;
+
+	/* avoid 2nd register if possible */
+	if (right->type == IMR_OBJ_TYPE_IMMEDIATE) {
+		switch (right->len) {
+		case sizeof(uint32_t):
+			EMIT(state, BPF_ALU32_IMM(op, regl, right->imm.value32));
+			return 0;
+		}
+	}
+
+	internal_error("alu bitops only handle 32bit immediate RHS");
+	return -EINVAL;
+}
+
+static int imr_jit_obj_meta(struct imr_state *state, const struct imr_object *o)
+{
+	int bpf_reg = imr_register_get(state, o->len);
+	int bpf_width = bpf_reg_width(o->len);
+	int ret;
+
+	switch (o->meta.key) {
+	case IMR_META_NFMARK:
+		EMIT(state, BPF_LDX_MEM(bpf_width, bpf_reg, BPF_REG_1,
+					 offsetof(struct __sk_buff, mark)));
+		break;
+	case IMR_META_L4PROTO:
+		ret = imr_load_thoff(state, bpf_reg);
+		if (ret < 0)
+			return ret;
+
+		EMIT(state, BPF_JMP_IMM(BPF_JEQ, bpf_reg, 0, 0)); /* th == 0? L4PROTO undefined. */
+		EMIT(state, BPF_LDX_MEM(bpf_width, bpf_reg, BPF_REG_1,
+					 offsetof(struct __sk_buff, cb[1])));
+		break;
+	case IMR_META_NFPROTO:
+		switch (state->nfproto) {
+		case NFPROTO_IPV4:
+		case NFPROTO_IPV6:
+			EMIT(state, BPF_MOV32_IMM(bpf_reg, state->nfproto));
+			break;
+		case NFPROTO_INET:	/* first need to check ihl->version */
+			ret = imr_maybe_reload_skb_data(state);
+			if (ret < 0)
+				return ret;
+
+			/* bpf_reg = iph->version & 0xf0 */
+			EMIT(state, BPF_LDX_MEM(BPF_B, bpf_reg, BPF_REG_2, 0));		/* ihl->version/hdrlen */
+			EMIT(state, BPF_ALU32_IMM(BPF_AND, bpf_reg, 0xf0));		/* retain version */
+
+			EMIT(state, BPF_JMP_IMM(BPF_JNE, bpf_reg, 4 << 4, 2));		/* ipv4? */
+			EMIT(state, BPF_MOV32_IMM(bpf_reg, NFPROTO_IPV4));
+			EMIT(state, BPF_JMP_IMM(BPF_JA, 0, 0, 5));			/* skip NF_DROP */
+
+			EMIT(state, BPF_JMP_IMM(BPF_JNE, bpf_reg, 6 << 4, 4));		/* ipv6? */
+			EMIT(state, BPF_MOV32_IMM(bpf_reg, NFPROTO_IPV6));
+			EMIT(state, BPF_JMP_IMM(BPF_JA, 0, 0, 2));			/* skip NF_DROP */
+
+			EMIT(state, BPF_MOV32_IMM(BPF_REG_0, NF_DROP));
+			EMIT(state, BPF_EXIT_INSN());
+			/* Not ipv4, not ipv6? Should not happen: INET hooks from ipv4/ipv6 stack */
+			break;
+		default:
+			internal_error("unsupported family");
+		}
+	default:
+		return -EOPNOTSUPP;
+	}
+
+	return 0;
+}
+
+static int imr_jit_object(struct imr_state *s, const struct imr_object *o)
+{
+	switch (o->type) {
+	case IMR_OBJ_TYPE_VERDICT:
+		return imr_jit_obj_verdict(s, o);
+	case IMR_OBJ_TYPE_PAYLOAD:
+		return imr_jit_obj_payload(s, o);
+	case IMR_OBJ_TYPE_IMMEDIATE:
+		return imr_jit_obj_immediate(s, o);
+	case IMR_OBJ_TYPE_ALU:
+		return imr_jit_obj_alu(s, o);
+	case IMR_OBJ_TYPE_META:
+		return imr_jit_obj_meta(s, o);
+	}
+
+	return -EINVAL;
+}
+
+static int imr_jit_rule(struct imr_state *state, int i)
+{
+	unsigned int start, end, count, len_cur;
+
+	end = state->num_objects;
+	if (i >= end)
+		return -EINVAL;
+
+	len_cur = state->len_cur;
+
+	start = i;
+	count = 0;
+
+	for (i = start; start < end; i++) {
+		int ret = imr_jit_object(state, state->objects[i]);
+
+		if (ret < 0) {
+			fprintf(stderr, "failed to JIT object type %d\n",  state->objects[i]->type);
+			return ret;
+		}
+
+		count++;
+
+		if (state->objects[i]->type == IMR_OBJ_TYPE_VERDICT)
+			break;
+	}
+
+	if (i == end)
+		internal_error("no verdict found in rule");
+
+	imr_fixup_jumps(state, len_cur);
+
+	return count;
+}
+
+/* R0: return value.
+ * R1: __sk_buff (BPF_RUN_PROG() argument).
+ * R2-R5 are unused, (caller saved registers).
+ *   imr_state_init sets R2 to be start of skb->data.
+ * R2-R5 are invalidated after BPF function calls.
+ *
+ * R6-R9 are callee saved registers.
+ */
+int imr_state_init(struct imr_state *state, int family)
+{
+	if (!state->img) {
+		state->img = calloc(BPF_MAXINSNS, sizeof(struct bpf_insn));
+		if (!state->img)
+			return -ENOMEM;
+	}
+
+	state->len_cur = 0;
+	state->nfproto = family;
+
+	switch (family) {
+	case NFPROTO_INET:
+	case NFPROTO_IPV4:
+		state->headlen = sizeof(struct iphdr);
+		state->base = IMR_PAYLOAD_BASE_NH;
+		break;
+	case NFPROTO_IPV6:
+		state->headlen = sizeof(struct ip6_hdr);
+		state->base = IMR_PAYLOAD_BASE_NH;
+		break;
+	default:
+		state->base = IMR_PAYLOAD_BASE_NH;
+		break;
+	}
+
+	if (state->headlen) {
+		int ret = imr_reload_skb_data(state);
+		if (ret < 0)
+			return ret;
+	}
+
+	return 0;
+}
+
+struct bpf_insn	*imr_translate(struct imr_state *s, unsigned int *insn_count)
+{
+	struct bpf_insn *img;
+	int ret = 0, i = 0;
+
+	if (!s->img) {
+		ret = imr_state_init(s, s->nfproto);
+		if (ret < 0)
+			return NULL;
+	}
+
+	/* Only use R6..R9 for now to simplify helper calls (R1..R5 will be clobbered) */
+	s->regcount = 6;
+
+	do {
+		int insns = imr_jit_rule(s, i);
+		if (insns < 0) {
+			ret = insns;
+			break;
+		}
+		if (insns == 0)
+			internal_error("rule jit yields 0 insns");
+
+		i += insns;
+	} while (i < s->num_objects);
+
+	if (ret != 0)
+		return NULL;
+
+	ret = imr_jit_verdict(s, -2); /* XXX: policy support. -2: NFT_BREAK */
+	if (ret < 0)
+		return NULL;
+
+	*insn_count = s->len_cur;
+	img = s->img;
+
+	s->img = NULL;
+	s->len_cur = 0;
+
+	return img;
+}
diff --git a/net/netfilter/nf_tables_jit/imr.h b/net/netfilter/nf_tables_jit/imr.h
new file mode 100644
index 000000000000..7ebbf78526f9
--- /dev/null
+++ b/net/netfilter/nf_tables_jit/imr.h
@@ -0,0 +1,96 @@
+#ifndef IMR_HDR
+#define IMR_HDR
+#include <stdint.h>
+#include <stdio.h>
+
+/* map 1:1 to BPF regs. */
+enum imr_reg_num {
+	IMR_REG_0,
+	IMR_REG_1,
+	IMR_REG_2,
+	IMR_REG_3,
+	IMR_REG_4,
+	IMR_REG_5,
+	IMR_REG_6,
+	IMR_REG_7,
+	IMR_REG_8,
+	IMR_REG_9,
+	/* R10 is frame pointer */
+	IMR_REG_COUNT,
+};
+
+struct imr_state;
+struct imr_object;
+
+enum imr_obj_type {
+	IMR_OBJ_TYPE_VERDICT,
+	IMR_OBJ_TYPE_IMMEDIATE,
+	IMR_OBJ_TYPE_PAYLOAD,
+	IMR_OBJ_TYPE_ALU,
+	IMR_OBJ_TYPE_META,
+};
+
+enum imr_alu_op {
+	IMR_ALU_OP_EQ,
+	IMR_ALU_OP_NE,
+	IMR_ALU_OP_LT,
+	IMR_ALU_OP_LTE,
+	IMR_ALU_OP_GT,
+	IMR_ALU_OP_GTE,
+	IMR_ALU_OP_AND,
+	IMR_ALU_OP_LSHIFT,
+};
+
+enum imr_verdict {
+	IMR_VERDICT_NONE,	/* partially translated rule, no verdict */
+	IMR_VERDICT_NEXT,	/* move to next rule */
+	IMR_VERDICT_PASS,	/* end processing, accept packet */
+	IMR_VERDICT_DROP,	/* end processing, drop packet */
+};
+
+enum imr_payload_base {
+	IMR_PAYLOAD_BASE_INVALID,
+	IMR_PAYLOAD_BASE_LL,
+	IMR_PAYLOAD_BASE_NH,
+	IMR_PAYLOAD_BASE_TH,
+};
+
+enum imr_meta_key {
+	IMR_META_L4PROTO,
+	IMR_META_NFPROTO,
+	IMR_META_NFMARK,
+};
+
+struct imr_state *imr_state_alloc(void);
+void imr_state_free(struct imr_state *s);
+void imr_state_print(FILE *fp, struct imr_state *s);
+
+static inline int imr_state_rule_begin(struct imr_state *s)
+{
+	/* nothing for now */
+	return 0;
+}
+
+int imr_state_rule_end(struct imr_state *s);
+
+void imr_register_store(struct imr_state *s, enum imr_reg_num r, struct imr_object *o);
+struct imr_object *imr_register_load(const struct imr_state *s, enum imr_reg_num r);
+
+struct imr_object *imr_object_alloc(enum imr_obj_type t);
+void imr_object_free(struct imr_object *o);
+
+struct imr_object *imr_object_alloc_imm32(uint32_t value);
+struct imr_object *imr_object_alloc_imm64(uint64_t value);
+struct imr_object *imr_object_alloc_imm(const uint32_t *data, unsigned int len);
+struct imr_object *imr_object_alloc_verdict(enum imr_verdict v);
+
+struct imr_object *imr_object_alloc_payload(enum imr_payload_base b, uint16_t off, uint16_t len);
+struct imr_object *imr_object_alloc_alu(enum imr_alu_op op, struct imr_object *l, struct imr_object *r);
+struct imr_object *imr_object_alloc_meta(enum imr_meta_key k);
+
+int imr_state_add_obj(struct imr_state *s, struct imr_object *o);
+
+int imr_state_init(struct imr_state *state, int family);
+struct bpf_insn	*imr_translate(struct imr_state *s, unsigned int *insn_count);
+
+#endif /* IMR_HDR */
diff --git a/net/netfilter/nf_tables_jit/main.c b/net/netfilter/nf_tables_jit/main.c
index 6f6a4423c2e4..42b9d6d5d1fb 100644
--- a/net/netfilter/nf_tables_jit/main.c
+++ b/net/netfilter/nf_tables_jit/main.c
@@ -1,20 +1,578 @@
 // SPDX-License-Identifier: GPL-2.0
-#include <unistd.h>
+#include <stdlib.h>
+#include <stdint.h>
+#include <stddef.h>
+#include <time.h>
+#include <string.h>
+#include <netinet/in.h>
+#include <errno.h>
 
-int main(void)
+#include <netinet/ip.h>
+#include <netinet/ip6.h>
+
+#include <linux/netfilter.h>
+#include <linux/netfilter/nf_tables.h>
+#include <linux/netfilter/nfnetlink.h>
+
+#include <libmnl/libmnl.h>
+#include <libnftnl/common.h>
+#include <libnftnl/ruleset.h>
+#include <libnftnl/table.h>
+#include <libnftnl/chain.h>
+#include <libnftnl/set.h>
+#include <libnftnl/expr.h>
+#include <libnftnl/rule.h>
+
+#include <linux/if_ether.h>
+#include <linux/bpf.h>
+#include <linux/netlink.h>
+
+#include "imr.h"
+
+struct nft_jit_data_from_user {
+ int ebpf_fd;            /* fd to get program from, or < 0 if jitter error */
+ uint32_t expr_count;    /* number of translated expressions */
+};
+
+static FILE *log_file;
+#define NFTNL_EXPR_EBPF_FD      NFTNL_EXPR_BASE
+
+static int bpf(int cmd, union bpf_attr *attr, unsigned int size)
 {
-	static struct {
-		int fd, count;
-	} response;
+#ifndef __NR_bpf
+#define __NR_bpf 321 /* x86_64 */
+#endif
+        return syscall(__NR_bpf, cmd, attr, size);
+}
 
-	response.fd = -1;
-	for (;;) {
-		char buf[8192];
+struct nft_ebpf_prog {
+	enum bpf_prog_type type;
+	const struct bpf_insn *insn;
+	unsigned int len;
+};
+
+struct cb_args {
+	unsigned int buflen;
+	uint32_t exprs_seen;
+	uint32_t stmt_exprs;
+	struct imr_state *s;
+	int fd;
+};
+
+static void memory_allocation_error(void) { perror("allocation failed"); exit(1); }
+
+static int bpf_prog_load(const struct nft_ebpf_prog *prog)
+{
+	union bpf_attr attr = {};
+	char *log;
+	int ret;
+
+	attr.prog_type  = prog->type;
+	attr.insns      = (uint64_t)prog->insn;
+	attr.insn_cnt   = prog->len;
+	attr.license    = (uint64_t)("GPL");
+
+	log = malloc(8192);
+	attr.log_buf = (uint64_t)log;
+	attr.log_size = 8192;
+	attr.log_level = 1;
+
+	ret = bpf(BPF_PROG_LOAD, &attr, sizeof(attr));
+	if (ret < 0)
+		fprintf(log_file, "bpf errlog: %s\n", log);
+	free(log);
+	return ret;
+}
+
+
+static int nft_reg_to_imr_reg(int nfreg)
+{
+	switch (nfreg) {
+	case NFT_REG_VERDICT:
+		return IMR_REG_0;
+	/* old register numbers, 4 128 bit registers. */
+	case NFT_REG_1:
+		return IMR_REG_4;
+	case NFT_REG_2:
+		return IMR_REG_6;
+	case NFT_REG_3:
+		return IMR_REG_8;
+	case NFT_REG_4:
+		break;
+#ifdef NFT_REG32_SIZE
+	/* new register numbers, 16 32 bit registers, map to old ones */
+	case NFT_REG32_00:
+		return IMR_REG_4;
+	case NFT_REG32_01:
+		return IMR_REG_5;
+	case NFT_REG32_02:
+		return IMR_REG_6;
+#endif
+	default:
+		return -1;
+	}
+	return -1;
+}
+
+static int netlink_parse_immediate(const struct nftnl_expr *nle, void *out)
+{
+	struct imr_state *state = out;
+	struct imr_object *o = NULL;
+
+	if (nftnl_expr_is_set(nle, NFTNL_EXPR_IMM_DATA)) {
+		uint32_t len;
+		int reg;
+
+		nftnl_expr_get(nle, NFTNL_EXPR_IMM_DATA, &len);
+
+		switch (len) {
+		case sizeof(uint32_t):
+			o = imr_object_alloc_imm32(nftnl_expr_get_u32(nle, NFTNL_EXPR_IMM_DATA));
+			break;
+		case sizeof(uint64_t):
+			o = imr_object_alloc_imm64(nftnl_expr_get_u64(nle, NFTNL_EXPR_IMM_DATA));
+			break;
+		default:
+			return -ENOTSUP;
+		}
+		reg = nft_reg_to_imr_reg(nftnl_expr_get_u32(nle,
+					     NFTNL_EXPR_IMM_DREG));
+		if (reg < 0) {
+			imr_object_free(o);
+			return reg;
+		}
+
+		imr_register_store(state, reg, o);
+		return 0;
+	} else if (nftnl_expr_is_set(nle, NFTNL_EXPR_IMM_VERDICT)) {
+		uint32_t verdict;
+		int ret;
+
+		if (nftnl_expr_is_set(nle, NFTNL_EXPR_IMM_CHAIN))
+			return -ENOTSUP;
+
+                verdict = nftnl_expr_get_u32(nle, NFTNL_EXPR_IMM_VERDICT);
+
+		switch (verdict) {
+		case NF_ACCEPT:
+			o = imr_object_alloc_verdict(IMR_VERDICT_PASS);
+			break;
+		case NF_DROP:
+			o = imr_object_alloc_verdict(IMR_VERDICT_DROP);
+			break;
+		default:
+			fprintf(log_file, "Unhandled verdict %d\n", verdict);
+			o = imr_object_alloc_verdict(IMR_VERDICT_DROP);
+			break;
+		}
+
+		ret = imr_state_add_obj(state, o);
+		if (ret < 0)
+			imr_object_free(o);
+
+		return ret;
+	}
+
+	return -ENOTSUP;
+}
+
+static int netlink_parse_cmp(const struct nftnl_expr *nle, void *out)
+{
+	struct imr_object *o, *imm, *left;
+	const uint32_t *raw;
+	uint32_t tmp, len;
+	struct imr_state *state = out;
+	enum imr_alu_op op;
+	int ret;
+	op = nftnl_expr_get_u32(nle, NFTNL_EXPR_CMP_OP);
+
+	switch (op) {
+        case NFT_CMP_EQ:
+		op = IMR_ALU_OP_EQ;
+		break;
+        case NFT_CMP_NEQ:
+		op = IMR_ALU_OP_NE;
+		break;
+	case NFT_CMP_LT:
+		op = IMR_ALU_OP_LT;
+		break;
+	case NFT_CMP_LTE:
+		op = IMR_ALU_OP_LTE;
+		break;
+	case NFT_CMP_GT:
+		op = IMR_ALU_OP_GT;
+		break;
+	case NFT_CMP_GTE:
+		op = IMR_ALU_OP_GTE;
+		break;
+	default:
+		return -ENOTSUP;
+	}
+
+	raw = nftnl_expr_get(nle, NFTNL_EXPR_CMP_DATA, &len);
+	switch (len) {
+	case sizeof(uint64_t):
+		imm = imr_object_alloc_imm64(nftnl_expr_get_u64(nle, NFTNL_EXPR_CMP_DATA));
+		break;
+	case sizeof(uint32_t):
+		imm = imr_object_alloc_imm32(nftnl_expr_get_u32(nle, NFTNL_EXPR_CMP_DATA));
+		break;
+	case sizeof(uint16_t):
+		tmp = nftnl_expr_get_u16(nle, NFTNL_EXPR_CMP_DATA);
+
+		imm = imr_object_alloc_imm32(tmp);
+		break;
+	case sizeof(uint8_t):
+		tmp = nftnl_expr_get_u8(nle, NFTNL_EXPR_CMP_DATA);
+
+		imm = imr_object_alloc_imm32(tmp);
+		break;
+	default:
+		imm = imr_object_alloc_imm(raw, len);
+		break;
+	}
+
+	if (!imm)
+		return -ENOMEM;
+
+	ret = nft_reg_to_imr_reg(nftnl_expr_get_u32(nle, NFTNL_EXPR_CMP_SREG));
+	if (ret < 0) {
+		imr_object_free(imm);
+		return ret;
+	}
+
+	left = imr_register_load(state, ret);
+	if (!left) {
+		fprintf(log_file, "%s:%d\n", __FILE__, __LINE__);
+		return -EINVAL;
+	}
+
+	o = imr_object_alloc_alu(op, left, imm);
+
+	return imr_state_add_obj(state, o);
+}
+
+static int netlink_parse_meta(const struct nftnl_expr *nle, void *out)
+{
+	struct imr_state *state = out;
+	struct imr_object *meta;
+	enum imr_meta_key key;
+	int ret;
+
+	if (nftnl_expr_is_set(nle, NFTNL_EXPR_META_SREG))
+		return -EOPNOTSUPP;
+
+	ret = nft_reg_to_imr_reg(nftnl_expr_get_u32(nle, NFTNL_EXPR_META_DREG));
+	if (ret < 0)
+		return ret;
+
+	switch (nftnl_expr_get_u32(nle, NFTNL_EXPR_META_KEY)) {
+	case NFT_META_NFPROTO:
+		key = IMR_META_NFPROTO;
+		break;
+	case NFT_META_L4PROTO:
+		key = IMR_META_L4PROTO;
+		break;
+	case NFT_META_MARK:
+		key = IMR_META_NFMARK;
+		break;
+	default:
+		return -EOPNOTSUPP;
+	}
+
+	meta = imr_object_alloc_meta(key);
+	if (!meta)
+		return -ENOMEM;
+
+	imr_register_store(state, ret, meta);
+	return 0;
+}
+
+static int netlink_parse_payload(const struct nftnl_expr *nle, void *out)
+{
+	struct imr_state *state = out;
+	enum imr_payload_base imr_base;
+	uint32_t base, offset, len;
+	struct imr_object *payload;
+	int ret;
+
+	if (nftnl_expr_is_set(nle, NFTNL_EXPR_PAYLOAD_SREG) ||
+	    nftnl_expr_is_set(nle, NFTNL_EXPR_PAYLOAD_FLAGS))
+		return -EOPNOTSUPP;
+
+	base = nftnl_expr_get_u32(nle, NFTNL_EXPR_PAYLOAD_BASE);
+	offset = nftnl_expr_get_u32(nle, NFTNL_EXPR_PAYLOAD_OFFSET);
+	len = nftnl_expr_get_u32(nle, NFTNL_EXPR_PAYLOAD_LEN);
+
+	ret = nft_reg_to_imr_reg(nftnl_expr_get_u32(nle, NFTNL_EXPR_PAYLOAD_DREG));
+	if (ret < 0)
+		return ret;
+
+	switch (base) {
+	case NFT_PAYLOAD_LL_HEADER:
+		imr_base = IMR_PAYLOAD_BASE_LL;
+		break;
+	case NFT_PAYLOAD_NETWORK_HEADER:
+		imr_base = IMR_PAYLOAD_BASE_NH;
+		break;
+	case NFT_PAYLOAD_TRANSPORT_HEADER:
+		imr_base = IMR_PAYLOAD_BASE_TH;
+		break;
+	default:
+		fprintf(log_file, "%s:%d\n", __FILE__, __LINE__);
+		return -EINVAL;
+	}
+
+	payload = imr_object_alloc_payload(imr_base, offset, len);
+	if (!payload)
+		return -ENOMEM;
+
+	imr_register_store(state, ret, payload);
+	return 0;
+}
+
+static int netlink_parse_bitwise(const struct nftnl_expr *nle, void *out)
+{
+	struct imr_object *imm, *alu, *left;
+	struct imr_state *state = out;
+	uint32_t len_mask, len_xor;
+	int reg;
+
+	reg = nft_reg_to_imr_reg(nftnl_expr_get_u32(nle, NFTNL_EXPR_BITWISE_SREG));
+	if (reg < 0)
+		return reg;
+
+	left = imr_register_load(state, reg);
+	if (!left) {
+		fprintf(log_file, "%s:%d\n", __FILE__, __LINE__);
+		return -EINVAL;
+	}
+
+        nftnl_expr_get(nle, NFTNL_EXPR_BITWISE_XOR, &len_xor);
+	switch (len_xor) {
+	case sizeof(uint32_t):
+		if (nftnl_expr_get_u32(nle, NFTNL_EXPR_BITWISE_XOR) != 0)
+			return -EOPNOTSUPP;
+		break;
+	default:
+		return -EOPNOTSUPP;
+	}
+
+	nftnl_expr_get(nle, NFTNL_EXPR_BITWISE_MASK, &len_mask);
+	switch (len_mask) {
+	case sizeof(uint32_t):
+		imm = imr_object_alloc_imm32(nftnl_expr_get_u32(nle, NFTNL_EXPR_BITWISE_MASK));
+		if (!imm)
+			return -ENOMEM;
+		break;
+	default:
+		return -EOPNOTSUPP;
+	}
+
+	alu = imr_object_alloc_alu(IMR_ALU_OP_AND, left, imm);
+	if (!alu) {
+		imr_object_free(imm);
+		return -ENOMEM;
+	}
+
+	imr_register_store(state, reg, alu);
+	return 0;
+}
+
+static const struct {
+	const char *name;
+	int (*parse)(const struct nftnl_expr *nle,
+				 void *);
+} netlink_parsers[] = {
+	{ .name = "immediate",	.parse = netlink_parse_immediate },
+	{ .name = "cmp",	.parse = netlink_parse_cmp },
+	{ .name = "payload",	.parse = netlink_parse_payload },
+	{ .name = "bitwise",	.parse = netlink_parse_bitwise },
+	{ .name = "meta",	.parse = netlink_parse_meta },
+};
+
+static int expr_parse_cb(struct nftnl_expr *expr, void *data)
+{
+	const char *name = nftnl_expr_get_str(expr, NFTNL_EXPR_NAME);
+	struct cb_args *args = data;
+	struct imr_state *state = args->s;
+	unsigned int i;
+
+	if (!name)
+		return -1;
+
+	for (i = 0; i < MNL_ARRAY_SIZE(netlink_parsers); i++) {
+		int ret;
+
+		if (strcmp(netlink_parsers[i].name, name))
+			continue;
+
+		ret = netlink_parsers[i].parse(expr, state);
+		if (ret == 0) {
+			args->exprs_seen++;
+
+			if (strcmp(netlink_parsers[i].name, "cmp") == 0 ||
+			    strcmp(netlink_parsers[i].name, "immediate") == 0) {
+
+				args->stmt_exprs += args->exprs_seen;
+				args->exprs_seen = 0;
+			}
+		}
+
+		fprintf(log_file, "parse: %s got %d\n", name, ret);
+		return ret;
+	}
+
+	fprintf(log_file, "cannot handle expression %s\n", name);
+	return -EOPNOTSUPP;
+}
+
+static int nlmsg_parse_newrule(const struct nlmsghdr *nlh, struct cb_args *args)
+{
+	struct nft_ebpf_prog prog;
+	struct imr_state *state;
+	struct nftnl_rule *rule;
+	int ret = -ENOMEM;
+
+	rule = nftnl_rule_alloc();
+	if (!rule)
+		memory_allocation_error();
+
+	if (nftnl_rule_nlmsg_parse(nlh, rule) < 0)
+		goto err_free;
+
+	state = imr_state_alloc();
+	if (!state)
+		goto err_free;
+
+	ret = imr_state_init(state,
+			     nftnl_rule_get_u32(rule, NFTNL_RULE_FAMILY));
+	if (ret < 0) {
+		imr_state_free(state);
+		goto err_free;
+	}
 
-		if (read(0, buf, sizeof(buf)) < 0)
-			return 1;
-		if (write(1, &response, sizeof(response)) != sizeof(response))
-			return 2;
+	ret = imr_state_rule_begin(state);
+	if (ret < 0) {
+		imr_state_free(state);
+		goto err_free;
+	}
+
+	args->s = state;
+	ret = nftnl_expr_foreach(rule, expr_parse_cb, args);
+	if (ret == 0) {
+		fprintf(log_file, "completed tranlsation, %d stmt_exprs and %d partial\n",
+				  args->stmt_exprs, args->exprs_seen);
+	} else {
+		fprintf(log_file, "failed translation, %d stmt_exprs and %d partial\n",
+				  args->stmt_exprs, args->exprs_seen);
+		if (args->stmt_exprs) {
+			ret = imr_state_add_obj(state, imr_object_alloc_verdict(IMR_VERDICT_NONE));
+			if (ret < 0) {
+				imr_state_free(state);
+				goto err_free;
+			}
+		}
+	}
+
+	ret = imr_state_rule_end(state);
+	if (ret < 0) {
+		imr_state_free(state);
+		goto err_free;
+	}
+
+	imr_state_print(log_file, state);
+
+	if (args->stmt_exprs) {
+		prog.type = BPF_PROG_TYPE_SCHED_CLS;
+		prog.insn = imr_translate(state, &prog.len);
+
+		imr_state_free(state);
+		if (!prog.insn)
+			goto err_free;
+
+		args->fd = bpf_prog_load(&prog);
+		free((void*)prog.insn);
+		if (args->fd < 0)
+			goto err_free;
+		ret = 0;
+	} else {
+		imr_state_free(state);
+	}
+
+err_free:
+	nftnl_rule_free(rule);
+	return ret;
+}
+
+static int nlmsg_parse(const struct nlmsghdr *nlh, void *data)
+{
+	struct cb_args *args = data;
+
+	fprintf(log_file, "%s:%d, buflen %d, nlh %d, nl len %d\n", __FILE__, __LINE__,
+		(int)args->buflen, (int)sizeof(*nlh) , (int) nlh->nlmsg_len);
+	if (args->buflen < sizeof(*nlh) || args->buflen < nlh->nlmsg_len) {
+		// nftjit.c:517, buflen 428, nlh 16, nl len 20
+		fprintf(log_file, "%s:%d- ERROR: buflen %d, nlh %d, nl len %d\n", __FILE__, __LINE__,
+		(int)args->buflen, (int)sizeof(*nlh) , (int) nlh->nlmsg_len);
+		return -EINVAL;
+	}
+
+	switch (NFNL_MSG_TYPE(nlh->nlmsg_type)) {
+	case NFT_MSG_NEWRULE:
+		return nlmsg_parse_newrule(nlh, args);
+	default:
+		return -EOPNOTSUPP;
+	}
+
+	return 0;
+}
+
+static int doit(void)
+{
+	struct cb_args args;
+	struct nft_jit_data_from_user to_kernel = { .ebpf_fd = -1 };
+	char buf[MNL_SOCKET_BUFFER_SIZE];
+	ssize_t len;
+	int ret;
+
+	fprintf(log_file, "block in read, pid %d\n", (int) getpid());
+	len = read(0, buf, sizeof(buf));
+	if (len <= 0)
+		return 1;
+
+	memset(&args, 0, sizeof(args));
+	args.buflen = len;
+	args.fd = -1;
+
+	ret = len;
+	ret = mnl_cb_run(buf, ret, 0, 0, nlmsg_parse, &args);
+	to_kernel.ebpf_fd = args.fd;
+	to_kernel.expr_count = args.stmt_exprs;
+	if (ret < 0)
+		fprintf(log_file, "%s: mnl_cb_run: %d\n", __func__, ret);
+
+	if (write(1, &to_kernel, sizeof(to_kernel)) != (int)sizeof(to_kernel))
+		return 2;
+
+	return 0;
+}
+
+int main(int argc, char *argv[])
+{
+	int fd;
+
+	log_file = fopen("/tmp/debug.log", "a");
+	if (!log_file)
+		return 1;
+
+	fd = -1;
+	for (;;) {
+		int ret = doit();
+		if (ret != 0)
+			return ret;
+		close(fd);
+		fd = ret;
 	}
 
 	return 0;
diff --git a/net/netfilter/nf_tables_jit/nf_tables_jit_kern.c b/net/netfilter/nf_tables_jit/nf_tables_jit_kern.c
index 4778f53b2683..bd319d41e2d1 100644
--- a/net/netfilter/nf_tables_jit/nf_tables_jit_kern.c
+++ b/net/netfilter/nf_tables_jit/nf_tables_jit_kern.c
@@ -1,6 +1,14 @@
 // SPDX-License-Identifier: GPL-2.0
 #define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
 #include <linux/umh.h>
+#include <linux/sched.h>
+#include <linux/sched/signal.h>
+#include <linux/fs.h>
+#include <linux/fdtable.h>
+#include <linux/file.h>
+#include <linux/skbuff.h>
+#include <linux/bpf.h>
+
 #include <linux/netfilter/nfnetlink.h>
 #include <linux/netfilter/nf_tables.h>
 #include <net/netfilter/nf_tables_core.h>
@@ -18,8 +26,54 @@ static int nft_jit_load_umh(void)
 	return fork_usermode_blob(&UMH_start, &UMH_end - &UMH_start, &info);
 }
 
-int nf_tables_jit_work(const struct sk_buff *nlskb, struct nft_ebpf *e)
+static void nft_jit_fd_to_prog(struct nft_ebpf *e, int fd, u32 expr_count)
+{
+	struct task_struct *task = pid_task(find_vpid(info.pid), PIDTYPE_PID);
+	struct files_struct *files;
+	struct bpf_prog *p;
+	struct file *file;
+
+	if (WARN_ON_ONCE(!task) || expr_count > 128) {
+		nft_jit_stop_umh();
+		return;
+	}
+
+	if (expr_count == 0) /* could not translate */
+		return;
+
+	task_lock(task);
+	files = task->files;
+	if (!files)
+		goto out_unlock;
+
+	file = fcheck_files(files, fd);
+	if (file && !get_file_rcu(file))
+		file = NULL;
+
+	if (!file)
+		goto out_unlock;
+
+	p = bpf_prog_get_type_dev_file(file, BPF_PROG_TYPE_SCHED_CLS, false);
+
+	task_unlock(task);
+
+	if (!IS_ERR(p)) {
+		e->prog = p;
+		e->expressions = expr_count;
+	}
+
+	fput(file);
+	return;
+out_unlock:
+	task_unlock(task);
+	nft_jit_stop_umh();
+}
+
+static int nft_jit_write_rule_info(const struct sk_buff *nlskb)
 {
+	const char *addr = nlskb->data;
+	ssize_t w, n, nr = nlskb->len;
+
 	if (!info.pipe_to_umh) {
 		int ret = nft_jit_load_umh();
 		if (ret)
@@ -29,5 +83,93 @@ int nf_tables_jit_work(const struct sk_buff *nlskb, struct nft_ebpf *e)
 			return -EINVAL;
 	}
 
-	return 0;
+	w = 0;
+	do {
+		loff_t pos = 0;
+
+		n = __kernel_write(info.pipe_to_umh, addr, nr, &pos);
+		if (n < 0)
+			return n;
+		w += n;
+		nr -= n;
+		if (nr == 0)
+			break;
+		addr += n;
+	} while (!signal_pending(current));
+
+	if (w == nlskb->len)
+		return 0;
+
+	return -EINTR;
+}
+
+static int nft_jit_read_result(struct nft_jit_data_from_user *res)
+{
+	ssize_t r, n, nr = sizeof(*res);
+
+	r = 0;
+
+	do {
+		loff_t pos = 0;
+
+		n = kernel_read(info.pipe_from_umh, res, nr, &pos);
+		if (n < 0)
+			return n;
+		if (n == 0)
+			return -EPIPE;
+		r += n;
+		nr -= n;
+		if (nr == 0)
+			break;
+	} while (!signal_pending(current));
+
+	if (r == (ssize_t)sizeof(*res))
+		return 0;
+
+	return -EINTR;
+}
+
+int nf_tables_jit_work(const struct sk_buff *nlskb, struct nft_ebpf *e)
+{
+	struct nft_jit_data_from_user from_usr;
+	int ret;
+
+	ret = nft_jit_write_rule_info(nlskb);
+	if (ret < 0) {
+		nft_jit_stop_umh();
+		pr_info("write rule info: ret %d\n", ret);
+		return ret;
+	}
+
+	ret = nft_jit_read_result(&from_usr);
+	if (ret < 0) {
+		pr_info("read rule info: ret %d\n", ret);
+		nft_jit_stop_umh();
+		return ret;
+	}
+
+	if (from_usr.ebpf_fd >= 0) {
+		rcu_read_lock();
+		nft_jit_fd_to_prog(e, from_usr.ebpf_fd, from_usr.expr_count);
+		rcu_read_unlock();
+		return 0;
+	}
+
+	return ret;
+}
+
+void nft_jit_stop_umh(void)
+{
+	struct task_struct *tsk;
+
+	rcu_read_lock();
+	tsk = pid_task(find_vpid(info.pid), PIDTYPE_PID);
+	if (tsk)
+		force_sig(SIGKILL, tsk);
+	rcu_read_unlock();
+	fput(info.pipe_to_umh);
+	fput(info.pipe_from_umh);
+	memset(&info, 0, sizeof(info));
+
+	info.pid = -1;
 }
-- 
2.16.4

^ permalink raw reply related

* [PATCH net] vrf: check the original netdevice for generating redirect
From: Stephen Suryaputra @ 2018-06-01  4:05 UTC (permalink / raw)
  To: netdev; +Cc: Stephen Suryaputra

Use the right device to determine if redirect should be sent especially
when using vrf. Same as well as when sending the redirect.

Signed-off-by: Stephen Suryaputra <ssuryaextr@gmail.com>
---
 net/ipv6/ip6_output.c | 3 ++-
 net/ipv6/ndisc.c      | 6 ++++++
 2 files changed, 8 insertions(+), 1 deletion(-)

diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
index 7b6d168..af49f6c 100644
--- a/net/ipv6/ip6_output.c
+++ b/net/ipv6/ip6_output.c
@@ -507,7 +507,8 @@ int ip6_forward(struct sk_buff *skb)
 	   send redirects to source routed frames.
 	   We don't send redirects to frames decapsulated from IPsec.
 	 */
-	if (skb->dev == dst->dev && opt->srcrt == 0 && !skb_sec_path(skb)) {
+	if (IP6CB(skb)->iif == dst->dev->ifindex &&
+	    opt->srcrt == 0 && !skb_sec_path(skb)) {
 		struct in6_addr *target = NULL;
 		struct inet_peer *peer;
 		struct rt6_info *rt;
diff --git a/net/ipv6/ndisc.c b/net/ipv6/ndisc.c
index 9de4dfb1..525051a 100644
--- a/net/ipv6/ndisc.c
+++ b/net/ipv6/ndisc.c
@@ -1576,6 +1576,12 @@ void ndisc_send_redirect(struct sk_buff *skb, const struct in6_addr *target)
 	   ops_data_buf[NDISC_OPS_REDIRECT_DATA_SPACE], *ops_data = NULL;
 	bool ret;
 
+	if (netif_is_l3_master(skb->dev)) {
+		dev = __dev_get_by_index(dev_net(skb->dev), IPCB(skb)->iif);
+		if (!dev)
+			return;
+	}
+
 	if (ipv6_get_lladdr(dev, &saddr_buf, IFA_F_TENTATIVE)) {
 		ND_PRINTK(2, warn, "Redirect: no link-local address on %s\n",
 			  dev->name);
-- 
2.7.4

^ permalink raw reply related

* Re: [PATCH iproute2 net-next] iproute: ip route get support for sport, dport and ipproto match
From: David Ahern @ 2018-06-01 15:40 UTC (permalink / raw)
  To: Roopa Prabhu; +Cc: netdev
In-Reply-To: <1527699978-359-1-git-send-email-roopa@cumulusnetworks.com>

On 5/30/18 11:06 AM, Roopa Prabhu wrote:
> From: Roopa Prabhu <roopa@cumulusnetworks.com>
> 
> Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
> ---
>  ip/iproute.c           | 26 +++++++++++++++++++++++++-
>  man/man8/ip-route.8.in | 20 +++++++++++++++++++-
>  2 files changed, 44 insertions(+), 2 deletions(-)

applied to iproute2-next. Thanks

^ permalink raw reply

* Re: [PATCH V2 mlx5-next 0/2] Mellanox, mlx5 new device events
From: David Miller @ 2018-06-01 15:45 UTC (permalink / raw)
  To: dledford; +Cc: saeedm, netdev, linux-rdma, leonro, jgg
In-Reply-To: <82ddad11b6a77aa07238ff50ae8769360bfb583b.camel@redhat.com>

From: Doug Ledford <dledford@redhat.com>
Date: Fri, 01 Jun 2018 11:08:24 -0400

> On Thu, 2018-05-31 at 15:36 -0400, David Miller wrote:
>> From: Saeed Mahameed <saeedm@mellanox.com>
>> Date: Wed, 30 May 2018 10:59:48 -0700
>> 
>> > The following series is for mlx5-next tree [1], it adds the support of two
>> > new device events, from Ilan Tayari:
>> > 
>> > 1. High temperature warnings.
>> > 2. FPGA QP error event.
>> > 
>> > In case of no objection this series will be applied to mlx5-next tree
>> > and will be sent later as a pull request to both rdma and net trees.
>> > 
>> > [1] https://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux.git/log/?h=mlx5-next
>> > 
>> > v1->v2:
>> >   - improve commit message of the FPGA QP error event patch.
>> 
>> Series applied, thanks.
> 
> Hi Dave,
> 
> Although in this case it doesn't really matter and we can work around
> it, this was supposed to be a case of the new methodology that Saeed and
> Jason had worked out with you.  Specifically, when Saeed says in the
> cover letter:
> 
>> In case of no objection this series will be applied to mlx5-next tree>
>> and will be sent later as a pull request to both rdma and net trees.
> 
> then it is intended for you to ack the original patch series, not apply
> it, and when acks from both the net and rdma side have been received,
> then we will get a pull request of just that series.

Sorry, I saw your ACK and misinterpreted the situation.

I'll be more careful next time.

^ permalink raw reply

* Re: [PATCH net-next] net: mvpp2: Split the PPv2 driver to a dedicated directory
From: David Miller @ 2018-06-01 15:48 UTC (permalink / raw)
  To: maxime.chevallier
  Cc: netdev, linux-kernel, antoine.tenart, thomas.petazzoni,
	gregory.clement, miquel.raynal, nadavh, stefanc, ymarkman, mw
In-Reply-To: <20180531080743.4218-1-maxime.chevallier@bootlin.com>

From: Maxime Chevallier <maxime.chevallier@bootlin.com>
Date: Thu, 31 May 2018 10:07:43 +0200

> As the mvpp2 driver is growing, move this driver to a dedicated
> directory and split it into several files.
> 
> Since this driver has a lot of register defines and structure
> definitions, it can benefit from having all of this into a dedicated
> header file, named mvpp2.h.
> 
> A good chunk of the mvpp2 code is dedicated to Header Parser handling, so
> we introduce mvpp2_prs.h where all Header Parser definitions are located,
> and mvpp2_prs.c containing the related code.
> 
> In the same way, mvpp2_cls.h and mvpp2_cls.c are created to contain
> Classifier and RSS related code.
> 
> The former 'mvpp2.c' file is renamed 'mvpp2_main.c' so that we can keep
> the driver binary named 'mvpp2'.
> 
> This commit is only about spliting the driver into multiple files and
> doesn't introduce any new function, feature or fix besides removing
> 'static' keywords when needed.
> 
> Signed-off-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
> ---
> This is based on the latest net-next, I expect no conflicts but due to
> the amount of code this patchs touches, it can easily become un-appliable.
> 
> Tell me if that's the case, I'll be happy to rebase it if needed.

Applied, thank you.

^ permalink raw reply

* [PATCH iproute2] iplink_vrf: Save device index from response for return code
From: dsahern @ 2018-06-01 15:50 UTC (permalink / raw)
  To: stephen, netdev; +Cc: David Ahern, Hangbin Liu, Phil Sutter

From: David Ahern <dsahern@gmail.com>

A recent commit changed rtnl_talk_* to return the response message in
allocated memory so callers need to free it. The change to name_is_vrf
did not save the device index which is pointing to a struct inside the
now allocated and freed memory resulting in garbage getting returned
in some cases.

Fix by using a stack variable to save the return value and only set
it to ifi->ifi_index after all checks are done and before the answer
buffer is freed.

Fixes: 86bf43c7c2fdc ("lib/libnetlink: update rtnl_talk to support malloc buff at run time")
Cc: Hangbin Liu <liuhangbin@gmail.com>
Cc: Phil Sutter <phil@nwl.cc>
Signed-off-by: David Ahern <dsahern@gmail.com>
---
 ip/iplink_vrf.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/ip/iplink_vrf.c b/ip/iplink_vrf.c
index e9dd0df98412..6004bb4f305e 100644
--- a/ip/iplink_vrf.c
+++ b/ip/iplink_vrf.c
@@ -191,6 +191,7 @@ int name_is_vrf(const char *name)
 	struct rtattr *tb[IFLA_MAX+1];
 	struct rtattr *li[IFLA_INFO_MAX+1];
 	struct ifinfomsg *ifi;
+	int ifindex = 0;
 	int len;
 
 	addattr_l(&req.n, sizeof(req), IFLA_IFNAME, name, strlen(name) + 1);
@@ -218,7 +219,8 @@ int name_is_vrf(const char *name)
 	if (strcmp(RTA_DATA(li[IFLA_INFO_KIND]), "vrf"))
 		goto out;
 
+	ifindex = ifi->ifi_index;
 out:
 	free(answer);
-	return ifi->ifi_index;
+	return ifindex;
 }
-- 
2.11.0

^ permalink raw reply related

* Re: [PATCH] qtnfmac: fix NULL pointer dereference
From: Kalle Valo @ 2018-06-01 15:52 UTC (permalink / raw)
  To: Gustavo A. R. Silva
  Cc: Igor Mitsyanko, Avinash Patil, Sergey Matyukevich,
	David S. Miller, linux-wireless, netdev, linux-kernel,
	kernel-janitors
In-Reply-To: <20180601132408.GA2572@embeddedor.com>

"Gustavo A. R. Silva" <gustavo@embeddedor.com> writes:

> In case *vif* is NULL at 655: if (!vif), the execution path jumps to
> label out, where *vif* is dereferenced at 679:
>
> if (vif->sta_state == QTNF_STA_CONNECTING)
>
> Fix this by immediately returning when *vif* is NULL instead of
> jumping to label out.
>
> Addresses-Coverity-ID: 1469567 ("Dereference after null check")
> Fixes: 480daa9cb62c ("qtnfmac: fix invalid STA state on EAPOL failure")
> Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>

As commit 480daa9cb62c was recently applied to wireless-drivers-next
I'll queue this to 4.18.

-- 
Kalle Valo

^ permalink raw reply

* Re: [PATCH 04/18] rhashtable: detect when object movement might have invalidated a lookup
From: Herbert Xu @ 2018-06-01 16:06 UTC (permalink / raw)
  To: NeilBrown
  Cc: Thomas Graf, netdev, linux-kernel, Eric Dumazet, David S. Miller
In-Reply-To: <152782824943.30340.8224535954517915320.stgit@noble>

On Fri, Jun 01, 2018 at 02:44:09PM +1000, NeilBrown wrote:
> Some users of rhashtable might need to change the key
> of an object and move it to a different location in the table.
> Other users might want to allocate objects using
> SLAB_TYPESAFE_BY_RCU which can result in the same memory allocation
> being used for a different (type-compatible) purpose and similarly
> end up in a different hash-chain.
>
> To support these, we store a unique NULLS_MARKER at the end of
> each chain, and when a search fails to find a match, we check
> if the NULLS marker found was the expected one.  If not,
> the search is repeated.
> 
> The unique NULLS_MARKER is derived from the address of the
> head of the chain.

Yes I thinks makes a lot more sense than the existing rhashtable
nulls code.  The current rhashtable nulls code harkens back to the
time of the old rhashtable implementation where the same chain
existed in two different tables and that is no longer the case.

> If an object is removed and re-added to the same hash chain, we won't
> notice by looking that the NULLS marker.  In this case we must be sure

This is not currently required by TCP/UDP.  I'd rather not add
extra constraints that aren't actually used.  The only requirement
for nulls is to allow sockets to float from the listening table to
the established table within one RCU grace period.  There is no
shuttling back and forth, i.e., the only exit path for a socket in
the established table is to be freed at the end of the RCU grace
period.

Adding Eric/Dave to the cc list in case they do/will need such a
functionality.

Thanks,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply

* ndo_star_xmit() - resevered head and tailroom
From: Alexander Aring @ 2018-06-01 16:14 UTC (permalink / raw)
  To: netdev; +Cc: stefan, linux-wpan, kernel

Hi netdev community,

I am again on bug fixing in 6lowpan branch and thought that my
needed_headroom and needed_tailroom of net_device are available inside
my ndo_start_xmit() callback.

In case of a UDP socket, it was not the case. I send a fix now to use
skb_expand_copy() to make sure this space is available. (I need to
unshare the buffer anyway).

I am just curious is this suppose to work like that? I cannot believe
that I need to run a realloc() inside my ndo_start_xmit(). If possible,
when the net_device is known at socket layer it should allocate the
necessary space and I thought this is the designed transmit flow.

- Alex

^ permalink raw reply

* Re: [PATCH V2 mlx5-next 0/2] Mellanox, mlx5 new device events
From: Leon Romanovsky @ 2018-06-01 16:21 UTC (permalink / raw)
  To: David Miller; +Cc: dledford, saeedm, netdev, linux-rdma, jgg
In-Reply-To: <20180601.114558.569563972122502762.davem@davemloft.net>

[-- Attachment #1: Type: text/plain, Size: 2433 bytes --]

On Fri, Jun 01, 2018 at 11:45:58AM -0400, David Miller wrote:
> From: Doug Ledford <dledford@redhat.com>
> Date: Fri, 01 Jun 2018 11:08:24 -0400
>
> > On Thu, 2018-05-31 at 15:36 -0400, David Miller wrote:
> >> From: Saeed Mahameed <saeedm@mellanox.com>
> >> Date: Wed, 30 May 2018 10:59:48 -0700
> >>
> >> > The following series is for mlx5-next tree [1], it adds the support of two
> >> > new device events, from Ilan Tayari:
> >> >
> >> > 1. High temperature warnings.
> >> > 2. FPGA QP error event.
> >> >
> >> > In case of no objection this series will be applied to mlx5-next tree
> >> > and will be sent later as a pull request to both rdma and net trees.
> >> >
> >> > [1] https://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux.git/log/?h=mlx5-next
> >> >
> >> > v1->v2:
> >> >   - improve commit message of the FPGA QP error event patch.
> >>
> >> Series applied, thanks.
> >
> > Hi Dave,
> >
> > Although in this case it doesn't really matter and we can work around
> > it, this was supposed to be a case of the new methodology that Saeed and
> > Jason had worked out with you.  Specifically, when Saeed says in the
> > cover letter:
> >
> >> In case of no objection this series will be applied to mlx5-next tree>
> >> and will be sent later as a pull request to both rdma and net trees.
> >
> > then it is intended for you to ack the original patch series, not apply
> > it, and when acks from both the net and rdma side have been received,
> > then we will get a pull request of just that series.
>
> Sorry, I saw your ACK and misinterpreted the situation.
>
> I'll be more careful next time.

Doug, Dave

I would like to clarify this point, we intend to send pull request to only
one maintainer, who actually needs the accepted patches.

Let's take the RDMA flow counters series as an example:
https://www.spinics.net/lists/linux-rdma/msg65620.html

This series includes 12 patches for RDMA and 2 shared code
(mlx5-next). Those two patches, the RDMA side will receive as part
of specific pull request and netdev will get them if some netdev
feature down the road will need mlx5-next branch.

Of course, it is harmless for both of you to pull, but it looks like
extra work which is not needed for you.

Thanks

> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 801 bytes --]

^ permalink raw reply

* [PATCH net] net/packet: refine check for priv area size
From: Eric Dumazet @ 2018-06-01 16:23 UTC (permalink / raw)
  To: David S . Miller; +Cc: netdev, Eric Dumazet, Eric Dumazet

syzbot was able to trick af_packet again [1]

Various commits tried to address the problem in the past,
but failed to take into account V3 header size.

[1]

tpacket_rcv: packet too big, clamped from 72 to 4294967224. macoff=96
BUG: KASAN: use-after-free in prb_run_all_ft_ops net/packet/af_packet.c:1016 [inline]
BUG: KASAN: use-after-free in prb_fill_curr_block.isra.59+0x4e5/0x5c0 net/packet/af_packet.c:1039
Write of size 2 at addr ffff8801cb62000e by task kworker/1:2/2106

CPU: 1 PID: 2106 Comm: kworker/1:2 Not tainted 4.17.0-rc7+ #77
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Workqueue: ipv6_addrconf addrconf_dad_work
Call Trace:
 __dump_stack lib/dump_stack.c:77 [inline]
 dump_stack+0x1b9/0x294 lib/dump_stack.c:113
 print_address_description+0x6c/0x20b mm/kasan/report.c:256
 kasan_report_error mm/kasan/report.c:354 [inline]
 kasan_report.cold.7+0x242/0x2fe mm/kasan/report.c:412
 __asan_report_store2_noabort+0x17/0x20 mm/kasan/report.c:436
 prb_run_all_ft_ops net/packet/af_packet.c:1016 [inline]
 prb_fill_curr_block.isra.59+0x4e5/0x5c0 net/packet/af_packet.c:1039
 __packet_lookup_frame_in_block net/packet/af_packet.c:1094 [inline]
 packet_current_rx_frame net/packet/af_packet.c:1117 [inline]
 tpacket_rcv+0x1866/0x3340 net/packet/af_packet.c:2282
 dev_queue_xmit_nit+0x891/0xb90 net/core/dev.c:2018
 xmit_one net/core/dev.c:3049 [inline]
 dev_hard_start_xmit+0x16b/0xc10 net/core/dev.c:3069
 __dev_queue_xmit+0x2724/0x34c0 net/core/dev.c:3584
 dev_queue_xmit+0x17/0x20 net/core/dev.c:3617
 neigh_resolve_output+0x679/0xad0 net/core/neighbour.c:1358
 neigh_output include/net/neighbour.h:482 [inline]
 ip6_finish_output2+0xc9c/0x2810 net/ipv6/ip6_output.c:120
 ip6_finish_output+0x5fe/0xbc0 net/ipv6/ip6_output.c:154
 NF_HOOK_COND include/linux/netfilter.h:277 [inline]
 ip6_output+0x227/0x9b0 net/ipv6/ip6_output.c:171
 dst_output include/net/dst.h:444 [inline]
 NF_HOOK include/linux/netfilter.h:288 [inline]
 ndisc_send_skb+0x100d/0x1570 net/ipv6/ndisc.c:491
 ndisc_send_ns+0x3c1/0x8d0 net/ipv6/ndisc.c:633
 addrconf_dad_work+0xbef/0x1340 net/ipv6/addrconf.c:4033
 process_one_work+0xc1e/0x1b50 kernel/workqueue.c:2145
 worker_thread+0x1cc/0x1440 kernel/workqueue.c:2279
 kthread+0x345/0x410 kernel/kthread.c:240
 ret_from_fork+0x3a/0x50 arch/x86/entry/entry_64.S:412

The buggy address belongs to the page:
page:ffffea00072d8800 count:0 mapcount:-127 mapping:0000000000000000 index:0xffff8801cb620e80
flags: 0x2fffc0000000000()
raw: 02fffc0000000000 0000000000000000 ffff8801cb620e80 00000000ffffff80
raw: ffffea00072e3820 ffffea0007132d20 0000000000000002 0000000000000000
page dumped because: kasan: bad access detected

Memory state around the buggy address:
 ffff8801cb61ff00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
 ffff8801cb61ff80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>ffff8801cb620000: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
                      ^
 ffff8801cb620080: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
 ffff8801cb620100: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff

Fixes: 2b6867c2ce76 ("net/packet: fix overflow in check for priv area size")
Fixes: dc808110bb62 ("packet: handle too big packets for PACKET_V3")
Fixes: f6fb8f100b80 ("af-packet: TPACKET_V3 flexible buffer implementation.")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
---
 net/packet/af_packet.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
index acb7b86574cd3d6f13790550c00f11616caff2e3..60c2a252bdf527eae439c6953293a5f3e3ea48b9 100644
--- a/net/packet/af_packet.c
+++ b/net/packet/af_packet.c
@@ -4282,7 +4282,7 @@ static int packet_set_ring(struct sock *sk, union tpacket_req_u *req_u,
 			goto out;
 		if (po->tp_version >= TPACKET_V3 &&
 		    req->tp_block_size <=
-			  BLK_PLUS_PRIV((u64)req_u->req3.tp_sizeof_priv))
+		    BLK_PLUS_PRIV((u64)req_u->req3.tp_sizeof_priv) + sizeof(struct tpacket3_hdr))
 			goto out;
 		if (unlikely(req->tp_frame_size < po->tp_hdrlen +
 					po->tp_reserve))
-- 
2.17.0.921.gf22659ad46-goog

^ permalink raw reply related

* Re: [PATCH 05/18] rhashtable: simplify INIT_RHT_NULLS_HEAD()
From: Herbert Xu @ 2018-06-01 16:24 UTC (permalink / raw)
  To: NeilBrown; +Cc: Thomas Graf, netdev, linux-kernel
In-Reply-To: <152782824947.30340.2254332178823222989.stgit@noble>

On Fri, Jun 01, 2018 at 02:44:09PM +1000, NeilBrown wrote:
> The 'ht' and 'hash' arguments to INIT_RHT_NULLS_HEAD() are
> no longer used - so drop them.  This allows us to also
> from the nhash argument from nested_table_alloc().

The second sentence is missing a word.

> Signed-off-by: NeilBrown <neilb@suse.com>

Otherwise,

Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply

* Re: [PATCH 06/18] rhashtable: simplify nested_table_alloc() and rht_bucket_nested_insert()
From: Herbert Xu @ 2018-06-01 16:28 UTC (permalink / raw)
  To: NeilBrown; +Cc: Thomas Graf, netdev, linux-kernel
In-Reply-To: <152782824950.30340.10831876745236207066.stgit@noble>

On Fri, Jun 01, 2018 at 02:44:09PM +1000, NeilBrown wrote:
> Now that we don't use the hash value or shift in nested_table_alloc()
> there is room for simplification.
> We only need to pass a "is this a leaf" flag to nested_table_alloc(),
> and don't need to track as much information in
> rht_bucket_nested_insert().
> 
> Note there is another minor cleanup in nested_table_alloc() here.
> The number of elements in a page of "union nested_tables" is most naturally
> 
>   PAGE_SIZE / sizeof(ntbl[0])
> 
> The previous code had
> 
>   PAGE_SIZE / sizeof(ntbl[0].bucket)
> 
> which happens to be the correct value only because the bucket uses all
> the space in the union.
> 
> Signed-off-by: NeilBrown <neilb@suse.com>

Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply

* Re: [PATCH net] ipv6: omit traffic class when calculating flow hash
From: David Ahern @ 2018-06-01 16:42 UTC (permalink / raw)
  To: Michal Kubecek, David S. Miller
  Cc: netdev, linux-kernel, Nicolas Dichtel, Tom Herbert
In-Reply-To: <20180601112948.93BE7A0C48@unicorn.suse.cz>

On 6/1/18 4:34 AM, Michal Kubecek wrote:
> Some of the code paths calculating flow hash for IPv6 use flowlabel member
> of struct flowi6 which, despite its name, encodes both flow label and
> traffic class. If traffic class changes within a TCP connection (as e.g.
> ssh does), ECMP route can switch between path. It's also incosistent with
> other code paths where ip6_flowlabel() (returning only flow label) is used
> to feed the key.
> 
> Use only flow label everywhere, including one place where hash key is set
> using ip6_flowinfo().
> 
> Fixes: 51ebd3181572 ("ipv6: add support of equal cost multipath (ECMP)")
> Fixes: f70ea018da06 ("net: Add functions to get skb->hash based on flow structures")
> Signed-off-by: Michal Kubecek <mkubecek@suse.cz>
> ---
>  net/core/flow_dissector.c | 3 ++-
>  net/ipv6/route.c          | 5 +++--
>  2 files changed, 5 insertions(+), 3 deletions(-)
> 
> diff --git a/net/core/flow_dissector.c b/net/core/flow_dissector.c
> index d29f09bc5ff9..441d3db76e8e 100644
> --- a/net/core/flow_dissector.c
> +++ b/net/core/flow_dissector.c
> @@ -1334,7 +1334,8 @@ __u32 __get_hash_from_flowi6(const struct flowi6 *fl6, struct flow_keys *keys)
>  	keys->ports.src = fl6->fl6_sport;
>  	keys->ports.dst = fl6->fl6_dport;
>  	keys->keyid.keyid = fl6->fl6_gre_key;
> -	keys->tags.flow_label = (__force u32)fl6->flowlabel;
> +	keys->tags.flow_label = (__force u32)(fl6->flowlabel &
> +					      IPV6_FLOWLABEL_MASK);
>  	keys->basic.ip_proto = fl6->flowi6_proto;
>  
>  	return flow_hash_from_keys(keys);
> diff --git a/net/ipv6/route.c b/net/ipv6/route.c
> index f4d61736c41a..fcbacf1677f8 100644
> --- a/net/ipv6/route.c
> +++ b/net/ipv6/route.c
> @@ -1868,7 +1868,7 @@ static void ip6_multipath_l3_keys(const struct sk_buff *skb,
>  	} else {
>  		keys->addrs.v6addrs.src = key_iph->saddr;
>  		keys->addrs.v6addrs.dst = key_iph->daddr;
> -		keys->tags.flow_label = ip6_flowinfo(key_iph);
> +		keys->tags.flow_label = ip6_flowlabel(key_iph);
>  		keys->basic.ip_proto = key_iph->nexthdr;
>  	}
>  }
> @@ -1889,7 +1889,8 @@ u32 rt6_multipath_hash(const struct net *net, const struct flowi6 *fl6,
>  		} else {
>  			hash_keys.addrs.v6addrs.src = fl6->saddr;
>  			hash_keys.addrs.v6addrs.dst = fl6->daddr;
> -			hash_keys.tags.flow_label = (__force u32)fl6->flowlabel;
> +			hash_keys.tags.flow_label = (__force u32)(fl6->flowlabel &
> +								  IPV6_FLOWLABEL_MASK);
>  			hash_keys.basic.ip_proto = fl6->flowi6_proto;
>  		}
>  		break;
> 

Can you make an inline for the flowlabel conversion. Something like this:

diff --git a/include/net/ipv6.h b/include/net/ipv6.h
index 798558fd1681..e36eca2f8531 100644
--- a/include/net/ipv6.h
+++ b/include/net/ipv6.h
@@ -284,6 +284,11 @@ struct ip6_flowlabel {
 #define IPV6_FLOWLABEL_MASK            cpu_to_be32(0x000FFFFF)
 #define IPV6_FLOWLABEL_STATELESS_FLAG  cpu_to_be32(0x00080000)

+static inline u32 flowi6_get_flowlabel(const struct flowi6 *fl6)
+{
+       return (__force u32)(fl6->flowlabel & IPV6_FLOWLABEL_MASK);
+}
+
 #define IPV6_TCLASS_MASK (IPV6_FLOWINFO_MASK & ~IPV6_FLOWLABEL_MASK)
 #define IPV6_TCLASS_SHIFT      20

>From there we can fix the flow struct to have flowinfo instead of
flowlabel and use the macro to hide the conversion.

^ permalink raw reply related

* Re: [PATCH 07/18] rhashtable: use cmpxchg() to protect ->future_tbl.
From: Herbert Xu @ 2018-06-01 16:44 UTC (permalink / raw)
  To: NeilBrown; +Cc: Thomas Graf, netdev, linux-kernel
In-Reply-To: <152782824954.30340.10107132482367263068.stgit@noble>

On Fri, Jun 01, 2018 at 02:44:09PM +1000, NeilBrown wrote:
> Rather than borrowing one of the bucket locks to
> protect ->future_tbl updates, use cmpxchg().
> This gives more freedom to change how bucket locking
> is implemented.
> 
> Signed-off-by: NeilBrown <neilb@suse.com>

Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply

* Re: [PATCH V2 mlx5-next 0/2] Mellanox, mlx5 new device events
From: Doug Ledford @ 2018-06-01 16:48 UTC (permalink / raw)
  To: Leon Romanovsky, David Miller; +Cc: saeedm, netdev, linux-rdma, jgg
In-Reply-To: <20180601162126.GB2843@mtr-leonro.mtl.com>

[-- Attachment #1: Type: text/plain, Size: 2677 bytes --]

On Fri, 2018-06-01 at 19:21 +0300, Leon Romanovsky wrote:
> On Fri, Jun 01, 2018 at 11:45:58AM -0400, David Miller wrote:
> > From: Doug Ledford <dledford@redhat.com>
> > Date: Fri, 01 Jun 2018 11:08:24 -0400
> > 
> > > On Thu, 2018-05-31 at 15:36 -0400, David Miller wrote:
> > > > From: Saeed Mahameed <saeedm@mellanox.com>
> > > > Date: Wed, 30 May 2018 10:59:48 -0700
> > > > 
> > > > > The following series is for mlx5-next tree [1], it adds the support of two
> > > > > new device events, from Ilan Tayari:
> > > > > 
> > > > > 1. High temperature warnings.
> > > > > 2. FPGA QP error event.
> > > > > 
> > > > > In case of no objection this series will be applied to mlx5-next tree
> > > > > and will be sent later as a pull request to both rdma and net trees.
> > > > > 
> > > > > [1] https://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux.git/log/?h=mlx5-next
> > > > > 
> > > > > v1->v2:
> > > > >   - improve commit message of the FPGA QP error event patch.
> > > > 
> > > > Series applied, thanks.
> > > 
> > > Hi Dave,
> > > 
> > > Although in this case it doesn't really matter and we can work around
> > > it, this was supposed to be a case of the new methodology that Saeed and
> > > Jason had worked out with you.  Specifically, when Saeed says in the
> > > cover letter:
> > > 
> > > > In case of no objection this series will be applied to mlx5-next tree>
> > > > and will be sent later as a pull request to both rdma and net trees.
> > > 
> > > then it is intended for you to ack the original patch series, not apply
> > > it, and when acks from both the net and rdma side have been received,
> > > then we will get a pull request of just that series.
> > 
> > Sorry, I saw your ACK and misinterpreted the situation.
> > 
> > I'll be more careful next time.
> 
> Doug, Dave
> 
> I would like to clarify this point, we intend to send pull request to only
> one maintainer, who actually needs the accepted patches.
> 
> Let's take the RDMA flow counters series as an example:
> https://www.spinics.net/lists/linux-rdma/msg65620.html
> 
> This series includes 12 patches for RDMA and 2 shared code
> (mlx5-next). Those two patches, the RDMA side will receive as part
> of specific pull request and netdev will get them if some netdev
> feature down the road will need mlx5-next branch.
> 
> Of course, it is harmless for both of you to pull, but it looks like
> extra work which is not needed for you.

Ok, thanks for the clarification Leon.


-- 
Doug Ledford <dledford@redhat.com>
    GPG KeyID: B826A3330E572FDD
    Key fingerprint = AE6B 1BDA 122B 23B4 265B  1274 B826 A333 0E57 2FDD

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply

* Re: [PATCH V2 mlx5-next 0/2] Mellanox, mlx5 new device events
From: Doug Ledford @ 2018-06-01 16:49 UTC (permalink / raw)
  To: David Miller; +Cc: saeedm, netdev, linux-rdma, leonro, jgg
In-Reply-To: <20180601.114558.569563972122502762.davem@davemloft.net>

[-- Attachment #1: Type: text/plain, Size: 1838 bytes --]

On Fri, 2018-06-01 at 11:45 -0400, David Miller wrote:
> From: Doug Ledford <dledford@redhat.com>
> Date: Fri, 01 Jun 2018 11:08:24 -0400
> 
> > On Thu, 2018-05-31 at 15:36 -0400, David Miller wrote:
> >> From: Saeed Mahameed <saeedm@mellanox.com>
> >> Date: Wed, 30 May 2018 10:59:48 -0700
> >> 
> >> > The following series is for mlx5-next tree [1], it adds the support of two
> >> > new device events, from Ilan Tayari:
> >> > 
> >> > 1. High temperature warnings.
> >> > 2. FPGA QP error event.
> >> > 
> >> > In case of no objection this series will be applied to mlx5-next tree
> >> > and will be sent later as a pull request to both rdma and net trees.
> >> > 
> >> > [1] https://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux.git/log/?h=mlx5-next
> >> > 
> >> > v1->v2:
> >> >   - improve commit message of the FPGA QP error event patch.
> >> 
> >> Series applied, thanks.
> > 
> > Hi Dave,
> > 
> > Although in this case it doesn't really matter and we can work around
> > it, this was supposed to be a case of the new methodology that Saeed and
> > Jason had worked out with you.  Specifically, when Saeed says in the
> > cover letter:
> > 
> >> In case of no objection this series will be applied to mlx5-next tree>
> >> and will be sent later as a pull request to both rdma and net trees.
> > 
> > then it is intended for you to ack the original patch series, not apply
> > it, and when acks from both the net and rdma side have been received,
> > then we will get a pull request of just that series.
> 
> Sorry, I saw your ACK and misinterpreted the situation.
> 
> I'll be more careful next time.

Understandable, thanks ;-)

-- 
Doug Ledford <dledford@redhat.com>
    GPG KeyID: B826A3330E572FDD
    Key fingerprint = AE6B 1BDA 122B 23B4 265B  1274 B826 A333 0E57 2FDD

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox