[PATCH net-next v8 00/15] Introducing P4TC

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH net-next v8 00/15] Introducing P4TC
@ 2023-11-16 14:59 Jamal Hadi Salim
  2023-11-16 14:59 ` [PATCH net-next v8 01/15] net: sched: act_api: Introduce dynamic actions list Jamal Hadi Salim
                   ` (15 more replies)
  0 siblings, 16 replies; 79+ messages in thread
From: Jamal Hadi Salim @ 2023-11-16 14:59 UTC (permalink / raw)
  To: netdev
  Cc: deb.chatterjee, anjali.singhai, Vipin.Jain, namrata.limaye, tom,
	mleitner, Mahesh.Shirshyad, tomasz.osinski, jiri, xiyou.wangcong,
	davem, edumazet, kuba, pabeni, vladbu, horms, daniel, bpf,
	khalidm, toke, mattyk, dan.daly, chris.sommers,
	john.andy.fingerhut

We are seeking community feedback on P4TC patches.

We have reduced the number of commits in this patchset including leaving out
all the testcases and secondary patches in order to ease review.

We feel we have completed the migration from the V1 scriptable version to eBPF
and now is a good time to remove the RFC tag.

Changes In RFC Version 2
-------------------------

Version 2 is the initial integration of the eBPF datapath.
We took into consideration suggestions provided to use eBPF and put effort into
analyzing eBPF as datapath which involved extensive testing.
We implemented 6 approaches with eBPF and ran performance analysis and presented
our results at the P4 2023 workshop in Santa Clara[see: 1, 3] on each of the 6
vs the scriptable P4TC and concluded that 2 of the approaches are sensible (4 if
you account for XDP or TC separately).

Conclusions from the exercise: We lose the simple operational model we had
prior to integrating eBPF. We do gain performance in most cases when the
datapath is less compute-bound.
For more discussion on our requirements vs journeying the eBPF path please
scroll down to "Restating Our Requirements" and "Challenges".

This patch set presented two modes.
mode1: the parser is entirely based on eBPF - whereas the rest of the
SW datapath stays as _scriptable_ as in Version 1.
mode2: All of the kernel s/w datapath (including parser) is in eBPF.

The key ingredient for eBPF, that we did not have access to in the past, is
kfunc (it made a big difference for us to reconsider eBPF).

In V2 the two modes are mutually exclusive (IOW, you get to choose one
or the other via Kconfig).

Changes In RFC Version 3
-------------------------

These patches are still in a little bit of flux as we adjust to integrating
eBPF. So there are small constructs that are used in V1 and 2 but no longer
used in this version. We will make a V4 which will remove those.
The changes from V2 are as follows:

1) Feedback we got in V2 is to try stick to one of the two modes. In this version
we are taking one more step and going the path of mode2 vs v2 where we had 2 modes.

2) The P4 Register extern is no longer standalone. Instead, as part of integrating
into eBPF we introduce another kfunc which encapsulates Register as part of the
extern interface.

3) We have improved our CICD to include tools pointed to us by Simon. See
   "Testing" further below. Thanks to Simon for that and other issues he caught.
   Simon, we discussed on issue [7] but decided to keep that log since we think
   it is useful.

4) A lot of small cleanups. Thanks Marcelo. There are two things we need to
   re-discuss though; see: [5], [6].

5) We removed the need for a range of IDs for dynamic actions. Thanks Jakub.

6) Clarify ambiguity caused by smatch in an if(A) else if(B) condition. We are
   guaranteed that either A or B must exist; however, lets make smatch happy.
   Thanks to Simon and Dan Carpenter.

Changes In RFC Version 4
-------------------------

1) More integration from scriptable to eBPF. Small bug fixes.

2) More streamlining support of externs via kfunc (one additional kfunc).

3) Removed per-cpu scratchpad per Toke's suggestion and instead use XDP metadata.

There is more eBPF integration coming. One thing we looked at but is not in this
patchset but should be in the next is use of eBPF link in our loading (see
"challenge #1" further below).

Changes In RFC Version 5
-------------------------

1) More integration from scriptable view to eBPF. Small bug fixes from last
   integration.

2) More streamlining support of externs via kfunc (create-on-miss, etc)

3) eBPF linking for XDP.

There is more eBPF integration/streamlining coming (we are getting close to
conversion from scriptable domain).

Changes In RFC Version 6
-------------------------

1) Completed integration from scriptable view to eBPF. Completed integration
   of externs integration.

2) Small bug fixes from v5 based on testing.

Changes In Version 7
-------------------------

0) First time removing the RFC tag!

1) Removed XDP cookie. It turns out as was pointed out by Toke(Thanks!) - that
using bpf links was sufficient to protect us from someone replacing or deleting
a eBPF program after it has been bound to a netdev.

2) Add some reviewed-bys from Vlad.

3) Small bug fixes from v6 based on testing for ebpf.

4) Added the counter extern as a sample extern. Illustrating this example because
   it is slightly complex since it is possible to invoke it directly from
   the P4TC domain (in case of direct counters) or from eBPF (indirect counters).
   It is not exactly the most efficient implementation (a reasonable counter impl
   should be per-cpu).

Changes In Version 8
---------------------
1) Fix all the patchwork warnings and improve our ci to catch them in the future

2) Reduce the number of patches to basic max(15)  to ease review.

What is P4?
-----------

The Programming Protocol-independent Packet Processors (P4) is an open source,
domain-specific programming language for specifying data plane behavior.

The P4 ecosystem includes an extensive range of deployments, products, projects
and services, etc[9][10][11][12].

__What is P4TC?__

P4TC is a net-namespace aware implementation, meaning multiple P4 programs can
run independently in different namespaces alongside their appropriate state. The
implementation builds on top of many years of Linux TC experiences.
On why P4 - see small treatise here:[4].

There have been many discussions and meetings since about 2015 in regards to
P4 over TC[2] and we are finally proving the naysayers that we do get stuff
done!

A lot more of the P4TC motivation is captured at:
https://github.com/p4tc-dev/docs/blob/main/why-p4tc.md

**In this patch series we focus on s/w datapath only**.

__P4TC Workflow__

These patches enable kernel and user space code change _independence_ for any
new P4 program that describes a new datapath. The workflow is as follows:

  1) A developer writes a P4 program, "myprog"

  2) Compiles it using the P4C compiler[8]. The compiler generates 3 outputs:
     a) shell script(s) which form template definitions for the different P4
     objects "myprog" utilizes (tables, externs, actions etc).
     b) the parser and the rest of the datapath are generated
     in eBPF and need to be compiled into binaries.
     c) A json introspection file used for the control plane (by iproute2/tc).

  3) The developer (or operator) executes the shell script(s) to manifest the
     functional "myprog" into the kernel.

  4) The developer (or operator) instantiates "myprog" via the tc P4 filter
     to ingress/egress (depending on P4 arch) of one or more netdevs/ports.

     Example1: parser is an action:
       "tc filter add block 22 ingress protocol all prio 10 p4 pname myprog \
        action bpf obj $PARSER.o section parser/tc-ingress \
        action bpf obj $PROGNAME.o section p4prog/tc"

     Example2: parser explicitly bound and rest of dpath as an action:
       "tc filter add block 22 ingress protocol all prio 10 p4 pname myprog \
        prog tc obj $PARSER.o section parser/tc-ingress \
        action bpf obj $PROGNAME.o section p4prog/tc"

     Example3: parser is at XDP, rest of dpath as an action:
       "tc filter add block 22 ingress protocol all prio 10 p4 pname myprog \
        prog type xdp obj $PARSER.o section parser/xdp-ingress \
	pinned_link /path/to/xdp-prog-link \
        action bpf obj $PROGNAME.o section p4prog/tc"

     Example4: parser+prog at XDP:
       "tc filter add block 22 ingress protocol all prio 10 p4 pname myprog \
        prog type xdp obj $PROGNAME.o section p4prog/xdp \
	pinned_link /path/to/xdp-prog-link"

    see individual patches for more examples tc vs xdp etc. Also see section on
    "challenges" (on this cover letter).

Once "myprog" P4 program is instantiated one can start updating table entries
that are associated with myprog's table named "mytable". Example:

  tc p4ctrl create myprog/table/mytable dstAddr 10.0.1.2/32 \
    action send_to_port param port eno1

A packet arriving on ingress of any of the ports on block 22 will first be
exercised via the (eBPF) parser to find the headers pointing to the ip
destination address.
The remainder eBPF datapath uses the result dstAddr as a key to do a lookup in
myprog's mytable which returns the action params which are then used to execute
the action in the eBPF datapath (eventually sending out packets to eno1).
On a table miss, mytable's default miss action is executed.

__Description of Patches__

P4TC is designed to have no impact on the core code for other users
of TC. IOW, you can compile it out but even if it compiled in and you dont use
it there should be no impact on your performance.

We do make core kernel changes. Patch #1 adds infrastructure for "dynamic"
actions that can be created on "the fly" based on the P4 program requirement.
This patch makes a small incision into act_api which shouldn't affect the
performance (or functionality) of the existing actions. Patches 2-4,6-7 are
minimalist enablers for P4TC and have no effect the classical tc action.
Patch 5 adds infrastructure support for preallocation of dynamic actions.

The core P4TC code implements several P4 objects.

1) Patch #8 introduces P4 data types which are consumed by the rest of the code
2) Patch #9 introduces the concept of templating Pipelines. i.e CRUD commands
   for P4 pipelines.
3) Patch #10 introduces the concept of action templates and associated
   CRUD commands.
4) Patch #11 introduces the concept of P4 table templates and associated
   CRUD commands for tables
5) Patch #12 introduces table entries and associated CRUD commands.
6) Patch #13 introduces interaction of eBPF to P4TC tables via kfunc.
7) Patch #14 introduces the TC classifier P4 used at runtime.
8) Patch #15 introduces extern interfacing (both template and runtime).

__Testing__

Speaking of testing - we have ~300 tdc test cases. This number is growing as
we are adjusting to accommodate for eBPF.
These tests are run on our CICD system on pull requests and after commits are
approved. The CICD does a lot of other tests (more since v2, thanks to Simon's
input)including:
checkpatch, sparse, smatch, coccinelle, 32 bit and 64 bit builds tested on both
X86, ARM 64 and emulated BE via qemu s390. We trigger performance testing in the
CICD to catch performance regressions (currently only on the control path, but
in the future for the datapath).
Syzkaller runs 24/7 on dedicated hardware, originally we focussed only on memory
sanitizer but recently added support for concurrency sanitizer.
Before main releases we ensure each patch will compile on its own to help in
git bisect and run the xmas tree tool. We eventually put the code via coverity.

In addition we are working on a tool that will take a P4 program, run it through
the compiler, and generate permutations of traffic patterns via symbolic
execution that will test both positive and negative datapath code paths. The
test generator tool is still work in progress and will be generated by the P4
compiler.
Note: We have other code that test parallelization etc which we are trying to
find a fit for in the kernel tree's testing infra.

__Restating Our Requirements__

The initial release made in January/2023 had a "scriptable" datapath (think u32
classifier and pedit action). In this section we review the scriptable version
against the current implementation we are pushing upstream which uses eBPF.

Our intention is to target the TC crowd.
Essentially developers and ops people deploying TC based infra.
More importantly the original intent for P4TC was to enable _ops folks_ more than
devs (given code is being generated and doesn't need humans to write it).

With TC, we get whole "familiar" package of match-action pipeline abstraction++,
meaning from the control plane all the way to the tooling infra, i.e
iproute2/tc cli, netlink infra(request/resp, event subscribe/multicast-publish,
congestion control etc), s/w and h/w symbiosis, the autonomous kernel control,
etc.
The main advantage is that we have a singular vendor-neutral interface via the
kernel using well understood mechanisms based on deployment experience (and
at least this part doesnt need retraining).

1) Supporting expressibility of the universe set of P4 progs

It is a must to support 100% of all possible P4 programs. In the past the eBPF
verifier had to be worked around and even then there are cases where we couldnt
avoid path explosion when branching is involved. Kfunc-ing solves these issues
for us. Note, there are still challenges running all potential P4 programs at
the XDP level - the solution to that is to have the compiler generate XDP based
code only if it possible to map it to that layer.

2) Support for P4 HW and SW equivalence.

This feature continues to work even in the presence of eBPF as the s/w
datapath. There are cases of square-hole-round-peg scenarios but
those are implementation issues we can live with.

3) Operational usability

By maintaining the TC control plane (even in presence of eBPF datapath)
runtime aspects remain unchanged. So for our target audience of folks
who have deployed tc including offloads - the comfort zone is unchanged.
There is also the comfort zone of continuing to use the true-and-tried netlink
interfacing.

There is some loss in operational usability because we now have more knobs:
the extra compilation, loading and syncing of ebpf binaries, etc.
IOW, I can no longer just ship someone a shell script in an email to
say go run this and "myprog" will just work.

4) Operational and development Debuggability

If something goes wrong, the tc craftsperson is now required to have additional
knowledge of eBPF code and process. This applies to both the operational person
as well as someone who wrote a driver. We dont believe this is solvable.

5) Opportunity for rapid prototyping of new ideas

During the P4TC development phase something that came naturally was to often
handcode the template scripts because the compiler backend (which is P4 arch
specific) wasnt ready to generate certain things. Then you would read back the
template and diff to ensure the kernel didn't get something wrong. So this
started as a debug feature. During development, we wrote scripts that
covered a range of P4 architectures(PSA, V1, etc) which required no kernel code
changes.

Over time the debug feature morphed into: a) start by handcoding scripts then
b) read it back and then c) generate the P4 code.
It means one could start with the template scripts outside of the constraints
of a P4 architecture spec(PNA/PSA) or even within a P4 architecture then test
some ideas and eventually feed back the concepts to the compiler authors or
modify or create a new P4 architecture and share with the P4 standards folks.

To summarize in presence of eBPF: The debugging idea is probably still alive.
One could dump, with proper tooling(bpftool for example), the loaded eBPF code
and be able to check for differences. But this is not the interesting part.
The concept of going back from whats in the kernel to P4 is a lot more difficult
to implement mostly due to scoping of DSL vs general purpose. It may be lost.
We have been thinking of ways to use BTF and embedding annotations in the eBPF
code and binary but more thought is required and we welcome suggestions.

6) Supporting per namespace program

This requirement is still met (by virtue of keeping P4 control objects within the
TC domain).

__Challenges__

1) Concept of tc block in XDP is _very tedious_ to implement. It would be nice
   if we can use concept there as well, since we expect P4 to work with many
   ports. It will likely require some core patches to fix this.

2) Right now we are using "packed" construct to enforce alignment in kfunc data
   exchange; but we're wondering if there is potential to use BTF to understand
   parameters and their offsets and encode this information at the compiler
   level.

3) At the moment we are creating a static buffer of 128B to retrieve the action
   parameters. If you have a lot of table entries and individual(non-shared)
   action instances with actions that require very little (or no) param space
   a lot of memory is wasted. There may also be cases where 128B may not be
   enough; (likely this is something we can teach the P4C compiler). If we can
   have dynamic pointers instead for kfunc fixed length parameterization then
   this issue is resolvable.

4) See "Restating Our Requirements" #5.
   We would really appreciate ideas/suggestions, etc.

__References__

[1]https://github.com/p4tc-dev/docs/blob/main/p4-conference-2023/2023P4WorkshopP4TC.pdf
[2]https://github.com/p4tc-dev/docs/blob/main/why-p4tc.md#historical-perspective-for-p4tc
[3]https://2023p4workshop.sched.com/event/1KsAe/p4tc-linux-kernel-p4-implementation-approaches-and-evaluation
[4]https://github.com/p4tc-dev/docs/blob/main/why-p4tc.md#so-why-p4-and-how-does-p4-help-here
[5]https://lore.kernel.org/netdev/20230517110232.29349-3-jhs@mojatatu.com/T/#mf59be7abc5df3473cff3879c8cc3e2369c0640a6
[6]https://lore.kernel.org/netdev/20230517110232.29349-3-jhs@mojatatu.com/T/#m783cfd79e9d755cf0e7afc1a7d5404635a5b1919
[7]https://lore.kernel.org/netdev/20230517110232.29349-3-jhs@mojatatu.com/T/#ma8c84df0f7043d17b98f3d67aab0f4904c600469
[8]https://github.com/p4lang/p4c/tree/main/backends/tc
[9]https://p4.org/
[10]https://www.intel.com/content/www/us/en/products/details/network-io/ipu/e2000-asic.html
[11]https://www.amd.com/en/accelerators/pensando
[12]https://github.com/sonic-net/DASH/tree/main

Jamal Hadi Salim (15):
  net: sched: act_api: Introduce dynamic actions list
  net/sched: act_api: increase action kind string length
  net/sched: act_api: Update tc_action_ops to account for dynamic
    actions
  net/sched: act_api: add struct p4tc_action_ops as a parameter to
    lookup callback
  net: sched: act_api: Add support for preallocated dynamic action
    instances
  net: introduce rcu_replace_pointer_rtnl
  rtnl: add helper to check if group has listeners
  p4tc: add P4 data types
  p4tc: add template pipeline create, get, update, delete
  p4tc: add action template create, update, delete, get, flush and dump
  p4tc: add template table create, update, delete, get, flush and dump
  p4tc: add runtime table entry create, update, get, delete, flush and
    dump
  p4tc: add set of P4TC table kfuncs
  p4tc: add P4 classifier
  p4tc: Add P4 extern interface

 include/linux/bitops.h            |    1 +
 include/linux/rtnetlink.h         |   19 +
 include/net/act_api.h             |   22 +-
 include/net/p4tc.h                |  744 ++++++++
 include/net/p4tc_ext_api.h        |  199 ++
 include/net/p4tc_types.h          |   88 +
 include/net/tc_act/p4tc.h         |   52 +
 include/uapi/linux/p4tc.h         |  406 ++++
 include/uapi/linux/p4tc_ext.h     |   36 +
 include/uapi/linux/pkt_cls.h      |   19 +
 include/uapi/linux/rtnetlink.h    |   18 +
 net/sched/Kconfig                 |   23 +
 net/sched/Makefile                |    3 +
 net/sched/act_api.c               |  195 +-
 net/sched/cls_api.c               |    2 +-
 net/sched/cls_p4.c                |  447 +++++
 net/sched/p4tc/Makefile           |    8 +
 net/sched/p4tc/p4tc_action.c      | 2308 +++++++++++++++++++++++
 net/sched/p4tc/p4tc_bpf.c         |  414 +++++
 net/sched/p4tc/p4tc_ext.c         | 2204 ++++++++++++++++++++++
 net/sched/p4tc/p4tc_pipeline.c    |  707 +++++++
 net/sched/p4tc/p4tc_runtime_api.c |  153 ++
 net/sched/p4tc/p4tc_table.c       | 1634 ++++++++++++++++
 net/sched/p4tc/p4tc_tbl_entry.c   | 2870 +++++++++++++++++++++++++++++
 net/sched/p4tc/p4tc_tmpl_api.c    |  611 ++++++
 net/sched/p4tc/p4tc_tmpl_ext.c    | 2221 ++++++++++++++++++++++
 net/sched/p4tc/p4tc_types.c       | 1247 +++++++++++++
 net/sched/p4tc/trace.c            |   10 +
 net/sched/p4tc/trace.h            |   44 +
 security/selinux/nlmsgtab.c       |   10 +-
 30 files changed, 16676 insertions(+), 39 deletions(-)
 create mode 100644 include/net/p4tc.h
 create mode 100644 include/net/p4tc_ext_api.h
 create mode 100644 include/net/p4tc_types.h
 create mode 100644 include/net/tc_act/p4tc.h
 create mode 100644 include/uapi/linux/p4tc.h
 create mode 100644 include/uapi/linux/p4tc_ext.h
 create mode 100644 net/sched/cls_p4.c
 create mode 100644 net/sched/p4tc/Makefile
 create mode 100644 net/sched/p4tc/p4tc_action.c
 create mode 100644 net/sched/p4tc/p4tc_bpf.c
 create mode 100644 net/sched/p4tc/p4tc_ext.c
 create mode 100644 net/sched/p4tc/p4tc_pipeline.c
 create mode 100644 net/sched/p4tc/p4tc_runtime_api.c
 create mode 100644 net/sched/p4tc/p4tc_table.c
 create mode 100644 net/sched/p4tc/p4tc_tbl_entry.c
 create mode 100644 net/sched/p4tc/p4tc_tmpl_api.c
 create mode 100644 net/sched/p4tc/p4tc_tmpl_ext.c
 create mode 100644 net/sched/p4tc/p4tc_types.c
 create mode 100644 net/sched/p4tc/trace.c
 create mode 100644 net/sched/p4tc/trace.h

-- 
2.34.1

^ permalink raw reply	[flat|nested] 79+ messages in thread

* [PATCH net-next v8 01/15] net: sched: act_api: Introduce dynamic actions list
  2023-11-16 14:59 [PATCH net-next v8 00/15] Introducing P4TC Jamal Hadi Salim
@ 2023-11-16 14:59 ` Jamal Hadi Salim
  2023-11-16 14:59 ` [PATCH net-next v8 02/15] net/sched: act_api: increase action kind string length Jamal Hadi Salim
                   ` (14 subsequent siblings)
  15 siblings, 0 replies; 79+ messages in thread
From: Jamal Hadi Salim @ 2023-11-16 14:59 UTC (permalink / raw)
  To: netdev
  Cc: deb.chatterjee, anjali.singhai, namrata.limaye, tom, mleitner,
	Mahesh.Shirshyad, tomasz.osinski, jiri, xiyou.wangcong, davem,
	edumazet, kuba, pabeni, vladbu, horms, daniel, bpf, khalidm, toke,
	mattyk

In P4 we require to generate new actions "on the fly" based on the
specified P4 action definition. Dynamic action kinds, like the pipeline
they are attached to, must be per net namespace, as opposed to native
action kinds which are global. For that reason, we chose to create a
separate structure to store dynamic actions.

Co-developed-by: Victor Nogueira <victor@mojatatu.com>
Signed-off-by: Victor Nogueira <victor@mojatatu.com>
Co-developed-by: Pedro Tammela <pctammela@mojatatu.com>
Signed-off-by: Pedro Tammela <pctammela@mojatatu.com>
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Reviewed-by: Vlad Buslov <vladbu@nvidia.com>
---
 include/net/act_api.h |   7 ++-
 net/sched/act_api.c   | 123 +++++++++++++++++++++++++++++++++++++-----
 net/sched/cls_api.c   |   2 +-
 3 files changed, 115 insertions(+), 17 deletions(-)

diff --git a/include/net/act_api.h b/include/net/act_api.h
index 4ae0580b6..3d40adef1 100644
--- a/include/net/act_api.h
+++ b/include/net/act_api.h
@@ -105,6 +105,7 @@ typedef void (*tc_action_priv_destructor)(void *priv);
 
 struct tc_action_ops {
 	struct list_head head;
+	struct list_head dyn_head;
 	char    kind[IFNAMSIZ];
 	enum tca_id  id; /* identifier should match kind */
 	unsigned int	net_id;
@@ -198,8 +199,10 @@ int tcf_idr_check_alloc(struct tc_action_net *tn, u32 *index,
 int tcf_idr_release(struct tc_action *a, bool bind);
 
 int tcf_register_action(struct tc_action_ops *a, struct pernet_operations *ops);
+int tcf_register_dyn_action(struct net *net, struct tc_action_ops *act);
 int tcf_unregister_action(struct tc_action_ops *a,
 			  struct pernet_operations *ops);
+void tcf_unregister_dyn_action(struct net *net, struct tc_action_ops *act);
 int tcf_action_destroy(struct tc_action *actions[], int bind);
 int tcf_action_exec(struct sk_buff *skb, struct tc_action **actions,
 		    int nr_actions, struct tcf_result *res);
@@ -207,8 +210,8 @@ int tcf_action_init(struct net *net, struct tcf_proto *tp, struct nlattr *nla,
 		    struct nlattr *est,
 		    struct tc_action *actions[], int init_res[], size_t *attr_size,
 		    u32 flags, u32 fl_flags, struct netlink_ext_ack *extack);
-struct tc_action_ops *tc_action_load_ops(struct nlattr *nla, bool police,
-					 bool rtnl_held,
+struct tc_action_ops *tc_action_load_ops(struct net *net, struct nlattr *nla,
+					 bool police, bool rtnl_held,
 					 struct netlink_ext_ack *extack);
 struct tc_action *tcf_action_init_1(struct net *net, struct tcf_proto *tp,
 				    struct nlattr *nla, struct nlattr *est,
diff --git a/net/sched/act_api.c b/net/sched/act_api.c
index c39252d61..443c49116 100644
--- a/net/sched/act_api.c
+++ b/net/sched/act_api.c
@@ -57,6 +57,40 @@ static void tcf_free_cookie_rcu(struct rcu_head *p)
 	kfree(cookie);
 }
 
+static unsigned int dyn_act_net_id;
+
+struct tcf_dyn_act_net {
+	struct list_head act_base;
+	rwlock_t act_mod_lock;
+};
+
+static __net_init int tcf_dyn_act_base_init_net(struct net *net)
+{
+	struct tcf_dyn_act_net *dyn_base_net = net_generic(net, dyn_act_net_id);
+
+	INIT_LIST_HEAD(&dyn_base_net->act_base);
+	rwlock_init(&dyn_base_net->act_mod_lock);
+
+	return 0;
+}
+
+static void __net_exit tcf_dyn_act_base_exit_net(struct net *net)
+{
+	struct tcf_dyn_act_net *dyn_base_net = net_generic(net, dyn_act_net_id);
+	struct tc_action_ops *ops, *tmp;
+
+	list_for_each_entry_safe(ops, tmp, &dyn_base_net->act_base, dyn_head) {
+		list_del(&ops->dyn_head);
+	}
+}
+
+static struct pernet_operations tcf_dyn_act_base_net_ops = {
+	.init = tcf_dyn_act_base_init_net,
+	.exit = tcf_dyn_act_base_exit_net,
+	.id = &dyn_act_net_id,
+	.size = sizeof(struct tc_action_ops),
+};
+
 static void tcf_set_action_cookie(struct tc_cookie __rcu **old_cookie,
 				  struct tc_cookie *new_cookie)
 {
@@ -941,6 +975,48 @@ static void tcf_pernet_del_id_list(unsigned int id)
 	mutex_unlock(&act_id_mutex);
 }
 
+static struct tc_action_ops *tc_lookup_dyn_action(struct net *net, char *kind)
+{
+	struct tcf_dyn_act_net *dyn_base_net = net_generic(net, dyn_act_net_id);
+	struct tc_action_ops *a, *res = NULL;
+
+	read_lock(&dyn_base_net->act_mod_lock);
+	list_for_each_entry(a, &dyn_base_net->act_base, dyn_head) {
+		if (strcmp(kind, a->kind) == 0) {
+			if (try_module_get(a->owner))
+				res = a;
+			break;
+		}
+	}
+	read_unlock(&dyn_base_net->act_mod_lock);
+
+	return res;
+}
+
+void tcf_unregister_dyn_action(struct net *net, struct tc_action_ops *act)
+{
+	struct tcf_dyn_act_net *dyn_base_net = net_generic(net, dyn_act_net_id);
+
+	write_lock(&dyn_base_net->act_mod_lock);
+	list_del(&act->dyn_head);
+	write_unlock(&dyn_base_net->act_mod_lock);
+}
+EXPORT_SYMBOL(tcf_unregister_dyn_action);
+
+int tcf_register_dyn_action(struct net *net, struct tc_action_ops *act)
+{
+	struct tcf_dyn_act_net *dyn_base_net = net_generic(net, dyn_act_net_id);
+
+	if (tc_lookup_dyn_action(net, act->kind))
+		return -EEXIST;
+
+	write_lock(&dyn_base_net->act_mod_lock);
+	list_add(&act->dyn_head, &dyn_base_net->act_base);
+	write_unlock(&dyn_base_net->act_mod_lock);
+
+	return 0;
+}
+
 int tcf_register_action(struct tc_action_ops *act,
 			struct pernet_operations *ops)
 {
@@ -1011,7 +1087,7 @@ int tcf_unregister_action(struct tc_action_ops *act,
 EXPORT_SYMBOL(tcf_unregister_action);
 
 /* lookup by name */
-static struct tc_action_ops *tc_lookup_action_n(char *kind)
+static struct tc_action_ops *tc_lookup_action_n(struct net *net, char *kind)
 {
 	struct tc_action_ops *a, *res = NULL;
 
@@ -1019,31 +1095,48 @@ static struct tc_action_ops *tc_lookup_action_n(char *kind)
 		read_lock(&act_mod_lock);
 		list_for_each_entry(a, &act_base, head) {
 			if (strcmp(kind, a->kind) == 0) {
-				if (try_module_get(a->owner))
-					res = a;
-				break;
+				if (try_module_get(a->owner)) {
+					read_unlock(&act_mod_lock);
+					return a;
+				}
 			}
 		}
 		read_unlock(&act_mod_lock);
+
+		return tc_lookup_dyn_action(net, kind);
 	}
+
 	return res;
 }
 
 /* lookup by nlattr */
-static struct tc_action_ops *tc_lookup_action(struct nlattr *kind)
+static struct tc_action_ops *tc_lookup_action(struct net *net,
+					      struct nlattr *kind)
 {
+	struct tcf_dyn_act_net *dyn_base_net = net_generic(net, dyn_act_net_id);
 	struct tc_action_ops *a, *res = NULL;
 
 	if (kind) {
 		read_lock(&act_mod_lock);
 		list_for_each_entry(a, &act_base, head) {
+			if (nla_strcmp(kind, a->kind) == 0) {
+				if (try_module_get(a->owner)) {
+					read_unlock(&act_mod_lock);
+					return a;
+				}
+			}
+		}
+		read_unlock(&act_mod_lock);
+
+		read_lock(&dyn_base_net->act_mod_lock);
+		list_for_each_entry(a, &dyn_base_net->act_base, dyn_head) {
 			if (nla_strcmp(kind, a->kind) == 0) {
 				if (try_module_get(a->owner))
 					res = a;
 				break;
 			}
 		}
-		read_unlock(&act_mod_lock);
+		read_unlock(&dyn_base_net->act_mod_lock);
 	}
 	return res;
 }
@@ -1294,8 +1387,8 @@ void tcf_idr_insert_many(struct tc_action *actions[])
 	}
 }
 
-struct tc_action_ops *tc_action_load_ops(struct nlattr *nla, bool police,
-					 bool rtnl_held,
+struct tc_action_ops *tc_action_load_ops(struct net *net, struct nlattr *nla,
+					 bool police, bool rtnl_held,
 					 struct netlink_ext_ack *extack)
 {
 	struct nlattr *tb[TCA_ACT_MAX + 1];
@@ -1326,7 +1419,7 @@ struct tc_action_ops *tc_action_load_ops(struct nlattr *nla, bool police,
 		}
 	}
 
-	a_o = tc_lookup_action_n(act_name);
+	a_o = tc_lookup_action_n(net, act_name);
 	if (a_o == NULL) {
 #ifdef CONFIG_MODULES
 		if (rtnl_held)
@@ -1335,7 +1428,7 @@ struct tc_action_ops *tc_action_load_ops(struct nlattr *nla, bool police,
 		if (rtnl_held)
 			rtnl_lock();
 
-		a_o = tc_lookup_action_n(act_name);
+		a_o = tc_lookup_action_n(net, act_name);
 
 		/* We dropped the RTNL semaphore in order to
 		 * perform the module load.  So, even if we
@@ -1445,7 +1538,8 @@ int tcf_action_init(struct net *net, struct tcf_proto *tp, struct nlattr *nla,
 	for (i = 1; i <= TCA_ACT_MAX_PRIO && tb[i]; i++) {
 		struct tc_action_ops *a_o;
 
-		a_o = tc_action_load_ops(tb[i], flags & TCA_ACT_FLAGS_POLICE,
+		a_o = tc_action_load_ops(net, tb[i],
+					 flags & TCA_ACT_FLAGS_POLICE,
 					 !(flags & TCA_ACT_FLAGS_NO_RTNL),
 					 extack);
 		if (IS_ERR(a_o)) {
@@ -1655,7 +1749,7 @@ static struct tc_action *tcf_action_get_1(struct net *net, struct nlattr *nla,
 	index = nla_get_u32(tb[TCA_ACT_INDEX]);
 
 	err = -EINVAL;
-	ops = tc_lookup_action(tb[TCA_ACT_KIND]);
+	ops = tc_lookup_action(net, tb[TCA_ACT_KIND]);
 	if (!ops) { /* could happen in batch of actions */
 		NL_SET_ERR_MSG(extack, "Specified TC action kind not found");
 		goto err_out;
@@ -1703,7 +1797,7 @@ static int tca_action_flush(struct net *net, struct nlattr *nla,
 
 	err = -EINVAL;
 	kind = tb[TCA_ACT_KIND];
-	ops = tc_lookup_action(kind);
+	ops = tc_lookup_action(net, kind);
 	if (!ops) { /*some idjot trying to flush unknown action */
 		NL_SET_ERR_MSG(extack, "Cannot flush unknown TC action");
 		goto err_out;
@@ -2109,7 +2203,7 @@ static int tc_dump_action(struct sk_buff *skb, struct netlink_callback *cb)
 		return 0;
 	}
 
-	a_o = tc_lookup_action(kind);
+	a_o = tc_lookup_action(net, kind);
 	if (a_o == NULL)
 		return 0;
 
@@ -2176,6 +2270,7 @@ static int __init tc_action_init(void)
 	rtnl_register(PF_UNSPEC, RTM_GETACTION, tc_ctl_action, tc_dump_action,
 		      0);
 
+	register_pernet_subsys(&tcf_dyn_act_base_net_ops);
 	return 0;
 }
 
diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c
index 1976bd163..2db3c13c7 100644
--- a/net/sched/cls_api.c
+++ b/net/sched/cls_api.c
@@ -3293,7 +3293,7 @@ int tcf_exts_validate_ex(struct net *net, struct tcf_proto *tp, struct nlattr **
 		if (exts->police && tb[exts->police]) {
 			struct tc_action_ops *a_o;
 
-			a_o = tc_action_load_ops(tb[exts->police], true,
+			a_o = tc_action_load_ops(net, tb[exts->police], true,
 						 !(flags & TCA_ACT_FLAGS_NO_RTNL),
 						 extack);
 			if (IS_ERR(a_o))
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 79+ messages in thread

* [PATCH net-next v8 02/15] net/sched: act_api: increase action kind string length
  2023-11-16 14:59 [PATCH net-next v8 00/15] Introducing P4TC Jamal Hadi Salim
  2023-11-16 14:59 ` [PATCH net-next v8 01/15] net: sched: act_api: Introduce dynamic actions list Jamal Hadi Salim
@ 2023-11-16 14:59 ` Jamal Hadi Salim
  2023-11-16 14:59 ` [PATCH net-next v8 03/15] net/sched: act_api: Update tc_action_ops to account for dynamic actions Jamal Hadi Salim
                   ` (13 subsequent siblings)
  15 siblings, 0 replies; 79+ messages in thread
From: Jamal Hadi Salim @ 2023-11-16 14:59 UTC (permalink / raw)
  To: netdev
  Cc: deb.chatterjee, anjali.singhai, namrata.limaye, tom, mleitner,
	Mahesh.Shirshyad, tomasz.osinski, jiri, xiyou.wangcong, davem,
	edumazet, kuba, pabeni, vladbu, horms, daniel, bpf, khalidm, toke,
	mattyk

Increase action kind string length from IFNAMSIZ to 64

The new P4TC dynamic actions, created via templates, will have longer names
of format: "pipeline_name/act_name". IFNAMSIZ is currently 16 and is most
of the times undersized for the above format.
So, to conform to this new format, we increase the maximum name length
to account for this extra string (pipeline name) and the '/' character.

Co-developed-by: Victor Nogueira <victor@mojatatu.com>
Signed-off-by: Victor Nogueira <victor@mojatatu.com>
Co-developed-by: Pedro Tammela <pctammela@mojatatu.com>
Signed-off-by: Pedro Tammela <pctammela@mojatatu.com>
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Reviewed-by: Vlad Buslov <vladbu@nvidia.com>
---
 include/net/act_api.h        | 2 +-
 include/uapi/linux/pkt_cls.h | 1 +
 net/sched/act_api.c          | 6 +++---
 3 files changed, 5 insertions(+), 4 deletions(-)

diff --git a/include/net/act_api.h b/include/net/act_api.h
index 3d40adef1..b38a7029a 100644
--- a/include/net/act_api.h
+++ b/include/net/act_api.h
@@ -106,7 +106,7 @@ typedef void (*tc_action_priv_destructor)(void *priv);
 struct tc_action_ops {
 	struct list_head head;
 	struct list_head dyn_head;
-	char    kind[IFNAMSIZ];
+	char    kind[ACTNAMSIZ];
 	enum tca_id  id; /* identifier should match kind */
 	unsigned int	net_id;
 	size_t	size;
diff --git a/include/uapi/linux/pkt_cls.h b/include/uapi/linux/pkt_cls.h
index c7082cc60..75bf73742 100644
--- a/include/uapi/linux/pkt_cls.h
+++ b/include/uapi/linux/pkt_cls.h
@@ -6,6 +6,7 @@
 #include <linux/pkt_sched.h>
 
 #define TC_COOKIE_MAX_SIZE 16
+#define ACTNAMSIZ 64
 
 /* Action attributes */
 enum {
diff --git a/net/sched/act_api.c b/net/sched/act_api.c
index 443c49116..641e14df3 100644
--- a/net/sched/act_api.c
+++ b/net/sched/act_api.c
@@ -476,7 +476,7 @@ static size_t tcf_action_shared_attrs_size(const struct tc_action *act)
 	rcu_read_unlock();
 
 	return  nla_total_size(0) /* action number nested */
-		+ nla_total_size(IFNAMSIZ) /* TCA_ACT_KIND */
+		+ nla_total_size(ACTNAMSIZ) /* TCA_ACT_KIND */
 		+ cookie_len /* TCA_ACT_COOKIE */
 		+ nla_total_size(sizeof(struct nla_bitfield32)) /* TCA_ACT_HW_STATS */
 		+ nla_total_size(0) /* TCA_ACT_STATS nested */
@@ -1393,7 +1393,7 @@ struct tc_action_ops *tc_action_load_ops(struct net *net, struct nlattr *nla,
 {
 	struct nlattr *tb[TCA_ACT_MAX + 1];
 	struct tc_action_ops *a_o;
-	char act_name[IFNAMSIZ];
+	char act_name[ACTNAMSIZ];
 	struct nlattr *kind;
 	int err;
 
@@ -1408,7 +1408,7 @@ struct tc_action_ops *tc_action_load_ops(struct net *net, struct nlattr *nla,
 			NL_SET_ERR_MSG(extack, "TC action kind must be specified");
 			return ERR_PTR(err);
 		}
-		if (nla_strscpy(act_name, kind, IFNAMSIZ) < 0) {
+		if (nla_strscpy(act_name, kind, ACTNAMSIZ) < 0) {
 			NL_SET_ERR_MSG(extack, "TC action name too long");
 			return ERR_PTR(err);
 		}
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 79+ messages in thread

* [PATCH net-next v8 03/15] net/sched: act_api: Update tc_action_ops to account for dynamic actions
  2023-11-16 14:59 [PATCH net-next v8 00/15] Introducing P4TC Jamal Hadi Salim
  2023-11-16 14:59 ` [PATCH net-next v8 01/15] net: sched: act_api: Introduce dynamic actions list Jamal Hadi Salim
  2023-11-16 14:59 ` [PATCH net-next v8 02/15] net/sched: act_api: increase action kind string length Jamal Hadi Salim
@ 2023-11-16 14:59 ` Jamal Hadi Salim
  2023-11-16 14:59 ` [PATCH net-next v8 04/15] net/sched: act_api: add struct p4tc_action_ops as a parameter to lookup callback Jamal Hadi Salim
                   ` (12 subsequent siblings)
  15 siblings, 0 replies; 79+ messages in thread
From: Jamal Hadi Salim @ 2023-11-16 14:59 UTC (permalink / raw)
  To: netdev
  Cc: deb.chatterjee, anjali.singhai, namrata.limaye, tom, mleitner,
	Mahesh.Shirshyad, tomasz.osinski, jiri, xiyou.wangcong, davem,
	edumazet, kuba, pabeni, vladbu, horms, daniel, bpf, khalidm, toke,
	mattyk

The initialisation of P4TC action instances require access to a struct
p4tc_act (which appears in later patches) to help us to retrieve
information like the dynamic action parameters etc. In order to retrieve
struct p4tc_act we need the pipeline name or id and the action name or id.
Also recall that P4TC action IDs are dynamic and are net namespace
specific. The init callback from tc_action_ops parameters had no way of
supplying us that information. To solve this issue, we decided to create a
new tc_action_ops callback (init_ops), that provies us with the
tc_action_ops  struct which then provides us with the pipeline and action
name. In addition we add a new refcount to struct tc_action_ops called
dyn_ref, which accounts for how many action instances we have of a specific
action.

Co-developed-by: Victor Nogueira <victor@mojatatu.com>
Signed-off-by: Victor Nogueira <victor@mojatatu.com>
Co-developed-by: Pedro Tammela <pctammela@mojatatu.com>
Signed-off-by: Pedro Tammela <pctammela@mojatatu.com>
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Reviewed-by: Vlad Buslov <vladbu@nvidia.com>
---
 include/net/act_api.h |  6 ++++++
 net/sched/act_api.c   | 14 +++++++++++---
 2 files changed, 17 insertions(+), 3 deletions(-)

diff --git a/include/net/act_api.h b/include/net/act_api.h
index b38a7029a..1fdf502a5 100644
--- a/include/net/act_api.h
+++ b/include/net/act_api.h
@@ -109,6 +109,7 @@ struct tc_action_ops {
 	char    kind[ACTNAMSIZ];
 	enum tca_id  id; /* identifier should match kind */
 	unsigned int	net_id;
+	refcount_t dyn_ref;
 	size_t	size;
 	struct module		*owner;
 	int     (*act)(struct sk_buff *, const struct tc_action *,
@@ -120,6 +121,11 @@ struct tc_action_ops {
 			struct nlattr *est, struct tc_action **act,
 			struct tcf_proto *tp,
 			u32 flags, struct netlink_ext_ack *extack);
+	/* This should be merged with the original init action */
+	int     (*init_ops)(struct net *net, struct nlattr *nla,
+			    struct nlattr *est, struct tc_action **act,
+			   struct tcf_proto *tp, struct tc_action_ops *ops,
+			   u32 flags, struct netlink_ext_ack *extack);
 	int     (*walk)(struct net *, struct sk_buff *,
 			struct netlink_callback *, int,
 			const struct tc_action_ops *,
diff --git a/net/sched/act_api.c b/net/sched/act_api.c
index 641e14df3..41cd84146 100644
--- a/net/sched/act_api.c
+++ b/net/sched/act_api.c
@@ -1023,7 +1023,7 @@ int tcf_register_action(struct tc_action_ops *act,
 	struct tc_action_ops *a;
 	int ret;
 
-	if (!act->act || !act->dump || !act->init)
+	if (!act->act || !act->dump || (!act->init && !act->init_ops))
 		return -EINVAL;
 
 	/* We have to register pernet ops before making the action ops visible,
@@ -1484,8 +1484,16 @@ struct tc_action *tcf_action_init_1(struct net *net, struct tcf_proto *tp,
 			}
 		}
 
-		err = a_o->init(net, tb[TCA_ACT_OPTIONS], est, &a, tp,
-				userflags.value | flags, extack);
+		/* When we arrive here we guarantee that a_o->init or
+		 * a_o->init_ops exist.
+		 */
+		if (a_o->init)
+			err = a_o->init(net, tb[TCA_ACT_OPTIONS], est, &a, tp,
+					userflags.value | flags, extack);
+		else
+			err = a_o->init_ops(net, tb[TCA_ACT_OPTIONS], est, &a,
+					    tp, a_o, userflags.value | flags,
+					    extack);
 	} else {
 		err = a_o->init(net, nla, est, &a, tp, userflags.value | flags,
 				extack);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 79+ messages in thread

* [PATCH net-next v8 04/15] net/sched: act_api: add struct p4tc_action_ops as a parameter to lookup callback
  2023-11-16 14:59 [PATCH net-next v8 00/15] Introducing P4TC Jamal Hadi Salim
                   ` (2 preceding siblings ...)
  2023-11-16 14:59 ` [PATCH net-next v8 03/15] net/sched: act_api: Update tc_action_ops to account for dynamic actions Jamal Hadi Salim
@ 2023-11-16 14:59 ` Jamal Hadi Salim
  2023-11-16 14:59 ` [PATCH net-next v8 05/15] net: sched: act_api: Add support for preallocated dynamic action instances Jamal Hadi Salim
                   ` (11 subsequent siblings)
  15 siblings, 0 replies; 79+ messages in thread
From: Jamal Hadi Salim @ 2023-11-16 14:59 UTC (permalink / raw)
  To: netdev
  Cc: deb.chatterjee, anjali.singhai, namrata.limaye, tom, mleitner,
	Mahesh.Shirshyad, tomasz.osinski, jiri, xiyou.wangcong, davem,
	edumazet, kuba, pabeni, vladbu, horms, daniel, bpf, khalidm, toke,
	mattyk

For P4TC dynamic actions, we require information from struct tc_action_ops,
specifically the action kind, to find and locate the dynamic action
information for the lookup operation.

Signed-off-by: Victor Nogueira <victor@mojatatu.com>
Signed-off-by: Pedro Tammela <pctammela@mojatatu.com>
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Reviewed-by: Vlad Buslov <vladbu@nvidia.com>
---
 include/net/act_api.h | 3 ++-
 net/sched/act_api.c   | 2 +-
 2 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/include/net/act_api.h b/include/net/act_api.h
index 1fdf502a5..90e215f10 100644
--- a/include/net/act_api.h
+++ b/include/net/act_api.h
@@ -116,7 +116,8 @@ struct tc_action_ops {
 		       struct tcf_result *); /* called under RCU BH lock*/
 	int     (*dump)(struct sk_buff *, struct tc_action *, int, int);
 	void	(*cleanup)(struct tc_action *);
-	int     (*lookup)(struct net *net, struct tc_action **a, u32 index);
+	int     (*lookup)(struct net *net, const struct tc_action_ops *ops,
+			  struct tc_action **a, u32 index);
 	int     (*init)(struct net *net, struct nlattr *nla,
 			struct nlattr *est, struct tc_action **act,
 			struct tcf_proto *tp,
diff --git a/net/sched/act_api.c b/net/sched/act_api.c
index 41cd84146..b277accc3 100644
--- a/net/sched/act_api.c
+++ b/net/sched/act_api.c
@@ -726,7 +726,7 @@ static int __tcf_idr_search(struct net *net,
 	struct tc_action_net *tn = net_generic(net, ops->net_id);
 
 	if (unlikely(ops->lookup))
-		return ops->lookup(net, a, index);
+		return ops->lookup(net, ops, a, index);
 
 	return tcf_idr_search(tn, a, index);
 }
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 79+ messages in thread

* [PATCH net-next v8 05/15] net: sched: act_api: Add support for preallocated dynamic action instances
  2023-11-16 14:59 [PATCH net-next v8 00/15] Introducing P4TC Jamal Hadi Salim
                   ` (3 preceding siblings ...)
  2023-11-16 14:59 ` [PATCH net-next v8 04/15] net/sched: act_api: add struct p4tc_action_ops as a parameter to lookup callback Jamal Hadi Salim
@ 2023-11-16 14:59 ` Jamal Hadi Salim
  2023-11-16 14:59 ` [PATCH net-next v8 06/15] net: introduce rcu_replace_pointer_rtnl Jamal Hadi Salim
                   ` (10 subsequent siblings)
  15 siblings, 0 replies; 79+ messages in thread
From: Jamal Hadi Salim @ 2023-11-16 14:59 UTC (permalink / raw)
  To: netdev
  Cc: deb.chatterjee, anjali.singhai, namrata.limaye, tom, mleitner,
	Mahesh.Shirshyad, tomasz.osinski, jiri, xiyou.wangcong, davem,
	edumazet, kuba, pabeni, vladbu, horms, daniel, bpf, khalidm, toke,
	mattyk

In P4, actions are assumed to pre exist and have an upper bound number of
instances. Typically if you have 1M table entries you want to allocate
enough action instances to cover the 1M entries. However, this is a big
waste of memory if the action instances are not in use. So for our case,
we allow the user to specify a minimal amount of actions in the template
and then if more dynamic action instances are needed then they will be
added on demand as in the current approach with tc filter-action
relationship.

Add the necessary code to preallocate actions instances for dynamic
actions.

We add 2 new actions flags:
- TCA_ACT_FLAGS_PREALLOC: Indicates the action instance is a dynamic action
  and was preallocated for future use the templating phase of P4TC
- TCA_ACT_FLAGS_UNREFERENCED: Indicates the action instance was
  preallocated and is currently not being referenced by any other object.
  Which means it won't show up in an action instance dump.

Once an action instance is created we don't free it when the last table
entry referring to it is deleted.
Instead we add it to the pool/cache of action instances for
that specific action i.e it counts as if it is preallocated.
Preallocated actions can't be deleted by the tc actions runtime commands
and a dump or a get will only show preallocated actions
instances which are being used (TCA_ACT_FLAGS_UNREFERENCED == false).

The preallocated actions will be deleted once the pipeline is deleted
(which will purge the dynamic action kind and its instances).

For example, if we were to create a dynamic action that preallocates 128
elements and dumped:

$ tc -j p4template get action/myprog/send_nh | jq .

We'd see the following:

[
  {
    "obj": "action template",
    "pname": "myprog",
    "pipeid": 1
  },
  {
    "templates": [
      {
        "aname": "myprog/send_nh",
        "actid": 1,
        "params": [
          {
            "name": "port",
            "type": "dev",
            "id": 1
          }
        ],
        "prealloc": 128
      }
    ]
  }
]

If we try to dump the dynamic action instances, we won't see any:

$ tc -j actions ls action myprog/send_nh | jq .

[]

However, if we create a table entry which references this action kind:

$ tc p4ctrl create myprog/table/cb/FDB \
   dstAddr d2:96:91:5d:02:86 action myprog/send_nh \
   param port type dev dummy0

Dumping the action instance will now show this one instance which is
associated with the table entry:

$ tc -j actions ls action myprog/send_nh | jq .

[
  {
    "total acts": 1
  },
  {
    "actions": [
      {
        "order": 0,
        "kind": "myprog/send_nh",
        "index": 1,
        "ref": 1,
        "bind": 1,
        "params": [
          {
            "name": "port",
            "type": "dev",
            "value": "dummy0",
            "id": 1
          }
        ],
        "not_in_hw": true
      }
    ]
  }
]

Co-developed-by: Victor Nogueira <victor@mojatatu.com>
Signed-off-by: Victor Nogueira <victor@mojatatu.com>
Co-developed-by: Pedro Tammela <pctammela@mojatatu.com>
Signed-off-by: Pedro Tammela <pctammela@mojatatu.com>
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Reviewed-by: Vlad Buslov <vladbu@nvidia.com>
---
 include/net/act_api.h |  3 +++
 net/sched/act_api.c   | 50 ++++++++++++++++++++++++++++++++-----------
 2 files changed, 41 insertions(+), 12 deletions(-)

diff --git a/include/net/act_api.h b/include/net/act_api.h
index 90e215f10..cd5a8e86f 100644
--- a/include/net/act_api.h
+++ b/include/net/act_api.h
@@ -68,6 +68,8 @@ struct tc_action {
 #define TCA_ACT_FLAGS_REPLACE	(1U << (TCA_ACT_FLAGS_USER_BITS + 2))
 #define TCA_ACT_FLAGS_NO_RTNL	(1U << (TCA_ACT_FLAGS_USER_BITS + 3))
 #define TCA_ACT_FLAGS_AT_INGRESS	(1U << (TCA_ACT_FLAGS_USER_BITS + 4))
+#define TCA_ACT_FLAGS_PREALLOC	(1U << (TCA_ACT_FLAGS_USER_BITS + 5))
+#define TCA_ACT_FLAGS_UNREFERENCED	(1U << (TCA_ACT_FLAGS_USER_BITS + 6))
 
 /* Update lastuse only if needed, to avoid dirtying a cache line.
  * We use a temp variable to avoid fetching jiffies twice.
@@ -200,6 +202,7 @@ int tcf_idr_create_from_flags(struct tc_action_net *tn, u32 index,
 			      const struct tc_action_ops *ops, int bind,
 			      u32 flags);
 void tcf_idr_insert_many(struct tc_action *actions[]);
+void tcf_idr_insert_n(struct tc_action *actions[], const u32 n);
 void tcf_idr_cleanup(struct tc_action_net *tn, u32 index);
 int tcf_idr_check_alloc(struct tc_action_net *tn, u32 *index,
 			struct tc_action **a, int bind);
diff --git a/net/sched/act_api.c b/net/sched/act_api.c
index b277accc3..3fe399384 100644
--- a/net/sched/act_api.c
+++ b/net/sched/act_api.c
@@ -560,6 +560,8 @@ static int tcf_dump_walker(struct tcf_idrinfo *idrinfo, struct sk_buff *skb,
 			continue;
 		if (IS_ERR(p))
 			continue;
+		if (p->tcfa_flags & TCA_ACT_FLAGS_UNREFERENCED)
+			continue;
 
 		if (jiffy_since &&
 		    time_after(jiffy_since,
@@ -640,6 +642,9 @@ static int tcf_del_walker(struct tcf_idrinfo *idrinfo, struct sk_buff *skb,
 	idr_for_each_entry_ul(idr, p, tmp, id) {
 		if (IS_ERR(p))
 			continue;
+		if (p->tcfa_flags & TCA_ACT_FLAGS_PREALLOC)
+			continue;
+
 		ret = tcf_idr_release_unsafe(p);
 		if (ret == ACT_P_DELETED)
 			module_put(ops->owner);
@@ -1367,26 +1372,38 @@ static const struct nla_policy tcf_action_policy[TCA_ACT_MAX + 1] = {
 	[TCA_ACT_HW_STATS]	= NLA_POLICY_BITFIELD32(TCA_ACT_HW_STATS_ANY),
 };
 
+static void tcf_idr_insert_1(struct tc_action *a)
+{
+	struct tcf_idrinfo *idrinfo;
+
+	idrinfo = a->idrinfo;
+	mutex_lock(&idrinfo->lock);
+	/* Replace ERR_PTR(-EBUSY) allocated by tcf_idr_check_alloc if
+	 * it is just created, otherwise this is just a nop.
+	 */
+	idr_replace(&idrinfo->action_idr, a, a->tcfa_index);
+	mutex_unlock(&idrinfo->lock);
+}
+
 void tcf_idr_insert_many(struct tc_action *actions[])
 {
 	int i;
 
 	for (i = 0; i < TCA_ACT_MAX_PRIO; i++) {
-		struct tc_action *a = actions[i];
-		struct tcf_idrinfo *idrinfo;
-
-		if (!a)
+		if (!actions[i])
 			continue;
-		idrinfo = a->idrinfo;
-		mutex_lock(&idrinfo->lock);
-		/* Replace ERR_PTR(-EBUSY) allocated by tcf_idr_check_alloc if
-		 * it is just created, otherwise this is just a nop.
-		 */
-		idr_replace(&idrinfo->action_idr, a, a->tcfa_index);
-		mutex_unlock(&idrinfo->lock);
+		tcf_idr_insert_1(actions[i]);
 	}
 }
 
+void tcf_idr_insert_n(struct tc_action *actions[], const u32 n)
+{
+	int i;
+
+	for (i = 0; i < n; i++)
+		tcf_idr_insert_1(actions[i]);
+}
+
 struct tc_action_ops *tc_action_load_ops(struct net *net, struct nlattr *nla,
 					 bool police, bool rtnl_held,
 					 struct netlink_ext_ack *extack)
@@ -2033,8 +2050,17 @@ tca_action_gd(struct net *net, struct nlattr *nla, struct nlmsghdr *n,
 			ret = PTR_ERR(act);
 			goto err;
 		}
-		attr_size += tcf_action_fill_size(act);
 		actions[i - 1] = act;
+
+		if (event == RTM_DELACTION &&
+		    act->tcfa_flags & TCA_ACT_FLAGS_PREALLOC) {
+			ret = -EINVAL;
+			NL_SET_ERR_MSG_FMT(extack,
+					   "Unable to delete preallocated action %s",
+					   act->ops->kind);
+			goto err;
+		}
+		attr_size += tcf_action_fill_size(act);
 	}
 
 	attr_size = tcf_action_full_attrs_size(attr_size);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 79+ messages in thread

* [PATCH net-next v8 06/15] net: introduce rcu_replace_pointer_rtnl
  2023-11-16 14:59 [PATCH net-next v8 00/15] Introducing P4TC Jamal Hadi Salim
                   ` (4 preceding siblings ...)
  2023-11-16 14:59 ` [PATCH net-next v8 05/15] net: sched: act_api: Add support for preallocated dynamic action instances Jamal Hadi Salim
@ 2023-11-16 14:59 ` Jamal Hadi Salim
  2023-11-16 14:59 ` [PATCH net-next v8 07/15] rtnl: add helper to check if group has listeners Jamal Hadi Salim
                   ` (9 subsequent siblings)
  15 siblings, 0 replies; 79+ messages in thread
From: Jamal Hadi Salim @ 2023-11-16 14:59 UTC (permalink / raw)
  To: netdev
  Cc: deb.chatterjee, anjali.singhai, namrata.limaye, tom, mleitner,
	Mahesh.Shirshyad, tomasz.osinski, jiri, xiyou.wangcong, davem,
	edumazet, kuba, pabeni, vladbu, horms, daniel, bpf, khalidm, toke,
	mattyk

We use rcu_replace_pointer(rcu_ptr, ptr, lockdep_rtnl_is_held()) throughout
the P4TC infrastructure code.

It may be useful for other use cases, so we create a helper.

Co-developed-by: Victor Nogueira <victor@mojatatu.com>
Signed-off-by: Victor Nogueira <victor@mojatatu.com>
Co-developed-by: Pedro Tammela <pctammela@mojatatu.com>
Signed-off-by: Pedro Tammela <pctammela@mojatatu.com>
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
---
 include/linux/rtnetlink.h | 12 ++++++++++++
 1 file changed, 12 insertions(+)

diff --git a/include/linux/rtnetlink.h b/include/linux/rtnetlink.h
index 3d6cf306c..971055e66 100644
--- a/include/linux/rtnetlink.h
+++ b/include/linux/rtnetlink.h
@@ -62,6 +62,18 @@ static inline bool lockdep_rtnl_is_held(void)
 #define rcu_dereference_rtnl(p)					\
 	rcu_dereference_check(p, lockdep_rtnl_is_held())
 
+/**
+ * rcu_replace_pointer_rtnl - replace an RCU pointer under rtnl_lock, returning
+ * its old value
+ * @rcu_ptr: RCU pointer, whose old value is returned
+ * @ptr: regular pointer
+ *
+ * Perform a replacement under rtnl_lock, where @rcu_ptr is an RCU-annotated
+ * pointer. The old value of @rcu_ptr is returned, and @rcu_ptr is set to @ptr
+ */
+#define rcu_replace_pointer_rtnl(rcu_ptr, ptr)			\
+	rcu_replace_pointer(rcu_ptr, ptr, lockdep_rtnl_is_held())
+
 /**
  * rtnl_dereference - fetch RCU pointer when updates are prevented by RTNL
  * @p: The pointer to read, prior to dereferencing
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 79+ messages in thread

* [PATCH net-next v8 07/15] rtnl: add helper to check if group has listeners
  2023-11-16 14:59 [PATCH net-next v8 00/15] Introducing P4TC Jamal Hadi Salim
                   ` (5 preceding siblings ...)
  2023-11-16 14:59 ` [PATCH net-next v8 06/15] net: introduce rcu_replace_pointer_rtnl Jamal Hadi Salim
@ 2023-11-16 14:59 ` Jamal Hadi Salim
  2023-11-16 14:59 ` [PATCH net-next v8 08/15] p4tc: add P4 data types Jamal Hadi Salim
                   ` (8 subsequent siblings)
  15 siblings, 0 replies; 79+ messages in thread
From: Jamal Hadi Salim @ 2023-11-16 14:59 UTC (permalink / raw)
  To: netdev
  Cc: deb.chatterjee, anjali.singhai, namrata.limaye, tom, mleitner,
	Mahesh.Shirshyad, tomasz.osinski, jiri, xiyou.wangcong, davem,
	edumazet, kuba, pabeni, vladbu, horms, daniel, bpf, khalidm, toke,
	mattyk

As of today, rtnl code creates a new skb and unconditionally fills and
broadcasts it to the relevant group. For most operations this is okay
and doesn't waste resources in general.

For P4TC, it's interesting to know if the TC group has any listeners
when adding/updating/deleting table entries as we can optimize for the
most likely case it contains none. This not only improves our processing
speed, it also reduces pressure on the system memory as we completely
avoid the broadcast skb allocation.

Co-developed-by: Victor Nogueira <victor@mojatatu.com>
Signed-off-by: Victor Nogueira <victor@mojatatu.com>
Co-developed-by: Pedro Tammela <pctammela@mojatatu.com>
Signed-off-by: Pedro Tammela <pctammela@mojatatu.com>
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
---
 include/linux/rtnetlink.h | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/include/linux/rtnetlink.h b/include/linux/rtnetlink.h
index 971055e66..487e45f8a 100644
--- a/include/linux/rtnetlink.h
+++ b/include/linux/rtnetlink.h
@@ -142,4 +142,11 @@ extern int ndo_dflt_bridge_getlink(struct sk_buff *skb, u32 pid, u32 seq,
 
 extern void rtnl_offload_xstats_notify(struct net_device *dev);
 
+static inline int rtnl_has_listeners(const struct net *net, u32 group)
+{
+	struct sock *rtnl = net->rtnl;
+
+	return netlink_has_listeners(rtnl, group);
+}
+
 #endif	/* __LINUX_RTNETLINK_H */
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 79+ messages in thread

* [PATCH net-next v8 08/15] p4tc: add P4 data types
  2023-11-16 14:59 [PATCH net-next v8 00/15] Introducing P4TC Jamal Hadi Salim
                   ` (6 preceding siblings ...)
  2023-11-16 14:59 ` [PATCH net-next v8 07/15] rtnl: add helper to check if group has listeners Jamal Hadi Salim
@ 2023-11-16 14:59 ` Jamal Hadi Salim
  2023-11-16 16:03   ` Jiri Pirko
  2023-11-16 14:59 ` [PATCH net-next v8 09/15] p4tc: add template pipeline create, get, update, delete Jamal Hadi Salim
                   ` (7 subsequent siblings)
  15 siblings, 1 reply; 79+ messages in thread
From: Jamal Hadi Salim @ 2023-11-16 14:59 UTC (permalink / raw)
  To: netdev
  Cc: deb.chatterjee, anjali.singhai, namrata.limaye, tom, mleitner,
	Mahesh.Shirshyad, tomasz.osinski, jiri, xiyou.wangcong, davem,
	edumazet, kuba, pabeni, vladbu, horms, daniel, bpf, khalidm, toke,
	mattyk

Introduce abstraction that represents P4 data types.
This also introduces the Kconfig and Makefile which later patches use.
Types could be little, host or big endian definitions. The abstraction also
supports defining:

a) bitstrings using P4 annotations that look like "bit<X>" where X
   is the number of bits defined in a type

b) bitslices such that one can define in P4 as bit<8>[0-3] and
   bit<16>[4-9]. A 4-bit slice from bits 0-3 and a 6-bit slice from bits
   4-9 respectively.

Each type has a bitsize, a name (for debugging purposes), an ID and
methods/ops. The P4 types will be used by externs, dynamic actions, packet
headers and other parts of P4TC.

Each type has four ops:

- validate_p4t: Which validates if a given value of a specific type
  meets valid boundary conditions.

- create_bitops: Which, given a bitsize, bitstart and bitend allocates and
  returns a mask and a shift value. For example, if we have type
  bit<8>[3-3] meaning bitstart = 3 and bitend = 3, we'll create a mask
  which would only give us the fourth bit of a bit8 value, that is, 0x08.
  Since we are interested in the fourth bit, the bit shift value will be 3.
  This is also useful if an "irregular" bitsize is used, for example,
  bit24. In that case bitstart = 0 and bitend = 23. Shift will be 0 and
  the mask will be 0xFFFFFF00 if the machine is big endian.

- host_read : Which reads the value of a given type and transforms it to
  host order (if needed)

- host_write : Which writes a provided host order value and transforms it
  to the type's native order (if needed)

Co-developed-by: Victor Nogueira <victor@mojatatu.com>
Signed-off-by: Victor Nogueira <victor@mojatatu.com>
Co-developed-by: Pedro Tammela <pctammela@mojatatu.com>
Signed-off-by: Pedro Tammela <pctammela@mojatatu.com>
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
---
 include/net/p4tc_types.h    |   88 +++
 include/uapi/linux/p4tc.h   |   33 +
 net/sched/Kconfig           |   11 +
 net/sched/Makefile          |    2 +
 net/sched/p4tc/Makefile     |    3 +
 net/sched/p4tc/p4tc_types.c | 1247 +++++++++++++++++++++++++++++++++++
 6 files changed, 1384 insertions(+)
 create mode 100644 include/net/p4tc_types.h
 create mode 100644 include/uapi/linux/p4tc.h
 create mode 100644 net/sched/p4tc/Makefile
 create mode 100644 net/sched/p4tc/p4tc_types.c

diff --git a/include/net/p4tc_types.h b/include/net/p4tc_types.h
new file mode 100644
index 000000000..8f6f002ae
--- /dev/null
+++ b/include/net/p4tc_types.h
@@ -0,0 +1,88 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef __NET_P4TYPES_H
+#define __NET_P4TYPES_H
+
+#include <linux/netlink.h>
+#include <linux/pkt_cls.h>
+#include <linux/types.h>
+
+#include <uapi/linux/p4tc.h>
+
+#define P4T_MAX_BITSZ 128
+
+struct p4tc_type_mask_shift {
+	void *mask;
+	u8 shift;
+};
+
+struct p4tc_type;
+struct p4tc_type_ops {
+	int (*validate_p4t)(struct p4tc_type *container, void *value, u16 startbit,
+			    u16 endbit, struct netlink_ext_ack *extack);
+	struct p4tc_type_mask_shift *(*create_bitops)(u16 bitsz,
+						      u16 bitstart,
+						      u16 bitend,
+						      struct netlink_ext_ack *extack);
+	void (*host_read)(struct p4tc_type *container,
+			  struct p4tc_type_mask_shift *mask_shift, void *sval,
+			  void *dval);
+	void (*host_write)(struct p4tc_type *container,
+			   struct p4tc_type_mask_shift *mask_shift, void *sval,
+			   void *dval);
+	void (*print)(struct net *net, struct p4tc_type *container,
+		      const char *prefix, void *val);
+};
+
+#define P4T_MAX_STR_SZ 32
+struct p4tc_type {
+	char name[P4T_MAX_STR_SZ];
+	const struct p4tc_type_ops *ops;
+	size_t container_bitsz;
+	size_t bitsz;
+	int typeid;
+};
+
+struct p4tc_type *p4type_find_byid(int id);
+bool p4tc_is_type_unsigned(int typeid);
+
+void p4t_copy(struct p4tc_type_mask_shift *dst_mask_shift,
+	      struct p4tc_type *dst_t, void *dstv,
+	      struct p4tc_type_mask_shift *src_mask_shift,
+	      struct p4tc_type *src_t, void *srcv);
+int p4t_cmp(struct p4tc_type_mask_shift *dst_mask_shift,
+	    struct p4tc_type *dst_t, void *dstv,
+	    struct p4tc_type_mask_shift *src_mask_shift,
+	    struct p4tc_type *src_t, void *srcv);
+void p4t_release(struct p4tc_type_mask_shift *mask_shift);
+
+int p4tc_register_types(void);
+void p4tc_unregister_types(void);
+
+#ifdef CONFIG_RETPOLINE
+void __p4tc_type_host_read(const struct p4tc_type_ops *ops,
+			   struct p4tc_type *container,
+			   struct p4tc_type_mask_shift *mask_shift, void *sval,
+			   void *dval);
+void __p4tc_type_host_write(const struct p4tc_type_ops *ops,
+			    struct p4tc_type *container,
+			    struct p4tc_type_mask_shift *mask_shift, void *sval,
+			    void *dval);
+#else
+static inline void __p4tc_type_host_read(const struct p4tc_type_ops *ops,
+					 struct p4tc_type *container,
+					 struct p4tc_type_mask_shift *mask_shift,
+					 void *sval, void *dval)
+{
+	return ops->host_read(container, mask_shift, sval, dval);
+}
+
+static inline void __p4tc_type_host_write(const struct p4tc_type_ops *ops,
+					  struct p4tc_type *container,
+					  struct p4tc_type_mask_shift *mask_shift,
+					  void *sval, void *dval)
+{
+	return ops->host_write(container, mask_shift, sval, dval);
+}
+#endif
+
+#endif
diff --git a/include/uapi/linux/p4tc.h b/include/uapi/linux/p4tc.h
new file mode 100644
index 000000000..ba32dba66
--- /dev/null
+++ b/include/uapi/linux/p4tc.h
@@ -0,0 +1,33 @@
+/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
+#ifndef __LINUX_P4TC_H
+#define __LINUX_P4TC_H
+
+#define P4TC_MAX_KEYSZ 512
+
+enum {
+	P4T_UNSPEC,
+	P4T_U8,
+	P4T_U16,
+	P4T_U32,
+	P4T_U64,
+	P4T_STRING,
+	P4T_S8,
+	P4T_S16,
+	P4T_S32,
+	P4T_S64,
+	P4T_MACADDR,
+	P4T_IPV4ADDR,
+	P4T_BE16,
+	P4T_BE32,
+	P4T_BE64,
+	P4T_U128,
+	P4T_S128,
+	P4T_BOOL,
+	P4T_DEV,
+	P4T_KEY,
+	__P4T_MAX,
+};
+
+#define P4T_MAX (__P4T_MAX - 1)
+
+#endif
diff --git a/net/sched/Kconfig b/net/sched/Kconfig
index 470c70def..df6d5e15f 100644
--- a/net/sched/Kconfig
+++ b/net/sched/Kconfig
@@ -675,6 +675,17 @@ config NET_EMATCH_IPT
 	  To compile this code as a module, choose M here: the
 	  module will be called em_ipt.
 
+config NET_P4_TC
+	bool "P4 TC support"
+	select NET_CLS_ACT
+	help
+	  Say Y here if you want to use P4 features on top of TC.
+	  P4 is an open source,  domain-specific programming language for
+	  specifying data plane behavior. By enabling P4TC you will be able to
+	  write a P4 program, use a P4 compiler that supports P4TC backend to
+	  generate all needed artificats, which when loaded allow you to
+	  introduce a new kernel datapath that can be controlled via TC.
+
 config NET_CLS_ACT
 	bool "Actions"
 	select NET_CLS
diff --git a/net/sched/Makefile b/net/sched/Makefile
index b5fd49641..937b8f8a9 100644
--- a/net/sched/Makefile
+++ b/net/sched/Makefile
@@ -82,3 +82,5 @@ obj-$(CONFIG_NET_EMATCH_TEXT)	+= em_text.o
 obj-$(CONFIG_NET_EMATCH_CANID)	+= em_canid.o
 obj-$(CONFIG_NET_EMATCH_IPSET)	+= em_ipset.o
 obj-$(CONFIG_NET_EMATCH_IPT)	+= em_ipt.o
+
+obj-$(CONFIG_NET_P4_TC)		+= p4tc/
diff --git a/net/sched/p4tc/Makefile b/net/sched/p4tc/Makefile
new file mode 100644
index 000000000..dd1358c9e
--- /dev/null
+++ b/net/sched/p4tc/Makefile
@@ -0,0 +1,3 @@
+# SPDX-License-Identifier: GPL-2.0
+
+obj-y := p4tc_types.o
diff --git a/net/sched/p4tc/p4tc_types.c b/net/sched/p4tc/p4tc_types.c
new file mode 100644
index 000000000..4c8b58fc2
--- /dev/null
+++ b/net/sched/p4tc/p4tc_types.c
@@ -0,0 +1,1247 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * net/sched/p4tc_types.c -  P4 datatypes
+ * Copyright (c) 2022-2023, Mojatatu Networks
+ * Copyright (c) 2022-2023, Intel Corporation.
+ * Authors:     Jamal Hadi Salim <jhs@mojatatu.com>
+ *              Victor Nogueira <victor@mojatatu.com>
+ *              Pedro Tammela <pctammela@mojatatu.com>
+ */
+
+#include <linux/types.h>
+#include <linux/kernel.h>
+#include <linux/string.h>
+#include <linux/errno.h>
+#include <linux/skbuff.h>
+#include <linux/rtnetlink.h>
+#include <linux/module.h>
+#include <linux/init.h>
+#include <net/net_namespace.h>
+#include <net/netlink.h>
+#include <net/pkt_sched.h>
+#include <net/pkt_cls.h>
+#include <net/act_api.h>
+#include <net/p4tc_types.h>
+#include <linux/etherdevice.h>
+
+static DEFINE_IDR(p4tc_types_idr);
+
+static void p4tc_types_put(void)
+{
+	unsigned long tmp, typeid;
+	struct p4tc_type *type;
+
+	idr_for_each_entry_ul(&p4tc_types_idr, type, tmp, typeid) {
+		idr_remove(&p4tc_types_idr, typeid);
+		kfree(type);
+	}
+}
+
+struct p4tc_type *p4type_find_byid(int typeid)
+{
+	return idr_find(&p4tc_types_idr, typeid);
+}
+
+static struct p4tc_type *p4type_find_byname(const char *name)
+{
+	unsigned long tmp, typeid;
+	struct p4tc_type *type;
+
+	idr_for_each_entry_ul(&p4tc_types_idr, type, tmp, typeid) {
+		if (!strncmp(type->name, name, P4T_MAX_STR_SZ))
+			return type;
+	}
+
+	return NULL;
+}
+
+bool p4tc_is_type_unsigned(int typeid)
+{
+	switch (typeid) {
+	case P4T_U8:
+	case P4T_U16:
+	case P4T_U32:
+	case P4T_U64:
+	case P4T_U128:
+	case P4T_BOOL:
+		return true;
+	default:
+		return false;
+	}
+}
+
+void p4t_copy(struct p4tc_type_mask_shift *dst_mask_shift,
+	      struct p4tc_type *dst_t, void *dstv,
+	      struct p4tc_type_mask_shift *src_mask_shift,
+	      struct p4tc_type *src_t, void *srcv)
+{
+	u64 readval[BITS_TO_U64(P4TC_MAX_KEYSZ)] = {0};
+	const struct p4tc_type_ops *srco, *dsto;
+
+	dsto = dst_t->ops;
+	srco = src_t->ops;
+
+	__p4tc_type_host_read(srco, src_t, src_mask_shift, srcv,
+			      &readval);
+	__p4tc_type_host_write(dsto, dst_t, dst_mask_shift, &readval,
+			       dstv);
+}
+
+int p4t_cmp(struct p4tc_type_mask_shift *dst_mask_shift,
+	    struct p4tc_type *dst_t, void *dstv,
+	    struct p4tc_type_mask_shift *src_mask_shift,
+	    struct p4tc_type *src_t, void *srcv)
+{
+	u64 a[BITS_TO_U64(P4TC_MAX_KEYSZ)] = {0};
+	u64 b[BITS_TO_U64(P4TC_MAX_KEYSZ)] = {0};
+	const struct p4tc_type_ops *srco, *dsto;
+
+	dsto = dst_t->ops;
+	srco = src_t->ops;
+
+	__p4tc_type_host_read(dsto, dst_t, dst_mask_shift, dstv, a);
+	__p4tc_type_host_read(srco, src_t, src_mask_shift, srcv, b);
+
+	return memcmp(a, b, sizeof(a));
+}
+
+void p4t_release(struct p4tc_type_mask_shift *mask_shift)
+{
+	kfree(mask_shift->mask);
+	kfree(mask_shift);
+}
+
+static int p4t_validate_bitpos(u16 bitstart, u16 bitend, u16 maxbitstart,
+			       u16 maxbitend, struct netlink_ext_ack *extack)
+{
+	if (bitstart > maxbitstart) {
+		NL_SET_ERR_MSG_MOD(extack, "bitstart too high");
+		return -EINVAL;
+	}
+
+	if (bitend > maxbitend) {
+		NL_SET_ERR_MSG_MOD(extack, "bitend too high");
+		return -EINVAL;
+	}
+
+	if (bitstart > bitend) {
+		NL_SET_ERR_MSG_MOD(extack, "bitstart > bitend");
+		return -EINVAL;
+	}
+
+	return 0;
+}
+
+static int p4t_u32_validate(struct p4tc_type *container, void *value,
+			    u16 bitstart, u16 bitend,
+			    struct netlink_ext_ack *extack)
+{
+	u32 container_maxsz = U32_MAX;
+	u32 *val = value;
+	size_t maxval;
+	int ret;
+
+	ret = p4t_validate_bitpos(bitstart, bitend, 31, 31, extack);
+	if (ret < 0)
+		return ret;
+
+	maxval = GENMASK(bitend, 0);
+	if (val && (*val > container_maxsz || *val > maxval)) {
+		NL_SET_ERR_MSG_MOD(extack, "U32 value out of range");
+		return -EINVAL;
+	}
+
+	return 0;
+}
+
+static struct p4tc_type_mask_shift *
+p4t_u32_bitops(u16 bitsiz, u16 bitstart, u16 bitend,
+	       struct netlink_ext_ack *extack)
+{
+	struct p4tc_type_mask_shift *mask_shift;
+	u32 mask = GENMASK(bitend, bitstart);
+	u32 *cmask;
+
+	mask_shift = kzalloc(sizeof(*mask_shift), GFP_KERNEL);
+	if (!mask_shift)
+		return ERR_PTR(-ENOMEM);
+
+	cmask = kzalloc(sizeof(u32), GFP_KERNEL);
+	if (!cmask) {
+		kfree(mask_shift);
+		return ERR_PTR(-ENOMEM);
+	}
+
+	*cmask = mask;
+
+	mask_shift->mask = cmask;
+	mask_shift->shift = bitstart;
+
+	return mask_shift;
+}
+
+static void p4t_u32_write(struct p4tc_type *container,
+			  struct p4tc_type_mask_shift *mask_shift, void *sval,
+			  void *dval)
+{
+	u32 maskedst = 0;
+	u32 *dst = dval;
+	u32 *src = sval;
+	u8 shift = 0;
+
+	if (mask_shift) {
+		u32 *dmask = mask_shift->mask;
+
+		maskedst = *dst & ~*dmask;
+		shift = mask_shift->shift;
+	}
+
+	*dst = maskedst | (*src << shift);
+}
+
+static void p4t_u32_print(struct net *net, struct p4tc_type *container,
+			  const char *prefix, void *val)
+{
+	u32 *v = val;
+
+	pr_info("%s 0x%x\n", prefix, *v);
+}
+
+static void p4t_u32_hread(struct p4tc_type *container,
+			  struct p4tc_type_mask_shift *mask_shift, void *sval,
+			  void *dval)
+{
+	u32 *dst = dval;
+	u32 *src = sval;
+
+	if (mask_shift) {
+		u32 *smask = mask_shift->mask;
+		u8 shift = mask_shift->shift;
+
+		*dst = (*src & *smask) >> shift;
+	} else {
+		*dst = *src;
+	}
+}
+
+static int p4t_s32_validate(struct p4tc_type *container, void *value,
+			    u16 bitstart, u16 bitend,
+			    struct netlink_ext_ack *extack)
+{
+	s32 minsz = S32_MIN, maxsz = S32_MAX;
+	s32 *val = value;
+
+	if (val && (*val > maxsz || *val < minsz)) {
+		NL_SET_ERR_MSG_MOD(extack, "S32 value out of range");
+		return -EINVAL;
+	}
+
+	return 0;
+}
+
+static void p4t_s32_hread(struct p4tc_type *container,
+			  struct p4tc_type_mask_shift *mask_shift, void *sval,
+			  void *dval)
+{
+	s32 *dst = dval;
+	s32 *src = sval;
+
+	*dst = *src;
+}
+
+static void p4t_s32_write(struct p4tc_type *container,
+			  struct p4tc_type_mask_shift *mask_shift, void *sval,
+			  void *dval)
+{
+	s32 *dst = dval;
+	s32 *src = sval;
+
+	*dst = *src;
+}
+
+static void p4t_s32_print(struct net *net, struct p4tc_type *container,
+			  const char *prefix, void *val)
+{
+	s32 *v = val;
+
+	pr_info("%s %x\n", prefix, *v);
+}
+
+static void p4t_s64_print(struct net *net, struct p4tc_type *container,
+			  const char *prefix, void *val)
+{
+	s64 *v = val;
+
+	pr_info("%s 0x%llx\n", prefix, *v);
+}
+
+static int p4t_be32_validate(struct p4tc_type *container, void *value,
+			     u16 bitstart, u16 bitend,
+			     struct netlink_ext_ack *extack)
+{
+	size_t container_maxsz = U32_MAX;
+	__be32 *val_u32 = value;
+	__u32 val = 0;
+	size_t maxval;
+	int ret;
+
+	ret = p4t_validate_bitpos(bitstart, bitend, 31, 31, extack);
+	if (ret < 0)
+		return ret;
+
+	if (value)
+		val = be32_to_cpu(*val_u32);
+
+	maxval = GENMASK(bitend, 0);
+	if (val && (val > container_maxsz || val > maxval)) {
+		NL_SET_ERR_MSG_MOD(extack, "BE32 value out of range");
+		return -EINVAL;
+	}
+
+	return 0;
+}
+
+static void p4t_be32_hread(struct p4tc_type *container,
+			   struct p4tc_type_mask_shift *mask_shift, void *sval,
+			   void *dval)
+{
+	__be32 *src = sval;
+	u32 *dst = dval;
+
+	*dst = be32_to_cpu(*src);
+}
+
+static void p4t_be32_write(struct p4tc_type *container,
+			   struct p4tc_type_mask_shift *mask_shift, void *sval,
+			   void *dval)
+{
+	__be32 *dst = dval;
+	u32 *src = sval;
+
+	*dst = cpu_to_be32(*src);
+}
+
+static void p4t_be32_print(struct net *net, struct p4tc_type *container,
+			   const char *prefix, void *val)
+{
+	__be32 *v = val;
+
+	pr_info("%s 0x%x\n", prefix, *v);
+}
+
+static void p4t_be64_hread(struct p4tc_type *container,
+			   struct p4tc_type_mask_shift *mask_shift, void *sval,
+			   void *dval)
+{
+	__be64 *src = sval;
+	u64 *dst = dval;
+
+	*dst = be64_to_cpu(*src);
+}
+
+static void p4t_be64_write(struct p4tc_type *container,
+			   struct p4tc_type_mask_shift *mask_shift, void *sval,
+			   void *dval)
+{
+	__be64 *dst = dval;
+	u64 *src = sval;
+
+	*dst = cpu_to_be64(*src);
+}
+
+static void p4t_be64_print(struct net *net, struct p4tc_type *container,
+			   const char *prefix, void *val)
+{
+	__be64 *v = val;
+
+	pr_info("%s 0x%llx\n", prefix, *v);
+}
+
+static int p4t_u16_validate(struct p4tc_type *container, void *value,
+			    u16 bitstart, u16 bitend,
+			    struct netlink_ext_ack *extack)
+{
+	u16 container_maxsz = U16_MAX;
+	u16 *val = value;
+	u16 maxval;
+	int ret;
+
+	ret = p4t_validate_bitpos(bitstart, bitend, 15, 15, extack);
+	if (ret < 0)
+		return ret;
+
+	maxval = GENMASK(bitend, 0);
+	if (val && (*val > container_maxsz || *val > maxval)) {
+		NL_SET_ERR_MSG_MOD(extack, "U16 value out of range");
+		return -EINVAL;
+	}
+
+	return 0;
+}
+
+static struct p4tc_type_mask_shift *
+p4t_u16_bitops(u16 bitsiz, u16 bitstart, u16 bitend,
+	       struct netlink_ext_ack *extack)
+{
+	struct p4tc_type_mask_shift *mask_shift;
+	u16 mask = GENMASK(bitend, bitstart);
+	u16 *cmask;
+
+	mask_shift = kzalloc(sizeof(*mask_shift), GFP_KERNEL);
+	if (!mask_shift)
+		return ERR_PTR(-ENOMEM);
+
+	cmask = kzalloc(sizeof(u16), GFP_KERNEL);
+	if (!cmask) {
+		kfree(mask_shift);
+		return ERR_PTR(-ENOMEM);
+	}
+
+	*cmask = mask;
+
+	mask_shift->mask = cmask;
+	mask_shift->shift = bitstart;
+
+	return mask_shift;
+}
+
+static void p4t_u16_write(struct p4tc_type *container,
+			  struct p4tc_type_mask_shift *mask_shift, void *sval,
+			  void *dval)
+{
+	u16 maskedst = 0;
+	u16 *dst = dval;
+	u16 *src = sval;
+	u8 shift = 0;
+
+	if (mask_shift) {
+		u16 *dmask = mask_shift->mask;
+
+		maskedst = *dst & ~*dmask;
+		shift = mask_shift->shift;
+	}
+
+	*dst = maskedst | (*src << shift);
+}
+
+static void p4t_u16_print(struct net *net, struct p4tc_type *container,
+			  const char *prefix, void *val)
+{
+	u16 *v = val;
+
+	pr_info("%s 0x%x\n", prefix, *v);
+}
+
+static void p4t_u16_hread(struct p4tc_type *container,
+			  struct p4tc_type_mask_shift *mask_shift, void *sval,
+			  void *dval)
+{
+	u16 *dst = dval;
+	u16 *src = sval;
+
+	if (mask_shift) {
+		u16 *smask = mask_shift->mask;
+		u8 shift = mask_shift->shift;
+
+		*dst = (*src & *smask) >> shift;
+	} else {
+		*dst = *src;
+	}
+}
+
+static int p4t_s16_validate(struct p4tc_type *container, void *value,
+			    u16 bitstart, u16 bitend,
+			    struct netlink_ext_ack *extack)
+{
+	s16 minsz = S16_MIN, maxsz = S16_MAX;
+	s16 *val = value;
+
+	if (val && (*val > maxsz || *val < minsz)) {
+		NL_SET_ERR_MSG_MOD(extack, "S16 value out of range");
+		return -EINVAL;
+	}
+
+	return 0;
+}
+
+static void p4t_s16_hread(struct p4tc_type *container,
+			  struct p4tc_type_mask_shift *mask_shift, void *sval,
+			  void *dval)
+{
+	s16 *dst = dval;
+	s16 *src = sval;
+
+	*dst = *src;
+}
+
+static void p4t_s16_write(struct p4tc_type *container,
+			  struct p4tc_type_mask_shift *mask_shift, void *sval,
+			  void *dval)
+{
+	s16 *dst = dval;
+	s16 *src = sval;
+
+	*src = *dst;
+}
+
+static void p4t_s16_print(struct net *net, struct p4tc_type *container,
+			  const char *prefix, void *val)
+{
+	s16 *v = val;
+
+	pr_info("%s %d\n", prefix, *v);
+}
+
+static int p4t_be16_validate(struct p4tc_type *container, void *value,
+			     u16 bitstart, u16 bitend,
+			     struct netlink_ext_ack *extack)
+{
+	u16 container_maxsz = U16_MAX;
+	__be16 *val_u16 = value;
+	size_t maxval;
+	u16 val = 0;
+	int ret;
+
+	ret = p4t_validate_bitpos(bitstart, bitend, 15, 15, extack);
+	if (ret < 0)
+		return ret;
+
+	if (value)
+		val = be16_to_cpu(*val_u16);
+
+	maxval = GENMASK(bitend, 0);
+	if (val && (val > container_maxsz || val > maxval)) {
+		NL_SET_ERR_MSG_MOD(extack, "BE16 value out of range");
+		return -EINVAL;
+	}
+
+	return 0;
+}
+
+static void p4t_be16_hread(struct p4tc_type *container,
+			   struct p4tc_type_mask_shift *mask_shift, void *sval,
+			   void *dval)
+{
+	__be16 *src = sval;
+	u16 *dst = dval;
+
+	*dst = be16_to_cpu(*src);
+}
+
+static void p4t_be16_write(struct p4tc_type *container,
+			   struct p4tc_type_mask_shift *mask_shift, void *sval,
+			   void *dval)
+{
+	__be16 *dst = dval;
+	u16 *src = sval;
+
+	*dst = cpu_to_be16(*src);
+}
+
+static void p4t_be16_print(struct net *net, struct p4tc_type *container,
+			   const char *prefix, void *val)
+{
+	__be16 *v = val;
+
+	pr_info("%s 0x%x\n", prefix, *v);
+}
+
+static int p4t_u8_validate(struct p4tc_type *container, void *value,
+			   u16 bitstart, u16 bitend,
+			   struct netlink_ext_ack *extack)
+{
+	size_t container_maxsz = U8_MAX;
+	u8 *val = value;
+	u8 maxval;
+	int ret;
+
+	ret = p4t_validate_bitpos(bitstart, bitend, 7, 7, extack);
+	if (ret < 0)
+		return ret;
+
+	maxval = GENMASK(bitend, 0);
+	if (val && (*val > container_maxsz || *val > maxval)) {
+		NL_SET_ERR_MSG_MOD(extack, "U8 value out of range");
+		return -EINVAL;
+	}
+
+	return 0;
+}
+
+static struct p4tc_type_mask_shift *
+p4t_u8_bitops(u16 bitsiz, u16 bitstart, u16 bitend,
+	      struct netlink_ext_ack *extack)
+{
+	struct p4tc_type_mask_shift *mask_shift;
+	u8 mask = GENMASK(bitend, bitstart);
+	u8 *cmask;
+
+	mask_shift = kzalloc(sizeof(*mask_shift), GFP_KERNEL);
+	if (!mask_shift)
+		return ERR_PTR(-ENOMEM);
+
+	cmask = kzalloc(sizeof(u8), GFP_KERNEL);
+	if (!cmask) {
+		kfree(mask_shift);
+		return ERR_PTR(-ENOMEM);
+	}
+
+	*cmask = mask;
+
+	mask_shift->mask = cmask;
+	mask_shift->shift = bitstart;
+
+	return mask_shift;
+}
+
+static void p4t_u8_write(struct p4tc_type *container,
+			 struct p4tc_type_mask_shift *mask_shift, void *sval,
+			 void *dval)
+{
+	u8 maskedst = 0;
+	u8 *dst = dval;
+	u8 *src = sval;
+	u8 shift = 0;
+
+	if (mask_shift) {
+		u8 *dmask = (u8 *)mask_shift->mask;
+
+		maskedst = *dst & ~*dmask;
+		shift = mask_shift->shift;
+	}
+
+	*dst = maskedst | (*src << shift);
+}
+
+static void p4t_u8_print(struct net *net, struct p4tc_type *container,
+			 const char *prefix, void *val)
+{
+	u8 *v = val;
+
+	pr_info("%s 0x%x\n", prefix, *v);
+}
+
+static void p4t_u8_hread(struct p4tc_type *container,
+			 struct p4tc_type_mask_shift *mask_shift, void *sval,
+			 void *dval)
+{
+	u8 *dst = dval;
+	u8 *src = sval;
+
+	if (mask_shift) {
+		u8 *smask = mask_shift->mask;
+		u8 shift = mask_shift->shift;
+
+		*dst = (*src & *smask) >> shift;
+	} else {
+		*dst = *src;
+	}
+}
+
+static int p4t_s8_validate(struct p4tc_type *container, void *value,
+			   u16 bitstart, u16 bitend,
+			   struct netlink_ext_ack *extack)
+{
+	s8 minsz = S8_MIN, maxsz = S8_MAX;
+	s8 *val = value;
+
+	if (val && (*val > maxsz || *val < minsz)) {
+		NL_SET_ERR_MSG_MOD(extack, "S8 value out of range");
+		return -EINVAL;
+	}
+
+	return 0;
+}
+
+static void p4t_s8_hread(struct p4tc_type *container,
+			 struct p4tc_type_mask_shift *mask_shift, void *sval,
+			 void *dval)
+{
+	s8 *dst = dval;
+	s8 *src = sval;
+
+	*dst = *src;
+}
+
+static void p4t_s8_print(struct net *net, struct p4tc_type *container,
+			 const char *prefix, void *val)
+{
+	s8 *v = val;
+
+	pr_info("%s %d\n", prefix, *v);
+}
+
+static int p4t_u64_validate(struct p4tc_type *container, void *value,
+			    u16 bitstart, u16 bitend,
+			    struct netlink_ext_ack *extack)
+{
+	u64 container_maxsz = U64_MAX;
+	u8 *val = value;
+	u64 maxval;
+	int ret;
+
+	ret = p4t_validate_bitpos(bitstart, bitend, 63, 63, extack);
+	if (ret < 0)
+		return ret;
+
+	maxval = GENMASK_ULL(bitend, 0);
+	if (val && (*val > container_maxsz || *val > maxval)) {
+		NL_SET_ERR_MSG_MOD(extack, "U64 value out of range");
+		return -EINVAL;
+	}
+
+	return 0;
+}
+
+static struct p4tc_type_mask_shift *
+p4t_u64_bitops(u16 bitsiz, u16 bitstart, u16 bitend,
+	       struct netlink_ext_ack *extack)
+{
+	struct p4tc_type_mask_shift *mask_shift;
+	u64 mask = GENMASK(bitend, bitstart);
+	u64 *cmask;
+
+	mask_shift = kzalloc(sizeof(*mask_shift), GFP_KERNEL);
+	if (!mask_shift)
+		return ERR_PTR(-ENOMEM);
+
+	cmask = kzalloc(sizeof(u64), GFP_KERNEL);
+	if (!cmask) {
+		kfree(mask_shift);
+		return ERR_PTR(-ENOMEM);
+	}
+
+	*cmask = mask;
+
+	mask_shift->mask = cmask;
+	mask_shift->shift = bitstart;
+
+	return mask_shift;
+}
+
+static void p4t_u64_write(struct p4tc_type *container,
+			  struct p4tc_type_mask_shift *mask_shift, void *sval,
+			  void *dval)
+{
+	u64 maskedst = 0;
+	u64 *dst = dval;
+	u64 *src = sval;
+	u8 shift = 0;
+
+	if (mask_shift) {
+		u64 *dmask = (u64 *)mask_shift->mask;
+
+		maskedst = *dst & ~*dmask;
+		shift = mask_shift->shift;
+	}
+
+	*dst = maskedst | (*src << shift);
+}
+
+static void p4t_u64_print(struct net *net, struct p4tc_type *container,
+			  const char *prefix, void *val)
+{
+	u64 *v = val;
+
+	pr_info("%s 0x%llx\n", prefix, *v);
+}
+
+static void p4t_u64_hread(struct p4tc_type *container,
+			  struct p4tc_type_mask_shift *mask_shift, void *sval,
+			  void *dval)
+{
+	u64 *dst = dval;
+	u64 *src = sval;
+
+	if (mask_shift) {
+		u64 *smask = mask_shift->mask;
+		u8 shift = mask_shift->shift;
+
+		*dst = (*src & *smask) >> shift;
+	} else {
+		*dst = *src;
+	}
+}
+
+/* As of now, we are not allowing bitops for u128 */
+static int p4t_u128_validate(struct p4tc_type *container, void *value,
+			     u16 bitstart, u16 bitend,
+			     struct netlink_ext_ack *extack)
+{
+	if (bitstart != 0 || bitend != 127) {
+		NL_SET_ERR_MSG_MOD(extack,
+				   "Only valid bit type larger than bit64 is bit128");
+		return -EINVAL;
+	}
+
+	return 0;
+}
+
+static void p4t_u128_hread(struct p4tc_type *container,
+			   struct p4tc_type_mask_shift *mask_shift, void *sval,
+			   void *dval)
+{
+	memcpy(sval, dval, sizeof(__u64) * 2);
+}
+
+static void p4t_u128_write(struct p4tc_type *container,
+			   struct p4tc_type_mask_shift *mask_shift, void *sval,
+			   void *dval)
+{
+	memcpy(sval, dval, sizeof(__u64) * 2);
+}
+
+static void p4t_u128_print(struct net *net, struct p4tc_type *container,
+			   const char *prefix, void *val)
+{
+	u64 *v = val;
+
+	pr_info("%s[0-63] %16llx", prefix, v[0]);
+	pr_info("%s[64-127] %16llx", prefix, v[1]);
+}
+
+static int p4t_ipv4_validate(struct p4tc_type *container, void *value,
+			     u16 bitstart, u16 bitend,
+			     struct netlink_ext_ack *extack)
+{
+	/* Not allowing bit-slices for now */
+	if (bitstart != 0 || bitend != 31) {
+		NL_SET_ERR_MSG_MOD(extack, "Invalid bitstart or bitend");
+		return -EINVAL;
+	}
+
+	return 0;
+}
+
+static void p4t_ipv4_print(struct net *net, struct p4tc_type *container,
+			   const char *prefix, void *val)
+{
+	u32 *v32h = val;
+	__be32 v32;
+	u8 *v;
+
+	v32 = cpu_to_be32(*v32h);
+	v = (u8 *)&v32;
+
+	pr_info("%s %u.%u.%u.%u\n", prefix, v[0], v[1], v[2], v[3]);
+}
+
+static int p4t_mac_validate(struct p4tc_type *container, void *value,
+			    u16 bitstart, u16 bitend,
+			    struct netlink_ext_ack *extack)
+{
+	if (bitstart != 0 || bitend != 47) {
+		NL_SET_ERR_MSG_MOD(extack, "Invalid bitstart or bitend");
+		return -EINVAL;
+	}
+
+	return 0;
+}
+
+static void p4t_mac_print(struct net *net, struct p4tc_type *container,
+			  const char *prefix, void *val)
+{
+	u8 *v = val;
+
+	pr_info("%s %02X:%02x:%02x:%02x:%02x:%02x\n", prefix, v[0], v[1], v[2],
+		v[3], v[4], v[5]);
+}
+
+static int p4t_dev_validate(struct p4tc_type *container, void *value,
+			    u16 bitstart, u16 bitend,
+			    struct netlink_ext_ack *extack)
+{
+	if (bitstart != 0 || bitend != 31) {
+		NL_SET_ERR_MSG_MOD(extack, "Invalid start or endbit values");
+		return -EINVAL;
+	}
+
+	return 0;
+}
+
+static void p4t_dev_write(struct p4tc_type *container,
+			  struct p4tc_type_mask_shift *mask_shift, void *sval,
+			  void *dval)
+{
+	u32 *src = sval;
+	u32 *dst = dval;
+
+	*dst = *src;
+}
+
+static void p4t_dev_hread(struct p4tc_type *container,
+			  struct p4tc_type_mask_shift *mask_shift, void *sval,
+			  void *dval)
+{
+	u32 *src = sval;
+	u32 *dst = dval;
+
+	*dst = *src;
+}
+
+static void p4t_dev_print(struct net *net, struct p4tc_type *container,
+			  const char *prefix, void *val)
+{
+	const u32 *ifindex = val;
+	struct net_device *dev;
+
+	dev = dev_get_by_index_rcu(net, *ifindex);
+
+	pr_info("%s %s\n", prefix, dev->name);
+}
+
+static void p4t_key_hread(struct p4tc_type *container,
+			  struct p4tc_type_mask_shift *mask_shift, void *sval,
+			  void *dval)
+{
+	memcpy(dval, sval, BITS_TO_BYTES(container->bitsz));
+}
+
+static void p4t_key_write(struct p4tc_type *container,
+			  struct p4tc_type_mask_shift *mask_shift, void *sval,
+			  void *dval)
+{
+	memcpy(dval, sval, BITS_TO_BYTES(container->bitsz));
+}
+
+static void p4t_key_print(struct net *net, struct p4tc_type *container,
+			  const char *prefix, void *val)
+{
+	u16 bitstart = 0, bitend = 63;
+	u64 *v = val;
+	int i;
+
+	for (i = 0; i < BITS_TO_U64(container->bitsz); i++) {
+		pr_info("%s[%u-%u] %16llx\n", prefix, bitstart, bitend, v[i]);
+		bitstart += 64;
+		bitend += 64;
+	}
+}
+
+static int p4t_key_validate(struct p4tc_type *container, void *value,
+			    u16 bitstart, u16 bitend,
+			    struct netlink_ext_ack *extack)
+{
+	if (p4t_validate_bitpos(bitstart, bitend, 0, P4TC_MAX_KEYSZ, extack))
+		return -EINVAL;
+
+	return 0;
+}
+
+static int p4t_bool_validate(struct p4tc_type *container, void *value,
+			     u16 bitstart, u16 bitend,
+			     struct netlink_ext_ack *extack)
+{
+	int ret;
+
+	ret = p4t_validate_bitpos(bitstart, bitend, 7, 7, extack);
+	if (ret < 0)
+		return ret;
+
+	return -EINVAL;
+}
+
+static void p4t_bool_hread(struct p4tc_type *container,
+			   struct p4tc_type_mask_shift *mask_shift, void *sval,
+			   void *dval)
+{
+	bool *dst = dval;
+	bool *src = sval;
+
+	*dst = *src;
+}
+
+static void p4t_bool_write(struct p4tc_type *container,
+			   struct p4tc_type_mask_shift *mask_shift, void *sval,
+			   void *dval)
+{
+	bool *dst = dval;
+	bool *src = sval;
+
+	*dst = *src;
+}
+
+static void p4t_bool_print(struct net *net, struct p4tc_type *container,
+			   const char *prefix, void *val)
+{
+	bool *v = val;
+
+	pr_info("%s %s", prefix, *v ? "true" : "false");
+}
+
+static const struct p4tc_type_ops u8_ops = {
+	.validate_p4t = p4t_u8_validate,
+	.create_bitops = p4t_u8_bitops,
+	.host_read = p4t_u8_hread,
+	.host_write = p4t_u8_write,
+	.print = p4t_u8_print,
+};
+
+static const struct p4tc_type_ops u16_ops = {
+	.validate_p4t = p4t_u16_validate,
+	.create_bitops = p4t_u16_bitops,
+	.host_read = p4t_u16_hread,
+	.host_write = p4t_u16_write,
+	.print = p4t_u16_print,
+};
+
+static const struct p4tc_type_ops u32_ops = {
+	.validate_p4t = p4t_u32_validate,
+	.create_bitops = p4t_u32_bitops,
+	.host_read = p4t_u32_hread,
+	.host_write = p4t_u32_write,
+	.print = p4t_u32_print,
+};
+
+static const struct p4tc_type_ops u64_ops = {
+	.validate_p4t = p4t_u64_validate,
+	.create_bitops = p4t_u64_bitops,
+	.host_read = p4t_u64_hread,
+	.host_write = p4t_u64_write,
+	.print = p4t_u64_print,
+};
+
+static const struct p4tc_type_ops u128_ops = {
+	.validate_p4t = p4t_u128_validate,
+	.host_read = p4t_u128_hread,
+	.host_write = p4t_u128_write,
+	.print = p4t_u128_print,
+};
+
+static const struct p4tc_type_ops s8_ops = {
+	.validate_p4t = p4t_s8_validate,
+	.host_read = p4t_s8_hread,
+	.print = p4t_s8_print,
+};
+
+static const struct p4tc_type_ops s16_ops = {
+	.validate_p4t = p4t_s16_validate,
+	.host_read = p4t_s16_hread,
+	.host_write = p4t_s16_write,
+	.print = p4t_s16_print,
+};
+
+static const struct p4tc_type_ops s32_ops = {
+	.validate_p4t = p4t_s32_validate,
+	.host_read = p4t_s32_hread,
+	.host_write = p4t_s32_write,
+	.print = p4t_s32_print,
+};
+
+static const struct p4tc_type_ops s64_ops = {
+	.print = p4t_s64_print,
+};
+
+static const struct p4tc_type_ops s128_ops = {};
+
+static const struct p4tc_type_ops be16_ops = {
+	.validate_p4t = p4t_be16_validate,
+	.create_bitops = p4t_u16_bitops,
+	.host_read = p4t_be16_hread,
+	.host_write = p4t_be16_write,
+	.print = p4t_be16_print,
+};
+
+static const struct p4tc_type_ops be32_ops = {
+	.validate_p4t = p4t_be32_validate,
+	.create_bitops = p4t_u32_bitops,
+	.host_read = p4t_be32_hread,
+	.host_write = p4t_be32_write,
+	.print = p4t_be32_print,
+};
+
+static const struct p4tc_type_ops be64_ops = {
+	.validate_p4t = p4t_u64_validate,
+	.host_read = p4t_be64_hread,
+	.host_write = p4t_be64_write,
+	.print = p4t_be64_print,
+};
+
+static const struct p4tc_type_ops string_ops = {};
+
+static const struct p4tc_type_ops mac_ops = {
+	.validate_p4t = p4t_mac_validate,
+	.create_bitops = p4t_u64_bitops,
+	.host_read = p4t_u64_hread,
+	.host_write = p4t_u64_write,
+	.print = p4t_mac_print,
+};
+
+static const struct p4tc_type_ops ipv4_ops = {
+	.validate_p4t = p4t_ipv4_validate,
+	.host_read = p4t_be32_hread,
+	.host_write = p4t_be32_write,
+	.print = p4t_ipv4_print,
+};
+
+static const struct p4tc_type_ops bool_ops = {
+	.validate_p4t = p4t_bool_validate,
+	.host_read = p4t_bool_hread,
+	.host_write = p4t_bool_write,
+	.print = p4t_bool_print,
+};
+
+static const struct p4tc_type_ops dev_ops = {
+	.validate_p4t = p4t_dev_validate,
+	.host_read = p4t_dev_hread,
+	.host_write = p4t_dev_write,
+	.print = p4t_dev_print,
+};
+
+static const struct p4tc_type_ops key_ops = {
+	.validate_p4t = p4t_key_validate,
+	.host_read = p4t_key_hread,
+	.host_write = p4t_key_write,
+	.print = p4t_key_print,
+};
+
+#ifdef CONFIG_RETPOLINE
+void __p4tc_type_host_read(const struct p4tc_type_ops *ops,
+			   struct p4tc_type *container,
+			   struct p4tc_type_mask_shift *mask_shift, void *sval,
+			   void *dval)
+{
+	#define HREAD(cops) \
+	do { \
+		if (ops == &(cops)) \
+			return (cops).host_read(container, mask_shift, sval, dval); \
+	} while (0)
+
+	HREAD(u8_ops);
+	HREAD(u16_ops);
+	HREAD(u32_ops);
+	HREAD(u64_ops);
+	HREAD(u128_ops);
+	HREAD(s8_ops);
+	HREAD(s16_ops);
+	HREAD(s32_ops);
+	HREAD(be16_ops);
+	HREAD(be32_ops);
+	HREAD(mac_ops);
+	HREAD(ipv4_ops);
+	HREAD(bool_ops);
+	HREAD(dev_ops);
+	HREAD(key_ops);
+
+	return ops->host_read(container, mask_shift, sval, dval);
+}
+
+void __p4tc_type_host_write(const struct p4tc_type_ops *ops,
+			    struct p4tc_type *container,
+			    struct p4tc_type_mask_shift *mask_shift, void *sval,
+			    void *dval)
+{
+	#define HWRITE(cops) \
+	do { \
+		if (ops == &(cops)) \
+			return (cops).host_write(container, mask_shift, sval, dval); \
+	} while (0)
+
+	HWRITE(u8_ops);
+	HWRITE(u16_ops);
+	HWRITE(u32_ops);
+	HWRITE(u64_ops);
+	HWRITE(u128_ops);
+	HWRITE(s16_ops);
+	HWRITE(s32_ops);
+	HWRITE(be16_ops);
+	HWRITE(be32_ops);
+	HWRITE(mac_ops);
+	HWRITE(ipv4_ops);
+	HWRITE(bool_ops);
+	HWRITE(dev_ops);
+	HWRITE(key_ops);
+
+	return ops->host_write(container, mask_shift, sval, dval);
+}
+#endif
+
+static int ___p4tc_register_type(int typeid, size_t bitsz,
+				 size_t container_bitsz,
+				 const char *t_name,
+				 const struct p4tc_type_ops *ops)
+{
+	struct p4tc_type *type;
+	int err;
+
+	if (typeid > P4T_MAX)
+		return -EINVAL;
+
+	if (p4type_find_byid(typeid) || p4type_find_byname(t_name))
+		return -EEXIST;
+
+	if (bitsz > P4T_MAX_BITSZ)
+		return -E2BIG;
+
+	if (container_bitsz > P4T_MAX_BITSZ)
+		return -E2BIG;
+
+	type = kzalloc(sizeof(*type), GFP_ATOMIC);
+	if (!type)
+		return -ENOMEM;
+
+	err = idr_alloc_u32(&p4tc_types_idr, type, &typeid, typeid, GFP_ATOMIC);
+	if (err < 0)
+		return err;
+
+	strscpy(type->name, t_name, P4T_MAX_STR_SZ);
+	type->typeid = typeid;
+	type->bitsz = bitsz;
+	type->container_bitsz = container_bitsz;
+	type->ops = ops;
+
+	return 0;
+}
+
+static int __p4tc_register_type(int typeid, size_t bitsz,
+				size_t container_bitsz,
+				const char *t_name,
+				const struct p4tc_type_ops *ops)
+{
+	if (___p4tc_register_type(typeid, bitsz, container_bitsz, t_name, ops) <
+	    0) {
+		pr_err("Unable to allocate p4 type %s\n", t_name);
+		p4tc_types_put();
+		return -1;
+	}
+
+	return 0;
+}
+
+#define p4tc_register_type(...)                            \
+	do {                                               \
+		if (__p4tc_register_type(__VA_ARGS__) < 0) \
+			return -1;                         \
+	} while (0)
+
+int p4tc_register_types(void)
+{
+	p4tc_register_type(P4T_U8, 8, 8, "u8", &u8_ops);
+	p4tc_register_type(P4T_U16, 16, 16, "u16", &u16_ops);
+	p4tc_register_type(P4T_U32, 32, 32, "u32", &u32_ops);
+	p4tc_register_type(P4T_U64, 64, 64, "u64", &u64_ops);
+	p4tc_register_type(P4T_U128, 128, 128, "u128", &u128_ops);
+	p4tc_register_type(P4T_S8, 8, 8, "s8", &s8_ops);
+	p4tc_register_type(P4T_BE16, 16, 16, "be16", &be16_ops);
+	p4tc_register_type(P4T_BE32, 32, 32, "be32", &be32_ops);
+	p4tc_register_type(P4T_BE64, 64, 64, "be64", &be64_ops);
+	p4tc_register_type(P4T_S16, 16, 16, "s16", &s16_ops);
+	p4tc_register_type(P4T_S32, 32, 32, "s32", &s32_ops);
+	p4tc_register_type(P4T_S64, 64, 64, "s64", &s64_ops);
+	p4tc_register_type(P4T_S128, 128, 128, "s128", &s128_ops);
+	p4tc_register_type(P4T_STRING, P4T_MAX_STR_SZ * 4, P4T_MAX_STR_SZ * 4,
+			   "string", &string_ops);
+	p4tc_register_type(P4T_MACADDR, 48, 64, "mac", &mac_ops);
+	p4tc_register_type(P4T_IPV4ADDR, 32, 32, "ipv4", &ipv4_ops);
+	p4tc_register_type(P4T_BOOL, 32, 32, "bool", &bool_ops);
+	p4tc_register_type(P4T_DEV, 32, 32, "dev", &dev_ops);
+	p4tc_register_type(P4T_KEY, P4TC_MAX_KEYSZ, P4TC_MAX_KEYSZ, "key",
+			   &key_ops);
+
+	return 0;
+}
+
+void p4tc_unregister_types(void)
+{
+	p4tc_types_put();
+}
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 79+ messages in thread

* [PATCH net-next v8 09/15] p4tc: add template pipeline create, get, update, delete
  2023-11-16 14:59 [PATCH net-next v8 00/15] Introducing P4TC Jamal Hadi Salim
                   ` (7 preceding siblings ...)
  2023-11-16 14:59 ` [PATCH net-next v8 08/15] p4tc: add P4 data types Jamal Hadi Salim
@ 2023-11-16 14:59 ` Jamal Hadi Salim
  2023-11-16 16:11   ` Jiri Pirko
  2023-11-16 14:59 ` [PATCH net-next v8 10/15] p4tc: add action template create, update, delete, get, flush and dump Jamal Hadi Salim
                   ` (6 subsequent siblings)
  15 siblings, 1 reply; 79+ messages in thread
From: Jamal Hadi Salim @ 2023-11-16 14:59 UTC (permalink / raw)
  To: netdev
  Cc: deb.chatterjee, anjali.singhai, namrata.limaye, tom, mleitner,
	Mahesh.Shirshyad, tomasz.osinski, jiri, xiyou.wangcong, davem,
	edumazet, kuba, pabeni, vladbu, horms, daniel, bpf, khalidm, toke,
	mattyk

__Introducing P4 TC Pipeline__

This commit introduces P4 TC pipelines, which emulate the semantics of a
P4 program/pipeline using the TC infrastructure.

One can refer to P4 programs/pipelines using their names or their
specific pipeline ids (pipeid)

P4 template CRUD (Create, Read/get, Update and Delete) commands apply on a
pipeline.

As an example, to create a P4 program/pipeline named aP4proggie with a
single table in its pipeline, one would use the following command from user
space tc (as generated by the compiler):

tc p4template create pipeline/aP4proggie numtables 1 pipeid 1

Note that, in the above command, the numtables is set as 1; the default
is 0 because it is feasible to have a P4 program with no tables at all.

The P4 compiler will generate the pipeid, however if none is specified,
the kernel will issue one. Like the following example:

tc p4template create pipeline/aP4proggie numtables 1

To Read pipeline aP4proggie attributes, one would retrieve those details as
follows:

tc p4template get pipeline/[aP4proggie] [pipeid 1]

Note that in the above command one may specify pipeline ID, name or
both.

To Update aP4proggie pipeline from 1 to 10 tables, one would use the
following command:

tc p4template update pipeline/[aP4proggie] [pipeid 1] numtables 10

Note that, in the above command, one could use the P4 program/pipeline
name, id or both to specify which P4 program/pipeline to update.

To Delete a P4 program/pipeline named aP4proggie
with a pipeid of 1, one would use the following command:

tc p4template del pipeline/[aP4proggie] [pipeid 1]

Note that, in the above command, one could use the P4 program/pipeline
name, id or both to specify which P4 program/pipeline to delete

If one wished to dump all the created P4 programs/pipelines, one would
use the following command:

tc p4template get pipeline/

__Pipeline Lifetime__

After Create is issued, one can Read/get, Update and Delete; however
the pipeline can only be put to use after it is "sealed".
To seal a pipeline, one would issue the following command:

tc p4template update pipeline/aP4proggie state ready

After a pipeline is sealed it can be put to use via the TC P4 classifier.
For example:

tc filter add dev $DEV ingress protocol any prio 6 p4 pname aP4proggie \
    action bpf obj $PARSER.o section prog/tc-parser
    action bpf obj $PROGNAME.o section prog/tc-ingress

Instantiates aP4proggie in the ingress of $DEV. One could also attach it to
a block of ports (example tc block 22) as such:

tc filter add block 22 ingress protocol all prio 6 p4 pname aP4proggie \
    action bpf obj $PARSER.o section prog/tc-parser
    action bpf obj $PROGNAME.o section prog/tc-ingress

We can, after that, add a table entry.
Like, for example:

tc p4ctrl create aP4proggie/table/cb/aP4table \
      dstAddr 10.10.10.0/24 srcAddr 192.168.0.0/16 prio 16 \
      action drop

Once the pipeline is attached to a device or block it cannot be deleted.
It becomes Read-only from the control plane/user space.
The pipeline can be deleted when there are no longer any users left.

Co-developed-by: Victor Nogueira <victor@mojatatu.com>
Signed-off-by: Victor Nogueira <victor@mojatatu.com>
Co-developed-by: Pedro Tammela <pctammela@mojatatu.com>
Signed-off-by: Pedro Tammela <pctammela@mojatatu.com>
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
---
 include/net/p4tc.h             | 126 +++++++
 include/uapi/linux/p4tc.h      |  66 ++++
 include/uapi/linux/rtnetlink.h |   9 +
 net/sched/p4tc/Makefile        |   2 +-
 net/sched/p4tc/p4tc_pipeline.c | 611 +++++++++++++++++++++++++++++++++
 net/sched/p4tc/p4tc_tmpl_api.c | 585 +++++++++++++++++++++++++++++++
 security/selinux/nlmsgtab.c    |   6 +-
 7 files changed, 1403 insertions(+), 2 deletions(-)
 create mode 100644 include/net/p4tc.h
 create mode 100644 net/sched/p4tc/p4tc_pipeline.c
 create mode 100644 net/sched/p4tc/p4tc_tmpl_api.c

diff --git a/include/net/p4tc.h b/include/net/p4tc.h
new file mode 100644
index 000000000..ccb54d842
--- /dev/null
+++ b/include/net/p4tc.h
@@ -0,0 +1,126 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef __NET_P4TC_H
+#define __NET_P4TC_H
+
+#include <uapi/linux/p4tc.h>
+#include <linux/workqueue.h>
+#include <net/sch_generic.h>
+#include <net/net_namespace.h>
+#include <linux/refcount.h>
+#include <linux/rhashtable.h>
+#include <linux/rhashtable-types.h>
+
+#define P4TC_DEFAULT_NUM_TABLES P4TC_MINTABLES_COUNT
+#define P4TC_DEFAULT_MAX_RULES 1
+#define P4TC_PATH_MAX 3
+
+#define P4TC_KERNEL_PIPEID 0
+
+#define P4TC_PID_IDX 0
+
+struct p4tc_dump_ctx {
+	u32 ids[P4TC_PATH_MAX];
+};
+
+struct p4tc_template_common;
+
+struct p4tc_path_nlattrs {
+	char                     *pname;
+	u32                      *ids;
+	bool                     pname_passed;
+};
+
+struct p4tc_pipeline;
+struct p4tc_template_ops {
+	void (*init)(void);
+	struct p4tc_template_common *(*cu)(struct net *net, struct nlmsghdr *n,
+					   struct nlattr *nla,
+					   struct p4tc_path_nlattrs *nl_pname,
+					   struct netlink_ext_ack *extack);
+	int (*put)(struct p4tc_pipeline *pipeline,
+		   struct p4tc_template_common *tmpl,
+		   struct netlink_ext_ack *extack);
+	int (*gd)(struct net *net, struct sk_buff *skb, struct nlmsghdr *n,
+		  struct nlattr *nla, struct p4tc_path_nlattrs *nl_pname,
+		  struct netlink_ext_ack *extack);
+	int (*fill_nlmsg)(struct net *net, struct sk_buff *skb,
+			  struct p4tc_template_common *tmpl,
+			  struct netlink_ext_ack *extack);
+	int (*dump)(struct sk_buff *skb, struct p4tc_dump_ctx *ctx,
+		    struct nlattr *nla, char **p_name, u32 *ids,
+		    struct netlink_ext_ack *extack);
+	int (*dump_1)(struct sk_buff *skb, struct p4tc_template_common *common);
+};
+
+struct p4tc_template_common {
+	char                     name[TEMPLATENAMSZ];
+	struct p4tc_template_ops *ops;
+	u32                      p_id;
+	u32                      PAD0;
+};
+
+extern const struct p4tc_template_ops p4tc_pipeline_ops;
+
+struct p4tc_pipeline {
+	struct p4tc_template_common common;
+	struct rcu_head             rcu;
+	struct net                  *net;
+	/* Accounts for how many entities are referencing this pipeline.
+	 * As for now only P4 filters can refer to pipelines.
+	 */
+	refcount_t                  p_ctrl_ref;
+	u16                         num_tables;
+	u16                         curr_tables;
+	u8                          p_state;
+};
+
+struct p4tc_pipeline_net {
+	struct idr pipeline_idr;
+};
+
+static inline bool p4tc_tmpl_msg_is_update(struct nlmsghdr *n)
+{
+	return n->nlmsg_type == RTM_UPDATEP4TEMPLATE;
+}
+
+int p4tc_tmpl_generic_dump(struct sk_buff *skb, struct p4tc_dump_ctx *ctx,
+			   struct idr *idr, int idx,
+			   struct netlink_ext_ack *extack);
+
+struct p4tc_pipeline *p4tc_pipeline_find_byany(struct net *net,
+					       const char *p_name,
+					       const u32 pipeid,
+					       struct netlink_ext_ack *extack);
+struct p4tc_pipeline *p4tc_pipeline_find_byid(struct net *net,
+					      const u32 pipeid);
+struct p4tc_pipeline *p4tc_pipeline_find_get(struct net *net,
+					     const char *p_name,
+					     const u32 pipeid,
+					     struct netlink_ext_ack *extack);
+
+static inline bool p4tc_pipeline_get(struct p4tc_pipeline *pipeline)
+{
+	return refcount_inc_not_zero(&pipeline->p_ctrl_ref);
+}
+
+void p4tc_pipeline_put(struct p4tc_pipeline *pipeline);
+struct p4tc_pipeline *
+p4tc_pipeline_find_byany_unsealed(struct net *net, const char *p_name,
+				  const u32 pipeid,
+				  struct netlink_ext_ack *extack);
+
+static inline int p4tc_action_destroy(struct tc_action **acts)
+{
+	int ret = 0;
+
+	if (acts) {
+		ret = tcf_action_destroy(acts, TCA_ACT_UNBIND);
+		kfree(acts);
+	}
+
+	return ret;
+}
+
+#define to_pipeline(t) ((struct p4tc_pipeline *)t)
+
+#endif
diff --git a/include/uapi/linux/p4tc.h b/include/uapi/linux/p4tc.h
index ba32dba66..4d33f44c1 100644
--- a/include/uapi/linux/p4tc.h
+++ b/include/uapi/linux/p4tc.h
@@ -2,8 +2,71 @@
 #ifndef __LINUX_P4TC_H
 #define __LINUX_P4TC_H
 
+#include <linux/types.h>
+#include <linux/pkt_sched.h>
+
+/* pipeline header */
+struct p4tcmsg {
+	__u32 pipeid;
+	__u32 obj;
+};
+
+#define P4TC_MAXPIPELINE_COUNT 32
+#define P4TC_MAXTABLES_COUNT 32
+#define P4TC_MINTABLES_COUNT 0
+#define P4TC_MSGBATCH_SIZE 16
+
 #define P4TC_MAX_KEYSZ 512
 
+#define TEMPLATENAMSZ 32
+#define PIPELINENAMSIZ TEMPLATENAMSZ
+
+/* Root attributes */
+enum {
+	P4TC_ROOT_UNSPEC,
+	P4TC_ROOT, /* nested messages */
+	P4TC_ROOT_PNAME, /* string */
+	__P4TC_ROOT_MAX,
+};
+
+#define P4TC_ROOT_MAX (__P4TC_ROOT_MAX - 1)
+
+/* P4 Object types */
+enum {
+	P4TC_OBJ_UNSPEC,
+	P4TC_OBJ_PIPELINE,
+	__P4TC_OBJ_MAX,
+};
+
+#define P4TC_OBJ_MAX (__P4TC_OBJ_MAX - 1)
+
+/* P4 attributes */
+enum {
+	P4TC_UNSPEC,
+	P4TC_PATH,
+	P4TC_PARAMS,
+	__P4TC_MAX,
+};
+
+#define P4TC_MAX (__P4TC_MAX - 1)
+
+/* PIPELINE attributes */
+enum {
+	P4TC_PIPELINE_UNSPEC,
+	P4TC_PIPELINE_NUMTABLES, /* u16 */
+	P4TC_PIPELINE_STATE, /* u8 */
+	P4TC_PIPELINE_NAME, /* string only used for pipeline dump */
+	__P4TC_PIPELINE_MAX
+};
+
+#define P4TC_PIPELINE_MAX (__P4TC_PIPELINE_MAX - 1)
+
+/* PIPELINE states */
+enum {
+	P4TC_STATE_NOT_READY,
+	P4TC_STATE_READY,
+};
+
 enum {
 	P4T_UNSPEC,
 	P4T_U8,
@@ -30,4 +93,7 @@ enum {
 
 #define P4T_MAX (__P4T_MAX - 1)
 
+#define P4TC_RTA(r) \
+	((struct rtattr *)(((char *)(r)) + NLMSG_ALIGN(sizeof(struct p4tcmsg))))
+
 #endif
diff --git a/include/uapi/linux/rtnetlink.h b/include/uapi/linux/rtnetlink.h
index 3b687d20c..4f9ebe3e7 100644
--- a/include/uapi/linux/rtnetlink.h
+++ b/include/uapi/linux/rtnetlink.h
@@ -194,6 +194,15 @@ enum {
 	RTM_GETTUNNEL,
 #define RTM_GETTUNNEL	RTM_GETTUNNEL
 
+	RTM_CREATEP4TEMPLATE = 124,
+#define RTM_CREATEP4TEMPLATE	RTM_CREATEP4TEMPLATE
+	RTM_DELP4TEMPLATE,
+#define RTM_DELP4TEMPLATE	RTM_DELP4TEMPLATE
+	RTM_GETP4TEMPLATE,
+#define RTM_GETP4TEMPLATE	RTM_GETP4TEMPLATE
+	RTM_UPDATEP4TEMPLATE,
+#define RTM_UPDATEP4TEMPLATE	RTM_UPDATEP4TEMPLATE
+
 	__RTM_MAX,
 #define RTM_MAX		(((__RTM_MAX + 3) & ~3) - 1)
 };
diff --git a/net/sched/p4tc/Makefile b/net/sched/p4tc/Makefile
index dd1358c9e..0881a7563 100644
--- a/net/sched/p4tc/Makefile
+++ b/net/sched/p4tc/Makefile
@@ -1,3 +1,3 @@
 # SPDX-License-Identifier: GPL-2.0
 
-obj-y := p4tc_types.o
+obj-y := p4tc_types.o p4tc_tmpl_api.o p4tc_pipeline.o
diff --git a/net/sched/p4tc/p4tc_pipeline.c b/net/sched/p4tc/p4tc_pipeline.c
new file mode 100644
index 000000000..fc6e49573
--- /dev/null
+++ b/net/sched/p4tc/p4tc_pipeline.c
@@ -0,0 +1,611 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * net/sched/p4tc_pipeline.c	P4 TC PIPELINE
+ *
+ * Copyright (c) 2022-2023, Mojatatu Networks
+ * Copyright (c) 2022-2023, Intel Corporation.
+ * Authors:     Jamal Hadi Salim <jhs@mojatatu.com>
+ *              Victor Nogueira <victor@mojatatu.com>
+ *              Pedro Tammela <pctammela@mojatatu.com>
+ */
+
+#include <linux/types.h>
+#include <linux/kernel.h>
+#include <linux/string.h>
+#include <linux/errno.h>
+#include <linux/slab.h>
+#include <linux/skbuff.h>
+#include <linux/init.h>
+#include <linux/kmod.h>
+#include <linux/err.h>
+#include <linux/module.h>
+#include <net/net_namespace.h>
+#include <net/sock.h>
+#include <net/sch_generic.h>
+#include <net/pkt_cls.h>
+#include <net/p4tc.h>
+#include <net/netlink.h>
+#include <net/flow_offload.h>
+#include <net/p4tc_types.h>
+
+static unsigned int pipeline_net_id;
+static struct p4tc_pipeline *root_pipeline;
+
+static __net_init int pipeline_init_net(struct net *net)
+{
+	struct p4tc_pipeline_net *pipe_net = net_generic(net, pipeline_net_id);
+
+	idr_init(&pipe_net->pipeline_idr);
+
+	return 0;
+}
+
+static int __p4tc_pipeline_put(struct p4tc_pipeline *pipeline,
+			       struct p4tc_template_common *template,
+			       struct netlink_ext_ack *extack);
+
+static void __net_exit pipeline_exit_net(struct net *net)
+{
+	struct p4tc_pipeline_net *pipe_net;
+	struct p4tc_pipeline *pipeline;
+	unsigned long pipeid, tmp;
+
+	rtnl_lock();
+	pipe_net = net_generic(net, pipeline_net_id);
+	idr_for_each_entry_ul(&pipe_net->pipeline_idr, pipeline, tmp, pipeid) {
+		__p4tc_pipeline_put(pipeline, &pipeline->common, NULL);
+	}
+	idr_destroy(&pipe_net->pipeline_idr);
+	rtnl_unlock();
+}
+
+static struct pernet_operations pipeline_net_ops = {
+	.init = pipeline_init_net,
+	.pre_exit = pipeline_exit_net,
+	.id = &pipeline_net_id,
+	.size = sizeof(struct p4tc_pipeline_net),
+};
+
+static const struct nla_policy tc_pipeline_policy[P4TC_PIPELINE_MAX + 1] = {
+	[P4TC_PIPELINE_NUMTABLES] =
+		NLA_POLICY_RANGE(NLA_U16, P4TC_MINTABLES_COUNT, P4TC_MAXTABLES_COUNT),
+	[P4TC_PIPELINE_STATE] = { .type = NLA_U8 },
+};
+
+static void p4tc_pipeline_destroy(struct p4tc_pipeline *pipeline)
+{
+	kfree(pipeline);
+}
+
+static void p4tc_pipeline_destroy_rcu(struct rcu_head *head)
+{
+	struct p4tc_pipeline *pipeline;
+	struct net *net;
+
+	pipeline = container_of(head, struct p4tc_pipeline, rcu);
+
+	net = pipeline->net;
+	p4tc_pipeline_destroy(pipeline);
+	put_net(net);
+}
+
+static void p4tc_pipeline_teardown(struct p4tc_pipeline *pipeline,
+				   struct netlink_ext_ack *extack)
+{
+	struct net *net = pipeline->net;
+	struct p4tc_pipeline_net *pipe_net = net_generic(net, pipeline_net_id);
+	struct net *pipeline_net = maybe_get_net(net);
+
+	idr_remove(&pipe_net->pipeline_idr, pipeline->common.p_id);
+
+	/* If we are on netns cleanup we can't touch the pipeline_idr.
+	 * On pre_exit we will destroy the idr but never call into teardown
+	 * if filters are active which makes pipeline pointers dangle until
+	 * the filters ultimately destroy them.
+	 */
+	if (pipeline_net) {
+		idr_remove(&pipe_net->pipeline_idr, pipeline->common.p_id);
+		call_rcu(&pipeline->rcu, p4tc_pipeline_destroy_rcu);
+	} else {
+		p4tc_pipeline_destroy(pipeline);
+	}
+}
+
+static int __p4tc_pipeline_put(struct p4tc_pipeline *pipeline,
+			       struct p4tc_template_common *template,
+			       struct netlink_ext_ack *extack)
+{
+	/* The lifetime of the pipeline can be terminated in two cases:
+	 * - netns cleanup (system driven)
+	 * - pipeline delete (user driven)
+	 *
+	 * When the pipeline is referenced by one or more p4 classifiers we need
+	 * to make sure the pipeline and its components are alive while the classifier
+	 * is still visible by the datapath.
+	 * In the netns cleanup, we cannot destroy the pipeline in our netns exit callback
+	 * as the netdevs and filters are still visible in the datapath.
+	 * In such case, it's the filter's job to destroy the pipeline.
+	 *
+	 * To accommodate such scenario, whichever put call reaches '0' first will
+	 * destroy the pipeline and its components.
+	 *
+	 * On netns cleanup we guarantee no table entries operations are in flight.
+	 */
+	if (!refcount_dec_and_test(&pipeline->p_ctrl_ref)) {
+		NL_SET_ERR_MSG(extack, "Can't delete referenced pipeline");
+		return -EBUSY;
+	}
+
+	p4tc_pipeline_teardown(pipeline, extack);
+
+	return 0;
+}
+
+static inline int pipeline_try_set_state_ready(struct p4tc_pipeline *pipeline,
+					       struct netlink_ext_ack *extack)
+{
+	if (pipeline->curr_tables != pipeline->num_tables) {
+		NL_SET_ERR_MSG(extack,
+			       "Must have all table defined to update state to ready");
+		return -EINVAL;
+	}
+
+	pipeline->p_state = P4TC_STATE_READY;
+	return true;
+}
+
+static inline bool pipeline_sealed(struct p4tc_pipeline *pipeline)
+{
+	return pipeline->p_state == P4TC_STATE_READY;
+}
+
+struct p4tc_pipeline *p4tc_pipeline_find_byid(struct net *net, const u32 pipeid)
+{
+	struct p4tc_pipeline_net *pipe_net;
+
+	if (pipeid == P4TC_KERNEL_PIPEID)
+		return root_pipeline;
+
+	pipe_net = net_generic(net, pipeline_net_id);
+
+	return idr_find(&pipe_net->pipeline_idr, pipeid);
+}
+EXPORT_SYMBOL_GPL(p4tc_pipeline_find_byid);
+
+static struct p4tc_pipeline *p4tc_pipeline_find_byname(struct net *net,
+						       const char *name)
+{
+	struct p4tc_pipeline_net *pipe_net = net_generic(net, pipeline_net_id);
+	struct p4tc_pipeline *pipeline;
+	unsigned long tmp, id;
+
+	idr_for_each_entry_ul(&pipe_net->pipeline_idr, pipeline, tmp, id) {
+		/* Don't show kernel pipeline */
+		if (id == P4TC_KERNEL_PIPEID)
+			continue;
+		if (strncmp(pipeline->common.name, name, PIPELINENAMSIZ) == 0)
+			return pipeline;
+	}
+
+	return NULL;
+}
+
+static struct p4tc_pipeline *p4tc_pipeline_create(struct net *net,
+						  struct nlmsghdr *n,
+						  struct nlattr *nla,
+						  const char *p_name, u32 pipeid,
+						  struct netlink_ext_ack *extack)
+{
+	struct p4tc_pipeline_net *pipe_net = net_generic(net, pipeline_net_id);
+	struct nlattr *tb[P4TC_PIPELINE_MAX + 1];
+	struct p4tc_pipeline *pipeline;
+	int ret = 0;
+
+	ret = nla_parse_nested(tb, P4TC_PIPELINE_MAX, nla, tc_pipeline_policy,
+			       extack);
+
+	if (ret < 0)
+		goto out;
+
+	pipeline = p4tc_pipeline_find_byany(net, p_name, pipeid, NULL);
+	if (pipeid != P4TC_KERNEL_PIPEID && !IS_ERR(pipeline)) {
+		NL_SET_ERR_MSG(extack, "Pipeline exists");
+		ret = -EEXIST;
+		goto out;
+	}
+
+	pipeline = kzalloc(sizeof(*pipeline), GFP_KERNEL);
+	if (unlikely(!pipeline))
+		return ERR_PTR(-ENOMEM);
+
+	if (!p_name || p_name[0] == '\0') {
+		NL_SET_ERR_MSG(extack, "Must specify pipeline name");
+		ret = -EINVAL;
+		goto err;
+	}
+
+	strscpy(pipeline->common.name, p_name, PIPELINENAMSIZ);
+
+	if (pipeid) {
+		ret = idr_alloc_u32(&pipe_net->pipeline_idr, pipeline, &pipeid,
+				    pipeid, GFP_KERNEL);
+	} else {
+		pipeid = 1;
+		ret = idr_alloc_u32(&pipe_net->pipeline_idr, pipeline, &pipeid,
+				    UINT_MAX, GFP_KERNEL);
+	}
+
+	if (ret < 0) {
+		NL_SET_ERR_MSG(extack, "Unable to allocate pipeline id");
+		goto idr_rm;
+	}
+
+	pipeline->common.p_id = pipeid;
+
+	if (tb[P4TC_PIPELINE_NUMTABLES])
+		pipeline->num_tables =
+			nla_get_u16(tb[P4TC_PIPELINE_NUMTABLES]);
+	else
+		pipeline->num_tables = P4TC_DEFAULT_NUM_TABLES;
+
+	pipeline->p_state = P4TC_STATE_NOT_READY;
+
+	pipeline->net = net;
+
+	refcount_set(&pipeline->p_ctrl_ref, 1);
+
+	pipeline->common.ops = (struct p4tc_template_ops *)&p4tc_pipeline_ops;
+
+	return pipeline;
+
+idr_rm:
+	idr_remove(&pipe_net->pipeline_idr, pipeid);
+
+err:
+	kfree(pipeline);
+
+out:
+	return ERR_PTR(ret);
+}
+
+struct p4tc_pipeline *p4tc_pipeline_find_byany(struct net *net,
+					       const char *p_name,
+					       const u32 pipeid,
+					       struct netlink_ext_ack *extack)
+{
+	struct p4tc_pipeline *pipeline = NULL;
+
+	if (pipeid) {
+		pipeline = p4tc_pipeline_find_byid(net, pipeid);
+		if (!pipeline) {
+			NL_SET_ERR_MSG(extack, "Unable to find pipeline by id");
+			return ERR_PTR(-EINVAL);
+		}
+	} else {
+		if (p_name) {
+			pipeline = p4tc_pipeline_find_byname(net, p_name);
+			if (!pipeline) {
+				NL_SET_ERR_MSG(extack,
+					       "Pipeline name not found");
+				return ERR_PTR(-EINVAL);
+			}
+		} else {
+			NL_SET_ERR_MSG(extack,
+				       "Must specify pipeline name or id");
+			return ERR_PTR(-EINVAL);
+		}
+	}
+
+	return pipeline;
+}
+
+struct p4tc_pipeline *p4tc_pipeline_find_get(struct net *net, const char *p_name,
+					     const u32 pipeid,
+					     struct netlink_ext_ack *extack)
+{
+	struct p4tc_pipeline *pipeline =
+		p4tc_pipeline_find_byany(net, p_name, pipeid, extack);
+
+	if (IS_ERR(pipeline))
+		return pipeline;
+
+	if (!p4tc_pipeline_get(pipeline)) {
+		NL_SET_ERR_MSG(extack, "Pipeline is stale");
+		return ERR_PTR(-EINVAL);
+	}
+
+	return pipeline;
+}
+EXPORT_SYMBOL_GPL(p4tc_pipeline_find_get);
+
+void p4tc_pipeline_put(struct p4tc_pipeline *pipeline)
+{
+	__p4tc_pipeline_put(pipeline, &pipeline->common, NULL);
+}
+EXPORT_SYMBOL_GPL(p4tc_pipeline_put);
+
+struct p4tc_pipeline *
+p4tc_pipeline_find_byany_unsealed(struct net *net, const char *p_name,
+				  const u32 pipeid,
+				  struct netlink_ext_ack *extack)
+{
+	struct p4tc_pipeline *pipeline =
+		p4tc_pipeline_find_byany(net, p_name, pipeid, extack);
+	if (IS_ERR(pipeline))
+		return pipeline;
+
+	if (pipeline_sealed(pipeline)) {
+		NL_SET_ERR_MSG(extack, "Pipeline is sealed");
+		return ERR_PTR(-EINVAL);
+	}
+
+	return pipeline;
+}
+
+static struct p4tc_pipeline *
+p4tc_pipeline_update(struct net *net, struct nlmsghdr *n, struct nlattr *nla,
+		     const char *p_name, const u32 pipeid,
+		     struct netlink_ext_ack *extack)
+{
+	struct nlattr *tb[P4TC_PIPELINE_MAX + 1];
+	struct p4tc_pipeline *pipeline;
+	u16 num_tables = 0;
+	int ret = 0;
+
+	ret = nla_parse_nested(tb, P4TC_PIPELINE_MAX, nla, tc_pipeline_policy,
+			       extack);
+
+	if (ret < 0)
+		goto out;
+
+	pipeline =
+		p4tc_pipeline_find_byany_unsealed(net, p_name, pipeid, extack);
+	if (IS_ERR(pipeline))
+		return pipeline;
+
+	if (tb[P4TC_PIPELINE_NUMTABLES])
+		num_tables = nla_get_u16(tb[P4TC_PIPELINE_NUMTABLES]);
+
+	if (tb[P4TC_PIPELINE_STATE]) {
+		ret = pipeline_try_set_state_ready(pipeline, extack);
+		if (ret < 0)
+			goto out;
+	}
+
+	if (num_tables)
+		pipeline->num_tables = num_tables;
+
+	return pipeline;
+
+out:
+	return ERR_PTR(ret);
+}
+
+static struct p4tc_template_common *
+p4tc_pipeline_cu(struct net *net, struct nlmsghdr *n, struct nlattr *nla,
+		 struct p4tc_path_nlattrs *nl_path_attrs,
+		 struct netlink_ext_ack *extack)
+{
+	u32 *ids = nl_path_attrs->ids;
+	u32 pipeid = ids[P4TC_PID_IDX];
+	struct p4tc_pipeline *pipeline;
+
+	switch (n->nlmsg_type) {
+	case RTM_CREATEP4TEMPLATE:
+		pipeline = p4tc_pipeline_create(net, n, nla,
+						nl_path_attrs->pname,
+						pipeid, extack);
+		break;
+	case RTM_UPDATEP4TEMPLATE:
+		pipeline = p4tc_pipeline_update(net, n, nla,
+						nl_path_attrs->pname,
+						pipeid, extack);
+		break;
+	default:
+		return ERR_PTR(-EOPNOTSUPP);
+	}
+
+	if (IS_ERR(pipeline))
+		goto out;
+
+	if (!nl_path_attrs->pname_passed)
+		strscpy(nl_path_attrs->pname, pipeline->common.name,
+			PIPELINENAMSIZ);
+
+	if (!ids[P4TC_PID_IDX])
+		ids[P4TC_PID_IDX] = pipeline->common.p_id;
+
+out:
+	return (struct p4tc_template_common *)pipeline;
+}
+
+static int _p4tc_pipeline_fill_nlmsg(struct sk_buff *skb,
+				     const struct p4tc_pipeline *pipeline)
+{
+	unsigned char *b = nlmsg_get_pos(skb);
+	struct nlattr *nest;
+
+	nest = nla_nest_start(skb, P4TC_PARAMS);
+	if (!nest)
+		goto out_nlmsg_trim;
+	if (nla_put_u16(skb, P4TC_PIPELINE_NUMTABLES, pipeline->num_tables))
+		goto out_nlmsg_trim;
+	if (nla_put_u8(skb, P4TC_PIPELINE_STATE, pipeline->p_state))
+		goto out_nlmsg_trim;
+
+	nla_nest_end(skb, nest);
+
+	return skb->len;
+
+out_nlmsg_trim:
+	nlmsg_trim(skb, b);
+	return -1;
+}
+
+static int p4tc_pipeline_fill_nlmsg(struct net *net, struct sk_buff *skb,
+				    struct p4tc_template_common *template,
+				    struct netlink_ext_ack *extack)
+{
+	const struct p4tc_pipeline *pipeline = to_pipeline(template);
+
+	if (_p4tc_pipeline_fill_nlmsg(skb, pipeline) <= 0) {
+		NL_SET_ERR_MSG(extack,
+			       "Failed to fill notification attributes for pipeline");
+		return -EINVAL;
+	}
+
+	return 0;
+}
+
+static int p4tc_pipeline_del_one(struct p4tc_pipeline *pipeline,
+				 struct netlink_ext_ack *extack)
+{
+	/* User driven pipeline put doesn't transfer the lifetime
+	 * of the pipeline to other ref holders. In case of unlocked
+	 * table entries, it shall never teardown the pipeline so
+	 * need to do an atomic transition here.
+	 *
+	 * System driven put will serialize with rtnl_lock and
+	 * table entries are guaranteed to not be in flight.
+	 */
+	if (!refcount_dec_if_one(&pipeline->p_ctrl_ref)) {
+		NL_SET_ERR_MSG(extack, "Pipeline in use");
+		return -EAGAIN;
+	}
+
+	p4tc_pipeline_teardown(pipeline, extack);
+
+	return 0;
+}
+
+static int p4tc_pipeline_gd(struct net *net, struct sk_buff *skb,
+			    struct nlmsghdr *n, struct nlattr *nla,
+			    struct p4tc_path_nlattrs *nl_path_attrs,
+			    struct netlink_ext_ack *extack)
+{
+	unsigned char *b = nlmsg_get_pos(skb);
+	struct p4tc_template_common *tmpl;
+	struct p4tc_pipeline *pipeline;
+	u32 *ids = nl_path_attrs->ids;
+	u32 pipeid = ids[P4TC_PID_IDX];
+	int ret = 0;
+
+	if (n->nlmsg_type == RTM_DELP4TEMPLATE &&
+	    (n->nlmsg_flags & NLM_F_ROOT)) {
+		NL_SET_ERR_MSG(extack, "Pipeline flush not supported");
+		return -EOPNOTSUPP;
+	}
+
+	pipeline = p4tc_pipeline_find_byany(net, nl_path_attrs->pname, pipeid,
+					    extack);
+	if (IS_ERR(pipeline))
+		return PTR_ERR(pipeline);
+
+	tmpl = (struct p4tc_template_common *)pipeline;
+	if (p4tc_pipeline_fill_nlmsg(net, skb, tmpl, extack) < 0)
+		return -1;
+
+	if (!ids[P4TC_PID_IDX])
+		ids[P4TC_PID_IDX] = pipeline->common.p_id;
+
+	if (!nl_path_attrs->pname_passed)
+		strscpy(nl_path_attrs->pname, pipeline->common.name,
+			PIPELINENAMSIZ);
+
+	if (n->nlmsg_type == RTM_DELP4TEMPLATE) {
+		ret = p4tc_pipeline_del_one(pipeline, extack);
+		if (ret < 0)
+			goto out_nlmsg_trim;
+	}
+
+	return ret;
+
+out_nlmsg_trim:
+	nlmsg_trim(skb, b);
+	return ret;
+}
+
+static int p4tc_pipeline_dump(struct sk_buff *skb, struct p4tc_dump_ctx *ctx,
+			      struct nlattr *nla, char **p_name, u32 *ids,
+			      struct netlink_ext_ack *extack)
+{
+	struct net *net = sock_net(skb->sk);
+	struct p4tc_pipeline_net *pipe_net;
+
+	pipe_net = net_generic(net, pipeline_net_id);
+
+	return p4tc_tmpl_generic_dump(skb, ctx, &pipe_net->pipeline_idr,
+				      P4TC_PID_IDX, extack);
+}
+
+static int p4tc_pipeline_dump_1(struct sk_buff *skb,
+				struct p4tc_template_common *common)
+{
+	struct p4tc_pipeline *pipeline = to_pipeline(common);
+	unsigned char *b = nlmsg_get_pos(skb);
+	struct nlattr *param;
+
+	/* Don't show kernel pipeline in dump */
+	if (pipeline->common.p_id == P4TC_KERNEL_PIPEID)
+		return 1;
+
+	param = nla_nest_start(skb, P4TC_PARAMS);
+	if (!param)
+		goto out_nlmsg_trim;
+	if (nla_put_string(skb, P4TC_PIPELINE_NAME, pipeline->common.name))
+		goto out_nlmsg_trim;
+
+	nla_nest_end(skb, param);
+
+	return 0;
+
+out_nlmsg_trim:
+	nlmsg_trim(skb, b);
+	return -ENOMEM;
+}
+
+static int register_pipeline_pernet(void)
+{
+	return register_pernet_subsys(&pipeline_net_ops);
+}
+
+static void __p4tc_pipeline_init(void)
+{
+	int pipeid = P4TC_KERNEL_PIPEID;
+
+	root_pipeline = kzalloc(sizeof(*root_pipeline), GFP_ATOMIC);
+	if (unlikely(!root_pipeline)) {
+		pr_err("Unable to register kernel pipeline\n");
+		return;
+	}
+
+	strscpy(root_pipeline->common.name, "kernel", PIPELINENAMSIZ);
+
+	root_pipeline->common.ops =
+		(struct p4tc_template_ops *)&p4tc_pipeline_ops;
+
+	root_pipeline->common.p_id = pipeid;
+
+	root_pipeline->p_state = P4TC_STATE_READY;
+}
+
+static void p4tc_pipeline_init(void)
+{
+	if (register_pipeline_pernet() < 0)
+		pr_err("Failed to register per net pipeline IDR");
+
+	if (p4tc_register_types() < 0)
+		pr_err("Failed to register P4 types");
+
+	__p4tc_pipeline_init();
+}
+
+const struct p4tc_template_ops p4tc_pipeline_ops = {
+	.init = p4tc_pipeline_init,
+	.cu = p4tc_pipeline_cu,
+	.fill_nlmsg = p4tc_pipeline_fill_nlmsg,
+	.gd = p4tc_pipeline_gd,
+	.put = __p4tc_pipeline_put,
+	.dump = p4tc_pipeline_dump,
+	.dump_1 = p4tc_pipeline_dump_1,
+};
diff --git a/net/sched/p4tc/p4tc_tmpl_api.c b/net/sched/p4tc/p4tc_tmpl_api.c
new file mode 100644
index 000000000..c6eaaf47b
--- /dev/null
+++ b/net/sched/p4tc/p4tc_tmpl_api.c
@@ -0,0 +1,585 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * net/sched/p4tc_api.c	P4 TC API
+ *
+ * Copyright (c) 2022-2023, Mojatatu Networks
+ * Copyright (c) 2022-2023, Intel Corporation.
+ * Authors:     Jamal Hadi Salim <jhs@mojatatu.com>
+ *              Victor Nogueira <victor@mojatatu.com>
+ *              Pedro Tammela <pctammela@mojatatu.com>
+ */
+
+#include <linux/types.h>
+#include <linux/kernel.h>
+#include <linux/string.h>
+#include <linux/errno.h>
+#include <linux/slab.h>
+#include <linux/skbuff.h>
+#include <linux/init.h>
+#include <linux/kmod.h>
+#include <linux/err.h>
+#include <linux/module.h>
+#include <net/net_namespace.h>
+#include <net/sock.h>
+#include <net/sch_generic.h>
+#include <net/pkt_cls.h>
+#include <net/p4tc.h>
+#include <net/netlink.h>
+#include <net/flow_offload.h>
+
+static const struct nla_policy p4tc_root_policy[P4TC_ROOT_MAX + 1] = {
+	[P4TC_ROOT] = { .type = NLA_NESTED },
+	[P4TC_ROOT_PNAME] = { .type = NLA_STRING, .len = PIPELINENAMSIZ },
+};
+
+static const struct nla_policy p4tc_policy[P4TC_MAX + 1] = {
+	[P4TC_PATH] = { .type = NLA_BINARY,
+			.len = P4TC_PATH_MAX * sizeof(u32) },
+	[P4TC_PARAMS] = { .type = NLA_NESTED },
+};
+
+static bool obj_is_valid(u32 obj)
+{
+	switch (obj) {
+	case P4TC_OBJ_PIPELINE:
+		return true;
+	default:
+		return false;
+	}
+}
+
+static const struct p4tc_template_ops *p4tc_ops[P4TC_OBJ_MAX + 1] = {
+	[P4TC_OBJ_PIPELINE] = &p4tc_pipeline_ops,
+};
+
+int p4tc_tmpl_generic_dump(struct sk_buff *skb, struct p4tc_dump_ctx *ctx,
+			   struct idr *idr, int idx,
+			   struct netlink_ext_ack *extack)
+{
+	unsigned char *b = nlmsg_get_pos(skb);
+	struct p4tc_template_common *common;
+	unsigned long id = 0;
+	unsigned long tmp;
+	int i = 0;
+
+	id = ctx->ids[idx];
+
+	idr_for_each_entry_continue_ul(idr, common, tmp, id) {
+		struct nlattr *count;
+		int ret;
+
+		if (i == P4TC_MSGBATCH_SIZE)
+			break;
+
+		count = nla_nest_start(skb, i + 1);
+		if (!count)
+			goto out_nlmsg_trim;
+		ret = common->ops->dump_1(skb, common);
+		if (ret < 0) {
+			goto out_nlmsg_trim;
+		} else if (ret) {
+			nla_nest_cancel(skb, count);
+			continue;
+		}
+		nla_nest_end(skb, count);
+
+		i++;
+	}
+
+	if (i == 0) {
+		if (!ctx->ids[idx])
+			NL_SET_ERR_MSG(extack,
+				       "There are no pipeline components");
+		return 0;
+	}
+
+	ctx->ids[idx] = id;
+
+	return skb->len;
+
+out_nlmsg_trim:
+	nlmsg_trim(skb, b);
+	return -ENOMEM;
+}
+
+static int tc_ctl_p4_tmpl_gd_1(struct net *net, struct sk_buff *skb,
+			       struct nlmsghdr *n, struct nlattr *arg,
+			       struct p4tc_path_nlattrs *nl_path_attrs,
+			       struct netlink_ext_ack *extack)
+{
+	struct p4tcmsg *t = (struct p4tcmsg *)nlmsg_data(n);
+	struct nlattr *tb[P4TC_MAX + 1];
+	struct p4tc_template_ops *op;
+	u32 ids[P4TC_PATH_MAX] = {};
+	int ret;
+
+	if (!obj_is_valid(t->obj)) {
+		NL_SET_ERR_MSG(extack, "Invalid object type");
+		return -EINVAL;
+	}
+
+	ret = nla_parse_nested(tb, P4TC_MAX, arg, p4tc_policy, extack);
+	if (ret < 0)
+		return ret;
+
+	ids[P4TC_PID_IDX] = t->pipeid;
+
+	nl_path_attrs->ids = ids;
+
+	op = (struct p4tc_template_ops *)p4tc_ops[t->obj];
+
+	ret = op->gd(net, skb, n, tb[P4TC_PARAMS], nl_path_attrs, extack);
+	if (ret < 0)
+		return ret;
+
+	if (!t->pipeid)
+		t->pipeid = ids[P4TC_PID_IDX];
+
+	return ret;
+}
+
+static int tc_ctl_p4_tmpl_gd_n(struct sk_buff *skb, struct nlmsghdr *n,
+			       char *p_name, struct nlattr *nla, int event,
+			       struct netlink_ext_ack *extack)
+{
+	struct p4tcmsg *t = (struct p4tcmsg *)nlmsg_data(n);
+	struct p4tc_path_nlattrs nl_path_attrs = {0};
+	struct nlattr *tb[P4TC_MSGBATCH_SIZE + 1];
+	struct net *net = sock_net(skb->sk);
+	u32 portid = NETLINK_CB(skb).portid;
+	struct p4tcmsg *t_new;
+	struct sk_buff *nskb;
+	struct nlmsghdr *nlh;
+	struct nlattr *pnatt;
+	struct nlattr *root;
+	int ret = 0;
+	int i;
+
+	ret = nla_parse_nested(tb, P4TC_MSGBATCH_SIZE, nla, NULL, extack);
+	if (ret < 0)
+		return ret;
+
+	nskb = alloc_skb(NLMSG_GOODSIZE, GFP_KERNEL);
+	if (!nskb)
+		return -ENOMEM;
+
+	nlh = nlmsg_put(nskb, portid, n->nlmsg_seq, event, sizeof(*t),
+			n->nlmsg_flags);
+	if (!nlh) {
+		ret = -ENOMEM;
+		goto out;
+	}
+
+	t_new = nlmsg_data(nlh);
+	t_new->pipeid = t->pipeid;
+	t_new->obj = t->obj;
+
+	pnatt = nla_reserve(nskb, P4TC_ROOT_PNAME, PIPELINENAMSIZ);
+	if (!pnatt) {
+		ret = -ENOMEM;
+		goto out;
+	}
+
+	nl_path_attrs.pname = nla_data(pnatt);
+	if (!p_name) {
+		/* Filled up by the operation or forced failure */
+		memset(nl_path_attrs.pname, 0, PIPELINENAMSIZ);
+		nl_path_attrs.pname_passed = false;
+	} else {
+		strscpy(nl_path_attrs.pname, p_name, PIPELINENAMSIZ);
+		nl_path_attrs.pname_passed = true;
+	}
+
+	root = nla_nest_start(nskb, P4TC_ROOT);
+	for (i = 1; i < P4TC_MSGBATCH_SIZE + 1 && tb[i]; i++) {
+		struct nlattr *nest = nla_nest_start(nskb, i);
+
+		ret = tc_ctl_p4_tmpl_gd_1(net, nskb, nlh, tb[i], &nl_path_attrs,
+					  extack);
+		if (n->nlmsg_flags & NLM_F_ROOT && event == RTM_DELP4TEMPLATE) {
+			if (ret <= 0)
+				goto out;
+		} else {
+			if (ret < 0)
+				goto out;
+		}
+		nla_nest_end(nskb, nest);
+	}
+	nla_nest_end(nskb, root);
+
+	nlmsg_end(nskb, nlh);
+
+	if (event == RTM_GETP4TEMPLATE)
+		return rtnl_unicast(nskb, net, portid);
+
+	return rtnetlink_send(nskb, net, portid, RTNLGRP_TC,
+			      n->nlmsg_flags & NLM_F_ECHO);
+out:
+	kfree_skb(nskb);
+	return ret;
+}
+
+static int tc_ctl_p4_tmpl_get(struct sk_buff *skb, struct nlmsghdr *n,
+			      struct netlink_ext_ack *extack)
+{
+	struct nlattr *tb[P4TC_ROOT_MAX + 1];
+	char *p_name = NULL;
+	int ret;
+
+	ret = nlmsg_parse(n, sizeof(struct p4tcmsg), tb, P4TC_ROOT_MAX,
+			  p4tc_root_policy, extack);
+	if (ret < 0)
+		return ret;
+
+	if (NL_REQ_ATTR_CHECK(extack, NULL, tb, P4TC_ROOT)) {
+		NL_SET_ERR_MSG(extack,
+			       "Netlink P4TC template attributes missing");
+		return -EINVAL;
+	}
+
+	if (tb[P4TC_ROOT_PNAME])
+		p_name = nla_data(tb[P4TC_ROOT_PNAME]);
+
+	return tc_ctl_p4_tmpl_gd_n(skb, n, p_name, tb[P4TC_ROOT],
+				   RTM_GETP4TEMPLATE, extack);
+}
+
+static int tc_ctl_p4_tmpl_delete(struct sk_buff *skb, struct nlmsghdr *n,
+				 struct netlink_ext_ack *extack)
+{
+	struct nlattr *tb[P4TC_ROOT_MAX + 1];
+	char *p_name = NULL;
+	int ret;
+
+	if (!netlink_capable(skb, CAP_NET_ADMIN))
+		return -EPERM;
+
+	ret = nlmsg_parse(n, sizeof(struct p4tcmsg), tb, P4TC_ROOT_MAX,
+			  p4tc_root_policy, extack);
+	if (ret < 0)
+		return ret;
+
+	if (NL_REQ_ATTR_CHECK(extack, NULL, tb, P4TC_ROOT)) {
+		NL_SET_ERR_MSG(extack,
+			       "Netlink P4TC template attributes missing");
+		return -EINVAL;
+	}
+
+	if (tb[P4TC_ROOT_PNAME])
+		p_name = nla_data(tb[P4TC_ROOT_PNAME]);
+
+	return tc_ctl_p4_tmpl_gd_n(skb, n, p_name, tb[P4TC_ROOT],
+				   RTM_DELP4TEMPLATE, extack);
+}
+
+static int p4tc_template_put(struct net *net,
+			     struct p4tc_template_common *common,
+			     struct netlink_ext_ack *extack)
+{
+	/* Every created template is bound to a pipeline */
+	struct p4tc_pipeline *pipeline =
+		p4tc_pipeline_find_byid(net, common->p_id);
+	return common->ops->put(pipeline, common, extack);
+}
+
+static struct p4tc_template_common *
+p4tc_tmpl_cu_1(struct sk_buff *skb, struct net *net, struct nlmsghdr *n,
+	       struct p4tc_path_nlattrs *nl_path_attrs, struct nlattr *nla,
+	       struct netlink_ext_ack *extack)
+{
+	struct p4tcmsg *t = (struct p4tcmsg *)nlmsg_data(n);
+	struct p4tc_template_common *tmpl;
+	struct nlattr *tb[P4TC_MAX + 1];
+	struct p4tc_template_ops *op;
+	u32 ids[P4TC_PATH_MAX] = {};
+	int ret;
+
+	if (!obj_is_valid(t->obj)) {
+		NL_SET_ERR_MSG(extack, "Invalid object type");
+		ret = -EINVAL;
+		goto out;
+	}
+
+	ret = nla_parse_nested(tb, P4TC_MAX, nla, p4tc_policy, extack);
+	if (ret < 0)
+		goto out;
+
+	if (NL_REQ_ATTR_CHECK(extack, nla, tb, P4TC_PARAMS)) {
+		NL_SET_ERR_MSG(extack, "Must specify object attributes");
+		ret = -EINVAL;
+		goto out;
+	}
+
+	ids[P4TC_PID_IDX] = t->pipeid;
+	nl_path_attrs->ids = ids;
+
+	op = (struct p4tc_template_ops *)p4tc_ops[t->obj];
+	tmpl = op->cu(net, n, tb[P4TC_PARAMS], nl_path_attrs, extack);
+	if (IS_ERR(tmpl))
+		return tmpl;
+
+	ret = op->fill_nlmsg(net, skb, tmpl, extack);
+	if (ret < 0)
+		goto put;
+
+	if (!t->pipeid)
+		t->pipeid = ids[P4TC_PID_IDX];
+
+	return tmpl;
+
+put:
+	p4tc_template_put(net, tmpl, extack);
+
+out:
+	return ERR_PTR(ret);
+}
+
+static int p4tc_tmpl_cu_n(struct sk_buff *skb, struct nlmsghdr *n,
+			  struct nlattr *nla, char *p_name,
+			  struct netlink_ext_ack *extack)
+{
+	struct p4tc_template_common *tmpls[P4TC_MSGBATCH_SIZE];
+	struct p4tcmsg *t = (struct p4tcmsg *)nlmsg_data(n);
+	struct nlattr *tb[P4TC_MSGBATCH_SIZE + 1];
+	bool update = p4tc_tmpl_msg_is_update(n);
+	struct net *net = sock_net(skb->sk);
+	u32 portid = NETLINK_CB(skb).portid;
+	struct p4tc_path_nlattrs nl_path_attrs;
+	struct p4tcmsg *t_new;
+	struct sk_buff *nskb;
+	struct nlmsghdr *nlh;
+	struct nlattr *pnatt;
+	struct nlattr *root;
+	int ret;
+	int i;
+
+	ret = nla_parse_nested(tb, P4TC_MSGBATCH_SIZE, nla, NULL, extack);
+	if (ret < 0)
+		return ret;
+
+	nskb = alloc_skb(NLMSG_GOODSIZE, GFP_KERNEL);
+	if (!nskb)
+		return -ENOMEM;
+
+	nlh = nlmsg_put(nskb, portid, n->nlmsg_seq, n->nlmsg_type,
+			sizeof(*t), n->nlmsg_flags);
+	if (!nlh)
+		goto out;
+
+	t_new = nlmsg_data(nlh);
+	if (!t_new) {
+		NL_SET_ERR_MSG(extack, "Message header is missing");
+		ret = -EINVAL;
+		goto out;
+	}
+	t_new->pipeid = t->pipeid;
+	t_new->obj = t->obj;
+
+	pnatt = nla_reserve(nskb, P4TC_ROOT_PNAME, PIPELINENAMSIZ);
+	if (!pnatt) {
+		ret = -ENOMEM;
+		goto out;
+	}
+
+	nl_path_attrs.pname = nla_data(pnatt);
+	if (!p_name) {
+		/* Filled up by the operation or forced failure */
+		memset(nl_path_attrs.pname, 0, PIPELINENAMSIZ);
+		nl_path_attrs.pname_passed = false;
+	} else {
+		strscpy(nl_path_attrs.pname, p_name, PIPELINENAMSIZ);
+		nl_path_attrs.pname_passed = true;
+	}
+
+	root = nla_nest_start(nskb, P4TC_ROOT);
+	if (!root) {
+		ret = -ENOMEM;
+		goto out;
+	}
+
+	for (i = 0; i < P4TC_MSGBATCH_SIZE && tb[i + 1]; i++) {
+		struct nlattr *nest = nla_nest_start(nskb, i + 1);
+
+		tmpls[i] = p4tc_tmpl_cu_1(nskb, net, nlh, &nl_path_attrs,
+					  tb[i + 1], extack);
+		if (IS_ERR(tmpls[i])) {
+			ret = PTR_ERR(tmpls[i]);
+			if (i > 0 && update) {
+				nla_nest_cancel(nskb, nest);
+				goto nest_end_root;
+			}
+			goto undo_prev;
+		}
+
+		nla_nest_end(nskb, nest);
+	}
+nest_end_root:
+	nla_nest_end(nskb, root);
+
+	if (!t_new->pipeid)
+		t_new->pipeid = ret;
+
+	nlmsg_end(nskb, nlh);
+
+	return rtnetlink_send(nskb, net, portid, RTNLGRP_TC,
+			      n->nlmsg_flags & NLM_F_ECHO);
+
+undo_prev:
+	if (!update) {
+		while (--i > 0) {
+			struct p4tc_template_common *tmpl = tmpls[i - 1];
+
+			p4tc_template_put(net, tmpl, extack);
+		}
+	}
+
+out:
+	kfree_skb(nskb);
+	return ret;
+}
+
+static int tc_ctl_p4_tmpl_cu(struct sk_buff *skb, struct nlmsghdr *n,
+			     struct netlink_ext_ack *extack)
+{
+	struct nlattr *tb[P4TC_ROOT_MAX + 1];
+	char *p_name = NULL;
+	int ret = 0;
+
+	if (!netlink_capable(skb, CAP_NET_ADMIN))
+		return -EPERM;
+
+	ret = nlmsg_parse(n, sizeof(struct p4tcmsg), tb, P4TC_ROOT_MAX,
+			  p4tc_root_policy, extack);
+	if (ret < 0)
+		return ret;
+
+	if (NL_REQ_ATTR_CHECK(extack, NULL, tb, P4TC_ROOT)) {
+		NL_SET_ERR_MSG(extack,
+			       "Netlink P4TC template attributes missing");
+		return -EINVAL;
+	}
+
+	if (tb[P4TC_ROOT_PNAME])
+		p_name = nla_data(tb[P4TC_ROOT_PNAME]);
+
+	return p4tc_tmpl_cu_n(skb, n, tb[P4TC_ROOT], p_name, extack);
+}
+
+static int tc_ctl_p4_tmpl_dump_1(struct sk_buff *skb, struct nlattr *arg,
+				 char *p_name, struct netlink_callback *cb)
+{
+	struct p4tc_dump_ctx *ctx = (void *)cb->ctx;
+	struct netlink_ext_ack *extack = cb->extack;
+	u32 portid = NETLINK_CB(cb->skb).portid;
+	const struct nlmsghdr *n = cb->nlh;
+	struct nlattr *tb[P4TC_MAX + 1];
+	struct p4tc_template_ops *op;
+	u32 ids[P4TC_PATH_MAX] = {};
+	struct p4tcmsg *t_new;
+	struct nlmsghdr *nlh;
+	struct nlattr *root;
+	struct p4tcmsg *t;
+	int ret;
+
+	ret = nla_parse_nested_deprecated(tb, P4TC_MAX, arg, p4tc_policy,
+					  extack);
+	if (ret < 0)
+		return ret;
+
+	t = (struct p4tcmsg *)nlmsg_data(n);
+	if (!obj_is_valid(t->obj)) {
+		NL_SET_ERR_MSG(extack, "Invalid object type");
+		return -EINVAL;
+	}
+
+	nlh = nlmsg_put(skb, portid, n->nlmsg_seq, n->nlmsg_type,
+			sizeof(*t), n->nlmsg_flags);
+	if (!nlh)
+		return -ENOSPC;
+
+	t_new = nlmsg_data(nlh);
+	t_new->pipeid = t->pipeid;
+	t_new->obj = t->obj;
+
+	root = nla_nest_start(skb, P4TC_ROOT);
+
+	ids[P4TC_PID_IDX] = t->pipeid;
+
+	op = (struct p4tc_template_ops *)p4tc_ops[t->obj];
+	ret = op->dump(skb, ctx, tb[P4TC_PARAMS], &p_name, ids, extack);
+	if (ret <= 0)
+		goto out;
+	nla_nest_end(skb, root);
+
+	if (p_name) {
+		if (nla_put_string(skb, P4TC_ROOT_PNAME, p_name)) {
+			ret = -1;
+			goto out;
+		}
+	}
+
+	if (!t_new->pipeid)
+		t_new->pipeid = ids[P4TC_PID_IDX];
+
+	nlmsg_end(skb, nlh);
+
+	return ret;
+
+out:
+	nlmsg_cancel(skb, nlh);
+	return ret;
+}
+
+static int tc_ctl_p4_tmpl_dump(struct sk_buff *skb, struct netlink_callback *cb)
+{
+	struct nlattr *tb[P4TC_ROOT_MAX + 1];
+	char *p_name = NULL;
+	int ret;
+
+	ret = nlmsg_parse(cb->nlh, sizeof(struct p4tcmsg), tb, P4TC_ROOT_MAX,
+			  p4tc_root_policy, cb->extack);
+	if (ret < 0)
+		return ret;
+
+	if (NL_REQ_ATTR_CHECK(cb->extack, NULL, tb, P4TC_ROOT)) {
+		NL_SET_ERR_MSG(cb->extack,
+			       "Netlink P4TC template attributes missing");
+		return -EINVAL;
+	}
+
+	if (tb[P4TC_ROOT_PNAME])
+		p_name = nla_data(tb[P4TC_ROOT_PNAME]);
+
+	return tc_ctl_p4_tmpl_dump_1(skb, tb[P4TC_ROOT], p_name, cb);
+}
+
+static int __init p4tc_template_init(void)
+{
+	u32 obj_id;
+
+	rtnl_register(PF_UNSPEC, RTM_CREATEP4TEMPLATE, tc_ctl_p4_tmpl_cu, NULL,
+		      0);
+	rtnl_register(PF_UNSPEC, RTM_UPDATEP4TEMPLATE, tc_ctl_p4_tmpl_cu, NULL,
+		      0);
+	rtnl_register(PF_UNSPEC, RTM_DELP4TEMPLATE, tc_ctl_p4_tmpl_delete, NULL,
+		      0);
+	rtnl_register(PF_UNSPEC, RTM_GETP4TEMPLATE, tc_ctl_p4_tmpl_get,
+		      tc_ctl_p4_tmpl_dump, 0);
+
+	for (obj_id = P4TC_OBJ_PIPELINE; obj_id < P4TC_OBJ_MAX + 1; obj_id++) {
+		const struct p4tc_template_ops *op = p4tc_ops[obj_id];
+
+		if (!op)
+			continue;
+
+		if (!obj_is_valid(obj_id))
+			continue;
+
+		if (op->init)
+			op->init();
+	}
+
+	return 0;
+}
+
+subsys_initcall(p4tc_template_init);
diff --git a/security/selinux/nlmsgtab.c b/security/selinux/nlmsgtab.c
index 8ff670cf1..e50a1c1ff 100644
--- a/security/selinux/nlmsgtab.c
+++ b/security/selinux/nlmsgtab.c
@@ -94,6 +94,10 @@ static const struct nlmsg_perm nlmsg_route_perms[] = {
 	{ RTM_NEWTUNNEL,	NETLINK_ROUTE_SOCKET__NLMSG_WRITE },
 	{ RTM_DELTUNNEL,	NETLINK_ROUTE_SOCKET__NLMSG_WRITE },
 	{ RTM_GETTUNNEL,	NETLINK_ROUTE_SOCKET__NLMSG_READ  },
+	{ RTM_CREATEP4TEMPLATE,	NETLINK_ROUTE_SOCKET__NLMSG_WRITE },
+	{ RTM_DELP4TEMPLATE,	NETLINK_ROUTE_SOCKET__NLMSG_WRITE },
+	{ RTM_GETP4TEMPLATE,	NETLINK_ROUTE_SOCKET__NLMSG_READ },
+	{ RTM_UPDATEP4TEMPLATE,	NETLINK_ROUTE_SOCKET__NLMSG_WRITE },
 };
 
 static const struct nlmsg_perm nlmsg_tcpdiag_perms[] = {
@@ -177,7 +181,7 @@ int selinux_nlmsg_lookup(u16 sclass, u16 nlmsg_type, u32 *perm)
 		 * structures at the top of this file with the new mappings
 		 * before updating the BUILD_BUG_ON() macro!
 		 */
-		BUILD_BUG_ON(RTM_MAX != (RTM_NEWTUNNEL + 3));
+		BUILD_BUG_ON(RTM_MAX != (RTM_CREATEP4TEMPLATE + 3));
 		err = nlmsg_perm(nlmsg_type, perm, nlmsg_route_perms,
 				 sizeof(nlmsg_route_perms));
 		break;
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 79+ messages in thread

* [PATCH net-next v8 10/15] p4tc: add action template create, update, delete, get, flush and dump
  2023-11-16 14:59 [PATCH net-next v8 00/15] Introducing P4TC Jamal Hadi Salim
                   ` (8 preceding siblings ...)
  2023-11-16 14:59 ` [PATCH net-next v8 09/15] p4tc: add template pipeline create, get, update, delete Jamal Hadi Salim
@ 2023-11-16 14:59 ` Jamal Hadi Salim
  2023-11-16 16:28   ` Jiri Pirko
  2023-11-17  6:51   ` John Fastabend
  2023-11-16 14:59 ` [PATCH net-next v8 11/15] p4tc: add template table " Jamal Hadi Salim
                   ` (5 subsequent siblings)
  15 siblings, 2 replies; 79+ messages in thread
From: Jamal Hadi Salim @ 2023-11-16 14:59 UTC (permalink / raw)
  To: netdev
  Cc: deb.chatterjee, anjali.singhai, namrata.limaye, tom, mleitner,
	Mahesh.Shirshyad, tomasz.osinski, jiri, xiyou.wangcong, davem,
	edumazet, kuba, pabeni, vladbu, horms, daniel, bpf, khalidm, toke,
	mattyk

This commit allows users to create, update, delete, get, flush and dump
dynamic action kinds based on P4 action definition.

At the moment dynamic actions are tied to P4 programs only and cannot be
used outside of a P4 program definition.

Visualize the following action in a P4 program:

action ipv4_forward(@tc_type("macaddr) bit<48> dstAddr, @tc_type("dev") bit<8> port)
{
     // Action code (generated by the compiler)
}

The above is an action called ipv4_forward which receives as parameters
a bit<48> dstAddr (a mac address) and a bit<8> port (something close to
ifindex).

which is invoked on a P4 table match as such:

table mytable {
        key = {
            hdr.ipv4.dstAddr @tc_type("ipv4"): lpm;
        }

        actions = {
            ipv4_forward;
            drop;
            NoAction;
        }

        size = 1024;
}

We don't have an equivalent built in "ipv4_forward" action in TC. So we
create this action dynamically.

The mechanics of dynamic actions follow the CRUD semantics.

___DYNAMIC ACTION KIND CREATION___

In this stage we issue the creation command for the dynamic action which
specifies the action name, its ID, parameters and the parameter types.
So for the ipv4_forward action, the creation would look something like
this:

tc p4template create action/aP4proggie/ipv4_forward \
  param dstAddr type macaddr id 1 param port type dev id 2

Note1: Although the P4 program defined dstAddr as type bit48 we use our
type called macaddr (likewise for port) - see commit on p4 types for
details.

Note2: All the template commands (tc p4template) are generated by the
p4c compiler.

Note that in the template creation op we usually just specify the action
name, the parameters and their respective types. Also see that we specify
a pipeline name during the template creation command. As an example, the
above command creates an action template that is bounded to
pipeline or program named aP4proggie.

Note, In P4, actions are assumed to pre-exist and have an upper bound
number of instances. Typically if you have 1M table entries you want to allocate
enough action instances to cover the 1M entries. However, this is a big waste
waste of memory if the action instances are not in use. So for our case, we allow
the user to specify a minimal amount of actions instances in the template and then
if more dynamic action instances are needed then they will be added on
demand as in the current approach with tc filter-action relationship.
For example, if one were to create the action ipv4_forward preallocating
128 instances, one would issue the following command:

tc p4template create action/aP4proggie/ipv4_forward num_prealloc 128 \
  param dstAddr type macaddr id 1 param port type dev id 2

By default, 16 action instances will be preallocated.
If the user wishes to have more actions instances, they will have to be
created individually by the control plane using the tc actions command.
For example:

tc actions add action aP4proggie/ipv4_forward \
param dstAddr AA:BB:CC:DD:EE:DD param port eth1

Only then they can issue a table entry creation command using this newly
created action instance.

Note, this does not disqualify a user from binding to an existing action
instances. For example:

tc p4ctrl create aP4proggie/table/mycontrol/mytable \
   srcAddr 10.10.10.0/24 action ipv4_forward index 1

___ACTION KIND ACTIVATION___

Once we provided all the necessary information for the new dynamic action,
we can go to the final stage, which is action activation. In this stage,
we activate the dynamic action and make it available for instantiation.
To activate the action template, we issue the following command:

tc p4template update action aP4proggie/ipv4_forward state active

After the above the command, the action is ready to be instantiated.

___RUNTIME___

This next section deals with the runtime part of action templates, which
handle action template instantiation and binding.

To instantiate a new action that was created from a template, we use the
following command:

tc actions add action aP4proggie/ipv4_forward \
param dstAddr AA:BB:CC:DD:EE:FF param port eth0 index 1

Observe these are the same semantics as what tc today already provides
with a caveat that we have a keyword "param" to precede the appropriate
parameters - as such specifying the index is optional (kernel provides
one when unspecified).

As previously stated, we refer to the action by it's "full name"
(pipeline_name/action_name). Here we are creating an instance of the
ipv4_forward action specifying as parameter values AA:BB:CC:DD:EE:FF for
dstAddr and eth0 for port. We can create as many instances for action
templates as we wish.

To bind the above instantiated action to a table entry, you can do use the
same classical approach used to bind ordinary actions to filters, for
example:

tc p4ctrl create aP4proggie/table/mycontrol/mytable \
   srcAddr 10.10.10.0/24 action ipv4_forward index 1

The above command will bind our newly instantiated action to a table
entry which is executed if there's a match.

Of course one could have created the table entry as:

tc p4ctrl create aP4proggie/table/mycontrol/mytable \
srcAddr 10.10.10.0/24 \
action ipv4_forward param dstAddr AA:BB:CC:DD:EE:FF param port eth0

Actions from other P4 control blocks (in the same pipeline) might be
referenced as the action index is global within a pipeline.

___OTHER CONTROL COMMANDS___

The lifetime of the dynamic action is tied to its pipeline.
As with all pipeline components, write operations to action templates, such
as create, update and delete, can only be executed if the pipeline is not
sealed. Read/get can be issued even after the pipeline is sealed.

If, after we are done with our action template we want to delete it, we
should issue the following command:

tc p4template del action/aP4proggie/ipv4_forward

Note: If any instance was created for this action (as illustrated
ealier) than this action cannot be deleted, unless you delete all
instances first.

If we had created more action templates and wanted to flush all of the
action templates from pipeline aP4proggie, one would use the following
command:

tc p4template del action/aP4proggie/

After creating or updating a dynamic actions, if one wishes to verify that
the dynamic action was created correctly, one would use the following
command:

tc p4template get action/aP4proggie/ipv4_forward

The above command will display the relevant data for the action,
such as parameter names, types, etc.

If one wanted to check which action templates were associated to a specific
pipeline, one could use the following command:

tc p4template get action/aP4proggie/

Note that this command will only display the name of these action
templates. To verify their specific details, one should use the get
command, which was previously described.

Tested-by: "Khan, Mohd Arif" <mohd.arif.khan@intel.com>
Tested-by: "Pottimurthy, Sathya Narayana" <sathya.narayana.pottimurthy@intel.com>
Co-developed-by: Victor Nogueira <victor@mojatatu.com>
Signed-off-by: Victor Nogueira <victor@mojatatu.com>
Co-developed-by: Pedro Tammela <pctammela@mojatatu.com>
Signed-off-by: Pedro Tammela <pctammela@mojatatu.com>
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
---
 include/net/act_api.h          |    1 +
 include/net/p4tc.h             |  149 ++-
 include/net/tc_act/p4tc.h      |   28 +
 include/uapi/linux/p4tc.h      |   56 +
 net/sched/p4tc/Makefile        |    3 +-
 net/sched/p4tc/p4tc_action.c   | 2242 ++++++++++++++++++++++++++++++++
 net/sched/p4tc/p4tc_pipeline.c |   16 +-
 net/sched/p4tc/p4tc_tmpl_api.c |   18 +
 8 files changed, 2509 insertions(+), 4 deletions(-)
 create mode 100644 include/net/tc_act/p4tc.h
 create mode 100644 net/sched/p4tc/p4tc_action.c

diff --git a/include/net/act_api.h b/include/net/act_api.h
index cd5a8e86f..b95a9bc29 100644
--- a/include/net/act_api.h
+++ b/include/net/act_api.h
@@ -70,6 +70,7 @@ struct tc_action {
 #define TCA_ACT_FLAGS_AT_INGRESS	(1U << (TCA_ACT_FLAGS_USER_BITS + 4))
 #define TCA_ACT_FLAGS_PREALLOC	(1U << (TCA_ACT_FLAGS_USER_BITS + 5))
 #define TCA_ACT_FLAGS_UNREFERENCED	(1U << (TCA_ACT_FLAGS_USER_BITS + 6))
+#define TCA_ACT_FLAGS_FROM_P4TC	(1U << (TCA_ACT_FLAGS_USER_BITS + 7))
 
 /* Update lastuse only if needed, to avoid dirtying a cache line.
  * We use a temp variable to avoid fetching jiffies twice.
diff --git a/include/net/p4tc.h b/include/net/p4tc.h
index ccb54d842..68b00fa72 100644
--- a/include/net/p4tc.h
+++ b/include/net/p4tc.h
@@ -9,17 +9,23 @@
 #include <linux/refcount.h>
 #include <linux/rhashtable.h>
 #include <linux/rhashtable-types.h>
+#include <net/tc_act/p4tc.h>
+#include <net/p4tc_types.h>
 
 #define P4TC_DEFAULT_NUM_TABLES P4TC_MINTABLES_COUNT
 #define P4TC_DEFAULT_MAX_RULES 1
 #define P4TC_PATH_MAX 3
+#define P4TC_MAX_TENTRIES 33554432
 
 #define P4TC_KERNEL_PIPEID 0
 
 #define P4TC_PID_IDX 0
+#define P4TC_AID_IDX 1
+#define P4TC_PARSEID_IDX 1
 
 struct p4tc_dump_ctx {
 	u32 ids[P4TC_PATH_MAX];
+	struct rhashtable_iter *iter;
 };
 
 struct p4tc_template_common;
@@ -63,8 +69,10 @@ extern const struct p4tc_template_ops p4tc_pipeline_ops;
 
 struct p4tc_pipeline {
 	struct p4tc_template_common common;
+	struct idr                  p_act_idr;
 	struct rcu_head             rcu;
 	struct net                  *net;
+	u32                         num_created_acts;
 	/* Accounts for how many entities are referencing this pipeline.
 	 * As for now only P4 filters can refer to pipelines.
 	 */
@@ -109,18 +117,157 @@ p4tc_pipeline_find_byany_unsealed(struct net *net, const char *p_name,
 				  const u32 pipeid,
 				  struct netlink_ext_ack *extack);
 
+struct p4tc_act *tcf_p4_find_act(struct net *net,
+				 const struct tc_action_ops *a_o,
+				 struct netlink_ext_ack *extack);
+void
+tcf_p4_put_prealloc_act(struct p4tc_act *act, struct tcf_p4act *p4_act);
+
 static inline int p4tc_action_destroy(struct tc_action **acts)
 {
+	struct tc_action *acts_non_prealloc[TCA_ACT_MAX_PRIO] = {NULL};
 	int ret = 0;
 
 	if (acts) {
-		ret = tcf_action_destroy(acts, TCA_ACT_UNBIND);
+		int j = 0;
+		int i;
+
+		for (i = 0; i < TCA_ACT_MAX_PRIO && acts[i]; i++) {
+			if (acts[i]->tcfa_flags & TCA_ACT_FLAGS_PREALLOC) {
+				const struct tc_action_ops *ops;
+				struct tcf_p4act *p4act;
+				struct p4tc_act *act;
+				struct net *net;
+
+				p4act = (struct tcf_p4act *)acts[i];
+				net = acts[i]->idrinfo->net;
+				ops = acts[i]->ops;
+
+				act = tcf_p4_find_act(net, ops, NULL);
+				tcf_p4_put_prealloc_act(act, p4act);
+			} else {
+				acts_non_prealloc[j] = acts[i];
+				j++;
+			}
+		}
+
+		ret = tcf_action_destroy(acts_non_prealloc, TCA_ACT_UNBIND);
 		kfree(acts);
 	}
 
 	return ret;
 }
 
+struct p4tc_act_param {
+	struct list_head head;
+	struct rcu_head	rcu;
+	void            *value;
+	void            *mask;
+	struct p4tc_type *type;
+	u32             id;
+	u32             index;
+	u16             bitend;
+	u8              flags;
+	u8              PAD0;
+	char            name[ACTPARAMNAMSIZ];
+};
+
+struct p4tc_act_param_ops {
+	int (*init_value)(struct net *net, struct p4tc_act_param_ops *op,
+			  struct p4tc_act_param *nparam, struct nlattr **tb,
+			  struct netlink_ext_ack *extack);
+	int (*dump_value)(struct sk_buff *skb, struct p4tc_act_param_ops *op,
+			  struct p4tc_act_param *param);
+	void (*free)(struct p4tc_act_param *param);
+	u32 len;
+	u32 alloc_len;
+};
+
+struct p4tc_act {
+	struct p4tc_template_common common;
+	struct tc_action_ops        ops;
+	struct tc_action_net        *tn;
+	struct p4tc_pipeline        *pipeline;
+	struct idr                  params_idr;
+	struct tcf_exts             exts;
+	struct list_head            head;
+	struct list_head            prealloc_list;
+	/* Locks the preallocated actions list.
+	 * The list will be used whenever a table entry with an action or a
+	 * table default action gets created, updated or deleted. Note that
+	 * table entries may be added by both control and data path, so the
+	 * list can be modified from both contexts.
+	 */
+	spinlock_t                  list_lock;
+	u32                         a_id;
+	u32                         num_params;
+	u32                         num_prealloc_acts;
+	/* Accounts for how many entities refer to this action. Usually just the
+	 * pipeline it belongs to.
+	 */
+	refcount_t                  a_ref;
+	bool                        active;
+	char                        full_act_name[ACTNAMSIZ];
+};
+
+extern const struct p4tc_template_ops p4tc_act_ops;
+
+static inline int p4tc_action_init(struct net *net, struct nlattr *nla,
+				   struct tc_action *acts[], u32 pipeid,
+				   u32 flags, struct netlink_ext_ack *extack)
+{
+	int init_res[TCA_ACT_MAX_PRIO];
+	size_t attrs_size;
+	int ret;
+	int i;
+
+	/* If action was already created, just bind to existing one*/
+	flags |= TCA_ACT_FLAGS_BIND;
+	flags |= TCA_ACT_FLAGS_FROM_P4TC;
+	ret = tcf_action_init(net, NULL, nla, NULL, acts, init_res, &attrs_size,
+			      flags, 0, extack);
+
+	/* Check if we are trying to bind to dynamic action from different pipeline */
+	for (i = 0; i < TCA_ACT_MAX_PRIO && acts[i]; i++) {
+		struct tc_action *a = acts[i];
+		struct tcf_p4act *p;
+
+		if (a->ops->id <= TCA_ID_MAX)
+			continue;
+
+		p = to_p4act(a);
+		if (p->p_id != pipeid) {
+			NL_SET_ERR_MSG(extack,
+				       "Unable to bind to dynact from different pipeline");
+			ret = -EPERM;
+			goto destroy_acts;
+		}
+	}
+
+	return ret;
+
+destroy_acts:
+	p4tc_action_destroy(acts);
+	return ret;
+}
+
+struct p4tc_act *p4tc_action_find_get(struct p4tc_pipeline *pipeline,
+				      const char *act_name, const u32 a_id,
+				      struct netlink_ext_ack *extack);
+struct p4tc_act *p4tc_action_find_byid(struct p4tc_pipeline *pipeline,
+				       const u32 a_id);
+
+static inline bool p4tc_action_put_ref(struct p4tc_act *act)
+{
+	return refcount_dec_not_one(&act->a_ref);
+}
+
+struct tcf_p4act *
+tcf_p4_get_next_prealloc_act(struct p4tc_act *act);
+void tcf_p4_set_init_flags(struct tcf_p4act *p4act);
+
 #define to_pipeline(t) ((struct p4tc_pipeline *)t)
+#define to_hdrfield(t) ((struct p4tc_hdrfield *)t)
+#define to_act(t) ((struct p4tc_act *)t)
 
 #endif
diff --git a/include/net/tc_act/p4tc.h b/include/net/tc_act/p4tc.h
new file mode 100644
index 000000000..6447fe5ce
--- /dev/null
+++ b/include/net/tc_act/p4tc.h
@@ -0,0 +1,28 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef __NET_TC_ACT_P4_H
+#define __NET_TC_ACT_P4_H
+
+#include <net/pkt_cls.h>
+#include <net/act_api.h>
+
+struct tcf_p4act_params {
+	struct tcf_exts exts;
+	struct idr params_idr;
+	struct p4tc_act_param **params_array;
+	struct rcu_head rcu;
+	u32 num_params;
+	u32 tot_params_sz;
+};
+
+struct tcf_p4act {
+	struct tc_action common;
+	/* Params IDR reference passed during runtime */
+	struct tcf_p4act_params __rcu *params;
+	u32 p_id;
+	u32 act_id;
+	struct list_head node;
+};
+
+#define to_p4act(a) ((struct tcf_p4act *)a)
+
+#endif /* __NET_TC_ACT_P4_H */
diff --git a/include/uapi/linux/p4tc.h b/include/uapi/linux/p4tc.h
index 4d33f44c1..7b89229a7 100644
--- a/include/uapi/linux/p4tc.h
+++ b/include/uapi/linux/p4tc.h
@@ -4,6 +4,7 @@
 
 #include <linux/types.h>
 #include <linux/pkt_sched.h>
+#include <linux/pkt_cls.h>
 
 /* pipeline header */
 struct p4tcmsg {
@@ -17,9 +18,12 @@ struct p4tcmsg {
 #define P4TC_MSGBATCH_SIZE 16
 
 #define P4TC_MAX_KEYSZ 512
+#define P4TC_DEFAULT_NUM_PREALLOC 16
 
 #define TEMPLATENAMSZ 32
 #define PIPELINENAMSIZ TEMPLATENAMSZ
+#define ACTTMPLNAMSIZ TEMPLATENAMSZ
+#define ACTPARAMNAMSIZ TEMPLATENAMSZ
 
 /* Root attributes */
 enum {
@@ -35,6 +39,7 @@ enum {
 enum {
 	P4TC_OBJ_UNSPEC,
 	P4TC_OBJ_PIPELINE,
+	P4TC_OBJ_ACT,
 	__P4TC_OBJ_MAX,
 };
 
@@ -45,6 +50,7 @@ enum {
 	P4TC_UNSPEC,
 	P4TC_PATH,
 	P4TC_PARAMS,
+	P4TC_COUNT,
 	__P4TC_MAX,
 };
 
@@ -93,6 +99,56 @@ enum {
 
 #define P4T_MAX (__P4T_MAX - 1)
 
+/* Action attributes */
+enum {
+	P4TC_ACT_UNSPEC,
+	P4TC_ACT_NAME, /* string */
+	P4TC_ACT_PARMS, /* nested params */
+	P4TC_ACT_OPT, /* action opt */
+	P4TC_ACT_TM, /* action tm */
+	P4TC_ACT_ACTIVE, /* u8 */
+	P4TC_ACT_NUM_PREALLOC, /* u32 num preallocated action instances */
+	P4TC_ACT_PAD,
+	__P4TC_ACT_MAX
+};
+
+#define P4TC_ACT_MAX (__P4TC_ACT_MAX - 1)
+
+/* Action params attributes */
+enum {
+	P4TC_ACT_PARAMS_VALUE_UNSPEC,
+	P4TC_ACT_PARAMS_VALUE_RAW, /* binary */
+	__P4TC_ACT_PARAMS_VALUE_MAX
+};
+
+#define P4TC_ACT_VALUE_PARAMS_MAX (__P4TC_ACT_PARAMS_VALUE_MAX - 1)
+
+enum {
+	P4TC_ACT_PARAMS_TYPE_UNSPEC,
+	P4TC_ACT_PARAMS_TYPE_BITEND, /* u16 */
+	P4TC_ACT_PARAMS_TYPE_CONTAINER_ID, /* u32 */
+	__P4TC_ACT_PARAMS_TYPE_MAX
+};
+
+#define P4TC_ACT_PARAMS_TYPE_MAX (__P4TC_ACT_PARAMS_TYPE_MAX - 1)
+
+/* Action params attributes */
+enum {
+	P4TC_ACT_PARAMS_UNSPEC,
+	P4TC_ACT_PARAMS_NAME, /* string */
+	P4TC_ACT_PARAMS_ID, /* u32 */
+	P4TC_ACT_PARAMS_VALUE, /* bytes */
+	P4TC_ACT_PARAMS_MASK, /* bytes */
+	P4TC_ACT_PARAMS_TYPE, /* nested type */
+	__P4TC_ACT_PARAMS_MAX
+};
+
+#define P4TC_ACT_PARAMS_MAX (__P4TC_ACT_PARAMS_MAX - 1)
+
+struct tc_act_dyna {
+	tc_gen;
+};
+
 #define P4TC_RTA(r) \
 	((struct rtattr *)(((char *)(r)) + NLMSG_ALIGN(sizeof(struct p4tcmsg))))
 
diff --git a/net/sched/p4tc/Makefile b/net/sched/p4tc/Makefile
index 0881a7563..7dbcf8915 100644
--- a/net/sched/p4tc/Makefile
+++ b/net/sched/p4tc/Makefile
@@ -1,3 +1,4 @@
 # SPDX-License-Identifier: GPL-2.0
 
-obj-y := p4tc_types.o p4tc_tmpl_api.o p4tc_pipeline.o
+obj-y := p4tc_types.o p4tc_tmpl_api.o p4tc_pipeline.o \
+	p4tc_action.o
diff --git a/net/sched/p4tc/p4tc_action.c b/net/sched/p4tc/p4tc_action.c
new file mode 100644
index 000000000..19db0772c
--- /dev/null
+++ b/net/sched/p4tc/p4tc_action.c
@@ -0,0 +1,2242 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * net/sched/p4tc_action.c	P4 TC ACTION TEMPLATES
+ *
+ * Copyright (c) 2022-2023, Mojatatu Networks
+ * Copyright (c) 2022-2023, Intel Corporation.
+ * Authors:     Jamal Hadi Salim <jhs@mojatatu.com>
+ *              Victor Nogueira <victor@mojatatu.com>
+ *              Pedro Tammela <pctammela@mojatatu.com>
+ */
+
+#include <linux/err.h>
+#include <linux/errno.h>
+#include <linux/init.h>
+#include <linux/kernel.h>
+#include <linux/kmod.h>
+#include <linux/list.h>
+#include <linux/module.h>
+#include <linux/netdevice.h>
+#include <linux/skbuff.h>
+#include <linux/slab.h>
+#include <linux/string.h>
+#include <linux/types.h>
+#include <net/flow_offload.h>
+#include <net/net_namespace.h>
+#include <net/netlink.h>
+#include <net/pkt_cls.h>
+#include <net/p4tc.h>
+#include <net/sch_generic.h>
+#include <net/sock.h>
+#include <net/tc_act/p4tc.h>
+
+static LIST_HEAD(dynact_list);
+
+#define SEPARATOR "/"
+
+static void set_param_indices(struct idr *params_idr)
+{
+	struct p4tc_act_param *param;
+	unsigned long tmp, id;
+	int i = 0;
+
+	idr_for_each_entry_ul(params_idr, param, tmp, id) {
+		param->index = i;
+		i++;
+	}
+}
+
+#define P4TC_ACT_CREATED 1
+#define P4TC_ACT_PREALLOC 2
+#define P4TC_ACT_PREALLOC_UNINIT 3
+
+static int __tcf_p4_dyna_init(struct net *net, struct nlattr *est,
+			      struct p4tc_act *act, struct tc_act_dyna *parm,
+			      struct tc_action **a, struct tcf_proto *tp,
+			      struct tc_action_ops *a_o,
+			      struct tcf_chain **goto_ch, u32 flags,
+			      struct netlink_ext_ack *extack)
+{
+	bool from_p4tc = flags & TCA_ACT_FLAGS_FROM_P4TC;
+	bool prealloc = flags & TCA_ACT_FLAGS_PREALLOC;
+	bool replace = flags & TCA_ACT_FLAGS_REPLACE;
+	bool bind = flags & TCA_ACT_FLAGS_BIND;
+	struct p4tc_pipeline *pipeline;
+	struct tcf_p4act *p4act;
+	u32 index = parm->index;
+	bool exists = false;
+	int ret = 0;
+	int err;
+
+	if ((from_p4tc && !prealloc && !replace && !index)) {
+		p4act = tcf_p4_get_next_prealloc_act(act);
+
+		if (p4act) {
+			tcf_p4_set_init_flags(p4act);
+			*a = &p4act->common;
+			return P4TC_ACT_PREALLOC_UNINIT;
+		}
+	}
+
+	err = tcf_idr_check_alloc(act->tn, &index, a, bind);
+	if (err < 0)
+		return err;
+
+	exists = err;
+	if (!exists) {
+		struct tcf_p4act *p;
+
+		ret = tcf_idr_create(act->tn, index, est, a, a_o, bind, true,
+				     flags);
+		if (ret) {
+			tcf_idr_cleanup(act->tn, index);
+			return ret;
+		}
+
+		/* dyn_ref here should never be 0, because if we are here, it
+		 * means that a template action of this kind was created. Thus
+		 * dyn_ref should be at least 1. Also since this operation and
+		 * others that add or delete action templates run with
+		 * rtnl_lock held, we cannot do this op and a deletion op in
+		 * parallel.
+		 */
+		WARN_ON(!refcount_inc_not_zero(&a_o->dyn_ref));
+
+		pipeline = act->pipeline;
+
+		p = to_p4act(*a);
+		p->p_id = pipeline->common.p_id;
+		p->act_id = act->a_id;
+
+		p->common.tcfa_flags |= TCA_ACT_FLAGS_PREALLOC;
+		if (!prealloc && !bind) {
+			spin_lock_bh(&act->list_lock);
+			list_add_tail(&p->node, &act->prealloc_list);
+			spin_unlock_bh(&act->list_lock);
+		}
+
+		ret = P4TC_ACT_CREATED;
+	} else {
+		if (bind) {
+			if (((*a)->tcfa_flags & TCA_ACT_FLAGS_PREALLOC)) {
+				if ((*a)->tcfa_flags & TCA_ACT_FLAGS_UNREFERENCED) {
+					p4act = to_p4act(*a);
+					tcf_p4_set_init_flags(p4act);
+					return P4TC_ACT_PREALLOC_UNINIT;
+				}
+
+				return P4TC_ACT_PREALLOC;
+			}
+
+			return 0;
+		}
+
+		if (replace) {
+			if (((*a)->tcfa_flags & TCA_ACT_FLAGS_PREALLOC)) {
+				if ((*a)->tcfa_flags & TCA_ACT_FLAGS_UNREFERENCED) {
+					p4act = to_p4act(*a);
+					tcf_p4_set_init_flags(p4act);
+					ret = P4TC_ACT_PREALLOC_UNINIT;
+				} else {
+					ret = P4TC_ACT_PREALLOC;
+				}
+			}
+		} else {
+			NL_SET_ERR_MSG_FMT(extack,
+					   "Action %s with index %u was already created",
+					   (*a)->ops->kind, index);
+			tcf_idr_release(*a, bind);
+			return -EEXIST;
+		}
+	}
+
+	err = tcf_action_check_ctrlact(parm->action, tp, goto_ch, extack);
+	if (err < 0) {
+		tcf_idr_release(*a, bind);
+		return err;
+	}
+
+	return ret;
+}
+
+static void generic_free_param_value(struct p4tc_act_param *param)
+{
+	kfree(param->value);
+	kfree(param->mask);
+}
+
+static const struct nla_policy p4tc_act_params_value_policy[P4TC_ACT_VALUE_PARAMS_MAX + 1] = {
+	[P4TC_ACT_PARAMS_VALUE_RAW] = { .type = NLA_BINARY },
+};
+
+static const struct nla_policy p4tc_act_params_type_policy[P4TC_ACT_PARAMS_TYPE_MAX + 1] = {
+	[P4TC_ACT_PARAMS_TYPE_BITEND] = { .type = NLA_U16 },
+	[P4TC_ACT_PARAMS_TYPE_CONTAINER_ID] = { .type = NLA_U32 },
+};
+
+static int dev_init_param_value(struct net *net, struct p4tc_act_param_ops *op,
+				struct p4tc_act_param *nparam,
+				struct nlattr **tb,
+				struct netlink_ext_ack *extack)
+{
+	struct nlattr *tb_value[P4TC_ACT_VALUE_PARAMS_MAX + 1];
+	u32 value_len;
+	u32 *ifindex;
+	int err;
+
+	if (!tb[P4TC_ACT_PARAMS_VALUE]) {
+		NL_SET_ERR_MSG(extack, "Must specify param value");
+		return -EINVAL;
+	}
+	err = nla_parse_nested(tb_value, P4TC_ACT_VALUE_PARAMS_MAX,
+			       tb[P4TC_ACT_PARAMS_VALUE],
+			       p4tc_act_params_value_policy, extack);
+	if (err < 0)
+		return err;
+
+	value_len = nla_len(tb_value[P4TC_ACT_PARAMS_VALUE_RAW]);
+	if (value_len != sizeof(u32)) {
+		NL_SET_ERR_MSG(extack, "Value length differs from template's");
+		return -EINVAL;
+	}
+
+	ifindex = nla_data(tb_value[P4TC_ACT_PARAMS_VALUE_RAW]);
+	rcu_read_lock();
+	if (!dev_get_by_index_rcu(net, *ifindex)) {
+		NL_SET_ERR_MSG(extack, "Invalid ifindex");
+		rcu_read_unlock();
+		return -EINVAL;
+	}
+	rcu_read_unlock();
+
+	nparam->value = kmemdup(ifindex, sizeof(*ifindex), GFP_KERNEL);
+	if (!nparam->value)
+		return -EINVAL;
+
+	return 0;
+}
+
+static int dev_dump_param_value(struct sk_buff *skb,
+				struct p4tc_act_param_ops *op,
+				struct p4tc_act_param *param)
+{
+	const u32 *ifindex = param->value;
+	struct nlattr *nest;
+	int ret;
+
+	nest = nla_nest_start(skb, P4TC_ACT_PARAMS_VALUE);
+	if (nla_put_u32(skb, P4TC_ACT_PARAMS_VALUE_RAW, *ifindex)) {
+		ret = -EINVAL;
+		goto out_nla_cancel;
+	}
+	nla_nest_end(skb, nest);
+
+	return 0;
+
+out_nla_cancel:
+	nla_nest_cancel(skb, nest);
+	return ret;
+}
+
+static void dev_free_param_value(struct p4tc_act_param *param)
+{
+	kfree(param->value);
+}
+
+static const struct p4tc_act_param_ops param_ops[P4T_MAX + 1] = {
+	[P4T_DEV] = {
+		.init_value = dev_init_param_value,
+		.dump_value = dev_dump_param_value,
+		.free = dev_free_param_value,
+	},
+};
+
+static void tcf_p4_act_params_destroy(struct tcf_p4act_params *params)
+{
+	struct p4tc_act_param *param;
+	unsigned long param_id, tmp;
+
+	idr_for_each_entry_ul(&params->params_idr, param, tmp, param_id) {
+		struct p4tc_act_param_ops *op;
+
+		idr_remove(&params->params_idr, param_id);
+		op = (struct p4tc_act_param_ops *)&param_ops[param->type->typeid];
+		if (op->free)
+			op->free(param);
+		else
+			generic_free_param_value(param);
+		kfree(param);
+	}
+
+	kfree(params->params_array);
+	idr_destroy(&params->params_idr);
+
+	kfree(params);
+}
+
+static void tcf_p4_act_params_destroy_rcu(struct rcu_head *head)
+{
+	struct tcf_p4act_params *params;
+
+	params = container_of(head, struct tcf_p4act_params, rcu);
+	tcf_p4_act_params_destroy(params);
+}
+
+static int __tcf_p4_dyna_init_set(struct p4tc_act *act, struct tc_action **a,
+				  struct tcf_p4act_params *params,
+				  struct tcf_chain *goto_ch,
+				  struct tc_act_dyna *parm, bool exists,
+				  struct netlink_ext_ack *extack)
+{
+	struct tcf_p4act_params *params_old;
+	struct tcf_p4act *p;
+
+	p = to_p4act(*a);
+
+	/* sparse is fooled by lock under conditionals.
+	 * To avoid false positives, we are repeating these two lines in both
+	 * branches of the if-statement
+	 */
+	if (exists) {
+		spin_lock_bh(&p->tcf_lock);
+		goto_ch = tcf_action_set_ctrlact(*a, parm->action, goto_ch);
+		params_old = rcu_replace_pointer(p->params, params, 1);
+		spin_unlock_bh(&p->tcf_lock);
+	} else {
+		goto_ch = tcf_action_set_ctrlact(*a, parm->action, goto_ch);
+		params_old = rcu_replace_pointer(p->params, params, 1);
+	}
+
+	if (goto_ch)
+		tcf_chain_put_by_act(goto_ch);
+
+	if (params_old)
+		call_rcu(&params_old->rcu, tcf_p4_act_params_destroy_rcu);
+
+	return 0;
+}
+
+static int tcf_p4_dyna_template_init(struct net *net, struct tc_action **a,
+				     struct p4tc_act *act,
+				     struct idr *params_idr,
+				     struct list_head *params_lst,
+				     struct tc_act_dyna *parm, u32 flags,
+				     struct netlink_ext_ack *extack);
+
+static struct tcf_p4act_params *tcf_p4_act_params_init(struct p4tc_act *act)
+{
+	struct tcf_p4act_params *params;
+
+	params = kzalloc(sizeof(*params), GFP_KERNEL);
+	if (!params)
+		return ERR_PTR(-ENOMEM);
+
+	params->params_array = kcalloc(act->num_params,
+				       sizeof(struct p4tc_act_param *),
+				       GFP_KERNEL);
+	if (!params->params_array) {
+		kfree(params);
+		return ERR_PTR(-ENOMEM);
+	}
+
+	return params;
+}
+
+static struct p4tc_act_param *
+init_prealloc_param(struct p4tc_act *act, struct idr *params_idr,
+		    struct p4tc_act_param *param,
+		    unsigned long *param_id,
+		    struct netlink_ext_ack *extack)
+{
+	struct p4tc_act_param *nparam;
+	void *value;
+
+	nparam = kzalloc(sizeof(*nparam), GFP_KERNEL);
+	if (!nparam)
+		return ERR_PTR(-ENOMEM);
+
+	value = kzalloc(BITS_TO_BYTES(param->type->container_bitsz),
+			GFP_KERNEL);
+	if (!value) {
+		kfree(nparam);
+		return ERR_PTR(-ENOMEM);
+	}
+
+	strscpy(nparam->name, param->name, ACTPARAMNAMSIZ);
+	nparam->id = *param_id;
+	nparam->value = value;
+	nparam->type = param->type;
+
+	return nparam;
+}
+
+static void p4tc_param_put(struct p4tc_act_param *param)
+{
+	kfree(param);
+}
+
+static void free_intermediate_param(struct p4tc_act_param *param)
+{
+	kfree(param->value);
+	p4tc_param_put(param);
+}
+
+static void free_intermediate_params_list(struct list_head *params_list)
+{
+	struct p4tc_act_param *nparam, *p;
+
+	list_for_each_entry_safe(nparam, p, params_list, head) {
+		free_intermediate_param(nparam);
+	}
+}
+
+static int init_prealloc_params(struct p4tc_act *act,
+				struct idr *params_idr,
+				struct list_head *params_lst,
+				struct netlink_ext_ack *extack)
+{
+	struct p4tc_act_param *param;
+	unsigned long param_id = 0;
+	unsigned long tmp;
+
+	idr_for_each_entry_ul(params_idr, param, tmp, param_id) {
+		struct p4tc_act_param *nparam;
+
+		nparam = init_prealloc_param(act, params_idr, param, &param_id,
+					     extack);
+		if (IS_ERR(nparam))
+			return PTR_ERR(nparam);
+
+		list_add_tail(&nparam->head, params_lst);
+	}
+
+	return 0;
+}
+
+struct p4tc_act *p4tc_action_find_byid(struct p4tc_pipeline *pipeline,
+				       const u32 a_id)
+{
+	return idr_find(&pipeline->p_act_idr, a_id);
+}
+
+static void tcf_p4_prealloc_list_add(struct p4tc_act *act_tmpl,
+				     struct tc_action **acts,
+				     u32 num_prealloc_acts)
+{
+	int i;
+
+	for (i = 0; i < num_prealloc_acts; i++) {
+		struct tcf_p4act *p4act = to_p4act(acts[i]);
+
+		list_add_tail(&p4act->node, &act_tmpl->prealloc_list);
+	}
+
+	tcf_idr_insert_n(acts, num_prealloc_acts);
+}
+
+static int tcf_p4_prealloc_acts(struct net *net, struct p4tc_act *act,
+				struct idr *params_idr,
+				struct tc_action **acts,
+				const u32 num_prealloc_acts,
+				struct netlink_ext_ack *extack)
+{
+	int err;
+	int i;
+
+	for (i = 0; i < num_prealloc_acts; i++) {
+		u32 flags = TCA_ACT_FLAGS_PREALLOC | TCA_ACT_FLAGS_UNREFERENCED;
+		struct tc_action *a = acts[i];
+		struct tc_act_dyna parm = {0};
+		struct list_head params_lst;
+
+		parm.index = i + 1;
+		parm.action = TC_ACT_PIPE;
+
+		INIT_LIST_HEAD(&params_lst);
+
+		err = init_prealloc_params(act, params_idr, &params_lst,
+					   extack);
+		if (err < 0) {
+			free_intermediate_params_list(&params_lst);
+			goto destroy_acts;
+		}
+
+		err = tcf_p4_dyna_template_init(net, &a, act, params_idr,
+						&params_lst, &parm, flags,
+						extack);
+		free_intermediate_params_list(&params_lst);
+		if (err < 0)
+			goto destroy_acts;
+
+		acts[i] = a;
+	}
+
+	return 0;
+
+destroy_acts:
+	tcf_action_destroy(acts, false);
+
+	return err;
+}
+
+/* Need to implement after preallocating */
+struct tcf_p4act *
+tcf_p4_get_next_prealloc_act(struct p4tc_act *act)
+{
+	struct tcf_p4act *p4_act;
+
+	spin_lock_bh(&act->list_lock);
+	p4_act = list_first_entry_or_null(&act->prealloc_list, struct tcf_p4act,
+					  node);
+	if (p4_act) {
+		list_del_init(&p4_act->node);
+		refcount_set(&p4_act->common.tcfa_refcnt, 1);
+		atomic_set(&p4_act->common.tcfa_bindcnt, 1);
+	}
+	spin_unlock_bh(&act->list_lock);
+
+	return p4_act;
+}
+
+void tcf_p4_set_init_flags(struct tcf_p4act *p4act)
+{
+	struct tc_action *a;
+
+	a = (struct tc_action *)p4act;
+	a->tcfa_flags &= ~TCA_ACT_FLAGS_UNREFERENCED;
+}
+
+static void __tcf_p4_put_prealloc_act(struct p4tc_act *act,
+				      struct tcf_p4act *p4act)
+{
+	struct tcf_p4act_params *p4act_params;
+	struct p4tc_act_param *param;
+	unsigned long param_id, tmp;
+
+	spin_lock_bh(&p4act->tcf_lock);
+	p4act_params = rcu_dereference_protected(p4act->params, 1);
+	if (p4act_params) {
+		idr_for_each_entry_ul(&p4act_params->params_idr, param, tmp,
+				      param_id) {
+			const struct p4tc_type *type = param->type;
+			u32 type_bytesz = BITS_TO_BYTES(type->container_bitsz);
+
+			memset(param->value, 0, type_bytesz);
+		}
+	}
+	p4act->common.tcfa_flags |= TCA_ACT_FLAGS_UNREFERENCED;
+	spin_unlock_bh(&p4act->tcf_lock);
+
+	spin_lock_bh(&act->list_lock);
+	list_add_tail(&p4act->node, &act->prealloc_list);
+	spin_unlock_bh(&act->list_lock);
+}
+
+void
+tcf_p4_put_prealloc_act(struct p4tc_act *act, struct tcf_p4act *p4act)
+{
+	if (refcount_read(&p4act->common.tcfa_refcnt) == 1) {
+		__tcf_p4_put_prealloc_act(act, p4act);
+	} else {
+		refcount_dec(&p4act->common.tcfa_refcnt);
+		atomic_dec(&p4act->common.tcfa_bindcnt);
+	}
+}
+
+static const struct nla_policy p4tc_act_params_policy[P4TC_ACT_PARAMS_MAX + 1] = {
+	[P4TC_ACT_PARAMS_NAME] = { .type = NLA_STRING, .len = ACTPARAMNAMSIZ },
+	[P4TC_ACT_PARAMS_ID] = { .type = NLA_U32 },
+	[P4TC_ACT_PARAMS_VALUE] = { .type = NLA_NESTED },
+	[P4TC_ACT_PARAMS_MASK] = { .type = NLA_BINARY },
+	[P4TC_ACT_PARAMS_TYPE] = { .type = NLA_NESTED },
+};
+
+static int generic_dump_param_value(struct sk_buff *skb, struct p4tc_type *type,
+				    struct p4tc_act_param *param)
+{
+	const u32 bytesz = BITS_TO_BYTES(type->container_bitsz);
+	unsigned char *b = nlmsg_get_pos(skb);
+	struct nlattr *nla_value;
+
+	nla_value = nla_nest_start(skb, P4TC_ACT_PARAMS_VALUE);
+	if (nla_put(skb, P4TC_ACT_PARAMS_VALUE_RAW, bytesz,
+		    param->value))
+		goto out_nlmsg_trim;
+	nla_nest_end(skb, nla_value);
+
+	if (param->mask &&
+	    nla_put(skb, P4TC_ACT_PARAMS_MASK, bytesz, param->mask))
+		goto out_nlmsg_trim;
+
+	return 0;
+
+out_nlmsg_trim:
+	nlmsg_trim(skb, b);
+	return -1;
+}
+
+static int generic_init_param_value(struct p4tc_act_param *nparam,
+				    struct p4tc_type *type, struct nlattr **tb,
+				    struct netlink_ext_ack *extack)
+{
+	const u32 alloc_len = BITS_TO_BYTES(type->container_bitsz);
+	struct nlattr *tb_value[P4TC_ACT_VALUE_PARAMS_MAX + 1];
+	const u32 len = BITS_TO_BYTES(type->bitsz);
+	void *value;
+	int err;
+
+	if (!tb[P4TC_ACT_PARAMS_VALUE]) {
+		NL_SET_ERR_MSG(extack, "Must specify param value");
+		return -EINVAL;
+	}
+
+	err = nla_parse_nested(tb_value, P4TC_ACT_VALUE_PARAMS_MAX,
+			       tb[P4TC_ACT_PARAMS_VALUE],
+			       p4tc_act_params_value_policy, extack);
+	if (err < 0)
+		return err;
+
+	value = nla_data(tb_value[P4TC_ACT_PARAMS_VALUE_RAW]);
+	if (type->ops->validate_p4t) {
+		err = type->ops->validate_p4t(type, value, 0, nparam->bitend,
+					      extack);
+		if (err < 0)
+			return err;
+	}
+
+	if (nla_len(tb_value[P4TC_ACT_PARAMS_VALUE_RAW]) != len)
+		return -EINVAL;
+
+	nparam->value = kzalloc(alloc_len, GFP_KERNEL);
+	if (!nparam->value)
+		return -ENOMEM;
+
+	memcpy(nparam->value, value, len);
+
+	if (tb[P4TC_ACT_PARAMS_MASK]) {
+		const void *mask = nla_data(tb[P4TC_ACT_PARAMS_MASK]);
+
+		if (nla_len(tb[P4TC_ACT_PARAMS_MASK]) != len) {
+			NL_SET_ERR_MSG(extack,
+				       "Mask length differs from template's");
+			err = -EINVAL;
+			goto free_value;
+		}
+
+		nparam->mask = kzalloc(alloc_len, GFP_KERNEL);
+		if (!nparam->mask) {
+			err = -ENOMEM;
+			goto free_value;
+		}
+
+		memcpy(nparam->mask, mask, len);
+	}
+
+	return 0;
+
+free_value:
+	kfree(nparam->value);
+	return err;
+}
+
+static struct p4tc_act_param *param_find_byname(struct idr *params_idr,
+						const char *param_name)
+{
+	struct p4tc_act_param *param;
+	unsigned long tmp, id;
+
+	idr_for_each_entry_ul(params_idr, param, tmp, id) {
+		if (param == ERR_PTR(-EBUSY))
+			continue;
+		if (strncmp(param->name, param_name, ACTPARAMNAMSIZ) == 0)
+			return param;
+	}
+
+	return NULL;
+}
+
+static struct p4tc_act_param *tcf_param_find_byid(struct idr *params_idr,
+						  const u32 param_id)
+{
+	return idr_find(params_idr, param_id);
+}
+
+static struct p4tc_act_param *
+tcf_param_find_byany(struct p4tc_act *act,
+		     const char *param_name,
+		     const u32 param_id,
+		     struct netlink_ext_ack *extack)
+{
+	struct p4tc_act_param *param;
+	int err;
+
+	if (param_id) {
+		param = tcf_param_find_byid(&act->params_idr, param_id);
+		if (!param) {
+			NL_SET_ERR_MSG(extack, "Unable to find param by id");
+			err = -EINVAL;
+			goto out;
+		}
+	} else {
+		if (param_name) {
+			param = param_find_byname(&act->params_idr, param_name);
+			if (!param) {
+				NL_SET_ERR_MSG(extack, "Param name not found");
+				err = -EINVAL;
+				goto out;
+			}
+		} else {
+			NL_SET_ERR_MSG(extack, "Must specify param name or id");
+			err = -EINVAL;
+			goto out;
+		}
+	}
+
+	return param;
+
+out:
+	return ERR_PTR(err);
+}
+
+static struct p4tc_act_param *
+tcf_param_find_byanyattr(struct p4tc_act *act, struct nlattr *name_attr,
+			 const u32 param_id, struct netlink_ext_ack *extack)
+{
+	char *param_name = NULL;
+
+	if (name_attr)
+		param_name = nla_data(name_attr);
+
+	return tcf_param_find_byany(act, param_name, param_id, extack);
+}
+
+static int __p4_init_param_type(struct p4tc_act_param *param,
+				struct nlattr *nla, struct netlink_ext_ack *extack)
+{
+	struct nlattr *tb[P4TC_ACT_PARAMS_TYPE_MAX + 1];
+	struct p4tc_type *type;
+	u16 bitend;
+	u32 container_id;
+	int ret;
+
+	ret = nla_parse_nested(tb, P4TC_ACT_PARAMS_TYPE_MAX, nla,
+			       p4tc_act_params_type_policy, extack);
+	if (ret < 0)
+		return ret;
+
+	if (tb[P4TC_ACT_PARAMS_TYPE_CONTAINER_ID]) {
+		container_id =
+			nla_get_u32(tb[P4TC_ACT_PARAMS_TYPE_CONTAINER_ID]);
+
+		type = p4type_find_byid(container_id);
+		if (!type) {
+			NL_SET_ERR_MSG_FMT(extack,
+					   "Invalid container type id %u\n",
+					   container_id);
+			return -EINVAL;
+		}
+	} else {
+		NL_SET_ERR_MSG(extack, "Must specify type container id");
+		return -EINVAL;
+	}
+
+	if (tb[P4TC_ACT_PARAMS_TYPE_BITEND]) {
+		bitend = nla_get_u16(tb[P4TC_ACT_PARAMS_TYPE_BITEND]);
+	} else {
+		NL_SET_ERR_MSG(extack, "Must specify bitend");
+		return -EINVAL;
+	}
+
+	param->type = type;
+	param->bitend = bitend;
+
+	return 0;
+}
+
+static int tcf_p4_act_init_param(struct net *net,
+				 struct tcf_p4act_params *params,
+				 struct p4tc_act *act, struct nlattr *nla,
+				 struct netlink_ext_ack *extack)
+{
+	struct nlattr *tb[P4TC_ACT_PARAMS_MAX + 1];
+	struct p4tc_act_param *param, *nparam;
+	struct p4tc_act_param_ops *op;
+	u32 param_id = 0;
+	int err;
+
+	err = nla_parse_nested(tb, P4TC_ACT_PARAMS_MAX, nla,
+			       p4tc_act_params_policy, extack);
+	if (err < 0)
+		return err;
+
+	if (tb[P4TC_ACT_PARAMS_ID])
+		param_id = nla_get_u32(tb[P4TC_ACT_PARAMS_ID]);
+
+	param = tcf_param_find_byanyattr(act, tb[P4TC_ACT_PARAMS_NAME],
+					 param_id, extack);
+	if (IS_ERR(param))
+		return PTR_ERR(param);
+
+	nparam = kzalloc(sizeof(*nparam), GFP_KERNEL);
+	if (!nparam)
+		return -ENOMEM;
+
+	err = __p4_init_param_type(nparam, tb[P4TC_ACT_PARAMS_TYPE], extack);
+	if (err < 0)
+		goto free;
+
+	if (nparam->type != param->type) {
+		NL_SET_ERR_MSG(extack,
+			       "Param type differs from template");
+		err = -EINVAL;
+		goto free;
+	}
+
+	if (nparam->bitend != param->bitend) {
+		NL_SET_ERR_MSG(extack,
+			       "Param bitend differs from template");
+		err = -EINVAL;
+		goto free;
+	}
+
+	strscpy(nparam->name, param->name, ACTPARAMNAMSIZ);
+
+	op = (struct p4tc_act_param_ops *)&param_ops[param->type->typeid];
+	if (op->init_value)
+		err = op->init_value(net, op, nparam, tb, extack);
+	else
+		err = generic_init_param_value(nparam, nparam->type, tb, extack);
+
+	if (err < 0)
+		goto free;
+
+	nparam->id = param->id;
+	nparam->index = param->index;
+
+	err = idr_alloc_u32(&params->params_idr, nparam, &nparam->id,
+			    nparam->id, GFP_KERNEL);
+	if (err < 0)
+		goto free_val;
+
+	params->params_array[param->index] = nparam;
+
+	return 0;
+
+free_val:
+	if (op->free)
+		op->free(nparam);
+	else
+		generic_free_param_value(nparam);
+
+free:
+	kfree(nparam);
+	return err;
+}
+
+static int tcf_p4_act_init_params(struct net *net,
+				  struct tcf_p4act_params *params,
+				  struct p4tc_act *act, struct nlattr *nla,
+				  struct netlink_ext_ack *extack)
+{
+	struct nlattr *tb[P4TC_MSGBATCH_SIZE + 1];
+	int err;
+	int i;
+
+	err = nla_parse_nested(tb, P4TC_MSGBATCH_SIZE, nla, NULL, NULL);
+	if (err < 0)
+		return err;
+
+	for (i = 1; i < P4TC_MSGBATCH_SIZE + 1 && tb[i]; i++) {
+		err = tcf_p4_act_init_param(net, params, act, tb[i], extack);
+		if (err < 0)
+			return err;
+	}
+
+	return 0;
+}
+
+static struct p4tc_act *p4tc_action_find_byname(const char *act_name,
+						struct p4tc_pipeline *pipeline,
+						struct netlink_ext_ack *extack)
+{
+	char full_act_name[ACTNAMSIZ];
+	unsigned long tmp, id;
+	int num_bytes_written;
+	struct p4tc_act *act;
+
+	num_bytes_written = snprintf(full_act_name, ACTNAMSIZ, "%s/%s",
+				     pipeline->common.name, act_name);
+	if (num_bytes_written == ACTNAMSIZ) {
+		NL_SET_ERR_MSG_FMT(extack, "%s/%s is longer than %u\n",
+				   pipeline->common.name, act_name, ACTNAMSIZ);
+		return NULL;
+	}
+
+	idr_for_each_entry_ul(&pipeline->p_act_idr, act, tmp, id)
+		if (strncmp(act->full_act_name, full_act_name, ACTNAMSIZ) == 0)
+			return act;
+
+	return NULL;
+}
+
+struct p4tc_act *tcf_p4_find_act(struct net *net,
+				 const struct tc_action_ops *a_o,
+				 struct netlink_ext_ack *extack)
+{
+	char *act_name_clone, *act_name, *p_name;
+	struct p4tc_pipeline *pipeline;
+	struct p4tc_act *act;
+	int err;
+
+	act_name_clone = act_name = kstrdup(a_o->kind, GFP_KERNEL);
+	if (!act_name)
+		return ERR_PTR(-ENOMEM);
+
+	p_name = strsep(&act_name, SEPARATOR);
+	pipeline = p4tc_pipeline_find_byany(net, p_name, 0, NULL);
+	if (IS_ERR(pipeline)) {
+		err = -ENOENT;
+		goto free_act_name;
+	}
+
+	act = p4tc_action_find_byname(act_name, pipeline, extack);
+	if (!act) {
+		err = -ENOENT;
+		goto free_act_name;
+	}
+	kfree(act_name_clone);
+
+	return act;
+
+free_act_name:
+	kfree(act_name_clone);
+	return ERR_PTR(err);
+}
+
+static int tcf_p4_dyna_init(struct net *net, struct nlattr *nla,
+			    struct nlattr *est, struct tc_action **a,
+			    struct tcf_proto *tp, struct tc_action_ops *a_o,
+			    u32 flags, struct netlink_ext_ack *extack)
+{
+	bool bind = flags & TCA_ACT_FLAGS_BIND;
+	struct nlattr *tb[P4TC_ACT_MAX + 1];
+	struct tcf_chain *goto_ch = NULL;
+	struct tcf_p4act_params *params;
+	struct tcf_p4act *prealloc_act;
+	struct tc_act_dyna *parm;
+	struct p4tc_act *act;
+	bool exists = false;
+	int ret = 0;
+	int err;
+
+	if (flags & TCA_ACT_FLAGS_BIND &&
+	    !(flags & TCA_ACT_FLAGS_FROM_P4TC)) {
+		NL_SET_ERR_MSG(extack,
+			       "Can only bind to dynamic action from P4TC objects");
+		return -EPERM;
+	}
+
+	if (unlikely(!nla)) {
+		NL_SET_ERR_MSG(extack,
+			       "Must specify action netlink attributes");
+		return -EINVAL;
+	}
+
+	err = nla_parse_nested(tb, P4TC_ACT_MAX, nla, NULL, extack);
+	if (err < 0)
+		return err;
+
+	if (NL_REQ_ATTR_CHECK(extack, NULL, tb, P4TC_ACT_OPT)) {
+		NL_SET_ERR_MSG(extack,
+			       "Must specify option netlink attributes");
+		return -EINVAL;
+	}
+
+	act = tcf_p4_find_act(net, a_o, extack);
+	if (IS_ERR(act))
+		return PTR_ERR(act);
+
+	if (!act->active) {
+		NL_SET_ERR_MSG(extack,
+			       "Dynamic action must be active to create instance");
+		return -EINVAL;
+	}
+
+	parm = nla_data(tb[P4TC_ACT_OPT]);
+
+	ret = __tcf_p4_dyna_init(net, est, act, parm, a, tp, a_o, &goto_ch,
+				 flags, extack);
+	if (ret < 0)
+		return ret;
+	/* If trying to bind to unitialised preallocated action, must init below */
+	if (bind && ret == P4TC_ACT_PREALLOC)
+		return 0;
+
+	err = tcf_action_check_ctrlact(parm->action, tp, &goto_ch, extack);
+	if (err < 0)
+		goto release_idr;
+
+	params = tcf_p4_act_params_init(act);
+	if (IS_ERR(params)) {
+		err = PTR_ERR(params);
+		goto release_idr;
+	}
+
+	idr_init(&params->params_idr);
+	if (tb[P4TC_ACT_PARMS]) {
+		err = tcf_p4_act_init_params(net, params, act,
+					     tb[P4TC_ACT_PARMS], extack);
+		if (err < 0)
+			goto release_params;
+	} else {
+		if (!idr_is_empty(&act->params_idr)) {
+			NL_SET_ERR_MSG(extack,
+				       "Must specify action parameters");
+			err = -EINVAL;
+			goto release_params;
+		}
+	}
+
+	exists = ret != P4TC_ACT_CREATED;
+	err = __tcf_p4_dyna_init_set(act, a, params, goto_ch, parm, exists,
+				     extack);
+	if (err < 0)
+		goto release_params;
+
+	return ret;
+
+release_params:
+	tcf_p4_act_params_destroy(params);
+
+release_idr:
+	if (ret == P4TC_ACT_PREALLOC) {
+		prealloc_act = to_p4act(*a);
+		tcf_p4_put_prealloc_act(act, prealloc_act);
+		(*a)->tcfa_flags |= TCA_ACT_FLAGS_UNREFERENCED;
+	} else if (!bind && !exists &&
+		   ((*a)->tcfa_flags & TCA_ACT_FLAGS_PREALLOC)) {
+		prealloc_act = to_p4act(*a);
+		list_del_init(&prealloc_act->node);
+		tcf_idr_release(*a, bind);
+	} else {
+		tcf_idr_release(*a, bind);
+	}
+
+	return err;
+}
+
+static int tcf_p4_act_init_params_list(struct p4tc_act *act,
+				       struct tcf_p4act_params *params,
+				       struct list_head *params_lst)
+{
+	struct p4tc_act_param *nparam, *tmp;
+	u32 tot_params_sz = 0;
+	int err;
+
+	list_for_each_entry_safe(nparam, tmp, params_lst, head) {
+		err = idr_alloc_u32(&params->params_idr, nparam, &nparam->id,
+				    nparam->id, GFP_KERNEL);
+		if (err < 0)
+			return err;
+		list_del(&nparam->head);
+		params->num_params++;
+		tot_params_sz += nparam->type->container_bitsz;
+	}
+	/* Sum act_id */
+	params->tot_params_sz = tot_params_sz + (sizeof(u32) << 3);
+
+	return 0;
+}
+
+/* This is the action instantiation that is invoked from the template code,
+ * specifically when initialising preallocated dynamic actions.
+ * This functions is analogous to tcf_p4_dyna_init.
+ */
+static int tcf_p4_dyna_template_init(struct net *net, struct tc_action **a,
+				     struct p4tc_act *act,
+				     struct idr *params_idr,
+				     struct list_head *params_lst,
+				     struct tc_act_dyna *parm, u32 flags,
+				     struct netlink_ext_ack *extack)
+{
+	bool bind = flags & TCA_ACT_FLAGS_BIND;
+	struct tc_action_ops *a_o = &act->ops;
+	struct tcf_chain *goto_ch = NULL;
+	struct tcf_p4act_params *params;
+	struct tcf_p4act *prealloc_act;
+	bool exists = false;
+	int ret;
+	int err;
+
+	/* Don't need to check if action is active because we only call this
+	 * when we are on our way to activating the action.
+	 */
+	ret = __tcf_p4_dyna_init(net, NULL, act, parm, a, NULL, a_o, &goto_ch,
+				 flags, extack);
+	if (ret < 0)
+		return ret;
+
+	params = tcf_p4_act_params_init(act);
+	if (IS_ERR(params)) {
+		err = PTR_ERR(params);
+		goto release_idr;
+	}
+
+	idr_init(&params->params_idr);
+	if (params_idr) {
+		err = tcf_p4_act_init_params_list(act, params, params_lst);
+		if (err < 0)
+			goto release_params;
+	} else {
+		if (!idr_is_empty(&act->params_idr)) {
+			NL_SET_ERR_MSG(extack,
+				       "Must specify action parameters");
+			err = -EINVAL;
+			goto release_params;
+		}
+	}
+
+	exists = ret != P4TC_ACT_CREATED;
+	err = __tcf_p4_dyna_init_set(act, a, params, goto_ch, parm, exists,
+				     extack);
+	if (err < 0)
+		goto release_params;
+
+	return err;
+
+release_params:
+	tcf_p4_act_params_destroy(params);
+
+release_idr:
+	if (ret == P4TC_ACT_PREALLOC) {
+		prealloc_act = to_p4act(*a);
+		tcf_p4_put_prealloc_act(act, prealloc_act);
+		(*a)->tcfa_flags |= TCA_ACT_FLAGS_UNREFERENCED;
+	} else if (!bind && !exists &&
+		   ((*a)->tcfa_flags & TCA_ACT_FLAGS_PREALLOC)) {
+		prealloc_act = to_p4act(*a);
+		list_del_init(&prealloc_act->node);
+		tcf_idr_release(*a, bind);
+	} else {
+		tcf_idr_release(*a, bind);
+	}
+
+	return err;
+}
+
+static int tcf_p4_dyna_act(struct sk_buff *skb, const struct tc_action *a,
+			   struct tcf_result *res)
+{
+	struct tcf_p4act *dynact = to_p4act(a);
+
+	tcf_lastuse_update(&dynact->tcf_tm);
+	tcf_action_update_bstats(&dynact->common, skb);
+
+	return 0;
+}
+
+static int tcf_act_fill_param_type(struct sk_buff *skb,
+				   struct p4tc_act_param *param)
+{
+	unsigned char *b = nlmsg_get_pos(skb);
+
+	if (nla_put_u16(skb, P4TC_ACT_PARAMS_TYPE_BITEND, param->bitend))
+		goto nla_put_failure;
+
+	if (nla_put_u32(skb, P4TC_ACT_PARAMS_TYPE_CONTAINER_ID,
+			param->type->typeid))
+		goto nla_put_failure;
+
+	return 0;
+
+nla_put_failure:
+	nlmsg_trim(skb, b);
+	return -1;
+}
+
+static int tcf_p4_dyna_dump(struct sk_buff *skb, struct tc_action *a, int bind,
+			    int ref)
+{
+	struct tcf_p4act *dynact = to_p4act(a);
+	unsigned char *b = nlmsg_get_pos(skb);
+	struct tc_act_dyna opt = {
+		.index = dynact->tcf_index,
+		.refcnt = refcount_read(&dynact->tcf_refcnt) - ref,
+		.bindcnt = atomic_read(&dynact->tcf_bindcnt) - bind,
+	};
+	struct tcf_p4act_params *params;
+	struct p4tc_act_param *parm;
+	struct nlattr *nest_parms;
+	struct tcf_t t;
+	int i = 1;
+	int id;
+
+	spin_lock_bh(&dynact->tcf_lock);
+
+	opt.action = dynact->tcf_action;
+	if (nla_put(skb, P4TC_ACT_OPT, sizeof(opt), &opt))
+		goto nla_put_failure;
+
+	if (nla_put_string(skb, P4TC_ACT_NAME, a->ops->kind))
+		goto nla_put_failure;
+
+	tcf_tm_dump(&t, &dynact->tcf_tm);
+	if (nla_put_64bit(skb, P4TC_ACT_TM, sizeof(t), &t, P4TC_ACT_PAD))
+		goto nla_put_failure;
+
+	nest_parms = nla_nest_start(skb, P4TC_ACT_PARMS);
+	if (!nest_parms)
+		goto nla_put_failure;
+
+	params = rcu_dereference_protected(dynact->params, 1);
+	if (params) {
+		idr_for_each_entry(&params->params_idr, parm, id) {
+			struct p4tc_act_param_ops *op;
+			struct nlattr *nest_count;
+			struct nlattr *nest_type;
+
+			nest_count = nla_nest_start(skb, i);
+			if (!nest_count)
+				goto nla_put_failure;
+
+			if (nla_put_string(skb, P4TC_ACT_PARAMS_NAME,
+					   parm->name))
+				goto nla_put_failure;
+
+			if (nla_put_u32(skb, P4TC_ACT_PARAMS_ID, parm->id))
+				goto nla_put_failure;
+
+			op = (struct p4tc_act_param_ops *)&param_ops[parm->type->typeid];
+			if (op->dump_value) {
+				if (op->dump_value(skb, op, parm) < 0)
+					goto nla_put_failure;
+			} else {
+				if (generic_dump_param_value(skb, parm->type, parm))
+					goto nla_put_failure;
+			}
+
+			nest_type = nla_nest_start(skb, P4TC_ACT_PARAMS_TYPE);
+			if (!nest_type)
+				goto nla_put_failure;
+
+			tcf_act_fill_param_type(skb, parm);
+			nla_nest_end(skb, nest_type);
+
+			nla_nest_end(skb, nest_count);
+			i++;
+		}
+	}
+	nla_nest_end(skb, nest_parms);
+
+	spin_unlock_bh(&dynact->tcf_lock);
+
+	return skb->len;
+
+nla_put_failure:
+	spin_unlock_bh(&dynact->tcf_lock);
+	nlmsg_trim(skb, b);
+	return -1;
+}
+
+static int tcf_p4_dyna_lookup(struct net *net, const struct tc_action_ops *ops,
+			      struct tc_action **a, u32 index)
+{
+	struct p4tc_act *act;
+	int err;
+
+	act = tcf_p4_find_act(net, ops, NULL);
+	if (IS_ERR(act))
+		return PTR_ERR(act);
+
+	err = tcf_idr_search(act->tn, a, index);
+	if (!err)
+		return err;
+
+	if ((*a)->tcfa_flags & TCA_ACT_FLAGS_UNREFERENCED)
+		return false;
+
+	return err;
+}
+
+static int tcf_p4_dyna_walker(struct net *net, struct sk_buff *skb,
+			      struct netlink_callback *cb, int type,
+			      const struct tc_action_ops *ops,
+			      struct netlink_ext_ack *extack)
+{
+	struct p4tc_act *act;
+
+	act = tcf_p4_find_act(net, ops, extack);
+	if (IS_ERR(act))
+		return PTR_ERR(act);
+
+	return tcf_generic_walker(act->tn, skb, cb, type, ops, extack);
+}
+
+static void tcf_p4_dyna_cleanup(struct tc_action *a)
+{
+	struct tc_action_ops *ops = (struct tc_action_ops *)a->ops;
+	struct tcf_p4act *m = to_p4act(a);
+	struct tcf_p4act_params *params;
+
+	params = rcu_dereference_protected(m->params, 1);
+
+	if (refcount_read(&ops->dyn_ref) > 1)
+		refcount_dec(&ops->dyn_ref);
+
+	if (params)
+		call_rcu(&params->rcu, tcf_p4_act_params_destroy_rcu);
+}
+
+static struct p4tc_act *
+p4tc_action_find_byany(struct p4tc_pipeline *pipeline,
+		       const char *act_name, const u32 a_id,
+		       struct netlink_ext_ack *extack)
+{
+	struct p4tc_act *act;
+	int err;
+
+	if (a_id) {
+		act = p4tc_action_find_byid(pipeline, a_id);
+		if (!act) {
+			NL_SET_ERR_MSG(extack, "Unable to find action by id");
+			err = -ENOENT;
+			goto out;
+		}
+	} else {
+		if (act_name) {
+			act = p4tc_action_find_byname(act_name, pipeline, extack);
+			if (!act) {
+				NL_SET_ERR_MSG(extack, "Action name not found");
+				err = -ENOENT;
+				goto out;
+			}
+		} else {
+			NL_SET_ERR_MSG(extack,
+				       "Must specify action name or id");
+			err = -EINVAL;
+			goto out;
+		}
+	}
+
+	return act;
+
+out:
+	return ERR_PTR(err);
+}
+
+struct p4tc_act *p4tc_action_find_get(struct p4tc_pipeline *pipeline,
+				      const char *act_name, const u32 a_id,
+				      struct netlink_ext_ack *extack)
+{
+	struct p4tc_act *act;
+
+	act = p4tc_action_find_byany(pipeline, act_name, a_id, extack);
+	if (IS_ERR(act))
+		return act;
+
+	if (!refcount_inc_not_zero(&act->a_ref)) {
+		NL_SET_ERR_MSG(extack, "Action is stale");
+		return ERR_PTR(-EBUSY);
+	}
+
+	return act;
+}
+
+static struct p4tc_act *
+p4tc_action_find_byanyattr(struct nlattr *act_name_attr, const u32 a_id,
+			   struct p4tc_pipeline *pipeline,
+			   struct netlink_ext_ack *extack)
+{
+	char *act_name = NULL;
+
+	if (act_name_attr)
+		act_name = nla_data(act_name_attr);
+
+	return p4tc_action_find_byany(pipeline, act_name, a_id, extack);
+}
+
+static void p4_put_many_params(struct idr *params_idr)
+{
+	struct p4tc_act_param *param;
+	unsigned long tmp, id;
+
+	idr_for_each_entry_ul(params_idr, param, tmp, id)
+		p4tc_param_put(param);
+}
+
+static int p4_init_param_type(struct p4tc_act_param *param,
+			      struct nlattr *nla, struct netlink_ext_ack *extack)
+{
+	struct p4tc_type *type;
+	int ret;
+
+	ret = __p4_init_param_type(param, nla, extack);
+	if (ret < 0)
+		return ret;
+
+	type = param->type;
+	ret = type->ops->validate_p4t(type, NULL, 0, param->bitend, extack);
+	if (ret < 0)
+		return ret;
+
+	return 0;
+}
+
+static struct p4tc_act_param *p4_create_param(struct p4tc_act *act,
+					      struct idr *params_idr,
+					      struct nlattr **tb, u32 param_id,
+					      struct netlink_ext_ack *extack)
+{
+	struct p4tc_act_param *param;
+	char *name;
+	int ret;
+
+	if (tb[P4TC_ACT_PARAMS_NAME]) {
+		name = nla_data(tb[P4TC_ACT_PARAMS_NAME]);
+	} else {
+		NL_SET_ERR_MSG(extack, "Must specify param name");
+		ret = -EINVAL;
+		goto out;
+	}
+
+	param = kmalloc(sizeof(*param), GFP_KERNEL);
+	if (!param) {
+		ret = -ENOMEM;
+		goto out;
+	}
+
+	if (tcf_param_find_byid(&act->params_idr, param_id) ||
+	    param_find_byname(&act->params_idr, name)) {
+		NL_SET_ERR_MSG(extack, "Param already exists");
+		ret = -EEXIST;
+		goto free;
+	}
+
+	if (tb[P4TC_ACT_PARAMS_TYPE]) {
+		ret = p4_init_param_type(param, tb[P4TC_ACT_PARAMS_TYPE],
+					 extack);
+		if (ret < 0)
+			goto free;
+	} else {
+		NL_SET_ERR_MSG(extack, "Must specify param type");
+		ret = -EINVAL;
+		goto free;
+	}
+
+	if (param_id) {
+		ret = idr_alloc_u32(params_idr, param, &param_id,
+				    param_id, GFP_KERNEL);
+		if (ret < 0) {
+			NL_SET_ERR_MSG(extack, "Unable to allocate param id");
+			goto free;
+		}
+		param->id = param_id;
+	} else {
+		param->id = 1;
+
+		ret = idr_alloc_u32(params_idr, param, &param->id,
+				    UINT_MAX, GFP_KERNEL);
+		if (ret < 0) {
+			NL_SET_ERR_MSG(extack, "Unable to allocate param id");
+			goto free;
+		}
+	}
+
+	strscpy(param->name, name, ACTPARAMNAMSIZ);
+
+	return param;
+
+free:
+	kfree(param);
+
+out:
+	return ERR_PTR(ret);
+}
+
+static struct p4tc_act_param *p4_update_param(struct p4tc_act *act,
+					      struct nlattr **tb,
+					      struct idr *params_idr,
+					      u32 param_id,
+					      struct netlink_ext_ack *extack)
+{
+	struct p4tc_act_param *param_old, *param;
+	int ret;
+
+	param_old = tcf_param_find_byanyattr(act, tb[P4TC_ACT_PARAMS_NAME],
+					     param_id, extack);
+	if (IS_ERR(param_old))
+		return param_old;
+
+	param = kzalloc(sizeof(*param), GFP_KERNEL);
+	if (!param) {
+		ret = -ENOMEM;
+		goto out;
+	}
+
+	strscpy(param->name, param_old->name, ACTPARAMNAMSIZ);
+	param->id = param_old->id;
+
+	if (tb[P4TC_ACT_PARAMS_TYPE]) {
+		ret = p4_init_param_type(param, tb[P4TC_ACT_PARAMS_TYPE],
+					 extack);
+		if (ret < 0)
+			goto free;
+	} else {
+		NL_SET_ERR_MSG(extack, "Must specify param type");
+		ret = -EINVAL;
+		goto free;
+	}
+
+	ret = idr_alloc_u32(params_idr, param, &param->id,
+			    param->id, GFP_KERNEL);
+	if (ret < 0) {
+		NL_SET_ERR_MSG(extack, "Unable to allocate param id");
+		goto free;
+	}
+
+	return param;
+
+free:
+	kfree(param);
+out:
+	return ERR_PTR(ret);
+}
+
+static struct p4tc_act_param *p4_act_init_param(struct p4tc_act *act,
+						struct nlattr *nla,
+						struct idr *params_idr,
+						bool update,
+						struct netlink_ext_ack *extack)
+{
+	struct nlattr *tb[P4TC_ACT_PARAMS_MAX + 1];
+	u32 param_id = 0;
+	int ret;
+
+	ret = nla_parse_nested(tb, P4TC_ACT_PARAMS_MAX, nla, NULL, extack);
+	if (ret < 0) {
+		ret = -EINVAL;
+		goto out;
+	}
+
+	if (tb[P4TC_ACT_PARAMS_ID])
+		param_id = nla_get_u32(tb[P4TC_ACT_PARAMS_ID]);
+
+	if (update)
+		return p4_update_param(act, tb, params_idr, param_id, extack);
+	else
+		return p4_create_param(act, params_idr, tb, param_id, extack);
+
+out:
+	return ERR_PTR(ret);
+}
+
+static int p4_act_init_params(struct p4tc_act *act, struct nlattr *nla,
+			      struct idr *params_idr, bool update,
+			      struct netlink_ext_ack *extack)
+{
+	struct nlattr *tb[P4TC_MSGBATCH_SIZE + 1];
+	int ret;
+	int i;
+
+	ret = nla_parse_nested(tb, P4TC_MSGBATCH_SIZE, nla, NULL, extack);
+	if (ret < 0)
+		return -EINVAL;
+
+	for (i = 1; i < P4TC_MSGBATCH_SIZE + 1 && tb[i]; i++) {
+		struct p4tc_act_param *param;
+
+		param = p4_act_init_param(act, tb[i], params_idr, update,
+					  extack);
+		if (IS_ERR(param)) {
+			ret = PTR_ERR(param);
+			goto params_del;
+		}
+	}
+
+	return i - 1;
+
+params_del:
+	p4_put_many_params(params_idr);
+	return ret;
+}
+
+static int p4_act_init(struct p4tc_act *act, struct nlattr *nla,
+		       struct netlink_ext_ack *extack)
+{
+	int num_params = 0;
+	int ret;
+
+	idr_init(&act->params_idr);
+
+	if (nla) {
+		num_params =
+			p4_act_init_params(act, nla, &act->params_idr, false,
+					   extack);
+		if (num_params < 0) {
+			ret = num_params;
+			goto idr_destroy;
+		}
+	}
+
+	return num_params;
+
+idr_destroy:
+	p4_put_many_params(&act->params_idr);
+	idr_destroy(&act->params_idr);
+	return ret;
+}
+
+static const struct nla_policy p4tc_act_policy[P4TC_ACT_MAX + 1] = {
+	[P4TC_ACT_NAME] = { .type = NLA_STRING, .len = ACTTMPLNAMSIZ },
+	[P4TC_ACT_PARMS] = { .type = NLA_NESTED },
+	[P4TC_ACT_OPT] = NLA_POLICY_EXACT_LEN(sizeof(struct tc_act_dyna)),
+	[P4TC_ACT_NUM_PREALLOC] = NLA_POLICY_MIN(NLA_U32, 1),
+	[P4TC_ACT_ACTIVE] = { .type = NLA_U8 },
+};
+
+static void p4tc_action_net_exit(struct tc_action_net *tn)
+{
+	tcf_idrinfo_destroy(tn->ops, tn->idrinfo);
+	kfree(tn->idrinfo);
+	kfree(tn);
+}
+
+static void p4_act_params_put(struct p4tc_act *act)
+{
+	struct p4tc_act_param *act_param;
+	unsigned long param_id, tmp;
+
+	idr_for_each_entry_ul(&act->params_idr, act_param, tmp, param_id) {
+		idr_remove(&act->params_idr, param_id);
+		kfree(act_param);
+	}
+}
+
+static int __tcf_act_put(struct net *net, struct p4tc_pipeline *pipeline,
+			 struct p4tc_act *act, bool teardown,
+			 struct netlink_ext_ack *extack)
+{
+	struct tcf_p4act *p4act, *tmp_act;
+
+	if (!teardown && (refcount_read(&act->ops.dyn_ref) > 1 ||
+			  refcount_read(&act->a_ref) > 1)) {
+		NL_SET_ERR_MSG(extack,
+			       "Unable to delete referenced action template");
+		return -EBUSY;
+	}
+
+	p4_act_params_put(act);
+
+	tcf_unregister_dyn_action(net, &act->ops);
+	/* Free preallocated acts */
+	list_for_each_entry_safe(p4act, tmp_act, &act->prealloc_list, node) {
+		list_del_init(&p4act->node);
+		if (p4act->common.tcfa_flags & TCA_ACT_FLAGS_UNREFERENCED)
+			tcf_idr_release(&p4act->common, true);
+	}
+	p4tc_action_net_exit(act->tn);
+
+	idr_remove(&pipeline->p_act_idr, act->a_id);
+
+	list_del(&act->head);
+
+	kfree(act);
+
+	pipeline->num_created_acts--;
+
+	return 0;
+}
+
+static int _tcf_act_fill_nlmsg(struct net *net, struct sk_buff *skb,
+			       struct p4tc_act *act)
+{
+	unsigned char *b = nlmsg_get_pos(skb);
+	struct p4tc_act_param *param;
+	struct nlattr *nest, *parms;
+	unsigned long param_id, tmp;
+	int i = 1;
+
+	if (nla_put_u32(skb, P4TC_PATH, act->a_id))
+		goto out_nlmsg_trim;
+
+	nest = nla_nest_start(skb, P4TC_PARAMS);
+	if (!nest)
+		goto out_nlmsg_trim;
+
+	if (nla_put_string(skb, P4TC_ACT_NAME, act->full_act_name))
+		goto out_nlmsg_trim;
+
+	if (nla_put_u32(skb, P4TC_ACT_NUM_PREALLOC, act->num_prealloc_acts))
+		goto out_nlmsg_trim;
+
+	parms = nla_nest_start(skb, P4TC_ACT_PARMS);
+	if (!parms)
+		goto out_nlmsg_trim;
+
+	idr_for_each_entry_ul(&act->params_idr, param, tmp, param_id) {
+		struct nlattr *nest_count;
+		struct nlattr *nest_type;
+
+		nest_count = nla_nest_start(skb, i);
+		if (!nest_count)
+			goto out_nlmsg_trim;
+
+		if (nla_put_string(skb, P4TC_ACT_PARAMS_NAME, param->name))
+			goto out_nlmsg_trim;
+
+		if (nla_put_u32(skb, P4TC_ACT_PARAMS_ID, param->id))
+			goto out_nlmsg_trim;
+
+		nest_type = nla_nest_start(skb, P4TC_ACT_PARAMS_TYPE);
+		if (!nest_type)
+			goto out_nlmsg_trim;
+
+		tcf_act_fill_param_type(skb, param);
+		nla_nest_end(skb, nest_type);
+
+		nla_nest_end(skb, nest_count);
+		i++;
+	}
+	nla_nest_end(skb, parms);
+
+	nla_nest_end(skb, nest);
+
+	return skb->len;
+
+out_nlmsg_trim:
+	nlmsg_trim(skb, b);
+	return -1;
+}
+
+static int tcf_act_fill_nlmsg(struct net *net, struct sk_buff *skb,
+			      struct p4tc_template_common *tmpl,
+			      struct netlink_ext_ack *extack)
+{
+	return _tcf_act_fill_nlmsg(net, skb, to_act(tmpl));
+}
+
+static int tcf_act_flush(struct sk_buff *skb, struct net *net,
+			 struct p4tc_pipeline *pipeline,
+			 struct netlink_ext_ack *extack)
+{
+	unsigned char *b = nlmsg_get_pos(skb);
+	unsigned long tmp, act_id;
+	struct p4tc_act *act;
+	int ret = 0;
+	int i = 0;
+
+	if (nla_put_u32(skb, P4TC_PATH, 0))
+		goto out_nlmsg_trim;
+
+	if (idr_is_empty(&pipeline->p_act_idr)) {
+		NL_SET_ERR_MSG(extack,
+			       "There are not action templates to flush");
+		goto out_nlmsg_trim;
+	}
+
+	idr_for_each_entry_ul(&pipeline->p_act_idr, act, tmp, act_id) {
+		if (__tcf_act_put(net, pipeline, act, false, extack) < 0) {
+			ret = -EBUSY;
+			continue;
+		}
+		i++;
+	}
+
+	nla_put_u32(skb, P4TC_COUNT, i);
+
+	if (ret < 0) {
+		if (i == 0) {
+			NL_SET_ERR_MSG(extack,
+				       "Unable to flush any action template");
+			goto out_nlmsg_trim;
+		} else {
+			NL_SET_ERR_MSG_FMT(extack,
+					   "Flushed only %u action templates",
+					   i);
+		}
+	}
+
+	return i;
+
+out_nlmsg_trim:
+	nlmsg_trim(skb, b);
+	return ret;
+}
+
+static int tcf_act_gd(struct net *net, struct sk_buff *skb, struct nlmsghdr *n,
+		      struct nlattr *nla,
+		      struct p4tc_path_nlattrs *nl_path_attrs,
+		      struct netlink_ext_ack *extack)
+{
+	u32 *ids = nl_path_attrs->ids;
+	const u32 pipeid = ids[P4TC_PID_IDX], a_id = ids[P4TC_AID_IDX];
+	struct nlattr *tb[P4TC_ACT_MAX + 1] = { NULL };
+	unsigned char *b = nlmsg_get_pos(skb);
+	struct p4tc_pipeline *pipeline;
+	struct p4tc_act *act;
+	int ret = 0;
+
+	if (n->nlmsg_type == RTM_DELP4TEMPLATE)
+		pipeline = p4tc_pipeline_find_byany_unsealed(net,
+							     nl_path_attrs->pname,
+							     pipeid, extack);
+	else
+		pipeline = p4tc_pipeline_find_byany(net,
+						    nl_path_attrs->pname,
+						    pipeid, extack);
+	if (IS_ERR(pipeline))
+		return PTR_ERR(pipeline);
+
+	if (nla) {
+		ret = nla_parse_nested(tb, P4TC_ACT_MAX, nla, p4tc_act_policy,
+				       extack);
+		if (ret < 0)
+			return ret;
+	}
+
+	if (!nl_path_attrs->pname_passed)
+		strscpy(nl_path_attrs->pname, pipeline->common.name,
+			PIPELINENAMSIZ);
+
+	if (!ids[P4TC_PID_IDX])
+		ids[P4TC_PID_IDX] = pipeline->common.p_id;
+
+	if (n->nlmsg_type == RTM_DELP4TEMPLATE && (n->nlmsg_flags & NLM_F_ROOT))
+		return tcf_act_flush(skb, net, pipeline, extack);
+
+	act = p4tc_action_find_byanyattr(tb[P4TC_ACT_NAME], a_id, pipeline,
+					 extack);
+	if (IS_ERR(act))
+		return PTR_ERR(act);
+
+	if (_tcf_act_fill_nlmsg(net, skb, act) < 0) {
+		NL_SET_ERR_MSG(extack,
+			       "Failed to fill notification attributes for template action");
+		return -EINVAL;
+	}
+
+	if (n->nlmsg_type == RTM_DELP4TEMPLATE) {
+		ret = __tcf_act_put(net, pipeline, act, false, extack);
+		if (ret < 0)
+			goto out_nlmsg_trim;
+	}
+
+	return 0;
+
+out_nlmsg_trim:
+	nlmsg_trim(skb, b);
+	return ret;
+}
+
+static int tcf_act_put(struct p4tc_pipeline *pipeline,
+		       struct p4tc_template_common *tmpl,
+		       struct netlink_ext_ack *extack)
+{
+	struct p4tc_act *act = to_act(tmpl);
+
+	return __tcf_act_put(pipeline->net, pipeline, act, true, extack);
+}
+
+static void p4tc_params_replace_many(struct p4tc_act *act,
+				     struct idr *params_idr)
+{
+	struct p4tc_act_param *param;
+	unsigned long tmp, id;
+
+	idr_for_each_entry_ul(params_idr, param, tmp, id) {
+		idr_remove(params_idr, param->id);
+		param = idr_replace(&act->params_idr, param, param->id);
+		p4tc_param_put(param);
+	}
+}
+
+static struct p4tc_act *tcf_act_create(struct net *net, struct nlattr **tb,
+				       struct p4tc_pipeline *pipeline, u32 *ids,
+				       struct netlink_ext_ack *extack)
+{
+	u32 a_id = ids[P4TC_AID_IDX];
+	size_t num_copied_bytes;
+	struct p4tc_act *act;
+	int num_params = 0;
+	char *act_name;
+	int ret = 0;
+
+	if (tb[P4TC_ACT_NAME]) {
+		act_name = nla_data(tb[P4TC_ACT_NAME]);
+	} else {
+		NL_SET_ERR_MSG(extack, "Must supply action name");
+		return ERR_PTR(-EINVAL);
+	}
+
+	if ((p4tc_action_find_byname(act_name, pipeline, extack))) {
+		NL_SET_ERR_MSG(extack, "Action already exists with same name");
+		return ERR_PTR(-EEXIST);
+	}
+
+	if (p4tc_action_find_byid(pipeline, a_id)) {
+		NL_SET_ERR_MSG(extack, "Action already exists with same id");
+		return ERR_PTR(-EEXIST);
+	}
+
+	act = kzalloc(sizeof(*act), GFP_KERNEL);
+	if (!act)
+		return ERR_PTR(-ENOMEM);
+
+	act->ops.owner = THIS_MODULE;
+	act->ops.act = tcf_p4_dyna_act;
+	act->ops.dump = tcf_p4_dyna_dump;
+	act->ops.cleanup = tcf_p4_dyna_cleanup;
+	act->ops.init_ops = tcf_p4_dyna_init;
+	act->ops.lookup = tcf_p4_dyna_lookup;
+	act->ops.walk = tcf_p4_dyna_walker;
+	act->ops.size = sizeof(struct tcf_p4act);
+	INIT_LIST_HEAD(&act->head);
+
+	act->tn = kzalloc(sizeof(*act->tn), GFP_KERNEL);
+	if (!act->tn) {
+		ret = -ENOMEM;
+		goto free_act_ops;
+	}
+
+	ret = tc_action_net_init(net, act->tn, &act->ops);
+	if (ret < 0) {
+		kfree(act->tn);
+		goto free_act_ops;
+	}
+	act->tn->ops = &act->ops;
+
+	snprintf(act->ops.kind, ACTNAMSIZ, "%s/%s", pipeline->common.name,
+		 act_name);
+
+	if (a_id) {
+		ret = idr_alloc_u32(&pipeline->p_act_idr, act, &a_id, a_id,
+				    GFP_KERNEL);
+		if (ret < 0) {
+			NL_SET_ERR_MSG(extack, "Unable to alloc action id");
+			goto free_action_net;
+		}
+
+		act->a_id = a_id;
+	} else {
+		act->a_id = 1;
+
+		ret = idr_alloc_u32(&pipeline->p_act_idr, act, &act->a_id,
+				    UINT_MAX, GFP_KERNEL);
+		if (ret < 0) {
+			NL_SET_ERR_MSG(extack, "Unable to alloc action id");
+			goto free_action_net;
+		}
+	}
+
+	/* We are only preallocating the instances once the action template is
+	 * activated during update.
+	 */
+	if (tb[P4TC_ACT_NUM_PREALLOC]) {
+		u32 *num_prealloc_acts = nla_data(tb[P4TC_ACT_NUM_PREALLOC]);
+
+		if (*num_prealloc_acts > P4TC_MAX_TENTRIES) {
+			ret = -EINVAL;
+			NL_SET_ERR_MSG_FMT(extack,
+					   "num_prealloc_acts can't be > %u",
+					   P4TC_MAX_TENTRIES);
+			goto idr_rm;
+		}
+		act->num_prealloc_acts = *num_prealloc_acts;
+	} else {
+		act->num_prealloc_acts = P4TC_DEFAULT_NUM_PREALLOC;
+	}
+
+	refcount_set(&act->ops.dyn_ref, 1);
+	ret = tcf_register_dyn_action(net, &act->ops);
+	if (ret < 0) {
+		NL_SET_ERR_MSG(extack,
+			       "Unable to register new action template");
+		goto idr_rm;
+	}
+
+	num_params = p4_act_init(act, tb[P4TC_ACT_PARMS], extack);
+	if (num_params < 0) {
+		ret = num_params;
+		goto unregister;
+	}
+	act->num_params = num_params;
+
+	set_param_indices(&act->params_idr);
+
+	act->pipeline = pipeline;
+
+	pipeline->num_created_acts++;
+
+	act->common.p_id = pipeline->common.p_id;
+	num_copied_bytes = snprintf(act->full_act_name, ACTNAMSIZ, "%s/%s",
+				    pipeline->common.name, act_name);
+	if (num_copied_bytes == ACTNAMSIZ) {
+		NL_SET_ERR_MSG_FMT(extack,
+				   "Full action name should fit in %u bytes",
+				   ACTNAMSIZ);
+		ret = -E2BIG;
+		goto params_put;
+	}
+	num_copied_bytes = snprintf(act->common.name, ACTTMPLNAMSIZ, "%s",
+				    act_name);
+	if (num_copied_bytes == ACTTMPLNAMSIZ) {
+		NL_SET_ERR_MSG_FMT(extack,
+				   "Action name size should fit in %u bytes",
+				   ACTTMPLNAMSIZ);
+		ret = -E2BIG;
+		goto params_put;
+	}
+	act->common.ops = (struct p4tc_template_ops *)&p4tc_act_ops;
+
+	refcount_set(&act->a_ref, 1);
+
+	list_add_tail(&act->head, &dynact_list);
+	INIT_LIST_HEAD(&act->prealloc_list);
+	spin_lock_init(&act->list_lock);
+
+	return act;
+
+params_put:
+	p4_act_params_put(act);
+
+unregister:
+	tcf_unregister_dyn_action(net, &act->ops);
+
+idr_rm:
+	idr_remove(&pipeline->p_act_idr, act->a_id);
+
+free_action_net:
+	p4tc_action_net_exit(act->tn);
+
+free_act_ops:
+	kfree(act);
+
+	return ERR_PTR(ret);
+}
+
+static struct p4tc_act *tcf_act_update(struct net *net, struct nlattr **tb,
+				       struct p4tc_pipeline *pipeline, u32 *ids,
+				       u32 flags,
+				       struct netlink_ext_ack *extack)
+{
+	const u32 a_id = ids[P4TC_AID_IDX];
+	struct tc_action **prealloc_acts;
+	bool updates_params = false;
+	struct idr params_idr;
+	u32 num_prealloc_acts;
+	struct p4tc_act *act;
+	int num_params = 0;
+	s8 active = -1;
+	int ret = 0;
+
+	act = p4tc_action_find_byanyattr(tb[P4TC_ACT_NAME], a_id, pipeline,
+					 extack);
+	if (IS_ERR(act))
+		return act;
+
+	if (tb[P4TC_ACT_ACTIVE])
+		active = nla_get_u8(tb[P4TC_ACT_ACTIVE]);
+
+	if (act->active) {
+		if (!active) {
+			if (refcount_read(&act->ops.dyn_ref) > 1) {
+				NL_SET_ERR_MSG(extack,
+					       "Unable to inactivate action with instances");
+				return ERR_PTR(-EINVAL);
+			}
+			act->active = false;
+			return act;
+		}
+		NL_SET_ERR_MSG(extack, "Unable to update active action");
+
+		ret = -EINVAL;
+		goto out;
+	}
+
+	idr_init(&params_idr);
+	if (tb[P4TC_ACT_PARMS]) {
+		num_params = p4_act_init_params(act, tb[P4TC_ACT_PARMS],
+						&params_idr, true, extack);
+		if (num_params < 0) {
+			ret = num_params;
+			goto idr_destroy;
+		}
+		set_param_indices(&params_idr);
+		updates_params = true;
+	}
+
+	if (tb[P4TC_ACT_NUM_PREALLOC]) {
+		num_prealloc_acts = nla_get_u32(tb[P4TC_ACT_NUM_PREALLOC]);
+		if (num_prealloc_acts > P4TC_MAX_TENTRIES) {
+			ret = -EINVAL;
+			NL_SET_ERR_MSG_FMT(extack,
+					   "num_prealloc_acts can't be > %u",
+					   P4TC_MAX_TENTRIES);
+			goto params_del;
+		}
+	} else {
+		num_prealloc_acts = act->num_prealloc_acts;
+	}
+
+	act->pipeline = pipeline;
+	if (active == 1) {
+		struct idr *chosen_idr = updates_params ?
+			&params_idr : &act->params_idr;
+
+		prealloc_acts = kcalloc(num_prealloc_acts,
+					sizeof(*prealloc_acts),
+					GFP_KERNEL);
+		if (!prealloc_acts) {
+			ret = -ENOMEM;
+			goto params_del;
+		}
+		chosen_idr = updates_params ? &params_idr : &act->params_idr;
+
+		ret = tcf_p4_prealloc_acts(pipeline->net, act, chosen_idr,
+					   prealloc_acts, num_prealloc_acts,
+					   extack);
+		if (ret < 0)
+			goto free_prealloc_acts;
+
+		tcf_p4_prealloc_list_add(act, prealloc_acts,
+					 num_prealloc_acts);
+
+		kfree(prealloc_acts);
+
+		act->active = true;
+	} else if (!active) {
+		NL_SET_ERR_MSG(extack, "Action is already inactive");
+		ret = -EINVAL;
+		goto params_del;
+	}
+
+	act->num_prealloc_acts = num_prealloc_acts;
+
+	if (updates_params)
+		p4tc_params_replace_many(act, &params_idr);
+
+	idr_destroy(&params_idr);
+
+	return act;
+
+free_prealloc_acts:
+	kfree(prealloc_acts);
+
+params_del:
+	p4_put_many_params(&params_idr);
+
+idr_destroy:
+	idr_destroy(&params_idr);
+
+out:
+	return ERR_PTR(ret);
+}
+
+static struct p4tc_template_common *
+tcf_act_cu(struct net *net, struct nlmsghdr *n, struct nlattr *nla,
+	   struct p4tc_path_nlattrs *nl_path_attrs,
+	   struct netlink_ext_ack *extack)
+{
+	u32 *ids = nl_path_attrs->ids;
+	const u32 pipeid = ids[P4TC_PID_IDX];
+	struct nlattr *tb[P4TC_ACT_MAX + 1];
+	struct p4tc_pipeline *pipeline;
+	struct p4tc_act *act;
+	int ret;
+
+	pipeline = p4tc_pipeline_find_byany_unsealed(net, nl_path_attrs->pname,
+						     pipeid, extack);
+	if (IS_ERR(pipeline))
+		return (void *)pipeline;
+
+	ret = nla_parse_nested(tb, P4TC_ACT_MAX, nla, p4tc_act_policy, extack);
+	if (ret < 0)
+		return ERR_PTR(ret);
+
+	switch (n->nlmsg_type) {
+	case RTM_CREATEP4TEMPLATE:
+		act = tcf_act_create(net, tb, pipeline, ids, extack);
+		break;
+	case RTM_UPDATEP4TEMPLATE:
+		act = tcf_act_update(net, tb, pipeline, ids, n->nlmsg_flags,
+				     extack);
+		break;
+	default:
+		return ERR_PTR(-EOPNOTSUPP);
+	}
+
+	if (IS_ERR(act))
+		goto out;
+
+	if (!nl_path_attrs->pname_passed)
+		strscpy(nl_path_attrs->pname, pipeline->common.name,
+			PIPELINENAMSIZ);
+
+	if (!ids[P4TC_PID_IDX])
+		ids[P4TC_PID_IDX] = pipeline->common.p_id;
+
+out:
+	return (struct p4tc_template_common *)act;
+}
+
+static int tcf_act_dump(struct sk_buff *skb, struct p4tc_dump_ctx *ctx,
+			struct nlattr *nla, char **p_name, u32 *ids,
+			struct netlink_ext_ack *extack)
+{
+	struct net *net = sock_net(skb->sk);
+	struct p4tc_pipeline *pipeline;
+
+	if (!ctx->ids[P4TC_PID_IDX]) {
+		pipeline = p4tc_pipeline_find_byany(net, *p_name,
+						    ids[P4TC_PID_IDX], extack);
+		if (IS_ERR(pipeline))
+			return PTR_ERR(pipeline);
+		ctx->ids[P4TC_PID_IDX] = pipeline->common.p_id;
+	} else {
+		pipeline = p4tc_pipeline_find_byid(net, ctx->ids[P4TC_PID_IDX]);
+	}
+
+	if (!ids[P4TC_PID_IDX])
+		ids[P4TC_PID_IDX] = pipeline->common.p_id;
+
+	if (!(*p_name))
+		*p_name = pipeline->common.name;
+
+	return p4tc_tmpl_generic_dump(skb, ctx, &pipeline->p_act_idr,
+				      P4TC_AID_IDX, extack);
+}
+
+static int tcf_act_dump_1(struct sk_buff *skb,
+			  struct p4tc_template_common *common)
+{
+	struct nlattr *param = nla_nest_start(skb, P4TC_PARAMS);
+	unsigned char *b = nlmsg_get_pos(skb);
+	struct p4tc_act *act = to_act(common);
+
+	if (!param)
+		goto out_nlmsg_trim;
+
+	if (nla_put_string(skb, P4TC_ACT_NAME, act->full_act_name))
+		goto out_nlmsg_trim;
+
+	if (nla_put_u8(skb, P4TC_ACT_ACTIVE, act->active))
+		goto out_nlmsg_trim;
+
+	nla_nest_end(skb, param);
+
+	return 0;
+
+out_nlmsg_trim:
+	nlmsg_trim(skb, b);
+	return -ENOMEM;
+}
+
+const struct p4tc_template_ops p4tc_act_ops = {
+	.init = NULL,
+	.cu = tcf_act_cu,
+	.put = tcf_act_put,
+	.gd = tcf_act_gd,
+	.fill_nlmsg = tcf_act_fill_nlmsg,
+	.dump = tcf_act_dump,
+	.dump_1 = tcf_act_dump_1,
+};
diff --git a/net/sched/p4tc/p4tc_pipeline.c b/net/sched/p4tc/p4tc_pipeline.c
index fc6e49573..4c34c8534 100644
--- a/net/sched/p4tc/p4tc_pipeline.c
+++ b/net/sched/p4tc/p4tc_pipeline.c
@@ -74,6 +74,8 @@ static const struct nla_policy tc_pipeline_policy[P4TC_PIPELINE_MAX + 1] = {
 
 static void p4tc_pipeline_destroy(struct p4tc_pipeline *pipeline)
 {
+	idr_destroy(&pipeline->p_act_idr);
+
 	kfree(pipeline);
 }
 
@@ -95,8 +97,12 @@ static void p4tc_pipeline_teardown(struct p4tc_pipeline *pipeline,
 	struct net *net = pipeline->net;
 	struct p4tc_pipeline_net *pipe_net = net_generic(net, pipeline_net_id);
 	struct net *pipeline_net = maybe_get_net(net);
+	unsigned long iter_act_id;
+	struct p4tc_act *act;
+	unsigned long tmp;
 
-	idr_remove(&pipe_net->pipeline_idr, pipeline->common.p_id);
+	idr_for_each_entry_ul(&pipeline->p_act_idr, act, tmp, iter_act_id)
+		act->common.ops->put(pipeline, &act->common, extack);
 
 	/* If we are on netns cleanup we can't touch the pipeline_idr.
 	 * On pre_exit we will destroy the idr but never call into teardown
@@ -151,6 +157,7 @@ static inline int pipeline_try_set_state_ready(struct p4tc_pipeline *pipeline,
 	}
 
 	pipeline->p_state = P4TC_STATE_READY;
+
 	return true;
 }
 
@@ -248,6 +255,10 @@ static struct p4tc_pipeline *p4tc_pipeline_create(struct net *net,
 	else
 		pipeline->num_tables = P4TC_DEFAULT_NUM_TABLES;
 
+	idr_init(&pipeline->p_act_idr);
+
+	pipeline->num_created_acts = 0;
+
 	pipeline->p_state = P4TC_STATE_NOT_READY;
 
 	pipeline->net = net;
@@ -502,7 +513,8 @@ static int p4tc_pipeline_gd(struct net *net, struct sk_buff *skb,
 		return PTR_ERR(pipeline);
 
 	tmpl = (struct p4tc_template_common *)pipeline;
-	if (p4tc_pipeline_fill_nlmsg(net, skb, tmpl, extack) < 0)
+	ret = p4tc_pipeline_fill_nlmsg(net, skb, tmpl, extack);
+	if (ret < 0)
 		return -1;
 
 	if (!ids[P4TC_PID_IDX])
diff --git a/net/sched/p4tc/p4tc_tmpl_api.c b/net/sched/p4tc/p4tc_tmpl_api.c
index c6eaaf47b..ad81a7089 100644
--- a/net/sched/p4tc/p4tc_tmpl_api.c
+++ b/net/sched/p4tc/p4tc_tmpl_api.c
@@ -42,6 +42,7 @@ static bool obj_is_valid(u32 obj)
 {
 	switch (obj) {
 	case P4TC_OBJ_PIPELINE:
+	case P4TC_OBJ_ACT:
 		return true;
 	default:
 		return false;
@@ -50,6 +51,7 @@ static bool obj_is_valid(u32 obj)
 
 static const struct p4tc_template_ops *p4tc_ops[P4TC_OBJ_MAX + 1] = {
 	[P4TC_OBJ_PIPELINE] = &p4tc_pipeline_ops,
+	[P4TC_OBJ_ACT] = &p4tc_act_ops,
 };
 
 int p4tc_tmpl_generic_dump(struct sk_buff *skb, struct p4tc_dump_ctx *ctx,
@@ -124,6 +126,11 @@ static int tc_ctl_p4_tmpl_gd_1(struct net *net, struct sk_buff *skb,
 
 	ids[P4TC_PID_IDX] = t->pipeid;
 
+	if (tb[P4TC_PATH]) {
+		const u32 *arg_ids = nla_data(tb[P4TC_PATH]);
+
+		memcpy(&ids[P4TC_PID_IDX + 1], arg_ids, nla_len(tb[P4TC_PATH]));
+	}
 	nl_path_attrs->ids = ids;
 
 	op = (struct p4tc_template_ops *)p4tc_ops[t->obj];
@@ -311,6 +318,12 @@ p4tc_tmpl_cu_1(struct sk_buff *skb, struct net *net, struct nlmsghdr *n,
 	}
 
 	ids[P4TC_PID_IDX] = t->pipeid;
+
+	if (tb[P4TC_PATH]) {
+		const u32 *arg_ids = nla_data(tb[P4TC_PATH]);
+
+		memcpy(&ids[P4TC_PID_IDX + 1], arg_ids, nla_len(tb[P4TC_PATH]));
+	}
 	nl_path_attrs->ids = ids;
 
 	op = (struct p4tc_template_ops *)p4tc_ops[t->obj];
@@ -504,6 +517,11 @@ static int tc_ctl_p4_tmpl_dump_1(struct sk_buff *skb, struct nlattr *arg,
 	root = nla_nest_start(skb, P4TC_ROOT);
 
 	ids[P4TC_PID_IDX] = t->pipeid;
+	if (tb[P4TC_PATH]) {
+		const u32 *arg_ids = nla_data(tb[P4TC_PATH]);
+
+		memcpy(&ids[P4TC_PID_IDX + 1], arg_ids, nla_len(tb[P4TC_PATH]));
+	}
 
 	op = (struct p4tc_template_ops *)p4tc_ops[t->obj];
 	ret = op->dump(skb, ctx, tb[P4TC_PARAMS], &p_name, ids, extack);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 79+ messages in thread

* [PATCH net-next v8 11/15] p4tc: add template table create, update, delete, get, flush and dump
  2023-11-16 14:59 [PATCH net-next v8 00/15] Introducing P4TC Jamal Hadi Salim
                   ` (9 preceding siblings ...)
  2023-11-16 14:59 ` [PATCH net-next v8 10/15] p4tc: add action template create, update, delete, get, flush and dump Jamal Hadi Salim
@ 2023-11-16 14:59 ` Jamal Hadi Salim
  2023-11-16 14:59 ` [PATCH net-next v8 12/15] p4tc: add runtime table entry create, update, get, delete, " Jamal Hadi Salim
                   ` (4 subsequent siblings)
  15 siblings, 0 replies; 79+ messages in thread
From: Jamal Hadi Salim @ 2023-11-16 14:59 UTC (permalink / raw)
  To: netdev
  Cc: deb.chatterjee, anjali.singhai, namrata.limaye, tom, mleitner,
	Mahesh.Shirshyad, tomasz.osinski, jiri, xiyou.wangcong, davem,
	edumazet, kuba, pabeni, vladbu, horms, daniel, bpf, khalidm, toke,
	mattyk

This commit introduces code to creation and maintenance of P4 tables
defined in a P4 program.

As with all other P4TC objects, tables' lifetimes conform to extended CRUD
operations and are maintained via templates.
It's important to note that write operations, such as create, update and
delete can only be made if the pipeline is not sealed.

Per the P4 specification, tables prefix their name with the control block
(although this could be overridden by P4 annotations).

As an example, if one were to create a table named table1 in a
pipeline named myprog1, on control block "mycontrol", one would use
the following command:

tc p4template create table/myprog1/mycontrol/table1 tblid 1 \
   keysz 32 nummasks 8 tentries 8192

Above says that we are creating a table (table1) attached to pipeline
myprog1 on control block mycontrol. Table1's key size is 32 bits wide
and it can have up to 8 associated masks and 8192 entries. The table id
for table1 is 1. The table id is typically provided by the compiler.

Parameters such as nummasks (number of masks this table may have) and
tentries (maximum number of entries this table may have) may also be
omitted in which case 8 masks and 256 entries will be assumed.

Other attributes include:

- Table aging: The aging for the table entries belonging to the table
- Table type: The match type of the table (exact, LPM, or ternary)
- Direct Counter: Table counter instances used directly by the table
- Direct Meter: Table meter instances used directly by the table

If one were to retrieve the table named table1 (before or after the
pipeline is sealed) one would use the following command:

tc -j p4template get table/myprog1/mycontrol/table1 | jq .

If one were to dump all the tables from a pipeline named myprog1, one would
use the following command:

tc p4template get table/myprog1

If one were to update table1 (before the pipeline is sealed) one would use
the following command:

tc p4template update table/myprog1/mycontrol/table1 ....

If one were to delete table1 (before the pipeline is sealed) one would use
the following command:

tc p4template del table/myprog1/mycontrol/table1

If one were to flush all the tables from a pipeline named myprog1, control
block "mycontrol" one would use the following command:

tc p4template del table/myprog1/mycontrol/

___Table Permissions___

Tables can have permissions which apply to all the entries in the specified
table. Permissions are defined for both what the control plane (user space)
as well as the data path are allowed to do.

The permissions field is a 16bit value which will hold CRUDXPS (create,
read, update, delete, execute, publish and subscribe) permissions for
control and data path. Bits 13-7 will have the CRUDXPS values for control
and bits 6-0 will have CRUDXPS values for data path. By default each table
has the following permissions:

CRUD-PS-R--X--

Which means the control plane can perform CRUDPS operations whereas the
data path can only Read and execute on the entries.
The user can override these permissions when creating the table or when
updating.

For example, the following command will create a table which will not allow
the datapath to create, update or delete entries but give full CRUDP
permissions for the control plane.

$TC p4template create table/aP4proggie/cb/tname tblid 1 keysz 64 type lpm \
permissions 0x3D24 ...

Recall that these permissions come in the form of CRUDXPSCRUDXPS, where the
first CRUDXPS block is for control and the last is for data path.

So 0x3D24 is equivalent to CR-D-P--R--X--

If we were to issue a read command on a table (tname):

$TC -j p4template get table/aP4proggie/cb/tname | jq .

The output would be the following:

[
  {
    "obj": "table",
    "pname": "aP4Proggie",
    "pipeid": 22
  },
  {
    "templates": [
      {
        "tblid": 1,
        "tname": "cb/tname",
        "keysz": 64,
        "max_entries": 256,
        "masks": 8,
        "entries": 0,
        "permissions": "CRUD-P--R--X--",
        "table_type": "lpm",
        "acts_list": []
      }
    ]
  }
]

Note, the permissions concept is more powerful than classical const
definition currently taken by P4 which makes everything in a table
read-only.

___Initial Table Entries___

Templating can create initial table entries. For example:

tc p4template update table/myprog/cb/tname \
  entry srcAddr 10.10.10.10/24 dstAddr 1.1.1.0/24 prio 17

In this command we are "updating" table cb/tname with a new entry. This
entry has as its key srcAddr concatenated with dstAddr
(both IPv4 addresses) and prio 17.

If one was to read back the entry by issuing the following command:

tc p4template get table/myprog/cb/tname

They would get:

pipeline id 22
    table id 1
    table name cb/tname
    key_sz 64
    max entries 256
    masks 8
    table entries 1
    permissions CRUD-P--R--X--
    entry:

        entry priority 17[permissions-RUD-P--R--X--]
        entry key
            srcAddr id:1 size:32b type:ipv4 exact fieldval  10.10.10.10/32
            dstAddr id:2 size:32b type:ipv4 exact fieldval  1.1.1.0/24

___Table Actions List___

P4 tables allow certain actions but not other to be part of match entry on
a table. P4 also defines default actions to be executed when no entries
match; we have extended this concept to have a default hit,which is
executed upon matching an entry which has no action associated with it.

We also allow flags for each of the actions in this list that specify if
the action can be added only as a table entry (tableonly), or only as a
default action (defaultonly). If no flags are specified, it is assumed
that the action can be used in both contexts.

Both default hit and default miss are optional.

An example of specifying a default miss action is as follows:

tc p4template update table/myprog/cb/mytable \
    default_miss_action permissions 0x1124 action drop

The above will drop packets if the entry is not found in mytable.
Note the above makes the default action a const. Meaning the control
plane can neither replace it nor delete it.

tc p4template update table/myprog/mytable \
  default_hit_action permissions 0x3004 action ok

Whereas the above allows a default hit action to accept the packet.
The permission 0x3004 (binary 11000000000100) means we have only Create and
Read permissions in the control plane and eXecute permissions in the data
plane. This means, for example, that now we can only delete the default hit
action from the control plane.

Co-developed-by: Victor Nogueira <victor@mojatatu.com>
Signed-off-by: Victor Nogueira <victor@mojatatu.com>
Co-developed-by: Pedro Tammela <pctammela@mojatatu.com>
Signed-off-by: Pedro Tammela <pctammela@mojatatu.com>
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
---
 include/net/p4tc.h             |  136 ++-
 include/net/p4tc_types.h       |    2 +-
 include/uapi/linux/p4tc.h      |  122 +++
 net/sched/p4tc/Makefile        |    2 +-
 net/sched/p4tc/p4tc_action.c   |   13 +-
 net/sched/p4tc/p4tc_pipeline.c |   23 +-
 net/sched/p4tc/p4tc_table.c    | 1545 ++++++++++++++++++++++++++++++++
 net/sched/p4tc/p4tc_tmpl_api.c |    2 +
 8 files changed, 1828 insertions(+), 17 deletions(-)
 create mode 100644 net/sched/p4tc/p4tc_table.c

diff --git a/include/net/p4tc.h b/include/net/p4tc.h
index 68b00fa72..9521708e6 100644
--- a/include/net/p4tc.h
+++ b/include/net/p4tc.h
@@ -16,10 +16,18 @@
 #define P4TC_DEFAULT_MAX_RULES 1
 #define P4TC_PATH_MAX 3
 #define P4TC_MAX_TENTRIES 33554432
+#define P4TC_DEFAULT_TENTRIES 256
+#define P4TC_MAX_TMASKS 1024
+#define P4TC_DEFAULT_TMASKS 8
+#define P4TC_MAX_T_AGING 864000000
+#define P4TC_DEFAULT_T_AGING 30000
+
+#define P4TC_MAX_PERMISSION (GENMASK(P4TC_PERM_MAX_BIT, 0))
 
 #define P4TC_KERNEL_PIPEID 0
 
 #define P4TC_PID_IDX 0
+#define P4TC_TBLID_IDX 1
 #define P4TC_AID_IDX 1
 #define P4TC_PARSEID_IDX 1
 
@@ -70,6 +78,7 @@ extern const struct p4tc_template_ops p4tc_pipeline_ops;
 struct p4tc_pipeline {
 	struct p4tc_template_common common;
 	struct idr                  p_act_idr;
+	struct idr                  p_tbl_idr;
 	struct rcu_head             rcu;
 	struct net                  *net;
 	u32                         num_created_acts;
@@ -123,6 +132,11 @@ struct p4tc_act *tcf_p4_find_act(struct net *net,
 void
 tcf_p4_put_prealloc_act(struct p4tc_act *act, struct tcf_p4act *p4_act);
 
+static inline bool pipeline_sealed(struct p4tc_pipeline *pipeline)
+{
+	return pipeline->p_state == P4TC_STATE_READY;
+}
+
 static inline int p4tc_action_destroy(struct tc_action **acts)
 {
 	struct tc_action *acts_non_prealloc[TCA_ACT_MAX_PRIO] = {NULL};
@@ -158,6 +172,63 @@ static inline int p4tc_action_destroy(struct tc_action **acts)
 	return ret;
 }
 
+#define P4TC_CONTROL_PERMISSIONS (GENMASK(13, 7))
+#define P4TC_DATA_PERMISSIONS (GENMASK(6, 0))
+
+#define P4TC_TABLE_PERMISSIONS                                   \
+	((GENMASK(P4TC_CTRL_PERM_C_BIT, P4TC_CTRL_PERM_D_BIT)) | \
+	 P4TC_CTRL_PERM_P | P4TC_CTRL_PERM_S | P4TC_DATA_PERM_R | \
+	 P4TC_DATA_PERM_X)
+
+#define P4TC_PERMISSIONS_UNINIT (1 << P4TC_PERM_MAX_BIT)
+
+struct p4tc_table_defact {
+	struct tc_action **default_acts;
+	/* Will have two 7 bits blocks containing CRUDXPS (Create, read, update,
+	 * delete, execute, publish and subscribe) permissions for control plane
+	 * and data plane. The first 5 bits are for control and the next five
+	 * are for data plane. |crudxpscrudxps| if we were to denote it as UNIX
+	 * permission flags.
+	 */
+	__u16 permissions;
+	struct rcu_head  rcu;
+};
+
+struct p4tc_table_perm {
+	__u16           permissions;
+	struct rcu_head rcu;
+};
+
+struct p4tc_table {
+	struct p4tc_template_common         common;
+	struct list_head                    tbl_acts_list;
+	struct idr                          tbl_masks_idr;
+	struct idr                          tbl_prio_idr;
+	struct rhltable                     tbl_entries;
+	struct p4tc_table_defact __rcu      *tbl_default_hitact;
+	struct p4tc_table_defact __rcu      *tbl_default_missact;
+	struct p4tc_table_perm __rcu        *tbl_permissions;
+	struct p4tc_table_entry_mask __rcu  **tbl_masks_array;
+	unsigned long __rcu                 *tbl_free_masks_bitmap;
+	u64                                 tbl_aging;
+	/* Locks the available masks IDR which will be used when adding and
+	 * deleting table entries.
+	 */
+	spinlock_t                          tbl_masks_idr_lock;
+	u32                                 tbl_keysz;
+	u32                                 tbl_id;
+	u32                                 tbl_max_entries;
+	u32                                 tbl_max_masks;
+	u32                                 tbl_curr_num_masks;
+	/* Accounts for how many entities refer to this table. Usually just the
+	 * pipeline it belongs to.
+	 */
+	refcount_t                          tbl_ctrl_ref;
+	u16                                 tbl_type;
+};
+
+extern const struct p4tc_template_ops p4tc_table_ops;
+
 struct p4tc_act_param {
 	struct list_head head;
 	struct rcu_head	rcu;
@@ -210,6 +281,12 @@ struct p4tc_act {
 	char                        full_act_name[ACTNAMSIZ];
 };
 
+struct p4tc_table_act {
+	struct list_head node;
+	struct tc_action_ops *ops;
+	u8     flags;
+};
+
 extern const struct p4tc_template_ops p4tc_act_ops;
 
 static inline int p4tc_action_init(struct net *net, struct nlattr *nla,
@@ -262,12 +339,69 @@ static inline bool p4tc_action_put_ref(struct p4tc_act *act)
 	return refcount_dec_not_one(&act->a_ref);
 }
 
+struct p4tc_act_param *tcf_param_find_byid(struct idr *params_idr,
+					   const u32 param_id);
+struct p4tc_act_param *tcf_param_find_byany(struct p4tc_act *act,
+					    const char *param_name,
+					    const u32 param_id,
+					    struct netlink_ext_ack *extack);
+
+struct p4tc_table *p4tc_table_find_byany(struct p4tc_pipeline *pipeline,
+					 const char *tblname, const u32 tbl_id,
+					 struct netlink_ext_ack *extack);
+struct p4tc_table *p4tc_table_find_byid(struct p4tc_pipeline *pipeline,
+					const u32 tbl_id);
+int p4tc_table_try_set_state_ready(struct p4tc_pipeline *pipeline,
+				   struct netlink_ext_ack *extack);
+void p4tc_table_put_mask_array(struct p4tc_pipeline *pipeline);
+struct p4tc_table *p4tc_table_find_get(struct p4tc_pipeline *pipeline,
+				       const char *tblname, const u32 tbl_id,
+				       struct netlink_ext_ack *extack);
+
+static inline bool p4tc_table_put_ref(struct p4tc_table *table)
+{
+	return refcount_dec_not_one(&table->tbl_ctrl_ref);
+}
+
+struct p4tc_table_default_act_params {
+	struct p4tc_table_defact *default_hitact;
+	struct p4tc_table_defact *default_missact;
+	struct nlattr *default_hit_attr;
+	struct nlattr *default_miss_attr;
+};
+
+int
+p4tc_table_init_default_acts(struct net *net,
+			     struct p4tc_table_default_act_params *def_params,
+			     struct p4tc_table *table,
+			     struct list_head *acts_list,
+			     struct netlink_ext_ack *extack);
+
+static inline void
+p4tc_table_defacts_acts_copy(struct p4tc_table_defact *defact_copy,
+			     struct p4tc_table_defact *defact_orig)
+{
+	defact_copy->default_acts = defact_orig->default_acts;
+}
+
+void
+p4tc_table_replace_default_acts(struct p4tc_table *table,
+				struct p4tc_table_default_act_params *def_params,
+				bool lock_rtnl);
+
+struct p4tc_table_perm *
+p4tc_table_init_permissions(struct p4tc_table *table, u16 permissions,
+			    struct netlink_ext_ack *extack);
+void p4tc_table_replace_permissions(struct p4tc_table *table,
+				    struct p4tc_table_perm *tbl_perm,
+				    bool lock_rtnl);
+
 struct tcf_p4act *
 tcf_p4_get_next_prealloc_act(struct p4tc_act *act);
 void tcf_p4_set_init_flags(struct tcf_p4act *p4act);
 
 #define to_pipeline(t) ((struct p4tc_pipeline *)t)
-#define to_hdrfield(t) ((struct p4tc_hdrfield *)t)
 #define to_act(t) ((struct p4tc_act *)t)
+#define to_table(t) ((struct p4tc_table *)t)
 
 #endif
diff --git a/include/net/p4tc_types.h b/include/net/p4tc_types.h
index 8f6f002ae..a77576dfe 100644
--- a/include/net/p4tc_types.h
+++ b/include/net/p4tc_types.h
@@ -8,7 +8,7 @@
 
 #include <uapi/linux/p4tc.h>
 
-#define P4T_MAX_BITSZ 128
+#define P4T_MAX_BITSZ P4TC_MAX_KEYSZ
 
 struct p4tc_type_mask_shift {
 	void *mask;
diff --git a/include/uapi/linux/p4tc.h b/include/uapi/linux/p4tc.h
index 7b89229a7..9b9937a94 100644
--- a/include/uapi/linux/p4tc.h
+++ b/include/uapi/linux/p4tc.h
@@ -24,6 +24,87 @@ struct p4tcmsg {
 #define PIPELINENAMSIZ TEMPLATENAMSZ
 #define ACTTMPLNAMSIZ TEMPLATENAMSZ
 #define ACTPARAMNAMSIZ TEMPLATENAMSZ
+#define TABLENAMSIZ TEMPLATENAMSZ
+
+#define P4TC_TABLE_FLAGS_KEYSZ (1 << 0)
+#define P4TC_TABLE_FLAGS_MAX_ENTRIES (1 << 1)
+#define P4TC_TABLE_FLAGS_MAX_MASKS (1 << 2)
+#define P4TC_TABLE_FLAGS_DEFAULT_KEY (1 << 3)
+#define P4TC_TABLE_FLAGS_PERMISSIONS (1 << 4)
+#define P4TC_TABLE_FLAGS_TYPE (1 << 5)
+#define P4TC_TABLE_FLAGS_AGING (1 << 6)
+
+enum {
+	P4TC_TABLE_TYPE_UNSPEC,
+	P4TC_TABLE_TYPE_EXACT = 1,
+	P4TC_TABLE_TYPE_LPM = 2,
+	P4TC_TABLE_TYPE_TERNARY = 3,
+	__P4TC_TABLE_TYPE_MAX,
+};
+
+#define P4TC_TABLE_TYPE_MAX (__P4TC_TABLE_TYPE_MAX - 1)
+
+#define P4TC_CTRL_PERM_C_BIT 13
+#define P4TC_CTRL_PERM_R_BIT 12
+#define P4TC_CTRL_PERM_U_BIT 11
+#define P4TC_CTRL_PERM_D_BIT 10
+#define P4TC_CTRL_PERM_X_BIT 9
+#define P4TC_CTRL_PERM_P_BIT 8
+#define P4TC_CTRL_PERM_S_BIT 7
+
+#define P4TC_DATA_PERM_C_BIT 6
+#define P4TC_DATA_PERM_R_BIT 5
+#define P4TC_DATA_PERM_U_BIT 4
+#define P4TC_DATA_PERM_D_BIT 3
+#define P4TC_DATA_PERM_X_BIT 2
+#define P4TC_DATA_PERM_P_BIT 1
+#define P4TC_DATA_PERM_S_BIT 0
+
+#define P4TC_PERM_MAX_BIT P4TC_CTRL_PERM_C_BIT
+
+#define P4TC_CTRL_PERM_C (1 << P4TC_CTRL_PERM_C_BIT)
+#define P4TC_CTRL_PERM_R (1 << P4TC_CTRL_PERM_R_BIT)
+#define P4TC_CTRL_PERM_U (1 << P4TC_CTRL_PERM_U_BIT)
+#define P4TC_CTRL_PERM_D (1 << P4TC_CTRL_PERM_D_BIT)
+#define P4TC_CTRL_PERM_X (1 << P4TC_CTRL_PERM_X_BIT)
+#define P4TC_CTRL_PERM_P (1 << P4TC_CTRL_PERM_P_BIT)
+#define P4TC_CTRL_PERM_S (1 << P4TC_CTRL_PERM_S_BIT)
+
+#define P4TC_DATA_PERM_C (1 << P4TC_DATA_PERM_C_BIT)
+#define P4TC_DATA_PERM_R (1 << P4TC_DATA_PERM_R_BIT)
+#define P4TC_DATA_PERM_U (1 << P4TC_DATA_PERM_U_BIT)
+#define P4TC_DATA_PERM_D (1 << P4TC_DATA_PERM_D_BIT)
+#define P4TC_DATA_PERM_X (1 << P4TC_DATA_PERM_X_BIT)
+#define P4TC_DATA_PERM_P (1 << P4TC_DATA_PERM_P_BIT)
+#define P4TC_DATA_PERM_S (1 << P4TC_DATA_PERM_S_BIT)
+
+#define p4tc_ctrl_create_ok(perm)   ((perm) & P4TC_CTRL_PERM_C)
+#define p4tc_ctrl_read_ok(perm)     ((perm) & P4TC_CTRL_PERM_R)
+#define p4tc_ctrl_update_ok(perm)   ((perm) & P4TC_CTRL_PERM_U)
+#define p4tc_ctrl_delete_ok(perm)   ((perm) & P4TC_CTRL_PERM_D)
+#define p4tc_ctrl_exec_ok(perm)     ((perm) & P4TC_CTRL_PERM_X)
+#define p4tc_ctrl_pub_ok(perm)      ((perm) & P4TC_CTRL_PERM_P)
+#define p4tc_ctrl_sub_ok(perm)      ((perm) & P4TC_CTRL_PERM_S)
+
+#define p4tc_data_create_ok(perm)   ((perm) & P4TC_DATA_PERM_C)
+#define p4tc_data_read_ok(perm)     ((perm) & P4TC_DATA_PERM_R)
+#define p4tc_data_update_ok(perm)   ((perm) & P4TC_DATA_PERM_U)
+#define p4tc_data_delete_ok(perm)   ((perm) & P4TC_DATA_PERM_D)
+#define p4tc_data_exec_ok(perm)     ((perm) & P4TC_DATA_PERM_X)
+#define p4tc_data_pub_ok(perm)      ((perm) & P4TC_DATA_PERM_P)
+#define p4tc_data_sub_ok(perm)      ((perm) & P4TC_DATA_PERM_S)
+
+struct p4tc_table_parm {
+	__u64 tbl_aging;
+	__u32 tbl_keysz;
+	__u32 tbl_max_entries;
+	__u32 tbl_max_masks;
+	__u32 tbl_flags;
+	__u32 tbl_num_entries;
+	__u16 tbl_permissions;
+	__u8  tbl_type;
+	__u8  PAD0;
+};
 
 /* Root attributes */
 enum {
@@ -40,6 +121,7 @@ enum {
 	P4TC_OBJ_UNSPEC,
 	P4TC_OBJ_PIPELINE,
 	P4TC_OBJ_ACT,
+	P4TC_OBJ_TABLE,
 	__P4TC_OBJ_MAX,
 };
 
@@ -99,6 +181,46 @@ enum {
 
 #define P4T_MAX (__P4T_MAX - 1)
 
+enum {
+	P4TC_TABLE_DEFAULT_UNSPEC,
+	P4TC_TABLE_DEFAULT_ACTION,
+	P4TC_TABLE_DEFAULT_PERMISSIONS,
+	__P4TC_TABLE_DEFAULT_MAX
+};
+
+#define P4TC_TABLE_DEFAULT_MAX (__P4TC_TABLE_DEFAULT_MAX - 1)
+
+enum {
+	P4TC_TABLE_ACTS_DEFAULT_ONLY,
+	P4TC_TABLE_ACTS_TABLE_ONLY,
+	__P4TC_TABLE_ACTS_FLAGS_MAX,
+};
+
+#define P4TC_TABLE_ACTS_FLAGS_MAX (__P4TC_TABLE_ACTS_FLAGS_MAX - 1)
+
+enum {
+	P4TC_TABLE_ACT_UNSPEC,
+	P4TC_TABLE_ACT_FLAGS, /* u8 */
+	P4TC_TABLE_ACT_NAME, /* string */
+	__P4TC_TABLE_ACT_MAX
+};
+
+#define P4TC_TABLE_ACT_MAX (__P4TC_TABLE_ACT_MAX - 1)
+
+/* Table type attributes */
+enum {
+	P4TC_TABLE_UNSPEC,
+	P4TC_TABLE_NAME, /* string */
+	P4TC_TABLE_INFO, /* struct p4tc_table_parm */
+	P4TC_TABLE_DEFAULT_HIT, /* nested default hit action attributes */
+	P4TC_TABLE_DEFAULT_MISS, /* nested default miss action attributes */
+	P4TC_TABLE_CONST_ENTRY, /* nested const table entry */
+	P4TC_TABLE_ACTS_LIST, /* nested table actions list */
+	__P4TC_TABLE_MAX
+};
+
+#define P4TC_TABLE_MAX (__P4TC_TABLE_MAX - 1)
+
 /* Action attributes */
 enum {
 	P4TC_ACT_UNSPEC,
diff --git a/net/sched/p4tc/Makefile b/net/sched/p4tc/Makefile
index 7dbcf8915..7a9c13f86 100644
--- a/net/sched/p4tc/Makefile
+++ b/net/sched/p4tc/Makefile
@@ -1,4 +1,4 @@
 # SPDX-License-Identifier: GPL-2.0
 
 obj-y := p4tc_types.o p4tc_tmpl_api.o p4tc_pipeline.o \
-	p4tc_action.o
+	p4tc_action.o p4tc_table.o
diff --git a/net/sched/p4tc/p4tc_action.c b/net/sched/p4tc/p4tc_action.c
index 19db0772c..4912a6a11 100644
--- a/net/sched/p4tc/p4tc_action.c
+++ b/net/sched/p4tc/p4tc_action.c
@@ -655,17 +655,16 @@ static struct p4tc_act_param *param_find_byname(struct idr *params_idr,
 	return NULL;
 }
 
-static struct p4tc_act_param *tcf_param_find_byid(struct idr *params_idr,
-						  const u32 param_id)
+struct p4tc_act_param *tcf_param_find_byid(struct idr *params_idr,
+					   const u32 param_id)
 {
 	return idr_find(params_idr, param_id);
 }
 
-static struct p4tc_act_param *
-tcf_param_find_byany(struct p4tc_act *act,
-		     const char *param_name,
-		     const u32 param_id,
-		     struct netlink_ext_ack *extack)
+struct p4tc_act_param *tcf_param_find_byany(struct p4tc_act *act,
+					    const char *param_name,
+					    const u32 param_id,
+					    struct netlink_ext_ack *extack)
 {
 	struct p4tc_act_param *param;
 	int err;
diff --git a/net/sched/p4tc/p4tc_pipeline.c b/net/sched/p4tc/p4tc_pipeline.c
index 4c34c8534..b589bd9c2 100644
--- a/net/sched/p4tc/p4tc_pipeline.c
+++ b/net/sched/p4tc/p4tc_pipeline.c
@@ -75,6 +75,7 @@ static const struct nla_policy tc_pipeline_policy[P4TC_PIPELINE_MAX + 1] = {
 static void p4tc_pipeline_destroy(struct p4tc_pipeline *pipeline)
 {
 	idr_destroy(&pipeline->p_act_idr);
+	idr_destroy(&pipeline->p_tbl_idr);
 
 	kfree(pipeline);
 }
@@ -97,9 +98,13 @@ static void p4tc_pipeline_teardown(struct p4tc_pipeline *pipeline,
 	struct net *net = pipeline->net;
 	struct p4tc_pipeline_net *pipe_net = net_generic(net, pipeline_net_id);
 	struct net *pipeline_net = maybe_get_net(net);
-	unsigned long iter_act_id;
+	unsigned long iter_act_id, tmp;
+	struct p4tc_table *table;
 	struct p4tc_act *act;
-	unsigned long tmp;
+	unsigned long tbl_id;
+
+	idr_for_each_entry_ul(&pipeline->p_tbl_idr, table, tmp, tbl_id)
+		table->common.ops->put(pipeline, &table->common, extack);
 
 	idr_for_each_entry_ul(&pipeline->p_act_idr, act, tmp, iter_act_id)
 		act->common.ops->put(pipeline, &act->common, extack);
@@ -150,22 +155,23 @@ static int __p4tc_pipeline_put(struct p4tc_pipeline *pipeline,
 static inline int pipeline_try_set_state_ready(struct p4tc_pipeline *pipeline,
 					       struct netlink_ext_ack *extack)
 {
+	int ret;
+
 	if (pipeline->curr_tables != pipeline->num_tables) {
 		NL_SET_ERR_MSG(extack,
 			       "Must have all table defined to update state to ready");
 		return -EINVAL;
 	}
 
+	ret = p4tc_table_try_set_state_ready(pipeline, extack);
+	if (ret < 0)
+		return ret;
+
 	pipeline->p_state = P4TC_STATE_READY;
 
 	return true;
 }
 
-static inline bool pipeline_sealed(struct p4tc_pipeline *pipeline)
-{
-	return pipeline->p_state == P4TC_STATE_READY;
-}
-
 struct p4tc_pipeline *p4tc_pipeline_find_byid(struct net *net, const u32 pipeid)
 {
 	struct p4tc_pipeline_net *pipe_net;
@@ -257,6 +263,9 @@ static struct p4tc_pipeline *p4tc_pipeline_create(struct net *net,
 
 	idr_init(&pipeline->p_act_idr);
 
+	idr_init(&pipeline->p_tbl_idr);
+	pipeline->curr_tables = 0;
+
 	pipeline->num_created_acts = 0;
 
 	pipeline->p_state = P4TC_STATE_NOT_READY;
diff --git a/net/sched/p4tc/p4tc_table.c b/net/sched/p4tc/p4tc_table.c
new file mode 100644
index 000000000..291988858
--- /dev/null
+++ b/net/sched/p4tc/p4tc_table.c
@@ -0,0 +1,1545 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * net/sched/p4tc_table.c	P4 TC TABLE
+ *
+ * Copyright (c) 2022-2023, Mojatatu Networks
+ * Copyright (c) 2022-2023, Intel Corporation.
+ * Authors:     Jamal Hadi Salim <jhs@mojatatu.com>
+ *              Victor Nogueira <victor@mojatatu.com>
+ *              Pedro Tammela <pctammela@mojatatu.com>
+ */
+
+#include <linux/types.h>
+#include <linux/kernel.h>
+#include <linux/string.h>
+#include <linux/errno.h>
+#include <linux/slab.h>
+#include <linux/skbuff.h>
+#include <linux/init.h>
+#include <linux/kmod.h>
+#include <linux/err.h>
+#include <linux/module.h>
+#include <linux/bitmap.h>
+#include <net/net_namespace.h>
+#include <net/sock.h>
+#include <net/sch_generic.h>
+#include <net/pkt_cls.h>
+#include <net/p4tc.h>
+#include <net/netlink.h>
+#include <net/flow_offload.h>
+
+static int __p4tc_table_try_set_state_ready(struct p4tc_table *table,
+					    struct netlink_ext_ack *extack)
+{
+	struct p4tc_table_entry_mask __rcu **masks_array;
+	unsigned long *tbl_free_masks_bitmap;
+
+	masks_array = kcalloc(table->tbl_max_masks,
+			      sizeof(*table->tbl_masks_array),
+			      GFP_KERNEL);
+	if (!masks_array)
+		return -ENOMEM;
+
+	tbl_free_masks_bitmap =
+		bitmap_alloc(P4TC_MAX_TMASKS, GFP_KERNEL);
+	if (!tbl_free_masks_bitmap) {
+		kfree(masks_array);
+		return -ENOMEM;
+	}
+
+	bitmap_fill(tbl_free_masks_bitmap, P4TC_MAX_TMASKS);
+
+	table->tbl_masks_array = masks_array;
+	rcu_replace_pointer_rtnl(table->tbl_free_masks_bitmap,
+				 tbl_free_masks_bitmap);
+
+	return 0;
+}
+
+static void free_table_cache_array(struct p4tc_table **set_tables,
+				   int num_tables)
+{
+	int i;
+
+	for (i = 0; i < num_tables; i++) {
+		struct p4tc_table_entry_mask __rcu **masks_array;
+		struct p4tc_table *table = set_tables[i];
+		unsigned long *free_masks_bitmap;
+
+		masks_array = table->tbl_masks_array;
+
+		kfree(masks_array);
+		free_masks_bitmap =
+			rtnl_dereference(table->tbl_free_masks_bitmap);
+		bitmap_free(free_masks_bitmap);
+	}
+}
+
+int p4tc_table_try_set_state_ready(struct p4tc_pipeline *pipeline,
+				   struct netlink_ext_ack *extack)
+{
+	struct p4tc_table **set_tables;
+	struct p4tc_table *table;
+	unsigned long tmp, id;
+	int i = 0;
+	int ret;
+
+	set_tables = kcalloc(pipeline->num_tables, sizeof(*set_tables),
+			     GFP_KERNEL);
+	if (!set_tables)
+		return -ENOMEM;
+
+	idr_for_each_entry_ul(&pipeline->p_tbl_idr, table, tmp, id) {
+		ret = __p4tc_table_try_set_state_ready(table, extack);
+		if (ret < 0)
+			goto free_set_tables;
+		set_tables[i] = table;
+		i++;
+	}
+	kfree(set_tables);
+
+	return 0;
+
+free_set_tables:
+	free_table_cache_array(set_tables, i);
+	kfree(set_tables);
+	return ret;
+}
+
+static const struct nla_policy p4tc_table_policy[P4TC_TABLE_MAX + 1] = {
+	[P4TC_TABLE_NAME] = { .type = NLA_STRING, .len = TABLENAMSIZ },
+	[P4TC_TABLE_INFO] =
+		NLA_POLICY_EXACT_LEN(sizeof(struct p4tc_table_parm)),
+	[P4TC_TABLE_DEFAULT_HIT] = { .type = NLA_NESTED },
+	[P4TC_TABLE_DEFAULT_MISS] = { .type = NLA_NESTED },
+	[P4TC_TABLE_ACTS_LIST] = { .type = NLA_NESTED },
+	[P4TC_TABLE_CONST_ENTRY] = { .type = NLA_NESTED },
+};
+
+static int _p4tc_table_fill_nlmsg(struct sk_buff *skb, struct p4tc_table *table)
+{
+	unsigned char *b = nlmsg_get_pos(skb);
+	struct p4tc_table_parm parm = {0};
+	struct p4tc_table_perm *tbl_perm;
+	struct p4tc_table_act *table_act;
+	struct nlattr *nested_tbl_acts;
+	struct nlattr *default_missact;
+	struct nlattr *default_hitact;
+	struct nlattr *nested_count;
+	struct nlattr *nest;
+	int i = 1;
+
+	if (nla_put_u32(skb, P4TC_PATH, table->tbl_id))
+		goto out_nlmsg_trim;
+
+	nest = nla_nest_start(skb, P4TC_PARAMS);
+	if (!nest)
+		goto out_nlmsg_trim;
+
+	if (nla_put_string(skb, P4TC_TABLE_NAME, table->common.name))
+		goto out_nlmsg_trim;
+
+	parm.tbl_keysz = table->tbl_keysz;
+	parm.tbl_max_entries = table->tbl_max_entries;
+	parm.tbl_max_masks = table->tbl_max_masks;
+	parm.tbl_type = table->tbl_type;
+	parm.tbl_aging = table->tbl_aging;
+
+	tbl_perm = rcu_dereference_rtnl(table->tbl_permissions);
+	parm.tbl_permissions = tbl_perm->permissions;
+
+	if (table->tbl_default_hitact) {
+		struct p4tc_table_defact *hitact;
+
+		default_hitact = nla_nest_start(skb, P4TC_TABLE_DEFAULT_HIT);
+		rcu_read_lock();
+		hitact = rcu_dereference_rtnl(table->tbl_default_hitact);
+		if (hitact->default_acts) {
+			struct nlattr *nest_defact;
+
+			nest_defact = nla_nest_start(skb,
+						     P4TC_TABLE_DEFAULT_ACTION);
+			if (tcf_action_dump(skb, hitact->default_acts, 0, 0,
+					    false) < 0) {
+				rcu_read_unlock();
+				goto out_nlmsg_trim;
+			}
+			nla_nest_end(skb, nest_defact);
+		}
+		if (nla_put_u16(skb, P4TC_TABLE_DEFAULT_PERMISSIONS,
+				hitact->permissions) < 0) {
+			rcu_read_unlock();
+			goto out_nlmsg_trim;
+		}
+		rcu_read_unlock();
+		nla_nest_end(skb, default_hitact);
+	}
+
+	if (table->tbl_default_missact) {
+		struct p4tc_table_defact *missact;
+
+		default_missact = nla_nest_start(skb, P4TC_TABLE_DEFAULT_MISS);
+		rcu_read_lock();
+		missact = rcu_dereference_rtnl(table->tbl_default_missact);
+		if (missact->default_acts) {
+			struct nlattr *nest_defact;
+
+			nest_defact = nla_nest_start(skb,
+						     P4TC_TABLE_DEFAULT_ACTION);
+			if (tcf_action_dump(skb, missact->default_acts, 0, 0,
+					    false) < 0) {
+				rcu_read_unlock();
+				goto out_nlmsg_trim;
+			}
+			nla_nest_end(skb, nest_defact);
+		}
+		if (nla_put_u16(skb, P4TC_TABLE_DEFAULT_PERMISSIONS,
+				missact->permissions) < 0) {
+			rcu_read_unlock();
+			goto out_nlmsg_trim;
+		}
+		rcu_read_unlock();
+		nla_nest_end(skb, default_missact);
+	}
+
+	nested_tbl_acts = nla_nest_start(skb, P4TC_TABLE_ACTS_LIST);
+	list_for_each_entry(table_act, &table->tbl_acts_list, node) {
+		nested_count = nla_nest_start(skb, i);
+		if (nla_put_string(skb, P4TC_TABLE_ACT_NAME,
+				   table_act->ops->kind) < 0)
+			goto out_nlmsg_trim;
+		if (nla_put_u32(skb, P4TC_TABLE_ACT_FLAGS,
+				table_act->flags) < 0)
+			goto out_nlmsg_trim;
+
+		nla_nest_end(skb, nested_count);
+		i++;
+	}
+	nla_nest_end(skb, nested_tbl_acts);
+
+	if (nla_put(skb, P4TC_TABLE_INFO, sizeof(parm), &parm))
+		goto out_nlmsg_trim;
+	nla_nest_end(skb, nest);
+
+	return skb->len;
+
+out_nlmsg_trim:
+	nlmsg_trim(skb, b);
+	return -1;
+}
+
+static int p4tc_table_fill_nlmsg(struct net *net, struct sk_buff *skb,
+				 struct p4tc_template_common *template,
+				 struct netlink_ext_ack *extack)
+{
+	struct p4tc_table *table = to_table(template);
+
+	if (_p4tc_table_fill_nlmsg(skb, table) <= 0) {
+		NL_SET_ERR_MSG(extack,
+			       "Failed to fill notification attributes for table");
+		return -EINVAL;
+	}
+
+	return 0;
+}
+
+static inline void p4tc_table_defact_destroy(struct p4tc_table_defact *defact)
+{
+	if (defact) {
+		p4tc_action_destroy(defact->default_acts);
+		kfree(defact);
+	}
+}
+
+static void p4tc_table_acts_list_destroy(struct list_head *acts_list)
+{
+	struct p4tc_table_act *table_act, *tmp;
+
+	list_for_each_entry_safe(table_act, tmp, acts_list, node) {
+		struct p4tc_act *act;
+
+		act = container_of(table_act->ops, typeof(*act), ops);
+		list_del(&table_act->node);
+		kfree(table_act);
+		p4tc_action_put_ref(act);
+	}
+}
+
+static void p4tc_table_acts_list_replace(struct list_head *orig,
+					 struct list_head *acts_list)
+{
+	struct p4tc_table_act *table_act, *tmp;
+
+	p4tc_table_acts_list_destroy(orig);
+
+	list_for_each_entry_safe(table_act, tmp, acts_list, node) {
+		list_del_init(&table_act->node);
+		list_add_tail(&table_act->node, orig);
+	}
+}
+
+static void __p4tc_table_put_mask_array(struct p4tc_table *table)
+{
+	unsigned long *free_masks_bitmap;
+
+	kfree(table->tbl_masks_array);
+
+	free_masks_bitmap = rcu_dereference_rtnl(table->tbl_free_masks_bitmap);
+	bitmap_free(free_masks_bitmap);
+}
+
+void p4tc_table_put_mask_array(struct p4tc_pipeline *pipeline)
+{
+	struct p4tc_table *table;
+	unsigned long tmp, id;
+
+	idr_for_each_entry_ul(&pipeline->p_tbl_idr, table, tmp, id) {
+		__p4tc_table_put_mask_array(table);
+	}
+}
+
+static int _p4tc_table_put(struct net *net, struct nlattr **tb,
+			   struct p4tc_pipeline *pipeline,
+			   struct p4tc_table *table,
+			   struct netlink_ext_ack *extack)
+{
+	bool default_act_del = false;
+	struct p4tc_table_perm *perm;
+
+	if (tb)
+		default_act_del = tb[P4TC_TABLE_DEFAULT_HIT] ||
+			tb[P4TC_TABLE_DEFAULT_MISS];
+
+	if (!default_act_del) {
+		if (!refcount_dec_if_one(&table->tbl_ctrl_ref)) {
+			NL_SET_ERR_MSG(extack,
+				       "Unable to delete referenced table");
+			return -EBUSY;
+		}
+	}
+
+	if (tb && tb[P4TC_TABLE_DEFAULT_HIT]) {
+		struct p4tc_table_defact *hitact;
+
+		rcu_read_lock();
+		hitact = rcu_dereference(table->tbl_default_hitact);
+		if (hitact && !p4tc_ctrl_delete_ok(hitact->permissions)) {
+			NL_SET_ERR_MSG(extack,
+				       "Permission denied: Unable to delete default hitact");
+			rcu_read_unlock();
+			return -EPERM;
+		}
+		rcu_read_unlock();
+	}
+
+	if (tb && tb[P4TC_TABLE_DEFAULT_MISS]) {
+		struct p4tc_table_defact *missact;
+
+		rcu_read_lock();
+		missact = rcu_dereference(table->tbl_default_missact);
+		if (missact && !p4tc_ctrl_delete_ok(missact->permissions)) {
+			NL_SET_ERR_MSG(extack,
+				       "Permission denied: Unable to delete default missact");
+			rcu_read_unlock();
+			return -EPERM;
+		}
+		rcu_read_unlock();
+	}
+
+	if (!default_act_del || tb[P4TC_TABLE_DEFAULT_HIT]) {
+		struct p4tc_table_defact *hitact;
+
+		hitact = rtnl_dereference(table->tbl_default_hitact);
+		if (hitact) {
+			rcu_replace_pointer_rtnl(table->tbl_default_hitact,
+						 NULL);
+			synchronize_rcu();
+			p4tc_table_defact_destroy(hitact);
+		}
+	}
+
+	if (!default_act_del || tb[P4TC_TABLE_DEFAULT_MISS]) {
+		struct p4tc_table_defact *missact;
+
+		missact = rtnl_dereference(table->tbl_default_missact);
+		if (missact) {
+			rcu_replace_pointer_rtnl(table->tbl_default_missact,
+						 NULL);
+			synchronize_rcu();
+			p4tc_table_defact_destroy(missact);
+		}
+	}
+
+	if (default_act_del)
+		return 0;
+
+	p4tc_table_acts_list_destroy(&table->tbl_acts_list);
+
+	idr_destroy(&table->tbl_masks_idr);
+	idr_destroy(&table->tbl_prio_idr);
+
+	perm = rcu_replace_pointer_rtnl(table->tbl_permissions, NULL);
+	kfree_rcu(perm, rcu);
+
+	idr_remove(&pipeline->p_tbl_idr, table->tbl_id);
+	pipeline->curr_tables -= 1;
+
+	__p4tc_table_put_mask_array(table);
+
+	kfree(table);
+
+	return 0;
+}
+
+static int p4tc_table_put(struct p4tc_pipeline *pipeline,
+			  struct p4tc_template_common *tmpl,
+			  struct netlink_ext_ack *extack)
+{
+	struct p4tc_table *table = to_table(tmpl);
+
+	return _p4tc_table_put(pipeline->net, NULL, pipeline, table, extack);
+}
+
+struct p4tc_table *p4tc_table_find_byid(struct p4tc_pipeline *pipeline,
+					const u32 tbl_id)
+{
+	return idr_find(&pipeline->p_tbl_idr, tbl_id);
+}
+
+static struct p4tc_table *p4tc_table_find_byname(const char *tblname,
+						 struct p4tc_pipeline *pipeline)
+{
+	struct p4tc_table *table;
+	unsigned long tmp, id;
+
+	idr_for_each_entry_ul(&pipeline->p_tbl_idr, table, tmp, id)
+		if (strncmp(table->common.name, tblname, TABLENAMSIZ) == 0)
+			return table;
+
+	return NULL;
+}
+
+#define SEPARATOR '/'
+struct p4tc_table *p4tc_table_find_byany(struct p4tc_pipeline *pipeline,
+					 const char *tblname, const u32 tbl_id,
+					 struct netlink_ext_ack *extack)
+{
+	struct p4tc_table *table;
+	int err;
+
+	if (tbl_id) {
+		table = p4tc_table_find_byid(pipeline, tbl_id);
+		if (!table) {
+			NL_SET_ERR_MSG(extack, "Unable to find table by id");
+			err = -EINVAL;
+			goto out;
+		}
+	} else {
+		if (tblname) {
+			table = p4tc_table_find_byname(tblname, pipeline);
+			if (!table) {
+				NL_SET_ERR_MSG(extack, "Table name not found");
+				err = -EINVAL;
+				goto out;
+			}
+		} else {
+			NL_SET_ERR_MSG(extack, "Must specify table name or id");
+			err = -EINVAL;
+			goto out;
+		}
+	}
+
+	return table;
+out:
+	return ERR_PTR(err);
+}
+
+static int p4tc_table_get(struct p4tc_table *table)
+{
+	return refcount_inc_not_zero(&table->tbl_ctrl_ref);
+}
+
+struct p4tc_table *p4tc_table_find_get(struct p4tc_pipeline *pipeline,
+				       const char *tblname, const u32 tbl_id,
+				       struct netlink_ext_ack *extack)
+{
+	struct p4tc_table *table;
+
+	table = p4tc_table_find_byany(pipeline, tblname, tbl_id, extack);
+	if (IS_ERR(table))
+		return table;
+
+	if (!p4tc_table_get(table)) {
+		NL_SET_ERR_MSG(extack, "Table is marked for deletion");
+		return ERR_PTR(-EBUSY);
+	}
+
+	return table;
+}
+
+/* Permissions can also be updated by runtime command */
+static int __p4tc_table_init_default_act(struct net *net, struct nlattr **tb,
+					 struct p4tc_table_defact **default_act,
+					 u32 pipeid, __u16 curr_permissions,
+					 struct netlink_ext_ack *extack)
+{
+	int ret;
+
+	*default_act = kzalloc(sizeof(**default_act), GFP_KERNEL);
+	if (!(*default_act))
+		return -ENOMEM;
+
+	if (tb[P4TC_TABLE_DEFAULT_PERMISSIONS]) {
+		__u16 *permissions;
+
+		permissions = nla_data(tb[P4TC_TABLE_DEFAULT_PERMISSIONS]);
+		if (!p4tc_ctrl_read_ok(*permissions)) {
+			NL_SET_ERR_MSG(extack,
+				       "Default action must have ctrl path read permissions");
+			ret = -EINVAL;
+			goto default_act_free;
+		}
+		if (!p4tc_data_read_ok(*permissions)) {
+			NL_SET_ERR_MSG(extack,
+				       "Default action must have data path read permissions");
+			ret = -EINVAL;
+			goto default_act_free;
+		}
+		if (!p4tc_data_exec_ok(*permissions)) {
+			NL_SET_ERR_MSG(extack,
+				       "Default action must have data path execute permissions");
+			ret = -EINVAL;
+			goto default_act_free;
+		}
+		(*default_act)->permissions = *permissions;
+	} else {
+		(*default_act)->permissions = curr_permissions;
+	}
+
+	if (tb[P4TC_TABLE_DEFAULT_ACTION]) {
+		struct tc_action **default_acts;
+
+		if (!p4tc_ctrl_update_ok(curr_permissions)) {
+			NL_SET_ERR_MSG(extack,
+				       "Permission denied: Unable to update default hit action");
+			ret = -EPERM;
+			goto default_act_free;
+		}
+
+		default_acts = kcalloc(TCA_ACT_MAX_PRIO,
+				       sizeof(struct tc_action *), GFP_KERNEL);
+		if (!default_acts) {
+			ret = -ENOMEM;
+			goto default_act_free;
+		}
+
+		ret = p4tc_action_init(net, tb[P4TC_TABLE_DEFAULT_ACTION],
+				       default_acts, pipeid, 0, extack);
+		if (ret < 0) {
+			kfree(default_acts);
+			goto default_act_free;
+		} else if (ret > 1) {
+			NL_SET_ERR_MSG(extack, "Can only have one hit action");
+			p4tc_action_destroy(default_acts);
+			ret = -EINVAL;
+			goto default_act_free;
+		}
+		(*default_act)->default_acts = default_acts;
+	}
+
+	return 0;
+
+default_act_free:
+	kfree(*default_act);
+
+	return ret;
+}
+
+static int p4tc_table_check_defacts(struct tc_action *defact,
+				    struct list_head *acts_list)
+{
+	struct p4tc_table_act *table_act;
+
+	list_for_each_entry(table_act, acts_list, node) {
+		if (table_act->ops->id == defact->ops->id &&
+		    !(table_act->flags & BIT(P4TC_TABLE_ACTS_TABLE_ONLY)))
+			return true;
+	}
+
+	return false;
+}
+
+static struct nla_policy p4tc_table_default_policy[P4TC_TABLE_DEFAULT_MAX + 1] = {
+	[P4TC_TABLE_DEFAULT_ACTION] = { .type = NLA_NESTED },
+	[P4TC_TABLE_DEFAULT_PERMISSIONS] =
+		NLA_POLICY_MAX(NLA_U16, P4TC_MAX_PERMISSION),
+};
+
+/* Runtime and template call this */
+static int
+p4tc_table_init_default_act(struct net *net, struct nlattr *nla,
+			    struct p4tc_table *table,
+			    u16 curr_permissions,
+			    struct p4tc_table_defact **default_act,
+			    struct list_head *acts_list,
+			    struct netlink_ext_ack *extack)
+{
+	u16 permissions = P4TC_CONTROL_PERMISSIONS | P4TC_DATA_PERMISSIONS;
+	struct nlattr *tb[P4TC_TABLE_DEFAULT_MAX + 1];
+	int ret;
+
+	if (curr_permissions)
+		permissions = curr_permissions;
+
+	ret = nla_parse_nested(tb, P4TC_TABLE_DEFAULT_MAX, nla,
+			       p4tc_table_default_policy, extack);
+	if (ret < 0)
+		return ret;
+
+	if (!tb[P4TC_TABLE_DEFAULT_ACTION] &&
+	    !tb[P4TC_TABLE_DEFAULT_PERMISSIONS])
+		return 0;
+
+	ret = __p4tc_table_init_default_act(net, tb,
+					    default_act,
+					    table->common.p_id, permissions,
+					    extack);
+	if (ret < 0)
+		return ret;
+	if ((*default_act)->default_acts &&
+	    !p4tc_table_check_defacts((*default_act)->default_acts[0],
+				      acts_list)) {
+		NL_SET_ERR_MSG(extack,
+			       "Action is not allowed as default hit action");
+		p4tc_table_defact_destroy(*default_act);
+		return -EPERM;
+	}
+
+	return 0;
+}
+
+struct p4tc_table_perm *
+p4tc_table_init_permissions(struct p4tc_table *table, u16 permissions,
+			    struct netlink_ext_ack *extack)
+{
+	struct p4tc_table_perm *tbl_perm;
+	int ret;
+
+	if (permissions > P4TC_MAX_PERMISSION) {
+		NL_SET_ERR_MSG(extack,
+			       "Permission may only have 14 bits turned on");
+		ret = -EINVAL;
+		goto out;
+	}
+
+	tbl_perm = kzalloc(sizeof(*tbl_perm), GFP_KERNEL);
+	if (!tbl_perm) {
+		ret = -ENOMEM;
+		goto out;
+	}
+
+	tbl_perm->permissions = permissions;
+
+	return tbl_perm;
+
+out:
+	return ERR_PTR(ret);
+}
+
+void p4tc_table_replace_permissions(struct p4tc_table *table,
+				    struct p4tc_table_perm *tbl_perm,
+				    bool lock_rtnl)
+{
+	if (!tbl_perm)
+		return;
+
+	if (lock_rtnl)
+		rtnl_lock();
+	tbl_perm = rcu_replace_pointer_rtnl(table->tbl_permissions, tbl_perm);
+	if (lock_rtnl)
+		rtnl_unlock();
+	kfree_rcu(tbl_perm, rcu);
+}
+
+int
+p4tc_table_init_default_acts(struct net *net,
+			     struct p4tc_table_default_act_params *def_params,
+			     struct p4tc_table *table,
+			     struct list_head *acts_list,
+			     struct netlink_ext_ack *extack)
+{
+	u16 permissions;
+	int ret;
+
+	def_params->default_missact = NULL;
+	def_params->default_hitact = NULL;
+
+	if (def_params->default_hit_attr) {
+		struct p4tc_table_defact *tmp_default_hitact;
+
+		permissions = P4TC_CONTROL_PERMISSIONS | P4TC_DATA_PERMISSIONS;
+
+		rcu_read_lock();
+		if (table->tbl_default_hitact) {
+			tmp_default_hitact = rcu_dereference(table->tbl_default_hitact);
+			permissions = tmp_default_hitact->permissions;
+		}
+		rcu_read_unlock();
+
+		ret = p4tc_table_init_default_act(net,
+						  def_params->default_hit_attr,
+						  table, permissions,
+						  &def_params->default_hitact,
+						  acts_list, extack);
+		if (ret < 0)
+			return ret;
+	}
+
+	if (def_params->default_miss_attr) {
+		struct p4tc_table_defact *tmp_default_missact;
+
+		permissions = P4TC_CONTROL_PERMISSIONS | P4TC_DATA_PERMISSIONS;
+
+		rcu_read_lock();
+		if (table->tbl_default_missact) {
+			tmp_default_missact = rcu_dereference(table->tbl_default_missact);
+			permissions = tmp_default_missact->permissions;
+		}
+		rcu_read_unlock();
+
+		ret = p4tc_table_init_default_act(net,
+						  def_params->default_miss_attr,
+						  table, permissions,
+						  &def_params->default_missact,
+						  acts_list, extack);
+		if (ret < 0)
+			goto default_hitacts_free;
+	}
+
+	return 0;
+
+default_hitacts_free:
+	p4tc_table_defact_destroy(def_params->default_hitact);
+
+	return ret;
+}
+
+static const struct nla_policy p4tc_acts_list_policy[P4TC_TABLE_MAX + 1] = {
+	[P4TC_TABLE_ACT_FLAGS] =
+		NLA_POLICY_RANGE(NLA_U8, 0, BIT(P4TC_TABLE_ACTS_FLAGS_MAX)),
+	[P4TC_TABLE_ACT_NAME] = { .type = NLA_STRING, .len = ACTNAMSIZ },
+};
+
+static struct p4tc_table_act *p4tc_table_act_init(struct nlattr *nla,
+						  struct p4tc_pipeline *pipeline,
+						  struct netlink_ext_ack *extack)
+{
+	struct nlattr *tb[P4TC_TABLE_ACT_MAX + 1];
+	struct p4tc_table_act *table_act;
+	int ret;
+
+	ret = nla_parse_nested(tb, P4TC_TABLE_ACT_MAX, nla,
+			       p4tc_acts_list_policy, extack);
+	if (ret < 0)
+		return ERR_PTR(ret);
+
+	table_act = kzalloc(sizeof(*table_act), GFP_KERNEL);
+	if (unlikely(!table_act))
+		return ERR_PTR(-ENOMEM);
+
+	if (tb[P4TC_TABLE_ACT_NAME]) {
+		const char *actname = nla_data(tb[P4TC_TABLE_ACT_NAME]);
+		char *act_name_clone, *act_name, *p_name;
+		struct p4tc_act *act;
+
+		act_name_clone = act_name = kstrdup(actname, GFP_KERNEL);
+		if (unlikely(!act_name)) {
+			ret = -ENOMEM;
+			goto free_table_act;
+		}
+
+		p_name = strsep(&act_name, "/");
+		if (!act_name) {
+			NL_SET_ERR_MSG(extack,
+				       "Action name must have format pname/actname");
+			ret = -EINVAL;
+			kfree(act_name_clone);
+			goto free_table_act;
+		}
+		if (strncmp(pipeline->common.name, p_name, PIPELINENAMSIZ)) {
+			NL_SET_ERR_MSG_FMT(extack, "Pipeline name must be %s\n",
+					   pipeline->common.name);
+			ret = -EINVAL;
+			kfree(act_name_clone);
+			goto free_table_act;
+		}
+
+		act = p4tc_action_find_get(pipeline, act_name, 0, extack);
+		kfree(act_name_clone);
+		if (IS_ERR(act)) {
+			ret = PTR_ERR(act);
+			goto free_table_act;
+		}
+
+		table_act->ops = &act->ops;
+	} else {
+		NL_SET_ERR_MSG(extack,
+			       "Must specify allowed table action name");
+		ret = -EINVAL;
+		goto free_table_act;
+	}
+
+	if (tb[P4TC_TABLE_ACT_FLAGS]) {
+		u8 *flags = nla_data(tb[P4TC_TABLE_ACT_FLAGS]);
+
+		table_act->flags = *flags;
+	}
+
+	return table_act;
+
+free_table_act:
+	kfree(table_act);
+	return ERR_PTR(ret);
+}
+
+void
+p4tc_table_replace_default_acts(struct p4tc_table *table,
+				struct p4tc_table_default_act_params *def_params,
+				bool lock_rtnl)
+{
+	if (def_params->default_hitact) {
+		bool updated_actions = !!def_params->default_hitact->default_acts;
+		struct p4tc_table_defact *hitact;
+
+		if (lock_rtnl)
+			rtnl_lock();
+		if (!updated_actions) {
+			hitact = rcu_dereference_rtnl(table->tbl_default_hitact);
+			p4tc_table_defacts_acts_copy(def_params->default_hitact,
+						     hitact);
+		}
+		hitact = rcu_replace_pointer_rtnl(table->tbl_default_hitact,
+						  def_params->default_hitact);
+		if (lock_rtnl)
+			rtnl_unlock();
+		if (hitact) {
+			synchronize_rcu();
+			if (updated_actions)
+				p4tc_table_defact_destroy(hitact);
+			else
+				kfree(hitact);
+		}
+	}
+
+	if (def_params->default_missact) {
+		bool updated_actions = !!def_params->default_missact->default_acts;
+		struct p4tc_table_defact *missact;
+
+		if (lock_rtnl)
+			rtnl_lock();
+		if (!updated_actions) {
+			missact = rcu_dereference_rtnl(table->tbl_default_missact);
+			p4tc_table_defacts_acts_copy(def_params->default_missact,
+						     missact);
+		}
+		missact = rcu_replace_pointer_rtnl(table->tbl_default_missact,
+						   def_params->default_missact);
+		if (lock_rtnl)
+			rtnl_unlock();
+		if (missact) {
+			synchronize_rcu();
+			if (updated_actions)
+				p4tc_table_defact_destroy(missact);
+			else
+				kfree(missact);
+		}
+	}
+}
+
+static int p4tc_table_acts_list_init(struct nlattr *nla,
+				     struct p4tc_pipeline *pipeline,
+				     struct list_head *acts_list,
+				     struct netlink_ext_ack *extack)
+{
+	struct nlattr *tb[P4TC_MSGBATCH_SIZE + 1];
+	struct p4tc_table_act *table_act;
+	int ret;
+	int i;
+
+	ret = nla_parse_nested(tb, P4TC_MSGBATCH_SIZE, nla, NULL, extack);
+	if (ret < 0)
+		return ret;
+
+	for (i = 1; i < P4TC_MSGBATCH_SIZE + 1 && tb[i]; i++) {
+		table_act = p4tc_table_act_init(tb[i], pipeline, extack);
+		if (IS_ERR(table_act)) {
+			ret = PTR_ERR(table_act);
+			goto free_acts_list_list;
+		}
+		list_add_tail(&table_act->node, acts_list);
+	}
+
+	return 0;
+
+free_acts_list_list:
+	p4tc_table_acts_list_destroy(acts_list);
+
+	return ret;
+}
+
+static struct p4tc_table *
+p4tc_table_find_byanyattr(struct p4tc_pipeline *pipeline,
+			  struct nlattr *name_attr, const u32 tbl_id,
+			  struct netlink_ext_ack *extack)
+{
+	char *tblname = NULL;
+
+	if (name_attr)
+		tblname = nla_data(name_attr);
+
+	return p4tc_table_find_byany(pipeline, tblname, tbl_id, extack);
+}
+
+static struct p4tc_table *p4tc_table_create(struct net *net, struct nlattr **tb,
+					    u32 tbl_id,
+					    struct p4tc_pipeline *pipeline,
+					    struct netlink_ext_ack *extack)
+{
+	struct p4tc_table_default_act_params def_params = {0};
+	struct p4tc_table_perm *tbl_init_perms = NULL;
+	struct p4tc_table_parm *parm;
+	struct p4tc_table *table;
+	char *tblname;
+	int ret;
+
+	if (pipeline->curr_tables == pipeline->num_tables) {
+		NL_SET_ERR_MSG(extack,
+			       "Table range exceeded max allowed value");
+		ret = -EINVAL;
+		goto out;
+	}
+
+	/* Name has the following syntax cb/tname */
+	if (NL_REQ_ATTR_CHECK(extack, NULL, tb, P4TC_TABLE_NAME)) {
+		NL_SET_ERR_MSG(extack, "Must specify table name");
+		ret = -EINVAL;
+		goto out;
+	}
+
+	tblname =
+		strnchr(nla_data(tb[P4TC_TABLE_NAME]), TABLENAMSIZ, SEPARATOR);
+	if (!tblname) {
+		NL_SET_ERR_MSG(extack, "Table name must contain control block");
+		ret = -EINVAL;
+		goto out;
+	}
+
+	tblname += 1;
+	if (tblname[0] == '\0') {
+		NL_SET_ERR_MSG(extack, "Control block name is too big");
+		ret = -EINVAL;
+		goto out;
+	}
+
+	table = p4tc_table_find_byanyattr(pipeline, tb[P4TC_TABLE_NAME], tbl_id,
+					  NULL);
+	if (!IS_ERR(table)) {
+		NL_SET_ERR_MSG(extack, "Table already exists");
+		ret = -EEXIST;
+		goto out;
+	}
+
+	table = kzalloc(sizeof(*table), GFP_KERNEL);
+	if (!table) {
+		NL_SET_ERR_MSG(extack, "Unable to create table");
+		ret = -ENOMEM;
+		goto out;
+	}
+
+	table->common.p_id = pipeline->common.p_id;
+	strscpy(table->common.name, nla_data(tb[P4TC_TABLE_NAME]), TABLENAMSIZ);
+
+	if (NL_REQ_ATTR_CHECK(extack, NULL, tb, P4TC_TABLE_INFO)) {
+		ret = -EINVAL;
+		NL_SET_ERR_MSG(extack, "Missing table info");
+		goto free;
+	}
+
+	parm = nla_data(tb[P4TC_TABLE_INFO]);
+	if (!parm->tbl_keysz) {
+		NL_SET_ERR_MSG(extack, "Table keysz cannot be zero");
+		ret = -EINVAL;
+		goto free;
+	}
+	if (parm->tbl_keysz > P4TC_MAX_KEYSZ) {
+		NL_SET_ERR_MSG(extack,
+			       "Table keysz exceeds maximum keysz");
+		ret = -EINVAL;
+		goto free;
+	}
+	table->tbl_keysz = parm->tbl_keysz;
+
+	if (parm->tbl_flags & P4TC_TABLE_FLAGS_MAX_ENTRIES) {
+		if (!parm->tbl_max_entries) {
+			NL_SET_ERR_MSG(extack,
+				       "Table max_entries cannot be zero");
+			ret = -EINVAL;
+			goto free;
+		}
+		if (parm->tbl_max_entries > P4TC_MAX_TENTRIES) {
+			NL_SET_ERR_MSG(extack,
+				       "Table max_entries exceeds maximum value");
+			ret = -EINVAL;
+			goto free;
+		}
+		table->tbl_max_entries = parm->tbl_max_entries;
+	} else {
+		table->tbl_max_entries = P4TC_DEFAULT_TENTRIES;
+	}
+
+	if (parm->tbl_flags & P4TC_TABLE_FLAGS_MAX_MASKS) {
+		if (!parm->tbl_max_masks) {
+			NL_SET_ERR_MSG(extack,
+				       "Table max_masks cannot be zero");
+			ret = -EINVAL;
+			goto free;
+		}
+		if (parm->tbl_max_masks > P4TC_MAX_TMASKS) {
+			NL_SET_ERR_MSG(extack,
+				       "Table max_masks exceeds maximum value");
+			ret = -EINVAL;
+			goto free;
+		}
+		table->tbl_max_masks = parm->tbl_max_masks;
+	} else {
+		table->tbl_max_masks = P4TC_DEFAULT_TMASKS;
+	}
+
+	if (parm->tbl_flags & P4TC_TABLE_FLAGS_PERMISSIONS) {
+		u16 tbl_perms = parm->tbl_permissions;
+
+		tbl_init_perms = p4tc_table_init_permissions(table, tbl_perms,
+							     extack);
+		if (IS_ERR(tbl_init_perms)) {
+			ret = PTR_ERR(tbl_init_perms);
+			goto free;
+		}
+		rcu_assign_pointer(table->tbl_permissions, tbl_init_perms);
+	} else {
+		u16 tbl_perms = P4TC_TABLE_PERMISSIONS;
+
+		tbl_init_perms = p4tc_table_init_permissions(table,
+							     tbl_perms,
+							     extack);
+		if (IS_ERR(tbl_init_perms)) {
+			ret = PTR_ERR(tbl_init_perms);
+			goto free;
+		}
+		rcu_assign_pointer(table->tbl_permissions, tbl_init_perms);
+	}
+
+	if (parm->tbl_flags & P4TC_TABLE_FLAGS_TYPE) {
+		if (parm->tbl_type > P4TC_TABLE_TYPE_MAX) {
+			NL_SET_ERR_MSG(extack, "Table type must be Exact / LPM / Ternary");
+			ret = -EINVAL;
+			goto free_permissions;
+		}
+		table->tbl_type = parm->tbl_type;
+	} else {
+		table->tbl_type = P4TC_TABLE_TYPE_EXACT;
+	}
+
+	if (parm->tbl_flags & P4TC_TABLE_FLAGS_AGING) {
+		if (!parm->tbl_aging) {
+			NL_SET_ERR_MSG(extack,
+				       "Table aging can't be zero");
+			ret = -EINVAL;
+			goto free;
+		}
+		if (parm->tbl_aging > P4TC_MAX_T_AGING) {
+			NL_SET_ERR_MSG(extack,
+				       "Table aging exceeds maximum value");
+			ret = -EINVAL;
+			goto free;
+		}
+		table->tbl_aging = parm->tbl_aging;
+	} else {
+		table->tbl_aging = P4TC_DEFAULT_T_AGING;
+	}
+
+	refcount_set(&table->tbl_ctrl_ref, 1);
+
+	if (tbl_id) {
+		table->tbl_id = tbl_id;
+		ret = idr_alloc_u32(&pipeline->p_tbl_idr, table, &table->tbl_id,
+				    table->tbl_id, GFP_KERNEL);
+		if (ret < 0) {
+			NL_SET_ERR_MSG(extack, "Unable to allocate table id");
+			goto free_permissions;
+		}
+	} else {
+		table->tbl_id = 1;
+		ret = idr_alloc_u32(&pipeline->p_tbl_idr, table, &table->tbl_id,
+				    UINT_MAX, GFP_KERNEL);
+		if (ret < 0) {
+			NL_SET_ERR_MSG(extack, "Unable to allocate table id");
+			goto free_permissions;
+		}
+	}
+
+	INIT_LIST_HEAD(&table->tbl_acts_list);
+	if (tb[P4TC_TABLE_ACTS_LIST]) {
+		ret = p4tc_table_acts_list_init(tb[P4TC_TABLE_ACTS_LIST],
+						pipeline, &table->tbl_acts_list,
+						extack);
+		if (ret < 0)
+			goto idr_rm;
+	}
+
+	def_params.default_hit_attr = tb[P4TC_TABLE_DEFAULT_HIT];
+	def_params.default_miss_attr = tb[P4TC_TABLE_DEFAULT_MISS];
+
+	ret = p4tc_table_init_default_acts(net, &def_params, table,
+					   &table->tbl_acts_list, extack);
+	if (ret < 0)
+		goto idr_rm;
+
+	rcu_replace_pointer_rtnl(table->tbl_default_hitact,
+				 def_params.default_hitact);
+	rcu_replace_pointer_rtnl(table->tbl_default_missact,
+				 def_params.default_missact);
+
+	if (def_params.default_hitact &&
+	    !def_params.default_hitact->default_acts) {
+		NL_SET_ERR_MSG(extack,
+			       "Must specify defaults_hit_actions's action values");
+		ret = -EINVAL;
+		goto defaultacts_destroy;
+	}
+
+	if (def_params.default_missact &&
+	    !def_params.default_missact->default_acts) {
+		NL_SET_ERR_MSG(extack,
+			       "Must specify defaults_miss_actions's action values");
+		ret = -EINVAL;
+		goto defaultacts_destroy;
+	}
+
+	idr_init(&table->tbl_masks_idr);
+	idr_init(&table->tbl_prio_idr);
+	spin_lock_init(&table->tbl_masks_idr_lock);
+
+	pipeline->curr_tables += 1;
+
+	table->common.ops = (struct p4tc_template_ops *)&p4tc_table_ops;
+
+	return table;
+
+defaultacts_destroy:
+	p4tc_table_defact_destroy(def_params.default_hitact);
+	p4tc_table_defact_destroy(def_params.default_missact);
+
+idr_rm:
+	idr_remove(&pipeline->p_tbl_idr, table->tbl_id);
+
+free_permissions:
+	kfree(tbl_init_perms);
+
+	p4tc_table_acts_list_destroy(&table->tbl_acts_list);
+
+free:
+	kfree(table);
+
+out:
+	return ERR_PTR(ret);
+}
+
+static struct p4tc_table *p4tc_table_update(struct net *net, struct nlattr **tb,
+					    u32 tbl_id,
+					    struct p4tc_pipeline *pipeline,
+					    u32 flags,
+					    struct netlink_ext_ack *extack)
+{
+	u32 tbl_max_masks = 0, tbl_max_entries = 0, tbl_keysz = 0;
+	struct p4tc_table_default_act_params def_params = {0};
+	struct list_head *tbl_acts_list = NULL;
+	struct p4tc_table_perm *perm = NULL;
+	struct p4tc_table_parm *parm = NULL;
+	struct p4tc_table *table;
+	u64 tbl_aging = 0;
+	u8 tbl_type;
+	int ret = 0;
+
+	table = p4tc_table_find_byanyattr(pipeline, tb[P4TC_TABLE_NAME], tbl_id,
+					  extack);
+	if (IS_ERR(table))
+		return table;
+
+	/* Check if we are replacing this at the end */
+	if (tb[P4TC_TABLE_ACTS_LIST]) {
+		tbl_acts_list = kzalloc(sizeof(*tbl_acts_list), GFP_KERNEL);
+		if (!tbl_acts_list) {
+			ret = -ENOMEM;
+			goto out;
+		}
+		INIT_LIST_HEAD(tbl_acts_list);
+		ret = p4tc_table_acts_list_init(tb[P4TC_TABLE_ACTS_LIST],
+						pipeline, tbl_acts_list, extack);
+		if (ret < 0)
+			goto table_acts_destroy;
+	}
+
+	def_params.default_hit_attr = tb[P4TC_TABLE_DEFAULT_HIT];
+	def_params.default_miss_attr = tb[P4TC_TABLE_DEFAULT_MISS];
+
+	if (tbl_acts_list)
+		ret = p4tc_table_init_default_acts(net, &def_params, table,
+						   tbl_acts_list, extack);
+	else
+		ret = p4tc_table_init_default_acts(net, &def_params, table,
+						   &table->tbl_acts_list,
+						   extack);
+	if (ret < 0)
+		goto table_acts_destroy;
+
+	tbl_type = table->tbl_type;
+
+	if (tb[P4TC_TABLE_INFO]) {
+		parm = nla_data(tb[P4TC_TABLE_INFO]);
+		if (parm->tbl_flags & P4TC_TABLE_FLAGS_KEYSZ) {
+			if (!parm->tbl_keysz) {
+				NL_SET_ERR_MSG(extack,
+					       "Table keysz cannot be zero");
+				ret = -EINVAL;
+				goto defaultacts_destroy;
+			}
+			if (parm->tbl_keysz > P4TC_MAX_KEYSZ) {
+				NL_SET_ERR_MSG(extack,
+					       "Table keysz exceeds maximum keysz");
+				ret = -EINVAL;
+				goto defaultacts_destroy;
+			}
+			tbl_keysz = parm->tbl_keysz;
+		}
+
+		if (parm->tbl_flags & P4TC_TABLE_FLAGS_MAX_ENTRIES) {
+			if (!parm->tbl_max_entries) {
+				NL_SET_ERR_MSG(extack,
+					       "Table max_entries cannot be zero");
+				ret = -EINVAL;
+				goto defaultacts_destroy;
+			}
+			if (parm->tbl_max_entries > P4TC_MAX_TENTRIES) {
+				NL_SET_ERR_MSG(extack,
+					       "Table max_entries exceeds maximum value");
+				ret = -EINVAL;
+				goto defaultacts_destroy;
+			}
+			tbl_max_entries = parm->tbl_max_entries;
+		}
+
+		if (parm->tbl_flags & P4TC_TABLE_FLAGS_MAX_MASKS) {
+			if (!parm->tbl_max_masks) {
+				NL_SET_ERR_MSG(extack,
+					       "Table max_masks cannot be zero");
+				ret = -EINVAL;
+				goto defaultacts_destroy;
+			}
+			if (parm->tbl_max_masks > P4TC_MAX_TMASKS) {
+				NL_SET_ERR_MSG(extack,
+					       "Table max_masks exceeds maximum value");
+				ret = -EINVAL;
+				goto defaultacts_destroy;
+			}
+			tbl_max_masks = parm->tbl_max_masks;
+		}
+		if (parm->tbl_flags & P4TC_TABLE_FLAGS_PERMISSIONS) {
+			perm = p4tc_table_init_permissions(table,
+							   parm->tbl_permissions,
+							   extack);
+			if (IS_ERR(perm)) {
+				ret = PTR_ERR(perm);
+				goto defaultacts_destroy;
+			}
+		}
+
+		if (parm->tbl_flags & P4TC_TABLE_FLAGS_TYPE) {
+			if (parm->tbl_type > P4TC_TABLE_TYPE_MAX) {
+				NL_SET_ERR_MSG(extack, "Table type can only be exact or LPM");
+				ret = -EINVAL;
+				goto free_perm;
+			}
+			tbl_type = parm->tbl_type;
+		}
+		if (parm->tbl_flags & P4TC_TABLE_FLAGS_AGING) {
+			if (!parm->tbl_aging) {
+				NL_SET_ERR_MSG(extack,
+					       "Table aging can't be zero");
+				ret = -EINVAL;
+				goto free_perm;
+			}
+			if (parm->tbl_aging > P4TC_MAX_T_AGING) {
+				NL_SET_ERR_MSG(extack,
+					       "Table max_masks exceeds maximum value");
+				ret = -EINVAL;
+				goto free_perm;
+			}
+			tbl_aging = parm->tbl_aging;
+		}
+	}
+
+	p4tc_table_replace_default_acts(table, &def_params, false);
+	p4tc_table_replace_permissions(table, perm, false);
+
+	if (tbl_keysz)
+		table->tbl_keysz = tbl_keysz;
+	if (tbl_max_entries)
+		table->tbl_max_entries = tbl_max_entries;
+	if (tbl_max_masks)
+		table->tbl_max_masks = tbl_max_masks;
+	table->tbl_type = tbl_type;
+	if (tbl_aging)
+		table->tbl_aging = tbl_aging;
+
+	if (tbl_acts_list)
+		p4tc_table_acts_list_replace(&table->tbl_acts_list,
+					     tbl_acts_list);
+
+	return table;
+
+free_perm:
+	kfree(perm);
+
+defaultacts_destroy:
+	p4tc_table_defact_destroy(def_params.default_missact);
+	p4tc_table_defact_destroy(def_params.default_hitact);
+
+table_acts_destroy:
+	if (tbl_acts_list) {
+		p4tc_table_acts_list_destroy(tbl_acts_list);
+		kfree(tbl_acts_list);
+	}
+
+out:
+	return ERR_PTR(ret);
+}
+
+static struct p4tc_template_common *
+p4tc_table_cu(struct net *net, struct nlmsghdr *n, struct nlattr *nla,
+	      struct p4tc_path_nlattrs *nl_path_attrs,
+	      struct netlink_ext_ack *extack)
+{
+	u32 *ids = nl_path_attrs->ids;
+	u32 pipeid = ids[P4TC_PID_IDX], tbl_id = ids[P4TC_TBLID_IDX];
+	struct nlattr *tb[P4TC_TABLE_MAX + 1];
+	struct p4tc_pipeline *pipeline;
+	struct p4tc_table *table;
+	int ret;
+
+	pipeline = p4tc_pipeline_find_byany_unsealed(net, nl_path_attrs->pname,
+						     pipeid, extack);
+	if (IS_ERR(pipeline))
+		return (void *)pipeline;
+
+	ret = nla_parse_nested(tb, P4TC_TABLE_MAX, nla, p4tc_table_policy,
+			       extack);
+	if (ret < 0)
+		return ERR_PTR(ret);
+
+	switch (n->nlmsg_type) {
+	case RTM_CREATEP4TEMPLATE:
+		table = p4tc_table_create(net, tb, tbl_id, pipeline, extack);
+		break;
+	case RTM_UPDATEP4TEMPLATE:
+		table = p4tc_table_update(net, tb, tbl_id, pipeline,
+					  n->nlmsg_flags, extack);
+		break;
+	default:
+		return ERR_PTR(-EOPNOTSUPP);
+	}
+
+	if (IS_ERR(table))
+		goto out;
+
+	if (!nl_path_attrs->pname_passed)
+		strscpy(nl_path_attrs->pname, pipeline->common.name,
+			PIPELINENAMSIZ);
+
+	if (!ids[P4TC_PID_IDX])
+		ids[P4TC_PID_IDX] = pipeline->common.p_id;
+
+	if (!ids[P4TC_TBLID_IDX])
+		ids[P4TC_TBLID_IDX] = table->tbl_id;
+
+out:
+	return (struct p4tc_template_common *)table;
+}
+
+static int p4tc_table_flush(struct net *net, struct sk_buff *skb,
+			    struct p4tc_pipeline *pipeline,
+			    struct netlink_ext_ack *extack)
+{
+	unsigned char *b = nlmsg_get_pos(skb);
+	unsigned long tmp, tbl_id;
+	struct p4tc_table *table;
+	int ret = 0;
+	int i = 0;
+
+	if (nla_put_u32(skb, P4TC_PATH, 0))
+		goto out_nlmsg_trim;
+
+	if (idr_is_empty(&pipeline->p_tbl_idr)) {
+		NL_SET_ERR_MSG(extack, "There are no tables to flush");
+		goto out_nlmsg_trim;
+	}
+
+	idr_for_each_entry_ul(&pipeline->p_tbl_idr, table, tmp, tbl_id) {
+		if (_p4tc_table_put(net, NULL, pipeline, table, extack) < 0) {
+			ret = -EBUSY;
+			continue;
+		}
+		i++;
+	}
+
+	nla_put_u32(skb, P4TC_COUNT, i);
+
+	if (ret < 0) {
+		if (i == 0) {
+			NL_SET_ERR_MSG(extack, "Unable to flush any table");
+			goto out_nlmsg_trim;
+		} else {
+			NL_SET_ERR_MSG_FMT(extack,
+					   "Flushed only %u tables", i);
+		}
+	}
+
+	return i;
+
+out_nlmsg_trim:
+	nlmsg_trim(skb, b);
+	return ret;
+}
+
+static int p4tc_table_gd(struct net *net, struct sk_buff *skb,
+			 struct nlmsghdr *n, struct nlattr *nla,
+			 struct p4tc_path_nlattrs *nl_path_attrs,
+			 struct netlink_ext_ack *extack)
+{
+	u32 *ids = nl_path_attrs->ids;
+	u32 pipeid = ids[P4TC_PID_IDX], tbl_id = ids[P4TC_TBLID_IDX];
+	struct nlattr *tb[P4TC_TABLE_MAX + 1] = {};
+	unsigned char *b = nlmsg_get_pos(skb);
+	struct p4tc_pipeline *pipeline;
+	struct p4tc_table *table;
+	int ret = 0;
+
+	if (nla) {
+		ret = nla_parse_nested(tb, P4TC_TABLE_MAX, nla,
+				       p4tc_table_policy, extack);
+
+		if (ret < 0)
+			return ret;
+	}
+
+	if (n->nlmsg_type == RTM_GETP4TEMPLATE)
+		pipeline = p4tc_pipeline_find_byany(net,
+						    nl_path_attrs->pname,
+						    pipeid,
+						    extack);
+	else
+		pipeline = p4tc_pipeline_find_byany_unsealed(net,
+							     nl_path_attrs->pname,
+							     pipeid, extack);
+
+	if (IS_ERR(pipeline))
+		return PTR_ERR(pipeline);
+
+	if (!nl_path_attrs->pname_passed)
+		strscpy(nl_path_attrs->pname, pipeline->common.name,
+			PIPELINENAMSIZ);
+
+	if (!ids[P4TC_PID_IDX])
+		ids[P4TC_PID_IDX] = pipeline->common.p_id;
+
+	if (n->nlmsg_type == RTM_DELP4TEMPLATE && (n->nlmsg_flags & NLM_F_ROOT))
+		return p4tc_table_flush(net, skb, pipeline, extack);
+
+	table = p4tc_table_find_byanyattr(pipeline, tb[P4TC_TABLE_NAME], tbl_id,
+					  extack);
+	if (IS_ERR(table))
+		return PTR_ERR(table);
+
+	if (_p4tc_table_fill_nlmsg(skb, table) < 0) {
+		NL_SET_ERR_MSG(extack,
+			       "Failed to fill notification attributes for table");
+		return -EINVAL;
+	}
+
+	if (n->nlmsg_type == RTM_DELP4TEMPLATE) {
+		ret = _p4tc_table_put(net, tb, pipeline, table, extack);
+		if (ret < 0)
+			goto out_nlmsg_trim;
+	}
+
+	return 0;
+
+out_nlmsg_trim:
+	nlmsg_trim(skb, b);
+	return ret;
+}
+
+static int p4tc_table_dump(struct sk_buff *skb, struct p4tc_dump_ctx *ctx,
+			   struct nlattr *nla, char **p_name, u32 *ids,
+			   struct netlink_ext_ack *extack)
+{
+	struct net *net = sock_net(skb->sk);
+	struct p4tc_pipeline *pipeline;
+
+	if (!ctx->ids[P4TC_PID_IDX]) {
+		pipeline = p4tc_pipeline_find_byany(net, *p_name,
+						    ids[P4TC_PID_IDX], extack);
+		if (IS_ERR(pipeline))
+			return PTR_ERR(pipeline);
+		ctx->ids[P4TC_PID_IDX] = pipeline->common.p_id;
+	} else {
+		pipeline = p4tc_pipeline_find_byid(net, ctx->ids[P4TC_PID_IDX]);
+	}
+
+	if (!ids[P4TC_PID_IDX])
+		ids[P4TC_PID_IDX] = pipeline->common.p_id;
+
+	if (!(*p_name))
+		*p_name = pipeline->common.name;
+
+	return p4tc_tmpl_generic_dump(skb, ctx, &pipeline->p_tbl_idr,
+				      P4TC_TBLID_IDX, extack);
+}
+
+static int p4tc_table_dump_1(struct sk_buff *skb,
+			     struct p4tc_template_common *common)
+{
+	struct nlattr *nest = nla_nest_start(skb, P4TC_PARAMS);
+	struct p4tc_table *table = to_table(common);
+
+	if (!nest)
+		return -ENOMEM;
+
+	if (nla_put_string(skb, P4TC_TABLE_NAME, table->common.name)) {
+		nla_nest_cancel(skb, nest);
+		return -ENOMEM;
+	}
+
+	nla_nest_end(skb, nest);
+
+	return 0;
+}
+
+const struct p4tc_template_ops p4tc_table_ops = {
+	.init = NULL,
+	.cu = p4tc_table_cu,
+	.fill_nlmsg = p4tc_table_fill_nlmsg,
+	.gd = p4tc_table_gd,
+	.put = p4tc_table_put,
+	.dump = p4tc_table_dump,
+	.dump_1 = p4tc_table_dump_1,
+};
diff --git a/net/sched/p4tc/p4tc_tmpl_api.c b/net/sched/p4tc/p4tc_tmpl_api.c
index ad81a7089..e52c06c5e 100644
--- a/net/sched/p4tc/p4tc_tmpl_api.c
+++ b/net/sched/p4tc/p4tc_tmpl_api.c
@@ -43,6 +43,7 @@ static bool obj_is_valid(u32 obj)
 	switch (obj) {
 	case P4TC_OBJ_PIPELINE:
 	case P4TC_OBJ_ACT:
+	case P4TC_OBJ_TABLE:
 		return true;
 	default:
 		return false;
@@ -52,6 +53,7 @@ static bool obj_is_valid(u32 obj)
 static const struct p4tc_template_ops *p4tc_ops[P4TC_OBJ_MAX + 1] = {
 	[P4TC_OBJ_PIPELINE] = &p4tc_pipeline_ops,
 	[P4TC_OBJ_ACT] = &p4tc_act_ops,
+	[P4TC_OBJ_TABLE] = &p4tc_table_ops,
 };
 
 int p4tc_tmpl_generic_dump(struct sk_buff *skb, struct p4tc_dump_ctx *ctx,
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 79+ messages in thread

* [PATCH net-next v8 12/15] p4tc: add runtime table entry create, update, get, delete, flush and dump
  2023-11-16 14:59 [PATCH net-next v8 00/15] Introducing P4TC Jamal Hadi Salim
                   ` (10 preceding siblings ...)
  2023-11-16 14:59 ` [PATCH net-next v8 11/15] p4tc: add template table " Jamal Hadi Salim
@ 2023-11-16 14:59 ` Jamal Hadi Salim
  2023-11-16 14:59 ` [PATCH net-next v8 13/15] p4tc: add set of P4TC table kfuncs Jamal Hadi Salim
                   ` (3 subsequent siblings)
  15 siblings, 0 replies; 79+ messages in thread
From: Jamal Hadi Salim @ 2023-11-16 14:59 UTC (permalink / raw)
  To: netdev
  Cc: deb.chatterjee, anjali.singhai, namrata.limaye, tom, mleitner,
	Mahesh.Shirshyad, tomasz.osinski, jiri, xiyou.wangcong, davem,
	edumazet, kuba, pabeni, vladbu, horms, daniel, bpf, khalidm, toke,
	mattyk

Tables are conceptually similar to TCAMs and this implementation could be
labelled as an "algorithmic" TCAM. Tables have a key of a specific size,
maximum number of entries and masks allowed. The basic P4 key types
are supported (exact, LPM, ternary, and ranges) although the kernel side is
oblivious of all that and sees only bit blobs which it masks before a
lookup is performed.

This commit allows users to create, update, delete, get, flush and dump
table _entries_ (templates were described in earlier patch).

Note that table entries can only be created once the pipeline template is
sealed.

For example, a user issuing the following command:

tc p4ctrl create myprog/table/cb/tname  \
  dstAddr 10.10.10.0/24 srcAddr 192.168.0.0/16 prio 16 \
  action send param port port1

indicates we are creating a table entry in table "cb/tname" (a table
residing in control block "cb") on a pipeline named "myprog".

User space tc will create a key which has a value of 0x0a0a0a00c0a00000
(10.10.10.0 concatenated with 192.168.0.0) and a mask value of
0xffffff00ffff0000 (/24 concatenated with /16) that will be sent to the
kernel. In addition a priority field of 16 is passed to the kernel as
well as the action definition.
The priority field is needed to disambiguate in case two entries
match. In that case, the kernel will choose the one with lowest priority
number.

If the user wanted to, for example, update our just created entry with
an action, they'd issue the following command:

tc p4ctrl update myprog/table/cb/tname srcAddr 10.10.10.0/24 \
  dstAddr 192.168.0.0/16 prio 16 action send param port port5

In this case, the user needs to specify the pipeline name, the table name,
the key and the priority, so that we can locate the table entry.

If the user wanted to, for example, get the table entry that we just
updated, they'd issue the following command:

tc p4ctrl get myprog/table/cb/tname srcAddr 10.10.10.0/24 \
  dstAddr 192.168.0.0/16 prio 16

Note that, again, we need to specify the pipeline name, the table name,
the key and the priority, so that we can locate the table entry.

If the user wanted to delete the table entry we created, they'd issue the
following command:

tc p4ctrl del myprog/table/cb/tname srcAddr 10.10.10.0/24 \
  dstAddr 192.168.0.0/16 prio 16

Note that, again, we need to specify the pipeline name, the table
name, the key and the priority, so that we can
locate the table entry.

We can also flush all the table entries from a specific table.
To flush the table entries of table tname ane pipeline ptables,
the user would issue the following command:

tc p4ctrl del myprog/table/cb/tname

We can also dump all the table entries from a specific table.
To dump the table entries of table tname and pipeline myprog, the user
would issue the following command:

tc p4ctrl get myprog/table/cb/tname

__Table Entry Permissions__

Table entries can have permissions specified when they are being added.
Caveat: we are doing a lot more than what P4 defines because we feel it is
necessary.

It should be noted that there are two types of permissions:
 - Table permissions which are a property of the table (think directory in
   file systems). These are set by the template (see earlier patch on table
   template).
 - Table entry permissions which are specific to a table entry (think a
   file in a directory). This patch describes those permissions.

Furthermore in both cases the permissions are split into datapath vs
control path. The template definition can set either one. For example, one
could allow for adding table entries by the datapath in case of PNA
add-on-miss is needed.
By default tables entries have control plane RUD, meaning the control plane
can Read, Update or Delete entries. By default, as well, the control plane
can create new entries unless specified otherwise by the template.

Lets see an example of which creates the table "cb/tname" at template time:

    tc p4template create table/aP4proggie/cb/tname tblid 1 keysz 64 \
      permissions 0x3C24 ...

Above is setting the table tname's permission to be 0x3C24 is equivalent to
CRUD----R--X-- meaning:

The control plane can Create, Read, Update, Delete
The datapath can only Read and Execute table entries.
If one was to dump this table with:

tc -j p4template get table/aP4proggie/cb/tname | jq .

The output would be the following:

[
  {
    "obj": "table",
    "pname": "aP4proggie",
    "pipeid": 22
  },
  {
    "templates": [
      {
        "tblid": 1,
        "tname": "cb/tname",
        "keysz": 64,
        "max_entries": 256,
        "masks": 8,
        "entries": 0,
        "permissions": "CRUD----R--X--",
        "table_type": "exact",
        "acts_list": []
      }
    ]
  }
]

The expressed permissions above are probably the most practical for most
use cases.

__Constant Tables And P4-programmed Defined Entries__

If one wanted to restrict the table to be an equivalent to a "const" then
the permissions would be set to be: -R------R--X--

In such a case, typically the P4 program will have some entries defined
(see the famous P4 calculate example). The "initial entries" specified in
the P4 program will have to be added by the template (as generated by the
compiler), as such:

tc p4template update table/aP4proggie/cb/tname \
  entry srcAddr 10.10.10.10/24 dstAddr 1.1.1.0/24 prio 17

This table cannot be updated at runtime. Any attempt to add an entry of a
table which is read-only at runtime will get a permission denied response
back from the kernel.

Note: If one was to create an equivalent for PNA add-on-miss feature for
this table, then the template would issue table permissions as:
-R-----CR--X-- PNA doesn't specify whether the datapath can also delete or
update entries, but if it did then more appropriate permissions will be:
-R-----CRUDX--

__Mix And Match of RW vs Constant Entries__
Lets look at other scenarios; lets say the table has CRUD----R--X--
permissions as defined by the template...
At runtime the user could add entries which are "const" - by specifying the
entry's permission as -R------R--X-- example:

tc p4ctrl create aP4proggie/table/cb/tname srcAddr 10.10.10.10/24 \
  dstAddr 1.1.1.0/24 prio 17 permissions 0x1024 action drop

or not specify permissions at all as such:

tc p4ctrl create aP4proggie/table/cb/tname srcAddr 10.10.10.10/24 \
  dstAddr 1.1.1.0/24 prio 17 action drop

in which case the table's permissions defined at template
time(CRUD----R--X--) are assumed; meaning the table entry can be deleted or
updated by the control plane.

__Entries permissions Allowed On A Table Entry Creation At Runtime__

When an entry is added with expressed permissions it has at most to have
what the template table definition expressed but could ask for less
permission. For example, assuming a table with templated specified
permissions of CR-D----R--X--:
An entry created at runtime with permission of -R------R--X-- is allowed
but an entry with -RUD----R--X-- will be rejected.

Co-developed-by: Victor Nogueira <victor@mojatatu.com>
Signed-off-by: Victor Nogueira <victor@mojatatu.com>
Co-developed-by: Pedro Tammela <pctammela@mojatatu.com>
Signed-off-by: Pedro Tammela <pctammela@mojatatu.com>
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
---
 include/net/p4tc.h                |  124 +-
 include/uapi/linux/p4tc.h         |   68 +-
 include/uapi/linux/rtnetlink.h    |    9 +
 net/sched/p4tc/Makefile           |    3 +-
 net/sched/p4tc/p4tc_runtime_api.c |  145 ++
 net/sched/p4tc/p4tc_table.c       |   54 +-
 net/sched/p4tc/p4tc_tbl_entry.c   | 2569 +++++++++++++++++++++++++++++
 net/sched/p4tc/p4tc_tmpl_api.c    |    4 +-
 security/selinux/nlmsgtab.c       |    6 +-
 9 files changed, 2965 insertions(+), 17 deletions(-)
 create mode 100644 net/sched/p4tc/p4tc_runtime_api.c
 create mode 100644 net/sched/p4tc/p4tc_tbl_entry.c

diff --git a/include/net/p4tc.h b/include/net/p4tc.h
index 9521708e6..24f8b4873 100644
--- a/include/net/p4tc.h
+++ b/include/net/p4tc.h
@@ -120,6 +120,11 @@ static inline bool p4tc_pipeline_get(struct p4tc_pipeline *pipeline)
 	return refcount_inc_not_zero(&pipeline->p_ctrl_ref);
 }
 
+static inline void p4tc_pipeline_put_ref(struct p4tc_pipeline *pipeline)
+{
+	refcount_dec(&pipeline->p_ctrl_ref);
+}
+
 void p4tc_pipeline_put(struct p4tc_pipeline *pipeline);
 struct p4tc_pipeline *
 p4tc_pipeline_find_byany_unsealed(struct net *net, const char *p_name,
@@ -182,6 +187,8 @@ static inline int p4tc_action_destroy(struct tc_action **acts)
 
 #define P4TC_PERMISSIONS_UNINIT (1 << P4TC_PERM_MAX_BIT)
 
+#define P4TC_MAX_PARAM_DATA_SIZE 124
+
 struct p4tc_table_defact {
 	struct tc_action **default_acts;
 	/* Will have two 7 bits blocks containing CRUDXPS (Create, read, update,
@@ -203,8 +210,9 @@ struct p4tc_table {
 	struct p4tc_template_common         common;
 	struct list_head                    tbl_acts_list;
 	struct idr                          tbl_masks_idr;
-	struct idr                          tbl_prio_idr;
+	struct ida                          tbl_prio_idr;
 	struct rhltable                     tbl_entries;
+	struct p4tc_table_entry             *tbl_const_entry;
 	struct p4tc_table_defact __rcu      *tbl_default_hitact;
 	struct p4tc_table_defact __rcu      *tbl_default_missact;
 	struct p4tc_table_perm __rcu        *tbl_permissions;
@@ -220,11 +228,14 @@ struct p4tc_table {
 	u32                                 tbl_max_entries;
 	u32                                 tbl_max_masks;
 	u32                                 tbl_curr_num_masks;
+	/* Accounts for how many entries this table has */
+	atomic_t                            tbl_nelems;
 	/* Accounts for how many entities refer to this table. Usually just the
 	 * pipeline it belongs to.
 	 */
 	refcount_t                          tbl_ctrl_ref;
 	u16                                 tbl_type;
+	u16                                 PAD0;
 };
 
 extern const struct p4tc_template_ops p4tc_table_ops;
@@ -289,6 +300,86 @@ struct p4tc_table_act {
 
 extern const struct p4tc_template_ops p4tc_act_ops;
 
+extern const struct rhashtable_params entry_hlt_params;
+
+struct p4tc_table_entry;
+struct p4tc_table_entry_work {
+	struct work_struct   work;
+	struct p4tc_pipeline *pipeline;
+	struct p4tc_table_entry *entry;
+	struct p4tc_table *table;
+	u16 who_deleted;
+	bool send_event;
+};
+
+struct p4tc_table_entry_key {
+	u32 keysz;
+	/* Key start */
+	u32 maskid;
+	unsigned char fa_key[] __aligned(8);
+};
+
+struct p4tc_table_entry_value {
+	u32                              prio;
+	int                              num_acts;
+	struct tc_action                 **acts;
+	/* Accounts for how many entities are referencing, eg: Data path,
+	 * one or more control path and timer.
+	 */
+	refcount_t                       entries_ref;
+	u32                              permissions;
+	struct p4tc_table_entry_tm __rcu *tm;
+	struct p4tc_table_entry_work     *entry_work;
+	u64                              aging_ms;
+	struct hrtimer                   entry_timer;
+	bool                             is_dyn;
+};
+
+struct p4tc_table_entry_mask {
+	struct rcu_head	 rcu;
+	u32              sz;
+	u32              mask_index;
+	/* Accounts for how many entries are using this mask */
+	refcount_t       mask_ref;
+	u32              mask_id;
+	unsigned char fa_value[] __aligned(8);
+};
+
+struct p4tc_table_entry {
+	struct rcu_head rcu;
+	struct rhlist_head ht_node;
+	struct p4tc_table_entry_key key;
+	/* fallthrough: key data + value */
+};
+
+#define P4TC_KEYSZ_BYTES(bits) (round_up(BITS_TO_BYTES(bits), 8))
+
+#define ENTRY_KEY_OFFSET (offsetof(struct p4tc_table_entry_key, fa_key))
+
+#define P4TC_ENTRY_VALUE_OFFSET(entry) \
+	(offsetof(struct p4tc_table_entry, key) + ENTRY_KEY_OFFSET \
+	 + P4TC_KEYSZ_BYTES((entry)->key.keysz))
+
+static inline void *p4tc_table_entry_value(struct p4tc_table_entry *entry)
+{
+	return entry->key.fa_key + P4TC_KEYSZ_BYTES(entry->key.keysz);
+}
+
+static inline struct p4tc_table_entry_work *
+p4tc_table_entry_work(struct p4tc_table_entry *entry)
+{
+	struct p4tc_table_entry_value *value = p4tc_table_entry_value(entry);
+
+	return value->entry_work;
+}
+
+extern const struct nla_policy p4tc_root_policy[P4TC_ROOT_MAX + 1];
+extern const struct nla_policy p4tc_policy[P4TC_MAX + 1];
+
+struct p4tc_table_entry *
+p4tc_table_entry_lookup_direct(struct p4tc_table *table,
+			       struct p4tc_table_entry_key *key);
+
 static inline int p4tc_action_init(struct net *net, struct nlattr *nla,
 				   struct tc_action *acts[], u32 pipeid,
 				   u32 flags, struct netlink_ext_ack *extack)
@@ -377,6 +468,14 @@ p4tc_table_init_default_acts(struct net *net,
 			     struct list_head *acts_list,
 			     struct netlink_ext_ack *extack);
 
+static inline void p4tc_table_defact_destroy(struct p4tc_table_defact *defact)
+{
+	if (defact) {
+		p4tc_action_destroy(defact->default_acts);
+		kfree(defact);
+	}
+}
+
 static inline void
 p4tc_table_defacts_acts_copy(struct p4tc_table_defact *defact_copy,
 			     struct p4tc_table_defact *defact_orig)
@@ -391,15 +490,36 @@ p4tc_table_replace_default_acts(struct p4tc_table *table,
 
 struct p4tc_table_perm *
 p4tc_table_init_permissions(struct p4tc_table *table, u16 permissions,
-			    struct netlink_ext_ack *extack);
+			   struct netlink_ext_ack *extack);
 void p4tc_table_replace_permissions(struct p4tc_table *table,
 				    struct p4tc_table_perm *tbl_perm,
 				    bool lock_rtnl);
+void p4tc_table_entry_destroy_hash(void *ptr, void *arg);
+
+struct p4tc_table_entry *
+p4tc_table_const_entry_cu(struct net *net, struct nlattr *arg,
+			  struct p4tc_pipeline *pipeline,
+			  struct p4tc_table *table,
+			  struct netlink_ext_ack *extack);
+int p4tc_tbl_entry_crud(struct net *net, struct sk_buff *skb,
+			struct nlmsghdr *n, int cmd,
+			struct netlink_ext_ack *extack);
+int p4tc_tbl_entry_dumpit(struct net *net, struct sk_buff *skb,
+			  struct netlink_callback *cb,
+			  struct nlattr *arg, char *p_name);
+int p4tc_tbl_entry_fill(struct sk_buff *skb, struct p4tc_table *table,
+			struct p4tc_table_entry *entry, u32 tbl_id,
+			u16 who_deleted);
 
 struct tcf_p4act *
 tcf_p4_get_next_prealloc_act(struct p4tc_act *act);
 void tcf_p4_set_init_flags(struct tcf_p4act *p4act);
 
+static inline bool p4tc_runtime_msg_is_update(struct nlmsghdr *n)
+{
+	return n->nlmsg_type == RTM_P4TC_UPDATE;
+}
+
 #define to_pipeline(t) ((struct p4tc_pipeline *)t)
 #define to_act(t) ((struct p4tc_act *)t)
 #define to_table(t) ((struct p4tc_table *)t)
diff --git a/include/uapi/linux/p4tc.h b/include/uapi/linux/p4tc.h
index 9b9937a94..e87f0c8b9 100644
--- a/include/uapi/linux/p4tc.h
+++ b/include/uapi/linux/p4tc.h
@@ -86,6 +86,9 @@ enum {
 #define p4tc_ctrl_pub_ok(perm)      ((perm) & P4TC_CTRL_PERM_P)
 #define p4tc_ctrl_sub_ok(perm)      ((perm) & P4TC_CTRL_PERM_S)
 
+#define p4tc_ctrl_perm_rm_create(perm) \
+	(((perm) & ~P4TC_CTRL_PERM_C))
+
 #define p4tc_data_create_ok(perm)   ((perm) & P4TC_DATA_PERM_C)
 #define p4tc_data_read_ok(perm)     ((perm) & P4TC_DATA_PERM_R)
 #define p4tc_data_update_ok(perm)   ((perm) & P4TC_DATA_PERM_U)
@@ -94,6 +97,9 @@ enum {
 #define p4tc_data_pub_ok(perm)      ((perm) & P4TC_DATA_PERM_P)
 #define p4tc_data_sub_ok(perm)      ((perm) & P4TC_DATA_PERM_S)
 
+#define p4tc_data_perm_rm_create(perm) \
+	(((perm) & ~P4TC_DATA_PERM_C))
+
 struct p4tc_table_parm {
 	__u64 tbl_aging;
 	__u32 tbl_keysz;
@@ -127,6 +133,15 @@ enum {
 
 #define P4TC_OBJ_MAX (__P4TC_OBJ_MAX - 1)
 
+/* P4 runtime Object types */
+enum {
+	P4TC_OBJ_RUNTIME_UNSPEC,
+	P4TC_OBJ_RUNTIME_TABLE,
+	__P4TC_OBJ_RUNTIME_MAX,
+};
+
+#define P4TC_OBJ_RUNTIME_MAX (__P4TC_OBJ_RUNTIME_MAX - 1)
+
 /* P4 attributes */
 enum {
 	P4TC_UNSPEC,
@@ -214,7 +229,7 @@ enum {
 	P4TC_TABLE_INFO, /* struct p4tc_table_parm */
 	P4TC_TABLE_DEFAULT_HIT, /* nested default hit action attributes */
 	P4TC_TABLE_DEFAULT_MISS, /* nested default miss action attributes */
-	P4TC_TABLE_CONST_ENTRY, /* nested const table entry */
+	P4TC_TABLE_CONST_ENTRY, /* nested const table entry*/
 	P4TC_TABLE_ACTS_LIST, /* nested table actions list */
 	__P4TC_TABLE_MAX
 };
@@ -271,6 +286,57 @@ struct tc_act_dyna {
 	tc_gen;
 };
 
+struct p4tc_table_entry_tm {
+	__u64 created;
+	__u64 lastused;
+	__u64 firstused;
+	__u16 who_created;
+	__u16 who_updated;
+	__u16 who_deleted;
+	__u16 permissions;
+};
+
+enum {
+	P4TC_ENTRY_TBL_ATTRS_UNSPEC,
+	P4TC_ENTRY_TBL_ATTRS_DEFAULT_HIT, /* nested default hit attrs */
+	P4TC_ENTRY_TBL_ATTRS_DEFAULT_MISS, /* nested default miss attrs */
+	P4TC_ENTRY_TBL_ATTRS_PERMISSIONS, /* u16 table permissions */
+	__P4TC_ENTRY_TBL_ATTRS,
+};
+
+#define P4TC_ENTRY_TBL_ATTRS_MAX (__P4TC_ENTRY_TBL_ATTRS - 1)
+
+/* Table entry attributes */
+enum {
+	P4TC_ENTRY_UNSPEC,
+	P4TC_ENTRY_TBLNAME, /* string */
+	P4TC_ENTRY_KEY_BLOB, /* Key blob */
+	P4TC_ENTRY_MASK_BLOB, /* Mask blob */
+	P4TC_ENTRY_PRIO, /* u32 */
+	P4TC_ENTRY_ACT, /* nested actions */
+	P4TC_ENTRY_TM, /* entry data path timestamps */
+	P4TC_ENTRY_WHODUNNIT, /* tells who's modifying the entry */
+	P4TC_ENTRY_CREATE_WHODUNNIT, /* tells who created the entry */
+	P4TC_ENTRY_UPDATE_WHODUNNIT, /* tells who updated the entry last */
+	P4TC_ENTRY_DELETE_WHODUNNIT, /* tells who deleted the entry */
+	P4TC_ENTRY_PERMISSIONS, /* entry CRUDXPS permissions */
+	P4TC_ENTRY_TBL_ATTRS, /* nested table attributes */
+	P4TC_ENTRY_DYNAMIC, /* u8 tells if table entry is dynamic */
+	P4TC_ENTRY_AGING, /* u64 table entry aging */
+	P4TC_ENTRY_PAD,
+	__P4TC_ENTRY_MAX
+};
+
+#define P4TC_ENTRY_MAX (__P4TC_ENTRY_MAX - 1)
+
+enum {
+	P4TC_ENTITY_UNSPEC,
+	P4TC_ENTITY_KERNEL,
+	P4TC_ENTITY_TC,
+	P4TC_ENTITY_TIMER,
+	P4TC_ENTITY_MAX
+};
+
 #define P4TC_RTA(r) \
 	((struct rtattr *)(((char *)(r)) + NLMSG_ALIGN(sizeof(struct p4tcmsg))))
 
diff --git a/include/uapi/linux/rtnetlink.h b/include/uapi/linux/rtnetlink.h
index 4f9ebe3e7..76645560b 100644
--- a/include/uapi/linux/rtnetlink.h
+++ b/include/uapi/linux/rtnetlink.h
@@ -203,6 +203,15 @@ enum {
 	RTM_UPDATEP4TEMPLATE,
 #define RTM_UPDATEP4TEMPLATE	RTM_UPDATEP4TEMPLATE
 
+	RTM_P4TC_CREATE = 128,
+#define RTM_P4TC_CREATE	RTM_P4TC_CREATE
+	RTM_P4TC_DEL,
+#define RTM_P4TC_DEL		RTM_P4TC_DEL
+	RTM_P4TC_GET,
+#define RTM_P4TC_GET		RTM_P4TC_GET
+	RTM_P4TC_UPDATE,
+#define RTM_P4TC_UPDATE	RTM_P4TC_UPDATE
+
 	__RTM_MAX,
 #define RTM_MAX		(((__RTM_MAX + 3) & ~3) - 1)
 };
diff --git a/net/sched/p4tc/Makefile b/net/sched/p4tc/Makefile
index 7a9c13f86..921909ac4 100644
--- a/net/sched/p4tc/Makefile
+++ b/net/sched/p4tc/Makefile
@@ -1,4 +1,5 @@
 # SPDX-License-Identifier: GPL-2.0
 
 obj-y := p4tc_types.o p4tc_tmpl_api.o p4tc_pipeline.o \
-	p4tc_action.o p4tc_table.o
+	p4tc_action.o p4tc_table.o p4tc_tbl_entry.o \
+	p4tc_runtime_api.o
diff --git a/net/sched/p4tc/p4tc_runtime_api.c b/net/sched/p4tc/p4tc_runtime_api.c
new file mode 100644
index 000000000..bcb280909
--- /dev/null
+++ b/net/sched/p4tc/p4tc_runtime_api.c
@@ -0,0 +1,145 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * net/sched/p4tc_runtime_api.c P4 TC RUNTIME API
+ *
+ * Copyright (c) 2022-2023, Mojatatu Networks
+ * Copyright (c) 2022-2023, Intel Corporation.
+ * Authors:     Jamal Hadi Salim <jhs@mojatatu.com>
+ *              Victor Nogueira <victor@mojatatu.com>
+ *              Pedro Tammela <pctammela@mojatatu.com>
+ */
+
+#include <linux/types.h>
+#include <linux/kernel.h>
+#include <linux/string.h>
+#include <linux/errno.h>
+#include <linux/slab.h>
+#include <linux/skbuff.h>
+#include <linux/init.h>
+#include <linux/kmod.h>
+#include <linux/err.h>
+#include <linux/module.h>
+#include <linux/bitmap.h>
+#include <net/net_namespace.h>
+#include <net/sock.h>
+#include <net/sch_generic.h>
+#include <net/pkt_cls.h>
+#include <net/p4tc.h>
+#include <net/netlink.h>
+#include <net/flow_offload.h>
+
+static int tc_ctl_p4_root(struct sk_buff *skb, struct nlmsghdr *n, int cmd,
+			  struct netlink_ext_ack *extack)
+{
+	struct p4tcmsg *t = (struct p4tcmsg *)nlmsg_data(n);
+
+	switch (t->obj) {
+	case P4TC_OBJ_RUNTIME_TABLE: {
+		struct net *net = sock_net(skb->sk);
+		int ret;
+
+		net = maybe_get_net(net);
+		if (!net) {
+			NL_SET_ERR_MSG(extack, "Net namespace is going down");
+			return -EBUSY;
+		}
+
+		ret = p4tc_tbl_entry_crud(net, skb, n, cmd, extack);
+
+		put_net(net);
+
+		return ret;
+	}
+	default:
+		NL_SET_ERR_MSG(extack, "Unknown P4 runtime object type");
+		return -EOPNOTSUPP;
+	}
+}
+
+static int tc_ctl_p4_get(struct sk_buff *skb, struct nlmsghdr *n,
+			 struct netlink_ext_ack *extack)
+{
+	return tc_ctl_p4_root(skb, n, RTM_P4TC_GET, extack);
+}
+
+static int tc_ctl_p4_delete(struct sk_buff *skb, struct nlmsghdr *n,
+			    struct netlink_ext_ack *extack)
+{
+	if (!netlink_capable(skb, CAP_NET_ADMIN))
+		return -EPERM;
+
+	return tc_ctl_p4_root(skb, n, RTM_P4TC_DEL, extack);
+}
+
+static int tc_ctl_p4_cu(struct sk_buff *skb, struct nlmsghdr *n,
+			struct netlink_ext_ack *extack)
+{
+	int ret;
+
+	if (!netlink_capable(skb, CAP_NET_ADMIN))
+		return -EPERM;
+
+	ret = tc_ctl_p4_root(skb, n, n->nlmsg_type, extack);
+
+	return ret;
+}
+
+static int tc_ctl_p4_dump(struct sk_buff *skb, struct netlink_callback *cb)
+{
+	struct nlattr *tb[P4TC_ROOT_MAX + 1];
+	char *p_name = NULL;
+	struct p4tcmsg *t;
+	int ret = 0;
+
+	/* Dump is always called with the nlk->cb_mutex held.
+	 * In rtnl this mutex is set to rtnl_lock, which makes dump,
+	 * even for table entries, to serialized over the rtnl_lock.
+	 *
+	 * For table entries, it guarantees the net namespace is alive.
+	 * For externs, we don't need to lock the rtnl_lock.
+	 */
+	ASSERT_RTNL();
+
+	ret = nlmsg_parse(cb->nlh, sizeof(struct p4tcmsg), tb, P4TC_ROOT_MAX,
+			  p4tc_root_policy, cb->extack);
+	if (ret < 0)
+		return ret;
+
+	if (NL_REQ_ATTR_CHECK(cb->extack, NULL, tb, P4TC_ROOT)) {
+		NL_SET_ERR_MSG(cb->extack,
+			       "Netlink P4TC Runtime attributes missing");
+		return -EINVAL;
+	}
+
+	if (tb[P4TC_ROOT_PNAME])
+		p_name = nla_data(tb[P4TC_ROOT_PNAME]);
+
+	t = nlmsg_data(cb->nlh);
+
+	switch (t->obj) {
+	case P4TC_OBJ_RUNTIME_TABLE:
+		return p4tc_tbl_entry_dumpit(sock_net(skb->sk), skb, cb,
+					     tb[P4TC_ROOT], p_name);
+	default:
+		NL_SET_ERR_MSG_FMT(cb->extack,
+				   "Unknown p4 runtime object type %u\n",
+				   t->obj);
+		return -ENOENT;
+	}
+}
+
+static int __init p4tc_tbl_init(void)
+{
+	rtnl_register(PF_UNSPEC, RTM_P4TC_CREATE, tc_ctl_p4_cu, NULL,
+		      RTNL_FLAG_DOIT_UNLOCKED);
+	rtnl_register(PF_UNSPEC, RTM_P4TC_UPDATE, tc_ctl_p4_cu, NULL,
+		      RTNL_FLAG_DOIT_UNLOCKED);
+	rtnl_register(PF_UNSPEC, RTM_P4TC_DEL, tc_ctl_p4_delete, NULL,
+		      RTNL_FLAG_DOIT_UNLOCKED);
+	rtnl_register(PF_UNSPEC, RTM_P4TC_GET, tc_ctl_p4_get, tc_ctl_p4_dump,
+		      RTNL_FLAG_DOIT_UNLOCKED);
+
+	return 0;
+}
+
+subsys_initcall(p4tc_tbl_init);
diff --git a/net/sched/p4tc/p4tc_table.c b/net/sched/p4tc/p4tc_table.c
index 291988858..e38e14a84 100644
--- a/net/sched/p4tc/p4tc_table.c
+++ b/net/sched/p4tc/p4tc_table.c
@@ -144,6 +144,7 @@ static int _p4tc_table_fill_nlmsg(struct sk_buff *skb, struct p4tc_table *table)
 	parm.tbl_max_masks = table->tbl_max_masks;
 	parm.tbl_type = table->tbl_type;
 	parm.tbl_aging = table->tbl_aging;
+	parm.tbl_num_entries = atomic_read(&table->tbl_nelems);
 
 	tbl_perm = rcu_dereference_rtnl(table->tbl_permissions);
 	parm.tbl_permissions = tbl_perm->permissions;
@@ -217,6 +218,16 @@ static int _p4tc_table_fill_nlmsg(struct sk_buff *skb, struct p4tc_table *table)
 	}
 	nla_nest_end(skb, nested_tbl_acts);
 
+	if (table->tbl_const_entry) {
+		struct nlattr *const_nest;
+
+		const_nest = nla_nest_start(skb, P4TC_TABLE_CONST_ENTRY);
+		p4tc_tbl_entry_fill(skb, table, table->tbl_const_entry,
+				    table->tbl_id, P4TC_ENTITY_UNSPEC);
+		nla_nest_end(skb, const_nest);
+	}
+	table->tbl_const_entry = NULL;
+
 	if (nla_put(skb, P4TC_TABLE_INFO, sizeof(parm), &parm))
 		goto out_nlmsg_trim;
 	nla_nest_end(skb, nest);
@@ -243,14 +254,6 @@ static int p4tc_table_fill_nlmsg(struct net *net, struct sk_buff *skb,
 	return 0;
 }
 
-static inline void p4tc_table_defact_destroy(struct p4tc_table_defact *defact)
-{
-	if (defact) {
-		p4tc_action_destroy(defact->default_acts);
-		kfree(defact);
-	}
-}
-
 static void p4tc_table_acts_list_destroy(struct list_head *acts_list)
 {
 	struct p4tc_table_act *table_act, *tmp;
@@ -375,8 +378,11 @@ static int _p4tc_table_put(struct net *net, struct nlattr **tb,
 
 	p4tc_table_acts_list_destroy(&table->tbl_acts_list);
 
+	rhltable_free_and_destroy(&table->tbl_entries,
+				  p4tc_table_entry_destroy_hash, table);
+
 	idr_destroy(&table->tbl_masks_idr);
-	idr_destroy(&table->tbl_prio_idr);
+	ida_destroy(&table->tbl_prio_idr);
 
 	perm = rcu_replace_pointer_rtnl(table->tbl_permissions, NULL);
 	kfree_rcu(perm, rcu);
@@ -905,6 +911,7 @@ static struct p4tc_table *p4tc_table_create(struct net *net, struct nlattr **tb,
 					    struct p4tc_pipeline *pipeline,
 					    struct netlink_ext_ack *extack)
 {
+	struct rhashtable_params table_hlt_params = entry_hlt_params;
 	struct p4tc_table_default_act_params def_params = {0};
 	struct p4tc_table_perm *tbl_init_perms = NULL;
 	struct p4tc_table_parm *parm;
@@ -1126,12 +1133,24 @@ static struct p4tc_table *p4tc_table_create(struct net *net, struct nlattr **tb,
 	}
 
 	idr_init(&table->tbl_masks_idr);
-	idr_init(&table->tbl_prio_idr);
+	ida_init(&table->tbl_prio_idr);
 	spin_lock_init(&table->tbl_masks_idr_lock);
 
+	table_hlt_params.max_size = table->tbl_max_entries;
+	if (table->tbl_max_entries > U16_MAX)
+		table_hlt_params.nelem_hint = U16_MAX / 4 * 3;
+	else
+		table_hlt_params.nelem_hint = table->tbl_max_entries / 4 * 3;
+
+	if (rhltable_init(&table->tbl_entries, &table_hlt_params) < 0) {
+		ret = -EINVAL;
+		goto defaultacts_destroy;
+	}
+
 	pipeline->curr_tables += 1;
 
 	table->common.ops = (struct p4tc_template_ops *)&p4tc_table_ops;
+	atomic_set(&table->tbl_nelems, 0);
 
 	return table;
 
@@ -1288,6 +1307,21 @@ static struct p4tc_table *p4tc_table_update(struct net *net, struct nlattr **tb,
 		}
 	}
 
+	if (tb[P4TC_TABLE_CONST_ENTRY]) {
+		struct p4tc_table_entry *entry;
+
+		/* Workaround to make this work */
+		entry = p4tc_table_const_entry_cu(net,
+						  tb[P4TC_TABLE_CONST_ENTRY],
+						  pipeline, table, extack);
+		if (IS_ERR(entry)) {
+			ret = PTR_ERR(entry);
+			goto free_perm;
+		}
+
+		table->tbl_const_entry = entry;
+	}
+
 	p4tc_table_replace_default_acts(table, &def_params, false);
 	p4tc_table_replace_permissions(table, perm, false);
 
diff --git a/net/sched/p4tc/p4tc_tbl_entry.c b/net/sched/p4tc/p4tc_tbl_entry.c
new file mode 100644
index 000000000..cadd3e100
--- /dev/null
+++ b/net/sched/p4tc/p4tc_tbl_entry.c
@@ -0,0 +1,2569 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * net/sched/p4tc_tbl_api.c TC P4 TABLE API
+ *
+ * Copyright (c) 2022-2023, Mojatatu Networks
+ * Copyright (c) 2022-2023, Intel Corporation.
+ * Authors:     Jamal Hadi Salim <jhs@mojatatu.com>
+ *              Victor Nogueira <victor@mojatatu.com>
+ *              Pedro Tammela <pctammela@mojatatu.com>
+ */
+
+#include <linux/types.h>
+#include <linux/kernel.h>
+#include <linux/string.h>
+#include <linux/errno.h>
+#include <linux/slab.h>
+#include <linux/skbuff.h>
+#include <linux/init.h>
+#include <linux/kmod.h>
+#include <linux/err.h>
+#include <linux/module.h>
+#include <linux/bitmap.h>
+#include <net/net_namespace.h>
+#include <net/sock.h>
+#include <net/sch_generic.h>
+#include <net/pkt_cls.h>
+#include <net/p4tc.h>
+#include <net/netlink.h>
+#include <net/flow_offload.h>
+
+#define SIZEOF_MASKID (sizeof(((struct p4tc_table_entry_key *)0)->maskid))
+
+#define STARTOF_KEY(key) (&((key)->maskid))
+
+/* In this code we avoid locks for create/updating/deleting table entries by
+ * using a refcount (entries_ref). We also use RCU to avoid locks for reading.
+ * Everytime we try to get the entry, we increment and check the refcount to see
+ * whether a delete is happening in parallel.
+ */
+
+static int p4tc_tbl_entry_get(struct p4tc_table_entry_value *value)
+{
+	return refcount_inc_not_zero(&value->entries_ref);
+}
+
+static bool p4tc_tbl_entry_put(struct p4tc_table_entry_value *value)
+{
+	return refcount_dec_if_one(&value->entries_ref);
+}
+
+static bool p4tc_tbl_entry_put_ref(struct p4tc_table_entry_value *value)
+{
+	return refcount_dec_not_one(&value->entries_ref);
+}
+
+static u32 p4tc_entry_hash_fn(const void *data, u32 len, u32 seed)
+{
+	const struct p4tc_table_entry_key *key = data;
+	u32 keysz;
+
+	/* The key memory area is always zero allocated aligned to 8 */
+	keysz = round_up(SIZEOF_MASKID + (key->keysz >> 3), 4);
+
+	return jhash2(STARTOF_KEY(key), keysz / sizeof(u32), seed);
+}
+
+static int p4tc_entry_hash_cmp(struct rhashtable_compare_arg *arg,
+			       const void *ptr)
+{
+	const struct p4tc_table_entry_key *key = arg->key;
+	const struct p4tc_table_entry *entry = ptr;
+	u32 keysz;
+
+	keysz = SIZEOF_MASKID + (entry->key.keysz >> 3);
+
+	return memcmp(STARTOF_KEY(&entry->key), STARTOF_KEY(key), keysz);
+}
+
+static u32 p4tc_entry_obj_hash_fn(const void *data, u32 len, u32 seed)
+{
+	const struct p4tc_table_entry *entry = data;
+
+	return p4tc_entry_hash_fn(&entry->key, len, seed);
+}
+
+const struct rhashtable_params entry_hlt_params = {
+	.obj_cmpfn = p4tc_entry_hash_cmp,
+	.obj_hashfn = p4tc_entry_obj_hash_fn,
+	.hashfn = p4tc_entry_hash_fn,
+	.head_offset = offsetof(struct p4tc_table_entry, ht_node),
+	.key_offset = offsetof(struct p4tc_table_entry, key),
+	.automatic_shrinking = true,
+};
+
+static struct rhlist_head *
+p4tc_entry_lookup_bucket(struct p4tc_table *table,
+			 struct p4tc_table_entry_key *key)
+{
+	return rhltable_lookup(&table->tbl_entries, key, entry_hlt_params);
+}
+
+static struct p4tc_table_entry *
+__p4tc_entry_lookup_fast(struct p4tc_table *table,
+			 struct p4tc_table_entry_key *key)
+	__must_hold(RCU)
+{
+	struct p4tc_table_entry *entry_curr;
+	struct rhlist_head *bucket_list;
+
+	bucket_list =
+		p4tc_entry_lookup_bucket(table, key);
+	if (!bucket_list)
+		return NULL;
+
+	rht_entry(entry_curr, bucket_list, ht_node);
+
+	return entry_curr;
+}
+
+static struct p4tc_table_entry *
+p4tc_entry_lookup(struct p4tc_table *table, struct p4tc_table_entry_key *key,
+		  u32 prio) __must_hold(RCU)
+{
+	struct rhlist_head *tmp, *bucket_list;
+	struct p4tc_table_entry *entry;
+
+	if (table->tbl_type == P4TC_TABLE_TYPE_EXACT)
+		return __p4tc_entry_lookup_fast(table, key);
+
+	bucket_list =
+		p4tc_entry_lookup_bucket(table, key);
+	if (!bucket_list)
+		return NULL;
+
+	rhl_for_each_entry_rcu(entry, tmp, bucket_list, ht_node) {
+		struct p4tc_table_entry_value *value =
+			p4tc_table_entry_value(entry);
+
+		if (value->prio == prio)
+			return entry;
+	}
+
+	return NULL;
+}
+
+static struct p4tc_table_entry *
+__p4tc_entry_lookup(struct p4tc_table *table, struct p4tc_table_entry_key *key)
+	__must_hold(RCU)
+{
+	struct p4tc_table_entry *entry = NULL;
+	struct rhlist_head *tmp, *bucket_list;
+	struct p4tc_table_entry *entry_curr;
+	u32 smallest_prio = U32_MAX;
+
+	bucket_list =
+		rhltable_lookup(&table->tbl_entries, key, entry_hlt_params);
+	if (!bucket_list)
+		return NULL;
+
+	rhl_for_each_entry_rcu(entry_curr, tmp, bucket_list, ht_node) {
+		struct p4tc_table_entry_value *value =
+			p4tc_table_entry_value(entry_curr);
+		if (value->prio <= smallest_prio) {
+			smallest_prio = value->prio;
+			entry = entry_curr;
+		}
+	}
+
+	return entry;
+}
+
+static void mask_key(const struct p4tc_table_entry_mask *mask, u8 *masked_key,
+		     u8 *skb_key)
+{
+	int i;
+
+	for (i = 0; i < BITS_TO_BYTES(mask->sz); i++)
+		masked_key[i] = skb_key[i] & mask->fa_value[i];
+}
+
+static void update_last_used(struct p4tc_table_entry *entry)
+{
+	struct p4tc_table_entry_tm *entry_tm;
+	struct p4tc_table_entry_value *value;
+
+	value = p4tc_table_entry_value(entry);
+	entry_tm = rcu_dereference(value->tm);
+	WRITE_ONCE(entry_tm->lastused, get_jiffies_64());
+
+	if (value->is_dyn && !hrtimer_active(&value->entry_timer))
+		hrtimer_start(&value->entry_timer, ms_to_ktime(1000),
+			      HRTIMER_MODE_REL);
+}
+
+static struct p4tc_table_entry *
+__p4tc_table_entry_lookup_direct(struct p4tc_table *table,
+				 struct p4tc_table_entry_key *key)
+{
+	struct p4tc_table_entry *entry = NULL;
+	u32 smallest_prio = U32_MAX;
+	int i;
+
+	if (table->tbl_type == P4TC_TABLE_TYPE_EXACT)
+		return __p4tc_entry_lookup_fast(table, key);
+
+	for (i = 0; i < table->tbl_curr_num_masks; i++) {
+		u8 __mkey[sizeof(*key) + BITS_TO_BYTES(P4TC_MAX_KEYSZ)];
+		struct p4tc_table_entry_key *mkey = (void *)&__mkey;
+		struct p4tc_table_entry_mask *mask =
+			rcu_dereference(table->tbl_masks_array[i]);
+		struct p4tc_table_entry *entry_curr = NULL;
+
+		mkey->keysz = key->keysz;
+		mkey->maskid = mask->mask_id;
+		mask_key(mask, mkey->fa_key, key->fa_key);
+
+		if (table->tbl_type == P4TC_TABLE_TYPE_LPM) {
+			entry_curr = __p4tc_entry_lookup_fast(table, mkey);
+			if (entry_curr)
+				return entry_curr;
+		} else {
+			entry_curr = __p4tc_entry_lookup(table, mkey);
+
+			if (entry_curr) {
+				struct p4tc_table_entry_value *value =
+					p4tc_table_entry_value(entry_curr);
+				if (value->prio <= smallest_prio) {
+					smallest_prio = value->prio;
+					entry = entry_curr;
+				}
+			}
+		}
+	}
+
+	return entry;
+}
+
+struct p4tc_table_entry *
+p4tc_table_entry_lookup_direct(struct p4tc_table *table,
+			       struct p4tc_table_entry_key *key)
+{
+	struct p4tc_table_entry *entry;
+
+	entry = __p4tc_table_entry_lookup_direct(table, key);
+
+	if (entry)
+		update_last_used(entry);
+
+	return entry;
+}
+
+#define p4tc_table_entry_mask_find_byid(table, id) \
+	(idr_find(&(table)->tbl_masks_idr, id))
+
+static void gen_exact_mask(u8 *mask, u32 mask_size)
+{
+	memset(mask, 0xFF, mask_size);
+}
+
+static int p4tca_table_get_entry_keys(struct sk_buff *skb,
+				      struct p4tc_table *table,
+				      struct p4tc_table_entry *entry)
+{
+	unsigned char *b = nlmsg_get_pos(skb);
+	struct p4tc_table_entry_mask *mask;
+	int ret = -ENOMEM;
+	u32 key_sz_bytes;
+
+	if (table->tbl_type == P4TC_TABLE_TYPE_EXACT) {
+		u8 mask_value[BITS_TO_BYTES(P4TC_MAX_KEYSZ)] = { 0 };
+
+		key_sz_bytes = BITS_TO_BYTES(entry->key.keysz);
+		if (nla_put(skb, P4TC_ENTRY_KEY_BLOB, key_sz_bytes,
+			    entry->key.fa_key))
+			goto out_nlmsg_trim;
+
+		gen_exact_mask(mask_value, key_sz_bytes);
+		if (nla_put(skb, P4TC_ENTRY_MASK_BLOB, key_sz_bytes, mask_value))
+			goto out_nlmsg_trim;
+	} else {
+		key_sz_bytes = BITS_TO_BYTES(entry->key.keysz);
+		if (nla_put(skb, P4TC_ENTRY_KEY_BLOB, key_sz_bytes,
+			    entry->key.fa_key))
+			goto out_nlmsg_trim;
+
+		mask = p4tc_table_entry_mask_find_byid(table,
+						       entry->key.maskid);
+		if (nla_put(skb, P4TC_ENTRY_MASK_BLOB, key_sz_bytes,
+			    mask->fa_value))
+			goto out_nlmsg_trim;
+	}
+
+	return 0;
+
+out_nlmsg_trim:
+	nlmsg_trim(skb, b);
+	return ret;
+}
+
+static void p4tc_table_entry_tm_dump(struct p4tc_table_entry_tm *dtm,
+				     struct p4tc_table_entry_tm *stm)
+{
+	unsigned long now = jiffies;
+	u64 last_used;
+
+	dtm->created = stm->created ?
+		jiffies_to_clock_t(now - stm->created) : 0;
+
+	last_used = READ_ONCE(stm->lastused);
+	dtm->lastused = stm->lastused ?
+		jiffies_to_clock_t(now - last_used) : 0;
+	dtm->firstused = stm->firstused ?
+		jiffies_to_clock_t(now - stm->firstused) : 0;
+}
+
+#define P4TC_ENTRY_MAX_IDS (P4TC_PATH_MAX - 1)
+
+int p4tc_tbl_entry_fill(struct sk_buff *skb, struct p4tc_table *table,
+			struct p4tc_table_entry *entry, u32 tbl_id,
+			u16 who_deleted)
+{
+	unsigned char *b = nlmsg_get_pos(skb);
+	struct p4tc_table_entry_value *value;
+	struct p4tc_table_entry_tm dtm, *tm;
+	struct nlattr *nest, *nest_acts;
+	u32 ids[P4TC_ENTRY_MAX_IDS];
+	int ret = -ENOMEM;
+
+	ids[P4TC_TBLID_IDX - 1] = tbl_id;
+
+	if (nla_put(skb, P4TC_PATH, P4TC_ENTRY_MAX_IDS * sizeof(u32), ids))
+		goto out_nlmsg_trim;
+
+	nest = nla_nest_start(skb, P4TC_PARAMS);
+	if (!nest)
+		goto out_nlmsg_trim;
+
+	value = p4tc_table_entry_value(entry);
+
+	if (nla_put_u32(skb, P4TC_ENTRY_PRIO, value->prio))
+		goto out_nlmsg_trim;
+
+	if (p4tca_table_get_entry_keys(skb, table, entry) < 0)
+		goto out_nlmsg_trim;
+
+	if (value->acts) {
+		nest_acts = nla_nest_start(skb, P4TC_ENTRY_ACT);
+		if (tcf_action_dump(skb, value->acts, 0, 0, false) < 0)
+			goto out_nlmsg_trim;
+		nla_nest_end(skb, nest_acts);
+	}
+
+	if (nla_put_u16(skb, P4TC_ENTRY_PERMISSIONS, value->permissions))
+		goto out_nlmsg_trim;
+
+	tm = rcu_dereference_protected(value->tm, 1);
+
+	if (nla_put_u8(skb, P4TC_ENTRY_CREATE_WHODUNNIT, tm->who_created))
+		goto out_nlmsg_trim;
+
+	if (tm->who_updated) {
+		if (nla_put_u8(skb, P4TC_ENTRY_UPDATE_WHODUNNIT,
+			       tm->who_updated))
+			goto out_nlmsg_trim;
+	}
+
+	if (who_deleted) {
+		if (nla_put_u8(skb, P4TC_ENTRY_DELETE_WHODUNNIT,
+			       who_deleted))
+			goto out_nlmsg_trim;
+	}
+
+	p4tc_table_entry_tm_dump(&dtm, tm);
+	if (nla_put_64bit(skb, P4TC_ENTRY_TM, sizeof(dtm), &dtm,
+			  P4TC_ENTRY_PAD))
+		goto out_nlmsg_trim;
+
+	if (value->is_dyn) {
+		if (nla_put_u8(skb, P4TC_ENTRY_DYNAMIC, 1))
+			goto out_nlmsg_trim;
+	}
+
+	if (value->aging_ms) {
+		if (nla_put_u64_64bit(skb, P4TC_ENTRY_AGING, value->aging_ms,
+				      P4TC_ENTRY_PAD))
+			goto out_nlmsg_trim;
+	}
+
+	nla_nest_end(skb, nest);
+
+	return skb->len;
+
+out_nlmsg_trim:
+	nlmsg_trim(skb, b);
+	return ret;
+}
+
+static struct netlink_range_validation range_aging = {
+	.min = 1,
+	.max = P4TC_MAX_T_AGING,
+};
+
+static const struct nla_policy p4tc_entry_policy[P4TC_ENTRY_MAX + 1] = {
+	[P4TC_ENTRY_TBLNAME] = { .type = NLA_STRING },
+	[P4TC_ENTRY_KEY_BLOB] = { .type = NLA_BINARY },
+	[P4TC_ENTRY_MASK_BLOB] = { .type = NLA_BINARY },
+	[P4TC_ENTRY_PRIO] = { .type = NLA_U32 },
+	[P4TC_ENTRY_ACT] = { .type = NLA_NESTED },
+	[P4TC_ENTRY_TM] =
+		NLA_POLICY_EXACT_LEN(sizeof(struct p4tc_table_entry_tm)),
+	[P4TC_ENTRY_WHODUNNIT] = { .type = NLA_U8 },
+	[P4TC_ENTRY_CREATE_WHODUNNIT] = { .type = NLA_U8 },
+	[P4TC_ENTRY_UPDATE_WHODUNNIT] = { .type = NLA_U8 },
+	[P4TC_ENTRY_DELETE_WHODUNNIT] = { .type = NLA_U8 },
+	[P4TC_ENTRY_PERMISSIONS] = NLA_POLICY_MAX(NLA_U16, P4TC_MAX_PERMISSION),
+	[P4TC_ENTRY_TBL_ATTRS] = { .type = NLA_NESTED },
+	[P4TC_ENTRY_DYNAMIC] = NLA_POLICY_RANGE(NLA_U8, 1, 1),
+	[P4TC_ENTRY_AGING] = NLA_POLICY_FULL_RANGE(NLA_U64, &range_aging),
+};
+
+static struct p4tc_table_entry_mask *
+p4tc_table_entry_mask_find_byvalue(struct p4tc_table *table,
+				   struct p4tc_table_entry_mask *mask)
+{
+	struct p4tc_table_entry_mask *mask_cur;
+	unsigned long mask_id, tmp;
+
+	idr_for_each_entry_ul(&table->tbl_masks_idr, mask_cur, tmp, mask_id) {
+		if (mask_cur->sz == mask->sz) {
+			u32 mask_sz_bytes = BITS_TO_BYTES(mask->sz);
+			void *curr_mask_value = mask_cur->fa_value;
+			void *mask_value = mask->fa_value;
+
+			if (memcmp(curr_mask_value, mask_value, mask_sz_bytes) == 0)
+				return mask_cur;
+		}
+	}
+
+	return NULL;
+}
+
+static void __p4tc_table_entry_mask_del(struct p4tc_table *table,
+					struct p4tc_table_entry_mask *mask)
+{
+	if (table->tbl_type == P4TC_TABLE_TYPE_TERNARY) {
+		struct p4tc_table_entry_mask __rcu **masks_array;
+		unsigned long *free_masks_bitmap;
+
+		masks_array = table->tbl_masks_array;
+		rcu_assign_pointer(masks_array[mask->mask_index], NULL);
+
+		free_masks_bitmap =
+			rcu_dereference_protected(table->tbl_free_masks_bitmap,
+						  1);
+		bitmap_set(free_masks_bitmap, mask->mask_index, 1);
+	} else if (table->tbl_type == P4TC_TABLE_TYPE_LPM) {
+		struct p4tc_table_entry_mask __rcu **masks_array;
+		int i;
+
+		masks_array = table->tbl_masks_array;
+
+		for (i = mask->mask_index; i < table->tbl_curr_num_masks - 1;
+		     i++) {
+			struct p4tc_table_entry_mask *mask_tmp;
+
+			mask_tmp = rcu_dereference_protected(masks_array[i + 1],
+							     1);
+			rcu_assign_pointer(masks_array[i + 1], mask_tmp);
+		}
+
+		rcu_assign_pointer(masks_array[table->tbl_curr_num_masks - 1],
+				   NULL);
+	}
+
+	table->tbl_curr_num_masks--;
+}
+
+static void p4tc_table_entry_mask_del(struct p4tc_table *table,
+				      struct p4tc_table_entry *entry)
+{
+	struct p4tc_table_entry_mask *mask_found;
+	const u32 mask_id = entry->key.maskid;
+
+	/* Will always be found */
+	mask_found = p4tc_table_entry_mask_find_byid(table, mask_id);
+
+	/* Last reference, can delete */
+	if (refcount_dec_if_one(&mask_found->mask_ref)) {
+		spin_lock_bh(&table->tbl_masks_idr_lock);
+		idr_remove(&table->tbl_masks_idr, mask_found->mask_id);
+		__p4tc_table_entry_mask_del(table, mask_found);
+		spin_unlock_bh(&table->tbl_masks_idr_lock);
+		kfree_rcu(mask_found, rcu);
+	} else {
+		if (!refcount_dec_not_one(&mask_found->mask_ref))
+			pr_warn("Mask was deleted in parallel");
+	}
+}
+
+#if defined(__LITTLE_ENDIAN_BITFIELD)
+static u32 p4tc_fls(u8 *ptr, size_t len)
+{
+	int i;
+
+	for (i = len - 1; i >= 0; i--) {
+		int pos = fls(ptr[i]);
+
+		if (pos)
+			return (i * 8) + pos;
+	}
+
+	return 0;
+}
+#else
+static u32 p4tc_ffs(u8 *ptr, size_t len)
+{
+	int i;
+
+	for (i = 0; i < len; i++) {
+		int pos = ffs(ptr[i]);
+
+		if (pos)
+			return (i * 8) + pos;
+	}
+
+	return 0;
+}
+#endif
+
+static u32 find_lpm_mask(struct p4tc_table *table, u8 *ptr)
+{
+	u32 ret;
+#if defined(__LITTLE_ENDIAN_BITFIELD)
+	ret = p4tc_fls(ptr, BITS_TO_BYTES(table->tbl_keysz));
+#else
+	ret = p4tc_ffs(ptr, BITS_TO_BYTES(table->tbl_keysz));
+#endif
+	return ret ?: table->tbl_keysz;
+}
+
+static int p4tc_table_lpm_mask_insert(struct p4tc_table *table,
+				      struct p4tc_table_entry_mask *mask)
+{
+	struct p4tc_table_entry_mask __rcu **masks_array =
+		table->tbl_masks_array;
+	const u32 nmasks = table->tbl_curr_num_masks ?: 1;
+	int pos;
+
+	for (pos = 0; pos < nmasks; pos++) {
+		u32 mask_value = find_lpm_mask(table, mask->fa_value);
+
+		if (table->tbl_masks_array[pos]) {
+			struct p4tc_table_entry_mask *mask_pos;
+			u32 array_mask_value;
+
+			mask_pos = rcu_dereference_protected(masks_array[pos],
+							     1);
+			array_mask_value =
+				find_lpm_mask(table, mask_pos->fa_value);
+
+			if (mask_value > array_mask_value) {
+				/* shift masks to the right (will keep invariant) */
+				u32 tail = nmasks;
+
+				while (tail > pos + 1) {
+					rcu_assign_pointer(masks_array[tail],
+							   masks_array[tail - 1]);
+					table->tbl_masks_array[tail] =
+						table->tbl_masks_array[tail - 1];
+					tail--;
+				}
+				rcu_assign_pointer(masks_array[pos + 1],
+						   masks_array[pos]);
+				/* assign to pos */
+				break;
+			}
+		} else {
+			/* pos is empty, assign to pos */
+			break;
+		}
+	}
+
+	mask->mask_index = pos;
+	rcu_assign_pointer(masks_array[pos], mask);
+	table->tbl_curr_num_masks++;
+
+	return 0;
+}
+
+static int
+p4tc_table_ternary_mask_insert(struct p4tc_table *table,
+			       struct p4tc_table_entry_mask *mask)
+{
+	unsigned long *free_masks_bitmap =
+		rcu_dereference_protected(table->tbl_free_masks_bitmap, 1);
+	unsigned long pos =
+		find_first_bit(free_masks_bitmap, P4TC_MAX_TMASKS);
+	struct p4tc_table_entry_mask __rcu **masks_array =
+		table->tbl_masks_array;
+
+	if (pos == P4TC_MAX_TMASKS)
+		return -ENOSPC;
+
+	mask->mask_index = pos;
+	rcu_assign_pointer(masks_array[pos], mask);
+	bitmap_clear(free_masks_bitmap, pos, 1);
+	table->tbl_curr_num_masks++;
+
+	return 0;
+}
+
+static int p4tc_table_add_mask_array(struct p4tc_table *table,
+				     struct p4tc_table_entry_mask *mask)
+{
+	if (table->tbl_max_masks < table->tbl_curr_num_masks + 1)
+		return -ENOSPC;
+
+	switch (table->tbl_type) {
+	case P4TC_TABLE_TYPE_TERNARY:
+		return p4tc_table_ternary_mask_insert(table, mask);
+	case P4TC_TABLE_TYPE_LPM:
+		return p4tc_table_lpm_mask_insert(table, mask);
+	default:
+		return -ENOSPC;
+	}
+}
+
+/* TODO: Ordering optimisation for LPM */
+static struct p4tc_table_entry_mask *
+p4tc_table_entry_mask_add(struct p4tc_table *table,
+			  struct p4tc_table_entry *entry,
+			  struct p4tc_table_entry_mask *mask)
+{
+	struct p4tc_table_entry_mask *mask_found;
+	int ret;
+
+	mask_found = p4tc_table_entry_mask_find_byvalue(table, mask);
+	/* Only add mask if it was not already added */
+	if (!mask_found) {
+		struct p4tc_table_entry_mask *nmask;
+		size_t mask_sz_bytes = BITS_TO_BYTES(mask->sz);
+
+		nmask = kzalloc(struct_size(mask_found, fa_value, mask_sz_bytes), GFP_ATOMIC);
+		if (unlikely(!nmask))
+			return ERR_PTR(-ENOMEM);
+
+		memcpy(nmask->fa_value, mask->fa_value, mask_sz_bytes);
+
+		nmask->mask_id = 1;
+		nmask->sz = mask->sz;
+		refcount_set(&nmask->mask_ref, 1);
+
+		spin_lock_bh(&table->tbl_masks_idr_lock);
+		ret = idr_alloc_u32(&table->tbl_masks_idr, nmask,
+				    &nmask->mask_id, UINT_MAX, GFP_ATOMIC);
+		if (ret < 0)
+			goto unlock;
+
+		ret = p4tc_table_add_mask_array(table, nmask);
+unlock:
+		spin_unlock_bh(&table->tbl_masks_idr_lock);
+		if (ret < 0) {
+			kfree(nmask);
+			return ERR_PTR(ret);
+		}
+		entry->key.maskid = nmask->mask_id;
+		mask_found = nmask;
+	} else {
+		if (!refcount_inc_not_zero(&mask_found->mask_ref))
+			return ERR_PTR(-EBUSY);
+		entry->key.maskid = mask_found->mask_id;
+	}
+
+	return mask_found;
+}
+
+static int p4tc_tbl_entry_emit_event(struct p4tc_table_entry_work *entry_work,
+				     int cmd, gfp_t alloc_flags)
+{
+	struct p4tc_pipeline *pipeline = entry_work->pipeline;
+	struct p4tc_table_entry *entry = entry_work->entry;
+	struct p4tc_table *table = entry_work->table;
+	u16 who_deleted = entry_work->who_deleted;
+	struct net *net = pipeline->net;
+	struct sock *rtnl = net->rtnl;
+	struct nlmsghdr *nlh;
+	struct nlattr *nest;
+	struct sk_buff *skb;
+	struct nlattr *root;
+	struct p4tcmsg *t;
+	int err = -ENOMEM;
+
+	if (!rtnl_has_listeners(net, RTNLGRP_TC))
+		return 0;
+
+	skb = alloc_skb(NLMSG_GOODSIZE, alloc_flags);
+	if (!skb)
+		return err;
+
+	nlh = nlmsg_put(skb, 1, 1, cmd, sizeof(*t), NLM_F_REQUEST);
+	if (!nlh)
+		goto free_skb;
+
+	t = nlmsg_data(nlh);
+	if (!t)
+		goto free_skb;
+
+	t->pipeid = pipeline->common.p_id;
+	t->obj = P4TC_OBJ_RUNTIME_TABLE;
+
+	if (nla_put_string(skb, P4TC_ROOT_PNAME, pipeline->common.name))
+		goto free_skb;
+
+	root = nla_nest_start(skb, P4TC_ROOT);
+	if (!root)
+		goto free_skb;
+
+	nest = nla_nest_start(skb, 1);
+	if (p4tc_tbl_entry_fill(skb, table, entry, table->tbl_id,
+				who_deleted) < 0)
+		goto free_skb;
+	nla_nest_end(skb, nest);
+
+	nla_nest_end(skb, root);
+
+	nlmsg_end(skb, nlh);
+
+	return nlmsg_notify(rtnl, skb, 0, RTNLGRP_TC, 0, alloc_flags);
+
+free_skb:
+	kfree_skb(skb);
+	return err;
+}
+
+static void __p4tc_table_entry_put(struct p4tc_table_entry *entry)
+{
+	struct p4tc_table_entry_value *value;
+	struct p4tc_table_entry_tm *tm;
+
+	value = p4tc_table_entry_value(entry);
+
+	if (value->acts)
+		p4tc_action_destroy(value->acts);
+
+	kfree(value->entry_work);
+	tm = rcu_dereference_protected(value->tm, 1);
+	kfree(tm);
+
+	kfree(entry);
+}
+
+static void p4tc_table_entry_del_work(struct work_struct *work)
+{
+	struct p4tc_table_entry_work *entry_work =
+		container_of(work, typeof(*entry_work), work);
+	struct p4tc_pipeline *pipeline = entry_work->pipeline;
+	struct p4tc_table_entry *entry = entry_work->entry;
+	struct p4tc_table_entry_value *value;
+
+	value = p4tc_table_entry_value(entry);
+
+	if (entry_work->send_event && p4tc_ctrl_pub_ok(value->permissions))
+		p4tc_tbl_entry_emit_event(entry_work, RTM_P4TC_DEL, GFP_KERNEL);
+
+	if (value->is_dyn)
+		hrtimer_cancel(&value->entry_timer);
+
+	put_net(pipeline->net);
+	p4tc_pipeline_put_ref(pipeline);
+
+	__p4tc_table_entry_put(entry);
+}
+
+static void p4tc_table_entry_put(struct p4tc_table_entry *entry, bool deferred)
+{
+	struct p4tc_table_entry_value *value = p4tc_table_entry_value(entry);
+
+	if (deferred) {
+		struct p4tc_table_entry_work *entry_work = value->entry_work;
+		/* We have to free tc actions
+		 * in a sleepable context
+		 */
+		struct p4tc_pipeline *pipeline = entry_work->pipeline;
+
+		/* Avoid pipeline del before deferral ends */
+		p4tc_pipeline_get(pipeline);
+		get_net(pipeline->net); /* avoid action cleanup */
+		schedule_work(&entry_work->work);
+	} else {
+		if (value->is_dyn)
+			hrtimer_cancel(&value->entry_timer);
+
+		__p4tc_table_entry_put(entry);
+	}
+}
+
+static void p4tc_table_entry_put_rcu(struct rcu_head *rcu)
+{
+	struct p4tc_table_entry *entry =
+		container_of(rcu, struct p4tc_table_entry, rcu);
+	struct p4tc_table_entry_work *entry_work =
+		p4tc_table_entry_work(entry);
+	struct p4tc_pipeline *pipeline = entry_work->pipeline;
+
+	p4tc_table_entry_put(entry, true);
+
+	p4tc_pipeline_put_ref(pipeline);
+	put_net(pipeline->net);
+}
+
+static void __p4tc_table_entry_destroy(struct p4tc_table *table,
+				       struct p4tc_table_entry *entry,
+				       bool remove_from_hash, bool send_event,
+				       u16 who_deleted)
+{
+	/* !remove_from_hash and deferred deletion are incompatible
+	 * as entries that defer deletion after a GP __must__
+	 * be removed from the hash
+	 */
+	if (remove_from_hash)
+		rhltable_remove(&table->tbl_entries, &entry->ht_node,
+				entry_hlt_params);
+
+	if (table->tbl_type != P4TC_TABLE_TYPE_EXACT)
+		p4tc_table_entry_mask_del(table, entry);
+
+	if (remove_from_hash) {
+		struct p4tc_table_entry_work *entry_work =
+			p4tc_table_entry_work(entry);
+
+		entry_work->send_event = send_event;
+		entry_work->who_deleted = who_deleted;
+		/* guarantee net doesn't go down before async task runs */
+		get_net(entry_work->pipeline->net);
+		/* guarantee pipeline isn't deleted before async task runs */
+		p4tc_pipeline_get(entry_work->pipeline);
+		call_rcu(&entry->rcu, p4tc_table_entry_put_rcu);
+	} else {
+		p4tc_table_entry_put(entry, false);
+	}
+}
+
+#define P4TC_TABLE_EXACT_PRIO 64000
+
+static int p4tc_table_entry_exact_prio(void)
+{
+	return P4TC_TABLE_EXACT_PRIO;
+}
+
+static int p4tc_table_entry_alloc_new_prio(struct p4tc_table *table)
+{
+	if (table->tbl_type == P4TC_TABLE_TYPE_EXACT)
+		return p4tc_table_entry_exact_prio();
+
+	return ida_alloc_min(&table->tbl_prio_idr, 1,
+			     GFP_ATOMIC);
+}
+
+static void p4tc_table_entry_free_prio(struct p4tc_table *table, u32 prio)
+{
+	if (table->tbl_type != P4TC_TABLE_TYPE_EXACT)
+		ida_free(&table->tbl_prio_idr, prio);
+}
+
+static int p4tc_table_entry_destroy(struct p4tc_table *table,
+				    struct p4tc_table_entry *entry,
+				    bool remove_from_hash,
+				    bool send_event, u16 who_deleted)
+{
+	struct p4tc_table_entry_value *value = p4tc_table_entry_value(entry);
+
+	/* Entry was deleted in parallel */
+	if (!p4tc_tbl_entry_put(value))
+		return -EBUSY;
+
+	p4tc_table_entry_free_prio(table, value->prio);
+
+	__p4tc_table_entry_destroy(table, entry, remove_from_hash, send_event,
+				   who_deleted);
+
+	atomic_dec(&table->tbl_nelems);
+
+	return 0;
+}
+
+static void p4tc_table_entry_destroy_noida(struct p4tc_table *table,
+					   struct p4tc_table_entry *entry)
+{
+	/* Entry refcount was already decremented */
+	__p4tc_table_entry_destroy(table, entry, true, false, 0);
+}
+
+/* Only deletes entries when called from pipeline put */
+void p4tc_table_entry_destroy_hash(void *ptr, void *arg)
+{
+	struct p4tc_table_entry *entry = ptr;
+	struct p4tc_table *table = arg;
+
+	p4tc_table_entry_destroy(table, entry, false, false,
+				 P4TC_ENTITY_TC);
+}
+
+static void p4tc_table_entry_put_table(struct p4tc_pipeline *pipeline,
+				       struct p4tc_table *table)
+{
+	p4tc_table_put_ref(table);
+	p4tc_pipeline_put_ref(pipeline);
+}
+
+static int p4tc_table_entry_get_table(struct net *net,
+				      struct p4tc_pipeline **pipeline,
+				      struct p4tc_table **table,
+				      struct nlattr **tb,
+				      struct p4tc_path_nlattrs *nl_path_attrs,
+				      struct netlink_ext_ack *extack)
+{
+	/* The following can only race with user driven events
+	 * Netns is guaranteed to be alive
+	 */
+	u32 *ids = nl_path_attrs->ids;
+	u32 pipeid, tbl_id;
+	char *tblname;
+	int ret;
+
+	rcu_read_lock();
+
+	pipeid = ids[P4TC_PID_IDX];
+
+	*pipeline = p4tc_pipeline_find_get(net, nl_path_attrs->pname, pipeid,
+					   extack);
+	if (IS_ERR(*pipeline)) {
+		ret = PTR_ERR(*pipeline);
+		goto out;
+	}
+
+	if (!pipeline_sealed(*pipeline)) {
+		NL_SET_ERR_MSG(extack,
+			       "Need to seal pipeline before issuing runtime command");
+		ret = -EINVAL;
+		goto put;
+	}
+
+	tbl_id = ids[P4TC_TBLID_IDX];
+	tblname = tb[P4TC_ENTRY_TBLNAME] ? nla_data(tb[P4TC_ENTRY_TBLNAME]) : NULL;
+
+	*table = p4tc_table_find_get(*pipeline, tblname, tbl_id, extack);
+	if (IS_ERR(*table)) {
+		ret = PTR_ERR(*table);
+		goto put;
+	}
+
+	rcu_read_unlock();
+
+	return 0;
+
+put:
+	p4tc_pipeline_put_ref(*pipeline);
+
+out:
+	rcu_read_unlock();
+	return ret;
+}
+
+static void
+p4tc_table_entry_assign_key_exact(struct p4tc_table_entry_key *key, u8 *keyblob)
+{
+	memcpy(key->fa_key, keyblob, BITS_TO_BYTES(key->keysz));
+}
+
+static void
+p4tc_table_entry_assign_key_generic(struct p4tc_table_entry_key *key,
+				    struct p4tc_table_entry_mask *mask,
+				    u8 *keyblob, u8 *maskblob)
+{
+	u32 keysz = BITS_TO_BYTES(key->keysz);
+
+	memcpy(key->fa_key, keyblob, keysz);
+	memcpy(mask->fa_value, maskblob, keysz);
+}
+
+static int p4tc_table_entry_extract_key(struct p4tc_table *table,
+					struct nlattr **tb,
+					struct p4tc_table_entry_key *key,
+					struct p4tc_table_entry_mask *mask,
+					struct netlink_ext_ack *extack)
+{
+	bool is_exact = table->tbl_type == P4TC_TABLE_TYPE_EXACT;
+	void *keyblob, *maskblob;
+	u32 keysz;
+
+	if (NL_REQ_ATTR_CHECK(extack, NULL, tb, P4TC_ENTRY_KEY_BLOB)) {
+		NL_SET_ERR_MSG(extack, "Must specify key blobs");
+		return -EINVAL;
+	}
+
+	keysz = nla_len(tb[P4TC_ENTRY_KEY_BLOB]);
+	if (BITS_TO_BYTES(key->keysz) != keysz) {
+		NL_SET_ERR_MSG(extack,
+			       "Key blob size and table key size differ");
+		return -EINVAL;
+	}
+
+	if (!is_exact) {
+		if (NL_REQ_ATTR_CHECK(extack, NULL, tb, P4TC_ENTRY_MASK_BLOB)) {
+			NL_SET_ERR_MSG(extack, "Must specify mask blobs");
+			return -EINVAL;
+		}
+
+		if (keysz != nla_len(tb[P4TC_ENTRY_MASK_BLOB])) {
+			NL_SET_ERR_MSG(extack,
+				       "Key and mask blob must have the same length");
+			return -EINVAL;
+		}
+	}
+
+	keyblob = nla_data(tb[P4TC_ENTRY_KEY_BLOB]);
+	if (is_exact) {
+		p4tc_table_entry_assign_key_exact(key, keyblob);
+	} else {
+		maskblob = nla_data(tb[P4TC_ENTRY_MASK_BLOB]);
+		p4tc_table_entry_assign_key_generic(key, mask, keyblob,
+						    maskblob);
+	}
+
+	return 0;
+}
+
+static void p4tc_table_entry_build_key(struct p4tc_table *table,
+				       struct p4tc_table_entry_key *key,
+				       struct p4tc_table_entry_mask *mask)
+{
+	int i;
+
+	if (table->tbl_type == P4TC_TABLE_TYPE_EXACT)
+		return;
+
+	key->maskid = mask->mask_id;
+
+	for (i = 0; i < BITS_TO_BYTES(key->keysz); i++)
+		key->fa_key[i] &= mask->fa_value[i];
+}
+
+static int ___p4tc_table_entry_del(struct p4tc_pipeline *pipeline,
+				   struct p4tc_table *table,
+				   struct p4tc_table_entry *entry,
+				   bool from_control)
+__must_hold(RCU)
+{
+	u16 who_deleted = from_control ? P4TC_ENTITY_UNSPEC : P4TC_ENTITY_KERNEL;
+	struct p4tc_table_entry_value *value = p4tc_table_entry_value(entry);
+
+	if (from_control) {
+		if (!p4tc_ctrl_delete_ok(value->permissions))
+			return -EPERM;
+	} else {
+		if (!p4tc_data_delete_ok(value->permissions))
+			return -EPERM;
+	}
+
+	if (p4tc_table_entry_destroy(table, entry, true, !from_control,
+				     who_deleted) < 0)
+		return -EBUSY;
+
+	return 0;
+}
+
+static int p4tc_table_entry_gd(struct net *net, struct sk_buff *skb, bool del,
+			       u16 *permissions, struct nlattr *arg,
+			       struct p4tc_path_nlattrs *nl_path_attrs,
+			       struct netlink_ext_ack *extack)
+{
+	struct p4tc_table_entry_mask *mask = NULL, *new_mask;
+	struct nlattr *tb[P4TC_ENTRY_MAX + 1] = { NULL };
+	struct p4tc_table_entry *entry = NULL;
+	struct p4tc_pipeline *pipeline = NULL;
+	struct p4tc_table_entry_value *value;
+	struct p4tc_table_entry_key *key;
+	u32 *ids = nl_path_attrs->ids;
+	bool has_listener = !!skb;
+	struct p4tc_table *table;
+	u16 who_deleted = 0;
+	bool get = !del;
+	u32 keysz_bytes;
+	u32 keysz_bits;
+	u32 prio;
+	int ret;
+
+	ret = nla_parse_nested(tb, P4TC_ENTRY_MAX, arg, p4tc_entry_policy,
+			       extack);
+	if (ret < 0)
+		return ret;
+
+	ret = p4tc_table_entry_get_table(net, &pipeline, &table, tb,
+					 nl_path_attrs, extack);
+	if (ret < 0)
+		return ret;
+
+	if (table->tbl_type == P4TC_TABLE_TYPE_EXACT) {
+		prio = p4tc_table_entry_exact_prio();
+	} else {
+		if (NL_REQ_ATTR_CHECK(extack, arg, tb, P4TC_ENTRY_PRIO)) {
+			NL_SET_ERR_MSG(extack, "Must specify table entry priority");
+			return -EINVAL;
+		}
+		prio = nla_get_u32(tb[P4TC_ENTRY_PRIO]);
+	}
+
+	if (del && !pipeline_sealed(pipeline)) {
+		NL_SET_ERR_MSG(extack,
+			       "Unable to delete table entry in unsealed pipeline");
+		ret = -EINVAL;
+		goto table_put;
+	}
+
+	keysz_bits = table->tbl_keysz;
+	keysz_bytes = P4TC_KEYSZ_BYTES(table->tbl_keysz);
+
+	key = kzalloc(struct_size(key, fa_key, keysz_bytes), GFP_KERNEL);
+	if (unlikely(!key)) {
+		NL_SET_ERR_MSG(extack, "Unable to allocate key");
+		ret = -ENOMEM;
+		goto table_put;
+	}
+
+	key->keysz = keysz_bits;
+
+	if (table->tbl_type != P4TC_TABLE_TYPE_EXACT) {
+		mask = kzalloc(struct_size(mask, fa_value, keysz_bytes),
+			       GFP_KERNEL);
+		if (unlikely(!mask)) {
+			NL_SET_ERR_MSG(extack, "Failed to allocate mask");
+			ret = -ENOMEM;
+			goto free_key;
+		}
+		mask->sz = key->keysz;
+	}
+
+	ret = p4tc_table_entry_extract_key(table, tb, key, mask, extack);
+	if (unlikely(ret < 0)) {
+		if (table->tbl_type != P4TC_TABLE_TYPE_EXACT)
+			kfree(mask);
+
+		goto free_key;
+	}
+
+	if (table->tbl_type != P4TC_TABLE_TYPE_EXACT) {
+		new_mask = p4tc_table_entry_mask_find_byvalue(table, mask);
+		kfree(mask);
+		if (!new_mask) {
+			NL_SET_ERR_MSG(extack, "Unable to find entry mask");
+			ret = -ENOENT;
+			goto free_key;
+		} else {
+			mask = new_mask;
+		}
+	}
+
+	p4tc_table_entry_build_key(table, key, mask);
+
+	rcu_read_lock();
+	entry = p4tc_entry_lookup(table, key, prio);
+	if (!entry) {
+		NL_SET_ERR_MSG(extack, "Unable to find entry");
+		ret = -ENOENT;
+		goto unlock;
+	}
+
+	/* As we can run delete/update in parallel we might get a soon to be
+	 * purged entry from the lookup
+	 */
+	value = p4tc_table_entry_value(entry);
+	if (get && !p4tc_tbl_entry_get(value)) {
+		NL_SET_ERR_MSG(extack, "Entry deleted in parallel");
+		ret = -EBUSY;
+		goto unlock;
+	}
+
+	if (del) {
+		if (tb[P4TC_ENTRY_WHODUNNIT])
+			who_deleted = nla_get_u8(tb[P4TC_ENTRY_WHODUNNIT]);
+	} else {
+		if (!p4tc_ctrl_read_ok(value->permissions)) {
+			NL_SET_ERR_MSG(extack,
+				       "Unable to read table entry");
+			ret = -EPERM;
+			goto entry_put;
+		}
+
+		if (!p4tc_ctrl_pub_ok(value->permissions)) {
+			NL_SET_ERR_MSG(extack,
+				       "Unable to publish read entry");
+			ret = -EPERM;
+			goto entry_put;
+		}
+	}
+
+	if (has_listener) {
+		if (p4tc_tbl_entry_fill(skb, table, entry, table->tbl_id,
+					who_deleted) <= 0) {
+			NL_SET_ERR_MSG(extack,
+				       "Unable to fill table entry attributes");
+			ret = -EINVAL;
+			goto entry_put;
+		}
+		*permissions = value->permissions;
+	}
+
+	if (del) {
+		ret = ___p4tc_table_entry_del(pipeline, table, entry, true);
+		if (ret < 0) {
+			if (ret == -EBUSY)
+				NL_SET_ERR_MSG(extack,
+					       "Entry was deleted in parallel");
+			goto entry_put;
+		}
+
+		if (!has_listener)
+			goto out;
+	}
+
+	if (!ids[P4TC_PID_IDX])
+		ids[P4TC_PID_IDX] = pipeline->common.p_id;
+
+	if (!nl_path_attrs->pname_passed)
+		strscpy(nl_path_attrs->pname, pipeline->common.name,
+			PIPELINENAMSIZ);
+
+out:
+	ret = 0;
+
+entry_put:
+	if (get)
+		p4tc_tbl_entry_put_ref(value);
+
+unlock:
+	rcu_read_unlock();
+
+free_key:
+	kfree(key);
+
+table_put:
+	p4tc_table_entry_put_table(pipeline, table);
+
+	return ret;
+}
+
+static int p4tc_table_entry_flush(struct net *net, struct sk_buff *skb,
+				  struct nlattr *arg,
+				  struct p4tc_path_nlattrs *nl_path_attrs,
+				  struct netlink_ext_ack *extack)
+{
+	u32 *ids = nl_path_attrs->ids;
+	struct nlattr *tb[P4TC_ENTRY_MAX + 1] = { NULL };
+	u32 arg_ids[P4TC_PATH_MAX - 1];
+	struct p4tc_pipeline *pipeline;
+	struct p4tc_table_entry *entry;
+	struct rhashtable_iter iter;
+	bool has_listener = !!skb;
+	struct p4tc_table *table;
+	unsigned char *b;
+	int ret = 0;
+	int i = 0;
+
+	if (arg) {
+		ret = nla_parse_nested(tb, P4TC_ENTRY_MAX, arg,
+				       p4tc_entry_policy, extack);
+		if (ret < 0)
+			return ret;
+	}
+
+	ret = p4tc_table_entry_get_table(net, &pipeline, &table, tb,
+					 nl_path_attrs, extack);
+	if (ret < 0)
+		return ret;
+
+	if (has_listener)
+		b = nlmsg_get_pos(skb);
+
+	if (!ids[P4TC_TBLID_IDX])
+		arg_ids[P4TC_TBLID_IDX - 1] = table->tbl_id;
+
+	if (has_listener && nla_put(skb, P4TC_PATH, sizeof(arg_ids), arg_ids)) {
+		ret = -ENOMEM;
+		goto out_nlmsg_trim;
+	}
+
+	/* There is an issue here regarding the stability of walking an
+	 * rhashtable. If an insert or a delete happens in parallel, we may see
+	 * duplicate entries or skip some valid entries. To solve this we are
+	 * going to have an auxiliary list that also stores the entries and will
+	 * be used for flushing, instead of walking over the rhastable.
+	 */
+	rhltable_walk_enter(&table->tbl_entries, &iter);
+	do {
+		rhashtable_walk_start(&iter);
+
+		while ((entry = rhashtable_walk_next(&iter)) && !IS_ERR(entry)) {
+			struct p4tc_table_entry_value *value =
+				p4tc_table_entry_value(entry);
+
+			if (!p4tc_ctrl_delete_ok(value->permissions)) {
+				ret = -EPERM;
+				continue;
+			}
+
+			ret = p4tc_table_entry_destroy(table, entry, true, false,
+						       P4TC_ENTITY_UNSPEC);
+			if (ret < 0)
+				continue;
+
+			i++;
+		}
+
+		rhashtable_walk_stop(&iter);
+	} while (entry == ERR_PTR(-EAGAIN));
+	rhashtable_walk_exit(&iter);
+
+	/* If another user creates a table entry in parallel with this flush,
+	 * we may not be able to flush all the entries. So the user should
+	 * verify after flush to check for this.
+	 */
+
+	if (has_listener)
+		nla_put_u32(skb, P4TC_COUNT, i);
+
+	if (ret < 0) {
+		if (i == 0) {
+			NL_SET_ERR_MSG_WEAK(extack,
+					    "Unable to flush any entries");
+			goto out_nlmsg_trim;
+		} else {
+			if (!extack->_msg)
+				NL_SET_ERR_MSG_FMT(extack,
+						   "Flush only %u table entries",
+						   i);
+		}
+	}
+
+	if (has_listener) {
+		if (!ids[P4TC_PID_IDX])
+			ids[P4TC_PID_IDX] = pipeline->common.p_id;
+
+		if (!nl_path_attrs->pname_passed)
+			strscpy(nl_path_attrs->pname, pipeline->common.name,
+				PIPELINENAMSIZ);
+	}
+
+	ret = 0;
+	goto table_put;
+
+out_nlmsg_trim:
+	if (has_listener)
+		nlmsg_trim(skb, b);
+
+table_put:
+	p4tc_table_entry_put_table(pipeline, table);
+
+	return ret;
+}
+
+static enum hrtimer_restart entry_timer_handle(struct hrtimer *timer)
+{
+	struct p4tc_table_entry_value *value =
+		container_of(timer, struct p4tc_table_entry_value, entry_timer);
+	struct p4tc_table_entry_tm *tm;
+	struct p4tc_table_entry *entry;
+	u64 aging_ms = value->aging_ms;
+	struct p4tc_table *table;
+	u64 tdiff, lastused;
+
+	rcu_read_lock();
+	tm = rcu_dereference(value->tm);
+	lastused = tm->lastused;
+	rcu_read_unlock();
+
+	tdiff = jiffies64_to_msecs(get_jiffies_64() - lastused);
+
+	if (tdiff < aging_ms) {
+		hrtimer_forward_now(timer, ms_to_ktime(aging_ms));
+		return HRTIMER_RESTART;
+	}
+
+	entry = value->entry_work->entry;
+	table = value->entry_work->table;
+
+	p4tc_table_entry_destroy(table, entry, true,
+				 true, P4TC_ENTITY_TIMER);
+
+	return HRTIMER_NORESTART;
+}
+
+static struct p4tc_table_entry_tm *
+p4tc_table_entry_create_tm(const u16 whodunnit)
+{
+	struct p4tc_table_entry_tm *dtm;
+
+	dtm = kzalloc(sizeof(*dtm), GFP_ATOMIC);
+	if (unlikely(!dtm))
+		return ERR_PTR(-ENOMEM);
+
+	dtm->who_created = whodunnit;
+	dtm->who_deleted = P4TC_ENTITY_UNSPEC;
+	dtm->created = jiffies;
+	dtm->firstused = 0;
+	dtm->lastused = jiffies;
+
+	return dtm;
+}
+
+/* Invoked from both control and data path */
+static int __p4tc_table_entry_create(struct p4tc_pipeline *pipeline,
+				     struct p4tc_table *table,
+				     struct p4tc_table_entry *entry,
+				     struct p4tc_table_entry_mask *mask,
+				     u16 whodunnit, bool from_control)
+__must_hold(RCU)
+{
+	struct p4tc_table_entry_mask *mask_found = NULL;
+	struct p4tc_table_entry_work *entry_work;
+	struct p4tc_table_entry_value *value;
+	struct p4tc_table_perm *tbl_perm;
+	struct p4tc_table_entry_tm *dtm;
+	u16 permissions;
+	int ret;
+
+	value = p4tc_table_entry_value(entry);
+	/* We set it to zero on create an update to avoid having entry
+	 * deletion in parallel before we report to user space.
+	 */
+	refcount_set(&value->entries_ref, 0);
+
+	tbl_perm = rcu_dereference(table->tbl_permissions);
+	permissions = tbl_perm->permissions;
+	if (from_control) {
+		if (!p4tc_ctrl_create_ok(permissions))
+			return -EPERM;
+	} else {
+		if (!p4tc_data_create_ok(permissions))
+			return -EPERM;
+	}
+
+	/* From data plane we can only create entries on exact match */
+	if (table->tbl_type != P4TC_TABLE_TYPE_EXACT) {
+		mask_found = p4tc_table_entry_mask_add(table, entry, mask);
+		if (IS_ERR(mask_found)) {
+			ret = PTR_ERR(mask_found);
+			goto out;
+		}
+	}
+
+	p4tc_table_entry_build_key(table, &entry->key, mask_found);
+
+	if (p4tc_entry_lookup(table, &entry->key, value->prio)) {
+		ret = -EEXIST;
+		goto rm_masks_idr;
+	}
+
+	dtm = p4tc_table_entry_create_tm(whodunnit);
+	if (IS_ERR(dtm)) {
+		ret = PTR_ERR(dtm);
+		goto rm_masks_idr;
+	}
+
+	rcu_assign_pointer(value->tm, dtm);
+
+	entry_work = kzalloc(sizeof(*entry_work), GFP_ATOMIC);
+	if (unlikely(!entry_work)) {
+		ret = -ENOMEM;
+		goto free_tm;
+	}
+
+	entry_work->pipeline = pipeline;
+	entry_work->table = table;
+	entry_work->entry = entry;
+	value->entry_work = entry_work;
+
+	INIT_WORK(&entry_work->work, p4tc_table_entry_del_work);
+
+	if (atomic_inc_return(&table->tbl_nelems) > table->tbl_max_entries) {
+		atomic_dec(&table->tbl_nelems);
+		ret = -ENOSPC;
+		goto free_work;
+	}
+
+	if (rhltable_insert(&table->tbl_entries, &entry->ht_node,
+			    entry_hlt_params) < 0) {
+		atomic_dec(&table->tbl_nelems);
+		ret = -EBUSY;
+		goto free_work;
+	}
+
+	if (value->is_dyn) {
+		/* Only use table template aging if user didn't specify one */
+		value->aging_ms = value->aging_ms ?: table->tbl_aging;
+
+		hrtimer_init(&value->entry_timer, CLOCK_MONOTONIC,
+			     HRTIMER_MODE_REL);
+		value->entry_timer.function = &entry_timer_handle;
+		hrtimer_start(&value->entry_timer, ms_to_ktime(value->aging_ms),
+			      HRTIMER_MODE_REL);
+	}
+
+	if (!from_control && p4tc_ctrl_pub_ok(value->permissions))
+		p4tc_tbl_entry_emit_event(entry_work, RTM_P4TC_CREATE,
+					  GFP_ATOMIC);
+
+	return 0;
+
+free_work:
+	kfree(entry_work);
+
+free_tm:
+	kfree(dtm);
+
+rm_masks_idr:
+	if (table->tbl_type != P4TC_TABLE_TYPE_EXACT)
+		p4tc_table_entry_mask_del(table, entry);
+out:
+	return ret;
+}
+
+/* Invoked from both control and data path  */
+static int __p4tc_table_entry_update(struct p4tc_pipeline *pipeline,
+				     struct p4tc_table *table,
+				     struct p4tc_table_entry *entry,
+				     struct p4tc_table_entry_mask *mask,
+				     u16 whodunnit, bool from_control)
+__must_hold(RCU)
+{
+	struct p4tc_table_entry_mask *mask_found = NULL;
+	struct p4tc_table_entry_work *entry_work;
+	struct p4tc_table_entry_value *value_old;
+	struct p4tc_table_entry_value *value;
+	struct p4tc_table_entry *entry_old;
+	struct p4tc_table_entry_tm *tm_old;
+	struct p4tc_table_entry_tm *tm;
+	int ret;
+
+	value = p4tc_table_entry_value(entry);
+	/* We set it to zero on update to avoid having entry removed from the
+	 * rhashtable in parallel before we report to user space.
+	 */
+	refcount_set(&value->entries_ref, 0);
+
+	if (table->tbl_type != P4TC_TABLE_TYPE_EXACT) {
+		mask_found = p4tc_table_entry_mask_add(table, entry, mask);
+		if (IS_ERR(mask_found)) {
+			ret = PTR_ERR(mask_found);
+			goto out;
+		}
+	}
+
+	p4tc_table_entry_build_key(table, &entry->key, mask_found);
+
+	entry_old = p4tc_entry_lookup(table, &entry->key, value->prio);
+	if (!entry_old) {
+		ret = -ENOENT;
+		goto rm_masks_idr;
+	}
+
+	/* In case of parallel update, the thread that arrives here first will
+	 * get the right to update.
+	 *
+	 * In case of a parallel get/update, whoever is second will fail
+	 * appropriately.
+	 */
+	value_old = p4tc_table_entry_value(entry_old);
+	if (!p4tc_tbl_entry_put(value_old)) {
+		ret = -EAGAIN;
+		goto rm_masks_idr;
+	}
+
+	if (from_control) {
+		if (!p4tc_ctrl_update_ok(value_old->permissions)) {
+			ret = -EPERM;
+			goto set_entries_refcount;
+		}
+	} else {
+		if (!p4tc_data_update_ok(value_old->permissions)) {
+			ret = -EPERM;
+			goto set_entries_refcount;
+		}
+	}
+
+	tm = kzalloc(sizeof(*tm), GFP_ATOMIC);
+	if (unlikely(!tm)) {
+		ret = -ENOMEM;
+		goto set_entries_refcount;
+	}
+
+	tm_old = rcu_dereference_protected(value_old->tm, 1);
+	*tm = *tm_old;
+
+	tm->lastused = jiffies;
+	tm->who_updated = whodunnit;
+
+	if (value->permissions == P4TC_PERMISSIONS_UNINIT)
+		value->permissions = value_old->permissions;
+
+	rcu_assign_pointer(value->tm, tm);
+
+	entry_work = kzalloc(sizeof(*(entry_work)), GFP_ATOMIC);
+	if (unlikely(!entry_work)) {
+		ret = -ENOMEM;
+		goto free_tm;
+	}
+
+	entry_work->pipeline = pipeline;
+	entry_work->table = table;
+	entry_work->entry = entry;
+	value->entry_work = entry_work;
+	if (!value->is_dyn)
+		value->is_dyn = value_old->is_dyn;
+
+	if (value->is_dyn) {
+		/* Only use old entry value if user didn't specify new one */
+		value->aging_ms = value->aging_ms ?: value_old->aging_ms;
+
+		hrtimer_init(&value->entry_timer, CLOCK_MONOTONIC,
+			     HRTIMER_MODE_REL);
+		value->entry_timer.function = &entry_timer_handle;
+
+		hrtimer_start(&value->entry_timer, ms_to_ktime(value->aging_ms),
+			      HRTIMER_MODE_REL);
+	}
+
+	INIT_WORK(&entry_work->work, p4tc_table_entry_del_work);
+
+	if (rhltable_insert(&table->tbl_entries, &entry->ht_node,
+			    entry_hlt_params) < 0) {
+		ret = -EEXIST;
+		goto free_entry_work;
+	}
+
+	p4tc_table_entry_destroy_noida(table, entry_old);
+
+	if (!from_control && p4tc_ctrl_pub_ok(value->permissions))
+		p4tc_tbl_entry_emit_event(entry_work, RTM_P4TC_UPDATE,
+					  GFP_ATOMIC);
+
+	return 0;
+
+free_entry_work:
+	kfree(entry_work);
+
+free_tm:
+	kfree(tm);
+
+set_entries_refcount:
+	refcount_set(&value_old->entries_ref, 1);
+
+rm_masks_idr:
+	if (table->tbl_type != P4TC_TABLE_TYPE_EXACT)
+		p4tc_table_entry_mask_del(table, entry);
+
+out:
+	return ret;
+}
+
+static bool p4tc_table_check_entry_act(struct p4tc_table *table,
+				       struct tc_action *entry_act)
+{
+	struct p4tc_table_act *table_act;
+
+	list_for_each_entry(table_act, &table->tbl_acts_list, node) {
+		if (table_act->ops->id != entry_act->ops->id)
+			continue;
+
+		if (!(table_act->flags &
+		      BIT(P4TC_TABLE_ACTS_DEFAULT_ONLY)))
+			return true;
+	}
+
+	return false;
+}
+
+static struct nla_policy p4tc_table_attrs_policy[P4TC_ENTRY_TBL_ATTRS_MAX + 1] = {
+	[P4TC_ENTRY_TBL_ATTRS_DEFAULT_HIT] = { .type = NLA_NESTED },
+	[P4TC_ENTRY_TBL_ATTRS_DEFAULT_MISS] = { .type = NLA_NESTED },
+	[P4TC_ENTRY_TBL_ATTRS_PERMISSIONS] = NLA_POLICY_MAX(NLA_U16, P4TC_MAX_PERMISSION),
+};
+
+static int
+update_tbl_attrs(struct net *net, struct p4tc_table *table,
+		 struct nlattr *table_attrs,
+		 struct netlink_ext_ack *extack)
+{
+	struct p4tc_table_default_act_params def_params = {0};
+	struct nlattr *tb[P4TC_ENTRY_TBL_ATTRS_MAX + 1];
+	struct p4tc_table_perm *tbl_perm = NULL;
+	int err;
+
+	err = nla_parse_nested(tb, P4TC_ENTRY_TBL_ATTRS_MAX, table_attrs,
+			       p4tc_table_attrs_policy, extack);
+	if (err < 0)
+		return err;
+
+	if (tb[P4TC_ENTRY_TBL_ATTRS_PERMISSIONS]) {
+		u16 permissions;
+
+		if (atomic_read(&table->tbl_nelems) > 0) {
+			NL_SET_ERR_MSG(extack,
+				       "Unable to set table permissions if it already has entries");
+			return -EINVAL;
+		}
+
+		permissions = nla_get_u16(tb[P4TC_ENTRY_TBL_ATTRS_PERMISSIONS]);
+		tbl_perm = p4tc_table_init_permissions(table, permissions,
+						       extack);
+		if (IS_ERR(tbl_perm))
+			return PTR_ERR(tbl_perm);
+	}
+
+	def_params.default_hit_attr = tb[P4TC_ENTRY_TBL_ATTRS_DEFAULT_HIT];
+	def_params.default_miss_attr = tb[P4TC_ENTRY_TBL_ATTRS_DEFAULT_MISS];
+
+	err = p4tc_table_init_default_acts(net, &def_params, table,
+					   &table->tbl_acts_list, extack);
+	if (err < 0)
+		goto free_tbl_perm;
+
+	p4tc_table_replace_default_acts(table, &def_params, true);
+	p4tc_table_replace_permissions(table, tbl_perm, true);
+
+	return 0;
+
+free_tbl_perm:
+	kfree(tbl_perm);
+	return err;
+}
+
+static u16 p4tc_table_entry_tbl_permcpy(const u16 tblperm)
+{
+	return p4tc_ctrl_perm_rm_create(p4tc_data_perm_rm_create(tblperm));
+}
+
+#define P4TC_TBL_ENTRY_CU_FLAG_CREATE 0x1
+#define P4TC_TBL_ENTRY_CU_FLAG_UPDATE 0x2
+#define P4TC_TBL_ENTRY_CU_FLAG_SET 0x4
+
+static struct p4tc_table_entry *
+__p4tc_table_entry_cu(struct net *net, u8 cu_flags, struct nlattr **tb,
+		      struct p4tc_pipeline *pipeline, struct p4tc_table *table,
+		      struct netlink_ext_ack *extack)
+{
+	bool replace = cu_flags == P4TC_TBL_ENTRY_CU_FLAG_UPDATE;
+	bool set = cu_flags == P4TC_TBL_ENTRY_CU_FLAG_SET;
+	u8 __mask[sizeof(struct p4tc_table_entry_mask) +
+		BITS_TO_BYTES(P4TC_MAX_KEYSZ)] = { 0 };
+	struct p4tc_table_entry_mask *mask = (void *)&__mask;
+	struct p4tc_table_entry_value *value;
+	u8 whodunnit = P4TC_ENTITY_UNSPEC;
+	struct p4tc_table_entry *entry;
+	u32 keysz_bytes;
+	u32 keysz_bits;
+	u16 tblperm;
+	int ret = 0;
+	u32 entrysz;
+	u32 prio;
+
+	prio = tb[P4TC_ENTRY_PRIO] ? nla_get_u32(tb[P4TC_ENTRY_PRIO]) : 0;
+	if (table->tbl_type != P4TC_TABLE_TYPE_EXACT && replace) {
+		if (!prio) {
+			NL_SET_ERR_MSG(extack, "Must specify entry priority");
+			return ERR_PTR(-EINVAL);
+		}
+	} else {
+		if (table->tbl_type == P4TC_TABLE_TYPE_EXACT) {
+			if (prio) {
+				NL_SET_ERR_MSG(extack,
+					       "Mustn't specify entry priority for exact");
+				return ERR_PTR(-EINVAL);
+			}
+			prio = p4tc_table_entry_alloc_new_prio(table);
+		} else {
+			if (prio)
+				ret = ida_alloc_range(&table->tbl_prio_idr,
+						      prio, prio, GFP_ATOMIC);
+			else
+				ret = p4tc_table_entry_alloc_new_prio(table);
+			if (ret < 0) {
+				NL_SET_ERR_MSG(extack,
+					       "Unable to allocate priority");
+				return ERR_PTR(ret);
+			}
+			prio = ret;
+		}
+	}
+
+	whodunnit = nla_get_u8(tb[P4TC_ENTRY_WHODUNNIT]);
+
+	keysz_bits = table->tbl_keysz;
+	keysz_bytes = P4TC_KEYSZ_BYTES(keysz_bits);
+
+	/* Entry memory layout:
+	 * { entry:key __aligned(8):value }
+	 */
+	entrysz = sizeof(*entry) + keysz_bytes +
+		sizeof(struct p4tc_table_entry_value);
+
+	entry = kzalloc(entrysz, GFP_KERNEL);
+	if (unlikely(!entry)) {
+		NL_SET_ERR_MSG(extack, "Unable to allocate table entry");
+		ret = -ENOMEM;
+		goto idr_rm;
+	}
+
+	entry->key.keysz = keysz_bits;
+	mask->sz = keysz_bits;
+
+	ret = p4tc_table_entry_extract_key(table, tb, &entry->key, mask, extack);
+	if (ret < 0)
+		goto free_entry;
+
+	value = p4tc_table_entry_value(entry);
+	value->prio = prio;
+
+	rcu_read_lock();
+	tblperm = rcu_dereference(table->tbl_permissions)->permissions;
+	rcu_read_unlock();
+
+	if (tb[P4TC_ENTRY_PERMISSIONS]) {
+		u16 nlperm;
+
+		nlperm = nla_get_u16(tb[P4TC_ENTRY_PERMISSIONS]);
+		if (~tblperm & nlperm) {
+			NL_SET_ERR_MSG(extack,
+				       "Trying to set permission bits which aren't allowed by table");
+			ret = -EINVAL;
+			goto free_entry;
+		}
+
+		if (p4tc_ctrl_create_ok(nlperm) || p4tc_data_create_ok(nlperm)) {
+			NL_SET_ERR_MSG(extack,
+				       "Create permission for table entry doesn't make sense");
+			ret = -EINVAL;
+			goto free_entry;
+		}
+		value->permissions = nlperm;
+	} else {
+		if (replace)
+			value->permissions = P4TC_PERMISSIONS_UNINIT;
+		else
+			value->permissions =
+				p4tc_table_entry_tbl_permcpy(tblperm);
+	}
+
+	if (tb[P4TC_ENTRY_ACT]) {
+		value->acts = kcalloc(TCA_ACT_MAX_PRIO,
+				      sizeof(struct tc_action *), GFP_KERNEL);
+		if (unlikely(!value->acts)) {
+			ret = -ENOMEM;
+			goto free_entry;
+		}
+
+		ret = p4tc_action_init(net, tb[P4TC_ENTRY_ACT], value->acts,
+				       table->common.p_id,
+				       TCA_ACT_FLAGS_NO_RTNL, extack);
+		if (unlikely(ret < 0)) {
+			kfree(value->acts);
+			value->acts = NULL;
+			goto free_entry;
+		} else if (ret > 1) {
+			NL_SET_ERR_MSG(extack,
+				       "Can only have one entry action");
+			ret = -EINVAL;
+			goto free_acts;
+		}
+
+		value->num_acts = ret;
+
+		if (!p4tc_table_check_entry_act(table, value->acts[0])) {
+			ret = -EPERM;
+			NL_SET_ERR_MSG(extack,
+				       "Action is not allowed as entry action");
+			goto free_acts;
+		}
+	}
+
+	if (!replace) {
+		if ((!tb[P4TC_ENTRY_AGING] && tb[P4TC_ENTRY_DYNAMIC]) ||
+		    (tb[P4TC_ENTRY_AGING] && !tb[P4TC_ENTRY_DYNAMIC])) {
+			NL_SET_ERR_MSG(extack,
+				       "Aging may only be set alongside dynamic");
+			ret = -EINVAL;
+			goto free_acts;
+		}
+	}
+
+	if (tb[P4TC_ENTRY_AGING])
+		value->aging_ms = nla_get_u64(tb[P4TC_ENTRY_AGING]);
+
+	if (tb[P4TC_ENTRY_DYNAMIC])
+		value->is_dyn = true;
+
+	rcu_read_lock();
+	if (replace) {
+		ret = __p4tc_table_entry_update(pipeline, table, entry, mask,
+						whodunnit, true);
+	} else {
+		ret = __p4tc_table_entry_create(pipeline, table, entry, mask,
+						whodunnit, true);
+		if (set && ret == -EEXIST)
+			ret = __p4tc_table_entry_update(pipeline, table, entry,
+							mask, whodunnit, true);
+	}
+	rcu_read_unlock();
+	if (ret < 0) {
+		if ((replace || set) && ret == -EAGAIN)
+			NL_SET_ERR_MSG(extack,
+				       "Entry was being updated in parallel");
+
+		if (ret == -ENOSPC)
+			NL_SET_ERR_MSG(extack, "Table max entries reached");
+		else
+			NL_SET_ERR_MSG(extack, "Failed to create/update entry");
+
+		goto free_acts;
+	}
+
+	return entry;
+
+free_acts:
+	p4tc_action_destroy(value->acts);
+
+free_entry:
+	kfree(entry);
+
+idr_rm:
+	if (!replace)
+		p4tc_table_entry_free_prio(table, prio);
+
+	return ERR_PTR(ret);
+}
+
+static int p4tc_table_entry_cu(struct net *net, struct sk_buff *skb,
+			       u8 cu_flags, u16 *permissions,
+			       struct nlattr *arg,
+			       struct p4tc_path_nlattrs *nl_path_attrs,
+			       struct netlink_ext_ack *extack)
+{
+	bool replace = cu_flags == P4TC_TBL_ENTRY_CU_FLAG_UPDATE;
+	struct nlattr *tb[P4TC_ENTRY_MAX + 1] = { NULL };
+	struct p4tc_table_entry_value *value;
+	struct p4tc_pipeline *pipeline;
+	struct p4tc_table_entry *entry;
+	u32 *ids = nl_path_attrs->ids;
+	bool has_listener = !!skb;
+	struct p4tc_table *table;
+	int ret;
+
+	ret = nla_parse_nested(tb, P4TC_ENTRY_MAX, arg, p4tc_entry_policy,
+			       extack);
+	if (ret < 0)
+		return ret;
+
+	ret = p4tc_table_entry_get_table(net, &pipeline, &table, tb,
+					 nl_path_attrs, extack);
+	if (ret < 0)
+		return ret;
+
+	if (replace && tb[P4TC_ENTRY_TBL_ATTRS]) {
+		/* Table attributes update */
+		ret = update_tbl_attrs(net, table,
+				       tb[P4TC_ENTRY_TBL_ATTRS],
+				       extack);
+		goto table_put;
+	} else {
+		/* Table entry create or update */
+		if (NL_REQ_ATTR_CHECK(extack, arg, tb, P4TC_ENTRY_WHODUNNIT)) {
+			NL_SET_ERR_MSG(extack,
+				       "Must specify whodunnit attribute");
+			ret = -EINVAL;
+			goto table_put;
+		}
+	}
+
+	entry = __p4tc_table_entry_cu(net, cu_flags, tb, pipeline, table,
+				      extack);
+	if (IS_ERR(entry)) {
+		ret = PTR_ERR(entry);
+		goto table_put;
+	}
+
+	value = p4tc_table_entry_value(entry);
+	if (has_listener) {
+		if (p4tc_ctrl_pub_ok(value->permissions)) {
+			if (p4tc_tbl_entry_fill(skb, table, entry, table->tbl_id,
+						P4TC_ENTITY_UNSPEC) <= 0)
+				NL_SET_ERR_MSG(extack,
+					       "Unable to fill table entry attributes");
+
+			if (!nl_path_attrs->pname_passed)
+				strscpy(nl_path_attrs->pname,
+					pipeline->common.name, PIPELINENAMSIZ);
+
+			if (!ids[P4TC_PID_IDX])
+				ids[P4TC_PID_IDX] = pipeline->common.p_id;
+		}
+
+		*permissions = value->permissions;
+	}
+
+	/* We set it to zero on create an update to avoid having the entry
+	 * deleted in parallel before we report to user space.
+	 * We only set it to 1 here, after reporting.
+	 */
+	refcount_set(&value->entries_ref, 1);
+
+table_put:
+	p4tc_table_entry_put_table(pipeline, table);
+	return ret;
+}
+
+struct p4tc_table_entry *
+p4tc_table_const_entry_cu(struct net *net,
+			  struct nlattr *arg,
+			  struct p4tc_pipeline *pipeline,
+			  struct p4tc_table *table,
+			  struct netlink_ext_ack *extack)
+{
+	struct nlattr *tb[P4TC_ENTRY_MAX + 1] = { NULL };
+	u8 cu_flags = P4TC_TBL_ENTRY_CU_FLAG_CREATE;
+	struct p4tc_table_entry_value *value;
+	struct p4tc_table_entry *entry;
+	int ret;
+
+	ret = nla_parse_nested(tb, P4TC_ENTRY_MAX, arg, p4tc_entry_policy,
+			       extack);
+	if (ret < 0)
+		return ERR_PTR(ret);
+
+	if (NL_REQ_ATTR_CHECK(extack, arg, tb, P4TC_ENTRY_WHODUNNIT)) {
+		NL_SET_ERR_MSG(extack, "Must specify whodunnit attribute");
+		return ERR_PTR(-EINVAL);
+	}
+
+	entry = __p4tc_table_entry_cu(net, cu_flags, tb, pipeline, table,
+				      extack);
+	if (IS_ERR(entry))
+		return entry;
+
+	value = p4tc_table_entry_value(entry);
+	refcount_set(&value->entries_ref, 1);
+
+	return entry;
+}
+
+static int p4tc_tbl_entry_get_1(struct net *net, struct sk_buff *skb,
+				struct nlattr *arg, u16 *permissions,
+				struct p4tc_path_nlattrs *nl_path_attrs,
+				struct netlink_ext_ack *extack)
+{
+	struct nlattr *tb[P4TC_MAX + 1];
+	u32 *arg_ids;
+	int ret = 0;
+
+	ret = nla_parse_nested(tb, P4TC_MAX, arg, p4tc_policy, extack);
+	if (ret < 0)
+		return ret;
+
+	if (NL_REQ_ATTR_CHECK(extack, arg, tb, P4TC_PATH)) {
+		NL_SET_ERR_MSG(extack, "Must specify object path");
+		return -EINVAL;
+	}
+
+	if (NL_REQ_ATTR_CHECK(extack, arg, tb, P4TC_PARAMS)) {
+		NL_SET_ERR_MSG(extack, "Must specify parameters");
+		return -EINVAL;
+	}
+
+	arg_ids = nla_data(tb[P4TC_PATH]);
+	memcpy(&nl_path_attrs->ids[P4TC_TBLID_IDX], arg_ids,
+	       nla_len(tb[P4TC_PATH]));
+
+	return p4tc_table_entry_gd(net, skb, false, permissions,
+				   tb[P4TC_PARAMS], nl_path_attrs, extack);
+}
+
+static int p4tc_tbl_entry_del_1(struct net *net, struct sk_buff *skb,
+				bool flush, u16 *permissions,
+				struct nlattr *arg,
+				struct p4tc_path_nlattrs *nl_path_attrs,
+				struct netlink_ext_ack *extack)
+{
+	struct nlattr *tb[P4TC_MAX + 1];
+	u32 *arg_ids;
+	int ret = 0;
+
+	ret = nla_parse_nested(tb, P4TC_MAX, arg, p4tc_policy, extack);
+	if (ret < 0)
+		return ret;
+
+	if (NL_REQ_ATTR_CHECK(extack, arg, tb, P4TC_PATH)) {
+		NL_SET_ERR_MSG(extack, "Must specify object path");
+		return -EINVAL;
+	}
+
+	arg_ids = nla_data(tb[P4TC_PATH]);
+	memcpy(&nl_path_attrs->ids[P4TC_TBLID_IDX], arg_ids,
+	       nla_len(tb[P4TC_PATH]));
+
+	if (flush) {
+		ret = p4tc_table_entry_flush(net, skb, tb[P4TC_PARAMS],
+					     nl_path_attrs, extack);
+	} else {
+		if (NL_REQ_ATTR_CHECK(extack, arg, tb, P4TC_PARAMS)) {
+			NL_SET_ERR_MSG(extack, "Must specify parameters");
+			return -EINVAL;
+		}
+		ret = p4tc_table_entry_gd(net, skb, true, permissions,
+					  tb[P4TC_PARAMS], nl_path_attrs,
+					  extack);
+	}
+
+	return ret;
+}
+
+static int p4tc_tbl_entry_cu_1(struct net *net, struct sk_buff *skb,
+			       u8 cu_flags, u16 *permissions,
+			       struct nlattr *nla,
+			       struct p4tc_path_nlattrs *nl_path_attrs,
+			       struct netlink_ext_ack *extack)
+{
+	struct nlattr *tb[P4TC_MAX + 1];
+	u32 *tbl_id;
+	int ret = 0;
+
+	ret = nla_parse_nested(tb, P4TC_MAX, nla, p4tc_policy, extack);
+	if (ret < 0)
+		return ret;
+
+	if (NL_REQ_ATTR_CHECK(extack, nla, tb, P4TC_PATH)) {
+		NL_SET_ERR_MSG(extack, "Must specify object path");
+		return -EINVAL;
+	}
+
+	if (NL_REQ_ATTR_CHECK(extack, nla, tb, P4TC_PARAMS)) {
+		NL_SET_ERR_MSG(extack, "Must specify object attributes");
+		return -EINVAL;
+	}
+
+	tbl_id = nla_data(tb[P4TC_PATH]);
+	memcpy(&nl_path_attrs->ids[P4TC_TBLID_IDX], tbl_id,
+	       nla_len(tb[P4TC_PATH]));
+
+	return p4tc_table_entry_cu(net, skb, cu_flags, permissions,
+				   tb[P4TC_PARAMS], nl_path_attrs, extack);
+}
+
+static int __p4tc_tbl_entry_crud(struct net *net, struct sk_buff *skb,
+				 struct nlmsghdr *n, int cmd, char *p_name,
+				 struct nlattr *p4tca[],
+				 struct netlink_ext_ack *extack)
+{
+	struct p4tcmsg *t = (struct p4tcmsg *)nlmsg_data(n);
+	struct p4tc_path_nlattrs nl_path_attrs = {0};
+	u32 portid = NETLINK_CB(skb).portid;
+	u16 permissions = P4TC_CTRL_PERM_P;
+	u32 ids[P4TC_PATH_MAX] = { 0 };
+	int i, num_pub_permission = 0;
+	int ret = 0, ret_send;
+	struct p4tcmsg *t_new;
+	struct sk_buff *nskb;
+	struct nlmsghdr *nlh;
+	struct nlattr *pn_att;
+	struct nlattr *root;
+
+	nskb = nlmsg_new(NLMSG_GOODSIZE, GFP_KERNEL);
+	if (unlikely(!nskb))
+		return -ENOBUFS;
+
+	nlh = nlmsg_put(nskb, portid, n->nlmsg_seq, cmd, sizeof(*t),
+			n->nlmsg_flags);
+	if (unlikely(!nlh))
+		goto out;
+
+	t_new = nlmsg_data(nlh);
+	t_new->pipeid = t->pipeid;
+	t_new->obj = t->obj;
+	ids[P4TC_PID_IDX] = t_new->pipeid;
+	nl_path_attrs.ids = ids;
+
+	pn_att = nla_reserve(nskb, P4TC_ROOT_PNAME, PIPELINENAMSIZ);
+	if (unlikely(!pn_att)) {
+		ret = -ENOMEM;
+		goto out;
+	}
+
+	nl_path_attrs.pname = nla_data(pn_att);
+	if (!p_name) {
+		/* Filled up by the operation or forced failure */
+		memset(nl_path_attrs.pname, 0, PIPELINENAMSIZ);
+		nl_path_attrs.pname_passed = false;
+	} else {
+		strscpy(nl_path_attrs.pname, p_name, PIPELINENAMSIZ);
+		nl_path_attrs.pname_passed = true;
+	}
+
+	root = nla_nest_start(nskb, P4TC_ROOT);
+	for (i = 1; i < P4TC_MSGBATCH_SIZE + 1 && p4tca[i]; i++) {
+		struct nlattr *nest = nla_nest_start(nskb, i);
+
+		if (cmd == RTM_P4TC_GET)
+			ret = p4tc_tbl_entry_get_1(net, nskb, p4tca[i],
+						   &permissions, &nl_path_attrs,
+						   extack);
+		else if (cmd == RTM_P4TC_CREATE ||
+			 cmd == RTM_P4TC_UPDATE) {
+			u8 cu_flags;
+
+			if (cmd == RTM_P4TC_UPDATE)
+				cu_flags = P4TC_TBL_ENTRY_CU_FLAG_UPDATE;
+			else
+				if (n->nlmsg_flags & NLM_F_REPLACE)
+					cu_flags = P4TC_TBL_ENTRY_CU_FLAG_SET;
+				else
+					cu_flags = P4TC_TBL_ENTRY_CU_FLAG_CREATE;
+
+			ret = p4tc_tbl_entry_cu_1(net, nskb, cu_flags,
+						  &permissions,
+						  p4tca[i], &nl_path_attrs,
+						  extack);
+		} else if (cmd == RTM_P4TC_DEL) {
+			bool flush = nlh->nlmsg_flags & NLM_F_ROOT;
+
+			ret = p4tc_tbl_entry_del_1(net, nskb, flush,
+						   &permissions, p4tca[i],
+						   &nl_path_attrs, extack);
+		}
+
+		if (p4tc_ctrl_pub_ok(permissions)) {
+			num_pub_permission++;
+		} else {
+			nla_nest_cancel(nskb, nest);
+			continue;
+		}
+
+		if (ret < 0) {
+			if (i == 1) {
+				goto out;
+			} else {
+				nla_nest_cancel(nskb, nest);
+				break;
+			}
+		}
+		nla_nest_end(nskb, nest);
+	}
+	nla_nest_end(nskb, root);
+
+	if (!t_new->pipeid)
+		t_new->pipeid = ids[P4TC_PID_IDX];
+
+	nlmsg_end(nskb, nlh);
+
+	if (cmd == RTM_P4TC_GET) {
+		ret_send = rtnl_unicast(nskb, net, portid);
+	} else if (num_pub_permission) {
+		ret_send = rtnetlink_send(nskb, net, portid, RTNLGRP_TC,
+					  n->nlmsg_flags & NLM_F_ECHO);
+	} else {
+		ret_send = 0;
+		kfree_skb(nskb);
+	}
+
+	return ret_send ? ret_send : ret;
+
+out:
+	kfree_skb(nskb);
+	return ret;
+}
+
+static int __p4tc_tbl_entry_crud_fast(struct net *net, struct nlmsghdr *n,
+				      int cmd, char *p_name,
+				      struct nlattr *p4tca[],
+				      struct netlink_ext_ack *extack)
+{
+	struct p4tcmsg *t = (struct p4tcmsg *)nlmsg_data(n);
+	struct p4tc_path_nlattrs nl_path_attrs = {0};
+	u32 ids[P4TC_PATH_MAX] = { 0 };
+	int ret = 0;
+	int i;
+
+	ids[P4TC_PID_IDX] = t->pipeid;
+	nl_path_attrs.ids = ids;
+
+	/* Only read for searching the pipeline */
+	nl_path_attrs.pname = p_name;
+
+	for (i = 1; i < P4TC_MSGBATCH_SIZE + 1 && p4tca[i]; i++) {
+		if (cmd == RTM_P4TC_CREATE ||
+		    cmd == RTM_P4TC_UPDATE) {
+			u8 cu_flags;
+
+			if (cmd == RTM_P4TC_UPDATE)
+				cu_flags = P4TC_TBL_ENTRY_CU_FLAG_UPDATE;
+			else
+				if (n->nlmsg_flags & NLM_F_REPLACE)
+					cu_flags = P4TC_TBL_ENTRY_CU_FLAG_SET;
+				else
+					cu_flags = P4TC_TBL_ENTRY_CU_FLAG_CREATE;
+
+			ret = p4tc_tbl_entry_cu_1(net, NULL, cu_flags, NULL,
+						  p4tca[i], &nl_path_attrs,
+						  extack);
+		} else if (cmd == RTM_P4TC_DEL) {
+			bool flush = n->nlmsg_flags & NLM_F_ROOT;
+
+			ret = p4tc_tbl_entry_del_1(net, NULL, flush, NULL,
+						   p4tca[i], &nl_path_attrs,
+						   extack);
+		}
+
+		if (ret < 0)
+			goto out;
+	}
+
+out:
+	return ret;
+}
+
+int p4tc_tbl_entry_crud(struct net *net, struct sk_buff *skb,
+			struct nlmsghdr *n, int cmd,
+			struct netlink_ext_ack *extack)
+{
+	struct nlattr *p4tca[P4TC_MSGBATCH_SIZE + 1];
+	int echo = n->nlmsg_flags & NLM_F_ECHO;
+	struct nlattr *tb[P4TC_ROOT_MAX + 1];
+	char *p_name = NULL;
+	int listeners;
+	int ret = 0;
+
+	ret = nlmsg_parse(n, sizeof(struct p4tcmsg), tb, P4TC_ROOT_MAX,
+			  p4tc_root_policy, extack);
+	if (ret < 0)
+		return ret;
+
+	if (NL_REQ_ATTR_CHECK(extack, NULL, tb, P4TC_ROOT)) {
+		NL_SET_ERR_MSG(extack, "Netlink P4TC table attributes missing");
+		return -EINVAL;
+	}
+
+	ret = nla_parse_nested(p4tca, P4TC_MSGBATCH_SIZE, tb[P4TC_ROOT], NULL,
+			       extack);
+	if (ret < 0)
+		return ret;
+
+	if (!p4tca[1]) {
+		NL_SET_ERR_MSG(extack, "No elements in root table array");
+		return -EINVAL;
+	}
+
+	if (tb[P4TC_ROOT_PNAME])
+		p_name = nla_data(tb[P4TC_ROOT_PNAME]);
+
+	listeners = rtnl_has_listeners(net, RTNLGRP_TC);
+
+	if ((echo || listeners) || cmd == RTM_P4TC_GET)
+		ret = __p4tc_tbl_entry_crud(net, skb, n, cmd, p_name, p4tca,
+					    extack);
+	else
+		ret = __p4tc_tbl_entry_crud_fast(net, n, cmd, p_name, p4tca,
+						 extack);
+	return ret;
+}
+
+static int p4tc_table_entry_dump(struct net *net, struct sk_buff *skb,
+				 struct nlattr *arg,
+				 struct p4tc_path_nlattrs *nl_path_attrs,
+				 struct netlink_callback *cb,
+				 struct netlink_ext_ack *extack)
+{
+	struct nlattr *tb[P4TC_ENTRY_MAX + 1] = { NULL };
+	struct p4tc_dump_ctx *ctx = (void *)cb->ctx;
+	unsigned char *b = nlmsg_get_pos(skb);
+	struct p4tc_pipeline *pipeline = NULL;
+	struct p4tc_table_entry *entry = NULL;
+	struct p4tc_table *table;
+	int i = 0;
+	int ret;
+
+	if (arg) {
+		ret = nla_parse_nested(tb, P4TC_ENTRY_MAX, arg,
+				       p4tc_entry_policy, extack);
+		if (ret < 0) {
+			kfree(ctx->iter);
+			return ret;
+		}
+	}
+
+	ret = p4tc_table_entry_get_table(net, &pipeline, &table, tb,
+					 nl_path_attrs, extack);
+	if (ret < 0) {
+		kfree(ctx->iter);
+		return ret;
+	}
+
+	if (!ctx->iter) {
+		ctx->iter = kzalloc(sizeof(*ctx->iter), GFP_KERNEL);
+		if (!ctx->iter) {
+			ret = -ENOMEM;
+			goto table_put;
+		}
+
+		rhltable_walk_enter(&table->tbl_entries, ctx->iter);
+	}
+
+	/* There is an issue here regarding the stability of walking an
+	 * rhashtable. If an insert or a delete happens in parallel, we may see
+	 * duplicate entries or skip some valid entries. To solve this we are
+	 * going to have an auxiliary list that also stores the entries and will
+	 * be used for dump, instead of walking over the rhastable.
+	 */
+	ret = -ENOMEM;
+	rhashtable_walk_start(ctx->iter);
+	do {
+		for (i = 0; i < P4TC_MSGBATCH_SIZE &&
+		     (entry = rhashtable_walk_next(ctx->iter)) &&
+		     !IS_ERR(entry); i++) {
+			struct p4tc_table_entry_value *value =
+				p4tc_table_entry_value(entry);
+			struct nlattr *count;
+
+			if (!p4tc_ctrl_read_ok(value->permissions)) {
+				i--;
+				continue;
+			}
+
+			count = nla_nest_start(skb, i + 1);
+			if (!count) {
+				rhashtable_walk_stop(ctx->iter);
+				goto table_put;
+			}
+
+			ret = p4tc_tbl_entry_fill(skb, table, entry,
+						  table->tbl_id,
+						  P4TC_ENTITY_UNSPEC);
+			if (ret == 0) {
+				NL_SET_ERR_MSG(extack,
+					       "Failed to fill notification attributes for table entry");
+				goto walk_done;
+			} else if (ret == -ENOMEM) {
+				ret = 1;
+				nla_nest_cancel(skb, count);
+				rhashtable_walk_stop(ctx->iter);
+				goto table_put;
+			}
+			nla_nest_end(skb, count);
+		}
+	} while (entry == ERR_PTR(-EAGAIN));
+	rhashtable_walk_stop(ctx->iter);
+
+	if (!i) {
+		rhashtable_walk_exit(ctx->iter);
+
+		ret = 0;
+		kfree(ctx->iter);
+
+		goto table_put;
+	}
+
+	if (!nl_path_attrs->pname_passed)
+		strscpy(nl_path_attrs->pname, pipeline->common.name,
+			PIPELINENAMSIZ);
+
+	if (!nl_path_attrs->ids[P4TC_PID_IDX])
+		nl_path_attrs->ids[P4TC_PID_IDX] = pipeline->common.p_id;
+
+	if (!nl_path_attrs->ids[P4TC_TBLID_IDX])
+		nl_path_attrs->ids[P4TC_TBLID_IDX] = table->tbl_id;
+
+	ret = skb->len;
+
+	goto table_put;
+
+walk_done:
+	rhashtable_walk_stop(ctx->iter);
+	rhashtable_walk_exit(ctx->iter);
+	kfree(ctx->iter);
+
+	nlmsg_trim(skb, b);
+
+table_put:
+	p4tc_table_entry_put_table(pipeline, table);
+
+	return ret;
+}
+
+int p4tc_tbl_entry_dumpit(struct net *net, struct sk_buff *skb,
+			  struct netlink_callback *cb,
+			  struct nlattr *arg, char *p_name)
+{
+	struct p4tc_path_nlattrs nl_path_attrs = {0};
+	struct netlink_ext_ack *extack = cb->extack;
+	u32 portid = NETLINK_CB(cb->skb).portid;
+	const struct nlmsghdr *n = cb->nlh;
+	struct nlattr *tb[P4TC_MAX + 1];
+	u32 ids[P4TC_PATH_MAX] = { 0 };
+	struct p4tcmsg *t_new;
+	struct nlmsghdr *nlh;
+	struct nlattr *pnatt;
+	struct nlattr *root;
+	struct p4tcmsg *t;
+	u32 *arg_ids;
+	int ret;
+
+	ret = nla_parse_nested(tb, P4TC_MAX, arg, p4tc_policy, extack);
+	if (ret < 0)
+		return ret;
+
+	nlh = nlmsg_put(skb, portid, n->nlmsg_seq, RTM_P4TC_GET, sizeof(*t),
+			n->nlmsg_flags);
+	if (!nlh)
+		return -ENOSPC;
+
+	t = (struct p4tcmsg *)nlmsg_data(n);
+	t_new = nlmsg_data(nlh);
+	t_new->pipeid = t->pipeid;
+	t_new->obj = t->obj;
+
+	if (NL_REQ_ATTR_CHECK(extack, arg, tb, P4TC_PATH)) {
+		NL_SET_ERR_MSG(extack, "Must specify object path");
+		return -EINVAL;
+	}
+
+	pnatt = nla_reserve(skb, P4TC_ROOT_PNAME, PIPELINENAMSIZ);
+	if (!pnatt)
+		return -ENOMEM;
+
+	ids[P4TC_PID_IDX] = t_new->pipeid;
+	arg_ids = nla_data(tb[P4TC_PATH]);
+	memcpy(&ids[P4TC_TBLID_IDX], arg_ids, nla_len(tb[P4TC_PATH]));
+	nl_path_attrs.ids = ids;
+
+	nl_path_attrs.pname = nla_data(pnatt);
+	if (!p_name) {
+		/* Filled up by the operation or forced failure */
+		memset(nl_path_attrs.pname, 0, PIPELINENAMSIZ);
+		nl_path_attrs.pname_passed = false;
+	} else {
+		strscpy(nl_path_attrs.pname, p_name, PIPELINENAMSIZ);
+		nl_path_attrs.pname_passed = true;
+	}
+
+	root = nla_nest_start(skb, P4TC_ROOT);
+	ret = p4tc_table_entry_dump(net, skb, tb[P4TC_PARAMS], &nl_path_attrs,
+				    cb, extack);
+	if (ret <= 0)
+		goto out;
+	nla_nest_end(skb, root);
+
+	if (nl_path_attrs.pname) {
+		if (nla_put_string(skb, P4TC_ROOT_PNAME, nl_path_attrs.pname)) {
+			ret = -1;
+			goto out;
+		}
+	}
+
+	if (!t_new->pipeid)
+		t_new->pipeid = ids[P4TC_PID_IDX];
+
+	nlmsg_end(skb, nlh);
+
+	return skb->len;
+
+out:
+	nlmsg_cancel(skb, nlh);
+	return ret;
+}
diff --git a/net/sched/p4tc/p4tc_tmpl_api.c b/net/sched/p4tc/p4tc_tmpl_api.c
index e52c06c5e..2064dfaf1 100644
--- a/net/sched/p4tc/p4tc_tmpl_api.c
+++ b/net/sched/p4tc/p4tc_tmpl_api.c
@@ -27,12 +27,12 @@
 #include <net/netlink.h>
 #include <net/flow_offload.h>
 
-static const struct nla_policy p4tc_root_policy[P4TC_ROOT_MAX + 1] = {
+const struct nla_policy p4tc_root_policy[P4TC_ROOT_MAX + 1] = {
 	[P4TC_ROOT] = { .type = NLA_NESTED },
 	[P4TC_ROOT_PNAME] = { .type = NLA_STRING, .len = PIPELINENAMSIZ },
 };
 
-static const struct nla_policy p4tc_policy[P4TC_MAX + 1] = {
+const struct nla_policy p4tc_policy[P4TC_MAX + 1] = {
 	[P4TC_PATH] = { .type = NLA_BINARY,
 			.len = P4TC_PATH_MAX * sizeof(u32) },
 	[P4TC_PARAMS] = { .type = NLA_NESTED },
diff --git a/security/selinux/nlmsgtab.c b/security/selinux/nlmsgtab.c
index e50a1c1ff..da7902404 100644
--- a/security/selinux/nlmsgtab.c
+++ b/security/selinux/nlmsgtab.c
@@ -98,6 +98,10 @@ static const struct nlmsg_perm nlmsg_route_perms[] = {
 	{ RTM_DELP4TEMPLATE,	NETLINK_ROUTE_SOCKET__NLMSG_WRITE },
 	{ RTM_GETP4TEMPLATE,	NETLINK_ROUTE_SOCKET__NLMSG_READ },
 	{ RTM_UPDATEP4TEMPLATE,	NETLINK_ROUTE_SOCKET__NLMSG_WRITE },
+	{ RTM_P4TC_CREATE,	NETLINK_ROUTE_SOCKET__NLMSG_WRITE },
+	{ RTM_P4TC_DEL,	NETLINK_ROUTE_SOCKET__NLMSG_WRITE },
+	{ RTM_P4TC_GET,	NETLINK_ROUTE_SOCKET__NLMSG_READ },
+	{ RTM_P4TC_UPDATE,	NETLINK_ROUTE_SOCKET__NLMSG_WRITE },
 };
 
 static const struct nlmsg_perm nlmsg_tcpdiag_perms[] = {
@@ -181,7 +185,7 @@ int selinux_nlmsg_lookup(u16 sclass, u16 nlmsg_type, u32 *perm)
 		 * structures at the top of this file with the new mappings
 		 * before updating the BUILD_BUG_ON() macro!
 		 */
-		BUILD_BUG_ON(RTM_MAX != (RTM_CREATEP4TEMPLATE + 3));
+		BUILD_BUG_ON(RTM_MAX != (RTM_P4TC_CREATE + 3));
 		err = nlmsg_perm(nlmsg_type, perm, nlmsg_route_perms,
 				 sizeof(nlmsg_route_perms));
 		break;
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 79+ messages in thread

* [PATCH net-next v8 13/15] p4tc: add set of P4TC table kfuncs
  2023-11-16 14:59 [PATCH net-next v8 00/15] Introducing P4TC Jamal Hadi Salim
                   ` (11 preceding siblings ...)
  2023-11-16 14:59 ` [PATCH net-next v8 12/15] p4tc: add runtime table entry create, update, get, delete, " Jamal Hadi Salim
@ 2023-11-16 14:59 ` Jamal Hadi Salim
  2023-11-17  7:09   ` John Fastabend
                     ` (2 more replies)
  2023-11-16 14:59 ` [PATCH net-next v8 14/15] p4tc: add P4 classifier Jamal Hadi Salim
                   ` (2 subsequent siblings)
  15 siblings, 3 replies; 79+ messages in thread
From: Jamal Hadi Salim @ 2023-11-16 14:59 UTC (permalink / raw)
  To: netdev
  Cc: deb.chatterjee, anjali.singhai, namrata.limaye, tom, mleitner,
	Mahesh.Shirshyad, tomasz.osinski, jiri, xiyou.wangcong, davem,
	edumazet, kuba, pabeni, vladbu, horms, daniel, bpf, khalidm, toke,
	mattyk

We add an initial set of kfuncs to allow interactions from eBPF programs
to the P4TC domain.

- bpf_p4tc_tbl_read: Used to lookup a table entry from a BPF
program installed in TC. To find the table entry we take in an skb, the
pipeline ID, the table ID, a key and a key size.
We use the skb to get the network namespace structure where all the
pipelines are stored. After that we use the pipeline ID and the table
ID, to find the table. We then use the key to search for the entry.
We return an entry on success and NULL on failure.

- xdp_p4tc_tbl_read: Used to lookup a table entry from a BPF
program installed in XDP. To find the table entry we take in an xdp_md,
the pipeline ID, the table ID, a key and a key size.
We use struct xdp_md to get the network namespace structure where all
the pipelines are stored. After that we use the pipeline ID and the table
ID, to find the table. We then use the key to search for the entry.
We return an entry on success and NULL on failure.

- bpf_p4tc_entry_create: Used to create a table entry from a BPF
program installed in TC. To create the table entry we take an skb, the
pipeline ID, the table ID, a key and its size, and an action which will
be associated with the new entry.
We return 0 on success and a negative errno on failure

- xdp_p4tc_entry_create: Used to create a table entry from a BPF
program installed in XDP. To create the table entry we take an xdp_md, the
pipeline ID, the table ID, a key and its size, and an action which will
be associated with the new entry.
We return 0 on success and a negative errno on failure

- bpf_p4tc_entry_create_on_miss: conforms to PNA "add on miss".
First does a lookup using the passed key and upon a miss will add the entry
to the table.
We return 0 on success and a negative errno on failure

- xdp_p4tc_entry_create_on_miss: conforms to PNA "add on miss".
First does a lookup using the passed key and upon a miss will add the entry
to the table.
We return 0 on success and a negative errno on failure

- bpf_p4tc_entry_update: Used to update a table entry from a BPF
program installed in TC. To update the table entry we take an skb, the
pipeline ID, the table ID, a key and its size, and an action which will
be associated with the new entry.
We return 0 on success and a negative errno on failure

- xdp_p4tc_entry_update: Used to update a table entry from a BPF
program installed in XDP. To update the table entry we take an xdp_md, the
pipeline ID, the table ID, a key and its size, and an action which will
be associated with the new entry.
We return 0 on success and a negative errno on failure

- bpf_p4tc_entry_delete: Used to delete a table entry from a BPF
program installed in TC. To delete the table entry we take an skb, the
pipeline ID, the table ID, a key and a key size.
We return 0 on success and a negative errno on failure

- xdp_p4tc_entry_delete: Used to delete a table entry from a BPF
program installed in XDP. To delete the table entry we take an xdp_md, the
pipeline ID, the table ID, a key and a key size.
We return 0 on success and a negative errno on failure

Co-developed-by: Victor Nogueira <victor@mojatatu.com>
Signed-off-by: Victor Nogueira <victor@mojatatu.com>
Co-developed-by: Pedro Tammela <pctammela@mojatatu.com>
Signed-off-by: Pedro Tammela <pctammela@mojatatu.com>
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
---
 include/linux/bitops.h          |   1 +
 include/net/p4tc.h              |  58 +++++-
 include/net/tc_act/p4tc.h       |  24 +++
 include/uapi/linux/p4tc.h       |   2 +
 net/sched/p4tc/Makefile         |   1 +
 net/sched/p4tc/p4tc_action.c    |  69 ++++++-
 net/sched/p4tc/p4tc_bpf.c       | 337 ++++++++++++++++++++++++++++++++
 net/sched/p4tc/p4tc_pipeline.c  |  47 ++++-
 net/sched/p4tc/p4tc_table.c     |   8 +
 net/sched/p4tc/p4tc_tbl_entry.c | 288 ++++++++++++++++++++++++++-
 net/sched/p4tc/p4tc_tmpl_api.c  |   2 +
 11 files changed, 828 insertions(+), 9 deletions(-)
 create mode 100644 net/sched/p4tc/p4tc_bpf.c

diff --git a/include/linux/bitops.h b/include/linux/bitops.h
index 2ba557e06..290c2399a 100644
--- a/include/linux/bitops.h
+++ b/include/linux/bitops.h
@@ -19,6 +19,7 @@
 #define BITS_TO_LONGS(nr)	__KERNEL_DIV_ROUND_UP(nr, BITS_PER_TYPE(long))
 #define BITS_TO_U64(nr)		__KERNEL_DIV_ROUND_UP(nr, BITS_PER_TYPE(u64))
 #define BITS_TO_U32(nr)		__KERNEL_DIV_ROUND_UP(nr, BITS_PER_TYPE(u32))
+#define BITS_TO_U16(nr)		__KERNEL_DIV_ROUND_UP(nr, BITS_PER_TYPE(u16))
 #define BITS_TO_BYTES(nr)	__KERNEL_DIV_ROUND_UP(nr, BITS_PER_TYPE(char))
 
 extern unsigned int __sw_hweight8(unsigned int w);
diff --git a/include/net/p4tc.h b/include/net/p4tc.h
index 24f8b4873..e25eaa4ac 100644
--- a/include/net/p4tc.h
+++ b/include/net/p4tc.h
@@ -91,8 +91,26 @@ struct p4tc_pipeline {
 	u8                          p_state;
 };
 
+#define P4TC_PIPELINE_MAX_ARRAY 32
+
+struct p4tc_tbl_cache_key {
+	u32 pipeid;
+	u32 tblid;
+};
+
+extern const struct rhashtable_params tbl_cache_ht_params;
+
+struct p4tc_table;
+
+int p4tc_tbl_cache_insert(struct net *net, u32 pipeid, struct p4tc_table *table);
+void p4tc_tbl_cache_remove(struct net *net, struct p4tc_table *table);
+struct p4tc_table *p4tc_tbl_cache_lookup(struct net *net, u32 pipeid, u32 tblid);
+
+#define P4TC_TBLS_CACHE_SIZE 32
+
 struct p4tc_pipeline_net {
-	struct idr pipeline_idr;
+	struct list_head  tbls_cache[P4TC_TBLS_CACHE_SIZE];
+	struct idr        pipeline_idr;
 };
 
 static inline bool p4tc_tmpl_msg_is_update(struct nlmsghdr *n)
@@ -208,6 +226,7 @@ struct p4tc_table_perm {
 
 struct p4tc_table {
 	struct p4tc_template_common         common;
+	struct list_head                    tbl_cache_node;
 	struct list_head                    tbl_acts_list;
 	struct idr                          tbl_masks_idr;
 	struct ida                          tbl_prio_idr;
@@ -302,6 +321,17 @@ extern const struct p4tc_template_ops p4tc_act_ops;
 
 extern const struct rhashtable_params entry_hlt_params;
 
+struct p4tc_table_entry_act_bpf_params {
+	u32 pipeid;
+	u32 tblid;
+};
+
+struct p4tc_table_entry_create_bpf_params {
+	u64 aging_ms;
+	u32 pipeid;
+	u32 tblid;
+};
+
 struct p4tc_table_entry;
 struct p4tc_table_entry_work {
 	struct work_struct   work;
@@ -352,6 +382,13 @@ struct p4tc_table_entry {
 	/* fallthrough: key data + value */
 };
 
+struct p4tc_entry_key_bpf {
+	void *key;
+	void *mask;
+	u32 key_sz;
+	u32 mask_sz;
+};
+
 #define P4TC_KEYSZ_BYTES(bits) (round_up(BITS_TO_BYTES(bits), 8))
 
 #define ENTRY_KEY_OFFSET (offsetof(struct p4tc_table_entry_key, fa_key))
@@ -380,6 +417,25 @@ struct p4tc_table_entry *
 p4tc_table_entry_lookup_direct(struct p4tc_table *table,
 			       struct p4tc_table_entry_key *key);
 
+struct p4tc_table_entry_act_bpf *
+p4tc_table_entry_create_act_bpf(struct tc_action *action,
+				struct netlink_ext_ack *extack);
+int register_p4tc_tbl_bpf(void);
+int p4tc_table_entry_create_bpf(struct p4tc_pipeline *pipeline,
+				struct p4tc_table *table,
+				struct p4tc_table_entry_key *key,
+				struct p4tc_table_entry_act_bpf *act_bpf,
+				u64 aging_ms);
+int p4tc_table_entry_update_bpf(struct p4tc_pipeline *pipeline,
+				struct p4tc_table *table,
+				struct p4tc_table_entry_key *key,
+				struct p4tc_table_entry_act_bpf *act_bpf,
+				u64 aging_ms);
+
+int p4tc_table_entry_del_bpf(struct p4tc_pipeline *pipeline,
+			     struct p4tc_table *table,
+			     struct p4tc_table_entry_key *key);
+
 static inline int p4tc_action_init(struct net *net, struct nlattr *nla,
 				   struct tc_action *acts[], u32 pipeid,
 				   u32 flags, struct netlink_ext_ack *extack)
diff --git a/include/net/tc_act/p4tc.h b/include/net/tc_act/p4tc.h
index 6447fe5ce..ca925d112 100644
--- a/include/net/tc_act/p4tc.h
+++ b/include/net/tc_act/p4tc.h
@@ -14,10 +14,23 @@ struct tcf_p4act_params {
 	u32 tot_params_sz;
 };
 
+#define P4TC_MAX_PARAM_DATA_SIZE 124
+
+struct p4tc_table_entry_act_bpf {
+	u32 act_id;
+	u8 params[P4TC_MAX_PARAM_DATA_SIZE];
+} __packed;
+
+struct p4tc_table_entry_act_bpf_kern {
+	struct rcu_head rcu;
+	struct p4tc_table_entry_act_bpf act_bpf;
+};
+
 struct tcf_p4act {
 	struct tc_action common;
 	/* Params IDR reference passed during runtime */
 	struct tcf_p4act_params __rcu *params;
+	struct p4tc_table_entry_act_bpf_kern __rcu *act_bpf;
 	u32 p_id;
 	u32 act_id;
 	struct list_head node;
@@ -25,4 +38,15 @@ struct tcf_p4act {
 
 #define to_p4act(a) ((struct tcf_p4act *)a)
 
+static inline struct p4tc_table_entry_act_bpf *
+p4tc_table_entry_act_bpf(struct tc_action *action)
+{
+	struct p4tc_table_entry_act_bpf_kern *act_bpf;
+	struct tcf_p4act *p4act = to_p4act(action);
+
+	act_bpf = rcu_dereference(p4act->act_bpf);
+
+	return &act_bpf->act_bpf;
+}
+
 #endif /* __NET_TC_ACT_P4_H */
diff --git a/include/uapi/linux/p4tc.h b/include/uapi/linux/p4tc.h
index e87f0c8b9..a2a39303f 100644
--- a/include/uapi/linux/p4tc.h
+++ b/include/uapi/linux/p4tc.h
@@ -17,6 +17,8 @@ struct p4tcmsg {
 #define P4TC_MINTABLES_COUNT 0
 #define P4TC_MSGBATCH_SIZE 16
 
+#define P4TC_ACT_MAX_NUM_PARAMS P4TC_MSGBATCH_SIZE
+
 #define P4TC_MAX_KEYSZ 512
 #define P4TC_DEFAULT_NUM_PREALLOC 16
 
diff --git a/net/sched/p4tc/Makefile b/net/sched/p4tc/Makefile
index 921909ac4..3fed9a853 100644
--- a/net/sched/p4tc/Makefile
+++ b/net/sched/p4tc/Makefile
@@ -3,3 +3,4 @@
 obj-y := p4tc_types.o p4tc_tmpl_api.o p4tc_pipeline.o \
 	p4tc_action.o p4tc_table.o p4tc_tbl_entry.o \
 	p4tc_runtime_api.o
+obj-$(CONFIG_DEBUG_INFO_BTF) += p4tc_bpf.o
diff --git a/net/sched/p4tc/p4tc_action.c b/net/sched/p4tc/p4tc_action.c
index 4912a6a11..aaed03418 100644
--- a/net/sched/p4tc/p4tc_action.c
+++ b/net/sched/p4tc/p4tc_action.c
@@ -282,29 +282,83 @@ static void tcf_p4_act_params_destroy_rcu(struct rcu_head *head)
 	tcf_p4_act_params_destroy(params);
 }
 
+static struct p4tc_table_entry_act_bpf_kern *
+p4tc_create_act_bpf(struct tcf_p4act *p4act,
+		    struct tcf_p4act_params *act_params,
+		    struct netlink_ext_ack *extack)
+{
+	struct p4tc_act_param *params[P4TC_ACT_MAX_NUM_PARAMS];
+	struct p4tc_table_entry_act_bpf_kern *act_bpf;
+	struct p4tc_act_param *param;
+	unsigned long param_id, tmp;
+	size_t tot_params_sz = 0;
+	u8 *params_cursor;
+	int nparams = 0;
+	int i;
+
+	act_bpf = kzalloc(sizeof(*act_bpf), GFP_KERNEL);
+	if (!act_bpf)
+		return ERR_PTR(-ENOMEM);
+
+	idr_for_each_entry_ul(&act_params->params_idr, param, tmp, param_id) {
+		const struct p4tc_type *type = param->type;
+
+		if (tot_params_sz > P4TC_MAX_PARAM_DATA_SIZE) {
+			NL_SET_ERR_MSG(extack, "Maximum parameter byte size reached");
+			kfree(act_bpf);
+			return ERR_PTR(-EINVAL);
+		}
+
+		tot_params_sz += BITS_TO_BYTES(type->container_bitsz);
+		params[nparams++] = param;
+	}
+
+	act_bpf->act_bpf.act_id = p4act->act_id;
+	params_cursor = act_bpf->act_bpf.params;
+	for (i = 0; i < nparams; i++) {
+		u32 type_bytesz;
+
+		param = params[i];
+		type_bytesz =  BITS_TO_BYTES(param->type->container_bitsz);
+		memcpy(params_cursor, param->value, type_bytesz);
+		params_cursor += type_bytesz;
+	}
+
+	return act_bpf;
+}
+
 static int __tcf_p4_dyna_init_set(struct p4tc_act *act, struct tc_action **a,
 				  struct tcf_p4act_params *params,
 				  struct tcf_chain *goto_ch,
 				  struct tc_act_dyna *parm, bool exists,
 				  struct netlink_ext_ack *extack)
 {
+	struct p4tc_table_entry_act_bpf_kern *act_bpf = NULL, *act_bpf_old;
 	struct tcf_p4act_params *params_old;
 	struct tcf_p4act *p;
 
 	p = to_p4act(*a);
 
+	if (!((*a)->tcfa_flags & TCA_ACT_FLAGS_UNREFERENCED)) {
+		act_bpf = p4tc_create_act_bpf(p, params, extack);
+		if (IS_ERR(act_bpf))
+			return PTR_ERR(act_bpf);
+	}
+
 	/* sparse is fooled by lock under conditionals.
-	 * To avoid false positives, we are repeating these two lines in both
+	 * To avoid false positives, we are repeating these 3 lines in both
 	 * branches of the if-statement
 	 */
 	if (exists) {
 		spin_lock_bh(&p->tcf_lock);
 		goto_ch = tcf_action_set_ctrlact(*a, parm->action, goto_ch);
 		params_old = rcu_replace_pointer(p->params, params, 1);
+		act_bpf_old = rcu_replace_pointer(p->act_bpf, act_bpf, 1);
 		spin_unlock_bh(&p->tcf_lock);
 	} else {
 		goto_ch = tcf_action_set_ctrlact(*a, parm->action, goto_ch);
 		params_old = rcu_replace_pointer(p->params, params, 1);
+		act_bpf_old = rcu_replace_pointer(p->act_bpf, act_bpf, 1);
 	}
 
 	if (goto_ch)
@@ -313,6 +367,9 @@ static int __tcf_p4_dyna_init_set(struct p4tc_act *act, struct tc_action **a,
 	if (params_old)
 		call_rcu(&params_old->rcu, tcf_p4_act_params_destroy_rcu);
 
+	if (act_bpf_old)
+		kfree_rcu(act_bpf_old, rcu);
+
 	return 0;
 }
 
@@ -509,6 +566,7 @@ void tcf_p4_set_init_flags(struct tcf_p4act *p4act)
 static void __tcf_p4_put_prealloc_act(struct p4tc_act *act,
 				      struct tcf_p4act *p4act)
 {
+	struct p4tc_table_entry_act_bpf_kern *act_bpf_old;
 	struct tcf_p4act_params *p4act_params;
 	struct p4tc_act_param *param;
 	unsigned long param_id, tmp;
@@ -527,6 +585,10 @@ static void __tcf_p4_put_prealloc_act(struct p4tc_act *act,
 	p4act->common.tcfa_flags |= TCA_ACT_FLAGS_UNREFERENCED;
 	spin_unlock_bh(&p4act->tcf_lock);
 
+	act_bpf_old = rcu_replace_pointer(p4act->act_bpf, NULL, 1);
+	if (act_bpf_old)
+		kfree_rcu(act_bpf_old, rcu);
+
 	spin_lock_bh(&act->list_lock);
 	list_add_tail(&p4act->node, &act->prealloc_list);
 	spin_unlock_bh(&act->list_lock);
@@ -1274,16 +1336,21 @@ static int tcf_p4_dyna_walker(struct net *net, struct sk_buff *skb,
 static void tcf_p4_dyna_cleanup(struct tc_action *a)
 {
 	struct tc_action_ops *ops = (struct tc_action_ops *)a->ops;
+	struct p4tc_table_entry_act_bpf_kern *act_bpf;
 	struct tcf_p4act *m = to_p4act(a);
 	struct tcf_p4act_params *params;
 
 	params = rcu_dereference_protected(m->params, 1);
+	act_bpf = rcu_dereference_protected(m->act_bpf, 1);
 
 	if (refcount_read(&ops->dyn_ref) > 1)
 		refcount_dec(&ops->dyn_ref);
 
 	if (params)
 		call_rcu(&params->rcu, tcf_p4_act_params_destroy_rcu);
+
+	if (act_bpf)
+		kfree_rcu(act_bpf, rcu);
 }
 
 static struct p4tc_act *
diff --git a/net/sched/p4tc/p4tc_bpf.c b/net/sched/p4tc/p4tc_bpf.c
new file mode 100644
index 000000000..fe84c1504
--- /dev/null
+++ b/net/sched/p4tc/p4tc_bpf.c
@@ -0,0 +1,337 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Copyright (c) 2022-2023, Mojatatu Networks
+ * Copyright (c) 2022-2023, Intel Corporation.
+ * Authors:     Jamal Hadi Salim <jhs@mojatatu.com>
+ *              Victor Nogueira <victor@mojatatu.com>
+ *              Pedro Tammela <pctammela@mojatatu.com>
+ */
+
+#include <linux/bpf.h>
+#include <linux/btf.h>
+#include <linux/filter.h>
+#include <linux/btf_ids.h>
+#include <linux/net_namespace.h>
+#include <net/p4tc.h>
+#include <linux/netdevice.h>
+#include <net/sock.h>
+#include <net/xdp.h>
+
+BTF_ID_LIST(btf_p4tc_ids)
+BTF_ID(struct, p4tc_table_entry_act_bpf)
+BTF_ID(struct, p4tc_table_entry_act_bpf_params)
+BTF_ID(struct, p4tc_table_entry_act_bpf)
+BTF_ID(struct, p4tc_table_entry_create_bpf_params)
+
+static struct p4tc_table_entry_act_bpf dummy_act_bpf = {};
+
+static struct p4tc_table_entry_act_bpf *
+__bpf_p4tc_tbl_read(struct net *caller_net,
+		    struct p4tc_table_entry_act_bpf_params *params,
+		    void *key, const u32 key__sz)
+{
+	struct p4tc_table_entry_key *entry_key = key;
+	struct p4tc_table_entry_value *value;
+	struct p4tc_table_entry *entry;
+	struct p4tc_table *table;
+	u32 pipeid;
+	u32 tblid;
+
+	if (!params || !key)
+		return NULL;
+
+	if (key__sz <= ENTRY_KEY_OFFSET)
+		return NULL;
+
+	pipeid = params->pipeid;
+	tblid = params->tblid;
+
+	entry_key->keysz = (key__sz - ENTRY_KEY_OFFSET) << 3;
+
+	table = p4tc_tbl_cache_lookup(caller_net, pipeid, tblid);
+	if (!table)
+		return NULL;
+
+	entry = p4tc_table_entry_lookup_direct(table, entry_key);
+	if (!entry) {
+		struct p4tc_table_defact *defact;
+
+		defact = rcu_dereference(table->tbl_default_missact);
+		return defact ?
+			p4tc_table_entry_act_bpf(defact->default_acts[0]) : NULL;
+	}
+
+	value = p4tc_table_entry_value(entry);
+
+	return value->acts ?
+		p4tc_table_entry_act_bpf(value->acts[0]) : &dummy_act_bpf;
+}
+
+__diag_push();
+__diag_ignore_all("-Wmissing-prototypes",
+		  "Global functions as their definitions will be in vmlinux BTF");
+__bpf_kfunc struct p4tc_table_entry_act_bpf *
+bpf_p4tc_tbl_read(struct __sk_buff *skb_ctx,
+		  struct p4tc_table_entry_act_bpf_params *params,
+		  void *key, const u32 key__sz)
+{
+	struct sk_buff *skb = (struct sk_buff *)skb_ctx;
+	struct net *caller_net;
+
+	caller_net = skb->dev ? dev_net(skb->dev) : sock_net(skb->sk);
+
+	return __bpf_p4tc_tbl_read(caller_net, params, key, key__sz);
+}
+
+__bpf_kfunc struct p4tc_table_entry_act_bpf *
+xdp_p4tc_tbl_read(struct xdp_md *xdp_ctx,
+		  struct p4tc_table_entry_act_bpf_params *params,
+		  void *key, const u32 key__sz)
+{
+	struct xdp_buff *ctx = (struct xdp_buff *)xdp_ctx;
+	struct net *caller_net;
+
+	caller_net = dev_net(ctx->rxq->dev);
+
+	return __bpf_p4tc_tbl_read(caller_net, params, key, key__sz);
+}
+
+static int
+__bpf_p4tc_entry_create(struct net *net,
+			struct p4tc_table_entry_create_bpf_params *params,
+			void *key, const u32 key__sz,
+			struct p4tc_table_entry_act_bpf *act_bpf)
+{
+	struct p4tc_table_entry_key *entry_key = key;
+	struct p4tc_pipeline *pipeline;
+	struct p4tc_table *table;
+
+	if (!params || !key)
+		return -EINVAL;
+
+	if (key__sz <= ENTRY_KEY_OFFSET)
+		return -EINVAL;
+
+	pipeline = p4tc_pipeline_find_byid(net, params->pipeid);
+	if (!pipeline)
+		return -ENOENT;
+
+	table = p4tc_tbl_cache_lookup(net, params->pipeid, params->tblid);
+	if (!table)
+		return -ENOENT;
+
+	entry_key->keysz = (key__sz - ENTRY_KEY_OFFSET) << 3;
+
+	return p4tc_table_entry_create_bpf(pipeline, table, entry_key, act_bpf,
+					   params->aging_ms);
+}
+
+__bpf_kfunc int
+bpf_p4tc_entry_create(struct __sk_buff *skb_ctx,
+		      struct p4tc_table_entry_create_bpf_params *params,
+		      void *key, const u32 key__sz,
+		      struct p4tc_table_entry_act_bpf *act_bpf)
+{
+	struct sk_buff *skb = (struct sk_buff *)skb_ctx;
+	struct net *net;
+
+	net = skb->dev ? dev_net(skb->dev) : sock_net(skb->sk);
+
+	return __bpf_p4tc_entry_create(net, params, key, key__sz, act_bpf);
+}
+
+__bpf_kfunc int
+xdp_p4tc_entry_create(struct xdp_md *xdp_ctx,
+		      struct p4tc_table_entry_create_bpf_params *params,
+		      void *key, const u32 key__sz,
+		      struct p4tc_table_entry_act_bpf *act_bpf)
+{
+	struct xdp_buff *ctx = (struct xdp_buff *)xdp_ctx;
+	struct net *net;
+
+	net = dev_net(ctx->rxq->dev);
+
+	return __bpf_p4tc_entry_create(net, params, key, key__sz, act_bpf);
+}
+
+__bpf_kfunc int
+bpf_p4tc_entry_create_on_miss(struct __sk_buff *skb_ctx,
+			      struct p4tc_table_entry_create_bpf_params *params,
+			      void *key, const u32 key__sz,
+			      struct p4tc_table_entry_act_bpf *act_bpf)
+{
+	struct sk_buff *skb = (struct sk_buff *)skb_ctx;
+	struct net *net;
+
+	net = skb->dev ? dev_net(skb->dev) : sock_net(skb->sk);
+
+	return __bpf_p4tc_entry_create(net, params, key, key__sz, act_bpf);
+}
+
+__bpf_kfunc int
+xdp_p4tc_entry_create_on_miss(struct xdp_md *xdp_ctx,
+			      struct p4tc_table_entry_create_bpf_params *params,
+			      void *key, const u32 key__sz,
+			      struct p4tc_table_entry_act_bpf *act_bpf)
+{
+	struct xdp_buff *ctx = (struct xdp_buff *)xdp_ctx;
+	struct net *net;
+
+	net = dev_net(ctx->rxq->dev);
+
+	return __bpf_p4tc_entry_create(net, params, key, key__sz, act_bpf);
+}
+
+static int
+__bpf_p4tc_entry_update(struct net *net,
+			struct p4tc_table_entry_create_bpf_params *params,
+			void *key, const u32 key__sz,
+			struct p4tc_table_entry_act_bpf *act_bpf)
+{
+	struct p4tc_table_entry_key *entry_key = key;
+	struct p4tc_pipeline *pipeline;
+	struct p4tc_table *table;
+
+	if (!params || !key)
+		return -EINVAL;
+
+	if (key__sz <= ENTRY_KEY_OFFSET)
+		return -EINVAL;
+
+	pipeline = p4tc_pipeline_find_byid(net, params->pipeid);
+	if (!pipeline)
+		return -ENOENT;
+
+	table = p4tc_tbl_cache_lookup(net, params->pipeid, params->tblid);
+	if (!table)
+		return -ENOENT;
+
+	entry_key->keysz = (key__sz - ENTRY_KEY_OFFSET) << 3;
+
+	return p4tc_table_entry_update_bpf(pipeline, table, entry_key,
+					  act_bpf, params->aging_ms);
+}
+
+__bpf_kfunc int
+bpf_p4tc_entry_update(struct __sk_buff *skb_ctx,
+		      struct p4tc_table_entry_create_bpf_params *params,
+		      void *key, const u32 key__sz,
+		      struct p4tc_table_entry_act_bpf *act_bpf)
+{
+	struct sk_buff *skb = (struct sk_buff *)skb_ctx;
+	struct net *net;
+
+	net = skb->dev ? dev_net(skb->dev) : sock_net(skb->sk);
+
+	return __bpf_p4tc_entry_update(net, params, key, key__sz, act_bpf);
+}
+
+__bpf_kfunc int
+xdp_p4tc_entry_update(struct xdp_md *xdp_ctx,
+		      struct p4tc_table_entry_create_bpf_params *params,
+		      void *key, const u32 key__sz,
+		      struct p4tc_table_entry_act_bpf *act_bpf)
+{
+	struct xdp_buff *ctx = (struct xdp_buff *)xdp_ctx;
+	struct net *net;
+
+	net = dev_net(ctx->rxq->dev);
+
+	return __bpf_p4tc_entry_update(net, params, key, key__sz, act_bpf);
+}
+
+static int
+__bpf_p4tc_entry_delete(struct net *net,
+			struct p4tc_table_entry_create_bpf_params *params,
+			void *key, const u32 key__sz)
+{
+	struct p4tc_table_entry_key *entry_key = key;
+	struct p4tc_pipeline *pipeline;
+	struct p4tc_table *table;
+
+	if (!params || !key)
+		return -EINVAL;
+
+	if (key__sz <= ENTRY_KEY_OFFSET)
+		return -EINVAL;
+
+	pipeline = p4tc_pipeline_find_byid(net, params->pipeid);
+	if (!pipeline)
+		return -ENOENT;
+
+	table = p4tc_tbl_cache_lookup(net, params->pipeid, params->tblid);
+	if (!table)
+		return -ENOENT;
+
+	entry_key->keysz = (key__sz - ENTRY_KEY_OFFSET) << 3;
+
+	return p4tc_table_entry_del_bpf(pipeline, table, entry_key);
+}
+
+__bpf_kfunc int
+bpf_p4tc_entry_delete(struct __sk_buff *skb_ctx,
+		      struct p4tc_table_entry_create_bpf_params *params,
+		      void *key, const u32 key__sz)
+{
+	struct sk_buff *skb = (struct sk_buff *)skb_ctx;
+	struct net *net;
+
+	net = skb->dev ? dev_net(skb->dev) : sock_net(skb->sk);
+
+	return __bpf_p4tc_entry_delete(net, params, key, key__sz);
+}
+
+__bpf_kfunc int
+xdp_p4tc_entry_delete(struct xdp_md *xdp_ctx,
+		      struct p4tc_table_entry_create_bpf_params *params,
+		      void *key, const u32 key__sz)
+{
+	struct xdp_buff *ctx = (struct xdp_buff *)xdp_ctx;
+	struct net *net;
+
+	net = dev_net(ctx->rxq->dev);
+
+	return __bpf_p4tc_entry_delete(net, params, key, key__sz);
+}
+
+__diag_pop();
+
+BTF_SET8_START(p4tc_kfunc_check_tbl_set_skb)
+BTF_ID_FLAGS(func, bpf_p4tc_tbl_read, KF_RET_NULL);
+BTF_ID_FLAGS(func, bpf_p4tc_entry_create);
+BTF_ID_FLAGS(func, bpf_p4tc_entry_create_on_miss);
+BTF_ID_FLAGS(func, bpf_p4tc_entry_update);
+BTF_ID_FLAGS(func, bpf_p4tc_entry_delete);
+BTF_SET8_END(p4tc_kfunc_check_tbl_set_skb)
+
+static const struct btf_kfunc_id_set p4tc_kfunc_tbl_set_skb = {
+	.owner = THIS_MODULE,
+	.set = &p4tc_kfunc_check_tbl_set_skb,
+};
+
+BTF_SET8_START(p4tc_kfunc_check_tbl_set_xdp)
+BTF_ID_FLAGS(func, xdp_p4tc_tbl_read, KF_RET_NULL);
+BTF_ID_FLAGS(func, xdp_p4tc_entry_create);
+BTF_ID_FLAGS(func, xdp_p4tc_entry_create_on_miss);
+BTF_ID_FLAGS(func, xdp_p4tc_entry_update);
+BTF_ID_FLAGS(func, xdp_p4tc_entry_delete);
+BTF_SET8_END(p4tc_kfunc_check_tbl_set_xdp)
+
+static const struct btf_kfunc_id_set p4tc_kfunc_tbl_set_xdp = {
+	.owner = THIS_MODULE,
+	.set = &p4tc_kfunc_check_tbl_set_xdp,
+};
+
+int register_p4tc_tbl_bpf(void)
+{
+	int ret;
+
+	ret = register_btf_kfunc_id_set(BPF_PROG_TYPE_SCHED_ACT,
+					&p4tc_kfunc_tbl_set_skb);
+	if (ret < 0)
+		return ret;
+
+	/* There is no unregister_btf_kfunc_id_set function */
+	return register_btf_kfunc_id_set(BPF_PROG_TYPE_XDP,
+					 &p4tc_kfunc_tbl_set_xdp);
+}
diff --git a/net/sched/p4tc/p4tc_pipeline.c b/net/sched/p4tc/p4tc_pipeline.c
index b589bd9c2..6c00747ac 100644
--- a/net/sched/p4tc/p4tc_pipeline.c
+++ b/net/sched/p4tc/p4tc_pipeline.c
@@ -37,6 +37,44 @@ static __net_init int pipeline_init_net(struct net *net)
 
 	idr_init(&pipe_net->pipeline_idr);
 
+	for (int i = 0; i < P4TC_TBLS_CACHE_SIZE; i++)
+		INIT_LIST_HEAD(&pipe_net->tbls_cache[i]);
+
+	return 0;
+}
+
+static size_t p4tc_tbl_cache_hash(u32 pipeid, u32 tblid)
+{
+	return (pipeid + tblid) % P4TC_TBLS_CACHE_SIZE;
+}
+
+struct p4tc_table *p4tc_tbl_cache_lookup(struct net *net, u32 pipeid, u32 tblid)
+{
+	size_t hash = p4tc_tbl_cache_hash(pipeid, tblid);
+	struct p4tc_pipeline_net *pipe_net;
+	struct p4tc_table *pos, *tmp;
+	struct net_generic *ng;
+
+	/* RCU read lock is already being held */
+	ng = rcu_dereference(net->gen);
+	pipe_net = ng->ptr[pipeline_net_id];
+
+	list_for_each_entry_safe(pos, tmp, &pipe_net->tbls_cache[hash],
+				 tbl_cache_node) {
+		if (pos->common.p_id == pipeid && pos->tbl_id == tblid)
+			return pos;
+	}
+
+	return NULL;
+}
+
+int p4tc_tbl_cache_insert(struct net *net, u32 pipeid, struct p4tc_table *table)
+{
+	struct p4tc_pipeline_net *pipe_net = net_generic(net, pipeline_net_id);
+	size_t hash = p4tc_tbl_cache_hash(pipeid, table->tbl_id);
+
+	list_add_tail(&table->tbl_cache_node, &pipe_net->tbls_cache[hash]);
+
 	return 0;
 }
 
@@ -44,6 +82,11 @@ static int __p4tc_pipeline_put(struct p4tc_pipeline *pipeline,
 			       struct p4tc_template_common *template,
 			       struct netlink_ext_ack *extack);
 
+void p4tc_tbl_cache_remove(struct net *net, struct p4tc_table *table)
+{
+	list_del(&table->tbl_cache_node);
+}
+
 static void __net_exit pipeline_exit_net(struct net *net)
 {
 	struct p4tc_pipeline_net *pipe_net;
@@ -152,8 +195,8 @@ static int __p4tc_pipeline_put(struct p4tc_pipeline *pipeline,
 	return 0;
 }
 
-static inline int pipeline_try_set_state_ready(struct p4tc_pipeline *pipeline,
-					       struct netlink_ext_ack *extack)
+static int pipeline_try_set_state_ready(struct p4tc_pipeline *pipeline,
+					struct netlink_ext_ack *extack)
 {
 	int ret;
 
diff --git a/net/sched/p4tc/p4tc_table.c b/net/sched/p4tc/p4tc_table.c
index e38e14a84..7d79b01e5 100644
--- a/net/sched/p4tc/p4tc_table.c
+++ b/net/sched/p4tc/p4tc_table.c
@@ -380,6 +380,7 @@ static int _p4tc_table_put(struct net *net, struct nlattr **tb,
 
 	rhltable_free_and_destroy(&table->tbl_entries,
 				  p4tc_table_entry_destroy_hash, table);
+	p4tc_tbl_cache_remove(net, table);
 
 	idr_destroy(&table->tbl_masks_idr);
 	ida_destroy(&table->tbl_prio_idr);
@@ -1147,6 +1148,10 @@ static struct p4tc_table *p4tc_table_create(struct net *net, struct nlattr **tb,
 		goto defaultacts_destroy;
 	}
 
+	ret = p4tc_tbl_cache_insert(net, pipeline->common.p_id, table);
+	if (ret < 0)
+		goto entries_hashtable_destroy;
+
 	pipeline->curr_tables += 1;
 
 	table->common.ops = (struct p4tc_template_ops *)&p4tc_table_ops;
@@ -1154,6 +1159,9 @@ static struct p4tc_table *p4tc_table_create(struct net *net, struct nlattr **tb,
 
 	return table;
 
+entries_hashtable_destroy:
+	rhltable_destroy(&table->tbl_entries);
+
 defaultacts_destroy:
 	p4tc_table_defact_destroy(def_params.default_hitact);
 	p4tc_table_defact_destroy(def_params.default_missact);
diff --git a/net/sched/p4tc/p4tc_tbl_entry.c b/net/sched/p4tc/p4tc_tbl_entry.c
index cadd3e100..c6953199d 100644
--- a/net/sched/p4tc/p4tc_tbl_entry.c
+++ b/net/sched/p4tc/p4tc_tbl_entry.c
@@ -1064,6 +1064,44 @@ __must_hold(RCU)
 	return 0;
 }
 
+/* Internal function which will be called by the data path */
+static int __p4tc_table_entry_del(struct p4tc_pipeline *pipeline,
+				  struct p4tc_table *table,
+				  struct p4tc_table_entry_key *key,
+				  struct p4tc_table_entry_mask *mask, u32 prio)
+{
+	struct p4tc_table_entry *entry;
+	int ret;
+
+	p4tc_table_entry_build_key(table, key, mask);
+
+	entry = p4tc_entry_lookup(table, key, prio);
+	if (!entry)
+		return -ENOENT;
+
+	ret = ___p4tc_table_entry_del(pipeline, table, entry, false);
+
+	return ret;
+}
+
+int p4tc_table_entry_del_bpf(struct p4tc_pipeline *pipeline,
+			     struct p4tc_table *table,
+			     struct p4tc_table_entry_key *key)
+{
+	u8 __mask[sizeof(struct p4tc_table_entry_mask) +
+		  BITS_TO_BYTES(P4TC_MAX_KEYSZ)] = { 0 };
+	const u32 keysz_bytes = P4TC_KEYSZ_BYTES(table->tbl_keysz);
+	struct p4tc_table_entry_mask *mask = (void *)&__mask;
+
+	if (table->tbl_type != P4TC_TABLE_TYPE_EXACT)
+		return -EINVAL;
+
+	if (keysz_bytes != P4TC_KEYSZ_BYTES(key->keysz))
+		return -EINVAL;
+
+	return __p4tc_table_entry_del(pipeline, table, key, mask, 0);
+}
+
 static int p4tc_table_entry_gd(struct net *net, struct sk_buff *skb, bool del,
 			       u16 *permissions, struct nlattr *arg,
 			       struct p4tc_path_nlattrs *nl_path_attrs,
@@ -1358,6 +1396,54 @@ static int p4tc_table_entry_flush(struct net *net, struct sk_buff *skb,
 	return ret;
 }
 
+static int
+p4tc_table_tc_act_from_bpf_act(struct tcf_p4act *p4act,
+			       struct p4tc_table_entry_value *value,
+			       struct p4tc_table_entry_act_bpf *act_bpf)
+__must_hold(RCU)
+{
+	struct p4tc_table_entry_act_bpf_kern *new_act_bpf;
+	struct tcf_p4act_params *p4act_params;
+	struct p4tc_act_param *param;
+	unsigned long param_id, tmp;
+	u8 *params_cursor;
+	int err;
+
+	p4act_params = rcu_dereference(p4act->params);
+	/* Skip act_id */
+	params_cursor = (u8 *)act_bpf + sizeof(act_bpf->act_id);
+	idr_for_each_entry_ul(&p4act_params->params_idr, param, tmp, param_id) {
+		const struct p4tc_type *type = param->type;
+		const u32 type_bytesz = BITS_TO_BYTES(type->container_bitsz);
+
+		memcpy(param->value, params_cursor, type_bytesz);
+		params_cursor += type_bytesz;
+	}
+
+	new_act_bpf = kzalloc(sizeof(*new_act_bpf), GFP_ATOMIC);
+	if (unlikely(!new_act_bpf))
+		return -ENOMEM;
+
+	value->acts = kcalloc(TCA_ACT_MAX_PRIO, sizeof(struct tc_action *),
+			      GFP_ATOMIC);
+	if (unlikely(!value->acts)) {
+		err = -ENOMEM;
+		goto free_act_bpf;
+	}
+
+	new_act_bpf->act_bpf = *act_bpf;
+
+	rcu_assign_pointer(p4act->act_bpf, new_act_bpf);
+	value->num_acts = 1;
+	value->acts[0] = (struct tc_action *)p4act;
+
+	return 0;
+
+free_act_bpf:
+	kfree(new_act_bpf);
+	return err;
+}
+
 static enum hrtimer_restart entry_timer_handle(struct hrtimer *timer)
 {
 	struct p4tc_table_entry_value *value =
@@ -1519,6 +1605,116 @@ __must_hold(RCU)
 	return ret;
 }
 
+struct p4tc_table_entry_create_state {
+	struct p4tc_act *act;
+	struct tcf_p4act *p4_act;
+	struct p4tc_table_entry *entry;
+	u64 aging_ms;
+	u16 permissions;
+};
+
+static int
+p4tc_table_entry_init_bpf(struct p4tc_pipeline *pipeline,
+			  struct p4tc_table *table, u32 entry_key_sz,
+			  struct p4tc_table_entry_act_bpf *act_bpf,
+			  struct p4tc_table_entry_create_state *state)
+{
+	const u32 keysz_bytes = P4TC_KEYSZ_BYTES(table->tbl_keysz);
+	struct p4tc_table_entry_value *entry_value;
+	const u32 keysz_bits = table->tbl_keysz;
+	struct tcf_p4act *p4_act = NULL;
+	struct p4tc_table_entry *entry;
+	struct p4tc_act *act = NULL;
+	int err = -EINVAL;
+	u32 entrysz;
+
+	if (table->tbl_type != P4TC_TABLE_TYPE_EXACT)
+		goto out;
+
+	if (keysz_bytes != P4TC_KEYSZ_BYTES(entry_key_sz))
+		goto out;
+
+	if (atomic_read(&table->tbl_nelems) + 1 > table->tbl_max_entries)
+		goto out;
+
+	if (act_bpf) {
+		act = p4tc_action_find_get(pipeline, NULL, act_bpf->act_id,
+					   NULL);
+		if (!act) {
+			err = -ENOENT;
+			goto out;
+		}
+	}
+
+	entrysz = sizeof(*entry) + keysz_bytes +
+		  sizeof(struct p4tc_table_entry_value);
+
+	entry = kzalloc(entrysz, GFP_ATOMIC);
+	if (unlikely(!entry)) {
+		err = -ENOMEM;
+		goto act_put;
+	}
+	entry->key.keysz = keysz_bits;
+
+	entry_value = p4tc_table_entry_value(entry);
+	entry_value->prio = p4tc_table_entry_exact_prio();
+	entry_value->permissions = state->permissions;
+	entry_value->aging_ms = state->aging_ms;
+
+	if (act) {
+		p4_act = tcf_p4_get_next_prealloc_act(act);
+		if (!p4_act) {
+			err = -ENOENT;
+			goto idr_rm;
+		}
+
+		err = p4tc_table_tc_act_from_bpf_act(p4_act, entry_value, act_bpf);
+		if (err < 0)
+			goto free_prealloc;
+	}
+
+	state->act = act;
+	state->p4_act = p4_act;
+	state->entry = entry;
+
+	return 0;
+
+free_prealloc:
+	if (p4_act)
+		tcf_p4_put_prealloc_act(act, p4_act);
+
+idr_rm:
+	p4tc_table_entry_free_prio(table, entry_value->prio);
+
+	kfree(entry);
+
+act_put:
+	if (act)
+		p4tc_action_put_ref(act);
+out:
+	return err;
+}
+
+static void
+p4tc_table_entry_create_state_put(struct p4tc_table *table,
+				  struct p4tc_table_entry_create_state *state)
+{
+	struct p4tc_table_entry_value *value;
+
+	if (state->act)
+		tcf_p4_put_prealloc_act(state->act, state->p4_act);
+
+	value = p4tc_table_entry_value(state->entry);
+	p4tc_table_entry_free_prio(table, value->prio);
+
+	kfree(value->acts);
+
+	kfree(state->entry);
+
+	if (state->act)
+		p4tc_action_put_ref(state->act);
+}
+
 /* Invoked from both control and data path  */
 static int __p4tc_table_entry_update(struct p4tc_pipeline *pipeline,
 				     struct p4tc_table *table,
@@ -1657,6 +1853,93 @@ __must_hold(RCU)
 	return ret;
 }
 
+static u16 p4tc_table_entry_tbl_permcpy(const u16 tblperm)
+{
+	return p4tc_ctrl_perm_rm_create(p4tc_data_perm_rm_create(tblperm));
+}
+
+int p4tc_table_entry_create_bpf(struct p4tc_pipeline *pipeline,
+				struct p4tc_table *table,
+				struct p4tc_table_entry_key *key,
+				struct p4tc_table_entry_act_bpf *act_bpf,
+				u64 aging_ms)
+{
+	u16 tblperm = rcu_dereference(table->tbl_permissions)->permissions;
+	u8 __mask[sizeof(struct p4tc_table_entry_mask) +
+		  BITS_TO_BYTES(P4TC_MAX_KEYSZ)] = { 0 };
+	struct p4tc_table_entry_mask *mask = (void *)&__mask;
+	struct p4tc_table_entry_create_state state = {0};
+	struct p4tc_table_entry_value *value;
+	int err;
+
+	state.aging_ms = aging_ms;
+	state.permissions = p4tc_table_entry_tbl_permcpy(tblperm);
+	err = p4tc_table_entry_init_bpf(pipeline, table, key->keysz,
+					act_bpf, &state);
+	if (err < 0)
+		return err;
+	p4tc_table_entry_assign_key_exact(&state.entry->key, key->fa_key);
+
+	value = p4tc_table_entry_value(state.entry);
+	/* Entry is always dynamic when it comes from the data path */
+	value->is_dyn = true;
+
+	err = __p4tc_table_entry_create(pipeline, table, state.entry, mask,
+					P4TC_ENTITY_KERNEL, false);
+	if (err < 0)
+		goto put_state;
+
+	refcount_set(&value->entries_ref, 1);
+	if (state.p4_act)
+		tcf_p4_set_init_flags(state.p4_act);
+
+	return 0;
+
+put_state:
+	p4tc_table_entry_create_state_put(table, &state);
+
+	return err;
+}
+
+int p4tc_table_entry_update_bpf(struct p4tc_pipeline *pipeline,
+				struct p4tc_table *table,
+				struct p4tc_table_entry_key *key,
+				struct p4tc_table_entry_act_bpf *act_bpf,
+				u64 aging_ms)
+{
+	struct p4tc_table_entry_create_state state = {0};
+	struct p4tc_table_entry_value *value;
+	int err;
+
+	state.aging_ms = aging_ms;
+	state.permissions = P4TC_PERMISSIONS_UNINIT;
+	err = p4tc_table_entry_init_bpf(pipeline, table, key->keysz, act_bpf,
+					&state);
+	if (err < 0)
+		return err;
+
+	p4tc_table_entry_assign_key_exact(&state.entry->key, key->fa_key);
+
+	value = p4tc_table_entry_value(state.entry);
+	value->is_dyn = !!aging_ms;
+	err = __p4tc_table_entry_update(pipeline, table, state.entry, NULL,
+					P4TC_ENTITY_KERNEL, false);
+
+	if (err < 0)
+		goto put_state;
+
+	refcount_set(&value->entries_ref, 1);
+	if (state.p4_act)
+		tcf_p4_set_init_flags(state.p4_act);
+
+	return 0;
+
+put_state:
+	p4tc_table_entry_create_state_put(table, &state);
+
+	return err;
+}
+
 static bool p4tc_table_check_entry_act(struct p4tc_table *table,
 				       struct tc_action *entry_act)
 {
@@ -1729,11 +2012,6 @@ update_tbl_attrs(struct net *net, struct p4tc_table *table,
 	return err;
 }
 
-static u16 p4tc_table_entry_tbl_permcpy(const u16 tblperm)
-{
-	return p4tc_ctrl_perm_rm_create(p4tc_data_perm_rm_create(tblperm));
-}
-
 #define P4TC_TBL_ENTRY_CU_FLAG_CREATE 0x1
 #define P4TC_TBL_ENTRY_CU_FLAG_UPDATE 0x2
 #define P4TC_TBL_ENTRY_CU_FLAG_SET 0x4
diff --git a/net/sched/p4tc/p4tc_tmpl_api.c b/net/sched/p4tc/p4tc_tmpl_api.c
index 2064dfaf1..a3b3b1430 100644
--- a/net/sched/p4tc/p4tc_tmpl_api.c
+++ b/net/sched/p4tc/p4tc_tmpl_api.c
@@ -599,6 +599,8 @@ static int __init p4tc_template_init(void)
 			op->init();
 	}
 
+	register_p4tc_tbl_bpf();
+
 	return 0;
 }
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 79+ messages in thread

* [PATCH net-next v8 14/15] p4tc: add P4 classifier
  2023-11-16 14:59 [PATCH net-next v8 00/15] Introducing P4TC Jamal Hadi Salim
                   ` (12 preceding siblings ...)
  2023-11-16 14:59 ` [PATCH net-next v8 13/15] p4tc: add set of P4TC table kfuncs Jamal Hadi Salim
@ 2023-11-16 14:59 ` Jamal Hadi Salim
  2023-11-17  7:17   ` John Fastabend
  2023-11-16 14:59 ` [PATCH net-next v8 15/15] p4tc: Add P4 extern interface Jamal Hadi Salim
  2023-11-17  6:27 ` [PATCH net-next v8 00/15] Introducing P4TC John Fastabend
  15 siblings, 1 reply; 79+ messages in thread
From: Jamal Hadi Salim @ 2023-11-16 14:59 UTC (permalink / raw)
  To: netdev
  Cc: deb.chatterjee, anjali.singhai, namrata.limaye, tom, mleitner,
	Mahesh.Shirshyad, tomasz.osinski, jiri, xiyou.wangcong, davem,
	edumazet, kuba, pabeni, vladbu, horms, daniel, bpf, khalidm, toke,
	mattyk

Introduce P4 tc classifier. A tc filter instantiated on this classifier
is used to bind a P4 pipeline to one or more netdev ports. To use P4
classifier you must specify a pipeline name that will be associated to
this filter, a s/w parser and datapath ebpf program. The pipeline must have
already been created via a template.
For example, if we were to add a filter to ingress of network interface
device $P0 and associate it to P4 pipeline simple_l3 we'd issue the
following command:

tc filter add dev $P0 parent ffff: protocol all prio 6 p4 pname simple_l3 \
    action bpf obj $PARSER.o section prog/tc-parser \
    action bpf obj $PROGNAME.o section prog/tc-ingress

$PROGNAME.o and $PARSER.o is a compilation of the eBPF programs generated
by the P4 compiler and will be the representation of the P4 program.
Note that filter understands that $PARSER.o is a parser to be loaded
at the tc level. The datapath program is merely an eBPF action.

Note we do support a distinct way of loading the parser as opposed to
making it be an action, the above example would be:

tc filter add dev $P0 parent ffff: protocol all prio 6 p4 pname simple_l3 \
    prog type tc obj $PARSER.o ... \
    action bpf obj $PROGNAME.o section prog/tc-ingress

We support two types of loadings of these initial programs in the pipeline
and differentiate between what gets loaded at tc vs xdp by using syntax of

either "prog type tc obj" or "prog type xdp obj"

For XDP:

tc filter add dev $P0 ingress protocol all prio 1 p4 pname simple_l3 \
    prog type xdp obj $PARSER.o section parser/xdp \
    pinned_link /sys/fs/bpf/mylink \
    action bpf obj $PROGNAME.o section prog/tc-ingress

The theory of operations is as follows:

================================1. PARSING================================

The packet first encounters the parser.
The parser is implemented in ebpf residing either at the TC or XDP
level. The parsed header values are stored in a shared eBPF map.
When the parser runs at XDP level, we load it into XDP using tc filter
command and pin it to a file.

=============================2. ACTIONS=============================

In the above example, the P4 program (minus the parser) is encoded in an
action($PROGNAME.o). It should be noted that classical tc actions
continue to work:
IOW, someone could decide to add a mirred action to mirror all packets
after or before the ebpf action.

tc filter add dev $P0 parent ffff: protocol all prio 6 p4 pname simple_l3 \
    prog type tc obj $PARSER.o section parser/tc-ingress \
    action bpf obj $PROGNAME.o section prog/tc-ingress \
    action mirred egress mirror index 1 dev $P1 \
    action bpf obj $ANOTHERPROG.o section mysect/section-1

It should also be noted that it is feasible to split some of the ingress
datapath into XDP first and more into TC later (as was shown above for
example where the parser runs at XDP level). YMMV.

Co-developed-by: Victor Nogueira <victor@mojatatu.com>
Signed-off-by: Victor Nogueira <victor@mojatatu.com>
Co-developed-by: Pedro Tammela <pctammela@mojatatu.com>
Signed-off-by: Pedro Tammela <pctammela@mojatatu.com>
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
---
 include/uapi/linux/pkt_cls.h |  18 ++
 net/sched/Kconfig            |  12 +
 net/sched/Makefile           |   1 +
 net/sched/cls_p4.c           | 447 +++++++++++++++++++++++++++++++++++
 net/sched/p4tc/Makefile      |   4 +-
 net/sched/p4tc/trace.c       |  10 +
 net/sched/p4tc/trace.h       |  44 ++++
 7 files changed, 535 insertions(+), 1 deletion(-)
 create mode 100644 net/sched/cls_p4.c
 create mode 100644 net/sched/p4tc/trace.c
 create mode 100644 net/sched/p4tc/trace.h

diff --git a/include/uapi/linux/pkt_cls.h b/include/uapi/linux/pkt_cls.h
index 75bf73742..b70ba4647 100644
--- a/include/uapi/linux/pkt_cls.h
+++ b/include/uapi/linux/pkt_cls.h
@@ -739,6 +739,24 @@ enum {
 
 #define TCA_MATCHALL_MAX (__TCA_MATCHALL_MAX - 1)
 
+/* P4 classifier */
+
+enum {
+	TCA_P4_UNSPEC,
+	TCA_P4_CLASSID,
+	TCA_P4_ACT,
+	TCA_P4_PNAME,
+	TCA_P4_PIPEID,
+	TCA_P4_PROG_FD,
+	TCA_P4_PROG_NAME,
+	TCA_P4_PROG_TYPE,
+	TCA_P4_PROG_ID,
+	TCA_P4_PAD,
+	__TCA_P4_MAX,
+};
+
+#define TCA_P4_MAX (__TCA_P4_MAX - 1)
+
 /* Extended Matches */
 
 struct tcf_ematch_tree_hdr {
diff --git a/net/sched/Kconfig b/net/sched/Kconfig
index df6d5e15f..dbfe5ceef 100644
--- a/net/sched/Kconfig
+++ b/net/sched/Kconfig
@@ -565,6 +565,18 @@ config NET_CLS_MATCHALL
 	  To compile this code as a module, choose M here: the module will
 	  be called cls_matchall.
 
+config NET_CLS_P4
+	tristate "P4 classifier"
+	select NET_CLS
+	select NET_P4_TC
+	help
+	  If you say Y here, you will be able to bind a P4 pipeline
+	  program. You will need to install a P4 template representing the
+	  program successfully to use this feature.
+
+	  To compile this code as a module, choose M here: the module will
+	  be called cls_p4.
+
 config NET_EMATCH
 	bool "Extended Matches"
 	select NET_CLS
diff --git a/net/sched/Makefile b/net/sched/Makefile
index 937b8f8a9..15bd59ae3 100644
--- a/net/sched/Makefile
+++ b/net/sched/Makefile
@@ -73,6 +73,7 @@ obj-$(CONFIG_NET_CLS_CGROUP)	+= cls_cgroup.o
 obj-$(CONFIG_NET_CLS_BPF)	+= cls_bpf.o
 obj-$(CONFIG_NET_CLS_FLOWER)	+= cls_flower.o
 obj-$(CONFIG_NET_CLS_MATCHALL)	+= cls_matchall.o
+obj-$(CONFIG_NET_CLS_P4)	+= cls_p4.o
 obj-$(CONFIG_NET_EMATCH)	+= ematch.o
 obj-$(CONFIG_NET_EMATCH_CMP)	+= em_cmp.o
 obj-$(CONFIG_NET_EMATCH_NBYTE)	+= em_nbyte.o
diff --git a/net/sched/cls_p4.c b/net/sched/cls_p4.c
new file mode 100644
index 000000000..67991c74d
--- /dev/null
+++ b/net/sched/cls_p4.c
@@ -0,0 +1,447 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * net/sched/cls_p4.c - P4 Classifier
+ * Copyright (c) 2022-2023, Mojatatu Networks
+ * Copyright (c) 2022-2023, Intel Corporation.
+ * Authors:     Jamal Hadi Salim <jhs@mojatatu.com>
+ *              Victor Nogueira <victor@mojatatu.com>
+ *              Pedro Tammela <pctammela@mojatatu.com>
+ */
+
+#include <linux/kernel.h>
+#include <linux/init.h>
+#include <linux/module.h>
+#include <linux/percpu.h>
+#include <linux/bpf.h>
+#include <linux/filter.h>
+
+#include <net/sch_generic.h>
+#include <net/pkt_cls.h>
+
+#include <net/p4tc.h>
+
+#include "p4tc/trace.h"
+
+#define CLS_P4_PROG_NAME_LEN	256
+
+struct p4tc_bpf_prog {
+	struct bpf_prog *p4_prog;
+	const char *p4_prog_name;
+};
+
+struct cls_p4_head {
+	struct tcf_exts exts;
+	struct tcf_result res;
+	struct rcu_work rwork;
+	struct p4tc_pipeline *pipeline;
+	struct p4tc_bpf_prog *prog;
+	u32 handle;
+};
+
+static int p4_classify(struct sk_buff *skb, const struct tcf_proto *tp,
+		       struct tcf_result *res)
+{
+	struct cls_p4_head *head = rcu_dereference_bh(tp->root);
+	bool at_ingress = skb_at_tc_ingress(skb);
+
+	if (unlikely(!head)) {
+		pr_err("P4 classifier not found\n");
+		return -1;
+	}
+
+	/* head->prog represents the eBPF program that will be first executed by
+	 * the data plane. It may or may not exist. In addition to head->prog,
+	 * we'll have another eBPF program that will execute after this one in
+	 * the form of a filter action (head->exts).
+	 * head->prog->p4_prog_type == BPf_PROG_TYPE_SCHED_ACT means this
+	 * program executes in TC P4 filter.
+	 * head->prog->p4_prog_type == BPf_PROG_TYPE_SCHED_XDP means this
+	 * program was loaded in XDP.
+	 */
+	if (head->prog) {
+		int rc = TC_ACT_PIPE;
+
+		/* If eBPF program is loaded into TC */
+		if (head->prog->p4_prog->type == BPF_PROG_TYPE_SCHED_ACT) {
+			if (at_ingress) {
+				/* It is safe to push/pull even if skb_shared() */
+				__skb_push(skb, skb->mac_len);
+				bpf_compute_data_pointers(skb);
+				rc = bpf_prog_run(head->prog->p4_prog,
+						  skb);
+				__skb_pull(skb, skb->mac_len);
+			} else {
+				bpf_compute_data_pointers(skb);
+				rc = bpf_prog_run(head->prog->p4_prog,
+						  skb);
+			}
+		}
+
+		if (rc != TC_ACT_PIPE)
+			return rc;
+	}
+
+	trace_p4_classify(skb, head->pipeline);
+
+	*res = head->res;
+
+	return tcf_exts_exec(skb, &head->exts, res);
+}
+
+static int p4_init(struct tcf_proto *tp)
+{
+	return 0;
+}
+
+static void p4_bpf_prog_destroy(struct p4tc_bpf_prog *prog)
+{
+	bpf_prog_put(prog->p4_prog);
+	kfree(prog->p4_prog_name);
+	kfree(prog);
+}
+
+static void __p4_destroy(struct cls_p4_head *head)
+{
+	tcf_exts_destroy(&head->exts);
+	tcf_exts_put_net(&head->exts);
+	if (head->prog)
+		p4_bpf_prog_destroy(head->prog);
+	p4tc_pipeline_put(head->pipeline);
+	kfree(head);
+}
+
+static void p4_destroy_work(struct work_struct *work)
+{
+	struct cls_p4_head *head =
+		container_of(to_rcu_work(work), struct cls_p4_head, rwork);
+
+	rtnl_lock();
+	__p4_destroy(head);
+	rtnl_unlock();
+}
+
+static void p4_destroy(struct tcf_proto *tp, bool rtnl_held,
+		       struct netlink_ext_ack *extack)
+{
+	struct cls_p4_head *head = rtnl_dereference(tp->root);
+
+	if (!head)
+		return;
+
+	tcf_unbind_filter(tp, &head->res);
+
+	if (tcf_exts_get_net(&head->exts))
+		tcf_queue_work(&head->rwork, p4_destroy_work);
+	else
+		__p4_destroy(head);
+}
+
+static void *p4_get(struct tcf_proto *tp, u32 handle)
+{
+	struct cls_p4_head *head = rtnl_dereference(tp->root);
+
+	if (head && head->handle == handle)
+		return head;
+
+	return NULL;
+}
+
+static const struct nla_policy p4_policy[TCA_P4_MAX + 1] = {
+	[TCA_P4_UNSPEC] = { .type = NLA_UNSPEC },
+	[TCA_P4_CLASSID] = { .type = NLA_U32 },
+	[TCA_P4_ACT] = { .type = NLA_NESTED },
+	[TCA_P4_PNAME] = { .type = NLA_STRING, .len = PIPELINENAMSIZ },
+	[TCA_P4_PIPEID] = { .type = NLA_U32 },
+	[TCA_P4_PROG_FD] = { .type = NLA_U32 },
+	[TCA_P4_PROG_NAME] = { .type = NLA_STRING,
+			       .len = CLS_P4_PROG_NAME_LEN },
+	[TCA_P4_PROG_TYPE] = { .type = NLA_U32 },
+};
+
+static int cls_p4_prog_from_efd(struct nlattr **tb,
+				struct p4tc_bpf_prog *prog, u32 flags,
+				struct netlink_ext_ack *extack)
+{
+	struct bpf_prog *fp;
+	u32 prog_type;
+	char *name;
+	u32 bpf_fd;
+
+	bpf_fd = nla_get_u32(tb[TCA_P4_PROG_FD]);
+	prog_type = nla_get_u32(tb[TCA_P4_PROG_TYPE]);
+
+	if (prog_type != BPF_PROG_TYPE_XDP &&
+	    prog_type != BPF_PROG_TYPE_SCHED_ACT) {
+		NL_SET_ERR_MSG(extack,
+			       "BPF prog type must be BPF_PROG_TYPE_SCHED_ACT or BPF_PROG_TYPE_XDP");
+		return -EINVAL;
+	}
+
+	fp = bpf_prog_get_type_dev(bpf_fd, prog_type, false);
+	if (IS_ERR(fp))
+		return PTR_ERR(fp);
+
+	name = nla_memdup(tb[TCA_P4_PROG_NAME], GFP_KERNEL);
+	if (!name) {
+		bpf_prog_put(fp);
+		return -ENOMEM;
+	}
+
+	prog->p4_prog_name = name;
+	prog->p4_prog = fp;
+
+	return 0;
+}
+
+static int p4_set_parms(struct net *net, struct tcf_proto *tp,
+			struct cls_p4_head *head, unsigned long base,
+			struct nlattr **tb, struct nlattr *est, u32 flags,
+			struct netlink_ext_ack *extack)
+{
+	bool load_bpf_prog = tb[TCA_P4_PROG_NAME] && tb[TCA_P4_PROG_FD] &&
+			     tb[TCA_P4_PROG_TYPE];
+	struct p4tc_bpf_prog *prog = NULL;
+	int err;
+
+	err = tcf_exts_validate_ex(net, tp, tb, est, &head->exts, flags, 0,
+				   extack);
+	if (err < 0)
+		return err;
+
+	if (load_bpf_prog) {
+		prog = kzalloc(sizeof(*prog), GFP_KERNEL);
+		if (!prog) {
+			err = -ENOMEM;
+			goto exts_destroy;
+		}
+
+		err = cls_p4_prog_from_efd(tb, prog, flags, extack);
+		if (err < 0) {
+			kfree(prog);
+			goto exts_destroy;
+		}
+	}
+
+	if (tb[TCA_P4_CLASSID]) {
+		head->res.classid = nla_get_u32(tb[TCA_P4_CLASSID]);
+		tcf_bind_filter(tp, &head->res, base);
+	}
+
+	if (load_bpf_prog) {
+		if (head->prog) {
+			pr_notice("cls_p4: Substituting old BPF program with id %u with new one with id %u\n",
+				  head->prog->p4_prog->aux->id, prog->p4_prog->aux->id);
+			p4_bpf_prog_destroy(head->prog);
+		}
+		head->prog = prog;
+	}
+
+	return 0;
+
+exts_destroy:
+	tcf_exts_destroy(&head->exts);
+	return err;
+}
+
+static int p4_change(struct net *net, struct sk_buff *in_skb,
+		     struct tcf_proto *tp, unsigned long base, u32 handle,
+		     struct nlattr **tca, void **arg, u32 flags,
+		     struct netlink_ext_ack *extack)
+{
+	struct cls_p4_head *head = rtnl_dereference(tp->root);
+	struct p4tc_pipeline *pipeline = NULL;
+	struct nlattr *tb[TCA_P4_MAX + 1];
+	struct cls_p4_head *new_cls;
+	char *pname = NULL;
+	u32 pipeid = 0;
+	int err;
+
+	if (!tca[TCA_OPTIONS]) {
+		NL_SET_ERR_MSG(extack, "Must provide pipeline options");
+		return -EINVAL;
+	}
+
+	if (head)
+		return -EEXIST;
+
+	err = nla_parse_nested(tb, TCA_P4_MAX, tca[TCA_OPTIONS], p4_policy,
+			       extack);
+	if (err < 0)
+		return err;
+
+	if (tb[TCA_P4_PNAME])
+		pname = nla_data(tb[TCA_P4_PNAME]);
+
+	if (tb[TCA_P4_PIPEID])
+		pipeid = nla_get_u32(tb[TCA_P4_PIPEID]);
+
+	pipeline = p4tc_pipeline_find_get(net, pname, pipeid, extack);
+	if (IS_ERR(pipeline))
+		return PTR_ERR(pipeline);
+
+	if (!pipeline_sealed(pipeline)) {
+		err = -EINVAL;
+		NL_SET_ERR_MSG(extack, "Pipeline must be sealed before use");
+		goto pipeline_put;
+	}
+
+	new_cls = kzalloc(sizeof(*new_cls), GFP_KERNEL);
+	if (!new_cls) {
+		err = -ENOMEM;
+		goto pipeline_put;
+	}
+
+	err = tcf_exts_init(&new_cls->exts, net, TCA_P4_ACT, 0);
+	if (err)
+		goto err_exts_init;
+
+	if (!handle)
+		handle = 1;
+
+	new_cls->handle = handle;
+
+	err = p4_set_parms(net, tp, new_cls, base, tb, tca[TCA_RATE], flags,
+			   extack);
+	if (err)
+		goto err_set_parms;
+
+	new_cls->pipeline = pipeline;
+	*arg = head;
+	rcu_assign_pointer(tp->root, new_cls);
+	return 0;
+
+err_set_parms:
+	tcf_exts_destroy(&new_cls->exts);
+err_exts_init:
+	kfree(new_cls);
+pipeline_put:
+	p4tc_pipeline_put(pipeline);
+	return err;
+}
+
+static int p4_delete(struct tcf_proto *tp, void *arg, bool *last,
+		     bool rtnl_held, struct netlink_ext_ack *extack)
+{
+	*last = true;
+	return 0;
+}
+
+static void p4_walk(struct tcf_proto *tp, struct tcf_walker *arg,
+		    bool rtnl_held)
+{
+	struct cls_p4_head *head = rtnl_dereference(tp->root);
+
+	if (arg->count < arg->skip)
+		goto skip;
+
+	if (!head)
+		return;
+	if (arg->fn(tp, head, arg) < 0)
+		arg->stop = 1;
+skip:
+	arg->count++;
+}
+
+static int p4_prog_dump(struct sk_buff *skb, struct p4tc_bpf_prog *prog)
+{
+	unsigned char *b = nlmsg_get_pos(skb);
+
+	if (nla_put_u32(skb, TCA_P4_PROG_ID, prog->p4_prog->aux->id))
+		goto nla_put_failure;
+
+	if (nla_put_string(skb, TCA_P4_PROG_NAME, prog->p4_prog_name))
+		goto nla_put_failure;
+
+	if (nla_put_u32(skb, TCA_P4_PROG_TYPE, prog->p4_prog->type))
+		goto nla_put_failure;
+
+	return 0;
+
+nla_put_failure:
+	nlmsg_trim(skb, b);
+	return -1;
+}
+
+static int p4_dump(struct net *net, struct tcf_proto *tp, void *fh,
+		   struct sk_buff *skb, struct tcmsg *t, bool rtnl_held)
+{
+	struct cls_p4_head *head = fh;
+	struct nlattr *nest;
+
+	if (!head)
+		return skb->len;
+
+	t->tcm_handle = head->handle;
+
+	nest = nla_nest_start(skb, TCA_OPTIONS);
+	if (!nest)
+		goto nla_put_failure;
+
+	if (nla_put_string(skb, TCA_P4_PNAME, head->pipeline->common.name))
+		goto nla_put_failure;
+
+	if (head->res.classid &&
+	    nla_put_u32(skb, TCA_P4_CLASSID, head->res.classid))
+		goto nla_put_failure;
+
+	if (head->prog && p4_prog_dump(skb, head->prog))
+		goto nla_put_failure;
+
+	if (tcf_exts_dump(skb, &head->exts))
+		goto nla_put_failure;
+
+	nla_nest_end(skb, nest);
+
+	if (tcf_exts_dump_stats(skb, &head->exts) < 0)
+		goto nla_put_failure;
+
+	return skb->len;
+
+nla_put_failure:
+	nla_nest_cancel(skb, nest);
+	return -1;
+}
+
+static void p4_bind_class(void *fh, u32 classid, unsigned long cl, void *q,
+			  unsigned long base)
+{
+	struct cls_p4_head *head = fh;
+
+	if (head && head->res.classid == classid) {
+		if (cl)
+			__tcf_bind_filter(q, &head->res, base);
+		else
+			__tcf_unbind_filter(q, &head->res);
+	}
+}
+
+static struct tcf_proto_ops cls_p4_ops __read_mostly = {
+	.kind		= "p4",
+	.classify	= p4_classify,
+	.init		= p4_init,
+	.destroy	= p4_destroy,
+	.get		= p4_get,
+	.change		= p4_change,
+	.delete		= p4_delete,
+	.walk		= p4_walk,
+	.dump		= p4_dump,
+	.bind_class	= p4_bind_class,
+	.owner		= THIS_MODULE,
+};
+
+static int __init cls_p4_init(void)
+{
+	return register_tcf_proto_ops(&cls_p4_ops);
+}
+
+static void __exit cls_p4_exit(void)
+{
+	unregister_tcf_proto_ops(&cls_p4_ops);
+}
+
+module_init(cls_p4_init);
+module_exit(cls_p4_exit);
+
+MODULE_AUTHOR("Mojatatu Networks");
+MODULE_DESCRIPTION("P4 Classifier");
+MODULE_LICENSE("GPL");
diff --git a/net/sched/p4tc/Makefile b/net/sched/p4tc/Makefile
index 3fed9a853..726902f10 100644
--- a/net/sched/p4tc/Makefile
+++ b/net/sched/p4tc/Makefile
@@ -1,6 +1,8 @@
 # SPDX-License-Identifier: GPL-2.0
 
+CFLAGS_trace.o := -I$(src)
+
 obj-y := p4tc_types.o p4tc_tmpl_api.o p4tc_pipeline.o \
 	p4tc_action.o p4tc_table.o p4tc_tbl_entry.o \
-	p4tc_runtime_api.o
+	p4tc_runtime_api.o trace.o
 obj-$(CONFIG_DEBUG_INFO_BTF) += p4tc_bpf.o
diff --git a/net/sched/p4tc/trace.c b/net/sched/p4tc/trace.c
new file mode 100644
index 000000000..683313407
--- /dev/null
+++ b/net/sched/p4tc/trace.c
@@ -0,0 +1,10 @@
+// SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause)
+
+#include <net/p4tc.h>
+
+#ifndef __CHECKER__
+
+#define CREATE_TRACE_POINTS
+#include "trace.h"
+EXPORT_TRACEPOINT_SYMBOL_GPL(p4_classify);
+#endif
diff --git a/net/sched/p4tc/trace.h b/net/sched/p4tc/trace.h
new file mode 100644
index 000000000..80abec13b
--- /dev/null
+++ b/net/sched/p4tc/trace.h
@@ -0,0 +1,44 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM p4tc
+
+#if !defined(__P4TC_TRACE_H_) || defined(TRACE_HEADER_MULTI_READ)
+#define __P4TC_TRACE_H
+
+#include <linux/tracepoint.h>
+
+struct p4tc_pipeline;
+
+TRACE_EVENT(p4_classify,
+	    TP_PROTO(struct sk_buff *skb, struct p4tc_pipeline *pipeline),
+
+	    TP_ARGS(skb, pipeline),
+
+	    TP_STRUCT__entry(__string(pname, pipeline->common.name)
+			     __field(u32,  p_id)
+			     __field(u32,  ifindex)
+			     __field(u32,  ingress)
+			    ),
+
+	    TP_fast_assign(__assign_str(pname, pipeline->common.name);
+			   __entry->p_id = pipeline->common.p_id;
+			   __entry->ifindex = skb->dev->ifindex;
+			   __entry->ingress = skb_at_tc_ingress(skb);
+			  ),
+
+	    TP_printk("dev=%u dir=%s pipeline=%s p_id=%u",
+		      __entry->ifindex,
+		      __entry->ingress ? "ingress" : "egress",
+		      __get_str(pname),
+		      __entry->p_id
+		     )
+);
+
+#endif
+
+#undef TRACE_INCLUDE_PATH
+#define TRACE_INCLUDE_PATH .
+#undef TRACE_INCLUDE_FILE
+#define TRACE_INCLUDE_FILE trace
+
+#include <trace/define_trace.h>
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 79+ messages in thread

* [PATCH net-next v8 15/15] p4tc: Add P4 extern interface
  2023-11-16 14:59 [PATCH net-next v8 00/15] Introducing P4TC Jamal Hadi Salim
                   ` (13 preceding siblings ...)
  2023-11-16 14:59 ` [PATCH net-next v8 14/15] p4tc: add P4 classifier Jamal Hadi Salim
@ 2023-11-16 14:59 ` Jamal Hadi Salim
  2023-11-16 16:42   ` Jiri Pirko
  2023-11-17  6:27 ` [PATCH net-next v8 00/15] Introducing P4TC John Fastabend
  15 siblings, 1 reply; 79+ messages in thread
From: Jamal Hadi Salim @ 2023-11-16 14:59 UTC (permalink / raw)
  To: netdev
  Cc: deb.chatterjee, anjali.singhai, namrata.limaye, tom, mleitner,
	Mahesh.Shirshyad, tomasz.osinski, jiri, xiyou.wangcong, davem,
	edumazet, kuba, pabeni, vladbu, horms, daniel, bpf, khalidm, toke,
	mattyk

P4 externs are an abstraction in the language to call for extending
language functionality. For example, the function that sends a packet to a
specific port (send_to_port) in P4 PNA is an extern.

Externs can be seen as classes, which have constructors and methods.
Take, for example, the Register extern definition:

extern Register<T> {
    Register(@tc_numel bit<32> size);
    @tc_md_read T read(@tc_key bit<32> index);
    @tc_md_write void write(@tc_key bit<32> index, @tc_data T value);
}

struct tc_ControlPath_Register<T> {
    @tc_key bit<32> index;
    @tc_data T value;
}

Which can then be instantiated within a P4 program as:
Register<bit<32>>(128) reg1;
Register<bit<16>>(1024) reg2;

Will be abstracted into the template by the P4C compiler for "reg1" as
follows:

tc p4template create extern/root/register extid 10 numinstances 2
tc p4template create extern_inst/aP4Proggie/register/reg1 instid 1 \
control_path tc_key index type bit32 tc_data value type bit32 \
numelemens 128 default_value 22

=========================EXTERN RUNTIME COMMANDS=========================

Once we seal the pipeline, the register values will be assigned to the
default value specified on the template as "default_value". After sealing,
we can update the runtime instance element. For example to update
reg1[2] with the value 33, we will do the following:

tc p4ctrl update aP4proggie/extern/register/reg1 tc_key index 2 \
tc_data value 33

We can also get its value:

tc p4ctrl get aP4proggie/extern/register/reg1 tc_key index 2

Which will yield the following output:

total exts 0
        extern order 1:
          tc_key index id 1 type bit32 value: 1
          tc_data value id 2 type bit32 value: 33

We can also dump all of the elements in this register:

tc p4ctrl get aP4proggie/extern/register/reg1

Note that the only valid runtime operations are get and update.

=========================EXTERN P4 Runtime =========================

The generated ebpf code invokes the externs in the P4TC domain
using the md_read or md_write kfuncs, for example:
if the P4 program had this invocation:

tmp1 = reg1.read(index1);

Then equivalent generated ebpf code is as follows:

param.pipe_id = aP4Proggie_ID;
param.ext_id = EXTERN_REGISTER;
param.inst_id = EXTERN_REGISTER_INSTANCE_ID1;
param.index = index1;
param.param_id = EXTERN_REGISTER_PARAM_ID;
bpf_p4tc_extern_md_read(skb, &res, &param);
tmp1 = (u32 *)res.params;

Co-developed-by: Victor Nogueira <victor@mojatatu.com>
Signed-off-by: Victor Nogueira <victor@mojatatu.com>
Co-developed-by: Pedro Tammela <pctammela@mojatatu.com>
Signed-off-by: Pedro Tammela <pctammela@mojatatu.com>
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
---
 include/net/p4tc.h                |  161 +++
 include/net/p4tc_ext_api.h        |  199 +++
 include/uapi/linux/p4tc.h         |   61 +
 include/uapi/linux/p4tc_ext.h     |   36 +
 net/sched/p4tc/Makefile           |    2 +-
 net/sched/p4tc/p4tc_bpf.c         |   79 +-
 net/sched/p4tc/p4tc_ext.c         | 2204 ++++++++++++++++++++++++++++
 net/sched/p4tc/p4tc_pipeline.c    |   34 +-
 net/sched/p4tc/p4tc_runtime_api.c |   10 +-
 net/sched/p4tc/p4tc_table.c       |   57 +-
 net/sched/p4tc/p4tc_tbl_entry.c   |   25 +-
 net/sched/p4tc/p4tc_tmpl_api.c    |    4 +
 net/sched/p4tc/p4tc_tmpl_ext.c    | 2221 +++++++++++++++++++++++++++++
 13 files changed, 5083 insertions(+), 10 deletions(-)
 create mode 100644 include/net/p4tc_ext_api.h
 create mode 100644 include/uapi/linux/p4tc_ext.h
 create mode 100644 net/sched/p4tc/p4tc_ext.c
 create mode 100644 net/sched/p4tc/p4tc_tmpl_ext.c

diff --git a/include/net/p4tc.h b/include/net/p4tc.h
index e25eaa4ac..b6b825c64 100644
--- a/include/net/p4tc.h
+++ b/include/net/p4tc.h
@@ -11,6 +11,7 @@
 #include <linux/rhashtable-types.h>
 #include <net/tc_act/p4tc.h>
 #include <net/p4tc_types.h>
+#include <linux/bpf.h>
 
 #define P4TC_DEFAULT_NUM_TABLES P4TC_MINTABLES_COUNT
 #define P4TC_DEFAULT_MAX_RULES 1
@@ -21,6 +22,10 @@
 #define P4TC_DEFAULT_TMASKS 8
 #define P4TC_MAX_T_AGING 864000000
 #define P4TC_DEFAULT_T_AGING 30000
+#define P4TC_DEFAULT_NUM_EXT_INSTS 1
+#define P4TC_MAX_NUM_EXT_INSTS 1024
+#define P4TC_DEFAULT_NUM_EXT_INST_ELEMS 1
+#define P4TC_MAX_NUM_EXT_INST_ELEMS 1024
 
 #define P4TC_MAX_PERMISSION (GENMASK(P4TC_PERM_MAX_BIT, 0))
 
@@ -30,6 +35,8 @@
 #define P4TC_TBLID_IDX 1
 #define P4TC_AID_IDX 1
 #define P4TC_PARSEID_IDX 1
+#define P4TC_TMPL_EXT_IDX 1
+#define P4TC_TMPL_EXT_INST_IDX 2
 
 struct p4tc_dump_ctx {
 	u32 ids[P4TC_PATH_MAX];
@@ -79,9 +86,14 @@ struct p4tc_pipeline {
 	struct p4tc_template_common common;
 	struct idr                  p_act_idr;
 	struct idr                  p_tbl_idr;
+	/* IDR where the externs are stored globally in the root pipeline */
+	struct idr                  p_ext_idr;
+	/* IDR where the per user pipeline data related to externs is stored */
+	struct idr                  user_ext_idr;
 	struct rcu_head             rcu;
 	struct net                  *net;
 	u32                         num_created_acts;
+	u32                         num_created_ext_elems;
 	/* Accounts for how many entities are referencing this pipeline.
 	 * As for now only P4 filters can refer to pipelines.
 	 */
@@ -207,6 +219,27 @@ static inline int p4tc_action_destroy(struct tc_action **acts)
 
 #define P4TC_MAX_PARAM_DATA_SIZE 124
 
+#define P4TC_EXT_FLAGS_UNSPEC 0x0
+#define P4TC_EXT_FLAGS_CONTROL_READ 0x1
+#define P4TC_EXT_FLAGS_CONTROL_WRITE 0x2
+
+struct p4tc_ext_bpf_params {
+	u32 pipe_id;
+	u32 ext_id;
+	u32 inst_id;
+	u32 index;
+	u32 param_id;
+	u32 flags;
+	u8  in_params[128]; /* extern specific params if any */
+};
+
+struct p4tc_ext_bpf_res {
+	u32 ext_id;
+	u32 index;
+	u32 verdict;
+	u8 out_params[128]; /* specific values if any */
+};
+
 struct p4tc_table_defact {
 	struct tc_action **default_acts;
 	/* Will have two 7 bits blocks containing CRUDXPS (Create, read, update,
@@ -237,6 +270,7 @@ struct p4tc_table {
 	struct p4tc_table_perm __rcu        *tbl_permissions;
 	struct p4tc_table_entry_mask __rcu  **tbl_masks_array;
 	unsigned long __rcu                 *tbl_free_masks_bitmap;
+	struct p4tc_extern_inst             *tbl_counter;
 	u64                                 tbl_aging;
 	/* Locks the available masks IDR which will be used when adding and
 	 * deleting table entries.
@@ -352,6 +386,7 @@ struct p4tc_table_entry_key {
 struct p4tc_table_entry_value {
 	u32                              prio;
 	int                              num_acts;
+	struct p4tc_extern_common        *counter;
 	struct tc_action                 **acts;
 	/* Accounts for how many entities are referencing, eg: Data path,
 	 * one or more control path and timer.
@@ -576,8 +611,134 @@ static inline bool p4tc_runtime_msg_is_update(struct nlmsghdr *n)
 	return n->nlmsg_type == RTM_P4TC_UPDATE;
 }
 
+struct p4tc_user_pipeline_extern {
+	struct idr		e_inst_idr;
+	struct p4tc_tmpl_extern	*tmpl_ext;
+	void (*free)(struct p4tc_user_pipeline_extern *pipe_ext,
+		     struct idr *tmpl_exts_idr);
+	u32			ext_id;
+	/* Accounts for how many instances refer to this extern. */
+	refcount_t		ext_ref;
+	/* Counts how many instances this extern has */
+	atomic_t                curr_insts_num;
+	u32                     PAD0;
+	char			ext_name[EXTERNNAMSIZ];
+};
+
+struct p4tc_tmpl_extern {
+	struct p4tc_template_common  common;
+	struct idr                   params_idr;
+	const struct p4tc_extern_ops *ops;
+	u32                          ext_id;
+	u32                          num_params;
+	u32                          max_num_insts;
+	/* Accounts for how many pipelines refer to this extern. */
+	refcount_t                   tmpl_ref;
+	char                         mod_name[MODULE_NAME_LEN];
+	bool                         has_exec_method;
+};
+
+struct p4tc_extern_inst {
+	struct p4tc_template_common      common;
+	struct p4tc_extern_params        *params;
+	const struct p4tc_extern_ops     *ops;
+	struct idr                       control_elems_idr;
+	struct list_head                 unused_elems;
+	/* Locks the available externs list.
+	 * Which will be used by table entries that reference externs (refer to
+	 * direct counters and meters in P4).
+	 * Note that table entries can be created, update or deleted by both
+	 * control and data path. So this list may be modified from both
+	 * contexts.
+	 */
+	spinlock_t                       available_list_lock;
+	/* Accounts for how many elements refer to this extern. */
+	refcount_t                       inst_ref;
+	struct p4tc_user_pipeline_extern *pipe_ext;
+	char                             *ext_name;
+	/* Counts how many elements this extern has */
+	atomic_t                         curr_num_elems;
+	u32                              max_num_elems;
+	u32                              ext_id;
+	u32                              ext_inst_id;
+	u32                              num_control_params;
+	struct p4tc_extern_params        *constr_params;
+	u32				 flags;
+	bool                             tbl_bindable;
+	bool				 is_scalar;
+};
+
+int p4tc_pipeline_create_extern_net(struct p4tc_tmpl_extern *tmpl_ext);
+int p4tc_pipeline_del_extern_net(struct p4tc_tmpl_extern *tmpl_ext);
+struct p4tc_extern_inst *
+p4tc_ext_inst_find_bynames(struct net *net, struct p4tc_pipeline *pipeline,
+			   const char *extname, const char *instname,
+			   struct netlink_ext_ack *extack);
+struct p4tc_user_pipeline_extern *
+p4tc_pipe_ext_find_bynames(struct net *net, struct p4tc_pipeline *pipeline,
+			   const char *extname, struct netlink_ext_ack *extack);
+struct p4tc_extern_inst *
+p4tc_ext_inst_table_bind(struct p4tc_pipeline *pipeline,
+			 struct p4tc_user_pipeline_extern **pipe_ext,
+			 const char *ext_inst_path,
+			 struct netlink_ext_ack *extack);
+void
+p4tc_ext_inst_table_unbind(struct p4tc_table *table,
+			   struct p4tc_user_pipeline_extern *pipe_ext,
+			   struct p4tc_extern_inst *inst);
+struct p4tc_extern_inst *
+p4tc_ext_inst_get_byids(struct net *net, struct p4tc_pipeline **pipeline,
+			struct p4tc_ext_bpf_params *params);
+struct p4tc_extern_inst *
+p4tc_ext_find_byids(struct p4tc_pipeline *pipeline,
+		    const u32 ext_id, const u32 inst_id);
+struct p4tc_extern_inst *
+p4tc_ext_inst_alloc(const struct p4tc_extern_ops *ops, const u32 max_num_elems,
+		    bool tbl_bindable, char *ext_name);
+
+int __bpf_p4tc_extern_md_write(struct net *net,
+			       struct p4tc_ext_bpf_params *params);
+int __bpf_p4tc_extern_md_read(struct net *net,
+			      struct p4tc_ext_bpf_res *res,
+			      struct p4tc_ext_bpf_params *params);
+struct p4tc_extern_params *
+p4tc_ext_params_copy(struct p4tc_extern_params *params_orig);
+
+extern const struct p4tc_template_ops p4tc_tmpl_ext_ops;
+extern const struct p4tc_template_ops p4tc_ext_inst_ops;
+
+struct p4tc_extern_param {
+	struct p4tc_extern_param_ops    *ops;
+	struct p4tc_extern_param_ops    *mod_ops;
+	void				*value;
+	struct p4tc_type		*type;
+	struct p4tc_type_mask_shift	*mask_shift;
+	u32				id;
+	u32				index;
+	u32                             bitsz;
+	u32				flags;
+	char				name[EXTPARAMNAMSIZ];
+	struct rcu_head			rcu;
+};
+
+struct p4tc_extern_param_ops {
+	int (*init_value)(struct net *net,
+			  struct p4tc_extern_param *nparam, void *value,
+			  struct netlink_ext_ack *extack);
+	/* Only for bit<X> parameter types */
+	void (*default_value)(struct p4tc_extern_param *nparam);
+	int (*dump_value)(struct sk_buff *skb, struct p4tc_extern_param_ops *op,
+			  struct p4tc_extern_param *param);
+	void (*free)(struct p4tc_extern_param *param);
+	u32 len;
+	u32 alloc_len;
+};
+
 #define to_pipeline(t) ((struct p4tc_pipeline *)t)
 #define to_act(t) ((struct p4tc_act *)t)
 #define to_table(t) ((struct p4tc_table *)t)
 
+#define to_extern(t) ((struct p4tc_tmpl_extern *)t)
+#define to_extern_inst(t) ((struct p4tc_extern_inst *)t)
+
 #endif
diff --git a/include/net/p4tc_ext_api.h b/include/net/p4tc_ext_api.h
new file mode 100644
index 000000000..fcc16d252
--- /dev/null
+++ b/include/net/p4tc_ext_api.h
@@ -0,0 +1,199 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef __NET_P4TC_EXT_API_H
+#define __NET_P4TC_EXT_API_H
+
+/*
+ * Public extern P4TC_EXT API
+ */
+
+#include <uapi/linux/p4tc_ext.h>
+#include <linux/refcount.h>
+#include <net/flow_offload.h>
+#include <net/sch_generic.h>
+#include <net/pkt_sched.h>
+#include <net/net_namespace.h>
+#include <net/netns/generic.h>
+#include <net/p4tc.h>
+
+struct p4tc_extern_ops;
+
+struct p4tc_extern_params {
+	struct idr params_idr;
+	rwlock_t params_lock;
+	u32 num_params;
+	u32 PAD0;
+};
+
+struct p4tc_extern_common {
+	struct list_head             node;
+	struct p4tc_extern_params    *params;
+	const struct p4tc_extern_ops *ops;
+	struct p4tc_extern_inst      *inst;
+	u32                          p4tc_ext_flags;
+	u32                          p4tc_ext_key;
+	refcount_t                   p4tc_ext_refcnt;
+	u32                          PAD0;
+};
+
+struct p4tc_extern {
+	struct p4tc_extern_common       common;
+	struct idr			*elems_idr;
+	struct rcu_head			rcu;
+	size_t				attrs_size;
+	/* Extern element lock */
+	spinlock_t			p4tc_ext_lock;
+};
+
+/* Reserve 16 bits for user-space. See P4TC_EXT_FLAGS_NO_PERCPU_STATS. */
+#define P4TC_EXT_FLAGS_USER_BITS 16
+#define P4TC_EXT_FLAGS_USER_MASK 0xffff
+
+struct p4tc_extern_ops {
+	struct list_head head;
+	size_t size;
+	size_t elem_size;
+	struct module *owner;
+	struct p4tc_tmpl_extern *tmpl_ext;
+	int (*exec)(struct p4tc_extern_common *common, void *priv);
+	int (*construct)(struct p4tc_extern_inst **common,
+			 struct p4tc_extern_params *params,
+			 struct p4tc_extern_params *constr_params,
+			 u32 max_num_elems, bool tbl_bindable,
+			 struct netlink_ext_ack *extack);
+	void (*deconstruct)(struct p4tc_extern_inst *common);
+	int (*dump)(struct sk_buff *skb,
+		    struct p4tc_extern_inst *common,
+		    struct netlink_callback *cb);
+	int (*rctrl)(int cmd, struct p4tc_extern_inst *inst,
+		     struct p4tc_extern_common **e,
+		     struct p4tc_extern_params *params,
+		     void *key_u32, struct netlink_ext_ack *extack);
+	u32 id; /* identifier should match kind */
+	u32 PAD0;
+	char kind[P4TC_EXT_NAMSIZ];
+};
+
+#define P4TC_EXT_P_CREATED 1
+#define P4TC_EXT_P_DELETED 1
+
+int p4tc_register_extern(struct p4tc_extern_ops *ext);
+int p4tc_unregister_extern(struct p4tc_extern_ops *ext);
+
+int p4tc_ctl_extern_dump(struct sk_buff *skb, struct netlink_callback *cb,
+			 struct nlattr **tb, const char *pname);
+void p4tc_ext_purge(struct idr *idr);
+void p4tc_ext_inst_purge(struct p4tc_extern_inst *inst);
+
+int p4tc_ctl_extern(struct sk_buff *skb, struct nlmsghdr *n, int cmd,
+		    struct netlink_ext_ack *extack);
+struct p4tc_extern_param *
+p4tc_ext_param_find_byanyattr(struct idr *params_idr,
+			      struct nlattr *name_attr,
+			      const u32 param_id,
+			      struct netlink_ext_ack *extack);
+struct p4tc_extern_param *
+p4tc_ext_param_find_byid(struct idr *params_idr, const u32 param_id);
+
+int p4tc_ext_param_value_init(struct net *net,
+			      struct p4tc_extern_param *param,
+			      struct nlattr **tb, u32 typeid,
+			      bool value_required,
+			      struct netlink_ext_ack *extack);
+void p4tc_ext_param_value_free_tmpl(struct p4tc_extern_param *param);
+int p4tc_ext_param_value_dump_tmpl(struct sk_buff *skb,
+				   struct p4tc_extern_param *param);
+int p4tc_extern_insts_init_elems(struct idr *user_ext_idr);
+int p4tc_extern_inst_init_elems(struct p4tc_extern_inst *inst, u32 num_elems);
+
+int p4tc_unregister_extern(struct p4tc_extern_ops *ext);
+
+struct p4tc_extern_common *p4tc_ext_elem_get(struct p4tc_extern_inst *inst);
+void p4tc_ext_elem_put_list(struct p4tc_extern_inst *inst,
+			    struct p4tc_extern_common *e);
+
+int p4tc_ext_elem_dump_1(struct sk_buff *skb, struct p4tc_extern_common *e);
+void p4tc_ext_params_free(struct p4tc_extern_params *params, bool free_vals);
+
+static inline struct p4tc_extern_param *
+p4tc_extern_params_find_byid(struct p4tc_extern_params *params, u32 param_id)
+{
+	return idr_find(&params->params_idr, param_id);
+}
+
+int p4tc_ext_init_defval_params(struct p4tc_extern_inst *inst,
+				struct p4tc_extern_common *common,
+				struct idr *control_params_idr,
+				struct netlink_ext_ack *extack);
+struct p4tc_extern_params *p4tc_extern_params_init(void);
+
+static inline bool p4tc_ext_inst_has_dump(const struct p4tc_extern_inst *inst)
+{
+	const struct p4tc_extern_ops *ops = inst->ops;
+
+	return ops && ops->dump;
+}
+
+static inline bool p4tc_ext_has_rctrl(const struct p4tc_extern_ops *ops)
+{
+	return ops && ops->rctrl;
+}
+
+static inline bool p4tc_ext_has_exec(const struct p4tc_extern_ops *ops)
+{
+	return ops && ops->exec;
+}
+
+static inline bool p4tc_ext_has_construct(const struct p4tc_extern_ops *ops)
+{
+	return ops && ops->construct;
+}
+
+static inline bool
+p4tc_ext_inst_has_construct(const struct p4tc_extern_inst *inst)
+{
+	const struct p4tc_extern_ops *ops = inst->ops;
+
+	return p4tc_ext_has_construct(ops);
+}
+
+static inline bool
+p4tc_ext_inst_has_rctrl(const struct p4tc_extern_inst *inst)
+{
+	const struct p4tc_extern_ops *ops = inst->ops;
+
+	return p4tc_ext_has_rctrl(ops);
+}
+
+static inline bool
+p4tc_ext_inst_has_exec(const struct p4tc_extern_inst *inst)
+{
+	const struct p4tc_extern_ops *ops = inst->ops;
+
+	return p4tc_ext_has_exec(ops);
+}
+
+struct p4tc_extern *
+p4tc_ext_elem_find(struct p4tc_extern_inst *inst,
+		   struct p4tc_ext_bpf_params *params);
+
+struct p4tc_extern_common *
+p4tc_ext_common_elem_get(struct sk_buff *skb, struct p4tc_pipeline **pipeline,
+			 struct p4tc_ext_bpf_params *params);
+struct p4tc_extern_common *
+p4tc_xdp_ext_common_elem_get(struct xdp_buff *ctx,
+			     struct p4tc_pipeline **pipeline,
+			     struct p4tc_ext_bpf_params *params);
+void p4tc_ext_common_elem_put(struct p4tc_pipeline *pipeline,
+			      struct p4tc_extern_common *common);
+
+static inline void p4tc_ext_inst_inc_num_elems(struct p4tc_extern_inst *inst)
+{
+	atomic_inc(&inst->curr_num_elems);
+}
+
+static inline void p4tc_ext_inst_dec_num_elems(struct p4tc_extern_inst *inst)
+{
+	atomic_dec(&inst->curr_num_elems);
+}
+
+#endif
diff --git a/include/uapi/linux/p4tc.h b/include/uapi/linux/p4tc.h
index a2a39303f..3f19a1e5e 100644
--- a/include/uapi/linux/p4tc.h
+++ b/include/uapi/linux/p4tc.h
@@ -18,6 +18,8 @@ struct p4tcmsg {
 #define P4TC_MSGBATCH_SIZE 16
 
 #define P4TC_ACT_MAX_NUM_PARAMS P4TC_MSGBATCH_SIZE
+#define EXTPARAMNAMSIZ 256
+#define P4TC_MAX_EXTERN_METHODS 32
 
 #define P4TC_MAX_KEYSZ 512
 #define P4TC_DEFAULT_NUM_PREALLOC 16
@@ -27,6 +29,8 @@ struct p4tcmsg {
 #define ACTTMPLNAMSIZ TEMPLATENAMSZ
 #define ACTPARAMNAMSIZ TEMPLATENAMSZ
 #define TABLENAMSIZ TEMPLATENAMSZ
+#define EXTERNNAMSIZ TEMPLATENAMSZ
+#define EXTERNINSTNAMSIZ TEMPLATENAMSZ
 
 #define P4TC_TABLE_FLAGS_KEYSZ (1 << 0)
 #define P4TC_TABLE_FLAGS_MAX_ENTRIES (1 << 1)
@@ -119,6 +123,8 @@ enum {
 	P4TC_ROOT_UNSPEC,
 	P4TC_ROOT, /* nested messages */
 	P4TC_ROOT_PNAME, /* string */
+	P4TC_ROOT_COUNT,
+	P4TC_ROOT_FLAGS,
 	__P4TC_ROOT_MAX,
 };
 
@@ -130,6 +136,8 @@ enum {
 	P4TC_OBJ_PIPELINE,
 	P4TC_OBJ_ACT,
 	P4TC_OBJ_TABLE,
+	P4TC_OBJ_EXT,
+	P4TC_OBJ_EXT_INST,
 	__P4TC_OBJ_MAX,
 };
 
@@ -139,6 +147,7 @@ enum {
 enum {
 	P4TC_OBJ_RUNTIME_UNSPEC,
 	P4TC_OBJ_RUNTIME_TABLE,
+	P4TC_OBJ_RUNTIME_EXTERN,
 	__P4TC_OBJ_RUNTIME_MAX,
 };
 
@@ -233,6 +242,7 @@ enum {
 	P4TC_TABLE_DEFAULT_MISS, /* nested default miss action attributes */
 	P4TC_TABLE_CONST_ENTRY, /* nested const table entry*/
 	P4TC_TABLE_ACTS_LIST, /* nested table actions list */
+	P4TC_TABLE_COUNTER, /* string */
 	__P4TC_TABLE_MAX
 };
 
@@ -325,6 +335,7 @@ enum {
 	P4TC_ENTRY_TBL_ATTRS, /* nested table attributes */
 	P4TC_ENTRY_DYNAMIC, /* u8 tells if table entry is dynamic */
 	P4TC_ENTRY_AGING, /* u64 table entry aging */
+	P4TC_ENTRY_COUNTER, /* nested extern associated with entry counter */
 	P4TC_ENTRY_PAD,
 	__P4TC_ENTRY_MAX
 };
@@ -339,6 +350,56 @@ enum {
 	P4TC_ENTITY_MAX
 };
 
+/* P4 Extern attributes */
+enum {
+	P4TC_TMPL_EXT_UNSPEC,
+	P4TC_TMPL_EXT_NAME, /* string */
+	P4TC_TMPL_EXT_NUM_INSTS, /* u16 */
+	P4TC_TMPL_EXT_HAS_EXEC_METHOD, /* u8 */
+	__P4TC_TMPL_EXT_MAX
+};
+
+#define P4TC_TMPL_EXT_MAX (__P4TC_TMPL_EXT_MAX - 1)
+
+enum {
+	P4TC_TMPL_EXT_INST_UNSPEC,
+	P4TC_TMPL_EXT_INST_EXT_NAME, /* string */
+	P4TC_TMPL_EXT_INST_NAME, /* string */
+	P4TC_TMPL_EXT_INST_NUM_ELEMS, /* u32 */
+	P4TC_TMPL_EXT_INST_CONTROL_PARAMS, /* nested control params */
+	P4TC_TMPL_EXT_INST_TABLE_BINDABLE, /* bool */
+	P4TC_TMPL_EXT_INST_CONSTR_PARAMS, /* nested constructor params */
+	__P4TC_TMPL_EXT_INST_MAX
+};
+
+#define P4TC_TMPL_EXT_INST_MAX (__P4TC_TMPL_EXT_INST_MAX - 1)
+
+/* Extern params attributes */
+enum {
+	P4TC_EXT_PARAMS_VALUE_UNSPEC,
+	P4TC_EXT_PARAMS_VALUE_RAW, /* binary */
+	__P4TC_EXT_PARAMS_VALUE_MAX
+};
+
+#define P4TC_EXT_VALUE_PARAMS_MAX (__P4TC_EXT_PARAMS_VALUE_MAX - 1)
+
+#define P4TC_EXT_PARAMS_FLAG_ISKEY 0x1
+#define P4TC_EXT_PARAMS_FLAG_IS_DATASCALAR 0x2
+
+/* Extern params attributes */
+enum {
+	P4TC_EXT_PARAMS_UNSPEC,
+	P4TC_EXT_PARAMS_NAME, /* string */
+	P4TC_EXT_PARAMS_ID, /* u32 */
+	P4TC_EXT_PARAMS_VALUE, /* bytes */
+	P4TC_EXT_PARAMS_TYPE, /* u32 */
+	P4TC_EXT_PARAMS_BITSZ, /* u16 */
+	P4TC_EXT_PARAMS_FLAGS, /* u8 */
+	__P4TC_EXT_PARAMS_MAX
+};
+
+#define P4TC_EXT_PARAMS_MAX (__P4TC_EXT_PARAMS_MAX - 1)
+
 #define P4TC_RTA(r) \
 	((struct rtattr *)(((char *)(r)) + NLMSG_ALIGN(sizeof(struct p4tcmsg))))
 
diff --git a/include/uapi/linux/p4tc_ext.h b/include/uapi/linux/p4tc_ext.h
new file mode 100644
index 000000000..7d4ecb5b1
--- /dev/null
+++ b/include/uapi/linux/p4tc_ext.h
@@ -0,0 +1,36 @@
+/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
+#ifndef __LINUX_P4TC_EXT_H
+#define __LINUX_P4TC_EXT_H
+
+#include <linux/types.h>
+#include <linux/pkt_sched.h>
+
+#define P4TC_EXT_NAMSIZ 64
+
+/* Extern attributes */
+enum {
+	P4TC_EXT_UNSPEC,
+	P4TC_EXT_INST_NAME,
+	P4TC_EXT_KIND,
+	P4TC_EXT_PARAMS,
+	P4TC_EXT_KEY,
+	P4TC_EXT_PAD,
+	P4TC_EXT_FLAGS,
+	__P4TC_EXT_MAX
+};
+
+#define P4TC_EXT_ID_DYN 0x01
+#define P4TC_EXT_ID_MAX 1023
+
+/* See other P4TC_EXT_FLAGS_ * flags in include/net/act_api.h. */
+#define P4TC_EXT_FLAGS_NO_PERCPU_STATS (1 << 0) /* Don't use percpu allocator
+						 * for externs stats.
+						 */
+#define P4TC_EXT_FLAGS_SKIP_HW	(1 << 1) /* don't offload action to HW */
+#define P4TC_EXT_FLAGS_SKIP_SW	(1 << 2) /* don't use action in SW */
+
+#define P4TC_EXT_FLAG_LARGE_DUMP_ON	(1 << 0)
+
+#define P4TC_EXT_MAX __P4TC_EXT_MAX
+
+#endif
diff --git a/net/sched/p4tc/Makefile b/net/sched/p4tc/Makefile
index 726902f10..0e1ca964f 100644
--- a/net/sched/p4tc/Makefile
+++ b/net/sched/p4tc/Makefile
@@ -4,5 +4,5 @@ CFLAGS_trace.o := -I$(src)
 
 obj-y := p4tc_types.o p4tc_tmpl_api.o p4tc_pipeline.o \
 	p4tc_action.o p4tc_table.o p4tc_tbl_entry.o \
-	p4tc_runtime_api.o trace.o
+	p4tc_runtime_api.o trace.o p4tc_ext.o p4tc_tmpl_ext.o
 obj-$(CONFIG_DEBUG_INFO_BTF) += p4tc_bpf.o
diff --git a/net/sched/p4tc/p4tc_bpf.c b/net/sched/p4tc/p4tc_bpf.c
index fe84c1504..71f397961 100644
--- a/net/sched/p4tc/p4tc_bpf.c
+++ b/net/sched/p4tc/p4tc_bpf.c
@@ -13,6 +13,7 @@
 #include <linux/btf_ids.h>
 #include <linux/net_namespace.h>
 #include <net/p4tc.h>
+#include <net/p4tc_ext_api.h>
 #include <linux/netdevice.h>
 #include <net/sock.h>
 #include <net/xdp.h>
@@ -22,6 +23,8 @@ BTF_ID(struct, p4tc_table_entry_act_bpf)
 BTF_ID(struct, p4tc_table_entry_act_bpf_params)
 BTF_ID(struct, p4tc_table_entry_act_bpf)
 BTF_ID(struct, p4tc_table_entry_create_bpf_params)
+BTF_ID(struct, p4tc_ext_bpf_params)
+BTF_ID(struct, p4tc_ext_bpf_res)
 
 static struct p4tc_table_entry_act_bpf dummy_act_bpf = {};
 
@@ -294,6 +297,50 @@ xdp_p4tc_entry_delete(struct xdp_md *xdp_ctx,
 	return __bpf_p4tc_entry_delete(net, params, key, key__sz);
 }
 
+__bpf_kfunc int bpf_p4tc_extern_md_read(struct __sk_buff *skb_ctx,
+					struct p4tc_ext_bpf_res *res,
+					struct p4tc_ext_bpf_params *params)
+{
+	struct sk_buff *skb = (struct sk_buff *)skb_ctx;
+	struct net *net;
+
+	net = skb->dev ? dev_net(skb->dev) : sock_net(skb->sk);
+
+	return __bpf_p4tc_extern_md_read(net, res, params);
+}
+
+__bpf_kfunc int bpf_p4tc_extern_md_write(struct __sk_buff *skb_ctx,
+					 struct p4tc_ext_bpf_params *params)
+{
+	struct sk_buff *skb = (struct sk_buff *)skb_ctx;
+	struct net *net;
+
+	net = skb->dev ? dev_net(skb->dev) : sock_net(skb->sk);
+
+	return __bpf_p4tc_extern_md_write(net, params);
+}
+
+__bpf_kfunc int xdp_p4tc_extern_md_read(struct xdp_md *xdp_ctx,
+					struct p4tc_ext_bpf_res *res,
+					struct p4tc_ext_bpf_params *params)
+{
+	struct xdp_buff *ctx = (struct xdp_buff *)xdp_ctx;
+	struct net *net;
+
+	net = dev_net(ctx->rxq->dev);
+
+	return __bpf_p4tc_extern_md_read(net, res, params);
+}
+
+__bpf_kfunc int xdp_p4tc_extern_md_write(struct xdp_md *xdp_ctx,
+					 struct p4tc_ext_bpf_params *params)
+{
+	struct xdp_buff *ctx = (struct xdp_buff *)xdp_ctx;
+	struct net *net = dev_net(ctx->rxq->dev);
+
+	return __bpf_p4tc_extern_md_write(net, params);
+}
+
 __diag_pop();
 
 BTF_SET8_START(p4tc_kfunc_check_tbl_set_skb)
@@ -322,6 +369,26 @@ static const struct btf_kfunc_id_set p4tc_kfunc_tbl_set_xdp = {
 	.set = &p4tc_kfunc_check_tbl_set_xdp,
 };
 
+BTF_SET8_START(p4tc_kfunc_check_ext_set_skb)
+BTF_ID_FLAGS(func, bpf_p4tc_extern_md_write);
+BTF_ID_FLAGS(func, bpf_p4tc_extern_md_read);
+BTF_SET8_END(p4tc_kfunc_check_ext_set_skb)
+
+static const struct btf_kfunc_id_set p4tc_kfunc_ext_set_skb = {
+	.owner = THIS_MODULE,
+	.set = &p4tc_kfunc_check_ext_set_skb,
+};
+
+BTF_SET8_START(p4tc_kfunc_check_ext_set_xdp)
+BTF_ID_FLAGS(func, xdp_p4tc_extern_md_write);
+BTF_ID_FLAGS(func, xdp_p4tc_extern_md_read);
+BTF_SET8_END(p4tc_kfunc_check_ext_set_xdp)
+
+static const struct btf_kfunc_id_set p4tc_kfunc_ext_set_xdp = {
+	.owner = THIS_MODULE,
+	.set = &p4tc_kfunc_check_ext_set_xdp,
+};
+
 int register_p4tc_tbl_bpf(void)
 {
 	int ret;
@@ -332,6 +399,16 @@ int register_p4tc_tbl_bpf(void)
 		return ret;
 
 	/* There is no unregister_btf_kfunc_id_set function */
+	ret = register_btf_kfunc_id_set(BPF_PROG_TYPE_XDP,
+					&p4tc_kfunc_tbl_set_xdp);
+	if (ret < 0)
+		return ret;
+
+	ret = register_btf_kfunc_id_set(BPF_PROG_TYPE_SCHED_ACT,
+					&p4tc_kfunc_ext_set_skb);
+	if (ret < 0)
+		return ret;
+
 	return register_btf_kfunc_id_set(BPF_PROG_TYPE_XDP,
-					 &p4tc_kfunc_tbl_set_xdp);
+					 &p4tc_kfunc_ext_set_xdp);
 }
diff --git a/net/sched/p4tc/p4tc_ext.c b/net/sched/p4tc/p4tc_ext.c
new file mode 100644
index 000000000..604634c8e
--- /dev/null
+++ b/net/sched/p4tc/p4tc_ext.c
@@ -0,0 +1,2204 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * net/sched/p4tc_ext.c	P4 TC EXTERN API
+ *
+ * Copyright (c) 2022-2023, Mojatatu Networks
+ * Copyright (c) 2022-2023, Intel Corporation.
+ * Authors:     Jamal Hadi Salim <jhs@mojatatu.com>
+ *              Victor Nogueira <victor@mojatatu.com>
+ *              Pedro Tammela <pctammela@mojatatu.com>
+ */
+
+#include <linux/types.h>
+#include <linux/kernel.h>
+#include <linux/string.h>
+#include <linux/errno.h>
+#include <linux/slab.h>
+#include <linux/skbuff.h>
+#include <linux/init.h>
+#include <linux/kmod.h>
+#include <linux/err.h>
+#include <linux/module.h>
+#include <net/net_namespace.h>
+#include <net/sock.h>
+#include <net/sch_generic.h>
+#include <net/pkt_cls.h>
+#include <net/p4tc.h>
+#include <net/p4tc_types.h>
+#include <net/p4tc_ext_api.h>
+#include <net/netlink.h>
+#include <net/flow_offload.h>
+#include <net/tc_wrapper.h>
+#include <uapi/linux/p4tc.h>
+#include <net/xdp.h>
+
+static bool p4tc_ext_param_ops_is_init(struct p4tc_extern_param_ops *ops)
+{
+	struct p4tc_extern_param_ops uninit_ops = {NULL};
+
+	return memcmp(ops, &uninit_ops, sizeof(*ops));
+}
+
+static void p4tc_ext_put_param(struct p4tc_extern_param *param, bool free_val)
+{
+	struct p4tc_extern_param_ops *val_ops;
+
+	if (p4tc_ext_param_ops_is_init(param->ops))
+		val_ops = param->ops;
+	else
+		val_ops = param->mod_ops;
+
+	if (free_val) {
+		if (val_ops && val_ops->free)
+			val_ops->free(param);
+		else
+			kfree(param->value);
+	}
+
+	if (param->mask_shift)
+		p4t_release(param->mask_shift);
+	kfree(param);
+}
+
+static void p4tc_ext_put_many_params(struct idr *params_idr,
+				     struct p4tc_extern_param *params[],
+				     int params_count)
+{
+	int i;
+
+	for (i = 0; i < params_count; i++)
+		p4tc_ext_put_param(params[i], true);
+}
+
+static void p4tc_ext_insert_param(struct idr *params_idr,
+				  struct p4tc_extern_param *param)
+{
+	struct p4tc_extern_param *param_old;
+
+	param_old = idr_replace(params_idr, param, param->id);
+	if (param_old != ERR_PTR(-EBUSY))
+		p4tc_ext_put_param(param_old, true);
+}
+
+static void p4tc_ext_insert_many_params(struct idr *params_idr,
+					struct p4tc_extern_param *params[],
+					int params_count)
+{
+	int i;
+
+	for (i = 0; i < params_count; i++)
+		p4tc_ext_insert_param(params_idr, params[i]);
+}
+
+static void __p4tc_ext_params_free(struct p4tc_extern_params *params,
+				   bool free_vals)
+{
+	struct p4tc_extern_param *parm;
+	unsigned long tmp, id;
+
+	idr_for_each_entry_ul(&params->params_idr, parm, tmp, id) {
+		idr_remove(&params->params_idr, id);
+		p4tc_ext_put_param(parm, free_vals);
+	}
+}
+
+void p4tc_ext_params_free(struct p4tc_extern_params *params, bool free_vals)
+{
+	__p4tc_ext_params_free(params, free_vals);
+	idr_destroy(&params->params_idr);
+	kfree(params);
+}
+EXPORT_SYMBOL_GPL(p4tc_ext_params_free);
+
+static void free_p4tc_ext(struct p4tc_extern *p)
+{
+	if (p->common.params)
+		p4tc_ext_params_free(p->common.params, true);
+
+	kfree(p);
+}
+
+static void free_p4tc_ext_rcu(struct rcu_head *rcu)
+{
+	struct p4tc_extern *p;
+
+	p = container_of(rcu, struct p4tc_extern, rcu);
+
+	free_p4tc_ext(p);
+}
+
+static void p4tc_extern_cleanup(struct p4tc_extern *p)
+{
+	free_p4tc_ext_rcu(&p->rcu);
+}
+
+static int __p4tc_extern_put(struct p4tc_extern *p)
+{
+	if (refcount_dec_and_test(&p->common.p4tc_ext_refcnt)) {
+		idr_remove(p->elems_idr, p->common.p4tc_ext_key);
+
+		p4tc_extern_cleanup(p);
+
+		return 1;
+	}
+
+	return 0;
+}
+
+static int __p4tc_ext_idr_release(struct p4tc_extern *p)
+{
+	int ret = 0;
+
+	if (p) {
+		if (__p4tc_extern_put(p))
+			ret = P4TC_EXT_P_DELETED;
+	}
+
+	return ret;
+}
+
+static int p4tc_ext_idr_release(struct p4tc_extern *e)
+{
+	return __p4tc_ext_idr_release(e);
+}
+
+static int p4tc_ext_idr_release_dec_num_elems(struct p4tc_extern *e)
+{
+	struct p4tc_extern_inst *inst = e->common.inst;
+	int ret;
+
+	ret = __p4tc_ext_idr_release(e);
+	if (ret == P4TC_EXT_P_DELETED)
+		p4tc_ext_inst_dec_num_elems(inst);
+
+	return ret;
+}
+
+static size_t p4tc_extern_shared_attrs_size(void)
+{
+	return  nla_total_size(0) /* extern number nested */
+		+ nla_total_size(EXTERNNAMSIZ)  /* P4TC_EXT_KIND */
+		+ nla_total_size(EXTERNINSTNAMSIZ) /* P4TC_EXT_INST_NAME */
+		+ nla_total_size(sizeof(struct nla_bitfield32)); /* P4TC_EXT_FLAGS */
+}
+
+static size_t p4tc_extern_full_attrs_size(size_t sz)
+{
+	return NLMSG_HDRLEN                     /* struct nlmsghdr */
+		+ sizeof(struct p4tcmsg)
+		+ nla_total_size(0)             /* P4TC_ROOT nested */
+		+ sz;
+}
+
+static int
+generic_dump_ext_param_value(struct sk_buff *skb, struct p4tc_type *type,
+			     struct p4tc_extern_param *param)
+{
+	const u32 bytesz = BITS_TO_BYTES(type->container_bitsz);
+	unsigned char *b = nlmsg_get_pos(skb);
+	struct nlattr *nla_value;
+
+	nla_value = nla_nest_start(skb, P4TC_EXT_PARAMS_VALUE);
+	if (nla_put(skb, P4TC_EXT_PARAMS_VALUE_RAW, bytesz,
+		    param->value))
+		goto out_nlmsg_trim;
+	nla_nest_end(skb, nla_value);
+
+	return 0;
+
+out_nlmsg_trim:
+	nlmsg_trim(skb, b);
+	return -1;
+}
+
+static const struct nla_policy p4tc_extern_params_value_policy[P4TC_EXT_VALUE_PARAMS_MAX + 1] = {
+	[P4TC_EXT_PARAMS_VALUE_RAW] = { .type = NLA_BINARY },
+};
+
+static int dev_init_param_value(struct net *net,
+				struct p4tc_extern_param *nparam,
+				void *value,
+				struct netlink_ext_ack *extack)
+{
+	u32 ifindex, *ifindex_ptr;
+
+	if (value) {
+		ifindex = *((u32 *)value);
+	} else {
+		ifindex = 1;
+		goto param_alloc;
+	}
+
+	rcu_read_lock();
+	if (!dev_get_by_index_rcu(net, ifindex)) {
+		NL_SET_ERR_MSG(extack, "Invalid ifindex");
+		rcu_read_unlock();
+		return -EINVAL;
+	}
+	rcu_read_unlock();
+
+param_alloc:
+	nparam->value = kzalloc(sizeof(ifindex), GFP_KERNEL);
+	if (!nparam->value)
+		return -ENOMEM;
+
+	ifindex_ptr = nparam->value;
+	*ifindex_ptr = ifindex;
+
+	return 0;
+}
+
+static int dev_dump_param_value(struct sk_buff *skb,
+				struct p4tc_extern_param_ops *op,
+				struct p4tc_extern_param *param)
+{
+	struct nlattr *nest;
+	u32 *ifindex;
+	int ret;
+
+	nest = nla_nest_start(skb, P4TC_EXT_PARAMS_VALUE);
+	ifindex = (u32 *)param->value;
+
+	if (nla_put_u32(skb, P4TC_EXT_PARAMS_VALUE_RAW, *ifindex)) {
+		ret = -EINVAL;
+		goto out_nla_cancel;
+	}
+	nla_nest_end(skb, nest);
+
+	return 0;
+
+out_nla_cancel:
+	nla_nest_cancel(skb, nest);
+	return ret;
+}
+
+static void dev_free_param_value(struct p4tc_extern_param *param)
+{
+	kfree(param->value);
+}
+
+static const struct p4tc_extern_param_ops ext_param_ops[P4T_MAX + 1] = {
+	[P4T_DEV] = {
+		.init_value = dev_init_param_value,
+		.dump_value = dev_dump_param_value,
+		.free = dev_free_param_value,
+	},
+};
+
+static int p4tc_extern_elem_dump_param_noval(struct sk_buff *skb,
+					     struct p4tc_extern_param *parm)
+{
+	unsigned char *b = nlmsg_get_pos(skb);
+
+	if (nla_put_string(skb, P4TC_EXT_PARAMS_NAME,
+			   parm->name))
+		goto nla_put_failure;
+
+	if (nla_put_u32(skb, P4TC_EXT_PARAMS_ID, parm->id))
+		goto nla_put_failure;
+
+	if (nla_put_u32(skb, P4TC_EXT_PARAMS_TYPE, parm->type->typeid))
+		goto nla_put_failure;
+
+	return skb->len;
+
+nla_put_failure:
+	nlmsg_trim(skb, b);
+	return -1;
+}
+
+static int
+p4tc_extern_elem_dump_params(struct sk_buff *skb, struct p4tc_extern_common *e)
+{
+	unsigned char *b = nlmsg_get_pos(skb);
+	struct p4tc_extern_param *parm;
+	struct nlattr *nest_parms;
+	int id;
+
+	nest_parms = nla_nest_start(skb, P4TC_EXT_PARAMS);
+	if (e->params) {
+		int i = 1;
+
+		idr_for_each_entry(&e->params->params_idr, parm, id) {
+			struct p4tc_extern_param_ops *val_ops;
+			struct nlattr *nest_count;
+
+			nest_count = nla_nest_start(skb, i);
+			if (!nest_count)
+				goto nla_put_failure;
+
+			if (p4tc_extern_elem_dump_param_noval(skb, parm) < 0)
+				goto nla_put_failure;
+
+			if (p4tc_ext_param_ops_is_init(parm->ops))
+				val_ops = parm->ops;
+			else
+				val_ops = parm->mod_ops;
+
+			read_lock_bh(&e->params->params_lock);
+			if (val_ops && val_ops->dump_value) {
+				if (val_ops->dump_value(skb, parm->ops, parm) < 0) {
+					read_unlock_bh(&e->params->params_lock);
+					goto nla_put_failure;
+				}
+			} else {
+				if (generic_dump_ext_param_value(skb, parm->type, parm)) {
+					read_unlock_bh(&e->params->params_lock);
+					goto nla_put_failure;
+				}
+			}
+			read_unlock_bh(&e->params->params_lock);
+
+			if (nla_put_u32(skb, P4TC_EXT_PARAMS_FLAGS,
+					parm->flags))
+				goto nla_put_failure;
+
+			nla_nest_end(skb, nest_count);
+			i++;
+		}
+	}
+	nla_nest_end(skb, nest_parms);
+
+	return skb->len;
+
+nla_put_failure:
+	nlmsg_trim(skb, b);
+	return -1;
+}
+
+int
+p4tc_ext_elem_dump_1(struct sk_buff *skb, struct p4tc_extern_common *e)
+{
+	const char *instname = e->inst->common.name;
+	unsigned char *b = nlmsg_get_pos(skb);
+	const char *kind = e->inst->ext_name;
+	u32 flags = e->p4tc_ext_flags;
+	u32 key = e->p4tc_ext_key;
+	int err;
+
+	if (nla_put_string(skb, P4TC_EXT_KIND, kind))
+		goto nla_put_failure;
+
+	if (nla_put_string(skb, P4TC_EXT_INST_NAME, instname))
+		goto nla_put_failure;
+
+	if (nla_put_u32(skb, P4TC_EXT_KEY, key))
+		goto nla_put_failure;
+
+	if (flags && nla_put_bitfield32(skb, P4TC_EXT_FLAGS,
+					flags, flags))
+		goto nla_put_failure;
+
+	err = p4tc_extern_elem_dump_params(skb, e);
+	if (err < 0)
+		goto nla_put_failure;
+
+	return skb->len;
+
+nla_put_failure:
+	nlmsg_trim(skb, b);
+	return -1;
+}
+EXPORT_SYMBOL(p4tc_ext_elem_dump_1);
+
+static int p4tc_ext_dump_walker(struct p4tc_extern_inst *inst,
+				struct sk_buff *skb,
+				struct netlink_callback *cb)
+{
+	struct idr *idr = &inst->control_elems_idr;
+	int err = 0, s_i = 0, n_i = 0;
+	u32 ext_flags = cb->args[2];
+	struct p4tc_extern *p;
+	unsigned long id = 1;
+	struct nlattr *nest;
+	unsigned long tmp;
+	int key = -1;
+
+	if (p4tc_ext_inst_has_dump(inst)) {
+		n_i = inst->ops->dump(skb, inst, cb);
+		if (n_i < 0)
+			return n_i;
+	} else {
+		s_i = cb->args[0];
+
+		idr_for_each_entry_ul(idr, p, tmp, id) {
+			key++;
+			if (key < s_i)
+				continue;
+			if (IS_ERR(p))
+				continue;
+
+			nest = nla_nest_start(skb, n_i);
+			if (!nest) {
+				key--;
+				goto nla_put_failure;
+			}
+
+			err = p4tc_ext_elem_dump_1(skb, &p->common);
+			if (err < 0) {
+				key--;
+				nlmsg_trim(skb, nest);
+				goto done;
+			}
+			nla_nest_end(skb, nest);
+			n_i++;
+			if (!(ext_flags & P4TC_EXT_FLAG_LARGE_DUMP_ON) &&
+			    n_i >= P4TC_MSGBATCH_SIZE)
+				goto done;
+		}
+	}
+done:
+	if (key >= 0)
+		cb->args[0] = key + 1;
+
+	if (n_i) {
+		if (ext_flags & P4TC_EXT_FLAG_LARGE_DUMP_ON)
+			cb->args[1] = n_i;
+	}
+	return n_i;
+
+nla_put_failure:
+	nla_nest_cancel(skb, nest);
+	goto done;
+}
+
+static void __p4tc_ext_idr_purge(struct p4tc_extern *p)
+{
+	atomic_dec(&p->common.inst->curr_num_elems);
+	p4tc_extern_cleanup(p);
+}
+
+static void p4tc_ext_idr_purge(struct p4tc_extern *p)
+{
+	idr_remove(p->elems_idr, p->common.p4tc_ext_key);
+	__p4tc_ext_idr_purge(p);
+}
+
+/* Called when pipeline is being purged */
+void p4tc_ext_purge(struct idr *idr)
+{
+	struct p4tc_extern *p;
+	unsigned long tmp, id;
+
+	idr_for_each_entry_ul(idr, p, tmp, id) {
+		if (IS_ERR(p))
+			continue;
+		p4tc_ext_idr_purge(p);
+	}
+}
+
+static int p4tc_ext_idr_search(struct p4tc_extern_inst *inst,
+			       struct p4tc_extern **e, u32 key)
+{
+	struct idr *elems_idr = &inst->control_elems_idr;
+	struct p4tc_extern *p;
+
+	p = idr_find(elems_idr, key);
+	if (IS_ERR(p))
+		p = NULL;
+
+	if (p) {
+		*e = p;
+		return true;
+	}
+	return false;
+}
+
+static int __p4tc_ext_idr_search(struct p4tc_extern_inst *inst,
+				 struct p4tc_extern **e, u32 key)
+{
+	if (p4tc_ext_idr_search(inst, e, key)) {
+		refcount_inc(&((*e)->common.p4tc_ext_refcnt));
+		return true;
+	}
+
+	return false;
+}
+
+static int p4tc_ext_copy(struct p4tc_extern_inst *inst,
+			 u32 key, struct p4tc_extern **e,
+			 struct p4tc_extern *e_orig,
+			 const struct p4tc_extern_ops *ops,
+			 u32 flags)
+{
+	const u32 size = (ops && ops->elem_size) ? ops->elem_size : sizeof(**e);
+	struct p4tc_extern *p = kzalloc(size, GFP_KERNEL);
+
+	if (unlikely(!p))
+		return -ENOMEM;
+
+	spin_lock_init(&p->p4tc_ext_lock);
+	p->common.p4tc_ext_key = key;
+	p->common.p4tc_ext_flags = flags;
+	refcount_set(&p->common.p4tc_ext_refcnt,
+		     refcount_read(&e_orig->common.p4tc_ext_refcnt));
+
+	p->elems_idr = e_orig->elems_idr;
+	p->common.inst = inst;
+	p->common.ops = ops;
+	*e = p;
+	return 0;
+}
+
+static int p4tc_ext_idr_create(struct p4tc_extern_inst *inst,
+			       u32 key, struct p4tc_extern **e,
+			       const struct p4tc_extern_ops *ops,
+			       u32 flags)
+{
+	struct p4tc_extern *p = kzalloc(sizeof(*p), GFP_KERNEL);
+	u32 max_num_elems = inst->max_num_elems;
+
+	if (unlikely(!p))
+		return -ENOMEM;
+
+	if (atomic_read(&inst->curr_num_elems) == max_num_elems) {
+		kfree(p);
+		return -E2BIG;
+	}
+
+	p4tc_ext_inst_inc_num_elems(inst);
+
+	refcount_set(&p->common.p4tc_ext_refcnt, 1);
+
+	spin_lock_init(&p->p4tc_ext_lock);
+	p->common.p4tc_ext_key = key;
+	p->common.p4tc_ext_flags = flags;
+
+	p->elems_idr = &inst->control_elems_idr;
+	p->common.inst = inst;
+	p->common.ops = ops;
+	*e = p;
+	return 0;
+}
+
+/* Check if extern with specified key exists. If extern is found, increments
+ * its reference, and return 1. Otherwise return -ENOENT.
+ */
+static int p4tc_ext_idr_check_alloc(struct p4tc_extern_inst *inst,
+				    u32 key, struct p4tc_extern **e,
+				    struct netlink_ext_ack *extack)
+{
+	struct idr *elems_idr = &inst->control_elems_idr;
+	struct p4tc_extern *p;
+
+	p = idr_find(elems_idr, key);
+	if (p) {
+		refcount_inc(&p->common.p4tc_ext_refcnt);
+		*e = p;
+		return 1;
+	}
+
+	NL_SET_ERR_MSG_FMT(extack, "Unable to find element with key %u",
+			   key);
+	return -ENOENT;
+}
+
+struct p4tc_extern *
+p4tc_ext_elem_find(struct p4tc_extern_inst *inst,
+		   struct p4tc_ext_bpf_params *params)
+{
+	struct p4tc_extern *e;
+
+	e = idr_find(&inst->control_elems_idr, params->index);
+	if (!e)
+		return ERR_PTR(-ENOENT);
+
+	return e;
+}
+EXPORT_SYMBOL(p4tc_ext_elem_find);
+
+#define p4tc_ext_common_elem_find(common, params) \
+	((struct p4tc_extern_common *)p4tc_ext_elem_find(common, params))
+
+static struct p4tc_extern_common *
+__p4tc_ext_common_elem_get(struct net *net, struct p4tc_pipeline **pipeline,
+			   struct p4tc_ext_bpf_params *params)
+{
+	struct p4tc_extern_common *ext_common;
+	struct p4tc_extern_inst *inst;
+	int err;
+
+	inst = p4tc_ext_inst_get_byids(net, pipeline, params);
+	if (IS_ERR(inst)) {
+		err = PTR_ERR(inst);
+		goto put_pipe;
+	}
+
+	ext_common = p4tc_ext_common_elem_find(inst, params);
+	if (IS_ERR(ext_common)) {
+		err = PTR_ERR(ext_common);
+		goto put_pipe;
+	}
+
+	if (!refcount_inc_not_zero(&ext_common->p4tc_ext_refcnt)) {
+		err = -EBUSY;
+		goto put_pipe;
+	}
+
+	return ext_common;
+
+put_pipe:
+	p4tc_pipeline_put(*pipeline);
+	return ERR_PTR(err);
+}
+
+/* This function should be paired with p4tc_ext_common_elem_put */
+struct p4tc_extern_common *
+p4tc_ext_common_elem_get(struct sk_buff *skb, struct p4tc_pipeline **pipeline,
+			 struct p4tc_ext_bpf_params *params)
+{
+	struct net *net;
+
+	net = skb->dev ? dev_net(skb->dev) : sock_net(skb->sk);
+
+	return __p4tc_ext_common_elem_get(net, pipeline, params);
+}
+EXPORT_SYMBOL(p4tc_ext_common_elem_get);
+
+/* This function should be paired with p4tc_ext_common_elem_put */
+struct p4tc_extern_common *
+p4tc_xdp_ext_common_elem_get(struct xdp_buff *ctx,
+			     struct p4tc_pipeline **pipeline,
+			     struct p4tc_ext_bpf_params *params)
+{
+	struct net *net;
+
+	net = dev_net(ctx->rxq->dev);
+
+	return __p4tc_ext_common_elem_get(net, pipeline, params);
+}
+EXPORT_SYMBOL(p4tc_xdp_ext_common_elem_get);
+
+void p4tc_ext_common_elem_put(struct p4tc_pipeline *pipeline,
+			      struct p4tc_extern_common *common)
+{
+	refcount_dec(&common->p4tc_ext_refcnt);
+	p4tc_pipeline_put(pipeline);
+}
+EXPORT_SYMBOL(p4tc_ext_common_elem_put);
+
+static bool p4tc_ext_param_is_writable(struct p4tc_extern_param *param)
+{
+	return param->flags & P4TC_EXT_PARAMS_FLAG_ISKEY;
+}
+
+int __bpf_p4tc_extern_md_write(struct net *net,
+			       struct p4tc_ext_bpf_params *params)
+{
+	struct p4tc_extern_param *param;
+	struct p4tc_pipeline *pipeline;
+	struct p4tc_extern_common *e;
+	struct p4tc_type *type;
+	u8 *params_data;
+	int err = 0;
+
+	if (!params)
+		return -EINVAL;
+
+	params_data = params->in_params;
+
+	e = __p4tc_ext_common_elem_get(net, &pipeline, params);
+	if (IS_ERR(e))
+		return PTR_ERR(e);
+
+	param = idr_find(&e->params->params_idr, params->param_id);
+	if (unlikely(!param)) {
+		err = -EINVAL;
+		goto put_pipe;
+	}
+
+	if (!p4tc_ext_param_is_writable(param)) {
+		err = -EINVAL;
+		goto put_pipe;
+	}
+
+	type = param->type;
+	if (unlikely(!type->ops->host_read)) {
+		err = -EINVAL;
+		goto put_pipe;
+	}
+
+	if (unlikely(!type->ops->host_write)) {
+		err = -EINVAL;
+		goto put_pipe;
+	}
+
+	write_lock_bh(&e->params->params_lock);
+	p4t_copy(param->mask_shift, type, param->value,
+		 param->mask_shift, type, params_data);
+	write_unlock_bh(&e->params->params_lock);
+
+put_pipe:
+	p4tc_ext_common_elem_put(pipeline, e);
+
+	return err;
+}
+
+int __bpf_p4tc_extern_md_read(struct net *net,
+			      struct p4tc_ext_bpf_res *res,
+			      struct p4tc_ext_bpf_params *params)
+{
+	const struct p4tc_type_ops *ops;
+	struct p4tc_extern_param *param;
+	struct p4tc_pipeline *pipeline;
+	struct p4tc_extern_common *e;
+	int err = 0;
+
+	if (!params || !res)
+		return -EINVAL;
+
+	e = __p4tc_ext_common_elem_get(net, &pipeline, params);
+	if (IS_ERR(e))
+		return PTR_ERR(e);
+
+	param = idr_find(&e->params->params_idr, params->param_id);
+	if (unlikely(!param)) {
+		err = -ENOENT;
+		goto refcount_dec;
+	}
+
+	ops = param->type->ops;
+	if (unlikely(!ops->host_read)) {
+		err = -ENOENT;
+		goto refcount_dec;
+	}
+
+	read_lock_bh(&e->params->params_lock);
+	ops->host_read(param->type, param->mask_shift, param->value,
+		       res->out_params);
+	read_unlock_bh(&e->params->params_lock);
+
+refcount_dec:
+	p4tc_ext_common_elem_put(pipeline, e);
+
+	return err;
+}
+
+static int p4tc_extern_destroy(struct p4tc_extern *externs[])
+{
+	struct p4tc_extern *e;
+	int ret = 0, i;
+
+	for (i = 0; i < P4TC_MSGBATCH_SIZE && externs[i]; i++) {
+		e = externs[i];
+		externs[i] = NULL;
+		free_p4tc_ext_rcu(&e->rcu);
+	}
+	return ret;
+}
+
+static int p4tc_extern_put(struct p4tc_extern *p)
+{
+	return __p4tc_extern_put(p);
+}
+
+/* Put all externs in this array, skip those NULL's. */
+static void p4tc_extern_put_many(struct p4tc_extern *externs[])
+{
+	int i;
+
+	for (i = 0; i < P4TC_MSGBATCH_SIZE; i++) {
+		struct p4tc_extern *e = externs[i];
+
+		if (!e)
+			continue;
+		p4tc_extern_put(e);
+	}
+}
+
+static int p4tc_extern_elem_dump(struct sk_buff *skb,
+				 struct p4tc_extern *externs[],
+				 int ref)
+{
+	struct p4tc_extern *e;
+	int err = -EINVAL, i;
+	struct nlattr *nest;
+
+	for (i = 0; i < P4TC_MSGBATCH_SIZE && externs[i]; i++) {
+		e = externs[i];
+		nest = nla_nest_start_noflag(skb, i + 1);
+		if (!nest)
+			goto nla_put_failure;
+		err = p4tc_ext_elem_dump_1(skb, &e->common);
+		if (err < 0)
+			goto errout;
+		nla_nest_end(skb, nest);
+	}
+
+	return 0;
+
+nla_put_failure:
+	err = -EINVAL;
+errout:
+	nla_nest_cancel(skb, nest);
+	return err;
+}
+
+static void generic_free_param_value(struct p4tc_extern_param *param)
+{
+	kfree(param->value);
+}
+
+static void *generic_parse_param_value(struct p4tc_extern_param *nparam,
+				       struct p4tc_type *type,
+				       struct nlattr *nla, bool value_required,
+				       struct netlink_ext_ack *extack)
+{
+	const u32 alloc_len = BITS_TO_BYTES(type->container_bitsz);
+	struct nlattr *tb_value[P4TC_EXT_VALUE_PARAMS_MAX + 1];
+	void *value;
+	int err;
+
+	if (!nla) {
+		if (value_required) {
+			NL_SET_ERR_MSG(extack, "Must specify param value");
+			return ERR_PTR(-EINVAL);
+		} else {
+			return NULL;
+		}
+	}
+
+	err = nla_parse_nested(tb_value, P4TC_EXT_VALUE_PARAMS_MAX,
+			       nla, p4tc_extern_params_value_policy,
+			       extack);
+	if (err < 0)
+		return ERR_PTR(err);
+
+	value = nla_data(tb_value[P4TC_EXT_PARAMS_VALUE_RAW]);
+	if (type->ops->validate_p4t) {
+		err = type->ops->validate_p4t(type, value, 0, type->bitsz - 1,
+					      extack);
+		if (err < 0)
+			return ERR_PTR(err);
+	}
+
+	if (nla_len(tb_value[P4TC_EXT_PARAMS_VALUE_RAW]) != alloc_len)
+		return ERR_PTR(-EINVAL);
+
+	return value;
+}
+
+static int generic_init_param_value(struct net *net,
+				    struct p4tc_extern_param *nparam,
+				    struct nlattr **tb,
+				    u32 byte_sz, bool value_required,
+				    struct netlink_ext_ack *extack)
+{
+	const u32 alloc_len = BITS_TO_BYTES(nparam->type->container_bitsz);
+	struct p4tc_extern_param_ops *ops;
+	void *value;
+
+	if (p4tc_ext_param_ops_is_init(nparam->ops))
+		ops = nparam->ops;
+	else
+		ops = nparam->mod_ops;
+
+	value = generic_parse_param_value(nparam, nparam->type,
+					  tb[P4TC_EXT_PARAMS_VALUE],
+					  value_required, extack);
+	if (IS_ERR_OR_NULL(value))
+		return PTR_ERR(value);
+
+	if (ops && ops->init_value)
+		return ops->init_value(net, nparam, value, extack);
+
+	nparam->value = kzalloc(alloc_len, GFP_KERNEL);
+	if (!nparam->value)
+		return -ENOMEM;
+
+	memcpy(nparam->value, value, byte_sz);
+
+	return 0;
+}
+
+static const struct nla_policy p4tc_extern_policy[P4TC_EXT_MAX + 1] = {
+	[P4TC_EXT_INST_NAME] = {
+		.type = NLA_STRING,
+		.len = EXTERNINSTNAMSIZ
+	},
+	[P4TC_EXT_KIND]		= { .type = NLA_STRING },
+	[P4TC_EXT_PARAMS]	= { .type = NLA_NESTED },
+	[P4TC_EXT_KEY]		= { .type = NLA_NESTED },
+	[P4TC_EXT_FLAGS]	= { .type = NLA_BITFIELD32 },
+};
+
+static const struct nla_policy p4tc_extern_params_policy[P4TC_EXT_PARAMS_MAX + 1] = {
+	[P4TC_EXT_PARAMS_NAME] = { .type = NLA_STRING, .len = EXTPARAMNAMSIZ },
+	[P4TC_EXT_PARAMS_ID] = { .type = NLA_U32 },
+	[P4TC_EXT_PARAMS_VALUE] = { .type = NLA_NESTED },
+	[P4TC_EXT_PARAMS_TYPE] = { .type = NLA_U32 },
+	[P4TC_EXT_PARAMS_BITSZ] = { .type = NLA_U16 },
+	[P4TC_EXT_PARAMS_FLAGS] = { .type = NLA_U8 },
+};
+
+int p4tc_ext_param_value_init(struct net *net,
+			      struct p4tc_extern_param *param,
+			      struct nlattr **tb, u32 typeid,
+			      bool value_required,
+			      struct netlink_ext_ack *extack)
+{
+	u32 byte_sz = BITS_TO_BYTES(param->bitsz);
+
+	if (!param->ops) {
+		struct p4tc_extern_param_ops *ops;
+
+		ops = (struct p4tc_extern_param_ops *)&ext_param_ops[typeid];
+		param->ops = ops;
+	}
+
+	return generic_init_param_value(net, param, tb, byte_sz, value_required,
+					extack);
+}
+
+void p4tc_ext_param_value_free_tmpl(struct p4tc_extern_param *param)
+{
+	if (param->ops->free)
+		return param->ops->free(param);
+
+	return generic_free_param_value(param);
+}
+
+int p4tc_ext_param_value_dump_tmpl(struct sk_buff *skb,
+				   struct p4tc_extern_param *param)
+{
+	if (param->ops && param->ops->dump_value)
+		return param->ops->dump_value(skb, param->ops, param);
+
+	return generic_dump_ext_param_value(skb, param->type, param);
+}
+
+static struct p4tc_extern_param *
+p4tc_ext_create_param(struct net *net, struct p4tc_extern_params *params,
+		      struct idr *control_params_idr,
+		      struct nlattr **tb, size_t *attrs_size,
+		      bool init_param, struct netlink_ext_ack *extack)
+{
+	struct p4tc_extern_param *param, *nparam;
+	u32 param_id = 0;
+	int err = 0;
+
+	if (tb[P4TC_EXT_PARAMS_ID])
+		param_id = nla_get_u32(tb[P4TC_EXT_PARAMS_ID]);
+	*attrs_size += nla_total_size(sizeof(u32));
+
+	param = p4tc_ext_param_find_byanyattr(control_params_idr,
+					      tb[P4TC_EXT_PARAMS_NAME],
+					      param_id, extack);
+	if (IS_ERR(param))
+		return param;
+
+	if (tb[P4TC_EXT_PARAMS_TYPE]) {
+		u32 typeid = nla_get_u32(tb[P4TC_EXT_PARAMS_TYPE]);
+
+		if (param->type->typeid != typeid) {
+			NL_SET_ERR_MSG(extack,
+				       "Param type differs from template");
+			return ERR_PTR(-EINVAL);
+		}
+	} else {
+		NL_SET_ERR_MSG(extack, "Must specify param type");
+		return ERR_PTR(-EINVAL);
+	}
+	*attrs_size += nla_total_size(sizeof(u32));
+
+	nparam = kzalloc(sizeof(*nparam), GFP_KERNEL);
+	if (!nparam)
+		return ERR_PTR(-ENOMEM);
+
+	strscpy(nparam->name, param->name, EXTPARAMNAMSIZ);
+	nparam->type = param->type;
+	nparam->bitsz = param->bitsz;
+
+	if (init_param) {
+		err = p4tc_ext_param_value_init(net, nparam, tb,
+						param->type->typeid, true,
+						extack);
+	} else {
+		void *value;
+
+		value = generic_parse_param_value(nparam, nparam->type,
+						  tb[P4TC_EXT_PARAMS_VALUE],
+						  true, extack);
+		if (IS_ERR(value))
+			err = PTR_ERR(value);
+		else
+			nparam->value = value;
+	}
+
+	if (err < 0)
+		goto free;
+
+	*attrs_size += nla_total_size(BITS_TO_BYTES(param->type->container_bitsz));
+	nparam->id = param->id;
+
+	err = idr_alloc_u32(&params->params_idr, ERR_PTR(-EBUSY), &nparam->id,
+			    nparam->id, GFP_KERNEL);
+	if (err < 0)
+		goto free_val;
+
+	return nparam;
+
+free_val:
+	if (param->ops && param->ops->free)
+		param->ops->free(nparam);
+	else
+		generic_free_param_value(nparam);
+
+free:
+	kfree(nparam);
+
+	return ERR_PTR(err);
+}
+
+static struct p4tc_extern_param *
+p4tc_ext_init_param(struct net *net, struct idr *control_params_idr,
+		    struct p4tc_extern_params *params, struct nlattr *nla,
+		    size_t *attrs_size, bool init_value,
+		    struct netlink_ext_ack *extack)
+{
+	struct nlattr *tb[P4TC_EXT_PARAMS_MAX + 1];
+	int err;
+
+	err = nla_parse_nested(tb, P4TC_EXT_PARAMS_MAX, nla,
+			       p4tc_extern_params_policy, extack);
+	if (err < 0)
+		return ERR_PTR(err);
+
+	return p4tc_ext_create_param(net, params, control_params_idr, tb,
+				     attrs_size, init_value, extack);
+}
+
+static int p4tc_ext_get_key_param_value(struct nlattr *nla,
+					u32 *key, struct netlink_ext_ack *extack)
+{
+	struct nlattr *tb[P4TC_EXT_VALUE_PARAMS_MAX + 1];
+	u32 *value;
+	int err;
+
+	if (!nla) {
+		NL_SET_ERR_MSG(extack, "Must specify key param value");
+		return -EINVAL;
+	}
+
+	err = nla_parse_nested(tb, P4TC_EXT_VALUE_PARAMS_MAX,
+			       nla, p4tc_extern_params_value_policy, extack);
+	if (err < 0)
+		return err;
+
+	if (!tb[P4TC_EXT_PARAMS_VALUE_RAW]) {
+		NL_SET_ERR_MSG(extack, "Must specify raw value attr");
+		return -EINVAL;
+	}
+
+	if (nla_len(tb[P4TC_EXT_PARAMS_VALUE_RAW]) > sizeof(*key)) {
+		NL_SET_ERR_MSG(extack,
+			       "Param value is bigger than 32 bits");
+		return -EINVAL;
+	}
+
+	value = nla_data(tb[P4TC_EXT_PARAMS_VALUE_RAW]);
+
+	*key = *value;
+
+	return 0;
+}
+
+static int p4tc_ext_get_nonscalar_key_param(struct idr *params_idr,
+					    struct nlattr *nla, u32 *key,
+					    struct netlink_ext_ack *extack)
+{
+	struct nlattr *tb[P4TC_EXT_PARAMS_MAX + 1];
+	struct p4tc_extern_param *index_param;
+	char *param_name;
+	int err;
+
+	err = nla_parse_nested(tb, P4TC_EXT_PARAMS_MAX, nla,
+			       p4tc_extern_params_policy, extack);
+	if (err < 0)
+		return err;
+
+	if (!tb[P4TC_EXT_PARAMS_NAME]) {
+		NL_SET_ERR_MSG(extack, "Must specify key param name");
+		return -EINVAL;
+	}
+	param_name = nla_data(tb[P4TC_EXT_PARAMS_NAME]);
+
+	index_param = p4tc_ext_param_find_byanyattr(params_idr,
+						    tb[P4TC_EXT_PARAMS_NAME],
+						    0, extack);
+	if (IS_ERR(index_param)) {
+		NL_SET_ERR_MSG(extack, "Key param name not found");
+		return -EINVAL;
+	}
+
+	if (!(index_param->flags & P4TC_EXT_PARAMS_FLAG_ISKEY)) {
+		NL_SET_ERR_MSG_FMT(extack, "%s is not the key param name",
+				   param_name);
+		return -EINVAL;
+	}
+
+	err = p4tc_ext_get_key_param_value(tb[P4TC_EXT_PARAMS_VALUE], key,
+					   extack);
+	if (err < 0)
+		return err;
+
+	return index_param->id;
+}
+
+static int p4tc_ext_get_key_param_scalar(struct p4tc_extern_inst *inst,
+					 struct nlattr *nla, u32 *key,
+					 struct netlink_ext_ack *extack)
+{
+	struct nlattr *tb[P4TC_EXT_PARAMS_MAX + 1];
+	int err;
+
+	err = nla_parse_nested(tb, P4TC_EXT_PARAMS_MAX, nla,
+			       p4tc_extern_params_policy, extack);
+	if (err < 0)
+		return err;
+
+	return p4tc_ext_get_key_param_value(tb[P4TC_EXT_PARAMS_VALUE], key,
+					    extack);
+}
+
+struct p4tc_extern_params *p4tc_extern_params_init(void)
+{
+	struct p4tc_extern_params *params;
+
+	params = kzalloc(sizeof(*params), GFP_KERNEL);
+	if (!params)
+		return NULL;
+
+	idr_init(&params->params_idr);
+	rwlock_init(&params->params_lock);
+
+	return params;
+}
+
+static int __p4tc_ext_init_params(struct net *net,
+				  struct idr *control_params_idr,
+				  struct p4tc_extern_params **params,
+				  struct nlattr *nla, size_t *attrs_size,
+				  bool init_values,
+				  struct netlink_ext_ack *extack)
+{
+	struct p4tc_extern_param *params_backup[P4TC_MSGBATCH_SIZE] = { NULL };
+	struct nlattr *tb[P4TC_MSGBATCH_SIZE + 1];
+	int err;
+	int i;
+
+	if (!*params) {
+		*params = p4tc_extern_params_init();
+		if (!*params)
+			return -ENOMEM;
+	}
+
+	err = nla_parse_nested(tb, P4TC_MSGBATCH_SIZE, nla, NULL, extack);
+	if (err < 0) {
+		kfree(*params);
+		*params = NULL;
+		return err;
+	}
+
+	for (i = 1; i < P4TC_MSGBATCH_SIZE + 1 && tb[i]; i++) {
+		struct p4tc_extern_param *param;
+
+		param = p4tc_ext_init_param(net, control_params_idr, *params,
+					    tb[i], attrs_size, init_values,
+					    extack);
+		if (IS_ERR(param)) {
+			err = PTR_ERR(param);
+			goto params_del;
+		}
+		params_backup[i - 1] = param;
+		*attrs_size = nla_total_size(0);  /* params array element nested */
+	}
+
+	p4tc_ext_insert_many_params(&((*params)->params_idr), params_backup,
+				    i - 1);
+	return 0;
+
+params_del:
+	p4tc_ext_put_many_params(&((*params)->params_idr), params_backup,
+				 i - 1);
+	kfree(*params);
+	*params = NULL;
+	return err;
+}
+
+#define p4tc_ext_init_params(net, control_params_idr, params, nla, atrrs_size, extack) \
+	(__p4tc_ext_init_params(net, control_params_idr, params, \
+				nla, &(attrs_size), true, extack))
+
+#define p4tc_ext_parse_params(net, control_params_idr, params, nla, attrs_size, extack) \
+	(__p4tc_ext_init_params(net, control_params_idr, params, \
+				nla, &(attrs_size), false, extack))
+
+void p4tc_ext_elem_put_list(struct p4tc_extern_inst *inst,
+			    struct p4tc_extern_common *e)
+{
+	struct p4tc_extern_param *param;
+	unsigned long param_id, tmp;
+
+	idr_for_each_entry_ul(&e->params->params_idr, param, tmp, param_id) {
+		const struct p4tc_type *type = param->type;
+		const u32 type_bytesz = BITS_TO_BYTES(type->container_bitsz);
+
+		if (param->mod_ops)
+			param->mod_ops->default_value(param);
+		else
+			memset(param->value, 0, type_bytesz);
+	}
+
+	spin_lock(&inst->available_list_lock);
+	list_add_tail(&e->node, &inst->unused_elems);
+	refcount_dec(&e->p4tc_ext_refcnt);
+	spin_unlock(&inst->available_list_lock);
+}
+
+struct p4tc_extern_common *p4tc_ext_elem_get(struct p4tc_extern_inst *inst)
+{
+	struct p4tc_extern_common *e;
+
+	spin_lock(&inst->available_list_lock);
+	e = list_first_entry_or_null(&inst->unused_elems,
+				     struct p4tc_extern_common, node);
+	if (e) {
+		refcount_inc(&e->p4tc_ext_refcnt);
+		list_del_init(&e->node);
+	}
+
+	spin_unlock(&inst->available_list_lock);
+
+	return e;
+}
+
+static void p4tc_ext_idr_insert_many(struct p4tc_extern *externs[])
+{
+	int i;
+
+	for (i = 0; i < P4TC_MSGBATCH_SIZE; i++) {
+		struct p4tc_extern *e = externs[i];
+		struct p4tc_extern_inst *inst;
+		struct p4tc_extern *old_e;
+
+		if (!e)
+			continue;
+
+		inst = e->common.inst;
+		spin_lock(&inst->available_list_lock);
+		old_e = idr_replace(e->elems_idr, e, e->common.p4tc_ext_key);
+		if (inst->tbl_bindable)
+			list_del(&old_e->common.node);
+		call_rcu(&old_e->rcu, free_p4tc_ext_rcu);
+		if (inst->tbl_bindable)
+			list_add(&e->common.node, &inst->unused_elems);
+		spin_unlock(&inst->available_list_lock);
+	}
+}
+
+static const char *
+p4tc_ext_get_kind(struct nlattr *nla, struct netlink_ext_ack *extack)
+{
+	struct nlattr *tb[P4TC_EXT_MAX + 1];
+	struct nlattr *kind;
+	int err;
+
+	err = nla_parse_nested(tb, P4TC_EXT_MAX, nla,
+			       p4tc_extern_policy, extack);
+	if (err < 0)
+		return ERR_PTR(err);
+	err = -EINVAL;
+	kind = tb[P4TC_EXT_KIND];
+	if (!kind) {
+		NL_SET_ERR_MSG(extack, "TC extern name must be specified");
+		return ERR_PTR(err);
+	}
+
+	return nla_data(kind);
+}
+
+static struct p4tc_extern *
+p4tc_ext_init(struct net *net, struct nlattr *nla,
+	      struct p4tc_extern_inst *inst,
+	      u32 key, u32 flags,
+	      struct netlink_ext_ack *extack)
+{
+	struct idr *control_params_idr = &inst->params->params_idr;
+	const struct p4tc_extern_ops *e_o = inst->ops;
+	struct p4tc_extern_params *params = NULL;
+	struct p4tc_extern *e_orig = NULL;
+	size_t attrs_size = 0;
+	struct p4tc_extern *e;
+	int err = 0;
+
+	if (!nla) {
+		NL_SET_ERR_MSG(extack, "Must specify extern params");
+		err =  -EINVAL;
+		goto out;
+	}
+
+	if (p4tc_ext_has_rctrl(e_o)) {
+		err = p4tc_ext_parse_params(net, control_params_idr, &params,
+					    nla, attrs_size, extack);
+		if (err < 0)
+			goto out;
+
+		err = e_o->rctrl(RTM_P4TC_UPDATE, inst,
+				 (struct p4tc_extern_common **)&e, params, &key,
+				 extack);
+		p4tc_ext_params_free(params, false);
+		if (err < 0)
+			goto out;
+
+		return e;
+	}
+
+	err = p4tc_ext_idr_check_alloc(inst, key, &e_orig, extack);
+	if (err < 0)
+		goto out;
+
+	err = p4tc_ext_copy(inst, key, &e, e_orig, e_o, flags);
+	if (err < 0)
+		goto out;
+
+	err = p4tc_ext_init_params(net, control_params_idr, &params,
+				   nla, &attrs_size, extack);
+	if (err < 0)
+		goto release_idr;
+	attrs_size += nla_total_size(0) + p4tc_extern_shared_attrs_size();
+	e->attrs_size = attrs_size;
+
+	e->common.params = params;
+
+	return e;
+
+release_idr:
+	p4tc_ext_idr_release(e);
+
+out:
+	return ERR_PTR(err);
+}
+
+static struct p4tc_extern_param *find_key_param(struct idr *params_idr)
+{
+	struct p4tc_extern_param *param;
+	unsigned long tmp, id;
+
+	idr_for_each_entry_ul(params_idr, param, tmp, id) {
+		if (param->flags & P4TC_EXT_PARAMS_FLAG_ISKEY)
+			return param;
+	}
+
+	return NULL;
+}
+
+static struct p4tc_extern_param *
+p4tc_ext_init_defval_param(struct p4tc_extern_param *param,
+			   struct netlink_ext_ack *extack)
+{
+	const u32 bytesz = BITS_TO_BYTES(param->type->container_bitsz);
+	struct p4tc_extern_param_ops *val_ops;
+	struct p4tc_extern_param *nparam;
+	int err;
+
+	if (p4tc_ext_param_ops_is_init(param->ops))
+		val_ops = param->ops;
+	else
+		val_ops = param->mod_ops;
+
+	nparam = kzalloc(sizeof(*nparam), GFP_KERNEL);
+	if (!nparam) {
+		err = -ENOMEM;
+		goto out;
+	}
+
+	strscpy(nparam->name, param->name, EXTPARAMNAMSIZ);
+	nparam->type = param->type;
+	nparam->id = param->id;
+
+	if (val_ops) {
+		if (param->mod_ops && !val_ops->default_value) {
+			NL_SET_ERR_MSG_FMT(extack,
+					   "Param %s should have default_value op",
+					   param->name);
+			err = -EINVAL;
+			goto free_param;
+		}
+		err = val_ops->init_value(NULL, nparam, param->value,
+					  extack);
+		if (err < 0)
+			goto free_param;
+	} else {
+		nparam->value = kzalloc(bytesz, GFP_KERNEL);
+		if (!nparam->value) {
+			err = -ENOMEM;
+			goto free_param;
+		}
+
+		if (param->value)
+			memcpy(nparam->value, param->value, bytesz);
+	}
+	nparam->ops = param->ops;
+	nparam->mod_ops = param->mod_ops;
+
+	return nparam;
+
+free_param:
+	kfree(nparam);
+out:
+	return ERR_PTR(err);
+}
+
+struct p4tc_extern_params *
+p4tc_ext_params_copy(struct p4tc_extern_params *params_orig)
+{
+	struct p4tc_extern_param *nparam = NULL;
+	struct p4tc_extern_params *params_copy;
+	const struct p4tc_extern_param *param;
+	unsigned long tmp, id;
+	int err;
+
+	params_copy = p4tc_extern_params_init();
+	if (!params_copy) {
+		err = -ENOMEM;
+		goto err_out;
+	}
+
+	idr_for_each_entry_ul(&params_orig->params_idr, param, tmp, id) {
+		struct p4tc_type *param_type = param->type;
+		u32 alloc_len = BITS_TO_BYTES(param_type->container_bitsz);
+		struct p4tc_type_mask_shift *mask_shift = NULL;
+
+		nparam = kzalloc(sizeof(*nparam), GFP_KERNEL);
+		if (!nparam) {
+			err = -ENOMEM;
+			goto free_params;
+		}
+		nparam->ops = param->ops;
+		nparam->mod_ops = param->mod_ops;
+		nparam->type = param->type;
+
+		if (param->value) {
+			nparam->value = kzalloc(alloc_len, GFP_KERNEL);
+			if (!nparam->value) {
+				err = -ENOMEM;
+				goto free_param;
+			}
+			memcpy(nparam->value, param->value, alloc_len);
+		}
+
+		if (param_type->ops && param_type->ops->create_bitops) {
+			const u32 bitsz = param->bitsz;
+
+			mask_shift = param_type->ops->create_bitops(bitsz, 0,
+								    bitsz - 1,
+								    NULL);
+			if (IS_ERR(mask_shift)) {
+				err = PTR_ERR(mask_shift);
+				goto free_param_value;
+			}
+			nparam->mask_shift = mask_shift;
+		}
+
+		nparam->id = param->id;
+		err = idr_alloc_u32(&params_copy->params_idr, nparam,
+				    &nparam->id, nparam->id, GFP_KERNEL);
+		if (err < 0)
+			goto free_mask_shift;
+
+		nparam->index = param->index;
+		nparam->bitsz = param->bitsz;
+		nparam->flags = param->flags;
+		strscpy(nparam->name, param->name, EXTPARAMNAMSIZ);
+		params_copy->num_params++;
+	}
+
+	return params_copy;
+
+free_mask_shift:
+	if (nparam->mask_shift)
+		p4t_release(nparam->mask_shift);
+free_param_value:
+	kfree(nparam->value);
+free_param:
+	kfree(nparam);
+free_params:
+	p4tc_ext_params_free(params_copy, true);
+err_out:
+	return ERR_PTR(err);
+}
+EXPORT_SYMBOL(p4tc_ext_params_copy);
+
+int p4tc_ext_init_defval_params(struct p4tc_extern_inst *inst,
+				struct p4tc_extern_common *common,
+				struct idr *control_params_idr,
+				struct netlink_ext_ack *extack)
+{
+	struct p4tc_extern_params *params = NULL;
+	struct p4tc_extern_param *param;
+	unsigned long tmp, id;
+	int err;
+
+	params = p4tc_extern_params_init();
+	if (!params)
+		return -ENOMEM;
+
+	idr_for_each_entry_ul(control_params_idr, param, tmp, id) {
+		struct p4tc_extern_param *nparam;
+
+		if (param->flags & P4TC_EXT_PARAMS_FLAG_ISKEY)
+			/* Skip key param */
+			continue;
+
+		nparam = p4tc_ext_init_defval_param(param, extack);
+		if (IS_ERR(nparam)) {
+			err = PTR_ERR(nparam);
+			goto free_params;
+		}
+
+		err = idr_alloc_u32(&params->params_idr, nparam, &nparam->id,
+				    nparam->id, GFP_KERNEL);
+		if (err < 0) {
+			kfree(nparam);
+			goto free_params;
+		}
+		params->num_params++;
+	}
+
+	common->params = params;
+	common->inst = inst;
+	common->ops = inst->ops;
+	refcount_set(&common->p4tc_ext_refcnt, 1);
+	if (inst->tbl_bindable)
+		list_add(&common->node, &inst->unused_elems);
+
+	return 0;
+
+free_params:
+	p4tc_ext_params_free(params, true);
+	return err;
+}
+EXPORT_SYMBOL_GPL(p4tc_ext_init_defval_params);
+
+static int p4tc_ext_init_defval(struct p4tc_extern **e,
+				struct p4tc_extern_inst *inst,
+				u32 key, struct netlink_ext_ack *extack)
+{
+	const struct p4tc_extern_ops *e_o = inst->ops;
+	int err;
+
+	if (!inst->is_scalar) {
+		struct p4tc_extern_param *key_param;
+
+		key_param = find_key_param(&inst->params->params_idr);
+		if (!key_param) {
+			NL_SET_ERR_MSG(extack, "Unable to find key param");
+			return -ENOENT;
+		}
+	}
+
+	err = p4tc_ext_idr_create(inst, key, e, e_o, 0);
+	if (err < 0)
+		return err;
+
+	/* We already store it in the IDR, because we arrive here with the
+	 * rtnl_lock, so this code is never accessed concurrently.
+	 */
+	err = idr_alloc_u32(&inst->control_elems_idr, *e, &key,
+			    key, GFP_KERNEL);
+	if (err < 0) {
+		__p4tc_ext_idr_purge(*e);
+		return err;
+	}
+
+	err = p4tc_ext_init_defval_params(inst, &((*e)->common),
+					  &inst->params->params_idr, extack);
+	if (err < 0)
+		goto release_idr;
+
+	return 0;
+
+release_idr:
+	p4tc_ext_idr_release_dec_num_elems(*e);
+
+	return err;
+}
+
+static void p4tc_extern_inst_destroy_elems(struct idr *insts_idr)
+{
+	struct p4tc_extern_inst *inst;
+	unsigned long tmp, id;
+
+	idr_for_each_entry_ul(insts_idr, inst, tmp, id) {
+		unsigned long tmp2, elem_id;
+		struct p4tc_extern *e;
+
+		idr_for_each_entry_ul(&inst->control_elems_idr, e,
+				      tmp2, elem_id) {
+			p4tc_ext_idr_purge(e);
+		}
+	}
+}
+
+static void p4tc_user_pipe_ext_destroy_elems(struct idr *user_ext_idr)
+{
+	struct p4tc_user_pipeline_extern *pipe_ext;
+	unsigned long tmp, id;
+
+	idr_for_each_entry_ul(user_ext_idr, pipe_ext, tmp, id) {
+		if (p4tc_ext_has_construct(pipe_ext->tmpl_ext->ops))
+			continue;
+
+		p4tc_extern_inst_destroy_elems(&pipe_ext->e_inst_idr);
+	}
+}
+
+int p4tc_extern_inst_init_elems(struct p4tc_extern_inst *inst, u32 num_elems)
+{
+	int err = 0;
+	int i;
+
+	for (i = 0; i < num_elems; i++) {
+		struct p4tc_extern *e = NULL;
+
+		err = p4tc_ext_init_defval(&e, inst, i + 1, NULL);
+		if (err)
+			return err;
+	}
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(p4tc_extern_inst_init_elems);
+
+static int
+__p4tc_extern_insts_init_elems(struct idr *insts_idr)
+{
+	struct p4tc_extern_inst *inst;
+	unsigned long tmp, id;
+	int err = 0;
+
+	idr_for_each_entry_ul(insts_idr, inst, tmp, id) {
+		u32 max_num_elems = inst->max_num_elems;
+
+		err = p4tc_extern_inst_init_elems(inst, max_num_elems);
+		if (err < 0)
+			return err;
+	}
+
+	return 0;
+}
+
+/* Called before sealing the pipeline */
+int p4tc_extern_insts_init_elems(struct idr *user_ext_idr)
+{
+	struct p4tc_user_pipeline_extern *pipe_ext;
+	unsigned long tmp, id;
+	int err;
+
+	idr_for_each_entry_ul(user_ext_idr, pipe_ext, tmp, id) {
+		/* We assume the module construct will create the initial elems
+		 * by itself.
+		 * We only initialise when sealing if we don't have construct.
+		 */
+		if (p4tc_ext_has_construct(pipe_ext->tmpl_ext->ops))
+			continue;
+
+		err = __p4tc_extern_insts_init_elems(&pipe_ext->e_inst_idr);
+		if (err < 0)
+			goto destroy_ext_inst_elems;
+	}
+
+	return 0;
+
+destroy_ext_inst_elems:
+	p4tc_user_pipe_ext_destroy_elems(user_ext_idr);
+	return err;
+}
+
+static struct p4tc_extern *
+p4tc_extern_init_1(struct p4tc_pipeline *pipeline,
+		   struct p4tc_extern_inst *inst,
+		   struct nlattr *nla, u32 key, u32 flags,
+		   struct netlink_ext_ack *extack)
+{
+	return p4tc_ext_init(pipeline->net, nla, inst, key,
+			     flags, extack);
+}
+
+static int tce_get_fill(struct sk_buff *skb, struct p4tc_extern *externs[],
+			u32 portid, u32 seq, u16 flags, u32 pipeid, int cmd,
+			int ref, struct netlink_ext_ack *extack)
+{
+	unsigned char *b = nlmsg_get_pos(skb);
+	struct nlmsghdr *nlh;
+	struct nlattr *nest;
+	struct p4tcmsg *t;
+
+	nlh = nlmsg_put(skb, portid, seq, cmd, sizeof(*t), flags);
+	if (!nlh)
+		goto out_nlmsg_trim;
+	t = nlmsg_data(nlh);
+	t->pipeid = pipeid;
+	t->obj = P4TC_OBJ_RUNTIME_EXTERN;
+
+	nest = nla_nest_start(skb, P4TC_ROOT);
+	if (p4tc_extern_elem_dump(skb, externs, ref) < 0)
+		goto out_nlmsg_trim;
+
+	nla_nest_end(skb, nest);
+
+	nlh->nlmsg_len = (unsigned char *)nlmsg_get_pos(skb) - b;
+
+	return skb->len;
+
+out_nlmsg_trim:
+	nlmsg_trim(skb, b);
+	return -1;
+}
+
+static int
+p4tc_extern_get_respond(struct net *net, u32 portid, struct nlmsghdr *n,
+			struct p4tc_extern *externs[], u32 pipeid,
+			size_t attr_size, struct netlink_ext_ack *extack)
+{
+	struct sk_buff *skb;
+
+	skb = alloc_skb(attr_size <= NLMSG_GOODSIZE ? NLMSG_GOODSIZE : attr_size,
+			GFP_KERNEL);
+	if (!skb)
+		return -ENOBUFS;
+	if (tce_get_fill(skb, externs, portid, n->nlmsg_seq, 0, pipeid,
+			 RTM_P4TC_GET, 1, NULL) <= 0) {
+		NL_SET_ERR_MSG(extack,
+			       "Failed to fill netlink attributes while adding TC extern");
+		kfree_skb(skb);
+		return -EINVAL;
+	}
+
+	return rtnl_unicast(skb, net, portid);
+}
+
+static struct p4tc_extern *
+p4tc_extern_get_1(struct p4tc_extern_inst *inst,
+		  struct nlattr *nla, const char *kind, struct nlmsghdr *n,
+		  u32 key, u32 portid, struct netlink_ext_ack *extack)
+{
+	struct p4tc_extern *e;
+	int err;
+
+	if (p4tc_ext_inst_has_rctrl(inst)) {
+		err = inst->ops->rctrl(n->nlmsg_type, inst,
+				       (struct p4tc_extern_common **)&e,
+				       NULL, &key, extack);
+		if (err < 0)
+			return ERR_PTR(err);
+
+		return e;
+	}
+
+	if (__p4tc_ext_idr_search(inst, &e, key) == 0) {
+		err = -ENOENT;
+		NL_SET_ERR_MSG(extack, "TC extern with specified key not found");
+		goto err_out;
+	}
+
+	return e;
+
+err_out:
+	return ERR_PTR(err);
+}
+
+static int
+p4tc_extern_add_notify(struct net *net, struct nlmsghdr *n,
+		       struct p4tc_extern *externs[], u32 portid, u32 pipeid,
+		       size_t attr_size, struct netlink_ext_ack *extack)
+{
+	struct sk_buff *skb;
+
+	skb = alloc_skb(attr_size <= NLMSG_GOODSIZE ? NLMSG_GOODSIZE : attr_size,
+			GFP_KERNEL);
+	if (!skb)
+		return -ENOBUFS;
+
+	if (tce_get_fill(skb, externs, portid, n->nlmsg_seq, n->nlmsg_flags,
+			 pipeid, n->nlmsg_type, 0, extack) <= 0) {
+		NL_SET_ERR_MSG(extack,
+			       "Failed to fill netlink attributes while adding TC extern");
+		kfree_skb(skb);
+		return -EINVAL;
+	}
+
+	return rtnetlink_send(skb, net, portid, RTNLGRP_TC,
+			      n->nlmsg_flags & NLM_F_ECHO);
+}
+
+static int p4tc_ext_get_key_param(struct p4tc_extern_inst *inst,
+				  struct nlattr *nla,
+				  struct idr *params_idr, u32 *key,
+				  struct netlink_ext_ack *extack)
+{
+	int err = 0;
+
+	if (inst->is_scalar) {
+		if (nla) {
+			err = p4tc_ext_get_key_param_scalar(inst, nla, key,
+							    extack);
+			if (err < 0)
+				return err;
+
+			if (*key != 1) {
+				NL_SET_ERR_MSG(extack,
+					       "Key of scalar must be 1");
+				return -EINVAL;
+			}
+		} else {
+			*key = 1;
+		}
+	} else {
+		if (nla) {
+			err = p4tc_ext_get_nonscalar_key_param(params_idr, nla,
+							       key, extack);
+			if (err < 0)
+				return -EINVAL;
+		}
+
+		if (!key) {
+			NL_SET_ERR_MSG(extack, "Must specify extern key");
+			return -EINVAL;
+		}
+	}
+
+	return err;
+}
+
+static struct p4tc_extern *
+__p4tc_ctl_extern_1(struct p4tc_pipeline *pipeline,
+		    struct nlattr *nla, struct nlmsghdr *n,
+		    u32 portid, u32 flags, bool rctrl_allowed,
+		    struct netlink_ext_ack *extack)
+{
+	const char *kind = p4tc_ext_get_kind(nla, extack);
+	struct nlattr *tb[P4TC_EXT_MAX + 1];
+	struct p4tc_extern_inst *inst;
+	struct nlattr *params_attr;
+	struct p4tc_extern *e;
+	char *instname;
+	u32 key;
+	int err;
+
+	err = nla_parse_nested(tb, P4TC_EXT_MAX, nla,
+			       p4tc_extern_policy, extack);
+	if (err < 0)
+		return ERR_PTR(err);
+
+	if (IS_ERR(kind))
+		return (struct p4tc_extern *)kind;
+
+	if (NL_REQ_ATTR_CHECK(extack, NULL, tb, P4TC_EXT_INST_NAME)) {
+		NL_SET_ERR_MSG(extack,
+			       "TC extern inst name must be specified");
+		return ERR_PTR(-EINVAL);
+	}
+	instname = nla_data(tb[P4TC_EXT_INST_NAME]);
+
+	err = -EINVAL;
+	inst = p4tc_ext_inst_find_bynames(pipeline->net, pipeline, kind,
+					  instname, extack);
+	if (IS_ERR(inst))
+		return (struct p4tc_extern *)inst;
+
+	if (!rctrl_allowed && p4tc_ext_has_rctrl(inst->ops)) {
+		NL_SET_ERR_MSG(extack,
+			       "Runtime message may only have one extern with rctrl op");
+		return ERR_PTR(-EINVAL);
+	}
+
+	err = p4tc_ext_get_key_param(inst, tb[P4TC_EXT_KEY],
+				     &inst->params->params_idr, &key,
+				     extack);
+	if (err < 0)
+		return ERR_PTR(err);
+
+	params_attr = tb[P4TC_EXT_PARAMS];
+
+	switch (n->nlmsg_type) {
+	case RTM_P4TC_CREATE:
+		NL_SET_ERR_MSG(extack,
+			       "Create command is not supported");
+		return ERR_PTR(-EOPNOTSUPP);
+	case RTM_P4TC_UPDATE: {
+		struct nla_bitfield32 userflags = { 0, 0 };
+
+		if (tb[P4TC_EXT_FLAGS])
+			userflags = nla_get_bitfield32(tb[P4TC_EXT_FLAGS]);
+
+		flags = userflags.value | flags;
+		e = p4tc_extern_init_1(pipeline, inst, params_attr, key,
+				       flags, extack);
+		break;
+	}
+	case RTM_P4TC_DEL:
+		NL_SET_ERR_MSG(extack,
+			       "Delete command is not supported");
+		return ERR_PTR(-EOPNOTSUPP);
+	case RTM_P4TC_GET: {
+		e = p4tc_extern_get_1(inst, params_attr, kind, n, key, portid,
+				      extack);
+		break;
+	}
+	default:
+		NL_SET_ERR_MSG_FMT(extack, "Unknown extern command #%u",
+				   n->nlmsg_type);
+		return ERR_PTR(-EOPNOTSUPP);
+	}
+
+	return e;
+}
+
+static int __p4tc_ctl_extern(struct p4tc_pipeline *pipeline,
+			     struct nlattr *nla, struct nlmsghdr *n,
+			     u32 portid, u32 flags,
+			     struct netlink_ext_ack *extack)
+{
+	struct p4tc_extern *externs[P4TC_MSGBATCH_SIZE] = {};
+	struct nlattr *tb[P4TC_MSGBATCH_SIZE + 1];
+	bool processed_rctrl_extern = false;
+	struct p4tc_extern *ext;
+	size_t attr_size = 0;
+	bool has_one_element;
+	int i, ret;
+
+	ret = nla_parse_nested(tb, P4TC_MSGBATCH_SIZE, nla, NULL,
+			       extack);
+	if (ret < 0)
+		return ret;
+
+	/* We only allow 1 batched element in case the extern has an rctrl
+	 * callback.
+	 */
+	has_one_element = !tb[2];
+	ext = __p4tc_ctl_extern_1(pipeline, tb[1], n, portid,
+				  flags, has_one_element, extack);
+	if (IS_ERR(ext))
+		return PTR_ERR(ext);
+
+	externs[0] = ext;
+	if (p4tc_ext_has_rctrl(ext->common.ops)) {
+		processed_rctrl_extern = true;
+		goto notify;
+	} else {
+		attr_size += ext->attrs_size;
+	}
+
+	for (i = 2; i <= P4TC_MSGBATCH_SIZE && tb[i]; i++) {
+		ext = __p4tc_ctl_extern_1(pipeline, tb[i], n, portid,
+					  flags, false, extack);
+		if (IS_ERR(ext)) {
+			ret = PTR_ERR(ext);
+			goto err;
+		}
+
+		attr_size += ext->attrs_size;
+		/* Only add to externs array, extern modules that don't
+		 * implement rctrl callback.
+		 */
+		externs[i - 1] = ext;
+	}
+
+notify:
+	attr_size = p4tc_extern_full_attrs_size(attr_size);
+
+	if (n->nlmsg_type == RTM_P4TC_UPDATE) {
+		int listeners = rtnl_has_listeners(pipeline->net, RTNLGRP_TC);
+		int echo = n->nlmsg_flags & NLM_F_ECHO;
+
+		if (!processed_rctrl_extern)
+			p4tc_ext_idr_insert_many(externs);
+
+		if (echo || listeners)
+			p4tc_extern_add_notify(pipeline->net, n, externs,
+					       portid, pipeline->common.p_id,
+					       attr_size, extack);
+	} else if (n->nlmsg_type == RTM_P4TC_GET) {
+		p4tc_extern_get_respond(pipeline->net, portid, n, externs,
+					pipeline->common.p_id, attr_size,
+					extack);
+	}
+
+	return 0;
+
+err:
+	if (n->nlmsg_type == RTM_P4TC_UPDATE)
+		p4tc_extern_destroy(externs);
+	else if (n->nlmsg_type == RTM_P4TC_GET)
+		p4tc_extern_put_many(externs);
+
+	return ret;
+}
+
+static int parse_dump_ext_attrs(struct nlattr *nla,
+				struct nlattr **tb2)
+{
+	struct nlattr *tb[P4TC_MSGBATCH_SIZE + 1];
+
+	if (nla_parse_nested(tb, P4TC_MSGBATCH_SIZE, nla, NULL,
+			     NULL) < 0)
+		return -EINVAL;
+
+	if (!tb[1])
+		return -EINVAL;
+	if (nla_parse_nested(tb2, P4TC_EXT_MAX, tb[1],
+			     p4tc_extern_policy, NULL) < 0)
+		return -EINVAL;
+
+	if (!tb2[P4TC_EXT_KIND])
+		return -EINVAL;
+
+	if (!tb2[P4TC_EXT_INST_NAME])
+		return -EINVAL;
+
+	return 0;
+}
+
+int p4tc_ctl_extern_dump(struct sk_buff *skb, struct netlink_callback *cb,
+			 struct nlattr **tb, const char *pname)
+{
+	struct netlink_ext_ack *extack = cb->extack;
+	unsigned char *b = nlmsg_get_pos(skb);
+	struct nlattr *tb2[P4TC_EXT_MAX + 1];
+	struct net *net = sock_net(skb->sk);
+	struct nlattr *count_attr = NULL;
+	struct p4tc_pipeline *pipeline;
+	struct p4tc_extern_inst *inst;
+	char *kind_str, *instname;
+	struct nla_bitfield32 bf;
+	struct nlmsghdr *nlh;
+	struct nlattr *nest;
+	u32 ext_count = 0;
+	struct p4tcmsg *t;
+	int ret = 0;
+
+	pipeline = p4tc_pipeline_find_byany(net, pname, 0, extack);
+	if (IS_ERR(pipeline))
+		return PTR_ERR(pipeline);
+
+	if (!pipeline_sealed(pipeline)) {
+		NL_SET_ERR_MSG(extack,
+			       "Pipeline must be sealed for extern runtime ops");
+		return -EINVAL;
+	}
+
+	ret = parse_dump_ext_attrs(tb[P4TC_ROOT], tb2);
+	if (ret < 0)
+		return ret;
+
+	kind_str = nla_data(tb2[P4TC_EXT_KIND]);
+	if (NL_REQ_ATTR_CHECK(extack, NULL, tb, P4TC_EXT_KIND)) {
+		NL_SET_ERR_MSG(extack,
+			       "TC extern kind name must be specified");
+		return -EINVAL;
+	}
+
+	instname = nla_data(tb2[P4TC_EXT_INST_NAME]);
+	if (NL_REQ_ATTR_CHECK(extack, NULL, tb, P4TC_EXT_INST_NAME)) {
+		NL_SET_ERR_MSG(extack,
+			       "TC extern inst name must be specified");
+		return -EINVAL;
+	}
+
+	inst = p4tc_ext_inst_find_bynames(pipeline->net, pipeline, kind_str,
+					  instname, extack);
+	if (IS_ERR(inst))
+		return PTR_ERR(inst);
+
+	cb->args[2] = 0;
+	if (tb[P4TC_ROOT_FLAGS]) {
+		bf = nla_get_bitfield32(tb[P4TC_ROOT_FLAGS]);
+		cb->args[2] = bf.value;
+	}
+
+	nlh = nlmsg_put(skb, NETLINK_CB(cb->skb).portid, cb->nlh->nlmsg_seq,
+			cb->nlh->nlmsg_type, sizeof(*t), 0);
+	if (!nlh)
+		goto err_out;
+
+	t = nlmsg_data(nlh);
+	t->pipeid = pipeline->common.p_id;
+	t->obj = P4TC_OBJ_RUNTIME_EXTERN;
+	count_attr = nla_reserve(skb, P4TC_ROOT_COUNT, sizeof(u32));
+	if (!count_attr)
+		goto err_out;
+
+	nest = nla_nest_start_noflag(skb, P4TC_ROOT);
+	if (!nest)
+		goto err_out;
+
+	ret = p4tc_ext_dump_walker(inst, skb, cb);
+	if (ret < 0)
+		goto err_out;
+
+	if (ret > 0) {
+		nla_nest_end(skb, nest);
+		ret = skb->len;
+		ext_count = cb->args[1];
+		memcpy(nla_data(count_attr), &ext_count, sizeof(u32));
+		cb->args[1] = 0;
+	} else {
+		nlmsg_trim(skb, b);
+	}
+
+	nlh->nlmsg_len = (unsigned char *)nlmsg_get_pos(skb) - b;
+	if (NETLINK_CB(cb->skb).portid && ret)
+		nlh->nlmsg_flags |= NLM_F_MULTI;
+	return skb->len;
+
+err_out:
+	nlmsg_trim(skb, b);
+	return skb->len;
+}
+
+int p4tc_ctl_extern(struct sk_buff *skb, struct nlmsghdr *n, int cmd,
+		    struct netlink_ext_ack *extack)
+{
+	struct nlattr *tb[P4TC_ROOT_MAX + 1];
+	struct net *net = sock_net(skb->sk);
+	u32 portid = NETLINK_CB(skb).portid;
+	struct p4tc_pipeline *pipeline;
+	struct nlattr *root;
+	char *pname = NULL;
+	u32 flags = 0;
+	int ret = 0;
+
+	if (cmd != RTM_P4TC_GET && !netlink_capable(skb, CAP_NET_ADMIN)) {
+		NL_SET_ERR_MSG(extack, "Need CAP_NET_ADMIN to do CRU ops");
+		return -EPERM;
+	}
+
+	ret = nlmsg_parse(n, sizeof(struct p4tcmsg), tb, P4TC_ROOT_MAX,
+			  p4tc_root_policy, extack);
+	if (ret < 0)
+		return ret;
+
+	if (tb[P4TC_ROOT_PNAME])
+		pname = nla_data(tb[P4TC_ROOT_PNAME]);
+
+	if (NL_REQ_ATTR_CHECK(extack, NULL, tb, P4TC_ROOT)) {
+		NL_SET_ERR_MSG(extack, "Netlink P4TC extern attributes missing");
+		return -EINVAL;
+	}
+
+	root = tb[P4TC_ROOT];
+
+	pipeline = p4tc_pipeline_find_byany(net, pname, 0, extack);
+	if (IS_ERR(pipeline))
+		return PTR_ERR(pipeline);
+
+	if (!pipeline_sealed(pipeline)) {
+		NL_SET_ERR_MSG(extack,
+			       "Pipeline must be sealed for extern runtime ops");
+		return -EPERM;
+	}
+
+	return __p4tc_ctl_extern(pipeline, root, n, portid, flags, extack);
+}
diff --git a/net/sched/p4tc/p4tc_pipeline.c b/net/sched/p4tc/p4tc_pipeline.c
index 6c00747ac..c61115f78 100644
--- a/net/sched/p4tc/p4tc_pipeline.c
+++ b/net/sched/p4tc/p4tc_pipeline.c
@@ -27,6 +27,7 @@
 #include <net/netlink.h>
 #include <net/flow_offload.h>
 #include <net/p4tc_types.h>
+#include <net/p4tc_ext_api.h>
 
 static unsigned int pipeline_net_id;
 static struct p4tc_pipeline *root_pipeline;
@@ -99,6 +100,7 @@ static void __net_exit pipeline_exit_net(struct net *net)
 		__p4tc_pipeline_put(pipeline, &pipeline->common, NULL);
 	}
 	idr_destroy(&pipe_net->pipeline_idr);
+
 	rtnl_unlock();
 }
 
@@ -119,6 +121,7 @@ static void p4tc_pipeline_destroy(struct p4tc_pipeline *pipeline)
 {
 	idr_destroy(&pipeline->p_act_idr);
 	idr_destroy(&pipeline->p_tbl_idr);
+	idr_destroy(&pipeline->user_ext_idr);
 
 	kfree(pipeline);
 }
@@ -141,7 +144,8 @@ static void p4tc_pipeline_teardown(struct p4tc_pipeline *pipeline,
 	struct net *net = pipeline->net;
 	struct p4tc_pipeline_net *pipe_net = net_generic(net, pipeline_net_id);
 	struct net *pipeline_net = maybe_get_net(net);
-	unsigned long iter_act_id, tmp;
+	struct p4tc_user_pipeline_extern *pipe_ext;
+	unsigned long iter_act_id, ext_id, tmp;
 	struct p4tc_table *table;
 	struct p4tc_act *act;
 	unsigned long tbl_id;
@@ -152,6 +156,20 @@ static void p4tc_pipeline_teardown(struct p4tc_pipeline *pipeline,
 	idr_for_each_entry_ul(&pipeline->p_act_idr, act, tmp, iter_act_id)
 		act->common.ops->put(pipeline, &act->common, extack);
 
+	idr_for_each_entry_ul(&pipeline->user_ext_idr, pipe_ext, tmp, ext_id) {
+		unsigned long tmp_in, inst_id;
+		struct p4tc_extern_inst *inst;
+
+		idr_for_each_entry_ul(&pipe_ext->e_inst_idr, inst, tmp_in,
+				      inst_id) {
+			struct p4tc_template_common *common = &inst->common;
+
+			common->ops->put(pipeline, common, extack);
+		}
+
+		pipe_ext->free(pipe_ext, &pipeline->user_ext_idr);
+	}
+
 	/* If we are on netns cleanup we can't touch the pipeline_idr.
 	 * On pre_exit we will destroy the idr but never call into teardown
 	 * if filters are active which makes pipeline pointers dangle until
@@ -210,9 +228,18 @@ static int pipeline_try_set_state_ready(struct p4tc_pipeline *pipeline,
 	if (ret < 0)
 		return ret;
 
+	ret = p4tc_extern_insts_init_elems(&pipeline->user_ext_idr);
+	if (ret < 0)
+		goto unset_table_state_ready;
+
 	pipeline->p_state = P4TC_STATE_READY;
 
 	return true;
+
+unset_table_state_ready:
+	p4tc_table_put_mask_array(pipeline);
+
+	return ret;
 }
 
 struct p4tc_pipeline *p4tc_pipeline_find_byid(struct net *net, const u32 pipeid)
@@ -308,6 +335,9 @@ static struct p4tc_pipeline *p4tc_pipeline_create(struct net *net,
 
 	idr_init(&pipeline->p_tbl_idr);
 	pipeline->curr_tables = 0;
+	idr_init(&pipeline->p_tbl_idr);
+
+	idr_init(&pipeline->user_ext_idr);
 
 	pipeline->num_created_acts = 0;
 
@@ -645,6 +675,8 @@ static void __p4tc_pipeline_init(void)
 
 	strscpy(root_pipeline->common.name, "kernel", PIPELINENAMSIZ);
 
+	idr_init(&root_pipeline->p_ext_idr);
+
 	root_pipeline->common.ops =
 		(struct p4tc_template_ops *)&p4tc_pipeline_ops;
 
diff --git a/net/sched/p4tc/p4tc_runtime_api.c b/net/sched/p4tc/p4tc_runtime_api.c
index bcb280909..f085d1b2a 100644
--- a/net/sched/p4tc/p4tc_runtime_api.c
+++ b/net/sched/p4tc/p4tc_runtime_api.c
@@ -27,16 +27,17 @@
 #include <net/p4tc.h>
 #include <net/netlink.h>
 #include <net/flow_offload.h>
+#include <net/p4tc_ext_api.h>
 
 static int tc_ctl_p4_root(struct sk_buff *skb, struct nlmsghdr *n, int cmd,
 			  struct netlink_ext_ack *extack)
 {
 	struct p4tcmsg *t = (struct p4tcmsg *)nlmsg_data(n);
+	int ret;
 
 	switch (t->obj) {
 	case P4TC_OBJ_RUNTIME_TABLE: {
 		struct net *net = sock_net(skb->sk);
-		int ret;
 
 		net = maybe_get_net(net);
 		if (!net) {
@@ -50,6 +51,11 @@ static int tc_ctl_p4_root(struct sk_buff *skb, struct nlmsghdr *n, int cmd,
 
 		return ret;
 	}
+	case P4TC_OBJ_RUNTIME_EXTERN:
+		rtnl_lock();
+		ret = p4tc_ctl_extern(skb, n, cmd, extack);
+		rtnl_unlock();
+		return ret;
 	default:
 		NL_SET_ERR_MSG(extack, "Unknown P4 runtime object type");
 		return -EOPNOTSUPP;
@@ -120,6 +126,8 @@ static int tc_ctl_p4_dump(struct sk_buff *skb, struct netlink_callback *cb)
 	case P4TC_OBJ_RUNTIME_TABLE:
 		return p4tc_tbl_entry_dumpit(sock_net(skb->sk), skb, cb,
 					     tb[P4TC_ROOT], p_name);
+	case P4TC_OBJ_RUNTIME_EXTERN:
+		return p4tc_ctl_extern_dump(skb, cb, tb, p_name);
 	default:
 		NL_SET_ERR_MSG_FMT(cb->extack,
 				   "Unknown p4 runtime object type %u\n",
diff --git a/net/sched/p4tc/p4tc_table.c b/net/sched/p4tc/p4tc_table.c
index 7d79b01e5..534b972b9 100644
--- a/net/sched/p4tc/p4tc_table.c
+++ b/net/sched/p4tc/p4tc_table.c
@@ -114,6 +114,10 @@ static const struct nla_policy p4tc_table_policy[P4TC_TABLE_MAX + 1] = {
 	[P4TC_TABLE_DEFAULT_MISS] = { .type = NLA_NESTED },
 	[P4TC_TABLE_ACTS_LIST] = { .type = NLA_NESTED },
 	[P4TC_TABLE_CONST_ENTRY] = { .type = NLA_NESTED },
+	[P4TC_TABLE_COUNTER] = {
+		.type = NLA_STRING,
+		.len = EXTERNINSTNAMSIZ * 2 + 1
+	},
 };
 
 static int _p4tc_table_fill_nlmsg(struct sk_buff *skb, struct p4tc_table *table)
@@ -146,6 +150,12 @@ static int _p4tc_table_fill_nlmsg(struct sk_buff *skb, struct p4tc_table *table)
 	parm.tbl_aging = table->tbl_aging;
 	parm.tbl_num_entries = atomic_read(&table->tbl_nelems);
 
+	if (table->tbl_counter) {
+		if (nla_put_string(skb, P4TC_TABLE_COUNTER,
+				   table->tbl_counter->common.name) < 0)
+			goto out_nlmsg_trim;
+	}
+
 	tbl_perm = rcu_dereference_rtnl(table->tbl_permissions);
 	parm.tbl_permissions = tbl_perm->permissions;
 
@@ -914,7 +924,9 @@ static struct p4tc_table *p4tc_table_create(struct net *net, struct nlattr **tb,
 {
 	struct rhashtable_params table_hlt_params = entry_hlt_params;
 	struct p4tc_table_default_act_params def_params = {0};
+	struct p4tc_user_pipeline_extern *pipe_ext = NULL;
 	struct p4tc_table_perm *tbl_init_perms = NULL;
+	struct p4tc_extern_inst *inst = NULL;
 	struct p4tc_table_parm *parm;
 	struct p4tc_table *table;
 	char *tblname;
@@ -1077,13 +1089,25 @@ static struct p4tc_table *p4tc_table_create(struct net *net, struct nlattr **tb,
 
 	refcount_set(&table->tbl_ctrl_ref, 1);
 
+	if (tb[P4TC_TABLE_COUNTER]) {
+		const char *ext_inst_path = nla_data(tb[P4TC_TABLE_COUNTER]);
+
+		inst = p4tc_ext_inst_table_bind(pipeline, &pipe_ext,
+						ext_inst_path, extack);
+		if (IS_ERR(inst)) {
+			ret = PTR_ERR(inst);
+			goto free_permissions;
+		}
+		table->tbl_counter = inst;
+	}
+
 	if (tbl_id) {
 		table->tbl_id = tbl_id;
 		ret = idr_alloc_u32(&pipeline->p_tbl_idr, table, &table->tbl_id,
 				    table->tbl_id, GFP_KERNEL);
 		if (ret < 0) {
 			NL_SET_ERR_MSG(extack, "Unable to allocate table id");
-			goto free_permissions;
+			goto put_inst;
 		}
 	} else {
 		table->tbl_id = 1;
@@ -1091,7 +1115,7 @@ static struct p4tc_table *p4tc_table_create(struct net *net, struct nlattr **tb,
 				    UINT_MAX, GFP_KERNEL);
 		if (ret < 0) {
 			NL_SET_ERR_MSG(extack, "Unable to allocate table id");
-			goto free_permissions;
+			goto put_inst;
 		}
 	}
 
@@ -1169,11 +1193,15 @@ static struct p4tc_table *p4tc_table_create(struct net *net, struct nlattr **tb,
 idr_rm:
 	idr_remove(&pipeline->p_tbl_idr, table->tbl_id);
 
+	p4tc_table_acts_list_destroy(&table->tbl_acts_list);
+
+put_inst:
+	if (inst)
+		p4tc_ext_inst_table_unbind(table, pipe_ext, inst);
+
 free_permissions:
 	kfree(tbl_init_perms);
 
-	p4tc_table_acts_list_destroy(&table->tbl_acts_list);
-
 free:
 	kfree(table);
 
@@ -1189,7 +1217,9 @@ static struct p4tc_table *p4tc_table_update(struct net *net, struct nlattr **tb,
 {
 	u32 tbl_max_masks = 0, tbl_max_entries = 0, tbl_keysz = 0;
 	struct p4tc_table_default_act_params def_params = {0};
+	struct p4tc_user_pipeline_extern *pipe_ext = NULL;
 	struct list_head *tbl_acts_list = NULL;
+	struct p4tc_extern_inst *inst = NULL;
 	struct p4tc_table_perm *perm = NULL;
 	struct p4tc_table_parm *parm = NULL;
 	struct p4tc_table *table;
@@ -1315,6 +1345,17 @@ static struct p4tc_table *p4tc_table_update(struct net *net, struct nlattr **tb,
 		}
 	}
 
+	if (tb[P4TC_TABLE_COUNTER]) {
+		const char *ext_inst_path = nla_data(tb[P4TC_TABLE_COUNTER]);
+
+		inst = p4tc_ext_inst_table_bind(pipeline, &pipe_ext,
+						ext_inst_path, extack);
+		if (IS_ERR(inst)) {
+			ret = PTR_ERR(inst);
+			goto free_perm;
+		}
+	}
+
 	if (tb[P4TC_TABLE_CONST_ENTRY]) {
 		struct p4tc_table_entry *entry;
 
@@ -1324,7 +1365,7 @@ static struct p4tc_table *p4tc_table_update(struct net *net, struct nlattr **tb,
 						  pipeline, table, extack);
 		if (IS_ERR(entry)) {
 			ret = PTR_ERR(entry);
-			goto free_perm;
+			goto put_inst;
 		}
 
 		table->tbl_const_entry = entry;
@@ -1342,6 +1383,8 @@ static struct p4tc_table *p4tc_table_update(struct net *net, struct nlattr **tb,
 	table->tbl_type = tbl_type;
 	if (tbl_aging)
 		table->tbl_aging = tbl_aging;
+	if (inst)
+		table->tbl_counter = inst;
 
 	if (tbl_acts_list)
 		p4tc_table_acts_list_replace(&table->tbl_acts_list,
@@ -1349,6 +1392,10 @@ static struct p4tc_table *p4tc_table_update(struct net *net, struct nlattr **tb,
 
 	return table;
 
+put_inst:
+	if (inst)
+		p4tc_ext_inst_table_unbind(table, pipe_ext, inst);
+
 free_perm:
 	kfree(perm);
 
diff --git a/net/sched/p4tc/p4tc_tbl_entry.c b/net/sched/p4tc/p4tc_tbl_entry.c
index c6953199d..f90b6c66f 100644
--- a/net/sched/p4tc/p4tc_tbl_entry.c
+++ b/net/sched/p4tc/p4tc_tbl_entry.c
@@ -27,6 +27,7 @@
 #include <net/p4tc.h>
 #include <net/netlink.h>
 #include <net/flow_offload.h>
+#include <net/p4tc_ext_api.h>
 
 #define SIZEOF_MASKID (sizeof(((struct p4tc_table_entry_key *)0)->maskid))
 
@@ -324,6 +325,7 @@ int p4tc_tbl_entry_fill(struct sk_buff *skb, struct p4tc_table *table,
 	struct p4tc_table_entry_tm dtm, *tm;
 	struct nlattr *nest, *nest_acts;
 	u32 ids[P4TC_ENTRY_MAX_IDS];
+	struct nlattr *nest_counter;
 	int ret = -ENOMEM;
 
 	ids[P4TC_TBLID_IDX - 1] = tbl_id;
@@ -385,6 +387,11 @@ int p4tc_tbl_entry_fill(struct sk_buff *skb, struct p4tc_table *table,
 				      P4TC_ENTRY_PAD))
 			goto out_nlmsg_trim;
 	}
+	if (value->counter) {
+		nest_counter = nla_nest_start(skb, P4TC_ENTRY_COUNTER);
+		p4tc_ext_elem_dump_1(skb, value->counter);
+		nla_nest_end(skb, nest_counter);
+	}
 
 	nla_nest_end(skb, nest);
 
@@ -1568,11 +1575,20 @@ __must_hold(RCU)
 		goto free_work;
 	}
 
+	if (table->tbl_counter) {
+		value->counter = p4tc_ext_elem_get(table->tbl_counter);
+		if (!value->counter) {
+			atomic_dec(&table->tbl_nelems);
+			ret = -ENOENT;
+			goto free_work;
+		}
+	}
+
 	if (rhltable_insert(&table->tbl_entries, &entry->ht_node,
 			    entry_hlt_params) < 0) {
 		atomic_dec(&table->tbl_nelems);
 		ret = -EBUSY;
-		goto free_work;
+		goto put_ext;
 	}
 
 	if (value->is_dyn) {
@@ -1592,6 +1608,10 @@ __must_hold(RCU)
 
 	return 0;
 
+put_ext:
+	if (table->tbl_counter && value->counter)
+		p4tc_ext_elem_put_list(table->tbl_counter, value->counter);
+
 free_work:
 	kfree(entry_work);
 
@@ -1820,6 +1840,9 @@ __must_hold(RCU)
 			      HRTIMER_MODE_REL);
 	}
 
+	if (value_old->counter)
+		value->counter = value_old->counter;
+
 	INIT_WORK(&entry_work->work, p4tc_table_entry_del_work);
 
 	if (rhltable_insert(&table->tbl_entries, &entry->ht_node,
diff --git a/net/sched/p4tc/p4tc_tmpl_api.c b/net/sched/p4tc/p4tc_tmpl_api.c
index a3b3b1430..55a738d1b 100644
--- a/net/sched/p4tc/p4tc_tmpl_api.c
+++ b/net/sched/p4tc/p4tc_tmpl_api.c
@@ -44,6 +44,8 @@ static bool obj_is_valid(u32 obj)
 	case P4TC_OBJ_PIPELINE:
 	case P4TC_OBJ_ACT:
 	case P4TC_OBJ_TABLE:
+	case P4TC_OBJ_EXT:
+	case P4TC_OBJ_EXT_INST:
 		return true;
 	default:
 		return false;
@@ -54,6 +56,8 @@ static const struct p4tc_template_ops *p4tc_ops[P4TC_OBJ_MAX + 1] = {
 	[P4TC_OBJ_PIPELINE] = &p4tc_pipeline_ops,
 	[P4TC_OBJ_ACT] = &p4tc_act_ops,
 	[P4TC_OBJ_TABLE] = &p4tc_table_ops,
+	[P4TC_OBJ_EXT] = &p4tc_tmpl_ext_ops,
+	[P4TC_OBJ_EXT_INST] = &p4tc_ext_inst_ops,
 };
 
 int p4tc_tmpl_generic_dump(struct sk_buff *skb, struct p4tc_dump_ctx *ctx,
diff --git a/net/sched/p4tc/p4tc_tmpl_ext.c b/net/sched/p4tc/p4tc_tmpl_ext.c
new file mode 100644
index 000000000..ec3efbc68
--- /dev/null
+++ b/net/sched/p4tc/p4tc_tmpl_ext.c
@@ -0,0 +1,2221 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * net/sched/p4tc_tmpl_extern.c	P4 TC EXTERN TEMPLATE
+ *
+ * Copyright (c) 2022-2023, Mojatatu Networks
+ * Copyright (c) 2022-2023, Intel Corporation.
+ * Authors:     Jamal Hadi Salim <jhs@mojatatu.com>
+ *              Victor Nogueira <victor@mojatatu.com>
+ *              Pedro Tammela <pctammela@mojatatu.com>
+ */
+
+#include <linux/types.h>
+#include <linux/kernel.h>
+#include <linux/string.h>
+#include <linux/errno.h>
+#include <linux/slab.h>
+#include <linux/skbuff.h>
+#include <linux/err.h>
+#include <linux/module.h>
+#include <net/net_namespace.h>
+#include <net/pkt_cls.h>
+#include <net/p4tc.h>
+#include <net/netlink.h>
+#include <net/p4tc_types.h>
+#include <net/sock.h>
+#include <net/p4tc_ext_api.h>
+
+static LIST_HEAD(ext_base);
+static DEFINE_RWLOCK(ext_mod_lock);
+
+static const struct nla_policy tc_extern_inst_policy[P4TC_TMPL_EXT_INST_MAX + 1] = {
+	[P4TC_TMPL_EXT_INST_EXT_NAME] = {
+		.type = NLA_STRING,
+		.len =  EXTERNNAMSIZ
+	},
+	[P4TC_TMPL_EXT_INST_NAME] = {
+		.type = NLA_STRING,
+		.len =  EXTERNINSTNAMSIZ
+	},
+	[P4TC_TMPL_EXT_INST_NUM_ELEMS] = NLA_POLICY_RANGE(NLA_U32, 1,
+							  P4TC_MAX_NUM_EXT_INST_ELEMS),
+	[P4TC_TMPL_EXT_INST_CONTROL_PARAMS] = { .type = NLA_NESTED },
+	[P4TC_TMPL_EXT_INST_TABLE_BINDABLE] = { . type = NLA_U8 },
+	[P4TC_TMPL_EXT_INST_CONSTR_PARAMS] = { .type = NLA_NESTED },
+};
+
+static const struct nla_policy tc_extern_policy[P4TC_TMPL_EXT_MAX + 1] = {
+	[P4TC_TMPL_EXT_NAME] = { .type = NLA_STRING, .len =  EXTERNNAMSIZ },
+	[P4TC_TMPL_EXT_NUM_INSTS] = NLA_POLICY_RANGE(NLA_U16, 1,
+						     P4TC_MAX_NUM_EXT_INSTS),
+	[P4TC_TMPL_EXT_HAS_EXEC_METHOD] = NLA_POLICY_RANGE(NLA_U8, 1, 1),
+};
+
+static const struct nla_policy p4tc_extern_params_policy[P4TC_EXT_PARAMS_MAX + 1] = {
+	[P4TC_EXT_PARAMS_NAME] = { .type = NLA_STRING, .len = EXTPARAMNAMSIZ },
+	[P4TC_EXT_PARAMS_ID] = { .type = NLA_U32 },
+	[P4TC_EXT_PARAMS_VALUE] = { .type = NLA_NESTED },
+	[P4TC_EXT_PARAMS_TYPE] = { .type = NLA_U32 },
+	[P4TC_EXT_PARAMS_BITSZ] = { .type = NLA_U16 },
+	[P4TC_EXT_PARAMS_FLAGS] = { .type = NLA_U8 },
+};
+
+static void p4tc_extern_ops_put(const struct p4tc_extern_ops *ops)
+{
+	if (ops)
+		module_put(ops->owner);
+}
+
+static bool
+p4tc_extern_mod_callbacks_check(const struct p4tc_extern_ops *ext)
+{
+	if ((ext->construct || ext->deconstruct) && !(ext->rctrl || ext->dump))
+		return (ext->construct && ext->deconstruct);
+
+	if (ext->rctrl || ext->dump)
+		return (ext->construct && ext->deconstruct && ext->rctrl &&
+			ext->dump);
+
+	return true;
+}
+
+static struct p4tc_extern_ops *p4tc_extern_lookup_n(char *kind)
+{
+	struct p4tc_extern_ops *a = NULL;
+
+	read_lock(&ext_mod_lock);
+	list_for_each_entry(a, &ext_base, head) {
+		if (strcmp(kind, a->kind) == 0) {
+			read_unlock(&ext_mod_lock);
+			return a;
+		}
+	}
+	read_unlock(&ext_mod_lock);
+
+	return NULL;
+}
+
+static int
+p4tc_extern_mod_name(char *mod_name, char *kind)
+{
+	int nbytes;
+
+	nbytes = snprintf(mod_name, EXTERNNAMSIZ, "ext_%s", kind);
+	/* Extern name was too long */
+	if (nbytes == EXTERNNAMSIZ)
+		return -E2BIG;
+
+	return 0;
+}
+
+static struct p4tc_extern_ops *p4tc_extern_ops_load(char *kind)
+{
+	struct p4tc_extern_ops *ops = NULL;
+	char mod_name[EXTERNNAMSIZ] = {0};
+	int err;
+
+	if (!kind)
+		return NULL;
+
+	err = p4tc_extern_mod_name(mod_name, kind);
+	if (err < 0)
+		return NULL;
+
+	ops = p4tc_extern_lookup_n(mod_name);
+	if (ops && try_module_get(ops->owner))
+		return ops;
+
+	if (!ops) {
+		rtnl_unlock();
+		request_module(mod_name);
+		rtnl_lock();
+
+		ops = p4tc_extern_lookup_n(mod_name);
+		if (ops) {
+			if (try_module_get(ops->owner))
+				return ops;
+
+			return NULL;
+		}
+	}
+
+	return ops;
+}
+
+static void p4tc_extern_put_param(struct p4tc_extern_param *param)
+{
+	if (param->mask_shift)
+		p4t_release(param->mask_shift);
+	if (param->value)
+		p4tc_ext_param_value_free_tmpl(param);
+	kfree(param);
+}
+
+static void p4tc_extern_put_param_idr(struct idr *params_idr,
+				      struct p4tc_extern_param *param)
+{
+	idr_remove(params_idr, param->id);
+	p4tc_extern_put_param(param);
+}
+
+static void
+p4tc_user_pipeline_ext_put_ref(struct p4tc_user_pipeline_extern *pipe_ext)
+{
+	refcount_dec(&pipe_ext->ext_ref);
+}
+
+static void
+p4tc_user_pipeline_ext_free(struct p4tc_user_pipeline_extern *pipe_ext,
+			    struct idr *tmpl_exts_idr)
+{
+	idr_remove(tmpl_exts_idr, pipe_ext->ext_id);
+	idr_destroy(&pipe_ext->e_inst_idr);
+	refcount_dec(&pipe_ext->tmpl_ext->tmpl_ref);
+	kfree(pipe_ext);
+}
+
+static void
+p4tc_user_pipeline_ext_put(struct p4tc_pipeline *pipeline,
+			   struct p4tc_user_pipeline_extern *pipe_ext,
+			   bool release, struct idr *tmpl_exts_idr)
+{
+	if (refcount_dec_and_test(&pipe_ext->ext_ref) && release)
+		p4tc_user_pipeline_ext_free(pipe_ext, tmpl_exts_idr);
+}
+
+static struct p4tc_user_pipeline_extern *
+p4tc_user_pipeline_ext_find_byid(struct p4tc_pipeline *pipeline,
+				 const u32 ext_id)
+{
+	struct p4tc_user_pipeline_extern *pipe_ext;
+
+	pipe_ext = idr_find(&pipeline->user_ext_idr, ext_id);
+
+	return pipe_ext;
+}
+
+static struct p4tc_user_pipeline_extern *
+p4tc_user_pipeline_ext_get(struct p4tc_pipeline *pipeline, const u32 ext_id)
+{
+	struct p4tc_user_pipeline_extern *pipe_ext;
+
+	pipe_ext = p4tc_user_pipeline_ext_find_byid(pipeline, ext_id);
+	if (!pipe_ext)
+		return ERR_PTR(-ENOENT);
+
+	refcount_inc(&pipe_ext->ext_ref);
+
+	return pipe_ext;
+}
+
+void p4tc_ext_inst_purge(struct p4tc_extern_inst *inst)
+{
+	p4tc_ext_purge(&inst->control_elems_idr);
+}
+EXPORT_SYMBOL_GPL(p4tc_ext_inst_purge);
+
+static void ___p4tc_ext_inst_put(struct p4tc_extern_inst *inst, bool put_params)
+{
+	if (p4tc_ext_inst_has_construct(inst)) {
+		inst->ops->deconstruct(inst);
+	} else {
+		if (inst->params && put_params)
+			p4tc_ext_params_free(inst->params, true);
+
+		p4tc_ext_inst_purge(inst);
+		kfree(inst);
+	}
+}
+
+static int __p4tc_ext_inst_put(struct p4tc_pipeline *pipeline,
+			       struct p4tc_extern_inst *inst, bool teardown,
+			       bool release, struct netlink_ext_ack *extack)
+{
+	struct p4tc_user_pipeline_extern *pipe_ext = inst->pipe_ext;
+	const u32 inst_id = inst->ext_inst_id;
+
+	if (!teardown && !refcount_dec_if_one(&inst->inst_ref)) {
+		NL_SET_ERR_MSG(extack,
+			       "Can't delete referenced extern instance template");
+		return -EBUSY;
+	}
+
+	___p4tc_ext_inst_put(inst, true);
+
+	idr_remove(&pipe_ext->e_inst_idr, inst_id);
+
+	p4tc_user_pipeline_ext_put(pipeline, pipe_ext, release,
+				   &pipeline->user_ext_idr);
+
+	return 0;
+}
+
+static int _p4tc_tmpl_ext_put(struct p4tc_pipeline *pipeline,
+			      struct p4tc_tmpl_extern *ext, bool teardown,
+			      struct netlink_ext_ack *extack)
+{
+	if (!teardown && !refcount_dec_if_one(&ext->tmpl_ref)) {
+		NL_SET_ERR_MSG(extack,
+			       "Can't delete referenced extern template");
+		return -EBUSY;
+	}
+
+	idr_remove(&pipeline->p_ext_idr, ext->ext_id);
+	p4tc_extern_ops_put(ext->ops);
+
+	kfree(ext);
+
+	return 0;
+}
+
+static int p4tc_tmpl_ext_put(struct p4tc_pipeline *pipeline,
+			     struct p4tc_template_common *tmpl,
+			     struct netlink_ext_ack *extack)
+{
+	struct p4tc_tmpl_extern *ext;
+
+	ext = to_extern(tmpl);
+
+	return _p4tc_tmpl_ext_put(pipeline, ext, true, extack);
+}
+
+struct p4tc_extern_inst *
+p4tc_ext_inst_alloc(const struct p4tc_extern_ops *ops, const u32 max_num_elems,
+		    bool tbl_bindable, char *ext_name)
+{
+	struct p4tc_extern_inst *inst;
+	const u32 inst_size = (ops && ops->size) ? ops->size : sizeof(*inst);
+
+	inst = kzalloc(inst_size, GFP_KERNEL);
+	if (!inst)
+		return ERR_PTR(-ENOMEM);
+
+	inst->ops = ops;
+	inst->max_num_elems = max_num_elems;
+	refcount_set(&inst->inst_ref, 1);
+	INIT_LIST_HEAD(&inst->unused_elems);
+	spin_lock_init(&inst->available_list_lock);
+	atomic_set(&inst->curr_num_elems, 0);
+	idr_init(&inst->control_elems_idr);
+	inst->ext_name = ext_name;
+	inst->tbl_bindable = tbl_bindable;
+
+	inst->common.ops = (typeof(inst->common.ops))&p4tc_ext_inst_ops;
+
+	return inst;
+}
+EXPORT_SYMBOL(p4tc_ext_inst_alloc);
+
+static int p4tc_ext_inst_put(struct p4tc_pipeline *pipeline,
+			     struct p4tc_template_common *tmpl,
+			     struct netlink_ext_ack *extack)
+{
+	struct p4tc_extern_inst *inst;
+
+	inst = to_extern_inst(tmpl);
+
+	return __p4tc_ext_inst_put(pipeline, inst, true, false, extack);
+}
+
+static struct p4tc_extern_inst *
+p4tc_ext_inst_find_byname(struct p4tc_user_pipeline_extern *pipe_ext,
+			  const char *instname)
+{
+	struct p4tc_extern_inst *ext_inst;
+	unsigned long tmp, inst_id;
+
+	idr_for_each_entry_ul(&pipe_ext->e_inst_idr, ext_inst, tmp, inst_id) {
+		if (strncmp(ext_inst->common.name, instname,
+			    EXTERNINSTNAMSIZ) == 0)
+			return ext_inst;
+	}
+
+	return NULL;
+}
+
+static struct p4tc_extern_inst *
+p4tc_ext_inst_find_byid(struct p4tc_user_pipeline_extern *pipe_ext,
+			const u32 inst_id)
+{
+	struct p4tc_extern_inst *ext_inst;
+
+	ext_inst = idr_find(&pipe_ext->e_inst_idr, inst_id);
+
+	return ext_inst;
+}
+
+static struct p4tc_extern_inst *
+p4tc_ext_inst_find_byany(struct p4tc_user_pipeline_extern *pipe_ext,
+			 const char *instname, u32 instid,
+			 struct netlink_ext_ack *extack)
+{
+	struct p4tc_extern_inst *inst;
+	int err;
+
+	if (instid) {
+		inst = p4tc_ext_inst_find_byid(pipe_ext, instid);
+		if (!inst) {
+			NL_SET_ERR_MSG(extack, "Unable to find instance by id");
+			err = -EINVAL;
+			goto out;
+		}
+	} else {
+		if (instname) {
+			inst = p4tc_ext_inst_find_byname(pipe_ext, instname);
+			if (!inst) {
+				NL_SET_ERR_MSG_FMT(extack,
+						   "Instance name not found %s\n",
+						   instname);
+				err = -EINVAL;
+				goto out;
+			}
+		} else {
+			NL_SET_ERR_MSG(extack,
+				       "Must specify instance name or id");
+			err = -EINVAL;
+			goto out;
+		}
+	}
+
+	return inst;
+
+out:
+	return ERR_PTR(err);
+}
+
+static struct p4tc_extern_inst *
+p4tc_ext_inst_get(struct p4tc_user_pipeline_extern *pipe_ext,
+		  const char *instname, const u32 ext_inst_id,
+		  struct netlink_ext_ack *extack)
+{
+	struct p4tc_extern_inst *ext_inst;
+
+	ext_inst = p4tc_ext_inst_find_byany(pipe_ext, instname, ext_inst_id,
+					    extack);
+	if (IS_ERR(ext_inst))
+		return ext_inst;
+
+	/* Extern instance template was deleted in parallel */
+	if (!refcount_inc_not_zero(&ext_inst->inst_ref))
+		return ERR_PTR(-EBUSY);
+
+	return ext_inst;
+}
+
+static void p4tc_ext_inst_put_ref(struct p4tc_extern_inst *inst)
+{
+	refcount_dec(&inst->inst_ref);
+}
+
+static struct p4tc_tmpl_extern *
+p4tc_tmpl_ext_find_name(struct p4tc_pipeline *pipeline, const char *extern_name)
+{
+	struct p4tc_tmpl_extern *ext;
+	unsigned long tmp, id;
+
+	idr_for_each_entry_ul(&pipeline->p_ext_idr, ext, tmp, id)
+		if (ext->common.name[0] &&
+		    strncmp(ext->common.name, extern_name,
+			    EXTERNNAMSIZ) == 0)
+			return ext;
+
+	return NULL;
+}
+
+static struct p4tc_tmpl_extern *
+p4tc_tmpl_ext_find_byid(struct p4tc_pipeline *pipeline, const u32 ext_id)
+{
+	return idr_find(&pipeline->p_ext_idr, ext_id);
+}
+
+static struct p4tc_tmpl_extern *
+p4tc_tmpl_ext_find_byany(struct p4tc_pipeline *pipeline,
+			 const char *extern_name, u32 ext_id,
+			 struct netlink_ext_ack *extack)
+{
+	struct p4tc_tmpl_extern *ext;
+	int err;
+
+	if (ext_id) {
+		ext = p4tc_tmpl_ext_find_byid(pipeline, ext_id);
+		if (!ext) {
+			NL_SET_ERR_MSG(extack, "Unable to find ext by id");
+			err = -EINVAL;
+			goto out;
+		}
+	} else {
+		if (extern_name) {
+			ext = p4tc_tmpl_ext_find_name(pipeline, extern_name);
+			if (!ext) {
+				NL_SET_ERR_MSG(extack,
+					       "Extern name not found");
+				err = -EINVAL;
+				goto out;
+			}
+		} else {
+			NL_SET_ERR_MSG(extack,
+				       "Must specify ext name or id");
+			err = -EINVAL;
+			goto out;
+		}
+	}
+
+	return ext;
+
+out:
+	return ERR_PTR(err);
+}
+
+static struct p4tc_extern_inst *
+p4tc_ext_inst_find_byanyattr(struct p4tc_user_pipeline_extern *pipe_ext,
+			     struct nlattr *name_attr, u32 instid,
+			     struct netlink_ext_ack *extack)
+{
+	char *instname = NULL;
+
+	if (name_attr)
+		instname = nla_data(name_attr);
+
+	return p4tc_ext_inst_find_byany(pipe_ext, instname, instid,
+					extack);
+}
+
+static struct p4tc_extern_param *
+p4tc_ext_param_find_byname(struct idr *params_idr, const char *param_name)
+{
+	struct p4tc_extern_param *param;
+	unsigned long tmp, id;
+
+	idr_for_each_entry_ul(params_idr, param, tmp, id) {
+		if (param == ERR_PTR(-EBUSY))
+			continue;
+		if (strncmp(param->name, param_name, EXTPARAMNAMSIZ) == 0)
+			return param;
+	}
+
+	return NULL;
+}
+
+struct p4tc_extern_param *
+p4tc_ext_param_find_byid(struct idr *params_idr, const u32 param_id)
+{
+	return idr_find(params_idr, param_id);
+}
+EXPORT_SYMBOL(p4tc_ext_param_find_byid);
+
+static struct p4tc_extern_param *
+p4tc_ext_param_find_byany(struct idr *params_idr, const char *param_name,
+			  const u32 param_id, struct netlink_ext_ack *extack)
+{
+	struct p4tc_extern_param *param;
+	int err;
+
+	if (param_id) {
+		param = p4tc_ext_param_find_byid(params_idr, param_id);
+		if (!param) {
+			NL_SET_ERR_MSG(extack, "Unable to find param by id");
+			err = -EINVAL;
+			goto out;
+		}
+	} else {
+		if (param_name) {
+			param = p4tc_ext_param_find_byname(params_idr,
+							   param_name);
+			if (!param) {
+				NL_SET_ERR_MSG(extack, "Param name not found");
+				err = -EINVAL;
+				goto out;
+			}
+		} else {
+			NL_SET_ERR_MSG(extack, "Must specify param name or id");
+			err = -EINVAL;
+			goto out;
+		}
+	}
+
+	return param;
+
+out:
+	return ERR_PTR(err);
+}
+
+struct p4tc_extern_param *
+p4tc_ext_param_find_byanyattr(struct idr *params_idr,
+			      struct nlattr *name_attr,
+			      const u32 param_id,
+			      struct netlink_ext_ack *extack)
+{
+	char *param_name = NULL;
+
+	if (name_attr)
+		param_name = nla_data(name_attr);
+
+	return p4tc_ext_param_find_byany(params_idr, param_name, param_id,
+					 extack);
+}
+
+static struct p4tc_extern_param *
+p4tc_extern_create_param(struct idr *params_idr, struct nlattr **tb,
+			 u32 param_id, struct netlink_ext_ack *extack)
+{
+	struct p4tc_extern_param *param;
+	u8 *flags = NULL;
+	char *name;
+	int ret;
+
+	if (tb[P4TC_EXT_PARAMS_NAME]) {
+		name = nla_data(tb[P4TC_EXT_PARAMS_NAME]);
+	} else {
+		NL_SET_ERR_MSG(extack, "Must specify param name");
+		ret = -EINVAL;
+		goto out;
+	}
+
+	param = kzalloc(sizeof(*param), GFP_KERNEL);
+	if (!param) {
+		ret = -ENOMEM;
+		goto out;
+	}
+
+	if ((param_id && p4tc_ext_param_find_byid(params_idr, param_id)) ||
+	    p4tc_ext_param_find_byname(params_idr, name)) {
+		NL_SET_ERR_MSG_FMT(extack, "Param already exists %s", name);
+		ret = -EEXIST;
+		goto free;
+	}
+
+	if ((tb[P4TC_EXT_PARAMS_TYPE] && !tb[P4TC_EXT_PARAMS_BITSZ]) ||
+	    (!tb[P4TC_EXT_PARAMS_TYPE] && tb[P4TC_EXT_PARAMS_BITSZ])) {
+		NL_SET_ERR_MSG(extack, "Must specify type with bit size");
+		ret = -EINVAL;
+		goto free;
+	}
+
+	if (tb[P4TC_EXT_PARAMS_TYPE]) {
+		struct p4tc_type_mask_shift *mask_shift = NULL;
+		struct p4tc_type *type;
+		u32 typeid;
+		u16 bitsz;
+
+		typeid = nla_get_u32(tb[P4TC_EXT_PARAMS_TYPE]);
+		bitsz = nla_get_u16(tb[P4TC_EXT_PARAMS_BITSZ]);
+
+		type = p4type_find_byid(typeid);
+		if (!type) {
+			NL_SET_ERR_MSG(extack, "Param type is invalid");
+			ret = -EINVAL;
+			goto free;
+		}
+		param->type = type;
+		if (bitsz > param->type->bitsz) {
+			NL_SET_ERR_MSG(extack, "Bit size is bigger than type");
+			ret = -EINVAL;
+			goto free;
+		}
+		if (type->ops->create_bitops) {
+			mask_shift = type->ops->create_bitops(bitsz, 0,
+							      bitsz - 1,
+							      extack);
+			if (IS_ERR(mask_shift)) {
+				ret = PTR_ERR(mask_shift);
+				goto free;
+			}
+		}
+		param->mask_shift = mask_shift;
+		param->bitsz = bitsz;
+	} else {
+		NL_SET_ERR_MSG(extack, "Must specify param type");
+		ret = -EINVAL;
+		goto free;
+	}
+
+	if (tb[P4TC_EXT_PARAMS_FLAGS]) {
+		flags = nla_data(tb[P4TC_EXT_PARAMS_FLAGS]);
+		param->flags = *flags;
+	}
+
+	if (flags && *flags & P4TC_EXT_PARAMS_FLAG_ISKEY) {
+		switch (param->type->typeid) {
+		case P4T_U8:
+		case P4T_U16:
+		case P4T_U32:
+			break;
+		default: {
+			NL_SET_ERR_MSG(extack,
+				       "Key must be an unsigned integer");
+			ret = -EINVAL;
+			goto free_mask_shift;
+		}
+		}
+	}
+
+	if (param_id) {
+		ret = idr_alloc_u32(params_idr, param, &param_id,
+				    param_id, GFP_KERNEL);
+		if (ret < 0) {
+			NL_SET_ERR_MSG(extack, "Unable to allocate param id");
+			goto free_mask_shift;
+		}
+		param->id = param_id;
+	} else {
+		param->id = 1;
+
+		ret = idr_alloc_u32(params_idr, param, &param->id,
+				    UINT_MAX, GFP_KERNEL);
+		if (ret < 0) {
+			NL_SET_ERR_MSG(extack, "Unable to allocate param id");
+			goto free_mask_shift;
+		}
+	}
+
+	strscpy(param->name, name, EXTPARAMNAMSIZ);
+
+	return param;
+
+free_mask_shift:
+	if (param->mask_shift)
+		p4t_release(param->mask_shift);
+
+free:
+	kfree(param);
+
+out:
+	return ERR_PTR(ret);
+}
+
+static struct p4tc_extern_param *
+p4tc_extern_create_param_value(struct net *net, struct idr *params_idr,
+			       struct nlattr **tb, u32 param_id,
+			       struct netlink_ext_ack *extack)
+{
+	struct p4tc_extern_param *param;
+	int err;
+
+	param = p4tc_extern_create_param(params_idr, tb, param_id, extack);
+	if (IS_ERR(param))
+		return param;
+
+	err = p4tc_ext_param_value_init(net, param, tb, param->type->typeid,
+					false, extack);
+	if (err < 0) {
+		p4tc_extern_put_param_idr(params_idr, param);
+		return ERR_PTR(err);
+	}
+
+	return param;
+}
+
+static struct p4tc_extern_param *
+p4tc_extern_init_param_value(struct net *net, struct idr *params_idr,
+			     struct nlattr *nla,
+			     struct netlink_ext_ack *extack)
+{
+	struct nlattr *tb[P4TC_EXT_PARAMS_MAX + 1];
+	u32 param_id = 0;
+	int ret;
+
+	ret = nla_parse_nested(tb, P4TC_EXT_PARAMS_MAX, nla,
+			       p4tc_extern_params_policy, extack);
+	if (ret < 0) {
+		ret = -EINVAL;
+		goto out;
+	}
+
+	if (tb[P4TC_EXT_PARAMS_ID])
+		param_id = nla_get_u32(tb[P4TC_EXT_PARAMS_ID]);
+
+	return p4tc_extern_create_param_value(net, params_idr, tb,
+					      param_id, extack);
+
+out:
+	return ERR_PTR(ret);
+}
+
+static bool
+p4tc_extern_params_check_flags(struct p4tc_extern_param *param,
+			       struct netlink_ext_ack *extack)
+{
+	if (param->flags & P4TC_EXT_PARAMS_FLAG_ISKEY &&
+	    param->flags & P4TC_EXT_PARAMS_FLAG_IS_DATASCALAR) {
+		NL_SET_ERR_MSG(extack,
+			       "Can't set key and data scalar flags at the same time");
+		return false;
+	}
+
+	return true;
+}
+
+static struct p4tc_extern_params *
+p4tc_extern_init_params_value(struct net *net,
+			      struct p4tc_extern_params *params,
+			      struct nlattr **tb,
+			      bool *is_scalar, bool tbl_bindable,
+			      struct netlink_ext_ack *extack)
+{
+	bool has_scalar_param = false;
+	bool has_key_param = false;
+	int ret;
+	int i;
+
+	for (i = 1; i < P4TC_MSGBATCH_SIZE + 1 && tb[i]; i++) {
+		struct p4tc_extern_param *param;
+
+		param = p4tc_extern_init_param_value(net, &params->params_idr,
+						     tb[i], extack);
+		if (IS_ERR(param)) {
+			ret = PTR_ERR(param);
+			goto params_del;
+		}
+
+		if (!p4tc_extern_params_check_flags(param, extack)) {
+			ret = -EINVAL;
+			goto params_del;
+		}
+
+		if (has_key_param) {
+			if (param->flags & P4TC_EXT_PARAMS_FLAG_ISKEY) {
+				NL_SET_ERR_MSG(extack,
+					       "There can't be 2 key params");
+				ret = -EINVAL;
+				goto params_del;
+			}
+		} else {
+			has_key_param = param->flags & P4TC_EXT_PARAMS_FLAG_ISKEY;
+		}
+
+		if (has_scalar_param) {
+			if (!param->flags ||
+			    (param->flags & P4TC_EXT_PARAMS_FLAG_IS_DATASCALAR)) {
+				NL_SET_ERR_MSG(extack,
+					       "All data parameters must be scalars");
+				ret = -EINVAL;
+				goto params_del;
+			}
+		} else {
+			has_scalar_param = param->flags & P4TC_EXT_PARAMS_FLAG_IS_DATASCALAR;
+		}
+		if (tbl_bindable) {
+			if (!p4tc_is_type_unsigned(param->type->typeid)) {
+				NL_SET_ERR_MSG_FMT(extack,
+						   "Extern with %s parameter is unbindable",
+						   param->type->name);
+				ret = -EINVAL;
+				goto params_del;
+			}
+		}
+		params->num_params++;
+	}
+	*is_scalar = has_scalar_param;
+
+	return params;
+
+params_del:
+	p4tc_ext_params_free(params, true);
+	return ERR_PTR(ret);
+}
+
+static struct p4tc_extern_params *
+p4tc_extern_create_params_value(struct net *net, struct nlattr *nla,
+				bool *is_scalar, bool tbl_bindable,
+				struct netlink_ext_ack *extack)
+{
+	struct nlattr *tb[P4TC_MSGBATCH_SIZE + 1];
+	struct p4tc_extern_params *params;
+	int ret;
+
+	params = p4tc_extern_params_init();
+	if (!params) {
+		ret = -ENOMEM;
+		goto err_out;
+	}
+
+	if (nla) {
+		ret = nla_parse_nested(tb, P4TC_MSGBATCH_SIZE, nla, NULL,
+				       extack);
+		if (ret < 0) {
+			ret = -EINVAL;
+			goto params_del;
+		}
+	} else {
+		return params;
+	}
+
+	return p4tc_extern_init_params_value(net, params, tb, is_scalar,
+					     tbl_bindable, extack);
+
+params_del:
+	p4tc_ext_params_free(params, true);
+err_out:
+	return ERR_PTR(ret);
+}
+
+static struct p4tc_extern_params *
+p4tc_extern_update_params_value(struct net *net, struct nlattr *nla,
+				bool *is_scalar, bool tbl_bindable,
+				struct netlink_ext_ack *extack)
+{
+	struct nlattr *tb[P4TC_MSGBATCH_SIZE + 1];
+	struct p4tc_extern_params *params;
+	int ret;
+
+	if (nla) {
+		params = p4tc_extern_params_init();
+		if (!params) {
+			ret = -ENOMEM;
+			goto err_out;
+		}
+
+		ret = nla_parse_nested(tb, P4TC_MSGBATCH_SIZE, nla, NULL,
+				       extack);
+		if (ret < 0) {
+			ret = -EINVAL;
+			goto params_del;
+		}
+	} else {
+		return NULL;
+	}
+
+	return p4tc_extern_init_params_value(net, params, tb, is_scalar,
+					     tbl_bindable, extack);
+
+params_del:
+	p4tc_ext_params_free(params, true);
+err_out:
+	return ERR_PTR(ret);
+}
+
+static struct p4tc_tmpl_extern *
+p4tc_tmpl_ext_find_byanyattr(struct p4tc_pipeline *pipeline,
+			     struct nlattr *name_attr, u32 ext_id,
+			     struct netlink_ext_ack *extack)
+{
+	char *extern_name = NULL;
+
+	if (name_attr)
+		extern_name = nla_data(name_attr);
+
+	return p4tc_tmpl_ext_find_byany(pipeline, extern_name, ext_id,
+				       extack);
+}
+
+int p4tc_register_extern(struct p4tc_extern_ops *ext)
+{
+	if (p4tc_extern_lookup_n(ext->kind))
+		return -EEXIST;
+
+	if (!p4tc_extern_mod_callbacks_check(ext))
+		return -EINVAL;
+
+	write_lock(&ext_mod_lock);
+	list_add_tail(&ext->head, &ext_base);
+	write_unlock(&ext_mod_lock);
+
+	return 0;
+}
+EXPORT_SYMBOL(p4tc_register_extern);
+
+int p4tc_unregister_extern(struct p4tc_extern_ops *ext)
+{
+	struct p4tc_extern_ops *a;
+	int err = -ENOENT;
+
+	write_lock(&ext_mod_lock);
+	list_for_each_entry(a, &ext_base, head) {
+		if (a == ext) {
+			list_del(&ext->head);
+			err = 0;
+			break;
+		}
+	}
+	write_unlock(&ext_mod_lock);
+	return err;
+}
+EXPORT_SYMBOL(p4tc_unregister_extern);
+
+static struct p4tc_user_pipeline_extern *
+p4tc_user_pipeline_ext_find_byname(struct p4tc_pipeline *pipeline,
+				   const char *extname)
+{
+	struct p4tc_user_pipeline_extern *pipe_ext;
+	unsigned long tmp, ext_id;
+
+	idr_for_each_entry_ul(&pipeline->user_ext_idr, pipe_ext, tmp, ext_id) {
+		if (strncmp(pipe_ext->ext_name, extname, EXTERNNAMSIZ) == 0)
+			return pipe_ext;
+	}
+
+	return NULL;
+}
+
+static struct p4tc_user_pipeline_extern *
+p4tc_user_pipeline_ext_find_byany(struct p4tc_pipeline *pipeline,
+				  const char *extname, u32 ext_id,
+				  struct netlink_ext_ack *extack)
+{
+	struct p4tc_user_pipeline_extern *pipe_ext;
+	int err;
+
+	if (ext_id) {
+		pipe_ext = p4tc_user_pipeline_ext_find_byid(pipeline, ext_id);
+		if (!pipe_ext) {
+			NL_SET_ERR_MSG(extack, "Unable to find extern");
+			err = -EINVAL;
+			goto out;
+		}
+	} else {
+		if (extname) {
+			pipe_ext = p4tc_user_pipeline_ext_find_byname(pipeline,
+								      extname);
+			if (!pipe_ext) {
+				NL_SET_ERR_MSG(extack,
+					       "Extern name not found");
+				err = -EINVAL;
+				goto out;
+			}
+		} else {
+			NL_SET_ERR_MSG(extack,
+				       "Must specify extern name or id");
+			err = -EINVAL;
+			goto out;
+		}
+	}
+
+	return pipe_ext;
+
+out:
+	return ERR_PTR(err);
+}
+
+static struct p4tc_user_pipeline_extern *
+p4tc_user_pipeline_ext_find_byanyattr(struct p4tc_pipeline *pipeline,
+				      struct nlattr *name_attr, u32 ext_id,
+				      struct netlink_ext_ack *extack)
+{
+	char *extname = NULL;
+
+	if (name_attr)
+		extname = nla_data(name_attr);
+
+	return p4tc_user_pipeline_ext_find_byany(pipeline, extname, ext_id,
+						 extack);
+}
+
+static bool
+p4tc_user_pipeline_insts_exceeded(struct p4tc_user_pipeline_extern *pipe_ext)
+{
+	const u32 max_num_insts = pipe_ext->tmpl_ext->max_num_insts;
+
+	return atomic_read(&pipe_ext->curr_insts_num) == max_num_insts;
+}
+
+static struct p4tc_user_pipeline_extern *
+p4tc_user_pipeline_ext_find_or_create(struct p4tc_pipeline *pipeline,
+				      struct p4tc_tmpl_extern *tmpl_ext,
+				      struct netlink_ext_ack *extack)
+{
+	struct p4tc_user_pipeline_extern *pipe_ext;
+	int err;
+
+	pipe_ext = p4tc_user_pipeline_ext_get(pipeline, tmpl_ext->ext_id);
+	if (pipe_ext != ERR_PTR(-ENOENT)) {
+		if (p4tc_user_pipeline_insts_exceeded(pipe_ext)) {
+			NL_SET_ERR_MSG(extack,
+				       "Maximum number of instances exceeded");
+			p4tc_user_pipeline_ext_put_ref(pipe_ext);
+			return ERR_PTR(-ENOSPC);
+		}
+
+		return pipe_ext;
+	}
+
+	pipe_ext = kzalloc(sizeof(*pipe_ext), GFP_KERNEL);
+	if (!pipe_ext)
+		return ERR_PTR(-ENOMEM);
+
+	pipe_ext->ext_id = tmpl_ext->ext_id;
+	err = idr_alloc_u32(&pipeline->user_ext_idr, pipe_ext,
+			    &pipe_ext->ext_id, pipe_ext->ext_id, GFP_KERNEL);
+	if (err < 0)
+		goto free_pipe_ext;
+
+	strscpy(pipe_ext->ext_name, tmpl_ext->common.name, EXTERNNAMSIZ);
+	idr_init(&pipe_ext->e_inst_idr);
+	refcount_set(&pipe_ext->ext_ref, 2);
+	atomic_set(&pipe_ext->curr_insts_num, 0);
+	refcount_inc(&tmpl_ext->tmpl_ref);
+	pipe_ext->tmpl_ext = tmpl_ext;
+	pipe_ext->free = p4tc_user_pipeline_ext_free;
+
+	return pipe_ext;
+
+free_pipe_ext:
+	kfree(pipe_ext);
+	return ERR_PTR(err);
+}
+
+struct p4tc_user_pipeline_extern *
+p4tc_pipe_ext_find_bynames(struct net *net, struct p4tc_pipeline *pipeline,
+			   const char *extname, struct netlink_ext_ack *extack)
+{
+	return p4tc_user_pipeline_ext_find_byany(pipeline, extname, 0,
+						 extack);
+}
+
+struct p4tc_extern_inst *
+p4tc_ext_inst_find_bynames(struct net *net, struct p4tc_pipeline *pipeline,
+			   const char *extname, const char *instname,
+			   struct netlink_ext_ack *extack)
+{
+	struct p4tc_user_pipeline_extern *pipe_ext;
+	struct p4tc_extern_inst *inst;
+
+	pipe_ext = p4tc_pipe_ext_find_bynames(net, pipeline, extname, extack);
+	if (IS_ERR(pipe_ext))
+		return (void *)pipe_ext;
+
+	inst = p4tc_ext_inst_find_byany(pipe_ext, instname, 0, extack);
+	if (IS_ERR(inst))
+		return inst;
+
+	return inst;
+}
+
+static void
+__p4tc_ext_inst_table_unbind(struct p4tc_user_pipeline_extern *pipe_ext,
+			     struct p4tc_extern_inst *inst)
+{
+	p4tc_user_pipeline_ext_put_ref(pipe_ext);
+	p4tc_ext_inst_put_ref(inst);
+}
+
+void
+p4tc_ext_inst_table_unbind(struct p4tc_table *table,
+			   struct p4tc_user_pipeline_extern *pipe_ext,
+			   struct p4tc_extern_inst *inst)
+{
+	table->tbl_counter = NULL;
+	__p4tc_ext_inst_table_unbind(pipe_ext, inst);
+}
+
+struct p4tc_extern_inst *
+p4tc_ext_find_byids(struct p4tc_pipeline *pipeline,
+		    const u32 ext_id, const u32 inst_id)
+{
+	struct p4tc_user_pipeline_extern *pipe_ext;
+	struct p4tc_extern_inst *inst;
+	int err;
+
+	pipe_ext = p4tc_user_pipeline_ext_find_byid(pipeline, ext_id);
+	if (!pipe_ext) {
+		err = -ENOENT;
+		goto out;
+	}
+
+	inst = p4tc_ext_inst_find_byid(pipe_ext, inst_id);
+	if (!inst) {
+		err = -EBUSY;
+		goto out;
+	}
+
+	return inst;
+
+out:
+	return ERR_PTR(err);
+}
+
+#define SEPARATOR "/"
+
+struct p4tc_extern_inst *
+p4tc_ext_inst_table_bind(struct p4tc_pipeline *pipeline,
+			 struct p4tc_user_pipeline_extern **pipe_ext,
+			 const char *ext_inst_path,
+			 struct netlink_ext_ack *extack)
+{
+	char *instname_clone, *extname, *instname;
+	struct p4tc_extern_inst *inst;
+	int err;
+
+	instname_clone = instname = kstrdup(ext_inst_path, GFP_KERNEL);
+	if (!instname)
+		return ERR_PTR(-ENOMEM);
+
+	extname = strsep(&instname, SEPARATOR);
+
+	*pipe_ext = p4tc_pipe_ext_find_bynames(pipeline->net, pipeline, extname,
+					       extack);
+	if (IS_ERR(*pipe_ext)) {
+		err = PTR_ERR(*pipe_ext);
+		goto free_inst_path;
+	}
+
+	inst = p4tc_ext_inst_get(*pipe_ext, instname, 0, extack);
+	if (IS_ERR(inst)) {
+		err = PTR_ERR(inst);
+		goto free_inst_path;
+	}
+
+	if (!inst->tbl_bindable) {
+		__p4tc_ext_inst_table_unbind(*pipe_ext, inst);
+		NL_SET_ERR_MSG_FMT(extack,
+				   "Extern instance %s can't be bound to a table",
+				   inst->common.name);
+		err = -EPERM;
+		goto put_inst;
+	}
+
+	kfree(instname_clone);
+
+	return inst;
+
+put_inst:
+	p4tc_ext_inst_put_ref(inst);
+
+free_inst_path:
+	kfree(instname_clone);
+	return ERR_PTR(err);
+}
+
+struct p4tc_extern_inst *
+p4tc_ext_inst_get_byids(struct net *net, struct p4tc_pipeline **pipeline,
+			struct p4tc_ext_bpf_params *params)
+{
+	struct p4tc_extern_inst *inst;
+	int err;
+
+	*pipeline = p4tc_pipeline_find_get(net, NULL, params->pipe_id, NULL);
+	if (IS_ERR(*pipeline))
+		return (struct p4tc_extern_inst *)*pipeline;
+
+	inst = p4tc_ext_find_byids(*pipeline, params->ext_id, params->inst_id);
+	if (IS_ERR(inst)) {
+		err = PTR_ERR(inst);
+		goto put_pipeline;
+	}
+
+	return inst;
+
+put_pipeline:
+	p4tc_pipeline_put(*pipeline);
+
+	return ERR_PTR(err);
+}
+EXPORT_SYMBOL(p4tc_ext_inst_get_byids);
+
+static struct p4tc_extern_inst *
+p4tc_ext_inst_update(struct net *net, struct nlmsghdr *n,
+		     struct nlattr *nla, struct p4tc_pipeline *pipeline,
+		     u32 *ids, struct netlink_ext_ack *extack)
+{
+	struct p4tc_extern_params *new_params, *new_constr_params;
+	struct p4tc_extern_params *params, *constr_params;
+	struct nlattr *tb[P4TC_TMPL_EXT_INST_MAX + 1];
+	struct p4tc_user_pipeline_extern *pipe_ext;
+	struct p4tc_extern_inst *new_inst = NULL;
+	struct p4tc_pipeline *root_pipeline;
+	struct p4tc_extern_inst *old_inst;
+	bool has_scalar_params = false;
+	struct p4tc_tmpl_extern *ext;
+	u32 ext_id = 0, inst_id = 0;
+	bool tbl_bindable = false;
+	char *inst_name = NULL;
+	u32 max_num_elems = 0;
+	int ret;
+
+	ret = nla_parse_nested(tb, P4TC_TMPL_EXT_INST_MAX, nla,
+			       tc_extern_inst_policy, extack);
+	if (ret < 0)
+		return ERR_PTR(ret);
+
+	ext_id = ids[P4TC_TMPL_EXT_IDX];
+
+	root_pipeline = p4tc_pipeline_find_byid(net, P4TC_KERNEL_PIPEID);
+
+	ext = p4tc_tmpl_ext_find_byanyattr(root_pipeline,
+					   tb[P4TC_TMPL_EXT_INST_EXT_NAME],
+					   ext_id, extack);
+	if (IS_ERR(ext))
+		return (struct p4tc_extern_inst *)ext;
+
+	inst_id = ids[P4TC_TMPL_EXT_INST_IDX];
+
+	if (tb[P4TC_TMPL_EXT_INST_NAME])
+		inst_name = nla_data(tb[P4TC_TMPL_EXT_INST_NAME]);
+
+	pipe_ext = p4tc_user_pipeline_ext_find_byid(pipeline, ext->ext_id);
+	if (!pipe_ext) {
+		NL_SET_ERR_MSG(extack, "Unable to find pipeline extern by id");
+		return ERR_PTR(-ENOENT);
+	}
+
+	old_inst = p4tc_ext_inst_find_byanyattr(pipe_ext, tb[P4TC_TMPL_EXT_INST_NAME],
+						inst_id, extack);
+	if (IS_ERR(old_inst)) {
+		NL_SET_ERR_MSG(extack, "Unable to find extern instance by id");
+		return ERR_PTR(-ENOENT);
+	}
+
+	if (tb[P4TC_TMPL_EXT_INST_NUM_ELEMS])
+		max_num_elems = nla_get_u32(tb[P4TC_TMPL_EXT_INST_NUM_ELEMS]);
+
+	if (tb[P4TC_TMPL_EXT_INST_TABLE_BINDABLE])
+		tbl_bindable = true;
+	else
+		tbl_bindable = old_inst->tbl_bindable;
+
+	if (tbl_bindable && !p4tc_ext_has_exec(ext->ops)) {
+		NL_SET_ERR_MSG(extack,
+			       "Instance may only be table bindable if module has exec");
+		return ERR_PTR(-EINVAL);
+	}
+
+	new_params = p4tc_extern_update_params_value(net,
+						     tb[P4TC_TMPL_EXT_INST_CONTROL_PARAMS],
+						     &has_scalar_params, tbl_bindable,
+						     extack);
+	if (IS_ERR(new_params))
+		return (struct p4tc_extern_inst *)new_params;
+
+	params = new_params ?: old_inst->params;
+	max_num_elems = max_num_elems ?: old_inst->max_num_elems;
+
+	if (p4tc_ext_inst_has_construct(old_inst)) {
+		struct nlattr *nla_constr_params = tb[P4TC_TMPL_EXT_INST_CONSTR_PARAMS];
+
+		new_constr_params = p4tc_extern_update_params_value(net,
+								    nla_constr_params,
+								    &has_scalar_params,
+								    tbl_bindable,
+								    extack);
+		if (IS_ERR(new_constr_params)) {
+			if (new_params)
+				p4tc_ext_params_free(new_params, true);
+
+			return (struct p4tc_extern_inst *)new_constr_params;
+		}
+		constr_params = new_constr_params ?: old_inst->constr_params;
+
+		ret = old_inst->ops->construct(&new_inst, params, constr_params,
+					       max_num_elems, tbl_bindable,
+					       extack);
+		if (new_params)
+			p4tc_ext_params_free(new_params, true);
+		if (new_constr_params)
+			p4tc_ext_params_free(new_constr_params, true);
+		if (ret < 0)
+			return ERR_PTR(ret);
+	} else {
+		if (tb[P4TC_TMPL_EXT_INST_CONSTR_PARAMS]) {
+			NL_SET_ERR_MSG(extack,
+				       "Need construct mod op to pass constructor params");
+			ret = -EINVAL;
+			goto free_control_params;
+		}
+
+		new_inst = p4tc_ext_inst_alloc(ext->ops, max_num_elems,
+					       tbl_bindable,
+					       pipe_ext->ext_name);
+		if (IS_ERR(new_inst)) {
+			ret = PTR_ERR(new_inst);
+			goto free_control_params;
+		}
+		new_inst->params = params;
+	}
+
+	new_inst->ext_inst_id = old_inst->ext_inst_id;
+	new_inst->is_scalar = has_scalar_params;
+	new_inst->ext_id = ext->ext_id;
+	new_inst->pipe_ext = pipe_ext;
+
+	strscpy(new_inst->common.name, inst_name, EXTERNINSTNAMSIZ);
+
+	idr_replace(&pipe_ext->e_inst_idr, new_inst, old_inst->ext_inst_id);
+
+	___p4tc_ext_inst_put(old_inst, !!new_params);
+
+	return new_inst;
+
+free_control_params:
+	if (new_params)
+		p4tc_ext_params_free(new_params, true);
+
+	return ERR_PTR(ret);
+}
+
+static struct p4tc_extern_inst *
+p4tc_ext_inst_create(struct net *net, struct nlmsghdr *n,
+		     struct nlattr *nla, struct p4tc_pipeline *pipeline,
+		     u32 *ids, struct netlink_ext_ack *extack)
+{
+	struct p4tc_extern_params *constr_params = NULL, *params;
+	struct nlattr *tb[P4TC_TMPL_EXT_INST_MAX + 1];
+	struct p4tc_user_pipeline_extern *pipe_ext;
+	struct p4tc_pipeline *root_pipeline;
+	bool has_scalar_params = false;
+	struct p4tc_extern_inst *inst;
+	struct p4tc_tmpl_extern *ext;
+	u32 ext_id = 0, inst_id = 0;
+	bool tbl_bindable = false;
+	char *inst_name = NULL;
+	u32 max_num_elems;
+	int ret;
+
+	ret = nla_parse_nested(tb, P4TC_TMPL_EXT_INST_MAX, nla,
+			       tc_extern_inst_policy, extack);
+	if (ret < 0)
+		return ERR_PTR(ret);
+
+	ext_id = ids[P4TC_TMPL_EXT_IDX];
+
+	root_pipeline = p4tc_pipeline_find_byid(net, P4TC_KERNEL_PIPEID);
+
+	ext = p4tc_tmpl_ext_find_byanyattr(root_pipeline,
+					   tb[P4TC_TMPL_EXT_INST_EXT_NAME],
+					   ext_id, extack);
+	if (IS_ERR(ext))
+		return (struct p4tc_extern_inst *)ext;
+
+	if (tb[P4TC_TMPL_EXT_INST_NAME]) {
+		inst_name = nla_data(tb[P4TC_TMPL_EXT_INST_NAME]);
+	} else {
+		NL_SET_ERR_MSG(extack,
+			       "Must specify extern name");
+		return ERR_PTR(-EINVAL);
+	}
+
+	inst_id = ids[P4TC_TMPL_EXT_INST_IDX];
+	if (!inst_id) {
+		NL_SET_ERR_MSG(extack, "Must specify extern instance id");
+		return ERR_PTR(-EINVAL);
+	}
+
+	pipe_ext = p4tc_user_pipeline_ext_find_or_create(pipeline, ext, extack);
+	if (IS_ERR(pipe_ext))
+		return (struct p4tc_extern_inst *)pipe_ext;
+
+	if (p4tc_ext_inst_find_byid(pipe_ext, inst_id) ||
+	    p4tc_ext_inst_find_byname(pipe_ext, inst_name)) {
+		NL_SET_ERR_MSG(extack,
+			       "Extern instance with same name or ID already exists");
+		ret = -EEXIST;
+		goto put_pipe_ext;
+	}
+
+	if (tb[P4TC_TMPL_EXT_INST_NUM_ELEMS])
+		max_num_elems = nla_get_u32(tb[P4TC_TMPL_EXT_INST_NUM_ELEMS]);
+	else
+		max_num_elems = P4TC_DEFAULT_NUM_EXT_INST_ELEMS;
+
+	if (tb[P4TC_TMPL_EXT_INST_TABLE_BINDABLE])
+		tbl_bindable = true;
+
+	if (tbl_bindable && !p4tc_ext_has_exec(ext->ops)) {
+		NL_SET_ERR_MSG(extack,
+			       "Instance may only be table bindable if module has exec");
+		return ERR_PTR(-EINVAL);
+	}
+
+	params = p4tc_extern_create_params_value(net,
+						 tb[P4TC_TMPL_EXT_INST_CONTROL_PARAMS],
+						 &has_scalar_params,
+						 tbl_bindable, extack);
+	if (IS_ERR(params))
+		return (struct p4tc_extern_inst *)params;
+
+	if (p4tc_ext_has_construct(ext->ops)) {
+		struct nlattr *nla_constr_params = tb[P4TC_TMPL_EXT_INST_CONSTR_PARAMS];
+
+		constr_params = p4tc_extern_create_params_value(net,
+								nla_constr_params,
+								&has_scalar_params,
+								tbl_bindable, extack);
+		if (IS_ERR(constr_params)) {
+			ret = PTR_ERR(constr_params);
+			goto free_control_params;
+		}
+
+		ret = ext->ops->construct(&inst, params, constr_params,
+					  max_num_elems, tbl_bindable, extack);
+		p4tc_ext_params_free(params, true);
+		p4tc_ext_params_free(constr_params, true);
+		if (ret < 0)
+			goto put_pipe_ext;
+	} else {
+		if (tb[P4TC_TMPL_EXT_INST_CONSTR_PARAMS]) {
+			NL_SET_ERR_MSG(extack,
+				       "Need construct mod op to pass constructor params");
+			ret = -EINVAL;
+			goto free_control_params;
+		}
+
+		inst = p4tc_ext_inst_alloc(ext->ops, max_num_elems,
+					   tbl_bindable, pipe_ext->ext_name);
+		if (IS_ERR(inst)) {
+			ret = -ENOMEM;
+			goto free_control_params;
+		}
+
+		inst->params = params;
+	}
+
+	inst->ext_id = ext->ext_id;
+	inst->ext_inst_id = inst_id;
+	inst->pipe_ext = pipe_ext;
+	inst->ext_id = ext->ext_id;
+	inst->is_scalar = has_scalar_params;
+
+	strscpy(inst->common.name, inst_name, EXTERNINSTNAMSIZ);
+
+	ret = idr_alloc_u32(&pipe_ext->e_inst_idr, inst, &inst_id,
+			    inst_id, GFP_KERNEL);
+	if (ret < 0) {
+		NL_SET_ERR_MSG(extack,
+			       "Unable to allocate ID for extern instance");
+		goto free_extern;
+	}
+
+	atomic_inc(&pipe_ext->curr_insts_num);
+
+	return inst;
+
+free_extern:
+	if (p4tc_ext_inst_has_construct(inst))
+		inst->ops->deconstruct(inst);
+	else
+		kfree(inst);
+
+free_control_params:
+	if (!p4tc_ext_has_construct(ext->ops) && params)
+		p4tc_ext_params_free(params, true);
+
+put_pipe_ext:
+	p4tc_user_pipeline_ext_put_ref(pipe_ext);
+
+	return ERR_PTR(ret);
+}
+
+static struct p4tc_template_common *
+p4tc_ext_inst_cu(struct net *net, struct nlmsghdr *n, struct nlattr *nla,
+		 struct p4tc_path_nlattrs *nl_path_attrs,
+		 struct netlink_ext_ack *extack)
+{
+	u32 *ids = nl_path_attrs->ids;
+	u32 pipeid = ids[P4TC_PID_IDX];
+	struct p4tc_pipeline *pipeline;
+	struct p4tc_extern_inst *inst;
+
+	pipeline = p4tc_pipeline_find_byany_unsealed(net, nl_path_attrs->pname,
+						     pipeid, extack);
+	if (IS_ERR(pipeline))
+		return (void *)pipeline;
+
+	switch (n->nlmsg_type) {
+	case RTM_CREATEP4TEMPLATE:
+		inst = p4tc_ext_inst_create(net, n, nla, pipeline, ids,
+					    extack);
+		break;
+	case RTM_UPDATEP4TEMPLATE:
+		inst = p4tc_ext_inst_update(net, n, nla, pipeline, ids,
+					    extack);
+		break;
+	default:
+		/* Should never happen */
+		NL_SET_ERR_MSG(extack,
+			       "Only create and update are supported for extern inst");
+		return ERR_PTR(-EOPNOTSUPP);
+	}
+
+	if (IS_ERR(inst))
+		goto out;
+
+	if (!nl_path_attrs->pname_passed)
+		strscpy(nl_path_attrs->pname, pipeline->common.name,
+			PIPELINENAMSIZ);
+
+	if (!ids[P4TC_PID_IDX])
+		ids[P4TC_PID_IDX] = pipeline->common.p_id;
+
+	if (!ids[P4TC_TMPL_EXT_IDX])
+		ids[P4TC_TMPL_EXT_IDX] = inst->ext_id;
+
+	if (!ids[P4TC_TMPL_EXT_INST_IDX])
+		ids[P4TC_TMPL_EXT_INST_IDX] = inst->ext_inst_id;
+
+out:
+	return (struct p4tc_template_common *)inst;
+}
+
+static struct p4tc_tmpl_extern *
+p4tc_tmpl_ext_create(struct nlmsghdr *n, struct nlattr *nla,
+		     struct p4tc_pipeline *pipeline, u32 *ids,
+		     struct netlink_ext_ack *extack)
+{
+	struct nlattr *tb[P4TC_TMPL_EXT_MAX + 1];
+	struct p4tc_tmpl_extern *ext;
+	char *extern_name = NULL;
+	u32 ext_id = 0;
+	int ret;
+
+	ret = nla_parse_nested(tb, P4TC_TMPL_EXT_MAX, nla, tc_extern_policy,
+			       extack);
+	if (ret < 0)
+		return ERR_PTR(ret);
+
+	ext_id = ids[P4TC_TMPL_EXT_IDX];
+	if (!ext_id) {
+		NL_SET_ERR_MSG(extack, "Must specify extern id");
+		return ERR_PTR(-EINVAL);
+	}
+
+	if (tb[P4TC_TMPL_EXT_NAME]) {
+		extern_name = nla_data(tb[P4TC_TMPL_EXT_NAME]);
+	} else {
+		NL_SET_ERR_MSG(extack,
+			       "Must specify extern name");
+		return ERR_PTR(-EINVAL);
+	}
+
+	if (p4tc_tmpl_ext_find_byid(pipeline, ext_id) ||
+	    p4tc_tmpl_ext_find_name(pipeline, extern_name)) {
+		NL_SET_ERR_MSG(extack,
+			       "Extern with same id or name was already inserted");
+		return ERR_PTR(-EEXIST);
+	}
+
+	ext = kzalloc(sizeof(*ext), GFP_KERNEL);
+	if (!ext) {
+		NL_SET_ERR_MSG(extack, "Failed to allocate ext");
+		return ERR_PTR(-ENOMEM);
+	}
+
+	if (tb[P4TC_TMPL_EXT_NUM_INSTS]) {
+		u16 *num_insts = nla_data(tb[P4TC_TMPL_EXT_NUM_INSTS]);
+
+		ext->max_num_insts = *num_insts;
+	} else {
+		ext->max_num_insts = P4TC_DEFAULT_NUM_EXT_INSTS;
+	}
+
+	if (tb[P4TC_TMPL_EXT_HAS_EXEC_METHOD])
+		ext->has_exec_method = nla_get_u8(tb[P4TC_TMPL_EXT_HAS_EXEC_METHOD]);
+
+	/* Extern module is not mandatory */
+	if (ext->has_exec_method) {
+		struct p4tc_extern_ops *ops;
+
+		ops = p4tc_extern_ops_load(extern_name);
+		if (!ops) {
+			ret = -ENOENT;
+			goto free_extern;
+		}
+
+		ext->ops = ops;
+	}
+
+	ret = idr_alloc_u32(&pipeline->p_ext_idr, ext, &ext_id, ext_id,
+			    GFP_KERNEL);
+	if (ret < 0) {
+		NL_SET_ERR_MSG(extack, "Unable to allocate ID for extern");
+		goto free_extern;
+	}
+
+	ext->ext_id = ext_id;
+
+	strscpy(ext->common.name, extern_name, EXTERNNAMSIZ);
+
+	refcount_set(&ext->tmpl_ref, 1);
+
+	ext->common.p_id = pipeline->common.p_id;
+	ext->common.ops = (struct p4tc_template_ops *)&p4tc_tmpl_ext_ops;
+
+	return ext;
+
+free_extern:
+	p4tc_extern_ops_put(ext->ops);
+	kfree(ext);
+
+	return ERR_PTR(ret);
+}
+
+static struct p4tc_template_common *
+p4tc_tmpl_ext_cu(struct net *net, struct nlmsghdr *n, struct nlattr *nla,
+		 struct p4tc_path_nlattrs *nl_path_attrs,
+		 struct netlink_ext_ack *extack)
+{
+	struct p4tc_pipeline *pipeline;
+	u32 *ids = nl_path_attrs->ids;
+	struct p4tc_tmpl_extern *ext;
+
+	if (p4tc_tmpl_msg_is_update(n)) {
+		NL_SET_ERR_MSG(extack, "Extern update not supported");
+		return ERR_PTR(-EOPNOTSUPP);
+	}
+
+	pipeline = p4tc_pipeline_find_byid(net, P4TC_KERNEL_PIPEID);
+
+	ext = p4tc_tmpl_ext_create(n, nla, pipeline, ids, extack);
+	if (IS_ERR(ext))
+		goto out;
+
+out:
+	return (struct p4tc_template_common *)ext;
+}
+
+static int ext_inst_param_fill_nlmsg(struct sk_buff *skb,
+				     struct p4tc_extern_param *param)
+{
+	unsigned char *b = nlmsg_get_pos(skb);
+
+	if (nla_put_string(skb, P4TC_EXT_PARAMS_NAME, param->name))
+		goto out_nlmsg_trim;
+
+	if (nla_put_u32(skb, P4TC_EXT_PARAMS_ID, param->id))
+		goto out_nlmsg_trim;
+
+	if (nla_put_u32(skb, P4TC_EXT_PARAMS_TYPE, param->type->typeid))
+		goto out_nlmsg_trim;
+
+	if (param->value && p4tc_ext_param_value_dump_tmpl(skb, param))
+		goto out_nlmsg_trim;
+
+	return skb->len;
+
+out_nlmsg_trim:
+	nlmsg_trim(skb, b);
+	return -1;
+}
+
+static int ext_inst_params_fill_nlmsg(struct sk_buff *skb,
+				      struct p4tc_extern_params *params)
+{
+	unsigned char *b = nlmsg_get_pos(skb);
+	struct p4tc_extern_param *param;
+	struct nlattr *nest_count;
+	unsigned long id, tmp;
+	int i = 1;
+
+	if (!params)
+		return skb->len;
+
+	idr_for_each_entry_ul(&params->params_idr, param, tmp, id) {
+		nest_count = nla_nest_start(skb, i);
+		if (!nest_count)
+			goto out_nlmsg_trim;
+
+		if (ext_inst_param_fill_nlmsg(skb, param) < 0)
+			goto out_nlmsg_trim;
+
+		nla_nest_end(skb, nest_count);
+		i++;
+	}
+
+	return skb->len;
+
+out_nlmsg_trim:
+	nlmsg_trim(skb, b);
+	return -1;
+}
+
+static int
+__p4tc_ext_inst_fill_nlmsg(struct sk_buff *skb, struct p4tc_extern_inst *inst,
+			   struct netlink_ext_ack *extack)
+{
+	struct nlattr *nest, *parms, *constr_parms;
+	const char *ext_name = inst->ext_name;
+	unsigned char *b = nlmsg_get_pos(skb);
+	/* Parser instance id + header field id */
+	u32 ids[2];
+
+	ids[0] = inst->ext_id;
+	ids[1] = inst->ext_inst_id;
+
+	if (nla_put(skb, P4TC_PATH, sizeof(ids), &ids))
+		goto out_nlmsg_trim;
+
+	nest = nla_nest_start(skb, P4TC_PARAMS);
+	if (!nest)
+		goto out_nlmsg_trim;
+
+	if (ext_name[0]) {
+		if (nla_put_string(skb, P4TC_TMPL_EXT_INST_EXT_NAME,
+				   ext_name))
+			goto out_nlmsg_trim;
+	}
+
+	if (inst->common.name[0]) {
+		if (nla_put_string(skb, P4TC_TMPL_EXT_INST_NAME,
+				   inst->common.name))
+			goto out_nlmsg_trim;
+	}
+
+	if (nla_put_u32(skb, P4TC_TMPL_EXT_INST_NUM_ELEMS,
+			inst->max_num_elems))
+		goto out_nlmsg_trim;
+
+	parms = nla_nest_start(skb, P4TC_TMPL_EXT_INST_CONTROL_PARAMS);
+	if (!parms)
+		goto out_nlmsg_trim;
+
+	if (ext_inst_params_fill_nlmsg(skb, inst->params) < 0)
+		goto out_nlmsg_trim;
+
+	nla_nest_end(skb, parms);
+
+	constr_parms = nla_nest_start(skb, P4TC_TMPL_EXT_INST_CONSTR_PARAMS);
+	if (!constr_parms)
+		goto out_nlmsg_trim;
+
+	if (ext_inst_params_fill_nlmsg(skb, inst->constr_params) < 0)
+		goto out_nlmsg_trim;
+
+	nla_nest_end(skb, constr_parms);
+
+	nla_nest_end(skb, nest);
+
+	return skb->len;
+
+out_nlmsg_trim:
+	nlmsg_trim(skb, b);
+	return -1;
+}
+
+static int __p4tc_tmpl_ext_fill_nlmsg(struct sk_buff *skb,
+				      struct p4tc_tmpl_extern *ext)
+{
+	unsigned char *b = nlmsg_get_pos(skb);
+	struct nlattr *nest;
+	/* Parser instance id + header field id */
+	u32 id;
+
+	id = ext->ext_id;
+
+	if (nla_put(skb, P4TC_PATH, sizeof(id), &id))
+		goto out_nlmsg_trim;
+
+	nest = nla_nest_start(skb, P4TC_PARAMS);
+	if (!nest)
+		goto out_nlmsg_trim;
+
+	if (ext->common.name[0]) {
+		if (nla_put_string(skb, P4TC_TMPL_EXT_NAME, ext->common.name))
+			goto out_nlmsg_trim;
+	}
+
+	if (nla_put_u16(skb, P4TC_TMPL_EXT_NUM_INSTS, ext->max_num_insts))
+		goto out_nlmsg_trim;
+
+	nla_nest_end(skb, nest);
+
+	return skb->len;
+
+out_nlmsg_trim:
+	nlmsg_trim(skb, b);
+	return -1;
+}
+
+static int p4tc_ext_inst_fill_nlmsg(struct net *net, struct sk_buff *skb,
+				    struct p4tc_template_common *template,
+				    struct netlink_ext_ack *extack)
+{
+	struct p4tc_extern_inst *inst = to_extern_inst(template);
+
+	if (__p4tc_ext_inst_fill_nlmsg(skb, inst, extack) < 0) {
+		NL_SET_ERR_MSG(extack,
+			       "Failed to fill notification attributes for extern instance");
+		return -EINVAL;
+	}
+
+	return 0;
+}
+
+static int p4tc_tmpl_ext_fill_nlmsg(struct net *net, struct sk_buff *skb,
+				    struct p4tc_template_common *template,
+				    struct netlink_ext_ack *extack)
+{
+	struct p4tc_tmpl_extern *ext = to_extern(template);
+
+	if (__p4tc_tmpl_ext_fill_nlmsg(skb, ext) <= 0) {
+		NL_SET_ERR_MSG(extack,
+			       "Failed to fill notification attributes for extern");
+		return -EINVAL;
+	}
+
+	return 0;
+}
+
+static int p4tc_tmpl_ext_flush(struct sk_buff *skb,
+			       struct p4tc_pipeline *pipeline,
+			       struct netlink_ext_ack *extack)
+{
+	unsigned char *b = nlmsg_get_pos(skb);
+	struct p4tc_tmpl_extern *ext;
+	unsigned long tmp, ext_id;
+	int ret = 0;
+	u32 path[1];
+	int i = 0;
+
+	path[0] = 0;
+
+	if (idr_is_empty(&pipeline->p_ext_idr)) {
+		NL_SET_ERR_MSG(extack, "There are no externs to flush");
+		goto out_nlmsg_trim;
+	}
+
+	if (nla_put(skb, P4TC_PATH, sizeof(path), path))
+		goto out_nlmsg_trim;
+
+	idr_for_each_entry_ul(&pipeline->p_ext_idr, ext, tmp, ext_id) {
+		if (_p4tc_tmpl_ext_put(pipeline, ext, false, extack) < 0) {
+			ret = -EBUSY;
+			continue;
+		}
+		i++;
+	}
+
+	nla_put_u32(skb, P4TC_COUNT, i);
+
+	if (ret < 0) {
+		if (i == 0) {
+			NL_SET_ERR_MSG(extack,
+				       "Unable to flush any externs");
+			goto out_nlmsg_trim;
+		} else {
+			NL_SET_ERR_MSG_FMT(extack,
+					   "Flush only %u externs", i);
+		}
+	}
+
+	return i;
+
+out_nlmsg_trim:
+	nlmsg_trim(skb, b);
+	return 0;
+}
+
+static int p4tc_ext_inst_flush(struct sk_buff *skb,
+			       struct p4tc_pipeline *pipeline,
+			       struct p4tc_user_pipeline_extern *pipe_ext,
+			       struct netlink_ext_ack *extack)
+{
+	unsigned char *b = nlmsg_get_pos(skb);
+	struct p4tc_extern_inst *inst;
+	unsigned long tmp, inst_id;
+	int ret = 0;
+	u32 path[2];
+	int i = 0;
+
+	if (idr_is_empty(&pipe_ext->e_inst_idr)) {
+		NL_SET_ERR_MSG(extack, "There are no externs to flush");
+		goto out_nlmsg_trim;
+	}
+
+	path[0] = pipe_ext->ext_id;
+	path[1] = 0;
+
+	if (nla_put(skb, P4TC_PATH, sizeof(path), path))
+		goto out_nlmsg_trim;
+
+	idr_for_each_entry_ul(&pipe_ext->e_inst_idr, inst, tmp, inst_id) {
+		if (__p4tc_ext_inst_put(pipeline, inst, false, false,
+					extack) < 0) {
+			ret = -EBUSY;
+			continue;
+		}
+		i++;
+	}
+
+	/* We don't release pipe_ext in the loop to avoid use-after-free whilst
+	 * iterating through e_inst_idr. We free it here only if flush
+	 * succeeded, that is, all instances were deleted and thus ext_ref == 1
+	 */
+	if (refcount_read(&pipe_ext->ext_ref) == 1)
+		p4tc_user_pipeline_ext_free(pipe_ext, &pipeline->user_ext_idr);
+
+	nla_put_u32(skb, P4TC_COUNT, i);
+
+	if (ret < 0) {
+		if (i == 0) {
+			NL_SET_ERR_MSG(extack,
+				       "Unable to flush any externs instance");
+			goto out_nlmsg_trim;
+		} else {
+			NL_SET_ERR_MSG_FMT(extack,
+					   "Flushed only %u extern instances", i);
+		}
+	}
+
+	return i;
+
+out_nlmsg_trim:
+	nlmsg_trim(skb, b);
+	return 0;
+}
+
+static int p4tc_ext_inst_gd(struct net *net, struct sk_buff *skb,
+			    struct nlmsghdr *n, struct nlattr *nla,
+			    struct p4tc_path_nlattrs *nl_path_attrs,
+			    struct netlink_ext_ack *extack)
+{
+	struct nlattr *tb[P4TC_TMPL_EXT_INST_MAX + 1] = {NULL};
+	struct p4tc_user_pipeline_extern *pipe_ext;
+	u32 *ids = nl_path_attrs->ids;
+	u32 inst_id = ids[P4TC_TMPL_EXT_INST_IDX];
+	unsigned char *b = nlmsg_get_pos(skb);
+	u32 ext_id = ids[P4TC_TMPL_EXT_IDX];
+	u32 pipe_id = ids[P4TC_PID_IDX];
+	struct p4tc_pipeline *pipeline;
+	struct p4tc_extern_inst *inst;
+	int ret;
+
+	if (n->nlmsg_type == RTM_GETP4TEMPLATE)
+		pipeline = p4tc_pipeline_find_byany(net, nl_path_attrs->pname,
+						    pipe_id, extack);
+	else
+		pipeline = p4tc_pipeline_find_byany_unsealed(net,
+							     nl_path_attrs->pname,
+							     pipe_id, extack);
+	if (IS_ERR(pipeline))
+		return PTR_ERR(pipeline);
+
+	if (nla) {
+		ret = nla_parse_nested(tb, P4TC_TMPL_EXT_MAX, nla,
+				       tc_extern_inst_policy, extack);
+		if (ret < 0)
+			return ret;
+	}
+
+	pipe_ext = p4tc_user_pipeline_ext_find_byanyattr(pipeline,
+							 tb[P4TC_TMPL_EXT_INST_EXT_NAME],
+							 ext_id, extack);
+	if (IS_ERR(pipe_ext))
+		return PTR_ERR(pipe_ext);
+
+	if (!nl_path_attrs->pname_passed)
+		strscpy(nl_path_attrs->pname, pipeline->common.name,
+			PIPELINENAMSIZ);
+
+	if (!ids[P4TC_PID_IDX])
+		ids[P4TC_PID_IDX] = pipeline->common.p_id;
+
+	if (!ids[P4TC_TMPL_EXT_IDX])
+		ids[P4TC_TMPL_EXT_IDX] = pipe_ext->ext_id;
+
+	if (n->nlmsg_type == RTM_DELP4TEMPLATE && n->nlmsg_flags & NLM_F_ROOT)
+		return p4tc_ext_inst_flush(skb, pipeline, pipe_ext, extack);
+
+	inst = p4tc_ext_inst_find_byanyattr(pipe_ext,
+					    tb[P4TC_TMPL_EXT_INST_NAME],
+					    inst_id, extack);
+	if (IS_ERR(inst))
+		return PTR_ERR(inst);
+
+	ret = __p4tc_ext_inst_fill_nlmsg(skb, inst, extack);
+	if (ret < 0)
+		return -ENOMEM;
+
+	if (!ids[P4TC_TMPL_EXT_INST_IDX])
+		ids[P4TC_TMPL_EXT_INST_IDX] = inst->ext_inst_id;
+
+	if (n->nlmsg_type == RTM_DELP4TEMPLATE) {
+		ret = __p4tc_ext_inst_put(pipeline, inst, false, true, extack);
+		if (ret < 0)
+			goto out_nlmsg_trim;
+	}
+
+	return 0;
+
+out_nlmsg_trim:
+	nlmsg_trim(skb, b);
+	return ret;
+}
+
+static int p4tc_tmpl_ext_gd(struct net *net, struct sk_buff *skb,
+			    struct nlmsghdr *n, struct nlattr *nla,
+			    struct p4tc_path_nlattrs *nl_path_attrs,
+			    struct netlink_ext_ack *extack)
+{
+	struct nlattr *tb[P4TC_TMPL_EXT_MAX + 1] = {NULL};
+	unsigned char *b = nlmsg_get_pos(skb);
+	u32 *ids = nl_path_attrs->ids;
+	u32 ext_id = ids[P4TC_TMPL_EXT_IDX];
+	struct p4tc_pipeline *pipeline;
+	struct p4tc_tmpl_extern *ext;
+	int ret;
+
+	pipeline = p4tc_pipeline_find_byid(net, P4TC_KERNEL_PIPEID);
+
+	if (nla) {
+		ret = nla_parse_nested(tb, P4TC_TMPL_EXT_MAX, nla,
+				       tc_extern_policy, extack);
+		if (ret < 0)
+			return ret;
+	}
+
+	if (!nl_path_attrs->pname_passed)
+		strscpy(nl_path_attrs->pname, pipeline->common.name,
+			PIPELINENAMSIZ);
+
+	if (!ids[P4TC_PID_IDX])
+		ids[P4TC_PID_IDX] = pipeline->common.p_id;
+
+	if (n->nlmsg_type == RTM_DELP4TEMPLATE && n->nlmsg_flags & NLM_F_ROOT)
+		return p4tc_tmpl_ext_flush(skb, pipeline, extack);
+
+	ext = p4tc_tmpl_ext_find_byanyattr(pipeline, tb[P4TC_TMPL_EXT_NAME],
+					   ext_id, extack);
+	if (IS_ERR(ext))
+		return PTR_ERR(ext);
+
+	ret = __p4tc_tmpl_ext_fill_nlmsg(skb, ext);
+	if (ret < 0)
+		return -ENOMEM;
+
+	if (!ids[P4TC_TMPL_EXT_IDX])
+		ids[P4TC_TMPL_EXT_IDX] = ext->ext_id;
+
+	if (n->nlmsg_type == RTM_DELP4TEMPLATE) {
+		ret = _p4tc_tmpl_ext_put(pipeline, ext, false, extack);
+		if (ret < 0)
+			goto out_nlmsg_trim;
+	}
+
+	return 0;
+
+out_nlmsg_trim:
+	nlmsg_trim(skb, b);
+	return ret;
+}
+
+static int p4tc_tmpl_ext_dump_1(struct sk_buff *skb,
+				struct p4tc_template_common *common)
+{
+	struct nlattr *param = nla_nest_start(skb, P4TC_PARAMS);
+	struct p4tc_tmpl_extern *ext = to_extern(common);
+	unsigned char *b = nlmsg_get_pos(skb);
+	u32 path[2];
+
+	if (!param)
+		goto out_nlmsg_trim;
+
+	if (ext->common.name[0] &&
+	    nla_put_string(skb, P4TC_TMPL_EXT_NAME, ext->common.name))
+		goto out_nlmsg_trim;
+
+	nla_nest_end(skb, param);
+
+	path[0] = ext->ext_id;
+	if (nla_put(skb, P4TC_PATH, sizeof(path), path))
+		goto out_nlmsg_trim;
+
+	return 0;
+
+out_nlmsg_trim:
+	nlmsg_trim(skb, b);
+	return -ENOMEM;
+}
+
+static int p4tc_tmpl_ext_dump(struct sk_buff *skb, struct p4tc_dump_ctx *ctx,
+			      struct nlattr *nla, char **p_name, u32 *ids,
+			      struct netlink_ext_ack *extack)
+{
+	struct net *net = sock_net(skb->sk);
+	struct p4tc_pipeline *pipeline;
+
+	pipeline = p4tc_pipeline_find_byid(net, P4TC_KERNEL_PIPEID);
+
+	if (!ids[P4TC_PID_IDX])
+		ids[P4TC_PID_IDX] = pipeline->common.p_id;
+
+	if (!(*p_name))
+		*p_name = pipeline->common.name;
+
+	return p4tc_tmpl_generic_dump(skb, ctx, &pipeline->p_ext_idr,
+				      P4TC_TMPL_EXT_IDX, extack);
+}
+
+static int p4tc_ext_inst_dump_1(struct sk_buff *skb,
+				struct p4tc_template_common *common)
+{
+	struct nlattr *param = nla_nest_start(skb, P4TC_PARAMS);
+	struct p4tc_extern_inst *inst = to_extern_inst(common);
+	unsigned char *b = nlmsg_get_pos(skb);
+	u32 path[2];
+
+	if (!param)
+		goto out_nlmsg_trim;
+
+	if (inst->common.name[0] &&
+	    nla_put_string(skb, P4TC_TMPL_EXT_NAME, inst->common.name))
+		goto out_nlmsg_trim;
+
+	nla_nest_end(skb, param);
+
+	path[0] = inst->ext_id;
+	path[1] = inst->ext_inst_id;
+	if (nla_put(skb, P4TC_PATH, sizeof(path), path))
+		goto out_nlmsg_trim;
+
+	return 0;
+
+out_nlmsg_trim:
+	nlmsg_trim(skb, b);
+	return -ENOMEM;
+}
+
+static int p4tc_ext_inst_dump(struct sk_buff *skb,
+			      struct p4tc_dump_ctx *ctx,
+			      struct nlattr *nla, char **p_name,
+			      u32 *ids, struct netlink_ext_ack *extack)
+{
+	struct nlattr *tb[P4TC_TMPL_EXT_INST_MAX + 1] = {NULL};
+	struct p4tc_user_pipeline_extern *pipe_ext;
+	u32 ext_id = ids[P4TC_TMPL_EXT_IDX];
+	struct net *net = sock_net(skb->sk);
+	struct p4tc_pipeline *pipeline;
+	u32 pipeid = ids[P4TC_PID_IDX];
+	int ret;
+
+	pipeline = p4tc_pipeline_find_byany(net, *p_name,
+					    pipeid, extack);
+	if (IS_ERR(pipeline))
+		return PTR_ERR(pipeline);
+
+	if (!ids[P4TC_PID_IDX])
+		ids[P4TC_PID_IDX] = pipeline->common.p_id;
+
+	if (!(*p_name))
+		*p_name = pipeline->common.name;
+
+	if (nla) {
+		ret = nla_parse_nested(tb, P4TC_TMPL_EXT_INST_MAX, nla,
+				       tc_extern_inst_policy, extack);
+		if (ret < 0)
+			return ret;
+	}
+
+	pipe_ext = p4tc_user_pipeline_ext_find_byanyattr(pipeline,
+							 tb[P4TC_TMPL_EXT_INST_EXT_NAME],
+							 ext_id, extack);
+	if (IS_ERR(pipe_ext))
+		return PTR_ERR(pipe_ext);
+
+	return p4tc_tmpl_generic_dump(skb, ctx, &pipe_ext->e_inst_idr,
+				      P4TC_TMPL_EXT_INST_IDX, extack);
+}
+
+const struct p4tc_template_ops p4tc_ext_inst_ops = {
+	.cu = p4tc_ext_inst_cu,
+	.fill_nlmsg = p4tc_ext_inst_fill_nlmsg,
+	.gd = p4tc_ext_inst_gd,
+	.put = p4tc_ext_inst_put,
+	.dump = p4tc_ext_inst_dump,
+	.dump_1 = p4tc_ext_inst_dump_1,
+};
+
+const struct p4tc_template_ops p4tc_tmpl_ext_ops = {
+	.cu = p4tc_tmpl_ext_cu,
+	.fill_nlmsg = p4tc_tmpl_ext_fill_nlmsg,
+	.gd = p4tc_tmpl_ext_gd,
+	.put = p4tc_tmpl_ext_put,
+	.dump = p4tc_tmpl_ext_dump,
+	.dump_1 = p4tc_tmpl_ext_dump_1,
+};
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 79+ messages in thread

* Re: [PATCH net-next v8 08/15] p4tc: add P4 data types
  2023-11-16 14:59 ` [PATCH net-next v8 08/15] p4tc: add P4 data types Jamal Hadi Salim
@ 2023-11-16 16:03   ` Jiri Pirko
  2023-11-17 12:01     ` Jamal Hadi Salim
  0 siblings, 1 reply; 79+ messages in thread
From: Jiri Pirko @ 2023-11-16 16:03 UTC (permalink / raw)
  To: Jamal Hadi Salim
  Cc: netdev, deb.chatterjee, anjali.singhai, namrata.limaye, tom,
	mleitner, Mahesh.Shirshyad, tomasz.osinski, xiyou.wangcong, davem,
	edumazet, kuba, pabeni, vladbu, horms, daniel, bpf, khalidm, toke,
	mattyk

Thu, Nov 16, 2023 at 03:59:41PM CET, jhs@mojatatu.com wrote:

[...]

>diff --git a/include/net/p4tc_types.h b/include/net/p4tc_types.h
>new file mode 100644
>index 000000000..8f6f002ae

[...]

>+#define P4T_MAX_BITSZ 128

[...]

>+#define P4T_MAX_STR_SZ 32

[...]


>diff --git a/include/uapi/linux/p4tc.h b/include/uapi/linux/p4tc.h
>new file mode 100644
>index 000000000..ba32dba66
>--- /dev/null
>+++ b/include/uapi/linux/p4tc.h
>@@ -0,0 +1,33 @@
>+/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
>+#ifndef __LINUX_P4TC_H
>+#define __LINUX_P4TC_H
>+
>+#define P4TC_MAX_KEYSZ 512
>+
>+enum {
>+	P4T_UNSPEC,

I wonder, what it the reason for "P4T"/"P4TC" prefix inconsistency.
In the kernel header, that could be fixes, but in uapi header this is
forever. Is this just to be aligned with other TC uapi
inconsitencies? :D


>+	P4T_U8,
>+	P4T_U16,
>+	P4T_U32,
>+	P4T_U64,
>+	P4T_STRING,
>+	P4T_S8,
>+	P4T_S16,
>+	P4T_S32,
>+	P4T_S64,
>+	P4T_MACADDR,
>+	P4T_IPV4ADDR,
>+	P4T_BE16,
>+	P4T_BE32,
>+	P4T_BE64,
>+	P4T_U128,
>+	P4T_S128,
>+	P4T_BOOL,
>+	P4T_DEV,
>+	P4T_KEY,
>+	__P4T_MAX,
>+};
>+
>+#define P4T_MAX (__P4T_MAX - 1)
>+
>+#endif

[...]

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH net-next v8 09/15] p4tc: add template pipeline create, get, update, delete
  2023-11-16 14:59 ` [PATCH net-next v8 09/15] p4tc: add template pipeline create, get, update, delete Jamal Hadi Salim
@ 2023-11-16 16:11   ` Jiri Pirko
  2023-11-17 12:09     ` Jamal Hadi Salim
  0 siblings, 1 reply; 79+ messages in thread
From: Jiri Pirko @ 2023-11-16 16:11 UTC (permalink / raw)
  To: Jamal Hadi Salim
  Cc: netdev, deb.chatterjee, anjali.singhai, namrata.limaye, tom,
	mleitner, Mahesh.Shirshyad, tomasz.osinski, xiyou.wangcong, davem,
	edumazet, kuba, pabeni, vladbu, horms, daniel, bpf, khalidm, toke,
	mattyk

Thu, Nov 16, 2023 at 03:59:42PM CET, jhs@mojatatu.com wrote:

[...]


>diff --git a/include/uapi/linux/p4tc.h b/include/uapi/linux/p4tc.h
>index ba32dba66..4d33f44c1 100644
>--- a/include/uapi/linux/p4tc.h
>+++ b/include/uapi/linux/p4tc.h
>@@ -2,8 +2,71 @@
> #ifndef __LINUX_P4TC_H
> #define __LINUX_P4TC_H
> 
>+#include <linux/types.h>
>+#include <linux/pkt_sched.h>
>+
>+/* pipeline header */
>+struct p4tcmsg {
>+	__u32 pipeid;
>+	__u32 obj;
>+};

I don't follow. Is there any sane reason to use header instead of normal
netlink attribute? Moveover, you extend the existing RT netlink with
a huge amout of p4 things. Isn't this the good time to finally introduce
generic netlink TC family with proper yaml spec with all the benefits it
brings and implement p4 tc uapi there? Please?


>+
>+#define P4TC_MAXPIPELINE_COUNT 32
>+#define P4TC_MAXTABLES_COUNT 32
>+#define P4TC_MINTABLES_COUNT 0
>+#define P4TC_MSGBATCH_SIZE 16
>+
> #define P4TC_MAX_KEYSZ 512
> 
>+#define TEMPLATENAMSZ 32
>+#define PIPELINENAMSIZ TEMPLATENAMSZ

ugh. A prefix please?

pw-bot: cr

[...]

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH net-next v8 10/15] p4tc: add action template create, update, delete, get, flush and dump
  2023-11-16 14:59 ` [PATCH net-next v8 10/15] p4tc: add action template create, update, delete, get, flush and dump Jamal Hadi Salim
@ 2023-11-16 16:28   ` Jiri Pirko
  2023-11-17 15:11     ` Jamal Hadi Salim
  2023-11-17  6:51   ` John Fastabend
  1 sibling, 1 reply; 79+ messages in thread
From: Jiri Pirko @ 2023-11-16 16:28 UTC (permalink / raw)
  To: Jamal Hadi Salim
  Cc: netdev, deb.chatterjee, anjali.singhai, namrata.limaye, tom,
	mleitner, Mahesh.Shirshyad, tomasz.osinski, xiyou.wangcong, davem,
	edumazet, kuba, pabeni, vladbu, horms, daniel, bpf, khalidm, toke,
	mattyk

Thu, Nov 16, 2023 at 03:59:43PM CET, jhs@mojatatu.com wrote:

[...]


>diff --git a/include/net/act_api.h b/include/net/act_api.h
>index cd5a8e86f..b95a9bc29 100644
>--- a/include/net/act_api.h
>+++ b/include/net/act_api.h
>@@ -70,6 +70,7 @@ struct tc_action {
> #define TCA_ACT_FLAGS_AT_INGRESS	(1U << (TCA_ACT_FLAGS_USER_BITS + 4))
> #define TCA_ACT_FLAGS_PREALLOC	(1U << (TCA_ACT_FLAGS_USER_BITS + 5))
> #define TCA_ACT_FLAGS_UNREFERENCED	(1U << (TCA_ACT_FLAGS_USER_BITS + 6))
>+#define TCA_ACT_FLAGS_FROM_P4TC	(1U << (TCA_ACT_FLAGS_USER_BITS + 7))
> 
> /* Update lastuse only if needed, to avoid dirtying a cache line.
>  * We use a temp variable to avoid fetching jiffies twice.
>diff --git a/include/net/p4tc.h b/include/net/p4tc.h
>index ccb54d842..68b00fa72 100644
>--- a/include/net/p4tc.h
>+++ b/include/net/p4tc.h
>@@ -9,17 +9,23 @@
> #include <linux/refcount.h>
> #include <linux/rhashtable.h>
> #include <linux/rhashtable-types.h>
>+#include <net/tc_act/p4tc.h>
>+#include <net/p4tc_types.h>
> 
> #define P4TC_DEFAULT_NUM_TABLES P4TC_MINTABLES_COUNT
> #define P4TC_DEFAULT_MAX_RULES 1
> #define P4TC_PATH_MAX 3
>+#define P4TC_MAX_TENTRIES 33554432

Seeing define like this one always makes me happier. Where does it come
from? Why not 0x2000000 at least?


> 
> #define P4TC_KERNEL_PIPEID 0
> 
> #define P4TC_PID_IDX 0
>+#define P4TC_AID_IDX 1
>+#define P4TC_PARSEID_IDX 1
> 
> struct p4tc_dump_ctx {
> 	u32 ids[P4TC_PATH_MAX];
>+	struct rhashtable_iter *iter;
> };
> 
> struct p4tc_template_common;
>@@ -63,8 +69,10 @@ extern const struct p4tc_template_ops p4tc_pipeline_ops;
> 
> struct p4tc_pipeline {
> 	struct p4tc_template_common common;
>+	struct idr                  p_act_idr;
> 	struct rcu_head             rcu;
> 	struct net                  *net;
>+	u32                         num_created_acts;
> 	/* Accounts for how many entities are referencing this pipeline.
> 	 * As for now only P4 filters can refer to pipelines.
> 	 */
>@@ -109,18 +117,157 @@ p4tc_pipeline_find_byany_unsealed(struct net *net, const char *p_name,
> 				  const u32 pipeid,
> 				  struct netlink_ext_ack *extack);
> 
>+struct p4tc_act *tcf_p4_find_act(struct net *net,
>+				 const struct tc_action_ops *a_o,
>+				 struct netlink_ext_ack *extack);
>+void
>+tcf_p4_put_prealloc_act(struct p4tc_act *act, struct tcf_p4act *p4_act);
>+
> static inline int p4tc_action_destroy(struct tc_action **acts)
> {
>+	struct tc_action *acts_non_prealloc[TCA_ACT_MAX_PRIO] = {NULL};
> 	int ret = 0;
> 
> 	if (acts) {
>-		ret = tcf_action_destroy(acts, TCA_ACT_UNBIND);
>+		int j = 0;
>+		int i;

Move declarations to the beginning of the if body.

[...]


>diff --git a/include/uapi/linux/p4tc.h b/include/uapi/linux/p4tc.h
>index 4d33f44c1..7b89229a7 100644
>--- a/include/uapi/linux/p4tc.h
>+++ b/include/uapi/linux/p4tc.h
>@@ -4,6 +4,7 @@
> 
> #include <linux/types.h>
> #include <linux/pkt_sched.h>
>+#include <linux/pkt_cls.h>
> 
> /* pipeline header */
> struct p4tcmsg {
>@@ -17,9 +18,12 @@ struct p4tcmsg {
> #define P4TC_MSGBATCH_SIZE 16
> 
> #define P4TC_MAX_KEYSZ 512
>+#define P4TC_DEFAULT_NUM_PREALLOC 16
> 
> #define TEMPLATENAMSZ 32
> #define PIPELINENAMSIZ TEMPLATENAMSZ
>+#define ACTTMPLNAMSIZ TEMPLATENAMSZ
>+#define ACTPARAMNAMSIZ TEMPLATENAMSZ

Prefix? This is uapi. Could you please be more careful with naming at
least in the uapi area?


[...]


>diff --git a/net/sched/p4tc/p4tc_action.c b/net/sched/p4tc/p4tc_action.c
>new file mode 100644
>index 000000000..19db0772c
>--- /dev/null
>+++ b/net/sched/p4tc/p4tc_action.c
>@@ -0,0 +1,2242 @@
>+// SPDX-License-Identifier: GPL-2.0-or-later
>+/*
>+ * net/sched/p4tc_action.c	P4 TC ACTION TEMPLATES
>+ *
>+ * Copyright (c) 2022-2023, Mojatatu Networks
>+ * Copyright (c) 2022-2023, Intel Corporation.
>+ * Authors:     Jamal Hadi Salim <jhs@mojatatu.com>
>+ *              Victor Nogueira <victor@mojatatu.com>
>+ *              Pedro Tammela <pctammela@mojatatu.com>
>+ */
>+
>+#include <linux/err.h>
>+#include <linux/errno.h>
>+#include <linux/init.h>
>+#include <linux/kernel.h>
>+#include <linux/kmod.h>
>+#include <linux/list.h>
>+#include <linux/module.h>
>+#include <linux/netdevice.h>
>+#include <linux/skbuff.h>
>+#include <linux/slab.h>
>+#include <linux/string.h>
>+#include <linux/types.h>
>+#include <net/flow_offload.h>
>+#include <net/net_namespace.h>
>+#include <net/netlink.h>
>+#include <net/pkt_cls.h>
>+#include <net/p4tc.h>
>+#include <net/sch_generic.h>
>+#include <net/sock.h>
>+#include <net/tc_act/p4tc.h>
>+
>+static LIST_HEAD(dynact_list);
>+
>+#define SEPARATOR "/"

Prefix? Btw, why exactly do you need this. It is used only once.

To quote a few function names in this file:

>+static void set_param_indices(struct idr *params_idr)
>+static void generic_free_param_value(struct p4tc_act_param *param)
>+static int dev_init_param_value(struct net *net, struct p4tc_act_param_ops *op,
>+static void dev_free_param_value(struct p4tc_act_param *param)
>+static void tcf_p4_act_params_destroy_rcu(struct rcu_head *head)
>+static int __tcf_p4_dyna_init_set(struct p4tc_act *act, struct tc_action **a,
>+static int tcf_p4_dyna_template_init(struct net *net, struct tc_action **a,
>+init_prealloc_param(struct p4tc_act *act, struct idr *params_idr,
>+static void p4tc_param_put(struct p4tc_act_param *param)
>+static void free_intermediate_param(struct p4tc_act_param *param)
>+static void free_intermediate_params_list(struct list_head *params_list)
>+static int init_prealloc_params(struct p4tc_act *act,
>+struct p4tc_act *p4tc_action_find_byid(struct p4tc_pipeline *pipeline,
>+static void tcf_p4_prealloc_list_add(struct p4tc_act *act_tmpl,
>+static int tcf_p4_prealloc_acts(struct net *net, struct p4tc_act *act,
>+tcf_p4_get_next_prealloc_act(struct p4tc_act *act)
>+void tcf_p4_set_init_flags(struct tcf_p4act *p4act)
>+static void __tcf_p4_put_prealloc_act(struct p4tc_act *act,
>+tcf_p4_put_prealloc_act(struct p4tc_act *act, struct tcf_p4act *p4act)
>+static int generic_dump_param_value(struct sk_buff *skb, struct p4tc_type *type,
>+static int generic_init_param_value(struct p4tc_act_param *nparam,
>+static struct p4tc_act_param *param_find_byname(struct idr *params_idr,
>+tcf_param_find_byany(struct p4tc_act *act,
>+tcf_param_find_byanyattr(struct p4tc_act *act, struct nlattr *name_attr,
>+static int __p4_init_param_type(struct p4tc_act_param *param,
>+static int tcf_p4_act_init_params(struct net *net,
>+static struct p4tc_act *p4tc_action_find_byname(const char *act_name,
>+static int tcf_p4_dyna_init(struct net *net, struct nlattr *nla,
>+static int tcf_act_fill_param_type(struct sk_buff *skb,
>+static void tcf_p4_dyna_cleanup(struct tc_action *a)
>+struct p4tc_act *p4tc_action_find_get(struct p4tc_pipeline *pipeline,
>+p4tc_action_find_byanyattr(struct nlattr *act_name_attr, const u32 a_id,
>+static void p4_put_many_params(struct idr *params_idr)
>+static int p4_init_param_type(struct p4tc_act_param *param,
>+static struct p4tc_act_param *p4_create_param(struct p4tc_act *act,
>+static struct p4tc_act_param *p4_update_param(struct p4tc_act *act,
>+static struct p4tc_act_param *p4_act_init_param(struct p4tc_act *act,
>+static void p4tc_action_net_exit(struct tc_action_net *tn)
>+static void p4_act_params_put(struct p4tc_act *act)
>+static int __tcf_act_put(struct net *net, struct p4tc_pipeline *pipeline,
>+static int _tcf_act_fill_nlmsg(struct net *net, struct sk_buff *skb,
>+static int tcf_act_fill_nlmsg(struct net *net, struct sk_buff *skb,
>+static int tcf_act_flush(struct sk_buff *skb, struct net *net,
>+static void p4tc_params_replace_many(struct p4tc_act *act,
>+				     struct idr *params_idr)
>+static struct p4tc_act *tcf_act_create(struct net *net, struct nlattr **tb,
>+tcf_act_cu(struct net *net, struct nlmsghdr *n, struct nlattr *nla,

Is there some secret key how you name the functions? To me, this looks
completely inconsistent :/



^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH net-next v8 15/15] p4tc: Add P4 extern interface
  2023-11-16 14:59 ` [PATCH net-next v8 15/15] p4tc: Add P4 extern interface Jamal Hadi Salim
@ 2023-11-16 16:42   ` Jiri Pirko
  2023-11-17 12:14     ` Jamal Hadi Salim
  0 siblings, 1 reply; 79+ messages in thread
From: Jiri Pirko @ 2023-11-16 16:42 UTC (permalink / raw)
  To: Jamal Hadi Salim
  Cc: netdev, deb.chatterjee, anjali.singhai, namrata.limaye, tom,
	mleitner, Mahesh.Shirshyad, tomasz.osinski, xiyou.wangcong, davem,
	edumazet, kuba, pabeni, vladbu, horms, daniel, bpf, khalidm, toke,
	mattyk

Thu, Nov 16, 2023 at 03:59:48PM CET, jhs@mojatatu.com wrote:

[...]

> include/net/p4tc.h                |  161 +++
> include/net/p4tc_ext_api.h        |  199 +++
> include/uapi/linux/p4tc.h         |   61 +
> include/uapi/linux/p4tc_ext.h     |   36 +
> net/sched/p4tc/Makefile           |    2 +-
> net/sched/p4tc/p4tc_bpf.c         |   79 +-
> net/sched/p4tc/p4tc_ext.c         | 2204 ++++++++++++++++++++++++++++
> net/sched/p4tc/p4tc_pipeline.c    |   34 +-
> net/sched/p4tc/p4tc_runtime_api.c |   10 +-
> net/sched/p4tc/p4tc_table.c       |   57 +-
> net/sched/p4tc/p4tc_tbl_entry.c   |   25 +-
> net/sched/p4tc/p4tc_tmpl_api.c    |    4 +
> net/sched/p4tc/p4tc_tmpl_ext.c    | 2221 +++++++++++++++++++++++++++++
> 13 files changed, 5083 insertions(+), 10 deletions(-)

This is for this patch. Now for the whole patchset you have:
 30 files changed, 16676 insertions(+), 39 deletions(-)

I understand that you want to fit into 15 patches with all the work.
But sorry, patches like this are unreviewable. My suggestion is to split
the patchset into multiple ones including smaller patches and allow
people to digest this. I don't believe that anyone can seriously stand
to review a patch with more than 200 lines changes.

[...]

^ permalink raw reply	[flat|nested] 79+ messages in thread

* RE: [PATCH net-next v8 00/15] Introducing P4TC
  2023-11-16 14:59 [PATCH net-next v8 00/15] Introducing P4TC Jamal Hadi Salim
                   ` (14 preceding siblings ...)
  2023-11-16 14:59 ` [PATCH net-next v8 15/15] p4tc: Add P4 extern interface Jamal Hadi Salim
@ 2023-11-17  6:27 ` John Fastabend
  2023-11-17 12:49   ` Jamal Hadi Salim
  15 siblings, 1 reply; 79+ messages in thread
From: John Fastabend @ 2023-11-17  6:27 UTC (permalink / raw)
  To: Jamal Hadi Salim, netdev
  Cc: deb.chatterjee, anjali.singhai, Vipin.Jain, namrata.limaye, tom,
	mleitner, Mahesh.Shirshyad, tomasz.osinski, jiri, xiyou.wangcong,
	davem, edumazet, kuba, pabeni, vladbu, horms, daniel, bpf,
	khalidm, toke, mattyk, dan.daly, chris.sommers,
	john.andy.fingerhut

Jamal Hadi Salim wrote:
> We are seeking community feedback on P4TC patches.
> 

[...]

> 
> What is P4?
> -----------

I read the cover letter here is my high level takeaway.

P4c-bpf backend exists and I don't see why we wouldn't use that as a starting
point. At least the cover letter needs to explain why this path is not taken.
From the cover letter there appears to be bpf pieces and non-bpf pieces, but
I don't see any reason not to just land it all in BPF. Support exists and if
its missing some smaller things add them and everyone gets them vs niche P4
backend.

Without hardware support for any of this its impossible to understand how 'tc'
would work as a hardware offload interface for a p4 device so we need hardware
support to evaluate. For example I'm not even sure how you would take a BPF
parser into hardware on most network devices that aren't processor based.

P4 has a P4Runtime I think most folks would prefer a P4 UI vs typing in 'tc'
commands so arguing for 'tc' UI is nice is not going to be very compelling.
Best we can say is it works well enough and we use it. 

more commentary below.

> 
> The Programming Protocol-independent Packet Processors (P4) is an open source,
> domain-specific programming language for specifying data plane behavior.
> 
> The P4 ecosystem includes an extensive range of deployments, products, projects
> and services, etc[9][10][11][12].
> 
> __What is P4TC?__
> 
> P4TC is a net-namespace aware implementation, meaning multiple P4 programs can
> run independently in different namespaces alongside their appropriate state. The
> implementation builds on top of many years of Linux TC experiences.
> On why P4 - see small treatise here:[4].
> 
> There have been many discussions and meetings since about 2015 in regards to
> P4 over TC[2] and we are finally proving the naysayers that we do get stuff
> done!
> 
> A lot more of the P4TC motivation is captured at:
> https://github.com/p4tc-dev/docs/blob/main/why-p4tc.md
> 
> **In this patch series we focus on s/w datapath only**.

I don't see the value in adding 16676 lines of code for s/w only datapath
of something we already can do with p4c-ebpf backend. Or one of the other
backends already there. Namely take P4 programs and run them on CPUs in Linux.

Also I suspect a pipelined datapath is going to be slower than a O(1) lookup
datapath so I'm guessing its slower than most datapaths we have already.

What do we gain here over existing p4c-ebpf?

> 
> __P4TC Workflow__
> 
> These patches enable kernel and user space code change _independence_ for any
> new P4 program that describes a new datapath. The workflow is as follows:
> 
>   1) A developer writes a P4 program, "myprog"
> 
>   2) Compiles it using the P4C compiler[8]. The compiler generates 3 outputs:
>      a) shell script(s) which form template definitions for the different P4
>      objects "myprog" utilizes (tables, externs, actions etc).

This is odd to me. I think packing around shell scrips as a program is not
very usable. Why not just an object file.

>      b) the parser and the rest of the datapath are generated
>      in eBPF and need to be compiled into binaries.
>      c) A json introspection file used for the control plane (by iproute2/tc).

Why split up the eBPF and control plane like this? eBPF has a control plane
just use the existing one?

> 
>   3) The developer (or operator) executes the shell script(s) to manifest the
>      functional "myprog" into the kernel.
> 
>   4) The developer (or operator) instantiates "myprog" via the tc P4 filter
>      to ingress/egress (depending on P4 arch) of one or more netdevs/ports.
> 
>      Example1: parser is an action:
>        "tc filter add block 22 ingress protocol all prio 10 p4 pname myprog \
>         action bpf obj $PARSER.o section parser/tc-ingress \
>         action bpf obj $PROGNAME.o section p4prog/tc"
> 
>      Example2: parser explicitly bound and rest of dpath as an action:
>        "tc filter add block 22 ingress protocol all prio 10 p4 pname myprog \
>         prog tc obj $PARSER.o section parser/tc-ingress \
>         action bpf obj $PROGNAME.o section p4prog/tc"
> 
>      Example3: parser is at XDP, rest of dpath as an action:
>        "tc filter add block 22 ingress protocol all prio 10 p4 pname myprog \
>         prog type xdp obj $PARSER.o section parser/xdp-ingress \
> 	pinned_link /path/to/xdp-prog-link \
>         action bpf obj $PROGNAME.o section p4prog/tc"
> 
>      Example4: parser+prog at XDP:
>        "tc filter add block 22 ingress protocol all prio 10 p4 pname myprog \
>         prog type xdp obj $PROGNAME.o section p4prog/xdp \
> 	pinned_link /path/to/xdp-prog-link"
> 
>     see individual patches for more examples tc vs xdp etc. Also see section on
>     "challenges" (on this cover letter).
> 
> Once "myprog" P4 program is instantiated one can start updating table entries
> that are associated with myprog's table named "mytable". Example:
> 
>   tc p4ctrl create myprog/table/mytable dstAddr 10.0.1.2/32 \
>     action send_to_port param port eno1

As a UI above is entirely cryptic to most folks I bet.

myprog table is a BPF map? If so then I don't see any need for this just
interact with it like a BPF map. I suspect its some other object, but
I don't see any ratoinal for that.

> 
> A packet arriving on ingress of any of the ports on block 22 will first be
> exercised via the (eBPF) parser to find the headers pointing to the ip
> destination address.
> The remainder eBPF datapath uses the result dstAddr as a key to do a lookup in
> myprog's mytable which returns the action params which are then used to execute
> the action in the eBPF datapath (eventually sending out packets to eno1).
> On a table miss, mytable's default miss action is executed.

This chunk looks like standard BPF program. Parse pkt, lookup an action,
do the action.

> 
> __Description of Patches__
> 
> P4TC is designed to have no impact on the core code for other users
> of TC. IOW, you can compile it out but even if it compiled in and you dont use
> it there should be no impact on your performance.
> 
> We do make core kernel changes. Patch #1 adds infrastructure for "dynamic"
> actions that can be created on "the fly" based on the P4 program requirement.

the common pattern in bpf for this is to use a tail call map and populate
it at runtime and/or just compile your program with the actions. Here
the actions came from the p4 back up at step 1 so no reason we can't
just compile them with p4c.

> This patch makes a small incision into act_api which shouldn't affect the
> performance (or functionality) of the existing actions. Patches 2-4,6-7 are
> minimalist enablers for P4TC and have no effect the classical tc action.
> Patch 5 adds infrastructure support for preallocation of dynamic actions.
> 
> The core P4TC code implements several P4 objects.

[...]

> 
> __Restating Our Requirements__
> 
> The initial release made in January/2023 had a "scriptable" datapath (think u32
> classifier and pedit action). In this section we review the scriptable version
> against the current implementation we are pushing upstream which uses eBPF.
> 
> Our intention is to target the TC crowd.
> Essentially developers and ops people deploying TC based infra.
> More importantly the original intent for P4TC was to enable _ops folks_ more than
> devs (given code is being generated and doesn't need humans to write it).

I don't follow. humans wrote the p4.

I think the intent should be to enable P4 to run on Linux. Ideally efficiently.
If the _ops folks are writing P4 great as long as we give them an efficient
way to run their p4 I don't think they care about what executes it.

> 
> With TC, we get whole "familiar" package of match-action pipeline abstraction++,
> meaning from the control plane all the way to the tooling infra, i.e
> iproute2/tc cli, netlink infra(request/resp, event subscribe/multicast-publish,
> congestion control etc), s/w and h/w symbiosis, the autonomous kernel control,
> etc.
> The main advantage is that we have a singular vendor-neutral interface via the
> kernel using well understood mechanisms based on deployment experience (and
> at least this part doesnt need retraining).

A seemless p4 experience would be great. That looks like a tooling problem
at the p4c-backend and p4c-frontend problem. Rather than a bunch of 'tc' glue
I would aim for,

  $ p4c-* myprog.p4
  $ p4cRun ./myprog

And maybe some options like,

  $ p4cRun -i eth0 ./myprog

Then use the p4runtime to interface with the system. If you don't like the
runtime then it should be brought up in that working group.

> 
> 1) Supporting expressibility of the universe set of P4 progs
> 
> It is a must to support 100% of all possible P4 programs. In the past the eBPF
> verifier had to be worked around and even then there are cases where we couldnt
> avoid path explosion when branching is involved. Kfunc-ing solves these issues
> for us. Note, there are still challenges running all potential P4 programs at
> the XDP level - the solution to that is to have the compiler generate XDP based
> code only if it possible to map it to that layer.

Examples and we can fix it.

> 
> 2) Support for P4 HW and SW equivalence.
> 
> This feature continues to work even in the presence of eBPF as the s/w
> datapath. There are cases of square-hole-round-peg scenarios but
> those are implementation issues we can live with.

But no hw support.

> 
> 3) Operational usability
> 
> By maintaining the TC control plane (even in presence of eBPF datapath)
> runtime aspects remain unchanged. So for our target audience of folks
> who have deployed tc including offloads - the comfort zone is unchanged.
> There is also the comfort zone of continuing to use the true-and-tried netlink
> interfacing.

The P4 control plane should be P4Runtime.

> 
> There is some loss in operational usability because we now have more knobs:
> the extra compilation, loading and syncing of ebpf binaries, etc.
> IOW, I can no longer just ship someone a shell script in an email to
> say go run this and "myprog" will just work.
> 
> 4) Operational and development Debuggability
> 
> If something goes wrong, the tc craftsperson is now required to have additional
> knowledge of eBPF code and process. This applies to both the operational person
> as well as someone who wrote a driver. We dont believe this is solvable.
> 
> 5) Opportunity for rapid prototyping of new ideas

[...]

> 6) Supporting per namespace program
> 
> This requirement is still met (by virtue of keeping P4 control objects within the
> TC domain).

BPF can also be network namespaced I'm not sure I understand comment.

> 
> __Challenges__
> 
> 1) Concept of tc block in XDP is _very tedious_ to implement. It would be nice
>    if we can use concept there as well, since we expect P4 to work with many
>    ports. It will likely require some core patches to fix this.
> 
> 2) Right now we are using "packed" construct to enforce alignment in kfunc data
>    exchange; but we're wondering if there is potential to use BTF to understand
>    parameters and their offsets and encode this information at the compiler
>    level.
> 
> 3) At the moment we are creating a static buffer of 128B to retrieve the action
>    parameters. If you have a lot of table entries and individual(non-shared)
>    action instances with actions that require very little (or no) param space
>    a lot of memory is wasted. There may also be cases where 128B may not be
>    enough; (likely this is something we can teach the P4C compiler). If we can
>    have dynamic pointers instead for kfunc fixed length parameterization then
>    this issue is resolvable.
> 
> 4) See "Restating Our Requirements" #5.
>    We would really appreciate ideas/suggestions, etc.
> 
> __References__

Thanks,
John

^ permalink raw reply	[flat|nested] 79+ messages in thread

* RE: [PATCH net-next v8 10/15] p4tc: add action template create, update, delete, get, flush and dump
  2023-11-16 14:59 ` [PATCH net-next v8 10/15] p4tc: add action template create, update, delete, get, flush and dump Jamal Hadi Salim
  2023-11-16 16:28   ` Jiri Pirko
@ 2023-11-17  6:51   ` John Fastabend
  1 sibling, 0 replies; 79+ messages in thread
From: John Fastabend @ 2023-11-17  6:51 UTC (permalink / raw)
  To: Jamal Hadi Salim, netdev
  Cc: deb.chatterjee, anjali.singhai, namrata.limaye, tom, mleitner,
	Mahesh.Shirshyad, tomasz.osinski, jiri, xiyou.wangcong, davem,
	edumazet, kuba, pabeni, vladbu, horms, daniel, bpf, khalidm, toke,
	mattyk

Jamal Hadi Salim wrote:
> This commit allows users to create, update, delete, get, flush and dump
> dynamic action kinds based on P4 action definition.
> 
> At the moment dynamic actions are tied to P4 programs only and cannot be
> used outside of a P4 program definition.
> 
> Visualize the following action in a P4 program:
> 
> action ipv4_forward(@tc_type("macaddr) bit<48> dstAddr, @tc_type("dev") bit<8> port)
> {
>      // Action code (generated by the compiler)

So this is BPF or what?

> }
> 
> The above is an action called ipv4_forward which receives as parameters
> a bit<48> dstAddr (a mac address) and a bit<8> port (something close to
> ifindex).
> 
> which is invoked on a P4 table match as such:
> 
> table mytable {
>         key = {
>             hdr.ipv4.dstAddr @tc_type("ipv4"): lpm;
>         }
> 
>         actions = {
>             ipv4_forward;
>             drop;
>             NoAction;
>         }
> 
>         size = 1024;
> }
> 
> We don't have an equivalent built in "ipv4_forward" action in TC. So we
> create this action dynamically.
> 
> The mechanics of dynamic actions follow the CRUD semantics.
> 
> ___DYNAMIC ACTION KIND CREATION___
> 
> In this stage we issue the creation command for the dynamic action which
> specifies the action name, its ID, parameters and the parameter types.
> So for the ipv4_forward action, the creation would look something like
> this:
> 
> tc p4template create action/aP4proggie/ipv4_forward \
>   param dstAddr type macaddr id 1 param port type dev id 2
> 
> Note1: Although the P4 program defined dstAddr as type bit48 we use our
> type called macaddr (likewise for port) - see commit on p4 types for
> details.
> 
> Note2: All the template commands (tc p4template) are generated by the
> p4c compiler.
> 
> Note that in the template creation op we usually just specify the action
> name, the parameters and their respective types. Also see that we specify
> a pipeline name during the template creation command. As an example, the
> above command creates an action template that is bounded to
> pipeline or program named aP4proggie.
> 
> Note, In P4, actions are assumed to pre-exist and have an upper bound
> number of instances. Typically if you have 1M table entries you want to allocate
> enough action instances to cover the 1M entries. However, this is a big waste
> waste of memory if the action instances are not in use. So for our case, we allow
> the user to specify a minimal amount of actions instances in the template and then
> if more dynamic action instances are needed then they will be added on
> demand as in the current approach with tc filter-action relationship.
> For example, if one were to create the action ipv4_forward preallocating
> 128 instances, one would issue the following command:
> 
> tc p4template create action/aP4proggie/ipv4_forward num_prealloc 128 \
>   param dstAddr type macaddr id 1 param port type dev id 2
> 
> By default, 16 action instances will be preallocated.
> If the user wishes to have more actions instances, they will have to be
> created individually by the control plane using the tc actions command.
> For example:
> 
> tc actions add action aP4proggie/ipv4_forward \
> param dstAddr AA:BB:CC:DD:EE:DD param port eth1
> 
> Only then they can issue a table entry creation command using this newly
> created action instance.
> 
> Note, this does not disqualify a user from binding to an existing action
> instances. For example:
> 
> tc p4ctrl create aP4proggie/table/mycontrol/mytable \
>    srcAddr 10.10.10.0/24 action ipv4_forward index 1
> 
> ___ACTION KIND ACTIVATION___
> 
> Once we provided all the necessary information for the new dynamic action,
> we can go to the final stage, which is action activation. In this stage,
> we activate the dynamic action and make it available for instantiation.
> To activate the action template, we issue the following command:
> 
> tc p4template update action aP4proggie/ipv4_forward state active
> 
> After the above the command, the action is ready to be instantiated.
> 
> ___RUNTIME___
> 
> This next section deals with the runtime part of action templates, which
> handle action template instantiation and binding.
> 
> To instantiate a new action that was created from a template, we use the
> following command:
> 
> tc actions add action aP4proggie/ipv4_forward \
> param dstAddr AA:BB:CC:DD:EE:FF param port eth0 index 1
> 
> Observe these are the same semantics as what tc today already provides
> with a caveat that we have a keyword "param" to precede the appropriate
> parameters - as such specifying the index is optional (kernel provides
> one when unspecified).
> 
> As previously stated, we refer to the action by it's "full name"
> (pipeline_name/action_name). Here we are creating an instance of the
> ipv4_forward action specifying as parameter values AA:BB:CC:DD:EE:FF for
> dstAddr and eth0 for port. We can create as many instances for action
> templates as we wish.
> 
> To bind the above instantiated action to a table entry, you can do use the
> same classical approach used to bind ordinary actions to filters, for
> example:
> 
> tc p4ctrl create aP4proggie/table/mycontrol/mytable \
>    srcAddr 10.10.10.0/24 action ipv4_forward index 1
> 
> The above command will bind our newly instantiated action to a table
> entry which is executed if there's a match.
> 
> Of course one could have created the table entry as:
> 
> tc p4ctrl create aP4proggie/table/mycontrol/mytable \
> srcAddr 10.10.10.0/24 \
> action ipv4_forward param dstAddr AA:BB:CC:DD:EE:FF param port eth0
> 
> Actions from other P4 control blocks (in the same pipeline) might be
> referenced as the action index is global within a pipeline.
> 

Where did what the action actually does get defined? It looks like
a label at this point, but without code?

^ permalink raw reply	[flat|nested] 79+ messages in thread

* RE: [PATCH net-next v8 13/15] p4tc: add set of P4TC table kfuncs
  2023-11-16 14:59 ` [PATCH net-next v8 13/15] p4tc: add set of P4TC table kfuncs Jamal Hadi Salim
@ 2023-11-17  7:09   ` John Fastabend
  2023-11-19  9:14   ` kernel test robot
  2023-11-20 22:28   ` kernel test robot
  2 siblings, 0 replies; 79+ messages in thread
From: John Fastabend @ 2023-11-17  7:09 UTC (permalink / raw)
  To: Jamal Hadi Salim, netdev
  Cc: deb.chatterjee, anjali.singhai, namrata.limaye, tom, mleitner,
	Mahesh.Shirshyad, tomasz.osinski, jiri, xiyou.wangcong, davem,
	edumazet, kuba, pabeni, vladbu, horms, daniel, bpf, khalidm, toke,
	mattyk

Jamal Hadi Salim wrote:
> We add an initial set of kfuncs to allow interactions from eBPF programs
> to the P4TC domain.
>

If you just use bpf maps then you get all the bpf map type for
free and you get the bpf_map_* ops from BPF side.

I don't see any use for this duplication.

> - bpf_p4tc_tbl_read: Used to lookup a table entry from a BPF
> program installed in TC. To find the table entry we take in an skb, the
> pipeline ID, the table ID, a key and a key size.
> We use the skb to get the network namespace structure where all the
> pipelines are stored. After that we use the pipeline ID and the table
> ID, to find the table. We then use the key to search for the entry.
> We return an entry on success and NULL on failure.
> 
> - xdp_p4tc_tbl_read: Used to lookup a table entry from a BPF
> program installed in XDP. To find the table entry we take in an xdp_md,
> the pipeline ID, the table ID, a key and a key size.
> We use struct xdp_md to get the network namespace structure where all
> the pipelines are stored. After that we use the pipeline ID and the table
> ID, to find the table. We then use the key to search for the entry.
> We return an entry on success and NULL on failure.
> 
> - bpf_p4tc_entry_create: Used to create a table entry from a BPF
> program installed in TC. To create the table entry we take an skb, the
> pipeline ID, the table ID, a key and its size, and an action which will
> be associated with the new entry.
> We return 0 on success and a negative errno on failure
> 
> - xdp_p4tc_entry_create: Used to create a table entry from a BPF
> program installed in XDP. To create the table entry we take an xdp_md, the
> pipeline ID, the table ID, a key and its size, and an action which will
> be associated with the new entry.
> We return 0 on success and a negative errno on failure
> 
> - bpf_p4tc_entry_create_on_miss: conforms to PNA "add on miss".
> First does a lookup using the passed key and upon a miss will add the entry
> to the table.
> We return 0 on success and a negative errno on failure
> 
> - xdp_p4tc_entry_create_on_miss: conforms to PNA "add on miss".
> First does a lookup using the passed key and upon a miss will add the entry
> to the table.
> We return 0 on success and a negative errno on failure
> 
> - bpf_p4tc_entry_update: Used to update a table entry from a BPF
> program installed in TC. To update the table entry we take an skb, the
> pipeline ID, the table ID, a key and its size, and an action which will
> be associated with the new entry.
> We return 0 on success and a negative errno on failure
> 
> - xdp_p4tc_entry_update: Used to update a table entry from a BPF
> program installed in XDP. To update the table entry we take an xdp_md, the
> pipeline ID, the table ID, a key and its size, and an action which will
> be associated with the new entry.
> We return 0 on success and a negative errno on failure
> 
> - bpf_p4tc_entry_delete: Used to delete a table entry from a BPF
> program installed in TC. To delete the table entry we take an skb, the
> pipeline ID, the table ID, a key and a key size.
> We return 0 on success and a negative errno on failure
> 
> - xdp_p4tc_entry_delete: Used to delete a table entry from a BPF
> program installed in XDP. To delete the table entry we take an xdp_md, the
> pipeline ID, the table ID, a key and a key size.
> We return 0 on success and a negative errno on failure
> 
> Co-developed-by: Victor Nogueira <victor@mojatatu.com>
> Signed-off-by: Victor Nogueira <victor@mojatatu.com>
> Co-developed-by: Pedro Tammela <pctammela@mojatatu.com>
> Signed-off-by: Pedro Tammela <pctammela@mojatatu.com>
> Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
> ---

^ permalink raw reply	[flat|nested] 79+ messages in thread

* RE: [PATCH net-next v8 14/15] p4tc: add P4 classifier
  2023-11-16 14:59 ` [PATCH net-next v8 14/15] p4tc: add P4 classifier Jamal Hadi Salim
@ 2023-11-17  7:17   ` John Fastabend
  0 siblings, 0 replies; 79+ messages in thread
From: John Fastabend @ 2023-11-17  7:17 UTC (permalink / raw)
  To: Jamal Hadi Salim, netdev
  Cc: deb.chatterjee, anjali.singhai, namrata.limaye, tom, mleitner,
	Mahesh.Shirshyad, tomasz.osinski, jiri, xiyou.wangcong, davem,
	edumazet, kuba, pabeni, vladbu, horms, daniel, bpf, khalidm, toke,
	mattyk

Jamal Hadi Salim wrote:
> Introduce P4 tc classifier. A tc filter instantiated on this classifier
> is used to bind a P4 pipeline to one or more netdev ports. To use P4
> classifier you must specify a pipeline name that will be associated to
> this filter, a s/w parser and datapath ebpf program. The pipeline must have
> already been created via a template.
> For example, if we were to add a filter to ingress of network interface
> device $P0 and associate it to P4 pipeline simple_l3 we'd issue the
> following command:
> 
> tc filter add dev $P0 parent ffff: protocol all prio 6 p4 pname simple_l3 \
>     action bpf obj $PARSER.o section prog/tc-parser \
>     action bpf obj $PROGNAME.o section prog/tc-ingress
> 
> $PROGNAME.o and $PARSER.o is a compilation of the eBPF programs generated
> by the P4 compiler and will be the representation of the P4 program.
> Note that filter understands that $PARSER.o is a parser to be loaded
> at the tc level. The datapath program is merely an eBPF action.
> 
> Note we do support a distinct way of loading the parser as opposed to
> making it be an action, the above example would be:
> 
> tc filter add dev $P0 parent ffff: protocol all prio 6 p4 pname simple_l3 \
>     prog type tc obj $PARSER.o ... \
>     action bpf obj $PROGNAME.o section prog/tc-ingress
> 
> We support two types of loadings of these initial programs in the pipeline
> and differentiate between what gets loaded at tc vs xdp by using syntax of
> 
> either "prog type tc obj" or "prog type xdp obj"
> 
> For XDP:
> 
> tc filter add dev $P0 ingress protocol all prio 1 p4 pname simple_l3 \
>     prog type xdp obj $PARSER.o section parser/xdp \
>     pinned_link /sys/fs/bpf/mylink \
>     action bpf obj $PROGNAME.o section prog/tc-ingress
> 
> The theory of operations is as follows:

You've lost me here. You have a BPF parser program and some BPF actions.
Why not have the parser program just call the BPF action?

It seems the action is based on LPM or some software like TCAM. So
steps in BPF would be,

   Parse -> bpf_map_lookup to find action -> call action

Just use the normal way to load the bpf program through 'tc' or 'xdp'.
Or its all about hardware, but I still have no idea how this BPF parser
gets into hardware and how these actions get there. So we can't even
evaluate what that looks like.

> 
> ================================1. PARSING================================
> 
> The packet first encounters the parser.
> The parser is implemented in ebpf residing either at the TC or XDP
> level. The parsed header values are stored in a shared eBPF map.
> When the parser runs at XDP level, we load it into XDP using tc filter
> command and pin it to a file.
> 
> =============================2. ACTIONS=============================
> 
> In the above example, the P4 program (minus the parser) is encoded in an
> action($PROGNAME.o). It should be noted that classical tc actions
> continue to work:
> IOW, someone could decide to add a mirred action to mirror all packets
> after or before the ebpf action.
> 
> tc filter add dev $P0 parent ffff: protocol all prio 6 p4 pname simple_l3 \
>     prog type tc obj $PARSER.o section parser/tc-ingress \
>     action bpf obj $PROGNAME.o section prog/tc-ingress \
>     action mirred egress mirror index 1 dev $P1 \
>     action bpf obj $ANOTHERPROG.o section mysect/section-1
> 
> It should also be noted that it is feasible to split some of the ingress
> datapath into XDP first and more into TC later (as was shown above for
> example where the parser runs at XDP level). YMMV.
> 
> Co-developed-by: Victor Nogueira <victor@mojatatu.com>
> Signed-off-by: Victor Nogueira <victor@mojatatu.com>
> Co-developed-by: Pedro Tammela <pctammela@mojatatu.com>
> Signed-off-by: Pedro Tammela <pctammela@mojatatu.com>
> Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH net-next v8 08/15] p4tc: add P4 data types
  2023-11-16 16:03   ` Jiri Pirko
@ 2023-11-17 12:01     ` Jamal Hadi Salim
  0 siblings, 0 replies; 79+ messages in thread
From: Jamal Hadi Salim @ 2023-11-17 12:01 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: netdev, deb.chatterjee, anjali.singhai, namrata.limaye, tom,
	mleitner, Mahesh.Shirshyad, tomasz.osinski, xiyou.wangcong, davem,
	edumazet, kuba, pabeni, vladbu, horms, daniel, bpf, khalidm, toke,
	mattyk

On Thu, Nov 16, 2023 at 11:03 AM Jiri Pirko <jiri@resnulli.us> wrote:
>
> Thu, Nov 16, 2023 at 03:59:41PM CET, jhs@mojatatu.com wrote:
>
> [...]
>
> >diff --git a/include/net/p4tc_types.h b/include/net/p4tc_types.h
> >new file mode 100644
> >index 000000000..8f6f002ae
>
> [...]
>
> >+#define P4T_MAX_BITSZ 128
>
> [...]
>
> >+#define P4T_MAX_STR_SZ 32
>
> [...]
>
>
> >diff --git a/include/uapi/linux/p4tc.h b/include/uapi/linux/p4tc.h
> >new file mode 100644
> >index 000000000..ba32dba66
> >--- /dev/null
> >+++ b/include/uapi/linux/p4tc.h
> >@@ -0,0 +1,33 @@
> >+/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
> >+#ifndef __LINUX_P4TC_H
> >+#define __LINUX_P4TC_H
> >+
> >+#define P4TC_MAX_KEYSZ 512
> >+
> >+enum {
> >+      P4T_UNSPEC,
>
> I wonder, what it the reason for "P4T"/"P4TC" prefix inconsistency.
> In the kernel header, that could be fixes, but in uapi header this is
> forever. Is this just to be aligned with other TC uapi
> inconsitencies? :D
>

P4T is for "types" - but ok, we can change it to P4TC_XXX for consistency.

cheers,
jamal

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH net-next v8 09/15] p4tc: add template pipeline create, get, update, delete
  2023-11-16 16:11   ` Jiri Pirko
@ 2023-11-17 12:09     ` Jamal Hadi Salim
  2023-11-20  8:18       ` Jiri Pirko
  2023-11-20 18:20       ` David Ahern
  0 siblings, 2 replies; 79+ messages in thread
From: Jamal Hadi Salim @ 2023-11-17 12:09 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: netdev, deb.chatterjee, anjali.singhai, namrata.limaye, tom,
	mleitner, Mahesh.Shirshyad, tomasz.osinski, xiyou.wangcong, davem,
	edumazet, kuba, pabeni, vladbu, horms, daniel, bpf, khalidm, toke,
	mattyk, David Ahern, Stephen Hemminger

On Thu, Nov 16, 2023 at 11:11 AM Jiri Pirko <jiri@resnulli.us> wrote:
>
> Thu, Nov 16, 2023 at 03:59:42PM CET, jhs@mojatatu.com wrote:
>
> [...]
>
>
> >diff --git a/include/uapi/linux/p4tc.h b/include/uapi/linux/p4tc.h
> >index ba32dba66..4d33f44c1 100644
> >--- a/include/uapi/linux/p4tc.h
> >+++ b/include/uapi/linux/p4tc.h
> >@@ -2,8 +2,71 @@
> > #ifndef __LINUX_P4TC_H
> > #define __LINUX_P4TC_H
> >
> >+#include <linux/types.h>
> >+#include <linux/pkt_sched.h>
> >+
> >+/* pipeline header */
> >+struct p4tcmsg {
> >+      __u32 pipeid;
> >+      __u32 obj;
> >+};
>
> I don't follow. Is there any sane reason to use header instead of normal
> netlink attribute? Moveover, you extend the existing RT netlink with
> a huge amout of p4 things. Isn't this the good time to finally introduce
> generic netlink TC family with proper yaml spec with all the benefits it
> brings and implement p4 tc uapi there? Please?
>

Several reasons:
a) We are similar to current tc messaging with the subheader being
there for multiplexing.
b) Where does this leave iproute2? +Cc David and Stephen. Do other
generic netlink conversions get contributed back to iproute2?
c) note: Our API is CRUD-ish instead of RPC(per generic netlink)
based. i.e you have:
 COMMAND <PATH/TO/OBJECT> [optional data]  so we can support arbitrary
P4 programs from the control plane.
d) we have spent many hours optimizing the control to the kernel so i
am not sure what it would buy us to switch to generic netlink..

cheers,
jamal

>
> >+
> >+#define P4TC_MAXPIPELINE_COUNT 32
> >+#define P4TC_MAXTABLES_COUNT 32
> >+#define P4TC_MINTABLES_COUNT 0
> >+#define P4TC_MSGBATCH_SIZE 16
> >+
> > #define P4TC_MAX_KEYSZ 512
> >
> >+#define TEMPLATENAMSZ 32
> >+#define PIPELINENAMSIZ TEMPLATENAMSZ
>
> ugh. A prefix please?
>
> pw-bot: cr
>
> [...]

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH net-next v8 15/15] p4tc: Add P4 extern interface
  2023-11-16 16:42   ` Jiri Pirko
@ 2023-11-17 12:14     ` Jamal Hadi Salim
  2023-11-20  8:22       ` Jiri Pirko
  0 siblings, 1 reply; 79+ messages in thread
From: Jamal Hadi Salim @ 2023-11-17 12:14 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: netdev, deb.chatterjee, anjali.singhai, namrata.limaye, tom,
	mleitner, Mahesh.Shirshyad, tomasz.osinski, xiyou.wangcong, davem,
	edumazet, kuba, pabeni, vladbu, horms, daniel, bpf, khalidm, toke,
	mattyk

On Thu, Nov 16, 2023 at 11:42 AM Jiri Pirko <jiri@resnulli.us> wrote:
>
> Thu, Nov 16, 2023 at 03:59:48PM CET, jhs@mojatatu.com wrote:
>
> [...]
>
> > include/net/p4tc.h                |  161 +++
> > include/net/p4tc_ext_api.h        |  199 +++
> > include/uapi/linux/p4tc.h         |   61 +
> > include/uapi/linux/p4tc_ext.h     |   36 +
> > net/sched/p4tc/Makefile           |    2 +-
> > net/sched/p4tc/p4tc_bpf.c         |   79 +-
> > net/sched/p4tc/p4tc_ext.c         | 2204 ++++++++++++++++++++++++++++
> > net/sched/p4tc/p4tc_pipeline.c    |   34 +-
> > net/sched/p4tc/p4tc_runtime_api.c |   10 +-
> > net/sched/p4tc/p4tc_table.c       |   57 +-
> > net/sched/p4tc/p4tc_tbl_entry.c   |   25 +-
> > net/sched/p4tc/p4tc_tmpl_api.c    |    4 +
> > net/sched/p4tc/p4tc_tmpl_ext.c    | 2221 +++++++++++++++++++++++++++++
> > 13 files changed, 5083 insertions(+), 10 deletions(-)
>
> This is for this patch. Now for the whole patchset you have:
>  30 files changed, 16676 insertions(+), 39 deletions(-)
>
> I understand that you want to fit into 15 patches with all the work.
> But sorry, patches like this are unreviewable. My suggestion is to split
> the patchset into multiple ones including smaller patches and allow
> people to digest this. I don't believe that anyone can seriously stand
> to review a patch with more than 200 lines changes.

This specific patch is not difficult to split into two. I can do that
and send out minus the first 8 trivial patches - but not familiar with
how to do "here's part 1 of the patches" and "here's patchset two".
There's dependency between them so not clear how patchwork and
reviewers would deal with it. Thoughts?

Note: The code machinery is really repeatable; for example if you look
at the tables control you will see very similar patterns to actions
etc. i.e spending time to review one will make it easy for the rest.

cheers,
jamal

> [...]

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH net-next v8 00/15] Introducing P4TC
  2023-11-17  6:27 ` [PATCH net-next v8 00/15] Introducing P4TC John Fastabend
@ 2023-11-17 12:49   ` Jamal Hadi Salim
  2023-11-17 18:37     ` John Fastabend
  0 siblings, 1 reply; 79+ messages in thread
From: Jamal Hadi Salim @ 2023-11-17 12:49 UTC (permalink / raw)
  To: John Fastabend
  Cc: netdev, deb.chatterjee, anjali.singhai, Vipin.Jain,
	namrata.limaye, tom, mleitner, Mahesh.Shirshyad, tomasz.osinski,
	jiri, xiyou.wangcong, davem, edumazet, kuba, pabeni, vladbu,
	horms, daniel, bpf, khalidm, toke, mattyk, dan.daly,
	chris.sommers, john.andy.fingerhut

On Fri, Nov 17, 2023 at 1:27 AM John Fastabend <john.fastabend@gmail.com> wrote:
>
> Jamal Hadi Salim wrote:
> > We are seeking community feedback on P4TC patches.
> >
>
> [...]
>
> >
> > What is P4?
> > -----------
>
> I read the cover letter here is my high level takeaway.
>

At least you read the cover letter this time ;->

> P4c-bpf backend exists and I don't see why we wouldn't use that as a starting
> point.

Are you familiar with P4 architectures? That code was for PSA (which
is essentially for switches) we are doing PNA (which is more nic
oriented).
And yes, we used that code as a starting point and made the necessary
changes needed to conform to PNA. We made it actually work better by
using kfuncs.

> At least the cover letter needs to explain why this path is not taken.

I thought we had a reference to that backend - but will add it for the
next update.

> From the cover letter there appears to be bpf pieces and non-bpf pieces, but
> I don't see any reason not to just land it all in BPF. Support exists and if
> its missing some smaller things add them and everyone gets them vs niche P4
> backend.

Ok, i thought you said you read the cover letter. Reasons are well
stated, primarily that we need to make sure all P4 programs work.

>
> Without hardware support for any of this its impossible to understand how 'tc'
> would work as a hardware offload interface for a p4 device so we need hardware
> support to evaluate. For example I'm not even sure how you would take a BPF
> parser into hardware on most network devices that aren't processor based.
>

P4 has nothing to do with parsers in hardware. Where did you get this
requirement from?

> P4 has a P4Runtime I think most folks would prefer a P4 UI vs typing in 'tc'
> commands so arguing for 'tc' UI is nice is not going to be very compelling.
> Best we can say is it works well enough and we use it.


The control plane interface is netlink. This part is not negotiable.
You can write whatever you want on top of it(for example P4runtime
using netlink as its southbound interface). We feel that tc - a well
understood utility - is one we should make publicly available for the
rest of the world to use. For example we have rust code that runs on
top of netlink to do performance testing.

> more commentary below.
>
> >
> > The Programming Protocol-independent Packet Processors (P4) is an open source,
> > domain-specific programming language for specifying data plane behavior.
> >
> > The P4 ecosystem includes an extensive range of deployments, products, projects
> > and services, etc[9][10][11][12].
> >
> > __What is P4TC?__
> >
> > P4TC is a net-namespace aware implementation, meaning multiple P4 programs can
> > run independently in different namespaces alongside their appropriate state. The
> > implementation builds on top of many years of Linux TC experiences.
> > On why P4 - see small treatise here:[4].
> >
> > There have been many discussions and meetings since about 2015 in regards to
> > P4 over TC[2] and we are finally proving the naysayers that we do get stuff
> > done!
> >
> > A lot more of the P4TC motivation is captured at:
> > https://github.com/p4tc-dev/docs/blob/main/why-p4tc.md
> >
> > **In this patch series we focus on s/w datapath only**.
>
> I don't see the value in adding 16676 lines of code for s/w only datapath
> of something we already can do with p4c-ebpf backend.

Please please stop this entitlement politics (which i frankly think
you guys have been getting away with for a few years now).
This code does not touch any core code - you guys constantly push code
that touches core code and it is not unusual we have to pick up the
pieces after but now you are going to call me out for the number of
lines of code? Is it ok for you to write lines of code in the kernel
but not me? Judge the technical work then we can have a meaningful
discussion.

TBH, I am trying very hard to see if i should respond to any more
comments from you. I was very happy with our original scriptable
approach and you came out and banged on the table that you want ebpf.
We spent 10 months of multiple people working on this code to make it
ebpf friendly and now you want more (actually i am not sure what the
hell you want).

> Or one of the other
> backends already there. Namely take P4 programs and run them on CPUs in Linux.
>
> Also I suspect a pipelined datapath is going to be slower than a O(1) lookup
> datapath so I'm guessing its slower than most datapaths we have already.
>
> What do we gain here over existing p4c-ebpf?
>

see above.

> >
> > __P4TC Workflow__
> >
> > These patches enable kernel and user space code change _independence_ for any
> > new P4 program that describes a new datapath. The workflow is as follows:
> >
> >   1) A developer writes a P4 program, "myprog"
> >
> >   2) Compiles it using the P4C compiler[8]. The compiler generates 3 outputs:
> >      a) shell script(s) which form template definitions for the different P4
> >      objects "myprog" utilizes (tables, externs, actions etc).
>
> This is odd to me. I think packing around shell scrips as a program is not
> very usable. Why not just an object file.
>
> >      b) the parser and the rest of the datapath are generated
> >      in eBPF and need to be compiled into binaries.
> >      c) A json introspection file used for the control plane (by iproute2/tc).
>
> Why split up the eBPF and control plane like this? eBPF has a control plane
> just use the existing one?
>

The cover letter clearly states that we are using netlink as the
control api. Does eBPF support netlink?

> >
> >   3) The developer (or operator) executes the shell script(s) to manifest the
> >      functional "myprog" into the kernel.
> >
> >   4) The developer (or operator) instantiates "myprog" via the tc P4 filter
> >      to ingress/egress (depending on P4 arch) of one or more netdevs/ports.
> >
> >      Example1: parser is an action:
> >        "tc filter add block 22 ingress protocol all prio 10 p4 pname myprog \
> >         action bpf obj $PARSER.o section parser/tc-ingress \
> >         action bpf obj $PROGNAME.o section p4prog/tc"
> >
> >      Example2: parser explicitly bound and rest of dpath as an action:
> >        "tc filter add block 22 ingress protocol all prio 10 p4 pname myprog \
> >         prog tc obj $PARSER.o section parser/tc-ingress \
> >         action bpf obj $PROGNAME.o section p4prog/tc"
> >
> >      Example3: parser is at XDP, rest of dpath as an action:
> >        "tc filter add block 22 ingress protocol all prio 10 p4 pname myprog \
> >         prog type xdp obj $PARSER.o section parser/xdp-ingress \
> >       pinned_link /path/to/xdp-prog-link \
> >         action bpf obj $PROGNAME.o section p4prog/tc"
> >
> >      Example4: parser+prog at XDP:
> >        "tc filter add block 22 ingress protocol all prio 10 p4 pname myprog \
> >         prog type xdp obj $PROGNAME.o section p4prog/xdp \
> >       pinned_link /path/to/xdp-prog-link"
> >
> >     see individual patches for more examples tc vs xdp etc. Also see section on
> >     "challenges" (on this cover letter).
> >
> > Once "myprog" P4 program is instantiated one can start updating table entries
> > that are associated with myprog's table named "mytable". Example:
> >
> >   tc p4ctrl create myprog/table/mytable dstAddr 10.0.1.2/32 \
> >     action send_to_port param port eno1
>
> As a UI above is entirely cryptic to most folks I bet.
>

But ebpf is not?

> myprog table is a BPF map? If so then I don't see any need for this just
> interact with it like a BPF map. I suspect its some other object, but
> I don't see any ratoinal for that.

All the P4 objects sit in the TC domain. The datapath program is ebpf.
Control is via netlink.


> >
> > A packet arriving on ingress of any of the ports on block 22 will first be
> > exercised via the (eBPF) parser to find the headers pointing to the ip
> > destination address.
> > The remainder eBPF datapath uses the result dstAddr as a key to do a lookup in
> > myprog's mytable which returns the action params which are then used to execute
> > the action in the eBPF datapath (eventually sending out packets to eno1).
> > On a table miss, mytable's default miss action is executed.
>
> This chunk looks like standard BPF program. Parse pkt, lookup an action,
> do the action.
>

Yes, the ebpf datapath does the parsing, and then interacts with
kfuncs to the tc world before it (the ebpf datapath) executes the
action.
Note: ebpf did not invent any of that (parse, lookup, action). It has
existed in tc for 20 years before ebpf existed.

> > __Description of Patches__
> >
> > P4TC is designed to have no impact on the core code for other users
> > of TC. IOW, you can compile it out but even if it compiled in and you dont use
> > it there should be no impact on your performance.
> >
> > We do make core kernel changes. Patch #1 adds infrastructure for "dynamic"
> > actions that can be created on "the fly" based on the P4 program requirement.
>
> the common pattern in bpf for this is to use a tail call map and populate
> it at runtime and/or just compile your program with the actions. Here
> the actions came from the p4 back up at step 1 so no reason we can't
> just compile them with p4c.
>
> > This patch makes a small incision into act_api which shouldn't affect the
> > performance (or functionality) of the existing actions. Patches 2-4,6-7 are
> > minimalist enablers for P4TC and have no effect the classical tc action.
> > Patch 5 adds infrastructure support for preallocation of dynamic actions.
> >
> > The core P4TC code implements several P4 objects.
>
> [...]
>
> >
> > __Restating Our Requirements__
> >
> > The initial release made in January/2023 had a "scriptable" datapath (think u32
> > classifier and pedit action). In this section we review the scriptable version
> > against the current implementation we are pushing upstream which uses eBPF.
> >
> > Our intention is to target the TC crowd.
> > Essentially developers and ops people deploying TC based infra.
> > More importantly the original intent for P4TC was to enable _ops folks_ more than
> > devs (given code is being generated and doesn't need humans to write it).
>
> I don't follow. humans wrote the p4.
>

But not the ebpf code, that is compiler generated. P4 is a higher
level Domain specific language and ebpf is just one backend (others
s/w variants include DPDK, Rust, C, etc)

> I think the intent should be to enable P4 to run on Linux. Ideally efficiently.
> If the _ops folks are writing P4 great as long as we give them an efficient
> way to run their p4 I don't think they care about what executes it.
>
> >
> > With TC, we get whole "familiar" package of match-action pipeline abstraction++,
> > meaning from the control plane all the way to the tooling infra, i.e
> > iproute2/tc cli, netlink infra(request/resp, event subscribe/multicast-publish,
> > congestion control etc), s/w and h/w symbiosis, the autonomous kernel control,
> > etc.
> > The main advantage is that we have a singular vendor-neutral interface via the
> > kernel using well understood mechanisms based on deployment experience (and
> > at least this part doesnt need retraining).
>
> A seemless p4 experience would be great. That looks like a tooling problem
> at the p4c-backend and p4c-frontend problem. Rather than a bunch of 'tc' glue
> I would aim for,
>
>   $ p4c-* myprog.p4
>   $ p4cRun ./myprog
>
> And maybe some options like,
>
>   $ p4cRun -i eth0 ./myprog

Armchair lawyering and classical ML bikesheding

> Then use the p4runtime to interface with the system. If you don't like the
> runtime then it should be brought up in that working group.
>
> >
> > 1) Supporting expressibility of the universe set of P4 progs
> >
> > It is a must to support 100% of all possible P4 programs. In the past the eBPF
> > verifier had to be worked around and even then there are cases where we couldnt
> > avoid path explosion when branching is involved. Kfunc-ing solves these issues
> > for us. Note, there are still challenges running all potential P4 programs at
> > the XDP level - the solution to that is to have the compiler generate XDP based
> > code only if it possible to map it to that layer.
>
> Examples and we can fix it.

Right. Let me wait for you to fix something 5 years from now. I would
never have used eBPF at all but the kfunc is what changed my mind.

> >
> > 2) Support for P4 HW and SW equivalence.
> >
> > This feature continues to work even in the presence of eBPF as the s/w
> > datapath. There are cases of square-hole-round-peg scenarios but
> > those are implementation issues we can live with.
>
> But no hw support.
>

This patcheset has nothing to do with offload (you read the cover
letter?). All above is saying is that by virtue of using TC we have a
path to a proven offload approach.


> >
> > 3) Operational usability
> >
> > By maintaining the TC control plane (even in presence of eBPF datapath)
> > runtime aspects remain unchanged. So for our target audience of folks
> > who have deployed tc including offloads - the comfort zone is unchanged.
> > There is also the comfort zone of continuing to use the true-and-tried netlink
> > interfacing.
>
> The P4 control plane should be P4Runtime.
>

And be my guest and write it on top of netlink.

cheers,
jamal

> >
> > There is some loss in operational usability because we now have more knobs:
> > the extra compilation, loading and syncing of ebpf binaries, etc.
> > IOW, I can no longer just ship someone a shell script in an email to
> > say go run this and "myprog" will just work.
> >
> > 4) Operational and development Debuggability
> >
> > If something goes wrong, the tc craftsperson is now required to have additional
> > knowledge of eBPF code and process. This applies to both the operational person
> > as well as someone who wrote a driver. We dont believe this is solvable.
> >
> > 5) Opportunity for rapid prototyping of new ideas
>
> [...]
>
> > 6) Supporting per namespace program
> >
> > This requirement is still met (by virtue of keeping P4 control objects within the
> > TC domain).
>
> BPF can also be network namespaced I'm not sure I understand comment.
>
> >
> > __Challenges__
> >
> > 1) Concept of tc block in XDP is _very tedious_ to implement. It would be nice
> >    if we can use concept there as well, since we expect P4 to work with many
> >    ports. It will likely require some core patches to fix this.
> >
> > 2) Right now we are using "packed" construct to enforce alignment in kfunc data
> >    exchange; but we're wondering if there is potential to use BTF to understand
> >    parameters and their offsets and encode this information at the compiler
> >    level.
> >
> > 3) At the moment we are creating a static buffer of 128B to retrieve the action
> >    parameters. If you have a lot of table entries and individual(non-shared)
> >    action instances with actions that require very little (or no) param space
> >    a lot of memory is wasted. There may also be cases where 128B may not be
> >    enough; (likely this is something we can teach the P4C compiler). If we can
> >    have dynamic pointers instead for kfunc fixed length parameterization then
> >    this issue is resolvable.
> >
> > 4) See "Restating Our Requirements" #5.
> >    We would really appreciate ideas/suggestions, etc.
> >
> > __References__
>
> Thanks,
> John

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH net-next v8 10/15] p4tc: add action template create, update, delete, get, flush and dump
  2023-11-16 16:28   ` Jiri Pirko
@ 2023-11-17 15:11     ` Jamal Hadi Salim
  2023-11-20  8:19       ` Jiri Pirko
  0 siblings, 1 reply; 79+ messages in thread
From: Jamal Hadi Salim @ 2023-11-17 15:11 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: netdev, deb.chatterjee, anjali.singhai, namrata.limaye, tom,
	mleitner, Mahesh.Shirshyad, tomasz.osinski, xiyou.wangcong, davem,
	edumazet, kuba, pabeni, vladbu, horms, daniel, bpf, khalidm, toke,
	mattyk

On Thu, Nov 16, 2023 at 11:28 AM Jiri Pirko <jiri@resnulli.us> wrote:
>
> Thu, Nov 16, 2023 at 03:59:43PM CET, jhs@mojatatu.com wrote:
>
> [...]
>
>
> >diff --git a/include/net/act_api.h b/include/net/act_api.h
> >index cd5a8e86f..b95a9bc29 100644
> >--- a/include/net/act_api.h
> >+++ b/include/net/act_api.h
> >@@ -70,6 +70,7 @@ struct tc_action {
> > #define TCA_ACT_FLAGS_AT_INGRESS      (1U << (TCA_ACT_FLAGS_USER_BITS + 4))
> > #define TCA_ACT_FLAGS_PREALLOC        (1U << (TCA_ACT_FLAGS_USER_BITS + 5))
> > #define TCA_ACT_FLAGS_UNREFERENCED    (1U << (TCA_ACT_FLAGS_USER_BITS + 6))
> >+#define TCA_ACT_FLAGS_FROM_P4TC       (1U << (TCA_ACT_FLAGS_USER_BITS + 7))
> >
> > /* Update lastuse only if needed, to avoid dirtying a cache line.
> >  * We use a temp variable to avoid fetching jiffies twice.
> >diff --git a/include/net/p4tc.h b/include/net/p4tc.h
> >index ccb54d842..68b00fa72 100644
> >--- a/include/net/p4tc.h
> >+++ b/include/net/p4tc.h
> >@@ -9,17 +9,23 @@
> > #include <linux/refcount.h>
> > #include <linux/rhashtable.h>
> > #include <linux/rhashtable-types.h>
> >+#include <net/tc_act/p4tc.h>
> >+#include <net/p4tc_types.h>
> >
> > #define P4TC_DEFAULT_NUM_TABLES P4TC_MINTABLES_COUNT
> > #define P4TC_DEFAULT_MAX_RULES 1
> > #define P4TC_PATH_MAX 3
> >+#define P4TC_MAX_TENTRIES 33554432
>
> Seeing define like this one always makes me happier. Where does it come
> from? Why not 0x2000000 at least?

I dont recall why we decided to do decimal - will change it.

>
> >
> > #define P4TC_KERNEL_PIPEID 0
> >
> > #define P4TC_PID_IDX 0
> >+#define P4TC_AID_IDX 1
> >+#define P4TC_PARSEID_IDX 1
> >
> > struct p4tc_dump_ctx {
> >       u32 ids[P4TC_PATH_MAX];
> >+      struct rhashtable_iter *iter;
> > };
> >
> > struct p4tc_template_common;
> >@@ -63,8 +69,10 @@ extern const struct p4tc_template_ops p4tc_pipeline_ops;
> >
> > struct p4tc_pipeline {
> >       struct p4tc_template_common common;
> >+      struct idr                  p_act_idr;
> >       struct rcu_head             rcu;
> >       struct net                  *net;
> >+      u32                         num_created_acts;
> >       /* Accounts for how many entities are referencing this pipeline.
> >        * As for now only P4 filters can refer to pipelines.
> >        */
> >@@ -109,18 +117,157 @@ p4tc_pipeline_find_byany_unsealed(struct net *net, const char *p_name,
> >                                 const u32 pipeid,
> >                                 struct netlink_ext_ack *extack);
> >
> >+struct p4tc_act *tcf_p4_find_act(struct net *net,
> >+                               const struct tc_action_ops *a_o,
> >+                               struct netlink_ext_ack *extack);
> >+void
> >+tcf_p4_put_prealloc_act(struct p4tc_act *act, struct tcf_p4act *p4_act);
> >+
> > static inline int p4tc_action_destroy(struct tc_action **acts)
> > {
> >+      struct tc_action *acts_non_prealloc[TCA_ACT_MAX_PRIO] = {NULL};
> >       int ret = 0;
> >
> >       if (acts) {
> >-              ret = tcf_action_destroy(acts, TCA_ACT_UNBIND);
> >+              int j = 0;
> >+              int i;
>
> Move declarations to the beginning of the if body.
>

Didnt follow - which specific declaration?

> [...]
>
>
> >diff --git a/include/uapi/linux/p4tc.h b/include/uapi/linux/p4tc.h
> >index 4d33f44c1..7b89229a7 100644
> >--- a/include/uapi/linux/p4tc.h
> >+++ b/include/uapi/linux/p4tc.h
> >@@ -4,6 +4,7 @@
> >
> > #include <linux/types.h>
> > #include <linux/pkt_sched.h>
> >+#include <linux/pkt_cls.h>
> >
> > /* pipeline header */
> > struct p4tcmsg {
> >@@ -17,9 +18,12 @@ struct p4tcmsg {
> > #define P4TC_MSGBATCH_SIZE 16
> >
> > #define P4TC_MAX_KEYSZ 512
> >+#define P4TC_DEFAULT_NUM_PREALLOC 16
> >
> > #define TEMPLATENAMSZ 32
> > #define PIPELINENAMSIZ TEMPLATENAMSZ
> >+#define ACTTMPLNAMSIZ TEMPLATENAMSZ
> >+#define ACTPARAMNAMSIZ TEMPLATENAMSZ
>
> Prefix? This is uapi. Could you please be more careful with naming at
> least in the uapi area?

Good point.

>
> [...]
>
>
> >diff --git a/net/sched/p4tc/p4tc_action.c b/net/sched/p4tc/p4tc_action.c
> >new file mode 100644
> >index 000000000..19db0772c
> >--- /dev/null
> >+++ b/net/sched/p4tc/p4tc_action.c
> >@@ -0,0 +1,2242 @@
> >+// SPDX-License-Identifier: GPL-2.0-or-later
> >+/*
> >+ * net/sched/p4tc_action.c    P4 TC ACTION TEMPLATES
> >+ *
> >+ * Copyright (c) 2022-2023, Mojatatu Networks
> >+ * Copyright (c) 2022-2023, Intel Corporation.
> >+ * Authors:     Jamal Hadi Salim <jhs@mojatatu.com>
> >+ *              Victor Nogueira <victor@mojatatu.com>
> >+ *              Pedro Tammela <pctammela@mojatatu.com>
> >+ */
> >+
> >+#include <linux/err.h>
> >+#include <linux/errno.h>
> >+#include <linux/init.h>
> >+#include <linux/kernel.h>
> >+#include <linux/kmod.h>
> >+#include <linux/list.h>
> >+#include <linux/module.h>
> >+#include <linux/netdevice.h>
> >+#include <linux/skbuff.h>
> >+#include <linux/slab.h>
> >+#include <linux/string.h>
> >+#include <linux/types.h>
> >+#include <net/flow_offload.h>
> >+#include <net/net_namespace.h>
> >+#include <net/netlink.h>
> >+#include <net/pkt_cls.h>
> >+#include <net/p4tc.h>
> >+#include <net/sch_generic.h>
> >+#include <net/sock.h>
> >+#include <net/tc_act/p4tc.h>
> >+
> >+static LIST_HEAD(dynact_list);
> >+
> >+#define SEPARATOR "/"
>
> Prefix? Btw, why exactly do you need this. It is used only once.
>

We'll get rid of it.

> To quote a few function names in this file:
>
> >+static void set_param_indices(struct idr *params_idr)
> >+static void generic_free_param_value(struct p4tc_act_param *param)
> >+static int dev_init_param_value(struct net *net, struct p4tc_act_param_ops *op,
> >+static void dev_free_param_value(struct p4tc_act_param *param)
> >+static void tcf_p4_act_params_destroy_rcu(struct rcu_head *head)
> >+static int __tcf_p4_dyna_init_set(struct p4tc_act *act, struct tc_action **a,
> >+static int tcf_p4_dyna_template_init(struct net *net, struct tc_action **a,
> >+init_prealloc_param(struct p4tc_act *act, struct idr *params_idr,
> >+static void p4tc_param_put(struct p4tc_act_param *param)
> >+static void free_intermediate_param(struct p4tc_act_param *param)
> >+static void free_intermediate_params_list(struct list_head *params_list)
> >+static int init_prealloc_params(struct p4tc_act *act,
> >+struct p4tc_act *p4tc_action_find_byid(struct p4tc_pipeline *pipeline,
> >+static void tcf_p4_prealloc_list_add(struct p4tc_act *act_tmpl,
> >+static int tcf_p4_prealloc_acts(struct net *net, struct p4tc_act *act,
> >+tcf_p4_get_next_prealloc_act(struct p4tc_act *act)
> >+void tcf_p4_set_init_flags(struct tcf_p4act *p4act)
> >+static void __tcf_p4_put_prealloc_act(struct p4tc_act *act,
> >+tcf_p4_put_prealloc_act(struct p4tc_act *act, struct tcf_p4act *p4act)
> >+static int generic_dump_param_value(struct sk_buff *skb, struct p4tc_type *type,
> >+static int generic_init_param_value(struct p4tc_act_param *nparam,
> >+static struct p4tc_act_param *param_find_byname(struct idr *params_idr,
> >+tcf_param_find_byany(struct p4tc_act *act,
> >+tcf_param_find_byanyattr(struct p4tc_act *act, struct nlattr *name_attr,
> >+static int __p4_init_param_type(struct p4tc_act_param *param,
> >+static int tcf_p4_act_init_params(struct net *net,
> >+static struct p4tc_act *p4tc_action_find_byname(const char *act_name,
> >+static int tcf_p4_dyna_init(struct net *net, struct nlattr *nla,
> >+static int tcf_act_fill_param_type(struct sk_buff *skb,
> >+static void tcf_p4_dyna_cleanup(struct tc_action *a)
> >+struct p4tc_act *p4tc_action_find_get(struct p4tc_pipeline *pipeline,
> >+p4tc_action_find_byanyattr(struct nlattr *act_name_attr, const u32 a_id,
> >+static void p4_put_many_params(struct idr *params_idr)
> >+static int p4_init_param_type(struct p4tc_act_param *param,
> >+static struct p4tc_act_param *p4_create_param(struct p4tc_act *act,
> >+static struct p4tc_act_param *p4_update_param(struct p4tc_act *act,
> >+static struct p4tc_act_param *p4_act_init_param(struct p4tc_act *act,
> >+static void p4tc_action_net_exit(struct tc_action_net *tn)
> >+static void p4_act_params_put(struct p4tc_act *act)
> >+static int __tcf_act_put(struct net *net, struct p4tc_pipeline *pipeline,
> >+static int _tcf_act_fill_nlmsg(struct net *net, struct sk_buff *skb,
> >+static int tcf_act_fill_nlmsg(struct net *net, struct sk_buff *skb,
> >+static int tcf_act_flush(struct sk_buff *skb, struct net *net,
> >+static void p4tc_params_replace_many(struct p4tc_act *act,
> >+                                   struct idr *params_idr)
> >+static struct p4tc_act *tcf_act_create(struct net *net, struct nlattr **tb,
> >+tcf_act_cu(struct net *net, struct nlmsghdr *n, struct nlattr *nla,
>
> Is there some secret key how you name the functions? To me, this looks
> completely inconsistent :/

What would be better? tcf_p4_xxxx?
A lot of the tcf_xxx is because that convention is used in that file
but we can change it.

cheers,
jamal
>

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH net-next v8 00/15] Introducing P4TC
  2023-11-17 12:49   ` Jamal Hadi Salim
@ 2023-11-17 18:37     ` John Fastabend
  2023-11-17 20:46       ` Jamal Hadi Salim
  0 siblings, 1 reply; 79+ messages in thread
From: John Fastabend @ 2023-11-17 18:37 UTC (permalink / raw)
  To: Jamal Hadi Salim, John Fastabend
  Cc: netdev, deb.chatterjee, anjali.singhai, Vipin.Jain,
	namrata.limaye, tom, mleitner, Mahesh.Shirshyad, tomasz.osinski,
	jiri, xiyou.wangcong, davem, edumazet, kuba, pabeni, vladbu,
	horms, daniel, bpf, khalidm, toke, mattyk, dan.daly,
	chris.sommers, john.andy.fingerhut

Jamal Hadi Salim wrote:
> On Fri, Nov 17, 2023 at 1:27 AM John Fastabend <john.fastabend@gmail.com> wrote:
> >
> > Jamal Hadi Salim wrote:
> > > We are seeking community feedback on P4TC patches.
> > >
> >
> > [...]
> >
> > >
> > > What is P4?
> > > -----------
> >
> > I read the cover letter here is my high level takeaway.
> >
> 
> At least you read the cover letter this time ;->

I read it last time as well. About mid way down I tried to
list the points (1-5) more concisely if folks want to get to the
meat of my argument quickly.

> > P4c-bpf backend exists and I don't see why we wouldn't use that as a starting
> > point.
> 
> Are you familiar with P4 architectures? That code was for PSA (which
> is essentially for switches) we are doing PNA (which is more nic
> oriented).

Yes. But for folks that are not PSA is a switch architecture it
looks roughly like this,

   parser -> ingress -> deparser -> pkt replication -> parser
                                                        -> egress
                                                           -> deparser
                                                             -> queueing

The gist is ingress/egress blocks hold your p4 logic (match action
tables usually) to xfrm headers, counters, registers, and so on. You
get one on ingress and one on egress to build your logic up.

And PNA is a roughly like this,

   ingress -> parser -> control -> deparser -> accelerators -> host | network 

An accelerators are externs more or less defined outside P4. Control has
all your metrics, header transforms, registers, and so on. And parser
well it parsers headers. Deparser is something we don't typically think
about much on sw side but it serializes the object back into a packet.
That is a rough couple line explanation.

You can also define whatever architecture like and there are some
ways to do that. But if you want to be a PSA or PNA you define those
blocks in your P4. The key idea is to have architectures that map
to a large set of different vendor hardware. Clearly sw and FPGAs
can build mostly any architecture needed.

As an editorial comment P4 is very much a hardware centric view of
the world when looking at P4 architectures. SW never needed these
because we mostly have general purpose CPUs.

> And yes, we used that code as a starting point and made the necessary
> changes needed to conform to PNA. We made it actually work better by
> using kfuncs.

Better performance? More P4 DSL program space implemented? The kfuncs
added are equivelant to map ops already in BPF but over 'tc' map types.
Or did I miss some kfuncs.

The p4c-ebpf backend already supports two models we could have added
the PNA model to it as well. Its actually simpler than PSA model
in many ways at least its fewer blocks. I think all this infrastructure
here could be unesseary with updates to p4c-ebpf.

> 
> > At least the cover letter needs to explain why this path is not taken.
> 
> I thought we had a reference to that backend - but will add it for the
> next update.
> 
> > From the cover letter there appears to be bpf pieces and non-bpf pieces, but
> > I don't see any reason not to just land it all in BPF. Support exists and if
> > its missing some smaller things add them and everyone gets them vs niche P4
> > backend.
> 
> Ok, i thought you said you read the cover letter. Reasons are well
> stated, primarily that we need to make sure all P4 programs work.

I don't think that is a very strong argument to use/build a p4c-tc
architecture and implementation instead of p4c-ebpf. I can't think
of any reason p4c-ebpf can't support all programs other than perhaps
its missing a few items. From a design side though it should be
capabable of any PSA, PNA, and many more architectures you come up
with.

And I'm genuinely curious what is missing so a list would be nice.
The missing block perhaps is a perfomant software TCAM, but I'm
not fully convinced that software should even bother to try and
duplicate a TCAM. If you need a performant TCAM buy hw with a TCAM
emulating one is always going to be slower. Have Intel/AMD/ARM
glue a TCAM to the core if its so useful.

To be clear p4c-tc is only targeting PNA programs not all P4 space.

> 
> >
> > Without hardware support for any of this its impossible to understand how 'tc'
> > would work as a hardware offload interface for a p4 device so we need hardware
> > support to evaluate. For example I'm not even sure how you would take a BPF
> > parser into hardware on most network devices that aren't processor based.
> >
> 
> P4 has nothing to do with parsers in hardware. Where did you get this
> requirement from?

P4 is/was primarily developed as a DSL to program hardware. We've
never figured out how to do a native Linux P4 controller for hardware.
There are a couple blockers for that in my opinion. First no one
has ever opened up the hardware to an OSS solution. Two its
never been entirely clear what the big win for enough people would be.
So we get targetted offloads, timestamp, vxlan, tso, ktls, even
heard quic offload yesterday. And its easy enough to just program
the hardware directly from user space.

So yes I think P4 has a lot to do with hardware, its probably
fair to say this p4c-tc thing isn't hardware. But, I think its
very limiting and the value of any p4 implementation in kernel
would be its ability to use hardware.

I'm not even convinced P4 is a good DSL for SW implementations.
I don't think its obvious how hw P4 and sw datapaths integrate
effectively. My opinion is p4c-tc is not moving us forward
here.

> 
> > P4 has a P4Runtime I think most folks would prefer a P4 UI vs typing in 'tc'
> > commands so arguing for 'tc' UI is nice is not going to be very compelling.
> > Best we can say is it works well enough and we use it.
> 
> 
> The control plane interface is netlink. This part is not negotiable.
> You can write whatever you want on top of it(for example P4runtime
> using netlink as its southbound interface). We feel that tc - a well

Sure we need a low level interface for p4runtime to use and I
agree we don't need all blocks done at once.

> understood utility - is one we should make publicly available for the
> rest of the world to use. For example we have rust code that runs on
> top of netlink to do performance testing.

If updates/lookups from userspace is a performance vector you
care about I can't see how netlink is more efficient than a
mmapped bpf map. If you have data share it, but it seems
highly unlikely.

The argument I'm trying to make is netlink vs bpf maps vs
some other goo shouldn't matter to users because we should
build them higher level tooling to interact with the p4
objects. Then it comes down to performance in my opinion.
And if map updates matter I suspect netlink is relatively
slow.

> 
> > more commentary below.
> >
> > >
> > > The Programming Protocol-independent Packet Processors (P4) is an open source,
> > > domain-specific programming language for specifying data plane behavior.
> > >
> > > The P4 ecosystem includes an extensive range of deployments, products, projects
> > > and services, etc[9][10][11][12].
> > >
> > > __What is P4TC?__
> > >
> > > P4TC is a net-namespace aware implementation, meaning multiple P4 programs can
> > > run independently in different namespaces alongside their appropriate state. The
> > > implementation builds on top of many years of Linux TC experiences.
> > > On why P4 - see small treatise here:[4].
> > >
> > > There have been many discussions and meetings since about 2015 in regards to
> > > P4 over TC[2] and we are finally proving the naysayers that we do get stuff
> > > done!
> > >
> > > A lot more of the P4TC motivation is captured at:
> > > https://github.com/p4tc-dev/docs/blob/main/why-p4tc.md
> > >
> > > **In this patch series we focus on s/w datapath only**.
> >
> > I don't see the value in adding 16676 lines of code for s/w only datapath
> > of something we already can do with p4c-ebpf backend.
> 
> Please please stop this entitlement politics (which i frankly think
> you guys have been getting away with for a few years now).

I'm allowed to disagree with your architecture and propose what I think
is a betteer way to translate P4 into software.

Its common to argue against adding new code if it duplicates functionality
we already support.

> This code does not touch any core code - you guys constantly push code
> that touches core code and it is not unusual we have to pick up the
> pieces after but now you are going to call me out for the number of
> lines of code? Is it ok for you to write lines of code in the kernel
> but not me? Judge the technical work then we can have a meaningful
> discussion.

I think I'm judging the technical work here. Bullet points.

1. p4c-tc implementation looks like it should be slower than a
   in terms of pkts/sec than a bpf implementation. Meaning
   I suspect pipeline and objects laid out like this will lose
   to a BPF program with an parser and single lookup. The p4c-ebpf
   compiler should look to create optimized EBPF code not some
   emulated switch topology.

2. p4c-tc control plan looks slower than a directly mmaped bpf
   map. Doing a simple update vs a netlink msg. The argument
   that BPF can't do CRUD (which we had offlist) seems incorrect
   to me. Correct me if I'm wrong with details about why.

2. I don't see why ebpf can not support all P4 programs. Because
   the DSL compiler side doesn't support the nic architecture
   side to me indicates fixing the compiler is the direction
   not pushing on the kernel.

3. Working in BPF framework will benefit more folks than a tc
   framework. I just don't see a large user base of P4 software
   running on Linux. It doesn't mean we can't have it in linux,
   but worth considering. We have lots of niche stuff in the
   kernel, but usually the niche thing doesn't have another
   more common way to run it.

4. The win for P4 is not sw implementation. Its about getting
   programmable hardware and this doesn't advance that goal
   in any meaningful way as far as I can see.

5. By pushing the P4 model so low in the stack of tooling
   you lose ability for compiler to do interesting things.
   Combining match action tables, converting them to
   switch statements or jumps, finding inverse operations
   and removing them. I still think there is lots of unexplored
   work on compiling P4 that has not been done.

> 
> TBH, I am trying very hard to see if i should respond to any more
> comments from you. I was very happy with our original scriptable
> approach and you came out and banged on the table that you want ebpf.
> We spent 10 months of multiple people working on this code to make it
> ebpf friendly and now you want more (actually i am not sure what the
> hell you want).

I've made the above arguments on early versions of the code,
and when we talked, and even offered it in p4 working group.
It shouldn't be surprising I've not changed my opinion.

Its a argument against duplicating existing functionality with
something that is slower and doesn't give us HW P4 support. The
bullets above.

> 
> > Or one of the other
> > backends already there. Namely take P4 programs and run them on CPUs in Linux.
> >
> > Also I suspect a pipelined datapath is going to be slower than a O(1) lookup
> > datapath so I'm guessing its slower than most datapaths we have already.
> >
> > What do we gain here over existing p4c-ebpf?
> >
> 
> see above.

We are talking past eachother becaus here I argue it looks like a slow
datapath and you say 'see above' but what above was I meant to see?
That it doesn't have PNA support? Compared to PSA doing a PNA support
should be straightforward.

I disagree that software should try to emulate hardware to closely.
They are fundamentally different platforms. One has CAMs, TCAMs,
and LPMs and obscure instruction sets to make all this work. The other
is working on a general purpose CPU. I think slamming a hardware
architecture into software with emulated TCAMs and what not,
will be a losing performance proposition. Experience shows you can
either go SIMD direction and parrallize everything with these instructions
or you reduce the datapath to a single (or minimal set) of lookups.
Find a counter-example.

> 
> > >
> > > __P4TC Workflow__
> > >
> > > These patches enable kernel and user space code change _independence_ for any
> > > new P4 program that describes a new datapath. The workflow is as follows:
> > >
> > >   1) A developer writes a P4 program, "myprog"
> > >
> > >   2) Compiles it using the P4C compiler[8]. The compiler generates 3 outputs:
> > >      a) shell script(s) which form template definitions for the different P4
> > >      objects "myprog" utilizes (tables, externs, actions etc).
> >
> > This is odd to me. I think packing around shell scrips as a program is not
> > very usable. Why not just an object file.
> >
> > >      b) the parser and the rest of the datapath are generated
> > >      in eBPF and need to be compiled into binaries.
> > >      c) A json introspection file used for the control plane (by iproute2/tc).
> >
> > Why split up the eBPF and control plane like this? eBPF has a control plane
> > just use the existing one?
> >
> 
> The cover letter clearly states that we are using netlink as the
> control api. Does eBPF support netlink?

But why? The statement is there but no rational is given. People are
used to it was maybe stated, but my argument is users of P4 shouldn't
be crafting netlink messages they need tooling if its netlink or BPF
or some new thing. So pick the most efficient tool for the job. Why
is netlink the most efficient option here.

> 
> > >
> > >   3) The developer (or operator) executes the shell script(s) to manifest the
> > >      functional "myprog" into the kernel.
> > >
> > >   4) The developer (or operator) instantiates "myprog" via the tc P4 filter
> > >      to ingress/egress (depending on P4 arch) of one or more netdevs/ports.
> > >
> > >      Example1: parser is an action:
> > >        "tc filter add block 22 ingress protocol all prio 10 p4 pname myprog \
> > >         action bpf obj $PARSER.o section parser/tc-ingress \
> > >         action bpf obj $PROGNAME.o section p4prog/tc"
> > >
> > >      Example2: parser explicitly bound and rest of dpath as an action:
> > >        "tc filter add block 22 ingress protocol all prio 10 p4 pname myprog \
> > >         prog tc obj $PARSER.o section parser/tc-ingress \
> > >         action bpf obj $PROGNAME.o section p4prog/tc"
> > >
> > >      Example3: parser is at XDP, rest of dpath as an action:
> > >        "tc filter add block 22 ingress protocol all prio 10 p4 pname myprog \
> > >         prog type xdp obj $PARSER.o section parser/xdp-ingress \
> > >       pinned_link /path/to/xdp-prog-link \
> > >         action bpf obj $PROGNAME.o section p4prog/tc"
> > >
> > >      Example4: parser+prog at XDP:
> > >        "tc filter add block 22 ingress protocol all prio 10 p4 pname myprog \
> > >         prog type xdp obj $PROGNAME.o section p4prog/xdp \
> > >       pinned_link /path/to/xdp-prog-link"
> > >
> > >     see individual patches for more examples tc vs xdp etc. Also see section on
> > >     "challenges" (on this cover letter).
> > >
> > > Once "myprog" P4 program is instantiated one can start updating table entries
> > > that are associated with myprog's table named "mytable". Example:
> > >
> > >   tc p4ctrl create myprog/table/mytable dstAddr 10.0.1.2/32 \
> > >     action send_to_port param port eno1
> >
> > As a UI above is entirely cryptic to most folks I bet.
> >
> 
> But ebpf is not?

We don't need everything out the gate but my point is that the UI
should be abstracted away from the P4 programmer and operator at
this level. My observation that 'tc' is cryptic was just an off-hand
comment I don't think its relevant to the overall argument for or against,
what we should understand is how to map p4runtime or at least a
operator friendly UI onto the semantics.

> 
> > myprog table is a BPF map? If so then I don't see any need for this just
> > interact with it like a BPF map. I suspect its some other object, but
> > I don't see any ratoinal for that.
> 
> All the P4 objects sit in the TC domain. The datapath program is ebpf.
> Control is via netlink.

I'm missing something fundamental. What do we gain from this TC domain.
There are some TC maps for LPM and TCAMs we have LPM already in BPF
and TCAM you have could easily be added if you want to. Then entire
program runs to completion. Surely this is more performant. Throw in
XDP and the redirect never leaves the NIC, no skb, etc.

From the architecture side I don't think we need kernel objects
for pipelines and some P4 notion of match action tables those
can all be mapped into the BPF program. The packet never leaves
XDP. Performance is good on datapath and performance is good
on map update side. It looks like noise to me teaching the kernel
about P4 objects and types. More importantly you are constraining
the optimizations the compiler can make. Perhaps the compiler
wants no map at all and implements it as a switch stmt for
example. Maybe the compiler can find inverse operations and
fastpaths to short circuit. By forcing the model so low in
the stack you remove this ability.

> 
> 
> > >
> > > A packet arriving on ingress of any of the ports on block 22 will first be
> > > exercised via the (eBPF) parser to find the headers pointing to the ip
> > > destination address.
> > > The remainder eBPF datapath uses the result dstAddr as a key to do a lookup in
> > > myprog's mytable which returns the action params which are then used to execute
> > > the action in the eBPF datapath (eventually sending out packets to eno1).
> > > On a table miss, mytable's default miss action is executed.
> >
> > This chunk looks like standard BPF program. Parse pkt, lookup an action,
> > do the action.
> >
> 
> Yes, the ebpf datapath does the parsing, and then interacts with
> kfuncs to the tc world before it (the ebpf datapath) executes the
> action.
> Note: ebpf did not invent any of that (parse, lookup, action). It has
> existed in tc for 20 years before ebpf existed.

Its not about who invented what. All this goes way back.

My point is the 'tc' world here looks unnecessary. It can be managed
from outside the kernel entirely.

> 
> > > __Description of Patches__
> > >
> > > P4TC is designed to have no impact on the core code for other users
> > > of TC. IOW, you can compile it out but even if it compiled in and you dont use
> > > it there should be no impact on your performance.
> > >
> > > We do make core kernel changes. Patch #1 adds infrastructure for "dynamic"
> > > actions that can be created on "the fly" based on the P4 program requirement.
> >
> > the common pattern in bpf for this is to use a tail call map and populate
> > it at runtime and/or just compile your program with the actions. Here
> > the actions came from the p4 back up at step 1 so no reason we can't
> > just compile them with p4c.
> >
> > > This patch makes a small incision into act_api which shouldn't affect the
> > > performance (or functionality) of the existing actions. Patches 2-4,6-7 are
> > > minimalist enablers for P4TC and have no effect the classical tc action.
> > > Patch 5 adds infrastructure support for preallocation of dynamic actions.
> > >
> > > The core P4TC code implements several P4 objects.
> >
> > [...]
> >
> > >
> > > __Restating Our Requirements__
> > >
> > > The initial release made in January/2023 had a "scriptable" datapath (think u32
> > > classifier and pedit action). In this section we review the scriptable version
> > > against the current implementation we are pushing upstream which uses eBPF.
> > >
> > > Our intention is to target the TC crowd.
> > > Essentially developers and ops people deploying TC based infra.
> > > More importantly the original intent for P4TC was to enable _ops folks_ more than
> > > devs (given code is being generated and doesn't need humans to write it).
> >
> > I don't follow. humans wrote the p4.
> >
> 
> But not the ebpf code, that is compiler generated. P4 is a higher
> level Domain specific language and ebpf is just one backend (others
> s/w variants include DPDK, Rust, C, etc)

Yes. I still don't follow. Of course ebpf is just one backend.

> 
> > I think the intent should be to enable P4 to run on Linux. Ideally efficiently.
> > If the _ops folks are writing P4 great as long as we give them an efficient
> > way to run their p4 I don't think they care about what executes it.
> >
> > >
> > > With TC, we get whole "familiar" package of match-action pipeline abstraction++,
> > > meaning from the control plane all the way to the tooling infra, i.e
> > > iproute2/tc cli, netlink infra(request/resp, event subscribe/multicast-publish,
> > > congestion control etc), s/w and h/w symbiosis, the autonomous kernel control,
> > > etc.
> > > The main advantage is that we have a singular vendor-neutral interface via the
> > > kernel using well understood mechanisms based on deployment experience (and
> > > at least this part doesnt need retraining).
> >
> > A seemless p4 experience would be great. That looks like a tooling problem
> > at the p4c-backend and p4c-frontend problem. Rather than a bunch of 'tc' glue
> > I would aim for,
> >
> >   $ p4c-* myprog.p4
> >   $ p4cRun ./myprog
> >
> > And maybe some options like,
> >
> >   $ p4cRun -i eth0 ./myprog
> 
> Armchair lawyering and classical ML bikesheding

It was just an example of what I think the end goal should be.

> 
> > Then use the p4runtime to interface with the system. If you don't like the
> > runtime then it should be brought up in that working group.
> >
> > >
> > > 1) Supporting expressibility of the universe set of P4 progs
> > >
> > > It is a must to support 100% of all possible P4 programs. In the past the eBPF
> > > verifier had to be worked around and even then there are cases where we couldnt
> > > avoid path explosion when branching is involved. Kfunc-ing solves these issues
> > > for us. Note, there are still challenges running all potential P4 programs at
> > > the XDP level - the solution to that is to have the compiler generate XDP based
> > > code only if it possible to map it to that layer.
> >
> > Examples and we can fix it.
> 
> Right. Let me wait for you to fix something 5 years from now. I would
> never have used eBPF at all but the kfunc is what changed my mind.
> 
> > >
> > > 2) Support for P4 HW and SW equivalence.
> > >
> > > This feature continues to work even in the presence of eBPF as the s/w
> > > datapath. There are cases of square-hole-round-peg scenarios but
> > > those are implementation issues we can live with.
> >
> > But no hw support.
> >
> 
> This patcheset has nothing to do with offload (you read the cover
> letter?). All above is saying is that by virtue of using TC we have a
> path to a proven offload approach.

I'm arguing P4 is in a big part about programmable HW. If we merge
a P4 into the kernel all the way down to the p4 types and don't
consider how it works with hardware that is a non starter for me.

> 
> 
> > >
> > > 3) Operational usability
> > >
> > > By maintaining the TC control plane (even in presence of eBPF datapath)
> > > runtime aspects remain unchanged. So for our target audience of folks
> > > who have deployed tc including offloads - the comfort zone is unchanged.
> > > There is also the comfort zone of continuing to use the true-and-tried netlink
> > > interfacing.
> >
> > The P4 control plane should be P4Runtime.
> >
> 
> And be my guest and write it on top of netlink.

But I would prefer it was a BPF map and gave my reasons above.

> 
> cheers,
> jamal

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH net-next v8 00/15] Introducing P4TC
  2023-11-17 18:37     ` John Fastabend
@ 2023-11-17 20:46       ` Jamal Hadi Salim
  2023-11-20  9:39         ` Jiri Pirko
  0 siblings, 1 reply; 79+ messages in thread
From: Jamal Hadi Salim @ 2023-11-17 20:46 UTC (permalink / raw)
  To: John Fastabend
  Cc: netdev, deb.chatterjee, anjali.singhai, Vipin.Jain,
	namrata.limaye, tom, mleitner, Mahesh.Shirshyad, tomasz.osinski,
	jiri, xiyou.wangcong, davem, edumazet, kuba, pabeni, vladbu,
	horms, daniel, bpf, khalidm, toke, mattyk, dan.daly,
	chris.sommers, john.andy.fingerhut

On Fri, Nov 17, 2023 at 1:37 PM John Fastabend <john.fastabend@gmail.com> wrote:
>
> Jamal Hadi Salim wrote:
> > On Fri, Nov 17, 2023 at 1:27 AM John Fastabend <john.fastabend@gmail.com> wrote:
> > >
> > > Jamal Hadi Salim wrote:
> > > > We are seeking community feedback on P4TC patches.
> > > >
> > >
> > > [...]
> > >
> > > >
> > > > What is P4?
> > > > -----------
> > >
> > > I read the cover letter here is my high level takeaway.
> > >
> >
> > At least you read the cover letter this time ;->
>
> I read it last time as well. About mid way down I tried to
> list the points (1-5) more concisely if folks want to get to the
> meat of my argument quickly.

You wrote an essay - i will just jump to your points further down the
text below and try and summarize it...

> > > P4c-bpf backend exists and I don't see why we wouldn't use that as a starting
> > > point.
> >
> > Are you familiar with P4 architectures? That code was for PSA (which
> > is essentially for switches) we are doing PNA (which is more nic
> > oriented).
>
> Yes. But for folks that are not PSA is a switch architecture it
> looks roughly like this,
>
>    parser -> ingress -> deparser -> pkt replication -> parser
>                                                         -> egress
>                                                            -> deparser
>                                                              -> queueing
>
> The gist is ingress/egress blocks hold your p4 logic (match action
> tables usually) to xfrm headers, counters, registers, and so on. You
> get one on ingress and one on egress to build your logic up.
>
> And PNA is a roughly like this,
>
>    ingress -> parser -> control -> deparser -> accelerators -> host | network
>
> An accelerators are externs more or less defined outside P4. Control has
> all your metrics, header transforms, registers, and so on. And parser
> well it parsers headers. Deparser is something we don't typically think
> about much on sw side but it serializes the object back into a packet.
> That is a rough couple line explanation.
>
> You can also define whatever architecture like and there are some
> ways to do that. But if you want to be a PSA or PNA you define those
> blocks in your P4. The key idea is to have architectures that map
> to a large set of different vendor hardware. Clearly sw and FPGAs
> can build mostly any architecture needed.
>
> As an editorial comment P4 is very much a hardware centric view of
> the world when looking at P4 architectures. SW never needed these
> because we mostly have general purpose CPUs.
>
> > And yes, we used that code as a starting point and made the necessary
> > changes needed to conform to PNA. We made it actually work better by
> > using kfuncs.
>
> Better performance? More P4 DSL program space implemented? The kfuncs
> added are equivelant to map ops already in BPF but over 'tc' map types.
> Or did I miss some kfuncs.
>
> The p4c-ebpf backend already supports two models we could have added
> the PNA model to it as well. Its actually simpler than PSA model
> in many ways at least its fewer blocks. I think all this infrastructure
> here could be unesseary with updates to p4c-ebpf.
>
> >
> > > At least the cover letter needs to explain why this path is not taken.
> >
> > I thought we had a reference to that backend - but will add it for the
> > next update.
> >
> > > From the cover letter there appears to be bpf pieces and non-bpf pieces, but
> > > I don't see any reason not to just land it all in BPF. Support exists and if
> > > its missing some smaller things add them and everyone gets them vs niche P4
> > > backend.
> >
> > Ok, i thought you said you read the cover letter. Reasons are well
> > stated, primarily that we need to make sure all P4 programs work.
>
> I don't think that is a very strong argument to use/build a p4c-tc
> architecture and implementation instead of p4c-ebpf. I can't think
> of any reason p4c-ebpf can't support all programs other than perhaps
> its missing a few items. From a design side though it should be
> capabable of any PSA, PNA, and many more architectures you come up
> with.
>
> And I'm genuinely curious what is missing so a list would be nice.
> The missing block perhaps is a perfomant software TCAM, but I'm
> not fully convinced that software should even bother to try and
> duplicate a TCAM. If you need a performant TCAM buy hw with a TCAM
> emulating one is always going to be slower. Have Intel/AMD/ARM
> glue a TCAM to the core if its so useful.
>
> To be clear p4c-tc is only targeting PNA programs not all P4 space.
>
> >
> > >
> > > Without hardware support for any of this its impossible to understand how 'tc'
> > > would work as a hardware offload interface for a p4 device so we need hardware
> > > support to evaluate. For example I'm not even sure how you would take a BPF
> > > parser into hardware on most network devices that aren't processor based.
> > >
> >
> > P4 has nothing to do with parsers in hardware. Where did you get this
> > requirement from?
>
> P4 is/was primarily developed as a DSL to program hardware. We've
> never figured out how to do a native Linux P4 controller for hardware.
> There are a couple blockers for that in my opinion. First no one
> has ever opened up the hardware to an OSS solution. Two its
> never been entirely clear what the big win for enough people would be.
> So we get targetted offloads, timestamp, vxlan, tso, ktls, even
> heard quic offload yesterday. And its easy enough to just program
> the hardware directly from user space.
>
> So yes I think P4 has a lot to do with hardware, its probably
> fair to say this p4c-tc thing isn't hardware. But, I think its
> very limiting and the value of any p4 implementation in kernel
> would be its ability to use hardware.
>
> I'm not even convinced P4 is a good DSL for SW implementations.
> I don't think its obvious how hw P4 and sw datapaths integrate
> effectively. My opinion is p4c-tc is not moving us forward
> here.
>
> >
> > > P4 has a P4Runtime I think most folks would prefer a P4 UI vs typing in 'tc'
> > > commands so arguing for 'tc' UI is nice is not going to be very compelling.
> > > Best we can say is it works well enough and we use it.
> >
> >
> > The control plane interface is netlink. This part is not negotiable.
> > You can write whatever you want on top of it(for example P4runtime
> > using netlink as its southbound interface). We feel that tc - a well
>
> Sure we need a low level interface for p4runtime to use and I
> agree we don't need all blocks done at once.
>
> > understood utility - is one we should make publicly available for the
> > rest of the world to use. For example we have rust code that runs on
> > top of netlink to do performance testing.
>
> If updates/lookups from userspace is a performance vector you
> care about I can't see how netlink is more efficient than a
> mmapped bpf map. If you have data share it, but it seems
> highly unlikely.
>
> The argument I'm trying to make is netlink vs bpf maps vs
> some other goo shouldn't matter to users because we should
> build them higher level tooling to interact with the p4
> objects. Then it comes down to performance in my opinion.
> And if map updates matter I suspect netlink is relatively
> slow.
>
> >
> > > more commentary below.
> > >
> > > >
> > > > The Programming Protocol-independent Packet Processors (P4) is an open source,
> > > > domain-specific programming language for specifying data plane behavior.
> > > >
> > > > The P4 ecosystem includes an extensive range of deployments, products, projects
> > > > and services, etc[9][10][11][12].
> > > >
> > > > __What is P4TC?__
> > > >
> > > > P4TC is a net-namespace aware implementation, meaning multiple P4 programs can
> > > > run independently in different namespaces alongside their appropriate state. The
> > > > implementation builds on top of many years of Linux TC experiences.
> > > > On why P4 - see small treatise here:[4].
> > > >
> > > > There have been many discussions and meetings since about 2015 in regards to
> > > > P4 over TC[2] and we are finally proving the naysayers that we do get stuff
> > > > done!
> > > >
> > > > A lot more of the P4TC motivation is captured at:
> > > > https://github.com/p4tc-dev/docs/blob/main/why-p4tc.md
> > > >
> > > > **In this patch series we focus on s/w datapath only**.
> > >
> > > I don't see the value in adding 16676 lines of code for s/w only datapath
> > > of something we already can do with p4c-ebpf backend.
> >
> > Please please stop this entitlement politics (which i frankly think
> > you guys have been getting away with for a few years now).
>
> I'm allowed to disagree with your architecture and propose what I think
> is a betteer way to translate P4 into software.
>
> Its common to argue against adding new code if it duplicates functionality
> we already support.
>
> > This code does not touch any core code - you guys constantly push code
> > that touches core code and it is not unusual we have to pick up the
> > pieces after but now you are going to call me out for the number of
> > lines of code? Is it ok for you to write lines of code in the kernel
> > but not me? Judge the technical work then we can have a meaningful
> > discussion.
>
> I think I'm judging the technical work here. Bullet points.
>
> 1. p4c-tc implementation looks like it should be slower than a
>    in terms of pkts/sec than a bpf implementation. Meaning
>    I suspect pipeline and objects laid out like this will lose
>    to a BPF program with an parser and single lookup. The p4c-ebpf
>    compiler should look to create optimized EBPF code not some
>    emulated switch topology.
>

The parser is ebpf based. The other objects which require control
plane interaction are not - those interact via netlink.
We published perf data a while back - presented at the P4 workshop
back in April (was in the cover letter)
https://github.com/p4tc-dev/docs/blob/main/p4-conference-2023/2023P4WorkshopP4TC.pdf
But do note: the correct abstraction is the first priority.
Optimization is something we can teach the compiler over time. But
even with the minimalist code generation you can see that our approach
always beats ebpf in LPM and ternary. The other ones I am pretty sure
we can optimize over time.
Your view of "single lookup" is true for simple programs but if you
have 10 tables trying to model a 5G function then it doesnt make sense
(and i think the data we published was clear that you gain no
advantage using ebpf - as a matter of fact there was no perf
difference between XDP and tc in such cases).

> 2. p4c-tc control plan looks slower than a directly mmaped bpf
>    map. Doing a simple update vs a netlink msg. The argument
>    that BPF can't do CRUD (which we had offlist) seems incorrect
>    to me. Correct me if I'm wrong with details about why.
>

So let me see....
you want me to replace netlink and all its features and rewrite it
using the ebpf system calls? Congestion control, event handling,
arbitrary message crafting, etc and the years of work that went into
netlink? NO to the HELL.
I should note: that there was an interesting talk at netdevconf 0x17
where the speaker showed the challenges of dealing with ebpf on "day
two" - slides or videos are not up yet, but link is:
https://netdevconf.info/0x17/sessions/talk/is-scaling-ebpf-easy-yet-a-small-step-to-one-server-but-giant-leap-to-distributed-network.html
The point the speaker was making is it's always easy to whip an ebpf
program that can slice and dice packets and maybe even flush LEDs but
the real work and challenge is in the control plane. I agree with the
speaker based on my experiences. This discussion of replacing netlink
with ebpf system calls is absolutely a non-starter. Let's just end the
discussion and agree to disagree if you are going to keep insisting on
that.

> 2. I don't see why ebpf can not support all P4 programs. Because
>    the DSL compiler side doesn't support the nic architecture
>    side to me indicates fixing the compiler is the direction
>    not pushing on the kernel.
>

Wrestling with the verifier, different version of toolchains, etc.
This is not just a problem we are facing, but about everyone out there
that tries to do something serious with ebpf  eventually hits these
issues. Kfuncs really opened the door for us (i think it improved
usability of ebpf by probably orders of magnitude). Without kfuncs i
would not have even considered ebpf - and did i say i was fine with
u32 and pedit approach we had.

> 3. Working in BPF framework will benefit more folks than a tc
>    framework. I just don't see a large user base of P4 software
>    running on Linux. It doesn't mean we can't have it in linux,
>    but worth considering. We have lots of niche stuff in the
>    kernel, but usually the niche thing doesn't have another
>    more common way to run it.
>

To each their itch - that's what open source is about. This is our
itch. You dont have to like it nor use it.  There are a lot of things
i dont like in the kernel and would never use. Saying you dont see a
"large user base of P4 software on Linux" is handwaving at best. Under
what metric do you reach such a conclusion? The fact that i can
describe something in a _simple_ high level language like P4 and get
low level ebpf for free is of great value. I dont need to go and look
for an ebpf expert to hand code things for me.

> 4. The win for P4 is not sw implementation. Its about getting
>    programmable hardware and this doesn't advance that goal
>    in any meaningful way as far as I can see.

And all the s/w incarnations of P4 out there would disagree with you.
The fact that P4 has use in h/w doesnt disqualify it from being useful
in s/w.

> 5. By pushing the P4 model so low in the stack of tooling
>    you lose ability for compiler to do interesting things.
>    Combining match action tables, converting them to
>    switch statements or jumps, finding inverse operations
>    and removing them. I still think there is lots of unexplored
>    work on compiling P4 that has not been done.
>

And that can be done over time unless you are saying it is impossible.
ebpf != P4, they are two different levels of expression. eBPF is just
a tool to get us there and nothing more.

cheers,
jamal

> >
> > TBH, I am trying very hard to see if i should respond to any more
> > comments from you. I was very happy with our original scriptable
> > approach and you came out and banged on the table that you want ebpf.
> > We spent 10 months of multiple people working on this code to make it
> > ebpf friendly and now you want more (actually i am not sure what the
> > hell you want).
>
> I've made the above arguments on early versions of the code,
> and when we talked, and even offered it in p4 working group.
> It shouldn't be surprising I've not changed my opinion.
>
> Its a argument against duplicating existing functionality with
> something that is slower and doesn't give us HW P4 support. The
> bullets above.
>
>
> >
> > > Or one of the other
> > > backends already there. Namely take P4 programs and run them on CPUs in Linux.
> > >
> > > Also I suspect a pipelined datapath is going to be slower than a O(1) lookup
> > > datapath so I'm guessing its slower than most datapaths we have already.
> > >
> > > What do we gain here over existing p4c-ebpf?
> > >
> >
> > see above.
>
> We are talking past eachother becaus here I argue it looks like a slow
> datapath and you say 'see above' but what above was I meant to see?
> That it doesn't have PNA support? Compared to PSA doing a PNA support
> should be straightforward.
>
> I disagree that software should try to emulate hardware to closely.
> They are fundamentally different platforms. One has CAMs, TCAMs,
> and LPMs and obscure instruction sets to make all this work. The other
> is working on a general purpose CPU. I think slamming a hardware
> architecture into software with emulated TCAMs and what not,
> will be a losing performance proposition. Experience shows you can
> either go SIMD direction and parrallize everything with these instructions
> or you reduce the datapath to a single (or minimal set) of lookups.
> Find a counter-example.
>
> >
> > > >
> > > > __P4TC Workflow__
> > > >
> > > > These patches enable kernel and user space code change _independence_ for any
> > > > new P4 program that describes a new datapath. The workflow is as follows:
> > > >
> > > >   1) A developer writes a P4 program, "myprog"
> > > >
> > > >   2) Compiles it using the P4C compiler[8]. The compiler generates 3 outputs:
> > > >      a) shell script(s) which form template definitions for the different P4
> > > >      objects "myprog" utilizes (tables, externs, actions etc).
> > >
> > > This is odd to me. I think packing around shell scrips as a program is not
> > > very usable. Why not just an object file.
> > >
> > > >      b) the parser and the rest of the datapath are generated
> > > >      in eBPF and need to be compiled into binaries.
> > > >      c) A json introspection file used for the control plane (by iproute2/tc).
> > >
> > > Why split up the eBPF and control plane like this? eBPF has a control plane
> > > just use the existing one?
> > >
> >
> > The cover letter clearly states that we are using netlink as the
> > control api. Does eBPF support netlink?
>
> But why? The statement is there but no rational is given. People are
> used to it was maybe stated, but my argument is users of P4 shouldn't
> be crafting netlink messages they need tooling if its netlink or BPF
> or some new thing. So pick the most efficient tool for the job. Why
> is netlink the most efficient option here.
>
> >
> > > >
> > > >   3) The developer (or operator) executes the shell script(s) to manifest the
> > > >      functional "myprog" into the kernel.
> > > >
> > > >   4) The developer (or operator) instantiates "myprog" via the tc P4 filter
> > > >      to ingress/egress (depending on P4 arch) of one or more netdevs/ports.
> > > >
> > > >      Example1: parser is an action:
> > > >        "tc filter add block 22 ingress protocol all prio 10 p4 pname myprog \
> > > >         action bpf obj $PARSER.o section parser/tc-ingress \
> > > >         action bpf obj $PROGNAME.o section p4prog/tc"
> > > >
> > > >      Example2: parser explicitly bound and rest of dpath as an action:
> > > >        "tc filter add block 22 ingress protocol all prio 10 p4 pname myprog \
> > > >         prog tc obj $PARSER.o section parser/tc-ingress \
> > > >         action bpf obj $PROGNAME.o section p4prog/tc"
> > > >
> > > >      Example3: parser is at XDP, rest of dpath as an action:
> > > >        "tc filter add block 22 ingress protocol all prio 10 p4 pname myprog \
> > > >         prog type xdp obj $PARSER.o section parser/xdp-ingress \
> > > >       pinned_link /path/to/xdp-prog-link \
> > > >         action bpf obj $PROGNAME.o section p4prog/tc"
> > > >
> > > >      Example4: parser+prog at XDP:
> > > >        "tc filter add block 22 ingress protocol all prio 10 p4 pname myprog \
> > > >         prog type xdp obj $PROGNAME.o section p4prog/xdp \
> > > >       pinned_link /path/to/xdp-prog-link"
> > > >
> > > >     see individual patches for more examples tc vs xdp etc. Also see section on
> > > >     "challenges" (on this cover letter).
> > > >
> > > > Once "myprog" P4 program is instantiated one can start updating table entries
> > > > that are associated with myprog's table named "mytable". Example:
> > > >
> > > >   tc p4ctrl create myprog/table/mytable dstAddr 10.0.1.2/32 \
> > > >     action send_to_port param port eno1
> > >
> > > As a UI above is entirely cryptic to most folks I bet.
> > >
> >
> > But ebpf is not?
>
> We don't need everything out the gate but my point is that the UI
> should be abstracted away from the P4 programmer and operator at
> this level. My observation that 'tc' is cryptic was just an off-hand
> comment I don't think its relevant to the overall argument for or against,
> what we should understand is how to map p4runtime or at least a
> operator friendly UI onto the semantics.
>
> >
> > > myprog table is a BPF map? If so then I don't see any need for this just
> > > interact with it like a BPF map. I suspect its some other object, but
> > > I don't see any ratoinal for that.
> >
> > All the P4 objects sit in the TC domain. The datapath program is ebpf.
> > Control is via netlink.
>
> I'm missing something fundamental. What do we gain from this TC domain.
> There are some TC maps for LPM and TCAMs we have LPM already in BPF
> and TCAM you have could easily be added if you want to. Then entire
> program runs to completion. Surely this is more performant. Throw in
> XDP and the redirect never leaves the NIC, no skb, etc.
>
> From the architecture side I don't think we need kernel objects
> for pipelines and some P4 notion of match action tables those
> can all be mapped into the BPF program. The packet never leaves
> XDP. Performance is good on datapath and performance is good
> on map update side. It looks like noise to me teaching the kernel
> about P4 objects and types. More importantly you are constraining
> the optimizations the compiler can make. Perhaps the compiler
> wants no map at all and implements it as a switch stmt for
> example. Maybe the compiler can find inverse operations and
> fastpaths to short circuit. By forcing the model so low in
> the stack you remove this ability.
>
> >
> >
> > > >
> > > > A packet arriving on ingress of any of the ports on block 22 will first be
> > > > exercised via the (eBPF) parser to find the headers pointing to the ip
> > > > destination address.
> > > > The remainder eBPF datapath uses the result dstAddr as a key to do a lookup in
> > > > myprog's mytable which returns the action params which are then used to execute
> > > > the action in the eBPF datapath (eventually sending out packets to eno1).
> > > > On a table miss, mytable's default miss action is executed.
> > >
> > > This chunk looks like standard BPF program. Parse pkt, lookup an action,
> > > do the action.
> > >
> >
> > Yes, the ebpf datapath does the parsing, and then interacts with
> > kfuncs to the tc world before it (the ebpf datapath) executes the
> > action.
> > Note: ebpf did not invent any of that (parse, lookup, action). It has
> > existed in tc for 20 years before ebpf existed.
>
> Its not about who invented what. All this goes way back.
>
> My point is the 'tc' world here looks unnecessary. It can be managed
> from outside the kernel entirely.
>
> >
> > > > __Description of Patches__
> > > >
> > > > P4TC is designed to have no impact on the core code for other users
> > > > of TC. IOW, you can compile it out but even if it compiled in and you dont use
> > > > it there should be no impact on your performance.
> > > >
> > > > We do make core kernel changes. Patch #1 adds infrastructure for "dynamic"
> > > > actions that can be created on "the fly" based on the P4 program requirement.
> > >
> > > the common pattern in bpf for this is to use a tail call map and populate
> > > it at runtime and/or just compile your program with the actions. Here
> > > the actions came from the p4 back up at step 1 so no reason we can't
> > > just compile them with p4c.
> > >
> > > > This patch makes a small incision into act_api which shouldn't affect the
> > > > performance (or functionality) of the existing actions. Patches 2-4,6-7 are
> > > > minimalist enablers for P4TC and have no effect the classical tc action.
> > > > Patch 5 adds infrastructure support for preallocation of dynamic actions.
> > > >
> > > > The core P4TC code implements several P4 objects.
> > >
> > > [...]
> > >
> > > >
> > > > __Restating Our Requirements__
> > > >
> > > > The initial release made in January/2023 had a "scriptable" datapath (think u32
> > > > classifier and pedit action). In this section we review the scriptable version
> > > > against the current implementation we are pushing upstream which uses eBPF.
> > > >
> > > > Our intention is to target the TC crowd.
> > > > Essentially developers and ops people deploying TC based infra.
> > > > More importantly the original intent for P4TC was to enable _ops folks_ more than
> > > > devs (given code is being generated and doesn't need humans to write it).
> > >
> > > I don't follow. humans wrote the p4.
> > >
> >
> > But not the ebpf code, that is compiler generated. P4 is a higher
> > level Domain specific language and ebpf is just one backend (others
> > s/w variants include DPDK, Rust, C, etc)
>
> Yes. I still don't follow. Of course ebpf is just one backend.
>
> >
> > > I think the intent should be to enable P4 to run on Linux. Ideally efficiently.
> > > If the _ops folks are writing P4 great as long as we give them an efficient
> > > way to run their p4 I don't think they care about what executes it.
> > >
> > > >
> > > > With TC, we get whole "familiar" package of match-action pipeline abstraction++,
> > > > meaning from the control plane all the way to the tooling infra, i.e
> > > > iproute2/tc cli, netlink infra(request/resp, event subscribe/multicast-publish,
> > > > congestion control etc), s/w and h/w symbiosis, the autonomous kernel control,
> > > > etc.
> > > > The main advantage is that we have a singular vendor-neutral interface via the
> > > > kernel using well understood mechanisms based on deployment experience (and
> > > > at least this part doesnt need retraining).
> > >
> > > A seemless p4 experience would be great. That looks like a tooling problem
> > > at the p4c-backend and p4c-frontend problem. Rather than a bunch of 'tc' glue
> > > I would aim for,
> > >
> > >   $ p4c-* myprog.p4
> > >   $ p4cRun ./myprog
> > >
> > > And maybe some options like,
> > >
> > >   $ p4cRun -i eth0 ./myprog
> >
> > Armchair lawyering and classical ML bikesheding
>
> It was just an example of what I think the end goal should be.
>
> >
> > > Then use the p4runtime to interface with the system. If you don't like the
> > > runtime then it should be brought up in that working group.
> > >
> > > >
> > > > 1) Supporting expressibility of the universe set of P4 progs
> > > >
> > > > It is a must to support 100% of all possible P4 programs. In the past the eBPF
> > > > verifier had to be worked around and even then there are cases where we couldnt
> > > > avoid path explosion when branching is involved. Kfunc-ing solves these issues
> > > > for us. Note, there are still challenges running all potential P4 programs at
> > > > the XDP level - the solution to that is to have the compiler generate XDP based
> > > > code only if it possible to map it to that layer.
> > >
> > > Examples and we can fix it.
> >
> > Right. Let me wait for you to fix something 5 years from now. I would
> > never have used eBPF at all but the kfunc is what changed my mind.
> >
> > > >
> > > > 2) Support for P4 HW and SW equivalence.
> > > >
> > > > This feature continues to work even in the presence of eBPF as the s/w
> > > > datapath. There are cases of square-hole-round-peg scenarios but
> > > > those are implementation issues we can live with.
> > >
> > > But no hw support.
> > >
> >
> > This patcheset has nothing to do with offload (you read the cover
> > letter?). All above is saying is that by virtue of using TC we have a
> > path to a proven offload approach.
>
> I'm arguing P4 is in a big part about programmable HW. If we merge
> a P4 into the kernel all the way down to the p4 types and don't
> consider how it works with hardware that is a non starter for me.
>
> >
> >
> > > >
> > > > 3) Operational usability
> > > >
> > > > By maintaining the TC control plane (even in presence of eBPF datapath)
> > > > runtime aspects remain unchanged. So for our target audience of folks
> > > > who have deployed tc including offloads - the comfort zone is unchanged.
> > > > There is also the comfort zone of continuing to use the true-and-tried netlink
> > > > interfacing.
> > >
> > > The P4 control plane should be P4Runtime.
> > >
> >
> > And be my guest and write it on top of netlink.
>
> But I would prefer it was a BPF map and gave my reasons above.
>
> >
> > cheers,
> > jamal

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH net-next v8 13/15] p4tc: add set of P4TC table kfuncs
  2023-11-16 14:59 ` [PATCH net-next v8 13/15] p4tc: add set of P4TC table kfuncs Jamal Hadi Salim
  2023-11-17  7:09   ` John Fastabend
@ 2023-11-19  9:14   ` kernel test robot
  2023-11-20 22:28   ` kernel test robot
  2 siblings, 0 replies; 79+ messages in thread
From: kernel test robot @ 2023-11-19  9:14 UTC (permalink / raw)
  To: Jamal Hadi Salim, netdev
  Cc: oe-kbuild-all, deb.chatterjee, anjali.singhai, namrata.limaye,
	tom, mleitner, Mahesh.Shirshyad, tomasz.osinski, jiri,
	xiyou.wangcong, davem, edumazet, kuba, pabeni, vladbu, horms,
	daniel, bpf, khalidm, toke, mattyk

Hi Jamal,

kernel test robot noticed the following build errors:

[auto build test ERROR on net-next/main]

url:    https://github.com/intel-lab-lkp/linux/commits/Jamal-Hadi-Salim/net-sched-act_api-Introduce-dynamic-actions-list/20231116-230427
base:   net-next/main
patch link:    https://lore.kernel.org/r/20231116145948.203001-14-jhs%40mojatatu.com
patch subject: [PATCH net-next v8 13/15] p4tc: add set of P4TC table kfuncs
config: um-randconfig-r132-20231119 (https://download.01.org/0day-ci/archive/20231119/202311191752.mYGymxfv-lkp@intel.com/config)
compiler: gcc-12 (Debian 12.2.0-14) 12.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20231119/202311191752.mYGymxfv-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202311191752.mYGymxfv-lkp@intel.com/

All errors (new ones prefixed by >>):

   /usr/bin/ld: net/sched/p4tc/p4tc_tmpl_api.o: in function `p4tc_template_init':
>> p4tc_tmpl_api.c:(.init.text+0x71): undefined reference to `register_p4tc_tbl_bpf'
   collect2: error: ld returned 1 exit status

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH net-next v8 09/15] p4tc: add template pipeline create, get, update, delete
  2023-11-17 12:09     ` Jamal Hadi Salim
@ 2023-11-20  8:18       ` Jiri Pirko
  2023-11-20 12:48         ` Jamal Hadi Salim
  2023-11-20 18:20       ` David Ahern
  1 sibling, 1 reply; 79+ messages in thread
From: Jiri Pirko @ 2023-11-20  8:18 UTC (permalink / raw)
  To: Jamal Hadi Salim
  Cc: netdev, deb.chatterjee, anjali.singhai, namrata.limaye, tom,
	mleitner, Mahesh.Shirshyad, tomasz.osinski, xiyou.wangcong, davem,
	edumazet, kuba, pabeni, vladbu, horms, daniel, bpf, khalidm, toke,
	mattyk, David Ahern, Stephen Hemminger

Fri, Nov 17, 2023 at 01:09:45PM CET, jhs@mojatatu.com wrote:
>On Thu, Nov 16, 2023 at 11:11 AM Jiri Pirko <jiri@resnulli.us> wrote:
>>
>> Thu, Nov 16, 2023 at 03:59:42PM CET, jhs@mojatatu.com wrote:
>>
>> [...]
>>
>>
>> >diff --git a/include/uapi/linux/p4tc.h b/include/uapi/linux/p4tc.h
>> >index ba32dba66..4d33f44c1 100644
>> >--- a/include/uapi/linux/p4tc.h
>> >+++ b/include/uapi/linux/p4tc.h
>> >@@ -2,8 +2,71 @@
>> > #ifndef __LINUX_P4TC_H
>> > #define __LINUX_P4TC_H
>> >
>> >+#include <linux/types.h>
>> >+#include <linux/pkt_sched.h>
>> >+
>> >+/* pipeline header */
>> >+struct p4tcmsg {
>> >+      __u32 pipeid;
>> >+      __u32 obj;
>> >+};
>>
>> I don't follow. Is there any sane reason to use header instead of normal
>> netlink attribute? Moveover, you extend the existing RT netlink with
>> a huge amout of p4 things. Isn't this the good time to finally introduce
>> generic netlink TC family with proper yaml spec with all the benefits it
>> brings and implement p4 tc uapi there? Please?
>>
>
>Several reasons:
>a) We are similar to current tc messaging with the subheader being
>there for multiplexing.

Yeah, you don't need to carry 20year old burden in newly introduced
interface. That's my point.


>b) Where does this leave iproute2? +Cc David and Stephen. Do other
>generic netlink conversions get contributed back to iproute2?

There is no conversion afaik, only extensions. And they has to be,
otherwise the user would not be able to use the newly introduced
features.


>c) note: Our API is CRUD-ish instead of RPC(per generic netlink)
>based. i.e you have:
> COMMAND <PATH/TO/OBJECT> [optional data]  so we can support arbitrary
>P4 programs from the control plane.

I'm pretty sure you can achieve the same over genetlink.


>d) we have spent many hours optimizing the control to the kernel so i
>am not sure what it would buy us to switch to generic netlink..

All the benefits of ynl yaml tooling, at least.


>
>cheers,
>jamal
>
>>
>> >+
>> >+#define P4TC_MAXPIPELINE_COUNT 32
>> >+#define P4TC_MAXTABLES_COUNT 32
>> >+#define P4TC_MINTABLES_COUNT 0
>> >+#define P4TC_MSGBATCH_SIZE 16
>> >+
>> > #define P4TC_MAX_KEYSZ 512
>> >
>> >+#define TEMPLATENAMSZ 32
>> >+#define PIPELINENAMSIZ TEMPLATENAMSZ
>>
>> ugh. A prefix please?
>>
>> pw-bot: cr
>>
>> [...]

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH net-next v8 10/15] p4tc: add action template create, update, delete, get, flush and dump
  2023-11-17 15:11     ` Jamal Hadi Salim
@ 2023-11-20  8:19       ` Jiri Pirko
  2023-11-20 13:45         ` Jamal Hadi Salim
  0 siblings, 1 reply; 79+ messages in thread
From: Jiri Pirko @ 2023-11-20  8:19 UTC (permalink / raw)
  To: Jamal Hadi Salim
  Cc: netdev, deb.chatterjee, anjali.singhai, namrata.limaye, tom,
	mleitner, Mahesh.Shirshyad, tomasz.osinski, xiyou.wangcong, davem,
	edumazet, kuba, pabeni, vladbu, horms, daniel, bpf, khalidm, toke,
	mattyk

Fri, Nov 17, 2023 at 04:11:29PM CET, jhs@mojatatu.com wrote:
>On Thu, Nov 16, 2023 at 11:28 AM Jiri Pirko <jiri@resnulli.us> wrote:
>>
>> Thu, Nov 16, 2023 at 03:59:43PM CET, jhs@mojatatu.com wrote:
>>
>> [...]
>>
>>
>> >diff --git a/include/net/act_api.h b/include/net/act_api.h
>> >index cd5a8e86f..b95a9bc29 100644
>> >--- a/include/net/act_api.h
>> >+++ b/include/net/act_api.h
>> >@@ -70,6 +70,7 @@ struct tc_action {
>> > #define TCA_ACT_FLAGS_AT_INGRESS      (1U << (TCA_ACT_FLAGS_USER_BITS + 4))
>> > #define TCA_ACT_FLAGS_PREALLOC        (1U << (TCA_ACT_FLAGS_USER_BITS + 5))
>> > #define TCA_ACT_FLAGS_UNREFERENCED    (1U << (TCA_ACT_FLAGS_USER_BITS + 6))
>> >+#define TCA_ACT_FLAGS_FROM_P4TC       (1U << (TCA_ACT_FLAGS_USER_BITS + 7))
>> >
>> > /* Update lastuse only if needed, to avoid dirtying a cache line.
>> >  * We use a temp variable to avoid fetching jiffies twice.
>> >diff --git a/include/net/p4tc.h b/include/net/p4tc.h
>> >index ccb54d842..68b00fa72 100644
>> >--- a/include/net/p4tc.h
>> >+++ b/include/net/p4tc.h
>> >@@ -9,17 +9,23 @@
>> > #include <linux/refcount.h>
>> > #include <linux/rhashtable.h>
>> > #include <linux/rhashtable-types.h>
>> >+#include <net/tc_act/p4tc.h>
>> >+#include <net/p4tc_types.h>
>> >
>> > #define P4TC_DEFAULT_NUM_TABLES P4TC_MINTABLES_COUNT
>> > #define P4TC_DEFAULT_MAX_RULES 1
>> > #define P4TC_PATH_MAX 3
>> >+#define P4TC_MAX_TENTRIES 33554432
>>
>> Seeing define like this one always makes me happier. Where does it come
>> from? Why not 0x2000000 at least?
>
>I dont recall why we decided to do decimal - will change it.
>
>>
>> >
>> > #define P4TC_KERNEL_PIPEID 0
>> >
>> > #define P4TC_PID_IDX 0
>> >+#define P4TC_AID_IDX 1
>> >+#define P4TC_PARSEID_IDX 1
>> >
>> > struct p4tc_dump_ctx {
>> >       u32 ids[P4TC_PATH_MAX];
>> >+      struct rhashtable_iter *iter;
>> > };
>> >
>> > struct p4tc_template_common;
>> >@@ -63,8 +69,10 @@ extern const struct p4tc_template_ops p4tc_pipeline_ops;
>> >
>> > struct p4tc_pipeline {
>> >       struct p4tc_template_common common;
>> >+      struct idr                  p_act_idr;
>> >       struct rcu_head             rcu;
>> >       struct net                  *net;
>> >+      u32                         num_created_acts;
>> >       /* Accounts for how many entities are referencing this pipeline.
>> >        * As for now only P4 filters can refer to pipelines.
>> >        */
>> >@@ -109,18 +117,157 @@ p4tc_pipeline_find_byany_unsealed(struct net *net, const char *p_name,
>> >                                 const u32 pipeid,
>> >                                 struct netlink_ext_ack *extack);
>> >
>> >+struct p4tc_act *tcf_p4_find_act(struct net *net,
>> >+                               const struct tc_action_ops *a_o,
>> >+                               struct netlink_ext_ack *extack);
>> >+void
>> >+tcf_p4_put_prealloc_act(struct p4tc_act *act, struct tcf_p4act *p4_act);
>> >+
>> > static inline int p4tc_action_destroy(struct tc_action **acts)
>> > {
>> >+      struct tc_action *acts_non_prealloc[TCA_ACT_MAX_PRIO] = {NULL};
>> >       int ret = 0;
>> >
>> >       if (acts) {
>> >-              ret = tcf_action_destroy(acts, TCA_ACT_UNBIND);
>> >+              int j = 0;
>> >+              int i;
>>
>> Move declarations to the beginning of the if body.
>>
>
>Didnt follow - which specific declaration?

It should look like this:

		int j = 0;
		int i;

		ret = tcf_action_destroy(acts, TCA_ACT_UNBIND);



>
>> [...]
>>
>>
>> >diff --git a/include/uapi/linux/p4tc.h b/include/uapi/linux/p4tc.h
>> >index 4d33f44c1..7b89229a7 100644
>> >--- a/include/uapi/linux/p4tc.h
>> >+++ b/include/uapi/linux/p4tc.h
>> >@@ -4,6 +4,7 @@
>> >
>> > #include <linux/types.h>
>> > #include <linux/pkt_sched.h>
>> >+#include <linux/pkt_cls.h>
>> >
>> > /* pipeline header */
>> > struct p4tcmsg {
>> >@@ -17,9 +18,12 @@ struct p4tcmsg {
>> > #define P4TC_MSGBATCH_SIZE 16
>> >
>> > #define P4TC_MAX_KEYSZ 512
>> >+#define P4TC_DEFAULT_NUM_PREALLOC 16
>> >
>> > #define TEMPLATENAMSZ 32
>> > #define PIPELINENAMSIZ TEMPLATENAMSZ
>> >+#define ACTTMPLNAMSIZ TEMPLATENAMSZ
>> >+#define ACTPARAMNAMSIZ TEMPLATENAMSZ
>>
>> Prefix? This is uapi. Could you please be more careful with naming at
>> least in the uapi area?
>
>Good point.
>
>>
>> [...]
>>
>>
>> >diff --git a/net/sched/p4tc/p4tc_action.c b/net/sched/p4tc/p4tc_action.c
>> >new file mode 100644
>> >index 000000000..19db0772c
>> >--- /dev/null
>> >+++ b/net/sched/p4tc/p4tc_action.c
>> >@@ -0,0 +1,2242 @@
>> >+// SPDX-License-Identifier: GPL-2.0-or-later
>> >+/*
>> >+ * net/sched/p4tc_action.c    P4 TC ACTION TEMPLATES
>> >+ *
>> >+ * Copyright (c) 2022-2023, Mojatatu Networks
>> >+ * Copyright (c) 2022-2023, Intel Corporation.
>> >+ * Authors:     Jamal Hadi Salim <jhs@mojatatu.com>
>> >+ *              Victor Nogueira <victor@mojatatu.com>
>> >+ *              Pedro Tammela <pctammela@mojatatu.com>
>> >+ */
>> >+
>> >+#include <linux/err.h>
>> >+#include <linux/errno.h>
>> >+#include <linux/init.h>
>> >+#include <linux/kernel.h>
>> >+#include <linux/kmod.h>
>> >+#include <linux/list.h>
>> >+#include <linux/module.h>
>> >+#include <linux/netdevice.h>
>> >+#include <linux/skbuff.h>
>> >+#include <linux/slab.h>
>> >+#include <linux/string.h>
>> >+#include <linux/types.h>
>> >+#include <net/flow_offload.h>
>> >+#include <net/net_namespace.h>
>> >+#include <net/netlink.h>
>> >+#include <net/pkt_cls.h>
>> >+#include <net/p4tc.h>
>> >+#include <net/sch_generic.h>
>> >+#include <net/sock.h>
>> >+#include <net/tc_act/p4tc.h>
>> >+
>> >+static LIST_HEAD(dynact_list);
>> >+
>> >+#define SEPARATOR "/"
>>
>> Prefix? Btw, why exactly do you need this. It is used only once.
>>
>
>We'll get rid of it.
>
>> To quote a few function names in this file:
>>
>> >+static void set_param_indices(struct idr *params_idr)
>> >+static void generic_free_param_value(struct p4tc_act_param *param)
>> >+static int dev_init_param_value(struct net *net, struct p4tc_act_param_ops *op,
>> >+static void dev_free_param_value(struct p4tc_act_param *param)
>> >+static void tcf_p4_act_params_destroy_rcu(struct rcu_head *head)
>> >+static int __tcf_p4_dyna_init_set(struct p4tc_act *act, struct tc_action **a,
>> >+static int tcf_p4_dyna_template_init(struct net *net, struct tc_action **a,
>> >+init_prealloc_param(struct p4tc_act *act, struct idr *params_idr,
>> >+static void p4tc_param_put(struct p4tc_act_param *param)
>> >+static void free_intermediate_param(struct p4tc_act_param *param)
>> >+static void free_intermediate_params_list(struct list_head *params_list)
>> >+static int init_prealloc_params(struct p4tc_act *act,
>> >+struct p4tc_act *p4tc_action_find_byid(struct p4tc_pipeline *pipeline,
>> >+static void tcf_p4_prealloc_list_add(struct p4tc_act *act_tmpl,
>> >+static int tcf_p4_prealloc_acts(struct net *net, struct p4tc_act *act,
>> >+tcf_p4_get_next_prealloc_act(struct p4tc_act *act)
>> >+void tcf_p4_set_init_flags(struct tcf_p4act *p4act)
>> >+static void __tcf_p4_put_prealloc_act(struct p4tc_act *act,
>> >+tcf_p4_put_prealloc_act(struct p4tc_act *act, struct tcf_p4act *p4act)
>> >+static int generic_dump_param_value(struct sk_buff *skb, struct p4tc_type *type,
>> >+static int generic_init_param_value(struct p4tc_act_param *nparam,
>> >+static struct p4tc_act_param *param_find_byname(struct idr *params_idr,
>> >+tcf_param_find_byany(struct p4tc_act *act,
>> >+tcf_param_find_byanyattr(struct p4tc_act *act, struct nlattr *name_attr,
>> >+static int __p4_init_param_type(struct p4tc_act_param *param,
>> >+static int tcf_p4_act_init_params(struct net *net,
>> >+static struct p4tc_act *p4tc_action_find_byname(const char *act_name,
>> >+static int tcf_p4_dyna_init(struct net *net, struct nlattr *nla,
>> >+static int tcf_act_fill_param_type(struct sk_buff *skb,
>> >+static void tcf_p4_dyna_cleanup(struct tc_action *a)
>> >+struct p4tc_act *p4tc_action_find_get(struct p4tc_pipeline *pipeline,
>> >+p4tc_action_find_byanyattr(struct nlattr *act_name_attr, const u32 a_id,
>> >+static void p4_put_many_params(struct idr *params_idr)
>> >+static int p4_init_param_type(struct p4tc_act_param *param,
>> >+static struct p4tc_act_param *p4_create_param(struct p4tc_act *act,
>> >+static struct p4tc_act_param *p4_update_param(struct p4tc_act *act,
>> >+static struct p4tc_act_param *p4_act_init_param(struct p4tc_act *act,
>> >+static void p4tc_action_net_exit(struct tc_action_net *tn)
>> >+static void p4_act_params_put(struct p4tc_act *act)
>> >+static int __tcf_act_put(struct net *net, struct p4tc_pipeline *pipeline,
>> >+static int _tcf_act_fill_nlmsg(struct net *net, struct sk_buff *skb,
>> >+static int tcf_act_fill_nlmsg(struct net *net, struct sk_buff *skb,
>> >+static int tcf_act_flush(struct sk_buff *skb, struct net *net,
>> >+static void p4tc_params_replace_many(struct p4tc_act *act,
>> >+                                   struct idr *params_idr)
>> >+static struct p4tc_act *tcf_act_create(struct net *net, struct nlattr **tb,
>> >+tcf_act_cu(struct net *net, struct nlmsghdr *n, struct nlattr *nla,
>>
>> Is there some secret key how you name the functions? To me, this looks
>> completely inconsistent :/
>
>What would be better? tcf_p4_xxxx?

Idk, up to you, just please maintain some basic naming consistency and
prefixes.


>A lot of the tcf_xxx is because that convention is used in that file
>but we can change it.
>
>cheers,
>jamal
>>

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH net-next v8 15/15] p4tc: Add P4 extern interface
  2023-11-17 12:14     ` Jamal Hadi Salim
@ 2023-11-20  8:22       ` Jiri Pirko
  2023-11-20 14:02         ` Jamal Hadi Salim
  0 siblings, 1 reply; 79+ messages in thread
From: Jiri Pirko @ 2023-11-20  8:22 UTC (permalink / raw)
  To: Jamal Hadi Salim
  Cc: netdev, deb.chatterjee, anjali.singhai, namrata.limaye, tom,
	mleitner, Mahesh.Shirshyad, tomasz.osinski, xiyou.wangcong, davem,
	edumazet, kuba, pabeni, vladbu, horms, daniel, bpf, khalidm, toke,
	mattyk

Fri, Nov 17, 2023 at 01:14:43PM CET, jhs@mojatatu.com wrote:
>On Thu, Nov 16, 2023 at 11:42 AM Jiri Pirko <jiri@resnulli.us> wrote:
>>
>> Thu, Nov 16, 2023 at 03:59:48PM CET, jhs@mojatatu.com wrote:
>>
>> [...]
>>
>> > include/net/p4tc.h                |  161 +++
>> > include/net/p4tc_ext_api.h        |  199 +++
>> > include/uapi/linux/p4tc.h         |   61 +
>> > include/uapi/linux/p4tc_ext.h     |   36 +
>> > net/sched/p4tc/Makefile           |    2 +-
>> > net/sched/p4tc/p4tc_bpf.c         |   79 +-
>> > net/sched/p4tc/p4tc_ext.c         | 2204 ++++++++++++++++++++++++++++
>> > net/sched/p4tc/p4tc_pipeline.c    |   34 +-
>> > net/sched/p4tc/p4tc_runtime_api.c |   10 +-
>> > net/sched/p4tc/p4tc_table.c       |   57 +-
>> > net/sched/p4tc/p4tc_tbl_entry.c   |   25 +-
>> > net/sched/p4tc/p4tc_tmpl_api.c    |    4 +
>> > net/sched/p4tc/p4tc_tmpl_ext.c    | 2221 +++++++++++++++++++++++++++++
>> > 13 files changed, 5083 insertions(+), 10 deletions(-)
>>
>> This is for this patch. Now for the whole patchset you have:
>>  30 files changed, 16676 insertions(+), 39 deletions(-)
>>
>> I understand that you want to fit into 15 patches with all the work.
>> But sorry, patches like this are unreviewable. My suggestion is to split
>> the patchset into multiple ones including smaller patches and allow
>> people to digest this. I don't believe that anyone can seriously stand
>> to review a patch with more than 200 lines changes.
>
>This specific patch is not difficult to split into two. I can do that
>and send out minus the first 8 trivial patches - but not familiar with
>how to do "here's part 1 of the patches" and "here's patchset two".

Split into multiple patchsets and send one by one. No need to have all
in at once.


>There's dependency between them so not clear how patchwork and

What dependency. It should compile. Introduce some basic functionality
first and extend it incrementally with other patchsets. The usual way.


>reviewers would deal with it. Thoughts?
>
>Note: The code machinery is really repeatable; for example if you look
>at the tables control you will see very similar patterns to actions
>etc. i.e spending time to review one will make it easy for the rest.
>
>cheers,
>jamal
>
>> [...]

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH net-next v8 00/15] Introducing P4TC
  2023-11-17 20:46       ` Jamal Hadi Salim
@ 2023-11-20  9:39         ` Jiri Pirko
  2023-11-20 14:23           ` Jamal Hadi Salim
  0 siblings, 1 reply; 79+ messages in thread
From: Jiri Pirko @ 2023-11-20  9:39 UTC (permalink / raw)
  To: Jamal Hadi Salim
  Cc: John Fastabend, netdev, deb.chatterjee, anjali.singhai,
	Vipin.Jain, namrata.limaye, tom, mleitner, Mahesh.Shirshyad,
	tomasz.osinski, xiyou.wangcong, davem, edumazet, kuba, pabeni,
	vladbu, horms, daniel, bpf, khalidm, toke, mattyk, dan.daly,
	chris.sommers, john.andy.fingerhut

Fri, Nov 17, 2023 at 09:46:11PM CET, jhs@mojatatu.com wrote:
>On Fri, Nov 17, 2023 at 1:37 PM John Fastabend <john.fastabend@gmail.com> wrote:
>>
>> Jamal Hadi Salim wrote:
>> > On Fri, Nov 17, 2023 at 1:27 AM John Fastabend <john.fastabend@gmail.com> wrote:
>> > >
>> > > Jamal Hadi Salim wrote:

[...]


>>
>> I think I'm judging the technical work here. Bullet points.
>>
>> 1. p4c-tc implementation looks like it should be slower than a
>>    in terms of pkts/sec than a bpf implementation. Meaning
>>    I suspect pipeline and objects laid out like this will lose
>>    to a BPF program with an parser and single lookup. The p4c-ebpf
>>    compiler should look to create optimized EBPF code not some
>>    emulated switch topology.
>>
>
>The parser is ebpf based. The other objects which require control
>plane interaction are not - those interact via netlink.
>We published perf data a while back - presented at the P4 workshop
>back in April (was in the cover letter)
>https://github.com/p4tc-dev/docs/blob/main/p4-conference-2023/2023P4WorkshopP4TC.pdf
>But do note: the correct abstraction is the first priority.
>Optimization is something we can teach the compiler over time. But
>even with the minimalist code generation you can see that our approach
>always beats ebpf in LPM and ternary. The other ones I am pretty sure

Any idea why? Perhaps the existing eBPF maps are not that suitable for
this kinds of lookups? I mean in theory, eBPF should be always faster.


>we can optimize over time.
>Your view of "single lookup" is true for simple programs but if you
>have 10 tables trying to model a 5G function then it doesnt make sense
>(and i think the data we published was clear that you gain no
>advantage using ebpf - as a matter of fact there was no perf
>difference between XDP and tc in such cases).
>
>> 2. p4c-tc control plan looks slower than a directly mmaped bpf
>>    map. Doing a simple update vs a netlink msg. The argument
>>    that BPF can't do CRUD (which we had offlist) seems incorrect
>>    to me. Correct me if I'm wrong with details about why.
>>
>
>So let me see....
>you want me to replace netlink and all its features and rewrite it
>using the ebpf system calls? Congestion control, event handling,
>arbitrary message crafting, etc and the years of work that went into
>netlink? NO to the HELL.

Wait, I don't think John suggests anything like that. He just suggests
to have the tables as eBPF maps. Honestly, I don't understand the
fixation on netlink. Its socket messaging, memcpies, processing
overhead, etc can't keep up with mmaped memory access at scale. Measure
that and I bet you'll get drastically different results.

I mean, netlink is good for a lot of things, but does not mean it is an
universal answer to userspace<->kernel data passing.


>I should note: that there was an interesting talk at netdevconf 0x17
>where the speaker showed the challenges of dealing with ebpf on "day
>two" - slides or videos are not up yet, but link is:
>https://netdevconf.info/0x17/sessions/talk/is-scaling-ebpf-easy-yet-a-small-step-to-one-server-but-giant-leap-to-distributed-network.html
>The point the speaker was making is it's always easy to whip an ebpf
>program that can slice and dice packets and maybe even flush LEDs but
>the real work and challenge is in the control plane. I agree with the
>speaker based on my experiences. This discussion of replacing netlink
>with ebpf system calls is absolutely a non-starter. Let's just end the
>discussion and agree to disagree if you are going to keep insisting on
>that.


[...]

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH net-next v8 09/15] p4tc: add template pipeline create, get, update, delete
  2023-11-20  8:18       ` Jiri Pirko
@ 2023-11-20 12:48         ` Jamal Hadi Salim
  2023-11-20 13:16           ` Jiri Pirko
  0 siblings, 1 reply; 79+ messages in thread
From: Jamal Hadi Salim @ 2023-11-20 12:48 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: netdev, deb.chatterjee, anjali.singhai, namrata.limaye, tom,
	mleitner, Mahesh.Shirshyad, tomasz.osinski, xiyou.wangcong, davem,
	edumazet, kuba, pabeni, vladbu, horms, daniel, bpf, khalidm, toke,
	mattyk, David Ahern, Stephen Hemminger

On Mon, Nov 20, 2023 at 3:18 AM Jiri Pirko <jiri@resnulli.us> wrote:
>
> Fri, Nov 17, 2023 at 01:09:45PM CET, jhs@mojatatu.com wrote:
> >On Thu, Nov 16, 2023 at 11:11 AM Jiri Pirko <jiri@resnulli.us> wrote:
> >>
> >> Thu, Nov 16, 2023 at 03:59:42PM CET, jhs@mojatatu.com wrote:
> >>
> >> [...]
> >>
> >>
> >> >diff --git a/include/uapi/linux/p4tc.h b/include/uapi/linux/p4tc.h
> >> >index ba32dba66..4d33f44c1 100644
> >> >--- a/include/uapi/linux/p4tc.h
> >> >+++ b/include/uapi/linux/p4tc.h
> >> >@@ -2,8 +2,71 @@
> >> > #ifndef __LINUX_P4TC_H
> >> > #define __LINUX_P4TC_H
> >> >
> >> >+#include <linux/types.h>
> >> >+#include <linux/pkt_sched.h>
> >> >+
> >> >+/* pipeline header */
> >> >+struct p4tcmsg {
> >> >+      __u32 pipeid;
> >> >+      __u32 obj;
> >> >+};
> >>
> >> I don't follow. Is there any sane reason to use header instead of normal
> >> netlink attribute? Moveover, you extend the existing RT netlink with
> >> a huge amout of p4 things. Isn't this the good time to finally introduce
> >> generic netlink TC family with proper yaml spec with all the benefits it
> >> brings and implement p4 tc uapi there? Please?
> >>
> >
> >Several reasons:
> >a) We are similar to current tc messaging with the subheader being
> >there for multiplexing.
>
> Yeah, you don't need to carry 20year old burden in newly introduced
> interface. That's my point.

Having a demux sub header is 20 year old burden? I didnt follow.

>
> >b) Where does this leave iproute2? +Cc David and Stephen. Do other
> >generic netlink conversions get contributed back to iproute2?
>
> There is no conversion afaik, only extensions. And they has to be,
> otherwise the user would not be able to use the newly introduced
> features.

The big question is does the collective who use iproute2 still get to
use the same tooling or now they have to go and learn some new
tooling. I understand the value of the new approach but is it a
revolution or an evolution? We opted to put thing in iproute2 instead
for example because that is widely available (and used).

>
> >c) note: Our API is CRUD-ish instead of RPC(per generic netlink)
> >based. i.e you have:
> > COMMAND <PATH/TO/OBJECT> [optional data]  so we can support arbitrary
> >P4 programs from the control plane.
>
> I'm pretty sure you can achieve the same over genetlink.
>

I think you are right.

>
> >d) we have spent many hours optimizing the control to the kernel so i
> >am not sure what it would buy us to switch to generic netlink..
>
> All the benefits of ynl yaml tooling, at least.
>

Did you pay close attention to what we have? The user space code is
written once into iproute2 and subsequent to that there is no
recompilation  of any iproute2 code. The compiler generates a json
file specific to a P4 program which is then introspected by the
iproute2 code.


cheers,
jamal

>
> >
> >cheers,
> >jamal
> >
> >>
> >> >+
> >> >+#define P4TC_MAXPIPELINE_COUNT 32
> >> >+#define P4TC_MAXTABLES_COUNT 32
> >> >+#define P4TC_MINTABLES_COUNT 0
> >> >+#define P4TC_MSGBATCH_SIZE 16
> >> >+
> >> > #define P4TC_MAX_KEYSZ 512
> >> >
> >> >+#define TEMPLATENAMSZ 32
> >> >+#define PIPELINENAMSIZ TEMPLATENAMSZ
> >>
> >> ugh. A prefix please?
> >>
> >> pw-bot: cr
> >>
> >> [...]

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH net-next v8 09/15] p4tc: add template pipeline create, get, update, delete
  2023-11-20 12:48         ` Jamal Hadi Salim
@ 2023-11-20 13:16           ` Jiri Pirko
  2023-11-20 15:30             ` Jamal Hadi Salim
  0 siblings, 1 reply; 79+ messages in thread
From: Jiri Pirko @ 2023-11-20 13:16 UTC (permalink / raw)
  To: Jamal Hadi Salim
  Cc: netdev, deb.chatterjee, anjali.singhai, namrata.limaye, tom,
	mleitner, Mahesh.Shirshyad, tomasz.osinski, xiyou.wangcong, davem,
	edumazet, kuba, pabeni, vladbu, horms, daniel, bpf, khalidm, toke,
	mattyk, David Ahern, Stephen Hemminger

Mon, Nov 20, 2023 at 01:48:14PM CET, jhs@mojatatu.com wrote:
>On Mon, Nov 20, 2023 at 3:18 AM Jiri Pirko <jiri@resnulli.us> wrote:
>>
>> Fri, Nov 17, 2023 at 01:09:45PM CET, jhs@mojatatu.com wrote:
>> >On Thu, Nov 16, 2023 at 11:11 AM Jiri Pirko <jiri@resnulli.us> wrote:
>> >>
>> >> Thu, Nov 16, 2023 at 03:59:42PM CET, jhs@mojatatu.com wrote:
>> >>
>> >> [...]
>> >>
>> >>
>> >> >diff --git a/include/uapi/linux/p4tc.h b/include/uapi/linux/p4tc.h
>> >> >index ba32dba66..4d33f44c1 100644
>> >> >--- a/include/uapi/linux/p4tc.h
>> >> >+++ b/include/uapi/linux/p4tc.h
>> >> >@@ -2,8 +2,71 @@
>> >> > #ifndef __LINUX_P4TC_H
>> >> > #define __LINUX_P4TC_H
>> >> >
>> >> >+#include <linux/types.h>
>> >> >+#include <linux/pkt_sched.h>
>> >> >+
>> >> >+/* pipeline header */
>> >> >+struct p4tcmsg {
>> >> >+      __u32 pipeid;
>> >> >+      __u32 obj;
>> >> >+};
>> >>
>> >> I don't follow. Is there any sane reason to use header instead of normal
>> >> netlink attribute? Moveover, you extend the existing RT netlink with
>> >> a huge amout of p4 things. Isn't this the good time to finally introduce
>> >> generic netlink TC family with proper yaml spec with all the benefits it
>> >> brings and implement p4 tc uapi there? Please?
>> >>
>> >
>> >Several reasons:
>> >a) We are similar to current tc messaging with the subheader being
>> >there for multiplexing.
>>
>> Yeah, you don't need to carry 20year old burden in newly introduced
>> interface. That's my point.
>
>Having a demux sub header is 20 year old burden? I didnt follow.

You don't need the header, that's my point.


>
>>
>> >b) Where does this leave iproute2? +Cc David and Stephen. Do other
>> >generic netlink conversions get contributed back to iproute2?
>>
>> There is no conversion afaik, only extensions. And they has to be,
>> otherwise the user would not be able to use the newly introduced
>> features.
>
>The big question is does the collective who use iproute2 still get to
>use the same tooling or now they have to go and learn some new
>tooling. I understand the value of the new approach but is it a
>revolution or an evolution? We opted to put thing in iproute2 instead
>for example because that is widely available (and used).

I don't see why iproute2 user facing interface would be any different
depending on if you user RTnetlink or genetlink as backend channel...


>
>>
>> >c) note: Our API is CRUD-ish instead of RPC(per generic netlink)
>> >based. i.e you have:
>> > COMMAND <PATH/TO/OBJECT> [optional data]  so we can support arbitrary
>> >P4 programs from the control plane.
>>
>> I'm pretty sure you can achieve the same over genetlink.
>>
>
>I think you are right.
>
>>
>> >d) we have spent many hours optimizing the control to the kernel so i
>> >am not sure what it would buy us to switch to generic netlink..
>>
>> All the benefits of ynl yaml tooling, at least.
>>
>
>Did you pay close attention to what we have? The user space code is
>written once into iproute2 and subsequent to that there is no
>recompilation  of any iproute2 code. The compiler generates a json
>file specific to a P4 program which is then introspected by the
>iproute2 code.

Right, but in real life, netlink is used directly by many apps. I don't
see why this is any different.

Plus, the very best part of yaml from user perpective I see is,
you just need the kernel-git yaml file and you can submit all commands.
No userspace implementation needed.


>
>
>cheers,
>jamal
>
>>
>> >
>> >cheers,
>> >jamal
>> >
>> >>
>> >> >+
>> >> >+#define P4TC_MAXPIPELINE_COUNT 32
>> >> >+#define P4TC_MAXTABLES_COUNT 32
>> >> >+#define P4TC_MINTABLES_COUNT 0
>> >> >+#define P4TC_MSGBATCH_SIZE 16
>> >> >+
>> >> > #define P4TC_MAX_KEYSZ 512
>> >> >
>> >> >+#define TEMPLATENAMSZ 32
>> >> >+#define PIPELINENAMSIZ TEMPLATENAMSZ
>> >>
>> >> ugh. A prefix please?
>> >>
>> >> pw-bot: cr
>> >>
>> >> [...]

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH net-next v8 10/15] p4tc: add action template create, update, delete, get, flush and dump
  2023-11-20  8:19       ` Jiri Pirko
@ 2023-11-20 13:45         ` Jamal Hadi Salim
  2023-11-20 16:25           ` Jiri Pirko
  0 siblings, 1 reply; 79+ messages in thread
From: Jamal Hadi Salim @ 2023-11-20 13:45 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: netdev, deb.chatterjee, anjali.singhai, namrata.limaye, tom,
	mleitner, Mahesh.Shirshyad, tomasz.osinski, xiyou.wangcong, davem,
	edumazet, kuba, pabeni, vladbu, horms, daniel, bpf, khalidm, toke,
	mattyk

On Mon, Nov 20, 2023 at 3:19 AM Jiri Pirko <jiri@resnulli.us> wrote:
>
> Fri, Nov 17, 2023 at 04:11:29PM CET, jhs@mojatatu.com wrote:
> >On Thu, Nov 16, 2023 at 11:28 AM Jiri Pirko <jiri@resnulli.us> wrote:
> >>
> >> Thu, Nov 16, 2023 at 03:59:43PM CET, jhs@mojatatu.com wrote:
> >>
> >> [...]
> >>
> >>
> >> >diff --git a/include/net/act_api.h b/include/net/act_api.h
> >> >index cd5a8e86f..b95a9bc29 100644
> >> >--- a/include/net/act_api.h
> >> >+++ b/include/net/act_api.h
> >> >@@ -70,6 +70,7 @@ struct tc_action {
> >> > #define TCA_ACT_FLAGS_AT_INGRESS      (1U << (TCA_ACT_FLAGS_USER_BITS + 4))
> >> > #define TCA_ACT_FLAGS_PREALLOC        (1U << (TCA_ACT_FLAGS_USER_BITS + 5))
> >> > #define TCA_ACT_FLAGS_UNREFERENCED    (1U << (TCA_ACT_FLAGS_USER_BITS + 6))
> >> >+#define TCA_ACT_FLAGS_FROM_P4TC       (1U << (TCA_ACT_FLAGS_USER_BITS + 7))
> >> >
> >> > /* Update lastuse only if needed, to avoid dirtying a cache line.
> >> >  * We use a temp variable to avoid fetching jiffies twice.
> >> >diff --git a/include/net/p4tc.h b/include/net/p4tc.h
> >> >index ccb54d842..68b00fa72 100644
> >> >--- a/include/net/p4tc.h
> >> >+++ b/include/net/p4tc.h
> >> >@@ -9,17 +9,23 @@
> >> > #include <linux/refcount.h>
> >> > #include <linux/rhashtable.h>
> >> > #include <linux/rhashtable-types.h>
> >> >+#include <net/tc_act/p4tc.h>
> >> >+#include <net/p4tc_types.h>
> >> >
> >> > #define P4TC_DEFAULT_NUM_TABLES P4TC_MINTABLES_COUNT
> >> > #define P4TC_DEFAULT_MAX_RULES 1
> >> > #define P4TC_PATH_MAX 3
> >> >+#define P4TC_MAX_TENTRIES 33554432
> >>
> >> Seeing define like this one always makes me happier. Where does it come
> >> from? Why not 0x2000000 at least?
> >
> >I dont recall why we decided to do decimal - will change it.
> >
> >>
> >> >
> >> > #define P4TC_KERNEL_PIPEID 0
> >> >
> >> > #define P4TC_PID_IDX 0
> >> >+#define P4TC_AID_IDX 1
> >> >+#define P4TC_PARSEID_IDX 1
> >> >
> >> > struct p4tc_dump_ctx {
> >> >       u32 ids[P4TC_PATH_MAX];
> >> >+      struct rhashtable_iter *iter;
> >> > };
> >> >
> >> > struct p4tc_template_common;
> >> >@@ -63,8 +69,10 @@ extern const struct p4tc_template_ops p4tc_pipeline_ops;
> >> >
> >> > struct p4tc_pipeline {
> >> >       struct p4tc_template_common common;
> >> >+      struct idr                  p_act_idr;
> >> >       struct rcu_head             rcu;
> >> >       struct net                  *net;
> >> >+      u32                         num_created_acts;
> >> >       /* Accounts for how many entities are referencing this pipeline.
> >> >        * As for now only P4 filters can refer to pipelines.
> >> >        */
> >> >@@ -109,18 +117,157 @@ p4tc_pipeline_find_byany_unsealed(struct net *net, const char *p_name,
> >> >                                 const u32 pipeid,
> >> >                                 struct netlink_ext_ack *extack);
> >> >
> >> >+struct p4tc_act *tcf_p4_find_act(struct net *net,
> >> >+                               const struct tc_action_ops *a_o,
> >> >+                               struct netlink_ext_ack *extack);
> >> >+void
> >> >+tcf_p4_put_prealloc_act(struct p4tc_act *act, struct tcf_p4act *p4_act);
> >> >+
> >> > static inline int p4tc_action_destroy(struct tc_action **acts)
> >> > {
> >> >+      struct tc_action *acts_non_prealloc[TCA_ACT_MAX_PRIO] = {NULL};
> >> >       int ret = 0;
> >> >
> >> >       if (acts) {
> >> >-              ret = tcf_action_destroy(acts, TCA_ACT_UNBIND);
> >> >+              int j = 0;
> >> >+              int i;
> >>
> >> Move declarations to the beginning of the if body.
> >>
> >
> >Didnt follow - which specific declaration?
>
> It should look like this:
>
>                 int j = 0;
>                 int i;
>
>                 ret = tcf_action_destroy(acts, TCA_ACT_UNBIND);

Huh? It does look like that already ;-> Note there's a "-" on that line.

>
> >
> >> [...]
> >>
> >>
> >> >diff --git a/include/uapi/linux/p4tc.h b/include/uapi/linux/p4tc.h
> >> >index 4d33f44c1..7b89229a7 100644
> >> >--- a/include/uapi/linux/p4tc.h
> >> >+++ b/include/uapi/linux/p4tc.h
> >> >@@ -4,6 +4,7 @@
> >> >
> >> > #include <linux/types.h>
> >> > #include <linux/pkt_sched.h>
> >> >+#include <linux/pkt_cls.h>
> >> >
> >> > /* pipeline header */
> >> > struct p4tcmsg {
> >> >@@ -17,9 +18,12 @@ struct p4tcmsg {
> >> > #define P4TC_MSGBATCH_SIZE 16
> >> >
> >> > #define P4TC_MAX_KEYSZ 512
> >> >+#define P4TC_DEFAULT_NUM_PREALLOC 16
> >> >
> >> > #define TEMPLATENAMSZ 32
> >> > #define PIPELINENAMSIZ TEMPLATENAMSZ
> >> >+#define ACTTMPLNAMSIZ TEMPLATENAMSZ
> >> >+#define ACTPARAMNAMSIZ TEMPLATENAMSZ
> >>
> >> Prefix? This is uapi. Could you please be more careful with naming at
> >> least in the uapi area?
> >
> >Good point.
> >
> >>
> >> [...]
> >>
> >>
> >> >diff --git a/net/sched/p4tc/p4tc_action.c b/net/sched/p4tc/p4tc_action.c
> >> >new file mode 100644
> >> >index 000000000..19db0772c
> >> >--- /dev/null
> >> >+++ b/net/sched/p4tc/p4tc_action.c
> >> >@@ -0,0 +1,2242 @@
> >> >+// SPDX-License-Identifier: GPL-2.0-or-later
> >> >+/*
> >> >+ * net/sched/p4tc_action.c    P4 TC ACTION TEMPLATES
> >> >+ *
> >> >+ * Copyright (c) 2022-2023, Mojatatu Networks
> >> >+ * Copyright (c) 2022-2023, Intel Corporation.
> >> >+ * Authors:     Jamal Hadi Salim <jhs@mojatatu.com>
> >> >+ *              Victor Nogueira <victor@mojatatu.com>
> >> >+ *              Pedro Tammela <pctammela@mojatatu.com>
> >> >+ */
> >> >+
> >> >+#include <linux/err.h>
> >> >+#include <linux/errno.h>
> >> >+#include <linux/init.h>
> >> >+#include <linux/kernel.h>
> >> >+#include <linux/kmod.h>
> >> >+#include <linux/list.h>
> >> >+#include <linux/module.h>
> >> >+#include <linux/netdevice.h>
> >> >+#include <linux/skbuff.h>
> >> >+#include <linux/slab.h>
> >> >+#include <linux/string.h>
> >> >+#include <linux/types.h>
> >> >+#include <net/flow_offload.h>
> >> >+#include <net/net_namespace.h>
> >> >+#include <net/netlink.h>
> >> >+#include <net/pkt_cls.h>
> >> >+#include <net/p4tc.h>
> >> >+#include <net/sch_generic.h>
> >> >+#include <net/sock.h>
> >> >+#include <net/tc_act/p4tc.h>
> >> >+
> >> >+static LIST_HEAD(dynact_list);
> >> >+
> >> >+#define SEPARATOR "/"
> >>
> >> Prefix? Btw, why exactly do you need this. It is used only once.
> >>
> >
> >We'll get rid of it.
> >
> >> To quote a few function names in this file:
> >>
> >> >+static void set_param_indices(struct idr *params_idr)
> >> >+static void generic_free_param_value(struct p4tc_act_param *param)
> >> >+static int dev_init_param_value(struct net *net, struct p4tc_act_param_ops *op,
> >> >+static void dev_free_param_value(struct p4tc_act_param *param)
> >> >+static void tcf_p4_act_params_destroy_rcu(struct rcu_head *head)
> >> >+static int __tcf_p4_dyna_init_set(struct p4tc_act *act, struct tc_action **a,
> >> >+static int tcf_p4_dyna_template_init(struct net *net, struct tc_action **a,
> >> >+init_prealloc_param(struct p4tc_act *act, struct idr *params_idr,
> >> >+static void p4tc_param_put(struct p4tc_act_param *param)
> >> >+static void free_intermediate_param(struct p4tc_act_param *param)
> >> >+static void free_intermediate_params_list(struct list_head *params_list)
> >> >+static int init_prealloc_params(struct p4tc_act *act,
> >> >+struct p4tc_act *p4tc_action_find_byid(struct p4tc_pipeline *pipeline,
> >> >+static void tcf_p4_prealloc_list_add(struct p4tc_act *act_tmpl,
> >> >+static int tcf_p4_prealloc_acts(struct net *net, struct p4tc_act *act,
> >> >+tcf_p4_get_next_prealloc_act(struct p4tc_act *act)
> >> >+void tcf_p4_set_init_flags(struct tcf_p4act *p4act)
> >> >+static void __tcf_p4_put_prealloc_act(struct p4tc_act *act,
> >> >+tcf_p4_put_prealloc_act(struct p4tc_act *act, struct tcf_p4act *p4act)
> >> >+static int generic_dump_param_value(struct sk_buff *skb, struct p4tc_type *type,
> >> >+static int generic_init_param_value(struct p4tc_act_param *nparam,
> >> >+static struct p4tc_act_param *param_find_byname(struct idr *params_idr,
> >> >+tcf_param_find_byany(struct p4tc_act *act,
> >> >+tcf_param_find_byanyattr(struct p4tc_act *act, struct nlattr *name_attr,
> >> >+static int __p4_init_param_type(struct p4tc_act_param *param,
> >> >+static int tcf_p4_act_init_params(struct net *net,
> >> >+static struct p4tc_act *p4tc_action_find_byname(const char *act_name,
> >> >+static int tcf_p4_dyna_init(struct net *net, struct nlattr *nla,
> >> >+static int tcf_act_fill_param_type(struct sk_buff *skb,
> >> >+static void tcf_p4_dyna_cleanup(struct tc_action *a)
> >> >+struct p4tc_act *p4tc_action_find_get(struct p4tc_pipeline *pipeline,
> >> >+p4tc_action_find_byanyattr(struct nlattr *act_name_attr, const u32 a_id,
> >> >+static void p4_put_many_params(struct idr *params_idr)
> >> >+static int p4_init_param_type(struct p4tc_act_param *param,
> >> >+static struct p4tc_act_param *p4_create_param(struct p4tc_act *act,
> >> >+static struct p4tc_act_param *p4_update_param(struct p4tc_act *act,
> >> >+static struct p4tc_act_param *p4_act_init_param(struct p4tc_act *act,
> >> >+static void p4tc_action_net_exit(struct tc_action_net *tn)
> >> >+static void p4_act_params_put(struct p4tc_act *act)
> >> >+static int __tcf_act_put(struct net *net, struct p4tc_pipeline *pipeline,
> >> >+static int _tcf_act_fill_nlmsg(struct net *net, struct sk_buff *skb,
> >> >+static int tcf_act_fill_nlmsg(struct net *net, struct sk_buff *skb,
> >> >+static int tcf_act_flush(struct sk_buff *skb, struct net *net,
> >> >+static void p4tc_params_replace_many(struct p4tc_act *act,
> >> >+                                   struct idr *params_idr)
> >> >+static struct p4tc_act *tcf_act_create(struct net *net, struct nlattr **tb,
> >> >+tcf_act_cu(struct net *net, struct nlmsghdr *n, struct nlattr *nla,
> >>
> >> Is there some secret key how you name the functions? To me, this looks
> >> completely inconsistent :/
> >
> >What would be better? tcf_p4_xxxx?
>
> Idk, up to you, just please maintain some basic naming consistency and
> prefixes.
>

We'll come up with something.

cheers,
jamal

>
> >A lot of the tcf_xxx is because that convention is used in that file
> >but we can change it.
> >
> >cheers,
> >jamal
> >>

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH net-next v8 15/15] p4tc: Add P4 extern interface
  2023-11-20  8:22       ` Jiri Pirko
@ 2023-11-20 14:02         ` Jamal Hadi Salim
  2023-11-20 16:27           ` Jiri Pirko
  0 siblings, 1 reply; 79+ messages in thread
From: Jamal Hadi Salim @ 2023-11-20 14:02 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: netdev, deb.chatterjee, anjali.singhai, namrata.limaye, tom,
	mleitner, Mahesh.Shirshyad, tomasz.osinski, xiyou.wangcong, davem,
	edumazet, kuba, pabeni, vladbu, horms, daniel, bpf, khalidm, toke,
	mattyk

On Mon, Nov 20, 2023 at 3:22 AM Jiri Pirko <jiri@resnulli.us> wrote:
>
> Fri, Nov 17, 2023 at 01:14:43PM CET, jhs@mojatatu.com wrote:
> >On Thu, Nov 16, 2023 at 11:42 AM Jiri Pirko <jiri@resnulli.us> wrote:
> >>
> >> Thu, Nov 16, 2023 at 03:59:48PM CET, jhs@mojatatu.com wrote:
> >>
> >> [...]
> >>
> >> > include/net/p4tc.h                |  161 +++
> >> > include/net/p4tc_ext_api.h        |  199 +++
> >> > include/uapi/linux/p4tc.h         |   61 +
> >> > include/uapi/linux/p4tc_ext.h     |   36 +
> >> > net/sched/p4tc/Makefile           |    2 +-
> >> > net/sched/p4tc/p4tc_bpf.c         |   79 +-
> >> > net/sched/p4tc/p4tc_ext.c         | 2204 ++++++++++++++++++++++++++++
> >> > net/sched/p4tc/p4tc_pipeline.c    |   34 +-
> >> > net/sched/p4tc/p4tc_runtime_api.c |   10 +-
> >> > net/sched/p4tc/p4tc_table.c       |   57 +-
> >> > net/sched/p4tc/p4tc_tbl_entry.c   |   25 +-
> >> > net/sched/p4tc/p4tc_tmpl_api.c    |    4 +
> >> > net/sched/p4tc/p4tc_tmpl_ext.c    | 2221 +++++++++++++++++++++++++++++
> >> > 13 files changed, 5083 insertions(+), 10 deletions(-)
> >>
> >> This is for this patch. Now for the whole patchset you have:
> >>  30 files changed, 16676 insertions(+), 39 deletions(-)
> >>
> >> I understand that you want to fit into 15 patches with all the work.
> >> But sorry, patches like this are unreviewable. My suggestion is to split
> >> the patchset into multiple ones including smaller patches and allow
> >> people to digest this. I don't believe that anyone can seriously stand
> >> to review a patch with more than 200 lines changes.
> >
> >This specific patch is not difficult to split into two. I can do that
> >and send out minus the first 8 trivial patches - but not familiar with
> >how to do "here's part 1 of the patches" and "here's patchset two".
>
> Split into multiple patchsets and send one by one. No need to have all
> in at once.
>
>
> >There's dependency between them so not clear how patchwork and
>
> What dependency. It should compile. Introduce some basic functionality
> first and extend it incrementally with other patchsets. The usual way.
>

Sorry, still not following:
Lets say i split the current patchset 1 with patch 1-8 (which are
trivial and have been reviewed) then make the rest into patchset 2
with a new set 1-8. I dont see how patchset 2 compiles unless it has
access to code from patchset 1. Unless patchset 1 is merged i dont see
how this works with patchwork or reviewers. Am i missing something?

cheers,
jamal

>
> >reviewers would deal with it. Thoughts?
> >
> >Note: The code machinery is really repeatable; for example if you look
> >at the tables control you will see very similar patterns to actions
> >etc. i.e spending time to review one will make it easy for the rest.
> >
> >cheers,
> >jamal
> >
> >> [...]

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH net-next v8 00/15] Introducing P4TC
  2023-11-20  9:39         ` Jiri Pirko
@ 2023-11-20 14:23           ` Jamal Hadi Salim
  2023-11-20 18:10             ` Jiri Pirko
  0 siblings, 1 reply; 79+ messages in thread
From: Jamal Hadi Salim @ 2023-11-20 14:23 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: John Fastabend, netdev, deb.chatterjee, anjali.singhai,
	Vipin.Jain, namrata.limaye, tom, mleitner, Mahesh.Shirshyad,
	tomasz.osinski, xiyou.wangcong, davem, edumazet, kuba, pabeni,
	vladbu, horms, daniel, bpf, khalidm, toke, mattyk, dan.daly,
	chris.sommers, john.andy.fingerhut

On Mon, Nov 20, 2023 at 4:39 AM Jiri Pirko <jiri@resnulli.us> wrote:
>
> Fri, Nov 17, 2023 at 09:46:11PM CET, jhs@mojatatu.com wrote:
> >On Fri, Nov 17, 2023 at 1:37 PM John Fastabend <john.fastabend@gmail.com> wrote:
> >>
> >> Jamal Hadi Salim wrote:
> >> > On Fri, Nov 17, 2023 at 1:27 AM John Fastabend <john.fastabend@gmail.com> wrote:
> >> > >
> >> > > Jamal Hadi Salim wrote:
>
> [...]
>
>
> >>
> >> I think I'm judging the technical work here. Bullet points.
> >>
> >> 1. p4c-tc implementation looks like it should be slower than a
> >>    in terms of pkts/sec than a bpf implementation. Meaning
> >>    I suspect pipeline and objects laid out like this will lose
> >>    to a BPF program with an parser and single lookup. The p4c-ebpf
> >>    compiler should look to create optimized EBPF code not some
> >>    emulated switch topology.
> >>
> >
> >The parser is ebpf based. The other objects which require control
> >plane interaction are not - those interact via netlink.
> >We published perf data a while back - presented at the P4 workshop
> >back in April (was in the cover letter)
> >https://github.com/p4tc-dev/docs/blob/main/p4-conference-2023/2023P4WorkshopP4TC.pdf
> >But do note: the correct abstraction is the first priority.
> >Optimization is something we can teach the compiler over time. But
> >even with the minimalist code generation you can see that our approach
> >always beats ebpf in LPM and ternary. The other ones I am pretty sure
>
> Any idea why? Perhaps the existing eBPF maps are not that suitable for
> this kinds of lookups? I mean in theory, eBPF should be always faster.

We didnt look closely; however, that is not the point - the point is
the perf difference if there is one, is not big with the big win being
proper P4 abstraction. For LPM for sure our algorithmic approach is
better. For ternary the compute intensity in looping is better done in
C. And for exact i believe that ebpf uses better hashing.
Again, that is not the point we were trying to validate in those experiments..

On your point of "maps are not that suitable" P4 tables tend to have
very specific attributes (examples associated meters, counters,
default hit and miss actions, etc).

> >we can optimize over time.
> >Your view of "single lookup" is true for simple programs but if you
> >have 10 tables trying to model a 5G function then it doesnt make sense
> >(and i think the data we published was clear that you gain no
> >advantage using ebpf - as a matter of fact there was no perf
> >difference between XDP and tc in such cases).
> >
> >> 2. p4c-tc control plan looks slower than a directly mmaped bpf
> >>    map. Doing a simple update vs a netlink msg. The argument
> >>    that BPF can't do CRUD (which we had offlist) seems incorrect
> >>    to me. Correct me if I'm wrong with details about why.
> >>
> >
> >So let me see....
> >you want me to replace netlink and all its features and rewrite it
> >using the ebpf system calls? Congestion control, event handling,
> >arbitrary message crafting, etc and the years of work that went into
> >netlink? NO to the HELL.
>
> Wait, I don't think John suggests anything like that. He just suggests
> to have the tables as eBPF maps.

What's the difference? Unless maps can do netlink.

> Honestly, I don't understand the
> fixation on netlink. Its socket messaging, memcpies, processing
> overhead, etc can't keep up with mmaped memory access at scale. Measure
> that and I bet you'll get drastically different results.
>
> I mean, netlink is good for a lot of things, but does not mean it is an
> universal answer to userspace<->kernel data passing.

Here's a small sample of our requirements that are satisfied by
netlink for P4 object hierarchy[1]:
1. Msg construction/parsing
2. Multi-user request/response messaging
3. Multi-user event subscribe/publish messaging

I dont think i need to provide an explanation on the differences here
visavis what ebpf system calls provide vs what netlink provides and
how netlink is a clear fit. If it is not clear i can give more
breakdown. And of course there's more but above is a good sample.

The part that is taken for granted is the control plane code and
interaction which is an extremely important detail. P4 Abstraction
requires hierarchies with different compiler generated encoded path
ids etc. This ID mapping gets exacerbated by having multitudes of  P4
programs which have different requirements. Netlink is a natural fit
for this P4 abstraction. Not to mention the netlink/tc path (and in
particular the ID mapping) provides a conduit for offload when that is
needed.
eBPF is just a tool - and the objects are intended to be generic - and
i dont see how any of this could be achieved without retooling to make
it more specific to P4.

cheers,
jamal



>
> >I should note: that there was an interesting talk at netdevconf 0x17
> >where the speaker showed the challenges of dealing with ebpf on "day
> >two" - slides or videos are not up yet, but link is:
> >https://netdevconf.info/0x17/sessions/talk/is-scaling-ebpf-easy-yet-a-small-step-to-one-server-but-giant-leap-to-distributed-network.html
> >The point the speaker was making is it's always easy to whip an ebpf
> >program that can slice and dice packets and maybe even flush LEDs but
> >the real work and challenge is in the control plane. I agree with the
> >speaker based on my experiences. This discussion of replacing netlink
> >with ebpf system calls is absolutely a non-starter. Let's just end the
> >discussion and agree to disagree if you are going to keep insisting on
> >that.
>
>
> [...]

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH net-next v8 09/15] p4tc: add template pipeline create, get, update, delete
  2023-11-20 13:16           ` Jiri Pirko
@ 2023-11-20 15:30             ` Jamal Hadi Salim
  2023-11-20 16:25               ` Jiri Pirko
  0 siblings, 1 reply; 79+ messages in thread
From: Jamal Hadi Salim @ 2023-11-20 15:30 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: netdev, deb.chatterjee, anjali.singhai, namrata.limaye, tom,
	mleitner, Mahesh.Shirshyad, tomasz.osinski, xiyou.wangcong, davem,
	edumazet, kuba, pabeni, vladbu, horms, daniel, bpf, khalidm, toke,
	mattyk, David Ahern, Stephen Hemminger

On Mon, Nov 20, 2023 at 8:16 AM Jiri Pirko <jiri@resnulli.us> wrote:
>
> Mon, Nov 20, 2023 at 01:48:14PM CET, jhs@mojatatu.com wrote:
> >On Mon, Nov 20, 2023 at 3:18 AM Jiri Pirko <jiri@resnulli.us> wrote:
> >>
> >> Fri, Nov 17, 2023 at 01:09:45PM CET, jhs@mojatatu.com wrote:
> >> >On Thu, Nov 16, 2023 at 11:11 AM Jiri Pirko <jiri@resnulli.us> wrote:
> >> >>
> >> >> Thu, Nov 16, 2023 at 03:59:42PM CET, jhs@mojatatu.com wrote:
> >> >>
> >> >> [...]
> >> >>
> >> >>
> >> >> >diff --git a/include/uapi/linux/p4tc.h b/include/uapi/linux/p4tc.h
> >> >> >index ba32dba66..4d33f44c1 100644
> >> >> >--- a/include/uapi/linux/p4tc.h
> >> >> >+++ b/include/uapi/linux/p4tc.h
> >> >> >@@ -2,8 +2,71 @@
> >> >> > #ifndef __LINUX_P4TC_H
> >> >> > #define __LINUX_P4TC_H
> >> >> >
> >> >> >+#include <linux/types.h>
> >> >> >+#include <linux/pkt_sched.h>
> >> >> >+
> >> >> >+/* pipeline header */
> >> >> >+struct p4tcmsg {
> >> >> >+      __u32 pipeid;
> >> >> >+      __u32 obj;
> >> >> >+};
> >> >>
> >> >> I don't follow. Is there any sane reason to use header instead of normal
> >> >> netlink attribute? Moveover, you extend the existing RT netlink with
> >> >> a huge amout of p4 things. Isn't this the good time to finally introduce
> >> >> generic netlink TC family with proper yaml spec with all the benefits it
> >> >> brings and implement p4 tc uapi there? Please?
> >> >>
> >> >
> >> >Several reasons:
> >> >a) We are similar to current tc messaging with the subheader being
> >> >there for multiplexing.
> >>
> >> Yeah, you don't need to carry 20year old burden in newly introduced
> >> interface. That's my point.
> >
> >Having a demux sub header is 20 year old burden? I didnt follow.
>
> You don't need the header, that's my point.
>

Let me see if i understand you:
We have multiple object types per pipeline - this info is _omni
present and it is never going to change_.
Your view is, have a hierarchy of attributes and put this subheader in
probably one attribute at the root.
You parse the root, you find the obj and pipeid and then you use that
to parse the rest of the per-object specific
attributes?

I dont know if a hierarchical attribute layout gives you any advantage
over the subheader approach - unless we figure a way to annotate
attributes as "optional" vs "must be present". I agree that getting
the validation for free is a bonus ..


> >
> >>
> >> >b) Where does this leave iproute2? +Cc David and Stephen. Do other
> >> >generic netlink conversions get contributed back to iproute2?
> >>
> >> There is no conversion afaik, only extensions. And they has to be,
> >> otherwise the user would not be able to use the newly introduced
> >> features.
> >
> >The big question is does the collective who use iproute2 still get to
> >use the same tooling or now they have to go and learn some new
> >tooling. I understand the value of the new approach but is it a
> >revolution or an evolution? We opted to put thing in iproute2 instead
> >for example because that is widely available (and used).
>
> I don't see why iproute2 user facing interface would be any different
> depending on if you user RTnetlink or genetlink as backend channel...
>

iproute2 supports plenty of genetlink already.
We need to find a way to have the best of both worlds.

>
> >
> >>
> >> >c) note: Our API is CRUD-ish instead of RPC(per generic netlink)
> >> >based. i.e you have:
> >> > COMMAND <PATH/TO/OBJECT> [optional data]  so we can support arbitrary
> >> >P4 programs from the control plane.
> >>
> >> I'm pretty sure you can achieve the same over genetlink.
> >>
> >
> >I think you are right.
> >
> >>
> >> >d) we have spent many hours optimizing the control to the kernel so i
> >> >am not sure what it would buy us to switch to generic netlink..
> >>
> >> All the benefits of ynl yaml tooling, at least.
> >>
> >
> >Did you pay close attention to what we have? The user space code is
> >written once into iproute2 and subsequent to that there is no
> >recompilation  of any iproute2 code. The compiler generates a json
> >file specific to a P4 program which is then introspected by the
> >iproute2 code.
>
> Right, but in real life, netlink is used directly by many apps. I don't
> see why this is any different.
>

Not sure if you were referring to what i said about the json file or
something else. The main value is not just kernel independence but
also iproute2 independence i.e not need to compile any code.

> Plus, the very best part of yaml from user perpective I see is,
> you just need the kernel-git yaml file and you can submit all commands.
> No userspace implementation needed.

Two different tacts: i can see this as being developer friendly (and
we are more trying to be operator friendly).
I need to take a closer look. Sounds like it should be polyglot
friendly as well. If i am not mistaken you still have to compile code
as a result of generation from the yaml?

cheers,
jamal

>
> >
> >
> >cheers,
> >jamal
> >
> >>
> >> >
> >> >cheers,
> >> >jamal
> >> >
> >> >>
> >> >> >+
> >> >> >+#define P4TC_MAXPIPELINE_COUNT 32
> >> >> >+#define P4TC_MAXTABLES_COUNT 32
> >> >> >+#define P4TC_MINTABLES_COUNT 0
> >> >> >+#define P4TC_MSGBATCH_SIZE 16
> >> >> >+
> >> >> > #define P4TC_MAX_KEYSZ 512
> >> >> >
> >> >> >+#define TEMPLATENAMSZ 32
> >> >> >+#define PIPELINENAMSIZ TEMPLATENAMSZ
> >> >>
> >> >> ugh. A prefix please?
> >> >>
> >> >> pw-bot: cr
> >> >>
> >> >> [...]

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH net-next v8 09/15] p4tc: add template pipeline create, get, update, delete
  2023-11-20 15:30             ` Jamal Hadi Salim
@ 2023-11-20 16:25               ` Jiri Pirko
  0 siblings, 0 replies; 79+ messages in thread
From: Jiri Pirko @ 2023-11-20 16:25 UTC (permalink / raw)
  To: Jamal Hadi Salim
  Cc: netdev, deb.chatterjee, anjali.singhai, namrata.limaye, tom,
	mleitner, Mahesh.Shirshyad, tomasz.osinski, xiyou.wangcong, davem,
	edumazet, kuba, pabeni, vladbu, horms, daniel, bpf, khalidm, toke,
	mattyk, David Ahern, Stephen Hemminger

Mon, Nov 20, 2023 at 04:30:11PM CET, jhs@mojatatu.com wrote:
>On Mon, Nov 20, 2023 at 8:16 AM Jiri Pirko <jiri@resnulli.us> wrote:
>>
>> Mon, Nov 20, 2023 at 01:48:14PM CET, jhs@mojatatu.com wrote:
>> >On Mon, Nov 20, 2023 at 3:18 AM Jiri Pirko <jiri@resnulli.us> wrote:
>> >>
>> >> Fri, Nov 17, 2023 at 01:09:45PM CET, jhs@mojatatu.com wrote:
>> >> >On Thu, Nov 16, 2023 at 11:11 AM Jiri Pirko <jiri@resnulli.us> wrote:
>> >> >>
>> >> >> Thu, Nov 16, 2023 at 03:59:42PM CET, jhs@mojatatu.com wrote:
>> >> >>
>> >> >> [...]
>> >> >>
>> >> >>
>> >> >> >diff --git a/include/uapi/linux/p4tc.h b/include/uapi/linux/p4tc.h
>> >> >> >index ba32dba66..4d33f44c1 100644
>> >> >> >--- a/include/uapi/linux/p4tc.h
>> >> >> >+++ b/include/uapi/linux/p4tc.h
>> >> >> >@@ -2,8 +2,71 @@
>> >> >> > #ifndef __LINUX_P4TC_H
>> >> >> > #define __LINUX_P4TC_H
>> >> >> >
>> >> >> >+#include <linux/types.h>
>> >> >> >+#include <linux/pkt_sched.h>
>> >> >> >+
>> >> >> >+/* pipeline header */
>> >> >> >+struct p4tcmsg {
>> >> >> >+      __u32 pipeid;
>> >> >> >+      __u32 obj;
>> >> >> >+};
>> >> >>
>> >> >> I don't follow. Is there any sane reason to use header instead of normal
>> >> >> netlink attribute? Moveover, you extend the existing RT netlink with
>> >> >> a huge amout of p4 things. Isn't this the good time to finally introduce
>> >> >> generic netlink TC family with proper yaml spec with all the benefits it
>> >> >> brings and implement p4 tc uapi there? Please?
>> >> >>
>> >> >
>> >> >Several reasons:
>> >> >a) We are similar to current tc messaging with the subheader being
>> >> >there for multiplexing.
>> >>
>> >> Yeah, you don't need to carry 20year old burden in newly introduced
>> >> interface. That's my point.
>> >
>> >Having a demux sub header is 20 year old burden? I didnt follow.
>>
>> You don't need the header, that's my point.
>>
>
>Let me see if i understand you:
>We have multiple object types per pipeline - this info is _omni
>present and it is never going to change_.
>Your view is, have a hierarchy of attributes and put this subheader in
>probably one attribute at the root.

That or use genetlink to have per-cmd attributes.


>You parse the root, you find the obj and pipeid and then you use that
>to parse the rest of the per-object specific
>attributes?
>
>I dont know if a hierarchical attribute layout gives you any advantage
>over the subheader approach - unless we figure a way to annotate
>attributes as "optional" vs "must be present". I agree that getting
>the validation for free is a bonus ..
>
>
>> >
>> >>
>> >> >b) Where does this leave iproute2? +Cc David and Stephen. Do other
>> >> >generic netlink conversions get contributed back to iproute2?
>> >>
>> >> There is no conversion afaik, only extensions. And they has to be,
>> >> otherwise the user would not be able to use the newly introduced
>> >> features.
>> >
>> >The big question is does the collective who use iproute2 still get to
>> >use the same tooling or now they have to go and learn some new
>> >tooling. I understand the value of the new approach but is it a
>> >revolution or an evolution? We opted to put thing in iproute2 instead
>> >for example because that is widely available (and used).
>>
>> I don't see why iproute2 user facing interface would be any different
>> depending on if you user RTnetlink or genetlink as backend channel...
>>
>
>iproute2 supports plenty of genetlink already.
>We need to find a way to have the best of both worlds.
>
>>
>> >
>> >>
>> >> >c) note: Our API is CRUD-ish instead of RPC(per generic netlink)
>> >> >based. i.e you have:
>> >> > COMMAND <PATH/TO/OBJECT> [optional data]  so we can support arbitrary
>> >> >P4 programs from the control plane.
>> >>
>> >> I'm pretty sure you can achieve the same over genetlink.
>> >>
>> >
>> >I think you are right.
>> >
>> >>
>> >> >d) we have spent many hours optimizing the control to the kernel so i
>> >> >am not sure what it would buy us to switch to generic netlink..
>> >>
>> >> All the benefits of ynl yaml tooling, at least.
>> >>
>> >
>> >Did you pay close attention to what we have? The user space code is
>> >written once into iproute2 and subsequent to that there is no
>> >recompilation  of any iproute2 code. The compiler generates a json
>> >file specific to a P4 program which is then introspected by the
>> >iproute2 code.
>>
>> Right, but in real life, netlink is used directly by many apps. I don't
>> see why this is any different.
>>
>
>Not sure if you were referring to what i said about the json file or
>something else. The main value is not just kernel independence but
>also iproute2 independence i.e not need to compile any code.
>
>> Plus, the very best part of yaml from user perpective I see is,
>> you just need the kernel-git yaml file and you can submit all commands.
>> No userspace implementation needed.
>
>Two different tacts: i can see this as being developer friendly (and
>we are more trying to be operator friendly).
>I need to take a closer look. Sounds like it should be polyglot
>friendly as well. If i am not mistaken you still have to compile code
>as a result of generation from the yaml?

Nope, you can run ynl.py and let it parse the yaml on fly.


>
>cheers,
>jamal
>
>>
>> >
>> >
>> >cheers,
>> >jamal
>> >
>> >>
>> >> >
>> >> >cheers,
>> >> >jamal
>> >> >
>> >> >>
>> >> >> >+
>> >> >> >+#define P4TC_MAXPIPELINE_COUNT 32
>> >> >> >+#define P4TC_MAXTABLES_COUNT 32
>> >> >> >+#define P4TC_MINTABLES_COUNT 0
>> >> >> >+#define P4TC_MSGBATCH_SIZE 16
>> >> >> >+
>> >> >> > #define P4TC_MAX_KEYSZ 512
>> >> >> >
>> >> >> >+#define TEMPLATENAMSZ 32
>> >> >> >+#define PIPELINENAMSIZ TEMPLATENAMSZ
>> >> >>
>> >> >> ugh. A prefix please?
>> >> >>
>> >> >> pw-bot: cr
>> >> >>
>> >> >> [...]

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH net-next v8 10/15] p4tc: add action template create, update, delete, get, flush and dump
  2023-11-20 13:45         ` Jamal Hadi Salim
@ 2023-11-20 16:25           ` Jiri Pirko
  0 siblings, 0 replies; 79+ messages in thread
From: Jiri Pirko @ 2023-11-20 16:25 UTC (permalink / raw)
  To: Jamal Hadi Salim
  Cc: netdev, deb.chatterjee, anjali.singhai, namrata.limaye, tom,
	mleitner, Mahesh.Shirshyad, tomasz.osinski, xiyou.wangcong, davem,
	edumazet, kuba, pabeni, vladbu, horms, daniel, bpf, khalidm, toke,
	mattyk

Mon, Nov 20, 2023 at 02:45:24PM CET, jhs@mojatatu.com wrote:
>On Mon, Nov 20, 2023 at 3:19 AM Jiri Pirko <jiri@resnulli.us> wrote:
>>
>> Fri, Nov 17, 2023 at 04:11:29PM CET, jhs@mojatatu.com wrote:
>> >On Thu, Nov 16, 2023 at 11:28 AM Jiri Pirko <jiri@resnulli.us> wrote:
>> >>
>> >> Thu, Nov 16, 2023 at 03:59:43PM CET, jhs@mojatatu.com wrote:
>> >>
>> >> [...]
>> >>
>> >>
>> >> >diff --git a/include/net/act_api.h b/include/net/act_api.h
>> >> >index cd5a8e86f..b95a9bc29 100644
>> >> >--- a/include/net/act_api.h
>> >> >+++ b/include/net/act_api.h
>> >> >@@ -70,6 +70,7 @@ struct tc_action {
>> >> > #define TCA_ACT_FLAGS_AT_INGRESS      (1U << (TCA_ACT_FLAGS_USER_BITS + 4))
>> >> > #define TCA_ACT_FLAGS_PREALLOC        (1U << (TCA_ACT_FLAGS_USER_BITS + 5))
>> >> > #define TCA_ACT_FLAGS_UNREFERENCED    (1U << (TCA_ACT_FLAGS_USER_BITS + 6))
>> >> >+#define TCA_ACT_FLAGS_FROM_P4TC       (1U << (TCA_ACT_FLAGS_USER_BITS + 7))
>> >> >
>> >> > /* Update lastuse only if needed, to avoid dirtying a cache line.
>> >> >  * We use a temp variable to avoid fetching jiffies twice.
>> >> >diff --git a/include/net/p4tc.h b/include/net/p4tc.h
>> >> >index ccb54d842..68b00fa72 100644
>> >> >--- a/include/net/p4tc.h
>> >> >+++ b/include/net/p4tc.h
>> >> >@@ -9,17 +9,23 @@
>> >> > #include <linux/refcount.h>
>> >> > #include <linux/rhashtable.h>
>> >> > #include <linux/rhashtable-types.h>
>> >> >+#include <net/tc_act/p4tc.h>
>> >> >+#include <net/p4tc_types.h>
>> >> >
>> >> > #define P4TC_DEFAULT_NUM_TABLES P4TC_MINTABLES_COUNT
>> >> > #define P4TC_DEFAULT_MAX_RULES 1
>> >> > #define P4TC_PATH_MAX 3
>> >> >+#define P4TC_MAX_TENTRIES 33554432
>> >>
>> >> Seeing define like this one always makes me happier. Where does it come
>> >> from? Why not 0x2000000 at least?
>> >
>> >I dont recall why we decided to do decimal - will change it.
>> >
>> >>
>> >> >
>> >> > #define P4TC_KERNEL_PIPEID 0
>> >> >
>> >> > #define P4TC_PID_IDX 0
>> >> >+#define P4TC_AID_IDX 1
>> >> >+#define P4TC_PARSEID_IDX 1
>> >> >
>> >> > struct p4tc_dump_ctx {
>> >> >       u32 ids[P4TC_PATH_MAX];
>> >> >+      struct rhashtable_iter *iter;
>> >> > };
>> >> >
>> >> > struct p4tc_template_common;
>> >> >@@ -63,8 +69,10 @@ extern const struct p4tc_template_ops p4tc_pipeline_ops;
>> >> >
>> >> > struct p4tc_pipeline {
>> >> >       struct p4tc_template_common common;
>> >> >+      struct idr                  p_act_idr;
>> >> >       struct rcu_head             rcu;
>> >> >       struct net                  *net;
>> >> >+      u32                         num_created_acts;
>> >> >       /* Accounts for how many entities are referencing this pipeline.
>> >> >        * As for now only P4 filters can refer to pipelines.
>> >> >        */
>> >> >@@ -109,18 +117,157 @@ p4tc_pipeline_find_byany_unsealed(struct net *net, const char *p_name,
>> >> >                                 const u32 pipeid,
>> >> >                                 struct netlink_ext_ack *extack);
>> >> >
>> >> >+struct p4tc_act *tcf_p4_find_act(struct net *net,
>> >> >+                               const struct tc_action_ops *a_o,
>> >> >+                               struct netlink_ext_ack *extack);
>> >> >+void
>> >> >+tcf_p4_put_prealloc_act(struct p4tc_act *act, struct tcf_p4act *p4_act);
>> >> >+
>> >> > static inline int p4tc_action_destroy(struct tc_action **acts)
>> >> > {
>> >> >+      struct tc_action *acts_non_prealloc[TCA_ACT_MAX_PRIO] = {NULL};
>> >> >       int ret = 0;
>> >> >
>> >> >       if (acts) {
>> >> >-              ret = tcf_action_destroy(acts, TCA_ACT_UNBIND);
>> >> >+              int j = 0;
>> >> >+              int i;
>> >>
>> >> Move declarations to the beginning of the if body.
>> >>
>> >
>> >Didnt follow - which specific declaration?
>>
>> It should look like this:
>>
>>                 int j = 0;
>>                 int i;
>>
>>                 ret = tcf_action_destroy(acts, TCA_ACT_UNBIND);
>
>Huh? It does look like that already ;-> Note there's a "-" on that line.

I'm blind. Sorry.


>
>>
>> >
>> >> [...]
>> >>
>> >>
>> >> >diff --git a/include/uapi/linux/p4tc.h b/include/uapi/linux/p4tc.h
>> >> >index 4d33f44c1..7b89229a7 100644
>> >> >--- a/include/uapi/linux/p4tc.h
>> >> >+++ b/include/uapi/linux/p4tc.h
>> >> >@@ -4,6 +4,7 @@
>> >> >
>> >> > #include <linux/types.h>
>> >> > #include <linux/pkt_sched.h>
>> >> >+#include <linux/pkt_cls.h>
>> >> >
>> >> > /* pipeline header */
>> >> > struct p4tcmsg {
>> >> >@@ -17,9 +18,12 @@ struct p4tcmsg {
>> >> > #define P4TC_MSGBATCH_SIZE 16
>> >> >
>> >> > #define P4TC_MAX_KEYSZ 512
>> >> >+#define P4TC_DEFAULT_NUM_PREALLOC 16
>> >> >
>> >> > #define TEMPLATENAMSZ 32
>> >> > #define PIPELINENAMSIZ TEMPLATENAMSZ
>> >> >+#define ACTTMPLNAMSIZ TEMPLATENAMSZ
>> >> >+#define ACTPARAMNAMSIZ TEMPLATENAMSZ
>> >>
>> >> Prefix? This is uapi. Could you please be more careful with naming at
>> >> least in the uapi area?
>> >
>> >Good point.
>> >
>> >>
>> >> [...]
>> >>
>> >>
>> >> >diff --git a/net/sched/p4tc/p4tc_action.c b/net/sched/p4tc/p4tc_action.c
>> >> >new file mode 100644
>> >> >index 000000000..19db0772c
>> >> >--- /dev/null
>> >> >+++ b/net/sched/p4tc/p4tc_action.c
>> >> >@@ -0,0 +1,2242 @@
>> >> >+// SPDX-License-Identifier: GPL-2.0-or-later
>> >> >+/*
>> >> >+ * net/sched/p4tc_action.c    P4 TC ACTION TEMPLATES
>> >> >+ *
>> >> >+ * Copyright (c) 2022-2023, Mojatatu Networks
>> >> >+ * Copyright (c) 2022-2023, Intel Corporation.
>> >> >+ * Authors:     Jamal Hadi Salim <jhs@mojatatu.com>
>> >> >+ *              Victor Nogueira <victor@mojatatu.com>
>> >> >+ *              Pedro Tammela <pctammela@mojatatu.com>
>> >> >+ */
>> >> >+
>> >> >+#include <linux/err.h>
>> >> >+#include <linux/errno.h>
>> >> >+#include <linux/init.h>
>> >> >+#include <linux/kernel.h>
>> >> >+#include <linux/kmod.h>
>> >> >+#include <linux/list.h>
>> >> >+#include <linux/module.h>
>> >> >+#include <linux/netdevice.h>
>> >> >+#include <linux/skbuff.h>
>> >> >+#include <linux/slab.h>
>> >> >+#include <linux/string.h>
>> >> >+#include <linux/types.h>
>> >> >+#include <net/flow_offload.h>
>> >> >+#include <net/net_namespace.h>
>> >> >+#include <net/netlink.h>
>> >> >+#include <net/pkt_cls.h>
>> >> >+#include <net/p4tc.h>
>> >> >+#include <net/sch_generic.h>
>> >> >+#include <net/sock.h>
>> >> >+#include <net/tc_act/p4tc.h>
>> >> >+
>> >> >+static LIST_HEAD(dynact_list);
>> >> >+
>> >> >+#define SEPARATOR "/"
>> >>
>> >> Prefix? Btw, why exactly do you need this. It is used only once.
>> >>
>> >
>> >We'll get rid of it.
>> >
>> >> To quote a few function names in this file:
>> >>
>> >> >+static void set_param_indices(struct idr *params_idr)
>> >> >+static void generic_free_param_value(struct p4tc_act_param *param)
>> >> >+static int dev_init_param_value(struct net *net, struct p4tc_act_param_ops *op,
>> >> >+static void dev_free_param_value(struct p4tc_act_param *param)
>> >> >+static void tcf_p4_act_params_destroy_rcu(struct rcu_head *head)
>> >> >+static int __tcf_p4_dyna_init_set(struct p4tc_act *act, struct tc_action **a,
>> >> >+static int tcf_p4_dyna_template_init(struct net *net, struct tc_action **a,
>> >> >+init_prealloc_param(struct p4tc_act *act, struct idr *params_idr,
>> >> >+static void p4tc_param_put(struct p4tc_act_param *param)
>> >> >+static void free_intermediate_param(struct p4tc_act_param *param)
>> >> >+static void free_intermediate_params_list(struct list_head *params_list)
>> >> >+static int init_prealloc_params(struct p4tc_act *act,
>> >> >+struct p4tc_act *p4tc_action_find_byid(struct p4tc_pipeline *pipeline,
>> >> >+static void tcf_p4_prealloc_list_add(struct p4tc_act *act_tmpl,
>> >> >+static int tcf_p4_prealloc_acts(struct net *net, struct p4tc_act *act,
>> >> >+tcf_p4_get_next_prealloc_act(struct p4tc_act *act)
>> >> >+void tcf_p4_set_init_flags(struct tcf_p4act *p4act)
>> >> >+static void __tcf_p4_put_prealloc_act(struct p4tc_act *act,
>> >> >+tcf_p4_put_prealloc_act(struct p4tc_act *act, struct tcf_p4act *p4act)
>> >> >+static int generic_dump_param_value(struct sk_buff *skb, struct p4tc_type *type,
>> >> >+static int generic_init_param_value(struct p4tc_act_param *nparam,
>> >> >+static struct p4tc_act_param *param_find_byname(struct idr *params_idr,
>> >> >+tcf_param_find_byany(struct p4tc_act *act,
>> >> >+tcf_param_find_byanyattr(struct p4tc_act *act, struct nlattr *name_attr,
>> >> >+static int __p4_init_param_type(struct p4tc_act_param *param,
>> >> >+static int tcf_p4_act_init_params(struct net *net,
>> >> >+static struct p4tc_act *p4tc_action_find_byname(const char *act_name,
>> >> >+static int tcf_p4_dyna_init(struct net *net, struct nlattr *nla,
>> >> >+static int tcf_act_fill_param_type(struct sk_buff *skb,
>> >> >+static void tcf_p4_dyna_cleanup(struct tc_action *a)
>> >> >+struct p4tc_act *p4tc_action_find_get(struct p4tc_pipeline *pipeline,
>> >> >+p4tc_action_find_byanyattr(struct nlattr *act_name_attr, const u32 a_id,
>> >> >+static void p4_put_many_params(struct idr *params_idr)
>> >> >+static int p4_init_param_type(struct p4tc_act_param *param,
>> >> >+static struct p4tc_act_param *p4_create_param(struct p4tc_act *act,
>> >> >+static struct p4tc_act_param *p4_update_param(struct p4tc_act *act,
>> >> >+static struct p4tc_act_param *p4_act_init_param(struct p4tc_act *act,
>> >> >+static void p4tc_action_net_exit(struct tc_action_net *tn)
>> >> >+static void p4_act_params_put(struct p4tc_act *act)
>> >> >+static int __tcf_act_put(struct net *net, struct p4tc_pipeline *pipeline,
>> >> >+static int _tcf_act_fill_nlmsg(struct net *net, struct sk_buff *skb,
>> >> >+static int tcf_act_fill_nlmsg(struct net *net, struct sk_buff *skb,
>> >> >+static int tcf_act_flush(struct sk_buff *skb, struct net *net,
>> >> >+static void p4tc_params_replace_many(struct p4tc_act *act,
>> >> >+                                   struct idr *params_idr)
>> >> >+static struct p4tc_act *tcf_act_create(struct net *net, struct nlattr **tb,
>> >> >+tcf_act_cu(struct net *net, struct nlmsghdr *n, struct nlattr *nla,
>> >>
>> >> Is there some secret key how you name the functions? To me, this looks
>> >> completely inconsistent :/
>> >
>> >What would be better? tcf_p4_xxxx?
>>
>> Idk, up to you, just please maintain some basic naming consistency and
>> prefixes.
>>
>
>We'll come up with something.
>
>cheers,
>jamal
>
>>
>> >A lot of the tcf_xxx is because that convention is used in that file
>> >but we can change it.
>> >
>> >cheers,
>> >jamal
>> >>

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH net-next v8 15/15] p4tc: Add P4 extern interface
  2023-11-20 14:02         ` Jamal Hadi Salim
@ 2023-11-20 16:27           ` Jiri Pirko
  2023-11-20 19:00             ` Jamal Hadi Salim
  0 siblings, 1 reply; 79+ messages in thread
From: Jiri Pirko @ 2023-11-20 16:27 UTC (permalink / raw)
  To: Jamal Hadi Salim
  Cc: netdev, deb.chatterjee, anjali.singhai, namrata.limaye, tom,
	mleitner, Mahesh.Shirshyad, tomasz.osinski, xiyou.wangcong, davem,
	edumazet, kuba, pabeni, vladbu, horms, daniel, bpf, khalidm, toke,
	mattyk

Mon, Nov 20, 2023 at 03:02:49PM CET, jhs@mojatatu.com wrote:
>On Mon, Nov 20, 2023 at 3:22 AM Jiri Pirko <jiri@resnulli.us> wrote:
>>
>> Fri, Nov 17, 2023 at 01:14:43PM CET, jhs@mojatatu.com wrote:
>> >On Thu, Nov 16, 2023 at 11:42 AM Jiri Pirko <jiri@resnulli.us> wrote:
>> >>
>> >> Thu, Nov 16, 2023 at 03:59:48PM CET, jhs@mojatatu.com wrote:
>> >>
>> >> [...]
>> >>
>> >> > include/net/p4tc.h                |  161 +++
>> >> > include/net/p4tc_ext_api.h        |  199 +++
>> >> > include/uapi/linux/p4tc.h         |   61 +
>> >> > include/uapi/linux/p4tc_ext.h     |   36 +
>> >> > net/sched/p4tc/Makefile           |    2 +-
>> >> > net/sched/p4tc/p4tc_bpf.c         |   79 +-
>> >> > net/sched/p4tc/p4tc_ext.c         | 2204 ++++++++++++++++++++++++++++
>> >> > net/sched/p4tc/p4tc_pipeline.c    |   34 +-
>> >> > net/sched/p4tc/p4tc_runtime_api.c |   10 +-
>> >> > net/sched/p4tc/p4tc_table.c       |   57 +-
>> >> > net/sched/p4tc/p4tc_tbl_entry.c   |   25 +-
>> >> > net/sched/p4tc/p4tc_tmpl_api.c    |    4 +
>> >> > net/sched/p4tc/p4tc_tmpl_ext.c    | 2221 +++++++++++++++++++++++++++++
>> >> > 13 files changed, 5083 insertions(+), 10 deletions(-)
>> >>
>> >> This is for this patch. Now for the whole patchset you have:
>> >>  30 files changed, 16676 insertions(+), 39 deletions(-)
>> >>
>> >> I understand that you want to fit into 15 patches with all the work.
>> >> But sorry, patches like this are unreviewable. My suggestion is to split
>> >> the patchset into multiple ones including smaller patches and allow
>> >> people to digest this. I don't believe that anyone can seriously stand
>> >> to review a patch with more than 200 lines changes.
>> >
>> >This specific patch is not difficult to split into two. I can do that
>> >and send out minus the first 8 trivial patches - but not familiar with
>> >how to do "here's part 1 of the patches" and "here's patchset two".
>>
>> Split into multiple patchsets and send one by one. No need to have all
>> in at once.
>>
>>
>> >There's dependency between them so not clear how patchwork and
>>
>> What dependency. It should compile. Introduce some basic functionality
>> first and extend it incrementally with other patchsets. The usual way.
>>
>
>Sorry, still not following:
>Lets say i split the current patchset 1 with patch 1-8 (which are
>trivial and have been reviewed) then make the rest into patchset 2
>with a new set 1-8. I dont see how patchset 2 compiles unless it has
>access to code from patchset 1. Unless patchset 1 is merged i dont see
>how this works with patchwork or reviewers. Am i missing something?

Why it would not work. Describe your motivation and plans and submit
part of the work, the rest later on. No problem.


>
>cheers,
>jamal
>
>>
>> >reviewers would deal with it. Thoughts?
>> >
>> >Note: The code machinery is really repeatable; for example if you look
>> >at the tables control you will see very similar patterns to actions
>> >etc. i.e spending time to review one will make it easy for the rest.
>> >
>> >cheers,
>> >jamal
>> >
>> >> [...]

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH net-next v8 00/15] Introducing P4TC
  2023-11-20 14:23           ` Jamal Hadi Salim
@ 2023-11-20 18:10             ` Jiri Pirko
  2023-11-20 19:56               ` Jamal Hadi Salim
  0 siblings, 1 reply; 79+ messages in thread
From: Jiri Pirko @ 2023-11-20 18:10 UTC (permalink / raw)
  To: Jamal Hadi Salim
  Cc: John Fastabend, netdev, deb.chatterjee, anjali.singhai,
	Vipin.Jain, namrata.limaye, tom, mleitner, Mahesh.Shirshyad,
	tomasz.osinski, xiyou.wangcong, davem, edumazet, kuba, pabeni,
	vladbu, horms, daniel, bpf, khalidm, toke, mattyk, dan.daly,
	chris.sommers, john.andy.fingerhut

Mon, Nov 20, 2023 at 03:23:59PM CET, jhs@mojatatu.com wrote:
>On Mon, Nov 20, 2023 at 4:39 AM Jiri Pirko <jiri@resnulli.us> wrote:
>>
>> Fri, Nov 17, 2023 at 09:46:11PM CET, jhs@mojatatu.com wrote:
>> >On Fri, Nov 17, 2023 at 1:37 PM John Fastabend <john.fastabend@gmail.com> wrote:
>> >>
>> >> Jamal Hadi Salim wrote:
>> >> > On Fri, Nov 17, 2023 at 1:27 AM John Fastabend <john.fastabend@gmail.com> wrote:
>> >> > >
>> >> > > Jamal Hadi Salim wrote:
>>
>> [...]
>>
>>
>> >>
>> >> I think I'm judging the technical work here. Bullet points.
>> >>
>> >> 1. p4c-tc implementation looks like it should be slower than a
>> >>    in terms of pkts/sec than a bpf implementation. Meaning
>> >>    I suspect pipeline and objects laid out like this will lose
>> >>    to a BPF program with an parser and single lookup. The p4c-ebpf
>> >>    compiler should look to create optimized EBPF code not some
>> >>    emulated switch topology.
>> >>
>> >
>> >The parser is ebpf based. The other objects which require control
>> >plane interaction are not - those interact via netlink.
>> >We published perf data a while back - presented at the P4 workshop
>> >back in April (was in the cover letter)
>> >https://github.com/p4tc-dev/docs/blob/main/p4-conference-2023/2023P4WorkshopP4TC.pdf
>> >But do note: the correct abstraction is the first priority.
>> >Optimization is something we can teach the compiler over time. But
>> >even with the minimalist code generation you can see that our approach
>> >always beats ebpf in LPM and ternary. The other ones I am pretty sure
>>
>> Any idea why? Perhaps the existing eBPF maps are not that suitable for
>> this kinds of lookups? I mean in theory, eBPF should be always faster.
>
>We didnt look closely; however, that is not the point - the point is
>the perf difference if there is one, is not big with the big win being
>proper P4 abstraction. For LPM for sure our algorithmic approach is
>better. For ternary the compute intensity in looping is better done in
>C. And for exact i believe that ebpf uses better hashing.
>Again, that is not the point we were trying to validate in those experiments..
>
>On your point of "maps are not that suitable" P4 tables tend to have
>very specific attributes (examples associated meters, counters,
>default hit and miss actions, etc).
>
>> >we can optimize over time.
>> >Your view of "single lookup" is true for simple programs but if you
>> >have 10 tables trying to model a 5G function then it doesnt make sense
>> >(and i think the data we published was clear that you gain no
>> >advantage using ebpf - as a matter of fact there was no perf
>> >difference between XDP and tc in such cases).
>> >
>> >> 2. p4c-tc control plan looks slower than a directly mmaped bpf
>> >>    map. Doing a simple update vs a netlink msg. The argument
>> >>    that BPF can't do CRUD (which we had offlist) seems incorrect
>> >>    to me. Correct me if I'm wrong with details about why.
>> >>
>> >
>> >So let me see....
>> >you want me to replace netlink and all its features and rewrite it
>> >using the ebpf system calls? Congestion control, event handling,
>> >arbitrary message crafting, etc and the years of work that went into
>> >netlink? NO to the HELL.
>>
>> Wait, I don't think John suggests anything like that. He just suggests
>> to have the tables as eBPF maps.
>
>What's the difference? Unless maps can do netlink.
>
>> Honestly, I don't understand the
>> fixation on netlink. Its socket messaging, memcpies, processing
>> overhead, etc can't keep up with mmaped memory access at scale. Measure
>> that and I bet you'll get drastically different results.
>>
>> I mean, netlink is good for a lot of things, but does not mean it is an
>> universal answer to userspace<->kernel data passing.
>
>Here's a small sample of our requirements that are satisfied by
>netlink for P4 object hierarchy[1]:
>1. Msg construction/parsing
>2. Multi-user request/response messaging

What is actually a usecase for having multiple users program p4 pipeline
in parallel?

>3. Multi-user event subscribe/publish messaging

Same here. What is the usecase for multiple users receiving p4 events?


>
>I dont think i need to provide an explanation on the differences here
>visavis what ebpf system calls provide vs what netlink provides and
>how netlink is a clear fit. If it is not clear i can give more

It is not :/


>breakdown. And of course there's more but above is a good sample.
>
>The part that is taken for granted is the control plane code and
>interaction which is an extremely important detail. P4 Abstraction
>requires hierarchies with different compiler generated encoded path
>ids etc. This ID mapping gets exacerbated by having multitudes of  P4

Why the actual eBFP mapping does not serve the same purpose as ID?
ID:mapping
1 :1
?


>programs which have different requirements. Netlink is a natural fit
>for this P4 abstraction. Not to mention the netlink/tc path (and in
>particular the ID mapping) provides a conduit for offload when that is
>needed.
>eBPF is just a tool - and the objects are intended to be generic - and
>i dont see how any of this could be achieved without retooling to make
>it more specific to P4.
>
>cheers,
>jamal
>
>
>
>>
>> >I should note: that there was an interesting talk at netdevconf 0x17
>> >where the speaker showed the challenges of dealing with ebpf on "day
>> >two" - slides or videos are not up yet, but link is:
>> >https://netdevconf.info/0x17/sessions/talk/is-scaling-ebpf-easy-yet-a-small-step-to-one-server-but-giant-leap-to-distributed-network.html
>> >The point the speaker was making is it's always easy to whip an ebpf
>> >program that can slice and dice packets and maybe even flush LEDs but
>> >the real work and challenge is in the control plane. I agree with the
>> >speaker based on my experiences. This discussion of replacing netlink
>> >with ebpf system calls is absolutely a non-starter. Let's just end the
>> >discussion and agree to disagree if you are going to keep insisting on
>> >that.
>>
>>
>> [...]

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH net-next v8 09/15] p4tc: add template pipeline create, get, update, delete
  2023-11-17 12:09     ` Jamal Hadi Salim
  2023-11-20  8:18       ` Jiri Pirko
@ 2023-11-20 18:20       ` David Ahern
  2023-11-20 20:12         ` Jamal Hadi Salim
  1 sibling, 1 reply; 79+ messages in thread
From: David Ahern @ 2023-11-20 18:20 UTC (permalink / raw)
  To: Jamal Hadi Salim, Jiri Pirko
  Cc: netdev, deb.chatterjee, anjali.singhai, namrata.limaye, tom,
	mleitner, Mahesh.Shirshyad, tomasz.osinski, xiyou.wangcong, davem,
	edumazet, kuba, pabeni, vladbu, horms, daniel, bpf, khalidm, toke,
	mattyk, David Ahern, Stephen Hemminger

On 11/17/23 4:09 AM, Jamal Hadi Salim wrote:
>>> diff --git a/include/uapi/linux/p4tc.h b/include/uapi/linux/p4tc.h
>>> index ba32dba66..4d33f44c1 100644
>>> --- a/include/uapi/linux/p4tc.h
>>> +++ b/include/uapi/linux/p4tc.h
>>> @@ -2,8 +2,71 @@
>>> #ifndef __LINUX_P4TC_H
>>> #define __LINUX_P4TC_H
>>>
>>> +#include <linux/types.h>
>>> +#include <linux/pkt_sched.h>
>>> +
>>> +/* pipeline header */
>>> +struct p4tcmsg {
>>> +      __u32 pipeid;
>>> +      __u32 obj;
>>> +};
>>
>> I don't follow. Is there any sane reason to use header instead of normal
>> netlink attribute? Moveover, you extend the existing RT netlink with
>> a huge amout of p4 things. Isn't this the good time to finally introduce
>> generic netlink TC family with proper yaml spec with all the benefits it
>> brings and implement p4 tc uapi there? Please?
>>

There is precedence (new netdev APIs) to move new infra to genl, but it
is not clear to me if extending existing functionality should fall into
that required conversion.

> 
> Several reasons:
> a) We are similar to current tc messaging with the subheader being
> there for multiplexing.
> b) Where does this leave iproute2? +Cc David and Stephen. Do other
> generic netlink conversions get contributed back to iproute2?
> c) note: Our API is CRUD-ish instead of RPC(per generic netlink)
> based. i.e you have:
>  COMMAND <PATH/TO/OBJECT> [optional data]  so we can support arbitrary
> P4 programs from the control plane.
> d) we have spent many hours optimizing the control to the kernel so i
> am not sure what it would buy us to switch to generic netlink..
> 


^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH net-next v8 15/15] p4tc: Add P4 extern interface
  2023-11-20 16:27           ` Jiri Pirko
@ 2023-11-20 19:00             ` Jamal Hadi Salim
  0 siblings, 0 replies; 79+ messages in thread
From: Jamal Hadi Salim @ 2023-11-20 19:00 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: netdev, deb.chatterjee, anjali.singhai, namrata.limaye, tom,
	mleitner, Mahesh.Shirshyad, tomasz.osinski, xiyou.wangcong, davem,
	edumazet, kuba, pabeni, vladbu, horms, daniel, bpf, khalidm, toke,
	mattyk

On Mon, Nov 20, 2023 at 11:27 AM Jiri Pirko <jiri@resnulli.us> wrote:
>
> Mon, Nov 20, 2023 at 03:02:49PM CET, jhs@mojatatu.com wrote:
> >On Mon, Nov 20, 2023 at 3:22 AM Jiri Pirko <jiri@resnulli.us> wrote:
> >>
> >> Fri, Nov 17, 2023 at 01:14:43PM CET, jhs@mojatatu.com wrote:
> >> >On Thu, Nov 16, 2023 at 11:42 AM Jiri Pirko <jiri@resnulli.us> wrote:
> >> >>
> >> >> Thu, Nov 16, 2023 at 03:59:48PM CET, jhs@mojatatu.com wrote:
> >> >>
> >> >> [...]
> >> >>
> >> >> > include/net/p4tc.h                |  161 +++
> >> >> > include/net/p4tc_ext_api.h        |  199 +++
> >> >> > include/uapi/linux/p4tc.h         |   61 +
> >> >> > include/uapi/linux/p4tc_ext.h     |   36 +
> >> >> > net/sched/p4tc/Makefile           |    2 +-
> >> >> > net/sched/p4tc/p4tc_bpf.c         |   79 +-
> >> >> > net/sched/p4tc/p4tc_ext.c         | 2204 ++++++++++++++++++++++++++++
> >> >> > net/sched/p4tc/p4tc_pipeline.c    |   34 +-
> >> >> > net/sched/p4tc/p4tc_runtime_api.c |   10 +-
> >> >> > net/sched/p4tc/p4tc_table.c       |   57 +-
> >> >> > net/sched/p4tc/p4tc_tbl_entry.c   |   25 +-
> >> >> > net/sched/p4tc/p4tc_tmpl_api.c    |    4 +
> >> >> > net/sched/p4tc/p4tc_tmpl_ext.c    | 2221 +++++++++++++++++++++++++++++
> >> >> > 13 files changed, 5083 insertions(+), 10 deletions(-)
> >> >>
> >> >> This is for this patch. Now for the whole patchset you have:
> >> >>  30 files changed, 16676 insertions(+), 39 deletions(-)
> >> >>
> >> >> I understand that you want to fit into 15 patches with all the work.
> >> >> But sorry, patches like this are unreviewable. My suggestion is to split
> >> >> the patchset into multiple ones including smaller patches and allow
> >> >> people to digest this. I don't believe that anyone can seriously stand
> >> >> to review a patch with more than 200 lines changes.
> >> >
> >> >This specific patch is not difficult to split into two. I can do that
> >> >and send out minus the first 8 trivial patches - but not familiar with
> >> >how to do "here's part 1 of the patches" and "here's patchset two".
> >>
> >> Split into multiple patchsets and send one by one. No need to have all
> >> in at once.
> >>
> >>
> >> >There's dependency between them so not clear how patchwork and
> >>
> >> What dependency. It should compile. Introduce some basic functionality
> >> first and extend it incrementally with other patchsets. The usual way.
> >>
> >
> >Sorry, still not following:
> >Lets say i split the current patchset 1 with patch 1-8 (which are
> >trivial and have been reviewed) then make the rest into patchset 2
> >with a new set 1-8. I dont see how patchset 2 compiles unless it has
> >access to code from patchset 1. Unless patchset 1 is merged i dont see
> >how this works with patchwork or reviewers. Am i missing something?
>
> Why it would not work. Describe your motivation and plans and submit
> part of the work, the rest later on. No problem.

Sorry, still not following ;->
So push the trivial patches 1-8 for merge - then push rest in small increments?

cheers,
jamal
>
> >
> >cheers,
> >jamal
> >
> >>
> >> >reviewers would deal with it. Thoughts?
> >> >
> >> >Note: The code machinery is really repeatable; for example if you look
> >> >at the tables control you will see very similar patterns to actions
> >> >etc. i.e spending time to review one will make it easy for the rest.
> >> >
> >> >cheers,
> >> >jamal
> >> >
> >> >> [...]

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH net-next v8 00/15] Introducing P4TC
  2023-11-20 18:10             ` Jiri Pirko
@ 2023-11-20 19:56               ` Jamal Hadi Salim
  2023-11-20 20:41                 ` John Fastabend
  2023-11-20 21:48                 ` Daniel Borkmann
  0 siblings, 2 replies; 79+ messages in thread
From: Jamal Hadi Salim @ 2023-11-20 19:56 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: John Fastabend, netdev, deb.chatterjee, anjali.singhai,
	Vipin.Jain, namrata.limaye, tom, mleitner, Mahesh.Shirshyad,
	tomasz.osinski, xiyou.wangcong, davem, edumazet, kuba, pabeni,
	vladbu, horms, daniel, bpf, khalidm, toke, mattyk, dan.daly,
	chris.sommers, john.andy.fingerhut

On Mon, Nov 20, 2023 at 1:10 PM Jiri Pirko <jiri@resnulli.us> wrote:
>
> Mon, Nov 20, 2023 at 03:23:59PM CET, jhs@mojatatu.com wrote:
> >On Mon, Nov 20, 2023 at 4:39 AM Jiri Pirko <jiri@resnulli.us> wrote:
> >>
> >> Fri, Nov 17, 2023 at 09:46:11PM CET, jhs@mojatatu.com wrote:
> >> >On Fri, Nov 17, 2023 at 1:37 PM John Fastabend <john.fastabend@gmail.com> wrote:
> >> >>
> >> >> Jamal Hadi Salim wrote:
> >> >> > On Fri, Nov 17, 2023 at 1:27 AM John Fastabend <john.fastabend@gmail.com> wrote:
> >> >> > >
> >> >> > > Jamal Hadi Salim wrote:
> >>
> >> [...]
> >>
> >>
> >> >>
> >> >> I think I'm judging the technical work here. Bullet points.
> >> >>
> >> >> 1. p4c-tc implementation looks like it should be slower than a
> >> >>    in terms of pkts/sec than a bpf implementation. Meaning
> >> >>    I suspect pipeline and objects laid out like this will lose
> >> >>    to a BPF program with an parser and single lookup. The p4c-ebpf
> >> >>    compiler should look to create optimized EBPF code not some
> >> >>    emulated switch topology.
> >> >>
> >> >
> >> >The parser is ebpf based. The other objects which require control
> >> >plane interaction are not - those interact via netlink.
> >> >We published perf data a while back - presented at the P4 workshop
> >> >back in April (was in the cover letter)
> >> >https://github.com/p4tc-dev/docs/blob/main/p4-conference-2023/2023P4WorkshopP4TC.pdf
> >> >But do note: the correct abstraction is the first priority.
> >> >Optimization is something we can teach the compiler over time. But
> >> >even with the minimalist code generation you can see that our approach
> >> >always beats ebpf in LPM and ternary. The other ones I am pretty sure
> >>
> >> Any idea why? Perhaps the existing eBPF maps are not that suitable for
> >> this kinds of lookups? I mean in theory, eBPF should be always faster.
> >
> >We didnt look closely; however, that is not the point - the point is
> >the perf difference if there is one, is not big with the big win being
> >proper P4 abstraction. For LPM for sure our algorithmic approach is
> >better. For ternary the compute intensity in looping is better done in
> >C. And for exact i believe that ebpf uses better hashing.
> >Again, that is not the point we were trying to validate in those experiments..
> >
> >On your point of "maps are not that suitable" P4 tables tend to have
> >very specific attributes (examples associated meters, counters,
> >default hit and miss actions, etc).
> >
> >> >we can optimize over time.
> >> >Your view of "single lookup" is true for simple programs but if you
> >> >have 10 tables trying to model a 5G function then it doesnt make sense
> >> >(and i think the data we published was clear that you gain no
> >> >advantage using ebpf - as a matter of fact there was no perf
> >> >difference between XDP and tc in such cases).
> >> >
> >> >> 2. p4c-tc control plan looks slower than a directly mmaped bpf
> >> >>    map. Doing a simple update vs a netlink msg. The argument
> >> >>    that BPF can't do CRUD (which we had offlist) seems incorrect
> >> >>    to me. Correct me if I'm wrong with details about why.
> >> >>
> >> >
> >> >So let me see....
> >> >you want me to replace netlink and all its features and rewrite it
> >> >using the ebpf system calls? Congestion control, event handling,
> >> >arbitrary message crafting, etc and the years of work that went into
> >> >netlink? NO to the HELL.
> >>
> >> Wait, I don't think John suggests anything like that. He just suggests
> >> to have the tables as eBPF maps.
> >
> >What's the difference? Unless maps can do netlink.
> >
> >> Honestly, I don't understand the
> >> fixation on netlink. Its socket messaging, memcpies, processing
> >> overhead, etc can't keep up with mmaped memory access at scale. Measure
> >> that and I bet you'll get drastically different results.
> >>
> >> I mean, netlink is good for a lot of things, but does not mean it is an
> >> universal answer to userspace<->kernel data passing.
> >
> >Here's a small sample of our requirements that are satisfied by
> >netlink for P4 object hierarchy[1]:
> >1. Msg construction/parsing
> >2. Multi-user request/response messaging
>
> What is actually a usecase for having multiple users program p4 pipeline
> in parallel?

First of all - this is Linux, multiple users is a way of life, you
shouldnt have to ask that question unless you are trying to be
socratic. Meaning multiple control plane apps can be allowed to
program different parts and even different tables - think multi-tier
pipeline.

> >3. Multi-user event subscribe/publish messaging
>
> Same here. What is the usecase for multiple users receiving p4 events?

Same thing.
Note: Events are really not part of P4 but we added them for
flexibility - and as you well know they are useful.

>
> >
> >I dont think i need to provide an explanation on the differences here
> >visavis what ebpf system calls provide vs what netlink provides and
> >how netlink is a clear fit. If it is not clear i can give more
>
> It is not :/

I thought it was obvious for someone like you, but fine - here goes for those 3:

1. Msg construction/parsing: A lot of infra for sending attributes
back and forth is already built into netlink. I would have to create
mine from scratch for ebpf.  This will include not just the
construction/parsing but all the detailed attribute content policy
validations(even in the presence of hierarchies) that comes with it.
And not to forget the state transform between kernel and user space.

2. Multi-user request/response messaging
If you can write all the code for #1 above then this should work fine for ebpf

3. Event publish subscribe
You would have to create mechanisms for ebpf which either are non
trivial or non complete: Example 1: you can put surgeries in the ebpf
code to look at map manipulations and then interface it to some event
management scheme which checks for subscribed users. Example 2: It may
also be feasible to create your own map for subscription vs something
like perf ring for event publication(something i have done in the
past), but that is also limited in many ways.

>
> >breakdown. And of course there's more but above is a good sample.
> >
> >The part that is taken for granted is the control plane code and
> >interaction which is an extremely important detail. P4 Abstraction
> >requires hierarchies with different compiler generated encoded path
> >ids etc. This ID mapping gets exacerbated by having multitudes of  P4
>
> Why the actual eBFP mapping does not serve the same purpose as ID?
> ID:mapping 1 :1

An identification of an object requires hierarchical IDs: A
pipeline/program ID, A table id, a table entry Identification, an
action identification and for each individual action content
parameter, an ID etc. These same IDs would be what hardware would
recognize as well (in case of offload).  Given the dynamic nature of
these IDs it is essentially up to the compiler to define them. These
hierarchies  are much easier to validate in netlink.

We dont want to be constrained to a generic infra like eBPF for these
objects. Again eBPF is a means to an end (and not the goal here!).

cheers,
jamal
>
>
> >programs which have different requirements. Netlink is a natural fit
> >for this P4 abstraction. Not to mention the netlink/tc path (and in
> >particular the ID mapping) provides a conduit for offload when that is
> >needed.
> >eBPF is just a tool - and the objects are intended to be generic - and
> >i dont see how any of this could be achieved without retooling to make
> >it more specific to P4.
> >
> >cheers,
> >jamal
> >
> >
> >
> >>
> >> >I should note: that there was an interesting talk at netdevconf 0x17
> >> >where the speaker showed the challenges of dealing with ebpf on "day
> >> >two" - slides or videos are not up yet, but link is:
> >> >https://netdevconf.info/0x17/sessions/talk/is-scaling-ebpf-easy-yet-a-small-step-to-one-server-but-giant-leap-to-distributed-network.html
> >> >The point the speaker was making is it's always easy to whip an ebpf
> >> >program that can slice and dice packets and maybe even flush LEDs but
> >> >the real work and challenge is in the control plane. I agree with the
> >> >speaker based on my experiences. This discussion of replacing netlink
> >> >with ebpf system calls is absolutely a non-starter. Let's just end the
> >> >discussion and agree to disagree if you are going to keep insisting on
> >> >that.
> >>
> >>
> >> [...]

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH net-next v8 09/15] p4tc: add template pipeline create, get, update, delete
  2023-11-20 18:20       ` David Ahern
@ 2023-11-20 20:12         ` Jamal Hadi Salim
  0 siblings, 0 replies; 79+ messages in thread
From: Jamal Hadi Salim @ 2023-11-20 20:12 UTC (permalink / raw)
  To: David Ahern
  Cc: Jiri Pirko, netdev, deb.chatterjee, anjali.singhai,
	namrata.limaye, tom, mleitner, Mahesh.Shirshyad, tomasz.osinski,
	xiyou.wangcong, davem, edumazet, kuba, pabeni, vladbu, horms,
	daniel, bpf, khalidm, toke, mattyk, David Ahern,
	Stephen Hemminger

On Mon, Nov 20, 2023 at 1:20 PM David Ahern <dsahern@kernel.org> wrote:
>
> On 11/17/23 4:09 AM, Jamal Hadi Salim wrote:
> >>> diff --git a/include/uapi/linux/p4tc.h b/include/uapi/linux/p4tc.h
> >>> index ba32dba66..4d33f44c1 100644
> >>> --- a/include/uapi/linux/p4tc.h
> >>> +++ b/include/uapi/linux/p4tc.h
> >>> @@ -2,8 +2,71 @@
> >>> #ifndef __LINUX_P4TC_H
> >>> #define __LINUX_P4TC_H
> >>>
> >>> +#include <linux/types.h>
> >>> +#include <linux/pkt_sched.h>
> >>> +
> >>> +/* pipeline header */
> >>> +struct p4tcmsg {
> >>> +      __u32 pipeid;
> >>> +      __u32 obj;
> >>> +};
> >>
> >> I don't follow. Is there any sane reason to use header instead of normal
> >> netlink attribute? Moveover, you extend the existing RT netlink with
> >> a huge amout of p4 things. Isn't this the good time to finally introduce
> >> generic netlink TC family with proper yaml spec with all the benefits it
> >> brings and implement p4 tc uapi there? Please?
> >>
>
> There is precedence (new netdev APIs) to move new infra to genl, but it
> is not clear to me if extending existing functionality should fall into
> that required conversion.
>

Big question is:  how does the genl (which i am assuming you mean the
ynl stuff) fit back into iproute2?
The yaml files approach is a great deal of help for maintenance IMO (a
lot of repetitive code gone). But do we leave the rest of the masses
out? What is the motivation for pushing anything to be shared? And if
the answer is to convert everything onwards into genl then where is
the central location to grab that code from? Is it still iproute2 or
the kernel? etc

cheers,
jamal

> >
> > Several reasons:
> > a) We are similar to current tc messaging with the subheader being
> > there for multiplexing.
> > b) Where does this leave iproute2? +Cc David and Stephen. Do other
> > generic netlink conversions get contributed back to iproute2?
> > c) note: Our API is CRUD-ish instead of RPC(per generic netlink)
> > based. i.e you have:
> >  COMMAND <PATH/TO/OBJECT> [optional data]  so we can support arbitrary
> > P4 programs from the control plane.
> > d) we have spent many hours optimizing the control to the kernel so i
> > am not sure what it would buy us to switch to generic netlink..
> >
>

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH net-next v8 00/15] Introducing P4TC
  2023-11-20 19:56               ` Jamal Hadi Salim
@ 2023-11-20 20:41                 ` John Fastabend
  2023-11-20 22:13                   ` Jamal Hadi Salim
  2023-11-20 21:48                 ` Daniel Borkmann
  1 sibling, 1 reply; 79+ messages in thread
From: John Fastabend @ 2023-11-20 20:41 UTC (permalink / raw)
  To: Jamal Hadi Salim, Jiri Pirko
  Cc: John Fastabend, netdev, deb.chatterjee, anjali.singhai,
	Vipin.Jain, namrata.limaye, tom, mleitner, Mahesh.Shirshyad,
	tomasz.osinski, xiyou.wangcong, davem, edumazet, kuba, pabeni,
	vladbu, horms, daniel, bpf, khalidm, toke, mattyk, dan.daly,
	chris.sommers, john.andy.fingerhut

Jamal Hadi Salim wrote:
> On Mon, Nov 20, 2023 at 1:10 PM Jiri Pirko <jiri@resnulli.us> wrote:
> >
> > Mon, Nov 20, 2023 at 03:23:59PM CET, jhs@mojatatu.com wrote:
> > >On Mon, Nov 20, 2023 at 4:39 AM Jiri Pirko <jiri@resnulli.us> wrote:
> > >>
> > >> Fri, Nov 17, 2023 at 09:46:11PM CET, jhs@mojatatu.com wrote:
> > >> >On Fri, Nov 17, 2023 at 1:37 PM John Fastabend <john.fastabend@gmail.com> wrote:
> > >> >>
> > >> >> Jamal Hadi Salim wrote:
> > >> >> > On Fri, Nov 17, 2023 at 1:27 AM John Fastabend <john.fastabend@gmail.com> wrote:
> > >> >> > >
> > >> >> > > Jamal Hadi Salim wrote:
> > >>
> > >> [...]
> > >>
> > >>
> > >> >>
> > >> >> I think I'm judging the technical work here. Bullet points.
> > >> >>
> > >> >> 1. p4c-tc implementation looks like it should be slower than a
> > >> >>    in terms of pkts/sec than a bpf implementation. Meaning
> > >> >>    I suspect pipeline and objects laid out like this will lose
> > >> >>    to a BPF program with an parser and single lookup. The p4c-ebpf
> > >> >>    compiler should look to create optimized EBPF code not some
> > >> >>    emulated switch topology.
> > >> >>
> > >> >
> > >> >The parser is ebpf based. The other objects which require control
> > >> >plane interaction are not - those interact via netlink.
> > >> >We published perf data a while back - presented at the P4 workshop
> > >> >back in April (was in the cover letter)
> > >> >https://github.com/p4tc-dev/docs/blob/main/p4-conference-2023/2023P4WorkshopP4TC.pdf
> > >> >But do note: the correct abstraction is the first priority.
> > >> >Optimization is something we can teach the compiler over time. But
> > >> >even with the minimalist code generation you can see that our approach
> > >> >always beats ebpf in LPM and ternary. The other ones I am pretty sure
> > >>
> > >> Any idea why? Perhaps the existing eBPF maps are not that suitable for
> > >> this kinds of lookups? I mean in theory, eBPF should be always faster.
> > >
> > >We didnt look closely; however, that is not the point - the point is
> > >the perf difference if there is one, is not big with the big win being
> > >proper P4 abstraction. For LPM for sure our algorithmic approach is
> > >better. For ternary the compute intensity in looping is better done in
> > >C. And for exact i believe that ebpf uses better hashing.
> > >Again, that is not the point we were trying to validate in those experiments..

If you compared your implementation to the bpf lpm_trie its a bit
misleading. The data structure is a rhashtable vs a Trie doing LPM.

Also I can't see how __p4tc_table_entry_lookup() is going to scale?
That looks like a bucket per key? If so that wont scale well with
1000's of entries and lots of duplicate masks. I did a quick scan
of code, but would be nice to detail the algorithm in the commit
msg so we can disect it.

This doesn't look what we would want though for an LPM unless
I've dropped this out of context.

+static struct p4tc_table_entry *
+__p4tc_entry_lookup(struct p4tc_table *table, struct p4tc_table_entry_key *key)
+       __must_hold(RCU)
+{
+       struct p4tc_table_entry *entry = NULL;
+       struct rhlist_head *tmp, *bucket_list;
+       struct p4tc_table_entry *entry_curr;
+       u32 smallest_prio = U32_MAX;
+
+       bucket_list =
+               rhltable_lookup(&table->tbl_entries, key, entry_hlt_params);
+       if (!bucket_list)
+               return NULL;
+
+       rhl_for_each_entry_rcu(entry_curr, tmp, bucket_list, ht_node) {
+               struct p4tc_table_entry_value *value =
+                       p4tc_table_entry_value(entry_curr);
+               if (value->prio <= smallest_prio) {
+                       smallest_prio = value->prio;
+                       entry = entry_curr;
+               }
+       }
+
+       return entry;
+}


Also I don't know what 'better done in C' matters the TCAM data structure
can be written in C and used as a BPF map. At least that is how we would
normally approach it from BPF side.

> > >
> > >On your point of "maps are not that suitable" P4 tables tend to have
> > >very specific attributes (examples associated meters, counters,
> > >default hit and miss actions, etc).

The typical way we handle this from BPF is to either use the 0 entry
for stats, annotations, etc. or create a blob of memory (another map,
variables, global struct, ...) and stash the info there. If we care
about performance we make those per cpu and deal with it in user
land.

> > >
> > >> >we can optimize over time.
> > >> >Your view of "single lookup" is true for simple programs but if you
> > >> >have 10 tables trying to model a 5G function then it doesnt make sense
> > >> >(and i think the data we published was clear that you gain no
> > >> >advantage using ebpf - as a matter of fact there was no perf
> > >> >difference between XDP and tc in such cases).
> > >> >
> > >> >> 2. p4c-tc control plan looks slower than a directly mmaped bpf
> > >> >>    map. Doing a simple update vs a netlink msg. The argument
> > >> >>    that BPF can't do CRUD (which we had offlist) seems incorrect
> > >> >>    to me. Correct me if I'm wrong with details about why.
> > >> >>
> > >> >
> > >> >So let me see....
> > >> >you want me to replace netlink and all its features and rewrite it
> > >> >using the ebpf system calls? Congestion control, event handling,
> > >> >arbitrary message crafting, etc and the years of work that went into
> > >> >netlink? NO to the HELL.
> > >>
> > >> Wait, I don't think John suggests anything like that. He just suggests
> > >> to have the tables as eBPF maps.
> > >
> > >What's the difference? Unless maps can do netlink.
> > >

I'm going to argue map update time matters and we should use the fastest
updates possible. If it complicates user space side some I would prefer
that to slow updates. I don't think you can get much faster than a
mmaped block of memory. Or even syscall updates are probably faster than
netlink msgs.

> > >> Honestly, I don't understand the
> > >> fixation on netlink. Its socket messaging, memcpies, processing
> > >> overhead, etc can't keep up with mmaped memory access at scale. Measure
> > >> that and I bet you'll get drastically different results.
> > >>
> > >> I mean, netlink is good for a lot of things, but does not mean it is an
> > >> universal answer to userspace<->kernel data passing.
> > >
> > >Here's a small sample of our requirements that are satisfied by
> > >netlink for P4 object hierarchy[1]:
> > >1. Msg construction/parsing
> > >2. Multi-user request/response messaging
> >
> > What is actually a usecase for having multiple users program p4 pipeline
> > in parallel?
> 
> First of all - this is Linux, multiple users is a way of life, you
> shouldnt have to ask that question unless you are trying to be
> socratic. Meaning multiple control plane apps can be allowed to
> program different parts and even different tables - think multi-tier
> pipeline.

Linux is always been opinionated and rejects code all the time because
its not the "right" way. I've been on the reject your stuff side before.

Partitioning ownershiip of the pipeline is different than multiple
users of the same elements. From BPF side (to show its doable) is
done by pinning maps to files and giving that file to different
programs. The DDOS thing can own the DDOS map and the router can own
its router tables. BPF handles this using the file systems mostly.

> 
> > >3. Multi-user event subscribe/publish messaging
> >
> > Same here. What is the usecase for multiple users receiving p4 events?
> 
> Same thing.
> Note: Events are really not part of P4 but we added them for
> flexibility - and as you well know they are useful.

Per above I wouldn't sacrafice update perf for this. Also its doable
from userspace if you need to. Other thing I've come to dislike a bit
is teaching the kernel a specific DSL. P4 is my favorte, but still
going so far as to encode a specific P4 spec into the kernel seems
unnecessary. Also now will we have to have kernel X supports P4.16 and
kernel X+N supports P4.18 it seems like a pain.

> 
> >
> > >
> > >I dont think i need to provide an explanation on the differences here
> > >visavis what ebpf system calls provide vs what netlink provides and
> > >how netlink is a clear fit. If it is not clear i can give more
> >
> > It is not :/
> 
> I thought it was obvious for someone like you, but fine - here goes for those 3:
> 
> 1. Msg construction/parsing: A lot of infra for sending attributes
> back and forth is already built into netlink. I would have to create
> mine from scratch for ebpf.  This will include not just the
> construction/parsing but all the detailed attribute content policy
> validations(even in the presence of hierarchies) that comes with it.
> And not to forget the state transform between kernel and user space.

But the series here does that as well probably could reuse that on
top of BPF. We have lots of libraries to deal with ebpf to help.
I don't see anything problematic here for BPF.

> 
> 2. Multi-user request/response messaging
> If you can write all the code for #1 above then this should work fine for ebpf
> 
> 3. Event publish subscribe
> You would have to create mechanisms for ebpf which either are non
> trivial or non complete: Example 1: you can put surgeries in the ebpf
> code to look at map manipulations and then interface it to some event
> management scheme which checks for subscribed users. Example 2: It may
> also be feasible to create your own map for subscription vs something
> like perf ring for event publication(something i have done in the
> past), but that is also limited in many ways.

I would just push them out over a single perf ring and build the
subscription on top of GRPC (pick your protocol of choice).

> 
> >
> > >breakdown. And of course there's more but above is a good sample.
> > >
> > >The part that is taken for granted is the control plane code and
> > >interaction which is an extremely important detail. P4 Abstraction
> > >requires hierarchies with different compiler generated encoded path
> > >ids etc. This ID mapping gets exacerbated by having multitudes of  P4
> >
> > Why the actual eBFP mapping does not serve the same purpose as ID?
> > ID:mapping 1 :1
> 
> An identification of an object requires hierarchical IDs: A
> pipeline/program ID, A table id, a table entry Identification, an
> action identification and for each individual action content
> parameter, an ID etc. These same IDs would be what hardware would
> recognize as well (in case of offload).  Given the dynamic nature of
> these IDs it is essentially up to the compiler to define them. These
> hierarchies  are much easier to validate in netlink.

I'm on board for offloads, but this series says no offloads and we
have no one with hardware in Linux for offloads yet. If we have a
series with a P4 driver and NIC I can get my hands on now we have
an entirely different conversation.

None of this above is a problem in eBPF. Its just mapping ids around.

> 
> We dont want to be constrained to a generic infra like eBPF for these
> objects. Again eBPF is a means to an end (and not the goal here!).

I don't see any constraints from eBPF above just a list of things
that of course you would have to code up. But none of that doesn't
already exist in other projects.

> 
> cheers,
> jamal
> >
> >
> > >programs which have different requirements. Netlink is a natural fit
> > >for this P4 abstraction. Not to mention the netlink/tc path (and in
> > >particular the ID mapping) provides a conduit for offload when that is
> > >needed.
> > >eBPF is just a tool - and the objects are intended to be generic - and
> > >i dont see how any of this could be achieved without retooling to make
> > >it more specific to P4.
> > >
> > >cheers,
> > >jamal
> > >
> > >
> > >
> > >>
> > >> >I should note: that there was an interesting talk at netdevconf 0x17
> > >> >where the speaker showed the challenges of dealing with ebpf on "day
> > >> >two" - slides or videos are not up yet, but link is:
> > >> >https://netdevconf.info/0x17/sessions/talk/is-scaling-ebpf-easy-yet-a-small-step-to-one-server-but-giant-leap-to-distributed-network.html
> > >> >The point the speaker was making is it's always easy to whip an ebpf
> > >> >program that can slice and dice packets and maybe even flush LEDs but
> > >> >the real work and challenge is in the control plane. I agree with the
> > >> >speaker based on my experiences. This discussion of replacing netlink
> > >> >with ebpf system calls is absolutely a non-starter. Let's just end the
> > >> >discussion and agree to disagree if you are going to keep insisting on
> > >> >that.
> > >>
> > >>
> > >> [...]



^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH net-next v8 00/15] Introducing P4TC
  2023-11-20 19:56               ` Jamal Hadi Salim
  2023-11-20 20:41                 ` John Fastabend
@ 2023-11-20 21:48                 ` Daniel Borkmann
  2023-11-20 22:56                   ` Jamal Hadi Salim
  1 sibling, 1 reply; 79+ messages in thread
From: Daniel Borkmann @ 2023-11-20 21:48 UTC (permalink / raw)
  To: Jamal Hadi Salim, Jiri Pirko
  Cc: John Fastabend, netdev, deb.chatterjee, anjali.singhai,
	Vipin.Jain, namrata.limaye, tom, mleitner, Mahesh.Shirshyad,
	tomasz.osinski, xiyou.wangcong, davem, edumazet, kuba, pabeni,
	vladbu, horms, bpf, khalidm, toke, mattyk, dan.daly,
	chris.sommers, john.andy.fingerhut

On 11/20/23 8:56 PM, Jamal Hadi Salim wrote:
> On Mon, Nov 20, 2023 at 1:10 PM Jiri Pirko <jiri@resnulli.us> wrote:
>> Mon, Nov 20, 2023 at 03:23:59PM CET, jhs@mojatatu.com wrote:
>>> On Mon, Nov 20, 2023 at 4:39 AM Jiri Pirko <jiri@resnulli.us> wrote:
>>>> Fri, Nov 17, 2023 at 09:46:11PM CET, jhs@mojatatu.com wrote:
>>>>> On Fri, Nov 17, 2023 at 1:37 PM John Fastabend <john.fastabend@gmail.com> wrote:
>>>>>> Jamal Hadi Salim wrote:
>>>>>>> On Fri, Nov 17, 2023 at 1:27 AM John Fastabend <john.fastabend@gmail.com> wrote:
>>>>>>>> Jamal Hadi Salim wrote:
>>>>
>>>> [...]
>>>>
>>>>>> I think I'm judging the technical work here. Bullet points.
>>>>>>
>>>>>> 1. p4c-tc implementation looks like it should be slower than a
>>>>>>     in terms of pkts/sec than a bpf implementation. Meaning
>>>>>>     I suspect pipeline and objects laid out like this will lose
>>>>>>     to a BPF program with an parser and single lookup. The p4c-ebpf
>>>>>>     compiler should look to create optimized EBPF code not some
>>>>>>     emulated switch topology.
>>>>>
>>>>> The parser is ebpf based. The other objects which require control
>>>>> plane interaction are not - those interact via netlink.
>>>>> We published perf data a while back - presented at the P4 workshop
>>>>> back in April (was in the cover letter)
>>>>> https://github.com/p4tc-dev/docs/blob/main/p4-conference-2023/2023P4WorkshopP4TC.pdf
>>>>> But do note: the correct abstraction is the first priority.
>>>>> Optimization is something we can teach the compiler over time. But
>>>>> even with the minimalist code generation you can see that our approach
>>>>> always beats ebpf in LPM and ternary. The other ones I am pretty sure
>>>>
>>>> Any idea why? Perhaps the existing eBPF maps are not that suitable for
>>>> this kinds of lookups? I mean in theory, eBPF should be always faster.
>>>
>>> We didnt look closely; however, that is not the point - the point is
>>> the perf difference if there is one, is not big with the big win being
>>> proper P4 abstraction. For LPM for sure our algorithmic approach is
>>> better. For ternary the compute intensity in looping is better done in
>>> C. And for exact i believe that ebpf uses better hashing.
>>> Again, that is not the point we were trying to validate in those experiments..
>>>
>>> On your point of "maps are not that suitable" P4 tables tend to have
>>> very specific attributes (examples associated meters, counters,
>>> default hit and miss actions, etc).
>>>
>>>>> we can optimize over time.
>>>>> Your view of "single lookup" is true for simple programs but if you
>>>>> have 10 tables trying to model a 5G function then it doesnt make sense
>>>>> (and i think the data we published was clear that you gain no
>>>>> advantage using ebpf - as a matter of fact there was no perf
>>>>> difference between XDP and tc in such cases).
>>>>>
>>>>>> 2. p4c-tc control plan looks slower than a directly mmaped bpf
>>>>>>     map. Doing a simple update vs a netlink msg. The argument
>>>>>>     that BPF can't do CRUD (which we had offlist) seems incorrect
>>>>>>     to me. Correct me if I'm wrong with details about why.
>>>>>
>>>>> So let me see....
>>>>> you want me to replace netlink and all its features and rewrite it
>>>>> using the ebpf system calls? Congestion control, event handling,
>>>>> arbitrary message crafting, etc and the years of work that went into
>>>>> netlink? NO to the HELL.
>>>>
>>>> Wait, I don't think John suggests anything like that. He just suggests
>>>> to have the tables as eBPF maps.
>>>
>>> What's the difference? Unless maps can do netlink.
>>>
>>>> Honestly, I don't understand the
>>>> fixation on netlink. Its socket messaging, memcpies, processing
>>>> overhead, etc can't keep up with mmaped memory access at scale. Measure
>>>> that and I bet you'll get drastically different results.
>>>>
>>>> I mean, netlink is good for a lot of things, but does not mean it is an
>>>> universal answer to userspace<->kernel data passing.
>>>
>>> Here's a small sample of our requirements that are satisfied by
>>> netlink for P4 object hierarchy[1]:
>>> 1. Msg construction/parsing
>>> 2. Multi-user request/response messaging
>>
>> What is actually a usecase for having multiple users program p4 pipeline
>> in parallel?
> 
> First of all - this is Linux, multiple users is a way of life, you
> shouldnt have to ask that question unless you are trying to be
> socratic. Meaning multiple control plane apps can be allowed to
> program different parts and even different tables - think multi-tier
> pipeline.
> 
>>> 3. Multi-user event subscribe/publish messaging
>>
>> Same here. What is the usecase for multiple users receiving p4 events?
> 
> Same thing.
> Note: Events are really not part of P4 but we added them for
> flexibility - and as you well know they are useful.
> 
>>> I dont think i need to provide an explanation on the differences here
>>> visavis what ebpf system calls provide vs what netlink provides and
>>> how netlink is a clear fit. If it is not clear i can give more
>>
>> It is not :/
> 
> I thought it was obvious for someone like you, but fine - here goes for those 3:
> 
> 1. Msg construction/parsing: A lot of infra for sending attributes
> back and forth is already built into netlink. I would have to create
> mine from scratch for ebpf.  This will include not just the
> construction/parsing but all the detailed attribute content policy
> validations(even in the presence of hierarchies) that comes with it.
> And not to forget the state transform between kernel and user space.
> 
> 2. Multi-user request/response messaging
> If you can write all the code for #1 above then this should work fine for ebpf
> 
> 3. Event publish subscribe
> You would have to create mechanisms for ebpf which either are non
> trivial or non complete: Example 1: you can put surgeries in the ebpf
> code to look at map manipulations and then interface it to some event
> management scheme which checks for subscribed users. Example 2: It may
> also be feasible to create your own map for subscription vs something
> like perf ring for event publication(something i have done in the
> past), but that is also limited in many ways.

I still don't think this answers all the questions on why the netlink
shim layer. The kfuncs are essentially available to all of tc BPF and
I don't think there was a discussion why they cannot be done generic
in a way that they could benefit all tc/XDP BPF users. With the patch
14 you are more or less copying what is existing with {cls,act}_bpf
just that you also allow XDP loading from tc(?). We do have existing
interfaces for XDP program management.

tc BPF and XDP already have widely used infrastructure and can be developed
against libbpf or other user space libraries for a user space control plane.
With 'control plane' you refer here to the tc / netlink shim you've built,
but looking at the tc command line examples, this doesn't really provide a
good user experience (you call it p4 but people load bpf obj files). If the
expectation is that an operator should run tc commands, then neither it's
a nice experience for p4 nor for BPF folks. From a BPF PoV, we moved over
to bpf_mprog and plan to also extend this for XDP to have a common look and
feel wrt networking for developers. Why can't this be reused?

I don't quite follow why not most of this could be implemented entirely in
user space without the detour of this and you would provide a developer
library which could then be integrated into a p4 runtime/frontend? This
way users never interface with ebpf parts nor tc given they also shouldn't
have to - it's an implementation detail. This is what John was also pointing
out earlier.

If you need notifications/subscribe mechanism for map updates, then this
could be extended.. same way like BPF internals got extended along with the
sched_ext work, making the core pieces more useful also outside of the latter.

The link to below slides are not public, so it's hard to see what is really
meant here, but I have also never seen an email from the speaker on the BPF
mailing list providing concrete feedback(?). People do build control planes
around BPF in the wild, I'm not sure where you take 'flush LEDs' from, to
me this all sounds rather hand-wavy and trying to brute-force the fixation
on netlink you went with that is raising questions. I don't think there was
objection on going with eBPF but rather all this infra for the former for
a SW-only extension.

[...]
>>>>> I should note: that there was an interesting talk at netdevconf 0x17
>>>>> where the speaker showed the challenges of dealing with ebpf on "day
>>>>> two" - slides or videos are not up yet, but link is:
>>>>> https://netdevconf.info/0x17/sessions/talk/is-scaling-ebpf-easy-yet-a-small-step-to-one-server-but-giant-leap-to-distributed-network.html
>>>>> The point the speaker was making is it's always easy to whip an ebpf
>>>>> program that can slice and dice packets and maybe even flush LEDs but
>>>>> the real work and challenge is in the control plane. I agree with the
>>>>> speaker based on my experiences. This discussion of replacing netlink
>>>>> with ebpf system calls is absolutely a non-starter. Let's just end the
>>>>> discussion and agree to disagree if you are going to keep insisting on
>>>>> that.

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH net-next v8 00/15] Introducing P4TC
  2023-11-20 20:41                 ` John Fastabend
@ 2023-11-20 22:13                   ` Jamal Hadi Salim
  0 siblings, 0 replies; 79+ messages in thread
From: Jamal Hadi Salim @ 2023-11-20 22:13 UTC (permalink / raw)
  To: John Fastabend
  Cc: Jiri Pirko, netdev, deb.chatterjee, anjali.singhai, Vipin.Jain,
	namrata.limaye, tom, mleitner, Mahesh.Shirshyad, tomasz.osinski,
	xiyou.wangcong, davem, edumazet, kuba, pabeni, vladbu, horms,
	daniel, bpf, khalidm, toke, mattyk, dan.daly, chris.sommers,
	john.andy.fingerhut

On Mon, Nov 20, 2023 at 3:41 PM John Fastabend <john.fastabend@gmail.com> wrote:
>
> Jamal Hadi Salim wrote:
> > On Mon, Nov 20, 2023 at 1:10 PM Jiri Pirko <jiri@resnulli.us> wrote:
> > >
> > > Mon, Nov 20, 2023 at 03:23:59PM CET, jhs@mojatatu.com wrote:
> > > >On Mon, Nov 20, 2023 at 4:39 AM Jiri Pirko <jiri@resnulli.us> wrote:
> > > >>
> > > >> Fri, Nov 17, 2023 at 09:46:11PM CET, jhs@mojatatu.com wrote:
> > > >> >On Fri, Nov 17, 2023 at 1:37 PM John Fastabend <john.fastabend@gmail.com> wrote:
> > > >> >>
> > > >> >> Jamal Hadi Salim wrote:
> > > >> >> > On Fri, Nov 17, 2023 at 1:27 AM John Fastabend <john.fastabend@gmail.com> wrote:
> > > >> >> > >
> > > >> >> > > Jamal Hadi Salim wrote:
> > > >>
> > > >> [...]
> > > >>
> > > >>
> > > >> >>
> > > >> >> I think I'm judging the technical work here. Bullet points.
> > > >> >>
> > > >> >> 1. p4c-tc implementation looks like it should be slower than a
> > > >> >>    in terms of pkts/sec than a bpf implementation. Meaning
> > > >> >>    I suspect pipeline and objects laid out like this will lose
> > > >> >>    to a BPF program with an parser and single lookup. The p4c-ebpf
> > > >> >>    compiler should look to create optimized EBPF code not some
> > > >> >>    emulated switch topology.
> > > >> >>
> > > >> >
> > > >> >The parser is ebpf based. The other objects which require control
> > > >> >plane interaction are not - those interact via netlink.
> > > >> >We published perf data a while back - presented at the P4 workshop
> > > >> >back in April (was in the cover letter)
> > > >> >https://github.com/p4tc-dev/docs/blob/main/p4-conference-2023/2023P4WorkshopP4TC.pdf
> > > >> >But do note: the correct abstraction is the first priority.
> > > >> >Optimization is something we can teach the compiler over time. But
> > > >> >even with the minimalist code generation you can see that our approach
> > > >> >always beats ebpf in LPM and ternary. The other ones I am pretty sure
> > > >>
> > > >> Any idea why? Perhaps the existing eBPF maps are not that suitable for
> > > >> this kinds of lookups? I mean in theory, eBPF should be always faster.
> > > >
> > > >We didnt look closely; however, that is not the point - the point is
> > > >the perf difference if there is one, is not big with the big win being
> > > >proper P4 abstraction. For LPM for sure our algorithmic approach is
> > > >better. For ternary the compute intensity in looping is better done in
> > > >C. And for exact i believe that ebpf uses better hashing.
> > > >Again, that is not the point we were trying to validate in those experiments..
>
> If you compared your implementation to the bpf lpm_trie its a bit
> misleading. The data structure is a rhashtable vs a Trie doing LPM.
>
> Also I can't see how __p4tc_table_entry_lookup() is going to scale?
> That looks like a bucket per key? If so that wont scale well with
> 1000's of entries and lots of duplicate masks.

I think you are misreading the code - there are no duplicate masks;
iiuc, by scale you mean performance of lookup and the numbers we got
show very different results (the more entries and masks the better
numbers we showed).
Again - i dont want to make this a topic. The issue is not whether we
beat you or you beat us in numbers is not relevant to begin with.

>I did a quick scan
> of code, but would be nice to detail the algorithm in the commit
> msg so we can disect it.
>
> This doesn't look what we would want though for an LPM unless
> I've dropped this out of context.
>
> +static struct p4tc_table_entry *
> +__p4tc_entry_lookup(struct p4tc_table *table, struct p4tc_table_entry_key *key)
> +       __must_hold(RCU)
> +{
> +       struct p4tc_table_entry *entry = NULL;
> +       struct rhlist_head *tmp, *bucket_list;
> +       struct p4tc_table_entry *entry_curr;
> +       u32 smallest_prio = U32_MAX;
> +
> +       bucket_list =
> +               rhltable_lookup(&table->tbl_entries, key, entry_hlt_params);
> +       if (!bucket_list)
> +               return NULL;
> +
> +       rhl_for_each_entry_rcu(entry_curr, tmp, bucket_list, ht_node) {
> +               struct p4tc_table_entry_value *value =
> +                       p4tc_table_entry_value(entry_curr);
> +               if (value->prio <= smallest_prio) {
> +                       smallest_prio = value->prio;
> +                       entry = entry_curr;
> +               }
> +       }
> +
> +       return entry;
> +}

You are quoting the ternary (not LPM) matching code. It iterates all
entries (we could only do ~190 when we tested in plain ebpf, thats why
our test was restricted to that number).

> Also I don't know what 'better done in C' matters the TCAM data structure
> can be written in C and used as a BPF map. At least that is how we would
> normally approach it from BPF side.

See the code you quoted - you have to loop and pick the best of N
matches, where N could be arbitrarily large.

> > > >
> > > >On your point of "maps are not that suitable" P4 tables tend to have
> > > >very specific attributes (examples associated meters, counters,
> > > >default hit and miss actions, etc).
>
> The typical way we handle this from BPF is to either use the 0 entry
> for stats, annotations, etc. or create a blob of memory (another map,
> variables, global struct, ...) and stash the info there. If we care
> about performance we make those per cpu and deal with it in user
> land.
>

Back to the abstraction overhead in user space being high. The whole
point is to minimize all that..

> > > >
> > > >> >we can optimize over time.
> > > >> >Your view of "single lookup" is true for simple programs but if you
> > > >> >have 10 tables trying to model a 5G function then it doesnt make sense
> > > >> >(and i think the data we published was clear that you gain no
> > > >> >advantage using ebpf - as a matter of fact there was no perf
> > > >> >difference between XDP and tc in such cases).
> > > >> >
> > > >> >> 2. p4c-tc control plan looks slower than a directly mmaped bpf
> > > >> >>    map. Doing a simple update vs a netlink msg. The argument
> > > >> >>    that BPF can't do CRUD (which we had offlist) seems incorrect
> > > >> >>    to me. Correct me if I'm wrong with details about why.
> > > >> >>
> > > >> >
> > > >> >So let me see....
> > > >> >you want me to replace netlink and all its features and rewrite it
> > > >> >using the ebpf system calls? Congestion control, event handling,
> > > >> >arbitrary message crafting, etc and the years of work that went into
> > > >> >netlink? NO to the HELL.
> > > >>
> > > >> Wait, I don't think John suggests anything like that. He just suggests
> > > >> to have the tables as eBPF maps.
> > > >
> > > >What's the difference? Unless maps can do netlink.
> > > >
>
> I'm going to argue map update time matters and we should use the fastest
> updates possible. If it complicates user space side some I would prefer
> that to slow updates. I don't think you can get much faster than a
> mmaped block of memory. Or even syscall updates are probably faster than
> netlink msgs.

So lets put this to rest:
It's about the P4 abstraction first (as i mentioned earlier) - i am
sure mmaping would be faster, but that is secondary - correct
abstraction first.
I am ok with some level of abstraction wrangling (example match-action
in P4 to match-value in ebpf) but there is a limit.

> > > >> Honestly, I don't understand the
> > > >> fixation on netlink. Its socket messaging, memcpies, processing
> > > >> overhead, etc can't keep up with mmaped memory access at scale. Measure
> > > >> that and I bet you'll get drastically different results.
> > > >>
> > > >> I mean, netlink is good for a lot of things, but does not mean it is an
> > > >> universal answer to userspace<->kernel data passing.
> > > >
> > > >Here's a small sample of our requirements that are satisfied by
> > > >netlink for P4 object hierarchy[1]:
> > > >1. Msg construction/parsing
> > > >2. Multi-user request/response messaging
> > >
> > > What is actually a usecase for having multiple users program p4 pipeline
> > > in parallel?
> >
> > First of all - this is Linux, multiple users is a way of life, you
> > shouldnt have to ask that question unless you are trying to be
> > socratic. Meaning multiple control plane apps can be allowed to
> > program different parts and even different tables - think multi-tier
> > pipeline.
>
> Linux is always been opinionated and rejects code all the time because
> its not the "right" way. I've been on the reject your stuff side before.
>
> Partitioning ownershiip of the pipeline is different than multiple
> users of the same elements. From BPF side (to show its doable) is
> done by pinning maps to files and giving that file to different
> programs. The DDOS thing can own the DDOS map and the router can own
> its router tables. BPF handles this using the file systems mostly.
>

And with tc it just fits right in without any of those tricks...

> >
> > > >3. Multi-user event subscribe/publish messaging
> > >
> > > Same here. What is the usecase for multiple users receiving p4 events?
> >
> > Same thing.
> > Note: Events are really not part of P4 but we added them for
> > flexibility - and as you well know they are useful.
>
> Per above I wouldn't sacrafice update perf for this. Also its doable
> from userspace if you need to. Other thing I've come to dislike a bit
> is teaching the kernel a specific DSL. P4 is my favorte, but still
> going so far as to encode a specific P4 spec into the kernel seems
> unnecessary. Also now will we have to have kernel X supports P4.16 and
> kernel X+N supports P4.18 it seems like a pain.
>

I believe you are misunderstanding, let me explain. While our focus is on PNA:
There is no change in the kernel infra (upstream code) for PNA or PSA
or PXXXX. The compiler may end up generating different code depending
on the architecture selected at the compile command line. The control
constructs are very static with their hierarchy IDs.
In regards to what you prophesize above when the language goes from
P4.16 to P4.18 - i dont mean to be rude, but: kettle look at pot
much?;-> i.e what happens when eBPF ISA gets extended? The safety
feature we have for P4TC is externs - most of these will be
implemented as self-fulfilling kfuncs.

> >
> > >
> > > >
> > > >I dont think i need to provide an explanation on the differences here
> > > >visavis what ebpf system calls provide vs what netlink provides and
> > > >how netlink is a clear fit. If it is not clear i can give more
> > >
> > > It is not :/
> >
> > I thought it was obvious for someone like you, but fine - here goes for those 3:
> >
> > 1. Msg construction/parsing: A lot of infra for sending attributes
> > back and forth is already built into netlink. I would have to create
> > mine from scratch for ebpf.  This will include not just the
> > construction/parsing but all the detailed attribute content policy
> > validations(even in the presence of hierarchies) that comes with it.
> > And not to forget the state transform between kernel and user space.
>
> But the series here does that as well probably could reuse that on
> top of BPF. We have lots of libraries to deal with ebpf to help.
> I don't see anything problematic here for BPF.

Which library does all these (netlink features) in eBPF and has
something matching it in the kernel? We did try to write our own but
it was a huge waste of time.

> >
> > 2. Multi-user request/response messaging
> > If you can write all the code for #1 above then this should work fine for ebpf
> >
> > 3. Event publish subscribe
> > You would have to create mechanisms for ebpf which either are non
> > trivial or non complete: Example 1: you can put surgeries in the ebpf
> > code to look at map manipulations and then interface it to some event
> > management scheme which checks for subscribed users. Example 2: It may
> > also be feasible to create your own map for subscription vs something
> > like perf ring for event publication(something i have done in the
> > past), but that is also limited in many ways.
>
> I would just push them out over a single perf ring and build the
> subscription on top of GRPC (pick your protocol of choice).
>


Why - just so i could use ebpf? Ive never understood that single user
mode perf ring thing.

> >
> > >
> > > >breakdown. And of course there's more but above is a good sample.
> > > >
> > > >The part that is taken for granted is the control plane code and
> > > >interaction which is an extremely important detail. P4 Abstraction
> > > >requires hierarchies with different compiler generated encoded path
> > > >ids etc. This ID mapping gets exacerbated by having multitudes of  P4
> > >
> > > Why the actual eBFP mapping does not serve the same purpose as ID?
> > > ID:mapping 1 :1
> >
> > An identification of an object requires hierarchical IDs: A
> > pipeline/program ID, A table id, a table entry Identification, an
> > action identification and for each individual action content
> > parameter, an ID etc. These same IDs would be what hardware would
> > recognize as well (in case of offload).  Given the dynamic nature of
> > these IDs it is essentially up to the compiler to define them. These
> > hierarchies  are much easier to validate in netlink.
>
> I'm on board for offloads, but this series says no offloads and we
> have no one with hardware in Linux for offloads yet. If we have a
> series with a P4 driver and NIC I can get my hands on now we have
> an entirely different conversation.

The argument made is that P4 s/w stands on its own merit regardless of
presence of hardware offload (there are s/w versions of DPDK and rust
that I believe are used in production). As an example, the DASH
project quoted in the cover letter uses P4 as a datapath specification
language. The datapath is then verified to be working in s/w. So let's
not argue that there is no merit to a s/w P4 version without h/w
offload.

I do have a NIC(Intel e2000) that does P4 offloads but i am afraid I
can't give it to you. Folks who are doing offloads will present
drivers when they are ready and when/if those patches show there will
be extensions to deal with ndo_tc. But i know you already know and are
on the p4tc mailing list and are quite aware of these developments - I
am not sure I understand your motivation for bringing this up a few
times now. I read it as some sort of insinuation that there is some
secret vendor hardware that is going to benefit from all this secret
trojan we are doing here. Again P4 s/w stands on its own.

> None of this above is a problem in eBPF. Its just mapping ids around.
>
> >
> > We dont want to be constrained to a generic infra like eBPF for these
> > objects. Again eBPF is a means to an end (and not the goal here!).
>
> I don't see any constraints from eBPF above just a list of things
> that of course you would have to code up. But none of that doesn't
> already exist in other projects.
>

And we can agree to disagree.

cheers,
jamal

> >
> > cheers,
> > jamal
> > >
> > >
> > > >programs which have different requirements. Netlink is a natural fit
> > > >for this P4 abstraction. Not to mention the netlink/tc path (and in
> > > >particular the ID mapping) provides a conduit for offload when that is
> > > >needed.
> > > >eBPF is just a tool - and the objects are intended to be generic - and
> > > >i dont see how any of this could be achieved without retooling to make
> > > >it more specific to P4.
> > > >
> > > >cheers,
> > > >jamal
> > > >
> > > >
> > > >
> > > >>
> > > >> >I should note: that there was an interesting talk at netdevconf 0x17
> > > >> >where the speaker showed the challenges of dealing with ebpf on "day
> > > >> >two" - slides or videos are not up yet, but link is:
> > > >> >https://netdevconf.info/0x17/sessions/talk/is-scaling-ebpf-easy-yet-a-small-step-to-one-server-but-giant-leap-to-distributed-network.html
> > > >> >The point the speaker was making is it's always easy to whip an ebpf
> > > >> >program that can slice and dice packets and maybe even flush LEDs but
> > > >> >the real work and challenge is in the control plane. I agree with the
> > > >> >speaker based on my experiences. This discussion of replacing netlink
> > > >> >with ebpf system calls is absolutely a non-starter. Let's just end the
> > > >> >discussion and agree to disagree if you are going to keep insisting on
> > > >> >that.
> > > >>
> > > >>
> > > >> [...]
>
>

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH net-next v8 13/15] p4tc: add set of P4TC table kfuncs
  2023-11-16 14:59 ` [PATCH net-next v8 13/15] p4tc: add set of P4TC table kfuncs Jamal Hadi Salim
  2023-11-17  7:09   ` John Fastabend
  2023-11-19  9:14   ` kernel test robot
@ 2023-11-20 22:28   ` kernel test robot
  2 siblings, 0 replies; 79+ messages in thread
From: kernel test robot @ 2023-11-20 22:28 UTC (permalink / raw)
  To: Jamal Hadi Salim, netdev
  Cc: oe-kbuild-all, deb.chatterjee, anjali.singhai, namrata.limaye,
	tom, mleitner, Mahesh.Shirshyad, tomasz.osinski, jiri,
	xiyou.wangcong, davem, edumazet, kuba, pabeni, vladbu, horms,
	daniel, bpf, khalidm, toke, mattyk

Hi Jamal,

kernel test robot noticed the following build errors:

[auto build test ERROR on net-next/main]

url:    https://github.com/intel-lab-lkp/linux/commits/Jamal-Hadi-Salim/net-sched-act_api-Introduce-dynamic-actions-list/20231116-230427
base:   net-next/main
patch link:    https://lore.kernel.org/r/20231116145948.203001-14-jhs%40mojatatu.com
patch subject: [PATCH net-next v8 13/15] p4tc: add set of P4TC table kfuncs
config: i386-randconfig-r113-20231120 (https://download.01.org/0day-ci/archive/20231121/202311210628.2LXSAYiy-lkp@intel.com/config)
compiler: clang version 16.0.4 (https://github.com/llvm/llvm-project.git ae42196bc493ffe877a7e3dff8be32035dea4d07)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20231121/202311210628.2LXSAYiy-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202311210628.2LXSAYiy-lkp@intel.com/

All errors (new ones prefixed by >>):

>> ld.lld: error: undefined symbol: register_p4tc_tbl_bpf
   >>> referenced by p4tc_tmpl_api.c:602 (net/sched/p4tc/p4tc_tmpl_api.c:602)
   >>>               net/sched/p4tc/p4tc_tmpl_api.o:(p4tc_template_init) in archive vmlinux.a

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH net-next v8 00/15] Introducing P4TC
  2023-11-20 21:48                 ` Daniel Borkmann
@ 2023-11-20 22:56                   ` Jamal Hadi Salim
  2023-11-21 13:06                     ` Jiri Pirko
  0 siblings, 1 reply; 79+ messages in thread
From: Jamal Hadi Salim @ 2023-11-20 22:56 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: Jiri Pirko, John Fastabend, netdev, deb.chatterjee,
	anjali.singhai, Vipin.Jain, namrata.limaye, tom, mleitner,
	Mahesh.Shirshyad, tomasz.osinski, xiyou.wangcong, davem, edumazet,
	kuba, pabeni, vladbu, horms, bpf, khalidm, toke, mattyk, dan.daly,
	chris.sommers, john.andy.fingerhut

On Mon, Nov 20, 2023 at 4:49 PM Daniel Borkmann <daniel@iogearbox.net> wrote:
>
> On 11/20/23 8:56 PM, Jamal Hadi Salim wrote:
> > On Mon, Nov 20, 2023 at 1:10 PM Jiri Pirko <jiri@resnulli.us> wrote:
> >> Mon, Nov 20, 2023 at 03:23:59PM CET, jhs@mojatatu.com wrote:
> >>> On Mon, Nov 20, 2023 at 4:39 AM Jiri Pirko <jiri@resnulli.us> wrote:
> >>>> Fri, Nov 17, 2023 at 09:46:11PM CET, jhs@mojatatu.com wrote:
> >>>>> On Fri, Nov 17, 2023 at 1:37 PM John Fastabend <john.fastabend@gmail.com> wrote:
> >>>>>> Jamal Hadi Salim wrote:
> >>>>>>> On Fri, Nov 17, 2023 at 1:27 AM John Fastabend <john.fastabend@gmail.com> wrote:
> >>>>>>>> Jamal Hadi Salim wrote:
> >>>>
> >>>> [...]
> >>>>
> >>>>>> I think I'm judging the technical work here. Bullet points.
> >>>>>>
> >>>>>> 1. p4c-tc implementation looks like it should be slower than a
> >>>>>>     in terms of pkts/sec than a bpf implementation. Meaning
> >>>>>>     I suspect pipeline and objects laid out like this will lose
> >>>>>>     to a BPF program with an parser and single lookup. The p4c-ebpf
> >>>>>>     compiler should look to create optimized EBPF code not some
> >>>>>>     emulated switch topology.
> >>>>>
> >>>>> The parser is ebpf based. The other objects which require control
> >>>>> plane interaction are not - those interact via netlink.
> >>>>> We published perf data a while back - presented at the P4 workshop
> >>>>> back in April (was in the cover letter)
> >>>>> https://github.com/p4tc-dev/docs/blob/main/p4-conference-2023/2023P4WorkshopP4TC.pdf
> >>>>> But do note: the correct abstraction is the first priority.
> >>>>> Optimization is something we can teach the compiler over time. But
> >>>>> even with the minimalist code generation you can see that our approach
> >>>>> always beats ebpf in LPM and ternary. The other ones I am pretty sure
> >>>>
> >>>> Any idea why? Perhaps the existing eBPF maps are not that suitable for
> >>>> this kinds of lookups? I mean in theory, eBPF should be always faster.
> >>>
> >>> We didnt look closely; however, that is not the point - the point is
> >>> the perf difference if there is one, is not big with the big win being
> >>> proper P4 abstraction. For LPM for sure our algorithmic approach is
> >>> better. For ternary the compute intensity in looping is better done in
> >>> C. And for exact i believe that ebpf uses better hashing.
> >>> Again, that is not the point we were trying to validate in those experiments..
> >>>
> >>> On your point of "maps are not that suitable" P4 tables tend to have
> >>> very specific attributes (examples associated meters, counters,
> >>> default hit and miss actions, etc).
> >>>
> >>>>> we can optimize over time.
> >>>>> Your view of "single lookup" is true for simple programs but if you
> >>>>> have 10 tables trying to model a 5G function then it doesnt make sense
> >>>>> (and i think the data we published was clear that you gain no
> >>>>> advantage using ebpf - as a matter of fact there was no perf
> >>>>> difference between XDP and tc in such cases).
> >>>>>
> >>>>>> 2. p4c-tc control plan looks slower than a directly mmaped bpf
> >>>>>>     map. Doing a simple update vs a netlink msg. The argument
> >>>>>>     that BPF can't do CRUD (which we had offlist) seems incorrect
> >>>>>>     to me. Correct me if I'm wrong with details about why.
> >>>>>
> >>>>> So let me see....
> >>>>> you want me to replace netlink and all its features and rewrite it
> >>>>> using the ebpf system calls? Congestion control, event handling,
> >>>>> arbitrary message crafting, etc and the years of work that went into
> >>>>> netlink? NO to the HELL.
> >>>>
> >>>> Wait, I don't think John suggests anything like that. He just suggests
> >>>> to have the tables as eBPF maps.
> >>>
> >>> What's the difference? Unless maps can do netlink.
> >>>
> >>>> Honestly, I don't understand the
> >>>> fixation on netlink. Its socket messaging, memcpies, processing
> >>>> overhead, etc can't keep up with mmaped memory access at scale. Measure
> >>>> that and I bet you'll get drastically different results.
> >>>>
> >>>> I mean, netlink is good for a lot of things, but does not mean it is an
> >>>> universal answer to userspace<->kernel data passing.
> >>>
> >>> Here's a small sample of our requirements that are satisfied by
> >>> netlink for P4 object hierarchy[1]:
> >>> 1. Msg construction/parsing
> >>> 2. Multi-user request/response messaging
> >>
> >> What is actually a usecase for having multiple users program p4 pipeline
> >> in parallel?
> >
> > First of all - this is Linux, multiple users is a way of life, you
> > shouldnt have to ask that question unless you are trying to be
> > socratic. Meaning multiple control plane apps can be allowed to
> > program different parts and even different tables - think multi-tier
> > pipeline.
> >
> >>> 3. Multi-user event subscribe/publish messaging
> >>
> >> Same here. What is the usecase for multiple users receiving p4 events?
> >
> > Same thing.
> > Note: Events are really not part of P4 but we added them for
> > flexibility - and as you well know they are useful.
> >
> >>> I dont think i need to provide an explanation on the differences here
> >>> visavis what ebpf system calls provide vs what netlink provides and
> >>> how netlink is a clear fit. If it is not clear i can give more
> >>
> >> It is not :/
> >
> > I thought it was obvious for someone like you, but fine - here goes for those 3:
> >
> > 1. Msg construction/parsing: A lot of infra for sending attributes
> > back and forth is already built into netlink. I would have to create
> > mine from scratch for ebpf.  This will include not just the
> > construction/parsing but all the detailed attribute content policy
> > validations(even in the presence of hierarchies) that comes with it.
> > And not to forget the state transform between kernel and user space.
> >
> > 2. Multi-user request/response messaging
> > If you can write all the code for #1 above then this should work fine for ebpf
> >
> > 3. Event publish subscribe
> > You would have to create mechanisms for ebpf which either are non
> > trivial or non complete: Example 1: you can put surgeries in the ebpf
> > code to look at map manipulations and then interface it to some event
> > management scheme which checks for subscribed users. Example 2: It may
> > also be feasible to create your own map for subscription vs something
> > like perf ring for event publication(something i have done in the
> > past), but that is also limited in many ways.
>
> I still don't think this answers all the questions on why the netlink
> shim layer. The kfuncs are essentially available to all of tc BPF and
> I don't think there was a discussion why they cannot be done generic
> in a way that they could benefit all tc/XDP BPF users. With the patch
> 14 you are more or less copying what is existing with {cls,act}_bpf
> just that you also allow XDP loading from tc(?). We do have existing
> interfaces for XDP program management.
>

I am not sure i followed - but we are open to suggestions to improve
operation usability.

> tc BPF and XDP already have widely used infrastructure and can be developed
> against libbpf or other user space libraries for a user space control plane.
> With 'control plane' you refer here to the tc / netlink shim you've built,
> but looking at the tc command line examples, this doesn't really provide a
> good user experience (you call it p4 but people load bpf obj files). If the
> expectation is that an operator should run tc commands, then neither it's
> a nice experience for p4 nor for BPF folks. From a BPF PoV, we moved over
> to bpf_mprog and plan to also extend this for XDP to have a common look and
> feel wrt networking for developers. Why can't this be reused?

The filter loading which loads the program is considered pipeline
instantiation - consider it as "provisioning" more than "control"
which runs at runtime. "control" is purely netlink based. The iproute2
code we use links libbpf for example for the filter. If we can achieve
the same with bpf_mprog then sure - we just dont want to loose
functionality though.  off top of my head, some sample space:
- we could have multiple pipelines with different priorities (which tc
provides to us) - and each pipeline may have its own logic with many
tables etc (and the choice to iterate the next one is essentially
encoded in the tc action codes)
- we use tc block to map groups of ports (which i dont think bpf has
internal access of)

In regards to usability: no i dont expect someone doing things at
scale to use command line tc. The APIs are via netlink. But the tc cli
is must for the rest of the masses per our traditions. Also i really
didnt even want to use ebpf at all for operator experience reasons -
it requires a compilation of the code and an extra loading compared to
what our original u32/pedit code offered.

> I don't quite follow why not most of this could be implemented entirely in
> user space without the detour of this and you would provide a developer
> library which could then be integrated into a p4 runtime/frontend? This
> way users never interface with ebpf parts nor tc given they also shouldn't
> have to - it's an implementation detail. This is what John was also pointing
> out earlier.
>

Netlink is the API. We will provide a library for object manipulation
which abstracts away the need to know netlink. Someone who for their
own reasons wants to use p4runtime or TDI could write on top of this.
I would not design a kernel interface to just meet p4runtime (we
already have TDI which came later which does things differently). So i
expect us to support both those two. And if i was to do something on
SDN that was more robust i would write my own that still uses these
netlink interfaces.

> If you need notifications/subscribe mechanism for map updates, then this
> could be extended.. same way like BPF internals got extended along with the
> sched_ext work, making the core pieces more useful also outside of the latter.
>

Why? I already have this working great right now with netlink.

> The link to below slides are not public, so it's hard to see what is really
> meant here, but I have also never seen an email from the speaker on the BPF
> mailing list providing concrete feedback(?). People do build control planes
> around BPF in the wild, I'm not sure where you take 'flush LEDs' from, to
> me this all sounds rather hand-wavy and trying to brute-force the fixation
> on netlink you went with that is raising questions. I don't think there was
> objection on going with eBPF but rather all this infra for the former for
> a SW-only extension.

There are a handful of people who are holding the slides being
released (will go and chase them after this).

BTW, our experience in regards to usability for eBPF control plane is
the same as Ivan. I was listening to the talk and just nodding along.
You focused too much on the datapath and did a good job there but i am
afraid not so much on usability of the control path. My view is: to
create a back and forth with the kernel for something as complex as we
have using the ebpf system calls vs netlink, you would need to spend a
lot more developer resources in the ebpf case.  If you want to call
what i have a "the fixation on netlink" maybe you are fixated on ebpf
syscall?;->

cheers,
jamal


> [...]
> >>>>> I should note: that there was an interesting talk at netdevconf 0x17
> >>>>> where the speaker showed the challenges of dealing with ebpf on "day
> >>>>> two" - slides or videos are not up yet, but link is:
> >>>>> https://netdevconf.info/0x17/sessions/talk/is-scaling-ebpf-easy-yet-a-small-step-to-one-server-but-giant-leap-to-distributed-network.html
> >>>>> The point the speaker was making is it's always easy to whip an ebpf
> >>>>> program that can slice and dice packets and maybe even flush LEDs but
> >>>>> the real work and challenge is in the control plane. I agree with the
> >>>>> speaker based on my experiences. This discussion of replacing netlink
> >>>>> with ebpf system calls is absolutely a non-starter. Let's just end the
> >>>>> discussion and agree to disagree if you are going to keep insisting on
> >>>>> that.

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH net-next v8 00/15] Introducing P4TC
  2023-11-20 22:56                   ` Jamal Hadi Salim
@ 2023-11-21 13:06                     ` Jiri Pirko
  2023-11-21 13:47                       ` Jamal Hadi Salim
  0 siblings, 1 reply; 79+ messages in thread
From: Jiri Pirko @ 2023-11-21 13:06 UTC (permalink / raw)
  To: Jamal Hadi Salim
  Cc: Daniel Borkmann, John Fastabend, netdev, deb.chatterjee,
	anjali.singhai, Vipin.Jain, namrata.limaye, tom, mleitner,
	Mahesh.Shirshyad, tomasz.osinski, xiyou.wangcong, davem, edumazet,
	kuba, pabeni, vladbu, horms, bpf, khalidm, toke, mattyk, dan.daly,
	chris.sommers, john.andy.fingerhut

Mon, Nov 20, 2023 at 11:56:50PM CET, jhs@mojatatu.com wrote:
>On Mon, Nov 20, 2023 at 4:49 PM Daniel Borkmann <daniel@iogearbox.net> wrote:
>>
>> On 11/20/23 8:56 PM, Jamal Hadi Salim wrote:
>> > On Mon, Nov 20, 2023 at 1:10 PM Jiri Pirko <jiri@resnulli.us> wrote:
>> >> Mon, Nov 20, 2023 at 03:23:59PM CET, jhs@mojatatu.com wrote:

[...]

>
>> tc BPF and XDP already have widely used infrastructure and can be developed
>> against libbpf or other user space libraries for a user space control plane.
>> With 'control plane' you refer here to the tc / netlink shim you've built,
>> but looking at the tc command line examples, this doesn't really provide a
>> good user experience (you call it p4 but people load bpf obj files). If the
>> expectation is that an operator should run tc commands, then neither it's
>> a nice experience for p4 nor for BPF folks. From a BPF PoV, we moved over
>> to bpf_mprog and plan to also extend this for XDP to have a common look and
>> feel wrt networking for developers. Why can't this be reused?
>
>The filter loading which loads the program is considered pipeline
>instantiation - consider it as "provisioning" more than "control"
>which runs at runtime. "control" is purely netlink based. The iproute2
>code we use links libbpf for example for the filter. If we can achieve
>the same with bpf_mprog then sure - we just dont want to loose
>functionality though.  off top of my head, some sample space:
>- we could have multiple pipelines with different priorities (which tc
>provides to us) - and each pipeline may have its own logic with many
>tables etc (and the choice to iterate the next one is essentially
>encoded in the tc action codes)
>- we use tc block to map groups of ports (which i dont think bpf has
>internal access of)
>
>In regards to usability: no i dont expect someone doing things at
>scale to use command line tc. The APIs are via netlink. But the tc cli
>is must for the rest of the masses per our traditions. Also i really

I don't follow. You repeatedly mention "the must of the traditional tc
cli", but what of the existing traditional cli you use for p4tc?
If I look at the examples, pretty much everything looks new to me.
Example:

  tc p4ctrl create myprog/table/mytable dstAddr 10.0.1.2/32 \
    action send_to_port param port eno1

This is just TC/RTnetlink used as a channel to pass new things over. If
that is the case, what's traditional here?


>didnt even want to use ebpf at all for operator experience reasons -
>it requires a compilation of the code and an extra loading compared to
>what our original u32/pedit code offered.
>
>> I don't quite follow why not most of this could be implemented entirely in
>> user space without the detour of this and you would provide a developer
>> library which could then be integrated into a p4 runtime/frontend? This
>> way users never interface with ebpf parts nor tc given they also shouldn't
>> have to - it's an implementation detail. This is what John was also pointing
>> out earlier.
>>
>
>Netlink is the API. We will provide a library for object manipulation
>which abstracts away the need to know netlink. Someone who for their
>own reasons wants to use p4runtime or TDI could write on top of this.
>I would not design a kernel interface to just meet p4runtime (we
>already have TDI which came later which does things differently). So i
>expect us to support both those two. And if i was to do something on
>SDN that was more robust i would write my own that still uses these
>netlink interfaces.

Actually, what Daniel says about the p4 library used as a backend to p4
frontend is pretty much aligned what I claimed on the p4 calls couple of
times. If you have this p4 userspace tooling, it is easy for offloads to
replace the backed by vendor-specific library which allows p4 offload
suitable for all vendors (your plan of p4tc offload does not work well
for our hw, as we repeatedly claimed).

As I also said on the p4 call couple of times, I don't see the kernel
as the correct place to do the p4 abstractions. Why don't you do it in
userspace and give vendors possiblity to have p4 backends with compilers,
runtime optimizations etc in userspace, talking to the HW in the
vendor-suitable way too. Then the SW implementation could be easily eBPF
and the main reason (I believe) why you need to have this is TC
(offload) is then void.

The "everyone wants to use TC/netlink" claim does not seem correct
to me. Why not to have one Linux p4 solution that fits everyones needs?

[...]

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH net-next v8 00/15] Introducing P4TC
  2023-11-21 13:06                     ` Jiri Pirko
@ 2023-11-21 13:47                       ` Jamal Hadi Salim
  2023-11-21 14:19                         ` Jiri Pirko
  0 siblings, 1 reply; 79+ messages in thread
From: Jamal Hadi Salim @ 2023-11-21 13:47 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: Daniel Borkmann, John Fastabend, netdev, deb.chatterjee,
	anjali.singhai, Vipin.Jain, namrata.limaye, tom, mleitner,
	Mahesh.Shirshyad, tomasz.osinski, xiyou.wangcong, davem, edumazet,
	kuba, pabeni, vladbu, horms, bpf, khalidm, toke, mattyk, dan.daly,
	chris.sommers, john.andy.fingerhut

On Tue, Nov 21, 2023 at 8:06 AM Jiri Pirko <jiri@resnulli.us> wrote:
>
> Mon, Nov 20, 2023 at 11:56:50PM CET, jhs@mojatatu.com wrote:
> >On Mon, Nov 20, 2023 at 4:49 PM Daniel Borkmann <daniel@iogearbox.net> wrote:
> >>
> >> On 11/20/23 8:56 PM, Jamal Hadi Salim wrote:
> >> > On Mon, Nov 20, 2023 at 1:10 PM Jiri Pirko <jiri@resnulli.us> wrote:
> >> >> Mon, Nov 20, 2023 at 03:23:59PM CET, jhs@mojatatu.com wrote:
>
> [...]
>
> >
> >> tc BPF and XDP already have widely used infrastructure and can be developed
> >> against libbpf or other user space libraries for a user space control plane.
> >> With 'control plane' you refer here to the tc / netlink shim you've built,
> >> but looking at the tc command line examples, this doesn't really provide a
> >> good user experience (you call it p4 but people load bpf obj files). If the
> >> expectation is that an operator should run tc commands, then neither it's
> >> a nice experience for p4 nor for BPF folks. From a BPF PoV, we moved over
> >> to bpf_mprog and plan to also extend this for XDP to have a common look and
> >> feel wrt networking for developers. Why can't this be reused?
> >
> >The filter loading which loads the program is considered pipeline
> >instantiation - consider it as "provisioning" more than "control"
> >which runs at runtime. "control" is purely netlink based. The iproute2
> >code we use links libbpf for example for the filter. If we can achieve
> >the same with bpf_mprog then sure - we just dont want to loose
> >functionality though.  off top of my head, some sample space:
> >- we could have multiple pipelines with different priorities (which tc
> >provides to us) - and each pipeline may have its own logic with many
> >tables etc (and the choice to iterate the next one is essentially
> >encoded in the tc action codes)
> >- we use tc block to map groups of ports (which i dont think bpf has
> >internal access of)
> >
> >In regards to usability: no i dont expect someone doing things at
> >scale to use command line tc. The APIs are via netlink. But the tc cli
> >is must for the rest of the masses per our traditions. Also i really
>
> I don't follow. You repeatedly mention "the must of the traditional tc
> cli", but what of the existing traditional cli you use for p4tc?
> If I look at the examples, pretty much everything looks new to me.
> Example:
>
>   tc p4ctrl create myprog/table/mytable dstAddr 10.0.1.2/32 \
>     action send_to_port param port eno1
>
> This is just TC/RTnetlink used as a channel to pass new things over. If
> that is the case, what's traditional here?
>


What is not traditional about it?

>
> >didnt even want to use ebpf at all for operator experience reasons -
> >it requires a compilation of the code and an extra loading compared to
> >what our original u32/pedit code offered.
> >
> >> I don't quite follow why not most of this could be implemented entirely in
> >> user space without the detour of this and you would provide a developer
> >> library which could then be integrated into a p4 runtime/frontend? This
> >> way users never interface with ebpf parts nor tc given they also shouldn't
> >> have to - it's an implementation detail. This is what John was also pointing
> >> out earlier.
> >>
> >
> >Netlink is the API. We will provide a library for object manipulation
> >which abstracts away the need to know netlink. Someone who for their
> >own reasons wants to use p4runtime or TDI could write on top of this.
> >I would not design a kernel interface to just meet p4runtime (we
> >already have TDI which came later which does things differently). So i
> >expect us to support both those two. And if i was to do something on
> >SDN that was more robust i would write my own that still uses these
> >netlink interfaces.
>
> Actually, what Daniel says about the p4 library used as a backend to p4
> frontend is pretty much aligned what I claimed on the p4 calls couple of
> times. If you have this p4 userspace tooling, it is easy for offloads to
> replace the backed by vendor-specific library which allows p4 offload
> suitable for all vendors (your plan of p4tc offload does not work well
> for our hw, as we repeatedly claimed).
>

That's you - NVIDIA. You have chosen a path away from the kernel
towards DOCA. I understand NVIDIA's frustration with dealing with
upstream process (which has been cited to me as a good reason for
DOCA) but please dont impose these values and your politics on other
vendors(Intel, AMD for example) who are more than willing to invest
into making the kernel interfaces the path forward. Your choice.
Nobody is stopping you from offering your customers proprietary
solutions which include a specific ebpf approach alongside DOCA. We
believe that a singular interface regardless of the vendor is the
right way forward. IMHO, this siloing that unfortunately is also added
by eBPF being a double edged sword is not good for the community.

> As I also said on the p4 call couple of times, I don't see the kernel
> as the correct place to do the p4 abstractions. Why don't you do it in
> userspace and give vendors possiblity to have p4 backends with compilers,
> runtime optimizations etc in userspace, talking to the HW in the
> vendor-suitable way too. Then the SW implementation could be easily eBPF
> and the main reason (I believe) why you need to have this is TC
> (offload) is then void.
>
> The "everyone wants to use TC/netlink" claim does not seem correct
> to me. Why not to have one Linux p4 solution that fits everyones needs?

You mean more fitting to the DOCA world? no, because iam a kernel
first person and kernel interfaces are good for everyone.

cheers,
jamal

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH net-next v8 00/15] Introducing P4TC
  2023-11-21 13:47                       ` Jamal Hadi Salim
@ 2023-11-21 14:19                         ` Jiri Pirko
  2023-11-21 15:21                           ` Jamal Hadi Salim
  0 siblings, 1 reply; 79+ messages in thread
From: Jiri Pirko @ 2023-11-21 14:19 UTC (permalink / raw)
  To: Jamal Hadi Salim
  Cc: Daniel Borkmann, John Fastabend, netdev, deb.chatterjee,
	anjali.singhai, Vipin.Jain, namrata.limaye, tom, mleitner,
	Mahesh.Shirshyad, tomasz.osinski, xiyou.wangcong, davem, edumazet,
	kuba, pabeni, vladbu, horms, bpf, khalidm, toke, mattyk, dan.daly,
	chris.sommers, john.andy.fingerhut

Tue, Nov 21, 2023 at 02:47:40PM CET, jhs@mojatatu.com wrote:
>On Tue, Nov 21, 2023 at 8:06 AM Jiri Pirko <jiri@resnulli.us> wrote:
>>
>> Mon, Nov 20, 2023 at 11:56:50PM CET, jhs@mojatatu.com wrote:
>> >On Mon, Nov 20, 2023 at 4:49 PM Daniel Borkmann <daniel@iogearbox.net> wrote:
>> >>
>> >> On 11/20/23 8:56 PM, Jamal Hadi Salim wrote:
>> >> > On Mon, Nov 20, 2023 at 1:10 PM Jiri Pirko <jiri@resnulli.us> wrote:
>> >> >> Mon, Nov 20, 2023 at 03:23:59PM CET, jhs@mojatatu.com wrote:
>>
>> [...]
>>
>> >
>> >> tc BPF and XDP already have widely used infrastructure and can be developed
>> >> against libbpf or other user space libraries for a user space control plane.
>> >> With 'control plane' you refer here to the tc / netlink shim you've built,
>> >> but looking at the tc command line examples, this doesn't really provide a
>> >> good user experience (you call it p4 but people load bpf obj files). If the
>> >> expectation is that an operator should run tc commands, then neither it's
>> >> a nice experience for p4 nor for BPF folks. From a BPF PoV, we moved over
>> >> to bpf_mprog and plan to also extend this for XDP to have a common look and
>> >> feel wrt networking for developers. Why can't this be reused?
>> >
>> >The filter loading which loads the program is considered pipeline
>> >instantiation - consider it as "provisioning" more than "control"
>> >which runs at runtime. "control" is purely netlink based. The iproute2
>> >code we use links libbpf for example for the filter. If we can achieve
>> >the same with bpf_mprog then sure - we just dont want to loose
>> >functionality though.  off top of my head, some sample space:
>> >- we could have multiple pipelines with different priorities (which tc
>> >provides to us) - and each pipeline may have its own logic with many
>> >tables etc (and the choice to iterate the next one is essentially
>> >encoded in the tc action codes)
>> >- we use tc block to map groups of ports (which i dont think bpf has
>> >internal access of)
>> >
>> >In regards to usability: no i dont expect someone doing things at
>> >scale to use command line tc. The APIs are via netlink. But the tc cli
>> >is must for the rest of the masses per our traditions. Also i really
>>
>> I don't follow. You repeatedly mention "the must of the traditional tc
>> cli", but what of the existing traditional cli you use for p4tc?
>> If I look at the examples, pretty much everything looks new to me.
>> Example:
>>
>>   tc p4ctrl create myprog/table/mytable dstAddr 10.0.1.2/32 \
>>     action send_to_port param port eno1
>>
>> This is just TC/RTnetlink used as a channel to pass new things over. If
>> that is the case, what's traditional here?
>>
>
>
>What is not traditional about it?

Okay, so in that case, the following example communitating with
userspace deamon using imaginary "p4ctrl" app is equally traditional:
  $ p4ctrl create myprog/table/mytable dstAddr 10.0.1.2/32 \
     action send_to_port param port eno1


>
>>
>> >didnt even want to use ebpf at all for operator experience reasons -
>> >it requires a compilation of the code and an extra loading compared to
>> >what our original u32/pedit code offered.
>> >
>> >> I don't quite follow why not most of this could be implemented entirely in
>> >> user space without the detour of this and you would provide a developer
>> >> library which could then be integrated into a p4 runtime/frontend? This
>> >> way users never interface with ebpf parts nor tc given they also shouldn't
>> >> have to - it's an implementation detail. This is what John was also pointing
>> >> out earlier.
>> >>
>> >
>> >Netlink is the API. We will provide a library for object manipulation
>> >which abstracts away the need to know netlink. Someone who for their
>> >own reasons wants to use p4runtime or TDI could write on top of this.
>> >I would not design a kernel interface to just meet p4runtime (we
>> >already have TDI which came later which does things differently). So i
>> >expect us to support both those two. And if i was to do something on
>> >SDN that was more robust i would write my own that still uses these
>> >netlink interfaces.
>>
>> Actually, what Daniel says about the p4 library used as a backend to p4
>> frontend is pretty much aligned what I claimed on the p4 calls couple of
>> times. If you have this p4 userspace tooling, it is easy for offloads to
>> replace the backed by vendor-specific library which allows p4 offload
>> suitable for all vendors (your plan of p4tc offload does not work well
>> for our hw, as we repeatedly claimed).
>>
>
>That's you - NVIDIA. You have chosen a path away from the kernel
>towards DOCA. I understand NVIDIA's frustration with dealing with
>upstream process (which has been cited to me as a good reason for
>DOCA) but please dont impose these values and your politics on other
>vendors(Intel, AMD for example) who are more than willing to invest
>into making the kernel interfaces the path forward. Your choice.

No, you are missing the point. This has nothing to do with DOCA. This
has to do with the simple limitation of your offload assuming there are
no runtime changes in the compiled pipeline. For Intel, maybe they
aren't, and it's a good fit for them. All I say is, that it is not the
good fit for everyone.


>Nobody is stopping you from offering your customers proprietary
>solutions which include a specific ebpf approach alongside DOCA. We
>believe that a singular interface regardless of the vendor is the
>right way forward. IMHO, this siloing that unfortunately is also added
>by eBPF being a double edged sword is not good for the community.
>
>> As I also said on the p4 call couple of times, I don't see the kernel
>> as the correct place to do the p4 abstractions. Why don't you do it in
>> userspace and give vendors possiblity to have p4 backends with compilers,
>> runtime optimizations etc in userspace, talking to the HW in the
>> vendor-suitable way too. Then the SW implementation could be easily eBPF
>> and the main reason (I believe) why you need to have this is TC
>> (offload) is then void.
>>
>> The "everyone wants to use TC/netlink" claim does not seem correct
>> to me. Why not to have one Linux p4 solution that fits everyones needs?
>
>You mean more fitting to the DOCA world? no, because iam a kernel

Again, this has 0 relation to DOCA.


>first person and kernel interfaces are good for everyone.

Yeah, not really. Not always the kernel is the right answer. Your/Intel
plan to handle the offload by:
1) abuse devlink to flash p4 binary
2) parse the binary in kernel to match to the table ids of rules coming
   from p4tc ndo_setup_tc
3) abuse devlink to flash p4 binary for tc-flower
4) parse the binary in kernel to match to the table ids of rules coming
   from tc-flower ndo_setup_tc
is really something that is making me a little bit nauseous.

If you don't have a feasible plan to do the offload, p4tc does not make
sense to me to be honest.


>
>cheers,
>jamal

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH net-next v8 00/15] Introducing P4TC
  2023-11-21 14:19                         ` Jiri Pirko
@ 2023-11-21 15:21                           ` Jamal Hadi Salim
  2023-11-22  9:25                             ` Jiri Pirko
  0 siblings, 1 reply; 79+ messages in thread
From: Jamal Hadi Salim @ 2023-11-21 15:21 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: Daniel Borkmann, John Fastabend, netdev, deb.chatterjee,
	anjali.singhai, Vipin.Jain, namrata.limaye, tom, mleitner,
	Mahesh.Shirshyad, tomasz.osinski, xiyou.wangcong, davem, edumazet,
	kuba, pabeni, vladbu, horms, bpf, khalidm, toke, mattyk, dan.daly,
	chris.sommers, john.andy.fingerhut

On Tue, Nov 21, 2023 at 9:19 AM Jiri Pirko <jiri@resnulli.us> wrote:
>
> Tue, Nov 21, 2023 at 02:47:40PM CET, jhs@mojatatu.com wrote:
> >On Tue, Nov 21, 2023 at 8:06 AM Jiri Pirko <jiri@resnulli.us> wrote:
> >>
> >> Mon, Nov 20, 2023 at 11:56:50PM CET, jhs@mojatatu.com wrote:
> >> >On Mon, Nov 20, 2023 at 4:49 PM Daniel Borkmann <daniel@iogearbox.net> wrote:
> >> >>
> >> >> On 11/20/23 8:56 PM, Jamal Hadi Salim wrote:
> >> >> > On Mon, Nov 20, 2023 at 1:10 PM Jiri Pirko <jiri@resnulli.us> wrote:
> >> >> >> Mon, Nov 20, 2023 at 03:23:59PM CET, jhs@mojatatu.com wrote:
> >>
> >> [...]
> >>
> >> >
> >> >> tc BPF and XDP already have widely used infrastructure and can be developed
> >> >> against libbpf or other user space libraries for a user space control plane.
> >> >> With 'control plane' you refer here to the tc / netlink shim you've built,
> >> >> but looking at the tc command line examples, this doesn't really provide a
> >> >> good user experience (you call it p4 but people load bpf obj files). If the
> >> >> expectation is that an operator should run tc commands, then neither it's
> >> >> a nice experience for p4 nor for BPF folks. From a BPF PoV, we moved over
> >> >> to bpf_mprog and plan to also extend this for XDP to have a common look and
> >> >> feel wrt networking for developers. Why can't this be reused?
> >> >
> >> >The filter loading which loads the program is considered pipeline
> >> >instantiation - consider it as "provisioning" more than "control"
> >> >which runs at runtime. "control" is purely netlink based. The iproute2
> >> >code we use links libbpf for example for the filter. If we can achieve
> >> >the same with bpf_mprog then sure - we just dont want to loose
> >> >functionality though.  off top of my head, some sample space:
> >> >- we could have multiple pipelines with different priorities (which tc
> >> >provides to us) - and each pipeline may have its own logic with many
> >> >tables etc (and the choice to iterate the next one is essentially
> >> >encoded in the tc action codes)
> >> >- we use tc block to map groups of ports (which i dont think bpf has
> >> >internal access of)
> >> >
> >> >In regards to usability: no i dont expect someone doing things at
> >> >scale to use command line tc. The APIs are via netlink. But the tc cli
> >> >is must for the rest of the masses per our traditions. Also i really
> >>
> >> I don't follow. You repeatedly mention "the must of the traditional tc
> >> cli", but what of the existing traditional cli you use for p4tc?
> >> If I look at the examples, pretty much everything looks new to me.
> >> Example:
> >>
> >>   tc p4ctrl create myprog/table/mytable dstAddr 10.0.1.2/32 \
> >>     action send_to_port param port eno1
> >>
> >> This is just TC/RTnetlink used as a channel to pass new things over. If
> >> that is the case, what's traditional here?
> >>
> >
> >
> >What is not traditional about it?
>
> Okay, so in that case, the following example communitating with
> userspace deamon using imaginary "p4ctrl" app is equally traditional:
>   $ p4ctrl create myprog/table/mytable dstAddr 10.0.1.2/32 \
>      action send_to_port param port eno1

Huh? Thats just an application - classical tc which part of iproute2
that is sending to the kernel, no different than "tc flower.."
Where do you get the "userspace" daemon part? Yes, you can write a
daemon but it will use the same APIs as tc.

>
> >
> >>
> >> >didnt even want to use ebpf at all for operator experience reasons -
> >> >it requires a compilation of the code and an extra loading compared to
> >> >what our original u32/pedit code offered.
> >> >
> >> >> I don't quite follow why not most of this could be implemented entirely in
> >> >> user space without the detour of this and you would provide a developer
> >> >> library which could then be integrated into a p4 runtime/frontend? This
> >> >> way users never interface with ebpf parts nor tc given they also shouldn't
> >> >> have to - it's an implementation detail. This is what John was also pointing
> >> >> out earlier.
> >> >>
> >> >
> >> >Netlink is the API. We will provide a library for object manipulation
> >> >which abstracts away the need to know netlink. Someone who for their
> >> >own reasons wants to use p4runtime or TDI could write on top of this.
> >> >I would not design a kernel interface to just meet p4runtime (we
> >> >already have TDI which came later which does things differently). So i
> >> >expect us to support both those two. And if i was to do something on
> >> >SDN that was more robust i would write my own that still uses these
> >> >netlink interfaces.
> >>
> >> Actually, what Daniel says about the p4 library used as a backend to p4
> >> frontend is pretty much aligned what I claimed on the p4 calls couple of
> >> times. If you have this p4 userspace tooling, it is easy for offloads to
> >> replace the backed by vendor-specific library which allows p4 offload
> >> suitable for all vendors (your plan of p4tc offload does not work well
> >> for our hw, as we repeatedly claimed).
> >>
> >
> >That's you - NVIDIA. You have chosen a path away from the kernel
> >towards DOCA. I understand NVIDIA's frustration with dealing with
> >upstream process (which has been cited to me as a good reason for
> >DOCA) but please dont impose these values and your politics on other
> >vendors(Intel, AMD for example) who are more than willing to invest
> >into making the kernel interfaces the path forward. Your choice.
>
> No, you are missing the point. This has nothing to do with DOCA.

Right Jiri ;->

> This
> has to do with the simple limitation of your offload assuming there are
> no runtime changes in the compiled pipeline. For Intel, maybe they
> aren't, and it's a good fit for them. All I say is, that it is not the
> good fit for everyone.

 a) it is not part of the P4 spec to dynamically make changes to the
datapath pipeline after it is create and we are discussing a P4
implementation not an extension that would add more value b) We are
more than happy to add extensions in the future to accomodate for
features but first _P4 spec_ must be met c) we had longer discussions
with Matty, Khalid and the Rice folks who wrote a paper on that topic
which you probably didnt attend and everything that needs to be done
can be from user space today for all those optimizations.

Conclusion is: For what you need to do (which i dont believe is a
limitation in your hardware rather a design decision on your part) run
your user space daemon, do optimizations and update the datapath.
Everybody is happy.

>
> >Nobody is stopping you from offering your customers proprietary
> >solutions which include a specific ebpf approach alongside DOCA. We
> >believe that a singular interface regardless of the vendor is the
> >right way forward. IMHO, this siloing that unfortunately is also added
> >by eBPF being a double edged sword is not good for the community.
> >
> >> As I also said on the p4 call couple of times, I don't see the kernel
> >> as the correct place to do the p4 abstractions. Why don't you do it in
> >> userspace and give vendors possiblity to have p4 backends with compilers,
> >> runtime optimizations etc in userspace, talking to the HW in the
> >> vendor-suitable way too. Then the SW implementation could be easily eBPF
> >> and the main reason (I believe) why you need to have this is TC
> >> (offload) is then void.
> >>
> >> The "everyone wants to use TC/netlink" claim does not seem correct
> >> to me. Why not to have one Linux p4 solution that fits everyones needs?
> >
> >You mean more fitting to the DOCA world? no, because iam a kernel
>
> Again, this has 0 relation to DOCA.
>
>
> >first person and kernel interfaces are good for everyone.
>
> Yeah, not really. Not always the kernel is the right answer. Your/Intel
> plan to handle the offload by:
> 1) abuse devlink to flash p4 binary
> 2) parse the binary in kernel to match to the table ids of rules coming
>    from p4tc ndo_setup_tc
> 3) abuse devlink to flash p4 binary for tc-flower
> 4) parse the binary in kernel to match to the table ids of rules coming
>    from tc-flower ndo_setup_tc
> is really something that is making me a little bit nauseous.
>
> If you don't have a feasible plan to do the offload, p4tc does not make
> sense to me to be honest.

You mean if there's no plan to match your (NVIDIA?)  point of view.
For #1 - how's this different from DDP? Wasnt that your suggestion to
begin with? For #2 Nobody is proposing to do anything of the sort. The
ndo is passed IDs for the objects and associated contents. For #3+#4
tc flower thing has nothing to do with P4TC that was just some random
proposal someone made seeing if they could ride on top of P4TC.

Besides this nobody really has to satisfy your point of view - like i
said earlier feel free to provide proprietary solutions. From a
consumer perspective  I would not want to deal with 4 different
vendors with 4 different proprietary approaches. The kernel is the
unifying part. You seemed happier with tc flower just not with the
kernel process - which is ironically the same thing we are going
through here ;->

cheers,
jamal

>
> >
> >cheers,
> >jamal

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH net-next v8 00/15] Introducing P4TC
  2023-11-21 15:21                           ` Jamal Hadi Salim
@ 2023-11-22  9:25                             ` Jiri Pirko
  2023-11-22 15:14                               ` Jamal Hadi Salim
  0 siblings, 1 reply; 79+ messages in thread
From: Jiri Pirko @ 2023-11-22  9:25 UTC (permalink / raw)
  To: Jamal Hadi Salim
  Cc: Daniel Borkmann, John Fastabend, netdev, deb.chatterjee,
	anjali.singhai, Vipin.Jain, namrata.limaye, tom, mleitner,
	Mahesh.Shirshyad, tomasz.osinski, xiyou.wangcong, davem, edumazet,
	kuba, pabeni, vladbu, horms, bpf, khalidm, toke, mattyk, dan.daly,
	chris.sommers, john.andy.fingerhut

Tue, Nov 21, 2023 at 04:21:44PM CET, jhs@mojatatu.com wrote:
>On Tue, Nov 21, 2023 at 9:19 AM Jiri Pirko <jiri@resnulli.us> wrote:
>>
>> Tue, Nov 21, 2023 at 02:47:40PM CET, jhs@mojatatu.com wrote:
>> >On Tue, Nov 21, 2023 at 8:06 AM Jiri Pirko <jiri@resnulli.us> wrote:
>> >>
>> >> Mon, Nov 20, 2023 at 11:56:50PM CET, jhs@mojatatu.com wrote:
>> >> >On Mon, Nov 20, 2023 at 4:49 PM Daniel Borkmann <daniel@iogearbox.net> wrote:
>> >> >>
>> >> >> On 11/20/23 8:56 PM, Jamal Hadi Salim wrote:
>> >> >> > On Mon, Nov 20, 2023 at 1:10 PM Jiri Pirko <jiri@resnulli.us> wrote:
>> >> >> >> Mon, Nov 20, 2023 at 03:23:59PM CET, jhs@mojatatu.com wrote:
>> >>
>> >> [...]
>> >>
>> >> >
>> >> >> tc BPF and XDP already have widely used infrastructure and can be developed
>> >> >> against libbpf or other user space libraries for a user space control plane.
>> >> >> With 'control plane' you refer here to the tc / netlink shim you've built,
>> >> >> but looking at the tc command line examples, this doesn't really provide a
>> >> >> good user experience (you call it p4 but people load bpf obj files). If the
>> >> >> expectation is that an operator should run tc commands, then neither it's
>> >> >> a nice experience for p4 nor for BPF folks. From a BPF PoV, we moved over
>> >> >> to bpf_mprog and plan to also extend this for XDP to have a common look and
>> >> >> feel wrt networking for developers. Why can't this be reused?
>> >> >
>> >> >The filter loading which loads the program is considered pipeline
>> >> >instantiation - consider it as "provisioning" more than "control"
>> >> >which runs at runtime. "control" is purely netlink based. The iproute2
>> >> >code we use links libbpf for example for the filter. If we can achieve
>> >> >the same with bpf_mprog then sure - we just dont want to loose
>> >> >functionality though.  off top of my head, some sample space:
>> >> >- we could have multiple pipelines with different priorities (which tc
>> >> >provides to us) - and each pipeline may have its own logic with many
>> >> >tables etc (and the choice to iterate the next one is essentially
>> >> >encoded in the tc action codes)
>> >> >- we use tc block to map groups of ports (which i dont think bpf has
>> >> >internal access of)
>> >> >
>> >> >In regards to usability: no i dont expect someone doing things at
>> >> >scale to use command line tc. The APIs are via netlink. But the tc cli
>> >> >is must for the rest of the masses per our traditions. Also i really
>> >>
>> >> I don't follow. You repeatedly mention "the must of the traditional tc
>> >> cli", but what of the existing traditional cli you use for p4tc?
>> >> If I look at the examples, pretty much everything looks new to me.
>> >> Example:
>> >>
>> >>   tc p4ctrl create myprog/table/mytable dstAddr 10.0.1.2/32 \
>> >>     action send_to_port param port eno1
>> >>
>> >> This is just TC/RTnetlink used as a channel to pass new things over. If
>> >> that is the case, what's traditional here?
>> >>
>> >
>> >
>> >What is not traditional about it?
>>
>> Okay, so in that case, the following example communitating with
>> userspace deamon using imaginary "p4ctrl" app is equally traditional:
>>   $ p4ctrl create myprog/table/mytable dstAddr 10.0.1.2/32 \
>>      action send_to_port param port eno1
>
>Huh? Thats just an application - classical tc which part of iproute2
>that is sending to the kernel, no different than "tc flower.."
>Where do you get the "userspace" daemon part? Yes, you can write a
>daemon but it will use the same APIs as tc.

Okay, so which part is the "tradition"?


>
>>
>> >
>> >>
>> >> >didnt even want to use ebpf at all for operator experience reasons -
>> >> >it requires a compilation of the code and an extra loading compared to
>> >> >what our original u32/pedit code offered.
>> >> >
>> >> >> I don't quite follow why not most of this could be implemented entirely in
>> >> >> user space without the detour of this and you would provide a developer
>> >> >> library which could then be integrated into a p4 runtime/frontend? This
>> >> >> way users never interface with ebpf parts nor tc given they also shouldn't
>> >> >> have to - it's an implementation detail. This is what John was also pointing
>> >> >> out earlier.
>> >> >>
>> >> >
>> >> >Netlink is the API. We will provide a library for object manipulation
>> >> >which abstracts away the need to know netlink. Someone who for their
>> >> >own reasons wants to use p4runtime or TDI could write on top of this.
>> >> >I would not design a kernel interface to just meet p4runtime (we
>> >> >already have TDI which came later which does things differently). So i
>> >> >expect us to support both those two. And if i was to do something on
>> >> >SDN that was more robust i would write my own that still uses these
>> >> >netlink interfaces.
>> >>
>> >> Actually, what Daniel says about the p4 library used as a backend to p4
>> >> frontend is pretty much aligned what I claimed on the p4 calls couple of
>> >> times. If you have this p4 userspace tooling, it is easy for offloads to
>> >> replace the backed by vendor-specific library which allows p4 offload
>> >> suitable for all vendors (your plan of p4tc offload does not work well
>> >> for our hw, as we repeatedly claimed).
>> >>
>> >
>> >That's you - NVIDIA. You have chosen a path away from the kernel
>> >towards DOCA. I understand NVIDIA's frustration with dealing with
>> >upstream process (which has been cited to me as a good reason for
>> >DOCA) but please dont impose these values and your politics on other
>> >vendors(Intel, AMD for example) who are more than willing to invest
>> >into making the kernel interfaces the path forward. Your choice.
>>
>> No, you are missing the point. This has nothing to do with DOCA.
>
>Right Jiri ;->
>
>> This
>> has to do with the simple limitation of your offload assuming there are
>> no runtime changes in the compiled pipeline. For Intel, maybe they
>> aren't, and it's a good fit for them. All I say is, that it is not the
>> good fit for everyone.
>
> a) it is not part of the P4 spec to dynamically make changes to the
>datapath pipeline after it is create and we are discussing a P4

Isn't this up to the implementation? I mean from the p4 perspective,
everything is static. Hw might need to reshuffle the pipeline internally
during rule insertion/remove in order to optimize the layout.


>implementation not an extension that would add more value b) We are
>more than happy to add extensions in the future to accomodate for
>features but first _P4 spec_ must be met c) we had longer discussions
>with Matty, Khalid and the Rice folks who wrote a paper on that topic
>which you probably didnt attend and everything that needs to be done
>can be from user space today for all those optimizations.
>
>Conclusion is: For what you need to do (which i dont believe is a
>limitation in your hardware rather a design decision on your part) run
>your user space daemon, do optimizations and update the datapath.
>Everybody is happy.

Should the userspace daemon listen on inserted rules to be offloade
over netlink?


>
>>
>> >Nobody is stopping you from offering your customers proprietary
>> >solutions which include a specific ebpf approach alongside DOCA. We
>> >believe that a singular interface regardless of the vendor is the
>> >right way forward. IMHO, this siloing that unfortunately is also added
>> >by eBPF being a double edged sword is not good for the community.
>> >
>> >> As I also said on the p4 call couple of times, I don't see the kernel
>> >> as the correct place to do the p4 abstractions. Why don't you do it in
>> >> userspace and give vendors possiblity to have p4 backends with compilers,
>> >> runtime optimizations etc in userspace, talking to the HW in the
>> >> vendor-suitable way too. Then the SW implementation could be easily eBPF
>> >> and the main reason (I believe) why you need to have this is TC
>> >> (offload) is then void.
>> >>
>> >> The "everyone wants to use TC/netlink" claim does not seem correct
>> >> to me. Why not to have one Linux p4 solution that fits everyones needs?
>> >
>> >You mean more fitting to the DOCA world? no, because iam a kernel
>>
>> Again, this has 0 relation to DOCA.
>>
>>
>> >first person and kernel interfaces are good for everyone.
>>
>> Yeah, not really. Not always the kernel is the right answer. Your/Intel
>> plan to handle the offload by:
>> 1) abuse devlink to flash p4 binary
>> 2) parse the binary in kernel to match to the table ids of rules coming
>>    from p4tc ndo_setup_tc
>> 3) abuse devlink to flash p4 binary for tc-flower
>> 4) parse the binary in kernel to match to the table ids of rules coming
>>    from tc-flower ndo_setup_tc
>> is really something that is making me a little bit nauseous.
>>
>> If you don't have a feasible plan to do the offload, p4tc does not make
>> sense to me to be honest.
>
>You mean if there's no plan to match your (NVIDIA?)  point of view.
>For #1 - how's this different from DDP? Wasnt that your suggestion to

I doubt that. Any flashing-blob-parsing-in-kernel is something I'm
opposed to from day 1.


>begin with? For #2 Nobody is proposing to do anything of the sort. The
>ndo is passed IDs for the objects and associated contents. For #3+#4

During offload, you need to parse the blob in driver to be able to match
the ids with blob entities. That was presented by you/Intel in the past
IIRC.


>tc flower thing has nothing to do with P4TC that was just some random
>proposal someone made seeing if they could ride on top of P4TC.

Yeah, it's not yet merged and already mentally used for abuse. I love
that :)


>
>Besides this nobody really has to satisfy your point of view - like i
>said earlier feel free to provide proprietary solutions. From a
>consumer perspective  I would not want to deal with 4 different
>vendors with 4 different proprietary approaches. The kernel is the
>unifying part. You seemed happier with tc flower just not with the

Yeah, that is my point, why the unifying part can't be a userspace
daemon/library with multiple backends (p4tc, bpf, vendorX, vendorY, ..)?

I just don't see the kernel as a good fit for abstraction here,
given the fact that the vendor compilers does not run in kernel.
That is breaking your model.


>kernel process - which is ironically the same thing we are going
>through here ;->
>
>cheers,
>jamal
>
>>
>> >
>> >cheers,
>> >jamal

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH net-next v8 00/15] Introducing P4TC
  2023-11-22  9:25                             ` Jiri Pirko
@ 2023-11-22 15:14                               ` Jamal Hadi Salim
  2023-11-22 18:31                                 ` Jiri Pirko
  0 siblings, 1 reply; 79+ messages in thread
From: Jamal Hadi Salim @ 2023-11-22 15:14 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: Daniel Borkmann, John Fastabend, netdev, deb.chatterjee,
	anjali.singhai, Vipin.Jain, namrata.limaye, tom, mleitner,
	Mahesh.Shirshyad, tomasz.osinski, xiyou.wangcong, davem, edumazet,
	kuba, pabeni, vladbu, horms, bpf, khalidm, toke, mattyk, dan.daly,
	chris.sommers, john.andy.fingerhut

On Wed, Nov 22, 2023 at 4:25 AM Jiri Pirko <jiri@resnulli.us> wrote:
>
> Tue, Nov 21, 2023 at 04:21:44PM CET, jhs@mojatatu.com wrote:
> >On Tue, Nov 21, 2023 at 9:19 AM Jiri Pirko <jiri@resnulli.us> wrote:
> >>
> >> Tue, Nov 21, 2023 at 02:47:40PM CET, jhs@mojatatu.com wrote:
> >> >On Tue, Nov 21, 2023 at 8:06 AM Jiri Pirko <jiri@resnulli.us> wrote:
> >> >>
> >> >> Mon, Nov 20, 2023 at 11:56:50PM CET, jhs@mojatatu.com wrote:
> >> >> >On Mon, Nov 20, 2023 at 4:49 PM Daniel Borkmann <daniel@iogearbox.net> wrote:
> >> >> >>
> >> >> >> On 11/20/23 8:56 PM, Jamal Hadi Salim wrote:
> >> >> >> > On Mon, Nov 20, 2023 at 1:10 PM Jiri Pirko <jiri@resnulli.us> wrote:
> >> >> >> >> Mon, Nov 20, 2023 at 03:23:59PM CET, jhs@mojatatu.com wrote:
> >> >>
> >> >> [...]
> >> >>
> >> >> >
> >> >> >> tc BPF and XDP already have widely used infrastructure and can be developed
> >> >> >> against libbpf or other user space libraries for a user space control plane.
> >> >> >> With 'control plane' you refer here to the tc / netlink shim you've built,
> >> >> >> but looking at the tc command line examples, this doesn't really provide a
> >> >> >> good user experience (you call it p4 but people load bpf obj files). If the
> >> >> >> expectation is that an operator should run tc commands, then neither it's
> >> >> >> a nice experience for p4 nor for BPF folks. From a BPF PoV, we moved over
> >> >> >> to bpf_mprog and plan to also extend this for XDP to have a common look and
> >> >> >> feel wrt networking for developers. Why can't this be reused?
> >> >> >
> >> >> >The filter loading which loads the program is considered pipeline
> >> >> >instantiation - consider it as "provisioning" more than "control"
> >> >> >which runs at runtime. "control" is purely netlink based. The iproute2
> >> >> >code we use links libbpf for example for the filter. If we can achieve
> >> >> >the same with bpf_mprog then sure - we just dont want to loose
> >> >> >functionality though.  off top of my head, some sample space:
> >> >> >- we could have multiple pipelines with different priorities (which tc
> >> >> >provides to us) - and each pipeline may have its own logic with many
> >> >> >tables etc (and the choice to iterate the next one is essentially
> >> >> >encoded in the tc action codes)
> >> >> >- we use tc block to map groups of ports (which i dont think bpf has
> >> >> >internal access of)
> >> >> >
> >> >> >In regards to usability: no i dont expect someone doing things at
> >> >> >scale to use command line tc. The APIs are via netlink. But the tc cli
> >> >> >is must for the rest of the masses per our traditions. Also i really
> >> >>
> >> >> I don't follow. You repeatedly mention "the must of the traditional tc
> >> >> cli", but what of the existing traditional cli you use for p4tc?
> >> >> If I look at the examples, pretty much everything looks new to me.
> >> >> Example:
> >> >>
> >> >>   tc p4ctrl create myprog/table/mytable dstAddr 10.0.1.2/32 \
> >> >>     action send_to_port param port eno1
> >> >>
> >> >> This is just TC/RTnetlink used as a channel to pass new things over. If
> >> >> that is the case, what's traditional here?
> >> >>
> >> >
> >> >
> >> >What is not traditional about it?
> >>
> >> Okay, so in that case, the following example communitating with
> >> userspace deamon using imaginary "p4ctrl" app is equally traditional:
> >>   $ p4ctrl create myprog/table/mytable dstAddr 10.0.1.2/32 \
> >>      action send_to_port param port eno1
> >
> >Huh? Thats just an application - classical tc which part of iproute2
> >that is sending to the kernel, no different than "tc flower.."
> >Where do you get the "userspace" daemon part? Yes, you can write a
> >daemon but it will use the same APIs as tc.
>
> Okay, so which part is the "tradition"?
>

Provides tooling via tc cli that _everyone_ in the tc world is
familiar with - which uses the same syntax as other tc extensions do,
same expectations (eg events, request responses, familiar commands for
dumping, flushing etc). Basically someone familiar with tc will pick
this up and operate it very quickly and would have an easier time
debugging it.
There are caveats - as will be with all new classifiers - but those
are within reason.

> >>
> >> >
> >> >>
> >> >> >didnt even want to use ebpf at all for operator experience reasons -
> >> >> >it requires a compilation of the code and an extra loading compared to
> >> >> >what our original u32/pedit code offered.
> >> >> >
> >> >> >> I don't quite follow why not most of this could be implemented entirely in
> >> >> >> user space without the detour of this and you would provide a developer
> >> >> >> library which could then be integrated into a p4 runtime/frontend? This
> >> >> >> way users never interface with ebpf parts nor tc given they also shouldn't
> >> >> >> have to - it's an implementation detail. This is what John was also pointing
> >> >> >> out earlier.
> >> >> >>
> >> >> >
> >> >> >Netlink is the API. We will provide a library for object manipulation
> >> >> >which abstracts away the need to know netlink. Someone who for their
> >> >> >own reasons wants to use p4runtime or TDI could write on top of this.
> >> >> >I would not design a kernel interface to just meet p4runtime (we
> >> >> >already have TDI which came later which does things differently). So i
> >> >> >expect us to support both those two. And if i was to do something on
> >> >> >SDN that was more robust i would write my own that still uses these
> >> >> >netlink interfaces.
> >> >>
> >> >> Actually, what Daniel says about the p4 library used as a backend to p4
> >> >> frontend is pretty much aligned what I claimed on the p4 calls couple of
> >> >> times. If you have this p4 userspace tooling, it is easy for offloads to
> >> >> replace the backed by vendor-specific library which allows p4 offload
> >> >> suitable for all vendors (your plan of p4tc offload does not work well
> >> >> for our hw, as we repeatedly claimed).
> >> >>
> >> >
> >> >That's you - NVIDIA. You have chosen a path away from the kernel
> >> >towards DOCA. I understand NVIDIA's frustration with dealing with
> >> >upstream process (which has been cited to me as a good reason for
> >> >DOCA) but please dont impose these values and your politics on other
> >> >vendors(Intel, AMD for example) who are more than willing to invest
> >> >into making the kernel interfaces the path forward. Your choice.
> >>
> >> No, you are missing the point. This has nothing to do with DOCA.
> >
> >Right Jiri ;->
> >
> >> This
> >> has to do with the simple limitation of your offload assuming there are
> >> no runtime changes in the compiled pipeline. For Intel, maybe they
> >> aren't, and it's a good fit for them. All I say is, that it is not the
> >> good fit for everyone.
> >
> > a) it is not part of the P4 spec to dynamically make changes to the
> >datapath pipeline after it is create and we are discussing a P4
>
> Isn't this up to the implementation? I mean from the p4 perspective,
> everything is static. Hw might need to reshuffle the pipeline internally
> during rule insertion/remove in order to optimize the layout.
>

But do note: the focus here is on P4 (hence the name P4TC).

> >implementation not an extension that would add more value b) We are
> >more than happy to add extensions in the future to accomodate for
> >features but first _P4 spec_ must be met c) we had longer discussions
> >with Matty, Khalid and the Rice folks who wrote a paper on that topic
> >which you probably didnt attend and everything that needs to be done
> >can be from user space today for all those optimizations.
> >
> >Conclusion is: For what you need to do (which i dont believe is a
> >limitation in your hardware rather a design decision on your part) run
> >your user space daemon, do optimizations and update the datapath.
> >Everybody is happy.
>
> Should the userspace daemon listen on inserted rules to be offloade
> over netlink?
>

I mean you could if you wanted to given this is just traditional
netlink which emits events (with some filtering when we integrate the
filter approach). But why?

> >
> >>
> >> >Nobody is stopping you from offering your customers proprietary
> >> >solutions which include a specific ebpf approach alongside DOCA. We
> >> >believe that a singular interface regardless of the vendor is the
> >> >right way forward. IMHO, this siloing that unfortunately is also added
> >> >by eBPF being a double edged sword is not good for the community.
> >> >
> >> >> As I also said on the p4 call couple of times, I don't see the kernel
> >> >> as the correct place to do the p4 abstractions. Why don't you do it in
> >> >> userspace and give vendors possiblity to have p4 backends with compilers,
> >> >> runtime optimizations etc in userspace, talking to the HW in the
> >> >> vendor-suitable way too. Then the SW implementation could be easily eBPF
> >> >> and the main reason (I believe) why you need to have this is TC
> >> >> (offload) is then void.
> >> >>
> >> >> The "everyone wants to use TC/netlink" claim does not seem correct
> >> >> to me. Why not to have one Linux p4 solution that fits everyones needs?
> >> >
> >> >You mean more fitting to the DOCA world? no, because iam a kernel
> >>
> >> Again, this has 0 relation to DOCA.
> >>
> >>
> >> >first person and kernel interfaces are good for everyone.
> >>
> >> Yeah, not really. Not always the kernel is the right answer. Your/Intel
> >> plan to handle the offload by:
> >> 1) abuse devlink to flash p4 binary
> >> 2) parse the binary in kernel to match to the table ids of rules coming
> >>    from p4tc ndo_setup_tc
> >> 3) abuse devlink to flash p4 binary for tc-flower
> >> 4) parse the binary in kernel to match to the table ids of rules coming
> >>    from tc-flower ndo_setup_tc
> >> is really something that is making me a little bit nauseous.
> >>
> >> If you don't have a feasible plan to do the offload, p4tc does not make
> >> sense to me to be honest.
> >
> >You mean if there's no plan to match your (NVIDIA?)  point of view.
> >For #1 - how's this different from DDP? Wasnt that your suggestion to
>
> I doubt that. Any flashing-blob-parsing-in-kernel is something I'm
> opposed to from day 1.
>
>

Oh well - it is in the kernel and it works fine tbh.

> >begin with? For #2 Nobody is proposing to do anything of the sort. The
> >ndo is passed IDs for the objects and associated contents. For #3+#4
>
> During offload, you need to parse the blob in driver to be able to match
> the ids with blob entities. That was presented by you/Intel in the past
> IIRC.
>

You are correct - in case of offload the netlink IDs will have to be
authenticated against what the hardware can accept, but the devlink
flash use i believe was from you as a compromise.

>
> >tc flower thing has nothing to do with P4TC that was just some random
> >proposal someone made seeing if they could ride on top of P4TC.
>
> Yeah, it's not yet merged and already mentally used for abuse. I love
> that :)
>
> >
> >Besides this nobody really has to satisfy your point of view - like i
> >said earlier feel free to provide proprietary solutions. From a
> >consumer perspective  I would not want to deal with 4 different
> >vendors with 4 different proprietary approaches. The kernel is the
> >unifying part. You seemed happier with tc flower just not with the
>
> Yeah, that is my point, why the unifying part can't be a userspace
> daemon/library with multiple backends (p4tc, bpf, vendorX, vendorY, ..)?
>
> I just don't see the kernel as a good fit for abstraction here,
> given the fact that the vendor compilers does not run in kernel.
> That is breaking your model.
>

Jiri - we want to support P4, first. Like you said the P4 pipeline,
once installed is static.
P4 doesnt allow dynamic update of the pipeline. For example, once you
say "here are my 14 tables and their associated actions and here's how
the pipeline main control (on how to iterate the tables etc) is going
to be" and after you instantiate/activate that pipeline, you dont go
back 5 minutes later and say "sorry, please introduce table 15, which
i want you to walk to after you visit table 3 if metadata foo is 5" or
"shoot, let's change that table 5 to be exact instead of LPM". It's
not anywhere in the spec.
That doesnt mean it is not useful thing to have - but it is an
invention that has _nothing to do with the P4 spec_; so saying a P4
implementation must support it is a bit out of scope and there are
vendors with hardware who support P4 today that dont need any of this.
In my opinion that is a feature that could be added later out of
necessity (there is some good niche value in being able to add some
"dynamicism" to any pipeline) and influence the P4 standards on why it
is needed.
It should be doable today in a brute force way (this is just one
suggestion that came to me when Rice University/Nvidia presented[1]);
i am sure there are other approaches and the idea is by no means
proven.

1) User space Creates/compiles/Adds/activate your program that has 14
tables at tc prio X chain Y
2) a) 5 minutes later user space decides it wants to change and add
table 3 after table 15, visited when metadata foo=5
    b) your compiler in user space compiles a brand new program which
satisfies #2a (how this program was authored is out of scope of
discussion)
    c) user space adds the new program at tc prio X+1 chain Y or another chain Z
    d) user space delete tc prio X chain Y (and make sure your packets
entry point is whatever #c is)

cheers,
jamal

[1] https://www.cs.rice.edu/~eugeneng/papers/SIGCOMM23-Pipeleon.pdf

>
> >kernel process - which is ironically the same thing we are going
> >through here ;->
> >
> >cheers,
> >jamal
> >
> >>
> >> >
> >> >cheers,
> >> >jamal

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH net-next v8 00/15] Introducing P4TC
  2023-11-22 15:14                               ` Jamal Hadi Salim
@ 2023-11-22 18:31                                 ` Jiri Pirko
  2023-11-22 18:50                                   ` John Fastabend
  2023-11-22 19:35                                   ` Jamal Hadi Salim
  0 siblings, 2 replies; 79+ messages in thread
From: Jiri Pirko @ 2023-11-22 18:31 UTC (permalink / raw)
  To: Jamal Hadi Salim
  Cc: Daniel Borkmann, John Fastabend, netdev, deb.chatterjee,
	anjali.singhai, Vipin.Jain, namrata.limaye, tom, mleitner,
	Mahesh.Shirshyad, tomasz.osinski, xiyou.wangcong, davem, edumazet,
	kuba, pabeni, vladbu, horms, bpf, khalidm, toke, mattyk, dan.daly,
	chris.sommers, john.andy.fingerhut

Wed, Nov 22, 2023 at 04:14:02PM CET, jhs@mojatatu.com wrote:
>On Wed, Nov 22, 2023 at 4:25 AM Jiri Pirko <jiri@resnulli.us> wrote:
>>
>> Tue, Nov 21, 2023 at 04:21:44PM CET, jhs@mojatatu.com wrote:
>> >On Tue, Nov 21, 2023 at 9:19 AM Jiri Pirko <jiri@resnulli.us> wrote:
>> >>
>> >> Tue, Nov 21, 2023 at 02:47:40PM CET, jhs@mojatatu.com wrote:
>> >> >On Tue, Nov 21, 2023 at 8:06 AM Jiri Pirko <jiri@resnulli.us> wrote:
>> >> >>
>> >> >> Mon, Nov 20, 2023 at 11:56:50PM CET, jhs@mojatatu.com wrote:
>> >> >> >On Mon, Nov 20, 2023 at 4:49 PM Daniel Borkmann <daniel@iogearbox.net> wrote:
>> >> >> >>
>> >> >> >> On 11/20/23 8:56 PM, Jamal Hadi Salim wrote:
>> >> >> >> > On Mon, Nov 20, 2023 at 1:10 PM Jiri Pirko <jiri@resnulli.us> wrote:
>> >> >> >> >> Mon, Nov 20, 2023 at 03:23:59PM CET, jhs@mojatatu.com wrote:
>> >> >>
>> >> >> [...]
>> >> >>
>> >> >> >
>> >> >> >> tc BPF and XDP already have widely used infrastructure and can be developed
>> >> >> >> against libbpf or other user space libraries for a user space control plane.
>> >> >> >> With 'control plane' you refer here to the tc / netlink shim you've built,
>> >> >> >> but looking at the tc command line examples, this doesn't really provide a
>> >> >> >> good user experience (you call it p4 but people load bpf obj files). If the
>> >> >> >> expectation is that an operator should run tc commands, then neither it's
>> >> >> >> a nice experience for p4 nor for BPF folks. From a BPF PoV, we moved over
>> >> >> >> to bpf_mprog and plan to also extend this for XDP to have a common look and
>> >> >> >> feel wrt networking for developers. Why can't this be reused?
>> >> >> >
>> >> >> >The filter loading which loads the program is considered pipeline
>> >> >> >instantiation - consider it as "provisioning" more than "control"
>> >> >> >which runs at runtime. "control" is purely netlink based. The iproute2
>> >> >> >code we use links libbpf for example for the filter. If we can achieve
>> >> >> >the same with bpf_mprog then sure - we just dont want to loose
>> >> >> >functionality though.  off top of my head, some sample space:
>> >> >> >- we could have multiple pipelines with different priorities (which tc
>> >> >> >provides to us) - and each pipeline may have its own logic with many
>> >> >> >tables etc (and the choice to iterate the next one is essentially
>> >> >> >encoded in the tc action codes)
>> >> >> >- we use tc block to map groups of ports (which i dont think bpf has
>> >> >> >internal access of)
>> >> >> >
>> >> >> >In regards to usability: no i dont expect someone doing things at
>> >> >> >scale to use command line tc. The APIs are via netlink. But the tc cli
>> >> >> >is must for the rest of the masses per our traditions. Also i really
>> >> >>
>> >> >> I don't follow. You repeatedly mention "the must of the traditional tc
>> >> >> cli", but what of the existing traditional cli you use for p4tc?
>> >> >> If I look at the examples, pretty much everything looks new to me.
>> >> >> Example:
>> >> >>
>> >> >>   tc p4ctrl create myprog/table/mytable dstAddr 10.0.1.2/32 \
>> >> >>     action send_to_port param port eno1
>> >> >>
>> >> >> This is just TC/RTnetlink used as a channel to pass new things over. If
>> >> >> that is the case, what's traditional here?
>> >> >>
>> >> >
>> >> >
>> >> >What is not traditional about it?
>> >>
>> >> Okay, so in that case, the following example communitating with
>> >> userspace deamon using imaginary "p4ctrl" app is equally traditional:
>> >>   $ p4ctrl create myprog/table/mytable dstAddr 10.0.1.2/32 \
>> >>      action send_to_port param port eno1
>> >
>> >Huh? Thats just an application - classical tc which part of iproute2
>> >that is sending to the kernel, no different than "tc flower.."
>> >Where do you get the "userspace" daemon part? Yes, you can write a
>> >daemon but it will use the same APIs as tc.
>>
>> Okay, so which part is the "tradition"?
>>
>
>Provides tooling via tc cli that _everyone_ in the tc world is
>familiar with - which uses the same syntax as other tc extensions do,
>same expectations (eg events, request responses, familiar commands for
>dumping, flushing etc). Basically someone familiar with tc will pick
>this up and operate it very quickly and would have an easier time
>debugging it.
>There are caveats - as will be with all new classifiers - but those
>are within reason.

Okay, so syntax familiarity wise, what's the difference between
following 2 approaches:
$ tc p4ctrl create myprog/table/mytable dstAddr 10.0.1.2/32 \
      action send_to_port param port eno1
$ p4ctrl create myprog/table/mytable dstAddr 10.0.1.2/32 \
      action send_to_port param port eno1
?


>
>> >>
>> >> >
>> >> >>
>> >> >> >didnt even want to use ebpf at all for operator experience reasons -
>> >> >> >it requires a compilation of the code and an extra loading compared to
>> >> >> >what our original u32/pedit code offered.
>> >> >> >
>> >> >> >> I don't quite follow why not most of this could be implemented entirely in
>> >> >> >> user space without the detour of this and you would provide a developer
>> >> >> >> library which could then be integrated into a p4 runtime/frontend? This
>> >> >> >> way users never interface with ebpf parts nor tc given they also shouldn't
>> >> >> >> have to - it's an implementation detail. This is what John was also pointing
>> >> >> >> out earlier.
>> >> >> >>
>> >> >> >
>> >> >> >Netlink is the API. We will provide a library for object manipulation
>> >> >> >which abstracts away the need to know netlink. Someone who for their
>> >> >> >own reasons wants to use p4runtime or TDI could write on top of this.
>> >> >> >I would not design a kernel interface to just meet p4runtime (we
>> >> >> >already have TDI which came later which does things differently). So i
>> >> >> >expect us to support both those two. And if i was to do something on
>> >> >> >SDN that was more robust i would write my own that still uses these
>> >> >> >netlink interfaces.
>> >> >>
>> >> >> Actually, what Daniel says about the p4 library used as a backend to p4
>> >> >> frontend is pretty much aligned what I claimed on the p4 calls couple of
>> >> >> times. If you have this p4 userspace tooling, it is easy for offloads to
>> >> >> replace the backed by vendor-specific library which allows p4 offload
>> >> >> suitable for all vendors (your plan of p4tc offload does not work well
>> >> >> for our hw, as we repeatedly claimed).
>> >> >>
>> >> >
>> >> >That's you - NVIDIA. You have chosen a path away from the kernel
>> >> >towards DOCA. I understand NVIDIA's frustration with dealing with
>> >> >upstream process (which has been cited to me as a good reason for
>> >> >DOCA) but please dont impose these values and your politics on other
>> >> >vendors(Intel, AMD for example) who are more than willing to invest
>> >> >into making the kernel interfaces the path forward. Your choice.
>> >>
>> >> No, you are missing the point. This has nothing to do with DOCA.
>> >
>> >Right Jiri ;->
>> >
>> >> This
>> >> has to do with the simple limitation of your offload assuming there are
>> >> no runtime changes in the compiled pipeline. For Intel, maybe they
>> >> aren't, and it's a good fit for them. All I say is, that it is not the
>> >> good fit for everyone.
>> >
>> > a) it is not part of the P4 spec to dynamically make changes to the
>> >datapath pipeline after it is create and we are discussing a P4
>>
>> Isn't this up to the implementation? I mean from the p4 perspective,
>> everything is static. Hw might need to reshuffle the pipeline internally
>> during rule insertion/remove in order to optimize the layout.
>>
>
>But do note: the focus here is on P4 (hence the name P4TC).
>
>> >implementation not an extension that would add more value b) We are
>> >more than happy to add extensions in the future to accomodate for
>> >features but first _P4 spec_ must be met c) we had longer discussions
>> >with Matty, Khalid and the Rice folks who wrote a paper on that topic
>> >which you probably didnt attend and everything that needs to be done
>> >can be from user space today for all those optimizations.
>> >
>> >Conclusion is: For what you need to do (which i dont believe is a
>> >limitation in your hardware rather a design decision on your part) run
>> >your user space daemon, do optimizations and update the datapath.
>> >Everybody is happy.
>>
>> Should the userspace daemon listen on inserted rules to be offloade
>> over netlink?
>>
>
>I mean you could if you wanted to given this is just traditional
>netlink which emits events (with some filtering when we integrate the
>filter approach). But why?

Nevermind.


>
>> >
>> >>
>> >> >Nobody is stopping you from offering your customers proprietary
>> >> >solutions which include a specific ebpf approach alongside DOCA. We
>> >> >believe that a singular interface regardless of the vendor is the
>> >> >right way forward. IMHO, this siloing that unfortunately is also added
>> >> >by eBPF being a double edged sword is not good for the community.
>> >> >
>> >> >> As I also said on the p4 call couple of times, I don't see the kernel
>> >> >> as the correct place to do the p4 abstractions. Why don't you do it in
>> >> >> userspace and give vendors possiblity to have p4 backends with compilers,
>> >> >> runtime optimizations etc in userspace, talking to the HW in the
>> >> >> vendor-suitable way too. Then the SW implementation could be easily eBPF
>> >> >> and the main reason (I believe) why you need to have this is TC
>> >> >> (offload) is then void.
>> >> >>
>> >> >> The "everyone wants to use TC/netlink" claim does not seem correct
>> >> >> to me. Why not to have one Linux p4 solution that fits everyones needs?
>> >> >
>> >> >You mean more fitting to the DOCA world? no, because iam a kernel
>> >>
>> >> Again, this has 0 relation to DOCA.
>> >>
>> >>
>> >> >first person and kernel interfaces are good for everyone.
>> >>
>> >> Yeah, not really. Not always the kernel is the right answer. Your/Intel
>> >> plan to handle the offload by:
>> >> 1) abuse devlink to flash p4 binary
>> >> 2) parse the binary in kernel to match to the table ids of rules coming
>> >>    from p4tc ndo_setup_tc
>> >> 3) abuse devlink to flash p4 binary for tc-flower
>> >> 4) parse the binary in kernel to match to the table ids of rules coming
>> >>    from tc-flower ndo_setup_tc
>> >> is really something that is making me a little bit nauseous.
>> >>
>> >> If you don't have a feasible plan to do the offload, p4tc does not make
>> >> sense to me to be honest.
>> >
>> >You mean if there's no plan to match your (NVIDIA?)  point of view.
>> >For #1 - how's this different from DDP? Wasnt that your suggestion to
>>
>> I doubt that. Any flashing-blob-parsing-in-kernel is something I'm
>> opposed to from day 1.
>>
>>
>
>Oh well - it is in the kernel and it works fine tbh.
>
>> >begin with? For #2 Nobody is proposing to do anything of the sort. The
>> >ndo is passed IDs for the objects and associated contents. For #3+#4
>>
>> During offload, you need to parse the blob in driver to be able to match
>> the ids with blob entities. That was presented by you/Intel in the past
>> IIRC.
>>
>
>You are correct - in case of offload the netlink IDs will have to be
>authenticated against what the hardware can accept, but the devlink
>flash use i believe was from you as a compromise.

Definitelly not. I'm against devlink abuse for this from day 1.


>
>>
>> >tc flower thing has nothing to do with P4TC that was just some random
>> >proposal someone made seeing if they could ride on top of P4TC.
>>
>> Yeah, it's not yet merged and already mentally used for abuse. I love
>> that :)
>>
>> >
>> >Besides this nobody really has to satisfy your point of view - like i
>> >said earlier feel free to provide proprietary solutions. From a
>> >consumer perspective  I would not want to deal with 4 different
>> >vendors with 4 different proprietary approaches. The kernel is the
>> >unifying part. You seemed happier with tc flower just not with the
>>
>> Yeah, that is my point, why the unifying part can't be a userspace
>> daemon/library with multiple backends (p4tc, bpf, vendorX, vendorY, ..)?
>>
>> I just don't see the kernel as a good fit for abstraction here,
>> given the fact that the vendor compilers does not run in kernel.
>> That is breaking your model.
>>
>
>Jiri - we want to support P4, first. Like you said the P4 pipeline,
>once installed is static.
>P4 doesnt allow dynamic update of the pipeline. For example, once you
>say "here are my 14 tables and their associated actions and here's how
>the pipeline main control (on how to iterate the tables etc) is going
>to be" and after you instantiate/activate that pipeline, you dont go
>back 5 minutes later and say "sorry, please introduce table 15, which
>i want you to walk to after you visit table 3 if metadata foo is 5" or
>"shoot, let's change that table 5 to be exact instead of LPM". It's
>not anywhere in the spec.
>That doesnt mean it is not useful thing to have - but it is an
>invention that has _nothing to do with the P4 spec_; so saying a P4
>implementation must support it is a bit out of scope and there are
>vendors with hardware who support P4 today that dont need any of this.

I'm not talking about the spec. I'm talking about the offload
implemetation, the offload compiler the offload runtime manager. You
don't have those in kernel. That is the issue. The runtime manager is
the one to decide and reshuffle the hw internals. Again, this has
nothing to do with p4 frontend. This is offload implementation.

And that is why I believe your p4 kernel implementation is unoffloadable.
And if it is unoffloadable, do we really need it? IDK.


>In my opinion that is a feature that could be added later out of
>necessity (there is some good niche value in being able to add some
>"dynamicism" to any pipeline) and influence the P4 standards on why it
>is needed.
>It should be doable today in a brute force way (this is just one
>suggestion that came to me when Rice University/Nvidia presented[1]);
>i am sure there are other approaches and the idea is by no means
>proven.
>
>1) User space Creates/compiles/Adds/activate your program that has 14
>tables at tc prio X chain Y
>2) a) 5 minutes later user space decides it wants to change and add
>table 3 after table 15, visited when metadata foo=5
>    b) your compiler in user space compiles a brand new program which
>satisfies #2a (how this program was authored is out of scope of
>discussion)
>    c) user space adds the new program at tc prio X+1 chain Y or another chain Z
>    d) user space delete tc prio X chain Y (and make sure your packets
>entry point is whatever #c is)

I never suggested anything like what you describe. I'm not sure why you
think so.


>
>cheers,
>jamal
>
>[1] https://www.cs.rice.edu/~eugeneng/papers/SIGCOMM23-Pipeleon.pdf
>
>>
>> >kernel process - which is ironically the same thing we are going
>> >through here ;->
>> >
>> >cheers,
>> >jamal
>> >
>> >>
>> >> >
>> >> >cheers,
>> >> >jamal

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH net-next v8 00/15] Introducing P4TC
  2023-11-22 18:31                                 ` Jiri Pirko
@ 2023-11-22 18:50                                   ` John Fastabend
  2023-11-22 19:35                                   ` Jamal Hadi Salim
  1 sibling, 0 replies; 79+ messages in thread
From: John Fastabend @ 2023-11-22 18:50 UTC (permalink / raw)
  To: Jiri Pirko, Jamal Hadi Salim
  Cc: Daniel Borkmann, John Fastabend, netdev, deb.chatterjee,
	anjali.singhai, Vipin.Jain, namrata.limaye, tom, mleitner,
	Mahesh.Shirshyad, tomasz.osinski, xiyou.wangcong, davem, edumazet,
	kuba, pabeni, vladbu, horms, bpf, khalidm, toke, mattyk, dan.daly,
	chris.sommers, john.andy.fingerhut

Jiri Pirko wrote:
> Wed, Nov 22, 2023 at 04:14:02PM CET, jhs@mojatatu.com wrote:
> >On Wed, Nov 22, 2023 at 4:25 AM Jiri Pirko <jiri@resnulli.us> wrote:
> >>
> >> Tue, Nov 21, 2023 at 04:21:44PM CET, jhs@mojatatu.com wrote:
> >> >On Tue, Nov 21, 2023 at 9:19 AM Jiri Pirko <jiri@resnulli.us> wrote:
> >> >>
> >> >> Tue, Nov 21, 2023 at 02:47:40PM CET, jhs@mojatatu.com wrote:
> >> >> >On Tue, Nov 21, 2023 at 8:06 AM Jiri Pirko <jiri@resnulli.us> wrote:
> >> >> >>
> >> >> >> Mon, Nov 20, 2023 at 11:56:50PM CET, jhs@mojatatu.com wrote:
> >> >> >> >On Mon, Nov 20, 2023 at 4:49 PM Daniel Borkmann <daniel@iogearbox.net> wrote:
> >> >> >> >>
> >> >> >> >> On 11/20/23 8:56 PM, Jamal Hadi Salim wrote:
> >> >> >> >> > On Mon, Nov 20, 2023 at 1:10 PM Jiri Pirko <jiri@resnulli.us> wrote:
> >> >> >> >> >> Mon, Nov 20, 2023 at 03:23:59PM CET, jhs@mojatatu.com wrote:
> >> >> >>
> >> >> >> [...]
> >> >> >>
> >> >> >> >
> >> >> >> >> tc BPF and XDP already have widely used infrastructure and can be developed
> >> >> >> >> against libbpf or other user space libraries for a user space control plane.
> >> >> >> >> With 'control plane' you refer here to the tc / netlink shim you've built,
> >> >> >> >> but looking at the tc command line examples, this doesn't really provide a
> >> >> >> >> good user experience (you call it p4 but people load bpf obj files). If the
> >> >> >> >> expectation is that an operator should run tc commands, then neither it's
> >> >> >> >> a nice experience for p4 nor for BPF folks. From a BPF PoV, we moved over
> >> >> >> >> to bpf_mprog and plan to also extend this for XDP to have a common look and
> >> >> >> >> feel wrt networking for developers. Why can't this be reused?
> >> >> >> >
> >> >> >> >The filter loading which loads the program is considered pipeline
> >> >> >> >instantiation - consider it as "provisioning" more than "control"
> >> >> >> >which runs at runtime. "control" is purely netlink based. The iproute2
> >> >> >> >code we use links libbpf for example for the filter. If we can achieve
> >> >> >> >the same with bpf_mprog then sure - we just dont want to loose
> >> >> >> >functionality though.  off top of my head, some sample space:
> >> >> >> >- we could have multiple pipelines with different priorities (which tc
> >> >> >> >provides to us) - and each pipeline may have its own logic with many
> >> >> >> >tables etc (and the choice to iterate the next one is essentially
> >> >> >> >encoded in the tc action codes)
> >> >> >> >- we use tc block to map groups of ports (which i dont think bpf has
> >> >> >> >internal access of)
> >> >> >> >
> >> >> >> >In regards to usability: no i dont expect someone doing things at
> >> >> >> >scale to use command line tc. The APIs are via netlink. But the tc cli
> >> >> >> >is must for the rest of the masses per our traditions. Also i really
> >> >> >>
> >> >> >> I don't follow. You repeatedly mention "the must of the traditional tc
> >> >> >> cli", but what of the existing traditional cli you use for p4tc?
> >> >> >> If I look at the examples, pretty much everything looks new to me.
> >> >> >> Example:
> >> >> >>
> >> >> >>   tc p4ctrl create myprog/table/mytable dstAddr 10.0.1.2/32 \
> >> >> >>     action send_to_port param port eno1
> >> >> >>
> >> >> >> This is just TC/RTnetlink used as a channel to pass new things over. If
> >> >> >> that is the case, what's traditional here?
> >> >> >>
> >> >> >
> >> >> >
> >> >> >What is not traditional about it?
> >> >>
> >> >> Okay, so in that case, the following example communitating with
> >> >> userspace deamon using imaginary "p4ctrl" app is equally traditional:
> >> >>   $ p4ctrl create myprog/table/mytable dstAddr 10.0.1.2/32 \
> >> >>      action send_to_port param port eno1
> >> >
> >> >Huh? Thats just an application - classical tc which part of iproute2
> >> >that is sending to the kernel, no different than "tc flower.."
> >> >Where do you get the "userspace" daemon part? Yes, you can write a
> >> >daemon but it will use the same APIs as tc.
> >>
> >> Okay, so which part is the "tradition"?
> >>
> >
> >Provides tooling via tc cli that _everyone_ in the tc world is
> >familiar with - which uses the same syntax as other tc extensions do,
> >same expectations (eg events, request responses, familiar commands for
> >dumping, flushing etc). Basically someone familiar with tc will pick
> >this up and operate it very quickly and would have an easier time
> >debugging it.
> >There are caveats - as will be with all new classifiers - but those
> >are within reason.
> 
> Okay, so syntax familiarity wise, what's the difference between
> following 2 approaches:
> $ tc p4ctrl create myprog/table/mytable dstAddr 10.0.1.2/32 \
>       action send_to_port param port eno1
> $ p4ctrl create myprog/table/mytable dstAddr 10.0.1.2/32 \
>       action send_to_port param port eno1
> ?
> 
> 
> >
> >> >>
> >> >> >
> >> >> >>
> >> >> >> >didnt even want to use ebpf at all for operator experience reasons -
> >> >> >> >it requires a compilation of the code and an extra loading compared to
> >> >> >> >what our original u32/pedit code offered.
> >> >> >> >
> >> >> >> >> I don't quite follow why not most of this could be implemented entirely in
> >> >> >> >> user space without the detour of this and you would provide a developer
> >> >> >> >> library which could then be integrated into a p4 runtime/frontend? This
> >> >> >> >> way users never interface with ebpf parts nor tc given they also shouldn't
> >> >> >> >> have to - it's an implementation detail. This is what John was also pointing
> >> >> >> >> out earlier.
> >> >> >> >>
> >> >> >> >
> >> >> >> >Netlink is the API. We will provide a library for object manipulation
> >> >> >> >which abstracts away the need to know netlink. Someone who for their
> >> >> >> >own reasons wants to use p4runtime or TDI could write on top of this.
> >> >> >> >I would not design a kernel interface to just meet p4runtime (we
> >> >> >> >already have TDI which came later which does things differently). So i
> >> >> >> >expect us to support both those two. And if i was to do something on
> >> >> >> >SDN that was more robust i would write my own that still uses these
> >> >> >> >netlink interfaces.
> >> >> >>
> >> >> >> Actually, what Daniel says about the p4 library used as a backend to p4
> >> >> >> frontend is pretty much aligned what I claimed on the p4 calls couple of
> >> >> >> times. If you have this p4 userspace tooling, it is easy for offloads to
> >> >> >> replace the backed by vendor-specific library which allows p4 offload
> >> >> >> suitable for all vendors (your plan of p4tc offload does not work well
> >> >> >> for our hw, as we repeatedly claimed).
> >> >> >>
> >> >> >
> >> >> >That's you - NVIDIA. You have chosen a path away from the kernel
> >> >> >towards DOCA. I understand NVIDIA's frustration with dealing with
> >> >> >upstream process (which has been cited to me as a good reason for
> >> >> >DOCA) but please dont impose these values and your politics on other
> >> >> >vendors(Intel, AMD for example) who are more than willing to invest
> >> >> >into making the kernel interfaces the path forward. Your choice.
> >> >>
> >> >> No, you are missing the point. This has nothing to do with DOCA.
> >> >
> >> >Right Jiri ;->
> >> >
> >> >> This
> >> >> has to do with the simple limitation of your offload assuming there are
> >> >> no runtime changes in the compiled pipeline. For Intel, maybe they
> >> >> aren't, and it's a good fit for them. All I say is, that it is not the
> >> >> good fit for everyone.
> >> >
> >> > a) it is not part of the P4 spec to dynamically make changes to the
> >> >datapath pipeline after it is create and we are discussing a P4
> >>
> >> Isn't this up to the implementation? I mean from the p4 perspective,
> >> everything is static. Hw might need to reshuffle the pipeline internally
> >> during rule insertion/remove in order to optimize the layout.
> >>
> >
> >But do note: the focus here is on P4 (hence the name P4TC).
> >
> >> >implementation not an extension that would add more value b) We are
> >> >more than happy to add extensions in the future to accomodate for
> >> >features but first _P4 spec_ must be met c) we had longer discussions
> >> >with Matty, Khalid and the Rice folks who wrote a paper on that topic
> >> >which you probably didnt attend and everything that needs to be done
> >> >can be from user space today for all those optimizations.
> >> >
> >> >Conclusion is: For what you need to do (which i dont believe is a
> >> >limitation in your hardware rather a design decision on your part) run
> >> >your user space daemon, do optimizations and update the datapath.
> >> >Everybody is happy.
> >>
> >> Should the userspace daemon listen on inserted rules to be offloade
> >> over netlink?
> >>
> >
> >I mean you could if you wanted to given this is just traditional
> >netlink which emits events (with some filtering when we integrate the
> >filter approach). But why?
> 
> Nevermind.
> 
> 
> >
> >> >
> >> >>
> >> >> >Nobody is stopping you from offering your customers proprietary
> >> >> >solutions which include a specific ebpf approach alongside DOCA. We
> >> >> >believe that a singular interface regardless of the vendor is the
> >> >> >right way forward. IMHO, this siloing that unfortunately is also added
> >> >> >by eBPF being a double edged sword is not good for the community.
> >> >> >
> >> >> >> As I also said on the p4 call couple of times, I don't see the kernel
> >> >> >> as the correct place to do the p4 abstractions. Why don't you do it in
> >> >> >> userspace and give vendors possiblity to have p4 backends with compilers,
> >> >> >> runtime optimizations etc in userspace, talking to the HW in the
> >> >> >> vendor-suitable way too. Then the SW implementation could be easily eBPF
> >> >> >> and the main reason (I believe) why you need to have this is TC
> >> >> >> (offload) is then void.
> >> >> >>
> >> >> >> The "everyone wants to use TC/netlink" claim does not seem correct
> >> >> >> to me. Why not to have one Linux p4 solution that fits everyones needs?
> >> >> >
> >> >> >You mean more fitting to the DOCA world? no, because iam a kernel
> >> >>
> >> >> Again, this has 0 relation to DOCA.
> >> >>
> >> >>
> >> >> >first person and kernel interfaces are good for everyone.
> >> >>
> >> >> Yeah, not really. Not always the kernel is the right answer. Your/Intel
> >> >> plan to handle the offload by:
> >> >> 1) abuse devlink to flash p4 binary
> >> >> 2) parse the binary in kernel to match to the table ids of rules coming
> >> >>    from p4tc ndo_setup_tc
> >> >> 3) abuse devlink to flash p4 binary for tc-flower
> >> >> 4) parse the binary in kernel to match to the table ids of rules coming
> >> >>    from tc-flower ndo_setup_tc
> >> >> is really something that is making me a little bit nauseous.
> >> >>
> >> >> If you don't have a feasible plan to do the offload, p4tc does not make
> >> >> sense to me to be honest.
> >> >
> >> >You mean if there's no plan to match your (NVIDIA?)  point of view.
> >> >For #1 - how's this different from DDP? Wasnt that your suggestion to
> >>
> >> I doubt that. Any flashing-blob-parsing-in-kernel is something I'm
> >> opposed to from day 1.
> >>
> >>
> >
> >Oh well - it is in the kernel and it works fine tbh.
> >
> >> >begin with? For #2 Nobody is proposing to do anything of the sort. The
> >> >ndo is passed IDs for the objects and associated contents. For #3+#4
> >>
> >> During offload, you need to parse the blob in driver to be able to match
> >> the ids with blob entities. That was presented by you/Intel in the past
> >> IIRC.
> >>
> >
> >You are correct - in case of offload the netlink IDs will have to be
> >authenticated against what the hardware can accept, but the devlink
> >flash use i believe was from you as a compromise.
> 
> Definitelly not. I'm against devlink abuse for this from day 1.
> 
> 
> >
> >>
> >> >tc flower thing has nothing to do with P4TC that was just some random
> >> >proposal someone made seeing if they could ride on top of P4TC.
> >>
> >> Yeah, it's not yet merged and already mentally used for abuse. I love
> >> that :)
> >>
> >> >
> >> >Besides this nobody really has to satisfy your point of view - like i
> >> >said earlier feel free to provide proprietary solutions. From a
> >> >consumer perspective  I would not want to deal with 4 different
> >> >vendors with 4 different proprietary approaches. The kernel is the
> >> >unifying part. You seemed happier with tc flower just not with the
> >>
> >> Yeah, that is my point, why the unifying part can't be a userspace
> >> daemon/library with multiple backends (p4tc, bpf, vendorX, vendorY, ..)?
> >>
> >> I just don't see the kernel as a good fit for abstraction here,
> >> given the fact that the vendor compilers does not run in kernel.
> >> That is breaking your model.
> >>
> >
> >Jiri - we want to support P4, first. Like you said the P4 pipeline,
> >once installed is static.
> >P4 doesnt allow dynamic update of the pipeline. For example, once you
> >say "here are my 14 tables and their associated actions and here's how
> >the pipeline main control (on how to iterate the tables etc) is going
> >to be" and after you instantiate/activate that pipeline, you dont go
> >back 5 minutes later and say "sorry, please introduce table 15, which
> >i want you to walk to after you visit table 3 if metadata foo is 5" or
> >"shoot, let's change that table 5 to be exact instead of LPM". It's
> >not anywhere in the spec.
> >That doesnt mean it is not useful thing to have - but it is an
> >invention that has _nothing to do with the P4 spec_; so saying a P4
> >implementation must support it is a bit out of scope and there are
> >vendors with hardware who support P4 today that dont need any of this.
> 
> I'm not talking about the spec. I'm talking about the offload
> implemetation, the offload compiler the offload runtime manager. You
> don't have those in kernel. That is the issue. The runtime manager is
> the one to decide and reshuffle the hw internals. Again, this has
> nothing to do with p4 frontend. This is offload implementation.
> 
> And that is why I believe your p4 kernel implementation is unoffloadable.
> And if it is unoffloadable, do we really need it? IDK.

And my point is we already have a way to do software P4 implementation
in BPF so I don't see the need for yet another mechanism. And if
P4tc is not amenable to hardware offload then when we need a P4
hardware offload then what? We write another P4TC-HW?

My opinion if we push a DSL into the kernel (which I also sort of
think is not a great idea) then it should at least support offloads
from the start. I know P4 can be used for software, but really its
main value is hardware offlaod here. Even the name PNA, Portable NIC
Architecture, implies its about NICs. If it was software focused
portability wouldn't matter a general purpose CPU can emulate
almost any architecture you would like.

> 
> 
> >In my opinion that is a feature that could be added later out of
> >necessity (there is some good niche value in being able to add some
> >"dynamicism" to any pipeline) and influence the P4 standards on why it
> >is needed.
> >It should be doable today in a brute force way (this is just one
> >suggestion that came to me when Rice University/Nvidia presented[1]);
> >i am sure there are other approaches and the idea is by no means
> >proven.
> >
> >1) User space Creates/compiles/Adds/activate your program that has 14
> >tables at tc prio X chain Y
> >2) a) 5 minutes later user space decides it wants to change and add
> >table 3 after table 15, visited when metadata foo=5
> >    b) your compiler in user space compiles a brand new program which
> >satisfies #2a (how this program was authored is out of scope of
> >discussion)
> >    c) user space adds the new program at tc prio X+1 chain Y or another chain Z
> >    d) user space delete tc prio X chain Y (and make sure your packets
> >entry point is whatever #c is)
> 
> I never suggested anything like what you describe. I'm not sure why you
> think so.
> 
> 
> >
> >cheers,
> >jamal
> >
> >[1] https://www.cs.rice.edu/~eugeneng/papers/SIGCOMM23-Pipeleon.pdf
> >
> >>
> >> >kernel process - which is ironically the same thing we are going
> >> >through here ;->
> >> >
> >> >cheers,
> >> >jamal
> >> >
> >> >>
> >> >> >
> >> >> >cheers,
> >> >> >jamal



^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH net-next v8 00/15] Introducing P4TC
  2023-11-22 18:31                                 ` Jiri Pirko
  2023-11-22 18:50                                   ` John Fastabend
@ 2023-11-22 19:35                                   ` Jamal Hadi Salim
  2023-11-23  6:36                                     ` Jiri Pirko
  1 sibling, 1 reply; 79+ messages in thread
From: Jamal Hadi Salim @ 2023-11-22 19:35 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: Daniel Borkmann, John Fastabend, netdev, deb.chatterjee,
	anjali.singhai, Vipin.Jain, namrata.limaye, tom, mleitner,
	Mahesh.Shirshyad, tomasz.osinski, xiyou.wangcong, davem, edumazet,
	kuba, pabeni, vladbu, horms, bpf, khalidm, toke, mattyk, dan.daly,
	chris.sommers, john.andy.fingerhut

On Wed, Nov 22, 2023 at 1:31 PM Jiri Pirko <jiri@resnulli.us> wrote:
>
> Wed, Nov 22, 2023 at 04:14:02PM CET, jhs@mojatatu.com wrote:
> >On Wed, Nov 22, 2023 at 4:25 AM Jiri Pirko <jiri@resnulli.us> wrote:
> >>
> >> Tue, Nov 21, 2023 at 04:21:44PM CET, jhs@mojatatu.com wrote:
> >> >On Tue, Nov 21, 2023 at 9:19 AM Jiri Pirko <jiri@resnulli.us> wrote:
> >> >>
> >> >> Tue, Nov 21, 2023 at 02:47:40PM CET, jhs@mojatatu.com wrote:
> >> >> >On Tue, Nov 21, 2023 at 8:06 AM Jiri Pirko <jiri@resnulli.us> wrote:
> >> >> >>
> >> >> >> Mon, Nov 20, 2023 at 11:56:50PM CET, jhs@mojatatu.com wrote:
> >> >> >> >On Mon, Nov 20, 2023 at 4:49 PM Daniel Borkmann <daniel@iogearbox.net> wrote:
> >> >> >> >>
> >> >> >> >> On 11/20/23 8:56 PM, Jamal Hadi Salim wrote:
> >> >> >> >> > On Mon, Nov 20, 2023 at 1:10 PM Jiri Pirko <jiri@resnulli.us> wrote:
> >> >> >> >> >> Mon, Nov 20, 2023 at 03:23:59PM CET, jhs@mojatatu.com wrote:
> >> >> >>
> >> >> >> [...]
> >> >> >>
> >> >> >> >
> >> >> >> >> tc BPF and XDP already have widely used infrastructure and can be developed
> >> >> >> >> against libbpf or other user space libraries for a user space control plane.
> >> >> >> >> With 'control plane' you refer here to the tc / netlink shim you've built,
> >> >> >> >> but looking at the tc command line examples, this doesn't really provide a
> >> >> >> >> good user experience (you call it p4 but people load bpf obj files). If the
> >> >> >> >> expectation is that an operator should run tc commands, then neither it's
> >> >> >> >> a nice experience for p4 nor for BPF folks. From a BPF PoV, we moved over
> >> >> >> >> to bpf_mprog and plan to also extend this for XDP to have a common look and
> >> >> >> >> feel wrt networking for developers. Why can't this be reused?
> >> >> >> >
> >> >> >> >The filter loading which loads the program is considered pipeline
> >> >> >> >instantiation - consider it as "provisioning" more than "control"
> >> >> >> >which runs at runtime. "control" is purely netlink based. The iproute2
> >> >> >> >code we use links libbpf for example for the filter. If we can achieve
> >> >> >> >the same with bpf_mprog then sure - we just dont want to loose
> >> >> >> >functionality though.  off top of my head, some sample space:
> >> >> >> >- we could have multiple pipelines with different priorities (which tc
> >> >> >> >provides to us) - and each pipeline may have its own logic with many
> >> >> >> >tables etc (and the choice to iterate the next one is essentially
> >> >> >> >encoded in the tc action codes)
> >> >> >> >- we use tc block to map groups of ports (which i dont think bpf has
> >> >> >> >internal access of)
> >> >> >> >
> >> >> >> >In regards to usability: no i dont expect someone doing things at
> >> >> >> >scale to use command line tc. The APIs are via netlink. But the tc cli
> >> >> >> >is must for the rest of the masses per our traditions. Also i really
> >> >> >>
> >> >> >> I don't follow. You repeatedly mention "the must of the traditional tc
> >> >> >> cli", but what of the existing traditional cli you use for p4tc?
> >> >> >> If I look at the examples, pretty much everything looks new to me.
> >> >> >> Example:
> >> >> >>
> >> >> >>   tc p4ctrl create myprog/table/mytable dstAddr 10.0.1.2/32 \
> >> >> >>     action send_to_port param port eno1
> >> >> >>
> >> >> >> This is just TC/RTnetlink used as a channel to pass new things over. If
> >> >> >> that is the case, what's traditional here?
> >> >> >>
> >> >> >
> >> >> >
> >> >> >What is not traditional about it?
> >> >>
> >> >> Okay, so in that case, the following example communitating with
> >> >> userspace deamon using imaginary "p4ctrl" app is equally traditional:
> >> >>   $ p4ctrl create myprog/table/mytable dstAddr 10.0.1.2/32 \
> >> >>      action send_to_port param port eno1
> >> >
> >> >Huh? Thats just an application - classical tc which part of iproute2
> >> >that is sending to the kernel, no different than "tc flower.."
> >> >Where do you get the "userspace" daemon part? Yes, you can write a
> >> >daemon but it will use the same APIs as tc.
> >>
> >> Okay, so which part is the "tradition"?
> >>
> >
> >Provides tooling via tc cli that _everyone_ in the tc world is
> >familiar with - which uses the same syntax as other tc extensions do,
> >same expectations (eg events, request responses, familiar commands for
> >dumping, flushing etc). Basically someone familiar with tc will pick
> >this up and operate it very quickly and would have an easier time
> >debugging it.
> >There are caveats - as will be with all new classifiers - but those
> >are within reason.
>
> Okay, so syntax familiarity wise, what's the difference between
> following 2 approaches:
> $ tc p4ctrl create myprog/table/mytable dstAddr 10.0.1.2/32 \
>       action send_to_port param port eno1
> $ p4ctrl create myprog/table/mytable dstAddr 10.0.1.2/32 \
>       action send_to_port param port eno1
> ?
>
>
> >
> >> >>
> >> >> >
> >> >> >>
> >> >> >> >didnt even want to use ebpf at all for operator experience reasons -
> >> >> >> >it requires a compilation of the code and an extra loading compared to
> >> >> >> >what our original u32/pedit code offered.
> >> >> >> >
> >> >> >> >> I don't quite follow why not most of this could be implemented entirely in
> >> >> >> >> user space without the detour of this and you would provide a developer
> >> >> >> >> library which could then be integrated into a p4 runtime/frontend? This
> >> >> >> >> way users never interface with ebpf parts nor tc given they also shouldn't
> >> >> >> >> have to - it's an implementation detail. This is what John was also pointing
> >> >> >> >> out earlier.
> >> >> >> >>
> >> >> >> >
> >> >> >> >Netlink is the API. We will provide a library for object manipulation
> >> >> >> >which abstracts away the need to know netlink. Someone who for their
> >> >> >> >own reasons wants to use p4runtime or TDI could write on top of this.
> >> >> >> >I would not design a kernel interface to just meet p4runtime (we
> >> >> >> >already have TDI which came later which does things differently). So i
> >> >> >> >expect us to support both those two. And if i was to do something on
> >> >> >> >SDN that was more robust i would write my own that still uses these
> >> >> >> >netlink interfaces.
> >> >> >>
> >> >> >> Actually, what Daniel says about the p4 library used as a backend to p4
> >> >> >> frontend is pretty much aligned what I claimed on the p4 calls couple of
> >> >> >> times. If you have this p4 userspace tooling, it is easy for offloads to
> >> >> >> replace the backed by vendor-specific library which allows p4 offload
> >> >> >> suitable for all vendors (your plan of p4tc offload does not work well
> >> >> >> for our hw, as we repeatedly claimed).
> >> >> >>
> >> >> >
> >> >> >That's you - NVIDIA. You have chosen a path away from the kernel
> >> >> >towards DOCA. I understand NVIDIA's frustration with dealing with
> >> >> >upstream process (which has been cited to me as a good reason for
> >> >> >DOCA) but please dont impose these values and your politics on other
> >> >> >vendors(Intel, AMD for example) who are more than willing to invest
> >> >> >into making the kernel interfaces the path forward. Your choice.
> >> >>
> >> >> No, you are missing the point. This has nothing to do with DOCA.
> >> >
> >> >Right Jiri ;->
> >> >
> >> >> This
> >> >> has to do with the simple limitation of your offload assuming there are
> >> >> no runtime changes in the compiled pipeline. For Intel, maybe they
> >> >> aren't, and it's a good fit for them. All I say is, that it is not the
> >> >> good fit for everyone.
> >> >
> >> > a) it is not part of the P4 spec to dynamically make changes to the
> >> >datapath pipeline after it is create and we are discussing a P4
> >>
> >> Isn't this up to the implementation? I mean from the p4 perspective,
> >> everything is static. Hw might need to reshuffle the pipeline internally
> >> during rule insertion/remove in order to optimize the layout.
> >>
> >
> >But do note: the focus here is on P4 (hence the name P4TC).
> >
> >> >implementation not an extension that would add more value b) We are
> >> >more than happy to add extensions in the future to accomodate for
> >> >features but first _P4 spec_ must be met c) we had longer discussions
> >> >with Matty, Khalid and the Rice folks who wrote a paper on that topic
> >> >which you probably didnt attend and everything that needs to be done
> >> >can be from user space today for all those optimizations.
> >> >
> >> >Conclusion is: For what you need to do (which i dont believe is a
> >> >limitation in your hardware rather a design decision on your part) run
> >> >your user space daemon, do optimizations and update the datapath.
> >> >Everybody is happy.
> >>
> >> Should the userspace daemon listen on inserted rules to be offloade
> >> over netlink?
> >>
> >
> >I mean you could if you wanted to given this is just traditional
> >netlink which emits events (with some filtering when we integrate the
> >filter approach). But why?
>
> Nevermind.
>
>
> >
> >> >
> >> >>
> >> >> >Nobody is stopping you from offering your customers proprietary
> >> >> >solutions which include a specific ebpf approach alongside DOCA. We
> >> >> >believe that a singular interface regardless of the vendor is the
> >> >> >right way forward. IMHO, this siloing that unfortunately is also added
> >> >> >by eBPF being a double edged sword is not good for the community.
> >> >> >
> >> >> >> As I also said on the p4 call couple of times, I don't see the kernel
> >> >> >> as the correct place to do the p4 abstractions. Why don't you do it in
> >> >> >> userspace and give vendors possiblity to have p4 backends with compilers,
> >> >> >> runtime optimizations etc in userspace, talking to the HW in the
> >> >> >> vendor-suitable way too. Then the SW implementation could be easily eBPF
> >> >> >> and the main reason (I believe) why you need to have this is TC
> >> >> >> (offload) is then void.
> >> >> >>
> >> >> >> The "everyone wants to use TC/netlink" claim does not seem correct
> >> >> >> to me. Why not to have one Linux p4 solution that fits everyones needs?
> >> >> >
> >> >> >You mean more fitting to the DOCA world? no, because iam a kernel
> >> >>
> >> >> Again, this has 0 relation to DOCA.
> >> >>
> >> >>
> >> >> >first person and kernel interfaces are good for everyone.
> >> >>
> >> >> Yeah, not really. Not always the kernel is the right answer. Your/Intel
> >> >> plan to handle the offload by:
> >> >> 1) abuse devlink to flash p4 binary
> >> >> 2) parse the binary in kernel to match to the table ids of rules coming
> >> >>    from p4tc ndo_setup_tc
> >> >> 3) abuse devlink to flash p4 binary for tc-flower
> >> >> 4) parse the binary in kernel to match to the table ids of rules coming
> >> >>    from tc-flower ndo_setup_tc
> >> >> is really something that is making me a little bit nauseous.
> >> >>
> >> >> If you don't have a feasible plan to do the offload, p4tc does not make
> >> >> sense to me to be honest.
> >> >
> >> >You mean if there's no plan to match your (NVIDIA?)  point of view.
> >> >For #1 - how's this different from DDP? Wasnt that your suggestion to
> >>
> >> I doubt that. Any flashing-blob-parsing-in-kernel is something I'm
> >> opposed to from day 1.
> >>
> >>
> >
> >Oh well - it is in the kernel and it works fine tbh.
> >
> >> >begin with? For #2 Nobody is proposing to do anything of the sort. The
> >> >ndo is passed IDs for the objects and associated contents. For #3+#4
> >>
> >> During offload, you need to parse the blob in driver to be able to match
> >> the ids with blob entities. That was presented by you/Intel in the past
> >> IIRC.
> >>
> >
> >You are correct - in case of offload the netlink IDs will have to be
> >authenticated against what the hardware can accept, but the devlink
> >flash use i believe was from you as a compromise.
>
> Definitelly not. I'm against devlink abuse for this from day 1.
>
>
> >
> >>
> >> >tc flower thing has nothing to do with P4TC that was just some random
> >> >proposal someone made seeing if they could ride on top of P4TC.
> >>
> >> Yeah, it's not yet merged and already mentally used for abuse. I love
> >> that :)
> >>
> >> >
> >> >Besides this nobody really has to satisfy your point of view - like i
> >> >said earlier feel free to provide proprietary solutions. From a
> >> >consumer perspective  I would not want to deal with 4 different
> >> >vendors with 4 different proprietary approaches. The kernel is the
> >> >unifying part. You seemed happier with tc flower just not with the
> >>
> >> Yeah, that is my point, why the unifying part can't be a userspace
> >> daemon/library with multiple backends (p4tc, bpf, vendorX, vendorY, ..)?
> >>
> >> I just don't see the kernel as a good fit for abstraction here,
> >> given the fact that the vendor compilers does not run in kernel.
> >> That is breaking your model.
> >>
> >
> >Jiri - we want to support P4, first. Like you said the P4 pipeline,
> >once installed is static.
> >P4 doesnt allow dynamic update of the pipeline. For example, once you
> >say "here are my 14 tables and their associated actions and here's how
> >the pipeline main control (on how to iterate the tables etc) is going
> >to be" and after you instantiate/activate that pipeline, you dont go
> >back 5 minutes later and say "sorry, please introduce table 15, which
> >i want you to walk to after you visit table 3 if metadata foo is 5" or
> >"shoot, let's change that table 5 to be exact instead of LPM". It's
> >not anywhere in the spec.
> >That doesnt mean it is not useful thing to have - but it is an
> >invention that has _nothing to do with the P4 spec_; so saying a P4
> >implementation must support it is a bit out of scope and there are
> >vendors with hardware who support P4 today that dont need any of this.
>
> I'm not talking about the spec. I'm talking about the offload
> implemetation, the offload compiler the offload runtime manager. You
> don't have those in kernel. That is the issue. The runtime manager is
> the one to decide and reshuffle the hw internals. Again, this has
> nothing to do with p4 frontend. This is offload implementation.
>
> And that is why I believe your p4 kernel implementation is unoffloadable.
> And if it is unoffloadable, do we really need it? IDK.
>

Say what?
It's not offloadable in your hardware, you mean? Because i have beside
me here an intel e2000 which offloads just fine (and the AMD folks
seem fine too).
If your view is that all these runtime optimization surmount to a
compiler in the kernel/driver that is your, well, your view. In my
view (and others have said this to you already) the P4C compiler is
responsible for resource optimizations. The hardware supports P4, you
give it constraints and it knows what to do. At runtime, anything a
driver needs to do for resource optimization (resorting, reshuffling
etc), that is not a P4 problem - sorry if you have issues in your
architecture approach.

> >In my opinion that is a feature that could be added later out of
> >necessity (there is some good niche value in being able to add some
> >"dynamicism" to any pipeline) and influence the P4 standards on why it
> >is needed.
> >It should be doable today in a brute force way (this is just one
> >suggestion that came to me when Rice University/Nvidia presented[1]);
> >i am sure there are other approaches and the idea is by no means
> >proven.
> >
> >1) User space Creates/compiles/Adds/activate your program that has 14
> >tables at tc prio X chain Y
> >2) a) 5 minutes later user space decides it wants to change and add
> >table 3 after table 15, visited when metadata foo=5
> >    b) your compiler in user space compiles a brand new program which
> >satisfies #2a (how this program was authored is out of scope of
> >discussion)
> >    c) user space adds the new program at tc prio X+1 chain Y or another chain Z
> >    d) user space delete tc prio X chain Y (and make sure your packets
> >entry point is whatever #c is)
>
> I never suggested anything like what you describe. I'm not sure why you
> think so.

It's the same class of problems - the paper i pointed to (coauthored
by Matty and others) has runtime resource optimizations which are
tantamount to changing the nature of the pipeline. We may need to
profile in the kernel but all those optimizations can be derived in
user space using the approach I described.

cheers,
jamal


> >[1] https://www.cs.rice.edu/~eugeneng/papers/SIGCOMM23-Pipeleon.pdf
> >
> >>
> >> >kernel process - which is ironically the same thing we are going
> >> >through here ;->
> >> >
> >> >cheers,
> >> >jamal
> >> >
> >> >>
> >> >> >
> >> >> >cheers,
> >> >> >jamal

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH net-next v8 00/15] Introducing P4TC
  2023-11-22 19:35                                   ` Jamal Hadi Salim
@ 2023-11-23  6:36                                     ` Jiri Pirko
  2023-11-23 13:22                                       ` Jamal Hadi Salim
  0 siblings, 1 reply; 79+ messages in thread
From: Jiri Pirko @ 2023-11-23  6:36 UTC (permalink / raw)
  To: Jamal Hadi Salim
  Cc: Daniel Borkmann, John Fastabend, netdev, deb.chatterjee,
	anjali.singhai, Vipin.Jain, namrata.limaye, tom, mleitner,
	Mahesh.Shirshyad, tomasz.osinski, xiyou.wangcong, davem, edumazet,
	kuba, pabeni, vladbu, horms, bpf, khalidm, toke, mattyk, dan.daly,
	chris.sommers, john.andy.fingerhut

Wed, Nov 22, 2023 at 08:35:21PM CET, jhs@mojatatu.com wrote:
>On Wed, Nov 22, 2023 at 1:31 PM Jiri Pirko <jiri@resnulli.us> wrote:
>>
>> Wed, Nov 22, 2023 at 04:14:02PM CET, jhs@mojatatu.com wrote:
>> >On Wed, Nov 22, 2023 at 4:25 AM Jiri Pirko <jiri@resnulli.us> wrote:
>> >>
>> >> Tue, Nov 21, 2023 at 04:21:44PM CET, jhs@mojatatu.com wrote:
>> >> >On Tue, Nov 21, 2023 at 9:19 AM Jiri Pirko <jiri@resnulli.us> wrote:
>> >> >>
>> >> >> Tue, Nov 21, 2023 at 02:47:40PM CET, jhs@mojatatu.com wrote:
>> >> >> >On Tue, Nov 21, 2023 at 8:06 AM Jiri Pirko <jiri@resnulli.us> wrote:
>> >> >> >>
>> >> >> >> Mon, Nov 20, 2023 at 11:56:50PM CET, jhs@mojatatu.com wrote:
>> >> >> >> >On Mon, Nov 20, 2023 at 4:49 PM Daniel Borkmann <daniel@iogearbox.net> wrote:
>> >> >> >> >>
>> >> >> >> >> On 11/20/23 8:56 PM, Jamal Hadi Salim wrote:
>> >> >> >> >> > On Mon, Nov 20, 2023 at 1:10 PM Jiri Pirko <jiri@resnulli.us> wrote:
>> >> >> >> >> >> Mon, Nov 20, 2023 at 03:23:59PM CET, jhs@mojatatu.com wrote:
>> >> >> >>
>> >> >> >> [...]
>> >> >> >>
>> >> >> >> >
>> >> >> >> >> tc BPF and XDP already have widely used infrastructure and can be developed
>> >> >> >> >> against libbpf or other user space libraries for a user space control plane.
>> >> >> >> >> With 'control plane' you refer here to the tc / netlink shim you've built,
>> >> >> >> >> but looking at the tc command line examples, this doesn't really provide a
>> >> >> >> >> good user experience (you call it p4 but people load bpf obj files). If the
>> >> >> >> >> expectation is that an operator should run tc commands, then neither it's
>> >> >> >> >> a nice experience for p4 nor for BPF folks. From a BPF PoV, we moved over
>> >> >> >> >> to bpf_mprog and plan to also extend this for XDP to have a common look and
>> >> >> >> >> feel wrt networking for developers. Why can't this be reused?
>> >> >> >> >
>> >> >> >> >The filter loading which loads the program is considered pipeline
>> >> >> >> >instantiation - consider it as "provisioning" more than "control"
>> >> >> >> >which runs at runtime. "control" is purely netlink based. The iproute2
>> >> >> >> >code we use links libbpf for example for the filter. If we can achieve
>> >> >> >> >the same with bpf_mprog then sure - we just dont want to loose
>> >> >> >> >functionality though.  off top of my head, some sample space:
>> >> >> >> >- we could have multiple pipelines with different priorities (which tc
>> >> >> >> >provides to us) - and each pipeline may have its own logic with many
>> >> >> >> >tables etc (and the choice to iterate the next one is essentially
>> >> >> >> >encoded in the tc action codes)
>> >> >> >> >- we use tc block to map groups of ports (which i dont think bpf has
>> >> >> >> >internal access of)
>> >> >> >> >
>> >> >> >> >In regards to usability: no i dont expect someone doing things at
>> >> >> >> >scale to use command line tc. The APIs are via netlink. But the tc cli
>> >> >> >> >is must for the rest of the masses per our traditions. Also i really
>> >> >> >>
>> >> >> >> I don't follow. You repeatedly mention "the must of the traditional tc
>> >> >> >> cli", but what of the existing traditional cli you use for p4tc?
>> >> >> >> If I look at the examples, pretty much everything looks new to me.
>> >> >> >> Example:
>> >> >> >>
>> >> >> >>   tc p4ctrl create myprog/table/mytable dstAddr 10.0.1.2/32 \
>> >> >> >>     action send_to_port param port eno1
>> >> >> >>
>> >> >> >> This is just TC/RTnetlink used as a channel to pass new things over. If
>> >> >> >> that is the case, what's traditional here?
>> >> >> >>
>> >> >> >
>> >> >> >
>> >> >> >What is not traditional about it?
>> >> >>
>> >> >> Okay, so in that case, the following example communitating with
>> >> >> userspace deamon using imaginary "p4ctrl" app is equally traditional:
>> >> >>   $ p4ctrl create myprog/table/mytable dstAddr 10.0.1.2/32 \
>> >> >>      action send_to_port param port eno1
>> >> >
>> >> >Huh? Thats just an application - classical tc which part of iproute2
>> >> >that is sending to the kernel, no different than "tc flower.."
>> >> >Where do you get the "userspace" daemon part? Yes, you can write a
>> >> >daemon but it will use the same APIs as tc.
>> >>
>> >> Okay, so which part is the "tradition"?
>> >>
>> >
>> >Provides tooling via tc cli that _everyone_ in the tc world is
>> >familiar with - which uses the same syntax as other tc extensions do,
>> >same expectations (eg events, request responses, familiar commands for
>> >dumping, flushing etc). Basically someone familiar with tc will pick
>> >this up and operate it very quickly and would have an easier time
>> >debugging it.
>> >There are caveats - as will be with all new classifiers - but those
>> >are within reason.
>>
>> Okay, so syntax familiarity wise, what's the difference between
>> following 2 approaches:
>> $ tc p4ctrl create myprog/table/mytable dstAddr 10.0.1.2/32 \
>>       action send_to_port param port eno1
>> $ p4ctrl create myprog/table/mytable dstAddr 10.0.1.2/32 \
>>       action send_to_port param port eno1
>> ?
>>
>>
>> >
>> >> >>
>> >> >> >
>> >> >> >>
>> >> >> >> >didnt even want to use ebpf at all for operator experience reasons -
>> >> >> >> >it requires a compilation of the code and an extra loading compared to
>> >> >> >> >what our original u32/pedit code offered.
>> >> >> >> >
>> >> >> >> >> I don't quite follow why not most of this could be implemented entirely in
>> >> >> >> >> user space without the detour of this and you would provide a developer
>> >> >> >> >> library which could then be integrated into a p4 runtime/frontend? This
>> >> >> >> >> way users never interface with ebpf parts nor tc given they also shouldn't
>> >> >> >> >> have to - it's an implementation detail. This is what John was also pointing
>> >> >> >> >> out earlier.
>> >> >> >> >>
>> >> >> >> >
>> >> >> >> >Netlink is the API. We will provide a library for object manipulation
>> >> >> >> >which abstracts away the need to know netlink. Someone who for their
>> >> >> >> >own reasons wants to use p4runtime or TDI could write on top of this.
>> >> >> >> >I would not design a kernel interface to just meet p4runtime (we
>> >> >> >> >already have TDI which came later which does things differently). So i
>> >> >> >> >expect us to support both those two. And if i was to do something on
>> >> >> >> >SDN that was more robust i would write my own that still uses these
>> >> >> >> >netlink interfaces.
>> >> >> >>
>> >> >> >> Actually, what Daniel says about the p4 library used as a backend to p4
>> >> >> >> frontend is pretty much aligned what I claimed on the p4 calls couple of
>> >> >> >> times. If you have this p4 userspace tooling, it is easy for offloads to
>> >> >> >> replace the backed by vendor-specific library which allows p4 offload
>> >> >> >> suitable for all vendors (your plan of p4tc offload does not work well
>> >> >> >> for our hw, as we repeatedly claimed).
>> >> >> >>
>> >> >> >
>> >> >> >That's you - NVIDIA. You have chosen a path away from the kernel
>> >> >> >towards DOCA. I understand NVIDIA's frustration with dealing with
>> >> >> >upstream process (which has been cited to me as a good reason for
>> >> >> >DOCA) but please dont impose these values and your politics on other
>> >> >> >vendors(Intel, AMD for example) who are more than willing to invest
>> >> >> >into making the kernel interfaces the path forward. Your choice.
>> >> >>
>> >> >> No, you are missing the point. This has nothing to do with DOCA.
>> >> >
>> >> >Right Jiri ;->
>> >> >
>> >> >> This
>> >> >> has to do with the simple limitation of your offload assuming there are
>> >> >> no runtime changes in the compiled pipeline. For Intel, maybe they
>> >> >> aren't, and it's a good fit for them. All I say is, that it is not the
>> >> >> good fit for everyone.
>> >> >
>> >> > a) it is not part of the P4 spec to dynamically make changes to the
>> >> >datapath pipeline after it is create and we are discussing a P4
>> >>
>> >> Isn't this up to the implementation? I mean from the p4 perspective,
>> >> everything is static. Hw might need to reshuffle the pipeline internally
>> >> during rule insertion/remove in order to optimize the layout.
>> >>
>> >
>> >But do note: the focus here is on P4 (hence the name P4TC).
>> >
>> >> >implementation not an extension that would add more value b) We are
>> >> >more than happy to add extensions in the future to accomodate for
>> >> >features but first _P4 spec_ must be met c) we had longer discussions
>> >> >with Matty, Khalid and the Rice folks who wrote a paper on that topic
>> >> >which you probably didnt attend and everything that needs to be done
>> >> >can be from user space today for all those optimizations.
>> >> >
>> >> >Conclusion is: For what you need to do (which i dont believe is a
>> >> >limitation in your hardware rather a design decision on your part) run
>> >> >your user space daemon, do optimizations and update the datapath.
>> >> >Everybody is happy.
>> >>
>> >> Should the userspace daemon listen on inserted rules to be offloade
>> >> over netlink?
>> >>
>> >
>> >I mean you could if you wanted to given this is just traditional
>> >netlink which emits events (with some filtering when we integrate the
>> >filter approach). But why?
>>
>> Nevermind.
>>
>>
>> >
>> >> >
>> >> >>
>> >> >> >Nobody is stopping you from offering your customers proprietary
>> >> >> >solutions which include a specific ebpf approach alongside DOCA. We
>> >> >> >believe that a singular interface regardless of the vendor is the
>> >> >> >right way forward. IMHO, this siloing that unfortunately is also added
>> >> >> >by eBPF being a double edged sword is not good for the community.
>> >> >> >
>> >> >> >> As I also said on the p4 call couple of times, I don't see the kernel
>> >> >> >> as the correct place to do the p4 abstractions. Why don't you do it in
>> >> >> >> userspace and give vendors possiblity to have p4 backends with compilers,
>> >> >> >> runtime optimizations etc in userspace, talking to the HW in the
>> >> >> >> vendor-suitable way too. Then the SW implementation could be easily eBPF
>> >> >> >> and the main reason (I believe) why you need to have this is TC
>> >> >> >> (offload) is then void.
>> >> >> >>
>> >> >> >> The "everyone wants to use TC/netlink" claim does not seem correct
>> >> >> >> to me. Why not to have one Linux p4 solution that fits everyones needs?
>> >> >> >
>> >> >> >You mean more fitting to the DOCA world? no, because iam a kernel
>> >> >>
>> >> >> Again, this has 0 relation to DOCA.
>> >> >>
>> >> >>
>> >> >> >first person and kernel interfaces are good for everyone.
>> >> >>
>> >> >> Yeah, not really. Not always the kernel is the right answer. Your/Intel
>> >> >> plan to handle the offload by:
>> >> >> 1) abuse devlink to flash p4 binary
>> >> >> 2) parse the binary in kernel to match to the table ids of rules coming
>> >> >>    from p4tc ndo_setup_tc
>> >> >> 3) abuse devlink to flash p4 binary for tc-flower
>> >> >> 4) parse the binary in kernel to match to the table ids of rules coming
>> >> >>    from tc-flower ndo_setup_tc
>> >> >> is really something that is making me a little bit nauseous.
>> >> >>
>> >> >> If you don't have a feasible plan to do the offload, p4tc does not make
>> >> >> sense to me to be honest.
>> >> >
>> >> >You mean if there's no plan to match your (NVIDIA?)  point of view.
>> >> >For #1 - how's this different from DDP? Wasnt that your suggestion to
>> >>
>> >> I doubt that. Any flashing-blob-parsing-in-kernel is something I'm
>> >> opposed to from day 1.
>> >>
>> >>
>> >
>> >Oh well - it is in the kernel and it works fine tbh.
>> >
>> >> >begin with? For #2 Nobody is proposing to do anything of the sort. The
>> >> >ndo is passed IDs for the objects and associated contents. For #3+#4
>> >>
>> >> During offload, you need to parse the blob in driver to be able to match
>> >> the ids with blob entities. That was presented by you/Intel in the past
>> >> IIRC.
>> >>
>> >
>> >You are correct - in case of offload the netlink IDs will have to be
>> >authenticated against what the hardware can accept, but the devlink
>> >flash use i believe was from you as a compromise.
>>
>> Definitelly not. I'm against devlink abuse for this from day 1.
>>
>>
>> >
>> >>
>> >> >tc flower thing has nothing to do with P4TC that was just some random
>> >> >proposal someone made seeing if they could ride on top of P4TC.
>> >>
>> >> Yeah, it's not yet merged and already mentally used for abuse. I love
>> >> that :)
>> >>
>> >> >
>> >> >Besides this nobody really has to satisfy your point of view - like i
>> >> >said earlier feel free to provide proprietary solutions. From a
>> >> >consumer perspective  I would not want to deal with 4 different
>> >> >vendors with 4 different proprietary approaches. The kernel is the
>> >> >unifying part. You seemed happier with tc flower just not with the
>> >>
>> >> Yeah, that is my point, why the unifying part can't be a userspace
>> >> daemon/library with multiple backends (p4tc, bpf, vendorX, vendorY, ..)?
>> >>
>> >> I just don't see the kernel as a good fit for abstraction here,
>> >> given the fact that the vendor compilers does not run in kernel.
>> >> That is breaking your model.
>> >>
>> >
>> >Jiri - we want to support P4, first. Like you said the P4 pipeline,
>> >once installed is static.
>> >P4 doesnt allow dynamic update of the pipeline. For example, once you
>> >say "here are my 14 tables and their associated actions and here's how
>> >the pipeline main control (on how to iterate the tables etc) is going
>> >to be" and after you instantiate/activate that pipeline, you dont go
>> >back 5 minutes later and say "sorry, please introduce table 15, which
>> >i want you to walk to after you visit table 3 if metadata foo is 5" or
>> >"shoot, let's change that table 5 to be exact instead of LPM". It's
>> >not anywhere in the spec.
>> >That doesnt mean it is not useful thing to have - but it is an
>> >invention that has _nothing to do with the P4 spec_; so saying a P4
>> >implementation must support it is a bit out of scope and there are
>> >vendors with hardware who support P4 today that dont need any of this.
>>
>> I'm not talking about the spec. I'm talking about the offload
>> implemetation, the offload compiler the offload runtime manager. You
>> don't have those in kernel. That is the issue. The runtime manager is
>> the one to decide and reshuffle the hw internals. Again, this has
>> nothing to do with p4 frontend. This is offload implementation.
>>
>> And that is why I believe your p4 kernel implementation is unoffloadable.
>> And if it is unoffloadable, do we really need it? IDK.
>>
>
>Say what?
>It's not offloadable in your hardware, you mean? Because i have beside
>me here an intel e2000 which offloads just fine (and the AMD folks
>seem fine too).

Will Intel and AMD have compiler in kernel, so no blob transfer and
parsing it in kernel wound not be needed? No.


>If your view is that all these runtime optimization surmount to a
>compiler in the kernel/driver that is your, well, your view. In my
>view (and others have said this to you already) the P4C compiler is
>responsible for resource optimizations. The hardware supports P4, you
>give it constraints and it knows what to do. At runtime, anything a
>driver needs to do for resource optimization (resorting, reshuffling
>etc), that is not a P4 problem - sorry if you have issues in your
>architecture approach.

Sure, it is the offload implementation problem. And for them, you need
to use userspace components. And that is the problem. This discussion
leads nowhere, I don't know how differently should I describe this.


>
>> >In my opinion that is a feature that could be added later out of
>> >necessity (there is some good niche value in being able to add some
>> >"dynamicism" to any pipeline) and influence the P4 standards on why it
>> >is needed.
>> >It should be doable today in a brute force way (this is just one
>> >suggestion that came to me when Rice University/Nvidia presented[1]);
>> >i am sure there are other approaches and the idea is by no means
>> >proven.
>> >
>> >1) User space Creates/compiles/Adds/activate your program that has 14
>> >tables at tc prio X chain Y
>> >2) a) 5 minutes later user space decides it wants to change and add
>> >table 3 after table 15, visited when metadata foo=5
>> >    b) your compiler in user space compiles a brand new program which
>> >satisfies #2a (how this program was authored is out of scope of
>> >discussion)
>> >    c) user space adds the new program at tc prio X+1 chain Y or another chain Z
>> >    d) user space delete tc prio X chain Y (and make sure your packets
>> >entry point is whatever #c is)
>>
>> I never suggested anything like what you describe. I'm not sure why you
>> think so.
>
>It's the same class of problems - the paper i pointed to (coauthored
>by Matty and others) has runtime resource optimizations which are
>tantamount to changing the nature of the pipeline. We may need to
>profile in the kernel but all those optimizations can be derived in
>user space using the approach I described.
>
>cheers,
>jamal
>
>
>> >[1] https://www.cs.rice.edu/~eugeneng/papers/SIGCOMM23-Pipeleon.pdf
>> >
>> >>
>> >> >kernel process - which is ironically the same thing we are going
>> >> >through here ;->
>> >> >
>> >> >cheers,
>> >> >jamal
>> >> >
>> >> >>
>> >> >> >
>> >> >> >cheers,
>> >> >> >jamal

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH net-next v8 00/15] Introducing P4TC
  2023-11-23  6:36                                     ` Jiri Pirko
@ 2023-11-23 13:22                                       ` Jamal Hadi Salim
  2023-11-23 13:34                                         ` Jiri Pirko
  0 siblings, 1 reply; 79+ messages in thread
From: Jamal Hadi Salim @ 2023-11-23 13:22 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: Daniel Borkmann, John Fastabend, netdev, deb.chatterjee,
	anjali.singhai, Vipin.Jain, namrata.limaye, tom, mleitner,
	Mahesh.Shirshyad, tomasz.osinski, xiyou.wangcong, davem, edumazet,
	kuba, pabeni, vladbu, horms, bpf, khalidm, toke, mattyk, dan.daly,
	chris.sommers, john.andy.fingerhut

On Thu, Nov 23, 2023 at 1:36 AM Jiri Pirko <jiri@resnulli.us> wrote:
>
> Wed, Nov 22, 2023 at 08:35:21PM CET, jhs@mojatatu.com wrote:
> >On Wed, Nov 22, 2023 at 1:31 PM Jiri Pirko <jiri@resnulli.us> wrote:
> >>
> >> Wed, Nov 22, 2023 at 04:14:02PM CET, jhs@mojatatu.com wrote:
> >> >On Wed, Nov 22, 2023 at 4:25 AM Jiri Pirko <jiri@resnulli.us> wrote:
> >> >>
> >> >> Tue, Nov 21, 2023 at 04:21:44PM CET, jhs@mojatatu.com wrote:
> >> >> >On Tue, Nov 21, 2023 at 9:19 AM Jiri Pirko <jiri@resnulli.us> wrote:
> >> >> >>
> >> >> >> Tue, Nov 21, 2023 at 02:47:40PM CET, jhs@mojatatu.com wrote:
> >> >> >> >On Tue, Nov 21, 2023 at 8:06 AM Jiri Pirko <jiri@resnulli.us> wrote:
> >> >> >> >>
> >> >> >> >> Mon, Nov 20, 2023 at 11:56:50PM CET, jhs@mojatatu.com wrote:
> >> >> >> >> >On Mon, Nov 20, 2023 at 4:49 PM Daniel Borkmann <daniel@iogearbox.net> wrote:
> >> >> >> >> >>
> >> >> >> >> >> On 11/20/23 8:56 PM, Jamal Hadi Salim wrote:
> >> >> >> >> >> > On Mon, Nov 20, 2023 at 1:10 PM Jiri Pirko <jiri@resnulli.us> wrote:
> >> >> >> >> >> >> Mon, Nov 20, 2023 at 03:23:59PM CET, jhs@mojatatu.com wrote:
> >> >> >> >>
> >> >> >> >> [...]
> >> >> >> >>
> >> >> >> >> >
> >> >> >> >> >> tc BPF and XDP already have widely used infrastructure and can be developed
> >> >> >> >> >> against libbpf or other user space libraries for a user space control plane.
> >> >> >> >> >> With 'control plane' you refer here to the tc / netlink shim you've built,
> >> >> >> >> >> but looking at the tc command line examples, this doesn't really provide a
> >> >> >> >> >> good user experience (you call it p4 but people load bpf obj files). If the
> >> >> >> >> >> expectation is that an operator should run tc commands, then neither it's
> >> >> >> >> >> a nice experience for p4 nor for BPF folks. From a BPF PoV, we moved over
> >> >> >> >> >> to bpf_mprog and plan to also extend this for XDP to have a common look and
> >> >> >> >> >> feel wrt networking for developers. Why can't this be reused?
> >> >> >> >> >
> >> >> >> >> >The filter loading which loads the program is considered pipeline
> >> >> >> >> >instantiation - consider it as "provisioning" more than "control"
> >> >> >> >> >which runs at runtime. "control" is purely netlink based. The iproute2
> >> >> >> >> >code we use links libbpf for example for the filter. If we can achieve
> >> >> >> >> >the same with bpf_mprog then sure - we just dont want to loose
> >> >> >> >> >functionality though.  off top of my head, some sample space:
> >> >> >> >> >- we could have multiple pipelines with different priorities (which tc
> >> >> >> >> >provides to us) - and each pipeline may have its own logic with many
> >> >> >> >> >tables etc (and the choice to iterate the next one is essentially
> >> >> >> >> >encoded in the tc action codes)
> >> >> >> >> >- we use tc block to map groups of ports (which i dont think bpf has
> >> >> >> >> >internal access of)
> >> >> >> >> >
> >> >> >> >> >In regards to usability: no i dont expect someone doing things at
> >> >> >> >> >scale to use command line tc. The APIs are via netlink. But the tc cli
> >> >> >> >> >is must for the rest of the masses per our traditions. Also i really
> >> >> >> >>
> >> >> >> >> I don't follow. You repeatedly mention "the must of the traditional tc
> >> >> >> >> cli", but what of the existing traditional cli you use for p4tc?
> >> >> >> >> If I look at the examples, pretty much everything looks new to me.
> >> >> >> >> Example:
> >> >> >> >>
> >> >> >> >>   tc p4ctrl create myprog/table/mytable dstAddr 10.0.1.2/32 \
> >> >> >> >>     action send_to_port param port eno1
> >> >> >> >>
> >> >> >> >> This is just TC/RTnetlink used as a channel to pass new things over. If
> >> >> >> >> that is the case, what's traditional here?
> >> >> >> >>
> >> >> >> >
> >> >> >> >
> >> >> >> >What is not traditional about it?
> >> >> >>
> >> >> >> Okay, so in that case, the following example communitating with
> >> >> >> userspace deamon using imaginary "p4ctrl" app is equally traditional:
> >> >> >>   $ p4ctrl create myprog/table/mytable dstAddr 10.0.1.2/32 \
> >> >> >>      action send_to_port param port eno1
> >> >> >
> >> >> >Huh? Thats just an application - classical tc which part of iproute2
> >> >> >that is sending to the kernel, no different than "tc flower.."
> >> >> >Where do you get the "userspace" daemon part? Yes, you can write a
> >> >> >daemon but it will use the same APIs as tc.
> >> >>
> >> >> Okay, so which part is the "tradition"?
> >> >>
> >> >
> >> >Provides tooling via tc cli that _everyone_ in the tc world is
> >> >familiar with - which uses the same syntax as other tc extensions do,
> >> >same expectations (eg events, request responses, familiar commands for
> >> >dumping, flushing etc). Basically someone familiar with tc will pick
> >> >this up and operate it very quickly and would have an easier time
> >> >debugging it.
> >> >There are caveats - as will be with all new classifiers - but those
> >> >are within reason.
> >>
> >> Okay, so syntax familiarity wise, what's the difference between
> >> following 2 approaches:
> >> $ tc p4ctrl create myprog/table/mytable dstAddr 10.0.1.2/32 \
> >>       action send_to_port param port eno1
> >> $ p4ctrl create myprog/table/mytable dstAddr 10.0.1.2/32 \
> >>       action send_to_port param port eno1
> >> ?
> >>
> >>
> >> >
> >> >> >>
> >> >> >> >
> >> >> >> >>
> >> >> >> >> >didnt even want to use ebpf at all for operator experience reasons -
> >> >> >> >> >it requires a compilation of the code and an extra loading compared to
> >> >> >> >> >what our original u32/pedit code offered.
> >> >> >> >> >
> >> >> >> >> >> I don't quite follow why not most of this could be implemented entirely in
> >> >> >> >> >> user space without the detour of this and you would provide a developer
> >> >> >> >> >> library which could then be integrated into a p4 runtime/frontend? This
> >> >> >> >> >> way users never interface with ebpf parts nor tc given they also shouldn't
> >> >> >> >> >> have to - it's an implementation detail. This is what John was also pointing
> >> >> >> >> >> out earlier.
> >> >> >> >> >>
> >> >> >> >> >
> >> >> >> >> >Netlink is the API. We will provide a library for object manipulation
> >> >> >> >> >which abstracts away the need to know netlink. Someone who for their
> >> >> >> >> >own reasons wants to use p4runtime or TDI could write on top of this.
> >> >> >> >> >I would not design a kernel interface to just meet p4runtime (we
> >> >> >> >> >already have TDI which came later which does things differently). So i
> >> >> >> >> >expect us to support both those two. And if i was to do something on
> >> >> >> >> >SDN that was more robust i would write my own that still uses these
> >> >> >> >> >netlink interfaces.
> >> >> >> >>
> >> >> >> >> Actually, what Daniel says about the p4 library used as a backend to p4
> >> >> >> >> frontend is pretty much aligned what I claimed on the p4 calls couple of
> >> >> >> >> times. If you have this p4 userspace tooling, it is easy for offloads to
> >> >> >> >> replace the backed by vendor-specific library which allows p4 offload
> >> >> >> >> suitable for all vendors (your plan of p4tc offload does not work well
> >> >> >> >> for our hw, as we repeatedly claimed).
> >> >> >> >>
> >> >> >> >
> >> >> >> >That's you - NVIDIA. You have chosen a path away from the kernel
> >> >> >> >towards DOCA. I understand NVIDIA's frustration with dealing with
> >> >> >> >upstream process (which has been cited to me as a good reason for
> >> >> >> >DOCA) but please dont impose these values and your politics on other
> >> >> >> >vendors(Intel, AMD for example) who are more than willing to invest
> >> >> >> >into making the kernel interfaces the path forward. Your choice.
> >> >> >>
> >> >> >> No, you are missing the point. This has nothing to do with DOCA.
> >> >> >
> >> >> >Right Jiri ;->
> >> >> >
> >> >> >> This
> >> >> >> has to do with the simple limitation of your offload assuming there are
> >> >> >> no runtime changes in the compiled pipeline. For Intel, maybe they
> >> >> >> aren't, and it's a good fit for them. All I say is, that it is not the
> >> >> >> good fit for everyone.
> >> >> >
> >> >> > a) it is not part of the P4 spec to dynamically make changes to the
> >> >> >datapath pipeline after it is create and we are discussing a P4
> >> >>
> >> >> Isn't this up to the implementation? I mean from the p4 perspective,
> >> >> everything is static. Hw might need to reshuffle the pipeline internally
> >> >> during rule insertion/remove in order to optimize the layout.
> >> >>
> >> >
> >> >But do note: the focus here is on P4 (hence the name P4TC).
> >> >
> >> >> >implementation not an extension that would add more value b) We are
> >> >> >more than happy to add extensions in the future to accomodate for
> >> >> >features but first _P4 spec_ must be met c) we had longer discussions
> >> >> >with Matty, Khalid and the Rice folks who wrote a paper on that topic
> >> >> >which you probably didnt attend and everything that needs to be done
> >> >> >can be from user space today for all those optimizations.
> >> >> >
> >> >> >Conclusion is: For what you need to do (which i dont believe is a
> >> >> >limitation in your hardware rather a design decision on your part) run
> >> >> >your user space daemon, do optimizations and update the datapath.
> >> >> >Everybody is happy.
> >> >>
> >> >> Should the userspace daemon listen on inserted rules to be offloade
> >> >> over netlink?
> >> >>
> >> >
> >> >I mean you could if you wanted to given this is just traditional
> >> >netlink which emits events (with some filtering when we integrate the
> >> >filter approach). But why?
> >>
> >> Nevermind.
> >>
> >>
> >> >
> >> >> >
> >> >> >>
> >> >> >> >Nobody is stopping you from offering your customers proprietary
> >> >> >> >solutions which include a specific ebpf approach alongside DOCA. We
> >> >> >> >believe that a singular interface regardless of the vendor is the
> >> >> >> >right way forward. IMHO, this siloing that unfortunately is also added
> >> >> >> >by eBPF being a double edged sword is not good for the community.
> >> >> >> >
> >> >> >> >> As I also said on the p4 call couple of times, I don't see the kernel
> >> >> >> >> as the correct place to do the p4 abstractions. Why don't you do it in
> >> >> >> >> userspace and give vendors possiblity to have p4 backends with compilers,
> >> >> >> >> runtime optimizations etc in userspace, talking to the HW in the
> >> >> >> >> vendor-suitable way too. Then the SW implementation could be easily eBPF
> >> >> >> >> and the main reason (I believe) why you need to have this is TC
> >> >> >> >> (offload) is then void.
> >> >> >> >>
> >> >> >> >> The "everyone wants to use TC/netlink" claim does not seem correct
> >> >> >> >> to me. Why not to have one Linux p4 solution that fits everyones needs?
> >> >> >> >
> >> >> >> >You mean more fitting to the DOCA world? no, because iam a kernel
> >> >> >>
> >> >> >> Again, this has 0 relation to DOCA.
> >> >> >>
> >> >> >>
> >> >> >> >first person and kernel interfaces are good for everyone.
> >> >> >>
> >> >> >> Yeah, not really. Not always the kernel is the right answer. Your/Intel
> >> >> >> plan to handle the offload by:
> >> >> >> 1) abuse devlink to flash p4 binary
> >> >> >> 2) parse the binary in kernel to match to the table ids of rules coming
> >> >> >>    from p4tc ndo_setup_tc
> >> >> >> 3) abuse devlink to flash p4 binary for tc-flower
> >> >> >> 4) parse the binary in kernel to match to the table ids of rules coming
> >> >> >>    from tc-flower ndo_setup_tc
> >> >> >> is really something that is making me a little bit nauseous.
> >> >> >>
> >> >> >> If you don't have a feasible plan to do the offload, p4tc does not make
> >> >> >> sense to me to be honest.
> >> >> >
> >> >> >You mean if there's no plan to match your (NVIDIA?)  point of view.
> >> >> >For #1 - how's this different from DDP? Wasnt that your suggestion to
> >> >>
> >> >> I doubt that. Any flashing-blob-parsing-in-kernel is something I'm
> >> >> opposed to from day 1.
> >> >>
> >> >>
> >> >
> >> >Oh well - it is in the kernel and it works fine tbh.
> >> >
> >> >> >begin with? For #2 Nobody is proposing to do anything of the sort. The
> >> >> >ndo is passed IDs for the objects and associated contents. For #3+#4
> >> >>
> >> >> During offload, you need to parse the blob in driver to be able to match
> >> >> the ids with blob entities. That was presented by you/Intel in the past
> >> >> IIRC.
> >> >>
> >> >
> >> >You are correct - in case of offload the netlink IDs will have to be
> >> >authenticated against what the hardware can accept, but the devlink
> >> >flash use i believe was from you as a compromise.
> >>
> >> Definitelly not. I'm against devlink abuse for this from day 1.
> >>
> >>
> >> >
> >> >>
> >> >> >tc flower thing has nothing to do with P4TC that was just some random
> >> >> >proposal someone made seeing if they could ride on top of P4TC.
> >> >>
> >> >> Yeah, it's not yet merged and already mentally used for abuse. I love
> >> >> that :)
> >> >>
> >> >> >
> >> >> >Besides this nobody really has to satisfy your point of view - like i
> >> >> >said earlier feel free to provide proprietary solutions. From a
> >> >> >consumer perspective  I would not want to deal with 4 different
> >> >> >vendors with 4 different proprietary approaches. The kernel is the
> >> >> >unifying part. You seemed happier with tc flower just not with the
> >> >>
> >> >> Yeah, that is my point, why the unifying part can't be a userspace
> >> >> daemon/library with multiple backends (p4tc, bpf, vendorX, vendorY, ..)?
> >> >>
> >> >> I just don't see the kernel as a good fit for abstraction here,
> >> >> given the fact that the vendor compilers does not run in kernel.
> >> >> That is breaking your model.
> >> >>
> >> >
> >> >Jiri - we want to support P4, first. Like you said the P4 pipeline,
> >> >once installed is static.
> >> >P4 doesnt allow dynamic update of the pipeline. For example, once you
> >> >say "here are my 14 tables and their associated actions and here's how
> >> >the pipeline main control (on how to iterate the tables etc) is going
> >> >to be" and after you instantiate/activate that pipeline, you dont go
> >> >back 5 minutes later and say "sorry, please introduce table 15, which
> >> >i want you to walk to after you visit table 3 if metadata foo is 5" or
> >> >"shoot, let's change that table 5 to be exact instead of LPM". It's
> >> >not anywhere in the spec.
> >> >That doesnt mean it is not useful thing to have - but it is an
> >> >invention that has _nothing to do with the P4 spec_; so saying a P4
> >> >implementation must support it is a bit out of scope and there are
> >> >vendors with hardware who support P4 today that dont need any of this.
> >>
> >> I'm not talking about the spec. I'm talking about the offload
> >> implemetation, the offload compiler the offload runtime manager. You
> >> don't have those in kernel. That is the issue. The runtime manager is
> >> the one to decide and reshuffle the hw internals. Again, this has
> >> nothing to do with p4 frontend. This is offload implementation.
> >>
> >> And that is why I believe your p4 kernel implementation is unoffloadable.
> >> And if it is unoffloadable, do we really need it? IDK.
> >>
> >
> >Say what?
> >It's not offloadable in your hardware, you mean? Because i have beside
> >me here an intel e2000 which offloads just fine (and the AMD folks
> >seem fine too).
>
> Will Intel and AMD have compiler in kernel, so no blob transfer and
> parsing it in kernel wound not be needed? No.

By that definition anything that parses anything is a compiler.

>
> >If your view is that all these runtime optimization surmount to a
> >compiler in the kernel/driver that is your, well, your view. In my
> >view (and others have said this to you already) the P4C compiler is
> >responsible for resource optimizations. The hardware supports P4, you
> >give it constraints and it knows what to do. At runtime, anything a
> >driver needs to do for resource optimization (resorting, reshuffling
> >etc), that is not a P4 problem - sorry if you have issues in your
> >architecture approach.
>
> Sure, it is the offload implementation problem. And for them, you need
> to use userspace components. And that is the problem. This discussion
> leads nowhere, I don't know how differently I should describe this.

Jiri's - that's your view based on whatever design you have in your
mind. This has nothing to do with P4.
So let me repeat again:
1) A vendor's backend for P4 when it compiles ensures that resource
constraints are taken care of.
2) The same program can run in s/w.
3) It makes *ZERO* sense to mix vendor specific constraint
optimization(what you described as resorting, reshuffling etc) as part
of P4TC or P4. Absolutely nothing to do with either. Write a
background task, specific to you,  if you feel you need to move things
around at runtime.

We agree on one thing at least: This discussion is going nowhere.

cheers,
jamal



> >
> >> >In my opinion that is a feature that could be added later out of
> >> >necessity (there is some good niche value in being able to add some
> >> >"dynamicism" to any pipeline) and influence the P4 standards on why it
> >> >is needed.
> >> >It should be doable today in a brute force way (this is just one
> >> >suggestion that came to me when Rice University/Nvidia presented[1]);
> >> >i am sure there are other approaches and the idea is by no means
> >> >proven.
> >> >
> >> >1) User space Creates/compiles/Adds/activate your program that has 14
> >> >tables at tc prio X chain Y
> >> >2) a) 5 minutes later user space decides it wants to change and add
> >> >table 3 after table 15, visited when metadata foo=5
> >> >    b) your compiler in user space compiles a brand new program which
> >> >satisfies #2a (how this program was authored is out of scope of
> >> >discussion)
> >> >    c) user space adds the new program at tc prio X+1 chain Y or another chain Z
> >> >    d) user space delete tc prio X chain Y (and make sure your packets
> >> >entry point is whatever #c is)
> >>
> >> I never suggested anything like what you describe. I'm not sure why you
> >> think so.
> >
> >It's the same class of problems - the paper i pointed to (coauthored
> >by Matty and others) has runtime resource optimizations which are
> >tantamount to changing the nature of the pipeline. We may need to
> >profile in the kernel but all those optimizations can be derived in
> >user space using the approach I described.
> >
> >cheers,
> >jamal
> >
> >
> >> >[1] https://www.cs.rice.edu/~eugeneng/papers/SIGCOMM23-Pipeleon.pdf
> >> >
> >> >>
> >> >> >kernel process - which is ironically the same thing we are going
> >> >> >through here ;->
> >> >> >
> >> >> >cheers,
> >> >> >jamal
> >> >> >
> >> >> >>
> >> >> >> >
> >> >> >> >cheers,
> >> >> >> >jamal

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH net-next v8 00/15] Introducing P4TC
  2023-11-23 13:22                                       ` Jamal Hadi Salim
@ 2023-11-23 13:34                                         ` Jiri Pirko
  2023-11-23 13:45                                           ` Jamal Hadi Salim
  0 siblings, 1 reply; 79+ messages in thread
From: Jiri Pirko @ 2023-11-23 13:34 UTC (permalink / raw)
  To: Jamal Hadi Salim
  Cc: Daniel Borkmann, John Fastabend, netdev, deb.chatterjee,
	anjali.singhai, Vipin.Jain, namrata.limaye, tom, mleitner,
	Mahesh.Shirshyad, tomasz.osinski, xiyou.wangcong, davem, edumazet,
	kuba, pabeni, vladbu, horms, bpf, khalidm, toke, mattyk, dan.daly,
	chris.sommers, john.andy.fingerhut

Thu, Nov 23, 2023 at 02:22:11PM CET, jhs@mojatatu.com wrote:
>On Thu, Nov 23, 2023 at 1:36 AM Jiri Pirko <jiri@resnulli.us> wrote:
>>
>> Wed, Nov 22, 2023 at 08:35:21PM CET, jhs@mojatatu.com wrote:
>> >On Wed, Nov 22, 2023 at 1:31 PM Jiri Pirko <jiri@resnulli.us> wrote:
>> >>
>> >> Wed, Nov 22, 2023 at 04:14:02PM CET, jhs@mojatatu.com wrote:
>> >> >On Wed, Nov 22, 2023 at 4:25 AM Jiri Pirko <jiri@resnulli.us> wrote:
>> >> >>
>> >> >> Tue, Nov 21, 2023 at 04:21:44PM CET, jhs@mojatatu.com wrote:
>> >> >> >On Tue, Nov 21, 2023 at 9:19 AM Jiri Pirko <jiri@resnulli.us> wrote:
>> >> >> >>
>> >> >> >> Tue, Nov 21, 2023 at 02:47:40PM CET, jhs@mojatatu.com wrote:
>> >> >> >> >On Tue, Nov 21, 2023 at 8:06 AM Jiri Pirko <jiri@resnulli.us> wrote:
>> >> >> >> >>
>> >> >> >> >> Mon, Nov 20, 2023 at 11:56:50PM CET, jhs@mojatatu.com wrote:
>> >> >> >> >> >On Mon, Nov 20, 2023 at 4:49 PM Daniel Borkmann <daniel@iogearbox.net> wrote:
>> >> >> >> >> >>
>> >> >> >> >> >> On 11/20/23 8:56 PM, Jamal Hadi Salim wrote:
>> >> >> >> >> >> > On Mon, Nov 20, 2023 at 1:10 PM Jiri Pirko <jiri@resnulli.us> wrote:
>> >> >> >> >> >> >> Mon, Nov 20, 2023 at 03:23:59PM CET, jhs@mojatatu.com wrote:
>> >> >> >> >>
>> >> >> >> >> [...]
>> >> >> >> >>
>> >> >> >> >> >
>> >> >> >> >> >> tc BPF and XDP already have widely used infrastructure and can be developed
>> >> >> >> >> >> against libbpf or other user space libraries for a user space control plane.
>> >> >> >> >> >> With 'control plane' you refer here to the tc / netlink shim you've built,
>> >> >> >> >> >> but looking at the tc command line examples, this doesn't really provide a
>> >> >> >> >> >> good user experience (you call it p4 but people load bpf obj files). If the
>> >> >> >> >> >> expectation is that an operator should run tc commands, then neither it's
>> >> >> >> >> >> a nice experience for p4 nor for BPF folks. From a BPF PoV, we moved over
>> >> >> >> >> >> to bpf_mprog and plan to also extend this for XDP to have a common look and
>> >> >> >> >> >> feel wrt networking for developers. Why can't this be reused?
>> >> >> >> >> >
>> >> >> >> >> >The filter loading which loads the program is considered pipeline
>> >> >> >> >> >instantiation - consider it as "provisioning" more than "control"
>> >> >> >> >> >which runs at runtime. "control" is purely netlink based. The iproute2
>> >> >> >> >> >code we use links libbpf for example for the filter. If we can achieve
>> >> >> >> >> >the same with bpf_mprog then sure - we just dont want to loose
>> >> >> >> >> >functionality though.  off top of my head, some sample space:
>> >> >> >> >> >- we could have multiple pipelines with different priorities (which tc
>> >> >> >> >> >provides to us) - and each pipeline may have its own logic with many
>> >> >> >> >> >tables etc (and the choice to iterate the next one is essentially
>> >> >> >> >> >encoded in the tc action codes)
>> >> >> >> >> >- we use tc block to map groups of ports (which i dont think bpf has
>> >> >> >> >> >internal access of)
>> >> >> >> >> >
>> >> >> >> >> >In regards to usability: no i dont expect someone doing things at
>> >> >> >> >> >scale to use command line tc. The APIs are via netlink. But the tc cli
>> >> >> >> >> >is must for the rest of the masses per our traditions. Also i really
>> >> >> >> >>
>> >> >> >> >> I don't follow. You repeatedly mention "the must of the traditional tc
>> >> >> >> >> cli", but what of the existing traditional cli you use for p4tc?
>> >> >> >> >> If I look at the examples, pretty much everything looks new to me.
>> >> >> >> >> Example:
>> >> >> >> >>
>> >> >> >> >>   tc p4ctrl create myprog/table/mytable dstAddr 10.0.1.2/32 \
>> >> >> >> >>     action send_to_port param port eno1
>> >> >> >> >>
>> >> >> >> >> This is just TC/RTnetlink used as a channel to pass new things over. If
>> >> >> >> >> that is the case, what's traditional here?
>> >> >> >> >>
>> >> >> >> >
>> >> >> >> >
>> >> >> >> >What is not traditional about it?
>> >> >> >>
>> >> >> >> Okay, so in that case, the following example communitating with
>> >> >> >> userspace deamon using imaginary "p4ctrl" app is equally traditional:
>> >> >> >>   $ p4ctrl create myprog/table/mytable dstAddr 10.0.1.2/32 \
>> >> >> >>      action send_to_port param port eno1
>> >> >> >
>> >> >> >Huh? Thats just an application - classical tc which part of iproute2
>> >> >> >that is sending to the kernel, no different than "tc flower.."
>> >> >> >Where do you get the "userspace" daemon part? Yes, you can write a
>> >> >> >daemon but it will use the same APIs as tc.
>> >> >>
>> >> >> Okay, so which part is the "tradition"?
>> >> >>
>> >> >
>> >> >Provides tooling via tc cli that _everyone_ in the tc world is
>> >> >familiar with - which uses the same syntax as other tc extensions do,
>> >> >same expectations (eg events, request responses, familiar commands for
>> >> >dumping, flushing etc). Basically someone familiar with tc will pick
>> >> >this up and operate it very quickly and would have an easier time
>> >> >debugging it.
>> >> >There are caveats - as will be with all new classifiers - but those
>> >> >are within reason.
>> >>
>> >> Okay, so syntax familiarity wise, what's the difference between
>> >> following 2 approaches:
>> >> $ tc p4ctrl create myprog/table/mytable dstAddr 10.0.1.2/32 \
>> >>       action send_to_port param port eno1
>> >> $ p4ctrl create myprog/table/mytable dstAddr 10.0.1.2/32 \
>> >>       action send_to_port param port eno1
>> >> ?
>> >>
>> >>
>> >> >
>> >> >> >>
>> >> >> >> >
>> >> >> >> >>
>> >> >> >> >> >didnt even want to use ebpf at all for operator experience reasons -
>> >> >> >> >> >it requires a compilation of the code and an extra loading compared to
>> >> >> >> >> >what our original u32/pedit code offered.
>> >> >> >> >> >
>> >> >> >> >> >> I don't quite follow why not most of this could be implemented entirely in
>> >> >> >> >> >> user space without the detour of this and you would provide a developer
>> >> >> >> >> >> library which could then be integrated into a p4 runtime/frontend? This
>> >> >> >> >> >> way users never interface with ebpf parts nor tc given they also shouldn't
>> >> >> >> >> >> have to - it's an implementation detail. This is what John was also pointing
>> >> >> >> >> >> out earlier.
>> >> >> >> >> >>
>> >> >> >> >> >
>> >> >> >> >> >Netlink is the API. We will provide a library for object manipulation
>> >> >> >> >> >which abstracts away the need to know netlink. Someone who for their
>> >> >> >> >> >own reasons wants to use p4runtime or TDI could write on top of this.
>> >> >> >> >> >I would not design a kernel interface to just meet p4runtime (we
>> >> >> >> >> >already have TDI which came later which does things differently). So i
>> >> >> >> >> >expect us to support both those two. And if i was to do something on
>> >> >> >> >> >SDN that was more robust i would write my own that still uses these
>> >> >> >> >> >netlink interfaces.
>> >> >> >> >>
>> >> >> >> >> Actually, what Daniel says about the p4 library used as a backend to p4
>> >> >> >> >> frontend is pretty much aligned what I claimed on the p4 calls couple of
>> >> >> >> >> times. If you have this p4 userspace tooling, it is easy for offloads to
>> >> >> >> >> replace the backed by vendor-specific library which allows p4 offload
>> >> >> >> >> suitable for all vendors (your plan of p4tc offload does not work well
>> >> >> >> >> for our hw, as we repeatedly claimed).
>> >> >> >> >>
>> >> >> >> >
>> >> >> >> >That's you - NVIDIA. You have chosen a path away from the kernel
>> >> >> >> >towards DOCA. I understand NVIDIA's frustration with dealing with
>> >> >> >> >upstream process (which has been cited to me as a good reason for
>> >> >> >> >DOCA) but please dont impose these values and your politics on other
>> >> >> >> >vendors(Intel, AMD for example) who are more than willing to invest
>> >> >> >> >into making the kernel interfaces the path forward. Your choice.
>> >> >> >>
>> >> >> >> No, you are missing the point. This has nothing to do with DOCA.
>> >> >> >
>> >> >> >Right Jiri ;->
>> >> >> >
>> >> >> >> This
>> >> >> >> has to do with the simple limitation of your offload assuming there are
>> >> >> >> no runtime changes in the compiled pipeline. For Intel, maybe they
>> >> >> >> aren't, and it's a good fit for them. All I say is, that it is not the
>> >> >> >> good fit for everyone.
>> >> >> >
>> >> >> > a) it is not part of the P4 spec to dynamically make changes to the
>> >> >> >datapath pipeline after it is create and we are discussing a P4
>> >> >>
>> >> >> Isn't this up to the implementation? I mean from the p4 perspective,
>> >> >> everything is static. Hw might need to reshuffle the pipeline internally
>> >> >> during rule insertion/remove in order to optimize the layout.
>> >> >>
>> >> >
>> >> >But do note: the focus here is on P4 (hence the name P4TC).
>> >> >
>> >> >> >implementation not an extension that would add more value b) We are
>> >> >> >more than happy to add extensions in the future to accomodate for
>> >> >> >features but first _P4 spec_ must be met c) we had longer discussions
>> >> >> >with Matty, Khalid and the Rice folks who wrote a paper on that topic
>> >> >> >which you probably didnt attend and everything that needs to be done
>> >> >> >can be from user space today for all those optimizations.
>> >> >> >
>> >> >> >Conclusion is: For what you need to do (which i dont believe is a
>> >> >> >limitation in your hardware rather a design decision on your part) run
>> >> >> >your user space daemon, do optimizations and update the datapath.
>> >> >> >Everybody is happy.
>> >> >>
>> >> >> Should the userspace daemon listen on inserted rules to be offloade
>> >> >> over netlink?
>> >> >>
>> >> >
>> >> >I mean you could if you wanted to given this is just traditional
>> >> >netlink which emits events (with some filtering when we integrate the
>> >> >filter approach). But why?
>> >>
>> >> Nevermind.
>> >>
>> >>
>> >> >
>> >> >> >
>> >> >> >>
>> >> >> >> >Nobody is stopping you from offering your customers proprietary
>> >> >> >> >solutions which include a specific ebpf approach alongside DOCA. We
>> >> >> >> >believe that a singular interface regardless of the vendor is the
>> >> >> >> >right way forward. IMHO, this siloing that unfortunately is also added
>> >> >> >> >by eBPF being a double edged sword is not good for the community.
>> >> >> >> >
>> >> >> >> >> As I also said on the p4 call couple of times, I don't see the kernel
>> >> >> >> >> as the correct place to do the p4 abstractions. Why don't you do it in
>> >> >> >> >> userspace and give vendors possiblity to have p4 backends with compilers,
>> >> >> >> >> runtime optimizations etc in userspace, talking to the HW in the
>> >> >> >> >> vendor-suitable way too. Then the SW implementation could be easily eBPF
>> >> >> >> >> and the main reason (I believe) why you need to have this is TC
>> >> >> >> >> (offload) is then void.
>> >> >> >> >>
>> >> >> >> >> The "everyone wants to use TC/netlink" claim does not seem correct
>> >> >> >> >> to me. Why not to have one Linux p4 solution that fits everyones needs?
>> >> >> >> >
>> >> >> >> >You mean more fitting to the DOCA world? no, because iam a kernel
>> >> >> >>
>> >> >> >> Again, this has 0 relation to DOCA.
>> >> >> >>
>> >> >> >>
>> >> >> >> >first person and kernel interfaces are good for everyone.
>> >> >> >>
>> >> >> >> Yeah, not really. Not always the kernel is the right answer. Your/Intel
>> >> >> >> plan to handle the offload by:
>> >> >> >> 1) abuse devlink to flash p4 binary
>> >> >> >> 2) parse the binary in kernel to match to the table ids of rules coming
>> >> >> >>    from p4tc ndo_setup_tc
>> >> >> >> 3) abuse devlink to flash p4 binary for tc-flower
>> >> >> >> 4) parse the binary in kernel to match to the table ids of rules coming
>> >> >> >>    from tc-flower ndo_setup_tc
>> >> >> >> is really something that is making me a little bit nauseous.
>> >> >> >>
>> >> >> >> If you don't have a feasible plan to do the offload, p4tc does not make
>> >> >> >> sense to me to be honest.
>> >> >> >
>> >> >> >You mean if there's no plan to match your (NVIDIA?)  point of view.
>> >> >> >For #1 - how's this different from DDP? Wasnt that your suggestion to
>> >> >>
>> >> >> I doubt that. Any flashing-blob-parsing-in-kernel is something I'm
>> >> >> opposed to from day 1.
>> >> >>
>> >> >>
>> >> >
>> >> >Oh well - it is in the kernel and it works fine tbh.
>> >> >
>> >> >> >begin with? For #2 Nobody is proposing to do anything of the sort. The
>> >> >> >ndo is passed IDs for the objects and associated contents. For #3+#4
>> >> >>
>> >> >> During offload, you need to parse the blob in driver to be able to match
>> >> >> the ids with blob entities. That was presented by you/Intel in the past
>> >> >> IIRC.
>> >> >>
>> >> >
>> >> >You are correct - in case of offload the netlink IDs will have to be
>> >> >authenticated against what the hardware can accept, but the devlink
>> >> >flash use i believe was from you as a compromise.
>> >>
>> >> Definitelly not. I'm against devlink abuse for this from day 1.
>> >>
>> >>
>> >> >
>> >> >>
>> >> >> >tc flower thing has nothing to do with P4TC that was just some random
>> >> >> >proposal someone made seeing if they could ride on top of P4TC.
>> >> >>
>> >> >> Yeah, it's not yet merged and already mentally used for abuse. I love
>> >> >> that :)
>> >> >>
>> >> >> >
>> >> >> >Besides this nobody really has to satisfy your point of view - like i
>> >> >> >said earlier feel free to provide proprietary solutions. From a
>> >> >> >consumer perspective  I would not want to deal with 4 different
>> >> >> >vendors with 4 different proprietary approaches. The kernel is the
>> >> >> >unifying part. You seemed happier with tc flower just not with the
>> >> >>
>> >> >> Yeah, that is my point, why the unifying part can't be a userspace
>> >> >> daemon/library with multiple backends (p4tc, bpf, vendorX, vendorY, ..)?
>> >> >>
>> >> >> I just don't see the kernel as a good fit for abstraction here,
>> >> >> given the fact that the vendor compilers does not run in kernel.
>> >> >> That is breaking your model.
>> >> >>
>> >> >
>> >> >Jiri - we want to support P4, first. Like you said the P4 pipeline,
>> >> >once installed is static.
>> >> >P4 doesnt allow dynamic update of the pipeline. For example, once you
>> >> >say "here are my 14 tables and their associated actions and here's how
>> >> >the pipeline main control (on how to iterate the tables etc) is going
>> >> >to be" and after you instantiate/activate that pipeline, you dont go
>> >> >back 5 minutes later and say "sorry, please introduce table 15, which
>> >> >i want you to walk to after you visit table 3 if metadata foo is 5" or
>> >> >"shoot, let's change that table 5 to be exact instead of LPM". It's
>> >> >not anywhere in the spec.
>> >> >That doesnt mean it is not useful thing to have - but it is an
>> >> >invention that has _nothing to do with the P4 spec_; so saying a P4
>> >> >implementation must support it is a bit out of scope and there are
>> >> >vendors with hardware who support P4 today that dont need any of this.
>> >>
>> >> I'm not talking about the spec. I'm talking about the offload
>> >> implemetation, the offload compiler the offload runtime manager. You
>> >> don't have those in kernel. That is the issue. The runtime manager is
>> >> the one to decide and reshuffle the hw internals. Again, this has
>> >> nothing to do with p4 frontend. This is offload implementation.
>> >>
>> >> And that is why I believe your p4 kernel implementation is unoffloadable.
>> >> And if it is unoffloadable, do we really need it? IDK.
>> >>
>> >
>> >Say what?
>> >It's not offloadable in your hardware, you mean? Because i have beside
>> >me here an intel e2000 which offloads just fine (and the AMD folks
>> >seem fine too).
>>
>> Will Intel and AMD have compiler in kernel, so no blob transfer and
>> parsing it in kernel wound not be needed? No.
>
>By that definition anything that parses anything is a compiler.
>
>>
>> >If your view is that all these runtime optimization surmount to a
>> >compiler in the kernel/driver that is your, well, your view. In my
>> >view (and others have said this to you already) the P4C compiler is
>> >responsible for resource optimizations. The hardware supports P4, you
>> >give it constraints and it knows what to do. At runtime, anything a
>> >driver needs to do for resource optimization (resorting, reshuffling
>> >etc), that is not a P4 problem - sorry if you have issues in your
>> >architecture approach.
>>
>> Sure, it is the offload implementation problem. And for them, you need
>> to use userspace components. And that is the problem. This discussion
>> leads nowhere, I don't know how differently I should describe this.
>
>Jiri's - that's your view based on whatever design you have in your
>mind. This has nothing to do with P4.
>So let me repeat again:
>1) A vendor's backend for P4 when it compiles ensures that resource
>constraints are taken care of.
>2) The same program can run in s/w.
>3) It makes *ZERO* sense to mix vendor specific constraint
>optimization(what you described as resorting, reshuffling etc) as part
>of P4TC or P4. Absolutely nothing to do with either. Write a

I never suggested for it to be part of P4tc of P4. I don't know why you
think so.


>background task, specific to you,  if you feel you need to move things
>around at runtime.

Yeah, that backgroud task is in userspace.


>
>We agree on one thing at least: This discussion is going nowhere.

Correct.


>
>cheers,
>jamal
>
>
>
>> >
>> >> >In my opinion that is a feature that could be added later out of
>> >> >necessity (there is some good niche value in being able to add some
>> >> >"dynamicism" to any pipeline) and influence the P4 standards on why it
>> >> >is needed.
>> >> >It should be doable today in a brute force way (this is just one
>> >> >suggestion that came to me when Rice University/Nvidia presented[1]);
>> >> >i am sure there are other approaches and the idea is by no means
>> >> >proven.
>> >> >
>> >> >1) User space Creates/compiles/Adds/activate your program that has 14
>> >> >tables at tc prio X chain Y
>> >> >2) a) 5 minutes later user space decides it wants to change and add
>> >> >table 3 after table 15, visited when metadata foo=5
>> >> >    b) your compiler in user space compiles a brand new program which
>> >> >satisfies #2a (how this program was authored is out of scope of
>> >> >discussion)
>> >> >    c) user space adds the new program at tc prio X+1 chain Y or another chain Z
>> >> >    d) user space delete tc prio X chain Y (and make sure your packets
>> >> >entry point is whatever #c is)
>> >>
>> >> I never suggested anything like what you describe. I'm not sure why you
>> >> think so.
>> >
>> >It's the same class of problems - the paper i pointed to (coauthored
>> >by Matty and others) has runtime resource optimizations which are
>> >tantamount to changing the nature of the pipeline. We may need to
>> >profile in the kernel but all those optimizations can be derived in
>> >user space using the approach I described.
>> >
>> >cheers,
>> >jamal
>> >
>> >
>> >> >[1] https://www.cs.rice.edu/~eugeneng/papers/SIGCOMM23-Pipeleon.pdf
>> >> >
>> >> >>
>> >> >> >kernel process - which is ironically the same thing we are going
>> >> >> >through here ;->
>> >> >> >
>> >> >> >cheers,
>> >> >> >jamal
>> >> >> >
>> >> >> >>
>> >> >> >> >
>> >> >> >> >cheers,
>> >> >> >> >jamal

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH net-next v8 00/15] Introducing P4TC
  2023-11-23 13:34                                         ` Jiri Pirko
@ 2023-11-23 13:45                                           ` Jamal Hadi Salim
  2023-11-23 14:07                                             ` Jiri Pirko
  0 siblings, 1 reply; 79+ messages in thread
From: Jamal Hadi Salim @ 2023-11-23 13:45 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: Daniel Borkmann, John Fastabend, netdev, deb.chatterjee,
	anjali.singhai, Vipin.Jain, namrata.limaye, tom, mleitner,
	Mahesh.Shirshyad, tomasz.osinski, xiyou.wangcong, davem, edumazet,
	kuba, pabeni, vladbu, horms, bpf, khalidm, toke, mattyk, dan.daly,
	chris.sommers, john.andy.fingerhut

On Thu, Nov 23, 2023 at 8:34 AM Jiri Pirko <jiri@resnulli.us> wrote:
>
> Thu, Nov 23, 2023 at 02:22:11PM CET, jhs@mojatatu.com wrote:
> >On Thu, Nov 23, 2023 at 1:36 AM Jiri Pirko <jiri@resnulli.us> wrote:
> >>
> >> Wed, Nov 22, 2023 at 08:35:21PM CET, jhs@mojatatu.com wrote:
> >> >On Wed, Nov 22, 2023 at 1:31 PM Jiri Pirko <jiri@resnulli.us> wrote:
> >> >>
> >> >> Wed, Nov 22, 2023 at 04:14:02PM CET, jhs@mojatatu.com wrote:
> >> >> >On Wed, Nov 22, 2023 at 4:25 AM Jiri Pirko <jiri@resnulli.us> wrote:
> >> >> >>
> >> >> >> Tue, Nov 21, 2023 at 04:21:44PM CET, jhs@mojatatu.com wrote:
> >> >> >> >On Tue, Nov 21, 2023 at 9:19 AM Jiri Pirko <jiri@resnulli.us> wrote:
> >> >> >> >>
> >> >> >> >> Tue, Nov 21, 2023 at 02:47:40PM CET, jhs@mojatatu.com wrote:
> >> >> >> >> >On Tue, Nov 21, 2023 at 8:06 AM Jiri Pirko <jiri@resnulli.us> wrote:
> >> >> >> >> >>
> >> >> >> >> >> Mon, Nov 20, 2023 at 11:56:50PM CET, jhs@mojatatu.com wrote:
> >> >> >> >> >> >On Mon, Nov 20, 2023 at 4:49 PM Daniel Borkmann <daniel@iogearbox.net> wrote:
> >> >> >> >> >> >>
> >> >> >> >> >> >> On 11/20/23 8:56 PM, Jamal Hadi Salim wrote:
> >> >> >> >> >> >> > On Mon, Nov 20, 2023 at 1:10 PM Jiri Pirko <jiri@resnulli.us> wrote:
> >> >> >> >> >> >> >> Mon, Nov 20, 2023 at 03:23:59PM CET, jhs@mojatatu.com wrote:
> >> >> >> >> >>
> >> >> >> >> >> [...]
> >> >> >> >> >>
> >> >> >> >> >> >
> >> >> >> >> >> >> tc BPF and XDP already have widely used infrastructure and can be developed
> >> >> >> >> >> >> against libbpf or other user space libraries for a user space control plane.
> >> >> >> >> >> >> With 'control plane' you refer here to the tc / netlink shim you've built,
> >> >> >> >> >> >> but looking at the tc command line examples, this doesn't really provide a
> >> >> >> >> >> >> good user experience (you call it p4 but people load bpf obj files). If the
> >> >> >> >> >> >> expectation is that an operator should run tc commands, then neither it's
> >> >> >> >> >> >> a nice experience for p4 nor for BPF folks. From a BPF PoV, we moved over
> >> >> >> >> >> >> to bpf_mprog and plan to also extend this for XDP to have a common look and
> >> >> >> >> >> >> feel wrt networking for developers. Why can't this be reused?
> >> >> >> >> >> >
> >> >> >> >> >> >The filter loading which loads the program is considered pipeline
> >> >> >> >> >> >instantiation - consider it as "provisioning" more than "control"
> >> >> >> >> >> >which runs at runtime. "control" is purely netlink based. The iproute2
> >> >> >> >> >> >code we use links libbpf for example for the filter. If we can achieve
> >> >> >> >> >> >the same with bpf_mprog then sure - we just dont want to loose
> >> >> >> >> >> >functionality though.  off top of my head, some sample space:
> >> >> >> >> >> >- we could have multiple pipelines with different priorities (which tc
> >> >> >> >> >> >provides to us) - and each pipeline may have its own logic with many
> >> >> >> >> >> >tables etc (and the choice to iterate the next one is essentially
> >> >> >> >> >> >encoded in the tc action codes)
> >> >> >> >> >> >- we use tc block to map groups of ports (which i dont think bpf has
> >> >> >> >> >> >internal access of)
> >> >> >> >> >> >
> >> >> >> >> >> >In regards to usability: no i dont expect someone doing things at
> >> >> >> >> >> >scale to use command line tc. The APIs are via netlink. But the tc cli
> >> >> >> >> >> >is must for the rest of the masses per our traditions. Also i really
> >> >> >> >> >>
> >> >> >> >> >> I don't follow. You repeatedly mention "the must of the traditional tc
> >> >> >> >> >> cli", but what of the existing traditional cli you use for p4tc?
> >> >> >> >> >> If I look at the examples, pretty much everything looks new to me.
> >> >> >> >> >> Example:
> >> >> >> >> >>
> >> >> >> >> >>   tc p4ctrl create myprog/table/mytable dstAddr 10.0.1.2/32 \
> >> >> >> >> >>     action send_to_port param port eno1
> >> >> >> >> >>
> >> >> >> >> >> This is just TC/RTnetlink used as a channel to pass new things over. If
> >> >> >> >> >> that is the case, what's traditional here?
> >> >> >> >> >>
> >> >> >> >> >
> >> >> >> >> >
> >> >> >> >> >What is not traditional about it?
> >> >> >> >>
> >> >> >> >> Okay, so in that case, the following example communitating with
> >> >> >> >> userspace deamon using imaginary "p4ctrl" app is equally traditional:
> >> >> >> >>   $ p4ctrl create myprog/table/mytable dstAddr 10.0.1.2/32 \
> >> >> >> >>      action send_to_port param port eno1
> >> >> >> >
> >> >> >> >Huh? Thats just an application - classical tc which part of iproute2
> >> >> >> >that is sending to the kernel, no different than "tc flower.."
> >> >> >> >Where do you get the "userspace" daemon part? Yes, you can write a
> >> >> >> >daemon but it will use the same APIs as tc.
> >> >> >>
> >> >> >> Okay, so which part is the "tradition"?
> >> >> >>
> >> >> >
> >> >> >Provides tooling via tc cli that _everyone_ in the tc world is
> >> >> >familiar with - which uses the same syntax as other tc extensions do,
> >> >> >same expectations (eg events, request responses, familiar commands for
> >> >> >dumping, flushing etc). Basically someone familiar with tc will pick
> >> >> >this up and operate it very quickly and would have an easier time
> >> >> >debugging it.
> >> >> >There are caveats - as will be with all new classifiers - but those
> >> >> >are within reason.
> >> >>
> >> >> Okay, so syntax familiarity wise, what's the difference between
> >> >> following 2 approaches:
> >> >> $ tc p4ctrl create myprog/table/mytable dstAddr 10.0.1.2/32 \
> >> >>       action send_to_port param port eno1
> >> >> $ p4ctrl create myprog/table/mytable dstAddr 10.0.1.2/32 \
> >> >>       action send_to_port param port eno1
> >> >> ?
> >> >>
> >> >>
> >> >> >
> >> >> >> >>
> >> >> >> >> >
> >> >> >> >> >>
> >> >> >> >> >> >didnt even want to use ebpf at all for operator experience reasons -
> >> >> >> >> >> >it requires a compilation of the code and an extra loading compared to
> >> >> >> >> >> >what our original u32/pedit code offered.
> >> >> >> >> >> >
> >> >> >> >> >> >> I don't quite follow why not most of this could be implemented entirely in
> >> >> >> >> >> >> user space without the detour of this and you would provide a developer
> >> >> >> >> >> >> library which could then be integrated into a p4 runtime/frontend? This
> >> >> >> >> >> >> way users never interface with ebpf parts nor tc given they also shouldn't
> >> >> >> >> >> >> have to - it's an implementation detail. This is what John was also pointing
> >> >> >> >> >> >> out earlier.
> >> >> >> >> >> >>
> >> >> >> >> >> >
> >> >> >> >> >> >Netlink is the API. We will provide a library for object manipulation
> >> >> >> >> >> >which abstracts away the need to know netlink. Someone who for their
> >> >> >> >> >> >own reasons wants to use p4runtime or TDI could write on top of this.
> >> >> >> >> >> >I would not design a kernel interface to just meet p4runtime (we
> >> >> >> >> >> >already have TDI which came later which does things differently). So i
> >> >> >> >> >> >expect us to support both those two. And if i was to do something on
> >> >> >> >> >> >SDN that was more robust i would write my own that still uses these
> >> >> >> >> >> >netlink interfaces.
> >> >> >> >> >>
> >> >> >> >> >> Actually, what Daniel says about the p4 library used as a backend to p4
> >> >> >> >> >> frontend is pretty much aligned what I claimed on the p4 calls couple of
> >> >> >> >> >> times. If you have this p4 userspace tooling, it is easy for offloads to
> >> >> >> >> >> replace the backed by vendor-specific library which allows p4 offload
> >> >> >> >> >> suitable for all vendors (your plan of p4tc offload does not work well
> >> >> >> >> >> for our hw, as we repeatedly claimed).
> >> >> >> >> >>
> >> >> >> >> >
> >> >> >> >> >That's you - NVIDIA. You have chosen a path away from the kernel
> >> >> >> >> >towards DOCA. I understand NVIDIA's frustration with dealing with
> >> >> >> >> >upstream process (which has been cited to me as a good reason for
> >> >> >> >> >DOCA) but please dont impose these values and your politics on other
> >> >> >> >> >vendors(Intel, AMD for example) who are more than willing to invest
> >> >> >> >> >into making the kernel interfaces the path forward. Your choice.
> >> >> >> >>
> >> >> >> >> No, you are missing the point. This has nothing to do with DOCA.
> >> >> >> >
> >> >> >> >Right Jiri ;->
> >> >> >> >
> >> >> >> >> This
> >> >> >> >> has to do with the simple limitation of your offload assuming there are
> >> >> >> >> no runtime changes in the compiled pipeline. For Intel, maybe they
> >> >> >> >> aren't, and it's a good fit for them. All I say is, that it is not the
> >> >> >> >> good fit for everyone.
> >> >> >> >
> >> >> >> > a) it is not part of the P4 spec to dynamically make changes to the
> >> >> >> >datapath pipeline after it is create and we are discussing a P4
> >> >> >>
> >> >> >> Isn't this up to the implementation? I mean from the p4 perspective,
> >> >> >> everything is static. Hw might need to reshuffle the pipeline internally
> >> >> >> during rule insertion/remove in order to optimize the layout.
> >> >> >>
> >> >> >
> >> >> >But do note: the focus here is on P4 (hence the name P4TC).
> >> >> >
> >> >> >> >implementation not an extension that would add more value b) We are
> >> >> >> >more than happy to add extensions in the future to accomodate for
> >> >> >> >features but first _P4 spec_ must be met c) we had longer discussions
> >> >> >> >with Matty, Khalid and the Rice folks who wrote a paper on that topic
> >> >> >> >which you probably didnt attend and everything that needs to be done
> >> >> >> >can be from user space today for all those optimizations.
> >> >> >> >
> >> >> >> >Conclusion is: For what you need to do (which i dont believe is a
> >> >> >> >limitation in your hardware rather a design decision on your part) run
> >> >> >> >your user space daemon, do optimizations and update the datapath.
> >> >> >> >Everybody is happy.
> >> >> >>
> >> >> >> Should the userspace daemon listen on inserted rules to be offloade
> >> >> >> over netlink?
> >> >> >>
> >> >> >
> >> >> >I mean you could if you wanted to given this is just traditional
> >> >> >netlink which emits events (with some filtering when we integrate the
> >> >> >filter approach). But why?
> >> >>
> >> >> Nevermind.
> >> >>
> >> >>
> >> >> >
> >> >> >> >
> >> >> >> >>
> >> >> >> >> >Nobody is stopping you from offering your customers proprietary
> >> >> >> >> >solutions which include a specific ebpf approach alongside DOCA. We
> >> >> >> >> >believe that a singular interface regardless of the vendor is the
> >> >> >> >> >right way forward. IMHO, this siloing that unfortunately is also added
> >> >> >> >> >by eBPF being a double edged sword is not good for the community.
> >> >> >> >> >
> >> >> >> >> >> As I also said on the p4 call couple of times, I don't see the kernel
> >> >> >> >> >> as the correct place to do the p4 abstractions. Why don't you do it in
> >> >> >> >> >> userspace and give vendors possiblity to have p4 backends with compilers,
> >> >> >> >> >> runtime optimizations etc in userspace, talking to the HW in the
> >> >> >> >> >> vendor-suitable way too. Then the SW implementation could be easily eBPF
> >> >> >> >> >> and the main reason (I believe) why you need to have this is TC
> >> >> >> >> >> (offload) is then void.
> >> >> >> >> >>
> >> >> >> >> >> The "everyone wants to use TC/netlink" claim does not seem correct
> >> >> >> >> >> to me. Why not to have one Linux p4 solution that fits everyones needs?
> >> >> >> >> >
> >> >> >> >> >You mean more fitting to the DOCA world? no, because iam a kernel
> >> >> >> >>
> >> >> >> >> Again, this has 0 relation to DOCA.
> >> >> >> >>
> >> >> >> >>
> >> >> >> >> >first person and kernel interfaces are good for everyone.
> >> >> >> >>
> >> >> >> >> Yeah, not really. Not always the kernel is the right answer. Your/Intel
> >> >> >> >> plan to handle the offload by:
> >> >> >> >> 1) abuse devlink to flash p4 binary
> >> >> >> >> 2) parse the binary in kernel to match to the table ids of rules coming
> >> >> >> >>    from p4tc ndo_setup_tc
> >> >> >> >> 3) abuse devlink to flash p4 binary for tc-flower
> >> >> >> >> 4) parse the binary in kernel to match to the table ids of rules coming
> >> >> >> >>    from tc-flower ndo_setup_tc
> >> >> >> >> is really something that is making me a little bit nauseous.
> >> >> >> >>
> >> >> >> >> If you don't have a feasible plan to do the offload, p4tc does not make
> >> >> >> >> sense to me to be honest.
> >> >> >> >
> >> >> >> >You mean if there's no plan to match your (NVIDIA?)  point of view.
> >> >> >> >For #1 - how's this different from DDP? Wasnt that your suggestion to
> >> >> >>
> >> >> >> I doubt that. Any flashing-blob-parsing-in-kernel is something I'm
> >> >> >> opposed to from day 1.
> >> >> >>
> >> >> >>
> >> >> >
> >> >> >Oh well - it is in the kernel and it works fine tbh.
> >> >> >
> >> >> >> >begin with? For #2 Nobody is proposing to do anything of the sort. The
> >> >> >> >ndo is passed IDs for the objects and associated contents. For #3+#4
> >> >> >>
> >> >> >> During offload, you need to parse the blob in driver to be able to match
> >> >> >> the ids with blob entities. That was presented by you/Intel in the past
> >> >> >> IIRC.
> >> >> >>
> >> >> >
> >> >> >You are correct - in case of offload the netlink IDs will have to be
> >> >> >authenticated against what the hardware can accept, but the devlink
> >> >> >flash use i believe was from you as a compromise.
> >> >>
> >> >> Definitelly not. I'm against devlink abuse for this from day 1.
> >> >>
> >> >>
> >> >> >
> >> >> >>
> >> >> >> >tc flower thing has nothing to do with P4TC that was just some random
> >> >> >> >proposal someone made seeing if they could ride on top of P4TC.
> >> >> >>
> >> >> >> Yeah, it's not yet merged and already mentally used for abuse. I love
> >> >> >> that :)
> >> >> >>
> >> >> >> >
> >> >> >> >Besides this nobody really has to satisfy your point of view - like i
> >> >> >> >said earlier feel free to provide proprietary solutions. From a
> >> >> >> >consumer perspective  I would not want to deal with 4 different
> >> >> >> >vendors with 4 different proprietary approaches. The kernel is the
> >> >> >> >unifying part. You seemed happier with tc flower just not with the
> >> >> >>
> >> >> >> Yeah, that is my point, why the unifying part can't be a userspace
> >> >> >> daemon/library with multiple backends (p4tc, bpf, vendorX, vendorY, ..)?
> >> >> >>
> >> >> >> I just don't see the kernel as a good fit for abstraction here,
> >> >> >> given the fact that the vendor compilers does not run in kernel.
> >> >> >> That is breaking your model.
> >> >> >>
> >> >> >
> >> >> >Jiri - we want to support P4, first. Like you said the P4 pipeline,
> >> >> >once installed is static.
> >> >> >P4 doesnt allow dynamic update of the pipeline. For example, once you
> >> >> >say "here are my 14 tables and their associated actions and here's how
> >> >> >the pipeline main control (on how to iterate the tables etc) is going
> >> >> >to be" and after you instantiate/activate that pipeline, you dont go
> >> >> >back 5 minutes later and say "sorry, please introduce table 15, which
> >> >> >i want you to walk to after you visit table 3 if metadata foo is 5" or
> >> >> >"shoot, let's change that table 5 to be exact instead of LPM". It's
> >> >> >not anywhere in the spec.
> >> >> >That doesnt mean it is not useful thing to have - but it is an
> >> >> >invention that has _nothing to do with the P4 spec_; so saying a P4
> >> >> >implementation must support it is a bit out of scope and there are
> >> >> >vendors with hardware who support P4 today that dont need any of this.
> >> >>
> >> >> I'm not talking about the spec. I'm talking about the offload
> >> >> implemetation, the offload compiler the offload runtime manager. You
> >> >> don't have those in kernel. That is the issue. The runtime manager is
> >> >> the one to decide and reshuffle the hw internals. Again, this has
> >> >> nothing to do with p4 frontend. This is offload implementation.
> >> >>
> >> >> And that is why I believe your p4 kernel implementation is unoffloadable.
> >> >> And if it is unoffloadable, do we really need it? IDK.
> >> >>
> >> >
> >> >Say what?
> >> >It's not offloadable in your hardware, you mean? Because i have beside
> >> >me here an intel e2000 which offloads just fine (and the AMD folks
> >> >seem fine too).
> >>
> >> Will Intel and AMD have compiler in kernel, so no blob transfer and
> >> parsing it in kernel wound not be needed? No.
> >
> >By that definition anything that parses anything is a compiler.
> >
> >>
> >> >If your view is that all these runtime optimization surmount to a
> >> >compiler in the kernel/driver that is your, well, your view. In my
> >> >view (and others have said this to you already) the P4C compiler is
> >> >responsible for resource optimizations. The hardware supports P4, you
> >> >give it constraints and it knows what to do. At runtime, anything a
> >> >driver needs to do for resource optimization (resorting, reshuffling
> >> >etc), that is not a P4 problem - sorry if you have issues in your
> >> >architecture approach.
> >>
> >> Sure, it is the offload implementation problem. And for them, you need
> >> to use userspace components. And that is the problem. This discussion
> >> leads nowhere, I don't know how differently I should describe this.
> >
> >Jiri's - that's your view based on whatever design you have in your
> >mind. This has nothing to do with P4.
> >So let me repeat again:
> >1) A vendor's backend for P4 when it compiles ensures that resource
> >constraints are taken care of.
> >2) The same program can run in s/w.
> >3) It makes *ZERO* sense to mix vendor specific constraint
> >optimization(what you described as resorting, reshuffling etc) as part
> >of P4TC or P4. Absolutely nothing to do with either. Write a
>
> I never suggested for it to be part of P4tc of P4. I don't know why you
> think so.

I guess because this discussion is about P4/P4TC? I may have misread
what you are saying then because I saw the  "P4TC must be in
userspace" mantra tied to this specific optimization requirement.

>
> >background task, specific to you,  if you feel you need to move things
> >around at runtime.
>
> Yeah, that backgroud task is in userspace.
>

I don't have a horse in this race.

cheers,
jamal

>
> >
> >We agree on one thing at least: This discussion is going nowhere.
>
> Correct.
>
> >
> >cheers,
> >jamal
> >
> >
> >
> >> >
> >> >> >In my opinion that is a feature that could be added later out of
> >> >> >necessity (there is some good niche value in being able to add some
> >> >> >"dynamicism" to any pipeline) and influence the P4 standards on why it
> >> >> >is needed.
> >> >> >It should be doable today in a brute force way (this is just one
> >> >> >suggestion that came to me when Rice University/Nvidia presented[1]);
> >> >> >i am sure there are other approaches and the idea is by no means
> >> >> >proven.
> >> >> >
> >> >> >1) User space Creates/compiles/Adds/activate your program that has 14
> >> >> >tables at tc prio X chain Y
> >> >> >2) a) 5 minutes later user space decides it wants to change and add
> >> >> >table 3 after table 15, visited when metadata foo=5
> >> >> >    b) your compiler in user space compiles a brand new program which
> >> >> >satisfies #2a (how this program was authored is out of scope of
> >> >> >discussion)
> >> >> >    c) user space adds the new program at tc prio X+1 chain Y or another chain Z
> >> >> >    d) user space delete tc prio X chain Y (and make sure your packets
> >> >> >entry point is whatever #c is)
> >> >>
> >> >> I never suggested anything like what you describe. I'm not sure why you
> >> >> think so.
> >> >
> >> >It's the same class of problems - the paper i pointed to (coauthored
> >> >by Matty and others) has runtime resource optimizations which are
> >> >tantamount to changing the nature of the pipeline. We may need to
> >> >profile in the kernel but all those optimizations can be derived in
> >> >user space using the approach I described.
> >> >
> >> >cheers,
> >> >jamal
> >> >
> >> >
> >> >> >[1] https://www.cs.rice.edu/~eugeneng/papers/SIGCOMM23-Pipeleon.pdf
> >> >> >
> >> >> >>
> >> >> >> >kernel process - which is ironically the same thing we are going
> >> >> >> >through here ;->
> >> >> >> >
> >> >> >> >cheers,
> >> >> >> >jamal
> >> >> >> >
> >> >> >> >>
> >> >> >> >> >
> >> >> >> >> >cheers,
> >> >> >> >> >jamal

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH net-next v8 00/15] Introducing P4TC
  2023-11-23 13:45                                           ` Jamal Hadi Salim
@ 2023-11-23 14:07                                             ` Jiri Pirko
  2023-11-23 14:28                                               ` Jamal Hadi Salim
  0 siblings, 1 reply; 79+ messages in thread
From: Jiri Pirko @ 2023-11-23 14:07 UTC (permalink / raw)
  To: Jamal Hadi Salim
  Cc: Daniel Borkmann, John Fastabend, netdev, deb.chatterjee,
	anjali.singhai, Vipin.Jain, namrata.limaye, tom, mleitner,
	Mahesh.Shirshyad, tomasz.osinski, xiyou.wangcong, davem, edumazet,
	kuba, pabeni, vladbu, horms, bpf, khalidm, toke, mattyk, dan.daly,
	chris.sommers, john.andy.fingerhut

Thu, Nov 23, 2023 at 02:45:50PM CET, jhs@mojatatu.com wrote:
>On Thu, Nov 23, 2023 at 8:34 AM Jiri Pirko <jiri@resnulli.us> wrote:
>>
>> Thu, Nov 23, 2023 at 02:22:11PM CET, jhs@mojatatu.com wrote:
>> >On Thu, Nov 23, 2023 at 1:36 AM Jiri Pirko <jiri@resnulli.us> wrote:
>> >>
>> >> Wed, Nov 22, 2023 at 08:35:21PM CET, jhs@mojatatu.com wrote:
>> >> >On Wed, Nov 22, 2023 at 1:31 PM Jiri Pirko <jiri@resnulli.us> wrote:
>> >> >>
>> >> >> Wed, Nov 22, 2023 at 04:14:02PM CET, jhs@mojatatu.com wrote:
>> >> >> >On Wed, Nov 22, 2023 at 4:25 AM Jiri Pirko <jiri@resnulli.us> wrote:
>> >> >> >>
>> >> >> >> Tue, Nov 21, 2023 at 04:21:44PM CET, jhs@mojatatu.com wrote:
>> >> >> >> >On Tue, Nov 21, 2023 at 9:19 AM Jiri Pirko <jiri@resnulli.us> wrote:
>> >> >> >> >>
>> >> >> >> >> Tue, Nov 21, 2023 at 02:47:40PM CET, jhs@mojatatu.com wrote:
>> >> >> >> >> >On Tue, Nov 21, 2023 at 8:06 AM Jiri Pirko <jiri@resnulli.us> wrote:
>> >> >> >> >> >>
>> >> >> >> >> >> Mon, Nov 20, 2023 at 11:56:50PM CET, jhs@mojatatu.com wrote:
>> >> >> >> >> >> >On Mon, Nov 20, 2023 at 4:49 PM Daniel Borkmann <daniel@iogearbox.net> wrote:
>> >> >> >> >> >> >>
>> >> >> >> >> >> >> On 11/20/23 8:56 PM, Jamal Hadi Salim wrote:
>> >> >> >> >> >> >> > On Mon, Nov 20, 2023 at 1:10 PM Jiri Pirko <jiri@resnulli.us> wrote:
>> >> >> >> >> >> >> >> Mon, Nov 20, 2023 at 03:23:59PM CET, jhs@mojatatu.com wrote:
>> >> >> >> >> >>
>> >> >> >> >> >> [...]
>> >> >> >> >> >>
>> >> >> >> >> >> >
>> >> >> >> >> >> >> tc BPF and XDP already have widely used infrastructure and can be developed
>> >> >> >> >> >> >> against libbpf or other user space libraries for a user space control plane.
>> >> >> >> >> >> >> With 'control plane' you refer here to the tc / netlink shim you've built,
>> >> >> >> >> >> >> but looking at the tc command line examples, this doesn't really provide a
>> >> >> >> >> >> >> good user experience (you call it p4 but people load bpf obj files). If the
>> >> >> >> >> >> >> expectation is that an operator should run tc commands, then neither it's
>> >> >> >> >> >> >> a nice experience for p4 nor for BPF folks. From a BPF PoV, we moved over
>> >> >> >> >> >> >> to bpf_mprog and plan to also extend this for XDP to have a common look and
>> >> >> >> >> >> >> feel wrt networking for developers. Why can't this be reused?
>> >> >> >> >> >> >
>> >> >> >> >> >> >The filter loading which loads the program is considered pipeline
>> >> >> >> >> >> >instantiation - consider it as "provisioning" more than "control"
>> >> >> >> >> >> >which runs at runtime. "control" is purely netlink based. The iproute2
>> >> >> >> >> >> >code we use links libbpf for example for the filter. If we can achieve
>> >> >> >> >> >> >the same with bpf_mprog then sure - we just dont want to loose
>> >> >> >> >> >> >functionality though.  off top of my head, some sample space:
>> >> >> >> >> >> >- we could have multiple pipelines with different priorities (which tc
>> >> >> >> >> >> >provides to us) - and each pipeline may have its own logic with many
>> >> >> >> >> >> >tables etc (and the choice to iterate the next one is essentially
>> >> >> >> >> >> >encoded in the tc action codes)
>> >> >> >> >> >> >- we use tc block to map groups of ports (which i dont think bpf has
>> >> >> >> >> >> >internal access of)
>> >> >> >> >> >> >
>> >> >> >> >> >> >In regards to usability: no i dont expect someone doing things at
>> >> >> >> >> >> >scale to use command line tc. The APIs are via netlink. But the tc cli
>> >> >> >> >> >> >is must for the rest of the masses per our traditions. Also i really
>> >> >> >> >> >>
>> >> >> >> >> >> I don't follow. You repeatedly mention "the must of the traditional tc
>> >> >> >> >> >> cli", but what of the existing traditional cli you use for p4tc?
>> >> >> >> >> >> If I look at the examples, pretty much everything looks new to me.
>> >> >> >> >> >> Example:
>> >> >> >> >> >>
>> >> >> >> >> >>   tc p4ctrl create myprog/table/mytable dstAddr 10.0.1.2/32 \
>> >> >> >> >> >>     action send_to_port param port eno1
>> >> >> >> >> >>
>> >> >> >> >> >> This is just TC/RTnetlink used as a channel to pass new things over. If
>> >> >> >> >> >> that is the case, what's traditional here?
>> >> >> >> >> >>
>> >> >> >> >> >
>> >> >> >> >> >
>> >> >> >> >> >What is not traditional about it?
>> >> >> >> >>
>> >> >> >> >> Okay, so in that case, the following example communitating with
>> >> >> >> >> userspace deamon using imaginary "p4ctrl" app is equally traditional:
>> >> >> >> >>   $ p4ctrl create myprog/table/mytable dstAddr 10.0.1.2/32 \
>> >> >> >> >>      action send_to_port param port eno1
>> >> >> >> >
>> >> >> >> >Huh? Thats just an application - classical tc which part of iproute2
>> >> >> >> >that is sending to the kernel, no different than "tc flower.."
>> >> >> >> >Where do you get the "userspace" daemon part? Yes, you can write a
>> >> >> >> >daemon but it will use the same APIs as tc.
>> >> >> >>
>> >> >> >> Okay, so which part is the "tradition"?
>> >> >> >>
>> >> >> >
>> >> >> >Provides tooling via tc cli that _everyone_ in the tc world is
>> >> >> >familiar with - which uses the same syntax as other tc extensions do,
>> >> >> >same expectations (eg events, request responses, familiar commands for
>> >> >> >dumping, flushing etc). Basically someone familiar with tc will pick
>> >> >> >this up and operate it very quickly and would have an easier time
>> >> >> >debugging it.
>> >> >> >There are caveats - as will be with all new classifiers - but those
>> >> >> >are within reason.
>> >> >>
>> >> >> Okay, so syntax familiarity wise, what's the difference between
>> >> >> following 2 approaches:
>> >> >> $ tc p4ctrl create myprog/table/mytable dstAddr 10.0.1.2/32 \
>> >> >>       action send_to_port param port eno1
>> >> >> $ p4ctrl create myprog/table/mytable dstAddr 10.0.1.2/32 \
>> >> >>       action send_to_port param port eno1
>> >> >> ?
>> >> >>
>> >> >>
>> >> >> >
>> >> >> >> >>
>> >> >> >> >> >
>> >> >> >> >> >>
>> >> >> >> >> >> >didnt even want to use ebpf at all for operator experience reasons -
>> >> >> >> >> >> >it requires a compilation of the code and an extra loading compared to
>> >> >> >> >> >> >what our original u32/pedit code offered.
>> >> >> >> >> >> >
>> >> >> >> >> >> >> I don't quite follow why not most of this could be implemented entirely in
>> >> >> >> >> >> >> user space without the detour of this and you would provide a developer
>> >> >> >> >> >> >> library which could then be integrated into a p4 runtime/frontend? This
>> >> >> >> >> >> >> way users never interface with ebpf parts nor tc given they also shouldn't
>> >> >> >> >> >> >> have to - it's an implementation detail. This is what John was also pointing
>> >> >> >> >> >> >> out earlier.
>> >> >> >> >> >> >>
>> >> >> >> >> >> >
>> >> >> >> >> >> >Netlink is the API. We will provide a library for object manipulation
>> >> >> >> >> >> >which abstracts away the need to know netlink. Someone who for their
>> >> >> >> >> >> >own reasons wants to use p4runtime or TDI could write on top of this.
>> >> >> >> >> >> >I would not design a kernel interface to just meet p4runtime (we
>> >> >> >> >> >> >already have TDI which came later which does things differently). So i
>> >> >> >> >> >> >expect us to support both those two. And if i was to do something on
>> >> >> >> >> >> >SDN that was more robust i would write my own that still uses these
>> >> >> >> >> >> >netlink interfaces.
>> >> >> >> >> >>
>> >> >> >> >> >> Actually, what Daniel says about the p4 library used as a backend to p4
>> >> >> >> >> >> frontend is pretty much aligned what I claimed on the p4 calls couple of
>> >> >> >> >> >> times. If you have this p4 userspace tooling, it is easy for offloads to
>> >> >> >> >> >> replace the backed by vendor-specific library which allows p4 offload
>> >> >> >> >> >> suitable for all vendors (your plan of p4tc offload does not work well
>> >> >> >> >> >> for our hw, as we repeatedly claimed).
>> >> >> >> >> >>
>> >> >> >> >> >
>> >> >> >> >> >That's you - NVIDIA. You have chosen a path away from the kernel
>> >> >> >> >> >towards DOCA. I understand NVIDIA's frustration with dealing with
>> >> >> >> >> >upstream process (which has been cited to me as a good reason for
>> >> >> >> >> >DOCA) but please dont impose these values and your politics on other
>> >> >> >> >> >vendors(Intel, AMD for example) who are more than willing to invest
>> >> >> >> >> >into making the kernel interfaces the path forward. Your choice.
>> >> >> >> >>
>> >> >> >> >> No, you are missing the point. This has nothing to do with DOCA.
>> >> >> >> >
>> >> >> >> >Right Jiri ;->
>> >> >> >> >
>> >> >> >> >> This
>> >> >> >> >> has to do with the simple limitation of your offload assuming there are
>> >> >> >> >> no runtime changes in the compiled pipeline. For Intel, maybe they
>> >> >> >> >> aren't, and it's a good fit for them. All I say is, that it is not the
>> >> >> >> >> good fit for everyone.
>> >> >> >> >
>> >> >> >> > a) it is not part of the P4 spec to dynamically make changes to the
>> >> >> >> >datapath pipeline after it is create and we are discussing a P4
>> >> >> >>
>> >> >> >> Isn't this up to the implementation? I mean from the p4 perspective,
>> >> >> >> everything is static. Hw might need to reshuffle the pipeline internally
>> >> >> >> during rule insertion/remove in order to optimize the layout.
>> >> >> >>
>> >> >> >
>> >> >> >But do note: the focus here is on P4 (hence the name P4TC).
>> >> >> >
>> >> >> >> >implementation not an extension that would add more value b) We are
>> >> >> >> >more than happy to add extensions in the future to accomodate for
>> >> >> >> >features but first _P4 spec_ must be met c) we had longer discussions
>> >> >> >> >with Matty, Khalid and the Rice folks who wrote a paper on that topic
>> >> >> >> >which you probably didnt attend and everything that needs to be done
>> >> >> >> >can be from user space today for all those optimizations.
>> >> >> >> >
>> >> >> >> >Conclusion is: For what you need to do (which i dont believe is a
>> >> >> >> >limitation in your hardware rather a design decision on your part) run
>> >> >> >> >your user space daemon, do optimizations and update the datapath.
>> >> >> >> >Everybody is happy.
>> >> >> >>
>> >> >> >> Should the userspace daemon listen on inserted rules to be offloade
>> >> >> >> over netlink?
>> >> >> >>
>> >> >> >
>> >> >> >I mean you could if you wanted to given this is just traditional
>> >> >> >netlink which emits events (with some filtering when we integrate the
>> >> >> >filter approach). But why?
>> >> >>
>> >> >> Nevermind.
>> >> >>
>> >> >>
>> >> >> >
>> >> >> >> >
>> >> >> >> >>
>> >> >> >> >> >Nobody is stopping you from offering your customers proprietary
>> >> >> >> >> >solutions which include a specific ebpf approach alongside DOCA. We
>> >> >> >> >> >believe that a singular interface regardless of the vendor is the
>> >> >> >> >> >right way forward. IMHO, this siloing that unfortunately is also added
>> >> >> >> >> >by eBPF being a double edged sword is not good for the community.
>> >> >> >> >> >
>> >> >> >> >> >> As I also said on the p4 call couple of times, I don't see the kernel
>> >> >> >> >> >> as the correct place to do the p4 abstractions. Why don't you do it in
>> >> >> >> >> >> userspace and give vendors possiblity to have p4 backends with compilers,
>> >> >> >> >> >> runtime optimizations etc in userspace, talking to the HW in the
>> >> >> >> >> >> vendor-suitable way too. Then the SW implementation could be easily eBPF
>> >> >> >> >> >> and the main reason (I believe) why you need to have this is TC
>> >> >> >> >> >> (offload) is then void.
>> >> >> >> >> >>
>> >> >> >> >> >> The "everyone wants to use TC/netlink" claim does not seem correct
>> >> >> >> >> >> to me. Why not to have one Linux p4 solution that fits everyones needs?
>> >> >> >> >> >
>> >> >> >> >> >You mean more fitting to the DOCA world? no, because iam a kernel
>> >> >> >> >>
>> >> >> >> >> Again, this has 0 relation to DOCA.
>> >> >> >> >>
>> >> >> >> >>
>> >> >> >> >> >first person and kernel interfaces are good for everyone.
>> >> >> >> >>
>> >> >> >> >> Yeah, not really. Not always the kernel is the right answer. Your/Intel
>> >> >> >> >> plan to handle the offload by:
>> >> >> >> >> 1) abuse devlink to flash p4 binary
>> >> >> >> >> 2) parse the binary in kernel to match to the table ids of rules coming
>> >> >> >> >>    from p4tc ndo_setup_tc
>> >> >> >> >> 3) abuse devlink to flash p4 binary for tc-flower
>> >> >> >> >> 4) parse the binary in kernel to match to the table ids of rules coming
>> >> >> >> >>    from tc-flower ndo_setup_tc
>> >> >> >> >> is really something that is making me a little bit nauseous.
>> >> >> >> >>
>> >> >> >> >> If you don't have a feasible plan to do the offload, p4tc does not make
>> >> >> >> >> sense to me to be honest.
>> >> >> >> >
>> >> >> >> >You mean if there's no plan to match your (NVIDIA?)  point of view.
>> >> >> >> >For #1 - how's this different from DDP? Wasnt that your suggestion to
>> >> >> >>
>> >> >> >> I doubt that. Any flashing-blob-parsing-in-kernel is something I'm
>> >> >> >> opposed to from day 1.
>> >> >> >>
>> >> >> >>
>> >> >> >
>> >> >> >Oh well - it is in the kernel and it works fine tbh.
>> >> >> >
>> >> >> >> >begin with? For #2 Nobody is proposing to do anything of the sort. The
>> >> >> >> >ndo is passed IDs for the objects and associated contents. For #3+#4
>> >> >> >>
>> >> >> >> During offload, you need to parse the blob in driver to be able to match
>> >> >> >> the ids with blob entities. That was presented by you/Intel in the past
>> >> >> >> IIRC.
>> >> >> >>
>> >> >> >
>> >> >> >You are correct - in case of offload the netlink IDs will have to be
>> >> >> >authenticated against what the hardware can accept, but the devlink
>> >> >> >flash use i believe was from you as a compromise.
>> >> >>
>> >> >> Definitelly not. I'm against devlink abuse for this from day 1.
>> >> >>
>> >> >>
>> >> >> >
>> >> >> >>
>> >> >> >> >tc flower thing has nothing to do with P4TC that was just some random
>> >> >> >> >proposal someone made seeing if they could ride on top of P4TC.
>> >> >> >>
>> >> >> >> Yeah, it's not yet merged and already mentally used for abuse. I love
>> >> >> >> that :)
>> >> >> >>
>> >> >> >> >
>> >> >> >> >Besides this nobody really has to satisfy your point of view - like i
>> >> >> >> >said earlier feel free to provide proprietary solutions. From a
>> >> >> >> >consumer perspective  I would not want to deal with 4 different
>> >> >> >> >vendors with 4 different proprietary approaches. The kernel is the
>> >> >> >> >unifying part. You seemed happier with tc flower just not with the
>> >> >> >>
>> >> >> >> Yeah, that is my point, why the unifying part can't be a userspace
>> >> >> >> daemon/library with multiple backends (p4tc, bpf, vendorX, vendorY, ..)?
>> >> >> >>
>> >> >> >> I just don't see the kernel as a good fit for abstraction here,
>> >> >> >> given the fact that the vendor compilers does not run in kernel.
>> >> >> >> That is breaking your model.
>> >> >> >>
>> >> >> >
>> >> >> >Jiri - we want to support P4, first. Like you said the P4 pipeline,
>> >> >> >once installed is static.
>> >> >> >P4 doesnt allow dynamic update of the pipeline. For example, once you
>> >> >> >say "here are my 14 tables and their associated actions and here's how
>> >> >> >the pipeline main control (on how to iterate the tables etc) is going
>> >> >> >to be" and after you instantiate/activate that pipeline, you dont go
>> >> >> >back 5 minutes later and say "sorry, please introduce table 15, which
>> >> >> >i want you to walk to after you visit table 3 if metadata foo is 5" or
>> >> >> >"shoot, let's change that table 5 to be exact instead of LPM". It's
>> >> >> >not anywhere in the spec.
>> >> >> >That doesnt mean it is not useful thing to have - but it is an
>> >> >> >invention that has _nothing to do with the P4 spec_; so saying a P4
>> >> >> >implementation must support it is a bit out of scope and there are
>> >> >> >vendors with hardware who support P4 today that dont need any of this.
>> >> >>
>> >> >> I'm not talking about the spec. I'm talking about the offload
>> >> >> implemetation, the offload compiler the offload runtime manager. You
>> >> >> don't have those in kernel. That is the issue. The runtime manager is
>> >> >> the one to decide and reshuffle the hw internals. Again, this has
>> >> >> nothing to do with p4 frontend. This is offload implementation.
>> >> >>
>> >> >> And that is why I believe your p4 kernel implementation is unoffloadable.
>> >> >> And if it is unoffloadable, do we really need it? IDK.
>> >> >>
>> >> >
>> >> >Say what?
>> >> >It's not offloadable in your hardware, you mean? Because i have beside
>> >> >me here an intel e2000 which offloads just fine (and the AMD folks
>> >> >seem fine too).
>> >>
>> >> Will Intel and AMD have compiler in kernel, so no blob transfer and
>> >> parsing it in kernel wound not be needed? No.
>> >
>> >By that definition anything that parses anything is a compiler.
>> >
>> >>
>> >> >If your view is that all these runtime optimization surmount to a
>> >> >compiler in the kernel/driver that is your, well, your view. In my
>> >> >view (and others have said this to you already) the P4C compiler is
>> >> >responsible for resource optimizations. The hardware supports P4, you
>> >> >give it constraints and it knows what to do. At runtime, anything a
>> >> >driver needs to do for resource optimization (resorting, reshuffling
>> >> >etc), that is not a P4 problem - sorry if you have issues in your
>> >> >architecture approach.
>> >>
>> >> Sure, it is the offload implementation problem. And for them, you need
>> >> to use userspace components. And that is the problem. This discussion
>> >> leads nowhere, I don't know how differently I should describe this.
>> >
>> >Jiri's - that's your view based on whatever design you have in your
>> >mind. This has nothing to do with P4.
>> >So let me repeat again:
>> >1) A vendor's backend for P4 when it compiles ensures that resource
>> >constraints are taken care of.
>> >2) The same program can run in s/w.
>> >3) It makes *ZERO* sense to mix vendor specific constraint
>> >optimization(what you described as resorting, reshuffling etc) as part
>> >of P4TC or P4. Absolutely nothing to do with either. Write a
>>
>> I never suggested for it to be part of P4tc of P4. I don't know why you
>> think so.
>
>I guess because this discussion is about P4/P4TC? I may have misread
>what you are saying then because I saw the  "P4TC must be in
>userspace" mantra tied to this specific optimization requirement.

Yeah, and again, my point is, this is unoffloadable. Do we still
need it in kernel?


>
>>
>> >background task, specific to you,  if you feel you need to move things
>> >around at runtime.
>>
>> Yeah, that backgroud task is in userspace.
>>
>
>I don't have a horse in this race.
>
>cheers,
>jamal
>
>>
>> >
>> >We agree on one thing at least: This discussion is going nowhere.
>>
>> Correct.
>>
>> >
>> >cheers,
>> >jamal
>> >
>> >
>> >
>> >> >
>> >> >> >In my opinion that is a feature that could be added later out of
>> >> >> >necessity (there is some good niche value in being able to add some
>> >> >> >"dynamicism" to any pipeline) and influence the P4 standards on why it
>> >> >> >is needed.
>> >> >> >It should be doable today in a brute force way (this is just one
>> >> >> >suggestion that came to me when Rice University/Nvidia presented[1]);
>> >> >> >i am sure there are other approaches and the idea is by no means
>> >> >> >proven.
>> >> >> >
>> >> >> >1) User space Creates/compiles/Adds/activate your program that has 14
>> >> >> >tables at tc prio X chain Y
>> >> >> >2) a) 5 minutes later user space decides it wants to change and add
>> >> >> >table 3 after table 15, visited when metadata foo=5
>> >> >> >    b) your compiler in user space compiles a brand new program which
>> >> >> >satisfies #2a (how this program was authored is out of scope of
>> >> >> >discussion)
>> >> >> >    c) user space adds the new program at tc prio X+1 chain Y or another chain Z
>> >> >> >    d) user space delete tc prio X chain Y (and make sure your packets
>> >> >> >entry point is whatever #c is)
>> >> >>
>> >> >> I never suggested anything like what you describe. I'm not sure why you
>> >> >> think so.
>> >> >
>> >> >It's the same class of problems - the paper i pointed to (coauthored
>> >> >by Matty and others) has runtime resource optimizations which are
>> >> >tantamount to changing the nature of the pipeline. We may need to
>> >> >profile in the kernel but all those optimizations can be derived in
>> >> >user space using the approach I described.
>> >> >
>> >> >cheers,
>> >> >jamal
>> >> >
>> >> >
>> >> >> >[1] https://www.cs.rice.edu/~eugeneng/papers/SIGCOMM23-Pipeleon.pdf
>> >> >> >
>> >> >> >>
>> >> >> >> >kernel process - which is ironically the same thing we are going
>> >> >> >> >through here ;->
>> >> >> >> >
>> >> >> >> >cheers,
>> >> >> >> >jamal
>> >> >> >> >
>> >> >> >> >>
>> >> >> >> >> >
>> >> >> >> >> >cheers,
>> >> >> >> >> >jamal

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH net-next v8 00/15] Introducing P4TC
  2023-11-23 14:07                                             ` Jiri Pirko
@ 2023-11-23 14:28                                               ` Jamal Hadi Salim
  2023-11-23 15:27                                                 ` Jiri Pirko
  0 siblings, 1 reply; 79+ messages in thread
From: Jamal Hadi Salim @ 2023-11-23 14:28 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: Daniel Borkmann, John Fastabend, netdev, deb.chatterjee,
	anjali.singhai, Vipin.Jain, namrata.limaye, tom, mleitner,
	Mahesh.Shirshyad, tomasz.osinski, xiyou.wangcong, davem, edumazet,
	kuba, pabeni, vladbu, horms, bpf, khalidm, toke, mattyk, dan.daly,
	chris.sommers, john.andy.fingerhut

On Thu, Nov 23, 2023 at 9:07 AM Jiri Pirko <jiri@resnulli.us> wrote:
>
> Thu, Nov 23, 2023 at 02:45:50PM CET, jhs@mojatatu.com wrote:
> >On Thu, Nov 23, 2023 at 8:34 AM Jiri Pirko <jiri@resnulli.us> wrote:
> >>
> >> Thu, Nov 23, 2023 at 02:22:11PM CET, jhs@mojatatu.com wrote:
> >> >On Thu, Nov 23, 2023 at 1:36 AM Jiri Pirko <jiri@resnulli.us> wrote:
> >> >>
> >> >> Wed, Nov 22, 2023 at 08:35:21PM CET, jhs@mojatatu.com wrote:
> >> >> >On Wed, Nov 22, 2023 at 1:31 PM Jiri Pirko <jiri@resnulli.us> wrote:
> >> >> >>
> >> >> >> Wed, Nov 22, 2023 at 04:14:02PM CET, jhs@mojatatu.com wrote:
> >> >> >> >On Wed, Nov 22, 2023 at 4:25 AM Jiri Pirko <jiri@resnulli.us> wrote:
> >> >> >> >>
> >> >> >> >> Tue, Nov 21, 2023 at 04:21:44PM CET, jhs@mojatatu.com wrote:
> >> >> >> >> >On Tue, Nov 21, 2023 at 9:19 AM Jiri Pirko <jiri@resnulli.us> wrote:
> >> >> >> >> >>
> >> >> >> >> >> Tue, Nov 21, 2023 at 02:47:40PM CET, jhs@mojatatu.com wrote:
> >> >> >> >> >> >On Tue, Nov 21, 2023 at 8:06 AM Jiri Pirko <jiri@resnulli.us> wrote:
> >> >> >> >> >> >>
> >> >> >> >> >> >> Mon, Nov 20, 2023 at 11:56:50PM CET, jhs@mojatatu.com wrote:
> >> >> >> >> >> >> >On Mon, Nov 20, 2023 at 4:49 PM Daniel Borkmann <daniel@iogearbox.net> wrote:
> >> >> >> >> >> >> >>
> >> >> >> >> >> >> >> On 11/20/23 8:56 PM, Jamal Hadi Salim wrote:
> >> >> >> >> >> >> >> > On Mon, Nov 20, 2023 at 1:10 PM Jiri Pirko <jiri@resnulli.us> wrote:
> >> >> >> >> >> >> >> >> Mon, Nov 20, 2023 at 03:23:59PM CET, jhs@mojatatu.com wrote:
> >> >> >> >> >> >>
> >> >> >> >> >> >> [...]
> >> >> >> >> >> >>
> >> >> >> >> >> >> >
> >> >> >> >> >> >> >> tc BPF and XDP already have widely used infrastructure and can be developed
> >> >> >> >> >> >> >> against libbpf or other user space libraries for a user space control plane.
> >> >> >> >> >> >> >> With 'control plane' you refer here to the tc / netlink shim you've built,
> >> >> >> >> >> >> >> but looking at the tc command line examples, this doesn't really provide a
> >> >> >> >> >> >> >> good user experience (you call it p4 but people load bpf obj files). If the
> >> >> >> >> >> >> >> expectation is that an operator should run tc commands, then neither it's
> >> >> >> >> >> >> >> a nice experience for p4 nor for BPF folks. From a BPF PoV, we moved over
> >> >> >> >> >> >> >> to bpf_mprog and plan to also extend this for XDP to have a common look and
> >> >> >> >> >> >> >> feel wrt networking for developers. Why can't this be reused?
> >> >> >> >> >> >> >
> >> >> >> >> >> >> >The filter loading which loads the program is considered pipeline
> >> >> >> >> >> >> >instantiation - consider it as "provisioning" more than "control"
> >> >> >> >> >> >> >which runs at runtime. "control" is purely netlink based. The iproute2
> >> >> >> >> >> >> >code we use links libbpf for example for the filter. If we can achieve
> >> >> >> >> >> >> >the same with bpf_mprog then sure - we just dont want to loose
> >> >> >> >> >> >> >functionality though.  off top of my head, some sample space:
> >> >> >> >> >> >> >- we could have multiple pipelines with different priorities (which tc
> >> >> >> >> >> >> >provides to us) - and each pipeline may have its own logic with many
> >> >> >> >> >> >> >tables etc (and the choice to iterate the next one is essentially
> >> >> >> >> >> >> >encoded in the tc action codes)
> >> >> >> >> >> >> >- we use tc block to map groups of ports (which i dont think bpf has
> >> >> >> >> >> >> >internal access of)
> >> >> >> >> >> >> >
> >> >> >> >> >> >> >In regards to usability: no i dont expect someone doing things at
> >> >> >> >> >> >> >scale to use command line tc. The APIs are via netlink. But the tc cli
> >> >> >> >> >> >> >is must for the rest of the masses per our traditions. Also i really
> >> >> >> >> >> >>
> >> >> >> >> >> >> I don't follow. You repeatedly mention "the must of the traditional tc
> >> >> >> >> >> >> cli", but what of the existing traditional cli you use for p4tc?
> >> >> >> >> >> >> If I look at the examples, pretty much everything looks new to me.
> >> >> >> >> >> >> Example:
> >> >> >> >> >> >>
> >> >> >> >> >> >>   tc p4ctrl create myprog/table/mytable dstAddr 10.0.1.2/32 \
> >> >> >> >> >> >>     action send_to_port param port eno1
> >> >> >> >> >> >>
> >> >> >> >> >> >> This is just TC/RTnetlink used as a channel to pass new things over. If
> >> >> >> >> >> >> that is the case, what's traditional here?
> >> >> >> >> >> >>
> >> >> >> >> >> >
> >> >> >> >> >> >
> >> >> >> >> >> >What is not traditional about it?
> >> >> >> >> >>
> >> >> >> >> >> Okay, so in that case, the following example communitating with
> >> >> >> >> >> userspace deamon using imaginary "p4ctrl" app is equally traditional:
> >> >> >> >> >>   $ p4ctrl create myprog/table/mytable dstAddr 10.0.1.2/32 \
> >> >> >> >> >>      action send_to_port param port eno1
> >> >> >> >> >
> >> >> >> >> >Huh? Thats just an application - classical tc which part of iproute2
> >> >> >> >> >that is sending to the kernel, no different than "tc flower.."
> >> >> >> >> >Where do you get the "userspace" daemon part? Yes, you can write a
> >> >> >> >> >daemon but it will use the same APIs as tc.
> >> >> >> >>
> >> >> >> >> Okay, so which part is the "tradition"?
> >> >> >> >>
> >> >> >> >
> >> >> >> >Provides tooling via tc cli that _everyone_ in the tc world is
> >> >> >> >familiar with - which uses the same syntax as other tc extensions do,
> >> >> >> >same expectations (eg events, request responses, familiar commands for
> >> >> >> >dumping, flushing etc). Basically someone familiar with tc will pick
> >> >> >> >this up and operate it very quickly and would have an easier time
> >> >> >> >debugging it.
> >> >> >> >There are caveats - as will be with all new classifiers - but those
> >> >> >> >are within reason.
> >> >> >>
> >> >> >> Okay, so syntax familiarity wise, what's the difference between
> >> >> >> following 2 approaches:
> >> >> >> $ tc p4ctrl create myprog/table/mytable dstAddr 10.0.1.2/32 \
> >> >> >>       action send_to_port param port eno1
> >> >> >> $ p4ctrl create myprog/table/mytable dstAddr 10.0.1.2/32 \
> >> >> >>       action send_to_port param port eno1
> >> >> >> ?
> >> >> >>
> >> >> >>
> >> >> >> >
> >> >> >> >> >>
> >> >> >> >> >> >
> >> >> >> >> >> >>
> >> >> >> >> >> >> >didnt even want to use ebpf at all for operator experience reasons -
> >> >> >> >> >> >> >it requires a compilation of the code and an extra loading compared to
> >> >> >> >> >> >> >what our original u32/pedit code offered.
> >> >> >> >> >> >> >
> >> >> >> >> >> >> >> I don't quite follow why not most of this could be implemented entirely in
> >> >> >> >> >> >> >> user space without the detour of this and you would provide a developer
> >> >> >> >> >> >> >> library which could then be integrated into a p4 runtime/frontend? This
> >> >> >> >> >> >> >> way users never interface with ebpf parts nor tc given they also shouldn't
> >> >> >> >> >> >> >> have to - it's an implementation detail. This is what John was also pointing
> >> >> >> >> >> >> >> out earlier.
> >> >> >> >> >> >> >>
> >> >> >> >> >> >> >
> >> >> >> >> >> >> >Netlink is the API. We will provide a library for object manipulation
> >> >> >> >> >> >> >which abstracts away the need to know netlink. Someone who for their
> >> >> >> >> >> >> >own reasons wants to use p4runtime or TDI could write on top of this.
> >> >> >> >> >> >> >I would not design a kernel interface to just meet p4runtime (we
> >> >> >> >> >> >> >already have TDI which came later which does things differently). So i
> >> >> >> >> >> >> >expect us to support both those two. And if i was to do something on
> >> >> >> >> >> >> >SDN that was more robust i would write my own that still uses these
> >> >> >> >> >> >> >netlink interfaces.
> >> >> >> >> >> >>
> >> >> >> >> >> >> Actually, what Daniel says about the p4 library used as a backend to p4
> >> >> >> >> >> >> frontend is pretty much aligned what I claimed on the p4 calls couple of
> >> >> >> >> >> >> times. If you have this p4 userspace tooling, it is easy for offloads to
> >> >> >> >> >> >> replace the backed by vendor-specific library which allows p4 offload
> >> >> >> >> >> >> suitable for all vendors (your plan of p4tc offload does not work well
> >> >> >> >> >> >> for our hw, as we repeatedly claimed).
> >> >> >> >> >> >>
> >> >> >> >> >> >
> >> >> >> >> >> >That's you - NVIDIA. You have chosen a path away from the kernel
> >> >> >> >> >> >towards DOCA. I understand NVIDIA's frustration with dealing with
> >> >> >> >> >> >upstream process (which has been cited to me as a good reason for
> >> >> >> >> >> >DOCA) but please dont impose these values and your politics on other
> >> >> >> >> >> >vendors(Intel, AMD for example) who are more than willing to invest
> >> >> >> >> >> >into making the kernel interfaces the path forward. Your choice.
> >> >> >> >> >>
> >> >> >> >> >> No, you are missing the point. This has nothing to do with DOCA.
> >> >> >> >> >
> >> >> >> >> >Right Jiri ;->
> >> >> >> >> >
> >> >> >> >> >> This
> >> >> >> >> >> has to do with the simple limitation of your offload assuming there are
> >> >> >> >> >> no runtime changes in the compiled pipeline. For Intel, maybe they
> >> >> >> >> >> aren't, and it's a good fit for them. All I say is, that it is not the
> >> >> >> >> >> good fit for everyone.
> >> >> >> >> >
> >> >> >> >> > a) it is not part of the P4 spec to dynamically make changes to the
> >> >> >> >> >datapath pipeline after it is create and we are discussing a P4
> >> >> >> >>
> >> >> >> >> Isn't this up to the implementation? I mean from the p4 perspective,
> >> >> >> >> everything is static. Hw might need to reshuffle the pipeline internally
> >> >> >> >> during rule insertion/remove in order to optimize the layout.
> >> >> >> >>
> >> >> >> >
> >> >> >> >But do note: the focus here is on P4 (hence the name P4TC).
> >> >> >> >
> >> >> >> >> >implementation not an extension that would add more value b) We are
> >> >> >> >> >more than happy to add extensions in the future to accomodate for
> >> >> >> >> >features but first _P4 spec_ must be met c) we had longer discussions
> >> >> >> >> >with Matty, Khalid and the Rice folks who wrote a paper on that topic
> >> >> >> >> >which you probably didnt attend and everything that needs to be done
> >> >> >> >> >can be from user space today for all those optimizations.
> >> >> >> >> >
> >> >> >> >> >Conclusion is: For what you need to do (which i dont believe is a
> >> >> >> >> >limitation in your hardware rather a design decision on your part) run
> >> >> >> >> >your user space daemon, do optimizations and update the datapath.
> >> >> >> >> >Everybody is happy.
> >> >> >> >>
> >> >> >> >> Should the userspace daemon listen on inserted rules to be offloade
> >> >> >> >> over netlink?
> >> >> >> >>
> >> >> >> >
> >> >> >> >I mean you could if you wanted to given this is just traditional
> >> >> >> >netlink which emits events (with some filtering when we integrate the
> >> >> >> >filter approach). But why?
> >> >> >>
> >> >> >> Nevermind.
> >> >> >>
> >> >> >>
> >> >> >> >
> >> >> >> >> >
> >> >> >> >> >>
> >> >> >> >> >> >Nobody is stopping you from offering your customers proprietary
> >> >> >> >> >> >solutions which include a specific ebpf approach alongside DOCA. We
> >> >> >> >> >> >believe that a singular interface regardless of the vendor is the
> >> >> >> >> >> >right way forward. IMHO, this siloing that unfortunately is also added
> >> >> >> >> >> >by eBPF being a double edged sword is not good for the community.
> >> >> >> >> >> >
> >> >> >> >> >> >> As I also said on the p4 call couple of times, I don't see the kernel
> >> >> >> >> >> >> as the correct place to do the p4 abstractions. Why don't you do it in
> >> >> >> >> >> >> userspace and give vendors possiblity to have p4 backends with compilers,
> >> >> >> >> >> >> runtime optimizations etc in userspace, talking to the HW in the
> >> >> >> >> >> >> vendor-suitable way too. Then the SW implementation could be easily eBPF
> >> >> >> >> >> >> and the main reason (I believe) why you need to have this is TC
> >> >> >> >> >> >> (offload) is then void.
> >> >> >> >> >> >>
> >> >> >> >> >> >> The "everyone wants to use TC/netlink" claim does not seem correct
> >> >> >> >> >> >> to me. Why not to have one Linux p4 solution that fits everyones needs?
> >> >> >> >> >> >
> >> >> >> >> >> >You mean more fitting to the DOCA world? no, because iam a kernel
> >> >> >> >> >>
> >> >> >> >> >> Again, this has 0 relation to DOCA.
> >> >> >> >> >>
> >> >> >> >> >>
> >> >> >> >> >> >first person and kernel interfaces are good for everyone.
> >> >> >> >> >>
> >> >> >> >> >> Yeah, not really. Not always the kernel is the right answer. Your/Intel
> >> >> >> >> >> plan to handle the offload by:
> >> >> >> >> >> 1) abuse devlink to flash p4 binary
> >> >> >> >> >> 2) parse the binary in kernel to match to the table ids of rules coming
> >> >> >> >> >>    from p4tc ndo_setup_tc
> >> >> >> >> >> 3) abuse devlink to flash p4 binary for tc-flower
> >> >> >> >> >> 4) parse the binary in kernel to match to the table ids of rules coming
> >> >> >> >> >>    from tc-flower ndo_setup_tc
> >> >> >> >> >> is really something that is making me a little bit nauseous.
> >> >> >> >> >>
> >> >> >> >> >> If you don't have a feasible plan to do the offload, p4tc does not make
> >> >> >> >> >> sense to me to be honest.
> >> >> >> >> >
> >> >> >> >> >You mean if there's no plan to match your (NVIDIA?)  point of view.
> >> >> >> >> >For #1 - how's this different from DDP? Wasnt that your suggestion to
> >> >> >> >>
> >> >> >> >> I doubt that. Any flashing-blob-parsing-in-kernel is something I'm
> >> >> >> >> opposed to from day 1.
> >> >> >> >>
> >> >> >> >>
> >> >> >> >
> >> >> >> >Oh well - it is in the kernel and it works fine tbh.
> >> >> >> >
> >> >> >> >> >begin with? For #2 Nobody is proposing to do anything of the sort. The
> >> >> >> >> >ndo is passed IDs for the objects and associated contents. For #3+#4
> >> >> >> >>
> >> >> >> >> During offload, you need to parse the blob in driver to be able to match
> >> >> >> >> the ids with blob entities. That was presented by you/Intel in the past
> >> >> >> >> IIRC.
> >> >> >> >>
> >> >> >> >
> >> >> >> >You are correct - in case of offload the netlink IDs will have to be
> >> >> >> >authenticated against what the hardware can accept, but the devlink
> >> >> >> >flash use i believe was from you as a compromise.
> >> >> >>
> >> >> >> Definitelly not. I'm against devlink abuse for this from day 1.
> >> >> >>
> >> >> >>
> >> >> >> >
> >> >> >> >>
> >> >> >> >> >tc flower thing has nothing to do with P4TC that was just some random
> >> >> >> >> >proposal someone made seeing if they could ride on top of P4TC.
> >> >> >> >>
> >> >> >> >> Yeah, it's not yet merged and already mentally used for abuse. I love
> >> >> >> >> that :)
> >> >> >> >>
> >> >> >> >> >
> >> >> >> >> >Besides this nobody really has to satisfy your point of view - like i
> >> >> >> >> >said earlier feel free to provide proprietary solutions. From a
> >> >> >> >> >consumer perspective  I would not want to deal with 4 different
> >> >> >> >> >vendors with 4 different proprietary approaches. The kernel is the
> >> >> >> >> >unifying part. You seemed happier with tc flower just not with the
> >> >> >> >>
> >> >> >> >> Yeah, that is my point, why the unifying part can't be a userspace
> >> >> >> >> daemon/library with multiple backends (p4tc, bpf, vendorX, vendorY, ..)?
> >> >> >> >>
> >> >> >> >> I just don't see the kernel as a good fit for abstraction here,
> >> >> >> >> given the fact that the vendor compilers does not run in kernel.
> >> >> >> >> That is breaking your model.
> >> >> >> >>
> >> >> >> >
> >> >> >> >Jiri - we want to support P4, first. Like you said the P4 pipeline,
> >> >> >> >once installed is static.
> >> >> >> >P4 doesnt allow dynamic update of the pipeline. For example, once you
> >> >> >> >say "here are my 14 tables and their associated actions and here's how
> >> >> >> >the pipeline main control (on how to iterate the tables etc) is going
> >> >> >> >to be" and after you instantiate/activate that pipeline, you dont go
> >> >> >> >back 5 minutes later and say "sorry, please introduce table 15, which
> >> >> >> >i want you to walk to after you visit table 3 if metadata foo is 5" or
> >> >> >> >"shoot, let's change that table 5 to be exact instead of LPM". It's
> >> >> >> >not anywhere in the spec.
> >> >> >> >That doesnt mean it is not useful thing to have - but it is an
> >> >> >> >invention that has _nothing to do with the P4 spec_; so saying a P4
> >> >> >> >implementation must support it is a bit out of scope and there are
> >> >> >> >vendors with hardware who support P4 today that dont need any of this.
> >> >> >>
> >> >> >> I'm not talking about the spec. I'm talking about the offload
> >> >> >> implemetation, the offload compiler the offload runtime manager. You
> >> >> >> don't have those in kernel. That is the issue. The runtime manager is
> >> >> >> the one to decide and reshuffle the hw internals. Again, this has
> >> >> >> nothing to do with p4 frontend. This is offload implementation.
> >> >> >>
> >> >> >> And that is why I believe your p4 kernel implementation is unoffloadable.
> >> >> >> And if it is unoffloadable, do we really need it? IDK.
> >> >> >>
> >> >> >
> >> >> >Say what?
> >> >> >It's not offloadable in your hardware, you mean? Because i have beside
> >> >> >me here an intel e2000 which offloads just fine (and the AMD folks
> >> >> >seem fine too).
> >> >>
> >> >> Will Intel and AMD have compiler in kernel, so no blob transfer and
> >> >> parsing it in kernel wound not be needed? No.
> >> >
> >> >By that definition anything that parses anything is a compiler.
> >> >
> >> >>
> >> >> >If your view is that all these runtime optimization surmount to a
> >> >> >compiler in the kernel/driver that is your, well, your view. In my
> >> >> >view (and others have said this to you already) the P4C compiler is
> >> >> >responsible for resource optimizations. The hardware supports P4, you
> >> >> >give it constraints and it knows what to do. At runtime, anything a
> >> >> >driver needs to do for resource optimization (resorting, reshuffling
> >> >> >etc), that is not a P4 problem - sorry if you have issues in your
> >> >> >architecture approach.
> >> >>
> >> >> Sure, it is the offload implementation problem. And for them, you need
> >> >> to use userspace components. And that is the problem. This discussion
> >> >> leads nowhere, I don't know how differently I should describe this.
> >> >
> >> >Jiri's - that's your view based on whatever design you have in your
> >> >mind. This has nothing to do with P4.
> >> >So let me repeat again:
> >> >1) A vendor's backend for P4 when it compiles ensures that resource
> >> >constraints are taken care of.
> >> >2) The same program can run in s/w.
> >> >3) It makes *ZERO* sense to mix vendor specific constraint
> >> >optimization(what you described as resorting, reshuffling etc) as part
> >> >of P4TC or P4. Absolutely nothing to do with either. Write a
> >>
> >> I never suggested for it to be part of P4tc of P4. I don't know why you
> >> think so.
> >
> >I guess because this discussion is about P4/P4TC? I may have misread
> >what you are saying then because I saw the  "P4TC must be in
> >userspace" mantra tied to this specific optimization requirement.
>
> Yeah, and again, my point is, this is unoffloadable.

Here we go again with this weird claim. I guess we need to give an
award to the other vendors for doing the "impossible"?

>Do we still  need it in kernel?

Didnt you just say it has nothing to do with P4TC?

You "It cant be offloaded".
Me "it can be offloaded, other vendors are doing it and it has nothing
to do with P4 or P4TC and here's why..."
You " i didnt say it has anything to do with P4 or P4TC"
Me "ok i misunderstood i thought you said P4 cant be offloaded via
P4TC and has to be done in user space"
You "It cant be offloaded"

Circular non-ending discussion.

Then there's John
John "ebpf, ebpf, ebpf"
Me "we gave you ebpf"
John "but you are not using ebpf system call"
Me " but it doesnt make sense for the following reasons..."
John "but someone has already implemented ebpf.."
Me "yes, but here's how ..."
John "ebpf, ebpf, ebpf"

Another circular non-ending discussion.

Let's just end this electron-wasting lawyering discussion.

cheers,
jamal






Bizare. Unoffloadable according to you.

>
> >
> >>
> >> >background task, specific to you,  if you feel you need to move things
> >> >around at runtime.
> >>
> >> Yeah, that backgroud task is in userspace.
> >>
> >
> >I don't have a horse in this race.
> >
> >cheers,
> >jamal
> >
> >>
> >> >
> >> >We agree on one thing at least: This discussion is going nowhere.
> >>
> >> Correct.
> >>
> >> >
> >> >cheers,
> >> >jamal
> >> >
> >> >
> >> >
> >> >> >
> >> >> >> >In my opinion that is a feature that could be added later out of
> >> >> >> >necessity (there is some good niche value in being able to add some
> >> >> >> >"dynamicism" to any pipeline) and influence the P4 standards on why it
> >> >> >> >is needed.
> >> >> >> >It should be doable today in a brute force way (this is just one
> >> >> >> >suggestion that came to me when Rice University/Nvidia presented[1]);
> >> >> >> >i am sure there are other approaches and the idea is by no means
> >> >> >> >proven.
> >> >> >> >
> >> >> >> >1) User space Creates/compiles/Adds/activate your program that has 14
> >> >> >> >tables at tc prio X chain Y
> >> >> >> >2) a) 5 minutes later user space decides it wants to change and add
> >> >> >> >table 3 after table 15, visited when metadata foo=5
> >> >> >> >    b) your compiler in user space compiles a brand new program which
> >> >> >> >satisfies #2a (how this program was authored is out of scope of
> >> >> >> >discussion)
> >> >> >> >    c) user space adds the new program at tc prio X+1 chain Y or another chain Z
> >> >> >> >    d) user space delete tc prio X chain Y (and make sure your packets
> >> >> >> >entry point is whatever #c is)
> >> >> >>
> >> >> >> I never suggested anything like what you describe. I'm not sure why you
> >> >> >> think so.
> >> >> >
> >> >> >It's the same class of problems - the paper i pointed to (coauthored
> >> >> >by Matty and others) has runtime resource optimizations which are
> >> >> >tantamount to changing the nature of the pipeline. We may need to
> >> >> >profile in the kernel but all those optimizations can be derived in
> >> >> >user space using the approach I described.
> >> >> >
> >> >> >cheers,
> >> >> >jamal
> >> >> >
> >> >> >
> >> >> >> >[1] https://www.cs.rice.edu/~eugeneng/papers/SIGCOMM23-Pipeleon.pdf
> >> >> >> >
> >> >> >> >>
> >> >> >> >> >kernel process - which is ironically the same thing we are going
> >> >> >> >> >through here ;->
> >> >> >> >> >
> >> >> >> >> >cheers,
> >> >> >> >> >jamal
> >> >> >> >> >
> >> >> >> >> >>
> >> >> >> >> >> >
> >> >> >> >> >> >cheers,
> >> >> >> >> >> >jamal

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH net-next v8 00/15] Introducing P4TC
  2023-11-23 14:28                                               ` Jamal Hadi Salim
@ 2023-11-23 15:27                                                 ` Jiri Pirko
  2023-11-23 16:30                                                   ` Jamal Hadi Salim
  0 siblings, 1 reply; 79+ messages in thread
From: Jiri Pirko @ 2023-11-23 15:27 UTC (permalink / raw)
  To: Jamal Hadi Salim
  Cc: Daniel Borkmann, John Fastabend, netdev, deb.chatterjee,
	anjali.singhai, Vipin.Jain, namrata.limaye, tom, mleitner,
	Mahesh.Shirshyad, tomasz.osinski, xiyou.wangcong, davem, edumazet,
	kuba, pabeni, vladbu, horms, bpf, khalidm, toke, mattyk, dan.daly,
	chris.sommers, john.andy.fingerhut

Thu, Nov 23, 2023 at 03:28:07PM CET, jhs@mojatatu.com wrote:
>On Thu, Nov 23, 2023 at 9:07 AM Jiri Pirko <jiri@resnulli.us> wrote:
>>
>> Thu, Nov 23, 2023 at 02:45:50PM CET, jhs@mojatatu.com wrote:
>> >On Thu, Nov 23, 2023 at 8:34 AM Jiri Pirko <jiri@resnulli.us> wrote:
>> >>
>> >> Thu, Nov 23, 2023 at 02:22:11PM CET, jhs@mojatatu.com wrote:
>> >> >On Thu, Nov 23, 2023 at 1:36 AM Jiri Pirko <jiri@resnulli.us> wrote:
>> >> >>
>> >> >> Wed, Nov 22, 2023 at 08:35:21PM CET, jhs@mojatatu.com wrote:
>> >> >> >On Wed, Nov 22, 2023 at 1:31 PM Jiri Pirko <jiri@resnulli.us> wrote:
>> >> >> >>
>> >> >> >> Wed, Nov 22, 2023 at 04:14:02PM CET, jhs@mojatatu.com wrote:
>> >> >> >> >On Wed, Nov 22, 2023 at 4:25 AM Jiri Pirko <jiri@resnulli.us> wrote:
>> >> >> >> >>
>> >> >> >> >> Tue, Nov 21, 2023 at 04:21:44PM CET, jhs@mojatatu.com wrote:
>> >> >> >> >> >On Tue, Nov 21, 2023 at 9:19 AM Jiri Pirko <jiri@resnulli.us> wrote:
>> >> >> >> >> >>
>> >> >> >> >> >> Tue, Nov 21, 2023 at 02:47:40PM CET, jhs@mojatatu.com wrote:
>> >> >> >> >> >> >On Tue, Nov 21, 2023 at 8:06 AM Jiri Pirko <jiri@resnulli.us> wrote:
>> >> >> >> >> >> >>
>> >> >> >> >> >> >> Mon, Nov 20, 2023 at 11:56:50PM CET, jhs@mojatatu.com wrote:
>> >> >> >> >> >> >> >On Mon, Nov 20, 2023 at 4:49 PM Daniel Borkmann <daniel@iogearbox.net> wrote:
>> >> >> >> >> >> >> >>
>> >> >> >> >> >> >> >> On 11/20/23 8:56 PM, Jamal Hadi Salim wrote:
>> >> >> >> >> >> >> >> > On Mon, Nov 20, 2023 at 1:10 PM Jiri Pirko <jiri@resnulli.us> wrote:
>> >> >> >> >> >> >> >> >> Mon, Nov 20, 2023 at 03:23:59PM CET, jhs@mojatatu.com wrote:
>> >> >> >> >> >> >>
>> >> >> >> >> >> >> [...]
>> >> >> >> >> >> >>
>> >> >> >> >> >> >> >
>> >> >> >> >> >> >> >> tc BPF and XDP already have widely used infrastructure and can be developed
>> >> >> >> >> >> >> >> against libbpf or other user space libraries for a user space control plane.
>> >> >> >> >> >> >> >> With 'control plane' you refer here to the tc / netlink shim you've built,
>> >> >> >> >> >> >> >> but looking at the tc command line examples, this doesn't really provide a
>> >> >> >> >> >> >> >> good user experience (you call it p4 but people load bpf obj files). If the
>> >> >> >> >> >> >> >> expectation is that an operator should run tc commands, then neither it's
>> >> >> >> >> >> >> >> a nice experience for p4 nor for BPF folks. From a BPF PoV, we moved over
>> >> >> >> >> >> >> >> to bpf_mprog and plan to also extend this for XDP to have a common look and
>> >> >> >> >> >> >> >> feel wrt networking for developers. Why can't this be reused?
>> >> >> >> >> >> >> >
>> >> >> >> >> >> >> >The filter loading which loads the program is considered pipeline
>> >> >> >> >> >> >> >instantiation - consider it as "provisioning" more than "control"
>> >> >> >> >> >> >> >which runs at runtime. "control" is purely netlink based. The iproute2
>> >> >> >> >> >> >> >code we use links libbpf for example for the filter. If we can achieve
>> >> >> >> >> >> >> >the same with bpf_mprog then sure - we just dont want to loose
>> >> >> >> >> >> >> >functionality though.  off top of my head, some sample space:
>> >> >> >> >> >> >> >- we could have multiple pipelines with different priorities (which tc
>> >> >> >> >> >> >> >provides to us) - and each pipeline may have its own logic with many
>> >> >> >> >> >> >> >tables etc (and the choice to iterate the next one is essentially
>> >> >> >> >> >> >> >encoded in the tc action codes)
>> >> >> >> >> >> >> >- we use tc block to map groups of ports (which i dont think bpf has
>> >> >> >> >> >> >> >internal access of)
>> >> >> >> >> >> >> >
>> >> >> >> >> >> >> >In regards to usability: no i dont expect someone doing things at
>> >> >> >> >> >> >> >scale to use command line tc. The APIs are via netlink. But the tc cli
>> >> >> >> >> >> >> >is must for the rest of the masses per our traditions. Also i really
>> >> >> >> >> >> >>
>> >> >> >> >> >> >> I don't follow. You repeatedly mention "the must of the traditional tc
>> >> >> >> >> >> >> cli", but what of the existing traditional cli you use for p4tc?
>> >> >> >> >> >> >> If I look at the examples, pretty much everything looks new to me.
>> >> >> >> >> >> >> Example:
>> >> >> >> >> >> >>
>> >> >> >> >> >> >>   tc p4ctrl create myprog/table/mytable dstAddr 10.0.1.2/32 \
>> >> >> >> >> >> >>     action send_to_port param port eno1
>> >> >> >> >> >> >>
>> >> >> >> >> >> >> This is just TC/RTnetlink used as a channel to pass new things over. If
>> >> >> >> >> >> >> that is the case, what's traditional here?
>> >> >> >> >> >> >>
>> >> >> >> >> >> >
>> >> >> >> >> >> >
>> >> >> >> >> >> >What is not traditional about it?
>> >> >> >> >> >>
>> >> >> >> >> >> Okay, so in that case, the following example communitating with
>> >> >> >> >> >> userspace deamon using imaginary "p4ctrl" app is equally traditional:
>> >> >> >> >> >>   $ p4ctrl create myprog/table/mytable dstAddr 10.0.1.2/32 \
>> >> >> >> >> >>      action send_to_port param port eno1
>> >> >> >> >> >
>> >> >> >> >> >Huh? Thats just an application - classical tc which part of iproute2
>> >> >> >> >> >that is sending to the kernel, no different than "tc flower.."
>> >> >> >> >> >Where do you get the "userspace" daemon part? Yes, you can write a
>> >> >> >> >> >daemon but it will use the same APIs as tc.
>> >> >> >> >>
>> >> >> >> >> Okay, so which part is the "tradition"?
>> >> >> >> >>
>> >> >> >> >
>> >> >> >> >Provides tooling via tc cli that _everyone_ in the tc world is
>> >> >> >> >familiar with - which uses the same syntax as other tc extensions do,
>> >> >> >> >same expectations (eg events, request responses, familiar commands for
>> >> >> >> >dumping, flushing etc). Basically someone familiar with tc will pick
>> >> >> >> >this up and operate it very quickly and would have an easier time
>> >> >> >> >debugging it.
>> >> >> >> >There are caveats - as will be with all new classifiers - but those
>> >> >> >> >are within reason.
>> >> >> >>
>> >> >> >> Okay, so syntax familiarity wise, what's the difference between
>> >> >> >> following 2 approaches:
>> >> >> >> $ tc p4ctrl create myprog/table/mytable dstAddr 10.0.1.2/32 \
>> >> >> >>       action send_to_port param port eno1
>> >> >> >> $ p4ctrl create myprog/table/mytable dstAddr 10.0.1.2/32 \
>> >> >> >>       action send_to_port param port eno1
>> >> >> >> ?
>> >> >> >>
>> >> >> >>
>> >> >> >> >
>> >> >> >> >> >>
>> >> >> >> >> >> >
>> >> >> >> >> >> >>
>> >> >> >> >> >> >> >didnt even want to use ebpf at all for operator experience reasons -
>> >> >> >> >> >> >> >it requires a compilation of the code and an extra loading compared to
>> >> >> >> >> >> >> >what our original u32/pedit code offered.
>> >> >> >> >> >> >> >
>> >> >> >> >> >> >> >> I don't quite follow why not most of this could be implemented entirely in
>> >> >> >> >> >> >> >> user space without the detour of this and you would provide a developer
>> >> >> >> >> >> >> >> library which could then be integrated into a p4 runtime/frontend? This
>> >> >> >> >> >> >> >> way users never interface with ebpf parts nor tc given they also shouldn't
>> >> >> >> >> >> >> >> have to - it's an implementation detail. This is what John was also pointing
>> >> >> >> >> >> >> >> out earlier.
>> >> >> >> >> >> >> >>
>> >> >> >> >> >> >> >
>> >> >> >> >> >> >> >Netlink is the API. We will provide a library for object manipulation
>> >> >> >> >> >> >> >which abstracts away the need to know netlink. Someone who for their
>> >> >> >> >> >> >> >own reasons wants to use p4runtime or TDI could write on top of this.
>> >> >> >> >> >> >> >I would not design a kernel interface to just meet p4runtime (we
>> >> >> >> >> >> >> >already have TDI which came later which does things differently). So i
>> >> >> >> >> >> >> >expect us to support both those two. And if i was to do something on
>> >> >> >> >> >> >> >SDN that was more robust i would write my own that still uses these
>> >> >> >> >> >> >> >netlink interfaces.
>> >> >> >> >> >> >>
>> >> >> >> >> >> >> Actually, what Daniel says about the p4 library used as a backend to p4
>> >> >> >> >> >> >> frontend is pretty much aligned what I claimed on the p4 calls couple of
>> >> >> >> >> >> >> times. If you have this p4 userspace tooling, it is easy for offloads to
>> >> >> >> >> >> >> replace the backed by vendor-specific library which allows p4 offload
>> >> >> >> >> >> >> suitable for all vendors (your plan of p4tc offload does not work well
>> >> >> >> >> >> >> for our hw, as we repeatedly claimed).
>> >> >> >> >> >> >>
>> >> >> >> >> >> >
>> >> >> >> >> >> >That's you - NVIDIA. You have chosen a path away from the kernel
>> >> >> >> >> >> >towards DOCA. I understand NVIDIA's frustration with dealing with
>> >> >> >> >> >> >upstream process (which has been cited to me as a good reason for
>> >> >> >> >> >> >DOCA) but please dont impose these values and your politics on other
>> >> >> >> >> >> >vendors(Intel, AMD for example) who are more than willing to invest
>> >> >> >> >> >> >into making the kernel interfaces the path forward. Your choice.
>> >> >> >> >> >>
>> >> >> >> >> >> No, you are missing the point. This has nothing to do with DOCA.
>> >> >> >> >> >
>> >> >> >> >> >Right Jiri ;->
>> >> >> >> >> >
>> >> >> >> >> >> This
>> >> >> >> >> >> has to do with the simple limitation of your offload assuming there are
>> >> >> >> >> >> no runtime changes in the compiled pipeline. For Intel, maybe they
>> >> >> >> >> >> aren't, and it's a good fit for them. All I say is, that it is not the
>> >> >> >> >> >> good fit for everyone.
>> >> >> >> >> >
>> >> >> >> >> > a) it is not part of the P4 spec to dynamically make changes to the
>> >> >> >> >> >datapath pipeline after it is create and we are discussing a P4
>> >> >> >> >>
>> >> >> >> >> Isn't this up to the implementation? I mean from the p4 perspective,
>> >> >> >> >> everything is static. Hw might need to reshuffle the pipeline internally
>> >> >> >> >> during rule insertion/remove in order to optimize the layout.
>> >> >> >> >>
>> >> >> >> >
>> >> >> >> >But do note: the focus here is on P4 (hence the name P4TC).
>> >> >> >> >
>> >> >> >> >> >implementation not an extension that would add more value b) We are
>> >> >> >> >> >more than happy to add extensions in the future to accomodate for
>> >> >> >> >> >features but first _P4 spec_ must be met c) we had longer discussions
>> >> >> >> >> >with Matty, Khalid and the Rice folks who wrote a paper on that topic
>> >> >> >> >> >which you probably didnt attend and everything that needs to be done
>> >> >> >> >> >can be from user space today for all those optimizations.
>> >> >> >> >> >
>> >> >> >> >> >Conclusion is: For what you need to do (which i dont believe is a
>> >> >> >> >> >limitation in your hardware rather a design decision on your part) run
>> >> >> >> >> >your user space daemon, do optimizations and update the datapath.
>> >> >> >> >> >Everybody is happy.
>> >> >> >> >>
>> >> >> >> >> Should the userspace daemon listen on inserted rules to be offloade
>> >> >> >> >> over netlink?
>> >> >> >> >>
>> >> >> >> >
>> >> >> >> >I mean you could if you wanted to given this is just traditional
>> >> >> >> >netlink which emits events (with some filtering when we integrate the
>> >> >> >> >filter approach). But why?
>> >> >> >>
>> >> >> >> Nevermind.
>> >> >> >>
>> >> >> >>
>> >> >> >> >
>> >> >> >> >> >
>> >> >> >> >> >>
>> >> >> >> >> >> >Nobody is stopping you from offering your customers proprietary
>> >> >> >> >> >> >solutions which include a specific ebpf approach alongside DOCA. We
>> >> >> >> >> >> >believe that a singular interface regardless of the vendor is the
>> >> >> >> >> >> >right way forward. IMHO, this siloing that unfortunately is also added
>> >> >> >> >> >> >by eBPF being a double edged sword is not good for the community.
>> >> >> >> >> >> >
>> >> >> >> >> >> >> As I also said on the p4 call couple of times, I don't see the kernel
>> >> >> >> >> >> >> as the correct place to do the p4 abstractions. Why don't you do it in
>> >> >> >> >> >> >> userspace and give vendors possiblity to have p4 backends with compilers,
>> >> >> >> >> >> >> runtime optimizations etc in userspace, talking to the HW in the
>> >> >> >> >> >> >> vendor-suitable way too. Then the SW implementation could be easily eBPF
>> >> >> >> >> >> >> and the main reason (I believe) why you need to have this is TC
>> >> >> >> >> >> >> (offload) is then void.
>> >> >> >> >> >> >>
>> >> >> >> >> >> >> The "everyone wants to use TC/netlink" claim does not seem correct
>> >> >> >> >> >> >> to me. Why not to have one Linux p4 solution that fits everyones needs?
>> >> >> >> >> >> >
>> >> >> >> >> >> >You mean more fitting to the DOCA world? no, because iam a kernel
>> >> >> >> >> >>
>> >> >> >> >> >> Again, this has 0 relation to DOCA.
>> >> >> >> >> >>
>> >> >> >> >> >>
>> >> >> >> >> >> >first person and kernel interfaces are good for everyone.
>> >> >> >> >> >>
>> >> >> >> >> >> Yeah, not really. Not always the kernel is the right answer. Your/Intel
>> >> >> >> >> >> plan to handle the offload by:
>> >> >> >> >> >> 1) abuse devlink to flash p4 binary
>> >> >> >> >> >> 2) parse the binary in kernel to match to the table ids of rules coming
>> >> >> >> >> >>    from p4tc ndo_setup_tc
>> >> >> >> >> >> 3) abuse devlink to flash p4 binary for tc-flower
>> >> >> >> >> >> 4) parse the binary in kernel to match to the table ids of rules coming
>> >> >> >> >> >>    from tc-flower ndo_setup_tc
>> >> >> >> >> >> is really something that is making me a little bit nauseous.
>> >> >> >> >> >>
>> >> >> >> >> >> If you don't have a feasible plan to do the offload, p4tc does not make
>> >> >> >> >> >> sense to me to be honest.
>> >> >> >> >> >
>> >> >> >> >> >You mean if there's no plan to match your (NVIDIA?)  point of view.
>> >> >> >> >> >For #1 - how's this different from DDP? Wasnt that your suggestion to
>> >> >> >> >>
>> >> >> >> >> I doubt that. Any flashing-blob-parsing-in-kernel is something I'm
>> >> >> >> >> opposed to from day 1.
>> >> >> >> >>
>> >> >> >> >>
>> >> >> >> >
>> >> >> >> >Oh well - it is in the kernel and it works fine tbh.
>> >> >> >> >
>> >> >> >> >> >begin with? For #2 Nobody is proposing to do anything of the sort. The
>> >> >> >> >> >ndo is passed IDs for the objects and associated contents. For #3+#4
>> >> >> >> >>
>> >> >> >> >> During offload, you need to parse the blob in driver to be able to match
>> >> >> >> >> the ids with blob entities. That was presented by you/Intel in the past
>> >> >> >> >> IIRC.
>> >> >> >> >>
>> >> >> >> >
>> >> >> >> >You are correct - in case of offload the netlink IDs will have to be
>> >> >> >> >authenticated against what the hardware can accept, but the devlink
>> >> >> >> >flash use i believe was from you as a compromise.
>> >> >> >>
>> >> >> >> Definitelly not. I'm against devlink abuse for this from day 1.
>> >> >> >>
>> >> >> >>
>> >> >> >> >
>> >> >> >> >>
>> >> >> >> >> >tc flower thing has nothing to do with P4TC that was just some random
>> >> >> >> >> >proposal someone made seeing if they could ride on top of P4TC.
>> >> >> >> >>
>> >> >> >> >> Yeah, it's not yet merged and already mentally used for abuse. I love
>> >> >> >> >> that :)
>> >> >> >> >>
>> >> >> >> >> >
>> >> >> >> >> >Besides this nobody really has to satisfy your point of view - like i
>> >> >> >> >> >said earlier feel free to provide proprietary solutions. From a
>> >> >> >> >> >consumer perspective  I would not want to deal with 4 different
>> >> >> >> >> >vendors with 4 different proprietary approaches. The kernel is the
>> >> >> >> >> >unifying part. You seemed happier with tc flower just not with the
>> >> >> >> >>
>> >> >> >> >> Yeah, that is my point, why the unifying part can't be a userspace
>> >> >> >> >> daemon/library with multiple backends (p4tc, bpf, vendorX, vendorY, ..)?
>> >> >> >> >>
>> >> >> >> >> I just don't see the kernel as a good fit for abstraction here,
>> >> >> >> >> given the fact that the vendor compilers does not run in kernel.
>> >> >> >> >> That is breaking your model.
>> >> >> >> >>
>> >> >> >> >
>> >> >> >> >Jiri - we want to support P4, first. Like you said the P4 pipeline,
>> >> >> >> >once installed is static.
>> >> >> >> >P4 doesnt allow dynamic update of the pipeline. For example, once you
>> >> >> >> >say "here are my 14 tables and their associated actions and here's how
>> >> >> >> >the pipeline main control (on how to iterate the tables etc) is going
>> >> >> >> >to be" and after you instantiate/activate that pipeline, you dont go
>> >> >> >> >back 5 minutes later and say "sorry, please introduce table 15, which
>> >> >> >> >i want you to walk to after you visit table 3 if metadata foo is 5" or
>> >> >> >> >"shoot, let's change that table 5 to be exact instead of LPM". It's
>> >> >> >> >not anywhere in the spec.
>> >> >> >> >That doesnt mean it is not useful thing to have - but it is an
>> >> >> >> >invention that has _nothing to do with the P4 spec_; so saying a P4
>> >> >> >> >implementation must support it is a bit out of scope and there are
>> >> >> >> >vendors with hardware who support P4 today that dont need any of this.
>> >> >> >>
>> >> >> >> I'm not talking about the spec. I'm talking about the offload
>> >> >> >> implemetation, the offload compiler the offload runtime manager. You
>> >> >> >> don't have those in kernel. That is the issue. The runtime manager is
>> >> >> >> the one to decide and reshuffle the hw internals. Again, this has
>> >> >> >> nothing to do with p4 frontend. This is offload implementation.
>> >> >> >>
>> >> >> >> And that is why I believe your p4 kernel implementation is unoffloadable.
>> >> >> >> And if it is unoffloadable, do we really need it? IDK.
>> >> >> >>
>> >> >> >
>> >> >> >Say what?
>> >> >> >It's not offloadable in your hardware, you mean? Because i have beside
>> >> >> >me here an intel e2000 which offloads just fine (and the AMD folks
>> >> >> >seem fine too).
>> >> >>
>> >> >> Will Intel and AMD have compiler in kernel, so no blob transfer and
>> >> >> parsing it in kernel wound not be needed? No.
>> >> >
>> >> >By that definition anything that parses anything is a compiler.
>> >> >
>> >> >>
>> >> >> >If your view is that all these runtime optimization surmount to a
>> >> >> >compiler in the kernel/driver that is your, well, your view. In my
>> >> >> >view (and others have said this to you already) the P4C compiler is
>> >> >> >responsible for resource optimizations. The hardware supports P4, you
>> >> >> >give it constraints and it knows what to do. At runtime, anything a
>> >> >> >driver needs to do for resource optimization (resorting, reshuffling
>> >> >> >etc), that is not a P4 problem - sorry if you have issues in your
>> >> >> >architecture approach.
>> >> >>
>> >> >> Sure, it is the offload implementation problem. And for them, you need
>> >> >> to use userspace components. And that is the problem. This discussion
>> >> >> leads nowhere, I don't know how differently I should describe this.
>> >> >
>> >> >Jiri's - that's your view based on whatever design you have in your
>> >> >mind. This has nothing to do with P4.
>> >> >So let me repeat again:
>> >> >1) A vendor's backend for P4 when it compiles ensures that resource
>> >> >constraints are taken care of.
>> >> >2) The same program can run in s/w.
>> >> >3) It makes *ZERO* sense to mix vendor specific constraint
>> >> >optimization(what you described as resorting, reshuffling etc) as part
>> >> >of P4TC or P4. Absolutely nothing to do with either. Write a
>> >>
>> >> I never suggested for it to be part of P4tc of P4. I don't know why you
>> >> think so.
>> >
>> >I guess because this discussion is about P4/P4TC? I may have misread
>> >what you are saying then because I saw the  "P4TC must be in
>> >userspace" mantra tied to this specific optimization requirement.
>>
>> Yeah, and again, my point is, this is unoffloadable.
>
>Here we go again with this weird claim. I guess we need to give an
>award to the other vendors for doing the "impossible"?

By having the compiler in kernel, that would be awesome. Clear offload
from kernel to device.

That's not the case. Trampolines, binary blobs parsing in kernel doing
the match with tc structures in drivers, abuse of devlink flash,
tc-flower offload using this facility. All this was already seriously
discussed before p4tc is even merged. Great, love that.



>
>>Do we still  need it in kernel?
>
>Didnt you just say it has nothing to do with P4TC?
>
>You "It cant be offloaded".
>Me "it can be offloaded, other vendors are doing it and it has nothing
>to do with P4 or P4TC and here's why..."
>You " i didnt say it has anything to do with P4 or P4TC"
>Me "ok i misunderstood i thought you said P4 cant be offloaded via
>P4TC and has to be done in user space"
>You "It cant be offloaded"

Let me do my own misinterpretation please.


>
>Circular non-ending discussion.
>
>Then there's John
>John "ebpf, ebpf, ebpf"
>Me "we gave you ebpf"
>John "but you are not using ebpf system call"
>Me " but it doesnt make sense for the following reasons..."
>John "but someone has already implemented ebpf.."
>Me "yes, but here's how ..."
>John "ebpf, ebpf, ebpf"
>
>Another circular non-ending discussion.
>
>Let's just end this electron-wasting lawyering discussion.
>
>cheers,
>jamal
>
>
>
>
>
>
>Bizare. Unoffloadable according to you.
>
>>
>> >
>> >>
>> >> >background task, specific to you,  if you feel you need to move things
>> >> >around at runtime.
>> >>
>> >> Yeah, that backgroud task is in userspace.
>> >>
>> >
>> >I don't have a horse in this race.
>> >
>> >cheers,
>> >jamal
>> >
>> >>
>> >> >
>> >> >We agree on one thing at least: This discussion is going nowhere.
>> >>
>> >> Correct.
>> >>
>> >> >
>> >> >cheers,
>> >> >jamal
>> >> >
>> >> >
>> >> >
>> >> >> >
>> >> >> >> >In my opinion that is a feature that could be added later out of
>> >> >> >> >necessity (there is some good niche value in being able to add some
>> >> >> >> >"dynamicism" to any pipeline) and influence the P4 standards on why it
>> >> >> >> >is needed.
>> >> >> >> >It should be doable today in a brute force way (this is just one
>> >> >> >> >suggestion that came to me when Rice University/Nvidia presented[1]);
>> >> >> >> >i am sure there are other approaches and the idea is by no means
>> >> >> >> >proven.
>> >> >> >> >
>> >> >> >> >1) User space Creates/compiles/Adds/activate your program that has 14
>> >> >> >> >tables at tc prio X chain Y
>> >> >> >> >2) a) 5 minutes later user space decides it wants to change and add
>> >> >> >> >table 3 after table 15, visited when metadata foo=5
>> >> >> >> >    b) your compiler in user space compiles a brand new program which
>> >> >> >> >satisfies #2a (how this program was authored is out of scope of
>> >> >> >> >discussion)
>> >> >> >> >    c) user space adds the new program at tc prio X+1 chain Y or another chain Z
>> >> >> >> >    d) user space delete tc prio X chain Y (and make sure your packets
>> >> >> >> >entry point is whatever #c is)
>> >> >> >>
>> >> >> >> I never suggested anything like what you describe. I'm not sure why you
>> >> >> >> think so.
>> >> >> >
>> >> >> >It's the same class of problems - the paper i pointed to (coauthored
>> >> >> >by Matty and others) has runtime resource optimizations which are
>> >> >> >tantamount to changing the nature of the pipeline. We may need to
>> >> >> >profile in the kernel but all those optimizations can be derived in
>> >> >> >user space using the approach I described.
>> >> >> >
>> >> >> >cheers,
>> >> >> >jamal
>> >> >> >
>> >> >> >
>> >> >> >> >[1] https://www.cs.rice.edu/~eugeneng/papers/SIGCOMM23-Pipeleon.pdf
>> >> >> >> >
>> >> >> >> >>
>> >> >> >> >> >kernel process - which is ironically the same thing we are going
>> >> >> >> >> >through here ;->
>> >> >> >> >> >
>> >> >> >> >> >cheers,
>> >> >> >> >> >jamal
>> >> >> >> >> >
>> >> >> >> >> >>
>> >> >> >> >> >> >
>> >> >> >> >> >> >cheers,
>> >> >> >> >> >> >jamal

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH net-next v8 00/15] Introducing P4TC
  2023-11-23 15:27                                                 ` Jiri Pirko
@ 2023-11-23 16:30                                                   ` Jamal Hadi Salim
  2023-11-23 17:53                                                     ` Edward Cree
  2023-11-23 18:04                                                     ` Jiri Pirko
  0 siblings, 2 replies; 79+ messages in thread
From: Jamal Hadi Salim @ 2023-11-23 16:30 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: Daniel Borkmann, John Fastabend, netdev, deb.chatterjee,
	anjali.singhai, Vipin.Jain, namrata.limaye, tom, mleitner,
	Mahesh.Shirshyad, tomasz.osinski, xiyou.wangcong, davem, edumazet,
	kuba, pabeni, vladbu, horms, bpf, khalidm, toke, mattyk, dan.daly,
	chris.sommers, john.andy.fingerhut

On Thu, Nov 23, 2023 at 10:28 AM Jiri Pirko <jiri@resnulli.us> wrote:
>
> Thu, Nov 23, 2023 at 03:28:07PM CET, jhs@mojatatu.com wrote:
> >On Thu, Nov 23, 2023 at 9:07 AM Jiri Pirko <jiri@resnulli.us> wrote:
> >>
> >> Thu, Nov 23, 2023 at 02:45:50PM CET, jhs@mojatatu.com wrote:
> >> >On Thu, Nov 23, 2023 at 8:34 AM Jiri Pirko <jiri@resnulli.us> wrote:
> >> >>
> >> >> Thu, Nov 23, 2023 at 02:22:11PM CET, jhs@mojatatu.com wrote:
> >> >> >On Thu, Nov 23, 2023 at 1:36 AM Jiri Pirko <jiri@resnulli.us> wrote:
> >> >> >>
> >> >> >> Wed, Nov 22, 2023 at 08:35:21PM CET, jhs@mojatatu.com wrote:
> >> >> >> >On Wed, Nov 22, 2023 at 1:31 PM Jiri Pirko <jiri@resnulli.us> wrote:
> >> >> >> >>
> >> >> >> >> Wed, Nov 22, 2023 at 04:14:02PM CET, jhs@mojatatu.com wrote:
> >> >> >> >> >On Wed, Nov 22, 2023 at 4:25 AM Jiri Pirko <jiri@resnulli.us> wrote:
> >> >> >> >> >>
> >> >> >> >> >> Tue, Nov 21, 2023 at 04:21:44PM CET, jhs@mojatatu.com wrote:
> >> >> >> >> >> >On Tue, Nov 21, 2023 at 9:19 AM Jiri Pirko <jiri@resnulli.us> wrote:
> >> >> >> >> >> >>
> >> >> >> >> >> >> Tue, Nov 21, 2023 at 02:47:40PM CET, jhs@mojatatu.com wrote:
> >> >> >> >> >> >> >On Tue, Nov 21, 2023 at 8:06 AM Jiri Pirko <jiri@resnulli.us> wrote:
> >> >> >> >> >> >> >>
> >> >> >> >> >> >> >> Mon, Nov 20, 2023 at 11:56:50PM CET, jhs@mojatatu.com wrote:
> >> >> >> >> >> >> >> >On Mon, Nov 20, 2023 at 4:49 PM Daniel Borkmann <daniel@iogearbox.net> wrote:
> >> >> >> >> >> >> >> >>
> >> >> >> >> >> >> >> >> On 11/20/23 8:56 PM, Jamal Hadi Salim wrote:
> >> >> >> >> >> >> >> >> > On Mon, Nov 20, 2023 at 1:10 PM Jiri Pirko <jiri@resnulli.us> wrote:
> >> >> >> >> >> >> >> >> >> Mon, Nov 20, 2023 at 03:23:59PM CET, jhs@mojatatu.com wrote:
> >> >> >> >> >> >> >>
> >> >> >> >> >> >> >> [...]
> >> >> >> >> >> >> >>
> >> >> >> >> >> >> >> >
> >> >> >> >> >> >> >> >> tc BPF and XDP already have widely used infrastructure and can be developed
> >> >> >> >> >> >> >> >> against libbpf or other user space libraries for a user space control plane.
> >> >> >> >> >> >> >> >> With 'control plane' you refer here to the tc / netlink shim you've built,
> >> >> >> >> >> >> >> >> but looking at the tc command line examples, this doesn't really provide a
> >> >> >> >> >> >> >> >> good user experience (you call it p4 but people load bpf obj files). If the
> >> >> >> >> >> >> >> >> expectation is that an operator should run tc commands, then neither it's
> >> >> >> >> >> >> >> >> a nice experience for p4 nor for BPF folks. From a BPF PoV, we moved over
> >> >> >> >> >> >> >> >> to bpf_mprog and plan to also extend this for XDP to have a common look and
> >> >> >> >> >> >> >> >> feel wrt networking for developers. Why can't this be reused?
> >> >> >> >> >> >> >> >
> >> >> >> >> >> >> >> >The filter loading which loads the program is considered pipeline
> >> >> >> >> >> >> >> >instantiation - consider it as "provisioning" more than "control"
> >> >> >> >> >> >> >> >which runs at runtime. "control" is purely netlink based. The iproute2
> >> >> >> >> >> >> >> >code we use links libbpf for example for the filter. If we can achieve
> >> >> >> >> >> >> >> >the same with bpf_mprog then sure - we just dont want to loose
> >> >> >> >> >> >> >> >functionality though.  off top of my head, some sample space:
> >> >> >> >> >> >> >> >- we could have multiple pipelines with different priorities (which tc
> >> >> >> >> >> >> >> >provides to us) - and each pipeline may have its own logic with many
> >> >> >> >> >> >> >> >tables etc (and the choice to iterate the next one is essentially
> >> >> >> >> >> >> >> >encoded in the tc action codes)
> >> >> >> >> >> >> >> >- we use tc block to map groups of ports (which i dont think bpf has
> >> >> >> >> >> >> >> >internal access of)
> >> >> >> >> >> >> >> >
> >> >> >> >> >> >> >> >In regards to usability: no i dont expect someone doing things at
> >> >> >> >> >> >> >> >scale to use command line tc. The APIs are via netlink. But the tc cli
> >> >> >> >> >> >> >> >is must for the rest of the masses per our traditions. Also i really
> >> >> >> >> >> >> >>
> >> >> >> >> >> >> >> I don't follow. You repeatedly mention "the must of the traditional tc
> >> >> >> >> >> >> >> cli", but what of the existing traditional cli you use for p4tc?
> >> >> >> >> >> >> >> If I look at the examples, pretty much everything looks new to me.
> >> >> >> >> >> >> >> Example:
> >> >> >> >> >> >> >>
> >> >> >> >> >> >> >>   tc p4ctrl create myprog/table/mytable dstAddr 10.0.1.2/32 \
> >> >> >> >> >> >> >>     action send_to_port param port eno1
> >> >> >> >> >> >> >>
> >> >> >> >> >> >> >> This is just TC/RTnetlink used as a channel to pass new things over. If
> >> >> >> >> >> >> >> that is the case, what's traditional here?
> >> >> >> >> >> >> >>
> >> >> >> >> >> >> >
> >> >> >> >> >> >> >
> >> >> >> >> >> >> >What is not traditional about it?
> >> >> >> >> >> >>
> >> >> >> >> >> >> Okay, so in that case, the following example communitating with
> >> >> >> >> >> >> userspace deamon using imaginary "p4ctrl" app is equally traditional:
> >> >> >> >> >> >>   $ p4ctrl create myprog/table/mytable dstAddr 10.0.1.2/32 \
> >> >> >> >> >> >>      action send_to_port param port eno1
> >> >> >> >> >> >
> >> >> >> >> >> >Huh? Thats just an application - classical tc which part of iproute2
> >> >> >> >> >> >that is sending to the kernel, no different than "tc flower.."
> >> >> >> >> >> >Where do you get the "userspace" daemon part? Yes, you can write a
> >> >> >> >> >> >daemon but it will use the same APIs as tc.
> >> >> >> >> >>
> >> >> >> >> >> Okay, so which part is the "tradition"?
> >> >> >> >> >>
> >> >> >> >> >
> >> >> >> >> >Provides tooling via tc cli that _everyone_ in the tc world is
> >> >> >> >> >familiar with - which uses the same syntax as other tc extensions do,
> >> >> >> >> >same expectations (eg events, request responses, familiar commands for
> >> >> >> >> >dumping, flushing etc). Basically someone familiar with tc will pick
> >> >> >> >> >this up and operate it very quickly and would have an easier time
> >> >> >> >> >debugging it.
> >> >> >> >> >There are caveats - as will be with all new classifiers - but those
> >> >> >> >> >are within reason.
> >> >> >> >>
> >> >> >> >> Okay, so syntax familiarity wise, what's the difference between
> >> >> >> >> following 2 approaches:
> >> >> >> >> $ tc p4ctrl create myprog/table/mytable dstAddr 10.0.1.2/32 \
> >> >> >> >>       action send_to_port param port eno1
> >> >> >> >> $ p4ctrl create myprog/table/mytable dstAddr 10.0.1.2/32 \
> >> >> >> >>       action send_to_port param port eno1
> >> >> >> >> ?
> >> >> >> >>
> >> >> >> >>
> >> >> >> >> >
> >> >> >> >> >> >>
> >> >> >> >> >> >> >
> >> >> >> >> >> >> >>
> >> >> >> >> >> >> >> >didnt even want to use ebpf at all for operator experience reasons -
> >> >> >> >> >> >> >> >it requires a compilation of the code and an extra loading compared to
> >> >> >> >> >> >> >> >what our original u32/pedit code offered.
> >> >> >> >> >> >> >> >
> >> >> >> >> >> >> >> >> I don't quite follow why not most of this could be implemented entirely in
> >> >> >> >> >> >> >> >> user space without the detour of this and you would provide a developer
> >> >> >> >> >> >> >> >> library which could then be integrated into a p4 runtime/frontend? This
> >> >> >> >> >> >> >> >> way users never interface with ebpf parts nor tc given they also shouldn't
> >> >> >> >> >> >> >> >> have to - it's an implementation detail. This is what John was also pointing
> >> >> >> >> >> >> >> >> out earlier.
> >> >> >> >> >> >> >> >>
> >> >> >> >> >> >> >> >
> >> >> >> >> >> >> >> >Netlink is the API. We will provide a library for object manipulation
> >> >> >> >> >> >> >> >which abstracts away the need to know netlink. Someone who for their
> >> >> >> >> >> >> >> >own reasons wants to use p4runtime or TDI could write on top of this.
> >> >> >> >> >> >> >> >I would not design a kernel interface to just meet p4runtime (we
> >> >> >> >> >> >> >> >already have TDI which came later which does things differently). So i
> >> >> >> >> >> >> >> >expect us to support both those two. And if i was to do something on
> >> >> >> >> >> >> >> >SDN that was more robust i would write my own that still uses these
> >> >> >> >> >> >> >> >netlink interfaces.
> >> >> >> >> >> >> >>
> >> >> >> >> >> >> >> Actually, what Daniel says about the p4 library used as a backend to p4
> >> >> >> >> >> >> >> frontend is pretty much aligned what I claimed on the p4 calls couple of
> >> >> >> >> >> >> >> times. If you have this p4 userspace tooling, it is easy for offloads to
> >> >> >> >> >> >> >> replace the backed by vendor-specific library which allows p4 offload
> >> >> >> >> >> >> >> suitable for all vendors (your plan of p4tc offload does not work well
> >> >> >> >> >> >> >> for our hw, as we repeatedly claimed).
> >> >> >> >> >> >> >>
> >> >> >> >> >> >> >
> >> >> >> >> >> >> >That's you - NVIDIA. You have chosen a path away from the kernel
> >> >> >> >> >> >> >towards DOCA. I understand NVIDIA's frustration with dealing with
> >> >> >> >> >> >> >upstream process (which has been cited to me as a good reason for
> >> >> >> >> >> >> >DOCA) but please dont impose these values and your politics on other
> >> >> >> >> >> >> >vendors(Intel, AMD for example) who are more than willing to invest
> >> >> >> >> >> >> >into making the kernel interfaces the path forward. Your choice.
> >> >> >> >> >> >>
> >> >> >> >> >> >> No, you are missing the point. This has nothing to do with DOCA.
> >> >> >> >> >> >
> >> >> >> >> >> >Right Jiri ;->
> >> >> >> >> >> >
> >> >> >> >> >> >> This
> >> >> >> >> >> >> has to do with the simple limitation of your offload assuming there are
> >> >> >> >> >> >> no runtime changes in the compiled pipeline. For Intel, maybe they
> >> >> >> >> >> >> aren't, and it's a good fit for them. All I say is, that it is not the
> >> >> >> >> >> >> good fit for everyone.
> >> >> >> >> >> >
> >> >> >> >> >> > a) it is not part of the P4 spec to dynamically make changes to the
> >> >> >> >> >> >datapath pipeline after it is create and we are discussing a P4
> >> >> >> >> >>
> >> >> >> >> >> Isn't this up to the implementation? I mean from the p4 perspective,
> >> >> >> >> >> everything is static. Hw might need to reshuffle the pipeline internally
> >> >> >> >> >> during rule insertion/remove in order to optimize the layout.
> >> >> >> >> >>
> >> >> >> >> >
> >> >> >> >> >But do note: the focus here is on P4 (hence the name P4TC).
> >> >> >> >> >
> >> >> >> >> >> >implementation not an extension that would add more value b) We are
> >> >> >> >> >> >more than happy to add extensions in the future to accomodate for
> >> >> >> >> >> >features but first _P4 spec_ must be met c) we had longer discussions
> >> >> >> >> >> >with Matty, Khalid and the Rice folks who wrote a paper on that topic
> >> >> >> >> >> >which you probably didnt attend and everything that needs to be done
> >> >> >> >> >> >can be from user space today for all those optimizations.
> >> >> >> >> >> >
> >> >> >> >> >> >Conclusion is: For what you need to do (which i dont believe is a
> >> >> >> >> >> >limitation in your hardware rather a design decision on your part) run
> >> >> >> >> >> >your user space daemon, do optimizations and update the datapath.
> >> >> >> >> >> >Everybody is happy.
> >> >> >> >> >>
> >> >> >> >> >> Should the userspace daemon listen on inserted rules to be offloade
> >> >> >> >> >> over netlink?
> >> >> >> >> >>
> >> >> >> >> >
> >> >> >> >> >I mean you could if you wanted to given this is just traditional
> >> >> >> >> >netlink which emits events (with some filtering when we integrate the
> >> >> >> >> >filter approach). But why?
> >> >> >> >>
> >> >> >> >> Nevermind.
> >> >> >> >>
> >> >> >> >>
> >> >> >> >> >
> >> >> >> >> >> >
> >> >> >> >> >> >>
> >> >> >> >> >> >> >Nobody is stopping you from offering your customers proprietary
> >> >> >> >> >> >> >solutions which include a specific ebpf approach alongside DOCA. We
> >> >> >> >> >> >> >believe that a singular interface regardless of the vendor is the
> >> >> >> >> >> >> >right way forward. IMHO, this siloing that unfortunately is also added
> >> >> >> >> >> >> >by eBPF being a double edged sword is not good for the community.
> >> >> >> >> >> >> >
> >> >> >> >> >> >> >> As I also said on the p4 call couple of times, I don't see the kernel
> >> >> >> >> >> >> >> as the correct place to do the p4 abstractions. Why don't you do it in
> >> >> >> >> >> >> >> userspace and give vendors possiblity to have p4 backends with compilers,
> >> >> >> >> >> >> >> runtime optimizations etc in userspace, talking to the HW in the
> >> >> >> >> >> >> >> vendor-suitable way too. Then the SW implementation could be easily eBPF
> >> >> >> >> >> >> >> and the main reason (I believe) why you need to have this is TC
> >> >> >> >> >> >> >> (offload) is then void.
> >> >> >> >> >> >> >>
> >> >> >> >> >> >> >> The "everyone wants to use TC/netlink" claim does not seem correct
> >> >> >> >> >> >> >> to me. Why not to have one Linux p4 solution that fits everyones needs?
> >> >> >> >> >> >> >
> >> >> >> >> >> >> >You mean more fitting to the DOCA world? no, because iam a kernel
> >> >> >> >> >> >>
> >> >> >> >> >> >> Again, this has 0 relation to DOCA.
> >> >> >> >> >> >>
> >> >> >> >> >> >>
> >> >> >> >> >> >> >first person and kernel interfaces are good for everyone.
> >> >> >> >> >> >>
> >> >> >> >> >> >> Yeah, not really. Not always the kernel is the right answer. Your/Intel
> >> >> >> >> >> >> plan to handle the offload by:
> >> >> >> >> >> >> 1) abuse devlink to flash p4 binary
> >> >> >> >> >> >> 2) parse the binary in kernel to match to the table ids of rules coming
> >> >> >> >> >> >>    from p4tc ndo_setup_tc
> >> >> >> >> >> >> 3) abuse devlink to flash p4 binary for tc-flower
> >> >> >> >> >> >> 4) parse the binary in kernel to match to the table ids of rules coming
> >> >> >> >> >> >>    from tc-flower ndo_setup_tc
> >> >> >> >> >> >> is really something that is making me a little bit nauseous.
> >> >> >> >> >> >>
> >> >> >> >> >> >> If you don't have a feasible plan to do the offload, p4tc does not make
> >> >> >> >> >> >> sense to me to be honest.
> >> >> >> >> >> >
> >> >> >> >> >> >You mean if there's no plan to match your (NVIDIA?)  point of view.
> >> >> >> >> >> >For #1 - how's this different from DDP? Wasnt that your suggestion to
> >> >> >> >> >>
> >> >> >> >> >> I doubt that. Any flashing-blob-parsing-in-kernel is something I'm
> >> >> >> >> >> opposed to from day 1.
> >> >> >> >> >>
> >> >> >> >> >>
> >> >> >> >> >
> >> >> >> >> >Oh well - it is in the kernel and it works fine tbh.
> >> >> >> >> >
> >> >> >> >> >> >begin with? For #2 Nobody is proposing to do anything of the sort. The
> >> >> >> >> >> >ndo is passed IDs for the objects and associated contents. For #3+#4
> >> >> >> >> >>
> >> >> >> >> >> During offload, you need to parse the blob in driver to be able to match
> >> >> >> >> >> the ids with blob entities. That was presented by you/Intel in the past
> >> >> >> >> >> IIRC.
> >> >> >> >> >>
> >> >> >> >> >
> >> >> >> >> >You are correct - in case of offload the netlink IDs will have to be
> >> >> >> >> >authenticated against what the hardware can accept, but the devlink
> >> >> >> >> >flash use i believe was from you as a compromise.
> >> >> >> >>
> >> >> >> >> Definitelly not. I'm against devlink abuse for this from day 1.
> >> >> >> >>
> >> >> >> >>
> >> >> >> >> >
> >> >> >> >> >>
> >> >> >> >> >> >tc flower thing has nothing to do with P4TC that was just some random
> >> >> >> >> >> >proposal someone made seeing if they could ride on top of P4TC.
> >> >> >> >> >>
> >> >> >> >> >> Yeah, it's not yet merged and already mentally used for abuse. I love
> >> >> >> >> >> that :)
> >> >> >> >> >>
> >> >> >> >> >> >
> >> >> >> >> >> >Besides this nobody really has to satisfy your point of view - like i
> >> >> >> >> >> >said earlier feel free to provide proprietary solutions. From a
> >> >> >> >> >> >consumer perspective  I would not want to deal with 4 different
> >> >> >> >> >> >vendors with 4 different proprietary approaches. The kernel is the
> >> >> >> >> >> >unifying part. You seemed happier with tc flower just not with the
> >> >> >> >> >>
> >> >> >> >> >> Yeah, that is my point, why the unifying part can't be a userspace
> >> >> >> >> >> daemon/library with multiple backends (p4tc, bpf, vendorX, vendorY, ..)?
> >> >> >> >> >>
> >> >> >> >> >> I just don't see the kernel as a good fit for abstraction here,
> >> >> >> >> >> given the fact that the vendor compilers does not run in kernel.
> >> >> >> >> >> That is breaking your model.
> >> >> >> >> >>
> >> >> >> >> >
> >> >> >> >> >Jiri - we want to support P4, first. Like you said the P4 pipeline,
> >> >> >> >> >once installed is static.
> >> >> >> >> >P4 doesnt allow dynamic update of the pipeline. For example, once you
> >> >> >> >> >say "here are my 14 tables and their associated actions and here's how
> >> >> >> >> >the pipeline main control (on how to iterate the tables etc) is going
> >> >> >> >> >to be" and after you instantiate/activate that pipeline, you dont go
> >> >> >> >> >back 5 minutes later and say "sorry, please introduce table 15, which
> >> >> >> >> >i want you to walk to after you visit table 3 if metadata foo is 5" or
> >> >> >> >> >"shoot, let's change that table 5 to be exact instead of LPM". It's
> >> >> >> >> >not anywhere in the spec.
> >> >> >> >> >That doesnt mean it is not useful thing to have - but it is an
> >> >> >> >> >invention that has _nothing to do with the P4 spec_; so saying a P4
> >> >> >> >> >implementation must support it is a bit out of scope and there are
> >> >> >> >> >vendors with hardware who support P4 today that dont need any of this.
> >> >> >> >>
> >> >> >> >> I'm not talking about the spec. I'm talking about the offload
> >> >> >> >> implemetation, the offload compiler the offload runtime manager. You
> >> >> >> >> don't have those in kernel. That is the issue. The runtime manager is
> >> >> >> >> the one to decide and reshuffle the hw internals. Again, this has
> >> >> >> >> nothing to do with p4 frontend. This is offload implementation.
> >> >> >> >>
> >> >> >> >> And that is why I believe your p4 kernel implementation is unoffloadable.
> >> >> >> >> And if it is unoffloadable, do we really need it? IDK.
> >> >> >> >>
> >> >> >> >
> >> >> >> >Say what?
> >> >> >> >It's not offloadable in your hardware, you mean? Because i have beside
> >> >> >> >me here an intel e2000 which offloads just fine (and the AMD folks
> >> >> >> >seem fine too).
> >> >> >>
> >> >> >> Will Intel and AMD have compiler in kernel, so no blob transfer and
> >> >> >> parsing it in kernel wound not be needed? No.
> >> >> >
> >> >> >By that definition anything that parses anything is a compiler.
> >> >> >
> >> >> >>
> >> >> >> >If your view is that all these runtime optimization surmount to a
> >> >> >> >compiler in the kernel/driver that is your, well, your view. In my
> >> >> >> >view (and others have said this to you already) the P4C compiler is
> >> >> >> >responsible for resource optimizations. The hardware supports P4, you
> >> >> >> >give it constraints and it knows what to do. At runtime, anything a
> >> >> >> >driver needs to do for resource optimization (resorting, reshuffling
> >> >> >> >etc), that is not a P4 problem - sorry if you have issues in your
> >> >> >> >architecture approach.
> >> >> >>
> >> >> >> Sure, it is the offload implementation problem. And for them, you need
> >> >> >> to use userspace components. And that is the problem. This discussion
> >> >> >> leads nowhere, I don't know how differently I should describe this.
> >> >> >
> >> >> >Jiri's - that's your view based on whatever design you have in your
> >> >> >mind. This has nothing to do with P4.
> >> >> >So let me repeat again:
> >> >> >1) A vendor's backend for P4 when it compiles ensures that resource
> >> >> >constraints are taken care of.
> >> >> >2) The same program can run in s/w.
> >> >> >3) It makes *ZERO* sense to mix vendor specific constraint
> >> >> >optimization(what you described as resorting, reshuffling etc) as part
> >> >> >of P4TC or P4. Absolutely nothing to do with either. Write a
> >> >>
> >> >> I never suggested for it to be part of P4tc of P4. I don't know why you
> >> >> think so.
> >> >
> >> >I guess because this discussion is about P4/P4TC? I may have misread
> >> >what you are saying then because I saw the  "P4TC must be in
> >> >userspace" mantra tied to this specific optimization requirement.
> >>
> >> Yeah, and again, my point is, this is unoffloadable.
> >
> >Here we go again with this weird claim. I guess we need to give an
> >award to the other vendors for doing the "impossible"?
>
> By having the compiler in kernel, that would be awesome. Clear offload
> from kernel to device.
>
> That's not the case. Trampolines, binary blobs parsing in kernel doing
> the match with tc structures in drivers, abuse of devlink flash,
> tc-flower offload using this facility. All this was already seriously
> discussed before p4tc is even merged. Great, love that.
>

I was hoping not to say anything but my fingers couldnt help themselves:
So "unoffloadable" means there is a binary blob and this doesnt work
per your design idea of how it should work?
Not that it cant be implemented (clearly it has been implemented), it
is just not how _you_ would implement it? All along I thought this was
an issue with your hardware.
I know that when someone says devlink your answer is N.O - but that is
a different topic.

cheers,
jamal

>
> >
> >>Do we still  need it in kernel?
> >
> >Didnt you just say it has nothing to do with P4TC?
> >
> >You "It cant be offloaded".
> >Me "it can be offloaded, other vendors are doing it and it has nothing
> >to do with P4 or P4TC and here's why..."
> >You " i didnt say it has anything to do with P4 or P4TC"
> >Me "ok i misunderstood i thought you said P4 cant be offloaded via
> >P4TC and has to be done in user space"
> >You "It cant be offloaded"
>
> Let me do my own misinterpretation please.
>
>
>
> >
> >Circular non-ending discussion.
> >
> >Then there's John
> >John "ebpf, ebpf, ebpf"
> >Me "we gave you ebpf"
> >John "but you are not using ebpf system call"
> >Me " but it doesnt make sense for the following reasons..."
> >John "but someone has already implemented ebpf.."
> >Me "yes, but here's how ..."
> >John "ebpf, ebpf, ebpf"
> >
> >Another circular non-ending discussion.
> >
> >Let's just end this electron-wasting lawyering discussion.
> >
> >cheers,
> >jamal
> >
> >
> >
> >
> >
> >
> >Bizare. Unoffloadable according to you.
> >
> >>
> >> >
> >> >>
> >> >> >background task, specific to you,  if you feel you need to move things
> >> >> >around at runtime.
> >> >>
> >> >> Yeah, that backgroud task is in userspace.
> >> >>
> >> >
> >> >I don't have a horse in this race.
> >> >
> >> >cheers,
> >> >jamal
> >> >
> >> >>
> >> >> >
> >> >> >We agree on one thing at least: This discussion is going nowhere.
> >> >>
> >> >> Correct.
> >> >>
> >> >> >
> >> >> >cheers,
> >> >> >jamal
> >> >> >
> >> >> >
> >> >> >
> >> >> >> >
> >> >> >> >> >In my opinion that is a feature that could be added later out of
> >> >> >> >> >necessity (there is some good niche value in being able to add some
> >> >> >> >> >"dynamicism" to any pipeline) and influence the P4 standards on why it
> >> >> >> >> >is needed.
> >> >> >> >> >It should be doable today in a brute force way (this is just one
> >> >> >> >> >suggestion that came to me when Rice University/Nvidia presented[1]);
> >> >> >> >> >i am sure there are other approaches and the idea is by no means
> >> >> >> >> >proven.
> >> >> >> >> >
> >> >> >> >> >1) User space Creates/compiles/Adds/activate your program that has 14
> >> >> >> >> >tables at tc prio X chain Y
> >> >> >> >> >2) a) 5 minutes later user space decides it wants to change and add
> >> >> >> >> >table 3 after table 15, visited when metadata foo=5
> >> >> >> >> >    b) your compiler in user space compiles a brand new program which
> >> >> >> >> >satisfies #2a (how this program was authored is out of scope of
> >> >> >> >> >discussion)
> >> >> >> >> >    c) user space adds the new program at tc prio X+1 chain Y or another chain Z
> >> >> >> >> >    d) user space delete tc prio X chain Y (and make sure your packets
> >> >> >> >> >entry point is whatever #c is)
> >> >> >> >>
> >> >> >> >> I never suggested anything like what you describe. I'm not sure why you
> >> >> >> >> think so.
> >> >> >> >
> >> >> >> >It's the same class of problems - the paper i pointed to (coauthored
> >> >> >> >by Matty and others) has runtime resource optimizations which are
> >> >> >> >tantamount to changing the nature of the pipeline. We may need to
> >> >> >> >profile in the kernel but all those optimizations can be derived in
> >> >> >> >user space using the approach I described.
> >> >> >> >
> >> >> >> >cheers,
> >> >> >> >jamal
> >> >> >> >
> >> >> >> >
> >> >> >> >> >[1] https://www.cs.rice.edu/~eugeneng/papers/SIGCOMM23-Pipeleon.pdf
> >> >> >> >> >
> >> >> >> >> >>
> >> >> >> >> >> >kernel process - which is ironically the same thing we are going
> >> >> >> >> >> >through here ;->
> >> >> >> >> >> >
> >> >> >> >> >> >cheers,
> >> >> >> >> >> >jamal
> >> >> >> >> >> >
> >> >> >> >> >> >>
> >> >> >> >> >> >> >
> >> >> >> >> >> >> >cheers,
> >> >> >> >> >> >> >jamal

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH net-next v8 00/15] Introducing P4TC
  2023-11-23 16:30                                                   ` Jamal Hadi Salim
@ 2023-11-23 17:53                                                     ` Edward Cree
  2023-11-23 18:09                                                       ` Jiri Pirko
  2023-11-23 18:53                                                       ` Jakub Kicinski
  2023-11-23 18:04                                                     ` Jiri Pirko
  1 sibling, 2 replies; 79+ messages in thread
From: Edward Cree @ 2023-11-23 17:53 UTC (permalink / raw)
  To: Jamal Hadi Salim, Jiri Pirko
  Cc: Daniel Borkmann, John Fastabend, netdev, deb.chatterjee,
	anjali.singhai, Vipin.Jain, namrata.limaye, tom, mleitner,
	Mahesh.Shirshyad, tomasz.osinski, xiyou.wangcong, davem, edumazet,
	kuba, pabeni, vladbu, horms, bpf, khalidm, toke, mattyk, dan.daly,
	chris.sommers, john.andy.fingerhut

On 23/11/2023 16:30, Jamal Hadi Salim wrote:
> I was hoping not to say anything but my fingers couldnt help themselves:
> So "unoffloadable" means there is a binary blob and this doesnt work
> per your design idea of how it should work?
> Not that it cant be implemented (clearly it has been implemented), it
> is just not how _you_ would implement it? All along I thought this was
> an issue with your hardware.

The kernel doesn't like to trust offload blobs from a userspace compiler,
 because it has no way to be sure that what comes out of the compiler
 matches the rules/tables/whatever it has in the SW datapath.
It's also a support nightmare because it's basically like each user
 compiling their own device firmware.  At least normally with device
 firmware the driver side is talking to something with narrow/fixed
 semantics and went through upstream review, even if the firmware side is
 still a black box.
Just to prove I'm not playing favourites: this is *also* a problem with
 eBPF offloads like Nanotubes, and I'm not convinced we have a viable
 solution yet.

The only way I can see to handle it is something analogous to proof-
 carrying code, where the kernel (driver, since the blob is likely to be
 wholly vendor-specific) can inspect the binary blob and verify somehow
 that (assuming the HW behaves according to its datasheet) it implements
 the same thing that exists in SW.
Or simplify the hardware design enough that the compiler can be small
 and tight enough to live in-kernel, but that's often impossible.

-ed

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH net-next v8 00/15] Introducing P4TC
  2023-11-23 16:30                                                   ` Jamal Hadi Salim
  2023-11-23 17:53                                                     ` Edward Cree
@ 2023-11-23 18:04                                                     ` Jiri Pirko
  1 sibling, 0 replies; 79+ messages in thread
From: Jiri Pirko @ 2023-11-23 18:04 UTC (permalink / raw)
  To: Jamal Hadi Salim
  Cc: Daniel Borkmann, John Fastabend, netdev, deb.chatterjee,
	anjali.singhai, Vipin.Jain, namrata.limaye, tom, mleitner,
	Mahesh.Shirshyad, tomasz.osinski, xiyou.wangcong, davem, edumazet,
	kuba, pabeni, vladbu, horms, bpf, khalidm, toke, mattyk, dan.daly,
	chris.sommers, john.andy.fingerhut

Thu, Nov 23, 2023 at 05:30:58PM CET, jhs@mojatatu.com wrote:
>On Thu, Nov 23, 2023 at 10:28 AM Jiri Pirko <jiri@resnulli.us> wrote:
>>
>> Thu, Nov 23, 2023 at 03:28:07PM CET, jhs@mojatatu.com wrote:
>> >On Thu, Nov 23, 2023 at 9:07 AM Jiri Pirko <jiri@resnulli.us> wrote:
>> >>
>> >> Thu, Nov 23, 2023 at 02:45:50PM CET, jhs@mojatatu.com wrote:
>> >> >On Thu, Nov 23, 2023 at 8:34 AM Jiri Pirko <jiri@resnulli.us> wrote:
>> >> >>
>> >> >> Thu, Nov 23, 2023 at 02:22:11PM CET, jhs@mojatatu.com wrote:
>> >> >> >On Thu, Nov 23, 2023 at 1:36 AM Jiri Pirko <jiri@resnulli.us> wrote:
>> >> >> >>
>> >> >> >> Wed, Nov 22, 2023 at 08:35:21PM CET, jhs@mojatatu.com wrote:
>> >> >> >> >On Wed, Nov 22, 2023 at 1:31 PM Jiri Pirko <jiri@resnulli.us> wrote:
>> >> >> >> >>
>> >> >> >> >> Wed, Nov 22, 2023 at 04:14:02PM CET, jhs@mojatatu.com wrote:
>> >> >> >> >> >On Wed, Nov 22, 2023 at 4:25 AM Jiri Pirko <jiri@resnulli.us> wrote:
>> >> >> >> >> >>
>> >> >> >> >> >> Tue, Nov 21, 2023 at 04:21:44PM CET, jhs@mojatatu.com wrote:
>> >> >> >> >> >> >On Tue, Nov 21, 2023 at 9:19 AM Jiri Pirko <jiri@resnulli.us> wrote:
>> >> >> >> >> >> >>
>> >> >> >> >> >> >> Tue, Nov 21, 2023 at 02:47:40PM CET, jhs@mojatatu.com wrote:
>> >> >> >> >> >> >> >On Tue, Nov 21, 2023 at 8:06 AM Jiri Pirko <jiri@resnulli.us> wrote:
>> >> >> >> >> >> >> >>
>> >> >> >> >> >> >> >> Mon, Nov 20, 2023 at 11:56:50PM CET, jhs@mojatatu.com wrote:
>> >> >> >> >> >> >> >> >On Mon, Nov 20, 2023 at 4:49 PM Daniel Borkmann <daniel@iogearbox.net> wrote:
>> >> >> >> >> >> >> >> >>
>> >> >> >> >> >> >> >> >> On 11/20/23 8:56 PM, Jamal Hadi Salim wrote:
>> >> >> >> >> >> >> >> >> > On Mon, Nov 20, 2023 at 1:10 PM Jiri Pirko <jiri@resnulli.us> wrote:
>> >> >> >> >> >> >> >> >> >> Mon, Nov 20, 2023 at 03:23:59PM CET, jhs@mojatatu.com wrote:
>> >> >> >> >> >> >> >>
>> >> >> >> >> >> >> >> [...]
>> >> >> >> >> >> >> >>
>> >> >> >> >> >> >> >> >
>> >> >> >> >> >> >> >> >> tc BPF and XDP already have widely used infrastructure and can be developed
>> >> >> >> >> >> >> >> >> against libbpf or other user space libraries for a user space control plane.
>> >> >> >> >> >> >> >> >> With 'control plane' you refer here to the tc / netlink shim you've built,
>> >> >> >> >> >> >> >> >> but looking at the tc command line examples, this doesn't really provide a
>> >> >> >> >> >> >> >> >> good user experience (you call it p4 but people load bpf obj files). If the
>> >> >> >> >> >> >> >> >> expectation is that an operator should run tc commands, then neither it's
>> >> >> >> >> >> >> >> >> a nice experience for p4 nor for BPF folks. From a BPF PoV, we moved over
>> >> >> >> >> >> >> >> >> to bpf_mprog and plan to also extend this for XDP to have a common look and
>> >> >> >> >> >> >> >> >> feel wrt networking for developers. Why can't this be reused?
>> >> >> >> >> >> >> >> >
>> >> >> >> >> >> >> >> >The filter loading which loads the program is considered pipeline
>> >> >> >> >> >> >> >> >instantiation - consider it as "provisioning" more than "control"
>> >> >> >> >> >> >> >> >which runs at runtime. "control" is purely netlink based. The iproute2
>> >> >> >> >> >> >> >> >code we use links libbpf for example for the filter. If we can achieve
>> >> >> >> >> >> >> >> >the same with bpf_mprog then sure - we just dont want to loose
>> >> >> >> >> >> >> >> >functionality though.  off top of my head, some sample space:
>> >> >> >> >> >> >> >> >- we could have multiple pipelines with different priorities (which tc
>> >> >> >> >> >> >> >> >provides to us) - and each pipeline may have its own logic with many
>> >> >> >> >> >> >> >> >tables etc (and the choice to iterate the next one is essentially
>> >> >> >> >> >> >> >> >encoded in the tc action codes)
>> >> >> >> >> >> >> >> >- we use tc block to map groups of ports (which i dont think bpf has
>> >> >> >> >> >> >> >> >internal access of)
>> >> >> >> >> >> >> >> >
>> >> >> >> >> >> >> >> >In regards to usability: no i dont expect someone doing things at
>> >> >> >> >> >> >> >> >scale to use command line tc. The APIs are via netlink. But the tc cli
>> >> >> >> >> >> >> >> >is must for the rest of the masses per our traditions. Also i really
>> >> >> >> >> >> >> >>
>> >> >> >> >> >> >> >> I don't follow. You repeatedly mention "the must of the traditional tc
>> >> >> >> >> >> >> >> cli", but what of the existing traditional cli you use for p4tc?
>> >> >> >> >> >> >> >> If I look at the examples, pretty much everything looks new to me.
>> >> >> >> >> >> >> >> Example:
>> >> >> >> >> >> >> >>
>> >> >> >> >> >> >> >>   tc p4ctrl create myprog/table/mytable dstAddr 10.0.1.2/32 \
>> >> >> >> >> >> >> >>     action send_to_port param port eno1
>> >> >> >> >> >> >> >>
>> >> >> >> >> >> >> >> This is just TC/RTnetlink used as a channel to pass new things over. If
>> >> >> >> >> >> >> >> that is the case, what's traditional here?
>> >> >> >> >> >> >> >>
>> >> >> >> >> >> >> >
>> >> >> >> >> >> >> >
>> >> >> >> >> >> >> >What is not traditional about it?
>> >> >> >> >> >> >>
>> >> >> >> >> >> >> Okay, so in that case, the following example communitating with
>> >> >> >> >> >> >> userspace deamon using imaginary "p4ctrl" app is equally traditional:
>> >> >> >> >> >> >>   $ p4ctrl create myprog/table/mytable dstAddr 10.0.1.2/32 \
>> >> >> >> >> >> >>      action send_to_port param port eno1
>> >> >> >> >> >> >
>> >> >> >> >> >> >Huh? Thats just an application - classical tc which part of iproute2
>> >> >> >> >> >> >that is sending to the kernel, no different than "tc flower.."
>> >> >> >> >> >> >Where do you get the "userspace" daemon part? Yes, you can write a
>> >> >> >> >> >> >daemon but it will use the same APIs as tc.
>> >> >> >> >> >>
>> >> >> >> >> >> Okay, so which part is the "tradition"?
>> >> >> >> >> >>
>> >> >> >> >> >
>> >> >> >> >> >Provides tooling via tc cli that _everyone_ in the tc world is
>> >> >> >> >> >familiar with - which uses the same syntax as other tc extensions do,
>> >> >> >> >> >same expectations (eg events, request responses, familiar commands for
>> >> >> >> >> >dumping, flushing etc). Basically someone familiar with tc will pick
>> >> >> >> >> >this up and operate it very quickly and would have an easier time
>> >> >> >> >> >debugging it.
>> >> >> >> >> >There are caveats - as will be with all new classifiers - but those
>> >> >> >> >> >are within reason.
>> >> >> >> >>
>> >> >> >> >> Okay, so syntax familiarity wise, what's the difference between
>> >> >> >> >> following 2 approaches:
>> >> >> >> >> $ tc p4ctrl create myprog/table/mytable dstAddr 10.0.1.2/32 \
>> >> >> >> >>       action send_to_port param port eno1
>> >> >> >> >> $ p4ctrl create myprog/table/mytable dstAddr 10.0.1.2/32 \
>> >> >> >> >>       action send_to_port param port eno1
>> >> >> >> >> ?
>> >> >> >> >>
>> >> >> >> >>
>> >> >> >> >> >
>> >> >> >> >> >> >>
>> >> >> >> >> >> >> >
>> >> >> >> >> >> >> >>
>> >> >> >> >> >> >> >> >didnt even want to use ebpf at all for operator experience reasons -
>> >> >> >> >> >> >> >> >it requires a compilation of the code and an extra loading compared to
>> >> >> >> >> >> >> >> >what our original u32/pedit code offered.
>> >> >> >> >> >> >> >> >
>> >> >> >> >> >> >> >> >> I don't quite follow why not most of this could be implemented entirely in
>> >> >> >> >> >> >> >> >> user space without the detour of this and you would provide a developer
>> >> >> >> >> >> >> >> >> library which could then be integrated into a p4 runtime/frontend? This
>> >> >> >> >> >> >> >> >> way users never interface with ebpf parts nor tc given they also shouldn't
>> >> >> >> >> >> >> >> >> have to - it's an implementation detail. This is what John was also pointing
>> >> >> >> >> >> >> >> >> out earlier.
>> >> >> >> >> >> >> >> >>
>> >> >> >> >> >> >> >> >
>> >> >> >> >> >> >> >> >Netlink is the API. We will provide a library for object manipulation
>> >> >> >> >> >> >> >> >which abstracts away the need to know netlink. Someone who for their
>> >> >> >> >> >> >> >> >own reasons wants to use p4runtime or TDI could write on top of this.
>> >> >> >> >> >> >> >> >I would not design a kernel interface to just meet p4runtime (we
>> >> >> >> >> >> >> >> >already have TDI which came later which does things differently). So i
>> >> >> >> >> >> >> >> >expect us to support both those two. And if i was to do something on
>> >> >> >> >> >> >> >> >SDN that was more robust i would write my own that still uses these
>> >> >> >> >> >> >> >> >netlink interfaces.
>> >> >> >> >> >> >> >>
>> >> >> >> >> >> >> >> Actually, what Daniel says about the p4 library used as a backend to p4
>> >> >> >> >> >> >> >> frontend is pretty much aligned what I claimed on the p4 calls couple of
>> >> >> >> >> >> >> >> times. If you have this p4 userspace tooling, it is easy for offloads to
>> >> >> >> >> >> >> >> replace the backed by vendor-specific library which allows p4 offload
>> >> >> >> >> >> >> >> suitable for all vendors (your plan of p4tc offload does not work well
>> >> >> >> >> >> >> >> for our hw, as we repeatedly claimed).
>> >> >> >> >> >> >> >>
>> >> >> >> >> >> >> >
>> >> >> >> >> >> >> >That's you - NVIDIA. You have chosen a path away from the kernel
>> >> >> >> >> >> >> >towards DOCA. I understand NVIDIA's frustration with dealing with
>> >> >> >> >> >> >> >upstream process (which has been cited to me as a good reason for
>> >> >> >> >> >> >> >DOCA) but please dont impose these values and your politics on other
>> >> >> >> >> >> >> >vendors(Intel, AMD for example) who are more than willing to invest
>> >> >> >> >> >> >> >into making the kernel interfaces the path forward. Your choice.
>> >> >> >> >> >> >>
>> >> >> >> >> >> >> No, you are missing the point. This has nothing to do with DOCA.
>> >> >> >> >> >> >
>> >> >> >> >> >> >Right Jiri ;->
>> >> >> >> >> >> >
>> >> >> >> >> >> >> This
>> >> >> >> >> >> >> has to do with the simple limitation of your offload assuming there are
>> >> >> >> >> >> >> no runtime changes in the compiled pipeline. For Intel, maybe they
>> >> >> >> >> >> >> aren't, and it's a good fit for them. All I say is, that it is not the
>> >> >> >> >> >> >> good fit for everyone.
>> >> >> >> >> >> >
>> >> >> >> >> >> > a) it is not part of the P4 spec to dynamically make changes to the
>> >> >> >> >> >> >datapath pipeline after it is create and we are discussing a P4
>> >> >> >> >> >>
>> >> >> >> >> >> Isn't this up to the implementation? I mean from the p4 perspective,
>> >> >> >> >> >> everything is static. Hw might need to reshuffle the pipeline internally
>> >> >> >> >> >> during rule insertion/remove in order to optimize the layout.
>> >> >> >> >> >>
>> >> >> >> >> >
>> >> >> >> >> >But do note: the focus here is on P4 (hence the name P4TC).
>> >> >> >> >> >
>> >> >> >> >> >> >implementation not an extension that would add more value b) We are
>> >> >> >> >> >> >more than happy to add extensions in the future to accomodate for
>> >> >> >> >> >> >features but first _P4 spec_ must be met c) we had longer discussions
>> >> >> >> >> >> >with Matty, Khalid and the Rice folks who wrote a paper on that topic
>> >> >> >> >> >> >which you probably didnt attend and everything that needs to be done
>> >> >> >> >> >> >can be from user space today for all those optimizations.
>> >> >> >> >> >> >
>> >> >> >> >> >> >Conclusion is: For what you need to do (which i dont believe is a
>> >> >> >> >> >> >limitation in your hardware rather a design decision on your part) run
>> >> >> >> >> >> >your user space daemon, do optimizations and update the datapath.
>> >> >> >> >> >> >Everybody is happy.
>> >> >> >> >> >>
>> >> >> >> >> >> Should the userspace daemon listen on inserted rules to be offloade
>> >> >> >> >> >> over netlink?
>> >> >> >> >> >>
>> >> >> >> >> >
>> >> >> >> >> >I mean you could if you wanted to given this is just traditional
>> >> >> >> >> >netlink which emits events (with some filtering when we integrate the
>> >> >> >> >> >filter approach). But why?
>> >> >> >> >>
>> >> >> >> >> Nevermind.
>> >> >> >> >>
>> >> >> >> >>
>> >> >> >> >> >
>> >> >> >> >> >> >
>> >> >> >> >> >> >>
>> >> >> >> >> >> >> >Nobody is stopping you from offering your customers proprietary
>> >> >> >> >> >> >> >solutions which include a specific ebpf approach alongside DOCA. We
>> >> >> >> >> >> >> >believe that a singular interface regardless of the vendor is the
>> >> >> >> >> >> >> >right way forward. IMHO, this siloing that unfortunately is also added
>> >> >> >> >> >> >> >by eBPF being a double edged sword is not good for the community.
>> >> >> >> >> >> >> >
>> >> >> >> >> >> >> >> As I also said on the p4 call couple of times, I don't see the kernel
>> >> >> >> >> >> >> >> as the correct place to do the p4 abstractions. Why don't you do it in
>> >> >> >> >> >> >> >> userspace and give vendors possiblity to have p4 backends with compilers,
>> >> >> >> >> >> >> >> runtime optimizations etc in userspace, talking to the HW in the
>> >> >> >> >> >> >> >> vendor-suitable way too. Then the SW implementation could be easily eBPF
>> >> >> >> >> >> >> >> and the main reason (I believe) why you need to have this is TC
>> >> >> >> >> >> >> >> (offload) is then void.
>> >> >> >> >> >> >> >>
>> >> >> >> >> >> >> >> The "everyone wants to use TC/netlink" claim does not seem correct
>> >> >> >> >> >> >> >> to me. Why not to have one Linux p4 solution that fits everyones needs?
>> >> >> >> >> >> >> >
>> >> >> >> >> >> >> >You mean more fitting to the DOCA world? no, because iam a kernel
>> >> >> >> >> >> >>
>> >> >> >> >> >> >> Again, this has 0 relation to DOCA.
>> >> >> >> >> >> >>
>> >> >> >> >> >> >>
>> >> >> >> >> >> >> >first person and kernel interfaces are good for everyone.
>> >> >> >> >> >> >>
>> >> >> >> >> >> >> Yeah, not really. Not always the kernel is the right answer. Your/Intel
>> >> >> >> >> >> >> plan to handle the offload by:
>> >> >> >> >> >> >> 1) abuse devlink to flash p4 binary
>> >> >> >> >> >> >> 2) parse the binary in kernel to match to the table ids of rules coming
>> >> >> >> >> >> >>    from p4tc ndo_setup_tc
>> >> >> >> >> >> >> 3) abuse devlink to flash p4 binary for tc-flower
>> >> >> >> >> >> >> 4) parse the binary in kernel to match to the table ids of rules coming
>> >> >> >> >> >> >>    from tc-flower ndo_setup_tc
>> >> >> >> >> >> >> is really something that is making me a little bit nauseous.
>> >> >> >> >> >> >>
>> >> >> >> >> >> >> If you don't have a feasible plan to do the offload, p4tc does not make
>> >> >> >> >> >> >> sense to me to be honest.
>> >> >> >> >> >> >
>> >> >> >> >> >> >You mean if there's no plan to match your (NVIDIA?)  point of view.
>> >> >> >> >> >> >For #1 - how's this different from DDP? Wasnt that your suggestion to
>> >> >> >> >> >>
>> >> >> >> >> >> I doubt that. Any flashing-blob-parsing-in-kernel is something I'm
>> >> >> >> >> >> opposed to from day 1.
>> >> >> >> >> >>
>> >> >> >> >> >>
>> >> >> >> >> >
>> >> >> >> >> >Oh well - it is in the kernel and it works fine tbh.
>> >> >> >> >> >
>> >> >> >> >> >> >begin with? For #2 Nobody is proposing to do anything of the sort. The
>> >> >> >> >> >> >ndo is passed IDs for the objects and associated contents. For #3+#4
>> >> >> >> >> >>
>> >> >> >> >> >> During offload, you need to parse the blob in driver to be able to match
>> >> >> >> >> >> the ids with blob entities. That was presented by you/Intel in the past
>> >> >> >> >> >> IIRC.
>> >> >> >> >> >>
>> >> >> >> >> >
>> >> >> >> >> >You are correct - in case of offload the netlink IDs will have to be
>> >> >> >> >> >authenticated against what the hardware can accept, but the devlink
>> >> >> >> >> >flash use i believe was from you as a compromise.
>> >> >> >> >>
>> >> >> >> >> Definitelly not. I'm against devlink abuse for this from day 1.
>> >> >> >> >>
>> >> >> >> >>
>> >> >> >> >> >
>> >> >> >> >> >>
>> >> >> >> >> >> >tc flower thing has nothing to do with P4TC that was just some random
>> >> >> >> >> >> >proposal someone made seeing if they could ride on top of P4TC.
>> >> >> >> >> >>
>> >> >> >> >> >> Yeah, it's not yet merged and already mentally used for abuse. I love
>> >> >> >> >> >> that :)
>> >> >> >> >> >>
>> >> >> >> >> >> >
>> >> >> >> >> >> >Besides this nobody really has to satisfy your point of view - like i
>> >> >> >> >> >> >said earlier feel free to provide proprietary solutions. From a
>> >> >> >> >> >> >consumer perspective  I would not want to deal with 4 different
>> >> >> >> >> >> >vendors with 4 different proprietary approaches. The kernel is the
>> >> >> >> >> >> >unifying part. You seemed happier with tc flower just not with the
>> >> >> >> >> >>
>> >> >> >> >> >> Yeah, that is my point, why the unifying part can't be a userspace
>> >> >> >> >> >> daemon/library with multiple backends (p4tc, bpf, vendorX, vendorY, ..)?
>> >> >> >> >> >>
>> >> >> >> >> >> I just don't see the kernel as a good fit for abstraction here,
>> >> >> >> >> >> given the fact that the vendor compilers does not run in kernel.
>> >> >> >> >> >> That is breaking your model.
>> >> >> >> >> >>
>> >> >> >> >> >
>> >> >> >> >> >Jiri - we want to support P4, first. Like you said the P4 pipeline,
>> >> >> >> >> >once installed is static.
>> >> >> >> >> >P4 doesnt allow dynamic update of the pipeline. For example, once you
>> >> >> >> >> >say "here are my 14 tables and their associated actions and here's how
>> >> >> >> >> >the pipeline main control (on how to iterate the tables etc) is going
>> >> >> >> >> >to be" and after you instantiate/activate that pipeline, you dont go
>> >> >> >> >> >back 5 minutes later and say "sorry, please introduce table 15, which
>> >> >> >> >> >i want you to walk to after you visit table 3 if metadata foo is 5" or
>> >> >> >> >> >"shoot, let's change that table 5 to be exact instead of LPM". It's
>> >> >> >> >> >not anywhere in the spec.
>> >> >> >> >> >That doesnt mean it is not useful thing to have - but it is an
>> >> >> >> >> >invention that has _nothing to do with the P4 spec_; so saying a P4
>> >> >> >> >> >implementation must support it is a bit out of scope and there are
>> >> >> >> >> >vendors with hardware who support P4 today that dont need any of this.
>> >> >> >> >>
>> >> >> >> >> I'm not talking about the spec. I'm talking about the offload
>> >> >> >> >> implemetation, the offload compiler the offload runtime manager. You
>> >> >> >> >> don't have those in kernel. That is the issue. The runtime manager is
>> >> >> >> >> the one to decide and reshuffle the hw internals. Again, this has
>> >> >> >> >> nothing to do with p4 frontend. This is offload implementation.
>> >> >> >> >>
>> >> >> >> >> And that is why I believe your p4 kernel implementation is unoffloadable.
>> >> >> >> >> And if it is unoffloadable, do we really need it? IDK.
>> >> >> >> >>
>> >> >> >> >
>> >> >> >> >Say what?
>> >> >> >> >It's not offloadable in your hardware, you mean? Because i have beside
>> >> >> >> >me here an intel e2000 which offloads just fine (and the AMD folks
>> >> >> >> >seem fine too).
>> >> >> >>
>> >> >> >> Will Intel and AMD have compiler in kernel, so no blob transfer and
>> >> >> >> parsing it in kernel wound not be needed? No.
>> >> >> >
>> >> >> >By that definition anything that parses anything is a compiler.
>> >> >> >
>> >> >> >>
>> >> >> >> >If your view is that all these runtime optimization surmount to a
>> >> >> >> >compiler in the kernel/driver that is your, well, your view. In my
>> >> >> >> >view (and others have said this to you already) the P4C compiler is
>> >> >> >> >responsible for resource optimizations. The hardware supports P4, you
>> >> >> >> >give it constraints and it knows what to do. At runtime, anything a
>> >> >> >> >driver needs to do for resource optimization (resorting, reshuffling
>> >> >> >> >etc), that is not a P4 problem - sorry if you have issues in your
>> >> >> >> >architecture approach.
>> >> >> >>
>> >> >> >> Sure, it is the offload implementation problem. And for them, you need
>> >> >> >> to use userspace components. And that is the problem. This discussion
>> >> >> >> leads nowhere, I don't know how differently I should describe this.
>> >> >> >
>> >> >> >Jiri's - that's your view based on whatever design you have in your
>> >> >> >mind. This has nothing to do with P4.
>> >> >> >So let me repeat again:
>> >> >> >1) A vendor's backend for P4 when it compiles ensures that resource
>> >> >> >constraints are taken care of.
>> >> >> >2) The same program can run in s/w.
>> >> >> >3) It makes *ZERO* sense to mix vendor specific constraint
>> >> >> >optimization(what you described as resorting, reshuffling etc) as part
>> >> >> >of P4TC or P4. Absolutely nothing to do with either. Write a
>> >> >>
>> >> >> I never suggested for it to be part of P4tc of P4. I don't know why you
>> >> >> think so.
>> >> >
>> >> >I guess because this discussion is about P4/P4TC? I may have misread
>> >> >what you are saying then because I saw the  "P4TC must be in
>> >> >userspace" mantra tied to this specific optimization requirement.
>> >>
>> >> Yeah, and again, my point is, this is unoffloadable.
>> >
>> >Here we go again with this weird claim. I guess we need to give an
>> >award to the other vendors for doing the "impossible"?
>>
>> By having the compiler in kernel, that would be awesome. Clear offload
>> from kernel to device.
>>
>> That's not the case. Trampolines, binary blobs parsing in kernel doing
>> the match with tc structures in drivers, abuse of devlink flash,
>> tc-flower offload using this facility. All this was already seriously
>> discussed before p4tc is even merged. Great, love that.
>>
>
>I was hoping not to say anything but my fingers couldnt help themselves:
>So "unoffloadable" means there is a binary blob and this doesnt work
>per your design idea of how it should work?

Not going to repeat myself.


>Not that it cant be implemented (clearly it has been implemented), it
>is just not how _you_ would implement it? All along I thought this was
>an issue with your hardware.

The subset of issues is present even with Intel approach. I thought that
is clear, it's apparently not. I'm done.


>I know that when someone says devlink your answer is N.O - but that is
>a different topic.

That's probably because lot of times people tend to abuse it.


>
>cheers,
>jamal
>
>>
>> >
>> >>Do we still  need it in kernel?
>> >
>> >Didnt you just say it has nothing to do with P4TC?
>> >
>> >You "It cant be offloaded".
>> >Me "it can be offloaded, other vendors are doing it and it has nothing
>> >to do with P4 or P4TC and here's why..."
>> >You " i didnt say it has anything to do with P4 or P4TC"
>> >Me "ok i misunderstood i thought you said P4 cant be offloaded via
>> >P4TC and has to be done in user space"
>> >You "It cant be offloaded"
>>
>> Let me do my own misinterpretation please.
>>
>>
>>
>> >
>> >Circular non-ending discussion.
>> >
>> >Then there's John
>> >John "ebpf, ebpf, ebpf"
>> >Me "we gave you ebpf"
>> >John "but you are not using ebpf system call"
>> >Me " but it doesnt make sense for the following reasons..."
>> >John "but someone has already implemented ebpf.."
>> >Me "yes, but here's how ..."
>> >John "ebpf, ebpf, ebpf"
>> >
>> >Another circular non-ending discussion.
>> >
>> >Let's just end this electron-wasting lawyering discussion.
>> >
>> >cheers,
>> >jamal
>> >
>> >
>> >
>> >
>> >
>> >
>> >Bizare. Unoffloadable according to you.
>> >
>> >>
>> >> >
>> >> >>
>> >> >> >background task, specific to you,  if you feel you need to move things
>> >> >> >around at runtime.
>> >> >>
>> >> >> Yeah, that backgroud task is in userspace.
>> >> >>
>> >> >
>> >> >I don't have a horse in this race.
>> >> >
>> >> >cheers,
>> >> >jamal
>> >> >
>> >> >>
>> >> >> >
>> >> >> >We agree on one thing at least: This discussion is going nowhere.
>> >> >>
>> >> >> Correct.
>> >> >>
>> >> >> >
>> >> >> >cheers,
>> >> >> >jamal
>> >> >> >
>> >> >> >
>> >> >> >
>> >> >> >> >
>> >> >> >> >> >In my opinion that is a feature that could be added later out of
>> >> >> >> >> >necessity (there is some good niche value in being able to add some
>> >> >> >> >> >"dynamicism" to any pipeline) and influence the P4 standards on why it
>> >> >> >> >> >is needed.
>> >> >> >> >> >It should be doable today in a brute force way (this is just one
>> >> >> >> >> >suggestion that came to me when Rice University/Nvidia presented[1]);
>> >> >> >> >> >i am sure there are other approaches and the idea is by no means
>> >> >> >> >> >proven.
>> >> >> >> >> >
>> >> >> >> >> >1) User space Creates/compiles/Adds/activate your program that has 14
>> >> >> >> >> >tables at tc prio X chain Y
>> >> >> >> >> >2) a) 5 minutes later user space decides it wants to change and add
>> >> >> >> >> >table 3 after table 15, visited when metadata foo=5
>> >> >> >> >> >    b) your compiler in user space compiles a brand new program which
>> >> >> >> >> >satisfies #2a (how this program was authored is out of scope of
>> >> >> >> >> >discussion)
>> >> >> >> >> >    c) user space adds the new program at tc prio X+1 chain Y or another chain Z
>> >> >> >> >> >    d) user space delete tc prio X chain Y (and make sure your packets
>> >> >> >> >> >entry point is whatever #c is)
>> >> >> >> >>
>> >> >> >> >> I never suggested anything like what you describe. I'm not sure why you
>> >> >> >> >> think so.
>> >> >> >> >
>> >> >> >> >It's the same class of problems - the paper i pointed to (coauthored
>> >> >> >> >by Matty and others) has runtime resource optimizations which are
>> >> >> >> >tantamount to changing the nature of the pipeline. We may need to
>> >> >> >> >profile in the kernel but all those optimizations can be derived in
>> >> >> >> >user space using the approach I described.
>> >> >> >> >
>> >> >> >> >cheers,
>> >> >> >> >jamal
>> >> >> >> >
>> >> >> >> >
>> >> >> >> >> >[1] https://www.cs.rice.edu/~eugeneng/papers/SIGCOMM23-Pipeleon.pdf
>> >> >> >> >> >
>> >> >> >> >> >>
>> >> >> >> >> >> >kernel process - which is ironically the same thing we are going
>> >> >> >> >> >> >through here ;->
>> >> >> >> >> >> >
>> >> >> >> >> >> >cheers,
>> >> >> >> >> >> >jamal
>> >> >> >> >> >> >
>> >> >> >> >> >> >>
>> >> >> >> >> >> >> >
>> >> >> >> >> >> >> >cheers,
>> >> >> >> >> >> >> >jamal

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH net-next v8 00/15] Introducing P4TC
  2023-11-23 17:53                                                     ` Edward Cree
@ 2023-11-23 18:09                                                       ` Jiri Pirko
  2023-11-23 18:58                                                         ` Jamal Hadi Salim
  2023-11-23 18:53                                                       ` Jakub Kicinski
  1 sibling, 1 reply; 79+ messages in thread
From: Jiri Pirko @ 2023-11-23 18:09 UTC (permalink / raw)
  To: Edward Cree
  Cc: Jamal Hadi Salim, Daniel Borkmann, John Fastabend, netdev,
	deb.chatterjee, anjali.singhai, Vipin.Jain, namrata.limaye, tom,
	mleitner, Mahesh.Shirshyad, tomasz.osinski, xiyou.wangcong, davem,
	edumazet, kuba, pabeni, vladbu, horms, bpf, khalidm, toke, mattyk,
	dan.daly, chris.sommers, john.andy.fingerhut

Thu, Nov 23, 2023 at 06:53:42PM CET, ecree.xilinx@gmail.com wrote:
>On 23/11/2023 16:30, Jamal Hadi Salim wrote:
>> I was hoping not to say anything but my fingers couldnt help themselves:
>> So "unoffloadable" means there is a binary blob and this doesnt work
>> per your design idea of how it should work?
>> Not that it cant be implemented (clearly it has been implemented), it
>> is just not how _you_ would implement it? All along I thought this was
>> an issue with your hardware.
>
>The kernel doesn't like to trust offload blobs from a userspace compiler,
> because it has no way to be sure that what comes out of the compiler
> matches the rules/tables/whatever it has in the SW datapath.
>It's also a support nightmare because it's basically like each user
> compiling their own device firmware.  At least normally with device
> firmware the driver side is talking to something with narrow/fixed
> semantics and went through upstream review, even if the firmware side is
> still a black box.
>Just to prove I'm not playing favourites: this is *also* a problem with
> eBPF offloads like Nanotubes, and I'm not convinced we have a viable
> solution yet.

Just for the record, I'm not aware of anyone suggesting p4 eBPF offload
in this thread.


>
>The only way I can see to handle it is something analogous to proof-
> carrying code, where the kernel (driver, since the blob is likely to be
> wholly vendor-specific) can inspect the binary blob and verify somehow
> that (assuming the HW behaves according to its datasheet) it implements
> the same thing that exists in SW.
>Or simplify the hardware design enough that the compiler can be small
> and tight enough to live in-kernel, but that's often impossible.

Yeah, that would solve the offloading problem. From what I'm hearing
from multiple sides, not going to happen.

>
>-ed

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH net-next v8 00/15] Introducing P4TC
  2023-11-23 17:53                                                     ` Edward Cree
  2023-11-23 18:09                                                       ` Jiri Pirko
@ 2023-11-23 18:53                                                       ` Jakub Kicinski
  2023-11-23 19:42                                                         ` Tom Herbert
  2023-11-24 10:39                                                         ` Jiri Pirko
  1 sibling, 2 replies; 79+ messages in thread
From: Jakub Kicinski @ 2023-11-23 18:53 UTC (permalink / raw)
  To: Edward Cree
  Cc: Jamal Hadi Salim, Jiri Pirko, Daniel Borkmann, John Fastabend,
	netdev, deb.chatterjee, anjali.singhai, Vipin.Jain,
	namrata.limaye, tom, mleitner, Mahesh.Shirshyad, tomasz.osinski,
	xiyou.wangcong, davem, edumazet, pabeni, vladbu, horms, bpf,
	khalidm, toke, mattyk, dan.daly, chris.sommers,
	john.andy.fingerhut

On Thu, 23 Nov 2023 17:53:42 +0000 Edward Cree wrote:
> The kernel doesn't like to trust offload blobs from a userspace compiler,
>  because it has no way to be sure that what comes out of the compiler
>  matches the rules/tables/whatever it has in the SW datapath.
> It's also a support nightmare because it's basically like each user
>  compiling their own device firmware.  

Practically speaking every high speed NIC runs a huge binary blob of FW.
First, let's acknowledge that as reality.

Second, there is no equivalent for arbitrary packet parsing in the
kernel proper. Offload means take something form the host and put it
on the device. If there's nothing in the kernel, we can't consider
the new functionality an offload.

I understand that "we offload SW functionality" is our general policy,
but we should remember why this policy is in place, and not
automatically jump to the conclusion.

>  At least normally with device firmware the driver side is talking to
>  something with narrow/fixed semantics and went through upstream
>  review, even if the firmware side is still a black box.

We should be buildings things which are useful and open (as in
extensible by people "from the street"). With that in mind, to me,
a more practical approach would be to try to figure out a common
and rigid FW interface for expressing the parsing graph.

But that's an interface going from the binary blob to the kernel.

> Just to prove I'm not playing favourites: this is *also* a problem with
>  eBPF offloads like Nanotubes, and I'm not convinced we have a viable
>  solution yet.

BPF offloads are actual offloads. Config/state is in the kernel,
you need to pop it out to user space, then prove that it's what
user intended.

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH net-next v8 00/15] Introducing P4TC
  2023-11-23 18:09                                                       ` Jiri Pirko
@ 2023-11-23 18:58                                                         ` Jamal Hadi Salim
  0 siblings, 0 replies; 79+ messages in thread
From: Jamal Hadi Salim @ 2023-11-23 18:58 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: Edward Cree, Daniel Borkmann, John Fastabend, netdev,
	deb.chatterjee, anjali.singhai, Vipin.Jain, namrata.limaye, tom,
	mleitner, Mahesh.Shirshyad, tomasz.osinski, xiyou.wangcong, davem,
	edumazet, kuba, pabeni, vladbu, horms, bpf, khalidm, toke, mattyk,
	dan.daly, chris.sommers, john.andy.fingerhut

On Thu, Nov 23, 2023 at 1:09 PM Jiri Pirko <jiri@resnulli.us> wrote:
>
> Thu, Nov 23, 2023 at 06:53:42PM CET, ecree.xilinx@gmail.com wrote:
> >On 23/11/2023 16:30, Jamal Hadi Salim wrote:
> >> I was hoping not to say anything but my fingers couldnt help themselves:
> >> So "unoffloadable" means there is a binary blob and this doesnt work
> >> per your design idea of how it should work?
> >> Not that it cant be implemented (clearly it has been implemented), it
> >> is just not how _you_ would implement it? All along I thought this was
> >> an issue with your hardware.
> >
> >The kernel doesn't like to trust offload blobs from a userspace compiler,
> > because it has no way to be sure that what comes out of the compiler
> > matches the rules/tables/whatever it has in the SW datapath.
> >It's also a support nightmare because it's basically like each user
> > compiling their own device firmware.  At least normally with device
> > firmware the driver side is talking to something with narrow/fixed
> > semantics and went through upstream review, even if the firmware side is
> > still a black box.
> >Just to prove I'm not playing favourites: this is *also* a problem with
> > eBPF offloads like Nanotubes, and I'm not convinced we have a viable
> > solution yet.
>
> Just for the record, I'm not aware of anyone suggesting p4 eBPF offload
> in this thread.
>
>
> >
> >The only way I can see to handle it is something analogous to proof-
> > carrying code, where the kernel (driver, since the blob is likely to be
> > wholly vendor-specific) can inspect the binary blob and verify somehow
> > that (assuming the HW behaves according to its datasheet) it implements
> > the same thing that exists in SW.
> >Or simplify the hardware design enough that the compiler can be small
> > and tight enough to live in-kernel, but that's often impossible.
>
> Yeah, that would solve the offloading problem. From what I'm hearing
> from multiple sides, not going to happen.

This is a topic that has been discussed many times. The idea Tom is
describing has been the basis of the discussion i.e some form of
signature that is tied to the binary as well as the s/w side of things
when you do offload. I am not an attestation expert - but isnt that
sufficient?

cheers,
jamal

> >
> >-ed

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH net-next v8 00/15] Introducing P4TC
  2023-11-23 18:53                                                       ` Jakub Kicinski
@ 2023-11-23 19:42                                                         ` Tom Herbert
  2023-11-24 10:39                                                         ` Jiri Pirko
  1 sibling, 0 replies; 79+ messages in thread
From: Tom Herbert @ 2023-11-23 19:42 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Edward Cree, Jamal Hadi Salim, Jiri Pirko, Daniel Borkmann,
	John Fastabend, netdev, deb.chatterjee, anjali.singhai,
	Vipin.Jain, namrata.limaye, mleitner, Mahesh.Shirshyad,
	tomasz.osinski, xiyou.wangcong, davem, edumazet, pabeni, vladbu,
	horms, bpf, khalidm, toke, mattyk, dan.daly, chris.sommers,
	john.andy.fingerhut

On Thu, Nov 23, 2023 at 10:53 AM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Thu, 23 Nov 2023 17:53:42 +0000 Edward Cree wrote:
> > The kernel doesn't like to trust offload blobs from a userspace compiler,
> >  because it has no way to be sure that what comes out of the compiler
> >  matches the rules/tables/whatever it has in the SW datapath.
> > It's also a support nightmare because it's basically like each user
> >  compiling their own device firmware.
>

Hi Jakub,

> Practically speaking every high speed NIC runs a huge binary blob of FW.
> First, let's acknowledge that as reality.
>
Yes. But we're also seeing a trend for programmable NICs. It's an
interesting question as to how the kernel can leverage that
programmability for the benefit of the user.

> Second, there is no equivalent for arbitrary packet parsing in the
> kernel proper. Offload means take something form the host and put it
> on the device. If there's nothing in the kernel, we can't consider
> the new functionality an offload.

That's completely true, however I believe that eBPF has expanded our
definition of "what's in the kernel". For instance, we can do
arbitrary parsing in an XDP/eBPF program (in fact, it's still on my
list of things to do to rip out Flow dissector C code and replace it
with eBPF).

(https://netdevconf.info/0x15/slides/16/Flow%20dissector_PANDA%20parser.pdf,
https://www.youtube.com/watch?v=zVnmVDSEoXc&list=PLrninrcyMo3L-hsJv23hFyDGRaeBY1EJO)

>
> I understand that "we offload SW functionality" is our general policy,
> but we should remember why this policy is in place, and not
> automatically jump to the conclusion.
>
> >  At least normally with device firmware the driver side is talking to
> >  something with narrow/fixed semantics and went through upstream
> >  review, even if the firmware side is still a black box.
>
> We should be buildings things which are useful and open (as in
> extensible by people "from the street"). With that in mind, to me,
> a more practical approach would be to try to figure out a common
> and rigid FW interface for expressing the parsing graph.

Parse graphs are best represented by declarative representation, not
an imperative one. This is a main reason why I want to replace flow
dissector, a parser written in imperative C code is difficult to
maintain as evident by the myriad of bugs in that code (particularly
when people added support or uncommon protocols). P4 got this part
right, however I don't believe we need to boil the ocean by
programming the kernel in a new language. A better alternative is to
define an IR that contains for this purpose.  We do that in Common
Parser Language (CPL) which is a .json schema to describe parse
graphs. With an IR we can compile into arbitrary backends including
P4, eBPF, C, and even custom assembly instructions for parsing
(arbitrary font ends languages are facilitated as well).

(https://netdevconf.info/0x16/papers/11/High%20Performance%20Programmable%20Parsers.pdf)

>
> But that's an interface going from the binary blob to the kernel.
>
> > Just to prove I'm not playing favourites: this is *also* a problem with
> >  eBPF offloads like Nanotubes, and I'm not convinced we have a viable
> >  solution yet.
>
> BPF offloads are actual offloads. Config/state is in the kernel,
> you need to pop it out to user space, then prove that it's what
> user intended.

Seems like offloading eBPF byte code and running a VM in the offload
device is pretty much considered a non-starter. But, what if we could
offload the _functionality_ of an eBPF program with confidence that
the functionality _exactly_ matches that of the eBPF program running
in the kernel? I believe that could be beneficial.

For instance, we all know that LRO never gained traction. The reason
is because each vendor does it however they want and no one can match
the exact functionality that SW GRO provides. It's not an offload of
kernel SW, so it's not viable. But, suppose we wrote GRO in some
program that could be compiled into eBPF and a device binary. Using
something like that hash technique I described, it seems like we could
properly do a kernel offload of GRO where the offload functionality
matches the software in the kernel.

Tom

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH net-next v8 00/15] Introducing P4TC
  2023-11-23 18:53                                                       ` Jakub Kicinski
  2023-11-23 19:42                                                         ` Tom Herbert
@ 2023-11-24 10:39                                                         ` Jiri Pirko
  1 sibling, 0 replies; 79+ messages in thread
From: Jiri Pirko @ 2023-11-24 10:39 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Edward Cree, Jamal Hadi Salim, Daniel Borkmann, John Fastabend,
	netdev, deb.chatterjee, anjali.singhai, Vipin.Jain,
	namrata.limaye, tom, mleitner, Mahesh.Shirshyad, tomasz.osinski,
	xiyou.wangcong, davem, edumazet, pabeni, vladbu, horms, bpf,
	khalidm, toke, mattyk, dan.daly, chris.sommers,
	john.andy.fingerhut

Thu, Nov 23, 2023 at 07:53:05PM CET, kuba@kernel.org wrote:
>On Thu, 23 Nov 2023 17:53:42 +0000 Edward Cree wrote:
>> The kernel doesn't like to trust offload blobs from a userspace compiler,
>>  because it has no way to be sure that what comes out of the compiler
>>  matches the rules/tables/whatever it has in the SW datapath.
>> It's also a support nightmare because it's basically like each user
>>  compiling their own device firmware.  
>
>Practically speaking every high speed NIC runs a huge binary blob of FW.
>First, let's acknowledge that as reality.

True, but I believe we need to diferenciate:
1) vendor created, versioned, signed binary fw blob
2) user compiled on demand, blob

I look at 2) as on "a configuration" of some sort.


>
>Second, there is no equivalent for arbitrary packet parsing in the
>kernel proper. Offload means take something form the host and put it
>on the device. If there's nothing in the kernel, we can't consider
>the new functionality an offload.
>
>I understand that "we offload SW functionality" is our general policy,
>but we should remember why this policy is in place, and not
>automatically jump to the conclusion.

It is in place to have well defined SW definition of what devices
offloads.


>
>>  At least normally with device firmware the driver side is talking to
>>  something with narrow/fixed semantics and went through upstream
>>  review, even if the firmware side is still a black box.
>
>We should be buildings things which are useful and open (as in
>extensible by people "from the street"). With that in mind, to me,
>a more practical approach would be to try to figure out a common
>and rigid FW interface for expressing the parsing graph.

Hmm, could you elaborate a bit more on this one please?

>
>But that's an interface going from the binary blob to the kernel.
>
>> Just to prove I'm not playing favourites: this is *also* a problem with
>>  eBPF offloads like Nanotubes, and I'm not convinced we have a viable
>>  solution yet.
>
>BPF offloads are actual offloads. Config/state is in the kernel,
>you need to pop it out to user space, then prove that it's what
>user intended.

^ permalink raw reply	[flat|nested] 79+ messages in thread

end of thread, other threads:[~2023-11-24 10:39 UTC | newest]

Thread overview: 79+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-11-16 14:59 [PATCH net-next v8 00/15] Introducing P4TC Jamal Hadi Salim
2023-11-16 14:59 ` [PATCH net-next v8 01/15] net: sched: act_api: Introduce dynamic actions list Jamal Hadi Salim
2023-11-16 14:59 ` [PATCH net-next v8 02/15] net/sched: act_api: increase action kind string length Jamal Hadi Salim
2023-11-16 14:59 ` [PATCH net-next v8 03/15] net/sched: act_api: Update tc_action_ops to account for dynamic actions Jamal Hadi Salim
2023-11-16 14:59 ` [PATCH net-next v8 04/15] net/sched: act_api: add struct p4tc_action_ops as a parameter to lookup callback Jamal Hadi Salim
2023-11-16 14:59 ` [PATCH net-next v8 05/15] net: sched: act_api: Add support for preallocated dynamic action instances Jamal Hadi Salim
2023-11-16 14:59 ` [PATCH net-next v8 06/15] net: introduce rcu_replace_pointer_rtnl Jamal Hadi Salim
2023-11-16 14:59 ` [PATCH net-next v8 07/15] rtnl: add helper to check if group has listeners Jamal Hadi Salim
2023-11-16 14:59 ` [PATCH net-next v8 08/15] p4tc: add P4 data types Jamal Hadi Salim
2023-11-16 16:03   ` Jiri Pirko
2023-11-17 12:01     ` Jamal Hadi Salim
2023-11-16 14:59 ` [PATCH net-next v8 09/15] p4tc: add template pipeline create, get, update, delete Jamal Hadi Salim
2023-11-16 16:11   ` Jiri Pirko
2023-11-17 12:09     ` Jamal Hadi Salim
2023-11-20  8:18       ` Jiri Pirko
2023-11-20 12:48         ` Jamal Hadi Salim
2023-11-20 13:16           ` Jiri Pirko
2023-11-20 15:30             ` Jamal Hadi Salim
2023-11-20 16:25               ` Jiri Pirko
2023-11-20 18:20       ` David Ahern
2023-11-20 20:12         ` Jamal Hadi Salim
2023-11-16 14:59 ` [PATCH net-next v8 10/15] p4tc: add action template create, update, delete, get, flush and dump Jamal Hadi Salim
2023-11-16 16:28   ` Jiri Pirko
2023-11-17 15:11     ` Jamal Hadi Salim
2023-11-20  8:19       ` Jiri Pirko
2023-11-20 13:45         ` Jamal Hadi Salim
2023-11-20 16:25           ` Jiri Pirko
2023-11-17  6:51   ` John Fastabend
2023-11-16 14:59 ` [PATCH net-next v8 11/15] p4tc: add template table " Jamal Hadi Salim
2023-11-16 14:59 ` [PATCH net-next v8 12/15] p4tc: add runtime table entry create, update, get, delete, " Jamal Hadi Salim
2023-11-16 14:59 ` [PATCH net-next v8 13/15] p4tc: add set of P4TC table kfuncs Jamal Hadi Salim
2023-11-17  7:09   ` John Fastabend
2023-11-19  9:14   ` kernel test robot
2023-11-20 22:28   ` kernel test robot
2023-11-16 14:59 ` [PATCH net-next v8 14/15] p4tc: add P4 classifier Jamal Hadi Salim
2023-11-17  7:17   ` John Fastabend
2023-11-16 14:59 ` [PATCH net-next v8 15/15] p4tc: Add P4 extern interface Jamal Hadi Salim
2023-11-16 16:42   ` Jiri Pirko
2023-11-17 12:14     ` Jamal Hadi Salim
2023-11-20  8:22       ` Jiri Pirko
2023-11-20 14:02         ` Jamal Hadi Salim
2023-11-20 16:27           ` Jiri Pirko
2023-11-20 19:00             ` Jamal Hadi Salim
2023-11-17  6:27 ` [PATCH net-next v8 00/15] Introducing P4TC John Fastabend
2023-11-17 12:49   ` Jamal Hadi Salim
2023-11-17 18:37     ` John Fastabend
2023-11-17 20:46       ` Jamal Hadi Salim
2023-11-20  9:39         ` Jiri Pirko
2023-11-20 14:23           ` Jamal Hadi Salim
2023-11-20 18:10             ` Jiri Pirko
2023-11-20 19:56               ` Jamal Hadi Salim
2023-11-20 20:41                 ` John Fastabend
2023-11-20 22:13                   ` Jamal Hadi Salim
2023-11-20 21:48                 ` Daniel Borkmann
2023-11-20 22:56                   ` Jamal Hadi Salim
2023-11-21 13:06                     ` Jiri Pirko
2023-11-21 13:47                       ` Jamal Hadi Salim
2023-11-21 14:19                         ` Jiri Pirko
2023-11-21 15:21                           ` Jamal Hadi Salim
2023-11-22  9:25                             ` Jiri Pirko
2023-11-22 15:14                               ` Jamal Hadi Salim
2023-11-22 18:31                                 ` Jiri Pirko
2023-11-22 18:50                                   ` John Fastabend
2023-11-22 19:35                                   ` Jamal Hadi Salim
2023-11-23  6:36                                     ` Jiri Pirko
2023-11-23 13:22                                       ` Jamal Hadi Salim
2023-11-23 13:34                                         ` Jiri Pirko
2023-11-23 13:45                                           ` Jamal Hadi Salim
2023-11-23 14:07                                             ` Jiri Pirko
2023-11-23 14:28                                               ` Jamal Hadi Salim
2023-11-23 15:27                                                 ` Jiri Pirko
2023-11-23 16:30                                                   ` Jamal Hadi Salim
2023-11-23 17:53                                                     ` Edward Cree
2023-11-23 18:09                                                       ` Jiri Pirko
2023-11-23 18:58                                                         ` Jamal Hadi Salim
2023-11-23 18:53                                                       ` Jakub Kicinski
2023-11-23 19:42                                                         ` Tom Herbert
2023-11-24 10:39                                                         ` Jiri Pirko
2023-11-23 18:04                                                     ` Jiri Pirko

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).