[PATCH net-next v12 00/15] Introducing P4TC (series 1)

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH net-next v12 00/15] Introducing P4TC (series 1)
@ 2024-02-25 16:54 Jamal Hadi Salim
  2024-02-25 16:54 ` [PATCH net-next v12 01/15] net: sched: act_api: Introduce P4 actions list Jamal Hadi Salim
                   ` (16 more replies)
  0 siblings, 17 replies; 71+ messages in thread
From: Jamal Hadi Salim @ 2024-02-25 16:54 UTC (permalink / raw)
  To: netdev
  Cc: deb.chatterjee, anjali.singhai, namrata.limaye, tom, mleitner,
	Mahesh.Shirshyad, Vipin.Jain, tomasz.osinski, jiri,
	xiyou.wangcong, davem, edumazet, kuba, pabeni, vladbu, horms,
	khalidm, toke, daniel, victor, pctammela, dan.daly,
	andy.fingerhut, chris.sommers, mattyk, bpf

This is the first patchset of two. In this patch we are submitting 15 which
cover the minimal viable P4 PNA architecture.

__Description of these Patches__

Patch #1 adds infrastructure for per-netns P4 actions that can be created on
as need basis for the P4 program requirement. This patch makes a small incision
into act_api. Patches 2-4 are minimalist enablers for P4TC and have no
effect the classical tc action (example patch#2 just increases the size of the
action names from 16->64B).
Patch 5 adds infrastructure support for preallocation of dynamic actions.

The core P4TC code implements several P4 objects.
1) Patch #6 introduces P4 data types which are consumed by the rest of the code
2) Patch #7 introduces the templating API. i.e. CRUD commands for templates
3) Patch #8 introduces the concept of templating Pipelines. i.e CRUD commands
   for P4 pipelines.
4) Patch #9 introduces the action templates and associated CRUD commands.
5) Patch #10 introduce the action runtime infrastructure.
6) Patch #11 introduces the concept of P4 table templates and associated
   CRUD commands for tables.
7) Patch #12 introduces runtime table entry infra and associated CU commands.
8) Patch #13 introduces runtime table entry infra and associated RD commands.
9) Patch #14 introduces interaction of eBPF to P4TC tables via kfunc.
10) Patch #15 introduces the TC classifier P4 used at runtime.

Daniel, please look again at patch #15.

There are a few more patches (5) not in this patchset that deal with test
cases, etc.

What is P4?
-----------

The Programming Protocol-independent Packet Processors (P4) is an open source,
domain-specific programming language for specifying data plane behavior.

The current P4 landscape includes an extensive range of deployments, products,
projects and services, etc[9][12]. Two major NIC vendors, Intel[10] and AMD[11]
currently offer P4-native NICs. P4 is currently curated by the Linux
Foundation[9].

On why P4 - see small treatise here:[4].

What is P4TC?
-------------

P4TC is a net-namespace aware P4 implementation over TC; meaning, a P4 program
and its associated objects and state are attachend to a kernel _netns_ structure.
IOW, if we had two programs across netns' or within a netns they have no
visibility to each others objects (unlike for example TC actions whose kinds are
"global" in nature or eBPF maps visavis bpftool).

P4TC builds on top of many years of Linux TC experiences of a netlink control
path interface coupled with a software datapath with an equivalent offloadable
hardware datapath. In this patch series we are focussing only on the s/w
datapath. The s/w and h/w path equivalence that TC provides is relevant
for a primary use case of P4 where some (currently) large consumers of NICs
provide vendors their datapath specs in P4. In such a case one could generate
specified datapaths in s/w and test/validate the requirements before hardware
acquisition(example [12]).

Unlike other approaches such as TC Flower which require kernel and user space
changes when new datapath objects like packet headers are introduced P4TC, with
these patches, provides _kernel and user space code change independence_.
Meaning:
A P4 program describes headers, parsers, etc alongside the datapath processing;
the compiler uses the P4 program as input and generates several artifacts which
are then loaded into the kernel to manifest the intended datapath. In addition
to the generated datapath, control path constructs are generated. The process is
described further below in "P4TC Workflow".

There have been many discussions and meetings within the community since
about 2015 in regards to P4 over TC[2] and we are finally proving to the
naysayers that we do get stuff done!

A lot more of the P4TC motivation is captured at:
https://github.com/p4tc-dev/docs/blob/main/why-p4tc.md

__P4TC Architecture__

The current architecture was described at netdevconf 0x17[14] and if you prefer
academic conference papers, a short paper is available here[15].

There are 4 parts:

1) A Template CRUD provisioning API for manifesting a P4 program and its
associated objects in the kernel. The template provisioning API uses netlink.
See patch in part 2.

2) A Runtime CRUD+ API code which is used for controlling the different runtime
behavior of the P4 objects. The runtime API uses netlink. See notes further
down. See patch description later..

3) P4 objects and their control interfaces: tables, actions, externs, etc.
Any object that requires control plane interaction resides in the TC domain
and is subject to the CRUD runtime API.  The intended goal is to make use of the
tc semantics of skip_sw/hw to target P4 program objects either in s/w or h/w.

4) S/W Datapath code hooks. The s/w datapath is eBPF based and is generated
by a compiler based on the P4 spec. When accessing any P4 object that requires
control plane interfaces, the eBPF code accesses the P4TC side from #3 above
using kfuncs.

The generated eBPF code is derived from [13] with enhancements and fixes to meet
our requirements.

__P4TC Workflow__

The Development and instantiation workflow for P4TC is as follows:

  A) A developer writes a P4 program, "myprog"

  B) Compiles it using the P4C compiler[8]. The compiler generates 3 outputs:

     a) A shell script which form template definitions for the different P4
     objects "myprog" utilizes (tables, externs, actions etc). See #1 above..

     b) the parser and the rest of the datapath are generated as eBPF and need
     to be compiled into binaries. At the moment the parser and the main control
     block are generated as separate eBPF program but this could change in
     the future (without affecting any kernel code). See #4 above.

     c) A json introspection file used for the control plane (by iproute2/tc).

  C) At this point the artifacts from #1,#4 could be handed to an operator
     (the operator could be the same person as the developer from #A, #B).

     i) For the eBPF part, either the operator is handed an ebpf binary or
     source which they compile at this point into a binary.
     The operator executes the shell script(s) to manifest the functional
     "myprog" into the kernel.

     ii) The operator instantiates "myprog" pipeline via the tc P4 filter
     to ingress/egress (depending on P4 arch) of one or more netdevs/ports
     (illustrated below as "block 22").

     Example instantion where the parser is a separate action:
       "tc filter add block 22 ingress protocol all prio 10 p4 pname myprog \
        action bpf obj $PARSER.o section p4tc/parse \
        action bpf obj $PROGNAME.o section p4tc/main"

See individual patches in partc for more examples tc vs xdp etc. Also see
section on "challenges" (further below on this cover letter).

Once "myprog" P4 program is instantiated one can start performing operations
on table entries and/or actions at runtime as described below.

__P4TC Runtime Control Path__

The control interface builds on past tc experience and tries to get things
right from the beginning (example filtering is separated from depending
on existing object TLVs and made generic); also the code is written in
such a way it is mostly lockless.

The P4TC control interface, using netlink, provides what we call a CRUDPS
abstraction which stands for: Create, Read(get), Update, Delete, Subscribe,
Publish.  From a high level PoV the following describes a conformant high level
API (both on netlink data model and code level):

	Create(</path/to/object, DATA>+)
	Read(</path/to/object>, [optional filter])
	Update(</path/to/object>, DATA>+)
	Delete(</path/to/object>, [optional filter])
	Subscribe(</path/to/object>, [optional filter])

Note, we _dont_ treat "dump" or "flush" as speacial. If "path/to/object" points
to a table then a "Delete" implies "flush" and a "Read" implies dump but if
it points to an entry (by specifying a key) then "Delete" implies deleting
and entry and "Read" implies reading that single entry. It should be noted that
both "Delete" and "Read" take an optional filter parameter. The filter can
define further refinements to what the control plane wants read or deleted.
"Subscribe" uses built in netlink event management. It, as well, takes a filter
which can further refine what events get generated to the control plane (taken
out of this patchset, to be re-added with consideration of [16]).

Lets show some runtime samples:

..create an entry, if we match ip address 10.0.1.2 send packet out eno1
  tc p4ctrl create myprog/table/mytable \
   dstAddr 10.0.1.2/32 action send_to_port param port eno1

..Batch create entries
  tc p4ctrl create myprog/table/mytable \
  entry dstAddr 10.1.1.2/32  action send_to_port param port eno1 \
  entry dstAddr 10.1.10.2/32  action send_to_port param port eno10 \
  entry dstAddr 10.0.2.2/32  action send_to_port param port eno2

..Get an entry (note "read" is interchangeably used as "get" which is a common
		semantic in tc):
  tc p4ctrl read myprog/table/mytable \
   dstAddr 10.0.2.2/32

..dump mytable
  tc p4ctrl read myprog/table/mytable

..dump mytable for all entries whose key fits within 10.1.0.0/16
  tc p4ctrl read myprog/table/mytable \
  filter key/myprog/mytable/dstAddr = 10.1.0.0/16

..dump all mytable entries which have an action send_to_port with param "eno1"
  tc p4ctrl get myprog/table/mytable \
  filter param/act/myprog/send_to_port/port = "eno1"

The filter expression is powerful, f.e you could say:

  tc p4ctrl get myprog/table/mytable \
  filter param/act/myprog/send_to_port/port = "eno1" && \
         key/myprog/mytable/dstAddr = 10.1.0.0/16

It also works on built in metadata, example in the following case dumping
entries from mytable that have seen activity in the last 10 secs:
  tc p4ctrl get myprog/table/mytable \
  filter msecs_since < 10000

Delete follows the same syntax as get/read, so for sake of brevity we won't
show more example than how to flush mytable:

  tc p4ctrl delete myprog/table/mytable

Mystery question: How do we achieve iproute2-kernel independence and
how does "tc p4ctrl" as a cli know how to program the kernel given an
arbitrary command line as shown above? Answer(s): It queries the
compiler generated json file in "P4TC Workflow" #B.c above. The json file has
enough details to figure out that we have a program called "myprog" which has a
table "mytable" that has a key name "dstAddr" which happens to be type ipv4
address prefix. The json file also provides details to show that the table
"mytable" supports an action called "send_to_port" which accepts a parameter
"port" of type netdev (see the types patch for all supported P4 data types).
All P4 components have names, IDs, and types - so this makes it very easy to map
into netlink.
Once user space tc/p4ctrl validates the human command input, it creates
standard binary netlink structures (TLVs etc) which are sent to the kernel.
See the runtime table entry patch for more details.

__P4TC Datapath__

The P4TC s/w datapath execution is generated as eBPF. Any objects that require
control interfacing reside in the "P4TC domain" and are controlled via netlink
as described above. Per packet execution and state and even objects that do not
require control interfacing (like the P4 parser) are generated as eBPF.

A packet arriving on s/w ingress of any of the ports on block 22 will first be
exercised via the (generated eBPF) parser component to extract the headers (the
ip destination address in labelled "dstAddr" above).
The datapath then proceeds to use "dstAddr", table ID and pipeline ID
as a key to do a lookup in myprog's "mytable" which returns the action params
which are then used to execute the action in the eBPF datapath (eventually
sending out packets to eno1).
On a table miss, mytable's default miss action (not described) is executed.

__Testing__

Speaking of testing - we have 2-300 tdc test cases (which will be in the
second patchset).
These tests are run on our CICD system on pull requests and after commits are
approved. The CICD does a lot of other tests (more since v2, thanks to Simon's
input)including:
checkpatch, sparse, smatch, coccinelle, 32 bit and 64 bit builds tested on both
X86, ARM 64 and emulated BE via qemu s390. We trigger performance testing in the
CICD to catch performance regressions (currently only on the control path, but
in the future for the datapath).
Syzkaller runs 24/7 on dedicated hardware, originally we focussed only on memory
sanitizer but recently added support for concurrency sanitizer.
Before main releases we ensure each patch will compile on its own to help in
git bisect and run the xmas tree tool. We eventually put the code via coverity.

In addition we are working on enabling a tool that will take a P4 program, run
it through the compiler, and generate permutations of traffic patterns via
symbolic execution that will test both positive and negative datapath code
paths. The test generator tool integration is still work in progress.
Also: We have other code that test parallelization etc which we are trying to
find a fit for in the kernel tree's testing infra.

__References__

[1]https://github.com/p4tc-dev/docs/blob/main/p4-conference-2023/2023P4WorkshopP4TC.pdf
[2]https://github.com/p4tc-dev/docs/blob/main/why-p4tc.md#historical-perspective-for-p4tc
[3]https://2023p4workshop.sched.com/event/1KsAe/p4tc-linux-kernel-p4-implementation-approaches-and-evaluation
[4]https://github.com/p4tc-dev/docs/blob/main/why-p4tc.md#so-why-p4-and-how-does-p4-help-here
[5]https://lore.kernel.org/netdev/20230517110232.29349-3-jhs@mojatatu.com/T/#mf59be7abc5df3473cff3879c8cc3e2369c0640a6
[6]https://lore.kernel.org/netdev/20230517110232.29349-3-jhs@mojatatu.com/T/#m783cfd79e9d755cf0e7afc1a7d5404635a5b1919
[7]https://lore.kernel.org/netdev/20230517110232.29349-3-jhs@mojatatu.com/T/#ma8c84df0f7043d17b98f3d67aab0f4904c600469
[8]https://github.com/p4lang/p4c/tree/main/backends/tc
[9]https://p4.org/
[10]https://www.intel.com/content/www/us/en/products/details/network-io/ipu/e2000-asic.html
[11]https://www.amd.com/en/accelerators/pensando
[12]https://github.com/sonic-net/DASH/tree/main
[13]https://github.com/p4lang/p4c/tree/main/backends/ebpf
[14]https://netdevconf.info/0x17/sessions/talk/integrating-ebpf-into-the-p4tc-datapath.html
[15]https://dl.acm.org/doi/10.1145/3630047.3630193
[16]https://lore.kernel.org/netdev/20231216123001.1293639-1-jiri@resnulli.us/
[17.a]https://netdevconf.info/0x13/session.html?talk-tc-u-classifier
[17.b]man tc-u32
[18]man tc-pedit
[19] https://lore.kernel.org/netdev/20231219181623.3845083-6-victor@mojatatu.com/T/#m86e71743d1d83b728bb29d5b877797cb4942e835
[20.a] https://netdevconf.info/0x16/sessions/talk/your-network-datapath-will-be-p4-scripted.html
[20.b] https://netdevconf.info/0x16/sessions/workshop/p4tc-workshop.html

--------
HISTORY
--------

Changes in Version 12
----------------------

0) Introduce back 15 patches (v11 had 5)

1) From discussions with Daniel:
   i) Remove the XDP programs association alltogether. No refcounting. nothing.
   ii) Remove prog type tc - everything is now an ebpf tc action.

2) s/PAD0/__pad0/g. Thanks to Marcelo.

3) Add extack to specify how many entries (N of M) specified in a batch for
   any of requested Create/Update/Delete succeeded. Prior to this it would
   only tell us the batch failed to complete without giving us details of
   which of M failed. Added as a debug aid.

Changes in Version 11
----------------------
1) Split the series into two. Original patches 1-5 in this patchset. The rest
   will go out after this is merged.

2) Change any references of IFNAMSIZ in the action code when referencing the
   action name size to ACTNAMSIZ. Thanks to Marcelo.

Changes in Version 10
----------------------
1) A couple of patches from the earlier version were clean enough to submit,
   so we did. This gave us room to split the two largest patches each into
   two. Even though the split is not git-bisactable and really some of it didn't
   make much sense (eg spliting a create, and update in one patch and delete and
   get into another) we made sure each of the split patches compiled
   independently. The idea is to reduce the number of lines of code to review
   and when we get sufficient reviews we will put the splits together again.
   See patch #12 and #13 as well as patches #7 and #8).

2) Add more context in patch 0. Please READ!

3) Added dump/delete filters back to the code - we had taken them out in the
   earlier patches to reduce the amount of code for review - but in retrospect
   we feel they are important enough to push earlier rather than later.

Changes In version 9
---------------------

1) Remove the largest patch (externs) to ease review.

2) Break up action patches into two to ease review bringing down the patches
   that need more scrutiny to 8 (the first 7 are almost trivial).

3) Fixup prefix naming convention to p4tc_xxx for uapi and p4a_xxx for actions
   to provide consistency(Jiri).

4) Silence sparse warning "was not declared. Should it be static?" for kfuncs
   by making them static. TBH, not sure if this is the right solution
   but it makes sparse happy and hopefully someone will comment.

Changes In Version 8
---------------------

1) Fix all the patchwork warnings and improve our ci to catch them in the future

2) Reduce the number of patches to basic max(15)  to ease review.

Changes In Version 7
-------------------------

0) First time removing the RFC tag!

1) Removed XDP cookie. It turns out as was pointed out by Toke(Thanks!) - that
using bpf links was sufficient to protect us from someone replacing or deleting
a eBPF program after it has been bound to a netdev.

2) Add some reviewed-bys from Vlad.

3) Small bug fixes from v6 based on testing for ebpf.

4) Added the counter extern as a sample extern. Illustrating this example because
   it is slightly complex since it is possible to invoke it directly from
   the P4TC domain (in case of direct counters) or from eBPF (indirect counters).
   It is not exactly the most efficient implementation (a reasonable counter impl
   should be per-cpu).

Changes In RFC Version 6
-------------------------

1) Completed integration from scriptable view to eBPF. Completed integration
   of externs integration.

2) Small bug fixes from v5 based on testing.

Changes In RFC Version 5
-------------------------

1) More integration from scriptable view to eBPF. Small bug fixes from last
   integration.

2) More streamlining support of externs via kfunc (create-on-miss, etc)

3) eBPF linking for XDP.

There is more eBPF integration/streamlining coming (we are getting close to
conversion from scriptable domain).

Changes In RFC Version 4
-------------------------

1) More integration from scriptable to eBPF. Small bug fixes.

2) More streamlining support of externs via kfunc (one additional kfunc).

3) Removed per-cpu scratchpad per Toke's suggestion and instead use XDP metadata.

There is more eBPF integration coming. One thing we looked at but is not in this
patchset but should be in the next is use of eBPF link in our loading (see
"challenge #1" further below).

Changes In RFC Version 3
-------------------------

These patches are still in a little bit of flux as we adjust to integrating
eBPF. So there are small constructs that are used in V1 and 2 but no longer
used in this version. We will make a V4 which will remove those.
The changes from V2 are as follows:

1) Feedback we got in V2 is to try stick to one of the two modes. In this version
we are taking one more step and going the path of mode2 vs v2 where we had 2 modes.

2) The P4 Register extern is no longer standalone. Instead, as part of integrating
into eBPF we introduce another kfunc which encapsulates Register as part of the
extern interface.

3) We have improved our CICD to include tools pointed to us by Simon. See
   "Testing" further below. Thanks to Simon for that and other issues he caught.
   Simon, we discussed on issue [7] but decided to keep that log since we think
   it is useful.

4) A lot of small cleanups. Thanks Marcelo. There are two things we need to
   re-discuss though; see: [5], [6].

5) We removed the need for a range of IDs for dynamic actions. Thanks Jakub.

6) Clarify ambiguity caused by smatch in an if(A) else if(B) condition. We are
   guaranteed that either A or B must exist; however, lets make smatch happy.
   Thanks to Simon and Dan Carpenter.

Changes In RFC Version 2
-------------------------

Version 2 is the initial integration of the eBPF datapath.
We took into consideration suggestions provided to use eBPF and put effort into
analyzing eBPF as datapath which involved extensive testing.
We implemented 6 approaches with eBPF and ran performance analysis and presented
our results at the P4 2023 workshop in Santa Clara[see: 1, 3] on each of the 6
vs the scriptable P4TC and concluded that 2 of the approaches are sensible (4 if
you account for XDP or TC separately).

Conclusions from the exercise: We lose the simple operational model we had
prior to integrating eBPF. We do gain performance in most cases when the
datapath is less compute-bound.
For more discussion on our requirements vs journeying the eBPF path please
scroll down to "Restating Our Requirements" and "Challenges".

This patch set presented two modes.
mode1: the parser is entirely based on eBPF - whereas the rest of the
SW datapath stays as _scriptable_ as in Version 1.
mode2: All of the kernel s/w datapath (including parser) is in eBPF.

The key ingredient for eBPF, that we did not have access to in the past, is
kfunc (it made a big difference for us to reconsider eBPF).

In V2 the two modes are mutually exclusive (IOW, you get to choose one
or the other via Kconfig).

Jamal Hadi Salim (15):
  net: sched: act_api: Introduce P4 actions list
  net/sched: act_api: increase action kind string length
  net/sched: act_api: Update tc_action_ops to account for P4 actions
  net/sched: act_api: add struct p4tc_action_ops as a parameter to
    lookup callback
  net: sched: act_api: Add support for preallocated P4 action instances
  p4tc: add P4 data types
  p4tc: add template API
  p4tc: add template pipeline create, get, update, delete
  p4tc: add template action create, update, delete, get, flush and dump
  p4tc: add runtime action support
  p4tc: add template table create, update, delete, get, flush and dump
  p4tc: add runtime table entry create and update
  p4tc: add runtime table entry get, delete, flush and dump
  p4tc: add set of P4TC table kfuncs
  p4tc: add P4 classifier

 include/linux/bitops.h            |    1 +
 include/net/act_api.h             |   23 +-
 include/net/p4tc.h                |  675 +++++++
 include/net/p4tc_types.h          |   91 +
 include/net/tc_act/p4tc.h         |   78 +
 include/uapi/linux/p4tc.h         |  442 +++++
 include/uapi/linux/pkt_cls.h      |   15 +
 include/uapi/linux/rtnetlink.h    |   18 +
 include/uapi/linux/tc_act/tc_p4.h |   11 +
 net/sched/Kconfig                 |   23 +
 net/sched/Makefile                |    3 +
 net/sched/act_api.c               |  192 +-
 net/sched/cls_api.c               |    2 +-
 net/sched/cls_p4.c                |  305 +++
 net/sched/p4tc/Makefile           |    8 +
 net/sched/p4tc/p4tc_action.c      | 2397 +++++++++++++++++++++++
 net/sched/p4tc/p4tc_bpf.c         |  342 ++++
 net/sched/p4tc/p4tc_filter.c      |  870 ++++++++
 net/sched/p4tc/p4tc_pipeline.c    |  700 +++++++
 net/sched/p4tc/p4tc_runtime_api.c |  145 ++
 net/sched/p4tc/p4tc_table.c       | 1834 +++++++++++++++++
 net/sched/p4tc/p4tc_tbl_entry.c   | 3047 +++++++++++++++++++++++++++++
 net/sched/p4tc/p4tc_tmpl_api.c    |  440 +++++
 net/sched/p4tc/p4tc_types.c       | 1407 +++++++++++++
 net/sched/p4tc/trace.c            |   10 +
 net/sched/p4tc/trace.h            |   44 +
 security/selinux/nlmsgtab.c       |   10 +-
 27 files changed, 13097 insertions(+), 36 deletions(-)
 create mode 100644 include/net/p4tc.h
 create mode 100644 include/net/p4tc_types.h
 create mode 100644 include/net/tc_act/p4tc.h
 create mode 100644 include/uapi/linux/p4tc.h
 create mode 100644 include/uapi/linux/tc_act/tc_p4.h
 create mode 100644 net/sched/cls_p4.c
 create mode 100644 net/sched/p4tc/Makefile
 create mode 100644 net/sched/p4tc/p4tc_action.c
 create mode 100644 net/sched/p4tc/p4tc_bpf.c
 create mode 100644 net/sched/p4tc/p4tc_filter.c
 create mode 100644 net/sched/p4tc/p4tc_pipeline.c
 create mode 100644 net/sched/p4tc/p4tc_runtime_api.c
 create mode 100644 net/sched/p4tc/p4tc_table.c
 create mode 100644 net/sched/p4tc/p4tc_tbl_entry.c
 create mode 100644 net/sched/p4tc/p4tc_tmpl_api.c
 create mode 100644 net/sched/p4tc/p4tc_types.c
 create mode 100644 net/sched/p4tc/trace.c
 create mode 100644 net/sched/p4tc/trace.h

-- 
2.34.1

^ permalink raw reply	[flat|nested] 71+ messages in thread

* [PATCH net-next v12  01/15] net: sched: act_api: Introduce P4 actions list
  2024-02-25 16:54 [PATCH net-next v12 00/15] Introducing P4TC (series 1) Jamal Hadi Salim
@ 2024-02-25 16:54 ` Jamal Hadi Salim
  2024-02-29 15:05   ` Paolo Abeni
  2024-02-25 16:54 ` [PATCH net-next v12 02/15] net/sched: act_api: increase action kind string length Jamal Hadi Salim
                   ` (15 subsequent siblings)
  16 siblings, 1 reply; 71+ messages in thread
From: Jamal Hadi Salim @ 2024-02-25 16:54 UTC (permalink / raw)
  To: netdev
  Cc: deb.chatterjee, anjali.singhai, namrata.limaye, tom, mleitner,
	Mahesh.Shirshyad, Vipin.Jain, tomasz.osinski, jiri,
	xiyou.wangcong, davem, edumazet, kuba, pabeni, vladbu, horms,
	khalidm, toke, daniel, victor, pctammela, bpf

In P4 we require to generate new actions "on the fly" based on the
specified P4 action definition. P4 action kinds, like the pipeline
they are attached to, must be per net namespace, as opposed to native
action kinds which are global. For that reason, we chose to create a
separate structure to store P4 actions.

Co-developed-by: Victor Nogueira <victor@mojatatu.com>
Signed-off-by: Victor Nogueira <victor@mojatatu.com>
Co-developed-by: Pedro Tammela <pctammela@mojatatu.com>
Signed-off-by: Pedro Tammela <pctammela@mojatatu.com>
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Reviewed-by: Vlad Buslov <vladbu@nvidia.com>
Reviewed-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
---
 include/net/act_api.h |   8 ++-
 net/sched/act_api.c   | 123 +++++++++++++++++++++++++++++++++++++-----
 net/sched/cls_api.c   |   2 +-
 3 files changed, 116 insertions(+), 17 deletions(-)

diff --git a/include/net/act_api.h b/include/net/act_api.h
index 77ee0c657..f22be14bb 100644
--- a/include/net/act_api.h
+++ b/include/net/act_api.h
@@ -105,6 +105,7 @@ typedef void (*tc_action_priv_destructor)(void *priv);
 
 struct tc_action_ops {
 	struct list_head head;
+	struct list_head p4_head;
 	char    kind[IFNAMSIZ];
 	enum tca_id  id; /* identifier should match kind */
 	unsigned int	net_id;
@@ -199,10 +200,12 @@ int tcf_idr_check_alloc(struct tc_action_net *tn, u32 *index,
 int tcf_idr_release(struct tc_action *a, bool bind);
 
 int tcf_register_action(struct tc_action_ops *a, struct pernet_operations *ops);
+int tcf_register_p4_action(struct net *net, struct tc_action_ops *act);
 int tcf_unregister_action(struct tc_action_ops *a,
 			  struct pernet_operations *ops);
 #define NET_ACT_ALIAS_PREFIX "net-act-"
 #define MODULE_ALIAS_NET_ACT(kind)	MODULE_ALIAS(NET_ACT_ALIAS_PREFIX kind)
+void tcf_unregister_p4_action(struct net *net, struct tc_action_ops *act);
 int tcf_action_destroy(struct tc_action *actions[], int bind);
 int tcf_action_exec(struct sk_buff *skb, struct tc_action **actions,
 		    int nr_actions, struct tcf_result *res);
@@ -210,8 +213,9 @@ int tcf_action_init(struct net *net, struct tcf_proto *tp, struct nlattr *nla,
 		    struct nlattr *est,
 		    struct tc_action *actions[], int init_res[], size_t *attr_size,
 		    u32 flags, u32 fl_flags, struct netlink_ext_ack *extack);
-struct tc_action_ops *tc_action_load_ops(struct nlattr *nla, u32 flags,
-					 struct netlink_ext_ack *extack);
+struct tc_action_ops *
+tc_action_load_ops(struct net *net, struct nlattr *nla,
+		   u32 flags, struct netlink_ext_ack *extack);
 struct tc_action *tcf_action_init_1(struct net *net, struct tcf_proto *tp,
 				    struct nlattr *nla, struct nlattr *est,
 				    struct tc_action_ops *a_o, int *init_res,
diff --git a/net/sched/act_api.c b/net/sched/act_api.c
index 9ee622fb1..23ef394f2 100644
--- a/net/sched/act_api.c
+++ b/net/sched/act_api.c
@@ -57,6 +57,40 @@ static void tcf_free_cookie_rcu(struct rcu_head *p)
 	kfree(cookie);
 }
 
+static unsigned int p4_act_net_id;
+
+struct tcf_p4_act_net {
+	struct list_head act_base;
+	rwlock_t act_mod_lock;
+};
+
+static __net_init int tcf_p4_act_base_init_net(struct net *net)
+{
+	struct tcf_p4_act_net *p4_base_net = net_generic(net, p4_act_net_id);
+
+	INIT_LIST_HEAD(&p4_base_net->act_base);
+	rwlock_init(&p4_base_net->act_mod_lock);
+
+	return 0;
+}
+
+static void __net_exit tcf_p4_act_base_exit_net(struct net *net)
+{
+	struct tcf_p4_act_net *p4_base_net = net_generic(net, p4_act_net_id);
+	struct tc_action_ops *ops, *tmp;
+
+	list_for_each_entry_safe(ops, tmp, &p4_base_net->act_base, p4_head) {
+		list_del(&ops->p4_head);
+	}
+}
+
+static struct pernet_operations tcf_p4_act_base_net_ops = {
+	.init = tcf_p4_act_base_init_net,
+	.exit = tcf_p4_act_base_exit_net,
+	.id = &p4_act_net_id,
+	.size = sizeof(struct tc_action_ops),
+};
+
 static void tcf_set_action_cookie(struct tc_cookie __rcu **old_cookie,
 				  struct tc_cookie *new_cookie)
 {
@@ -962,6 +996,48 @@ static void tcf_pernet_del_id_list(unsigned int id)
 	mutex_unlock(&act_id_mutex);
 }
 
+static struct tc_action_ops *tc_lookup_p4_action(struct net *net, char *kind)
+{
+	struct tcf_p4_act_net *p4_base_net = net_generic(net, p4_act_net_id);
+	struct tc_action_ops *a, *res = NULL;
+
+	read_lock(&p4_base_net->act_mod_lock);
+	list_for_each_entry(a, &p4_base_net->act_base, p4_head) {
+		if (strcmp(kind, a->kind) == 0) {
+			if (try_module_get(a->owner))
+				res = a;
+			break;
+		}
+	}
+	read_unlock(&p4_base_net->act_mod_lock);
+
+	return res;
+}
+
+void tcf_unregister_p4_action(struct net *net, struct tc_action_ops *act)
+{
+	struct tcf_p4_act_net *p4_base_net = net_generic(net, p4_act_net_id);
+
+	write_lock(&p4_base_net->act_mod_lock);
+	list_del(&act->p4_head);
+	write_unlock(&p4_base_net->act_mod_lock);
+}
+EXPORT_SYMBOL(tcf_unregister_p4_action);
+
+int tcf_register_p4_action(struct net *net, struct tc_action_ops *act)
+{
+	struct tcf_p4_act_net *p4_base_net = net_generic(net, p4_act_net_id);
+
+	if (tc_lookup_p4_action(net, act->kind))
+		return -EEXIST;
+
+	write_lock(&p4_base_net->act_mod_lock);
+	list_add(&act->p4_head, &p4_base_net->act_base);
+	write_unlock(&p4_base_net->act_mod_lock);
+
+	return 0;
+}
+
 int tcf_register_action(struct tc_action_ops *act,
 			struct pernet_operations *ops)
 {
@@ -1032,7 +1108,7 @@ int tcf_unregister_action(struct tc_action_ops *act,
 EXPORT_SYMBOL(tcf_unregister_action);
 
 /* lookup by name */
-static struct tc_action_ops *tc_lookup_action_n(char *kind)
+static struct tc_action_ops *tc_lookup_action_n(struct net *net, char *kind)
 {
 	struct tc_action_ops *a, *res = NULL;
 
@@ -1040,31 +1116,48 @@ static struct tc_action_ops *tc_lookup_action_n(char *kind)
 		read_lock(&act_mod_lock);
 		list_for_each_entry(a, &act_base, head) {
 			if (strcmp(kind, a->kind) == 0) {
-				if (try_module_get(a->owner))
-					res = a;
-				break;
+				if (try_module_get(a->owner)) {
+					read_unlock(&act_mod_lock);
+					return a;
+				}
 			}
 		}
 		read_unlock(&act_mod_lock);
+
+		return tc_lookup_p4_action(net, kind);
 	}
+
 	return res;
 }
 
 /* lookup by nlattr */
-static struct tc_action_ops *tc_lookup_action(struct nlattr *kind)
+static struct tc_action_ops *tc_lookup_action(struct net *net,
+					      struct nlattr *kind)
 {
+	struct tcf_p4_act_net *p4_base_net = net_generic(net, p4_act_net_id);
 	struct tc_action_ops *a, *res = NULL;
 
 	if (kind) {
 		read_lock(&act_mod_lock);
 		list_for_each_entry(a, &act_base, head) {
+			if (nla_strcmp(kind, a->kind) == 0) {
+				if (try_module_get(a->owner)) {
+					read_unlock(&act_mod_lock);
+					return a;
+				}
+			}
+		}
+		read_unlock(&act_mod_lock);
+
+		read_lock(&p4_base_net->act_mod_lock);
+		list_for_each_entry(a, &p4_base_net->act_base, p4_head) {
 			if (nla_strcmp(kind, a->kind) == 0) {
 				if (try_module_get(a->owner))
 					res = a;
 				break;
 			}
 		}
-		read_unlock(&act_mod_lock);
+		read_unlock(&p4_base_net->act_mod_lock);
 	}
 	return res;
 }
@@ -1324,8 +1417,9 @@ void tcf_idr_insert_many(struct tc_action *actions[], int init_res[])
 	}
 }
 
-struct tc_action_ops *tc_action_load_ops(struct nlattr *nla, u32 flags,
-					 struct netlink_ext_ack *extack)
+struct tc_action_ops *
+tc_action_load_ops(struct net *net, struct nlattr *nla,
+		   u32 flags, struct netlink_ext_ack *extack)
 {
 	bool police = flags & TCA_ACT_FLAGS_POLICE;
 	struct nlattr *tb[TCA_ACT_MAX + 1];
@@ -1356,7 +1450,7 @@ struct tc_action_ops *tc_action_load_ops(struct nlattr *nla, u32 flags,
 		}
 	}
 
-	a_o = tc_lookup_action_n(act_name);
+	a_o = tc_lookup_action_n(net, act_name);
 	if (a_o == NULL) {
 #ifdef CONFIG_MODULES
 		bool rtnl_held = !(flags & TCA_ACT_FLAGS_NO_RTNL);
@@ -1367,7 +1461,7 @@ struct tc_action_ops *tc_action_load_ops(struct nlattr *nla, u32 flags,
 		if (rtnl_held)
 			rtnl_lock();
 
-		a_o = tc_lookup_action_n(act_name);
+		a_o = tc_lookup_action_n(net, act_name);
 
 		/* We dropped the RTNL semaphore in order to
 		 * perform the module load.  So, even if we
@@ -1477,7 +1571,7 @@ int tcf_action_init(struct net *net, struct tcf_proto *tp, struct nlattr *nla,
 	for (i = 1; i <= TCA_ACT_MAX_PRIO && tb[i]; i++) {
 		struct tc_action_ops *a_o;
 
-		a_o = tc_action_load_ops(tb[i], flags, extack);
+		a_o = tc_action_load_ops(net, tb[i], flags, extack);
 		if (IS_ERR(a_o)) {
 			err = PTR_ERR(a_o);
 			goto err_mod;
@@ -1683,7 +1777,7 @@ static struct tc_action *tcf_action_get_1(struct net *net, struct nlattr *nla,
 	index = nla_get_u32(tb[TCA_ACT_INDEX]);
 
 	err = -EINVAL;
-	ops = tc_lookup_action(tb[TCA_ACT_KIND]);
+	ops = tc_lookup_action(net, tb[TCA_ACT_KIND]);
 	if (!ops) { /* could happen in batch of actions */
 		NL_SET_ERR_MSG(extack, "Specified TC action kind not found");
 		goto err_out;
@@ -1731,7 +1825,7 @@ static int tca_action_flush(struct net *net, struct nlattr *nla,
 
 	err = -EINVAL;
 	kind = tb[TCA_ACT_KIND];
-	ops = tc_lookup_action(kind);
+	ops = tc_lookup_action(net, kind);
 	if (!ops) { /*some idjot trying to flush unknown action */
 		NL_SET_ERR_MSG(extack, "Cannot flush unknown TC action");
 		goto err_out;
@@ -2184,7 +2278,7 @@ static int tc_dump_action(struct sk_buff *skb, struct netlink_callback *cb)
 		return 0;
 	}
 
-	a_o = tc_lookup_action(kind);
+	a_o = tc_lookup_action(net, kind);
 	if (a_o == NULL)
 		return 0;
 
@@ -2251,6 +2345,7 @@ static int __init tc_action_init(void)
 	rtnl_register(PF_UNSPEC, RTM_GETACTION, tc_ctl_action, tc_dump_action,
 		      0);
 
+	register_pernet_subsys(&tcf_p4_act_base_net_ops);
 	return 0;
 }
 
diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c
index ca5676b26..142f49a2c 100644
--- a/net/sched/cls_api.c
+++ b/net/sched/cls_api.c
@@ -3330,7 +3330,7 @@ int tcf_exts_validate_ex(struct net *net, struct tcf_proto *tp, struct nlattr **
 			struct tc_action_ops *a_o;
 
 			flags |= TCA_ACT_FLAGS_POLICE | TCA_ACT_FLAGS_BIND;
-			a_o = tc_action_load_ops(tb[exts->police], flags,
+			a_o = tc_action_load_ops(net, tb[exts->police], flags,
 						 extack);
 			if (IS_ERR(a_o))
 				return PTR_ERR(a_o);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [PATCH net-next v12  02/15] net/sched: act_api: increase action kind string length
  2024-02-25 16:54 [PATCH net-next v12 00/15] Introducing P4TC (series 1) Jamal Hadi Salim
  2024-02-25 16:54 ` [PATCH net-next v12 01/15] net: sched: act_api: Introduce P4 actions list Jamal Hadi Salim
@ 2024-02-25 16:54 ` Jamal Hadi Salim
  2024-02-25 16:54 ` [PATCH net-next v12 03/15] net/sched: act_api: Update tc_action_ops to account for P4 actions Jamal Hadi Salim
                   ` (14 subsequent siblings)
  16 siblings, 0 replies; 71+ messages in thread
From: Jamal Hadi Salim @ 2024-02-25 16:54 UTC (permalink / raw)
  To: netdev
  Cc: deb.chatterjee, anjali.singhai, namrata.limaye, tom, mleitner,
	Mahesh.Shirshyad, Vipin.Jain, tomasz.osinski, jiri,
	xiyou.wangcong, davem, edumazet, kuba, pabeni, vladbu, horms,
	khalidm, toke, daniel, victor, pctammela, bpf

Increase action kind string length from IFNAMSIZ to 64

The new P4 actions, created via templates, will have longer names
of format: "pipeline_name/act_name". IFNAMSIZ is currently 16 and is most
of the times undersized for the above format.
So, to conform to this new format, we increase the maximum name length
and change its definition from IFNAMSIZ to ACTNAMSIZ to account for this
extra string (pipeline name) and the '/' character.

Co-developed-by: Victor Nogueira <victor@mojatatu.com>
Signed-off-by: Victor Nogueira <victor@mojatatu.com>
Co-developed-by: Pedro Tammela <pctammela@mojatatu.com>
Signed-off-by: Pedro Tammela <pctammela@mojatatu.com>
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Reviewed-by: Vlad Buslov <vladbu@nvidia.com>
Reviewed-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
---
 include/net/act_api.h        | 2 +-
 include/uapi/linux/pkt_cls.h | 1 +
 net/sched/act_api.c          | 8 ++++----
 3 files changed, 6 insertions(+), 5 deletions(-)

diff --git a/include/net/act_api.h b/include/net/act_api.h
index f22be14bb..c839ff57c 100644
--- a/include/net/act_api.h
+++ b/include/net/act_api.h
@@ -106,7 +106,7 @@ typedef void (*tc_action_priv_destructor)(void *priv);
 struct tc_action_ops {
 	struct list_head head;
 	struct list_head p4_head;
-	char    kind[IFNAMSIZ];
+	char    kind[ACTNAMSIZ];
 	enum tca_id  id; /* identifier should match kind */
 	unsigned int	net_id;
 	size_t	size;
diff --git a/include/uapi/linux/pkt_cls.h b/include/uapi/linux/pkt_cls.h
index ea277039f..dd313a727 100644
--- a/include/uapi/linux/pkt_cls.h
+++ b/include/uapi/linux/pkt_cls.h
@@ -6,6 +6,7 @@
 #include <linux/pkt_sched.h>
 
 #define TC_COOKIE_MAX_SIZE 16
+#define ACTNAMSIZ 64
 
 /* Action attributes */
 enum {
diff --git a/net/sched/act_api.c b/net/sched/act_api.c
index 23ef394f2..ce10d2c6e 100644
--- a/net/sched/act_api.c
+++ b/net/sched/act_api.c
@@ -476,7 +476,7 @@ static size_t tcf_action_shared_attrs_size(const struct tc_action *act)
 	rcu_read_unlock();
 
 	return  nla_total_size(0) /* action number nested */
-		+ nla_total_size(IFNAMSIZ) /* TCA_ACT_KIND */
+		+ nla_total_size(ACTNAMSIZ) /* TCA_ACT_KIND */
 		+ cookie_len /* TCA_ACT_COOKIE */
 		+ nla_total_size(sizeof(struct nla_bitfield32)) /* TCA_ACT_HW_STATS */
 		+ nla_total_size(0) /* TCA_ACT_STATS nested */
@@ -1424,7 +1424,7 @@ tc_action_load_ops(struct net *net, struct nlattr *nla,
 	bool police = flags & TCA_ACT_FLAGS_POLICE;
 	struct nlattr *tb[TCA_ACT_MAX + 1];
 	struct tc_action_ops *a_o;
-	char act_name[IFNAMSIZ];
+	char act_name[ACTNAMSIZ];
 	struct nlattr *kind;
 	int err;
 
@@ -1439,12 +1439,12 @@ tc_action_load_ops(struct net *net, struct nlattr *nla,
 			NL_SET_ERR_MSG(extack, "TC action kind must be specified");
 			return ERR_PTR(err);
 		}
-		if (nla_strscpy(act_name, kind, IFNAMSIZ) < 0) {
+		if (nla_strscpy(act_name, kind, ACTNAMSIZ) < 0) {
 			NL_SET_ERR_MSG(extack, "TC action name too long");
 			return ERR_PTR(err);
 		}
 	} else {
-		if (strscpy(act_name, "police", IFNAMSIZ) < 0) {
+		if (strscpy(act_name, "police", ACTNAMSIZ) < 0) {
 			NL_SET_ERR_MSG(extack, "TC action name too long");
 			return ERR_PTR(-EINVAL);
 		}
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [PATCH net-next v12  03/15] net/sched: act_api: Update tc_action_ops to account for P4 actions
  2024-02-25 16:54 [PATCH net-next v12 00/15] Introducing P4TC (series 1) Jamal Hadi Salim
  2024-02-25 16:54 ` [PATCH net-next v12 01/15] net: sched: act_api: Introduce P4 actions list Jamal Hadi Salim
  2024-02-25 16:54 ` [PATCH net-next v12 02/15] net/sched: act_api: increase action kind string length Jamal Hadi Salim
@ 2024-02-25 16:54 ` Jamal Hadi Salim
  2024-02-29 16:19   ` Paolo Abeni
  2024-02-25 16:54 ` [PATCH net-next v12 04/15] net/sched: act_api: add struct p4tc_action_ops as a parameter to lookup callback Jamal Hadi Salim
                   ` (13 subsequent siblings)
  16 siblings, 1 reply; 71+ messages in thread
From: Jamal Hadi Salim @ 2024-02-25 16:54 UTC (permalink / raw)
  To: netdev
  Cc: deb.chatterjee, anjali.singhai, namrata.limaye, tom, mleitner,
	Mahesh.Shirshyad, Vipin.Jain, tomasz.osinski, jiri,
	xiyou.wangcong, davem, edumazet, kuba, pabeni, vladbu, horms,
	khalidm, toke, daniel, victor, pctammela, bpf

The initialisation of P4TC action instances require access to a struct
p4tc_act (which appears in later patches) to help us to retrieve
information like the P4 action parameters etc. In order to retrieve
struct p4tc_act we need the pipeline name or id and the action name or id.
Also recall that P4TC action IDs are P4 and are net namespace specific and
not global like standard tc actions.
The init callback from tc_action_ops parameters had no way of
supplying us that information. To solve this issue, we decided to create a
new tc_action_ops callback (init_ops), that provies us with the
tc_action_ops  struct which then provides us with the pipeline and action
name. In addition we add a new refcount to struct tc_action_ops called
dyn_ref, which accounts for how many action instances we have of a specific
action.

Co-developed-by: Victor Nogueira <victor@mojatatu.com>
Signed-off-by: Victor Nogueira <victor@mojatatu.com>
Co-developed-by: Pedro Tammela <pctammela@mojatatu.com>
Signed-off-by: Pedro Tammela <pctammela@mojatatu.com>
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Reviewed-by: Vlad Buslov <vladbu@nvidia.com>
Reviewed-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
---
 include/net/act_api.h |  6 ++++++
 net/sched/act_api.c   | 14 +++++++++++---
 2 files changed, 17 insertions(+), 3 deletions(-)

diff --git a/include/net/act_api.h b/include/net/act_api.h
index c839ff57c..69be5ed83 100644
--- a/include/net/act_api.h
+++ b/include/net/act_api.h
@@ -109,6 +109,7 @@ struct tc_action_ops {
 	char    kind[ACTNAMSIZ];
 	enum tca_id  id; /* identifier should match kind */
 	unsigned int	net_id;
+	refcount_t p4_ref;
 	size_t	size;
 	struct module		*owner;
 	int     (*act)(struct sk_buff *, const struct tc_action *,
@@ -120,6 +121,11 @@ struct tc_action_ops {
 			struct nlattr *est, struct tc_action **act,
 			struct tcf_proto *tp,
 			u32 flags, struct netlink_ext_ack *extack);
+	/* This should be merged with the original init action */
+	int     (*init_ops)(struct net *net, struct nlattr *nla,
+			    struct nlattr *est, struct tc_action **act,
+			   struct tcf_proto *tp, struct tc_action_ops *ops,
+			   u32 flags, struct netlink_ext_ack *extack);
 	int     (*walk)(struct net *, struct sk_buff *,
 			struct netlink_callback *, int,
 			const struct tc_action_ops *,
diff --git a/net/sched/act_api.c b/net/sched/act_api.c
index ce10d2c6e..3d1fb8da1 100644
--- a/net/sched/act_api.c
+++ b/net/sched/act_api.c
@@ -1044,7 +1044,7 @@ int tcf_register_action(struct tc_action_ops *act,
 	struct tc_action_ops *a;
 	int ret;
 
-	if (!act->act || !act->dump || !act->init)
+	if (!act->act || !act->dump || (!act->init && !act->init_ops))
 		return -EINVAL;
 
 	/* We have to register pernet ops before making the action ops visible,
@@ -1517,8 +1517,16 @@ struct tc_action *tcf_action_init_1(struct net *net, struct tcf_proto *tp,
 			}
 		}
 
-		err = a_o->init(net, tb[TCA_ACT_OPTIONS], est, &a, tp,
-				userflags.value | flags, extack);
+		/* When we arrive here we guarantee that a_o->init or
+		 * a_o->init_ops exist.
+		 */
+		if (a_o->init)
+			err = a_o->init(net, tb[TCA_ACT_OPTIONS], est, &a, tp,
+					userflags.value | flags, extack);
+		else
+			err = a_o->init_ops(net, tb[TCA_ACT_OPTIONS], est, &a,
+					    tp, a_o, userflags.value | flags,
+					    extack);
 	} else {
 		err = a_o->init(net, nla, est, &a, tp, userflags.value | flags,
 				extack);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [PATCH net-next v12  04/15] net/sched: act_api: add struct p4tc_action_ops as a parameter to lookup callback
  2024-02-25 16:54 [PATCH net-next v12 00/15] Introducing P4TC (series 1) Jamal Hadi Salim
                   ` (2 preceding siblings ...)
  2024-02-25 16:54 ` [PATCH net-next v12 03/15] net/sched: act_api: Update tc_action_ops to account for P4 actions Jamal Hadi Salim
@ 2024-02-25 16:54 ` Jamal Hadi Salim
  2024-02-25 16:54 ` [PATCH net-next v12 05/15] net: sched: act_api: Add support for preallocated P4 action instances Jamal Hadi Salim
                   ` (12 subsequent siblings)
  16 siblings, 0 replies; 71+ messages in thread
From: Jamal Hadi Salim @ 2024-02-25 16:54 UTC (permalink / raw)
  To: netdev
  Cc: deb.chatterjee, anjali.singhai, namrata.limaye, tom, mleitner,
	Mahesh.Shirshyad, Vipin.Jain, tomasz.osinski, jiri,
	xiyou.wangcong, davem, edumazet, kuba, pabeni, vladbu, horms,
	khalidm, toke, daniel, victor, pctammela, bpf

For P4 actions, we require information from struct tc_action_ops,
specifically the action kind, to find and locate the P4 action information
for the lookup operation.

Co-developed-by: Victor Nogueira <victor@mojatatu.com>
Signed-off-by: Victor Nogueira <victor@mojatatu.com>
Co-developed-by: Pedro Tammela <pctammela@mojatatu.com>
Signed-off-by: Pedro Tammela <pctammela@mojatatu.com>
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Reviewed-by: Vlad Buslov <vladbu@nvidia.com>
Reviewed-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
---
 include/net/act_api.h | 3 ++-
 net/sched/act_api.c   | 2 +-
 2 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/include/net/act_api.h b/include/net/act_api.h
index 69be5ed83..49f471c58 100644
--- a/include/net/act_api.h
+++ b/include/net/act_api.h
@@ -116,7 +116,8 @@ struct tc_action_ops {
 		       struct tcf_result *); /* called under RCU BH lock*/
 	int     (*dump)(struct sk_buff *, struct tc_action *, int, int);
 	void	(*cleanup)(struct tc_action *);
-	int     (*lookup)(struct net *net, struct tc_action **a, u32 index);
+	int     (*lookup)(struct net *net, const struct tc_action_ops *ops,
+			  struct tc_action **a, u32 index);
 	int     (*init)(struct net *net, struct nlattr *nla,
 			struct nlattr *est, struct tc_action **act,
 			struct tcf_proto *tp,
diff --git a/net/sched/act_api.c b/net/sched/act_api.c
index 3d1fb8da1..835ead746 100644
--- a/net/sched/act_api.c
+++ b/net/sched/act_api.c
@@ -726,7 +726,7 @@ static int __tcf_idr_search(struct net *net,
 	struct tc_action_net *tn = net_generic(net, ops->net_id);
 
 	if (unlikely(ops->lookup))
-		return ops->lookup(net, a, index);
+		return ops->lookup(net, ops, a, index);
 
 	return tcf_idr_search(tn, a, index);
 }
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [PATCH net-next v12  05/15] net: sched: act_api: Add support for preallocated P4 action instances
  2024-02-25 16:54 [PATCH net-next v12 00/15] Introducing P4TC (series 1) Jamal Hadi Salim
                   ` (3 preceding siblings ...)
  2024-02-25 16:54 ` [PATCH net-next v12 04/15] net/sched: act_api: add struct p4tc_action_ops as a parameter to lookup callback Jamal Hadi Salim
@ 2024-02-25 16:54 ` Jamal Hadi Salim
  2024-02-25 16:54 ` [PATCH net-next v12 06/15] p4tc: add P4 data types Jamal Hadi Salim
                   ` (11 subsequent siblings)
  16 siblings, 0 replies; 71+ messages in thread
From: Jamal Hadi Salim @ 2024-02-25 16:54 UTC (permalink / raw)
  To: netdev
  Cc: deb.chatterjee, anjali.singhai, namrata.limaye, tom, mleitner,
	Mahesh.Shirshyad, Vipin.Jain, tomasz.osinski, jiri,
	xiyou.wangcong, davem, edumazet, kuba, pabeni, vladbu, horms,
	khalidm, toke, daniel, victor, pctammela, bpf

In P4, actions are assumed to pre exist and have an upper bound number of
instances. Typically if you a table defined with 1M table entries you want
to allocate enough action instances to cover the 1M entries. However, this
is a big waste of memory if the action instances are not in use. So for
our case, we allow the user to specify a minimal amount of actions in the
template and then if more P4 action instances are needed then they will be
added on demand as in the current approach with tc filter-action
relationship.

Add the necessary code to preallocate actions instances for P4
actions.

We add 2 new actions flags:
- TCA_ACT_FLAGS_PREALLOC: Indicates the action instance is a P4 action
  and was preallocated for future use the templating phase of P4TC
- TCA_ACT_FLAGS_UNREFERENCED: Indicates the action instance was
  preallocated and is currently not being referenced by any other object.
  Which means it won't show up in an action instance dump.

Once an action instance is created we don't free it when the last table
entry referring to it is deleted.
Instead we add it to the pool/cache of action instances for that specific
action kind i.e it counts as if it is preallocated.
Preallocated actions can't be deleted by the tc actions runtime commands
and a dump or a get will only show preallocated actions instances which are
being used (i.e TCA_ACT_FLAGS_UNREFERENCED == false).

The preallocated actions will be deleted once the pipeline is deleted
(which will purge the P4 action kind and its instances).

For example, if we were to create a P4 action that preallocates 128
elements and dumped:

$ tc -j p4template get action/myprog/send_nh | jq .

We'd see the following:

[
  {
    "obj": "action template",
    "pname": "myprog",
    "pipeid": 1
  },
  {
    "templates": [
      {
        "aname": "myprog/send_nh",
        "actid": 1,
        "params": [
          {
            "name": "port",
            "type": "dev",
            "id": 1
          }
        ],
        "prealloc": 128
      }
    ]
  }
]

If we try to dump the P4 action instances, we won't see any:

$ tc -j actions ls action myprog/send_nh | jq .

[]

However, if we create a table entry which references this action kind:

$ tc p4ctrl create myprog/table/cb/FDB \
   dstAddr d2:96:91:5d:02:86 action myprog/send_nh \
   param port type dev dummy0

Dumping the action instance will now show this one instance which is
associated with the table entry:

$ tc -j actions ls action myprog/send_nh | jq .

[
  {
    "total acts": 1
  },
  {
    "actions": [
      {
        "order": 0,
        "kind": "myprog/send_nh",
        "index": 1,
        "ref": 1,
        "bind": 1,
        "params": [
          {
            "name": "port",
            "type": "dev",
            "value": "dummy0",
            "id": 1
          }
        ],
        "not_in_hw": true
      }
    ]
  }
]

Co-developed-by: Victor Nogueira <victor@mojatatu.com>
Signed-off-by: Victor Nogueira <victor@mojatatu.com>
Co-developed-by: Pedro Tammela <pctammela@mojatatu.com>
Signed-off-by: Pedro Tammela <pctammela@mojatatu.com>
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Reviewed-by: Vlad Buslov <vladbu@nvidia.com>
Reviewed-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
---
 include/net/act_api.h |  3 +++
 net/sched/act_api.c   | 45 +++++++++++++++++++++++++++++++++++--------
 2 files changed, 40 insertions(+), 8 deletions(-)

diff --git a/include/net/act_api.h b/include/net/act_api.h
index 49f471c58..d35870fbf 100644
--- a/include/net/act_api.h
+++ b/include/net/act_api.h
@@ -68,6 +68,8 @@ struct tc_action {
 #define TCA_ACT_FLAGS_REPLACE	(1U << (TCA_ACT_FLAGS_USER_BITS + 2))
 #define TCA_ACT_FLAGS_NO_RTNL	(1U << (TCA_ACT_FLAGS_USER_BITS + 3))
 #define TCA_ACT_FLAGS_AT_INGRESS	(1U << (TCA_ACT_FLAGS_USER_BITS + 4))
+#define TCA_ACT_FLAGS_PREALLOC	(1U << (TCA_ACT_FLAGS_USER_BITS + 5))
+#define TCA_ACT_FLAGS_UNREFERENCED	(1U << (TCA_ACT_FLAGS_USER_BITS + 6))
 
 /* Update lastuse only if needed, to avoid dirtying a cache line.
  * We use a temp variable to avoid fetching jiffies twice.
@@ -201,6 +203,7 @@ int tcf_idr_create_from_flags(struct tc_action_net *tn, u32 index,
 			      const struct tc_action_ops *ops, int bind,
 			      u32 flags);
 void tcf_idr_insert_many(struct tc_action *actions[], int init_res[]);
+void tcf_idr_insert_n(struct tc_action *actions[], const u32 n);
 void tcf_idr_cleanup(struct tc_action_net *tn, u32 index);
 int tcf_idr_check_alloc(struct tc_action_net *tn, u32 *index,
 			struct tc_action **a, int bind);
diff --git a/net/sched/act_api.c b/net/sched/act_api.c
index 835ead746..418e44235 100644
--- a/net/sched/act_api.c
+++ b/net/sched/act_api.c
@@ -560,6 +560,8 @@ static int tcf_dump_walker(struct tcf_idrinfo *idrinfo, struct sk_buff *skb,
 			continue;
 		if (IS_ERR(p))
 			continue;
+		if (p->tcfa_flags & TCA_ACT_FLAGS_UNREFERENCED)
+			continue;
 
 		if (jiffy_since &&
 		    time_after(jiffy_since,
@@ -640,6 +642,9 @@ static int tcf_del_walker(struct tcf_idrinfo *idrinfo, struct sk_buff *skb,
 	idr_for_each_entry_ul(idr, p, tmp, id) {
 		if (IS_ERR(p))
 			continue;
+		if (p->tcfa_flags & TCA_ACT_FLAGS_PREALLOC)
+			continue;
+
 		ret = tcf_idr_release_unsafe(p);
 		if (ret == ACT_P_DELETED)
 			module_put(ops->owner);
@@ -1398,25 +1403,40 @@ static const struct nla_policy tcf_action_policy[TCA_ACT_MAX + 1] = {
 	[TCA_ACT_HW_STATS]	= NLA_POLICY_BITFIELD32(TCA_ACT_HW_STATS_ANY),
 };
 
+static void tcf_idr_insert_1(struct tc_action *a)
+{
+	struct tcf_idrinfo *idrinfo;
+
+	idrinfo = a->idrinfo;
+	mutex_lock(&idrinfo->lock);
+	/* Replace ERR_PTR(-EBUSY) allocated by tcf_idr_check_alloc if
+	 * it is just created, otherwise this is just a nop.
+	 */
+	idr_replace(&idrinfo->action_idr, a, a->tcfa_index);
+	mutex_unlock(&idrinfo->lock);
+}
+
 void tcf_idr_insert_many(struct tc_action *actions[], int init_res[])
 {
 	struct tc_action *a;
 	int i;
 
 	tcf_act_for_each_action(i, a, actions) {
-		struct tcf_idrinfo *idrinfo;
-
 		if (init_res[i] == ACT_P_BOUND)
 			continue;
 
-		idrinfo = a->idrinfo;
-		mutex_lock(&idrinfo->lock);
-		/* Replace ERR_PTR(-EBUSY) allocated by tcf_idr_check_alloc */
-		idr_replace(&idrinfo->action_idr, a, a->tcfa_index);
-		mutex_unlock(&idrinfo->lock);
+		tcf_idr_insert_1(a);
 	}
 }
 
+void tcf_idr_insert_n(struct tc_action *actions[], const u32 n)
+{
+	int i;
+
+	for (i = 0; i < n; i++)
+		tcf_idr_insert_1(actions[i]);
+}
+
 struct tc_action_ops *
 tc_action_load_ops(struct net *net, struct nlattr *nla,
 		   u32 flags, struct netlink_ext_ack *extack)
@@ -2092,8 +2112,17 @@ tca_action_gd(struct net *net, struct nlattr *nla, struct nlmsghdr *n,
 			ret = PTR_ERR(act);
 			goto err;
 		}
-		attr_size += tcf_action_fill_size(act);
 		actions[i - 1] = act;
+
+		if (event == RTM_DELACTION &&
+		    act->tcfa_flags & TCA_ACT_FLAGS_PREALLOC) {
+			ret = -EINVAL;
+			NL_SET_ERR_MSG_FMT(extack,
+					   "Unable to delete preallocated action %s",
+					   act->ops->kind);
+			goto err;
+		}
+		attr_size += tcf_action_fill_size(act);
 	}
 
 	attr_size = tcf_action_full_attrs_size(attr_size);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [PATCH net-next v12  06/15] p4tc: add P4 data types
  2024-02-25 16:54 [PATCH net-next v12 00/15] Introducing P4TC (series 1) Jamal Hadi Salim
                   ` (4 preceding siblings ...)
  2024-02-25 16:54 ` [PATCH net-next v12 05/15] net: sched: act_api: Add support for preallocated P4 action instances Jamal Hadi Salim
@ 2024-02-25 16:54 ` Jamal Hadi Salim
  2024-02-29 15:09   ` Paolo Abeni
  2024-02-25 16:54 ` [PATCH net-next v12 07/15] p4tc: add template API Jamal Hadi Salim
                   ` (10 subsequent siblings)
  16 siblings, 1 reply; 71+ messages in thread
From: Jamal Hadi Salim @ 2024-02-25 16:54 UTC (permalink / raw)
  To: netdev
  Cc: deb.chatterjee, anjali.singhai, namrata.limaye, tom, mleitner,
	Mahesh.Shirshyad, Vipin.Jain, tomasz.osinski, jiri,
	xiyou.wangcong, davem, edumazet, kuba, pabeni, vladbu, horms,
	khalidm, toke, daniel, victor, pctammela, bpf

Introduce abstraction that represents P4 data types.
This also introduces the Kconfig and Makefile which later patches use.
Numeric types could be little, host or big endian definitions. The abstraction
also supports defining:

a) bitstrings using P4 annotations that look like "bit<X>" where X
   is the number of bits defined in a type

b) bitslices such that one can define in P4 as bit<8>[0-3] and
   bit<16>[4-9]. A 4-bit slice from bits 0-3 and a 6-bit slice from bits
   4-9 respectively.

c) speacialized types like dev (which stands for a netdev), key, etc

Each type has a bitsize, a name (for debugging purposes), an ID and
methods/ops. The P4 types will be used by externs, dynamic actions, packet
headers and other parts of P4TC.

Each type has four ops:

- validate_p4t: Which validates if a given value of a specific type
  meets valid boundary conditions.

- create_bitops: Which, given a bitsize, bitstart and bitend allocates and
  returns a mask and a shift value. For example, if we have type
  bit<8>[3-3] meaning bitstart = 3 and bitend = 3, we'll create a mask
  which would only give us the fourth bit of a bit8 value, that is, 0x08.
  Since we are interested in the fourth bit, the bit shift value will be 3.
  This is also useful if an "irregular" bitsize is used, for example,
  bit24. In that case bitstart = 0 and bitend = 23. Shift will be 0 and
  the mask will be 0xFFFFFF00 if the machine is big endian.

- host_read : Which reads the value of a given type and transforms it to
  host order (if needed)

- host_write : Which writes a provided host order value and transforms it
  to the type's native order (if needed)

Co-developed-by: Victor Nogueira <victor@mojatatu.com>
Signed-off-by: Victor Nogueira <victor@mojatatu.com>
Co-developed-by: Pedro Tammela <pctammela@mojatatu.com>
Signed-off-by: Pedro Tammela <pctammela@mojatatu.com>
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Reviewed-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
---
 include/net/p4tc_types.h    |   91 +++
 include/uapi/linux/p4tc.h   |   33 +
 net/sched/Kconfig           |   11 +
 net/sched/Makefile          |    2 +
 net/sched/p4tc/Makefile     |    3 +
 net/sched/p4tc/p4tc_types.c | 1407 +++++++++++++++++++++++++++++++++++
 6 files changed, 1547 insertions(+)
 create mode 100644 include/net/p4tc_types.h
 create mode 100644 include/uapi/linux/p4tc.h
 create mode 100644 net/sched/p4tc/Makefile
 create mode 100644 net/sched/p4tc/p4tc_types.c

diff --git a/include/net/p4tc_types.h b/include/net/p4tc_types.h
new file mode 100644
index 000000000..af9f51fc1
--- /dev/null
+++ b/include/net/p4tc_types.h
@@ -0,0 +1,91 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef __NET_P4TYPES_H
+#define __NET_P4TYPES_H
+
+#include <linux/netlink.h>
+#include <linux/pkt_cls.h>
+#include <linux/types.h>
+
+#include <uapi/linux/p4tc.h>
+
+#define P4TC_T_MAX_BITSZ 128
+
+struct p4tc_type_mask_shift {
+	void *mask;
+	u8 shift;
+};
+
+struct p4tc_type;
+struct p4tc_type_ops {
+	int (*validate_p4t)(struct p4tc_type *container, void *value,
+			    u16 startbit, u16 endbit,
+			    struct netlink_ext_ack *extack);
+	struct p4tc_type_mask_shift *(*create_bitops)(u16 bitsz, u16 bitstart,
+						      u16 bitend,
+						      struct netlink_ext_ack *extack);
+	void (*host_read)(struct p4tc_type *container,
+			  struct p4tc_type_mask_shift *mask_shift, void *sval,
+			  void *dval);
+	void (*host_write)(struct p4tc_type *container,
+			   struct p4tc_type_mask_shift *mask_shift, void *sval,
+			   void *dval);
+	void (*print)(struct net *net, struct p4tc_type *container,
+		      const char *prefix, void *val);
+};
+
+#define P4TC_T_MAX_STR_SZ 32
+struct p4tc_type {
+	char name[P4TC_T_MAX_STR_SZ];
+	const struct p4tc_type_ops *ops;
+	size_t container_bitsz;
+	size_t bitsz;
+	int typeid;
+};
+
+struct p4tc_type *p4type_find_byid(int id);
+bool p4tc_is_type_unsigned_he(int typeid);
+bool p4tc_is_type_numeric(int typeid);
+
+void p4t_copy(struct p4tc_type_mask_shift *dst_mask_shift,
+	      struct p4tc_type *dst_t, void *dstv,
+	      struct p4tc_type_mask_shift *src_mask_shift,
+	      struct p4tc_type *src_t, void *srcv);
+int p4t_cmp(struct p4tc_type_mask_shift *dst_mask_shift,
+	    struct p4tc_type *dst_t, void *dstv,
+	    struct p4tc_type_mask_shift *src_mask_shift,
+	    struct p4tc_type *src_t, void *srcv);
+void p4t_release(struct p4tc_type_mask_shift *mask_shift);
+
+int p4tc_register_types(void);
+void p4tc_unregister_types(void);
+
+#ifdef CONFIG_RETPOLINE
+void __p4tc_type_host_read(const struct p4tc_type_ops *ops,
+			   struct p4tc_type *container,
+			   struct p4tc_type_mask_shift *mask_shift, void *sval,
+			   void *dval);
+void __p4tc_type_host_write(const struct p4tc_type_ops *ops,
+			    struct p4tc_type *container,
+			    struct p4tc_type_mask_shift *mask_shift, void *sval,
+			    void *dval);
+#else
+static inline void
+__p4tc_type_host_read(const struct p4tc_type_ops *ops,
+		      struct p4tc_type *container,
+		      struct p4tc_type_mask_shift *mask_shift,
+		      void *sval, void *dval)
+{
+	return ops->host_read(container, mask_shift, sval, dval);
+}
+
+static inline void
+__p4tc_type_host_write(const struct p4tc_type_ops *ops,
+		       struct p4tc_type *container,
+		       struct p4tc_type_mask_shift *mask_shift,
+		       void *sval, void *dval)
+{
+	return ops->host_write(container, mask_shift, sval, dval);
+}
+#endif
+
+#endif
diff --git a/include/uapi/linux/p4tc.h b/include/uapi/linux/p4tc.h
new file mode 100644
index 000000000..0133947c5
--- /dev/null
+++ b/include/uapi/linux/p4tc.h
@@ -0,0 +1,33 @@
+/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
+#ifndef __LINUX_P4TC_H
+#define __LINUX_P4TC_H
+
+#define P4TC_MAX_KEYSZ 512
+
+enum {
+	P4TC_T_UNSPEC,
+	P4TC_T_U8,
+	P4TC_T_U16,
+	P4TC_T_U32,
+	P4TC_T_U64,
+	P4TC_T_STRING,
+	P4TC_T_S8,
+	P4TC_T_S16,
+	P4TC_T_S32,
+	P4TC_T_S64,
+	P4TC_T_MACADDR,
+	P4TC_T_IPV4ADDR,
+	P4TC_T_BE16,
+	P4TC_T_BE32,
+	P4TC_T_BE64,
+	P4TC_T_U128,
+	P4TC_T_S128,
+	P4TC_T_BOOL,
+	P4TC_T_DEV,
+	P4TC_T_KEY,
+	__P4TC_T_MAX,
+};
+
+#define P4TC_T_MAX (__P4TC_T_MAX - 1)
+
+#endif
diff --git a/net/sched/Kconfig b/net/sched/Kconfig
index 8180d0c12..5dbae579b 100644
--- a/net/sched/Kconfig
+++ b/net/sched/Kconfig
@@ -675,6 +675,17 @@ config NET_EMATCH_IPT
 	  To compile this code as a module, choose M here: the
 	  module will be called em_ipt.
 
+config NET_P4TC
+	bool "P4TC support"
+	select NET_CLS_ACT
+	help
+	  Say Y here if you want to use P4 features on top of TC.
+	  P4 is an open source,  domain-specific programming language for
+	  specifying data plane behavior. By enabling P4TC you will be able to
+	  write a P4 program, use a P4 compiler that supports P4TC backend to
+	  generate all needed artificats, which when loaded allow you to
+	  introduce a new kernel datapath that can be controlled via TC.
+
 config NET_CLS_ACT
 	bool "Actions"
 	select NET_CLS
diff --git a/net/sched/Makefile b/net/sched/Makefile
index 82c3f78ca..581f9dd69 100644
--- a/net/sched/Makefile
+++ b/net/sched/Makefile
@@ -81,3 +81,5 @@ obj-$(CONFIG_NET_EMATCH_TEXT)	+= em_text.o
 obj-$(CONFIG_NET_EMATCH_CANID)	+= em_canid.o
 obj-$(CONFIG_NET_EMATCH_IPSET)	+= em_ipset.o
 obj-$(CONFIG_NET_EMATCH_IPT)	+= em_ipt.o
+
+obj-$(CONFIG_NET_P4TC)		+= p4tc/
diff --git a/net/sched/p4tc/Makefile b/net/sched/p4tc/Makefile
new file mode 100644
index 000000000..dd1358c9e
--- /dev/null
+++ b/net/sched/p4tc/Makefile
@@ -0,0 +1,3 @@
+# SPDX-License-Identifier: GPL-2.0
+
+obj-y := p4tc_types.o
diff --git a/net/sched/p4tc/p4tc_types.c b/net/sched/p4tc/p4tc_types.c
new file mode 100644
index 000000000..67561a292
--- /dev/null
+++ b/net/sched/p4tc/p4tc_types.c
@@ -0,0 +1,1407 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * net/sched/p4tc/p4tc_types.c -  P4 datatypes
+ * Copyright (c) 2022-2024, Mojatatu Networks
+ * Copyright (c) 2022-2024, Intel Corporation.
+ * Authors:     Jamal Hadi Salim <jhs@mojatatu.com>
+ *              Victor Nogueira <victor@mojatatu.com>
+ *              Pedro Tammela <pctammela@mojatatu.com>
+ */
+
+#include <linux/types.h>
+#include <linux/kernel.h>
+#include <linux/string.h>
+#include <linux/errno.h>
+#include <linux/skbuff.h>
+#include <linux/rtnetlink.h>
+#include <linux/module.h>
+#include <linux/init.h>
+#include <net/net_namespace.h>
+#include <net/netlink.h>
+#include <net/pkt_sched.h>
+#include <net/pkt_cls.h>
+#include <net/act_api.h>
+#include <net/p4tc_types.h>
+#include <linux/etherdevice.h>
+
+static DEFINE_IDR(p4tc_types_idr);
+
+static void p4tc_types_put(void)
+{
+	unsigned long tmp, typeid;
+	struct p4tc_type *type;
+
+	idr_for_each_entry_ul(&p4tc_types_idr, type, tmp, typeid) {
+		idr_remove(&p4tc_types_idr, typeid);
+		kfree(type);
+	}
+}
+
+struct p4tc_type *p4type_find_byid(int typeid)
+{
+	return idr_find(&p4tc_types_idr, typeid);
+}
+
+static struct p4tc_type *p4type_find_byname(const char *name)
+{
+	unsigned long tmp, typeid;
+	struct p4tc_type *type;
+
+	idr_for_each_entry_ul(&p4tc_types_idr, type, tmp, typeid) {
+		if (!strncmp(type->name, name, P4TC_T_MAX_STR_SZ))
+			return type;
+	}
+
+	return NULL;
+}
+
+static bool p4tc_is_type_unsigned_be(int typeid)
+{
+	switch (typeid) {
+	case P4TC_T_BE16:
+	case P4TC_T_BE32:
+	case P4TC_T_BE64:
+		return true;
+	default:
+		return false;
+	}
+}
+
+bool p4tc_is_type_unsigned_he(int typeid)
+{
+	switch (typeid) {
+	case P4TC_T_U8:
+	case P4TC_T_U16:
+	case P4TC_T_U32:
+	case P4TC_T_U64:
+	case P4TC_T_U128:
+	case P4TC_T_BOOL:
+		return true;
+	default:
+		return false;
+	}
+}
+
+static bool p4tc_is_type_unsigned(int typeid)
+{
+	return p4tc_is_type_unsigned_he(typeid) ||
+		p4tc_is_type_unsigned_be(typeid);
+}
+
+static bool p4tc_is_type_signed(int typeid)
+{
+	switch (typeid) {
+	case P4TC_T_S8:
+	case P4TC_T_S16:
+	case P4TC_T_S32:
+	case P4TC_T_S64:
+	case P4TC_T_S128:
+		return true;
+	default:
+		return false;
+	}
+}
+
+bool p4tc_is_type_numeric(int typeid)
+{
+	return p4tc_is_type_unsigned(typeid) ||
+		p4tc_is_type_signed(typeid);
+}
+
+void p4t_copy(struct p4tc_type_mask_shift *dst_mask_shift,
+	      struct p4tc_type *dst_t, void *dstv,
+	      struct p4tc_type_mask_shift *src_mask_shift,
+	      struct p4tc_type *src_t, void *srcv)
+{
+	u64 readval[BITS_TO_U64(P4TC_MAX_KEYSZ)] = {0};
+	const struct p4tc_type_ops *srco, *dsto;
+
+	dsto = dst_t->ops;
+	srco = src_t->ops;
+
+	__p4tc_type_host_read(srco, src_t, src_mask_shift, srcv,
+			      &readval);
+	__p4tc_type_host_write(dsto, dst_t, dst_mask_shift, &readval,
+			       dstv);
+}
+
+int p4t_cmp(struct p4tc_type_mask_shift *dst_mask_shift,
+	    struct p4tc_type *dst_t, void *dstv,
+	    struct p4tc_type_mask_shift *src_mask_shift,
+	    struct p4tc_type *src_t, void *srcv)
+{
+	u64 a[BITS_TO_U64(P4TC_MAX_KEYSZ)] = {0};
+	u64 b[BITS_TO_U64(P4TC_MAX_KEYSZ)] = {0};
+	const struct p4tc_type_ops *srco, *dsto;
+
+	dsto = dst_t->ops;
+	srco = src_t->ops;
+
+	__p4tc_type_host_read(dsto, dst_t, dst_mask_shift, dstv, a);
+	__p4tc_type_host_read(srco, src_t, src_mask_shift, srcv, b);
+
+	return memcmp(a, b, sizeof(a));
+}
+
+void p4t_release(struct p4tc_type_mask_shift *mask_shift)
+{
+	kfree(mask_shift->mask);
+	kfree(mask_shift);
+}
+
+static int p4t_validate_bitpos(u16 bitstart, u16 bitend, u16 maxbitstart,
+			       u16 maxbitend, struct netlink_ext_ack *extack)
+{
+	if (bitstart > maxbitstart) {
+		NL_SET_ERR_MSG_MOD(extack, "bitstart too high");
+		return -EINVAL;
+	}
+
+	if (bitend > maxbitend) {
+		NL_SET_ERR_MSG_MOD(extack, "bitend too high");
+		return -EINVAL;
+	}
+
+	if (bitstart > bitend) {
+		NL_SET_ERR_MSG_MOD(extack, "bitstart > bitend");
+		return -EINVAL;
+	}
+
+	return 0;
+}
+
+static int p4t_u32_validate(struct p4tc_type *container, void *value,
+			    u16 bitstart, u16 bitend,
+			    struct netlink_ext_ack *extack)
+{
+	u32 container_maxsz = U32_MAX;
+	u32 *val = value;
+	size_t maxval;
+	int ret;
+
+	ret = p4t_validate_bitpos(bitstart, bitend, 31, 31, extack);
+	if (ret < 0)
+		return ret;
+
+	maxval = GENMASK(bitend, 0);
+	if (val && (*val > container_maxsz || *val > maxval)) {
+		NL_SET_ERR_MSG_MOD(extack, "U32 value out of range");
+		return -EINVAL;
+	}
+
+	return 0;
+}
+
+static struct p4tc_type_mask_shift *
+p4t_u32_bitops(u16 bitsiz, u16 bitstart, u16 bitend,
+	       struct netlink_ext_ack *extack)
+{
+	struct p4tc_type_mask_shift *mask_shift;
+	u32 mask = GENMASK(bitend, bitstart);
+	u32 *cmask;
+
+	mask_shift = kzalloc(sizeof(*mask_shift), GFP_KERNEL);
+	if (!mask_shift)
+		return ERR_PTR(-ENOMEM);
+
+	cmask = kzalloc(sizeof(u32), GFP_KERNEL);
+	if (!cmask) {
+		kfree(mask_shift);
+		return ERR_PTR(-ENOMEM);
+	}
+
+	*cmask = mask;
+
+	mask_shift->mask = cmask;
+	mask_shift->shift = bitstart;
+
+	return mask_shift;
+}
+
+static void p4t_u32_write(struct p4tc_type *container,
+			  struct p4tc_type_mask_shift *mask_shift, void *sval,
+			  void *dval)
+{
+	u32 maskedst = 0;
+	u32 *dst = dval;
+	u32 *src = sval;
+	u8 shift = 0;
+
+	if (mask_shift) {
+		u32 *dmask = mask_shift->mask;
+
+		maskedst = *dst & ~*dmask;
+		shift = mask_shift->shift;
+	}
+
+	*dst = maskedst | (*src << shift);
+}
+
+static void p4t_u32_print(struct net *net, struct p4tc_type *container,
+			  const char *prefix, void *val)
+{
+	u32 *v = val;
+
+	pr_info("%s 0x%x\n", prefix, *v);
+}
+
+static void p4t_u32_hread(struct p4tc_type *container,
+			  struct p4tc_type_mask_shift *mask_shift, void *sval,
+			  void *dval)
+{
+	u32 *dst = dval;
+	u32 *src = sval;
+
+	if (mask_shift) {
+		u32 *smask = mask_shift->mask;
+		u8 shift = mask_shift->shift;
+
+		*dst = (*src & *smask) >> shift;
+	} else {
+		*dst = *src;
+	}
+}
+
+static int p4t_s32_validate(struct p4tc_type *container, void *value,
+			    u16 bitstart, u16 bitend,
+			    struct netlink_ext_ack *extack)
+{
+	s32 minsz = S32_MIN, maxsz = S32_MAX;
+	s32 *val = value;
+
+	if (val && (*val > maxsz || *val < minsz)) {
+		NL_SET_ERR_MSG_MOD(extack, "S32 value out of range");
+		return -EINVAL;
+	}
+
+	return 0;
+}
+
+static void p4t_s32_hread(struct p4tc_type *container,
+			  struct p4tc_type_mask_shift *mask_shift, void *sval,
+			  void *dval)
+{
+	s32 *dst = dval;
+	s32 *src = sval;
+
+	*dst = *src;
+}
+
+static void p4t_s32_write(struct p4tc_type *container,
+			  struct p4tc_type_mask_shift *mask_shift, void *sval,
+			  void *dval)
+{
+	s32 *dst = dval;
+	s32 *src = sval;
+
+	*dst = *src;
+}
+
+static void p4t_s32_print(struct net *net, struct p4tc_type *container,
+			  const char *prefix, void *val)
+{
+	s32 *v = val;
+
+	pr_info("%s %x\n", prefix, *v);
+}
+
+static int p4t_s64_validate(struct p4tc_type *container, void *value,
+			    u16 bitstart, u16 bitend,
+			    struct netlink_ext_ack *extack)
+{
+	s64 minsz = S64_MIN, maxsz = S64_MAX;
+	s64 *val = value;
+
+	if (val && (*val > maxsz || *val < minsz)) {
+		NL_SET_ERR_MSG_MOD(extack, "S64 value out of range");
+		return -EINVAL;
+	}
+
+	return 0;
+}
+
+static void p4t_s64_hread(struct p4tc_type *container,
+			  struct p4tc_type_mask_shift *mask_shift, void *sval,
+			  void *dval)
+{
+	s64 *dst = dval;
+	s64 *src = sval;
+
+	*dst = *src;
+}
+
+static void p4t_s64_write(struct p4tc_type *container,
+			  struct p4tc_type_mask_shift *mask_shift, void *sval,
+			  void *dval)
+{
+	s64 *dst = dval;
+	s64 *src = sval;
+
+	*dst = *src;
+}
+
+static void p4t_s64_print(struct net *net, struct p4tc_type *container,
+			  const char *prefix, void *val)
+{
+	s64 *v = val;
+
+	pr_info("%s 0x%llx\n", prefix, *v);
+}
+
+static int p4t_be32_validate(struct p4tc_type *container, void *value,
+			     u16 bitstart, u16 bitend,
+			     struct netlink_ext_ack *extack)
+{
+	size_t container_maxsz = U32_MAX;
+	__be32 *val_u32 = value;
+	__u32 val = 0;
+	size_t maxval;
+	int ret;
+
+	ret = p4t_validate_bitpos(bitstart, bitend, 31, 31, extack);
+	if (ret < 0)
+		return ret;
+
+	if (value)
+		val = be32_to_cpu(*val_u32);
+
+	maxval = GENMASK(bitend, 0);
+	if (val && (val > container_maxsz || val > maxval)) {
+		NL_SET_ERR_MSG_MOD(extack, "BE32 value out of range");
+		return -EINVAL;
+	}
+
+	return 0;
+}
+
+static void p4t_be32_hread(struct p4tc_type *container,
+			   struct p4tc_type_mask_shift *mask_shift, void *sval,
+			   void *dval)
+{
+	__be32 *src = sval;
+	u32 *dst = dval;
+
+	*dst = be32_to_cpu(*src);
+}
+
+static void p4t_be32_write(struct p4tc_type *container,
+			   struct p4tc_type_mask_shift *mask_shift, void *sval,
+			   void *dval)
+{
+	__be32 *dst = dval;
+	u32 *src = sval;
+
+	*dst = cpu_to_be32(*src);
+}
+
+static void p4t_be32_print(struct net *net, struct p4tc_type *container,
+			   const char *prefix, void *val)
+{
+	__be32 *v = val;
+
+	pr_info("%s 0x%x\n", prefix, *v);
+}
+
+static void p4t_be64_hread(struct p4tc_type *container,
+			   struct p4tc_type_mask_shift *mask_shift, void *sval,
+			   void *dval)
+{
+	__be64 *src = sval;
+	u64 *dst = dval;
+
+	*dst = be64_to_cpu(*src);
+}
+
+static void p4t_be64_write(struct p4tc_type *container,
+			   struct p4tc_type_mask_shift *mask_shift, void *sval,
+			   void *dval)
+{
+	__be64 *dst = dval;
+	u64 *src = sval;
+
+	*dst = cpu_to_be64(*src);
+}
+
+static void p4t_be64_print(struct net *net, struct p4tc_type *container,
+			   const char *prefix, void *val)
+{
+	__be64 *v = val;
+
+	pr_info("%s 0x%llx\n", prefix, *v);
+}
+
+static int p4t_u16_validate(struct p4tc_type *container, void *value,
+			    u16 bitstart, u16 bitend,
+			    struct netlink_ext_ack *extack)
+{
+	u16 container_maxsz = U16_MAX;
+	u16 *val = value;
+	u16 maxval;
+	int ret;
+
+	ret = p4t_validate_bitpos(bitstart, bitend, 15, 15, extack);
+	if (ret < 0)
+		return ret;
+
+	maxval = GENMASK(bitend, 0);
+	if (val && (*val > container_maxsz || *val > maxval)) {
+		NL_SET_ERR_MSG_MOD(extack, "U16 value out of range");
+		return -EINVAL;
+	}
+
+	return 0;
+}
+
+static struct p4tc_type_mask_shift *
+p4t_u16_bitops(u16 bitsiz, u16 bitstart, u16 bitend,
+	       struct netlink_ext_ack *extack)
+{
+	struct p4tc_type_mask_shift *mask_shift;
+	u16 mask = GENMASK(bitend, bitstart);
+	u16 *cmask;
+
+	mask_shift = kzalloc(sizeof(*mask_shift), GFP_KERNEL);
+	if (!mask_shift)
+		return ERR_PTR(-ENOMEM);
+
+	cmask = kzalloc(sizeof(u16), GFP_KERNEL);
+	if (!cmask) {
+		kfree(mask_shift);
+		return ERR_PTR(-ENOMEM);
+	}
+
+	*cmask = mask;
+
+	mask_shift->mask = cmask;
+	mask_shift->shift = bitstart;
+
+	return mask_shift;
+}
+
+static void p4t_u16_write(struct p4tc_type *container,
+			  struct p4tc_type_mask_shift *mask_shift, void *sval,
+			  void *dval)
+{
+	u16 maskedst = 0;
+	u16 *dst = dval;
+	u16 *src = sval;
+	u8 shift = 0;
+
+	if (mask_shift) {
+		u16 *dmask = mask_shift->mask;
+
+		maskedst = *dst & ~*dmask;
+		shift = mask_shift->shift;
+	}
+
+	*dst = maskedst | (*src << shift);
+}
+
+static void p4t_u16_print(struct net *net, struct p4tc_type *container,
+			  const char *prefix, void *val)
+{
+	u16 *v = val;
+
+	pr_info("%s 0x%x\n", prefix, *v);
+}
+
+static void p4t_u16_hread(struct p4tc_type *container,
+			  struct p4tc_type_mask_shift *mask_shift, void *sval,
+			  void *dval)
+{
+	u16 *dst = dval;
+	u16 *src = sval;
+
+	if (mask_shift) {
+		u16 *smask = mask_shift->mask;
+		u8 shift = mask_shift->shift;
+
+		*dst = (*src & *smask) >> shift;
+	} else {
+		*dst = *src;
+	}
+}
+
+static int p4t_s16_validate(struct p4tc_type *container, void *value,
+			    u16 bitstart, u16 bitend,
+			    struct netlink_ext_ack *extack)
+{
+	s16 minsz = S16_MIN, maxsz = S16_MAX;
+	s16 *val = value;
+
+	if (val && (*val > maxsz || *val < minsz)) {
+		NL_SET_ERR_MSG_MOD(extack, "S16 value out of range");
+		return -EINVAL;
+	}
+
+	return 0;
+}
+
+static void p4t_s16_hread(struct p4tc_type *container,
+			  struct p4tc_type_mask_shift *mask_shift, void *sval,
+			  void *dval)
+{
+	s16 *dst = dval;
+	s16 *src = sval;
+
+	*dst = *src;
+}
+
+static void p4t_s16_write(struct p4tc_type *container,
+			  struct p4tc_type_mask_shift *mask_shift, void *sval,
+			  void *dval)
+{
+	s16 *dst = dval;
+	s16 *src = sval;
+
+	*src = *dst;
+}
+
+static void p4t_s16_print(struct net *net, struct p4tc_type *container,
+			  const char *prefix, void *val)
+{
+	s16 *v = val;
+
+	pr_info("%s %d\n", prefix, *v);
+}
+
+static int p4t_be16_validate(struct p4tc_type *container, void *value,
+			     u16 bitstart, u16 bitend,
+			     struct netlink_ext_ack *extack)
+{
+	u16 container_maxsz = U16_MAX;
+	__be16 *val_u16 = value;
+	size_t maxval;
+	u16 val = 0;
+	int ret;
+
+	ret = p4t_validate_bitpos(bitstart, bitend, 15, 15, extack);
+	if (ret < 0)
+		return ret;
+
+	if (value)
+		val = be16_to_cpu(*val_u16);
+
+	maxval = GENMASK(bitend, 0);
+	if (val && (val > container_maxsz || val > maxval)) {
+		NL_SET_ERR_MSG_MOD(extack, "BE16 value out of range");
+		return -EINVAL;
+	}
+
+	return 0;
+}
+
+static void p4t_be16_hread(struct p4tc_type *container,
+			   struct p4tc_type_mask_shift *mask_shift, void *sval,
+			   void *dval)
+{
+	__be16 *src = sval;
+	u16 *dst = dval;
+
+	*dst = be16_to_cpu(*src);
+}
+
+static void p4t_be16_write(struct p4tc_type *container,
+			   struct p4tc_type_mask_shift *mask_shift, void *sval,
+			   void *dval)
+{
+	__be16 *dst = dval;
+	u16 *src = sval;
+
+	*dst = cpu_to_be16(*src);
+}
+
+static void p4t_be16_print(struct net *net, struct p4tc_type *container,
+			   const char *prefix, void *val)
+{
+	__be16 *v = val;
+
+	pr_info("%s 0x%x\n", prefix, *v);
+}
+
+static int p4t_u8_validate(struct p4tc_type *container, void *value,
+			   u16 bitstart, u16 bitend,
+			   struct netlink_ext_ack *extack)
+{
+	size_t container_maxsz = U8_MAX;
+	u8 *val = value;
+	u8 maxval;
+	int ret;
+
+	ret = p4t_validate_bitpos(bitstart, bitend, 7, 7, extack);
+	if (ret < 0)
+		return ret;
+
+	maxval = GENMASK(bitend, 0);
+	if (val && (*val > container_maxsz || *val > maxval)) {
+		NL_SET_ERR_MSG_MOD(extack, "U8 value out of range");
+		return -EINVAL;
+	}
+
+	return 0;
+}
+
+static struct p4tc_type_mask_shift *
+p4t_u8_bitops(u16 bitsiz, u16 bitstart, u16 bitend,
+	      struct netlink_ext_ack *extack)
+{
+	struct p4tc_type_mask_shift *mask_shift;
+	u8 mask = GENMASK(bitend, bitstart);
+	u8 *cmask;
+
+	mask_shift = kzalloc(sizeof(*mask_shift), GFP_KERNEL);
+	if (!mask_shift)
+		return ERR_PTR(-ENOMEM);
+
+	cmask = kzalloc(sizeof(u8), GFP_KERNEL);
+	if (!cmask) {
+		kfree(mask_shift);
+		return ERR_PTR(-ENOMEM);
+	}
+
+	*cmask = mask;
+
+	mask_shift->mask = cmask;
+	mask_shift->shift = bitstart;
+
+	return mask_shift;
+}
+
+static void p4t_u8_write(struct p4tc_type *container,
+			 struct p4tc_type_mask_shift *mask_shift, void *sval,
+			 void *dval)
+{
+	u8 maskedst = 0;
+	u8 *dst = dval;
+	u8 *src = sval;
+	u8 shift = 0;
+
+	if (mask_shift) {
+		u8 *dmask = (u8 *)mask_shift->mask;
+
+		maskedst = *dst & ~*dmask;
+		shift = mask_shift->shift;
+	}
+
+	*dst = maskedst | (*src << shift);
+}
+
+static void p4t_u8_print(struct net *net, struct p4tc_type *container,
+			 const char *prefix, void *val)
+{
+	u8 *v = val;
+
+	pr_info("%s 0x%x\n", prefix, *v);
+}
+
+static void p4t_u8_hread(struct p4tc_type *container,
+			 struct p4tc_type_mask_shift *mask_shift, void *sval,
+			 void *dval)
+{
+	u8 *dst = dval;
+	u8 *src = sval;
+
+	if (mask_shift) {
+		u8 *smask = mask_shift->mask;
+		u8 shift = mask_shift->shift;
+
+		*dst = (*src & *smask) >> shift;
+	} else {
+		*dst = *src;
+	}
+}
+
+static int p4t_s8_validate(struct p4tc_type *container, void *value,
+			   u16 bitstart, u16 bitend,
+			   struct netlink_ext_ack *extack)
+{
+	s8 minsz = S8_MIN, maxsz = S8_MAX;
+	s8 *val = value;
+
+	if (val && (*val > maxsz || *val < minsz)) {
+		NL_SET_ERR_MSG_MOD(extack, "S8 value out of range");
+		return -EINVAL;
+	}
+
+	return 0;
+}
+
+static void p4t_s8_hread(struct p4tc_type *container,
+			 struct p4tc_type_mask_shift *mask_shift, void *sval,
+			 void *dval)
+{
+	s8 *dst = dval;
+	s8 *src = sval;
+
+	*dst = *src;
+}
+
+static void p4t_s8_print(struct net *net, struct p4tc_type *container,
+			 const char *prefix, void *val)
+{
+	s8 *v = val;
+
+	pr_info("%s %d\n", prefix, *v);
+}
+
+static int p4t_u64_validate(struct p4tc_type *container, void *value,
+			    u16 bitstart, u16 bitend,
+			    struct netlink_ext_ack *extack)
+{
+	u64 container_maxsz = U64_MAX;
+	u8 *val = value;
+	u64 maxval;
+	int ret;
+
+	ret = p4t_validate_bitpos(bitstart, bitend, 63, 63, extack);
+	if (ret < 0)
+		return ret;
+
+	maxval = GENMASK_ULL(bitend, 0);
+	if (val && (*val > container_maxsz || *val > maxval)) {
+		NL_SET_ERR_MSG_MOD(extack, "U64 value out of range");
+		return -EINVAL;
+	}
+
+	return 0;
+}
+
+static struct p4tc_type_mask_shift *
+p4t_u64_bitops(u16 bitsiz, u16 bitstart, u16 bitend,
+	       struct netlink_ext_ack *extack)
+{
+	struct p4tc_type_mask_shift *mask_shift;
+	u64 mask = GENMASK(bitend, bitstart);
+	u64 *cmask;
+
+	mask_shift = kzalloc(sizeof(*mask_shift), GFP_KERNEL);
+	if (!mask_shift)
+		return ERR_PTR(-ENOMEM);
+
+	cmask = kzalloc(sizeof(u64), GFP_KERNEL);
+	if (!cmask) {
+		kfree(mask_shift);
+		return ERR_PTR(-ENOMEM);
+	}
+
+	*cmask = mask;
+
+	mask_shift->mask = cmask;
+	mask_shift->shift = bitstart;
+
+	return mask_shift;
+}
+
+static void p4t_u64_write(struct p4tc_type *container,
+			  struct p4tc_type_mask_shift *mask_shift, void *sval,
+			  void *dval)
+{
+	u64 maskedst = 0;
+	u64 *dst = dval;
+	u64 *src = sval;
+	u8 shift = 0;
+
+	if (mask_shift) {
+		u64 *dmask = (u64 *)mask_shift->mask;
+
+		maskedst = *dst & ~*dmask;
+		shift = mask_shift->shift;
+	}
+
+	*dst = maskedst | (*src << shift);
+}
+
+static void p4t_u64_print(struct net *net, struct p4tc_type *container,
+			  const char *prefix, void *val)
+{
+	u64 *v = val;
+
+	pr_info("%s 0x%llx\n", prefix, *v);
+}
+
+static void p4t_u64_hread(struct p4tc_type *container,
+			  struct p4tc_type_mask_shift *mask_shift, void *sval,
+			  void *dval)
+{
+	u64 *dst = dval;
+	u64 *src = sval;
+
+	if (mask_shift) {
+		u64 *smask = mask_shift->mask;
+		u8 shift = mask_shift->shift;
+
+		*dst = (*src & *smask) >> shift;
+	} else {
+		*dst = *src;
+	}
+}
+
+/* As of now, we are not allowing bitops for u128 */
+static int p4t_u128_validate(struct p4tc_type *container, void *value,
+			     u16 bitstart, u16 bitend,
+			     struct netlink_ext_ack *extack)
+{
+	if (bitstart != 0 || bitend != 127) {
+		NL_SET_ERR_MSG_MOD(extack,
+				   "Only valid bit type larger than bit64 is bit128");
+		return -EINVAL;
+	}
+
+	return 0;
+}
+
+static void p4t_u128_hread(struct p4tc_type *container,
+			   struct p4tc_type_mask_shift *mask_shift, void *sval,
+			   void *dval)
+{
+	memcpy(sval, dval, sizeof(__u64) * 2);
+}
+
+static void p4t_u128_write(struct p4tc_type *container,
+			   struct p4tc_type_mask_shift *mask_shift, void *sval,
+			   void *dval)
+{
+	memcpy(sval, dval, sizeof(__u64) * 2);
+}
+
+static void p4t_u128_print(struct net *net, struct p4tc_type *container,
+			   const char *prefix, void *val)
+{
+	u64 *v = val;
+
+	pr_info("%s[0-63] %16llx", prefix, v[0]);
+	pr_info("%s[64-127] %16llx", prefix, v[1]);
+}
+
+static int p4t_s128_validate(struct p4tc_type *container, void *value,
+			     u16 bitstart, u16 bitend,
+			     struct netlink_ext_ack *extack)
+{
+	if (bitstart != 0 || bitend != 127) {
+		NL_SET_ERR_MSG_MOD(extack,
+				   "Only valid int type larger than int64 is int128");
+		return -EINVAL;
+	}
+
+	return 0;
+}
+
+static void p4t_s128_hread(struct p4tc_type *container,
+			   struct p4tc_type_mask_shift *mask_shift, void *sval,
+			   void *dval)
+{
+	memcpy(sval, dval, sizeof(__u64) * 2);
+}
+
+static void p4t_s128_write(struct p4tc_type *container,
+			   struct p4tc_type_mask_shift *mask_shift, void *sval,
+			   void *dval)
+{
+	memcpy(sval, dval, sizeof(__u64) * 2);
+}
+
+static void p4t_s128_print(struct net *net, struct p4tc_type *container,
+			   const char *prefix, void *val)
+{
+	u64 *v = val;
+
+	pr_info("%s[0-63] %16llx", prefix, v[0]);
+	pr_info("%s[64-127] %16llx", prefix, v[1]);
+}
+
+static int p4t_string_validate(struct p4tc_type *container, void *value,
+			       u16 bitstart, u16 bitend,
+			       struct netlink_ext_ack *extack)
+{
+	if (bitstart != 0 || bitend >= P4TC_T_MAX_STR_SZ) {
+		NL_SET_ERR_MSG_FMT_MOD(extack,
+				       "String size must be at most %u\n",
+				       P4TC_T_MAX_STR_SZ);
+		return -EINVAL;
+	}
+
+	return 0;
+}
+
+static void p4t_string_hread(struct p4tc_type *container,
+			     struct p4tc_type_mask_shift *mask_shift,
+			     void *sval, void *dval)
+{
+	strscpy(sval, dval, P4TC_T_MAX_STR_SZ);
+}
+
+static void p4t_string_write(struct p4tc_type *container,
+			     struct p4tc_type_mask_shift *mask_shift,
+			     void *sval, void *dval)
+{
+	strscpy(sval, dval, P4TC_T_MAX_STR_SZ);
+}
+
+static void p4t_string_print(struct net *net, struct p4tc_type *container,
+			     const char *prefix, void *val)
+{
+	char *v = val;
+
+	pr_info("%s\n", v);
+}
+
+static int p4t_ipv4_validate(struct p4tc_type *container, void *value,
+			     u16 bitstart, u16 bitend,
+			     struct netlink_ext_ack *extack)
+{
+	/* Not allowing bit-slices for now */
+	if (bitstart != 0 || bitend != 31) {
+		NL_SET_ERR_MSG_MOD(extack, "Invalid bitstart or bitend");
+		return -EINVAL;
+	}
+
+	return 0;
+}
+
+static void p4t_ipv4_print(struct net *net, struct p4tc_type *container,
+			   const char *prefix, void *val)
+{
+	u32 *v32h = val;
+	__be32 v32;
+	u8 *v;
+
+	v32 = cpu_to_be32(*v32h);
+	v = (u8 *)&v32;
+
+	pr_info("%s %u.%u.%u.%u\n", prefix, v[0], v[1], v[2], v[3]);
+}
+
+static int p4t_mac_validate(struct p4tc_type *container, void *value,
+			    u16 bitstart, u16 bitend,
+			    struct netlink_ext_ack *extack)
+{
+	if (bitstart != 0 || bitend != 47) {
+		NL_SET_ERR_MSG_MOD(extack, "Invalid bitstart or bitend");
+		return -EINVAL;
+	}
+
+	return 0;
+}
+
+static void p4t_mac_print(struct net *net, struct p4tc_type *container,
+			  const char *prefix, void *val)
+{
+	u8 *v = val;
+
+	pr_info("%s %02X:%02x:%02x:%02x:%02x:%02x\n", prefix, v[0], v[1], v[2],
+		v[3], v[4], v[5]);
+}
+
+static int p4t_dev_validate(struct p4tc_type *container, void *value,
+			    u16 bitstart, u16 bitend,
+			    struct netlink_ext_ack *extack)
+{
+	if (bitstart != 0 || bitend != 31) {
+		NL_SET_ERR_MSG_MOD(extack, "Invalid start or endbit values");
+		return -EINVAL;
+	}
+
+	return 0;
+}
+
+static void p4t_dev_write(struct p4tc_type *container,
+			  struct p4tc_type_mask_shift *mask_shift, void *sval,
+			  void *dval)
+{
+	u32 *src = sval;
+	u32 *dst = dval;
+
+	*dst = *src;
+}
+
+static void p4t_dev_hread(struct p4tc_type *container,
+			  struct p4tc_type_mask_shift *mask_shift, void *sval,
+			  void *dval)
+{
+	u32 *src = sval;
+	u32 *dst = dval;
+
+	*dst = *src;
+}
+
+static void p4t_dev_print(struct net *net, struct p4tc_type *container,
+			  const char *prefix, void *val)
+{
+	const u32 *ifindex = val;
+	struct net_device *dev;
+
+	dev = dev_get_by_index_rcu(net, *ifindex);
+
+	pr_info("%s %s\n", prefix, dev->name);
+}
+
+static void p4t_key_hread(struct p4tc_type *container,
+			  struct p4tc_type_mask_shift *mask_shift, void *sval,
+			  void *dval)
+{
+	memcpy(dval, sval, BITS_TO_BYTES(container->bitsz));
+}
+
+static void p4t_key_write(struct p4tc_type *container,
+			  struct p4tc_type_mask_shift *mask_shift, void *sval,
+			  void *dval)
+{
+	memcpy(dval, sval, BITS_TO_BYTES(container->bitsz));
+}
+
+static void p4t_key_print(struct net *net, struct p4tc_type *container,
+			  const char *prefix, void *val)
+{
+	u16 bitstart = 0, bitend = 63;
+	u64 *v = val;
+	int i;
+
+	for (i = 0; i < BITS_TO_U64(container->bitsz); i++) {
+		pr_info("%s[%u-%u] %16llx\n", prefix, bitstart, bitend, v[i]);
+		bitstart += 64;
+		bitend += 64;
+	}
+}
+
+static int p4t_key_validate(struct p4tc_type *container, void *value,
+			    u16 bitstart, u16 bitend,
+			    struct netlink_ext_ack *extack)
+{
+	if (p4t_validate_bitpos(bitstart, bitend, 0, P4TC_MAX_KEYSZ, extack))
+		return -EINVAL;
+
+	return 0;
+}
+
+static int p4t_bool_validate(struct p4tc_type *container, void *value,
+			     u16 bitstart, u16 bitend,
+			     struct netlink_ext_ack *extack)
+{
+	int ret;
+
+	ret = p4t_validate_bitpos(bitstart, bitend, 7, 7, extack);
+	if (ret < 0)
+		return ret;
+
+	return -EINVAL;
+}
+
+static void p4t_bool_hread(struct p4tc_type *container,
+			   struct p4tc_type_mask_shift *mask_shift, void *sval,
+			   void *dval)
+{
+	bool *dst = dval;
+	bool *src = sval;
+
+	*dst = *src;
+}
+
+static void p4t_bool_write(struct p4tc_type *container,
+			   struct p4tc_type_mask_shift *mask_shift, void *sval,
+			   void *dval)
+{
+	bool *dst = dval;
+	bool *src = sval;
+
+	*dst = *src;
+}
+
+static void p4t_bool_print(struct net *net, struct p4tc_type *container,
+			   const char *prefix, void *val)
+{
+	bool *v = val;
+
+	pr_info("%s %s", prefix, *v ? "true" : "false");
+}
+
+static const struct p4tc_type_ops u8_ops = {
+	.validate_p4t = p4t_u8_validate,
+	.create_bitops = p4t_u8_bitops,
+	.host_read = p4t_u8_hread,
+	.host_write = p4t_u8_write,
+	.print = p4t_u8_print,
+};
+
+static const struct p4tc_type_ops u16_ops = {
+	.validate_p4t = p4t_u16_validate,
+	.create_bitops = p4t_u16_bitops,
+	.host_read = p4t_u16_hread,
+	.host_write = p4t_u16_write,
+	.print = p4t_u16_print,
+};
+
+static const struct p4tc_type_ops u32_ops = {
+	.validate_p4t = p4t_u32_validate,
+	.create_bitops = p4t_u32_bitops,
+	.host_read = p4t_u32_hread,
+	.host_write = p4t_u32_write,
+	.print = p4t_u32_print,
+};
+
+static const struct p4tc_type_ops u64_ops = {
+	.validate_p4t = p4t_u64_validate,
+	.create_bitops = p4t_u64_bitops,
+	.host_read = p4t_u64_hread,
+	.host_write = p4t_u64_write,
+	.print = p4t_u64_print,
+};
+
+static const struct p4tc_type_ops u128_ops = {
+	.validate_p4t = p4t_u128_validate,
+	.host_read = p4t_u128_hread,
+	.host_write = p4t_u128_write,
+	.print = p4t_u128_print,
+};
+
+static const struct p4tc_type_ops s8_ops = {
+	.validate_p4t = p4t_s8_validate,
+	.host_read = p4t_s8_hread,
+	.print = p4t_s8_print,
+};
+
+static const struct p4tc_type_ops s16_ops = {
+	.validate_p4t = p4t_s16_validate,
+	.host_read = p4t_s16_hread,
+	.host_write = p4t_s16_write,
+	.print = p4t_s16_print,
+};
+
+static const struct p4tc_type_ops s32_ops = {
+	.validate_p4t = p4t_s32_validate,
+	.host_read = p4t_s32_hread,
+	.host_write = p4t_s32_write,
+	.print = p4t_s32_print,
+};
+
+static const struct p4tc_type_ops s64_ops = {
+	.validate_p4t = p4t_s64_validate,
+	.host_read = p4t_s64_hread,
+	.host_write = p4t_s64_write,
+	.print = p4t_s64_print,
+};
+
+static const struct p4tc_type_ops s128_ops = {
+	.validate_p4t = p4t_s128_validate,
+	.host_read = p4t_s128_hread,
+	.host_write = p4t_s128_write,
+	.print = p4t_s128_print,
+};
+
+static const struct p4tc_type_ops be16_ops = {
+	.validate_p4t = p4t_be16_validate,
+	.create_bitops = p4t_u16_bitops,
+	.host_read = p4t_be16_hread,
+	.host_write = p4t_be16_write,
+	.print = p4t_be16_print,
+};
+
+static const struct p4tc_type_ops be32_ops = {
+	.validate_p4t = p4t_be32_validate,
+	.create_bitops = p4t_u32_bitops,
+	.host_read = p4t_be32_hread,
+	.host_write = p4t_be32_write,
+	.print = p4t_be32_print,
+};
+
+static const struct p4tc_type_ops be64_ops = {
+	.validate_p4t = p4t_u64_validate,
+	.host_read = p4t_be64_hread,
+	.host_write = p4t_be64_write,
+	.print = p4t_be64_print,
+};
+
+static const struct p4tc_type_ops string_ops = {
+	.validate_p4t = p4t_string_validate,
+	.host_read = p4t_string_hread,
+	.host_write = p4t_string_write,
+	.print = p4t_string_print,
+};
+
+static const struct p4tc_type_ops mac_ops = {
+	.validate_p4t = p4t_mac_validate,
+	.create_bitops = p4t_u64_bitops,
+	.host_read = p4t_u64_hread,
+	.host_write = p4t_u64_write,
+	.print = p4t_mac_print,
+};
+
+static const struct p4tc_type_ops ipv4_ops = {
+	.validate_p4t = p4t_ipv4_validate,
+	.host_read = p4t_be32_hread,
+	.host_write = p4t_be32_write,
+	.print = p4t_ipv4_print,
+};
+
+static const struct p4tc_type_ops bool_ops = {
+	.validate_p4t = p4t_bool_validate,
+	.host_read = p4t_bool_hread,
+	.host_write = p4t_bool_write,
+	.print = p4t_bool_print,
+};
+
+static const struct p4tc_type_ops dev_ops = {
+	.validate_p4t = p4t_dev_validate,
+	.host_read = p4t_dev_hread,
+	.host_write = p4t_dev_write,
+	.print = p4t_dev_print,
+};
+
+static const struct p4tc_type_ops key_ops = {
+	.validate_p4t = p4t_key_validate,
+	.host_read = p4t_key_hread,
+	.host_write = p4t_key_write,
+	.print = p4t_key_print,
+};
+
+#ifdef CONFIG_RETPOLINE
+void __p4tc_type_host_read(const struct p4tc_type_ops *ops,
+			   struct p4tc_type *container,
+			   struct p4tc_type_mask_shift *mask_shift, void *sval,
+			   void *dval)
+{
+	#define HREAD(cops) \
+	do { \
+		if (ops == &(cops)) \
+			return (cops).host_read(container, mask_shift, sval, \
+						dval); \
+	} while (0)
+
+	HREAD(u8_ops);
+	HREAD(u16_ops);
+	HREAD(u32_ops);
+	HREAD(u64_ops);
+	HREAD(u128_ops);
+	HREAD(s8_ops);
+	HREAD(s16_ops);
+	HREAD(s32_ops);
+	HREAD(be16_ops);
+	HREAD(be32_ops);
+	HREAD(mac_ops);
+	HREAD(ipv4_ops);
+	HREAD(bool_ops);
+	HREAD(dev_ops);
+	HREAD(key_ops);
+
+	return ops->host_read(container, mask_shift, sval, dval);
+}
+
+void __p4tc_type_host_write(const struct p4tc_type_ops *ops,
+			    struct p4tc_type *container,
+			    struct p4tc_type_mask_shift *mask_shift, void *sval,
+			    void *dval)
+{
+	#define HWRITE(cops) \
+	do { \
+		if (ops == &(cops)) \
+			return (cops).host_write(container, mask_shift, sval, \
+						 dval); \
+	} while (0)
+
+	HWRITE(u8_ops);
+	HWRITE(u16_ops);
+	HWRITE(u32_ops);
+	HWRITE(u64_ops);
+	HWRITE(u128_ops);
+	HWRITE(s16_ops);
+	HWRITE(s32_ops);
+	HWRITE(be16_ops);
+	HWRITE(be32_ops);
+	HWRITE(mac_ops);
+	HWRITE(ipv4_ops);
+	HWRITE(bool_ops);
+	HWRITE(dev_ops);
+	HWRITE(key_ops);
+
+	return ops->host_write(container, mask_shift, sval, dval);
+}
+#endif
+
+static int ___p4tc_register_type(int typeid, size_t bitsz,
+				 size_t container_bitsz,
+				 const char *t_name,
+				 const struct p4tc_type_ops *ops)
+{
+	struct p4tc_type *type;
+	int err;
+
+	if (typeid > P4TC_T_MAX)
+		return -EINVAL;
+
+	if (p4type_find_byid(typeid) || p4type_find_byname(t_name))
+		return -EEXIST;
+
+	if (bitsz > P4TC_T_MAX_BITSZ)
+		return -E2BIG;
+
+	if (container_bitsz > P4TC_T_MAX_BITSZ)
+		return -E2BIG;
+
+	type = kzalloc(sizeof(*type), GFP_ATOMIC);
+	if (!type)
+		return -ENOMEM;
+
+	err = idr_alloc_u32(&p4tc_types_idr, type, &typeid, typeid, GFP_ATOMIC);
+	if (err < 0)
+		return err;
+
+	strscpy(type->name, t_name, P4TC_T_MAX_STR_SZ);
+	type->typeid = typeid;
+	type->bitsz = bitsz;
+	type->container_bitsz = container_bitsz;
+	type->ops = ops;
+
+	return 0;
+}
+
+static int __p4tc_register_type(int typeid, size_t bitsz,
+				size_t container_bitsz,
+				const char *t_name,
+				const struct p4tc_type_ops *ops)
+{
+	if (___p4tc_register_type(typeid, bitsz, container_bitsz, t_name, ops) <
+	    0) {
+		pr_err("Unable to allocate p4 type %s\n", t_name);
+		p4tc_types_put();
+		return -1;
+	}
+
+	return 0;
+}
+
+#define p4tc_register_type(...)                            \
+	do {                                               \
+		if (__p4tc_register_type(__VA_ARGS__) < 0) \
+			return -1;                         \
+	} while (0)
+
+int p4tc_register_types(void)
+{
+	p4tc_register_type(P4TC_T_U8, 8, 8, "u8", &u8_ops);
+	p4tc_register_type(P4TC_T_U16, 16, 16, "u16", &u16_ops);
+	p4tc_register_type(P4TC_T_U32, 32, 32, "u32", &u32_ops);
+	p4tc_register_type(P4TC_T_U64, 64, 64, "u64", &u64_ops);
+	p4tc_register_type(P4TC_T_U128, 128, 128, "u128", &u128_ops);
+	p4tc_register_type(P4TC_T_S8, 8, 8, "s8", &s8_ops);
+	p4tc_register_type(P4TC_T_BE16, 16, 16, "be16", &be16_ops);
+	p4tc_register_type(P4TC_T_BE32, 32, 32, "be32", &be32_ops);
+	p4tc_register_type(P4TC_T_BE64, 64, 64, "be64", &be64_ops);
+	p4tc_register_type(P4TC_T_S16, 16, 16, "s16", &s16_ops);
+	p4tc_register_type(P4TC_T_S32, 32, 32, "s32", &s32_ops);
+	p4tc_register_type(P4TC_T_S64, 64, 64, "s64", &s64_ops);
+	p4tc_register_type(P4TC_T_S128, 128, 128, "s128", &s128_ops);
+	p4tc_register_type(P4TC_T_STRING, P4TC_T_MAX_STR_SZ * 4,
+			   P4TC_T_MAX_STR_SZ * 4, "string", &string_ops);
+	p4tc_register_type(P4TC_T_MACADDR, 48, 64, "mac", &mac_ops);
+	p4tc_register_type(P4TC_T_IPV4ADDR, 32, 32, "ipv4", &ipv4_ops);
+	p4tc_register_type(P4TC_T_BOOL, 32, 32, "bool", &bool_ops);
+	p4tc_register_type(P4TC_T_DEV, 32, 32, "dev", &dev_ops);
+	p4tc_register_type(P4TC_T_KEY, P4TC_MAX_KEYSZ, P4TC_MAX_KEYSZ, "key",
+			   &key_ops);
+
+	return 0;
+}
+
+void p4tc_unregister_types(void)
+{
+	p4tc_types_put();
+}
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [PATCH net-next v12  07/15] p4tc: add template API
  2024-02-25 16:54 [PATCH net-next v12 00/15] Introducing P4TC (series 1) Jamal Hadi Salim
                   ` (5 preceding siblings ...)
  2024-02-25 16:54 ` [PATCH net-next v12 06/15] p4tc: add P4 data types Jamal Hadi Salim
@ 2024-02-25 16:54 ` Jamal Hadi Salim
  2024-02-25 16:54 ` [PATCH net-next v12 08/15] p4tc: add template pipeline create, get, update, delete Jamal Hadi Salim
                   ` (9 subsequent siblings)
  16 siblings, 0 replies; 71+ messages in thread
From: Jamal Hadi Salim @ 2024-02-25 16:54 UTC (permalink / raw)
  To: netdev
  Cc: deb.chatterjee, anjali.singhai, namrata.limaye, tom, mleitner,
	Mahesh.Shirshyad, Vipin.Jain, tomasz.osinski, jiri,
	xiyou.wangcong, davem, edumazet, kuba, pabeni, vladbu, horms,
	khalidm, toke, daniel, victor, pctammela, bpf

Add p4tc template API that will serve as infrastructure for all future
template objects (in this set pipeline, table, action, and more later)

This commit is not functional by itself. It needs the subsequent patch
to be of any use. This commit's purpose is to ease review. If something
were to break and you do git bisect to this patch it will not be helpful.
In the next release we are planning to merge the two back again.

The template API infrastructure follows the CRUD (Create, Read/get, Update,
and Delete) commands.

To issue a p4template create command the user will follow the below grammar:

tc p4template create objtype/[objpath] [objid] objparams

To show a more concrete example, to create a new pipeline (pipelines
come in the next commit), the user would issue the following command:

tc p4template create pipeline/aP4proggie pipeid 1 numtables 1 ...

Note that the user may specify an optional ID to the obj ("pipeid 1" above), if
none is specified, the kernel will assign one.

The command for update is analogous:

tc p4template update objtype/[objpath] [objid] objparams

Note that for the user may refer to the object by name (in the objpath)
or directly by ID.

Delete is also analogous:

tc p4template delete objtype/[objpath] [objid]

As is get:

tc p4template get objtype/[objpath] [objid]

One can also dump or flush template objects. This will be better
exposed in the object specific commits in this patchset

Co-developed-by: Victor Nogueira <victor@mojatatu.com>
Signed-off-by: Victor Nogueira <victor@mojatatu.com>
Co-developed-by: Pedro Tammela <pctammela@mojatatu.com>
Signed-off-by: Pedro Tammela <pctammela@mojatatu.com>
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Reviewed-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
---
 include/net/p4tc.h             |  53 +++++
 include/uapi/linux/p4tc.h      |  42 ++++
 include/uapi/linux/rtnetlink.h |   9 +
 net/sched/p4tc/Makefile        |   2 +-
 net/sched/p4tc/p4tc_tmpl_api.c | 366 +++++++++++++++++++++++++++++++++
 security/selinux/nlmsgtab.c    |   6 +-
 6 files changed, 476 insertions(+), 2 deletions(-)
 create mode 100644 include/net/p4tc.h
 create mode 100644 net/sched/p4tc/p4tc_tmpl_api.c

diff --git a/include/net/p4tc.h b/include/net/p4tc.h
new file mode 100644
index 000000000..e55d7b0b6
--- /dev/null
+++ b/include/net/p4tc.h
@@ -0,0 +1,53 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef __NET_P4TC_H
+#define __NET_P4TC_H
+
+#include <uapi/linux/p4tc.h>
+#include <linux/workqueue.h>
+#include <net/sch_generic.h>
+#include <net/net_namespace.h>
+#include <linux/refcount.h>
+#include <linux/rhashtable.h>
+#include <linux/rhashtable-types.h>
+
+#define P4TC_PATH_MAX 3
+
+struct p4tc_dump_ctx {
+	u32 ids[P4TC_PATH_MAX];
+};
+
+struct p4tc_template_common;
+
+struct p4tc_template_ops {
+	struct p4tc_template_common *(*cu)(struct net *net, struct nlmsghdr *n,
+					   struct nlattr *nla,
+					   struct netlink_ext_ack *extack);
+	int (*put)(struct p4tc_template_common *tmpl,
+		   struct netlink_ext_ack *extack);
+	int (*gd)(struct net *net, struct sk_buff *skb, struct nlmsghdr *n,
+		  struct nlattr *nla, struct netlink_ext_ack *extack);
+	int (*fill_nlmsg)(struct net *net, struct sk_buff *skb,
+			  struct p4tc_template_common *tmpl,
+			  struct netlink_ext_ack *extack);
+	int (*dump)(struct sk_buff *skb, struct p4tc_dump_ctx *ctx,
+		    struct nlattr *nla, u32 *ids,
+		    struct netlink_ext_ack *extack);
+	int (*dump_1)(struct sk_buff *skb, struct p4tc_template_common *common);
+	u32 obj_id;
+};
+
+struct p4tc_template_common {
+	char                     name[P4TC_TMPL_NAMSZ];
+	struct p4tc_template_ops *ops;
+};
+
+static inline bool p4tc_tmpl_msg_is_update(struct nlmsghdr *n)
+{
+	return n->nlmsg_type == RTM_UPDATEP4TEMPLATE;
+}
+
+int p4tc_tmpl_generic_dump(struct sk_buff *skb, struct p4tc_dump_ctx *ctx,
+			   struct idr *idr, int idx,
+			   struct netlink_ext_ack *extack);
+
+#endif
diff --git a/include/uapi/linux/p4tc.h b/include/uapi/linux/p4tc.h
index 0133947c5..22ba1c05a 100644
--- a/include/uapi/linux/p4tc.h
+++ b/include/uapi/linux/p4tc.h
@@ -2,8 +2,47 @@
 #ifndef __LINUX_P4TC_H
 #define __LINUX_P4TC_H
 
+#include <linux/types.h>
+#include <linux/pkt_sched.h>
+
+/* pipeline header */
+struct p4tcmsg {
+	__u32 obj;
+};
+
+#define P4TC_MSGBATCH_SIZE 16
+
 #define P4TC_MAX_KEYSZ 512
 
+#define P4TC_TMPL_NAMSZ 32
+
+/* Root attributes */
+enum {
+	P4TC_ROOT_UNSPEC,
+	P4TC_ROOT, /* nested messages */
+	__P4TC_ROOT_MAX,
+};
+
+#define P4TC_ROOT_MAX (__P4TC_ROOT_MAX - 1)
+
+/* P4 Object types */
+enum {
+	P4TC_OBJ_UNSPEC,
+	__P4TC_OBJ_MAX,
+};
+
+#define P4TC_OBJ_MAX (__P4TC_OBJ_MAX - 1)
+
+/* P4 attributes */
+enum {
+	P4TC_UNSPEC,
+	P4TC_PATH,
+	P4TC_PARAMS,
+	__P4TC_MAX,
+};
+
+#define P4TC_MAX (__P4TC_MAX - 1)
+
 enum {
 	P4TC_T_UNSPEC,
 	P4TC_T_U8,
@@ -30,4 +69,7 @@ enum {
 
 #define P4TC_T_MAX (__P4TC_T_MAX - 1)
 
+#define P4TC_RTA(r) \
+	((struct rtattr *)(((char *)(r)) + NLMSG_ALIGN(sizeof(struct p4tcmsg))))
+
 #endif
diff --git a/include/uapi/linux/rtnetlink.h b/include/uapi/linux/rtnetlink.h
index 3b687d20c..4f9ebe3e7 100644
--- a/include/uapi/linux/rtnetlink.h
+++ b/include/uapi/linux/rtnetlink.h
@@ -194,6 +194,15 @@ enum {
 	RTM_GETTUNNEL,
 #define RTM_GETTUNNEL	RTM_GETTUNNEL
 
+	RTM_CREATEP4TEMPLATE = 124,
+#define RTM_CREATEP4TEMPLATE	RTM_CREATEP4TEMPLATE
+	RTM_DELP4TEMPLATE,
+#define RTM_DELP4TEMPLATE	RTM_DELP4TEMPLATE
+	RTM_GETP4TEMPLATE,
+#define RTM_GETP4TEMPLATE	RTM_GETP4TEMPLATE
+	RTM_UPDATEP4TEMPLATE,
+#define RTM_UPDATEP4TEMPLATE	RTM_UPDATEP4TEMPLATE
+
 	__RTM_MAX,
 #define RTM_MAX		(((__RTM_MAX + 3) & ~3) - 1)
 };
diff --git a/net/sched/p4tc/Makefile b/net/sched/p4tc/Makefile
index dd1358c9e..e28dfc6eb 100644
--- a/net/sched/p4tc/Makefile
+++ b/net/sched/p4tc/Makefile
@@ -1,3 +1,3 @@
 # SPDX-License-Identifier: GPL-2.0
 
-obj-y := p4tc_types.o
+obj-y := p4tc_types.o p4tc_tmpl_api.o
diff --git a/net/sched/p4tc/p4tc_tmpl_api.c b/net/sched/p4tc/p4tc_tmpl_api.c
new file mode 100644
index 000000000..0569f3f1c
--- /dev/null
+++ b/net/sched/p4tc/p4tc_tmpl_api.c
@@ -0,0 +1,366 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * net/sched/p4tc/p4tc_tmpl_api.c	P4 TC TEMPLATE API
+ *
+ * Copyright (c) 2022-2024, Mojatatu Networks
+ * Copyright (c) 2022-2024, Intel Corporation.
+ * Authors:     Jamal Hadi Salim <jhs@mojatatu.com>
+ *              Victor Nogueira <victor@mojatatu.com>
+ *              Pedro Tammela <pctammela@mojatatu.com>
+ */
+
+#include <linux/types.h>
+#include <linux/kernel.h>
+#include <linux/string.h>
+#include <linux/errno.h>
+#include <linux/slab.h>
+#include <linux/skbuff.h>
+#include <linux/init.h>
+#include <linux/kmod.h>
+#include <linux/err.h>
+#include <linux/module.h>
+#include <net/net_namespace.h>
+#include <net/sock.h>
+#include <net/sch_generic.h>
+#include <net/pkt_cls.h>
+#include <net/p4tc.h>
+#include <net/netlink.h>
+#include <net/flow_offload.h>
+
+static const struct nla_policy p4tc_root_policy[P4TC_ROOT_MAX + 1] = {
+	[P4TC_ROOT] = { .type = NLA_NESTED },
+};
+
+static const struct nla_policy p4tc_policy[P4TC_MAX + 1] = {
+	[P4TC_PATH] = { .type = NLA_BINARY,
+			.len = P4TC_PATH_MAX * sizeof(u32) },
+	[P4TC_PARAMS] = { .type = NLA_NESTED },
+};
+
+static const struct p4tc_template_ops *p4tc_ops[P4TC_OBJ_MAX + 1] = {};
+
+static bool obj_is_valid(u32 obj_id)
+{
+	if (obj_id > P4TC_OBJ_MAX)
+		return false;
+
+	return !!p4tc_ops[obj_id];
+}
+
+int p4tc_tmpl_generic_dump(struct sk_buff *skb, struct p4tc_dump_ctx *ctx,
+			   struct idr *idr, int idx,
+			   struct netlink_ext_ack *extack)
+{
+	unsigned char *b = nlmsg_get_pos(skb);
+	struct p4tc_template_common *common;
+	unsigned long id = 0;
+	unsigned long tmp;
+	int i = 0;
+
+	id = ctx->ids[idx];
+
+	idr_for_each_entry_continue_ul(idr, common, tmp, id) {
+		struct nlattr *count;
+		int ret;
+
+		if (i == P4TC_MSGBATCH_SIZE)
+			break;
+
+		count = nla_nest_start(skb, i + 1);
+		if (!count)
+			goto out_nlmsg_trim;
+		ret = common->ops->dump_1(skb, common);
+		if (ret < 0) {
+			goto out_nlmsg_trim;
+		} else if (ret) {
+			nla_nest_cancel(skb, count);
+			continue;
+		}
+		nla_nest_end(skb, count);
+
+		i++;
+	}
+
+	if (i == 0) {
+		if (!ctx->ids[idx])
+			NL_SET_ERR_MSG(extack,
+				       "There are no pipeline components");
+		return 0;
+	}
+
+	ctx->ids[idx] = id;
+
+	return skb->len;
+
+out_nlmsg_trim:
+	nlmsg_trim(skb, b);
+	return -ENOMEM;
+}
+
+static int p4tc_template_put(struct net *net,
+			     struct p4tc_template_common *common,
+			     struct netlink_ext_ack *extack)
+{
+	/* Every created template is bound to a pipeline */
+	return common->ops->put(common, extack);
+}
+
+static int tc_ctl_p4_tmpl_1_send(struct sk_buff *skb, struct net *net,
+				 struct nlmsghdr *n, u32 portid)
+{
+	if (n->nlmsg_type == RTM_GETP4TEMPLATE)
+		return rtnl_unicast(skb, net, portid);
+
+	return rtnetlink_send(skb, net, portid, RTNLGRP_TC,
+			      n->nlmsg_flags & NLM_F_ECHO);
+}
+
+static int tc_ctl_p4_tmpl_1(struct sk_buff *skb, struct nlmsghdr *n,
+			    struct nlattr *nla, struct netlink_ext_ack *extack)
+{
+	struct p4tcmsg *t = (struct p4tcmsg *)nlmsg_data(n);
+	struct net *net = sock_net(skb->sk);
+	u32 portid = NETLINK_CB(skb).portid;
+	struct p4tc_template_common *tmpl;
+	struct p4tc_template_ops *obj_op;
+	struct nlattr *tb[P4TC_MAX + 1];
+	struct p4tcmsg *t_new;
+	struct nlmsghdr *nlh;
+	struct sk_buff *nskb;
+	struct nlattr *root;
+	int ret;
+
+	/* All checks will fail at this point because obj_is_valid will return
+	 * false. The next patch will make this functional
+	 */
+	if (!obj_is_valid(t->obj)) {
+		NL_SET_ERR_MSG(extack, "Invalid object type");
+		return -EINVAL;
+	}
+
+	ret = nla_parse_nested(tb, P4TC_MAX, nla, p4tc_policy, extack);
+	if (ret < 0)
+		return ret;
+
+	nskb = alloc_skb(NLMSG_GOODSIZE, GFP_KERNEL);
+	if (!nskb)
+		return -ENOMEM;
+
+	nlh = nlmsg_put(nskb, portid, n->nlmsg_seq, n->nlmsg_type,
+			sizeof(*t), n->nlmsg_flags);
+	if (!nlh) {
+		ret = -ENOMEM;
+		goto free_skb;
+	}
+
+	t_new = nlmsg_data(nlh);
+	if (!t_new) {
+		NL_SET_ERR_MSG(extack, "Message header is missing");
+		ret = -EINVAL;
+		goto free_skb;
+	}
+	t_new->obj = t->obj;
+
+	root = nla_nest_start(nskb, P4TC_ROOT);
+	if (!root) {
+		ret = -ENOMEM;
+		goto free_skb;
+	}
+
+	obj_op = (struct p4tc_template_ops *)p4tc_ops[t->obj];
+	switch (n->nlmsg_type) {
+	case RTM_CREATEP4TEMPLATE:
+	case RTM_UPDATEP4TEMPLATE:
+		if (NL_REQ_ATTR_CHECK(extack, nla, tb, P4TC_PARAMS)) {
+			NL_SET_ERR_MSG(extack,
+				       "Must specify object attributes");
+			ret = -EINVAL;
+			goto free_skb;
+		}
+		tmpl = obj_op->cu(net, n, tb[P4TC_PARAMS], extack);
+		if (IS_ERR(tmpl)) {
+			ret = PTR_ERR(tmpl);
+			goto free_skb;
+		}
+
+		ret = obj_op->fill_nlmsg(net, nskb, tmpl, extack);
+		if (ret < 0) {
+			p4tc_template_put(net, tmpl, extack);
+			goto free_skb;
+		}
+		break;
+	case RTM_DELP4TEMPLATE:
+	case RTM_GETP4TEMPLATE:
+		ret = obj_op->gd(net, nskb, n, tb[P4TC_PARAMS], extack);
+		if (ret < 0)
+			goto free_skb;
+		break;
+	default:
+		ret = -EINVAL;
+		goto free_skb;
+	}
+
+	nlmsg_end(nskb, nlh);
+
+	return tc_ctl_p4_tmpl_1_send(nskb, net, nlh, portid);
+
+free_skb:
+	kfree_skb(nskb);
+
+	return ret;
+}
+
+static int tc_ctl_p4_tmpl_get(struct sk_buff *skb, struct nlmsghdr *n,
+			      struct netlink_ext_ack *extack)
+{
+	struct nlattr *tb[P4TC_ROOT_MAX + 1];
+	int ret;
+
+	ret = nlmsg_parse(n, sizeof(struct p4tcmsg), tb, P4TC_ROOT_MAX,
+			  p4tc_root_policy, extack);
+	if (ret < 0)
+		return ret;
+
+	if (NL_REQ_ATTR_CHECK(extack, NULL, tb, P4TC_ROOT)) {
+		NL_SET_ERR_MSG(extack,
+			       "Netlink P4TC template attributes missing");
+		return -EINVAL;
+	}
+
+	return tc_ctl_p4_tmpl_1(skb, n, tb[P4TC_ROOT], extack);
+}
+
+static int tc_ctl_p4_tmpl_delete(struct sk_buff *skb, struct nlmsghdr *n,
+				 struct netlink_ext_ack *extack)
+{
+	struct nlattr *tb[P4TC_ROOT_MAX + 1];
+	int ret;
+
+	if (!netlink_capable(skb, CAP_NET_ADMIN))
+		return -EPERM;
+
+	ret = nlmsg_parse(n, sizeof(struct p4tcmsg), tb, P4TC_ROOT_MAX,
+			  p4tc_root_policy, extack);
+	if (ret < 0)
+		return ret;
+
+	if (NL_REQ_ATTR_CHECK(extack, NULL, tb, P4TC_ROOT)) {
+		NL_SET_ERR_MSG(extack,
+			       "Netlink P4TC template attributes missing");
+		return -EINVAL;
+	}
+
+	return tc_ctl_p4_tmpl_1(skb, n, tb[P4TC_ROOT], extack);
+}
+
+static int tc_ctl_p4_tmpl_cu(struct sk_buff *skb, struct nlmsghdr *n,
+			     struct netlink_ext_ack *extack)
+{
+	struct nlattr *tb[P4TC_ROOT_MAX + 1];
+	int ret = 0;
+
+	if (!netlink_capable(skb, CAP_NET_ADMIN))
+		return -EPERM;
+
+	ret = nlmsg_parse(n, sizeof(struct p4tcmsg), tb, P4TC_ROOT_MAX,
+			  p4tc_root_policy, extack);
+	if (ret < 0)
+		return ret;
+
+	if (NL_REQ_ATTR_CHECK(extack, NULL, tb, P4TC_ROOT)) {
+		NL_SET_ERR_MSG(extack,
+			       "Netlink P4TC template attributes missing");
+		return -EINVAL;
+	}
+
+	return tc_ctl_p4_tmpl_1(skb, n, tb[P4TC_ROOT], extack);
+}
+
+static int tc_ctl_p4_tmpl_dump_1(struct sk_buff *skb, struct nlattr *arg,
+				 struct netlink_callback *cb)
+{
+	struct p4tc_dump_ctx *ctx = (void *)cb->ctx;
+	struct netlink_ext_ack *extack = cb->extack;
+	u32 portid = NETLINK_CB(cb->skb).portid;
+	const struct nlmsghdr *n = cb->nlh;
+	struct p4tc_template_ops *obj_op;
+	struct nlattr *tb[P4TC_MAX + 1];
+	u32 ids[P4TC_PATH_MAX] = {};
+	struct p4tcmsg *t_new;
+	struct nlmsghdr *nlh;
+	struct nlattr *root;
+	struct p4tcmsg *t;
+	int ret;
+
+	ret = nla_parse_nested_deprecated(tb, P4TC_MAX, arg, p4tc_policy,
+					  extack);
+	if (ret < 0)
+		return ret;
+
+	t = (struct p4tcmsg *)nlmsg_data(n);
+	/* All checks will fail at this point because obj_is_valid will return
+	 * false. The next patch will make this functional
+	 */
+	if (!obj_is_valid(t->obj)) {
+		NL_SET_ERR_MSG(extack, "Invalid object type");
+		return -EINVAL;
+	}
+
+	nlh = nlmsg_put(skb, portid, n->nlmsg_seq, n->nlmsg_type,
+			sizeof(*t), n->nlmsg_flags);
+	if (!nlh)
+		return -ENOSPC;
+
+	t_new = nlmsg_data(nlh);
+	t_new->obj = t->obj;
+
+	root = nla_nest_start(skb, P4TC_ROOT);
+
+	obj_op = (struct p4tc_template_ops *)p4tc_ops[t->obj];
+	ret = obj_op->dump(skb, ctx, tb[P4TC_PARAMS], ids, extack);
+	if (ret <= 0)
+		goto out;
+	nla_nest_end(skb, root);
+
+	nlmsg_end(skb, nlh);
+
+	return ret;
+
+out:
+	nlmsg_cancel(skb, nlh);
+	return ret;
+}
+
+static int tc_ctl_p4_tmpl_dump(struct sk_buff *skb, struct netlink_callback *cb)
+{
+	struct nlattr *tb[P4TC_ROOT_MAX + 1];
+	int ret;
+
+	ret = nlmsg_parse(cb->nlh, sizeof(struct p4tcmsg), tb, P4TC_ROOT_MAX,
+			  p4tc_root_policy, cb->extack);
+	if (ret < 0)
+		return ret;
+
+	if (NL_REQ_ATTR_CHECK(cb->extack, NULL, tb, P4TC_ROOT)) {
+		NL_SET_ERR_MSG(cb->extack,
+			       "Netlink P4TC template attributes missing");
+		return -EINVAL;
+	}
+
+	return tc_ctl_p4_tmpl_dump_1(skb, tb[P4TC_ROOT], cb);
+}
+
+static int __init p4tc_template_init(void)
+{
+	rtnl_register(PF_UNSPEC, RTM_CREATEP4TEMPLATE, tc_ctl_p4_tmpl_cu, NULL,
+		      0);
+	rtnl_register(PF_UNSPEC, RTM_UPDATEP4TEMPLATE, tc_ctl_p4_tmpl_cu, NULL,
+		      0);
+	rtnl_register(PF_UNSPEC, RTM_DELP4TEMPLATE, tc_ctl_p4_tmpl_delete, NULL,
+		      0);
+	rtnl_register(PF_UNSPEC, RTM_GETP4TEMPLATE, tc_ctl_p4_tmpl_get,
+		      tc_ctl_p4_tmpl_dump, 0);
+	return 0;
+}
+
+subsys_initcall(p4tc_template_init);
diff --git a/security/selinux/nlmsgtab.c b/security/selinux/nlmsgtab.c
index 8ff670cf1..e50a1c1ff 100644
--- a/security/selinux/nlmsgtab.c
+++ b/security/selinux/nlmsgtab.c
@@ -94,6 +94,10 @@ static const struct nlmsg_perm nlmsg_route_perms[] = {
 	{ RTM_NEWTUNNEL,	NETLINK_ROUTE_SOCKET__NLMSG_WRITE },
 	{ RTM_DELTUNNEL,	NETLINK_ROUTE_SOCKET__NLMSG_WRITE },
 	{ RTM_GETTUNNEL,	NETLINK_ROUTE_SOCKET__NLMSG_READ  },
+	{ RTM_CREATEP4TEMPLATE,	NETLINK_ROUTE_SOCKET__NLMSG_WRITE },
+	{ RTM_DELP4TEMPLATE,	NETLINK_ROUTE_SOCKET__NLMSG_WRITE },
+	{ RTM_GETP4TEMPLATE,	NETLINK_ROUTE_SOCKET__NLMSG_READ },
+	{ RTM_UPDATEP4TEMPLATE,	NETLINK_ROUTE_SOCKET__NLMSG_WRITE },
 };
 
 static const struct nlmsg_perm nlmsg_tcpdiag_perms[] = {
@@ -177,7 +181,7 @@ int selinux_nlmsg_lookup(u16 sclass, u16 nlmsg_type, u32 *perm)
 		 * structures at the top of this file with the new mappings
 		 * before updating the BUILD_BUG_ON() macro!
 		 */
-		BUILD_BUG_ON(RTM_MAX != (RTM_NEWTUNNEL + 3));
+		BUILD_BUG_ON(RTM_MAX != (RTM_CREATEP4TEMPLATE + 3));
 		err = nlmsg_perm(nlmsg_type, perm, nlmsg_route_perms,
 				 sizeof(nlmsg_route_perms));
 		break;
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [PATCH net-next v12  08/15] p4tc: add template pipeline create, get, update, delete
  2024-02-25 16:54 [PATCH net-next v12 00/15] Introducing P4TC (series 1) Jamal Hadi Salim
                   ` (6 preceding siblings ...)
  2024-02-25 16:54 ` [PATCH net-next v12 07/15] p4tc: add template API Jamal Hadi Salim
@ 2024-02-25 16:54 ` Jamal Hadi Salim
  2024-02-25 16:54 ` [PATCH net-next v12 09/15] p4tc: add template action create, update, delete, get, flush and dump Jamal Hadi Salim
                   ` (8 subsequent siblings)
  16 siblings, 0 replies; 71+ messages in thread
From: Jamal Hadi Salim @ 2024-02-25 16:54 UTC (permalink / raw)
  To: netdev
  Cc: deb.chatterjee, anjali.singhai, namrata.limaye, tom, mleitner,
	Mahesh.Shirshyad, Vipin.Jain, tomasz.osinski, jiri,
	xiyou.wangcong, davem, edumazet, kuba, pabeni, vladbu, horms,
	khalidm, toke, daniel, victor, pctammela, bpf

__Introducing P4 TC Pipeline__

This commit introduces P4 TC pipelines, which emulate the semantics of a
P4 program/pipeline using the TC infrastructure.

This patch relies on the previous one to be functional. They were split
to ease review.

One can refer to P4 programs/pipelines using their names or their
specific pipeline ids (pipeid)

P4 template CRUD (Create, Read/get, Update and Delete) commands apply on a
pipeline.

As an example, to create a P4 program/pipeline named aP4proggie with a
single table in its pipeline, one would use the following command from user
space tc (as generated by the compiler):

tc p4template create pipeline/aP4proggie numtables 1 pipeid 1

Note that, in the above command, the numtables is set as 1; the default
is 0 because it is feasible to have a P4 program with no tables at all.

Note: if no pipeid is specified, the kernel will issue one. To see what pipeline
ID is issued, one would add -echo option and the response back from the
kernel will contain the details:

tc -echo p4template create pipeline/aP4proggie numtables 1

To Read pipeline aP4proggie attributes, one would retrieve those details as
follows:

tc p4template get pipeline/[aP4proggie] [pipeid 1]

Note that in the above command one may specify pipeline ID, name or
both.

To Update aP4proggie pipeline from 1 to 10 tables, one would use the
following command:

tc p4template update pipeline/[aP4proggie] [pipeid 1] numtables 10

Note that, in the above command, one could use the P4 program/pipeline
name, id or both to specify which P4 program/pipeline to update.

To Delete a P4 program/pipeline named aP4proggie
with a pipeid of 1, one would use the following command:

tc p4template del pipeline/[aP4proggie] [pipeid 1]

Note that, in the above command, one could use the P4 program/pipeline
name, id or both to specify which P4 program/pipeline to delete

If one wished to dump all the created P4 programs/pipelines, one would
use the following command:

tc p4template get pipeline/

__Pipeline Lifetime__

After _Create_ is issued, one can Read/get, Update and Delete pipeline objects;
however the pipeline can only be put to use after it is "sealed".
To seal a pipeline, one would issue the following command:

tc p4template update pipeline/aP4proggie state ready

After a pipeline is sealed it can be instantiated via the TC P4 classifier.
For example:

tc filter add $DEV ingress protocol any prio 6 p4 pname aP4proggie \
    action bpf obj $PARSER.o section p4tc/parse
    action bpf obj $PROGNAME.o section p4tc/main

Instantiates aP4proggie in the ingress of $DEV. One could also attach it to
a block of ports (example tc block 22) as such:

tc filter add block 22 ingress protocol all prio 6 p4 pname aP4proggie \
    action bpf obj $PARSER.o section p4tc/parse
    action bpf obj $PROGNAME.o section p4tc/main

We can, add a table entry after the pipeline is sealed
(even before instantiating). Like, for example:

tc p4ctrl create aP4proggie/table/cb/aP4table \
      dstAddr 10.10.10.0/24 srcAddr 192.168.0.0/16 prio 16 \
      action drop

Once the pipeline is instantiated on a device or block it cannot be deleted.
It becomes Read-only from the control plane/user space.
The pipeline can be deleted when there are no longer any users left by
destroying all instances (i.e all instantiated filters are deleted).

Co-developed-by: Victor Nogueira <victor@mojatatu.com>
Signed-off-by: Victor Nogueira <victor@mojatatu.com>
Co-developed-by: Pedro Tammela <pctammela@mojatatu.com>
Signed-off-by: Pedro Tammela <pctammela@mojatatu.com>
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Reviewed-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
---
 include/net/p4tc.h             |  78 +++-
 include/uapi/linux/p4tc.h      |  24 ++
 net/sched/p4tc/Makefile        |   2 +-
 net/sched/p4tc/p4tc_pipeline.c | 636 +++++++++++++++++++++++++++++++++
 net/sched/p4tc/p4tc_tmpl_api.c | 104 ++++--
 5 files changed, 820 insertions(+), 24 deletions(-)
 create mode 100644 net/sched/p4tc/p4tc_pipeline.c

diff --git a/include/net/p4tc.h b/include/net/p4tc.h
index e55d7b0b6..2cb5d06c3 100644
--- a/include/net/p4tc.h
+++ b/include/net/p4tc.h
@@ -10,27 +10,43 @@
 #include <linux/rhashtable.h>
 #include <linux/rhashtable-types.h>
 
+#define P4TC_DEFAULT_NUM_TABLES P4TC_MINTABLES_COUNT
+#define P4TC_DEFAULT_MAX_RULES 1
 #define P4TC_PATH_MAX 3
 
+#define P4TC_KERNEL_PIPEID 0
+
+#define P4TC_PID_IDX 0
+
 struct p4tc_dump_ctx {
 	u32 ids[P4TC_PATH_MAX];
 };
 
 struct p4tc_template_common;
 
+struct p4tc_path_nlattrs {
+	char                     *pname;
+	u32                      *ids;
+	bool                     pname_passed;
+};
+
+struct p4tc_pipeline;
 struct p4tc_template_ops {
 	struct p4tc_template_common *(*cu)(struct net *net, struct nlmsghdr *n,
 					   struct nlattr *nla,
+					   struct p4tc_path_nlattrs *nl_pname,
 					   struct netlink_ext_ack *extack);
-	int (*put)(struct p4tc_template_common *tmpl,
+	int (*put)(struct p4tc_pipeline *pipeline,
+		   struct p4tc_template_common *tmpl,
 		   struct netlink_ext_ack *extack);
 	int (*gd)(struct net *net, struct sk_buff *skb, struct nlmsghdr *n,
-		  struct nlattr *nla, struct netlink_ext_ack *extack);
+		  struct nlattr *nla, struct p4tc_path_nlattrs *nl_pname,
+		  struct netlink_ext_ack *extack);
 	int (*fill_nlmsg)(struct net *net, struct sk_buff *skb,
 			  struct p4tc_template_common *tmpl,
 			  struct netlink_ext_ack *extack);
 	int (*dump)(struct sk_buff *skb, struct p4tc_dump_ctx *ctx,
-		    struct nlattr *nla, u32 *ids,
+		    struct nlattr *nla, char **p_name, u32 *ids,
 		    struct netlink_ext_ack *extack);
 	int (*dump_1)(struct sk_buff *skb, struct p4tc_template_common *common);
 	u32 obj_id;
@@ -39,6 +55,25 @@ struct p4tc_template_ops {
 struct p4tc_template_common {
 	char                     name[P4TC_TMPL_NAMSZ];
 	struct p4tc_template_ops *ops;
+	u32                      p_id;
+	u32                      __pad0;
+};
+
+struct p4tc_pipeline {
+	struct p4tc_template_common common;
+	struct rcu_head             rcu;
+	struct net                  *net;
+	/* Accounts for how many entities are referencing this pipeline.
+	 * As for now only P4 filters can refer to pipelines.
+	 */
+	refcount_t                  p_ctrl_ref;
+	u16                         num_tables;
+	u16                         curr_tables;
+	u8                          p_state;
+};
+
+struct p4tc_pipeline_net {
+	struct idr pipeline_idr;
 };
 
 static inline bool p4tc_tmpl_msg_is_update(struct nlmsghdr *n)
@@ -46,8 +81,45 @@ static inline bool p4tc_tmpl_msg_is_update(struct nlmsghdr *n)
 	return n->nlmsg_type == RTM_UPDATEP4TEMPLATE;
 }
 
+int p4tc_tmpl_register_ops(const struct p4tc_template_ops *tmpl_ops);
+
 int p4tc_tmpl_generic_dump(struct sk_buff *skb, struct p4tc_dump_ctx *ctx,
 			   struct idr *idr, int idx,
 			   struct netlink_ext_ack *extack);
 
+struct p4tc_pipeline *p4tc_pipeline_find_byany(struct net *net,
+					       const char *p_name,
+					       const u32 pipeid,
+					       struct netlink_ext_ack *extack);
+struct p4tc_pipeline *p4tc_pipeline_find_byid(struct net *net,
+					      const u32 pipeid);
+struct p4tc_pipeline *
+p4tc_pipeline_find_get(struct net *net, const char *p_name,
+		       const u32 pipeid, struct netlink_ext_ack *extack);
+
+static inline bool p4tc_pipeline_get(struct p4tc_pipeline *pipeline)
+{
+	return refcount_inc_not_zero(&pipeline->p_ctrl_ref);
+}
+
+void p4tc_pipeline_put(struct p4tc_pipeline *pipeline);
+struct p4tc_pipeline *
+p4tc_pipeline_find_byany_unsealed(struct net *net, const char *p_name,
+				  const u32 pipeid,
+				  struct netlink_ext_ack *extack);
+
+static inline int p4tc_action_destroy(struct tc_action **acts)
+{
+	int ret = 0;
+
+	if (acts) {
+		ret = tcf_action_destroy(acts, TCA_ACT_UNBIND);
+		kfree(acts);
+	}
+
+	return ret;
+}
+
+#define to_pipeline(t) ((struct p4tc_pipeline *)t)
+
 #endif
diff --git a/include/uapi/linux/p4tc.h b/include/uapi/linux/p4tc.h
index 22ba1c05a..8d8ffcb9e 100644
--- a/include/uapi/linux/p4tc.h
+++ b/include/uapi/linux/p4tc.h
@@ -7,19 +7,25 @@
 
 /* pipeline header */
 struct p4tcmsg {
+	__u32 pipeid;
 	__u32 obj;
 };
 
+#define P4TC_MAXPIPELINE_COUNT 32
+#define P4TC_MAXTABLES_COUNT 32
+#define P4TC_MINTABLES_COUNT 0
 #define P4TC_MSGBATCH_SIZE 16
 
 #define P4TC_MAX_KEYSZ 512
 
 #define P4TC_TMPL_NAMSZ 32
+#define P4TC_PIPELINE_NAMSIZ P4TC_TMPL_NAMSZ
 
 /* Root attributes */
 enum {
 	P4TC_ROOT_UNSPEC,
 	P4TC_ROOT, /* nested messages */
+	P4TC_ROOT_PNAME, /* string - mandatory for pipeline create */
 	__P4TC_ROOT_MAX,
 };
 
@@ -28,6 +34,7 @@ enum {
 /* P4 Object types */
 enum {
 	P4TC_OBJ_UNSPEC,
+	P4TC_OBJ_PIPELINE,
 	__P4TC_OBJ_MAX,
 };
 
@@ -43,6 +50,23 @@ enum {
 
 #define P4TC_MAX (__P4TC_MAX - 1)
 
+/* PIPELINE attributes */
+enum {
+	P4TC_PIPELINE_UNSPEC,
+	P4TC_PIPELINE_NUMTABLES, /* u16 */
+	P4TC_PIPELINE_STATE, /* u8 */
+	P4TC_PIPELINE_NAME, /* string only used for pipeline dump */
+	__P4TC_PIPELINE_MAX
+};
+
+#define P4TC_PIPELINE_MAX (__P4TC_PIPELINE_MAX - 1)
+
+/* PIPELINE states */
+enum {
+	P4TC_STATE_NOT_READY,
+	P4TC_STATE_READY,
+};
+
 enum {
 	P4TC_T_UNSPEC,
 	P4TC_T_U8,
diff --git a/net/sched/p4tc/Makefile b/net/sched/p4tc/Makefile
index e28dfc6eb..0881a7563 100644
--- a/net/sched/p4tc/Makefile
+++ b/net/sched/p4tc/Makefile
@@ -1,3 +1,3 @@
 # SPDX-License-Identifier: GPL-2.0
 
-obj-y := p4tc_types.o p4tc_tmpl_api.o
+obj-y := p4tc_types.o p4tc_tmpl_api.o p4tc_pipeline.o
diff --git a/net/sched/p4tc/p4tc_pipeline.c b/net/sched/p4tc/p4tc_pipeline.c
new file mode 100644
index 000000000..936ec777a
--- /dev/null
+++ b/net/sched/p4tc/p4tc_pipeline.c
@@ -0,0 +1,636 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * net/sched/p4tc/p4tc_pipeline.c	P4 TC PIPELINE
+ *
+ * Copyright (c) 2022-2024, Mojatatu Networks
+ * Copyright (c) 2022-2024, Intel Corporation.
+ * Authors:     Jamal Hadi Salim <jhs@mojatatu.com>
+ *              Victor Nogueira <victor@mojatatu.com>
+ *              Pedro Tammela <pctammela@mojatatu.com>
+ */
+
+#include <linux/types.h>
+#include <linux/kernel.h>
+#include <linux/string.h>
+#include <linux/errno.h>
+#include <linux/slab.h>
+#include <linux/skbuff.h>
+#include <linux/init.h>
+#include <linux/kmod.h>
+#include <linux/err.h>
+#include <linux/module.h>
+#include <net/net_namespace.h>
+#include <net/sock.h>
+#include <net/sch_generic.h>
+#include <net/pkt_cls.h>
+#include <net/p4tc.h>
+#include <net/netlink.h>
+#include <net/flow_offload.h>
+#include <net/p4tc_types.h>
+
+static unsigned int pipeline_net_id;
+static struct p4tc_pipeline *root_pipeline;
+
+static __net_init int pipeline_init_net(struct net *net)
+{
+	struct p4tc_pipeline_net *pipe_net = net_generic(net, pipeline_net_id);
+
+	idr_init(&pipe_net->pipeline_idr);
+
+	return 0;
+}
+
+static int __p4tc_pipeline_put(struct p4tc_pipeline *pipeline,
+			       struct p4tc_template_common *template,
+			       struct netlink_ext_ack *extack);
+
+static void __net_exit pipeline_exit_net(struct net *net)
+{
+	struct p4tc_pipeline_net *pipe_net;
+	struct p4tc_pipeline *pipeline;
+	unsigned long pipeid, tmp;
+
+	rtnl_lock();
+	pipe_net = net_generic(net, pipeline_net_id);
+	idr_for_each_entry_ul(&pipe_net->pipeline_idr, pipeline, tmp, pipeid) {
+		__p4tc_pipeline_put(pipeline, &pipeline->common, NULL);
+	}
+	idr_destroy(&pipe_net->pipeline_idr);
+	rtnl_unlock();
+}
+
+static struct pernet_operations pipeline_net_ops = {
+	.init = pipeline_init_net,
+	.pre_exit = pipeline_exit_net,
+	.id = &pipeline_net_id,
+	.size = sizeof(struct p4tc_pipeline_net),
+};
+
+static const struct nla_policy tc_pipeline_policy[P4TC_PIPELINE_MAX + 1] = {
+	[P4TC_PIPELINE_NUMTABLES] =
+		NLA_POLICY_RANGE(NLA_U16, P4TC_MINTABLES_COUNT,
+				 P4TC_MAXTABLES_COUNT),
+	[P4TC_PIPELINE_STATE] = { .type = NLA_U8 },
+};
+
+static void p4tc_pipeline_destroy(struct p4tc_pipeline *pipeline)
+{
+	kfree(pipeline);
+}
+
+static void p4tc_pipeline_destroy_rcu(struct rcu_head *head)
+{
+	struct p4tc_pipeline *pipeline;
+	struct net *net;
+
+	pipeline = container_of(head, struct p4tc_pipeline, rcu);
+
+	net = pipeline->net;
+	p4tc_pipeline_destroy(pipeline);
+	put_net(net);
+}
+
+static void p4tc_pipeline_teardown(struct p4tc_pipeline *pipeline,
+				   struct netlink_ext_ack *extack)
+{
+	struct net *net = pipeline->net;
+	struct p4tc_pipeline_net *pipe_net = net_generic(net, pipeline_net_id);
+	struct net *pipeline_net = maybe_get_net(net);
+
+	idr_remove(&pipe_net->pipeline_idr, pipeline->common.p_id);
+
+	/* If we are on netns cleanup we can't touch the pipeline_idr.
+	 * On pre_exit we will destroy the idr but never call into teardown
+	 * if filters are active which makes pipeline pointers dangle until
+	 * the filters ultimately destroy them.
+	 */
+	if (pipeline_net) {
+		idr_remove(&pipe_net->pipeline_idr, pipeline->common.p_id);
+		call_rcu(&pipeline->rcu, p4tc_pipeline_destroy_rcu);
+	} else {
+		p4tc_pipeline_destroy(pipeline);
+	}
+}
+
+static int __p4tc_pipeline_put(struct p4tc_pipeline *pipeline,
+			       struct p4tc_template_common *template,
+			       struct netlink_ext_ack *extack)
+{
+	/* The lifetime of the pipeline can be terminated in two cases:
+	 * - netns cleanup (system driven)
+	 * - pipeline delete (user driven)
+	 *
+	 * When the pipeline is referenced by one or more p4 classifiers we need
+	 * to make sure the pipeline and its components are alive while the
+	 * classifier is still visible by the datapath.
+	 * In the netns cleanup, we cannot destroy the pipeline in our netns
+	 * exit callback as the netdevs and filters are still visible in the
+	 * datapath. In such case, it's the filter's job to destroy the
+	 * pipeline.
+	 *
+	 * To accommodate such scenario, whichever put call reaches '0' first
+	 * will destroy the pipeline and its components.
+	 *
+	 * On netns cleanup we guarantee no table entries operations are in
+	 * flight.
+	 */
+	if (!refcount_dec_and_test(&pipeline->p_ctrl_ref)) {
+		NL_SET_ERR_MSG(extack, "Can't delete referenced pipeline");
+		return -EBUSY;
+	}
+
+	p4tc_pipeline_teardown(pipeline, extack);
+
+	return 0;
+}
+
+static int pipeline_try_set_state_ready(struct p4tc_pipeline *pipeline,
+					struct netlink_ext_ack *extack)
+{
+	if (pipeline->curr_tables != pipeline->num_tables) {
+		NL_SET_ERR_MSG(extack,
+			       "Must have all table defined to update state to ready");
+		return -EINVAL;
+	}
+
+	pipeline->p_state = P4TC_STATE_READY;
+	return true;
+}
+
+static bool p4tc_pipeline_sealed(struct p4tc_pipeline *pipeline)
+{
+	return pipeline->p_state == P4TC_STATE_READY;
+}
+
+struct p4tc_pipeline *p4tc_pipeline_find_byid(struct net *net, const u32 pipeid)
+{
+	struct p4tc_pipeline_net *pipe_net;
+
+	if (pipeid == P4TC_KERNEL_PIPEID)
+		return root_pipeline;
+
+	pipe_net = net_generic(net, pipeline_net_id);
+
+	return idr_find(&pipe_net->pipeline_idr, pipeid);
+}
+EXPORT_SYMBOL_GPL(p4tc_pipeline_find_byid);
+
+static struct p4tc_pipeline *p4tc_pipeline_find_byname(struct net *net,
+						       const char *name)
+{
+	struct p4tc_pipeline_net *pipe_net = net_generic(net, pipeline_net_id);
+	struct p4tc_pipeline *pipeline;
+	unsigned long tmp, id;
+
+	idr_for_each_entry_ul(&pipe_net->pipeline_idr, pipeline, tmp, id) {
+		/* Don't show kernel pipeline */
+		if (id == P4TC_KERNEL_PIPEID)
+			continue;
+		if (strncmp(pipeline->common.name, name,
+			    P4TC_PIPELINE_NAMSIZ) == 0)
+			return pipeline;
+	}
+
+	return NULL;
+}
+
+static const struct p4tc_template_ops p4tc_pipeline_ops;
+
+static struct p4tc_pipeline *
+p4tc_pipeline_create(struct net *net, struct nlmsghdr *n,
+		     struct nlattr *nla, const char *p_name,
+		     u32 pipeid, struct netlink_ext_ack *extack)
+{
+	struct p4tc_pipeline_net *pipe_net = net_generic(net, pipeline_net_id);
+	struct nlattr *tb[P4TC_PIPELINE_MAX + 1];
+	struct p4tc_pipeline *pipeline;
+	int ret = 0;
+
+	ret = nla_parse_nested(tb, P4TC_PIPELINE_MAX, nla, tc_pipeline_policy,
+			       extack);
+
+	if (ret < 0)
+		goto out;
+
+	pipeline = p4tc_pipeline_find_byany(net, p_name, pipeid, NULL);
+	if (pipeid != P4TC_KERNEL_PIPEID && !IS_ERR(pipeline)) {
+		NL_SET_ERR_MSG(extack, "Pipeline exists");
+		ret = -EEXIST;
+		goto out;
+	}
+
+	pipeline = kzalloc(sizeof(*pipeline), GFP_KERNEL);
+	if (unlikely(!pipeline))
+		return ERR_PTR(-ENOMEM);
+
+	if (!p_name || p_name[0] == '\0') {
+		NL_SET_ERR_MSG(extack, "Must specify pipeline name");
+		ret = -EINVAL;
+		goto err;
+	}
+
+	strscpy(pipeline->common.name, p_name, P4TC_PIPELINE_NAMSIZ);
+
+	if (pipeid) {
+		ret = idr_alloc_u32(&pipe_net->pipeline_idr, pipeline, &pipeid,
+				    pipeid, GFP_KERNEL);
+	} else {
+		pipeid = 1;
+		ret = idr_alloc_u32(&pipe_net->pipeline_idr, pipeline, &pipeid,
+				    UINT_MAX, GFP_KERNEL);
+	}
+
+	if (ret < 0) {
+		NL_SET_ERR_MSG(extack, "Unable to allocate pipeline id");
+		goto idr_rm;
+	}
+
+	pipeline->common.p_id = pipeid;
+
+	if (tb[P4TC_PIPELINE_NUMTABLES])
+		pipeline->num_tables =
+			nla_get_u16(tb[P4TC_PIPELINE_NUMTABLES]);
+	else
+		pipeline->num_tables = P4TC_DEFAULT_NUM_TABLES;
+
+	pipeline->p_state = P4TC_STATE_NOT_READY;
+
+	pipeline->net = net;
+
+	refcount_set(&pipeline->p_ctrl_ref, 1);
+
+	pipeline->common.ops = (struct p4tc_template_ops *)&p4tc_pipeline_ops;
+
+	return pipeline;
+
+idr_rm:
+	idr_remove(&pipe_net->pipeline_idr, pipeid);
+
+err:
+	kfree(pipeline);
+
+out:
+	return ERR_PTR(ret);
+}
+
+struct p4tc_pipeline *p4tc_pipeline_find_byany(struct net *net,
+					       const char *p_name,
+					       const u32 pipeid,
+					       struct netlink_ext_ack *extack)
+{
+	struct p4tc_pipeline *pipeline = NULL;
+
+	if (pipeid) {
+		pipeline = p4tc_pipeline_find_byid(net, pipeid);
+		if (!pipeline) {
+			NL_SET_ERR_MSG(extack, "Unable to find pipeline by id");
+			return ERR_PTR(-EINVAL);
+		}
+	} else {
+		if (p_name) {
+			pipeline = p4tc_pipeline_find_byname(net, p_name);
+			if (!pipeline) {
+				NL_SET_ERR_MSG(extack,
+					       "Pipeline name not found");
+				return ERR_PTR(-EINVAL);
+			}
+		} else {
+			NL_SET_ERR_MSG(extack,
+				       "Must specify pipeline name or id");
+			return ERR_PTR(-EINVAL);
+		}
+	}
+
+	return pipeline;
+}
+
+struct p4tc_pipeline *
+p4tc_pipeline_find_get(struct net *net, const char *p_name,
+		       const u32 pipeid, struct netlink_ext_ack *extack)
+{
+	struct p4tc_pipeline *pipeline =
+		p4tc_pipeline_find_byany(net, p_name, pipeid, extack);
+
+	if (IS_ERR(pipeline))
+		return pipeline;
+
+	if (!p4tc_pipeline_get(pipeline)) {
+		NL_SET_ERR_MSG(extack, "Pipeline is stale");
+		return ERR_PTR(-EINVAL);
+	}
+
+	return pipeline;
+}
+EXPORT_SYMBOL_GPL(p4tc_pipeline_find_get);
+
+void p4tc_pipeline_put(struct p4tc_pipeline *pipeline)
+{
+	__p4tc_pipeline_put(pipeline, &pipeline->common, NULL);
+}
+EXPORT_SYMBOL_GPL(p4tc_pipeline_put);
+
+struct p4tc_pipeline *
+p4tc_pipeline_find_byany_unsealed(struct net *net, const char *p_name,
+				  const u32 pipeid,
+				  struct netlink_ext_ack *extack)
+{
+	struct p4tc_pipeline *pipeline =
+		p4tc_pipeline_find_byany(net, p_name, pipeid, extack);
+	if (IS_ERR(pipeline))
+		return pipeline;
+
+	if (p4tc_pipeline_sealed(pipeline)) {
+		NL_SET_ERR_MSG(extack, "Pipeline is sealed");
+		return ERR_PTR(-EINVAL);
+	}
+
+	return pipeline;
+}
+
+static struct p4tc_pipeline *
+p4tc_pipeline_update(struct net *net, struct nlmsghdr *n, struct nlattr *nla,
+		     const char *p_name, const u32 pipeid,
+		     struct netlink_ext_ack *extack)
+{
+	struct nlattr *tb[P4TC_PIPELINE_MAX + 1];
+	struct p4tc_pipeline *pipeline;
+	u16 num_tables = 0;
+	int ret = 0;
+
+	ret = nla_parse_nested(tb, P4TC_PIPELINE_MAX, nla, tc_pipeline_policy,
+			       extack);
+
+	if (ret < 0)
+		goto out;
+
+	pipeline =
+		p4tc_pipeline_find_byany_unsealed(net, p_name, pipeid, extack);
+	if (IS_ERR(pipeline))
+		return pipeline;
+
+	if (tb[P4TC_PIPELINE_NUMTABLES])
+		num_tables = nla_get_u16(tb[P4TC_PIPELINE_NUMTABLES]);
+
+	if (tb[P4TC_PIPELINE_STATE]) {
+		ret = pipeline_try_set_state_ready(pipeline, extack);
+		if (ret < 0)
+			goto out;
+	}
+
+	if (num_tables)
+		pipeline->num_tables = num_tables;
+
+	return pipeline;
+
+out:
+	return ERR_PTR(ret);
+}
+
+static struct p4tc_template_common *
+p4tc_pipeline_cu(struct net *net, struct nlmsghdr *n, struct nlattr *nla,
+		 struct p4tc_path_nlattrs *nl_path_attrs,
+		 struct netlink_ext_ack *extack)
+{
+	u32 *ids = nl_path_attrs->ids;
+	u32 pipeid = ids[P4TC_PID_IDX];
+	struct p4tc_pipeline *pipeline;
+
+	switch (n->nlmsg_type) {
+	case RTM_CREATEP4TEMPLATE:
+		pipeline = p4tc_pipeline_create(net, n, nla,
+						nl_path_attrs->pname,
+						pipeid, extack);
+		break;
+	case RTM_UPDATEP4TEMPLATE:
+		pipeline = p4tc_pipeline_update(net, n, nla,
+						nl_path_attrs->pname,
+						pipeid, extack);
+		break;
+	default:
+		return ERR_PTR(-EOPNOTSUPP);
+	}
+
+	if (IS_ERR(pipeline))
+		goto out;
+
+	if (!nl_path_attrs->pname_passed)
+		strscpy(nl_path_attrs->pname, pipeline->common.name,
+			P4TC_PIPELINE_NAMSIZ);
+
+	if (!ids[P4TC_PID_IDX])
+		ids[P4TC_PID_IDX] = pipeline->common.p_id;
+
+out:
+	return (struct p4tc_template_common *)pipeline;
+}
+
+static int _p4tc_pipeline_fill_nlmsg(struct sk_buff *skb,
+				     const struct p4tc_pipeline *pipeline)
+{
+	unsigned char *b = nlmsg_get_pos(skb);
+	struct nlattr *nest;
+
+	nest = nla_nest_start(skb, P4TC_PARAMS);
+	if (!nest)
+		goto out_nlmsg_trim;
+	if (nla_put_u16(skb, P4TC_PIPELINE_NUMTABLES, pipeline->num_tables))
+		goto out_nlmsg_trim;
+	if (nla_put_u8(skb, P4TC_PIPELINE_STATE, pipeline->p_state))
+		goto out_nlmsg_trim;
+
+	nla_nest_end(skb, nest);
+
+	return skb->len;
+
+out_nlmsg_trim:
+	nlmsg_trim(skb, b);
+	return -1;
+}
+
+static int p4tc_pipeline_fill_nlmsg(struct net *net, struct sk_buff *skb,
+				    struct p4tc_template_common *template,
+				    struct netlink_ext_ack *extack)
+{
+	const struct p4tc_pipeline *pipeline = to_pipeline(template);
+
+	if (_p4tc_pipeline_fill_nlmsg(skb, pipeline) <= 0) {
+		NL_SET_ERR_MSG(extack,
+			       "Failed to fill notification attributes for pipeline");
+		return -EINVAL;
+	}
+
+	return 0;
+}
+
+static int p4tc_pipeline_del_one(struct p4tc_pipeline *pipeline,
+				 struct netlink_ext_ack *extack)
+{
+	/* User driven pipeline put doesn't transfer the lifetime
+	 * of the pipeline to other ref holders. In case of unlocked
+	 * table entries, it shall never teardown the pipeline so
+	 * need to do an atomic transition here.
+	 *
+	 * System driven put will serialize with rtnl_lock and
+	 * table entries are guaranteed to not be in flight.
+	 */
+	if (!refcount_dec_if_one(&pipeline->p_ctrl_ref)) {
+		NL_SET_ERR_MSG(extack, "Pipeline in use");
+		return -EAGAIN;
+	}
+
+	p4tc_pipeline_teardown(pipeline, extack);
+
+	return 0;
+}
+
+static int p4tc_pipeline_gd(struct net *net, struct sk_buff *skb,
+			    struct nlmsghdr *n, struct nlattr *nla,
+			    struct p4tc_path_nlattrs *nl_path_attrs,
+			    struct netlink_ext_ack *extack)
+{
+	unsigned char *b = nlmsg_get_pos(skb);
+	struct p4tc_template_common *tmpl;
+	struct p4tc_pipeline *pipeline;
+	u32 *ids = nl_path_attrs->ids;
+	u32 pipeid = ids[P4TC_PID_IDX];
+	int ret = 0;
+
+	if (n->nlmsg_type == RTM_DELP4TEMPLATE &&
+	    (n->nlmsg_flags & NLM_F_ROOT)) {
+		NL_SET_ERR_MSG(extack, "Pipeline flush not supported");
+		return -EOPNOTSUPP;
+	}
+
+	pipeline = p4tc_pipeline_find_byany(net, nl_path_attrs->pname, pipeid,
+					    extack);
+	if (IS_ERR(pipeline))
+		return PTR_ERR(pipeline);
+
+	tmpl = (struct p4tc_template_common *)pipeline;
+	if (p4tc_pipeline_fill_nlmsg(net, skb, tmpl, extack) < 0)
+		return -1;
+
+	if (!ids[P4TC_PID_IDX])
+		ids[P4TC_PID_IDX] = pipeline->common.p_id;
+
+	if (!nl_path_attrs->pname_passed)
+		strscpy(nl_path_attrs->pname, pipeline->common.name,
+			P4TC_PIPELINE_NAMSIZ);
+
+	if (n->nlmsg_type == RTM_DELP4TEMPLATE) {
+		ret = p4tc_pipeline_del_one(pipeline, extack);
+		if (ret < 0)
+			goto out_nlmsg_trim;
+	}
+
+	return ret;
+
+out_nlmsg_trim:
+	nlmsg_trim(skb, b);
+	return ret;
+}
+
+static int p4tc_pipeline_dump(struct sk_buff *skb, struct p4tc_dump_ctx *ctx,
+			      struct nlattr *nla, char **p_name, u32 *ids,
+			      struct netlink_ext_ack *extack)
+{
+	struct net *net = sock_net(skb->sk);
+	struct p4tc_pipeline_net *pipe_net;
+
+	pipe_net = net_generic(net, pipeline_net_id);
+
+	return p4tc_tmpl_generic_dump(skb, ctx, &pipe_net->pipeline_idr,
+				      P4TC_PID_IDX, extack);
+}
+
+static int p4tc_pipeline_dump_1(struct sk_buff *skb,
+				struct p4tc_template_common *common)
+{
+	struct p4tc_pipeline *pipeline = to_pipeline(common);
+	unsigned char *b = nlmsg_get_pos(skb);
+	struct nlattr *param;
+
+	/* Don't show kernel pipeline in dump */
+	if (pipeline->common.p_id == P4TC_KERNEL_PIPEID)
+		return 1;
+
+	param = nla_nest_start(skb, P4TC_PARAMS);
+	if (!param)
+		goto out_nlmsg_trim;
+	if (nla_put_string(skb, P4TC_PIPELINE_NAME, pipeline->common.name))
+		goto out_nlmsg_trim;
+
+	nla_nest_end(skb, param);
+
+	return 0;
+
+out_nlmsg_trim:
+	nlmsg_trim(skb, b);
+	return -ENOMEM;
+}
+
+static int register_pipeline_pernet(void)
+{
+	return register_pernet_subsys(&pipeline_net_ops);
+}
+
+static const struct p4tc_template_ops p4tc_pipeline_ops = {
+	.cu = p4tc_pipeline_cu,
+	.fill_nlmsg = p4tc_pipeline_fill_nlmsg,
+	.gd = p4tc_pipeline_gd,
+	.put = __p4tc_pipeline_put,
+	.dump = p4tc_pipeline_dump,
+	.dump_1 = p4tc_pipeline_dump_1,
+	.obj_id = P4TC_OBJ_PIPELINE,
+};
+
+static int __p4tc_pipeline_init(void)
+{
+	int pipeid = P4TC_KERNEL_PIPEID;
+
+	root_pipeline = kzalloc(sizeof(*root_pipeline), GFP_ATOMIC);
+	if (unlikely(!root_pipeline)) {
+		pr_err("Unable to register kernel pipeline\n");
+		return -ENOMEM;
+	}
+
+	strscpy(root_pipeline->common.name, "kernel", P4TC_PIPELINE_NAMSIZ);
+
+	root_pipeline->common.ops =
+		(struct p4tc_template_ops *)&p4tc_pipeline_ops;
+
+	root_pipeline->common.p_id = pipeid;
+
+	root_pipeline->p_state = P4TC_STATE_READY;
+
+	return 0;
+}
+
+static int __init p4tc_pipeline_init(void)
+{
+	if (register_pipeline_pernet() < 0) {
+		pr_err("Failed to register per net pipeline IDR");
+		return 0;
+	}
+
+	if (p4tc_register_types() < 0) {
+		pr_err("Failed to register P4 types");
+		goto unregister_pipeline_pernet;
+	}
+
+	if (__p4tc_pipeline_init() < 0)
+		goto unregister_types;
+
+	p4tc_tmpl_register_ops(&p4tc_pipeline_ops);
+
+	return 0;
+
+unregister_types:
+	p4tc_unregister_types();
+
+unregister_pipeline_pernet:
+	unregister_pernet_subsys(&pipeline_net_ops);
+	return 0;
+}
+
+subsys_initcall(p4tc_pipeline_init);
diff --git a/net/sched/p4tc/p4tc_tmpl_api.c b/net/sched/p4tc/p4tc_tmpl_api.c
index 0569f3f1c..bb973071a 100644
--- a/net/sched/p4tc/p4tc_tmpl_api.c
+++ b/net/sched/p4tc/p4tc_tmpl_api.c
@@ -29,6 +29,7 @@
 
 static const struct nla_policy p4tc_root_policy[P4TC_ROOT_MAX + 1] = {
 	[P4TC_ROOT] = { .type = NLA_NESTED },
+	[P4TC_ROOT_PNAME] = { .type = NLA_STRING, .len = P4TC_PIPELINE_NAMSIZ },
 };
 
 static const struct nla_policy p4tc_policy[P4TC_MAX + 1] = {
@@ -47,6 +48,16 @@ static bool obj_is_valid(u32 obj_id)
 	return !!p4tc_ops[obj_id];
 }
 
+int p4tc_tmpl_register_ops(const struct p4tc_template_ops *tmpl_ops)
+{
+	if (tmpl_ops->obj_id > P4TC_OBJ_MAX)
+		return -EINVAL;
+
+	p4tc_ops[tmpl_ops->obj_id] = tmpl_ops;
+
+	return 0;
+}
+
 int p4tc_tmpl_generic_dump(struct sk_buff *skb, struct p4tc_dump_ctx *ctx,
 			   struct idr *idr, int idx,
 			   struct netlink_ext_ack *extack)
@@ -102,7 +113,9 @@ static int p4tc_template_put(struct net *net,
 			     struct netlink_ext_ack *extack)
 {
 	/* Every created template is bound to a pipeline */
-	return common->ops->put(common, extack);
+	struct p4tc_pipeline *pipeline =
+		p4tc_pipeline_find_byid(net, common->p_id);
+	return common->ops->put(pipeline, common, extack);
 }
 
 static int tc_ctl_p4_tmpl_1_send(struct sk_buff *skb, struct net *net,
@@ -116,23 +129,24 @@ static int tc_ctl_p4_tmpl_1_send(struct sk_buff *skb, struct net *net,
 }
 
 static int tc_ctl_p4_tmpl_1(struct sk_buff *skb, struct nlmsghdr *n,
-			    struct nlattr *nla, struct netlink_ext_ack *extack)
+			    struct nlattr *nla, const char *p_name,
+			    struct netlink_ext_ack *extack)
 {
 	struct p4tcmsg *t = (struct p4tcmsg *)nlmsg_data(n);
+	struct p4tc_path_nlattrs nl_path_attrs = {0};
 	struct net *net = sock_net(skb->sk);
 	u32 portid = NETLINK_CB(skb).portid;
 	struct p4tc_template_common *tmpl;
 	struct p4tc_template_ops *obj_op;
 	struct nlattr *tb[P4TC_MAX + 1];
+	u32 ids[P4TC_PATH_MAX] = {};
 	struct p4tcmsg *t_new;
+	struct nlattr *pnatt;
 	struct nlmsghdr *nlh;
 	struct sk_buff *nskb;
 	struct nlattr *root;
 	int ret;
 
-	/* All checks will fail at this point because obj_is_valid will return
-	 * false. The next patch will make this functional
-	 */
 	if (!obj_is_valid(t->obj)) {
 		NL_SET_ERR_MSG(extack, "Invalid object type");
 		return -EINVAL;
@@ -142,6 +156,10 @@ static int tc_ctl_p4_tmpl_1(struct sk_buff *skb, struct nlmsghdr *n,
 	if (ret < 0)
 		return ret;
 
+	ids[P4TC_PID_IDX] = t->pipeid;
+
+	nl_path_attrs.ids = ids;
+
 	nskb = alloc_skb(NLMSG_GOODSIZE, GFP_KERNEL);
 	if (!nskb)
 		return -ENOMEM;
@@ -154,12 +172,24 @@ static int tc_ctl_p4_tmpl_1(struct sk_buff *skb, struct nlmsghdr *n,
 	}
 
 	t_new = nlmsg_data(nlh);
-	if (!t_new) {
-		NL_SET_ERR_MSG(extack, "Message header is missing");
-		ret = -EINVAL;
+	t_new->pipeid = t->pipeid;
+	t_new->obj = t->obj;
+
+	pnatt = nla_reserve(nskb, P4TC_ROOT_PNAME, P4TC_PIPELINE_NAMSIZ);
+	if (!pnatt) {
+		ret = -ENOMEM;
 		goto free_skb;
 	}
-	t_new->obj = t->obj;
+
+	nl_path_attrs.pname = nla_data(pnatt);
+	if (!p_name) {
+		/* Filled up by the operation or forced failure */
+		memset(nl_path_attrs.pname, 0, P4TC_PIPELINE_NAMSIZ);
+		nl_path_attrs.pname_passed = false;
+	} else {
+		strscpy(nl_path_attrs.pname, p_name, P4TC_PIPELINE_NAMSIZ);
+		nl_path_attrs.pname_passed = true;
+	}
 
 	root = nla_nest_start(nskb, P4TC_ROOT);
 	if (!root) {
@@ -168,6 +198,7 @@ static int tc_ctl_p4_tmpl_1(struct sk_buff *skb, struct nlmsghdr *n,
 	}
 
 	obj_op = (struct p4tc_template_ops *)p4tc_ops[t->obj];
+
 	switch (n->nlmsg_type) {
 	case RTM_CREATEP4TEMPLATE:
 	case RTM_UPDATEP4TEMPLATE:
@@ -177,7 +208,8 @@ static int tc_ctl_p4_tmpl_1(struct sk_buff *skb, struct nlmsghdr *n,
 			ret = -EINVAL;
 			goto free_skb;
 		}
-		tmpl = obj_op->cu(net, n, tb[P4TC_PARAMS], extack);
+		tmpl = obj_op->cu(net, n, tb[P4TC_PARAMS], &nl_path_attrs,
+				  extack);
 		if (IS_ERR(tmpl)) {
 			ret = PTR_ERR(tmpl);
 			goto free_skb;
@@ -191,7 +223,8 @@ static int tc_ctl_p4_tmpl_1(struct sk_buff *skb, struct nlmsghdr *n,
 		break;
 	case RTM_DELP4TEMPLATE:
 	case RTM_GETP4TEMPLATE:
-		ret = obj_op->gd(net, nskb, n, tb[P4TC_PARAMS], extack);
+		ret = obj_op->gd(net, nskb, n, tb[P4TC_PARAMS], &nl_path_attrs,
+				 extack);
 		if (ret < 0)
 			goto free_skb;
 		break;
@@ -200,6 +233,11 @@ static int tc_ctl_p4_tmpl_1(struct sk_buff *skb, struct nlmsghdr *n,
 		goto free_skb;
 	}
 
+	if (!t->pipeid)
+		t_new->pipeid = ids[P4TC_PID_IDX];
+
+	nla_nest_end(nskb, root);
+
 	nlmsg_end(nskb, nlh);
 
 	return tc_ctl_p4_tmpl_1_send(nskb, net, nlh, portid);
@@ -214,6 +252,7 @@ static int tc_ctl_p4_tmpl_get(struct sk_buff *skb, struct nlmsghdr *n,
 			      struct netlink_ext_ack *extack)
 {
 	struct nlattr *tb[P4TC_ROOT_MAX + 1];
+	char *p_name = NULL;
 	int ret;
 
 	ret = nlmsg_parse(n, sizeof(struct p4tcmsg), tb, P4TC_ROOT_MAX,
@@ -227,13 +266,17 @@ static int tc_ctl_p4_tmpl_get(struct sk_buff *skb, struct nlmsghdr *n,
 		return -EINVAL;
 	}
 
-	return tc_ctl_p4_tmpl_1(skb, n, tb[P4TC_ROOT], extack);
+	if (tb[P4TC_ROOT_PNAME])
+		p_name = nla_data(tb[P4TC_ROOT_PNAME]);
+
+	return tc_ctl_p4_tmpl_1(skb, n, tb[P4TC_ROOT], p_name, extack);
 }
 
 static int tc_ctl_p4_tmpl_delete(struct sk_buff *skb, struct nlmsghdr *n,
 				 struct netlink_ext_ack *extack)
 {
 	struct nlattr *tb[P4TC_ROOT_MAX + 1];
+	char *p_name = NULL;
 	int ret;
 
 	if (!netlink_capable(skb, CAP_NET_ADMIN))
@@ -250,13 +293,17 @@ static int tc_ctl_p4_tmpl_delete(struct sk_buff *skb, struct nlmsghdr *n,
 		return -EINVAL;
 	}
 
-	return tc_ctl_p4_tmpl_1(skb, n, tb[P4TC_ROOT], extack);
+	if (tb[P4TC_ROOT_PNAME])
+		p_name = nla_data(tb[P4TC_ROOT_PNAME]);
+
+	return tc_ctl_p4_tmpl_1(skb, n, tb[P4TC_ROOT], p_name, extack);
 }
 
 static int tc_ctl_p4_tmpl_cu(struct sk_buff *skb, struct nlmsghdr *n,
 			     struct netlink_ext_ack *extack)
 {
 	struct nlattr *tb[P4TC_ROOT_MAX + 1];
+	char *p_name = NULL;
 	int ret = 0;
 
 	if (!netlink_capable(skb, CAP_NET_ADMIN))
@@ -273,11 +320,14 @@ static int tc_ctl_p4_tmpl_cu(struct sk_buff *skb, struct nlmsghdr *n,
 		return -EINVAL;
 	}
 
-	return tc_ctl_p4_tmpl_1(skb, n, tb[P4TC_ROOT], extack);
+	if (tb[P4TC_ROOT_PNAME])
+		p_name = nla_data(tb[P4TC_ROOT_PNAME]);
+
+	return tc_ctl_p4_tmpl_1(skb, n, tb[P4TC_ROOT], p_name, extack);
 }
 
 static int tc_ctl_p4_tmpl_dump_1(struct sk_buff *skb, struct nlattr *arg,
-				 struct netlink_callback *cb)
+				 char *p_name, struct netlink_callback *cb)
 {
 	struct p4tc_dump_ctx *ctx = (void *)cb->ctx;
 	struct netlink_ext_ack *extack = cb->extack;
@@ -298,9 +348,6 @@ static int tc_ctl_p4_tmpl_dump_1(struct sk_buff *skb, struct nlattr *arg,
 		return ret;
 
 	t = (struct p4tcmsg *)nlmsg_data(n);
-	/* All checks will fail at this point because obj_is_valid will return
-	 * false. The next patch will make this functional
-	 */
 	if (!obj_is_valid(t->obj)) {
 		NL_SET_ERR_MSG(extack, "Invalid object type");
 		return -EINVAL;
@@ -312,16 +359,29 @@ static int tc_ctl_p4_tmpl_dump_1(struct sk_buff *skb, struct nlattr *arg,
 		return -ENOSPC;
 
 	t_new = nlmsg_data(nlh);
+	t_new->pipeid = t->pipeid;
 	t_new->obj = t->obj;
 
 	root = nla_nest_start(skb, P4TC_ROOT);
 
+	ids[P4TC_PID_IDX] = t->pipeid;
+
 	obj_op = (struct p4tc_template_ops *)p4tc_ops[t->obj];
-	ret = obj_op->dump(skb, ctx, tb[P4TC_PARAMS], ids, extack);
+	ret = obj_op->dump(skb, ctx, tb[P4TC_PARAMS], &p_name, ids, extack);
 	if (ret <= 0)
 		goto out;
 	nla_nest_end(skb, root);
 
+	if (p_name) {
+		if (nla_put_string(skb, P4TC_ROOT_PNAME, p_name)) {
+			ret = -1;
+			goto out;
+		}
+	}
+
+	if (!t_new->pipeid)
+		t_new->pipeid = ids[P4TC_PID_IDX];
+
 	nlmsg_end(skb, nlh);
 
 	return ret;
@@ -334,6 +394,7 @@ static int tc_ctl_p4_tmpl_dump_1(struct sk_buff *skb, struct nlattr *arg,
 static int tc_ctl_p4_tmpl_dump(struct sk_buff *skb, struct netlink_callback *cb)
 {
 	struct nlattr *tb[P4TC_ROOT_MAX + 1];
+	char *p_name = NULL;
 	int ret;
 
 	ret = nlmsg_parse(cb->nlh, sizeof(struct p4tcmsg), tb, P4TC_ROOT_MAX,
@@ -347,7 +408,10 @@ static int tc_ctl_p4_tmpl_dump(struct sk_buff *skb, struct netlink_callback *cb)
 		return -EINVAL;
 	}
 
-	return tc_ctl_p4_tmpl_dump_1(skb, tb[P4TC_ROOT], cb);
+	if (tb[P4TC_ROOT_PNAME])
+		p_name = nla_data(tb[P4TC_ROOT_PNAME]);
+
+	return tc_ctl_p4_tmpl_dump_1(skb, tb[P4TC_ROOT], p_name, cb);
 }
 
 static int __init p4tc_template_init(void)
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [PATCH net-next v12  09/15] p4tc: add template action create, update, delete, get, flush and dump
  2024-02-25 16:54 [PATCH net-next v12 00/15] Introducing P4TC (series 1) Jamal Hadi Salim
                   ` (7 preceding siblings ...)
  2024-02-25 16:54 ` [PATCH net-next v12 08/15] p4tc: add template pipeline create, get, update, delete Jamal Hadi Salim
@ 2024-02-25 16:54 ` Jamal Hadi Salim
  2024-02-25 16:54 ` [PATCH net-next v12 10/15] p4tc: add runtime action support Jamal Hadi Salim
                   ` (7 subsequent siblings)
  16 siblings, 0 replies; 71+ messages in thread
From: Jamal Hadi Salim @ 2024-02-25 16:54 UTC (permalink / raw)
  To: netdev
  Cc: deb.chatterjee, anjali.singhai, namrata.limaye, tom, mleitner,
	Mahesh.Shirshyad, Vipin.Jain, tomasz.osinski, jiri,
	xiyou.wangcong, davem, edumazet, kuba, pabeni, vladbu, horms,
	khalidm, toke, daniel, victor, pctammela, bpf

This commit allows the control plane to create, update, delete, get, flush
and dump action templates based on P4 action definitions.

Visualize the following action in a P4 program named aP4Proggie:

action send_nh(@tc_type("macaddr) bit<48> dstAddr, @tc_type("dev") bit<8> port)
{
    hdr.ethernet.dstAddr = dstMac;
    send_to_port(port);
}

The above is an action called send_nh which receives as parameters
a bit<48> dstAddr (a mac address) and a bit<8> port (something close to
ifindex).

which is applied on a P4 table match as such:

table mytable {
        key = {
            hdr.ipv4.dstAddr @tc_type("ipv4"): lpm;
        }

        actions = {
            send_nh;
            drop;
            NoAction;
        }

        size = 1024;
}

The mechanics of actions follow the CRUD semantics which are illustrated
in the text below.

___P4 ACTION KIND CREATION___

In this stage we create the p4 action kind by specifying the action name,
its ID, its parameters and the parameter types.
So for the send_nh action, the creation would look something like
this:

tc p4template create action/aP4proggie/send_nh \
  param dstAddr type macaddr id 1 param port type dev id 2

All the template commands (tc p4template) are generated by the
p4c compiler (but of course could be hand coded by humans).

Also note that an action name has to specify the program name since
P4 actions are unique to a program.  As an example, the
above command creates an action template that is bounded to
pipeline/program named "aP4proggie".

Note2: In P4, actions are assumed to pre-exist and have an upper bound
number of instances. Typically, if you have a max of 1024 "mytable" table
entries you want to allocate enough action instances to cover the 1024
entries. However, this is a big waste of memory when we have low table
occupancy. We pick a middle ground by providing pre-allocation control
via attribute "num_prealloc".
The compiler generated template does not specify it and by default we
preallocate 16 entries. The user can override this value by editing the
generated text, for example to change the number to 128 as such:

tc p4template create action/aP4proggie/send_nh num_prealloc 128 \
  param dstAddr type macaddr id 1 param port type dev id 2

When all preallocated action instances are exhausted (used in table
entries) then the behavior switches to the current tc action approach i.e
for every table entry created a new action instance is dynamically
allocated if no "index" attribute is specified. Once an instance is created it
is added to the pool and never freed.

Note, Current tc action behavior is maintained:

a) If the user wishes to preallocate more action instances later at runtime
to take advantage of a faster table entry creation (by avoiding dynamic
allocation at table entry creation time), they will have to individually
create actions via the control plane using the classical "tc actions"
command.
For example:

tc actions add action aP4proggie/send_nh \
param dstAddr AA:BB:CC:DD:EE:DD param port eth1

The action is added to the pool of action aP4proggie/send_nh instances and
any table entry creation will grab it. The parameters specified above will
be replaced when the table entry is created.

b) Sharing of action instances works the same way i.e you could autobind
to any action instance in a table entry creation by specifying the action
"index".

___ACTION KIND ACTIVATION___

Once we provided all the necessary information for the new p4 action,
we can go to the final stage: action activation. In this template stage,
we activate the p4 action and make it available for instantiation.
To activate the action template, we issue the following command:

tc p4template update action/aP4proggie/send_nh state active

___OTHER CONTROL COMMANDS___

The lifetime of the p4 action is tied to its pipeline
(see earlier patches). As with all pipeline components, write operations to
action templates (i.e action kinds), such as create, update and delete, can only
be executed if the pipeline is not sealed. Read/get can be issued even after the
pipeline is sealed.

If, after we are done with our action template we want to delete it, we
could issue the following command:

tc p4template del action/aP4proggie/send_nh

Note: If any instance was created for this action (as illustrated
earlier) than this action cannot be deleted, unless you delete all
instances first.

If we had created more action templates and wanted to flush all of the
action templates from pipeline aP4proggie, one would use the following
command:

tc p4template del action/aP4proggie/

After creating or updating a p4 actions, if one wishes to verify that
the p4 action was created correctly, one would use the following sample
command:

tc p4template get action/aP4proggie/send_nh

The above command will display the relevant data for the action,
such as parameter names, types, etc.

If one wanted to check which action templates were associated to a specific
pipeline, one could use the following command:

tc p4template get action/aP4proggie/

Note that this command will only display the name of these action
templates. To verify their specific details, one should use the get
command, which was previously described.

Co-developed-by: Victor Nogueira <victor@mojatatu.com>
Signed-off-by: Victor Nogueira <victor@mojatatu.com>
Co-developed-by: Pedro Tammela <pctammela@mojatatu.com>
Signed-off-by: Pedro Tammela <pctammela@mojatatu.com>
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Reviewed-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
---
 include/net/act_api.h             |    1 +
 include/net/p4tc.h                |   78 +-
 include/net/tc_act/p4tc.h         |   27 +
 include/uapi/linux/p4tc.h         |   50 ++
 include/uapi/linux/tc_act/tc_p4.h |   11 +
 net/sched/p4tc/Makefile           |    3 +-
 net/sched/p4tc/p4tc_action.c      | 1109 +++++++++++++++++++++++++++++
 net/sched/p4tc/p4tc_pipeline.c    |   16 +-
 net/sched/p4tc/p4tc_tmpl_api.c    |   10 +
 9 files changed, 1294 insertions(+), 11 deletions(-)
 create mode 100644 include/net/tc_act/p4tc.h
 create mode 100644 include/uapi/linux/tc_act/tc_p4.h
 create mode 100644 net/sched/p4tc/p4tc_action.c

diff --git a/include/net/act_api.h b/include/net/act_api.h
index d35870fbf..c554776ff 100644
--- a/include/net/act_api.h
+++ b/include/net/act_api.h
@@ -70,6 +70,7 @@ struct tc_action {
 #define TCA_ACT_FLAGS_AT_INGRESS	(1U << (TCA_ACT_FLAGS_USER_BITS + 4))
 #define TCA_ACT_FLAGS_PREALLOC	(1U << (TCA_ACT_FLAGS_USER_BITS + 5))
 #define TCA_ACT_FLAGS_UNREFERENCED	(1U << (TCA_ACT_FLAGS_USER_BITS + 6))
+#define TCA_ACT_FLAGS_FROM_P4TC	(1U << (TCA_ACT_FLAGS_USER_BITS + 7))
 
 /* Update lastuse only if needed, to avoid dirtying a cache line.
  * We use a temp variable to avoid fetching jiffies twice.
diff --git a/include/net/p4tc.h b/include/net/p4tc.h
index 2cb5d06c3..33cc8bb13 100644
--- a/include/net/p4tc.h
+++ b/include/net/p4tc.h
@@ -9,17 +9,23 @@
 #include <linux/refcount.h>
 #include <linux/rhashtable.h>
 #include <linux/rhashtable-types.h>
+#include <net/tc_act/p4tc.h>
+#include <net/p4tc_types.h>
 
 #define P4TC_DEFAULT_NUM_TABLES P4TC_MINTABLES_COUNT
 #define P4TC_DEFAULT_MAX_RULES 1
 #define P4TC_PATH_MAX 3
+#define P4TC_MAX_TENTRIES 0x2000000
 
 #define P4TC_KERNEL_PIPEID 0
 
 #define P4TC_PID_IDX 0
+#define P4TC_AID_IDX 1
+#define P4TC_PARSEID_IDX 1
 
 struct p4tc_dump_ctx {
 	u32 ids[P4TC_PATH_MAX];
+	struct rhashtable_iter *iter;
 };
 
 struct p4tc_template_common;
@@ -61,8 +67,10 @@ struct p4tc_template_common {
 
 struct p4tc_pipeline {
 	struct p4tc_template_common common;
+	struct idr                  p_act_idr;
 	struct rcu_head             rcu;
 	struct net                  *net;
+	u32                         num_created_acts;
 	/* Accounts for how many entities are referencing this pipeline.
 	 * As for now only P4 filters can refer to pipelines.
 	 */
@@ -108,18 +116,72 @@ p4tc_pipeline_find_byany_unsealed(struct net *net, const char *p_name,
 				  const u32 pipeid,
 				  struct netlink_ext_ack *extack);
 
-static inline int p4tc_action_destroy(struct tc_action **acts)
-{
-	int ret = 0;
+struct p4tc_act_param {
+	struct list_head head;
+	struct rcu_head	rcu;
+	void            *value;
+	void            *mask;
+	struct p4tc_type *type;
+	u32             id;
+	u32             index;
+	u16             bitend;
+	u8              flags;
+	u8              __pad0;
+	char            name[P4TC_ACT_PARAM_NAMSIZ];
+};
 
-	if (acts) {
-		ret = tcf_action_destroy(acts, TCA_ACT_UNBIND);
-		kfree(acts);
-	}
+struct p4tc_act_param_ops {
+	int (*init_value)(struct net *net, struct p4tc_act_param_ops *op,
+			  struct p4tc_act_param *nparam, struct nlattr **tb,
+			  struct netlink_ext_ack *extack);
+	int (*dump_value)(struct sk_buff *skb, struct p4tc_act_param_ops *op,
+			  struct p4tc_act_param *param);
+	void (*free)(struct p4tc_act_param *param);
+	u32 len;
+	u32 alloc_len;
+};
 
-	return ret;
+struct p4tc_act {
+	struct p4tc_template_common common;
+	struct tc_action_ops        ops;
+	struct tc_action_net        *tn;
+	struct p4tc_pipeline        *pipeline;
+	struct idr                  params_idr;
+	struct tcf_exts             exts;
+	struct list_head            head;
+	struct list_head            prealloc_list;
+	/* Locks the preallocated actions list.
+	 * The list will be used whenever a table entry with an action or a
+	 * table default action gets created, updated or deleted. Note that
+	 * table entries may be added by both control and data path, so the
+	 * list can be modified from both contexts.
+	 */
+	spinlock_t                  list_lock;
+	u32                         a_id;
+	u32                         num_params;
+	u32                         num_prealloc_acts;
+	/* Accounts for how many entities refer to this action. Usually just the
+	 * pipeline it belongs to.
+	 */
+	refcount_t                  a_ref;
+	bool                        active;
+	u32                         num_runt_params;
+	char                        fullname[ACTNAMSIZ];
+};
+
+struct p4tc_act *p4a_tmpl_get(struct p4tc_pipeline *pipeline,
+			      const char *act_name, const u32 a_id,
+			      struct netlink_ext_ack *extack);
+struct p4tc_act *p4a_tmpl_find_byid(struct p4tc_pipeline *pipeline,
+				    const u32 a_id);
+
+static inline bool p4tc_action_put_ref(struct p4tc_act *act)
+{
+	return refcount_dec_not_one(&act->a_ref);
 }
 
 #define to_pipeline(t) ((struct p4tc_pipeline *)t)
+#define to_hdrfield(t) ((struct p4tc_hdrfield *)t)
+#define p4tc_to_act(t) ((struct p4tc_act *)t)
 
 #endif
diff --git a/include/net/tc_act/p4tc.h b/include/net/tc_act/p4tc.h
new file mode 100644
index 000000000..c5256d821
--- /dev/null
+++ b/include/net/tc_act/p4tc.h
@@ -0,0 +1,27 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef __NET_TC_ACT_P4_H
+#define __NET_TC_ACT_P4_H
+
+#include <net/pkt_cls.h>
+#include <net/act_api.h>
+
+struct tcf_p4act_params {
+	struct idr params_idr;
+	struct p4tc_act_param **params_array;
+	struct rcu_head rcu;
+	u32 num_params;
+	u32 tot_params_sz;
+};
+
+struct tcf_p4act {
+	struct tc_action common;
+	/* Params IDR reference passed during runtime */
+	struct tcf_p4act_params __rcu *params;
+	u32 p_id;
+	u32 act_id;
+	struct list_head node;
+};
+
+#define to_p4act(a) ((struct tcf_p4act *)a)
+
+#endif /* __NET_TC_ACT_P4_H */
diff --git a/include/uapi/linux/p4tc.h b/include/uapi/linux/p4tc.h
index 8d8ffcb9e..d07e331bc 100644
--- a/include/uapi/linux/p4tc.h
+++ b/include/uapi/linux/p4tc.h
@@ -4,6 +4,9 @@
 
 #include <linux/types.h>
 #include <linux/pkt_sched.h>
+#include <linux/pkt_cls.h>
+
+#include <linux/tc_act/tc_p4.h>
 
 /* pipeline header */
 struct p4tcmsg {
@@ -17,9 +20,12 @@ struct p4tcmsg {
 #define P4TC_MSGBATCH_SIZE 16
 
 #define P4TC_MAX_KEYSZ 512
+#define P4TC_DEFAULT_NUM_PREALLOC 16
 
 #define P4TC_TMPL_NAMSZ 32
 #define P4TC_PIPELINE_NAMSIZ P4TC_TMPL_NAMSZ
+#define P4TC_ACT_TMPL_NAMSZ P4TC_TMPL_NAMSZ
+#define P4TC_ACT_PARAM_NAMSIZ P4TC_TMPL_NAMSZ
 
 /* Root attributes */
 enum {
@@ -35,6 +41,7 @@ enum {
 enum {
 	P4TC_OBJ_UNSPEC,
 	P4TC_OBJ_PIPELINE,
+	P4TC_OBJ_ACT,
 	__P4TC_OBJ_MAX,
 };
 
@@ -45,6 +52,7 @@ enum {
 	P4TC_UNSPEC,
 	P4TC_PATH,
 	P4TC_PARAMS,
+	P4TC_COUNT,
 	__P4TC_MAX,
 };
 
@@ -93,6 +101,48 @@ enum {
 
 #define P4TC_T_MAX (__P4TC_T_MAX - 1)
 
+/* Action attributes */
+enum {
+	P4TC_ACT_UNSPEC,
+	P4TC_ACT_NAME, /* string - mandatory for create */
+	P4TC_ACT_PARMS, /* nested params */
+	P4TC_ACT_OPT, /* action opt */
+	P4TC_ACT_TM, /* action tm */
+	P4TC_ACT_ACTIVE, /* u8 */
+	P4TC_ACT_NUM_PREALLOC, /* u32 num preallocated action instances */
+	P4TC_ACT_PAD,
+	__P4TC_ACT_MAX
+};
+
+#define P4TC_ACT_MAX (__P4TC_ACT_MAX - 1)
+enum {
+	P4TC_ACT_PARAMS_TYPE_UNSPEC,
+	P4TC_ACT_PARAMS_TYPE_BITEND, /* u16 */
+	P4TC_ACT_PARAMS_TYPE_CONTAINER_ID, /* u32 */
+	__P4TC_ACT_PARAMS_TYPE_MAX
+};
+
+#define P4TC_ACT_PARAMS_TYPE_MAX (__P4TC_ACT_PARAMS_TYPE_MAX - 1)
+
+enum {
+	P4TC_ACT_PARAMS_FLAGS_RUNT,
+	__P4TC_ACT_PARAMS_FLAGS_MAX
+};
+
+#define P4TC_ACT_PARAMS_FLAGS_MAX (__P4TC_ACT_PARAMS_FLAGS_MAX - 1)
+
+/* Action params attributes */
+enum {
+	P4TC_ACT_PARAMS_UNSPEC,
+	P4TC_ACT_PARAMS_NAME, /* string - mandatory for params create */
+	P4TC_ACT_PARAMS_ID, /* u32 */
+	P4TC_ACT_PARAMS_TYPE, /* nested type - mandatory for params create */
+	P4TC_ACT_PARAMS_FLAGS, /* u8 */
+	__P4TC_ACT_PARAMS_MAX
+};
+
+#define P4TC_ACT_PARAMS_MAX (__P4TC_ACT_PARAMS_MAX - 1)
+
 #define P4TC_RTA(r) \
 	((struct rtattr *)(((char *)(r)) + NLMSG_ALIGN(sizeof(struct p4tcmsg))))
 
diff --git a/include/uapi/linux/tc_act/tc_p4.h b/include/uapi/linux/tc_act/tc_p4.h
new file mode 100644
index 000000000..874d85c9f
--- /dev/null
+++ b/include/uapi/linux/tc_act/tc_p4.h
@@ -0,0 +1,11 @@
+/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
+#ifndef __LINUX_TC_P4_H
+#define __LINUX_TC_P4_H
+
+#include <linux/pkt_cls.h>
+
+struct tc_act_p4 {
+	tc_gen;
+};
+
+#endif
diff --git a/net/sched/p4tc/Makefile b/net/sched/p4tc/Makefile
index 0881a7563..7dbcf8915 100644
--- a/net/sched/p4tc/Makefile
+++ b/net/sched/p4tc/Makefile
@@ -1,3 +1,4 @@
 # SPDX-License-Identifier: GPL-2.0
 
-obj-y := p4tc_types.o p4tc_tmpl_api.o p4tc_pipeline.o
+obj-y := p4tc_types.o p4tc_tmpl_api.o p4tc_pipeline.o \
+	p4tc_action.o
diff --git a/net/sched/p4tc/p4tc_action.c b/net/sched/p4tc/p4tc_action.c
new file mode 100644
index 000000000..597a14006
--- /dev/null
+++ b/net/sched/p4tc/p4tc_action.c
@@ -0,0 +1,1109 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * net/sched/p4tc/p4tc_action.c	P4 TC ACTION
+ *
+ * Copyright (c) 2022-2024, Mojatatu Networks
+ * Copyright (c) 2022-2024, Intel Corporation.
+ * Authors:     Jamal Hadi Salim <jhs@mojatatu.com>
+ *              Victor Nogueira <victor@mojatatu.com>
+ *              Pedro Tammela <pctammela@mojatatu.com>
+ */
+
+#include <linux/err.h>
+#include <linux/errno.h>
+#include <linux/init.h>
+#include <linux/kernel.h>
+#include <linux/kmod.h>
+#include <linux/list.h>
+#include <linux/module.h>
+#include <linux/netdevice.h>
+#include <linux/skbuff.h>
+#include <linux/slab.h>
+#include <linux/string.h>
+#include <linux/types.h>
+#include <net/flow_offload.h>
+#include <net/net_namespace.h>
+#include <net/netlink.h>
+#include <net/pkt_cls.h>
+#include <net/p4tc.h>
+#include <net/sch_generic.h>
+#include <net/sock.h>
+#include <net/tc_act/p4tc.h>
+
+static void p4a_parm_put(struct p4tc_act_param *param)
+{
+	kfree(param);
+}
+
+static const struct nla_policy p4a_parm_policy[P4TC_ACT_PARAMS_MAX + 1] = {
+	[P4TC_ACT_PARAMS_NAME] = {
+		.type = NLA_STRING,
+		.len = P4TC_ACT_PARAM_NAMSIZ
+	},
+	[P4TC_ACT_PARAMS_ID] = { .type = NLA_U32 },
+	[P4TC_ACT_PARAMS_TYPE] = { .type = NLA_NESTED },
+	[P4TC_ACT_PARAMS_FLAGS] =
+		NLA_POLICY_RANGE(NLA_U8, 0,
+				 BIT(P4TC_ACT_PARAMS_FLAGS_MAX + 1) - 1),
+};
+
+static struct p4tc_act_param *
+p4a_parm_find_byname(struct idr *params_idr, const char *param_name)
+{
+	struct p4tc_act_param *param;
+	unsigned long tmp, id;
+
+	idr_for_each_entry_ul(params_idr, param, tmp, id) {
+		if (param == ERR_PTR(-EBUSY))
+			continue;
+		if (strncmp(param->name, param_name,
+			    P4TC_ACT_PARAM_NAMSIZ) == 0)
+			return param;
+	}
+
+	return NULL;
+}
+
+static struct p4tc_act_param *
+p4a_parm_find_byid(struct idr *params_idr, const u32 param_id)
+{
+	return idr_find(params_idr, param_id);
+}
+
+static struct p4tc_act_param *
+p4a_parm_find_byany(struct p4tc_act *act, const char *param_name,
+		    const u32 param_id, struct netlink_ext_ack *extack)
+{
+	struct p4tc_act_param *param;
+	int err;
+
+	if (param_id) {
+		param = p4a_parm_find_byid(&act->params_idr, param_id);
+		if (!param) {
+			NL_SET_ERR_MSG(extack, "Unable to find param by id");
+			err = -EINVAL;
+			goto out;
+		}
+	} else {
+		if (param_name) {
+			param = p4a_parm_find_byname(&act->params_idr,
+						     param_name);
+			if (!param) {
+				NL_SET_ERR_MSG(extack, "Param name not found");
+				err = -EINVAL;
+				goto out;
+			}
+		} else {
+			NL_SET_ERR_MSG(extack, "Must specify param name or id");
+			err = -EINVAL;
+			goto out;
+		}
+	}
+
+	return param;
+
+out:
+	return ERR_PTR(err);
+}
+
+static struct p4tc_act_param *
+p4a_parm_find_byanyattr(struct p4tc_act *act, struct nlattr *name_attr,
+			const u32 param_id,
+			struct netlink_ext_ack *extack)
+{
+	char *param_name = NULL;
+
+	if (name_attr)
+		param_name = nla_data(name_attr);
+
+	return p4a_parm_find_byany(act, param_name, param_id, extack);
+}
+
+static const struct nla_policy
+p4a_parm_type_policy[P4TC_ACT_PARAMS_TYPE_MAX + 1] = {
+	[P4TC_ACT_PARAMS_TYPE_BITEND] = { .type = NLA_U16 },
+	[P4TC_ACT_PARAMS_TYPE_CONTAINER_ID] = { .type = NLA_U32 },
+};
+
+static int
+__p4a_parm_init_type(struct p4tc_act_param *param, struct nlattr *nla,
+		     struct netlink_ext_ack *extack)
+{
+	struct nlattr *tb[P4TC_ACT_PARAMS_TYPE_MAX + 1];
+	struct p4tc_type *type;
+	u32 container_id;
+	u16 bitend;
+	int ret;
+
+	ret = nla_parse_nested(tb, P4TC_ACT_PARAMS_TYPE_MAX, nla,
+			       p4a_parm_type_policy, extack);
+	if (ret < 0)
+		return ret;
+
+	if (tb[P4TC_ACT_PARAMS_TYPE_CONTAINER_ID]) {
+		container_id =
+			nla_get_u32(tb[P4TC_ACT_PARAMS_TYPE_CONTAINER_ID]);
+
+		type = p4type_find_byid(container_id);
+		if (!type) {
+			NL_SET_ERR_MSG_FMT(extack,
+					   "Invalid container type id %u\n",
+					   container_id);
+			return -EINVAL;
+		}
+	} else {
+		NL_SET_ERR_MSG(extack, "Must specify type container id");
+		return -EINVAL;
+	}
+
+	if (tb[P4TC_ACT_PARAMS_TYPE_BITEND]) {
+		bitend = nla_get_u16(tb[P4TC_ACT_PARAMS_TYPE_BITEND]);
+	} else {
+		NL_SET_ERR_MSG(extack, "Must specify bitend");
+		return -EINVAL;
+	}
+
+	param->type = type;
+	param->bitend = bitend;
+
+	return 0;
+}
+
+static struct p4tc_act *
+p4a_tmpl_find_byname(const char *fullname, struct p4tc_pipeline *pipeline,
+		     struct netlink_ext_ack *extack)
+{
+	unsigned long tmp, id;
+	struct p4tc_act *act;
+
+	idr_for_each_entry_ul(&pipeline->p_act_idr, act, tmp, id)
+		if (strncmp(act->fullname, fullname, ACTNAMSIZ) == 0)
+			return act;
+
+	return NULL;
+}
+
+static int p4a_parm_type_fill(struct sk_buff *skb, struct p4tc_act_param *param)
+{
+	unsigned char *b = nlmsg_get_pos(skb);
+
+	if (nla_put_u16(skb, P4TC_ACT_PARAMS_TYPE_BITEND, param->bitend))
+		goto nla_put_failure;
+
+	if (nla_put_u32(skb, P4TC_ACT_PARAMS_TYPE_CONTAINER_ID,
+			param->type->typeid))
+		goto nla_put_failure;
+
+	return 0;
+
+nla_put_failure:
+	nlmsg_trim(skb, b);
+	return -1;
+}
+
+struct p4tc_act *p4a_tmpl_find_byid(struct p4tc_pipeline *pipeline,
+				    const u32 a_id)
+{
+	return idr_find(&pipeline->p_act_idr, a_id);
+}
+
+static struct p4tc_act *
+p4a_tmpl_find_byany(struct p4tc_pipeline *pipeline,
+		    const char *act_name, const u32 a_id,
+		    struct netlink_ext_ack *extack)
+{
+	struct p4tc_act *act;
+	int err;
+
+	if (a_id) {
+		act = p4a_tmpl_find_byid(pipeline, a_id);
+		if (!act) {
+			NL_SET_ERR_MSG(extack, "Unable to find action by id");
+			err = -ENOENT;
+			goto out;
+		}
+	} else {
+		if (act_name) {
+			act = p4a_tmpl_find_byname(act_name, pipeline,
+						   extack);
+			if (!act) {
+				NL_SET_ERR_MSG(extack, "Action name not found");
+				err = -ENOENT;
+				goto out;
+			}
+		} else {
+			NL_SET_ERR_MSG(extack,
+				       "Must specify action name or id");
+			err = -EINVAL;
+			goto out;
+		}
+	}
+
+	return act;
+
+out:
+	return ERR_PTR(err);
+}
+
+struct p4tc_act *p4a_tmpl_get(struct p4tc_pipeline *pipeline,
+			      const char *act_name, const u32 a_id,
+			      struct netlink_ext_ack *extack)
+{
+	struct p4tc_act *act;
+
+	act = p4a_tmpl_find_byany(pipeline, act_name, a_id, extack);
+	if (IS_ERR(act))
+		return act;
+
+	if (!refcount_inc_not_zero(&act->a_ref)) {
+		NL_SET_ERR_MSG(extack, "Action is stale");
+		return ERR_PTR(-EBUSY);
+	}
+
+	return act;
+}
+
+static struct p4tc_act *
+p4a_tmpl_find_byanyattr(struct nlattr *attr, const u32 a_id,
+			struct p4tc_pipeline *pipeline,
+			struct netlink_ext_ack *extack)
+{
+	char fullname[ACTNAMSIZ] = {};
+	char *actname = NULL;
+
+	if (attr) {
+		actname = nla_data(attr);
+
+		snprintf(fullname, ACTNAMSIZ, "%s/%s", pipeline->common.name,
+			 actname);
+	}
+
+	return p4a_tmpl_find_byany(pipeline, fullname, a_id, extack);
+}
+
+static void p4a_tmpl_parms_put_many(struct idr *params_idr)
+{
+	struct p4tc_act_param *param;
+	unsigned long tmp, id;
+
+	idr_for_each_entry_ul(params_idr, param, tmp, id)
+		p4a_parm_put(param);
+}
+
+static int
+p4a_parm_init_type(struct p4tc_act_param *param, struct nlattr *nla,
+		   struct netlink_ext_ack *extack)
+{
+	struct p4tc_type *type;
+	int ret;
+
+	ret = __p4a_parm_init_type(param, nla, extack);
+	if (ret < 0)
+		return ret;
+
+	type = param->type;
+	ret = type->ops->validate_p4t(type, NULL, 0, param->bitend, extack);
+	if (ret < 0)
+		return ret;
+
+	return 0;
+}
+
+static struct p4tc_act_param *
+p4a_tmpl_parm_create(struct p4tc_act *act, struct idr *params_idr,
+		     struct nlattr **tb, u32 param_id,
+		     struct netlink_ext_ack *extack)
+{
+	struct p4tc_act_param *param;
+	char *name;
+	int ret;
+
+	if (tb[P4TC_ACT_PARAMS_NAME]) {
+		name = nla_data(tb[P4TC_ACT_PARAMS_NAME]);
+	} else {
+		NL_SET_ERR_MSG(extack, "Must specify param name");
+		ret = -EINVAL;
+		goto out;
+	}
+
+	param = kzalloc(sizeof(*param), GFP_KERNEL);
+	if (!param) {
+		ret = -ENOMEM;
+		goto out;
+	}
+
+	if (p4a_parm_find_byid(&act->params_idr, param_id) ||
+	    p4a_parm_find_byname(&act->params_idr, name)) {
+		NL_SET_ERR_MSG(extack, "Param already exists");
+		ret = -EEXIST;
+		goto free;
+	}
+
+	if (tb[P4TC_ACT_PARAMS_TYPE]) {
+		ret = p4a_parm_init_type(param, tb[P4TC_ACT_PARAMS_TYPE],
+					 extack);
+		if (ret < 0)
+			goto free;
+	} else {
+		NL_SET_ERR_MSG(extack, "Must specify param type");
+		ret = -EINVAL;
+		goto free;
+	}
+
+	if (param_id) {
+		ret = idr_alloc_u32(params_idr, param, &param_id,
+				    param_id, GFP_KERNEL);
+		if (ret < 0) {
+			NL_SET_ERR_MSG(extack, "Unable to allocate param id");
+			goto free;
+		}
+		param->id = param_id;
+	} else {
+		param->id = 1;
+
+		ret = idr_alloc_u32(params_idr, param, &param->id,
+				    UINT_MAX, GFP_KERNEL);
+		if (ret < 0) {
+			NL_SET_ERR_MSG(extack, "Unable to allocate param id");
+			goto free;
+		}
+	}
+
+	if (tb[P4TC_ACT_PARAMS_FLAGS])
+		param->flags = nla_get_u8(tb[P4TC_ACT_PARAMS_FLAGS]);
+
+	strscpy(param->name, name, P4TC_ACT_PARAM_NAMSIZ);
+
+	return param;
+
+free:
+	kfree(param);
+
+out:
+	return ERR_PTR(ret);
+}
+
+static struct p4tc_act_param *
+p4a_tmpl_parm_update(struct p4tc_act *act, struct nlattr **tb,
+		     struct idr *params_idr, u32 param_id,
+		     struct netlink_ext_ack *extack)
+{
+	struct p4tc_act_param *param_old, *param;
+	u8 flags;
+	int ret;
+
+	param_old = p4a_parm_find_byanyattr(act, tb[P4TC_ACT_PARAMS_NAME],
+					    param_id, extack);
+	if (IS_ERR(param_old))
+		return param_old;
+
+	flags = param_old->flags;
+
+	param = kzalloc(sizeof(*param), GFP_KERNEL);
+	if (!param) {
+		ret = -ENOMEM;
+		goto out;
+	}
+
+	strscpy(param->name, param_old->name, P4TC_ACT_PARAM_NAMSIZ);
+	param->id = param_old->id;
+
+	if (tb[P4TC_ACT_PARAMS_TYPE]) {
+		ret = p4a_parm_init_type(param, tb[P4TC_ACT_PARAMS_TYPE],
+					 extack);
+		if (ret < 0)
+			goto free;
+	} else {
+		param->type = param_old->type;
+		param->bitend = param_old->bitend;
+	}
+
+	ret = idr_alloc_u32(params_idr, param, &param->id,
+			    param->id, GFP_KERNEL);
+	if (ret < 0) {
+		NL_SET_ERR_MSG(extack, "Unable to allocate param id");
+		goto free;
+	}
+
+	if (tb[P4TC_ACT_PARAMS_FLAGS])
+		flags = nla_get_u8(tb[P4TC_ACT_PARAMS_FLAGS]);
+
+	param->flags = flags;
+
+	return param;
+
+free:
+	kfree(param);
+out:
+	return ERR_PTR(ret);
+}
+
+static struct p4tc_act_param *
+p4a_tmpl_parm_init(struct p4tc_act *act, struct nlattr *nla,
+		   struct idr *params_idr, bool update,
+		   struct netlink_ext_ack *extack)
+{
+	struct nlattr *tb[P4TC_ACT_PARAMS_MAX + 1];
+	u32 param_id = 0;
+	int ret;
+
+	ret = nla_parse_nested(tb, P4TC_ACT_PARAMS_MAX, nla, p4a_parm_policy,
+			       extack);
+	if (ret < 0) {
+		ret = -EINVAL;
+		goto out;
+	}
+
+	if (tb[P4TC_ACT_PARAMS_ID])
+		param_id = nla_get_u32(tb[P4TC_ACT_PARAMS_ID]);
+
+	if (update)
+		return p4a_tmpl_parm_update(act, tb, params_idr, param_id,
+					    extack);
+	else
+		return p4a_tmpl_parm_create(act, params_idr, tb, param_id,
+					    extack);
+
+out:
+	return ERR_PTR(ret);
+}
+
+static int p4a_tmpl_parms_init(struct p4tc_act *act, struct nlattr *nla,
+			       struct idr *params_idr, bool update,
+			       struct netlink_ext_ack *extack)
+{
+	struct nlattr *tb[P4TC_MSGBATCH_SIZE + 1];
+	int ret;
+	int i;
+
+	ret = nla_parse_nested(tb, P4TC_MSGBATCH_SIZE, nla, NULL, extack);
+	if (ret < 0)
+		return -EINVAL;
+
+	for (i = 1; i < P4TC_MSGBATCH_SIZE + 1 && tb[i]; i++) {
+		struct p4tc_act_param *param;
+
+		param = p4a_tmpl_parm_init(act, tb[i], params_idr, update,
+					   extack);
+		if (IS_ERR(param)) {
+			ret = PTR_ERR(param);
+			goto params_del;
+		}
+	}
+
+	return i - 1;
+
+params_del:
+	p4a_tmpl_parms_put_many(params_idr);
+	return ret;
+}
+
+static int p4a_tmpl_init(struct p4tc_act *act, struct nlattr *nla,
+			 struct netlink_ext_ack *extack)
+{
+	int num_params = 0;
+	int ret;
+
+	idr_init(&act->params_idr);
+
+	if (nla) {
+		num_params =
+			p4a_tmpl_parms_init(act, nla, &act->params_idr, false,
+					    extack);
+		if (num_params < 0) {
+			ret = num_params;
+			goto idr_destroy;
+		}
+	}
+
+	return num_params;
+
+idr_destroy:
+	p4a_tmpl_parms_put_many(&act->params_idr);
+	idr_destroy(&act->params_idr);
+	return ret;
+}
+
+static struct netlink_range_validation prealloc_range = {
+	.min = 1,
+	.max = P4TC_MAX_TENTRIES,
+};
+
+static const struct nla_policy p4a_tmpl_policy[P4TC_ACT_MAX + 1] = {
+	[P4TC_ACT_NAME] = { .type = NLA_STRING, .len = P4TC_ACT_TMPL_NAMSZ },
+	[P4TC_ACT_PARMS] = { .type = NLA_NESTED },
+	[P4TC_ACT_OPT] = NLA_POLICY_EXACT_LEN(sizeof(struct tc_act_p4)),
+	[P4TC_ACT_NUM_PREALLOC] =
+		NLA_POLICY_FULL_RANGE(NLA_U32, &prealloc_range),
+	[P4TC_ACT_ACTIVE] = { .type = NLA_U8 },
+};
+
+static void p4a_tmpl_parms_put(struct p4tc_act *act)
+{
+	struct p4tc_act_param *act_param;
+	unsigned long param_id, tmp;
+
+	idr_for_each_entry_ul(&act->params_idr, act_param, tmp, param_id) {
+		idr_remove(&act->params_idr, param_id);
+		kfree(act_param);
+	}
+}
+
+static int __p4a_tmpl_put(struct net *net, struct p4tc_pipeline *pipeline,
+			  struct p4tc_act *act, bool teardown,
+			  struct netlink_ext_ack *extack)
+{
+	struct tcf_p4act *p4act, *tmp_act;
+
+	if (!teardown && refcount_read(&act->a_ref) > 1) {
+		NL_SET_ERR_MSG(extack,
+			       "Unable to delete referenced action template");
+		return -EBUSY;
+	}
+
+	p4a_tmpl_parms_put(act);
+
+	tcf_unregister_p4_action(net, &act->ops);
+	/* Free preallocated acts */
+	list_for_each_entry_safe(p4act, tmp_act, &act->prealloc_list, node) {
+		list_del_init(&p4act->node);
+		if (p4act->common.tcfa_flags & TCA_ACT_FLAGS_UNREFERENCED)
+			tcf_idr_release(&p4act->common, true);
+	}
+
+	idr_remove(&pipeline->p_act_idr, act->a_id);
+
+	list_del(&act->head);
+
+	kfree(act);
+
+	pipeline->num_created_acts--;
+
+	return 0;
+}
+
+static int _p4a_tmpl_fill_nlmsg(struct net *net, struct sk_buff *skb,
+				struct p4tc_act *act)
+{
+	unsigned char *b = nlmsg_get_pos(skb);
+	struct p4tc_act_param *param;
+	struct nlattr *nest, *parms;
+	unsigned long param_id, tmp;
+	int i = 1;
+
+	if (nla_put_u32(skb, P4TC_PATH, act->a_id))
+		goto out_nlmsg_trim;
+
+	nest = nla_nest_start(skb, P4TC_PARAMS);
+	if (!nest)
+		goto out_nlmsg_trim;
+
+	if (nla_put_string(skb, P4TC_ACT_NAME, act->fullname))
+		goto out_nlmsg_trim;
+
+	if (nla_put_u32(skb, P4TC_ACT_NUM_PREALLOC, act->num_prealloc_acts))
+		goto out_nlmsg_trim;
+
+	parms = nla_nest_start(skb, P4TC_ACT_PARMS);
+	if (!parms)
+		goto out_nlmsg_trim;
+
+	idr_for_each_entry_ul(&act->params_idr, param, tmp, param_id) {
+		struct nlattr *nest_count;
+		struct nlattr *nest_type;
+
+		nest_count = nla_nest_start(skb, i);
+		if (!nest_count)
+			goto out_nlmsg_trim;
+
+		if (nla_put_string(skb, P4TC_ACT_PARAMS_NAME, param->name))
+			goto out_nlmsg_trim;
+
+		if (nla_put_u32(skb, P4TC_ACT_PARAMS_ID, param->id))
+			goto out_nlmsg_trim;
+
+		nest_type = nla_nest_start(skb, P4TC_ACT_PARAMS_TYPE);
+		if (!nest_type)
+			goto out_nlmsg_trim;
+
+		p4a_parm_type_fill(skb, param);
+		nla_nest_end(skb, nest_type);
+
+		if (nla_put_u8(skb, P4TC_ACT_PARAMS_FLAGS, param->flags))
+			goto out_nlmsg_trim;
+
+		nla_nest_end(skb, nest_count);
+		i++;
+	}
+	nla_nest_end(skb, parms);
+
+	nla_nest_end(skb, nest);
+
+	return skb->len;
+
+out_nlmsg_trim:
+	nlmsg_trim(skb, b);
+	return -1;
+}
+
+static int p4a_tmpl_fill_nlmsg(struct net *net, struct sk_buff *skb,
+			       struct p4tc_template_common *tmpl,
+			       struct netlink_ext_ack *extack)
+{
+	return _p4a_tmpl_fill_nlmsg(net, skb, p4tc_to_act(tmpl));
+}
+
+static int p4a_tmpl_flush(struct sk_buff *skb, struct net *net,
+			  struct p4tc_pipeline *pipeline,
+			  struct netlink_ext_ack *extack)
+{
+	unsigned char *b = nlmsg_get_pos(skb);
+	unsigned long tmp, act_id;
+	struct p4tc_act *act;
+	int ret = 0;
+	int i = 0;
+
+	if (nla_put_u32(skb, P4TC_PATH, 0))
+		goto out_nlmsg_trim;
+
+	if (idr_is_empty(&pipeline->p_act_idr)) {
+		NL_SET_ERR_MSG(extack,
+			       "There are not action templates to flush");
+		goto out_nlmsg_trim;
+	}
+
+	idr_for_each_entry_ul(&pipeline->p_act_idr, act, tmp, act_id) {
+		if (__p4a_tmpl_put(net, pipeline, act, false, extack) < 0) {
+			ret = -EBUSY;
+			continue;
+		}
+		i++;
+	}
+
+	if (nla_put_u32(skb, P4TC_COUNT, i))
+		goto out_nlmsg_trim;
+
+	if (ret < 0) {
+		if (i == 0) {
+			NL_SET_ERR_MSG(extack,
+				       "Unable to flush any action template");
+			goto out_nlmsg_trim;
+		} else {
+			NL_SET_ERR_MSG_FMT(extack,
+					   "Flushed only %u action templates",
+					   i);
+		}
+	}
+
+	return i;
+
+out_nlmsg_trim:
+	nlmsg_trim(skb, b);
+	return ret;
+}
+
+static int p4a_tmpl_gd(struct net *net, struct sk_buff *skb,
+		       struct nlmsghdr *n, struct nlattr *nla,
+		       struct p4tc_path_nlattrs *nl_path_attrs,
+		       struct netlink_ext_ack *extack)
+{
+	u32 *ids = nl_path_attrs->ids;
+	const u32 pipeid = ids[P4TC_PID_IDX], a_id = ids[P4TC_AID_IDX];
+	struct nlattr *tb[P4TC_ACT_MAX + 1] = { NULL };
+	unsigned char *b = nlmsg_get_pos(skb);
+	struct p4tc_pipeline *pipeline;
+	struct p4tc_act *act;
+	int ret = 0;
+
+	if (n->nlmsg_type == RTM_DELP4TEMPLATE)
+		pipeline =
+			p4tc_pipeline_find_byany_unsealed(net,
+							  nl_path_attrs->pname,
+							  pipeid, extack);
+	else
+		pipeline = p4tc_pipeline_find_byany(net,
+						    nl_path_attrs->pname,
+						    pipeid, extack);
+	if (IS_ERR(pipeline))
+		return PTR_ERR(pipeline);
+
+	if (nla) {
+		ret = nla_parse_nested(tb, P4TC_ACT_MAX, nla,
+				       p4a_tmpl_policy, extack);
+		if (ret < 0)
+			return ret;
+	}
+
+	if (!nl_path_attrs->pname_passed)
+		strscpy(nl_path_attrs->pname, pipeline->common.name,
+			P4TC_PIPELINE_NAMSIZ);
+
+	if (!ids[P4TC_PID_IDX])
+		ids[P4TC_PID_IDX] = pipeline->common.p_id;
+
+	if (n->nlmsg_type == RTM_DELP4TEMPLATE && (n->nlmsg_flags & NLM_F_ROOT))
+		return p4a_tmpl_flush(skb, net, pipeline, extack);
+
+	act = p4a_tmpl_find_byanyattr(tb[P4TC_ACT_NAME], a_id, pipeline,
+				      extack);
+	if (IS_ERR(act))
+		return PTR_ERR(act);
+
+	if (_p4a_tmpl_fill_nlmsg(net, skb, act) < 0) {
+		NL_SET_ERR_MSG(extack,
+			       "Failed to fill notification attributes for template action");
+		return -EINVAL;
+	}
+
+	if (n->nlmsg_type == RTM_DELP4TEMPLATE) {
+		ret = __p4a_tmpl_put(net, pipeline, act, false, extack);
+		if (ret < 0)
+			goto out_nlmsg_trim;
+	}
+
+	return 0;
+
+out_nlmsg_trim:
+	nlmsg_trim(skb, b);
+	return ret;
+}
+
+static int p4a_tmpl_put(struct p4tc_pipeline *pipeline,
+			struct p4tc_template_common *tmpl,
+			struct netlink_ext_ack *extack)
+{
+	struct p4tc_act *act = p4tc_to_act(tmpl);
+
+	return __p4a_tmpl_put(pipeline->net, pipeline, act, true, extack);
+}
+
+static void p4a_tmpl_parm_idx_set(struct idr *params_idr)
+{
+	struct p4tc_act_param *param;
+	unsigned long tmp, id;
+	int i = 0;
+
+	idr_for_each_entry_ul(params_idr, param, tmp, id) {
+		param->index = i;
+		i++;
+	}
+}
+
+static void p4a_tmpl_parms_replace_many(struct p4tc_act *act,
+					struct idr *params_idr)
+{
+	struct p4tc_act_param *param;
+	unsigned long tmp, id;
+
+	idr_for_each_entry_ul(params_idr, param, tmp, id) {
+		idr_remove(params_idr, param->id);
+		param = idr_replace(&act->params_idr, param, param->id);
+		p4a_parm_put(param);
+	}
+}
+
+static const struct p4tc_template_ops p4tc_act_ops;
+
+static struct p4tc_act *
+p4a_tmpl_create(struct net *net, struct nlattr **tb,
+		struct p4tc_pipeline *pipeline, u32 *ids,
+		struct netlink_ext_ack *extack)
+{
+	u32 a_id = ids[P4TC_AID_IDX];
+	char fullname[ACTNAMSIZ];
+	struct p4tc_act *act;
+	int num_params = 0;
+	size_t nbytes;
+	char *actname;
+	int ret = 0;
+
+	if (!tb[P4TC_ACT_NAME]) {
+		NL_SET_ERR_MSG(extack, "Must supply action name");
+		return ERR_PTR(-EINVAL);
+	}
+
+	actname = nla_data(tb[P4TC_ACT_NAME]);
+
+	nbytes = snprintf(fullname, ACTNAMSIZ, "%s/%s", pipeline->common.name,
+			  actname);
+	if (nbytes == ACTNAMSIZ) {
+		NL_SET_ERR_MSG_FMT(extack,
+				   "Full action name should fit in %u bytes",
+				   ACTNAMSIZ);
+		return ERR_PTR(-E2BIG);
+	}
+
+	if (p4a_tmpl_find_byname(fullname, pipeline, extack)) {
+		NL_SET_ERR_MSG(extack, "Action already exists with same name");
+		return ERR_PTR(-EEXIST);
+	}
+
+	if (p4a_tmpl_find_byid(pipeline, a_id)) {
+		NL_SET_ERR_MSG(extack, "Action already exists with same id");
+		return ERR_PTR(-EEXIST);
+	}
+
+	act = kzalloc(sizeof(*act), GFP_KERNEL);
+	if (!act)
+		return ERR_PTR(-ENOMEM);
+
+	if (a_id) {
+		ret = idr_alloc_u32(&pipeline->p_act_idr, act, &a_id, a_id,
+				    GFP_KERNEL);
+		if (ret < 0) {
+			NL_SET_ERR_MSG(extack, "Unable to alloc action id");
+			goto free_act;
+		}
+
+		act->a_id = a_id;
+	} else {
+		act->a_id = 1;
+
+		ret = idr_alloc_u32(&pipeline->p_act_idr, act, &act->a_id,
+				    UINT_MAX, GFP_KERNEL);
+		if (ret < 0) {
+			NL_SET_ERR_MSG(extack, "Unable to alloc action id");
+			goto free_act;
+		}
+	}
+
+	/* We are only preallocating the instances once the action template is
+	 * activated during update.
+	 */
+	if (tb[P4TC_ACT_NUM_PREALLOC])
+		act->num_prealloc_acts = nla_get_u32(tb[P4TC_ACT_NUM_PREALLOC]);
+	else
+		act->num_prealloc_acts = P4TC_DEFAULT_NUM_PREALLOC;
+
+	num_params = p4a_tmpl_init(act, tb[P4TC_ACT_PARMS], extack);
+	if (num_params < 0) {
+		ret = num_params;
+		goto idr_rm;
+	}
+	act->num_params = num_params;
+
+	p4a_tmpl_parm_idx_set(&act->params_idr);
+
+	act->pipeline = pipeline;
+
+	pipeline->num_created_acts++;
+
+	act->common.p_id = pipeline->common.p_id;
+
+	strscpy(act->fullname, fullname, ACTNAMSIZ);
+	strscpy(act->common.name, actname, P4TC_ACT_TMPL_NAMSZ);
+
+	act->common.ops = (struct p4tc_template_ops *)&p4tc_act_ops;
+
+	refcount_set(&act->a_ref, 1);
+
+	INIT_LIST_HEAD(&act->prealloc_list);
+	spin_lock_init(&act->list_lock);
+
+	return act;
+
+idr_rm:
+	idr_remove(&pipeline->p_act_idr, act->a_id);
+
+free_act:
+	kfree(act);
+
+	return ERR_PTR(ret);
+}
+
+static struct p4tc_act *
+p4a_tmpl_update(struct net *net, struct nlattr **tb,
+		struct p4tc_pipeline *pipeline, u32 *ids,
+		u32 flags, struct netlink_ext_ack *extack)
+{
+	const u32 a_id = ids[P4TC_AID_IDX];
+	bool updates_params = false;
+	struct idr params_idr;
+	u32 num_prealloc_acts;
+	struct p4tc_act *act;
+	int num_params = 0;
+	s8 active = -1;
+	int ret = 0;
+
+	act = p4a_tmpl_find_byanyattr(tb[P4TC_ACT_NAME], a_id, pipeline,
+				      extack);
+	if (IS_ERR(act))
+		return act;
+
+	if (tb[P4TC_ACT_ACTIVE])
+		active = nla_get_u8(tb[P4TC_ACT_ACTIVE]);
+
+	if (act->active) {
+		if (!active) {
+			act->active = false;
+			return act;
+		}
+		NL_SET_ERR_MSG(extack, "Unable to update active action");
+
+		ret = -EINVAL;
+		goto out;
+	}
+
+	idr_init(&params_idr);
+	if (tb[P4TC_ACT_PARMS]) {
+		num_params = p4a_tmpl_parms_init(act, tb[P4TC_ACT_PARMS],
+						 &params_idr, true, extack);
+		if (num_params < 0) {
+			ret = num_params;
+			goto idr_destroy;
+		}
+		p4a_tmpl_parm_idx_set(&params_idr);
+		updates_params = true;
+	}
+
+	if (tb[P4TC_ACT_NUM_PREALLOC])
+		num_prealloc_acts = nla_get_u32(tb[P4TC_ACT_NUM_PREALLOC]);
+	else
+		num_prealloc_acts = act->num_prealloc_acts;
+
+	act->pipeline = pipeline;
+	if (active == 1) {
+		act->active = true;
+	} else if (!active) {
+		NL_SET_ERR_MSG(extack, "Action is already inactive");
+		ret = -EINVAL;
+		goto params_del;
+	}
+
+	act->num_prealloc_acts = num_prealloc_acts;
+
+	if (updates_params)
+		p4a_tmpl_parms_replace_many(act, &params_idr);
+
+	idr_destroy(&params_idr);
+
+	return act;
+
+params_del:
+	p4a_tmpl_parms_put_many(&params_idr);
+
+idr_destroy:
+	idr_destroy(&params_idr);
+
+out:
+	return ERR_PTR(ret);
+}
+
+static struct p4tc_template_common *
+p4a_tmpl_cu(struct net *net, struct nlmsghdr *n, struct nlattr *nla,
+	    struct p4tc_path_nlattrs *nl_path_attrs,
+	    struct netlink_ext_ack *extack)
+{
+	u32 *ids = nl_path_attrs->ids;
+	const u32 pipeid = ids[P4TC_PID_IDX];
+	struct nlattr *tb[P4TC_ACT_MAX + 1];
+	struct p4tc_pipeline *pipeline;
+	struct p4tc_act *act;
+	int ret;
+
+	pipeline = p4tc_pipeline_find_byany_unsealed(net, nl_path_attrs->pname,
+						     pipeid, extack);
+	if (IS_ERR(pipeline))
+		return (void *)pipeline;
+
+	ret = nla_parse_nested(tb, P4TC_ACT_MAX, nla, p4a_tmpl_policy,
+			       extack);
+	if (ret < 0)
+		return ERR_PTR(ret);
+
+	switch (n->nlmsg_type) {
+	case RTM_CREATEP4TEMPLATE:
+		act = p4a_tmpl_create(net, tb, pipeline, ids, extack);
+		break;
+	case RTM_UPDATEP4TEMPLATE:
+		act = p4a_tmpl_update(net, tb, pipeline, ids,
+				      n->nlmsg_flags, extack);
+		break;
+	default:
+		return ERR_PTR(-EOPNOTSUPP);
+	}
+
+	if (IS_ERR(act))
+		goto out;
+
+	if (!nl_path_attrs->pname_passed)
+		strscpy(nl_path_attrs->pname, pipeline->common.name,
+			P4TC_PIPELINE_NAMSIZ);
+
+	if (!ids[P4TC_PID_IDX])
+		ids[P4TC_PID_IDX] = pipeline->common.p_id;
+
+out:
+	return (struct p4tc_template_common *)act;
+}
+
+static int p4a_tmpl_dump(struct sk_buff *skb, struct p4tc_dump_ctx *ctx,
+			 struct nlattr *nla, char **p_name, u32 *ids,
+			 struct netlink_ext_ack *extack)
+{
+	struct net *net = sock_net(skb->sk);
+	struct p4tc_pipeline *pipeline;
+
+	if (!ctx->ids[P4TC_PID_IDX]) {
+		pipeline = p4tc_pipeline_find_byany(net, *p_name,
+						    ids[P4TC_PID_IDX], extack);
+		if (IS_ERR(pipeline))
+			return PTR_ERR(pipeline);
+		ctx->ids[P4TC_PID_IDX] = pipeline->common.p_id;
+	} else {
+		pipeline = p4tc_pipeline_find_byid(net, ctx->ids[P4TC_PID_IDX]);
+	}
+
+	if (!ids[P4TC_PID_IDX])
+		ids[P4TC_PID_IDX] = pipeline->common.p_id;
+
+	if (!(*p_name))
+		*p_name = pipeline->common.name;
+
+	return p4tc_tmpl_generic_dump(skb, ctx, &pipeline->p_act_idr,
+				      P4TC_AID_IDX, extack);
+}
+
+static int p4a_tmpl_dump_1(struct sk_buff *skb,
+			   struct p4tc_template_common *common)
+{
+	struct nlattr *param = nla_nest_start(skb, P4TC_PARAMS);
+	unsigned char *b = nlmsg_get_pos(skb);
+	struct p4tc_act *act = p4tc_to_act(common);
+
+	if (!param)
+		goto out_nlmsg_trim;
+
+	if (nla_put_string(skb, P4TC_ACT_NAME, act->fullname))
+		goto out_nlmsg_trim;
+
+	if (nla_put_u8(skb, P4TC_ACT_ACTIVE, act->active))
+		goto out_nlmsg_trim;
+
+	nla_nest_end(skb, param);
+
+	return 0;
+
+out_nlmsg_trim:
+	nlmsg_trim(skb, b);
+	return -ENOMEM;
+}
+
+static const struct p4tc_template_ops p4tc_act_ops = {
+	.cu = p4a_tmpl_cu,
+	.put = p4a_tmpl_put,
+	.gd = p4a_tmpl_gd,
+	.fill_nlmsg = p4a_tmpl_fill_nlmsg,
+	.dump = p4a_tmpl_dump,
+	.dump_1 = p4a_tmpl_dump_1,
+	.obj_id = P4TC_OBJ_ACT,
+};
+
+static int __init p4tc_act_init(void)
+{
+	p4tc_tmpl_register_ops(&p4tc_act_ops);
+
+	return 0;
+}
+
+subsys_initcall(p4tc_act_init);
diff --git a/net/sched/p4tc/p4tc_pipeline.c b/net/sched/p4tc/p4tc_pipeline.c
index 936ec777a..626d98734 100644
--- a/net/sched/p4tc/p4tc_pipeline.c
+++ b/net/sched/p4tc/p4tc_pipeline.c
@@ -75,6 +75,8 @@ static const struct nla_policy tc_pipeline_policy[P4TC_PIPELINE_MAX + 1] = {
 
 static void p4tc_pipeline_destroy(struct p4tc_pipeline *pipeline)
 {
+	idr_destroy(&pipeline->p_act_idr);
+
 	kfree(pipeline);
 }
 
@@ -96,8 +98,12 @@ static void p4tc_pipeline_teardown(struct p4tc_pipeline *pipeline,
 	struct net *net = pipeline->net;
 	struct p4tc_pipeline_net *pipe_net = net_generic(net, pipeline_net_id);
 	struct net *pipeline_net = maybe_get_net(net);
+	unsigned long iter_act_id;
+	struct p4tc_act *act;
+	unsigned long tmp;
 
-	idr_remove(&pipe_net->pipeline_idr, pipeline->common.p_id);
+	idr_for_each_entry_ul(&pipeline->p_act_idr, act, tmp, iter_act_id)
+		act->common.ops->put(pipeline, &act->common, extack);
 
 	/* If we are on netns cleanup we can't touch the pipeline_idr.
 	 * On pre_exit we will destroy the idr but never call into teardown
@@ -154,6 +160,7 @@ static int pipeline_try_set_state_ready(struct p4tc_pipeline *pipeline,
 	}
 
 	pipeline->p_state = P4TC_STATE_READY;
+
 	return true;
 }
 
@@ -253,6 +260,10 @@ p4tc_pipeline_create(struct net *net, struct nlmsghdr *n,
 	else
 		pipeline->num_tables = P4TC_DEFAULT_NUM_TABLES;
 
+	idr_init(&pipeline->p_act_idr);
+
+	pipeline->num_created_acts = 0;
+
 	pipeline->p_state = P4TC_STATE_NOT_READY;
 
 	pipeline->net = net;
@@ -507,7 +518,8 @@ static int p4tc_pipeline_gd(struct net *net, struct sk_buff *skb,
 		return PTR_ERR(pipeline);
 
 	tmpl = (struct p4tc_template_common *)pipeline;
-	if (p4tc_pipeline_fill_nlmsg(net, skb, tmpl, extack) < 0)
+	ret = p4tc_pipeline_fill_nlmsg(net, skb, tmpl, extack);
+	if (ret < 0)
 		return -1;
 
 	if (!ids[P4TC_PID_IDX])
diff --git a/net/sched/p4tc/p4tc_tmpl_api.c b/net/sched/p4tc/p4tc_tmpl_api.c
index bb973071a..cc7e23a4a 100644
--- a/net/sched/p4tc/p4tc_tmpl_api.c
+++ b/net/sched/p4tc/p4tc_tmpl_api.c
@@ -158,6 +158,11 @@ static int tc_ctl_p4_tmpl_1(struct sk_buff *skb, struct nlmsghdr *n,
 
 	ids[P4TC_PID_IDX] = t->pipeid;
 
+	if (tb[P4TC_PATH]) {
+		const u32 *arg_ids = nla_data(tb[P4TC_PATH]);
+
+		memcpy(&ids[P4TC_PID_IDX + 1], arg_ids, nla_len(tb[P4TC_PATH]));
+	}
 	nl_path_attrs.ids = ids;
 
 	nskb = alloc_skb(NLMSG_GOODSIZE, GFP_KERNEL);
@@ -365,6 +370,11 @@ static int tc_ctl_p4_tmpl_dump_1(struct sk_buff *skb, struct nlattr *arg,
 	root = nla_nest_start(skb, P4TC_ROOT);
 
 	ids[P4TC_PID_IDX] = t->pipeid;
+	if (tb[P4TC_PATH]) {
+		const u32 *arg_ids = nla_data(tb[P4TC_PATH]);
+
+		memcpy(&ids[P4TC_PID_IDX + 1], arg_ids, nla_len(tb[P4TC_PATH]));
+	}
 
 	obj_op = (struct p4tc_template_ops *)p4tc_ops[t->obj];
 	ret = obj_op->dump(skb, ctx, tb[P4TC_PARAMS], &p_name, ids, extack);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [PATCH net-next v12  10/15] p4tc: add runtime action support
  2024-02-25 16:54 [PATCH net-next v12 00/15] Introducing P4TC (series 1) Jamal Hadi Salim
                   ` (8 preceding siblings ...)
  2024-02-25 16:54 ` [PATCH net-next v12 09/15] p4tc: add template action create, update, delete, get, flush and dump Jamal Hadi Salim
@ 2024-02-25 16:54 ` Jamal Hadi Salim
  2024-02-25 16:54 ` [PATCH net-next v12 11/15] p4tc: add template table create, update, delete, get, flush and dump Jamal Hadi Salim
                   ` (6 subsequent siblings)
  16 siblings, 0 replies; 71+ messages in thread
From: Jamal Hadi Salim @ 2024-02-25 16:54 UTC (permalink / raw)
  To: netdev
  Cc: deb.chatterjee, anjali.singhai, namrata.limaye, tom, mleitner,
	Mahesh.Shirshyad, Vipin.Jain, tomasz.osinski, jiri,
	xiyou.wangcong, davem, edumazet, kuba, pabeni, vladbu, horms,
	khalidm, toke, daniel, victor, pctammela, bpf

This commit deals with the runtime part of P4 actions i.e instantiation and
binding of action kinds which are created via templates (see previous
patch).

For illustration we repeat the P4 code snippet from the action template
commit:

action send_nh(@tc_type("macaddr) bit<48> dstAddr, @tc_type("dev") bit<8> port)
{
    hdr.ethernet.dstAddr = dstMac;
    send_to_port(port);
}

table mytable {
        key = {
            hdr.ipv4.dstAddr @tc_type("ipv4"): lpm;
        }

        actions = {
            send_nh;
            drop;
            NoAction;
        }

        size = 1024;
}

One could create a table entry alongside an action instance as follows:

tc p4ctrl create aP4proggie/table/mycontrol/mytable \
   srcAddr 10.10.10.0/24 \
   action send_nh param dstAddr AA:BB:CC:DD:EE:FF param port eth0

As previously stated, we refer to the action by it's "full name"
(pipeline_name/action_name). In the above the pipeline_name is inherited from
the table patch ("aP4proggie"), so it doesn't need to be specified.
Above we are creating an instance of the send_nh action specifying as parameter
values AA:BB:CC:DD:EE:FF for dstAddr and eth0 for port.

We could also create the action instance outside of the table entry
which is then added to the pool of send_nh action kind's instances as follows:

tc actions add action aP4proggie/send_nh \
param dstAddr AA:BB:CC:DD:EE:FF param port eth0

Observe these are _exactly the same semantics_ as what tc today already
provides with a caveat that we have a keyword "param" to precede the
appropriate parameters.

Note: We can create as many instances for action templates as we wish, as
long as we do not exceed the maximum allowed actions - in this specific
case 1024 for table "mytable". Any creation of action instances above the max
allowed will be rejected by the kernel.

Action sharing still works the same way (as in classical tc). For example
if we know the action index of the previous instance is 100 then we can
bind it to a table entry, for example

tc p4ctrl create aP4proggie/table/mycontrol/mytable \
   srcAddr 11.11.0.0/16 action send_nh index 100

Co-developed-by: Victor Nogueira <victor@mojatatu.com>
Signed-off-by: Victor Nogueira <victor@mojatatu.com>
Co-developed-by: Pedro Tammela <pctammela@mojatatu.com>
Signed-off-by: Pedro Tammela <pctammela@mojatatu.com>
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Reviewed-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
---
 include/net/p4tc.h           |  114 ++++
 include/uapi/linux/p4tc.h    |   13 +
 net/sched/p4tc/p4tc_action.c | 1239 +++++++++++++++++++++++++++++++++-
 3 files changed, 1356 insertions(+), 10 deletions(-)

diff --git a/include/net/p4tc.h b/include/net/p4tc.h
index 33cc8bb13..21705e731 100644
--- a/include/net/p4tc.h
+++ b/include/net/p4tc.h
@@ -116,6 +116,56 @@ p4tc_pipeline_find_byany_unsealed(struct net *net, const char *p_name,
 				  const u32 pipeid,
 				  struct netlink_ext_ack *extack);
 
+struct p4tc_act *p4a_runt_find(struct net *net,
+			       const struct tc_action_ops *a_o,
+			       struct netlink_ext_ack *extack);
+void
+p4a_runt_prealloc_put(struct p4tc_act *act, struct tcf_p4act *p4_act);
+
+static inline int p4tc_action_destroy(struct tc_action *acts[])
+{
+	struct tc_action *acts_non_prealloc[TCA_ACT_MAX_PRIO] = {NULL};
+	struct tc_action *a;
+	int ret = 0;
+	int j = 0;
+	int i;
+
+	tcf_act_for_each_action(i, a, acts) {
+		if (acts[i]->tcfa_flags & TCA_ACT_FLAGS_PREALLOC) {
+			struct tcf_p4act *p4act;
+			struct p4tc_act *act;
+			struct net *net;
+
+			p4act = (struct tcf_p4act *)acts[i];
+			net = maybe_get_net(acts[i]->idrinfo->net);
+
+			if (net) {
+				const struct tc_action_ops *ops;
+
+				ops = acts[i]->ops;
+				act = p4a_runt_find(net, ops, NULL);
+				p4a_runt_prealloc_put(act, p4act);
+				put_net(net);
+			} else {
+				/* If net is coming down, template
+				 * action will be deleted, so no need to
+				 * remove from prealloc list, just decr
+				 * refcounts.
+				 */
+				acts_non_prealloc[j] = acts[i];
+				j++;
+			}
+		} else {
+			acts_non_prealloc[j] = acts[i];
+			j++;
+		}
+	}
+
+	ret = tcf_action_destroy(acts_non_prealloc, TCA_ACT_UNBIND);
+
+	return ret;
+}
+
 struct p4tc_act_param {
 	struct list_head head;
 	struct rcu_head	rcu;
@@ -169,6 +219,62 @@ struct p4tc_act {
 	char                        fullname[ACTNAMSIZ];
 };
 
+static inline int p4tc_action_init(struct net *net, struct nlattr *nla,
+				   struct tc_action *acts[], u32 pipeid,
+				   u32 flags, struct netlink_ext_ack *extack)
+{
+	struct nlattr *tb[TCA_ACT_MAX_PRIO + 1] = {};
+	int init_res[TCA_ACT_MAX_PRIO];
+	struct tc_action *a;
+	size_t attrs_size;
+	size_t nacts = 0;
+	int ret;
+	int i;
+
+	ret = nla_parse_nested_deprecated(tb, TCA_ACT_MAX_PRIO, nla, NULL,
+					  extack);
+	if (ret < 0)
+		return ret;
+
+	for (i = 1; i < TCA_ACT_MAX_PRIO + 1; i++)
+		nacts += !!tb[i];
+
+	if (nacts > 1) {
+		NL_SET_ERR_MSG(extack, "Only one action is allowed");
+		return -E2BIG;
+	}
+
+	/* If action was already created, just bind to existing one */
+	flags |= TCA_ACT_FLAGS_BIND;
+	flags |= TCA_ACT_FLAGS_FROM_P4TC;
+	ret = tcf_action_init(net, NULL, nla, NULL, acts, init_res, &attrs_size,
+			      flags, 0, extack);
+
+	/* Check if we are trying to bind to dynamic action from different
+	 * pipeline.
+	 */
+	tcf_act_for_each_action(i, a, acts) {
+		struct tcf_p4act *p;
+
+		if (a->ops->id <= TCA_ID_MAX)
+			continue;
+
+		p = to_p4act(a);
+		if (p->p_id != pipeid) {
+			NL_SET_ERR_MSG(extack,
+				       "Unable to bind to dynact from different pipeline");
+			ret = -EPERM;
+			goto destroy_acts;
+		}
+	}
+
+	return ret;
+
+destroy_acts:
+	p4tc_action_destroy(acts);
+	return ret;
+}
+
 struct p4tc_act *p4a_tmpl_get(struct p4tc_pipeline *pipeline,
 			      const char *act_name, const u32 a_id,
 			      struct netlink_ext_ack *extack);
@@ -180,6 +286,14 @@ static inline bool p4tc_action_put_ref(struct p4tc_act *act)
 	return refcount_dec_not_one(&act->a_ref);
 }
 
+struct tcf_p4act *
+p4a_runt_prealloc_get_next(struct p4tc_act *act);
+void p4a_runt_init_flags(struct tcf_p4act *p4act);
+void p4a_runt_parm_destroy(struct p4tc_act_param *parm);
+struct p4tc_act_param *
+p4a_runt_parm_init(struct net *net, struct p4tc_act *act,
+		   struct nlattr *nla, struct netlink_ext_ack *extack);
+
 #define to_pipeline(t) ((struct p4tc_pipeline *)t)
 #define to_hdrfield(t) ((struct p4tc_hdrfield *)t)
 #define p4tc_to_act(t) ((struct p4tc_act *)t)
diff --git a/include/uapi/linux/p4tc.h b/include/uapi/linux/p4tc.h
index d07e331bc..bb4533689 100644
--- a/include/uapi/linux/p4tc.h
+++ b/include/uapi/linux/p4tc.h
@@ -115,6 +115,17 @@ enum {
 };
 
 #define P4TC_ACT_MAX (__P4TC_ACT_MAX - 1)
+
+/* Action params value attributes */
+
+enum {
+	P4TC_ACT_PARAMS_VALUE_UNSPEC,
+	P4TC_ACT_PARAMS_VALUE_RAW, /* binary */
+	__P4TC_ACT_PARAMS_VALUE_MAX
+};
+
+#define P4TC_ACT_VALUE_PARAMS_MAX (__P4TC_ACT_PARAMS_VALUE_MAX - 1)
+
 enum {
 	P4TC_ACT_PARAMS_TYPE_UNSPEC,
 	P4TC_ACT_PARAMS_TYPE_BITEND, /* u16 */
@@ -138,6 +149,8 @@ enum {
 	P4TC_ACT_PARAMS_ID, /* u32 */
 	P4TC_ACT_PARAMS_TYPE, /* nested type - mandatory for params create */
 	P4TC_ACT_PARAMS_FLAGS, /* u8 */
+	P4TC_ACT_PARAMS_VALUE, /* bytes - mandatory for runtime params create */
+	P4TC_ACT_PARAMS_MASK, /* bytes */
 	__P4TC_ACT_PARAMS_MAX
 };
 
diff --git a/net/sched/p4tc/p4tc_action.c b/net/sched/p4tc/p4tc_action.c
index 597a14006..d47ccd69a 100644
--- a/net/sched/p4tc/p4tc_action.c
+++ b/net/sched/p4tc/p4tc_action.c
@@ -30,11 +30,516 @@
 #include <net/sock.h>
 #include <net/tc_act/p4tc.h>
 
+static LIST_HEAD(dynact_list);
+
+#define P4TC_ACT_CREATED 1
+#define P4TC_ACT_PREALLOC 2
+#define P4TC_ACT_PREALLOC_UNINIT 3
+
+static int __p4a_runt_init(struct net *net, struct nlattr *est,
+			   struct p4tc_act *act, struct tc_act_p4 *parm,
+			   struct tc_action **a, struct tcf_proto *tp,
+			   struct tc_action_ops *a_o,
+			   struct tcf_chain **goto_ch, u32 flags,
+			   struct netlink_ext_ack *extack)
+{
+	bool from_p4tc = flags & TCA_ACT_FLAGS_FROM_P4TC;
+	bool prealloc = flags & TCA_ACT_FLAGS_PREALLOC;
+	bool replace = flags & TCA_ACT_FLAGS_REPLACE;
+	bool bind = flags & TCA_ACT_FLAGS_BIND;
+	struct p4tc_pipeline *pipeline;
+	struct tcf_p4act *p4act;
+	u32 index = parm->index;
+	bool exists = false;
+	int ret = 0;
+	int err;
+
+	if ((from_p4tc && !prealloc && !replace && !index)) {
+		p4act = p4a_runt_prealloc_get_next(act);
+
+		if (p4act) {
+			p4a_runt_init_flags(p4act);
+			*a = &p4act->common;
+			return P4TC_ACT_PREALLOC_UNINIT;
+		}
+	}
+
+	err = tcf_idr_check_alloc(act->tn, &index, a, bind);
+	if (err < 0)
+		return err;
+
+	exists = err;
+	if (!exists) {
+		struct tcf_p4act *p;
+
+		ret = tcf_idr_create(act->tn, index, est, a, a_o, bind, true,
+				     flags);
+		if (ret) {
+			tcf_idr_cleanup(act->tn, index);
+			return ret;
+		}
+
+		/* p4_ref here should never be 0, because if we are here, it
+		 * means that a template action of this kind was created. Thus
+		 * p4_ref should be at least 1. Also since this operation and
+		 * others that add or delete action templates run with
+		 * rtnl_lock held, we cannot do this op and a deletion op in
+		 * parallel.
+		 */
+		WARN_ON(!refcount_inc_not_zero(&a_o->p4_ref));
+
+		pipeline = act->pipeline;
+
+		p = to_p4act(*a);
+		p->p_id = pipeline->common.p_id;
+		p->act_id = act->a_id;
+
+		p->common.tcfa_flags |= TCA_ACT_FLAGS_PREALLOC;
+		if (!prealloc && !bind) {
+			spin_lock_bh(&act->list_lock);
+			list_add_tail(&p->node, &act->prealloc_list);
+			spin_unlock_bh(&act->list_lock);
+		}
+
+		ret = P4TC_ACT_CREATED;
+	} else {
+		const u32 tcfa_flags = (*a)->tcfa_flags;
+
+		if (bind) {
+			if ((tcfa_flags & TCA_ACT_FLAGS_PREALLOC)) {
+				if (tcfa_flags & TCA_ACT_FLAGS_UNREFERENCED) {
+					p4act = to_p4act(*a);
+					p4a_runt_init_flags(p4act);
+					return P4TC_ACT_PREALLOC_UNINIT;
+				}
+
+				return P4TC_ACT_PREALLOC;
+			}
+
+			return 0;
+		}
+
+		if (replace) {
+			if ((tcfa_flags & TCA_ACT_FLAGS_PREALLOC)) {
+				if (tcfa_flags & TCA_ACT_FLAGS_UNREFERENCED) {
+					p4act = to_p4act(*a);
+					p4a_runt_init_flags(p4act);
+					ret = P4TC_ACT_PREALLOC_UNINIT;
+				} else {
+					ret = P4TC_ACT_PREALLOC;
+				}
+			}
+		} else {
+			NL_SET_ERR_MSG_FMT(extack,
+					   "Action %s with index %u was already created",
+					   (*a)->ops->kind, index);
+			tcf_idr_release(*a, bind);
+			return -EEXIST;
+		}
+	}
+
+	err = tcf_action_check_ctrlact(parm->action, tp, goto_ch, extack);
+	if (err < 0) {
+		tcf_idr_release(*a, bind);
+		return err;
+	}
+
+	return ret;
+}
+
+static void p4a_runt_parm_val_free(struct p4tc_act_param *param)
+{
+	kfree(param->value);
+	kfree(param->mask);
+}
+
+static const struct nla_policy
+p4a_parm_val_policy[P4TC_ACT_VALUE_PARAMS_MAX + 1] = {
+	[P4TC_ACT_PARAMS_VALUE_RAW] = { .type = NLA_BINARY },
+};
+
+static const struct nla_policy
+p4a_parm_type_policy[P4TC_ACT_PARAMS_TYPE_MAX + 1] = {
+	[P4TC_ACT_PARAMS_TYPE_BITEND] = { .type = NLA_U16 },
+	[P4TC_ACT_PARAMS_TYPE_CONTAINER_ID] = { .type = NLA_U32 },
+};
+
+static int p4a_runt_dev_parm_val_init(struct net *net,
+				      struct p4tc_act_param_ops *op,
+				      struct p4tc_act_param *nparam,
+				      struct nlattr **tb,
+				      struct netlink_ext_ack *extack)
+{
+	struct nlattr *tb_value[P4TC_ACT_VALUE_PARAMS_MAX + 1];
+	u32 value_len;
+	u32 *ifindex;
+	int err;
+
+	if (!tb[P4TC_ACT_PARAMS_VALUE]) {
+		NL_SET_ERR_MSG(extack, "Must specify param value");
+		return -EINVAL;
+	}
+	err = nla_parse_nested(tb_value, P4TC_ACT_VALUE_PARAMS_MAX,
+			       tb[P4TC_ACT_PARAMS_VALUE],
+			       p4a_parm_val_policy, extack);
+	if (err < 0)
+		return err;
+
+	value_len = nla_len(tb_value[P4TC_ACT_PARAMS_VALUE_RAW]);
+	if (value_len != sizeof(u32)) {
+		NL_SET_ERR_MSG(extack, "Value length differs from template's");
+		return -EINVAL;
+	}
+
+	ifindex = nla_data(tb_value[P4TC_ACT_PARAMS_VALUE_RAW]);
+	rcu_read_lock();
+	if (!dev_get_by_index_rcu(net, *ifindex)) {
+		NL_SET_ERR_MSG(extack, "Invalid ifindex");
+		rcu_read_unlock();
+		return -EINVAL;
+	}
+	rcu_read_unlock();
+
+	nparam->value = kmemdup(ifindex, sizeof(*ifindex), GFP_KERNEL);
+	if (!nparam->value)
+		return -EINVAL;
+
+	return 0;
+}
+
+static int p4a_runt_dev_parm_val_dump(struct sk_buff *skb,
+				      struct p4tc_act_param_ops *op,
+				      struct p4tc_act_param *param)
+{
+	const u32 *ifindex = param->value;
+	struct nlattr *nest;
+	int ret;
+
+	nest = nla_nest_start(skb, P4TC_ACT_PARAMS_VALUE);
+	if (nla_put_u32(skb, P4TC_ACT_PARAMS_VALUE_RAW, *ifindex)) {
+		ret = -EINVAL;
+		goto out_nla_cancel;
+	}
+	nla_nest_end(skb, nest);
+
+	return 0;
+
+out_nla_cancel:
+	nla_nest_cancel(skb, nest);
+	return ret;
+}
+
+static void p4a_runt_dev_parm_val_free(struct p4tc_act_param *param)
+{
+	kfree(param->value);
+}
+
+static const struct p4tc_act_param_ops param_ops[P4TC_T_MAX + 1] = {
+	[P4TC_T_DEV] = {
+		.init_value = p4a_runt_dev_parm_val_init,
+		.dump_value = p4a_runt_dev_parm_val_dump,
+		.free = p4a_runt_dev_parm_val_free,
+	},
+};
+
+void p4a_runt_parm_destroy(struct p4tc_act_param *parm)
+{
+	struct p4tc_act_param_ops *op;
+
+	op = (struct p4tc_act_param_ops *)&param_ops[parm->type->typeid];
+	if (op->free)
+		op->free(parm);
+	else
+		p4a_runt_parm_val_free(parm);
+	kfree(parm);
+}
+
+static void p4a_runt_parms_destroy(struct tcf_p4act_params *params)
+{
+	struct p4tc_act_param *parm;
+	unsigned long param_id, tmp;
+
+	idr_for_each_entry_ul(&params->params_idr, parm, tmp, param_id) {
+		idr_remove(&params->params_idr, param_id);
+		p4a_runt_parm_destroy(parm);
+	}
+
+	kfree(params->params_array);
+	idr_destroy(&params->params_idr);
+
+	kfree(params);
+}
+
+static void p4a_runt_parms_destroy_rcu(struct rcu_head *head)
+{
+	struct tcf_p4act_params *params;
+
+	params = container_of(head, struct tcf_p4act_params, rcu);
+	p4a_runt_parms_destroy(params);
+}
+
+static int __p4a_runt_init_set(struct p4tc_act *act, struct tc_action **a,
+			       struct tcf_p4act_params *params,
+			       struct tcf_chain *goto_ch,
+			       struct tc_act_p4 *parm, bool exists,
+			       struct netlink_ext_ack *extack)
+{
+	struct tcf_p4act_params *params_old;
+	struct tcf_p4act *p;
+
+	p = to_p4act(*a);
+
+	/* sparse is fooled by lock under conditionals.
+	 * To avoid false positives, we are repeating these two lines in both
+	 * branches of the if-statement
+	 */
+	if (exists) {
+		spin_lock_bh(&p->tcf_lock);
+		goto_ch = tcf_action_set_ctrlact(*a, parm->action, goto_ch);
+		params_old = rcu_replace_pointer(p->params, params, 1);
+		spin_unlock_bh(&p->tcf_lock);
+	} else {
+		goto_ch = tcf_action_set_ctrlact(*a, parm->action, goto_ch);
+		params_old = rcu_replace_pointer(p->params, params, 1);
+	}
+
+	if (goto_ch)
+		tcf_chain_put_by_act(goto_ch);
+
+	if (params_old)
+		call_rcu(&params_old->rcu, p4a_runt_parms_destroy_rcu);
+
+	return 0;
+}
+
+static int p4a_runt_init_from_tmpl(struct net *net, struct tc_action **a,
+				   struct p4tc_act *act,
+				   struct idr *params_idr,
+				   struct list_head *params_lst,
+				   struct tc_act_p4 *parm, u32 flags,
+				   struct netlink_ext_ack *extack);
+
+static struct tcf_p4act_params *p4a_runt_parms_alloc(struct p4tc_act *act)
+{
+	struct tcf_p4act_params *params;
+
+	params = kzalloc(sizeof(*params), GFP_KERNEL);
+	if (!params)
+		return ERR_PTR(-ENOMEM);
+
+	params->params_array = kcalloc(act->num_params,
+				       sizeof(struct p4tc_act_param *),
+				       GFP_KERNEL);
+	if (!params->params_array) {
+		kfree(params);
+		return ERR_PTR(-ENOMEM);
+	}
+
+	idr_init(&params->params_idr);
+
+	return params;
+}
+
+static struct p4tc_act_param *
+p4a_runt_prealloc_init_param(struct p4tc_act *act, struct idr *params_idr,
+			     struct p4tc_act_param *param,
+			     unsigned long *param_id,
+			     struct netlink_ext_ack *extack)
+{
+	struct p4tc_act_param *nparam;
+	void *value;
+
+	nparam = kzalloc(sizeof(*nparam), GFP_KERNEL);
+	if (!nparam)
+		return ERR_PTR(-ENOMEM);
+
+	value = kzalloc(BITS_TO_BYTES(param->type->container_bitsz),
+			GFP_KERNEL);
+	if (!value) {
+		kfree(nparam);
+		return ERR_PTR(-ENOMEM);
+	}
+
+	strscpy(nparam->name, param->name, P4TC_ACT_PARAM_NAMSIZ);
+	nparam->id = *param_id;
+	nparam->value = value;
+	nparam->type = param->type;
+
+	return nparam;
+}
+
 static void p4a_parm_put(struct p4tc_act_param *param)
 {
 	kfree(param);
 }
 
+static void p4a_runt_parm_put_val(struct p4tc_act_param *param)
+{
+	kfree(param->value);
+	p4a_parm_put(param);
+}
+
+static void p4a_runt_prealloc_list_free(struct list_head *params_list)
+{
+	struct p4tc_act_param *nparam, *p;
+
+	list_for_each_entry_safe(nparam, p, params_list, head) {
+		p4a_runt_parm_put_val(nparam);
+	}
+}
+
+static int p4a_runt_prealloc_params_init(struct p4tc_act *act,
+					 struct idr *params_idr,
+					 struct list_head *params_lst,
+					 struct netlink_ext_ack *extack)
+{
+	struct p4tc_act_param *param;
+	unsigned long param_id = 0;
+	unsigned long tmp;
+
+	idr_for_each_entry_ul(params_idr, param, tmp, param_id) {
+		struct p4tc_act_param *nparam;
+
+		if (param->flags & BIT(P4TC_ACT_PARAMS_FLAGS_RUNT))
+			continue;
+
+		nparam = p4a_runt_prealloc_init_param(act, params_idr,
+						      param, &param_id,
+						      extack);
+		if (IS_ERR(nparam))
+			return PTR_ERR(nparam);
+
+		list_add_tail(&nparam->head, params_lst);
+	}
+
+	return 0;
+}
+
+static void
+p4a_runt_prealloc_list_add(struct p4tc_act *act_tmpl,
+			   struct tc_action **acts,
+			   u32 num_prealloc_acts)
+{
+	int i;
+
+	for (i = 0; i < num_prealloc_acts; i++) {
+		struct tcf_p4act *p4act = to_p4act(acts[i]);
+
+		list_add_tail(&p4act->node, &act_tmpl->prealloc_list);
+	}
+
+	tcf_idr_insert_n(acts, num_prealloc_acts);
+}
+
+static int
+p4a_runt_prealloc_create(struct net *net, struct p4tc_act *act,
+			 struct idr *params_idr, struct tc_action **acts,
+			 const u32 num_prealloc_acts,
+			 struct netlink_ext_ack *extack)
+{
+	int err;
+	int i;
+
+	for (i = 0; i < num_prealloc_acts; i++) {
+		u32 flags = TCA_ACT_FLAGS_PREALLOC | TCA_ACT_FLAGS_UNREFERENCED;
+		struct tc_action *a = acts[i];
+		struct tc_act_p4 parm = {0};
+		struct list_head params_lst;
+
+		parm.index = i + 1;
+		parm.action = TC_ACT_PIPE;
+
+		INIT_LIST_HEAD(&params_lst);
+
+		err = p4a_runt_prealloc_params_init(act, params_idr,
+						    &params_lst, extack);
+		if (err < 0) {
+			p4a_runt_prealloc_list_free(&params_lst);
+			goto destroy_acts;
+		}
+
+		err = p4a_runt_init_from_tmpl(net, &a, act, params_idr,
+					      &params_lst, &parm, flags,
+					      extack);
+		p4a_runt_prealloc_list_free(&params_lst);
+		if (err < 0)
+			goto destroy_acts;
+
+		acts[i] = a;
+	}
+
+	return 0;
+
+destroy_acts:
+	tcf_action_destroy(acts, false);
+
+	return err;
+}
+
+/* Need to implement after preallocating */
+struct tcf_p4act *
+p4a_runt_prealloc_get_next(struct p4tc_act *act)
+{
+	struct tcf_p4act *p4_act;
+
+	spin_lock_bh(&act->list_lock);
+	p4_act = list_first_entry_or_null(&act->prealloc_list, struct tcf_p4act,
+					  node);
+	if (p4_act) {
+		list_del_init(&p4_act->node);
+		refcount_set(&p4_act->common.tcfa_refcnt, 1);
+		atomic_set(&p4_act->common.tcfa_bindcnt, 1);
+	}
+	spin_unlock_bh(&act->list_lock);
+
+	return p4_act;
+}
+
+void p4a_runt_init_flags(struct tcf_p4act *p4act)
+{
+	struct tc_action *a;
+
+	a = (struct tc_action *)p4act;
+	a->tcfa_flags &= ~TCA_ACT_FLAGS_UNREFERENCED;
+}
+
+static void __p4a_runt_prealloc_put(struct p4tc_act *act,
+				    struct tcf_p4act *p4act)
+{
+	struct tcf_p4act_params *p4act_params;
+	struct p4tc_act_param *param;
+	unsigned long param_id, tmp;
+
+	spin_lock_bh(&p4act->tcf_lock);
+	p4act_params = rcu_dereference_protected(p4act->params, 1);
+	if (p4act_params) {
+		idr_for_each_entry_ul(&p4act_params->params_idr, param, tmp,
+				      param_id) {
+			const struct p4tc_type *type = param->type;
+			u32 type_bytesz = BITS_TO_BYTES(type->container_bitsz);
+
+			memset(param->value, 0, type_bytesz);
+		}
+	}
+	p4act->common.tcfa_flags |= TCA_ACT_FLAGS_UNREFERENCED;
+	spin_unlock_bh(&p4act->tcf_lock);
+
+	spin_lock_bh(&act->list_lock);
+	list_add_tail(&p4act->node, &act->prealloc_list);
+	spin_unlock_bh(&act->list_lock);
+}
+
+void
+p4a_runt_prealloc_put(struct p4tc_act *act, struct tcf_p4act *p4act)
+{
+	if (refcount_read(&p4act->common.tcfa_refcnt) == 1) {
+		__p4a_runt_prealloc_put(act, p4act);
+	} else {
+		refcount_dec(&p4act->common.tcfa_refcnt);
+		atomic_dec(&p4act->common.tcfa_bindcnt);
+	}
+}
+
 static const struct nla_policy p4a_parm_policy[P4TC_ACT_PARAMS_MAX + 1] = {
 	[P4TC_ACT_PARAMS_NAME] = {
 		.type = NLA_STRING,
@@ -45,8 +550,100 @@ static const struct nla_policy p4a_parm_policy[P4TC_ACT_PARAMS_MAX + 1] = {
 	[P4TC_ACT_PARAMS_FLAGS] =
 		NLA_POLICY_RANGE(NLA_U8, 0,
 				 BIT(P4TC_ACT_PARAMS_FLAGS_MAX + 1) - 1),
+	[P4TC_ACT_PARAMS_VALUE] = { .type = NLA_NESTED },
+	[P4TC_ACT_PARAMS_MASK] = { .type = NLA_BINARY },
 };
 
+static int
+p4a_runt_parm_val_dump(struct sk_buff *skb, struct p4tc_type *type,
+		       struct p4tc_act_param *param)
+{
+	const u32 bytesz = BITS_TO_BYTES(type->container_bitsz);
+	unsigned char *b = nlmsg_get_pos(skb);
+	struct nlattr *nla_value;
+
+	nla_value = nla_nest_start(skb, P4TC_ACT_PARAMS_VALUE);
+	if (nla_put(skb, P4TC_ACT_PARAMS_VALUE_RAW, bytesz,
+		    param->value))
+		goto out_nlmsg_trim;
+	nla_nest_end(skb, nla_value);
+
+	if (param->mask &&
+	    nla_put(skb, P4TC_ACT_PARAMS_MASK, bytesz, param->mask))
+		goto out_nlmsg_trim;
+
+	return 0;
+
+out_nlmsg_trim:
+	nlmsg_trim(skb, b);
+	return -1;
+}
+
+static int
+p4a_runt_parm_val_init(struct p4tc_act_param *nparam,
+		       struct p4tc_type *type, struct nlattr **tb,
+		       struct netlink_ext_ack *extack)
+{
+	const u32 alloc_len = BITS_TO_BYTES(type->container_bitsz);
+	struct nlattr *tb_value[P4TC_ACT_VALUE_PARAMS_MAX + 1];
+	const u32 len = BITS_TO_BYTES(type->bitsz);
+	void *value;
+	int err;
+
+	if (!tb[P4TC_ACT_PARAMS_VALUE]) {
+		NL_SET_ERR_MSG(extack, "Must specify param value");
+		return -EINVAL;
+	}
+
+	err = nla_parse_nested(tb_value, P4TC_ACT_VALUE_PARAMS_MAX,
+			       tb[P4TC_ACT_PARAMS_VALUE],
+			       p4a_parm_val_policy, extack);
+	if (err < 0)
+		return err;
+
+	value = nla_data(tb_value[P4TC_ACT_PARAMS_VALUE_RAW]);
+	if (type->ops->validate_p4t) {
+		err = type->ops->validate_p4t(type, value, 0, nparam->bitend,
+					      extack);
+		if (err < 0)
+			return err;
+	}
+
+	if (nla_len(tb_value[P4TC_ACT_PARAMS_VALUE_RAW]) != len)
+		return -EINVAL;
+
+	nparam->value = kzalloc(alloc_len, GFP_KERNEL);
+	if (!nparam->value)
+		return -ENOMEM;
+
+	memcpy(nparam->value, value, len);
+
+	if (tb[P4TC_ACT_PARAMS_MASK]) {
+		const void *mask = nla_data(tb[P4TC_ACT_PARAMS_MASK]);
+
+		if (nla_len(tb[P4TC_ACT_PARAMS_MASK]) != len) {
+			NL_SET_ERR_MSG(extack,
+				       "Mask length differs from template's");
+			err = -EINVAL;
+			goto free_value;
+		}
+
+		nparam->mask = kzalloc(alloc_len, GFP_KERNEL);
+		if (!nparam->mask) {
+			err = -ENOMEM;
+			goto free_value;
+		}
+
+		memcpy(nparam->mask, mask, len);
+	}
+
+	return 0;
+
+free_value:
+	kfree(nparam->value);
+	return err;
+}
+
 static struct p4tc_act_param *
 p4a_parm_find_byname(struct idr *params_idr, const char *param_name)
 {
@@ -119,11 +716,29 @@ p4a_parm_find_byanyattr(struct p4tc_act *act, struct nlattr *name_attr,
 	return p4a_parm_find_byany(act, param_name, param_id, extack);
 }
 
-static const struct nla_policy
-p4a_parm_type_policy[P4TC_ACT_PARAMS_TYPE_MAX + 1] = {
-	[P4TC_ACT_PARAMS_TYPE_BITEND] = { .type = NLA_U16 },
-	[P4TC_ACT_PARAMS_TYPE_CONTAINER_ID] = { .type = NLA_U32 },
-};
+static int p4a_runt_parms_check(struct p4tc_act *act,
+				struct idr *params_idr,
+				struct netlink_ext_ack *extack)
+{
+	struct p4tc_act_param *parm;
+	unsigned long param_id, tmp;
+
+	idr_for_each_entry_ul(&act->params_idr, parm, tmp, param_id) {
+		struct p4tc_act_param *parm_passed;
+
+		parm_passed = p4a_parm_find_byid(params_idr, param_id);
+		if (!parm_passed) {
+			if (!(parm->flags & BIT(P4TC_ACT_PARAMS_FLAGS_RUNT))) {
+				NL_SET_ERR_MSG_FMT(extack,
+						   "Must specify param %s\n",
+						   parm->name);
+				return -EINVAL;
+			}
+		}
+	}
+
+	return 0;
+}
 
 static int
 __p4a_parm_init_type(struct p4tc_act_param *param, struct nlattr *nla,
@@ -169,6 +784,123 @@ __p4a_parm_init_type(struct p4tc_act_param *param, struct nlattr *nla,
 	return 0;
 }
 
+struct p4tc_act_param *
+p4a_runt_parm_init(struct net *net, struct p4tc_act *act,
+		   struct nlattr *nla, struct netlink_ext_ack *extack)
+{
+	struct nlattr *tb[P4TC_ACT_PARAMS_MAX + 1];
+	struct p4tc_act_param *param, *nparam;
+	struct p4tc_act_param_ops *op;
+	u32 param_id = 0;
+	int err;
+
+	err = nla_parse_nested(tb, P4TC_ACT_PARAMS_MAX, nla, p4a_parm_policy,
+			       extack);
+	if (err < 0)
+		goto out;
+
+	if (tb[P4TC_ACT_PARAMS_ID])
+		param_id = nla_get_u32(tb[P4TC_ACT_PARAMS_ID]);
+
+	param = p4a_parm_find_byanyattr(act, tb[P4TC_ACT_PARAMS_NAME],
+					param_id, extack);
+	if (IS_ERR(param))
+		return param;
+
+	if (param->flags & BIT(P4TC_ACT_PARAMS_FLAGS_RUNT)) {
+		NL_SET_ERR_MSG_FMT(extack,
+				   "Param %s with runtime flag must not be specified",
+				   param->name);
+		err = -EINVAL;
+		goto out;
+	}
+
+	nparam = kzalloc(sizeof(*nparam), GFP_KERNEL);
+	if (!nparam) {
+		err = -ENOMEM;
+		goto out;
+	}
+
+	err = __p4a_parm_init_type(nparam, tb[P4TC_ACT_PARAMS_TYPE],
+				   extack);
+	if (err < 0)
+		goto free;
+
+	if (nparam->type != param->type) {
+		NL_SET_ERR_MSG(extack,
+			       "Param type differs from template");
+		err = -EINVAL;
+		goto free;
+	}
+
+	if (nparam->bitend != param->bitend) {
+		NL_SET_ERR_MSG(extack,
+			       "Param bitend differs from template");
+		err = -EINVAL;
+		goto free;
+	}
+
+	strscpy(nparam->name, param->name, P4TC_ACT_PARAM_NAMSIZ);
+
+	op = (struct p4tc_act_param_ops *)&param_ops[param->type->typeid];
+	if (op->init_value)
+		err = op->init_value(net, op, nparam, tb, extack);
+	else
+		err = p4a_runt_parm_val_init(nparam, nparam->type, tb,
+					     extack);
+	if (err < 0)
+		goto free;
+
+	nparam->id = param->id;
+	nparam->index = param->index;
+
+	return nparam;
+
+free:
+	kfree(nparam);
+out:
+	return ERR_PTR(err);
+}
+
+static int p4a_runt_parms_init(struct net *net, struct tcf_p4act_params *params,
+			       struct p4tc_act *act, struct nlattr *nla,
+			       struct netlink_ext_ack *extack)
+{
+	struct nlattr *tb[P4TC_MSGBATCH_SIZE + 1];
+	int err;
+	int i;
+
+	err = nla_parse_nested(tb, P4TC_MSGBATCH_SIZE, nla, NULL, NULL);
+	if (err < 0)
+		return err;
+
+	for (i = 1; i < P4TC_MSGBATCH_SIZE + 1 && tb[i]; i++) {
+		const struct p4tc_act_param_ops *op;
+		struct p4tc_act_param *param;
+
+		param = p4a_runt_parm_init(net, act, tb[i], extack);
+		if (IS_ERR(param))
+			return PTR_ERR(param);
+
+		err = idr_alloc_u32(&params->params_idr, param, &param->id,
+				    param->id, GFP_KERNEL);
+		op = &param_ops[param->type->typeid];
+		if (err < 0) {
+			if (op->free)
+				op->free(param);
+			else
+				p4a_runt_parm_val_free(param);
+			kfree(param);
+			return err;
+		}
+
+		if (params->params_array)
+			params->params_array[param->index] = param;
+	}
+
+	return p4a_runt_parms_check(act, &params->params_idr, extack);
+}
+
 static struct p4tc_act *
 p4a_tmpl_find_byname(const char *fullname, struct p4tc_pipeline *pipeline,
 		     struct netlink_ext_ack *extack)
@@ -183,6 +915,145 @@ p4a_tmpl_find_byname(const char *fullname, struct p4tc_pipeline *pipeline,
 	return NULL;
 }
 
+struct p4tc_act *p4a_runt_find(struct net *net,
+			       const struct tc_action_ops *a_o,
+			       struct netlink_ext_ack *extack)
+{
+	char *pname, *aname, fullname[ACTNAMSIZ];
+	struct p4tc_pipeline *pipeline;
+	struct p4tc_act *act;
+
+	strscpy(fullname, a_o->kind, ACTNAMSIZ);
+
+	aname = fullname;
+	pname = strsep(&aname, "/");
+	pipeline = p4tc_pipeline_find_byany(net, pname, 0, NULL);
+	if (IS_ERR(pipeline))
+		return ERR_PTR(-ENOENT);
+
+	act = p4a_tmpl_find_byname(a_o->kind, pipeline, extack);
+	if (!act)
+		return ERR_PTR(-ENOENT);
+
+	return act;
+}
+
+static int p4a_runt_init(struct net *net, struct nlattr *nla,
+			 struct nlattr *est, struct tc_action **a,
+			 struct tcf_proto *tp, struct tc_action_ops *a_o,
+			 u32 flags, struct netlink_ext_ack *extack)
+{
+	bool bind = flags & TCA_ACT_FLAGS_BIND;
+	struct nlattr *tb[P4TC_ACT_MAX + 1];
+	struct tcf_chain *goto_ch = NULL;
+	struct tcf_p4act_params *params;
+	struct tcf_p4act *prealloc_act;
+	struct tc_act_p4 *parm;
+	struct p4tc_act *act;
+	bool exists = false;
+	int ret = 0;
+	int err;
+
+	if (flags & TCA_ACT_FLAGS_BIND &&
+	    !(flags & TCA_ACT_FLAGS_FROM_P4TC)) {
+		NL_SET_ERR_MSG(extack,
+			       "Can only bind to dynamic action from P4TC objects");
+		return -EPERM;
+	}
+
+	if (unlikely(!nla)) {
+		NL_SET_ERR_MSG(extack,
+			       "Must specify action netlink attributes");
+		return -EINVAL;
+	}
+
+	err = nla_parse_nested(tb, P4TC_ACT_MAX, nla, NULL, extack);
+	if (err < 0)
+		return err;
+
+	if (NL_REQ_ATTR_CHECK(extack, NULL, tb, P4TC_ACT_OPT)) {
+		NL_SET_ERR_MSG(extack,
+			       "Must specify option netlink attributes");
+		return -EINVAL;
+	}
+
+	act = p4a_runt_find(net, a_o, extack);
+	if (IS_ERR(act))
+		return PTR_ERR(act);
+
+	if (!act->active) {
+		NL_SET_ERR_MSG(extack,
+			       "Dynamic action must be active to create instance");
+		return -EINVAL;
+	}
+
+	parm = nla_data(tb[P4TC_ACT_OPT]);
+
+	ret = __p4a_runt_init(net, est, act, parm, a, tp, a_o, &goto_ch,
+			      flags, extack);
+	if (ret < 0)
+		return ret;
+	/* If trying to bind to unitialised preallocated action, must init
+	 * below
+	 */
+	if (bind && ret == P4TC_ACT_PREALLOC)
+		return 0;
+
+	err = tcf_action_check_ctrlact(parm->action, tp, &goto_ch, extack);
+	if (err < 0)
+		goto release_idr;
+
+	params = p4a_runt_parms_alloc(act);
+	if (IS_ERR(params)) {
+		err = PTR_ERR(params);
+		goto release_idr;
+	}
+
+	if (tb[P4TC_ACT_PARMS]) {
+		err = p4a_runt_parms_init(net, params, act, tb[P4TC_ACT_PARMS],
+					  extack);
+		if (err < 0)
+			goto release_params;
+	} else {
+		err = p4a_runt_parms_check(act, &params->params_idr, extack);
+		if (err < 0)
+			goto release_params;
+	}
+
+	exists = ret != P4TC_ACT_CREATED;
+	err = __p4a_runt_init_set(act, a, params, goto_ch, parm, exists,
+				  extack);
+	if (err < 0)
+		goto release_params;
+
+	return ret;
+
+release_params:
+	p4a_runt_parms_destroy(params);
+
+release_idr:
+	if (ret == P4TC_ACT_PREALLOC) {
+		prealloc_act = to_p4act(*a);
+		p4a_runt_prealloc_put(act, prealloc_act);
+		(*a)->tcfa_flags |= TCA_ACT_FLAGS_UNREFERENCED;
+	} else if (!bind && !exists &&
+		   ((*a)->tcfa_flags & TCA_ACT_FLAGS_PREALLOC)) {
+		prealloc_act = to_p4act(*a);
+		list_del_init(&prealloc_act->node);
+		tcf_idr_release(*a, bind);
+	} else {
+		tcf_idr_release(*a, bind);
+	}
+
+	return err;
+}
+
+static int p4a_runt_act(struct sk_buff *skb, const struct tc_action *a,
+			struct tcf_result *res)
+{
+	return 0;
+}
+
 static int p4a_parm_type_fill(struct sk_buff *skb, struct p4tc_act_param *param)
 {
 	unsigned char *b = nlmsg_get_pos(skb);
@@ -201,6 +1072,265 @@ static int p4a_parm_type_fill(struct sk_buff *skb, struct p4tc_act_param *param)
 	return -1;
 }
 
+static int p4a_runt_dump(struct sk_buff *skb, struct tc_action *a,
+			 int bind, int ref)
+{
+	struct tcf_p4act *dynact = to_p4act(a);
+	unsigned char *b = nlmsg_get_pos(skb);
+	struct tc_act_p4 opt = {
+		.index = dynact->tcf_index,
+		.refcnt = refcount_read(&dynact->tcf_refcnt) - ref,
+		.bindcnt = atomic_read(&dynact->tcf_bindcnt) - bind,
+	};
+	struct p4tc_act_param *tmpl_parm;
+	struct tcf_p4act_params *params;
+	struct nlattr *nest_parms;
+	struct p4tc_act *act;
+	struct net *net;
+	struct tcf_t t;
+	int i = 1;
+	int id;
+
+	spin_lock_bh(&dynact->tcf_lock);
+
+	net = a->idrinfo->net;
+	act = p4a_runt_find(net, a->ops, NULL);
+	if (!act)
+		goto nla_put_failure;
+
+	opt.action = dynact->tcf_action;
+	if (nla_put(skb, P4TC_ACT_OPT, sizeof(opt), &opt))
+		goto nla_put_failure;
+
+	if (nla_put_string(skb, P4TC_ACT_NAME, a->ops->kind))
+		goto nla_put_failure;
+
+	tcf_tm_dump(&t, &dynact->tcf_tm);
+	if (nla_put_64bit(skb, P4TC_ACT_TM, sizeof(t), &t, P4TC_ACT_PAD))
+		goto nla_put_failure;
+
+	nest_parms = nla_nest_start(skb, P4TC_ACT_PARMS);
+	if (!nest_parms)
+		goto nla_put_failure;
+
+	params = rcu_dereference_protected(dynact->params, 1);
+	if (params) {
+		idr_for_each_entry(&act->params_idr, tmpl_parm, id) {
+			struct p4tc_act_param_ops *op;
+			struct p4tc_act_param *parm;
+			struct nlattr *nest_count;
+			struct nlattr *nest_type;
+
+			parm = p4a_parm_find_byid(&params->params_idr, id);
+			if (!parm)
+				parm = tmpl_parm;
+
+			nest_count = nla_nest_start(skb, i);
+			if (!nest_count)
+				goto nla_put_failure;
+
+			if (nla_put_string(skb, P4TC_ACT_PARAMS_NAME,
+					   parm->name))
+				goto nla_put_failure;
+
+			if (nla_put_u32(skb, P4TC_ACT_PARAMS_ID, parm->id))
+				goto nla_put_failure;
+
+			if (!(parm->flags & BIT(P4TC_ACT_PARAMS_FLAGS_RUNT))) {
+				op = (struct p4tc_act_param_ops *)
+					&param_ops[parm->type->typeid];
+				if (op->dump_value) {
+					if (op->dump_value(skb, op, parm) < 0)
+						goto nla_put_failure;
+				} else {
+					if (p4a_runt_parm_val_dump(skb,
+								   parm->type,
+								   parm))
+						goto nla_put_failure;
+				}
+			}
+
+			nest_type = nla_nest_start(skb, P4TC_ACT_PARAMS_TYPE);
+			if (!nest_type)
+				goto nla_put_failure;
+
+			p4a_parm_type_fill(skb, parm);
+			nla_nest_end(skb, nest_type);
+
+			if (nla_put_u8(skb, P4TC_ACT_PARAMS_FLAGS, parm->flags))
+				goto nla_put_failure;
+
+			nla_nest_end(skb, nest_count);
+			i++;
+		}
+	}
+	nla_nest_end(skb, nest_parms);
+
+	spin_unlock_bh(&dynact->tcf_lock);
+
+	return skb->len;
+
+nla_put_failure:
+	spin_unlock_bh(&dynact->tcf_lock);
+	nlmsg_trim(skb, b);
+	return -1;
+}
+
+static int p4a_runt_lookup(struct net *net,
+			   const struct tc_action_ops *ops,
+			   struct tc_action **a, u32 index)
+{
+	struct p4tc_act *act;
+	int err;
+
+	act = p4a_runt_find(net, ops, NULL);
+	if (IS_ERR(act))
+		return PTR_ERR(act);
+
+	err = tcf_idr_search(act->tn, a, index);
+	if (!err)
+		return err;
+
+	if ((*a)->tcfa_flags & TCA_ACT_FLAGS_UNREFERENCED)
+		return false;
+
+	return err;
+}
+
+static int p4a_runt_walker(struct net *net, struct sk_buff *skb,
+			   struct netlink_callback *cb, int type,
+			   const struct tc_action_ops *ops,
+			   struct netlink_ext_ack *extack)
+{
+	struct p4tc_act *act;
+
+	act = p4a_runt_find(net, ops, extack);
+	if (IS_ERR(act))
+		return PTR_ERR(act);
+
+	return tcf_generic_walker(act->tn, skb, cb, type, ops, extack);
+}
+
+static void p4a_runt_cleanup(struct tc_action *a)
+{
+	struct tc_action_ops *ops = (struct tc_action_ops *)a->ops;
+	struct tcf_p4act *m = to_p4act(a);
+	struct tcf_p4act_params *params;
+
+	params = rcu_dereference_protected(m->params, 1);
+
+	if (refcount_read(&ops->p4_ref) > 1)
+		refcount_dec(&ops->p4_ref);
+
+	if (params)
+		call_rcu(&params->rcu, p4a_runt_parms_destroy_rcu);
+}
+
+static void p4a_runt_net_exit(struct tc_action_net *tn)
+{
+	tcf_idrinfo_destroy(tn->ops, tn->idrinfo);
+	kfree(tn->idrinfo);
+	kfree(tn);
+}
+
+static int p4a_runt_parm_list_init(struct p4tc_act *act,
+				   struct tcf_p4act_params *params,
+				   struct list_head *params_lst)
+{
+	struct p4tc_act_param *nparam, *tmp;
+	u32 tot_params_sz = 0;
+	int err;
+
+	list_for_each_entry_safe(nparam, tmp, params_lst, head) {
+		err = idr_alloc_u32(&params->params_idr, nparam, &nparam->id,
+				    nparam->id, GFP_KERNEL);
+		if (err < 0)
+			return err;
+		list_del(&nparam->head);
+		params->num_params++;
+		tot_params_sz += nparam->type->container_bitsz;
+	}
+	/* Sum act_id */
+	params->tot_params_sz = tot_params_sz + (sizeof(u32) << 3);
+
+	return 0;
+}
+
+/* This is the action instantiation that is invoked from the template code,
+ * specifically when initialising preallocated dynamic actions.
+ * This functions is analogous to p4a_runt_init.
+ */
+static int p4a_runt_init_from_tmpl(struct net *net, struct tc_action **a,
+				   struct p4tc_act *act,
+				   struct idr *params_idr,
+				   struct list_head *params_lst,
+				   struct tc_act_p4 *parm, u32 flags,
+				   struct netlink_ext_ack *extack)
+{
+	bool bind = flags & TCA_ACT_FLAGS_BIND;
+	struct tc_action_ops *a_o = &act->ops;
+	struct tcf_chain *goto_ch = NULL;
+	struct tcf_p4act_params *params;
+	struct tcf_p4act *prealloc_act;
+	bool exists = false;
+	int ret;
+	int err;
+
+	/* Don't need to check if action is active because we only call this
+	 * when we are on our way to activating the action.
+	 */
+	ret = __p4a_runt_init(net, NULL, act, parm, a, NULL, a_o, &goto_ch,
+			      flags, extack);
+	if (ret < 0)
+		return ret;
+
+	params = p4a_runt_parms_alloc(act);
+	if (IS_ERR(params)) {
+		err = PTR_ERR(params);
+		goto release_idr;
+	}
+
+	if (params_idr) {
+		err = p4a_runt_parm_list_init(act, params, params_lst);
+		if (err < 0)
+			goto release_params;
+	} else {
+		if (!idr_is_empty(&act->params_idr)) {
+			NL_SET_ERR_MSG(extack,
+				       "Must specify action parameters");
+			err = -EINVAL;
+			goto release_params;
+		}
+	}
+
+	exists = ret != P4TC_ACT_CREATED;
+	err = __p4a_runt_init_set(act, a, params, goto_ch, parm, exists,
+				  extack);
+	if (err < 0)
+		goto release_params;
+
+	return err;
+
+release_params:
+	p4a_runt_parms_destroy(params);
+
+release_idr:
+	if (ret == P4TC_ACT_PREALLOC) {
+		prealloc_act = to_p4act(*a);
+		p4a_runt_prealloc_put(act, prealloc_act);
+		(*a)->tcfa_flags |= TCA_ACT_FLAGS_UNREFERENCED;
+	} else if (!bind && !exists &&
+		   ((*a)->tcfa_flags & TCA_ACT_FLAGS_PREALLOC)) {
+		prealloc_act = to_p4act(*a);
+		list_del_init(&prealloc_act->node);
+		tcf_idr_release(*a, bind);
+	} else {
+		tcf_idr_release(*a, bind);
+	}
+
+	return err;
+}
+
 struct p4tc_act *p4a_tmpl_find_byid(struct p4tc_pipeline *pipeline,
 				    const u32 a_id)
 {
@@ -281,6 +1411,18 @@ p4a_tmpl_find_byanyattr(struct nlattr *attr, const u32 a_id,
 	return p4a_tmpl_find_byany(pipeline, fullname, a_id, extack);
 }
 
+static void p4a_tmpl_check_set_runtime(struct p4tc_act *act)
+{
+	struct p4tc_act_param *parm;
+	unsigned long param_id, tmp;
+
+	act->num_runt_params = 0;
+	idr_for_each_entry_ul(&act->params_idr, parm, tmp, param_id) {
+		if (parm->flags & BIT(P4TC_ACT_PARAMS_FLAGS_RUNT))
+			act->num_runt_params++;
+	}
+}
+
 static void p4a_tmpl_parms_put_many(struct idr *params_idr)
 {
 	struct p4tc_act_param *param;
@@ -555,7 +1697,8 @@ static int __p4a_tmpl_put(struct net *net, struct p4tc_pipeline *pipeline,
 {
 	struct tcf_p4act *p4act, *tmp_act;
 
-	if (!teardown && refcount_read(&act->a_ref) > 1) {
+	if (!teardown && (refcount_read(&act->ops.p4_ref) > 1 ||
+			  refcount_read(&act->a_ref) > 1)) {
 		NL_SET_ERR_MSG(extack,
 			       "Unable to delete referenced action template");
 		return -EBUSY;
@@ -570,6 +1713,7 @@ static int __p4a_tmpl_put(struct net *net, struct p4tc_pipeline *pipeline,
 		if (p4act->common.tcfa_flags & TCA_ACT_FLAGS_UNREFERENCED)
 			tcf_idr_release(&p4act->common, true);
 	}
+	p4a_runt_net_exit(act->tn);
 
 	idr_remove(&pipeline->p_act_idr, act->a_id);
 
@@ -847,12 +1991,36 @@ p4a_tmpl_create(struct net *net, struct nlattr **tb,
 	if (!act)
 		return ERR_PTR(-ENOMEM);
 
+	strscpy(act->ops.kind, fullname, ACTNAMSIZ);
+	act->ops.owner = THIS_MODULE;
+	act->ops.act = p4a_runt_act;
+	act->ops.dump = p4a_runt_dump;
+	act->ops.cleanup = p4a_runt_cleanup;
+	act->ops.init_ops = p4a_runt_init;
+	act->ops.lookup = p4a_runt_lookup;
+	act->ops.walk = p4a_runt_walker;
+	act->ops.size = sizeof(struct tcf_p4act);
+	INIT_LIST_HEAD(&act->head);
+
+	act->tn = kzalloc(sizeof(*act->tn), GFP_KERNEL);
+	if (!act->tn) {
+		ret = -ENOMEM;
+		goto free_act_ops;
+	}
+
+	ret = tc_action_net_init(net, act->tn, &act->ops);
+	if (ret < 0) {
+		kfree(act->tn);
+		goto free_act_ops;
+	}
+	act->tn->ops = &act->ops;
+
 	if (a_id) {
 		ret = idr_alloc_u32(&pipeline->p_act_idr, act, &a_id, a_id,
 				    GFP_KERNEL);
 		if (ret < 0) {
 			NL_SET_ERR_MSG(extack, "Unable to alloc action id");
-			goto free_act;
+			goto free_action_net;
 		}
 
 		act->a_id = a_id;
@@ -863,7 +2031,7 @@ p4a_tmpl_create(struct net *net, struct nlattr **tb,
 				    UINT_MAX, GFP_KERNEL);
 		if (ret < 0) {
 			NL_SET_ERR_MSG(extack, "Unable to alloc action id");
-			goto free_act;
+			goto free_action_net;
 		}
 	}
 
@@ -875,10 +2043,18 @@ p4a_tmpl_create(struct net *net, struct nlattr **tb,
 	else
 		act->num_prealloc_acts = P4TC_DEFAULT_NUM_PREALLOC;
 
+	refcount_set(&act->ops.p4_ref, 1);
+	ret = tcf_register_p4_action(net, &act->ops);
+	if (ret < 0) {
+		NL_SET_ERR_MSG(extack,
+			       "Unable to register new action template");
+		goto idr_rm;
+	}
+
 	num_params = p4a_tmpl_init(act, tb[P4TC_ACT_PARMS], extack);
 	if (num_params < 0) {
 		ret = num_params;
-		goto idr_rm;
+		goto unregister;
 	}
 	act->num_params = num_params;
 
@@ -897,15 +2073,22 @@ p4a_tmpl_create(struct net *net, struct nlattr **tb,
 
 	refcount_set(&act->a_ref, 1);
 
+	list_add_tail(&act->head, &dynact_list);
 	INIT_LIST_HEAD(&act->prealloc_list);
 	spin_lock_init(&act->list_lock);
 
 	return act;
 
+unregister:
+	tcf_unregister_p4_action(net, &act->ops);
+
 idr_rm:
 	idr_remove(&pipeline->p_act_idr, act->a_id);
 
-free_act:
+free_action_net:
+	p4a_runt_net_exit(act->tn);
+
+free_act_ops:
 	kfree(act);
 
 	return ERR_PTR(ret);
@@ -917,6 +2100,7 @@ p4a_tmpl_update(struct net *net, struct nlattr **tb,
 		u32 flags, struct netlink_ext_ack *extack)
 {
 	const u32 a_id = ids[P4TC_AID_IDX];
+	struct tc_action **prealloc_acts;
 	bool updates_params = false;
 	struct idr params_idr;
 	u32 num_prealloc_acts;
@@ -935,6 +2119,11 @@ p4a_tmpl_update(struct net *net, struct nlattr **tb,
 
 	if (act->active) {
 		if (!active) {
+			if (refcount_read(&act->ops.p4_ref) > 1) {
+				NL_SET_ERR_MSG(extack,
+					       "Unable to inactivate action with instances");
+				return ERR_PTR(-EINVAL);
+			}
 			act->active = false;
 			return act;
 		}
@@ -963,6 +2152,32 @@ p4a_tmpl_update(struct net *net, struct nlattr **tb,
 
 	act->pipeline = pipeline;
 	if (active == 1) {
+		struct idr *chosen_idr = updates_params ?
+			&params_idr : &act->params_idr;
+
+		prealloc_acts = kcalloc(num_prealloc_acts,
+					sizeof(*prealloc_acts),
+					GFP_KERNEL);
+		if (!prealloc_acts) {
+			ret = -ENOMEM;
+			goto params_del;
+		}
+
+		p4a_tmpl_check_set_runtime(act);
+
+		ret = p4a_runt_prealloc_create(pipeline->net, act,
+					       chosen_idr,
+					       prealloc_acts,
+					       num_prealloc_acts,
+					       extack);
+		if (ret < 0)
+			goto free_prealloc_acts;
+
+		p4a_runt_prealloc_list_add(act, prealloc_acts,
+					   num_prealloc_acts);
+
+		kfree(prealloc_acts);
+
 		act->active = true;
 	} else if (!active) {
 		NL_SET_ERR_MSG(extack, "Action is already inactive");
@@ -979,6 +2194,10 @@ p4a_tmpl_update(struct net *net, struct nlattr **tb,
 
 	return act;
 
+free_prealloc_acts:
+	act->num_runt_params = 0;
+	kfree(prealloc_acts);
+
 params_del:
 	p4a_tmpl_parms_put_many(&params_idr);
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [PATCH net-next v12  11/15] p4tc: add template table create, update, delete, get, flush and dump
  2024-02-25 16:54 [PATCH net-next v12 00/15] Introducing P4TC (series 1) Jamal Hadi Salim
                   ` (9 preceding siblings ...)
  2024-02-25 16:54 ` [PATCH net-next v12 10/15] p4tc: add runtime action support Jamal Hadi Salim
@ 2024-02-25 16:54 ` Jamal Hadi Salim
  2024-02-25 16:54 ` [PATCH net-next v12 12/15] p4tc: add runtime table entry create and update Jamal Hadi Salim
                   ` (5 subsequent siblings)
  16 siblings, 0 replies; 71+ messages in thread
From: Jamal Hadi Salim @ 2024-02-25 16:54 UTC (permalink / raw)
  To: netdev
  Cc: deb.chatterjee, anjali.singhai, namrata.limaye, tom, mleitner,
	Mahesh.Shirshyad, Vipin.Jain, tomasz.osinski, jiri,
	xiyou.wangcong, davem, edumazet, kuba, pabeni, vladbu, horms,
	khalidm, toke, daniel, victor, pctammela, bpf

This commit introduces code to creation and maintenance of P4 tables
defined in a P4 program.

As with all other P4TC objects, tables' lifetimes conform to extended CRUD
operations and are maintained via templates.
It's important to note that write operations, such as create, update and
delete can only be made if the pipeline is not sealed.

Per the P4 specification, tables prefix their name with the control block
(although this could be overridden by P4 annotations).

As an example, if one were to create a table named table1 in a pipeline named
myprog1, on control block "mycontrol", one would use the following command:

tc p4template create table/myprog1/mycontrol/table1 tblid 1 \
   keysz 32 nummasks 8 tentries 8192

Above says that we are creating a table (table1) attached to pipeline
myprog1 on control block mycontrol. Table1's key size is 32 bits wide
and it can have up to 8 associated masks and 8192 entries. The table id
for table1 is 1. The table id is typically provided by the compiler.

Parameters such as nummasks (number of masks this table may have) and
tentries (maximum number of entries this table may have); either may be
omitted in which case it defaults to 8 masks and 256 entries respectfully.

P4 tables have many associated attributes that further refine how these
tables operate. Some attributes are per table and others per entry.
Attributes include:

- Timer profiles: The timer profiles for the table entries belonging to the
  table. These profiles contain the supported aging values for the table.
  There are currently 4 different built-in profiles ID0=30s(default),
  ID1=60s, ID2=120s and ID3=180s.
  The user can override the defaults by specifying the number of profiles per
  table. In that case, the kernel will generate the specified number of
  profiles with their aging values abiding by the following rule:
    - Profile ID 0 - 30000ms
    - Profile ID(n) - Profile ID(n - 1) + 30000ms
  So, for example, if the user specified num_timer_profiles as 5, the
  profile IDs and aging values would be the following:
    - Profile ID 0 - 30000ms (default profile)
    - Profile ID 1 - 60000ms
    - Profile ID 2 - 90000ms
    - Profile ID 3 - 120000ms
    - Profile ID 4 - 150000ms
  The values of the different profiles could be changed at runtime.
  The default profile is used if the user specifies an out of range value.

- Match key type: The match type of the table key (exact, LPM, ternary,
  etc).
  Note in P4 a key may constitute multiple (sub)key types, eg matching on
  srcip using a LPM prefix and an exact match for dstip. In such a case it is
  up to the compiler to generalize the key used (For example in this
  case the overall key may endup being LPM or ternary).

- Direct Counter: Table counter instances used directly by the table, when
  specified in the P4 program, there will be one counter per entry.

- Direct Meter: Table meter instances used directly by the table, when
  specified in the P4 program, there will be one meter per entry.

- CRUDXPS Permissions both for specific entries and tables. The permissions
  are applicable to both the control plane and the datapath (see "Table
  Permissions" further below).

- Allowed Actions List. This will be a list of all possible actions that
  can be added to table entries that are added to the specified table.

- Action profiles. When defined in a P4 program, action profiles provide a
  mechanism to share action instances.

- Actions Selectors. When defined in a P4 program can be used to select
  the execution of one or more action instance selected at table lookup
  time by using a hash computation.

- Default hit action. When a default hit action is defined it is used when
  a matched table entry did not define an action. Depending on the P4
  program the default hit action can be updated at runtime (in addition to
  being specified in the template).

- Default miss action. When a default miss action is defined it is used
  when a lookup that table fails.

- Action scope. In addition to actions being annotated as default hit or
  miss they can also be annotated to be either specific to a table of
  globally available to multiple tables within the same P4 program.

- Max entries. This is an upper bound for number of entries a specific
  table allows.

- Num masks. In the case of LPM or ternary matches, this defines the
  maximum allowed masks for that table.

- Timers. When defined in a P4 program, each entry has an associated timer.
  Depending on the programmed timer profile (see above), an entry gets a
  timeout. The timer attribute specifies the behavior of a table entry
  expiration.
  The timer is refreshed every time there's a hit. After an idle period
  the P4 program can define using this attribute, whether it wants to have
  an event generated to user space (and have user space delete the entry), or
  whether it wants the kernel to delete it and send the event to announce
  the deletion. The default in P4TC is to both delete and generate an event.

- per entry "static" vs "dynamic" entry. By default all entries created
  from the control plane are "static" unless otherwise specified. All
  entries added from datapath are "dynamic" unless otherwise specified.
  "Dynamic" entries are subject to deletion when idle (subject to the rules
  specified in "Timers" above).

If one were to retrieve the template details of a table named table1 (before or
after the pipeline is sealed) one would use the following command:

tc p4template get table/myprog1/mycontrol/table1

If one were to dump all the tables from a pipeline named myprog1, one would
use the following command:

tc p4template get table/myprog1

If one were to update table1 (before the pipeline is sealed) one would use
the following command:

tc p4template update table/myprog1/mycontrol/table1 ....

If one were to delete table1 (before the pipeline is sealed) one would use
the following command:

tc p4template del table/myprog1/mycontrol/table1

If one were to flush all the tables from a pipeline named myprog1, control
block "mycontrol" one would use the following command:

tc p4template del table/myprog1/mycontrol/

___Table Permissions___

Tables can have permissions which apply to all the entries in the specified
table. Permissions are defined for both what the control plane (user space)
as well as the data path are allowed to do on the table.

The permissions field is a 16bit value which will hold CRUDXPS (create,
read, update, delete, execute, publish and subscribe) permissions for
control and data path. Bits 13-7 will have the CRUDXPS values for control
and bits 6-0 will have CRUDXPS values for data path. By default each table
has the following permissions:

CRUD-PS-R--X--

Which means the control plane can perform CRUDPS operations whereas the
data path (-R--X--) can only Read and execute on the entries.
The user can override these permissions when creating the table or when
updating.

For example, the following command will create a table which will not allow
the datapath to create, update or delete entries but give full CRUDS
permissions for the control plane.

$TC p4template create table/aP4proggie/cb/tname tblid 1 keysz 64 type lpm \
permissions 0x3D24 ...

Recall that these permissions come in the form of CRUDXPSCRUDXPS, where the
first CRUDXPS block is for control and the last is for data path.

So 0x34A4 is equivalent to CR-D--S-R--X--

If we were to issue a read command on a table (tname):

$TC -j p4template get table/aP4proggie/cb/tname | jq .

The output would be the following:

[
  {
    "obj": "table",
    "pname": "aP4Proggie",
    "pipeid": 22
  },
  {
    "templates": [
      {
        "tblid": 1,
        "tname": "cb/tname",
        "keysz": 64,
        "max_entries": 256,
        "masks": 8,
        "entries": 0,
        "permissions": "CRUD--S-R--X--",
        "table_type": "lpm",
        "acts_list": []
      }
    ]
  }
]

Note, the permissions concept is more powerful than classical const
definition currently taken by P4 which makes everything in a table
read-only.

___Initial Table Entries___

Templating can create initial table entries. For example:

tc p4template create table/myprog/cb/tname \
  entry srcAddr 10.10.10.10/32 dstAddr 1.1.1.0/24 prio 17

In this command we are creating table tname with an entry which will be
present before even the P4 program starts executing. This entry has as its key
IPv4 srcAddr and dstAddr and prio 17.

If one was to read back the entry by issuing the following runtime command:

tc p4ctrl get myprog/table/cb/tname

They would get:
...

    entry priority 17[permissions-RUD--S-R--X--]
    entry key
        srcAddr id:1 size:32b type:ipv4 exact fieldval  10.10.10.10/32
        dstAddr id:2 size:32b type:ipv4 exact fieldval  1.1.1.0/24

We will explain the p4ctrl commands in more details on the next patches.

Before the pipeline is sealed we can only do a p4ctrl get. The write
commands (create, update, delete and flush) and not allowed before
sealing.

___Table Actions List___

P4 tables can be programmed to allow only a specified list of actions to be
part of match entry on a table. P4 also defines default actions to be executed
when no entries match; we have extended this concept to have a default hit,
which is executed upon matching an entry which has no action associated with it.

We also allow flags for each of the actions in this list that specify if
the action can be added only as a table entry (tableonly), or only as a
default action (defaultonly). If no flags are specified, it is assumed
that the action can be used in both contexts.

Both default hit and default missactions  are optional.

An example of specifying a default miss action is as follows:

tc p4template update table/myprog/cb/mytable \
    default_miss_action permissions 0x1124 action drop

The above will drop packets if the entry is not found in mytable.
Note the above makes the default action a const. Meaning the control
plane can neither replace it nor delete it.

tc p4template update table/myprog/mytable \
  default_hit_action permissions 0x3004 action ok

Whereas the above allows a default hit action to accept the packet.
The permission 0x3004 (binary 11000000000100) means we have only Create and
Read permissions in the control plane and eXecute permissions in the data
plane. This means, for example, that now we can only delete the default hit
action from the control plane.

Co-developed-by: Victor Nogueira <victor@mojatatu.com>
Signed-off-by: Victor Nogueira <victor@mojatatu.com>
Co-developed-by: Pedro Tammela <pctammela@mojatatu.com>
Signed-off-by: Pedro Tammela <pctammela@mojatatu.com>
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Reviewed-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
---
 include/net/p4tc.h             |  156 +++-
 include/net/p4tc_types.h       |    2 +-
 include/uapi/linux/p4tc.h      |  119 +++
 net/sched/p4tc/Makefile        |    2 +-
 net/sched/p4tc/p4tc_action.c   |    4 +-
 net/sched/p4tc/p4tc_pipeline.c |   23 +-
 net/sched/p4tc/p4tc_table.c    | 1606 ++++++++++++++++++++++++++++++++
 7 files changed, 1900 insertions(+), 12 deletions(-)
 create mode 100644 net/sched/p4tc/p4tc_table.c

diff --git a/include/net/p4tc.h b/include/net/p4tc.h
index 21705e731..d11a9efa7 100644
--- a/include/net/p4tc.h
+++ b/include/net/p4tc.h
@@ -16,10 +16,23 @@
 #define P4TC_DEFAULT_MAX_RULES 1
 #define P4TC_PATH_MAX 3
 #define P4TC_MAX_TENTRIES 0x2000000
+#define P4TC_DEFAULT_TENTRIES 256
+#define P4TC_MAX_TMASKS 1024
+#define P4TC_DEFAULT_TMASKS 8
+#define P4TC_MAX_T_AGING_MS 864000000
+#define P4TC_DEFAULT_T_AGING_MS 30000
+#define P4TC_DEFAULT_NUM_TIMER_PROFILES 4
+#define P4TC_MAX_NUM_TIMER_PROFILES P4TC_MSGBATCH_SIZE
+
+#define P4TC_TIMER_PROFILE_ZERO_AGING_MS 30000
+#define P4TC_DEFAULT_TIMER_PROFILE_ID 0
+
+#define P4TC_MAX_PERMISSION (GENMASK(P4TC_PERM_MAX_BIT, 0))
 
 #define P4TC_KERNEL_PIPEID 0
 
 #define P4TC_PID_IDX 0
+#define P4TC_TBLID_IDX 1
 #define P4TC_AID_IDX 1
 #define P4TC_PARSEID_IDX 1
 
@@ -68,6 +81,7 @@ struct p4tc_template_common {
 struct p4tc_pipeline {
 	struct p4tc_template_common common;
 	struct idr                  p_act_idr;
+	struct idr                  p_tbl_idr;
 	struct rcu_head             rcu;
 	struct net                  *net;
 	u32                         num_created_acts;
@@ -122,6 +136,11 @@ struct p4tc_act *p4a_runt_find(struct net *net,
 void
 p4a_runt_prealloc_put(struct p4tc_act *act, struct tcf_p4act *p4_act);
 
+static inline bool p4tc_pipeline_sealed(struct p4tc_pipeline *pipeline)
+{
+	return pipeline->p_state == P4TC_STATE_READY;
+}
+
 static inline int p4tc_action_destroy(struct tc_action *acts[])
 {
 	struct tc_action *acts_non_prealloc[TCA_ACT_MAX_PRIO] = {NULL};
@@ -166,6 +185,65 @@ static inline int p4tc_action_destroy(struct tc_action *acts[])
 	return ret;
 }
 
+#define P4TC_CONTROL_PERMISSIONS (GENMASK(13, 7))
+#define P4TC_DATA_PERMISSIONS (GENMASK(6, 0))
+
+#define P4TC_TABLE_DEFAULT_PERMISSIONS                                   \
+	((GENMASK(P4TC_CTRL_PERM_C_BIT, P4TC_CTRL_PERM_D_BIT)) | \
+	 P4TC_CTRL_PERM_P | P4TC_CTRL_PERM_S | P4TC_DATA_PERM_R | \
+	 P4TC_DATA_PERM_X)
+
+#define P4TC_PERMISSIONS_UNINIT (1 << P4TC_PERM_MAX_BIT)
+
+struct p4tc_table_defact {
+	struct tc_action *acts[2];
+	/* Will have two 7 bits blocks containing CRUDXPS (Create, read, update,
+	 * delete, execute, publish and subscribe) permissions for control plane
+	 * and data plane. The first 5 bits are for control and the next five
+	 * are for data plane. |crudxpscrudxps| if we were to denote it as UNIX
+	 * permission flags.
+	 */
+	__u16 perm;
+	struct rcu_head  rcu;
+};
+
+struct p4tc_table_perm {
+	__u16           permissions;
+	struct rcu_head rcu;
+};
+
+struct p4tc_table {
+	struct p4tc_template_common         common;
+	struct list_head                    tbl_acts_list;
+	struct idr                          tbl_masks_idr;
+	struct idr                          tbl_prio_idr;
+	struct xarray                       tbl_profiles_xa;
+	struct rhltable                     tbl_entries;
+	/* Mutex that protects tbl_profiles_xa */
+	struct mutex                        tbl_profiles_xa_lock;
+	struct p4tc_table_defact __rcu      *tbl_dflt_hitact;
+	struct p4tc_table_defact __rcu      *tbl_dflt_missact;
+	struct p4tc_table_perm __rcu        *tbl_permissions;
+	struct p4tc_table_entry_mask __rcu  **tbl_masks_array;
+	unsigned long __rcu                 *tbl_free_masks_bitmap;
+	/* Locks the available masks IDR which will be used when adding and
+	 * deleting table entries.
+	 */
+	spinlock_t                          tbl_masks_idr_lock;
+	u32                                 tbl_keysz;
+	u32                                 tbl_id;
+	u32                                 tbl_max_entries;
+	u32                                 tbl_max_masks;
+	u32                                 tbl_curr_num_masks;
+	atomic_t                            tbl_num_timer_profiles;
+	/* Accounts for how many entities refer to this table. Usually just the
+	 * pipeline it belongs to.
+	 */
+	refcount_t                          tbl_ctrl_ref;
+	u16                                 tbl_type;
+	u16                                 __pad0;
+};
+
 struct p4tc_act_param {
 	struct list_head head;
 	struct rcu_head	rcu;
@@ -219,6 +297,18 @@ struct p4tc_act {
 	char                        fullname[ACTNAMSIZ];
 };
 
+struct p4tc_table_act {
+	struct list_head node;
+	struct p4tc_act *act;
+	u8     flags;
+};
+
+struct p4tc_table_timer_profile {
+	struct rcu_head rcu;
+	u64 aging_ms;
+	u32 profile_id;
+};
+
 static inline int p4tc_action_init(struct net *net, struct nlattr *nla,
 				   struct tc_action *acts[], u32 pipeid,
 				   u32 flags, struct netlink_ext_ack *extack)
@@ -286,6 +376,70 @@ static inline bool p4tc_action_put_ref(struct p4tc_act *act)
 	return refcount_dec_not_one(&act->a_ref);
 }
 
+struct p4tc_act_param *p4a_parm_find_byid(struct idr *params_idr,
+					  const u32 param_id);
+struct p4tc_act_param *
+p4a_parm_find_byany(struct p4tc_act *act, const char *param_name,
+		    const u32 param_id, struct netlink_ext_ack *extack);
+
+struct p4tc_table *p4tc_table_find_byany(struct p4tc_pipeline *pipeline,
+					 const char *tblname, const u32 tbl_id,
+					 struct netlink_ext_ack *extack);
+struct p4tc_table *p4tc_table_find_byid(struct p4tc_pipeline *pipeline,
+					const u32 tbl_id);
+int p4tc_table_try_set_state_ready(struct p4tc_pipeline *pipeline,
+				   struct netlink_ext_ack *extack);
+void p4tc_table_put_mask_array(struct p4tc_pipeline *pipeline);
+struct p4tc_table *p4tc_table_find_get(struct p4tc_pipeline *pipeline,
+				       const char *tblname, const u32 tbl_id,
+				       struct netlink_ext_ack *extack);
+
+static inline bool p4tc_table_put_ref(struct p4tc_table *table)
+{
+	return refcount_dec_not_one(&table->tbl_ctrl_ref);
+}
+
+struct p4tc_table_defact_params {
+	struct p4tc_table_defact *hitact;
+	struct p4tc_table_defact *missact;
+	struct nlattr *nla_hit;
+	struct nlattr *nla_miss;
+};
+
+int p4tc_table_init_default_acts(struct net *net,
+				 struct p4tc_table_defact_params *dflt,
+				 struct p4tc_table *table,
+				 struct list_head *acts_list,
+				 struct netlink_ext_ack *extack);
+
+static inline bool p4tc_table_act_is_noaction(struct p4tc_table_act *table_act)
+{
+	return !table_act->act->common.p_id && !table_act->act->a_id;
+}
+
+static inline bool p4tc_table_defact_is_noaction(struct tcf_p4act *p4_defact)
+{
+	return !p4_defact->p_id && !p4_defact->act_id;
+}
+
+static inline void p4tc_table_defacts_acts_copy(struct p4tc_table_defact *dst,
+						struct p4tc_table_defact *src)
+{
+	dst->acts[0] = src->acts[0];
+	dst->acts[1] = NULL;
+}
+
+void p4tc_table_replace_default_acts(struct p4tc_table *table,
+				     struct p4tc_table_defact_params *dflt,
+				     bool lock_rtnl);
+
+struct p4tc_table_perm *
+p4tc_table_init_permissions(struct p4tc_table *table, u16 permissions,
+			    struct netlink_ext_ack *extack);
+void p4tc_table_replace_permissions(struct p4tc_table *table,
+				    struct p4tc_table_perm *tbl_perm,
+				    bool lock_rtnl);
+
 struct tcf_p4act *
 p4a_runt_prealloc_get_next(struct p4tc_act *act);
 void p4a_runt_init_flags(struct tcf_p4act *p4act);
@@ -295,7 +449,7 @@ p4a_runt_parm_init(struct net *net, struct p4tc_act *act,
 		   struct nlattr *nla, struct netlink_ext_ack *extack);
 
 #define to_pipeline(t) ((struct p4tc_pipeline *)t)
-#define to_hdrfield(t) ((struct p4tc_hdrfield *)t)
 #define p4tc_to_act(t) ((struct p4tc_act *)t)
+#define p4tc_to_table(t) ((struct p4tc_table *)t)
 
 #endif
diff --git a/include/net/p4tc_types.h b/include/net/p4tc_types.h
index af9f51fc1..0b1d79740 100644
--- a/include/net/p4tc_types.h
+++ b/include/net/p4tc_types.h
@@ -8,7 +8,7 @@
 
 #include <uapi/linux/p4tc.h>
 
-#define P4TC_T_MAX_BITSZ 128
+#define P4TC_T_MAX_BITSZ P4TC_MAX_KEYSZ
 
 struct p4tc_type_mask_shift {
 	void *mask;
diff --git a/include/uapi/linux/p4tc.h b/include/uapi/linux/p4tc.h
index bb4533689..92ed964ab 100644
--- a/include/uapi/linux/p4tc.h
+++ b/include/uapi/linux/p4tc.h
@@ -26,6 +26,67 @@ struct p4tcmsg {
 #define P4TC_PIPELINE_NAMSIZ P4TC_TMPL_NAMSZ
 #define P4TC_ACT_TMPL_NAMSZ P4TC_TMPL_NAMSZ
 #define P4TC_ACT_PARAM_NAMSIZ P4TC_TMPL_NAMSZ
+#define P4TC_TABLE_NAMSIZ P4TC_TMPL_NAMSZ
+
+enum {
+	P4TC_TABLE_TYPE_UNSPEC,
+	P4TC_TABLE_TYPE_EXACT = 1,
+	P4TC_TABLE_TYPE_LPM = 2,
+	P4TC_TABLE_TYPE_TERNARY = 3,
+	__P4TC_TABLE_TYPE_MAX,
+};
+
+#define P4TC_TABLE_TYPE_MAX (__P4TC_TABLE_TYPE_MAX - 1)
+
+#define P4TC_CTRL_PERM_C_BIT 13
+#define P4TC_CTRL_PERM_R_BIT 12
+#define P4TC_CTRL_PERM_U_BIT 11
+#define P4TC_CTRL_PERM_D_BIT 10
+#define P4TC_CTRL_PERM_X_BIT 9
+#define P4TC_CTRL_PERM_P_BIT 8
+#define P4TC_CTRL_PERM_S_BIT 7
+
+#define P4TC_DATA_PERM_C_BIT 6
+#define P4TC_DATA_PERM_R_BIT 5
+#define P4TC_DATA_PERM_U_BIT 4
+#define P4TC_DATA_PERM_D_BIT 3
+#define P4TC_DATA_PERM_X_BIT 2
+#define P4TC_DATA_PERM_P_BIT 1
+#define P4TC_DATA_PERM_S_BIT 0
+
+#define P4TC_PERM_MAX_BIT P4TC_CTRL_PERM_C_BIT
+
+#define P4TC_CTRL_PERM_C (1 << P4TC_CTRL_PERM_C_BIT)
+#define P4TC_CTRL_PERM_R (1 << P4TC_CTRL_PERM_R_BIT)
+#define P4TC_CTRL_PERM_U (1 << P4TC_CTRL_PERM_U_BIT)
+#define P4TC_CTRL_PERM_D (1 << P4TC_CTRL_PERM_D_BIT)
+#define P4TC_CTRL_PERM_X (1 << P4TC_CTRL_PERM_X_BIT)
+#define P4TC_CTRL_PERM_P (1 << P4TC_CTRL_PERM_P_BIT)
+#define P4TC_CTRL_PERM_S (1 << P4TC_CTRL_PERM_S_BIT)
+
+#define P4TC_DATA_PERM_C (1 << P4TC_DATA_PERM_C_BIT)
+#define P4TC_DATA_PERM_R (1 << P4TC_DATA_PERM_R_BIT)
+#define P4TC_DATA_PERM_U (1 << P4TC_DATA_PERM_U_BIT)
+#define P4TC_DATA_PERM_D (1 << P4TC_DATA_PERM_D_BIT)
+#define P4TC_DATA_PERM_X (1 << P4TC_DATA_PERM_X_BIT)
+#define P4TC_DATA_PERM_P (1 << P4TC_DATA_PERM_P_BIT)
+#define P4TC_DATA_PERM_S (1 << P4TC_DATA_PERM_S_BIT)
+
+#define p4tc_ctrl_create_ok(perm)   ((perm) & P4TC_CTRL_PERM_C)
+#define p4tc_ctrl_read_ok(perm)     ((perm) & P4TC_CTRL_PERM_R)
+#define p4tc_ctrl_update_ok(perm)   ((perm) & P4TC_CTRL_PERM_U)
+#define p4tc_ctrl_delete_ok(perm)   ((perm) & P4TC_CTRL_PERM_D)
+#define p4tc_ctrl_exec_ok(perm)     ((perm) & P4TC_CTRL_PERM_X)
+#define p4tc_ctrl_pub_ok(perm)      ((perm) & P4TC_CTRL_PERM_P)
+#define p4tc_ctrl_sub_ok(perm)      ((perm) & P4TC_CTRL_PERM_S)
+
+#define p4tc_data_create_ok(perm)   ((perm) & P4TC_DATA_PERM_C)
+#define p4tc_data_read_ok(perm)     ((perm) & P4TC_DATA_PERM_R)
+#define p4tc_data_update_ok(perm)   ((perm) & P4TC_DATA_PERM_U)
+#define p4tc_data_delete_ok(perm)   ((perm) & P4TC_DATA_PERM_D)
+#define p4tc_data_exec_ok(perm)     ((perm) & P4TC_DATA_PERM_X)
+#define p4tc_data_pub_ok(perm)      ((perm) & P4TC_DATA_PERM_P)
+#define p4tc_data_sub_ok(perm)      ((perm) & P4TC_DATA_PERM_S)
 
 /* Root attributes */
 enum {
@@ -42,6 +103,7 @@ enum {
 	P4TC_OBJ_UNSPEC,
 	P4TC_OBJ_PIPELINE,
 	P4TC_OBJ_ACT,
+	P4TC_OBJ_TABLE,
 	__P4TC_OBJ_MAX,
 };
 
@@ -101,6 +163,63 @@ enum {
 
 #define P4TC_T_MAX (__P4TC_T_MAX - 1)
 
+enum {
+	P4TC_TABLE_DEFAULT_ACTION_UNSPEC,
+	P4TC_TABLE_DEFAULT_ACTION,
+	P4TC_TABLE_DEFAULT_ACTION_NOACTION,
+	P4TC_TABLE_DEFAULT_ACTION_PERMISSIONS,
+	__P4TC_TABLE_DEFAULT_ACTION_MAX
+};
+
+#define P4TC_TABLE_DEFAULT_ACTION_MAX (__P4TC_TABLE_DEFAULT_ACTION_MAX - 1)
+
+enum {
+	P4TC_TABLE_ACTS_DEFAULT_ONLY,
+	P4TC_TABLE_ACTS_TABLE_ONLY,
+	__P4TC_TABLE_ACTS_FLAGS_MAX,
+};
+
+#define P4TC_TABLE_ACTS_FLAGS_MAX (__P4TC_TABLE_ACTS_FLAGS_MAX - 1)
+
+enum {
+	P4TC_TABLE_ACT_UNSPEC,
+	P4TC_TABLE_ACT_FLAGS, /* u8 */
+	P4TC_TABLE_ACT_NAME, /* string */
+	__P4TC_TABLE_ACT_MAX
+};
+
+#define P4TC_TABLE_ACT_MAX (__P4TC_TABLE_ACT_MAX - 1)
+
+/* Table type attributes */
+enum {
+	P4TC_TABLE_UNSPEC,
+	P4TC_TABLE_NAME, /* string - mandatory for create and update*/
+	P4TC_TABLE_KEYSZ, /* u32 - mandatory for create*/
+	P4TC_TABLE_MAX_ENTRIES, /* u32 */
+	P4TC_TABLE_MAX_MASKS, /* u32 */
+	P4TC_TABLE_NUM_ENTRIES, /* u32 */
+	P4TC_TABLE_PERMISSIONS, /* u16 */
+	P4TC_TABLE_TYPE, /* u8 */
+	P4TC_TABLE_DEFAULT_HIT, /* nested default hit action attributes */
+	P4TC_TABLE_DEFAULT_MISS, /* nested default miss action attributes */
+	P4TC_TABLE_CONST_ENTRY, /* nested const table entry */
+	P4TC_TABLE_ACTS_LIST, /* nested table actions list */
+	P4TC_TABLE_NUM_TIMER_PROFILES, /* u32 - number of timer profiles */
+	P4TC_TABLE_TIMER_PROFILES, /* nested timer profiles
+				    * kernel -> user space only
+				    */
+	__P4TC_TABLE_MAX
+};
+
+enum {
+	P4TC_TIMER_PROFILE_UNSPEC,
+	P4TC_TIMER_PROFILE_ID, /* u32 */
+	P4TC_TIMER_PROFILE_AGING, /* u64 */
+	__P4TC_TIMER_PROFILE_MAX
+};
+
+#define P4TC_TABLE_MAX (__P4TC_TABLE_MAX - 1)
+
 /* Action attributes */
 enum {
 	P4TC_ACT_UNSPEC,
diff --git a/net/sched/p4tc/Makefile b/net/sched/p4tc/Makefile
index 7dbcf8915..7a9c13f86 100644
--- a/net/sched/p4tc/Makefile
+++ b/net/sched/p4tc/Makefile
@@ -1,4 +1,4 @@
 # SPDX-License-Identifier: GPL-2.0
 
 obj-y := p4tc_types.o p4tc_tmpl_api.o p4tc_pipeline.o \
-	p4tc_action.o
+	p4tc_action.o p4tc_table.o
diff --git a/net/sched/p4tc/p4tc_action.c b/net/sched/p4tc/p4tc_action.c
index d47ccd69a..4b7b5501a 100644
--- a/net/sched/p4tc/p4tc_action.c
+++ b/net/sched/p4tc/p4tc_action.c
@@ -661,13 +661,13 @@ p4a_parm_find_byname(struct idr *params_idr, const char *param_name)
 	return NULL;
 }
 
-static struct p4tc_act_param *
+struct p4tc_act_param *
 p4a_parm_find_byid(struct idr *params_idr, const u32 param_id)
 {
 	return idr_find(params_idr, param_id);
 }
 
-static struct p4tc_act_param *
+struct p4tc_act_param *
 p4a_parm_find_byany(struct p4tc_act *act, const char *param_name,
 		    const u32 param_id, struct netlink_ext_ack *extack)
 {
diff --git a/net/sched/p4tc/p4tc_pipeline.c b/net/sched/p4tc/p4tc_pipeline.c
index 626d98734..9b3cc9245 100644
--- a/net/sched/p4tc/p4tc_pipeline.c
+++ b/net/sched/p4tc/p4tc_pipeline.c
@@ -76,6 +76,7 @@ static const struct nla_policy tc_pipeline_policy[P4TC_PIPELINE_MAX + 1] = {
 static void p4tc_pipeline_destroy(struct p4tc_pipeline *pipeline)
 {
 	idr_destroy(&pipeline->p_act_idr);
+	idr_destroy(&pipeline->p_tbl_idr);
 
 	kfree(pipeline);
 }
@@ -98,9 +99,13 @@ static void p4tc_pipeline_teardown(struct p4tc_pipeline *pipeline,
 	struct net *net = pipeline->net;
 	struct p4tc_pipeline_net *pipe_net = net_generic(net, pipeline_net_id);
 	struct net *pipeline_net = maybe_get_net(net);
-	unsigned long iter_act_id;
+	unsigned long iter_act_id, tmp;
+	struct p4tc_table *table;
 	struct p4tc_act *act;
-	unsigned long tmp;
+	unsigned long tbl_id;
+
+	idr_for_each_entry_ul(&pipeline->p_tbl_idr, table, tmp, tbl_id)
+		table->common.ops->put(pipeline, &table->common, extack);
 
 	idr_for_each_entry_ul(&pipeline->p_act_idr, act, tmp, iter_act_id)
 		act->common.ops->put(pipeline, &act->common, extack);
@@ -153,22 +158,23 @@ static int __p4tc_pipeline_put(struct p4tc_pipeline *pipeline,
 static int pipeline_try_set_state_ready(struct p4tc_pipeline *pipeline,
 					struct netlink_ext_ack *extack)
 {
+	int ret;
+
 	if (pipeline->curr_tables != pipeline->num_tables) {
 		NL_SET_ERR_MSG(extack,
 			       "Must have all table defined to update state to ready");
 		return -EINVAL;
 	}
 
+	ret = p4tc_table_try_set_state_ready(pipeline, extack);
+	if (ret < 0)
+		return ret;
+
 	pipeline->p_state = P4TC_STATE_READY;
 
 	return true;
 }
 
-static bool p4tc_pipeline_sealed(struct p4tc_pipeline *pipeline)
-{
-	return pipeline->p_state == P4TC_STATE_READY;
-}
-
 struct p4tc_pipeline *p4tc_pipeline_find_byid(struct net *net, const u32 pipeid)
 {
 	struct p4tc_pipeline_net *pipe_net;
@@ -262,6 +268,9 @@ p4tc_pipeline_create(struct net *net, struct nlmsghdr *n,
 
 	idr_init(&pipeline->p_act_idr);
 
+	idr_init(&pipeline->p_tbl_idr);
+	pipeline->curr_tables = 0;
+
 	pipeline->num_created_acts = 0;
 
 	pipeline->p_state = P4TC_STATE_NOT_READY;
diff --git a/net/sched/p4tc/p4tc_table.c b/net/sched/p4tc/p4tc_table.c
new file mode 100644
index 000000000..6ac1af3f1
--- /dev/null
+++ b/net/sched/p4tc/p4tc_table.c
@@ -0,0 +1,1606 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * net/sched/p4tc/p4tc_table.c	P4 TC TABLE
+ *
+ * Copyright (c) 2022-2024, Mojatatu Networks
+ * Copyright (c) 2022-2024, Intel Corporation.
+ * Authors:     Jamal Hadi Salim <jhs@mojatatu.com>
+ *              Victor Nogueira <victor@mojatatu.com>
+ *              Pedro Tammela <pctammela@mojatatu.com>
+ */
+
+#include <linux/types.h>
+#include <linux/kernel.h>
+#include <linux/string.h>
+#include <linux/errno.h>
+#include <linux/slab.h>
+#include <linux/skbuff.h>
+#include <linux/init.h>
+#include <linux/kmod.h>
+#include <linux/err.h>
+#include <linux/module.h>
+#include <linux/bitmap.h>
+#include <net/net_namespace.h>
+#include <net/sock.h>
+#include <net/sch_generic.h>
+#include <net/pkt_cls.h>
+#include <net/p4tc.h>
+#include <net/netlink.h>
+#include <net/flow_offload.h>
+
+static int __p4tc_table_try_set_state_ready(struct p4tc_table *table,
+					    struct netlink_ext_ack *extack)
+{
+	struct p4tc_table_entry_mask __rcu **masks_array;
+	unsigned long *tbl_free_masks_bitmap;
+
+	masks_array = kcalloc(table->tbl_max_masks,
+			      sizeof(*table->tbl_masks_array),
+			      GFP_KERNEL);
+	if (!masks_array)
+		return -ENOMEM;
+
+	tbl_free_masks_bitmap =
+		bitmap_alloc(P4TC_MAX_TMASKS, GFP_KERNEL);
+	if (!tbl_free_masks_bitmap) {
+		kfree(masks_array);
+		return -ENOMEM;
+	}
+
+	bitmap_fill(tbl_free_masks_bitmap, P4TC_MAX_TMASKS);
+
+	table->tbl_masks_array = masks_array;
+	rcu_replace_pointer_rtnl(table->tbl_free_masks_bitmap,
+				 tbl_free_masks_bitmap);
+
+	return 0;
+}
+
+static void free_table_cache_array(struct p4tc_table **set_tables,
+				   int num_tables)
+{
+	int i;
+
+	for (i = 0; i < num_tables; i++) {
+		struct p4tc_table_entry_mask __rcu **masks_array;
+		struct p4tc_table *table = set_tables[i];
+		unsigned long *free_masks_bitmap;
+
+		masks_array = table->tbl_masks_array;
+
+		kfree(masks_array);
+		free_masks_bitmap =
+			rtnl_dereference(table->tbl_free_masks_bitmap);
+		bitmap_free(free_masks_bitmap);
+	}
+}
+
+int p4tc_table_try_set_state_ready(struct p4tc_pipeline *pipeline,
+				   struct netlink_ext_ack *extack)
+{
+	struct p4tc_table **set_tables;
+	struct p4tc_table *table;
+	unsigned long tmp, id;
+	int i = 0;
+	int ret;
+
+	set_tables = kcalloc(pipeline->num_tables, sizeof(*set_tables),
+			     GFP_KERNEL);
+	if (!set_tables)
+		return -ENOMEM;
+
+	idr_for_each_entry_ul(&pipeline->p_tbl_idr, table, tmp, id) {
+		ret = __p4tc_table_try_set_state_ready(table, extack);
+		if (ret < 0)
+			goto free_set_tables;
+		set_tables[i] = table;
+		i++;
+	}
+	kfree(set_tables);
+
+	return 0;
+
+free_set_tables:
+	free_table_cache_array(set_tables, i);
+	kfree(set_tables);
+	return ret;
+}
+
+static const struct netlink_range_validation keysz_range = {
+	.min = 1,
+	.max = P4TC_MAX_KEYSZ,
+};
+
+static const struct netlink_range_validation max_entries_range = {
+	.min = 1,
+	.max = P4TC_MAX_TENTRIES,
+};
+
+static const struct netlink_range_validation max_masks_range = {
+	.min = 1,
+	.max = P4TC_MAX_TMASKS,
+};
+
+static const struct netlink_range_validation permissions_range = {
+	.min = 0,
+	.max = P4TC_MAX_PERMISSION,
+};
+
+static const struct nla_policy p4tc_table_policy[P4TC_TABLE_MAX + 1] = {
+	[P4TC_TABLE_NAME] = { .type = NLA_STRING, .len = P4TC_TABLE_NAMSIZ },
+	[P4TC_TABLE_KEYSZ] = NLA_POLICY_FULL_RANGE(NLA_U32, &keysz_range),
+	[P4TC_TABLE_MAX_ENTRIES] =
+		NLA_POLICY_FULL_RANGE(NLA_U32, &max_entries_range),
+	[P4TC_TABLE_MAX_MASKS] =
+		NLA_POLICY_FULL_RANGE(NLA_U32, &max_masks_range),
+	[P4TC_TABLE_PERMISSIONS] =
+		NLA_POLICY_FULL_RANGE(NLA_U16, &permissions_range),
+	[P4TC_TABLE_TYPE] =
+		NLA_POLICY_RANGE(NLA_U8, P4TC_TABLE_TYPE_EXACT,
+				 P4TC_TABLE_TYPE_MAX),
+	[P4TC_TABLE_DEFAULT_HIT] = { .type = NLA_NESTED },
+	[P4TC_TABLE_DEFAULT_MISS] = { .type = NLA_NESTED },
+	[P4TC_TABLE_ACTS_LIST] = { .type = NLA_NESTED },
+	[P4TC_TABLE_NUM_TIMER_PROFILES] =
+		NLA_POLICY_RANGE(NLA_U32, 1, P4TC_MAX_NUM_TIMER_PROFILES),
+};
+
+static int _p4tc_table_fill_nlmsg(struct sk_buff *skb, struct p4tc_table *table)
+{
+	struct p4tc_table_timer_profile *timer_profile;
+	unsigned char *b = nlmsg_get_pos(skb);
+	struct p4tc_table_perm *tbl_perm;
+	struct p4tc_table_act *table_act;
+	struct nlattr *nested_profiles;
+	struct nlattr *nested_tbl_acts;
+	struct nlattr *default_missact;
+	struct nlattr *default_hitact;
+	struct nlattr *nested_count;
+	unsigned long profile_id;
+	struct nlattr *nest;
+	int i = 1;
+
+	if (nla_put_u32(skb, P4TC_PATH, table->tbl_id))
+		goto out_nlmsg_trim;
+
+	nest = nla_nest_start(skb, P4TC_PARAMS);
+	if (!nest)
+		goto out_nlmsg_trim;
+
+	if (nla_put_string(skb, P4TC_TABLE_NAME, table->common.name))
+		goto out_nlmsg_trim;
+
+	if (table->tbl_dflt_hitact) {
+		struct p4tc_table_defact *hitact;
+		struct tcf_p4act *p4_hitact;
+
+		default_hitact = nla_nest_start(skb, P4TC_TABLE_DEFAULT_HIT);
+		rcu_read_lock();
+		hitact = rcu_dereference_rtnl(table->tbl_dflt_hitact);
+		p4_hitact = to_p4act(hitact->acts[0]);
+		if (p4tc_table_defact_is_noaction(p4_hitact)) {
+			if (nla_put_u8(skb,
+				       P4TC_TABLE_DEFAULT_ACTION_NOACTION,
+				       1) < 0) {
+				rcu_read_unlock();
+				goto out_nlmsg_trim;
+			}
+		} else if (hitact->acts[0]) {
+			struct nlattr *nest_defact;
+
+			nest_defact = nla_nest_start(skb,
+						     P4TC_TABLE_DEFAULT_ACTION);
+			if (tcf_action_dump(skb, hitact->acts, 0, 0,
+					    false) < 0) {
+				rcu_read_unlock();
+				goto out_nlmsg_trim;
+			}
+			nla_nest_end(skb, nest_defact);
+		}
+		if (nla_put_u16(skb, P4TC_TABLE_DEFAULT_ACTION_PERMISSIONS,
+				hitact->perm) < 0) {
+			rcu_read_unlock();
+			goto out_nlmsg_trim;
+		}
+		rcu_read_unlock();
+		nla_nest_end(skb, default_hitact);
+	}
+
+	if (table->tbl_dflt_missact) {
+		struct p4tc_table_defact *missact;
+		struct tcf_p4act *p4_missact;
+
+		default_missact = nla_nest_start(skb, P4TC_TABLE_DEFAULT_MISS);
+		rcu_read_lock();
+		missact = rcu_dereference_rtnl(table->tbl_dflt_missact);
+		p4_missact = to_p4act(missact->acts[0]);
+		if (p4tc_table_defact_is_noaction(p4_missact)) {
+			if (nla_put_u8(skb,
+				       P4TC_TABLE_DEFAULT_ACTION_NOACTION,
+				       1) < 0) {
+				rcu_read_unlock();
+				goto out_nlmsg_trim;
+			}
+		} else if (missact->acts[0]) {
+			struct nlattr *nest_defact;
+
+			nest_defact = nla_nest_start(skb,
+						     P4TC_TABLE_DEFAULT_ACTION);
+			if (tcf_action_dump(skb, missact->acts, 0, 0,
+					    false) < 0) {
+				rcu_read_unlock();
+				goto out_nlmsg_trim;
+			}
+			nla_nest_end(skb, nest_defact);
+		}
+		if (nla_put_u16(skb, P4TC_TABLE_DEFAULT_ACTION_PERMISSIONS,
+				missact->perm) < 0) {
+			rcu_read_unlock();
+			goto out_nlmsg_trim;
+		}
+		rcu_read_unlock();
+		nla_nest_end(skb, default_missact);
+	}
+
+	if (nla_put_u32(skb, P4TC_TABLE_NUM_TIMER_PROFILES,
+			atomic_read(&table->tbl_num_timer_profiles)) < 0)
+		goto out_nlmsg_trim;
+
+	nested_profiles = nla_nest_start(skb, P4TC_TABLE_TIMER_PROFILES);
+	i = 1;
+	rcu_read_lock();
+	xa_for_each(&table->tbl_profiles_xa, profile_id, timer_profile) {
+		nested_count = nla_nest_start(skb, i);
+		if (nla_put_u32(skb, P4TC_TIMER_PROFILE_ID,
+				timer_profile->profile_id)) {
+			rcu_read_unlock();
+			goto out_nlmsg_trim;
+		}
+
+		if (nla_put(skb, P4TC_TIMER_PROFILE_AGING, sizeof(u64),
+			    &timer_profile->aging_ms)) {
+			rcu_read_unlock();
+			goto out_nlmsg_trim;
+		}
+
+		nla_nest_end(skb, nested_count);
+		i++;
+	}
+	rcu_read_unlock();
+	nla_nest_end(skb, nested_profiles);
+
+	nested_tbl_acts = nla_nest_start(skb, P4TC_TABLE_ACTS_LIST);
+	list_for_each_entry(table_act, &table->tbl_acts_list, node) {
+		nested_count = nla_nest_start(skb, i);
+		if (nla_put_string(skb, P4TC_TABLE_ACT_NAME,
+				   table_act->act->common.name) < 0)
+			goto out_nlmsg_trim;
+		if (nla_put_u32(skb, P4TC_TABLE_ACT_FLAGS,
+				table_act->flags) < 0)
+			goto out_nlmsg_trim;
+
+		nla_nest_end(skb, nested_count);
+		i++;
+	}
+	nla_nest_end(skb, nested_tbl_acts);
+
+	if (nla_put_u32(skb, P4TC_TABLE_KEYSZ, table->tbl_keysz))
+		goto out_nlmsg_trim;
+
+	if (nla_put_u32(skb, P4TC_TABLE_MAX_ENTRIES, table->tbl_max_entries))
+		goto out_nlmsg_trim;
+
+	if (nla_put_u32(skb, P4TC_TABLE_MAX_MASKS, table->tbl_max_masks))
+		goto out_nlmsg_trim;
+
+	tbl_perm = rcu_dereference_rtnl(table->tbl_permissions);
+	if (nla_put_u16(skb, P4TC_TABLE_PERMISSIONS, tbl_perm->permissions))
+		goto out_nlmsg_trim;
+
+	nla_nest_end(skb, nest);
+
+	return skb->len;
+
+out_nlmsg_trim:
+	nlmsg_trim(skb, b);
+	return -1;
+}
+
+static int p4tc_table_fill_nlmsg(struct net *net, struct sk_buff *skb,
+				 struct p4tc_template_common *template,
+				 struct netlink_ext_ack *extack)
+{
+	struct p4tc_table *table = p4tc_to_table(template);
+
+	if (_p4tc_table_fill_nlmsg(skb, table) <= 0) {
+		NL_SET_ERR_MSG(extack,
+			       "Failed to fill notification attributes for table");
+		return -EINVAL;
+	}
+
+	return 0;
+}
+
+static void p4tc_table_defact_destroy(struct p4tc_table_defact *defact)
+{
+	if (defact) {
+		p4tc_action_destroy(defact->acts);
+		kfree(defact);
+	}
+}
+
+static void
+p4tc_table_timer_profile_destroy(struct p4tc_table *table,
+				 struct p4tc_table_timer_profile *table_profile)
+{
+	struct xarray *profiles_xa = &table->tbl_profiles_xa;
+
+	atomic_dec(&table->tbl_num_timer_profiles);
+	xa_erase(profiles_xa, table_profile->profile_id);
+
+	kfree_rcu(table_profile, rcu);
+}
+
+static void p4tc_table_timer_profiles_destroy(struct p4tc_table *table)
+{
+	struct p4tc_table_timer_profile *table_profile;
+	unsigned long profile_id;
+
+	mutex_lock(&table->tbl_profiles_xa_lock);
+	xa_for_each(&table->tbl_profiles_xa, profile_id, table_profile)
+		p4tc_table_timer_profile_destroy(table, table_profile);
+
+	xa_destroy(&table->tbl_profiles_xa);
+	mutex_unlock(&table->tbl_profiles_xa_lock);
+}
+
+/* From the template, the user may only specify the number of timer profiles
+ * they want for the table. If this number is not specified during the table
+ * creation command, the kernel will create 4 timer profiles:
+ * - ID 0: 30000ms
+ * - ID 1: 60000ms
+ * - ID 2: 90000ms
+ * - ID 3: 1200000ms
+ * If the user specify the number of timer profiles, the aging for those
+ * profiles will be assigned using the same pattern as shown above, i.e profile
+ * ID 0 will have aging 30000ms and the rest will conform to the following
+ * pattern:
+ * Aging(IDn) = Aging(IDn-1) + 30000ms
+ * These values may only be updated with the runtime command (p4ctrl) after the
+ * pipeline is sealed.
+ */
+static int
+p4tc_tmpl_timer_profiles_init(struct p4tc_table *table, const u32 num_profiles)
+{
+	struct xarray *profiles_xa = &table->tbl_profiles_xa;
+	u64 aging_ms = P4TC_TIMER_PROFILE_ZERO_AGING_MS;
+	struct p4tc_table_timer_profile *table_profile;
+	int ret;
+	int i;
+
+	/* No need for locking here because the pipeline is sealed and we are
+	 * protected by the RTNL lock
+	 */
+	xa_init(profiles_xa);
+	for (i = P4TC_DEFAULT_TIMER_PROFILE_ID; i < num_profiles; i++) {
+		table_profile = kzalloc(sizeof(*table_profile), GFP_KERNEL);
+		if (unlikely(!table_profile))
+			return -ENOMEM;
+
+		table_profile->profile_id = i;
+		table_profile->aging_ms = aging_ms;
+
+		ret = xa_insert(profiles_xa, i, table_profile, GFP_KERNEL);
+		if (ret < 0) {
+			kfree(table_profile);
+			goto profiles_destroy;
+		}
+		atomic_inc(&table->tbl_num_timer_profiles);
+		aging_ms += P4TC_TIMER_PROFILE_ZERO_AGING_MS;
+	}
+	mutex_init(&table->tbl_profiles_xa_lock);
+
+	return 0;
+
+profiles_destroy:
+	p4tc_table_timer_profiles_destroy(table);
+	return ret;
+}
+
+static void p4tc_table_acts_list_destroy(struct list_head *acts_list)
+{
+	struct p4tc_table_act *table_act, *tmp;
+
+	list_for_each_entry_safe(table_act, tmp, acts_list, node) {
+		list_del(&table_act->node);
+		if (!p4tc_table_act_is_noaction(table_act))
+			p4tc_action_put_ref(table_act->act);
+		kfree(table_act);
+	}
+}
+
+static void p4tc_table_acts_list_replace(struct list_head *dst,
+					 struct list_head *src)
+{
+	p4tc_table_acts_list_destroy(dst);
+	list_splice_init(src, dst);
+	kfree(src);
+}
+
+static void __p4tc_table_put_mask_array(struct p4tc_table *table)
+{
+	unsigned long *free_masks_bitmap;
+
+	kfree(table->tbl_masks_array);
+
+	free_masks_bitmap = rcu_dereference_rtnl(table->tbl_free_masks_bitmap);
+	bitmap_free(free_masks_bitmap);
+}
+
+void p4tc_table_put_mask_array(struct p4tc_pipeline *pipeline)
+{
+	struct p4tc_table *table;
+	unsigned long tmp, id;
+
+	idr_for_each_entry_ul(&pipeline->p_tbl_idr, table, tmp, id) {
+		__p4tc_table_put_mask_array(table);
+	}
+}
+
+static int _p4tc_table_put(struct net *net, struct nlattr **tb,
+			   struct p4tc_pipeline *pipeline,
+			   struct p4tc_table *table,
+			   struct netlink_ext_ack *extack)
+{
+	bool default_act_del = false;
+	struct p4tc_table_perm *perm;
+
+	if (tb)
+		default_act_del = tb[P4TC_TABLE_DEFAULT_HIT] ||
+			tb[P4TC_TABLE_DEFAULT_MISS];
+
+	if (!default_act_del) {
+		if (!refcount_dec_if_one(&table->tbl_ctrl_ref)) {
+			NL_SET_ERR_MSG(extack,
+				       "Unable to delete referenced table");
+			return -EBUSY;
+		}
+	}
+
+	if (tb && tb[P4TC_TABLE_DEFAULT_HIT]) {
+		struct p4tc_table_defact *hitact;
+
+		rcu_read_lock();
+		hitact = rcu_dereference(table->tbl_dflt_hitact);
+		if (hitact && !p4tc_ctrl_delete_ok(hitact->perm)) {
+			NL_SET_ERR_MSG(extack,
+				       "Unable to delete default hitact");
+			rcu_read_unlock();
+			return -EPERM;
+		}
+		rcu_read_unlock();
+	}
+
+	if (tb && tb[P4TC_TABLE_DEFAULT_MISS]) {
+		struct p4tc_table_defact *missact;
+
+		rcu_read_lock();
+		missact = rcu_dereference(table->tbl_dflt_missact);
+		if (missact && !p4tc_ctrl_delete_ok(missact->perm)) {
+			NL_SET_ERR_MSG(extack,
+				       "Unable to delete default missact");
+			rcu_read_unlock();
+			return -EPERM;
+		}
+		rcu_read_unlock();
+	}
+
+	if (!default_act_del || tb[P4TC_TABLE_DEFAULT_HIT]) {
+		struct p4tc_table_defact *hitact;
+
+		hitact = rtnl_dereference(table->tbl_dflt_hitact);
+		if (hitact) {
+			rcu_replace_pointer_rtnl(table->tbl_dflt_hitact, NULL);
+			synchronize_rcu();
+			p4tc_table_defact_destroy(hitact);
+		}
+	}
+
+	if (!default_act_del || tb[P4TC_TABLE_DEFAULT_MISS]) {
+		struct p4tc_table_defact *missact;
+
+		missact = rtnl_dereference(table->tbl_dflt_missact);
+		if (missact) {
+			rcu_replace_pointer_rtnl(table->tbl_dflt_missact, NULL);
+			synchronize_rcu();
+			p4tc_table_defact_destroy(missact);
+		}
+	}
+
+	if (default_act_del)
+		return 0;
+
+	p4tc_table_acts_list_destroy(&table->tbl_acts_list);
+	p4tc_table_timer_profiles_destroy(table);
+
+	idr_destroy(&table->tbl_masks_idr);
+	idr_destroy(&table->tbl_prio_idr);
+
+	perm = rcu_replace_pointer_rtnl(table->tbl_permissions, NULL);
+	kfree_rcu(perm, rcu);
+
+	idr_remove(&pipeline->p_tbl_idr, table->tbl_id);
+	pipeline->curr_tables -= 1;
+
+	__p4tc_table_put_mask_array(table);
+
+	kfree(table);
+
+	return 0;
+}
+
+static int p4tc_table_put(struct p4tc_pipeline *pipeline,
+			  struct p4tc_template_common *tmpl,
+			  struct netlink_ext_ack *extack)
+{
+	struct p4tc_table *table = p4tc_to_table(tmpl);
+
+	return _p4tc_table_put(pipeline->net, NULL, pipeline, table, extack);
+}
+
+struct p4tc_table *p4tc_table_find_byid(struct p4tc_pipeline *pipeline,
+					const u32 tbl_id)
+{
+	return idr_find(&pipeline->p_tbl_idr, tbl_id);
+}
+
+static struct p4tc_table *p4tc_table_find_byname(const char *tblname,
+						 struct p4tc_pipeline *pipeline)
+{
+	struct p4tc_table *table;
+	unsigned long tmp, id;
+
+	idr_for_each_entry_ul(&pipeline->p_tbl_idr, table, tmp, id)
+		if (strncmp(table->common.name, tblname,
+			    P4TC_TABLE_NAMSIZ) == 0)
+			return table;
+
+	return NULL;
+}
+
+struct p4tc_table *p4tc_table_find_byany(struct p4tc_pipeline *pipeline,
+					 const char *tblname, const u32 tbl_id,
+					 struct netlink_ext_ack *extack)
+{
+	struct p4tc_table *table;
+	int err;
+
+	if (tbl_id) {
+		table = p4tc_table_find_byid(pipeline, tbl_id);
+		if (!table) {
+			NL_SET_ERR_MSG(extack, "Unable to find table by id");
+			err = -EINVAL;
+			goto out;
+		}
+	} else {
+		if (tblname) {
+			table = p4tc_table_find_byname(tblname, pipeline);
+			if (!table) {
+				NL_SET_ERR_MSG(extack, "Table name not found");
+				err = -EINVAL;
+				goto out;
+			}
+		} else {
+			NL_SET_ERR_MSG(extack, "Must specify table name or id");
+			err = -EINVAL;
+			goto out;
+		}
+	}
+
+	return table;
+out:
+	return ERR_PTR(err);
+}
+
+static int p4tc_table_get(struct p4tc_table *table)
+{
+	return refcount_inc_not_zero(&table->tbl_ctrl_ref);
+}
+
+struct p4tc_table *p4tc_table_find_get(struct p4tc_pipeline *pipeline,
+				       const char *tblname, const u32 tbl_id,
+				       struct netlink_ext_ack *extack)
+{
+	struct p4tc_table *table;
+
+	table = p4tc_table_find_byany(pipeline, tblname, tbl_id, extack);
+	if (IS_ERR(table))
+		return table;
+
+	if (!p4tc_table_get(table)) {
+		NL_SET_ERR_MSG(extack, "Table is marked for deletion");
+		return ERR_PTR(-EBUSY);
+	}
+
+	return table;
+}
+
+static struct p4tc_act NoAction = {
+	.common.p_id = 0,
+	.common.name = "NoAction",
+	.a_id = 0,
+};
+
+/* Permissions can also be updated by runtime command */
+static struct p4tc_table_defact *
+__p4tc_table_init_defact(struct net *net, struct nlattr **tb, u32 pipeid,
+			 __u16 perm, struct netlink_ext_ack *extack)
+{
+	struct p4tc_table_defact *defact;
+	int ret;
+
+	defact = kzalloc(sizeof(*defact), GFP_KERNEL);
+	if (!defact) {
+		NL_SET_ERR_MSG(extack, "Failed to initialize default actions");
+		return ERR_PTR(-ENOMEM);
+	}
+
+	if (tb[P4TC_TABLE_DEFAULT_ACTION_PERMISSIONS]) {
+		__u16 nperm;
+
+		nperm = nla_get_u16(tb[P4TC_TABLE_DEFAULT_ACTION_PERMISSIONS]);
+		if (!p4tc_ctrl_read_ok(nperm)) {
+			NL_SET_ERR_MSG(extack,
+				       "Default action must have ctrl path read permissions");
+			ret = -EINVAL;
+			goto err;
+		}
+		if (!p4tc_data_read_ok(nperm)) {
+			NL_SET_ERR_MSG(extack,
+				       "Default action must have data path read permissions");
+			ret = -EINVAL;
+			goto err;
+		}
+		if (!p4tc_data_exec_ok(nperm)) {
+			NL_SET_ERR_MSG(extack,
+				       "Default action must have data path execute permissions");
+			ret = -EINVAL;
+			goto err;
+		}
+		defact->perm = nperm;
+	} else {
+		defact->perm = perm;
+	}
+
+	if (tb[P4TC_TABLE_DEFAULT_ACTION_NOACTION] &&
+	    tb[P4TC_TABLE_DEFAULT_ACTION]) {
+		NL_SET_ERR_MSG(extack,
+			       "Specifying no action and action simultaneously is not allowed");
+		ret = -EINVAL;
+		goto err;
+	}
+
+	if (tb[P4TC_TABLE_DEFAULT_ACTION]) {
+		if (!p4tc_ctrl_update_ok(perm)) {
+			NL_SET_ERR_MSG(extack,
+				       "Unable to update default action");
+			ret = -EPERM;
+			goto err;
+		}
+
+		ret = p4tc_action_init(net, tb[P4TC_TABLE_DEFAULT_ACTION],
+				       defact->acts, pipeid, 0, extack);
+		if (ret < 0)
+			goto err;
+	} else if (tb[P4TC_TABLE_DEFAULT_ACTION_NOACTION]) {
+		struct tcf_p4act *p4_defact;
+
+		if (!p4tc_ctrl_update_ok(perm)) {
+			NL_SET_ERR_MSG(extack,
+				       "Unable to update default action");
+			ret = -EPERM;
+			goto err;
+		}
+
+		p4_defact = kzalloc(sizeof(*p4_defact), GFP_KERNEL);
+		if (!p4_defact) {
+			ret = -ENOMEM;
+			goto err;
+		}
+		p4_defact->p_id = 0;
+		p4_defact->act_id = 0;
+		defact->acts[0] = (struct tc_action *)p4_defact;
+	}
+
+	return defact;
+
+err:
+	kfree(defact);
+	return ERR_PTR(ret);
+}
+
+static int p4tc_table_check_defacts(struct tc_action *defact,
+				    struct list_head *acts_list)
+{
+	struct tcf_p4act *p4_defact = to_p4act(defact);
+	struct p4tc_table_act *table_act;
+
+	list_for_each_entry(table_act, acts_list, node) {
+		if (table_act->act->common.p_id == p4_defact->p_id &&
+		    table_act->act->a_id == p4_defact->act_id &&
+		    !(table_act->flags & BIT(P4TC_TABLE_ACTS_TABLE_ONLY)))
+			return true;
+	}
+
+	return false;
+}
+
+static struct nla_policy
+p4tc_table_default_policy[P4TC_TABLE_DEFAULT_ACTION_MAX + 1] = {
+	[P4TC_TABLE_DEFAULT_ACTION] = { .type = NLA_NESTED },
+	[P4TC_TABLE_DEFAULT_ACTION_PERMISSIONS] =
+		NLA_POLICY_MAX(NLA_U16, P4TC_MAX_PERMISSION),
+	[P4TC_TABLE_DEFAULT_ACTION_NOACTION] =
+		NLA_POLICY_RANGE(NLA_U8, 1, 1),
+};
+
+/* Runtime and template call this */
+static struct p4tc_table_defact *
+p4tc_table_init_default_act(struct net *net, struct nlattr *nla,
+			    struct p4tc_table *table,
+			    u16 perm, struct list_head *acts_list,
+			    struct netlink_ext_ack *extack)
+{
+	struct nlattr *tb[P4TC_TABLE_DEFAULT_ACTION_MAX + 1];
+	struct p4tc_table_defact *defact;
+	int ret;
+
+	ret = nla_parse_nested(tb, P4TC_TABLE_DEFAULT_ACTION_MAX, nla,
+			       p4tc_table_default_policy, extack);
+	if (ret < 0)
+		return ERR_PTR(ret);
+
+	if (!tb[P4TC_TABLE_DEFAULT_ACTION] &&
+	    !tb[P4TC_TABLE_DEFAULT_ACTION_PERMISSIONS] &&
+	    !tb[P4TC_TABLE_DEFAULT_ACTION_NOACTION]) {
+		NL_SET_ERR_MSG(extack,
+			       "Nested P4TC_TABLE_DEFAULT_ACTION attr is empty");
+		return ERR_PTR(-EINVAL);
+	}
+
+	defact = __p4tc_table_init_defact(net, tb, table->common.p_id, perm,
+					  extack);
+	if (IS_ERR(defact))
+		return defact;
+
+	if (defact->acts[0] &&
+	    !p4tc_table_check_defacts(defact->acts[0], acts_list)) {
+		NL_SET_ERR_MSG(extack,
+			       "Action is not allowed as default action");
+		p4tc_table_defact_destroy(defact);
+		return ERR_PTR(-EPERM);
+	}
+
+	return defact;
+}
+
+struct p4tc_table_perm *
+p4tc_table_init_permissions(struct p4tc_table *table, u16 permissions,
+			    struct netlink_ext_ack *extack)
+{
+	struct p4tc_table_perm *tbl_perm;
+
+	tbl_perm = kzalloc(sizeof(*tbl_perm), GFP_KERNEL);
+	if (!tbl_perm)
+		return ERR_PTR(-ENOMEM);
+
+	tbl_perm->permissions = permissions;
+
+	return tbl_perm;
+}
+
+void p4tc_table_replace_permissions(struct p4tc_table *table,
+				    struct p4tc_table_perm *tbl_perm,
+				    bool lock_rtnl)
+{
+	if (!tbl_perm)
+		return;
+
+	if (lock_rtnl)
+		rtnl_lock();
+	tbl_perm = rcu_replace_pointer_rtnl(table->tbl_permissions, tbl_perm);
+	if (lock_rtnl)
+		rtnl_unlock();
+	kfree_rcu(tbl_perm, rcu);
+}
+
+int p4tc_table_init_default_acts(struct net *net,
+				 struct p4tc_table_defact_params *dflt,
+				 struct p4tc_table *table,
+				 struct list_head *acts_list,
+				 struct netlink_ext_ack *extack)
+{
+	int ret;
+
+	dflt->missact = NULL;
+	dflt->hitact = NULL;
+
+	if (dflt->nla_hit) {
+		struct p4tc_table_defact *hitact;
+		u16 perm;
+
+		perm = P4TC_CONTROL_PERMISSIONS | P4TC_DATA_PERMISSIONS;
+
+		rcu_read_lock();
+		if (table->tbl_dflt_hitact)
+			perm = rcu_dereference(table->tbl_dflt_hitact)->perm;
+		rcu_read_unlock();
+
+		hitact = p4tc_table_init_default_act(net, dflt->nla_hit, table,
+						     perm, acts_list, extack);
+		if (IS_ERR(hitact))
+			return PTR_ERR(hitact);
+
+		dflt->hitact = hitact;
+	}
+
+	if (dflt->nla_miss) {
+		struct p4tc_table_defact *missact;
+		u16 perm;
+
+		perm = P4TC_CONTROL_PERMISSIONS | P4TC_DATA_PERMISSIONS;
+
+		rcu_read_lock();
+		if (table->tbl_dflt_missact)
+			perm = rcu_dereference(table->tbl_dflt_missact)->perm;
+		rcu_read_unlock();
+
+		missact = p4tc_table_init_default_act(net, dflt->nla_miss,
+						      table, perm, acts_list,
+						      extack);
+		if (IS_ERR(missact)) {
+			ret = PTR_ERR(missact);
+			goto default_hitacts_free;
+		}
+
+		dflt->missact = missact;
+	}
+
+	return 0;
+
+default_hitacts_free:
+	p4tc_table_defact_destroy(dflt->hitact);
+	return ret;
+}
+
+static const struct nla_policy p4tc_acts_list_policy[P4TC_TABLE_MAX + 1] = {
+	[P4TC_TABLE_ACT_FLAGS] =
+		NLA_POLICY_RANGE(NLA_U8, 0, BIT(P4TC_TABLE_ACTS_FLAGS_MAX)),
+	[P4TC_TABLE_ACT_NAME] = { .type = NLA_STRING, .len = ACTNAMSIZ },
+};
+
+static struct p4tc_table_act *
+p4tc_table_act_init(struct nlattr *nla, struct p4tc_pipeline *pipeline,
+		    struct netlink_ext_ack *extack)
+{
+	struct nlattr *tb[P4TC_TABLE_ACT_MAX + 1];
+	struct p4tc_table_act *table_act;
+	int ret;
+
+	ret = nla_parse_nested(tb, P4TC_TABLE_ACT_MAX, nla,
+			       p4tc_acts_list_policy, extack);
+	if (ret < 0)
+		return ERR_PTR(ret);
+
+	table_act = kzalloc(sizeof(*table_act), GFP_KERNEL);
+	if (unlikely(!table_act))
+		return ERR_PTR(-ENOMEM);
+
+	if (tb[P4TC_TABLE_ACT_NAME]) {
+		const char *fullname = nla_data(tb[P4TC_TABLE_ACT_NAME]);
+		char *pname, *aname, actname[ACTNAMSIZ];
+		struct p4tc_act *act;
+
+		nla_strscpy(actname, tb[P4TC_TABLE_ACT_NAME], ACTNAMSIZ);
+		aname = actname;
+
+		pname = strsep(&aname, "/");
+		if (!aname) {
+			if (strcmp(pname, "NoAction") == 0) {
+				table_act->act = &NoAction;
+				return table_act;
+			}
+
+			NL_SET_ERR_MSG(extack,
+				       "Action name must have format pname/actname");
+			ret = -EINVAL;
+			goto free_table_act;
+		}
+
+		if (strncmp(pipeline->common.name, pname,
+			    P4TC_PIPELINE_NAMSIZ)) {
+			NL_SET_ERR_MSG_FMT(extack, "Pipeline name must be %s\n",
+					   pipeline->common.name);
+			ret = -EINVAL;
+			goto free_table_act;
+		}
+
+		act = p4a_tmpl_get(pipeline, fullname, 0, extack);
+		if (IS_ERR(act)) {
+			ret = PTR_ERR(act);
+			goto free_table_act;
+		}
+
+		table_act->act = act;
+	} else {
+		NL_SET_ERR_MSG(extack,
+			       "Must specify allowed table action name");
+		ret = -EINVAL;
+		goto free_table_act;
+	}
+
+	if (tb[P4TC_TABLE_ACT_FLAGS]) {
+		u8 *flags = nla_data(tb[P4TC_TABLE_ACT_FLAGS]);
+
+		if (*flags & BIT(P4TC_TABLE_ACTS_DEFAULT_ONLY) &&
+		    *flags & BIT(P4TC_TABLE_ACTS_TABLE_ONLY)) {
+			NL_SET_ERR_MSG(extack,
+				       "defaultonly and tableonly are mutually exclusive");
+			ret = -EINVAL;
+			goto act_put;
+		}
+
+		table_act->flags = *flags;
+	}
+
+	if (table_act->act->num_runt_params) {
+		if (table_act->flags != BIT(P4TC_TABLE_ACTS_DEFAULT_ONLY)) {
+			NL_SET_ERR_MSG_FMT(extack,
+					   "Action with runtime parameters (%s) may only be default action",
+					   table_act->act->common.name);
+			ret = -EINVAL;
+			goto act_put;
+		}
+	}
+
+	return table_act;
+
+act_put:
+	p4tc_action_put_ref(table_act->act);
+
+free_table_act:
+	kfree(table_act);
+	return ERR_PTR(ret);
+}
+
+void p4tc_table_replace_default_acts(struct p4tc_table *table,
+				     struct p4tc_table_defact_params *dflt,
+				     bool lock_rtnl)
+{
+	if (dflt->hitact) {
+		bool updated_actions = !!dflt->hitact->acts[0];
+		struct p4tc_table_defact *hitact;
+
+		if (lock_rtnl)
+			rtnl_lock();
+		if (!updated_actions) {
+			hitact = rcu_dereference_rtnl(table->tbl_dflt_hitact);
+			p4tc_table_defacts_acts_copy(dflt->hitact, hitact);
+		}
+
+		hitact = rcu_replace_pointer_rtnl(table->tbl_dflt_hitact,
+						  dflt->hitact);
+		if (lock_rtnl)
+			rtnl_unlock();
+		if (hitact) {
+			synchronize_rcu();
+			if (updated_actions)
+				p4tc_table_defact_destroy(hitact);
+			else
+				kfree(hitact);
+		}
+	}
+
+	if (dflt->missact) {
+		bool updated_actions = !!dflt->missact->acts[0];
+		struct p4tc_table_defact *missact;
+
+		if (lock_rtnl)
+			rtnl_lock();
+		if (!updated_actions) {
+			missact = rcu_dereference_rtnl(table->tbl_dflt_missact);
+			p4tc_table_defacts_acts_copy(dflt->missact, missact);
+		}
+
+		missact = rcu_replace_pointer_rtnl(table->tbl_dflt_missact,
+						   dflt->missact);
+		if (lock_rtnl)
+			rtnl_unlock();
+		if (missact) {
+			synchronize_rcu();
+			if (updated_actions)
+				p4tc_table_defact_destroy(missact);
+			else
+				kfree(missact);
+		}
+	}
+}
+
+static int p4tc_table_acts_list_init(struct nlattr *nla,
+				     struct p4tc_pipeline *pipeline,
+				     struct list_head *acts_list,
+				     struct netlink_ext_ack *extack)
+{
+	struct nlattr *tb[P4TC_MSGBATCH_SIZE + 1];
+	struct p4tc_table_act *table_act;
+	int ret;
+	int i;
+
+	ret = nla_parse_nested(tb, P4TC_MSGBATCH_SIZE, nla, NULL, extack);
+	if (ret < 0)
+		return ret;
+
+	for (i = 1; i < P4TC_MSGBATCH_SIZE + 1 && tb[i]; i++) {
+		table_act = p4tc_table_act_init(tb[i], pipeline, extack);
+		if (IS_ERR(table_act)) {
+			ret = PTR_ERR(table_act);
+			goto free_acts_list_list;
+		}
+		list_add_tail(&table_act->node, acts_list);
+	}
+
+	return 0;
+
+free_acts_list_list:
+	p4tc_table_acts_list_destroy(acts_list);
+
+	return ret;
+}
+
+static struct p4tc_table *
+p4tc_table_find_byanyattr(struct p4tc_pipeline *pipeline,
+			  struct nlattr *name_attr, const u32 tbl_id,
+			  struct netlink_ext_ack *extack)
+{
+	char *tblname = NULL;
+
+	if (name_attr)
+		tblname = nla_data(name_attr);
+
+	return p4tc_table_find_byany(pipeline, tblname, tbl_id, extack);
+}
+
+static const struct p4tc_template_ops p4tc_table_ops;
+
+static struct p4tc_table *p4tc_table_create(struct net *net, struct nlattr **tb,
+					    u32 tbl_id,
+					    struct p4tc_pipeline *pipeline,
+					    struct netlink_ext_ack *extack)
+{
+	u32 num_profiles = P4TC_DEFAULT_NUM_TIMER_PROFILES;
+	struct p4tc_table_perm *tbl_init_perms = NULL;
+	struct p4tc_table_defact_params dflt = { 0 };
+	struct p4tc_table *table;
+	char *tblname;
+	int ret;
+
+	if (pipeline->curr_tables == pipeline->num_tables) {
+		NL_SET_ERR_MSG(extack,
+			       "Table range exceeded max allowed value");
+		ret = -EINVAL;
+		goto out;
+	}
+
+	/* Name has the following syntax cb/tname */
+	if (NL_REQ_ATTR_CHECK(extack, NULL, tb, P4TC_TABLE_NAME)) {
+		NL_SET_ERR_MSG(extack, "Must specify table name");
+		ret = -EINVAL;
+		goto out;
+	}
+
+	tblname =
+		strnchr(nla_data(tb[P4TC_TABLE_NAME]), P4TC_TABLE_NAMSIZ, '/');
+	if (!tblname) {
+		NL_SET_ERR_MSG(extack, "Table name must contain control block");
+		ret = -EINVAL;
+		goto out;
+	}
+
+	tblname += 1;
+	if (tblname[0] == '\0') {
+		NL_SET_ERR_MSG(extack, "Control block name is too big");
+		ret = -EINVAL;
+		goto out;
+	}
+
+	table = p4tc_table_find_byanyattr(pipeline, tb[P4TC_TABLE_NAME], tbl_id,
+					  NULL);
+	if (!IS_ERR(table)) {
+		NL_SET_ERR_MSG(extack, "Table already exists");
+		ret = -EEXIST;
+		goto out;
+	}
+
+	table = kzalloc(sizeof(*table), GFP_KERNEL);
+	if (!table) {
+		NL_SET_ERR_MSG(extack, "Unable to create table");
+		ret = -ENOMEM;
+		goto out;
+	}
+
+	table->common.p_id = pipeline->common.p_id;
+	strscpy(table->common.name, nla_data(tb[P4TC_TABLE_NAME]),
+		P4TC_TABLE_NAMSIZ);
+
+	if (tb[P4TC_TABLE_KEYSZ]) {
+		table->tbl_keysz = nla_get_u32(tb[P4TC_TABLE_KEYSZ]);
+	} else {
+		NL_SET_ERR_MSG(extack, "Must specify table keysz");
+		ret = -EINVAL;
+		goto free;
+	}
+
+	if (tb[P4TC_TABLE_MAX_ENTRIES])
+		table->tbl_max_entries =
+			nla_get_u32(tb[P4TC_TABLE_MAX_ENTRIES]);
+	else
+		table->tbl_max_entries = P4TC_DEFAULT_TENTRIES;
+
+	if (tb[P4TC_TABLE_MAX_MASKS])
+		table->tbl_max_masks = nla_get_u32(tb[P4TC_TABLE_MAX_MASKS]);
+	else
+		table->tbl_max_masks = P4TC_DEFAULT_TMASKS;
+
+	if (tb[P4TC_TABLE_PERMISSIONS]) {
+		u16 tbl_permissions = nla_get_u16(tb[P4TC_TABLE_PERMISSIONS]);
+
+		tbl_init_perms = p4tc_table_init_permissions(table,
+							     tbl_permissions,
+							     extack);
+		if (IS_ERR(tbl_init_perms)) {
+			ret = PTR_ERR(tbl_init_perms);
+			goto free;
+		}
+		rcu_assign_pointer(table->tbl_permissions, tbl_init_perms);
+	} else {
+		u16 tbl_permissions = P4TC_TABLE_DEFAULT_PERMISSIONS;
+
+		tbl_init_perms = p4tc_table_init_permissions(table,
+							     tbl_permissions,
+							     extack);
+		if (IS_ERR(tbl_init_perms)) {
+			ret = PTR_ERR(tbl_init_perms);
+			goto free;
+		}
+		rcu_assign_pointer(table->tbl_permissions, tbl_init_perms);
+	}
+
+	if (tb[P4TC_TABLE_TYPE])
+		table->tbl_type = nla_get_u8(tb[P4TC_TABLE_TYPE]);
+	else
+		table->tbl_type = P4TC_TABLE_TYPE_EXACT;
+
+	refcount_set(&table->tbl_ctrl_ref, 1);
+
+	if (tbl_id) {
+		table->tbl_id = tbl_id;
+		ret = idr_alloc_u32(&pipeline->p_tbl_idr, table, &table->tbl_id,
+				    table->tbl_id, GFP_KERNEL);
+		if (ret < 0) {
+			NL_SET_ERR_MSG(extack, "Unable to allocate table id");
+			goto free_permissions;
+		}
+	} else {
+		table->tbl_id = 1;
+		ret = idr_alloc_u32(&pipeline->p_tbl_idr, table, &table->tbl_id,
+				    UINT_MAX, GFP_KERNEL);
+		if (ret < 0) {
+			NL_SET_ERR_MSG(extack, "Unable to allocate table id");
+			goto free_permissions;
+		}
+	}
+
+	INIT_LIST_HEAD(&table->tbl_acts_list);
+	if (tb[P4TC_TABLE_ACTS_LIST]) {
+		ret = p4tc_table_acts_list_init(tb[P4TC_TABLE_ACTS_LIST],
+						pipeline, &table->tbl_acts_list,
+						extack);
+		if (ret < 0)
+			goto idr_rm;
+	}
+
+	dflt.nla_hit = tb[P4TC_TABLE_DEFAULT_HIT];
+	dflt.nla_miss = tb[P4TC_TABLE_DEFAULT_MISS];
+
+	ret = p4tc_table_init_default_acts(net, &dflt, table,
+					   &table->tbl_acts_list, extack);
+	if (ret < 0)
+		goto idr_rm;
+
+	if (dflt.hitact && !dflt.hitact->acts[0]) {
+		NL_SET_ERR_MSG(extack,
+			       "Must specify defaults_hit_actions's action values");
+		ret = -EINVAL;
+		goto defaultacts_destroy;
+	}
+
+	if (dflt.missact && !dflt.missact->acts[0]) {
+		NL_SET_ERR_MSG(extack,
+			       "Must specify defaults_miss_actions's action values");
+		ret = -EINVAL;
+		goto defaultacts_destroy;
+	}
+
+	rcu_replace_pointer_rtnl(table->tbl_dflt_hitact, dflt.hitact);
+	rcu_replace_pointer_rtnl(table->tbl_dflt_missact, dflt.missact);
+
+	if (tb[P4TC_TABLE_NUM_TIMER_PROFILES])
+		num_profiles = nla_get_u32(tb[P4TC_TABLE_NUM_TIMER_PROFILES]);
+
+	atomic_set(&table->tbl_num_timer_profiles, 0);
+	ret = p4tc_tmpl_timer_profiles_init(table, num_profiles);
+	if (ret < 0)
+		goto defaultacts_destroy;
+
+	idr_init(&table->tbl_masks_idr);
+	idr_init(&table->tbl_prio_idr);
+	spin_lock_init(&table->tbl_masks_idr_lock);
+
+	pipeline->curr_tables += 1;
+
+	table->common.ops = (struct p4tc_template_ops *)&p4tc_table_ops;
+
+	return table;
+
+defaultacts_destroy:
+	p4tc_table_defact_destroy(dflt.hitact);
+	p4tc_table_defact_destroy(dflt.missact);
+
+idr_rm:
+	idr_remove(&pipeline->p_tbl_idr, table->tbl_id);
+
+free_permissions:
+	kfree(tbl_init_perms);
+
+	p4tc_table_acts_list_destroy(&table->tbl_acts_list);
+
+free:
+	kfree(table);
+
+out:
+	return ERR_PTR(ret);
+}
+
+static struct p4tc_table *p4tc_table_update(struct net *net, struct nlattr **tb,
+					    u32 tbl_id,
+					    struct p4tc_pipeline *pipeline,
+					    u32 flags,
+					    struct netlink_ext_ack *extack)
+{
+	u32 tbl_max_masks = 0, tbl_max_entries = 0, tbl_keysz = 0;
+	struct p4tc_table_defact_params dflt = { 0 };
+	struct p4tc_table_perm *perm = NULL;
+	struct list_head *tbl_acts_list;
+	struct p4tc_table *table;
+	u8 tbl_type;
+	int ret = 0;
+
+	table = p4tc_table_find_byanyattr(pipeline, tb[P4TC_TABLE_NAME], tbl_id,
+					  extack);
+	if (IS_ERR(table))
+		return table;
+
+	if (tb[P4TC_TABLE_NUM_TIMER_PROFILES]) {
+		NL_SET_ERR_MSG(extack, "Num timer profiles is not updatable");
+		return ERR_PTR(-EINVAL);
+	}
+
+	/* Check if we are replacing this at the end */
+	if (tb[P4TC_TABLE_ACTS_LIST]) {
+		tbl_acts_list = kzalloc(sizeof(*tbl_acts_list), GFP_KERNEL);
+		if (!tbl_acts_list)
+			return ERR_PTR(-ENOMEM);
+
+		INIT_LIST_HEAD(tbl_acts_list);
+		ret = p4tc_table_acts_list_init(tb[P4TC_TABLE_ACTS_LIST],
+						pipeline, tbl_acts_list,
+						extack);
+		if (ret < 0)
+			goto table_acts_destroy;
+	} else {
+		tbl_acts_list = &table->tbl_acts_list;
+	}
+
+	dflt.nla_hit = tb[P4TC_TABLE_DEFAULT_HIT];
+	dflt.nla_miss = tb[P4TC_TABLE_DEFAULT_MISS];
+
+	ret = p4tc_table_init_default_acts(net, &dflt, table, tbl_acts_list,
+					   extack);
+	if (ret < 0)
+		goto table_acts_destroy;
+
+	tbl_type = table->tbl_type;
+
+	if (tb[P4TC_TABLE_KEYSZ])
+		tbl_keysz = nla_get_u32(tb[P4TC_TABLE_KEYSZ]);
+
+	if (tb[P4TC_TABLE_MAX_ENTRIES])
+		tbl_max_entries = nla_get_u32(tb[P4TC_TABLE_MAX_ENTRIES]);
+
+	if (tb[P4TC_TABLE_MAX_MASKS])
+		tbl_max_masks = nla_get_u32(tb[P4TC_TABLE_MAX_MASKS]);
+
+	if (tb[P4TC_TABLE_PERMISSIONS]) {
+		__u16 nperm = nla_get_u16(tb[P4TC_TABLE_PERMISSIONS]);
+
+		perm = p4tc_table_init_permissions(table, nperm, extack);
+		if (IS_ERR(perm)) {
+			ret = PTR_ERR(perm);
+			goto defaultacts_destroy;
+		}
+	}
+
+	if (tb[P4TC_TABLE_TYPE])
+		tbl_type = nla_get_u8(tb[P4TC_TABLE_TYPE]);
+
+	p4tc_table_replace_default_acts(table, &dflt, false);
+	p4tc_table_replace_permissions(table, perm, false);
+
+	if (tbl_keysz)
+		table->tbl_keysz = tbl_keysz;
+	if (tbl_max_entries)
+		table->tbl_max_entries = tbl_max_entries;
+	if (tbl_max_masks)
+		table->tbl_max_masks = tbl_max_masks;
+	table->tbl_type = tbl_type;
+
+	if (tb[P4TC_TABLE_ACTS_LIST])
+		p4tc_table_acts_list_replace(&table->tbl_acts_list,
+					     tbl_acts_list);
+
+	return table;
+
+defaultacts_destroy:
+	p4tc_table_defact_destroy(dflt.missact);
+	p4tc_table_defact_destroy(dflt.hitact);
+
+table_acts_destroy:
+	if (tb[P4TC_TABLE_ACTS_LIST]) {
+		p4tc_table_acts_list_destroy(tbl_acts_list);
+		kfree(tbl_acts_list);
+	}
+
+	return ERR_PTR(ret);
+}
+
+static struct p4tc_template_common *
+p4tc_table_cu(struct net *net, struct nlmsghdr *n, struct nlattr *nla,
+	      struct p4tc_path_nlattrs *nl_path_attrs,
+	      struct netlink_ext_ack *extack)
+{
+	u32 *ids = nl_path_attrs->ids;
+	u32 pipeid = ids[P4TC_PID_IDX], tbl_id = ids[P4TC_TBLID_IDX];
+	struct nlattr *tb[P4TC_TABLE_MAX + 1];
+	struct p4tc_pipeline *pipeline;
+	struct p4tc_table *table;
+	int ret;
+
+	pipeline = p4tc_pipeline_find_byany_unsealed(net, nl_path_attrs->pname,
+						     pipeid, extack);
+	if (IS_ERR(pipeline))
+		return (void *)pipeline;
+
+	ret = nla_parse_nested(tb, P4TC_TABLE_MAX, nla, p4tc_table_policy,
+			       extack);
+	if (ret < 0)
+		return ERR_PTR(ret);
+
+	switch (n->nlmsg_type) {
+	case RTM_CREATEP4TEMPLATE:
+		table = p4tc_table_create(net, tb, tbl_id, pipeline, extack);
+		break;
+	case RTM_UPDATEP4TEMPLATE:
+		table = p4tc_table_update(net, tb, tbl_id, pipeline,
+					  n->nlmsg_flags, extack);
+		break;
+	default:
+		return ERR_PTR(-EOPNOTSUPP);
+	}
+
+	if (IS_ERR(table))
+		goto out;
+
+	if (!nl_path_attrs->pname_passed)
+		strscpy(nl_path_attrs->pname, pipeline->common.name,
+			P4TC_PIPELINE_NAMSIZ);
+
+	if (!ids[P4TC_PID_IDX])
+		ids[P4TC_PID_IDX] = pipeline->common.p_id;
+
+	if (!ids[P4TC_TBLID_IDX])
+		ids[P4TC_TBLID_IDX] = table->tbl_id;
+
+out:
+	return (struct p4tc_template_common *)table;
+}
+
+static int p4tc_table_flush(struct net *net, struct sk_buff *skb,
+			    struct p4tc_pipeline *pipeline,
+			    struct netlink_ext_ack *extack)
+{
+	unsigned char *b = nlmsg_get_pos(skb);
+	unsigned long tmp, tbl_id;
+	struct p4tc_table *table;
+	int ret = 0;
+	int i = 0;
+
+	if (nla_put_u32(skb, P4TC_PATH, 0))
+		goto out_nlmsg_trim;
+
+	if (idr_is_empty(&pipeline->p_tbl_idr)) {
+		NL_SET_ERR_MSG(extack, "There are no tables to flush");
+		goto out_nlmsg_trim;
+	}
+
+	idr_for_each_entry_ul(&pipeline->p_tbl_idr, table, tmp, tbl_id) {
+		if (_p4tc_table_put(net, NULL, pipeline, table, extack) < 0) {
+			ret = -EBUSY;
+			continue;
+		}
+		i++;
+	}
+
+	if (nla_put_u32(skb, P4TC_COUNT, i))
+		goto out_nlmsg_trim;
+
+	if (ret < 0) {
+		if (i == 0) {
+			NL_SET_ERR_MSG(extack, "Unable to flush any table");
+			goto out_nlmsg_trim;
+		} else {
+			NL_SET_ERR_MSG_FMT(extack,
+					   "Flushed only %u tables", i);
+		}
+	}
+
+	return i;
+
+out_nlmsg_trim:
+	nlmsg_trim(skb, b);
+	return ret;
+}
+
+static int p4tc_table_gd(struct net *net, struct sk_buff *skb,
+			 struct nlmsghdr *n, struct nlattr *nla,
+			 struct p4tc_path_nlattrs *nl_path_attrs,
+			 struct netlink_ext_ack *extack)
+{
+	u32 *ids = nl_path_attrs->ids;
+	u32 pipeid = ids[P4TC_PID_IDX], tbl_id = ids[P4TC_TBLID_IDX];
+	struct nlattr *tb[P4TC_TABLE_MAX + 1] = {};
+	unsigned char *b = nlmsg_get_pos(skb);
+	struct p4tc_pipeline *pipeline;
+	struct p4tc_table *table;
+	int ret = 0;
+
+	if (nla) {
+		ret = nla_parse_nested(tb, P4TC_TABLE_MAX, nla,
+				       p4tc_table_policy, extack);
+
+		if (ret < 0)
+			return ret;
+	}
+
+	if (n->nlmsg_type == RTM_GETP4TEMPLATE) {
+		pipeline = p4tc_pipeline_find_byany(net,
+						    nl_path_attrs->pname,
+						    pipeid,
+						    extack);
+	} else {
+		const char *pname = nl_path_attrs->pname;
+
+		pipeline = p4tc_pipeline_find_byany_unsealed(net, pname,
+							     pipeid, extack);
+	}
+
+	if (IS_ERR(pipeline))
+		return PTR_ERR(pipeline);
+
+	if (!nl_path_attrs->pname_passed)
+		strscpy(nl_path_attrs->pname, pipeline->common.name,
+			P4TC_PIPELINE_NAMSIZ);
+
+	if (!ids[P4TC_PID_IDX])
+		ids[P4TC_PID_IDX] = pipeline->common.p_id;
+
+	if (n->nlmsg_type == RTM_DELP4TEMPLATE && (n->nlmsg_flags & NLM_F_ROOT))
+		return p4tc_table_flush(net, skb, pipeline, extack);
+
+	table = p4tc_table_find_byanyattr(pipeline, tb[P4TC_TABLE_NAME], tbl_id,
+					  extack);
+	if (IS_ERR(table))
+		return PTR_ERR(table);
+
+	if (_p4tc_table_fill_nlmsg(skb, table) < 0) {
+		NL_SET_ERR_MSG(extack,
+			       "Failed to fill notification attributes for table");
+		return -EINVAL;
+	}
+
+	if (n->nlmsg_type == RTM_DELP4TEMPLATE) {
+		ret = _p4tc_table_put(net, tb, pipeline, table, extack);
+		if (ret < 0)
+			goto out_nlmsg_trim;
+	}
+
+	return 0;
+
+out_nlmsg_trim:
+	nlmsg_trim(skb, b);
+	return ret;
+}
+
+static int p4tc_table_dump(struct sk_buff *skb, struct p4tc_dump_ctx *ctx,
+			   struct nlattr *nla, char **p_name, u32 *ids,
+			   struct netlink_ext_ack *extack)
+{
+	struct net *net = sock_net(skb->sk);
+	struct p4tc_pipeline *pipeline;
+
+	if (!ctx->ids[P4TC_PID_IDX]) {
+		pipeline = p4tc_pipeline_find_byany(net, *p_name,
+						    ids[P4TC_PID_IDX], extack);
+		if (IS_ERR(pipeline))
+			return PTR_ERR(pipeline);
+		ctx->ids[P4TC_PID_IDX] = pipeline->common.p_id;
+	} else {
+		pipeline = p4tc_pipeline_find_byid(net, ctx->ids[P4TC_PID_IDX]);
+	}
+
+	if (!ids[P4TC_PID_IDX])
+		ids[P4TC_PID_IDX] = pipeline->common.p_id;
+
+	if (!(*p_name))
+		*p_name = pipeline->common.name;
+
+	return p4tc_tmpl_generic_dump(skb, ctx, &pipeline->p_tbl_idr,
+				      P4TC_TBLID_IDX, extack);
+}
+
+static int p4tc_table_dump_1(struct sk_buff *skb,
+			     struct p4tc_template_common *common)
+{
+	struct nlattr *nest = nla_nest_start(skb, P4TC_PARAMS);
+	struct p4tc_table *table = p4tc_to_table(common);
+
+	if (!nest)
+		return -ENOMEM;
+
+	if (nla_put_string(skb, P4TC_TABLE_NAME, table->common.name)) {
+		nla_nest_cancel(skb, nest);
+		return -ENOMEM;
+	}
+
+	nla_nest_end(skb, nest);
+
+	return 0;
+}
+
+static const struct p4tc_template_ops p4tc_table_ops = {
+	.cu = p4tc_table_cu,
+	.fill_nlmsg = p4tc_table_fill_nlmsg,
+	.gd = p4tc_table_gd,
+	.put = p4tc_table_put,
+	.dump = p4tc_table_dump,
+	.dump_1 = p4tc_table_dump_1,
+	.obj_id = P4TC_OBJ_TABLE,
+};
+
+static int __init p4tc_table_init(void)
+{
+	p4tc_tmpl_register_ops(&p4tc_table_ops);
+
+	return 0;
+}
+
+subsys_initcall(p4tc_table_init);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [PATCH net-next v12  12/15] p4tc: add runtime table entry create and update
  2024-02-25 16:54 [PATCH net-next v12 00/15] Introducing P4TC (series 1) Jamal Hadi Salim
                   ` (10 preceding siblings ...)
  2024-02-25 16:54 ` [PATCH net-next v12 11/15] p4tc: add template table create, update, delete, get, flush and dump Jamal Hadi Salim
@ 2024-02-25 16:54 ` Jamal Hadi Salim
  2024-02-25 16:54 ` [PATCH net-next v12 13/15] p4tc: add runtime table entry get, delete, flush and dump Jamal Hadi Salim
                   ` (4 subsequent siblings)
  16 siblings, 0 replies; 71+ messages in thread
From: Jamal Hadi Salim @ 2024-02-25 16:54 UTC (permalink / raw)
  To: netdev
  Cc: deb.chatterjee, anjali.singhai, namrata.limaye, tom, mleitner,
	Mahesh.Shirshyad, Vipin.Jain, tomasz.osinski, jiri,
	xiyou.wangcong, davem, edumazet, kuba, pabeni, vladbu, horms,
	khalidm, toke, daniel, victor, pctammela, bpf

Tables are conceptually similar to TCAMs and this implementation could be
labelled as an "algorithmic" TCAM. Tables have a key of a specific size,
maximum number of entries and masks allowed. The basic P4 key types
are supported (exact, LPM, ternary, and ranges) although the kernel side is
oblivious of all that and sees only bit blobs which it masks before a
lookup is performed.

This commit allows users to create and update table _entries_
(templates were described in earlier patch). We'll add get, delete and
dump in the next commit

Note that table entries can only be created once the pipeline template is
sealed.

For example, a user issuing the following command:

tc p4ctrl create myprog/table/cb/tname  \
  dstAddr 10.10.10.0/24 srcAddr 192.168.0.0/16 prio 16 \
  action send param port port1

indicates we are creating a table entry in table "cb/tname" (a table
residing in control block "cb") on a pipeline named "myprog".

User space tc will create a key which has a value of 0x0a0a0a00c0a00000
(10.10.10.0 concatenated with 192.168.0.0) and a mask value of
0xffffff00ffff0000 (/24 concatenated with /16) that will be sent to the
kernel. In addition a priority field of 16 is passed to the kernel as
well as the action definition.
The priority field is needed to disambiguate in case two entries
match. In that case, the kernel will choose the one with lowest priority
number.

If the user wanted to, for example, update our just created entry with
an action, they'd issue the following command:

tc p4ctrl update myprog/table/cb/tname srcAddr 10.10.10.0/24 \
  dstAddr 192.168.0.0/16 prio 16 action send param port port5

In this case, the user needs to specify the pipeline name, the table name,
the key and the priority, so that we can locate the table entry.

__Table Attributes__

Just for context, these are the table attributes (presented in the table
template commit):

- Timer profiles: The timer profiles for the table entries belonging to the
  table. These profiles contain the supported aging values for the table.
  There are currently 4 different built-in profiles ID0=30s(default),
  ID1=60s, ID2=120s and ID3=180s.
  The user can override the defaults by specifying the number of profiles per
  table. In that case, the kernel will generate the specified number of
  profiles with their aging values abiding by the following rule:
    - Profile ID 0 - 30000ms
    - Profile ID(n) - Profile ID(n - 1) + 30000ms
  So, for example, if the user specified num_timer_profiles as 5, the
  profile IDs and aging values would be the following:
    - Profile ID 0 - 30000ms (default profile)
    - Profile ID 1 - 60000ms
    - Profile ID 2 - 90000ms
    - Profile ID 3 - 120000ms
    - Profile ID 4 - 150000ms
  The values of the different profiles could be changed at runtime.
  The default profile is used if the user specifies an out of range value.

- Match key type: The match type of the table key (exact, LPM, ternary,
  etc).
  Note in P4 a key may constitute multiple (sub)key types, eg matching on
  srcip using a prefix and an exact match for dstip. In such a case it is
  up to the compiler to generalize the key used (For example in this
  case the overall key may endup being LPM or ternary).

- Direct Counter: Table counter instances used directly by the table, when
  specified in the P4 program, there will be one counter per entry.

- Direct Meter: Table meter instances used directly by the table, when
  specified in the P4 program, there will be one meter per entry.

- CRUDXPS Permissions both for specific entries and tables. The permissions
  are applicable to both the control plane and the datapath (see "Table
  Permissions" further below).

- Allowed Actions List. This will be a list of all possible actions that
  can be added to table entries that are added to the specified table.

- Action profiles. When defined in a P4 program, action profiles provide a
  mechanism to share action instances.

- Actions Selectors. When defined in a P4 program can be used to select
  the execution of one or more action instance selected at table lookup
  time by using a hash computation.

- Default hit action. When a default hit action is defined it is used when
  a matched table entry did not define an action. Depending on the P4
  program the default hit action can be updated at runtime (in addition to
  being specified in the template).

- Default miss action. When a default miss action is defined it is used
  when a lookup that table fails.

- Action scope. In addition to actions being annotated as default hit or
  miss they can also be annotated to be either specific to a table of
  globally available to multiple tables within the same P4 program.

- Max entries. This is an upper bound for number of entries a specific
  table allows.

- Num masks. In the case of LPM or ternary matches, this defines the
  maximum allowed masks for that table.

- Timers. When defined in a P4 program, each entry has an associated timer.
  Depending on the programmed timer profile (see above), an entry gets a
  timeout. The timer attribute specifies the behavior of a table entry
  expiration.
  The timer is refreshed every time there's a hit. After an idle period
  the P4 program can define using this attribute, whether it wants to have
  an event generated to user space (and have user space delete the entry), or
  whether it wants the kernel to delete it and send the event to announce
  the deletion. The default in P4TC is to both delete and generate an event.

- per entry "static" vs "dynamic" entry. By default all entries created
  from the control plane are "static" unless otherwise specified. All
  entries added from datapath are "dynamic" unless otherwise specified.
  "Dynamic" entries are subject to deletion when idle (subject to the rules
  specified in "Timers" above).

__Table Entry Permissions__

Whereas templates can define the default permissions for tables, runtime has
ability to define table entry permissions as they are added.

To reiterate there are two types of permissions:
 - Table permissions which are a property of the table (think directory in
   file systems). These are set by the template (see earlier patch on table
   template).
 - Table entry permissions which are specific to a table entry (think a
   file in a directory). This patch describes those permissions.

Furthermore in both cases the permissions are split into datapath vs
control path. The template definition can set either one. For example, one
could allow for adding table entries by the datapath in case of PNA
add-on-miss is needed.
By default tables entries have control plane RUD, meaning the control plane
can Read, Update or Delete entries. By default, as well, the control plane
can create new entries unless specified otherwise by the template.

Lets see an example of which creates the table "cb/tname" at template time:

    tc p4template create table/aP4proggie/cb/tname tblid 1 keysz 64 \
      permissions 0x3C24 ...

Above is setting the table tname's permission to be 0x3C24 is equivalent to
CRUD----R--X-- meaning:

The control plane can Create, Read, Update, Delete
The datapath can only Read and Execute table entries.
If one was to dump this table with:

tc -j p4template get table/aP4proggie/cb/tname | jq .

The output would be the following:

[
  {
    "obj": "table",
    "pname": "aP4proggie",
    "pipeid": 22
  },
  {
    "templates": [
      {
        "tblid": 1,
        "tname": "cb/tname",
        "keysz": 64,
        "max_entries": 256,
        "masks": 8,
        "entries": 0,
        "permissions": "CRUD----R--X--",
        "table_type": "exact",
        "acts_list": []
      }
    ]
  }
]

The expressed permissions above are probably the most practical for most
use cases.

__Constant Tables And P4-programmed Defined Entries__

If one wanted to restrict the table to be an equivalent to a "const" then
the permissions would be set to be: -R------R--X-- by the template.

In such a case, typically the P4 program will have some entries defined
(see the famous P4 calculator example). The "initial entries" specified in
the P4 program will have to be added by the template (as generated by the
compiler), as such:

tc p4template create table/aP4proggie/cb/tname \
  entry srcAddr 10.10.10.10/24 dstAddr 1.1.1.0/24 prio 17

This table cannot be updated at runtime. Any attempt to add an entry of a
table which is read-only at runtime will get a "permission denied" response
back from the kernel.

Note: If one was to create an equivalent for PNA add-on-miss feature for
this table, then the template would issue table permissions as:
-R-----CR--X-- PNA doesn't specify whether the datapath can also delete or
update entries, but if it did then more appropriate permissions will be:
-R-----CRUDX--

__Mix And Match of RW vs Constant Entries__
Lets look at other scenarios; lets say the table has CRUD----R--X--
permissions as defined by the template...
At runtime the user could add entries which are "const" - by specifying the
entry's permission as -R------R--X-- example:

tc p4ctrl create aP4proggie/table/cb/tname srcAddr 10.10.10.10/24 \
  dstAddr 1.1.1.0/24 prio 17 permissions 0x1024 action drop

or not specify permissions at all as such:

tc p4ctrl create aP4proggie/table/cb/tname srcAddr 10.10.10.10/24 \
  dstAddr 1.1.1.0/24 prio 17 action drop

in which case the table's permissions defined at template
time(CRUD----R--X--) are assumed; meaning the table entry can be deleted or
updated by the control plane.

__Entries permissions Allowed On A Table Entry Creation At Runtime__

When entries are created without specifying permissions they inherit the
table permissions set by the template.
When an entry is added with expressed permissions it has at most to have
what the template table definition expressed but could ask for less
permission. For example, assuming a table with templated specified
permissions of CR-D----R--X--:
An entry created at runtime with permission of -R------R--X-- is allowed
but an entry with -RUD----R--X-- will be rejected.

Co-developed-by: Victor Nogueira <victor@mojatatu.com>
Signed-off-by: Victor Nogueira <victor@mojatatu.com>
Co-developed-by: Pedro Tammela <pctammela@mojatatu.com>
Signed-off-by: Pedro Tammela <pctammela@mojatatu.com>
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Reviewed-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
---
 include/net/p4tc.h                |  119 +-
 include/uapi/linux/p4tc.h         |   73 +-
 include/uapi/linux/rtnetlink.h    |    9 +
 net/sched/p4tc/Makefile           |    3 +-
 net/sched/p4tc/p4tc_runtime_api.c |   82 ++
 net/sched/p4tc/p4tc_table.c       |   96 +-
 net/sched/p4tc/p4tc_tbl_entry.c   | 1804 +++++++++++++++++++++++++++++
 net/sched/p4tc/p4tc_tmpl_api.c    |    4 +-
 security/selinux/nlmsgtab.c       |    6 +-
 9 files changed, 2179 insertions(+), 17 deletions(-)
 create mode 100644 net/sched/p4tc/p4tc_runtime_api.c
 create mode 100644 net/sched/p4tc/p4tc_tbl_entry.c

diff --git a/include/net/p4tc.h b/include/net/p4tc.h
index d11a9efa7..bc32b73ec 100644
--- a/include/net/p4tc.h
+++ b/include/net/p4tc.h
@@ -124,6 +124,11 @@ static inline bool p4tc_pipeline_get(struct p4tc_pipeline *pipeline)
 	return refcount_inc_not_zero(&pipeline->p_ctrl_ref);
 }
 
+static inline void p4tc_pipeline_put_ref(struct p4tc_pipeline *pipeline)
+{
+	refcount_dec(&pipeline->p_ctrl_ref);
+}
+
 void p4tc_pipeline_put(struct p4tc_pipeline *pipeline);
 struct p4tc_pipeline *
 p4tc_pipeline_find_byany_unsealed(struct net *net, const char *p_name,
@@ -195,6 +200,8 @@ static inline int p4tc_action_destroy(struct tc_action *acts[])
 
 #define P4TC_PERMISSIONS_UNINIT (1 << P4TC_PERM_MAX_BIT)
 
+#define P4TC_MAX_PARAM_DATA_SIZE 124
+
 struct p4tc_table_defact {
 	struct tc_action *acts[2];
 	/* Will have two 7 bits blocks containing CRUDXPS (Create, read, update,
@@ -216,11 +223,12 @@ struct p4tc_table {
 	struct p4tc_template_common         common;
 	struct list_head                    tbl_acts_list;
 	struct idr                          tbl_masks_idr;
-	struct idr                          tbl_prio_idr;
+	struct ida                          tbl_prio_idr;
 	struct xarray                       tbl_profiles_xa;
 	struct rhltable                     tbl_entries;
 	/* Mutex that protects tbl_profiles_xa */
 	struct mutex                        tbl_profiles_xa_lock;
+	struct p4tc_table_entry             *tbl_entry;
 	struct p4tc_table_defact __rcu      *tbl_dflt_hitact;
 	struct p4tc_table_defact __rcu      *tbl_dflt_missact;
 	struct p4tc_table_perm __rcu        *tbl_permissions;
@@ -236,6 +244,8 @@ struct p4tc_table {
 	u32                                 tbl_max_masks;
 	u32                                 tbl_curr_num_masks;
 	atomic_t                            tbl_num_timer_profiles;
+	/* Accounts for how many entries this table has */
+	atomic_t                            tbl_nelems;
 	/* Accounts for how many entities refer to this table. Usually just the
 	 * pipeline it belongs to.
 	 */
@@ -309,6 +319,76 @@ struct p4tc_table_timer_profile {
 	u32 profile_id;
 };
 
+extern const struct rhashtable_params entry_hlt_params;
+
+struct p4tc_table_entry;
+struct p4tc_table_entry_work {
+	struct work_struct   work;
+	struct p4tc_pipeline *pipeline;
+	struct p4tc_table_entry *entry;
+	struct p4tc_table *table;
+	u16 who_deleted;
+	bool send_event;
+};
+
+struct p4tc_table_entry_key {
+	u32 keysz;
+	/* Key start */
+	u32 maskid;
+	unsigned char fa_key[] __aligned(8);
+};
+
+struct p4tc_table_entry_value {
+	struct tc_action                         *acts[2];
+	/* Accounts for how many entities are referencing it
+	 * eg: Data path, one or more control path and timer.
+	 */
+	u32                                      prio;
+	refcount_t                               entries_ref;
+	u32                                      permissions;
+	u32                                      __pad0;
+	struct p4tc_table_entry_tm __rcu         *tm;
+	struct p4tc_table_entry_work             *entry_work;
+	u64                                      aging_ms;
+	struct hrtimer                           entry_timer;
+	bool                                     tmpl_created;
+};
+
+struct p4tc_table_entry_mask {
+	struct rcu_head	 rcu;
+	u32              sz;
+	u32              mask_index;
+	/* Accounts for how many entries are using this mask */
+	refcount_t       mask_ref;
+	u32              mask_id;
+	unsigned char fa_value[] __aligned(8);
+};
+
+struct p4tc_table_entry {
+	struct rcu_head rcu;
+	struct rhlist_head ht_node;
+	struct p4tc_table_entry_key key;
+	/* fallthrough: key data + value */
+};
+
+#define P4TC_KEYSZ_BYTES(bits) (round_up(BITS_TO_BYTES(bits), 8))
+
+static inline void *p4tc_table_entry_value(struct p4tc_table_entry *entry)
+{
+	return entry->key.fa_key + P4TC_KEYSZ_BYTES(entry->key.keysz);
+}
+
+static inline struct p4tc_table_entry_work *
+p4tc_table_entry_work(struct p4tc_table_entry *entry)
+{
+	struct p4tc_table_entry_value *value = p4tc_table_entry_value(entry);
+
+	return value->entry_work;
+}
+
+extern const struct nla_policy p4tc_root_policy[P4TC_ROOT_MAX + 1];
+extern const struct nla_policy p4tc_policy[P4TC_MAX + 1];
+
 static inline int p4tc_action_init(struct net *net, struct nlattr *nla,
 				   struct tc_action *acts[], u32 pipeid,
 				   u32 flags, struct netlink_ext_ack *extack)
@@ -422,6 +502,21 @@ static inline bool p4tc_table_defact_is_noaction(struct tcf_p4act *p4_defact)
 	return !p4_defact->p_id && !p4_defact->act_id;
 }
 
+static inline void p4tc_table_defact_destroy(struct p4tc_table_defact *defact)
+{
+	if (defact) {
+		if (defact->acts[0]) {
+			struct tcf_p4act *p4_defact = to_p4act(defact->acts[0]);
+
+			if (p4tc_table_defact_is_noaction(p4_defact))
+				kfree(p4_defact);
+			else
+				p4tc_action_destroy(defact->acts);
+		}
+		kfree(defact);
+	}
+}
+
 static inline void p4tc_table_defacts_acts_copy(struct p4tc_table_defact *dst,
 						struct p4tc_table_defact *src)
 {
@@ -435,10 +530,25 @@ void p4tc_table_replace_default_acts(struct p4tc_table *table,
 
 struct p4tc_table_perm *
 p4tc_table_init_permissions(struct p4tc_table *table, u16 permissions,
-			    struct netlink_ext_ack *extack);
+			   struct netlink_ext_ack *extack);
 void p4tc_table_replace_permissions(struct p4tc_table *table,
 				    struct p4tc_table_perm *tbl_perm,
 				    bool lock_rtnl);
+void p4tc_table_entry_destroy_hash(void *ptr, void *arg);
+
+struct p4tc_table_entry *
+p4tc_tmpl_table_entry_cu(struct net *net, struct nlattr *arg,
+			 struct p4tc_pipeline *pipeline,
+			 struct p4tc_table *table,
+			 struct netlink_ext_ack *extack);
+int p4tc_tbl_entry_root(struct net *net, struct sk_buff *skb,
+			struct nlmsghdr *n, int cmd,
+			struct netlink_ext_ack *extack);
+int p4tc_tbl_entry_dumpit(struct net *net, struct sk_buff *skb,
+			  struct netlink_callback *cb,
+			  struct nlattr *arg, char *p_name);
+int p4tc_tbl_entry_fill(struct sk_buff *skb, struct p4tc_table *table,
+			struct p4tc_table_entry *entry, u32 tbl_id);
 
 struct tcf_p4act *
 p4a_runt_prealloc_get_next(struct p4tc_act *act);
@@ -448,6 +558,11 @@ struct p4tc_act_param *
 p4a_runt_parm_init(struct net *net, struct p4tc_act *act,
 		   struct nlattr *nla, struct netlink_ext_ack *extack);
 
+static inline bool p4tc_runtime_msg_is_update(struct nlmsghdr *n)
+{
+	return n->nlmsg_type == RTM_P4TC_UPDATE;
+}
+
 #define to_pipeline(t) ((struct p4tc_pipeline *)t)
 #define p4tc_to_act(t) ((struct p4tc_act *)t)
 #define p4tc_to_table(t) ((struct p4tc_table *)t)
diff --git a/include/uapi/linux/p4tc.h b/include/uapi/linux/p4tc.h
index 92ed964ab..adac8024c 100644
--- a/include/uapi/linux/p4tc.h
+++ b/include/uapi/linux/p4tc.h
@@ -80,6 +80,9 @@ enum {
 #define p4tc_ctrl_pub_ok(perm)      ((perm) & P4TC_CTRL_PERM_P)
 #define p4tc_ctrl_sub_ok(perm)      ((perm) & P4TC_CTRL_PERM_S)
 
+#define p4tc_ctrl_perm_rm_create(perm) \
+	(((perm) & ~P4TC_CTRL_PERM_C))
+
 #define p4tc_data_create_ok(perm)   ((perm) & P4TC_DATA_PERM_C)
 #define p4tc_data_read_ok(perm)     ((perm) & P4TC_DATA_PERM_R)
 #define p4tc_data_update_ok(perm)   ((perm) & P4TC_DATA_PERM_U)
@@ -88,6 +91,9 @@ enum {
 #define p4tc_data_pub_ok(perm)      ((perm) & P4TC_DATA_PERM_P)
 #define p4tc_data_sub_ok(perm)      ((perm) & P4TC_DATA_PERM_S)
 
+#define p4tc_data_perm_rm_create(perm) \
+	(((perm) & ~P4TC_DATA_PERM_C))
+
 /* Root attributes */
 enum {
 	P4TC_ROOT_UNSPEC,
@@ -109,6 +115,15 @@ enum {
 
 #define P4TC_OBJ_MAX (__P4TC_OBJ_MAX - 1)
 
+/* P4 runtime Object types */
+enum {
+	P4TC_OBJ_RUNTIME_UNSPEC,
+	P4TC_OBJ_RUNTIME_TABLE,
+	__P4TC_OBJ_RUNTIME_MAX,
+};
+
+#define P4TC_OBJ_RUNTIME_MAX (__P4TC_OBJ_RUNTIME_MAX - 1)
+
 /* P4 attributes */
 enum {
 	P4TC_UNSPEC,
@@ -202,7 +217,7 @@ enum {
 	P4TC_TABLE_TYPE, /* u8 */
 	P4TC_TABLE_DEFAULT_HIT, /* nested default hit action attributes */
 	P4TC_TABLE_DEFAULT_MISS, /* nested default miss action attributes */
-	P4TC_TABLE_CONST_ENTRY, /* nested const table entry */
+	P4TC_TABLE_ENTRY, /* nested const table entry*/
 	P4TC_TABLE_ACTS_LIST, /* nested table actions list */
 	P4TC_TABLE_NUM_TIMER_PROFILES, /* u32 - number of timer profiles */
 	P4TC_TABLE_TIMER_PROFILES, /* nested timer profiles
@@ -275,6 +290,62 @@ enum {
 
 #define P4TC_ACT_PARAMS_MAX (__P4TC_ACT_PARAMS_MAX - 1)
 
+struct p4tc_table_entry_tm {
+	__u64 created;
+	__u64 lastused;
+	__u64 firstused;
+	__u16 who_created;
+	__u16 who_updated;
+	__u16 who_deleted;
+	__u16 permissions;
+};
+
+enum {
+	P4TC_ENTRY_TBL_ATTRS_UNSPEC,
+	P4TC_ENTRY_TBL_ATTRS_DEFAULT_HIT, /* nested default hit attrs */
+	P4TC_ENTRY_TBL_ATTRS_DEFAULT_MISS, /* nested default miss attrs */
+	P4TC_ENTRY_TBL_ATTRS_PERMISSIONS, /* u16 table permissions */
+	__P4TC_ENTRY_TBL_ATTRS,
+};
+
+#define P4TC_ENTRY_TBL_ATTRS_MAX (__P4TC_ENTRY_TBL_ATTRS - 1)
+
+/* Table entry attributes */
+enum {
+	P4TC_ENTRY_UNSPEC,
+	P4TC_ENTRY_TBLNAME, /* string - mandatory for create */
+	P4TC_ENTRY_KEY_BLOB, /* Key blob - mandatory for create, update, delete,
+			      * and get
+			      */
+	P4TC_ENTRY_MASK_BLOB, /* Mask blob */
+	P4TC_ENTRY_PRIO, /* u32 - mandatory for delete and get for non-exact
+			  * table
+			  */
+	P4TC_ENTRY_ACT, /* nested actions */
+	P4TC_ENTRY_TM, /* entry data path timestamps */
+	P4TC_ENTRY_WHODUNNIT, /* tells who's modifying the entry */
+	P4TC_ENTRY_CREATE_WHODUNNIT, /* tells who created the entry */
+	P4TC_ENTRY_UPDATE_WHODUNNIT, /* tells who updated the entry last */
+	P4TC_ENTRY_DELETE_WHODUNNIT, /* tells who deleted the entry */
+	P4TC_ENTRY_PERMISSIONS, /* entry CRUDXPS permissions */
+	P4TC_ENTRY_TBL_ATTRS, /* nested table attributes */
+	P4TC_ENTRY_TMPL_CREATED, /* u8 tells whether entry was create by
+				  * template
+				  */
+	P4TC_ENTRY_PAD,
+	__P4TC_ENTRY_MAX
+};
+
+#define P4TC_ENTRY_MAX (__P4TC_ENTRY_MAX - 1)
+
+enum {
+	P4TC_ENTITY_UNSPEC,
+	P4TC_ENTITY_KERNEL,
+	P4TC_ENTITY_TC,
+	P4TC_ENTITY_TIMER,
+	P4TC_ENTITY_MAX
+};
+
 #define P4TC_RTA(r) \
 	((struct rtattr *)(((char *)(r)) + NLMSG_ALIGN(sizeof(struct p4tcmsg))))
 
diff --git a/include/uapi/linux/rtnetlink.h b/include/uapi/linux/rtnetlink.h
index 4f9ebe3e7..76645560b 100644
--- a/include/uapi/linux/rtnetlink.h
+++ b/include/uapi/linux/rtnetlink.h
@@ -203,6 +203,15 @@ enum {
 	RTM_UPDATEP4TEMPLATE,
 #define RTM_UPDATEP4TEMPLATE	RTM_UPDATEP4TEMPLATE
 
+	RTM_P4TC_CREATE = 128,
+#define RTM_P4TC_CREATE	RTM_P4TC_CREATE
+	RTM_P4TC_DEL,
+#define RTM_P4TC_DEL		RTM_P4TC_DEL
+	RTM_P4TC_GET,
+#define RTM_P4TC_GET		RTM_P4TC_GET
+	RTM_P4TC_UPDATE,
+#define RTM_P4TC_UPDATE	RTM_P4TC_UPDATE
+
 	__RTM_MAX,
 #define RTM_MAX		(((__RTM_MAX + 3) & ~3) - 1)
 };
diff --git a/net/sched/p4tc/Makefile b/net/sched/p4tc/Makefile
index 7a9c13f86..921909ac4 100644
--- a/net/sched/p4tc/Makefile
+++ b/net/sched/p4tc/Makefile
@@ -1,4 +1,5 @@
 # SPDX-License-Identifier: GPL-2.0
 
 obj-y := p4tc_types.o p4tc_tmpl_api.o p4tc_pipeline.o \
-	p4tc_action.o p4tc_table.o
+	p4tc_action.o p4tc_table.o p4tc_tbl_entry.o \
+	p4tc_runtime_api.o
diff --git a/net/sched/p4tc/p4tc_runtime_api.c b/net/sched/p4tc/p4tc_runtime_api.c
new file mode 100644
index 000000000..d80103d36
--- /dev/null
+++ b/net/sched/p4tc/p4tc_runtime_api.c
@@ -0,0 +1,82 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * net/sched/p4tc/p4tc_runtime_api.c P4 TC RUNTIME API
+ *
+ * Copyright (c) 2022-2024, Mojatatu Networks
+ * Copyright (c) 2022-2024, Intel Corporation.
+ * Authors:     Jamal Hadi Salim <jhs@mojatatu.com>
+ *              Victor Nogueira <victor@mojatatu.com>
+ *              Pedro Tammela <pctammela@mojatatu.com>
+ */
+
+#include <linux/types.h>
+#include <linux/kernel.h>
+#include <linux/string.h>
+#include <linux/errno.h>
+#include <linux/slab.h>
+#include <linux/skbuff.h>
+#include <linux/init.h>
+#include <linux/kmod.h>
+#include <linux/err.h>
+#include <linux/module.h>
+#include <linux/bitmap.h>
+#include <net/net_namespace.h>
+#include <net/sock.h>
+#include <net/sch_generic.h>
+#include <net/pkt_cls.h>
+#include <net/p4tc.h>
+#include <net/netlink.h>
+#include <net/flow_offload.h>
+
+static int tc_ctl_p4_root(struct sk_buff *skb, struct nlmsghdr *n, int cmd,
+			  struct netlink_ext_ack *extack)
+{
+	struct p4tcmsg *t = (struct p4tcmsg *)nlmsg_data(n);
+
+	switch (t->obj) {
+	case P4TC_OBJ_RUNTIME_TABLE: {
+		struct net *net = sock_net(skb->sk);
+		int ret;
+
+		net = maybe_get_net(net);
+		if (!net) {
+			NL_SET_ERR_MSG(extack, "Net namespace is going down");
+			return -EBUSY;
+		}
+
+		ret = p4tc_tbl_entry_root(net, skb, n, cmd, extack);
+
+		put_net(net);
+
+		return ret;
+	}
+	default:
+		NL_SET_ERR_MSG(extack, "Unknown P4 runtime object type");
+		return -EOPNOTSUPP;
+	}
+}
+
+static int tc_ctl_p4_cu(struct sk_buff *skb, struct nlmsghdr *n,
+			struct netlink_ext_ack *extack)
+{
+	int ret;
+
+	if (!netlink_capable(skb, CAP_NET_ADMIN))
+		return -EPERM;
+
+	ret = tc_ctl_p4_root(skb, n, n->nlmsg_type, extack);
+
+	return ret;
+}
+
+static int __init p4tc_tbl_init(void)
+{
+	rtnl_register(PF_UNSPEC, RTM_P4TC_CREATE, tc_ctl_p4_cu, NULL,
+		      RTNL_FLAG_DOIT_UNLOCKED);
+	rtnl_register(PF_UNSPEC, RTM_P4TC_UPDATE, tc_ctl_p4_cu, NULL,
+		      RTNL_FLAG_DOIT_UNLOCKED);
+
+	return 0;
+}
+
+subsys_initcall(p4tc_tbl_init);
diff --git a/net/sched/p4tc/p4tc_table.c b/net/sched/p4tc/p4tc_table.c
index 6ac1af3f1..4bfff14bd 100644
--- a/net/sched/p4tc/p4tc_table.c
+++ b/net/sched/p4tc/p4tc_table.c
@@ -143,6 +143,7 @@ static const struct nla_policy p4tc_table_policy[P4TC_TABLE_MAX + 1] = {
 	[P4TC_TABLE_ACTS_LIST] = { .type = NLA_NESTED },
 	[P4TC_TABLE_NUM_TIMER_PROFILES] =
 		NLA_POLICY_RANGE(NLA_U32, 1, P4TC_MAX_NUM_TIMER_PROFILES),
+	[P4TC_TABLE_ENTRY] = { .type = NLA_NESTED },
 };
 
 static int _p4tc_table_fill_nlmsg(struct sk_buff *skb, struct p4tc_table *table)
@@ -284,6 +285,18 @@ static int _p4tc_table_fill_nlmsg(struct sk_buff *skb, struct p4tc_table *table)
 	}
 	nla_nest_end(skb, nested_tbl_acts);
 
+	if (table->tbl_entry) {
+		struct nlattr *entry_nest;
+
+		entry_nest = nla_nest_start(skb, P4TC_TABLE_ENTRY);
+		if (p4tc_tbl_entry_fill(skb, table, table->tbl_entry,
+					table->tbl_id) < 0)
+			goto out_nlmsg_trim;
+
+		nla_nest_end(skb, entry_nest);
+	}
+	table->tbl_entry = NULL;
+
 	if (nla_put_u32(skb, P4TC_TABLE_KEYSZ, table->tbl_keysz))
 		goto out_nlmsg_trim;
 
@@ -293,6 +306,10 @@ static int _p4tc_table_fill_nlmsg(struct sk_buff *skb, struct p4tc_table *table)
 	if (nla_put_u32(skb, P4TC_TABLE_MAX_MASKS, table->tbl_max_masks))
 		goto out_nlmsg_trim;
 
+	if (nla_put_u32(skb, P4TC_TABLE_NUM_ENTRIES,
+			atomic_read(&table->tbl_nelems)))
+		goto out_nlmsg_trim;
+
 	tbl_perm = rcu_dereference_rtnl(table->tbl_permissions);
 	if (nla_put_u16(skb, P4TC_TABLE_PERMISSIONS, tbl_perm->permissions))
 		goto out_nlmsg_trim;
@@ -321,14 +338,6 @@ static int p4tc_table_fill_nlmsg(struct net *net, struct sk_buff *skb,
 	return 0;
 }
 
-static void p4tc_table_defact_destroy(struct p4tc_table_defact *defact)
-{
-	if (defact) {
-		p4tc_action_destroy(defact->acts);
-		kfree(defact);
-	}
-}
-
 static void
 p4tc_table_timer_profile_destroy(struct p4tc_table *table,
 				 struct p4tc_table_timer_profile *table_profile)
@@ -523,8 +532,11 @@ static int _p4tc_table_put(struct net *net, struct nlattr **tb,
 	p4tc_table_acts_list_destroy(&table->tbl_acts_list);
 	p4tc_table_timer_profiles_destroy(table);
 
+	rhltable_free_and_destroy(&table->tbl_entries,
+				  p4tc_table_entry_destroy_hash, table);
+
 	idr_destroy(&table->tbl_masks_idr);
-	idr_destroy(&table->tbl_prio_idr);
+	ida_destroy(&table->tbl_prio_idr);
 
 	perm = rcu_replace_pointer_rtnl(table->tbl_permissions, NULL);
 	kfree_rcu(perm, rcu);
@@ -1072,11 +1084,50 @@ p4tc_table_find_byanyattr(struct p4tc_pipeline *pipeline,
 
 static const struct p4tc_template_ops p4tc_table_ops;
 
+static bool p4tc_table_entry_create_only(struct nlattr **tb)
+{
+	int i;
+
+	/* Excluding table name on purpose */
+	for (i = P4TC_TABLE_KEYSZ; i < P4TC_TABLE_MAX; i++)
+		if (tb[i] && i != P4TC_TABLE_ENTRY)
+			return false;
+
+	return true;
+}
+
+static struct p4tc_table *
+p4tc_table_entry_create(struct net *net, struct nlattr **tb,
+			u32 tbl_id, struct p4tc_pipeline *pipeline,
+			struct netlink_ext_ack *extack)
+{
+	struct p4tc_table *table;
+
+	table = p4tc_table_find_byanyattr(pipeline, tb[P4TC_TABLE_NAME], tbl_id,
+					  extack);
+	if (IS_ERR(table))
+		return table;
+
+	if (tb[P4TC_TABLE_ENTRY]) {
+		struct p4tc_table_entry *entry;
+
+		entry = p4tc_tmpl_table_entry_cu(net, tb[P4TC_TABLE_ENTRY],
+						 pipeline, table, extack);
+		if (IS_ERR(entry))
+			return (struct p4tc_table *)entry;
+
+		table->tbl_entry = entry;
+	}
+
+	return table;
+}
+
 static struct p4tc_table *p4tc_table_create(struct net *net, struct nlattr **tb,
 					    u32 tbl_id,
 					    struct p4tc_pipeline *pipeline,
 					    struct netlink_ext_ack *extack)
 {
+	struct rhashtable_params table_hlt_params = entry_hlt_params;
 	u32 num_profiles = P4TC_DEFAULT_NUM_TIMER_PROFILES;
 	struct p4tc_table_perm *tbl_init_perms = NULL;
 	struct p4tc_table_defact_params dflt = { 0 };
@@ -1084,6 +1135,10 @@ static struct p4tc_table *p4tc_table_create(struct net *net, struct nlattr **tb,
 	char *tblname;
 	int ret;
 
+	if (p4tc_table_entry_create_only(tb))
+		return p4tc_table_entry_create(net, tb, tbl_id, pipeline,
+					       extack);
+
 	if (pipeline->curr_tables == pipeline->num_tables) {
 		NL_SET_ERR_MSG(extack,
 			       "Table range exceeded max allowed value");
@@ -1243,15 +1298,30 @@ static struct p4tc_table *p4tc_table_create(struct net *net, struct nlattr **tb,
 		goto defaultacts_destroy;
 
 	idr_init(&table->tbl_masks_idr);
-	idr_init(&table->tbl_prio_idr);
+	ida_init(&table->tbl_prio_idr);
 	spin_lock_init(&table->tbl_masks_idr_lock);
 
+	table_hlt_params.max_size = table->tbl_max_entries;
+	if (table->tbl_max_entries > U16_MAX)
+		table_hlt_params.nelem_hint = U16_MAX / 4 * 3;
+	else
+		table_hlt_params.nelem_hint = table->tbl_max_entries / 4 * 3;
+
+	if (rhltable_init(&table->tbl_entries, &table_hlt_params) < 0) {
+		ret = -EINVAL;
+		goto profiles_destroy;
+	}
+
 	pipeline->curr_tables += 1;
 
 	table->common.ops = (struct p4tc_template_ops *)&p4tc_table_ops;
+	atomic_set(&table->tbl_nelems, 0);
 
 	return table;
 
+profiles_destroy:
+	p4tc_table_timer_profiles_destroy(table);
+
 defaultacts_destroy:
 	p4tc_table_defact_destroy(dflt.hitact);
 	p4tc_table_defact_destroy(dflt.missact);
@@ -1285,6 +1355,12 @@ static struct p4tc_table *p4tc_table_update(struct net *net, struct nlattr **tb,
 	u8 tbl_type;
 	int ret = 0;
 
+	if (tb[P4TC_TABLE_ENTRY]) {
+		NL_SET_ERR_MSG(extack,
+			       "Entry update not supported from template");
+		return ERR_PTR(-EOPNOTSUPP);
+	}
+
 	table = p4tc_table_find_byanyattr(pipeline, tb[P4TC_TABLE_NAME], tbl_id,
 					  extack);
 	if (IS_ERR(table))
diff --git a/net/sched/p4tc/p4tc_tbl_entry.c b/net/sched/p4tc/p4tc_tbl_entry.c
new file mode 100644
index 000000000..a2f9ab959
--- /dev/null
+++ b/net/sched/p4tc/p4tc_tbl_entry.c
@@ -0,0 +1,1804 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * net/sched/p4tc/p4tc_tbl_entry.c P4 TC TABLE ENTRY
+ *
+ * Copyright (c) 2022-2024, Mojatatu Networks
+ * Copyright (c) 2022-2024, Intel Corporation.
+ * Authors:     Jamal Hadi Salim <jhs@mojatatu.com>
+ *              Victor Nogueira <victor@mojatatu.com>
+ *              Pedro Tammela <pctammela@mojatatu.com>
+ */
+
+#include <linux/types.h>
+#include <linux/kernel.h>
+#include <linux/string.h>
+#include <linux/errno.h>
+#include <linux/slab.h>
+#include <linux/skbuff.h>
+#include <linux/init.h>
+#include <linux/kmod.h>
+#include <linux/err.h>
+#include <linux/module.h>
+#include <linux/bitmap.h>
+#include <net/net_namespace.h>
+#include <net/sock.h>
+#include <net/sch_generic.h>
+#include <net/pkt_cls.h>
+#include <net/p4tc.h>
+#include <net/netlink.h>
+#include <net/flow_offload.h>
+
+#define SIZEOF_MASKID (sizeof(((struct p4tc_table_entry_key *)0)->maskid))
+
+#define STARTOF_KEY(key) (&((key)->maskid))
+
+/* In this code we avoid locks for create/updating/deleting table entries by
+ * using a refcount (entries_ref). We also use RCU to avoid locks for reading.
+ * Everytime we try to get the entry, we increment and check the refcount to see
+ * whether a delete is happening in parallel.
+ */
+
+static bool p4tc_tbl_entry_put(struct p4tc_table_entry_value *value)
+{
+	return refcount_dec_if_one(&value->entries_ref);
+}
+
+static u32 p4tc_entry_hash_fn(const void *data, u32 len, u32 seed)
+{
+	const struct p4tc_table_entry_key *key = data;
+	u32 keysz;
+
+	/* The key memory area is always zero allocated aligned to 8 */
+	keysz = round_up(SIZEOF_MASKID + BITS_TO_BYTES(key->keysz), 4);
+
+	return jhash2(STARTOF_KEY(key), keysz / sizeof(u32), seed);
+}
+
+static int p4tc_entry_hash_cmp(struct rhashtable_compare_arg *arg,
+			       const void *ptr)
+{
+	const struct p4tc_table_entry_key *key = arg->key;
+	const struct p4tc_table_entry *entry = ptr;
+	u32 keysz;
+
+	keysz = SIZEOF_MASKID + BITS_TO_BYTES(entry->key.keysz);
+
+	return memcmp(STARTOF_KEY(&entry->key), STARTOF_KEY(key), keysz);
+}
+
+static u32 p4tc_entry_obj_hash_fn(const void *data, u32 len, u32 seed)
+{
+	const struct p4tc_table_entry *entry = data;
+
+	return p4tc_entry_hash_fn(&entry->key, len, seed);
+}
+
+const struct rhashtable_params entry_hlt_params = {
+	.obj_cmpfn = p4tc_entry_hash_cmp,
+	.obj_hashfn = p4tc_entry_obj_hash_fn,
+	.hashfn = p4tc_entry_hash_fn,
+	.head_offset = offsetof(struct p4tc_table_entry, ht_node),
+	.key_offset = offsetof(struct p4tc_table_entry, key),
+	.automatic_shrinking = true,
+};
+
+static struct rhlist_head *
+p4tc_entry_lookup_bucket(struct p4tc_table *table,
+			 struct p4tc_table_entry_key *key)
+{
+	return rhltable_lookup(&table->tbl_entries, key, entry_hlt_params);
+}
+
+static struct p4tc_table_entry *
+__p4tc_entry_lookup_fast(struct p4tc_table *table,
+			 struct p4tc_table_entry_key *key)
+	__must_hold(RCU)
+{
+	struct p4tc_table_entry *entry_curr;
+	struct rhlist_head *bucket_list;
+
+	bucket_list =
+		p4tc_entry_lookup_bucket(table, key);
+	if (!bucket_list)
+		return NULL;
+
+	rht_entry(entry_curr, bucket_list, ht_node);
+
+	return entry_curr;
+}
+
+static struct p4tc_table_entry *
+p4tc_entry_lookup(struct p4tc_table *table, struct p4tc_table_entry_key *key,
+		  u32 prio) __must_hold(RCU)
+{
+	struct rhlist_head *tmp, *bucket_list;
+	struct p4tc_table_entry *entry;
+
+	if (table->tbl_type == P4TC_TABLE_TYPE_EXACT)
+		return __p4tc_entry_lookup_fast(table, key);
+
+	bucket_list =
+		p4tc_entry_lookup_bucket(table, key);
+	if (!bucket_list)
+		return NULL;
+
+	rhl_for_each_entry_rcu(entry, tmp, bucket_list, ht_node) {
+		struct p4tc_table_entry_value *value =
+			p4tc_table_entry_value(entry);
+
+		if (value->prio == prio)
+			return entry;
+	}
+
+	return NULL;
+}
+
+#define p4tc_table_entry_mask_find_byid(table, id) \
+	(idr_find(&(table)->tbl_masks_idr, id))
+
+static void gen_exact_mask(u8 *mask, u32 mask_size)
+{
+	memset(mask, 0xFF, mask_size);
+}
+
+static int p4tca_table_get_entry_keys(struct sk_buff *skb,
+				      struct p4tc_table *table,
+				      struct p4tc_table_entry *entry)
+{
+	unsigned char *b = nlmsg_get_pos(skb);
+	struct p4tc_table_entry_mask *mask;
+	int ret = -ENOMEM;
+	u32 key_sz_bytes;
+
+	if (table->tbl_type == P4TC_TABLE_TYPE_EXACT) {
+		u8 mask_value[BITS_TO_BYTES(P4TC_MAX_KEYSZ)] = { 0 };
+
+		key_sz_bytes = BITS_TO_BYTES(entry->key.keysz);
+		if (nla_put(skb, P4TC_ENTRY_KEY_BLOB, key_sz_bytes,
+			    entry->key.fa_key))
+			goto out_nlmsg_trim;
+
+		gen_exact_mask(mask_value, key_sz_bytes);
+		if (nla_put(skb, P4TC_ENTRY_MASK_BLOB, key_sz_bytes,
+			    mask_value))
+			goto out_nlmsg_trim;
+	} else {
+		key_sz_bytes = BITS_TO_BYTES(entry->key.keysz);
+		if (nla_put(skb, P4TC_ENTRY_KEY_BLOB, key_sz_bytes,
+			    entry->key.fa_key))
+			goto out_nlmsg_trim;
+
+		mask = p4tc_table_entry_mask_find_byid(table,
+						       entry->key.maskid);
+		if (nla_put(skb, P4TC_ENTRY_MASK_BLOB, key_sz_bytes,
+			    mask->fa_value))
+			goto out_nlmsg_trim;
+	}
+
+	return 0;
+
+out_nlmsg_trim:
+	nlmsg_trim(skb, b);
+	return ret;
+}
+
+static void p4tc_table_entry_tm_dump(struct p4tc_table_entry_tm *dtm,
+				     struct p4tc_table_entry_tm *stm)
+{
+	unsigned long now = jiffies;
+	u64 last_used;
+
+	dtm->created = stm->created ?
+		jiffies_to_clock_t(now - stm->created) : 0;
+
+	last_used = READ_ONCE(stm->lastused);
+	dtm->lastused = stm->lastused ?
+		jiffies_to_clock_t(now - last_used) : 0;
+	dtm->firstused = stm->firstused ?
+		jiffies_to_clock_t(now - stm->firstused) : 0;
+}
+
+#define P4TC_ENTRY_MAX_IDS (P4TC_PATH_MAX - 1)
+
+int p4tc_tbl_entry_fill(struct sk_buff *skb, struct p4tc_table *table,
+			struct p4tc_table_entry *entry, u32 tbl_id)
+{
+	unsigned char *b = nlmsg_get_pos(skb);
+	struct p4tc_table_entry_value *value;
+	struct p4tc_table_entry_tm dtm, *tm;
+	struct nlattr *nest, *nest_acts;
+	u32 ids[P4TC_ENTRY_MAX_IDS];
+	int ret = -ENOMEM;
+
+	ids[P4TC_TBLID_IDX - 1] = tbl_id;
+
+	if (nla_put(skb, P4TC_PATH, P4TC_ENTRY_MAX_IDS * sizeof(u32), ids))
+		goto out_nlmsg_trim;
+
+	nest = nla_nest_start(skb, P4TC_PARAMS);
+	if (!nest)
+		goto out_nlmsg_trim;
+
+	value = p4tc_table_entry_value(entry);
+
+	if (nla_put_u32(skb, P4TC_ENTRY_PRIO, value->prio))
+		goto out_nlmsg_trim;
+
+	if (p4tca_table_get_entry_keys(skb, table, entry) < 0)
+		goto out_nlmsg_trim;
+
+	if (value->acts[0]) {
+		nest_acts = nla_nest_start(skb, P4TC_ENTRY_ACT);
+		if (tcf_action_dump(skb, value->acts, 0, 0, false) < 0)
+			goto out_nlmsg_trim;
+		nla_nest_end(skb, nest_acts);
+	}
+
+	if (nla_put_u16(skb, P4TC_ENTRY_PERMISSIONS, value->permissions))
+		goto out_nlmsg_trim;
+
+	tm = rcu_dereference_protected(value->tm, 1);
+
+	if (nla_put_u8(skb, P4TC_ENTRY_CREATE_WHODUNNIT, tm->who_created))
+		goto out_nlmsg_trim;
+
+	if (tm->who_updated) {
+		if (nla_put_u8(skb, P4TC_ENTRY_UPDATE_WHODUNNIT,
+			       tm->who_updated))
+			goto out_nlmsg_trim;
+	}
+
+	p4tc_table_entry_tm_dump(&dtm, tm);
+	if (nla_put_64bit(skb, P4TC_ENTRY_TM, sizeof(dtm), &dtm,
+			  P4TC_ENTRY_PAD))
+		goto out_nlmsg_trim;
+
+	if (value->tmpl_created) {
+		if (nla_put_u8(skb, P4TC_ENTRY_TMPL_CREATED, 1))
+			goto out_nlmsg_trim;
+	}
+
+	nla_nest_end(skb, nest);
+
+	return skb->len;
+
+out_nlmsg_trim:
+	nlmsg_trim(skb, b);
+	return ret;
+}
+
+static const struct nla_policy p4tc_entry_policy[P4TC_ENTRY_MAX + 1] = {
+	[P4TC_ENTRY_TBLNAME] = { .type = NLA_STRING },
+	[P4TC_ENTRY_KEY_BLOB] = { .type = NLA_BINARY },
+	[P4TC_ENTRY_MASK_BLOB] = { .type = NLA_BINARY },
+	[P4TC_ENTRY_PRIO] = { .type = NLA_U32 },
+	[P4TC_ENTRY_ACT] = { .type = NLA_NESTED },
+	[P4TC_ENTRY_TM] =
+		NLA_POLICY_EXACT_LEN(sizeof(struct p4tc_table_entry_tm)),
+	[P4TC_ENTRY_WHODUNNIT] = { .type = NLA_U8 },
+	[P4TC_ENTRY_CREATE_WHODUNNIT] = { .type = NLA_U8 },
+	[P4TC_ENTRY_UPDATE_WHODUNNIT] = { .type = NLA_U8 },
+	[P4TC_ENTRY_DELETE_WHODUNNIT] = { .type = NLA_U8 },
+	[P4TC_ENTRY_PERMISSIONS] = NLA_POLICY_MAX(NLA_U16, P4TC_MAX_PERMISSION),
+	[P4TC_ENTRY_TBL_ATTRS] = { .type = NLA_NESTED },
+};
+
+static struct p4tc_table_entry_mask *
+p4tc_table_entry_mask_find_byvalue(struct p4tc_table *table,
+				   struct p4tc_table_entry_mask *mask)
+{
+	struct p4tc_table_entry_mask *mask_cur;
+	unsigned long mask_id, tmp;
+
+	idr_for_each_entry_ul(&table->tbl_masks_idr, mask_cur, tmp, mask_id) {
+		if (mask_cur->sz == mask->sz) {
+			u32 mask_sz_bytes = BITS_TO_BYTES(mask->sz);
+			void *curr_mask_value = mask_cur->fa_value;
+			void *mask_value = mask->fa_value;
+
+			if (memcmp(curr_mask_value, mask_value,
+				   mask_sz_bytes) == 0)
+				return mask_cur;
+		}
+	}
+
+	return NULL;
+}
+
+static void __p4tc_table_entry_mask_del(struct p4tc_table *table,
+					struct p4tc_table_entry_mask *mask)
+{
+	if (table->tbl_type == P4TC_TABLE_TYPE_TERNARY) {
+		struct p4tc_table_entry_mask __rcu **masks_array;
+		unsigned long *free_masks_bitmap;
+
+		masks_array = table->tbl_masks_array;
+		rcu_assign_pointer(masks_array[mask->mask_index], NULL);
+
+		free_masks_bitmap =
+			rcu_dereference_protected(table->tbl_free_masks_bitmap,
+						  1);
+		bitmap_set(free_masks_bitmap, mask->mask_index, 1);
+	} else if (table->tbl_type == P4TC_TABLE_TYPE_LPM) {
+		struct p4tc_table_entry_mask __rcu **masks_array;
+		int i;
+
+		masks_array = table->tbl_masks_array;
+
+		for (i = mask->mask_index; i < table->tbl_curr_num_masks - 1;
+		     i++) {
+			struct p4tc_table_entry_mask *mask_tmp;
+
+			mask_tmp = rcu_dereference_protected(masks_array[i + 1],
+							     1);
+			rcu_assign_pointer(masks_array[i + 1], mask_tmp);
+		}
+
+		rcu_assign_pointer(masks_array[table->tbl_curr_num_masks - 1],
+				   NULL);
+	}
+
+	table->tbl_curr_num_masks--;
+}
+
+static void p4tc_table_entry_mask_del(struct p4tc_table *table,
+				      struct p4tc_table_entry *entry)
+{
+	struct p4tc_table_entry_mask *mask_found;
+	const u32 mask_id = entry->key.maskid;
+
+	/* Will always be found */
+	mask_found = p4tc_table_entry_mask_find_byid(table, mask_id);
+
+	/* Last reference, can delete */
+	if (refcount_dec_if_one(&mask_found->mask_ref)) {
+		spin_lock_bh(&table->tbl_masks_idr_lock);
+		idr_remove(&table->tbl_masks_idr, mask_found->mask_id);
+		__p4tc_table_entry_mask_del(table, mask_found);
+		spin_unlock_bh(&table->tbl_masks_idr_lock);
+		kfree_rcu(mask_found, rcu);
+	} else {
+		if (!refcount_dec_not_one(&mask_found->mask_ref))
+			pr_warn("Mask was deleted in parallel");
+	}
+}
+
+#if defined(__LITTLE_ENDIAN_BITFIELD)
+static u32 p4tc_fls(u8 *ptr, size_t len)
+{
+	int i;
+
+	for (i = len - 1; i >= 0; i--) {
+		int pos = fls(ptr[i]);
+
+		if (pos)
+			return (i * 8) + pos;
+	}
+
+	return 0;
+}
+#else
+static u32 p4tc_ffs(u8 *ptr, size_t len)
+{
+	int i;
+
+	for (i = 0; i < len; i++) {
+		int pos = ffs(ptr[i]);
+
+		if (pos)
+			return (i * 8) + pos;
+	}
+
+	return 0;
+}
+#endif
+
+static u32 find_lpm_mask(struct p4tc_table *table, u8 *ptr)
+{
+	u32 ret;
+#if defined(__LITTLE_ENDIAN_BITFIELD)
+	ret = p4tc_fls(ptr, BITS_TO_BYTES(table->tbl_keysz));
+#else
+	ret = p4tc_ffs(ptr, BITS_TO_BYTES(table->tbl_keysz));
+#endif
+	return ret ?: table->tbl_keysz;
+}
+
+static int p4tc_table_lpm_mask_insert(struct p4tc_table *table,
+				      struct p4tc_table_entry_mask *mask)
+{
+	struct p4tc_table_entry_mask __rcu **masks_array =
+		table->tbl_masks_array;
+	const u32 nmasks = table->tbl_curr_num_masks ?: 1;
+	int pos;
+
+	for (pos = 0; pos < nmasks; pos++) {
+		u32 mask_value = find_lpm_mask(table, mask->fa_value);
+
+		if (table->tbl_masks_array[pos]) {
+			struct p4tc_table_entry_mask *mask_pos;
+			u32 array_mask_value;
+
+			mask_pos = rcu_dereference_protected(masks_array[pos],
+							     1);
+			array_mask_value =
+				find_lpm_mask(table, mask_pos->fa_value);
+
+			if (mask_value > array_mask_value) {
+				/* shift masks to the right (will keep
+				 * invariant)
+				 */
+				u32 tail = nmasks;
+
+				while (tail > pos + 1) {
+					rcu_assign_pointer(masks_array[tail],
+							   masks_array[tail - 1]);
+					table->tbl_masks_array[tail] =
+						table->tbl_masks_array[tail - 1];
+					tail--;
+				}
+				rcu_assign_pointer(masks_array[pos + 1],
+						   masks_array[pos]);
+				/* assign to pos */
+				break;
+			}
+		} else {
+			/* pos is empty, assign to pos */
+			break;
+		}
+	}
+
+	mask->mask_index = pos;
+	rcu_assign_pointer(masks_array[pos], mask);
+	table->tbl_curr_num_masks++;
+
+	return 0;
+}
+
+static int
+p4tc_table_ternary_mask_insert(struct p4tc_table *table,
+			       struct p4tc_table_entry_mask *mask)
+{
+	unsigned long *free_masks_bitmap =
+		rcu_dereference_protected(table->tbl_free_masks_bitmap, 1);
+	unsigned long pos =
+		find_first_bit(free_masks_bitmap, P4TC_MAX_TMASKS);
+	struct p4tc_table_entry_mask __rcu **masks_array =
+		table->tbl_masks_array;
+
+	if (pos == P4TC_MAX_TMASKS)
+		return -ENOSPC;
+
+	mask->mask_index = pos;
+	rcu_assign_pointer(masks_array[pos], mask);
+	bitmap_clear(free_masks_bitmap, pos, 1);
+	table->tbl_curr_num_masks++;
+
+	return 0;
+}
+
+static int p4tc_table_add_mask_array(struct p4tc_table *table,
+				     struct p4tc_table_entry_mask *mask)
+{
+	if (table->tbl_max_masks < table->tbl_curr_num_masks + 1)
+		return -ENOSPC;
+
+	switch (table->tbl_type) {
+	case P4TC_TABLE_TYPE_TERNARY:
+		return p4tc_table_ternary_mask_insert(table, mask);
+	case P4TC_TABLE_TYPE_LPM:
+		return p4tc_table_lpm_mask_insert(table, mask);
+	default:
+		return -ENOSPC;
+	}
+}
+
+static struct p4tc_table_entry_mask *
+p4tc_table_entry_mask_add(struct p4tc_table *table,
+			  struct p4tc_table_entry *entry,
+			  struct p4tc_table_entry_mask *mask)
+{
+	struct p4tc_table_entry_mask *found;
+	int ret;
+
+	found = p4tc_table_entry_mask_find_byvalue(table, mask);
+	/* Only add mask if it was not already added */
+	if (!found) {
+		struct p4tc_table_entry_mask *nmask;
+		size_t masksz_bytes = BITS_TO_BYTES(mask->sz);
+
+		nmask = kzalloc(struct_size(found, fa_value, masksz_bytes),
+				GFP_ATOMIC);
+		if (unlikely(!nmask))
+			return ERR_PTR(-ENOMEM);
+
+		memcpy(nmask->fa_value, mask->fa_value, masksz_bytes);
+
+		nmask->mask_id = 1;
+		nmask->sz = mask->sz;
+		refcount_set(&nmask->mask_ref, 1);
+
+		spin_lock_bh(&table->tbl_masks_idr_lock);
+		ret = idr_alloc_u32(&table->tbl_masks_idr, nmask,
+				    &nmask->mask_id, UINT_MAX, GFP_ATOMIC);
+		if (ret < 0)
+			goto unlock;
+
+		ret = p4tc_table_add_mask_array(table, nmask);
+unlock:
+		spin_unlock_bh(&table->tbl_masks_idr_lock);
+		if (ret < 0) {
+			kfree(nmask);
+			return ERR_PTR(ret);
+		}
+		entry->key.maskid = nmask->mask_id;
+		found = nmask;
+	} else {
+		if (!refcount_inc_not_zero(&found->mask_ref))
+			return ERR_PTR(-EBUSY);
+		entry->key.maskid = found->mask_id;
+	}
+
+	return found;
+}
+
+static int p4tc_tbl_entry_emit_event(struct p4tc_table_entry_work *entry_work,
+				     int cmd, gfp_t alloc_flags)
+{
+	struct p4tc_pipeline *pipeline = entry_work->pipeline;
+	struct p4tc_table_entry *entry = entry_work->entry;
+	struct p4tc_table *table = entry_work->table;
+	struct net *net = pipeline->net;
+	struct sock *rtnl = net->rtnl;
+	struct nlmsghdr *nlh;
+	struct nlattr *nest;
+	struct sk_buff *skb;
+	struct nlattr *root;
+	struct p4tcmsg *t;
+	int err = -ENOMEM;
+
+	if (!rtnl_has_listeners(net, RTNLGRP_TC))
+		return 0;
+
+	skb = alloc_skb(NLMSG_GOODSIZE, alloc_flags);
+	if (!skb)
+		return err;
+
+	nlh = nlmsg_put(skb, 1, 1, cmd, sizeof(*t), NLM_F_REQUEST);
+	if (!nlh)
+		goto free_skb;
+
+	t = nlmsg_data(nlh);
+	if (!t)
+		goto free_skb;
+
+	t->pipeid = pipeline->common.p_id;
+	t->obj = P4TC_OBJ_RUNTIME_TABLE;
+
+	if (nla_put_string(skb, P4TC_ROOT_PNAME, pipeline->common.name))
+		goto free_skb;
+
+	root = nla_nest_start(skb, P4TC_ROOT);
+	if (!root)
+		goto free_skb;
+
+	nest = nla_nest_start(skb, 1);
+	if (p4tc_tbl_entry_fill(skb, table, entry, table->tbl_id) < 0)
+		goto free_skb;
+	nla_nest_end(skb, nest);
+
+	nla_nest_end(skb, root);
+
+	nlmsg_end(skb, nlh);
+
+	return nlmsg_notify(rtnl, skb, 0, RTNLGRP_TC, 0, alloc_flags);
+
+free_skb:
+	kfree_skb(skb);
+	return err;
+}
+
+static void __p4tc_table_entry_put(struct p4tc_table_entry *entry)
+{
+	struct p4tc_table_entry_value *value;
+	struct p4tc_table_entry_tm *tm;
+
+	value = p4tc_table_entry_value(entry);
+
+	if (value->acts[0])
+		p4tc_action_destroy(value->acts);
+
+	kfree(value->entry_work);
+	tm = rcu_dereference_protected(value->tm, 1);
+	kfree(tm);
+
+	kfree(entry);
+}
+
+static void p4tc_table_entry_del_work(struct work_struct *work)
+{
+	struct p4tc_table_entry_work *entry_work =
+		container_of(work, typeof(*entry_work), work);
+	struct p4tc_pipeline *pipeline = entry_work->pipeline;
+	struct p4tc_table_entry *entry = entry_work->entry;
+	struct p4tc_table_entry_value *value;
+
+	value = p4tc_table_entry_value(entry);
+
+	if (entry_work->send_event && p4tc_ctrl_pub_ok(value->permissions))
+		p4tc_tbl_entry_emit_event(entry_work, RTM_P4TC_DEL, GFP_KERNEL);
+
+	put_net(pipeline->net);
+	p4tc_pipeline_put_ref(pipeline);
+
+	__p4tc_table_entry_put(entry);
+}
+
+static void p4tc_table_entry_put(struct p4tc_table_entry *entry, bool deferred)
+{
+	struct p4tc_table_entry_value *value = p4tc_table_entry_value(entry);
+
+	if (deferred) {
+		struct p4tc_table_entry_work *entry_work = value->entry_work;
+		/* We have to free tc actions
+		 * in a sleepable context
+		 */
+		struct p4tc_pipeline *pipeline = entry_work->pipeline;
+
+		/* Avoid pipeline del before deferral ends */
+		p4tc_pipeline_get(pipeline);
+		get_net(pipeline->net); /* avoid action cleanup */
+		schedule_work(&entry_work->work);
+	} else {
+		__p4tc_table_entry_put(entry);
+	}
+}
+
+static void p4tc_table_entry_put_rcu(struct rcu_head *rcu)
+{
+	struct p4tc_table_entry *entry =
+		container_of(rcu, struct p4tc_table_entry, rcu);
+	struct p4tc_table_entry_work *entry_work =
+		p4tc_table_entry_work(entry);
+	struct p4tc_pipeline *pipeline = entry_work->pipeline;
+
+	p4tc_table_entry_put(entry, true);
+
+	p4tc_pipeline_put_ref(pipeline);
+	put_net(pipeline->net);
+}
+
+static void __p4tc_table_entry_destroy(struct p4tc_table *table,
+				       struct p4tc_table_entry *entry,
+				       bool remove_from_hash, bool send_event,
+				       u16 who_deleted)
+{
+	/* !remove_from_hash and deferred deletion are incompatible
+	 * as entries that defer deletion after a GP __must__
+	 * be removed from the hash
+	 */
+	if (remove_from_hash)
+		rhltable_remove(&table->tbl_entries, &entry->ht_node,
+				entry_hlt_params);
+
+	if (table->tbl_type != P4TC_TABLE_TYPE_EXACT)
+		p4tc_table_entry_mask_del(table, entry);
+
+	if (remove_from_hash) {
+		struct p4tc_table_entry_work *entry_work =
+			p4tc_table_entry_work(entry);
+
+		entry_work->send_event = send_event;
+		entry_work->who_deleted = who_deleted;
+
+		/* get pipeline/net for async task */
+		get_net(entry_work->pipeline->net);
+		p4tc_pipeline_get(entry_work->pipeline);
+
+		call_rcu(&entry->rcu, p4tc_table_entry_put_rcu);
+	} else {
+		p4tc_table_entry_put(entry, false);
+	}
+}
+
+#define P4TC_TABLE_EXACT_PRIO 64000
+
+static int p4tc_table_entry_exact_prio(void)
+{
+	return P4TC_TABLE_EXACT_PRIO;
+}
+
+static int p4tc_table_entry_alloc_new_prio(struct p4tc_table *table)
+{
+	if (table->tbl_type == P4TC_TABLE_TYPE_EXACT)
+		return p4tc_table_entry_exact_prio();
+
+	return ida_alloc_min(&table->tbl_prio_idr, 1, GFP_ATOMIC);
+}
+
+static void p4tc_table_entry_free_prio(struct p4tc_table *table, u32 prio)
+{
+	if (table->tbl_type != P4TC_TABLE_TYPE_EXACT)
+		ida_free(&table->tbl_prio_idr, prio);
+}
+
+static int p4tc_table_entry_destroy(struct p4tc_table *table,
+				    struct p4tc_table_entry *entry,
+				    bool remove_from_hash,
+				    bool send_event, u16 who_deleted)
+{
+	struct p4tc_table_entry_value *value = p4tc_table_entry_value(entry);
+
+	/* Entry was deleted in parallel */
+	if (!p4tc_tbl_entry_put(value))
+		return -EBUSY;
+
+	p4tc_table_entry_free_prio(table, value->prio);
+
+	__p4tc_table_entry_destroy(table, entry, remove_from_hash, send_event,
+				   who_deleted);
+
+	atomic_dec(&table->tbl_nelems);
+
+	return 0;
+}
+
+static void p4tc_table_entry_destroy_noida(struct p4tc_table *table,
+					   struct p4tc_table_entry *entry)
+{
+	/* Entry refcount was already decremented */
+	__p4tc_table_entry_destroy(table, entry, true, false, 0);
+}
+
+/* Only deletes entries when called from pipeline put */
+void p4tc_table_entry_destroy_hash(void *ptr, void *arg)
+{
+	struct p4tc_table_entry *entry = ptr;
+	struct p4tc_table *table = arg;
+
+	p4tc_table_entry_destroy(table, entry, false, false,
+				 P4TC_ENTITY_TC);
+}
+
+struct p4tc_table_get_state {
+	struct p4tc_pipeline *pipeline;
+	struct p4tc_table *table;
+};
+
+static void
+p4tc_table_entry_put_table(struct p4tc_table_get_state *table_get_state)
+{
+	p4tc_table_put_ref(table_get_state->table);
+	p4tc_pipeline_put_ref(table_get_state->pipeline);
+}
+
+static int
+p4tc_table_entry_get_table(struct net *net, int cmd,
+			   struct p4tc_table_get_state *table_get_state,
+			   struct nlattr **tb,
+			   struct p4tc_path_nlattrs *nl_path_attrs,
+			   struct netlink_ext_ack *extack)
+{
+	/* The following can only race with user driven events
+	 * Netns is guaranteed to be alive
+	 */
+	struct p4tc_pipeline *pipeline;
+	u32 *ids = nl_path_attrs->ids;
+	struct p4tc_table *table;
+	u32 pipeid, tbl_id;
+	char *tblname;
+	int ret;
+
+	rcu_read_lock();
+
+	pipeid = ids[P4TC_PID_IDX];
+
+	pipeline = p4tc_pipeline_find_get(net, nl_path_attrs->pname, pipeid,
+					  extack);
+	if (IS_ERR(pipeline)) {
+		ret = PTR_ERR(pipeline);
+		goto out;
+	}
+
+	if (cmd != RTM_P4TC_GET && !p4tc_pipeline_sealed(pipeline)) {
+		switch (cmd) {
+		case RTM_P4TC_CREATE:
+			NL_SET_ERR_MSG(extack,
+				       "Pipeline must be sealed for runtime create");
+			break;
+		case RTM_P4TC_UPDATE:
+			NL_SET_ERR_MSG(extack,
+				       "Pipeline must be sealed for runtime update");
+			break;
+		case RTM_P4TC_DEL:
+			NL_SET_ERR_MSG(extack,
+				       "Pipeline must be sealed for runtime delete");
+			break;
+		default:
+			/* Will never happen */
+			break;
+		}
+		ret = -EINVAL;
+		goto put;
+	}
+
+	tbl_id = ids[P4TC_TBLID_IDX];
+	tblname = tb[P4TC_ENTRY_TBLNAME] ?
+		nla_data(tb[P4TC_ENTRY_TBLNAME]) : NULL;
+
+	table = p4tc_table_find_get(pipeline, tblname, tbl_id, extack);
+	if (IS_ERR(table)) {
+		ret = PTR_ERR(table);
+		goto put;
+	}
+
+	rcu_read_unlock();
+
+	table_get_state->pipeline = pipeline;
+	table_get_state->table = table;
+
+	return 0;
+
+put:
+	p4tc_pipeline_put_ref(pipeline);
+
+out:
+	rcu_read_unlock();
+	return ret;
+}
+
+static void
+p4tc_table_entry_assign_key_exact(struct p4tc_table_entry_key *key, u8 *keyblob)
+{
+	memcpy(key->fa_key, keyblob, BITS_TO_BYTES(key->keysz));
+}
+
+static void
+p4tc_table_entry_assign_key_generic(struct p4tc_table_entry_key *key,
+				    struct p4tc_table_entry_mask *mask,
+				    u8 *keyblob, u8 *maskblob)
+{
+	u32 keysz = BITS_TO_BYTES(key->keysz);
+
+	memcpy(key->fa_key, keyblob, keysz);
+	memcpy(mask->fa_value, maskblob, keysz);
+}
+
+static int p4tc_table_entry_extract_key(struct p4tc_table *table,
+					struct nlattr **tb,
+					struct p4tc_table_entry_key *key,
+					struct p4tc_table_entry_mask *mask,
+					struct netlink_ext_ack *extack)
+{
+	bool is_exact = table->tbl_type == P4TC_TABLE_TYPE_EXACT;
+	void *keyblob, *maskblob;
+	u32 keysz;
+
+	if (NL_REQ_ATTR_CHECK(extack, NULL, tb, P4TC_ENTRY_KEY_BLOB)) {
+		NL_SET_ERR_MSG(extack, "Must specify key blobs");
+		return -EINVAL;
+	}
+
+	keysz = nla_len(tb[P4TC_ENTRY_KEY_BLOB]);
+	if (BITS_TO_BYTES(key->keysz) != keysz) {
+		NL_SET_ERR_MSG(extack,
+			       "Key blob size and table key size differ");
+		return -EINVAL;
+	}
+
+	if (!is_exact) {
+		if (NL_REQ_ATTR_CHECK(extack, NULL, tb, P4TC_ENTRY_MASK_BLOB)) {
+			NL_SET_ERR_MSG(extack, "Must specify mask blob");
+			return -EINVAL;
+		}
+
+		if (keysz != nla_len(tb[P4TC_ENTRY_MASK_BLOB])) {
+			NL_SET_ERR_MSG(extack,
+				       "Key and mask blob must have the same length");
+			return -EINVAL;
+		}
+	}
+
+	keyblob = nla_data(tb[P4TC_ENTRY_KEY_BLOB]);
+	if (is_exact) {
+		p4tc_table_entry_assign_key_exact(key, keyblob);
+	} else {
+		maskblob = nla_data(tb[P4TC_ENTRY_MASK_BLOB]);
+		p4tc_table_entry_assign_key_generic(key, mask, keyblob,
+						    maskblob);
+	}
+
+	return 0;
+}
+
+static void p4tc_table_entry_build_key(struct p4tc_table *table,
+				       struct p4tc_table_entry_key *key,
+				       struct p4tc_table_entry_mask *mask)
+{
+	int i;
+
+	if (table->tbl_type == P4TC_TABLE_TYPE_EXACT)
+		return;
+
+	key->maskid = mask->mask_id;
+
+	for (i = 0; i < BITS_TO_BYTES(key->keysz); i++)
+		key->fa_key[i] &= mask->fa_value[i];
+}
+
+static struct p4tc_table_entry_tm *
+p4tc_table_entry_create_tm(const u16 whodunnit)
+{
+	struct p4tc_table_entry_tm *dtm;
+
+	dtm = kzalloc(sizeof(*dtm), GFP_ATOMIC);
+	if (unlikely(!dtm))
+		return ERR_PTR(-ENOMEM);
+
+	dtm->who_created = whodunnit;
+	dtm->who_deleted = P4TC_ENTITY_UNSPEC;
+	dtm->created = jiffies;
+	dtm->firstused = 0;
+	dtm->lastused = jiffies;
+
+	return dtm;
+}
+
+/* Invoked from both control and data path */
+static int __p4tc_table_entry_create(struct p4tc_pipeline *pipeline,
+				     struct p4tc_table *table,
+				     struct p4tc_table_entry *entry,
+				     struct p4tc_table_entry_mask *mask,
+				     u16 whodunnit, bool from_control)
+__must_hold(RCU)
+{
+	struct p4tc_table_entry_mask *mask_found = NULL;
+	struct p4tc_table_entry_work *entry_work;
+	struct p4tc_table_entry_value *value;
+	struct p4tc_table_perm *tbl_perm;
+	struct p4tc_table_entry_tm *dtm;
+	u16 permissions;
+	int ret;
+
+	value = p4tc_table_entry_value(entry);
+	/* We set it to zero on create an update to avoid having entry
+	 * deletion in parallel before we report to user space.
+	 */
+	refcount_set(&value->entries_ref, 0);
+
+	tbl_perm = rcu_dereference(table->tbl_permissions);
+	permissions = tbl_perm->permissions;
+	if (from_control) {
+		if (!p4tc_ctrl_create_ok(permissions))
+			return -EPERM;
+	} else {
+		if (!p4tc_data_create_ok(permissions))
+			return -EPERM;
+	}
+
+	/* From data plane we can only create entries on exact match */
+	if (table->tbl_type != P4TC_TABLE_TYPE_EXACT) {
+		mask_found = p4tc_table_entry_mask_add(table, entry, mask);
+		if (IS_ERR(mask_found)) {
+			ret = PTR_ERR(mask_found);
+			goto out;
+		}
+	}
+
+	p4tc_table_entry_build_key(table, &entry->key, mask_found);
+
+	if (p4tc_entry_lookup(table, &entry->key, value->prio)) {
+		ret = -EEXIST;
+		goto rm_masks_idr;
+	}
+
+	dtm = p4tc_table_entry_create_tm(whodunnit);
+	if (IS_ERR(dtm)) {
+		ret = PTR_ERR(dtm);
+		goto rm_masks_idr;
+	}
+
+	rcu_assign_pointer(value->tm, dtm);
+
+	entry_work = kzalloc(sizeof(*entry_work), GFP_ATOMIC);
+	if (unlikely(!entry_work)) {
+		ret = -ENOMEM;
+		goto free_tm;
+	}
+
+	entry_work->pipeline = pipeline;
+	entry_work->table = table;
+	entry_work->entry = entry;
+	value->entry_work = entry_work;
+
+	INIT_WORK(&entry_work->work, p4tc_table_entry_del_work);
+
+	if (atomic_inc_return(&table->tbl_nelems) > table->tbl_max_entries) {
+		atomic_dec(&table->tbl_nelems);
+		ret = -ENOSPC;
+		goto free_work;
+	}
+
+	if (rhltable_insert(&table->tbl_entries, &entry->ht_node,
+			    entry_hlt_params) < 0) {
+		atomic_dec(&table->tbl_nelems);
+		ret = -EBUSY;
+		goto free_work;
+	}
+
+	if (!from_control && p4tc_ctrl_pub_ok(value->permissions))
+		p4tc_tbl_entry_emit_event(entry_work, RTM_P4TC_CREATE,
+					  GFP_ATOMIC);
+
+	return 0;
+
+free_work:
+	kfree(entry_work);
+
+free_tm:
+	kfree(dtm);
+
+rm_masks_idr:
+	if (table->tbl_type != P4TC_TABLE_TYPE_EXACT)
+		p4tc_table_entry_mask_del(table, entry);
+out:
+	return ret;
+}
+
+/* Invoked from both control and data path  */
+static int __p4tc_table_entry_update(struct p4tc_pipeline *pipeline,
+				     struct p4tc_table *table,
+				     struct p4tc_table_entry *entry,
+				     struct p4tc_table_entry_mask *mask,
+				     u16 whodunnit, bool from_control)
+__must_hold(RCU)
+{
+	struct p4tc_table_entry_mask *mask_found = NULL;
+	struct p4tc_table_entry_work *entry_work;
+	struct p4tc_table_entry_value *value_old;
+	struct p4tc_table_entry_value *value;
+	struct p4tc_table_entry *entry_old;
+	struct p4tc_table_entry_tm *tm_old;
+	struct p4tc_table_entry_tm *tm;
+	int ret;
+
+	value = p4tc_table_entry_value(entry);
+	/* We set it to zero on update to avoid having entry removed from the
+	 * rhashtable in parallel before we report to user space.
+	 */
+	refcount_set(&value->entries_ref, 0);
+
+	if (table->tbl_type != P4TC_TABLE_TYPE_EXACT) {
+		mask_found = p4tc_table_entry_mask_add(table, entry, mask);
+		if (IS_ERR(mask_found)) {
+			ret = PTR_ERR(mask_found);
+			goto out;
+		}
+	}
+
+	p4tc_table_entry_build_key(table, &entry->key, mask_found);
+
+	entry_old = p4tc_entry_lookup(table, &entry->key, value->prio);
+	if (!entry_old) {
+		ret = -ENOENT;
+		goto rm_masks_idr;
+	}
+
+	/* In case of parallel update, the thread that arrives here first will
+	 * get the right to update.
+	 *
+	 * In case of a parallel get/update, whoever is second will fail
+	 * appropriately.
+	 */
+	value_old = p4tc_table_entry_value(entry_old);
+	if (!p4tc_tbl_entry_put(value_old)) {
+		ret = -EAGAIN;
+		goto rm_masks_idr;
+	}
+
+	if (from_control) {
+		if (!p4tc_ctrl_update_ok(value_old->permissions)) {
+			ret = -EPERM;
+			goto set_entries_refcount;
+		}
+	} else {
+		if (!p4tc_data_update_ok(value_old->permissions)) {
+			ret = -EPERM;
+			goto set_entries_refcount;
+		}
+	}
+
+	tm = kzalloc(sizeof(*tm), GFP_ATOMIC);
+	if (unlikely(!tm)) {
+		ret = -ENOMEM;
+		goto set_entries_refcount;
+	}
+
+	tm_old = rcu_dereference_protected(value_old->tm, 1);
+	*tm = *tm_old;
+
+	tm->lastused = jiffies;
+	tm->who_updated = whodunnit;
+
+	if (value->permissions == P4TC_PERMISSIONS_UNINIT)
+		value->permissions = value_old->permissions;
+
+	rcu_assign_pointer(value->tm, tm);
+
+	entry_work = kzalloc(sizeof(*(entry_work)), GFP_ATOMIC);
+	if (unlikely(!entry_work)) {
+		ret = -ENOMEM;
+		goto free_tm;
+	}
+
+	entry_work->pipeline = pipeline;
+	entry_work->table = table;
+	entry_work->entry = entry;
+	value->entry_work = entry_work;
+
+	INIT_WORK(&entry_work->work, p4tc_table_entry_del_work);
+
+	if (rhltable_insert(&table->tbl_entries, &entry->ht_node,
+			    entry_hlt_params) < 0) {
+		ret = -EEXIST;
+		goto free_entry_work;
+	}
+
+	p4tc_table_entry_destroy_noida(table, entry_old);
+
+	if (!from_control && p4tc_ctrl_pub_ok(value->permissions))
+		p4tc_tbl_entry_emit_event(entry_work, RTM_P4TC_UPDATE,
+					  GFP_ATOMIC);
+
+	return 0;
+
+free_entry_work:
+	kfree(entry_work);
+
+free_tm:
+	kfree(tm);
+
+set_entries_refcount:
+	refcount_set(&value_old->entries_ref, 1);
+
+rm_masks_idr:
+	if (table->tbl_type != P4TC_TABLE_TYPE_EXACT)
+		p4tc_table_entry_mask_del(table, entry);
+
+out:
+	return ret;
+}
+
+static bool p4tc_table_check_entry_act(struct p4tc_table *table,
+				       struct tc_action *entry_act)
+{
+	struct tcf_p4act *entry_p4act = to_p4act(entry_act);
+	struct p4tc_table_act *table_act;
+
+	list_for_each_entry(table_act, &table->tbl_acts_list, node) {
+		if (table_act->act->common.p_id != entry_p4act->p_id ||
+		    table_act->act->a_id != entry_p4act->act_id)
+			continue;
+
+		if (!(table_act->flags &
+		      BIT(P4TC_TABLE_ACTS_DEFAULT_ONLY)))
+			return true;
+	}
+
+	return false;
+}
+
+static bool p4tc_table_check_no_act(struct p4tc_table *table)
+{
+	struct p4tc_table_act *table_act;
+
+	if (list_empty(&table->tbl_acts_list))
+		return false;
+
+	list_for_each_entry(table_act, &table->tbl_acts_list, node) {
+		if (p4tc_table_act_is_noaction(table_act))
+			return true;
+	}
+
+	return false;
+}
+
+static struct nla_policy
+p4tc_table_attrs_policy[P4TC_ENTRY_TBL_ATTRS_MAX + 1] = {
+	[P4TC_ENTRY_TBL_ATTRS_DEFAULT_HIT] = { .type = NLA_NESTED },
+	[P4TC_ENTRY_TBL_ATTRS_DEFAULT_MISS] = { .type = NLA_NESTED },
+	[P4TC_ENTRY_TBL_ATTRS_PERMISSIONS] =
+		NLA_POLICY_MAX(NLA_U16, P4TC_MAX_PERMISSION),
+};
+
+static int p4tc_tbl_attrs_update(struct net *net, struct p4tc_table *table,
+				 struct nlattr *attrs,
+				 struct netlink_ext_ack *extack)
+{
+	struct nlattr *tb[P4TC_ENTRY_TBL_ATTRS_MAX + 1];
+	struct p4tc_table_defact_params dflt = { 0 };
+	struct p4tc_table_perm *tbl_perm = NULL;
+	int err;
+
+	err = nla_parse_nested(tb, P4TC_ENTRY_TBL_ATTRS_MAX, attrs,
+			       p4tc_table_attrs_policy, extack);
+	if (err < 0)
+		return err;
+
+	if (tb[P4TC_ENTRY_TBL_ATTRS_PERMISSIONS]) {
+		u16 permissions;
+
+		if (atomic_read(&table->tbl_nelems) > 0) {
+			NL_SET_ERR_MSG(extack,
+				       "Unable to set table permissions if it already has entries");
+			return -EINVAL;
+		}
+
+		permissions = nla_get_u16(tb[P4TC_ENTRY_TBL_ATTRS_PERMISSIONS]);
+		tbl_perm = p4tc_table_init_permissions(table, permissions,
+						       extack);
+		if (IS_ERR(tbl_perm))
+			return PTR_ERR(tbl_perm);
+	}
+
+	dflt.nla_hit = tb[P4TC_ENTRY_TBL_ATTRS_DEFAULT_HIT];
+	dflt.nla_miss = tb[P4TC_ENTRY_TBL_ATTRS_DEFAULT_MISS];
+
+	err = p4tc_table_init_default_acts(net, &dflt, table,
+					   &table->tbl_acts_list, extack);
+	if (err < 0)
+		goto free_tbl_perm;
+
+	p4tc_table_replace_default_acts(table, &dflt, true);
+	p4tc_table_replace_permissions(table, tbl_perm, true);
+
+	return 0;
+
+free_tbl_perm:
+	kfree(tbl_perm);
+	return err;
+}
+
+static u16 p4tc_table_entry_tbl_permcpy(const u16 tblperm)
+{
+	return p4tc_ctrl_perm_rm_create(p4tc_data_perm_rm_create(tblperm));
+}
+
+#define P4TC_TBL_ENTRY_CU_FLAG_CREATE 0x1
+#define P4TC_TBL_ENTRY_CU_FLAG_UPDATE 0x2
+#define P4TC_TBL_ENTRY_CU_FLAG_SET 0x4
+
+static struct p4tc_table_entry *
+__p4tc_table_entry_cu(struct net *net, u8 cu_flags, struct nlattr **tb,
+		      struct p4tc_pipeline *pipeline, struct p4tc_table *table,
+		      struct netlink_ext_ack *extack)
+{
+	bool replace = cu_flags == P4TC_TBL_ENTRY_CU_FLAG_UPDATE;
+	bool set = cu_flags == P4TC_TBL_ENTRY_CU_FLAG_SET;
+	u8 __mask[sizeof(struct p4tc_table_entry_mask) +
+		BITS_TO_BYTES(P4TC_MAX_KEYSZ)] = { 0 };
+	struct p4tc_table_entry_mask *mask = (void *)&__mask;
+	struct p4tc_table_entry_value *value;
+	u8 whodunnit = P4TC_ENTITY_UNSPEC;
+	struct p4tc_table_entry *entry;
+	u32 keysz_bytes;
+	u32 keysz_bits;
+	u16 tblperm;
+	int ret = 0;
+	u32 entrysz;
+	u32 prio;
+
+	prio = tb[P4TC_ENTRY_PRIO] ? nla_get_u32(tb[P4TC_ENTRY_PRIO]) : 0;
+	if (table->tbl_type != P4TC_TABLE_TYPE_EXACT && replace) {
+		if (!prio) {
+			NL_SET_ERR_MSG(extack, "Must specify entry priority");
+			return ERR_PTR(-EINVAL);
+		}
+	} else {
+		if (table->tbl_type == P4TC_TABLE_TYPE_EXACT) {
+			if (prio) {
+				NL_SET_ERR_MSG(extack,
+					       "Mustn't specify entry priority for exact");
+				return ERR_PTR(-EINVAL);
+			}
+			prio = p4tc_table_entry_alloc_new_prio(table);
+		} else {
+			if (prio)
+				ret = ida_alloc_range(&table->tbl_prio_idr,
+						      prio, prio, GFP_ATOMIC);
+			else
+				ret = p4tc_table_entry_alloc_new_prio(table);
+			if (ret < 0) {
+				NL_SET_ERR_MSG(extack,
+					       "Unable to allocate priority");
+				return ERR_PTR(ret);
+			}
+			prio = ret;
+		}
+	}
+
+	keysz_bits = table->tbl_keysz;
+	keysz_bytes = P4TC_KEYSZ_BYTES(keysz_bits);
+
+	/* Entry memory layout:
+	 * { entry:key __aligned(8):value }
+	 */
+	entrysz = sizeof(*entry) + keysz_bytes +
+		sizeof(struct p4tc_table_entry_value);
+
+	entry = kzalloc(entrysz, GFP_KERNEL);
+	if (unlikely(!entry)) {
+		NL_SET_ERR_MSG(extack, "Unable to allocate table entry");
+		ret = -ENOMEM;
+		goto idr_rm;
+	}
+
+	entry->key.keysz = keysz_bits;
+	mask->sz = keysz_bits;
+
+	ret = p4tc_table_entry_extract_key(table, tb, &entry->key, mask,
+					   extack);
+	if (ret < 0)
+		goto free_entry;
+
+	value = p4tc_table_entry_value(entry);
+	value->prio = prio;
+
+	rcu_read_lock();
+	tblperm = rcu_dereference(table->tbl_permissions)->permissions;
+	rcu_read_unlock();
+
+	if (tb[P4TC_ENTRY_PERMISSIONS]) {
+		u16 nlperm;
+
+		nlperm = nla_get_u16(tb[P4TC_ENTRY_PERMISSIONS]);
+		if (~tblperm & nlperm) {
+			NL_SET_ERR_MSG(extack,
+				       "Unable to set permission bit that is not allowed by table");
+			ret = -EINVAL;
+			goto free_entry;
+		}
+
+		if (p4tc_ctrl_create_ok(nlperm) ||
+		    p4tc_data_create_ok(nlperm)) {
+			NL_SET_ERR_MSG(extack,
+				       "Create permission for table entry doesn't make sense");
+			ret = -EINVAL;
+			goto free_entry;
+		}
+		value->permissions = nlperm;
+	} else {
+		if (replace)
+			value->permissions = P4TC_PERMISSIONS_UNINIT;
+		else
+			value->permissions =
+				p4tc_table_entry_tbl_permcpy(tblperm);
+	}
+
+	if (tb[P4TC_ENTRY_ACT]) {
+		ret = p4tc_action_init(net, tb[P4TC_ENTRY_ACT], value->acts,
+				       table->common.p_id,
+				       TCA_ACT_FLAGS_NO_RTNL, extack);
+		if (unlikely(ret < 0))
+			goto free_entry;
+
+		if (!p4tc_table_check_entry_act(table, value->acts[0])) {
+			ret = -EPERM;
+			NL_SET_ERR_MSG(extack,
+				       "Action is not allowed as entry action");
+			goto free_acts;
+		}
+	} else {
+		if (!p4tc_table_check_no_act(table)) {
+			NL_SET_ERR_MSG_FMT(extack,
+					   "Entry must have action associated with it");
+			ret = -EPERM;
+			goto free_entry;
+		}
+	}
+
+	whodunnit = nla_get_u8(tb[P4TC_ENTRY_WHODUNNIT]);
+
+	rcu_read_lock();
+	if (replace) {
+		ret = __p4tc_table_entry_update(pipeline, table, entry, mask,
+						whodunnit, true);
+	} else {
+		ret = __p4tc_table_entry_create(pipeline, table, entry, mask,
+						whodunnit, true);
+		if (set && ret == -EEXIST)
+			ret = __p4tc_table_entry_update(pipeline, table, entry,
+							mask, whodunnit, true);
+	}
+	rcu_read_unlock();
+	if (ret < 0) {
+		if ((replace || set) && ret == -EAGAIN)
+			NL_SET_ERR_MSG(extack,
+				       "Entry was being updated in parallel");
+
+		if (ret == -ENOSPC)
+			NL_SET_ERR_MSG(extack, "Table max entries reached");
+		else
+			NL_SET_ERR_MSG(extack, "Failed to create/update entry");
+
+		goto free_acts;
+	}
+
+	return entry;
+
+free_acts:
+	p4tc_action_destroy(value->acts);
+
+free_entry:
+	kfree(entry);
+
+idr_rm:
+	if (!replace)
+		p4tc_table_entry_free_prio(table, prio);
+
+	return ERR_PTR(ret);
+}
+
+static int p4tc_table_entry_cu(struct net *net, struct sk_buff *skb,
+			       u8 cu_flags, u16 *permissions,
+			       struct nlattr *arg,
+			       struct p4tc_path_nlattrs *nl_path_attrs,
+			       struct netlink_ext_ack *extack)
+{
+	bool replace = cu_flags == P4TC_TBL_ENTRY_CU_FLAG_UPDATE;
+	int cmd = replace ? RTM_P4TC_UPDATE : RTM_P4TC_CREATE;
+	struct p4tc_table_get_state table_get_state = { NULL};
+	struct nlattr *tb[P4TC_ENTRY_MAX + 1] = { NULL };
+	struct p4tc_table_entry_value *value;
+	struct p4tc_pipeline *pipeline;
+	struct p4tc_table_entry *entry;
+	u32 *ids = nl_path_attrs->ids;
+	bool has_listener = !!skb;
+	struct p4tc_table *table;
+	int ret;
+
+	ret = nla_parse_nested(tb, P4TC_ENTRY_MAX, arg, p4tc_entry_policy,
+			       extack);
+	if (ret < 0)
+		return ret;
+
+	ret = p4tc_table_entry_get_table(net, cmd, &table_get_state, tb,
+					 nl_path_attrs, extack);
+	if (ret < 0)
+		return ret;
+
+	pipeline = table_get_state.pipeline;
+	table = table_get_state.table;
+
+	if (replace && tb[P4TC_ENTRY_TBL_ATTRS]) {
+		/* Table attributes update */
+		ret = p4tc_tbl_attrs_update(net, table,
+					    tb[P4TC_ENTRY_TBL_ATTRS], extack);
+		goto table_put;
+	} else {
+		/* Table entry create or update */
+		if (NL_REQ_ATTR_CHECK(extack, arg, tb, P4TC_ENTRY_WHODUNNIT)) {
+			NL_SET_ERR_MSG(extack,
+				       "Must specify whodunnit attribute");
+			ret = -EINVAL;
+			goto table_put;
+		}
+	}
+
+	entry = __p4tc_table_entry_cu(net, cu_flags, tb, pipeline, table,
+				      extack);
+	if (IS_ERR(entry)) {
+		ret = PTR_ERR(entry);
+		goto table_put;
+	}
+
+	value = p4tc_table_entry_value(entry);
+	if (has_listener) {
+		if (p4tc_ctrl_pub_ok(value->permissions)) {
+			if (p4tc_tbl_entry_fill(skb, table, entry,
+						table->tbl_id) <= 0)
+				NL_SET_ERR_MSG(extack,
+					       "Unable to fill table entry attributes");
+
+			if (!nl_path_attrs->pname_passed)
+				strscpy(nl_path_attrs->pname,
+					pipeline->common.name,
+					P4TC_PIPELINE_NAMSIZ);
+
+			if (!ids[P4TC_PID_IDX])
+				ids[P4TC_PID_IDX] = pipeline->common.p_id;
+		}
+
+		*permissions = value->permissions;
+	}
+
+	/* We set it to zero on create an update to avoid having the entry
+	 * deleted in parallel before we report to user space.
+	 * We only set it to 1 here, after reporting.
+	 */
+	refcount_set(&value->entries_ref, 1);
+
+table_put:
+	p4tc_table_entry_put_table(&table_get_state);
+	return ret;
+}
+
+struct p4tc_table_entry *
+p4tc_tmpl_table_entry_cu(struct net *net, struct nlattr *arg,
+			 struct p4tc_pipeline *pipeline,
+			 struct p4tc_table *table,
+			 struct netlink_ext_ack *extack)
+{
+	struct nlattr *tb[P4TC_ENTRY_MAX + 1] = { NULL };
+	u8 cu_flags = P4TC_TBL_ENTRY_CU_FLAG_CREATE;
+	struct p4tc_table_entry_value *value;
+	struct p4tc_table_entry *entry;
+	int ret;
+
+	ret = nla_parse_nested(tb, P4TC_ENTRY_MAX, arg, p4tc_entry_policy,
+			       extack);
+	if (ret < 0)
+		return ERR_PTR(ret);
+
+	if (NL_REQ_ATTR_CHECK(extack, arg, tb, P4TC_ENTRY_WHODUNNIT)) {
+		NL_SET_ERR_MSG(extack, "Must specify whodunnit attribute");
+		return ERR_PTR(-EINVAL);
+	}
+
+	entry = __p4tc_table_entry_cu(net, cu_flags, tb, pipeline, table,
+				      extack);
+	if (IS_ERR(entry))
+		return entry;
+
+	value = p4tc_table_entry_value(entry);
+	refcount_set(&value->entries_ref, 1);
+	value->tmpl_created = true;
+
+	return entry;
+}
+
+static int p4tc_tbl_entry_cu_1(struct net *net, struct sk_buff *skb,
+			       u8 cu_flags, u16 *permissions,
+			       struct nlattr *nla,
+			       struct p4tc_path_nlattrs *nl_path_attrs,
+			       struct netlink_ext_ack *extack)
+{
+	struct nlattr *tb[P4TC_MAX + 1];
+	u32 *tbl_id;
+	int ret = 0;
+
+	ret = nla_parse_nested(tb, P4TC_MAX, nla, p4tc_policy, extack);
+	if (ret < 0)
+		return ret;
+
+	if (NL_REQ_ATTR_CHECK(extack, nla, tb, P4TC_PATH)) {
+		NL_SET_ERR_MSG(extack, "Must specify object path");
+		return -EINVAL;
+	}
+
+	if (NL_REQ_ATTR_CHECK(extack, nla, tb, P4TC_PARAMS)) {
+		NL_SET_ERR_MSG(extack, "Must specify object attributes");
+		return -EINVAL;
+	}
+
+	tbl_id = nla_data(tb[P4TC_PATH]);
+	memcpy(&nl_path_attrs->ids[P4TC_TBLID_IDX], tbl_id,
+	       nla_len(tb[P4TC_PATH]));
+
+	return p4tc_table_entry_cu(net, skb, cu_flags, permissions,
+				   tb[P4TC_PARAMS], nl_path_attrs, extack);
+}
+
+static int __p4tc_entry_root_num_batched(struct nlattr *p4tca[])
+{
+	int i = 1;
+
+	while (i < P4TC_MSGBATCH_SIZE + 1 && p4tca[i])
+		i++;
+
+	return i - 1;
+}
+
+static int __p4tc_tbl_entry_root(struct net *net, struct sk_buff *skb,
+				 struct nlmsghdr *n, int cmd, char *p_name,
+				 struct nlattr *p4tca[],
+				 struct netlink_ext_ack *extack)
+{
+	struct p4tcmsg *t = (struct p4tcmsg *)nlmsg_data(n);
+	struct p4tc_path_nlattrs nl_path_attrs = {0};
+	u32 portid = NETLINK_CB(skb).portid;
+	u16 permissions = P4TC_CTRL_PERM_P;
+	u32 ids[P4TC_PATH_MAX] = { 0 };
+	int i, num_pub_permission = 0;
+	int ret = 0, ret_send;
+	struct p4tcmsg *t_new;
+	struct sk_buff *nskb;
+	struct nlmsghdr *nlh;
+	struct nlattr *pn_att;
+	struct nlattr *root;
+
+	nskb = nlmsg_new(NLMSG_GOODSIZE, GFP_KERNEL);
+	if (unlikely(!nskb))
+		return -ENOBUFS;
+
+	nlh = nlmsg_put(nskb, portid, n->nlmsg_seq, cmd, sizeof(*t),
+			n->nlmsg_flags);
+	if (unlikely(!nlh))
+		goto out;
+
+	t_new = nlmsg_data(nlh);
+	t_new->pipeid = t->pipeid;
+	t_new->obj = t->obj;
+	ids[P4TC_PID_IDX] = t_new->pipeid;
+	nl_path_attrs.ids = ids;
+
+	pn_att = nla_reserve(nskb, P4TC_ROOT_PNAME, P4TC_PIPELINE_NAMSIZ);
+	if (unlikely(!pn_att)) {
+		ret = -ENOMEM;
+		goto out;
+	}
+
+	nl_path_attrs.pname = nla_data(pn_att);
+	if (!p_name) {
+		/* Filled up by the operation or forced failure */
+		memset(nl_path_attrs.pname, 0, P4TC_PIPELINE_NAMSIZ);
+		nl_path_attrs.pname_passed = false;
+	} else {
+		strscpy(nl_path_attrs.pname, p_name, P4TC_PIPELINE_NAMSIZ);
+		nl_path_attrs.pname_passed = true;
+	}
+
+	root = nla_nest_start(nskb, P4TC_ROOT);
+	for (i = 1; i < P4TC_MSGBATCH_SIZE + 1 && p4tca[i]; i++) {
+		struct nlattr *nest = nla_nest_start(nskb, i);
+
+		if (cmd == RTM_P4TC_CREATE || cmd == RTM_P4TC_UPDATE) {
+			u8 cu_flags;
+
+			if (cmd == RTM_P4TC_UPDATE)
+				cu_flags = P4TC_TBL_ENTRY_CU_FLAG_UPDATE;
+			else
+				if (n->nlmsg_flags & NLM_F_REPLACE)
+					cu_flags = P4TC_TBL_ENTRY_CU_FLAG_SET;
+				else
+					cu_flags =
+						P4TC_TBL_ENTRY_CU_FLAG_CREATE;
+
+			ret = p4tc_tbl_entry_cu_1(net, nskb, cu_flags,
+						  &permissions,
+						  p4tca[i], &nl_path_attrs,
+						  extack);
+		}
+
+		if (p4tc_ctrl_pub_ok(permissions)) {
+			num_pub_permission++;
+		} else {
+			nla_nest_cancel(nskb, nest);
+			continue;
+		}
+
+		if (ret < 0) {
+			int num_batched = __p4tc_entry_root_num_batched(p4tca);
+
+			NL_SET_ERR_MSG_FMT(extack,
+					   "%s\nProcessed %d/%d entries",
+					   extack->_msg, i, num_batched);
+			if (i == 1) {
+				goto out;
+			} else {
+				nla_nest_cancel(nskb, nest);
+				break;
+			}
+		}
+		nla_nest_end(nskb, nest);
+	}
+	nla_nest_end(nskb, root);
+
+	if (!t_new->pipeid)
+		t_new->pipeid = ids[P4TC_PID_IDX];
+
+	nlmsg_end(nskb, nlh);
+
+	if (num_pub_permission) {
+		ret_send = rtnetlink_send(nskb, net, portid, RTNLGRP_TC,
+					  n->nlmsg_flags & NLM_F_ECHO);
+	} else {
+		ret_send = 0;
+		kfree_skb(nskb);
+	}
+
+	return ret_send ? ret_send : ret;
+
+out:
+	kfree_skb(nskb);
+	return ret;
+}
+
+static int __p4tc_tbl_entry_root_fast(struct net *net, struct nlmsghdr *n,
+				      int cmd, char *p_name,
+				      struct nlattr *p4tca[],
+				      struct netlink_ext_ack *extack)
+{
+	struct p4tcmsg *t = (struct p4tcmsg *)nlmsg_data(n);
+	struct p4tc_path_nlattrs nl_path_attrs = {0};
+	u32 ids[P4TC_PATH_MAX] = { 0 };
+	int ret = 0;
+	int i;
+
+	ids[P4TC_PID_IDX] = t->pipeid;
+	nl_path_attrs.ids = ids;
+
+	/* Only read for searching the pipeline */
+	nl_path_attrs.pname = p_name;
+
+	for (i = 1; i < P4TC_MSGBATCH_SIZE + 1 && p4tca[i]; i++) {
+		if (cmd == RTM_P4TC_CREATE ||
+		    cmd == RTM_P4TC_UPDATE) {
+			u8 cu_flags;
+
+			if (cmd == RTM_P4TC_UPDATE)
+				cu_flags = P4TC_TBL_ENTRY_CU_FLAG_UPDATE;
+			else
+				if (n->nlmsg_flags & NLM_F_REPLACE)
+					cu_flags = P4TC_TBL_ENTRY_CU_FLAG_SET;
+				else
+					cu_flags =
+						P4TC_TBL_ENTRY_CU_FLAG_CREATE;
+
+			ret = p4tc_tbl_entry_cu_1(net, NULL, cu_flags, NULL,
+						  p4tca[i], &nl_path_attrs,
+						  extack);
+		}
+
+		if (ret < 0)
+			goto out;
+	}
+
+out:
+	return ret;
+}
+
+int p4tc_tbl_entry_root(struct net *net, struct sk_buff *skb,
+			struct nlmsghdr *n, int cmd,
+			struct netlink_ext_ack *extack)
+{
+	struct nlattr *p4tca[P4TC_MSGBATCH_SIZE + 1];
+	int echo = n->nlmsg_flags & NLM_F_ECHO;
+	struct nlattr *tb[P4TC_ROOT_MAX + 1];
+	char *p_name = NULL;
+	int listeners;
+	int ret = 0;
+
+	ret = nlmsg_parse(n, sizeof(struct p4tcmsg), tb, P4TC_ROOT_MAX,
+			  p4tc_root_policy, extack);
+	if (ret < 0)
+		return ret;
+
+	if (NL_REQ_ATTR_CHECK(extack, NULL, tb, P4TC_ROOT)) {
+		NL_SET_ERR_MSG(extack, "Netlink P4TC table attributes missing");
+		return -EINVAL;
+	}
+
+	ret = nla_parse_nested(p4tca, P4TC_MSGBATCH_SIZE, tb[P4TC_ROOT], NULL,
+			       extack);
+	if (ret < 0)
+		return ret;
+
+	if (!p4tca[1]) {
+		NL_SET_ERR_MSG(extack, "No elements in root table array");
+		return -EINVAL;
+	}
+
+	if (tb[P4TC_ROOT_PNAME])
+		p_name = nla_data(tb[P4TC_ROOT_PNAME]);
+
+	listeners = rtnl_has_listeners(net, RTNLGRP_TC);
+
+	if ((echo || listeners) || cmd == RTM_P4TC_GET)
+		ret = __p4tc_tbl_entry_root(net, skb, n, cmd, p_name, p4tca,
+					    extack);
+	else
+		ret = __p4tc_tbl_entry_root_fast(net, n, cmd, p_name, p4tca,
+						 extack);
+	return ret;
+}
diff --git a/net/sched/p4tc/p4tc_tmpl_api.c b/net/sched/p4tc/p4tc_tmpl_api.c
index cc7e23a4a..385553479 100644
--- a/net/sched/p4tc/p4tc_tmpl_api.c
+++ b/net/sched/p4tc/p4tc_tmpl_api.c
@@ -27,12 +27,12 @@
 #include <net/netlink.h>
 #include <net/flow_offload.h>
 
-static const struct nla_policy p4tc_root_policy[P4TC_ROOT_MAX + 1] = {
+const struct nla_policy p4tc_root_policy[P4TC_ROOT_MAX + 1] = {
 	[P4TC_ROOT] = { .type = NLA_NESTED },
 	[P4TC_ROOT_PNAME] = { .type = NLA_STRING, .len = P4TC_PIPELINE_NAMSIZ },
 };
 
-static const struct nla_policy p4tc_policy[P4TC_MAX + 1] = {
+const struct nla_policy p4tc_policy[P4TC_MAX + 1] = {
 	[P4TC_PATH] = { .type = NLA_BINARY,
 			.len = P4TC_PATH_MAX * sizeof(u32) },
 	[P4TC_PARAMS] = { .type = NLA_NESTED },
diff --git a/security/selinux/nlmsgtab.c b/security/selinux/nlmsgtab.c
index e50a1c1ff..da7902404 100644
--- a/security/selinux/nlmsgtab.c
+++ b/security/selinux/nlmsgtab.c
@@ -98,6 +98,10 @@ static const struct nlmsg_perm nlmsg_route_perms[] = {
 	{ RTM_DELP4TEMPLATE,	NETLINK_ROUTE_SOCKET__NLMSG_WRITE },
 	{ RTM_GETP4TEMPLATE,	NETLINK_ROUTE_SOCKET__NLMSG_READ },
 	{ RTM_UPDATEP4TEMPLATE,	NETLINK_ROUTE_SOCKET__NLMSG_WRITE },
+	{ RTM_P4TC_CREATE,	NETLINK_ROUTE_SOCKET__NLMSG_WRITE },
+	{ RTM_P4TC_DEL,	NETLINK_ROUTE_SOCKET__NLMSG_WRITE },
+	{ RTM_P4TC_GET,	NETLINK_ROUTE_SOCKET__NLMSG_READ },
+	{ RTM_P4TC_UPDATE,	NETLINK_ROUTE_SOCKET__NLMSG_WRITE },
 };
 
 static const struct nlmsg_perm nlmsg_tcpdiag_perms[] = {
@@ -181,7 +185,7 @@ int selinux_nlmsg_lookup(u16 sclass, u16 nlmsg_type, u32 *perm)
 		 * structures at the top of this file with the new mappings
 		 * before updating the BUILD_BUG_ON() macro!
 		 */
-		BUILD_BUG_ON(RTM_MAX != (RTM_CREATEP4TEMPLATE + 3));
+		BUILD_BUG_ON(RTM_MAX != (RTM_P4TC_CREATE + 3));
 		err = nlmsg_perm(nlmsg_type, perm, nlmsg_route_perms,
 				 sizeof(nlmsg_route_perms));
 		break;
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [PATCH net-next v12  13/15] p4tc: add runtime table entry get, delete, flush and dump
  2024-02-25 16:54 [PATCH net-next v12 00/15] Introducing P4TC (series 1) Jamal Hadi Salim
                   ` (11 preceding siblings ...)
  2024-02-25 16:54 ` [PATCH net-next v12 12/15] p4tc: add runtime table entry create and update Jamal Hadi Salim
@ 2024-02-25 16:54 ` Jamal Hadi Salim
  2024-02-25 16:54 ` [PATCH net-next v12 14/15] p4tc: add set of P4TC table kfuncs Jamal Hadi Salim
                   ` (3 subsequent siblings)
  16 siblings, 0 replies; 71+ messages in thread
From: Jamal Hadi Salim @ 2024-02-25 16:54 UTC (permalink / raw)
  To: netdev
  Cc: deb.chatterjee, anjali.singhai, namrata.limaye, tom, mleitner,
	Mahesh.Shirshyad, Vipin.Jain, tomasz.osinski, jiri,
	xiyou.wangcong, davem, edumazet, kuba, pabeni, vladbu, horms,
	khalidm, toke, daniel, victor, pctammela, bpf

This commit allows users to get, delete, flush and dump table _entries_
(templates were described in earlier patch and so were create and update).

If the user wants to, for example, read a table entry from a table
(nh_table) which has as key nh_index (bit32), belongs to pipeline
(routing) and has as its key an nh_index of 1, they'd issue the
following command:

tc p4ctrl get routing/table/Main/nh_table nh_index 1

Which will give us the following output:

$TC p4ctrl get routing/table/Main/nh_table

pipeline:  routing(id 1)
 table: Main/nh_table(id 1)entry priority 64000[permissions -RUD-PS-R--X--]
    entry key
     nh_index id:1 size:32b type:bit32 exact fieldval  1
    entry actions:
	action order 1: routing/Main/set_nh  index 2 ref 1 bind 1
	 params:
	  dmac type macaddr  value: 13:37:13:37:13:37 id 1
	  port type dev  value: port1 id 2

    created by: tc (id 2)
    dynamic false
    created 20338 sec  used 39 sec

Note that, as with create and update, we need to specify the pipeline name,
the table name, the key and the priority, so that we can locate the table
entry. Also, in this case, the entry had an action which has a parameter
dmac (a mac address) and port1 (a net device).

If the user wanted to delete the same table entry, they'd issue the
following command:

tc p4ctrl del routing/table/Main/nh_table nh_index 1

Note that, again, we need to specify the pipeline name, the table
name, the key and the priority, so that we can locate the table entry.

We can also flush all the table entries from a specific table.
To flush the table entries of table tname ane pipeline ptables,
the user would issue the following command:

tc p4ctrl del routing/table/Main/nh_table

Likewise, we can also dump all the table entries from a specific table.
To dump the table entries of table tname and pipeline myprog, the user
would issue the following command:

tc p4ctrl get routing/table/Main/nh_table

Which will give the following output (in case we had one entry with
nh_index key value 1):

pipeline:  routing(id 1)
 table: Main/nh_table(id 1)entry priority 64000[permissions -RUD-PS-R--X--]
    entry key
     nh_index id:1 size:32b type:bit32 exact fieldval  1
    entry actions:
	action order 1: routing/Main/set_nh  index 2 ref 1 bind 1
	 params:
	  dmac type macaddr  value: 13:37:13:37:13:37 id 1
	  port type dev  value: port1 id 2

    created by: tc (id 2)
    dynamic false
    created 20338 sec  used 39 sec

Co-developed-by: Victor Nogueira <victor@mojatatu.com>
Signed-off-by: Victor Nogueira <victor@mojatatu.com>
Co-developed-by: Pedro Tammela <pctammela@mojatatu.com>
Signed-off-by: Pedro Tammela <pctammela@mojatatu.com>
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Reviewed-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
---
 include/net/p4tc.h                |  28 +-
 include/uapi/linux/p4tc.h         |  90 +++-
 net/sched/p4tc/Makefile           |   2 +-
 net/sched/p4tc/p4tc_filter.c      | 870 ++++++++++++++++++++++++++++++
 net/sched/p4tc/p4tc_runtime_api.c |  63 +++
 net/sched/p4tc/p4tc_table.c       | 113 +++-
 net/sched/p4tc/p4tc_tbl_entry.c   | 853 ++++++++++++++++++++++++++++-
 7 files changed, 2010 insertions(+), 9 deletions(-)
 create mode 100644 net/sched/p4tc/p4tc_filter.c

diff --git a/include/net/p4tc.h b/include/net/p4tc.h
index bc32b73ec..231936df4 100644
--- a/include/net/p4tc.h
+++ b/include/net/p4tc.h
@@ -36,9 +36,15 @@
 #define P4TC_AID_IDX 1
 #define P4TC_PARSEID_IDX 1
 
+struct p4tc_filter {
+	struct p4tc_filter_oper *operation;
+	int cmd; /* CRUD command */
+};
+
 struct p4tc_dump_ctx {
 	u32 ids[P4TC_PATH_MAX];
 	struct rhashtable_iter *iter;
+	struct p4tc_filter *entry_filter;
 };
 
 struct p4tc_template_common;
@@ -351,6 +357,7 @@ struct p4tc_table_entry_value {
 	struct p4tc_table_entry_work             *entry_work;
 	u64                                      aging_ms;
 	struct hrtimer                           entry_timer;
+	bool                                     is_dyn;
 	bool                                     tmpl_created;
 };
 
@@ -534,6 +541,15 @@ p4tc_table_init_permissions(struct p4tc_table *table, u16 permissions,
 void p4tc_table_replace_permissions(struct p4tc_table *table,
 				    struct p4tc_table_perm *tbl_perm,
 				    bool lock_rtnl);
+int p4tc_table_timer_profile_update(struct p4tc_table *table,
+				    struct nlattr *nla,
+				    struct netlink_ext_ack *extack);
+struct p4tc_table_timer_profile *
+p4tc_table_timer_profile_find_byaging(struct p4tc_table *table,
+				      u64 aging_ms);
+struct p4tc_table_timer_profile *
+p4tc_table_timer_profile_find(struct p4tc_table *table, u32 profile_id);
+
 void p4tc_table_entry_destroy_hash(void *ptr, void *arg);
 
 struct p4tc_table_entry *
@@ -548,7 +564,17 @@ int p4tc_tbl_entry_dumpit(struct net *net, struct sk_buff *skb,
 			  struct netlink_callback *cb,
 			  struct nlattr *arg, char *p_name);
 int p4tc_tbl_entry_fill(struct sk_buff *skb, struct p4tc_table *table,
-			struct p4tc_table_entry *entry, u32 tbl_id);
+			struct p4tc_table_entry *entry, u32 tbl_id,
+			u16 who_deleted);
+void p4tc_tbl_entry_mask_key(u8 *masked_key, u8 *key, const u8 *mask,
+			     u32 masksz);
+
+struct p4tc_filter *
+p4tc_filter_build(struct p4tc_pipeline *pipeline, struct p4tc_table *table,
+		  struct nlattr *nla, struct netlink_ext_ack *extack);
+bool p4tc_filter_exec(struct p4tc_filter *filter,
+		      struct p4tc_table_entry *entry);
+void p4tc_filter_destroy(struct p4tc_filter *filter);
 
 struct tcf_p4act *
 p4a_runt_prealloc_get_next(struct p4tc_act *act);
diff --git a/include/uapi/linux/p4tc.h b/include/uapi/linux/p4tc.h
index adac8024c..3f1444ad9 100644
--- a/include/uapi/linux/p4tc.h
+++ b/include/uapi/linux/p4tc.h
@@ -226,6 +226,8 @@ enum {
 	__P4TC_TABLE_MAX
 };
 
+#define P4TC_TABLE_MAX (__P4TC_TABLE_MAX - 1)
+
 enum {
 	P4TC_TIMER_PROFILE_UNSPEC,
 	P4TC_TIMER_PROFILE_ID, /* u32 */
@@ -233,7 +235,7 @@ enum {
 	__P4TC_TIMER_PROFILE_MAX
 };
 
-#define P4TC_TABLE_MAX (__P4TC_TABLE_MAX - 1)
+#define P4TC_TIMER_PROFILE_MAX (__P4TC_TIMER_PROFILE_MAX - 1)
 
 /* Action attributes */
 enum {
@@ -300,11 +302,93 @@ struct p4tc_table_entry_tm {
 	__u16 permissions;
 };
 
+enum {
+	P4TC_FILTER_OPND_ACT_UNSPEC,
+	P4TC_FILTER_OPND_ACT_NAME, /* string  */
+	P4TC_FILTER_OPND_ACT_ID, /* u32 */
+	P4TC_FILTER_OPND_ACT_PARAMS, /* nested params */
+	__P4TC_FILTER_OPND_ACT_MAX
+};
+
+#define P4TC_FILTER_OPND_ACT_MAX (__P4TC_FILTER_OPND_ACT_MAX - 1)
+
+enum {
+	P4TC_FILTER_OPND_UNSPEC,
+	P4TC_FILTER_OPND_ENTRY_KEY_BLOB, /* Key blob */
+	P4TC_FILTER_OPND_ENTRY_MASK_BLOB, /* Mask blob */
+	P4TC_FILTER_OPND_ACT, /* nested action - P4TC_FITLER_OPND_ACT_XXX */
+	P4TC_FILTER_OPND_PRIO, /* u32 */
+	P4TC_FILTER_OPND_TIME_DELTA, /* in msecs */
+	__P4TC_FILTER_OPND_MAX
+};
+
+#define P4TC_FILTER_OPND_MAX (__P4TC_FILTER_OPND_MAX - 1)
+
+enum {
+	P4TC_FILTER_OP_UNSPEC,
+	P4TC_FILTER_OP_REL,
+	P4TC_FILTER_OP_LOGICAL,
+	__P4TC_FILTER_OP_MAX
+};
+
+#define P4TC_FILTER_OP_MAX (__P4TC_FILTER_OP_MAX - 1)
+
+enum {
+	P4TC_FILTER_OP_REL_UNSPEC,
+	P4TC_FILTER_OP_REL_EQ,
+	P4TC_FILTER_OP_REL_NEQ,
+	P4TC_FILTER_OP_REL_LT,
+	P4TC_FILTER_OP_REL_GT,
+	P4TC_FILTER_OP_REL_LE,
+	P4TC_FILTER_OP_REL_GE,
+	__P4TC_FILTER_OP_REL_MAX
+};
+
+#define P4TC_FILTER_OP_REL_MAX (__P4TC_FILTER_OP_REL_MAX - 1)
+
+enum {
+	P4TC_FILTER_OP_LOGICAL_UNSPEC,
+	P4TC_FILTER_OP_LOGICAL_AND,
+	P4TC_FILTER_OP_LOGICAL_OR,
+	P4TC_FILTER_OP_LOGICAL_NOT,
+	P4TC_FILTER_OP_LOGICAL_XOR,
+	__P4TC_FILTER_OP_LOGICAL_MAX
+};
+
+#define P4TC_FILTER_OP_LOGICAL_MAX (__P4TC_FILTER_OP_LOGICAL_MAX - 1)
+
+enum p4tc_filter_ntype {
+	P4TC_FILTER_NODE_UNSPEC,
+	P4TC_FILTER_NODE_PARENT, /* nested - P4TC_FILTER_XXX */
+	P4TC_FILTER_NODE_LEAF, /* nested - P4TC_FILTER_OPND_XXX */
+	__P4TC_FILTER_NODE_MAX
+};
+
+#define P4TC_FILTER_NODE_MAX (__P4TC_FILTER_NODE_MAX - 1)
+
+enum {
+	P4TC_FILTER_UNSPEC,
+	P4TC_FILTER_OP_KIND, /* P4TC_FILTER_OP_REL || P4TC_FILTER_OP_LOGICAL */
+	P4TC_FILTER_OP_VALUE, /* P4TC_FILTER_OP_REL_XXX ||
+			       * P4TC_FILTER_OP_LOGICAL_XXX
+			       */
+	P4TC_FILTER_NODE1, /* nested - P4TC_FILTER_NODE_XXX */
+	P4TC_FILTER_NODE2, /* nested - P4TC_FILTER_NODE_XXX - Present only for
+			    * LOGICAL OPS with LOGICAL_NOT being the exception
+			    */
+	__P4TC_FILTER_MAX
+};
+
+#define P4TC_FILTER_MAX (__P4TC_FILTER_MAX - 1)
+
+#define P4TC_FILTER_DEPTH_LIMIT 5
+
 enum {
 	P4TC_ENTRY_TBL_ATTRS_UNSPEC,
 	P4TC_ENTRY_TBL_ATTRS_DEFAULT_HIT, /* nested default hit attrs */
 	P4TC_ENTRY_TBL_ATTRS_DEFAULT_MISS, /* nested default miss attrs */
 	P4TC_ENTRY_TBL_ATTRS_PERMISSIONS, /* u16 table permissions */
+	P4TC_ENTRY_TBL_ATTRS_TIMER_PROFILE, /* nested timer profile */
 	__P4TC_ENTRY_TBL_ATTRS,
 };
 
@@ -332,6 +416,10 @@ enum {
 	P4TC_ENTRY_TMPL_CREATED, /* u8 tells whether entry was create by
 				  * template
 				  */
+	P4TC_ENTRY_DYNAMIC, /* u8 tells if table entry is dynamic */
+	P4TC_ENTRY_AGING, /* u64 table entry aging */
+	P4TC_ENTRY_PROFILE_ID, /* u32 table entry profile ID */
+	P4TC_ENTRY_FILTER, /* nested filter */
 	P4TC_ENTRY_PAD,
 	__P4TC_ENTRY_MAX
 };
diff --git a/net/sched/p4tc/Makefile b/net/sched/p4tc/Makefile
index 921909ac4..56a8adc74 100644
--- a/net/sched/p4tc/Makefile
+++ b/net/sched/p4tc/Makefile
@@ -2,4 +2,4 @@
 
 obj-y := p4tc_types.o p4tc_tmpl_api.o p4tc_pipeline.o \
 	p4tc_action.o p4tc_table.o p4tc_tbl_entry.o \
-	p4tc_runtime_api.o
+	p4tc_filter.o p4tc_runtime_api.o
diff --git a/net/sched/p4tc/p4tc_filter.c b/net/sched/p4tc/p4tc_filter.c
new file mode 100644
index 000000000..4db726816
--- /dev/null
+++ b/net/sched/p4tc/p4tc_filter.c
@@ -0,0 +1,870 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * net/sched/p4tc/p4tc_filter.c P4 TC FILTER
+ *
+ * Copyright (c) 2022-2024, Mojatatu Networks
+ * Copyright (c) 2022-2024, Intel Corporation.
+ * Authors:     Jamal Hadi Salim <jhs@mojatatu.com>
+ *              Victor Nogueira <victor@mojatatu.com>
+ *              Pedro Tammela <pctammela@mojatatu.com>
+ */
+
+#include <linux/types.h>
+#include <linux/kernel.h>
+#include <linux/string.h>
+#include <linux/errno.h>
+#include <linux/slab.h>
+#include <linux/skbuff.h>
+#include <linux/init.h>
+#include <linux/err.h>
+#include <net/p4tc.h>
+#include <net/netlink.h>
+
+enum {
+	P4TC_FILTER_OPND_KIND_UNSPEC,
+	P4TC_FILTER_OPND_KIND_ENTRY_KEY,
+	P4TC_FILTER_OPND_KIND_ACT,
+	P4TC_FILTER_OPND_KIND_ACT_PARAM,
+	P4TC_FILTER_OPND_KIND_PRIO,
+	P4TC_FILTER_OPND_KIND_MSECS,
+};
+
+struct p4tc_filter_opnd {
+	u32 opnd_kind;
+	union {
+		struct {
+			u8 val[BITS_TO_BYTES(P4TC_MAX_KEYSZ)];
+			u8 mask[BITS_TO_BYTES(P4TC_MAX_KEYSZ)];
+		} entry_key;
+		struct {
+			struct p4tc_act *val;
+			struct p4tc_act_param *param;
+		} act;
+		u32 prio;
+		u32 msecs_since;
+	};
+};
+
+struct p4tc_filter_oper;
+struct p4tc_filter_node {
+	enum p4tc_filter_ntype ntype;
+	union {
+		struct p4tc_filter_opnd *opnd;
+		struct p4tc_filter_oper *operation;
+	};
+};
+
+struct p4tc_filter_oper {
+	struct p4tc_filter_node *node1;
+	struct p4tc_filter_node *node2;
+	u16 op_kind;
+	u16 op_value;
+};
+
+static const struct nla_policy
+p4tc_entry_filter_act_policy[P4TC_FILTER_OPND_ACT_MAX + 1] = {
+	[P4TC_FILTER_OPND_ACT_NAME] = {
+		.type = NLA_STRING,
+		.len = P4TC_ACT_TMPL_NAMSZ
+	},
+	[P4TC_FILTER_OPND_ACT_ID] = { .type = NLA_U32 },
+	[P4TC_FILTER_OPND_ACT_PARAMS] = { .type = NLA_NESTED },
+};
+
+static const struct nla_policy
+p4tc_entry_filter_opnd_policy[P4TC_FILTER_OPND_MAX + 1] = {
+	[P4TC_FILTER_OPND_ENTRY_KEY_BLOB] = { .type = NLA_BINARY },
+	[P4TC_FILTER_OPND_ENTRY_MASK_BLOB] = { .type = NLA_BINARY },
+	[P4TC_FILTER_OPND_ACT] = { .type = NLA_NESTED },
+	[P4TC_FILTER_OPND_PRIO] = { .type = NLA_U32 },
+	[P4TC_FILTER_OPND_TIME_DELTA] = { .type = NLA_U32 },
+};
+
+static const struct nla_policy
+p4tc_entry_filter_node_policy[P4TC_FILTER_NODE_MAX + 1] = {
+	[P4TC_FILTER_NODE_PARENT] = { .type = NLA_NESTED },
+	[P4TC_FILTER_NODE_LEAF] = { .type = NLA_NESTED },
+};
+
+static struct netlink_range_validation range_filter_op_kind = {
+	.min = P4TC_FILTER_OP_REL,
+	.max = P4TC_FILTER_OP_MAX,
+};
+
+static const struct nla_policy p4tc_entry_filter_policy[P4TC_FILTER_MAX + 1] = {
+	[P4TC_FILTER_OP_KIND] =
+		NLA_POLICY_FULL_RANGE(NLA_U16, &range_filter_op_kind),
+	[P4TC_FILTER_OP_VALUE] = { .type = NLA_U16 },
+	[P4TC_FILTER_NODE1] = { .type = NLA_NESTED },
+	[P4TC_FILTER_NODE2] = { .type = NLA_NESTED },
+};
+
+static bool p4tc_filter_msg_valid(struct nlattr **tb,
+				  struct netlink_ext_ack *extack)
+{
+	bool is_empty = true;
+	int i;
+
+	if ((tb[P4TC_FILTER_OPND_ENTRY_KEY_BLOB] &&
+	     !tb[P4TC_FILTER_OPND_ENTRY_MASK_BLOB]) ||
+	    (tb[P4TC_FILTER_OPND_ENTRY_MASK_BLOB] &&
+	     !tb[P4TC_FILTER_OPND_ENTRY_KEY_BLOB])) {
+		NL_SET_ERR_MSG(extack, "Must specify key with mask");
+		return false;
+	}
+
+	for (i = P4TC_FILTER_OPND_ENTRY_MASK_BLOB; i < P4TC_FILTER_OPND_MAX + 1;
+	     i++) {
+		if (tb[i]) {
+			if (!is_empty) {
+				NL_SET_ERR_MSG(extack,
+					       "May only specify one filter key attribute");
+				return false;
+			}
+			is_empty = false;
+		}
+	}
+
+	if (is_empty) {
+		NL_SET_ERR_MSG(extack, "Filter opnd message is empty");
+		return false;
+	}
+
+	return true;
+}
+
+static bool p4tc_filter_op_value_valid(const u16 filter_op_kind,
+				       const u16 filter_op_value,
+				       struct netlink_ext_ack *extack)
+{
+	switch (filter_op_kind) {
+	case P4TC_FILTER_OP_REL:
+		if (filter_op_value < P4TC_FILTER_OP_REL_EQ ||
+		    filter_op_value > P4TC_FILTER_OP_REL_MAX) {
+			NL_SET_ERR_MSG_FMT(extack,
+					   "Invalid filter relational op %u\n",
+					   filter_op_value);
+			return false;
+		}
+		break;
+	case P4TC_FILTER_OP_LOGICAL:
+		if (filter_op_value < P4TC_FILTER_OP_LOGICAL_AND ||
+		    filter_op_value > P4TC_FILTER_OP_LOGICAL_MAX) {
+			NL_SET_ERR_MSG_FMT(extack,
+					   "Invalid filter logical op %u\n",
+					   filter_op_value);
+			return false;
+		}
+		break;
+	default:
+		/* Will never happen */
+		return false;
+	}
+
+	return true;
+}
+
+static bool p4tc_filter_op_requires_node2(const u16 filter_op_kind,
+					  const u16 filter_op_value)
+{
+	switch (filter_op_kind) {
+	case P4TC_FILTER_OP_LOGICAL:
+		switch (filter_op_value) {
+		case P4TC_FILTER_OP_LOGICAL_AND:
+		case P4TC_FILTER_OP_LOGICAL_OR:
+		case P4TC_FILTER_OP_LOGICAL_XOR:
+			return true;
+		default:
+			return false;
+		}
+	case P4TC_FILTER_OP_REL:
+		return false;
+	default:
+		return false;
+	}
+}
+
+static void p4tc_filter_opnd_destroy(struct p4tc_filter_opnd *opnd)
+{
+	switch (opnd->opnd_kind) {
+	case P4TC_FILTER_OPND_KIND_ACT:
+		p4tc_action_put_ref(opnd->act.val);
+		break;
+	case P4TC_FILTER_OPND_KIND_ACT_PARAM:
+		p4a_runt_parm_destroy(opnd->act.param);
+		break;
+	default:
+		break;
+	}
+
+	kfree(opnd);
+}
+
+static void
+p4tc_filter_oper_destroy(struct p4tc_filter_oper *operation);
+
+static void p4tc_filter_node_destroy(struct p4tc_filter_node *node)
+{
+	if (!node)
+		return;
+
+	if (node->ntype == P4TC_FILTER_NODE_LEAF)
+		p4tc_filter_opnd_destroy(node->opnd);
+	else
+		p4tc_filter_oper_destroy(node->operation);
+	kfree(node);
+}
+
+static void p4tc_filter_oper_destroy(struct p4tc_filter_oper *operation)
+{
+	p4tc_filter_node_destroy(operation->node1);
+	p4tc_filter_node_destroy(operation->node2);
+	kfree(operation);
+}
+
+void p4tc_filter_destroy(struct p4tc_filter *filter)
+{
+	if (filter)
+		p4tc_filter_oper_destroy(filter->operation);
+	kfree(filter);
+}
+
+static void p4tc_filter_opnd_prio_build(struct p4tc_filter_opnd *filter_opnd,
+					struct nlattr *nla)
+{
+	filter_opnd->opnd_kind = P4TC_FILTER_OPND_KIND_PRIO;
+	filter_opnd->prio = nla_get_u32(nla);
+}
+
+static void
+p4tc_filter_opnd_msecs_since_build(struct p4tc_filter_opnd *filter_opnd,
+				   struct nlattr *nla)
+{
+	filter_opnd->opnd_kind = P4TC_FILTER_OPND_KIND_MSECS;
+	filter_opnd->msecs_since = nla_get_u32(nla);
+}
+
+static int
+p4tc_filter_opnd_act_build(struct p4tc_pipeline *pipeline, struct nlattr *nla,
+			   struct p4tc_filter_opnd *filter_opnd,
+			   struct netlink_ext_ack *extack)
+{
+	struct nlattr *tb[P4TC_FILTER_OPND_MAX + 1];
+	struct p4tc_act_param *param = NULL;
+	struct p4tc_act *act;
+	char *act_name;
+	u32 act_id;
+	int ret;
+
+	if (!nla)
+		return 0;
+
+	ret = nla_parse_nested(tb, P4TC_FILTER_OPND_ACT_MAX, nla,
+			       p4tc_entry_filter_act_policy, extack);
+	if (ret < 0)
+		return ret;
+
+	act_id = tb[P4TC_FILTER_OPND_ACT_ID] ?
+		nla_get_u32(tb[P4TC_FILTER_OPND_ACT_ID]) : 0;
+
+	act_name = tb[P4TC_FILTER_OPND_ACT_NAME] ?
+		nla_data(tb[P4TC_FILTER_OPND_ACT_NAME]) : NULL;
+
+	act = p4a_tmpl_get(pipeline, act_name, act_id, extack);
+	if (IS_ERR(act))
+		return PTR_ERR(act);
+
+	if (tb[P4TC_FILTER_OPND_ACT_PARAMS]) {
+		/* Don't call p4a_runt_parm_alloc because we needn't allocate
+		 * params_array for the filter_opnd.
+		 */
+		param = p4a_runt_parm_init(pipeline->net, act,
+					   tb[P4TC_FILTER_OPND_ACT_PARAMS],
+					   extack);
+		if (IS_ERR(param)) {
+			ret = PTR_ERR(param);
+			goto params_destroy;
+		}
+
+		filter_opnd->act.param = param;
+		filter_opnd->opnd_kind = P4TC_FILTER_OPND_KIND_ACT_PARAM;
+	} else {
+		filter_opnd->opnd_kind = P4TC_FILTER_OPND_KIND_ACT;
+	}
+
+	filter_opnd->act.val = act;
+
+	return 0;
+
+params_destroy:
+	p4a_runt_parm_destroy(param);
+
+	p4tc_action_put_ref(act);
+
+	return ret;
+}
+
+static int
+p4tc_filter_opnd_entry_key_build(struct nlattr **tb, struct p4tc_table *table,
+				 struct p4tc_filter_opnd *filter_opnd,
+				 struct netlink_ext_ack *extack)
+{
+	u32 maskblob_len;
+	u32 keysz;
+
+	keysz = nla_len(tb[P4TC_FILTER_OPND_ENTRY_KEY_BLOB]);
+	if (keysz != BITS_TO_BYTES(table->tbl_keysz)) {
+		NL_SET_ERR_MSG(extack,
+			       "Filter key size and table key size differ");
+		return -EINVAL;
+	}
+
+	nla_memcpy(filter_opnd->entry_key.val,
+		   tb[P4TC_FILTER_OPND_ENTRY_KEY_BLOB], keysz);
+
+	maskblob_len =
+		nla_len(tb[P4TC_FILTER_OPND_ENTRY_MASK_BLOB]);
+	if (keysz != maskblob_len) {
+		NL_SET_ERR_MSG(extack,
+			       "Key and mask blob must have the same length");
+		return -EINVAL;
+	}
+
+	nla_memcpy(filter_opnd->entry_key.mask,
+		   tb[P4TC_FILTER_OPND_ENTRY_MASK_BLOB], keysz);
+	p4tc_tbl_entry_mask_key(filter_opnd->entry_key.val,
+				filter_opnd->entry_key.val,
+				filter_opnd->entry_key.mask, keysz);
+
+	filter_opnd->opnd_kind = P4TC_FILTER_OPND_KIND_ENTRY_KEY;
+
+	return 0;
+}
+
+static struct p4tc_filter_opnd *
+p4tc_filter_opnd_build(struct p4tc_pipeline *pipeline,
+		       struct p4tc_table *table, struct nlattr *nla,
+		       struct netlink_ext_ack *extack)
+{
+	struct nlattr *tb[P4TC_FILTER_OPND_MAX + 1];
+	struct p4tc_filter_opnd *filter_opnd;
+	int ret;
+
+	ret = nla_parse_nested(tb, P4TC_FILTER_OPND_MAX, nla,
+			       p4tc_entry_filter_opnd_policy, extack);
+	if (ret < 0)
+		return ERR_PTR(ret);
+
+	if (!p4tc_filter_msg_valid(tb, extack))
+		return ERR_PTR(-EINVAL);
+
+	filter_opnd = kzalloc(sizeof(*filter_opnd), GFP_KERNEL);
+	if (!filter_opnd)
+		return ERR_PTR(-ENOMEM);
+
+	if (tb[P4TC_FILTER_OPND_PRIO]) {
+		p4tc_filter_opnd_prio_build(filter_opnd,
+					    tb[P4TC_FILTER_OPND_PRIO]);
+	} else if (tb[P4TC_FILTER_OPND_TIME_DELTA]) {
+		struct nlattr *msecs_attr = tb[P4TC_FILTER_OPND_TIME_DELTA];
+
+		p4tc_filter_opnd_msecs_since_build(filter_opnd, msecs_attr);
+	} else if (tb[P4TC_FILTER_OPND_ACT]) {
+		ret = p4tc_filter_opnd_act_build(pipeline,
+						 tb[P4TC_FILTER_OPND_ACT],
+						 filter_opnd, extack);
+		if (ret < 0)
+			goto free_filter_opnd;
+	} else if (tb[P4TC_FILTER_OPND_ENTRY_KEY_BLOB]) {
+		ret = p4tc_filter_opnd_entry_key_build(tb, table, filter_opnd,
+						       extack);
+		if (ret < 0)
+			goto free_filter_opnd;
+	} else {
+		ret = -EINVAL;
+		goto free_filter_opnd;
+	}
+
+	return filter_opnd;
+
+free_filter_opnd:
+	kfree(filter_opnd);
+	return ERR_PTR(ret);
+}
+
+static bool p4tc_filter_oper_rel_opnd_is_comp(struct p4tc_filter_opnd *opnd1,
+					      struct netlink_ext_ack *extack)
+{
+	switch (opnd1->opnd_kind) {
+	case P4TC_FILTER_OPND_KIND_ENTRY_KEY:
+		NL_SET_ERR_MSG(extack,
+			       "Compare with key operand isn't allowed");
+		return false;
+	case P4TC_FILTER_OPND_KIND_ACT:
+		NL_SET_ERR_MSG(extack,
+			       "Compare with act operand is forbidden");
+		return false;
+	case P4TC_FILTER_OPND_KIND_ACT_PARAM: {
+		struct p4tc_act_param *param;
+
+		param = opnd1->act.param;
+		if (!p4tc_is_type_numeric(param->type->typeid)) {
+			NL_SET_ERR_MSG(extack,
+				       "May only compare numeric act parameters");
+			return false;
+		}
+		return true;
+	}
+	default:
+		return true;
+	}
+}
+
+static bool p4tc_filter_oper_rel_is_valid(struct p4tc_filter_oper *filter_oper,
+					  struct netlink_ext_ack *extack)
+{
+	struct p4tc_filter_node *filter_node1 = filter_oper->node1;
+	struct p4tc_filter_opnd *opnd = filter_node1->opnd;
+
+	switch (filter_oper->op_value) {
+	case P4TC_FILTER_OP_REL_EQ:
+	case P4TC_FILTER_OP_REL_NEQ:
+		return true;
+	case P4TC_FILTER_OP_REL_LT:
+	case P4TC_FILTER_OP_REL_GT:
+	case P4TC_FILTER_OP_REL_LE:
+	case P4TC_FILTER_OP_REL_GE:
+		return p4tc_filter_oper_rel_opnd_is_comp(opnd, extack);
+	default:
+		/* Will never happen */
+		return false;
+	}
+}
+
+static bool p4tc_filter_oper_is_valid(struct p4tc_filter_oper *filter_oper,
+				      struct netlink_ext_ack *extack)
+{
+	switch (filter_oper->op_kind) {
+	case P4TC_FILTER_OP_LOGICAL:
+		return true;
+	case P4TC_FILTER_OP_REL:
+		return p4tc_filter_oper_rel_is_valid(filter_oper, extack);
+	default:
+		/* Will never happen */
+		return false;
+	}
+}
+
+static struct p4tc_filter_oper *
+p4tc_filter_oper_build(struct p4tc_pipeline *pipeline, struct p4tc_table *table,
+		       struct nlattr *nla, u32 depth,
+		       struct netlink_ext_ack *extack);
+
+static struct p4tc_filter_node *
+p4tc_filter_node_build(struct p4tc_pipeline *pipeline, struct p4tc_table *table,
+		       struct nlattr *nla, u32 depth,
+		       struct netlink_ext_ack *extack)
+{
+	struct nlattr *tb[P4TC_FILTER_NODE_MAX + 1];
+	struct p4tc_filter_oper *operation;
+	struct p4tc_filter_node *node;
+	int ret;
+
+	ret = nla_parse_nested(tb, P4TC_FILTER_NODE_MAX, nla,
+			       p4tc_entry_filter_node_policy, extack);
+	if (ret < 0)
+		return ERR_PTR(-EINVAL);
+
+	if ((!tb[P4TC_FILTER_NODE_PARENT] && !tb[P4TC_FILTER_NODE_LEAF]) ||
+	    (tb[P4TC_FILTER_NODE_PARENT] && tb[P4TC_FILTER_NODE_LEAF])) {
+		NL_SET_ERR_MSG(extack,
+			       "Must specify either P4TC_FILTER_NODE_PARENT or P4TC_FILTER_NODE_LEAF");
+		return ERR_PTR(-EINVAL);
+	}
+
+	node = kzalloc(sizeof(*node), GFP_KERNEL);
+	if (!node)
+		return ERR_PTR(-ENOMEM);
+
+	if (tb[P4TC_FILTER_NODE_LEAF]) {
+		struct p4tc_filter_opnd *opnd;
+
+		opnd = p4tc_filter_opnd_build(pipeline, table,
+					      tb[P4TC_FILTER_NODE_LEAF],
+					      extack);
+		if (IS_ERR(opnd)) {
+			ret = PTR_ERR(opnd);
+			goto free_node;
+		}
+		node->ntype = P4TC_FILTER_NODE_LEAF;
+		node->opnd = opnd;
+
+		return node;
+	}
+
+	if (depth == P4TC_FILTER_DEPTH_LIMIT) {
+		NL_SET_ERR_MSG_FMT(extack, "Recursion limit (%d) exceeded",
+				   P4TC_FILTER_DEPTH_LIMIT);
+		ret = -EINVAL;
+		goto free_node;
+	}
+
+	operation = p4tc_filter_oper_build(pipeline, table,
+					   tb[P4TC_FILTER_NODE_PARENT],
+					   depth + 1, extack);
+	if (IS_ERR(operation)) {
+		ret = PTR_ERR(operation);
+		goto free_node;
+	}
+	node->ntype = P4TC_FILTER_NODE_PARENT;
+	node->operation = operation;
+
+	return node;
+
+free_node:
+	kfree(node);
+	return ERR_PTR(ret);
+}
+
+static struct p4tc_filter_oper *
+p4tc_filter_oper_build(struct p4tc_pipeline *pipeline, struct p4tc_table *table,
+		       struct nlattr *nla, u32 depth,
+		       struct netlink_ext_ack *extack)
+{
+	struct p4tc_filter_node *filter_node2 = NULL;
+	struct p4tc_filter_node *filter_node1;
+	struct nlattr *tb[P4TC_FILTER_MAX + 1];
+	struct p4tc_filter_oper *filter_oper;
+	u16 filter_op_value;
+	u16 filter_op_kind;
+	int ret;
+
+	if (!nla)
+		return ERR_PTR(-EINVAL);
+
+	ret = nla_parse_nested(tb, P4TC_FILTER_MAX, nla,
+			       p4tc_entry_filter_policy, extack);
+	if (ret < 0)
+		return ERR_PTR(ret);
+
+	if (!tb[P4TC_FILTER_OP_KIND] || !tb[P4TC_FILTER_OP_VALUE]) {
+		NL_SET_ERR_MSG(extack, "Must specify filter op kind and value");
+		return ERR_PTR(-EINVAL);
+	}
+
+	filter_op_kind = nla_get_u16(tb[P4TC_FILTER_OP_KIND]);
+	filter_op_value = nla_get_u16(tb[P4TC_FILTER_OP_VALUE]);
+
+	/* filter_op_kind is checked by netlink policy */
+	if (!p4tc_filter_op_value_valid(filter_op_kind, filter_op_value,
+					extack))
+		return ERR_PTR(-EINVAL);
+
+	if (!tb[P4TC_FILTER_NODE1]) {
+		NL_SET_ERR_MSG_FMT(extack, "Must specify filter node1");
+		return ERR_PTR(-EINVAL);
+	}
+
+	if (p4tc_filter_op_requires_node2(filter_op_kind, filter_op_value)) {
+		if (!tb[P4TC_FILTER_NODE2]) {
+			NL_SET_ERR_MSG(extack,
+				       "Must specify filter node2");
+			return ERR_PTR(-EINVAL);
+		}
+	}
+
+	filter_oper = kzalloc(sizeof(*filter_oper), GFP_KERNEL);
+	if (!filter_oper)
+		return ERR_PTR(-ENOMEM);
+
+	filter_node1 = p4tc_filter_node_build(pipeline, table,
+					      tb[P4TC_FILTER_NODE1],
+					      depth, extack);
+	if (IS_ERR(filter_node1)) {
+		ret = PTR_ERR(filter_node1);
+		goto free_operation;
+	}
+
+	if (tb[P4TC_FILTER_NODE2]) {
+		filter_node2 = p4tc_filter_node_build(pipeline, table,
+						      tb[P4TC_FILTER_NODE2],
+						      depth, extack);
+		if (IS_ERR(filter_node2)) {
+			ret = PTR_ERR(filter_node2);
+			goto free_node1;
+		}
+	}
+
+	filter_oper->op_kind = filter_op_kind;
+	filter_oper->op_value = filter_op_value;
+	filter_oper->node1 = filter_node1;
+	filter_oper->node2 = filter_node2;
+
+	if (!p4tc_filter_oper_is_valid(filter_oper, extack)) {
+		ret = -EINVAL;
+		goto free_node2;
+	}
+
+	return filter_oper;
+
+free_node2:
+	p4tc_filter_node_destroy(filter_node2);
+
+free_node1:
+	p4tc_filter_node_destroy(filter_node1);
+
+free_operation:
+	kfree(filter_oper);
+
+	return ERR_PTR(ret);
+}
+
+struct p4tc_filter *
+p4tc_filter_build(struct p4tc_pipeline *pipeline, struct p4tc_table *table,
+		  struct nlattr *nla, struct netlink_ext_ack *extack)
+{
+	struct p4tc_filter_oper *filter_oper;
+	struct p4tc_filter *filter;
+
+	if (!nla)
+		return NULL;
+
+	filter = kzalloc(sizeof(*filter), GFP_KERNEL);
+	if (!filter)
+		return ERR_PTR(-ENOMEM);
+
+	filter_oper = p4tc_filter_oper_build(pipeline, table, nla, 0, extack);
+	if (IS_ERR(filter_oper)) {
+		kfree(filter);
+		return (struct p4tc_filter *)filter_oper;
+	}
+
+	filter->operation = filter_oper;
+
+	return filter;
+}
+
+static int
+p4tc_filter_act_param(struct p4tc_act_param *entry_act_param,
+		      struct p4tc_act_param *filter_act_param)
+{
+	return p4t_cmp(NULL, entry_act_param->type, entry_act_param->value,
+		       NULL, filter_act_param->type, filter_act_param->value);
+}
+
+static bool p4tc_filter_cmp_op(u16 op_value, int cmp)
+{
+	switch (op_value) {
+	case P4TC_FILTER_OP_REL_EQ:
+		return !cmp;
+	case P4TC_FILTER_OP_REL_NEQ:
+		return !!cmp;
+	case P4TC_FILTER_OP_REL_LT:
+		return cmp < 0;
+	case P4TC_FILTER_OP_REL_GT:
+		return cmp > 0;
+	case P4TC_FILTER_OP_REL_LE:
+		return cmp <= 0;
+	case P4TC_FILTER_OP_REL_GE:
+		return cmp >= 0;
+	default:
+		return false;
+	}
+}
+
+static bool
+p4tc_filter_act_params(struct p4tc_filter_oper *filter_oper,
+		       struct tcf_p4act_params *entry_act_params,
+		       struct p4tc_act_param *filter_act_param)
+{
+	struct idr *entry_act_params_idr = &entry_act_params->params_idr;
+	struct p4tc_act_param *entry_act_param;
+	int cmp;
+
+	entry_act_param = p4a_parm_find_byid(entry_act_params_idr,
+					     filter_act_param->id);
+	if (!entry_act_param)
+		return false;
+
+	cmp = p4tc_filter_act_param(entry_act_param,
+				    filter_act_param);
+	return p4tc_filter_cmp_op(filter_oper->op_value, cmp);
+}
+
+static bool
+p4tc_filter_exec_act(struct p4tc_filter_oper *filter_oper,
+		     struct p4tc_table_entry_value *value,
+		     struct p4tc_filter_opnd *filter_opnd)
+{
+	struct tcf_p4act *p4act;
+
+	if (!filter_opnd)
+		return true;
+
+	if (!value->acts[0])
+		return false;
+
+	p4act = to_p4act(value->acts[0]);
+	if (filter_opnd->act.val->a_id != p4act->act_id)
+		return false;
+
+	if (filter_opnd->opnd_kind == P4TC_FILTER_OPND_KIND_ACT_PARAM) {
+		struct tcf_p4act_params *params;
+
+		params = rcu_dereference(p4act->params);
+		return p4tc_filter_act_params(filter_oper, params,
+					      filter_opnd->act.param);
+	}
+
+	return true;
+}
+
+static bool
+p4tc_filter_exec_opnd(struct p4tc_filter_oper *filter_oper,
+		      struct p4tc_table_entry *entry,
+		      struct p4tc_filter_opnd *filter_opnd)
+{
+	switch (filter_opnd->opnd_kind) {
+	case P4TC_FILTER_OPND_KIND_ENTRY_KEY: {
+		u8 key[BITS_TO_BYTES(P4TC_MAX_KEYSZ)] = {0};
+		u32 keysz;
+		int cmp;
+
+		keysz = BITS_TO_BYTES(entry->key.keysz);
+		p4tc_tbl_entry_mask_key(key, entry->key.fa_key,
+					filter_opnd->entry_key.mask, keysz);
+
+		cmp = memcmp(key, filter_opnd->entry_key.val, keysz);
+		return p4tc_filter_cmp_op(filter_oper->op_value, cmp);
+	}
+	case P4TC_FILTER_OPND_KIND_ACT:
+	case P4TC_FILTER_OPND_KIND_ACT_PARAM:
+		return p4tc_filter_exec_act(filter_oper,
+					     p4tc_table_entry_value(entry),
+					     filter_opnd);
+	case P4TC_FILTER_OPND_KIND_PRIO: {
+		struct p4tc_table_entry_value *value;
+
+		value = p4tc_table_entry_value(entry);
+		switch (filter_oper->op_value) {
+		case P4TC_FILTER_OP_REL_EQ:
+			return value->prio == filter_opnd->prio;
+		case P4TC_FILTER_OP_REL_NEQ:
+			return value->prio != filter_opnd->prio;
+		case P4TC_FILTER_OP_REL_LT:
+			return value->prio < filter_opnd->prio;
+		case P4TC_FILTER_OP_REL_GT:
+			return value->prio > filter_opnd->prio;
+		case P4TC_FILTER_OP_REL_LE:
+			return value->prio <= filter_opnd->prio;
+		case P4TC_FILTER_OP_REL_GE:
+			return value->prio >= filter_opnd->prio;
+		default:
+			return false;
+		}
+	}
+	case P4TC_FILTER_OPND_KIND_MSECS: {
+		struct p4tc_table_entry_value *value;
+		unsigned long jiffy_since;
+		unsigned long last_used;
+
+		jiffy_since = jiffies -
+			msecs_to_jiffies(filter_opnd->msecs_since);
+
+		value = p4tc_table_entry_value(entry);
+		rcu_read_lock();
+		last_used = rcu_dereference(value->tm)->lastused;
+		rcu_read_unlock();
+
+		switch (filter_oper->op_value) {
+		case P4TC_FILTER_OP_REL_EQ:
+			return jiffy_since == last_used;
+		case P4TC_FILTER_OP_REL_NEQ:
+			return jiffy_since != last_used;
+		case P4TC_FILTER_OP_REL_LT:
+			return time_before(jiffy_since, last_used);
+		case P4TC_FILTER_OP_REL_GT:
+			return time_after(jiffy_since, last_used);
+		case P4TC_FILTER_OP_REL_LE:
+			return time_before_eq(jiffy_since, last_used);
+		case P4TC_FILTER_OP_REL_GE:
+			return time_after_eq(jiffy_since, last_used);
+		default:
+			/* Will never happen */
+			return false;
+		}
+	}
+	default:
+		return false;
+	}
+}
+
+static bool p4tc_filter_exec_oper(struct p4tc_filter_oper *filter_oper,
+				  struct p4tc_table_entry *entry);
+
+static bool p4tc_filter_exec_node(struct p4tc_filter_oper *filter_oper,
+				  struct p4tc_table_entry *entry,
+				  struct p4tc_filter_node *node)
+{
+	if (node->ntype == P4TC_FILTER_NODE_PARENT)
+		return p4tc_filter_exec_oper(node->operation, entry);
+
+	return p4tc_filter_exec_opnd(filter_oper, entry, node->opnd);
+}
+
+static bool
+p4tc_filter_exec_oper_logical(struct p4tc_filter_oper *filter_oper,
+			      struct p4tc_table_entry *entry)
+{
+	bool ret;
+
+	ret = p4tc_filter_exec_node(filter_oper, entry, filter_oper->node1);
+
+	switch (filter_oper->op_value) {
+	case P4TC_FILTER_OP_LOGICAL_AND:
+		return ret && p4tc_filter_exec_node(filter_oper, entry,
+						    filter_oper->node2);
+	case P4TC_FILTER_OP_LOGICAL_OR:
+		return ret || p4tc_filter_exec_node(filter_oper, entry,
+						    filter_oper->node2);
+	case P4TC_FILTER_OP_LOGICAL_NOT:
+		return !ret;
+	case P4TC_FILTER_OP_LOGICAL_XOR:
+		return ret != p4tc_filter_exec_node(filter_oper, entry,
+						    filter_oper->node2);
+	default:
+		/* Never happens */
+		return false;
+	}
+}
+
+static bool
+p4tc_filter_exec_oper_rel(struct p4tc_filter_oper *filter_oper,
+			  struct p4tc_table_entry *entry)
+{
+	return p4tc_filter_exec_node(filter_oper, entry,
+				     filter_oper->node1);
+}
+
+static bool
+p4tc_filter_exec_oper(struct p4tc_filter_oper *filter_oper,
+		      struct p4tc_table_entry *entry)
+{
+	switch (filter_oper->op_kind) {
+	case P4TC_FILTER_OP_REL:
+		return p4tc_filter_exec_oper_rel(filter_oper, entry);
+	case P4TC_FILTER_OP_LOGICAL:
+		return p4tc_filter_exec_oper_logical(filter_oper, entry);
+	default:
+		return false;
+	}
+}
+
+bool p4tc_filter_exec(struct p4tc_filter *filter,
+		      struct p4tc_table_entry *entry)
+{
+	if (!filter)
+		return true;
+
+	return p4tc_filter_exec_oper(filter->operation, entry);
+}
diff --git a/net/sched/p4tc/p4tc_runtime_api.c b/net/sched/p4tc/p4tc_runtime_api.c
index d80103d36..44239cb22 100644
--- a/net/sched/p4tc/p4tc_runtime_api.c
+++ b/net/sched/p4tc/p4tc_runtime_api.c
@@ -56,6 +56,21 @@ static int tc_ctl_p4_root(struct sk_buff *skb, struct nlmsghdr *n, int cmd,
 	}
 }
 
+static int tc_ctl_p4_get(struct sk_buff *skb, struct nlmsghdr *n,
+			 struct netlink_ext_ack *extack)
+{
+	return tc_ctl_p4_root(skb, n, RTM_P4TC_GET, extack);
+}
+
+static int tc_ctl_p4_delete(struct sk_buff *skb, struct nlmsghdr *n,
+			    struct netlink_ext_ack *extack)
+{
+	if (!netlink_capable(skb, CAP_NET_ADMIN))
+		return -EPERM;
+
+	return tc_ctl_p4_root(skb, n, RTM_P4TC_DEL, extack);
+}
+
 static int tc_ctl_p4_cu(struct sk_buff *skb, struct nlmsghdr *n,
 			struct netlink_ext_ack *extack)
 {
@@ -69,12 +84,60 @@ static int tc_ctl_p4_cu(struct sk_buff *skb, struct nlmsghdr *n,
 	return ret;
 }
 
+static int tc_ctl_p4_dump(struct sk_buff *skb, struct netlink_callback *cb)
+{
+	struct nlattr *tb[P4TC_ROOT_MAX + 1];
+	char *p_name = NULL;
+	struct p4tcmsg *t;
+	int ret = 0;
+
+	/* Dump is always called with the nlk->cb_mutex held.
+	 * In rtnl this mutex is set to rtnl_lock, which makes dump,
+	 * even for table entries, to serialized over the rtnl_lock.
+	 *
+	 * For table entries, it guarantees the net namespace is alive.
+	 * For externs, we don't need to lock the rtnl_lock.
+	 */
+	ASSERT_RTNL();
+
+	ret = nlmsg_parse(cb->nlh, sizeof(struct p4tcmsg), tb, P4TC_ROOT_MAX,
+			  p4tc_root_policy, cb->extack);
+	if (ret < 0)
+		return ret;
+
+	if (NL_REQ_ATTR_CHECK(cb->extack, NULL, tb, P4TC_ROOT)) {
+		NL_SET_ERR_MSG(cb->extack,
+			       "Netlink P4TC Runtime attributes missing");
+		return -EINVAL;
+	}
+
+	if (tb[P4TC_ROOT_PNAME])
+		p_name = nla_data(tb[P4TC_ROOT_PNAME]);
+
+	t = nlmsg_data(cb->nlh);
+
+	switch (t->obj) {
+	case P4TC_OBJ_RUNTIME_TABLE:
+		return p4tc_tbl_entry_dumpit(sock_net(skb->sk), skb, cb,
+					     tb[P4TC_ROOT], p_name);
+	default:
+		NL_SET_ERR_MSG_FMT(cb->extack,
+				   "Unknown p4 runtime object type %u\n",
+				   t->obj);
+		return -ENOENT;
+	}
+}
+
 static int __init p4tc_tbl_init(void)
 {
 	rtnl_register(PF_UNSPEC, RTM_P4TC_CREATE, tc_ctl_p4_cu, NULL,
 		      RTNL_FLAG_DOIT_UNLOCKED);
 	rtnl_register(PF_UNSPEC, RTM_P4TC_UPDATE, tc_ctl_p4_cu, NULL,
 		      RTNL_FLAG_DOIT_UNLOCKED);
+	rtnl_register(PF_UNSPEC, RTM_P4TC_DEL, tc_ctl_p4_delete, NULL,
+		      RTNL_FLAG_DOIT_UNLOCKED);
+	rtnl_register(PF_UNSPEC, RTM_P4TC_GET, tc_ctl_p4_get, tc_ctl_p4_dump,
+		      RTNL_FLAG_DOIT_UNLOCKED);
 
 	return 0;
 }
diff --git a/net/sched/p4tc/p4tc_table.c b/net/sched/p4tc/p4tc_table.c
index 4bfff14bd..e1b2beed2 100644
--- a/net/sched/p4tc/p4tc_table.c
+++ b/net/sched/p4tc/p4tc_table.c
@@ -106,6 +106,11 @@ int p4tc_table_try_set_state_ready(struct p4tc_pipeline *pipeline,
 	return ret;
 }
 
+static const struct netlink_range_validation aging_range = {
+	.min = 1,
+	.max = P4TC_MAX_T_AGING_MS,
+};
+
 static const struct netlink_range_validation keysz_range = {
 	.min = 1,
 	.max = P4TC_MAX_KEYSZ,
@@ -290,7 +295,7 @@ static int _p4tc_table_fill_nlmsg(struct sk_buff *skb, struct p4tc_table *table)
 
 		entry_nest = nla_nest_start(skb, P4TC_TABLE_ENTRY);
 		if (p4tc_tbl_entry_fill(skb, table, table->tbl_entry,
-					table->tbl_id) < 0)
+					table->tbl_id, P4TC_ENTITY_UNSPEC) < 0)
 			goto out_nlmsg_trim;
 
 		nla_nest_end(skb, entry_nest);
@@ -363,6 +368,112 @@ static void p4tc_table_timer_profiles_destroy(struct p4tc_table *table)
 	mutex_unlock(&table->tbl_profiles_xa_lock);
 }
 
+static const struct nla_policy
+p4tc_timer_profile_policy[P4TC_TIMER_PROFILE_MAX + 1] = {
+	[P4TC_TIMER_PROFILE_ID] =
+		NLA_POLICY_RANGE(NLA_U32, 0, P4TC_MAX_NUM_TIMER_PROFILES),
+	[P4TC_TIMER_PROFILE_AGING] =
+		NLA_POLICY_FULL_RANGE(NLA_U64, &aging_range),
+};
+
+struct p4tc_table_timer_profile *
+p4tc_table_timer_profile_find_byaging(struct p4tc_table *table, u64 aging_ms)
+__must_hold(RCU)
+{
+	struct p4tc_table_timer_profile *timer_profile;
+	unsigned long profile_id;
+
+	xa_for_each(&table->tbl_profiles_xa, profile_id, timer_profile) {
+		if (timer_profile->aging_ms == aging_ms)
+			return timer_profile;
+	}
+
+	return NULL;
+}
+
+struct p4tc_table_timer_profile *
+p4tc_table_timer_profile_find(struct p4tc_table *table, u32 profile_id)
+__must_hold(RCU)
+{
+	return xa_load(&table->tbl_profiles_xa, profile_id);
+}
+
+/* This function will be exercised via a runtime command.
+ * Note that two profile IDs can't have the same aging value
+ */
+int p4tc_table_timer_profile_update(struct p4tc_table *table,
+				    struct nlattr *nla,
+				    struct netlink_ext_ack *extack)
+{
+	struct p4tc_table_timer_profile *old_timer_profile;
+	struct p4tc_table_timer_profile *timer_profile;
+	struct nlattr *tb[P4TC_TIMER_PROFILE_MAX + 1];
+	u32 profile_id;
+	u64 aging_ms;
+	int ret;
+
+	ret = nla_parse_nested(tb, P4TC_TIMER_PROFILE_MAX, nla,
+			       p4tc_timer_profile_policy, extack);
+	if (ret < 0)
+		return ret;
+
+	if (!tb[P4TC_TIMER_PROFILE_ID]) {
+		NL_SET_ERR_MSG(extack, "Must specify table profile ID");
+		return -EINVAL;
+	}
+	profile_id = nla_get_u32(tb[P4TC_TIMER_PROFILE_ID]);
+
+	if (!tb[P4TC_TIMER_PROFILE_AGING]) {
+		NL_SET_ERR_MSG(extack, "Must specify table profile aging");
+		return -EINVAL;
+	}
+	aging_ms = nla_get_u64(tb[P4TC_TIMER_PROFILE_AGING]);
+
+	rcu_read_lock();
+	timer_profile = p4tc_table_timer_profile_find_byaging(table,
+							      aging_ms);
+	if (timer_profile && timer_profile->profile_id != profile_id) {
+		NL_SET_ERR_MSG_FMT(extack,
+				   "Aging %llu was already specified by profile ID %u",
+				   aging_ms, timer_profile->profile_id);
+		rcu_read_unlock();
+		return -EINVAL;
+	}
+	rcu_read_unlock();
+
+	timer_profile = kzalloc(sizeof(*timer_profile), GFP_KERNEL);
+	if (unlikely(!timer_profile))
+		return -ENOMEM;
+
+	timer_profile->profile_id = profile_id;
+	timer_profile->aging_ms = aging_ms;
+
+	mutex_lock(&table->tbl_profiles_xa_lock);
+	old_timer_profile = xa_load(&table->tbl_profiles_xa, profile_id);
+	if (!old_timer_profile) {
+		NL_SET_ERR_MSG_FMT(extack,
+				   "Unable to find timer profile with ID %u\n",
+				   profile_id);
+		ret = -ENOENT;
+		goto unlock;
+	}
+
+	old_timer_profile = xa_cmpxchg(&table->tbl_profiles_xa,
+				       timer_profile->profile_id,
+				       old_timer_profile,
+				       timer_profile, GFP_KERNEL);
+	kfree_rcu(old_timer_profile, rcu);
+	mutex_unlock(&table->tbl_profiles_xa_lock);
+
+	return 0;
+
+unlock:
+	mutex_unlock(&table->tbl_profiles_xa_lock);
+
+	kfree(timer_profile);
+	return ret;
+}
+
 /* From the template, the user may only specify the number of timer profiles
  * they want for the table. If this number is not specified during the table
  * creation command, the kernel will create 4 timer profiles:
diff --git a/net/sched/p4tc/p4tc_tbl_entry.c b/net/sched/p4tc/p4tc_tbl_entry.c
index a2f9ab959..7a644eb40 100644
--- a/net/sched/p4tc/p4tc_tbl_entry.c
+++ b/net/sched/p4tc/p4tc_tbl_entry.c
@@ -38,11 +38,21 @@
  * whether a delete is happening in parallel.
  */
 
+static int p4tc_tbl_entry_get(struct p4tc_table_entry_value *value)
+{
+	return refcount_inc_not_zero(&value->entries_ref);
+}
+
 static bool p4tc_tbl_entry_put(struct p4tc_table_entry_value *value)
 {
 	return refcount_dec_if_one(&value->entries_ref);
 }
 
+static bool p4tc_tbl_entry_put_ref(struct p4tc_table_entry_value *value)
+{
+	return refcount_dec_not_one(&value->entries_ref);
+}
+
 static u32 p4tc_entry_hash_fn(const void *data, u32 len, u32 seed)
 {
 	const struct p4tc_table_entry_key *key = data;
@@ -133,6 +143,15 @@ p4tc_entry_lookup(struct p4tc_table *table, struct p4tc_table_entry_key *key,
 	return NULL;
 }
 
+void p4tc_tbl_entry_mask_key(u8 *masked_key, u8 *key, const u8 *mask,
+			     u32 masksz)
+{
+	int i;
+
+	for (i = 0; i < masksz; i++)
+		masked_key[i] = key[i] & mask[i];
+}
+
 #define p4tc_table_entry_mask_find_byid(table, id) \
 	(idr_find(&(table)->tbl_masks_idr, id))
 
@@ -201,7 +220,8 @@ static void p4tc_table_entry_tm_dump(struct p4tc_table_entry_tm *dtm,
 #define P4TC_ENTRY_MAX_IDS (P4TC_PATH_MAX - 1)
 
 int p4tc_tbl_entry_fill(struct sk_buff *skb, struct p4tc_table *table,
-			struct p4tc_table_entry *entry, u32 tbl_id)
+			struct p4tc_table_entry *entry, u32 tbl_id,
+			u16 who_deleted)
 {
 	unsigned char *b = nlmsg_get_pos(skb);
 	struct p4tc_table_entry_value *value;
@@ -248,11 +268,28 @@ int p4tc_tbl_entry_fill(struct sk_buff *skb, struct p4tc_table *table,
 			goto out_nlmsg_trim;
 	}
 
+	if (who_deleted) {
+		if (nla_put_u8(skb, P4TC_ENTRY_DELETE_WHODUNNIT,
+			       who_deleted))
+			goto out_nlmsg_trim;
+	}
+
 	p4tc_table_entry_tm_dump(&dtm, tm);
 	if (nla_put_64bit(skb, P4TC_ENTRY_TM, sizeof(dtm), &dtm,
 			  P4TC_ENTRY_PAD))
 		goto out_nlmsg_trim;
 
+	if (value->is_dyn) {
+		if (nla_put_u8(skb, P4TC_ENTRY_DYNAMIC, 1))
+			goto out_nlmsg_trim;
+
+		if (value->aging_ms) {
+			if (nla_put_u64_64bit(skb, P4TC_ENTRY_AGING,
+					      value->aging_ms, P4TC_ENTRY_PAD))
+				goto out_nlmsg_trim;
+		}
+	}
+
 	if (value->tmpl_created) {
 		if (nla_put_u8(skb, P4TC_ENTRY_TMPL_CREATED, 1))
 			goto out_nlmsg_trim;
@@ -267,6 +304,11 @@ int p4tc_tbl_entry_fill(struct sk_buff *skb, struct p4tc_table *table,
 	return ret;
 }
 
+static struct netlink_range_validation range_aging = {
+	.min = 1,
+	.max = P4TC_MAX_T_AGING_MS,
+};
+
 static const struct nla_policy p4tc_entry_policy[P4TC_ENTRY_MAX + 1] = {
 	[P4TC_ENTRY_TBLNAME] = { .type = NLA_STRING },
 	[P4TC_ENTRY_KEY_BLOB] = { .type = NLA_BINARY },
@@ -281,6 +323,11 @@ static const struct nla_policy p4tc_entry_policy[P4TC_ENTRY_MAX + 1] = {
 	[P4TC_ENTRY_DELETE_WHODUNNIT] = { .type = NLA_U8 },
 	[P4TC_ENTRY_PERMISSIONS] = NLA_POLICY_MAX(NLA_U16, P4TC_MAX_PERMISSION),
 	[P4TC_ENTRY_TBL_ATTRS] = { .type = NLA_NESTED },
+	[P4TC_ENTRY_DYNAMIC] = NLA_POLICY_RANGE(NLA_U8, 1, 1),
+	[P4TC_ENTRY_AGING] = NLA_POLICY_FULL_RANGE(NLA_U64, &range_aging),
+	[P4TC_ENTRY_PROFILE_ID] =
+		NLA_POLICY_RANGE(NLA_U32, 0, P4TC_MAX_NUM_TIMER_PROFILES - 1),
+	[P4TC_ENTRY_FILTER] = { .type = NLA_NESTED },
 };
 
 static struct p4tc_table_entry_mask *
@@ -548,6 +595,7 @@ static int p4tc_tbl_entry_emit_event(struct p4tc_table_entry_work *entry_work,
 	struct p4tc_pipeline *pipeline = entry_work->pipeline;
 	struct p4tc_table_entry *entry = entry_work->entry;
 	struct p4tc_table *table = entry_work->table;
+	u16 who_deleted = entry_work->who_deleted;
 	struct net *net = pipeline->net;
 	struct sock *rtnl = net->rtnl;
 	struct nlmsghdr *nlh;
@@ -583,7 +631,8 @@ static int p4tc_tbl_entry_emit_event(struct p4tc_table_entry_work *entry_work,
 		goto free_skb;
 
 	nest = nla_nest_start(skb, 1);
-	if (p4tc_tbl_entry_fill(skb, table, entry, table->tbl_id) < 0)
+	if (p4tc_tbl_entry_fill(skb, table, entry, table->tbl_id,
+				who_deleted) < 0)
 		goto free_skb;
 	nla_nest_end(skb, nest);
 
@@ -628,6 +677,9 @@ static void p4tc_table_entry_del_work(struct work_struct *work)
 	if (entry_work->send_event && p4tc_ctrl_pub_ok(value->permissions))
 		p4tc_tbl_entry_emit_event(entry_work, RTM_P4TC_DEL, GFP_KERNEL);
 
+	if (value->is_dyn)
+		hrtimer_cancel(&value->entry_timer);
+
 	put_net(pipeline->net);
 	p4tc_pipeline_put_ref(pipeline);
 
@@ -650,6 +702,9 @@ static void p4tc_table_entry_put(struct p4tc_table_entry *entry, bool deferred)
 		get_net(pipeline->net); /* avoid action cleanup */
 		schedule_work(&entry_work->work);
 	} else {
+		if (value->is_dyn)
+			hrtimer_cancel(&value->entry_timer);
+
 		__p4tc_table_entry_put(entry);
 	}
 }
@@ -926,6 +981,388 @@ static void p4tc_table_entry_build_key(struct p4tc_table *table,
 		key->fa_key[i] &= mask->fa_value[i];
 }
 
+static int ___p4tc_table_entry_del(struct p4tc_pipeline *pipeline,
+				   struct p4tc_table *table,
+				   struct p4tc_table_entry *entry,
+				   bool from_control)
+__must_hold(RCU)
+{
+	u16 who_deleted = from_control ?
+		P4TC_ENTITY_UNSPEC : P4TC_ENTITY_KERNEL;
+	struct p4tc_table_entry_value *value = p4tc_table_entry_value(entry);
+
+	if (from_control) {
+		if (!p4tc_ctrl_delete_ok(value->permissions))
+			return -EPERM;
+	} else {
+		if (!p4tc_data_delete_ok(value->permissions))
+			return -EPERM;
+	}
+
+	if (p4tc_table_entry_destroy(table, entry, true, !from_control,
+				     who_deleted) < 0)
+		return -EBUSY;
+
+	return 0;
+}
+
+static int p4tc_table_entry_gd(struct net *net, struct sk_buff *skb,
+			       int cmd, u16 *permissions, struct nlattr *arg,
+			       struct p4tc_path_nlattrs *nl_path_attrs,
+			       struct netlink_ext_ack *extack)
+{
+	struct p4tc_table_get_state table_get_state = { NULL };
+	struct p4tc_table_entry_mask *mask = NULL, *new_mask;
+	struct nlattr *tb[P4TC_ENTRY_MAX + 1] = { NULL };
+	struct p4tc_table_entry *entry = NULL;
+	struct p4tc_pipeline *pipeline = NULL;
+	struct p4tc_table_entry_value *value;
+	struct p4tc_table_entry_key *key;
+	bool get = cmd == RTM_P4TC_GET;
+	u32 *ids = nl_path_attrs->ids;
+	bool has_listener = !!skb;
+	struct p4tc_table *table;
+	u16 who_deleted = 0;
+	bool del = !get;
+	u32 keysz_bytes;
+	u32 keysz_bits;
+	u32 prio;
+	int ret;
+
+	ret = nla_parse_nested(tb, P4TC_ENTRY_MAX, arg, p4tc_entry_policy,
+			       extack);
+	if (ret < 0)
+		return ret;
+
+	ret = p4tc_table_entry_get_table(net, cmd, &table_get_state, tb,
+					 nl_path_attrs, extack);
+	if (ret < 0)
+		return ret;
+
+	pipeline = table_get_state.pipeline;
+	table = table_get_state.table;
+
+	if (table->tbl_type == P4TC_TABLE_TYPE_EXACT) {
+		prio = p4tc_table_entry_exact_prio();
+	} else {
+		if (NL_REQ_ATTR_CHECK(extack, arg, tb, P4TC_ENTRY_PRIO)) {
+			NL_SET_ERR_MSG(extack,
+				       "Must specify table entry priority");
+			return -EINVAL;
+		}
+		prio = nla_get_u32(tb[P4TC_ENTRY_PRIO]);
+	}
+
+	keysz_bits = table->tbl_keysz;
+	keysz_bytes = P4TC_KEYSZ_BYTES(table->tbl_keysz);
+
+	key = kzalloc(struct_size(key, fa_key, keysz_bytes), GFP_KERNEL);
+	if (unlikely(!key)) {
+		NL_SET_ERR_MSG(extack, "Unable to allocate key");
+		ret = -ENOMEM;
+		goto table_put;
+	}
+
+	key->keysz = keysz_bits;
+
+	if (table->tbl_type != P4TC_TABLE_TYPE_EXACT) {
+		mask = kzalloc(struct_size(mask, fa_value, keysz_bytes),
+			       GFP_KERNEL);
+		if (unlikely(!mask)) {
+			NL_SET_ERR_MSG(extack, "Failed to allocate mask");
+			ret = -ENOMEM;
+			goto free_key;
+		}
+		mask->sz = key->keysz;
+	}
+
+	ret = p4tc_table_entry_extract_key(table, tb, key, mask, extack);
+	if (unlikely(ret < 0)) {
+		if (table->tbl_type != P4TC_TABLE_TYPE_EXACT)
+			kfree(mask);
+
+		goto free_key;
+	}
+
+	if (table->tbl_type != P4TC_TABLE_TYPE_EXACT) {
+		new_mask = p4tc_table_entry_mask_find_byvalue(table, mask);
+		kfree(mask);
+		if (!new_mask) {
+			NL_SET_ERR_MSG(extack, "Unable to find entry mask");
+			ret = -ENOENT;
+			goto free_key;
+		} else {
+			mask = new_mask;
+		}
+	}
+
+	p4tc_table_entry_build_key(table, key, mask);
+
+	rcu_read_lock();
+	entry = p4tc_entry_lookup(table, key, prio);
+	if (!entry) {
+		NL_SET_ERR_MSG(extack, "Unable to find entry");
+		ret = -ENOENT;
+		goto unlock;
+	}
+
+	/* As we can run delete/update in parallel we might get a soon to be
+	 * purged entry from the lookup
+	 */
+	value = p4tc_table_entry_value(entry);
+	if (get && !p4tc_tbl_entry_get(value)) {
+		NL_SET_ERR_MSG(extack, "Entry deleted in parallel");
+		ret = -EBUSY;
+		goto unlock;
+	}
+
+	if (del) {
+		if (tb[P4TC_ENTRY_WHODUNNIT])
+			who_deleted = nla_get_u8(tb[P4TC_ENTRY_WHODUNNIT]);
+	} else {
+		if (!p4tc_ctrl_read_ok(value->permissions)) {
+			NL_SET_ERR_MSG(extack,
+				       "Unable to read table entry");
+			ret = -EPERM;
+			goto entry_put;
+		}
+
+		if (!p4tc_ctrl_pub_ok(value->permissions)) {
+			NL_SET_ERR_MSG(extack,
+				       "Unable to publish read entry");
+			ret = -EPERM;
+			goto entry_put;
+		}
+	}
+
+	if (has_listener) {
+		if (p4tc_tbl_entry_fill(skb, table, entry, table->tbl_id,
+					who_deleted) <= 0) {
+			NL_SET_ERR_MSG(extack,
+				       "Unable to fill table entry attributes");
+			ret = -EINVAL;
+			goto entry_put;
+		}
+		*permissions = value->permissions;
+	}
+
+	if (del) {
+		ret = ___p4tc_table_entry_del(pipeline, table, entry, true);
+		if (ret < 0) {
+			if (ret == -EBUSY)
+				NL_SET_ERR_MSG(extack,
+					       "Entry was deleted in parallel");
+			goto entry_put;
+		}
+
+		if (!has_listener)
+			goto out;
+	}
+
+	if (!ids[P4TC_PID_IDX])
+		ids[P4TC_PID_IDX] = pipeline->common.p_id;
+
+	if (!nl_path_attrs->pname_passed)
+		strscpy(nl_path_attrs->pname, pipeline->common.name,
+			P4TC_PIPELINE_NAMSIZ);
+
+out:
+	ret = 0;
+
+entry_put:
+	if (get)
+		p4tc_tbl_entry_put_ref(value);
+
+unlock:
+	rcu_read_unlock();
+
+free_key:
+	kfree(key);
+
+table_put:
+	p4tc_table_entry_put_table(&table_get_state);
+
+	return ret;
+}
+
+static int p4tc_table_entry_flush(struct net *net, struct sk_buff *skb,
+				  struct nlattr *arg,
+				  struct p4tc_path_nlattrs *nl_path_attrs,
+				  struct netlink_ext_ack *extack)
+{
+	struct p4tc_table_get_state table_get_state = { NULL};
+	struct nlattr *tb[P4TC_ENTRY_MAX + 1] = { NULL };
+	u32 arg_ids[P4TC_PATH_MAX - 1];
+	struct p4tc_pipeline *pipeline;
+	struct p4tc_table_entry *entry;
+	u32 *ids = nl_path_attrs->ids;
+	struct rhashtable_iter iter;
+	struct p4tc_filter *filter;
+	bool has_listener = !!skb;
+	struct p4tc_table *table;
+	unsigned char *b;
+	int fails = 0;
+	int ret = 0;
+	int i = 0;
+
+	if (arg) {
+		ret = nla_parse_nested(tb, P4TC_ENTRY_MAX, arg,
+				       p4tc_entry_policy, extack);
+		if (ret < 0)
+			return ret;
+	}
+
+	ret = p4tc_table_entry_get_table(net, RTM_P4TC_DEL, &table_get_state,
+					 tb, nl_path_attrs, extack);
+	if (ret < 0)
+		return ret;
+
+	if (has_listener) {
+		b = nlmsg_get_pos(skb);
+	} else {
+		if (tb[P4TC_ENTRY_FILTER]) {
+			NL_SET_ERR_MSG_FMT(extack,
+					   "Can't specify filter attributes without a listener");
+			ret = -EINVAL;
+			goto table_put;
+		}
+	}
+
+	pipeline = table_get_state.pipeline;
+	table = table_get_state.table;
+
+	if (!ids[P4TC_TBLID_IDX])
+		arg_ids[P4TC_TBLID_IDX - 1] = table->tbl_id;
+
+	filter = p4tc_filter_build(pipeline, table, tb[P4TC_ENTRY_FILTER],
+				   extack);
+	if (IS_ERR(filter)) {
+		ret = PTR_ERR(filter);
+		goto table_put;
+	}
+
+	if (has_listener && nla_put(skb, P4TC_PATH, sizeof(arg_ids), arg_ids)) {
+		ret = -ENOMEM;
+		goto filter_destroy;
+	}
+
+	/* There is an issue here regarding the stability of walking an
+	 * rhashtable. If an insert or a delete happens in parallel, we may see
+	 * duplicate entries or skip some valid entries. To solve this we are
+	 * going to have an auxiliary list that also stores the entries and will
+	 * be used for flushing, instead of walking over the rhastable.
+	 */
+	rhltable_walk_enter(&table->tbl_entries, &iter);
+	do {
+		rhashtable_walk_start(&iter);
+
+		while ((entry = rhashtable_walk_next(&iter)) &&
+		       !IS_ERR(entry)) {
+			struct p4tc_table_entry_value *value =
+				p4tc_table_entry_value(entry);
+
+			if (!p4tc_ctrl_delete_ok(value->permissions)) {
+				ret = -EPERM;
+				fails++;
+				continue;
+			}
+
+			if (!p4tc_filter_exec(filter, entry))
+				continue;
+
+			ret = p4tc_table_entry_destroy(table, entry, true,
+						       false,
+						       P4TC_ENTITY_UNSPEC);
+			if (ret < 0) {
+				fails++;
+				continue;
+			}
+
+			i++;
+		}
+
+		rhashtable_walk_stop(&iter);
+	} while (entry == ERR_PTR(-EAGAIN));
+	rhashtable_walk_exit(&iter);
+
+	/* If another user creates a table entry in parallel with this flush,
+	 * we may not be able to flush all the entries. So the user should
+	 * verify after flush to check for this.
+	 */
+
+	if (has_listener) {
+		if (nla_put_u32(skb, P4TC_COUNT, i))
+			goto out_nlmsg_trim;
+	}
+
+	if (fails) {
+		if (i == 0) {
+			NL_SET_ERR_MSG(extack,
+				       "Unable to flush any entries");
+			goto out_nlmsg_trim;
+		} else {
+			NL_SET_ERR_MSG_FMT(extack,
+					   "Flushed %u table entries and %u failed",
+					   i, fails);
+		}
+	}
+
+	if (has_listener) {
+		if (!ids[P4TC_PID_IDX])
+			ids[P4TC_PID_IDX] = pipeline->common.p_id;
+
+		if (!nl_path_attrs->pname_passed)
+			strscpy(nl_path_attrs->pname, pipeline->common.name,
+				P4TC_PIPELINE_NAMSIZ);
+	}
+
+	ret = 0;
+	goto filter_destroy;
+
+out_nlmsg_trim:
+	if (has_listener)
+		nlmsg_trim(skb, b);
+
+filter_destroy:
+	p4tc_filter_destroy(filter);
+
+table_put:
+	p4tc_table_entry_put_table(&table_get_state);
+
+	return ret;
+}
+
+static enum hrtimer_restart entry_timer_handle(struct hrtimer *timer)
+{
+	struct p4tc_table_entry_value *value =
+		container_of(timer, struct p4tc_table_entry_value, entry_timer);
+	struct p4tc_table_entry_tm *tm;
+	struct p4tc_table_entry *entry;
+	u64 aging_ms = value->aging_ms;
+	struct p4tc_table *table;
+	u64 tdiff, lastused;
+
+	rcu_read_lock();
+	tm = rcu_dereference(value->tm);
+	lastused = tm->lastused;
+	rcu_read_unlock();
+
+	tdiff = jiffies64_to_msecs(get_jiffies_64() - lastused);
+
+	if (tdiff < aging_ms) {
+		hrtimer_forward_now(timer, ms_to_ktime(aging_ms));
+		return HRTIMER_RESTART;
+	}
+
+	entry = value->entry_work->entry;
+	table = value->entry_work->table;
+
+	p4tc_table_entry_destroy(table, entry, true,
+				 true, P4TC_ENTITY_TIMER);
+
+	return HRTIMER_NORESTART;
+}
+
 static struct p4tc_table_entry_tm *
 p4tc_table_entry_create_tm(const u16 whodunnit)
 {
@@ -1026,6 +1463,14 @@ __must_hold(RCU)
 		goto free_work;
 	}
 
+	if (value->is_dyn) {
+		hrtimer_init(&value->entry_timer, CLOCK_MONOTONIC,
+			     HRTIMER_MODE_REL);
+		value->entry_timer.function = &entry_timer_handle;
+		hrtimer_start(&value->entry_timer, ms_to_ktime(value->aging_ms),
+			      HRTIMER_MODE_REL);
+	}
+
 	if (!from_control && p4tc_ctrl_pub_ok(value->permissions))
 		p4tc_tbl_entry_emit_event(entry_work, RTM_P4TC_CREATE,
 					  GFP_ATOMIC);
@@ -1135,6 +1580,20 @@ __must_hold(RCU)
 	entry_work->table = table;
 	entry_work->entry = entry;
 	value->entry_work = entry_work;
+	if (!value->is_dyn)
+		value->is_dyn = value_old->is_dyn;
+
+	if (value->is_dyn) {
+		/* Only use old entry value if user didn't specify new one */
+		value->aging_ms = value->aging_ms ?: value_old->aging_ms;
+
+		hrtimer_init(&value->entry_timer, CLOCK_MONOTONIC,
+			     HRTIMER_MODE_REL);
+		value->entry_timer.function = &entry_timer_handle;
+
+		hrtimer_start(&value->entry_timer, ms_to_ktime(value->aging_ms),
+			      HRTIMER_MODE_REL);
+	}
 
 	INIT_WORK(&entry_work->work, p4tc_table_entry_del_work);
 
@@ -1209,6 +1668,7 @@ p4tc_table_attrs_policy[P4TC_ENTRY_TBL_ATTRS_MAX + 1] = {
 	[P4TC_ENTRY_TBL_ATTRS_DEFAULT_MISS] = { .type = NLA_NESTED },
 	[P4TC_ENTRY_TBL_ATTRS_PERMISSIONS] =
 		NLA_POLICY_MAX(NLA_U16, P4TC_MAX_PERMISSION),
+	[P4TC_ENTRY_TBL_ATTRS_TIMER_PROFILE] = { .type = NLA_NESTED },
 };
 
 static int p4tc_tbl_attrs_update(struct net *net, struct p4tc_table *table,
@@ -1249,11 +1709,23 @@ static int p4tc_tbl_attrs_update(struct net *net, struct p4tc_table *table,
 	if (err < 0)
 		goto free_tbl_perm;
 
+	if (tb[P4TC_ENTRY_TBL_ATTRS_TIMER_PROFILE]) {
+		struct nlattr *attr = tb[P4TC_ENTRY_TBL_ATTRS_TIMER_PROFILE];
+
+		err = p4tc_table_timer_profile_update(table, attr, extack);
+		if (err < 0)
+			goto default_acts_free;
+	}
+
 	p4tc_table_replace_default_acts(table, &dflt, true);
 	p4tc_table_replace_permissions(table, tbl_perm, true);
 
 	return 0;
 
+default_acts_free:
+	p4tc_table_defact_destroy(dflt.hitact);
+	p4tc_table_defact_destroy(dflt.missact);
+
 free_tbl_perm:
 	kfree(tbl_perm);
 	return err;
@@ -1397,6 +1869,75 @@ __p4tc_table_entry_cu(struct net *net, u8 cu_flags, struct nlattr **tb,
 		}
 	}
 
+	if (tb[P4TC_ENTRY_AGING] && tb[P4TC_ENTRY_PROFILE_ID]) {
+		NL_SET_ERR_MSG(extack,
+			       "Must specify either aging or profile ID");
+		ret = -EINVAL;
+		goto free_acts;
+	}
+
+	if (!replace) {
+		if (tb[P4TC_ENTRY_AGING] && !tb[P4TC_ENTRY_DYNAMIC]) {
+			NL_SET_ERR_MSG(extack,
+				       "Aging may only be set alongside dynamic");
+			ret = -EINVAL;
+			goto free_acts;
+		}
+		if (tb[P4TC_ENTRY_PROFILE_ID] && !tb[P4TC_ENTRY_DYNAMIC]) {
+			NL_SET_ERR_MSG(extack,
+				       "Profile may only be set alongside dynamic");
+			ret = -EINVAL;
+			goto free_acts;
+		}
+	}
+
+	if (tb[P4TC_ENTRY_DYNAMIC])
+		value->is_dyn = true;
+
+	if (tb[P4TC_ENTRY_AGING]) {
+		u64 aging_ms = nla_get_u64(tb[P4TC_ENTRY_AGING]);
+		struct p4tc_table_timer_profile *timer_profile;
+
+		/* Aging value specified for entry cu(create/update) command
+		 * must match one of the timer profiles. We'll lift this
+		 * requirement for SW only in the future.
+		 */
+		rcu_read_lock();
+		timer_profile = p4tc_table_timer_profile_find_byaging(table,
+								      aging_ms);
+		if (!timer_profile) {
+			NL_SET_ERR_MSG_FMT(extack,
+					   "Specified aging %llu doesn't match any timer profile",
+					   aging_ms);
+			ret = -EINVAL;
+			rcu_read_unlock();
+			goto free_acts;
+		}
+		rcu_read_unlock();
+		value->aging_ms = aging_ms;
+	} else if (tb[P4TC_ENTRY_PROFILE_ID]) {
+		u32 profile_id = nla_get_u32(tb[P4TC_ENTRY_PROFILE_ID]);
+		struct p4tc_table_timer_profile *timer_profile;
+
+		rcu_read_lock();
+		timer_profile = p4tc_table_timer_profile_find(table,
+							      profile_id);
+		if (!timer_profile) {
+			ret = -ENOENT;
+			rcu_read_unlock();
+			goto free_acts;
+		}
+		value->aging_ms = timer_profile->aging_ms;
+		rcu_read_unlock();
+	} else if (value->is_dyn) {
+		struct p4tc_table_timer_profile *timer_profile;
+
+		rcu_read_lock();
+		timer_profile = p4tc_table_timer_profile_find(table, 0);
+		value->aging_ms = timer_profile->aging_ms;
+		rcu_read_unlock();
+	}
+
 	whodunnit = nla_get_u8(tb[P4TC_ENTRY_WHODUNNIT]);
 
 	rcu_read_lock();
@@ -1495,8 +2036,10 @@ static int p4tc_table_entry_cu(struct net *net, struct sk_buff *skb,
 	value = p4tc_table_entry_value(entry);
 	if (has_listener) {
 		if (p4tc_ctrl_pub_ok(value->permissions)) {
+			u16 who_del = P4TC_ENTITY_UNSPEC;
+
 			if (p4tc_tbl_entry_fill(skb, table, entry,
-						table->tbl_id) <= 0)
+						table->tbl_id, who_del) <= 0)
 				NL_SET_ERR_MSG(extack,
 					       "Unable to fill table entry attributes");
 
@@ -1557,6 +2100,76 @@ p4tc_tmpl_table_entry_cu(struct net *net, struct nlattr *arg,
 	return entry;
 }
 
+static int p4tc_tbl_entry_get_1(struct net *net, struct sk_buff *skb,
+				struct nlattr *arg, u16 *permissions,
+				struct p4tc_path_nlattrs *nl_path_attrs,
+				struct netlink_ext_ack *extack)
+{
+	struct nlattr *tb[P4TC_MAX + 1];
+	u32 *arg_ids;
+	int ret = 0;
+
+	ret = nla_parse_nested(tb, P4TC_MAX, arg, p4tc_policy, extack);
+	if (ret < 0)
+		return ret;
+
+	if (NL_REQ_ATTR_CHECK(extack, arg, tb, P4TC_PATH)) {
+		NL_SET_ERR_MSG(extack, "Must specify object path");
+		return -EINVAL;
+	}
+
+	if (NL_REQ_ATTR_CHECK(extack, arg, tb, P4TC_PARAMS)) {
+		NL_SET_ERR_MSG(extack, "Must specify parameters");
+		return -EINVAL;
+	}
+
+	arg_ids = nla_data(tb[P4TC_PATH]);
+	memcpy(&nl_path_attrs->ids[P4TC_TBLID_IDX], arg_ids,
+	       nla_len(tb[P4TC_PATH]));
+
+	return p4tc_table_entry_gd(net, skb, RTM_P4TC_GET, permissions,
+				   tb[P4TC_PARAMS], nl_path_attrs, extack);
+}
+
+static int p4tc_tbl_entry_del_1(struct net *net, struct sk_buff *skb,
+				bool flush, u16 *permissions,
+				struct nlattr *arg,
+				struct p4tc_path_nlattrs *nl_path_attrs,
+				struct netlink_ext_ack *extack)
+{
+	struct nlattr *tb[P4TC_MAX + 1];
+	u32 *arg_ids;
+	int ret = 0;
+
+	ret = nla_parse_nested(tb, P4TC_MAX, arg, p4tc_policy, extack);
+	if (ret < 0)
+		return ret;
+
+	if (NL_REQ_ATTR_CHECK(extack, arg, tb, P4TC_PATH)) {
+		NL_SET_ERR_MSG(extack, "Must specify object path");
+		return -EINVAL;
+	}
+
+	arg_ids = nla_data(tb[P4TC_PATH]);
+	memcpy(&nl_path_attrs->ids[P4TC_TBLID_IDX], arg_ids,
+	       nla_len(tb[P4TC_PATH]));
+
+	if (flush) {
+		ret = p4tc_table_entry_flush(net, skb, tb[P4TC_PARAMS],
+					     nl_path_attrs, extack);
+	} else {
+		if (NL_REQ_ATTR_CHECK(extack, arg, tb, P4TC_PARAMS)) {
+			NL_SET_ERR_MSG(extack, "Must specify parameters");
+			return -EINVAL;
+		}
+		ret = p4tc_table_entry_gd(net, skb, RTM_P4TC_DEL, permissions,
+					  tb[P4TC_PARAMS], nl_path_attrs,
+					  extack);
+	}
+
+	return ret;
+}
+
 static int p4tc_tbl_entry_cu_1(struct net *net, struct sk_buff *skb,
 			       u8 cu_flags, u16 *permissions,
 			       struct nlattr *nla,
@@ -1652,7 +2265,11 @@ static int __p4tc_tbl_entry_root(struct net *net, struct sk_buff *skb,
 	for (i = 1; i < P4TC_MSGBATCH_SIZE + 1 && p4tca[i]; i++) {
 		struct nlattr *nest = nla_nest_start(nskb, i);
 
-		if (cmd == RTM_P4TC_CREATE || cmd == RTM_P4TC_UPDATE) {
+		if (cmd == RTM_P4TC_GET)
+			ret = p4tc_tbl_entry_get_1(net, nskb, p4tca[i],
+						   &permissions, &nl_path_attrs,
+						   extack);
+		else if (cmd == RTM_P4TC_CREATE || cmd == RTM_P4TC_UPDATE) {
 			u8 cu_flags;
 
 			if (cmd == RTM_P4TC_UPDATE)
@@ -1668,6 +2285,12 @@ static int __p4tc_tbl_entry_root(struct net *net, struct sk_buff *skb,
 						  &permissions,
 						  p4tca[i], &nl_path_attrs,
 						  extack);
+		} else if (cmd == RTM_P4TC_DEL) {
+			bool flush = nlh->nlmsg_flags & NLM_F_ROOT;
+
+			ret = p4tc_tbl_entry_del_1(net, nskb, flush,
+						   &permissions, p4tca[i],
+						   &nl_path_attrs, extack);
 		}
 
 		if (p4tc_ctrl_pub_ok(permissions)) {
@@ -1699,7 +2322,9 @@ static int __p4tc_tbl_entry_root(struct net *net, struct sk_buff *skb,
 
 	nlmsg_end(nskb, nlh);
 
-	if (num_pub_permission) {
+	if (cmd == RTM_P4TC_GET) {
+		ret_send = rtnl_unicast(nskb, net, portid);
+	} else if (num_pub_permission) {
 		ret_send = rtnetlink_send(nskb, net, portid, RTNLGRP_TC,
 					  n->nlmsg_flags & NLM_F_ECHO);
 	} else {
@@ -1748,6 +2373,12 @@ static int __p4tc_tbl_entry_root_fast(struct net *net, struct nlmsghdr *n,
 			ret = p4tc_tbl_entry_cu_1(net, NULL, cu_flags, NULL,
 						  p4tca[i], &nl_path_attrs,
 						  extack);
+		} else if (cmd == RTM_P4TC_DEL) {
+			bool flush = n->nlmsg_flags & NLM_F_ROOT;
+
+			ret = p4tc_tbl_entry_del_1(net, NULL, flush, NULL,
+						   p4tca[i], &nl_path_attrs,
+						   extack);
 		}
 
 		if (ret < 0)
@@ -1802,3 +2433,215 @@ int p4tc_tbl_entry_root(struct net *net, struct sk_buff *skb,
 						 extack);
 	return ret;
 }
+
+static void p4tc_table_entry_dump_ctx_destroy(struct p4tc_dump_ctx *ctx)
+{
+	kfree(ctx->iter);
+	if (ctx->entry_filter)
+		p4tc_filter_destroy(ctx->entry_filter);
+}
+
+static int p4tc_table_entry_dump(struct net *net, struct sk_buff *skb,
+				 struct nlattr *arg,
+				 struct p4tc_path_nlattrs *nl_path_attrs,
+				 struct netlink_callback *cb,
+				 struct netlink_ext_ack *extack)
+{
+	struct p4tc_table_get_state table_get_state = { NULL};
+	struct nlattr *tb[P4TC_ENTRY_MAX + 1] = { NULL };
+	struct p4tc_dump_ctx *ctx = (void *)cb->ctx;
+	struct p4tc_pipeline *pipeline = NULL;
+	struct p4tc_table_entry *entry = NULL;
+	struct p4tc_table *table;
+	int i = 0;
+	int ret;
+
+	if (arg) {
+		ret = nla_parse_nested(tb, P4TC_ENTRY_MAX, arg,
+				       p4tc_entry_policy, extack);
+		if (ret < 0) {
+			p4tc_table_entry_dump_ctx_destroy(ctx);
+			return ret;
+		}
+	}
+
+	ret = p4tc_table_entry_get_table(net, RTM_P4TC_GET, &table_get_state,
+					 tb, nl_path_attrs, extack);
+	if (ret < 0) {
+		p4tc_table_entry_dump_ctx_destroy(ctx);
+		return ret;
+	}
+
+	pipeline = table_get_state.pipeline;
+	table = table_get_state.table;
+
+	if (!ctx->iter) {
+		struct p4tc_filter *entry_filter;
+
+		ctx->iter = kzalloc(sizeof(*ctx->iter), GFP_KERNEL);
+		if (!ctx->iter) {
+			ret = -ENOMEM;
+			goto table_put;
+		}
+
+		entry_filter = p4tc_filter_build(pipeline, table,
+						 tb[P4TC_ENTRY_FILTER],
+						 extack);
+		if (IS_ERR(entry_filter)) {
+			kfree(ctx->iter);
+			ret = PTR_ERR(entry_filter);
+			goto table_put;
+		}
+		ctx->entry_filter = entry_filter;
+
+		rhltable_walk_enter(&table->tbl_entries, ctx->iter);
+	}
+
+	/* There is an issue here regarding the stability of walking an
+	 * rhashtable. If an insert or a delete happens in parallel, we may see
+	 * duplicate entries or skip some valid entries. To solve this we are
+	 * going to have an auxiliary list that also stores the entries and will
+	 * be used for dump, instead of walking over the rhastable.
+	 */
+	ret = -ENOMEM;
+	rhashtable_walk_start(ctx->iter);
+	do {
+		i = 0;
+		while (i < P4TC_MSGBATCH_SIZE &&
+		       (entry = rhashtable_walk_next(ctx->iter)) &&
+		       !IS_ERR(entry)) {
+			struct p4tc_table_entry_value *value =
+				p4tc_table_entry_value(entry);
+			struct nlattr *count;
+
+			if (p4tc_ctrl_read_ok(value->permissions) &&
+			    p4tc_filter_exec(ctx->entry_filter, entry)) {
+				count = nla_nest_start(skb, i + 1);
+				if (!count) {
+					rhashtable_walk_stop(ctx->iter);
+					goto table_put;
+				}
+
+				ret = p4tc_tbl_entry_fill(skb, table, entry,
+							  table->tbl_id,
+							  P4TC_ENTITY_UNSPEC);
+				if (ret == -ENOMEM) {
+					ret = 1;
+					nla_nest_cancel(skb, count);
+					rhashtable_walk_stop(ctx->iter);
+					goto table_put;
+				}
+				nla_nest_end(skb, count);
+
+				i++;
+			}
+		}
+	} while (entry == ERR_PTR(-EAGAIN));
+	rhashtable_walk_stop(ctx->iter);
+
+	if (!i) {
+		rhashtable_walk_exit(ctx->iter);
+
+		ret = 0;
+		p4tc_table_entry_dump_ctx_destroy(ctx);
+
+		goto table_put;
+	}
+
+	if (!nl_path_attrs->pname_passed)
+		strscpy(nl_path_attrs->pname, pipeline->common.name,
+			P4TC_PIPELINE_NAMSIZ);
+
+	if (!nl_path_attrs->ids[P4TC_PID_IDX])
+		nl_path_attrs->ids[P4TC_PID_IDX] = pipeline->common.p_id;
+
+	if (!nl_path_attrs->ids[P4TC_TBLID_IDX])
+		nl_path_attrs->ids[P4TC_TBLID_IDX] = table->tbl_id;
+
+	ret = skb->len;
+
+table_put:
+	p4tc_table_entry_put_table(&table_get_state);
+
+	return ret;
+}
+
+int p4tc_tbl_entry_dumpit(struct net *net, struct sk_buff *skb,
+			  struct netlink_callback *cb,
+			  struct nlattr *arg, char *p_name)
+{
+	struct p4tc_path_nlattrs nl_path_attrs = {0};
+	struct netlink_ext_ack *extack = cb->extack;
+	u32 portid = NETLINK_CB(cb->skb).portid;
+	const struct nlmsghdr *n = cb->nlh;
+	struct nlattr *tb[P4TC_MAX + 1];
+	u32 ids[P4TC_PATH_MAX] = { 0 };
+	struct p4tcmsg *t_new;
+	struct nlmsghdr *nlh;
+	struct nlattr *pnatt;
+	struct nlattr *root;
+	struct p4tcmsg *t;
+	u32 *arg_ids;
+	int ret;
+
+	ret = nla_parse_nested(tb, P4TC_MAX, arg, p4tc_policy, extack);
+	if (ret < 0)
+		return ret;
+
+	nlh = nlmsg_put(skb, portid, n->nlmsg_seq, RTM_P4TC_GET, sizeof(*t),
+			n->nlmsg_flags);
+	if (!nlh)
+		return -ENOSPC;
+
+	t = (struct p4tcmsg *)nlmsg_data(n);
+	t_new = nlmsg_data(nlh);
+	t_new->pipeid = t->pipeid;
+	t_new->obj = t->obj;
+
+	if (NL_REQ_ATTR_CHECK(extack, arg, tb, P4TC_PATH)) {
+		NL_SET_ERR_MSG(extack, "Must specify object path");
+		return -EINVAL;
+	}
+
+	pnatt = nla_reserve(skb, P4TC_ROOT_PNAME, P4TC_PIPELINE_NAMSIZ);
+	if (!pnatt)
+		return -ENOMEM;
+
+	ids[P4TC_PID_IDX] = t_new->pipeid;
+	arg_ids = nla_data(tb[P4TC_PATH]);
+	memcpy(&ids[P4TC_TBLID_IDX], arg_ids, nla_len(tb[P4TC_PATH]));
+	nl_path_attrs.ids = ids;
+
+	nl_path_attrs.pname = nla_data(pnatt);
+	if (!p_name) {
+		/* Filled up by the operation or forced failure */
+		memset(nl_path_attrs.pname, 0, P4TC_PIPELINE_NAMSIZ);
+		nl_path_attrs.pname_passed = false;
+	} else {
+		strscpy(nl_path_attrs.pname, p_name, P4TC_PIPELINE_NAMSIZ);
+		nl_path_attrs.pname_passed = true;
+	}
+
+	root = nla_nest_start(skb, P4TC_ROOT);
+	ret = p4tc_table_entry_dump(net, skb, tb[P4TC_PARAMS], &nl_path_attrs,
+				    cb, extack);
+	if (ret <= 0)
+		goto out;
+	nla_nest_end(skb, root);
+
+	if (nla_put_string(skb, P4TC_ROOT_PNAME, nl_path_attrs.pname)) {
+		ret = -1;
+		goto out;
+	}
+
+	if (!t_new->pipeid)
+		t_new->pipeid = ids[P4TC_PID_IDX];
+
+	nlmsg_end(skb, nlh);
+
+	return skb->len;
+
+out:
+	nlmsg_cancel(skb, nlh);
+	return ret;
+}
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [PATCH net-next v12  14/15] p4tc: add set of P4TC table kfuncs
  2024-02-25 16:54 [PATCH net-next v12 00/15] Introducing P4TC (series 1) Jamal Hadi Salim
                   ` (12 preceding siblings ...)
  2024-02-25 16:54 ` [PATCH net-next v12 13/15] p4tc: add runtime table entry get, delete, flush and dump Jamal Hadi Salim
@ 2024-02-25 16:54 ` Jamal Hadi Salim
  2024-03-01  6:53   ` Martin KaFai Lau
  2024-02-25 16:54 ` [PATCH net-next v12 15/15] p4tc: add P4 classifier Jamal Hadi Salim
                   ` (2 subsequent siblings)
  16 siblings, 1 reply; 71+ messages in thread
From: Jamal Hadi Salim @ 2024-02-25 16:54 UTC (permalink / raw)
  To: netdev
  Cc: deb.chatterjee, anjali.singhai, namrata.limaye, tom, mleitner,
	Mahesh.Shirshyad, Vipin.Jain, tomasz.osinski, jiri,
	xiyou.wangcong, davem, edumazet, kuba, pabeni, vladbu, horms,
	khalidm, toke, daniel, victor, pctammela, bpf

We add an initial set of kfuncs to allow interactions from eBPF programs
to the P4TC domain.

- bpf_p4tc_tbl_read: Used to lookup a table entry from a BPF
program installed in TC. To find the table entry we take in an skb, the
pipeline ID, the table ID, a key and a key size.
We use the skb to get the network namespace structure where all the
pipelines are stored. After that we use the pipeline ID and the table
ID, to find the table. We then use the key to search for the entry.
We return an entry on success and NULL on failure.

- xdp_p4tc_tbl_read: Used to lookup a table entry from a BPF
program installed in XDP. To find the table entry we take in an xdp_md,
the pipeline ID, the table ID, a key and a key size.
We use struct xdp_md to get the network namespace structure where all
the pipelines are stored. After that we use the pipeline ID and the table
ID, to find the table. We then use the key to search for the entry.
We return an entry on success and NULL on failure.

- bpf_p4tc_entry_create: Used to create a table entry from a BPF
program installed in TC. To create the table entry we take an skb, the
pipeline ID, the table ID, a key and its size, and an action which will
be associated with the new entry.
We return 0 on success and a negative errno on failure

- xdp_p4tc_entry_create: Used to create a table entry from a BPF
program installed in XDP. To create the table entry we take an xdp_md, the
pipeline ID, the table ID, a key and its size, and an action which will
be associated with the new entry.
We return 0 on success and a negative errno on failure

- bpf_p4tc_entry_create_on_miss: conforms to PNA "add on miss".
First does a lookup using the passed key and upon a miss will add the entry
to the table.
We return 0 on success and a negative errno on failure

- xdp_p4tc_entry_create_on_miss: conforms to PNA "add on miss".
First does a lookup using the passed key and upon a miss will add the entry
to the table.
We return 0 on success and a negative errno on failure

- bpf_p4tc_entry_update: Used to update a table entry from a BPF
program installed in TC. To update the table entry we take an skb, the
pipeline ID, the table ID, a key and its size, and an action which will
be associated with the new entry.
We return 0 on success and a negative errno on failure

- xdp_p4tc_entry_update: Used to update a table entry from a BPF
program installed in XDP. To update the table entry we take an xdp_md, the
pipeline ID, the table ID, a key and its size, and an action which will
be associated with the new entry.
We return 0 on success and a negative errno on failure

- bpf_p4tc_entry_delete: Used to delete a table entry from a BPF
program installed in TC. To delete the table entry we take an skb, the
pipeline ID, the table ID, a key and a key size.
We return 0 on success and a negative errno on failure

- xdp_p4tc_entry_delete: Used to delete a table entry from a BPF
program installed in XDP. To delete the table entry we take an xdp_md, the
pipeline ID, the table ID, a key and a key size.
We return 0 on success and a negative errno on failure

Co-developed-by: Victor Nogueira <victor@mojatatu.com>
Signed-off-by: Victor Nogueira <victor@mojatatu.com>
Co-developed-by: Pedro Tammela <pctammela@mojatatu.com>
Signed-off-by: Pedro Tammela <pctammela@mojatatu.com>
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Reviewed-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
---
 include/linux/bitops.h          |   1 +
 include/net/p4tc.h              |  89 ++++++-
 include/net/tc_act/p4tc.h       |  51 ++++
 include/uapi/linux/p4tc.h       |   2 +
 net/sched/p4tc/Makefile         |   1 +
 net/sched/p4tc/p4tc_action.c    |  71 ++++-
 net/sched/p4tc/p4tc_bpf.c       | 342 ++++++++++++++++++++++++
 net/sched/p4tc/p4tc_pipeline.c  |  43 +++
 net/sched/p4tc/p4tc_table.c     |  41 +++
 net/sched/p4tc/p4tc_tbl_entry.c | 454 ++++++++++++++++++++++++++++++--
 10 files changed, 1062 insertions(+), 33 deletions(-)
 create mode 100644 net/sched/p4tc/p4tc_bpf.c

diff --git a/include/linux/bitops.h b/include/linux/bitops.h
index 2ba557e06..290c2399a 100644
--- a/include/linux/bitops.h
+++ b/include/linux/bitops.h
@@ -19,6 +19,7 @@
 #define BITS_TO_LONGS(nr)	__KERNEL_DIV_ROUND_UP(nr, BITS_PER_TYPE(long))
 #define BITS_TO_U64(nr)		__KERNEL_DIV_ROUND_UP(nr, BITS_PER_TYPE(u64))
 #define BITS_TO_U32(nr)		__KERNEL_DIV_ROUND_UP(nr, BITS_PER_TYPE(u32))
+#define BITS_TO_U16(nr)		__KERNEL_DIV_ROUND_UP(nr, BITS_PER_TYPE(u16))
 #define BITS_TO_BYTES(nr)	__KERNEL_DIV_ROUND_UP(nr, BITS_PER_TYPE(char))
 
 extern unsigned int __sw_hweight8(unsigned int w);
diff --git a/include/net/p4tc.h b/include/net/p4tc.h
index 231936df4..9e6317dea 100644
--- a/include/net/p4tc.h
+++ b/include/net/p4tc.h
@@ -100,8 +100,28 @@ struct p4tc_pipeline {
 	u8                          p_state;
 };
 
+#define P4TC_PIPELINE_MAX_ARRAY 32
+
+struct p4tc_tbl_cache_key {
+	u32 pipeid;
+	u32 tblid;
+};
+
+extern const struct rhashtable_params tbl_cache_ht_params;
+
+struct p4tc_table;
+
+int p4tc_tbl_cache_insert(struct net *net, u32 pipeid,
+			  struct p4tc_table *table);
+void p4tc_tbl_cache_remove(struct net *net, struct p4tc_table *table);
+struct p4tc_table *p4tc_tbl_cache_lookup(struct net *net, u32 pipeid,
+					 u32 tblid);
+
+#define P4TC_TBLS_CACHE_SIZE 32
+
 struct p4tc_pipeline_net {
-	struct idr pipeline_idr;
+	struct list_head  tbls_cache[P4TC_TBLS_CACHE_SIZE];
+	struct idr        pipeline_idr;
 };
 
 static inline bool p4tc_tmpl_msg_is_update(struct nlmsghdr *n)
@@ -227,6 +247,7 @@ struct p4tc_table_perm {
 
 struct p4tc_table {
 	struct p4tc_template_common         common;
+	struct list_head                    tbl_cache_node;
 	struct list_head                    tbl_acts_list;
 	struct idr                          tbl_masks_idr;
 	struct ida                          tbl_prio_idr;
@@ -327,6 +348,17 @@ struct p4tc_table_timer_profile {
 
 extern const struct rhashtable_params entry_hlt_params;
 
+struct p4tc_table_entry_act_bpf_params {
+	u32 pipeid;
+	u32 tblid;
+};
+
+struct p4tc_table_entry_create_bpf_params {
+	u32 profile_id;
+	u32 pipeid;
+	u32 tblid;
+};
+
 struct p4tc_table_entry;
 struct p4tc_table_entry_work {
 	struct work_struct   work;
@@ -378,8 +410,24 @@ struct p4tc_table_entry {
 	/* fallthrough: key data + value */
 };
 
+struct p4tc_entry_key_bpf {
+	void *key;
+	void *mask;
+	u32 key_sz;
+	u32 mask_sz;
+};
+
 #define P4TC_KEYSZ_BYTES(bits) (round_up(BITS_TO_BYTES(bits), 8))
 
+#define P4TC_ENTRY_KEY_SZ_BYTES(bits) \
+	(P4TC_ENTRY_KEY_OFFSET + P4TC_KEYSZ_BYTES(bits))
+
+#define P4TC_ENTRY_KEY_OFFSET (offsetof(struct p4tc_table_entry_key, fa_key))
+
+#define P4TC_ENTRY_VALUE_OFFSET(entry) \
+	(offsetof(struct p4tc_table_entry, key) + P4TC_ENTRY_KEY_OFFSET \
+	 + P4TC_KEYSZ_BYTES((entry)->key.keysz))
+
 static inline void *p4tc_table_entry_value(struct p4tc_table_entry *entry)
 {
 	return entry->key.fa_key + P4TC_KEYSZ_BYTES(entry->key.keysz);
@@ -396,6 +444,29 @@ p4tc_table_entry_work(struct p4tc_table_entry *entry)
 extern const struct nla_policy p4tc_root_policy[P4TC_ROOT_MAX + 1];
 extern const struct nla_policy p4tc_policy[P4TC_MAX + 1];
 
+struct p4tc_table_entry *
+p4tc_table_entry_lookup_direct(struct p4tc_table *table,
+			       struct p4tc_table_entry_key *key);
+
+struct p4tc_table_entry_act_bpf *
+p4tc_table_entry_create_act_bpf(struct tc_action *action,
+				struct netlink_ext_ack *extack);
+int register_p4tc_tbl_bpf(void);
+int p4tc_table_entry_create_bpf(struct p4tc_pipeline *pipeline,
+				struct p4tc_table *table,
+				struct p4tc_table_entry_key *key,
+				struct p4tc_table_entry_act_bpf *act_bpf,
+				u32 profile_id);
+int p4tc_table_entry_update_bpf(struct p4tc_pipeline *pipeline,
+				struct p4tc_table *table,
+				struct p4tc_table_entry_key *key,
+				struct p4tc_table_entry_act_bpf *act_bpf,
+				u32 profile_id);
+
+int p4tc_table_entry_del_bpf(struct p4tc_pipeline *pipeline,
+			     struct p4tc_table *table,
+			     struct p4tc_table_entry_key *key);
+
 static inline int p4tc_action_init(struct net *net, struct nlattr *nla,
 				   struct tc_action *acts[], u32 pipeid,
 				   u32 flags, struct netlink_ext_ack *extack)
@@ -465,6 +536,7 @@ static inline bool p4tc_action_put_ref(struct p4tc_act *act)
 
 struct p4tc_act_param *p4a_parm_find_byid(struct idr *params_idr,
 					  const u32 param_id);
+
 struct p4tc_act_param *
 p4a_parm_find_byany(struct p4tc_act *act, const char *param_name,
 		    const u32 param_id, struct netlink_ext_ack *extack);
@@ -513,12 +585,19 @@ static inline void p4tc_table_defact_destroy(struct p4tc_table_defact *defact)
 {
 	if (defact) {
 		if (defact->acts[0]) {
-			struct tcf_p4act *p4_defact = to_p4act(defact->acts[0]);
+			struct tcf_p4act *dflt = to_p4act(defact->acts[0]);
+
+			if (p4tc_table_defact_is_noaction(dflt)) {
+				struct p4tc_table_entry_act_bpf_kern *act_bpf;
 
-			if (p4tc_table_defact_is_noaction(p4_defact))
-				kfree(p4_defact);
-			else
+				act_bpf =
+					rcu_dereference_protected(dflt->act_bpf,
+								  1);
+				kfree(act_bpf);
+				kfree(dflt);
+			} else {
 				p4tc_action_destroy(defact->acts);
+			}
 		}
 		kfree(defact);
 	}
diff --git a/include/net/tc_act/p4tc.h b/include/net/tc_act/p4tc.h
index c5256d821..155068de0 100644
--- a/include/net/tc_act/p4tc.h
+++ b/include/net/tc_act/p4tc.h
@@ -13,10 +13,26 @@ struct tcf_p4act_params {
 	u32 tot_params_sz;
 };
 
+#define P4TC_MAX_PARAM_DATA_SIZE 124
+
+struct p4tc_table_entry_act_bpf {
+	u32 act_id;
+	u32 hit:1,
+	    is_default_miss_act:1,
+	    is_default_hit_act:1;
+	u8 params[P4TC_MAX_PARAM_DATA_SIZE];
+} __packed;
+
+struct p4tc_table_entry_act_bpf_kern {
+	struct rcu_head rcu;
+	struct p4tc_table_entry_act_bpf act_bpf;
+};
+
 struct tcf_p4act {
 	struct tc_action common;
 	/* Params IDR reference passed during runtime */
 	struct tcf_p4act_params __rcu *params;
+	struct p4tc_table_entry_act_bpf_kern __rcu *act_bpf;
 	u32 p_id;
 	u32 act_id;
 	struct list_head node;
@@ -24,4 +40,39 @@ struct tcf_p4act {
 
 #define to_p4act(a) ((struct tcf_p4act *)a)
 
+static inline struct p4tc_table_entry_act_bpf *
+p4tc_table_entry_act_bpf(struct tc_action *action)
+{
+	struct p4tc_table_entry_act_bpf_kern *act_bpf;
+	struct tcf_p4act *p4act = to_p4act(action);
+
+	act_bpf = rcu_dereference(p4act->act_bpf);
+
+	return &act_bpf->act_bpf;
+}
+
+static inline int
+p4tc_table_entry_act_bpf_change_flags(struct tc_action *action, u32 hit,
+				      u32 dflt_miss, u32 dflt_hit)
+{
+	struct p4tc_table_entry_act_bpf_kern *act_bpf, *act_bpf_old;
+	struct tcf_p4act *p4act = to_p4act(action);
+
+	act_bpf = kzalloc(sizeof(*act_bpf), GFP_KERNEL);
+	if (!act_bpf)
+		return -ENOMEM;
+
+	spin_lock_bh(&p4act->tcf_lock);
+	act_bpf_old = rcu_dereference_protected(p4act->act_bpf, 1);
+	act_bpf->act_bpf = act_bpf_old->act_bpf;
+	act_bpf->act_bpf.hit = hit;
+	act_bpf->act_bpf.is_default_hit_act = dflt_hit;
+	act_bpf->act_bpf.is_default_miss_act = dflt_miss;
+	rcu_replace_pointer(p4act->act_bpf, act_bpf, 1);
+	kfree_rcu(act_bpf_old, rcu);
+	spin_unlock_bh(&p4act->tcf_lock);
+
+	return 0;
+}
+
 #endif /* __NET_TC_ACT_P4_H */
diff --git a/include/uapi/linux/p4tc.h b/include/uapi/linux/p4tc.h
index 3f1444ad9..943c79fbc 100644
--- a/include/uapi/linux/p4tc.h
+++ b/include/uapi/linux/p4tc.h
@@ -19,6 +19,8 @@ struct p4tcmsg {
 #define P4TC_MINTABLES_COUNT 0
 #define P4TC_MSGBATCH_SIZE 16
 
+#define P4TC_ACT_MAX_NUM_PARAMS P4TC_MSGBATCH_SIZE
+
 #define P4TC_MAX_KEYSZ 512
 #define P4TC_DEFAULT_NUM_PREALLOC 16
 
diff --git a/net/sched/p4tc/Makefile b/net/sched/p4tc/Makefile
index 56a8adc74..73ccb53c4 100644
--- a/net/sched/p4tc/Makefile
+++ b/net/sched/p4tc/Makefile
@@ -3,3 +3,4 @@
 obj-y := p4tc_types.o p4tc_tmpl_api.o p4tc_pipeline.o \
 	p4tc_action.o p4tc_table.o p4tc_tbl_entry.o \
 	p4tc_filter.o p4tc_runtime_api.o
+obj-$(CONFIG_DEBUG_INFO_BTF) += p4tc_bpf.o
diff --git a/net/sched/p4tc/p4tc_action.c b/net/sched/p4tc/p4tc_action.c
index 4b7b5501a..6108bcf65 100644
--- a/net/sched/p4tc/p4tc_action.c
+++ b/net/sched/p4tc/p4tc_action.c
@@ -278,29 +278,85 @@ static void p4a_runt_parms_destroy_rcu(struct rcu_head *head)
 	p4a_runt_parms_destroy(params);
 }
 
+static struct p4tc_table_entry_act_bpf_kern *
+p4a_runt_create_bpf(struct tcf_p4act *p4act,
+		    struct tcf_p4act_params *act_params,
+		    struct netlink_ext_ack *extack)
+{
+	struct p4tc_act_param *params[P4TC_ACT_MAX_NUM_PARAMS];
+	struct p4tc_table_entry_act_bpf_kern *act_bpf;
+	struct p4tc_act_param *param;
+	unsigned long param_id, tmp;
+	size_t tot_params_sz = 0;
+	u8 *params_cursor;
+	int nparams = 0;
+	int i;
+
+	act_bpf = kzalloc(sizeof(*act_bpf), GFP_KERNEL);
+	if (!act_bpf)
+		return ERR_PTR(-ENOMEM);
+
+	idr_for_each_entry_ul(&act_params->params_idr, param, tmp, param_id) {
+		const struct p4tc_type *type = param->type;
+
+		if (tot_params_sz > P4TC_MAX_PARAM_DATA_SIZE) {
+			NL_SET_ERR_MSG(extack,
+				       "Maximum parameter byte size reached");
+			kfree(act_bpf);
+			return ERR_PTR(-EINVAL);
+		}
+
+		tot_params_sz += BITS_TO_BYTES(type->container_bitsz);
+		params[nparams++] = param;
+	}
+
+	act_bpf->act_bpf.act_id = p4act->act_id;
+	params_cursor = act_bpf->act_bpf.params;
+	for (i = 0; i < nparams; i++) {
+		u32 type_bytesz;
+
+		param = params[i];
+		type_bytesz =  BITS_TO_BYTES(param->type->container_bitsz);
+		memcpy(params_cursor, param->value, type_bytesz);
+		params_cursor += type_bytesz;
+	}
+	act_bpf->act_bpf.hit = true;
+
+	return act_bpf;
+}
+
 static int __p4a_runt_init_set(struct p4tc_act *act, struct tc_action **a,
 			       struct tcf_p4act_params *params,
 			       struct tcf_chain *goto_ch,
 			       struct tc_act_p4 *parm, bool exists,
 			       struct netlink_ext_ack *extack)
 {
+	struct p4tc_table_entry_act_bpf_kern *act_bpf = NULL, *act_bpf_old;
 	struct tcf_p4act_params *params_old;
 	struct tcf_p4act *p;
 
 	p = to_p4act(*a);
 
+	if (!((*a)->tcfa_flags & TCA_ACT_FLAGS_UNREFERENCED)) {
+		act_bpf = p4a_runt_create_bpf(p, params, extack);
+		if (IS_ERR(act_bpf))
+			return PTR_ERR(act_bpf);
+	}
+
 	/* sparse is fooled by lock under conditionals.
-	 * To avoid false positives, we are repeating these two lines in both
+	 * To avoid false positives, we are repeating these 3 lines in both
 	 * branches of the if-statement
 	 */
 	if (exists) {
 		spin_lock_bh(&p->tcf_lock);
 		goto_ch = tcf_action_set_ctrlact(*a, parm->action, goto_ch);
 		params_old = rcu_replace_pointer(p->params, params, 1);
+		act_bpf_old = rcu_replace_pointer(p->act_bpf, act_bpf, 1);
 		spin_unlock_bh(&p->tcf_lock);
 	} else {
 		goto_ch = tcf_action_set_ctrlact(*a, parm->action, goto_ch);
 		params_old = rcu_replace_pointer(p->params, params, 1);
+		act_bpf_old = rcu_replace_pointer(p->act_bpf, act_bpf, 1);
 	}
 
 	if (goto_ch)
@@ -309,6 +365,9 @@ static int __p4a_runt_init_set(struct p4tc_act *act, struct tc_action **a,
 	if (params_old)
 		call_rcu(&params_old->rcu, p4a_runt_parms_destroy_rcu);
 
+	if (act_bpf_old)
+		kfree_rcu(act_bpf_old, rcu);
+
 	return 0;
 }
 
@@ -506,6 +565,7 @@ void p4a_runt_init_flags(struct tcf_p4act *p4act)
 static void __p4a_runt_prealloc_put(struct p4tc_act *act,
 				    struct tcf_p4act *p4act)
 {
+	struct p4tc_table_entry_act_bpf_kern *act_bpf_old;
 	struct tcf_p4act_params *p4act_params;
 	struct p4tc_act_param *param;
 	unsigned long param_id, tmp;
@@ -524,6 +584,10 @@ static void __p4a_runt_prealloc_put(struct p4tc_act *act,
 	p4act->common.tcfa_flags |= TCA_ACT_FLAGS_UNREFERENCED;
 	spin_unlock_bh(&p4act->tcf_lock);
 
+	act_bpf_old = rcu_replace_pointer(p4act->act_bpf, NULL, 1);
+	if (act_bpf_old)
+		kfree_rcu(act_bpf_old, rcu);
+
 	spin_lock_bh(&act->list_lock);
 	list_add_tail(&p4act->node, &act->prealloc_list);
 	spin_unlock_bh(&act->list_lock);
@@ -1214,16 +1278,21 @@ static int p4a_runt_walker(struct net *net, struct sk_buff *skb,
 static void p4a_runt_cleanup(struct tc_action *a)
 {
 	struct tc_action_ops *ops = (struct tc_action_ops *)a->ops;
+	struct p4tc_table_entry_act_bpf_kern *act_bpf;
 	struct tcf_p4act *m = to_p4act(a);
 	struct tcf_p4act_params *params;
 
 	params = rcu_dereference_protected(m->params, 1);
+	act_bpf = rcu_dereference_protected(m->act_bpf, 1);
 
 	if (refcount_read(&ops->p4_ref) > 1)
 		refcount_dec(&ops->p4_ref);
 
 	if (params)
 		call_rcu(&params->rcu, p4a_runt_parms_destroy_rcu);
+
+	if (act_bpf)
+		kfree_rcu(act_bpf, rcu);
 }
 
 static void p4a_runt_net_exit(struct tc_action_net *tn)
diff --git a/net/sched/p4tc/p4tc_bpf.c b/net/sched/p4tc/p4tc_bpf.c
new file mode 100644
index 000000000..0eb1002ca
--- /dev/null
+++ b/net/sched/p4tc/p4tc_bpf.c
@@ -0,0 +1,342 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Copyright (c) 2022-2024, Mojatatu Networks
+ * Copyright (c) 2022-2024, Intel Corporation.
+ * Authors:     Jamal Hadi Salim <jhs@mojatatu.com>
+ *              Victor Nogueira <victor@mojatatu.com>
+ *              Pedro Tammela <pctammela@mojatatu.com>
+ */
+
+#include <linux/bpf.h>
+#include <linux/btf.h>
+#include <linux/filter.h>
+#include <linux/btf_ids.h>
+#include <linux/net_namespace.h>
+#include <net/p4tc.h>
+#include <linux/netdevice.h>
+#include <net/sock.h>
+#include <net/xdp.h>
+
+BTF_ID_LIST(btf_p4tc_ids)
+BTF_ID(struct, p4tc_table_entry_act_bpf)
+BTF_ID(struct, p4tc_table_entry_act_bpf_params)
+BTF_ID(struct, p4tc_table_entry_act_bpf)
+BTF_ID(struct, p4tc_table_entry_create_bpf_params)
+
+static struct p4tc_table_entry_act_bpf p4tc_no_action_hit_bpf = {
+	.hit = 1,
+};
+
+static struct p4tc_table_entry_act_bpf *
+__bpf_p4tc_tbl_read(struct net *caller_net,
+		    struct p4tc_table_entry_act_bpf_params *params,
+		    void *key, const u32 key__sz)
+{
+	struct p4tc_table_entry_key *entry_key = key;
+	struct p4tc_table_defact *defact_hit;
+	struct p4tc_table_entry_value *value;
+	struct p4tc_table_entry *entry;
+	struct p4tc_table *table;
+	u32 pipeid;
+	u32 tblid;
+
+	if (!params || !key)
+		return NULL;
+
+	pipeid = params->pipeid;
+	tblid = params->tblid;
+
+	if (key__sz != P4TC_ENTRY_KEY_SZ_BYTES(entry_key->keysz))
+		return NULL;
+
+	table = p4tc_tbl_cache_lookup(caller_net, pipeid, tblid);
+	if (!table)
+		return NULL;
+
+	if (entry_key->keysz != table->tbl_keysz)
+		return NULL;
+
+	entry = p4tc_table_entry_lookup_direct(table, entry_key);
+	if (!entry) {
+		struct p4tc_table_defact *defact;
+
+		defact = rcu_dereference(table->tbl_dflt_missact);
+		return defact ? p4tc_table_entry_act_bpf(defact->acts[0]) :
+				NULL;
+	}
+
+	value = p4tc_table_entry_value(entry);
+
+	if (value->acts[0])
+		return p4tc_table_entry_act_bpf(value->acts[0]);
+
+	defact_hit = rcu_dereference(table->tbl_dflt_hitact);
+	return defact_hit ? p4tc_table_entry_act_bpf(defact_hit->acts[0]) :
+		&p4tc_no_action_hit_bpf;
+}
+
+__bpf_kfunc static struct p4tc_table_entry_act_bpf *
+bpf_p4tc_tbl_read(struct __sk_buff *skb_ctx,
+		  struct p4tc_table_entry_act_bpf_params *params,
+		  void *key, const u32 key__sz)
+{
+	struct sk_buff *skb = (struct sk_buff *)skb_ctx;
+	struct net *caller_net;
+
+	caller_net = skb->dev ? dev_net(skb->dev) : sock_net(skb->sk);
+
+	return __bpf_p4tc_tbl_read(caller_net, params, key, key__sz);
+}
+
+__bpf_kfunc static struct p4tc_table_entry_act_bpf *
+xdp_p4tc_tbl_read(struct xdp_md *xdp_ctx,
+		  struct p4tc_table_entry_act_bpf_params *params,
+		  void *key, const u32 key__sz)
+{
+	struct xdp_buff *ctx = (struct xdp_buff *)xdp_ctx;
+	struct net *caller_net;
+
+	caller_net = dev_net(ctx->rxq->dev);
+
+	return __bpf_p4tc_tbl_read(caller_net, params, key, key__sz);
+}
+
+static int
+__bpf_p4tc_entry_create(struct net *net,
+			struct p4tc_table_entry_create_bpf_params *params,
+			void *key, const u32 key__sz,
+			struct p4tc_table_entry_act_bpf *act_bpf)
+{
+	struct p4tc_table_entry_key *entry_key = key;
+	struct p4tc_pipeline *pipeline;
+	struct p4tc_table *table;
+
+	if (!params || !key)
+		return -EINVAL;
+	if (key__sz != P4TC_ENTRY_KEY_SZ_BYTES(entry_key->keysz))
+		return -EINVAL;
+
+	pipeline = p4tc_pipeline_find_byid(net, params->pipeid);
+	if (!pipeline)
+		return -ENOENT;
+
+	table = p4tc_tbl_cache_lookup(net, params->pipeid, params->tblid);
+	if (!table)
+		return -ENOENT;
+
+	if (entry_key->keysz != table->tbl_keysz)
+		return -EINVAL;
+
+	return p4tc_table_entry_create_bpf(pipeline, table, entry_key, act_bpf,
+					   params->profile_id);
+}
+
+__bpf_kfunc static int
+bpf_p4tc_entry_create(struct __sk_buff *skb_ctx,
+		      struct p4tc_table_entry_create_bpf_params *params,
+		      void *key, const u32 key__sz,
+		      struct p4tc_table_entry_act_bpf *act_bpf)
+{
+	struct sk_buff *skb = (struct sk_buff *)skb_ctx;
+	struct net *net;
+
+	net = skb->dev ? dev_net(skb->dev) : sock_net(skb->sk);
+
+	return __bpf_p4tc_entry_create(net, params, key, key__sz, act_bpf);
+}
+
+__bpf_kfunc static int
+xdp_p4tc_entry_create(struct xdp_md *xdp_ctx,
+		      struct p4tc_table_entry_create_bpf_params *params,
+		      void *key, const u32 key__sz,
+		      struct p4tc_table_entry_act_bpf *act_bpf)
+{
+	struct xdp_buff *ctx = (struct xdp_buff *)xdp_ctx;
+	struct net *net;
+
+	net = dev_net(ctx->rxq->dev);
+
+	return __bpf_p4tc_entry_create(net, params, key, key__sz, act_bpf);
+}
+
+__bpf_kfunc static int
+bpf_p4tc_entry_create_on_miss(struct __sk_buff *skb_ctx,
+			      struct p4tc_table_entry_create_bpf_params *params,
+			      void *key, const u32 key__sz,
+			      struct p4tc_table_entry_act_bpf *act_bpf)
+{
+	struct sk_buff *skb = (struct sk_buff *)skb_ctx;
+	struct net *net;
+
+	net = skb->dev ? dev_net(skb->dev) : sock_net(skb->sk);
+
+	return __bpf_p4tc_entry_create(net, params, key, key__sz, act_bpf);
+}
+
+__bpf_kfunc static int
+xdp_p4tc_entry_create_on_miss(struct xdp_md *xdp_ctx,
+			      struct p4tc_table_entry_create_bpf_params *params,
+			      void *key, const u32 key__sz,
+			      struct p4tc_table_entry_act_bpf *act_bpf)
+{
+	struct xdp_buff *ctx = (struct xdp_buff *)xdp_ctx;
+	struct net *net;
+
+	net = dev_net(ctx->rxq->dev);
+
+	return __bpf_p4tc_entry_create(net, params, key, key__sz, act_bpf);
+}
+
+static int
+__bpf_p4tc_entry_update(struct net *net,
+			struct p4tc_table_entry_create_bpf_params *params,
+			void *key, const u32 key__sz,
+			struct p4tc_table_entry_act_bpf *act_bpf)
+{
+	struct p4tc_table_entry_key *entry_key = key;
+	struct p4tc_pipeline *pipeline;
+	struct p4tc_table *table;
+
+	if (!params || !key)
+		return -EINVAL;
+
+	if (key__sz != P4TC_ENTRY_KEY_SZ_BYTES(entry_key->keysz))
+		return -EINVAL;
+
+	pipeline = p4tc_pipeline_find_byid(net, params->pipeid);
+	if (!pipeline)
+		return -ENOENT;
+
+	table = p4tc_tbl_cache_lookup(net, params->pipeid, params->tblid);
+	if (!table)
+		return -ENOENT;
+
+	if (entry_key->keysz != table->tbl_keysz)
+		return -EINVAL;
+
+	return p4tc_table_entry_update_bpf(pipeline, table, entry_key,
+					  act_bpf, params->profile_id);
+}
+
+__bpf_kfunc static int
+bpf_p4tc_entry_update(struct __sk_buff *skb_ctx,
+		      struct p4tc_table_entry_create_bpf_params *params,
+		      void *key, const u32 key__sz,
+		      struct p4tc_table_entry_act_bpf *act_bpf)
+{
+	struct sk_buff *skb = (struct sk_buff *)skb_ctx;
+	struct net *net;
+
+	net = skb->dev ? dev_net(skb->dev) : sock_net(skb->sk);
+
+	return __bpf_p4tc_entry_update(net, params, key, key__sz, act_bpf);
+}
+
+__bpf_kfunc static int
+xdp_p4tc_entry_update(struct xdp_md *xdp_ctx,
+		      struct p4tc_table_entry_create_bpf_params *params,
+		      void *key, const u32 key__sz,
+		      struct p4tc_table_entry_act_bpf *act_bpf)
+{
+	struct xdp_buff *ctx = (struct xdp_buff *)xdp_ctx;
+	struct net *net;
+
+	net = dev_net(ctx->rxq->dev);
+
+	return __bpf_p4tc_entry_update(net, params, key, key__sz, act_bpf);
+}
+
+static int
+__bpf_p4tc_entry_delete(struct net *net,
+			struct p4tc_table_entry_create_bpf_params *params,
+			void *key, const u32 key__sz)
+{
+	struct p4tc_table_entry_key *entry_key = key;
+	struct p4tc_pipeline *pipeline;
+	struct p4tc_table *table;
+
+	if (!params || !key)
+		return -EINVAL;
+
+	if (key__sz != P4TC_ENTRY_KEY_SZ_BYTES(entry_key->keysz))
+		return -EINVAL;
+
+	pipeline = p4tc_pipeline_find_byid(net, params->pipeid);
+	if (!pipeline)
+		return -ENOENT;
+
+	table = p4tc_tbl_cache_lookup(net, params->pipeid, params->tblid);
+	if (!table)
+		return -ENOENT;
+
+	if (entry_key->keysz != table->tbl_keysz)
+		return -EINVAL;
+
+	return p4tc_table_entry_del_bpf(pipeline, table, entry_key);
+}
+
+__bpf_kfunc static int
+bpf_p4tc_entry_delete(struct __sk_buff *skb_ctx,
+		      struct p4tc_table_entry_create_bpf_params *params,
+		      void *key, const u32 key__sz)
+{
+	struct sk_buff *skb = (struct sk_buff *)skb_ctx;
+	struct net *net;
+
+	net = skb->dev ? dev_net(skb->dev) : sock_net(skb->sk);
+
+	return __bpf_p4tc_entry_delete(net, params, key, key__sz);
+}
+
+__bpf_kfunc static int
+xdp_p4tc_entry_delete(struct xdp_md *xdp_ctx,
+		      struct p4tc_table_entry_create_bpf_params *params,
+		      void *key, const u32 key__sz)
+{
+	struct xdp_buff *ctx = (struct xdp_buff *)xdp_ctx;
+	struct net *net;
+
+	net = dev_net(ctx->rxq->dev);
+
+	return __bpf_p4tc_entry_delete(net, params, key, key__sz);
+}
+
+BTF_SET8_START(p4tc_kfunc_check_tbl_set_skb)
+BTF_ID_FLAGS(func, bpf_p4tc_tbl_read, KF_RET_NULL);
+BTF_ID_FLAGS(func, bpf_p4tc_entry_create);
+BTF_ID_FLAGS(func, bpf_p4tc_entry_create_on_miss);
+BTF_ID_FLAGS(func, bpf_p4tc_entry_update);
+BTF_ID_FLAGS(func, bpf_p4tc_entry_delete);
+BTF_SET8_END(p4tc_kfunc_check_tbl_set_skb)
+
+static const struct btf_kfunc_id_set p4tc_kfunc_tbl_set_skb = {
+	.owner = THIS_MODULE,
+	.set = &p4tc_kfunc_check_tbl_set_skb,
+};
+
+BTF_SET8_START(p4tc_kfunc_check_tbl_set_xdp)
+BTF_ID_FLAGS(func, xdp_p4tc_tbl_read, KF_RET_NULL);
+BTF_ID_FLAGS(func, xdp_p4tc_entry_create);
+BTF_ID_FLAGS(func, xdp_p4tc_entry_create_on_miss);
+BTF_ID_FLAGS(func, xdp_p4tc_entry_update);
+BTF_ID_FLAGS(func, xdp_p4tc_entry_delete);
+BTF_SET8_END(p4tc_kfunc_check_tbl_set_xdp)
+
+static const struct btf_kfunc_id_set p4tc_kfunc_tbl_set_xdp = {
+	.owner = THIS_MODULE,
+	.set = &p4tc_kfunc_check_tbl_set_xdp,
+};
+
+int register_p4tc_tbl_bpf(void)
+{
+	int ret;
+
+	ret = register_btf_kfunc_id_set(BPF_PROG_TYPE_SCHED_ACT,
+					&p4tc_kfunc_tbl_set_skb);
+	if (ret < 0)
+		return ret;
+
+	/* There is no unregister_btf_kfunc_id_set function */
+	return register_btf_kfunc_id_set(BPF_PROG_TYPE_XDP,
+					 &p4tc_kfunc_tbl_set_xdp);
+}
diff --git a/net/sched/p4tc/p4tc_pipeline.c b/net/sched/p4tc/p4tc_pipeline.c
index 9b3cc9245..90f81dedc 100644
--- a/net/sched/p4tc/p4tc_pipeline.c
+++ b/net/sched/p4tc/p4tc_pipeline.c
@@ -37,6 +37,44 @@ static __net_init int pipeline_init_net(struct net *net)
 
 	idr_init(&pipe_net->pipeline_idr);
 
+	for (int i = 0; i < P4TC_TBLS_CACHE_SIZE; i++)
+		INIT_LIST_HEAD(&pipe_net->tbls_cache[i]);
+
+	return 0;
+}
+
+static size_t p4tc_tbl_cache_hash(u32 pipeid, u32 tblid)
+{
+	return (pipeid + tblid) % P4TC_TBLS_CACHE_SIZE;
+}
+
+struct p4tc_table *p4tc_tbl_cache_lookup(struct net *net, u32 pipeid, u32 tblid)
+{
+	size_t hash = p4tc_tbl_cache_hash(pipeid, tblid);
+	struct p4tc_pipeline_net *pipe_net;
+	struct p4tc_table *pos, *tmp;
+	struct net_generic *ng;
+
+	/* RCU read lock is already being held */
+	ng = rcu_dereference(net->gen);
+	pipe_net = ng->ptr[pipeline_net_id];
+
+	list_for_each_entry_safe(pos, tmp, &pipe_net->tbls_cache[hash],
+				 tbl_cache_node) {
+		if (pos->common.p_id == pipeid && pos->tbl_id == tblid)
+			return pos;
+	}
+
+	return NULL;
+}
+
+int p4tc_tbl_cache_insert(struct net *net, u32 pipeid, struct p4tc_table *table)
+{
+	struct p4tc_pipeline_net *pipe_net = net_generic(net, pipeline_net_id);
+	size_t hash = p4tc_tbl_cache_hash(pipeid, table->tbl_id);
+
+	list_add_tail(&table->tbl_cache_node, &pipe_net->tbls_cache[hash]);
+
 	return 0;
 }
 
@@ -44,6 +82,11 @@ static int __p4tc_pipeline_put(struct p4tc_pipeline *pipeline,
 			       struct p4tc_template_common *template,
 			       struct netlink_ext_ack *extack);
 
+void p4tc_tbl_cache_remove(struct net *net, struct p4tc_table *table)
+{
+	list_del(&table->tbl_cache_node);
+}
+
 static void __net_exit pipeline_exit_net(struct net *net)
 {
 	struct p4tc_pipeline_net *pipe_net;
diff --git a/net/sched/p4tc/p4tc_table.c b/net/sched/p4tc/p4tc_table.c
index e1b2beed2..2bc758d85 100644
--- a/net/sched/p4tc/p4tc_table.c
+++ b/net/sched/p4tc/p4tc_table.c
@@ -645,6 +645,7 @@ static int _p4tc_table_put(struct net *net, struct nlattr **tb,
 
 	rhltable_free_and_destroy(&table->tbl_entries,
 				  p4tc_table_entry_destroy_hash, table);
+	p4tc_tbl_cache_remove(net, table);
 
 	idr_destroy(&table->tbl_masks_idr);
 	ida_destroy(&table->tbl_prio_idr);
@@ -816,6 +817,7 @@ __p4tc_table_init_defact(struct net *net, struct nlattr **tb, u32 pipeid,
 		if (ret < 0)
 			goto err;
 	} else if (tb[P4TC_TABLE_DEFAULT_ACTION_NOACTION]) {
+		struct p4tc_table_entry_act_bpf_kern *no_action_bpf_kern;
 		struct tcf_p4act *p4_defact;
 
 		if (!p4tc_ctrl_update_ok(perm)) {
@@ -825,11 +827,20 @@ __p4tc_table_init_defact(struct net *net, struct nlattr **tb, u32 pipeid,
 			goto err;
 		}
 
+		no_action_bpf_kern = kzalloc(sizeof(*no_action_bpf_kern),
+					     GFP_KERNEL);
+		if (!no_action_bpf_kern) {
+			ret = -ENOMEM;
+			goto err;
+		}
+
 		p4_defact = kzalloc(sizeof(*p4_defact), GFP_KERNEL);
 		if (!p4_defact) {
+			kfree(no_action_bpf_kern);
 			ret = -ENOMEM;
 			goto err;
 		}
+		rcu_assign_pointer(p4_defact->act_bpf, no_action_bpf_kern);
 		p4_defact->p_id = 0;
 		p4_defact->act_id = 0;
 		defact->acts[0] = (struct tc_action *)p4_defact;
@@ -964,6 +975,14 @@ int p4tc_table_init_default_acts(struct net *net,
 		if (IS_ERR(hitact))
 			return PTR_ERR(hitact);
 
+		if (hitact->acts[0]) {
+			struct tc_action *_hitact = hitact->acts[0];
+
+			ret = p4tc_table_entry_act_bpf_change_flags(_hitact, 1,
+								    0, 1);
+			if (ret < 0)
+				goto default_hitacts_free;
+		}
 		dflt->hitact = hitact;
 	}
 
@@ -986,11 +1005,22 @@ int p4tc_table_init_default_acts(struct net *net,
 			goto default_hitacts_free;
 		}
 
+		if (missact->acts[0]) {
+			struct tc_action *_missact = missact->acts[0];
+
+			ret = p4tc_table_entry_act_bpf_change_flags(_missact, 0,
+								    1, 0);
+			if (ret < 0)
+				goto default_missacts_free;
+		}
 		dflt->missact = missact;
 	}
 
 	return 0;
 
+default_missacts_free:
+	p4tc_table_defact_destroy(dflt->missact);
+
 default_hitacts_free:
 	p4tc_table_defact_destroy(dflt->hitact);
 	return ret;
@@ -1423,6 +1453,10 @@ static struct p4tc_table *p4tc_table_create(struct net *net, struct nlattr **tb,
 		goto profiles_destroy;
 	}
 
+	ret = p4tc_tbl_cache_insert(net, pipeline->common.p_id, table);
+	if (ret < 0)
+		goto entries_hashtable_destroy;
+
 	pipeline->curr_tables += 1;
 
 	table->common.ops = (struct p4tc_template_ops *)&p4tc_table_ops;
@@ -1430,6 +1464,9 @@ static struct p4tc_table *p4tc_table_create(struct net *net, struct nlattr **tb,
 
 	return table;
 
+entries_hashtable_destroy:
+	rhltable_destroy(&table->tbl_entries);
+
 profiles_destroy:
 	p4tc_table_timer_profiles_destroy(table);
 
@@ -1787,6 +1824,10 @@ static int __init p4tc_table_init(void)
 {
 	p4tc_tmpl_register_ops(&p4tc_table_ops);
 
+#if IS_ENABLED(CONFIG_DEBUG_INFO_BTF)
+	register_p4tc_tbl_bpf();
+#endif
+
 	return 0;
 }
 
diff --git a/net/sched/p4tc/p4tc_tbl_entry.c b/net/sched/p4tc/p4tc_tbl_entry.c
index 7a644eb40..3904f62e7 100644
--- a/net/sched/p4tc/p4tc_tbl_entry.c
+++ b/net/sched/p4tc/p4tc_tbl_entry.c
@@ -143,6 +143,32 @@ p4tc_entry_lookup(struct p4tc_table *table, struct p4tc_table_entry_key *key,
 	return NULL;
 }
 
+static struct p4tc_table_entry *
+__p4tc_entry_lookup(struct p4tc_table *table, struct p4tc_table_entry_key *key)
+	__must_hold(RCU)
+{
+	struct p4tc_table_entry *entry = NULL;
+	struct rhlist_head *tmp, *bucket_list;
+	struct p4tc_table_entry *entry_curr;
+	u32 smallest_prio = U32_MAX;
+
+	bucket_list =
+		rhltable_lookup(&table->tbl_entries, key, entry_hlt_params);
+	if (!bucket_list)
+		return NULL;
+
+	rhl_for_each_entry_rcu(entry_curr, tmp, bucket_list, ht_node) {
+		struct p4tc_table_entry_value *value =
+			p4tc_table_entry_value(entry_curr);
+		if (value->prio <= smallest_prio) {
+			smallest_prio = value->prio;
+			entry = entry_curr;
+		}
+	}
+
+	return entry;
+}
+
 void p4tc_tbl_entry_mask_key(u8 *masked_key, u8 *key, const u8 *mask,
 			     u32 masksz)
 {
@@ -152,6 +178,79 @@ void p4tc_tbl_entry_mask_key(u8 *masked_key, u8 *key, const u8 *mask,
 		masked_key[i] = key[i] & mask[i];
 }
 
+static void update_last_used(struct p4tc_table_entry *entry)
+{
+	struct p4tc_table_entry_tm *entry_tm;
+	struct p4tc_table_entry_value *value;
+
+	value = p4tc_table_entry_value(entry);
+	entry_tm = rcu_dereference(value->tm);
+	WRITE_ONCE(entry_tm->lastused, get_jiffies_64());
+
+	if (value->is_dyn && !hrtimer_active(&value->entry_timer))
+		hrtimer_start(&value->entry_timer, ms_to_ktime(1000),
+			      HRTIMER_MODE_REL);
+}
+
+static struct p4tc_table_entry *
+__p4tc_table_entry_lookup_direct(struct p4tc_table *table,
+				 struct p4tc_table_entry_key *key)
+{
+	struct p4tc_table_entry *entry = NULL;
+	u32 smallest_prio = U32_MAX;
+	int i;
+
+	if (table->tbl_type == P4TC_TABLE_TYPE_EXACT)
+		return __p4tc_entry_lookup_fast(table, key);
+
+	for (i = 0; i < table->tbl_curr_num_masks; i++) {
+		u8 __mkey[sizeof(*key) + BITS_TO_BYTES(P4TC_MAX_KEYSZ)];
+		struct p4tc_table_entry_key *mkey = (void *)&__mkey;
+		struct p4tc_table_entry_mask *mask =
+			rcu_dereference(table->tbl_masks_array[i]);
+		struct p4tc_table_entry *entry_curr = NULL;
+
+		mkey->keysz = key->keysz;
+		mkey->maskid = mask->mask_id;
+		p4tc_tbl_entry_mask_key(mkey->fa_key, key->fa_key,
+					mask->fa_value,
+					BITS_TO_BYTES(mask->sz));
+
+		if (table->tbl_type == P4TC_TABLE_TYPE_LPM) {
+			entry_curr = __p4tc_entry_lookup_fast(table, mkey);
+			if (entry_curr)
+				return entry_curr;
+		} else {
+			entry_curr = __p4tc_entry_lookup(table, mkey);
+
+			if (entry_curr) {
+				struct p4tc_table_entry_value *value =
+					p4tc_table_entry_value(entry_curr);
+				if (value->prio <= smallest_prio) {
+					smallest_prio = value->prio;
+					entry = entry_curr;
+				}
+			}
+		}
+	}
+
+	return entry;
+}
+
+struct p4tc_table_entry *
+p4tc_table_entry_lookup_direct(struct p4tc_table *table,
+			       struct p4tc_table_entry_key *key)
+{
+	struct p4tc_table_entry *entry;
+
+	entry = __p4tc_table_entry_lookup_direct(table, key);
+
+	if (entry)
+		update_last_used(entry);
+
+	return entry;
+}
+
 #define p4tc_table_entry_mask_find_byid(table, id) \
 	(idr_find(&(table)->tbl_masks_idr, id))
 
@@ -1006,6 +1105,44 @@ __must_hold(RCU)
 	return 0;
 }
 
+/* Internal function which will be called by the data path */
+static int __p4tc_table_entry_del(struct p4tc_pipeline *pipeline,
+				  struct p4tc_table *table,
+				  struct p4tc_table_entry_key *key,
+				  struct p4tc_table_entry_mask *mask, u32 prio)
+{
+	struct p4tc_table_entry *entry;
+	int ret;
+
+	p4tc_table_entry_build_key(table, key, mask);
+
+	entry = p4tc_entry_lookup(table, key, prio);
+	if (!entry)
+		return -ENOENT;
+
+	ret = ___p4tc_table_entry_del(pipeline, table, entry, false);
+
+	return ret;
+}
+
+int p4tc_table_entry_del_bpf(struct p4tc_pipeline *pipeline,
+			     struct p4tc_table *table,
+			     struct p4tc_table_entry_key *key)
+{
+	u8 __mask[sizeof(struct p4tc_table_entry_mask) +
+		  BITS_TO_BYTES(P4TC_MAX_KEYSZ)] = { 0 };
+	const u32 keysz_bytes = P4TC_KEYSZ_BYTES(table->tbl_keysz);
+	struct p4tc_table_entry_mask *mask = (void *)&__mask;
+
+	if (table->tbl_type != P4TC_TABLE_TYPE_EXACT)
+		return -EINVAL;
+
+	if (keysz_bytes != P4TC_KEYSZ_BYTES(key->keysz))
+		return -EINVAL;
+
+	return __p4tc_table_entry_del(pipeline, table, key, mask, 0);
+}
+
 static int p4tc_table_entry_gd(struct net *net, struct sk_buff *skb,
 			       int cmd, u16 *permissions, struct nlattr *arg,
 			       struct p4tc_path_nlattrs *nl_path_attrs,
@@ -1332,6 +1469,44 @@ static int p4tc_table_entry_flush(struct net *net, struct sk_buff *skb,
 	return ret;
 }
 
+static int
+p4tc_table_tc_act_from_bpf_act(struct tcf_p4act *p4act,
+			       struct p4tc_table_entry_value *value,
+			       struct p4tc_table_entry_act_bpf *act_bpf)
+__must_hold(RCU)
+{
+	struct p4tc_table_entry_act_bpf_kern *new_act_bpf;
+	struct tcf_p4act_params *p4act_params;
+	struct p4tc_act_param *param;
+	unsigned long param_id, tmp;
+	u8 *params_cursor;
+
+	p4act_params = rcu_dereference(p4act->params);
+	/* Skip act_id */
+	params_cursor = (u8 *)act_bpf + sizeof(act_bpf->act_id);
+	idr_for_each_entry_ul(&p4act_params->params_idr, param, tmp, param_id) {
+		const struct p4tc_type *type = param->type;
+		const u32 type_bytesz = BITS_TO_BYTES(type->container_bitsz);
+
+		memcpy(param->value, params_cursor, type_bytesz);
+		params_cursor += type_bytesz;
+	}
+
+	new_act_bpf = kzalloc(sizeof(*new_act_bpf), GFP_ATOMIC);
+	if (unlikely(!new_act_bpf))
+		return -ENOMEM;
+
+	new_act_bpf->act_bpf = *act_bpf;
+	new_act_bpf->act_bpf.hit = 1;
+	new_act_bpf->act_bpf.is_default_hit_act = 0;
+	new_act_bpf->act_bpf.is_default_miss_act = 0;
+
+	rcu_assign_pointer(p4act->act_bpf, new_act_bpf);
+	value->acts[0] = (struct tc_action *)p4act;
+
+	return 0;
+}
+
 static enum hrtimer_restart entry_timer_handle(struct hrtimer *timer)
 {
 	struct p4tc_table_entry_value *value =
@@ -1490,6 +1665,158 @@ __must_hold(RCU)
 	return ret;
 }
 
+static bool p4tc_table_check_entry_act(struct p4tc_table *table,
+				       struct tc_action *entry_act)
+{
+	struct tcf_p4act *entry_p4act = to_p4act(entry_act);
+	struct p4tc_table_act *table_act;
+
+	list_for_each_entry(table_act, &table->tbl_acts_list, node) {
+		if (table_act->act->common.p_id != entry_p4act->p_id ||
+		    table_act->act->a_id != entry_p4act->act_id)
+			continue;
+
+		if (!(table_act->flags &
+		      BIT(P4TC_TABLE_ACTS_DEFAULT_ONLY)))
+			return true;
+	}
+
+	return false;
+}
+
+static bool p4tc_table_check_no_act(struct p4tc_table *table)
+{
+	struct p4tc_table_act *table_act;
+
+	if (list_empty(&table->tbl_acts_list))
+		return false;
+
+	list_for_each_entry(table_act, &table->tbl_acts_list, node) {
+		if (p4tc_table_act_is_noaction(table_act))
+			return true;
+	}
+
+	return false;
+}
+
+struct p4tc_table_entry_create_state {
+	struct p4tc_act *act;
+	struct tcf_p4act *p4_act;
+	struct p4tc_table_entry *entry;
+	u64 aging_ms;
+	u16 permissions;
+};
+
+static int
+p4tc_table_entry_init_bpf(struct p4tc_pipeline *pipeline,
+			  struct p4tc_table *table, u32 entry_key_sz,
+			  struct p4tc_table_entry_act_bpf *act_bpf,
+			  struct p4tc_table_entry_create_state *state)
+{
+	const u32 keysz_bytes = P4TC_KEYSZ_BYTES(table->tbl_keysz);
+	struct p4tc_table_entry_value *entry_value;
+	const u32 keysz_bits = table->tbl_keysz;
+	struct tcf_p4act *p4_act = NULL;
+	struct p4tc_table_entry *entry;
+	struct p4tc_act *act = NULL;
+	int err = -EINVAL;
+	u32 entrysz;
+
+	if (table->tbl_type != P4TC_TABLE_TYPE_EXACT)
+		goto out;
+
+	if (keysz_bytes != P4TC_KEYSZ_BYTES(entry_key_sz))
+		goto out;
+
+	if (atomic_read(&table->tbl_nelems) + 1 > table->tbl_max_entries)
+		goto out;
+
+	if (act_bpf) {
+		act = p4a_tmpl_get(pipeline, NULL, act_bpf->act_id, NULL);
+		if (!act) {
+			err = -ENOENT;
+			goto out;
+		}
+	} else {
+		if (!p4tc_table_check_no_act(table)) {
+			err = -EPERM;
+			goto out;
+		}
+	}
+
+	entrysz = sizeof(*entry) + keysz_bytes +
+		  sizeof(struct p4tc_table_entry_value);
+
+	entry = kzalloc(entrysz, GFP_ATOMIC);
+	if (unlikely(!entry)) {
+		err = -ENOMEM;
+		goto act_put;
+	}
+	entry->key.keysz = keysz_bits;
+
+	entry_value = p4tc_table_entry_value(entry);
+	entry_value->prio = p4tc_table_entry_exact_prio();
+	entry_value->permissions = state->permissions;
+	entry_value->aging_ms = state->aging_ms;
+
+	if (act) {
+		p4_act = p4a_runt_prealloc_get_next(act);
+		if (!p4_act) {
+			err = -ENOENT;
+			goto idr_rm;
+		}
+
+		if (!p4tc_table_check_entry_act(table, &p4_act->common)) {
+			err = -EPERM;
+			goto free_prealloc;
+		}
+
+		err = p4tc_table_tc_act_from_bpf_act(p4_act, entry_value,
+						     act_bpf);
+		if (err < 0)
+			goto free_prealloc;
+	}
+
+	state->act = act;
+	state->p4_act = p4_act;
+	state->entry = entry;
+
+	return 0;
+
+free_prealloc:
+	if (p4_act)
+		p4a_runt_prealloc_put(act, p4_act);
+
+idr_rm:
+	p4tc_table_entry_free_prio(table, entry_value->prio);
+
+	kfree(entry);
+
+act_put:
+	if (act)
+		p4tc_action_put_ref(act);
+out:
+	return err;
+}
+
+static void
+p4tc_table_entry_create_state_put(struct p4tc_table *table,
+				  struct p4tc_table_entry_create_state *state)
+{
+	struct p4tc_table_entry_value *value;
+
+	if (state->act)
+		p4a_runt_prealloc_put(state->act, state->p4_act);
+
+	value = p4tc_table_entry_value(state->entry);
+	p4tc_table_entry_free_prio(table, value->prio);
+
+	kfree(state->entry);
+
+	if (state->act)
+		p4tc_action_put_ref(state->act);
+}
+
 /* Invoked from both control and data path  */
 static int __p4tc_table_entry_update(struct p4tc_pipeline *pipeline,
 				     struct p4tc_table *table,
@@ -1628,38 +1955,111 @@ __must_hold(RCU)
 	return ret;
 }
 
-static bool p4tc_table_check_entry_act(struct p4tc_table *table,
-				       struct tc_action *entry_act)
+static u16 p4tc_table_entry_tbl_permcpy(const u16 tblperm)
 {
-	struct tcf_p4act *entry_p4act = to_p4act(entry_act);
-	struct p4tc_table_act *table_act;
+	return p4tc_ctrl_perm_rm_create(p4tc_data_perm_rm_create(tblperm));
+}
 
-	list_for_each_entry(table_act, &table->tbl_acts_list, node) {
-		if (table_act->act->common.p_id != entry_p4act->p_id ||
-		    table_act->act->a_id != entry_p4act->act_id)
-			continue;
+/* If the profile_id specified by the eBPF program for entry create or update is
+ * invalid, we'll use the default profile ID's aging value
+ */
+static void
+p4tc_table_entry_assign_aging(struct p4tc_table *table,
+			      struct p4tc_table_entry_create_state *state,
+			      u32 profile_id)
+{
+	struct p4tc_table_timer_profile *timer_profile;
 
-		if (!(table_act->flags &
-		      BIT(P4TC_TABLE_ACTS_DEFAULT_ONLY)))
-			return true;
-	}
+	timer_profile = p4tc_table_timer_profile_find(table, profile_id);
+	if (!timer_profile)
+		timer_profile = p4tc_table_timer_profile_find(table,
+							      P4TC_DEFAULT_TIMER_PROFILE_ID);
 
-	return false;
+	state->aging_ms = timer_profile->aging_ms;
 }
 
-static bool p4tc_table_check_no_act(struct p4tc_table *table)
+int p4tc_table_entry_create_bpf(struct p4tc_pipeline *pipeline,
+				struct p4tc_table *table,
+				struct p4tc_table_entry_key *key,
+				struct p4tc_table_entry_act_bpf *act_bpf,
+				u32 profile_id)
 {
-	struct p4tc_table_act *table_act;
+	u16 tblperm = rcu_dereference(table->tbl_permissions)->permissions;
+	u8 __mask[sizeof(struct p4tc_table_entry_mask) +
+		  BITS_TO_BYTES(P4TC_MAX_KEYSZ)] = { 0 };
+	struct p4tc_table_entry_mask *mask = (void *)&__mask;
+	struct p4tc_table_entry_create_state state = {0};
+	struct p4tc_table_entry_value *value;
+	int err;
 
-	if (list_empty(&table->tbl_acts_list))
-		return false;
+	p4tc_table_entry_assign_aging(table, &state, profile_id);
 
-	list_for_each_entry(table_act, &table->tbl_acts_list, node) {
-		if (p4tc_table_act_is_noaction(table_act))
-			return true;
-	}
+	state.permissions = p4tc_table_entry_tbl_permcpy(tblperm);
+	err = p4tc_table_entry_init_bpf(pipeline, table, key->keysz,
+					act_bpf, &state);
+	if (err < 0)
+		return err;
+	p4tc_table_entry_assign_key_exact(&state.entry->key, key->fa_key);
 
-	return false;
+	value = p4tc_table_entry_value(state.entry);
+	/* Entry is always dynamic when it comes from the data path */
+	value->is_dyn = true;
+
+	err = __p4tc_table_entry_create(pipeline, table, state.entry, mask,
+					P4TC_ENTITY_KERNEL, false);
+	if (err < 0)
+		goto put_state;
+
+	refcount_set(&value->entries_ref, 1);
+	if (state.p4_act)
+		p4a_runt_init_flags(state.p4_act);
+
+	return 0;
+
+put_state:
+	p4tc_table_entry_create_state_put(table, &state);
+
+	return err;
+}
+
+int p4tc_table_entry_update_bpf(struct p4tc_pipeline *pipeline,
+				struct p4tc_table *table,
+				struct p4tc_table_entry_key *key,
+				struct p4tc_table_entry_act_bpf *act_bpf,
+				u32 profile_id)
+{
+	struct p4tc_table_entry_create_state state = {0};
+	struct p4tc_table_entry_value *value;
+	int err;
+
+	p4tc_table_entry_assign_aging(table, &state, profile_id);
+
+	state.permissions = P4TC_PERMISSIONS_UNINIT;
+	err = p4tc_table_entry_init_bpf(pipeline, table, key->keysz, act_bpf,
+					&state);
+	if (err < 0)
+		return err;
+
+	p4tc_table_entry_assign_key_exact(&state.entry->key, key->fa_key);
+
+	value = p4tc_table_entry_value(state.entry);
+	value->is_dyn = !!state.aging_ms;
+	err = __p4tc_table_entry_update(pipeline, table, state.entry, NULL,
+					P4TC_ENTITY_KERNEL, false);
+
+	if (err < 0)
+		goto put_state;
+
+	refcount_set(&value->entries_ref, 1);
+	if (state.p4_act)
+		p4a_runt_init_flags(state.p4_act);
+
+	return 0;
+
+put_state:
+	p4tc_table_entry_create_state_put(table, &state);
+
+	return err;
 }
 
 static struct nla_policy
@@ -1731,11 +2131,6 @@ static int p4tc_tbl_attrs_update(struct net *net, struct p4tc_table *table,
 	return err;
 }
 
-static u16 p4tc_table_entry_tbl_permcpy(const u16 tblperm)
-{
-	return p4tc_ctrl_perm_rm_create(p4tc_data_perm_rm_create(tblperm));
-}
-
 #define P4TC_TBL_ENTRY_CU_FLAG_CREATE 0x1
 #define P4TC_TBL_ENTRY_CU_FLAG_UPDATE 0x2
 #define P4TC_TBL_ENTRY_CU_FLAG_SET 0x4
@@ -1860,6 +2255,11 @@ __p4tc_table_entry_cu(struct net *net, u8 cu_flags, struct nlattr **tb,
 				       "Action is not allowed as entry action");
 			goto free_acts;
 		}
+
+		ret = p4tc_table_entry_act_bpf_change_flags(value->acts[0], 1,
+							    0, 0);
+		if (ret < 0)
+			goto free_acts;
 	} else {
 		if (!p4tc_table_check_no_act(table)) {
 			NL_SET_ERR_MSG_FMT(extack,
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [PATCH net-next v12  15/15] p4tc: add P4 classifier
  2024-02-25 16:54 [PATCH net-next v12 00/15] Introducing P4TC (series 1) Jamal Hadi Salim
                   ` (13 preceding siblings ...)
  2024-02-25 16:54 ` [PATCH net-next v12 14/15] p4tc: add set of P4TC table kfuncs Jamal Hadi Salim
@ 2024-02-25 16:54 ` Jamal Hadi Salim
  2024-02-28 17:11 ` [PATCH net-next v12 00/15] Introducing P4TC (series 1) John Fastabend
  2024-02-29 17:13 ` Paolo Abeni
  16 siblings, 0 replies; 71+ messages in thread
From: Jamal Hadi Salim @ 2024-02-25 16:54 UTC (permalink / raw)
  To: netdev
  Cc: deb.chatterjee, anjali.singhai, namrata.limaye, tom, mleitner,
	Mahesh.Shirshyad, Vipin.Jain, tomasz.osinski, jiri,
	xiyou.wangcong, davem, edumazet, kuba, pabeni, vladbu, horms,
	khalidm, toke, daniel, victor, pctammela, bpf

Introduce P4 tc classifier. The main task of this classifier is to manage
the lifetime of pipeline instances across one or more netdev ports.
Note: a defined pipeline may be instantiated multiple times across one or
more tc chains and different priorities.

Note that part or whole of the P4 pipeline could reside in tc, XDP or even
hardware depending on how the P4 program was compiled. This classifier only
deals with tc layer.
To use the P4 classifier you must specify a pipeline name that will be
associated to the filter instance, a s/w parser (eBPF) and datapath P4
control block program (eBPF) program. Although this patchset does not deal
with offloads, it is also possible to load the h/w part using this filter.
We will illustrate a few examples further below to clarify. Please treat
the illustrated split as an example - there are probably more pragmatic
approaches to splitting the pipeline; however, regardless of where the
different pieces of the pipeline are placed (tc, XDP, HW) and what each layer
will implement (what part of the pipeline) - these examples are merely showing
what is possible.

The pipeline is assumed to have already been created via a template.

For example, if we were to add a filter to ingress of a group of netdevs
(tc block 22) and associate it to P4 pipeline simple_l3 we could issue the
following command:

tc filter add block 22 parent ffff: protocol all prio 6 p4 pname simple_l3 \
    action bpf obj $PARSER.o ... \
    action bpf obj $PROGNAME.o section p4tc/main

The above uses the classical tc action mechanism in which the first action
runs the P4 parser and if that goes well then the P4 control block is
executed. Note, although not shown above, one could also append the command
line with other traditional tc actions.

Given one of the objectives of this classifier is to manage the lifetime
of the p4 program and said program may be split across tc:xdp:hardware we
allow specification of where the xdp (and in the future hardware) programs
can be found. For this reason when instantiating one could specify where
the associated XDP program using they syntax "prog type xdp progname", where
progname refers to the XDP ebpf program name. The control plane side (below
we show iproute2) will be responsible for loading the XDP program. The kernel
is unaware of the XDP side.
There is an ongoing discussion in the P4TC community biweekly meetings
which is likely going to have us add another location definition "prog type hw"
which will specify the hardware object file name and other related attributes.
The current discussion is that this h/w piece will go via the p4 classifier.

An example using xdp and tc:

tc filter add dev $P0 ingress protocol all prio 1 p4 pname simple_l3 \
    prog type xdp obj $PARSER.o section p4tc/parse-xdp \
    action bpf obj $PROGNAME.o section p4tc/main

In this case, the parser will be executed in the XDP layer and the rest of
P4 control block as a tc action.

For illustration sake, the hw one looks as follows (please note there's
still a lot of discussions going on in the meetings - the example is here
merely to illustrate the tc filter functionality):

tc filter add block 22 ingress protocol all prio 1 p4 pname simple_l3 \
   prog type hw filename "mypnameprog.o" ... \
   prog type xdp obj $PARSER.o section p4tc/parse-xdp \
   action bpf obj $PROGNAME.o section p4tc/main

The theory of operations is as follows:

================================1. PARSING================================

The packet first encounters the parser.
The parser is implemented in ebpf residing either at the TC or XDP
level. The parsed header values are stored in a shared per-cpu eBPF map.
When the parser runs at XDP level, we load it into XDP using the control
plane (tc filter command) and pin it to a file.

=============================2. ACTIONS=============================

In the above example, the P4 program (minus the parser) is encoded in an
action($PROGNAME.o). It should be noted that classical tc actions
continue to work:
IOW, someone could decide to add a mirred action to mirror all packets
after or before the ebpf action.

tc filter add dev $P0 parent ffff: protocol all prio 6 p4 pname simple_l3 \
    action bpf obj $PARSER.o section p4tc/parse \
    action bpf obj $PROGNAME.o section p4tc/main \
    action mirred egress mirror index 1 dev $P1 \
    action bpf obj $ANOTHERPROG.o section mysect/section-1

It should also be noted that it is feasible to split some of the ingress
datapath into XDP first and more into TC later (as was shown above for
example where the parser runs at XDP level). YMMV.
Regardless of choice of which scheme to use, none of these will affect
UAPI. It will all depend on whether you generate code to load on XDP vs
tc, etc. We expect the compiler to evolve over time (but that has
nothing to do with the kernel part).

Co-developed-by: Victor Nogueira <victor@mojatatu.com>
Signed-off-by: Victor Nogueira <victor@mojatatu.com>
Co-developed-by: Pedro Tammela <pctammela@mojatatu.com>
Signed-off-by: Pedro Tammela <pctammela@mojatatu.com>
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
---
 include/uapi/linux/pkt_cls.h |  14 ++
 net/sched/Kconfig            |  12 ++
 net/sched/Makefile           |   1 +
 net/sched/cls_p4.c           | 305 +++++++++++++++++++++++++++++++++++
 net/sched/p4tc/Makefile      |   4 +-
 net/sched/p4tc/trace.c       |  10 ++
 net/sched/p4tc/trace.h       |  44 +++++
 7 files changed, 389 insertions(+), 1 deletion(-)
 create mode 100644 net/sched/cls_p4.c
 create mode 100644 net/sched/p4tc/trace.c
 create mode 100644 net/sched/p4tc/trace.h

diff --git a/include/uapi/linux/pkt_cls.h b/include/uapi/linux/pkt_cls.h
index dd313a727..4a811e3c1 100644
--- a/include/uapi/linux/pkt_cls.h
+++ b/include/uapi/linux/pkt_cls.h
@@ -692,6 +692,20 @@ enum {
 
 #define TCA_MATCHALL_MAX (__TCA_MATCHALL_MAX - 1)
 
+/* P4 classifier */
+
+enum {
+	TCA_P4_UNSPEC,
+	TCA_P4_CLASSID,
+	TCA_P4_ACT,
+	TCA_P4_PNAME,
+	TCA_P4_PIPEID,
+	TCA_P4_PAD,
+	__TCA_P4_MAX,
+};
+
+#define TCA_P4_MAX (__TCA_P4_MAX - 1)
+
 /* Extended Matches */
 
 struct tcf_ematch_tree_hdr {
diff --git a/net/sched/Kconfig b/net/sched/Kconfig
index 5dbae579b..66d7fed27 100644
--- a/net/sched/Kconfig
+++ b/net/sched/Kconfig
@@ -565,6 +565,18 @@ config NET_CLS_MATCHALL
 	  To compile this code as a module, choose M here: the module will
 	  be called cls_matchall.
 
+config NET_CLS_P4
+	tristate "P4 classifier"
+	select NET_CLS
+	select NET_P4TC
+	help
+	  If you say Y here, you will be able to bind a P4 pipeline
+	  program. You will need to install a P4 template representing the
+	  program successfully to use this feature.
+
+	  To compile this code as a module, choose M here: the module will
+	  be called cls_p4.
+
 config NET_EMATCH
 	bool "Extended Matches"
 	select NET_CLS
diff --git a/net/sched/Makefile b/net/sched/Makefile
index 581f9dd69..b4f9ef48d 100644
--- a/net/sched/Makefile
+++ b/net/sched/Makefile
@@ -72,6 +72,7 @@ obj-$(CONFIG_NET_CLS_CGROUP)	+= cls_cgroup.o
 obj-$(CONFIG_NET_CLS_BPF)	+= cls_bpf.o
 obj-$(CONFIG_NET_CLS_FLOWER)	+= cls_flower.o
 obj-$(CONFIG_NET_CLS_MATCHALL)	+= cls_matchall.o
+obj-$(CONFIG_NET_CLS_P4)	+= cls_p4.o
 obj-$(CONFIG_NET_EMATCH)	+= ematch.o
 obj-$(CONFIG_NET_EMATCH_CMP)	+= em_cmp.o
 obj-$(CONFIG_NET_EMATCH_NBYTE)	+= em_nbyte.o
diff --git a/net/sched/cls_p4.c b/net/sched/cls_p4.c
new file mode 100644
index 000000000..a266e777b
--- /dev/null
+++ b/net/sched/cls_p4.c
@@ -0,0 +1,305 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * net/sched/cls_p4.c - P4 Classifier
+ * Copyright (c) 2022-2024, Mojatatu Networks
+ * Copyright (c) 2022-2024, Intel Corporation.
+ * Authors:     Jamal Hadi Salim <jhs@mojatatu.com>
+ *              Victor Nogueira <victor@mojatatu.com>
+ *              Pedro Tammela <pctammela@mojatatu.com>
+ */
+
+#include <linux/kernel.h>
+#include <linux/init.h>
+#include <linux/module.h>
+#include <linux/percpu.h>
+#include <linux/bpf.h>
+#include <linux/filter.h>
+
+#include <net/sch_generic.h>
+#include <net/pkt_cls.h>
+
+#include <net/p4tc.h>
+
+#include "p4tc/trace.h"
+
+struct cls_p4_head {
+	struct tcf_exts exts;
+	struct tcf_result res;
+	struct rcu_work rwork;
+	struct p4tc_pipeline *pipeline;
+	u32 handle;
+};
+
+static int p4_classify(struct sk_buff *skb, const struct tcf_proto *tp,
+		       struct tcf_result *res)
+{
+	struct cls_p4_head *head = rcu_dereference_bh(tp->root);
+
+	if (unlikely(!head)) {
+		pr_err("P4 classifier not found\n");
+		return -1;
+	}
+
+	trace_p4_classify(skb, head->pipeline);
+
+	*res = head->res;
+
+	return tcf_exts_exec(skb, &head->exts, res);
+}
+
+static int p4_init(struct tcf_proto *tp)
+{
+	return 0;
+}
+
+static void __p4_destroy(struct cls_p4_head *head)
+{
+	tcf_exts_destroy(&head->exts);
+	tcf_exts_put_net(&head->exts);
+	p4tc_pipeline_put(head->pipeline);
+	kfree(head);
+}
+
+static void p4_destroy_work(struct work_struct *work)
+{
+	struct cls_p4_head *head =
+		container_of(to_rcu_work(work), struct cls_p4_head, rwork);
+
+	rtnl_lock();
+	__p4_destroy(head);
+	rtnl_unlock();
+}
+
+static void p4_destroy(struct tcf_proto *tp, bool rtnl_held,
+		       struct netlink_ext_ack *extack)
+{
+	struct cls_p4_head *head = rtnl_dereference(tp->root);
+
+	if (!head)
+		return;
+
+	tcf_unbind_filter(tp, &head->res);
+
+	if (tcf_exts_get_net(&head->exts))
+		tcf_queue_work(&head->rwork, p4_destroy_work);
+	else
+		__p4_destroy(head);
+}
+
+static void *p4_get(struct tcf_proto *tp, u32 handle)
+{
+	struct cls_p4_head *head = rtnl_dereference(tp->root);
+
+	if (head && head->handle == handle)
+		return head;
+
+	return NULL;
+}
+
+static const struct nla_policy p4_policy[TCA_P4_MAX + 1] = {
+	[TCA_P4_UNSPEC] = { .type = NLA_UNSPEC },
+	[TCA_P4_CLASSID] = { .type = NLA_U32 },
+	[TCA_P4_ACT] = { .type = NLA_NESTED },
+	[TCA_P4_PNAME] = { .type = NLA_STRING, .len = P4TC_PIPELINE_NAMSIZ },
+	[TCA_P4_PIPEID] = { .type = NLA_U32 },
+};
+
+static int p4_set_parms(struct net *net, struct tcf_proto *tp,
+			struct cls_p4_head *head, unsigned long base,
+			struct nlattr **tb, struct nlattr *est, u32 flags,
+			struct netlink_ext_ack *extack)
+{
+	int err;
+
+	err = tcf_exts_validate_ex(net, tp, tb, est, &head->exts, flags, 0,
+				   extack);
+	if (err < 0)
+		return err;
+
+	if (tb[TCA_P4_CLASSID]) {
+		head->res.classid = nla_get_u32(tb[TCA_P4_CLASSID]);
+		tcf_bind_filter(tp, &head->res, base);
+	}
+
+	return 0;
+}
+
+static int p4_change(struct net *net, struct sk_buff *in_skb,
+		     struct tcf_proto *tp, unsigned long base, u32 handle,
+		     struct nlattr **tca, void **arg, u32 flags,
+		     struct netlink_ext_ack *extack)
+{
+	struct cls_p4_head *head = rtnl_dereference(tp->root);
+	struct p4tc_pipeline *pipeline = NULL;
+	struct nlattr *tb[TCA_P4_MAX + 1];
+	struct cls_p4_head *new_cls;
+	char *pname = NULL;
+	u32 pipeid = 0;
+	int err;
+
+	if (!tca[TCA_OPTIONS]) {
+		NL_SET_ERR_MSG(extack, "Must provide pipeline options");
+		return -EINVAL;
+	}
+
+	if (head)
+		return -EEXIST;
+
+	err = nla_parse_nested(tb, TCA_P4_MAX, tca[TCA_OPTIONS], p4_policy,
+			       extack);
+	if (err < 0)
+		return err;
+
+	if (tb[TCA_P4_PNAME])
+		pname = nla_data(tb[TCA_P4_PNAME]);
+
+	if (tb[TCA_P4_PIPEID])
+		pipeid = nla_get_u32(tb[TCA_P4_PIPEID]);
+
+	pipeline = p4tc_pipeline_find_get(net, pname, pipeid, extack);
+	if (IS_ERR(pipeline))
+		return PTR_ERR(pipeline);
+
+	if (!p4tc_pipeline_sealed(pipeline)) {
+		err = -EINVAL;
+		NL_SET_ERR_MSG(extack, "Pipeline must be sealed before use");
+		goto pipeline_put;
+	}
+
+	new_cls = kzalloc(sizeof(*new_cls), GFP_KERNEL);
+	if (!new_cls) {
+		err = -ENOMEM;
+		goto pipeline_put;
+	}
+
+	err = tcf_exts_init(&new_cls->exts, net, TCA_P4_ACT, 0);
+	if (err)
+		goto err_exts_init;
+
+	if (!handle)
+		handle = 1;
+
+	new_cls->handle = handle;
+
+	err = p4_set_parms(net, tp, new_cls, base, tb, tca[TCA_RATE], flags,
+			   extack);
+	if (err)
+		goto err_set_parms;
+
+	new_cls->pipeline = pipeline;
+	*arg = head;
+	rcu_assign_pointer(tp->root, new_cls);
+	return 0;
+
+err_set_parms:
+	tcf_exts_destroy(&new_cls->exts);
+err_exts_init:
+	kfree(new_cls);
+pipeline_put:
+	p4tc_pipeline_put(pipeline);
+	return err;
+}
+
+static int p4_delete(struct tcf_proto *tp, void *arg, bool *last,
+		     bool rtnl_held, struct netlink_ext_ack *extack)
+{
+	*last = true;
+	return 0;
+}
+
+static void p4_walk(struct tcf_proto *tp, struct tcf_walker *arg,
+		    bool rtnl_held)
+{
+	struct cls_p4_head *head = rtnl_dereference(tp->root);
+
+	if (arg->count < arg->skip)
+		goto skip;
+
+	if (!head)
+		return;
+	if (arg->fn(tp, head, arg) < 0)
+		arg->stop = 1;
+skip:
+	arg->count++;
+}
+
+static int p4_dump(struct net *net, struct tcf_proto *tp, void *fh,
+		   struct sk_buff *skb, struct tcmsg *t, bool rtnl_held)
+{
+	struct cls_p4_head *head = fh;
+	struct nlattr *nest;
+
+	if (!head)
+		return skb->len;
+
+	t->tcm_handle = head->handle;
+
+	nest = nla_nest_start(skb, TCA_OPTIONS);
+	if (!nest)
+		goto nla_put_failure;
+
+	if (nla_put_string(skb, TCA_P4_PNAME, head->pipeline->common.name))
+		goto nla_put_failure;
+
+	if (head->res.classid &&
+	    nla_put_u32(skb, TCA_P4_CLASSID, head->res.classid))
+		goto nla_put_failure;
+
+	if (tcf_exts_dump(skb, &head->exts))
+		goto nla_put_failure;
+
+	nla_nest_end(skb, nest);
+
+	if (tcf_exts_dump_stats(skb, &head->exts) < 0)
+		goto nla_put_failure;
+
+	return skb->len;
+
+nla_put_failure:
+	nla_nest_cancel(skb, nest);
+	return -1;
+}
+
+static void p4_bind_class(void *fh, u32 classid, unsigned long cl, void *q,
+			  unsigned long base)
+{
+	struct cls_p4_head *head = fh;
+
+	if (head && head->res.classid == classid) {
+		if (cl)
+			__tcf_bind_filter(q, &head->res, base);
+		else
+			__tcf_unbind_filter(q, &head->res);
+	}
+}
+
+static struct tcf_proto_ops cls_p4_ops __read_mostly = {
+	.kind		= "p4",
+	.classify	= p4_classify,
+	.init		= p4_init,
+	.destroy	= p4_destroy,
+	.get		= p4_get,
+	.change		= p4_change,
+	.delete		= p4_delete,
+	.walk		= p4_walk,
+	.dump		= p4_dump,
+	.bind_class	= p4_bind_class,
+	.owner		= THIS_MODULE,
+};
+
+static int __init cls_p4_init(void)
+{
+	return register_tcf_proto_ops(&cls_p4_ops);
+}
+
+static void __exit cls_p4_exit(void)
+{
+	unregister_tcf_proto_ops(&cls_p4_ops);
+}
+
+module_init(cls_p4_init);
+module_exit(cls_p4_exit);
+
+MODULE_AUTHOR("Mojatatu Networks");
+MODULE_DESCRIPTION("P4 Classifier");
+MODULE_LICENSE("GPL");
diff --git a/net/sched/p4tc/Makefile b/net/sched/p4tc/Makefile
index 73ccb53c4..04302a3ac 100644
--- a/net/sched/p4tc/Makefile
+++ b/net/sched/p4tc/Makefile
@@ -1,6 +1,8 @@
 # SPDX-License-Identifier: GPL-2.0
 
+CFLAGS_trace.o := -I$(src)
+
 obj-y := p4tc_types.o p4tc_tmpl_api.o p4tc_pipeline.o \
 	p4tc_action.o p4tc_table.o p4tc_tbl_entry.o \
-	p4tc_filter.o p4tc_runtime_api.o
+	p4tc_filter.o p4tc_runtime_api.o trace.o
 obj-$(CONFIG_DEBUG_INFO_BTF) += p4tc_bpf.o
diff --git a/net/sched/p4tc/trace.c b/net/sched/p4tc/trace.c
new file mode 100644
index 000000000..683313407
--- /dev/null
+++ b/net/sched/p4tc/trace.c
@@ -0,0 +1,10 @@
+// SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause)
+
+#include <net/p4tc.h>
+
+#ifndef __CHECKER__
+
+#define CREATE_TRACE_POINTS
+#include "trace.h"
+EXPORT_TRACEPOINT_SYMBOL_GPL(p4_classify);
+#endif
diff --git a/net/sched/p4tc/trace.h b/net/sched/p4tc/trace.h
new file mode 100644
index 000000000..80abec13b
--- /dev/null
+++ b/net/sched/p4tc/trace.h
@@ -0,0 +1,44 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM p4tc
+
+#if !defined(__P4TC_TRACE_H_) || defined(TRACE_HEADER_MULTI_READ)
+#define __P4TC_TRACE_H
+
+#include <linux/tracepoint.h>
+
+struct p4tc_pipeline;
+
+TRACE_EVENT(p4_classify,
+	    TP_PROTO(struct sk_buff *skb, struct p4tc_pipeline *pipeline),
+
+	    TP_ARGS(skb, pipeline),
+
+	    TP_STRUCT__entry(__string(pname, pipeline->common.name)
+			     __field(u32,  p_id)
+			     __field(u32,  ifindex)
+			     __field(u32,  ingress)
+			    ),
+
+	    TP_fast_assign(__assign_str(pname, pipeline->common.name);
+			   __entry->p_id = pipeline->common.p_id;
+			   __entry->ifindex = skb->dev->ifindex;
+			   __entry->ingress = skb_at_tc_ingress(skb);
+			  ),
+
+	    TP_printk("dev=%u dir=%s pipeline=%s p_id=%u",
+		      __entry->ifindex,
+		      __entry->ingress ? "ingress" : "egress",
+		      __get_str(pname),
+		      __entry->p_id
+		     )
+);
+
+#endif
+
+#undef TRACE_INCLUDE_PATH
+#define TRACE_INCLUDE_PATH .
+#undef TRACE_INCLUDE_FILE
+#define TRACE_INCLUDE_FILE trace
+
+#include <trace/define_trace.h>
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 71+ messages in thread

* RE: [PATCH net-next v12 00/15] Introducing P4TC (series 1)
  2024-02-25 16:54 [PATCH net-next v12 00/15] Introducing P4TC (series 1) Jamal Hadi Salim
                   ` (14 preceding siblings ...)
  2024-02-25 16:54 ` [PATCH net-next v12 15/15] p4tc: add P4 classifier Jamal Hadi Salim
@ 2024-02-28 17:11 ` John Fastabend
  2024-02-28 18:23   ` Jamal Hadi Salim
  2024-03-01  7:02   ` Martin KaFai Lau
  2024-02-29 17:13 ` Paolo Abeni
  16 siblings, 2 replies; 71+ messages in thread
From: John Fastabend @ 2024-02-28 17:11 UTC (permalink / raw)
  To: Jamal Hadi Salim, netdev
  Cc: deb.chatterjee, anjali.singhai, namrata.limaye, tom, mleitner,
	Mahesh.Shirshyad, Vipin.Jain, tomasz.osinski, jiri,
	xiyou.wangcong, davem, edumazet, kuba, pabeni, vladbu, horms,
	khalidm, toke, daniel, victor, pctammela, dan.daly,
	andy.fingerhut, chris.sommers, mattyk, bpf

Jamal Hadi Salim wrote:
> This is the first patchset of two. In this patch we are submitting 15 which
> cover the minimal viable P4 PNA architecture.
> 
> __Description of these Patches__
> 
> Patch #1 adds infrastructure for per-netns P4 actions that can be created on
> as need basis for the P4 program requirement. This patch makes a small incision
> into act_api. Patches 2-4 are minimalist enablers for P4TC and have no
> effect the classical tc action (example patch#2 just increases the size of the
> action names from 16->64B).
> Patch 5 adds infrastructure support for preallocation of dynamic actions.
> 
> The core P4TC code implements several P4 objects.
> 1) Patch #6 introduces P4 data types which are consumed by the rest of the code
> 2) Patch #7 introduces the templating API. i.e. CRUD commands for templates
> 3) Patch #8 introduces the concept of templating Pipelines. i.e CRUD commands
>    for P4 pipelines.
> 4) Patch #9 introduces the action templates and associated CRUD commands.
> 5) Patch #10 introduce the action runtime infrastructure.
> 6) Patch #11 introduces the concept of P4 table templates and associated
>    CRUD commands for tables.
> 7) Patch #12 introduces runtime table entry infra and associated CU commands.
> 8) Patch #13 introduces runtime table entry infra and associated RD commands.
> 9) Patch #14 introduces interaction of eBPF to P4TC tables via kfunc.
> 10) Patch #15 introduces the TC classifier P4 used at runtime.
> 
> Daniel, please look again at patch #15.
> 
> There are a few more patches (5) not in this patchset that deal with test
> cases, etc.
> 
> What is P4?
> -----------
> 
> The Programming Protocol-independent Packet Processors (P4) is an open source,
> domain-specific programming language for specifying data plane behavior.
> 
> The current P4 landscape includes an extensive range of deployments, products,
> projects and services, etc[9][12]. Two major NIC vendors, Intel[10] and AMD[11]
> currently offer P4-native NICs. P4 is currently curated by the Linux
> Foundation[9].
> 
> On why P4 - see small treatise here:[4].
> 
> What is P4TC?
> -------------
> 
> P4TC is a net-namespace aware P4 implementation over TC; meaning, a P4 program
> and its associated objects and state are attachend to a kernel _netns_ structure.
> IOW, if we had two programs across netns' or within a netns they have no
> visibility to each others objects (unlike for example TC actions whose kinds are
> "global" in nature or eBPF maps visavis bpftool).

[...]

Although I appreciate a good amount of work went into building above I'll
add my concerns here so they are not lost. These are architecture concerns
not this line of code needs some tweak.

 - It encodes a DSL into the kernel. Its unclear how we pick which DSL gets
   pushed into the kernel and which do not. Do we take any DSL folks can code
   up?
   I would prefer a lower level  intermediate langauge. My view is this is
   a lesson we should have learned from OVS. OVS had wider adoption and
   still struggled in some ways my belief is this is very similar to OVS.
   (Also OVS was novel/great at a lot of things fwiw.)

 - We have a general purpose language in BPF that can implement the P4 DSL
   already. I don't see any need for another set of code when the end goal
   is running P4 in Linux network stack is doable. Typically we reject
   duplicate things when they don't have concrete benefits.

 - P4 as a DSL is not optimized for general purpose CPUs, but
   rather hardware pipelines. Although it can be optimized for CPUs its
   a harder problem. A review of some of the VPP/DPDK work here is useful.

 - P4 infrastructure already has a p4c backend this is adding another P4
   backend instead of getting the rather small group of people to work on
   a single backend we are now creating another one.

 - Common reasons I think would justify a new P4 backend and implementation
   would be: speed efficiency, or expressiveness. I think this
   implementation is neither more efficient nor more expressive. Concrete
   examples on expressiveness would be interesting, but I don't see any.
   Loops were mentioned once but latest kernels have loop support.

 - The main talking point for many slide decks about p4tc is hardware
   offload. This seems like the main benefit of pushing the P4 DSL into the
   kernel. But, we have no hw implementation, not even a vendor stepping up
   to comment on this implementation and how it will work for them. HW
   introduces all sorts of interesting problems that I don't see how we
   solve in this framework. For example a few off the top of my head:
   syncing current state into tc, how does operator program tc inside
   constraints, who writes the p4 models for these hardware devices, do
   they fit into this 'tc' infrastructure, partial updates into hardware
   seems unlikely to work for most hardware, ...

 - The kfuncs are mostly duplicates of map ops we already have in BPF API.
   The motivation by my read is to use netlink instead of bpf commands. I
   don't agree with this, optimizing for some low level debug a developer
   uses is the wrong design space. Actual users should not be deploying
   this via ssh into boxes. The workflow will not scale and really we need
   tooling and infra to land P4 programs across the network. This is orders
   of more pain if its an endpoint solution and not a middlebox/switch
   solution. As a switch solution I don't see how p4tc sw scales to even TOR
   packet rates. So you need tooling on top and user interact with the
   tooling not the Linux widget/debugger at the bottom.

 - There is no performance analysis: The comment was functionality before
   performance which I disagree with. If it was a first implementation and
   we didn't have a way to do P4 DSL already than I might agree, but here
   we have an existing solution so it should be at least as good and should
   be better than existing backend. A software datapath adoption is going
   to be critically based on performance. I don't see taking even a 5% hit
   when porting over to P4 from existing datapath.

Commentary: I think its 100% correct to debate how the P4 DSL is
implemented in the kernel. I can't see why this is off limits somehow this
patch set proposes an approach there could be many approaches. BPF comes up
not because I'm some BPF zealot that needs P4 DSL in BPF, but because it
exists today there is even a P4 backend. Fundamentally I don't see the
value add we get by creating two P4 pipelines this is going to create
duplication all the way up to the P4 tooling/infra through to the kernel.
From your side you keep saying I'm bike shedding and demanding BPF, but
from my perspective your introducing another entire toolchain simply
because you want some low level debug commands that 99% of P4 users should
not be using or caring about.

To try and be constructive some things that would change my mind would
be a vendor showing how hardware can be used. This would be compelling.
Or performance showing its somehow gets a more performant implementation.
Or lastly if the current p4c implementation is fundamentally broken
somehow.

Thanks
John

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH net-next v12 00/15] Introducing P4TC (series 1)
  2024-02-28 17:11 ` [PATCH net-next v12 00/15] Introducing P4TC (series 1) John Fastabend
@ 2024-02-28 18:23   ` Jamal Hadi Salim
  2024-02-28 21:13     ` John Fastabend
  2024-03-01  7:02   ` Martin KaFai Lau
  1 sibling, 1 reply; 71+ messages in thread
From: Jamal Hadi Salim @ 2024-02-28 18:23 UTC (permalink / raw)
  To: John Fastabend
  Cc: netdev, deb.chatterjee, anjali.singhai, namrata.limaye, tom,
	mleitner, Mahesh.Shirshyad, Vipin.Jain, tomasz.osinski, jiri,
	xiyou.wangcong, davem, edumazet, kuba, pabeni, vladbu, horms,
	khalidm, toke, daniel, victor, pctammela, dan.daly,
	andy.fingerhut, chris.sommers, mattyk, bpf

On Wed, Feb 28, 2024 at 12:11 PM John Fastabend
<john.fastabend@gmail.com> wrote:
>
> Jamal Hadi Salim wrote:
> > This is the first patchset of two. In this patch we are submitting 15 which
> > cover the minimal viable P4 PNA architecture.
> >
> > __Description of these Patches__
> >
> > Patch #1 adds infrastructure for per-netns P4 actions that can be created on
> > as need basis for the P4 program requirement. This patch makes a small incision
> > into act_api. Patches 2-4 are minimalist enablers for P4TC and have no
> > effect the classical tc action (example patch#2 just increases the size of the
> > action names from 16->64B).
> > Patch 5 adds infrastructure support for preallocation of dynamic actions.
> >
> > The core P4TC code implements several P4 objects.
> > 1) Patch #6 introduces P4 data types which are consumed by the rest of the code
> > 2) Patch #7 introduces the templating API. i.e. CRUD commands for templates
> > 3) Patch #8 introduces the concept of templating Pipelines. i.e CRUD commands
> >    for P4 pipelines.
> > 4) Patch #9 introduces the action templates and associated CRUD commands.
> > 5) Patch #10 introduce the action runtime infrastructure.
> > 6) Patch #11 introduces the concept of P4 table templates and associated
> >    CRUD commands for tables.
> > 7) Patch #12 introduces runtime table entry infra and associated CU commands.
> > 8) Patch #13 introduces runtime table entry infra and associated RD commands.
> > 9) Patch #14 introduces interaction of eBPF to P4TC tables via kfunc.
> > 10) Patch #15 introduces the TC classifier P4 used at runtime.
> >
> > Daniel, please look again at patch #15.
> >
> > There are a few more patches (5) not in this patchset that deal with test
> > cases, etc.
> >
> > What is P4?
> > -----------
> >
> > The Programming Protocol-independent Packet Processors (P4) is an open source,
> > domain-specific programming language for specifying data plane behavior.
> >
> > The current P4 landscape includes an extensive range of deployments, products,
> > projects and services, etc[9][12]. Two major NIC vendors, Intel[10] and AMD[11]
> > currently offer P4-native NICs. P4 is currently curated by the Linux
> > Foundation[9].
> >
> > On why P4 - see small treatise here:[4].
> >
> > What is P4TC?
> > -------------
> >
> > P4TC is a net-namespace aware P4 implementation over TC; meaning, a P4 program
> > and its associated objects and state are attachend to a kernel _netns_ structure.
> > IOW, if we had two programs across netns' or within a netns they have no
> > visibility to each others objects (unlike for example TC actions whose kinds are
> > "global" in nature or eBPF maps visavis bpftool).
>
> [...]
>
> Although I appreciate a good amount of work went into building above I'll
> add my concerns here so they are not lost. These are architecture concerns
> not this line of code needs some tweak.
>
>  - It encodes a DSL into the kernel. Its unclear how we pick which DSL gets
>    pushed into the kernel and which do not. Do we take any DSL folks can code
>    up?
>    I would prefer a lower level  intermediate langauge. My view is this is
>    a lesson we should have learned from OVS. OVS had wider adoption and
>    still struggled in some ways my belief is this is very similar to OVS.
>    (Also OVS was novel/great at a lot of things fwiw.)
>
>  - We have a general purpose language in BPF that can implement the P4 DSL
>    already. I don't see any need for another set of code when the end goal
>    is running P4 in Linux network stack is doable. Typically we reject
>    duplicate things when they don't have concrete benefits.
>
>  - P4 as a DSL is not optimized for general purpose CPUs, but
>    rather hardware pipelines. Although it can be optimized for CPUs its
>    a harder problem. A review of some of the VPP/DPDK work here is useful.
>
>  - P4 infrastructure already has a p4c backend this is adding another P4
>    backend instead of getting the rather small group of people to work on
>    a single backend we are now creating another one.
>
>  - Common reasons I think would justify a new P4 backend and implementation
>    would be: speed efficiency, or expressiveness. I think this
>    implementation is neither more efficient nor more expressive. Concrete
>    examples on expressiveness would be interesting, but I don't see any.
>    Loops were mentioned once but latest kernels have loop support.
>
>  - The main talking point for many slide decks about p4tc is hardware
>    offload. This seems like the main benefit of pushing the P4 DSL into the
>    kernel. But, we have no hw implementation, not even a vendor stepping up
>    to comment on this implementation and how it will work for them. HW
>    introduces all sorts of interesting problems that I don't see how we
>    solve in this framework. For example a few off the top of my head:
>    syncing current state into tc, how does operator program tc inside
>    constraints, who writes the p4 models for these hardware devices, do
>    they fit into this 'tc' infrastructure, partial updates into hardware
>    seems unlikely to work for most hardware, ...
>
>  - The kfuncs are mostly duplicates of map ops we already have in BPF API.
>    The motivation by my read is to use netlink instead of bpf commands. I
>    don't agree with this, optimizing for some low level debug a developer
>    uses is the wrong design space. Actual users should not be deploying
>    this via ssh into boxes. The workflow will not scale and really we need
>    tooling and infra to land P4 programs across the network. This is orders
>    of more pain if its an endpoint solution and not a middlebox/switch
>    solution. As a switch solution I don't see how p4tc sw scales to even TOR
>    packet rates. So you need tooling on top and user interact with the
>    tooling not the Linux widget/debugger at the bottom.
>
>  - There is no performance analysis: The comment was functionality before
>    performance which I disagree with. If it was a first implementation and
>    we didn't have a way to do P4 DSL already than I might agree, but here
>    we have an existing solution so it should be at least as good and should
>    be better than existing backend. A software datapath adoption is going
>    to be critically based on performance. I don't see taking even a 5% hit
>    when porting over to P4 from existing datapath.
>
> Commentary: I think its 100% correct to debate how the P4 DSL is
> implemented in the kernel. I can't see why this is off limits somehow this
> patch set proposes an approach there could be many approaches. BPF comes up
> not because I'm some BPF zealot that needs P4 DSL in BPF, but because it
> exists today there is even a P4 backend. Fundamentally I don't see the
> value add we get by creating two P4 pipelines this is going to create
> duplication all the way up to the P4 tooling/infra through to the kernel.
> From your side you keep saying I'm bike shedding and demanding BPF, but
> from my perspective your introducing another entire toolchain simply
> because you want some low level debug commands that 99% of P4 users should
> not be using or caring about.
>
> To try and be constructive some things that would change my mind would
> be a vendor showing how hardware can be used. This would be compelling.
> Or performance showing its somehow gets a more performant implementation.
> Or lastly if the current p4c implementation is fundamentally broken
> somehow.
>

John,
With all due respect we are going back again over the same points,
recycled many times over to which i have responded to you many times.
It's gettting tiring.  This is exactly why i called it bikeshedding.
Let's just agree to disagree.

cheers,
jamal

> Thanks
> John

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH net-next v12 00/15] Introducing P4TC (series 1)
  2024-02-28 18:23   ` Jamal Hadi Salim
@ 2024-02-28 21:13     ` John Fastabend
  0 siblings, 0 replies; 71+ messages in thread
From: John Fastabend @ 2024-02-28 21:13 UTC (permalink / raw)
  To: Jamal Hadi Salim, John Fastabend
  Cc: netdev, deb.chatterjee, anjali.singhai, namrata.limaye, tom,
	mleitner, Mahesh.Shirshyad, Vipin.Jain, tomasz.osinski, jiri,
	xiyou.wangcong, davem, edumazet, kuba, pabeni, vladbu, horms,
	khalidm, toke, daniel, victor, pctammela, dan.daly,
	andy.fingerhut, chris.sommers, mattyk, bpf

Jamal Hadi Salim wrote:
> On Wed, Feb 28, 2024 at 12:11 PM John Fastabend
> <john.fastabend@gmail.com> wrote:
> >
> > Jamal Hadi Salim wrote:
> > > This is the first patchset of two. In this patch we are submitting 15 which
> > > cover the minimal viable P4 PNA architecture.
> > >
> > > __Description of these Patches__
> > >
> > > Patch #1 adds infrastructure for per-netns P4 actions that can be created on
> > > as need basis for the P4 program requirement. This patch makes a small incision
> > > into act_api. Patches 2-4 are minimalist enablers for P4TC and have no
> > > effect the classical tc action (example patch#2 just increases the size of the
> > > action names from 16->64B).
> > > Patch 5 adds infrastructure support for preallocation of dynamic actions.
> > >
> > > The core P4TC code implements several P4 objects.
> > > 1) Patch #6 introduces P4 data types which are consumed by the rest of the code
> > > 2) Patch #7 introduces the templating API. i.e. CRUD commands for templates
> > > 3) Patch #8 introduces the concept of templating Pipelines. i.e CRUD commands
> > >    for P4 pipelines.
> > > 4) Patch #9 introduces the action templates and associated CRUD commands.
> > > 5) Patch #10 introduce the action runtime infrastructure.
> > > 6) Patch #11 introduces the concept of P4 table templates and associated
> > >    CRUD commands for tables.
> > > 7) Patch #12 introduces runtime table entry infra and associated CU commands.
> > > 8) Patch #13 introduces runtime table entry infra and associated RD commands.
> > > 9) Patch #14 introduces interaction of eBPF to P4TC tables via kfunc.
> > > 10) Patch #15 introduces the TC classifier P4 used at runtime.
> > >
> > > Daniel, please look again at patch #15.
> > >
> > > There are a few more patches (5) not in this patchset that deal with test
> > > cases, etc.
> > >
> > > What is P4?
> > > -----------
> > >
> > > The Programming Protocol-independent Packet Processors (P4) is an open source,
> > > domain-specific programming language for specifying data plane behavior.
> > >
> > > The current P4 landscape includes an extensive range of deployments, products,
> > > projects and services, etc[9][12]. Two major NIC vendors, Intel[10] and AMD[11]
> > > currently offer P4-native NICs. P4 is currently curated by the Linux
> > > Foundation[9].
> > >
> > > On why P4 - see small treatise here:[4].
> > >
> > > What is P4TC?
> > > -------------
> > >
> > > P4TC is a net-namespace aware P4 implementation over TC; meaning, a P4 program
> > > and its associated objects and state are attachend to a kernel _netns_ structure.
> > > IOW, if we had two programs across netns' or within a netns they have no
> > > visibility to each others objects (unlike for example TC actions whose kinds are
> > > "global" in nature or eBPF maps visavis bpftool).
> >
> > [...]
> >
> > Although I appreciate a good amount of work went into building above I'll
> > add my concerns here so they are not lost. These are architecture concerns
> > not this line of code needs some tweak.
> >
> >  - It encodes a DSL into the kernel. Its unclear how we pick which DSL gets
> >    pushed into the kernel and which do not. Do we take any DSL folks can code
> >    up?
> >    I would prefer a lower level  intermediate langauge. My view is this is
> >    a lesson we should have learned from OVS. OVS had wider adoption and
> >    still struggled in some ways my belief is this is very similar to OVS.
> >    (Also OVS was novel/great at a lot of things fwiw.)
> >
> >  - We have a general purpose language in BPF that can implement the P4 DSL
> >    already. I don't see any need for another set of code when the end goal
> >    is running P4 in Linux network stack is doable. Typically we reject
> >    duplicate things when they don't have concrete benefits.
> >
> >  - P4 as a DSL is not optimized for general purpose CPUs, but
> >    rather hardware pipelines. Although it can be optimized for CPUs its
> >    a harder problem. A review of some of the VPP/DPDK work here is useful.
> >
> >  - P4 infrastructure already has a p4c backend this is adding another P4
> >    backend instead of getting the rather small group of people to work on
> >    a single backend we are now creating another one.
> >
> >  - Common reasons I think would justify a new P4 backend and implementation
> >    would be: speed efficiency, or expressiveness. I think this
> >    implementation is neither more efficient nor more expressive. Concrete
> >    examples on expressiveness would be interesting, but I don't see any.
> >    Loops were mentioned once but latest kernels have loop support.
> >
> >  - The main talking point for many slide decks about p4tc is hardware
> >    offload. This seems like the main benefit of pushing the P4 DSL into the
> >    kernel. But, we have no hw implementation, not even a vendor stepping up
> >    to comment on this implementation and how it will work for them. HW
> >    introduces all sorts of interesting problems that I don't see how we
> >    solve in this framework. For example a few off the top of my head:
> >    syncing current state into tc, how does operator program tc inside
> >    constraints, who writes the p4 models for these hardware devices, do
> >    they fit into this 'tc' infrastructure, partial updates into hardware
> >    seems unlikely to work for most hardware, ...
> >
> >  - The kfuncs are mostly duplicates of map ops we already have in BPF API.
> >    The motivation by my read is to use netlink instead of bpf commands. I
> >    don't agree with this, optimizing for some low level debug a developer
> >    uses is the wrong design space. Actual users should not be deploying
> >    this via ssh into boxes. The workflow will not scale and really we need
> >    tooling and infra to land P4 programs across the network. This is orders
> >    of more pain if its an endpoint solution and not a middlebox/switch
> >    solution. As a switch solution I don't see how p4tc sw scales to even TOR
> >    packet rates. So you need tooling on top and user interact with the
> >    tooling not the Linux widget/debugger at the bottom.
> >
> >  - There is no performance analysis: The comment was functionality before
> >    performance which I disagree with. If it was a first implementation and
> >    we didn't have a way to do P4 DSL already than I might agree, but here
> >    we have an existing solution so it should be at least as good and should
> >    be better than existing backend. A software datapath adoption is going
> >    to be critically based on performance. I don't see taking even a 5% hit
> >    when porting over to P4 from existing datapath.
> >
> > Commentary: I think its 100% correct to debate how the P4 DSL is
> > implemented in the kernel. I can't see why this is off limits somehow this
> > patch set proposes an approach there could be many approaches. BPF comes up
> > not because I'm some BPF zealot that needs P4 DSL in BPF, but because it
> > exists today there is even a P4 backend. Fundamentally I don't see the
> > value add we get by creating two P4 pipelines this is going to create
> > duplication all the way up to the P4 tooling/infra through to the kernel.
> > From your side you keep saying I'm bike shedding and demanding BPF, but
> > from my perspective your introducing another entire toolchain simply
> > because you want some low level debug commands that 99% of P4 users should
> > not be using or caring about.
> >
> > To try and be constructive some things that would change my mind would
> > be a vendor showing how hardware can be used. This would be compelling.
> > Or performance showing its somehow gets a more performant implementation.
> > Or lastly if the current p4c implementation is fundamentally broken
> > somehow.
> >
> 
> John,
> With all due respect we are going back again over the same points,
> recycled many times over to which i have responded to you many times.
> It's gettting tiring.  This is exactly why i called it bikeshedding.
> Let's just agree to disagree.

Yep we agree to disagree and I put them them as a summary so others
can see them and think it over/decide where they stand on it. In the
end you don't need my ACK here, but I wanted my opinion summarized.

> 
> cheers,
> jamal
> 
> > Thanks
> > John



^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH net-next v12  01/15] net: sched: act_api: Introduce P4 actions list
  2024-02-25 16:54 ` [PATCH net-next v12 01/15] net: sched: act_api: Introduce P4 actions list Jamal Hadi Salim
@ 2024-02-29 15:05   ` Paolo Abeni
  2024-02-29 18:21     ` Jamal Hadi Salim
  0 siblings, 1 reply; 71+ messages in thread
From: Paolo Abeni @ 2024-02-29 15:05 UTC (permalink / raw)
  To: Jamal Hadi Salim, netdev
  Cc: deb.chatterjee, anjali.singhai, namrata.limaye, tom, mleitner,
	Mahesh.Shirshyad, Vipin.Jain, tomasz.osinski, jiri,
	xiyou.wangcong, davem, edumazet, kuba, vladbu, horms, khalidm,
	toke, daniel, victor, pctammela, bpf

On Sun, 2024-02-25 at 11:54 -0500, Jamal Hadi Salim wrote:
> In P4 we require to generate new actions "on the fly" based on the
> specified P4 action definition. P4 action kinds, like the pipeline
> they are attached to, must be per net namespace, as opposed to native
> action kinds which are global. For that reason, we chose to create a
> separate structure to store P4 actions.
> 
> Co-developed-by: Victor Nogueira <victor@mojatatu.com>
> Signed-off-by: Victor Nogueira <victor@mojatatu.com>
> Co-developed-by: Pedro Tammela <pctammela@mojatatu.com>
> Signed-off-by: Pedro Tammela <pctammela@mojatatu.com>
> Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
> Reviewed-by: Vlad Buslov <vladbu@nvidia.com>
> Reviewed-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
> ---
>  include/net/act_api.h |   8 ++-
>  net/sched/act_api.c   | 123 +++++++++++++++++++++++++++++++++++++-----
>  net/sched/cls_api.c   |   2 +-
>  3 files changed, 116 insertions(+), 17 deletions(-)
> 
> diff --git a/include/net/act_api.h b/include/net/act_api.h
> index 77ee0c657..f22be14bb 100644
> --- a/include/net/act_api.h
> +++ b/include/net/act_api.h
> @@ -105,6 +105,7 @@ typedef void (*tc_action_priv_destructor)(void *priv);
>  
>  struct tc_action_ops {
>  	struct list_head head;
> +	struct list_head p4_head;
>  	char    kind[IFNAMSIZ];
>  	enum tca_id  id; /* identifier should match kind */
>  	unsigned int	net_id;
> @@ -199,10 +200,12 @@ int tcf_idr_check_alloc(struct tc_action_net *tn, u32 *index,
>  int tcf_idr_release(struct tc_action *a, bool bind);
>  
>  int tcf_register_action(struct tc_action_ops *a, struct pernet_operations *ops);
> +int tcf_register_p4_action(struct net *net, struct tc_action_ops *act);
>  int tcf_unregister_action(struct tc_action_ops *a,
>  			  struct pernet_operations *ops);
>  #define NET_ACT_ALIAS_PREFIX "net-act-"
>  #define MODULE_ALIAS_NET_ACT(kind)	MODULE_ALIAS(NET_ACT_ALIAS_PREFIX kind)
> +void tcf_unregister_p4_action(struct net *net, struct tc_action_ops *act);
>  int tcf_action_destroy(struct tc_action *actions[], int bind);
>  int tcf_action_exec(struct sk_buff *skb, struct tc_action **actions,
>  		    int nr_actions, struct tcf_result *res);
> @@ -210,8 +213,9 @@ int tcf_action_init(struct net *net, struct tcf_proto *tp, struct nlattr *nla,
>  		    struct nlattr *est,
>  		    struct tc_action *actions[], int init_res[], size_t *attr_size,
>  		    u32 flags, u32 fl_flags, struct netlink_ext_ack *extack);
> -struct tc_action_ops *tc_action_load_ops(struct nlattr *nla, u32 flags,
> -					 struct netlink_ext_ack *extack);
> +struct tc_action_ops *
> +tc_action_load_ops(struct net *net, struct nlattr *nla,
> +		   u32 flags, struct netlink_ext_ack *extack);
>  struct tc_action *tcf_action_init_1(struct net *net, struct tcf_proto *tp,
>  				    struct nlattr *nla, struct nlattr *est,
>  				    struct tc_action_ops *a_o, int *init_res,
> diff --git a/net/sched/act_api.c b/net/sched/act_api.c
> index 9ee622fb1..23ef394f2 100644
> --- a/net/sched/act_api.c
> +++ b/net/sched/act_api.c
> @@ -57,6 +57,40 @@ static void tcf_free_cookie_rcu(struct rcu_head *p)
>  	kfree(cookie);
>  }
>  
> +static unsigned int p4_act_net_id;
> +
> +struct tcf_p4_act_net {
> +	struct list_head act_base;
> +	rwlock_t act_mod_lock;

Note that rwlock in networking code is discouraged, as they have to be
unfair, see commit 0daf07e527095e64ee8927ce297ab626643e9f51.

In this specific case I think there should be no problems, as is
extremely hard/impossible to have serious contention on the write
side,. Also there is already an existing rwlock nearby, no not a
blocker but IMHO worthy to be noted.

Cheers,

Paolo


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH net-next v12  06/15] p4tc: add P4 data types
  2024-02-25 16:54 ` [PATCH net-next v12 06/15] p4tc: add P4 data types Jamal Hadi Salim
@ 2024-02-29 15:09   ` Paolo Abeni
  2024-02-29 18:31     ` Jamal Hadi Salim
  0 siblings, 1 reply; 71+ messages in thread
From: Paolo Abeni @ 2024-02-29 15:09 UTC (permalink / raw)
  To: Jamal Hadi Salim, netdev
  Cc: deb.chatterjee, anjali.singhai, namrata.limaye, tom, mleitner,
	Mahesh.Shirshyad, Vipin.Jain, tomasz.osinski, jiri,
	xiyou.wangcong, davem, edumazet, kuba, vladbu, horms, khalidm,
	toke, daniel, victor, pctammela, bpf

On Sun, 2024-02-25 at 11:54 -0500, Jamal Hadi Salim wrote:
> Introduce abstraction that represents P4 data types.
> This also introduces the Kconfig and Makefile which later patches use.
> Numeric types could be little, host or big endian definitions. The abstraction
> also supports defining:
> 
> a) bitstrings using P4 annotations that look like "bit<X>" where X
>    is the number of bits defined in a type
> 
> b) bitslices such that one can define in P4 as bit<8>[0-3] and
>    bit<16>[4-9]. A 4-bit slice from bits 0-3 and a 6-bit slice from bits
>    4-9 respectively.
> 
> c) speacialized types like dev (which stands for a netdev), key, etc
> 
> Each type has a bitsize, a name (for debugging purposes), an ID and
> methods/ops. The P4 types will be used by externs, dynamic actions, packet
> headers and other parts of P4TC.
> 
> Each type has four ops:
> 
> - validate_p4t: Which validates if a given value of a specific type
>   meets valid boundary conditions.
> 
> - create_bitops: Which, given a bitsize, bitstart and bitend allocates and
>   returns a mask and a shift value. For example, if we have type
>   bit<8>[3-3] meaning bitstart = 3 and bitend = 3, we'll create a mask
>   which would only give us the fourth bit of a bit8 value, that is, 0x08.
>   Since we are interested in the fourth bit, the bit shift value will be 3.
>   This is also useful if an "irregular" bitsize is used, for example,
>   bit24. In that case bitstart = 0 and bitend = 23. Shift will be 0 and
>   the mask will be 0xFFFFFF00 if the machine is big endian.
> 
> - host_read : Which reads the value of a given type and transforms it to
>   host order (if needed)
> 
> - host_write : Which writes a provided host order value and transforms it
>   to the type's native order (if needed)

The type has a 'print' op, but I can't easily find where such op is
used and its role?!?

Thanks,

Paolo


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH net-next v12  03/15] net/sched: act_api: Update tc_action_ops to account for P4 actions
  2024-02-25 16:54 ` [PATCH net-next v12 03/15] net/sched: act_api: Update tc_action_ops to account for P4 actions Jamal Hadi Salim
@ 2024-02-29 16:19   ` Paolo Abeni
  2024-02-29 18:30     ` Jamal Hadi Salim
  0 siblings, 1 reply; 71+ messages in thread
From: Paolo Abeni @ 2024-02-29 16:19 UTC (permalink / raw)
  To: Jamal Hadi Salim, netdev
  Cc: deb.chatterjee, anjali.singhai, namrata.limaye, tom, mleitner,
	Mahesh.Shirshyad, Vipin.Jain, tomasz.osinski, jiri,
	xiyou.wangcong, davem, edumazet, kuba, vladbu, horms, khalidm,
	toke, daniel, victor, pctammela, bpf

On Sun, 2024-02-25 at 11:54 -0500, Jamal Hadi Salim wrote:
> The initialisation of P4TC action instances require access to a struct
> p4tc_act (which appears in later patches) to help us to retrieve
> information like the P4 action parameters etc. In order to retrieve
> struct p4tc_act we need the pipeline name or id and the action name or id.
> Also recall that P4TC action IDs are P4 and are net namespace specific and
> not global like standard tc actions.
> The init callback from tc_action_ops parameters had no way of
> supplying us that information. To solve this issue, we decided to create a
> new tc_action_ops callback (init_ops), that provies us with the
> tc_action_ops  struct which then provides us with the pipeline and action
> name. 

The new init ops looks a bit unfortunate. I *think* it would be better
adding the new argument to the existing init op

> In addition we add a new refcount to struct tc_action_ops called
> dyn_ref, which accounts for how many action instances we have of a specific
> action.
> 
> Co-developed-by: Victor Nogueira <victor@mojatatu.com>
> Signed-off-by: Victor Nogueira <victor@mojatatu.com>
> Co-developed-by: Pedro Tammela <pctammela@mojatatu.com>
> Signed-off-by: Pedro Tammela <pctammela@mojatatu.com>
> Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
> Reviewed-by: Vlad Buslov <vladbu@nvidia.com>
> Reviewed-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
> ---
>  include/net/act_api.h |  6 ++++++
>  net/sched/act_api.c   | 14 +++++++++++---
>  2 files changed, 17 insertions(+), 3 deletions(-)
> 
> diff --git a/include/net/act_api.h b/include/net/act_api.h
> index c839ff57c..69be5ed83 100644
> --- a/include/net/act_api.h
> +++ b/include/net/act_api.h
> @@ -109,6 +109,7 @@ struct tc_action_ops {
>  	char    kind[ACTNAMSIZ];
>  	enum tca_id  id; /* identifier should match kind */
>  	unsigned int	net_id;
> +	refcount_t p4_ref;
>  	size_t	size;
>  	struct module		*owner;
>  	int     (*act)(struct sk_buff *, const struct tc_action *,
> @@ -120,6 +121,11 @@ struct tc_action_ops {
>  			struct nlattr *est, struct tc_action **act,
>  			struct tcf_proto *tp,
>  			u32 flags, struct netlink_ext_ack *extack);
> +	/* This should be merged with the original init action */
> +	int     (*init_ops)(struct net *net, struct nlattr *nla,
> +			    struct nlattr *est, struct tc_action **act,
> +			   struct tcf_proto *tp, struct tc_action_ops *ops,

shouldn't the 'ops' argument be 'const'?

Thanks,

Paolo


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH net-next v12 00/15] Introducing P4TC (series 1)
  2024-02-25 16:54 [PATCH net-next v12 00/15] Introducing P4TC (series 1) Jamal Hadi Salim
                   ` (15 preceding siblings ...)
  2024-02-28 17:11 ` [PATCH net-next v12 00/15] Introducing P4TC (series 1) John Fastabend
@ 2024-02-29 17:13 ` Paolo Abeni
  2024-02-29 18:49   ` Jamal Hadi Salim
                     ` (2 more replies)
  16 siblings, 3 replies; 71+ messages in thread
From: Paolo Abeni @ 2024-02-29 17:13 UTC (permalink / raw)
  To: Jamal Hadi Salim, netdev
  Cc: deb.chatterjee, anjali.singhai, namrata.limaye, tom, mleitner,
	Mahesh.Shirshyad, Vipin.Jain, tomasz.osinski, jiri,
	xiyou.wangcong, davem, edumazet, kuba, vladbu, horms, khalidm,
	toke, daniel, victor, pctammela, dan.daly, andy.fingerhut,
	chris.sommers, mattyk, bpf

On Sun, 2024-02-25 at 11:54 -0500, Jamal Hadi Salim wrote:
> This is the first patchset of two. In this patch we are submitting 15 which
> cover the minimal viable P4 PNA architecture.
> 
> __Description of these Patches__
> 
> Patch #1 adds infrastructure for per-netns P4 actions that can be created on
> as need basis for the P4 program requirement. This patch makes a small incision
> into act_api. Patches 2-4 are minimalist enablers for P4TC and have no
> effect the classical tc action (example patch#2 just increases the size of the
> action names from 16->64B).
> Patch 5 adds infrastructure support for preallocation of dynamic actions.
> 
> The core P4TC code implements several P4 objects.
> 1) Patch #6 introduces P4 data types which are consumed by the rest of the code
> 2) Patch #7 introduces the templating API. i.e. CRUD commands for templates
> 3) Patch #8 introduces the concept of templating Pipelines. i.e CRUD commands
>    for P4 pipelines.
> 4) Patch #9 introduces the action templates and associated CRUD commands.
> 5) Patch #10 introduce the action runtime infrastructure.
> 6) Patch #11 introduces the concept of P4 table templates and associated
>    CRUD commands for tables.
> 7) Patch #12 introduces runtime table entry infra and associated CU commands.
> 8) Patch #13 introduces runtime table entry infra and associated RD commands.
> 9) Patch #14 introduces interaction of eBPF to P4TC tables via kfunc.
> 10) Patch #15 introduces the TC classifier P4 used at runtime.
> 
> Daniel, please look again at patch #15.
> 
> There are a few more patches (5) not in this patchset that deal with test
> cases, etc.
> 
> What is P4?
> -----------
> 
> The Programming Protocol-independent Packet Processors (P4) is an open source,
> domain-specific programming language for specifying data plane behavior.
> 
> The current P4 landscape includes an extensive range of deployments, products,
> projects and services, etc[9][12]. Two major NIC vendors, Intel[10] and AMD[11]
> currently offer P4-native NICs. P4 is currently curated by the Linux
> Foundation[9].
> 
> On why P4 - see small treatise here:[4].
> 
> What is P4TC?
> -------------
> 
> P4TC is a net-namespace aware P4 implementation over TC; meaning, a P4 program
> and its associated objects and state are attachend to a kernel _netns_ structure.
> IOW, if we had two programs across netns' or within a netns they have no
> visibility to each others objects (unlike for example TC actions whose kinds are
> "global" in nature or eBPF maps visavis bpftool).
> 
> P4TC builds on top of many years of Linux TC experiences of a netlink control
> path interface coupled with a software datapath with an equivalent offloadable
> hardware datapath. In this patch series we are focussing only on the s/w
> datapath. The s/w and h/w path equivalence that TC provides is relevant
> for a primary use case of P4 where some (currently) large consumers of NICs
> provide vendors their datapath specs in P4. In such a case one could generate
> specified datapaths in s/w and test/validate the requirements before hardware
> acquisition(example [12]).
> 
> Unlike other approaches such as TC Flower which require kernel and user space
> changes when new datapath objects like packet headers are introduced P4TC, with
> these patches, provides _kernel and user space code change independence_.
> Meaning:
> A P4 program describes headers, parsers, etc alongside the datapath processing;
> the compiler uses the P4 program as input and generates several artifacts which
> are then loaded into the kernel to manifest the intended datapath. In addition
> to the generated datapath, control path constructs are generated. The process is
> described further below in "P4TC Workflow".
> 
> There have been many discussions and meetings within the community since
> about 2015 in regards to P4 over TC[2] and we are finally proving to the
> naysayers that we do get stuff done!
> 
> A lot more of the P4TC motivation is captured at:
> https://github.com/p4tc-dev/docs/blob/main/why-p4tc.md
> 
> __P4TC Architecture__
> 
> The current architecture was described at netdevconf 0x17[14] and if you prefer
> academic conference papers, a short paper is available here[15].
> 
> There are 4 parts:
> 
> 1) A Template CRUD provisioning API for manifesting a P4 program and its
> associated objects in the kernel. The template provisioning API uses netlink.
> See patch in part 2.
> 
> 2) A Runtime CRUD+ API code which is used for controlling the different runtime
> behavior of the P4 objects. The runtime API uses netlink. See notes further
> down. See patch description later..
> 
> 3) P4 objects and their control interfaces: tables, actions, externs, etc.
> Any object that requires control plane interaction resides in the TC domain
> and is subject to the CRUD runtime API.  The intended goal is to make use of the
> tc semantics of skip_sw/hw to target P4 program objects either in s/w or h/w.
> 
> 4) S/W Datapath code hooks. The s/w datapath is eBPF based and is generated
> by a compiler based on the P4 spec. When accessing any P4 object that requires
> control plane interfaces, the eBPF code accesses the P4TC side from #3 above
> using kfuncs.
> 
> The generated eBPF code is derived from [13] with enhancements and fixes to meet
> our requirements.
> 
> __P4TC Workflow__
> 
> The Development and instantiation workflow for P4TC is as follows:
> 
>   A) A developer writes a P4 program, "myprog"
> 
>   B) Compiles it using the P4C compiler[8]. The compiler generates 3 outputs:
> 
>      a) A shell script which form template definitions for the different P4
>      objects "myprog" utilizes (tables, externs, actions etc). See #1 above..
> 
>      b) the parser and the rest of the datapath are generated as eBPF and need
>      to be compiled into binaries. At the moment the parser and the main control
>      block are generated as separate eBPF program but this could change in
>      the future (without affecting any kernel code). See #4 above.
> 
>      c) A json introspection file used for the control plane (by iproute2/tc).
> 
>   C) At this point the artifacts from #1,#4 could be handed to an operator
>      (the operator could be the same person as the developer from #A, #B).
> 
>      i) For the eBPF part, either the operator is handed an ebpf binary or
>      source which they compile at this point into a binary.
>      The operator executes the shell script(s) to manifest the functional
>      "myprog" into the kernel.
> 
>      ii) The operator instantiates "myprog" pipeline via the tc P4 filter
>      to ingress/egress (depending on P4 arch) of one or more netdevs/ports
>      (illustrated below as "block 22").
> 
>      Example instantion where the parser is a separate action:
>        "tc filter add block 22 ingress protocol all prio 10 p4 pname myprog \
>         action bpf obj $PARSER.o section p4tc/parse \
>         action bpf obj $PROGNAME.o section p4tc/main"
> 
> See individual patches in partc for more examples tc vs xdp etc. Also see
> section on "challenges" (further below on this cover letter).
> 
> Once "myprog" P4 program is instantiated one can start performing operations
> on table entries and/or actions at runtime as described below.
> 
> __P4TC Runtime Control Path__
> 
> The control interface builds on past tc experience and tries to get things
> right from the beginning (example filtering is separated from depending
> on existing object TLVs and made generic); also the code is written in
> such a way it is mostly lockless.
> 
> The P4TC control interface, using netlink, provides what we call a CRUDPS
> abstraction which stands for: Create, Read(get), Update, Delete, Subscribe,
> Publish.  From a high level PoV the following describes a conformant high level
> API (both on netlink data model and code level):
> 
> 	Create(</path/to/object, DATA>+)
> 	Read(</path/to/object>, [optional filter])
> 	Update(</path/to/object>, DATA>+)
> 	Delete(</path/to/object>, [optional filter])
> 	Subscribe(</path/to/object>, [optional filter])
> 
> Note, we _dont_ treat "dump" or "flush" as speacial. If "path/to/object" points
> to a table then a "Delete" implies "flush" and a "Read" implies dump but if
> it points to an entry (by specifying a key) then "Delete" implies deleting
> and entry and "Read" implies reading that single entry. It should be noted that
> both "Delete" and "Read" take an optional filter parameter. The filter can
> define further refinements to what the control plane wants read or deleted.
> "Subscribe" uses built in netlink event management. It, as well, takes a filter
> which can further refine what events get generated to the control plane (taken
> out of this patchset, to be re-added with consideration of [16]).
> 
> Lets show some runtime samples:
> 
> ..create an entry, if we match ip address 10.0.1.2 send packet out eno1
>   tc p4ctrl create myprog/table/mytable \
>    dstAddr 10.0.1.2/32 action send_to_port param port eno1
> 
> ..Batch create entries
>   tc p4ctrl create myprog/table/mytable \
>   entry dstAddr 10.1.1.2/32  action send_to_port param port eno1 \
>   entry dstAddr 10.1.10.2/32  action send_to_port param port eno10 \
>   entry dstAddr 10.0.2.2/32  action send_to_port param port eno2
> 
> ..Get an entry (note "read" is interchangeably used as "get" which is a common
> 		semantic in tc):
>   tc p4ctrl read myprog/table/mytable \
>    dstAddr 10.0.2.2/32
> 
> ..dump mytable
>   tc p4ctrl read myprog/table/mytable
> 
> ..dump mytable for all entries whose key fits within 10.1.0.0/16
>   tc p4ctrl read myprog/table/mytable \
>   filter key/myprog/mytable/dstAddr = 10.1.0.0/16
> 
> ..dump all mytable entries which have an action send_to_port with param "eno1"
>   tc p4ctrl get myprog/table/mytable \
>   filter param/act/myprog/send_to_port/port = "eno1"
> 
> The filter expression is powerful, f.e you could say:
> 
>   tc p4ctrl get myprog/table/mytable \
>   filter param/act/myprog/send_to_port/port = "eno1" && \
>          key/myprog/mytable/dstAddr = 10.1.0.0/16
> 
> It also works on built in metadata, example in the following case dumping
> entries from mytable that have seen activity in the last 10 secs:
>   tc p4ctrl get myprog/table/mytable \
>   filter msecs_since < 10000
> 
> Delete follows the same syntax as get/read, so for sake of brevity we won't
> show more example than how to flush mytable:
> 
>   tc p4ctrl delete myprog/table/mytable
> 
> Mystery question: How do we achieve iproute2-kernel independence and
> how does "tc p4ctrl" as a cli know how to program the kernel given an
> arbitrary command line as shown above? Answer(s): It queries the
> compiler generated json file in "P4TC Workflow" #B.c above. The json file has
> enough details to figure out that we have a program called "myprog" which has a
> table "mytable" that has a key name "dstAddr" which happens to be type ipv4
> address prefix. The json file also provides details to show that the table
> "mytable" supports an action called "send_to_port" which accepts a parameter
> "port" of type netdev (see the types patch for all supported P4 data types).
> All P4 components have names, IDs, and types - so this makes it very easy to map
> into netlink.
> Once user space tc/p4ctrl validates the human command input, it creates
> standard binary netlink structures (TLVs etc) which are sent to the kernel.
> See the runtime table entry patch for more details.
> 
> __P4TC Datapath__
> 
> The P4TC s/w datapath execution is generated as eBPF. Any objects that require
> control interfacing reside in the "P4TC domain" and are controlled via netlink
> as described above. Per packet execution and state and even objects that do not
> require control interfacing (like the P4 parser) are generated as eBPF.
> 
> A packet arriving on s/w ingress of any of the ports on block 22 will first be
> exercised via the (generated eBPF) parser component to extract the headers (the
> ip destination address in labelled "dstAddr" above).
> The datapath then proceeds to use "dstAddr", table ID and pipeline ID
> as a key to do a lookup in myprog's "mytable" which returns the action params
> which are then used to execute the action in the eBPF datapath (eventually
> sending out packets to eno1).
> On a table miss, mytable's default miss action (not described) is executed.
> 
> __Testing__
> 
> Speaking of testing - we have 2-300 tdc test cases (which will be in the
> second patchset).
> These tests are run on our CICD system on pull requests and after commits are
> approved. The CICD does a lot of other tests (more since v2, thanks to Simon's
> input)including:
> checkpatch, sparse, smatch, coccinelle, 32 bit and 64 bit builds tested on both
> X86, ARM 64 and emulated BE via qemu s390. We trigger performance testing in the
> CICD to catch performance regressions (currently only on the control path, but
> in the future for the datapath).
> Syzkaller runs 24/7 on dedicated hardware, originally we focussed only on memory
> sanitizer but recently added support for concurrency sanitizer.
> Before main releases we ensure each patch will compile on its own to help in
> git bisect and run the xmas tree tool. We eventually put the code via coverity.
> 
> In addition we are working on enabling a tool that will take a P4 program, run
> it through the compiler, and generate permutations of traffic patterns via
> symbolic execution that will test both positive and negative datapath code
> paths. The test generator tool integration is still work in progress.
> Also: We have other code that test parallelization etc which we are trying to
> find a fit for in the kernel tree's testing infra.
> 
> 
> __References__
> 
> [1]https://github.com/p4tc-dev/docs/blob/main/p4-conference-2023/2023P4WorkshopP4TC.pdf
> [2]https://github.com/p4tc-dev/docs/blob/main/why-p4tc.md#historical-perspective-for-p4tc
> [3]https://2023p4workshop.sched.com/event/1KsAe/p4tc-linux-kernel-p4-implementation-approaches-and-evaluation
> [4]https://github.com/p4tc-dev/docs/blob/main/why-p4tc.md#so-why-p4-and-how-does-p4-help-here
> [5]https://lore.kernel.org/netdev/20230517110232.29349-3-jhs@mojatatu.com/T/#mf59be7abc5df3473cff3879c8cc3e2369c0640a6
> [6]https://lore.kernel.org/netdev/20230517110232.29349-3-jhs@mojatatu.com/T/#m783cfd79e9d755cf0e7afc1a7d5404635a5b1919
> [7]https://lore.kernel.org/netdev/20230517110232.29349-3-jhs@mojatatu.com/T/#ma8c84df0f7043d17b98f3d67aab0f4904c600469
> [8]https://github.com/p4lang/p4c/tree/main/backends/tc
> [9]https://p4.org/
> [10]https://www.intel.com/content/www/us/en/products/details/network-io/ipu/e2000-asic.html
> [11]https://www.amd.com/en/accelerators/pensando
> [12]https://github.com/sonic-net/DASH/tree/main
> [13]https://github.com/p4lang/p4c/tree/main/backends/ebpf
> [14]https://netdevconf.info/0x17/sessions/talk/integrating-ebpf-into-the-p4tc-datapath.html
> [15]https://dl.acm.org/doi/10.1145/3630047.3630193
> [16]https://lore.kernel.org/netdev/20231216123001.1293639-1-jiri@resnulli.us/
> [17.a]https://netdevconf.info/0x13/session.html?talk-tc-u-classifier
> [17.b]man tc-u32
> [18]man tc-pedit
> [19] https://lore.kernel.org/netdev/20231219181623.3845083-6-victor@mojatatu.com/T/#m86e71743d1d83b728bb29d5b877797cb4942e835
> [20.a] https://netdevconf.info/0x16/sessions/talk/your-network-datapath-will-be-p4-scripted.html
> [20.b] https://netdevconf.info/0x16/sessions/workshop/p4tc-workshop.html
> 
> --------
> HISTORY
> --------
> 
> Changes in Version 12
> ----------------------
> 
> 0) Introduce back 15 patches (v11 had 5)
> 
> 1) From discussions with Daniel:
>    i) Remove the XDP programs association alltogether. No refcounting. nothing.
>    ii) Remove prog type tc - everything is now an ebpf tc action.
> 
> 2) s/PAD0/__pad0/g. Thanks to Marcelo.
> 
> 3) Add extack to specify how many entries (N of M) specified in a batch for
>    any of requested Create/Update/Delete succeeded. Prior to this it would
>    only tell us the batch failed to complete without giving us details of
>    which of M failed. Added as a debug aid.
> 
> Changes in Version 11
> ----------------------
> 1) Split the series into two. Original patches 1-5 in this patchset. The rest
>    will go out after this is merged.
> 
> 2) Change any references of IFNAMSIZ in the action code when referencing the
>    action name size to ACTNAMSIZ. Thanks to Marcelo.
> 
> Changes in Version 10
> ----------------------
> 1) A couple of patches from the earlier version were clean enough to submit,
>    so we did. This gave us room to split the two largest patches each into
>    two. Even though the split is not git-bisactable and really some of it didn't
>    make much sense (eg spliting a create, and update in one patch and delete and
>    get into another) we made sure each of the split patches compiled
>    independently. The idea is to reduce the number of lines of code to review
>    and when we get sufficient reviews we will put the splits together again.
>    See patch #12 and #13 as well as patches #7 and #8).
> 
> 2) Add more context in patch 0. Please READ!
> 
> 3) Added dump/delete filters back to the code - we had taken them out in the
>    earlier patches to reduce the amount of code for review - but in retrospect
>    we feel they are important enough to push earlier rather than later.
> 
> 
> Changes In version 9
> ---------------------
> 
> 1) Remove the largest patch (externs) to ease review.
> 
> 2) Break up action patches into two to ease review bringing down the patches
>    that need more scrutiny to 8 (the first 7 are almost trivial).
> 
> 3) Fixup prefix naming convention to p4tc_xxx for uapi and p4a_xxx for actions
>    to provide consistency(Jiri).
> 
> 4) Silence sparse warning "was not declared. Should it be static?" for kfuncs
>    by making them static. TBH, not sure if this is the right solution
>    but it makes sparse happy and hopefully someone will comment.
> 
> Changes In Version 8
> ---------------------
> 
> 1) Fix all the patchwork warnings and improve our ci to catch them in the future
> 
> 2) Reduce the number of patches to basic max(15)  to ease review.
> 
> Changes In Version 7
> -------------------------
> 
> 0) First time removing the RFC tag!
> 
> 1) Removed XDP cookie. It turns out as was pointed out by Toke(Thanks!) - that
> using bpf links was sufficient to protect us from someone replacing or deleting
> a eBPF program after it has been bound to a netdev.
> 
> 2) Add some reviewed-bys from Vlad.
> 
> 3) Small bug fixes from v6 based on testing for ebpf.
> 
> 4) Added the counter extern as a sample extern. Illustrating this example because
>    it is slightly complex since it is possible to invoke it directly from
>    the P4TC domain (in case of direct counters) or from eBPF (indirect counters).
>    It is not exactly the most efficient implementation (a reasonable counter impl
>    should be per-cpu).
> 
> Changes In RFC Version 6
> -------------------------
> 
> 1) Completed integration from scriptable view to eBPF. Completed integration
>    of externs integration.
> 
> 2) Small bug fixes from v5 based on testing.
> 
> Changes In RFC Version 5
> -------------------------
> 
> 1) More integration from scriptable view to eBPF. Small bug fixes from last
>    integration.
> 
> 2) More streamlining support of externs via kfunc (create-on-miss, etc)
> 
> 3) eBPF linking for XDP.
> 
> There is more eBPF integration/streamlining coming (we are getting close to
> conversion from scriptable domain).
> 
> Changes In RFC Version 4
> -------------------------
> 
> 1) More integration from scriptable to eBPF. Small bug fixes.
> 
> 2) More streamlining support of externs via kfunc (one additional kfunc).
> 
> 3) Removed per-cpu scratchpad per Toke's suggestion and instead use XDP metadata.
> 
> There is more eBPF integration coming. One thing we looked at but is not in this
> patchset but should be in the next is use of eBPF link in our loading (see
> "challenge #1" further below).
> 
> Changes In RFC Version 3
> -------------------------
> 
> These patches are still in a little bit of flux as we adjust to integrating
> eBPF. So there are small constructs that are used in V1 and 2 but no longer
> used in this version. We will make a V4 which will remove those.
> The changes from V2 are as follows:
> 
> 1) Feedback we got in V2 is to try stick to one of the two modes. In this version
> we are taking one more step and going the path of mode2 vs v2 where we had 2 modes.
> 
> 2) The P4 Register extern is no longer standalone. Instead, as part of integrating
> into eBPF we introduce another kfunc which encapsulates Register as part of the
> extern interface.
> 
> 3) We have improved our CICD to include tools pointed to us by Simon. See
>    "Testing" further below. Thanks to Simon for that and other issues he caught.
>    Simon, we discussed on issue [7] but decided to keep that log since we think
>    it is useful.
> 
> 4) A lot of small cleanups. Thanks Marcelo. There are two things we need to
>    re-discuss though; see: [5], [6].
> 
> 5) We removed the need for a range of IDs for dynamic actions. Thanks Jakub.
> 
> 6) Clarify ambiguity caused by smatch in an if(A) else if(B) condition. We are
>    guaranteed that either A or B must exist; however, lets make smatch happy.
>    Thanks to Simon and Dan Carpenter.
> 
> Changes In RFC Version 2
> -------------------------
> 
> Version 2 is the initial integration of the eBPF datapath.
> We took into consideration suggestions provided to use eBPF and put effort into
> analyzing eBPF as datapath which involved extensive testing.
> We implemented 6 approaches with eBPF and ran performance analysis and presented
> our results at the P4 2023 workshop in Santa Clara[see: 1, 3] on each of the 6
> vs the scriptable P4TC and concluded that 2 of the approaches are sensible (4 if
> you account for XDP or TC separately).
> 
> Conclusions from the exercise: We lose the simple operational model we had
> prior to integrating eBPF. We do gain performance in most cases when the
> datapath is less compute-bound.
> For more discussion on our requirements vs journeying the eBPF path please
> scroll down to "Restating Our Requirements" and "Challenges".
> 
> This patch set presented two modes.
> mode1: the parser is entirely based on eBPF - whereas the rest of the
> SW datapath stays as _scriptable_ as in Version 1.
> mode2: All of the kernel s/w datapath (including parser) is in eBPF.
> 
> The key ingredient for eBPF, that we did not have access to in the past, is
> kfunc (it made a big difference for us to reconsider eBPF).
> 
> In V2 the two modes are mutually exclusive (IOW, you get to choose one
> or the other via Kconfig).

I think/fear that this series has a "quorum" problem: different voices
raises opposition, and nobody (?) outside the authors supported the
code and the feature. 

Could be the missing of H/W offload support in the current form the
root cause for such lack support? Or there are parties interested that
have been quite so far?

Thanks,

Paolo



^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH net-next v12 01/15] net: sched: act_api: Introduce P4 actions list
  2024-02-29 15:05   ` Paolo Abeni
@ 2024-02-29 18:21     ` Jamal Hadi Salim
  2024-03-01  7:30       ` Paolo Abeni
  0 siblings, 1 reply; 71+ messages in thread
From: Jamal Hadi Salim @ 2024-02-29 18:21 UTC (permalink / raw)
  To: Paolo Abeni
  Cc: netdev, deb.chatterjee, anjali.singhai, namrata.limaye, tom,
	mleitner, Mahesh.Shirshyad, Vipin.Jain, tomasz.osinski, jiri,
	xiyou.wangcong, davem, edumazet, kuba, vladbu, horms, khalidm,
	toke, daniel, victor, pctammela, bpf

On Thu, Feb 29, 2024 at 10:05 AM Paolo Abeni <pabeni@redhat.com> wrote:
>
> On Sun, 2024-02-25 at 11:54 -0500, Jamal Hadi Salim wrote:
> > In P4 we require to generate new actions "on the fly" based on the
> > specified P4 action definition. P4 action kinds, like the pipeline
> > they are attached to, must be per net namespace, as opposed to native
> > action kinds which are global. For that reason, we chose to create a
> > separate structure to store P4 actions.
> >
> > Co-developed-by: Victor Nogueira <victor@mojatatu.com>
> > Signed-off-by: Victor Nogueira <victor@mojatatu.com>
> > Co-developed-by: Pedro Tammela <pctammela@mojatatu.com>
> > Signed-off-by: Pedro Tammela <pctammela@mojatatu.com>
> > Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
> > Reviewed-by: Vlad Buslov <vladbu@nvidia.com>
> > Reviewed-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
> > ---
> >  include/net/act_api.h |   8 ++-
> >  net/sched/act_api.c   | 123 +++++++++++++++++++++++++++++++++++++-----
> >  net/sched/cls_api.c   |   2 +-
> >  3 files changed, 116 insertions(+), 17 deletions(-)
> >
> > diff --git a/include/net/act_api.h b/include/net/act_api.h
> > index 77ee0c657..f22be14bb 100644
> > --- a/include/net/act_api.h
> > +++ b/include/net/act_api.h
> > @@ -105,6 +105,7 @@ typedef void (*tc_action_priv_destructor)(void *priv);
> >
> >  struct tc_action_ops {
> >       struct list_head head;
> > +     struct list_head p4_head;
> >       char    kind[IFNAMSIZ];
> >       enum tca_id  id; /* identifier should match kind */
> >       unsigned int    net_id;
> > @@ -199,10 +200,12 @@ int tcf_idr_check_alloc(struct tc_action_net *tn, u32 *index,
> >  int tcf_idr_release(struct tc_action *a, bool bind);
> >
> >  int tcf_register_action(struct tc_action_ops *a, struct pernet_operations *ops);
> > +int tcf_register_p4_action(struct net *net, struct tc_action_ops *act);
> >  int tcf_unregister_action(struct tc_action_ops *a,
> >                         struct pernet_operations *ops);
> >  #define NET_ACT_ALIAS_PREFIX "net-act-"
> >  #define MODULE_ALIAS_NET_ACT(kind)   MODULE_ALIAS(NET_ACT_ALIAS_PREFIX kind)
> > +void tcf_unregister_p4_action(struct net *net, struct tc_action_ops *act);
> >  int tcf_action_destroy(struct tc_action *actions[], int bind);
> >  int tcf_action_exec(struct sk_buff *skb, struct tc_action **actions,
> >                   int nr_actions, struct tcf_result *res);
> > @@ -210,8 +213,9 @@ int tcf_action_init(struct net *net, struct tcf_proto *tp, struct nlattr *nla,
> >                   struct nlattr *est,
> >                   struct tc_action *actions[], int init_res[], size_t *attr_size,
> >                   u32 flags, u32 fl_flags, struct netlink_ext_ack *extack);
> > -struct tc_action_ops *tc_action_load_ops(struct nlattr *nla, u32 flags,
> > -                                      struct netlink_ext_ack *extack);
> > +struct tc_action_ops *
> > +tc_action_load_ops(struct net *net, struct nlattr *nla,
> > +                u32 flags, struct netlink_ext_ack *extack);
> >  struct tc_action *tcf_action_init_1(struct net *net, struct tcf_proto *tp,
> >                                   struct nlattr *nla, struct nlattr *est,
> >                                   struct tc_action_ops *a_o, int *init_res,
> > diff --git a/net/sched/act_api.c b/net/sched/act_api.c
> > index 9ee622fb1..23ef394f2 100644
> > --- a/net/sched/act_api.c
> > +++ b/net/sched/act_api.c
> > @@ -57,6 +57,40 @@ static void tcf_free_cookie_rcu(struct rcu_head *p)
> >       kfree(cookie);
> >  }
> >
> > +static unsigned int p4_act_net_id;
> > +
> > +struct tcf_p4_act_net {
> > +     struct list_head act_base;
> > +     rwlock_t act_mod_lock;
>
> Note that rwlock in networking code is discouraged, as they have to be
> unfair, see commit 0daf07e527095e64ee8927ce297ab626643e9f51.
>
> In this specific case I think there should be no problems, as is
> extremely hard/impossible to have serious contention on the write
> side,. Also there is already an existing rwlock nearby, no not a
> blocker but IMHO worthy to be noted.
>

Sure - we can replace it. What's the preference? Spinlock?

cheers,
jamal

> Cheers,
>
> Paolo
>

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH net-next v12 03/15] net/sched: act_api: Update tc_action_ops to account for P4 actions
  2024-02-29 16:19   ` Paolo Abeni
@ 2024-02-29 18:30     ` Jamal Hadi Salim
  0 siblings, 0 replies; 71+ messages in thread
From: Jamal Hadi Salim @ 2024-02-29 18:30 UTC (permalink / raw)
  To: Paolo Abeni
  Cc: netdev, deb.chatterjee, anjali.singhai, namrata.limaye, tom,
	mleitner, Mahesh.Shirshyad, Vipin.Jain, tomasz.osinski, jiri,
	xiyou.wangcong, davem, edumazet, kuba, vladbu, horms, khalidm,
	toke, daniel, victor, pctammela, bpf

On Thu, Feb 29, 2024 at 11:19 AM Paolo Abeni <pabeni@redhat.com> wrote:
>
> On Sun, 2024-02-25 at 11:54 -0500, Jamal Hadi Salim wrote:
> > The initialisation of P4TC action instances require access to a struct
> > p4tc_act (which appears in later patches) to help us to retrieve
> > information like the P4 action parameters etc. In order to retrieve
> > struct p4tc_act we need the pipeline name or id and the action name or id.
> > Also recall that P4TC action IDs are P4 and are net namespace specific and
> > not global like standard tc actions.
> > The init callback from tc_action_ops parameters had no way of
> > supplying us that information. To solve this issue, we decided to create a
> > new tc_action_ops callback (init_ops), that provies us with the
> > tc_action_ops  struct which then provides us with the pipeline and action
> > name.
>
> The new init ops looks a bit unfortunate. I *think* it would be better
> adding the new argument to the existing init op
>

Our initial goal was to avoid creating a much larger patch by changing
any other action's code and we observe that ->init() already has 8
params already ;-> And only dynamic actions need this extra extension.
If you still feel the change is needed, sure we can make that change.

> > In addition we add a new refcount to struct tc_action_ops called
> > dyn_ref, which accounts for how many action instances we have of a specific
> > action.
> >
> > Co-developed-by: Victor Nogueira <victor@mojatatu.com>
> > Signed-off-by: Victor Nogueira <victor@mojatatu.com>
> > Co-developed-by: Pedro Tammela <pctammela@mojatatu.com>
> > Signed-off-by: Pedro Tammela <pctammela@mojatatu.com>
> > Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
> > Reviewed-by: Vlad Buslov <vladbu@nvidia.com>
> > Reviewed-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
> > ---
> >  include/net/act_api.h |  6 ++++++
> >  net/sched/act_api.c   | 14 +++++++++++---
> >  2 files changed, 17 insertions(+), 3 deletions(-)
> >
> > diff --git a/include/net/act_api.h b/include/net/act_api.h
> > index c839ff57c..69be5ed83 100644
> > --- a/include/net/act_api.h
> > +++ b/include/net/act_api.h
> > @@ -109,6 +109,7 @@ struct tc_action_ops {
> >       char    kind[ACTNAMSIZ];
> >       enum tca_id  id; /* identifier should match kind */
> >       unsigned int    net_id;
> > +     refcount_t p4_ref;
> >       size_t  size;
> >       struct module           *owner;
> >       int     (*act)(struct sk_buff *, const struct tc_action *,
> > @@ -120,6 +121,11 @@ struct tc_action_ops {
> >                       struct nlattr *est, struct tc_action **act,
> >                       struct tcf_proto *tp,
> >                       u32 flags, struct netlink_ext_ack *extack);
> > +     /* This should be merged with the original init action */
> > +     int     (*init_ops)(struct net *net, struct nlattr *nla,
> > +                         struct nlattr *est, struct tc_action **act,
> > +                        struct tcf_proto *tp, struct tc_action_ops *ops,
>
> shouldn't the 'ops' argument be 'const'?
>

As it is right now this would be hard to do because we carry around a
refcnt in that struct. We will think about it..


cheers,
jamal


> Thanks,
>
> Paolo
>

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH net-next v12 06/15] p4tc: add P4 data types
  2024-02-29 15:09   ` Paolo Abeni
@ 2024-02-29 18:31     ` Jamal Hadi Salim
  0 siblings, 0 replies; 71+ messages in thread
From: Jamal Hadi Salim @ 2024-02-29 18:31 UTC (permalink / raw)
  To: Paolo Abeni
  Cc: netdev, deb.chatterjee, anjali.singhai, namrata.limaye, tom,
	mleitner, Mahesh.Shirshyad, Vipin.Jain, tomasz.osinski, jiri,
	xiyou.wangcong, davem, edumazet, kuba, vladbu, horms, khalidm,
	toke, daniel, victor, pctammela, bpf

On Thu, Feb 29, 2024 at 10:09 AM Paolo Abeni <pabeni@redhat.com> wrote:
>
> On Sun, 2024-02-25 at 11:54 -0500, Jamal Hadi Salim wrote:
> > Introduce abstraction that represents P4 data types.
> > This also introduces the Kconfig and Makefile which later patches use.
> > Numeric types could be little, host or big endian definitions. The abstraction
> > also supports defining:
> >
> > a) bitstrings using P4 annotations that look like "bit<X>" where X
> >    is the number of bits defined in a type
> >
> > b) bitslices such that one can define in P4 as bit<8>[0-3] and
> >    bit<16>[4-9]. A 4-bit slice from bits 0-3 and a 6-bit slice from bits
> >    4-9 respectively.
> >
> > c) speacialized types like dev (which stands for a netdev), key, etc
> >
> > Each type has a bitsize, a name (for debugging purposes), an ID and
> > methods/ops. The P4 types will be used by externs, dynamic actions, packet
> > headers and other parts of P4TC.
> >
> > Each type has four ops:
> >
> > - validate_p4t: Which validates if a given value of a specific type
> >   meets valid boundary conditions.
> >
> > - create_bitops: Which, given a bitsize, bitstart and bitend allocates and
> >   returns a mask and a shift value. For example, if we have type
> >   bit<8>[3-3] meaning bitstart = 3 and bitend = 3, we'll create a mask
> >   which would only give us the fourth bit of a bit8 value, that is, 0x08.
> >   Since we are interested in the fourth bit, the bit shift value will be 3.
> >   This is also useful if an "irregular" bitsize is used, for example,
> >   bit24. In that case bitstart = 0 and bitend = 23. Shift will be 0 and
> >   the mask will be 0xFFFFFF00 if the machine is big endian.
> >
> > - host_read : Which reads the value of a given type and transforms it to
> >   host order (if needed)
> >
> > - host_write : Which writes a provided host order value and transforms it
> >   to the type's native order (if needed)
>
> The type has a 'print' op, but I can't easily find where such op is
> used and its role?!?
>

Thanks for catching that. We'll remove it. It was part of an
operational debugging patch that was not submitted.

cheers,
jamal

> Thanks,
>
> Paolo
>

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH net-next v12 00/15] Introducing P4TC (series 1)
  2024-02-29 17:13 ` Paolo Abeni
@ 2024-02-29 18:49   ` Jamal Hadi Salim
  2024-02-29 20:52     ` John Fastabend
  2024-02-29 21:49   ` Singhai, Anjali
  2024-03-01 18:53   ` Chris Sommers
  2 siblings, 1 reply; 71+ messages in thread
From: Jamal Hadi Salim @ 2024-02-29 18:49 UTC (permalink / raw)
  To: Paolo Abeni
  Cc: netdev, deb.chatterjee, anjali.singhai, namrata.limaye, tom,
	mleitner, Mahesh.Shirshyad, Vipin.Jain, tomasz.osinski, jiri,
	xiyou.wangcong, davem, edumazet, kuba, vladbu, horms, khalidm,
	toke, daniel, victor, pctammela, dan.daly, andy.fingerhut,
	chris.sommers, mattyk, bpf

On Thu, Feb 29, 2024 at 12:14 PM Paolo Abeni <pabeni@redhat.com> wrote:
>
> On Sun, 2024-02-25 at 11:54 -0500, Jamal Hadi Salim wrote:
> > This is the first patchset of two. In this patch we are submitting 15 which
> > cover the minimal viable P4 PNA architecture.
> >
> > __Description of these Patches__
> >
> > Patch #1 adds infrastructure for per-netns P4 actions that can be created on
> > as need basis for the P4 program requirement. This patch makes a small incision
> > into act_api. Patches 2-4 are minimalist enablers for P4TC and have no
> > effect the classical tc action (example patch#2 just increases the size of the
> > action names from 16->64B).
> > Patch 5 adds infrastructure support for preallocation of dynamic actions.
> >
> > The core P4TC code implements several P4 objects.
> > 1) Patch #6 introduces P4 data types which are consumed by the rest of the code
> > 2) Patch #7 introduces the templating API. i.e. CRUD commands for templates
> > 3) Patch #8 introduces the concept of templating Pipelines. i.e CRUD commands
> >    for P4 pipelines.
> > 4) Patch #9 introduces the action templates and associated CRUD commands.
> > 5) Patch #10 introduce the action runtime infrastructure.
> > 6) Patch #11 introduces the concept of P4 table templates and associated
> >    CRUD commands for tables.
> > 7) Patch #12 introduces runtime table entry infra and associated CU commands.
> > 8) Patch #13 introduces runtime table entry infra and associated RD commands.
> > 9) Patch #14 introduces interaction of eBPF to P4TC tables via kfunc.
> > 10) Patch #15 introduces the TC classifier P4 used at runtime.
> >
> > Daniel, please look again at patch #15.
> >
> > There are a few more patches (5) not in this patchset that deal with test
> > cases, etc.
> >
> > What is P4?
> > -----------
> >
> > The Programming Protocol-independent Packet Processors (P4) is an open source,
> > domain-specific programming language for specifying data plane behavior.
> >
> > The current P4 landscape includes an extensive range of deployments, products,
> > projects and services, etc[9][12]. Two major NIC vendors, Intel[10] and AMD[11]
> > currently offer P4-native NICs. P4 is currently curated by the Linux
> > Foundation[9].
> >
> > On why P4 - see small treatise here:[4].
> >
> > What is P4TC?
> > -------------
> >
> > P4TC is a net-namespace aware P4 implementation over TC; meaning, a P4 program
> > and its associated objects and state are attachend to a kernel _netns_ structure.
> > IOW, if we had two programs across netns' or within a netns they have no
> > visibility to each others objects (unlike for example TC actions whose kinds are
> > "global" in nature or eBPF maps visavis bpftool).
> >
> > P4TC builds on top of many years of Linux TC experiences of a netlink control
> > path interface coupled with a software datapath with an equivalent offloadable
> > hardware datapath. In this patch series we are focussing only on the s/w
> > datapath. The s/w and h/w path equivalence that TC provides is relevant
> > for a primary use case of P4 where some (currently) large consumers of NICs
> > provide vendors their datapath specs in P4. In such a case one could generate
> > specified datapaths in s/w and test/validate the requirements before hardware
> > acquisition(example [12]).
> >
> > Unlike other approaches such as TC Flower which require kernel and user space
> > changes when new datapath objects like packet headers are introduced P4TC, with
> > these patches, provides _kernel and user space code change independence_.
> > Meaning:
> > A P4 program describes headers, parsers, etc alongside the datapath processing;
> > the compiler uses the P4 program as input and generates several artifacts which
> > are then loaded into the kernel to manifest the intended datapath. In addition
> > to the generated datapath, control path constructs are generated. The process is
> > described further below in "P4TC Workflow".
> >
> > There have been many discussions and meetings within the community since
> > about 2015 in regards to P4 over TC[2] and we are finally proving to the
> > naysayers that we do get stuff done!
> >
> > A lot more of the P4TC motivation is captured at:
> > https://github.com/p4tc-dev/docs/blob/main/why-p4tc.md
> >
> > __P4TC Architecture__
> >
> > The current architecture was described at netdevconf 0x17[14] and if you prefer
> > academic conference papers, a short paper is available here[15].
> >
> > There are 4 parts:
> >
> > 1) A Template CRUD provisioning API for manifesting a P4 program and its
> > associated objects in the kernel. The template provisioning API uses netlink.
> > See patch in part 2.
> >
> > 2) A Runtime CRUD+ API code which is used for controlling the different runtime
> > behavior of the P4 objects. The runtime API uses netlink. See notes further
> > down. See patch description later..
> >
> > 3) P4 objects and their control interfaces: tables, actions, externs, etc.
> > Any object that requires control plane interaction resides in the TC domain
> > and is subject to the CRUD runtime API.  The intended goal is to make use of the
> > tc semantics of skip_sw/hw to target P4 program objects either in s/w or h/w.
> >
> > 4) S/W Datapath code hooks. The s/w datapath is eBPF based and is generated
> > by a compiler based on the P4 spec. When accessing any P4 object that requires
> > control plane interfaces, the eBPF code accesses the P4TC side from #3 above
> > using kfuncs.
> >
> > The generated eBPF code is derived from [13] with enhancements and fixes to meet
> > our requirements.
> >
> > __P4TC Workflow__
> >
> > The Development and instantiation workflow for P4TC is as follows:
> >
> >   A) A developer writes a P4 program, "myprog"
> >
> >   B) Compiles it using the P4C compiler[8]. The compiler generates 3 outputs:
> >
> >      a) A shell script which form template definitions for the different P4
> >      objects "myprog" utilizes (tables, externs, actions etc). See #1 above..
> >
> >      b) the parser and the rest of the datapath are generated as eBPF and need
> >      to be compiled into binaries. At the moment the parser and the main control
> >      block are generated as separate eBPF program but this could change in
> >      the future (without affecting any kernel code). See #4 above.
> >
> >      c) A json introspection file used for the control plane (by iproute2/tc).
> >
> >   C) At this point the artifacts from #1,#4 could be handed to an operator
> >      (the operator could be the same person as the developer from #A, #B).
> >
> >      i) For the eBPF part, either the operator is handed an ebpf binary or
> >      source which they compile at this point into a binary.
> >      The operator executes the shell script(s) to manifest the functional
> >      "myprog" into the kernel.
> >
> >      ii) The operator instantiates "myprog" pipeline via the tc P4 filter
> >      to ingress/egress (depending on P4 arch) of one or more netdevs/ports
> >      (illustrated below as "block 22").
> >
> >      Example instantion where the parser is a separate action:
> >        "tc filter add block 22 ingress protocol all prio 10 p4 pname myprog \
> >         action bpf obj $PARSER.o section p4tc/parse \
> >         action bpf obj $PROGNAME.o section p4tc/main"
> >
> > See individual patches in partc for more examples tc vs xdp etc. Also see
> > section on "challenges" (further below on this cover letter).
> >
> > Once "myprog" P4 program is instantiated one can start performing operations
> > on table entries and/or actions at runtime as described below.
> >
> > __P4TC Runtime Control Path__
> >
> > The control interface builds on past tc experience and tries to get things
> > right from the beginning (example filtering is separated from depending
> > on existing object TLVs and made generic); also the code is written in
> > such a way it is mostly lockless.
> >
> > The P4TC control interface, using netlink, provides what we call a CRUDPS
> > abstraction which stands for: Create, Read(get), Update, Delete, Subscribe,
> > Publish.  From a high level PoV the following describes a conformant high level
> > API (both on netlink data model and code level):
> >
> >       Create(</path/to/object, DATA>+)
> >       Read(</path/to/object>, [optional filter])
> >       Update(</path/to/object>, DATA>+)
> >       Delete(</path/to/object>, [optional filter])
> >       Subscribe(</path/to/object>, [optional filter])
> >
> > Note, we _dont_ treat "dump" or "flush" as speacial. If "path/to/object" points
> > to a table then a "Delete" implies "flush" and a "Read" implies dump but if
> > it points to an entry (by specifying a key) then "Delete" implies deleting
> > and entry and "Read" implies reading that single entry. It should be noted that
> > both "Delete" and "Read" take an optional filter parameter. The filter can
> > define further refinements to what the control plane wants read or deleted.
> > "Subscribe" uses built in netlink event management. It, as well, takes a filter
> > which can further refine what events get generated to the control plane (taken
> > out of this patchset, to be re-added with consideration of [16]).
> >
> > Lets show some runtime samples:
> >
> > ..create an entry, if we match ip address 10.0.1.2 send packet out eno1
> >   tc p4ctrl create myprog/table/mytable \
> >    dstAddr 10.0.1.2/32 action send_to_port param port eno1
> >
> > ..Batch create entries
> >   tc p4ctrl create myprog/table/mytable \
> >   entry dstAddr 10.1.1.2/32  action send_to_port param port eno1 \
> >   entry dstAddr 10.1.10.2/32  action send_to_port param port eno10 \
> >   entry dstAddr 10.0.2.2/32  action send_to_port param port eno2
> >
> > ..Get an entry (note "read" is interchangeably used as "get" which is a common
> >               semantic in tc):
> >   tc p4ctrl read myprog/table/mytable \
> >    dstAddr 10.0.2.2/32
> >
> > ..dump mytable
> >   tc p4ctrl read myprog/table/mytable
> >
> > ..dump mytable for all entries whose key fits within 10.1.0.0/16
> >   tc p4ctrl read myprog/table/mytable \
> >   filter key/myprog/mytable/dstAddr = 10.1.0.0/16
> >
> > ..dump all mytable entries which have an action send_to_port with param "eno1"
> >   tc p4ctrl get myprog/table/mytable \
> >   filter param/act/myprog/send_to_port/port = "eno1"
> >
> > The filter expression is powerful, f.e you could say:
> >
> >   tc p4ctrl get myprog/table/mytable \
> >   filter param/act/myprog/send_to_port/port = "eno1" && \
> >          key/myprog/mytable/dstAddr = 10.1.0.0/16
> >
> > It also works on built in metadata, example in the following case dumping
> > entries from mytable that have seen activity in the last 10 secs:
> >   tc p4ctrl get myprog/table/mytable \
> >   filter msecs_since < 10000
> >
> > Delete follows the same syntax as get/read, so for sake of brevity we won't
> > show more example than how to flush mytable:
> >
> >   tc p4ctrl delete myprog/table/mytable
> >
> > Mystery question: How do we achieve iproute2-kernel independence and
> > how does "tc p4ctrl" as a cli know how to program the kernel given an
> > arbitrary command line as shown above? Answer(s): It queries the
> > compiler generated json file in "P4TC Workflow" #B.c above. The json file has
> > enough details to figure out that we have a program called "myprog" which has a
> > table "mytable" that has a key name "dstAddr" which happens to be type ipv4
> > address prefix. The json file also provides details to show that the table
> > "mytable" supports an action called "send_to_port" which accepts a parameter
> > "port" of type netdev (see the types patch for all supported P4 data types).
> > All P4 components have names, IDs, and types - so this makes it very easy to map
> > into netlink.
> > Once user space tc/p4ctrl validates the human command input, it creates
> > standard binary netlink structures (TLVs etc) which are sent to the kernel.
> > See the runtime table entry patch for more details.
> >
> > __P4TC Datapath__
> >
> > The P4TC s/w datapath execution is generated as eBPF. Any objects that require
> > control interfacing reside in the "P4TC domain" and are controlled via netlink
> > as described above. Per packet execution and state and even objects that do not
> > require control interfacing (like the P4 parser) are generated as eBPF.
> >
> > A packet arriving on s/w ingress of any of the ports on block 22 will first be
> > exercised via the (generated eBPF) parser component to extract the headers (the
> > ip destination address in labelled "dstAddr" above).
> > The datapath then proceeds to use "dstAddr", table ID and pipeline ID
> > as a key to do a lookup in myprog's "mytable" which returns the action params
> > which are then used to execute the action in the eBPF datapath (eventually
> > sending out packets to eno1).
> > On a table miss, mytable's default miss action (not described) is executed.
> >
> > __Testing__
> >
> > Speaking of testing - we have 2-300 tdc test cases (which will be in the
> > second patchset).
> > These tests are run on our CICD system on pull requests and after commits are
> > approved. The CICD does a lot of other tests (more since v2, thanks to Simon's
> > input)including:
> > checkpatch, sparse, smatch, coccinelle, 32 bit and 64 bit builds tested on both
> > X86, ARM 64 and emulated BE via qemu s390. We trigger performance testing in the
> > CICD to catch performance regressions (currently only on the control path, but
> > in the future for the datapath).
> > Syzkaller runs 24/7 on dedicated hardware, originally we focussed only on memory
> > sanitizer but recently added support for concurrency sanitizer.
> > Before main releases we ensure each patch will compile on its own to help in
> > git bisect and run the xmas tree tool. We eventually put the code via coverity.
> >
> > In addition we are working on enabling a tool that will take a P4 program, run
> > it through the compiler, and generate permutations of traffic patterns via
> > symbolic execution that will test both positive and negative datapath code
> > paths. The test generator tool integration is still work in progress.
> > Also: We have other code that test parallelization etc which we are trying to
> > find a fit for in the kernel tree's testing infra.
> >
> >
> > __References__
> >
> > [1]https://github.com/p4tc-dev/docs/blob/main/p4-conference-2023/2023P4WorkshopP4TC.pdf
> > [2]https://github.com/p4tc-dev/docs/blob/main/why-p4tc.md#historical-perspective-for-p4tc
> > [3]https://2023p4workshop.sched.com/event/1KsAe/p4tc-linux-kernel-p4-implementation-approaches-and-evaluation
> > [4]https://github.com/p4tc-dev/docs/blob/main/why-p4tc.md#so-why-p4-and-how-does-p4-help-here
> > [5]https://lore.kernel.org/netdev/20230517110232.29349-3-jhs@mojatatu.com/T/#mf59be7abc5df3473cff3879c8cc3e2369c0640a6
> > [6]https://lore.kernel.org/netdev/20230517110232.29349-3-jhs@mojatatu.com/T/#m783cfd79e9d755cf0e7afc1a7d5404635a5b1919
> > [7]https://lore.kernel.org/netdev/20230517110232.29349-3-jhs@mojatatu.com/T/#ma8c84df0f7043d17b98f3d67aab0f4904c600469
> > [8]https://github.com/p4lang/p4c/tree/main/backends/tc
> > [9]https://p4.org/
> > [10]https://www.intel.com/content/www/us/en/products/details/network-io/ipu/e2000-asic.html
> > [11]https://www.amd.com/en/accelerators/pensando
> > [12]https://github.com/sonic-net/DASH/tree/main
> > [13]https://github.com/p4lang/p4c/tree/main/backends/ebpf
> > [14]https://netdevconf.info/0x17/sessions/talk/integrating-ebpf-into-the-p4tc-datapath.html
> > [15]https://dl.acm.org/doi/10.1145/3630047.3630193
> > [16]https://lore.kernel.org/netdev/20231216123001.1293639-1-jiri@resnulli.us/
> > [17.a]https://netdevconf.info/0x13/session.html?talk-tc-u-classifier
> > [17.b]man tc-u32
> > [18]man tc-pedit
> > [19] https://lore.kernel.org/netdev/20231219181623.3845083-6-victor@mojatatu.com/T/#m86e71743d1d83b728bb29d5b877797cb4942e835
> > [20.a] https://netdevconf.info/0x16/sessions/talk/your-network-datapath-will-be-p4-scripted.html
> > [20.b] https://netdevconf.info/0x16/sessions/workshop/p4tc-workshop.html
> >
> > --------
> > HISTORY
> > --------
> >
> > Changes in Version 12
> > ----------------------
> >
> > 0) Introduce back 15 patches (v11 had 5)
> >
> > 1) From discussions with Daniel:
> >    i) Remove the XDP programs association alltogether. No refcounting. nothing.
> >    ii) Remove prog type tc - everything is now an ebpf tc action.
> >
> > 2) s/PAD0/__pad0/g. Thanks to Marcelo.
> >
> > 3) Add extack to specify how many entries (N of M) specified in a batch for
> >    any of requested Create/Update/Delete succeeded. Prior to this it would
> >    only tell us the batch failed to complete without giving us details of
> >    which of M failed. Added as a debug aid.
> >
> > Changes in Version 11
> > ----------------------
> > 1) Split the series into two. Original patches 1-5 in this patchset. The rest
> >    will go out after this is merged.
> >
> > 2) Change any references of IFNAMSIZ in the action code when referencing the
> >    action name size to ACTNAMSIZ. Thanks to Marcelo.
> >
> > Changes in Version 10
> > ----------------------
> > 1) A couple of patches from the earlier version were clean enough to submit,
> >    so we did. This gave us room to split the two largest patches each into
> >    two. Even though the split is not git-bisactable and really some of it didn't
> >    make much sense (eg spliting a create, and update in one patch and delete and
> >    get into another) we made sure each of the split patches compiled
> >    independently. The idea is to reduce the number of lines of code to review
> >    and when we get sufficient reviews we will put the splits together again.
> >    See patch #12 and #13 as well as patches #7 and #8).
> >
> > 2) Add more context in patch 0. Please READ!
> >
> > 3) Added dump/delete filters back to the code - we had taken them out in the
> >    earlier patches to reduce the amount of code for review - but in retrospect
> >    we feel they are important enough to push earlier rather than later.
> >
> >
> > Changes In version 9
> > ---------------------
> >
> > 1) Remove the largest patch (externs) to ease review.
> >
> > 2) Break up action patches into two to ease review bringing down the patches
> >    that need more scrutiny to 8 (the first 7 are almost trivial).
> >
> > 3) Fixup prefix naming convention to p4tc_xxx for uapi and p4a_xxx for actions
> >    to provide consistency(Jiri).
> >
> > 4) Silence sparse warning "was not declared. Should it be static?" for kfuncs
> >    by making them static. TBH, not sure if this is the right solution
> >    but it makes sparse happy and hopefully someone will comment.
> >
> > Changes In Version 8
> > ---------------------
> >
> > 1) Fix all the patchwork warnings and improve our ci to catch them in the future
> >
> > 2) Reduce the number of patches to basic max(15)  to ease review.
> >
> > Changes In Version 7
> > -------------------------
> >
> > 0) First time removing the RFC tag!
> >
> > 1) Removed XDP cookie. It turns out as was pointed out by Toke(Thanks!) - that
> > using bpf links was sufficient to protect us from someone replacing or deleting
> > a eBPF program after it has been bound to a netdev.
> >
> > 2) Add some reviewed-bys from Vlad.
> >
> > 3) Small bug fixes from v6 based on testing for ebpf.
> >
> > 4) Added the counter extern as a sample extern. Illustrating this example because
> >    it is slightly complex since it is possible to invoke it directly from
> >    the P4TC domain (in case of direct counters) or from eBPF (indirect counters).
> >    It is not exactly the most efficient implementation (a reasonable counter impl
> >    should be per-cpu).
> >
> > Changes In RFC Version 6
> > -------------------------
> >
> > 1) Completed integration from scriptable view to eBPF. Completed integration
> >    of externs integration.
> >
> > 2) Small bug fixes from v5 based on testing.
> >
> > Changes In RFC Version 5
> > -------------------------
> >
> > 1) More integration from scriptable view to eBPF. Small bug fixes from last
> >    integration.
> >
> > 2) More streamlining support of externs via kfunc (create-on-miss, etc)
> >
> > 3) eBPF linking for XDP.
> >
> > There is more eBPF integration/streamlining coming (we are getting close to
> > conversion from scriptable domain).
> >
> > Changes In RFC Version 4
> > -------------------------
> >
> > 1) More integration from scriptable to eBPF. Small bug fixes.
> >
> > 2) More streamlining support of externs via kfunc (one additional kfunc).
> >
> > 3) Removed per-cpu scratchpad per Toke's suggestion and instead use XDP metadata.
> >
> > There is more eBPF integration coming. One thing we looked at but is not in this
> > patchset but should be in the next is use of eBPF link in our loading (see
> > "challenge #1" further below).
> >
> > Changes In RFC Version 3
> > -------------------------
> >
> > These patches are still in a little bit of flux as we adjust to integrating
> > eBPF. So there are small constructs that are used in V1 and 2 but no longer
> > used in this version. We will make a V4 which will remove those.
> > The changes from V2 are as follows:
> >
> > 1) Feedback we got in V2 is to try stick to one of the two modes. In this version
> > we are taking one more step and going the path of mode2 vs v2 where we had 2 modes.
> >
> > 2) The P4 Register extern is no longer standalone. Instead, as part of integrating
> > into eBPF we introduce another kfunc which encapsulates Register as part of the
> > extern interface.
> >
> > 3) We have improved our CICD to include tools pointed to us by Simon. See
> >    "Testing" further below. Thanks to Simon for that and other issues he caught.
> >    Simon, we discussed on issue [7] but decided to keep that log since we think
> >    it is useful.
> >
> > 4) A lot of small cleanups. Thanks Marcelo. There are two things we need to
> >    re-discuss though; see: [5], [6].
> >
> > 5) We removed the need for a range of IDs for dynamic actions. Thanks Jakub.
> >
> > 6) Clarify ambiguity caused by smatch in an if(A) else if(B) condition. We are
> >    guaranteed that either A or B must exist; however, lets make smatch happy.
> >    Thanks to Simon and Dan Carpenter.
> >
> > Changes In RFC Version 2
> > -------------------------
> >
> > Version 2 is the initial integration of the eBPF datapath.
> > We took into consideration suggestions provided to use eBPF and put effort into
> > analyzing eBPF as datapath which involved extensive testing.
> > We implemented 6 approaches with eBPF and ran performance analysis and presented
> > our results at the P4 2023 workshop in Santa Clara[see: 1, 3] on each of the 6
> > vs the scriptable P4TC and concluded that 2 of the approaches are sensible (4 if
> > you account for XDP or TC separately).
> >
> > Conclusions from the exercise: We lose the simple operational model we had
> > prior to integrating eBPF. We do gain performance in most cases when the
> > datapath is less compute-bound.
> > For more discussion on our requirements vs journeying the eBPF path please
> > scroll down to "Restating Our Requirements" and "Challenges".
> >
> > This patch set presented two modes.
> > mode1: the parser is entirely based on eBPF - whereas the rest of the
> > SW datapath stays as _scriptable_ as in Version 1.
> > mode2: All of the kernel s/w datapath (including parser) is in eBPF.
> >
> > The key ingredient for eBPF, that we did not have access to in the past, is
> > kfunc (it made a big difference for us to reconsider eBPF).
> >
> > In V2 the two modes are mutually exclusive (IOW, you get to choose one
> > or the other via Kconfig).
>
> I think/fear that this series has a "quorum" problem: different voices
> raises opposition, and nobody (?) outside the authors supported the
> code and the feature.
>
> Could be the missing of H/W offload support in the current form the
> root cause for such lack support? Or there are parties interested that
> have been quite so far?

Some of the people who attend our meetings and have vested interest in
this are on Cc.  But the cover letter is clear on this (right at the
top under "What is P4" and "what is P4TC").

cheers,
jamal


> Thanks,
>
> Paolo
>
>

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH net-next v12 00/15] Introducing P4TC (series 1)
  2024-02-29 18:49   ` Jamal Hadi Salim
@ 2024-02-29 20:52     ` John Fastabend
  0 siblings, 0 replies; 71+ messages in thread
From: John Fastabend @ 2024-02-29 20:52 UTC (permalink / raw)
  To: Jamal Hadi Salim, Paolo Abeni
  Cc: netdev, deb.chatterjee, anjali.singhai, namrata.limaye, tom,
	mleitner, Mahesh.Shirshyad, Vipin.Jain, tomasz.osinski, jiri,
	xiyou.wangcong, davem, edumazet, kuba, vladbu, horms, khalidm,
	toke, daniel, victor, pctammela, dan.daly, andy.fingerhut,
	chris.sommers, mattyk, bpf

Jamal Hadi Salim wrote:
> On Thu, Feb 29, 2024 at 12:14 PM Paolo Abeni <pabeni@redhat.com> wrote:
> >
> > On Sun, 2024-02-25 at 11:54 -0500, Jamal Hadi Salim wrote:
> > > This is the first patchset of two. In this patch we are submitting 15 which
> > > cover the minimal viable P4 PNA architecture.
> > >
> > > __Description of these Patches__
> > >
> > > Patch #1 adds infrastructure for per-netns P4 actions that can be created on
> > > as need basis for the P4 program requirement. This patch makes a small incision
> > > into act_api. Patches 2-4 are minimalist enablers for P4TC and have no
> > > effect the classical tc action (example patch#2 just increases the size of the
> > > action names from 16->64B).
> > > Patch 5 adds infrastructure support for preallocation of dynamic actions.
> > >
> > > The core P4TC code implements several P4 objects.
> > > 1) Patch #6 introduces P4 data types which are consumed by the rest of the code
> > > 2) Patch #7 introduces the templating API. i.e. CRUD commands for templates
> > > 3) Patch #8 introduces the concept of templating Pipelines. i.e CRUD commands
> > >    for P4 pipelines.
> > > 4) Patch #9 introduces the action templates and associated CRUD commands.
> > > 5) Patch #10 introduce the action runtime infrastructure.
> > > 6) Patch #11 introduces the concept of P4 table templates and associated
> > >    CRUD commands for tables.
> > > 7) Patch #12 introduces runtime table entry infra and associated CU commands.
> > > 8) Patch #13 introduces runtime table entry infra and associated RD commands.
> > > 9) Patch #14 introduces interaction of eBPF to P4TC tables via kfunc.
> > > 10) Patch #15 introduces the TC classifier P4 used at runtime.
> > >
> > > Daniel, please look again at patch #15.
> > >
> > > There are a few more patches (5) not in this patchset that deal with test
> > > cases, etc.
> > >
> > > What is P4?
> > > -----------
> > >
> > > The Programming Protocol-independent Packet Processors (P4) is an open source,
> > > domain-specific programming language for specifying data plane behavior.
> > >
> > > The current P4 landscape includes an extensive range of deployments, products,
> > > projects and services, etc[9][12]. Two major NIC vendors, Intel[10] and AMD[11]
> > > currently offer P4-native NICs. P4 is currently curated by the Linux
> > > Foundation[9].
> > >
> > > On why P4 - see small treatise here:[4].
> > >
> > > What is P4TC?
> > > -------------
> > >
> > > P4TC is a net-namespace aware P4 implementation over TC; meaning, a P4 program
> > > and its associated objects and state are attachend to a kernel _netns_ structure.
> > > IOW, if we had two programs across netns' or within a netns they have no
> > > visibility to each others objects (unlike for example TC actions whose kinds are
> > > "global" in nature or eBPF maps visavis bpftool).
> > >
> > > P4TC builds on top of many years of Linux TC experiences of a netlink control
> > > path interface coupled with a software datapath with an equivalent offloadable
> > > hardware datapath. In this patch series we are focussing only on the s/w
> > > datapath. The s/w and h/w path equivalence that TC provides is relevant
> > > for a primary use case of P4 where some (currently) large consumers of NICs
> > > provide vendors their datapath specs in P4. In such a case one could generate
> > > specified datapaths in s/w and test/validate the requirements before hardware
> > > acquisition(example [12]).
> > >
> > > Unlike other approaches such as TC Flower which require kernel and user space
> > > changes when new datapath objects like packet headers are introduced P4TC, with
> > > these patches, provides _kernel and user space code change independence_.
> > > Meaning:
> > > A P4 program describes headers, parsers, etc alongside the datapath processing;
> > > the compiler uses the P4 program as input and generates several artifacts which
> > > are then loaded into the kernel to manifest the intended datapath. In addition
> > > to the generated datapath, control path constructs are generated. The process is
> > > described further below in "P4TC Workflow".
> > >
> > > There have been many discussions and meetings within the community since
> > > about 2015 in regards to P4 over TC[2] and we are finally proving to the
> > > naysayers that we do get stuff done!
> > >
> > > A lot more of the P4TC motivation is captured at:
> > > https://github.com/p4tc-dev/docs/blob/main/why-p4tc.md
> > >
> > > __P4TC Architecture__
> > >
> > > The current architecture was described at netdevconf 0x17[14] and if you prefer
> > > academic conference papers, a short paper is available here[15].
> > >
> > > There are 4 parts:
> > >
> > > 1) A Template CRUD provisioning API for manifesting a P4 program and its
> > > associated objects in the kernel. The template provisioning API uses netlink.
> > > See patch in part 2.
> > >
> > > 2) A Runtime CRUD+ API code which is used for controlling the different runtime
> > > behavior of the P4 objects. The runtime API uses netlink. See notes further
> > > down. See patch description later..
> > >
> > > 3) P4 objects and their control interfaces: tables, actions, externs, etc.
> > > Any object that requires control plane interaction resides in the TC domain
> > > and is subject to the CRUD runtime API.  The intended goal is to make use of the
> > > tc semantics of skip_sw/hw to target P4 program objects either in s/w or h/w.
> > >
> > > 4) S/W Datapath code hooks. The s/w datapath is eBPF based and is generated
> > > by a compiler based on the P4 spec. When accessing any P4 object that requires
> > > control plane interfaces, the eBPF code accesses the P4TC side from #3 above
> > > using kfuncs.
> > >
> > > The generated eBPF code is derived from [13] with enhancements and fixes to meet
> > > our requirements.
> > >
> > > __P4TC Workflow__
> > >
> > > The Development and instantiation workflow for P4TC is as follows:
> > >
> > >   A) A developer writes a P4 program, "myprog"
> > >
> > >   B) Compiles it using the P4C compiler[8]. The compiler generates 3 outputs:
> > >
> > >      a) A shell script which form template definitions for the different P4
> > >      objects "myprog" utilizes (tables, externs, actions etc). See #1 above..
> > >
> > >      b) the parser and the rest of the datapath are generated as eBPF and need
> > >      to be compiled into binaries. At the moment the parser and the main control
> > >      block are generated as separate eBPF program but this could change in
> > >      the future (without affecting any kernel code). See #4 above.
> > >
> > >      c) A json introspection file used for the control plane (by iproute2/tc).
> > >
> > >   C) At this point the artifacts from #1,#4 could be handed to an operator
> > >      (the operator could be the same person as the developer from #A, #B).
> > >
> > >      i) For the eBPF part, either the operator is handed an ebpf binary or
> > >      source which they compile at this point into a binary.
> > >      The operator executes the shell script(s) to manifest the functional
> > >      "myprog" into the kernel.
> > >
> > >      ii) The operator instantiates "myprog" pipeline via the tc P4 filter
> > >      to ingress/egress (depending on P4 arch) of one or more netdevs/ports
> > >      (illustrated below as "block 22").
> > >
> > >      Example instantion where the parser is a separate action:
> > >        "tc filter add block 22 ingress protocol all prio 10 p4 pname myprog \
> > >         action bpf obj $PARSER.o section p4tc/parse \
> > >         action bpf obj $PROGNAME.o section p4tc/main"
> > >
> > > See individual patches in partc for more examples tc vs xdp etc. Also see
> > > section on "challenges" (further below on this cover letter).
> > >
> > > Once "myprog" P4 program is instantiated one can start performing operations
> > > on table entries and/or actions at runtime as described below.
> > >
> > > __P4TC Runtime Control Path__
> > >
> > > The control interface builds on past tc experience and tries to get things
> > > right from the beginning (example filtering is separated from depending
> > > on existing object TLVs and made generic); also the code is written in
> > > such a way it is mostly lockless.
> > >
> > > The P4TC control interface, using netlink, provides what we call a CRUDPS
> > > abstraction which stands for: Create, Read(get), Update, Delete, Subscribe,
> > > Publish.  From a high level PoV the following describes a conformant high level
> > > API (both on netlink data model and code level):
> > >
> > >       Create(</path/to/object, DATA>+)
> > >       Read(</path/to/object>, [optional filter])
> > >       Update(</path/to/object>, DATA>+)
> > >       Delete(</path/to/object>, [optional filter])
> > >       Subscribe(</path/to/object>, [optional filter])
> > >
> > > Note, we _dont_ treat "dump" or "flush" as speacial. If "path/to/object" points
> > > to a table then a "Delete" implies "flush" and a "Read" implies dump but if
> > > it points to an entry (by specifying a key) then "Delete" implies deleting
> > > and entry and "Read" implies reading that single entry. It should be noted that
> > > both "Delete" and "Read" take an optional filter parameter. The filter can
> > > define further refinements to what the control plane wants read or deleted.
> > > "Subscribe" uses built in netlink event management. It, as well, takes a filter
> > > which can further refine what events get generated to the control plane (taken
> > > out of this patchset, to be re-added with consideration of [16]).
> > >
> > > Lets show some runtime samples:
> > >
> > > ..create an entry, if we match ip address 10.0.1.2 send packet out eno1
> > >   tc p4ctrl create myprog/table/mytable \
> > >    dstAddr 10.0.1.2/32 action send_to_port param port eno1
> > >
> > > ..Batch create entries
> > >   tc p4ctrl create myprog/table/mytable \
> > >   entry dstAddr 10.1.1.2/32  action send_to_port param port eno1 \
> > >   entry dstAddr 10.1.10.2/32  action send_to_port param port eno10 \
> > >   entry dstAddr 10.0.2.2/32  action send_to_port param port eno2
> > >
> > > ..Get an entry (note "read" is interchangeably used as "get" which is a common
> > >               semantic in tc):
> > >   tc p4ctrl read myprog/table/mytable \
> > >    dstAddr 10.0.2.2/32
> > >
> > > ..dump mytable
> > >   tc p4ctrl read myprog/table/mytable
> > >
> > > ..dump mytable for all entries whose key fits within 10.1.0.0/16
> > >   tc p4ctrl read myprog/table/mytable \
> > >   filter key/myprog/mytable/dstAddr = 10.1.0.0/16
> > >
> > > ..dump all mytable entries which have an action send_to_port with param "eno1"
> > >   tc p4ctrl get myprog/table/mytable \
> > >   filter param/act/myprog/send_to_port/port = "eno1"
> > >
> > > The filter expression is powerful, f.e you could say:
> > >
> > >   tc p4ctrl get myprog/table/mytable \
> > >   filter param/act/myprog/send_to_port/port = "eno1" && \
> > >          key/myprog/mytable/dstAddr = 10.1.0.0/16
> > >
> > > It also works on built in metadata, example in the following case dumping
> > > entries from mytable that have seen activity in the last 10 secs:
> > >   tc p4ctrl get myprog/table/mytable \
> > >   filter msecs_since < 10000
> > >
> > > Delete follows the same syntax as get/read, so for sake of brevity we won't
> > > show more example than how to flush mytable:
> > >
> > >   tc p4ctrl delete myprog/table/mytable
> > >
> > > Mystery question: How do we achieve iproute2-kernel independence and
> > > how does "tc p4ctrl" as a cli know how to program the kernel given an
> > > arbitrary command line as shown above? Answer(s): It queries the
> > > compiler generated json file in "P4TC Workflow" #B.c above. The json file has
> > > enough details to figure out that we have a program called "myprog" which has a
> > > table "mytable" that has a key name "dstAddr" which happens to be type ipv4
> > > address prefix. The json file also provides details to show that the table
> > > "mytable" supports an action called "send_to_port" which accepts a parameter
> > > "port" of type netdev (see the types patch for all supported P4 data types).
> > > All P4 components have names, IDs, and types - so this makes it very easy to map
> > > into netlink.
> > > Once user space tc/p4ctrl validates the human command input, it creates
> > > standard binary netlink structures (TLVs etc) which are sent to the kernel.
> > > See the runtime table entry patch for more details.
> > >
> > > __P4TC Datapath__
> > >
> > > The P4TC s/w datapath execution is generated as eBPF. Any objects that require
> > > control interfacing reside in the "P4TC domain" and are controlled via netlink
> > > as described above. Per packet execution and state and even objects that do not
> > > require control interfacing (like the P4 parser) are generated as eBPF.
> > >
> > > A packet arriving on s/w ingress of any of the ports on block 22 will first be
> > > exercised via the (generated eBPF) parser component to extract the headers (the
> > > ip destination address in labelled "dstAddr" above).
> > > The datapath then proceeds to use "dstAddr", table ID and pipeline ID
> > > as a key to do a lookup in myprog's "mytable" which returns the action params
> > > which are then used to execute the action in the eBPF datapath (eventually
> > > sending out packets to eno1).
> > > On a table miss, mytable's default miss action (not described) is executed.
> > >
> > > __Testing__
> > >
> > > Speaking of testing - we have 2-300 tdc test cases (which will be in the
> > > second patchset).
> > > These tests are run on our CICD system on pull requests and after commits are
> > > approved. The CICD does a lot of other tests (more since v2, thanks to Simon's
> > > input)including:
> > > checkpatch, sparse, smatch, coccinelle, 32 bit and 64 bit builds tested on both
> > > X86, ARM 64 and emulated BE via qemu s390. We trigger performance testing in the
> > > CICD to catch performance regressions (currently only on the control path, but
> > > in the future for the datapath).
> > > Syzkaller runs 24/7 on dedicated hardware, originally we focussed only on memory
> > > sanitizer but recently added support for concurrency sanitizer.
> > > Before main releases we ensure each patch will compile on its own to help in
> > > git bisect and run the xmas tree tool. We eventually put the code via coverity.
> > >
> > > In addition we are working on enabling a tool that will take a P4 program, run
> > > it through the compiler, and generate permutations of traffic patterns via
> > > symbolic execution that will test both positive and negative datapath code
> > > paths. The test generator tool integration is still work in progress.
> > > Also: We have other code that test parallelization etc which we are trying to
> > > find a fit for in the kernel tree's testing infra.
> > >
> > >
> > > __References__
> > >
> > > [1]https://github.com/p4tc-dev/docs/blob/main/p4-conference-2023/2023P4WorkshopP4TC.pdf
> > > [2]https://github.com/p4tc-dev/docs/blob/main/why-p4tc.md#historical-perspective-for-p4tc
> > > [3]https://2023p4workshop.sched.com/event/1KsAe/p4tc-linux-kernel-p4-implementation-approaches-and-evaluation
> > > [4]https://github.com/p4tc-dev/docs/blob/main/why-p4tc.md#so-why-p4-and-how-does-p4-help-here
> > > [5]https://lore.kernel.org/netdev/20230517110232.29349-3-jhs@mojatatu.com/T/#mf59be7abc5df3473cff3879c8cc3e2369c0640a6
> > > [6]https://lore.kernel.org/netdev/20230517110232.29349-3-jhs@mojatatu.com/T/#m783cfd79e9d755cf0e7afc1a7d5404635a5b1919
> > > [7]https://lore.kernel.org/netdev/20230517110232.29349-3-jhs@mojatatu.com/T/#ma8c84df0f7043d17b98f3d67aab0f4904c600469
> > > [8]https://github.com/p4lang/p4c/tree/main/backends/tc
> > > [9]https://p4.org/
> > > [10]https://www.intel.com/content/www/us/en/products/details/network-io/ipu/e2000-asic.html
> > > [11]https://www.amd.com/en/accelerators/pensando
> > > [12]https://github.com/sonic-net/DASH/tree/main
> > > [13]https://github.com/p4lang/p4c/tree/main/backends/ebpf
> > > [14]https://netdevconf.info/0x17/sessions/talk/integrating-ebpf-into-the-p4tc-datapath.html
> > > [15]https://dl.acm.org/doi/10.1145/3630047.3630193
> > > [16]https://lore.kernel.org/netdev/20231216123001.1293639-1-jiri@resnulli.us/
> > > [17.a]https://netdevconf.info/0x13/session.html?talk-tc-u-classifier
> > > [17.b]man tc-u32
> > > [18]man tc-pedit
> > > [19] https://lore.kernel.org/netdev/20231219181623.3845083-6-victor@mojatatu.com/T/#m86e71743d1d83b728bb29d5b877797cb4942e835
> > > [20.a] https://netdevconf.info/0x16/sessions/talk/your-network-datapath-will-be-p4-scripted.html
> > > [20.b] https://netdevconf.info/0x16/sessions/workshop/p4tc-workshop.html
> > >
> > > --------
> > > HISTORY
> > > --------
> > >
> > > Changes in Version 12
> > > ----------------------
> > >
> > > 0) Introduce back 15 patches (v11 had 5)
> > >
> > > 1) From discussions with Daniel:
> > >    i) Remove the XDP programs association alltogether. No refcounting. nothing.
> > >    ii) Remove prog type tc - everything is now an ebpf tc action.
> > >
> > > 2) s/PAD0/__pad0/g. Thanks to Marcelo.
> > >
> > > 3) Add extack to specify how many entries (N of M) specified in a batch for
> > >    any of requested Create/Update/Delete succeeded. Prior to this it would
> > >    only tell us the batch failed to complete without giving us details of
> > >    which of M failed. Added as a debug aid.
> > >
> > > Changes in Version 11
> > > ----------------------
> > > 1) Split the series into two. Original patches 1-5 in this patchset. The rest
> > >    will go out after this is merged.
> > >
> > > 2) Change any references of IFNAMSIZ in the action code when referencing the
> > >    action name size to ACTNAMSIZ. Thanks to Marcelo.
> > >
> > > Changes in Version 10
> > > ----------------------
> > > 1) A couple of patches from the earlier version were clean enough to submit,
> > >    so we did. This gave us room to split the two largest patches each into
> > >    two. Even though the split is not git-bisactable and really some of it didn't
> > >    make much sense (eg spliting a create, and update in one patch and delete and
> > >    get into another) we made sure each of the split patches compiled
> > >    independently. The idea is to reduce the number of lines of code to review
> > >    and when we get sufficient reviews we will put the splits together again.
> > >    See patch #12 and #13 as well as patches #7 and #8).
> > >
> > > 2) Add more context in patch 0. Please READ!
> > >
> > > 3) Added dump/delete filters back to the code - we had taken them out in the
> > >    earlier patches to reduce the amount of code for review - but in retrospect
> > >    we feel they are important enough to push earlier rather than later.
> > >
> > >
> > > Changes In version 9
> > > ---------------------
> > >
> > > 1) Remove the largest patch (externs) to ease review.
> > >
> > > 2) Break up action patches into two to ease review bringing down the patches
> > >    that need more scrutiny to 8 (the first 7 are almost trivial).
> > >
> > > 3) Fixup prefix naming convention to p4tc_xxx for uapi and p4a_xxx for actions
> > >    to provide consistency(Jiri).
> > >
> > > 4) Silence sparse warning "was not declared. Should it be static?" for kfuncs
> > >    by making them static. TBH, not sure if this is the right solution
> > >    but it makes sparse happy and hopefully someone will comment.
> > >
> > > Changes In Version 8
> > > ---------------------
> > >
> > > 1) Fix all the patchwork warnings and improve our ci to catch them in the future
> > >
> > > 2) Reduce the number of patches to basic max(15)  to ease review.
> > >
> > > Changes In Version 7
> > > -------------------------
> > >
> > > 0) First time removing the RFC tag!
> > >
> > > 1) Removed XDP cookie. It turns out as was pointed out by Toke(Thanks!) - that
> > > using bpf links was sufficient to protect us from someone replacing or deleting
> > > a eBPF program after it has been bound to a netdev.
> > >
> > > 2) Add some reviewed-bys from Vlad.
> > >
> > > 3) Small bug fixes from v6 based on testing for ebpf.
> > >
> > > 4) Added the counter extern as a sample extern. Illustrating this example because
> > >    it is slightly complex since it is possible to invoke it directly from
> > >    the P4TC domain (in case of direct counters) or from eBPF (indirect counters).
> > >    It is not exactly the most efficient implementation (a reasonable counter impl
> > >    should be per-cpu).
> > >
> > > Changes In RFC Version 6
> > > -------------------------
> > >
> > > 1) Completed integration from scriptable view to eBPF. Completed integration
> > >    of externs integration.
> > >
> > > 2) Small bug fixes from v5 based on testing.
> > >
> > > Changes In RFC Version 5
> > > -------------------------
> > >
> > > 1) More integration from scriptable view to eBPF. Small bug fixes from last
> > >    integration.
> > >
> > > 2) More streamlining support of externs via kfunc (create-on-miss, etc)
> > >
> > > 3) eBPF linking for XDP.
> > >
> > > There is more eBPF integration/streamlining coming (we are getting close to
> > > conversion from scriptable domain).
> > >
> > > Changes In RFC Version 4
> > > -------------------------
> > >
> > > 1) More integration from scriptable to eBPF. Small bug fixes.
> > >
> > > 2) More streamlining support of externs via kfunc (one additional kfunc).
> > >
> > > 3) Removed per-cpu scratchpad per Toke's suggestion and instead use XDP metadata.
> > >
> > > There is more eBPF integration coming. One thing we looked at but is not in this
> > > patchset but should be in the next is use of eBPF link in our loading (see
> > > "challenge #1" further below).
> > >
> > > Changes In RFC Version 3
> > > -------------------------
> > >
> > > These patches are still in a little bit of flux as we adjust to integrating
> > > eBPF. So there are small constructs that are used in V1 and 2 but no longer
> > > used in this version. We will make a V4 which will remove those.
> > > The changes from V2 are as follows:
> > >
> > > 1) Feedback we got in V2 is to try stick to one of the two modes. In this version
> > > we are taking one more step and going the path of mode2 vs v2 where we had 2 modes.
> > >
> > > 2) The P4 Register extern is no longer standalone. Instead, as part of integrating
> > > into eBPF we introduce another kfunc which encapsulates Register as part of the
> > > extern interface.
> > >
> > > 3) We have improved our CICD to include tools pointed to us by Simon. See
> > >    "Testing" further below. Thanks to Simon for that and other issues he caught.
> > >    Simon, we discussed on issue [7] but decided to keep that log since we think
> > >    it is useful.
> > >
> > > 4) A lot of small cleanups. Thanks Marcelo. There are two things we need to
> > >    re-discuss though; see: [5], [6].
> > >
> > > 5) We removed the need for a range of IDs for dynamic actions. Thanks Jakub.
> > >
> > > 6) Clarify ambiguity caused by smatch in an if(A) else if(B) condition. We are
> > >    guaranteed that either A or B must exist; however, lets make smatch happy.
> > >    Thanks to Simon and Dan Carpenter.
> > >
> > > Changes In RFC Version 2
> > > -------------------------
> > >
> > > Version 2 is the initial integration of the eBPF datapath.
> > > We took into consideration suggestions provided to use eBPF and put effort into
> > > analyzing eBPF as datapath which involved extensive testing.
> > > We implemented 6 approaches with eBPF and ran performance analysis and presented
> > > our results at the P4 2023 workshop in Santa Clara[see: 1, 3] on each of the 6
> > > vs the scriptable P4TC and concluded that 2 of the approaches are sensible (4 if
> > > you account for XDP or TC separately).
> > >
> > > Conclusions from the exercise: We lose the simple operational model we had
> > > prior to integrating eBPF. We do gain performance in most cases when the
> > > datapath is less compute-bound.
> > > For more discussion on our requirements vs journeying the eBPF path please
> > > scroll down to "Restating Our Requirements" and "Challenges".
> > >
> > > This patch set presented two modes.
> > > mode1: the parser is entirely based on eBPF - whereas the rest of the
> > > SW datapath stays as _scriptable_ as in Version 1.
> > > mode2: All of the kernel s/w datapath (including parser) is in eBPF.
> > >
> > > The key ingredient for eBPF, that we did not have access to in the past, is
> > > kfunc (it made a big difference for us to reconsider eBPF).
> > >
> > > In V2 the two modes are mutually exclusive (IOW, you get to choose one
> > > or the other via Kconfig).
> >
> > I think/fear that this series has a "quorum" problem: different voices
> > raises opposition, and nobody (?) outside the authors supported the
> > code and the feature.
> >
> > Could be the missing of H/W offload support in the current form the
> > root cause for such lack support? Or there are parties interested that
> > have been quite so far?

Yeah agree with h/w comment would be interested to hear these folks that
have h/w. For me to get on board obvious things that would be interesting.
(a) hardware offload (b) some fundamental problem with exisiing p4c
backend we already have or (c) significant performance improvement.

> 
> Some of the people who attend our meetings and have vested interest in
> this are on Cc.  But the cover letter is clear on this (right at the
> top under "What is P4" and "what is P4TC").
> 
> cheers,
> jamal
> 
> 
> > Thanks,
> >
> > Paolo
> >
> >
> 



^ permalink raw reply	[flat|nested] 71+ messages in thread

* RE: [PATCH net-next v12 00/15] Introducing P4TC (series 1)
  2024-02-29 17:13 ` Paolo Abeni
  2024-02-29 18:49   ` Jamal Hadi Salim
@ 2024-02-29 21:49   ` Singhai, Anjali
  2024-02-29 22:33     ` John Fastabend
  2024-03-01 18:53   ` Chris Sommers
  2 siblings, 1 reply; 71+ messages in thread
From: Singhai, Anjali @ 2024-02-29 21:49 UTC (permalink / raw)
  To: Paolo Abeni, Hadi Salim, Jamal, netdev@vger.kernel.org
  Cc: Chatterjee, Deb, Limaye, Namrata, tom@sipanda.io,
	mleitner@redhat.com, Mahesh.Shirshyad@amd.com, Vipin.Jain@amd.com,
	Osinski, Tomasz, jiri@resnulli.us, xiyou.wangcong@gmail.com,
	davem@davemloft.net, edumazet@google.com, kuba@kernel.org,
	vladbu@nvidia.com, horms@kernel.org, khalidm@nvidia.com,
	toke@redhat.com, daniel@iogearbox.net, victor@mojatatu.com,
	Tammela, Pedro, Daly, Dan, andy.fingerhut@gmail.com,
	Sommers, Chris, mattyk@nvidia.com, bpf@vger.kernel.org

From: Paolo Abeni <pabeni@redhat.com> 

> I think/fear that this series has a "quorum" problem: different voices raises opposition, and nobody (?) outside the authors
> supported the code and the feature. 

> Could be the missing of H/W offload support in the current form the root cause for such lack support? Or there are parties 
> interested that have been quite so far?

Hi,
   Intel/AMD definitely need the p4tc offload support and a kernel SW pipeline, as a lot of customers using programmable pipeline (smart switch and smart NIC) prefer kernel standard APIs and interfaces (netlink and tc ndo). Intel and other vendors have native P4 capable HW and are invested in P4 as a dataplane specification.

- Customers run P4 dataplane in multiple targets including SW pipeline as well as programmable Switches and DPUs.
- A standardized kernel APIs and implementation brings in portability across vendors and across targets (CPU/SW and DPUs).
- A P4 pipeline can be built using both SW and HW (DPU/switch) components and the P4 pipeline should seamlessly move between the two. 
- This patch series helps create a SW pipeline and standard API.

Thanks,
Anjali


^ permalink raw reply	[flat|nested] 71+ messages in thread

* RE: [PATCH net-next v12 00/15] Introducing P4TC (series 1)
  2024-02-29 21:49   ` Singhai, Anjali
@ 2024-02-29 22:33     ` John Fastabend
  2024-02-29 22:48       ` Jamal Hadi Salim
  0 siblings, 1 reply; 71+ messages in thread
From: John Fastabend @ 2024-02-29 22:33 UTC (permalink / raw)
  To: Singhai, Anjali, Paolo Abeni, Hadi Salim, Jamal,
	netdev@vger.kernel.org
  Cc: Chatterjee, Deb, Limaye, Namrata, tom@sipanda.io,
	mleitner@redhat.com, Mahesh.Shirshyad@amd.com, Vipin.Jain@amd.com,
	Osinski, Tomasz, jiri@resnulli.us, xiyou.wangcong@gmail.com,
	davem@davemloft.net, edumazet@google.com, kuba@kernel.org,
	vladbu@nvidia.com, horms@kernel.org, khalidm@nvidia.com,
	toke@redhat.com, daniel@iogearbox.net, victor@mojatatu.com,
	Tammela, Pedro, Daly, Dan, andy.fingerhut@gmail.com,
	Sommers, Chris, mattyk@nvidia.com, bpf@vger.kernel.org

Singhai, Anjali wrote:
> From: Paolo Abeni <pabeni@redhat.com> 
> 
> > I think/fear that this series has a "quorum" problem: different voices raises opposition, and nobody (?) outside the authors
> > supported the code and the feature. 
> 
> > Could be the missing of H/W offload support in the current form the root cause for such lack support? Or there are parties 
> > interested that have been quite so far?
> 
> Hi,
>    Intel/AMD definitely need the p4tc offload support and a kernel SW pipeline, as a lot of customers using programmable pipeline (smart switch and smart NIC) prefer kernel standard APIs and interfaces (netlink and tc ndo). Intel and other vendors have native P4 capable HW and are invested in P4 as a dataplane specification.

Great what hardware/driver and how do we get that code here so we can see
it working? Is the hardware available e.g. can I get ahold of one?

What is programmable on your devices? Is this 'just' the parser graph or
are you slicing up tables and so on. Is it a FPGA, DPU architecture or a
TCAM architecture? How do you reprogram the device? I somehow doubt its
through a piecemeal ndo. But let me know if I'm wrong maybe my internal
architecture details are dated. Fully speculating the interface is a FW
big thunk to the device?

Without any details its difficult to get community feedback on how the
hw programmable interface should work. The only reason I've even
bothered with this thread is I want to see P4 working.

Who owns the AMD side or some other vendor so we can get something that
works across at least two vendors which is our usual bar for adding hw
offload things.

Note if you just want a kernel SW pipeline we already have that so
I'm not seeing that as paticularly motivating. Again my point of view.
P4 as a dataplane specification is great but I don't see the connection
to this patchset without real hardware in a driver.

> 
> - Customers run P4 dataplane in multiple targets including SW pipeline as well as programmable Switches and DPUs.
> - A standardized kernel APIs and implementation brings in portability across vendors and across targets (CPU/SW and DPUs).
> - A P4 pipeline can be built using both SW and HW (DPU/switch) components and the P4 pipeline should seamlessly move between the two. 
> - This patch series helps create a SW pipeline and standard API.
> 
> Thanks,
> Anjali
> 

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH net-next v12 00/15] Introducing P4TC (series 1)
  2024-02-29 22:33     ` John Fastabend
@ 2024-02-29 22:48       ` Jamal Hadi Salim
       [not found]         ` <CAOuuhY8qbsYCjdUYUZv8J3jz8HGXmtxLmTDP6LKgN5uRVZwMnQ@mail.gmail.com>
  0 siblings, 1 reply; 71+ messages in thread
From: Jamal Hadi Salim @ 2024-02-29 22:48 UTC (permalink / raw)
  To: John Fastabend
  Cc: Singhai, Anjali, Paolo Abeni, netdev@vger.kernel.org,
	Chatterjee, Deb, Limaye, Namrata, tom@sipanda.io,
	mleitner@redhat.com, Mahesh.Shirshyad@amd.com, Vipin.Jain@amd.com,
	Osinski, Tomasz, jiri@resnulli.us, xiyou.wangcong@gmail.com,
	davem@davemloft.net, edumazet@google.com, kuba@kernel.org,
	vladbu@nvidia.com, horms@kernel.org, khalidm@nvidia.com,
	toke@redhat.com, daniel@iogearbox.net, victor@mojatatu.com,
	Tammela, Pedro, Daly, Dan, andy.fingerhut@gmail.com,
	Sommers, Chris, mattyk@nvidia.com, bpf@vger.kernel.org

On Thu, Feb 29, 2024 at 5:33 PM John Fastabend <john.fastabend@gmail.com> wrote:
>
> Singhai, Anjali wrote:
> > From: Paolo Abeni <pabeni@redhat.com>
> >
> > > I think/fear that this series has a "quorum" problem: different voices raises opposition, and nobody (?) outside the authors
> > > supported the code and the feature.
> >
> > > Could be the missing of H/W offload support in the current form the root cause for such lack support? Or there are parties
> > > interested that have been quite so far?
> >
> > Hi,
> >    Intel/AMD definitely need the p4tc offload support and a kernel SW pipeline, as a lot of customers using programmable pipeline (smart switch and smart NIC) prefer kernel standard APIs and interfaces (netlink and tc ndo). Intel and other vendors have native P4 capable HW and are invested in P4 as a dataplane specification.
>
> Great what hardware/driver and how do we get that code here so we can see
> it working? Is the hardware available e.g. can I get ahold of one?
>
> What is programmable on your devices? Is this 'just' the parser graph or
> are you slicing up tables and so on. Is it a FPGA, DPU architecture or a
> TCAM architecture? How do you reprogram the device? I somehow doubt its
> through a piecemeal ndo. But let me know if I'm wrong maybe my internal
> architecture details are dated. Fully speculating the interface is a FW
> big thunk to the device?
>
> Without any details its difficult to get community feedback on how the
> hw programmable interface should work. The only reason I've even
> bothered with this thread is I want to see P4 working.
>
> Who owns the AMD side or some other vendor so we can get something that
> works across at least two vendors which is our usual bar for adding hw
> offload things.
>
> Note if you just want a kernel SW pipeline we already have that so
> I'm not seeing that as paticularly motivating. Again my point of view.
> P4 as a dataplane specification is great but I don't see the connection
> to this patchset without real hardware in a driver.

Here's what you can buy on the market that are native P4 (not that it
hasnt been mentioned from day 1 on patch 0 references):
[10]https://www.intel.com/content/www/us/en/products/details/network-io/ipu/e2000-asic.html
[11]https://www.amd.com/en/accelerators/pensando

I want to emphasize again these patches are about the P4 s/w pipeline
that is intended to work seamlessly with hw offload. If you are
interested in h/w offload and want to contribute just show up at the
meetings - they are open to all. The current offloadable piece is the
match-action tables. The P4 specs may change to include parsers in the
future or other objects etc (but not sure why we should discuss this
in the thread).

cheers,
jamal

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH net-next v12 14/15] p4tc: add set of P4TC table kfuncs
  2024-02-25 16:54 ` [PATCH net-next v12 14/15] p4tc: add set of P4TC table kfuncs Jamal Hadi Salim
@ 2024-03-01  6:53   ` Martin KaFai Lau
  2024-03-01 12:31     ` Jamal Hadi Salim
  0 siblings, 1 reply; 71+ messages in thread
From: Martin KaFai Lau @ 2024-03-01  6:53 UTC (permalink / raw)
  To: Jamal Hadi Salim
  Cc: deb.chatterjee, anjali.singhai, namrata.limaye, tom, mleitner,
	Mahesh.Shirshyad, Vipin.Jain, tomasz.osinski, jiri,
	xiyou.wangcong, davem, edumazet, kuba, pabeni, vladbu, horms,
	khalidm, toke, daniel, victor, pctammela, bpf, netdev

On 2/25/24 8:54 AM, Jamal Hadi Salim wrote:
> +struct p4tc_table_entry_act_bpf_params {

Will this struct be extended in the future?

> +	u32 pipeid;
> +	u32 tblid;
> +};
> +
> +struct p4tc_table_entry_create_bpf_params {
> +	u32 profile_id;
> +	u32 pipeid;
> +	u32 tblid;
> +};
> +

[ ... ]

> diff --git a/include/net/tc_act/p4tc.h b/include/net/tc_act/p4tc.h
> index c5256d821..155068de0 100644
> --- a/include/net/tc_act/p4tc.h
> +++ b/include/net/tc_act/p4tc.h
> @@ -13,10 +13,26 @@ struct tcf_p4act_params {
>   	u32 tot_params_sz;
>   };
>   
> +#define P4TC_MAX_PARAM_DATA_SIZE 124
> +
> +struct p4tc_table_entry_act_bpf {
> +	u32 act_id;
> +	u32 hit:1,
> +	    is_default_miss_act:1,
> +	    is_default_hit_act:1;
> +	u8 params[P4TC_MAX_PARAM_DATA_SIZE];
> +} __packed;
> +
> +struct p4tc_table_entry_act_bpf_kern {
> +	struct rcu_head rcu;
> +	struct p4tc_table_entry_act_bpf act_bpf;
> +};
> +
>   struct tcf_p4act {
>   	struct tc_action common;
>   	/* Params IDR reference passed during runtime */
>   	struct tcf_p4act_params __rcu *params;
> +	struct p4tc_table_entry_act_bpf_kern __rcu *act_bpf;
>   	u32 p_id;
>   	u32 act_id;
>   	struct list_head node;
> @@ -24,4 +40,39 @@ struct tcf_p4act {
>   
>   #define to_p4act(a) ((struct tcf_p4act *)a)
>   
> +static inline struct p4tc_table_entry_act_bpf *
> +p4tc_table_entry_act_bpf(struct tc_action *action)
> +{
> +	struct p4tc_table_entry_act_bpf_kern *act_bpf;
> +	struct tcf_p4act *p4act = to_p4act(action);
> +
> +	act_bpf = rcu_dereference(p4act->act_bpf);
> +
> +	return &act_bpf->act_bpf;
> +}
> +
> +static inline int
> +p4tc_table_entry_act_bpf_change_flags(struct tc_action *action, u32 hit,
> +				      u32 dflt_miss, u32 dflt_hit)
> +{
> +	struct p4tc_table_entry_act_bpf_kern *act_bpf, *act_bpf_old;
> +	struct tcf_p4act *p4act = to_p4act(action);
> +
> +	act_bpf = kzalloc(sizeof(*act_bpf), GFP_KERNEL);


[ ... ]

> +__bpf_kfunc static struct p4tc_table_entry_act_bpf *
> +bpf_p4tc_tbl_read(struct __sk_buff *skb_ctx,

The argument could be "struct sk_buff *skb" instead of __sk_buff. Take a look at 
commit 2f4643934670.

> +		  struct p4tc_table_entry_act_bpf_params *params,
> +		  void *key, const u32 key__sz)
> +{
> +	struct sk_buff *skb = (struct sk_buff *)skb_ctx;
> +	struct net *caller_net;
> +
> +	caller_net = skb->dev ? dev_net(skb->dev) : sock_net(skb->sk);
> +
> +	return __bpf_p4tc_tbl_read(caller_net, params, key, key__sz);
> +}
> +
> +__bpf_kfunc static struct p4tc_table_entry_act_bpf *
> +xdp_p4tc_tbl_read(struct xdp_md *xdp_ctx,
> +		  struct p4tc_table_entry_act_bpf_params *params,
> +		  void *key, const u32 key__sz)
> +{
> +	struct xdp_buff *ctx = (struct xdp_buff *)xdp_ctx;
> +	struct net *caller_net;
> +
> +	caller_net = dev_net(ctx->rxq->dev);
> +
> +	return __bpf_p4tc_tbl_read(caller_net, params, key, key__sz);
> +}
> +
> +static int
> +__bpf_p4tc_entry_create(struct net *net,
> +			struct p4tc_table_entry_create_bpf_params *params,
> +			void *key, const u32 key__sz,
> +			struct p4tc_table_entry_act_bpf *act_bpf)
> +{
> +	struct p4tc_table_entry_key *entry_key = key;
> +	struct p4tc_pipeline *pipeline;
> +	struct p4tc_table *table;
> +
> +	if (!params || !key)
> +		return -EINVAL;
> +	if (key__sz != P4TC_ENTRY_KEY_SZ_BYTES(entry_key->keysz))
> +		return -EINVAL;
> +
> +	pipeline = p4tc_pipeline_find_byid(net, params->pipeid);
> +	if (!pipeline)
> +		return -ENOENT;
> +
> +	table = p4tc_tbl_cache_lookup(net, params->pipeid, params->tblid);
> +	if (!table)
> +		return -ENOENT;
> +
> +	if (entry_key->keysz != table->tbl_keysz)
> +		return -EINVAL;
> +
> +	return p4tc_table_entry_create_bpf(pipeline, table, entry_key, act_bpf,
> +					   params->profile_id);

My understanding is this kfunc will allocate a "struct 
p4tc_table_entry_act_bpf_kern" object. If the bpf_p4tc_entry_delete() kfunc is 
never called and the bpf prog is unloaded, how the act_bpf object will be 
cleaned up?

> +}
> +
> +__bpf_kfunc static int
> +bpf_p4tc_entry_create(struct __sk_buff *skb_ctx,
> +		      struct p4tc_table_entry_create_bpf_params *params,
> +		      void *key, const u32 key__sz,
> +		      struct p4tc_table_entry_act_bpf *act_bpf)
> +{
> +	struct sk_buff *skb = (struct sk_buff *)skb_ctx;
> +	struct net *net;
> +
> +	net = skb->dev ? dev_net(skb->dev) : sock_net(skb->sk);
> +
> +	return __bpf_p4tc_entry_create(net, params, key, key__sz, act_bpf);
> +}
> +
> +__bpf_kfunc static int
> +xdp_p4tc_entry_create(struct xdp_md *xdp_ctx,
> +		      struct p4tc_table_entry_create_bpf_params *params,
> +		      void *key, const u32 key__sz,
> +		      struct p4tc_table_entry_act_bpf *act_bpf)
> +{
> +	struct xdp_buff *ctx = (struct xdp_buff *)xdp_ctx;
> +	struct net *net;
> +
> +	net = dev_net(ctx->rxq->dev);
> +
> +	return __bpf_p4tc_entry_create(net, params, key, key__sz, act_bpf);
> +}
> +
> +__bpf_kfunc static int
> +bpf_p4tc_entry_create_on_miss(struct __sk_buff *skb_ctx,
> +			      struct p4tc_table_entry_create_bpf_params *params,
> +			      void *key, const u32 key__sz,
> +			      struct p4tc_table_entry_act_bpf *act_bpf)
> +{
> +	struct sk_buff *skb = (struct sk_buff *)skb_ctx;
> +	struct net *net;
> +
> +	net = skb->dev ? dev_net(skb->dev) : sock_net(skb->sk);
> +
> +	return __bpf_p4tc_entry_create(net, params, key, key__sz, act_bpf);
> +}
> +
> +__bpf_kfunc static int
> +xdp_p4tc_entry_create_on_miss(struct xdp_md *xdp_ctx,

Same here. "struct xdp_buff *xdp".

> +			      struct p4tc_table_entry_create_bpf_params *params,
> +			      void *key, const u32 key__sz,
> +			      struct p4tc_table_entry_act_bpf *act_bpf)
> +{
> +	struct xdp_buff *ctx = (struct xdp_buff *)xdp_ctx;
> +	struct net *net;
> +
> +	net = dev_net(ctx->rxq->dev);
> +
> +	return __bpf_p4tc_entry_create(net, params, key, key__sz, act_bpf);
> +}
> +

[ ... ]

> +__bpf_kfunc static int
> +bpf_p4tc_entry_delete(struct __sk_buff *skb_ctx,
> +		      struct p4tc_table_entry_create_bpf_params *params,
> +		      void *key, const u32 key__sz)
> +{
> +	struct sk_buff *skb = (struct sk_buff *)skb_ctx;
> +	struct net *net;
> +
> +	net = skb->dev ? dev_net(skb->dev) : sock_net(skb->sk);
> +
> +	return __bpf_p4tc_entry_delete(net, params, key, key__sz);
> +}
> +
> +__bpf_kfunc static int
> +xdp_p4tc_entry_delete(struct xdp_md *xdp_ctx,
> +		      struct p4tc_table_entry_create_bpf_params *params,
> +		      void *key, const u32 key__sz)
> +{
> +	struct xdp_buff *ctx = (struct xdp_buff *)xdp_ctx;
> +	struct net *net;
> +
> +	net = dev_net(ctx->rxq->dev);
> +
> +	return __bpf_p4tc_entry_delete(net, params, key, key__sz);
> +}
> +
> +BTF_SET8_START(p4tc_kfunc_check_tbl_set_skb)

This soon will be broken with the latest change in bpf-next. It is replaced by 
BTF_KFUNCS_START. commit a05e90427ef6.

What is the plan on the selftest ?

> +BTF_ID_FLAGS(func, bpf_p4tc_tbl_read, KF_RET_NULL);
> +BTF_ID_FLAGS(func, bpf_p4tc_entry_create);
> +BTF_ID_FLAGS(func, bpf_p4tc_entry_create_on_miss);
> +BTF_ID_FLAGS(func, bpf_p4tc_entry_update);
> +BTF_ID_FLAGS(func, bpf_p4tc_entry_delete);
> +BTF_SET8_END(p4tc_kfunc_check_tbl_set_skb)
> +
> +static const struct btf_kfunc_id_set p4tc_kfunc_tbl_set_skb = {
> +	.owner = THIS_MODULE,
> +	.set = &p4tc_kfunc_check_tbl_set_skb,
> +};
> +
> +BTF_SET8_START(p4tc_kfunc_check_tbl_set_xdp)
> +BTF_ID_FLAGS(func, xdp_p4tc_tbl_read, KF_RET_NULL);
> +BTF_ID_FLAGS(func, xdp_p4tc_entry_create);
> +BTF_ID_FLAGS(func, xdp_p4tc_entry_create_on_miss);
> +BTF_ID_FLAGS(func, xdp_p4tc_entry_update);
> +BTF_ID_FLAGS(func, xdp_p4tc_entry_delete);
> +BTF_SET8_END(p4tc_kfunc_check_tbl_set_xdp)


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH net-next v12 00/15] Introducing P4TC (series 1)
  2024-02-28 17:11 ` [PATCH net-next v12 00/15] Introducing P4TC (series 1) John Fastabend
  2024-02-28 18:23   ` Jamal Hadi Salim
@ 2024-03-01  7:02   ` Martin KaFai Lau
  2024-03-01 12:36     ` Jamal Hadi Salim
  1 sibling, 1 reply; 71+ messages in thread
From: Martin KaFai Lau @ 2024-03-01  7:02 UTC (permalink / raw)
  To: John Fastabend, Jamal Hadi Salim
  Cc: deb.chatterjee, anjali.singhai, namrata.limaye, tom, mleitner,
	Mahesh.Shirshyad, Vipin.Jain, tomasz.osinski, jiri,
	xiyou.wangcong, davem, edumazet, kuba, pabeni, vladbu, horms,
	khalidm, toke, daniel, victor, pctammela, dan.daly,
	andy.fingerhut, chris.sommers, mattyk, bpf, netdev

On 2/28/24 9:11 AM, John Fastabend wrote:
>   - The kfuncs are mostly duplicates of map ops we already have in BPF API.
>     The motivation by my read is to use netlink instead of bpf commands. I

I also have similar thought on the kfuncs (create/update/delete) which is mostly 
bpf map ops. It could have one single kfunc to allocate a kernel specific p4 
entry/object and then store that in a bpf map. With the bpf_rbtree, bpf_list, 
and other recent advancements, it should be able to describe them in a bpf map. 
The reply in v9 was that the p4 table will also be used in the future HW 
piece/driver but the HW piece is not ready yet, bpf is the only consumer of the 
kernel p4 table now and this makes mimicking the bpf map api to kfuncs not 
convincing. bpf "tc / xdp" program uses netlink to attach/detach and the policy 
also stays in the bpf map.

When there is a HW piece that consumes the p4 table, that will be a better time 
to discuss the kfunc interface.

>     don't agree with this, optimizing for some low level debug a developer
>     uses is the wrong design space. Actual users should not be deploying
>     this via ssh into boxes. The workflow will not scale and really we need
>     tooling and infra to land P4 programs across the network. This is orders
>     of more pain if its an endpoint solution and not a middlebox/switch
>     solution. As a switch solution I don't see how p4tc sw scales to even TOR
>     packet rates. So you need tooling on top and user interact with the
>     tooling not the Linux widget/debugger at the bottom.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH net-next v12 01/15] net: sched: act_api: Introduce P4 actions list
  2024-02-29 18:21     ` Jamal Hadi Salim
@ 2024-03-01  7:30       ` Paolo Abeni
  2024-03-01 12:39         ` Jamal Hadi Salim
  0 siblings, 1 reply; 71+ messages in thread
From: Paolo Abeni @ 2024-03-01  7:30 UTC (permalink / raw)
  To: Jamal Hadi Salim
  Cc: netdev, deb.chatterjee, anjali.singhai, namrata.limaye, tom,
	mleitner, Mahesh.Shirshyad, Vipin.Jain, tomasz.osinski, jiri,
	xiyou.wangcong, davem, edumazet, kuba, vladbu, horms, khalidm,
	toke, daniel, victor, pctammela, bpf

On Thu, 2024-02-29 at 13:21 -0500, Jamal Hadi Salim wrote:
> On Thu, Feb 29, 2024 at 10:05 AM Paolo Abeni <pabeni@redhat.com> wrote:
> > 
> > On Sun, 2024-02-25 at 11:54 -0500, Jamal Hadi Salim wrote:
> > > In P4 we require to generate new actions "on the fly" based on the
> > > specified P4 action definition. P4 action kinds, like the pipeline
> > > they are attached to, must be per net namespace, as opposed to native
> > > action kinds which are global. For that reason, we chose to create a
> > > separate structure to store P4 actions.
> > > 
> > > Co-developed-by: Victor Nogueira <victor@mojatatu.com>
> > > Signed-off-by: Victor Nogueira <victor@mojatatu.com>
> > > Co-developed-by: Pedro Tammela <pctammela@mojatatu.com>
> > > Signed-off-by: Pedro Tammela <pctammela@mojatatu.com>
> > > Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
> > > Reviewed-by: Vlad Buslov <vladbu@nvidia.com>
> > > Reviewed-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
> > > ---
> > >  include/net/act_api.h |   8 ++-
> > >  net/sched/act_api.c   | 123 +++++++++++++++++++++++++++++++++++++-----
> > >  net/sched/cls_api.c   |   2 +-
> > >  3 files changed, 116 insertions(+), 17 deletions(-)
> > > 
> > > diff --git a/include/net/act_api.h b/include/net/act_api.h
> > > index 77ee0c657..f22be14bb 100644
> > > --- a/include/net/act_api.h
> > > +++ b/include/net/act_api.h
> > > @@ -105,6 +105,7 @@ typedef void (*tc_action_priv_destructor)(void *priv);
> > > 
> > >  struct tc_action_ops {
> > >       struct list_head head;
> > > +     struct list_head p4_head;
> > >       char    kind[IFNAMSIZ];
> > >       enum tca_id  id; /* identifier should match kind */
> > >       unsigned int    net_id;
> > > @@ -199,10 +200,12 @@ int tcf_idr_check_alloc(struct tc_action_net *tn, u32 *index,
> > >  int tcf_idr_release(struct tc_action *a, bool bind);
> > > 
> > >  int tcf_register_action(struct tc_action_ops *a, struct pernet_operations *ops);
> > > +int tcf_register_p4_action(struct net *net, struct tc_action_ops *act);
> > >  int tcf_unregister_action(struct tc_action_ops *a,
> > >                         struct pernet_operations *ops);
> > >  #define NET_ACT_ALIAS_PREFIX "net-act-"
> > >  #define MODULE_ALIAS_NET_ACT(kind)   MODULE_ALIAS(NET_ACT_ALIAS_PREFIX kind)
> > > +void tcf_unregister_p4_action(struct net *net, struct tc_action_ops *act);
> > >  int tcf_action_destroy(struct tc_action *actions[], int bind);
> > >  int tcf_action_exec(struct sk_buff *skb, struct tc_action **actions,
> > >                   int nr_actions, struct tcf_result *res);
> > > @@ -210,8 +213,9 @@ int tcf_action_init(struct net *net, struct tcf_proto *tp, struct nlattr *nla,
> > >                   struct nlattr *est,
> > >                   struct tc_action *actions[], int init_res[], size_t *attr_size,
> > >                   u32 flags, u32 fl_flags, struct netlink_ext_ack *extack);
> > > -struct tc_action_ops *tc_action_load_ops(struct nlattr *nla, u32 flags,
> > > -                                      struct netlink_ext_ack *extack);
> > > +struct tc_action_ops *
> > > +tc_action_load_ops(struct net *net, struct nlattr *nla,
> > > +                u32 flags, struct netlink_ext_ack *extack);
> > >  struct tc_action *tcf_action_init_1(struct net *net, struct tcf_proto *tp,
> > >                                   struct nlattr *nla, struct nlattr *est,
> > >                                   struct tc_action_ops *a_o, int *init_res,
> > > diff --git a/net/sched/act_api.c b/net/sched/act_api.c
> > > index 9ee622fb1..23ef394f2 100644
> > > --- a/net/sched/act_api.c
> > > +++ b/net/sched/act_api.c
> > > @@ -57,6 +57,40 @@ static void tcf_free_cookie_rcu(struct rcu_head *p)
> > >       kfree(cookie);
> > >  }
> > > 
> > > +static unsigned int p4_act_net_id;
> > > +
> > > +struct tcf_p4_act_net {
> > > +     struct list_head act_base;
> > > +     rwlock_t act_mod_lock;
> > 
> > Note that rwlock in networking code is discouraged, as they have to be
> > unfair, see commit 0daf07e527095e64ee8927ce297ab626643e9f51.
> > 
> > In this specific case I think there should be no problems, as is
> > extremely hard/impossible to have serious contention on the write
> > side,. Also there is already an existing rwlock nearby, no not a
> > blocker but IMHO worthy to be noted.
> > 
> 
> Sure - we can replace it. What's the preference? Spinlock?

Plain spinlock will work. Using spinlock + RCU should be quite straight
forward and will provide faster lookup.

Cheers,

Paolo


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH net-next v12 14/15] p4tc: add set of P4TC table kfuncs
  2024-03-01  6:53   ` Martin KaFai Lau
@ 2024-03-01 12:31     ` Jamal Hadi Salim
  2024-03-03  1:32       ` Martin KaFai Lau
  0 siblings, 1 reply; 71+ messages in thread
From: Jamal Hadi Salim @ 2024-03-01 12:31 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: deb.chatterjee, anjali.singhai, namrata.limaye, tom, mleitner,
	Mahesh.Shirshyad, Vipin.Jain, tomasz.osinski, jiri,
	xiyou.wangcong, davem, edumazet, kuba, pabeni, vladbu, horms,
	khalidm, toke, daniel, victor, pctammela, bpf, netdev

On Fri, Mar 1, 2024 at 1:53 AM Martin KaFai Lau <martin.lau@linux.dev> wrote:
>
> On 2/25/24 8:54 AM, Jamal Hadi Salim wrote:
> > +struct p4tc_table_entry_act_bpf_params {
>
> Will this struct be extended in the future?
>
> > +     u32 pipeid;
> > +     u32 tblid;
> > +};
> > +

Not that i can think of. We probably want to have the option to do so
if needed. Do you see any harm if we were to make changes for whatever
reason in the future?

> > +struct p4tc_table_entry_create_bpf_params {
> > +     u32 profile_id;
> > +     u32 pipeid;
> > +     u32 tblid;
> > +};
> > +
>
> [ ... ]
>
> > diff --git a/include/net/tc_act/p4tc.h b/include/net/tc_act/p4tc.h
> > index c5256d821..155068de0 100644
> > --- a/include/net/tc_act/p4tc.h
> > +++ b/include/net/tc_act/p4tc.h
> > @@ -13,10 +13,26 @@ struct tcf_p4act_params {
> >       u32 tot_params_sz;
> >   };
> >
> > +#define P4TC_MAX_PARAM_DATA_SIZE 124
> > +
> > +struct p4tc_table_entry_act_bpf {
> > +     u32 act_id;
> > +     u32 hit:1,
> > +         is_default_miss_act:1,
> > +         is_default_hit_act:1;
> > +     u8 params[P4TC_MAX_PARAM_DATA_SIZE];
> > +} __packed;
> > +
> > +struct p4tc_table_entry_act_bpf_kern {
> > +     struct rcu_head rcu;
> > +     struct p4tc_table_entry_act_bpf act_bpf;
> > +};
> > +
> >   struct tcf_p4act {
> >       struct tc_action common;
> >       /* Params IDR reference passed during runtime */
> >       struct tcf_p4act_params __rcu *params;
> > +     struct p4tc_table_entry_act_bpf_kern __rcu *act_bpf;
> >       u32 p_id;
> >       u32 act_id;
> >       struct list_head node;
> > @@ -24,4 +40,39 @@ struct tcf_p4act {
> >
> >   #define to_p4act(a) ((struct tcf_p4act *)a)
> >
> > +static inline struct p4tc_table_entry_act_bpf *
> > +p4tc_table_entry_act_bpf(struct tc_action *action)
> > +{
> > +     struct p4tc_table_entry_act_bpf_kern *act_bpf;
> > +     struct tcf_p4act *p4act = to_p4act(action);
> > +
> > +     act_bpf = rcu_dereference(p4act->act_bpf);
> > +
> > +     return &act_bpf->act_bpf;
> > +}
> > +
> > +static inline int
> > +p4tc_table_entry_act_bpf_change_flags(struct tc_action *action, u32 hit,
> > +                                   u32 dflt_miss, u32 dflt_hit)
> > +{
> > +     struct p4tc_table_entry_act_bpf_kern *act_bpf, *act_bpf_old;
> > +     struct tcf_p4act *p4act = to_p4act(action);
> > +
> > +     act_bpf = kzalloc(sizeof(*act_bpf), GFP_KERNEL);
>
>
> [ ... ]
>
> > +__bpf_kfunc static struct p4tc_table_entry_act_bpf *
> > +bpf_p4tc_tbl_read(struct __sk_buff *skb_ctx,
>
> The argument could be "struct sk_buff *skb" instead of __sk_buff. Take a look at
> commit 2f4643934670.

We'll make that change.

>
> > +               struct p4tc_table_entry_act_bpf_params *params,
> > +               void *key, const u32 key__sz)
> > +{
> > +     struct sk_buff *skb = (struct sk_buff *)skb_ctx;
> > +     struct net *caller_net;
> > +
> > +     caller_net = skb->dev ? dev_net(skb->dev) : sock_net(skb->sk);
> > +
> > +     return __bpf_p4tc_tbl_read(caller_net, params, key, key__sz);
> > +}
> > +
> > +__bpf_kfunc static struct p4tc_table_entry_act_bpf *
> > +xdp_p4tc_tbl_read(struct xdp_md *xdp_ctx,
> > +               struct p4tc_table_entry_act_bpf_params *params,
> > +               void *key, const u32 key__sz)
> > +{
> > +     struct xdp_buff *ctx = (struct xdp_buff *)xdp_ctx;
> > +     struct net *caller_net;
> > +
> > +     caller_net = dev_net(ctx->rxq->dev);
> > +
> > +     return __bpf_p4tc_tbl_read(caller_net, params, key, key__sz);
> > +}
> > +
> > +static int
> > +__bpf_p4tc_entry_create(struct net *net,
> > +                     struct p4tc_table_entry_create_bpf_params *params,
> > +                     void *key, const u32 key__sz,
> > +                     struct p4tc_table_entry_act_bpf *act_bpf)
> > +{
> > +     struct p4tc_table_entry_key *entry_key = key;
> > +     struct p4tc_pipeline *pipeline;
> > +     struct p4tc_table *table;
> > +
> > +     if (!params || !key)
> > +             return -EINVAL;
> > +     if (key__sz != P4TC_ENTRY_KEY_SZ_BYTES(entry_key->keysz))
> > +             return -EINVAL;
> > +
> > +     pipeline = p4tc_pipeline_find_byid(net, params->pipeid);
> > +     if (!pipeline)
> > +             return -ENOENT;
> > +
> > +     table = p4tc_tbl_cache_lookup(net, params->pipeid, params->tblid);
> > +     if (!table)
> > +             return -ENOENT;
> > +
> > +     if (entry_key->keysz != table->tbl_keysz)
> > +             return -EINVAL;
> > +
> > +     return p4tc_table_entry_create_bpf(pipeline, table, entry_key, act_bpf,
> > +                                        params->profile_id);
>
> My understanding is this kfunc will allocate a "struct
> p4tc_table_entry_act_bpf_kern" object. If the bpf_p4tc_entry_delete() kfunc is
> never called and the bpf prog is unloaded, how the act_bpf object will be
> cleaned up?
>

The TC code takes care of this. Unloading the bpf prog does not affect
the deletion, it is the TC control side that will take care of it. If
we delete the pipeline otoh then not just this entry but all entries
will be flushed.

> > +}
> > +
> > +__bpf_kfunc static int
> > +bpf_p4tc_entry_create(struct __sk_buff *skb_ctx,
> > +                   struct p4tc_table_entry_create_bpf_params *params,
> > +                   void *key, const u32 key__sz,
> > +                   struct p4tc_table_entry_act_bpf *act_bpf)
> > +{
> > +     struct sk_buff *skb = (struct sk_buff *)skb_ctx;
> > +     struct net *net;
> > +
> > +     net = skb->dev ? dev_net(skb->dev) : sock_net(skb->sk);
> > +
> > +     return __bpf_p4tc_entry_create(net, params, key, key__sz, act_bpf);
> > +}
> > +
> > +__bpf_kfunc static int
> > +xdp_p4tc_entry_create(struct xdp_md *xdp_ctx,
> > +                   struct p4tc_table_entry_create_bpf_params *params,
> > +                   void *key, const u32 key__sz,
> > +                   struct p4tc_table_entry_act_bpf *act_bpf)
> > +{
> > +     struct xdp_buff *ctx = (struct xdp_buff *)xdp_ctx;
> > +     struct net *net;
> > +
> > +     net = dev_net(ctx->rxq->dev);
> > +
> > +     return __bpf_p4tc_entry_create(net, params, key, key__sz, act_bpf);
> > +}
> > +
> > +__bpf_kfunc static int
> > +bpf_p4tc_entry_create_on_miss(struct __sk_buff *skb_ctx,
> > +                           struct p4tc_table_entry_create_bpf_params *params,
> > +                           void *key, const u32 key__sz,
> > +                           struct p4tc_table_entry_act_bpf *act_bpf)
> > +{
> > +     struct sk_buff *skb = (struct sk_buff *)skb_ctx;
> > +     struct net *net;
> > +
> > +     net = skb->dev ? dev_net(skb->dev) : sock_net(skb->sk);
> > +
> > +     return __bpf_p4tc_entry_create(net, params, key, key__sz, act_bpf);
> > +}
> > +
> > +__bpf_kfunc static int
> > +xdp_p4tc_entry_create_on_miss(struct xdp_md *xdp_ctx,
>
> Same here. "struct xdp_buff *xdp".
>

ACK

> > +                           struct p4tc_table_entry_create_bpf_params *params,
> > +                           void *key, const u32 key__sz,
> > +                           struct p4tc_table_entry_act_bpf *act_bpf)
> > +{
> > +     struct xdp_buff *ctx = (struct xdp_buff *)xdp_ctx;
> > +     struct net *net;
> > +
> > +     net = dev_net(ctx->rxq->dev);
> > +
> > +     return __bpf_p4tc_entry_create(net, params, key, key__sz, act_bpf);
> > +}
> > +
>
> [ ... ]
>
> > +__bpf_kfunc static int
> > +bpf_p4tc_entry_delete(struct __sk_buff *skb_ctx,
> > +                   struct p4tc_table_entry_create_bpf_params *params,
> > +                   void *key, const u32 key__sz)
> > +{
> > +     struct sk_buff *skb = (struct sk_buff *)skb_ctx;
> > +     struct net *net;
> > +
> > +     net = skb->dev ? dev_net(skb->dev) : sock_net(skb->sk);
> > +
> > +     return __bpf_p4tc_entry_delete(net, params, key, key__sz);
> > +}
> > +
> > +__bpf_kfunc static int
> > +xdp_p4tc_entry_delete(struct xdp_md *xdp_ctx,
> > +                   struct p4tc_table_entry_create_bpf_params *params,
> > +                   void *key, const u32 key__sz)
> > +{
> > +     struct xdp_buff *ctx = (struct xdp_buff *)xdp_ctx;
> > +     struct net *net;
> > +
> > +     net = dev_net(ctx->rxq->dev);
> > +
> > +     return __bpf_p4tc_entry_delete(net, params, key, key__sz);
> > +}
> > +
> > +BTF_SET8_START(p4tc_kfunc_check_tbl_set_skb)
>
> This soon will be broken with the latest change in bpf-next. It is replaced by
> BTF_KFUNCS_START. commit a05e90427ef6.
>

Ok, this wasnt in net-next when we pushed. We base our changes on
net-next. When do you plan to merge that into net-next?

> What is the plan on the selftest ?
>

We may need some guidance. How do you see us writing a selftest for this?
We have extensive testing on the control side which is netlink (not
part of the current series).

Overall: I thank you for taking time to review  - it is the kind of
feedback we were hoping for from the ebpf side.

cheers,
jamal

> > +BTF_ID_FLAGS(func, bpf_p4tc_tbl_read, KF_RET_NULL);
> > +BTF_ID_FLAGS(func, bpf_p4tc_entry_create);
> > +BTF_ID_FLAGS(func, bpf_p4tc_entry_create_on_miss);
> > +BTF_ID_FLAGS(func, bpf_p4tc_entry_update);
> > +BTF_ID_FLAGS(func, bpf_p4tc_entry_delete);
> > +BTF_SET8_END(p4tc_kfunc_check_tbl_set_skb)
> > +
> > +static const struct btf_kfunc_id_set p4tc_kfunc_tbl_set_skb = {
> > +     .owner = THIS_MODULE,
> > +     .set = &p4tc_kfunc_check_tbl_set_skb,
> > +};
> > +
> > +BTF_SET8_START(p4tc_kfunc_check_tbl_set_xdp)
> > +BTF_ID_FLAGS(func, xdp_p4tc_tbl_read, KF_RET_NULL);
> > +BTF_ID_FLAGS(func, xdp_p4tc_entry_create);
> > +BTF_ID_FLAGS(func, xdp_p4tc_entry_create_on_miss);
> > +BTF_ID_FLAGS(func, xdp_p4tc_entry_update);
> > +BTF_ID_FLAGS(func, xdp_p4tc_entry_delete);
> > +BTF_SET8_END(p4tc_kfunc_check_tbl_set_xdp)
>

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH net-next v12 00/15] Introducing P4TC (series 1)
  2024-03-01  7:02   ` Martin KaFai Lau
@ 2024-03-01 12:36     ` Jamal Hadi Salim
  0 siblings, 0 replies; 71+ messages in thread
From: Jamal Hadi Salim @ 2024-03-01 12:36 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: John Fastabend, deb.chatterjee, anjali.singhai, namrata.limaye,
	tom, mleitner, Mahesh.Shirshyad, Vipin.Jain, tomasz.osinski, jiri,
	xiyou.wangcong, davem, edumazet, kuba, pabeni, vladbu, horms,
	khalidm, toke, daniel, victor, pctammela, dan.daly,
	andy.fingerhut, chris.sommers, mattyk, bpf, netdev

On Fri, Mar 1, 2024 at 2:02 AM Martin KaFai Lau <martin.lau@linux.dev> wrote:
>
> On 2/28/24 9:11 AM, John Fastabend wrote:
> >   - The kfuncs are mostly duplicates of map ops we already have in BPF API.
> >     The motivation by my read is to use netlink instead of bpf commands. I
>
> I also have similar thought on the kfuncs (create/update/delete) which is mostly
> bpf map ops. It could have one single kfunc to allocate a kernel specific p4
> entry/object and then store that in a bpf map. With the bpf_rbtree, bpf_list,
> and other recent advancements, it should be able to describe them in a bpf map.
> The reply in v9 was that the p4 table will also be used in the future HW
> piece/driver but the HW piece is not ready yet, bpf is the only consumer of the
> kernel p4 table now and this makes mimicking the bpf map api to kfuncs not
> convincing. bpf "tc / xdp" program uses netlink to attach/detach and the policy
> also stays in the bpf map.
>

It's a lot more complex than just attaching/detaching. Our control
plane uses netlink (regardless of whether it is offloaded or not) for
all object controls (not just table entries) for the many reasons that
have been stated in the cover letters since the beginning. I
unfortunately took out some of the text after v10 to try and shorten
the text. I will be adding it back. If you cant find it i could
cutnpaste and send privately.

cheers,
jamal

> When there is a HW piece that consumes the p4 table, that will be a better time
> to discuss the kfunc interface.
>
> >     don't agree with this, optimizing for some low level debug a developer
> >     uses is the wrong design space. Actual users should not be deploying
> >     this via ssh into boxes. The workflow will not scale and really we need
> >     tooling and infra to land P4 programs across the network. This is orders
> >     of more pain if its an endpoint solution and not a middlebox/switch
> >     solution. As a switch solution I don't see how p4tc sw scales to even TOR
> >     packet rates. So you need tooling on top and user interact with the
> >     tooling not the Linux widget/debugger at the bottom.
>

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH net-next v12 01/15] net: sched: act_api: Introduce P4 actions list
  2024-03-01  7:30       ` Paolo Abeni
@ 2024-03-01 12:39         ` Jamal Hadi Salim
  0 siblings, 0 replies; 71+ messages in thread
From: Jamal Hadi Salim @ 2024-03-01 12:39 UTC (permalink / raw)
  To: Paolo Abeni
  Cc: netdev, deb.chatterjee, anjali.singhai, namrata.limaye, tom,
	mleitner, Mahesh.Shirshyad, Vipin.Jain, tomasz.osinski, jiri,
	xiyou.wangcong, davem, edumazet, kuba, vladbu, horms, khalidm,
	toke, daniel, victor, pctammela, bpf

On Fri, Mar 1, 2024 at 2:30 AM Paolo Abeni <pabeni@redhat.com> wrote:
>
> On Thu, 2024-02-29 at 13:21 -0500, Jamal Hadi Salim wrote:
> > On Thu, Feb 29, 2024 at 10:05 AM Paolo Abeni <pabeni@redhat.com> wrote:
> > >
> > > On Sun, 2024-02-25 at 11:54 -0500, Jamal Hadi Salim wrote:
> > > > In P4 we require to generate new actions "on the fly" based on the
> > > > specified P4 action definition. P4 action kinds, like the pipeline
> > > > they are attached to, must be per net namespace, as opposed to native
> > > > action kinds which are global. For that reason, we chose to create a
> > > > separate structure to store P4 actions.
> > > >
> > > > Co-developed-by: Victor Nogueira <victor@mojatatu.com>
> > > > Signed-off-by: Victor Nogueira <victor@mojatatu.com>
> > > > Co-developed-by: Pedro Tammela <pctammela@mojatatu.com>
> > > > Signed-off-by: Pedro Tammela <pctammela@mojatatu.com>
> > > > Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
> > > > Reviewed-by: Vlad Buslov <vladbu@nvidia.com>
> > > > Reviewed-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
> > > > ---
> > > >  include/net/act_api.h |   8 ++-
> > > >  net/sched/act_api.c   | 123 +++++++++++++++++++++++++++++++++++++-----
> > > >  net/sched/cls_api.c   |   2 +-
> > > >  3 files changed, 116 insertions(+), 17 deletions(-)
> > > >
> > > > diff --git a/include/net/act_api.h b/include/net/act_api.h
> > > > index 77ee0c657..f22be14bb 100644
> > > > --- a/include/net/act_api.h
> > > > +++ b/include/net/act_api.h
> > > > @@ -105,6 +105,7 @@ typedef void (*tc_action_priv_destructor)(void *priv);
> > > >
> > > >  struct tc_action_ops {
> > > >       struct list_head head;
> > > > +     struct list_head p4_head;
> > > >       char    kind[IFNAMSIZ];
> > > >       enum tca_id  id; /* identifier should match kind */
> > > >       unsigned int    net_id;
> > > > @@ -199,10 +200,12 @@ int tcf_idr_check_alloc(struct tc_action_net *tn, u32 *index,
> > > >  int tcf_idr_release(struct tc_action *a, bool bind);
> > > >
> > > >  int tcf_register_action(struct tc_action_ops *a, struct pernet_operations *ops);
> > > > +int tcf_register_p4_action(struct net *net, struct tc_action_ops *act);
> > > >  int tcf_unregister_action(struct tc_action_ops *a,
> > > >                         struct pernet_operations *ops);
> > > >  #define NET_ACT_ALIAS_PREFIX "net-act-"
> > > >  #define MODULE_ALIAS_NET_ACT(kind)   MODULE_ALIAS(NET_ACT_ALIAS_PREFIX kind)
> > > > +void tcf_unregister_p4_action(struct net *net, struct tc_action_ops *act);
> > > >  int tcf_action_destroy(struct tc_action *actions[], int bind);
> > > >  int tcf_action_exec(struct sk_buff *skb, struct tc_action **actions,
> > > >                   int nr_actions, struct tcf_result *res);
> > > > @@ -210,8 +213,9 @@ int tcf_action_init(struct net *net, struct tcf_proto *tp, struct nlattr *nla,
> > > >                   struct nlattr *est,
> > > >                   struct tc_action *actions[], int init_res[], size_t *attr_size,
> > > >                   u32 flags, u32 fl_flags, struct netlink_ext_ack *extack);
> > > > -struct tc_action_ops *tc_action_load_ops(struct nlattr *nla, u32 flags,
> > > > -                                      struct netlink_ext_ack *extack);
> > > > +struct tc_action_ops *
> > > > +tc_action_load_ops(struct net *net, struct nlattr *nla,
> > > > +                u32 flags, struct netlink_ext_ack *extack);
> > > >  struct tc_action *tcf_action_init_1(struct net *net, struct tcf_proto *tp,
> > > >                                   struct nlattr *nla, struct nlattr *est,
> > > >                                   struct tc_action_ops *a_o, int *init_res,
> > > > diff --git a/net/sched/act_api.c b/net/sched/act_api.c
> > > > index 9ee622fb1..23ef394f2 100644
> > > > --- a/net/sched/act_api.c
> > > > +++ b/net/sched/act_api.c
> > > > @@ -57,6 +57,40 @@ static void tcf_free_cookie_rcu(struct rcu_head *p)
> > > >       kfree(cookie);
> > > >  }
> > > >
> > > > +static unsigned int p4_act_net_id;
> > > > +
> > > > +struct tcf_p4_act_net {
> > > > +     struct list_head act_base;
> > > > +     rwlock_t act_mod_lock;
> > >
> > > Note that rwlock in networking code is discouraged, as they have to be
> > > unfair, see commit 0daf07e527095e64ee8927ce297ab626643e9f51.
> > >
> > > In this specific case I think there should be no problems, as is
> > > extremely hard/impossible to have serious contention on the write
> > > side,. Also there is already an existing rwlock nearby, no not a
> > > blocker but IMHO worthy to be noted.
> > >
> >
> > Sure - we can replace it. What's the preference? Spinlock?
>
> Plain spinlock will work. Using spinlock + RCU should be quite straight
> forward and will provide faster lookup.
>

rcu + spinlock sounds like a bit of overkill but we'll look into it.

cheers,
jamal

> Cheers,
>
> Paolo
>

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH net-next v12 00/15] Introducing P4TC (series 1)
       [not found]         ` <CAOuuhY8qbsYCjdUYUZv8J3jz8HGXmtxLmTDP6LKgN5uRVZwMnQ@mail.gmail.com>
@ 2024-03-01 17:00           ` Jakub Kicinski
  2024-03-01 17:39             ` Jamal Hadi Salim
  0 siblings, 1 reply; 71+ messages in thread
From: Jakub Kicinski @ 2024-03-01 17:00 UTC (permalink / raw)
  To: Jamal Hadi Salim
  Cc: Tom Herbert, John Fastabend, Singhai, Anjali, Paolo Abeni,
	Linux Kernel Network Developers, Chatterjee, Deb, Limaye, Namrata,
	mleitner, Mahesh.Shirshyad, Vipin.Jain, Osinski, Tomasz,
	Jiri Pirko, Cong Wang, David S . Miller, edumazet, Vlad Buslov,
	horms, khalidm, Toke Høiland-Jørgensen, Daniel Borkmann,
	Victor Nogueira, Tammela, Pedro, Daly, Dan, andy.fingerhut,
	Sommers, Chris, mattyk, bpf

On Thu, 29 Feb 2024 19:00:50 -0800 Tom Herbert wrote:
> > I want to emphasize again these patches are about the P4 s/w pipeline
> > that is intended to work seamlessly with hw offload. If you are
> > interested in h/w offload and want to contribute just show up at the
> > meetings - they are open to all. The current offloadable piece is the
> > match-action tables. The P4 specs may change to include parsers in the
> > future or other objects etc (but not sure why we should discuss this
> > in the thread).
> 
> Pardon my ignorance, but doesn't P4 want to be compiled to a backend
> target? How does going through TC make this seamless?

+1

My intuition is that for offload the device would be programmed at
start-of-day / probe. By loading the compiled P4 from /lib/firmware.
Then the _device_ tells the kernel what tables and parser graph it's
got.

Plus, if we're talking about offloads, aren't we getting back into
the same controversies we had when merging OvS (not that I was around).
The "standalone stack to the side" problem. Some of the tables in the
pipeline may be for routing, not ACLs. Should they be fed from the
routing stack? How is that integration going to work? The parsing
graph feels a bit like global device configuration, not a piece of
functionality that should sit under sub-sub-system in the corner.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH net-next v12 00/15] Introducing P4TC (series 1)
  2024-03-01 17:00           ` Jakub Kicinski
@ 2024-03-01 17:39             ` Jamal Hadi Salim
  2024-03-02  1:32               ` Jakub Kicinski
  0 siblings, 1 reply; 71+ messages in thread
From: Jamal Hadi Salim @ 2024-03-01 17:39 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Tom Herbert, John Fastabend, Singhai, Anjali, Paolo Abeni,
	Linux Kernel Network Developers, Chatterjee, Deb, Limaye, Namrata,
	mleitner, Mahesh.Shirshyad, Vipin.Jain, Osinski, Tomasz,
	Jiri Pirko, Cong Wang, David S . Miller, edumazet, Vlad Buslov,
	horms, khalidm, Toke Høiland-Jørgensen, Daniel Borkmann,
	Victor Nogueira, Tammela, Pedro, Daly, Dan, andy.fingerhut,
	Sommers, Chris, mattyk, bpf

On Fri, Mar 1, 2024 at 12:00 PM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Thu, 29 Feb 2024 19:00:50 -0800 Tom Herbert wrote:
> > > I want to emphasize again these patches are about the P4 s/w pipeline
> > > that is intended to work seamlessly with hw offload. If you are
> > > interested in h/w offload and want to contribute just show up at the
> > > meetings - they are open to all. The current offloadable piece is the
> > > match-action tables. The P4 specs may change to include parsers in the
> > > future or other objects etc (but not sure why we should discuss this
> > > in the thread).
> >
> > Pardon my ignorance, but doesn't P4 want to be compiled to a backend
> > target? How does going through TC make this seamless?
>
> +1
>

I should clarify what i meant by "seamless". It means the same control
API is used for s/w or h/w. This is a feature of tc, and is not being
introduced by P4TC. P4 control only deals with Match-action tables -
just as TC does.

> My intuition is that for offload the device would be programmed at
> start-of-day / probe. By loading the compiled P4 from /lib/firmware.
> Then the _device_ tells the kernel what tables and parser graph it's
> got.
>

BTW: I just want to say that these patches are about s/w - not
offload. Someone asked about offload so as in normal discussions we
steered in that direction. The hardware piece will require additional
patchsets which still require discussions. I hope we dont steer off
too much, otherwise i can start a new thread just to discuss current
view of the h/w.

Its not the device telling the kernel what it has. Its the other way around.
From the P4 program you generate the s/w (the ebpf code and other
auxillary stuff) and h/w pieces using a compiler.
You compile ebpf, etc, then load.

The current point of discussion is the hw binary is to be "activated"
through the same tc filter that does the s/w. So one could say:

tc filter add block 22 ingress protocol all prio 1 p4 pname simple_l3
\
   prog type hw filename "simple_l3.o" ... \
   action bpf obj $PARSER.o section p4tc/parser \
   action bpf obj $PROGNAME.o section p4tc/main

And that would through tc driver callbacks signal to the driver to
find the binary possibly via  /lib/firmware
Some of the original discussion was to use devlink for loading the
binary - but that went nowhere.

Once you have this in place then netlink with tc skip_sw/hw. This is
what i meant by "seamless"

> Plus, if we're talking about offloads, aren't we getting back into
> the same controversies we had when merging OvS (not that I was around).
> The "standalone stack to the side" problem. Some of the tables in the
> pipeline may be for routing, not ACLs. Should they be fed from the
> routing stack? How is that integration going to work? The parsing
> graph feels a bit like global device configuration, not a piece of
> functionality that should sit under sub-sub-system in the corner.

The current (maybe i should say initial) thought is the P4 program
does not touch the existing kernel infra such as fdb etc.
Of course we can model the kernel datapath using P4 but you wont be
using "ip route add..." or "bridge fdb...".
In the future, P4 extern could be used to model existing infra and we
should be able to use the same tooling. That is a discussion that
comes on/off (i think it did in the last meeting).

cheers,
jamal

^ permalink raw reply	[flat|nested] 71+ messages in thread

* RE: [PATCH net-next v12 00/15] Introducing P4TC (series 1)
  2024-02-29 17:13 ` Paolo Abeni
  2024-02-29 18:49   ` Jamal Hadi Salim
  2024-02-29 21:49   ` Singhai, Anjali
@ 2024-03-01 18:53   ` Chris Sommers
  2 siblings, 0 replies; 71+ messages in thread
From: Chris Sommers @ 2024-03-01 18:53 UTC (permalink / raw)
  To: Paolo Abeni, Jamal Hadi Salim, netdev@vger.kernel.org
  Cc: deb.chatterjee@intel.com, anjali.singhai@intel.com,
	namrata.limaye@intel.com, tom@sipanda.io, mleitner@redhat.com,
	Mahesh.Shirshyad@amd.com, Vipin.Jain@amd.com,
	tomasz.osinski@intel.com, jiri@resnulli.us,
	xiyou.wangcong@gmail.com, davem@davemloft.net,
	edumazet@google.com, kuba@kernel.org, vladbu@nvidia.com,
	horms@kernel.org, khalidm@nvidia.com, toke@redhat.com,
	daniel@iogearbox.net, victor@mojatatu.com, pctammela@mojatatu.com,
	dan.daly@intel.com, andy.fingerhut@gmail.com, mattyk@nvidia.com,
	bpf@vger.kernel.org

>From: Paolo Abeni mailto:pabeni@redhat.com 
>Sent: Thursday, February 29, 2024 9:14 AM
>To: Jamal Hadi Salim mailto:jhs@mojatatu.com; mailto:netdev@vger.kernel.org
>Cc: mailto:deb.chatterjee@intel.com; mailto:anjali.singhai@intel.com; mailto:namrata.limaye@intel.com; mailto:tom@sipanda.io; mailto:mleitner@redhat.com; mailto:Mahesh.Shirshyad@amd.com; mailto:Vipin.Jain@amd.com; mailto:tomasz.osinski@intel.com; mailto:jiri@resnulli.us; mailto:xiyou.wangcong@gmail.com; mailto:davem@davemloft.net; mailto:edumazet@google.com; mailto:kuba@kernel.org; mailto:vladbu@nvidia.com; mailto:horms@kernel.org; mailto:khalidm@nvidia.com; mailto:toke@redhat.com; mailto:daniel@iogearbox.net; mailto:victor@mojatatu.com; mailto:pctammela@mojatatu.com; mailto:dan.daly@intel.com; mailto:andy.fingerhut@gmail.com; Chris Sommers mailto:chris.sommers@keysight.com; mailto:mattyk@nvidia.com; mailto:bpf@vger.kernel.org
>Subject: Re: [PATCH net-next v12 00/15] Introducing P4TC (series 1)
>
>On Sun, 2024-02-25 at 11: 54 -0500, Jamal Hadi Salim wrote: > This is the first patchset of two. In this patch we are submitting 15 which > cover the minimal viable P4 PNA architecture. > > __Description of these Patches__ > > 
>ZjQcmQRYFpfptBannerStart
>This Message is From an External Sender: Use caution opening files, clicking links or responding to requests. 
>
>
>
>ZjQcmQRYFpfptBannerEnd
>On Sun, 2024-02-25 at 11:54 -0500, Jamal Hadi Salim wrote:
>> This is the first patchset of two. In this patch we are submitting 15 which
>> cover the minimal viable P4 PNA architecture.
>> 
>> __Description of these Patches__
>> 
>> Patch #1 adds infrastructure for per-netns P4 actions that can be created on
>> as need basis for the P4 program requirement. This patch makes a small incision
>> into act_api. Patches 2-4 are minimalist enablers for P4TC and have no
>> effect the classical tc action (example patch#2 just increases the size of the
>> action names from 16->64B).
>> Patch 5 adds infrastructure support for preallocation of dynamic actions.
>> 
>> The core P4TC code implements several P4 objects.
>> 1) Patch #6 introduces P4 data types which are consumed by the rest of the code
>> 2) Patch #7 introduces the templating API. i.e. CRUD commands for templates
>> 3) Patch #8 introduces the concept of templating Pipelines. i.e CRUD commands
>>    for P4 pipelines.
>> 4) Patch #9 introduces the action templates and associated CRUD commands.
>> 5) Patch #10 introduce the action runtime infrastructure.
>> 6) Patch #11 introduces the concept of P4 table templates and associated
>>    CRUD commands for tables.
>> 7) Patch #12 introduces runtime table entry infra and associated CU commands.
>> 8) Patch #13 introduces runtime table entry infra and associated RD commands.
>> 9) Patch #14 introduces interaction of eBPF to P4TC tables via kfunc.
>> 10) Patch #15 introduces the TC classifier P4 used at runtime.
>> 
>> Daniel, please look again at patch #15.
>> 
>> There are a few more patches (5) not in this patchset that deal with test
>> cases, etc.
>> 
>> What is P4?
>> -----------
>> 
>> The Programming Protocol-independent Packet Processors (P4) is an open source,
>> domain-specific programming language for specifying data plane behavior.
>> 
>> The current P4 landscape includes an extensive range of deployments, products,
>> projects and services, etc[9][12]. Two major NIC vendors, Intel[10] and AMD[11]
>> currently offer P4-native NICs. P4 is currently curated by the Linux
>> Foundation[9].
>> 
>> On why P4 - see small treatise here:[4].
>> 
>> What is P4TC?
>> -------------
>> 
>> P4TC is a net-namespace aware P4 implementation over TC; meaning, a P4 program
>> and its associated objects and state are attachend to a kernel _netns_ structure.
>> IOW, if we had two programs across netns' or within a netns they have no
>> visibility to each others objects (unlike for example TC actions whose kinds are
>> "global" in nature or eBPF maps visavis bpftool).
>> 
>> P4TC builds on top of many years of Linux TC experiences of a netlink control
>> path interface coupled with a software datapath with an equivalent offloadable
>> hardware datapath. In this patch series we are focussing only on the s/w
>> datapath. The s/w and h/w path equivalence that TC provides is relevant
>> for a primary use case of P4 where some (currently) large consumers of NICs
>> provide vendors their datapath specs in P4. In such a case one could generate
>> specified datapaths in s/w and test/validate the requirements before hardware
>> acquisition(example [12]).
>> 
>> Unlike other approaches such as TC Flower which require kernel and user space
>> changes when new datapath objects like packet headers are introduced P4TC, with
>> these patches, provides _kernel and user space code change independence_.
>> Meaning:
>> A P4 program describes headers, parsers, etc alongside the datapath processing;
>> the compiler uses the P4 program as input and generates several artifacts which
>> are then loaded into the kernel to manifest the intended datapath. In addition
>> to the generated datapath, control path constructs are generated. The process is
>> described further below in "P4TC Workflow".
>> 
>> There have been many discussions and meetings within the community since
>> about 2015 in regards to P4 over TC[2] and we are finally proving to the
>> naysayers that we do get stuff done!
>> 
>> A lot more of the P4TC motivation is captured at:
>> https://urldefense.com/v3/__https://github.com/p4tc-dev/docs/blob/main/why-p4tc.md__;!!I5pVk4LIGAfnvw!nUhTLe4nLBPIqlW8u7cJYnSCbFNQSDW318aG2nqxbMZuptPK41snJiHein5mNazAYWIwAfJ3o8O7ZSCV8wc$
>> 
>> __P4TC Architecture__
>> 
>> The current architecture was described at netdevconf 0x17[14] and if you prefer
>> academic conference papers, a short paper is available here[15].
>> 
>> There are 4 parts:
>> 
>> 1) A Template CRUD provisioning API for manifesting a P4 program and its
>> associated objects in the kernel. The template provisioning API uses netlink.
>> See patch in part 2.
>> 
>> 2) A Runtime CRUD+ API code which is used for controlling the different runtime
>> behavior of the P4 objects. The runtime API uses netlink. See notes further
>> down. See patch description later..
>> 
>> 3) P4 objects and their control interfaces: tables, actions, externs, etc.
>> Any object that requires control plane interaction resides in the TC domain
>> and is subject to the CRUD runtime API.  The intended goal is to make use of the
>> tc semantics of skip_sw/hw to target P4 program objects either in s/w or h/w.
>> 
>> 4) S/W Datapath code hooks. The s/w datapath is eBPF based and is generated
>> by a compiler based on the P4 spec. When accessing any P4 object that requires
>> control plane interfaces, the eBPF code accesses the P4TC side from #3 above
>> using kfuncs.
>> 
>> The generated eBPF code is derived from [13] with enhancements and fixes to meet
>> our requirements.
>> 
>> __P4TC Workflow__
>> 
>> The Development and instantiation workflow for P4TC is as follows:
>> 
>>   A) A developer writes a P4 program, "myprog"
>> 
>>   B) Compiles it using the P4C compiler[8]. The compiler generates 3 outputs:
>> 
>>      a) A shell script which form template definitions for the different P4
>>      objects "myprog" utilizes (tables, externs, actions etc). See #1 above..
>> 
>>      b) the parser and the rest of the datapath are generated as eBPF and need
>>      to be compiled into binaries. At the moment the parser and the main control
>>      block are generated as separate eBPF program but this could change in
>>      the future (without affecting any kernel code). See #4 above.
>> 
>>      c) A json introspection file used for the control plane (by iproute2/tc).
>> 
>>   C) At this point the artifacts from #1,#4 could be handed to an operator
>>      (the operator could be the same person as the developer from #A, #B).
>> 
>>      i) For the eBPF part, either the operator is handed an ebpf binary or
>>      source which they compile at this point into a binary.
>>      The operator executes the shell script(s) to manifest the functional
>>      "myprog" into the kernel.
>> 
>>      ii) The operator instantiates "myprog" pipeline via the tc P4 filter
>>      to ingress/egress (depending on P4 arch) of one or more netdevs/ports
>>      (illustrated below as "block 22").
>> 
>>      Example instantion where the parser is a separate action:
>>        "tc filter add block 22 ingress protocol all prio 10 p4 pname myprog \
>>         action bpf obj $PARSER.o section p4tc/parse \
>>         action bpf obj $PROGNAME.o section p4tc/main"
>> 
>> See individual patches in partc for more examples tc vs xdp etc. Also see
>> section on "challenges" (further below on this cover letter).
>> 
>> Once "myprog" P4 program is instantiated one can start performing operations
>> on table entries and/or actions at runtime as described below.
>> 
>> __P4TC Runtime Control Path__
>> 
>> The control interface builds on past tc experience and tries to get things
>> right from the beginning (example filtering is separated from depending
>> on existing object TLVs and made generic); also the code is written in
>> such a way it is mostly lockless.
>> 
>> The P4TC control interface, using netlink, provides what we call a CRUDPS
>> abstraction which stands for: Create, Read(get), Update, Delete, Subscribe,
>> Publish.  From a high level PoV the following describes a conformant high level
>> API (both on netlink data model and code level):
>> 
>>  Create(</path/to/object, DATA>+)
>>  Read(</path/to/object>, [optional filter])
>>  Update(</path/to/object>, DATA>+)
>>  Delete(</path/to/object>, [optional filter])
>>  Subscribe(</path/to/object>, [optional filter])
>> 
>> Note, we _dont_ treat "dump" or "flush" as speacial. If "path/to/object" points
>> to a table then a "Delete" implies "flush" and a "Read" implies dump but if
>> it points to an entry (by specifying a key) then "Delete" implies deleting
>> and entry and "Read" implies reading that single entry. It should be noted that
>> both "Delete" and "Read" take an optional filter parameter. The filter can
>> define further refinements to what the control plane wants read or deleted.
>> "Subscribe" uses built in netlink event management. It, as well, takes a filter
>> which can further refine what events get generated to the control plane (taken
>> out of this patchset, to be re-added with consideration of [16]).
>> 
>> Lets show some runtime samples:
>> 
>> ..create an entry, if we match ip address 10.0.1.2 send packet out eno1
>>   tc p4ctrl create myprog/table/mytable \
>>    dstAddr 10.0.1.2/32 action send_to_port param port eno1
>> 
>> ..Batch create entries
>>   tc p4ctrl create myprog/table/mytable \
>>   entry dstAddr 10.1.1.2/32  action send_to_port param port eno1 \
>>   entry dstAddr 10.1.10.2/32  action send_to_port param port eno10 \
>>   entry dstAddr 10.0.2.2/32  action send_to_port param port eno2
>> 
>> ..Get an entry (note "read" is interchangeably used as "get" which is a common
>>      semantic in tc):
>>   tc p4ctrl read myprog/table/mytable \
>>    dstAddr 10.0.2.2/32
>> 
>> ..dump mytable
>>   tc p4ctrl read myprog/table/mytable
>> 
>> ..dump mytable for all entries whose key fits within 10.1.0.0/16
>>   tc p4ctrl read myprog/table/mytable \
>>   filter key/myprog/mytable/dstAddr = 10.1.0.0/16
>> 
>> ..dump all mytable entries which have an action send_to_port with param "eno1"
>>   tc p4ctrl get myprog/table/mytable \
>>   filter param/act/myprog/send_to_port/port = "eno1"
>> 
>> The filter expression is powerful, f.e you could say:
>> 
>>   tc p4ctrl get myprog/table/mytable \
>>   filter param/act/myprog/send_to_port/port = "eno1" && \
>>          key/myprog/mytable/dstAddr = 10.1.0.0/16
>> 
>> It also works on built in metadata, example in the following case dumping
>> entries from mytable that have seen activity in the last 10 secs:
>>   tc p4ctrl get myprog/table/mytable \
>>   filter msecs_since < 10000
>> 
>> Delete follows the same syntax as get/read, so for sake of brevity we won't
>> show more example than how to flush mytable:
>> 
>>   tc p4ctrl delete myprog/table/mytable
>> 
>> Mystery question: How do we achieve iproute2-kernel independence and
>> how does "tc p4ctrl" as a cli know how to program the kernel given an
>> arbitrary command line as shown above? Answer(s): It queries the
>> compiler generated json file in "P4TC Workflow" #B.c above. The json file has
>> enough details to figure out that we have a program called "myprog" which has a
>> table "mytable" that has a key name "dstAddr" which happens to be type ipv4
>> address prefix. The json file also provides details to show that the table
>> "mytable" supports an action called "send_to_port" which accepts a parameter
>> "port" of type netdev (see the types patch for all supported P4 data types).
>> All P4 components have names, IDs, and types - so this makes it very easy to map
>> into netlink.
>> Once user space tc/p4ctrl validates the human command input, it creates
>> standard binary netlink structures (TLVs etc) which are sent to the kernel.
>> See the runtime table entry patch for more details.
>> 
>> __P4TC Datapath__
>> 
>> The P4TC s/w datapath execution is generated as eBPF. Any objects that require
>> control interfacing reside in the "P4TC domain" and are controlled via netlink
>> as described above. Per packet execution and state and even objects that do not
>> require control interfacing (like the P4 parser) are generated as eBPF.
>> 
>> A packet arriving on s/w ingress of any of the ports on block 22 will first be
>> exercised via the (generated eBPF) parser component to extract the headers (the
>> ip destination address in labelled "dstAddr" above).
>> The datapath then proceeds to use "dstAddr", table ID and pipeline ID
>> as a key to do a lookup in myprog's "mytable" which returns the action params
>> which are then used to execute the action in the eBPF datapath (eventually
>> sending out packets to eno1).
>> On a table miss, mytable's default miss action (not described) is executed.
>> 
>> __Testing__
>> 
>> Speaking of testing - we have 2-300 tdc test cases (which will be in the
>> second patchset).
>> These tests are run on our CICD system on pull requests and after commits are
>> approved. The CICD does a lot of other tests (more since v2, thanks to Simon's
>> input)including:
>> checkpatch, sparse, smatch, coccinelle, 32 bit and 64 bit builds tested on both
>> X86, ARM 64 and emulated BE via qemu s390. We trigger performance testing in the
>> CICD to catch performance regressions (currently only on the control path, but
>> in the future for the datapath).
>> Syzkaller runs 24/7 on dedicated hardware, originally we focussed only on memory
>> sanitizer but recently added support for concurrency sanitizer.
>> Before main releases we ensure each patch will compile on its own to help in
>> git bisect and run the xmas tree tool. We eventually put the code via coverity.
>> 
>> In addition we are working on enabling a tool that will take a P4 program, run
>> it through the compiler, and generate permutations of traffic patterns via
>> symbolic execution that will test both positive and negative datapath code
>> paths. The test generator tool integration is still work in progress.
>> Also: We have other code that test parallelization etc which we are trying to
>> find a fit for in the kernel tree's testing infra.
>> 
>> 
>> __References__
>> 
>> [1]https://urldefense.com/v3/__https://github.com/p4tc-dev/docs/blob/main/p4-conference-2023/2023P4WorkshopP4TC.pdf__;!!I5pVk4LIGAfnvw!nUhTLe4nLBPIqlW8u7cJYnSCbFNQSDW318aG2nqxbMZuptPK41snJiHein5mNazAYWIwAfJ3o8O7bPf6Tk4$
>> [2]https://urldefense.com/v3/__https://github.com/p4tc-dev/docs/blob/main/why-p4tc.md*historical-perspective-for-p4tc__;Iw!!I5pVk4LIGAfnvw!nUhTLe4nLBPIqlW8u7cJYnSCbFNQSDW318aG2nqxbMZuptPK41snJiHein5mNazAYWIwAfJ3o8O7LkM5QJk$
>> [3]https://urldefense.com/v3/__https://2023p4workshop.sched.com/event/1KsAe/p4tc-linux-kernel-p4-implementation-approaches-and-evaluation__;!!I5pVk4LIGAfnvw!nUhTLe4nLBPIqlW8u7cJYnSCbFNQSDW318aG2nqxbMZuptPK41snJiHein5mNazAYWIwAfJ3o8O73gpmAKE$
>> [4]https://urldefense.com/v3/__https://github.com/p4tc-dev/docs/blob/main/why-p4tc.md*so-why-p4-and-how-does-p4-help-here__;Iw!!I5pVk4LIGAfnvw!nUhTLe4nLBPIqlW8u7cJYnSCbFNQSDW318aG2nqxbMZuptPK41snJiHein5mNazAYWIwAfJ3o8O7fvy73gU$
>> [5]https://urldefense.com/v3/__https://lore.kernel.org/netdev/20230517110232.29349-3-jhs@mojatatu.com/T/*mf59be7abc5df3473cff3879c8cc3e2369c0640a6__;Iw!!I5pVk4LIGAfnvw!nUhTLe4nLBPIqlW8u7cJYnSCbFNQSDW318aG2nqxbMZuptPK41snJiHein5mNazAYWIwAfJ3o8O7rJJDxSc$
>> [6]https://urldefense.com/v3/__https://lore.kernel.org/netdev/20230517110232.29349-3-jhs@mojatatu.com/T/*m783cfd79e9d755cf0e7afc1a7d5404635a5b1919__;Iw!!I5pVk4LIGAfnvw!nUhTLe4nLBPIqlW8u7cJYnSCbFNQSDW318aG2nqxbMZuptPK41snJiHein5mNazAYWIwAfJ3o8O74EMrBVI$
>> [7]https://urldefense.com/v3/__https://lore.kernel.org/netdev/20230517110232.29349-3-jhs@mojatatu.com/T/*ma8c84df0f7043d17b98f3d67aab0f4904c600469__;Iw!!I5pVk4LIGAfnvw!nUhTLe4nLBPIqlW8u7cJYnSCbFNQSDW318aG2nqxbMZuptPK41snJiHein5mNazAYWIwAfJ3o8O7-6T3BD8$
>> [8]https://urldefense.com/v3/__https://github.com/p4lang/p4c/tree/main/backends/tc__;!!I5pVk4LIGAfnvw!nUhTLe4nLBPIqlW8u7cJYnSCbFNQSDW318aG2nqxbMZuptPK41snJiHein5mNazAYWIwAfJ3o8O7EsGj_yE$
>> [9]https://urldefense.com/v3/__https://p4.org/__;!!I5pVk4LIGAfnvw!nUhTLe4nLBPIqlW8u7cJYnSCbFNQSDW318aG2nqxbMZuptPK41snJiHein5mNazAYWIwAfJ3o8O7MA51wp8$
>> [10]https://urldefense.com/v3/__https://www.intel.com/content/www/us/en/products/details/network-io/ipu/e2000-asic.html__;!!I5pVk4LIGAfnvw!nUhTLe4nLBPIqlW8u7cJYnSCbFNQSDW318aG2nqxbMZuptPK41snJiHein5mNazAYWIwAfJ3o8O7HaJpkWg$
>> [11]https://urldefense.com/v3/__https://www.amd.com/en/accelerators/pensando__;!!I5pVk4LIGAfnvw!nUhTLe4nLBPIqlW8u7cJYnSCbFNQSDW318aG2nqxbMZuptPK41snJiHein5mNazAYWIwAfJ3o8O7u8agJlY$
>> [12]https://urldefense.com/v3/__https://github.com/sonic-net/DASH/tree/main__;!!I5pVk4LIGAfnvw!nUhTLe4nLBPIqlW8u7cJYnSCbFNQSDW318aG2nqxbMZuptPK41snJiHein5mNazAYWIwAfJ3o8O77NF6LU0$
>> [13]https://urldefense.com/v3/__https://github.com/p4lang/p4c/tree/main/backends/ebpf__;!!I5pVk4LIGAfnvw!nUhTLe4nLBPIqlW8u7cJYnSCbFNQSDW318aG2nqxbMZuptPK41snJiHein5mNazAYWIwAfJ3o8O7Hn8dxDI$
>> [14]https://urldefense.com/v3/__https://netdevconf.info/0x17/sessions/talk/integrating-ebpf-into-the-p4tc-datapath.html__;!!I5pVk4LIGAfnvw!nUhTLe4nLBPIqlW8u7cJYnSCbFNQSDW318aG2nqxbMZuptPK41snJiHein5mNazAYWIwAfJ3o8O7dDtnoik$
>> [15]https://urldefense.com/v3/__https://dl.acm.org/doi/10.1145/3630047.3630193__;!!I5pVk4LIGAfnvw!nUhTLe4nLBPIqlW8u7cJYnSCbFNQSDW318aG2nqxbMZuptPK41snJiHein5mNazAYWIwAfJ3o8O7zb87EuI$
>> [16]https://urldefense.com/v3/__https://lore.kernel.org/netdev/20231216123001.1293639-1-jiri@resnulli.us/__;!!I5pVk4LIGAfnvw!nUhTLe4nLBPIqlW8u7cJYnSCbFNQSDW318aG2nqxbMZuptPK41snJiHein5mNazAYWIwAfJ3o8O7mLYrgl8$
>> [17.a]https://urldefense.com/v3/__https://netdevconf.info/0x13/session.html?talk-tc-u-classifier__;!!I5pVk4LIGAfnvw!nUhTLe4nLBPIqlW8u7cJYnSCbFNQSDW318aG2nqxbMZuptPK41snJiHein5mNazAYWIwAfJ3o8O7qSaba8A$
>> [17.b]man tc-u32
>> [18]man tc-pedit
>> [19] https://urldefense.com/v3/__https://lore.kernel.org/netdev/20231219181623.3845083-6-victor@mojatatu.com/T/*m86e71743d1d83b728bb29d5b877797cb4942e835__;Iw!!I5pVk4LIGAfnvw!nUhTLe4nLBPIqlW8u7cJYnSCbFNQSDW318aG2nqxbMZuptPK41snJiHein5mNazAYWIwAfJ3o8O7Uc3-7Vg$
>> [20.a] https://urldefense.com/v3/__https://netdevconf.info/0x16/sessions/talk/your-network-datapath-will-be-p4-scripted.html__;!!I5pVk4LIGAfnvw!nUhTLe4nLBPIqlW8u7cJYnSCbFNQSDW318aG2nqxbMZuptPK41snJiHein5mNazAYWIwAfJ3o8O7YIAkKuc$
>> [20.b] https://urldefense.com/v3/__https://netdevconf.info/0x16/sessions/workshop/p4tc-workshop.html__;!!I5pVk4LIGAfnvw!nUhTLe4nLBPIqlW8u7cJYnSCbFNQSDW318aG2nqxbMZuptPK41snJiHein5mNazAYWIwAfJ3o8O7_8aEvEI$
>> 
>> --------
>> HISTORY
>> --------
>> 
>> Changes in Version 12
>> ----------------------
>> 
>> 0) Introduce back 15 patches (v11 had 5)
>> 
>> 1) From discussions with Daniel:
>>    i) Remove the XDP programs association alltogether. No refcounting. nothing.
>>    ii) Remove prog type tc - everything is now an ebpf tc action.
>> 
>> 2) s/PAD0/__pad0/g. Thanks to Marcelo.
>> 
>> 3) Add extack to specify how many entries (N of M) specified in a batch for
>>    any of requested Create/Update/Delete succeeded. Prior to this it would
>>    only tell us the batch failed to complete without giving us details of
>>    which of M failed. Added as a debug aid.
>> 
>> Changes in Version 11
>> ----------------------
>> 1) Split the series into two. Original patches 1-5 in this patchset. The rest
>>    will go out after this is merged.
>> 
>> 2) Change any references of IFNAMSIZ in the action code when referencing the
>>    action name size to ACTNAMSIZ. Thanks to Marcelo.
>> 
>> Changes in Version 10
>> ----------------------
>> 1) A couple of patches from the earlier version were clean enough to submit,
>>    so we did. This gave us room to split the two largest patches each into
>>    two. Even though the split is not git-bisactable and really some of it didn't
>>    make much sense (eg spliting a create, and update in one patch and delete and
>>    get into another) we made sure each of the split patches compiled
>>    independently. The idea is to reduce the number of lines of code to review
>>    and when we get sufficient reviews we will put the splits together again.
>>    See patch #12 and #13 as well as patches #7 and #8).
>> 
>> 2) Add more context in patch 0. Please READ!
>> 
>> 3) Added dump/delete filters back to the code - we had taken them out in the
>>    earlier patches to reduce the amount of code for review - but in retrospect
>>    we feel they are important enough to push earlier rather than later.
>> 
>> 
>> Changes In version 9
>> ---------------------
>> 
>> 1) Remove the largest patch (externs) to ease review.
>> 
>> 2) Break up action patches into two to ease review bringing down the patches
>>    that need more scrutiny to 8 (the first 7 are almost trivial).
>> 
>> 3) Fixup prefix naming convention to p4tc_xxx for uapi and p4a_xxx for actions
>>    to provide consistency(Jiri).
>> 
>> 4) Silence sparse warning "was not declared. Should it be static?" for kfuncs
>>    by making them static. TBH, not sure if this is the right solution
>>    but it makes sparse happy and hopefully someone will comment.
>> 
>> Changes In Version 8
>> ---------------------
>> 
>> 1) Fix all the patchwork warnings and improve our ci to catch them in the future
>> 
>> 2) Reduce the number of patches to basic max(15)  to ease review.
>> 
>> Changes In Version 7
>> -------------------------
>> 
>> 0) First time removing the RFC tag!
>> 
>> 1) Removed XDP cookie. It turns out as was pointed out by Toke(Thanks!) - that
>> using bpf links was sufficient to protect us from someone replacing or deleting
>> a eBPF program after it has been bound to a netdev.
>> 
>> 2) Add some reviewed-bys from Vlad.
>> 
>> 3) Small bug fixes from v6 based on testing for ebpf.
>> 
>> 4) Added the counter extern as a sample extern. Illustrating this example because
>>    it is slightly complex since it is possible to invoke it directly from
>>    the P4TC domain (in case of direct counters) or from eBPF (indirect counters).
>>    It is not exactly the most efficient implementation (a reasonable counter impl
>>    should be per-cpu).
>> 
>> Changes In RFC Version 6
>> -------------------------
>> 
>> 1) Completed integration from scriptable view to eBPF. Completed integration
>>    of externs integration.
>> 
>> 2) Small bug fixes from v5 based on testing.
>> 
>> Changes In RFC Version 5
>> -------------------------
>> 
>> 1) More integration from scriptable view to eBPF. Small bug fixes from last
>>    integration.
>> 
>> 2) More streamlining support of externs via kfunc (create-on-miss, etc)
>> 
>> 3) eBPF linking for XDP.
>> 
>> There is more eBPF integration/streamlining coming (we are getting close to
>> conversion from scriptable domain).
>> 
>> Changes In RFC Version 4
>> -------------------------
>> 
>> 1) More integration from scriptable to eBPF. Small bug fixes.
>> 
>> 2) More streamlining support of externs via kfunc (one additional kfunc).
>> 
>> 3) Removed per-cpu scratchpad per Toke's suggestion and instead use XDP metadata.
>> 
>> There is more eBPF integration coming. One thing we looked at but is not in this
>> patchset but should be in the next is use of eBPF link in our loading (see
>> "challenge #1" further below).
>> 
>> Changes In RFC Version 3
>> -------------------------
>> 
>> These patches are still in a little bit of flux as we adjust to integrating
>> eBPF. So there are small constructs that are used in V1 and 2 but no longer
>> used in this version. We will make a V4 which will remove those.
>> The changes from V2 are as follows:
>> 
>> 1) Feedback we got in V2 is to try stick to one of the two modes. In this version
>> we are taking one more step and going the path of mode2 vs v2 where we had 2 modes.
>> 
>> 2) The P4 Register extern is no longer standalone. Instead, as part of integrating
>> into eBPF we introduce another kfunc which encapsulates Register as part of the
>> extern interface.
>> 
>> 3) We have improved our CICD to include tools pointed to us by Simon. See
>>    "Testing" further below. Thanks to Simon for that and other issues he caught.
>>    Simon, we discussed on issue [7] but decided to keep that log since we think
>>    it is useful.
>> 
>> 4) A lot of small cleanups. Thanks Marcelo. There are two things we need to
>>    re-discuss though; see: [5], [6].
>> 
>> 5) We removed the need for a range of IDs for dynamic actions. Thanks Jakub.
>> 
>> 6) Clarify ambiguity caused by smatch in an if(A) else if(B) condition. We are
>>    guaranteed that either A or B must exist; however, lets make smatch happy.
>>    Thanks to Simon and Dan Carpenter.
>> 
>> Changes In RFC Version 2
>> -------------------------
>> 
>> Version 2 is the initial integration of the eBPF datapath.
>> We took into consideration suggestions provided to use eBPF and put effort into
>> analyzing eBPF as datapath which involved extensive testing.
>> We implemented 6 approaches with eBPF and ran performance analysis and presented
>> our results at the P4 2023 workshop in Santa Clara[see: 1, 3] on each of the 6
>> vs the scriptable P4TC and concluded that 2 of the approaches are sensible (4 if
>> you account for XDP or TC separately).
>> 
>> Conclusions from the exercise: We lose the simple operational model we had
>> prior to integrating eBPF. We do gain performance in most cases when the
>> datapath is less compute-bound.
>> For more discussion on our requirements vs journeying the eBPF path please
>> scroll down to "Restating Our Requirements" and "Challenges".
>> 
>> This patch set presented two modes.
>> mode1: the parser is entirely based on eBPF - whereas the rest of the
>> SW datapath stays as _scriptable_ as in Version 1.
>> mode2: All of the kernel s/w datapath (including parser) is in eBPF.
>> 
>> The key ingredient for eBPF, that we did not have access to in the past, is
>> kfunc (it made a big difference for us to reconsider eBPF).
>> 
>> In V2 the two modes are mutually exclusive (IOW, you get to choose one
>> or the other via Kconfig).
>
>I think/fear that this series has a "quorum" problem: different voices
>raises opposition, and nobody (?) outside the authors supported the
>code and the feature. 
>
>Could be the missing of H/W offload support in the current form the
>root cause for such lack support? Or there are parties interested that
>have been quite so far?
>
>Thanks,
>
>Paolo
>

Hi Paolo, thanks. I am one of those "parties interested that have been quite so far."

I wanted to voice my staunch support for accepting P4TC into the kernel. None of the present objections in the various threads reduce my enthusiasm. I find the following aspects most compelling:

- Performant, highly functional, pure-SW P4 dataplane

- Near-ubiquitous availability on all platforms, once it's upstreamed. Saves having to install a bunch of other p4 ecosystem tools, lowers the barrier to entry, and increases the likelihood an application can run on any platform.

- larger dev community. Anything added to the Linux kernel benefits from a large, thriving community, vast and rigorous regression testing, long-term support, etc.

- well-conceived CRUDX northbound API and clever use of existing well-understood netlink, easy to overlay other northbound APIs such as TDI (Table driven interface) used in IPDK; P4Runtime gRPC API; etc.

- integration with popular and well-understood tc provides a good impedance match for users.

- extensibility, ability to add externs, and interface to eBPF. The ability to add externs is especially compelling. It is not easy to do so in current backends such as bmv2, P4DPDK or p4-ebpf. 

- roadmap to hardware offload for even greater performance. Even _without_ offload, the above benefits justify it in my mind. There are many applications for a pure-SW P4 dataplane, both in userland like P4DPDK, and the proposed P4TC - running as part of the kernel is _exciting_. Vendors have already voiced their support for offload and this initial set of patches paves the way and lets the community benefit from it and start to make it better, now.

It is possible the detractors of P4TC are not active P4 users, so I hope to provide a bit of perspective. Besides the pioneering switch ASIC (Tofino) use-cases which provided the initial impetus, P4 is used extensively in at least two commercial IPUs/DPUs. In addition, there are multiple toolchains to run P4 code on FPGAs. The dream is to write P4 code which can be run in a scalable fashion on a range of targets. It shouldn’t be necessary to “prove” P4 is worthy, those who’ve already embraced it know this.

There are several use-cases for a SW implementation of a P4 dataplane, including behavioral modeling and production uses. P4 allows one to write core functionality which can run on multiple platforms: pure SW, FPGAs, offload NICs/DPUs/IPUs, switch ASICs.

Behavioral modeling of a pipeline using P4:

- The SONiC-DASH project (https://github.com/sonic-net/DASH) is a thriving, multi-vendor collaboration which specifies advanced, high-performance features to accelerate datacenter services. These overlay services are specified using a P4 program which allows all concerned to agree on the packet pipeline and even the control-plane APIs (using SAI, the Switch Abstraction Interface). The actual implementation on a vendor's offload device (DPU/IPU) may or may not use any of the reference P4 code, but that is not important. What is important is that we specify the dataplane in P4, and execute it on the bmv2 backend in a container. We run conformance and regression suites with standard test vectors, which can also be run against actual production implementations to verify compliance. The bmv2 backend has many limitations, including performance and difficulty to extend its functionality. As a major contributor to this project, I am helping to explore alternatives.

- Large-scale cloud-service providers use P4 extensively as a dataplane (fabric switch) modeling language. One of the driving use-cases in the P4-API working group (I’m a co-chair) is to control SDN switches using P4-Runtime. The switches’ pipelines are modeled in P4 by some users, similar to the DASH use-case. Having a performant, pure-SW implementation is invaluable for modeling and simulation.

Running P4 code in pure SW for production use-cases (not just modeling):

There are many use-cases for running a custom dataplane written in P4. The productivity of P4 code cannot be overstated. With the right framework, P4 apps can be developed (and controlled/managed) in literally hours. It is much more productive than writing, say c or eBPF. I can do all three, and P4 is way more productive for certain applications.

In conclusion, I hope we can upstream P4-TC soon. Please move this forward with all due speed. Thanks!

Chris Sommers
Keysight Technologies

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH net-next v12 00/15] Introducing P4TC (series 1)
  2024-03-01 17:39             ` Jamal Hadi Salim
@ 2024-03-02  1:32               ` Jakub Kicinski
  2024-03-02  2:20                 ` Tom Herbert
  2024-03-02  2:59                 ` Hardware Offload discussion WAS(Re: " Jamal Hadi Salim
  0 siblings, 2 replies; 71+ messages in thread
From: Jakub Kicinski @ 2024-03-02  1:32 UTC (permalink / raw)
  To: Jamal Hadi Salim
  Cc: Tom Herbert, John Fastabend, Singhai, Anjali, Paolo Abeni,
	Linux Kernel Network Developers, Chatterjee, Deb, Limaye, Namrata,
	mleitner, Mahesh.Shirshyad, Vipin.Jain, Osinski, Tomasz,
	Jiri Pirko, Cong Wang, David S . Miller, edumazet, Vlad Buslov,
	horms, khalidm, Toke Høiland-Jørgensen, Daniel Borkmann,
	Victor Nogueira, Tammela, Pedro, Daly, Dan, andy.fingerhut,
	Sommers, Chris, mattyk, bpf

On Fri, 1 Mar 2024 12:39:56 -0500 Jamal Hadi Salim wrote:
> On Fri, Mar 1, 2024 at 12:00 PM Jakub Kicinski <kuba@kernel.org> wrote:
> > > Pardon my ignorance, but doesn't P4 want to be compiled to a backend
> > > target? How does going through TC make this seamless?  
> >
> > +1
> 
> I should clarify what i meant by "seamless". It means the same control
> API is used for s/w or h/w. This is a feature of tc, and is not being
> introduced by P4TC. P4 control only deals with Match-action tables -
> just as TC does.

Right, and the compiled P4 pipeline is tacked onto that API.
Loading that presumably implies a pipeline reset. There's 
no precedent for loading things into TC resulting a device
datapath reset.

> > My intuition is that for offload the device would be programmed at
> > start-of-day / probe. By loading the compiled P4 from /lib/firmware.
> > Then the _device_ tells the kernel what tables and parser graph it's
> > got.
> 
> BTW: I just want to say that these patches are about s/w - not
> offload. Someone asked about offload so as in normal discussions we
> steered in that direction. The hardware piece will require additional
> patchsets which still require discussions. I hope we dont steer off
> too much, otherwise i can start a new thread just to discuss current
> view of the h/w.
> 
> Its not the device telling the kernel what it has. Its the other way around.

Yes, I'm describing how I'd have designed it :) If it was the same
as what you've already implemented - why would I be typing it into
an email.. ? :)

> From the P4 program you generate the s/w (the ebpf code and other
> auxillary stuff) and h/w pieces using a compiler.
> You compile ebpf, etc, then load.

That part is fine.

> The current point of discussion is the hw binary is to be "activated"
> through the same tc filter that does the s/w. So one could say:
> 
> tc filter add block 22 ingress protocol all prio 1 p4 pname simple_l3
> \
>    prog type hw filename "simple_l3.o" ... \
>    action bpf obj $PARSER.o section p4tc/parser \
>    action bpf obj $PROGNAME.o section p4tc/main
> 
> And that would through tc driver callbacks signal to the driver to
> find the binary possibly via  /lib/firmware
> Some of the original discussion was to use devlink for loading the
> binary - but that went nowhere.

Back to the device reset, unless the load has no impact on inflight
traffic the loading doesn't belong in TC, IMO. Plus you're going to
run into (what IIRC was Jiri's complaint) that you're loading arbitrary
binary blobs, opaque to the kernel.

> Once you have this in place then netlink with tc skip_sw/hw. This is
> what i meant by "seamless"
> 
> > Plus, if we're talking about offloads, aren't we getting back into
> > the same controversies we had when merging OvS (not that I was around).
> > The "standalone stack to the side" problem. Some of the tables in the
> > pipeline may be for routing, not ACLs. Should they be fed from the
> > routing stack? How is that integration going to work? The parsing
> > graph feels a bit like global device configuration, not a piece of
> > functionality that should sit under sub-sub-system in the corner.  
> 
> The current (maybe i should say initial) thought is the P4 program
> does not touch the existing kernel infra such as fdb etc.

It's off to the side thing. Ignoring the fact that *all*, networking
devices already have parsers which would benefit from being accurately
described.

> Of course we can model the kernel datapath using P4 but you wont be
> using "ip route add..." or "bridge fdb...".
> In the future, P4 extern could be used to model existing infra and we
> should be able to use the same tooling. That is a discussion that
> comes on/off (i think it did in the last meeting).

Maybe, IDK. I thought prevailing wisdom, at least for offloads,
is to offload the existing networking stack, and fill in the gaps.
Not build a completely new implementation from scratch, and "integrate
later". Or at least "fill in the gaps" is how I like to think.

I can't quite fit together in my head how this is okay, but OvS
was not allowed to add their offload API. And what's supposed to
be part of TC and what isn't, where you only expect to have one 
filter here, and create a whole new object universe inside TC.

But that's just my opinions. The way things work we may wake up one 
day and find out that Dave has applied this :)

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH net-next v12 00/15] Introducing P4TC (series 1)
  2024-03-02  1:32               ` Jakub Kicinski
@ 2024-03-02  2:20                 ` Tom Herbert
  2024-03-03  3:15                   ` Jakub Kicinski
  2024-03-02  2:59                 ` Hardware Offload discussion WAS(Re: " Jamal Hadi Salim
  1 sibling, 1 reply; 71+ messages in thread
From: Tom Herbert @ 2024-03-02  2:20 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Jamal Hadi Salim, John Fastabend, Singhai, Anjali, Paolo Abeni,
	Linux Kernel Network Developers, Chatterjee, Deb, Limaye, Namrata,
	mleitner, Mahesh.Shirshyad, Vipin.Jain, Osinski, Tomasz,
	Jiri Pirko, Cong Wang, David S . Miller, edumazet, Vlad Buslov,
	horms, khalidm, Toke Høiland-Jørgensen, Daniel Borkmann,
	Victor Nogueira, Tammela, Pedro, Daly, Dan, andy.fingerhut,
	Sommers, Chris, mattyk, bpf

On Fri, Mar 1, 2024 at 5:32 PM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Fri, 1 Mar 2024 12:39:56 -0500 Jamal Hadi Salim wrote:
> > On Fri, Mar 1, 2024 at 12:00 PM Jakub Kicinski <kuba@kernel.org> wrote:
> > > > Pardon my ignorance, but doesn't P4 want to be compiled to a backend
> > > > target? How does going through TC make this seamless?
> > >
> > > +1
> >
> > I should clarify what i meant by "seamless". It means the same control
> > API is used for s/w or h/w. This is a feature of tc, and is not being
> > introduced by P4TC. P4 control only deals with Match-action tables -
> > just as TC does.
>
> Right, and the compiled P4 pipeline is tacked onto that API.
> Loading that presumably implies a pipeline reset. There's
> no precedent for loading things into TC resulting a device
> datapath reset.
>
> > > My intuition is that for offload the device would be programmed at
> > > start-of-day / probe. By loading the compiled P4 from /lib/firmware.
> > > Then the _device_ tells the kernel what tables and parser graph it's
> > > got.
> >
> > BTW: I just want to say that these patches are about s/w - not
> > offload. Someone asked about offload so as in normal discussions we
> > steered in that direction. The hardware piece will require additional
> > patchsets which still require discussions. I hope we dont steer off
> > too much, otherwise i can start a new thread just to discuss current
> > view of the h/w.
> >
> > Its not the device telling the kernel what it has. Its the other way around.
>
> Yes, I'm describing how I'd have designed it :) If it was the same
> as what you've already implemented - why would I be typing it into
> an email.. ? :)
>
> > From the P4 program you generate the s/w (the ebpf code and other
> > auxillary stuff) and h/w pieces using a compiler.
> > You compile ebpf, etc, then load.
>
> That part is fine.
>
> > The current point of discussion is the hw binary is to be "activated"
> > through the same tc filter that does the s/w. So one could say:
> >
> > tc filter add block 22 ingress protocol all prio 1 p4 pname simple_l3
> > \
> >    prog type hw filename "simple_l3.o" ... \
> >    action bpf obj $PARSER.o section p4tc/parser \
> >    action bpf obj $PROGNAME.o section p4tc/main
> >
> > And that would through tc driver callbacks signal to the driver to
> > find the binary possibly via  /lib/firmware
> > Some of the original discussion was to use devlink for loading the
> > binary - but that went nowhere.
>
> Back to the device reset, unless the load has no impact on inflight
> traffic the loading doesn't belong in TC, IMO. Plus you're going to
> run into (what IIRC was Jiri's complaint) that you're loading arbitrary
> binary blobs, opaque to the kernel.
>
> > Once you have this in place then netlink with tc skip_sw/hw. This is
> > what i meant by "seamless"
> >
> > > Plus, if we're talking about offloads, aren't we getting back into
> > > the same controversies we had when merging OvS (not that I was around).
> > > The "standalone stack to the side" problem. Some of the tables in the
> > > pipeline may be for routing, not ACLs. Should they be fed from the
> > > routing stack? How is that integration going to work? The parsing
> > > graph feels a bit like global device configuration, not a piece of
> > > functionality that should sit under sub-sub-system in the corner.
> >
> > The current (maybe i should say initial) thought is the P4 program
> > does not touch the existing kernel infra such as fdb etc.
>
> It's off to the side thing. Ignoring the fact that *all*, networking
> devices already have parsers which would benefit from being accurately
> described.

Jakub,

This is configurability versus programmability. The table driven
approach as input (configurability) might work fine for generic
match-action tables up to the point that tables are expressive enough
to satisfy the requirements. But parsing doesn't fall into the table
driven paradigm: parsers want to be *programmed*. This is why we
removed kParser from this patch set and fell back to eBPF for parsing.
But the problem we quickly hit that eBPF is not offloadable to network
devices, for example when we compile P4 in an eBPF parser we've lost
the declarative representation that parsers in the devices could
consume (they're not CPUs running eBPF).

I think the key here is what we mean by kernel offload. When we do
kernel offload, is it the kernel implementation or the kernel
functionality that's being offloaded? If it's the latter then we have
a lot more flexibility. What we'd need is a safe and secure way to
synchronize with that offload device that precisely supports the
kernel functionality we'd like to offload. This can be done if both
the kernel bits and programmed offload are derived from the same
source (i.e. tag source code with a sha-1). For example, if someone
writes a parser in P4, we can compile that into both eBPF and a P4
backend using independent tool chains and program download. At
runtime, the kernel can safely offload the functionality of the eBPF
parser to the device if it matches the hash to that reported by the
device

Tom

>
> > Of course we can model the kernel datapath using P4 but you wont be
> > using "ip route add..." or "bridge fdb...".
> > In the future, P4 extern could be used to model existing infra and we
> > should be able to use the same tooling. That is a discussion that
> > comes on/off (i think it did in the last meeting).
>
> Maybe, IDK. I thought prevailing wisdom, at least for offloads,
> is to offload the existing networking stack, and fill in the gaps.
> Not build a completely new implementation from scratch, and "integrate
> later". Or at least "fill in the gaps" is how I like to think.
>
> I can't quite fit together in my head how this is okay, but OvS
> was not allowed to add their offload API. And what's supposed to
> be part of TC and what isn't, where you only expect to have one
> filter here, and create a whole new object universe inside TC.
>
> But that's just my opinions. The way things work we may wake up one
> day and find out that Dave has applied this :)

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Hardware Offload discussion WAS(Re: [PATCH net-next v12 00/15] Introducing P4TC (series 1)
  2024-03-02  1:32               ` Jakub Kicinski
  2024-03-02  2:20                 ` Tom Herbert
@ 2024-03-02  2:59                 ` Jamal Hadi Salim
  2024-03-02 14:36                   ` Jamal Hadi Salim
  1 sibling, 1 reply; 71+ messages in thread
From: Jamal Hadi Salim @ 2024-03-02  2:59 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Tom Herbert, John Fastabend, Singhai, Anjali, Paolo Abeni,
	Linux Kernel Network Developers, Chatterjee, Deb, Limaye, Namrata,
	Marcelo Ricardo Leitner, Shirshyad, Mahesh, Jain, Vipin,
	Osinski, Tomasz, Jiri Pirko, Cong Wang, David S . Miller,
	Eric Dumazet, Vlad Buslov, Simon Horman, Khalid Manaa,
	Toke Høiland-Jørgensen, Daniel Borkmann,
	Victor Nogueira, Tammela, Pedro, Daly, Dan, Andy Fingerhut,
	Sommers, Chris, Matty Kadosh, bpf

On Fri, Mar 1, 2024 at 8:32 PM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Fri, 1 Mar 2024 12:39:56 -0500 Jamal Hadi Salim wrote:
> > On Fri, Mar 1, 2024 at 12:00 PM Jakub Kicinski <kuba@kernel.org> wrote:
> > > > Pardon my ignorance, but doesn't P4 want to be compiled to a backend
> > > > target? How does going through TC make this seamless?
> > >
> > > +1
> >
> > I should clarify what i meant by "seamless". It means the same control
> > API is used for s/w or h/w. This is a feature of tc, and is not being
> > introduced by P4TC. P4 control only deals with Match-action tables -
> > just as TC does.
>
> Right, and the compiled P4 pipeline is tacked onto that API.
> Loading that presumably implies a pipeline reset. There's
> no precedent for loading things into TC resulting a device
> datapath reset.

Ive changed the subject to reflect this discussion is about h/w
offload so we dont drift too much from the intent of the patches.

AFAIK, all these devices have some HA built in to do program
replacement. i.e. afaik, no device reset.
I believe the tofino switch in the earlier generations may have needed
resets which caused a few packet drops in a live environment update.
Granted there may be devices (not that i am aware) that may not be
able to do HA. All this needs to be considered for offloads.

> > > My intuition is that for offload the device would be programmed at
> > > start-of-day / probe. By loading the compiled P4 from /lib/firmware.
> > > Then the _device_ tells the kernel what tables and parser graph it's
> > > got.
> >
> > BTW: I just want to say that these patches are about s/w - not
> > offload. Someone asked about offload so as in normal discussions we
> > steered in that direction. The hardware piece will require additional
> > patchsets which still require discussions. I hope we dont steer off
> > too much, otherwise i can start a new thread just to discuss current
> > view of the h/w.
> >
> > Its not the device telling the kernel what it has. Its the other way around.
>
> Yes, I'm describing how I'd have designed it :) If it was the same
> as what you've already implemented - why would I be typing it into
> an email.. ? :)
>

I think i misunderstood you and thought I needed to provide context.
The P4 pipelines are meant to be able to be re-programmed multiple
times in a live environment. IOW, I should be able to delete/create a
pipeline while another is running. Some hardware may require that the
parser is shared etc, but you can certainly replace the match action
tables or add an entirely new logic. In any case this is all still
under discussion and can be further refined.

> > From the P4 program you generate the s/w (the ebpf code and other
> > auxillary stuff) and h/w pieces using a compiler.
> > You compile ebpf, etc, then load.
>
> That part is fine.
>
> > The current point of discussion is the hw binary is to be "activated"
> > through the same tc filter that does the s/w. So one could say:
> >
> > tc filter add block 22 ingress protocol all prio 1 p4 pname simple_l3
> > \
> >    prog type hw filename "simple_l3.o" ... \
> >    action bpf obj $PARSER.o section p4tc/parser \
> >    action bpf obj $PROGNAME.o section p4tc/main
> >
> > And that would through tc driver callbacks signal to the driver to
> > find the binary possibly via  /lib/firmware
> > Some of the original discussion was to use devlink for loading the
> > binary - but that went nowhere.
>
> Back to the device reset, unless the load has no impact on inflight
> traffic the loading doesn't belong in TC, IMO. Plus you're going to
> run into (what IIRC was Jiri's complaint) that you're loading arbitrary
> binary blobs, opaque to the kernel.
>

And you said at that time binary blobs are already a way of life.
Let's take DDP as a use case:  They load the firmware (via ethtool)
and we were recently discussing whether they should use flower or u32
etc.  I would say this is in the same spirit. Doing ethtool may be a
bit disconnected. But that is up for discussion as well.
There has been concern that we need to have some authentication in
some of the discussions. Is that what you mean?

> > Once you have this in place then netlink with tc skip_sw/hw. This is
> > what i meant by "seamless"
> >
> > > Plus, if we're talking about offloads, aren't we getting back into
> > > the same controversies we had when merging OvS (not that I was around).
> > > The "standalone stack to the side" problem. Some of the tables in the
> > > pipeline may be for routing, not ACLs. Should they be fed from the
> > > routing stack? How is that integration going to work? The parsing
> > > graph feels a bit like global device configuration, not a piece of
> > > functionality that should sit under sub-sub-system in the corner.
> >
> > The current (maybe i should say initial) thought is the P4 program
> > does not touch the existing kernel infra such as fdb etc.
>
> It's off to the side thing. Ignoring the fact that *all*, networking
> devices already have parsers which would benefit from being accurately
> described.
>

I am not following this point.

> > Of course we can model the kernel datapath using P4 but you wont be
> > using "ip route add..." or "bridge fdb...".
> > In the future, P4 extern could be used to model existing infra and we
> > should be able to use the same tooling. That is a discussion that
> > comes on/off (i think it did in the last meeting).
>
> Maybe, IDK. I thought prevailing wisdom, at least for offloads,
> is to offload the existing networking stack, and fill in the gaps.
> Not build a completely new implementation from scratch, and "integrate
> later". Or at least "fill in the gaps" is how I like to think.
>
> I can't quite fit together in my head how this is okay, but OvS
> was not allowed to add their offload API. And what's supposed to
> be part of TC and what isn't, where you only expect to have one
> filter here, and create a whole new object universe inside TC.
>

I was there.
Ovs matched what tc already had functionally, 10 years after tc
existed, and they were busy rewriting what tc offered. So naturally we
pushed for them to use what TC had. You still need to write whatever
extensions needed into the kernel etc in order to support what the
hardware can offer.

I hope i am not stating the obvious: P4 provides a more malleable
approach. Assume a blank template in h/w and s/w and where you specify
what you need then both the s/w and hardware support it. Flower is
analogous to a "fixed pipeline" meaning you can extend flower by
changing the kernel and datapath. Often it is not covering all
potential hw match actions engines and often we see patches to do one
more thing requiring more kernel changes.  If you replace flower with
P4 you remove the need to update the kernel, user space etc for the
same features that flower needs to be extended for today. You just
tell the compiler what you need (within hardware capacity of course).
So i dont see P4 as "offload the existing kernel infra aka flower" but
rather remove the limitations that flower constrains us with today. As
far as other kernel infra (fdb etc), that can be added as i stated -
it is just not a starting point.

cheers,
jamal

> But that's just my opinions. The way things work we may wake up one
> day and find out that Dave has applied this :)

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Hardware Offload discussion WAS(Re: [PATCH net-next v12 00/15] Introducing P4TC (series 1)
  2024-03-02  2:59                 ` Hardware Offload discussion WAS(Re: " Jamal Hadi Salim
@ 2024-03-02 14:36                   ` Jamal Hadi Salim
  2024-03-03  3:27                     ` Jakub Kicinski
  0 siblings, 1 reply; 71+ messages in thread
From: Jamal Hadi Salim @ 2024-03-02 14:36 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Tom Herbert, John Fastabend, Singhai, Anjali, Paolo Abeni,
	Linux Kernel Network Developers, Chatterjee, Deb, Limaye, Namrata,
	Marcelo Ricardo Leitner, Shirshyad, Mahesh, Jain, Vipin,
	Osinski, Tomasz, Jiri Pirko, Cong Wang, David S . Miller,
	Eric Dumazet, Vlad Buslov, Simon Horman, Khalid Manaa,
	Toke Høiland-Jørgensen, Daniel Borkmann,
	Victor Nogueira, Tammela, Pedro, Daly, Dan, Andy Fingerhut,
	Sommers, Chris, Matty Kadosh, bpf

On Fri, Mar 1, 2024 at 9:59 PM Jamal Hadi Salim <jhs@mojatatu.com> wrote:
>
> On Fri, Mar 1, 2024 at 8:32 PM Jakub Kicinski <kuba@kernel.org> wrote:
> >
> > On Fri, 1 Mar 2024 12:39:56 -0500 Jamal Hadi Salim wrote:
> > > On Fri, Mar 1, 2024 at 12:00 PM Jakub Kicinski <kuba@kernel.org> wrote:
> > > > > Pardon my ignorance, but doesn't P4 want to be compiled to a backend
> > > > > target? How does going through TC make this seamless?
> > > >
> > > > +1
> > >
> > > I should clarify what i meant by "seamless". It means the same control
> > > API is used for s/w or h/w. This is a feature of tc, and is not being
> > > introduced by P4TC. P4 control only deals with Match-action tables -
> > > just as TC does.
> >
> > Right, and the compiled P4 pipeline is tacked onto that API.
> > Loading that presumably implies a pipeline reset. There's
> > no precedent for loading things into TC resulting a device
> > datapath reset.
>
> Ive changed the subject to reflect this discussion is about h/w
> offload so we dont drift too much from the intent of the patches.
>
> AFAIK, all these devices have some HA built in to do program
> replacement. i.e. afaik, no device reset.
> I believe the tofino switch in the earlier generations may have needed
> resets which caused a few packet drops in a live environment update.
> Granted there may be devices (not that i am aware) that may not be
> able to do HA. All this needs to be considered for offloads.
>
> > > > My intuition is that for offload the device would be programmed at
> > > > start-of-day / probe. By loading the compiled P4 from /lib/firmware.
> > > > Then the _device_ tells the kernel what tables and parser graph it's
> > > > got.
> > >
> > > BTW: I just want to say that these patches are about s/w - not
> > > offload. Someone asked about offload so as in normal discussions we
> > > steered in that direction. The hardware piece will require additional
> > > patchsets which still require discussions. I hope we dont steer off
> > > too much, otherwise i can start a new thread just to discuss current
> > > view of the h/w.
> > >
> > > Its not the device telling the kernel what it has. Its the other way around.
> >
> > Yes, I'm describing how I'd have designed it :) If it was the same
> > as what you've already implemented - why would I be typing it into
> > an email.. ? :)
> >
>
> I think i misunderstood you and thought I needed to provide context.
> The P4 pipelines are meant to be able to be re-programmed multiple
> times in a live environment. IOW, I should be able to delete/create a
> pipeline while another is running. Some hardware may require that the
> parser is shared etc, but you can certainly replace the match action
> tables or add an entirely new logic. In any case this is all still
> under discussion and can be further refined.
>
> > > From the P4 program you generate the s/w (the ebpf code and other
> > > auxillary stuff) and h/w pieces using a compiler.
> > > You compile ebpf, etc, then load.
> >
> > That part is fine.
> >
> > > The current point of discussion is the hw binary is to be "activated"
> > > through the same tc filter that does the s/w. So one could say:
> > >
> > > tc filter add block 22 ingress protocol all prio 1 p4 pname simple_l3
> > > \
> > >    prog type hw filename "simple_l3.o" ... \
> > >    action bpf obj $PARSER.o section p4tc/parser \
> > >    action bpf obj $PROGNAME.o section p4tc/main
> > >
> > > And that would through tc driver callbacks signal to the driver to
> > > find the binary possibly via  /lib/firmware
> > > Some of the original discussion was to use devlink for loading the
> > > binary - but that went nowhere.
> >
> > Back to the device reset, unless the load has no impact on inflight
> > traffic the loading doesn't belong in TC, IMO. Plus you're going to
> > run into (what IIRC was Jiri's complaint) that you're loading arbitrary
> > binary blobs, opaque to the kernel.
> >
>
> And you said at that time binary blobs are already a way of life.
> Let's take DDP as a use case:  They load the firmware (via ethtool)
> and we were recently discussing whether they should use flower or u32
> etc.  I would say this is in the same spirit. Doing ethtool may be a
> bit disconnected. But that is up for discussion as well.
> There has been concern that we need to have some authentication in
> some of the discussions. Is that what you mean?
>
> > > Once you have this in place then netlink with tc skip_sw/hw. This is
> > > what i meant by "seamless"
> > >
> > > > Plus, if we're talking about offloads, aren't we getting back into
> > > > the same controversies we had when merging OvS (not that I was around).
> > > > The "standalone stack to the side" problem. Some of the tables in the
> > > > pipeline may be for routing, not ACLs. Should they be fed from the
> > > > routing stack? How is that integration going to work? The parsing
> > > > graph feels a bit like global device configuration, not a piece of
> > > > functionality that should sit under sub-sub-system in the corner.
> > >
> > > The current (maybe i should say initial) thought is the P4 program
> > > does not touch the existing kernel infra such as fdb etc.
> >
> > It's off to the side thing. Ignoring the fact that *all*, networking
> > devices already have parsers which would benefit from being accurately
> > described.
> >
>
> I am not following this point.
>
> > > Of course we can model the kernel datapath using P4 but you wont be
> > > using "ip route add..." or "bridge fdb...".
> > > In the future, P4 extern could be used to model existing infra and we
> > > should be able to use the same tooling. That is a discussion that
> > > comes on/off (i think it did in the last meeting).
> >
> > Maybe, IDK. I thought prevailing wisdom, at least for offloads,
> > is to offload the existing networking stack, and fill in the gaps.
> > Not build a completely new implementation from scratch, and "integrate
> > later". Or at least "fill in the gaps" is how I like to think.
> >
> > I can't quite fit together in my head how this is okay, but OvS
> > was not allowed to add their offload API. And what's supposed to
> > be part of TC and what isn't, where you only expect to have one
> > filter here, and create a whole new object universe inside TC.
> >
>
> I was there.
> Ovs matched what tc already had functionally, 10 years after tc
> existed, and they were busy rewriting what tc offered. So naturally we
> pushed for them to use what TC had. You still need to write whatever
> extensions needed into the kernel etc in order to support what the
> hardware can offer.
>
> I hope i am not stating the obvious: P4 provides a more malleable
> approach. Assume a blank template in h/w and s/w and where you specify
> what you need then both the s/w and hardware support it. Flower is
> analogous to a "fixed pipeline" meaning you can extend flower by
> changing the kernel and datapath. Often it is not covering all
> potential hw match actions engines and often we see patches to do one
> more thing requiring more kernel changes.  If you replace flower with
> P4 you remove the need to update the kernel, user space etc for the
> same features that flower needs to be extended for today. You just
> tell the compiler what you need (within hardware capacity of course).
> So i dont see P4 as "offload the existing kernel infra aka flower" but
> rather remove the limitations that flower constrains us with today. As
> far as other kernel infra (fdb etc), that can be added as i stated -
> it is just not a starting point.
>

Sorry, after getting some coffee i believe I mumbled too much in my
previous email. Let me summarize your points and reduce the mumbling:
1)Your point on: Triggering the pipeline re/programming via the filter
would require a reset of the device on a live environment.
AFAIK, the "P4 native" devices that I know of  do allow multiple
programs and have operational schemes to allow updates without resets.
I will gather more info and post it after one of our meetings.
Having said that, we really have not paid much attention to this
detail so it is a valid concern that needs to be ironed out.
It is even more imperative if we want to support a device that is not
"P4 native" or one that requires a reset whether it is P4 native or
not then what you referred to as "programmed at start-of-day / probe"
is a valid concern.

2) Your point on:  "integrate later", or at least "fill in the gaps"
This part i am probably going to mumble on. I am going to consider
more than just doing ACLs/MAT via flower/u32 for the sake of
discussion.
True, "fill the gaps" has been our model so far. It requires kernel
changes, user space code changes etc justifiably so because most of
the time such datapaths are subject to standardization via IETF, IEEE,
etc and new extensions come in on a regular basis.  And sometimes we
do add features that one or two users or a single vendor has need for
at the cost of kernel and user/control extension. Given our work
process, any features added this way take a long time to make it to
the end user. At the cost of this sounding controversial, i am going
to call things like fdb, fib, etc which have fixed datapaths in the
kernel "legacy". These "legacy" datapaths almost all the time have
very strong user bases with strong infra tooling which took years to
get in shape. So they must be supported. I see two approaches:
-  you can leave those "legacy" ndo ops alone and not go via the tc
ndo ops used by P4TC.
-  or write a P4 program that looks _exactly_ like what current
bridging looks like and add helpers to allow existing tools to
continue to work via tc ndo and then phase out the "fixed datapath"
ndos. This will take a long long time but it could be a goal.

There is another caveat: Often different vendor hardware has slightly
different features which cant be exposed because either they are very
specific to the vendor or it's just very hard to express with existing
"legacy" without making intrusive changes. So we are going to be able
to allow these vendors/users to expose as much or as little as is
needed for a specific deployment without affecting anyone else with
new kernel/user code.

On the "integrate later" aspect: That is probably because most of the
times we want to avoid doing intrusive niche changes (which is
resolvable with the above).

cheers,
jamal

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH net-next v12 14/15] p4tc: add set of P4TC table kfuncs
  2024-03-01 12:31     ` Jamal Hadi Salim
@ 2024-03-03  1:32       ` Martin KaFai Lau
  2024-03-03 17:20         ` Jamal Hadi Salim
  0 siblings, 1 reply; 71+ messages in thread
From: Martin KaFai Lau @ 2024-03-03  1:32 UTC (permalink / raw)
  To: Jamal Hadi Salim
  Cc: deb.chatterjee, anjali.singhai, namrata.limaye, tom, mleitner,
	Mahesh.Shirshyad, Vipin.Jain, tomasz.osinski, jiri,
	xiyou.wangcong, davem, edumazet, kuba, pabeni, vladbu, horms,
	khalidm, toke, daniel, victor, pctammela, bpf, netdev

On 3/1/24 4:31 AM, Jamal Hadi Salim wrote:
> On Fri, Mar 1, 2024 at 1:53 AM Martin KaFai Lau <martin.lau@linux.dev> wrote:
>>
>> On 2/25/24 8:54 AM, Jamal Hadi Salim wrote:
>>> +struct p4tc_table_entry_act_bpf_params {
>>
>> Will this struct be extended in the future?
>>
>>> +     u32 pipeid;
>>> +     u32 tblid;
>>> +};
>>> +
> 
> Not that i can think of. We probably want to have the option to do so
> if needed. Do you see any harm if we were to make changes for whatever
> reason in the future?

It will be useful to add an argument named with "__sz" suffix to the kfunc.
Take a look at how the kfunc in nf_conntrack_bpf.c is handling the "opts" and 
"opts__sz" argument in its kfunc.

> 
>>> +struct p4tc_table_entry_create_bpf_params {
>>> +     u32 profile_id;
>>> +     u32 pipeid;
>>> +     u32 tblid;
>>> +};
>>> +
>>
>> [ ... ]
>>
>>> diff --git a/include/net/tc_act/p4tc.h b/include/net/tc_act/p4tc.h
>>> index c5256d821..155068de0 100644
>>> --- a/include/net/tc_act/p4tc.h
>>> +++ b/include/net/tc_act/p4tc.h
>>> @@ -13,10 +13,26 @@ struct tcf_p4act_params {
>>>        u32 tot_params_sz;
>>>    };
>>>
>>> +#define P4TC_MAX_PARAM_DATA_SIZE 124
>>> +
>>> +struct p4tc_table_entry_act_bpf {
>>> +     u32 act_id;
>>> +     u32 hit:1,
>>> +         is_default_miss_act:1,
>>> +         is_default_hit_act:1;
>>> +     u8 params[P4TC_MAX_PARAM_DATA_SIZE];
>>> +} __packed;
>>> +
>>> +struct p4tc_table_entry_act_bpf_kern {
>>> +     struct rcu_head rcu;
>>> +     struct p4tc_table_entry_act_bpf act_bpf;
>>> +};
>>> +
>>>    struct tcf_p4act {
>>>        struct tc_action common;
>>>        /* Params IDR reference passed during runtime */
>>>        struct tcf_p4act_params __rcu *params;
>>> +     struct p4tc_table_entry_act_bpf_kern __rcu *act_bpf;
>>>        u32 p_id;
>>>        u32 act_id;
>>>        struct list_head node;
>>> @@ -24,4 +40,39 @@ struct tcf_p4act {
>>>
>>>    #define to_p4act(a) ((struct tcf_p4act *)a)
>>>
>>> +static inline struct p4tc_table_entry_act_bpf *
>>> +p4tc_table_entry_act_bpf(struct tc_action *action)
>>> +{
>>> +     struct p4tc_table_entry_act_bpf_kern *act_bpf;
>>> +     struct tcf_p4act *p4act = to_p4act(action);
>>> +
>>> +     act_bpf = rcu_dereference(p4act->act_bpf);
>>> +
>>> +     return &act_bpf->act_bpf;
>>> +}
>>> +
>>> +static inline int
>>> +p4tc_table_entry_act_bpf_change_flags(struct tc_action *action, u32 hit,
>>> +                                   u32 dflt_miss, u32 dflt_hit)
>>> +{
>>> +     struct p4tc_table_entry_act_bpf_kern *act_bpf, *act_bpf_old;
>>> +     struct tcf_p4act *p4act = to_p4act(action);
>>> +
>>> +     act_bpf = kzalloc(sizeof(*act_bpf), GFP_KERNEL);
>>
>>
>> [ ... ]
>>
>>> +__bpf_kfunc static struct p4tc_table_entry_act_bpf *
>>> +bpf_p4tc_tbl_read(struct __sk_buff *skb_ctx,
>>
>> The argument could be "struct sk_buff *skb" instead of __sk_buff. Take a look at
>> commit 2f4643934670.
> 
> We'll make that change.
> 
>>
>>> +               struct p4tc_table_entry_act_bpf_params *params,
>>> +               void *key, const u32 key__sz)
>>> +{
>>> +     struct sk_buff *skb = (struct sk_buff *)skb_ctx;
>>> +     struct net *caller_net;
>>> +
>>> +     caller_net = skb->dev ? dev_net(skb->dev) : sock_net(skb->sk);
>>> +
>>> +     return __bpf_p4tc_tbl_read(caller_net, params, key, key__sz);
>>> +}
>>> +
>>> +__bpf_kfunc static struct p4tc_table_entry_act_bpf *
>>> +xdp_p4tc_tbl_read(struct xdp_md *xdp_ctx,
>>> +               struct p4tc_table_entry_act_bpf_params *params,
>>> +               void *key, const u32 key__sz)
>>> +{
>>> +     struct xdp_buff *ctx = (struct xdp_buff *)xdp_ctx;
>>> +     struct net *caller_net;
>>> +
>>> +     caller_net = dev_net(ctx->rxq->dev);
>>> +
>>> +     return __bpf_p4tc_tbl_read(caller_net, params, key, key__sz);
>>> +}
>>> +
>>> +static int
>>> +__bpf_p4tc_entry_create(struct net *net,
>>> +                     struct p4tc_table_entry_create_bpf_params *params,
>>> +                     void *key, const u32 key__sz,
>>> +                     struct p4tc_table_entry_act_bpf *act_bpf)
>>> +{
>>> +     struct p4tc_table_entry_key *entry_key = key;
>>> +     struct p4tc_pipeline *pipeline;
>>> +     struct p4tc_table *table;
>>> +
>>> +     if (!params || !key)
>>> +             return -EINVAL;
>>> +     if (key__sz != P4TC_ENTRY_KEY_SZ_BYTES(entry_key->keysz))
>>> +             return -EINVAL;
>>> +
>>> +     pipeline = p4tc_pipeline_find_byid(net, params->pipeid);
>>> +     if (!pipeline)
>>> +             return -ENOENT;
>>> +
>>> +     table = p4tc_tbl_cache_lookup(net, params->pipeid, params->tblid);
>>> +     if (!table)
>>> +             return -ENOENT;
>>> +
>>> +     if (entry_key->keysz != table->tbl_keysz)
>>> +             return -EINVAL;
>>> +
>>> +     return p4tc_table_entry_create_bpf(pipeline, table, entry_key, act_bpf,
>>> +                                        params->profile_id);
>>
>> My understanding is this kfunc will allocate a "struct
>> p4tc_table_entry_act_bpf_kern" object. If the bpf_p4tc_entry_delete() kfunc is
>> never called and the bpf prog is unloaded, how the act_bpf object will be
>> cleaned up?
>>
> 
> The TC code takes care of this. Unloading the bpf prog does not affect
> the deletion, it is the TC control side that will take care of it. If
> we delete the pipeline otoh then not just this entry but all entries
> will be flushed.

It looks like the "struct p4tc_table_entry_act_bpf_kern" object is allocated by 
the bpf prog through kfunc and will only be useful for the bpf prog but not 
other parts of the kernel. However, if the bpf prog is unloaded, these bpf 
specific objects will be left over in the kernel until the tc pipeline (where 
the act_bpf_kern object resided) is gone.

It is the expectation on bpf prog (not only tc/xdp bpf prog) about resources 
clean up that these bpf objects will be gone after unloading the bpf prog and 
unpinning its bpf map.

[ ... ]

>>> +BTF_SET8_START(p4tc_kfunc_check_tbl_set_skb)
>>
>> This soon will be broken with the latest change in bpf-next. It is replaced by
>> BTF_KFUNCS_START. commit a05e90427ef6.

It has already been included in the latest bpf-next pull-request, so should 
reach net-next soon.

>>
> 
> Ok, this wasnt in net-next when we pushed. We base our changes on
> net-next. When do you plan to merge that into net-next?
> 
>> What is the plan on the selftest ?
>>
> 
> We may need some guidance. How do you see us writing a selftest for this?
> We have extensive testing on the control side which is netlink (not
> part of the current series).

There are examples in tools/testing/selftests/bpf, e.g. the test_bpf_nf.c to 
test the kfuncs in nf_conntrack_bpf mentioned above. There are also selftests 
doing netlink to setup the test. The bpf/test_progs tries to avoid external 
dependency as much as possible, so linking to an extra external library and 
using an extra tool/binary will be unacceptable.
and only the bpf/test_progs binary will be run by bpf CI.

The selftest does not have to be complicated. It can exercise the kfunc and show 
how the new struct (e.g. struct p4tc_table_entry_bpf_*) will be used. There is 
BPF_PROG_RUN for the tc and xdp prog, so should be quite doable.


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH net-next v12 00/15] Introducing P4TC (series 1)
  2024-03-02  2:20                 ` Tom Herbert
@ 2024-03-03  3:15                   ` Jakub Kicinski
  2024-03-03 16:31                     ` Tom Herbert
  0 siblings, 1 reply; 71+ messages in thread
From: Jakub Kicinski @ 2024-03-03  3:15 UTC (permalink / raw)
  To: Tom Herbert
  Cc: Jamal Hadi Salim, John Fastabend, Singhai, Anjali, Paolo Abeni,
	Linux Kernel Network Developers, Chatterjee, Deb, Limaye, Namrata,
	mleitner, Mahesh.Shirshyad, Vipin.Jain, Osinski, Tomasz,
	Jiri Pirko, Cong Wang, David S . Miller, edumazet, Vlad Buslov,
	horms, khalidm, Toke Høiland-Jørgensen, Daniel Borkmann,
	Victor Nogueira, Tammela, Pedro, Daly, Dan, andy.fingerhut,
	Sommers, Chris, mattyk, bpf

On Fri, 1 Mar 2024 18:20:36 -0800 Tom Herbert wrote:
> This is configurability versus programmability. The table driven
> approach as input (configurability) might work fine for generic
> match-action tables up to the point that tables are expressive enough
> to satisfy the requirements. But parsing doesn't fall into the table
> driven paradigm: parsers want to be *programmed*. This is why we
> removed kParser from this patch set and fell back to eBPF for parsing.
> But the problem we quickly hit that eBPF is not offloadable to network
> devices, for example when we compile P4 in an eBPF parser we've lost
> the declarative representation that parsers in the devices could
> consume (they're not CPUs running eBPF).
> 
> I think the key here is what we mean by kernel offload. When we do
> kernel offload, is it the kernel implementation or the kernel
> functionality that's being offloaded? If it's the latter then we have
> a lot more flexibility. What we'd need is a safe and secure way to
> synchronize with that offload device that precisely supports the
> kernel functionality we'd like to offload. This can be done if both
> the kernel bits and programmed offload are derived from the same
> source (i.e. tag source code with a sha-1). For example, if someone
> writes a parser in P4, we can compile that into both eBPF and a P4
> backend using independent tool chains and program download. At
> runtime, the kernel can safely offload the functionality of the eBPF
> parser to the device if it matches the hash to that reported by the
> device

Good points. If I understand you correctly you're saying that parsers
are more complex than just a basic parsing tree a'la u32.
Then we can take this argument further. P4 has grown to encompass a lot
of functionality of quite complex devices. How do we square that with 
the kernel functionality offload model. If the entire device is modeled,
including f.e. TSO, an offload would mean that the user has to write
a TSO implementation which they then load into TC? That seems odd.

IOW I don't quite know how to square in my head the "total
functionality" with being a TC-based "plugin".

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Hardware Offload discussion WAS(Re: [PATCH net-next v12 00/15] Introducing P4TC (series 1)
  2024-03-02 14:36                   ` Jamal Hadi Salim
@ 2024-03-03  3:27                     ` Jakub Kicinski
  2024-03-03 17:00                       ` Jamal Hadi Salim
  0 siblings, 1 reply; 71+ messages in thread
From: Jakub Kicinski @ 2024-03-03  3:27 UTC (permalink / raw)
  To: Jamal Hadi Salim
  Cc: Tom Herbert, John Fastabend, Singhai, Anjali, Paolo Abeni,
	Linux Kernel Network Developers, Chatterjee, Deb, Limaye, Namrata,
	Marcelo Ricardo Leitner, Shirshyad, Mahesh, Jain, Vipin,
	Osinski, Tomasz, Jiri Pirko, Cong Wang, David S . Miller,
	Eric Dumazet, Vlad Buslov, Simon Horman, Khalid Manaa,
	Toke Høiland-Jørgensen, Daniel Borkmann,
	Victor Nogueira, Tammela, Pedro, Daly, Dan, Andy Fingerhut,
	Sommers, Chris, Matty Kadosh, bpf

On Sat, 2 Mar 2024 09:36:53 -0500 Jamal Hadi Salim wrote:
> 2) Your point on:  "integrate later", or at least "fill in the gaps"
> This part i am probably going to mumble on. I am going to consider
> more than just doing ACLs/MAT via flower/u32 for the sake of
> discussion.
> True, "fill the gaps" has been our model so far. It requires kernel
> changes, user space code changes etc justifiably so because most of
> the time such datapaths are subject to standardization via IETF, IEEE,
> etc and new extensions come in on a regular basis.  And sometimes we
> do add features that one or two users or a single vendor has need for
> at the cost of kernel and user/control extension. Given our work
> process, any features added this way take a long time to make it to
> the end user.

What I had in mind was more of a DDP model. The device loads it binary
blob FW in whatever way it does, then it tells the kernel its parser
graph, and tables. The kernel exposes those tables to user space.
All dynamic, no need to change the kernel for each new protocol.

But that's different in two ways:
 1. the device tells kernel the tables, no "dynamic reprogramming"
 2. you don't need the SW side, the only use of the API is to interact
    with the device

User can still do BPF kfuncs to look up in the tables (like in FIB), 
but call them from cls_bpf.

I think in P4 terms that may be something more akin to only providing
the runtime API? I seem to recall they had some distinction...

> At the cost of this sounding controversial, i am going
> to call things like fdb, fib, etc which have fixed datapaths in the
> kernel "legacy". These "legacy" datapaths almost all the time have

The cynic in me sometimes thinks that the biggest problem with "legacy"
protocols is that it's hard to make money on them :)

> very strong user bases with strong infra tooling which took years to
> get in shape. So they must be supported. I see two approaches:
> -  you can leave those "legacy" ndo ops alone and not go via the tc
> ndo ops used by P4TC.
> -  or write a P4 program that looks _exactly_ like what current
> bridging looks like and add helpers to allow existing tools to
> continue to work via tc ndo and then phase out the "fixed datapath"
> ndos. This will take a long long time but it could be a goal.
> 
> There is another caveat: Often different vendor hardware has slightly
> different features which cant be exposed because either they are very
> specific to the vendor or it's just very hard to express with existing
> "legacy" without making intrusive changes. So we are going to be able
> to allow these vendors/users to expose as much or as little as is
> needed for a specific deployment without affecting anyone else with
> new kernel/user code.
> 
> On the "integrate later" aspect: That is probably because most of the
> times we want to avoid doing intrusive niche changes (which is
> resolvable with the above).

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH net-next v12 00/15] Introducing P4TC (series 1)
  2024-03-03  3:15                   ` Jakub Kicinski
@ 2024-03-03 16:31                     ` Tom Herbert
  2024-03-04 20:07                       ` Jakub Kicinski
  2024-03-04 21:19                       ` Stanislav Fomichev
  0 siblings, 2 replies; 71+ messages in thread
From: Tom Herbert @ 2024-03-03 16:31 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Jamal Hadi Salim, John Fastabend, Singhai, Anjali, Paolo Abeni,
	Linux Kernel Network Developers, Chatterjee, Deb, Limaye, Namrata,
	mleitner, Mahesh.Shirshyad, Vipin.Jain, Osinski, Tomasz,
	Jiri Pirko, Cong Wang, David S . Miller, edumazet, Vlad Buslov,
	horms, khalidm, Toke Høiland-Jørgensen, Daniel Borkmann,
	Victor Nogueira, Tammela, Pedro, Daly, Dan, andy.fingerhut,
	Sommers, Chris, mattyk, bpf

On Sat, Mar 2, 2024 at 7:15 PM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Fri, 1 Mar 2024 18:20:36 -0800 Tom Herbert wrote:
> > This is configurability versus programmability. The table driven
> > approach as input (configurability) might work fine for generic
> > match-action tables up to the point that tables are expressive enough
> > to satisfy the requirements. But parsing doesn't fall into the table
> > driven paradigm: parsers want to be *programmed*. This is why we
> > removed kParser from this patch set and fell back to eBPF for parsing.
> > But the problem we quickly hit that eBPF is not offloadable to network
> > devices, for example when we compile P4 in an eBPF parser we've lost
> > the declarative representation that parsers in the devices could
> > consume (they're not CPUs running eBPF).
> >
> > I think the key here is what we mean by kernel offload. When we do
> > kernel offload, is it the kernel implementation or the kernel
> > functionality that's being offloaded? If it's the latter then we have
> > a lot more flexibility. What we'd need is a safe and secure way to
> > synchronize with that offload device that precisely supports the
> > kernel functionality we'd like to offload. This can be done if both
> > the kernel bits and programmed offload are derived from the same
> > source (i.e. tag source code with a sha-1). For example, if someone
> > writes a parser in P4, we can compile that into both eBPF and a P4
> > backend using independent tool chains and program download. At
> > runtime, the kernel can safely offload the functionality of the eBPF
> > parser to the device if it matches the hash to that reported by the
> > device
>
> Good points. If I understand you correctly you're saying that parsers
> are more complex than just a basic parsing tree a'la u32.

Yes. Parsing things like TLVs, GRE flag field, or nested protobufs
isn't conducive to u32. We also want the advantages of compiler
optimizations to unroll loops, squash nodes in the parse graph, etc.

> Then we can take this argument further. P4 has grown to encompass a lot
> of functionality of quite complex devices. How do we square that with
> the kernel functionality offload model. If the entire device is modeled,
> including f.e. TSO, an offload would mean that the user has to write
> a TSO implementation which they then load into TC? That seems odd.
>
> IOW I don't quite know how to square in my head the "total
> functionality" with being a TC-based "plugin".

Hi Jakub,

I believe the solution is to replace kernel code with eBPF in cases
where we need programmability. This effectively means that we would
ship eBPF code as part of the kernel. So in the case of TSO, the
kernel would include a standard implementation in eBPF that could be
compiled into the kernel by default. The restricted C source code is
tagged with a hash, so if someone wants to offload TSO they could
compile the source into their target and retain the hash. At runtime
it's a matter of querying the driver to see if the device supports the
TSO program the kernel is running by comparing hash values. Scaling
this, a device could support a catalogue of programs: TSO, LRO,
parser, IPtables, etc., If the kernel can match the hash of its eBPF
code to one reported by the driver then it can assume functionality is
offloadable. This is an elaboration of "device features", but instead
of the device telling us they think they support an adequate GRO
implementation by reporting NETIF_F_GRO, the device would tell the
kernel that they not only support GRO but they provide identical
functionality of the kernel GRO (which IMO is the first requirement of
kernel offload).

Even before considering hardware offload, I think this approach
addresses a more fundamental problem to make the kernel programmable.
Since the code is in eBPF, the kernel can be reprogrammed at runtime
which could be controlled by TC. This allows local customization of
kernel features, but also is the simplest way to "patch" the kernel
with security and bug fixes (nobody is ever excited to do a kernel
rebase in their datacenter!). Flow dissector is a prime candidate for
this, and I am still planning to replace it with an all eBPF program
(https://netdevconf.info/0x15/slides/16/Flow%20dissector_PANDA%20parser.pdf).

Tom

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Hardware Offload discussion WAS(Re: [PATCH net-next v12 00/15] Introducing P4TC (series 1)
  2024-03-03  3:27                     ` Jakub Kicinski
@ 2024-03-03 17:00                       ` Jamal Hadi Salim
  2024-03-03 18:10                         ` Tom Herbert
  0 siblings, 1 reply; 71+ messages in thread
From: Jamal Hadi Salim @ 2024-03-03 17:00 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Tom Herbert, John Fastabend, Singhai, Anjali, Paolo Abeni,
	Linux Kernel Network Developers, Chatterjee, Deb, Limaye, Namrata,
	Marcelo Ricardo Leitner, Shirshyad, Mahesh, Jain, Vipin,
	Osinski, Tomasz, Jiri Pirko, Cong Wang, David S . Miller,
	Eric Dumazet, Vlad Buslov, Simon Horman, Khalid Manaa,
	Toke Høiland-Jørgensen, Daniel Borkmann,
	Victor Nogueira, Tammela, Pedro, Daly, Dan, Andy Fingerhut,
	Sommers, Chris, Matty Kadosh, bpf

On Sat, Mar 2, 2024 at 10:27 PM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Sat, 2 Mar 2024 09:36:53 -0500 Jamal Hadi Salim wrote:
> > 2) Your point on:  "integrate later", or at least "fill in the gaps"
> > This part i am probably going to mumble on. I am going to consider
> > more than just doing ACLs/MAT via flower/u32 for the sake of
> > discussion.
> > True, "fill the gaps" has been our model so far. It requires kernel
> > changes, user space code changes etc justifiably so because most of
> > the time such datapaths are subject to standardization via IETF, IEEE,
> > etc and new extensions come in on a regular basis.  And sometimes we
> > do add features that one or two users or a single vendor has need for
> > at the cost of kernel and user/control extension. Given our work
> > process, any features added this way take a long time to make it to
> > the end user.
>
> What I had in mind was more of a DDP model. The device loads it binary
> blob FW in whatever way it does, then it tells the kernel its parser
> graph, and tables. The kernel exposes those tables to user space.
> All dynamic, no need to change the kernel for each new protocol.
>
> But that's different in two ways:
>  1. the device tells kernel the tables, no "dynamic reprogramming"
>  2. you don't need the SW side, the only use of the API is to interact
>     with the device
>
> User can still do BPF kfuncs to look up in the tables (like in FIB),
> but call them from cls_bpf.
>

This is not far off from what is envisioned today in the discussions.
The main issue is who loads the binary? We went from devlink to the
filter doing the loading. DDP is ethtool. We still need to tie a PCI
device/tc block to the "program" so we can do skip_sw and it works.
Meaning a device that is capable of handling multiple programs can
have multiple blobs loaded. A "program" is mapped to a tc filter and
MAT control works the same way as it does today (netlink/tc ndo).

A program in P4 has a name, ID and people have been suggesting a sha1
identity (or a signature of some kind should be generated by the
compiler). So the upward propagation could be tied to discovering
these 3 tuples from the driver. Then the control plane targets a
program via those tuples via netlink (as we do currently).

I do note, using the DDP sample space, currently whatever gets loaded
is "trusted" and really you need to have human knowledge of what the
NIC's parsing + MAT is to send the control. With P4 that is all
visible/programmable by the end user (i am not a proponent of vendors
"shipping" things or calling them for support) - so should be
sufficient to just discover what is in the binary and send the correct
control messages down.

> I think in P4 terms that may be something more akin to only providing
> the runtime API? I seem to recall they had some distinction...

There are several solutions out there (ex: TDI, P4runtime) - our API
is netlink and those could be written on top of netlink, there's no
controversy there.
So the starting point is defining the datapath using P4, generating
the binary blob and whatever constraints needed using the vendor
backend and for s/w equivalent generating the eBPF datapath.

> > At the cost of this sounding controversial, i am going
> > to call things like fdb, fib, etc which have fixed datapaths in the
> > kernel "legacy". These "legacy" datapaths almost all the time have
>
> The cynic in me sometimes thinks that the biggest problem with "legacy"
> protocols is that it's hard to make money on them :)

That's a big motivation without a doubt, but also there are people
that want to experiment with things. One of the craziest examples we
have is someone who created a P4 program for "in network calculator",
essentially a calculator in the datapath. You send it two operands and
an operator using custom headers, it does the math and responds with a
result in a new header. By itself this program is a toy but it
demonstrates that if one wanted to, they could have something custom
in hardware and/or kernel datapath.

cheers,
jamal

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH net-next v12 14/15] p4tc: add set of P4TC table kfuncs
  2024-03-03  1:32       ` Martin KaFai Lau
@ 2024-03-03 17:20         ` Jamal Hadi Salim
  2024-03-05  7:40           ` Martin KaFai Lau
  0 siblings, 1 reply; 71+ messages in thread
From: Jamal Hadi Salim @ 2024-03-03 17:20 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: deb.chatterjee, anjali.singhai, namrata.limaye, tom, mleitner,
	Mahesh.Shirshyad, Vipin.Jain, tomasz.osinski, jiri,
	xiyou.wangcong, davem, edumazet, kuba, pabeni, vladbu, horms,
	khalidm, toke, daniel, victor, pctammela, bpf, netdev

On Sat, Mar 2, 2024 at 8:32 PM Martin KaFai Lau <martin.lau@linux.dev> wrote:
>
> On 3/1/24 4:31 AM, Jamal Hadi Salim wrote:
> > On Fri, Mar 1, 2024 at 1:53 AM Martin KaFai Lau <martin.lau@linux.dev> wrote:
> >>
> >> On 2/25/24 8:54 AM, Jamal Hadi Salim wrote:
> >>> +struct p4tc_table_entry_act_bpf_params {
> >>
> >> Will this struct be extended in the future?
> >>
> >>> +     u32 pipeid;
> >>> +     u32 tblid;
> >>> +};
> >>> +
> >
> > Not that i can think of. We probably want to have the option to do so
> > if needed. Do you see any harm if we were to make changes for whatever
> > reason in the future?
>
> It will be useful to add an argument named with "__sz" suffix to the kfunc.
> Take a look at how the kfunc in nf_conntrack_bpf.c is handling the "opts" and
> "opts__sz" argument in its kfunc.
>

Ok, will look.

> >
> >>> +struct p4tc_table_entry_create_bpf_params {
> >>> +     u32 profile_id;
> >>> +     u32 pipeid;
> >>> +     u32 tblid;
> >>> +};
> >>> +
> >>
> >> [ ... ]
> >>
> >>> diff --git a/include/net/tc_act/p4tc.h b/include/net/tc_act/p4tc.h
> >>> index c5256d821..155068de0 100644
> >>> --- a/include/net/tc_act/p4tc.h
> >>> +++ b/include/net/tc_act/p4tc.h
> >>> @@ -13,10 +13,26 @@ struct tcf_p4act_params {
> >>>        u32 tot_params_sz;
> >>>    };
> >>>
> >>> +#define P4TC_MAX_PARAM_DATA_SIZE 124
> >>> +
> >>> +struct p4tc_table_entry_act_bpf {
> >>> +     u32 act_id;
> >>> +     u32 hit:1,
> >>> +         is_default_miss_act:1,
> >>> +         is_default_hit_act:1;
> >>> +     u8 params[P4TC_MAX_PARAM_DATA_SIZE];
> >>> +} __packed;
> >>> +
> >>> +struct p4tc_table_entry_act_bpf_kern {
> >>> +     struct rcu_head rcu;
> >>> +     struct p4tc_table_entry_act_bpf act_bpf;
> >>> +};
> >>> +
> >>>    struct tcf_p4act {
> >>>        struct tc_action common;
> >>>        /* Params IDR reference passed during runtime */
> >>>        struct tcf_p4act_params __rcu *params;
> >>> +     struct p4tc_table_entry_act_bpf_kern __rcu *act_bpf;
> >>>        u32 p_id;
> >>>        u32 act_id;
> >>>        struct list_head node;
> >>> @@ -24,4 +40,39 @@ struct tcf_p4act {
> >>>
> >>>    #define to_p4act(a) ((struct tcf_p4act *)a)
> >>>
> >>> +static inline struct p4tc_table_entry_act_bpf *
> >>> +p4tc_table_entry_act_bpf(struct tc_action *action)
> >>> +{
> >>> +     struct p4tc_table_entry_act_bpf_kern *act_bpf;
> >>> +     struct tcf_p4act *p4act = to_p4act(action);
> >>> +
> >>> +     act_bpf = rcu_dereference(p4act->act_bpf);
> >>> +
> >>> +     return &act_bpf->act_bpf;
> >>> +}
> >>> +
> >>> +static inline int
> >>> +p4tc_table_entry_act_bpf_change_flags(struct tc_action *action, u32 hit,
> >>> +                                   u32 dflt_miss, u32 dflt_hit)
> >>> +{
> >>> +     struct p4tc_table_entry_act_bpf_kern *act_bpf, *act_bpf_old;
> >>> +     struct tcf_p4act *p4act = to_p4act(action);
> >>> +
> >>> +     act_bpf = kzalloc(sizeof(*act_bpf), GFP_KERNEL);
> >>
> >>
> >> [ ... ]
> >>
> >>> +__bpf_kfunc static struct p4tc_table_entry_act_bpf *
> >>> +bpf_p4tc_tbl_read(struct __sk_buff *skb_ctx,
> >>
> >> The argument could be "struct sk_buff *skb" instead of __sk_buff. Take a look at
> >> commit 2f4643934670.
> >
> > We'll make that change.
> >
> >>
> >>> +               struct p4tc_table_entry_act_bpf_params *params,
> >>> +               void *key, const u32 key__sz)
> >>> +{
> >>> +     struct sk_buff *skb = (struct sk_buff *)skb_ctx;
> >>> +     struct net *caller_net;
> >>> +
> >>> +     caller_net = skb->dev ? dev_net(skb->dev) : sock_net(skb->sk);
> >>> +
> >>> +     return __bpf_p4tc_tbl_read(caller_net, params, key, key__sz);
> >>> +}
> >>> +
> >>> +__bpf_kfunc static struct p4tc_table_entry_act_bpf *
> >>> +xdp_p4tc_tbl_read(struct xdp_md *xdp_ctx,
> >>> +               struct p4tc_table_entry_act_bpf_params *params,
> >>> +               void *key, const u32 key__sz)
> >>> +{
> >>> +     struct xdp_buff *ctx = (struct xdp_buff *)xdp_ctx;
> >>> +     struct net *caller_net;
> >>> +
> >>> +     caller_net = dev_net(ctx->rxq->dev);
> >>> +
> >>> +     return __bpf_p4tc_tbl_read(caller_net, params, key, key__sz);
> >>> +}
> >>> +
> >>> +static int
> >>> +__bpf_p4tc_entry_create(struct net *net,
> >>> +                     struct p4tc_table_entry_create_bpf_params *params,
> >>> +                     void *key, const u32 key__sz,
> >>> +                     struct p4tc_table_entry_act_bpf *act_bpf)
> >>> +{
> >>> +     struct p4tc_table_entry_key *entry_key = key;
> >>> +     struct p4tc_pipeline *pipeline;
> >>> +     struct p4tc_table *table;
> >>> +
> >>> +     if (!params || !key)
> >>> +             return -EINVAL;
> >>> +     if (key__sz != P4TC_ENTRY_KEY_SZ_BYTES(entry_key->keysz))
> >>> +             return -EINVAL;
> >>> +
> >>> +     pipeline = p4tc_pipeline_find_byid(net, params->pipeid);
> >>> +     if (!pipeline)
> >>> +             return -ENOENT;
> >>> +
> >>> +     table = p4tc_tbl_cache_lookup(net, params->pipeid, params->tblid);
> >>> +     if (!table)
> >>> +             return -ENOENT;
> >>> +
> >>> +     if (entry_key->keysz != table->tbl_keysz)
> >>> +             return -EINVAL;
> >>> +
> >>> +     return p4tc_table_entry_create_bpf(pipeline, table, entry_key, act_bpf,
> >>> +                                        params->profile_id);
> >>
> >> My understanding is this kfunc will allocate a "struct
> >> p4tc_table_entry_act_bpf_kern" object. If the bpf_p4tc_entry_delete() kfunc is
> >> never called and the bpf prog is unloaded, how the act_bpf object will be
> >> cleaned up?
> >>
> >
> > The TC code takes care of this. Unloading the bpf prog does not affect
> > the deletion, it is the TC control side that will take care of it. If
> > we delete the pipeline otoh then not just this entry but all entries
> > will be flushed.
>
> It looks like the "struct p4tc_table_entry_act_bpf_kern" object is allocated by
> the bpf prog through kfunc and will only be useful for the bpf prog but not
> other parts of the kernel. However, if the bpf prog is unloaded, these bpf
> specific objects will be left over in the kernel until the tc pipeline (where
> the act_bpf_kern object resided) is gone.
>
> It is the expectation on bpf prog (not only tc/xdp bpf prog) about resources
> clean up that these bpf objects will be gone after unloading the bpf prog and
> unpinning its bpf map.
>

The table (residing on the TC side) could be shared by multiple bpf
programs. Entries are allocated on the TC side of the fence.
IOW, the memory is not owned by the bpf prog but rather by pipeline.
We do have a "whodunnit" field, i.e we keep track of which entity
added an entry and we are capable of deleting all entries when we
detect a bpf program being deleted (this would be via deleting the tc
filter). But my thinking is we should make that a policy decision as
opposed to something which is default.

> [ ... ]
>
> >>> +BTF_SET8_START(p4tc_kfunc_check_tbl_set_skb)
> >>
> >> This soon will be broken with the latest change in bpf-next. It is replaced by
> >> BTF_KFUNCS_START. commit a05e90427ef6.
>
> It has already been included in the latest bpf-next pull-request, so should
> reach net-next soon.
>

Ok, we'll wait for it.

>> We may need some guidance. How do you see us writing a selftest for this?
>> We have extensive testing on the control side which is netlink (not
>> part of the current series).

>There are examples in tools/testing/selftests/bpf, e.g. the test_bpf_nf.c to
>test the kfuncs in nf_conntrack_bpf mentioned above. There are also selftests
>doing netlink to setup the test. The bpf/test_progs tries to avoid external
>dependency as much as possible, so linking to an extra external library and
>using an extra tool/binary will be unacceptable.
>and only the bpf/test_progs binary will be run by bpf CI.
>
>The selftest does not have to be complicated. It can exercise the kfunc and show
>how the new struct (e.g. struct p4tc_table_entry_bpf_*) will be used. There is
>BPF_PROG_RUN for the tc and xdp prog, so should be quite doable.

We will look into it.

Thanks for your feedback.

cheers,
jamal

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Hardware Offload discussion WAS(Re: [PATCH net-next v12 00/15] Introducing P4TC (series 1)
  2024-03-03 17:00                       ` Jamal Hadi Salim
@ 2024-03-03 18:10                         ` Tom Herbert
  2024-03-03 19:04                           ` Jamal Hadi Salim
  0 siblings, 1 reply; 71+ messages in thread
From: Tom Herbert @ 2024-03-03 18:10 UTC (permalink / raw)
  To: Jamal Hadi Salim
  Cc: Jakub Kicinski, John Fastabend, Singhai, Anjali, Paolo Abeni,
	Linux Kernel Network Developers, Chatterjee, Deb, Limaye, Namrata,
	Marcelo Ricardo Leitner, Shirshyad, Mahesh, Jain, Vipin,
	Osinski, Tomasz, Jiri Pirko, Cong Wang, David S . Miller,
	Eric Dumazet, Vlad Buslov, Simon Horman, Khalid Manaa,
	Toke Høiland-Jørgensen, Daniel Borkmann,
	Victor Nogueira, Tammela, Pedro, Daly, Dan, Andy Fingerhut,
	Sommers, Chris, Matty Kadosh, bpf

On Sun, Mar 3, 2024 at 9:00 AM Jamal Hadi Salim <jhs@mojatatu.com> wrote:
>
> On Sat, Mar 2, 2024 at 10:27 PM Jakub Kicinski <kuba@kernel.org> wrote:
> >
> > On Sat, 2 Mar 2024 09:36:53 -0500 Jamal Hadi Salim wrote:
> > > 2) Your point on:  "integrate later", or at least "fill in the gaps"
> > > This part i am probably going to mumble on. I am going to consider
> > > more than just doing ACLs/MAT via flower/u32 for the sake of
> > > discussion.
> > > True, "fill the gaps" has been our model so far. It requires kernel
> > > changes, user space code changes etc justifiably so because most of
> > > the time such datapaths are subject to standardization via IETF, IEEE,
> > > etc and new extensions come in on a regular basis.  And sometimes we
> > > do add features that one or two users or a single vendor has need for
> > > at the cost of kernel and user/control extension. Given our work
> > > process, any features added this way take a long time to make it to
> > > the end user.
> >
> > What I had in mind was more of a DDP model. The device loads it binary
> > blob FW in whatever way it does, then it tells the kernel its parser
> > graph, and tables. The kernel exposes those tables to user space.
> > All dynamic, no need to change the kernel for each new protocol.
> >
> > But that's different in two ways:
> >  1. the device tells kernel the tables, no "dynamic reprogramming"
> >  2. you don't need the SW side, the only use of the API is to interact
> >     with the device
> >
> > User can still do BPF kfuncs to look up in the tables (like in FIB),
> > but call them from cls_bpf.
> >
>
> This is not far off from what is envisioned today in the discussions.
> The main issue is who loads the binary? We went from devlink to the
> filter doing the loading. DDP is ethtool. We still need to tie a PCI
> device/tc block to the "program" so we can do skip_sw and it works.
> Meaning a device that is capable of handling multiple programs can
> have multiple blobs loaded. A "program" is mapped to a tc filter and
> MAT control works the same way as it does today (netlink/tc ndo).
>
> A program in P4 has a name, ID and people have been suggesting a sha1
> identity (or a signature of some kind should be generated by the
> compiler). So the upward propagation could be tied to discovering
> these 3 tuples from the driver. Then the control plane targets a
> program via those tuples via netlink (as we do currently).
>
> I do note, using the DDP sample space, currently whatever gets loaded
> is "trusted" and really you need to have human knowledge of what the
> NIC's parsing + MAT is to send the control. With P4 that is all
> visible/programmable by the end user (i am not a proponent of vendors
> "shipping" things or calling them for support) - so should be
> sufficient to just discover what is in the binary and send the correct
> control messages down.
>
> > I think in P4 terms that may be something more akin to only providing
> > the runtime API? I seem to recall they had some distinction...
>
> There are several solutions out there (ex: TDI, P4runtime) - our API
> is netlink and those could be written on top of netlink, there's no
> controversy there.
> So the starting point is defining the datapath using P4, generating
> the binary blob and whatever constraints needed using the vendor
> backend and for s/w equivalent generating the eBPF datapath.
>
> > > At the cost of this sounding controversial, i am going
> > > to call things like fdb, fib, etc which have fixed datapaths in the
> > > kernel "legacy". These "legacy" datapaths almost all the time have
> >
> > The cynic in me sometimes thinks that the biggest problem with "legacy"
> > protocols is that it's hard to make money on them :)
>
> That's a big motivation without a doubt, but also there are people
> that want to experiment with things. One of the craziest examples we
> have is someone who created a P4 program for "in network calculator",
> essentially a calculator in the datapath. You send it two operands and
> an operator using custom headers, it does the math and responds with a
> result in a new header. By itself this program is a toy but it
> demonstrates that if one wanted to, they could have something custom
> in hardware and/or kernel datapath.

Jamal,

Given how long P4 has been around it's surprising that the best
publicly available code example is "the network calculator" toy. At
this point in its lifetime, eBPF had far more examples of real world
use cases publically available. That being said, there's nothing
unique about P4 supporting the network calculator. We could just as
easily write this in eBPF (either plain C or P4)  and "offload" it to
an ARM core on a SmartNIC.

If we are going to support programmable device offload in the Linux
kernel then I maintain it should be a generic mechanism that's
agnostic to *both* the frontend programming language as well as the
backend target. For frontend languages we want to let the user program
in a language that's convenient for *them*, which honestly in most
cases isn't going to be a narrow use case DSL (i.e. typically users
want to code in C/C++, Python, Rust, etc.). For the backend it's the
same story, maybe we're compiling to run in host, maybe we're
offloading to P4 runtime, maybe we're offloading to another CPU, maybe
we're offloading some other programmable NPU. The only real
requirement is a compiler that can take the frontend code and compile
for the desired backend target, but above all we want this to be easy
for the programmer, the compiler needs to do the heavy lifting and we
should never require the user to understand the nuances of a target.

IMO, the model we want for programmable kernel offload is "write once,
run anywhere, run well". Which is the Java tagline amended with "run
well". Users write one program for their datapath processing, it runs
on various targets, for any given target we run to run at the highest
performance levels possible given the target's capabilities.

Tom

>
> cheers,
> jamal

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Hardware Offload discussion WAS(Re: [PATCH net-next v12 00/15] Introducing P4TC (series 1)
  2024-03-03 18:10                         ` Tom Herbert
@ 2024-03-03 19:04                           ` Jamal Hadi Salim
  2024-03-04 20:18                             ` Jakub Kicinski
  2024-03-04 21:23                             ` Stanislav Fomichev
  0 siblings, 2 replies; 71+ messages in thread
From: Jamal Hadi Salim @ 2024-03-03 19:04 UTC (permalink / raw)
  To: Tom Herbert
  Cc: Jakub Kicinski, John Fastabend, Singhai, Anjali, Paolo Abeni,
	Linux Kernel Network Developers, Chatterjee, Deb, Limaye, Namrata,
	Marcelo Ricardo Leitner, Shirshyad, Mahesh, Jain, Vipin,
	Osinski, Tomasz, Jiri Pirko, Cong Wang, David S . Miller,
	Eric Dumazet, Vlad Buslov, Simon Horman, Khalid Manaa,
	Toke Høiland-Jørgensen, Daniel Borkmann,
	Victor Nogueira, Tammela, Pedro, Daly, Dan, Andy Fingerhut,
	Sommers, Chris, Matty Kadosh, bpf

On Sun, Mar 3, 2024 at 1:11 PM Tom Herbert <tom@sipanda.io> wrote:
>
> On Sun, Mar 3, 2024 at 9:00 AM Jamal Hadi Salim <jhs@mojatatu.com> wrote:
> >
> > On Sat, Mar 2, 2024 at 10:27 PM Jakub Kicinski <kuba@kernel.org> wrote:
> > >
> > > On Sat, 2 Mar 2024 09:36:53 -0500 Jamal Hadi Salim wrote:
> > > > 2) Your point on:  "integrate later", or at least "fill in the gaps"
> > > > This part i am probably going to mumble on. I am going to consider
> > > > more than just doing ACLs/MAT via flower/u32 for the sake of
> > > > discussion.
> > > > True, "fill the gaps" has been our model so far. It requires kernel
> > > > changes, user space code changes etc justifiably so because most of
> > > > the time such datapaths are subject to standardization via IETF, IEEE,
> > > > etc and new extensions come in on a regular basis.  And sometimes we
> > > > do add features that one or two users or a single vendor has need for
> > > > at the cost of kernel and user/control extension. Given our work
> > > > process, any features added this way take a long time to make it to
> > > > the end user.
> > >
> > > What I had in mind was more of a DDP model. The device loads it binary
> > > blob FW in whatever way it does, then it tells the kernel its parser
> > > graph, and tables. The kernel exposes those tables to user space.
> > > All dynamic, no need to change the kernel for each new protocol.
> > >
> > > But that's different in two ways:
> > >  1. the device tells kernel the tables, no "dynamic reprogramming"
> > >  2. you don't need the SW side, the only use of the API is to interact
> > >     with the device
> > >
> > > User can still do BPF kfuncs to look up in the tables (like in FIB),
> > > but call them from cls_bpf.
> > >
> >
> > This is not far off from what is envisioned today in the discussions.
> > The main issue is who loads the binary? We went from devlink to the
> > filter doing the loading. DDP is ethtool. We still need to tie a PCI
> > device/tc block to the "program" so we can do skip_sw and it works.
> > Meaning a device that is capable of handling multiple programs can
> > have multiple blobs loaded. A "program" is mapped to a tc filter and
> > MAT control works the same way as it does today (netlink/tc ndo).
> >
> > A program in P4 has a name, ID and people have been suggesting a sha1
> > identity (or a signature of some kind should be generated by the
> > compiler). So the upward propagation could be tied to discovering
> > these 3 tuples from the driver. Then the control plane targets a
> > program via those tuples via netlink (as we do currently).
> >
> > I do note, using the DDP sample space, currently whatever gets loaded
> > is "trusted" and really you need to have human knowledge of what the
> > NIC's parsing + MAT is to send the control. With P4 that is all
> > visible/programmable by the end user (i am not a proponent of vendors
> > "shipping" things or calling them for support) - so should be
> > sufficient to just discover what is in the binary and send the correct
> > control messages down.
> >
> > > I think in P4 terms that may be something more akin to only providing
> > > the runtime API? I seem to recall they had some distinction...
> >
> > There are several solutions out there (ex: TDI, P4runtime) - our API
> > is netlink and those could be written on top of netlink, there's no
> > controversy there.
> > So the starting point is defining the datapath using P4, generating
> > the binary blob and whatever constraints needed using the vendor
> > backend and for s/w equivalent generating the eBPF datapath.
> >
> > > > At the cost of this sounding controversial, i am going
> > > > to call things like fdb, fib, etc which have fixed datapaths in the
> > > > kernel "legacy". These "legacy" datapaths almost all the time have
> > >
> > > The cynic in me sometimes thinks that the biggest problem with "legacy"
> > > protocols is that it's hard to make money on them :)
> >
> > That's a big motivation without a doubt, but also there are people
> > that want to experiment with things. One of the craziest examples we
> > have is someone who created a P4 program for "in network calculator",
> > essentially a calculator in the datapath. You send it two operands and
> > an operator using custom headers, it does the math and responds with a
> > result in a new header. By itself this program is a toy but it
> > demonstrates that if one wanted to, they could have something custom
> > in hardware and/or kernel datapath.
>
> Jamal,
>
> Given how long P4 has been around it's surprising that the best
> publicly available code example is "the network calculator" toy.

Come on Tom ;-> That was just an example of something "crazy" to
demonstrate freedom. I can run that in any of the P4 friendly NICs
today. You are probably being facetious - There are some serious
publicly available projects out there, some of which I quote on the
cover letter (like DASH).

> At
> this point in its lifetime, eBPF had far more examples of real world
> use cases publically available. That being said, there's nothing
> unique about P4 supporting the network calculator. We could just as
> easily write this in eBPF (either plain C or P4)  and "offload" it to
> an ARM core on a SmartNIC.

With current port speeds hitting 800gbps you want to use Arm cores as
your offload engine?;-> Running the generated ebpf on the arm core is
a valid P4 target.  i.e there is no contradiction.
Note: P4 is a DSL specialized for datapath definition; it is not a
competition to ebpf, two different worlds. I see ebpf as an
infrastructure tool, nothing more.

> If we are going to support programmable device offload in the Linux
> kernel then I maintain it should be a generic mechanism that's
> agnostic to *both* the frontend programming language as well as the
> backend target. For frontend languages we want to let the user program
> in a language that's convenient for *them*, which honestly in most
> cases isn't going to be a narrow use case DSL (i.e. typically users
> want to code in C/C++, Python, Rust, etc.).

You and I have never agreed philosophically on this point, ever.
Developers are expensive and not economically scalable. IOW, In the
era of automation (generative AI, etc) tooling is king. Let's build
the right tooling. Whenever you make this statement  i get the vision
of Steve Balmer ranting on the stage with "developers! developers!
developers!" but that was eons ago. To use your strong view: Learn
compilers! And the future is probably to replace compilers with AI.

> For the backend it's the
> same story, maybe we're compiling to run in host, maybe we're
> offloading to P4 runtime, maybe we're offloading to another CPU, maybe
> we're offloading some other programmable NPU. The only real
> requirement is a compiler that can take the frontend code and compile
> for the desired backend target, but above all we want this to be easy
> for the programmer, the compiler needs to do the heavy lifting and we
> should never require the user to understand the nuances of a target.
>

Agreed, it is possible to use other languages in the frontend. It is
also possible to extend P4.

> IMO, the model we want for programmable kernel offload is "write once,
> run anywhere, run well". Which is the Java tagline amended with "run
> well". Users write one program for their datapath processing, it runs
> on various targets, for any given target we run to run at the highest
> performance levels possible given the target's capabilities.
>

I would like to emphasize: Our target is P4 - vendors have put out
hardware, people are deploying and evolving things. It is real today
with deployments, not some science project. I am not arguing you cant
do what you suggested but we want to initially focus on P4. Neither am
i saying we cant influence P4 to be more Linux friendly. But none of
that matters. We are only concerned about P4.

cheers,
jamal



> Tom
>
> >
> > cheers,
> > jamal

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH net-next v12 00/15] Introducing P4TC (series 1)
  2024-03-03 16:31                     ` Tom Herbert
@ 2024-03-04 20:07                       ` Jakub Kicinski
  2024-03-04 20:58                         ` eBPF to implement core functionility WAS " Tom Herbert
  2024-03-04 21:19                       ` Stanislav Fomichev
  1 sibling, 1 reply; 71+ messages in thread
From: Jakub Kicinski @ 2024-03-04 20:07 UTC (permalink / raw)
  To: Tom Herbert
  Cc: Jamal Hadi Salim, John Fastabend, Singhai, Anjali, Paolo Abeni,
	Linux Kernel Network Developers, Chatterjee, Deb, Limaye, Namrata,
	mleitner, Mahesh.Shirshyad, Vipin.Jain, Osinski, Tomasz,
	Jiri Pirko, Cong Wang, David S . Miller, edumazet, Vlad Buslov,
	horms, khalidm, Toke Høiland-Jørgensen, Daniel Borkmann,
	Victor Nogueira, Tammela, Pedro, Daly, Dan, andy.fingerhut,
	Sommers, Chris, mattyk, bpf

On Sun, 3 Mar 2024 08:31:11 -0800 Tom Herbert wrote:
> Even before considering hardware offload, I think this approach
> addresses a more fundamental problem to make the kernel programmable.

I like some aspects of what you're describing, but my understanding
is that it'd be a noticeable shift in direction.
I'm not sure if merging P4TC is the most effective way of taking
a first step in that direction. (I mean that in the literal sense
of lack of confidence, not polite way to indicate holding a conviction
to the contrary.)

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Hardware Offload discussion WAS(Re: [PATCH net-next v12 00/15] Introducing P4TC (series 1)
  2024-03-03 19:04                           ` Jamal Hadi Salim
@ 2024-03-04 20:18                             ` Jakub Kicinski
  2024-03-04 21:02                               ` Jamal Hadi Salim
  2024-03-04 21:23                             ` Stanislav Fomichev
  1 sibling, 1 reply; 71+ messages in thread
From: Jakub Kicinski @ 2024-03-04 20:18 UTC (permalink / raw)
  To: Jamal Hadi Salim
  Cc: Tom Herbert, John Fastabend, Singhai, Anjali, Paolo Abeni,
	Linux Kernel Network Developers, Chatterjee, Deb, Limaye, Namrata,
	Marcelo Ricardo Leitner, Shirshyad, Mahesh, Jain, Vipin,
	Osinski, Tomasz, Jiri Pirko, Cong Wang, David S . Miller,
	Eric Dumazet, Vlad Buslov, Simon Horman, Khalid Manaa,
	Toke Høiland-Jørgensen, Daniel Borkmann,
	Victor Nogueira, Tammela, Pedro, Daly, Dan, Andy Fingerhut,
	Sommers, Chris, Matty Kadosh, bpf

On Sun, 3 Mar 2024 14:04:11 -0500 Jamal Hadi Salim wrote:
> > At
> > this point in its lifetime, eBPF had far more examples of real world
> > use cases publically available. That being said, there's nothing
> > unique about P4 supporting the network calculator. We could just as
> > easily write this in eBPF (either plain C or P4)  and "offload" it to
> > an ARM core on a SmartNIC.  
> 
> With current port speeds hitting 800gbps you want to use Arm cores as
> your offload engine?;-> Running the generated ebpf on the arm core is
> a valid P4 target.  i.e there is no contradiction.
> Note: P4 is a DSL specialized for datapath definition; it is not a
> competition to ebpf, two different worlds. I see ebpf as an
> infrastructure tool, nothing more.

I wonder how much we're benefiting of calling this thing P4 and how
much we should focus on filling in the tech gaps.
Exactly like you said, BPF is not competition, but neither does 
the kernel "support P4", any more than it supports bpftrace and:

$ git grep --files-with-matches bpftrace
Documentation/bpf/redirect.rst
tools/testing/selftests/bpf/progs/test_xdp_attach_fail.c

Filling in tech gaps would also help DPP, IDK how much DPP is based 
or using P4, neither should I have to care, frankly :S

^ permalink raw reply	[flat|nested] 71+ messages in thread

* eBPF to implement core functionility WAS Re: [PATCH net-next v12 00/15] Introducing P4TC (series 1)
  2024-03-04 20:07                       ` Jakub Kicinski
@ 2024-03-04 20:58                         ` Tom Herbert
  0 siblings, 0 replies; 71+ messages in thread
From: Tom Herbert @ 2024-03-04 20:58 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Jamal Hadi Salim, John Fastabend, Singhai, Anjali, Paolo Abeni,
	Linux Kernel Network Developers, Chatterjee, Deb, Limaye, Namrata,
	Marcelo Ricardo Leitner, Shirshyad, Mahesh, Jain, Vipin,
	Osinski, Tomasz, Jiri Pirko, Cong Wang, David S . Miller,
	Eric Dumazet, Vlad Buslov, Simon Horman, Khalid Manaa,
	Toke Høiland-Jørgensen, Daniel Borkmann,
	Victor Nogueira, Tammela, Pedro, Daly, Dan, Andy Fingerhut,
	Sommers, Chris, Matty Kadosh, bpf

On Mon, Mar 4, 2024 at 12:07 PM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Sun, 3 Mar 2024 08:31:11 -0800 Tom Herbert wrote:
> > Even before considering hardware offload, I think this approach
> > addresses a more fundamental problem to make the kernel programmable.
>
> I like some aspects of what you're describing, but my understanding
> is that it'd be a noticeable shift in direction.
> I'm not sure if merging P4TC is the most effective way of taking
> a first step in that direction. (I mean that in the literal sense
> of lack of confidence, not polite way to indicate holding a conviction
> to the contrary.)

Jakub,

My comments were with regards to making the kernel offloadable by
first making it programmable. The P4TC patches are very good for
describing processing that is table driven like filtering or IPtables,
but I was thinking more of kernel datapath processing that isn't table
driven like GSO, GRO, flow dissector, and even up to revisiting TCP
offload.

Basically, I'm proposing that instead of eBPF always being side
functionality, there are cases where it could natively be used to
implement the main functionality of the kernel datapath! It is a
noticeable shift in direction, but I also think it's the logical
outcome of eBPF :-).

Tom

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Hardware Offload discussion WAS(Re: [PATCH net-next v12 00/15] Introducing P4TC (series 1)
  2024-03-04 20:18                             ` Jakub Kicinski
@ 2024-03-04 21:02                               ` Jamal Hadi Salim
  0 siblings, 0 replies; 71+ messages in thread
From: Jamal Hadi Salim @ 2024-03-04 21:02 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Tom Herbert, John Fastabend, Singhai, Anjali, Paolo Abeni,
	Linux Kernel Network Developers, Chatterjee, Deb, Limaye, Namrata,
	Marcelo Ricardo Leitner, Shirshyad, Mahesh, Jain, Vipin,
	Osinski, Tomasz, Jiri Pirko, Cong Wang, David S . Miller,
	Eric Dumazet, Vlad Buslov, Simon Horman, Khalid Manaa,
	Toke Høiland-Jørgensen, Daniel Borkmann,
	Victor Nogueira, Tammela, Pedro, Daly, Dan, Andy Fingerhut,
	Sommers, Chris, Matty Kadosh, bpf

On Mon, Mar 4, 2024 at 3:18 PM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Sun, 3 Mar 2024 14:04:11 -0500 Jamal Hadi Salim wrote:
> > > At
> > > this point in its lifetime, eBPF had far more examples of real world
> > > use cases publically available. That being said, there's nothing
> > > unique about P4 supporting the network calculator. We could just as
> > > easily write this in eBPF (either plain C or P4)  and "offload" it to
> > > an ARM core on a SmartNIC.
> >
> > With current port speeds hitting 800gbps you want to use Arm cores as
> > your offload engine?;-> Running the generated ebpf on the arm core is
> > a valid P4 target.  i.e there is no contradiction.
> > Note: P4 is a DSL specialized for datapath definition; it is not a
> > competition to ebpf, two different worlds. I see ebpf as an
> > infrastructure tool, nothing more.
>
> I wonder how much we're benefiting of calling this thing P4 and how
> much we should focus on filling in the tech gaps.

We are implementing based on the P4 standard specification. I fear it
is confusing to call it something else if everyone else is calling it
P4 (including the vendors whose devices are being targeted in case of
offload).
If the name is an issue, sure we can change.
It just so happens that TC has similar semantics to P4 (match action
tables) - hence the name P4TC and implementation encompassing code
that fits nicely with TC.

> Exactly like you said, BPF is not competition, but neither does
> the kernel "support P4", any more than it supports bpftrace and:
>

Like i said if name is an issue, let's change the name;->

> $ git grep --files-with-matches bpftrace
> Documentation/bpf/redirect.rst
> tools/testing/selftests/bpf/progs/test_xdp_attach_fail.c
>
> Filling in tech gaps would also help DPP, IDK how much DPP is based
> or using P4, neither should I have to care, frankly :S

DDP is an Intel specific approach, pre-P4. P4: at least two vendors(on
Cc) AMD have NICs with P4 specification and there FPGA variants out
there as well.
From my discussions with folks at Intel it is easy to transform DDP to
P4. My understanding is it is the same compiler folks. The beauty
being you dont have to use the intel version of the loaded program to
offload if you wanted to change what the hardware does custom to you
(within constraints of what hardware can do).

cheers,
jamal

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH net-next v12 00/15] Introducing P4TC (series 1)
  2024-03-03 16:31                     ` Tom Herbert
  2024-03-04 20:07                       ` Jakub Kicinski
@ 2024-03-04 21:19                       ` Stanislav Fomichev
  2024-03-04 22:01                         ` Tom Herbert
  1 sibling, 1 reply; 71+ messages in thread
From: Stanislav Fomichev @ 2024-03-04 21:19 UTC (permalink / raw)
  To: Tom Herbert
  Cc: Jakub Kicinski, Jamal Hadi Salim, John Fastabend, anjali.singhai,
	Paolo Abeni, Linux Kernel Network Developers, deb.chatterjee,
	namrata.limaye, mleitner, Mahesh.Shirshyad, Vipin.Jain,
	tomasz.osinski, Jiri Pirko, Cong Wang, davem, edumazet,
	Vlad Buslov, horms, khalidm, Toke Høiland-Jørgensen,
	Daniel Borkmann, Victor Nogueira, pctammela, dan.daly,
	andy.fingerhut, chris.sommers, mattyk, bpf

On 03/03, Tom Herbert wrote:
> On Sat, Mar 2, 2024 at 7:15 PM Jakub Kicinski <kuba@kernel.org> wrote:
> >
> > On Fri, 1 Mar 2024 18:20:36 -0800 Tom Herbert wrote:
> > > This is configurability versus programmability. The table driven
> > > approach as input (configurability) might work fine for generic
> > > match-action tables up to the point that tables are expressive enough
> > > to satisfy the requirements. But parsing doesn't fall into the table
> > > driven paradigm: parsers want to be *programmed*. This is why we
> > > removed kParser from this patch set and fell back to eBPF for parsing.
> > > But the problem we quickly hit that eBPF is not offloadable to network
> > > devices, for example when we compile P4 in an eBPF parser we've lost
> > > the declarative representation that parsers in the devices could
> > > consume (they're not CPUs running eBPF).
> > >
> > > I think the key here is what we mean by kernel offload. When we do
> > > kernel offload, is it the kernel implementation or the kernel
> > > functionality that's being offloaded? If it's the latter then we have
> > > a lot more flexibility. What we'd need is a safe and secure way to
> > > synchronize with that offload device that precisely supports the
> > > kernel functionality we'd like to offload. This can be done if both
> > > the kernel bits and programmed offload are derived from the same
> > > source (i.e. tag source code with a sha-1). For example, if someone
> > > writes a parser in P4, we can compile that into both eBPF and a P4
> > > backend using independent tool chains and program download. At
> > > runtime, the kernel can safely offload the functionality of the eBPF
> > > parser to the device if it matches the hash to that reported by the
> > > device
> >
> > Good points. If I understand you correctly you're saying that parsers
> > are more complex than just a basic parsing tree a'la u32.
> 
> Yes. Parsing things like TLVs, GRE flag field, or nested protobufs
> isn't conducive to u32. We also want the advantages of compiler
> optimizations to unroll loops, squash nodes in the parse graph, etc.
> 
> > Then we can take this argument further. P4 has grown to encompass a lot
> > of functionality of quite complex devices. How do we square that with
> > the kernel functionality offload model. If the entire device is modeled,
> > including f.e. TSO, an offload would mean that the user has to write
> > a TSO implementation which they then load into TC? That seems odd.
> >
> > IOW I don't quite know how to square in my head the "total
> > functionality" with being a TC-based "plugin".
> 
> Hi Jakub,
> 
> I believe the solution is to replace kernel code with eBPF in cases
> where we need programmability. This effectively means that we would
> ship eBPF code as part of the kernel. So in the case of TSO, the
> kernel would include a standard implementation in eBPF that could be
> compiled into the kernel by default. The restricted C source code is
> tagged with a hash, so if someone wants to offload TSO they could
> compile the source into their target and retain the hash. At runtime
> it's a matter of querying the driver to see if the device supports the
> TSO program the kernel is running by comparing hash values. Scaling
> this, a device could support a catalogue of programs: TSO, LRO,
> parser, IPtables, etc., If the kernel can match the hash of its eBPF
> code to one reported by the driver then it can assume functionality is
> offloadable. This is an elaboration of "device features", but instead
> of the device telling us they think they support an adequate GRO
> implementation by reporting NETIF_F_GRO, the device would tell the
> kernel that they not only support GRO but they provide identical
> functionality of the kernel GRO (which IMO is the first requirement of
> kernel offload).
> 
> Even before considering hardware offload, I think this approach
> addresses a more fundamental problem to make the kernel programmable.
> Since the code is in eBPF, the kernel can be reprogrammed at runtime
> which could be controlled by TC. This allows local customization of
> kernel features, but also is the simplest way to "patch" the kernel
> with security and bug fixes (nobody is ever excited to do a kernel

[..]

> rebase in their datacenter!). Flow dissector is a prime candidate for
> this, and I am still planning to replace it with an all eBPF program
> (https://netdevconf.info/0x15/slides/16/Flow%20dissector_PANDA%20parser.pdf).

So you're suggesting to bundle (and extend)
tools/testing/selftests/bpf/progs/bpf_flow.c? We were thinking along
similar lines here. We load this program manually right now, shipping
and autoloading with the kernel will be easer.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Hardware Offload discussion WAS(Re: [PATCH net-next v12 00/15] Introducing P4TC (series 1)
  2024-03-03 19:04                           ` Jamal Hadi Salim
  2024-03-04 20:18                             ` Jakub Kicinski
@ 2024-03-04 21:23                             ` Stanislav Fomichev
  2024-03-04 21:44                               ` Jamal Hadi Salim
  1 sibling, 1 reply; 71+ messages in thread
From: Stanislav Fomichev @ 2024-03-04 21:23 UTC (permalink / raw)
  To: Jamal Hadi Salim
  Cc: Tom Herbert, Jakub Kicinski, John Fastabend, anjali.singhai,
	Paolo Abeni, Linux Kernel Network Developers, deb.chatterjee,
	namrata.limaye, Marcelo Ricardo Leitner, Mahesh.Shirshyad,
	Vipin.Jain, tomasz.osinski, Jiri Pirko, Cong Wang, davem,
	Eric Dumazet, Vlad Buslov, Simon Horman, Khalid Manaa,
	Toke Høiland-Jørgensen, Daniel Borkmann,
	Victor Nogueira, pctammela, dan.daly, Andy Fingerhut,
	chris.sommers, Matty Kadosh, bpf

On 03/03, Jamal Hadi Salim wrote:
> On Sun, Mar 3, 2024 at 1:11 PM Tom Herbert <tom@sipanda.io> wrote:
> >
> > On Sun, Mar 3, 2024 at 9:00 AM Jamal Hadi Salim <jhs@mojatatu.com> wrote:
> > >
> > > On Sat, Mar 2, 2024 at 10:27 PM Jakub Kicinski <kuba@kernel.org> wrote:
> > > >
> > > > On Sat, 2 Mar 2024 09:36:53 -0500 Jamal Hadi Salim wrote:
> > > > > 2) Your point on:  "integrate later", or at least "fill in the gaps"
> > > > > This part i am probably going to mumble on. I am going to consider
> > > > > more than just doing ACLs/MAT via flower/u32 for the sake of
> > > > > discussion.
> > > > > True, "fill the gaps" has been our model so far. It requires kernel
> > > > > changes, user space code changes etc justifiably so because most of
> > > > > the time such datapaths are subject to standardization via IETF, IEEE,
> > > > > etc and new extensions come in on a regular basis.  And sometimes we
> > > > > do add features that one or two users or a single vendor has need for
> > > > > at the cost of kernel and user/control extension. Given our work
> > > > > process, any features added this way take a long time to make it to
> > > > > the end user.
> > > >
> > > > What I had in mind was more of a DDP model. The device loads it binary
> > > > blob FW in whatever way it does, then it tells the kernel its parser
> > > > graph, and tables. The kernel exposes those tables to user space.
> > > > All dynamic, no need to change the kernel for each new protocol.
> > > >
> > > > But that's different in two ways:
> > > >  1. the device tells kernel the tables, no "dynamic reprogramming"
> > > >  2. you don't need the SW side, the only use of the API is to interact
> > > >     with the device
> > > >
> > > > User can still do BPF kfuncs to look up in the tables (like in FIB),
> > > > but call them from cls_bpf.
> > > >
> > >
> > > This is not far off from what is envisioned today in the discussions.
> > > The main issue is who loads the binary? We went from devlink to the
> > > filter doing the loading. DDP is ethtool. We still need to tie a PCI
> > > device/tc block to the "program" so we can do skip_sw and it works.
> > > Meaning a device that is capable of handling multiple programs can
> > > have multiple blobs loaded. A "program" is mapped to a tc filter and
> > > MAT control works the same way as it does today (netlink/tc ndo).
> > >
> > > A program in P4 has a name, ID and people have been suggesting a sha1
> > > identity (or a signature of some kind should be generated by the
> > > compiler). So the upward propagation could be tied to discovering
> > > these 3 tuples from the driver. Then the control plane targets a
> > > program via those tuples via netlink (as we do currently).
> > >
> > > I do note, using the DDP sample space, currently whatever gets loaded
> > > is "trusted" and really you need to have human knowledge of what the
> > > NIC's parsing + MAT is to send the control. With P4 that is all
> > > visible/programmable by the end user (i am not a proponent of vendors
> > > "shipping" things or calling them for support) - so should be
> > > sufficient to just discover what is in the binary and send the correct
> > > control messages down.
> > >
> > > > I think in P4 terms that may be something more akin to only providing
> > > > the runtime API? I seem to recall they had some distinction...
> > >
> > > There are several solutions out there (ex: TDI, P4runtime) - our API
> > > is netlink and those could be written on top of netlink, there's no
> > > controversy there.
> > > So the starting point is defining the datapath using P4, generating
> > > the binary blob and whatever constraints needed using the vendor
> > > backend and for s/w equivalent generating the eBPF datapath.
> > >
> > > > > At the cost of this sounding controversial, i am going
> > > > > to call things like fdb, fib, etc which have fixed datapaths in the
> > > > > kernel "legacy". These "legacy" datapaths almost all the time have
> > > >
> > > > The cynic in me sometimes thinks that the biggest problem with "legacy"
> > > > protocols is that it's hard to make money on them :)
> > >
> > > That's a big motivation without a doubt, but also there are people
> > > that want to experiment with things. One of the craziest examples we
> > > have is someone who created a P4 program for "in network calculator",
> > > essentially a calculator in the datapath. You send it two operands and
> > > an operator using custom headers, it does the math and responds with a
> > > result in a new header. By itself this program is a toy but it
> > > demonstrates that if one wanted to, they could have something custom
> > > in hardware and/or kernel datapath.
> >
> > Jamal,
> >
> > Given how long P4 has been around it's surprising that the best
> > publicly available code example is "the network calculator" toy.
> 
> Come on Tom ;-> That was just an example of something "crazy" to
> demonstrate freedom. I can run that in any of the P4 friendly NICs
> today. You are probably being facetious - There are some serious
> publicly available projects out there, some of which I quote on the
> cover letter (like DASH).

Shameless plug. I have a more crazy example with bpf:

https://github.com/fomichev/xdp-btc-miner

A good way to ensure all those smartnic cycles are not wasted :-D
I wish we had more nics with xdp bpf offloads :-(

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Hardware Offload discussion WAS(Re: [PATCH net-next v12 00/15] Introducing P4TC (series 1)
  2024-03-04 21:23                             ` Stanislav Fomichev
@ 2024-03-04 21:44                               ` Jamal Hadi Salim
  2024-03-04 22:23                                 ` Stanislav Fomichev
  0 siblings, 1 reply; 71+ messages in thread
From: Jamal Hadi Salim @ 2024-03-04 21:44 UTC (permalink / raw)
  To: Stanislav Fomichev
  Cc: Tom Herbert, Jakub Kicinski, John Fastabend, anjali.singhai,
	Paolo Abeni, Linux Kernel Network Developers, deb.chatterjee,
	namrata.limaye, Marcelo Ricardo Leitner, Mahesh.Shirshyad,
	Vipin.Jain, tomasz.osinski, Jiri Pirko, Cong Wang, davem,
	Eric Dumazet, Vlad Buslov, Simon Horman, Khalid Manaa,
	Toke Høiland-Jørgensen, Daniel Borkmann,
	Victor Nogueira, pctammela, dan.daly, Andy Fingerhut,
	chris.sommers, Matty Kadosh, bpf

On Mon, Mar 4, 2024 at 4:23 PM Stanislav Fomichev <sdf@google.com> wrote:
>
> On 03/03, Jamal Hadi Salim wrote:
> > On Sun, Mar 3, 2024 at 1:11 PM Tom Herbert <tom@sipanda.io> wrote:
> > >
> > > On Sun, Mar 3, 2024 at 9:00 AM Jamal Hadi Salim <jhs@mojatatu.com> wrote:
> > > >
> > > > On Sat, Mar 2, 2024 at 10:27 PM Jakub Kicinski <kuba@kernel.org> wrote:
> > > > >
> > > > > On Sat, 2 Mar 2024 09:36:53 -0500 Jamal Hadi Salim wrote:
> > > > > > 2) Your point on:  "integrate later", or at least "fill in the gaps"
> > > > > > This part i am probably going to mumble on. I am going to consider
> > > > > > more than just doing ACLs/MAT via flower/u32 for the sake of
> > > > > > discussion.
> > > > > > True, "fill the gaps" has been our model so far. It requires kernel
> > > > > > changes, user space code changes etc justifiably so because most of
> > > > > > the time such datapaths are subject to standardization via IETF, IEEE,
> > > > > > etc and new extensions come in on a regular basis.  And sometimes we
> > > > > > do add features that one or two users or a single vendor has need for
> > > > > > at the cost of kernel and user/control extension. Given our work
> > > > > > process, any features added this way take a long time to make it to
> > > > > > the end user.
> > > > >
> > > > > What I had in mind was more of a DDP model. The device loads it binary
> > > > > blob FW in whatever way it does, then it tells the kernel its parser
> > > > > graph, and tables. The kernel exposes those tables to user space.
> > > > > All dynamic, no need to change the kernel for each new protocol.
> > > > >
> > > > > But that's different in two ways:
> > > > >  1. the device tells kernel the tables, no "dynamic reprogramming"
> > > > >  2. you don't need the SW side, the only use of the API is to interact
> > > > >     with the device
> > > > >
> > > > > User can still do BPF kfuncs to look up in the tables (like in FIB),
> > > > > but call them from cls_bpf.
> > > > >
> > > >
> > > > This is not far off from what is envisioned today in the discussions.
> > > > The main issue is who loads the binary? We went from devlink to the
> > > > filter doing the loading. DDP is ethtool. We still need to tie a PCI
> > > > device/tc block to the "program" so we can do skip_sw and it works.
> > > > Meaning a device that is capable of handling multiple programs can
> > > > have multiple blobs loaded. A "program" is mapped to a tc filter and
> > > > MAT control works the same way as it does today (netlink/tc ndo).
> > > >
> > > > A program in P4 has a name, ID and people have been suggesting a sha1
> > > > identity (or a signature of some kind should be generated by the
> > > > compiler). So the upward propagation could be tied to discovering
> > > > these 3 tuples from the driver. Then the control plane targets a
> > > > program via those tuples via netlink (as we do currently).
> > > >
> > > > I do note, using the DDP sample space, currently whatever gets loaded
> > > > is "trusted" and really you need to have human knowledge of what the
> > > > NIC's parsing + MAT is to send the control. With P4 that is all
> > > > visible/programmable by the end user (i am not a proponent of vendors
> > > > "shipping" things or calling them for support) - so should be
> > > > sufficient to just discover what is in the binary and send the correct
> > > > control messages down.
> > > >
> > > > > I think in P4 terms that may be something more akin to only providing
> > > > > the runtime API? I seem to recall they had some distinction...
> > > >
> > > > There are several solutions out there (ex: TDI, P4runtime) - our API
> > > > is netlink and those could be written on top of netlink, there's no
> > > > controversy there.
> > > > So the starting point is defining the datapath using P4, generating
> > > > the binary blob and whatever constraints needed using the vendor
> > > > backend and for s/w equivalent generating the eBPF datapath.
> > > >
> > > > > > At the cost of this sounding controversial, i am going
> > > > > > to call things like fdb, fib, etc which have fixed datapaths in the
> > > > > > kernel "legacy". These "legacy" datapaths almost all the time have
> > > > >
> > > > > The cynic in me sometimes thinks that the biggest problem with "legacy"
> > > > > protocols is that it's hard to make money on them :)
> > > >
> > > > That's a big motivation without a doubt, but also there are people
> > > > that want to experiment with things. One of the craziest examples we
> > > > have is someone who created a P4 program for "in network calculator",
> > > > essentially a calculator in the datapath. You send it two operands and
> > > > an operator using custom headers, it does the math and responds with a
> > > > result in a new header. By itself this program is a toy but it
> > > > demonstrates that if one wanted to, they could have something custom
> > > > in hardware and/or kernel datapath.
> > >
> > > Jamal,
> > >
> > > Given how long P4 has been around it's surprising that the best
> > > publicly available code example is "the network calculator" toy.
> >
> > Come on Tom ;-> That was just an example of something "crazy" to
> > demonstrate freedom. I can run that in any of the P4 friendly NICs
> > today. You are probably being facetious - There are some serious
> > publicly available projects out there, some of which I quote on the
> > cover letter (like DASH).
>
> Shameless plug. I have a more crazy example with bpf:
>
> https://github.com/fomichev/xdp-btc-miner
>

Hrm - this looks crazy interesting;-> Tempting. I guess to port this
to P4 we'd need the sha256 in h/w (which most of these vendors have
already). Is there any other acceleration would you need? Would have
been more fun if you invented you own headers too ;->

cheers,
jamal

> A good way to ensure all those smartnic cycles are not wasted :-D
> I wish we had more nics with xdp bpf offloads :-(

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH net-next v12 00/15] Introducing P4TC (series 1)
  2024-03-04 21:19                       ` Stanislav Fomichev
@ 2024-03-04 22:01                         ` Tom Herbert
  2024-03-04 23:24                           ` Stanislav Fomichev
  0 siblings, 1 reply; 71+ messages in thread
From: Tom Herbert @ 2024-03-04 22:01 UTC (permalink / raw)
  To: Stanislav Fomichev
  Cc: Jakub Kicinski, Jamal Hadi Salim, John Fastabend, anjali.singhai,
	Paolo Abeni, Linux Kernel Network Developers, deb.chatterjee,
	namrata.limaye, mleitner, Mahesh.Shirshyad, Vipin.Jain,
	tomasz.osinski, Jiri Pirko, Cong Wang, davem, edumazet,
	Vlad Buslov, horms, khalidm, Toke Høiland-Jørgensen,
	Daniel Borkmann, Victor Nogueira, pctammela, dan.daly,
	andy.fingerhut, chris.sommers, mattyk, bpf

On Mon, Mar 4, 2024 at 1:19 PM Stanislav Fomichev <sdf@google.com> wrote:
>
> On 03/03, Tom Herbert wrote:
> > On Sat, Mar 2, 2024 at 7:15 PM Jakub Kicinski <kuba@kernel.org> wrote:
> > >
> > > On Fri, 1 Mar 2024 18:20:36 -0800 Tom Herbert wrote:
> > > > This is configurability versus programmability. The table driven
> > > > approach as input (configurability) might work fine for generic
> > > > match-action tables up to the point that tables are expressive enough
> > > > to satisfy the requirements. But parsing doesn't fall into the table
> > > > driven paradigm: parsers want to be *programmed*. This is why we
> > > > removed kParser from this patch set and fell back to eBPF for parsing.
> > > > But the problem we quickly hit that eBPF is not offloadable to network
> > > > devices, for example when we compile P4 in an eBPF parser we've lost
> > > > the declarative representation that parsers in the devices could
> > > > consume (they're not CPUs running eBPF).
> > > >
> > > > I think the key here is what we mean by kernel offload. When we do
> > > > kernel offload, is it the kernel implementation or the kernel
> > > > functionality that's being offloaded? If it's the latter then we have
> > > > a lot more flexibility. What we'd need is a safe and secure way to
> > > > synchronize with that offload device that precisely supports the
> > > > kernel functionality we'd like to offload. This can be done if both
> > > > the kernel bits and programmed offload are derived from the same
> > > > source (i.e. tag source code with a sha-1). For example, if someone
> > > > writes a parser in P4, we can compile that into both eBPF and a P4
> > > > backend using independent tool chains and program download. At
> > > > runtime, the kernel can safely offload the functionality of the eBPF
> > > > parser to the device if it matches the hash to that reported by the
> > > > device
> > >
> > > Good points. If I understand you correctly you're saying that parsers
> > > are more complex than just a basic parsing tree a'la u32.
> >
> > Yes. Parsing things like TLVs, GRE flag field, or nested protobufs
> > isn't conducive to u32. We also want the advantages of compiler
> > optimizations to unroll loops, squash nodes in the parse graph, etc.
> >
> > > Then we can take this argument further. P4 has grown to encompass a lot
> > > of functionality of quite complex devices. How do we square that with
> > > the kernel functionality offload model. If the entire device is modeled,
> > > including f.e. TSO, an offload would mean that the user has to write
> > > a TSO implementation which they then load into TC? That seems odd.
> > >
> > > IOW I don't quite know how to square in my head the "total
> > > functionality" with being a TC-based "plugin".
> >
> > Hi Jakub,
> >
> > I believe the solution is to replace kernel code with eBPF in cases
> > where we need programmability. This effectively means that we would
> > ship eBPF code as part of the kernel. So in the case of TSO, the
> > kernel would include a standard implementation in eBPF that could be
> > compiled into the kernel by default. The restricted C source code is
> > tagged with a hash, so if someone wants to offload TSO they could
> > compile the source into their target and retain the hash. At runtime
> > it's a matter of querying the driver to see if the device supports the
> > TSO program the kernel is running by comparing hash values. Scaling
> > this, a device could support a catalogue of programs: TSO, LRO,
> > parser, IPtables, etc., If the kernel can match the hash of its eBPF
> > code to one reported by the driver then it can assume functionality is
> > offloadable. This is an elaboration of "device features", but instead
> > of the device telling us they think they support an adequate GRO
> > implementation by reporting NETIF_F_GRO, the device would tell the
> > kernel that they not only support GRO but they provide identical
> > functionality of the kernel GRO (which IMO is the first requirement of
> > kernel offload).
> >
> > Even before considering hardware offload, I think this approach
> > addresses a more fundamental problem to make the kernel programmable.
> > Since the code is in eBPF, the kernel can be reprogrammed at runtime
> > which could be controlled by TC. This allows local customization of
> > kernel features, but also is the simplest way to "patch" the kernel
> > with security and bug fixes (nobody is ever excited to do a kernel
>
> [..]
>
> > rebase in their datacenter!). Flow dissector is a prime candidate for
> > this, and I am still planning to replace it with an all eBPF program
> > (https://netdevconf.info/0x15/slides/16/Flow%20dissector_PANDA%20parser.pdf).
>
> So you're suggesting to bundle (and extend)
> tools/testing/selftests/bpf/progs/bpf_flow.c? We were thinking along
> similar lines here. We load this program manually right now, shipping
> and autoloading with the kernel will be easer.

Hi Stanislav,

Yes, I envision that we would have a standard implementation of
flow-dissector in eBPF that is shipped with the kernel and autoloaded.
However, for the front end source I want to move away from imperative
code. As I mentioned in the presentation flow_dissector.c is spaghetti
code and has been prone to bugs over the years especially whenever
someone adds support for a new fringe protocol (I take the liberty to
call it spaghetti code since I'm partially responsible for creating
this mess ;-) ).

The problem is that parsers are much better represented by a
declarative rather than an imperative representation. To that end, we
defined PANDA which allows constructing a parser (parse graph) in data
structures in C. We use the "PANDA parser" to compile C to restricted
C code which looks more like eBPF in imperative code. With this method
we abstract out all the bookkeeping that was often the source of bugs
(like pulling up skbufs, checking length limits, etc.). The other
advantage is that we're able to find a lot more optimizations if we
start with a right representation of the problem.

If you're interested, the video presentation on this is in
https://www.youtube.com/watch?v=zVnmVDSEoXc.

Tom

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Hardware Offload discussion WAS(Re: [PATCH net-next v12 00/15] Introducing P4TC (series 1)
  2024-03-04 21:44                               ` Jamal Hadi Salim
@ 2024-03-04 22:23                                 ` Stanislav Fomichev
  2024-03-04 22:59                                   ` Jamal Hadi Salim
  0 siblings, 1 reply; 71+ messages in thread
From: Stanislav Fomichev @ 2024-03-04 22:23 UTC (permalink / raw)
  To: Jamal Hadi Salim
  Cc: Tom Herbert, Jakub Kicinski, John Fastabend, anjali.singhai,
	Paolo Abeni, Linux Kernel Network Developers, deb.chatterjee,
	namrata.limaye, Marcelo Ricardo Leitner, Mahesh.Shirshyad,
	Vipin.Jain, tomasz.osinski, Jiri Pirko, Cong Wang, davem,
	Eric Dumazet, Vlad Buslov, Simon Horman, Khalid Manaa,
	Toke Høiland-Jørgensen, Daniel Borkmann,
	Victor Nogueira, pctammela, dan.daly, Andy Fingerhut,
	chris.sommers, Matty Kadosh, bpf

On 03/04, Jamal Hadi Salim wrote:
> On Mon, Mar 4, 2024 at 4:23 PM Stanislav Fomichev <sdf@google.com> wrote:
> >
> > On 03/03, Jamal Hadi Salim wrote:
> > > On Sun, Mar 3, 2024 at 1:11 PM Tom Herbert <tom@sipanda.io> wrote:
> > > >
> > > > On Sun, Mar 3, 2024 at 9:00 AM Jamal Hadi Salim <jhs@mojatatu.com> wrote:
> > > > >
> > > > > On Sat, Mar 2, 2024 at 10:27 PM Jakub Kicinski <kuba@kernel.org> wrote:
> > > > > >
> > > > > > On Sat, 2 Mar 2024 09:36:53 -0500 Jamal Hadi Salim wrote:
> > > > > > > 2) Your point on:  "integrate later", or at least "fill in the gaps"
> > > > > > > This part i am probably going to mumble on. I am going to consider
> > > > > > > more than just doing ACLs/MAT via flower/u32 for the sake of
> > > > > > > discussion.
> > > > > > > True, "fill the gaps" has been our model so far. It requires kernel
> > > > > > > changes, user space code changes etc justifiably so because most of
> > > > > > > the time such datapaths are subject to standardization via IETF, IEEE,
> > > > > > > etc and new extensions come in on a regular basis.  And sometimes we
> > > > > > > do add features that one or two users or a single vendor has need for
> > > > > > > at the cost of kernel and user/control extension. Given our work
> > > > > > > process, any features added this way take a long time to make it to
> > > > > > > the end user.
> > > > > >
> > > > > > What I had in mind was more of a DDP model. The device loads it binary
> > > > > > blob FW in whatever way it does, then it tells the kernel its parser
> > > > > > graph, and tables. The kernel exposes those tables to user space.
> > > > > > All dynamic, no need to change the kernel for each new protocol.
> > > > > >
> > > > > > But that's different in two ways:
> > > > > >  1. the device tells kernel the tables, no "dynamic reprogramming"
> > > > > >  2. you don't need the SW side, the only use of the API is to interact
> > > > > >     with the device
> > > > > >
> > > > > > User can still do BPF kfuncs to look up in the tables (like in FIB),
> > > > > > but call them from cls_bpf.
> > > > > >
> > > > >
> > > > > This is not far off from what is envisioned today in the discussions.
> > > > > The main issue is who loads the binary? We went from devlink to the
> > > > > filter doing the loading. DDP is ethtool. We still need to tie a PCI
> > > > > device/tc block to the "program" so we can do skip_sw and it works.
> > > > > Meaning a device that is capable of handling multiple programs can
> > > > > have multiple blobs loaded. A "program" is mapped to a tc filter and
> > > > > MAT control works the same way as it does today (netlink/tc ndo).
> > > > >
> > > > > A program in P4 has a name, ID and people have been suggesting a sha1
> > > > > identity (or a signature of some kind should be generated by the
> > > > > compiler). So the upward propagation could be tied to discovering
> > > > > these 3 tuples from the driver. Then the control plane targets a
> > > > > program via those tuples via netlink (as we do currently).
> > > > >
> > > > > I do note, using the DDP sample space, currently whatever gets loaded
> > > > > is "trusted" and really you need to have human knowledge of what the
> > > > > NIC's parsing + MAT is to send the control. With P4 that is all
> > > > > visible/programmable by the end user (i am not a proponent of vendors
> > > > > "shipping" things or calling them for support) - so should be
> > > > > sufficient to just discover what is in the binary and send the correct
> > > > > control messages down.
> > > > >
> > > > > > I think in P4 terms that may be something more akin to only providing
> > > > > > the runtime API? I seem to recall they had some distinction...
> > > > >
> > > > > There are several solutions out there (ex: TDI, P4runtime) - our API
> > > > > is netlink and those could be written on top of netlink, there's no
> > > > > controversy there.
> > > > > So the starting point is defining the datapath using P4, generating
> > > > > the binary blob and whatever constraints needed using the vendor
> > > > > backend and for s/w equivalent generating the eBPF datapath.
> > > > >
> > > > > > > At the cost of this sounding controversial, i am going
> > > > > > > to call things like fdb, fib, etc which have fixed datapaths in the
> > > > > > > kernel "legacy". These "legacy" datapaths almost all the time have
> > > > > >
> > > > > > The cynic in me sometimes thinks that the biggest problem with "legacy"
> > > > > > protocols is that it's hard to make money on them :)
> > > > >
> > > > > That's a big motivation without a doubt, but also there are people
> > > > > that want to experiment with things. One of the craziest examples we
> > > > > have is someone who created a P4 program for "in network calculator",
> > > > > essentially a calculator in the datapath. You send it two operands and
> > > > > an operator using custom headers, it does the math and responds with a
> > > > > result in a new header. By itself this program is a toy but it
> > > > > demonstrates that if one wanted to, they could have something custom
> > > > > in hardware and/or kernel datapath.
> > > >
> > > > Jamal,
> > > >
> > > > Given how long P4 has been around it's surprising that the best
> > > > publicly available code example is "the network calculator" toy.
> > >
> > > Come on Tom ;-> That was just an example of something "crazy" to
> > > demonstrate freedom. I can run that in any of the P4 friendly NICs
> > > today. You are probably being facetious - There are some serious
> > > publicly available projects out there, some of which I quote on the
> > > cover letter (like DASH).
> >
> > Shameless plug. I have a more crazy example with bpf:
> >
> > https://github.com/fomichev/xdp-btc-miner
> >
> 
> Hrm - this looks crazy interesting;-> Tempting. I guess to port this
> to P4 we'd need the sha256 in h/w (which most of these vendors have
> already). Is there any other acceleration would you need? Would have
> been more fun if you invented you own headers too ;->

Yeah, some way to do sha256(sha256(at_some_fixed_packet_offset + 80 bytes))
is one thing. And the other is some way to compare that sha256 vs some
hard-coded (difficulty) number (as a 256-byte uint). But I have no
clue how well that maps into declarative p4 language. Most likely
possible if you're saying that the calculator is possible?
I'm assuming that even sha256 can possibly be implemented in p4 without
any extra support from the vendor? It's just a bunch of xors and
rotations over a fix-sized input buffer.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Hardware Offload discussion WAS(Re: [PATCH net-next v12 00/15] Introducing P4TC (series 1)
  2024-03-04 22:23                                 ` Stanislav Fomichev
@ 2024-03-04 22:59                                   ` Jamal Hadi Salim
  2024-03-04 23:14                                     ` Stanislav Fomichev
  0 siblings, 1 reply; 71+ messages in thread
From: Jamal Hadi Salim @ 2024-03-04 22:59 UTC (permalink / raw)
  To: Stanislav Fomichev
  Cc: Tom Herbert, Jakub Kicinski, John Fastabend, anjali.singhai,
	Paolo Abeni, Linux Kernel Network Developers, deb.chatterjee,
	namrata.limaye, Marcelo Ricardo Leitner, Mahesh.Shirshyad,
	Vipin.Jain, tomasz.osinski, Jiri Pirko, Cong Wang, davem,
	Eric Dumazet, Vlad Buslov, Simon Horman, Khalid Manaa,
	Toke Høiland-Jørgensen, Daniel Borkmann,
	Victor Nogueira, pctammela, dan.daly, Andy Fingerhut,
	chris.sommers, Matty Kadosh, bpf

On Mon, Mar 4, 2024 at 5:23 PM Stanislav Fomichev <sdf@google.com> wrote:
>
> On 03/04, Jamal Hadi Salim wrote:
> > On Mon, Mar 4, 2024 at 4:23 PM Stanislav Fomichev <sdf@google.com> wrote:
> > >
> > > On 03/03, Jamal Hadi Salim wrote:
> > > > On Sun, Mar 3, 2024 at 1:11 PM Tom Herbert <tom@sipanda.io> wrote:
> > > > >
> > > > > On Sun, Mar 3, 2024 at 9:00 AM Jamal Hadi Salim <jhs@mojatatu.com> wrote:
> > > > > >
> > > > > > On Sat, Mar 2, 2024 at 10:27 PM Jakub Kicinski <kuba@kernel.org> wrote:
> > > > > > >
> > > > > > > On Sat, 2 Mar 2024 09:36:53 -0500 Jamal Hadi Salim wrote:
> > > > > > > > 2) Your point on:  "integrate later", or at least "fill in the gaps"
> > > > > > > > This part i am probably going to mumble on. I am going to consider
> > > > > > > > more than just doing ACLs/MAT via flower/u32 for the sake of
> > > > > > > > discussion.
> > > > > > > > True, "fill the gaps" has been our model so far. It requires kernel
> > > > > > > > changes, user space code changes etc justifiably so because most of
> > > > > > > > the time such datapaths are subject to standardization via IETF, IEEE,
> > > > > > > > etc and new extensions come in on a regular basis.  And sometimes we
> > > > > > > > do add features that one or two users or a single vendor has need for
> > > > > > > > at the cost of kernel and user/control extension. Given our work
> > > > > > > > process, any features added this way take a long time to make it to
> > > > > > > > the end user.
> > > > > > >
> > > > > > > What I had in mind was more of a DDP model. The device loads it binary
> > > > > > > blob FW in whatever way it does, then it tells the kernel its parser
> > > > > > > graph, and tables. The kernel exposes those tables to user space.
> > > > > > > All dynamic, no need to change the kernel for each new protocol.
> > > > > > >
> > > > > > > But that's different in two ways:
> > > > > > >  1. the device tells kernel the tables, no "dynamic reprogramming"
> > > > > > >  2. you don't need the SW side, the only use of the API is to interact
> > > > > > >     with the device
> > > > > > >
> > > > > > > User can still do BPF kfuncs to look up in the tables (like in FIB),
> > > > > > > but call them from cls_bpf.
> > > > > > >
> > > > > >
> > > > > > This is not far off from what is envisioned today in the discussions.
> > > > > > The main issue is who loads the binary? We went from devlink to the
> > > > > > filter doing the loading. DDP is ethtool. We still need to tie a PCI
> > > > > > device/tc block to the "program" so we can do skip_sw and it works.
> > > > > > Meaning a device that is capable of handling multiple programs can
> > > > > > have multiple blobs loaded. A "program" is mapped to a tc filter and
> > > > > > MAT control works the same way as it does today (netlink/tc ndo).
> > > > > >
> > > > > > A program in P4 has a name, ID and people have been suggesting a sha1
> > > > > > identity (or a signature of some kind should be generated by the
> > > > > > compiler). So the upward propagation could be tied to discovering
> > > > > > these 3 tuples from the driver. Then the control plane targets a
> > > > > > program via those tuples via netlink (as we do currently).
> > > > > >
> > > > > > I do note, using the DDP sample space, currently whatever gets loaded
> > > > > > is "trusted" and really you need to have human knowledge of what the
> > > > > > NIC's parsing + MAT is to send the control. With P4 that is all
> > > > > > visible/programmable by the end user (i am not a proponent of vendors
> > > > > > "shipping" things or calling them for support) - so should be
> > > > > > sufficient to just discover what is in the binary and send the correct
> > > > > > control messages down.
> > > > > >
> > > > > > > I think in P4 terms that may be something more akin to only providing
> > > > > > > the runtime API? I seem to recall they had some distinction...
> > > > > >
> > > > > > There are several solutions out there (ex: TDI, P4runtime) - our API
> > > > > > is netlink and those could be written on top of netlink, there's no
> > > > > > controversy there.
> > > > > > So the starting point is defining the datapath using P4, generating
> > > > > > the binary blob and whatever constraints needed using the vendor
> > > > > > backend and for s/w equivalent generating the eBPF datapath.
> > > > > >
> > > > > > > > At the cost of this sounding controversial, i am going
> > > > > > > > to call things like fdb, fib, etc which have fixed datapaths in the
> > > > > > > > kernel "legacy". These "legacy" datapaths almost all the time have
> > > > > > >
> > > > > > > The cynic in me sometimes thinks that the biggest problem with "legacy"
> > > > > > > protocols is that it's hard to make money on them :)
> > > > > >
> > > > > > That's a big motivation without a doubt, but also there are people
> > > > > > that want to experiment with things. One of the craziest examples we
> > > > > > have is someone who created a P4 program for "in network calculator",
> > > > > > essentially a calculator in the datapath. You send it two operands and
> > > > > > an operator using custom headers, it does the math and responds with a
> > > > > > result in a new header. By itself this program is a toy but it
> > > > > > demonstrates that if one wanted to, they could have something custom
> > > > > > in hardware and/or kernel datapath.
> > > > >
> > > > > Jamal,
> > > > >
> > > > > Given how long P4 has been around it's surprising that the best
> > > > > publicly available code example is "the network calculator" toy.
> > > >
> > > > Come on Tom ;-> That was just an example of something "crazy" to
> > > > demonstrate freedom. I can run that in any of the P4 friendly NICs
> > > > today. You are probably being facetious - There are some serious
> > > > publicly available projects out there, some of which I quote on the
> > > > cover letter (like DASH).
> > >
> > > Shameless plug. I have a more crazy example with bpf:
> > >
> > > https://github.com/fomichev/xdp-btc-miner
> > >
> >
> > Hrm - this looks crazy interesting;-> Tempting. I guess to port this
> > to P4 we'd need the sha256 in h/w (which most of these vendors have
> > already). Is there any other acceleration would you need? Would have
> > been more fun if you invented you own headers too ;->
>
> Yeah, some way to do sha256(sha256(at_some_fixed_packet_offset + 80 bytes))

This part is straight forward.

> is one thing. And the other is some way to compare that sha256 vs some
> hard-coded (difficulty) number (as a 256-byte uint).

The compiler may have issues with this comparison - will have to look
(I am pretty sure it's fixable though).


>  But I have no
> clue how well that maps into declarative p4 language. Most likely
> possible if you're saying that the calculator is possible?

The calculator basically is written as a set of match-action tables.
You parse your header, construct a key based on the operator field of
the header (eg "+"),  invoke an action which takes the operands from
the headers(eg "1" and "2"), the action returns you results(3"). You
stash the result in a new packet and send it back to the source.

So my thinking is the computation you need would be modelled on an action.

> I'm assuming that even sha256 can possibly be implemented in p4 without
> any extra support from the vendor? It's just a bunch of xors and
> rotations over a fix-sized input buffer.

True,  and I think those would be fast. But if the h/w offers it as an
interface why not.
It's not that you are running out of instruction space - and my memory
is hazy - but iirc, there is sha256 support in the kernel Crypto API -
does it not make sense to kfunc into that?

cheers,
jamal

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Hardware Offload discussion WAS(Re: [PATCH net-next v12 00/15] Introducing P4TC (series 1)
  2024-03-04 22:59                                   ` Jamal Hadi Salim
@ 2024-03-04 23:14                                     ` Stanislav Fomichev
  0 siblings, 0 replies; 71+ messages in thread
From: Stanislav Fomichev @ 2024-03-04 23:14 UTC (permalink / raw)
  To: Jamal Hadi Salim
  Cc: Tom Herbert, Jakub Kicinski, John Fastabend, anjali.singhai,
	Paolo Abeni, Linux Kernel Network Developers, deb.chatterjee,
	namrata.limaye, Marcelo Ricardo Leitner, Mahesh.Shirshyad,
	Vipin.Jain, tomasz.osinski, Jiri Pirko, Cong Wang, davem,
	Eric Dumazet, Vlad Buslov, Simon Horman, Khalid Manaa,
	Toke Høiland-Jørgensen, Daniel Borkmann,
	Victor Nogueira, pctammela, dan.daly, Andy Fingerhut,
	chris.sommers, Matty Kadosh, bpf

On 03/04, Jamal Hadi Salim wrote:
> On Mon, Mar 4, 2024 at 5:23 PM Stanislav Fomichev <sdf@google.com> wrote:
> >
> > On 03/04, Jamal Hadi Salim wrote:
> > > On Mon, Mar 4, 2024 at 4:23 PM Stanislav Fomichev <sdf@google.com> wrote:
> > > >
> > > > On 03/03, Jamal Hadi Salim wrote:
> > > > > On Sun, Mar 3, 2024 at 1:11 PM Tom Herbert <tom@sipanda.io> wrote:
> > > > > >
> > > > > > On Sun, Mar 3, 2024 at 9:00 AM Jamal Hadi Salim <jhs@mojatatu.com> wrote:
> > > > > > >
> > > > > > > On Sat, Mar 2, 2024 at 10:27 PM Jakub Kicinski <kuba@kernel.org> wrote:
> > > > > > > >
> > > > > > > > On Sat, 2 Mar 2024 09:36:53 -0500 Jamal Hadi Salim wrote:
> > > > > > > > > 2) Your point on:  "integrate later", or at least "fill in the gaps"
> > > > > > > > > This part i am probably going to mumble on. I am going to consider
> > > > > > > > > more than just doing ACLs/MAT via flower/u32 for the sake of
> > > > > > > > > discussion.
> > > > > > > > > True, "fill the gaps" has been our model so far. It requires kernel
> > > > > > > > > changes, user space code changes etc justifiably so because most of
> > > > > > > > > the time such datapaths are subject to standardization via IETF, IEEE,
> > > > > > > > > etc and new extensions come in on a regular basis.  And sometimes we
> > > > > > > > > do add features that one or two users or a single vendor has need for
> > > > > > > > > at the cost of kernel and user/control extension. Given our work
> > > > > > > > > process, any features added this way take a long time to make it to
> > > > > > > > > the end user.
> > > > > > > >
> > > > > > > > What I had in mind was more of a DDP model. The device loads it binary
> > > > > > > > blob FW in whatever way it does, then it tells the kernel its parser
> > > > > > > > graph, and tables. The kernel exposes those tables to user space.
> > > > > > > > All dynamic, no need to change the kernel for each new protocol.
> > > > > > > >
> > > > > > > > But that's different in two ways:
> > > > > > > >  1. the device tells kernel the tables, no "dynamic reprogramming"
> > > > > > > >  2. you don't need the SW side, the only use of the API is to interact
> > > > > > > >     with the device
> > > > > > > >
> > > > > > > > User can still do BPF kfuncs to look up in the tables (like in FIB),
> > > > > > > > but call them from cls_bpf.
> > > > > > > >
> > > > > > >
> > > > > > > This is not far off from what is envisioned today in the discussions.
> > > > > > > The main issue is who loads the binary? We went from devlink to the
> > > > > > > filter doing the loading. DDP is ethtool. We still need to tie a PCI
> > > > > > > device/tc block to the "program" so we can do skip_sw and it works.
> > > > > > > Meaning a device that is capable of handling multiple programs can
> > > > > > > have multiple blobs loaded. A "program" is mapped to a tc filter and
> > > > > > > MAT control works the same way as it does today (netlink/tc ndo).
> > > > > > >
> > > > > > > A program in P4 has a name, ID and people have been suggesting a sha1
> > > > > > > identity (or a signature of some kind should be generated by the
> > > > > > > compiler). So the upward propagation could be tied to discovering
> > > > > > > these 3 tuples from the driver. Then the control plane targets a
> > > > > > > program via those tuples via netlink (as we do currently).
> > > > > > >
> > > > > > > I do note, using the DDP sample space, currently whatever gets loaded
> > > > > > > is "trusted" and really you need to have human knowledge of what the
> > > > > > > NIC's parsing + MAT is to send the control. With P4 that is all
> > > > > > > visible/programmable by the end user (i am not a proponent of vendors
> > > > > > > "shipping" things or calling them for support) - so should be
> > > > > > > sufficient to just discover what is in the binary and send the correct
> > > > > > > control messages down.
> > > > > > >
> > > > > > > > I think in P4 terms that may be something more akin to only providing
> > > > > > > > the runtime API? I seem to recall they had some distinction...
> > > > > > >
> > > > > > > There are several solutions out there (ex: TDI, P4runtime) - our API
> > > > > > > is netlink and those could be written on top of netlink, there's no
> > > > > > > controversy there.
> > > > > > > So the starting point is defining the datapath using P4, generating
> > > > > > > the binary blob and whatever constraints needed using the vendor
> > > > > > > backend and for s/w equivalent generating the eBPF datapath.
> > > > > > >
> > > > > > > > > At the cost of this sounding controversial, i am going
> > > > > > > > > to call things like fdb, fib, etc which have fixed datapaths in the
> > > > > > > > > kernel "legacy". These "legacy" datapaths almost all the time have
> > > > > > > >
> > > > > > > > The cynic in me sometimes thinks that the biggest problem with "legacy"
> > > > > > > > protocols is that it's hard to make money on them :)
> > > > > > >
> > > > > > > That's a big motivation without a doubt, but also there are people
> > > > > > > that want to experiment with things. One of the craziest examples we
> > > > > > > have is someone who created a P4 program for "in network calculator",
> > > > > > > essentially a calculator in the datapath. You send it two operands and
> > > > > > > an operator using custom headers, it does the math and responds with a
> > > > > > > result in a new header. By itself this program is a toy but it
> > > > > > > demonstrates that if one wanted to, they could have something custom
> > > > > > > in hardware and/or kernel datapath.
> > > > > >
> > > > > > Jamal,
> > > > > >
> > > > > > Given how long P4 has been around it's surprising that the best
> > > > > > publicly available code example is "the network calculator" toy.
> > > > >
> > > > > Come on Tom ;-> That was just an example of something "crazy" to
> > > > > demonstrate freedom. I can run that in any of the P4 friendly NICs
> > > > > today. You are probably being facetious - There are some serious
> > > > > publicly available projects out there, some of which I quote on the
> > > > > cover letter (like DASH).
> > > >
> > > > Shameless plug. I have a more crazy example with bpf:
> > > >
> > > > https://github.com/fomichev/xdp-btc-miner
> > > >
> > >
> > > Hrm - this looks crazy interesting;-> Tempting. I guess to port this
> > > to P4 we'd need the sha256 in h/w (which most of these vendors have
> > > already). Is there any other acceleration would you need? Would have
> > > been more fun if you invented you own headers too ;->
> >
> > Yeah, some way to do sha256(sha256(at_some_fixed_packet_offset + 80 bytes))
> 
> This part is straight forward.
> 
> > is one thing. And the other is some way to compare that sha256 vs some
> > hard-coded (difficulty) number (as a 256-byte uint).
> 
> The compiler may have issues with this comparison - will have to look
> (I am pretty sure it's fixable though).
> 
> 
> >  But I have no
> > clue how well that maps into declarative p4 language. Most likely
> > possible if you're saying that the calculator is possible?
> 
> The calculator basically is written as a set of match-action tables.
> You parse your header, construct a key based on the operator field of
> the header (eg "+"),  invoke an action which takes the operands from
> the headers(eg "1" and "2"), the action returns you results(3"). You
> stash the result in a new packet and send it back to the source.
> 
> So my thinking is the computation you need would be modelled on an action.
> 
> > I'm assuming that even sha256 can possibly be implemented in p4 without
> > any extra support from the vendor? It's just a bunch of xors and
> > rotations over a fix-sized input buffer.

[..]

> True,  and I think those would be fast. But if the h/w offers it as an
> interface why not.
> It's not that you are running out of instruction space - and my memory
> is hazy - but iirc, there is sha256 support in the kernel Crypto API -
> does it not make sense to kfunc into that?

Oh yeah, that's definitely a better path if somebody were do to it
"properly". It's still fun, though, to see how far we can push
the bpf vm/verifier without using any extra helpers :-D

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH net-next v12 00/15] Introducing P4TC (series 1)
  2024-03-04 22:01                         ` Tom Herbert
@ 2024-03-04 23:24                           ` Stanislav Fomichev
  2024-03-04 23:50                             ` Tom Herbert
  0 siblings, 1 reply; 71+ messages in thread
From: Stanislav Fomichev @ 2024-03-04 23:24 UTC (permalink / raw)
  To: Tom Herbert
  Cc: Jakub Kicinski, Jamal Hadi Salim, John Fastabend, anjali.singhai,
	Paolo Abeni, Linux Kernel Network Developers, deb.chatterjee,
	namrata.limaye, mleitner, Mahesh.Shirshyad, Vipin.Jain,
	tomasz.osinski, Jiri Pirko, Cong Wang, davem, edumazet,
	Vlad Buslov, horms, khalidm, Toke Høiland-Jørgensen,
	Daniel Borkmann, Victor Nogueira, pctammela, dan.daly,
	andy.fingerhut, chris.sommers, mattyk, bpf

On 03/04, Tom Herbert wrote:
> On Mon, Mar 4, 2024 at 1:19 PM Stanislav Fomichev <sdf@google.com> wrote:
> >
> > On 03/03, Tom Herbert wrote:
> > > On Sat, Mar 2, 2024 at 7:15 PM Jakub Kicinski <kuba@kernel.org> wrote:
> > > >
> > > > On Fri, 1 Mar 2024 18:20:36 -0800 Tom Herbert wrote:
> > > > > This is configurability versus programmability. The table driven
> > > > > approach as input (configurability) might work fine for generic
> > > > > match-action tables up to the point that tables are expressive enough
> > > > > to satisfy the requirements. But parsing doesn't fall into the table
> > > > > driven paradigm: parsers want to be *programmed*. This is why we
> > > > > removed kParser from this patch set and fell back to eBPF for parsing.
> > > > > But the problem we quickly hit that eBPF is not offloadable to network
> > > > > devices, for example when we compile P4 in an eBPF parser we've lost
> > > > > the declarative representation that parsers in the devices could
> > > > > consume (they're not CPUs running eBPF).
> > > > >
> > > > > I think the key here is what we mean by kernel offload. When we do
> > > > > kernel offload, is it the kernel implementation or the kernel
> > > > > functionality that's being offloaded? If it's the latter then we have
> > > > > a lot more flexibility. What we'd need is a safe and secure way to
> > > > > synchronize with that offload device that precisely supports the
> > > > > kernel functionality we'd like to offload. This can be done if both
> > > > > the kernel bits and programmed offload are derived from the same
> > > > > source (i.e. tag source code with a sha-1). For example, if someone
> > > > > writes a parser in P4, we can compile that into both eBPF and a P4
> > > > > backend using independent tool chains and program download. At
> > > > > runtime, the kernel can safely offload the functionality of the eBPF
> > > > > parser to the device if it matches the hash to that reported by the
> > > > > device
> > > >
> > > > Good points. If I understand you correctly you're saying that parsers
> > > > are more complex than just a basic parsing tree a'la u32.
> > >
> > > Yes. Parsing things like TLVs, GRE flag field, or nested protobufs
> > > isn't conducive to u32. We also want the advantages of compiler
> > > optimizations to unroll loops, squash nodes in the parse graph, etc.
> > >
> > > > Then we can take this argument further. P4 has grown to encompass a lot
> > > > of functionality of quite complex devices. How do we square that with
> > > > the kernel functionality offload model. If the entire device is modeled,
> > > > including f.e. TSO, an offload would mean that the user has to write
> > > > a TSO implementation which they then load into TC? That seems odd.
> > > >
> > > > IOW I don't quite know how to square in my head the "total
> > > > functionality" with being a TC-based "plugin".
> > >
> > > Hi Jakub,
> > >
> > > I believe the solution is to replace kernel code with eBPF in cases
> > > where we need programmability. This effectively means that we would
> > > ship eBPF code as part of the kernel. So in the case of TSO, the
> > > kernel would include a standard implementation in eBPF that could be
> > > compiled into the kernel by default. The restricted C source code is
> > > tagged with a hash, so if someone wants to offload TSO they could
> > > compile the source into their target and retain the hash. At runtime
> > > it's a matter of querying the driver to see if the device supports the
> > > TSO program the kernel is running by comparing hash values. Scaling
> > > this, a device could support a catalogue of programs: TSO, LRO,
> > > parser, IPtables, etc., If the kernel can match the hash of its eBPF
> > > code to one reported by the driver then it can assume functionality is
> > > offloadable. This is an elaboration of "device features", but instead
> > > of the device telling us they think they support an adequate GRO
> > > implementation by reporting NETIF_F_GRO, the device would tell the
> > > kernel that they not only support GRO but they provide identical
> > > functionality of the kernel GRO (which IMO is the first requirement of
> > > kernel offload).
> > >
> > > Even before considering hardware offload, I think this approach
> > > addresses a more fundamental problem to make the kernel programmable.
> > > Since the code is in eBPF, the kernel can be reprogrammed at runtime
> > > which could be controlled by TC. This allows local customization of
> > > kernel features, but also is the simplest way to "patch" the kernel
> > > with security and bug fixes (nobody is ever excited to do a kernel
> >
> > [..]
> >
> > > rebase in their datacenter!). Flow dissector is a prime candidate for
> > > this, and I am still planning to replace it with an all eBPF program
> > > (https://netdevconf.info/0x15/slides/16/Flow%20dissector_PANDA%20parser.pdf).
> >
> > So you're suggesting to bundle (and extend)
> > tools/testing/selftests/bpf/progs/bpf_flow.c? We were thinking along
> > similar lines here. We load this program manually right now, shipping
> > and autoloading with the kernel will be easer.
> 
> Hi Stanislav,
> 
> Yes, I envision that we would have a standard implementation of
> flow-dissector in eBPF that is shipped with the kernel and autoloaded.
> However, for the front end source I want to move away from imperative
> code. As I mentioned in the presentation flow_dissector.c is spaghetti
> code and has been prone to bugs over the years especially whenever
> someone adds support for a new fringe protocol (I take the liberty to
> call it spaghetti code since I'm partially responsible for creating
> this mess ;-) ).
> 
> The problem is that parsers are much better represented by a
> declarative rather than an imperative representation. To that end, we
> defined PANDA which allows constructing a parser (parse graph) in data
> structures in C. We use the "PANDA parser" to compile C to restricted
> C code which looks more like eBPF in imperative code. With this method
> we abstract out all the bookkeeping that was often the source of bugs
> (like pulling up skbufs, checking length limits, etc.). The other
> advantage is that we're able to find a lot more optimizations if we
> start with a right representation of the problem.
> 
> If you're interested, the video presentation on this is in
> https://www.youtube.com/watch?v=zVnmVDSEoXc.

Oh, yeah, I've seen this one. Agreed that the C implementation is not
pleasant and generating a parser from some declarative spec is a better
idea.

From my pow, the biggest win we get from making bpf flow dissector
pluggable is the fact that we can now actually write some tests for it
(and, maybe, fuzz it?). We should also probably spend more time properly
defining the behavior of the existing C implementation. We've seen
some interesting bugs like this one:
https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git/commit/?id=9fa02892857ae2b3b699630e5ede28f72106e7e7

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH net-next v12 00/15] Introducing P4TC (series 1)
  2024-03-04 23:24                           ` Stanislav Fomichev
@ 2024-03-04 23:50                             ` Tom Herbert
  0 siblings, 0 replies; 71+ messages in thread
From: Tom Herbert @ 2024-03-04 23:50 UTC (permalink / raw)
  To: Stanislav Fomichev
  Cc: Jakub Kicinski, Jamal Hadi Salim, John Fastabend, anjali.singhai,
	Paolo Abeni, Linux Kernel Network Developers, deb.chatterjee,
	namrata.limaye, mleitner, Mahesh.Shirshyad, Vipin.Jain,
	tomasz.osinski, Jiri Pirko, Cong Wang, davem, edumazet,
	Vlad Buslov, horms, khalidm, Toke Høiland-Jørgensen,
	Daniel Borkmann, Victor Nogueira, pctammela, dan.daly,
	andy.fingerhut, chris.sommers, mattyk, bpf

On Mon, Mar 4, 2024 at 3:24 PM Stanislav Fomichev <sdf@google.com> wrote:
>
> On 03/04, Tom Herbert wrote:
> > On Mon, Mar 4, 2024 at 1:19 PM Stanislav Fomichev <sdf@google.com> wrote:
> > >
> > > On 03/03, Tom Herbert wrote:
> > > > On Sat, Mar 2, 2024 at 7:15 PM Jakub Kicinski <kuba@kernel.org> wrote:
> > > > >
> > > > > On Fri, 1 Mar 2024 18:20:36 -0800 Tom Herbert wrote:
> > > > > > This is configurability versus programmability. The table driven
> > > > > > approach as input (configurability) might work fine for generic
> > > > > > match-action tables up to the point that tables are expressive enough
> > > > > > to satisfy the requirements. But parsing doesn't fall into the table
> > > > > > driven paradigm: parsers want to be *programmed*. This is why we
> > > > > > removed kParser from this patch set and fell back to eBPF for parsing.
> > > > > > But the problem we quickly hit that eBPF is not offloadable to network
> > > > > > devices, for example when we compile P4 in an eBPF parser we've lost
> > > > > > the declarative representation that parsers in the devices could
> > > > > > consume (they're not CPUs running eBPF).
> > > > > >
> > > > > > I think the key here is what we mean by kernel offload. When we do
> > > > > > kernel offload, is it the kernel implementation or the kernel
> > > > > > functionality that's being offloaded? If it's the latter then we have
> > > > > > a lot more flexibility. What we'd need is a safe and secure way to
> > > > > > synchronize with that offload device that precisely supports the
> > > > > > kernel functionality we'd like to offload. This can be done if both
> > > > > > the kernel bits and programmed offload are derived from the same
> > > > > > source (i.e. tag source code with a sha-1). For example, if someone
> > > > > > writes a parser in P4, we can compile that into both eBPF and a P4
> > > > > > backend using independent tool chains and program download. At
> > > > > > runtime, the kernel can safely offload the functionality of the eBPF
> > > > > > parser to the device if it matches the hash to that reported by the
> > > > > > device
> > > > >
> > > > > Good points. If I understand you correctly you're saying that parsers
> > > > > are more complex than just a basic parsing tree a'la u32.
> > > >
> > > > Yes. Parsing things like TLVs, GRE flag field, or nested protobufs
> > > > isn't conducive to u32. We also want the advantages of compiler
> > > > optimizations to unroll loops, squash nodes in the parse graph, etc.
> > > >
> > > > > Then we can take this argument further. P4 has grown to encompass a lot
> > > > > of functionality of quite complex devices. How do we square that with
> > > > > the kernel functionality offload model. If the entire device is modeled,
> > > > > including f.e. TSO, an offload would mean that the user has to write
> > > > > a TSO implementation which they then load into TC? That seems odd.
> > > > >
> > > > > IOW I don't quite know how to square in my head the "total
> > > > > functionality" with being a TC-based "plugin".
> > > >
> > > > Hi Jakub,
> > > >
> > > > I believe the solution is to replace kernel code with eBPF in cases
> > > > where we need programmability. This effectively means that we would
> > > > ship eBPF code as part of the kernel. So in the case of TSO, the
> > > > kernel would include a standard implementation in eBPF that could be
> > > > compiled into the kernel by default. The restricted C source code is
> > > > tagged with a hash, so if someone wants to offload TSO they could
> > > > compile the source into their target and retain the hash. At runtime
> > > > it's a matter of querying the driver to see if the device supports the
> > > > TSO program the kernel is running by comparing hash values. Scaling
> > > > this, a device could support a catalogue of programs: TSO, LRO,
> > > > parser, IPtables, etc., If the kernel can match the hash of its eBPF
> > > > code to one reported by the driver then it can assume functionality is
> > > > offloadable. This is an elaboration of "device features", but instead
> > > > of the device telling us they think they support an adequate GRO
> > > > implementation by reporting NETIF_F_GRO, the device would tell the
> > > > kernel that they not only support GRO but they provide identical
> > > > functionality of the kernel GRO (which IMO is the first requirement of
> > > > kernel offload).
> > > >
> > > > Even before considering hardware offload, I think this approach
> > > > addresses a more fundamental problem to make the kernel programmable.
> > > > Since the code is in eBPF, the kernel can be reprogrammed at runtime
> > > > which could be controlled by TC. This allows local customization of
> > > > kernel features, but also is the simplest way to "patch" the kernel
> > > > with security and bug fixes (nobody is ever excited to do a kernel
> > >
> > > [..]
> > >
> > > > rebase in their datacenter!). Flow dissector is a prime candidate for
> > > > this, and I am still planning to replace it with an all eBPF program
> > > > (https://netdevconf.info/0x15/slides/16/Flow%20dissector_PANDA%20parser.pdf).
> > >
> > > So you're suggesting to bundle (and extend)
> > > tools/testing/selftests/bpf/progs/bpf_flow.c? We were thinking along
> > > similar lines here. We load this program manually right now, shipping
> > > and autoloading with the kernel will be easer.
> >
> > Hi Stanislav,
> >
> > Yes, I envision that we would have a standard implementation of
> > flow-dissector in eBPF that is shipped with the kernel and autoloaded.
> > However, for the front end source I want to move away from imperative
> > code. As I mentioned in the presentation flow_dissector.c is spaghetti
> > code and has been prone to bugs over the years especially whenever
> > someone adds support for a new fringe protocol (I take the liberty to
> > call it spaghetti code since I'm partially responsible for creating
> > this mess ;-) ).
> >
> > The problem is that parsers are much better represented by a
> > declarative rather than an imperative representation. To that end, we
> > defined PANDA which allows constructing a parser (parse graph) in data
> > structures in C. We use the "PANDA parser" to compile C to restricted
> > C code which looks more like eBPF in imperative code. With this method
> > we abstract out all the bookkeeping that was often the source of bugs
> > (like pulling up skbufs, checking length limits, etc.). The other
> > advantage is that we're able to find a lot more optimizations if we
> > start with a right representation of the problem.
> >
> > If you're interested, the video presentation on this is in
> > https://www.youtube.com/watch?v=zVnmVDSEoXc.
>
> Oh, yeah, I've seen this one. Agreed that the C implementation is not
> pleasant and generating a parser from some declarative spec is a better
> idea.
>
> From my pow, the biggest win we get from making bpf flow dissector
> pluggable is the fact that we can now actually write some tests for it

Yes, extracting out functions from the kernel allows them to be
independently unit tested. It's an even bigger win if the same source
code is used for offloading the functionality as I described. We can
call this "Test once, run anywhere!"

Tom

> (and, maybe, fuzz it?). We should also probably spend more time properly
> defining the behavior of the existing C implementation. We've seen
> some interesting bugs like this one:
> https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git/commit/?id=9fa02892857ae2b3b699630e5ede28f72106e7e7

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH net-next v12 14/15] p4tc: add set of P4TC table kfuncs
  2024-03-03 17:20         ` Jamal Hadi Salim
@ 2024-03-05  7:40           ` Martin KaFai Lau
  2024-03-05 12:30             ` Jamal Hadi Salim
  0 siblings, 1 reply; 71+ messages in thread
From: Martin KaFai Lau @ 2024-03-05  7:40 UTC (permalink / raw)
  To: Jamal Hadi Salim
  Cc: deb.chatterjee, anjali.singhai, namrata.limaye, tom, mleitner,
	Mahesh.Shirshyad, Vipin.Jain, tomasz.osinski, jiri,
	xiyou.wangcong, davem, edumazet, kuba, pabeni, vladbu, horms,
	khalidm, toke, daniel, victor, pctammela, bpf, netdev

On 3/3/24 9:20 AM, Jamal Hadi Salim wrote:

>>>>> +#define P4TC_MAX_PARAM_DATA_SIZE 124
>>>>> +
>>>>> +struct p4tc_table_entry_act_bpf {
>>>>> +     u32 act_id;
>>>>> +     u32 hit:1,
>>>>> +         is_default_miss_act:1,
>>>>> +         is_default_hit_act:1;
>>>>> +     u8 params[P4TC_MAX_PARAM_DATA_SIZE];
>>>>> +} __packed;
>>>>> +
>>>>> +struct p4tc_table_entry_act_bpf_kern {
>>>>> +     struct rcu_head rcu;
>>>>> +     struct p4tc_table_entry_act_bpf act_bpf;
>>>>> +};
>>>>> +
>>>>>     struct tcf_p4act {
>>>>>         struct tc_action common;
>>>>>         /* Params IDR reference passed during runtime */
>>>>>         struct tcf_p4act_params __rcu *params;
>>>>> +     struct p4tc_table_entry_act_bpf_kern __rcu *act_bpf;
>>>>>         u32 p_id;
>>>>>         u32 act_id;
>>>>>         struct list_head node;
>>>>> @@ -24,4 +40,39 @@ struct tcf_p4act {
>>>>>
>>>>>     #define to_p4act(a) ((struct tcf_p4act *)a)
>>>>>
>>>>> +static inline struct p4tc_table_entry_act_bpf *
>>>>> +p4tc_table_entry_act_bpf(struct tc_action *action)
>>>>> +{
>>>>> +     struct p4tc_table_entry_act_bpf_kern *act_bpf;
>>>>> +     struct tcf_p4act *p4act = to_p4act(action);
>>>>> +
>>>>> +     act_bpf = rcu_dereference(p4act->act_bpf);
>>>>> +
>>>>> +     return &act_bpf->act_bpf;
>>>>> +}
>>>>> +
>>>>> +static inline int
>>>>> +p4tc_table_entry_act_bpf_change_flags(struct tc_action *action, u32 hit,
>>>>> +                                   u32 dflt_miss, u32 dflt_hit)
>>>>> +{
>>>>> +     struct p4tc_table_entry_act_bpf_kern *act_bpf, *act_bpf_old;
>>>>> +     struct tcf_p4act *p4act = to_p4act(action);
>>>>> +
>>>>> +     act_bpf = kzalloc(sizeof(*act_bpf), GFP_KERNEL);
>>>>
>>>>
>>>> [ ... ]


>>>>> +static int
>>>>> +__bpf_p4tc_entry_create(struct net *net,
>>>>> +                     struct p4tc_table_entry_create_bpf_params *params,
>>>>> +                     void *key, const u32 key__sz,
>>>>> +                     struct p4tc_table_entry_act_bpf *act_bpf)
>>>>> +{
>>>>> +     struct p4tc_table_entry_key *entry_key = key;
>>>>> +     struct p4tc_pipeline *pipeline;
>>>>> +     struct p4tc_table *table;
>>>>> +
>>>>> +     if (!params || !key)
>>>>> +             return -EINVAL;
>>>>> +     if (key__sz != P4TC_ENTRY_KEY_SZ_BYTES(entry_key->keysz))
>>>>> +             return -EINVAL;
>>>>> +
>>>>> +     pipeline = p4tc_pipeline_find_byid(net, params->pipeid);
>>>>> +     if (!pipeline)
>>>>> +             return -ENOENT;
>>>>> +
>>>>> +     table = p4tc_tbl_cache_lookup(net, params->pipeid, params->tblid);
>>>>> +     if (!table)
>>>>> +             return -ENOENT;
>>>>> +
>>>>> +     if (entry_key->keysz != table->tbl_keysz)
>>>>> +             return -EINVAL;
>>>>> +
>>>>> +     return p4tc_table_entry_create_bpf(pipeline, table, entry_key, act_bpf,
>>>>> +                                        params->profile_id);
>>>>
>>>> My understanding is this kfunc will allocate a "struct
>>>> p4tc_table_entry_act_bpf_kern" object. If the bpf_p4tc_entry_delete() kfunc is
>>>> never called and the bpf prog is unloaded, how the act_bpf object will be
>>>> cleaned up?
>>>>
>>>
>>> The TC code takes care of this. Unloading the bpf prog does not affect
>>> the deletion, it is the TC control side that will take care of it. If
>>> we delete the pipeline otoh then not just this entry but all entries
>>> will be flushed.
>>
>> It looks like the "struct p4tc_table_entry_act_bpf_kern" object is allocated by
>> the bpf prog through kfunc and will only be useful for the bpf prog but not
>> other parts of the kernel. However, if the bpf prog is unloaded, these bpf
>> specific objects will be left over in the kernel until the tc pipeline (where
>> the act_bpf_kern object resided) is gone.
>>
>> It is the expectation on bpf prog (not only tc/xdp bpf prog) about resources
>> clean up that these bpf objects will be gone after unloading the bpf prog and
>> unpinning its bpf map.
>>
> 
> The table (residing on the TC side) could be shared by multiple bpf
> programs. Entries are allocated on the TC side of the fence.


> IOW, the memory is not owned by the bpf prog but rather by pipeline.

The struct p4tc_table_entry_act_(bpf_kern) object is allocated by 
bpf_p4tc_entry_create() kfunc and only bpf prog can use it, no?
afaict, this is bpf objects.

> We do have a "whodunnit" field, i.e we keep track of which entity
> added an entry and we are capable of deleting all entries when we
> detect a bpf program being deleted (this would be via deleting the tc
> filter). But my thinking is we should make that a policy decision as
> opposed to something which is default.

afaik, this policy decision or cleanup upon tc filter delete has not been done 
yet. I will leave it to you to figure out how to track what was allocated by a 
particular bpf prog on the TC side. It is not immediately clear to me and I 
probably won't have a good idea either.

Just to be clear that it is almost certain to be unacceptable to extend and make 
changes on the bpf side in the future to handle specific resource 
cleanup/tracking/sharing of the bpf objects allocated by these kfuncs. This 
problem has already been solved and works for different bpf program types, 
tc/cgroup/tracing...etc. Adding a refcnted bpf prog pointer alongside the 
act_bpf_kern object will be a non-starter.

I think multiple people have already commented that these kfuncs 
(create/update/delete...) resemble the existing bpf map. If these kfuncs are 
replaced with the bpf map ops, this bpf resource management has already been 
handled and will be consistent with other bpf program types.

I expect the act_bpf_kern object probably will grow in size over time also.
Considering this new p4 pipeline and table is residing on the TC side, I will 
leave it up to others to decide if it is acceptable to have some unused bpf 
objects left attached there.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH net-next v12 14/15] p4tc: add set of P4TC table kfuncs
  2024-03-05  7:40           ` Martin KaFai Lau
@ 2024-03-05 12:30             ` Jamal Hadi Salim
  2024-03-06  7:58               ` Martin KaFai Lau
  0 siblings, 1 reply; 71+ messages in thread
From: Jamal Hadi Salim @ 2024-03-05 12:30 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: deb.chatterjee, anjali.singhai, namrata.limaye, tom, mleitner,
	Mahesh.Shirshyad, Vipin.Jain, tomasz.osinski, jiri,
	xiyou.wangcong, davem, edumazet, kuba, pabeni, vladbu, horms,
	khalidm, toke, daniel, victor, pctammela, bpf, netdev

On Tue, Mar 5, 2024 at 2:40 AM Martin KaFai Lau <martin.lau@linux.dev> wrote:
>
> On 3/3/24 9:20 AM, Jamal Hadi Salim wrote:
>
> >>>>> +#define P4TC_MAX_PARAM_DATA_SIZE 124
> >>>>> +
> >>>>> +struct p4tc_table_entry_act_bpf {
> >>>>> +     u32 act_id;
> >>>>> +     u32 hit:1,
> >>>>> +         is_default_miss_act:1,
> >>>>> +         is_default_hit_act:1;
> >>>>> +     u8 params[P4TC_MAX_PARAM_DATA_SIZE];
> >>>>> +} __packed;
> >>>>> +
> >>>>> +struct p4tc_table_entry_act_bpf_kern {
> >>>>> +     struct rcu_head rcu;
> >>>>> +     struct p4tc_table_entry_act_bpf act_bpf;
> >>>>> +};
> >>>>> +
> >>>>>     struct tcf_p4act {
> >>>>>         struct tc_action common;
> >>>>>         /* Params IDR reference passed during runtime */
> >>>>>         struct tcf_p4act_params __rcu *params;
> >>>>> +     struct p4tc_table_entry_act_bpf_kern __rcu *act_bpf;
> >>>>>         u32 p_id;
> >>>>>         u32 act_id;
> >>>>>         struct list_head node;
> >>>>> @@ -24,4 +40,39 @@ struct tcf_p4act {
> >>>>>
> >>>>>     #define to_p4act(a) ((struct tcf_p4act *)a)
> >>>>>
> >>>>> +static inline struct p4tc_table_entry_act_bpf *
> >>>>> +p4tc_table_entry_act_bpf(struct tc_action *action)
> >>>>> +{
> >>>>> +     struct p4tc_table_entry_act_bpf_kern *act_bpf;
> >>>>> +     struct tcf_p4act *p4act = to_p4act(action);
> >>>>> +
> >>>>> +     act_bpf = rcu_dereference(p4act->act_bpf);
> >>>>> +
> >>>>> +     return &act_bpf->act_bpf;
> >>>>> +}
> >>>>> +
> >>>>> +static inline int
> >>>>> +p4tc_table_entry_act_bpf_change_flags(struct tc_action *action, u32 hit,
> >>>>> +                                   u32 dflt_miss, u32 dflt_hit)
> >>>>> +{
> >>>>> +     struct p4tc_table_entry_act_bpf_kern *act_bpf, *act_bpf_old;
> >>>>> +     struct tcf_p4act *p4act = to_p4act(action);
> >>>>> +
> >>>>> +     act_bpf = kzalloc(sizeof(*act_bpf), GFP_KERNEL);
> >>>>
> >>>>
> >>>> [ ... ]
>
>
> >>>>> +static int
> >>>>> +__bpf_p4tc_entry_create(struct net *net,
> >>>>> +                     struct p4tc_table_entry_create_bpf_params *params,
> >>>>> +                     void *key, const u32 key__sz,
> >>>>> +                     struct p4tc_table_entry_act_bpf *act_bpf)
> >>>>> +{
> >>>>> +     struct p4tc_table_entry_key *entry_key = key;
> >>>>> +     struct p4tc_pipeline *pipeline;
> >>>>> +     struct p4tc_table *table;
> >>>>> +
> >>>>> +     if (!params || !key)
> >>>>> +             return -EINVAL;
> >>>>> +     if (key__sz != P4TC_ENTRY_KEY_SZ_BYTES(entry_key->keysz))
> >>>>> +             return -EINVAL;
> >>>>> +
> >>>>> +     pipeline = p4tc_pipeline_find_byid(net, params->pipeid);
> >>>>> +     if (!pipeline)
> >>>>> +             return -ENOENT;
> >>>>> +
> >>>>> +     table = p4tc_tbl_cache_lookup(net, params->pipeid, params->tblid);
> >>>>> +     if (!table)
> >>>>> +             return -ENOENT;
> >>>>> +
> >>>>> +     if (entry_key->keysz != table->tbl_keysz)
> >>>>> +             return -EINVAL;
> >>>>> +
> >>>>> +     return p4tc_table_entry_create_bpf(pipeline, table, entry_key, act_bpf,
> >>>>> +                                        params->profile_id);
> >>>>
> >>>> My understanding is this kfunc will allocate a "struct
> >>>> p4tc_table_entry_act_bpf_kern" object. If the bpf_p4tc_entry_delete() kfunc is
> >>>> never called and the bpf prog is unloaded, how the act_bpf object will be
> >>>> cleaned up?
> >>>>
> >>>
> >>> The TC code takes care of this. Unloading the bpf prog does not affect
> >>> the deletion, it is the TC control side that will take care of it. If
> >>> we delete the pipeline otoh then not just this entry but all entries
> >>> will be flushed.
> >>
> >> It looks like the "struct p4tc_table_entry_act_bpf_kern" object is allocated by
> >> the bpf prog through kfunc and will only be useful for the bpf prog but not
> >> other parts of the kernel. However, if the bpf prog is unloaded, these bpf
> >> specific objects will be left over in the kernel until the tc pipeline (where
> >> the act_bpf_kern object resided) is gone.
> >>
> >> It is the expectation on bpf prog (not only tc/xdp bpf prog) about resources
> >> clean up that these bpf objects will be gone after unloading the bpf prog and
> >> unpinning its bpf map.
> >>
> >
> > The table (residing on the TC side) could be shared by multiple bpf
> > programs. Entries are allocated on the TC side of the fence.
>
>
> > IOW, the memory is not owned by the bpf prog but rather by pipeline.
>
> The struct p4tc_table_entry_act_(bpf_kern) object is allocated by
> bpf_p4tc_entry_create() kfunc and only bpf prog can use it, no?
> afaict, this is bpf objects.
>

Bear with me because i am not sure i am following.
When we looked at conntrack as guidance we noticed they do things
slightly differently. They have an allocate kfunc and an insert
function. If you have alloc then you need a complimentary release. The
existence of the release in conntrack, correct me if i am wrong, seems
to be based on the need to free the object if an insert fails. In our
case the insert does first allocate then inserts all in one operation.
If either fails it's not the concern of the bpf side to worry about
it. IOW, i see the ownership as belonging to the P4TC side  (it is
both allocated, updated and freed by that side). Likely i am missing
something..

> > We do have a "whodunnit" field, i.e we keep track of which entity
> > added an entry and we are capable of deleting all entries when we
> > detect a bpf program being deleted (this would be via deleting the tc
> > filter). But my thinking is we should make that a policy decision as
> > opposed to something which is default.
>
> afaik, this policy decision or cleanup upon tc filter delete has not been done
> yet. I will leave it to you to figure out how to track what was allocated by a
> particular bpf prog on the TC side. It is not immediately clear to me and I
> probably won't have a good idea either.
>

I am looking at the conntrack code and i dont see how they release
entries from the cotrack table when the bpf prog goes away.

> Just to be clear that it is almost certain to be unacceptable to extend and make
> changes on the bpf side in the future to handle specific resource
> cleanup/tracking/sharing of the bpf objects allocated by these kfuncs. This
> problem has already been solved and works for different bpf program types,
> tc/cgroup/tracing...etc. Adding a refcnted bpf prog pointer alongside the
> act_bpf_kern object will be a non-starter.
>
> I think multiple people have already commented that these kfuncs
> (create/update/delete...) resemble the existing bpf map. If these kfuncs are
> replaced with the bpf map ops, this bpf resource management has already been
> handled and will be consistent with other bpf program types.
>
> I expect the act_bpf_kern object probably will grow in size over time also.
> Considering this new p4 pipeline and table is residing on the TC side, I will
> leave it up to others to decide if it is acceptable to have some unused bpf
> objects left attached there.

There should be no dangling things at all.
Probably not a very good example, but this would be analogous to
pinning a map which is shared by many bpf progs. Deleting one or all
the bpf progs doesnt delete the contents of the bpf map, you have to
explicitly remove it. Deleting the pipeline will be equivalent to
deleting the map. IOW, resource cleanup is tied to the pipeline...

cheers,
jamal

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH net-next v12 14/15] p4tc: add set of P4TC table kfuncs
  2024-03-05 12:30             ` Jamal Hadi Salim
@ 2024-03-06  7:58               ` Martin KaFai Lau
  2024-03-06 20:22                 ` Jamal Hadi Salim
  0 siblings, 1 reply; 71+ messages in thread
From: Martin KaFai Lau @ 2024-03-06  7:58 UTC (permalink / raw)
  To: Jamal Hadi Salim
  Cc: deb.chatterjee, anjali.singhai, namrata.limaye, tom, mleitner,
	Mahesh.Shirshyad, Vipin.Jain, tomasz.osinski, jiri,
	xiyou.wangcong, davem, edumazet, kuba, pabeni, vladbu, horms,
	khalidm, toke, daniel, victor, pctammela, bpf, netdev

On 3/5/24 4:30 AM, Jamal Hadi Salim wrote:
> On Tue, Mar 5, 2024 at 2:40 AM Martin KaFai Lau <martin.lau@linux.dev> wrote:
>>
>> On 3/3/24 9:20 AM, Jamal Hadi Salim wrote:
>>
>>>>>>> +#define P4TC_MAX_PARAM_DATA_SIZE 124
>>>>>>> +
>>>>>>> +struct p4tc_table_entry_act_bpf {
>>>>>>> +     u32 act_id;
>>>>>>> +     u32 hit:1,
>>>>>>> +         is_default_miss_act:1,
>>>>>>> +         is_default_hit_act:1;
>>>>>>> +     u8 params[P4TC_MAX_PARAM_DATA_SIZE];
>>>>>>> +} __packed;
>>>>>>> +
>>>>>>> +struct p4tc_table_entry_act_bpf_kern {
>>>>>>> +     struct rcu_head rcu;
>>>>>>> +     struct p4tc_table_entry_act_bpf act_bpf;
>>>>>>> +};
>>>>>>> +
>>>>>>>      struct tcf_p4act {
>>>>>>>          struct tc_action common;
>>>>>>>          /* Params IDR reference passed during runtime */
>>>>>>>          struct tcf_p4act_params __rcu *params;
>>>>>>> +     struct p4tc_table_entry_act_bpf_kern __rcu *act_bpf;
>>>>>>>          u32 p_id;
>>>>>>>          u32 act_id;
>>>>>>>          struct list_head node;
>>>>>>> @@ -24,4 +40,39 @@ struct tcf_p4act {
>>>>>>>
>>>>>>>      #define to_p4act(a) ((struct tcf_p4act *)a)
>>>>>>>
>>>>>>> +static inline struct p4tc_table_entry_act_bpf *
>>>>>>> +p4tc_table_entry_act_bpf(struct tc_action *action)
>>>>>>> +{
>>>>>>> +     struct p4tc_table_entry_act_bpf_kern *act_bpf;
>>>>>>> +     struct tcf_p4act *p4act = to_p4act(action);
>>>>>>> +
>>>>>>> +     act_bpf = rcu_dereference(p4act->act_bpf);
>>>>>>> +
>>>>>>> +     return &act_bpf->act_bpf;
>>>>>>> +}
>>>>>>> +
>>>>>>> +static inline int
>>>>>>> +p4tc_table_entry_act_bpf_change_flags(struct tc_action *action, u32 hit,
>>>>>>> +                                   u32 dflt_miss, u32 dflt_hit)
>>>>>>> +{
>>>>>>> +     struct p4tc_table_entry_act_bpf_kern *act_bpf, *act_bpf_old;
>>>>>>> +     struct tcf_p4act *p4act = to_p4act(action);
>>>>>>> +
>>>>>>> +     act_bpf = kzalloc(sizeof(*act_bpf), GFP_KERNEL);
>>>>>>
>>>>>>
>>>>>> [ ... ]
>>
>>
>>>>>>> +static int
>>>>>>> +__bpf_p4tc_entry_create(struct net *net,
>>>>>>> +                     struct p4tc_table_entry_create_bpf_params *params,
>>>>>>> +                     void *key, const u32 key__sz,
>>>>>>> +                     struct p4tc_table_entry_act_bpf *act_bpf)
>>>>>>> +{
>>>>>>> +     struct p4tc_table_entry_key *entry_key = key;
>>>>>>> +     struct p4tc_pipeline *pipeline;
>>>>>>> +     struct p4tc_table *table;
>>>>>>> +
>>>>>>> +     if (!params || !key)
>>>>>>> +             return -EINVAL;
>>>>>>> +     if (key__sz != P4TC_ENTRY_KEY_SZ_BYTES(entry_key->keysz))
>>>>>>> +             return -EINVAL;
>>>>>>> +
>>>>>>> +     pipeline = p4tc_pipeline_find_byid(net, params->pipeid);
>>>>>>> +     if (!pipeline)
>>>>>>> +             return -ENOENT;
>>>>>>> +
>>>>>>> +     table = p4tc_tbl_cache_lookup(net, params->pipeid, params->tblid);
>>>>>>> +     if (!table)
>>>>>>> +             return -ENOENT;
>>>>>>> +
>>>>>>> +     if (entry_key->keysz != table->tbl_keysz)
>>>>>>> +             return -EINVAL;
>>>>>>> +
>>>>>>> +     return p4tc_table_entry_create_bpf(pipeline, table, entry_key, act_bpf,
>>>>>>> +                                        params->profile_id);
>>>>>>
>>>>>> My understanding is this kfunc will allocate a "struct
>>>>>> p4tc_table_entry_act_bpf_kern" object. If the bpf_p4tc_entry_delete() kfunc is
>>>>>> never called and the bpf prog is unloaded, how the act_bpf object will be
>>>>>> cleaned up?
>>>>>>
>>>>>
>>>>> The TC code takes care of this. Unloading the bpf prog does not affect
>>>>> the deletion, it is the TC control side that will take care of it. If
>>>>> we delete the pipeline otoh then not just this entry but all entries
>>>>> will be flushed.
>>>>
>>>> It looks like the "struct p4tc_table_entry_act_bpf_kern" object is allocated by
>>>> the bpf prog through kfunc and will only be useful for the bpf prog but not
>>>> other parts of the kernel. However, if the bpf prog is unloaded, these bpf
>>>> specific objects will be left over in the kernel until the tc pipeline (where
>>>> the act_bpf_kern object resided) is gone.
>>>>
>>>> It is the expectation on bpf prog (not only tc/xdp bpf prog) about resources
>>>> clean up that these bpf objects will be gone after unloading the bpf prog and
>>>> unpinning its bpf map.
>>>>
>>>
>>> The table (residing on the TC side) could be shared by multiple bpf
>>> programs. Entries are allocated on the TC side of the fence.
>>
>>
>>> IOW, the memory is not owned by the bpf prog but rather by pipeline.
>>
>> The struct p4tc_table_entry_act_(bpf_kern) object is allocated by
>> bpf_p4tc_entry_create() kfunc and only bpf prog can use it, no?
>> afaict, this is bpf objects.
>>
> 
> Bear with me because i am not sure i am following.
> When we looked at conntrack as guidance we noticed they do things
> slightly differently. They have an allocate kfunc and an insert
> function. If you have alloc then you need a complimentary release. The
> existence of the release in conntrack, correct me if i am wrong, seems
> to be based on the need to free the object if an insert fails. In our
> case the insert does first allocate then inserts all in one operation.
> If either fails it's not the concern of the bpf side to worry about
> it. IOW, i see the ownership as belonging to the P4TC side  (it is
> both allocated, updated and freed by that side). Likely i am missing
> something..

It is not the concern about the kfuncs may leak object.

I think my question was, who can use the act_bpf_kern object when all tc bpf 
prog is unloaded? If no one can use it, it should as well be cleaned up when the 
bpf prog is unloaded.

or the kernel p4 pipeline can use the act_bpf_kern object even when there is no 
bpf prog loaded?


> 
>>> We do have a "whodunnit" field, i.e we keep track of which entity
>>> added an entry and we are capable of deleting all entries when we
>>> detect a bpf program being deleted (this would be via deleting the tc
>>> filter). But my thinking is we should make that a policy decision as
>>> opposed to something which is default.
>>
>> afaik, this policy decision or cleanup upon tc filter delete has not been done
>> yet. I will leave it to you to figure out how to track what was allocated by a
>> particular bpf prog on the TC side. It is not immediately clear to me and I
>> probably won't have a good idea either.
>>
> 
> I am looking at the conntrack code and i dont see how they release
> entries from the cotrack table when the bpf prog goes away.
> 
>> Just to be clear that it is almost certain to be unacceptable to extend and make
>> changes on the bpf side in the future to handle specific resource
>> cleanup/tracking/sharing of the bpf objects allocated by these kfuncs. This
>> problem has already been solved and works for different bpf program types,
>> tc/cgroup/tracing...etc. Adding a refcnted bpf prog pointer alongside the
>> act_bpf_kern object will be a non-starter.
>>
>> I think multiple people have already commented that these kfuncs
>> (create/update/delete...) resemble the existing bpf map. If these kfuncs are
>> replaced with the bpf map ops, this bpf resource management has already been
>> handled and will be consistent with other bpf program types.
>>
>> I expect the act_bpf_kern object probably will grow in size over time also.
>> Considering this new p4 pipeline and table is residing on the TC side, I will
>> leave it up to others to decide if it is acceptable to have some unused bpf
>> objects left attached there.
> 
> There should be no dangling things at all.
> Probably not a very good example, but this would be analogous to
> pinning a map which is shared by many bpf progs. Deleting one or all
> the bpf progs doesnt delete the contents of the bpf map, you have to
> explicitly remove it. Deleting the pipeline will be equivalent to
> deleting the map. IOW, resource cleanup is tied to the pipeline..

bpf is also used by many subsystems (e.g. tracing/cgroup/...). The bpf users 
have a common expectation on how bpf resources will be cleaned up when writing 
bpf for different subsystems, i.e. map/link/pinned-file. Thus, p4 pipeline is 
not the same as a pinned bpf map here. The p4-tc bpf user cannot depend on the 
common bpf ecosystem to cleanup all resources.

It is going back to how link/fd and the map ops discussion by others in the 
earlier revisions which we probably don't want to redo here. I think I have been 
making enough noise such that we don't have to discuss potential future changes 
about how to release this resources when the bpf prog is unloaded.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH net-next v12 14/15] p4tc: add set of P4TC table kfuncs
  2024-03-06  7:58               ` Martin KaFai Lau
@ 2024-03-06 20:22                 ` Jamal Hadi Salim
  2024-03-06 22:21                   ` Martin KaFai Lau
  0 siblings, 1 reply; 71+ messages in thread
From: Jamal Hadi Salim @ 2024-03-06 20:22 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: deb.chatterjee, anjali.singhai, namrata.limaye, tom, mleitner,
	Mahesh.Shirshyad, Vipin.Jain, tomasz.osinski, jiri,
	xiyou.wangcong, davem, edumazet, kuba, pabeni, vladbu, horms,
	khalidm, toke, daniel, victor, pctammela, bpf, netdev

On Wed, Mar 6, 2024 at 2:58 AM Martin KaFai Lau <martin.lau@linux.dev> wrote:
>
> On 3/5/24 4:30 AM, Jamal Hadi Salim wrote:
> > On Tue, Mar 5, 2024 at 2:40 AM Martin KaFai Lau <martin.lau@linux.dev> wrote:
> >>
> >> On 3/3/24 9:20 AM, Jamal Hadi Salim wrote:
> >>
> >>>>>>> +#define P4TC_MAX_PARAM_DATA_SIZE 124
> >>>>>>> +
> >>>>>>> +struct p4tc_table_entry_act_bpf {
> >>>>>>> +     u32 act_id;
> >>>>>>> +     u32 hit:1,
> >>>>>>> +         is_default_miss_act:1,
> >>>>>>> +         is_default_hit_act:1;
> >>>>>>> +     u8 params[P4TC_MAX_PARAM_DATA_SIZE];
> >>>>>>> +} __packed;
> >>>>>>> +
> >>>>>>> +struct p4tc_table_entry_act_bpf_kern {
> >>>>>>> +     struct rcu_head rcu;
> >>>>>>> +     struct p4tc_table_entry_act_bpf act_bpf;
> >>>>>>> +};
> >>>>>>> +
> >>>>>>>      struct tcf_p4act {
> >>>>>>>          struct tc_action common;
> >>>>>>>          /* Params IDR reference passed during runtime */
> >>>>>>>          struct tcf_p4act_params __rcu *params;
> >>>>>>> +     struct p4tc_table_entry_act_bpf_kern __rcu *act_bpf;
> >>>>>>>          u32 p_id;
> >>>>>>>          u32 act_id;
> >>>>>>>          struct list_head node;
> >>>>>>> @@ -24,4 +40,39 @@ struct tcf_p4act {
> >>>>>>>
> >>>>>>>      #define to_p4act(a) ((struct tcf_p4act *)a)
> >>>>>>>
> >>>>>>> +static inline struct p4tc_table_entry_act_bpf *
> >>>>>>> +p4tc_table_entry_act_bpf(struct tc_action *action)
> >>>>>>> +{
> >>>>>>> +     struct p4tc_table_entry_act_bpf_kern *act_bpf;
> >>>>>>> +     struct tcf_p4act *p4act = to_p4act(action);
> >>>>>>> +
> >>>>>>> +     act_bpf = rcu_dereference(p4act->act_bpf);
> >>>>>>> +
> >>>>>>> +     return &act_bpf->act_bpf;
> >>>>>>> +}
> >>>>>>> +
> >>>>>>> +static inline int
> >>>>>>> +p4tc_table_entry_act_bpf_change_flags(struct tc_action *action, u32 hit,
> >>>>>>> +                                   u32 dflt_miss, u32 dflt_hit)
> >>>>>>> +{
> >>>>>>> +     struct p4tc_table_entry_act_bpf_kern *act_bpf, *act_bpf_old;
> >>>>>>> +     struct tcf_p4act *p4act = to_p4act(action);
> >>>>>>> +
> >>>>>>> +     act_bpf = kzalloc(sizeof(*act_bpf), GFP_KERNEL);
> >>>>>>
> >>>>>>
> >>>>>> [ ... ]
> >>
> >>
> >>>>>>> +static int
> >>>>>>> +__bpf_p4tc_entry_create(struct net *net,
> >>>>>>> +                     struct p4tc_table_entry_create_bpf_params *params,
> >>>>>>> +                     void *key, const u32 key__sz,
> >>>>>>> +                     struct p4tc_table_entry_act_bpf *act_bpf)
> >>>>>>> +{
> >>>>>>> +     struct p4tc_table_entry_key *entry_key = key;
> >>>>>>> +     struct p4tc_pipeline *pipeline;
> >>>>>>> +     struct p4tc_table *table;
> >>>>>>> +
> >>>>>>> +     if (!params || !key)
> >>>>>>> +             return -EINVAL;
> >>>>>>> +     if (key__sz != P4TC_ENTRY_KEY_SZ_BYTES(entry_key->keysz))
> >>>>>>> +             return -EINVAL;
> >>>>>>> +
> >>>>>>> +     pipeline = p4tc_pipeline_find_byid(net, params->pipeid);
> >>>>>>> +     if (!pipeline)
> >>>>>>> +             return -ENOENT;
> >>>>>>> +
> >>>>>>> +     table = p4tc_tbl_cache_lookup(net, params->pipeid, params->tblid);
> >>>>>>> +     if (!table)
> >>>>>>> +             return -ENOENT;
> >>>>>>> +
> >>>>>>> +     if (entry_key->keysz != table->tbl_keysz)
> >>>>>>> +             return -EINVAL;
> >>>>>>> +
> >>>>>>> +     return p4tc_table_entry_create_bpf(pipeline, table, entry_key, act_bpf,
> >>>>>>> +                                        params->profile_id);
> >>>>>>
> >>>>>> My understanding is this kfunc will allocate a "struct
> >>>>>> p4tc_table_entry_act_bpf_kern" object. If the bpf_p4tc_entry_delete() kfunc is
> >>>>>> never called and the bpf prog is unloaded, how the act_bpf object will be
> >>>>>> cleaned up?
> >>>>>>
> >>>>>
> >>>>> The TC code takes care of this. Unloading the bpf prog does not affect
> >>>>> the deletion, it is the TC control side that will take care of it. If
> >>>>> we delete the pipeline otoh then not just this entry but all entries
> >>>>> will be flushed.
> >>>>
> >>>> It looks like the "struct p4tc_table_entry_act_bpf_kern" object is allocated by
> >>>> the bpf prog through kfunc and will only be useful for the bpf prog but not
> >>>> other parts of the kernel. However, if the bpf prog is unloaded, these bpf
> >>>> specific objects will be left over in the kernel until the tc pipeline (where
> >>>> the act_bpf_kern object resided) is gone.
> >>>>
> >>>> It is the expectation on bpf prog (not only tc/xdp bpf prog) about resources
> >>>> clean up that these bpf objects will be gone after unloading the bpf prog and
> >>>> unpinning its bpf map.
> >>>>
> >>>
> >>> The table (residing on the TC side) could be shared by multiple bpf
> >>> programs. Entries are allocated on the TC side of the fence.
> >>
> >>
> >>> IOW, the memory is not owned by the bpf prog but rather by pipeline.
> >>
> >> The struct p4tc_table_entry_act_(bpf_kern) object is allocated by
> >> bpf_p4tc_entry_create() kfunc and only bpf prog can use it, no?
> >> afaict, this is bpf objects.
> >>
> >
> > Bear with me because i am not sure i am following.
> > When we looked at conntrack as guidance we noticed they do things
> > slightly differently. They have an allocate kfunc and an insert
> > function. If you have alloc then you need a complimentary release. The
> > existence of the release in conntrack, correct me if i am wrong, seems
> > to be based on the need to free the object if an insert fails. In our
> > case the insert does first allocate then inserts all in one operation.
> > If either fails it's not the concern of the bpf side to worry about
> > it. IOW, i see the ownership as belonging to the P4TC side  (it is
> > both allocated, updated and freed by that side). Likely i am missing
> > something..
>
> It is not the concern about the kfuncs may leak object.
>
> I think my question was, who can use the act_bpf_kern object when all tc bpf
> prog is unloaded? If no one can use it, it should as well be cleaned up when the
> bpf prog is unloaded.
>
> or the kernel p4 pipeline can use the act_bpf_kern object even when there is no
> bpf prog loaded?
>
>
> >
> >>> We do have a "whodunnit" field, i.e we keep track of which entity
> >>> added an entry and we are capable of deleting all entries when we
> >>> detect a bpf program being deleted (this would be via deleting the tc
> >>> filter). But my thinking is we should make that a policy decision as
> >>> opposed to something which is default.
> >>
> >> afaik, this policy decision or cleanup upon tc filter delete has not been done
> >> yet. I will leave it to you to figure out how to track what was allocated by a
> >> particular bpf prog on the TC side. It is not immediately clear to me and I
> >> probably won't have a good idea either.
> >>
> >
> > I am looking at the conntrack code and i dont see how they release
> > entries from the cotrack table when the bpf prog goes away.
> >
> >> Just to be clear that it is almost certain to be unacceptable to extend and make
> >> changes on the bpf side in the future to handle specific resource
> >> cleanup/tracking/sharing of the bpf objects allocated by these kfuncs. This
> >> problem has already been solved and works for different bpf program types,
> >> tc/cgroup/tracing...etc. Adding a refcnted bpf prog pointer alongside the
> >> act_bpf_kern object will be a non-starter.
> >>
> >> I think multiple people have already commented that these kfuncs
> >> (create/update/delete...) resemble the existing bpf map. If these kfuncs are
> >> replaced with the bpf map ops, this bpf resource management has already been
> >> handled and will be consistent with other bpf program types.
> >>
> >> I expect the act_bpf_kern object probably will grow in size over time also.
> >> Considering this new p4 pipeline and table is residing on the TC side, I will
> >> leave it up to others to decide if it is acceptable to have some unused bpf
> >> objects left attached there.
> >
> > There should be no dangling things at all.
> > Probably not a very good example, but this would be analogous to
> > pinning a map which is shared by many bpf progs. Deleting one or all
> > the bpf progs doesnt delete the contents of the bpf map, you have to
> > explicitly remove it. Deleting the pipeline will be equivalent to
> > deleting the map. IOW, resource cleanup is tied to the pipeline..
>
> bpf is also used by many subsystems (e.g. tracing/cgroup/...). The bpf users
> have a common expectation on how bpf resources will be cleaned up when writing
> bpf for different subsystems, i.e. map/link/pinned-file. Thus, p4 pipeline is
> not the same as a pinned bpf map here. The p4-tc bpf user cannot depend on the
> common bpf ecosystem to cleanup all resources.
>

I am not trying to be difficult. Sincerely trying to understand and
very puzzled - and it is not that we cant do what you are suggesting
just trying to understand the reasoning to make sure it fits our
requirements.

I asked earlier about conntrack (where we took the inspiration from):
How is what we are doing different from contrack? If you can help me
understand that i am more than willing to make the change.
Conntrack entries can be added via the kfunc(same for us). Contrack
entries can also be added from the control plane and can be found by
ebpf lookups(same for us). They can be deleted by the control plane,
timers, entry evictions to make space for new entries, etc (same for
us). Not sure if they can be deleted by ebpf side (we can). Perusing
the conntrack code, I could not find anything  that indicated that
entries created from ebpf are deleted when the ebpf program goes away.

To re-emphasize: Maybe there's something subtle i am missing that we
are not doing that conntrack is doing?
Conntrack does one small thing we dont: It allocs and returns to ebpf
the memory for insertion. I dont see that as particularly useful for
our case (and more importantly how that results in the entries being
deleted when the ebpf prog goes away)

cheers,
jamal

> It is going back to how link/fd and the map ops discussion by others in the
> earlier revisions which we probably don't want to redo here. I think I have been
> making enough noise such that we don't have to discuss potential future changes
> about how to release this resources when the bpf prog is unloaded.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH net-next v12 14/15] p4tc: add set of P4TC table kfuncs
  2024-03-06 20:22                 ` Jamal Hadi Salim
@ 2024-03-06 22:21                   ` Martin KaFai Lau
  2024-03-06 23:19                     ` Jamal Hadi Salim
  0 siblings, 1 reply; 71+ messages in thread
From: Martin KaFai Lau @ 2024-03-06 22:21 UTC (permalink / raw)
  To: Jamal Hadi Salim
  Cc: deb.chatterjee, anjali.singhai, namrata.limaye, tom, mleitner,
	Mahesh.Shirshyad, Vipin.Jain, tomasz.osinski, jiri,
	xiyou.wangcong, davem, edumazet, kuba, pabeni, vladbu, horms,
	khalidm, toke, daniel, victor, pctammela, bpf, netdev

On 3/6/24 12:22 PM, Jamal Hadi Salim wrote:
>> I think my question was, who can use the act_bpf_kern object when all tc bpf
>> prog is unloaded? If no one can use it, it should as well be cleaned up when the
>> bpf prog is unloaded.
>>
>> or the kernel p4 pipeline can use the act_bpf_kern object even when there is no
>> bpf prog loaded?

[ ... ]

>>> I am looking at the conntrack code and i dont see how they release
>>> entries from the cotrack table when the bpf prog goes away.

[ ... ]

> I asked earlier about conntrack (where we took the inspiration from):
> How is what we are doing different from contrack? If you can help me
> understand that i am more than willing to make the change.
> Conntrack entries can be added via the kfunc(same for us). Contrack
> entries can also be added from the control plane and can be found by
> ebpf lookups(same for us). They can be deleted by the control plane,
> timers, entry evictions to make space for new entries, etc (same for
> us). Not sure if they can be deleted by ebpf side (we can). Perusing
> the conntrack code, I could not find anything  that indicated that
> entries created from ebpf are deleted when the ebpf program goes away.
> 
> To re-emphasize: Maybe there's something subtle i am missing that we
> are not doing that conntrack is doing?
> Conntrack does one small thing we dont: It allocs and returns to ebpf
> the memory for insertion. I dont see that as particularly useful for
> our case (and more importantly how that results in the entries being
> deleted when the ebpf prog goes away)

afaik, the conntrack kfunc inserts "struct nf_conn" that can also be used by 
other kernel parts, so it is reasonable to go through the kernel existing 
eviction logic. It is why my earlier question on "is the act_bpf_kern object 
only useful for the bpf prog alone but not other kernel parts". From reading 
patch 14, it seems to be only usable by bpf prog. When all bpf program is 
unloaded, who will still read it and do something useful? If I mis-understood 
it, this will be useful to capture in the commit message to explain how it could 
be used by other kernel parts without bpf prog running.


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH net-next v12 14/15] p4tc: add set of P4TC table kfuncs
  2024-03-06 22:21                   ` Martin KaFai Lau
@ 2024-03-06 23:19                     ` Jamal Hadi Salim
  0 siblings, 0 replies; 71+ messages in thread
From: Jamal Hadi Salim @ 2024-03-06 23:19 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: deb.chatterjee, anjali.singhai, namrata.limaye, tom, mleitner,
	Mahesh.Shirshyad, Vipin.Jain, tomasz.osinski, jiri,
	xiyou.wangcong, davem, edumazet, kuba, pabeni, vladbu, horms,
	khalidm, toke, daniel, victor, pctammela, bpf, netdev

On Wed, Mar 6, 2024 at 5:21 PM Martin KaFai Lau <martin.lau@linux.dev> wrote:
>
> On 3/6/24 12:22 PM, Jamal Hadi Salim wrote:
> >> I think my question was, who can use the act_bpf_kern object when all tc bpf
> >> prog is unloaded? If no one can use it, it should as well be cleaned up when the
> >> bpf prog is unloaded.
> >>
> >> or the kernel p4 pipeline can use the act_bpf_kern object even when there is no
> >> bpf prog loaded?
>
> [ ... ]
>
> >>> I am looking at the conntrack code and i dont see how they release
> >>> entries from the cotrack table when the bpf prog goes away.
>
> [ ... ]
>
> > I asked earlier about conntrack (where we took the inspiration from):
> > How is what we are doing different from contrack? If you can help me
> > understand that i am more than willing to make the change.
> > Conntrack entries can be added via the kfunc(same for us). Contrack
> > entries can also be added from the control plane and can be found by
> > ebpf lookups(same for us). They can be deleted by the control plane,
> > timers, entry evictions to make space for new entries, etc (same for
> > us). Not sure if they can be deleted by ebpf side (we can). Perusing
> > the conntrack code, I could not find anything  that indicated that
> > entries created from ebpf are deleted when the ebpf program goes away.
> >
> > To re-emphasize: Maybe there's something subtle i am missing that we
> > are not doing that conntrack is doing?
> > Conntrack does one small thing we dont: It allocs and returns to ebpf
> > the memory for insertion. I dont see that as particularly useful for
> > our case (and more importantly how that results in the entries being
> > deleted when the ebpf prog goes away)
>
> afaik, the conntrack kfunc inserts "struct nf_conn" that can also be used by
> other kernel parts, so it is reasonable to go through the kernel existing
> eviction logic. It is why my earlier question on "is the act_bpf_kern object
> only useful for the bpf prog alone but not other kernel parts". From reading
> patch 14, it seems to be only usable by bpf prog. When all bpf program is
> unloaded, who will still read it and do something useful? If I mis-understood
> it, this will be useful to capture in the commit message to explain how it could
> be used by other kernel parts without bpf prog running.

Ok, I think i may have got the issue. Sigh. I didnt do a good job
explaining p4tc_table_entry_act_bpf_kern which has been the crux of
our back and forth. Sorry I know you said this several times and i was
busy describing things around it instead.
 A multiple of these structure p4tc_table_entry_act_bpf_kern are
preallocated(to match the P4 architecture, patch #9 describes some of
the subtleties involved) by the p4tc control plane and put in a kernel
pool. Their purpose is to hold the action parameters that are returned
to ebpf when there is a successful table lookup.  When the table entry
is deleted the act_bpf_kern is recycled to a pool to be reused for the
next table entry. The only time the pool memory is released is when
the pipeline is deleted. So it is not allocated via the kfunc at all.

I am not sure if that helps, if it does and you feel it should go in
the commit message we can do that. If not, please a little more
patience with me..

cheers,
jamal

^ permalink raw reply	[flat|nested] 71+ messages in thread

end of thread, other threads:[~2024-03-06 23:20 UTC | newest]

Thread overview: 71+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-02-25 16:54 [PATCH net-next v12 00/15] Introducing P4TC (series 1) Jamal Hadi Salim
2024-02-25 16:54 ` [PATCH net-next v12 01/15] net: sched: act_api: Introduce P4 actions list Jamal Hadi Salim
2024-02-29 15:05   ` Paolo Abeni
2024-02-29 18:21     ` Jamal Hadi Salim
2024-03-01  7:30       ` Paolo Abeni
2024-03-01 12:39         ` Jamal Hadi Salim
2024-02-25 16:54 ` [PATCH net-next v12 02/15] net/sched: act_api: increase action kind string length Jamal Hadi Salim
2024-02-25 16:54 ` [PATCH net-next v12 03/15] net/sched: act_api: Update tc_action_ops to account for P4 actions Jamal Hadi Salim
2024-02-29 16:19   ` Paolo Abeni
2024-02-29 18:30     ` Jamal Hadi Salim
2024-02-25 16:54 ` [PATCH net-next v12 04/15] net/sched: act_api: add struct p4tc_action_ops as a parameter to lookup callback Jamal Hadi Salim
2024-02-25 16:54 ` [PATCH net-next v12 05/15] net: sched: act_api: Add support for preallocated P4 action instances Jamal Hadi Salim
2024-02-25 16:54 ` [PATCH net-next v12 06/15] p4tc: add P4 data types Jamal Hadi Salim
2024-02-29 15:09   ` Paolo Abeni
2024-02-29 18:31     ` Jamal Hadi Salim
2024-02-25 16:54 ` [PATCH net-next v12 07/15] p4tc: add template API Jamal Hadi Salim
2024-02-25 16:54 ` [PATCH net-next v12 08/15] p4tc: add template pipeline create, get, update, delete Jamal Hadi Salim
2024-02-25 16:54 ` [PATCH net-next v12 09/15] p4tc: add template action create, update, delete, get, flush and dump Jamal Hadi Salim
2024-02-25 16:54 ` [PATCH net-next v12 10/15] p4tc: add runtime action support Jamal Hadi Salim
2024-02-25 16:54 ` [PATCH net-next v12 11/15] p4tc: add template table create, update, delete, get, flush and dump Jamal Hadi Salim
2024-02-25 16:54 ` [PATCH net-next v12 12/15] p4tc: add runtime table entry create and update Jamal Hadi Salim
2024-02-25 16:54 ` [PATCH net-next v12 13/15] p4tc: add runtime table entry get, delete, flush and dump Jamal Hadi Salim
2024-02-25 16:54 ` [PATCH net-next v12 14/15] p4tc: add set of P4TC table kfuncs Jamal Hadi Salim
2024-03-01  6:53   ` Martin KaFai Lau
2024-03-01 12:31     ` Jamal Hadi Salim
2024-03-03  1:32       ` Martin KaFai Lau
2024-03-03 17:20         ` Jamal Hadi Salim
2024-03-05  7:40           ` Martin KaFai Lau
2024-03-05 12:30             ` Jamal Hadi Salim
2024-03-06  7:58               ` Martin KaFai Lau
2024-03-06 20:22                 ` Jamal Hadi Salim
2024-03-06 22:21                   ` Martin KaFai Lau
2024-03-06 23:19                     ` Jamal Hadi Salim
2024-02-25 16:54 ` [PATCH net-next v12 15/15] p4tc: add P4 classifier Jamal Hadi Salim
2024-02-28 17:11 ` [PATCH net-next v12 00/15] Introducing P4TC (series 1) John Fastabend
2024-02-28 18:23   ` Jamal Hadi Salim
2024-02-28 21:13     ` John Fastabend
2024-03-01  7:02   ` Martin KaFai Lau
2024-03-01 12:36     ` Jamal Hadi Salim
2024-02-29 17:13 ` Paolo Abeni
2024-02-29 18:49   ` Jamal Hadi Salim
2024-02-29 20:52     ` John Fastabend
2024-02-29 21:49   ` Singhai, Anjali
2024-02-29 22:33     ` John Fastabend
2024-02-29 22:48       ` Jamal Hadi Salim
     [not found]         ` <CAOuuhY8qbsYCjdUYUZv8J3jz8HGXmtxLmTDP6LKgN5uRVZwMnQ@mail.gmail.com>
2024-03-01 17:00           ` Jakub Kicinski
2024-03-01 17:39             ` Jamal Hadi Salim
2024-03-02  1:32               ` Jakub Kicinski
2024-03-02  2:20                 ` Tom Herbert
2024-03-03  3:15                   ` Jakub Kicinski
2024-03-03 16:31                     ` Tom Herbert
2024-03-04 20:07                       ` Jakub Kicinski
2024-03-04 20:58                         ` eBPF to implement core functionility WAS " Tom Herbert
2024-03-04 21:19                       ` Stanislav Fomichev
2024-03-04 22:01                         ` Tom Herbert
2024-03-04 23:24                           ` Stanislav Fomichev
2024-03-04 23:50                             ` Tom Herbert
2024-03-02  2:59                 ` Hardware Offload discussion WAS(Re: " Jamal Hadi Salim
2024-03-02 14:36                   ` Jamal Hadi Salim
2024-03-03  3:27                     ` Jakub Kicinski
2024-03-03 17:00                       ` Jamal Hadi Salim
2024-03-03 18:10                         ` Tom Herbert
2024-03-03 19:04                           ` Jamal Hadi Salim
2024-03-04 20:18                             ` Jakub Kicinski
2024-03-04 21:02                               ` Jamal Hadi Salim
2024-03-04 21:23                             ` Stanislav Fomichev
2024-03-04 21:44                               ` Jamal Hadi Salim
2024-03-04 22:23                                 ` Stanislav Fomichev
2024-03-04 22:59                                   ` Jamal Hadi Salim
2024-03-04 23:14                                     ` Stanislav Fomichev
2024-03-01 18:53   ` Chris Sommers

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).