Re: [RFC PATCH] bpf: introduce new bpf map type BPF_MAP_TYPE_WILDCARD

All of lore.kernel.org
 help / color / mirror / Atom feed

From: sdf@google.com
To: Anton Protopopov <aspsk@isovalent.com>
Cc: bpf@vger.kernel.org, Alexei Starovoitov <ast@kernel.org>,
	Daniel Borkmann <daniel@iogearbox.net>,
	Andrii Nakryiko <andrii@kernel.org>,
	Martin KaFai Lau <martin.lau@linux.dev>,
	John Fastabend <john.fastabend@gmail.com>,
	Martynas Pumputis <m@lambda.lt>,
	Nikolay Aleksandrov <razor@blackwall.org>,
	Eric Torng <torng@msu.edu>
Subject: Re: [RFC PATCH] bpf: introduce new bpf map type BPF_MAP_TYPE_WILDCARD
Date: Wed, 7 Sep 2022 16:09:17 -0700	[thread overview]
Message-ID: <YxkknQJC1vWmU/o9@google.com> (raw)
In-Reply-To: <20220907080140.290413-1-aspsk@isovalent.com>

[sorry for the noise, got a bounce from the list, resend with the
message trimmed]

On 09/07, Anton Protopopov wrote:
> Add a new map, BPF_MAP_TYPE_WILDCARD, which provides means to implement  
> generic
> online packet classification. Here "online" stands for "fast lookups and  
> fast
> updates", and "generic" means that a user can create maps with custom  
> lookup
> schemes—different numbers of fields and different interpretation of  
> individual
> fields (prefix bitmasks, ranges, and direct matches).

> In particular, in Cilium we have several use cases for such a map:

>    * XDP Prefilter is a simple XDP-based DDoS mitigation system provided by
>      Cilium. At the moment it only supports filtering by source CIDRs. It  
> would
>      benefit from using this new map, as it allows to utilize wildcard  
> rules
>      without big penalty comparing to one hash or LPM lookup utilized now.

>    * XDP Packet Recorder (see  
> https://lpc.events/event/11/contributions/953/)

>    * K8S and Cilium Network Policies: as of k8s 1.25 port ranges are  
> considered
>      to be a stable feature, and this map allows to implement this easily  
> (and
>      also to provide more sophisticated filtering for Cilium Network  
> Policies)

> Keys for wildcard maps are defined using the struct wildcard_key  
> structure.
> Besides the type field, it contains the data field of an arbitrary size.  
> To
> educate map about what's contained inside the data field, two additional
> structures are used. The first one, struct wildcard_desc, also of  
> arbitrary
> size, tells how many fields are contained inside data, and the struct
> wildcard_rule_desc structure defines how individual fields look like.

> Fields (rules) can be of three types:  
> BPF_WILDCARD_RULE_{PREFIX,RANGE,MATCH}.

[..]

> The PREFIX rule means that inside data we have a binary value and a binary
> (prefix) mask:

>                 size             u32
>          <----------------> <----------->
>     ... |    rule value    |   prefix   | ...

> Here rule value is a binary value, e.g., 123.324.128.0, and prefix is a  
> u32 bit
> variable; we only use lower 8 bits of it. We allow 8, 16, 32, 64, and 128  
> bit
> values for PREFIX rules.

I haven't looked at the code, so pardon stupid questions. This sounds
like optimized LPM-trie?

If not, what's the difference?

If yes, can this be special cased/optimized in the existing LPM-trie
optimization? I think we've tried in the past, mostly gave up and
"fixed" by caching the state in the socket local storage.

Also, fixed 8/16/32/64 prefixes seems like a big limitation? At least if
I were to store ipv4 from the classless (cidr) world..

> The RANGE rule is determined by two binary values: minimum and maximum,  
> treated
> as unsigned integers of appropriate size:

>                 size               size
>          <----------------> <---------------->
>     ... |  min rule value  |  max rule value  | ...

> We only allow the 8, 16, 32, and 64-bit for RANGE rules.

That seems useful. I was thinking about similar 'rangemap' where
we can effectively store and lookup [a,b] ranges. Might be useful
in firewalls for storing lists of ports efficiently. So why not
a separate map instead? Are you able to share a lot of code with
the general matching map?

> The MATCH rule is determined by one binary value, and is basically the  
> same as
> (X,sizeof(X)*8) PREFIX rule, but can be processed a bit faster:

>                 size
>          <---------------->
>     ... |    rule value    | ...

> To speed up processing all the rules, including the prefix field, should  
> be
> stored in host byte order, and all elements in network byte order. 16-byte
> fields are stored as {lo,hi}—lower eight bytes, then higher eight bytes.

> For elements only values are stored.

Can these optimization be auto-magically derived whenever PREFIX map
with the right values is created? Why put this decision on the users?

> To simplify definition of key structures, the  
> BPF_WILDCARD_DESC_{1,2,3,4,5}
> macros should be used. For example, one can define an IPv4 4-tuple keys as
> follows:

>     BPF_WILDCARD_DESC_4(
>          capture4_wcard,
>          BPF_WILDCARD_RULE_PREFIX, __u32, saddr,
>          BPF_WILDCARD_RULE_PREFIX, __u32, daddr,
>          BPF_WILDCARD_RULE_RANGE, __u16, sport,
>          BPF_WILDCARD_RULE_RANGE, __u16, dport
>     );

> This macro will define the following structure:

>     struct capture4_wcard_key {
>          __u32 type;
>          __u32 priority;
>          union {
>              struct {
>                      __u32 saddr;
>                      __u32 saddr_prefix;
>                      __u32 daddr;
>                      __u32 daddr_prefix;
>                      __u16 sport_min;
>                      __u16 sport_max;
>                      __u16 dport_min;
>                      __u16 dport_max;
>              } __packed rule;
>              struct {
>                      __u32 saddr;
>                      __u32 daddr;
>                      __u16 sport;
>                      __u16 dport;
>              } __packed;
>          };
>     } __packed;

> Here type field should contain either BPF_WILDCARD_KEY_RULE or
> BPF_WILDCARD_KEY_ELEM so that kernel can differentiate between rules and
> elements. The rule structure is used to define (and lookup) rules, the  
> unnamed
> structure can be used to specify elements when matching them with rules.

> In order to simplify definition of a corresponding struct wildcard_desc,  
> the
> BPF_WILDCARD_DESC_* macros will create yet another structure:

>     struct capture4_wcard_desc {
>          __uint(n_rules, 4);
>          struct {
>                  __uint(type, BPF_WILDCARD_RULE_PREFIX);
>                  __uint(size, sizeof(__u32));
>          } saddr;
>          struct {
>                  __uint(type, BPF_WILDCARD_RULE_PREFIX);
>                  __uint(size, sizeof(__u32));
>          } daddr;
>          struct {
>                  __uint(type, BPF_WILDCARD_RULE_RANGE);
>                  __uint(size, sizeof(__u16));
>          } sport;
>          struct {
>                  __uint(type, BPF_WILDCARD_RULE_RANGE);
>                  __uint(size, sizeof(__u16));
>          } dport;
>     };

> This structure can be used in a (BTF) map definition as follows:

>      __type(wildcard_desc, struct capture4_wcard_desc);

> Then libbpf will create a corresponding struct wildcard_desc and pass it  
> to
> kernel in bpf_attr using new map_extra_data/map_extra_data_size fields.

[..]

> The map implementation allows users to specify which algorithm to use to  
> store
> rules and lookup packets. Currently, three algorithms are supported:

>    * Brute Force (suitable for map sizes of about 32 or below elements)

>    * Tuple Merge (a variant of the Tuple Merge algorithm described in the
>      "TupleMerge: Fast Software Packet Processing for Online Packet
>      Classification" white paper, see  
> https://nonsns.github.io/paper/rossi19ton.pdf.
>      The Tuple Merge algorithm is not protected by any patents.)

>    * Static Tuple Merge (a variant of Tuple Merge where a set of lookup  
> tables
>      is directly provided by a user)

As a user that has no clue how this map works, how do I decide which
algorithm to use? What I like with the current maps is that they don't
leak too much of their inner state. These controls seems a bit low
level?

next prev parent reply	other threads:[~2022-09-07 23:09 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-09-07  8:01 [RFC PATCH] bpf: introduce new bpf map type BPF_MAP_TYPE_WILDCARD Anton Protopopov
2022-09-07 23:09 ` sdf [this message]
2022-09-08  8:48   ` Anton Protopopov
2022-09-08 22:37     ` sdf
2022-09-09  9:37       ` Anton Protopopov
2022-09-09 18:55         ` sdf
2022-09-12  8:21           ` Anton Protopopov
2022-09-08  1:22 ` Daniel Xu
2022-09-08  8:03   ` Anton Protopopov
2022-09-20  7:02 ` [bpf] 692616731b: kernel-selftests.bpf.make_fail kernel test robot
2022-09-20  7:02   ` kernel test robot
  -- strict thread matches above, loose matches on Subject: below --
2022-09-08  7:30 [RFC PATCH] bpf: introduce new bpf map type BPF_MAP_TYPE_WILDCARD kernel test robot

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=YxkknQJC1vWmU/o9@google.com \
    --to=sdf@google.com \
    --cc=andrii@kernel.org \
    --cc=aspsk@isovalent.com \
    --cc=ast@kernel.org \
    --cc=bpf@vger.kernel.org \
    --cc=daniel@iogearbox.net \
    --cc=john.fastabend@gmail.com \
    --cc=m@lambda.lt \
    --cc=martin.lau@linux.dev \
    --cc=razor@blackwall.org \
    --cc=torng@msu.edu \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.