bpf.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: John Fastabend <john.fastabend@gmail.com>
To: Andrii Nakryiko <andrii.nakryiko@gmail.com>,
	Jesper Dangaard Brouer <brouer@redhat.com>
Cc: BPF-dev-list <bpf@vger.kernel.org>,
	"Alexander Lobakin" <alexandr.lobakin@intel.com>,
	"Karlsson, Magnus" <magnus.karlsson@intel.com>,
	"Magnus Karlsson" <magnus.karlsson@gmail.com>,
	"David Ahern" <dsahern@kernel.org>,
	"Björn Töpel" <bjorn@kernel.org>,
	"Saeed Mahameed" <saeed@kernel.org>,
	"kurt@linutronix.de" <kurt@linutronix.de>,
	"Raczynski, Piotr" <piotr.raczynski@intel.com>,
	"Zhang, Jessica" <jessica.zhang@intel.com>,
	"Maloor, Kishen" <kishen.maloor@intel.com>,
	"Gomes, Vinicius" <vinicius.gomes@intel.com>,
	"Brandeburg, Jesse" <jesse.brandeburg@intel.com>,
	"Swiatkowski, Michal" <michal.swiatkowski@intel.com>,
	"Plantykow, Marta A" <marta.a.plantykow@intel.com>,
	"Desouza, Ederson" <ederson.desouza@intel.com>,
	"Song, Yoong Siang" <yoong.siang.song@intel.com>,
	"Czapnik, Lukasz" <lukasz.czapnik@intel.com>,
	"John Fastabend" <john.fastabend@gmail.com>,
	"Joseph, Jithu" <jithu.joseph@intel.com>,
	"William Tu" <u9012063@gmail.com>,
	"Ong Boon Leong" <boon.leong.ong@intel.com>
Subject: Re: XDP-hints: Howto support multiple BTF types per packet basis?
Date: Wed, 26 May 2021 13:31:26 -0700	[thread overview]
Message-ID: <60aeb01ebcd10_fe49208b8@john-XPS-13-9370.notmuch> (raw)
In-Reply-To: <CAEf4BzYXUDyQaBjZmb_Q5-z3jw1-Uvdgxm+cfcQjSwb9oRoXnQ@mail.gmail.com>

Andrii Nakryiko wrote:
> On Wed, May 26, 2021 at 3:59 AM Jesper Dangaard Brouer
> <brouer@redhat.com> wrote:
> >
> > Hi All,
> >
> > I see a need for a driver to use different XDP metadata layout on a per
> > packet basis. E.g. PTP packets contains a hardware timestamp. E.g. VLAN
> > offloading and associated metadata as only relevant for packets using
> > VLANs. (Reserving room for every possible HW-hint is against the idea
> > of BTF).
> >
> > The question is how to support multiple BTF types on per packet basis?
> > (I need input from BTF experts, to tell me if I'm going in the wrong
> > direction with below ideas).
> 
> I'm trying to follow all three threads, but still, can someone dumb it

Thanks ;)

> down for me and use few very specific examples to show how all this is
> supposed to work end-to-end. I.e., how the C definition for those
> custom BTF layouts might look like and how they are used in BPF
> programs, etc. I'm struggling to put all the pieces together, even
> ignoring all the netdev-specific configuration questions.

Best to start with the simplest possible usable thing and get more
complex over time.

For a C definition I would expect drivers to do something like this,

 struct mynic_rx_descriptor {
	__u64 len;
        __u64 head;
	__u64 tail;
	__u64 foobar;
 }

 struct mynic_metadata {
	__u64 timestamp;
	__u64 hash;
	__u64 pkt_type;
	struct mynic_rx_descriptor *ptr_to_rx;
	/* other things */
 }

It doesn't really matter how the driver folks generate their metadata
though. They might use some non-C thing that is more natural for
writing parser/action/tcam codes.

Anyways given some C block like above we generate BTF from above
using normal method, quick hack just `pahole -J` the thing. Now we
have a BTF file.

Next up write some XDP program to do something with it,

 void myxdp_prog(struct xdp_md *ctx) {
	struct mynic_metadata m = (struct mynic_metadata *)ctx->data_meta;	

	// now I can get data using normal CO-RE
	// I usually have this _(&) to put CO-RE attributes in I
        // believe that is standard? Or use the other macros
	__u64 pkt_type = _(&m->pkt_type)

        // we can even walk into structs if we have probe read
        // around.
        struct mynic_rx_descriptor *rxdesc = _(&m->ptr_to_rx)

        // now do whatever I like with above metadata
 }

Run above program through normal CO-RE pass and as long as it has
access to the BTF from above it will work. I have some logic
sitting around to stitch two BTF blocks together but we have
that now done properly for linking.

probe_read from XDP should be added regardless of above. I've
found it super handy in skmsg programs to dig out kernel info
inline. With probe_read we can also start to walk net_device
struct for more detailed info as needed. Or into sock structs
for process level conntrack (other thread). Even without
probe_read above would be useful but fields would need to fit
into the metadata where we know we can read/write data.

Having drivers export their BTF over a /sys/fs/ interface
so that BTF can change with fimware/parser updates is possible
as well, but I would want to see above working in real world
before committing to a /sys/fs interface. Anyways the
interface is just a convienence.

> 
> As for BTF on a per-packet basis. This means that BTF itself is not
> known at the BPF program verification time, so there will be some sort
> of if/else if/else conditions to handle all recognized BTF IDs? Is
> that right? Fake but specific code would help (at least me) to
> actually join the discussion. Thanks.

I don't think we actually want per-packet data that sounds a bit
clumsy for me. Lets use a union and define it so that we have a
single BTF.

 struct mynic_metadata {
  __u64 pkt_type
  union {
      struct ipv6_meta meta; 
      struct ipv4_meta meta;
      struct arp_meta meta;
  }
 };

Then program has to swivel on pkt_type but that is most natural
C thing to do IMO.

Honestly we have about 90% of the necessary bits to do this now.
Typed that up a bit fast hope its legible. Got a lot going on today.

Andrii, make sense?

Thanks,
John
> 
> >
> > Let me describe a possible/proposed packet flow (feel free to disagree):
> >
> >  When driver RX e.g. a PTP packet it knows HW is configured for PTP-TS and
> >  when it sees a TS is available, then it chooses a code path that use the
> >  BTF layout that contains RX-TS. To communicate what BTF-type the
> >  XDP-metadata contains, it simply store the BTF-ID in xdp_buff->btf_id.
> >
> >  When redirecting the xdp_buff is converted to xdp_frame, and also contains
> >  the btf_id member. When converting xdp_frame to SKB, then netcore-code
> >  checks if this BTF-ID have been registered, if so there is a (callback or
> >  BPF-hook) registered to handle this BTF-type that transfer the fields from
> >  XDP-metadata area into SKB fields.
> >
> >  The XDP-prog also have access to this ctx->btf_id and can multiplex on
> >  this in the BPF-code itself. Or use other methods like parsing PTP packet
> >  and extract TS as expected BTF offset in XDP metadata (perhaps add a
> >  sanity check if metadata-size match).
> >
> >
> > I talked to AF_XDP people (Magnus, Bjørn and William) about this idea,
> > and they pointed out that AF_XDP also need to know what BTF-layout is
> > used. As Magnus wrote in other thread; there is only 32-bit left in
> > AF_XDP descriptor option. We could store the BTF-ID in this field, but
> > it would block for other use-cases. Bjørn came up with the idea of
> > storing the BTF-ID in the BTF-layout itself, but as the last-member (to
> > have fixed offset to check in userspace AF_XDP program). Then we only
> > need to use a single bit in AF_XDP descriptor option to say
> > XDP-metadata is BTF described.
> >
> > In the AF_XDP userspace program, the programmers can have a similar
> > callback system per known BTF-ID. This way they can compile efficient
> > code per ID via requesting the BTF layout from the kernel. (Hint:
> > `bpftool btf dump id 42 format c`).
> >
> > Please let me know if this it the right or wrong direction?
> >
> > --
> > Best regards,
> >   Jesper Dangaard Brouer
> >   MSc.CS, Principal Kernel Engineer at Red Hat
> >   LinkedIn: http://www.linkedin.com/in/brouer
> >



  parent reply	other threads:[~2021-05-26 20:31 UTC|newest]

Thread overview: 43+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-05-26 10:58 XDP-hints: Howto support multiple BTF types per packet basis? Jesper Dangaard Brouer
2021-05-26 19:12 ` Andrii Nakryiko
2021-05-26 20:20   ` Jesper Dangaard Brouer
2021-05-26 20:37     ` John Fastabend
2021-05-26 22:39     ` Andrii Nakryiko
2021-05-28 11:16       ` Jesper Dangaard Brouer
2021-05-30  3:24         ` Andrii Nakryiko
2021-05-26 20:31   ` John Fastabend [this message]
2021-05-26 22:54     ` Andrii Nakryiko
2021-05-27  0:44       ` John Fastabend
2021-05-27 17:44         ` Andrii Nakryiko
2021-05-28  5:48           ` John Fastabend
2021-05-28  9:16             ` Toke Høiland-Jørgensen
2021-05-28 10:38               ` Alexander Lobakin
2021-05-28 14:35               ` John Fastabend
2021-05-28 15:33                 ` Toke Høiland-Jørgensen
2021-05-28 16:02                 ` Jesper Dangaard Brouer
2021-05-28 17:29                   ` John Fastabend
2021-05-30  3:27                     ` Andrii Nakryiko
2021-05-31 11:03                     ` Toke Høiland-Jørgensen
2021-05-31 13:17                       ` Jesper Dangaard Brouer
2021-06-02  0:22                       ` John Fastabend
2021-06-02 16:18                         ` Jakub Kicinski
2021-06-22  7:44                           ` Michal Swiatkowski
2021-06-22 11:53                             ` Toke Høiland-Jørgensen
2021-06-23  8:32                               ` Michal Swiatkowski
2021-06-24 12:23                                 ` Toke Høiland-Jørgensen
2021-06-24 13:07                                   ` Magnus Karlsson
2021-06-24 14:58                                     ` Alexei Starovoitov
2021-06-24 15:11                                   ` Zvi Effron
2021-06-24 16:04                                     ` Toke Høiland-Jørgensen
2021-06-24 16:32                                       ` Zvi Effron
2021-06-24 16:45                                       ` Jesper Dangaard Brouer
2021-07-08  8:32                                   ` Michal Swiatkowski
2021-07-09 10:57                                     ` Toke Høiland-Jørgensen
2021-09-02  2:49                                       ` Michal Swiatkowski
2021-09-02  9:17                                         ` Jesper Dangaard Brouer
2021-09-07  6:27                                           ` Michal Swiatkowski
2021-09-08 13:28                                             ` Jesper Dangaard Brouer
2021-09-09 18:19                                               ` Andrii Nakryiko
2021-09-10 11:16                                                 ` Jesper Dangaard Brouer
2021-07-28  9:54 ` XDP-hints: how to inform driver about hints Michal Swiatkowski
2021-07-28 18:40   ` John Fastabend

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=60aeb01ebcd10_fe49208b8@john-XPS-13-9370.notmuch \
    --to=john.fastabend@gmail.com \
    --cc=alexandr.lobakin@intel.com \
    --cc=andrii.nakryiko@gmail.com \
    --cc=bjorn@kernel.org \
    --cc=boon.leong.ong@intel.com \
    --cc=bpf@vger.kernel.org \
    --cc=brouer@redhat.com \
    --cc=dsahern@kernel.org \
    --cc=ederson.desouza@intel.com \
    --cc=jesse.brandeburg@intel.com \
    --cc=jessica.zhang@intel.com \
    --cc=jithu.joseph@intel.com \
    --cc=kishen.maloor@intel.com \
    --cc=kurt@linutronix.de \
    --cc=lukasz.czapnik@intel.com \
    --cc=magnus.karlsson@gmail.com \
    --cc=magnus.karlsson@intel.com \
    --cc=marta.a.plantykow@intel.com \
    --cc=michal.swiatkowski@intel.com \
    --cc=piotr.raczynski@intel.com \
    --cc=saeed@kernel.org \
    --cc=u9012063@gmail.com \
    --cc=vinicius.gomes@intel.com \
    --cc=yoong.siang.song@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).