Re: [PATCHv4 bpf-next 2/4] xdp: extend xdp_redirect_map with broadcast support

All of lore.kernel.org
 help / color / mirror / Atom feed

From: "Toke Høiland-Jørgensen" <toke@redhat.com>
To: John Fastabend <john.fastabend@gmail.com>,
	John Fastabend <john.fastabend@gmail.com>,
	Hangbin Liu <liuhangbin@gmail.com>
Cc: bpf@vger.kernel.org, netdev@vger.kernel.org,
	"Jiri Benc" <jbenc@redhat.com>,
	"Jesper Dangaard Brouer" <brouer@redhat.com>,
	"Eelco Chaudron" <echaudro@redhat.com>,
	ast@kernel.org, "Daniel Borkmann" <daniel@iogearbox.net>,
	"Lorenzo Bianconi" <lorenzo.bianconi@redhat.com>,
	"David Ahern" <dsahern@gmail.com>,
	"Andrii Nakryiko" <andrii.nakryiko@gmail.com>,
	"Alexei Starovoitov" <alexei.starovoitov@gmail.com>,
	"Maciej Fijalkowski" <maciej.fijalkowski@intel.com>,
	"Björn Töpel" <bjorn.topel@gmail.com>
Subject: Re: [PATCHv4 bpf-next 2/4] xdp: extend xdp_redirect_map with broadcast support
Date: Wed, 07 Apr 2021 01:10:04 +0200	[thread overview]
Message-ID: <87h7kj2enn.fsf@toke.dk> (raw)
In-Reply-To: <606ce0db7cd40_2865920845@john-XPS-13-9370.notmuch>

John Fastabend <john.fastabend@gmail.com> writes:

> Toke Høiland-Jørgensen wrote:
>> John Fastabend <john.fastabend@gmail.com> writes:
>> 
>> > Toke Høiland-Jørgensen wrote:
>> >> Hangbin Liu <liuhangbin@gmail.com> writes:
>> >> 
>> >> > On Mon, Apr 05, 2021 at 05:24:48PM -0700, John Fastabend wrote:
>> >> >> Hangbin Liu wrote:
>> >> >> > This patch add two flags BPF_F_BROADCAST and BPF_F_EXCLUDE_INGRESS to extend
>> >> >> > xdp_redirect_map for broadcast support.
>> >> >> > 
>> >> >> > Keep the general data path in net/core/filter.c and the native data
>> >> >> > path in kernel/bpf/devmap.c so we can use direct calls to get better
>> >> >> > performace.
>> >> >> > 
>> >> >> > Here is the performance result by using xdp_redirect_{map, map_multi} in
>> >> >> > sample/bpf and send pkts via pktgen cmd:
>> >> >> > ./pktgen_sample03_burst_single_flow.sh -i eno1 -d $dst_ip -m $dst_mac -t 10 -s 64
>> >> >> > 
>> >> >> > There are some drop back as we need to loop the map and get each interface.
>> >> >> > 
>> >> >> > Version          | Test                                | Generic | Native
>> >> >> > 5.12 rc2         | redirect_map        i40e->i40e      |    2.0M |  9.8M
>> >> >> > 5.12 rc2         | redirect_map        i40e->veth      |    1.8M | 12.0M
>> >> >> 
>> >> >> Are these are 10gbps i40e ports? Sorry if I asked this earlier, maybe
>> >> >> add a note in the commit if another respin is needed.
>> >> >
>> >> > Yes, I will add it if there is an update.
>> >> >
>> >> >> > diff --git a/kernel/bpf/devmap.c b/kernel/bpf/devmap.c
>> >> >> > index 3980fb3bfb09..c8452c5f40f8 100644
>> >> >> > --- a/kernel/bpf/devmap.c
>> >> >> > +++ b/kernel/bpf/devmap.c
>> >> >> > @@ -198,6 +198,7 @@ static void dev_map_free(struct bpf_map *map)
>> >> >> >  	list_del_rcu(&dtab->list);
>> >> >> >  	spin_unlock(&dev_map_lock);
>> >> >> >  
>> >> >> > +	bpf_clear_redirect_map(map);
>> >> >> 
>> >> >> Is this a bugfix? If its needed here wouldn't we also need it in the
>> >> >> devmap case.
>> >> >
>> >> > No, in ee75aef23afe ("bpf, xdp: Restructure redirect actions") this function
>> >> > was removed. I added it back as we use ri->map again.
>> >> >
>> >> > What devmap case you mean?
>> >> >
>> >> >> 
>> >> >> >  	synchronize_rcu();
>> >> >> >  
>> >> >> >  	/* Make sure prior __dev_map_entry_free() have completed. */
>> >> >> 
>> >> >> [...]
>> >> >> 
>> >> >> > +
>> >> >> > +static struct bpf_dtab_netdev *devmap_get_next_obj(struct xdp_buff *xdp,
>> >> >> > +						   struct bpf_map *map,
>> >> >> > +						   u32 *key, u32 *next_key,
>> >> >> > +						   int ex_ifindex)
>> >> >> > +{
>> >> >> > +	struct bpf_dtab_netdev *obj;
>> >> >> > +	struct net_device *dev;
>> >> >> > +	u32 index;
>> >> >> > +	int err;
>> >> >> > +
>> >> >> > +	err = devmap_get_next_key(map, key, next_key);
>> >> >> > +	if (err)
>> >> >> > +		return NULL;
>> >> >> > +
>> >> >> > +	/* When using dev map hash, we could restart the hashtab traversal
>> >> >> > +	 * in case the key has been updated/removed in the mean time.
>> >> >> > +	 * So we may end up potentially looping due to traversal restarts
>> >> >> > +	 * from first elem.
>> >> >> > +	 *
>> >> >> > +	 * Let's use map's max_entries to limit the loop number.
>> >> >> > +	 */
>> >> >> > +	for (index = 0; index < map->max_entries; index++) {
>> >> >> > +		obj = devmap_lookup_elem(map, *next_key);
>> >> >> > +		if (!obj || dst_dev_is_ingress(obj, ex_ifindex))
>> >> >> > +			goto find_next;
>> >> >> > +
>> >> >> > +		dev = obj->dev;
>> >> >> > +
>> >> >> > +		if (!dev->netdev_ops->ndo_xdp_xmit)
>> >> >> > +			goto find_next;
>> >> >> > +
>> >> >> > +		err = xdp_ok_fwd_dev(dev, xdp->data_end - xdp->data);
>> >> >> > +		if (unlikely(err))
>> >> >> > +			goto find_next;
>> >> >> > +
>> >> >> > +		return obj;
>> >> >> > +
>> >> >> > +find_next:
>> >> >> > +		key = next_key;
>> >> >> > +		err = devmap_get_next_key(map, key, next_key);
>> >> >> > +		if (err)
>> >> >> > +			break;
>> >> >> > +	}
>> >> >> 
>> >> >> I'm missing something. Either an elaborated commit message or comment
>> >> >> is probably needed. I've been looking at this block for 30 minutes and
>> >> >> can't see how we avoid sending duplicate frames on a single interface?
>> >> >> Can you check this code flow, 
>> >> >> 
>> >> >>   dev_map_enqueue_multi()
>> >> >>    for (;;) {
>> >> >>      next_obj = devmap_get_next_obj(...)
>> >> >>         for (index = 0; index < map->max_entries; index++) {
>> >> >>            obj = devmap_lookup_elem();
>> >> >>            if (!obj) goto find_next
>> >> >>            key = next_key;
>> >> >>            err = devmap_get_next_key() 
>> >> >>                   if (!key) goto find_first
>> >> >>                   for (i = 0; i < dtab->n_buckets; i++)
>> >> >>                      return *next <- now *next_key is point back
>> >> >>                                      at first entry
>> >> >>            // loop back through and find first obj and return that
>> >> >
>> >> > 	 devmap_get_next_key() will loop to find the first one if there is no
>> >> > 	 key or dev. In normal time it will stop after the latest one.
>> >> >>         }
>> >> >>       bq_enqueue(...) // enqueue original obj
>> >> >>       obj = next_obj;
>> >> >>       key = next_key; 
>> >> >>       ...  // we are going to enqueue first obj, but how do we know
>> >> >>            // this hasn't already been sent? Presumably if we have
>> >> >>            // a delete in the hash table in the middle of a multicast
>> >> >>            // operation this might happen?
>> >> >>    }
>> >> >
>> >> > And yes, there is an corner case that if we removed a dev during multicast,
>> >> > there is an possibility that restart from the first key. But given that
>> >> > this is an unlikely case, and in normal internet there is also a possibility
>> >> > of duplicate/lost packet. This should also be acceptable?
>> >> 
>> >> In my mind this falls under "acceptable corner cases". I.e., if you're
>> >> going to use the map for redirect and you expect to be updating it while
>> >> you're doing so, don't use a hashmap. But if you will not be updating
>> >> the map (or find the possible duplication acceptable), you can use the
>> >> hashmap and gain the benefit of being able to index by ifindex.
>> >
>> > In a Kubernetes setup its going to be hard, if possible at all, to restrict
>> > the map from moving as interfaces/IPs are going to be dynamic. Using a
>> > hash map has nice benefits of not having to figure out how to put ifindex's
>> > into the array. Although on some early implementations I wrote a small
>> > hashing algorithm over the top of array, so that could work.
>> >
>> > I don't know how well multicast applications might handle duplicate packets.
>> > I wouldn't be too surprised if it was problematic. On the other hand missing
>> > an entry that was just added is likely OK. There is no way to know from
>> > network/user side if the entry was actually added before multicast op and
>> > skipped or insert happened just after multicast op. And vice versa for a
>> > delete dev, no way to know the multicast op happened before/after the
>> > delete.
>> >
>> > Have we consider doing something like the batch lookup ops over hashtab?
>> > I don't mind "missing" values so if we just walk the list?
>> >
>> >      head = dev_map_index_hash(dtab, i)
>> >      // collect all my devs and get ready to send multicast
>> >      hlist_nulls_for_each_entry_safe(dev, next, head, index_hlist) {
>> > 		enqueue(dev, skb)
>> >      }
>> >      // submit the queue of entries and do all the work to actually xmit
>> >      submit_enqueued();
>> >
>> > We don't have to care about keys just walk the hash list?
>> 
>> So you'd wrap that in a loop like:
>> 
>> for (i = 0; i < dtab->n_buckets; i++) {
>> 	head = dev_map_index_hash(dtab, i);
>> 	hlist_nulls_for_each_entry_safe(dev, next, head, index_hlist) {
>> 		bq_enqueue(dev, xdpf, dev_rx, obj->xdp_prog);
>> 	}
>> }
>> 
>> or? Yeah, I guess that would work!
>
> Nice. Thanks for sticking with this Hangbin its taking us a bit, but
> I think above works on my side at least.
>
>> 
>> It would mean that dev_map_enqueue_multi() would need more in-depth
>> knowledge into the map type, so would likely need to be two different
>> functions for the two different map types, living in devmap.c - but
>> that's probably acceptable.
>
> Yeah, I think thats fine.
>
>> 
>> And while we're doing that, the array-map version can also loop over all
>> indexes up to max_entries, instead of stopping at the first index that
>> doesn't have an entry like it does now (right now, it looks like if you
>> populate entries 0 and 2 in an array-map only one copy of the packet
>> will be sent, to index 0).
>
> Right, this is likely needed anyways. At least when I was doing prototypes
> of using array maps I often ended up with holes in the map. Just imagine
> adding a set of devs and then removing one, its not likely to be the
> last one you insert.

Yeah, totally. Would have pointed it out if I'd noticed before, but I
was too trusting in the abstraction of get_next_key() etc :)

-Toke

next prev parent reply	other threads:[~2021-04-06 23:10 UTC|newest]

Thread overview: 15+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-04-02 12:19 [PATCHv4 bpf-next 0/4] xdp: extend xdp_redirect_map with broadcast support Hangbin Liu
2021-04-02 12:19 ` [PATCHv4 bpf-next 1/4] bpf: run devmap xdp_prog on flush instead of bulk enqueue Hangbin Liu
2021-04-02 12:19 ` [PATCHv4 bpf-next 2/4] xdp: extend xdp_redirect_map with broadcast support Hangbin Liu
2021-04-06  0:24   ` John Fastabend
2021-04-06  6:38     ` Hangbin Liu
2021-04-06 10:19       ` Toke Høiland-Jørgensen
2021-04-06 21:49         ` John Fastabend
2021-04-06 22:19           ` Toke Høiland-Jørgensen
2021-04-06 22:29             ` John Fastabend
2021-04-06 23:10               ` Toke Høiland-Jørgensen [this message]
2021-04-08 23:30                 ` John Fastabend
2021-04-09  2:45                   ` Hangbin Liu
2021-04-06 11:17   ` Toke Høiland-Jørgensen
2021-04-02 12:19 ` [PATCHv4 bpf-next 3/4] sample/bpf: add xdp_redirect_map_multi for redirect_map broadcast test Hangbin Liu
2021-04-02 12:19 ` [PATCHv4 bpf-next 4/4] selftests/bpf: add xdp_redirect_multi test Hangbin Liu

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87h7kj2enn.fsf@toke.dk \
    --to=toke@redhat.com \
    --cc=alexei.starovoitov@gmail.com \
    --cc=andrii.nakryiko@gmail.com \
    --cc=ast@kernel.org \
    --cc=bjorn.topel@gmail.com \
    --cc=bpf@vger.kernel.org \
    --cc=brouer@redhat.com \
    --cc=daniel@iogearbox.net \
    --cc=dsahern@gmail.com \
    --cc=echaudro@redhat.com \
    --cc=jbenc@redhat.com \
    --cc=john.fastabend@gmail.com \
    --cc=liuhangbin@gmail.com \
    --cc=lorenzo.bianconi@redhat.com \
    --cc=maciej.fijalkowski@intel.com \
    --cc=netdev@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.