netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Stanislav Fomichev <sdf@fomichev.me>
To: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Cc: Yonghong Song <yhs@fb.com>,
	Jakub Kicinski <jakub.kicinski@netronome.com>,
	Brian Vazquez <brianvv@google.com>,
	Alexei Starovoitov <ast@fb.com>,
	"bpf@vger.kernel.org" <bpf@vger.kernel.org>,
	"netdev@vger.kernel.org" <netdev@vger.kernel.org>,
	Daniel Borkmann <daniel@iogearbox.net>,
	Kernel Team <Kernel-team@fb.com>
Subject: Re: [PATCH bpf-next 00/13] bpf: adding map batch processing support
Date: Tue, 3 Sep 2019 15:30:43 -0700	[thread overview]
Message-ID: <20190903223043.GC2101@mini-arch> (raw)
In-Reply-To: <20190903210127.z6mhkryqg6qz62dq@ast-mbp.dhcp.thefacebook.com>

On 09/03, Alexei Starovoitov wrote:
> On Fri, Aug 30, 2019 at 02:18:09PM -0700, Stanislav Fomichev wrote:
> > > > 
> > > > I personally like Jakub's/Quentin's proposal more. So if I get to choose
> > > > between this series and Jakub's filter+dump in BPF, I'd pick filter+dump
> > > > (pending per-cpu issue which we actually care about).
> > > > 
> > > > But if we can have both, I don't have any objections; this patch
> 
> I think we need to have both.
> imo Jakub's and Yonghong's approach are solving slightly different cases.
> 
> filter+dump via program is better suited for LRU map walks where filter prog
> would do some non-trivial logic.
> Whereas plain 'delete all' or 'dump all' is much simpler to use without
> loading yet another prog just to dump it.
> bpf infra today isn't quite ready for this very short lived auxiliary progs.
> At prog load pages get read-only mapping, tlbs across cpus flushed,
> kallsyms populated, FDs allocated, etc.
> Loading the prog is a heavy operation. There was a chatter before to have
> built-in progs. This filter+dump could benefit from builtin 'allow all'
> or 'delete all' progs, but imo that complicates design and asks even
> more questions than it answers. Should this builtin progs show up
> in 'bpftool prog show' ? When do they load/unload? Same safety requirements
> as normal progs? etc.
> imo it's fine to have little bit overlap between apis.
> So I think we should proceed with both batching apis.
We don't need to load filter+dump every time we need a dump, right?
We, internally, want to have this 'batch dump' only for long running daemons
(I think the same applies to bcc), we can load this filter+dump once and
then have a sys_bpf() command to trigger it.

Also, related, if we add this batch dump, it doesn't mean that
everything should switch to it. For example, I feel like we
are perfectly fine if bpftool still uses get_next_key+lookup
since we use it only for debugging.

> Having said that I think both are suffering from the important issue pointed out
> by Brian: when kernel deletes an element get_next_key iterator over hash/lru
> map will produce duplicates.
> The amount of duplicates can be huge. When batched iterator is slow and
> bpf prog is doing a lot of update/delete, there could be 10x worth of duplicates,
> since walk will resume from the beginning.
> User space cannot be tasked to deal with it.
> I think this issue has to be solved in the kernel first and it may require
> different batching api.
> 
> One idea is to use bucket spin_lock and batch process it bucket-at-a-time.
> From api pov the user space will tell kernel:
> - here is the buffer for N element. start dump from the beginning.
> - kernel will return <= N elements and an iterator.
> - user space will pass this opaque iterator back to get another batch
> For well behaved hash/lru map there will be zero or one elements per bucket.
> When there are 2+ the batching logic can process them together.
> If 'lookup' is requested the kernel can check whether user space provided
> enough space for these 2 elements. If not abort the batch earlier.
> get_next_key won't be used. Instead some sort of opaque iterator
> will be returned to user space, so next batch lookup can start from it.
> This iterator could be the index of the last dumped bucket.
> This idea won't work for pathological hash tables though.
> A lot of elements in a single bucket may be more than room for single batch.
> In such case iterator will get stuck, since num_of_elements_in_bucket > batch_buf_size.
> May be special error code can be used to solve that?
This all requires new per-map implementations unfortunately :-(
We were trying to see if we can somehow improve the existing bpf_map_ops
to be more friendly towards batching.

You also bring a valid point with regard to well behaved hash/lru,
we might be optimizing for the wrong case :-)

> I hope we can come up with other ideas to have a stable iterator over hash table.
> Let's use email to describe the ideas and upcoming LPC conference to
> sort out details and finalize the one to use.

  reply	other threads:[~2019-09-03 22:30 UTC|newest]

Thread overview: 38+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-08-29  6:45 [PATCH bpf-next 00/13] bpf: adding map batch processing support Yonghong Song
2019-08-29  6:45 ` [PATCH bpf-next 01/13] bpf: add bpf_map_value_size and bp_map_copy_value helper functions Yonghong Song
2019-08-29 22:04   ` Song Liu
2019-08-30  6:40     ` Yonghong Song
2019-08-29  6:45 ` [PATCH bpf-next 02/13] bpf: refactor map_update_elem() Yonghong Song
2019-08-29 23:37   ` Song Liu
2019-08-29  6:45 ` [PATCH bpf-next 03/13] bpf: refactor map_delete_elem() Yonghong Song
2019-08-29 23:39   ` Song Liu
2019-08-29  6:45 ` [PATCH bpf-next 04/13] bpf: refactor map_get_next_key() Yonghong Song
2019-08-29 23:39   ` Song Liu
2019-08-29  6:45 ` [PATCH bpf-next 05/13] bpf: adding map batch processing support Yonghong Song
2019-08-29 23:01   ` Brian Vazquez
2019-08-30  6:39     ` Yonghong Song
2019-08-30  6:58       ` Alexei Starovoitov
2019-08-29  6:45 ` [PATCH bpf-next 06/13] tools/bpf: sync uapi header bpf.h Yonghong Song
2019-08-29  6:45 ` [PATCH bpf-next 07/13] tools/bpf: implement libbpf API functions for map batch operations Yonghong Song
2019-08-29  6:45 ` [PATCH bpf-next 08/13] tools/bpf: add test for bpf_map_update_batch() Yonghong Song
2019-08-29  6:45 ` [PATCH bpf-next 09/13] tools/bpf: add test for bpf_map_lookup_batch() Yonghong Song
2019-08-29  6:45 ` [PATCH bpf-next 10/13] tools/bpf: add test for bpf_map_lookup_and_delete_batch() Yonghong Song
2019-08-29  6:45 ` [PATCH bpf-next 11/13] tools/bpf: add test for bpf_map_delete_batch() Yonghong Song
2019-08-29  6:45 ` [PATCH bpf-next 12/13] tools/bpf: add a multithreaded test for map batch operations Yonghong Song
2019-08-29  6:45 ` [PATCH bpf-next 13/13] tools/bpf: measure map batching perf Yonghong Song
2019-08-29 18:39 ` [PATCH bpf-next 00/13] bpf: adding map batch processing support Jakub Kicinski
2019-08-29 23:13   ` Brian Vazquez
2019-08-30  0:15     ` Jakub Kicinski
2019-08-30 20:15       ` Stanislav Fomichev
2019-08-30 20:55         ` Yonghong Song
2019-08-30 21:10           ` Jakub Kicinski
2019-08-30 22:24             ` Yonghong Song
2019-08-30 21:18           ` Stanislav Fomichev
2019-09-03 21:01             ` Alexei Starovoitov
2019-09-03 22:30               ` Stanislav Fomichev [this message]
2019-09-03 23:07                 ` Brian Vazquez
2019-09-04  1:35                   ` Alexei Starovoitov
2019-09-03 23:07                 ` Yonghong Song
2019-08-30  7:25   ` Yonghong Song
2019-08-30 21:35     ` Jakub Kicinski
2019-08-30 22:38       ` Yonghong Song

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20190903223043.GC2101@mini-arch \
    --to=sdf@fomichev.me \
    --cc=Kernel-team@fb.com \
    --cc=alexei.starovoitov@gmail.com \
    --cc=ast@fb.com \
    --cc=bpf@vger.kernel.org \
    --cc=brianvv@google.com \
    --cc=daniel@iogearbox.net \
    --cc=jakub.kicinski@netronome.com \
    --cc=netdev@vger.kernel.org \
    --cc=yhs@fb.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).