From mboxrd@z Thu Jan 1 00:00:00 1970 From: Jakub Kicinski Subject: Re: [PATCH bpf-next 2/2] bpf, libbpf: simplify perf RB walk and do incremental updates Date: Fri, 12 Oct 2018 08:39:41 -0700 Message-ID: <20181012083941.1904f59c@cakuba.netronome.com> References: <20181011140207.27602-1-daniel@iogearbox.net> <20181011140207.27602-3-daniel@iogearbox.net> <20181011200405.0af4a48b@cakuba.netronome.com> <4258c8c5-f18e-65ba-cb14-4a2f2e3187cc@iogearbox.net> <531acdb6-aef3-fe19-64aa-4255daab9cc1@iogearbox.net> Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Cc: alexei.starovoitov@gmail.com, netdev@vger.kernel.org To: Daniel Borkmann Return-path: Received: from mail-qt1-f194.google.com ([209.85.160.194]:39787 "EHLO mail-qt1-f194.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1728900AbeJLXMs (ORCPT ); Fri, 12 Oct 2018 19:12:48 -0400 Received: by mail-qt1-f194.google.com with SMTP id e22-v6so14293421qto.6 for ; Fri, 12 Oct 2018 08:39:45 -0700 (PDT) In-Reply-To: <531acdb6-aef3-fe19-64aa-4255daab9cc1@iogearbox.net> Sender: netdev-owner@vger.kernel.org List-ID: On Fri, 12 Oct 2018 15:30:57 +0200, Daniel Borkmann wrote: > On 10/12/2018 10:39 AM, Daniel Borkmann wrote: > > On 10/12/2018 05:04 AM, Jakub Kicinski wrote: > >> On Thu, 11 Oct 2018 16:02:07 +0200, Daniel Borkmann wrote: > >>> Clean up and improve bpf_perf_event_read_simple() ring walk a bit > >>> to use similar tail update scheme as in perf and bcc allowing the > >>> kernel to make forward progress not only after full timely walk. > >> > >> The extra memory barriers won't impact performance? If I read the code > >> correctly we now have: > >> > >> while (bla) { > >> head = HEAD > >> rmb() > >> > >> ... > >> > >> mb() > >> TAIL = tail > >> } > >> > >> Would it make sense to try to piggy back on the mb() for head re-read > >> at least? Perhaps that's a non-issue, just wondering. > > > > From the scheme specified in the comment in prior patch my understanding > > would be that they don't pair (see B and C) so there would be no guarantee > > that load of head would happen before load of data. Fwiw, I've been using > > the exact same semantics as user space perf tool walks the perf mmap'ed > > ring buffer (tools/perf/util/mmap.{h,c}) here. Given kernel doesn't stop > > On that note, I'll also respin, after some clarification with PeterZ on > why perf is using {rmb,mb}() barriers today as opposed to more lightweight > smp_{rmb,mb}() ones it turns out there is no real reason other than > historic one and perf can be changed and made more efficient as well. ;) Cool, to be clear I meant something like: head = HEAD rmb() while (bla) { ... head = HEAD mb() TAIL = tail } > > pushing into ring buffer while user space walks it and indicates how far > > it has consumed data via tail update, it would allow for making room > > successively and not only after full run has complete, so we don't make > > any assumptions in the generic libbpf library helper on how slow/quick > > the callback would be processing resp. how full ring is, etc, and kernel > > pushing new data can be processed in the same run if necessary. One thing > > we could consider is to batch tail updates, say, every 8 elements and a > > final update once we break out walking the ring; probably okay'ish as a > > heuristic.. Makes sense, I would be tempted to do some batching, but since I'm mostly working with I/O devices I may think that memory barriers cost more than they do for main memory-only accesses :) > >>> Also few other improvements to use realloc() instead of free() and > >>> malloc() combination and for the callback use proper perf_event_header > >>> instead of void pointer, so that real applications can use container_of() > >>> macro with proper type checking. > >> > >> FWIW the free() + malloc() was to avoid the the needless copy of the > >> previous event realloc() may do. It makes sense to use realloc() > >> especially if you want to put extra info in front of the buffer, just > >> sayin' it wasn't a complete braino ;) > > > > No strong preference from my side, I'd think that it might be sensible in > > any case from applications to call the bpf_perf_event_read_simple() with a > > already preallocated buffer, depending on the expected max element size from > > BPF could e.g. be a buffer of 1 page or so. Given 512 byte stack space from > > the BPF prog and MTU 1500 this would more than suffice to avoid new > > allocations altogether. Anyway, given we only grow the new memory area I > > did some testing on realloc() with an array of pointers to prior malloc()'ed > > buffers, running randomly 10M realloc()s to increase size over them and > > saw <1% where area had to be moved, so we're hitting corner case of a corner > > case, I'm also ok to leave the combination, though. :) Thanks for doing the experiment, not strong preference here either, so realloc() seems good!