From: Anton Protopopov <aspsk@isovalent.com>
To: John Fastabend <john.fastabend@gmail.com>
Cc: Alexei Starovoitov <alexei.starovoitov@gmail.com>,
Martin KaFai Lau <martin.lau@kernel.org>,
bpf <bpf@vger.kernel.org>, Joe Stringer <joe@isovalent.com>
Subject: Re: [PATCH bpf-next 1/2] bpf: add new map ops ->map_pressure
Date: Mon, 26 Jun 2023 16:29:16 +0000 [thread overview]
Message-ID: <ZJm83NTmkogNd3JP@zh-lab-node-5> (raw)
In-Reply-To: <6496320f40706_7b3662086f@john.notmuch>
On Fri, Jun 23, 2023 at 05:00:15PM -0700, John Fastabend wrote:
> Alexei Starovoitov wrote:
> > On Fri, Jun 2, 2023 at 7:20 AM Anton Protopopov <aspsk@isovalent.com> wrote:
> > >
> > > On Thu, Jun 01, 2023 at 05:40:10PM -0700, Alexei Starovoitov wrote:
> > > > On Thu, Jun 1, 2023 at 11:24 AM Alexei Starovoitov
> > > > <alexei.starovoitov@gmail.com> wrote:
> > > > >
> > > > > On Thu, Jun 1, 2023 at 11:17 AM Anton Protopopov <aspsk@isovalent.com> wrote:
> > > > > > >
> > > > > > > LRU logic doesn't kick in until the map is full.
> > > > > >
> > > > > > In fact, it can: a reproducable example is in the self-test from this patch
> > > > > > series. In the test N threads try to insert random values for keys 1..3000
> > > > > > simultaneously. As the result, the map may contain any number of elements,
> > > > > > typically 100 to 1000 (never full 3000, which is also less than the map size).
> > > > > > So a user can't really even closely estimate the number of elements in the LRU
> > > > > > map based on the number of updates (with unique keys). A per-cpu counter
> > > > > > inc/dec'ed from the kernel side would solve this.
> > > > >
> > > > > That's odd and unexpected.
> > > > > Definitely something to investigate and fix in the LRU map.
> > > > >
> > > > > Pls cc Martin in the future.
> > > > >
> > > > > > > If your LRU map is not full you shouldn't be using LRU in the first place.
> > > > > >
> > > > > > This makes sense, yes, especially that LRU evictions may happen randomly,
> > > > > > without a map being full. I will step back with this patch until we investigate
> > > > > > if we can replace LRUs with hashes.
> > > > > >
> > > > > > Thanks for the comments!
> > > >
> > > > Thinking about it more...
> > > > since you're proposing to use percpu counter unconditionally for prealloc
> > > > and percpu_counter_add_batch() logic is batched,
> > > > it could actually be acceptable if it's paired with non-api access.
> > > > Like another patch can add generic kfunc to do __percpu_counter_sum()
> > > > and in the 3rd patch kernel/bpf/preload/iterators/iterators.bpf.c
> > > > for maps can be extended to print the element count, so the user can have
> > > > convenient 'cat /sys/fs/bpf/maps.debug' way to debug maps.
> > > >
> > > > But additional logic of percpu_counter_add_batch() might get in the way
> > > > of debugging eventually.
> > > > If we want to have stats then we can have normal per-cpu u32 in basic
> > > > struct bpf_map that most maps, except array, will inc/dec on update/delete.
> > > > kfunc to iterate over percpu is still necessary.
> > > > This way we will be able to see not only number of elements, but detect
> > > > bad usage when one cpu is only adding and another cpu is deleting elements.
> > > > And other cpu misbalance.
> > >
> > > This looks for me like two different things: one is a kfunc to get the current
> > > counter (e.g., bpf_map_elements_count), the other is a kfunc to dump some more
> > > detailed stats (e.g., per-cpu values or more).
> > >
> > > My patch, slightly modified, addresses the first goal: most maps of interest
> > > already have a counter in some form (sometimes just atomic_t or u64+lock). If
> > > we add a percpu (non-batch) counter for pre-allocated hashmaps, then it's done:
> > > the new kfunc can get the counter based on the map type.
> > >
> > > If/when there's need to provide per-cpu statistics of elements or some more
> > > sophisticated statistics, this can be done without changing the api of the
> > > bpf_map_elements_count() kfunc.
> > >
> > > Would this work?
> >
> > No, because bpf_map_elements_count() as a building block is too big
> > and too specific. Nothing else can be made out of it, but counting
> > elements.
> > "for_each_cpu in per-cpu variable" would be generic that is usable beyond
> > this particular use case of stats collection.
>
> Without much thought, could you hook the eviction logic in LRU to know
> when the evict happens and even more details about what was evicted so
> we could debug the random case where we evict something in a conntrack
> table and then later it comes back to life and sends some data like a
> long living UDP session.
>
> For example in the cases where you build an LRU map because in 99%
> cases no evictions happen and the LRU is just there as a backstop
> you might even generate events to userspace to let it know evicts
> are in progress and they should do something about it.
Yes, one can trace evictions, as the destructor function for lru_list is
noinline. An example here: https://github.com/aspsk/bcc/tree/aspsk/lrusnoop
One problem with evictions and LRU is that evictions can [currently] happen at
random times even if map is nearly emptee, see a new map test from this series
and also https://lore.kernel.org/bpf/ZHjhBFLLnUcSy9Tt@zh-lab-node-5/ So for LRU
we really need to invest more time. For hashtabs and other maps the
bpf_map_sum_elem_count is in any case quite useful.
> Thanks,
> John
next prev parent reply other threads:[~2023-06-26 16:28 UTC|newest]
Thread overview: 19+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-05-31 11:05 [PATCH bpf-next 0/2] add mechanism to report map pressure Anton Protopopov
2023-05-31 11:05 ` [PATCH bpf-next 1/2] bpf: add new map ops ->map_pressure Anton Protopopov
2023-05-31 18:24 ` Alexei Starovoitov
2023-06-01 7:31 ` Anton Protopopov
2023-06-01 16:39 ` Alexei Starovoitov
2023-06-01 18:18 ` Anton Protopopov
2023-06-01 18:24 ` Alexei Starovoitov
2023-06-02 0:40 ` Alexei Starovoitov
2023-06-02 14:21 ` Anton Protopopov
2023-06-02 16:23 ` Alexei Starovoitov
2023-06-06 7:49 ` Anton Protopopov
2023-06-24 0:00 ` John Fastabend
2023-06-26 16:29 ` Anton Protopopov [this message]
2023-06-28 13:17 ` Anton Protopopov
2023-06-01 0:44 ` kernel test robot
2023-06-01 7:50 ` Anton Protopopov
2023-06-13 8:23 ` Yujie Liu
2023-06-13 8:30 ` Anton Protopopov
2023-05-31 11:05 ` [PATCH bpf-next 2/2] selftests/bpf: test map pressure Anton Protopopov
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=ZJm83NTmkogNd3JP@zh-lab-node-5 \
--to=aspsk@isovalent.com \
--cc=alexei.starovoitov@gmail.com \
--cc=bpf@vger.kernel.org \
--cc=joe@isovalent.com \
--cc=john.fastabend@gmail.com \
--cc=martin.lau@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox