From: "Toke Høiland-Jørgensen" <toke@redhat.com>
To: Roman Gushchin <guro@fb.com>, Daniel Borkmann <daniel@iogearbox.net>
Cc: bpf@vger.kernel.org, ast@kernel.org, netdev@vger.kernel.org,
andrii@kernel.org, akpm@linux-foundation.org, linux-mm@kvack.org,
linux-kernel@vger.kernel.org, kernel-team@fb.com,
hannes@cmpxchg.org, tj@kernel.org
Subject: Re: [PATCH bpf-next v8 06/34] bpf: prepare for memcg-based memory accounting for bpf maps
Date: Thu, 26 Nov 2020 10:56:12 +0100 [thread overview]
Message-ID: <87lfeol9vn.fsf@toke.dk> (raw)
In-Reply-To: <20201126023000.GB840171@carbon.dhcp.thefacebook.com>
Roman Gushchin <guro@fb.com> writes:
> On Thu, Nov 26, 2020 at 01:21:41AM +0100, Daniel Borkmann wrote:
>> On 11/25/20 4:00 AM, Roman Gushchin wrote:
>> > In the absolute majority of cases if a process is making a kernel
>> > allocation, it's memory cgroup is getting charged.
>> >
>> > Bpf maps can be updated from an interrupt context and in such
>> > case there is no process which can be charged. It makes the memory
>> > accounting of bpf maps non-trivial.
>> >
>> > Fortunately, after commit 4127c6504f25 ("mm: kmem: enable kernel
>> > memcg accounting from interrupt contexts") and b87d8cefe43c
>> > ("mm, memcg: rework remote charging API to support nesting")
>> > it's finally possible.
>> >
>> > To do it, a pointer to the memory cgroup of the process, which created
>> > the map, is saved, and this cgroup can be charged for all allocations
>> > made from an interrupt context. This commit introduces 2 helpers:
>> > bpf_map_kmalloc_node() and bpf_map_alloc_percpu(). They can be used in
>> > the bpf code for accounted memory allocations, both in the process and
>> > interrupt contexts. In the interrupt context they're using the saved
>> > memory cgroup, otherwise the current cgroup is getting charged.
>> >
>> > Signed-off-by: Roman Gushchin <guro@fb.com>
>>
>> Thanks for updating the cover letter; replying in this series instead
>> on one more item that came to mind:
>>
>> [...]
>> > diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
>> > index f3fe9f53f93c..4154c616788c 100644
>> > --- a/kernel/bpf/syscall.c
>> > +++ b/kernel/bpf/syscall.c
>> > @@ -31,6 +31,8 @@
>> > #include <linux/poll.h>
>> > #include <linux/bpf-netns.h>
>> > #include <linux/rcupdate_trace.h>
>> > +#include <linux/memcontrol.h>
>> > +#include <linux/sched/mm.h>
>> > #define IS_FD_ARRAY(map) ((map)->map_type == BPF_MAP_TYPE_PERF_EVENT_ARRAY || \
>> > (map)->map_type == BPF_MAP_TYPE_CGROUP_ARRAY || \
>> > @@ -456,6 +458,77 @@ void bpf_map_free_id(struct bpf_map *map, bool do_idr_lock)
>> > __release(&map_idr_lock);
>> > }
>> > +#ifdef CONFIG_MEMCG_KMEM
>> > +static void bpf_map_save_memcg(struct bpf_map *map)
>> > +{
>> > + map->memcg = get_mem_cgroup_from_mm(current->mm);
>> > +}
>> > +
>> > +static void bpf_map_release_memcg(struct bpf_map *map)
>> > +{
>> > + mem_cgroup_put(map->memcg);
>> > +}
>> > +
>> > +void *bpf_map_kmalloc_node(const struct bpf_map *map, size_t size, gfp_t flags,
>> > + int node)
>> > +{
>> > + struct mem_cgroup *old_memcg;
>> > + bool in_interrupt;
>> > + void *ptr;
>> > +
>> > + /*
>> > + * If the memory allocation is performed from an interrupt context,
>> > + * the memory cgroup to charge can't be determined from the context
>> > + * of the current task. Instead, we charge the memory cgroup, which
>> > + * contained the process created the map.
>> > + */
>> > + in_interrupt = in_interrupt();
>> > + if (in_interrupt)
>> > + old_memcg = set_active_memcg(map->memcg);
>> > +
>> > + ptr = kmalloc_node(size, flags, node);
>> > +
>> > + if (in_interrupt)
>> > + set_active_memcg(old_memcg);
>> > +
>> > + return ptr;
>> > +}
>> > +
>> > +void __percpu *bpf_map_alloc_percpu(const struct bpf_map *map, size_t size,
>> > + size_t align, gfp_t gfp)
>> > +{
>> > + struct mem_cgroup *old_memcg;
>> > + bool in_interrupt;
>> > + void *ptr;
>> > +
>> > + /*
>> > + * If the memory allocation is performed from an interrupt context,
>> > + * the memory cgroup to charge can't be determined from the context
>> > + * of the current task. Instead, we charge the memory cgroup, which
>> > + * contained the process created the map.
>> > + */
>> > + in_interrupt = in_interrupt();
>> > + if (in_interrupt)
>> > + old_memcg = set_active_memcg(map->memcg);
>> > +
>> > + ptr = __alloc_percpu_gfp(size, align, gfp);
>> > +
>> > + if (in_interrupt)
>> > + set_active_memcg(old_memcg);
>>
>> For this and above bpf_map_kmalloc_node() one, wouldn't it make more sense to
>> perform the temporary memcg unconditionally?
>>
>> old_memcg = set_active_memcg(map->memcg);
>> ptr = kmalloc_node(size, flags, node);
>> set_active_memcg(old_memcg);
>>
>> I think the semantics are otherwise a bit weird and the charging unpredictable;
>> this way it would /always/ be accounted against the prog in the memcg that
>> originally created the map.
>>
>> E.g. maps could be shared between progs attached to, say, XDP/tc where in_interrupt()
>> holds true with progs attached to skb-cgroup/egress where we're still in process
>> context. So some part of the memory is charged against the original map's memcg and
>> some other part against the current process' memcg which seems odd, no? Or, for example,
>> if we start to run a tracing BPF prog which updates state in a BPF map ... that tracing
>> prog now interferes with processes in other memcgs which may not be intentional & could
>> lead to potential failures there as opposed when the tracing prog is not run. My concern
>> is that the semantics are not quite clear and behavior unpredictable compared to always
>> charging against map->memcg.
>>
>> Similarly, what if an orchestration prog creates dedicated memcg(s) for maps with
>> individual limits ... the assumed behavior (imho) would be that whatever memory is
>> accounted on the map it can be accurately retrieved from there & similarly limits
>> enforced, no? It seems that would not be the case currently.
>>
>> Thoughts?
>
> I did consider this option. There are pros and cons. In general we
> tend to charge the cgroup which actually allocates the memory, and I
> decided to stick with this rule. I agree, it's fairly easy to come
> with arguments why always charging the map creator is better. The
> opposite is also true: it's not clear why bpf is different here. So
> I'm fine with both options, if there is a wide consensus, I'm happy to
> switch to the other option. In general, I believe that the current
> scheme is more flexible: if someone want to pay in advance, they are
> free to preallocate the map. Otherwise it's up to whoever wants to
> populate it.
I think I agree with Daniel here: conceptually the memory used by a map
ought to belong to that map's memcg. I can see how the other scheme can
be more flexible, but as Daniel points out it seems like it can lead to
hard-to-debug errors...
(Side note: I'm really excited about this work in general! The ulimit
thing has been a major pain...)
-Toke
next prev parent reply other threads:[~2020-11-26 9:56 UTC|newest]
Thread overview: 40+ messages / expand[flat|nested] mbox.gz Atom feed top
2020-11-25 3:00 [PATCH bpf-next v8 00/34] bpf: switch to memcg-based memory accounting Roman Gushchin
2020-11-25 3:00 ` [PATCH bpf-next v8 01/34] mm: memcontrol: use helpers to read page's memcg data Roman Gushchin
2020-11-25 3:00 ` [PATCH bpf-next v8 02/34] mm: memcontrol/slab: use helpers to access slab page's memcg_data Roman Gushchin
2020-11-25 3:00 ` [PATCH bpf-next v8 03/34] mm: introduce page memcg flags Roman Gushchin
2020-11-25 3:00 ` [PATCH bpf-next v8 04/34] mm: convert page kmemcg type to a page memcg flag Roman Gushchin
2020-11-25 3:00 ` [PATCH bpf-next v8 05/34] bpf: memcg-based memory accounting for bpf progs Roman Gushchin
2020-11-25 3:00 ` [PATCH bpf-next v8 06/34] bpf: prepare for memcg-based memory accounting for bpf maps Roman Gushchin
2020-11-26 0:21 ` Daniel Borkmann
2020-11-26 2:30 ` Roman Gushchin
2020-11-26 9:56 ` Toke Høiland-Jørgensen [this message]
2020-11-26 16:06 ` Roman Gushchin
2020-11-26 17:12 ` Alexei Starovoitov
2020-11-25 3:00 ` [PATCH bpf-next v8 07/34] bpf: " Roman Gushchin
2020-11-25 3:00 ` [PATCH bpf-next v8 08/34] bpf: refine memcg-based memory accounting for arraymap maps Roman Gushchin
2020-11-25 3:00 ` [PATCH bpf-next v8 09/34] bpf: refine memcg-based memory accounting for cpumap maps Roman Gushchin
2020-11-25 3:00 ` [PATCH bpf-next v8 10/34] bpf: memcg-based memory accounting for cgroup storage maps Roman Gushchin
2020-11-25 3:00 ` [PATCH bpf-next v8 11/34] bpf: refine memcg-based memory accounting for devmap maps Roman Gushchin
2020-11-25 3:00 ` [PATCH bpf-next v8 12/34] bpf: refine memcg-based memory accounting for hashtab maps Roman Gushchin
2020-11-25 3:00 ` [PATCH bpf-next v8 13/34] bpf: memcg-based memory accounting for lpm_trie maps Roman Gushchin
2020-11-25 3:00 ` [PATCH bpf-next v8 14/34] bpf: memcg-based memory accounting for bpf ringbuffer Roman Gushchin
2020-11-25 3:01 ` [PATCH bpf-next v8 15/34] bpf: memcg-based memory accounting for bpf local storage maps Roman Gushchin
2020-11-25 3:01 ` [PATCH bpf-next v8 16/34] bpf: refine memcg-based memory accounting for sockmap and sockhash maps Roman Gushchin
2020-11-25 3:01 ` [PATCH bpf-next v8 17/34] bpf: refine memcg-based memory accounting for xskmap maps Roman Gushchin
2020-11-25 3:01 ` [PATCH bpf-next v8 18/34] bpf: eliminate rlimit-based memory accounting for arraymap maps Roman Gushchin
2020-11-25 3:01 ` [PATCH bpf-next v8 19/34] bpf: eliminate rlimit-based memory accounting for bpf_struct_ops maps Roman Gushchin
2020-11-25 3:01 ` [PATCH bpf-next v8 20/34] bpf: eliminate rlimit-based memory accounting for cpumap maps Roman Gushchin
2020-11-25 3:01 ` [PATCH bpf-next v8 21/34] bpf: eliminate rlimit-based memory accounting for cgroup storage maps Roman Gushchin
2020-11-25 3:01 ` [PATCH bpf-next v8 22/34] bpf: eliminate rlimit-based memory accounting for devmap maps Roman Gushchin
2020-11-25 3:01 ` [PATCH bpf-next v8 23/34] bpf: eliminate rlimit-based memory accounting for hashtab maps Roman Gushchin
2020-11-25 3:01 ` [PATCH bpf-next v8 24/34] bpf: eliminate rlimit-based memory accounting for lpm_trie maps Roman Gushchin
2020-11-25 3:01 ` [PATCH bpf-next v8 25/34] bpf: eliminate rlimit-based memory accounting for queue_stack_maps maps Roman Gushchin
2020-11-25 3:01 ` [PATCH bpf-next v8 26/34] bpf: eliminate rlimit-based memory accounting for reuseport_array maps Roman Gushchin
2020-11-25 3:01 ` [PATCH bpf-next v8 27/34] bpf: eliminate rlimit-based memory accounting for bpf ringbuffer Roman Gushchin
2020-11-25 3:01 ` [PATCH bpf-next v8 28/34] bpf: eliminate rlimit-based memory accounting for sockmap and sockhash maps Roman Gushchin
2020-11-25 3:01 ` [PATCH bpf-next v8 29/34] bpf: eliminate rlimit-based memory accounting for stackmap maps Roman Gushchin
2020-11-25 3:01 ` [PATCH bpf-next v8 30/34] bpf: eliminate rlimit-based memory accounting for xskmap maps Roman Gushchin
2020-11-25 3:01 ` [PATCH bpf-next v8 31/34] bpf: eliminate rlimit-based memory accounting for bpf local storage maps Roman Gushchin
2020-11-25 3:01 ` [PATCH bpf-next v8 32/34] bpf: eliminate rlimit-based memory accounting infra for bpf maps Roman Gushchin
2020-11-25 3:01 ` [PATCH bpf-next v8 33/34] bpf: eliminate rlimit-based memory accounting for bpf progs Roman Gushchin
2020-11-25 3:01 ` [PATCH bpf-next v8 34/34] bpf: samples: do not touch RLIMIT_MEMLOCK Roman Gushchin
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=87lfeol9vn.fsf@toke.dk \
--to=toke@redhat.com \
--cc=akpm@linux-foundation.org \
--cc=andrii@kernel.org \
--cc=ast@kernel.org \
--cc=bpf@vger.kernel.org \
--cc=daniel@iogearbox.net \
--cc=guro@fb.com \
--cc=hannes@cmpxchg.org \
--cc=kernel-team@fb.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=netdev@vger.kernel.org \
--cc=tj@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.