All of lore.kernel.org
 help / color / mirror / Atom feed
From: John Fastabend <john.fastabend@gmail.com>
To: Alexei Starovoitov <alexei.starovoitov@gmail.com>, davem@davemloft.net
Cc: daniel@iogearbox.net, andrii@kernel.org, tj@kernel.org,
	kafai@fb.com, bpf@vger.kernel.org, kernel-team@fb.com
Subject: RE: [PATCH bpf-next 1/5] bpf: Introduce any context BPF specific memory allocator.
Date: Fri, 24 Jun 2022 18:23:26 -0700	[thread overview]
Message-ID: <62b6638e4e0d1_347af208e3@john.notmuch> (raw)
In-Reply-To: <20220623003230.37497-2-alexei.starovoitov@gmail.com>

Alexei Starovoitov wrote:
> From: Alexei Starovoitov <ast@kernel.org>
> 
> Tracing BPF programs can attach to kprobe and fentry. Hence they
> run in unknown context where calling plain kmalloc() might not be safe.
> 
> Front-end kmalloc() with per-cpu per-bucket cache of free elements.
> Refill this cache asynchronously from irq_work.
> 
> BPF programs always run with migration disabled.
> It's safe to allocate from cache of the current cpu with irqs disabled.
> Free-ing is always done into bucket of the current cpu as well.
> irq_work trims extra free elements from buckets with kfree
> and refills them with kmalloc, so global kmalloc logic takes care
> of freeing objects allocated by one cpu and freed on another.
> 
> struct bpf_mem_alloc supports two modes:
> - When size != 0 create kmem_cache and bpf_mem_cache for each cpu.
>   This is typical bpf hash map use case when all elements have equal size.
> - When size == 0 allocate 11 bpf_mem_cache-s for each cpu, then rely on
>   kmalloc/kfree. Max allocation size is 4096 in this case.
>   This is bpf_dynptr and bpf_kptr use case.
> 
> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
> ---

Some initial feedback but still looking over it. Figured it made more
sense to dump current thoughts then drop it this evening for Monday.

[...]

> +static int bpf_mem_cache_idx(size_t size)
    [...]

> +#define NUM_CACHES 11
> +
> +struct bpf_mem_cache {
> +	/* per-cpu list of free objects of size 'unit_size'.
> +	 * All accesses are done with preemption disabled
> +	 * with __llist_add() and __llist_del_first().
> +	 */
> +	struct llist_head free_llist;
> +
> +	/* NMI only free list.
> +	 * All accesses are NMI-safe llist_add() and llist_del_first().
> +	 *
> +	 * Each allocated object is either on free_llist or on free_llist_nmi.
> +	 * One cpu can allocate it from NMI by doing llist_del_first() from
> +	 * free_llist_nmi, while another might free it back from non-NMI by
> +	 * doing llist_add() into free_llist.
> +	 */
> +	struct llist_head free_llist_nmi;

stupid nit but newline here helps me read this.

> +	/* kmem_cache != NULL when bpf_mem_alloc was created for specific
> +	 * element size.
> +	 */
> +	struct kmem_cache *kmem_cache;

> +	struct irq_work refill_work;
> +	struct mem_cgroup *memcg;
> +	int unit_size;
> +	/* count of objects in free_llist */
> +	int free_cnt;
> +	/* count of objects in free_llist_nmi */
> +	atomic_t free_cnt_nmi;
> +	/* flag to refill nmi list too */
> +	bool refill_nmi_list;
> +};

What about having two types one for fixed size cache and one for buckets?
The logic below gets a bunch of if cases with just the single type. OTOH
I messed around with it for a bit and then had to duplicate most of the
codes so I'm not sure its entirely a good idea, but the __alloc() with
the 'if this else that' sort of made me think of it.

> +
> +static struct llist_node notrace *__llist_del_first(struct llist_head *head)
     [...]

> +
> +#define BATCH 48
> +#define LOW_WATERMARK 32
> +#define HIGH_WATERMARK 96
> +/* Assuming the average number of elements per bucket is 64, when all buckets
> + * are used the total memory will be: 64*16*32 + 64*32*32 + 64*64*32 + ... +
> + * 64*4096*32 ~ 20Mbyte
> + */
> +
> +/* extra macro useful for testing by randomizing in_nmi condition */
> +#define bpf_in_nmi() in_nmi()
> +
> +static void *__alloc(struct bpf_mem_cache *c, int node)

For example with two types this mostly drops out. Of course then the callers
have to know the type so not sure. And you get two alloc_bulks and
so on. Its not obviously this works out well.

[...]

> +static void free_bulk_nmi(struct bpf_mem_cache *c)
> +{
> +	struct llist_node *llnode;
> +	int cnt;
> +
> +	do {
> +		llnode = llist_del_first(&c->free_llist_nmi);
> +		if (llnode)
> +			cnt = atomic_dec_return(&c->free_cnt_nmi);
> +		else
> +			cnt = 0;
> +		__free(c, llnode);
> +	} while (cnt > (HIGH_WATERMARK + LOW_WATERMARK) / 2);
> +}

Comment from irq_work_run_list,

	/*
	 * On PREEMPT_RT IRQ-work which is not marked as HARD will be processed
	 * in a per-CPU thread in preemptible context. Only the items which are
	 * marked as IRQ_WORK_HARD_IRQ will be processed in hardirq context.
	 */

Not an RT expert but I read this to mean in PREEMPT_RT case we can't assume
this is !preemptible? If I read correctly then is there a risk we get
two runners here? And by extension would need to worry about free_cnt
and friends getting corrupted.

> +
> +static void bpf_mem_refill(struct irq_work *work)
> +{
> +	struct bpf_mem_cache *c = container_of(work, struct bpf_mem_cache, refill_work);
> +	int cnt;
> +
> +	cnt = c->free_cnt;
> +	if (cnt < LOW_WATERMARK)
> +		/* irq_work runs on this cpu and kmalloc will allocate
> +		 * from the current numa node which is what we want here.
> +		 */
> +		alloc_bulk(c, BATCH, NUMA_NO_NODE);
> +	else if (cnt > HIGH_WATERMARK)
> +		free_bulk(c);
> +
> +	if (!c->refill_nmi_list)
> +		/* don't refill NMI specific freelist
> +		 * until alloc/free from NMI.
> +		 */
> +		return;
> +	cnt = atomic_read(&c->free_cnt_nmi);
> +	if (cnt < LOW_WATERMARK)
> +		alloc_bulk_nmi(c, BATCH, NUMA_NO_NODE);
> +	else if (cnt > HIGH_WATERMARK)
> +		free_bulk_nmi(c);
> +	c->refill_nmi_list = false;
> +}
> +
> +static void notrace irq_work_raise(struct bpf_mem_cache *c, bool in_nmi)
> +{
> +	c->refill_nmi_list = in_nmi;

Should this be,

	c->refill_nmi_list |= in_nmi;

this would resolve comment in unit_alloc? We don't want to clear it if
we end up calling irq_work_raise from in_nmi and then in another
context. It would be really hard to debug if the case is possible and
a busy box just doesn't refill nmi enough.

> +	irq_work_queue(&c->refill_work);
> +}
> +
> +static void prefill_mem_cache(struct bpf_mem_cache *c, int cpu)
    [...]

> +
> +/* notrace is necessary here and in other functions to make sure
> + * bpf programs cannot attach to them and cause llist corruptions.
> + */

Thanks for the comment.

> +static void notrace *unit_alloc(struct bpf_mem_cache *c)
> +{
> +	bool in_nmi = bpf_in_nmi();
> +	struct llist_node *llnode;
> +	unsigned long flags;
> +	int cnt = 0;
> +
> +	if (unlikely(in_nmi)) {
> +		llnode = llist_del_first(&c->free_llist_nmi);
> +		if (llnode)
> +			cnt = atomic_dec_return(&c->free_cnt_nmi);

Dumb question maybe its Friday afternoon. If we are in_nmi() and preempt
disabled why do we need the atomic_dec_return()?

> +	} else {
> +		/* Disable irqs to prevent the following race:
> +		 * bpf_prog_A
> +		 *   bpf_mem_alloc
> +		 *      preemption or irq -> bpf_prog_B
> +		 *        bpf_mem_alloc
> +		 */
> +		local_irq_save(flags);
> +		llnode = __llist_del_first(&c->free_llist);
> +		if (llnode)
> +			cnt = --c->free_cnt;
> +		local_irq_restore(flags);
> +	}
> +	WARN_ON(cnt < 0);
> +

Is this a problem?

  in_nmi = false
  bpf_prog_A
   bpf_mem_alloc
   irq_restore
   irq -> bpf_prog_B
            bpf_mem_alloc
               in_nmi = true
               irq_work_raise(c, true)
   irq_work_raise(c, false)

At somepoint later
 
   bpf_mem_refill()
    refill_nmi_list <- false

The timing is tight but possible I suspect. See above simple fix would
be to just | the refill_nim_list bool? We shouldn't be clearing it
from a raise op.

> +	if (cnt < LOW_WATERMARK)
> +		irq_work_raise(c, in_nmi);
> +	return llnode;
> +}
>

OK need to drop for now. Will pick up reviewing the rest later.

  reply	other threads:[~2022-06-25  1:23 UTC|newest]

Thread overview: 72+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-06-23  0:32 [PATCH bpf-next 0/5] bpf: BPF specific memory allocator Alexei Starovoitov
2022-06-23  0:32 ` [PATCH bpf-next 1/5] bpf: Introduce any context " Alexei Starovoitov
2022-06-25  1:23   ` John Fastabend [this message]
2022-06-26 17:19     ` Alexei Starovoitov
2022-06-23  0:32 ` [PATCH bpf-next 2/5] bpf: Convert hash map to bpf_mem_alloc Alexei Starovoitov
2022-06-23  0:32 ` [PATCH bpf-next 3/5] selftests/bpf: Improve test coverage of test_maps Alexei Starovoitov
2022-06-23  0:32 ` [PATCH bpf-next 4/5] samples/bpf: Reduce syscall overhead in map_perf_test Alexei Starovoitov
2022-06-23  0:32 ` [PATCH bpf-next 5/5] bpf: Relax the requirement to use preallocated hash maps in tracing progs Alexei Starovoitov
2022-06-27  7:03 ` [PATCH bpf-next 0/5] bpf: BPF specific memory allocator Christoph Hellwig
2022-06-28  0:17   ` Christoph Lameter
2022-06-28  5:01     ` Alexei Starovoitov
2022-06-28 13:57       ` Christoph Lameter
2022-06-28 17:03         ` Alexei Starovoitov
2022-06-29  2:35           ` Christoph Lameter
2022-06-29  2:49             ` Alexei Starovoitov
2022-07-04 16:13               ` Vlastimil Babka
2022-07-06 17:43                 ` Alexei Starovoitov
2022-07-19 11:52                   ` Vlastimil Babka
2022-07-04 20:34   ` Matthew Wilcox
2022-07-06 17:50     ` Alexei Starovoitov
2022-07-06 17:55       ` Matthew Wilcox
2022-07-06 18:05         ` Alexei Starovoitov
2022-07-06 18:21           ` Matthew Wilcox
2022-07-06 18:26             ` Alexei Starovoitov
2022-07-06 18:31               ` Matthew Wilcox
2022-07-06 18:36                 ` Alexei Starovoitov
2022-07-06 18:40                   ` Matthew Wilcox
2022-07-06 18:51                     ` Alexei Starovoitov
2022-07-06 18:55                       ` Matthew Wilcox
2022-07-08 13:41           ` Michal Hocko
2022-07-08 17:48             ` Alexei Starovoitov
2022-07-08 20:13               ` Yosry Ahmed
2022-07-08 21:55               ` Shakeel Butt
2022-07-10  5:26                 ` Alexei Starovoitov
2022-07-10  7:32                   ` Shakeel Butt
2022-07-11 12:15                     ` Michal Hocko
2022-07-12  4:39                       ` Alexei Starovoitov
2022-07-12  7:40                         ` Michal Hocko
2022-07-12  8:39                           ` Yafang Shao
2022-07-12  9:52                             ` Michal Hocko
2022-07-12 15:25                               ` Shakeel Butt
2022-07-12 16:32                                 ` Tejun Heo
2022-07-12 17:26                                   ` Shakeel Butt
2022-07-12 17:36                                     ` Tejun Heo
2022-07-12 18:11                                       ` Shakeel Butt
2022-07-12 18:43                                         ` Alexei Starovoitov
2022-07-13 13:56                                           ` Yafang Shao
2022-07-12 19:11                                         ` Mina Almasry
2022-07-12 16:24                               ` Tejun Heo
2022-07-18 14:13                                 ` Michal Hocko
2022-07-13  2:39                               ` Roman Gushchin
2022-07-13 14:24                                 ` Yafang Shao
2022-07-13 16:24                                   ` Tejun Heo
2022-07-14  6:15                                     ` Yafang Shao
2022-07-18 17:55                                 ` Yosry Ahmed
2022-07-19 11:30                                   ` cgroup specific sticky resources (was: Re: [PATCH bpf-next 0/5] bpf: BPF specific memory allocator.) Michal Hocko
2022-07-19 18:00                                     ` Yosry Ahmed
2022-07-19 18:01                                       ` Yosry Ahmed
2022-07-19 18:46                                       ` Mina Almasry
2022-07-19 19:16                                         ` Tejun Heo
2022-07-19 19:30                                           ` Yosry Ahmed
2022-07-19 19:38                                             ` Tejun Heo
2022-07-19 19:40                                               ` Yosry Ahmed
2022-07-19 19:47                                               ` Mina Almasry
2022-07-19 19:54                                                 ` Tejun Heo
2022-07-19 20:16                                                   ` Mina Almasry
2022-07-19 20:29                                                     ` Tejun Heo
2022-07-20 12:26                                         ` Michal Hocko
2022-07-12 18:40                           ` [PATCH bpf-next 0/5] bpf: BPF specific memory allocator Alexei Starovoitov
2022-07-18 12:27                             ` Michal Hocko
2022-07-13  2:27                           ` Roman Gushchin
2022-07-11 12:22               ` Michal Hocko

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=62b6638e4e0d1_347af208e3@john.notmuch \
    --to=john.fastabend@gmail.com \
    --cc=alexei.starovoitov@gmail.com \
    --cc=andrii@kernel.org \
    --cc=bpf@vger.kernel.org \
    --cc=daniel@iogearbox.net \
    --cc=davem@davemloft.net \
    --cc=kafai@fb.com \
    --cc=kernel-team@fb.com \
    --cc=tj@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.