All of lore.kernel.org
 help / color / mirror / Atom feed
From: sdf@google.com
To: Hou Tao <houtao@huaweicloud.com>
Cc: bpf@vger.kernel.org, Alexei Starovoitov <ast@kernel.org>,
	Martin KaFai Lau <martin.lau@linux.dev>,
	Andrii Nakryiko <andrii@kernel.org>, Song Liu <song@kernel.org>,
	Hao Luo <haoluo@google.com>, Yonghong Song <yhs@fb.com>,
	Daniel Borkmann <daniel@iogearbox.net>,
	KP Singh <kpsingh@kernel.org>, Jiri Olsa <jolsa@kernel.org>,
	John Fastabend <john.fastabend@gmail.com>,
	houtao1@huawei.com
Subject: Re: [PATCH bpf 1/2] bpf: Wait for busy refill_work when destorying bpf memory allocator
Date: Wed, 19 Oct 2022 11:38:44 -0700	[thread overview]
Message-ID: <Y1BENCpam1I+anXF@google.com> (raw)
In-Reply-To: <20221019115539.983394-2-houtao@huaweicloud.com>

On 10/19, Hou Tao wrote:
> From: Hou Tao <houtao1@huawei.com>

> A busy irq work is an unfinished irq work and it can be either in the
> pending state or in the running state. When destroying bpf memory
> allocator, refill_work may be busy for PREEMPT_RT kernel in which irq
> work is invoked in a per-CPU RT-kthread. It is also possible for kernel
> with arch_irq_work_has_interrupt() being false (e.g. 1-cpu arm32 host)
> and irq work is inovked in timer interrupt.

> The busy refill_work leads to various issues. The obvious one is that
> there will be concurrent operations on free_by_rcu and free_list between
> irq work and memory draining. Another one is call_rcu_in_progress will
> not be reliable for the checking of pending RCU callback because
> do_call_rcu() may has not been invoked by irq work. The other is there
> will be use-after-free if irq work is freed before the callback of
> irq work is invoked as shown below:

>   BUG: kernel NULL pointer dereference, address: 0000000000000000
>   #PF: supervisor instruction fetch in kernel mode
>   #PF: error_code(0x0010) - not-present page
>   PGD 12ab94067 P4D 12ab94067 PUD 1796b4067 PMD 0
>   Oops: 0010 [#1] PREEMPT_RT SMP
>   CPU: 5 PID: 64 Comm: irq_work/5 Not tainted 6.0.0-rt11+ #1
>   Hardware name: QEMU Standard PC (i440FX + PIIX, 1996)
>   RIP: 0010:0x0
>   Code: Unable to access opcode bytes at 0xffffffffffffffd6.
>   RSP: 0018:ffffadc080293e78 EFLAGS: 00010286
>   RAX: 0000000000000000 RBX: ffffcdc07fb6a388 RCX: ffffa05000a2e000
>   RDX: ffffa05000a2e000 RSI: ffffffff96cc9827 RDI: ffffcdc07fb6a388
>   ......
>   Call Trace:
>    <TASK>
>    irq_work_single+0x24/0x60
>    irq_work_run_list+0x24/0x30
>    run_irq_workd+0x23/0x30
>    smpboot_thread_fn+0x203/0x300
>    kthread+0x126/0x150
>    ret_from_fork+0x1f/0x30
>    </TASK>

> Considering the ease of concurrency handling and the short wait time
> used for irq_work_sync() under PREEMPT_RT (When running two test_maps on
> PREEMPT_RT kernel and 72-cpus host, the max wait time is about 8ms and
> the 99th percentile is 10us), just waiting for busy refill_work to
> complete before memory draining and memory freeing.

> Fixes: 7c8199e24fa0 ("bpf: Introduce any context BPF specific memory  
> allocator.")
> Signed-off-by: Hou Tao <houtao1@huawei.com>
> ---
>   kernel/bpf/memalloc.c | 11 +++++++++++
>   1 file changed, 11 insertions(+)

> diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c
> index 94f0f63443a6..48e606aaacf0 100644
> --- a/kernel/bpf/memalloc.c
> +++ b/kernel/bpf/memalloc.c
> @@ -497,6 +497,16 @@ void bpf_mem_alloc_destroy(struct bpf_mem_alloc *ma)
>   		rcu_in_progress = 0;
>   		for_each_possible_cpu(cpu) {
>   			c = per_cpu_ptr(ma->cache, cpu);
> +			/*
> +			 * refill_work may be unfinished for PREEMPT_RT kernel
> +			 * in which irq work is invoked in a per-CPU RT thread.
> +			 * It is also possible for kernel with
> +			 * arch_irq_work_has_interrupt() being false and irq
> +			 * work is inovked in timer interrupt. So wait for the
> +			 * completion of irq work to ease the handling of
> +			 * concurrency.
> +			 */
> +			irq_work_sync(&c->refill_work);

Does it make sense to guard these with "IS_ENABLED(CONFIG_PREEMPT_RT)" ?
We do have a bunch of them sprinkled already to run alloc/free with
irqs disabled.

I was also trying to see if adding local_irq_save inside drain_mem_cache
to pair with the ones from refill might work, but waiting for irq to
finish seems easier...

Maybe also move both of these in some new "static void irq_work_wait"
to make it clear that the PREEMT_RT comment applies to both of them?

Or maybe that helper should do 'for_each_possible_cpu(cpu)  
irq_work_sync(&c->refill_work);'
in the PREEMPT_RT case so we don't have to call it twice?

>   			drain_mem_cache(c);
>   			rcu_in_progress += atomic_read(&c->call_rcu_in_progress);
>   		}
> @@ -511,6 +521,7 @@ void bpf_mem_alloc_destroy(struct bpf_mem_alloc *ma)
>   			cc = per_cpu_ptr(ma->caches, cpu);
>   			for (i = 0; i < NUM_CACHES; i++) {
>   				c = &cc->cache[i];
> +				irq_work_sync(&c->refill_work);
>   				drain_mem_cache(c);
>   				rcu_in_progress += atomic_read(&c->call_rcu_in_progress);
>   			}
> --
> 2.29.2


  reply	other threads:[~2022-10-19 18:38 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-10-19 11:55 [PATCH bpf 0/2] Wait for busy refill_work when destorying bpf memory allocator Hou Tao
2022-10-19 11:55 ` [PATCH bpf 1/2] bpf: " Hou Tao
2022-10-19 18:38   ` sdf [this message]
2022-10-20  1:07     ` Hou Tao
2022-10-20 17:49       ` Stanislav Fomichev
2022-10-21  1:06         ` Hou Tao
2022-10-19 11:55 ` [PATCH bpf 2/2] bpf: Use __llist_del_all() whenever possbile during memory draining Hou Tao
2022-10-19 19:00   ` sdf
2022-10-20  1:17     ` Hou Tao
2022-10-20 17:52       ` Stanislav Fomichev
2022-10-21  1:09         ` Hou Tao

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Y1BENCpam1I+anXF@google.com \
    --to=sdf@google.com \
    --cc=andrii@kernel.org \
    --cc=ast@kernel.org \
    --cc=bpf@vger.kernel.org \
    --cc=daniel@iogearbox.net \
    --cc=haoluo@google.com \
    --cc=houtao1@huawei.com \
    --cc=houtao@huaweicloud.com \
    --cc=john.fastabend@gmail.com \
    --cc=jolsa@kernel.org \
    --cc=kpsingh@kernel.org \
    --cc=martin.lau@linux.dev \
    --cc=song@kernel.org \
    --cc=yhs@fb.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.