From: Uladzislau Rezki <urezki@gmail.com>
To: Adrian Huang <adrianhuang0701@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>,
Uladzislau Rezki <urezki@gmail.com>,
Christoph Hellwig <hch@infradead.org>,
linux-mm@kvack.org, linux-kernel@vger.kernel.org,
Adrian Huang <ahuang12@lenovo.com>,
Jiwei Sun <sunjw10@lenovo.com>
Subject: Re: [PATCH 1/1] mm/vmalloc: Add preempt point in purge_vmap_node() when enabling kasan
Date: Fri, 5 Jul 2024 17:36:31 +0200 [thread overview]
Message-ID: <ZogS_04dP5LlRlXN@pc636> (raw)
In-Reply-To: <20240705130808.1581-1-ahuang12@lenovo.com>
On Fri, Jul 05, 2024 at 09:08:08PM +0800, Adrian Huang wrote:
> From: Adrian Huang <ahuang12@lenovo.com>
>
> When compiling kernel source 'make -j $(nproc)' with the up-and-running
> KASAN-enabled kernel on a 256-core machine, the following soft lockup
> is shown:
>
> watchdog: BUG: soft lockup - CPU#28 stuck for 22s! [kworker/28:1:1760]
> CPU: 28 PID: 1760 Comm: kworker/28:1 Kdump: loaded Not tainted 6.10.0-rc5 #95
> Workqueue: events drain_vmap_area_work
> RIP: 0010:smp_call_function_many_cond+0x1d8/0xbb0
> Code: 38 c8 7c 08 84 c9 0f 85 49 08 00 00 8b 45 08 a8 01 74 2e 48 89 f1 49 89 f7 48 c1 e9 03 41 83 e7 07 4c 01 e9 41 83 c7 03 f3 90 <0f> b6 01 41 38 c7 7c 08 84 c0 0f 85 d4 06 00 00 8b 45 08 a8 01 75
> RSP: 0018:ffffc9000cb3fb60 EFLAGS: 00000202
> RAX: 0000000000000011 RBX: ffff8883bc4469c0 RCX: ffffed10776e9949
> RDX: 0000000000000002 RSI: ffff8883bb74ca48 RDI: ffffffff8434dc50
> RBP: ffff8883bb74ca40 R08: ffff888103585dc0 R09: ffff8884533a1800
> R10: 0000000000000004 R11: ffffffffffffffff R12: ffffed1077888d39
> R13: dffffc0000000000 R14: ffffed1077888d38 R15: 0000000000000003
> FS: 0000000000000000(0000) GS:ffff8883bc400000(0000) knlGS:0000000000000000
> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 00005577b5c8d158 CR3: 0000000004850000 CR4: 0000000000350ef0
> Call Trace:
> <IRQ>
> ? watchdog_timer_fn+0x2cd/0x390
> ? __pfx_watchdog_timer_fn+0x10/0x10
> ? __hrtimer_run_queues+0x300/0x6d0
> ? sched_clock_cpu+0x69/0x4e0
> ? __pfx___hrtimer_run_queues+0x10/0x10
> ? srso_return_thunk+0x5/0x5f
> ? ktime_get_update_offsets_now+0x7f/0x2a0
> ? srso_return_thunk+0x5/0x5f
> ? srso_return_thunk+0x5/0x5f
> ? hrtimer_interrupt+0x2ca/0x760
> ? __sysvec_apic_timer_interrupt+0x8c/0x2b0
> ? sysvec_apic_timer_interrupt+0x6a/0x90
> </IRQ>
> <TASK>
> ? asm_sysvec_apic_timer_interrupt+0x16/0x20
> ? smp_call_function_many_cond+0x1d8/0xbb0
> ? __pfx_do_kernel_range_flush+0x10/0x10
> on_each_cpu_cond_mask+0x20/0x40
> flush_tlb_kernel_range+0x19b/0x250
> ? srso_return_thunk+0x5/0x5f
> ? kasan_release_vmalloc+0xa7/0xc0
> purge_vmap_node+0x357/0x820
> ? __pfx_purge_vmap_node+0x10/0x10
> __purge_vmap_area_lazy+0x5b8/0xa10
> drain_vmap_area_work+0x21/0x30
> process_one_work+0x661/0x10b0
> worker_thread+0x844/0x10e0
> ? srso_return_thunk+0x5/0x5f
> ? __kthread_parkme+0x82/0x140
> ? __pfx_worker_thread+0x10/0x10
> kthread+0x2a5/0x370
> ? __pfx_kthread+0x10/0x10
> ret_from_fork+0x30/0x70
> ? __pfx_kthread+0x10/0x10
> ret_from_fork_asm+0x1a/0x30
> </TASK>
>
> Debugging Analysis:
> 1. [Crash] The call trace indicates CPU #28 is waiting for other CPUs'
> responses by sending an IPI message in order to flush tlb.
> However, crash indicates the target CPU has responded.
>
> A. CPU #28 is waiting for CPU #2' response.
>
> crash> cfd_data | grep -w 28
> [28]: ffff8883bc4469c0
> crash> struct call_function_data ffff8883bc4469c0
> struct call_function_data {
> csd = 0x4ca40,
> cpumask = 0xffff888103585e40,
> cpumask_ipi = 0xffff8881035858c0
> }
>
> crash> struct __call_single_data 0x4ca40:a | grep ffff8883bb74ca40
> [2]: ffff8883bb74ca40
>
> B. CPU #2 has responded because the bit 'CSD_FLAG_LOCK' of u_flags
> is cleared.
>
> crash> struct __call_single_data 0xffff8883bb74ca40
> struct __call_single_data {
> node = {
> ...
> {
> u_flags = 0,
> a_flags = {
> counter = 0
> }
> },
> ...
> },
> func = 0xffffffff8117b080 <do_kernel_range_flush>,
> info = 0xffff8883bc444940
> }
>
> C. CPU #2 is running userspace application 'nm'.
>
> crash> bt -c 2
> PID: 52035 TASK: ffff888194c21ac0 CPU: 2 COMMAND: "nm"
> #0 [ffffc90043157ea8] crash_nmi_callback at ffffffff81122f42
> #1 [ffffc90043157eb0] nmi_handle at ffffffff8108c988
> #2 [ffffc90043157f10] default_do_nmi at ffffffff835b3460
> #3 [ffffc90043157f30] exc_nmi at ffffffff835b3630
> #4 [ffffc90043157f50] asm_exc_nmi at ffffffff83601573
> RIP: 00007f6261b90d38 RSP: 00007ffe4d1cb180 RFLAGS: 00000202
> RAX: 0000000000000001 RBX: 6e6f2d7865646e69 RCX: 00007f626281f634
> RDX: 00007f6262ba7f67 RSI: 00007f626265f65e RDI: 00007f62648f8a30
> RBP: 2d6f746c6e696874 R8: 00007f62618a4cc0 R9: 0000000000000001
> R10: 00007f627233e488 R11: 00007f627233d768 R12: 000055bee0ff4b90
> R13: 000055bee0fac650 R14: 00007ffe4d1cb280 R15: 0000000000000000
> ORIG_RAX: ffffffffffffffff CS: 0033 SS: 002b
>
> D. The soft lockup CPU iteratively traverses 128 vmap_nodes (128 bits
> are set) and flushes tlb. This might be the time-consuming work.
>
> crash> p /x purge_nodes
> $1 = {
> bits = {0xffffffffffffffff, 0xffffffffffffffff, 0x0, ...}
>
> 2. [Ftrace] In order to prove that the soft lockup CPU spends too much
> time iterating vmap_nodes and flushing tlb when purging vm_area
> structures, the ftrace confirms the speculation (Some info is trimmed).
>
> kworker: funcgraph_entry: | drain_vmap_area_work() {
> kworker: funcgraph_entry: | mutex_lock() {
> kworker: funcgraph_entry: 1.092 us | __cond_resched();
> kworker: funcgraph_exit: 3.306 us | }
> ... ...
> kworker: funcgraph_entry: | flush_tlb_kernel_range() {
> ... ...
> kworker: funcgraph_exit: # 7533.649 us | }
> ... ...
> kworker: funcgraph_entry: 2.344 us | mutex_unlock();
> kworker: funcgraph_exit: $ 23871554 us | }
>
> The drain_vmap_area_work() spends over 23 seconds.
>
> There are 2805 flush_tlb_kernel_range() calls in this ftrace log.
> * One is called in __purge_vmap_area_lazy().
> * Others are called in kasan_release_vmalloc().
>
> 3. Extending the soft lockup time can work around the issue (For example,
> # echo 60 > /proc/sys/kernel/watchdog_thresh). This confirms the
> above-mentioned speculation: drain_vmap_area_work() spends too much
> time.
>
> Commit 72210662c5a2 ("mm: vmalloc: offload free_vmap_area_lock lock")
> employs an effective vmap node logic. However, current design iterates
> 128 vmap_nodes and flushes tlb by a single CPU if
> vmap_lazy_nr < 2 * lazy_max_pages(). When enabling kasan, this might
> trigger the soft lockup because additional tlb flushes of kasan vmalloc
> spend much more time if 128 vmap nodes have the available purge list.
>
> Fix the issue by adding preempt point in purge_vmap_node() when
> enabling kasan.
>
> Fixes: 72210662c5a2 ("mm: vmalloc: offload free_vmap_area_lock lock")
> Signed-off-by: Adrian Huang <ahuang12@lenovo.com>
> Reviewed-and-tested-by: Jiwei Sun <sunjw10@lenovo.com>
> ---
> mm/vmalloc.c | 9 +++++++++
> 1 file changed, 9 insertions(+)
>
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index d0cbdd7c1e5b..380bdc317c8d 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -2193,6 +2193,15 @@ static void purge_vmap_node(struct work_struct *work)
> struct vmap_area *va, *n_va;
> LIST_HEAD(local_list);
>
> + /*
> + * Add the preemption point when enabling KASAN. Each vmap_area of
> + * vmap nodes has to flush tlb when releasing vmalloc, which might
> + * be the time-consuming work if lots of vamp nodes have the
> + * available purge list.
> + */
> + if (kasan_enabled())
> + cond_resched();
> +
> vn->nr_purged = 0;
>
> list_for_each_entry_safe(va, n_va, &vn->purge_list, list) {
> --
> 2.34.1
>
Thank you for highlighting this. We had a preempt point during purging
process. But it has been removed by the following commit:
<snip>
commit 282631cb2447318e2a55b41a665dbe8571c46d70
Author: Uladzislau Rezki (Sony) <urezki@gmail.com>
Date: Tue Jan 2 19:46:28 2024 +0100
mm: vmalloc: remove global purge_vmap_area_root rb-tree
<snip>
and it looks like a decision was wrong. We should restore this.
Could you please test it:
<snip>
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 03b82fb8ecd3..6dc204b8495a 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -2190,10 +2190,12 @@ static void purge_vmap_node(struct work_struct *work)
{
struct vmap_node *vn = container_of(work,
struct vmap_node, purge_work);
+ unsigned long resched_threshold;
struct vmap_area *va, *n_va;
LIST_HEAD(local_list);
vn->nr_purged = 0;
+ resched_threshold = lazy_max_pages() << 1;
list_for_each_entry_safe(va, n_va, &vn->purge_list, list) {
unsigned long nr = (va->va_end - va->va_start) >> PAGE_SHIFT;
@@ -2210,6 +2212,9 @@ static void purge_vmap_node(struct work_struct *work)
atomic_long_sub(nr, &vmap_lazy_nr);
vn->nr_purged++;
+ if (atomic_long_read(&vmap_lazy_nr) < resched_threshold)
+ cond_resched();
+
if (is_vn_id_valid(vn_id) && !vn->skip_populate)
if (node_pool_add_va(vn, va))
continue;
<snip>
Thank you in advance!
--
Uladzislau Rezki
next prev parent reply other threads:[~2024-07-05 15:36 UTC|newest]
Thread overview: 13+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-07-05 13:08 [PATCH 1/1] mm/vmalloc: Add preempt point in purge_vmap_node() when enabling kasan Adrian Huang
2024-07-05 15:36 ` Uladzislau Rezki [this message]
2024-07-08 13:39 ` [External] " Adrian Huang12
2024-07-08 16:06 ` Uladzislau Rezki
2024-07-19 11:40 ` Uladzislau Rezki
2024-07-22 11:50 ` Adrian Huang
2024-07-22 12:40 ` Uladzislau Rezki
2024-07-23 10:50 ` Uladzislau Rezki
2024-07-24 12:46 ` Adrian Huang
2024-07-24 14:32 ` Uladzislau Rezki
2024-07-24 15:09 ` Uladzislau Rezki
2024-07-24 19:27 ` Uladzislau Rezki
2024-07-24 20:34 ` Uladzislau Rezki
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=ZogS_04dP5LlRlXN@pc636 \
--to=urezki@gmail.com \
--cc=adrianhuang0701@gmail.com \
--cc=ahuang12@lenovo.com \
--cc=akpm@linux-foundation.org \
--cc=hch@infradead.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=sunjw10@lenovo.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.