* CPU_DOWN_FAILED hits ASSERTs in scheduling logic @ 2024-05-28 11:22 Roger Pau Monné 2024-05-29 11:47 ` Jürgen Groß 0 siblings, 1 reply; 7+ messages in thread From: Roger Pau Monné @ 2024-05-28 11:22 UTC (permalink / raw) To: xen-devel; +Cc: Juergen Gross, Dario Faggioli, George Dunlap Hello, When the stop_machine_run() call in cpu_down() fails and calls the CPU notifier CPU_DOWN_FAILED hook the following assert triggers in the scheduling code: Assertion '!cpumask_test_cpu(cpu, &prv->initialized)' failed at common/sched/cred1 ----[ Xen-4.19-unstable x86_64 debug=y Tainted: C ]---- CPU: 0 RIP: e008:[<ffff82d040248299>] common/sched/credit2.c#csched2_free_pdata+0xc8/0x177 RFLAGS: 0000000000010093 CONTEXT: hypervisor rax: 0000000000000000 rbx: ffff83202ecc2f80 rcx: ffff83202f3e64c0 rdx: 0000000000000001 rsi: 0000000000000002 rdi: ffff83202ecc2f88 rbp: ffff83203ffffd58 rsp: ffff83203ffffd30 r8: 0000000000000000 r9: ffff83202f3e6e01 r10: 0000000000000000 r11: 0f0f0f0f0f0f0f0f r12: ffff83202ecb80b0 r13: 0000000000000001 r14: 0000000000000282 r15: ffff83202ecbbf00 cr0: 000000008005003b cr4: 00000000007526e0 cr3: 00000000574c2000 cr2: 0000000000000000 fsb: 0000000000000000 gsb: 0000000000000000 gss: 0000000000000000 ds: 0000 es: 0000 fs: 0000 gs: 0000 ss: 0000 cs: e008 Xen code around <ffff82d040248299> (common/sched/credit2.c#csched2_free_pdata+0xc8/0x177): fe ff eb 9a 0f 0b 0f 0b <0f> 0b 49 8d 4f 08 49 8b 47 08 48 3b 48 08 75 2e Xen stack trace from rsp=ffff83203ffffd30: ffff83202d74d100 0000000000000001 ffff82d0404c4430 0000000000000006 0000000000000000 ffff83203ffffd78 ffff82d040257454 0000000000000000 0000000000000001 ffff83203ffffda8 ffff82d04021f303 ffff82d0404c4628 ffff82d0404c4620 ffff82d0404c4430 0000000000000006 ffff83203ffffdf0 ffff82d04022bc4c ffff83203ffffe18 0000000000000001 0000000000000001 00000000fffffff0 0000000000000000 0000000000000000 ffff82d0405e6500 ffff83203ffffe08 ffff82d040204fd5 0000000000000001 ffff83203ffffe30 ffff82d0402054f0 ffff82d0404c5860 0000000000000001 ffff83202ec75000 ffff83203ffffe48 ffff82d040348c25 ffff83202d74d0d0 ffff83203ffffe68 ffff82d0402071aa ffff83202ec751d0 ffff82d0405ce210 ffff83203ffffe80 ffff82d0402343c9 ffff82d0405ce200 ffff83203ffffeb0 ffff82d040234631 0000000000000000 0000000000007fff ffff82d0405d5080 ffff82d0405ce210 ffff83203ffffee8 ffff82d040321411 ffff82d040321399 ffff83202f3a9000 0000000000000000 0000001d91a6fa2d ffff82d0405e6500 ffff83203ffffde0 ffff82d040324391 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 Xen call trace: [<ffff82d040248299>] R common/sched/credit2.c#csched2_free_pdata+0xc8/0x177 [<ffff82d040257454>] F free_cpu_rm_data+0x41/0x58 [<ffff82d04021f303>] F common/sched/cpupool.c#cpu_callback+0xfb/0x466 [<ffff82d04022bc4c>] F notifier_call_chain+0x6c/0x96 [<ffff82d040204fd5>] F common/cpu.c#cpu_notifier_call_chain+0x1b/0x36 [<ffff82d0402054f0>] F cpu_down+0xa7/0x143 [<ffff82d040348c25>] F cpu_down_helper+0x11/0x27 [<ffff82d0402071aa>] F common/domain.c#continue_hypercall_tasklet_handler+0x50/0xbd [<ffff82d0402343c9>] F common/tasklet.c#do_tasklet_work+0x76/0xaf [<ffff82d040234631>] F do_tasklet+0x5b/0x8d [<ffff82d040321411>] F arch/x86/domain.c#idle_loop+0x78/0xe6 [<ffff82d040324391>] F continue_running+0x5b/0x5d **************************************** Panic on CPU 0: Assertion '!cpumask_test_cpu(cpu, &prv->initialized)' failed at common/sched/credit2.c:4111 **************************************** The issue seems to be that since the CPU hasn't been removed, it's still part of prv->initialized and the assert in csched2_free_pdata() called as part of free_cpu_rm_data() triggers. It's easy to reproduce by substituting the stop_machine_run() call in cpu_down() with an error. Thanks, Roger. ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: CPU_DOWN_FAILED hits ASSERTs in scheduling logic 2024-05-28 11:22 CPU_DOWN_FAILED hits ASSERTs in scheduling logic Roger Pau Monné @ 2024-05-29 11:47 ` Jürgen Groß 2024-05-29 12:46 ` Roger Pau Monné 0 siblings, 1 reply; 7+ messages in thread From: Jürgen Groß @ 2024-05-29 11:47 UTC (permalink / raw) To: Roger Pau Monné, xen-devel; +Cc: Dario Faggioli, George Dunlap [-- Attachment #1: Type: text/plain, Size: 4231 bytes --] On 28.05.24 13:22, Roger Pau Monné wrote: > Hello, > > When the stop_machine_run() call in cpu_down() fails and calls the CPU > notifier CPU_DOWN_FAILED hook the following assert triggers in the > scheduling code: > > Assertion '!cpumask_test_cpu(cpu, &prv->initialized)' failed at common/sched/cred1 > ----[ Xen-4.19-unstable x86_64 debug=y Tainted: C ]---- > CPU: 0 > RIP: e008:[<ffff82d040248299>] common/sched/credit2.c#csched2_free_pdata+0xc8/0x177 > RFLAGS: 0000000000010093 CONTEXT: hypervisor > rax: 0000000000000000 rbx: ffff83202ecc2f80 rcx: ffff83202f3e64c0 > rdx: 0000000000000001 rsi: 0000000000000002 rdi: ffff83202ecc2f88 > rbp: ffff83203ffffd58 rsp: ffff83203ffffd30 r8: 0000000000000000 > r9: ffff83202f3e6e01 r10: 0000000000000000 r11: 0f0f0f0f0f0f0f0f > r12: ffff83202ecb80b0 r13: 0000000000000001 r14: 0000000000000282 > r15: ffff83202ecbbf00 cr0: 000000008005003b cr4: 00000000007526e0 > cr3: 00000000574c2000 cr2: 0000000000000000 > fsb: 0000000000000000 gsb: 0000000000000000 gss: 0000000000000000 > ds: 0000 es: 0000 fs: 0000 gs: 0000 ss: 0000 cs: e008 > Xen code around <ffff82d040248299> (common/sched/credit2.c#csched2_free_pdata+0xc8/0x177): > fe ff eb 9a 0f 0b 0f 0b <0f> 0b 49 8d 4f 08 49 8b 47 08 48 3b 48 08 75 2e > Xen stack trace from rsp=ffff83203ffffd30: > ffff83202d74d100 0000000000000001 ffff82d0404c4430 0000000000000006 > 0000000000000000 ffff83203ffffd78 ffff82d040257454 0000000000000000 > 0000000000000001 ffff83203ffffda8 ffff82d04021f303 ffff82d0404c4628 > ffff82d0404c4620 ffff82d0404c4430 0000000000000006 ffff83203ffffdf0 > ffff82d04022bc4c ffff83203ffffe18 0000000000000001 0000000000000001 > 00000000fffffff0 0000000000000000 0000000000000000 ffff82d0405e6500 > ffff83203ffffe08 ffff82d040204fd5 0000000000000001 ffff83203ffffe30 > ffff82d0402054f0 ffff82d0404c5860 0000000000000001 ffff83202ec75000 > ffff83203ffffe48 ffff82d040348c25 ffff83202d74d0d0 ffff83203ffffe68 > ffff82d0402071aa ffff83202ec751d0 ffff82d0405ce210 ffff83203ffffe80 > ffff82d0402343c9 ffff82d0405ce200 ffff83203ffffeb0 ffff82d040234631 > 0000000000000000 0000000000007fff ffff82d0405d5080 ffff82d0405ce210 > ffff83203ffffee8 ffff82d040321411 ffff82d040321399 ffff83202f3a9000 > 0000000000000000 0000001d91a6fa2d ffff82d0405e6500 ffff83203ffffde0 > ffff82d040324391 0000000000000000 0000000000000000 0000000000000000 > 0000000000000000 0000000000000000 0000000000000000 0000000000000000 > 0000000000000000 0000000000000000 0000000000000000 0000000000000000 > 0000000000000000 0000000000000000 0000000000000000 0000000000000000 > 0000000000000000 0000000000000000 0000000000000000 0000000000000000 > 0000000000000000 0000000000000000 0000000000000000 0000000000000000 > Xen call trace: > [<ffff82d040248299>] R common/sched/credit2.c#csched2_free_pdata+0xc8/0x177 > [<ffff82d040257454>] F free_cpu_rm_data+0x41/0x58 > [<ffff82d04021f303>] F common/sched/cpupool.c#cpu_callback+0xfb/0x466 > [<ffff82d04022bc4c>] F notifier_call_chain+0x6c/0x96 > [<ffff82d040204fd5>] F common/cpu.c#cpu_notifier_call_chain+0x1b/0x36 > [<ffff82d0402054f0>] F cpu_down+0xa7/0x143 > [<ffff82d040348c25>] F cpu_down_helper+0x11/0x27 > [<ffff82d0402071aa>] F common/domain.c#continue_hypercall_tasklet_handler+0x50/0xbd > [<ffff82d0402343c9>] F common/tasklet.c#do_tasklet_work+0x76/0xaf > [<ffff82d040234631>] F do_tasklet+0x5b/0x8d > [<ffff82d040321411>] F arch/x86/domain.c#idle_loop+0x78/0xe6 > [<ffff82d040324391>] F continue_running+0x5b/0x5d > > > **************************************** > Panic on CPU 0: > Assertion '!cpumask_test_cpu(cpu, &prv->initialized)' failed at common/sched/credit2.c:4111 > **************************************** > > The issue seems to be that since the CPU hasn't been removed, it's > still part of prv->initialized and the assert in csched2_free_pdata() > called as part of free_cpu_rm_data() triggers. > > It's easy to reproduce by substituting the stop_machine_run() call in > cpu_down() with an error. Could you please give the attached patch a try? Only compile tested, though... Juergen [-- Attachment #2: 0001-xen-sched-fix-error-path-of-cpu-removal.patch --] [-- Type: text/x-patch, Size: 5977 bytes --] From 5925f15ace8be186c9f41c0cda2d4a675b0f7bb9 Mon Sep 17 00:00:00 2001 From: Juergen Gross <jgross@suse.com> Date: Wed, 29 May 2024 13:24:36 +0200 Subject: [PATCH] xen/sched: fix error path of cpu removal MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit In case removal of a cpu fails at the CPU_DYING step, an ASSERT() in the credit2 scheduler might trigger: Assertion '!cpumask_test_cpu(cpu, &prv->initialized)' failed at common/sched/cred1 ----[ Xen-4.19-unstable x86_64 debug=y Tainted: C ]---- CPU: 0 RIP: e008:[<ffff82d040248299>] common/sched/credit2.c#csched2_free_pdata+0xc8/0x177 RFLAGS: 0000000000010093 CONTEXT: hypervisor rax: 0000000000000000 rbx: ffff83202ecc2f80 rcx: ffff83202f3e64c0 rdx: 0000000000000001 rsi: 0000000000000002 rdi: ffff83202ecc2f88 rbp: ffff83203ffffd58 rsp: ffff83203ffffd30 r8: 0000000000000000 r9: ffff83202f3e6e01 r10: 0000000000000000 r11: 0f0f0f0f0f0f0f0f r12: ffff83202ecb80b0 r13: 0000000000000001 r14: 0000000000000282 r15: ffff83202ecbbf00 cr0: 000000008005003b cr4: 00000000007526e0 cr3: 00000000574c2000 cr2: 0000000000000000 fsb: 0000000000000000 gsb: 0000000000000000 gss: 0000000000000000 ds: 0000 es: 0000 fs: 0000 gs: 0000 ss: 0000 cs: e008 Xen code around <ffff82d040248299> (common/sched/credit2.c#csched2_free_pdata+0xc8/0x177): fe ff eb 9a 0f 0b 0f 0b <0f> 0b 49 8d 4f 08 49 8b 47 08 48 3b 48 08 75 2e Xen stack trace from rsp=ffff83203ffffd30: ffff83202d74d100 0000000000000001 ffff82d0404c4430 0000000000000006 0000000000000000 ffff83203ffffd78 ffff82d040257454 0000000000000000 0000000000000001 ffff83203ffffda8 ffff82d04021f303 ffff82d0404c4628 ffff82d0404c4620 ffff82d0404c4430 0000000000000006 ffff83203ffffdf0 ffff82d04022bc4c ffff83203ffffe18 0000000000000001 0000000000000001 00000000fffffff0 0000000000000000 0000000000000000 ffff82d0405e6500 ffff83203ffffe08 ffff82d040204fd5 0000000000000001 ffff83203ffffe30 ffff82d0402054f0 ffff82d0404c5860 0000000000000001 ffff83202ec75000 ffff83203ffffe48 ffff82d040348c25 ffff83202d74d0d0 ffff83203ffffe68 ffff82d0402071aa ffff83202ec751d0 ffff82d0405ce210 ffff83203ffffe80 ffff82d0402343c9 ffff82d0405ce200 ffff83203ffffeb0 ffff82d040234631 0000000000000000 0000000000007fff ffff82d0405d5080 ffff82d0405ce210 ffff83203ffffee8 ffff82d040321411 ffff82d040321399 ffff83202f3a9000 0000000000000000 0000001d91a6fa2d ffff82d0405e6500 ffff83203ffffde0 ffff82d040324391 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 Xen call trace: [<ffff82d040248299>] R common/sched/credit2.c#csched2_free_pdata+0xc8/0x177 [<ffff82d040257454>] F free_cpu_rm_data+0x41/0x58 [<ffff82d04021f303>] F common/sched/cpupool.c#cpu_callback+0xfb/0x466 [<ffff82d04022bc4c>] F notifier_call_chain+0x6c/0x96 [<ffff82d040204fd5>] F common/cpu.c#cpu_notifier_call_chain+0x1b/0x36 [<ffff82d0402054f0>] F cpu_down+0xa7/0x143 [<ffff82d040348c25>] F cpu_down_helper+0x11/0x27 [<ffff82d0402071aa>] F common/domain.c#continue_hypercall_tasklet_handler+0x50/0xbd [<ffff82d0402343c9>] F common/tasklet.c#do_tasklet_work+0x76/0xaf [<ffff82d040234631>] F do_tasklet+0x5b/0x8d [<ffff82d040321411>] F arch/x86/domain.c#idle_loop+0x78/0xe6 [<ffff82d040324391>] F continue_running+0x5b/0x5d This can be fixed by not freeing the cpu's scheduling data in case the CPU_DYING step hasn't been performed. Instead the not used struct sched_resource instances need to be freed in order to avoid memory leaking. Fixes: d84473689611 ("xen/sched: fix cpu hotplug") Reported-by: Roger Pau Monné <roger.pau@citrix.com> Signed-off-by: Juergen Gross <jgross@suse.com> --- xen/common/sched/core.c | 18 ++++++++++++++++-- xen/common/sched/private.h | 1 + 2 files changed, 17 insertions(+), 2 deletions(-) diff --git a/xen/common/sched/core.c b/xen/common/sched/core.c index d84b65f197..cbbeeaf0ca 100644 --- a/xen/common/sched/core.c +++ b/xen/common/sched/core.c @@ -3264,6 +3264,8 @@ struct cpu_rm_data *alloc_cpu_rm_data(unsigned int cpu, bool aff_alloc) data->sr[idx]->schedule_lock = sr->schedule_lock; } + data->n_sr_unused = idx; + out: rcu_read_unlock(&sched_res_rculock); @@ -3272,8 +3274,18 @@ struct cpu_rm_data *alloc_cpu_rm_data(unsigned int cpu, bool aff_alloc) void free_cpu_rm_data(struct cpu_rm_data *mem, unsigned int cpu) { - sched_free_udata(mem->old_ops, mem->vpriv_old); - sched_free_pdata(mem->old_ops, mem->ppriv_old, cpu); + unsigned int idx; + + if ( !mem->n_sr_unused ) + { + sched_free_udata(mem->old_ops, mem->vpriv_old); + sched_free_pdata(mem->old_ops, mem->ppriv_old, cpu); + } + else + { + for ( idx = 0; idx < mem->n_sr_unused; idx++ ) + sched_res_free(&mem->sr[idx]->rcu); + } free_affinity_masks(&mem->affinity); xfree(mem); @@ -3312,6 +3324,8 @@ int schedule_cpu_rm(unsigned int cpu, struct cpu_rm_data *data) /* See comment in schedule_cpu_add() regarding lock switching. */ old_lock = pcpu_schedule_lock_irqsave(cpu, &flags); + data->n_sr_unused = 0; + for_each_cpu ( cpu_iter, sr->cpus ) { per_cpu(sched_res_idx, cpu_iter) = 0; diff --git a/xen/common/sched/private.h b/xen/common/sched/private.h index c0e7c96d24..f0129fd9cc 100644 --- a/xen/common/sched/private.h +++ b/xen/common/sched/private.h @@ -645,6 +645,7 @@ struct cpu_rm_data { const struct scheduler *old_ops; void *ppriv_old; void *vpriv_old; + unsigned int n_sr_unused; struct sched_resource *sr[]; }; -- 2.35.3 ^ permalink raw reply related [flat|nested] 7+ messages in thread
* Re: CPU_DOWN_FAILED hits ASSERTs in scheduling logic 2024-05-29 11:47 ` Jürgen Groß @ 2024-05-29 12:46 ` Roger Pau Monné 2024-05-29 13:08 ` Jürgen Groß 0 siblings, 1 reply; 7+ messages in thread From: Roger Pau Monné @ 2024-05-29 12:46 UTC (permalink / raw) To: Jürgen Groß; +Cc: xen-devel, Dario Faggioli, George Dunlap On Wed, May 29, 2024 at 01:47:09PM +0200, Jürgen Groß wrote: > On 28.05.24 13:22, Roger Pau Monné wrote: > > Hello, > > > > When the stop_machine_run() call in cpu_down() fails and calls the CPU > > notifier CPU_DOWN_FAILED hook the following assert triggers in the > > scheduling code: > > > > Assertion '!cpumask_test_cpu(cpu, &prv->initialized)' failed at common/sched/cred1 > > ----[ Xen-4.19-unstable x86_64 debug=y Tainted: C ]---- > > CPU: 0 > > RIP: e008:[<ffff82d040248299>] common/sched/credit2.c#csched2_free_pdata+0xc8/0x177 > > RFLAGS: 0000000000010093 CONTEXT: hypervisor > > rax: 0000000000000000 rbx: ffff83202ecc2f80 rcx: ffff83202f3e64c0 > > rdx: 0000000000000001 rsi: 0000000000000002 rdi: ffff83202ecc2f88 > > rbp: ffff83203ffffd58 rsp: ffff83203ffffd30 r8: 0000000000000000 > > r9: ffff83202f3e6e01 r10: 0000000000000000 r11: 0f0f0f0f0f0f0f0f > > r12: ffff83202ecb80b0 r13: 0000000000000001 r14: 0000000000000282 > > r15: ffff83202ecbbf00 cr0: 000000008005003b cr4: 00000000007526e0 > > cr3: 00000000574c2000 cr2: 0000000000000000 > > fsb: 0000000000000000 gsb: 0000000000000000 gss: 0000000000000000 > > ds: 0000 es: 0000 fs: 0000 gs: 0000 ss: 0000 cs: e008 > > Xen code around <ffff82d040248299> (common/sched/credit2.c#csched2_free_pdata+0xc8/0x177): > > fe ff eb 9a 0f 0b 0f 0b <0f> 0b 49 8d 4f 08 49 8b 47 08 48 3b 48 08 75 2e > > Xen stack trace from rsp=ffff83203ffffd30: > > ffff83202d74d100 0000000000000001 ffff82d0404c4430 0000000000000006 > > 0000000000000000 ffff83203ffffd78 ffff82d040257454 0000000000000000 > > 0000000000000001 ffff83203ffffda8 ffff82d04021f303 ffff82d0404c4628 > > ffff82d0404c4620 ffff82d0404c4430 0000000000000006 ffff83203ffffdf0 > > ffff82d04022bc4c ffff83203ffffe18 0000000000000001 0000000000000001 > > 00000000fffffff0 0000000000000000 0000000000000000 ffff82d0405e6500 > > ffff83203ffffe08 ffff82d040204fd5 0000000000000001 ffff83203ffffe30 > > ffff82d0402054f0 ffff82d0404c5860 0000000000000001 ffff83202ec75000 > > ffff83203ffffe48 ffff82d040348c25 ffff83202d74d0d0 ffff83203ffffe68 > > ffff82d0402071aa ffff83202ec751d0 ffff82d0405ce210 ffff83203ffffe80 > > ffff82d0402343c9 ffff82d0405ce200 ffff83203ffffeb0 ffff82d040234631 > > 0000000000000000 0000000000007fff ffff82d0405d5080 ffff82d0405ce210 > > ffff83203ffffee8 ffff82d040321411 ffff82d040321399 ffff83202f3a9000 > > 0000000000000000 0000001d91a6fa2d ffff82d0405e6500 ffff83203ffffde0 > > ffff82d040324391 0000000000000000 0000000000000000 0000000000000000 > > 0000000000000000 0000000000000000 0000000000000000 0000000000000000 > > 0000000000000000 0000000000000000 0000000000000000 0000000000000000 > > 0000000000000000 0000000000000000 0000000000000000 0000000000000000 > > 0000000000000000 0000000000000000 0000000000000000 0000000000000000 > > 0000000000000000 0000000000000000 0000000000000000 0000000000000000 > > Xen call trace: > > [<ffff82d040248299>] R common/sched/credit2.c#csched2_free_pdata+0xc8/0x177 > > [<ffff82d040257454>] F free_cpu_rm_data+0x41/0x58 > > [<ffff82d04021f303>] F common/sched/cpupool.c#cpu_callback+0xfb/0x466 > > [<ffff82d04022bc4c>] F notifier_call_chain+0x6c/0x96 > > [<ffff82d040204fd5>] F common/cpu.c#cpu_notifier_call_chain+0x1b/0x36 > > [<ffff82d0402054f0>] F cpu_down+0xa7/0x143 > > [<ffff82d040348c25>] F cpu_down_helper+0x11/0x27 > > [<ffff82d0402071aa>] F common/domain.c#continue_hypercall_tasklet_handler+0x50/0xbd > > [<ffff82d0402343c9>] F common/tasklet.c#do_tasklet_work+0x76/0xaf > > [<ffff82d040234631>] F do_tasklet+0x5b/0x8d > > [<ffff82d040321411>] F arch/x86/domain.c#idle_loop+0x78/0xe6 > > [<ffff82d040324391>] F continue_running+0x5b/0x5d > > > > > > **************************************** > > Panic on CPU 0: > > Assertion '!cpumask_test_cpu(cpu, &prv->initialized)' failed at common/sched/credit2.c:4111 > > **************************************** > > > > The issue seems to be that since the CPU hasn't been removed, it's > > still part of prv->initialized and the assert in csched2_free_pdata() > > called as part of free_cpu_rm_data() triggers. > > > > It's easy to reproduce by substituting the stop_machine_run() call in > > cpu_down() with an error. > > Could you please give the attached patch a try? I still get the following assert: Assertion '!cpumask_test_cpu(cpu, &prv->initialized)' failed at common/sched/credit2.c:4111 ----[ Xen-4.19-unstable x86_64 debug=y Not tainted ]---- CPU: 0 RIP: e008:[<ffff82d040248173>] common/sched/credit2.c#csched2_free_pdata+0xc8/0x177 RFLAGS: 0000000000010093 CONTEXT: hypervisor rax: 0000000000000000 rbx: ffff83202ecc2f80 rcx: ffff83202f3e64c0 rdx: 0000000000000001 rsi: 0000000000000002 rdi: ffff83202ecc2f88 rbp: ffff83203ffffd50 rsp: ffff83203ffffd28 r8: 0000000000000000 r9: ffff83202f3e6e01 r10: 0000000000000000 r11: 0f0f0f0f0f0f0f0f r12: ffff83202ecb80b0 r13: 0000000000000001 r14: 0000000000000282 r15: ffff83202ecbbf00 cr0: 000000008005003b cr4: 00000000007526e0 cr3: 00000000574c2000 cr2: 0000000000000000 fsb: 0000000000000000 gsb: 0000000000000000 gss: 0000000000000000 ds: 0000 es: 0000 fs: 0000 gs: 0000 ss: 0000 cs: e008 Xen code around <ffff82d040248173> (common/sched/credit2.c#csched2_free_pdata+0xc8/0x177): fe ff eb 9a 0f 0b 0f 0b <0f> 0b 49 8d 4f 08 49 8b 47 08 48 3b 48 08 75 2e Xen stack trace from rsp=ffff83203ffffd28: 0000000000000001 ffff83202e029100 ffff82d0404c4430 0000000000000006 0000000000000000 ffff83203ffffd70 ffff82d040257338 0000000000000000 0000000000000001 ffff83203ffffda0 ffff82d04021f1e6 ffff82d0404c4628 ffff82d0404c4620 ffff82d0404c4430 0000000000000006 ffff83203ffffde8 ffff82d04022bb2f ffff83203ffffe10 0000000000000001 0000000000000001 0000000000000000 ffff83203ffffe10 0000000000000000 ffff82d0405e6500 ffff83203ffffe00 ffff82d040204fd5 0000000000000001 ffff83203ffffe30 ffff82d040205464 ffff82d0404c5860 0000000000000001 ffff83202ec5d000 0000000000000000 ffff83203ffffe48 ffff82d040348bd0 ffff83202e0290d0 ffff83203ffffe68 ffff82d04020708d ffff83202ec5d1d0 ffff82d0405ce210 ffff83203ffffe80 ffff82d0402342a3 ffff82d0405ce200 ffff83203ffffeb0 ffff82d04023450b 0000000000000000 0000000000007fff ffff82d0405d5080 ffff82d0405ce210 ffff83203ffffee8 ffff82d040321363 ffff82d0403212eb ffff83202f3a9000 0000000000000000 00000014b1552bff ffff82d0405e6500 ffff83203ffffde0 ffff82d0403242e3 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 Xen call trace: [<ffff82d040248173>] R common/sched/credit2.c#csched2_free_pdata+0xc8/0x177 [<ffff82d040257338>] F free_cpu_rm_data+0x48/0x80 [<ffff82d04021f1e6>] F common/sched/cpupool.c#cpu_callback+0xfb/0x466 [<ffff82d04022bb2f>] F notifier_call_chain+0x6c/0x96 [<ffff82d040204fd5>] F common/cpu.c#cpu_notifier_call_chain+0x1b/0x36 [<ffff82d040205464>] F cpu_down+0x60/0x83 [<ffff82d040348bd0>] F cpu_down_helper+0x11/0x27 [<ffff82d04020708d>] F common/domain.c#continue_hypercall_tasklet_handler+0x50/0xbd [<ffff82d0402342a3>] F common/tasklet.c#do_tasklet_work+0x76/0xaf [<ffff82d04023450b>] F do_tasklet+0x5b/0x8d [<ffff82d040321363>] F arch/x86/domain.c#idle_loop+0x78/0xe6 [<ffff82d0403242e3>] F continue_running+0x5b/0x5d **************************************** Panic on CPU 0: Assertion '!cpumask_test_cpu(cpu, &prv->initialized)' failed at common/sched/credit2.c:4111 **************************************** Reboot in five seconds... Resetting with ACPI MEMORY or I/O RESET_REG. I have the following bodge to trigger the failure: diff --git a/xen/common/cpu.c b/xen/common/cpu.c index d76f80fe2e99..b38796038e31 100644 --- a/xen/common/cpu.c +++ b/xen/common/cpu.c @@ -126,6 +126,7 @@ int cpu_down(unsigned int cpu) if ( err ) goto fail; + goto fail; if ( system_state < SYS_STATE_active || system_state == SYS_STATE_resume ) on_selected_cpus(cpumask_of(cpu), _take_cpu_down, NULL, true); else if ( (err = stop_machine_run(take_cpu_down, NULL, cpu)) < 0 ) Thanks, Roger. ^ permalink raw reply related [flat|nested] 7+ messages in thread
* Re: CPU_DOWN_FAILED hits ASSERTs in scheduling logic 2024-05-29 12:46 ` Roger Pau Monné @ 2024-05-29 13:08 ` Jürgen Groß 2024-05-29 16:03 ` Roger Pau Monné 0 siblings, 1 reply; 7+ messages in thread From: Jürgen Groß @ 2024-05-29 13:08 UTC (permalink / raw) To: Roger Pau Monné; +Cc: xen-devel, Dario Faggioli, George Dunlap [-- Attachment #1: Type: text/plain, Size: 4617 bytes --] On 29.05.24 14:46, Roger Pau Monné wrote: > On Wed, May 29, 2024 at 01:47:09PM +0200, Jürgen Groß wrote: >> On 28.05.24 13:22, Roger Pau Monné wrote: >>> Hello, >>> >>> When the stop_machine_run() call in cpu_down() fails and calls the CPU >>> notifier CPU_DOWN_FAILED hook the following assert triggers in the >>> scheduling code: >>> >>> Assertion '!cpumask_test_cpu(cpu, &prv->initialized)' failed at common/sched/cred1 >>> ----[ Xen-4.19-unstable x86_64 debug=y Tainted: C ]---- >>> CPU: 0 >>> RIP: e008:[<ffff82d040248299>] common/sched/credit2.c#csched2_free_pdata+0xc8/0x177 >>> RFLAGS: 0000000000010093 CONTEXT: hypervisor >>> rax: 0000000000000000 rbx: ffff83202ecc2f80 rcx: ffff83202f3e64c0 >>> rdx: 0000000000000001 rsi: 0000000000000002 rdi: ffff83202ecc2f88 >>> rbp: ffff83203ffffd58 rsp: ffff83203ffffd30 r8: 0000000000000000 >>> r9: ffff83202f3e6e01 r10: 0000000000000000 r11: 0f0f0f0f0f0f0f0f >>> r12: ffff83202ecb80b0 r13: 0000000000000001 r14: 0000000000000282 >>> r15: ffff83202ecbbf00 cr0: 000000008005003b cr4: 00000000007526e0 >>> cr3: 00000000574c2000 cr2: 0000000000000000 >>> fsb: 0000000000000000 gsb: 0000000000000000 gss: 0000000000000000 >>> ds: 0000 es: 0000 fs: 0000 gs: 0000 ss: 0000 cs: e008 >>> Xen code around <ffff82d040248299> (common/sched/credit2.c#csched2_free_pdata+0xc8/0x177): >>> fe ff eb 9a 0f 0b 0f 0b <0f> 0b 49 8d 4f 08 49 8b 47 08 48 3b 48 08 75 2e >>> Xen stack trace from rsp=ffff83203ffffd30: >>> ffff83202d74d100 0000000000000001 ffff82d0404c4430 0000000000000006 >>> 0000000000000000 ffff83203ffffd78 ffff82d040257454 0000000000000000 >>> 0000000000000001 ffff83203ffffda8 ffff82d04021f303 ffff82d0404c4628 >>> ffff82d0404c4620 ffff82d0404c4430 0000000000000006 ffff83203ffffdf0 >>> ffff82d04022bc4c ffff83203ffffe18 0000000000000001 0000000000000001 >>> 00000000fffffff0 0000000000000000 0000000000000000 ffff82d0405e6500 >>> ffff83203ffffe08 ffff82d040204fd5 0000000000000001 ffff83203ffffe30 >>> ffff82d0402054f0 ffff82d0404c5860 0000000000000001 ffff83202ec75000 >>> ffff83203ffffe48 ffff82d040348c25 ffff83202d74d0d0 ffff83203ffffe68 >>> ffff82d0402071aa ffff83202ec751d0 ffff82d0405ce210 ffff83203ffffe80 >>> ffff82d0402343c9 ffff82d0405ce200 ffff83203ffffeb0 ffff82d040234631 >>> 0000000000000000 0000000000007fff ffff82d0405d5080 ffff82d0405ce210 >>> ffff83203ffffee8 ffff82d040321411 ffff82d040321399 ffff83202f3a9000 >>> 0000000000000000 0000001d91a6fa2d ffff82d0405e6500 ffff83203ffffde0 >>> ffff82d040324391 0000000000000000 0000000000000000 0000000000000000 >>> 0000000000000000 0000000000000000 0000000000000000 0000000000000000 >>> 0000000000000000 0000000000000000 0000000000000000 0000000000000000 >>> 0000000000000000 0000000000000000 0000000000000000 0000000000000000 >>> 0000000000000000 0000000000000000 0000000000000000 0000000000000000 >>> 0000000000000000 0000000000000000 0000000000000000 0000000000000000 >>> Xen call trace: >>> [<ffff82d040248299>] R common/sched/credit2.c#csched2_free_pdata+0xc8/0x177 >>> [<ffff82d040257454>] F free_cpu_rm_data+0x41/0x58 >>> [<ffff82d04021f303>] F common/sched/cpupool.c#cpu_callback+0xfb/0x466 >>> [<ffff82d04022bc4c>] F notifier_call_chain+0x6c/0x96 >>> [<ffff82d040204fd5>] F common/cpu.c#cpu_notifier_call_chain+0x1b/0x36 >>> [<ffff82d0402054f0>] F cpu_down+0xa7/0x143 >>> [<ffff82d040348c25>] F cpu_down_helper+0x11/0x27 >>> [<ffff82d0402071aa>] F common/domain.c#continue_hypercall_tasklet_handler+0x50/0xbd >>> [<ffff82d0402343c9>] F common/tasklet.c#do_tasklet_work+0x76/0xaf >>> [<ffff82d040234631>] F do_tasklet+0x5b/0x8d >>> [<ffff82d040321411>] F arch/x86/domain.c#idle_loop+0x78/0xe6 >>> [<ffff82d040324391>] F continue_running+0x5b/0x5d >>> >>> >>> **************************************** >>> Panic on CPU 0: >>> Assertion '!cpumask_test_cpu(cpu, &prv->initialized)' failed at common/sched/credit2.c:4111 >>> **************************************** >>> >>> The issue seems to be that since the CPU hasn't been removed, it's >>> still part of prv->initialized and the assert in csched2_free_pdata() >>> called as part of free_cpu_rm_data() triggers. >>> >>> It's easy to reproduce by substituting the stop_machine_run() call in >>> cpu_down() with an error. >> >> Could you please give the attached patch a try? > > I still get the following assert: Oh, silly me. Without core scheduling active nr_sr_unused will be 0 all the time. :-( Next try. Juergen [-- Attachment #2: 0001-xen-sched-fix-error-path-of-cpu-removal.patch --] [-- Type: text/x-patch, Size: 5985 bytes --] From 8b707b0e4d68b1636f1c5ff0e2ef54bea5d9921a Mon Sep 17 00:00:00 2001 From: Juergen Gross <jgross@suse.com> Date: Wed, 29 May 2024 13:24:36 +0200 Subject: [PATCH] xen/sched: fix error path of cpu removal MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit In case removal of a cpu fails at the CPU_DYING step, an ASSERT() in the credit2 scheduler might trigger: Assertion '!cpumask_test_cpu(cpu, &prv->initialized)' failed at common/sched/cred1 ----[ Xen-4.19-unstable x86_64 debug=y Tainted: C ]---- CPU: 0 RIP: e008:[<ffff82d040248299>] common/sched/credit2.c#csched2_free_pdata+0xc8/0x177 RFLAGS: 0000000000010093 CONTEXT: hypervisor rax: 0000000000000000 rbx: ffff83202ecc2f80 rcx: ffff83202f3e64c0 rdx: 0000000000000001 rsi: 0000000000000002 rdi: ffff83202ecc2f88 rbp: ffff83203ffffd58 rsp: ffff83203ffffd30 r8: 0000000000000000 r9: ffff83202f3e6e01 r10: 0000000000000000 r11: 0f0f0f0f0f0f0f0f r12: ffff83202ecb80b0 r13: 0000000000000001 r14: 0000000000000282 r15: ffff83202ecbbf00 cr0: 000000008005003b cr4: 00000000007526e0 cr3: 00000000574c2000 cr2: 0000000000000000 fsb: 0000000000000000 gsb: 0000000000000000 gss: 0000000000000000 ds: 0000 es: 0000 fs: 0000 gs: 0000 ss: 0000 cs: e008 Xen code around <ffff82d040248299> (common/sched/credit2.c#csched2_free_pdata+0xc8/0x177): fe ff eb 9a 0f 0b 0f 0b <0f> 0b 49 8d 4f 08 49 8b 47 08 48 3b 48 08 75 2e Xen stack trace from rsp=ffff83203ffffd30: ffff83202d74d100 0000000000000001 ffff82d0404c4430 0000000000000006 0000000000000000 ffff83203ffffd78 ffff82d040257454 0000000000000000 0000000000000001 ffff83203ffffda8 ffff82d04021f303 ffff82d0404c4628 ffff82d0404c4620 ffff82d0404c4430 0000000000000006 ffff83203ffffdf0 ffff82d04022bc4c ffff83203ffffe18 0000000000000001 0000000000000001 00000000fffffff0 0000000000000000 0000000000000000 ffff82d0405e6500 ffff83203ffffe08 ffff82d040204fd5 0000000000000001 ffff83203ffffe30 ffff82d0402054f0 ffff82d0404c5860 0000000000000001 ffff83202ec75000 ffff83203ffffe48 ffff82d040348c25 ffff83202d74d0d0 ffff83203ffffe68 ffff82d0402071aa ffff83202ec751d0 ffff82d0405ce210 ffff83203ffffe80 ffff82d0402343c9 ffff82d0405ce200 ffff83203ffffeb0 ffff82d040234631 0000000000000000 0000000000007fff ffff82d0405d5080 ffff82d0405ce210 ffff83203ffffee8 ffff82d040321411 ffff82d040321399 ffff83202f3a9000 0000000000000000 0000001d91a6fa2d ffff82d0405e6500 ffff83203ffffde0 ffff82d040324391 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 Xen call trace: [<ffff82d040248299>] R common/sched/credit2.c#csched2_free_pdata+0xc8/0x177 [<ffff82d040257454>] F free_cpu_rm_data+0x41/0x58 [<ffff82d04021f303>] F common/sched/cpupool.c#cpu_callback+0xfb/0x466 [<ffff82d04022bc4c>] F notifier_call_chain+0x6c/0x96 [<ffff82d040204fd5>] F common/cpu.c#cpu_notifier_call_chain+0x1b/0x36 [<ffff82d0402054f0>] F cpu_down+0xa7/0x143 [<ffff82d040348c25>] F cpu_down_helper+0x11/0x27 [<ffff82d0402071aa>] F common/domain.c#continue_hypercall_tasklet_handler+0x50/0xbd [<ffff82d0402343c9>] F common/tasklet.c#do_tasklet_work+0x76/0xaf [<ffff82d040234631>] F do_tasklet+0x5b/0x8d [<ffff82d040321411>] F arch/x86/domain.c#idle_loop+0x78/0xe6 [<ffff82d040324391>] F continue_running+0x5b/0x5d This can be fixed by not freeing the cpu's scheduling data in case the CPU_DYING step hasn't been performed. Instead the not used struct sched_resource instances need to be freed in order to avoid memory leaking. Fixes: d84473689611 ("xen/sched: fix cpu hotplug") Reported-by: Roger Pau Monné <roger.pau@citrix.com> Signed-off-by: Juergen Gross <jgross@suse.com> --- xen/common/sched/core.c | 18 ++++++++++++++++-- xen/common/sched/private.h | 1 + 2 files changed, 17 insertions(+), 2 deletions(-) diff --git a/xen/common/sched/core.c b/xen/common/sched/core.c index d84b65f197..199793fc9e 100644 --- a/xen/common/sched/core.c +++ b/xen/common/sched/core.c @@ -3264,6 +3264,8 @@ struct cpu_rm_data *alloc_cpu_rm_data(unsigned int cpu, bool aff_alloc) data->sr[idx]->schedule_lock = sr->schedule_lock; } + data->n_sr_unused = idx + 1; + out: rcu_read_unlock(&sched_res_rculock); @@ -3272,8 +3274,18 @@ struct cpu_rm_data *alloc_cpu_rm_data(unsigned int cpu, bool aff_alloc) void free_cpu_rm_data(struct cpu_rm_data *mem, unsigned int cpu) { - sched_free_udata(mem->old_ops, mem->vpriv_old); - sched_free_pdata(mem->old_ops, mem->ppriv_old, cpu); + unsigned int idx; + + if ( !mem->n_sr_unused ) + { + sched_free_udata(mem->old_ops, mem->vpriv_old); + sched_free_pdata(mem->old_ops, mem->ppriv_old, cpu); + } + else + { + for ( idx = 0; idx < mem->n_sr_unused - 1; idx++ ) + sched_res_free(&mem->sr[idx]->rcu); + } free_affinity_masks(&mem->affinity); xfree(mem); @@ -3312,6 +3324,8 @@ int schedule_cpu_rm(unsigned int cpu, struct cpu_rm_data *data) /* See comment in schedule_cpu_add() regarding lock switching. */ old_lock = pcpu_schedule_lock_irqsave(cpu, &flags); + data->n_sr_unused = 0; + for_each_cpu ( cpu_iter, sr->cpus ) { per_cpu(sched_res_idx, cpu_iter) = 0; diff --git a/xen/common/sched/private.h b/xen/common/sched/private.h index c0e7c96d24..f0129fd9cc 100644 --- a/xen/common/sched/private.h +++ b/xen/common/sched/private.h @@ -645,6 +645,7 @@ struct cpu_rm_data { const struct scheduler *old_ops; void *ppriv_old; void *vpriv_old; + unsigned int n_sr_unused; struct sched_resource *sr[]; }; -- 2.35.3 ^ permalink raw reply related [flat|nested] 7+ messages in thread
* Re: CPU_DOWN_FAILED hits ASSERTs in scheduling logic 2024-05-29 13:08 ` Jürgen Groß @ 2024-05-29 16:03 ` Roger Pau Monné 2024-05-30 12:45 ` Jürgen Groß 0 siblings, 1 reply; 7+ messages in thread From: Roger Pau Monné @ 2024-05-29 16:03 UTC (permalink / raw) To: Jürgen Groß; +Cc: xen-devel, Dario Faggioli, George Dunlap On Wed, May 29, 2024 at 03:08:49PM +0200, Jürgen Groß wrote: > On 29.05.24 14:46, Roger Pau Monné wrote: > > On Wed, May 29, 2024 at 01:47:09PM +0200, Jürgen Groß wrote: > > > On 28.05.24 13:22, Roger Pau Monné wrote: > > > > Hello, > > > > > > > > When the stop_machine_run() call in cpu_down() fails and calls the CPU > > > > notifier CPU_DOWN_FAILED hook the following assert triggers in the > > > > scheduling code: > > > > > > > > Assertion '!cpumask_test_cpu(cpu, &prv->initialized)' failed at common/sched/cred1 > > > > ----[ Xen-4.19-unstable x86_64 debug=y Tainted: C ]---- > > > > CPU: 0 > > > > RIP: e008:[<ffff82d040248299>] common/sched/credit2.c#csched2_free_pdata+0xc8/0x177 > > > > RFLAGS: 0000000000010093 CONTEXT: hypervisor > > > > rax: 0000000000000000 rbx: ffff83202ecc2f80 rcx: ffff83202f3e64c0 > > > > rdx: 0000000000000001 rsi: 0000000000000002 rdi: ffff83202ecc2f88 > > > > rbp: ffff83203ffffd58 rsp: ffff83203ffffd30 r8: 0000000000000000 > > > > r9: ffff83202f3e6e01 r10: 0000000000000000 r11: 0f0f0f0f0f0f0f0f > > > > r12: ffff83202ecb80b0 r13: 0000000000000001 r14: 0000000000000282 > > > > r15: ffff83202ecbbf00 cr0: 000000008005003b cr4: 00000000007526e0 > > > > cr3: 00000000574c2000 cr2: 0000000000000000 > > > > fsb: 0000000000000000 gsb: 0000000000000000 gss: 0000000000000000 > > > > ds: 0000 es: 0000 fs: 0000 gs: 0000 ss: 0000 cs: e008 > > > > Xen code around <ffff82d040248299> (common/sched/credit2.c#csched2_free_pdata+0xc8/0x177): > > > > fe ff eb 9a 0f 0b 0f 0b <0f> 0b 49 8d 4f 08 49 8b 47 08 48 3b 48 08 75 2e > > > > Xen stack trace from rsp=ffff83203ffffd30: > > > > ffff83202d74d100 0000000000000001 ffff82d0404c4430 0000000000000006 > > > > 0000000000000000 ffff83203ffffd78 ffff82d040257454 0000000000000000 > > > > 0000000000000001 ffff83203ffffda8 ffff82d04021f303 ffff82d0404c4628 > > > > ffff82d0404c4620 ffff82d0404c4430 0000000000000006 ffff83203ffffdf0 > > > > ffff82d04022bc4c ffff83203ffffe18 0000000000000001 0000000000000001 > > > > 00000000fffffff0 0000000000000000 0000000000000000 ffff82d0405e6500 > > > > ffff83203ffffe08 ffff82d040204fd5 0000000000000001 ffff83203ffffe30 > > > > ffff82d0402054f0 ffff82d0404c5860 0000000000000001 ffff83202ec75000 > > > > ffff83203ffffe48 ffff82d040348c25 ffff83202d74d0d0 ffff83203ffffe68 > > > > ffff82d0402071aa ffff83202ec751d0 ffff82d0405ce210 ffff83203ffffe80 > > > > ffff82d0402343c9 ffff82d0405ce200 ffff83203ffffeb0 ffff82d040234631 > > > > 0000000000000000 0000000000007fff ffff82d0405d5080 ffff82d0405ce210 > > > > ffff83203ffffee8 ffff82d040321411 ffff82d040321399 ffff83202f3a9000 > > > > 0000000000000000 0000001d91a6fa2d ffff82d0405e6500 ffff83203ffffde0 > > > > ffff82d040324391 0000000000000000 0000000000000000 0000000000000000 > > > > 0000000000000000 0000000000000000 0000000000000000 0000000000000000 > > > > 0000000000000000 0000000000000000 0000000000000000 0000000000000000 > > > > 0000000000000000 0000000000000000 0000000000000000 0000000000000000 > > > > 0000000000000000 0000000000000000 0000000000000000 0000000000000000 > > > > 0000000000000000 0000000000000000 0000000000000000 0000000000000000 > > > > Xen call trace: > > > > [<ffff82d040248299>] R common/sched/credit2.c#csched2_free_pdata+0xc8/0x177 > > > > [<ffff82d040257454>] F free_cpu_rm_data+0x41/0x58 > > > > [<ffff82d04021f303>] F common/sched/cpupool.c#cpu_callback+0xfb/0x466 > > > > [<ffff82d04022bc4c>] F notifier_call_chain+0x6c/0x96 > > > > [<ffff82d040204fd5>] F common/cpu.c#cpu_notifier_call_chain+0x1b/0x36 > > > > [<ffff82d0402054f0>] F cpu_down+0xa7/0x143 > > > > [<ffff82d040348c25>] F cpu_down_helper+0x11/0x27 > > > > [<ffff82d0402071aa>] F common/domain.c#continue_hypercall_tasklet_handler+0x50/0xbd > > > > [<ffff82d0402343c9>] F common/tasklet.c#do_tasklet_work+0x76/0xaf > > > > [<ffff82d040234631>] F do_tasklet+0x5b/0x8d > > > > [<ffff82d040321411>] F arch/x86/domain.c#idle_loop+0x78/0xe6 > > > > [<ffff82d040324391>] F continue_running+0x5b/0x5d > > > > > > > > > > > > **************************************** > > > > Panic on CPU 0: > > > > Assertion '!cpumask_test_cpu(cpu, &prv->initialized)' failed at common/sched/credit2.c:4111 > > > > **************************************** > > > > > > > > The issue seems to be that since the CPU hasn't been removed, it's > > > > still part of prv->initialized and the assert in csched2_free_pdata() > > > > called as part of free_cpu_rm_data() triggers. > > > > > > > > It's easy to reproduce by substituting the stop_machine_run() call in > > > > cpu_down() with an error. > > > > > > Could you please give the attached patch a try? > > > > I still get the following assert: > > Oh, silly me. Without core scheduling active nr_sr_unused will be 0 all > the time. :-( > > Next try. I'm afraid I have a new trace for you: Assertion '!cpumask_test_cpu(cpu, &prv->initialized)' failed at common/sched/credit2.c:3987 ----[ Xen-4.19-unstable x86_64 debug=y Not tainted ]---- CPU: 0 RIP: e008:[<ffff82d040247d27>] common/sched/credit2.c#csched2_switch_sched+0x115/0x339 RFLAGS: 0000000000010093 CONTEXT: hypervisor rax: 000000000000c000 rbx: 0000000000000001 rcx: ffff82d0405e6500 rdx: 0000004feee13000 rsi: 0000000000000004 rdi: ffff83202ecc2f88 rbp: ffff83203ffffc80 rsp: ffff83203ffffc38 r8: 0000000000000000 r9: ffff83202ecbbf01 r10: 0000000000000000 r11: 0f0f0f0f0f0f0f0f r12: ffff83202ecc2f80 r13: ffff83402ca50100 r14: ffff83402ca50140 r15: ffff83202ecc2f88 cr0: 000000008005003b cr4: 00000000007526e0 cr3: 00000000574c2000 cr2: 0000000000000000 fsb: 0000000000000000 gsb: 0000000000000000 gss: 0000000000000000 ds: 0000 es: 0000 fs: 0000 gs: 0000 ss: 0000 cs: e008 Xen code around <ffff82d040247d27> (common/sched/credit2.c#csched2_switch_sched+0x115/0x339): 7c ff ff ff 0f 0b 0f 0b <0f> 0b 0f 0b 41 8b 56 30 89 de 48 8d 3d e8 00 1a Xen stack trace from rsp=ffff83203ffffc38: ffff83203ffffc48 ffff82d0402332ba ffff83203ffffc68 ffff82d04023343d 0000000000000001 ffff82d0405cf398 ffff83402ca50100 ffff82d0405e6500 ffff83202ecbbdb0 ffff83203ffffd18 ffff82d040256e1a ffff83203fff386c ffff83203fff2000 0000000000000005 ffff83202ecbbf00 ffff83402ca50140 ffff83203fff3868 0000000000000282 0000000040233509 ffff83202ecbbdb0 ffff83402ca50100 ffff83202f3e6d80 ffff83202ecc2ec0 ffff83202ecc2ec0 0000000000000001 ffff82d0403da460 0000000000000048 0000000000000000 ffff83203ffffd48 ffff82d0402414b7 0000000000000001 0000000000000000 ffff82d0403da460 0000000000000006 ffff83203ffffd70 ffff82d04024173d 0000000000000000 0000000000000001 ffff82d0404c4430 ffff83203ffffda0 ffff82d04021f1f9 ffff82d0404c4628 ffff82d0404c4620 ffff82d0404c4430 0000000000000006 ffff83203ffffde8 ffff82d04022bb2f ffff83203ffffe10 0000000000000001 0000000000000001 0000000000000000 ffff83203ffffe10 0000000000000000 ffff82d0405e6500 ffff83203ffffe00 ffff82d040204fd5 0000000000000001 ffff83203ffffe30 ffff82d040205464 ffff82d0404c5860 0000000000000001 ffff83202ec86000 0000000000000000 ffff83203ffffe48 ffff82d040348c32 ffff83402ca500d0 ffff83203ffffe68 ffff82d04020708d ffff83202ec861d0 ffff82d0405ce210 ffff83203ffffe80 ffff82d0402342a3 ffff82d0405ce200 ffff83203ffffeb0 ffff82d04023450b 0000000000000000 0000000000007fff ffff82d0405d5080 ffff82d0405ce210 ffff83203ffffee8 Xen call trace: [<ffff82d040247d27>] R common/sched/credit2.c#csched2_switch_sched+0x115/0x339 [<ffff82d040256e1a>] F schedule_cpu_add+0x1a4/0x463 [<ffff82d0402414b7>] F common/sched/cpupool.c#cpupool_assign_cpu_locked+0x5a/0x17e [<ffff82d04024173d>] F common/sched/cpupool.c#cpupool_cpu_add+0x162/0x16c [<ffff82d04021f1f9>] F common/sched/cpupool.c#cpu_callback+0x10e/0x466 [<ffff82d04022bb2f>] F notifier_call_chain+0x6c/0x96 [<ffff82d040204fd5>] F common/cpu.c#cpu_notifier_call_chain+0x1b/0x36 [<ffff82d040205464>] F cpu_down+0x60/0x83 [<ffff82d040348c32>] F cpu_down_helper+0x11/0x27 [<ffff82d04020708d>] F common/domain.c#continue_hypercall_tasklet_handler+0x50/0xbd [<ffff82d0402342a3>] F common/tasklet.c#do_tasklet_work+0x76/0xaf [<ffff82d04023450b>] F do_tasklet+0x5b/0x8d [<ffff82d040321372>] F arch/x86/domain.c#idle_loop+0x78/0xe6 [<ffff82d0403242f2>] F continue_running+0x5b/0x5d **************************************** Panic on CPU 0: Assertion '!cpumask_test_cpu(cpu, &prv->initialized)' failed at common/sched/credit2.c:3987 **************************************** This time is one of the asserts in init_pdata(). Roger. ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: CPU_DOWN_FAILED hits ASSERTs in scheduling logic 2024-05-29 16:03 ` Roger Pau Monné @ 2024-05-30 12:45 ` Jürgen Groß 2024-05-30 13:15 ` Roger Pau Monné 0 siblings, 1 reply; 7+ messages in thread From: Jürgen Groß @ 2024-05-30 12:45 UTC (permalink / raw) To: Roger Pau Monné; +Cc: xen-devel, Dario Faggioli, George Dunlap On 29.05.24 18:03, Roger Pau Monné wrote: > On Wed, May 29, 2024 at 03:08:49PM +0200, Jürgen Groß wrote: >> On 29.05.24 14:46, Roger Pau Monné wrote: >>> On Wed, May 29, 2024 at 01:47:09PM +0200, Jürgen Groß wrote: >>>> On 28.05.24 13:22, Roger Pau Monné wrote: >>>>> Hello, >>>>> >>>>> When the stop_machine_run() call in cpu_down() fails and calls the CPU >>>>> notifier CPU_DOWN_FAILED hook the following assert triggers in the >>>>> scheduling code: >>>>> >>>>> Assertion '!cpumask_test_cpu(cpu, &prv->initialized)' failed at common/sched/cred1 >>>>> ----[ Xen-4.19-unstable x86_64 debug=y Tainted: C ]---- >>>>> CPU: 0 >>>>> RIP: e008:[<ffff82d040248299>] common/sched/credit2.c#csched2_free_pdata+0xc8/0x177 >>>>> RFLAGS: 0000000000010093 CONTEXT: hypervisor >>>>> rax: 0000000000000000 rbx: ffff83202ecc2f80 rcx: ffff83202f3e64c0 >>>>> rdx: 0000000000000001 rsi: 0000000000000002 rdi: ffff83202ecc2f88 >>>>> rbp: ffff83203ffffd58 rsp: ffff83203ffffd30 r8: 0000000000000000 >>>>> r9: ffff83202f3e6e01 r10: 0000000000000000 r11: 0f0f0f0f0f0f0f0f >>>>> r12: ffff83202ecb80b0 r13: 0000000000000001 r14: 0000000000000282 >>>>> r15: ffff83202ecbbf00 cr0: 000000008005003b cr4: 00000000007526e0 >>>>> cr3: 00000000574c2000 cr2: 0000000000000000 >>>>> fsb: 0000000000000000 gsb: 0000000000000000 gss: 0000000000000000 >>>>> ds: 0000 es: 0000 fs: 0000 gs: 0000 ss: 0000 cs: e008 >>>>> Xen code around <ffff82d040248299> (common/sched/credit2.c#csched2_free_pdata+0xc8/0x177): >>>>> fe ff eb 9a 0f 0b 0f 0b <0f> 0b 49 8d 4f 08 49 8b 47 08 48 3b 48 08 75 2e >>>>> Xen stack trace from rsp=ffff83203ffffd30: >>>>> ffff83202d74d100 0000000000000001 ffff82d0404c4430 0000000000000006 >>>>> 0000000000000000 ffff83203ffffd78 ffff82d040257454 0000000000000000 >>>>> 0000000000000001 ffff83203ffffda8 ffff82d04021f303 ffff82d0404c4628 >>>>> ffff82d0404c4620 ffff82d0404c4430 0000000000000006 ffff83203ffffdf0 >>>>> ffff82d04022bc4c ffff83203ffffe18 0000000000000001 0000000000000001 >>>>> 00000000fffffff0 0000000000000000 0000000000000000 ffff82d0405e6500 >>>>> ffff83203ffffe08 ffff82d040204fd5 0000000000000001 ffff83203ffffe30 >>>>> ffff82d0402054f0 ffff82d0404c5860 0000000000000001 ffff83202ec75000 >>>>> ffff83203ffffe48 ffff82d040348c25 ffff83202d74d0d0 ffff83203ffffe68 >>>>> ffff82d0402071aa ffff83202ec751d0 ffff82d0405ce210 ffff83203ffffe80 >>>>> ffff82d0402343c9 ffff82d0405ce200 ffff83203ffffeb0 ffff82d040234631 >>>>> 0000000000000000 0000000000007fff ffff82d0405d5080 ffff82d0405ce210 >>>>> ffff83203ffffee8 ffff82d040321411 ffff82d040321399 ffff83202f3a9000 >>>>> 0000000000000000 0000001d91a6fa2d ffff82d0405e6500 ffff83203ffffde0 >>>>> ffff82d040324391 0000000000000000 0000000000000000 0000000000000000 >>>>> 0000000000000000 0000000000000000 0000000000000000 0000000000000000 >>>>> 0000000000000000 0000000000000000 0000000000000000 0000000000000000 >>>>> 0000000000000000 0000000000000000 0000000000000000 0000000000000000 >>>>> 0000000000000000 0000000000000000 0000000000000000 0000000000000000 >>>>> 0000000000000000 0000000000000000 0000000000000000 0000000000000000 >>>>> Xen call trace: >>>>> [<ffff82d040248299>] R common/sched/credit2.c#csched2_free_pdata+0xc8/0x177 >>>>> [<ffff82d040257454>] F free_cpu_rm_data+0x41/0x58 >>>>> [<ffff82d04021f303>] F common/sched/cpupool.c#cpu_callback+0xfb/0x466 >>>>> [<ffff82d04022bc4c>] F notifier_call_chain+0x6c/0x96 >>>>> [<ffff82d040204fd5>] F common/cpu.c#cpu_notifier_call_chain+0x1b/0x36 >>>>> [<ffff82d0402054f0>] F cpu_down+0xa7/0x143 >>>>> [<ffff82d040348c25>] F cpu_down_helper+0x11/0x27 >>>>> [<ffff82d0402071aa>] F common/domain.c#continue_hypercall_tasklet_handler+0x50/0xbd >>>>> [<ffff82d0402343c9>] F common/tasklet.c#do_tasklet_work+0x76/0xaf >>>>> [<ffff82d040234631>] F do_tasklet+0x5b/0x8d >>>>> [<ffff82d040321411>] F arch/x86/domain.c#idle_loop+0x78/0xe6 >>>>> [<ffff82d040324391>] F continue_running+0x5b/0x5d >>>>> >>>>> >>>>> **************************************** >>>>> Panic on CPU 0: >>>>> Assertion '!cpumask_test_cpu(cpu, &prv->initialized)' failed at common/sched/credit2.c:4111 >>>>> **************************************** >>>>> >>>>> The issue seems to be that since the CPU hasn't been removed, it's >>>>> still part of prv->initialized and the assert in csched2_free_pdata() >>>>> called as part of free_cpu_rm_data() triggers. >>>>> >>>>> It's easy to reproduce by substituting the stop_machine_run() call in >>>>> cpu_down() with an error. >>>> >>>> Could you please give the attached patch a try? >>> >>> I still get the following assert: >> >> Oh, silly me. Without core scheduling active nr_sr_unused will be 0 all >> the time. :-( >> >> Next try. > > I'm afraid I have a new trace for you: > > Assertion '!cpumask_test_cpu(cpu, &prv->initialized)' failed at common/sched/credit2.c:3987 > ----[ Xen-4.19-unstable x86_64 debug=y Not tainted ]---- > CPU: 0 > RIP: e008:[<ffff82d040247d27>] common/sched/credit2.c#csched2_switch_sched+0x115/0x339 > RFLAGS: 0000000000010093 CONTEXT: hypervisor > rax: 000000000000c000 rbx: 0000000000000001 rcx: ffff82d0405e6500 > rdx: 0000004feee13000 rsi: 0000000000000004 rdi: ffff83202ecc2f88 > rbp: ffff83203ffffc80 rsp: ffff83203ffffc38 r8: 0000000000000000 > r9: ffff83202ecbbf01 r10: 0000000000000000 r11: 0f0f0f0f0f0f0f0f > r12: ffff83202ecc2f80 r13: ffff83402ca50100 r14: ffff83402ca50140 > r15: ffff83202ecc2f88 cr0: 000000008005003b cr4: 00000000007526e0 > cr3: 00000000574c2000 cr2: 0000000000000000 > fsb: 0000000000000000 gsb: 0000000000000000 gss: 0000000000000000 > ds: 0000 es: 0000 fs: 0000 gs: 0000 ss: 0000 cs: e008 > Xen code around <ffff82d040247d27> (common/sched/credit2.c#csched2_switch_sched+0x115/0x339): > 7c ff ff ff 0f 0b 0f 0b <0f> 0b 0f 0b 41 8b 56 30 89 de 48 8d 3d e8 00 1a > Xen stack trace from rsp=ffff83203ffffc38: > ffff83203ffffc48 ffff82d0402332ba ffff83203ffffc68 ffff82d04023343d > 0000000000000001 ffff82d0405cf398 ffff83402ca50100 ffff82d0405e6500 > ffff83202ecbbdb0 ffff83203ffffd18 ffff82d040256e1a ffff83203fff386c > ffff83203fff2000 0000000000000005 ffff83202ecbbf00 ffff83402ca50140 > ffff83203fff3868 0000000000000282 0000000040233509 ffff83202ecbbdb0 > ffff83402ca50100 ffff83202f3e6d80 ffff83202ecc2ec0 ffff83202ecc2ec0 > 0000000000000001 ffff82d0403da460 0000000000000048 0000000000000000 > ffff83203ffffd48 ffff82d0402414b7 0000000000000001 0000000000000000 > ffff82d0403da460 0000000000000006 ffff83203ffffd70 ffff82d04024173d > 0000000000000000 0000000000000001 ffff82d0404c4430 ffff83203ffffda0 > ffff82d04021f1f9 ffff82d0404c4628 ffff82d0404c4620 ffff82d0404c4430 > 0000000000000006 ffff83203ffffde8 ffff82d04022bb2f ffff83203ffffe10 > 0000000000000001 0000000000000001 0000000000000000 ffff83203ffffe10 > 0000000000000000 ffff82d0405e6500 ffff83203ffffe00 ffff82d040204fd5 > 0000000000000001 ffff83203ffffe30 ffff82d040205464 ffff82d0404c5860 > 0000000000000001 ffff83202ec86000 0000000000000000 ffff83203ffffe48 > ffff82d040348c32 ffff83402ca500d0 ffff83203ffffe68 ffff82d04020708d > ffff83202ec861d0 ffff82d0405ce210 ffff83203ffffe80 ffff82d0402342a3 > ffff82d0405ce200 ffff83203ffffeb0 ffff82d04023450b 0000000000000000 > 0000000000007fff ffff82d0405d5080 ffff82d0405ce210 ffff83203ffffee8 > Xen call trace: > [<ffff82d040247d27>] R common/sched/credit2.c#csched2_switch_sched+0x115/0x339 > [<ffff82d040256e1a>] F schedule_cpu_add+0x1a4/0x463 > [<ffff82d0402414b7>] F common/sched/cpupool.c#cpupool_assign_cpu_locked+0x5a/0x17e > [<ffff82d04024173d>] F common/sched/cpupool.c#cpupool_cpu_add+0x162/0x16c > [<ffff82d04021f1f9>] F common/sched/cpupool.c#cpu_callback+0x10e/0x466 > [<ffff82d04022bb2f>] F notifier_call_chain+0x6c/0x96 > [<ffff82d040204fd5>] F common/cpu.c#cpu_notifier_call_chain+0x1b/0x36 > [<ffff82d040205464>] F cpu_down+0x60/0x83 > [<ffff82d040348c32>] F cpu_down_helper+0x11/0x27 > [<ffff82d04020708d>] F common/domain.c#continue_hypercall_tasklet_handler+0x50/0xbd > [<ffff82d0402342a3>] F common/tasklet.c#do_tasklet_work+0x76/0xaf > [<ffff82d04023450b>] F do_tasklet+0x5b/0x8d > [<ffff82d040321372>] F arch/x86/domain.c#idle_loop+0x78/0xe6 > [<ffff82d0403242f2>] F continue_running+0x5b/0x5d > > > **************************************** > Panic on CPU 0: > Assertion '!cpumask_test_cpu(cpu, &prv->initialized)' failed at common/sched/credit2.c:3987 > **************************************** > > This time is one of the asserts in init_pdata(). Yeah, the reason is similar, but fixing this is a little bit more work than the other patch. Not sure I'll manage to do this before Xen Summit. Juergen ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: CPU_DOWN_FAILED hits ASSERTs in scheduling logic 2024-05-30 12:45 ` Jürgen Groß @ 2024-05-30 13:15 ` Roger Pau Monné 0 siblings, 0 replies; 7+ messages in thread From: Roger Pau Monné @ 2024-05-30 13:15 UTC (permalink / raw) To: Jürgen Groß; +Cc: xen-devel, Dario Faggioli, George Dunlap On Thu, May 30, 2024 at 02:45:18PM +0200, Jürgen Groß wrote: > On 29.05.24 18:03, Roger Pau Monné wrote: > > On Wed, May 29, 2024 at 03:08:49PM +0200, Jürgen Groß wrote: > > > On 29.05.24 14:46, Roger Pau Monné wrote: > > > > On Wed, May 29, 2024 at 01:47:09PM +0200, Jürgen Groß wrote: > > > > > On 28.05.24 13:22, Roger Pau Monné wrote: > > > > > > Hello, > > > > > > > > > > > > When the stop_machine_run() call in cpu_down() fails and calls the CPU > > > > > > notifier CPU_DOWN_FAILED hook the following assert triggers in the > > > > > > scheduling code: > > > > > > > > > > > > Assertion '!cpumask_test_cpu(cpu, &prv->initialized)' failed at common/sched/cred1 > > > > > > ----[ Xen-4.19-unstable x86_64 debug=y Tainted: C ]---- > > > > > > CPU: 0 > > > > > > RIP: e008:[<ffff82d040248299>] common/sched/credit2.c#csched2_free_pdata+0xc8/0x177 > > > > > > RFLAGS: 0000000000010093 CONTEXT: hypervisor > > > > > > rax: 0000000000000000 rbx: ffff83202ecc2f80 rcx: ffff83202f3e64c0 > > > > > > rdx: 0000000000000001 rsi: 0000000000000002 rdi: ffff83202ecc2f88 > > > > > > rbp: ffff83203ffffd58 rsp: ffff83203ffffd30 r8: 0000000000000000 > > > > > > r9: ffff83202f3e6e01 r10: 0000000000000000 r11: 0f0f0f0f0f0f0f0f > > > > > > r12: ffff83202ecb80b0 r13: 0000000000000001 r14: 0000000000000282 > > > > > > r15: ffff83202ecbbf00 cr0: 000000008005003b cr4: 00000000007526e0 > > > > > > cr3: 00000000574c2000 cr2: 0000000000000000 > > > > > > fsb: 0000000000000000 gsb: 0000000000000000 gss: 0000000000000000 > > > > > > ds: 0000 es: 0000 fs: 0000 gs: 0000 ss: 0000 cs: e008 > > > > > > Xen code around <ffff82d040248299> (common/sched/credit2.c#csched2_free_pdata+0xc8/0x177): > > > > > > fe ff eb 9a 0f 0b 0f 0b <0f> 0b 49 8d 4f 08 49 8b 47 08 48 3b 48 08 75 2e > > > > > > Xen stack trace from rsp=ffff83203ffffd30: > > > > > > ffff83202d74d100 0000000000000001 ffff82d0404c4430 0000000000000006 > > > > > > 0000000000000000 ffff83203ffffd78 ffff82d040257454 0000000000000000 > > > > > > 0000000000000001 ffff83203ffffda8 ffff82d04021f303 ffff82d0404c4628 > > > > > > ffff82d0404c4620 ffff82d0404c4430 0000000000000006 ffff83203ffffdf0 > > > > > > ffff82d04022bc4c ffff83203ffffe18 0000000000000001 0000000000000001 > > > > > > 00000000fffffff0 0000000000000000 0000000000000000 ffff82d0405e6500 > > > > > > ffff83203ffffe08 ffff82d040204fd5 0000000000000001 ffff83203ffffe30 > > > > > > ffff82d0402054f0 ffff82d0404c5860 0000000000000001 ffff83202ec75000 > > > > > > ffff83203ffffe48 ffff82d040348c25 ffff83202d74d0d0 ffff83203ffffe68 > > > > > > ffff82d0402071aa ffff83202ec751d0 ffff82d0405ce210 ffff83203ffffe80 > > > > > > ffff82d0402343c9 ffff82d0405ce200 ffff83203ffffeb0 ffff82d040234631 > > > > > > 0000000000000000 0000000000007fff ffff82d0405d5080 ffff82d0405ce210 > > > > > > ffff83203ffffee8 ffff82d040321411 ffff82d040321399 ffff83202f3a9000 > > > > > > 0000000000000000 0000001d91a6fa2d ffff82d0405e6500 ffff83203ffffde0 > > > > > > ffff82d040324391 0000000000000000 0000000000000000 0000000000000000 > > > > > > 0000000000000000 0000000000000000 0000000000000000 0000000000000000 > > > > > > 0000000000000000 0000000000000000 0000000000000000 0000000000000000 > > > > > > 0000000000000000 0000000000000000 0000000000000000 0000000000000000 > > > > > > 0000000000000000 0000000000000000 0000000000000000 0000000000000000 > > > > > > 0000000000000000 0000000000000000 0000000000000000 0000000000000000 > > > > > > Xen call trace: > > > > > > [<ffff82d040248299>] R common/sched/credit2.c#csched2_free_pdata+0xc8/0x177 > > > > > > [<ffff82d040257454>] F free_cpu_rm_data+0x41/0x58 > > > > > > [<ffff82d04021f303>] F common/sched/cpupool.c#cpu_callback+0xfb/0x466 > > > > > > [<ffff82d04022bc4c>] F notifier_call_chain+0x6c/0x96 > > > > > > [<ffff82d040204fd5>] F common/cpu.c#cpu_notifier_call_chain+0x1b/0x36 > > > > > > [<ffff82d0402054f0>] F cpu_down+0xa7/0x143 > > > > > > [<ffff82d040348c25>] F cpu_down_helper+0x11/0x27 > > > > > > [<ffff82d0402071aa>] F common/domain.c#continue_hypercall_tasklet_handler+0x50/0xbd > > > > > > [<ffff82d0402343c9>] F common/tasklet.c#do_tasklet_work+0x76/0xaf > > > > > > [<ffff82d040234631>] F do_tasklet+0x5b/0x8d > > > > > > [<ffff82d040321411>] F arch/x86/domain.c#idle_loop+0x78/0xe6 > > > > > > [<ffff82d040324391>] F continue_running+0x5b/0x5d > > > > > > > > > > > > > > > > > > **************************************** > > > > > > Panic on CPU 0: > > > > > > Assertion '!cpumask_test_cpu(cpu, &prv->initialized)' failed at common/sched/credit2.c:4111 > > > > > > **************************************** > > > > > > > > > > > > The issue seems to be that since the CPU hasn't been removed, it's > > > > > > still part of prv->initialized and the assert in csched2_free_pdata() > > > > > > called as part of free_cpu_rm_data() triggers. > > > > > > > > > > > > It's easy to reproduce by substituting the stop_machine_run() call in > > > > > > cpu_down() with an error. > > > > > > > > > > Could you please give the attached patch a try? > > > > > > > > I still get the following assert: > > > > > > Oh, silly me. Without core scheduling active nr_sr_unused will be 0 all > > > the time. :-( > > > > > > Next try. > > > > I'm afraid I have a new trace for you: > > > > Assertion '!cpumask_test_cpu(cpu, &prv->initialized)' failed at common/sched/credit2.c:3987 > > ----[ Xen-4.19-unstable x86_64 debug=y Not tainted ]---- > > CPU: 0 > > RIP: e008:[<ffff82d040247d27>] common/sched/credit2.c#csched2_switch_sched+0x115/0x339 > > RFLAGS: 0000000000010093 CONTEXT: hypervisor > > rax: 000000000000c000 rbx: 0000000000000001 rcx: ffff82d0405e6500 > > rdx: 0000004feee13000 rsi: 0000000000000004 rdi: ffff83202ecc2f88 > > rbp: ffff83203ffffc80 rsp: ffff83203ffffc38 r8: 0000000000000000 > > r9: ffff83202ecbbf01 r10: 0000000000000000 r11: 0f0f0f0f0f0f0f0f > > r12: ffff83202ecc2f80 r13: ffff83402ca50100 r14: ffff83402ca50140 > > r15: ffff83202ecc2f88 cr0: 000000008005003b cr4: 00000000007526e0 > > cr3: 00000000574c2000 cr2: 0000000000000000 > > fsb: 0000000000000000 gsb: 0000000000000000 gss: 0000000000000000 > > ds: 0000 es: 0000 fs: 0000 gs: 0000 ss: 0000 cs: e008 > > Xen code around <ffff82d040247d27> (common/sched/credit2.c#csched2_switch_sched+0x115/0x339): > > 7c ff ff ff 0f 0b 0f 0b <0f> 0b 0f 0b 41 8b 56 30 89 de 48 8d 3d e8 00 1a > > Xen stack trace from rsp=ffff83203ffffc38: > > ffff83203ffffc48 ffff82d0402332ba ffff83203ffffc68 ffff82d04023343d > > 0000000000000001 ffff82d0405cf398 ffff83402ca50100 ffff82d0405e6500 > > ffff83202ecbbdb0 ffff83203ffffd18 ffff82d040256e1a ffff83203fff386c > > ffff83203fff2000 0000000000000005 ffff83202ecbbf00 ffff83402ca50140 > > ffff83203fff3868 0000000000000282 0000000040233509 ffff83202ecbbdb0 > > ffff83402ca50100 ffff83202f3e6d80 ffff83202ecc2ec0 ffff83202ecc2ec0 > > 0000000000000001 ffff82d0403da460 0000000000000048 0000000000000000 > > ffff83203ffffd48 ffff82d0402414b7 0000000000000001 0000000000000000 > > ffff82d0403da460 0000000000000006 ffff83203ffffd70 ffff82d04024173d > > 0000000000000000 0000000000000001 ffff82d0404c4430 ffff83203ffffda0 > > ffff82d04021f1f9 ffff82d0404c4628 ffff82d0404c4620 ffff82d0404c4430 > > 0000000000000006 ffff83203ffffde8 ffff82d04022bb2f ffff83203ffffe10 > > 0000000000000001 0000000000000001 0000000000000000 ffff83203ffffe10 > > 0000000000000000 ffff82d0405e6500 ffff83203ffffe00 ffff82d040204fd5 > > 0000000000000001 ffff83203ffffe30 ffff82d040205464 ffff82d0404c5860 > > 0000000000000001 ffff83202ec86000 0000000000000000 ffff83203ffffe48 > > ffff82d040348c32 ffff83402ca500d0 ffff83203ffffe68 ffff82d04020708d > > ffff83202ec861d0 ffff82d0405ce210 ffff83203ffffe80 ffff82d0402342a3 > > ffff82d0405ce200 ffff83203ffffeb0 ffff82d04023450b 0000000000000000 > > 0000000000007fff ffff82d0405d5080 ffff82d0405ce210 ffff83203ffffee8 > > Xen call trace: > > [<ffff82d040247d27>] R common/sched/credit2.c#csched2_switch_sched+0x115/0x339 > > [<ffff82d040256e1a>] F schedule_cpu_add+0x1a4/0x463 > > [<ffff82d0402414b7>] F common/sched/cpupool.c#cpupool_assign_cpu_locked+0x5a/0x17e > > [<ffff82d04024173d>] F common/sched/cpupool.c#cpupool_cpu_add+0x162/0x16c > > [<ffff82d04021f1f9>] F common/sched/cpupool.c#cpu_callback+0x10e/0x466 > > [<ffff82d04022bb2f>] F notifier_call_chain+0x6c/0x96 > > [<ffff82d040204fd5>] F common/cpu.c#cpu_notifier_call_chain+0x1b/0x36 > > [<ffff82d040205464>] F cpu_down+0x60/0x83 > > [<ffff82d040348c32>] F cpu_down_helper+0x11/0x27 > > [<ffff82d04020708d>] F common/domain.c#continue_hypercall_tasklet_handler+0x50/0xbd > > [<ffff82d0402342a3>] F common/tasklet.c#do_tasklet_work+0x76/0xaf > > [<ffff82d04023450b>] F do_tasklet+0x5b/0x8d > > [<ffff82d040321372>] F arch/x86/domain.c#idle_loop+0x78/0xe6 > > [<ffff82d0403242f2>] F continue_running+0x5b/0x5d > > > > > > **************************************** > > Panic on CPU 0: > > Assertion '!cpumask_test_cpu(cpu, &prv->initialized)' failed at common/sched/credit2.c:3987 > > **************************************** > > > > This time is one of the asserts in init_pdata(). > > Yeah, the reason is similar, but fixing this is a little bit more work > than the other patch. > > Not sure I'll manage to do this before Xen Summit. No worries, I'm not in a rush. I'm happy as long as it's on your plate and not mine :). Thanks, Roger. ^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2024-05-30 13:15 UTC | newest] Thread overview: 7+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2024-05-28 11:22 CPU_DOWN_FAILED hits ASSERTs in scheduling logic Roger Pau Monné 2024-05-29 11:47 ` Jürgen Groß 2024-05-29 12:46 ` Roger Pau Monné 2024-05-29 13:08 ` Jürgen Groß 2024-05-29 16:03 ` Roger Pau Monné 2024-05-30 12:45 ` Jürgen Groß 2024-05-30 13:15 ` Roger Pau Monné
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.