* 5.10-dovetail regression? @ 2022-04-07 14:12 Jan Kiszka 2022-04-07 14:25 ` Philippe Gerum 2022-04-07 15:24 ` Philippe Gerum 0 siblings, 2 replies; 7+ messages in thread From: Jan Kiszka @ 2022-04-07 14:12 UTC (permalink / raw) To: Philippe Gerum; +Cc: Xenomai Hi Philippe, does this already ring some bell? https://source.denx.de/Xenomai/xenomai-images/-/jobs/419210 Only triggers with qemu-amd64, not on real HW and not with 5.15. Jan -- Siemens AG, Technology Competence Center Embedded Linux ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: 5.10-dovetail regression? 2022-04-07 14:12 5.10-dovetail regression? Jan Kiszka @ 2022-04-07 14:25 ` Philippe Gerum 2022-04-07 14:37 ` Philippe Gerum 2022-04-07 15:24 ` Philippe Gerum 1 sibling, 1 reply; 7+ messages in thread From: Philippe Gerum @ 2022-04-07 14:25 UTC (permalink / raw) To: Jan Kiszka; +Cc: Xenomai a Jan Kiszka <jan.kiszka@siemens.com> writes: > Hi Philippe, > > does this already ring some bell? > > https://source.denx.de/Xenomai/xenomai-images/-/jobs/419210 > > Only triggers with qemu-amd64, not on real HW and not with 5.15. > > Jan 8e2c09ee5323 is most likely causing this. It's a backport of the fix developed for 5.15. I have a kvm-aarch64 setup which I routinely use too, I'll reproduce and fix this. -- Philippe. ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: 5.10-dovetail regression? 2022-04-07 14:25 ` Philippe Gerum @ 2022-04-07 14:37 ` Philippe Gerum 0 siblings, 0 replies; 7+ messages in thread From: Philippe Gerum @ 2022-04-07 14:37 UTC (permalink / raw) To: Jan Kiszka; +Cc: Xenomai Philippe Gerum <rpm@xenomai.org> writes: > a > Jan Kiszka <jan.kiszka@siemens.com> writes: > >> Hi Philippe, >> >> does this already ring some bell? >> >> https://source.denx.de/Xenomai/xenomai-images/-/jobs/419210 >> >> Only triggers with qemu-amd64, not on real HW and not with 5.15. >> >> Jan > > 8e2c09ee5323 is most likely causing this. It's a backport of the fix > developed for 5.15. I have a kvm-aarch64 setup which I routinely use > too, I'll reproduce and fix this. Sorry, I mean x86_64, not aarch64. -- Philippe. ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: 5.10-dovetail regression? 2022-04-07 14:12 5.10-dovetail regression? Jan Kiszka 2022-04-07 14:25 ` Philippe Gerum @ 2022-04-07 15:24 ` Philippe Gerum 2022-04-07 19:33 ` Jan Kiszka 1 sibling, 1 reply; 7+ messages in thread From: Philippe Gerum @ 2022-04-07 15:24 UTC (permalink / raw) To: Jan Kiszka; +Cc: Xenomai Jan Kiszka <jan.kiszka@siemens.com> writes: > Hi Philippe, > > does this already ring some bell? > > https://source.denx.de/Xenomai/xenomai-images/-/jobs/419210 > > Only triggers with qemu-amd64, not on real HW and not with 5.15. > I could not reproduce locally, but visual inspection revealed something fishy in #8e2c09ee5323. Could you try this on the failing kernel? TIA, diff --git a/kernel/time/clockevents.c b/kernel/time/clockevents.c index 2651c6cfd034..da6735d45a8a 100644 --- a/kernel/time/clockevents.c +++ b/kernel/time/clockevents.c @@ -644,8 +644,8 @@ void clockevents_exchange_device(struct clock_event_device *old, * to the release list, keep it around but mark it as * reserved. */ + list_del(&old->list); if (tick_check_is_proxy(new)) { - list_del(&old->list); clockevents_switch_state(old, CLOCK_EVT_STATE_RESERVED); } else { clockevents_switch_state(old, CLOCK_EVT_STATE_DETACHED); -- Philippe. ^ permalink raw reply related [flat|nested] 7+ messages in thread
* Re: 5.10-dovetail regression? 2022-04-07 15:24 ` Philippe Gerum @ 2022-04-07 19:33 ` Jan Kiszka 2022-04-09 9:16 ` Philippe Gerum 0 siblings, 1 reply; 7+ messages in thread From: Jan Kiszka @ 2022-04-07 19:33 UTC (permalink / raw) To: Philippe Gerum; +Cc: Xenomai On 07.04.22 17:24, Philippe Gerum wrote: > > Jan Kiszka <jan.kiszka@siemens.com> writes: > >> Hi Philippe, >> >> does this already ring some bell? >> >> https://source.denx.de/Xenomai/xenomai-images/-/jobs/419210 >> >> Only triggers with qemu-amd64, not on real HW and not with 5.15. >> > > I could not reproduce locally, but visual inspection revealed something > fishy in #8e2c09ee5323. Could you try this on the failing kernel? TIA, > > diff --git a/kernel/time/clockevents.c b/kernel/time/clockevents.c > index 2651c6cfd034..da6735d45a8a 100644 > --- a/kernel/time/clockevents.c > +++ b/kernel/time/clockevents.c > @@ -644,8 +644,8 @@ void clockevents_exchange_device(struct clock_event_device *old, > * to the release list, keep it around but mark it as > * reserved. > */ > + list_del(&old->list); > if (tick_check_is_proxy(new)) { > - list_del(&old->list); > clockevents_switch_state(old, CLOCK_EVT_STATE_RESERVED); > } else { > clockevents_switch_state(old, CLOCK_EVT_STATE_DETACHED); > Didn't reproduce locally for me as well, though using the same image. But the patch helped on the CI system. Thanks, Jan -- Siemens AG, Technology Competence Center Embedded Linux ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: 5.10-dovetail regression? 2022-04-07 19:33 ` Jan Kiszka @ 2022-04-09 9:16 ` Philippe Gerum 2022-04-09 9:32 ` Philippe Gerum 0 siblings, 1 reply; 7+ messages in thread From: Philippe Gerum @ 2022-04-09 9:16 UTC (permalink / raw) To: Jan Kiszka; +Cc: Xenomai Jan Kiszka <jan.kiszka@siemens.com> writes: > On 07.04.22 17:24, Philippe Gerum wrote: >> >> Jan Kiszka <jan.kiszka@siemens.com> writes: >> >>> Hi Philippe, >>> >>> does this already ring some bell? >>> >>> https://source.denx.de/Xenomai/xenomai-images/-/jobs/419210 >>> >>> Only triggers with qemu-amd64, not on real HW and not with 5.15. >>> >> >> I could not reproduce locally, but visual inspection revealed something >> fishy in #8e2c09ee5323. Could you try this on the failing kernel? TIA, >> >> diff --git a/kernel/time/clockevents.c b/kernel/time/clockevents.c >> index 2651c6cfd034..da6735d45a8a 100644 >> --- a/kernel/time/clockevents.c >> +++ b/kernel/time/clockevents.c >> @@ -644,8 +644,8 @@ void clockevents_exchange_device(struct clock_event_device *old, >> * to the release list, keep it around but mark it as >> * reserved. >> */ >> + list_del(&old->list); >> if (tick_check_is_proxy(new)) { >> - list_del(&old->list); >> clockevents_switch_state(old, CLOCK_EVT_STATE_RESERVED); >> } else { >> clockevents_switch_state(old, CLOCK_EVT_STATE_DETACHED); >> > > Didn't reproduce locally for me as well, though using the same image. > But the patch helped on the CI system. > It does not seem to be enough though, that patch fixes a different bug actually. So there are two of them: 1. lockup when running "corectl --stop" on 5.10/kvm_x86 configurations, not reproducible here on any other setup 2. list poisoning which triggers an assertion at boot on "some" x86 configurations The patch above definitely fixes #1, makes sense. I managed to reproduce #2 on real hw, with kernel 5.15 this time. Same gremlin: [ 2.052096] smpboot: Estimated ratio of average max frequency by base frequency (times 1024): 1152 [ 2.052273] ------------[ cut here ]------------ [ 2.053250] list_del corruption, ffff8881001ce0b8->next is LIST_POISON1 (dead000000000100) [ 2.053250] WARNING: CPU: 0 PID: 1 at lib/list_debug.c:45 __list_del_entry_valid+0x81/0xe0 [ 2.053250] Modules linked in: [ 2.053250] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 5.15.32+ #100 [ 2.053250] Hardware name: TQ-Group TQMxE39M/Type2 - Board Product Name, BIOS 5.12.09.16.05 07/26/2017 [ 2.053250] IRQ stage: Linux [ 2.053250] RIP: 0010:__list_del_entry_valid+0x81/0xe0 [ 2.053250] Code: 85 c5 ff 49 8b 55 08 4c 39 e2 75 5b b8 01 00 00 00 5d 41 5c 41 5d c3 4c 89 ea 48 8d 75 00 48 c7 c7 80 99 80 ad e8 ea fb 83 00 <0f> 0b 5d 41 5c 31 c0 41 5d c3 49 8d 14 24 48 8d 75 00 48 c7 c7 e0 [ 2.053250] RSP: 0000:ffff888100287dc0 EFLAGS: 00010246 [ 2.053250] RAX: 0000000000000000 RBX: ffff8881001ce000 RCX: 0000000000000000 [ 2.053250] RDX: 0000000000000002 RSI: 0000000000000008 RDI: ffffed1020050fae [ 2.053250] RBP: ffff8881001ce0b8 R08: ffffffffac22b384 R09: ffffffffac279120 [ 2.053250] R10: ffff888100287aaf R11: ffffed1020050f55 R12: dead000000000122 [ 2.053250] R13: dead000000000100 R14: 0000000000000002 R15: ffffffffadff62a0 [ 2.053250] FS: 0000000000000000(0000) GS:ffff88815c800000(0000) knlGS:0000000000000000 [ 2.053250] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 2.053250] CR2: ffff888104e01000 CR3: 0000000103e10000 CR4: 00000000003506f0 [ 2.053250] Call Trace: [ 2.053250] <TASK> [ 2.053250] clockevents_exchange_device+0x16c/0x2a0 [ 2.053250] tick_check_new_device+0x1c3/0x230 [ 2.053250] clockevents_register_device+0xc3/0x170 [ 2.053250] setup_boot_APIC_clock+0x526/0x553 [ 2.053250] ? default_ioapic_phys_id_map+0x40/0x40 [ 2.053250] native_smp_prepare_cpus+0x2cd/0x3ef [ 2.053250] kernel_init_freeable+0xc0/0x290 [ 2.053250] ? rest_init+0xe0/0xe0 [ 2.053250] kernel_init+0x19/0x130 [ 2.053250] ret_from_fork+0x22/0x30 [ 2.053250] </TASK> I'm on it. -- Philippe. ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: 5.10-dovetail regression? 2022-04-09 9:16 ` Philippe Gerum @ 2022-04-09 9:32 ` Philippe Gerum 0 siblings, 0 replies; 7+ messages in thread From: Philippe Gerum @ 2022-04-09 9:32 UTC (permalink / raw) To: Jan Kiszka; +Cc: Xenomai Philippe Gerum <rpm@xenomai.org> writes: > Jan Kiszka <jan.kiszka@siemens.com> writes: > >> On 07.04.22 17:24, Philippe Gerum wrote: >>> >>> Jan Kiszka <jan.kiszka@siemens.com> writes: >>> >>>> Hi Philippe, >>>> >>>> does this already ring some bell? >>>> >>>> https://source.denx.de/Xenomai/xenomai-images/-/jobs/419210 >>>> >>>> Only triggers with qemu-amd64, not on real HW and not with 5.15. >>>> >>> >>> I could not reproduce locally, but visual inspection revealed something >>> fishy in #8e2c09ee5323. Could you try this on the failing kernel? TIA, >>> >>> diff --git a/kernel/time/clockevents.c b/kernel/time/clockevents.c >>> index 2651c6cfd034..da6735d45a8a 100644 >>> --- a/kernel/time/clockevents.c >>> +++ b/kernel/time/clockevents.c >>> @@ -644,8 +644,8 @@ void clockevents_exchange_device(struct clock_event_device *old, >>> * to the release list, keep it around but mark it as >>> * reserved. >>> */ >>> + list_del(&old->list); >>> if (tick_check_is_proxy(new)) { >>> - list_del(&old->list); >>> clockevents_switch_state(old, CLOCK_EVT_STATE_RESERVED); >>> } else { >>> clockevents_switch_state(old, CLOCK_EVT_STATE_DETACHED); >>> >> >> Didn't reproduce locally for me as well, though using the same image. >> But the patch helped on the CI system. >> > > It does not seem to be enough though, that patch fixes a different bug > actually. So there are two of them: > > 1. lockup when running "corectl --stop" on 5.10/kvm_x86 configurations, > not reproducible here on any other setup > > 2. list poisoning which triggers an assertion at boot on "some" x86 > configurations > > The patch above definitely fixes #1, makes sense. I managed to reproduce > #2 on real hw, with kernel 5.15 this time. Same gremlin: > > [ 2.052096] smpboot: Estimated ratio of average max frequency by base frequency (times 1024): 1152 > [ 2.052273] ------------[ cut here ]------------ > [ 2.053250] list_del corruption, ffff8881001ce0b8->next is LIST_POISON1 (dead000000000100) > [ 2.053250] WARNING: CPU: 0 PID: 1 at lib/list_debug.c:45 __list_del_entry_valid+0x81/0xe0 > [ 2.053250] Modules linked in: > [ 2.053250] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 5.15.32+ #100 > [ 2.053250] Hardware name: TQ-Group TQMxE39M/Type2 - Board Product Name, BIOS 5.12.09.16.05 07/26/2017 > [ 2.053250] IRQ stage: Linux > [ 2.053250] RIP: 0010:__list_del_entry_valid+0x81/0xe0 > [ 2.053250] Code: 85 c5 ff 49 8b 55 08 4c 39 e2 75 5b b8 01 00 00 00 5d 41 5c 41 5d c3 4c 89 ea 48 8d 75 00 48 c7 c7 80 99 80 ad e8 ea fb 83 00 <0f> 0b 5d 41 5c 31 c0 41 5d c3 49 8d 14 24 48 8d 75 00 48 c7 c7 e0 > [ 2.053250] RSP: 0000:ffff888100287dc0 EFLAGS: 00010246 > [ 2.053250] RAX: 0000000000000000 RBX: ffff8881001ce000 RCX: 0000000000000000 > [ 2.053250] RDX: 0000000000000002 RSI: 0000000000000008 RDI: ffffed1020050fae > [ 2.053250] RBP: ffff8881001ce0b8 R08: ffffffffac22b384 R09: ffffffffac279120 > [ 2.053250] R10: ffff888100287aaf R11: ffffed1020050f55 R12: dead000000000122 > [ 2.053250] R13: dead000000000100 R14: 0000000000000002 R15: ffffffffadff62a0 > [ 2.053250] FS: 0000000000000000(0000) GS:ffff88815c800000(0000) knlGS:0000000000000000 > [ 2.053250] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > [ 2.053250] CR2: ffff888104e01000 CR3: 0000000103e10000 CR4: 00000000003506f0 > [ 2.053250] Call Trace: > [ 2.053250] <TASK> > [ 2.053250] clockevents_exchange_device+0x16c/0x2a0 > [ 2.053250] tick_check_new_device+0x1c3/0x230 > [ 2.053250] clockevents_register_device+0xc3/0x170 > [ 2.053250] setup_boot_APIC_clock+0x526/0x553 > [ 2.053250] ? default_ioapic_phys_id_map+0x40/0x40 > [ 2.053250] native_smp_prepare_cpus+0x2cd/0x3ef > [ 2.053250] kernel_init_freeable+0xc0/0x290 > [ 2.053250] ? rest_init+0xe0/0xe0 > [ 2.053250] kernel_init+0x19/0x130 > [ 2.053250] ret_from_fork+0x22/0x30 > [ 2.053250] </TASK> > > I'm on it. Ok, so the first patch is not a fix, it's plain nonsense and is responsible for the second issue in my test case. Back to square #1. Still on it. -- Philippe. ^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2022-04-09 9:32 UTC | newest] Thread overview: 7+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2022-04-07 14:12 5.10-dovetail regression? Jan Kiszka 2022-04-07 14:25 ` Philippe Gerum 2022-04-07 14:37 ` Philippe Gerum 2022-04-07 15:24 ` Philippe Gerum 2022-04-07 19:33 ` Jan Kiszka 2022-04-09 9:16 ` Philippe Gerum 2022-04-09 9:32 ` Philippe Gerum
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.