* [REGRESSION 4.20-rc1] 45975c7d21a1 ("rcu: Define RCU-sched API in terms of RCU for Tree RCU PREEMPT builds")
@ 2018-11-13 13:54 Ville Syrjälä
2018-11-13 15:10 ` Paul E. McKenney
0 siblings, 1 reply; 11+ messages in thread
From: Ville Syrjälä @ 2018-11-13 13:54 UTC (permalink / raw)
To: Paul E. McKenney
Cc: linux-kernel, Andi Kleen, Rafael J. Wysocki, Viresh Kumar,
Ingo Molnar, Borislav Petkov, H. Peter Anvin
Hi Paul,
After 4.20-rc1 some of my 32bit UP machines no longer reboot/shutdown.
I bisected this down to commit 45975c7d21a1 ("rcu: Define RCU-sched
API in terms of RCU for Tree RCU PREEMPT builds").
I traced the hang into
-> cpufreq_suspend()
-> cpufreq_stop_governor()
-> cpufreq_dbs_governor_stop()
-> gov_clear_update_util()
-> synchronize_sched()
-> synchronize_rcu()
Only PREEMPT=y is affected for obvious reasons, but that couldn't
explain why the same UP kernel booted on an SMP machine worked fine.
Eventually I realized that the difference between working and
non-working machine was IOAPIC vs. PIC. With initcall_debug I saw
that we mask everything in the PIC before cpufreq is shut down,
and came up with the following fix:
diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c
index 7aa3dcad2175..f88bf3c77fc0 100644
--- a/drivers/cpufreq/cpufreq.c
+++ b/drivers/cpufreq/cpufreq.c
@@ -2605,4 +2605,4 @@ static int __init cpufreq_core_init(void)
return 0;
}
module_param(off, int, 0444);
-core_initcall(cpufreq_core_init);
+late_initcall(cpufreq_core_init);
Here's the resulting change in inutcall_debug:
pci 0000:00:00.1: shutdown
hub 4-0:1.0: hub_ext_port_status failed (err = -110)
agpgart-intel 0000:00:00.0: shutdown
+ PM: Calling cpufreq_suspend+0x0/0x100
PM: Calling mce_syscore_shutdown+0x0/0x10
PM: Calling i8259A_shutdown+0x0/0x10
- PM: Calling cpufreq_suspend+0x0/0x100
+ reboot: Restarting system
+ reboot: machine restart
I didn't really look into what other ramifications the cpufreq
initcall change might have. cpufreq_global_kobject worries
me a bit. Maybe that one has to remain in core_initcall() and
we could just move the suspend to late_initcall()? Anyways,
I figured I'd leave this for someone more familiar with the
code to figure out ;)
--
Ville Syrjälä
Intel
^ permalink raw reply related [flat|nested] 11+ messages in thread* Re: [REGRESSION 4.20-rc1] 45975c7d21a1 ("rcu: Define RCU-sched API in terms of RCU for Tree RCU PREEMPT builds") 2018-11-13 13:54 [REGRESSION 4.20-rc1] 45975c7d21a1 ("rcu: Define RCU-sched API in terms of RCU for Tree RCU PREEMPT builds") Ville Syrjälä @ 2018-11-13 15:10 ` Paul E. McKenney 2018-11-14 20:20 ` Paul E. McKenney 0 siblings, 1 reply; 11+ messages in thread From: Paul E. McKenney @ 2018-11-13 15:10 UTC (permalink / raw) To: Ville Syrjälä Cc: linux-kernel, Andi Kleen, Rafael J. Wysocki, Viresh Kumar, Ingo Molnar, Borislav Petkov, H. Peter Anvin On Tue, Nov 13, 2018 at 03:54:53PM +0200, Ville Syrjälä wrote: > Hi Paul, > > After 4.20-rc1 some of my 32bit UP machines no longer reboot/shutdown. > I bisected this down to commit 45975c7d21a1 ("rcu: Define RCU-sched > API in terms of RCU for Tree RCU PREEMPT builds"). > > I traced the hang into > -> cpufreq_suspend() > -> cpufreq_stop_governor() > -> cpufreq_dbs_governor_stop() > -> gov_clear_update_util() > -> synchronize_sched() > -> synchronize_rcu() > > Only PREEMPT=y is affected for obvious reasons, but that couldn't > explain why the same UP kernel booted on an SMP machine worked fine. > Eventually I realized that the difference between working and > non-working machine was IOAPIC vs. PIC. With initcall_debug I saw > that we mask everything in the PIC before cpufreq is shut down, > and came up with the following fix: > > diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c > index 7aa3dcad2175..f88bf3c77fc0 100644 > --- a/drivers/cpufreq/cpufreq.c > +++ b/drivers/cpufreq/cpufreq.c > @@ -2605,4 +2605,4 @@ static int __init cpufreq_core_init(void) > return 0; > } > module_param(off, int, 0444); > -core_initcall(cpufreq_core_init); > +late_initcall(cpufreq_core_init); Thank you for testing this and tracking it down! I am glad that you have a fix, but I hope that we can arrive at a less constraining one. > Here's the resulting change in inutcall_debug: > pci 0000:00:00.1: shutdown > hub 4-0:1.0: hub_ext_port_status failed (err = -110) > agpgart-intel 0000:00:00.0: shutdown > + PM: Calling cpufreq_suspend+0x0/0x100 > PM: Calling mce_syscore_shutdown+0x0/0x10 > PM: Calling i8259A_shutdown+0x0/0x10 > - PM: Calling cpufreq_suspend+0x0/0x100 > + reboot: Restarting system > + reboot: machine restart > > I didn't really look into what other ramifications the cpufreq > initcall change might have. cpufreq_global_kobject worries > me a bit. Maybe that one has to remain in core_initcall() and > we could just move the suspend to late_initcall()? Anyways, > I figured I'd leave this for someone more familiar with the > code to figure out ;) Let me guess... When the system suspends or shuts down, there comes a point after which there is only a single CPU that is running with preemption and interrupts are disabled. At this point, RCU must change the way that it works, and the commit you bisected to would make the change more necessary. But if I am guessing correctly, we have just been getting lucky in the past. It looks like RCU needs to create a struct syscore_ops with a shutdown function and pass this to register_syscore_ops(). Maybe a suspend function as well. And RCU needs to invoke register_syscore_ops() at a time that causes RCU's shutdown function to be invoked in the right order with respect to the other work in flight. The hope would be that RCU's suspend function gets called just as the system transitions into a mode where the scheduler is no longer active, give or take. Does this make sense, or am I confused? Thanx, Paul ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [REGRESSION 4.20-rc1] 45975c7d21a1 ("rcu: Define RCU-sched API in terms of RCU for Tree RCU PREEMPT builds") 2018-11-13 15:10 ` Paul E. McKenney @ 2018-11-14 20:20 ` Paul E. McKenney 2018-11-26 22:01 ` Paul E. McKenney 0 siblings, 1 reply; 11+ messages in thread From: Paul E. McKenney @ 2018-11-14 20:20 UTC (permalink / raw) To: Ville Syrjälä Cc: linux-kernel, Andi Kleen, Rafael J. Wysocki, Viresh Kumar, Ingo Molnar, Borislav Petkov, H. Peter Anvin On Tue, Nov 13, 2018 at 07:10:37AM -0800, Paul E. McKenney wrote: > On Tue, Nov 13, 2018 at 03:54:53PM +0200, Ville Syrjälä wrote: > > Hi Paul, > > > > After 4.20-rc1 some of my 32bit UP machines no longer reboot/shutdown. > > I bisected this down to commit 45975c7d21a1 ("rcu: Define RCU-sched > > API in terms of RCU for Tree RCU PREEMPT builds"). > > > > I traced the hang into > > -> cpufreq_suspend() > > -> cpufreq_stop_governor() > > -> cpufreq_dbs_governor_stop() > > -> gov_clear_update_util() > > -> synchronize_sched() > > -> synchronize_rcu() > > > > Only PREEMPT=y is affected for obvious reasons, but that couldn't > > explain why the same UP kernel booted on an SMP machine worked fine. > > Eventually I realized that the difference between working and > > non-working machine was IOAPIC vs. PIC. With initcall_debug I saw > > that we mask everything in the PIC before cpufreq is shut down, > > and came up with the following fix: > > > > diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c > > index 7aa3dcad2175..f88bf3c77fc0 100644 > > --- a/drivers/cpufreq/cpufreq.c > > +++ b/drivers/cpufreq/cpufreq.c > > @@ -2605,4 +2605,4 @@ static int __init cpufreq_core_init(void) > > return 0; > > } > > module_param(off, int, 0444); > > -core_initcall(cpufreq_core_init); > > +late_initcall(cpufreq_core_init); > > Thank you for testing this and tracking it down! > > I am glad that you have a fix, but I hope that we can arrive at a less > constraining one. > > > Here's the resulting change in inutcall_debug: > > pci 0000:00:00.1: shutdown > > hub 4-0:1.0: hub_ext_port_status failed (err = -110) > > agpgart-intel 0000:00:00.0: shutdown > > + PM: Calling cpufreq_suspend+0x0/0x100 > > PM: Calling mce_syscore_shutdown+0x0/0x10 > > PM: Calling i8259A_shutdown+0x0/0x10 > > - PM: Calling cpufreq_suspend+0x0/0x100 > > + reboot: Restarting system > > + reboot: machine restart > > > > I didn't really look into what other ramifications the cpufreq > > initcall change might have. cpufreq_global_kobject worries > > me a bit. Maybe that one has to remain in core_initcall() and > > we could just move the suspend to late_initcall()? Anyways, > > I figured I'd leave this for someone more familiar with the > > code to figure out ;) > > Let me guess... > > When the system suspends or shuts down, there comes a point after which > there is only a single CPU that is running with preemption and interrupts > are disabled. At this point, RCU must change the way that it works, and > the commit you bisected to would make the change more necessary. But if > I am guessing correctly, we have just been getting lucky in the past. > > It looks like RCU needs to create a struct syscore_ops with a shutdown > function and pass this to register_syscore_ops(). Maybe a suspend > function as well. And RCU needs to invoke register_syscore_ops() at > a time that causes RCU's shutdown function to be invoked in the right > order with respect to the other work in flight. The hope would be that > RCU's suspend function gets called just as the system transitions into > a mode where the scheduler is no longer active, give or take. > > Does this make sense, or am I confused? Well, it certainly does not make sense in that blocking is still legal at .shutdown() invocation time, which means that RCU cannot revert to its boot-time approach at that point. Looks like I need hooks in a bunch of arch-dependent functions. Which is certainly doable, but will take a bit more digging. Thanx, Paul ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [REGRESSION 4.20-rc1] 45975c7d21a1 ("rcu: Define RCU-sched API in terms of RCU for Tree RCU PREEMPT builds") 2018-11-14 20:20 ` Paul E. McKenney @ 2018-11-26 22:01 ` Paul E. McKenney 2019-02-05 5:07 ` Tom Li 2019-04-15 13:35 ` Ville Syrjälä 0 siblings, 2 replies; 11+ messages in thread From: Paul E. McKenney @ 2018-11-26 22:01 UTC (permalink / raw) To: Ville Syrjälä Cc: linux-kernel, Andi Kleen, Rafael J. Wysocki, Viresh Kumar, Ingo Molnar, Borislav Petkov, H. Peter Anvin On Wed, Nov 14, 2018 at 12:20:13PM -0800, Paul E. McKenney wrote: > On Tue, Nov 13, 2018 at 07:10:37AM -0800, Paul E. McKenney wrote: > > On Tue, Nov 13, 2018 at 03:54:53PM +0200, Ville Syrjälä wrote: > > > Hi Paul, > > > > > > After 4.20-rc1 some of my 32bit UP machines no longer reboot/shutdown. > > > I bisected this down to commit 45975c7d21a1 ("rcu: Define RCU-sched > > > API in terms of RCU for Tree RCU PREEMPT builds"). > > > > > > I traced the hang into > > > -> cpufreq_suspend() > > > -> cpufreq_stop_governor() > > > -> cpufreq_dbs_governor_stop() > > > -> gov_clear_update_util() > > > -> synchronize_sched() > > > -> synchronize_rcu() > > > > > > Only PREEMPT=y is affected for obvious reasons, but that couldn't > > > explain why the same UP kernel booted on an SMP machine worked fine. > > > Eventually I realized that the difference between working and > > > non-working machine was IOAPIC vs. PIC. With initcall_debug I saw > > > that we mask everything in the PIC before cpufreq is shut down, > > > and came up with the following fix: > > > > > > diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c > > > index 7aa3dcad2175..f88bf3c77fc0 100644 > > > --- a/drivers/cpufreq/cpufreq.c > > > +++ b/drivers/cpufreq/cpufreq.c > > > @@ -2605,4 +2605,4 @@ static int __init cpufreq_core_init(void) > > > return 0; > > > } > > > module_param(off, int, 0444); > > > -core_initcall(cpufreq_core_init); > > > +late_initcall(cpufreq_core_init); > > > > Thank you for testing this and tracking it down! > > > > I am glad that you have a fix, but I hope that we can arrive at a less > > constraining one. > > > > > Here's the resulting change in inutcall_debug: > > > pci 0000:00:00.1: shutdown > > > hub 4-0:1.0: hub_ext_port_status failed (err = -110) > > > agpgart-intel 0000:00:00.0: shutdown > > > + PM: Calling cpufreq_suspend+0x0/0x100 > > > PM: Calling mce_syscore_shutdown+0x0/0x10 > > > PM: Calling i8259A_shutdown+0x0/0x10 > > > - PM: Calling cpufreq_suspend+0x0/0x100 > > > + reboot: Restarting system > > > + reboot: machine restart > > > > > > I didn't really look into what other ramifications the cpufreq > > > initcall change might have. cpufreq_global_kobject worries > > > me a bit. Maybe that one has to remain in core_initcall() and > > > we could just move the suspend to late_initcall()? Anyways, > > > I figured I'd leave this for someone more familiar with the > > > code to figure out ;) > > > > Let me guess... > > > > When the system suspends or shuts down, there comes a point after which > > there is only a single CPU that is running with preemption and interrupts > > are disabled. At this point, RCU must change the way that it works, and > > the commit you bisected to would make the change more necessary. But if > > I am guessing correctly, we have just been getting lucky in the past. > > > > It looks like RCU needs to create a struct syscore_ops with a shutdown > > function and pass this to register_syscore_ops(). Maybe a suspend > > function as well. And RCU needs to invoke register_syscore_ops() at > > a time that causes RCU's shutdown function to be invoked in the right > > order with respect to the other work in flight. The hope would be that > > RCU's suspend function gets called just as the system transitions into > > a mode where the scheduler is no longer active, give or take. > > > > Does this make sense, or am I confused? > > Well, it certainly does not make sense in that blocking is still legal > at .shutdown() invocation time, which means that RCU cannot revert to > its boot-time approach at that point. Looks like I need hooks in a > bunch of arch-dependent functions. Which is certainly doable, but will > take a bit more digging. A bit more detail, after some additional discussion at Linux Plumbers conference... The preferred approach is to hook into syscore_suspend(), syscore_resume(), and syscore_shutdown(). This can be done easily by creating an appropriately initialized struct syscore_ops and passing a pointer to it to register_syscore_ops() during boot. Taking these three functions in turn: syscore_suspend(): o arch/x86/kernel/apm_32.c suspend(), standby() These calls to syscore_suspend() has interrupts disabled, which is very good, but they are immediately re-enabled, and only then is the call to set_system_power_state(). Unless both interrupts and preemption are prevented somehow, it is not safe for CONFIG_PREEMPT=y RCU implementations to revert back to boot-time behavior at this point. o drivers/xen/manage.c xen_suspend() This looks to have interrupts disabled throughout. It is also invoked within stop_machine(), which means that the other CPUs, though online, are quiescent. This allows RCU to safely switch back to early boot operating mode. That is, this is safe only if there is no interaction with RCU-preempt read-side critical sections that might well be underway in the other CPUs. This assumption is likely violated in CONFIG_PREEMPT=y kernels. One alternative that would work with RCU in CONFIG_PREEMPT=y kernels is CPU-hotplug removing all but one CPU, but that might have some other disadvantages. o kernel/kexec_core.c kernel_kexec() Before we get here, disable_nonboot_cpus() has been invoked, which in turn invokes freeze_secondary_cpus(), which offlines all but the boot CPU. Prior to that, all user-space tasks are frozen. So in this case, it would be safe for RCU to revert back to its boot-time behavior. Aside from the possibility of unfreezable kthreads being preempted within RCU-preempt read-side critical sections, anyway... :-/ However, one can argue that as long as the kthreads preempted within an RCU-preempt read-side critical section are guaranteed to never ever run again, we might be OK. And this guarantee seems consistent with the kernel_kexec() operation. At least when there are no errors that cause the kernel_kexec() to return control to the initial kernel image... Of course, this line of reasoning does not apply when the kernel is to resume on the same hardware, as in some of the cases above. o kernel/power/hibernate.c create_image() Same as for kernel_kexec(), except that freeze_kernel_threads() is invoked, which hopefully gets all tasks out of RCU read-side critical sections. So this one might actually permit RCU to revert back to boot-time behavior. Except for the possibility of an error condition forcing an abort back into the original kernel image, which again could have trouble with kthreads that were preempted within an RCU read-side critical section throughout. o kernel/power/hibernate.c resume_target_kernel() kernel/power/hibernate.c hibernation_platform_enter() kernel/power/suspend.c suspend_enter() Same as for kernel_kexec(), but no obvious pretense of freezing any tasks. syscore_resume(): o arch/x86/kernel/apm_32.c suspend(), standby() Resume-time counterparts to the calls to syscore_suspend() called out above, with the same interrupt-enabling problem, as well as issues with tasks being preempted throughout within RCU-preempt read-side critical sections. o drivers/xen/manage.c xen_suspend() Resume-time counterpart to the calls to xen_suspend() called out above, with the same issues with tasks being preempted throughout within RCU-preempt read-side critical sections. o kernel/kexec_core.c kernel_kexec() Resume-time counterpart to the calls to kernel_kexec() called out above. This is the error case that causes trouble due to the possibility of preempted RCU read-side critical sections. o kernel/power/hibernate.c create_image() kernel/power/hibernate.c resume_target_kernel() kernel/power/hibernate.c hibernation_platform_enter() kernel/power/hibernate.c suspend_enter() Resume-time counterparts to calls within kernel/power/hibernate.c and kernel/power/suspend.c called out above. This is the error case that causes trouble due to the possibility of preempted RCU read-side critical sections. syscore_shutdown(): o kernel/reboot.c kernel_restart() kernel/reboot.c kernel_halt() kernel/reboot.c kernel_power_off() These appears to leave all CPUs online, which prevents RCU from safely reverting back to boot-time mode. So what is to be done? Here are the options I can see: 1. Status quo, which means that synchronize_rcu() and friends cannot be used in syscore_suspend(), syscore_resume(), and syscore_shutdown() callbacks. At the moment, this appears to be the only workable approach, though ideas and suggestions are quite welcome. 2. Make each code path to syscore_suspend(), syscore_resume(), and syscore_shutdown() offline all but the boot CPU, ensure that all tasks exit any RCU read-side critical sections that they might be in, then run the remainder of the code path on the boot CPU with interrupts disabled. Making all tasks exit any RCU read-side critical sections is easy when CONFIG_PREEMPT=n via things like stop-machine, but it is difficult and potentially time-consuming for CONFIG_PREEMPT=y kernels. 3. Do error checking so that there cannot possibly be failures beyond the time that syscore_suspend(), syscore_resume(), and syscore_shutdown() are invoked. This is fine for syscore_shutdown() and syscore_resume(), but syscore_suspend()'s callbacks are permitted to return errors that force suspend failures. And there are syscore_suspend() callbacks that actually do return errors, for example, fsl_lbc_syscore_suspend() in arch/powerpc/sysdev/fsl_lbc.c can return -ENOMEM. As can save_ioapic_entries() in arch/x86/kernel/apic/io_apic.c and arch/x86/include/asm/io_apic.h. And mvebu_mbus_suspend() in drivers/bus/mvebu-mbus.c. And iommu_suspend() in drivers/iommu/intel-iommu.c. And its_save_disable() in drivers/irqchip/irq-gic-v3-its.c can return -EBUSY. Perhaps these can be remedied somehow, but unless that can be done, this approach cannot work. 4. Your idea here!!! Thanx, Paul ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [REGRESSION 4.20-rc1] 45975c7d21a1 ("rcu: Define RCU-sched API in terms of RCU for Tree RCU PREEMPT builds") 2018-11-26 22:01 ` Paul E. McKenney @ 2019-02-05 5:07 ` Tom Li 2019-02-05 9:58 ` Aaro Koskinen 2019-04-15 13:35 ` Ville Syrjälä 1 sibling, 1 reply; 11+ messages in thread From: Tom Li @ 2019-02-05 5:07 UTC (permalink / raw) To: Paul E. McKenney Cc: paulmck, ak, bp, hpa, linux-kernel, mingo, rjw, ville.syrjala, viresh.kumar, tomli, linux-mips On Tue, Nov 13, 2018 at 03:54:53PM +0200, Ville Syrjälä wrote: > Hi Paul, > After 4.20-rc1 some of my 32bit UP machines no longer reboot/shutdown. > I bisected this down to commit 45975c7d21a1 ("rcu: Define RCU-sched > API in terms of RCU for Tree RCU PREEMPT builds"). > > I traced the hang into > -> cpufreq_suspend() > -> cpufreq_stop_governor() > -> cpufreq_dbs_governor_stop() > -> gov_clear_update_util() > -> synchronize_sched() > -> synchronize_rcu() > > Only PREEMPT=y is affected for obvious reasons, but that couldn't > explain why the same UP kernel booted on an SMP machine worked fine. > Eventually I realized that the difference between working and > non-working machine was IOAPIC vs. PIC. With initcall_debug I saw > that we mask everything in the PIC before cpufreq is shut down, Hello Paul & Ville. I'm not a kernel hacker, just a n00b playing with various non-x86 systems, but I've been forced getting into kernel hacking due to the same issue. Since February, I'm working on porting some trivial out-of-tree drivers to the upstream, and noticed my Yeeloong 8089D, a machine running the Loongson 2F 64-bit MIPS-III processor, will completely hang during reboot or shutdown. I tried bisecting it for 24 hours without sleep, but the attempt had failed. I managed to narrow it in-between 4.19 and 4.20, most unusual thing I've observed was its probabilistic nature. The chance of triggering it seems getting progressively higher since 4.20, making pinpointing the specific commit difficult, finally with a 100% chance since Linux 4.19. I initially suspected a bug in the platform driver, later I also tried to disable various kernel hardening options, all without success. At this point I've realized, because it has shown to be probabilistic, it must be a race condition, and since it's an uniprocessor system, it must be the CPU scheduler getting preempted somehow. Disabling CONFIG_PREEMPT makes the issue go away. I tried to add various preempt_disable() in the platform driver but didn't work. Finally I've hooked up a netconsole and started adding printk()s. [ 23.652000] loongson2_cpufreq loongson2_cpufreq: shutdown [ 23.656000] loongson-gpio loongson-gpio: shutdown [ 23.660000] migrate_to_reboot_cpu() [ 23.664000] syscore_shutdown() [ 23.668000] PM: Calling i8259A_shutdown+0x0/0xa8 [ 23.672000] PM: Calling irq_gc_shutdown+0x0/0x88 [ 23.672000] PM: Calling cpufreq_suspend+0x0/0x1a0 [ 23.672000] cpufreq_suspend() [ 23.672000] cpufreq_suspend: Suspending Governors [ 23.672000] cpufreq_stop_governor() [ 23.672000] cpufreq_stop_governor: for CPU 0 Looks like something in the core cpufreq code? So I tried searching "cpufreq_stop_governor()" at LKML... Oops! So it must be the same issue. To summary, the issue exists on all Linux kernels since 4.20-rc1, and the chance of triggering it, is now 100%. What is the current progress of purposed solutions? If a complete solution is still work-in-progress, could we simply submit a hotfix into the linux-stable trees, so at least the issue can be temporarily solved? What can I do to help testing? Thanks! Tom Li ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [REGRESSION 4.20-rc1] 45975c7d21a1 ("rcu: Define RCU-sched API in terms of RCU for Tree RCU PREEMPT builds") 2019-02-05 5:07 ` Tom Li @ 2019-02-05 9:58 ` Aaro Koskinen 2019-02-05 13:05 ` Tom Li 0 siblings, 1 reply; 11+ messages in thread From: Aaro Koskinen @ 2019-02-05 9:58 UTC (permalink / raw) To: Tom Li Cc: Paul E. McKenney, ak, bp, hpa, linux-kernel, mingo, rjw, ville.syrjala, viresh.kumar, linux-mips Hi, On Tue, Feb 05, 2019 at 01:07:00PM +0800, Tom Li wrote: > On Tue, Nov 13, 2018 at 03:54:53PM +0200, Ville Syrjälä wrote: > > Hi Paul, > > After 4.20-rc1 some of my 32bit UP machines no longer reboot/shutdown. > > I bisected this down to commit 45975c7d21a1 ("rcu: Define RCU-sched > > API in terms of RCU for Tree RCU PREEMPT builds"). > > > > I traced the hang into > > -> cpufreq_suspend() > > -> cpufreq_stop_governor() > > -> cpufreq_dbs_governor_stop() > > -> gov_clear_update_util() > > -> synchronize_sched() > > -> synchronize_rcu() > > > > Only PREEMPT=y is affected for obvious reasons, but that couldn't > > explain why the same UP kernel booted on an SMP machine worked fine. > > Eventually I realized that the difference between working and > > non-working machine was IOAPIC vs. PIC. With initcall_debug I saw > > that we mask everything in the PIC before cpufreq is shut down, > > Hello Paul & Ville. > > I'm not a kernel hacker, just a n00b playing with various non-x86 > systems, but I've been forced getting into kernel hacking due to > the same issue. > > Since February, I'm working on porting some trivial out-of-tree > drivers to the upstream, and noticed my Yeeloong 8089D, a machine running > the Loongson 2F 64-bit MIPS-III processor, will completely hang during > reboot or shutdown. I tried bisecting it for 24 hours without sleep, but > the attempt had failed. > > I managed to narrow it in-between 4.19 and 4.20, most unusual thing I've > observed was its probabilistic nature. The chance of triggering it seems > getting progressively higher since 4.20, making pinpointing the specific > commit difficult, finally with a 100% chance since Linux 4.19. > > I initially suspected a bug in the platform driver, later I also tried > to disable various kernel hardening options, all without success. At this > point I've realized, because it has shown to be probabilistic, it must be > a race condition, and since it's an uniprocessor system, it must be the > CPU scheduler getting preempted somehow. Disabling CONFIG_PREEMPT makes > the issue go away. I tried to add various preempt_disable() in the platform > driver but didn't work. > > Finally I've hooked up a netconsole and started adding printk()s. > > [ 23.652000] loongson2_cpufreq loongson2_cpufreq: shutdown > [ 23.656000] loongson-gpio loongson-gpio: shutdown > [ 23.660000] migrate_to_reboot_cpu() > [ 23.664000] syscore_shutdown() > [ 23.668000] PM: Calling i8259A_shutdown+0x0/0xa8 > [ 23.672000] PM: Calling irq_gc_shutdown+0x0/0x88 > [ 23.672000] PM: Calling cpufreq_suspend+0x0/0x1a0 > [ 23.672000] cpufreq_suspend() > [ 23.672000] cpufreq_suspend: Suspending Governors > [ 23.672000] cpufreq_stop_governor() > [ 23.672000] cpufreq_stop_governor: for CPU 0 > > Looks like something in the core cpufreq code? So I tried searching > "cpufreq_stop_governor()" at LKML... Oops! Can you try below fix? It works on my Loongson. A. --- From: Aaro Koskinen <aaro.koskinen@iki.fi> Date: Wed, 2 Jan 2019 22:58:52 +0200 Subject: [PATCH] irqchip/i8259: fix shutdown order by moving syscore_ops registration When using cpufreq on Loongson 2F MIPS platform, "poweroff" command gets frequently stuck in syscore_shutdown(). The reason is that i8259A_shutdown() gets called before cpufreq_suspend(), and if we have pending work then irq_work_sync() in cpufreq_dbs_governor_stop() gets stuck forever as we have all interrupts masked already. irq-i8259 is registering syscore_ops using device_initcall(), while cpufreq uses core_initcall(). Fix the shutdown order simply by registering the irq syscore_ops during the early IRQ init instead of using a separate initcall at later stage. Signed-off-by: Aaro Koskinen <aaro.koskinen@iki.fi> --- drivers/irqchip/irq-i8259.c | 9 +-------- 1 file changed, 1 insertion(+), 8 deletions(-) diff --git a/drivers/irqchip/irq-i8259.c b/drivers/irqchip/irq-i8259.c index b0d4aab1a58c..d000870d9b6b 100644 --- a/drivers/irqchip/irq-i8259.c +++ b/drivers/irqchip/irq-i8259.c @@ -225,14 +225,6 @@ static struct syscore_ops i8259_syscore_ops = { .shutdown = i8259A_shutdown, }; -static int __init i8259A_init_sysfs(void) -{ - register_syscore_ops(&i8259_syscore_ops); - return 0; -} - -device_initcall(i8259A_init_sysfs); - static void init_8259A(int auto_eoi) { unsigned long flags; @@ -332,6 +324,7 @@ struct irq_domain * __init __init_i8259_irqs(struct device_node *node) panic("Failed to add i8259 IRQ domain"); setup_irq(I8259A_IRQ_BASE + PIC_CASCADE_IR, &irq2); + register_syscore_ops(&i8259_syscore_ops); return domain; } -- 2.17.0 ^ permalink raw reply related [flat|nested] 11+ messages in thread
* Re: [REGRESSION 4.20-rc1] 45975c7d21a1 ("rcu: Define RCU-sched API in terms of RCU for Tree RCU PREEMPT builds") 2019-02-05 9:58 ` Aaro Koskinen @ 2019-02-05 13:05 ` Tom Li 2019-02-05 18:59 ` Aaro Koskinen 0 siblings, 1 reply; 11+ messages in thread From: Tom Li @ 2019-02-05 13:05 UTC (permalink / raw) To: Aaro Koskinen Cc: paulmck, ak, bp, hpa, linux-kernel, mingo, rjw, ville.syrjala, viresh.kumar, linux-mips On Tue, Feb 05, 2019 at 11:58:09AM +0200, Aaro Koskinen wrote: > Can you try below fix? It works on my Loongson. Hello Aaro, thanks for your response. But in case you've missed the original thread, please check it at: https://lkml.org/lkml/2018/11/13/857 My problem is NOT about how to fix the problem on Loongson (or x86): the patch in the original thread (only has one-line-of-code, simply changes timing of cpufreq_core_init), or your patch, is indeed working. But they are workarounds, the real issue is the race condition in cpufreq. My question is 1. What's the progress and current consensus about how the underlying problem can be fixed. 2. If there's no consensus since November, is it possible that we land a hotfix patch to linux-stable as a temporary workaround? Cheers, Tom Li. ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [REGRESSION 4.20-rc1] 45975c7d21a1 ("rcu: Define RCU-sched API in terms of RCU for Tree RCU PREEMPT builds") 2019-02-05 13:05 ` Tom Li @ 2019-02-05 18:59 ` Aaro Koskinen 2019-02-06 1:22 ` tomli 0 siblings, 1 reply; 11+ messages in thread From: Aaro Koskinen @ 2019-02-05 18:59 UTC (permalink / raw) To: Tom Li, Paul E. McKenney, Rafael J. Wysocki, Ville Syrjälä Cc: ak, bp, hpa, linux-kernel, mingo, rjw, viresh.kumar, linux-mips Hi, On Tue, Feb 05, 2019 at 09:05:59PM +0800, Tom Li wrote: > On Tue, Feb 05, 2019 at 11:58:09AM +0200, Aaro Koskinen wrote: > > Can you try below fix? It works on my Loongson. > > Hello Aaro, thanks for your response. But in case you've missed > the original thread, please check it at: > > https://lkml.org/lkml/2018/11/13/857 OK, thanks. This looks slightly different from the Loongson problem: - In Loongson, we don't get stuck in RCU, but in cpufreq_dbs_governor_stop -> irq_work_sync(). - I run non-preemptible kernel, and my system still gets stuck. What is common is that it's UP with i8259 PIC. > My problem is NOT about how to fix the problem on Loongson (or > x86): the patch in the original thread (only has one-line-of-code, > simply changes timing of cpufreq_core_init), or your patch, is > indeed working. But they are workarounds, the real issue is the race > condition in cpufreq. Looking at irq_work_sync(), I cannot think how it could work on UP machine with interrupts disabled. A. ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [REGRESSION 4.20-rc1] 45975c7d21a1 ("rcu: Define RCU-sched API in terms of RCU for Tree RCU PREEMPT builds") 2019-02-05 18:59 ` Aaro Koskinen @ 2019-02-06 1:22 ` tomli 0 siblings, 0 replies; 11+ messages in thread From: tomli @ 2019-02-06 1:22 UTC (permalink / raw) To: Aaro Koskinen, Paul E. McKenney, Rafael J. Wysocki, Ville Syrjälä Cc: ak, bp, hpa, linux-kernel, mingo, rjw, viresh.kumar, linux-mips > OK, thanks. This looks slightly different from the Loongson problem: > > - In Loongson, we don't get stuck in RCU, but in > cpufreq_dbs_governor_stop -> irq_work_sync(). > > - I run non-preemptible kernel, and my system still gets stuck. > > What is common is that it's UP with i8259 PIC. > > A. Now it's an interesting case. Because on my machine, the problem I encountered seems to be the identical one of the original thread, disabling preempting can effectively solve the lockup. Also, my issue is not only occuring on 4.20-rc1, but also on earlier kernels, with a lower probability. But on your machine, you have another non-identical, but closely- related issue. It seems the timing-dependent lockup of i8259 PIC can be triggered in different ways. The conclusion is clear though, there's a real lockup condition in i8259 PIC driver, and it's causing real issues. Aaro, have you tried submitting your i8259 patch to the mainline? Despite your concerns about its underlying cause, I think a fix should be submitted. If there are no objections from the maintainers, I suggest submitting it to the mainline upstream, and send it to linux-stable, requesting it to be applied on 3.16, 4.4, 4.9, 4.14, 4.19, 4.20 stable branches. If you are busy, I can help submitting. Tom Li ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [REGRESSION 4.20-rc1] 45975c7d21a1 ("rcu: Define RCU-sched API in terms of RCU for Tree RCU PREEMPT builds") 2018-11-26 22:01 ` Paul E. McKenney 2019-02-05 5:07 ` Tom Li @ 2019-04-15 13:35 ` Ville Syrjälä 2019-04-15 14:07 ` Paul E. McKenney 1 sibling, 1 reply; 11+ messages in thread From: Ville Syrjälä @ 2019-04-15 13:35 UTC (permalink / raw) To: Paul E. McKenney Cc: linux-kernel, Andi Kleen, Rafael J. Wysocki, Viresh Kumar, Ingo Molnar, Borislav Petkov, H. Peter Anvin On Mon, Nov 26, 2018 at 02:01:22PM -0800, Paul E. McKenney wrote: > On Wed, Nov 14, 2018 at 12:20:13PM -0800, Paul E. McKenney wrote: > > On Tue, Nov 13, 2018 at 07:10:37AM -0800, Paul E. McKenney wrote: > > > On Tue, Nov 13, 2018 at 03:54:53PM +0200, Ville Syrjälä wrote: > > > > Hi Paul, > > > > > > > > After 4.20-rc1 some of my 32bit UP machines no longer reboot/shutdown. > > > > I bisected this down to commit 45975c7d21a1 ("rcu: Define RCU-sched > > > > API in terms of RCU for Tree RCU PREEMPT builds"). > > > > > > > > I traced the hang into > > > > -> cpufreq_suspend() > > > > -> cpufreq_stop_governor() > > > > -> cpufreq_dbs_governor_stop() > > > > -> gov_clear_update_util() > > > > -> synchronize_sched() > > > > -> synchronize_rcu() > > > > > > > > Only PREEMPT=y is affected for obvious reasons, but that couldn't > > > > explain why the same UP kernel booted on an SMP machine worked fine. > > > > Eventually I realized that the difference between working and > > > > non-working machine was IOAPIC vs. PIC. With initcall_debug I saw > > > > that we mask everything in the PIC before cpufreq is shut down, > > > > and came up with the following fix: > > > > > > > > diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c > > > > index 7aa3dcad2175..f88bf3c77fc0 100644 > > > > --- a/drivers/cpufreq/cpufreq.c > > > > +++ b/drivers/cpufreq/cpufreq.c > > > > @@ -2605,4 +2605,4 @@ static int __init cpufreq_core_init(void) > > > > return 0; > > > > } > > > > module_param(off, int, 0444); > > > > -core_initcall(cpufreq_core_init); > > > > +late_initcall(cpufreq_core_init); > > > > > > Thank you for testing this and tracking it down! > > > > > > I am glad that you have a fix, but I hope that we can arrive at a less > > > constraining one. > > > > > > > Here's the resulting change in inutcall_debug: > > > > pci 0000:00:00.1: shutdown > > > > hub 4-0:1.0: hub_ext_port_status failed (err = -110) > > > > agpgart-intel 0000:00:00.0: shutdown > > > > + PM: Calling cpufreq_suspend+0x0/0x100 > > > > PM: Calling mce_syscore_shutdown+0x0/0x10 > > > > PM: Calling i8259A_shutdown+0x0/0x10 > > > > - PM: Calling cpufreq_suspend+0x0/0x100 > > > > + reboot: Restarting system > > > > + reboot: machine restart > > > > > > > > I didn't really look into what other ramifications the cpufreq > > > > initcall change might have. cpufreq_global_kobject worries > > > > me a bit. Maybe that one has to remain in core_initcall() and > > > > we could just move the suspend to late_initcall()? Anyways, > > > > I figured I'd leave this for someone more familiar with the > > > > code to figure out ;) > > > > > > Let me guess... > > > > > > When the system suspends or shuts down, there comes a point after which > > > there is only a single CPU that is running with preemption and interrupts > > > are disabled. At this point, RCU must change the way that it works, and > > > the commit you bisected to would make the change more necessary. But if > > > I am guessing correctly, we have just been getting lucky in the past. > > > > > > It looks like RCU needs to create a struct syscore_ops with a shutdown > > > function and pass this to register_syscore_ops(). Maybe a suspend > > > function as well. And RCU needs to invoke register_syscore_ops() at > > > a time that causes RCU's shutdown function to be invoked in the right > > > order with respect to the other work in flight. The hope would be that > > > RCU's suspend function gets called just as the system transitions into > > > a mode where the scheduler is no longer active, give or take. > > > > > > Does this make sense, or am I confused? > > > > Well, it certainly does not make sense in that blocking is still legal > > at .shutdown() invocation time, which means that RCU cannot revert to > > its boot-time approach at that point. Looks like I need hooks in a > > bunch of arch-dependent functions. Which is certainly doable, but will > > take a bit more digging. > > A bit more detail, after some additional discussion at Linux Plumbers > conference... > > The preferred approach is to hook into syscore_suspend(), > syscore_resume(), and syscore_shutdown(). This can be done easily by > creating an appropriately initialized struct syscore_ops and passing a > pointer to it to register_syscore_ops() during boot. Taking these three > functions in turn: > > syscore_suspend(): > > o arch/x86/kernel/apm_32.c suspend(), standby() > > These calls to syscore_suspend() has interrupts disabled, which > is very good, but they are immediately re-enabled, and only then > is the call to set_system_power_state(). Unless both interrupts > and preemption are prevented somehow, it is not safe for > CONFIG_PREEMPT=y RCU implementations to revert back to boot-time > behavior at this point. > > o drivers/xen/manage.c xen_suspend() > > This looks to have interrupts disabled throughout. It is also > invoked within stop_machine(), which means that the other CPUs, > though online, are quiescent. This allows RCU to safely switch > back to early boot operating mode. That is, this is safe only > if there is no interaction with RCU-preempt read-side critical > sections that might well be underway in the other CPUs. This > assumption is likely violated in CONFIG_PREEMPT=y kernels. One > alternative that would work with RCU in CONFIG_PREEMPT=y kernels > is CPU-hotplug removing all but one CPU, but that might have > some other disadvantages. > > o kernel/kexec_core.c kernel_kexec() > > Before we get here, disable_nonboot_cpus() has been invoked, which > in turn invokes freeze_secondary_cpus(), which offlines all but > the boot CPU. Prior to that, all user-space tasks are frozen. > So in this case, it would be safe for RCU to revert back to its > boot-time behavior. Aside from the possibility of unfreezable > kthreads being preempted within RCU-preempt read-side critical > sections, anyway... :-/ > > However, one can argue that as long as the kthreads preempted > within an RCU-preempt read-side critical section are guaranteed > to never ever run again, we might be OK. And this guarantee > seems consistent with the kernel_kexec() operation. At least > when there are no errors that cause the kernel_kexec() to return > control to the initial kernel image... > > Of course, this line of reasoning does not apply when the > kernel is to resume on the same hardware, as in some of the > cases above. > > o kernel/power/hibernate.c create_image() > > Same as for kernel_kexec(), except that freeze_kernel_threads() > is invoked, which hopefully gets all tasks out of RCU read-side > critical sections. So this one might actually permit RCU to > revert back to boot-time behavior. Except for the possibility of > an error condition forcing an abort back into the original kernel > image, which again could have trouble with kthreads that were > preempted within an RCU read-side critical section throughout. > > o kernel/power/hibernate.c resume_target_kernel() > kernel/power/hibernate.c hibernation_platform_enter() > kernel/power/suspend.c suspend_enter() > > Same as for kernel_kexec(), but no obvious pretense of freezing > any tasks. > > > syscore_resume(): > > o arch/x86/kernel/apm_32.c suspend(), standby() > > Resume-time counterparts to the calls to syscore_suspend() called > out above, with the same interrupt-enabling problem, as well as > issues with tasks being preempted throughout within RCU-preempt > read-side critical sections. > > o drivers/xen/manage.c xen_suspend() > > Resume-time counterpart to the calls to xen_suspend() called out > above, with the same issues with tasks being preempted throughout > within RCU-preempt read-side critical sections. > > o kernel/kexec_core.c kernel_kexec() > > Resume-time counterpart to the calls to kernel_kexec() called out > above. This is the error case that causes trouble due to the > possibility of preempted RCU read-side critical sections. > > o kernel/power/hibernate.c create_image() > kernel/power/hibernate.c resume_target_kernel() > kernel/power/hibernate.c hibernation_platform_enter() > kernel/power/hibernate.c suspend_enter() > > Resume-time counterparts to calls within kernel/power/hibernate.c > and kernel/power/suspend.c called out above. This is the error > case that causes trouble due to the possibility of preempted > RCU read-side critical sections. > > > syscore_shutdown(): > > o kernel/reboot.c kernel_restart() > kernel/reboot.c kernel_halt() > kernel/reboot.c kernel_power_off() > > These appears to leave all CPUs online, which prevents RCU from > safely reverting back to boot-time mode. > > > So what is to be done? > > Here are the options I can see: > > 1. Status quo, which means that synchronize_rcu() and friends > cannot be used in syscore_suspend(), syscore_resume(), and > syscore_shutdown() callbacks. At the moment, this appears to > be the only workable approach, though ideas and suggestions are > quite welcome. > > 2. Make each code path to syscore_suspend(), syscore_resume(), and > syscore_shutdown() offline all but the boot CPU, ensure that > all tasks exit any RCU read-side critical sections that they > might be in, then run the remainder of the code path on the > boot CPU with interrupts disabled. > > Making all tasks exit any RCU read-side critical sections is > easy when CONFIG_PREEMPT=n via things like stop-machine, but > it is difficult and potentially time-consuming for > CONFIG_PREEMPT=y kernels. > > 3. Do error checking so that there cannot possibly be failures > beyond the time that syscore_suspend(), syscore_resume(), > and syscore_shutdown() are invoked. This is fine for > syscore_shutdown() and syscore_resume(), but syscore_suspend()'s > callbacks are permitted to return errors that force suspend > failures. > > And there are syscore_suspend() callbacks that actually do > return errors, for example, fsl_lbc_syscore_suspend() > in arch/powerpc/sysdev/fsl_lbc.c can return -ENOMEM. > As can save_ioapic_entries() in arch/x86/kernel/apic/io_apic.c > and arch/x86/include/asm/io_apic.h. And mvebu_mbus_suspend() > in drivers/bus/mvebu-mbus.c. And iommu_suspend() in > drivers/iommu/intel-iommu.c. > > And its_save_disable() in drivers/irqchip/irq-gic-v3-its.c > can return -EBUSY. > > Perhaps these can be remedied somehow, but unless that can > be done, this approach cannot work. > > 4. Your idea here!!! Paul, are we any closer to fixing this regression? It's been around for far too long, and I'd like to stop carrying my original hack around. -- Ville Syrjälä Intel ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [REGRESSION 4.20-rc1] 45975c7d21a1 ("rcu: Define RCU-sched API in terms of RCU for Tree RCU PREEMPT builds") 2019-04-15 13:35 ` Ville Syrjälä @ 2019-04-15 14:07 ` Paul E. McKenney 0 siblings, 0 replies; 11+ messages in thread From: Paul E. McKenney @ 2019-04-15 14:07 UTC (permalink / raw) To: Ville Syrjälä Cc: linux-kernel, Andi Kleen, Rafael J. Wysocki, Viresh Kumar, Ingo Molnar, Borislav Petkov, H. Peter Anvin On Mon, Apr 15, 2019 at 04:35:24PM +0300, Ville Syrjälä wrote: > On Mon, Nov 26, 2018 at 02:01:22PM -0800, Paul E. McKenney wrote: > > On Wed, Nov 14, 2018 at 12:20:13PM -0800, Paul E. McKenney wrote: > > > On Tue, Nov 13, 2018 at 07:10:37AM -0800, Paul E. McKenney wrote: > > > > On Tue, Nov 13, 2018 at 03:54:53PM +0200, Ville Syrjälä wrote: > > > > > Hi Paul, > > > > > > > > > > After 4.20-rc1 some of my 32bit UP machines no longer reboot/shutdown. > > > > > I bisected this down to commit 45975c7d21a1 ("rcu: Define RCU-sched > > > > > API in terms of RCU for Tree RCU PREEMPT builds"). > > > > > > > > > > I traced the hang into > > > > > -> cpufreq_suspend() > > > > > -> cpufreq_stop_governor() > > > > > -> cpufreq_dbs_governor_stop() > > > > > -> gov_clear_update_util() > > > > > -> synchronize_sched() > > > > > -> synchronize_rcu() > > > > > > > > > > Only PREEMPT=y is affected for obvious reasons, but that couldn't > > > > > explain why the same UP kernel booted on an SMP machine worked fine. > > > > > Eventually I realized that the difference between working and > > > > > non-working machine was IOAPIC vs. PIC. With initcall_debug I saw > > > > > that we mask everything in the PIC before cpufreq is shut down, > > > > > and came up with the following fix: > > > > > > > > > > diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c > > > > > index 7aa3dcad2175..f88bf3c77fc0 100644 > > > > > --- a/drivers/cpufreq/cpufreq.c > > > > > +++ b/drivers/cpufreq/cpufreq.c > > > > > @@ -2605,4 +2605,4 @@ static int __init cpufreq_core_init(void) > > > > > return 0; > > > > > } > > > > > module_param(off, int, 0444); > > > > > -core_initcall(cpufreq_core_init); > > > > > +late_initcall(cpufreq_core_init); > > > > > > > > Thank you for testing this and tracking it down! > > > > > > > > I am glad that you have a fix, but I hope that we can arrive at a less > > > > constraining one. > > > > > > > > > Here's the resulting change in inutcall_debug: > > > > > pci 0000:00:00.1: shutdown > > > > > hub 4-0:1.0: hub_ext_port_status failed (err = -110) > > > > > agpgart-intel 0000:00:00.0: shutdown > > > > > + PM: Calling cpufreq_suspend+0x0/0x100 > > > > > PM: Calling mce_syscore_shutdown+0x0/0x10 > > > > > PM: Calling i8259A_shutdown+0x0/0x10 > > > > > - PM: Calling cpufreq_suspend+0x0/0x100 > > > > > + reboot: Restarting system > > > > > + reboot: machine restart > > > > > > > > > > I didn't really look into what other ramifications the cpufreq > > > > > initcall change might have. cpufreq_global_kobject worries > > > > > me a bit. Maybe that one has to remain in core_initcall() and > > > > > we could just move the suspend to late_initcall()? Anyways, > > > > > I figured I'd leave this for someone more familiar with the > > > > > code to figure out ;) > > > > > > > > Let me guess... > > > > > > > > When the system suspends or shuts down, there comes a point after which > > > > there is only a single CPU that is running with preemption and interrupts > > > > are disabled. At this point, RCU must change the way that it works, and > > > > the commit you bisected to would make the change more necessary. But if > > > > I am guessing correctly, we have just been getting lucky in the past. > > > > > > > > It looks like RCU needs to create a struct syscore_ops with a shutdown > > > > function and pass this to register_syscore_ops(). Maybe a suspend > > > > function as well. And RCU needs to invoke register_syscore_ops() at > > > > a time that causes RCU's shutdown function to be invoked in the right > > > > order with respect to the other work in flight. The hope would be that > > > > RCU's suspend function gets called just as the system transitions into > > > > a mode where the scheduler is no longer active, give or take. > > > > > > > > Does this make sense, or am I confused? > > > > > > Well, it certainly does not make sense in that blocking is still legal > > > at .shutdown() invocation time, which means that RCU cannot revert to > > > its boot-time approach at that point. Looks like I need hooks in a > > > bunch of arch-dependent functions. Which is certainly doable, but will > > > take a bit more digging. > > > > A bit more detail, after some additional discussion at Linux Plumbers > > conference... > > > > The preferred approach is to hook into syscore_suspend(), > > syscore_resume(), and syscore_shutdown(). This can be done easily by > > creating an appropriately initialized struct syscore_ops and passing a > > pointer to it to register_syscore_ops() during boot. Taking these three > > functions in turn: > > > > syscore_suspend(): > > > > o arch/x86/kernel/apm_32.c suspend(), standby() > > > > These calls to syscore_suspend() has interrupts disabled, which > > is very good, but they are immediately re-enabled, and only then > > is the call to set_system_power_state(). Unless both interrupts > > and preemption are prevented somehow, it is not safe for > > CONFIG_PREEMPT=y RCU implementations to revert back to boot-time > > behavior at this point. > > > > o drivers/xen/manage.c xen_suspend() > > > > This looks to have interrupts disabled throughout. It is also > > invoked within stop_machine(), which means that the other CPUs, > > though online, are quiescent. This allows RCU to safely switch > > back to early boot operating mode. That is, this is safe only > > if there is no interaction with RCU-preempt read-side critical > > sections that might well be underway in the other CPUs. This > > assumption is likely violated in CONFIG_PREEMPT=y kernels. One > > alternative that would work with RCU in CONFIG_PREEMPT=y kernels > > is CPU-hotplug removing all but one CPU, but that might have > > some other disadvantages. > > > > o kernel/kexec_core.c kernel_kexec() > > > > Before we get here, disable_nonboot_cpus() has been invoked, which > > in turn invokes freeze_secondary_cpus(), which offlines all but > > the boot CPU. Prior to that, all user-space tasks are frozen. > > So in this case, it would be safe for RCU to revert back to its > > boot-time behavior. Aside from the possibility of unfreezable > > kthreads being preempted within RCU-preempt read-side critical > > sections, anyway... :-/ > > > > However, one can argue that as long as the kthreads preempted > > within an RCU-preempt read-side critical section are guaranteed > > to never ever run again, we might be OK. And this guarantee > > seems consistent with the kernel_kexec() operation. At least > > when there are no errors that cause the kernel_kexec() to return > > control to the initial kernel image... > > > > Of course, this line of reasoning does not apply when the > > kernel is to resume on the same hardware, as in some of the > > cases above. > > > > o kernel/power/hibernate.c create_image() > > > > Same as for kernel_kexec(), except that freeze_kernel_threads() > > is invoked, which hopefully gets all tasks out of RCU read-side > > critical sections. So this one might actually permit RCU to > > revert back to boot-time behavior. Except for the possibility of > > an error condition forcing an abort back into the original kernel > > image, which again could have trouble with kthreads that were > > preempted within an RCU read-side critical section throughout. > > > > o kernel/power/hibernate.c resume_target_kernel() > > kernel/power/hibernate.c hibernation_platform_enter() > > kernel/power/suspend.c suspend_enter() > > > > Same as for kernel_kexec(), but no obvious pretense of freezing > > any tasks. > > > > > > syscore_resume(): > > > > o arch/x86/kernel/apm_32.c suspend(), standby() > > > > Resume-time counterparts to the calls to syscore_suspend() called > > out above, with the same interrupt-enabling problem, as well as > > issues with tasks being preempted throughout within RCU-preempt > > read-side critical sections. > > > > o drivers/xen/manage.c xen_suspend() > > > > Resume-time counterpart to the calls to xen_suspend() called out > > above, with the same issues with tasks being preempted throughout > > within RCU-preempt read-side critical sections. > > > > o kernel/kexec_core.c kernel_kexec() > > > > Resume-time counterpart to the calls to kernel_kexec() called out > > above. This is the error case that causes trouble due to the > > possibility of preempted RCU read-side critical sections. > > > > o kernel/power/hibernate.c create_image() > > kernel/power/hibernate.c resume_target_kernel() > > kernel/power/hibernate.c hibernation_platform_enter() > > kernel/power/hibernate.c suspend_enter() > > > > Resume-time counterparts to calls within kernel/power/hibernate.c > > and kernel/power/suspend.c called out above. This is the error > > case that causes trouble due to the possibility of preempted > > RCU read-side critical sections. > > > > > > syscore_shutdown(): > > > > o kernel/reboot.c kernel_restart() > > kernel/reboot.c kernel_halt() > > kernel/reboot.c kernel_power_off() > > > > These appears to leave all CPUs online, which prevents RCU from > > safely reverting back to boot-time mode. > > > > > > So what is to be done? > > > > Here are the options I can see: > > > > 1. Status quo, which means that synchronize_rcu() and friends > > cannot be used in syscore_suspend(), syscore_resume(), and > > syscore_shutdown() callbacks. At the moment, this appears to > > be the only workable approach, though ideas and suggestions are > > quite welcome. > > > > 2. Make each code path to syscore_suspend(), syscore_resume(), and > > syscore_shutdown() offline all but the boot CPU, ensure that > > all tasks exit any RCU read-side critical sections that they > > might be in, then run the remainder of the code path on the > > boot CPU with interrupts disabled. > > > > Making all tasks exit any RCU read-side critical sections is > > easy when CONFIG_PREEMPT=n via things like stop-machine, but > > it is difficult and potentially time-consuming for > > CONFIG_PREEMPT=y kernels. > > > > 3. Do error checking so that there cannot possibly be failures > > beyond the time that syscore_suspend(), syscore_resume(), > > and syscore_shutdown() are invoked. This is fine for > > syscore_shutdown() and syscore_resume(), but syscore_suspend()'s > > callbacks are permitted to return errors that force suspend > > failures. > > > > And there are syscore_suspend() callbacks that actually do > > return errors, for example, fsl_lbc_syscore_suspend() > > in arch/powerpc/sysdev/fsl_lbc.c can return -ENOMEM. > > As can save_ioapic_entries() in arch/x86/kernel/apic/io_apic.c > > and arch/x86/include/asm/io_apic.h. And mvebu_mbus_suspend() > > in drivers/bus/mvebu-mbus.c. And iommu_suspend() in > > drivers/iommu/intel-iommu.c. > > > > And its_save_disable() in drivers/irqchip/irq-gic-v3-its.c > > can return -EBUSY. > > > > Perhaps these can be remedied somehow, but unless that can > > be done, this approach cannot work. > > > > 4. Your idea here!!! > > Paul, are we any closer to fixing this regression? It's been around > for far too long, and I'd like to stop carrying my original hack > around. Actually, no. If I got any response before now, I fat-fingered it. Seeing no responses, I assumed that nobody cared, and that option #1 (status quo) was preferred. Any feedback from anyone on the various options? Or, better yet, some better options? Thanx, Paul ^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2019-04-15 14:07 UTC | newest]
Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2018-11-13 13:54 [REGRESSION 4.20-rc1] 45975c7d21a1 ("rcu: Define RCU-sched API in terms of RCU for Tree RCU PREEMPT builds") Ville Syrjälä
2018-11-13 15:10 ` Paul E. McKenney
2018-11-14 20:20 ` Paul E. McKenney
2018-11-26 22:01 ` Paul E. McKenney
2019-02-05 5:07 ` Tom Li
2019-02-05 9:58 ` Aaro Koskinen
2019-02-05 13:05 ` Tom Li
2019-02-05 18:59 ` Aaro Koskinen
2019-02-06 1:22 ` tomli
2019-04-15 13:35 ` Ville Syrjälä
2019-04-15 14:07 ` Paul E. McKenney
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox