* Re: Re: Hibernation considerations
[not found] ` <200707172320.16279.rjw@sisk.pl>
@ 2007-07-20 14:01 ` Milton Miller
2007-07-20 14:48 ` Huang, Ying
[not found] ` <ea7a437ca4038d408ac544bbc3c2434a@bga.com>
1 sibling, 1 reply; 81+ messages in thread
From: Milton Miller @ 2007-07-20 14:01 UTC (permalink / raw)
To: linux-pm, LKML; +Cc: David Lang, Jeremy Maitin-Shepard, Ying Huang
If you are wondering why you got this for the third time:
I didn't get the list right the first time, so when I didn't see it on
the archive I made some small changes and sent it again. Then I
retireved my incoming mail and found it had gone out, just incorrectly.
I hope this one makes it to the list. I'll reply to the comments
shortly.
Hi. I've found this thread from the kjump thread on the kexec mailing
list. I'll respond to that one later, but I wanted to respond to
several messages in this thread. [Actually, there is a brief outline
of a response near the bottom of this note]. I downloaded the archive
to get message-ids and references, hopefully I don't break the
threading badly.
First, my background is hardware and low level software. I wrote the
initial ppc64 support for kexec. I try not to learn too much about x86
details, but have read this thread and the tutorial article mentioned
in it. I've studied the suspend code at various times in the past but
not in the last year. Several years I mused that kexec could be used
restore the suspend image, but I was told that would not be swsupend.
Next: lets define what we are trying to solve with the kexec approach.
We can solve the case of getting the drivers to quiesce the hardware
with requests from userspace suspended in queues. This is what the
powermac suspend has been doing for years, and I think its agreed that
it will be something similar if and when we remove the freezer. In the
powermac code there are two notifications to drivers; in the first
stage they can allocate memory and interrupts are enabled, in the
second its copy enough device state to memory to restart the device.
The problem with all the current hybernate methods is how do we select
what needs to be re-enabled to write the image to stable storage (be it
disk, network, or other medium).
The kjump kexec proposal says after we have the io quiesced, we can
jump to a totally new kernel, let it initialize the io it needs, and
write the image. This has the advantage that there is no confusion as
far as which requests should be service or not serviced by the driver
and subsystem stacks. It also reuses all the drivers, which means we
don't get untested code paths. It also has the advantage that we can
use any complicated user stack to access a file system and run any
desired access methods (eg encryption, raid, etc).
The currently identified problems under discussion include:
(1) how to interact with acpi to enter into S4.
(2) how to identify which memory needs to be saved
(3) how to communicate where to save the memory
(4) what state should devices be in when switching kernels
(5) the complicated setup required with the current patch
(6) what code restores the image
I'll now start with quotes from several articles in this thread and my
responses.
Message-ID: <200707172217.01890.rjw@sisk.pl>
On Tue Jul 17 13:10:00 2007, Rafael J. Wysocki wrote:
> (1) Upon entering the sleep state, which IMO can be done _after_ the
> image
> has been saved:
> * figure out which devices can wake up
> * put devices into low power states (wake-up devices are placed in
> the Dx
> states compatible with the wake capability, the others are powered
> off)
> * execute the _PTS global control method
> * switch off the nonlocal CPUs (eg. nonboot CPUs on x86)
> * execute the _GTS global control method
> * set the GPE enable registers corresponding to the wake-up devices)
> * make the platform enter S4 (there's a well defined procedure for
> that)
> I think that this should be done by the image-saving kernel.
Message-ID: <87odiag45q.fsf@jbms.ath.cx>
On Tue Jul 17 13:35:52 2007, Jeremy Maitin-Shepard
expressed his agreement with this block but also confusion on the other
blocks.
I strongly disagree.
(1) as has been pointed out, this requires the new kernel to understand
all io devices in the first kernel.
(2) it requires both kernels to talk to ACPI. This is doomed to
failure. How can the second kernel initialize ACPI? The platform
thinks it has already been initialized. Do we plan to always undo all
acpi initialization?
> (2) Upon start-up (by which I mean what happens after the user has
> pressed
> the power button or something like that):
> * check if the image is present (and valid) _without_ enabling ACPI
> (we don't
> do that now, but I see no reason for not doing it in the new
> framework)
> * if the image is present (and valid), load it
> * turn on ACPI (unless already turned on by the BIOS, that is)
> * execute the _BFS global control method
> * execute the _WAK global control method
> * continue
> Here, the first two things should be done by the image-loading
> kernel, but
> the remaining operations have to be carried out by the restored
> kernel.
Here I agree. The kernel to put the machine to execute the sleep is
the one to execute the wakeup.
Here is my proposal. Instead of trying to both write the image and
suspend, this all becomes much simpler if we limit the scope the work
of the second kernel. Its purpose is to write the image. After that
its done. The platform can be powered off if we are going to S5.
However, to support suspend to ram and suspend to disk, we return to
the first kernel.
This means that the first kernel will need to know why it got resumed.
Was the system powered off, and this is the resume from the user
wakeup? Or was it restarted because the image has been saved, and its
now time to actually suspend until woken up? If you look at it, this
is the same interface we have with the magic arch_suspend hook -- did
we just suspend and its time to write the image, or did we just resume
and its time to wake everything up.
I think this can be implemented easily by giving the image saving
kernel two resume points: one for the image has been written, and one
for we rebooted and have restored the image. I'm not familiar with
ACPI. Perhaps we need a third to differentiate we read the image from
S4 instead of from S5, but that information must be available to the OS
because it needs that to know if it should resume from hibernate.
By making the split at image save and restore we have several
advantages:
(1) the kernel always initializes with devices in the init or quiesced
but active state.
(2) the kernel always resumes with devices in the init or quiesced but
active state.
(3) the kjump save and restore kernel does not need to know how to
suspend all devices in the platform.
(4) we have a merged path for suspend to disk, suspend to ram, and
suspend to both.
(5) because of (4), we can implement sleep policys where we save the
image to disk but try to stay in ram based on expected remaining
battery life.
(6) we confine all platform (acpi) interaction to the main kernel
(7) we limit the knowledge needed in the second kernel. It needs to
know how to do its job and then put the hardware back how it found it.
Nothing more.
For the suspend to both and then woke up case, we simply need to
invalidate the image before resuming normal kernel operation (so that a
later power off and then boot doesn't resume from this stale point).
People have worried about how to boot and restore the kernel, and what
to do if reading the image fails. They worry about needing memory
hotplug or delayed acpi parsing. They are forgetting one thing. This
kernel has support for kexec.
The restore or new boot is easily solved by having the bootloader from
the bios always boot the restore kernel. It will boot with the same
limited useable memory and no acpi support that it had to save to disk,
it just runs a different initramfs. If the restore kernel userspace
detects that there is no restore image, it simply loads the normal main
kernel and initrd / initramfs and calls the normal kexec. The cost is
the time to init the restore kernel, read the kernel with full drivers
(vs reading it from the bootloader). If you want a boot menu, kboot
(on sourceforge) has already been written.
On Jul 17, 2007, at 2:13 PM, Rafael J. Wysocki wrote:
> On Tuesday, 17 July 2007 22:27, david@lang.hm wrote:
>> On Tue, 17 Jul 2007, Alan Stern wrote:
>>> But what about the freezer? The original reason for using kexec was
>>> to
>>> avoid the need for the freezer. With no freezer, while the original
>>> kernel is busy powering down its devices, user tasks will be free to
>>> carry out I/O -- which will make the memory snapshot inconsistent
>>> with
>>> the on-disk data structures.
>>
>> no, user tasks just don't get scheduled during shutdown.
>>
>> the big problem with the freezer isn't stopping anything from
>> happening,
>> it's _selectivly_ stopping things.
Agreed. Or rather, selectively not stopping and resuming things.
> It's selectively stopping kernel threads, which is just about right.
> If you
> that _this_ is a main problem with the freezer, then think again.
>
>> with kexec you don't need to let any portion of the origional kernel
>> or
>> userspace operate so you don't have a problem.
>
> In fact, the main problem with the freezer is that it is a
> coarse-grained
> solution. Therefore, what I believe we should do is to evolve in the
> directoin
> of more fine-grained solutions and gradually phase out the freezer.
>
> The kexec-based approach is an attempt to replace one coarse-grained
> solution
> (the freezer) with even more coarse-grained solution (stopping the
> entire
> kernel with everything), which IMO doesn't address the main problem.
>
I think this addresses the problem. We do the same stop that suspend
does, and don't try to figure out what fine pieces are needed.
Instead we delclare no pieces are re-enabled. It's a bit harder than
powermac because we have to fully quiesce devices; we can't cheat by
leaving interrupts off. But once the drivers save the state of their
devices and stop their queues, it should be easy to audit the paths to
powerdown devices and call the platform suspend and ram wakeup paths.
Going back to the requirements document that started this thread:
Message-ID: <200707151433.34625.rjw@sisk.pl>
On Sun Jul 15 05:27:03 2007, Rafael J. Wysocki wrote:
> (1) Filesystems mounted before the hibernation are untouchable
This is because some file systems do a fsck or other activity even when
mounted read only. For the kexec case, however, this should be "media
open (including file systems mounted) by the hibernated system must not
be written". As has been mentioned in the past, we should be able to
use something like dm snapshot to allow fsck and the file system to see
the cleaned copy while not actually writing the media.
In fact, I claim we can even overwrite a file using its block numbers.
The suspended kernel won't know about it and its buffers will be
stale. But if the meta data doesn't change, it just means that you
can't read the suspend image file.
The kjump kernel must not retain any buffers if we reuse it.
> (2) Swap space in use before the hibernation must be handled with care
Yes. Actually, even though they have been used by the write-in-the
kernel users, they will be among the most difficult devices to use for
snapshots by a userspace second kernel. We have to communicate a list
of blocks.
> (3) There are memory regions that must not be saved or restored
because they may not exist. This means that we must identify the
memory to be saved and restored in a format to be passed between the
kernel.
I'm worried about regions that are marked nosave because they hold data
that someone thought should be preserved across restore. If its just
the variables to read the image then we are ok.
> (4) The user should be able to limit the size of a hibernation image
This means the suspending kernel must arrange to reduce its active
memory. The list of memory would be provided to the new kernel.
> (5) Hibernation should be transparent from the applications' point of
> view
People have pointed out they may want userspace to be aware of the
suspend. I believed this can be done with /proc/apm emulation today
or by other means; it seems that should be hooked up to dbus in some
fashion.
> (6) State of devices from before hibernation should be restored, if
> possible
related to suspend should be transparent ... yes. This is a driver
problem.
> (7) On ACPI systems special platform-related actions have to be
> carried out at
> the right points, so that the platform works correctly after the
> restore
I believe I have explained my suggestion.
> (8) Hibernation and restore should not be too slow
We control the added code. We are using full runtime drivers and will
run at hardware speeds.
> (9) Hibernation framework should not be too difficult to set up
Ok the current patch is presently too difficult. But I think it will
be much simpler with a few small changes.
As noted in another thread
Message-ID: <873azxwqhr.fsf@jbms.ath.cx>
Subject: [linux-pm] Re: hibernation/snapshot design
on Mon Jul 9 08:23:53 2007, Jeremy Maitin-Shepard wrote:
>> Both would work. One would eat 8-64MB of your RAM, permanently;
>
> As I have stated in other messages, the kdump approach would not waste
> any RAM permanently.
...
> Immediately before jumping to the new kernel, the first X bytes (where
> X
> is the amount of memory the new kernel will get, typically 16MB or
> 64MB)
> of physical memory are backed up into the arbitrary discontiguous pages
> that are made available. This will not take very long, because copying
> even 64MB of memory is extremely fast. Then the new kernel is free to
> use the first X bytes of contiguous physical memory. Problem solved.
Ok, now let's look at my list again:
> (1) how to interact with acpi to enter into S4.
This was discussed.
> (2) how to identify which memory needs to be saved
We need to generate a list. We need it to fit in a compuatable size so
that we can free extra pages and allocate the list before suspending IO
in the first kernel.
One possibility is to use something like the kexec copy list. If we
are imaging a small fraction of ram this is appropriate, but if we are
doing dense saves we need something extent based. We may enhance the
list with page counts.
> (3) how to communicate where to save the memory
This is an intresting topic. The suspended kernel has most IO and disk
space. It also knows how much space is to be occupied by the kernel.
So communicating a block map to the second kernel would be the obvious
choice. But the second kernel must be able to find the image to
restore it, and it must have drivers for the media. Also, this is not
feasible for storing to nfs.
I think we will end up with several methods.
One would be supply a list of blocks, and write to those blocks the
list of blocks followed by the image contents. The restore kernel then
only needs to know the starting block. It would read the anchor, and
can build upon that until the image is read into memory. We could
implement a file system, or do this in userspace.
I don't know how this compares to the current restore path. I wasn't
able to identify the code that creates the on disk structure in my 10
minute perusal of kernel/power/.
A second method will be to supply a device and file that will be
mounted by the save kernel, then unmounted and restored. This would
require a partition that is not mounted or open by the suspended kernel
(or use nfs or a similar protocol that is designed for multiple client
concurrent access).
A third method would be to allocate a file with the first kernel, and
make sure the blocks are flushed to disk. The save and restore kernels
map the file system using a snapshot device. Writing would map the
blocks and use the block offset to write to the real device using the
method from the first option; reading could be done directly from the
snapshot device.
The first and third option are dead on log based file systems (where
the data is stored in the log).
> (4) what state should devices be in when switching kernels
My proposal is either initialized and untouched or quiesced. Not low
power unless that is normal for the platform.
> (5) the complicated setup required with the current patch
I think a few simple changes to kjump will make this much simpler. See
below.
> (6) what code restores the image
The save kernel, loaded at boot. People have suggested booting the
first kernel, and using current restore code. However, I think that
ignores that (1) we saved from a different kernel, so the backed up
region will be restored to its backed up location in random pages, (2)
the code was written to restore the same kernel, so the text and data
will be replaced by identical text. Its much simpler conceptually to
use the same kernel to save and restore the image. If we decide not to
restore we can load and kexec the main kernel.
Simplifying kjump: the proposal for v3.
The current code is trying to use crash dump area as a safe, reserved
area to run the second kernel. However, that means that the kernel
has to be linked specially to run in the reserved area. I think we
need to finish separating kexec_jump from the other code paths.
(1) add a new command line argument that specifies the kexec_jump
target area (or just size?)
(2) add a kjump flag to the flags parameter, used by kexec_load. When
loading a jump kernel, it is loaded like a normal kernel, however,
additional control pages are allocated to (a) save this kenrel's use of
the kexec_jump target area (b) save the backed up region that is used
by all kernels like crash dump, and (c) space for invoking
relocate_new_kernel that will get its args from the execution entry
point and will restore the kernel then call resume and suspend.
(3) replace jump_huf_pfn with two command line addresses that specify
the (a) return point for after resume, and (b) the return point for
after image save. Actually these can be done in userspace; the second
restore kernel can just specify the null copy list and the entry points
supplied by the suspended kernel. To do resume we also need (c) where
to store resume address for the save kernel.
The seperation should be whoever builds a scatter copy list builds the
inverse list. This is why I propose simple jump entry points. I
expect just a few instructions to establish arguments for the call to
the exstinging relocate_new_kernel code.
As a first stage of suspend and resume, we can save to dedicated
partitions all memory (as supplied to crash_dump) that is not marked
nosave and not part of the save kernel's image. The fancy block lists
and memory lists can be added later.
Mmaking these changes will allow us to use a normal kernel invoked with
acpi=off apm=off mem=xxk as the save and restore kernel.
If we want to keep the second kernel booted, then we need to add a save
area for the booted jump target. Note that the save and restore lists
to relocate_new_kernel can be computed once and saved. Longer term we
could implement sys_kexec_load(UNLOAD) that would retrieve the saved
list back to application space to save to disk in a file. This means
you could save the booted save kernel, it just couldn't have any shared
storage open.
I'll try to expand on this in the jump v2 thread, but it may be days
until I do.
milton
^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Re: Hibernation considerations
2007-07-20 14:01 ` Milton Miller
@ 2007-07-20 14:48 ` Huang, Ying
2007-07-20 15:48 ` david
2007-07-20 21:34 ` Rafael J. Wysocki
0 siblings, 2 replies; 81+ messages in thread
From: Huang, Ying @ 2007-07-20 14:48 UTC (permalink / raw)
To: Milton Miller; +Cc: David Lang, LKML, linux-pm, Jeremy Maitin-Shepard
On Fri, 2007-07-20 at 09:01 -0500, Milton Miller wrote:
> Simplifying kjump: the proposal for v3.
>
> The current code is trying to use crash dump area as a safe, reserved
> area to run the second kernel. However, that means that the kernel
> has to be linked specially to run in the reserved area. I think we
> need to finish separating kexec_jump from the other code paths.
>
> (1) add a new command line argument that specifies the kexec_jump
> target area (or just size?)
>
> (2) add a kjump flag to the flags parameter, used by kexec_load. When
> loading a jump kernel, it is loaded like a normal kernel, however,
> additional control pages are allocated to (a) save this kenrel's use of
> the kexec_jump target area (b) save the backed up region that is used
> by all kernels like crash dump, and (c) space for invoking
> relocate_new_kernel that will get its args from the execution entry
> point and will restore the kernel then call resume and suspend.
Backuping target memory before kexec and restoring it after kexec is
planed feature for kexec jump. But I will work on image writing/reading
first.
> (3) replace jump_huf_pfn with two command line addresses that specify
> the (a) return point for after resume, and (b) the return point for
> after image save. Actually these can be done in userspace; the second
> restore kernel can just specify the null copy list and the entry points
> supplied by the suspended kernel. To do resume we also need (c) where
> to store resume address for the save kernel.
There is many free spaces in jump_buf_pfn page now. I think passing the
needed information through jump_buf_pfn is more convenient than through
kernel command line. That is, the jump_buf_pfn can be seen as a meta
interface, which is passed to kexeced kernel though command line, while
other information can be passed though jmp_buf_pfn.
> The seperation should be whoever builds a scatter copy list builds the
> inverse list. This is why I propose simple jump entry points. I
> expect just a few instructions to establish arguments for the call to
> the exstinging relocate_new_kernel code.
If the "scatter copy" is replaced by "scatter swap", we need not the
inverse list, and the state of kexeced kernel can be backuped too. There
are "scatter copy" support in normal kexec implementation in
"relocate_kernel".
> As a first stage of suspend and resume, we can save to dedicated
> partitions all memory (as supplied to crash_dump) that is not marked
> nosave and not part of the save kernel's image. The fancy block lists
> and memory lists can be added later.
>
> Mmaking these changes will allow us to use a normal kernel invoked
> with
> acpi=off apm=off mem=xxk as the save and restore kernel.
Yes, I am working on this.
> If we want to keep the second kernel booted, then we need to add a save
> area for the booted jump target. Note that the save and restore lists
> to relocate_new_kernel can be computed once and saved. Longer term we
> could implement sys_kexec_load(UNLOAD) that would retrieve the saved
> list back to application space to save to disk in a file. This means
> you could save the booted save kernel, it just couldn't have any shared
> storage open.
Yes, this is also in plan. But with lower priority and will only be
added if necessary.
Best Regards,
Huang Ying
^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Re: Hibernation considerations
2007-07-20 14:48 ` Huang, Ying
@ 2007-07-20 15:48 ` david
2007-07-22 2:17 ` Huang, Ying
[not found] ` <1185070634.3517.11.camel@caritas-dev.intel.com>
2007-07-20 21:34 ` Rafael J. Wysocki
1 sibling, 2 replies; 81+ messages in thread
From: david @ 2007-07-20 15:48 UTC (permalink / raw)
To: Huang, Ying; +Cc: LKML, Milton Miller, linux-pm, Jeremy Maitin-Shepard
On Fri, 20 Jul 2007, Huang, Ying wrote:
>
> On Fri, 2007-07-20 at 09:01 -0500, Milton Miller wrote:
>> Simplifying kjump: the proposal for v3.
>>
>> The current code is trying to use crash dump area as a safe, reserved
>> area to run the second kernel. However, that means that the kernel
>> has to be linked specially to run in the reserved area. I think we
>> need to finish separating kexec_jump from the other code paths.
>>
>> (1) add a new command line argument that specifies the kexec_jump
>> target area (or just size?)
>>
>> (2) add a kjump flag to the flags parameter, used by kexec_load. When
>> loading a jump kernel, it is loaded like a normal kernel, however,
>> additional control pages are allocated to (a) save this kenrel's use of
>> the kexec_jump target area (b) save the backed up region that is used
>> by all kernels like crash dump, and (c) space for invoking
>> relocate_new_kernel that will get its args from the execution entry
>> point and will restore the kernel then call resume and suspend.
>
> Backuping target memory before kexec and restoring it after kexec is
> planed feature for kexec jump. But I will work on image writing/reading
> first.
if we can get a list of what memory is safe to backup/restore then the
reading/writing of the image should be able to be done in userspace.
>> (3) replace jump_huf_pfn with two command line addresses that specify
>> the (a) return point for after resume, and (b) the return point for
>> after image save. Actually these can be done in userspace; the second
>> restore kernel can just specify the null copy list and the entry points
>> supplied by the suspended kernel. To do resume we also need (c) where
>> to store resume address for the save kernel.
>
> There is many free spaces in jump_buf_pfn page now. I think passing the
> needed information through jump_buf_pfn is more convenient than through
> kernel command line. That is, the jump_buf_pfn can be seen as a meta
> interface, which is passed to kexeced kernel though command line, while
> other information can be passed though jmp_buf_pfn.
>> The seperation should be whoever builds a scatter copy list builds the
>> inverse list. This is why I propose simple jump entry points. I
>> expect just a few instructions to establish arguments for the call to
>> the exstinging relocate_new_kernel code.
>
> If the "scatter copy" is replaced by "scatter swap", we need not the
> inverse list, and the state of kexeced kernel can be backuped too. There
> are "scatter copy" support in normal kexec implementation in
> "relocate_kernel".
what do you mean by "scatter swap"?
David Lang
^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Re: Hibernation considerations
[not found] ` <200707192228.05136.rjw@sisk.pl>
@ 2007-07-20 16:08 ` Milton Miller
2007-07-20 16:20 ` Alan Stern
2007-07-20 21:02 ` Rafael J. Wysocki
[not found] ` <Pine.LNX.4.64.0707191542430.28721@asgard.lang.hm>
1 sibling, 2 replies; 81+ messages in thread
From: Milton Miller @ 2007-07-20 16:08 UTC (permalink / raw)
To: Rafael J. Wysocki
Cc: David Lang, LKML, Ying Huang, linux-pm, Jeremy Maitin-Shepard
On Jul 19, 2007, at 3:28 PM, Rafael J. Wysocki wrote:
> On Thursday, 19 July 2007 17:46, Milton Miller wrote:
>> The currently identified problems under discussion include:
>> (1) how to interact with acpi to enter into S4.
>> (2) how to identify which memory needs to be saved
>> (3) how to communicate where to save the memory
>> (4) what state should devices be in when switching kernels
>> (5) the complicated setup required with the current patch
>> (6) what code restores the image
>
> (7) how to avoid corrupting filesystems mounted by the hibernated
> kernel
>
Ok I talked on this too.
>> I'll now start with quotes from several articles in this thread and my
>> responses.
>>
>> Message-ID: <200707172217.01890.rjw@sisk.pl>
>> On Tue Jul 17 13:10:00 2007, Rafael J. Wysocki wrote:
>>> (1) Upon entering the sleep state, which IMO can be done _after_ the
>>> image
>>> has been saved:
>>> * figure out which devices can wake up
>>> * put devices into low power states (wake-up devices are placed in
>>> the Dx
>>> states compatible with the wake capability, the others are
>>> powered
>>> off)
>>> * execute the _PTS global control method
>>> * switch off the nonlocal CPUs (eg. nonboot CPUs on x86)
>>> * execute the _GTS global control method
>>> * set the GPE enable registers corresponding to the wake-up
>>> devices)
>>> * make the platform enter S4 (there's a well defined procedure for
>>> that)
>>> I think that this should be done by the image-saving kernel.
>>
>> Message-ID: <87odiag45q.fsf@jbms.ath.cx>
>> On Tue Jul 17 13:35:52 2007, Jeremy Maitin-Shepard
>> expressed his agreement with this block but also confusion on the
>> other
>> blocks.
>>
>>
>> I strongly disagree.
>>
>> (1) as has been pointed out, this requires the new kernel to
>> understand
>> all io devices in the first kernel.
>> (2) it requires both kernels to talk to ACPI. This is doomed to
>> failure. How can the second kernel initialize ACPI? The platform
>> thinks it has already been initialized. Do we plan to always undo all
>> acpi initialization?
>
> Good question. I don't know.
>>> (2) Upon start-up (by which I mean what happens after the user has
>>> pressed
>>> the power button or something like that):
>>> * check if the image is present (and valid) _without_ enabling ACPI
>>> (we don't
>>> do that now, but I see no reason for not doing it in the new
>>> framework)
>>> * if the image is present (and valid), load it
>>> * turn on ACPI (unless already turned on by the BIOS, that is)
>>> * execute the _BFS global control method
>>> * execute the _WAK global control method
>>> * continue
>>> Here, the first two things should be done by the image-loading
>>> kernel, but
>>> the remaining operations have to be carried out by the restored
>>> kernel.
>>
>> Here I agree.
>>
>> Here is my proposal. Instead of trying to both write the image and
>> suspend, I think this all becomes much simpler if we limit the scope
>> the work of the second kernel. Its purpose is to write the image.
>> After that its done. The platform can be powered off if we are going
>> to S5. However, to support suspend to ram and suspend to disk, we
>> return to the first kernel.
>
> We can't do this unless we have frozen tasks (this way, or another)
> before
> carrying out the entire operation.
What can't we do? We've already worked with the drivers to quesce the
hardware and put any information to resume the device in ram. Now we
ask them to put their device in low power mode so we can go to sleep.
Even if we schedule, the only thing userspace could touch is memory.
If we resume, they just run those computations again.
> In that case, however, the kexec-based
> approach would have only one advantage over the current one. Namely,
> it
> would allow us to create bigger images.
The advantage is we don't have to come up with a way to teach drivers
"wake up to run these requests, but no other requests". We don't have
to figure out what we need to resume to allow them to process a
request.
>> This means that the first kernel will need to know why it got resumed.
>> Was the system powered off, and this is the resume from the user? Or
>> was it restarted because the image has been saved, and its now time to
>> actually suspend until woken up? If you look at it, this is the same
>> interface we have with the magic arch_suspend hook -- did we just
>> suspend and its time to write the image, or did we just resume and its
>> time to wake everything up.
>>
>> I think this can be easily solved by giving the image saving kernel
>> two
>> resume points: one for the image has been written, and one for we
>> rebooted and have restored the image. I'm not familiar with ACPI.
>> Perhaps we need a third to differentiate we read the image from S4
>> instead of from S5, but that information must be available to the OS
>> because it needs that to know if it should resume from hibernate.
>>
>> By making the split at image save and restore we have several
>> advantages:
>>
>> (1) the kernel always initializes with devices in the init or quiesced
>> but active state.
>>
>> (2) the kernel always resumes with devices in the init or quiesced but
>> active state.
>>
>> (3) the kjump save and restore kernel does not need to know how to
>> suspend all devices in the platform.
>>
>> (4) we have a merged path for suspend to disk, suspend to ram, and
>> suspend to both.
>>
>> (5) because of (4), we can implement sleep policys where we save the
>> image to disk but try to stay in ram based on expected remaining
>> battery life.
>>
>> (6) we confine all platform (acpi) interaction to the main kernel
>>
>> (7) we limit the knowledge needed in the second kernel. It needs to
>> know how to do its job and then put the hardware back how it found it.
>> Nothing more.
>
> This would have been nice if we had been able to do it.
I don't understand this comment. "if we had been able"? I don't
think we have tried yet.
>
>> For the suspend to ram and then woken up case, we simply need to
>> invalidate the image before restarting normal kernel operation.
>>
>> People have worried about how to boot and restore the kernel, and what
>> to do if reading the image fails. They worry about needing memory
>> hotplug or delayed acpi parsing. They are forgetting one thing. This
>> kernel has support for kexec.
>>
>> This is all easily solved by having the bootloader from the bios
>> always
>> boot the restore kernel.
>
> Well, I think this is not generally acceptable, although I agree that
> it would
> be simpler.
For those that don't find it acceptable they can teach their bootloader
when they may have a image to resume.
>> It will boot with limited useable memory and
>> no acpi support. If the restore kernel userspace detects that there
>> is
>> no restore image, it simply loads the normal main kernel and initrd /
>> initramfs and calls the normal kexec. The cost is the time to init
>> the
>> restore kernel, read the kernel with full drivers (vs reading it from
>> the bootloader). If you want a boot menu, use kboot (on sourceforge).
>
> Well, I'm afraid of adding more and more infrastructure to the mix.
Requiring the hibernated kernel to be able to start from kexec should
not be bad. If you were referring to adding kboot, that is just an
option.
One can still use bootloaders menus to select alternate kernels.
However, as you said, you want to boot differently for resume (no acpi
until after image loaded) from full boot.
>> On Jul 17, 2007, at 2:13 PM, Rafael J. Wysocki wrote:
>>> On Tuesday, 17 July 2007 22:27, david@lang.hm wrote:
>>>> On Tue, 17 Jul 2007, Alan Stern wrote:
>>>>> But what about the freezer? The original reason for using kexec
>>>>> was
>>>>> to
>>>>> avoid the need for the freezer. With no freezer, while the
>>>>> original
>>>>> kernel is busy powering down its devices, user tasks will be free
>>>>> to
>>>>> carry out I/O -- which will make the memory snapshot inconsistent
>>>>> with
>>>>> the on-disk data structures.
>>>>
>>>> no, user tasks just don't get scheduled during shutdown.
>>>>
>>>> the big problem with the freezer isn't stopping anything from
>>>> happening,
>>>> it's _selectivly_ stopping things.
>>
>> Agreed. Or rather, selectively not stopping and resuming things.
>
> I don't quite understant this statement. Can you please elaborate?
Feel free to list other problems with the freezer, but I'm saying that
the problems are stemming from trying to freeze most of userspace and
some selection of kernel threads so that new requests to the outside
are not made, but then turning around and saying "ok now do some io,
but only what this thread of execution originates". Its originates not
generates so we are trying to teach the whole stack these limits,
including going back to userspace for FUSE.
>>> It's selectively stopping kernel threads, which is just about right.
>>> If you
>>> that _this_ is a main problem with the freezer, then think again.
>>>
>>>> with kexec you don't need to let any portion of the origional kernel
>>>> or userspace operate so you don't have a problem.
>>>
>>> In fact, the main problem with the freezer is that it is a
>>> coarse-grained
>>> solution. Therefore, what I believe we should do is to evolve in the
>>> directoin
>>> of more fine-grained solutions and gradually phase out the freezer.
>>>
>>> The kexec-based approach is an attempt to replace one coarse-grained
>>> solution
>>> (the freezer) with even more coarse-grained solution (stopping the
>>> entire
>>> kernel with everything), which IMO doesn't address the main problem.
>>>
>>
>> I think this addresses teh problem. Its probably a bit harder than
>> powermac because we have to fully quiesce devices; we can't cheat by
>> leaving interrupts off. But once the drivers save the state of their
>> devices and stop their queues, it should be easy to audit the paths to
>> powerdown devices and call the platform suspend and ram wakeup paths.
In other words, I'm replacing a course-grained solution with an
absolute solution. "From this point on you can only write to ram."
>> Going back to the requirements document that started this thread:
>>
>> Message-ID: <200707151433.34625.rjw@sisk.pl>
>> On Sun Jul 15 05:27:03 2007, Rafael J. Wysocki wrote:
>>> (1) Filesystems mounted before the hibernation are untouchable
>>
>> This is because some file systems do a fsck or other activity even
>> when
>> mounted read only. For the kexec case, however, this should be "file
>> systems mounted by the hibernated system must not be written". As
>> has
>> been mentioned in the past, we should be able to use something like dm
>> snapshot to allow fsck and the file system to see the cleaned copy
>> while not actually writing the media.
>
> We can't _require_ users to use the dm snapshot in order for the
> hibernation
> to work, sorry.
I actually listed three ways to start. Not all of them required
dm-snapshot. I was proposing "if you need to read ext3, then use
dm-snapshot".
> And by _reading_ from a filesystem you generally update metadata.
not on ones mounted read-only. I'll reply more later in the thread.
>> The kjump kernel must not have any knowledge retained if we reuse it.
>>
>>> (2) Swap space in use before the hibernation must be handled with
>>> care
>>
>> Yes. Actually, even though they have been used by the write-in-the
>> kernel users, they will be among the most difficult devices to use for
>> snapshots by a userspace second kernel.
>>
>>> (3) There are memory regions that must not be saved or restored
>>
>> because they may not exist. This means that we must identify the
>> memory to be saved and restored in a format to be passed between the
>> kernel.
>>
>>> (4) The user should be able to limit the size of a hibernation image
>>
>> This means the suspending kernel must arrange to reduce its active
>> memory. The limited save can be done by providing a limited list in
>> (3).
>
> It seems to me that you don't understand the problem here.
>
> Assume you have 90% of RAM allocated before the hibernation and the
> user has
> requested the image to be not greater than 50% of RAM. In that case
> you have
> to free some memory _before_ identifying memory to save and you must
> not
> race with applications that attempt to allocate memory while you're
> doing it.
Hmm... I didn't say how to reduce the memory or identify it, did I?
Ok fine. I'll allocate a bunch of memory and put it on a list.
Normal memory pressure will swap things out or drop filesystem pages.
When I build the list of memory to backup, I filter out this list.
After resume, I'll free it back.
We can arrange for this "task" to be preferred by the oom killer, if
case the user is trying to suspend into less than memory than can be
freed.
>>> (5) Hibernation should be transparent from the applications' point of
>>> view
>>
>> People have pointed out they may want userspace to be aware of the
>> suspend. I believed this can be done with /proc/apm emulation today
>> or by other means; it seems that should be hooked up to dbus in some
>> fashion.
>
> Not a solution, because there still will be programs not needing to
> know
> anything about hibernation. After all, we don't require all
> applications to
> know anything about SMP, even if they are executed on an SMP system.
How do any of those methods require userpsace to know anything about
hibernation? I was talking about a general framework consistent with
todays kernel to user communication for those parts of userspace that
*want* to know about suspend and hibernation.
>>> (6) State of devices from before hibernation should be restored, if
>>> possible
>>
>> related to suspend should be transparent ... yes.
>>
>>> (7) On ACPI systems special platform-related actions have to be
>>> carried out at
>>> the right points, so that the platform works correctly after the
>>> restore
>>
>> I believe I have explained my suggestion.
>>
>>> (8) Hibernation and restore should not be too slow
>>
>> We control the added code. We are using full runtime drivers and
>> will
>> run at hardware speeds.
>
> That may not be enough. If you're going to save, say, 80% of RAM on a
> 2 GB
> machine, then you'll have to be using image compression.
Yea, so? We have a full kernel and userspace, adding compression
before writing should be easy. The is no struct page for memory in the
old kernel, so we likely need to be copying them in userspace anyways.
Adding compression should be easy.
>>> (9) Hibernation framework should not be too difficult to set up
>>
>> Ok the current patch is presently too difficult. But I think it will
>> be much simpler with a few small changes.
>>
>> As noted in the thread
>>
>> Message-ID: <873azxwqhr.fsf@jbms.ath.cx>
>> Subject: [linux-pm] Re: hibernation/snapshot design
>> on Mon Jul 9 08:23:53 2007, Jeremy Maitin-Shepard wrote:
>>>> Both would work. One would eat 8-64MB of your RAM, permanently;
>>>
>>> As I have stated in other messages, the kdump approach would not
>>> waste
>>> any RAM permanently.
>> ...
>>> Immediately before jumping to the new kernel, the first X bytes
>>> (where
>>> X
>>> is the amount of memory the new kernel will get, typically 16MB or
>>> 64MB)
>>> of physical memory are backed up into the arbitrary discontiguous
>>> pages
>>> that are made available. This will not take very long, because
>>> copying
>>> even 64MB of memory is extremely fast. Then the new kernel is free
>>> to
>>> use the first X bytes of contiguous physical memory. Problem solved.
>>
>>
>> Ok, now let's look at my list again:
>>
>>> (1) how to interact with acpi to enter into S4.
>>
>> This was discussed.
>>
>>> (2) how to identify which memory needs to be saved
>>
>> We need to generate a list. We need it to fit in a compuatable size
>> so
>> that we can free and allocate the pages before suspending IO in the
>> first kernel.
>>
>> One possibility is to use something like the kexec copy list. If we
>> are imaging a small fraction of ram this is appropriate, but if we are
>> doing dense saves we need something extent based. We should be able
>> to
>> extend the list.
>>
>>> (3) how to communicate where to save the memory
>>
>> This is an intresting topic. The suspended kernel has most IO and
>> disk
>> space. It also knows how much space is to be occupied by the kernel.
>> So communicating a block map to the second kernel would be the obvious
>> choice. But the second kernel must be able to find the image to
>> restore it, and it must have drivers for the media. Also, this is not
>> feasible for storing to nfs.
>>
>> I think we will end up with several methods.
>>
>> One would be supply a list of blocks, and implement a file system that
>> reads the file by reading the scatter list from media. The restore
>> kernel then only needs to read an anchor, and can build upon that
>> until
>> the image is read into memory. Or do this in userspace.
>>
>> I don't know how this compares to the current restore path. I wasn't
>> able to identify the code that creates the on disk structure in my 10
>> minute perusal of kernel/power/.
>
> The structure is created at two levels.
>
> First, the code in snapshot.c makes the image available to the code in
> swap.c
> as a stream of pages. The first page is the header, followed by some
> pages
> containing the PFNs of the page frames to which the image data pages
> are to be
> restored, followed by the image data pages themselves (the ordering of
> the PFNs
> must be the same as the ordering of data pages that correspond to
> them).
> Still, the low-level image format only needs to be known by the
> restore code in
> snapshot.c .
Ok sounds like this code could be reused. I'll look into it.
> Second, the code in swap.c writes the image pages to a storage adding
> some
> metadata making it possible to reproduce their original ordering
> during the
> restore.
So you are allocating the blocks as you go ... and adding meta data
along the way?
> The fact that we use swap spaces as the storage is related to
> implementation
> simplicity rather than anything else.
Ok ... this only supports uncompressed hibernation?
The first kernel is going to specify (1) what to backup. It can
specify (2) where to backup, although we have to be careful identify
the device in a persistent way.
>> A second method will be to supply a device and file that will be
>> mounted by the save kernel, then unmounted and restored. This would
>> require a partition that is not mounted or open by the suspended
>> kernel
>> (or use nfs or a similar protocol that is designed for multiple client
>> concurrent access).
>>
>> A third method would be to allocate a file with the first kernel, and
>> make sure the blocks are flushed to disk. The save and restore
>> kernels
>> map the file system using a snapshot device. Writing would map the
>> blocks and use the block offset to write to the real device using the
>> method from the first option; reading could be done directly from the
>> snapshot device.
>>
>> The first and third option are dead on log based file systems (where
>> the data is stored in the log).
>
> All in all, we have three different and working implementation of the
> image-writing and image-reading code at our disposal. Why would you
> want to
> break the open doors?
The problem I'm saying kexec solves is how to get the data to the
device while most of the kernel is trying not do anything permanent.
If we can reuse existing code, great.
>>> (4) what state should devices be in when switching kernels
>>
>> My proposal is either initialized and untouched or quiesced.
>
> This is reasonable, but in general we also need to save some
> information
> about the pre-hibernation state of devices, so that we can put them
> into the
> same state, if reasonably possible, during the restore.
What state are you referring to?
Yes, there is state that the drivers have to store to ram, but this the
same state they need to store when suspending to ram if the device can
be powered off.
Maybe we need to teach drivers to store more state, like remember that
a hard drive was spun down.
So we may need a flag saying "we powered off", "we resumed from
suspend".
>>> (5) the complicated setup required with the current patch
>>
>> I think a few simple changes to kjump will make this much simpler.
>> See
>> below.
>>
>>> (6) what code restores the image
>>
>> The save kernel, loaded at boot. People have suggested booting the
>> first kernel, and using current restore code. However, I think that
>> ignores that (1) we saved from a different kernel, so the backed up
>> region will be restored to its backed up random pages,
>
> This problem has already been solved.
>
>> (2) the code was written to restore the same kernel,
>
> Not exactly. In fact, the current implementation only relies on the
> tiny
> portion of the restore code being in the same place in both kernels,
> but
> we can change the code not to make this assumption (it'll be more
> complicated,
> but that's perfectly doable).
If the save kernel is different from the run kernel (to make it
smaller), its likely the image saving code will move. I view restoring
from a different kernel than saving as an advanced feature.
Lets get resuming from the save kernel working first.
>> so the text and data will be replaced by identical text. Its much
>> simpler
>> conceptually to use the same kernel to save and restore the image.
>
> Here I agree. :-)
>
>> Simplifying kjump: the proposal for v3.
>>
>> The current code is trying to use crash dump area as a safe, reserved
>> area to run the second kernel. However, that means that the kernel
>> has to be linked specially to run in the reserved area. I think we
>> need to finish separating kexec_jump from the other code paths.
>>
>> (1) add a new command line argument that specifies the kexec_jump
>> target area.
>>
>> (2) add a kjump flag to the flags parameter, used by kexec_load.
>> When
>> loading a jump kernel, it is loaded like a normal kernel, however,
>> additional control pages are allocated to (a) save the kexec_jump
>> target area (b) save the backed up region that is used by all kernels
>> like crash dump, and (c) space for invoking relocate_new_kernel that
>> will get its args from the execution entry point and will restore the
>> kernel then call resume and suspend.
>>
>> (3) replace jump_huf_pfn with two command line addresses that specify
>> the (a) return point for after resume, and (b) the return point for
>> after image save. Actually these can be done in userspace; the
>> second
>> restore kernel can just specify the null copy list and the entry
>> points
>> supplied by the suspended kernel. To do resume we also need (c) where
>> to store resume address for the save kernel.
>>
>>
>> As a first stage of suspend and resume, we can save to dedicated
>> partitions all memory (as supplied to crash_dump) that is not marked
>> nosave and not part of the save kernel's image.
>
> A little problem here: there are "nosave" areas that are not marked as
> nosave.
If crash_dump is going work the memory must exist.
>> The fancy block lists and memory lists can be added later.
>
> On the majority of systems that will work. On some of them it won't.
Ok .... well, my point is we can get started while we workout what the
list format is. If we decide to reuse the pfn lists above that may
come quickly.
>> mmaking these changes will allow us to use a normal kernel invoked
>> with
>> acpi=off apm=off mem=xxk as the save and restore kernel.
>>
>> If we want to keep the second kernel booted, then we need to add a
>> save
>> area for the booted jump target. Note that the save and restore
>> lists
>> to relocate_new_kernel can be computed once and saved. Longer term
>> we
>> could implement sys_kexec_load(UNLOAD) that would retrieve the saved
>> list back to application space to save to disk in a file. This means
>> you could save the booted save kernel, it just couldn't have any
>> shared
>> storage open.
>>
>> I'll try to expand on this in the jump v2 thread, but it may be 36+
>> hours before I do so.
>
> Well, I have no experience with kexec, so I really can't comment your
> kexec-related suggestions.
>
> Greetings,
> Rafael
Thanks,
milton
^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Re: Hibernation considerations
2007-07-20 16:08 ` Milton Miller
@ 2007-07-20 16:20 ` Alan Stern
2007-07-20 17:32 ` Milton Miller
2007-07-20 20:31 ` david
2007-07-20 21:02 ` Rafael J. Wysocki
1 sibling, 2 replies; 81+ messages in thread
From: Alan Stern @ 2007-07-20 16:20 UTC (permalink / raw)
To: Milton Miller
Cc: David Lang, LKML, Ying Huang, linux-pm, Jeremy Maitin-Shepard
On Fri, 20 Jul 2007, Milton Miller wrote:
> > We can't do this unless we have frozen tasks (this way, or another)
> > before
> > carrying out the entire operation.
>
> What can't we do? We've already worked with the drivers to quesce the
> hardware and put any information to resume the device in ram. Now we
> ask them to put their device in low power mode so we can go to sleep.
> Even if we schedule, the only thing userspace could touch is memory.
Userspace can submit I/O requests. Someone will have to audit every
driver to make sure that such I/O requests don't cause a quiesced
device to become active. If the device is active, it will make the
memory snapshot inconsistent with the on-device data.
Alan Stern
^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Re: Hibernation considerations
[not found] ` <200707201317.58025.rjw@sisk.pl>
@ 2007-07-20 16:56 ` Milton Miller
[not found] ` <f29402c6050f9c3ff5d83a59cea2de58@bga.com>
1 sibling, 0 replies; 81+ messages in thread
From: Milton Miller @ 2007-07-20 16:56 UTC (permalink / raw)
To: Rafael J. Wysocki
Cc: David Lang, LKML, Ying Huang, linux-pm, Jeremy Maitin-Shepard
On Jul 20, 2007, at 6:17 AM, Rafael J. Wysocki wrote:
> On Friday, 20 July 2007 01:07, david@lang.hm wrote:
>> On Thu, 19 Jul 2007, Rafael J. Wysocki wrote:
>>> On Thursday, 19 July 2007 17:46, Milton Miller wrote:
>>>> The currently identified problems under discussion include:
>>>> (1) how to interact with acpi to enter into S4.
>>>> (2) how to identify which memory needs to be saved
>>>> (3) how to communicate where to save the memory
>>>> (4) what state should devices be in when switching kernels
>>>> (5) the complicated setup required with the current patch
>>>> (6) what code restores the image
>>>
>>> (7) how to avoid corrupting filesystems mounted by the hibernated
>>> kernel
>>
>> I didn't realize this was a discussion item. I thought the options
>> were
>> clear, for some filesystem types you can mount them read-only, but for
>> ext3 (and possilby other less common ones) you just plain cannot touch
>> them.
>
> That's correct. And since you cannot thouch ext3, you need either to
> assume
> that you won't touch filesystems at all, or to have a code to
> recognize the
> filesystem you're dealing with.
Or add a small bit of infrastructure that errors writes at make_request
if you don't have a magic "i am a direct block device write from
userspace" flag on the bio.
The hibernate may fail, but you don't corrupt the media.
If you don't get the image out, resume back to the "this is resume"
instead of the power-down path.
>>>>> (2) Upon start-up (by which I mean what happens after the user has
>>>>> pressed
>>>>> the power button or something like that):
>>>>> * check if the image is present (and valid) _without_ enabling
>>>>> ACPI
>>>>> (we don't
>>>>> do that now, but I see no reason for not doing it in the new
>>>>> framework)
>>>>> * if the image is present (and valid), load it
>>>>> * turn on ACPI (unless already turned on by the BIOS, that is)
>>>>> * execute the _BFS global control method
>>>>> * execute the _WAK global control method
>>>>> * continue
>>>>> Here, the first two things should be done by the image-loading
>>>>> kernel, but
>>>>> the remaining operations have to be carried out by the restored
>>>>> kernel.
>>>>
>>>> Here I agree.
>>>>
>>>> Here is my proposal. Instead of trying to both write the image and
>>>> suspend, I think this all becomes much simpler if we limit the scope
>>>> the work of the second kernel. Its purpose is to write the image.
>>>> After that its done. The platform can be powered off if we are
>>>> going
>>>> to S5. However, to support suspend to ram and suspend to disk, we
>>>> return to the first kernel.
>>>
>>> We can't do this unless we have frozen tasks (this way, or another)
>>> before
>>> carrying out the entire operation. In that case, however, the
>>> kexec-based
>>> approach would have only one advantage over the current one.
>>> Namely, it
>>> would allow us to create bigger images.
>>
>> we all agree that tasks cannot run during the suspend-to-ram state,
>> but
>> the disagreement is over what this means
>>
>> at one extreme it could mean that you would need the full freezer as
>> per
>> the current suspend projects.
>>
>> at the other extreme it could mean that all that's needed is to
>> invoke the
>> suspend-to-ram routine before anything else on the suspended kernel
>> on the
>> return from the save and restore kernel.
>>
>> we just need to figure out which it is (or if it's somewhere in
>> between).
>
> Well, I think that the "invoke the suspend-to-ram routine before
> anything else
> on the suspended kernel" thing won't be easy to implement in practice.
Why? You don't expect suspend-to-ram in drivers to be implemented? We
need more speperation of the quiesce drivers from power-down devices?
Note that we are just talking about "suspend devices and put their
state in ram", not actually invoking the platform to suspend to ram.
And I'm actually saying we free memory and maybe allocate disk blocks
for the save before we suspend (see below).
>>>> Message-ID: <200707151433.34625.rjw@sisk.pl>
>>>> On Sun Jul 15 05:27:03 2007, Rafael J. Wysocki wrote:
>>>>> (1) Filesystems mounted before the hibernation are untouchable
>>>>
>>>> This is because some file systems do a fsck or other activity even
>>>> when
>>>> mounted read only. For the kexec case, however, this should be
>>>> "file
>>>> systems mounted by the hibernated system must not be written". As
>>>> has
>>>> been mentioned in the past, we should be able to use something like
>>>> dm
>>>> snapshot to allow fsck and the file system to see the cleaned copy
>>>> while not actually writing the media.
>>>
>>> We can't _require_ users to use the dm snapshot in order for the
>>> hibernation
>>> to work, sorry.
>>>
>>> And by _reading_ from a filesystem you generally update metadata.
>>
>> not if the filesystem is mounted read-only (except on ext3)
>
> Well, if the filesystem in question is a journaling one and the
> hibernated
> kernel has mounted this fs read-write, this seems to be tricky anyway.
Yes. I would argue writing to existing blocks of a file (not thorugh
the filesystem, just getting their blocsk from the file system) should
be safe, but it occurs to me that may not be the case if your fsck and
bmap move data blocks from some update log to the file system.
But we know the (maximum) image size. So we could allocate the blocks
in the first image before suspending the drivers and memory
allocations, and supplying the list to the second kernel. We could
even write to the first block with a signature "suspend to here", or
even the whole block list to the beginning (it will have to be saved to
disk for restore anyways).
>>>> The kjump kernel must not have any knowledge retained if we reuse
>>>> it.
>>>>
>>>>> (2) Swap space in use before the hibernation must be handled with
>>>>> care
>>>>
>>>> Yes. Actually, even though they have been used by the write-in-the
>>>> kernel users, they will be among the most difficult devices to use
>>>> for
>>>> snapshots by a userspace second kernel.
If we use the "write to these blocks" then this is as easy as writing
to a file in a mounted filesystem.
>>>>> (4) The user should be able to limit the size of a hibernation
>>>>> image
>>>>
>>>> This means the suspending kernel must arrange to reduce its active
>>>> memory. The limited save can be done by providing a limited list in
>>>> (3).
>>>
>>> It seems to me that you don't understand the problem here.
>>>
>>> Assume you have 90% of RAM allocated before the hibernation and the
>>> user has
>>> requested the image to be not greater than 50% of RAM. In that case
>>> you have
>>> to free some memory _before_ identifying memory to save and you must
>>> not
>>> race with applications that attempt to allocate memory while you're
>>> doing it.
>>
>> I disagree a little bit.
>>
>> first off, only the suspending kernel can know what can be freed and
>> what
>> is needed to do so (remember this is kernel internals, it can change
>> from
>> patch to patch, let alone version to version)
>>
>> second, if you have a lot of memory to free, and you can't just throw
>> away
>> caches to do so, you don't know what is going to be involved in
>> freeing
>> the memory, it's very possilbe that it is going to involve userspace,
>> so
>> you can't freeze any significant portion of the system, so you can't
>> eliminate all chance of races
>>
>> what you can do is
>>
>> 1. try to free stuff
>> 2. stop the system and account for memory, is enough free
>> if not goto 1
>>
>> if userspace is dirtying memory fast enough, or is just useing enough
>> memory that you can't meet your limit you just won't be able to
>> suspend.
>
> This means unreliable hibernation for some workloads. While I agree
> that
> shouldn't be a problem in a common case, there are users who will
> complain. ;-)
With my allocate memory as a task and don't save that task's memory
approach, we can get to this point while userspace is running. It
could be controllled by userspace, or even be userspace
(sys_do_not_save_me() waits for resume, and dies as the kernel
resumes).
>> but under any other conditions you will eventually get enough memory
>> free.
>>
>> so try several times and if you still fail tell the user they have too
>> much stuff running and they need to kill something.
>
> Well, with the freezer that's much simpler (and more reliable, I'd
> say): you
> freeze tasks and _then_ you shrink memory.
It means you are committed to suspend before you try to shrink memory.
What happens when the user requested a smaller image that memory in
use?
>>>>> =(8) Hibernation and restore should not be too slow
>>>>
>>>> We control the added code. We are using full runtime drivers and
>>>> will
>>>> run at hardware speeds.
>>>
>>> That may not be enough. If you're going to save, say, 80% of RAM on
>>> a 2 GB
>>> machine, then you'll have to be using image compression.
>>
>> this doesn't make sense, 20% of 2G is 400M, if you can't make a
>> kernel and
>> userspace that can run in 400M you have a serious problem.
>
> I was talking about the _speed_ of writing and reading.
Yes. As I said, adding a compress as we copy the pages into the saving
kernel for writeout should be easy.
>> even if you wanted to save 99% of RAM on a 2G system, you have 20M of
>> ram
>> to play with, which should easily be enough.
>>
>> remember, linux runs on really small systems as well, and while you do
>> have to load some drivers for the big system, there are a lot of other
>> things that aren't needed.
>>
>>> All in all, we have three different and working implementation of the
>>> image-writing and image-reading code at our disposal. Why would you
>>> want to
>>> break the open doors?
>>
>> becouse you say that the current methods won't work without ACPI
>> support.
>
> I didn't say that. [Or if I did, please point me to this message.]
>
> Anyway, this wouldn't be true even if I did.
>
> What I've been trying to say from the very beginning is that the
> current
> frameworks _support_ hibernation a la ACPI S4 (although that's not
> exactly
> ACPI S4) and if we are going to introduce a new framework, then it
> should
> be designed to _support_ ACPI S4 fully _from_ _the_ _start_.
>
> This DOESN'T mean that the non-ACPI hibernation should be unsupported
> and
> it DOESN"T mean that the non-ACPI hibernation is not supported
> currently.
> IT IS SUPPORTED.
>
As I said, I see kjump as a way to solve the "ok i am at a save point,
now how do I write this image to media without allowing any other io".
As you know by now, my solution for ACPI support is after the image is
written we go back to the kernel that started the suspend and it puts
the machine in S4.
If this works, we get down to 1 hibernate implementation in the kernel
:-).
milton
^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Re: Hibernation considerations
[not found] ` <f29402c6050f9c3ff5d83a59cea2de58@bga.com>
@ 2007-07-20 17:31 ` Jeremy Maitin-Shepard
2007-07-20 21:30 ` Rafael J. Wysocki
2007-07-20 19:26 ` david
2007-07-20 21:28 ` Rafael J. Wysocki
2 siblings, 1 reply; 81+ messages in thread
From: Jeremy Maitin-Shepard @ 2007-07-20 17:31 UTC (permalink / raw)
To: Milton Miller; +Cc: David Lang, LKML, Ying Huang, linux-pm
Milton Miller <miltonm@bga.com> writes:
[snip]
>>>> (7) how to avoid corrupting filesystems mounted by the hibernated kernel
>>>
>>> I didn't realize this was a discussion item. I thought the options were
>>> clear, for some filesystem types you can mount them read-only, but for
>>> ext3 (and possilby other less common ones) you just plain cannot touch
>>> them.
>>
>> That's correct. And since you cannot thouch ext3, you need either to assume
>> that you won't touch filesystems at all, or to have a code to recognize the
>> filesystem you're dealing with.
> Or add a small bit of infrastructure that errors writes at make_request if you
> don't have a magic "i am a direct block device write from userspace" flag on the
> bio.
I still don't understand why there is this fixation on accessing dirty
filesystems in use by the hibernated system. Even if you avoid
corrupting the filesystem by avoiding writing to the block device, there
isn't any real guarantee about the state of the data, except for a
filesystem that specifically makes guarantees about such data (and I
don't believe any of the existing ones do).
It isn't necessary to be able to access such filesystems: everything can
be done from an initramfs/initrd.
[snip]
--
Jeremy Maitin-Shepard
^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Re: Hibernation considerations
2007-07-20 16:20 ` Alan Stern
@ 2007-07-20 17:32 ` Milton Miller
2007-07-20 18:17 ` Alan Stern
2007-07-20 20:31 ` david
1 sibling, 1 reply; 81+ messages in thread
From: Milton Miller @ 2007-07-20 17:32 UTC (permalink / raw)
To: Alan Stern; +Cc: David Lang, LKML, Ying Huang, linux-pm, Jeremy Maitin-Shepard
On Jul 20, 2007, at 11:20 AM, Alan Stern wrote:
> On Fri, 20 Jul 2007, Milton Miller wrote:
>>> We can't do this unless we have frozen tasks (this way, or another)
>>> before
>>> carrying out the entire operation.
>>
>> What can't we do? We've already worked with the drivers to quesce
>> the
>> hardware and put any information to resume the device in ram. Now we
>> ask them to put their device in low power mode so we can go to sleep.
>> Even if we schedule, the only thing userspace could touch is memory.
>
> Userspace can submit I/O requests. Someone will have to audit every
> driver to make sure that such I/O requests don't cause a quiesced
> device to become active. If the device is active, it will make the
> memory snapshot inconsistent with the on-device data.
If a driver is waking a device between the time it was told by
hibernation "suspend all operations and save your device state to ram"
and "resume your device" then it is a buggy driver.
I argue the process can make the io request after we write to disk, we
just can't service it. If we are suspended it will go to the request
queue, and eventually the process will wait for normal throttling
mechanisms until the driver is woken up.
It may mean the driver has to set a flag so that it knows it had an
iorequest arrive while it was suspended and needs to wake the queue
during its resume function.
Actually, my point was more "what kernel services do the drivers need
to transition from quiesced to low power for acpi S4 or
suspend-to-ram"? We can't give them allocate-memory (but we give them
a call "we are going to suspend" when they can), but does "run this
tasklet" help? What timer facilities are needed?
Do we need to differentate init (por by bios) and resume from quiesced
(for reboot, kexec start/resume)? I hope not.
milton
^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Re: Hibernation considerations
2007-07-20 17:32 ` Milton Miller
@ 2007-07-20 18:17 ` Alan Stern
0 siblings, 0 replies; 81+ messages in thread
From: Alan Stern @ 2007-07-20 18:17 UTC (permalink / raw)
To: Milton Miller
Cc: David Lang, LKML, Ying Huang, linux-pm, Jeremy Maitin-Shepard
On Fri, 20 Jul 2007, Milton Miller wrote:
> On Jul 20, 2007, at 11:20 AM, Alan Stern wrote:
> > On Fri, 20 Jul 2007, Milton Miller wrote:
> >>> We can't do this unless we have frozen tasks (this way, or another)
> >>> before
> >>> carrying out the entire operation.
> >>
> >> What can't we do? We've already worked with the drivers to quesce
> >> the
> >> hardware and put any information to resume the device in ram. Now we
> >> ask them to put their device in low power mode so we can go to sleep.
> >> Even if we schedule, the only thing userspace could touch is memory.
> >
> > Userspace can submit I/O requests. Someone will have to audit every
> > driver to make sure that such I/O requests don't cause a quiesced
> > device to become active. If the device is active, it will make the
> > memory snapshot inconsistent with the on-device data.
>
> If a driver is waking a device between the time it was told by
> hibernation "suspend all operations and save your device state to ram"
> and "resume your device" then it is a buggy driver.
That's exactly my point. As far as I know nobody has done a survey,
but I bet you'd find _many_ drivers are buggy either in this way or the
converse (forcing an I/O request to fail immediately instead of waiting
until the suspend is over when it could succeed). They have this bug
because they were written -- those which include any suspend/resume
support at all -- under the assumption that they could rely on the
freezer.
And that's why Rafael said "We can't do this unless we have frozen
tasks (this way, or another) before carrying out the entire operation."
Until the drivers are fixed -- which seems like a tremendous job --
none of this will work.
> I argue the process can make the io request after we write to disk, we
> just can't service it. If we are suspended it will go to the request
> queue, and eventually the process will wait for normal throttling
> mechanisms until the driver is woken up.
Many drivers don't have request queues. Even for the ones that do,
there are I/O pathways that bypass the queue (think of ioctl or sysfs).
> Actually, my point was more "what kernel services do the drivers need
> to transition from quiesced to low power for acpi S4 or
> suspend-to-ram"? We can't give them allocate-memory (but we give them
> a call "we are going to suspend" when they can), but does "run this
> tasklet" help? What timer facilities are needed?
Some drivers need the ability to schedule. Some will need the ability
to allocate memory (although GFP_ATOMIC is probably sufficient). Some
will need timers to run.
> Do we need to differentate init (por by bios) and resume from quiesced
> (for reboot, kexec start/resume)? I hope not.
Yes we do.
Alan Stern
^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Re: Hibernation considerations
[not found] <Pine.LNX.4.44L0.0707201408060.2546-100000@iolanthe.rowland.org>
@ 2007-07-20 19:08 ` Milton Miller
2007-07-20 20:03 ` Oliver Neukum
1 sibling, 0 replies; 81+ messages in thread
From: Milton Miller @ 2007-07-20 19:08 UTC (permalink / raw)
To: Alan Stern; +Cc: David Lang, LKML, Ying Huang, linux-pm, Jeremy Maitin-Shepard
On Jul 20, 2007, at 1:17 PM, Alan Stern wrote:
> On Fri, 20 Jul 2007, Milton Miller wrote:
>
>> On Jul 20, 2007, at 11:20 AM, Alan Stern wrote:
>>> On Fri, 20 Jul 2007, Milton Miller wrote:
>>>>> We can't do this unless we have frozen tasks (this way, or another)
>>>>> before
>>>>> carrying out the entire operation.
>>>>
>>>> What can't we do? We've already worked with the drivers to quesce
>>>> the
>>>> hardware and put any information to resume the device in ram. Now
>>>> we
>>>> ask them to put their device in low power mode so we can go to
>>>> sleep.
>>>> Even if we schedule, the only thing userspace could touch is memory.
>>>
>>> Userspace can submit I/O requests. Someone will have to audit every
>>> driver to make sure that such I/O requests don't cause a quiesced
>>> device to become active. If the device is active, it will make the
>>> memory snapshot inconsistent with the on-device data.
>>
>> If a driver is waking a device between the time it was told by
>> hibernation "suspend all operations and save your device state to ram"
>> and "resume your device" then it is a buggy driver.
>
> That's exactly my point. As far as I know nobody has done a survey,
> but I bet you'd find _many_ drivers are buggy either in this way or the
> converse (forcing an I/O request to fail immediately instead of waiting
> until the suspend is over when it could succeed). They have this bug
> because they were written -- those which include any suspend/resume
> support at all -- under the assumption that they could rely on the
> freezer.
>
> And that's why Rafael said "We can't do this unless we have frozen
> tasks (this way, or another) before carrying out the entire operation."
> Until the drivers are fixed -- which seems like a tremendous job --
> none of this will work.
So this is in the way of removing the freezer ... but as we are not
relying on doing any io other than suspend device operation, save state
to ram, then later put device in low power mode for s3 and/or s4, and
finally restore and resume to running.
>> I argue the process can make the io request after we write to disk, we
>> just can't service it. If we are suspended it will go to the request
>> queue, and eventually the process will wait for normal throttling
>> mechanisms until the driver is woken up.
>
> Many drivers don't have request queues. Even for the ones that do,
> there are I/O pathways that bypass the queue (think of ioctl or sysfs).
So its not a flag in make_request, fine.
>> Actually, my point was more "what kernel services do the drivers need
>> to transition from quiesced to low power for acpi S4 or
>> suspend-to-ram"? We can't give them allocate-memory (but we give them
>> a call "we are going to suspend" when they can), but does "run this
>> tasklet" help? What timer facilities are needed?
>
> Some drivers need the ability to schedule. Some will need the ability
> to allocate memory (although GFP_ATOMIC is probably sufficient). Some
> will need timers to run.
Can they allocate the memory in advance? (Call them when we know we
want to suspend, they make the allocations they will need; we later
call them again to release the allocations).
If you need timers, you probably want some scheduling?
>> Do we need to differentate init (por by bios) and resume from quiesced
>> (for reboot, kexec start/resume)? I hope not.
>
> Yes we do.
can you elabrate? Note I was not asking resume-from-low power vs
init-from-por. We still get that distinction.
How do these drivers work today when we kexec?
The reason I'm asking is its hard to tell the first kernel what
happened. We can say "we powered off, and we were restarted", but it
becomes much harder when each device may or may not have a driver in
the save kernel if we have to differentate for each device if it was
initialized and later quiesced by the jump kernel during save or never
touched. And we need to tell the resume from hybernate code "i touched
it" "no i didn't" and "we resumed from s4" "no it was from s5".
This is why I've been proposing that we don't create the suspend image
with devices in the low power state, but only in a quiesced state
similar to the initial state.
I'm proposing a sequence like:
(1) start allocating pinned memory to reduce saved image size
(2) allocate and map blocks to save maximum image (we know how much ram
is not in 1, so the max size)
(3) tell drivers we are going to suspend. userspace is still running,
swaping still active, etc. now is the time to allocate memory to save
device state.
(4) do what we want to slow down userspace making requests (ie run
freezer today)
(5) call drivers while still scheduling with interrupts "save
oppertunitiy". From this point, any new request should be queued or
the process put on a wait queue.
(6) suspend timers, turn off interrupts
(7) call drivers with interrupts off (final save)
(8) jump to other kernel to save the image
(9) call drivers to transition to low power
(10) finish operations to platform suspend on hybernate
(11) call drivers to resume, telling them if from suspend-to-ram or
suspend-to-disk, possibly in two stages (interrupts off no scheduling
and interrupts on scheduling allowed)
(12) unfreeze processes, kill the the thread holding the extra memroy
used to reserve
So I'm asking what needs to happen in 9. If we have to turn interrupts
on and schedule, that's ok. If the low power state is the initial
state then fine.
Note that in 11, we could further differentate "from image restore in
S4" and "from image restore in S5", and "from failed image save", but
what needs to happen differently?
I'm guessing that the work that will take some time is seperating the
go to low power from quiesce operations for snapshot, as it sounds like
this is done with one driver call today? Making this separation will
give us our driver audit :-), but only if we decide on the requiements
before the start.
miton
^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Re: Hibernation considerations
[not found] <f29402c6050f9c3ff5d83a59cea2de58%40bga.com>
@ 2007-07-20 19:09 ` Milton Miller
2007-07-20 20:23 ` Jeremy Maitin-Shepard
0 siblings, 1 reply; 81+ messages in thread
From: Milton Miller @ 2007-07-20 19:09 UTC (permalink / raw)
To: Jeremy Maitin-Shepard; +Cc: linux-pm
On Fri Jul 20 10:31:41 PDT 2007, Jeremy Maitin-Shepard wrote:
> Milton Miller <miltonm at bga.com> writes:
>>>>> (7) how to avoid corrupting filesystems mounted by the hibernated
>>>>> kernel
>>>>
>>>> I didn't realize this was a discussion item. I thought the options
>>>> were
>>>> clear, for some filesystem types you can mount them read-only, but
>>>> for
>>>> ext3 (and possilby other less common ones) you just plain cannot
>>>> touch
>>>> them.
>>>
>>> That's correct. And since you cannot thouch ext3, you need either
>>> to assume
>>> that you won't touch filesystems at all, or to have a code to
>>> recognize the
>>> filesystem you're dealing with.
>>
>> Or add a small bit of infrastructure that errors writes at
>> make_request if you
>> don't have a magic "i am a direct block device write from userspace"
>> flag on the
>> bio.
>
> I still don't understand why there is this fixation on accessing dirty
> filesystems in use by the hibernated system. Even if you avoid
> corrupting the filesystem by avoiding writing to the block device,
> there
> isn't any real guarantee about the state of the data, except for a
> filesystem that specifically makes guarantees about such data (and I
> don't believe any of the existing ones do).
>
> It isn't necessary to be able to access such filesystems: everything
> can
> be done from an initramfs/initrd.
There is a requirement to hibernate without dedicating a partition to
save the data. Two reasons given have been upgrading ram in a machine
should not require a repartition of the hard drive, and the intel macs
can only have 4 partitions total and people want to dual boot.
Not having a separate partition means the userspace in the initramfs
needs to obtain a list of blocks upon which to write the data. My
first proposal was to do it all from userspace of the new kernel by
calling bmap on the filesystem mounted read only. My later proposal
is to allocate the blocks in the suspending kernel and pass the block
list to userspace.
There is still the question of how does the restore kernel find the
block list to restore from. It does help the first kernel invalidating
the image in the suspend-to-both resume-from-ram case.
milton
PS: I'm not subscribed to the list, and only saw your reply on the
archives ... which give me my message id as the message to reply to.
I'd appreciate being cc'd in the future. Thanks.
^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Re: Hibernation considerations
[not found] ` <f29402c6050f9c3ff5d83a59cea2de58@bga.com>
2007-07-20 17:31 ` Jeremy Maitin-Shepard
@ 2007-07-20 19:26 ` david
2007-07-20 21:28 ` Rafael J. Wysocki
2 siblings, 0 replies; 81+ messages in thread
From: david @ 2007-07-20 19:26 UTC (permalink / raw)
To: Milton Miller; +Cc: LKML, Ying Huang, linux-pm, Jeremy Maitin-Shepard
On Fri, 20 Jul 2007, Milton Miller wrote:
> On Jul 20, 2007, at 6:17 AM, Rafael J. Wysocki wrote:
>> On Friday, 20 July 2007 01:07, david@lang.hm wrote:
>> > On Thu, 19 Jul 2007, Rafael J. Wysocki wrote:
>> > > On Thursday, 19 July 2007 17:46, Milton Miller wrote:
>> > > > The currently identified problems under discussion include:
>> > > > (1) how to interact with acpi to enter into S4.
>> > > > (2) how to identify which memory needs to be saved
>> > > > (3) how to communicate where to save the memory
>> > > > (4) what state should devices be in when switching kernels
>> > > > (5) the complicated setup required with the current patch
>> > > > (6) what code restores the image
>> > >
>> > > (7) how to avoid corrupting filesystems mounted by the hibernated
>> > > kernel
>> >
>> > I didn't realize this was a discussion item. I thought the options were
>> > clear, for some filesystem types you can mount them read-only, but for
>> > ext3 (and possilby other less common ones) you just plain cannot touch
>> > them.
>>
>> That's correct. And since you cannot thouch ext3, you need either to
>> assume
>> that you won't touch filesystems at all, or to have a code to recognize
>> the
>> filesystem you're dealing with.
>
> Or add a small bit of infrastructure that errors writes at make_request if
> you don't have a magic "i am a direct block device write from userspace" flag
> on the bio.
the problem is that the filesystem code will replay the journal when you
mount the partition, even if you mount it read-only (I seem to remember
that you could avoid this if you put the entire block device into
read-only mode, but that doesn't help in this case)
> The hibernate may fail, but you don't corrupt the media.
>
> If you don't get the image out, resume back to the "this is resume" instead
> of the power-down path.
>
>> > > > > (2) Upon start-up (by which I mean what happens after the user has
>> > > > > pressed
>> > > > > the power button or something like that):
>> > > > > * check if the image is present (and valid) _without_ enabling
>> > > > > ACPI
>> > > > > (we don't
>> > > > > do that now, but I see no reason for not doing it in the new
>> > > > > framework)
>> > > > > * if the image is present (and valid), load it
>> > > > > * turn on ACPI (unless already turned on by the BIOS, that is)
>> > > > > * execute the _BFS global control method
>> > > > > * execute the _WAK global control method
>> > > > > * continue
>> > > > > Here, the first two things should be done by the image-loading
>> > > > > kernel, but
>> > > > > the remaining operations have to be carried out by the restored
>> > > > > kernel.
>> > > >
>> > > > Here I agree.
>> > > >
>> > > > Here is my proposal. Instead of trying to both write the image and
>> > > > suspend, I think this all becomes much simpler if we limit the scope
>> > > > the work of the second kernel. Its purpose is to write the image.
>> > > > After that its done. The platform can be powered off if we are
>> > > > going
>> > > > to S5. However, to support suspend to ram and suspend to disk, we
>> > > > return to the first kernel.
>> > >
>> > > We can't do this unless we have frozen tasks (this way, or another)
>> > > before
>> > > carrying out the entire operation. In that case, however, the
>> > > kexec-based
>> > > approach would have only one advantage over the current one. Namely,
>> > > it
>> > > would allow us to create bigger images.
>> >
>> > we all agree that tasks cannot run during the suspend-to-ram state, but
>> > the disagreement is over what this means
>> >
>> > at one extreme it could mean that you would need the full freezer as per
>> > the current suspend projects.
>> >
>> > at the other extreme it could mean that all that's needed is to invoke
>> > the
>> > suspend-to-ram routine before anything else on the suspended kernel on
>> > the
>> > return from the save and restore kernel.
>> >
>> > we just need to figure out which it is (or if it's somewhere in
>> > between).
>>
>> Well, I think that the "invoke the suspend-to-ram routine before anything
>> else
>> on the suspended kernel" thing won't be easy to implement in practice.
>
> Why? You don't expect suspend-to-ram in drivers to be implemented? We need
> more speperation of the quiesce drivers from power-down devices?
>
> Note that we are just talking about "suspend devices and put their state in
> ram", not actually invoking the platform to suspend to ram.
I thought we were talking about actually invoking the suspend-to-ram
>> > > > Message-ID: <200707151433.34625.rjw@sisk.pl>
>> > > > On Sun Jul 15 05:27:03 2007, Rafael J. Wysocki wrote:
>> > > > > (1) Filesystems mounted before the hibernation are untouchable
>> > > >
>> > > > This is because some file systems do a fsck or other activity even
>> > > > when
>> > > > mounted read only. For the kexec case, however, this should be
>> > > > "file
>> > > > systems mounted by the hibernated system must not be written". As
>> > > > has
>> > > > been mentioned in the past, we should be able to use something like
>> > > > dm
>> > > > snapshot to allow fsck and the file system to see the cleaned copy
>> > > > while not actually writing the media.
>> > >
>> > > We can't _require_ users to use the dm snapshot in order for the
>> > > hibernation
>> > > to work, sorry.
>> > >
>> > > And by _reading_ from a filesystem you generally update metadata.
>> >
>> > not if the filesystem is mounted read-only (except on ext3)
>>
>> Well, if the filesystem in question is a journaling one and the hibernated
>> kernel has mounted this fs read-write, this seems to be tricky anyway.
>
> Yes. I would argue writing to existing blocks of a file (not thorugh the
> filesystem, just getting their blocsk from the file system) should be safe,
> but it occurs to me that may not be the case if your fsck and bmap move data
> blocks from some update log to the file system.
right, and the answer is that the filesystem blocks allocated for the
suspend image are not allowed to be accessed in any way from the main
system.
this is a good argument for saving the data somewhere else ;-)
> But we know the (maximum) image size. So we could allocate the blocks in
> the first image before suspending the drivers and memory allocations, and
> supplying the list to the second kernel. We could even write to the first
> block with a signature "suspend to here", or even the whole block list to the
> beginning (it will have to be saved to disk for restore anyways).
no, you want to make the blocks that are allocated for the suspend image
be like the blocks allocated to the journal, alloate them once and never
touch them again
you especially do not want to try and write something to them from the
main system just before suspending, you don't know enough about what it
takes to get the data to the media to be absolutly sure that it's there
when the save-and-restore kernel goes to look.
>> > > > The kjump kernel must not have any knowledge retained if we reuse
>> > > > it.
>> > > >
>> > > > > (2) Swap space in use before the hibernation must be handled with
>> > > > > care
>> > > >
>> > > > Yes. Actually, even though they have been used by the write-in-the
>> > > > kernel users, they will be among the most difficult devices to use
>> > > > for
>> > > > snapshots by a userspace second kernel.
>
> If we use the "write to these blocks" then this is as easy as writing to a
> file in a mounted filesystem.
and keep in mind that "write to these blocks" can be done in userspace, it
doesn't require the kernel to do this.
David Lang
^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Re: Hibernation considerations
[not found] <fe998950b9d5ad317d5c1f5ff4e21ac9@bga.com>
@ 2007-07-20 19:37 ` Alan Stern
0 siblings, 0 replies; 81+ messages in thread
From: Alan Stern @ 2007-07-20 19:37 UTC (permalink / raw)
To: Milton Miller
Cc: David Lang, LKML, Ying Huang, linux-pm, Jeremy Maitin-Shepard
On Fri, 20 Jul 2007, Milton Miller wrote:
> > That's exactly my point. As far as I know nobody has done a survey,
> > but I bet you'd find _many_ drivers are buggy either in this way or the
> > converse (forcing an I/O request to fail immediately instead of waiting
> > until the suspend is over when it could succeed). They have this bug
> > because they were written -- those which include any suspend/resume
> > support at all -- under the assumption that they could rely on the
> > freezer.
> >
> > And that's why Rafael said "We can't do this unless we have frozen
> > tasks (this way, or another) before carrying out the entire operation."
> > Until the drivers are fixed -- which seems like a tremendous job --
> > none of this will work.
>
> So this is in the way of removing the freezer ... but as we are not
> relying on doing any io other than suspend device operation, save state
> to ram, then later put device in low power mode for s3 and/or s4, and
> finally restore and resume to running.
We aren't relying on doing any other I/O... and we have to prevent any
other I/O from taking place. That's the hard part.
> > Some drivers need the ability to schedule. Some will need the ability
> > to allocate memory (although GFP_ATOMIC is probably sufficient). Some
> > will need timers to run.
>
> Can they allocate the memory in advance? (Call them when we know we
> want to suspend, they make the allocations they will need; we later
> call them again to release the allocations).
Some yes, some no. The ones that can't generally don't need very much.
> If you need timers, you probably want some scheduling?
Yes, scheduling was one of the items I listed above.
> >> Do we need to differentate init (por by bios) and resume from quiesced
> >> (for reboot, kexec start/resume)? I hope not.
> >
> > Yes we do.
>
> can you elabrate? Note I was not asking resume-from-low power vs
> init-from-por. We still get that distinction.
To be more precise, drivers need to know whether they are doing a
complete initialization, a resume from low-power, or a resume from
hibernate. Currently there's no way to distinguish the last two (they
both involve calling the resume() method), but that's going to change.
The first can be told apart because it involves probe() rather than
resume().
> How do these drivers work today when we kexec?
We don't kexec during a resume from hibernation. When kexec does run,
drivers in the new kernel do a complete reinitialization.
> The reason I'm asking is its hard to tell the first kernel what
> happened. We can say "we powered off, and we were restarted", but it
> becomes much harder when each device may or may not have a driver in
> the save kernel if we have to differentate for each device if it was
> initialized and later quiesced by the jump kernel during save or never
> touched. And we need to tell the resume from hybernate code "i touched
> it" "no i didn't" and "we resumed from s4" "no it was from s5".
You merely have to distinguish between suspend and hibernate.
> I'm guessing that the work that will take some time is seperating the
> go to low power from quiesce operations for snapshot, as it sounds like
> this is done with one driver call today?
There's a single call with different arguments. How much do you know
about the way the Power Management core actually works now? Have you
read the files in Documentation/power?
> Making this separation will
> give us our driver audit :-), but only if we decide on the requiements
> before the start.
No it won't, although it will be a good start.
Alan Stern
^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Re: Hibernation considerations
[not found] <Pine.LNX.4.44L0.0707201408060.2546-100000@iolanthe.rowland.org>
2007-07-20 19:08 ` Milton Miller
@ 2007-07-20 20:03 ` Oliver Neukum
1 sibling, 0 replies; 81+ messages in thread
From: Oliver Neukum @ 2007-07-20 20:03 UTC (permalink / raw)
To: Alan Stern
Cc: David Lang, LKML, Milton Miller, Ying Huang, linux-pm,
Jeremy Maitin-Shepard
Am Freitag 20 Juli 2007 schrieb Alan Stern:
> Some drivers need the ability to schedule. Some will need the ability
> to allocate memory (although GFP_ATOMIC is probably sufficient). Some
> will need timers to run.
Some will have to request firmware. It can add up to some megabytes.
In addition, if we don't freeze, some drivers, eg. video drivers, can
do allocations in the megabyte range.
It seems to me that without the freezer we will end up with many drivers
needing a two step notification process. Furthermore there are requirements
on the order of shutting down system facilities, eg. device addition must
be stopped before drivers allocate firmware.
Regards
Oliver
^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Re: Hibernation considerations
[not found] <200707202203.27849.oliver@neukum.org>
@ 2007-07-20 20:12 ` Alan Stern
0 siblings, 0 replies; 81+ messages in thread
From: Alan Stern @ 2007-07-20 20:12 UTC (permalink / raw)
To: Oliver Neukum
Cc: David Lang, LKML, Milton Miller, Ying Huang, linux-pm,
Jeremy Maitin-Shepard
On Fri, 20 Jul 2007, Oliver Neukum wrote:
> Am Freitag 20 Juli 2007 schrieb Alan Stern:
> > Some drivers need the ability to schedule. Some will need the ability
> > to allocate memory (although GFP_ATOMIC is probably sufficient). Some
> > will need timers to run.
>
> Some will have to request firmware. It can add up to some megabytes.
> In addition, if we don't freeze, some drivers, eg. video drivers, can
> do allocations in the megabyte range.
>
> It seems to me that without the freezer we will end up with many drivers
> needing a two step notification process. Furthermore there are requirements
> on the order of shutting down system facilities, eg. device addition must
> be stopped before drivers allocate firmware.
These are really separate issues, since they refer to things that have
to happen well before the memory snapshot is captured.
We already have a pre-suspend notification available for drivers that
need to allocate large amounts of memory.
You are correct about the need to delay/stop device addition. I don't
know how this can be done in general; each code path calling
device_add() may have to be treated individually.
Alan Stern
^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Re: Hibernation considerations
2007-07-20 19:09 ` Milton Miller
@ 2007-07-20 20:23 ` Jeremy Maitin-Shepard
0 siblings, 0 replies; 81+ messages in thread
From: Jeremy Maitin-Shepard @ 2007-07-20 20:23 UTC (permalink / raw)
To: Milton Miller; +Cc: linux-pm
Milton Miller <miltonm@bga.com> writes:
[snip]
> There is a requirement to hibernate without dedicating a partition to save the
> data. Two reasons given have been upgrading ram in a machine should not require
> a repartition of the hard drive, and the intel macs can only have 4 partitions
> total and people want to dual boot.
> Not having a separate partition means the userspace in the initramfs needs to
> obtain a list of blocks upon which to write the data. My first proposal was to
> do it all from userspace of the new kernel by calling bmap on the filesystem
> mounted read only. My later proposal is to allocate the blocks in the
> suspending kernel and pass the block list to userspace.
I see. The later proposal is needed anyway in order to support
still-in-use swap partitions and swap files. It is unfortunate that
special support will be needed to tell the "save image" kernel about the
storage device in order to support in-use swap partitions and files
(both swap files and regular). It seems that the kexec approach will
need to support two different "storage methods": getting a block list
from the hibernated kernel, or having the "save image" kernel handle
itself (e.g. in order to support complicated storage methods like over
the network, FUSE, etc.). There is the additional complication for the
"hibernated kernel tells save image kernel where to save the image"
method that the relevant block device might not even have the same
major/minor number under the save-to-disk kernel due to dynamic device
numbering, etc. The same problem applies anyway to finding the block
device on resume (and is a problem that exists with the existing
hibernate implementations as well); in theory it can be identified by
attributes like filesystem label or UUID, or physical device path.
There doesn't seem to be an existing general solution to this problem.
Note: I have noticed that swap partitions can have a filesystem UUID,
but the e2fsprogs blkid program does not find it while the swap
partition is marked as a hibernate image.
> There is still the question of how does the restore kernel find the block list
> to restore from. It does help the first kernel invalidating the image in the
> suspend-to-both resume-from-ram case.
Suspend2/TuxOnIce currently supports writing to regular files as well as
swap files. To resume, it is necessary to specify the device and the
start block. The first block must contain sufficient information to
find all of the other blocks; presumably the same technique can be used
for the kexec approach.
> PS: I'm not subscribed to the list, and only saw your reply on the archives
> ... which give me my message id as the message to reply to. I'd appreciate
> being cc'd in the future.
The message was sent directly to you, so you should have received it.
(Your e-mail address was in the To: header).
--
Jeremy Maitin-Shepard
^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Re: Hibernation considerations
2007-07-20 16:20 ` Alan Stern
2007-07-20 17:32 ` Milton Miller
@ 2007-07-20 20:31 ` david
2007-07-20 21:24 ` Alan Stern
1 sibling, 1 reply; 81+ messages in thread
From: david @ 2007-07-20 20:31 UTC (permalink / raw)
To: Alan Stern
Cc: LKML, Milton Miller, Ying Huang, linux-pm, Jeremy Maitin-Shepard
On Fri, 20 Jul 2007, Alan Stern wrote:
> On Fri, 20 Jul 2007, Milton Miller wrote:
>
>>> We can't do this unless we have frozen tasks (this way, or another)
>>> before
>>> carrying out the entire operation.
>>
>> What can't we do? We've already worked with the drivers to quesce the
>> hardware and put any information to resume the device in ram. Now we
>> ask them to put their device in low power mode so we can go to sleep.
>> Even if we schedule, the only thing userspace could touch is memory.
>
> Userspace can submit I/O requests. Someone will have to audit every
> driver to make sure that such I/O requests don't cause a quiesced
> device to become active. If the device is active, it will make the
> memory snapshot inconsistent with the on-device data.
assuming this is the suspend-from-ram after a kexec back from the
write-to-disk kernel I don't think you are correct.
when doing a suspend-to-ram you get to a point where you just don't use
any userspace. from that point on you are just walking the device tree
putting things into low-power mode. This is the point where we are talking
about jumping to.
david Lang
^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Re: Hibernation considerations
2007-07-20 16:08 ` Milton Miller
2007-07-20 16:20 ` Alan Stern
@ 2007-07-20 21:02 ` Rafael J. Wysocki
2007-07-21 11:44 ` Miklos Szeredi
[not found] ` <E1ICDNw-0008HC-00@dorka.pomaz.szeredi.hu>
1 sibling, 2 replies; 81+ messages in thread
From: Rafael J. Wysocki @ 2007-07-20 21:02 UTC (permalink / raw)
To: Milton Miller
Cc: David Lang, LKML, Ying Huang, linux-pm, Jeremy Maitin-Shepard
On Friday, 20 July 2007 18:08, Milton Miller wrote:
> On Jul 19, 2007, at 3:28 PM, Rafael J. Wysocki wrote:
> > On Thursday, 19 July 2007 17:46, Milton Miller wrote:
> >> The currently identified problems under discussion include:
> >> (1) how to interact with acpi to enter into S4.
> >> (2) how to identify which memory needs to be saved
> >> (3) how to communicate where to save the memory
> >> (4) what state should devices be in when switching kernels
> >> (5) the complicated setup required with the current patch
> >> (6) what code restores the image
> >
> > (7) how to avoid corrupting filesystems mounted by the hibernated
> > kernel
> >
>
> Ok I talked on this too.
>
> >> I'll now start with quotes from several articles in this thread and my
> >> responses.
> >>
> >> Message-ID: <200707172217.01890.rjw@sisk.pl>
> >> On Tue Jul 17 13:10:00 2007, Rafael J. Wysocki wrote:
> >>> (1) Upon entering the sleep state, which IMO can be done _after_ the
> >>> image
> >>> has been saved:
> >>> * figure out which devices can wake up
> >>> * put devices into low power states (wake-up devices are placed in
> >>> the Dx
> >>> states compatible with the wake capability, the others are
> >>> powered
> >>> off)
> >>> * execute the _PTS global control method
> >>> * switch off the nonlocal CPUs (eg. nonboot CPUs on x86)
> >>> * execute the _GTS global control method
> >>> * set the GPE enable registers corresponding to the wake-up
> >>> devices)
> >>> * make the platform enter S4 (there's a well defined procedure for
> >>> that)
> >>> I think that this should be done by the image-saving kernel.
> >>
> >> Message-ID: <87odiag45q.fsf@jbms.ath.cx>
> >> On Tue Jul 17 13:35:52 2007, Jeremy Maitin-Shepard
> >> expressed his agreement with this block but also confusion on the
> >> other
> >> blocks.
> >>
> >>
> >> I strongly disagree.
> >>
> >> (1) as has been pointed out, this requires the new kernel to
> >> understand
> >> all io devices in the first kernel.
> >> (2) it requires both kernels to talk to ACPI. This is doomed to
> >> failure. How can the second kernel initialize ACPI? The platform
> >> thinks it has already been initialized. Do we plan to always undo all
> >> acpi initialization?
> >
> > Good question. I don't know.
>
>
> >>> (2) Upon start-up (by which I mean what happens after the user has
> >>> pressed
> >>> the power button or something like that):
> >>> * check if the image is present (and valid) _without_ enabling ACPI
> >>> (we don't
> >>> do that now, but I see no reason for not doing it in the new
> >>> framework)
> >>> * if the image is present (and valid), load it
> >>> * turn on ACPI (unless already turned on by the BIOS, that is)
> >>> * execute the _BFS global control method
> >>> * execute the _WAK global control method
> >>> * continue
> >>> Here, the first two things should be done by the image-loading
> >>> kernel, but
> >>> the remaining operations have to be carried out by the restored
> >>> kernel.
> >>
> >> Here I agree.
> >>
> >> Here is my proposal. Instead of trying to both write the image and
> >> suspend, I think this all becomes much simpler if we limit the scope
> >> the work of the second kernel. Its purpose is to write the image.
> >> After that its done. The platform can be powered off if we are going
> >> to S5. However, to support suspend to ram and suspend to disk, we
> >> return to the first kernel.
> >
> > We can't do this unless we have frozen tasks (this way, or another)
> > before
> > carrying out the entire operation.
>
> What can't we do? We've already worked with the drivers to quesce the
> hardware and put any information to resume the device in ram. Now we
> ask them to put their device in low power mode so we can go to sleep.
For that to work, we have to require the image-saving kernel to leave devices
in the same state, or in a state compatible with the state, in which they were
when it got control.
> Even if we schedule, the only thing userspace could touch is memory.
> If we resume, they just run those computations again.
>
> > In that case, however, the kexec-based
> > approach would have only one advantage over the current one. Namely,
> > it
> > would allow us to create bigger images.
>
> The advantage is we don't have to come up with a way to teach drivers
> "wake up to run these requests, but no other requests". We don't have
> to figure out what we need to resume to allow them to process a
> request.
I'm not sure what you mean here. Please explain.
> >> This means that the first kernel will need to know why it got resumed.
> >> Was the system powered off, and this is the resume from the user? Or
> >> was it restarted because the image has been saved, and its now time to
> >> actually suspend until woken up? If you look at it, this is the same
> >> interface we have with the magic arch_suspend hook -- did we just
> >> suspend and its time to write the image, or did we just resume and its
> >> time to wake everything up.
> >>
> >> I think this can be easily solved by giving the image saving kernel
> >> two
> >> resume points: one for the image has been written, and one for we
> >> rebooted and have restored the image. I'm not familiar with ACPI.
> >> Perhaps we need a third to differentiate we read the image from S4
> >> instead of from S5, but that information must be available to the OS
> >> because it needs that to know if it should resume from hibernate.
> >>
> >> By making the split at image save and restore we have several
> >> advantages:
> >>
> >> (1) the kernel always initializes with devices in the init or quiesced
> >> but active state.
> >>
> >> (2) the kernel always resumes with devices in the init or quiesced but
> >> active state.
> >>
> >> (3) the kjump save and restore kernel does not need to know how to
> >> suspend all devices in the platform.
> >>
> >> (4) we have a merged path for suspend to disk, suspend to ram, and
> >> suspend to both.
> >>
> >> (5) because of (4), we can implement sleep policys where we save the
> >> image to disk but try to stay in ram based on expected remaining
> >> battery life.
> >>
> >> (6) we confine all platform (acpi) interaction to the main kernel
> >>
> >> (7) we limit the knowledge needed in the second kernel. It needs to
> >> know how to do its job and then put the hardware back how it found it.
> >> Nothing more.
> >
> > This would have been nice if we had been able to do it.
>
> I don't understand this comment. "if we had been able"? I don't
> think we have tried yet.
That's related to the discussion above. If we are unable to do (3) and (6)
without the freezing of tasks, which I'm not sure is not the case, the entire
scheme won't be viable.
Well, we might be able to do it provided that drivers will block the tasks
on I/O effectively, but I see a big 'if' here ...
> >> For the suspend to ram and then woken up case, we simply need to
> >> invalidate the image before restarting normal kernel operation.
> >>
> >> People have worried about how to boot and restore the kernel, and what
> >> to do if reading the image fails. They worry about needing memory
> >> hotplug or delayed acpi parsing. They are forgetting one thing. This
> >> kernel has support for kexec.
> >>
> >> This is all easily solved by having the bootloader from the bios
> >> always
> >> boot the restore kernel.
> >
> > Well, I think this is not generally acceptable, although I agree that
> > it would
> > be simpler.
>
> For those that don't find it acceptable they can teach their bootloader
> when they may have a image to resume.
Yes, and I think we need to seriously consider this possibility.
> >> It will boot with limited useable memory and
> >> no acpi support. If the restore kernel userspace detects that there
> >> is
> >> no restore image, it simply loads the normal main kernel and initrd /
> >> initramfs and calls the normal kexec. The cost is the time to init
> >> the
> >> restore kernel, read the kernel with full drivers (vs reading it from
> >> the bootloader). If you want a boot menu, use kboot (on sourceforge).
> >
> > Well, I'm afraid of adding more and more infrastructure to the mix.
>
> Requiring the hibernated kernel to be able to start from kexec should
> not be bad. If you were referring to adding kboot, that is just an
> option.
Yes, I was.
> One can still use bootloaders menus to select alternate kernels.
> However, as you said, you want to boot differently for resume (no acpi
> until after image loaded) from full boot.
That's correct and I think some kind of cooperation with the bootloader is
needed for that.
> >> On Jul 17, 2007, at 2:13 PM, Rafael J. Wysocki wrote:
> >>> On Tuesday, 17 July 2007 22:27, david@lang.hm wrote:
> >>>> On Tue, 17 Jul 2007, Alan Stern wrote:
> >>>>> But what about the freezer? The original reason for using kexec
> >>>>> was
> >>>>> to
> >>>>> avoid the need for the freezer. With no freezer, while the
> >>>>> original
> >>>>> kernel is busy powering down its devices, user tasks will be free
> >>>>> to
> >>>>> carry out I/O -- which will make the memory snapshot inconsistent
> >>>>> with
> >>>>> the on-disk data structures.
> >>>>
> >>>> no, user tasks just don't get scheduled during shutdown.
> >>>>
> >>>> the big problem with the freezer isn't stopping anything from
> >>>> happening,
> >>>> it's _selectivly_ stopping things.
> >>
> >> Agreed. Or rather, selectively not stopping and resuming things.
> >
> > I don't quite understant this statement. Can you please elaborate?
>
> Feel free to list other problems with the freezer, but I'm saying that
> the problems are stemming from trying to freeze most of userspace and
> some selection of kernel threads so that new requests to the outside
> are not made, but then turning around and saying "ok now do some io,
> but only what this thread of execution originates".
We're _not_ doing anything like this.
> Its originates not generates so we are trying to teach the whole stack
> these limits, including going back to userspace for FUSE.
Again, I don't understand what you're talking about. This is not like things
work right now, that's for sure. :-)
The problem with FUSE is related to the fact that the freezer can't freeze
uninterruptible tasks and we said that perhaps we might avoid it if FUSE
was made freezing-aware. Still, no one has gone in this direction and I don't
know of any plans to do that.
Please, stop trying to blame the freezer for all evil.
Also, it's better if you know how the things that you want to improve really
work.
> >>> It's selectively stopping kernel threads, which is just about right.
> >>> If you
> >>> that _this_ is a main problem with the freezer, then think again.
> >>>
> >>>> with kexec you don't need to let any portion of the origional kernel
> >>>> or userspace operate so you don't have a problem.
> >>>
> >>> In fact, the main problem with the freezer is that it is a
> >>> coarse-grained
> >>> solution. Therefore, what I believe we should do is to evolve in the
> >>> directoin
> >>> of more fine-grained solutions and gradually phase out the freezer.
> >>>
> >>> The kexec-based approach is an attempt to replace one coarse-grained
> >>> solution
> >>> (the freezer) with even more coarse-grained solution (stopping the
> >>> entire
> >>> kernel with everything), which IMO doesn't address the main problem.
> >>>
> >>
> >> I think this addresses teh problem. Its probably a bit harder than
> >> powermac because we have to fully quiesce devices; we can't cheat by
> >> leaving interrupts off. But once the drivers save the state of their
> >> devices and stop their queues, it should be easy to audit the paths to
> >> powerdown devices and call the platform suspend and ram wakeup paths.
>
> In other words, I'm replacing a course-grained solution with an
> absolute solution. "From this point on you can only write to ram."
Which means that we need to take care of the drivers _before_ doing anything
else.
I agree with that, of course. :-)
> >> Going back to the requirements document that started this thread:
> >>
> >> Message-ID: <200707151433.34625.rjw@sisk.pl>
> >> On Sun Jul 15 05:27:03 2007, Rafael J. Wysocki wrote:
> >>> (1) Filesystems mounted before the hibernation are untouchable
> >>
> >> This is because some file systems do a fsck or other activity even
> >> when
> >> mounted read only. For the kexec case, however, this should be "file
> >> systems mounted by the hibernated system must not be written". As
> >> has
> >> been mentioned in the past, we should be able to use something like dm
> >> snapshot to allow fsck and the file system to see the cleaned copy
> >> while not actually writing the media.
> >
> > We can't _require_ users to use the dm snapshot in order for the
> > hibernation
> > to work, sorry.
>
> I actually listed three ways to start. Not all of them required
> dm-snapshot. I was proposing "if you need to read ext3, then use
> dm-snapshot".
I don't think we should differentiate filesystems this way. We should just do
the same thing with all of them.
> > And by _reading_ from a filesystem you generally update metadata.
>
> not on ones mounted read-only. I'll reply more later in the thread.
OK
> >> The kjump kernel must not have any knowledge retained if we reuse it.
> >>
> >>> (2) Swap space in use before the hibernation must be handled with
> >>> care
> >>
> >> Yes. Actually, even though they have been used by the write-in-the
> >> kernel users, they will be among the most difficult devices to use for
> >> snapshots by a userspace second kernel.
> >>
> >>> (3) There are memory regions that must not be saved or restored
> >>
> >> because they may not exist. This means that we must identify the
> >> memory to be saved and restored in a format to be passed between the
> >> kernel.
> >>
> >>> (4) The user should be able to limit the size of a hibernation image
> >>
> >> This means the suspending kernel must arrange to reduce its active
> >> memory. The limited save can be done by providing a limited list in
> >> (3).
> >
> > It seems to me that you don't understand the problem here.
> >
> > Assume you have 90% of RAM allocated before the hibernation and the
> > user has
> > requested the image to be not greater than 50% of RAM. In that case
> > you have
> > to free some memory _before_ identifying memory to save and you must
> > not
> > race with applications that attempt to allocate memory while you're
> > doing it.
>
> Hmm... I didn't say how to reduce the memory or identify it, did I?
>
> Ok fine. I'll allocate a bunch of memory and put it on a list.
And you cause the OOM killer to show up. Not good.
> Normal memory pressure will swap things out or drop filesystem pages.
> When I build the list of memory to backup, I filter out this list.
> After resume, I'll free it back.
>
> We can arrange for this "task" to be preferred by the oom killer, if
> case the user is trying to suspend into less than memory than can be
> freed.
>
> >>> (5) Hibernation should be transparent from the applications' point of
> >>> view
> >>
> >> People have pointed out they may want userspace to be aware of the
> >> suspend. I believed this can be done with /proc/apm emulation today
> >> or by other means; it seems that should be hooked up to dbus in some
> >> fashion.
> >
> > Not a solution, because there still will be programs not needing to
> > know
> > anything about hibernation. After all, we don't require all
> > applications to
> > know anything about SMP, even if they are executed on an SMP system.
>
> How do any of those methods require userpsace to know anything about
> hibernation? I was talking about a general framework consistent with
> todays kernel to user communication for those parts of userspace that
> *want* to know about suspend and hibernation.
OK, I didn't understand, then.
> >>> (6) State of devices from before hibernation should be restored, if
> >>> possible
> >>
> >> related to suspend should be transparent ... yes.
> >>
> >>> (7) On ACPI systems special platform-related actions have to be
> >>> carried out at
> >>> the right points, so that the platform works correctly after the
> >>> restore
> >>
> >> I believe I have explained my suggestion.
> >>
> >>> (8) Hibernation and restore should not be too slow
> >>
> >> We control the added code. We are using full runtime drivers and
> >> will
> >> run at hardware speeds.
> >
> > That may not be enough. If you're going to save, say, 80% of RAM on a
> > 2 GB
> > machine, then you'll have to be using image compression.
>
> Yea, so? We have a full kernel and userspace, adding compression
> before writing should be easy. The is no struct page for memory in the
> old kernel, so we likely need to be copying them in userspace anyways.
> Adding compression should be easy.
Yes, it's not that difficult.
> >>> (9) Hibernation framework should not be too difficult to set up
> >>
> >> Ok the current patch is presently too difficult. But I think it will
> >> be much simpler with a few small changes.
> >>
> >> As noted in the thread
> >>
> >> Message-ID: <873azxwqhr.fsf@jbms.ath.cx>
> >> Subject: [linux-pm] Re: hibernation/snapshot design
> >> on Mon Jul 9 08:23:53 2007, Jeremy Maitin-Shepard wrote:
> >>>> Both would work. One would eat 8-64MB of your RAM, permanently;
> >>>
> >>> As I have stated in other messages, the kdump approach would not
> >>> waste
> >>> any RAM permanently.
> >> ...
> >>> Immediately before jumping to the new kernel, the first X bytes
> >>> (where
> >>> X
> >>> is the amount of memory the new kernel will get, typically 16MB or
> >>> 64MB)
> >>> of physical memory are backed up into the arbitrary discontiguous
> >>> pages
> >>> that are made available. This will not take very long, because
> >>> copying
> >>> even 64MB of memory is extremely fast. Then the new kernel is free
> >>> to
> >>> use the first X bytes of contiguous physical memory. Problem solved.
> >>
> >>
> >> Ok, now let's look at my list again:
> >>
> >>> (1) how to interact with acpi to enter into S4.
> >>
> >> This was discussed.
> >>
> >>> (2) how to identify which memory needs to be saved
> >>
> >> We need to generate a list. We need it to fit in a compuatable size
> >> so
> >> that we can free and allocate the pages before suspending IO in the
> >> first kernel.
> >>
> >> One possibility is to use something like the kexec copy list. If we
> >> are imaging a small fraction of ram this is appropriate, but if we are
> >> doing dense saves we need something extent based. We should be able
> >> to
> >> extend the list.
> >>
> >>> (3) how to communicate where to save the memory
> >>
> >> This is an intresting topic. The suspended kernel has most IO and
> >> disk
> >> space. It also knows how much space is to be occupied by the kernel.
> >> So communicating a block map to the second kernel would be the obvious
> >> choice. But the second kernel must be able to find the image to
> >> restore it, and it must have drivers for the media. Also, this is not
> >> feasible for storing to nfs.
> >>
> >> I think we will end up with several methods.
> >>
> >> One would be supply a list of blocks, and implement a file system that
> >> reads the file by reading the scatter list from media. The restore
> >> kernel then only needs to read an anchor, and can build upon that
> >> until
> >> the image is read into memory. Or do this in userspace.
> >>
> >> I don't know how this compares to the current restore path. I wasn't
> >> able to identify the code that creates the on disk structure in my 10
> >> minute perusal of kernel/power/.
> >
> > The structure is created at two levels.
> >
> > First, the code in snapshot.c makes the image available to the code in
> > swap.c
> > as a stream of pages. The first page is the header, followed by some
> > pages
> > containing the PFNs of the page frames to which the image data pages
> > are to be
> > restored, followed by the image data pages themselves (the ordering of
> > the PFNs
> > must be the same as the ordering of data pages that correspond to
> > them).
> > Still, the low-level image format only needs to be known by the
> > restore code in
> > snapshot.c .
>
> Ok sounds like this code could be reused. I'll look into it.
>
> > Second, the code in swap.c writes the image pages to a storage adding
> > some
> > metadata making it possible to reproduce their original ordering
> > during the
> > restore.
>
> So you are allocating the blocks as you go ... and adding meta data
> along the way?
Something like this. I can't say how Nigel does it, though.
> > The fact that we use swap spaces as the storage is related to
> > implementation
> > simplicity rather than anything else.
>
> Ok ... this only supports uncompressed hibernation?
The in-kernel version doesn't support compression (again, this was a choice
made to keep the code relatively simple), but the userland version supports
compression (and image encryption).
> The first kernel is going to specify (1) what to backup. It can
> specify (2) where to backup, although we have to be careful identify
> the device in a persistent way.
Yes, that seems doable.
> >> A second method will be to supply a device and file that will be
> >> mounted by the save kernel, then unmounted and restored. This would
> >> require a partition that is not mounted or open by the suspended
> >> kernel
> >> (or use nfs or a similar protocol that is designed for multiple client
> >> concurrent access).
> >>
> >> A third method would be to allocate a file with the first kernel, and
> >> make sure the blocks are flushed to disk. The save and restore
> >> kernels
> >> map the file system using a snapshot device. Writing would map the
> >> blocks and use the block offset to write to the real device using the
> >> method from the first option; reading could be done directly from the
> >> snapshot device.
> >>
> >> The first and third option are dead on log based file systems (where
> >> the data is stored in the log).
> >
> > All in all, we have three different and working implementation of the
> > image-writing and image-reading code at our disposal. Why would you
> > want to
> > break the open doors?
>
> The problem I'm saying kexec solves is how to get the data to the
> device while most of the kernel is trying not do anything permanent.
>
> If we can reuse existing code, great.
I think we can.
> >>> (4) what state should devices be in when switching kernels
> >>
> >> My proposal is either initialized and untouched or quiesced.
> >
> > This is reasonable, but in general we also need to save some
> > information
> > about the pre-hibernation state of devices, so that we can put them
> > into the
> > same state, if reasonably possible, during the restore.
>
> What state are you referring to?
>
> Yes, there is state that the drivers have to store to ram, but this the
> same state they need to store when suspending to ram if the device can
> be powered off.
Yes.
> Maybe we need to teach drivers to store more state, like remember that
> a hard drive was spun down.
I'm not sure about that.
> So we may need a flag saying "we powered off", "we resumed from
> suspend".
There already is something like this.
Generally, we're going to have a special callback that will be used by the core
after the restore, so the driver will always know what it's supposed to do.
> >>> (5) the complicated setup required with the current patch
> >>
> >> I think a few simple changes to kjump will make this much simpler.
> >> See
> >> below.
> >>
> >>> (6) what code restores the image
> >>
> >> The save kernel, loaded at boot. People have suggested booting the
> >> first kernel, and using current restore code. However, I think that
> >> ignores that (1) we saved from a different kernel, so the backed up
> >> region will be restored to its backed up random pages,
> >
> > This problem has already been solved.
> >
> >> (2) the code was written to restore the same kernel,
> >
> > Not exactly. In fact, the current implementation only relies on the
> > tiny
> > portion of the restore code being in the same place in both kernels,
> > but
> > we can change the code not to make this assumption (it'll be more
> > complicated,
> > but that's perfectly doable).
>
> If the save kernel is different from the run kernel (to make it
> smaller), its likely the image saving code will move. I view restoring
> from a different kernel than saving as an advanced feature.
>
> Lets get resuming from the save kernel working first.
>
> >> so the text and data will be replaced by identical text. Its much
> >> simpler
> >> conceptually to use the same kernel to save and restore the image.
> >
> > Here I agree. :-)
> >
> >> Simplifying kjump: the proposal for v3.
> >>
> >> The current code is trying to use crash dump area as a safe, reserved
> >> area to run the second kernel. However, that means that the kernel
> >> has to be linked specially to run in the reserved area. I think we
> >> need to finish separating kexec_jump from the other code paths.
> >>
> >> (1) add a new command line argument that specifies the kexec_jump
> >> target area.
> >>
> >> (2) add a kjump flag to the flags parameter, used by kexec_load.
> >> When
> >> loading a jump kernel, it is loaded like a normal kernel, however,
> >> additional control pages are allocated to (a) save the kexec_jump
> >> target area (b) save the backed up region that is used by all kernels
> >> like crash dump, and (c) space for invoking relocate_new_kernel that
> >> will get its args from the execution entry point and will restore the
> >> kernel then call resume and suspend.
> >>
> >> (3) replace jump_huf_pfn with two command line addresses that specify
> >> the (a) return point for after resume, and (b) the return point for
> >> after image save. Actually these can be done in userspace; the
> >> second
> >> restore kernel can just specify the null copy list and the entry
> >> points
> >> supplied by the suspended kernel. To do resume we also need (c) where
> >> to store resume address for the save kernel.
> >>
> >>
> >> As a first stage of suspend and resume, we can save to dedicated
> >> partitions all memory (as supplied to crash_dump) that is not marked
> >> nosave and not part of the save kernel's image.
> >
> > A little problem here: there are "nosave" areas that are not marked as
> > nosave.
>
> If crash_dump is going work the memory must exist.
>
> >> The fancy block lists and memory lists can be added later.
> >
> > On the majority of systems that will work. On some of them it won't.
>
> Ok .... well, my point is we can get started while we workout what the
> list format is. If we decide to reuse the pfn lists above that may
> come quickly.
I think it's generally reasonable (a) not to save the entire memory (like free
RAM areas etc.) and (b) include the information of the original location of
each data page in the image, this way or another.
This doesn't complicate things all that much.
Greetings,
Rafael
--
"Premature optimization is the root of all evil." - Donald Knuth
^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Re: Hibernation considerations
2007-07-20 20:31 ` david
@ 2007-07-20 21:24 ` Alan Stern
2007-07-20 21:34 ` david
` (2 more replies)
0 siblings, 3 replies; 81+ messages in thread
From: Alan Stern @ 2007-07-20 21:24 UTC (permalink / raw)
To: david; +Cc: LKML, Milton Miller, Ying Huang, linux-pm, Jeremy Maitin-Shepard
On Fri, 20 Jul 2007 david@lang.hm wrote:
> > Userspace can submit I/O requests. Someone will have to audit every
> > driver to make sure that such I/O requests don't cause a quiesced
> > device to become active. If the device is active, it will make the
> > memory snapshot inconsistent with the on-device data.
>
> assuming this is the suspend-from-ram after a kexec back from the
> write-to-disk kernel I don't think you are correct.
>
> when doing a suspend-to-ram you get to a point where you just don't use
> any userspace.
What do you mean? How can you prevent user tasks from running? That's
basically what the freezer does, and the whole point of this approach
is to eliminate the freezer. Right?
> from that point on you are just walking the device tree
> putting things into low-power mode. This is the point where we are talking
> about jumping to.
Yes. And putting things into low-power mode requires the ability to
run the scheduler, which means that user tasks can be scheduled, which
means that they can run.
Alan Stern
^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Re: Hibernation considerations
[not found] ` <f29402c6050f9c3ff5d83a59cea2de58@bga.com>
2007-07-20 17:31 ` Jeremy Maitin-Shepard
2007-07-20 19:26 ` david
@ 2007-07-20 21:28 ` Rafael J. Wysocki
2007-07-20 21:33 ` Jeremy Maitin-Shepard
[not found] ` <87ejj2pxoc.fsf@jbms.ath.cx>
2 siblings, 2 replies; 81+ messages in thread
From: Rafael J. Wysocki @ 2007-07-20 21:28 UTC (permalink / raw)
To: Milton Miller
Cc: David Lang, LKML, Ying Huang, linux-pm, Jeremy Maitin-Shepard
On Friday, 20 July 2007 18:56, Milton Miller wrote:
> On Jul 20, 2007, at 6:17 AM, Rafael J. Wysocki wrote:
> > On Friday, 20 July 2007 01:07, david@lang.hm wrote:
> >> On Thu, 19 Jul 2007, Rafael J. Wysocki wrote:
> >>> On Thursday, 19 July 2007 17:46, Milton Miller wrote:
> >>>> The currently identified problems under discussion include:
> >>>> (1) how to interact with acpi to enter into S4.
> >>>> (2) how to identify which memory needs to be saved
> >>>> (3) how to communicate where to save the memory
> >>>> (4) what state should devices be in when switching kernels
> >>>> (5) the complicated setup required with the current patch
> >>>> (6) what code restores the image
> >>>
> >>> (7) how to avoid corrupting filesystems mounted by the hibernated
> >>> kernel
> >>
> >> I didn't realize this was a discussion item. I thought the options
> >> were
> >> clear, for some filesystem types you can mount them read-only, but for
> >> ext3 (and possilby other less common ones) you just plain cannot touch
> >> them.
> >
> > That's correct. And since you cannot thouch ext3, you need either to
> > assume
> > that you won't touch filesystems at all, or to have a code to
> > recognize the
> > filesystem you're dealing with.
>
> Or add a small bit of infrastructure that errors writes at make_request
> if you don't have a magic "i am a direct block device write from
> userspace" flag on the bio.
>
> The hibernate may fail, but you don't corrupt the media.
>
> If you don't get the image out, resume back to the "this is resume"
> instead of the power-down path.
Well, I don't think that is much prettier than the freezer ...
> >>>>> (2) Upon start-up (by which I mean what happens after the user has
> >>>>> pressed
> >>>>> the power button or something like that):
> >>>>> * check if the image is present (and valid) _without_ enabling
> >>>>> ACPI
> >>>>> (we don't
> >>>>> do that now, but I see no reason for not doing it in the new
> >>>>> framework)
> >>>>> * if the image is present (and valid), load it
> >>>>> * turn on ACPI (unless already turned on by the BIOS, that is)
> >>>>> * execute the _BFS global control method
> >>>>> * execute the _WAK global control method
> >>>>> * continue
> >>>>> Here, the first two things should be done by the image-loading
> >>>>> kernel, but
> >>>>> the remaining operations have to be carried out by the restored
> >>>>> kernel.
> >>>>
> >>>> Here I agree.
> >>>>
> >>>> Here is my proposal. Instead of trying to both write the image and
> >>>> suspend, I think this all becomes much simpler if we limit the scope
> >>>> the work of the second kernel. Its purpose is to write the image.
> >>>> After that its done. The platform can be powered off if we are
> >>>> going
> >>>> to S5. However, to support suspend to ram and suspend to disk, we
> >>>> return to the first kernel.
> >>>
> >>> We can't do this unless we have frozen tasks (this way, or another)
> >>> before
> >>> carrying out the entire operation. In that case, however, the
> >>> kexec-based
> >>> approach would have only one advantage over the current one.
> >>> Namely, it
> >>> would allow us to create bigger images.
> >>
> >> we all agree that tasks cannot run during the suspend-to-ram state,
> >> but
> >> the disagreement is over what this means
> >>
> >> at one extreme it could mean that you would need the full freezer as
> >> per
> >> the current suspend projects.
> >>
> >> at the other extreme it could mean that all that's needed is to
> >> invoke the
> >> suspend-to-ram routine before anything else on the suspended kernel
> >> on the
> >> return from the save and restore kernel.
> >>
> >> we just need to figure out which it is (or if it's somewhere in
> >> between).
> >
> > Well, I think that the "invoke the suspend-to-ram routine before
> > anything else
> > on the suspended kernel" thing won't be easy to implement in practice.
>
> Why? You don't expect suspend-to-ram in drivers to be implemented? We
> need more speperation of the quiesce drivers from power-down devices?
No. I'm saying that when you go back from the image-saving kernel to the
hibernated kernel, you need to make sure that no task will cause any
filesystem's on-disk state to be actually updated. If you can't make such
a guarantee, you just can't do that.
With the current state of the drivers, it's not doable without the freezer.
> Note that we are just talking about "suspend devices and put their
> state in ram", not actually invoking the platform to suspend to ram.
>
> And I'm actually saying we free memory and maybe allocate disk blocks
> for the save before we suspend (see below).
Well, I've already written about the OOM killer ...
> >>>> Message-ID: <200707151433.34625.rjw@sisk.pl>
> >>>> On Sun Jul 15 05:27:03 2007, Rafael J. Wysocki wrote:
> >>>>> (1) Filesystems mounted before the hibernation are untouchable
> >>>>
> >>>> This is because some file systems do a fsck or other activity even
> >>>> when
> >>>> mounted read only. For the kexec case, however, this should be
> >>>> "file
> >>>> systems mounted by the hibernated system must not be written". As
> >>>> has
> >>>> been mentioned in the past, we should be able to use something like
> >>>> dm
> >>>> snapshot to allow fsck and the file system to see the cleaned copy
> >>>> while not actually writing the media.
> >>>
> >>> We can't _require_ users to use the dm snapshot in order for the
> >>> hibernation
> >>> to work, sorry.
> >>>
> >>> And by _reading_ from a filesystem you generally update metadata.
> >>
> >> not if the filesystem is mounted read-only (except on ext3)
> >
> > Well, if the filesystem in question is a journaling one and the
> > hibernated
> > kernel has mounted this fs read-write, this seems to be tricky anyway.
>
> Yes. I would argue writing to existing blocks of a file (not thorugh
> the filesystem, just getting their blocsk from the file system) should
> be safe, but it occurs to me that may not be the case if your fsck and
> bmap move data blocks from some update log to the file system.
>
> But we know the (maximum) image size. So we could allocate the blocks
> in the first image before suspending the drivers and memory
> allocations, and supplying the list to the second kernel. We could
> even write to the first block with a signature "suspend to here", or
> even the whole block list to the beginning (it will have to be saved to
> disk for restore anyways).
The writing is easy (we're doing that already, just fine).
The tricky part would be if the image-saving kernel tried to mount a journaling
filesystem in use by the hibernated kernel.
> >>>> The kjump kernel must not have any knowledge retained if we reuse
> >>>> it.
> >>>>
> >>>>> (2) Swap space in use before the hibernation must be handled with
> >>>>> care
> >>>>
> >>>> Yes. Actually, even though they have been used by the write-in-the
> >>>> kernel users, they will be among the most difficult devices to use
> >>>> for
> >>>> snapshots by a userspace second kernel.
>
> If we use the "write to these blocks" then this is as easy as writing
> to a file in a mounted filesystem.
>
> >>>>> (4) The user should be able to limit the size of a hibernation
> >>>>> image
> >>>>
> >>>> This means the suspending kernel must arrange to reduce its active
> >>>> memory. The limited save can be done by providing a limited list in
> >>>> (3).
> >>>
> >>> It seems to me that you don't understand the problem here.
> >>>
> >>> Assume you have 90% of RAM allocated before the hibernation and the
> >>> user has
> >>> requested the image to be not greater than 50% of RAM. In that case
> >>> you have
> >>> to free some memory _before_ identifying memory to save and you must
> >>> not
> >>> race with applications that attempt to allocate memory while you're
> >>> doing it.
> >>
> >> I disagree a little bit.
> >>
> >> first off, only the suspending kernel can know what can be freed and
> >> what
> >> is needed to do so (remember this is kernel internals, it can change
> >> from
> >> patch to patch, let alone version to version)
> >>
> >> second, if you have a lot of memory to free, and you can't just throw
> >> away
> >> caches to do so, you don't know what is going to be involved in
> >> freeing
> >> the memory, it's very possilbe that it is going to involve userspace,
> >> so
> >> you can't freeze any significant portion of the system, so you can't
> >> eliminate all chance of races
> >>
> >> what you can do is
> >>
> >> 1. try to free stuff
> >> 2. stop the system and account for memory, is enough free
> >> if not goto 1
> >>
> >> if userspace is dirtying memory fast enough, or is just useing enough
> >> memory that you can't meet your limit you just won't be able to
> >> suspend.
> >
> > This means unreliable hibernation for some workloads. While I agree
> > that
> > shouldn't be a problem in a common case, there are users who will
> > complain. ;-)
>
> With my allocate memory as a task and don't save that task's memory
> approach, we can get to this point while userspace is running. It
> could be controllled by userspace, or even be userspace
> (sys_do_not_save_me() waits for resume, and dies as the kernel
> resumes).
>
> >> but under any other conditions you will eventually get enough memory
> >> free.
> >>
> >> so try several times and if you still fail tell the user they have too
> >> much stuff running and they need to kill something.
> >
> > Well, with the freezer that's much simpler (and more reliable, I'd
> > say): you
> > freeze tasks and _then_ you shrink memory.
>
> It means you are committed to suspend before you try to shrink memory.
> What happens when the user requested a smaller image that memory in
> use?
You mean the user wanted the image to be so small that we can't create it?
Well, we don't. :-)
In fact, we cheat a little. Namely, we check if there's enough storage space
and if so, we create an image that's bigger than requested by the user.
> >>>>> =(8) Hibernation and restore should not be too slow
> >>>>
> >>>> We control the added code. We are using full runtime drivers and
> >>>> will
> >>>> run at hardware speeds.
> >>>
> >>> That may not be enough. If you're going to save, say, 80% of RAM on
> >>> a 2 GB
> >>> machine, then you'll have to be using image compression.
> >>
> >> this doesn't make sense, 20% of 2G is 400M, if you can't make a
> >> kernel and
> >> userspace that can run in 400M you have a serious problem.
> >
> > I was talking about the _speed_ of writing and reading.
>
> Yes. As I said, adding a compress as we copy the pages into the saving
> kernel for writeout should be easy.
Easy or not, that's one more thing you should remember about.
> >> even if you wanted to save 99% of RAM on a 2G system, you have 20M of
> >> ram
> >> to play with, which should easily be enough.
> >>
> >> remember, linux runs on really small systems as well, and while you do
> >> have to load some drivers for the big system, there are a lot of other
> >> things that aren't needed.
> >>
> >>> All in all, we have three different and working implementation of the
> >>> image-writing and image-reading code at our disposal. Why would you
> >>> want to
> >>> break the open doors?
> >>
> >> becouse you say that the current methods won't work without ACPI
> >> support.
> >
> > I didn't say that. [Or if I did, please point me to this message.]
> >
> > Anyway, this wouldn't be true even if I did.
> >
> > What I've been trying to say from the very beginning is that the
> > current
> > frameworks _support_ hibernation a la ACPI S4 (although that's not
> > exactly
> > ACPI S4) and if we are going to introduce a new framework, then it
> > should
> > be designed to _support_ ACPI S4 fully _from_ _the_ _start_.
> >
> > This DOESN'T mean that the non-ACPI hibernation should be unsupported
> > and
> > it DOESN"T mean that the non-ACPI hibernation is not supported
> > currently.
> > IT IS SUPPORTED.
> >
>
> As I said, I see kjump as a way to solve the "ok i am at a save point,
> now how do I write this image to media without allowing any other io".
> As you know by now, my solution for ACPI support is after the image is
> written we go back to the kernel that started the suspend and it puts
> the machine in S4.
>
> If this works, we get down to 1 hibernate implementation in the kernel
> :-).
For now, we have one implementation in the kernel (swsusp) that may be used
with some external (user space) tools (and is called uswsusp in that case,
quite confusingly), the other complete one that's waiting for merging with
the first one, at least in part (tuxonice, formerly known as suspend2), and
your proposed _third_ one (in a number of variants, perhaps).
I _think_ we can get down to one, but not by creating something entirely new
from the scratch. If you can think of introducing the kexec-based approach
in such a way that it uses _as_ _much_ _of_ _existing_ _code_ _as_ _reasonably_
_possible_, then the result might be a candidate for the one common
implementation, as far as I'm concerned.
Greetings,
Rafael
--
"Premature optimization is the root of all evil." - Donald Knuth
^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Re: Hibernation considerations
2007-07-20 17:31 ` Jeremy Maitin-Shepard
@ 2007-07-20 21:30 ` Rafael J. Wysocki
0 siblings, 0 replies; 81+ messages in thread
From: Rafael J. Wysocki @ 2007-07-20 21:30 UTC (permalink / raw)
To: Jeremy Maitin-Shepard
Cc: David Lang, LKML, Milton Miller, Ying Huang, linux-pm
On Friday, 20 July 2007 19:31, Jeremy Maitin-Shepard wrote:
> Milton Miller <miltonm@bga.com> writes:
>
> [snip]
>
> >>>> (7) how to avoid corrupting filesystems mounted by the hibernated kernel
> >>>
> >>> I didn't realize this was a discussion item. I thought the options were
> >>> clear, for some filesystem types you can mount them read-only, but for
> >>> ext3 (and possilby other less common ones) you just plain cannot touch
> >>> them.
> >>
> >> That's correct. And since you cannot thouch ext3, you need either to assume
> >> that you won't touch filesystems at all, or to have a code to recognize the
> >> filesystem you're dealing with.
>
> > Or add a small bit of infrastructure that errors writes at make_request if you
> > don't have a magic "i am a direct block device write from userspace" flag on the
> > bio.
>
> I still don't understand why there is this fixation on accessing dirty
> filesystems in use by the hibernated system. Even if you avoid
> corrupting the filesystem by avoiding writing to the block device, there
> isn't any real guarantee about the state of the data, except for a
> filesystem that specifically makes guarantees about such data (and I
> don't believe any of the existing ones do).
>
> It isn't necessary to be able to access such filesystems: everything can
> be done from an initramfs/initrd.
That's correct, but you need an additional ramdisk for that (yet another
complication).
Greetings,
Rafael
--
"Premature optimization is the root of all evil." - Donald Knuth
^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Re: Hibernation considerations
2007-07-20 21:28 ` Rafael J. Wysocki
@ 2007-07-20 21:33 ` Jeremy Maitin-Shepard
[not found] ` <87ejj2pxoc.fsf@jbms.ath.cx>
1 sibling, 0 replies; 81+ messages in thread
From: Jeremy Maitin-Shepard @ 2007-07-20 21:33 UTC (permalink / raw)
To: Rafael J. Wysocki; +Cc: David Lang, LKML, Milton Miller, Ying Huang, linux-pm
"Rafael J. Wysocki" <rjw@sisk.pl> writes:
[snip]
>> Or add a small bit of infrastructure that errors writes at make_request
>> if you don't have a magic "i am a direct block device write from
>> userspace" flag on the bio.
>>
>> The hibernate may fail, but you don't corrupt the media.
>>
>> If you don't get the image out, resume back to the "this is resume"
>> instead of the power-down path.
> Well, I don't think that is much prettier than the freezer ...
It seems that a better solution to the "how do we write to a file on an
in-use partition" has been suggested, which also handles swap partitions
and swap files, and does not require mounting filesystems, so it seems
that the filesystem issue need not be considered.
[snip]
> No. I'm saying that when you go back from the image-saving kernel to the
> hibernated kernel, you need to make sure that no task will cause any
> filesystem's on-disk state to be actually updated. If you can't make such
> a guarantee, you just can't do that.
> With the current state of the drivers, it's not doable without the
> freezer.
It seems that it should be feasible to fix the drivers so that
1. they can be taken from normal state to quiesced state without
requiring the freezer;
2. they can be taken from normal state to low power state without
requiring the freezer;
3. they can be taken from quiesced state to low power state without
requiring the freezer.
In the particular, it seems that it should be possible to do (3) without
needing to schedule tasks.
It seems likely that (2) may in fact be almost exactly the same as, or
at least similar to, (1) followed by (3), at least for many drivers.
(1) is required by the kexec hibernate approach even ignoring suspend to
both or S4. (2) is required for suspend to ram without the freezer,
which seems to be desired anyway.
[snip]
--
Jeremy Maitin-Shepard
^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Re: Hibernation considerations
2007-07-20 21:24 ` Alan Stern
@ 2007-07-20 21:34 ` david
2007-07-20 21:37 ` Jeremy Maitin-Shepard
[not found] ` <Pine.LNX.4.64.0707201428080.5166@asgard.lang.hm>
2 siblings, 0 replies; 81+ messages in thread
From: david @ 2007-07-20 21:34 UTC (permalink / raw)
To: Alan Stern
Cc: LKML, Milton Miller, Ying Huang, linux-pm, Jeremy Maitin-Shepard
On Fri, 20 Jul 2007, Alan Stern wrote:
> On Fri, 20 Jul 2007 david@lang.hm wrote:
>
>>> Userspace can submit I/O requests. Someone will have to audit every
>>> driver to make sure that such I/O requests don't cause a quiesced
>>> device to become active. If the device is active, it will make the
>>> memory snapshot inconsistent with the on-device data.
>>
>> assuming this is the suspend-from-ram after a kexec back from the
>> write-to-disk kernel I don't think you are correct.
>>
>> when doing a suspend-to-ram you get to a point where you just don't use
>> any userspace.
>
> What do you mean? How can you prevent user tasks from running? That's
> basically what the freezer does, and the whole point of this approach
> is to eliminate the freezer. Right?
>
>> from that point on you are just walking the device tree
>> putting things into low-power mode. This is the point where we are talking
>> about jumping to.
>
> Yes. And putting things into low-power mode requires the ability to
> run the scheduler, which means that user tasks can be scheduled, which
> means that they can run.
I did not know that getting into low-power mode required scheduling.
does it require userspace?
if so this is a problem and I say punt on suspend-to-disk-and-ram until
suspend-to-ram is working independantly ;-)
if not, then can you schedule but not consider non-kernel tasks runnable?
freezing all of userspace is easy (see above)
freezing all of kernelspace is easy (unplug all non-boot CPU's and don't
schedule)
where freezing gets hard is when you need to partially freeze either one
of these.
David Lang
^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Re: Hibernation considerations
2007-07-20 14:48 ` Huang, Ying
2007-07-20 15:48 ` david
@ 2007-07-20 21:34 ` Rafael J. Wysocki
1 sibling, 0 replies; 81+ messages in thread
From: Rafael J. Wysocki @ 2007-07-20 21:34 UTC (permalink / raw)
To: Huang, Ying
Cc: David Lang, LKML, Milton Miller, linux-pm, Jeremy Maitin-Shepard
On Friday, 20 July 2007 16:48, Huang, Ying wrote:
> On Fri, 2007-07-20 at 09:01 -0500, Milton Miller wrote:
> > Simplifying kjump: the proposal for v3.
> >
> > The current code is trying to use crash dump area as a safe, reserved
> > area to run the second kernel. However, that means that the kernel
> > has to be linked specially to run in the reserved area. I think we
> > need to finish separating kexec_jump from the other code paths.
> >
> > (1) add a new command line argument that specifies the kexec_jump
> > target area (or just size?)
> >
> > (2) add a kjump flag to the flags parameter, used by kexec_load. When
> > loading a jump kernel, it is loaded like a normal kernel, however,
> > additional control pages are allocated to (a) save this kenrel's use of
> > the kexec_jump target area (b) save the backed up region that is used
> > by all kernels like crash dump, and (c) space for invoking
> > relocate_new_kernel that will get its args from the execution entry
> > point and will restore the kernel then call resume and suspend.
>
> Backuping target memory before kexec and restoring it after kexec is
> planed feature for kexec jump. But I will work on image writing/reading
> first.
Have you thought about using any existing code, when you're at it?
Greetings,
Rafael
--
"Premature optimization is the root of all evil." - Donald Knuth
^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Re: Hibernation considerations
[not found] <Pine.LNX.4.44L0.0707201608120.2546-100000@iolanthe.rowland.org>
@ 2007-07-20 21:35 ` Oliver Neukum
0 siblings, 0 replies; 81+ messages in thread
From: Oliver Neukum @ 2007-07-20 21:35 UTC (permalink / raw)
To: Alan Stern
Cc: David Lang, LKML, Milton Miller, Ying Huang, linux-pm,
Jeremy Maitin-Shepard
Am Freitag 20 Juli 2007 schrieb Alan Stern:
> On Fri, 20 Jul 2007, Oliver Neukum wrote:
>
> > Am Freitag 20 Juli 2007 schrieb Alan Stern:
> > > Some drivers need the ability to schedule. Some will need the ability
> > > to allocate memory (although GFP_ATOMIC is probably sufficient). Some
> > > will need timers to run.
> >
> > Some will have to request firmware. It can add up to some megabytes.
> > In addition, if we don't freeze, some drivers, eg. video drivers, can
> > do allocations in the megabyte range.
> >
> > It seems to me that without the freezer we will end up with many drivers
> > needing a two step notification process. Furthermore there are requirements
> > on the order of shutting down system facilities, eg. device addition must
> > be stopped before drivers allocate firmware.
>
> These are really separate issues, since they refer to things that have
> to happen well before the memory snapshot is captured.
>
> We already have a pre-suspend notification available for drivers that
> need to allocate large amounts of memory.
Is that facility fine grained enough?
> You are correct about the need to delay/stop device addition. I don't
> know how this can be done in general; each code path calling
> device_add() may have to be treated individually.
What about the old API? Do we have to block module loading?
What happens if a scsi error handler is woken? If it cannot be woken,
how are errors handled?
Regards
Oliver
^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Re: Hibernation considerations
2007-07-20 21:24 ` Alan Stern
2007-07-20 21:34 ` david
@ 2007-07-20 21:37 ` Jeremy Maitin-Shepard
[not found] ` <Pine.LNX.4.64.0707201428080.5166@asgard.lang.hm>
2 siblings, 0 replies; 81+ messages in thread
From: Jeremy Maitin-Shepard @ 2007-07-20 21:37 UTC (permalink / raw)
To: Alan Stern; +Cc: david, LKML, Milton Miller, Ying Huang, linux-pm
Alan Stern <stern@rowland.harvard.edu> writes:
> On Fri, 20 Jul 2007 david@lang.hm wrote:
>> > Userspace can submit I/O requests. Someone will have to audit every
>> > driver to make sure that such I/O requests don't cause a quiesced
>> > device to become active. If the device is active, it will make the
>> > memory snapshot inconsistent with the on-device data.
>>
>> assuming this is the suspend-from-ram after a kexec back from the
>> write-to-disk kernel I don't think you are correct.
>>
>> when doing a suspend-to-ram you get to a point where you just don't use
>> any userspace.
> What do you mean? How can you prevent user tasks from running? That's
> basically what the freezer does, and the whole point of this approach
> is to eliminate the freezer. Right?
Presumably no tasks at all would be scheduled.
>> from that point on you are just walking the device tree
>> putting things into low-power mode. This is the point where we are talking
>> about jumping to.
> Yes. And putting things into low-power mode requires the ability to
> run the scheduler, which means that user tasks can be scheduled, which
> means that they can run.
Does it really (fundamentally) require scheduling tasks, particularly in
the case that the devices have already been put in the "quiesced" state?
--
Jeremy Maitin-Shepard
^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Re: Hibernation considerations
[not found] ` <Pine.LNX.4.64.0707201428080.5166@asgard.lang.hm>
@ 2007-07-20 22:15 ` Rafael J. Wysocki
0 siblings, 0 replies; 81+ messages in thread
From: Rafael J. Wysocki @ 2007-07-20 22:15 UTC (permalink / raw)
To: david; +Cc: LKML, Milton Miller, Ying Huang, linux-pm, Jeremy Maitin-Shepard
On Friday, 20 July 2007 23:34, david@lang.hm wrote:
> On Fri, 20 Jul 2007, Alan Stern wrote:
>
> > On Fri, 20 Jul 2007 david@lang.hm wrote:
> >
> >>> Userspace can submit I/O requests. Someone will have to audit every
> >>> driver to make sure that such I/O requests don't cause a quiesced
> >>> device to become active. If the device is active, it will make the
> >>> memory snapshot inconsistent with the on-device data.
> >>
> >> assuming this is the suspend-from-ram after a kexec back from the
> >> write-to-disk kernel I don't think you are correct.
> >>
> >> when doing a suspend-to-ram you get to a point where you just don't use
> >> any userspace.
> >
> > What do you mean? How can you prevent user tasks from running? That's
> > basically what the freezer does, and the whole point of this approach
> > is to eliminate the freezer. Right?
> >
> >> from that point on you are just walking the device tree
> >> putting things into low-power mode. This is the point where we are talking
> >> about jumping to.
> >
> > Yes. And putting things into low-power mode requires the ability to
> > run the scheduler, which means that user tasks can be scheduled, which
> > means that they can run.
>
> I did not know that getting into low-power mode required scheduling.
>
> does it require userspace?
>
> if so this is a problem and I say punt on suspend-to-disk-and-ram until
> suspend-to-ram is working independantly ;-)
>
> if not, then can you schedule but not consider non-kernel tasks runnable?
>
> freezing all of userspace is easy (see above)
>
> freezing all of kernelspace is easy (unplug all non-boot CPU's and don't
> schedule)
>
> where freezing gets hard is when you need to partially freeze either one
> of these.
If you use the scheduler to "freeze" tasks, you never know where they are
stopped and what locks they may hold.
We would have done that already if that was so easy, because we really want
to freeze _all_ user space tasks (even if not all kernel threads).
Greetings,
Rafael
--
"Premature optimization is the root of all evil." - Donald Knuth
^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Re: Hibernation considerations
[not found] ` <87ejj2pxoc.fsf@jbms.ath.cx>
@ 2007-07-20 22:19 ` Rafael J. Wysocki
0 siblings, 0 replies; 81+ messages in thread
From: Rafael J. Wysocki @ 2007-07-20 22:19 UTC (permalink / raw)
To: Jeremy Maitin-Shepard
Cc: David Lang, LKML, Milton Miller, Ying Huang, linux-pm
On Friday, 20 July 2007 23:33, Jeremy Maitin-Shepard wrote:
> "Rafael J. Wysocki" <rjw@sisk.pl> writes:
>
> [snip]
>
> >> Or add a small bit of infrastructure that errors writes at make_request
> >> if you don't have a magic "i am a direct block device write from
> >> userspace" flag on the bio.
> >>
> >> The hibernate may fail, but you don't corrupt the media.
> >>
> >> If you don't get the image out, resume back to the "this is resume"
> >> instead of the power-down path.
>
> > Well, I don't think that is much prettier than the freezer ...
>
> It seems that a better solution to the "how do we write to a file on an
> in-use partition" has been suggested, which also handles swap partitions
> and swap files, and does not require mounting filesystems, so it seems
> that the filesystem issue need not be considered.
>
> [snip]
>
> > No. I'm saying that when you go back from the image-saving kernel to the
> > hibernated kernel, you need to make sure that no task will cause any
> > filesystem's on-disk state to be actually updated. If you can't make such
> > a guarantee, you just can't do that.
>
> > With the current state of the drivers, it's not doable without the
> > freezer.
>
> It seems that it should be feasible to fix the drivers so that
>
> 1. they can be taken from normal state to quiesced state without
> requiring the freezer;
>
> 2. they can be taken from normal state to low power state without
> requiring the freezer;
Yes, that's correct.
> 3. they can be taken from quiesced state to low power state without
> requiring the freezer.
>
> In the particular, it seems that it should be possible to do (3) without
> needing to schedule tasks.
For that, you'd have to forbid the drivers to call schedule() from the relevant
callbacks, which means, eg. no timeouts in there.
> It seems likely that (2) may in fact be almost exactly the same as, or
> at least similar to, (1) followed by (3), at least for many drivers.
> (1) is required by the kexec hibernate approach even ignoring suspend to
> both or S4. (2) is required for suspend to ram without the freezer,
> which seems to be desired anyway.
Yes, (2) is needed anyway.
Greetings,
Rafael
--
"Premature optimization is the root of all evil." - Donald Knuth
^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Re: Hibernation considerations
[not found] <200707202335.05519.oliver@neukum.org>
@ 2007-07-20 22:25 ` Alan Stern
[not found] ` <Pine.LNX.4.44L0.0707201820050.5241-100000@iolanthe.rowland.org>
1 sibling, 0 replies; 81+ messages in thread
From: Alan Stern @ 2007-07-20 22:25 UTC (permalink / raw)
To: Oliver Neukum
Cc: David Lang, LKML, Milton Miller, Ying Huang, linux-pm,
Jeremy Maitin-Shepard
On Fri, 20 Jul 2007, Oliver Neukum wrote:
> > We already have a pre-suspend notification available for drivers that
> > need to allocate large amounts of memory.
>
> Is that facility fine grained enough?
It's a notifier chain that gets called at several points during the
suspend transition. One of those points is right at the start, while
userspace is still running and reasonably large amounts of memory can
be allocated.
Is it fine-grained enough? I don't know -- hard to tell, since nothing
much is using it yet.
> > You are correct about the need to delay/stop device addition. I don't
> > know how this can be done in general; each code path calling
> > device_add() may have to be treated individually.
>
> What about the old API?
What old API do you mean?
> Do we have to block module loading?
No. Registering new drivers is okay, registering new devices is bad.
Of course, some modules do want to register a new device in their init
method. I don't know what we should do about them. Force the
registration to fail, I suppose. How often will people suspend while a
module is loading?
> What happens if a scsi error handler is woken? If it cannot be woken,
> how are errors handled?
Why should the error handler wake up? There isn't supposed to be any
I/O going on, hence no errors to handle.
Alan Stern
^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Re: Hibernation considerations
[not found] <877ioupxi8.fsf@jbms.ath.cx>
@ 2007-07-20 22:35 ` Alan Stern
2007-07-20 22:43 ` david
` (2 more replies)
0 siblings, 3 replies; 81+ messages in thread
From: Alan Stern @ 2007-07-20 22:35 UTC (permalink / raw)
To: Jeremy Maitin-Shepard; +Cc: david, LKML, Milton Miller, Ying Huang, linux-pm
On Fri, 20 Jul 2007, Jeremy Maitin-Shepard wrote:
> >> when doing a suspend-to-ram you get to a point where you just don't use
> >> any userspace.
>
> > What do you mean? How can you prevent user tasks from running? That's
> > basically what the freezer does, and the whole point of this approach
> > is to eliminate the freezer. Right?
>
> Presumably no tasks at all would be scheduled.
How would you prevent tasks from being scheduled? How would you
prevent drivers from deadlocking because in order to put their device
in a low-power state they need to acquire a lock which is held by a
user task?
> >> from that point on you are just walking the device tree
> >> putting things into low-power mode. This is the point where we are talking
> >> about jumping to.
>
> > Yes. And putting things into low-power mode requires the ability to
> > run the scheduler, which means that user tasks can be scheduled, which
> > means that they can run.
>
> Does it really (fundamentally) require scheduling tasks, particularly in
> the case that the devices have already been put in the "quiesced" state?
I can't say for sure. That's the way we have been doing it. It
wouldn't be easy to change, because the driver would have to busy-wait
during delays -- which would mean it would need to use different code
for system-wide suspend and runtime suspend.
Alan Stern
^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Re: Hibernation considerations
2007-07-20 22:35 ` Re: Hibernation considerations Alan Stern
@ 2007-07-20 22:43 ` david
2007-07-20 22:48 ` Jeremy Maitin-Shepard
[not found] ` <Pine.LNX.4.64.0707201540260.5166@asgard.lang.hm>
2 siblings, 0 replies; 81+ messages in thread
From: david @ 2007-07-20 22:43 UTC (permalink / raw)
To: Alan Stern
Cc: LKML, Milton Miller, Ying Huang, linux-pm, Jeremy Maitin-Shepard
On Fri, 20 Jul 2007, Alan Stern wrote:
> On Fri, 20 Jul 2007, Jeremy Maitin-Shepard wrote:
>
>>>> when doing a suspend-to-ram you get to a point where you just don't use
>>>> any userspace.
>>
>>> What do you mean? How can you prevent user tasks from running? That's
>>> basically what the freezer does, and the whole point of this approach
>>> is to eliminate the freezer. Right?
>>
>> Presumably no tasks at all would be scheduled.
>
> How would you prevent tasks from being scheduled? How would you
> prevent drivers from deadlocking because in order to put their device
> in a low-power state they need to acquire a lock which is held by a
> user task?
you give up on the suspend becouse you have no way of getting the user
task to give up the lock.
however, kernel locks should not be held by user tasks, user tasks are not
expected to behave in rational ways, allowing them to compete with kernel
tasks for locks is a sure way to get a deadlock or indefinate stall.
what locks are accessed this way?
>>>> from that point on you are just walking the device tree
>>>> putting things into low-power mode. This is the point where we are talking
>>>> about jumping to.
>>
>>> Yes. And putting things into low-power mode requires the ability to
>>> run the scheduler, which means that user tasks can be scheduled, which
>>> means that they can run.
>>
>> Does it really (fundamentally) require scheduling tasks, particularly in
>> the case that the devices have already been put in the "quiesced" state?
>
> I can't say for sure. That's the way we have been doing it. It
> wouldn't be easy to change, because the driver would have to busy-wait
> during delays -- which would mean it would need to use different code
> for system-wide suspend and runtime suspend.
please define terms so that we are all on the same page
what do you mean by
system-wide suspend
runtime suspend
David Lang
^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Re: Hibernation considerations
2007-07-20 22:35 ` Re: Hibernation considerations Alan Stern
2007-07-20 22:43 ` david
@ 2007-07-20 22:48 ` Jeremy Maitin-Shepard
[not found] ` <Pine.LNX.4.64.0707201540260.5166@asgard.lang.hm>
2 siblings, 0 replies; 81+ messages in thread
From: Jeremy Maitin-Shepard @ 2007-07-20 22:48 UTC (permalink / raw)
To: Alan Stern; +Cc: david, LKML, Milton Miller, Ying Huang, linux-pm
Alan Stern <stern@rowland.harvard.edu> writes:
> On Fri, 20 Jul 2007, Jeremy Maitin-Shepard wrote:
>> >> when doing a suspend-to-ram you get to a point where you just don't use
>> >> any userspace.
>>
>> > What do you mean? How can you prevent user tasks from running? That's
>> > basically what the freezer does, and the whole point of this approach
>> > is to eliminate the freezer. Right?
>>
>> Presumably no tasks at all would be scheduled.
> How would you prevent tasks from being scheduled? How would you
> prevent drivers from deadlocking because in order to put their device
> in a low-power state they need to acquire a lock which is held by a
> user task?
Perhaps this isn't an issue once the device is already quiesced. I'm
just conjecturing.
[snip]
--
Jeremy Maitin-Shepard
^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Re: Hibernation considerations
[not found] ` <Pine.LNX.4.64.0707201540260.5166@asgard.lang.hm>
@ 2007-07-21 5:21 ` Nigel Cunningham
2007-07-21 14:10 ` Alan Stern
1 sibling, 0 replies; 81+ messages in thread
From: Nigel Cunningham @ 2007-07-21 5:21 UTC (permalink / raw)
To: david; +Cc: LKML, Milton Miller, Ying Huang, linux-pm, Jeremy Maitin-Shepard
[-- Attachment #1.1: Type: text/plain, Size: 1346 bytes --]
Hi.
On Saturday 21 July 2007 08:43:20 david@lang.hm wrote:
> On Fri, 20 Jul 2007, Alan Stern wrote:
>
> > On Fri, 20 Jul 2007, Jeremy Maitin-Shepard wrote:
> >
> >>>> when doing a suspend-to-ram you get to a point where you just don't use
> >>>> any userspace.
> >>
> >>> What do you mean? How can you prevent user tasks from running? That's
> >>> basically what the freezer does, and the whole point of this approach
> >>> is to eliminate the freezer. Right?
> >>
> >> Presumably no tasks at all would be scheduled.
> >
> > How would you prevent tasks from being scheduled? How would you
> > prevent drivers from deadlocking because in order to put their device
> > in a low-power state they need to acquire a lock which is held by a
> > user task?
>
> you give up on the suspend becouse you have no way of getting the user
> task to give up the lock.
>
> however, kernel locks should not be held by user tasks, user tasks are not
> expected to behave in rational ways, allowing them to compete with kernel
> tasks for locks is a sure way to get a deadlock or indefinate stall.
>
> what locks are accessed this way?
Any userspace process can do a syscall. In the process of the syscall, it can
take kernel locks, and it can schedule (eg, while seeking to take a second
lock).
Regards,
Nigel
[-- Attachment #1.2: Type: application/pgp-signature, Size: 189 bytes --]
[-- Attachment #2: Type: text/plain, Size: 0 bytes --]
^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Re: Hibernation considerations
2007-07-20 21:02 ` Rafael J. Wysocki
@ 2007-07-21 11:44 ` Miklos Szeredi
[not found] ` <E1ICDNw-0008HC-00@dorka.pomaz.szeredi.hu>
1 sibling, 0 replies; 81+ messages in thread
From: Miklos Szeredi @ 2007-07-21 11:44 UTC (permalink / raw)
To: rjw; +Cc: david, linux-kernel, miltonm, ying.huang, linux-pm, jbms
> The problem with FUSE is related to the fact that the freezer can't
> freeze uninterruptible tasks and we said that perhaps we might avoid
> it if FUSE was made freezing-aware. Still, no one has gone in this
> direction and I don't know of any plans to do that.
I thought we have fully explored this direction. Lots of emails, and
an IRC session with Pavel. Conclusion:
- It can't be done without VFS surgery + adding various hacks to fuse
- VFS surgery for the sake of a working suspend is not realistic
Although removing the freezer seems the cleanest solution, I'm not
saying the freezer can't be fixed up in the mean time.
Allowing tasks to remain in uninterruptible sleep seemed a nice way to
get around the fuse issues. What was the problem with that patch? It
was something that was supposed to have been tested in suspend2,
wasn't it?
The other one (trying to wake up task, so that may make other tasks
freezable) didn't seem such a good approach to me.
The theory is quite simple: while and after suspending devices, no
tasks must be touching said devices.
The very cleanest way to do this is in the drivers. The very simplest
way is the current freezer. But may be there are possibilities
between these two extremes.
But I can almost guarantee you, that any attempt at fixing the issues
though fuse will just result in an even bigger mess than what we
currently have.
Miklos
^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Re: Hibernation considerations
[not found] ` <E1ICDNw-0008HC-00@dorka.pomaz.szeredi.hu>
@ 2007-07-21 12:43 ` Nigel Cunningham
[not found] ` <200707212243.35602.nigel@nigel.suspend2.net>
1 sibling, 0 replies; 81+ messages in thread
From: Nigel Cunningham @ 2007-07-21 12:43 UTC (permalink / raw)
To: Miklos Szeredi; +Cc: david, linux-kernel, miltonm, ying.huang, linux-pm, jbms
[-- Attachment #1.1: Type: text/plain, Size: 1109 bytes --]
Hi.
On Saturday 21 July 2007 21:44:32 Miklos Szeredi wrote:
> > The problem with FUSE is related to the fact that the freezer can't
> > freeze uninterruptible tasks and we said that perhaps we might avoid
> > it if FUSE was made freezing-aware. Still, no one has gone in this
> > direction and I don't know of any plans to do that.
>
> I thought we have fully explored this direction. Lots of emails, and
> an IRC session with Pavel. Conclusion:
What am I missing in the following suggested solution?
1) In the freezer code, we implement a new TIF_LATEFREEZE process flag, which,
when set, causes a userspace process to be frozen with kernel threads
instead of with userspace ones. When freezing, we freezing !TIF_LATEFREEZE,
sync and then freeze TIF_LATEFREEZE and freezable kernel threads.
2) In the fuse code, the PID of the process that will do the work gets passed
to the fuse kernel code when the mount is done. The kernel code sets the
TIF_LATEFREEZE flag, and resets it on umount.
Sorry, but this is a hit-and-run email - I'm off to bed now.
Regards,
Nigel
[-- Attachment #1.2: Type: application/pgp-signature, Size: 189 bytes --]
[-- Attachment #2: Type: text/plain, Size: 0 bytes --]
^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Re: Hibernation considerations
[not found] ` <200707212243.35602.nigel@nigel.suspend2.net>
@ 2007-07-21 13:56 ` Alan Stern
2007-07-21 16:13 ` Jeremy Maitin-Shepard
` (2 subsequent siblings)
3 siblings, 0 replies; 81+ messages in thread
From: Alan Stern @ 2007-07-21 13:56 UTC (permalink / raw)
To: Nigel Cunningham
Cc: david, Miklos Szeredi, linux-kernel, miltonm, ying.huang,
linux-pm, jbms
On Sat, 21 Jul 2007, Nigel Cunningham wrote:
> What am I missing in the following suggested solution?
>
> 1) In the freezer code, we implement a new TIF_LATEFREEZE process flag, which,
> when set, causes a userspace process to be frozen with kernel threads
> instead of with userspace ones. When freezing, we freezing !TIF_LATEFREEZE,
> sync and then freeze TIF_LATEFREEZE and freezable kernel threads.
>
> 2) In the fuse code, the PID of the process that will do the work gets passed
> to the fuse kernel code when the mount is done. The kernel code sets the
> TIF_LATEFREEZE flag, and resets it on umount.
What happens when one FUSE filesystem makes use of another? You'll
still end up with unfreezable processes, except that now you won't
detect them until the LATEFREEZE stage.
Alan Stern
^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Re: Hibernation considerations
[not found] ` <Pine.LNX.4.64.0707201540260.5166@asgard.lang.hm>
2007-07-21 5:21 ` Nigel Cunningham
@ 2007-07-21 14:10 ` Alan Stern
1 sibling, 0 replies; 81+ messages in thread
From: Alan Stern @ 2007-07-21 14:10 UTC (permalink / raw)
To: david; +Cc: LKML, Milton Miller, Ying Huang, linux-pm, Jeremy Maitin-Shepard
On Fri, 20 Jul 2007 david@lang.hm wrote:
> > How would you prevent tasks from being scheduled? How would you
> > prevent drivers from deadlocking because in order to put their device
> > in a low-power state they need to acquire a lock which is held by a
> > user task?
>
> you give up on the suspend becouse you have no way of getting the user
> task to give up the lock.
Once the deadlock has occurred it's too late. You can't give up; in
fact you can't do anything at all. The system has hung.
> however, kernel locks should not be held by user tasks, user tasks are not
> expected to behave in rational ways, allowing them to compete with kernel
> tasks for locks is a sure way to get a deadlock or indefinate stall.
What on Earth are you talking about? "Kernel locks should not be held
by user tasks"? Then who _should_ hold them? You are aware, I hope,
that down() and mutex_lock() can be called only in process context?
> what locks are accessed this way?
Lots of them. For example, most drivers won't want a suspend to occur
right in the middle of an I/O transfer. To prevent this, the driver
might use a mutex. The task doing the I/O (which will be a user task)
acquires the mutex during a transfer and the suspend routine acquires
the mutex while quiescing the device.
> >> Does it really (fundamentally) require scheduling tasks, particularly in
> >> the case that the devices have already been put in the "quiesced" state?
> >
> > I can't say for sure. That's the way we have been doing it. It
> > wouldn't be easy to change, because the driver would have to busy-wait
> > during delays -- which would mean it would need to use different code
> > for system-wide suspend and runtime suspend.
>
> please define terms so that we are all on the same page
Please read Documentation/power/devices.txt.
> what do you mean by
> system-wide suspend
That's what you would call standby, suspend-to-RAM, or hibernate. The
entire system goes to sleep.
> runtime suspend
That's when an individual device is placed in a low-power state to
save energy while it isn't being used. The system as a whole remains
awake and the device will be resumed the next time it is needed for
anything.
Alan Stern
^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Re: Hibernation considerations
[not found] ` <200707212243.35602.nigel@nigel.suspend2.net>
2007-07-21 13:56 ` Alan Stern
@ 2007-07-21 16:13 ` Jeremy Maitin-Shepard
[not found] ` <87lkd9ohtn.fsf@jbms.ath.cx>
2007-08-01 9:19 ` Pavel Machek
3 siblings, 0 replies; 81+ messages in thread
From: Jeremy Maitin-Shepard @ 2007-07-21 16:13 UTC (permalink / raw)
To: Nigel Cunningham
Cc: david, Miklos Szeredi, linux-kernel, miltonm, ying.huang,
linux-pm
It seems that you could still potentially get a failure to freeze if one
FUSE process depends on another, and the one that is frozen second just
happens to be waiting on the one that is frozen first when it is frozen.
I admit that this situation is unlikely, and perhaps acceptable.
A larger concern is that it seems that freezing FUSE processes at all
_will_ generate deadlocks if a non-synchronous or memory-map-supporting
filesystem is loopback mounted from a FUSE filesystem. In that case, if
you attempt to sync or free memory once FUSE is frozen, you are sure to
get a deadlock.
--
Jeremy Maitin-Shepard
^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Re: Hibernation considerations
[not found] ` <87lkd9ohtn.fsf@jbms.ath.cx>
@ 2007-07-21 18:12 ` Miklos Szeredi
2007-07-21 19:20 ` Rafael J. Wysocki
` (2 more replies)
2007-07-21 22:16 ` Nigel Cunningham
1 sibling, 3 replies; 81+ messages in thread
From: Miklos Szeredi @ 2007-07-21 18:12 UTC (permalink / raw)
To: jbms; +Cc: david, miklos, linux-kernel, miltonm, ying.huang, linux-pm
> It seems that you could still potentially get a failure to freeze if one
> FUSE process depends on another, and the one that is frozen second just
> happens to be waiting on the one that is frozen first when it is frozen.
> I admit that this situation is unlikely, and perhaps acceptable.
It isn't all that unlikely. There's sshfs for example, that depends
on a separate ssh process for transport.
Oh, there are also userspace network transports, like tun/tap,
nfqueue, etc. They could block any network filesystem (not just fuse)
if frozen first, making the freezer fail.
Hmm, wonder why this isn't affecting people with VPNs? Probably
network mounts over VPN are rare, and ever rarer to have fs activity
on them during suspend.
Anyway, I think it's long overdue to stop thinking about how to "fix"
fuse, and concentrate on fixing the underlying problem instead ;)
> A larger concern is that it seems that freezing FUSE processes at all
> _will_ generate deadlocks if a non-synchronous or memory-map-supporting
> filesystem is loopback mounted from a FUSE filesystem. In that case, if
> you attempt to sync or free memory once FUSE is frozen, you are sure to
> get a deadlock.
Well, it would deadlock, if
a) memory reclaim was synchronous, or
b) large part of the memory was used for dirty file data
I can't remember if (a) was ever true. And now the dirty ratio is 10%
by default, so if we go OOM because that 10% can't be reclaimed, there
is a more serious problem.
Swap over loop over fuse would be problematic, but that won't work for
some time yet ;)
Miklos
^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Re: Hibernation considerations
2007-07-21 18:12 ` Miklos Szeredi
@ 2007-07-21 19:20 ` Rafael J. Wysocki
2007-07-21 22:21 ` Nigel Cunningham
[not found] ` <200707212120.04645.rjw@sisk.pl>
2 siblings, 0 replies; 81+ messages in thread
From: Rafael J. Wysocki @ 2007-07-21 19:20 UTC (permalink / raw)
To: Miklos Szeredi; +Cc: david, linux-kernel, miltonm, ying.huang, linux-pm, jbms
On Saturday, 21 July 2007 20:12, Miklos Szeredi wrote:
> > It seems that you could still potentially get a failure to freeze if one
> > FUSE process depends on another, and the one that is frozen second just
> > happens to be waiting on the one that is frozen first when it is frozen.
> > I admit that this situation is unlikely, and perhaps acceptable.
>
> It isn't all that unlikely. There's sshfs for example, that depends
> on a separate ssh process for transport.
>
> Oh, there are also userspace network transports, like tun/tap,
> nfqueue, etc. They could block any network filesystem (not just fuse)
> if frozen first, making the freezer fail.
>
> Hmm, wonder why this isn't affecting people with VPNs? Probably
> network mounts over VPN are rare, and ever rarer to have fs activity
> on them during suspend.
>
> Anyway, I think it's long overdue to stop thinking about how to "fix"
> fuse, and concentrate on fixing the underlying problem instead ;)
To conclude this branch of the thread, I have a patch in the works that may
help a bit with unfreezable FUSE filesystems and it only affects the freezer.
I'll post it when 2.6.23-rc1 is out, because it's on top of some other patches
that need to go first.
Greetings,
Rafael
--
"Premature optimization is the root of all evil." - Donald Knuth
^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Re: Hibernation considerations
[not found] ` <87lkd9ohtn.fsf@jbms.ath.cx>
2007-07-21 18:12 ` Miklos Szeredi
@ 2007-07-21 22:16 ` Nigel Cunningham
2007-07-22 15:26 ` Alan Stern
1 sibling, 1 reply; 81+ messages in thread
From: Nigel Cunningham @ 2007-07-21 22:16 UTC (permalink / raw)
To: Jeremy Maitin-Shepard
Cc: david, Miklos Szeredi, linux-kernel, miltonm, ying.huang,
linux-pm
[-- Attachment #1.1: Type: text/plain, Size: 993 bytes --]
Hi.
On Sunday 22 July 2007 02:13:56 Jeremy Maitin-Shepard wrote:
> It seems that you could still potentially get a failure to freeze if one
> FUSE process depends on another, and the one that is frozen second just
> happens to be waiting on the one that is frozen first when it is frozen.
> I admit that this situation is unlikely, and perhaps acceptable.
>
> A larger concern is that it seems that freezing FUSE processes at all
> _will_ generate deadlocks if a non-synchronous or memory-map-supporting
> filesystem is loopback mounted from a FUSE filesystem. In that case, if
> you attempt to sync or free memory once FUSE is frozen, you are sure to
> get a deadlock.
Ok. So then (in response to Alan too), how about keeping a tree of mounts,
akin to the device tree, and working from the deepest nodes up? (In
conjunction with what I already suggested)?
Regards,
Nigel
--
See http://www.tuxonice.net for Howtos, FAQs, mailing
lists, wiki and bugzilla info.
[-- Attachment #1.2: Type: application/pgp-signature, Size: 189 bytes --]
[-- Attachment #2: Type: text/plain, Size: 0 bytes --]
^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Re: Hibernation considerations
2007-07-21 18:12 ` Miklos Szeredi
2007-07-21 19:20 ` Rafael J. Wysocki
@ 2007-07-21 22:21 ` Nigel Cunningham
[not found] ` <200707212120.04645.rjw@sisk.pl>
2 siblings, 0 replies; 81+ messages in thread
From: Nigel Cunningham @ 2007-07-21 22:21 UTC (permalink / raw)
To: Miklos Szeredi; +Cc: david, linux-kernel, miltonm, ying.huang, linux-pm, jbms
[-- Attachment #1.1: Type: text/plain, Size: 2078 bytes --]
Hi.
On Sunday 22 July 2007 04:12:22 Miklos Szeredi wrote:
> > It seems that you could still potentially get a failure to freeze if one
> > FUSE process depends on another, and the one that is frozen second just
> > happens to be waiting on the one that is frozen first when it is frozen.
> > I admit that this situation is unlikely, and perhaps acceptable.
>
> It isn't all that unlikely. There's sshfs for example, that depends
> on a separate ssh process for transport.
>
> Oh, there are also userspace network transports, like tun/tap,
> nfqueue, etc. They could block any network filesystem (not just fuse)
> if frozen first, making the freezer fail.
>
> Hmm, wonder why this isn't affecting people with VPNs? Probably
> network mounts over VPN are rare, and ever rarer to have fs activity
> on them during suspend.
>
> Anyway, I think it's long overdue to stop thinking about how to "fix"
> fuse, and concentrate on fixing the underlying problem instead ;)
That's what I'm seeking to do :)
> > A larger concern is that it seems that freezing FUSE processes at all
> > _will_ generate deadlocks if a non-synchronous or memory-map-supporting
> > filesystem is loopback mounted from a FUSE filesystem. In that case, if
> > you attempt to sync or free memory once FUSE is frozen, you are sure to
> > get a deadlock.
>
> Well, it would deadlock, if
>
> a) memory reclaim was synchronous, or
> b) large part of the memory was used for dirty file data
These are problems in normal operation, aren't they?
> I can't remember if (a) was ever true. And now the dirty ratio is 10%
> by default, so if we go OOM because that 10% can't be reclaimed, there
> is a more serious problem.
>
> Swap over loop over fuse would be problematic, but that won't work for
> some time yet ;)
Hopefully people will wake up to the problems with Fuse and get rid of it
before then :|. Of course I don't really expect that to happen.
Nigel
--
See http://www.tuxonice.net for Howtos, FAQs, mailing
lists, wiki and bugzilla info.
[-- Attachment #1.2: Type: application/pgp-signature, Size: 189 bytes --]
[-- Attachment #2: Type: text/plain, Size: 0 bytes --]
^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Re: Hibernation considerations
2007-07-20 15:48 ` david
@ 2007-07-22 2:17 ` Huang, Ying
[not found] ` <1185070634.3517.11.camel@caritas-dev.intel.com>
1 sibling, 0 replies; 81+ messages in thread
From: Huang, Ying @ 2007-07-22 2:17 UTC (permalink / raw)
To: david; +Cc: LKML, Milton Miller, linux-pm, Jeremy Maitin-Shepard
On Fri, 2007-07-20 at 08:48 -0700, david@lang.hm wrote:
> > Backuping target memory before kexec and restoring it after kexec is
> > planed feature for kexec jump. But I will work on image writing/reading
> > first.
>
> if we can get a list of what memory is safe to backup/restore then the
> reading/writing of the image should be able to be done in userspace.
The backup/restore here has nothing to do with the read/write of the
image. It means instead of preserving memory for a new kernel like that
of crash-dump, the memory for a new kernel is backupped before kexec and
restored after kexec by the kexec kernel.
> > If the "scatter copy" is replaced by "scatter swap", we need not the
> > inverse list, and the state of kexeced kernel can be backuped too. There
> > are "scatter copy" support in normal kexec implementation in
> > "relocate_kernel".
>
> what do you mean by "scatter swap"
copy: dest=src
swap: tmp=dest; dest=src; src=tmp
If memory is swapped, no information is lost, both that of kexec kernel
and kexeced kernel.
Best Regards,
Huang, Ying
^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Re: Hibernation considerations
[not found] ` <1185070634.3517.11.camel@caritas-dev.intel.com>
@ 2007-07-22 2:32 ` david
0 siblings, 0 replies; 81+ messages in thread
From: david @ 2007-07-22 2:32 UTC (permalink / raw)
To: Huang, Ying; +Cc: LKML, Milton Miller, linux-pm, Jeremy Maitin-Shepard
On Sun, 22 Jul 2007, Huang, Ying wrote:
> On Fri, 2007-07-20 at 08:48 -0700, david@lang.hm wrote:
>>> Backuping target memory before kexec and restoring it after kexec is
>>> planed feature for kexec jump. But I will work on image writing/reading
>>> first.
>>
>> if we can get a list of what memory is safe to backup/restore then the
>> reading/writing of the image should be able to be done in userspace.
>
> The backup/restore here has nothing to do with the read/write of the
> image. It means instead of preserving memory for a new kernel like that
> of crash-dump, the memory for a new kernel is backupped before kexec and
> restored after kexec by the kexec kernel.
Ok, I see the miscommunication here. you are talking about freeing up
memory for the second kernel instead of reserving it from boot time.
I'm talking about getting the second kernel a list of what memory pages it
should write to the image
if we can get the info for the list I'm looking for we should be able to
demonstrate the kexec based hibernate.
the change you are talking about in an enhancment that is useful after
that point to save some memory.
>>> If the "scatter copy" is replaced by "scatter swap", we need not the
>>> inverse list, and the state of kexeced kernel can be backuped too. There
>>> are "scatter copy" support in normal kexec implementation in
>>> "relocate_kernel".
>>
>> what do you mean by "scatter swap"
>
> copy: dest=src
> swap: tmp=dest; dest=src; src=tmp
>
> If memory is swapped, no information is lost, both that of kexec kernel
> and kexeced kernel.
I'm missing why you need to preserve this memory
if you are talking about memory that will be used by the second kernel
when you kexec to it then you don't need to preserve it (since it will be
overwritten by the second kernel). if you aren't talking about memory that
will be used by the second kernel why do you need to move it?
David Lang
^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Re: Hibernation considerations
[not found] <Pine.LNX.4.44L0.0707210956320.8201-100000@netrider.rowland.org>
@ 2007-07-22 3:43 ` david
0 siblings, 0 replies; 81+ messages in thread
From: david @ 2007-07-22 3:43 UTC (permalink / raw)
To: Alan Stern
Cc: LKML, Milton Miller, Ying Huang, linux-pm, Jeremy Maitin-Shepard
On Sat, 21 Jul 2007, Alan Stern wrote:
> On Fri, 20 Jul 2007 david@lang.hm wrote:
>
>>> How would you prevent tasks from being scheduled? How would you
>>> prevent drivers from deadlocking because in order to put their device
>>> in a low-power state they need to acquire a lock which is held by a
>>> user task?
>>
>> you give up on the suspend becouse you have no way of getting the user
>> task to give up the lock.
>
> Once the deadlock has occurred it's too late. You can't give up; in
> fact you can't do anything at all. The system has hung.
>
>> however, kernel locks should not be held by user tasks, user tasks are not
>> expected to behave in rational ways, allowing them to compete with kernel
>> tasks for locks is a sure way to get a deadlock or indefinate stall.
>
> What on Earth are you talking about? "Kernel locks should not be held
> by user tasks"? Then who _should_ hold them? You are aware, I hope,
> that down() and mutex_lock() can be called only in process context?
>
>> what locks are accessed this way?
>
> Lots of them. For example, most drivers won't want a suspend to occur
> right in the middle of an I/O transfer. To prevent this, the driver
> might use a mutex. The task doing the I/O (which will be a user task)
> acquires the mutex during a transfer and the suspend routine acquires
> the mutex while quiescing the device.
wait a min her, it's possible we are misunderstanding each other.
as I see it.
if userspace can aquire locks that prevent the kernel from shutting off
(or doing anything else in particular) then it's possible for misbehaving
userspace code to stop the kernel by simply choosing to never release the
lock.
this would be a trivial DOS from userspace.
now, if you are talking instead about the fact that when userspace makes a
system call, the execution of that system call involves aquiring locks
that are released before the system call completes you have a very
different situation.
if you have locks that are held across system calls then you should
already have problems. becouse you can't count on userspace ever taking
whatever action is appropriate to release the lock.
what am I missing that concerns you so much?
>>>> Does it really (fundamentally) require scheduling tasks, particularly in
>>>> the case that the devices have already been put in the "quiesced" state?
>>>
>>> I can't say for sure. That's the way we have been doing it. It
>>> wouldn't be easy to change, because the driver would have to busy-wait
>>> during delays -- which would mean it would need to use different code
>>> for system-wide suspend and runtime suspend.
>>
>> please define terms so that we are all on the same page
>
> Please read Documentation/power/devices.txt.
I have done so.
>> what do you mean by
>> system-wide suspend
>
> That's what you would call standby, suspend-to-RAM, or hibernate. The
> entire system goes to sleep.
>
>> runtime suspend
>
> That's when an individual device is placed in a low-power state to
> save energy while it isn't being used. The system as a whole remains
> awake and the device will be resumed the next time it is needed for
> anything.
thanks for the defintitions.
having read through Documentation/power/devices.txt I remain convinced
that you are making a fundamental mistake.
you are designing a system that will only work if everything (every
driver, every state transition) participates fully in the process at all
times. You started with the facts 'this is the info that ACPI provides and
this is how it is designed to be used' and worked from there instead of
looking to see what the kernel really needed and figuring how to provide a
good interface for that that happens to be implemented (today) with ACPI.
(a proper power management framework shouldn't care if you have ACPI, APM,
or some other method of controlling the devices)
this leads to resume functions that can only work if the proper suspend
function was called rather then makeing 'resume' just mean 'go to full
operation', which is the same thing that gets called when the device is
first initialized. internally it can examine the hardware and follow
different paths depending on what it finds the current state of the
hardware is, but the outside world (including the rest of the kernel)
should not care. the fact that the rest of the kernel needs to know if it
should call 'resume' or 'initialize' is a failure in the abstraction.
in fact, a better abstraction would be something like
report_power_modes
which would return a series of modes (sorted only by modeID)
modeID, %power_used_in_this_mode, %capability_in_this_mode
(I would make mode 0 always be complete power off, and mode 1 always be
full capacity)
report_power_mode_speed
which would return a matrix giving how long it takes to transition from
any mode to any other mode. this should be a relative number, not an
absolute number since it will be different at different clock speeds.
set_operational_mode(modeID)
which would take you from whatever mode you are in now to the requested
mode.
most devices would report the simple list of modes
0,0,0
1,100,100
with a mode_speed matrix of
0 1
---
0|0 1
1|1 0
it may be that there is more info needed for the powr management engine to
decide what modes it wants to put things into, if so identify what type of
info you need and add another column to the modes list.
for example:
you may want to add a flag for 'does this mode allow downstream devices
to operate?'
you may want to make a mode for 'this mode doesn't allow any new
requests, but continues to process pending requests' and have a flag that
indicates this
currently it looks like there's no way to find out what modes are
available, and you have to know what mode something is in currently before
you can request it change to a different mode. both of these prevent
effective power management without encoding intimate knowledge of the
capability of the particular hardware in your management tool.
some of this may be discoverable via the ACPI interface (it's not talked
about much in the devices.txt file), but the mode setting is still wrong.
note that in the example above it's accpetable for a driver to cache what
mode it thinks the device is in, but it needs to properly set the new
mode even if it's cached data is incorrect.
this approach would allow the transition of ALL drivers to the new mode of
operation in one fell swoop, and then adding additional power management
features is just adding to the existing list rather then implementing new
functions.
David Lang
^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Re: Hibernation considerations
2007-07-21 22:16 ` Nigel Cunningham
@ 2007-07-22 15:26 ` Alan Stern
0 siblings, 0 replies; 81+ messages in thread
From: Alan Stern @ 2007-07-22 15:26 UTC (permalink / raw)
To: nigel
Cc: david, Miklos Szeredi, linux-kernel, miltonm, ying.huang,
linux-pm, Jeremy Maitin-Shepard
On Sun, 22 Jul 2007, Nigel Cunningham wrote:
> Hi.
>
> On Sunday 22 July 2007 02:13:56 Jeremy Maitin-Shepard wrote:
> > It seems that you could still potentially get a failure to freeze if one
> > FUSE process depends on another, and the one that is frozen second just
> > happens to be waiting on the one that is frozen first when it is frozen.
> > I admit that this situation is unlikely, and perhaps acceptable.
> >
> > A larger concern is that it seems that freezing FUSE processes at all
> > _will_ generate deadlocks if a non-synchronous or memory-map-supporting
> > filesystem is loopback mounted from a FUSE filesystem. In that case, if
> > you attempt to sync or free memory once FUSE is frozen, you are sure to
> > get a deadlock.
>
> Ok. So then (in response to Alan too), how about keeping a tree of mounts,
> akin to the device tree, and working from the deepest nodes up? (In
> conjunction with what I already suggested)?
Face it, Nigel, this is a losing battle. You can try to come up with
ever-more complex schemes to try and force FUSE into the freezer's
framework, but it just won't fit. Or if it does, the next filesystem
to come along will require an even more baroque type of special-case
handling.
The general problem is that task A may be in an unfreezable state,
waiting for task B to do something, while task B is already frozen.
Since there's no reasonable way to determine that A really is waiting
for B, you're just stuck. (To make matters worse, A may not even
realize which task it is waiting for; it may know only that it's
waiting for somebody to do something!) A and B could be user tasks,
kernel threads, or one of each.
The only thing to do is what Rafael has been working on: unfreeze
things, hope the tasks sort themselves out, and try again.
Alan Stern
^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Re: Hibernation considerations
[not found] <Pine.LNX.4.64.0707212000380.6747@asgard.lang.hm>
@ 2007-07-22 16:00 ` Alan Stern
2007-07-22 21:50 ` david
0 siblings, 1 reply; 81+ messages in thread
From: Alan Stern @ 2007-07-22 16:00 UTC (permalink / raw)
To: david; +Cc: LKML, Milton Miller, Ying Huang, linux-pm, Jeremy Maitin-Shepard
On Sat, 21 Jul 2007 david@lang.hm wrote:
> wait a min her, it's possible we are misunderstanding each other.
I'd describe it as: You are misunderstanding me. :-)
> as I see it.
>
> if userspace can aquire locks that prevent the kernel from shutting off
> (or doing anything else in particular) then it's possible for misbehaving
> userspace code to stop the kernel by simply choosing to never release the
> lock.
>
> this would be a trivial DOS from userspace.
You are confusing "userspace" with "user tasks". And not only that,
you often use the term "userspace" when you should say "user mode".
If you want I can explain the differences.
> now, if you are talking instead about the fact that when userspace makes a
> system call, the execution of that system call involves aquiring locks
> that are released before the system call completes you have a very
> different situation.
That is exactly what I have been talking about. It may be different
from what you _thought_, but it's not different from what I actually
_said_.
> if you have locks that are held across system calls then you should
> already have problems. becouse you can't count on userspace ever taking
> whatever action is appropriate to release the lock.
>
> what am I missing that concerns you so much?
Here's what you are missing:
The new kexec approach eliminates the freezer and relies instead on the
fact that none of the tasks in the original kernel can execute while
the new kexec'd kernel is running. This means the new kernel can write
out a memory image with no fear of interference or corruption.
But it also means that tasks which otherwise would have been frozen are
actually free to run before the kexec call is made (and after the call
returns, if the kexec'd kernel returns back to the original kernel).
Any driver which was written with the assumption that tasks would be
frozen at those times will need to be changed.
For example, drivers know that they have to quiesce their device in
preparation for creating the memory snapshot. But they assume that no
I/O requests will be made while the device is quiesced (because no user
task is capable of generating an I/O request if they are all frozen),
so the driver doesn't try to prevent such requests from reactivating
the device.
The situation as regards locking is harder to discuss since I don't
know of any code examples to use as a guide. The fact remains that if
user tasks aren't frozen then they can make system calls, and while
running in kernel mode they can acquire locks, which might cause
problems -- even though I can't identify any definite examples.
Because of these problems, it's too early to start trying to use kexec
to avoid the need for the freezer.
Of course, exactly the same possible problems exist when one tries to
remove the freezer from suspend-to-RAM. It has nothing to do with
kexec in particular (and certainly nothing to do with ACPI).
> having read through Documentation/power/devices.txt I remain convinced
> that you are making a fundamental mistake.
>
> you are designing a system
I'm not designing anything! _You_ are. I'm merely pointing out
problems in your design which you haven't considered.
> that will only work if everything (every
> driver, every state transition) participates fully in the process at all
> times. You started with the facts 'this is the info that ACPI provides
Look again; I wasn't talking about ACPI. You have mixed up the issues
in this email thread. (Not hard to do, since it has been a very long
and complicated thread.)
> and
> this is how it is designed to be used' and worked from there instead of
> looking to see what the kernel really needed and figuring how to provide a
> good interface for that that happens to be implemented (today) with ACPI.
> (a proper power management framework shouldn't care if you have ACPI, APM,
> or some other method of controlling the devices)
This and the rest of your email have no bearing on what I was talking
about, so I have snipped out the remainder.
Alan Stern
^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Re: Hibernation considerations
[not found] <Pine.LNX.4.44L0.0707221116060.15224-100000@netrider.rowland.org>
@ 2007-07-22 16:27 ` Miklos Szeredi
2007-07-22 20:09 ` Alan Stern
2007-07-22 22:42 ` Nigel Cunningham
[not found] ` <200707230842.22121.nigel@nigel.suspend2.net>
2 siblings, 1 reply; 81+ messages in thread
From: Miklos Szeredi @ 2007-07-22 16:27 UTC (permalink / raw)
To: stern
Cc: david, miklos, nigel, linux-kernel, miltonm, ying.huang, linux-pm,
jbms
> The only thing to do is what Rafael has been working on: unfreeze
> things, hope the tasks sort themselves out, and try again.
Have we some proof, that this will untangle the freezing tasks in a
limited time? Or will it just make the problem harder to trigger?
Miklos
^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Re: Hibernation considerations
2007-07-22 16:27 ` Miklos Szeredi
@ 2007-07-22 20:09 ` Alan Stern
0 siblings, 0 replies; 81+ messages in thread
From: Alan Stern @ 2007-07-22 20:09 UTC (permalink / raw)
To: Miklos Szeredi
Cc: david, nigel, linux-kernel, miltonm, ying.huang, linux-pm, jbms
On Sun, 22 Jul 2007, Miklos Szeredi wrote:
> > The only thing to do is what Rafael has been working on: unfreeze
> > things, hope the tasks sort themselves out, and try again.
>
> Have we some proof, that this will untangle the freezing tasks in a
> limited time? Or will it just make the problem harder to trigger?
Of course there's no proof. Just the opposite -- if things get hung up
the first time, they might get hung up the second time. And the
third...
But it ought to make the problem harder to trigger. For the present
that's a worthwhile improvement.
Alan Stern
^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Re: Hibernation considerations
2007-07-22 16:00 ` Alan Stern
@ 2007-07-22 21:50 ` david
2007-07-23 15:19 ` Alan Stern
0 siblings, 1 reply; 81+ messages in thread
From: david @ 2007-07-22 21:50 UTC (permalink / raw)
To: Alan Stern
Cc: LKML, Milton Miller, Ying Huang, linux-pm, Jeremy Maitin-Shepard
On Sun, 22 Jul 2007, Alan Stern wrote:
> On Sat, 21 Jul 2007 david@lang.hm wrote:
>
>> wait a min her, it's possible we are misunderstanding each other.
>
> I'd describe it as: You are misunderstanding me. :-)
very possibly :-)
>> as I see it.
>>
>> if userspace can aquire locks that prevent the kernel from shutting off
>> (or doing anything else in particular) then it's possible for misbehaving
>> userspace code to stop the kernel by simply choosing to never release the
>> lock.
>>
>> this would be a trivial DOS from userspace.
>
> You are confusing "userspace" with "user tasks". And not only that,
> you often use the term "userspace" when you should say "user mode".
>
> If you want I can explain the differences.
please do, I have been treating all three as the same catagory.
>> now, if you are talking instead about the fact that when userspace makes a
>> system call, the execution of that system call involves aquiring locks
>> that are released before the system call completes you have a very
>> different situation.
>
> That is exactly what I have been talking about. It may be different
> from what you _thought_, but it's not different from what I actually
> _said_.
Ok, I did misunderstand you. it sound slike all you need to do to make
sure that locks are not held is to allow system calls to return before
trying to do the suspend/kexec/etc. that sounds like not only a trivial
thing to do, but something that would probably be done anyway.
although syscalls that then call out to userspace tasks before they can
complete cause potential deadlocks (without that issue you can just wait
until all syscalls have returned, and not allow anything to issue new
syscalls) is this the issue that's killing FUSE+suspend?
>> if you have locks that are held across system calls then you should
>> already have problems. becouse you can't count on userspace ever taking
>> whatever action is appropriate to release the lock.
>>
>> what am I missing that concerns you so much?
>
> Here's what you are missing:
>
> The new kexec approach eliminates the freezer and relies instead on the
> fact that none of the tasks in the original kernel can execute while
> the new kexec'd kernel is running. This means the new kernel can write
> out a memory image with no fear of interference or corruption.
correct
> But it also means that tasks which otherwise would have been frozen are
> actually free to run before the kexec call is made (and after the call
> returns, if the kexec'd kernel returns back to the original kernel).
> Any driver which was written with the assumption that tasks would be
> frozen at those times will need to be changed.
here is where you loose me.
why should jumping back to the original kernel immedialty start running
these processes? the process of doing a kexec requires things to happen in
the drivers before normal activity can happen, so there is a phase in
there where the kernel being jumped to has drivers initializing, but still
does not allow anything else to run. why can't this phase be extended to
allow for the possibility of transitioning these drivers to a sleep mode
instead of to full operation?
> For example, drivers know that they have to quiesce their device in
> preparation for creating the memory snapshot. But they assume that no
> I/O requests will be made while the device is quiesced (because no user
> task is capable of generating an I/O request if they are all frozen),
> so the driver doesn't try to prevent such requests from reactivating
> the device.
>
> The situation as regards locking is harder to discuss since I don't
> know of any code examples to use as a guide. The fact remains that if
> user tasks aren't frozen then they can make system calls, and while
> running in kernel mode they can acquire locks, which might cause
> problems -- even though I can't identify any definite examples.
yes, if userspace is running jobs and submitting I/O and system calls
while drivers are trying to initalize there is a big problem, but I am
missing the reason this must be the case.
> Because of these problems, it's too early to start trying to use kexec
> to avoid the need for the freezer.
>
> Of course, exactly the same possible problems exist when one tries to
> remove the freezer from suspend-to-RAM. It has nothing to do with
> kexec in particular (and certainly nothing to do with ACPI).
the part of the freezer that everyone is trying to eliminate is the
exceptions (freeze everything except X,Y,Z becouse we will need to use
those later for A)
>> having read through Documentation/power/devices.txt I remain convinced
>> that you are making a fundamental mistake.
>>
>> you are designing a system
>
> I'm not designing anything! _You_ are. I'm merely pointing out
> problems in your design which you haven't considered.
a better way of phrasing what I meant goes more along the lines of 'the
current design of the system...'
>> that will only work if everything (every
>> driver, every state transition) participates fully in the process at all
>> times. You started with the facts 'this is the info that ACPI provides
>
> Look again; I wasn't talking about ACPI. You have mixed up the issues
> in this email thread. (Not hard to do, since it has been a very long
> and complicated thread.)
very possibly. there are so many different sub-threads that part of my
answer was to you, and part is addressing other things brought up during
the thread
>> and
>> this is how it is designed to be used' and worked from there instead of
>> looking to see what the kernel really needed and figuring how to provide a
>> good interface for that that happens to be implemented (today) with ACPI.
>> (a proper power management framework shouldn't care if you have ACPI, APM,
>> or some other method of controlling the devices)
>
> This and the rest of your email have no bearing on what I was talking
> about, so I have snipped out the remainder.
this was in reaction to reading the power/devices.txt. my first thought
was along the lines of "no wonder device driver authors don't implement
all this, it's obviously evolved from the needs of the people doing the
suspend, one call at a time" and from there I started thinking about what
would make sense to driver authors and provide the capability that's
needed for the job. I broke that off into a seperate thread anyway.
David Lang
^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Re: Hibernation considerations
[not found] <Pine.LNX.4.44L0.0707221608140.16031-100000@netrider.rowland.org>
@ 2007-07-22 21:54 ` david
0 siblings, 0 replies; 81+ messages in thread
From: david @ 2007-07-22 21:54 UTC (permalink / raw)
To: Alan Stern
Cc: Miklos Szeredi, nigel, linux-kernel, miltonm, ying.huang,
linux-pm, jbms
On Sun, 22 Jul 2007, Alan Stern wrote:
> On Sun, 22 Jul 2007, Miklos Szeredi wrote:
>
>>> The only thing to do is what Rafael has been working on: unfreeze
>>> things, hope the tasks sort themselves out, and try again.
>>
>> Have we some proof, that this will untangle the freezing tasks in a
>> limited time? Or will it just make the problem harder to trigger?
>
> Of course there's no proof. Just the opposite -- if things get hung up
> the first time, they might get hung up the second time. And the
> third...
>
> But it ought to make the problem harder to trigger. For the present
> that's a worthwhile improvement.
it gives the system more tries to find a spot in time where the deadlock
doesn't happen, if you find one you can continue.
but even if things keep getting hung up, at least you are backing out of
each try safely and can eventually tell the user "I give up, try shutting
some things down and suspending again"
David Lang
^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Re: Hibernation considerations
[not found] <Pine.LNX.4.44L0.0707221116060.15224-100000@netrider.rowland.org>
2007-07-22 16:27 ` Miklos Szeredi
@ 2007-07-22 22:42 ` Nigel Cunningham
[not found] ` <200707230842.22121.nigel@nigel.suspend2.net>
2 siblings, 0 replies; 81+ messages in thread
From: Nigel Cunningham @ 2007-07-22 22:42 UTC (permalink / raw)
To: Alan Stern
Cc: david, Miklos Szeredi, nigel, linux-kernel, miltonm, ying.huang,
linux-pm, Jeremy Maitin-Shepard
[-- Attachment #1.1: Type: text/plain, Size: 2988 bytes --]
Hi Alan.
On Monday 23 July 2007 01:26:23 Alan Stern wrote:
> On Sun, 22 Jul 2007, Nigel Cunningham wrote:
>
> > Hi.
> >
> > On Sunday 22 July 2007 02:13:56 Jeremy Maitin-Shepard wrote:
> > > It seems that you could still potentially get a failure to freeze if one
> > > FUSE process depends on another, and the one that is frozen second just
> > > happens to be waiting on the one that is frozen first when it is frozen.
> > > I admit that this situation is unlikely, and perhaps acceptable.
> > >
> > > A larger concern is that it seems that freezing FUSE processes at all
> > > _will_ generate deadlocks if a non-synchronous or memory-map-supporting
> > > filesystem is loopback mounted from a FUSE filesystem. In that case, if
> > > you attempt to sync or free memory once FUSE is frozen, you are sure to
> > > get a deadlock.
> >
> > Ok. So then (in response to Alan too), how about keeping a tree of mounts,
> > akin to the device tree, and working from the deepest nodes up? (In
> > conjunction with what I already suggested)?
>
> Face it, Nigel, this is a losing battle. You can try to come up with
> ever-more complex schemes to try and force FUSE into the freezer's
> framework, but it just won't fit. Or if it does, the next filesystem
> to come along will require an even more baroque type of special-case
> handling.
It does seem to be a losing battle, but I'm wondering whether that's really
because it's an intractable problem, or because people have given up on it
before its time. We are talking about a computer system, so things should be
predictable.
> The general problem is that task A may be in an unfreezable state,
> waiting for task B to do something, while task B is already frozen.
> Since there's no reasonable way to determine that A really is waiting
> for B, you're just stuck. (To make matters worse, A may not even
> realize which task it is waiting for; it may know only that it's
> waiting for somebody to do something!) A and B could be user tasks,
> kernel threads, or one of each.
I guess I want to persist because all of these issues aren't utterly
unsolvable. It's just that we don't have the infrastructure yet to figure out
the solutions to these issues trivially. Take, for example, the locking
issue. If we could call some function to say "What process holds this lock?",
then task A could know that it's waiting on task B and put that information
somewhere. We could then use the information to freeze task B before task A.
> The only thing to do is what Rafael has been working on: unfreeze
> things, hope the tasks sort themselves out, and try again.
That's what I'm questioning. Is there a more reliable way and we've just given
up too quickly?
Regards,
Nigel
--
Nigel Cunningham
Christian Reformed Church of Cobden
103 Curdie Street, Cobden 3266, Victoria, Australia
Ph. +61 3 5595 1185 / +61 417 100 574
Communal Worship: 11 am Sunday.
[-- Attachment #1.2: Type: application/pgp-signature, Size: 189 bytes --]
[-- Attachment #2: Type: text/plain, Size: 0 bytes --]
^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Re: Hibernation considerations
[not found] ` <200707230842.22121.nigel@nigel.suspend2.net>
@ 2007-07-22 23:09 ` Rafael J. Wysocki
[not found] ` <200707230109.23071.rjw@sisk.pl>
` (4 subsequent siblings)
5 siblings, 0 replies; 81+ messages in thread
From: Rafael J. Wysocki @ 2007-07-22 23:09 UTC (permalink / raw)
To: Nigel Cunningham
Cc: david, Miklos Szeredi, nigel, linux-kernel, miltonm, ying.huang,
linux-pm, Jeremy Maitin-Shepard
Hi,
On Monday, 23 July 2007 00:42, Nigel Cunningham wrote:
> Hi Alan.
>
> On Monday 23 July 2007 01:26:23 Alan Stern wrote:
> > On Sun, 22 Jul 2007, Nigel Cunningham wrote:
> >
> > > Hi.
> > >
> > > On Sunday 22 July 2007 02:13:56 Jeremy Maitin-Shepard wrote:
> > > > It seems that you could still potentially get a failure to freeze if one
> > > > FUSE process depends on another, and the one that is frozen second just
> > > > happens to be waiting on the one that is frozen first when it is frozen.
> > > > I admit that this situation is unlikely, and perhaps acceptable.
> > > >
> > > > A larger concern is that it seems that freezing FUSE processes at all
> > > > _will_ generate deadlocks if a non-synchronous or memory-map-supporting
> > > > filesystem is loopback mounted from a FUSE filesystem. In that case, if
> > > > you attempt to sync or free memory once FUSE is frozen, you are sure to
> > > > get a deadlock.
> > >
> > > Ok. So then (in response to Alan too), how about keeping a tree of mounts,
> > > akin to the device tree, and working from the deepest nodes up? (In
> > > conjunction with what I already suggested)?
> >
> > Face it, Nigel, this is a losing battle. You can try to come up with
> > ever-more complex schemes to try and force FUSE into the freezer's
> > framework, but it just won't fit. Or if it does, the next filesystem
> > to come along will require an even more baroque type of special-case
> > handling.
>
> It does seem to be a losing battle, but I'm wondering whether that's really
> because it's an intractable problem, or because people have given up on it
> before its time. We are talking about a computer system, so things should be
> predictable.
>
> > The general problem is that task A may be in an unfreezable state,
> > waiting for task B to do something, while task B is already frozen.
> > Since there's no reasonable way to determine that A really is waiting
> > for B, you're just stuck. (To make matters worse, A may not even
> > realize which task it is waiting for; it may know only that it's
> > waiting for somebody to do something!) A and B could be user tasks,
> > kernel threads, or one of each.
>
> I guess I want to persist because all of these issues aren't utterly
> unsolvable. It's just that we don't have the infrastructure yet to figure out
> the solutions to these issues trivially. Take, for example, the locking
> issue. If we could call some function to say "What process holds this lock?",
> then task A could know that it's waiting on task B and put that information
> somewhere. We could then use the information to freeze task B before task A.
>
>
> > The only thing to do is what Rafael has been working on: unfreeze
> > things, hope the tasks sort themselves out, and try again.
>
> That's what I'm questioning. Is there a more reliable way and we've just given
> up too quickly?
Well, there probably is one, but it likely would require us to make changes
that wouldn't be accepted by some people and thus would never be merged.
Greetings,
Rafael
--
"Premature optimization is the root of all evil." - Donald Knuth
^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Re: Hibernation considerations
[not found] ` <200707230109.23071.rjw@sisk.pl>
@ 2007-07-22 23:18 ` Nigel Cunningham
0 siblings, 0 replies; 81+ messages in thread
From: Nigel Cunningham @ 2007-07-22 23:18 UTC (permalink / raw)
To: Rafael J. Wysocki
Cc: david, Miklos Szeredi, nigel, linux-kernel, miltonm, ying.huang,
linux-pm, Jeremy Maitin-Shepard
[-- Attachment #1.1: Type: text/plain, Size: 3712 bytes --]
On Monday 23 July 2007 09:09:21 Rafael J. Wysocki wrote:
> Hi,
>
> On Monday, 23 July 2007 00:42, Nigel Cunningham wrote:
> > Hi Alan.
> >
> > On Monday 23 July 2007 01:26:23 Alan Stern wrote:
> > > On Sun, 22 Jul 2007, Nigel Cunningham wrote:
> > >
> > > > Hi.
> > > >
> > > > On Sunday 22 July 2007 02:13:56 Jeremy Maitin-Shepard wrote:
> > > > > It seems that you could still potentially get a failure to freeze if
one
> > > > > FUSE process depends on another, and the one that is frozen second
just
> > > > > happens to be waiting on the one that is frozen first when it is
frozen.
> > > > > I admit that this situation is unlikely, and perhaps acceptable.
> > > > >
> > > > > A larger concern is that it seems that freezing FUSE processes at
all
> > > > > _will_ generate deadlocks if a non-synchronous or
memory-map-supporting
> > > > > filesystem is loopback mounted from a FUSE filesystem. In that
case, if
> > > > > you attempt to sync or free memory once FUSE is frozen, you are sure
to
> > > > > get a deadlock.
> > > >
> > > > Ok. So then (in response to Alan too), how about keeping a tree of
mounts,
> > > > akin to the device tree, and working from the deepest nodes up? (In
> > > > conjunction with what I already suggested)?
> > >
> > > Face it, Nigel, this is a losing battle. You can try to come up with
> > > ever-more complex schemes to try and force FUSE into the freezer's
> > > framework, but it just won't fit. Or if it does, the next filesystem
> > > to come along will require an even more baroque type of special-case
> > > handling.
> >
> > It does seem to be a losing battle, but I'm wondering whether that's
really
> > because it's an intractable problem, or because people have given up on it
> > before its time. We are talking about a computer system, so things should
be
> > predictable.
> >
> > > The general problem is that task A may be in an unfreezable state,
> > > waiting for task B to do something, while task B is already frozen.
> > > Since there's no reasonable way to determine that A really is waiting
> > > for B, you're just stuck. (To make matters worse, A may not even
> > > realize which task it is waiting for; it may know only that it's
> > > waiting for somebody to do something!) A and B could be user tasks,
> > > kernel threads, or one of each.
> >
> > I guess I want to persist because all of these issues aren't utterly
> > unsolvable. It's just that we don't have the infrastructure yet to figure
out
> > the solutions to these issues trivially. Take, for example, the locking
> > issue. If we could call some function to say "What process holds this
lock?",
> > then task A could know that it's waiting on task B and put that
information
> > somewhere. We could then use the information to freeze task B before task
A.
> >
> >
> > > The only thing to do is what Rafael has been working on: unfreeze
> > > things, hope the tasks sort themselves out, and try again.
> >
> > That's what I'm questioning. Is there a more reliable way and we've just
given
> > up too quickly?
>
> Well, there probably is one, but it likely would require us to make changes
> that wouldn't be accepted by some people and thus would never be merged.
Well, doesn't that imply that we should at least look into what changes would
be needed? If they wouldn't be accepted by some people, then either the
objections would be reasonable or they wouldn't (and would hopefully be
overridden). But we can't know if we don't try.
Regards,
Nigel
--
See http://www.tuxonice.net for Howtos, FAQs, mailing
lists, wiki and bugzilla info.
[-- Attachment #1.2: Type: application/pgp-signature, Size: 189 bytes --]
[-- Attachment #2: Type: text/plain, Size: 0 bytes --]
^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Re: Hibernation considerations
[not found] ` <200707230842.22121.nigel@nigel.suspend2.net>
2007-07-22 23:09 ` Rafael J. Wysocki
[not found] ` <200707230109.23071.rjw@sisk.pl>
@ 2007-07-23 0:04 ` Paul Mackerras
[not found] ` <18083.61595.217126.824924@cargo.ozlabs.ibm.com>
` (2 subsequent siblings)
5 siblings, 0 replies; 81+ messages in thread
From: Paul Mackerras @ 2007-07-23 0:04 UTC (permalink / raw)
To: Nigel Cunningham
Cc: david, Miklos Szeredi, nigel, linux-kernel, miltonm, ying.huang,
linux-pm, Jeremy Maitin-Shepard
Nigel Cunningham writes:
> I guess I want to persist because all of these issues aren't utterly
> unsolvable. It's just that we don't have the infrastructure yet to
> figure out the solutions to these issues trivially. Take, for example,
Ever heard of the halting problem? :) It's not just a matter of
infrastructure. You very quickly get into questions that are
mathematically undecideable.
> the locking issue. If we could call some function to say "What process
> holds this lock?", then task A could know that it's waiting on task B
> and put that information somewhere. We could then use the information
> to freeze task B before task A.
But how would that help? If task B holds the lock, then we can't
freeze it until it's released the lock. Then the question is, what
does task B need in order to get to the point where it releases the
lock? And so on. It rapidly gets not just extremely messy, but
actually impossible to compute in general.
Paul.
^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Re: Hibernation considerations
[not found] ` <18083.61595.217126.824924@cargo.ozlabs.ibm.com>
@ 2007-07-23 3:11 ` Nigel Cunningham
0 siblings, 0 replies; 81+ messages in thread
From: Nigel Cunningham @ 2007-07-23 3:11 UTC (permalink / raw)
To: Paul Mackerras
Cc: david, Miklos Szeredi, nigel, linux-kernel, miltonm, ying.huang,
linux-pm, Jeremy Maitin-Shepard
[-- Attachment #1.1: Type: text/plain, Size: 2192 bytes --]
Hi.
On Monday 23 July 2007 10:04:43 Paul Mackerras wrote:
> Nigel Cunningham writes:
>
> > I guess I want to persist because all of these issues aren't utterly
> > unsolvable. It's just that we don't have the infrastructure yet to
> > figure out the solutions to these issues trivially. Take, for example,
>
> Ever heard of the halting problem? :) It's not just a matter of
> infrastructure. You very quickly get into questions that are
> mathematically undecideable.
Is this the halting problem, though?
> > the locking issue. If we could call some function to say "What process
> > holds this lock?", then task A could know that it's waiting on task B
> > and put that information somewhere. We could then use the information
> > to freeze task B before task A.
>
> But how would that help? If task B holds the lock, then we can't
> freeze it until it's released the lock. Then the question is, what
> does task B need in order to get to the point where it releases the
> lock? And so on. It rapidly gets not just extremely messy, but
> actually impossible to compute in general.
Take a step back for a second.
The problem we're facing now is that we're getting some userspace threads,
used in processing I/O, that are functioning as exceptions to the "freeze
userspace, then freezeable kernel threads" rule. They are only exceptions
because of that role in processing I/O - because they're de facto kernel
threads. So, if we orient our thinking more in terms of I/O processing and
less in terms of the userspace/kernelspace distinction, we'll have a
solution:
1) Freeze processes that aren't fs related (ie stop them generating I/O).
2) Flush pending I/O.
3) Freeze filesystems in reverse order of dependency, the primary purpose
being to stop them generating further I/O on their metadata.
Locks that are being held are only being held because work is being done. If
we progressively focus on threads in terms of their create/process work
dependencies, we'll see that the problem isn't at all intractable.
Regards,
Nigel
--
See http://www.tuxonice.net for Howtos, FAQs, mailing
lists, wiki and bugzilla info.
[-- Attachment #1.2: Type: application/pgp-signature, Size: 189 bytes --]
[-- Attachment #2: Type: text/plain, Size: 0 bytes --]
^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Re: Hibernation considerations
[not found] ` <200707230842.22121.nigel@nigel.suspend2.net>
` (3 preceding siblings ...)
[not found] ` <18083.61595.217126.824924@cargo.ozlabs.ibm.com>
@ 2007-07-23 5:31 ` david
2007-07-23 10:24 ` Miklos Szeredi
5 siblings, 0 replies; 81+ messages in thread
From: david @ 2007-07-23 5:31 UTC (permalink / raw)
To: Nigel Cunningham
Cc: Miklos Szeredi, nigel, linux-kernel, miltonm, ying.huang,
linux-pm, Jeremy Maitin-Shepard
On Mon, 23 Jul 2007, Nigel Cunningham wrote:
> Hi Alan.
>
> On Monday 23 July 2007 01:26:23 Alan Stern wrote:
>> On Sun, 22 Jul 2007, Nigel Cunningham wrote:
>>
>>> Hi.
>>>
>>> On Sunday 22 July 2007 02:13:56 Jeremy Maitin-Shepard wrote:
>>>> It seems that you could still potentially get a failure to freeze if one
>>>> FUSE process depends on another, and the one that is frozen second just
>>>> happens to be waiting on the one that is frozen first when it is frozen.
>>>> I admit that this situation is unlikely, and perhaps acceptable.
>>>>
>>>> A larger concern is that it seems that freezing FUSE processes at all
>>>> _will_ generate deadlocks if a non-synchronous or memory-map-supporting
>>>> filesystem is loopback mounted from a FUSE filesystem. In that case, if
>>>> you attempt to sync or free memory once FUSE is frozen, you are sure to
>>>> get a deadlock.
>>>
>>> Ok. So then (in response to Alan too), how about keeping a tree of mounts,
>>> akin to the device tree, and working from the deepest nodes up? (In
>>> conjunction with what I already suggested)?
>>
>> Face it, Nigel, this is a losing battle. You can try to come up with
>> ever-more complex schemes to try and force FUSE into the freezer's
>> framework, but it just won't fit. Or if it does, the next filesystem
>> to come along will require an even more baroque type of special-case
>> handling.
>
> It does seem to be a losing battle, but I'm wondering whether that's really
> because it's an intractable problem, or because people have given up on it
> before its time. We are talking about a computer system, so things should be
> predictable.
>
>> The general problem is that task A may be in an unfreezable state,
>> waiting for task B to do something, while task B is already frozen.
>> Since there's no reasonable way to determine that A really is waiting
>> for B, you're just stuck. (To make matters worse, A may not even
>> realize which task it is waiting for; it may know only that it's
>> waiting for somebody to do something!) A and B could be user tasks,
>> kernel threads, or one of each.
>
> I guess I want to persist because all of these issues aren't utterly
> unsolvable. It's just that we don't have the infrastructure yet to figure out
> the solutions to these issues trivially. Take, for example, the locking
> issue. If we could call some function to say "What process holds this lock?",
> then task A could know that it's waiting on task B and put that information
> somewhere. We could then use the information to freeze task B before task A.
>
this sounds like the standard priority inversion problem taken to
extremes. Ingo has been working this issue, but IIRC the problem is that
tracking what owns the lock so that you can get that thing to run ends up
being enough overhead that it's not acceptable in the general case.
David Lang
>> The only thing to do is what Rafael has been working on: unfreeze
>> things, hope the tasks sort themselves out, and try again.
>
> That's what I'm questioning. Is there a more reliable way and we've just given
> up too quickly?
>
> Regards,
>
> Nigel
>
^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Re: Hibernation considerations
[not found] ` <200707230842.22121.nigel@nigel.suspend2.net>
` (4 preceding siblings ...)
2007-07-23 5:31 ` david
@ 2007-07-23 10:24 ` Miklos Szeredi
2007-07-23 12:08 ` Rafael J. Wysocki
5 siblings, 1 reply; 81+ messages in thread
From: Miklos Szeredi @ 2007-07-23 10:24 UTC (permalink / raw)
To: nigel
Cc: david, miklos, nigel, linux-kernel, miltonm, ying.huang, linux-pm,
jbms
> > The only thing to do is what Rafael has been working on: unfreeze
> > things, hope the tasks sort themselves out, and try again.
>
> That's what I'm questioning. Is there a more reliable way and we've
> just given up too quickly?
There obviously _are_ more reliable ways. A trivial one seems to be
to just not require user tasks to finish syscalls.
Yeah, stopping user processes outside the kernel is convenient, but
there's no fundamental reason why it is the only place where those
tasks can be stopped.
And there are very fundamental reasons to _not_ require this. Not
just in the fuse case, but in any case where a syscall requires
another user task to run before it can be finished (e.g. NFS over
OpenVPN).
Miklos
^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Re: Hibernation considerations
2007-07-23 10:24 ` Miklos Szeredi
@ 2007-07-23 12:08 ` Rafael J. Wysocki
2007-07-23 12:14 ` Miklos Szeredi
[not found] ` <E1ICwoE-0004q8-00@dorka.pomaz.szeredi.hu>
0 siblings, 2 replies; 81+ messages in thread
From: Rafael J. Wysocki @ 2007-07-23 12:08 UTC (permalink / raw)
To: Miklos Szeredi
Cc: david, nigel, linux-kernel, miltonm, ying.huang, linux-pm, jbms
On Monday, 23 July 2007 12:24, Miklos Szeredi wrote:
> > > The only thing to do is what Rafael has been working on: unfreeze
> > > things, hope the tasks sort themselves out, and try again.
> >
> > That's what I'm questioning. Is there a more reliable way and we've
> > just given up too quickly?
>
> There obviously _are_ more reliable ways. A trivial one seems to be
> to just not require user tasks to finish syscalls.
>
> Yeah, stopping user processes outside the kernel is convenient, but
> there's no fundamental reason why it is the only place where those
> tasks can be stopped.
The reason is that we want them to "park" in safe places, ie. where there
are no locks held etc. Thus, these safe places need to be chosen somehow
and since they are not marked throughout the code, we choose the obvious
one. :-)
> And there are very fundamental reasons to _not_ require this. Not
> just in the fuse case, but in any case where a syscall requires
> another user task to run before it can be finished (e.g. NFS over
> OpenVPN).
Yeah. Mark the safe places for us and we'll use them.
Greetings,
Rafael
--
"Premature optimization is the root of all evil." - Donald Knuth
^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Re: Hibernation considerations
2007-07-23 12:08 ` Rafael J. Wysocki
@ 2007-07-23 12:14 ` Miklos Szeredi
[not found] ` <E1ICwoE-0004q8-00@dorka.pomaz.szeredi.hu>
1 sibling, 0 replies; 81+ messages in thread
From: Miklos Szeredi @ 2007-07-23 12:14 UTC (permalink / raw)
To: rjw; +Cc: david, miklos, nigel, linux-kernel, miltonm, ying.huang, linux-pm,
jbms
> On Monday, 23 July 2007 12:24, Miklos Szeredi wrote:
> > > > The only thing to do is what Rafael has been working on: unfreeze
> > > > things, hope the tasks sort themselves out, and try again.
> > >
> > > That's what I'm questioning. Is there a more reliable way and we've
> > > just given up too quickly?
> >
> > There obviously _are_ more reliable ways. A trivial one seems to be
> > to just not require user tasks to finish syscalls.
> >
> > Yeah, stopping user processes outside the kernel is convenient, but
> > there's no fundamental reason why it is the only place where those
> > tasks can be stopped.
>
> The reason is that we want them to "park" in safe places, ie. where there
> are no locks held etc. Thus, these safe places need to be chosen somehow
> and since they are not marked throughout the code, we choose the obvious
> one. :-)
Why shouldn't locks be held?
No locks which are required for suspend must be held, sure. But
otherwise holding locks doesn't matter at all.
And I'm not saying that is trivial to do, but it might not be too hard
either.
Rafael, can you please tell, what happened to that patch, that did not
wait for tasks in uninterruptible sleep to be frozen?
That seemed like a magnificent approach compared to anything that has
been proposed since.
Miklos
^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Re: Hibernation considerations
[not found] ` <E1ICwoE-0004q8-00@dorka.pomaz.szeredi.hu>
@ 2007-07-23 12:27 ` Rafael J. Wysocki
2007-07-23 12:31 ` Oliver Neukum
[not found] ` <200707231431.30372.oliver@neukum.org>
2 siblings, 0 replies; 81+ messages in thread
From: Rafael J. Wysocki @ 2007-07-23 12:27 UTC (permalink / raw)
To: Miklos Szeredi
Cc: david, nigel, linux-kernel, miltonm, ying.huang, linux-pm, jbms
On Monday, 23 July 2007 14:14, Miklos Szeredi wrote:
> > On Monday, 23 July 2007 12:24, Miklos Szeredi wrote:
> > > > > The only thing to do is what Rafael has been working on: unfreeze
> > > > > things, hope the tasks sort themselves out, and try again.
> > > >
> > > > That's what I'm questioning. Is there a more reliable way and we've
> > > > just given up too quickly?
> > >
> > > There obviously _are_ more reliable ways. A trivial one seems to be
> > > to just not require user tasks to finish syscalls.
> > >
> > > Yeah, stopping user processes outside the kernel is convenient, but
> > > there's no fundamental reason why it is the only place where those
> > > tasks can be stopped.
> >
> > The reason is that we want them to "park" in safe places, ie. where there
> > are no locks held etc. Thus, these safe places need to be chosen somehow
> > and since they are not marked throughout the code, we choose the obvious
> > one. :-)
>
> Why shouldn't locks be held?
>
> No locks which are required for suspend must be held, sure. But
> otherwise holding locks doesn't matter at all.
>
> And I'm not saying that is trivial to do, but it might not be too hard
> either.
>
> Rafael, can you please tell, what happened to that patch, that did not
> wait for tasks in uninterruptible sleep to be frozen?
>
> That seemed like a magnificent approach compared to anything that has
> been proposed since.
Well, the freezer have failed to freeze tasks for a couple of times in my
test setup and I've had a couple of hangs.
I have an idea how to improve it, but that still requires some pending freezer
patches to go first.
Greetings,
Rafael
--
"Premature optimization is the root of all evil." - Donald Knuth
^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Re: Hibernation considerations
[not found] ` <E1ICwoE-0004q8-00@dorka.pomaz.szeredi.hu>
2007-07-23 12:27 ` Rafael J. Wysocki
@ 2007-07-23 12:31 ` Oliver Neukum
[not found] ` <200707231431.30372.oliver@neukum.org>
2 siblings, 0 replies; 81+ messages in thread
From: Oliver Neukum @ 2007-07-23 12:31 UTC (permalink / raw)
To: Miklos Szeredi
Cc: david, nigel, linux-kernel, miltonm, ying.huang, linux-pm, jbms
Am Montag 23 Juli 2007 schrieb Miklos Szeredi:
> > The reason is that we want them to "park" in safe places, ie. where there
> > are no locks held etc. Thus, these safe places need to be chosen somehow
> > and since they are not marked throughout the code, we choose the obvious
> > one. :-)
>
> Why shouldn't locks be held?
>
> No locks which are required for suspend must be held, sure. But
> otherwise holding locks doesn't matter at all.
If you can provide a way to tell them apart, this would work.
Regards
Oliver
^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Re: Hibernation considerations
[not found] ` <200707231431.30372.oliver@neukum.org>
@ 2007-07-23 13:08 ` Miklos Szeredi
[not found] ` <200707231601.09541.rjw@sisk.pl>
2007-07-23 14:01 ` Rafael J. Wysocki
2007-07-23 19:08 ` david
1 sibling, 2 replies; 81+ messages in thread
From: Miklos Szeredi @ 2007-07-23 13:08 UTC (permalink / raw)
To: oliver
Cc: david, miklos, nigel, linux-kernel, miltonm, ying.huang, linux-pm,
jbms
> > > The reason is that we want them to "park" in safe places, ie. where there
> > > are no locks held etc. Thus, these safe places need to be chosen somehow
> > > and since they are not marked throughout the code, we choose the obvious
> > > one. :-)
> >
> > Why shouldn't locks be held?
> >
> > No locks which are required for suspend must be held, sure. But
> > otherwise holding locks doesn't matter at all.
>
> If you can provide a way to tell them apart, this would work.
Without some marking we can't tell obviously.
Are there many such locks? We can easily check by adding some
debugging code to the lock primitives, to make them yell if they are
used during suspend.
Miklos
^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Re: Hibernation considerations
[not found] ` <200707231601.09541.rjw@sisk.pl>
@ 2007-07-23 14:01 ` Miklos Szeredi
0 siblings, 0 replies; 81+ messages in thread
From: Miklos Szeredi @ 2007-07-23 14:01 UTC (permalink / raw)
To: rjw; +Cc: david, miklos, nigel, linux-kernel, miltonm, ying.huang, linux-pm,
jbms
> Alan has recently proposed to introduce "suspend locks" to be acquired during
> a suspend/hibernation and such that we can leave uninterruptible tasks that
> don't hold any of them.
Sounds sane. A global rwsem could be acquired for read by drivers,
and for write by suspend/hibernate. Just need to add it to all
drivers that have PM, but that shouldn't need a heroic effort.
Miklos
^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Re: Hibernation considerations
2007-07-23 13:08 ` Miklos Szeredi
[not found] ` <200707231601.09541.rjw@sisk.pl>
@ 2007-07-23 14:01 ` Rafael J. Wysocki
1 sibling, 0 replies; 81+ messages in thread
From: Rafael J. Wysocki @ 2007-07-23 14:01 UTC (permalink / raw)
To: Miklos Szeredi, stern
Cc: david, nigel, linux-kernel, miltonm, ying.huang, linux-pm, jbms
On Monday, 23 July 2007 15:08, Miklos Szeredi wrote:
> > > > The reason is that we want them to "park" in safe places, ie. where there
> > > > are no locks held etc. Thus, these safe places need to be chosen somehow
> > > > and since they are not marked throughout the code, we choose the obvious
> > > > one. :-)
> > >
> > > Why shouldn't locks be held?
> > >
> > > No locks which are required for suspend must be held, sure. But
> > > otherwise holding locks doesn't matter at all.
> >
> > If you can provide a way to tell them apart, this would work.
>
> Without some marking we can't tell obviously.
>
> Are there many such locks? We can easily check by adding some
> debugging code to the lock primitives, to make them yell if they are
> used during suspend.
This way we can only obtain information from systems that use hibernation
quite often.
Alan has recently proposed to introduce "suspend locks" to be acquired during
a suspend/hibernation and such that we can leave uninterruptible tasks that
don't hold any of them.
Unfortunately, I have no link to his original message at hand.
Greetings,
Rafael
--
"Premature optimization is the root of all evil." - Donald Knuth
^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Re: Hibernation considerations
[not found] ` <Pine.LNX.4.44L0.0707201820050.5241-100000@iolanthe.rowland.org>
@ 2007-07-23 14:23 ` Oliver Neukum
2007-08-01 9:34 ` Pavel Machek
[not found] ` <20070801093437.GC4808@ucw.cz>
2 siblings, 0 replies; 81+ messages in thread
From: Oliver Neukum @ 2007-07-23 14:23 UTC (permalink / raw)
To: Alan Stern
Cc: David Lang, LKML, Milton Miller, Ying Huang, linux-pm,
Jeremy Maitin-Shepard
Am Samstag 21 Juli 2007 schrieb Alan Stern:
> On Fri, 20 Jul 2007, Oliver Neukum wrote:
>
> > > We already have a pre-suspend notification available for drivers that
> > > need to allocate large amounts of memory.
> >
> > Is that facility fine grained enough?
>
> It's a notifier chain that gets called at several points during the
> suspend transition. One of those points is right at the start, while
> userspace is still running and reasonably large amounts of memory can
> be allocated.
>
> Is it fine-grained enough? I don't know -- hard to tell, since nothing
> much is using it yet.
>
> > > You are correct about the need to delay/stop device addition. I don't
> > > know how this can be done in general; each code path calling
> > > device_add() may have to be treated individually.
> >
> > What about the old API?
>
> What old API do you mean?
The find_device() stuff.
> > Do we have to block module loading?
>
> No. Registering new drivers is okay, registering new devices is bad.
What if it is a driver for virtual devices that don't need probe()
for actual hardware?
> Of course, some modules do want to register a new device in their init
> method. I don't know what we should do about them. Force the
> registration to fail, I suppose. How often will people suspend while a
> module is loading?
>
> > What happens if a scsi error handler is woken? If it cannot be woken,
> > how are errors handled?
>
> Why should the error handler wake up? There isn't supposed to be any
> I/O going on, hence no errors to handle.
What about shared busses? Firewire, FibreChannel? They can get external
resets, etc ...
Regards
Oliver
^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Re: Hibernation considerations
2007-07-22 21:50 ` david
@ 2007-07-23 15:19 ` Alan Stern
0 siblings, 0 replies; 81+ messages in thread
From: Alan Stern @ 2007-07-23 15:19 UTC (permalink / raw)
To: david; +Cc: LKML, Milton Miller, Ying Huang, linux-pm, Jeremy Maitin-Shepard
On Sun, 22 Jul 2007 david@lang.hm wrote:
> > You are confusing "userspace" with "user tasks". And not only that,
> > you often use the term "userspace" when you should say "user mode".
> >
> > If you want I can explain the differences.
>
> please do, I have been treating all three as the same catagory.
Very briefly then: "User mode" and "kernel mode" refer to the CPU's
hardware privilege level. A process makes the transition from user
mode to kernel mode by executing a system call. Interrupt and
exception handlers also run in kernel mode, but they generally are not
considered to be part of any process. The reverse transition occurs
when a process returns from a system call, or when an interrupt which
occurred while the CPU was in user mode completes. (It's interesting
to note that system calls are somewhat similar to interrupts; in fact
sometimes they are implemented by a "software interrupt".)
"Kernel threads" are processes that run entirely in kernel mode. They
usually don't have a memory mapping for any user-owned memory and they
never go into user mode. All other processes are "user threads".
"Userspace" is a rather general term referring to things not in the
kernel. It comprises both user tasks (while running in user mode) and
user memory.
> Ok, I did misunderstand you. it sound slike all you need to do to make
> sure that locks are not held is to allow system calls to return before
> trying to do the suspend/kexec/etc. that sounds like not only a trivial
> thing to do, but something that would probably be done anyway.
If you could actually do it, it would work. But you can't do it. If
it were feasible, the freezer would have used that approach in the
first place.
For one thing, checking for a suspend-in-progress at the beginning of
each and every system call would add overhead to a hot path in the
kernel, one which is already very heavily optimized. People wouldn't
stand for it.
> although syscalls that then call out to userspace tasks before they can
> complete cause potential deadlocks (without that issue you can just wait
> until all syscalls have returned, and not allow anything to issue new
> syscalls) is this the issue that's killing FUSE+suspend?
You get similar problems from system calls that wait in kernel mode
until something has happened. For example, a read() call for the
console device will wait until somebody types on the keyboard. At any
point in time, many (or even most) user threads are blocked in a system
call.
> > Here's what you are missing:
> >
> > The new kexec approach eliminates the freezer and relies instead on the
> > fact that none of the tasks in the original kernel can execute while
> > the new kexec'd kernel is running. This means the new kernel can write
> > out a memory image with no fear of interference or corruption.
>
> correct
>
> > But it also means that tasks which otherwise would have been frozen are
> > actually free to run before the kexec call is made (and after the call
> > returns, if the kexec'd kernel returns back to the original kernel).
> > Any driver which was written with the assumption that tasks would be
> > frozen at those times will need to be changed.
>
> here is where you loose me.
>
> why should jumping back to the original kernel immedialty start running
> these processes?
Let's let kernel K1 be the original kernel, the one which is going into
hibernation. Kernel K2 is the one started by kexec to write out the
memory image.
Your question becomes: Why should K2 jumping back to K1 cause K1
immediately to start running user tasks? Answer: Because K1 has been
running user tasks all along (except while K2 was active) and nothing
has told it to stop. In fact, about the only things which _can_ cause
K1 to stop running user threads are the freezer (which you want to
eliminate) and disabling interrupts (not possible since some drivers
require interrupts to be enabled when putting devices in low-power
mode).
> the process of doing a kexec requires things to happen in
> the drivers before normal activity can happen, so there is a phase in
> there where the kernel being jumped to has drivers initializing, but still
> does not allow anything else to run.
So when K2 starts up, it will have a phase in which user threads don't
run. That doesn't affect K1. When K2 returns to K1, K1 does not go
through this sort of phase. It simply picks up from where it left off.
> why can't this phase be extended to
> allow for the possibility of transitioning these drivers to a sleep mode
> instead of to full operation?
Indeed, Rafael has suggested that K2 be responsible for putting devices
in low-power mode. This has the disadvantage of requiring K2 to
include drivers for every device used by K1, but otherwise it would
work.
However there still remains the problem of user tasks running after
devices are supposed to be quiescent and before K1 starts. There's
currently nothing to stop such tasks from making I/O requests and
thereby causing a quiescent device to become active again.
> > The situation as regards locking is harder to discuss since I don't
> > know of any code examples to use as a guide. The fact remains that if
> > user tasks aren't frozen then they can make system calls, and while
> > running in kernel mode they can acquire locks, which might cause
> > problems -- even though I can't identify any definite examples.
>
> yes, if userspace is running jobs and submitting I/O and system calls
> while drivers are trying to initalize there is a big problem, but I am
> missing the reason this must be the case.
We aren't talking about drivers initializing devices. We are talking
about what happens during the time when drivers are trying to quiesce
devices (i.e., before K1 has started up K2) or power them down (after
K2 has returned to K1).
> the part of the freezer that everyone is trying to eliminate is the
> exceptions (freeze everything except X,Y,Z becouse we will need to use
> those later for A)
Wrong. People are trying to eliminate the freezer entirely. Go back
and reread some of the postings at the beginning of this long thread,
especially those from Paul Mackerras and Ben Herrenschmidt.
Alan Stern
^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Re: Hibernation considerations
[not found] <200707231311.56398.nigel@nigel.suspend2.net>
@ 2007-07-23 15:23 ` Alan Stern
0 siblings, 0 replies; 81+ messages in thread
From: Alan Stern @ 2007-07-23 15:23 UTC (permalink / raw)
To: nigel
Cc: david, Miklos Szeredi, linux-kernel, miltonm, ying.huang,
linux-pm, Jeremy Maitin-Shepard
On Mon, 23 Jul 2007, Nigel Cunningham wrote:
> Take a step back for a second.
>
> The problem we're facing now is that we're getting some userspace threads,
> used in processing I/O, that are functioning as exceptions to the "freeze
> userspace, then freezeable kernel threads" rule. They are only exceptions
> because of that role in processing I/O - because they're de facto kernel
> threads. So, if we orient our thinking more in terms of I/O processing and
> less in terms of the userspace/kernelspace distinction, we'll have a
> solution:
>
> 1) Freeze processes that aren't fs related (ie stop them generating I/O).
The problem here is that with things like FUSE, _every_ process is
potentially fs related. Nothing prevents a FUSE thread from doing IPC
with any other thread.
> 2) Flush pending I/O.
> 3) Freeze filesystems in reverse order of dependency, the primary purpose
> being to stop them generating further I/O on their metadata.
>
> Locks that are being held are only being held because work is being done. If
> we progressively focus on threads in terms of their create/process work
> dependencies, we'll see that the problem isn't at all intractable.
As has been mentioned before, keeping track of all that dependency
information would be very fragile and time-consuming.
Alan Stern
^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Re: Hibernation considerations
[not found] <Pine.LNX.4.44L0.0707231035150.3545-100000@iolanthe.rowland.org>
@ 2007-07-23 19:01 ` david
0 siblings, 0 replies; 81+ messages in thread
From: david @ 2007-07-23 19:01 UTC (permalink / raw)
To: Alan Stern
Cc: LKML, Milton Miller, Ying Huang, linux-pm, Jeremy Maitin-Shepard
On Mon, 23 Jul 2007, Alan Stern wrote:
> On Sun, 22 Jul 2007 david@lang.hm wrote:
>
>> Ok, I did misunderstand you. it sound slike all you need to do to make
>> sure that locks are not held is to allow system calls to return before
>> trying to do the suspend/kexec/etc. that sounds like not only a trivial
>> thing to do, but something that would probably be done anyway.
>
> If you could actually do it, it would work. But you can't do it. If
> it were feasible, the freezer would have used that approach in the
> first place.
>
> For one thing, checking for a suspend-in-progress at the beginning of
> each and every system call would add overhead to a hot path in the
> kernel, one which is already very heavily optimized. People wouldn't
> stand for it.
I thought that the suspend stuff did this easily, but the freezer really
starts running into trouble when it wants to freeze some things, but not
other things. this seems to be the biggest area of churn and problems.
>> although syscalls that then call out to userspace tasks before they can
>> complete cause potential deadlocks (without that issue you can just wait
>> until all syscalls have returned, and not allow anything to issue new
>> syscalls) is this the issue that's killing FUSE+suspend?
>
> You get similar problems from system calls that wait in kernel mode
> until something has happened. For example, a read() call for the
> console device will wait until somebody types on the keyboard. At any
> point in time, many (or even most) user threads are blocked in a system
> call.
but are locks held while they are blocked like this?
>>> But it also means that tasks which otherwise would have been frozen are
>>> actually free to run before the kexec call is made (and after the call
>>> returns, if the kexec'd kernel returns back to the original kernel).
>>> Any driver which was written with the assumption that tasks would be
>>> frozen at those times will need to be changed.
>>
>> here is where you loose me.
>>
>> why should jumping back to the original kernel immedialty start running
>> these processes?
>
> Let's let kernel K1 be the original kernel, the one which is going into
> hibernation. Kernel K2 is the one started by kexec to write out the
> memory image.
>
> Your question becomes: Why should K2 jumping back to K1 cause K1
> immediately to start running user tasks? Answer: Because K1 has been
> running user tasks all along (except while K2 was active) and nothing
> has told it to stop. In fact, about the only things which _can_ cause
> K1 to stop running user threads are the freezer (which you want to
> eliminate) and disabling interrupts (not possible since some drivers
> require interrupts to be enabled when putting devices in low-power
> mode).
when you jump to a body of code you jump to a specific point in the code,
not to some nebulous 'everything running' state.
>> the process of doing a kexec requires things to happen in
>> the drivers before normal activity can happen, so there is a phase in
>> there where the kernel being jumped to has drivers initializing, but still
>> does not allow anything else to run.
>
> So when K2 starts up, it will have a phase in which user threads don't
> run. That doesn't affect K1. When K2 returns to K1, K1 does not go
> through this sort of phase. It simply picks up from where it left off.
then how can it restart drivers before the user threads need them?
>> why can't this phase be extended to
>> allow for the possibility of transitioning these drivers to a sleep mode
>> instead of to full operation?
>
> Indeed, Rafael has suggested that K2 be responsible for putting devices
> in low-power mode. This has the disadvantage of requiring K2 to
> include drivers for every device used by K1, but otherwise it would
> work.
>
> However there still remains the problem of user tasks running after
> devices are supposed to be quiescent and before K1 starts. There's
> currently nothing to stop such tasks from making I/O requests and
> thereby causing a quiescent device to become active again.
but if the devices are in low power mode then K1 needs to get them out of
low power mode before user tasks try to access them.
>>> The situation as regards locking is harder to discuss since I don't
>>> know of any code examples to use as a guide. The fact remains that if
>>> user tasks aren't frozen then they can make system calls, and while
>>> running in kernel mode they can acquire locks, which might cause
>>> problems -- even though I can't identify any definite examples.
>>
>> yes, if userspace is running jobs and submitting I/O and system calls
>> while drivers are trying to initalize there is a big problem, but I am
>> missing the reason this must be the case.
>
> We aren't talking about drivers initializing devices. We are talking
> about what happens during the time when drivers are trying to quiesce
> devices (i.e., before K1 has started up K2) or power them down (after
> K2 has returned to K1).
or if you are doing a resume instead of a suspend to ram the drivers need
to initialize or otherwise move to full power on K1 before user tasks hit
them.
>> the part of the freezer that everyone is trying to eliminate is the
>> exceptions (freeze everything except X,Y,Z becouse we will need to use
>> those later for A)
>
> Wrong. People are trying to eliminate the freezer entirely. Go back
> and reread some of the postings at the beginning of this long thread,
> especially those from Paul Mackerras and Ben Herrenschmidt.
>
> Alan Stern
>
>
^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Re: Hibernation considerations
[not found] ` <200707231431.30372.oliver@neukum.org>
2007-07-23 13:08 ` Miklos Szeredi
@ 2007-07-23 19:08 ` david
1 sibling, 0 replies; 81+ messages in thread
From: david @ 2007-07-23 19:08 UTC (permalink / raw)
To: Oliver Neukum
Cc: Miklos Szeredi, nigel, linux-kernel, miltonm, ying.huang,
linux-pm, jbms
[-- Attachment #1: Type: TEXT/PLAIN, Size: 783 bytes --]
On Mon, 23 Jul 2007, Oliver Neukum wrote:
> Am Montag 23 Juli 2007 schrieb Miklos Szeredi:
>>> The reason is that we want them to "park" in safe places, ie. where there
>>> are no locks held etc. Thus, these safe places need to be chosen somehow
>>> and since they are not marked throughout the code, we choose the obvious
>>> one. :-)
>>
>> Why shouldn't locks be held?
>>
>> No locks which are required for suspend must be held, sure. But
>> otherwise holding locks doesn't matter at all.
>
> If you can provide a way to tell them apart, this would work.
can you just tell the driver to try and suspend and if it reports back
that it fails back out of the suspend? or will the driver deadlock instead
of reporting a failure if a lock is held.
David Lang
[-- Attachment #2: Type: text/plain, Size: 0 bytes --]
^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Re: Hibernation considerations
[not found] <Pine.LNX.4.64.0707231121220.3828@asgard.lang.hm>
@ 2007-07-23 20:22 ` Alan Stern
2007-07-24 13:26 ` Huang, Ying
0 siblings, 1 reply; 81+ messages in thread
From: Alan Stern @ 2007-07-23 20:22 UTC (permalink / raw)
To: david; +Cc: LKML, Milton Miller, Ying Huang, linux-pm, Jeremy Maitin-Shepard
On Mon, 23 Jul 2007 david@lang.hm wrote:
> > For one thing, checking for a suspend-in-progress at the beginning of
> > each and every system call would add overhead to a hot path in the
> > kernel, one which is already very heavily optimized. People wouldn't
> > stand for it.
>
> I thought that the suspend stuff did this easily,
It does not do it at all. Do you know how the freezer works?
> but the freezer really
> starts running into trouble when it wants to freeze some things, but not
> other things. this seems to be the biggest area of churn and problems.
No. The freezer starts running into trouble when it wants to freeze a
thread but can't, because that thread is waiting for some event to
occur and the only thread which can cause the event is already frozen.
Or is itself waiting for a third thread which is already frozen...
> > You get similar problems from system calls that wait in kernel mode
> > until something has happened. For example, a read() call for the
> > console device will wait until somebody types on the keyboard. At any
> > point in time, many (or even most) user threads are blocked in a system
> > call.
>
> but are locks held while they are blocked like this?
Sometimes they are, sometimes they aren't.
> > Let's let kernel K1 be the original kernel, the one which is going into
> > hibernation. Kernel K2 is the one started by kexec to write out the
> > memory image.
> >
> > Your question becomes: Why should K2 jumping back to K1 cause K1
> > immediately to start running user tasks? Answer: Because K1 has been
> > running user tasks all along (except while K2 was active) and nothing
> > has told it to stop. In fact, about the only things which _can_ cause
> > K1 to stop running user threads are the freezer (which you want to
> > eliminate) and disabling interrupts (not possible since some drivers
> > require interrupts to be enabled when putting devices in low-power
> > mode).
>
> when you jump to a body of code you jump to a specific point in the code,
> not to some nebulous 'everything running' state.
How is that relevant? When K2 jumps back to K1, it jumps to some
designated location in K1. It might just after the place where K1
called K2; I'm not familiar with the details of kexec. In any event,
K1 will still be in the same state as it was when it called K2.
> > So when K2 starts up, it will have a phase in which user threads don't
> > run. That doesn't affect K1. When K2 returns to K1, K1 does not go
> > through this sort of phase. It simply picks up from where it left off.
>
> then how can it restart drivers before the user threads need them?
It can't. Indeed, in the absence of a freezer, user threads will need
devices (more accurately, will submit I/O requests for devices) that
have to be kept quiescent or low-power. Drivers will need to delay
those requests until the devices are returned to full operation.
That's exactly what I've been saying all along: Drivers will need to
be changed to delay I/O requests, if there is no freezer.
> > However there still remains the problem of user tasks running after
> > devices are supposed to be quiescent and before K1 starts. There's
> > currently nothing to stop such tasks from making I/O requests and
> > thereby causing a quiescent device to become active again.
>
> but if the devices are in low power mode then K1 needs to get them out of
> low power mode before user tasks try to access them.
No -- which is good because it can't. If a user task is running
there's no way to stop it from submitting I/O requests. K1 needs to
delay these requests until after the device has returned to full
operation.
> > We aren't talking about drivers initializing devices. We are talking
> > about what happens during the time when drivers are trying to quiesce
> > devices (i.e., before K1 has started up K2) or power them down (after
> > K2 has returned to K1).
>
> or if you are doing a resume instead of a suspend to ram the drivers need
> to initialize or otherwise move to full power on K1 before user tasks hit
> them.
Correct. User tasks are allowed to submit requests, but the requests
can't be carried out until the device returns to full operation.
Alan Stern
^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Re: Hibernation considerations
[not found] <Pine.LNX.4.44L0.0707231120380.3545-100000@iolanthe.rowland.org>
@ 2007-07-23 21:55 ` Nigel Cunningham
[not found] ` <200707240755.01820.nigel@nigel.suspend2.net>
1 sibling, 0 replies; 81+ messages in thread
From: Nigel Cunningham @ 2007-07-23 21:55 UTC (permalink / raw)
To: Alan Stern
Cc: david, Miklos Szeredi, nigel, linux-kernel, miltonm, ying.huang,
linux-pm, Jeremy Maitin-Shepard
[-- Attachment #1.1: Type: text/plain, Size: 1707 bytes --]
Hi.
On Tuesday 24 July 2007 01:23:15 Alan Stern wrote:
> On Mon, 23 Jul 2007, Nigel Cunningham wrote:
>
> > Take a step back for a second.
> >
> > The problem we're facing now is that we're getting some userspace threads,
> > used in processing I/O, that are functioning as exceptions to the "freeze
> > userspace, then freezeable kernel threads" rule. They are only exceptions
> > because of that role in processing I/O - because they're de facto kernel
> > threads. So, if we orient our thinking more in terms of I/O processing and
> > less in terms of the userspace/kernelspace distinction, we'll have a
> > solution:
> >
> > 1) Freeze processes that aren't fs related (ie stop them generating I/O).
>
> The problem here is that with things like FUSE, _every_ process is
> potentially fs related. Nothing prevents a FUSE thread from doing IPC
> with any other thread.
Yes, but the fuse thread is going to know what other thread it's doing IPC
with, so it can get that thread flagged too.
> > 2) Flush pending I/O.
> > 3) Freeze filesystems in reverse order of dependency, the primary purpose
> > being to stop them generating further I/O on their metadata.
> >
> > Locks that are being held are only being held because work is being done.
If
> > we progressively focus on threads in terms of their create/process work
> > dependencies, we'll see that the problem isn't at all intractable.
>
> As has been mentioned before, keeping track of all that dependency
> information would be very fragile and time-consuming.
I disagree. It's at least going to be less fragile and time-consuming then
maintaining new/extra code for kexec.
Nigel
[-- Attachment #1.2: Type: application/pgp-signature, Size: 189 bytes --]
[-- Attachment #2: Type: text/plain, Size: 0 bytes --]
^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Re: Hibernation considerations
[not found] ` <200707240755.01820.nigel@nigel.suspend2.net>
@ 2007-07-23 22:10 ` Rafael J. Wysocki
0 siblings, 0 replies; 81+ messages in thread
From: Rafael J. Wysocki @ 2007-07-23 22:10 UTC (permalink / raw)
To: Nigel Cunningham
Cc: david, Miklos Szeredi, nigel, linux-kernel, miltonm, ying.huang,
linux-pm, Jeremy Maitin-Shepard
On Monday, 23 July 2007 23:55, Nigel Cunningham wrote:
> Hi.
>
> On Tuesday 24 July 2007 01:23:15 Alan Stern wrote:
> > On Mon, 23 Jul 2007, Nigel Cunningham wrote:
> >
> > > Take a step back for a second.
> > >
> > > The problem we're facing now is that we're getting some userspace threads,
> > > used in processing I/O, that are functioning as exceptions to the "freeze
> > > userspace, then freezeable kernel threads" rule. They are only exceptions
> > > because of that role in processing I/O - because they're de facto kernel
> > > threads. So, if we orient our thinking more in terms of I/O processing and
> > > less in terms of the userspace/kernelspace distinction, we'll have a
> > > solution:
> > >
> > > 1) Freeze processes that aren't fs related (ie stop them generating I/O).
> >
> > The problem here is that with things like FUSE, _every_ process is
> > potentially fs related. Nothing prevents a FUSE thread from doing IPC
> > with any other thread.
>
> Yes, but the fuse thread is going to know what other thread it's doing IPC
> with, so it can get that thread flagged too.
Yes, but that thread may do IPC with yet another one and so on.
> > > 2) Flush pending I/O.
> > > 3) Freeze filesystems in reverse order of dependency, the primary purpose
> > > being to stop them generating further I/O on their metadata.
> > >
> > > Locks that are being held are only being held because work is being done.
> If
> > > we progressively focus on threads in terms of their create/process work
> > > dependencies, we'll see that the problem isn't at all intractable.
> >
> > As has been mentioned before, keeping track of all that dependency
> > information would be very fragile and time-consuming.
>
> I disagree. It's at least going to be less fragile and time-consuming then
> maintaining new/extra code for kexec.
Well, I think the issue is real, so we need to find a solution (the simpler,
the better) and that need not be related to kexec. ;-)
Greetings,
Rafael
--
"Premature optimization is the root of all evil." - Donald Knuth
^ permalink raw reply [flat|nested] 81+ messages in thread
* RE: Re: Hibernation considerations
2007-07-23 20:22 ` Alan Stern
@ 2007-07-24 13:26 ` Huang, Ying
0 siblings, 0 replies; 81+ messages in thread
From: Huang, Ying @ 2007-07-24 13:26 UTC (permalink / raw)
To: Alan Stern, david
Cc: LKML, Milton Miller, Pavel Machek, linux-pm,
Jeremy Maitin-Shepard
>From: Alan Stern [mailto:stern@rowland.harvard.edu]
>It can't. Indeed, in the absence of a freezer, user threads will need
>devices (more accurately, will submit I/O requests for devices) that
>have to be kept quiescent or low-power. Drivers will need to delay
>those requests until the devices are returned to full operation.
>
>That's exactly what I've been saying all along: Drivers will need to
>be changed to delay I/O requests, if there is no freezer.
If it is a too big work to implement "delaying I/O requests" for every
driver, is it possible to implement it as follow:
1. It is triggered to suspend to RAM/DISK.
2. Replace the driver related syscall entries (such as sys_read,
sys_write, sys_ioctl, etc) in sys_call_table with special wrapper
entries provided by "suspend to RAM/DISK" subsystem, which will delay
I/O requests if appropriate.
3. When devices are quiesced, they are put into "low power" state and
system is put into suspend state; or the image is written to disk
(through snapshot/uswsusp or kexeced kernel).
4. After resuming from RAM/DISK, devices are put into "normal" state and
the syscall entries replaced in step 2 are restored.
Best Regards,
Huang Ying
^ permalink raw reply [flat|nested] 81+ messages in thread
* RE: Re: Hibernation considerations
[not found] <9D7649D18729DE4BB2BD7B494F7FEDC236CF5C@pdsmsx415.ccr.corp.intel.com>
@ 2007-07-24 14:50 ` Alan Stern
0 siblings, 0 replies; 81+ messages in thread
From: Alan Stern @ 2007-07-24 14:50 UTC (permalink / raw)
To: Huang, Ying
Cc: david, LKML, Milton Miller, Pavel Machek, linux-pm,
Jeremy Maitin-Shepard
On Tue, 24 Jul 2007, Huang, Ying wrote:
> >From: Alan Stern [mailto:stern@rowland.harvard.edu]
> >It can't. Indeed, in the absence of a freezer, user threads will need
> >devices (more accurately, will submit I/O requests for devices) that
> >have to be kept quiescent or low-power. Drivers will need to delay
> >those requests until the devices are returned to full operation.
> >
> >That's exactly what I've been saying all along: Drivers will need to
> >be changed to delay I/O requests, if there is no freezer.
>
> If it is a too big work to implement "delaying I/O requests" for every
> driver, is it possible to implement it as follow:
>
> 1. It is triggered to suspend to RAM/DISK.
> 2. Replace the driver related syscall entries (such as sys_read,
> sys_write, sys_ioctl, etc) in sys_call_table with special wrapper
> entries provided by "suspend to RAM/DISK" subsystem, which will delay
> I/O requests if appropriate.
> 3. When devices are quiesced, they are put into "low power" state and
> system is put into suspend state; or the image is written to disk
> (through snapshot/uswsusp or kexeced kernel).
> 4. After resuming from RAM/DISK, devices are put into "normal" state and
> the syscall entries replaced in step 2 are restored.
Ha! I made exactly this same suggestion (URL lost in the mists of
time), except that I proposed changing the syscall entries for every
system call, not just the driver-related ones.
Nobody seemed to think it would work very well.
It leaves a few loose ends. For example, suppose a user thread is
already in the middle of a system call and is about to start doing some
I/O (maybe it's waiting for a timer to expire).
In the end, this doesn't seem to be very different from freezing all
user threads.
Alan Stern
^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Re: Hibernation considerations
[not found] ` <200707212243.35602.nigel@nigel.suspend2.net>
` (2 preceding siblings ...)
[not found] ` <87lkd9ohtn.fsf@jbms.ath.cx>
@ 2007-08-01 9:19 ` Pavel Machek
3 siblings, 0 replies; 81+ messages in thread
From: Pavel Machek @ 2007-08-01 9:19 UTC (permalink / raw)
To: Nigel Cunningham
Cc: david, Miklos Szeredi, linux-kernel, miltonm, ying.huang,
linux-pm, jbms
Hi!
> > > The problem with FUSE is related to the fact that the freezer can't
> > > freeze uninterruptible tasks and we said that perhaps we might avoid
> > > it if FUSE was made freezing-aware. Still, no one has gone in this
> > > direction and I don't know of any plans to do that.
> >
> > I thought we have fully explored this direction. Lots of emails, and
> > an IRC session with Pavel. Conclusion:
>
> What am I missing in the following suggested solution?
>
> 1) In the freezer code, we implement a new TIF_LATEFREEZE process flag, which,
> when set, causes a userspace process to be frozen with kernel threads
> instead of with userspace ones. When freezing, we freezing !TIF_LATEFREEZE,
> sync and then freeze TIF_LATEFREEZE and freezable kernel threads.
>
> 2) In the fuse code, the PID of the process that will do the work gets passed
The list of neccessary PIDs is not known to the kernel. FUSE servers
may depend on another parts of userland.
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Re: Hibernation considerations
[not found] ` <200707212120.04645.rjw@sisk.pl>
@ 2007-08-01 9:22 ` Pavel Machek
[not found] ` <20070801092227.GB4808@ucw.cz>
1 sibling, 0 replies; 81+ messages in thread
From: Pavel Machek @ 2007-08-01 9:22 UTC (permalink / raw)
To: Rafael J. Wysocki
Cc: david, nigel, Miklos Szeredi, linux-kernel, miltonm, ying.huang,
linux-pm, jbms
Hi!
> > Hmm, wonder why this isn't affecting people with VPNs? Probably
> > network mounts over VPN are rare, and ever rarer to have fs activity
> > on them during suspend.
> >
> > Anyway, I think it's long overdue to stop thinking about how to "fix"
> > fuse, and concentrate on fixing the underlying problem instead ;)
>
> To conclude this branch of the thread, I have a patch in the works that may
> help a bit with unfreezable FUSE filesystems and it only affects the freezer.
> I'll post it when 2.6.23-rc1 is out, because it's on top of some other patches
> that need to go first.
I'm interested... which one is that?
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Re: Hibernation considerations
[not found] ` <Pine.LNX.4.44L0.0707201820050.5241-100000@iolanthe.rowland.org>
2007-07-23 14:23 ` Oliver Neukum
@ 2007-08-01 9:34 ` Pavel Machek
[not found] ` <20070801093437.GC4808@ucw.cz>
2 siblings, 0 replies; 81+ messages in thread
From: Pavel Machek @ 2007-08-01 9:34 UTC (permalink / raw)
To: Alan Stern
Cc: David Lang, LKML, Milton Miller, Ying Huang, linux-pm,
Jeremy Maitin-Shepard
Hi!
> > Do we have to block module loading?
>
> No. Registering new drivers is okay, registering new devices is bad.
>
> Of course, some modules do want to register a new device in their init
> method. I don't know what we should do about them. Force the
> registration to fail, I suppose. How often will people suspend while a
> module is loading?
Well... plug this pcmcia card into the slot so that I do not have to
carry it separately, close the lid and go?
...not that impossible to imagine...
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Re: Hibernation considerations
[not found] ` <20070801092227.GB4808@ucw.cz>
2007-08-02 17:02 ` Rafael J. Wysocki
@ 2007-08-02 17:02 ` Rafael J. Wysocki
1 sibling, 0 replies; 81+ messages in thread
From: Rafael J. Wysocki @ 2007-08-02 17:02 UTC (permalink / raw)
To: Pavel Machek
Cc: david, nigel, Miklos Szeredi, linux-kernel, miltonm, ying.huang,
linux-pm, jbms
On Wednesday, 1 August 2007 11:22, Pavel Machek wrote:
> Hi!
>
> > > Hmm, wonder why this isn't affecting people with VPNs? Probably
> > > network mounts over VPN are rare, and ever rarer to have fs activity
> > > on them during suspend.
> > >
> > > Anyway, I think it's long overdue to stop thinking about how to "fix"
> > > fuse, and concentrate on fixing the underlying problem instead ;)
> >
> > To conclude this branch of the thread, I have a patch in the works that may
> > help a bit with unfreezable FUSE filesystems and it only affects the freezer.
> > I'll post it when 2.6.23-rc1 is out, because it's on top of some other patches
> > that need to go first.
>
> I'm interested... which one is that?
Appended, on top of this:
https://lists.linux-foundation.org/pipermail/linux-pm/2007-July/014521.html
Greetings,
Rafael
---
kernel/power/process.c | 49 ++++++++++++++++++++++++++++++++++++++++++++++++-
1 file changed, 48 insertions(+), 1 deletion(-)
Index: linux-2.6.23-rc1/kernel/power/process.c
===================================================================
--- linux-2.6.23-rc1.orig/kernel/power/process.c 2007-07-24 00:14:07.000000000 +0200
+++ linux-2.6.23-rc1/kernel/power/process.c 2007-07-24 00:14:17.000000000 +0200
@@ -30,6 +30,14 @@
*/
#define MAX_WAITS 5
+/*
+ * If the freezing of tasks fails, we attempt to thaw tasks that have already
+ * been frozen to give a chance the other tasks to freeze, in case one or more
+ * of them are blocked by the frozen ones. If this fails MAX_ATTEMPTS times
+ * in a row, we give up.
+ */
+#define MAX_ATTEMPTS 10
+
#define FREEZER_KERNEL_THREADS 0
#define FREEZER_USER_SPACE 1
@@ -192,14 +200,21 @@ static void cancel_freezing(struct task_
static int try_to_freeze_tasks(int freeze_user_space)
{
struct task_struct *g, *p;
- unsigned int todo, waits;
+ unsigned int todo, waits, attempts;
unsigned long ret;
struct timeval start, end;
s64 elapsed_csecs64;
unsigned int elapsed_csecs;
+ char *tick = "-\\|/";
+
+ printk(" ");
+ attempts = 0;
do_gettimeofday(&start);
+ Repeat:
+ printk("\b%c", tick[attempts++ % 4]);
+
refrigerator_called = 0;
waits = 0;
do {
@@ -235,11 +250,43 @@ static int try_to_freeze_tasks(int freez
}
} while (todo);
+ if (todo && attempts <= MAX_ATTEMPTS) {
+ /*
+ * Some tasks have not been able to freeze. They might be stuck
+ * in TASK_UNINTERRUPTIBLE waiting for the frozen tasks. Try to
+ * thaw the tasks that have frozen without clearing the freeze
+ * requests of the remaining tasks and repeat.
+ */
+ read_lock(&tasklist_lock);
+ do_each_thread(g, p) {
+ if (frozen(p)) {
+ p->flags &= ~PF_FROZEN;
+ wake_up_process(p);
+ }
+ } while_each_thread(g, p);
+ read_unlock(&tasklist_lock);
+
+ ret = wait_event_timeout(refrigerator_waitq,
+ refrigerator_called, TIMEOUT);
+ if (!ret) {
+ /*
+ * There is a little hope that we will succeed, but at
+ * least we want to know which tasks have not been
+ * frozen. Thus, we are going to repeat once.
+ */
+ attempts = MAX_ATTEMPTS;
+ }
+
+ goto Repeat;
+ }
+
do_gettimeofday(&end);
elapsed_csecs64 = timeval_to_ns(&end) - timeval_to_ns(&start);
do_div(elapsed_csecs64, NSEC_PER_SEC / 100);
elapsed_csecs = elapsed_csecs64;
+ printk("\b");
+
if (todo) {
/* This does not unfreeze processes that are already frozen
* (we have slightly ugly calling convention in that respect,
^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Re: Hibernation considerations
[not found] ` <20070801092227.GB4808@ucw.cz>
@ 2007-08-02 17:02 ` Rafael J. Wysocki
2007-08-02 17:02 ` Rafael J. Wysocki
1 sibling, 0 replies; 81+ messages in thread
From: Rafael J. Wysocki @ 2007-08-02 17:02 UTC (permalink / raw)
To: Pavel Machek
Cc: david, nigel, Miklos Szeredi, linux-kernel, miltonm, Huang, Ying,
linux-pm, jbms
[-- Attachment #1: Type: text/plain, Size: 3803 bytes --]
On Wednesday, 1 August 2007 11:22, Pavel Machek wrote:
> Hi!
>
> > > Hmm, wonder why this isn't affecting people with VPNs? Probably
> > > network mounts over VPN are rare, and ever rarer to have fs activity
> > > on them during suspend.
> > >
> > > Anyway, I think it's long overdue to stop thinking about how to "fix"
> > > fuse, and concentrate on fixing the underlying problem instead ;)
> >
> > To conclude this branch of the thread, I have a patch in the works that may
> > help a bit with unfreezable FUSE filesystems and it only affects the freezer.
> > I'll post it when 2.6.23-rc1 is out, because it's on top of some other patches
> > that need to go first.
>
> I'm interested... which one is that?
Appended, on top of this:
https://lists.linux-foundation.org/pipermail/linux-pm/2007-July/014521.html
Greetings,
Rafael
---
kernel/power/process.c | 49 ++++++++++++++++++++++++++++++++++++++++++++++++-
1 file changed, 48 insertions(+), 1 deletion(-)
Index: linux-2.6.23-rc1/kernel/power/process.c
===================================================================
--- linux-2.6.23-rc1.orig/kernel/power/process.c 2007-07-24 00:14:07.000000000 +0200
+++ linux-2.6.23-rc1/kernel/power/process.c 2007-07-24 00:14:17.000000000 +0200
@@ -30,6 +30,14 @@
*/
#define MAX_WAITS 5
+/*
+ * If the freezing of tasks fails, we attempt to thaw tasks that have already
+ * been frozen to give a chance the other tasks to freeze, in case one or more
+ * of them are blocked by the frozen ones. If this fails MAX_ATTEMPTS times
+ * in a row, we give up.
+ */
+#define MAX_ATTEMPTS 10
+
#define FREEZER_KERNEL_THREADS 0
#define FREEZER_USER_SPACE 1
@@ -192,14 +200,21 @@ static void cancel_freezing(struct task_
static int try_to_freeze_tasks(int freeze_user_space)
{
struct task_struct *g, *p;
- unsigned int todo, waits;
+ unsigned int todo, waits, attempts;
unsigned long ret;
struct timeval start, end;
s64 elapsed_csecs64;
unsigned int elapsed_csecs;
+ char *tick = "-\\|/";
+
+ printk(" ");
+ attempts = 0;
do_gettimeofday(&start);
+ Repeat:
+ printk("\b%c", tick[attempts++ % 4]);
+
refrigerator_called = 0;
waits = 0;
do {
@@ -235,11 +250,43 @@ static int try_to_freeze_tasks(int freez
}
} while (todo);
+ if (todo && attempts <= MAX_ATTEMPTS) {
+ /*
+ * Some tasks have not been able to freeze. They might be stuck
+ * in TASK_UNINTERRUPTIBLE waiting for the frozen tasks. Try to
+ * thaw the tasks that have frozen without clearing the freeze
+ * requests of the remaining tasks and repeat.
+ */
+ read_lock(&tasklist_lock);
+ do_each_thread(g, p) {
+ if (frozen(p)) {
+ p->flags &= ~PF_FROZEN;
+ wake_up_process(p);
+ }
+ } while_each_thread(g, p);
+ read_unlock(&tasklist_lock);
+
+ ret = wait_event_timeout(refrigerator_waitq,
+ refrigerator_called, TIMEOUT);
+ if (!ret) {
+ /*
+ * There is a little hope that we will succeed, but at
+ * least we want to know which tasks have not been
+ * frozen. Thus, we are going to repeat once.
+ */
+ attempts = MAX_ATTEMPTS;
+ }
+
+ goto Repeat;
+ }
+
do_gettimeofday(&end);
elapsed_csecs64 = timeval_to_ns(&end) - timeval_to_ns(&start);
do_div(elapsed_csecs64, NSEC_PER_SEC / 100);
elapsed_csecs = elapsed_csecs64;
+ printk("\b");
+
if (todo) {
/* This does not unfreeze processes that are already frozen
* (we have slightly ugly calling convention in that respect,
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
[-- Attachment #2: winmail.dat --]
[-- Type: application/ms-tnef, Size: 4942 bytes --]
[-- Attachment #3: Type: text/plain, Size: 0 bytes --]
^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Re: Hibernation considerations
[not found] ` <20070801093437.GC4808@ucw.cz>
@ 2007-08-03 3:50 ` david
0 siblings, 0 replies; 81+ messages in thread
From: david @ 2007-08-03 3:50 UTC (permalink / raw)
To: Pavel Machek
Cc: LKML, Milton Miller, Ying Huang, linux-pm, Jeremy Maitin-Shepard
On Wed, 1 Aug 2007, Pavel Machek wrote:
> Hi!
>
>>> Do we have to block module loading?
>>
>> No. Registering new drivers is okay, registering new devices is bad.
>>
>> Of course, some modules do want to register a new device in their init
>> method. I don't know what we should do about them. Force the
>> registration to fail, I suppose. How often will people suspend while a
>> module is loading?
>
> Well... plug this pcmcia card into the slot so that I do not have to
> carry it separately, close the lid and go?
>
> ...not that impossible to imagine...
I useually leave my broadband card in the slot, but not seated. I wouldn't
bet against it getting pushed in enough to be detected while putting the
laptop in the bag.
David Lang
^ permalink raw reply [flat|nested] 81+ messages in thread
end of thread, other threads:[~2007-08-03 3:50 UTC | newest]
Thread overview: 81+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <877ioupxi8.fsf@jbms.ath.cx>
2007-07-20 22:35 ` Re: Hibernation considerations Alan Stern
2007-07-20 22:43 ` david
2007-07-20 22:48 ` Jeremy Maitin-Shepard
[not found] ` <Pine.LNX.4.64.0707201540260.5166@asgard.lang.hm>
2007-07-21 5:21 ` Nigel Cunningham
2007-07-21 14:10 ` Alan Stern
[not found] <9D7649D18729DE4BB2BD7B494F7FEDC236CF5C@pdsmsx415.ccr.corp.intel.com>
2007-07-24 14:50 ` Alan Stern
[not found] <Pine.LNX.4.44L0.0707231120380.3545-100000@iolanthe.rowland.org>
2007-07-23 21:55 ` Nigel Cunningham
[not found] ` <200707240755.01820.nigel@nigel.suspend2.net>
2007-07-23 22:10 ` Rafael J. Wysocki
[not found] <Pine.LNX.4.64.0707231121220.3828@asgard.lang.hm>
2007-07-23 20:22 ` Alan Stern
2007-07-24 13:26 ` Huang, Ying
[not found] <Pine.LNX.4.44L0.0707231035150.3545-100000@iolanthe.rowland.org>
2007-07-23 19:01 ` david
[not found] <200707231311.56398.nigel@nigel.suspend2.net>
2007-07-23 15:23 ` Alan Stern
[not found] <Pine.LNX.4.44L0.0707221116060.15224-100000@netrider.rowland.org>
2007-07-22 16:27 ` Miklos Szeredi
2007-07-22 20:09 ` Alan Stern
2007-07-22 22:42 ` Nigel Cunningham
[not found] ` <200707230842.22121.nigel@nigel.suspend2.net>
2007-07-22 23:09 ` Rafael J. Wysocki
[not found] ` <200707230109.23071.rjw@sisk.pl>
2007-07-22 23:18 ` Nigel Cunningham
2007-07-23 0:04 ` Paul Mackerras
[not found] ` <18083.61595.217126.824924@cargo.ozlabs.ibm.com>
2007-07-23 3:11 ` Nigel Cunningham
2007-07-23 5:31 ` david
2007-07-23 10:24 ` Miklos Szeredi
2007-07-23 12:08 ` Rafael J. Wysocki
2007-07-23 12:14 ` Miklos Szeredi
[not found] ` <E1ICwoE-0004q8-00@dorka.pomaz.szeredi.hu>
2007-07-23 12:27 ` Rafael J. Wysocki
2007-07-23 12:31 ` Oliver Neukum
[not found] ` <200707231431.30372.oliver@neukum.org>
2007-07-23 13:08 ` Miklos Szeredi
[not found] ` <200707231601.09541.rjw@sisk.pl>
2007-07-23 14:01 ` Miklos Szeredi
2007-07-23 14:01 ` Rafael J. Wysocki
2007-07-23 19:08 ` david
[not found] <Pine.LNX.4.44L0.0707221608140.16031-100000@netrider.rowland.org>
2007-07-22 21:54 ` david
[not found] <Pine.LNX.4.64.0707212000380.6747@asgard.lang.hm>
2007-07-22 16:00 ` Alan Stern
2007-07-22 21:50 ` david
2007-07-23 15:19 ` Alan Stern
[not found] <Pine.LNX.4.44L0.0707210956320.8201-100000@netrider.rowland.org>
2007-07-22 3:43 ` david
[not found] <200707202335.05519.oliver@neukum.org>
2007-07-20 22:25 ` Alan Stern
[not found] ` <Pine.LNX.4.44L0.0707201820050.5241-100000@iolanthe.rowland.org>
2007-07-23 14:23 ` Oliver Neukum
2007-08-01 9:34 ` Pavel Machek
[not found] ` <20070801093437.GC4808@ucw.cz>
2007-08-03 3:50 ` david
[not found] <Pine.LNX.4.44L0.0707201608120.2546-100000@iolanthe.rowland.org>
2007-07-20 21:35 ` Oliver Neukum
[not found] <200707202203.27849.oliver@neukum.org>
2007-07-20 20:12 ` Alan Stern
[not found] <fe998950b9d5ad317d5c1f5ff4e21ac9@bga.com>
2007-07-20 19:37 ` Alan Stern
[not found] <f29402c6050f9c3ff5d83a59cea2de58%40bga.com>
2007-07-20 19:09 ` Milton Miller
2007-07-20 20:23 ` Jeremy Maitin-Shepard
[not found] <Pine.LNX.4.44L0.0707201408060.2546-100000@iolanthe.rowland.org>
2007-07-20 19:08 ` Milton Miller
2007-07-20 20:03 ` Oliver Neukum
2007-07-17 20:27 david
[not found] ` <Pine.LNX.4.44L0.0707171416120.3728-100000@iolanthe.rowland.org>
[not found] ` <200707172217.01890.rjw@sisk.pl>
[not found] ` <87odiag45q.fsf@jbms.ath.cx>
[not found] ` <200707151433.34625.rjw@sisk.pl>
[not found] ` <200707172320.16279.rjw@sisk.pl>
2007-07-20 14:01 ` Milton Miller
2007-07-20 14:48 ` Huang, Ying
2007-07-20 15:48 ` david
2007-07-22 2:17 ` Huang, Ying
[not found] ` <1185070634.3517.11.camel@caritas-dev.intel.com>
2007-07-22 2:32 ` david
2007-07-20 21:34 ` Rafael J. Wysocki
[not found] ` <ea7a437ca4038d408ac544bbc3c2434a@bga.com>
[not found] ` <200707192228.05136.rjw@sisk.pl>
2007-07-20 16:08 ` Milton Miller
2007-07-20 16:20 ` Alan Stern
2007-07-20 17:32 ` Milton Miller
2007-07-20 18:17 ` Alan Stern
2007-07-20 20:31 ` david
2007-07-20 21:24 ` Alan Stern
2007-07-20 21:34 ` david
2007-07-20 21:37 ` Jeremy Maitin-Shepard
[not found] ` <Pine.LNX.4.64.0707201428080.5166@asgard.lang.hm>
2007-07-20 22:15 ` Rafael J. Wysocki
2007-07-20 21:02 ` Rafael J. Wysocki
2007-07-21 11:44 ` Miklos Szeredi
[not found] ` <E1ICDNw-0008HC-00@dorka.pomaz.szeredi.hu>
2007-07-21 12:43 ` Nigel Cunningham
[not found] ` <200707212243.35602.nigel@nigel.suspend2.net>
2007-07-21 13:56 ` Alan Stern
2007-07-21 16:13 ` Jeremy Maitin-Shepard
[not found] ` <87lkd9ohtn.fsf@jbms.ath.cx>
2007-07-21 18:12 ` Miklos Szeredi
2007-07-21 19:20 ` Rafael J. Wysocki
2007-07-21 22:21 ` Nigel Cunningham
[not found] ` <200707212120.04645.rjw@sisk.pl>
2007-08-01 9:22 ` Pavel Machek
[not found] ` <20070801092227.GB4808@ucw.cz>
2007-08-02 17:02 ` Rafael J. Wysocki
2007-08-02 17:02 ` Rafael J. Wysocki
2007-07-21 22:16 ` Nigel Cunningham
2007-07-22 15:26 ` Alan Stern
2007-08-01 9:19 ` Pavel Machek
[not found] ` <Pine.LNX.4.64.0707191542430.28721@asgard.lang.hm>
[not found] ` <200707201317.58025.rjw@sisk.pl>
2007-07-20 16:56 ` Milton Miller
[not found] ` <f29402c6050f9c3ff5d83a59cea2de58@bga.com>
2007-07-20 17:31 ` Jeremy Maitin-Shepard
2007-07-20 21:30 ` Rafael J. Wysocki
2007-07-20 19:26 ` david
2007-07-20 21:28 ` Rafael J. Wysocki
2007-07-20 21:33 ` Jeremy Maitin-Shepard
[not found] ` <87ejj2pxoc.fsf@jbms.ath.cx>
2007-07-20 22:19 ` Rafael J. Wysocki
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox