* Re: [RFC][PATCH 1/2 -mm] kexec based hibernation -v3: kexec jump [not found] <1190266447.21818.17.camel@caritas-dev.intel.com> @ 2007-09-20 10:09 ` Pavel Machek [not found] ` <20070920100941.GA12157@atrey.karlin.mff.cuni.cz> ` (5 subsequent siblings) 6 siblings, 0 replies; 42+ messages in thread From: Pavel Machek @ 2007-09-20 10:09 UTC (permalink / raw) To: Huang, Ying Cc: nigel, Kexec Mailing List, linux-kernel, Eric W. Biederman, Andrew Morton, linux-pm, Jeremy Maitin-Shepard Hi! > This patch implements the functionality of jumping between the kexeced > kernel and the original kernel. > > A new reboot command named LINUX_REBOOT_CMD_KJUMP is defined to > trigger the jumping to (executing) the new kernel and jumping back to > the original kernel. > > To support jumping between two kernels, before jumping to (executing) > the new kernel and jumping back to the original kernel, the devices > are put into quiescent state (to be fully implemented), and the state > of devices and CPU is saved. After jumping back from kexeced kernel > and jumping to the new kernel, the state of devices and CPU are > restored accordingly. The devices/CPU state save/restore code of > software suspend is called to implement corresponding function. > > To support jumping without preserving memory. One shadow backup page > is allocated for each page used by new (kexeced) kernel. When do > kexec_load, the image of new kernel is loaded into shadow pages, and > before executing, the original pages and the shadow pages are swapped, > so the contents of original pages are backuped. Before jumping to the > new (kexeced) kernel and after jumping back to the original kernel, > the original pages and the shadow pages are swapped too. > > A jump back protocol is defined and documented. > Signed-off-by: Huang Ying <ying.huang@intel.com> Seems like good enough for -mm to me. Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html ^ permalink raw reply [flat|nested] 42+ messages in thread
[parent not found: <20070920100941.GA12157@atrey.karlin.mff.cuni.cz>]
* Re: [RFC][PATCH 1/2 -mm] kexec based hibernation -v3: kexec jump [not found] ` <20070920100941.GA12157@atrey.karlin.mff.cuni.cz> @ 2007-09-21 0:24 ` Nigel Cunningham [not found] ` <200709211024.35991.nigel@nigel.suspend2.net> 1 sibling, 0 replies; 42+ messages in thread From: Nigel Cunningham @ 2007-09-21 0:24 UTC (permalink / raw) To: Pavel Machek Cc: Kexec Mailing List, linux-kernel, Eric W. Biederman, Huang, Ying, Andrew Morton, linux-pm, Jeremy Maitin-Shepard Hi Andrew. On Thursday 20 September 2007 20:09:41 Pavel Machek wrote: > Seems like good enough for -mm to me. > > Pavel Andrew, if I recall correctly, you said a while ago that you didn't want another hibernation implementation in the vanilla kernel. If you're going to consider merging this kexec code, will you also please consider merging TuxOnIce? Regards, Nigel -- See http://www.tuxonice.net for Howtos, FAQs, mailing lists, wiki and bugzilla info. ^ permalink raw reply [flat|nested] 42+ messages in thread
[parent not found: <200709211024.35991.nigel@nigel.suspend2.net>]
* Re: [RFC][PATCH 1/2 -mm] kexec based hibernation -v3: kexec jump [not found] ` <200709211024.35991.nigel@nigel.suspend2.net> @ 2007-09-21 1:06 ` Andrew Morton 2007-09-21 1:19 ` Nigel Cunningham ` (2 more replies) 2007-09-21 9:49 ` Pavel Machek [not found] ` <20070921094908.GB20149@elf.ucw.cz> 2 siblings, 3 replies; 42+ messages in thread From: Andrew Morton @ 2007-09-21 1:06 UTC (permalink / raw) To: nigel Cc: Nigel Cunningham, Mailing List, linux-kernel, Kexec, Eric W. Biederman, Huang, Ying, linux-pm, Jeremy Maitin-Shepard On Fri, 21 Sep 2007 10:24:34 +1000 Nigel Cunningham <nigel@nigel.suspend2.net> wrote: > Hi Andrew. > > On Thursday 20 September 2007 20:09:41 Pavel Machek wrote: > > Seems like good enough for -mm to me. > > > > Pavel > > Andrew, if I recall correctly, you said a while ago that you didn't want > another hibernation implementation in the vanilla kernel. If you're going to > consider merging this kexec code, will you also please consider merging > TuxOnIce? > The theory is that kexec-based hibernation will mainly use preexisting kexec code and will permit us to delete the existing hibernation implementation. That's different from replacing it. ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [RFC][PATCH 1/2 -mm] kexec based hibernation -v3: kexec jump 2007-09-21 1:06 ` Andrew Morton @ 2007-09-21 1:19 ` Nigel Cunningham [not found] ` <200709211120.00448.ncunningham@crca.org.au> 2007-09-24 17:37 ` Thomas Meyer 2 siblings, 0 replies; 42+ messages in thread From: Nigel Cunningham @ 2007-09-21 1:19 UTC (permalink / raw) To: Andrew Morton Cc: nigel, Kexec Mailing List, linux-kernel, Eric W. Biederman, Huang, Ying, linux-pm, Jeremy Maitin-Shepard Hi. On Friday 21 September 2007 11:06:23 Andrew Morton wrote: > On Fri, 21 Sep 2007 10:24:34 +1000 Nigel Cunningham <nigel@nigel.suspend2.net> wrote: > > > Hi Andrew. > > > > On Thursday 20 September 2007 20:09:41 Pavel Machek wrote: > > > Seems like good enough for -mm to me. > > > > > > Pavel > > > > Andrew, if I recall correctly, you said a while ago that you didn't want > > another hibernation implementation in the vanilla kernel. If you're going to > > consider merging this kexec code, will you also please consider merging > > TuxOnIce? > > > > The theory is that kexec-based hibernation will mainly use preexisting > kexec code and will permit us to delete the existing hibernation > implementation. > > That's different from replacing it. TuxOnIce doesn't remove the existing implementation either. It can transparently replace it, but you can enable/disable that at compile time. Regards, Nigel -- Nigel Cunningham Christian Reformed Church of Cobden 103 Curdie Street, Cobden 3266, Victoria, Australia Ph. +61 3 5595 1185 / +61 417 100 574 Communal Worship: 11 am Sunday. ^ permalink raw reply [flat|nested] 42+ messages in thread
[parent not found: <200709211120.00448.ncunningham@crca.org.au>]
* Re: [RFC][PATCH 1/2 -mm] kexec based hibernation -v3: kexec jump [not found] ` <200709211120.00448.ncunningham@crca.org.au> @ 2007-09-21 1:41 ` Andrew Morton [not found] ` <20070920184106.79e1858a.akpm@linux-foundation.org> 1 sibling, 0 replies; 42+ messages in thread From: Andrew Morton @ 2007-09-21 1:41 UTC (permalink / raw) To: Nigel Cunningham Cc: Jeremy, nigel, Kexec Mailing List, linux-kernel, Eric W. Biederman, Huang, Ying, linux-pm, Maitin-Shepard On Fri, 21 Sep 2007 11:19:59 +1000 Nigel Cunningham <ncunningham@crca.org.au> wrote: > Hi. > > On Friday 21 September 2007 11:06:23 Andrew Morton wrote: > > On Fri, 21 Sep 2007 10:24:34 +1000 Nigel Cunningham > <nigel@nigel.suspend2.net> wrote: > > > > > Hi Andrew. > > > > > > On Thursday 20 September 2007 20:09:41 Pavel Machek wrote: > > > > Seems like good enough for -mm to me. > > > > > > > > Pavel > > > > > > Andrew, if I recall correctly, you said a while ago that you didn't want > > > another hibernation implementation in the vanilla kernel. If you're going > to > > > consider merging this kexec code, will you also please consider merging > > > TuxOnIce? > > > > > > > The theory is that kexec-based hibernation will mainly use preexisting > > kexec code and will permit us to delete the existing hibernation > > implementation. > > > > That's different from replacing it. > > TuxOnIce doesn't remove the existing implementation either. It can > transparently replace it, but you can enable/disable that at compile time. Right. So we end up with two implementations in-tree. Whereas kexec-based-hibernation leads us to having zero implementations in-tree. See, it's different. ^ permalink raw reply [flat|nested] 42+ messages in thread
[parent not found: <20070920184106.79e1858a.akpm@linux-foundation.org>]
* Re: [RFC][PATCH 1/2 -mm] kexec based hibernation -v3: kexec jump [not found] ` <20070920184106.79e1858a.akpm@linux-foundation.org> @ 2007-09-21 1:57 ` Nigel Cunningham [not found] ` <200709211157.28622.nigel@nigel.suspend2.net> ` (2 subsequent siblings) 3 siblings, 0 replies; 42+ messages in thread From: Nigel Cunningham @ 2007-09-21 1:57 UTC (permalink / raw) To: Andrew Morton Cc: nigel, Kexec Mailing List, linux-kernel, Eric W. Biederman, Huang, Ying, linux-pm, Jeremy Maitin-Shepard Hi. On Friday 21 September 2007 11:41:06 Andrew Morton wrote: > > On Friday 21 September 2007 11:06:23 Andrew Morton wrote: > > > On Fri, 21 Sep 2007 10:24:34 +1000 Nigel Cunningham > > <nigel@nigel.suspend2.net> wrote: > > > > > > > Hi Andrew. > > > > > > > > On Thursday 20 September 2007 20:09:41 Pavel Machek wrote: > > > > > Seems like good enough for -mm to me. > > > > > > > > > > Pavel > > > > > > > > Andrew, if I recall correctly, you said a while ago that you didn't want > > > > another hibernation implementation in the vanilla kernel. If you're going > > to > > > > consider merging this kexec code, will you also please consider merging > > > > TuxOnIce? > > > > > > > > > > The theory is that kexec-based hibernation will mainly use preexisting > > > kexec code and will permit us to delete the existing hibernation > > > implementation. > > > > > > That's different from replacing it. > > > > TuxOnIce doesn't remove the existing implementation either. It can > > transparently replace it, but you can enable/disable that at compile time. > > Right. So we end up with two implementations in-tree. Whereas > kexec-based-hibernation leads us to having zero implementations in-tree. > > See, it's different. That's not true. Kexec will itself be an implementation, otherwise you'd end up with people screaming about no hibernation support. And it won't result in the complete removal of the existing hibernation code from the kernel. At the very least, it's going to want the kernel being hibernated to have an interface by which it can find out which pages need to be saved. I wouldn't be surprised if it also ends up with an interface in which the kernel being hibernated tells it what bdev/sectors in which to save the image as well (otherwise you're going to need a dedicated, otherwise untouched partition exclusively for the kexec'd kernel to use), or what network settings to use if it wants to try to save the image to a network storage device. On top of that, there are all the issues related to device reinitialisation and so on, and it looks like there's greatly increased pain for users wanting to configure this new implementation. Kexec is by no means proven to be the panacea for all the issues. Regards, Nigel -- Nigel Cunningham Pastor Christian Reformed Church of Cobden Victoria, Australia +61 3 5595 1185 ^ permalink raw reply [flat|nested] 42+ messages in thread
[parent not found: <200709211157.28622.nigel@nigel.suspend2.net>]
* Re: [RFC][PATCH 1/2 -mm] kexec based hibernation -v3: kexec jump [not found] ` <200709211157.28622.nigel@nigel.suspend2.net> @ 2007-09-21 2:18 ` Huang, Ying [not found] ` <1190341137.21818.52.camel@caritas-dev.intel.com> ` (3 subsequent siblings) 4 siblings, 0 replies; 42+ messages in thread From: Huang, Ying @ 2007-09-21 2:18 UTC (permalink / raw) To: Nigel Cunningham Cc: nigel, Kexec Mailing List, linux-kernel, Eric W. Biederman, Andrew Morton, linux-pm, Jeremy Maitin-Shepard On Fri, 2007-09-21 at 11:57 +1000, Nigel Cunningham wrote: > Hi. > > On Friday 21 September 2007 11:41:06 Andrew Morton wrote: > > > On Friday 21 September 2007 11:06:23 Andrew Morton wrote: > > > > On Fri, 21 Sep 2007 10:24:34 +1000 Nigel Cunningham > > > <nigel@nigel.suspend2.net> wrote: > > > > > > > > > Hi Andrew. > > > > > > > > > > On Thursday 20 September 2007 20:09:41 Pavel Machek wrote: > > > > > > Seems like good enough for -mm to me. > > > > > > > > > > > > Pavel > > > > > > > > > > Andrew, if I recall correctly, you said a while ago that you didn't > want > > > > > another hibernation implementation in the vanilla kernel. If you're > going > > > to > > > > > consider merging this kexec code, will you also please consider > merging > > > > > TuxOnIce? > > > > > > > > > > > > > The theory is that kexec-based hibernation will mainly use preexisting > > > > kexec code and will permit us to delete the existing hibernation > > > > implementation. > > > > > > > > That's different from replacing it. > > > > > > TuxOnIce doesn't remove the existing implementation either. It can > > > transparently replace it, but you can enable/disable that at compile time. > > > > Right. So we end up with two implementations in-tree. Whereas > > kexec-based-hibernation leads us to having zero implementations in-tree. > > > > See, it's different. > > That's not true. Kexec will itself be an implementation, otherwise you'd end > up with people screaming about no hibernation support. And it won't result in > the complete removal of the existing hibernation code from the kernel. At the > very least, it's going to want the kernel being hibernated to have an > interface by which it can find out which pages need to be saved. I wouldn't This has been done by kexec/kdump guys. There is a makedumpfile utility and vmcoreinfo kernel mechanism to implement this. We can just reuse the work of kexec/kdump. > be surprised if it also ends up with an interface in which the kernel being > hibernated tells it what bdev/sectors in which to save the image as well > (otherwise you're going to need a dedicated, otherwise untouched partition > exclusively for the kexec'd kernel to use), or what network settings to use > if it wants to try to save the image to a network storage device. On top of These can be done in user space. The image writing will be done in user space for kexec base hibernation. > that, there are all the issues related to device reinitialisation and so on, Yes. Device reinitialisation is needed. But all in all, kexec based hibernation can be much simpler on the kernel side. > and it looks like there's greatly increased pain for users wanting to > configure this new implementation. Kexec is by no means proven to be the > panacea for all the issues. Configuration is a problem, we will work on it. But, because it is based on kexec/kdump instead of starting from scratch, the duplicated part between hibernation and kexec/kdump can be eliminated. Best Regards, Huang Ying ^ permalink raw reply [flat|nested] 42+ messages in thread
[parent not found: <1190341137.21818.52.camel@caritas-dev.intel.com>]
* Re: [RFC][PATCH 1/2 -mm] kexec based hibernation -v3: kexec jump [not found] ` <1190341137.21818.52.camel@caritas-dev.intel.com> @ 2007-09-21 2:25 ` Nigel Cunningham [not found] ` <200709211225.25874.nigel@nigel.suspend2.net> 1 sibling, 0 replies; 42+ messages in thread From: Nigel Cunningham @ 2007-09-21 2:25 UTC (permalink / raw) To: Huang, Ying Cc: nigel, Kexec Mailing List, linux-kernel, Eric W. Biederman, Andrew Morton, linux-pm, Jeremy Maitin-Shepard Hi. On Friday 21 September 2007 12:18:57 Huang, Ying wrote: > > That's not true. Kexec will itself be an implementation, otherwise you'd end > > up with people screaming about no hibernation support. And it won't result in > > the complete removal of the existing hibernation code from the kernel. At the > > very least, it's going to want the kernel being hibernated to have an > > interface by which it can find out which pages need to be saved. I wouldn't > > This has been done by kexec/kdump guys. There is a makedumpfile utility > and vmcoreinfo kernel mechanism to implement this. We can just reuse the > work of kexec/kdump. You've already said that you are currently saving all pages. How are you going to avoid saving free pages if you don't get the information from the kernel being saved? This will require more than just code reuse. > > be surprised if it also ends up with an interface in which the kernel being > > hibernated tells it what bdev/sectors in which to save the image as well > > (otherwise you're going to need a dedicated, otherwise untouched partition > > exclusively for the kexec'd kernel to use), or what network settings to use > > if it wants to try to save the image to a network storage device. On top of > > These can be done in user space. The image writing will be done in user > space for kexec base hibernation. That only complicates things more. Now you need to get the information on where to save the image from the kernel being saved, then transfer it to userspace after switching to the kexec kernel. That's more kernel code, not less. > > that, there are all the issues related to device reinitialisation and so on, > > Yes. Device reinitialisation is needed. But all in all, kexec based > hibernation can be much simpler on the kernel side. Sorry, but I'm yet to be convinced. I'm not unwilling, I'm just not there yet. > > and it looks like there's greatly increased pain for users wanting to > > configure this new implementation. Kexec is by no means proven to be the > > panacea for all the issues. > > Configuration is a problem, we will work on it. > > But, because it is based on kexec/kdump instead of starting from > scratch, the duplicated part between hibernation and kexec/kdump can be > eliminated. Regards, Nigel -- Nigel, Michelle and Alisdair Cunningham 5 Mitchell Street Cobden 3266 Victoria, Australia ^ permalink raw reply [flat|nested] 42+ messages in thread
[parent not found: <200709211225.25874.nigel@nigel.suspend2.net>]
* Re: [RFC][PATCH 1/2 -mm] kexec based hibernation -v3: kexec jump [not found] ` <200709211225.25874.nigel@nigel.suspend2.net> @ 2007-09-21 2:45 ` Huang, Ying [not found] ` <1190342757.21818.75.camel@caritas-dev.intel.com> 1 sibling, 0 replies; 42+ messages in thread From: Huang, Ying @ 2007-09-21 2:45 UTC (permalink / raw) To: Nigel Cunningham Cc: nigel, Kexec Mailing List, linux-kernel, Eric W. Biederman, Andrew Morton, linux-pm, Jeremy Maitin-Shepard On Fri, 2007-09-21 at 12:25 +1000, Nigel Cunningham wrote: > Hi. > > On Friday 21 September 2007 12:18:57 Huang, Ying wrote: > > > That's not true. Kexec will itself be an implementation, otherwise you'd > end > > > up with people screaming about no hibernation support. And it won't result > in > > > the complete removal of the existing hibernation code from the kernel. At > the > > > very least, it's going to want the kernel being hibernated to have an > > > interface by which it can find out which pages need to be saved. I > wouldn't > > > > This has been done by kexec/kdump guys. There is a makedumpfile utility > > and vmcoreinfo kernel mechanism to implement this. We can just reuse the > > work of kexec/kdump. > > You've already said that you are currently saving all pages. How are you going > to avoid saving free pages if you don't get the information from the kernel > being saved? This will require more than just code reuse. I have not tried "makedumpfile". The "makedumpfile" avoids saving free pages through checking the "mem_map" of the original kernel. I think there is nothing prevent it been used for kexec based hibernation image writing. This is an example of duplicated effort between kexec/kdump and original hibernation implementation. Both kexec/kdump and hibernation need to save memory image without saving the free pages. This can be done once instead of twice. > > > be surprised if it also ends up with an interface in which the kernel > being > > > hibernated tells it what bdev/sectors in which to save the image as well > > > (otherwise you're going to need a dedicated, otherwise untouched partition > > > exclusively for the kexec'd kernel to use), or what network settings to > use > > > if it wants to try to save the image to a network storage device. On top > of > > > > These can be done in user space. The image writing will be done in user > > space for kexec base hibernation. > > That only complicates things more. Now you need to get the information on > where to save the image from the kernel being saved, then transfer it to > userspace after switching to the kexec kernel. That's more kernel code, not > less. This is fairly simple in fact. For example, you can specify the bdev/sectors in kernel command line when do kexec load "kexec -l <...> --append='...'", then the image writing system can get it through "cat /proc/cmdline". > > > that, there are all the issues related to device reinitialisation and so > on, > > > > Yes. Device reinitialisation is needed. But all in all, kexec based > > hibernation can be much simpler on the kernel side. > > Sorry, but I'm yet to be convinced. I'm not unwilling, I'm just not there yet. > > > > and it looks like there's greatly increased pain for users wanting to > > > configure this new implementation. Kexec is by no means proven to be the > > > panacea for all the issues. > > > > Configuration is a problem, we will work on it. > > > > But, because it is based on kexec/kdump instead of starting from > > scratch, the duplicated part between hibernation and kexec/kdump can be > > eliminated. > Best Regards, Huang Ying ^ permalink raw reply [flat|nested] 42+ messages in thread
[parent not found: <1190342757.21818.75.camel@caritas-dev.intel.com>]
* Re: [RFC][PATCH 1/2 -mm] kexec based hibernation -v3: kexec jump [not found] ` <1190342757.21818.75.camel@caritas-dev.intel.com> @ 2007-09-21 2:58 ` Nigel Cunningham [not found] ` <200709211259.00195.nigel@nigel.suspend2.net> 2007-09-22 22:02 ` Alon Bar-Lev 2 siblings, 0 replies; 42+ messages in thread From: Nigel Cunningham @ 2007-09-21 2:58 UTC (permalink / raw) To: Huang, Ying Cc: nigel, Kexec Mailing List, linux-kernel, Eric W. Biederman, Andrew Morton, linux-pm, Jeremy Maitin-Shepard Hi. On Friday 21 September 2007 12:45:57 Huang, Ying wrote: > On Fri, 2007-09-21 at 12:25 +1000, Nigel Cunningham wrote: > > Hi. > > > > On Friday 21 September 2007 12:18:57 Huang, Ying wrote: > > > > That's not true. Kexec will itself be an implementation, otherwise you'd > > end > > > > up with people screaming about no hibernation support. And it won't result > > in > > > > the complete removal of the existing hibernation code from the kernel. At > > the > > > > very least, it's going to want the kernel being hibernated to have an > > > > interface by which it can find out which pages need to be saved. I > > wouldn't > > > > > > This has been done by kexec/kdump guys. There is a makedumpfile utility > > > and vmcoreinfo kernel mechanism to implement this. We can just reuse the > > > work of kexec/kdump. > > > > You've already said that you are currently saving all pages. How are you going > > to avoid saving free pages if you don't get the information from the kernel > > being saved? This will require more than just code reuse. > > I have not tried "makedumpfile". The "makedumpfile" avoids saving free > pages through checking the "mem_map" of the original kernel. I think > there is nothing prevent it been used for kexec based hibernation image > writing. > > This is an example of duplicated effort between kexec/kdump and original > hibernation implementation. Both kexec/kdump and hibernation need to > save memory image without saving the free pages. This can be done once > instead of twice. Ok. > > > > be surprised if it also ends up with an interface in which the kernel > > being > > > > hibernated tells it what bdev/sectors in which to save the image as well > > > > (otherwise you're going to need a dedicated, otherwise untouched partition > > > > exclusively for the kexec'd kernel to use), or what network settings to > > use > > > > if it wants to try to save the image to a network storage device. On top > > of > > > > > > These can be done in user space. The image writing will be done in user > > > space for kexec base hibernation. > > > > That only complicates things more. Now you need to get the information on > > where to save the image from the kernel being saved, then transfer it to > > userspace after switching to the kexec kernel. That's more kernel code, not > > less. > > This is fairly simple in fact. For example, you can specify the > bdev/sectors in kernel command line when do kexec load "kexec -l <...> > --append='...'", then the image writing system can get it through > "cat /proc/cmdline". Sounds doable, as long as you can cope with long command lines (which shouldn't be a biggie). (If you've got a swapfile or parts of a swap partition already in use, it can be quite fragmented). Andrew, you're seeing that it really doesn't mean the removal of all hibernation code from the kernel being suspended, aren't you? (And if the kexec'd kernel is the same binary, then there's more code again). Regards, Nigel -- See http://www.tuxonice.net for Howtos, FAQs, mailing lists, wiki and bugzilla info. ^ permalink raw reply [flat|nested] 42+ messages in thread
[parent not found: <200709211259.00195.nigel@nigel.suspend2.net>]
* Re: [RFC][PATCH 1/2 -mm] kexec based hibernation -v3: kexec jump [not found] ` <200709211259.00195.nigel@nigel.suspend2.net> @ 2007-09-21 4:46 ` Eric W. Biederman [not found] ` <m1r6ksiq27.fsf@ebiederm.dsl.xmission.com> 1 sibling, 0 replies; 42+ messages in thread From: Eric W. Biederman @ 2007-09-21 4:46 UTC (permalink / raw) To: nigel Cc: Kexec Mailing List, linux-kernel, Huang, Ying, Andrew Morton, linux-pm, Jeremy Maitin-Shepard Nigel Cunningham <nigel@nigel.suspend2.net> writes: > > Sounds doable, as long as you can cope with long command lines (which > shouldn't be a biggie). (If you've got a swapfile or parts of a swap > partition already in use, it can be quite fragmented). Hmm. This is an interesting problem. Sharing a swap file or a swap partition with the actual swap of user space pages does seem to be a limitation of this approach. Although the fact that it is simple to write to a separate file may be a reasonable compensation. > Andrew, you're seeing that it really doesn't mean the removal of all > hibernation code from the kernel being suspended, aren't you? (And if the > kexec'd kernel is the same binary, then there's more code again). More binary size yes not more code to maintain. As for the rest the current implementation is small enough and allows for enough beyond hibernation I think it makes sense to eventually merge assuming a good clean implementation can be achieved. Eric ^ permalink raw reply [flat|nested] 42+ messages in thread
[parent not found: <m1r6ksiq27.fsf@ebiederm.dsl.xmission.com>]
* Re: [RFC][PATCH 1/2 -mm] kexec based hibernation -v3: kexec jump [not found] ` <m1r6ksiq27.fsf@ebiederm.dsl.xmission.com> @ 2007-09-21 9:45 ` Pavel Machek [not found] ` <20070921094512.GA20149@elf.ucw.cz> 1 sibling, 0 replies; 42+ messages in thread From: Pavel Machek @ 2007-09-21 9:45 UTC (permalink / raw) To: Eric W. Biederman Cc: nigel, Kexec Mailing List, linux-kernel, Huang, Ying, Andrew Morton, linux-pm, Jeremy Maitin-Shepard Hi! > > > > Sounds doable, as long as you can cope with long command lines (which > > shouldn't be a biggie). (If you've got a swapfile or parts of a swap > > partition already in use, it can be quite fragmented). > > Hmm. This is an interesting problem. Sharing a swap file or a swap > partition with the actual swap of user space pages does seem to be > a limitation of this approach. > > Although the fact that it is simple to write to a separate file may > be a reasonable compensation. I'm not sure how you'd write it to a separate file. Notice that kjump kernel may not mount journalling filesystems, not even read-only. (Ext3 replays journal in that case). You could pass block numbers from the original kernel... Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html ^ permalink raw reply [flat|nested] 42+ messages in thread
[parent not found: <20070921094512.GA20149@elf.ucw.cz>]
* Re: [RFC][PATCH 1/2 -mm] kexec based hibernation -v3: kexec jump [not found] ` <20070921094512.GA20149@elf.ucw.cz> @ 2007-09-26 20:30 ` Joseph Fannin [not found] ` <20070926203036.GF31759@nineveh.local> 1 sibling, 0 replies; 42+ messages in thread From: Joseph Fannin @ 2007-09-26 20:30 UTC (permalink / raw) To: Pavel Machek Cc: nigel, Kexec Mailing List, linux-kernel, Eric W. Biederman, Andrew Morton, linux-pm, Jeremy Maitin-Shepard On Fri, Sep 21, 2007 at 11:45:12AM +0200, Pavel Machek wrote: > Hi! > > > > > > Sounds doable, as long as you can cope with long command lines (which > > > shouldn't be a biggie). (If you've got a swapfile or parts of a swap > > > partition already in use, it can be quite fragmented). > > > > Hmm. This is an interesting problem. Sharing a swap file or a swap > > partition with the actual swap of user space pages does seem to be > > a limitation of this approach. > > > > Although the fact that it is simple to write to a separate file may > > be a reasonable compensation. > > I'm not sure how you'd write it to a separate file. Notice that kjump > kernel may not mount journalling filesystems, not even > read-only. (Ext3 replays journal in that case). You could pass block > numbers from the original kernel... The ext3 thing is a bug, the case for which I don't think has been adequately explained to the ext[34] folks. There should be at least a no_replay mount flag available, or something. It has ramifications for more than just hibernation. And yeah, I'm gonna bring up the swap files thing again. If you can hibernate to a swap file, you can hibernate to a dedicated hibernation file, and vice versa. If you can't hibernate to a swap file, then swap files are effectively unsupported for any system you might want to hibernate. <handwave> I wonder what embedded folks would think about that </handwave>. But, in my ignorance, I'm not sure even fixing the ext3 bug will guarantee you consistent metadata so that you can handle a swap/hibernate file. You can do a sync(), but how do you make that not race against running processes without the freezer, or blkdev snapshots? I guess uswsusp and the-patch-previously-known-as-suspend2 handle this somehow, though. (It's that same ignorance that has me waiting for someone with established credit with kernel people to make that argument for the ext3 bug, so I can hang my own reasons for thinking that it's bad off of theirs). -- Joseph Fannin jfannin@gmail.com ^ permalink raw reply [flat|nested] 42+ messages in thread
[parent not found: <20070926203036.GF31759@nineveh.local>]
* Re: [RFC][PATCH 1/2 -mm] kexec based hibernation -v3: kexec jump [not found] ` <20070926203036.GF31759@nineveh.local> @ 2007-09-26 20:52 ` Nigel Cunningham 2007-09-27 6:33 ` Huang, Ying [not found] ` <1190874834.21818.300.camel@caritas-dev.intel.com> 2 siblings, 0 replies; 42+ messages in thread From: Nigel Cunningham @ 2007-09-26 20:52 UTC (permalink / raw) To: Joseph Fannin Cc: nigel, Kexec Mailing List, linux-kernel, Eric W. Biederman, Andrew Morton, linux-pm, Jeremy Maitin-Shepard Hi. On Thursday 27 September 2007 06:30:36 Joseph Fannin wrote: > On Fri, Sep 21, 2007 at 11:45:12AM +0200, Pavel Machek wrote: > > Hi! > > > > > > > > Sounds doable, as long as you can cope with long command lines (which > > > > shouldn't be a biggie). (If you've got a swapfile or parts of a swap > > > > partition already in use, it can be quite fragmented). > > > > > > Hmm. This is an interesting problem. Sharing a swap file or a swap > > > partition with the actual swap of user space pages does seem to be > > > a limitation of this approach. > > > > > > Although the fact that it is simple to write to a separate file may > > > be a reasonable compensation. > > > > I'm not sure how you'd write it to a separate file. Notice that kjump > > kernel may not mount journalling filesystems, not even > > read-only. (Ext3 replays journal in that case). You could pass block > > numbers from the original kernel... > > The ext3 thing is a bug, the case for which I don't think has been > adequately explained to the ext[34] folks. There should be at least a > no_replay mount flag available, or something. It has ramifications > for more than just hibernation. > > And yeah, I'm gonna bring up the swap files thing again. If you > can hibernate to a swap file, you can hibernate to a dedicated > hibernation file, and vice versa. > > If you can't hibernate to a swap file, then swap files are > effectively unsupported for any system you might want to hibernate. > <handwave> I wonder what embedded folks would think about that > </handwave>. > > But, in my ignorance, I'm not sure even fixing the ext3 bug will > guarantee you consistent metadata so that you can handle a > swap/hibernate file. You can do a sync(), but how do you make that > not race against running processes without the freezer, or blkdev > snapshots? > > I guess uswsusp and the-patch-previously-known-as-suspend2 handle > this somehow, though. > > (It's that same ignorance that has me waiting for someone with > established credit with kernel people to make that argument for the > ext3 bug, so I can hang my own reasons for thinking that it's bad off > of theirs). I haven't looked at swsusp support, but TuxOnIce handles all storage (swap partitions, swap files and ordinary files) by first allocating swap (if we're using swap), then bmapping the storage we're going to use. After that, we can freeze filesystems and processes with impunity. The allocated storage is then viewed as just a collection of bdevs, each with an ordered chain of extents defining which blocks we're going to read/write - a series of tapes if you like. In the image header, we store dev_ts and the block chains, together with the configuration information. As long as the same bdevs are configured at boot time prior to the echo > /sys/power/resume, we're in business. Filesystems don't need to be mounted because we don't use filesystem code anyway. (LVM etc does though in so far as it's needed to make the dev_t match the device again). This matches with what you said above about hibernating to swap files and dedicated hibernation files - TuxOnIce uses exactly the same code to do the i/o to both; the variation is in the code to recognise the image header and allocate/free/bmap storage. <not a filesystem expert> Personally, I don't think ext[34] is broken. If there's data being left in the journal that will need replaying, then mounting without replaying the journal sounds wrong. Perhaps you should instead be arguing that nothing should be left in the journal after a filesystem freeze. But, of course, current code isn't doing a filesystem freeze (just a process freeze) and the kexec guys want to take even that away. </not a filesystem expert> In short, I agree. AFAICS, you need both the process freezer and filesystem freezing to make this thing fly properly. Nigel -- See http://www.tuxonice.net for Howtos, FAQs, mailing lists, wiki and bugzilla info. ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [RFC][PATCH 1/2 -mm] kexec based hibernation -v3: kexec jump [not found] ` <20070926203036.GF31759@nineveh.local> 2007-09-26 20:52 ` Nigel Cunningham @ 2007-09-27 6:33 ` Huang, Ying [not found] ` <1190874834.21818.300.camel@caritas-dev.intel.com> 2 siblings, 0 replies; 42+ messages in thread From: Huang, Ying @ 2007-09-27 6:33 UTC (permalink / raw) To: Joseph Fannin Cc: nigel, Kexec Mailing List, linux-kernel, Eric W. Biederman, Andrew Morton, linux-pm, Jeremy Maitin-Shepard On Wed, 2007-09-26 at 16:30 -0400, Joseph Fannin wrote: > But, in my ignorance, I'm not sure even fixing the ext3 bug will > guarantee you consistent metadata so that you can handle a > swap/hibernate file. You can do a sync(), but how do you make that > not race against running processes without the freezer, or blkdev > snapshots? > > I guess uswsusp and the-patch-previously-known-as-suspend2 handle > this somehow, though. The image-writing kernel of kexec based hibernation run in a controlled way. It is not used by normal user, so only really necessary process need to be run. For example, it is possible that there is only one user process -- the image-writing process running in image-writing kernel. So, no freezer or blkdev snapshot is needed. Best Regards, Huang Ying ^ permalink raw reply [flat|nested] 42+ messages in thread
[parent not found: <1190874834.21818.300.camel@caritas-dev.intel.com>]
* Re: [RFC][PATCH 1/2 -mm] kexec based hibernation -v3: kexec jump [not found] ` <1190874834.21818.300.camel@caritas-dev.intel.com> @ 2007-09-27 6:35 ` Nigel Cunningham 0 siblings, 0 replies; 42+ messages in thread From: Nigel Cunningham @ 2007-09-27 6:35 UTC (permalink / raw) To: Huang, Ying Cc: nigel, Kexec Mailing List, linux-kernel, Joseph Fannin, Eric W. Biederman, Andrew Morton, linux-pm, Jeremy Maitin-Shepard Hi. On Thursday 27 September 2007 16:33:54 Huang, Ying wrote: > On Wed, 2007-09-26 at 16:30 -0400, Joseph Fannin wrote: > > But, in my ignorance, I'm not sure even fixing the ext3 bug will > > guarantee you consistent metadata so that you can handle a > > swap/hibernate file. You can do a sync(), but how do you make that > > not race against running processes without the freezer, or blkdev > > snapshots? > > > > I guess uswsusp and the-patch-previously-known-as-suspend2 handle > > this somehow, though. > > The image-writing kernel of kexec based hibernation run in a controlled > way. It is not used by normal user, so only really necessary process > need to be run. For example, it is possible that there is only one user > process -- the image-writing process running in image-writing kernel. > So, no freezer or blkdev snapshot is needed. You're thinking of the wrong kernel - we were talking about prior to switching to the kexec'd kernel while suspending. Regards, Nigel -- See http://www.tuxonice.net for Howtos, FAQs, mailing lists, wiki and bugzilla info. ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [RFC][PATCH 1/2 -mm] kexec based hibernation -v3: kexec jump [not found] ` <1190342757.21818.75.camel@caritas-dev.intel.com> 2007-09-21 2:58 ` Nigel Cunningham [not found] ` <200709211259.00195.nigel@nigel.suspend2.net> @ 2007-09-22 22:02 ` Alon Bar-Lev 2 siblings, 0 replies; 42+ messages in thread From: Alon Bar-Lev @ 2007-09-22 22:02 UTC (permalink / raw) To: Huang, Ying Cc: Nigel Cunningham, nigel, Kexec Mailing List, linux-kernel, Eric W. Biederman, Andrew Morton, linux-pm, Jeremy Maitin-Shepard On 9/21/07, Huang, Ying <ying.huang@intel.com> wrote: > This is fairly simple in fact. For example, you can specify the > bdev/sectors in kernel command line when do kexec load "kexec -l <...> > --append='...'", then the image writing system can get it through > "cat /proc/cmdline". I hope you take into account encrypted swap configuration. Currently all three suspend implementations support using encrypted swap in order to suspend/resume. A configuration which forces the user to remap encryption on the kexec kernel during suspend is not valid. Best Regards, Alon Bar-Lev. ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [RFC][PATCH 1/2 -mm] kexec based hibernation -v3: kexec jump [not found] ` <200709211157.28622.nigel@nigel.suspend2.net> 2007-09-21 2:18 ` Huang, Ying [not found] ` <1190341137.21818.52.camel@caritas-dev.intel.com> @ 2007-09-21 3:33 ` Eric W. Biederman 2007-09-21 4:16 ` Andrew Morton [not found] ` <m18x70ofp3.fsf@ebiederm.dsl.xmission.com> 4 siblings, 0 replies; 42+ messages in thread From: Eric W. Biederman @ 2007-09-21 3:33 UTC (permalink / raw) To: Nigel Cunningham Cc: nigel, Kexec Mailing List, linux-kernel, Huang, Ying, Andrew Morton, linux-pm, Jeremy Maitin-Shepard Nigel Cunningham <nigel@nigel.suspend2.net> writes: > > That's not true. Kexec will itself be an implementation, otherwise you'd end > up with people screaming about no hibernation support. There needs to be an implementation of hibernation based on kexec with return yes. > And it won't result in > the complete removal of the existing hibernation code from the kernel. At the > very least, it's going to want the kernel being hibernated to have an > interface by which it can find out which pages need to be saved. That interface should be running kernel -> user space -> target kernel. Not direct kernel to kernel. > I wouldn't > be surprised if it also ends up with an interface in which the kernel being > hibernated tells it what bdev/sectors in which to save the image as well > (otherwise you're going to need a dedicated, otherwise untouched partition > exclusively for the kexec'd kernel to use), or what network settings to use > if it wants to try to save the image to a network storage device. initramfs. We already seem to have that interface. And distros seems to do a pretty decent job of using it to configure systems. > On top of > that, there are all the issues related to device reinitialisation and so on, Yes. > and it looks like there's greatly increased pain for users wanting to > configure this new implementation. Not to be callous but that really is a user space and distro issue. > Kexec is by no means proven to be the panacea for all the issues. I agree. I'm still not quite convinced it will do a satisfactory job. But I think it does make sense to implement a general kexec with return and see if that can reasonably be used for handling hibernation issues. If done cleanly and with care the implementation won't be hibernation specific. Frankly this looks like the best way I can see to implement a general mechanism for calling silly firmware/BIOS/EFI services after we have a kernel up and running. It's a little bit like allowing X to call iopl(3) and do inb/outb directly. The configuration issues you raise pretty much exist for kexec on panic, and they seem to be being resolved for that case in a reasonable way. I do agree that the current kexec+return effort seems to be one of those unfortunate cases where we give every mechanism in the kernel to do something in user space and then no one actually implements the user space. That doesn't do any one any good. For hibernation we don't have the absolute need to step outside of the current kernel that we do in the kexec on panic approach. However we have this practical fight about mechanism and policy, and kexec with return has this seductive allure that it appears to be the minimal necessary mechanism in the kernel. No one has yet attacked the hard problem of coming up with separate hibernate methods for drivers. This should be the hard part of the puzzle, and the recurring work from a kernel maintenance point of view. There is some reason to hope that things will be a maintenance will be a little simpler because you can get at all of the distinct pieces of the puzzle. Currently kexec with return appears to require the minimal amount of mechanism in the kernel and leaves the policy to someplace else, plus the code is not hibernation specific. We could use it to make runtime EFI calls, or to implement cooperative multitasking between kernels. My current opinion is that the patches are starting to get close enough that it isn't a waste of my time reviewing them. But there is still a fair amount to be done before this code is in shape for us to merge it into the kernel. At 500 or so lines I don't feel bad about pushing back until all of the core user interface issues are resolved, and we have the code calling the proper driver methods. Eric ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [RFC][PATCH 1/2 -mm] kexec based hibernation -v3: kexec jump [not found] ` <200709211157.28622.nigel@nigel.suspend2.net> ` (2 preceding siblings ...) 2007-09-21 3:33 ` Eric W. Biederman @ 2007-09-21 4:16 ` Andrew Morton [not found] ` <m18x70ofp3.fsf@ebiederm.dsl.xmission.com> 4 siblings, 0 replies; 42+ messages in thread From: Andrew Morton @ 2007-09-21 4:16 UTC (permalink / raw) To: Nigel Cunningham Cc: Jeremy, nigel, Kexec Mailing List, linux-kernel, Eric W. Biederman, Huang, Ying, linux-pm, Maitin-Shepard On Fri, 21 Sep 2007 11:57:26 +1000 Nigel Cunningham <nigel@nigel.suspend2.net> wrote: > Hi. > > On Friday 21 September 2007 11:41:06 Andrew Morton wrote: > > > On Friday 21 September 2007 11:06:23 Andrew Morton wrote: > > > > On Fri, 21 Sep 2007 10:24:34 +1000 Nigel Cunningham > > > <nigel@nigel.suspend2.net> wrote: > > > > > > > > > Hi Andrew. > > > > > > > > > > On Thursday 20 September 2007 20:09:41 Pavel Machek wrote: > > > > > > Seems like good enough for -mm to me. > > > > > > > > > > > > Pavel > > > > > > > > > > Andrew, if I recall correctly, you said a while ago that you didn't > want > > > > > another hibernation implementation in the vanilla kernel. If you're > going > > > to > > > > > consider merging this kexec code, will you also please consider > merging > > > > > TuxOnIce? > > > > > > > > > > > > > The theory is that kexec-based hibernation will mainly use preexisting > > > > kexec code and will permit us to delete the existing hibernation > > > > implementation. > > > > > > > > That's different from replacing it. > > > > > > TuxOnIce doesn't remove the existing implementation either. It can > > > transparently replace it, but you can enable/disable that at compile time. > > > > Right. So we end up with two implementations in-tree. Whereas > > kexec-based-hibernation leads us to having zero implementations in-tree. > > > > See, it's different. > > That's not true. Kexec will itself be an implementation, otherwise you'd end > up with people screaming about no hibernation support. And it won't result in > the complete removal of the existing hibernation code from the kernel. At the > very least, it's going to want the kernel being hibernated to have an > interface by which it can find out which pages need to be saved. I wouldn't > be surprised if it also ends up with an interface in which the kernel being > hibernated tells it what bdev/sectors in which to save the image as well > (otherwise you're going to need a dedicated, otherwise untouched partition > exclusively for the kexec'd kernel to use), or what network settings to use > if it wants to try to save the image to a network storage device. On top of > that, there are all the issues related to device reinitialisation and so on, > and it looks like there's greatly increased pain for users wanting to > configure this new implementation. Kexec is by no means proven to be the > panacea for all the issues. > Maybe, maybe not, dunno. That's why we haven't merged it yet. If it ends up being no good, we won't merge it! ^ permalink raw reply [flat|nested] 42+ messages in thread
[parent not found: <m18x70ofp3.fsf@ebiederm.dsl.xmission.com>]
* Re: Re: [RFC][PATCH 1/2 -mm] kexec based hibernation -v3: kexec jump [not found] ` <m18x70ofp3.fsf@ebiederm.dsl.xmission.com> @ 2007-09-21 12:09 ` Rafael J. Wysocki [not found] ` <200709211409.25008.rjw@sisk.pl> 1 sibling, 0 replies; 42+ messages in thread From: Rafael J. Wysocki @ 2007-09-21 12:09 UTC (permalink / raw) To: linux-pm Cc: Nigel Cunningham, nigel, Kexec Mailing List, linux-kernel, Eric W. Biederman, Huang, Ying, Andrew Morton, Jeremy Maitin-Shepard On Friday, 21 September 2007 05:33, Eric W. Biederman wrote: > Nigel Cunningham <nigel@nigel.suspend2.net> writes: > > > > That's not true. Kexec will itself be an implementation, otherwise you'd end > > up with people screaming about no hibernation support. > > There needs to be an implementation of hibernation based on kexec with > return yes. > > > And it won't result in > > the complete removal of the existing hibernation code from the kernel. At the > > very least, it's going to want the kernel being hibernated to have an > > interface by which it can find out which pages need to be saved. > > That interface should be running kernel -> user space -> target kernel. > Not direct kernel to kernel. > > > I wouldn't > > be surprised if it also ends up with an interface in which the kernel being > > hibernated tells it what bdev/sectors in which to save the image as well > > (otherwise you're going to need a dedicated, otherwise untouched partition > > exclusively for the kexec'd kernel to use), or what network settings to use > > if it wants to try to save the image to a network storage device. > > initramfs. We already seem to have that interface. And distros > seems to do a pretty decent job of using it to configure systems. > > > On top of > > that, there are all the issues related to device reinitialisation and so on, > > Yes. > > > and it looks like there's greatly increased pain for users wanting to > > configure this new implementation. > > Not to be callous but that really is a user space and distro issue. > > > Kexec is by no means proven to be the panacea for all the issues. > > I agree. I'm still not quite convinced it will do a satisfactory job. > But I think it does make sense to implement a general kexec with > return and see if that can reasonably be used for handling hibernation > issues. If done cleanly and with care the implementation won't be > hibernation specific. Yes, and that's worth doing anyway, IMO. > Frankly this looks like the best way I can see to implement a general > mechanism for calling silly firmware/BIOS/EFI services after we > have a kernel up and running. It's a little bit like allowing > X to call iopl(3) and do inb/outb directly. > > The configuration issues you raise pretty much exist for kexec on > panic, and they seem to be being resolved for that case in a > reasonable way. I do agree that the current kexec+return effort seems > to be one of those unfortunate cases where we give every mechanism in > the kernel to do something in user space and then no one actually > implements the user space. That doesn't do any one any good. > > For hibernation we don't have the absolute need to step outside of the > current kernel that we do in the kexec on panic approach. However we > have this practical fight about mechanism and policy, and kexec with > return has this seductive allure that it appears to be the minimal > necessary mechanism in the kernel. > > No one has yet attacked the hard problem of coming up with separate > hibernate methods for drivers. Well, I've been playing a bit with that for some time, but it's not easy by any means. In short, I'm seeing some problems related to the handling of ACPI that seem to shatter the entire idea of having separate hibernate methods, at least as far as ACPI systems are concerned. Greetings, Rafael ^ permalink raw reply [flat|nested] 42+ messages in thread
[parent not found: <200709211409.25008.rjw@sisk.pl>]
* Re: Re: [RFC][PATCH 1/2 -mm] kexec based hibernation -v3: kexec jump [not found] ` <200709211409.25008.rjw@sisk.pl> @ 2007-09-21 13:14 ` huang ying [not found] ` <851fc09e0709210614q33cf3c81u1441fda17a66a6fd@mail.gmail.com> 1 sibling, 0 replies; 42+ messages in thread From: huang ying @ 2007-09-21 13:14 UTC (permalink / raw) To: Rafael J. Wysocki Cc: Nigel Cunningham, nigel, Kexec Mailing List, linux-kernel, Eric W. Biederman, Huang, Ying, linux-pm, Andrew Morton, Jeremy Maitin-Shepard On 9/21/07, Rafael J. Wysocki <rjw@sisk.pl> wrote: > On Friday, 21 September 2007 05:33, Eric W. Biederman wrote: > > Nigel Cunningham <nigel@nigel.suspend2.net> writes: > > > > > > That's not true. Kexec will itself be an implementation, otherwise you'd end > > > up with people screaming about no hibernation support. > > > > There needs to be an implementation of hibernation based on kexec with > > return yes. > > > > > And it won't result in > > > the complete removal of the existing hibernation code from the kernel. At the > > > very least, it's going to want the kernel being hibernated to have an > > > interface by which it can find out which pages need to be saved. > > > > That interface should be running kernel -> user space -> target kernel. > > Not direct kernel to kernel. > > > > > I wouldn't > > > be surprised if it also ends up with an interface in which the kernel being > > > hibernated tells it what bdev/sectors in which to save the image as well > > > (otherwise you're going to need a dedicated, otherwise untouched partition > > > exclusively for the kexec'd kernel to use), or what network settings to use > > > if it wants to try to save the image to a network storage device. > > > > initramfs. We already seem to have that interface. And distros > > seems to do a pretty decent job of using it to configure systems. > > > > > On top of > > > that, there are all the issues related to device reinitialisation and so on, > > > > Yes. > > > > > and it looks like there's greatly increased pain for users wanting to > > > configure this new implementation. > > > > Not to be callous but that really is a user space and distro issue. > > > > > Kexec is by no means proven to be the panacea for all the issues. > > > > I agree. I'm still not quite convinced it will do a satisfactory job. > > But I think it does make sense to implement a general kexec with > > return and see if that can reasonably be used for handling hibernation > > issues. If done cleanly and with care the implementation won't be > > hibernation specific. > > Yes, and that's worth doing anyway, IMO. > > > Frankly this looks like the best way I can see to implement a general > > mechanism for calling silly firmware/BIOS/EFI services after we > > have a kernel up and running. It's a little bit like allowing > > X to call iopl(3) and do inb/outb directly. > > > > The configuration issues you raise pretty much exist for kexec on > > panic, and they seem to be being resolved for that case in a > > reasonable way. I do agree that the current kexec+return effort seems > > to be one of those unfortunate cases where we give every mechanism in > > the kernel to do something in user space and then no one actually > > implements the user space. That doesn't do any one any good. > > > > For hibernation we don't have the absolute need to step outside of the > > current kernel that we do in the kexec on panic approach. However we > > have this practical fight about mechanism and policy, and kexec with > > return has this seductive allure that it appears to be the minimal > > necessary mechanism in the kernel. > > > > No one has yet attacked the hard problem of coming up with separate > > hibernate methods for drivers. > > Well, I've been playing a bit with that for some time, but it's not easy by any > means. > > In short, I'm seeing some problems related to the handling of ACPI that seem to > shatter the entire idea of having separate hibernate methods, at least as far > as ACPI systems are concerned. So sadly to hear this. Can you details it a little? Or a link? Best Regards, Huang Ying ^ permalink raw reply [flat|nested] 42+ messages in thread
[parent not found: <851fc09e0709210614q33cf3c81u1441fda17a66a6fd@mail.gmail.com>]
* Re: Re: [RFC][PATCH 1/2 -mm] kexec based hibernation -v3: kexec jump [not found] ` <851fc09e0709210614q33cf3c81u1441fda17a66a6fd@mail.gmail.com> @ 2007-09-21 14:31 ` Rafael J. Wysocki [not found] ` <200709211631.19130.rjw@sisk.pl> 1 sibling, 0 replies; 42+ messages in thread From: Rafael J. Wysocki @ 2007-09-21 14:31 UTC (permalink / raw) To: huang ying Cc: Nigel Cunningham, nigel, Kexec Mailing List, linux-kernel, Eric W. Biederman, Huang, Ying, linux-pm, Andrew Morton, Jeremy Maitin-Shepard On Friday, 21 September 2007 15:14, huang ying wrote: > On 9/21/07, Rafael J. Wysocki <rjw@sisk.pl> wrote: > > On Friday, 21 September 2007 05:33, Eric W. Biederman wrote: > > > Nigel Cunningham <nigel@nigel.suspend2.net> writes: [--snip--] > > > > > > No one has yet attacked the hard problem of coming up with separate > > > hibernate methods for drivers. > > > > Well, I've been playing a bit with that for some time, but it's not easy by any > > means. > > > > In short, I'm seeing some problems related to the handling of ACPI that seem to > > shatter the entire idea of having separate hibernate methods, at least as far > > as ACPI systems are concerned. > > So sadly to hear this. Can you details it a little? Or a link? Well, the problem is that apparently some systems (eg. my HP nx6325) expect us to execute the _PTS ACPI global control method before creating the image _and_ to execute acpi_enter_sleep_state(ACPI_STATE_S4) in order to finally put the system into the sleep state. In particular, on nx6325, if we don't do that, then after the restore the status of the AC power will not be reported correctly (and if you replace the battery while in the sleep state, the battery status will not be updated correctly after the restore). Similar issues have been reported for other machines. Now, the ACPI specification requires us to put devices into low power states before executing _PTS and that's exactly what we're doing before a suspend to RAM. Thus, it seems that in general we need to do the same for hibernation on ACPI systems. Greetings, Rafael ^ permalink raw reply [flat|nested] 42+ messages in thread
[parent not found: <200709211631.19130.rjw@sisk.pl>]
* Re: Re: [RFC][PATCH 1/2 -mm] kexec based hibernation -v3: kexec jump [not found] ` <200709211631.19130.rjw@sisk.pl> @ 2007-09-21 14:45 ` Alan Stern 2007-09-21 15:27 ` Rafael J. Wysocki 2007-09-21 15:02 ` huang ying ` (3 subsequent siblings) 4 siblings, 1 reply; 42+ messages in thread From: Alan Stern @ 2007-09-21 14:45 UTC (permalink / raw) To: Rafael J. Wysocki; +Cc: Linux-pm mailing list, huang ying [CC: list trimmed] On Fri, 21 Sep 2007, Rafael J. Wysocki wrote: > Well, the problem is that apparently some systems (eg. my HP nx6325) expect us > to execute the _PTS ACPI global control method before creating the image _and_ > to execute acpi_enter_sleep_state(ACPI_STATE_S4) in order to finally put the > system into the sleep state. In particular, on nx6325, if we don't do that, > then after the restore the status of the AC power will not be reported > correctly (and if you replace the battery while in the sleep state, the > battery status will not be updated correctly after the restore). Similar > issues have been reported for other machines. > > Now, the ACPI specification requires us to put devices into low power states > before executing _PTS and that's exactly what we're doing before a suspend to > RAM. Thus, it seems that in general we need to do the same for hibernation on > ACPI systems. I'm confused. You seem to be saying that for hibernation the required sequence of steps is: 1. Put devices into low-power states 2. Execute _PTS method 3. Create and write out the image 4. Execute acpi_enter_sleep_state() Am I missing something -- a step to put devices back in their full-power states before writing out the image? After all, you can't write an image if the disk drive isn't at full power. Also, how exactly does this conflict with the requirements of the kexec-based approach? At what point in the above sequence would the kexec call be made? Alan Stern ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: Re: [RFC][PATCH 1/2 -mm] kexec based hibernation -v3: kexec jump 2007-09-21 14:45 ` Alan Stern @ 2007-09-21 15:27 ` Rafael J. Wysocki 0 siblings, 0 replies; 42+ messages in thread From: Rafael J. Wysocki @ 2007-09-21 15:27 UTC (permalink / raw) To: Alan Stern; +Cc: Linux-pm mailing list, huang ying On Friday, 21 September 2007 16:45, Alan Stern wrote: > [CC: list trimmed] > > On Fri, 21 Sep 2007, Rafael J. Wysocki wrote: > > > Well, the problem is that apparently some systems (eg. my HP nx6325) expect us > > to execute the _PTS ACPI global control method before creating the image _and_ > > to execute acpi_enter_sleep_state(ACPI_STATE_S4) in order to finally put the > > system into the sleep state. In particular, on nx6325, if we don't do that, > > then after the restore the status of the AC power will not be reported > > correctly (and if you replace the battery while in the sleep state, the > > battery status will not be updated correctly after the restore). Similar > > issues have been reported for other machines. > > > > Now, the ACPI specification requires us to put devices into low power states > > before executing _PTS and that's exactly what we're doing before a suspend to > > RAM. Thus, it seems that in general we need to do the same for hibernation on > > ACPI systems. > > I'm confused. You seem to be saying that for hibernation the required > sequence of steps is: > > 1. Put devices into low-power states > 2. Execute _PTS method > 3. Create and write out the image > 4. Execute acpi_enter_sleep_state() > > Am I missing something -- a step to put devices back in their > full-power states before writing out the image? After all, you can't > write an image if the disk drive isn't at full power. Well, of course we put devices into the full power states in order to create the image and we put them back into low power states (or switch them off, depending on the kernel version) before executing acpi_enter_sleep_state(ACPI_STATE_S4). However, all of that seems to be irrelevant for the above problem. Namely, it follows from my tests that if we don't execute _PTS before creating the image or we don't use acpi_enter_sleep_state(ACPI_STATE_S4) to finally go to sleep, the system will be semi-functional after the restore and the steps done in between don't actually matter. > Also, how exactly does this conflict with the requirements of the > kexec-based approach? I'm not sure whether or not this really conflicts with them, but the point is that if we have to put devices into low power states before executing _PTS, which is exactly what we do before a suspend to RAM, then separate hibernation methods for device drivers are not needed (*). > At what point in the above sequence would the kexec call be made? I think that kexec would have to be made after executing _PTS and acpi_enter_sleep_state(ACPI_STATE_S4) would have to be called by the image-saving kernel. However, I think that what we really should be doing is to: 1) put devices into low power states 2) execute _PTS 3) create the image 4) resume _only_ those devices that are needed to save the image 5) save the image 6) finalize the transition to the sleep state (please note that I've omitted some details, like the handling of the nonboot CPUs etc., which are difficult by themselves, but don't seem to be relevant here). (*) I think we'll need some hibernation-specific methods in order to carry out step 4) above, ie. to resume devices needed for saving the image. Greetings, Rafael ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: Re: [RFC][PATCH 1/2 -mm] kexec based hibernation -v3: kexec jump [not found] ` <200709211631.19130.rjw@sisk.pl> 2007-09-21 14:45 ` Alan Stern @ 2007-09-21 15:02 ` huang ying [not found] ` <851fc09e0709210802o3be2789s8e93410fa07f7066@mail.gmail.com> ` (2 subsequent siblings) 4 siblings, 0 replies; 42+ messages in thread From: huang ying @ 2007-09-21 15:02 UTC (permalink / raw) To: Rafael J. Wysocki Cc: Nigel Cunningham, nigel, Kexec Mailing List, linux-kernel, Eric W. Biederman, Huang, Ying, linux-pm, Andrew Morton, Jeremy Maitin-Shepard On 9/21/07, Rafael J. Wysocki <rjw@sisk.pl> wrote: > On Friday, 21 September 2007 15:14, huang ying wrote: > > On 9/21/07, Rafael J. Wysocki <rjw@sisk.pl> wrote: > > > On Friday, 21 September 2007 05:33, Eric W. Biederman wrote: > > > > Nigel Cunningham <nigel@nigel.suspend2.net> writes: > [--snip--] > > > > > > > > No one has yet attacked the hard problem of coming up with separate > > > > hibernate methods for drivers. > > > > > > Well, I've been playing a bit with that for some time, but it's not easy by any > > > means. > > > > > > In short, I'm seeing some problems related to the handling of ACPI that seem to > > > shatter the entire idea of having separate hibernate methods, at least as far > > > as ACPI systems are concerned. > > > > So sadly to hear this. Can you details it a little? Or a link? > > Well, the problem is that apparently some systems (eg. my HP nx6325) expect us > to execute the _PTS ACPI global control method before creating the image _and_ > to execute acpi_enter_sleep_state(ACPI_STATE_S4) in order to finally put the > system into the sleep state. In particular, on nx6325, if we don't do that, > then after the restore the status of the AC power will not be reported > correctly (and if you replace the battery while in the sleep state, the > battery status will not be updated correctly after the restore). Similar > issues have been reported for other machines. > > Now, the ACPI specification requires us to put devices into low power states > before executing _PTS and that's exactly what we're doing before a suspend to > RAM. Thus, it seems that in general we need to do the same for hibernation on > ACPI systems. Then, is it possible to separate device quiesce from device suspend. Perhaps not for swsusp, but for kexec based hibernation? Best Regards, Huang Ying ^ permalink raw reply [flat|nested] 42+ messages in thread
[parent not found: <851fc09e0709210802o3be2789s8e93410fa07f7066@mail.gmail.com>]
* Re: Re: [RFC][PATCH 1/2 -mm] kexec based hibernation -v3: kexec jump [not found] ` <851fc09e0709210802o3be2789s8e93410fa07f7066@mail.gmail.com> @ 2007-09-21 15:50 ` Rafael J. Wysocki 0 siblings, 0 replies; 42+ messages in thread From: Rafael J. Wysocki @ 2007-09-21 15:50 UTC (permalink / raw) To: huang ying Cc: Nigel Cunningham, nigel, Kexec Mailing List, linux-kernel, Eric W. Biederman, Huang, Ying, linux-pm, Andrew Morton, Jeremy Maitin-Shepard On Friday, 21 September 2007 17:02, huang ying wrote: > On 9/21/07, Rafael J. Wysocki <rjw@sisk.pl> wrote: > > On Friday, 21 September 2007 15:14, huang ying wrote: > > > On 9/21/07, Rafael J. Wysocki <rjw@sisk.pl> wrote: > > > > On Friday, 21 September 2007 05:33, Eric W. Biederman wrote: > > > > > Nigel Cunningham <nigel@nigel.suspend2.net> writes: > > [--snip--] > > > > > > > > > > No one has yet attacked the hard problem of coming up with separate > > > > > hibernate methods for drivers. > > > > > > > > Well, I've been playing a bit with that for some time, but it's not easy by any > > > > means. > > > > > > > > In short, I'm seeing some problems related to the handling of ACPI that seem to > > > > shatter the entire idea of having separate hibernate methods, at least as far > > > > as ACPI systems are concerned. > > > > > > So sadly to hear this. Can you details it a little? Or a link? > > > > Well, the problem is that apparently some systems (eg. my HP nx6325) expect us > > to execute the _PTS ACPI global control method before creating the image _and_ > > to execute acpi_enter_sleep_state(ACPI_STATE_S4) in order to finally put the > > system into the sleep state. In particular, on nx6325, if we don't do that, > > then after the restore the status of the AC power will not be reported > > correctly (and if you replace the battery while in the sleep state, the > > battery status will not be updated correctly after the restore). Similar > > issues have been reported for other machines. > > > > Now, the ACPI specification requires us to put devices into low power states > > before executing _PTS and that's exactly what we're doing before a suspend to > > RAM. Thus, it seems that in general we need to do the same for hibernation on > > ACPI systems. > > Then, is it possible to separate device quiesce from device suspend. It surely is possible, but I'm not sure if it's going to be useful. I mean, if we need to do exactly the same thing before a suspend to RAM and before a hibernation (ie. to put devices into low power states), why would we want to use different methods for that in both cases? > Perhaps not for swsusp, but for kexec based hibernation? Frankly, I don't know. Generally, changing the way in which device drivers handle suspend (to RAM) and hibernation is a huge task. After considering this issue for some time I think that we really should start from hardening suspend (to RAM) so that it doesn't need the freezer any more, because _that_ would require us to change the suspend-related drivers' callbacks anyway. When we are sure how we are going to eliminate the freezer from suspend (to RAM), we'll know how that affects hibernation and what to do about it. Greetings, Rafael ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: Re: [RFC][PATCH 1/2 -mm] kexec based hibernation -v3: kexec jump [not found] ` <200709211631.19130.rjw@sisk.pl> ` (2 preceding siblings ...) [not found] ` <851fc09e0709210802o3be2789s8e93410fa07f7066@mail.gmail.com> @ 2007-09-21 18:11 ` Jeremy Maitin-Shepard [not found] ` <87sl576g8q.fsf@jbms.ath.cx> 4 siblings, 0 replies; 42+ messages in thread From: Jeremy Maitin-Shepard @ 2007-09-21 18:11 UTC (permalink / raw) To: Rafael J. Wysocki Cc: Nigel Cunningham, nigel, Kexec Mailing List, linux-kernel, Eric W. Biederman, Huang, Ying, linux-pm, huang ying, Andrew Morton "Rafael J. Wysocki" <rjw@sisk.pl> writes: > On Friday, 21 September 2007 15:14, huang ying wrote: >> On 9/21/07, Rafael J. Wysocki <rjw@sisk.pl> wrote: >> > On Friday, 21 September 2007 05:33, Eric W. Biederman wrote: >> > > Nigel Cunningham <nigel@nigel.suspend2.net> writes: > [--snip--] >> > > >> > > No one has yet attacked the hard problem of coming up with separate >> > > hibernate methods for drivers. >> > >> > Well, I've been playing a bit with that for some time, but it's not easy by > any >> > means. >> > >> > In short, I'm seeing some problems related to the handling of ACPI that seem > to >> > shatter the entire idea of having separate hibernate methods, at least as > far >> > as ACPI systems are concerned. >> >> So sadly to hear this. Can you details it a little? Or a link? > Well, the problem is that apparently some systems (eg. my HP nx6325) expect us > to execute the _PTS ACPI global control method before creating the image _and_ > to execute acpi_enter_sleep_state(ACPI_STATE_S4) in order to finally put the > system into the sleep state. In particular, on nx6325, if we don't do that, > then after the restore the status of the AC power will not be reported > correctly (and if you replace the battery while in the sleep state, the > battery status will not be updated correctly after the restore). Similar > issues have been reported for other machines. Suppose that instead of using ACPI S4 state at all, you instead just power off. Yes, you'll lose wakeup event functionality, and flashy LEDs, but doesn't this take care of the problem? The firmware shouldn't see the hibernate as anything other than a shutdown and reboot. ACPI should be initialized normally when resuming, which should take care of getting AC power status reported properly. This should be the behavior, anyway, on the many systems that do not support S4. > Now, the ACPI specification requires us to put devices into low power states > before executing _PTS and that's exactly what we're doing before a suspend to > RAM. Thus, it seems that in general we need to do the same for hibernation on > ACPI systems. It seems that if ACPI S4 is going to be used, Switching to low power state is something that should be done only immediately before entering that state (i.e. after the image has already been saved). In particular, it should not be done just before the atomic copy. It is true that (during resume) after the atomic copy snapshot is restored, drivers will need to be prepared (i.e. have saved whatever information is necessary) to _resume_ devices from the low power state, but that does not mean they have to actually be put into that low power state before the copy is made. I agree that for the kexec implementation there may be additional issues. For swsusp, uswsusp, and tuxonice, though, I don't see why there should be a problem. I think that, as was recognized before, all of the issues are resolved by properly considering exactly what each callback should do and when it should be called. The problems stem from ambiguous specifications, or trying to use the same callback for two different purposes or in two different cases. Let me know if I'm mistaken. -- Jeremy Maitin-Shepard ^ permalink raw reply [flat|nested] 42+ messages in thread
[parent not found: <87sl576g8q.fsf@jbms.ath.cx>]
* Re: Re: [RFC][PATCH 1/2 -mm] kexec based hibernation -v3: kexec jump [not found] ` <87sl576g8q.fsf@jbms.ath.cx> @ 2007-09-21 19:00 ` Rafael J. Wysocki 2007-09-21 19:45 ` Alan Stern 0 siblings, 1 reply; 42+ messages in thread From: Rafael J. Wysocki @ 2007-09-21 19:00 UTC (permalink / raw) To: Jeremy Maitin-Shepard Cc: Nigel Cunningham, nigel, Kexec Mailing List, linux-kernel, Eric W. Biederman, Huang, Ying, linux-pm, huang ying, Andrew Morton On Friday, 21 September 2007 20:11, Jeremy Maitin-Shepard wrote: > "Rafael J. Wysocki" <rjw@sisk.pl> writes: > > > On Friday, 21 September 2007 15:14, huang ying wrote: > >> On 9/21/07, Rafael J. Wysocki <rjw@sisk.pl> wrote: > >> > On Friday, 21 September 2007 05:33, Eric W. Biederman wrote: > >> > > Nigel Cunningham <nigel@nigel.suspend2.net> writes: > > [--snip--] > >> > > > >> > > No one has yet attacked the hard problem of coming up with separate > >> > > hibernate methods for drivers. > >> > > >> > Well, I've been playing a bit with that for some time, but it's not easy by > > any > >> > means. > >> > > >> > In short, I'm seeing some problems related to the handling of ACPI that seem > > to > >> > shatter the entire idea of having separate hibernate methods, at least as > > far > >> > as ACPI systems are concerned. > >> > >> So sadly to hear this. Can you details it a little? Or a link? > > > Well, the problem is that apparently some systems (eg. my HP nx6325) expect us > > to execute the _PTS ACPI global control method before creating the image _and_ > > to execute acpi_enter_sleep_state(ACPI_STATE_S4) in order to finally put the > > system into the sleep state. In particular, on nx6325, if we don't do that, > > then after the restore the status of the AC power will not be reported > > correctly (and if you replace the battery while in the sleep state, the > > battery status will not be updated correctly after the restore). Similar > > issues have been reported for other machines. > > Suppose that instead of using ACPI S4 state at all, you instead just > power off. Yes, you'll lose wakeup event functionality, and flashy > LEDs, but doesn't this take care of the problem? Nope. > The firmware shouldn't see the hibernate as anything other than a shutdown > and reboot. Actually, this assumption is apparently wrong. > ACPI should be initialized normally when resuming, which should take care of > getting AC power status reported properly. Well, that doesn't work. I've tested it, really. :-) > This should be the behavior, anyway, on the many systems that do not > support S4. > > > Now, the ACPI specification requires us to put devices into low power states > > before executing _PTS and that's exactly what we're doing before a suspend to > > RAM. Thus, it seems that in general we need to do the same for hibernation on > > ACPI systems. > > It seems that if ACPI S4 is going to be used, Switching to low power > state is something that should be done only immediately before entering > that state (i.e. after the image has already been saved). Doesn't. Work. > In particular, it should not be done just before the atomic copy. It is > true that (during resume) after the atomic copy snapshot is restored, > drivers will need to be prepared (i.e. have saved whatever information > is necessary) to _resume_ devices from the low power state, but that > does not mean they have to actually be put into that low power state > before the copy is made. > > I agree that for the kexec implementation there may be additional > issues. For swsusp, uswsusp, and tuxonice, though, I don't see why > there should be a problem. I think that, as was recognized before, all > of the issues are resolved by properly considering exactly what each > callback should do and when it should be called. The problems stem from > ambiguous specifications, or trying to use the same callback for two > different purposes or in two different cases. > > Let me know if I'm mistaken. See above. :-) Greetings, Rafael ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: Re: [RFC][PATCH 1/2 -mm] kexec based hibernation -v3: kexec jump 2007-09-21 19:00 ` Rafael J. Wysocki @ 2007-09-21 19:45 ` Alan Stern 0 siblings, 0 replies; 42+ messages in thread From: Alan Stern @ 2007-09-21 19:45 UTC (permalink / raw) To: Rafael J. Wysocki Cc: Nigel Cunningham, nigel, Kexec Mailing List, linux-kernel, Eric W. Biederman, Huang, Ying, linux-pm, huang ying, Andrew Morton, Jeremy Maitin-Shepard On Fri, 21 Sep 2007, Rafael J. Wysocki wrote: > > > Well, the problem is that apparently some systems (eg. my HP nx6325) expect us > > > to execute the _PTS ACPI global control method before creating the image _and_ > > > to execute acpi_enter_sleep_state(ACPI_STATE_S4) in order to finally put the > > > system into the sleep state. In particular, on nx6325, if we don't do that, > > > then after the restore the status of the AC power will not be reported > > > correctly (and if you replace the battery while in the sleep state, the > > > battery status will not be updated correctly after the restore). Similar > > > issues have been reported for other machines. > > > > Suppose that instead of using ACPI S4 state at all, you instead just > > power off. Yes, you'll lose wakeup event functionality, and flashy > > LEDs, but doesn't this take care of the problem? > > Nope. > > > The firmware shouldn't see the hibernate as anything other than a shutdown > > and reboot. > > Actually, this assumption is apparently wrong. One gets the impression that the hibernation image includes a memory area used by the firmware. That could explain why devices need to be in a low-power state when the image is created -- so that when the image is restored, the firmware doesn't get confused about the device states. It would also explain why the firmware sees resume-from-power-off-hibernation as different from a regular reboot: because its data area gets overwritten as part of the resume. In reality it's probably more complicated than this, with weird interactions between the firmware and the various ACPI methods. Nevertheless, the main idea seems valid. Alan Stern ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [RFC][PATCH 1/2 -mm] kexec based hibernation -v3: kexec jump [not found] ` <20070920184106.79e1858a.akpm@linux-foundation.org> 2007-09-21 1:57 ` Nigel Cunningham [not found] ` <200709211157.28622.nigel@nigel.suspend2.net> @ 2007-09-21 11:56 ` Rafael J. Wysocki [not found] ` <200709211356.30291.rjw@sisk.pl> 3 siblings, 0 replies; 42+ messages in thread From: Rafael J. Wysocki @ 2007-09-21 11:56 UTC (permalink / raw) To: Andrew Morton Cc: Nigel Cunningham, nigel, Kexec Mailing List, linux-kernel, Eric W. Biederman, Huang, Ying, linux-pm, Jeremy Maitin-Shepard Hi Andrew, On Friday, 21 September 2007 03:41, Andrew Morton wrote: > On Fri, 21 Sep 2007 11:19:59 +1000 Nigel Cunningham <ncunningham@crca.org.au> wrote: > > > Hi. > > > > On Friday 21 September 2007 11:06:23 Andrew Morton wrote: > > > On Fri, 21 Sep 2007 10:24:34 +1000 Nigel Cunningham > > <nigel@nigel.suspend2.net> wrote: > > > > > > > Hi Andrew. > > > > > > > > On Thursday 20 September 2007 20:09:41 Pavel Machek wrote: > > > > > Seems like good enough for -mm to me. > > > > > > > > > > Pavel > > > > > > > > Andrew, if I recall correctly, you said a while ago that you didn't want > > > > another hibernation implementation in the vanilla kernel. If you're going > > to > > > > consider merging this kexec code, will you also please consider merging > > > > TuxOnIce? > > > > > > > > > > The theory is that kexec-based hibernation will mainly use preexisting > > > kexec code and will permit us to delete the existing hibernation > > > implementation. > > > > > > That's different from replacing it. > > > > TuxOnIce doesn't remove the existing implementation either. It can > > transparently replace it, but you can enable/disable that at compile time. > > Right. So we end up with two implementations in-tree. Whereas > kexec-based-hibernation leads us to having zero implementations in-tree. Well, I don't quite agree. For now, the kexec-based approach is missing the handling of devices, AFAICS. Namely, it's quite easy to snapshot memory with the help of kexec, but the state of devices gets trashed in the process, so you need some additional code saving the state of devices for you, executed before the kexec. Moreover, on ACPI systems the transition to the S4 sleep state and back to S0 (working state) is more complicated than a system checkpointing, because we are supposed to take the platform firmware into consideration in that case. The more I think about this, the more it seems to me that it just can't be done on top of kexec in a reasonable fashion. Of course, we could avoid handling the ACPI S4, but that would leave some people (including me ;-)) with semi-working hardware after the "restore". I don't think that's generally acceptable in the long run. IMHO, for ACPI systems the way to go is to harden suspend to RAM (with s2ram in place and the graphics adapters specifications from Intel and AMD released we are in a good position to do that) and build the S4 transition mechanism on top of that. It can be done easlily by adapting the current hibernation code, but not on top of kexec (I'm afraid). [Besides, the current hibernation userland interface is used by default by openSUSE and it's also used by quite some Debian users, so we can't drop it overnight and it can't be implemented in a compatible way on top of the kexec-based solution.] ^ permalink raw reply [flat|nested] 42+ messages in thread
[parent not found: <200709211356.30291.rjw@sisk.pl>]
* Re: [RFC][PATCH 1/2 -mm] kexec based hibernation -v3: kexec jump [not found] ` <200709211356.30291.rjw@sisk.pl> @ 2007-09-21 11:58 ` Nigel Cunningham 2007-09-21 13:25 ` huang ying 1 sibling, 0 replies; 42+ messages in thread From: Nigel Cunningham @ 2007-09-21 11:58 UTC (permalink / raw) To: Rafael J. Wysocki Cc: nigel, Kexec Mailing List, linux-kernel, Eric W. Biederman, Huang, Ying, Andrew Morton, linux-pm, Jeremy Maitin-Shepard Hi. On Friday 21 September 2007 21:56:29 Rafael J. Wysocki wrote: > [Besides, the current hibernation userland interface is used by default by > openSUSE and it's also used by quite some Debian users, so we can't drop > it overnight and it can't be implemented in a compatible way on top of the > kexec-based solution.] Could it be fudged by giving userland a null image and having (say) the first ioctl be one that triggers all the real work (with other ioctls being noops or such like, as appropriate)? Regards, Nigel ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [RFC][PATCH 1/2 -mm] kexec based hibernation -v3: kexec jump [not found] ` <200709211356.30291.rjw@sisk.pl> 2007-09-21 11:58 ` Nigel Cunningham @ 2007-09-21 13:25 ` huang ying 1 sibling, 0 replies; 42+ messages in thread From: huang ying @ 2007-09-21 13:25 UTC (permalink / raw) To: Rafael J. Wysocki Cc: Nigel Cunningham, nigel, Kexec Mailing List, linux-kernel, Eric W. Biederman, Huang, Ying, Andrew Morton, linux-pm, Jeremy Maitin-Shepard On 9/21/07, Rafael J. Wysocki <rjw@sisk.pl> wrote: > Hi Andrew, > > On Friday, 21 September 2007 03:41, Andrew Morton wrote: > > On Fri, 21 Sep 2007 11:19:59 +1000 Nigel Cunningham <ncunningham@crca.org.au> wrote: > > > > > Hi. > > > > > > On Friday 21 September 2007 11:06:23 Andrew Morton wrote: > > > > On Fri, 21 Sep 2007 10:24:34 +1000 Nigel Cunningham > > > <nigel@nigel.suspend2.net> wrote: > > > > > > > > > Hi Andrew. > > > > > > > > > > On Thursday 20 September 2007 20:09:41 Pavel Machek wrote: > > > > > > Seems like good enough for -mm to me. > > > > > > > > > > > > Pavel > > > > > > > > > > Andrew, if I recall correctly, you said a while ago that you didn't want > > > > > another hibernation implementation in the vanilla kernel. If you're going > > > to > > > > > consider merging this kexec code, will you also please consider merging > > > > > TuxOnIce? > > > > > > > > > > > > > The theory is that kexec-based hibernation will mainly use preexisting > > > > kexec code and will permit us to delete the existing hibernation > > > > implementation. > > > > > > > > That's different from replacing it. > > > > > > TuxOnIce doesn't remove the existing implementation either. It can > > > transparently replace it, but you can enable/disable that at compile time. > > > > Right. So we end up with two implementations in-tree. Whereas > > kexec-based-hibernation leads us to having zero implementations in-tree. > > Well, I don't quite agree. > > For now, the kexec-based approach is missing the handling of devices, AFAICS. > Namely, it's quite easy to snapshot memory with the help of kexec, but the > state of devices gets trashed in the process, so you need some additional code > saving the state of devices for you, executed before the kexec. > > Moreover, on ACPI systems the transition to the S4 sleep state and back to S0 > (working state) is more complicated than a system checkpointing, because we > are supposed to take the platform firmware into consideration in that case. > The more I think about this, the more it seems to me that it just can't be done > on top of kexec in a reasonable fashion. Of course, we could avoid handling > the ACPI S4, but that would leave some people (including me ;-)) with > semi-working hardware after the "restore". I don't think that's generally > acceptable in the long run. > > IMHO, for ACPI systems the way to go is to harden suspend to RAM (with s2ram > in place and the graphics adapters specifications from Intel and AMD released > we are in a good position to do that) and build the S4 transition mechanism > on top of that. It can be done easlily by adapting the current hibernation > code, but not on top of kexec (I'm afraid). Yes. ACPI is a biggest issue of kexec based hibernation now. I will try to work on that. At least I can prove whether kexec based hibernation is possible with ACPI. > [Besides, the current hibernation userland interface is used by default by > openSUSE and it's also used by quite some Debian users, so we can't drop > it overnight and it can't be implemented in a compatible way on top of the > kexec-based solution.] Best Regards, Huang Ying ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [RFC][PATCH 1/2 -mm] kexec based hibernation -v3: kexec jump 2007-09-21 1:06 ` Andrew Morton 2007-09-21 1:19 ` Nigel Cunningham [not found] ` <200709211120.00448.ncunningham@crca.org.au> @ 2007-09-24 17:37 ` Thomas Meyer 2 siblings, 0 replies; 42+ messages in thread From: Thomas Meyer @ 2007-09-24 17:37 UTC (permalink / raw) To: Andrew Morton Cc: Nigel Cunningham, nigel, Kexec Mailing List, linux-kernel, Eric W. Biederman, linux-pm, Jeremy Maitin-Shepard Andrew Morton schrieb: > On Fri, 21 Sep 2007 10:24:34 +1000 Nigel Cunningham <nigel@nigel.suspend2.net> wrote: > > >> Hi Andrew. >> >> On Thursday 20 September 2007 20:09:41 Pavel Machek wrote: >> >>> Seems like good enough for -mm to me. >>> >>> Pavel >>> >> Andrew, if I recall correctly, you said a while ago that you didn't want >> another hibernation implementation in the vanilla kernel. If you're going to >> consider merging this kexec code, will you also please consider merging >> TuxOnIce? >> >> > > The theory is that kexec-based hibernation will mainly use preexisting > kexec code and will permit us to delete the existing hibernation > implementation. > > That's different from replacing it Before replacing existing hibernation implementations, someone should fix kexec for i386 (maybe others?) EFI systems... ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [RFC][PATCH 1/2 -mm] kexec based hibernation -v3: kexec jump [not found] ` <200709211024.35991.nigel@nigel.suspend2.net> 2007-09-21 1:06 ` Andrew Morton @ 2007-09-21 9:49 ` Pavel Machek [not found] ` <20070921094908.GB20149@elf.ucw.cz> 2 siblings, 0 replies; 42+ messages in thread From: Pavel Machek @ 2007-09-21 9:49 UTC (permalink / raw) To: nigel Cc: Kexec Mailing List, linux-kernel, Eric W. Biederman, Huang, Ying, Andrew Morton, linux-pm, Jeremy Maitin-Shepard Hi! > > Seems like good enough for -mm to me. (For the record, I do not think this is going to be hibernation-replacement any time soon. But it is functionality useful for other stuff -- dump memory and continue -- and yes it may be able to do hibernation in the long term. It really comes from the other side of reliability: * swsusp is "if your kernel is perfectly healthy, it will work" while this, coming from kdump is * "if your kernel is not completely trashed, it should work" ...which is why can't use swsusp to do dump memory and continue -- you want to do dumps on "slightly broken" systems. And yes, as a sideeffect it may be able to do hibernation... why not, lets see how it works out). Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html ^ permalink raw reply [flat|nested] 42+ messages in thread
[parent not found: <20070921094908.GB20149@elf.ucw.cz>]
* Re: Re: [RFC][PATCH 1/2 -mm] kexec based hibernation -v3: kexec jump [not found] ` <20070921094908.GB20149@elf.ucw.cz> @ 2007-09-21 12:10 ` Rafael J. Wysocki 0 siblings, 0 replies; 42+ messages in thread From: Rafael J. Wysocki @ 2007-09-21 12:10 UTC (permalink / raw) To: linux-pm Cc: nigel, Kexec Mailing List, linux-kernel, Eric W. Biederman, Huang, Ying, Andrew Morton, Jeremy Maitin-Shepard On Friday, 21 September 2007 11:49, Pavel Machek wrote: > Hi! > > > > Seems like good enough for -mm to me. > > (For the record, I do not think this is going to be > hibernation-replacement any time soon. But it is functionality useful > for other stuff -- dump memory and continue -- and yes it may be able > to do hibernation in the long term. > > It really comes from the other side of reliability: > > * swsusp is "if your kernel is perfectly healthy, it will work" > > while this, coming from kdump is > > * "if your kernel is not completely trashed, it should work" > > ...which is why can't use swsusp to do dump memory and continue -- you > want to do dumps on "slightly broken" systems. And yes, as a > sideeffect it may be able to do hibernation... why not, lets see how > it works out). I generally agree. :-) Greetings, Rafael ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [RFC][PATCH 1/2 -mm] kexec based hibernation -v3: kexec jump [not found] <1190266447.21818.17.camel@caritas-dev.intel.com> 2007-09-20 10:09 ` [RFC][PATCH 1/2 -mm] kexec based hibernation -v3: kexec jump Pavel Machek [not found] ` <20070920100941.GA12157@atrey.karlin.mff.cuni.cz> @ 2007-09-21 2:55 ` Eric W. Biederman 2007-09-21 4:01 ` Eric W. Biederman ` (3 subsequent siblings) 6 siblings, 0 replies; 42+ messages in thread From: Eric W. Biederman @ 2007-09-21 2:55 UTC (permalink / raw) To: Huang, Ying Cc: nigel, Kexec Mailing List, linux-kernel, Andrew Morton, linux-pm, Jeremy Maitin-Shepard "Huang, Ying" <ying.huang@intel.com> writes: > This patch implements the functionality of jumping between the kexeced > kernel and the original kernel. > > A new reboot command named LINUX_REBOOT_CMD_KJUMP is defined to > trigger the jumping to (executing) the new kernel and jumping back to > the original kernel. > > To support jumping between two kernels, before jumping to (executing) > the new kernel and jumping back to the original kernel, the devices > are put into quiescent state (to be fully implemented), Well this we have an implementation of (it's called shutdown) or does that method not do enough to meet the requirements of hibernation. If at all possible I would like to keep reboot, kexec and kexec+return all using the same device driver methods. > and the state of devices and CPU is saved. Makes a reasonable amount of sense. We do need to save whatever state we cannot recover just be reprogramming the hardware. As long as the drivers are built so this is a good place for a hot remove to happen we should be in good shape. > After jumping back from kexeced kernel > and jumping to the new kernel, the state of devices and CPU are > restored accordingly. The devices/CPU state save/restore code of > software suspend is called to implement corresponding function. At least for now that sounds like a reasonable work around. I don't think we want to merge this code until we have agreed upon how the new device_detach and device_reattach (or whatever we call the device methods for hibernate) are to be implemented. > To support jumping without preserving memory. One shadow backup page > is allocated for each page used by new (kexeced) kernel. That does not sound correct. The current implementation of kexec_load does allocate a source page and give it a destination page and usually those two pages are different. But if our memory allocations happen to return a destination page there we use it directly, making no copy necessary. I think we are talking about the same thing but I'm not certain you have thought about the case where your shadow backup page happens to be the same as current page. > When do > kexec_load, the image of new kernel is loaded into shadow pages, Ok. This sounds like the existing implementation. Except it depending on your destination it may force the address. > and > before executing, the original pages and the shadow pages are swapped, > so the contents of original pages are backuped. Yes. Unless we happen to have everything allocated on the same page. Does your code handle that case? I know the generic kexec code will pass lists like that in the proper circumstances. Especially for the kexec on panic case. > Before jumping to the > new (kexeced) kernel and after jumping back to the original kernel, > the original pages and the shadow pages are swapped too. Yes. That sounds right. > A jump back protocol is defined and documented. Bleh. We do need to document the requirements but we don't need a versioned monster. And we don't need to be exposing implementation details in that documentation. In the kexec world /sbin/kexec or another user space caller is responsible for passing information to our callers. To be polite we need to document more but the jump back protocol really should be as if the entry point kexec handed control to did a subroutine return. > Known issues > > - A field is added to Linux kernel real-mode header. This is > temporary, and should be replaced after the 32-bit boot protocol and > setup data patches are accepted. It shouldn't be needed. > - The suspend method of device is used to put device in quiescent > state. But if the ACPI is enabled this will also put devices into > low power state, which prevent the new kernel from booting. So, the > ACPI must be disabled both in original kernel and kexeced > kernel. This is planed to be resolved after the suspend method and > hibernate method is separated for device as proposed earlier in the > LKML. Reasonable. > - The NX (none executable) bit should be turned off for the control > page if available. Why don't we have a problem with this in the normal kexec case? More comments below. > Signed-off-by: Huang Ying <ying.huang@intel.com> > > --- > > Documentation/i386/jump_back_protocol.txt | 81 ++++++++++++ > arch/i386/Kconfig | 7 + > arch/i386/boot/header.S | 2 > arch/i386/kernel/machine_kexec.c | 77 +++++++++--- > arch/i386/kernel/relocate_kernel.S | 187 ++++++++++++++++++++++++++---- > arch/i386/kernel/setup.c | 3 > include/asm-i386/bootparam.h | 3 > include/asm-i386/kexec.h | 48 ++++++- > include/linux/kexec.h | 9 + > include/linux/reboot.h | 2 > kernel/kexec.c | 59 +++++++++ > kernel/ksysfs.c | 17 ++ > kernel/power/Kconfig | 2 > kernel/sys.c | 8 + > 14 files changed, 463 insertions(+), 42 deletions(-) > > Index: linux-2.6.23-rc6/arch/i386/kernel/machine_kexec.c > =================================================================== > --- linux-2.6.23-rc6.orig/arch/i386/kernel/machine_kexec.c 2007-09-20 > 11:24:25.000000000 +0800 > +++ linux-2.6.23-rc6/arch/i386/kernel/machine_kexec.c 2007-09-20 > 11:24:31.000000000 +0800 > @@ -20,6 +20,7 @@ > #include <asm/cpufeature.h> > #include <asm/desc.h> > #include <asm/system.h> > +#include <asm/setup.h> Please remove the unnecessary and incorrect include. > #define PAGE_ALIGNED __attribute__ ((__aligned__(PAGE_SIZE))) > static u32 kexec_pgd[1024] PAGE_ALIGNED; > @@ -98,23 +99,23 @@ > { > } > > -/* > - * Do not allocate memory (or fail in any way) in machine_kexec(). > - * We are past the point of no return, committed to rebooting now. > - */ > -NORET_TYPE void machine_kexec(struct kimage *image) > +static NORET_TYPE void __machine_kexec(struct kimage *image, > + void *control_page) ATTRIB_NORET; > + > +static NORET_TYPE void __machine_kexec(struct kimage *image, > + void *control_page) > { It looks like we are doing swap_pages unconditionally so I don't see why we need to versions of this function. And it looks like the NORET_TYPE is no longer an accurate description. > unsigned long page_list[PAGES_NR]; > - void *control_page; > - > - /* Interrupts aren't acceptable while we reboot */ > - local_irq_disable(); > - > - control_page = page_address(image->control_code_page); > - memcpy(control_page, relocate_kernel, PAGE_SIZE); > + asmlinkage NORET_TYPE void > + (*relocate_kernel_ptr)(unsigned long indirection_page, > + unsigned long control_page, > + unsigned long start_address, > + unsigned int has_pae) ATTRIB_NORET; > > + relocate_kernel_ptr = control_page + > + ((void *)relocate_kernel - (void *)relocate_page); > page_list[PA_CONTROL_PAGE] = __pa(control_page); > - page_list[VA_CONTROL_PAGE] = (unsigned long)relocate_kernel; > + page_list[VA_CONTROL_PAGE] = (unsigned long)control_page; > page_list[PA_PGD] = __pa(kexec_pgd); > page_list[VA_PGD] = (unsigned long)kexec_pgd; > #ifdef CONFIG_X86_PAE > @@ -127,6 +128,7 @@ > page_list[VA_PTE_0] = (unsigned long)kexec_pte0; > page_list[PA_PTE_1] = __pa(kexec_pte1); > page_list[VA_PTE_1] = (unsigned long)kexec_pte1; > + page_list[PA_SWAP_PAGE] = (page_to_pfn(image->swap_page) << PAGE_SHIFT); > > /* The segment registers are funny things, they have both a > * visible and an invisible part. Whenever the visible part is > @@ -145,8 +147,26 @@ > set_idt(phys_to_virt(0),0); > > /* now call it */ > - relocate_kernel((unsigned long)image->head, (unsigned long)page_list, > - image->start, cpu_has_pae); > + relocate_kernel_ptr((unsigned long)image->head, > + (unsigned long)page_list, > + image->start, cpu_has_pae); > +} > + > +/* > + * Do not allocate memory (or fail in any way) in machine_kexec(). > + * We are past the point of no return, committed to rebooting now. > + */ > +NORET_TYPE void machine_kexec(struct kimage *image) > +{ > + void *control_page; > + > + /* Interrupts aren't acceptable while we reboot */ > + local_irq_disable(); > + > + control_page = page_address(image->control_code_page); > + memcpy(control_page, relocate_page, PAGE_SIZE); > + > + __machine_kexec(image, control_page); > } > > /* crashkernel=size@addr specifies the location to reserve for > @@ -182,3 +202,30 @@ > #endif > } > > +#ifdef CONFIG_KEXEC_JUMP > +int machine_kexec_jump(struct kimage *image) > +{ > + void *control_page; > + unsigned long pa_control_page; > + > + control_page = page_address(image->control_code_page); > + memcpy(control_page, relocate_page, PAGE_SIZE/2); > + pa_control_page = __pa(control_page); > + memcpy(control_page + 1, &pa_control_page, sizeof(pa_control_page)); > + > + KJUMP_MAGIC(control_page) = KJUMP_MAGIC_NUMBER; > + KJUMP_VERSION(control_page) = KJUMP_VERSION_NUMBER; This bit really looks wrong. It is for /sbin/kexec to provide this kind of interface. > + > + if (!kexec_jump_save_cpu(control_page)) > + __machine_kexec(image, control_page); > + > + kexec_jump_back_entry = KJUMP_ENTRY(control_page); > + image->start = kexec_jump_back_entry; > + return 0; > +} > + > +void __init parse_kexec_jump_back_entry(void) > +{ > + kexec_jump_back_entry = boot_params.hdr.jump_back_entry; > +} > +#endif /* CONFIG_KEXEC_JUMP */ > Index: linux-2.6.23-rc6/include/asm-i386/kexec.h > =================================================================== > --- linux-2.6.23-rc6.orig/include/asm-i386/kexec.h 2007-09-20 11:24:25.000000000 > +0800 > +++ linux-2.6.23-rc6/include/asm-i386/kexec.h 2007-09-20 11:24:31.000000000 > +0800 > @@ -9,16 +9,42 @@ > #define VA_PTE_0 5 > #define PA_PTE_1 6 > #define VA_PTE_1 7 > +#define PA_SWAP_PAGE 8 > #ifdef CONFIG_X86_PAE > -#define PA_PMD_0 8 > -#define VA_PMD_0 9 > -#define PA_PMD_1 10 > -#define VA_PMD_1 11 > -#define PAGES_NR 12 > +#define PA_PMD_0 9 > +#define VA_PMD_0 10 > +#define PA_PMD_1 11 > +#define VA_PMD_1 12 > +#define PAGES_NR 13 > #else > -#define PAGES_NR 8 > +#define PAGES_NR 9 > #endif > > +#define KJUMP_DATA_BASE 0x800 > + > +#define KJUMP_MAGIC_NUMBER 0x626a > +#define KJUMP_VERSION_NUMBER 0x0100 > + > +#define KJUMP_DATA(buf) ((unsigned char *)(buf)+KJUMP_DATA_BASE) > +#define KJUMP_OFF(off) (KJUMP_DATA_BASE+(off)) > + > +#define KJUMP_MAGIC_OFF KJUMP_OFF(0x0) > +#define KJUMP_MAGIC(buf) (*(unsigned short *)(KJUMP_DATA(buf)+0x0)) > +#define KJUMP_VERSION(buf) (*(unsigned short *)(KJUMP_DATA(buf)+0x2)) > +#define KJUMP_BACKUP_PAGES_MAP_OFF \ > + KJUMP_OFF(0x4) > +#define KJUMP_BACKUP_PAGES_MAP(buf) \ > + (*(unsigned long *)(KJUMP_DATA(buf)+0x4)) > + > +/* > + * The following are not a part of jump back protocol, for internal > + * use only > + */ > +#define KJUMP_ENTRY_OFF KJUMP_OFF(0x20) > +#define KJUMP_ENTRY(buf) (*(unsigned long *)(KJUMP_DATA(buf)+0x20)) > +/* Other internal data fields base */ > +#define KJUMP_OTHER_OFF KJUMP_OFF(0x24) > + > #ifndef __ASSEMBLY__ > > #include <asm/ptrace.h> > @@ -94,6 +120,16 @@ > unsigned long start_address, > unsigned int has_pae) ATTRIB_NORET; > > +extern char relocate_page[PAGE_SIZE]; > + > +extern asmlinkage int kexec_jump_save_cpu(void *buf); > + > +#ifdef CONFIG_KEXEC_JUMP > +void parse_kexec_jump_back_entry(void); > +#else > +static inline void parse_kexec_jump_back_entry(void) { } > +#endif > + > #endif /* __ASSEMBLY__ */ > > #endif /* _I386_KEXEC_H */ > Index: linux-2.6.23-rc6/include/linux/kexec.h > =================================================================== > --- linux-2.6.23-rc6.orig/include/linux/kexec.h 2007-09-20 11:24:25.000000000 > +0800 > +++ linux-2.6.23-rc6/include/linux/kexec.h 2007-09-20 11:26:03.000000000 +0800 > @@ -83,6 +83,7 @@ > > unsigned long start; > struct page *control_code_page; > + struct page *swap_page; Ok. This looks reasonable. So we can reorder the pages in a non-destructive way. > unsigned long nr_segments; > struct kexec_segment segment[KEXEC_SEGMENT_MAX]; > @@ -194,4 +195,12 @@ > static inline void crash_kexec(struct pt_regs *regs) { } > static inline int kexec_should_crash(struct task_struct *p) { return 0; } > #endif /* CONFIG_KEXEC */ > + > +#ifdef CONFIG_KEXEC_JUMP > +extern int machine_kexec_jump(struct kimage *image); > +extern unsigned long kexec_jump_back_entry; > +extern int kexec_jump(void); > +#else /* !CONFIG_KEXEC_JUMP */ > +static inline int kexec_jump(void) { return 0; } > +#endif /* CONFIG_KEXEC_JUMP */ > #endif /* LINUX_KEXEC_H */ > Index: linux-2.6.23-rc6/kernel/kexec.c > =================================================================== > --- linux-2.6.23-rc6.orig/kernel/kexec.c 2007-09-20 11:24:25.000000000 +0800 > +++ linux-2.6.23-rc6/kernel/kexec.c 2007-09-20 11:24:31.000000000 +0800 > @@ -24,6 +24,10 @@ > #include <linux/utsrelease.h> > #include <linux/utsname.h> > #include <linux/numa.h> > +#include <linux/suspend.h> > +#include <linux/pm.h> > +#include <linux/cpu.h> > +#include <linux/console.h> > > #include <asm/page.h> > #include <asm/uaccess.h> > @@ -243,6 +247,12 @@ > goto out; > } > > + image->swap_page = kimage_alloc_control_pages(image, 0); > + if (!image->swap_page) { > + printk(KERN_ERR "Could not allocate swap buffer\n"); > + goto out; > + } > + > result = 0; > out: > if (result == 0) > @@ -1246,3 +1256,52 @@ > } > > module_init(crash_save_vmcoreinfo_init) > + > +#ifdef CONFIG_KEXEC_JUMP > +unsigned long kexec_jump_back_entry; > + > +int kexec_jump(void) > +{ > + int error; > + > + if (!kexec_image) > + return -EINVAL; > + > + pm_prepare_console(); > + suspend_console(); > + error = device_suspend(PMSG_FREEZE); > + if (error) > + goto Resume_console; > + error = disable_nonboot_cpus(); > + if (error) > + goto Resume_devices; > + local_irq_disable(); > + /* At this point, device_suspend() has been called, but *not* > + * device_power_down(). We *must* device_power_down() now. > + * Otherwise, drivers for some devices (e.g. interrupt controllers) > + * become desynchronized with the actual state of the hardware > + * at resume time, and evil weirdness ensues. > + */ > + error = device_power_down(PMSG_FREEZE); > + if (error) > + goto Enable_irqs; > + > + save_processor_state(); > + error = machine_kexec_jump(kexec_image); > + restore_processor_state(); > + > + /* NOTE: device_power_up() is just a resume() for devices > + * that suspended with irqs off ... no overall powerup. > + */ > + device_power_up(); > + Enable_irqs: > + local_irq_enable(); > + enable_nonboot_cpus(); > + Resume_devices: > + device_resume(); > + Resume_console: > + resume_console(); > + pm_restore_console(); > + return error; > +} > +#endif /* CONFIG_KEXEC_JUMP */ > Index: linux-2.6.23-rc6/kernel/ksysfs.c > =================================================================== > --- linux-2.6.23-rc6.orig/kernel/ksysfs.c 2007-09-20 11:24:25.000000000 +0800 > +++ linux-2.6.23-rc6/kernel/ksysfs.c 2007-09-20 11:24:31.000000000 +0800 > @@ -69,6 +69,20 @@ > } > KERNEL_ATTR_RO(vmcoreinfo); > > +#ifdef CONFIG_KEXEC_JUMP > +static ssize_t kexec_jump_back_entry_show(struct kset *kset, char *page) > +{ > + return sprintf(page, "0x%lx\n", kexec_jump_back_entry); > +} > +static ssize_t kexec_jump_back_entry_store(struct kset *kset, const char *page, > + size_t count) > +{ > + kexec_jump_back_entry = simple_strtoul(page, NULL, 0); > + return count; > +} > + > +KERNEL_ATTR_RW(kexec_jump_back_entry); > +#endif /* CONFIG_KEXEC_JUMP */ > #endif /* CONFIG_KEXEC */ > > /* > @@ -105,6 +119,9 @@ > &kexec_loaded_attr.attr, > &kexec_crash_loaded_attr.attr, > &vmcoreinfo_attr.attr, > +#ifdef CONFIG_KEXEC_JUMP > + &kexec_jump_back_entry_attr.attr, > +#endif We should not need this. > #endif > NULL > }; > Index: linux-2.6.23-rc6/kernel/sys.c > =================================================================== > --- linux-2.6.23-rc6.orig/kernel/sys.c 2007-09-20 11:24:25.000000000 +0800 > +++ linux-2.6.23-rc6/kernel/sys.c 2007-09-20 11:24:31.000000000 +0800 > @@ -424,6 +424,14 @@ > unlock_kernel(); > return -EINVAL; > > + case LINUX_REBOOT_CMD_KEXEC_JUMP: > + { > + int ret; > + ret = kexec_jump(); > + unlock_kernel(); > + return ret; > + } > + > #ifdef CONFIG_HIBERNATION > case LINUX_REBOOT_CMD_SW_SUSPEND: > { > Index: linux-2.6.23-rc6/include/linux/reboot.h > =================================================================== > --- linux-2.6.23-rc6.orig/include/linux/reboot.h 2007-09-20 11:24:25.000000000 > +0800 > +++ linux-2.6.23-rc6/include/linux/reboot.h 2007-09-20 11:24:31.000000000 +0800 > @@ -23,6 +23,7 @@ > * RESTART2 Restart system using given command string. > * SW_SUSPEND Suspend system using software suspend if compiled in. > * KEXEC Restart system using a previously loaded Linux kernel > + * KEXEC_JUMP Jump between original kernel and kexeced kernel. > */ > > #define LINUX_REBOOT_CMD_RESTART 0x01234567 > @@ -33,6 +34,7 @@ > #define LINUX_REBOOT_CMD_RESTART2 0xA1B2C3D4 > #define LINUX_REBOOT_CMD_SW_SUSPEND 0xD000FCE2 > #define LINUX_REBOOT_CMD_KEXEC 0x45584543 > +#define LINUX_REBOOT_CMD_KEXEC_JUMP 0x3928A5FD I'm still not quite convinced that we need a separate entry point for this. But until we split out the hibernation methods, it seems reasonable. > > #ifdef __KERNEL__ > Index: linux-2.6.23-rc6/arch/i386/Kconfig > =================================================================== > --- linux-2.6.23-rc6.orig/arch/i386/Kconfig 2007-09-20 11:24:25.000000000 +0800 > +++ linux-2.6.23-rc6/arch/i386/Kconfig 2007-09-20 11:24:31.000000000 +0800 > @@ -830,6 +830,13 @@ > (CONFIG_RELOCATABLE=y). > For more details see Documentation/kdump/kdump.txt > > +config KEXEC_JUMP > + bool "kexec jump (EXPERIMENTAL)" > + depends on EXPERIMENTAL > + depends on PM && X86_32 && KEXEC > + ---help--- > + Jump between the kexeced kernel and the orignal kernel. > + > config PHYSICAL_START > hex "Physical address where the kernel is loaded" if (EMBEDDED || > CRASH_DUMP) > default "0x1000000" if X86_NUMAQ > Index: linux-2.6.23-rc6/kernel/power/Kconfig > =================================================================== > --- linux-2.6.23-rc6.orig/kernel/power/Kconfig 2007-09-20 11:24:25.000000000 > +0800 > +++ linux-2.6.23-rc6/kernel/power/Kconfig 2007-09-20 11:24:31.000000000 +0800 > @@ -70,7 +70,7 @@ > > config PM_SLEEP > bool > - depends on SUSPEND || HIBERNATION > + depends on SUSPEND || HIBERNATION || KEXEC_JUMP > default y > > config SUSPEND_UP_POSSIBLE > Index: linux-2.6.23-rc6/arch/i386/kernel/relocate_kernel.S > =================================================================== > --- linux-2.6.23-rc6.orig/arch/i386/kernel/relocate_kernel.S 2007-09-20 > 11:24:25.000000000 +0800 > +++ linux-2.6.23-rc6/arch/i386/kernel/relocate_kernel.S 2007-09-20 > 11:24:31.000000000 +0800 > @@ -19,8 +19,87 @@ > #define PAGE_ATTR 0x63 /* _PAGE_PRESENT|_PAGE_RW|_PAGE_ACCESSED|_PAGE_DIRTY */ > #define PAE_PGD_ATTR 0x01 /* _PAGE_PRESENT */ > > +#define STACK_TOP 0x1000 > + > +#define DATA(offset) (KJUMP_OTHER_OFF+(offset)) > + > +/* Minimal CPU stat */ > +#define EBX DATA(0x0) > +#define ESI DATA(0x4) > +#define EDI DATA(0x8) > +#define EBP DATA(0xc) > +#define ESP DATA(0x10) > +#define CR0 DATA(0x14) > +#define CR3 DATA(0x18) > +#define CR4 DATA(0x1c) > +#define FLAG DATA(0x20) > +#define RET DATA(0x24) > + > +/* some information saved in control page (CP) for jumping back */ > +#define CP_VA_CONTROL_PAGE DATA(0x30) > +#define CP_PA_PGD DATA(0x34) > +#define CP_PA_SWAP_PAGE DATA(0x38) > + > .text > .align PAGE_ALIGNED > + .globl relocate_page > +relocate_page: > + > +/* > + * Entry point for jumping back from kexeced kernel, the paging is > + * turned off, the information needed is at relocate_page + > + * PAGE_SIZE/2 > + */ > +kexec_jump_back_entry: > + movl $relocate_page, %ebx > + movl %edi, KJUMP_ENTRY_OFF(%ebx) > + movl CP_VA_CONTROL_PAGE(%ebx), %edi > + > + lea STACK_TOP(%ebx), %esp > + > + movl CP_PA_SWAP_PAGE(%ebx), %eax > + movl KJUMP_BACKUP_PAGES_MAP_OFF(%ebx), %edx > + pushl %eax > + pushl %edx > + call swap_pages > + addl $8, %esp > + > + movl CP_PA_PGD(%ebx), %eax > + movl %eax, %cr3 > + > + movl %cr0, %eax > + orl $(1<<31), %eax > + movl %eax, %cr0 > + > + movl %edi, %esp > + addl $STACK_TOP, %esp > + > + movl %edi, %eax > + addl $(virtual_mapped - relocate_page), %eax > + pushl %eax > + ret > + > +virtual_mapped: > + movl %edi, %edx > + movl EBX(%edx), %ebx > + movl ESI(%edx), %esi > + movl EDI(%edx), %edi > + movl EBP(%edx), %ebp > + movl FLAG(%edx), %eax > + pushl %eax > + popf > + movl ESP(%edx), %esp > + movl CR4(%edx), %eax > + movl %eax, %cr4 > + movl CR3(%edx), %eax > + movl %eax, %cr3 > + movl CR0(%edx), %eax > + movl %eax, %cr0 > + movl RET(%edx), %eax > + movl %eax, (%esp) > + mov $1, %eax > + ret > + > .globl relocate_kernel > relocate_kernel: > movl 8(%esp), %ebp /* list of pages */ > @@ -146,6 +225,15 @@ > pushl $0 > popfl > > + /* save some information for jumping back */ > + movl PTR(VA_CONTROL_PAGE)(%ebp), %edi > + movl %edi, CP_VA_CONTROL_PAGE(%edi) > + movl PTR(PA_PGD)(%ebp), %eax > + movl %eax, CP_PA_PGD(%edi) > + movl PTR(PA_SWAP_PAGE)(%ebp), %eax > + movl %eax, CP_PA_SWAP_PAGE(%edi) > + movl %ebx, KJUMP_BACKUP_PAGES_MAP_OFF(%edi) > + > /* get physical address of control page now */ > /* this is impossible after page table switch */ > movl PTR(PA_CONTROL_PAGE)(%ebp), %edi > @@ -155,11 +243,11 @@ > movl %eax, %cr3 > > /* setup a new stack at the end of the physical control page */ > - lea 4096(%edi), %esp > + lea STACK_TOP(%edi), %esp > > /* jump to identity mapped page */ > movl %edi, %eax > - addl $(identity_mapped - relocate_kernel), %eax > + addl $(identity_mapped - relocate_page), %eax > pushl %eax > ret > > @@ -197,8 +285,44 @@ > xorl %eax, %eax > movl %eax, %cr3 > > + movl CP_PA_SWAP_PAGE(%edi), %eax > + pushl %eax > + pushl %ebx > + call swap_pages > + addl $8, %esp > + > + /* To be certain of avoiding problems with self-modifying code > + * I need to execute a serializing instruction here. > + * So I flush the TLB, it's handy, and not processor dependent. > + */ > + xorl %eax, %eax > + movl %eax, %cr3 > + > + /* set all of the registers to known values */ > + /* leave %esp alone */ > + > + movw KJUMP_MAGIC_OFF(%edi), %ax > + cmpw $KJUMP_MAGIC_NUMBER, %ax > + jz 1f > + xorl %edi, %edi > +1: > + xorl %eax, %eax > + xorl %ebx, %ebx > + xorl %ecx, %ecx > + xorl %edx, %edx > + xorl %esi, %esi > + xorl %ebp, %ebp > + ret > + > /* Do the copies */ > - movl %ebx, %ecx > +swap_pages: > + movl 8(%esp), %edx > + movl 4(%esp), %ecx > + pushl %ebp > + pushl %ebx > + pushl %edi > + pushl %esi > + movl %ecx, %ebx > jmp 1f > > 0: /* top, read another word from the indirection page */ > @@ -226,27 +350,50 @@ > movl %ecx, %esi /* For every source page do a copy */ > andl $0xfffff000, %esi > > + movl %edi, %eax > + movl %esi, %ebp > + > + movl %edx, %edi > movl $1024, %ecx > rep ; movsl > - jmp 0b > > -3: > + movl %ebp, %edi > + movl %eax, %esi > + movl $1024, %ecx > + rep ; movsl > > - /* To be certain of avoiding problems with self-modifying code > - * I need to execute a serializing instruction here. > - * So I flush the TLB, it's handy, and not processor dependent. > - */ > - xorl %eax, %eax > - movl %eax, %cr3 > + movl %eax, %edi > + movl %edx, %esi > + movl $1024, %ecx > + rep ; movsl > > - /* set all of the registers to known values */ > - /* leave %esp alone */ > + lea 4096(%ebp), %esi > + jmp 0b > +3: > + popl %esi > + popl %edi > + popl %ebx > + popl %ebp > + ret > > - xorl %eax, %eax > - xorl %ebx, %ebx > - xorl %ecx, %ecx > - xorl %edx, %edx > - xorl %esi, %esi > - xorl %edi, %edi > - xorl %ebp, %ebp > + .globl kexec_jump_save_cpu > +kexec_jump_save_cpu: > + movl 4(%esp), %edx > + movl %ebx, EBX(%edx) > + movl %esi, ESI(%edx) > + movl %edi, EDI(%edx) > + movl %ebp, EBP(%edx) > + movl %esp, ESP(%edx) > + movl %cr0, %eax > + movl %eax, CR0(%edx) > + movl %cr3, %eax > + movl %eax, CR3(%edx) > + movl %cr4, %eax > + movl %eax, CR4(%edx) > + pushf > + popl %eax > + movl %eax, FLAG(%edx) > + movl (%esp), %eax > + movl %eax, RET(%edx) > + mov $0, %eax > ret > Index: linux-2.6.23-rc6/include/asm-i386/bootparam.h > =================================================================== > --- linux-2.6.23-rc6.orig/include/asm-i386/bootparam.h 2007-09-20 > 11:24:25.000000000 +0800 > +++ linux-2.6.23-rc6/include/asm-i386/bootparam.h 2007-09-20 11:24:31.000000000 > +0800 > @@ -41,6 +41,9 @@ > u32 initrd_addr_max; > u32 kernel_alignment; > u8 relocatable_kernel; > + u8 _pad2[3]; > + u32 cmdline_size; > + u32 jump_back_entry; > } __attribute__((packed)); > > struct sys_desc_table { > Index: linux-2.6.23-rc6/arch/i386/boot/header.S > =================================================================== > --- linux-2.6.23-rc6.orig/arch/i386/boot/header.S 2007-09-20 11:24:25.000000000 > +0800 > +++ linux-2.6.23-rc6/arch/i386/boot/header.S 2007-09-20 11:24:31.000000000 +0800 > @@ -214,6 +214,8 @@ > #added with boot protocol > #version 2.06 > > +jump_back_entry: .long 0 #jump back entry point Please no. > + > # End of setup header ##################################################### > > .section ".inittext", "ax" > Index: linux-2.6.23-rc6/arch/i386/kernel/setup.c > =================================================================== > --- linux-2.6.23-rc6.orig/arch/i386/kernel/setup.c 2007-09-20 11:24:25.000000000 > +0800 > +++ linux-2.6.23-rc6/arch/i386/kernel/setup.c 2007-09-20 13:14:19.000000000 > +0800 > @@ -60,6 +60,7 @@ > #include <asm/ist.h> > #include <asm/io.h> > #include <asm/vmi.h> > +#include <asm/kexec.h> > #include <setup_arch.h> > #include <bios_ebda.h> > > @@ -566,6 +567,8 @@ > data_resource.start = virt_to_phys(_etext); > data_resource.end = virt_to_phys(_edata)-1; > > + parse_kexec_jump_back_entry(); > + Why do we need to touch setup.c? > parse_early_param(); > > if (user_defined_memmap) { > Index: linux-2.6.23-rc6/Documentation/i386/jump_back_protocol.txt > =================================================================== > --- /dev/null 1970-01-01 00:00:00.000000000 +0000 > +++ linux-2.6.23-rc6/Documentation/i386/jump_back_protocol.txt 2007-09-20 > 11:24:31.000000000 +0800 > @@ -0,0 +1,81 @@ > + THE LINUX/I386 JUMP BACK PROTOCOL > + --------------------------------- > + > + Huang Ying <ying.huang@intel.com> > + Last update 2007-09-19 > + > +Currently, the following versions of the jump back protocol exist. > + > +Protocol 1.00: Jumping between original kernel and kexeced kernel > + support. > + > + > +**** LOAD THE JUMP BACK IMAGE > + > +Jump back image is an ordinary ELF 64 executable file, it can be > +loaded just as other ELF64 image. That is, the PT_LOAD segments should > +be loaded into their physical address. > + > +Before loading all segments of jump back image, the jump back header > +should be checked. Jump back header can be loaded from the 4K page at > +the jump back entry in jump back image. > + > +The header looks like: > + > +Offset Proto Name Meaning > +/Size > + > +C00/2 1.00+ magic Magic number: 0x626A > +C02/2 1.00+ version Jump back protocol version > +C04/4 1.00+ backup_pages_map Map from target page to backup page > + > +Note: unlike ordinary ELF 64 file, the jump back image may occupy most > +memory pages, so it is important for loader to verify there is no > +conflict between pages of loaded image and pages used by loader > +itself. > + > + > +**** DETAILS OF JUMP BACK HEADER > + > +For each field, some are information from the jump back image to > +loader ("read"), some are expected to be filled out by the loader > +("write"), and some are expected to be read and modified by the loader > +("modify"). > + > +All general purpose boot loaders should write the fields marked > +(obligatory). > + > +The byte order of all fields is little endian. > + > +Field name: magic > +Type: read > +Offset/size: 0xc00/2 > +Protocol: 1.00+ > + > + Contains the magic number "jb" (0x626A) > + > +Field name: version > +Type: read > +Offset/size: 0xc02/2 > +Protocol: 1.00+ > + > + Contains the version number, in (major << 8)+minor format, > + e.g. 0x0100 for version 1.00. > + > +Field name: backup_pages_map > +Type: read > +Offset/size: 0xc04/4 > +Protocol: 1.00+ > + > + The map from target address to backup address, it is kimage->head in > + fact. > + TODO: detailed description This is an implementation detail that we should not need to export. I do think having a specification of the requirements for returning to the kexec kernel makes sense. I don't however believe that we need to pass any information. The map of the pages that matter is simply the memory the kernel is using from /proc/iomem - the memory that your image you are loading with kexec is using. All of that information we can pass in /sbin/kexec and should not need to do directly from the kernel. The following piece of code should be always be a valid kexec on jump user and test case. test_kexec_jump_payload: ret A standard subroutine call return on whichever architecture we are working on should be sufficient. For the registers and the segments we should define which ones should be hard coded, and which ones should be saved. Hard coded: %ss, %cs, %ds (4G flat with 0 base) Callee save: %esp %edi (And whichever ones we care about) It does make some sense to use the standard kernel argument passing sequence for bootloaders and what not. But the wrapper in /sbin/kexec should do that not the kernel itself. > + > +**** JUMP BACK TO THE JUMP BACK IMAGE > + > +To jump back to the jump back image, just jump to the jump back > +entry. At entry, the CPU must be in 32-bit protected mode with paging > +disabled; the CS, DS and SS must be 4G flat segments; if jumping back > +to loader is supported, %edi should be the jump back entry of loader, > +otherwise it should be zero. Eric ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [RFC][PATCH 1/2 -mm] kexec based hibernation -v3: kexec jump [not found] <1190266447.21818.17.camel@caritas-dev.intel.com> ` (2 preceding siblings ...) 2007-09-21 2:55 ` Eric W. Biederman @ 2007-09-21 4:01 ` Eric W. Biederman [not found] ` <m1fy18sp5c.fsf@ebiederm.dsl.xmission.com> ` (2 subsequent siblings) 6 siblings, 0 replies; 42+ messages in thread From: Eric W. Biederman @ 2007-09-21 4:01 UTC (permalink / raw) To: Huang, Ying Cc: nigel, Kexec Mailing List, linux-kernel, Andrew Morton, linux-pm, Jeremy Maitin-Shepard "Huang, Ying" <ying.huang@intel.com> writes: > Index: linux-2.6.23-rc6/include/linux/kexec.h > =================================================================== > --- linux-2.6.23-rc6.orig/include/linux/kexec.h 2007-09-20 11:24:25.000000000 > +0800 > +++ linux-2.6.23-rc6/include/linux/kexec.h 2007-09-20 11:26:03.000000000 +0800 > @@ -83,6 +83,7 @@ > > unsigned long start; > struct page *control_code_page; > + struct page *swap_page; > > unsigned long nr_segments; > struct kexec_segment segment[KEXEC_SEGMENT_MAX]; > @@ -194,4 +195,12 @@ > static inline void crash_kexec(struct pt_regs *regs) { } > static inline int kexec_should_crash(struct task_struct *p) { return 0; } > #endif /* CONFIG_KEXEC */ > + > +#ifdef CONFIG_KEXEC_JUMP > +extern int machine_kexec_jump(struct kimage *image); > +extern unsigned long kexec_jump_back_entry; > +extern int kexec_jump(void); > +#else /* !CONFIG_KEXEC_JUMP */ > +static inline int kexec_jump(void) { return 0; } > +#endif /* CONFIG_KEXEC_JUMP */ > #endif /* LINUX_KEXEC_H */ Please the kexec_jump code just be triggered off of a flag in struct kimage. We just need to define an extra flag to sys_kexec_load say KEXEC_RETURNS. Ideally in the long term we would not have to do anything except to accept the flag. Adding a flag makes a nice feature test if you want to see if your kernel supports the extended version of kexec. Until we get the hibernation methods sorted out storing the flag in struct kimage and making the methods that we call conditional feels like a more maintainable interface. Especially since we have to know at kexec image load time what we are going to do with the kexec image. > +#ifdef CONFIG_KEXEC_JUMP > +unsigned long kexec_jump_back_entry; > + > +int kexec_jump(void) > +{ > + int error; > + > + if (!kexec_image) > + return -EINVAL; I understand where you are coming from with this implementation of kexec_jump but it looks like this is one of the big parts of this patch that have not reached their final form. The line above is racy with sys_kexec_load. > + pm_prepare_console(); > + suspend_console(); > + error = device_suspend(PMSG_FREEZE); > + if (error) > + goto Resume_console; This as everyone knows needs to be device_shutdown or a better hibernation replacement. > + error = disable_nonboot_cpus(); > + if (error) > + goto Resume_devices; Can't we just catch the noboot cpu's in a mutex. disable_nonboot_cpus is actually impossible to implement 100% reliably with current hardware. But something smp_call_function so we trap them at a specific location and then the equivalent when we come back should be simple. I guess the tricky part is bringing the cpus back up again. Using the broken by design version of cpu hotplug really annoys me here. > + local_irq_disable(); > + /* At this point, device_suspend() has been called, but *not* > + * device_power_down(). We *must* device_power_down() now. > + * Otherwise, drivers for some devices (e.g. interrupt controllers) > + * become desynchronized with the actual state of the hardware > + * at resume time, and evil weirdness ensues. > + */ > + error = device_power_down(PMSG_FREEZE); > + if (error) > + goto Enable_irqs; This of course should go away when we have the proper methods. > + save_processor_state(); This line might even be reasonable. > + error = machine_kexec_jump(kexec_image); > + restore_processor_state(); > > + /* NOTE: device_power_up() is just a resume() for devices > + * that suspended with irqs off ... no overall powerup. > + */ > + device_power_up(); Yep this can go away. > + Enable_irqs: > + local_irq_enable(); > + enable_nonboot_cpus(); I haven't looked at the cpu start up code yet to see if it is generally implementable. I would think so, but I guess we need to be careful with our data structures. > + Resume_devices: > + device_resume(); This of course should change. > + Resume_console: > + resume_console(); > + pm_restore_console(); Odd. I'm a little surprised that the console is the last thing we restore. But it does make sense to treat it specially. > + return error; > +} > +#endif /* CONFIG_KEXEC_JUMP */ Eric ^ permalink raw reply [flat|nested] 42+ messages in thread
[parent not found: <m1fy18sp5c.fsf@ebiederm.dsl.xmission.com>]
* Re: [RFC][PATCH 1/2 -mm] kexec based hibernation -v3: kexec jump [not found] ` <m1fy18sp5c.fsf@ebiederm.dsl.xmission.com> @ 2007-09-21 7:27 ` Huang, Ying 0 siblings, 0 replies; 42+ messages in thread From: Huang, Ying @ 2007-09-21 7:27 UTC (permalink / raw) To: Eric W. Biederman Cc: nigel, Kexec Mailing List, linux-kernel, Andrew Morton, linux-pm, Jeremy Maitin-Shepard On Thu, 2007-09-20 at 20:55 -0600, Eric W. Biederman wrote: > "Huang, Ying" <ying.huang@intel.com> writes: > > > This patch implements the functionality of jumping between the kexeced > > kernel and the original kernel. > > > > A new reboot command named LINUX_REBOOT_CMD_KJUMP is defined to > > trigger the jumping to (executing) the new kernel and jumping back to > > the original kernel. > > > > To support jumping between two kernels, before jumping to (executing) > > the new kernel and jumping back to the original kernel, the devices > > are put into quiescent state (to be fully implemented), > > Well this we have an implementation of (it's called shutdown) or does > that method not do enough to meet the requirements of hibernation. I think the "device_shutdown" is not enough for hibernation. Because in current implementation of the device shutdown method, "recover" is not considered. For example, for hibernation, the current executing request of device should be delayed or finished before shutdown, and may be re-executing after "recover". So I think another pair of callbacks may be needed for the purpose of hibernation. > If at all possible I would like to keep reboot, kexec and kexec+return > all using the same device driver methods. I totally agree! > > and the state of devices and CPU is saved. > > Makes a reasonable amount of sense. We do need to save whatever > state we cannot recover just be reprogramming the hardware. > As long as the drivers are built so this is a good place for a > hot remove to happen we should be in good shape. > > > After jumping back from kexeced kernel > > and jumping to the new kernel, the state of devices and CPU are > > restored accordingly. The devices/CPU state save/restore code of > > software suspend is called to implement corresponding function. > > At least for now that sounds like a reasonable work around. > > I don't think we want to merge this code until we have agreed upon > how the new device_detach and device_reattach (or whatever we call the > device methods for hibernate) are to be implemented. There is a thread on LKML about this: http://lkml.org/lkml/2007/4/27/129 Do you agree with the conclusion there? > > To support jumping without preserving memory. One shadow backup page > > is allocated for each page used by new (kexeced) kernel. > > That does not sound correct. The current implementation of kexec_load > does allocate a source page and give it a destination page and usually > those two pages are different. But if our memory allocations happen > to return a destination page there we use it directly, making no > copy necessary. > > I think we are talking about the same thing but I'm not certain > you have thought about the case where your shadow backup page happens > to be the same as current page. My description here has some problem. If the source page (shadow page) is same as the target page, there is no copy or swap. I have thought about that, and current implementation works in this situation too. In original kernel it is a allocated page for kexec, so it will not be used for other purpose; in kexeced, it can be used freely. > > When do > > kexec_load, the image of new kernel is loaded into shadow pages, > > Ok. This sounds like the existing implementation. Except it > depending on your destination it may force the address. Yes. This is the existing implementation, just a little usage changing. I load all memory area used by kexeced kernel in addition to kernel image. This is done in kexec-tools. So the shadow page is allocated for every pages used by kexeced kernel. > > and > > before executing, the original pages and the shadow pages are swapped, > > so the contents of original pages are backuped. > > Yes. Unless we happen to have everything allocated on the same page. > Does your code handle that case? I know the generic kexec code will > pass lists like that in the proper circumstances. Especially for > the kexec on panic case. My code can handle that case. If everything allocated on the same page, just do not swap or swap with itself. The same lists of generic kexec code is used for swap too. > > Before jumping to the > > new (kexeced) kernel and after jumping back to the original kernel, > > the original pages and the shadow pages are swapped too. > > Yes. That sounds right. > > > A jump back protocol is defined and documented. > > Bleh. We do need to document the requirements but we don't need a > versioned monster. And we don't need to be exposing implementation > details in that documentation. > > In the kexec world /sbin/kexec or another user space caller is > responsible for passing information to our callers. > > To be polite we need to document more but the jump back protocol > really should be as if the entry point kexec handed control to did > a subroutine return. This protocol is mainly for loading the hibernation image from the bootloader directly, not for kexec. An external protocol should be defined for the bootloader, because they are external code. > > Known issues > > > > - A field is added to Linux kernel real-mode header. This is > > temporary, and should be replaced after the 32-bit boot protocol and > > setup data patches are accepted. > > It shouldn't be needed. A mechanism should be provided to pass the jump back entry to the kexeced kernel. A kernel command line parameter constructed by "purgatory" of kexec-tools may be better. > > - The suspend method of device is used to put device in quiescent > > state. But if the ACPI is enabled this will also put devices into > > low power state, which prevent the new kernel from booting. So, the > > ACPI must be disabled both in original kernel and kexeced > > kernel. This is planed to be resolved after the suspend method and > > hibernate method is separated for device as proposed earlier in the > > LKML. > > Reasonable. > > > - The NX (none executable) bit should be turned off for the control > > page if available. > > Why don't we have a problem with this in the normal kexec case? Unlike normal kexec, some information need to be read/written from/to the control page both in virtual mode and real mode. The code copied to control page is run both in virtual mode and real mode, so the NX bit should be cleared for the control page. But in normal kexec, code copied to control page is run in real mode only. > > More comments below. > > > Signed-off-by: Huang Ying <ying.huang@intel.com> > > > > --- > > > > Documentation/i386/jump_back_protocol.txt | 81 ++++++++++++ > > arch/i386/Kconfig | 7 + > > arch/i386/boot/header.S | 2 > > arch/i386/kernel/machine_kexec.c | 77 +++++++++--- > > arch/i386/kernel/relocate_kernel.S | 187 ++++++++++++++++++++++++++---- > > arch/i386/kernel/setup.c | 3 > > include/asm-i386/bootparam.h | 3 > > include/asm-i386/kexec.h | 48 ++++++- > > include/linux/kexec.h | 9 + > > include/linux/reboot.h | 2 > > kernel/kexec.c | 59 +++++++++ > > kernel/ksysfs.c | 17 ++ > > kernel/power/Kconfig | 2 > > kernel/sys.c | 8 + > > 14 files changed, 463 insertions(+), 42 deletions(-) > > > > Index: linux-2.6.23-rc6/arch/i386/kernel/machine_kexec.c > > =================================================================== > > --- linux-2.6.23-rc6.orig/arch/i386/kernel/machine_kexec.c 2007-09-20 > > 11:24:25.000000000 +0800 > > +++ linux-2.6.23-rc6/arch/i386/kernel/machine_kexec.c 2007-09-20 > > 11:24:31.000000000 +0800 > > @@ -20,6 +20,7 @@ > > #include <asm/cpufeature.h> > > #include <asm/desc.h> > > #include <asm/system.h> > > +#include <asm/setup.h> > > Please remove the unnecessary and incorrect include. OK. This should be removed. > > #define PAGE_ALIGNED __attribute__ ((__aligned__(PAGE_SIZE))) > > static u32 kexec_pgd[1024] PAGE_ALIGNED; > > @@ -98,23 +99,23 @@ > > { > > } > > > > -/* > > - * Do not allocate memory (or fail in any way) in machine_kexec(). > > - * We are past the point of no return, committed to rebooting now. > > - */ > > -NORET_TYPE void machine_kexec(struct kimage *image) > > +static NORET_TYPE void __machine_kexec(struct kimage *image, > > + void *control_page) ATTRIB_NORET; > > + > > +static NORET_TYPE void __machine_kexec(struct kimage *image, > > + void *control_page) > > { > > It looks like we are doing swap_pages unconditionally so I don't > see why we need to versions of this function. Yes. It seems that the two versions can be merged. > And it looks like the NORET_TYPE is no longer an accurate description. NORET_TYPE is accurate here, because the return point is not in __machine_kexec. > > unsigned long page_list[PAGES_NR]; > > - void *control_page; > > - > > - /* Interrupts aren't acceptable while we reboot */ > > - local_irq_disable(); > > - > > - control_page = page_address(image->control_code_page); > > - memcpy(control_page, relocate_kernel, PAGE_SIZE); > > + asmlinkage NORET_TYPE void > > + (*relocate_kernel_ptr)(unsigned long indirection_page, > > + unsigned long control_page, > > + unsigned long start_address, > > + unsigned int has_pae) ATTRIB_NORET; > > > > + relocate_kernel_ptr = control_page + > > + ((void *)relocate_kernel - (void *)relocate_page); > > page_list[PA_CONTROL_PAGE] = __pa(control_page); > > - page_list[VA_CONTROL_PAGE] = (unsigned long)relocate_kernel; > > + page_list[VA_CONTROL_PAGE] = (unsigned long)control_page; > > page_list[PA_PGD] = __pa(kexec_pgd); > > page_list[VA_PGD] = (unsigned long)kexec_pgd; > > #ifdef CONFIG_X86_PAE > > @@ -127,6 +128,7 @@ > > page_list[VA_PTE_0] = (unsigned long)kexec_pte0; > > page_list[PA_PTE_1] = __pa(kexec_pte1); > > page_list[VA_PTE_1] = (unsigned long)kexec_pte1; > > + page_list[PA_SWAP_PAGE] = (page_to_pfn(image->swap_page) << PAGE_SHIFT); > > > > /* The segment registers are funny things, they have both a > > * visible and an invisible part. Whenever the visible part is > > @@ -145,8 +147,26 @@ > > set_idt(phys_to_virt(0),0); > > > > /* now call it */ > > - relocate_kernel((unsigned long)image->head, (unsigned long)page_list, > > - image->start, cpu_has_pae); > > + relocate_kernel_ptr((unsigned long)image->head, > > + (unsigned long)page_list, > > + image->start, cpu_has_pae); > > +} > > + > > +/* > > + * Do not allocate memory (or fail in any way) in machine_kexec(). > > + * We are past the point of no return, committed to rebooting now. > > + */ > > +NORET_TYPE void machine_kexec(struct kimage *image) > > +{ > > + void *control_page; > > + > > + /* Interrupts aren't acceptable while we reboot */ > > + local_irq_disable(); > > + > > + control_page = page_address(image->control_code_page); > > + memcpy(control_page, relocate_page, PAGE_SIZE); > > + > > + __machine_kexec(image, control_page); > > } > > > > /* crashkernel=size@addr specifies the location to reserve for > > @@ -182,3 +202,30 @@ > > #endif > > } > > > > +#ifdef CONFIG_KEXEC_JUMP > > +int machine_kexec_jump(struct kimage *image) > > +{ > > + void *control_page; > > + unsigned long pa_control_page; > > > + > > + control_page = page_address(image->control_code_page); > > + memcpy(control_page, relocate_page, PAGE_SIZE/2); > > + pa_control_page = __pa(control_page); > > + memcpy(control_page + 1, &pa_control_page, sizeof(pa_control_page)); > > + > > + KJUMP_MAGIC(control_page) = KJUMP_MAGIC_NUMBER; > > + KJUMP_VERSION(control_page) = KJUMP_VERSION_NUMBER; > > This bit really looks wrong. It is for /sbin/kexec to provide this > kind of interface. This is an interface for external bootloader, not for kexec. And if the version of the original kernel and the kexeced kernel is different, a kexec jump back version may be needed. > > + > > + if (!kexec_jump_save_cpu(control_page)) > > + __machine_kexec(image, control_page); > > + > > + kexec_jump_back_entry = KJUMP_ENTRY(control_page); > > + image->start = kexec_jump_back_entry; > > + return 0; > > +} > > + > > +void __init parse_kexec_jump_back_entry(void) > > +{ > > + kexec_jump_back_entry = boot_params.hdr.jump_back_entry; > > +} > > +#endif /* CONFIG_KEXEC_JUMP */ > > Index: linux-2.6.23-rc6/include/asm-i386/kexec.h > > =================================================================== > > --- linux-2.6.23-rc6.orig/include/asm-i386/kexec.h 2007-09-20 11:24:25.000000000 > > +0800 > > +++ linux-2.6.23-rc6/include/asm-i386/kexec.h 2007-09-20 11:24:31.000000000 > > +0800 > > @@ -9,16 +9,42 @@ > > #define VA_PTE_0 5 > > #define PA_PTE_1 6 > > #define VA_PTE_1 7 > > +#define PA_SWAP_PAGE 8 > > #ifdef CONFIG_X86_PAE > > -#define PA_PMD_0 8 > > -#define VA_PMD_0 9 > > -#define PA_PMD_1 10 > > -#define VA_PMD_1 11 > > -#define PAGES_NR 12 > > +#define PA_PMD_0 9 > > +#define VA_PMD_0 10 > > +#define PA_PMD_1 11 > > +#define VA_PMD_1 12 > > +#define PAGES_NR 13 > > #else > > -#define PAGES_NR 8 > > +#define PAGES_NR 9 > > #endif > > > > +#define KJUMP_DATA_BASE 0x800 > > + > > +#define KJUMP_MAGIC_NUMBER 0x626a > > +#define KJUMP_VERSION_NUMBER 0x0100 > > + > > +#define KJUMP_DATA(buf) ((unsigned char *)(buf)+KJUMP_DATA_BASE) > > +#define KJUMP_OFF(off) (KJUMP_DATA_BASE+(off)) > > + > > +#define KJUMP_MAGIC_OFF KJUMP_OFF(0x0) > > +#define KJUMP_MAGIC(buf) (*(unsigned short *)(KJUMP_DATA(buf)+0x0)) > > +#define KJUMP_VERSION(buf) (*(unsigned short *)(KJUMP_DATA(buf)+0x2)) > > +#define KJUMP_BACKUP_PAGES_MAP_OFF \ > > + KJUMP_OFF(0x4) > > +#define KJUMP_BACKUP_PAGES_MAP(buf) \ > > + (*(unsigned long *)(KJUMP_DATA(buf)+0x4)) > > + > > +/* > > + * The following are not a part of jump back protocol, for internal > > + * use only > > + */ > > +#define KJUMP_ENTRY_OFF KJUMP_OFF(0x20) > > +#define KJUMP_ENTRY(buf) (*(unsigned long *)(KJUMP_DATA(buf)+0x20)) > > +/* Other internal data fields base */ > > +#define KJUMP_OTHER_OFF KJUMP_OFF(0x24) > > + > > #ifndef __ASSEMBLY__ > > > > #include <asm/ptrace.h> > > @@ -94,6 +120,16 @@ > > unsigned long start_address, > > unsigned int has_pae) ATTRIB_NORET; > > > > +extern char relocate_page[PAGE_SIZE]; > > + > > +extern asmlinkage int kexec_jump_save_cpu(void *buf); > > + > > +#ifdef CONFIG_KEXEC_JUMP > > +void parse_kexec_jump_back_entry(void); > > +#else > > +static inline void parse_kexec_jump_back_entry(void) { } > > +#endif > > + > > #endif /* __ASSEMBLY__ */ > > > > #endif /* _I386_KEXEC_H */ > > Index: linux-2.6.23-rc6/include/linux/kexec.h > > =================================================================== > > --- linux-2.6.23-rc6.orig/include/linux/kexec.h 2007-09-20 11:24:25.000000000 > > +0800 > > +++ linux-2.6.23-rc6/include/linux/kexec.h 2007-09-20 11:26:03.000000000 +0800 > > @@ -83,6 +83,7 @@ > > > > unsigned long start; > > struct page *control_code_page; > > + struct page *swap_page; > > Ok. This looks reasonable. So we can reorder the pages in a non-destructive > way. > > > unsigned long nr_segments; > > struct kexec_segment segment[KEXEC_SEGMENT_MAX]; > > @@ -194,4 +195,12 @@ > > static inline void crash_kexec(struct pt_regs *regs) { } > > static inline int kexec_should_crash(struct task_struct *p) { return 0; } > > #endif /* CONFIG_KEXEC */ > > + > > +#ifdef CONFIG_KEXEC_JUMP > > +extern int machine_kexec_jump(struct kimage *image); > > +extern unsigned long kexec_jump_back_entry; > > +extern int kexec_jump(void); > > +#else /* !CONFIG_KEXEC_JUMP */ > > +static inline int kexec_jump(void) { return 0; } > > +#endif /* CONFIG_KEXEC_JUMP */ > > #endif /* LINUX_KEXEC_H */ > > Index: linux-2.6.23-rc6/kernel/kexec.c > > =================================================================== > > --- linux-2.6.23-rc6.orig/kernel/kexec.c 2007-09-20 11:24:25.000000000 +0800 > > +++ linux-2.6.23-rc6/kernel/kexec.c 2007-09-20 11:24:31.000000000 +0800 > > @@ -24,6 +24,10 @@ > > #include <linux/utsrelease.h> > > #include <linux/utsname.h> > > #include <linux/numa.h> > > +#include <linux/suspend.h> > > +#include <linux/pm.h> > > +#include <linux/cpu.h> > > +#include <linux/console.h> > > > > #include <asm/page.h> > > #include <asm/uaccess.h> > > @@ -243,6 +247,12 @@ > > goto out; > > } > > > > + image->swap_page = kimage_alloc_control_pages(image, 0); > > + if (!image->swap_page) { > > + printk(KERN_ERR "Could not allocate swap buffer\n"); > > + goto out; > > + } > > + > > result = 0; > > out: > > if (result == 0) > > @@ -1246,3 +1256,52 @@ > > } > > > > module_init(crash_save_vmcoreinfo_init) > > + > > +#ifdef CONFIG_KEXEC_JUMP > > +unsigned long kexec_jump_back_entry; > > + > > +int kexec_jump(void) > > +{ > > + int error; > > + > > + if (!kexec_image) > > + return -EINVAL; > > + > > + pm_prepare_console(); > > + suspend_console(); > > + error = device_suspend(PMSG_FREEZE); > > + if (error) > > + goto Resume_console; > > + error = disable_nonboot_cpus(); > > + if (error) > > + goto Resume_devices; > > + local_irq_disable(); > > + /* At this point, device_suspend() has been called, but *not* > > + * device_power_down(). We *must* device_power_down() now. > > + * Otherwise, drivers for some devices (e.g. interrupt controllers) > > + * become desynchronized with the actual state of the hardware > > + * at resume time, and evil weirdness ensues. > > + */ > > + error = device_power_down(PMSG_FREEZE); > > + if (error) > > + goto Enable_irqs; > > + > > + save_processor_state(); > > + error = machine_kexec_jump(kexec_image); > > + restore_processor_state(); > > + > > + /* NOTE: device_power_up() is just a resume() for devices > > + * that suspended with irqs off ... no overall powerup. > > + */ > > + device_power_up(); > > + Enable_irqs: > > + local_irq_enable(); > > + enable_nonboot_cpus(); > > + Resume_devices: > > + device_resume(); > > + Resume_console: > > + resume_console(); > > + pm_restore_console(); > > + return error; > > +} > > +#endif /* CONFIG_KEXEC_JUMP */ > > Index: linux-2.6.23-rc6/kernel/ksysfs.c > > =================================================================== > > --- linux-2.6.23-rc6.orig/kernel/ksysfs.c 2007-09-20 11:24:25.000000000 +0800 > > +++ linux-2.6.23-rc6/kernel/ksysfs.c 2007-09-20 11:24:31.000000000 +0800 > > @@ -69,6 +69,20 @@ > > } > > KERNEL_ATTR_RO(vmcoreinfo); > > > > +#ifdef CONFIG_KEXEC_JUMP > > +static ssize_t kexec_jump_back_entry_show(struct kset *kset, char *page) > > +{ > > + return sprintf(page, "0x%lx\n", kexec_jump_back_entry); > > +} > > +static ssize_t kexec_jump_back_entry_store(struct kset *kset, const char *page, > > + size_t count) > > +{ > > + kexec_jump_back_entry = simple_strtoul(page, NULL, 0); > > + return count; > > +} > > + > > +KERNEL_ATTR_RW(kexec_jump_back_entry); > > +#endif /* CONFIG_KEXEC_JUMP */ > > #endif /* CONFIG_KEXEC */ > > > > /* > > @@ -105,6 +119,9 @@ > > &kexec_loaded_attr.attr, > > &kexec_crash_loaded_attr.attr, > > &vmcoreinfo_attr.attr, > > +#ifdef CONFIG_KEXEC_JUMP > > + &kexec_jump_back_entry_attr.attr, > > +#endif > We should not need this. Maybe this is not needed if the kernel command line mechanism is used to pass jump back entry point. The user space tool can get that through "cat /proc/cmdline". > > #endif > > NULL > > }; > > Index: linux-2.6.23-rc6/kernel/sys.c > > =================================================================== > > --- linux-2.6.23-rc6.orig/kernel/sys.c 2007-09-20 11:24:25.000000000 +0800 > > +++ linux-2.6.23-rc6/kernel/sys.c 2007-09-20 11:24:31.000000000 +0800 > > @@ -424,6 +424,14 @@ > > unlock_kernel(); > > return -EINVAL; > > > > + case LINUX_REBOOT_CMD_KEXEC_JUMP: > > + { > > + int ret; > > + ret = kexec_jump(); > > + unlock_kernel(); > > + return ret; > > + } > > + > > #ifdef CONFIG_HIBERNATION > > case LINUX_REBOOT_CMD_SW_SUSPEND: > > { > > Index: linux-2.6.23-rc6/include/linux/reboot.h > > =================================================================== > > --- linux-2.6.23-rc6.orig/include/linux/reboot.h 2007-09-20 11:24:25.000000000 > > +0800 > > +++ linux-2.6.23-rc6/include/linux/reboot.h 2007-09-20 11:24:31.000000000 +0800 > > @@ -23,6 +23,7 @@ > > * RESTART2 Restart system using given command string. > > * SW_SUSPEND Suspend system using software suspend if compiled in. > > * KEXEC Restart system using a previously loaded Linux kernel > > + * KEXEC_JUMP Jump between original kernel and kexeced kernel. > > */ > > > > #define LINUX_REBOOT_CMD_RESTART 0x01234567 > > @@ -33,6 +34,7 @@ > > #define LINUX_REBOOT_CMD_RESTART2 0xA1B2C3D4 > > #define LINUX_REBOOT_CMD_SW_SUSPEND 0xD000FCE2 > > #define LINUX_REBOOT_CMD_KEXEC 0x45584543 > > +#define LINUX_REBOOT_CMD_KEXEC_JUMP 0x3928A5FD > > I'm still not quite convinced that we need a separate entry point for this. > But until we split out the hibernation methods, it seems reasonable. > > > > > #ifdef __KERNEL__ > > Index: linux-2.6.23-rc6/arch/i386/Kconfig > > =================================================================== > > --- linux-2.6.23-rc6.orig/arch/i386/Kconfig 2007-09-20 11:24:25.000000000 +0800 > > +++ linux-2.6.23-rc6/arch/i386/Kconfig 2007-09-20 11:24:31.000000000 +0800 > > @@ -830,6 +830,13 @@ > > (CONFIG_RELOCATABLE=y). > > For more details see Documentation/kdump/kdump.txt > > > > +config KEXEC_JUMP > > + bool "kexec jump (EXPERIMENTAL)" > > + depends on EXPERIMENTAL > > + depends on PM && X86_32 && KEXEC > > + ---help--- > > + Jump between the kexeced kernel and the orignal kernel. > > + > > config PHYSICAL_START > > hex "Physical address where the kernel is loaded" if (EMBEDDED || > > CRASH_DUMP) > > default "0x1000000" if X86_NUMAQ > > Index: linux-2.6.23-rc6/kernel/power/Kconfig > > =================================================================== > > --- linux-2.6.23-rc6.orig/kernel/power/Kconfig 2007-09-20 11:24:25.000000000 > > +0800 > > +++ linux-2.6.23-rc6/kernel/power/Kconfig 2007-09-20 11:24:31.000000000 +0800 > > @@ -70,7 +70,7 @@ > > > > config PM_SLEEP > > bool > > - depends on SUSPEND || HIBERNATION > > + depends on SUSPEND || HIBERNATION || KEXEC_JUMP > > default y > > > > config SUSPEND_UP_POSSIBLE > > Index: linux-2.6.23-rc6/arch/i386/kernel/relocate_kernel.S > > =================================================================== > > --- linux-2.6.23-rc6.orig/arch/i386/kernel/relocate_kernel.S 2007-09-20 > > 11:24:25.000000000 +0800 > > +++ linux-2.6.23-rc6/arch/i386/kernel/relocate_kernel.S 2007-09-20 > > 11:24:31.000000000 +0800 > > @@ -19,8 +19,87 @@ > > #define PAGE_ATTR 0x63 /* _PAGE_PRESENT|_PAGE_RW|_PAGE_ACCESSED|_PAGE_DIRTY */ > > #define PAE_PGD_ATTR 0x01 /* _PAGE_PRESENT */ > > > > +#define STACK_TOP 0x1000 > > + > > +#define DATA(offset) (KJUMP_OTHER_OFF+(offset)) > > + > > +/* Minimal CPU stat */ > > +#define EBX DATA(0x0) > > +#define ESI DATA(0x4) > > +#define EDI DATA(0x8) > > +#define EBP DATA(0xc) > > +#define ESP DATA(0x10) > > +#define CR0 DATA(0x14) > > +#define CR3 DATA(0x18) > > +#define CR4 DATA(0x1c) > > +#define FLAG DATA(0x20) > > +#define RET DATA(0x24) > > + > > +/* some information saved in control page (CP) for jumping back */ > > +#define CP_VA_CONTROL_PAGE DATA(0x30) > > +#define CP_PA_PGD DATA(0x34) > > +#define CP_PA_SWAP_PAGE DATA(0x38) > > + > > .text > > .align PAGE_ALIGNED > > + .globl relocate_page > > +relocate_page: > > + > > +/* > > + * Entry point for jumping back from kexeced kernel, the paging is > > + * turned off, the information needed is at relocate_page + > > + * PAGE_SIZE/2 > > + */ > > +kexec_jump_back_entry: > > + movl $relocate_page, %ebx > > + movl %edi, KJUMP_ENTRY_OFF(%ebx) > > + movl CP_VA_CONTROL_PAGE(%ebx), %edi > > + > > + lea STACK_TOP(%ebx), %esp > > + > > + movl CP_PA_SWAP_PAGE(%ebx), %eax > > + movl KJUMP_BACKUP_PAGES_MAP_OFF(%ebx), %edx > > + pushl %eax > > + pushl %edx > > + call swap_pages > > + addl $8, %esp > > + > > + movl CP_PA_PGD(%ebx), %eax > > + movl %eax, %cr3 > > + > > + movl %cr0, %eax > > + orl $(1<<31), %eax > > + movl %eax, %cr0 > > + > > + movl %edi, %esp > > + addl $STACK_TOP, %esp > > + > > + movl %edi, %eax > > + addl $(virtual_mapped - relocate_page), %eax > > + pushl %eax > > + ret > > + > > +virtual_mapped: > > + movl %edi, %edx > > + movl EBX(%edx), %ebx > > + movl ESI(%edx), %esi > > + movl EDI(%edx), %edi > > + movl EBP(%edx), %ebp > > + movl FLAG(%edx), %eax > > + pushl %eax > > + popf > > + movl ESP(%edx), %esp > > + movl CR4(%edx), %eax > > + movl %eax, %cr4 > > + movl CR3(%edx), %eax > > + movl %eax, %cr3 > > + movl CR0(%edx), %eax > > + movl %eax, %cr0 > > + movl RET(%edx), %eax > > + movl %eax, (%esp) > > + mov $1, %eax > > + ret > > + > > .globl relocate_kernel > > relocate_kernel: > > movl 8(%esp), %ebp /* list of pages */ > > @@ -146,6 +225,15 @@ > > pushl $0 > > popfl > > > > + /* save some information for jumping back */ > > + movl PTR(VA_CONTROL_PAGE)(%ebp), %edi > > + movl %edi, CP_VA_CONTROL_PAGE(%edi) > > + movl PTR(PA_PGD)(%ebp), %eax > > + movl %eax, CP_PA_PGD(%edi) > > + movl PTR(PA_SWAP_PAGE)(%ebp), %eax > > + movl %eax, CP_PA_SWAP_PAGE(%edi) > > + movl %ebx, KJUMP_BACKUP_PAGES_MAP_OFF(%edi) > > + > > /* get physical address of control page now */ > > /* this is impossible after page table switch */ > > movl PTR(PA_CONTROL_PAGE)(%ebp), %edi > > @@ -155,11 +243,11 @@ > > movl %eax, %cr3 > > > > /* setup a new stack at the end of the physical control page */ > > - lea 4096(%edi), %esp > > + lea STACK_TOP(%edi), %esp > > > > /* jump to identity mapped page */ > > movl %edi, %eax > > - addl $(identity_mapped - relocate_kernel), %eax > > + addl $(identity_mapped - relocate_page), %eax > > pushl %eax > > ret > > > > @@ -197,8 +285,44 @@ > > xorl %eax, %eax > > movl %eax, %cr3 > > > > + movl CP_PA_SWAP_PAGE(%edi), %eax > > + pushl %eax > > + pushl %ebx > > + call swap_pages > > + addl $8, %esp > > + > > + /* To be certain of avoiding problems with self-modifying code > > + * I need to execute a serializing instruction here. > > + * So I flush the TLB, it's handy, and not processor dependent. > > + */ > > + xorl %eax, %eax > > + movl %eax, %cr3 > > + > > + /* set all of the registers to known values */ > > + /* leave %esp alone */ > > + > > + movw KJUMP_MAGIC_OFF(%edi), %ax > > + cmpw $KJUMP_MAGIC_NUMBER, %ax > > + jz 1f > > + xorl %edi, %edi > > +1: > > + xorl %eax, %eax > > + xorl %ebx, %ebx > > + xorl %ecx, %ecx > > + xorl %edx, %edx > > + xorl %esi, %esi > > + xorl %ebp, %ebp > > + ret > > + > > /* Do the copies */ > > - movl %ebx, %ecx > > +swap_pages: > > + movl 8(%esp), %edx > > + movl 4(%esp), %ecx > > + pushl %ebp > > + pushl %ebx > > + pushl %edi > > + pushl %esi > > + movl %ecx, %ebx > > jmp 1f > > > > 0: /* top, read another word from the indirection page */ > > @@ -226,27 +350,50 @@ > > movl %ecx, %esi /* For every source page do a copy */ > > andl $0xfffff000, %esi > > > > + movl %edi, %eax > > + movl %esi, %ebp > > + > > + movl %edx, %edi > > movl $1024, %ecx > > rep ; movsl > > - jmp 0b > > > > -3: > > + movl %ebp, %edi > > + movl %eax, %esi > > + movl $1024, %ecx > > + rep ; movsl > > > > - /* To be certain of avoiding problems with self-modifying code > > - * I need to execute a serializing instruction here. > > - * So I flush the TLB, it's handy, and not processor dependent. > > - */ > > - xorl %eax, %eax > > - movl %eax, %cr3 > > + movl %eax, %edi > > + movl %edx, %esi > > + movl $1024, %ecx > > + rep ; movsl > > > > - /* set all of the registers to known values */ > > - /* leave %esp alone */ > > + lea 4096(%ebp), %esi > > + jmp 0b > > +3: > > + popl %esi > > + popl %edi > > + popl %ebx > > + popl %ebp > > + ret > > > > - xorl %eax, %eax > > - xorl %ebx, %ebx > > - xorl %ecx, %ecx > > - xorl %edx, %edx > > - xorl %esi, %esi > > - xorl %edi, %edi > > - xorl %ebp, %ebp > > + .globl kexec_jump_save_cpu > > +kexec_jump_save_cpu: > > + movl 4(%esp), %edx > > + movl %ebx, EBX(%edx) > > + movl %esi, ESI(%edx) > > + movl %edi, EDI(%edx) > > + movl %ebp, EBP(%edx) > > + movl %esp, ESP(%edx) > > + movl %cr0, %eax > > + movl %eax, CR0(%edx) > > + movl %cr3, %eax > > + movl %eax, CR3(%edx) > > + movl %cr4, %eax > > + movl %eax, CR4(%edx) > > + pushf > > + popl %eax > > + movl %eax, FLAG(%edx) > > + movl (%esp), %eax > > + movl %eax, RET(%edx) > > + mov $0, %eax > > ret > > Index: linux-2.6.23-rc6/include/asm-i386/bootparam.h > > =================================================================== > > --- linux-2.6.23-rc6.orig/include/asm-i386/bootparam.h 2007-09-20 > > 11:24:25.000000000 +0800 > > +++ linux-2.6.23-rc6/include/asm-i386/bootparam.h 2007-09-20 11:24:31.000000000 > > +0800 > > @@ -41,6 +41,9 @@ > > u32 initrd_addr_max; > > u32 kernel_alignment; > > u8 relocatable_kernel; > > + u8 _pad2[3]; > > + u32 cmdline_size; > > + u32 jump_back_entry; > > } __attribute__((packed)); > > > > struct sys_desc_table { > > Index: linux-2.6.23-rc6/arch/i386/boot/header.S > > =================================================================== > > --- linux-2.6.23-rc6.orig/arch/i386/boot/header.S 2007-09-20 11:24:25.000000000 > > +0800 > > +++ linux-2.6.23-rc6/arch/i386/boot/header.S 2007-09-20 11:24:31.000000000 +0800 > > @@ -214,6 +214,8 @@ > > #added with boot protocol > > #version 2.06 > > > > +jump_back_entry: .long 0 #jump back entry point > > Please no. > > + > > # End of setup header ##################################################### > > > > .section ".inittext", "ax" > > Index: linux-2.6.23-rc6/arch/i386/kernel/setup.c > > =================================================================== > > --- linux-2.6.23-rc6.orig/arch/i386/kernel/setup.c 2007-09-20 11:24:25.000000000 > > +0800 > > +++ linux-2.6.23-rc6/arch/i386/kernel/setup.c 2007-09-20 13:14:19.000000000 > > +0800 > > @@ -60,6 +60,7 @@ > > #include <asm/ist.h> > > #include <asm/io.h> > > #include <asm/vmi.h> > > +#include <asm/kexec.h> > > #include <setup_arch.h> > > #include <bios_ebda.h> > > > > @@ -566,6 +567,8 @@ > > data_resource.start = virt_to_phys(_etext); > > data_resource.end = virt_to_phys(_edata)-1; > > > > + parse_kexec_jump_back_entry(); > > + > > Why do we need to touch setup.c? > > > parse_early_param(); > > > > if (user_defined_memmap) { > > Index: linux-2.6.23-rc6/Documentation/i386/jump_back_protocol.txt > > =================================================================== > > --- /dev/null 1970-01-01 00:00:00.000000000 +0000 > > +++ linux-2.6.23-rc6/Documentation/i386/jump_back_protocol.txt 2007-09-20 > > 11:24:31.000000000 +0800 > > @@ -0,0 +1,81 @@ > > + THE LINUX/I386 JUMP BACK PROTOCOL > > + --------------------------------- > > + > > + Huang Ying <ying.huang@intel.com> > > + Last update 2007-09-19 > > + > > +Currently, the following versions of the jump back protocol exist. > > + > > +Protocol 1.00: Jumping between original kernel and kexeced kernel > > + support. > > + > > + > > +**** LOAD THE JUMP BACK IMAGE > > + > > +Jump back image is an ordinary ELF 64 executable file, it can be > > +loaded just as other ELF64 image. That is, the PT_LOAD segments should > > +be loaded into their physical address. > > + > > +Before loading all segments of jump back image, the jump back header > > +should be checked. Jump back header can be loaded from the 4K page at > > +the jump back entry in jump back image. > > + > > +The header looks like: > > + > > +Offset Proto Name Meaning > > +/Size > > + > > +C00/2 1.00+ magic Magic number: 0x626A > > +C02/2 1.00+ version Jump back protocol version > > +C04/4 1.00+ backup_pages_map Map from target page to backup page > > + > > +Note: unlike ordinary ELF 64 file, the jump back image may occupy most > > +memory pages, so it is important for loader to verify there is no > > +conflict between pages of loaded image and pages used by loader > > +itself. > > + > > + > > +**** DETAILS OF JUMP BACK HEADER > > + > > +For each field, some are information from the jump back image to > > +loader ("read"), some are expected to be filled out by the loader > > +("write"), and some are expected to be read and modified by the loader > > +("modify"). > > + > > +All general purpose boot loaders should write the fields marked > > +(obligatory). > > + > > +The byte order of all fields is little endian. > > + > > +Field name: magic > > +Type: read > > +Offset/size: 0xc00/2 > > +Protocol: 1.00+ > > + > > + Contains the magic number "jb" (0x626A) > > + > > +Field name: version > > +Type: read > > +Offset/size: 0xc02/2 > > +Protocol: 1.00+ > > + > > + Contains the version number, in (major << 8)+minor format, > > + e.g. 0x0100 for version 1.00. > > + > > +Field name: backup_pages_map > > +Type: read > > +Offset/size: 0xc04/4 > > +Protocol: 1.00+ > > + > > + The map from target address to backup address, it is kimage->head in > > + fact. > > + TODO: detailed description > > This is an implementation detail that we should not need to export. > I do think having a specification of the requirements for returning to > the kexec kernel makes sense. I don't however believe that we need to > pass any information. The map of the pages that matter is simply > the memory the kernel is using from /proc/iomem - the memory that > your image you are loading with kexec is using. I think the backup_pages_map can be used by kdump. Because some pages of original kernel is swapped to the shadow page, If you want to examine the contents of these pages, the map is needed. And I think the page swapping provides a possibility to implement crash dump without preserving memory. Just load the crash dump kernel image and all memory used by crash dump kernel. > All of that information we can pass in /sbin/kexec and should not > need to do directly from the kernel. > > The following piece of code should be always be a valid kexec on > jump user and test case. > > test_kexec_jump_payload: > ret > > A standard subroutine call return on whichever architecture we are > working on should be sufficient. > > For the registers and the segments we should define which ones > should be hard coded, and which ones should be saved. > > Hard coded: > %ss, %cs, %ds (4G flat with 0 base) > > Callee save: > %esp > %edi > (And whichever ones we care about) > > It does make some sense to use the standard kernel argument passing sequence > for bootloaders and what not. But the wrapper in /sbin/kexec should do that > not the kernel itself. If I understand correctly. It seems that you think a function calling style interface to "purgatory" of /sbin/kexec is better, and the /sbin/kexec will convert the information to standard kernel parameter passing mechanism if necessary. My current implementation is something like a global variable information passing mechanism. Your method is more standard. But my method has some advantages too. - It's more convenient to pass some information back from kexeced kernel to original kernel. - If loaded from bootloader directly, the bootloader can check the information of hibernated kernel more easily. > > > + > > +**** JUMP BACK TO THE JUMP BACK IMAGE > > + > > +To jump back to the jump back image, just jump to the jump back > > +entry. At entry, the CPU must be in 32-bit protected mode with paging > > +disabled; the CS, DS and SS must be 4G flat segments; if jumping back > > +to loader is supported, %edi should be the jump back entry of loader, > > +otherwise it should be zero. Best Regards, Huang Ying ^ permalink raw reply [flat|nested] 42+ messages in thread
[parent not found: <m1y7f0k6p4.fsf@ebiederm.dsl.xmission.com>]
* Re: [RFC][PATCH 1/2 -mm] kexec based hibernation -v3: kexec jump [not found] ` <m1y7f0k6p4.fsf@ebiederm.dsl.xmission.com> @ 2007-09-21 8:42 ` Huang, Ying 0 siblings, 0 replies; 42+ messages in thread From: Huang, Ying @ 2007-09-21 8:42 UTC (permalink / raw) To: Eric W. Biederman Cc: nigel, Kexec Mailing List, linux-kernel, Andrew Morton, linux-pm, Jeremy Maitin-Shepard On Thu, 2007-09-20 at 22:01 -0600, Eric W. Biederman wrote: > "Huang, Ying" <ying.huang@intel.com> writes: > > > Index: linux-2.6.23-rc6/include/linux/kexec.h > > =================================================================== > > --- linux-2.6.23-rc6.orig/include/linux/kexec.h 2007-09-20 11:24:25.000000000 > > +0800 > > +++ linux-2.6.23-rc6/include/linux/kexec.h 2007-09-20 11:26:03.000000000 +0800 > > @@ -83,6 +83,7 @@ > > > > unsigned long start; > > struct page *control_code_page; > > + struct page *swap_page; > > > > unsigned long nr_segments; > > struct kexec_segment segment[KEXEC_SEGMENT_MAX]; > > @@ -194,4 +195,12 @@ > > static inline void crash_kexec(struct pt_regs *regs) { } > > static inline int kexec_should_crash(struct task_struct *p) { return 0; } > > #endif /* CONFIG_KEXEC */ > > + > > +#ifdef CONFIG_KEXEC_JUMP > > +extern int machine_kexec_jump(struct kimage *image); > > +extern unsigned long kexec_jump_back_entry; > > +extern int kexec_jump(void); > > +#else /* !CONFIG_KEXEC_JUMP */ > > +static inline int kexec_jump(void) { return 0; } > > +#endif /* CONFIG_KEXEC_JUMP */ > > #endif /* LINUX_KEXEC_H */ > > Please the kexec_jump code just be triggered off of a flag in > struct kimage. We just need to define an extra flag to sys_kexec_load > say KEXEC_RETURNS. Ideally in the long term we would not have to > do anything except to accept the flag. Adding a flag makes > a nice feature test if you want to see if your kernel supports > the extended version of kexec. > > Until we get the hibernation methods sorted out storing the flag in > struct kimage and making the methods that we call conditional feels > like a more maintainable interface. Especially since we have to > know at kexec image load time what we are going to do with the > kexec image. You mean we use KEXEC_RETURNS when do sys_kexec_load, then use ordinary reboot command LINUX_REBOOT_CMD_KEXEC, which call kexec_jump conditional based on KEXEC_RETURNS? This is reasonable. I will change it. > > +#ifdef CONFIG_KEXEC_JUMP > > +unsigned long kexec_jump_back_entry; > > + > > +int kexec_jump(void) > > +{ > > + int error; > > + > > + if (!kexec_image) > > + return -EINVAL; > > I understand where you are coming from with this implementation of > kexec_jump but it looks like this is one of the big parts of this > patch that have not reached their final form. > > The line above is racy with sys_kexec_load. Yes. I should use xchg(&kexec_image, NULL) as that of other kexec related functions. > > + pm_prepare_console(); > > + suspend_console(); > > + error = device_suspend(PMSG_FREEZE); > > + if (error) > > + goto Resume_console; > > This as everyone knows needs to be device_shutdown or a better hibernation > replacement. Yes. > > + error = disable_nonboot_cpus(); > > + if (error) > > + goto Resume_devices; > > Can't we just catch the noboot cpu's in a mutex. > disable_nonboot_cpus is actually impossible to implement 100% reliably > with current hardware. But something smp_call_function so we trap them > at a specific location and then the equivalent when we come back should > be simple. I guess the tricky part is bringing the cpus back up again. > > Using the broken by design version of cpu hotplug really annoys me here. I think this is not very simple. Given that we may jump back from the kernel with SMP turned off, or from bootloader directly. But CPU hotplug is another topic, I think it should be solved in another patch. > > + local_irq_disable(); > > + /* At this point, device_suspend() has been called, but *not* > > + * device_power_down(). We *must* device_power_down() now. > > + * Otherwise, drivers for some devices (e.g. interrupt controllers) > > + * become desynchronized with the actual state of the hardware > > + * at resume time, and evil weirdness ensues. > > + */ > > + error = device_power_down(PMSG_FREEZE); > > + if (error) > > + goto Enable_irqs; > > This of course should go away when we have the proper methods. Yes. > > + save_processor_state(); > This line might even be reasonable. > > + error = machine_kexec_jump(kexec_image); > > + restore_processor_state(); > > > > + /* NOTE: device_power_up() is just a resume() for devices > > + * that suspended with irqs off ... no overall powerup. > > + */ > > + device_power_up(); > Yep this can go away. Yes. > > + Enable_irqs: > > + local_irq_enable(); > > + enable_nonboot_cpus(); > > I haven't looked at the cpu start up code yet to see if it > is generally implementable. I would think so, but I guess > we need to be careful with our data structures. > > > + Resume_devices: > > + device_resume(); > This of course should change. Yes. > > + Resume_console: > > + resume_console(); > > + pm_restore_console(); > > Odd. I'm a little surprised that the console is the last > thing we restore. But it does make sense to treat it specially. > > > + return error; > > +} > > +#endif /* CONFIG_KEXEC_JUMP */ Best Regards, Huang Ying ^ permalink raw reply [flat|nested] 42+ messages in thread
[parent not found: <200709212158.50538.nigel@nigel.suspend2.net>]
[parent not found: <200709211418.20358.rjw@sisk.pl>]
* Re: [RFC][PATCH 1/2 -mm] kexec based hibernation -v3: kexec jump [not found] ` <200709211418.20358.rjw@sisk.pl> @ 2007-09-21 12:15 ` Nigel Cunningham 0 siblings, 0 replies; 42+ messages in thread From: Nigel Cunningham @ 2007-09-21 12:15 UTC (permalink / raw) To: Rafael J. Wysocki Cc: nigel, Kexec Mailing List, linux-kernel, Eric W. Biederman, Huang, Ying, Andrew Morton, linux-pm, Jeremy Maitin-Shepard Hi. On Friday 21 September 2007 22:18:19 Rafael J. Wysocki wrote: > On Friday, 21 September 2007 13:58, Nigel Cunningham wrote: > > Hi. > > > > On Friday 21 September 2007 21:56:29 Rafael J. Wysocki wrote: > > > [Besides, the current hibernation userland interface is used by default by > > > openSUSE and it's also used by quite some Debian users, so we can't drop > > > it overnight and it can't be implemented in a compatible way on top of the > > > kexec-based solution.] > > > > Could it be fudged by giving userland a null image and having (say) the first > > ioctl be one that triggers all the real work (with other ioctls being noops > > or such like, as appropriate)? > > Well, the "suspend" part is probably doable, but I'm afraid of the "resume" > one. 'k. I've occasionally thought about trying it, but haven't ever gotten around to actually doing it yet. (I'd like to make TuxOnIce transparently replace both swsusp and uswsusp if I could). Regards, Nigel ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [RFC][PATCH 1/2 -mm] kexec based hibernation -v3: kexec jump [not found] ` <200709212158.50538.nigel@nigel.suspend2.net> [not found] ` <200709211418.20358.rjw@sisk.pl> @ 2007-09-21 12:18 ` Rafael J. Wysocki 1 sibling, 0 replies; 42+ messages in thread From: Rafael J. Wysocki @ 2007-09-21 12:18 UTC (permalink / raw) To: Nigel Cunningham Cc: nigel, Kexec Mailing List, linux-kernel, Eric W. Biederman, Huang, Ying, Andrew Morton, linux-pm, Jeremy Maitin-Shepard On Friday, 21 September 2007 13:58, Nigel Cunningham wrote: > Hi. > > On Friday 21 September 2007 21:56:29 Rafael J. Wysocki wrote: > > [Besides, the current hibernation userland interface is used by default by > > openSUSE and it's also used by quite some Debian users, so we can't drop > > it overnight and it can't be implemented in a compatible way on top of the > > kexec-based solution.] > > Could it be fudged by giving userland a null image and having (say) the first > ioctl be one that triggers all the real work (with other ioctls being noops > or such like, as appropriate)? Well, the "suspend" part is probably doable, but I'm afraid of the "resume" one. Greetings, Rafael ^ permalink raw reply [flat|nested] 42+ messages in thread
* [RFC][PATCH 1/2 -mm] kexec based hibernation -v3: kexec jump
@ 2007-09-20 5:34 Huang, Ying
0 siblings, 0 replies; 42+ messages in thread
From: Huang, Ying @ 2007-09-20 5:34 UTC (permalink / raw)
To: Eric W. Biederman, Pavel Machek, nigel, Andrew Morton,
Jeremy Maitin-Shepard
Cc: linux-pm, Kexec Mailing List, linux-kernel
This patch implements the functionality of jumping between the kexeced
kernel and the original kernel.
A new reboot command named LINUX_REBOOT_CMD_KJUMP is defined to
trigger the jumping to (executing) the new kernel and jumping back to
the original kernel.
To support jumping between two kernels, before jumping to (executing)
the new kernel and jumping back to the original kernel, the devices
are put into quiescent state (to be fully implemented), and the state
of devices and CPU is saved. After jumping back from kexeced kernel
and jumping to the new kernel, the state of devices and CPU are
restored accordingly. The devices/CPU state save/restore code of
software suspend is called to implement corresponding function.
To support jumping without preserving memory. One shadow backup page
is allocated for each page used by new (kexeced) kernel. When do
kexec_load, the image of new kernel is loaded into shadow pages, and
before executing, the original pages and the shadow pages are swapped,
so the contents of original pages are backuped. Before jumping to the
new (kexeced) kernel and after jumping back to the original kernel,
the original pages and the shadow pages are swapped too.
A jump back protocol is defined and documented.
Known issues
- A field is added to Linux kernel real-mode header. This is
temporary, and should be replaced after the 32-bit boot protocol and
setup data patches are accepted.
- The suspend method of device is used to put device in quiescent
state. But if the ACPI is enabled this will also put devices into
low power state, which prevent the new kernel from booting. So, the
ACPI must be disabled both in original kernel and kexeced
kernel. This is planed to be resolved after the suspend method and
hibernate method is separated for device as proposed earlier in the
LKML.
- The NX (none executable) bit should be turned off for the control
page if available.
ChangeLog
-- 2007/9/19 --
1. Two reboot command are merge back to one again because the
underlying implementation is same.
2. Jumping without preserving memory is implemented. As a side effect,
two direction jumping is implemented.
3. A jump back protocol is defined and documented. The orignal kernel
and kexeced kernel are more independent from each other.
4. The CPU state save/restore code are merged into relocate_kernel.S.
-- 2007/8/24 --
1. The reboot command LINUX_REBOOT_CMD_KJUMP is split into to two
reboot command to reflect the different function.
2. Document is added for added kernel parameters.
3. /sys/kernel/kexec_jump_buf_pfn is made writable, it is used for
memory image restoring.
4. Console restoring after jumping back is implemented.
-- 2007/7/15 --
1. The kexec jump implementation is put into the kexec/kdump framework
instead of software suspend framework. The device and CPU state
save/restore code of software suspend is called when needed.
2. The same code path is used for both kexec a new kernel and jump
back to original kernel.
Signed-off-by: Huang Ying <ying.huang@intel.com>
---
Documentation/i386/jump_back_protocol.txt | 81 ++++++++++++
arch/i386/Kconfig | 7 +
arch/i386/boot/header.S | 2
arch/i386/kernel/machine_kexec.c | 77 +++++++++---
arch/i386/kernel/relocate_kernel.S | 187 ++++++++++++++++++++++++++----
arch/i386/kernel/setup.c | 3
include/asm-i386/bootparam.h | 3
include/asm-i386/kexec.h | 48 ++++++-
include/linux/kexec.h | 9 +
include/linux/reboot.h | 2
kernel/kexec.c | 59 +++++++++
kernel/ksysfs.c | 17 ++
kernel/power/Kconfig | 2
kernel/sys.c | 8 +
14 files changed, 463 insertions(+), 42 deletions(-)
Index: linux-2.6.23-rc6/arch/i386/kernel/machine_kexec.c
===================================================================
--- linux-2.6.23-rc6.orig/arch/i386/kernel/machine_kexec.c 2007-09-20 11:24:25.000000000 +0800
+++ linux-2.6.23-rc6/arch/i386/kernel/machine_kexec.c 2007-09-20 11:24:31.000000000 +0800
@@ -20,6 +20,7 @@
#include <asm/cpufeature.h>
#include <asm/desc.h>
#include <asm/system.h>
+#include <asm/setup.h>
#define PAGE_ALIGNED __attribute__ ((__aligned__(PAGE_SIZE)))
static u32 kexec_pgd[1024] PAGE_ALIGNED;
@@ -98,23 +99,23 @@
{
}
-/*
- * Do not allocate memory (or fail in any way) in machine_kexec().
- * We are past the point of no return, committed to rebooting now.
- */
-NORET_TYPE void machine_kexec(struct kimage *image)
+static NORET_TYPE void __machine_kexec(struct kimage *image,
+ void *control_page) ATTRIB_NORET;
+
+static NORET_TYPE void __machine_kexec(struct kimage *image,
+ void *control_page)
{
unsigned long page_list[PAGES_NR];
- void *control_page;
-
- /* Interrupts aren't acceptable while we reboot */
- local_irq_disable();
-
- control_page = page_address(image->control_code_page);
- memcpy(control_page, relocate_kernel, PAGE_SIZE);
+ asmlinkage NORET_TYPE void
+ (*relocate_kernel_ptr)(unsigned long indirection_page,
+ unsigned long control_page,
+ unsigned long start_address,
+ unsigned int has_pae) ATTRIB_NORET;
+ relocate_kernel_ptr = control_page +
+ ((void *)relocate_kernel - (void *)relocate_page);
page_list[PA_CONTROL_PAGE] = __pa(control_page);
- page_list[VA_CONTROL_PAGE] = (unsigned long)relocate_kernel;
+ page_list[VA_CONTROL_PAGE] = (unsigned long)control_page;
page_list[PA_PGD] = __pa(kexec_pgd);
page_list[VA_PGD] = (unsigned long)kexec_pgd;
#ifdef CONFIG_X86_PAE
@@ -127,6 +128,7 @@
page_list[VA_PTE_0] = (unsigned long)kexec_pte0;
page_list[PA_PTE_1] = __pa(kexec_pte1);
page_list[VA_PTE_1] = (unsigned long)kexec_pte1;
+ page_list[PA_SWAP_PAGE] = (page_to_pfn(image->swap_page) << PAGE_SHIFT);
/* The segment registers are funny things, they have both a
* visible and an invisible part. Whenever the visible part is
@@ -145,8 +147,26 @@
set_idt(phys_to_virt(0),0);
/* now call it */
- relocate_kernel((unsigned long)image->head, (unsigned long)page_list,
- image->start, cpu_has_pae);
+ relocate_kernel_ptr((unsigned long)image->head,
+ (unsigned long)page_list,
+ image->start, cpu_has_pae);
+}
+
+/*
+ * Do not allocate memory (or fail in any way) in machine_kexec().
+ * We are past the point of no return, committed to rebooting now.
+ */
+NORET_TYPE void machine_kexec(struct kimage *image)
+{
+ void *control_page;
+
+ /* Interrupts aren't acceptable while we reboot */
+ local_irq_disable();
+
+ control_page = page_address(image->control_code_page);
+ memcpy(control_page, relocate_page, PAGE_SIZE);
+
+ __machine_kexec(image, control_page);
}
/* crashkernel=size@addr specifies the location to reserve for
@@ -182,3 +202,30 @@
#endif
}
+#ifdef CONFIG_KEXEC_JUMP
+int machine_kexec_jump(struct kimage *image)
+{
+ void *control_page;
+ unsigned long pa_control_page;
+
+ control_page = page_address(image->control_code_page);
+ memcpy(control_page, relocate_page, PAGE_SIZE/2);
+ pa_control_page = __pa(control_page);
+ memcpy(control_page + 1, &pa_control_page, sizeof(pa_control_page));
+
+ KJUMP_MAGIC(control_page) = KJUMP_MAGIC_NUMBER;
+ KJUMP_VERSION(control_page) = KJUMP_VERSION_NUMBER;
+
+ if (!kexec_jump_save_cpu(control_page))
+ __machine_kexec(image, control_page);
+
+ kexec_jump_back_entry = KJUMP_ENTRY(control_page);
+ image->start = kexec_jump_back_entry;
+ return 0;
+}
+
+void __init parse_kexec_jump_back_entry(void)
+{
+ kexec_jump_back_entry = boot_params.hdr.jump_back_entry;
+}
+#endif /* CONFIG_KEXEC_JUMP */
Index: linux-2.6.23-rc6/include/asm-i386/kexec.h
===================================================================
--- linux-2.6.23-rc6.orig/include/asm-i386/kexec.h 2007-09-20 11:24:25.000000000 +0800
+++ linux-2.6.23-rc6/include/asm-i386/kexec.h 2007-09-20 11:24:31.000000000 +0800
@@ -9,16 +9,42 @@
#define VA_PTE_0 5
#define PA_PTE_1 6
#define VA_PTE_1 7
+#define PA_SWAP_PAGE 8
#ifdef CONFIG_X86_PAE
-#define PA_PMD_0 8
-#define VA_PMD_0 9
-#define PA_PMD_1 10
-#define VA_PMD_1 11
-#define PAGES_NR 12
+#define PA_PMD_0 9
+#define VA_PMD_0 10
+#define PA_PMD_1 11
+#define VA_PMD_1 12
+#define PAGES_NR 13
#else
-#define PAGES_NR 8
+#define PAGES_NR 9
#endif
+#define KJUMP_DATA_BASE 0x800
+
+#define KJUMP_MAGIC_NUMBER 0x626a
+#define KJUMP_VERSION_NUMBER 0x0100
+
+#define KJUMP_DATA(buf) ((unsigned char *)(buf)+KJUMP_DATA_BASE)
+#define KJUMP_OFF(off) (KJUMP_DATA_BASE+(off))
+
+#define KJUMP_MAGIC_OFF KJUMP_OFF(0x0)
+#define KJUMP_MAGIC(buf) (*(unsigned short *)(KJUMP_DATA(buf)+0x0))
+#define KJUMP_VERSION(buf) (*(unsigned short *)(KJUMP_DATA(buf)+0x2))
+#define KJUMP_BACKUP_PAGES_MAP_OFF \
+ KJUMP_OFF(0x4)
+#define KJUMP_BACKUP_PAGES_MAP(buf) \
+ (*(unsigned long *)(KJUMP_DATA(buf)+0x4))
+
+/*
+ * The following are not a part of jump back protocol, for internal
+ * use only
+ */
+#define KJUMP_ENTRY_OFF KJUMP_OFF(0x20)
+#define KJUMP_ENTRY(buf) (*(unsigned long *)(KJUMP_DATA(buf)+0x20))
+/* Other internal data fields base */
+#define KJUMP_OTHER_OFF KJUMP_OFF(0x24)
+
#ifndef __ASSEMBLY__
#include <asm/ptrace.h>
@@ -94,6 +120,16 @@
unsigned long start_address,
unsigned int has_pae) ATTRIB_NORET;
+extern char relocate_page[PAGE_SIZE];
+
+extern asmlinkage int kexec_jump_save_cpu(void *buf);
+
+#ifdef CONFIG_KEXEC_JUMP
+void parse_kexec_jump_back_entry(void);
+#else
+static inline void parse_kexec_jump_back_entry(void) { }
+#endif
+
#endif /* __ASSEMBLY__ */
#endif /* _I386_KEXEC_H */
Index: linux-2.6.23-rc6/include/linux/kexec.h
===================================================================
--- linux-2.6.23-rc6.orig/include/linux/kexec.h 2007-09-20 11:24:25.000000000 +0800
+++ linux-2.6.23-rc6/include/linux/kexec.h 2007-09-20 11:26:03.000000000 +0800
@@ -83,6 +83,7 @@
unsigned long start;
struct page *control_code_page;
+ struct page *swap_page;
unsigned long nr_segments;
struct kexec_segment segment[KEXEC_SEGMENT_MAX];
@@ -194,4 +195,12 @@
static inline void crash_kexec(struct pt_regs *regs) { }
static inline int kexec_should_crash(struct task_struct *p) { return 0; }
#endif /* CONFIG_KEXEC */
+
+#ifdef CONFIG_KEXEC_JUMP
+extern int machine_kexec_jump(struct kimage *image);
+extern unsigned long kexec_jump_back_entry;
+extern int kexec_jump(void);
+#else /* !CONFIG_KEXEC_JUMP */
+static inline int kexec_jump(void) { return 0; }
+#endif /* CONFIG_KEXEC_JUMP */
#endif /* LINUX_KEXEC_H */
Index: linux-2.6.23-rc6/kernel/kexec.c
===================================================================
--- linux-2.6.23-rc6.orig/kernel/kexec.c 2007-09-20 11:24:25.000000000 +0800
+++ linux-2.6.23-rc6/kernel/kexec.c 2007-09-20 11:24:31.000000000 +0800
@@ -24,6 +24,10 @@
#include <linux/utsrelease.h>
#include <linux/utsname.h>
#include <linux/numa.h>
+#include <linux/suspend.h>
+#include <linux/pm.h>
+#include <linux/cpu.h>
+#include <linux/console.h>
#include <asm/page.h>
#include <asm/uaccess.h>
@@ -243,6 +247,12 @@
goto out;
}
+ image->swap_page = kimage_alloc_control_pages(image, 0);
+ if (!image->swap_page) {
+ printk(KERN_ERR "Could not allocate swap buffer\n");
+ goto out;
+ }
+
result = 0;
out:
if (result == 0)
@@ -1246,3 +1256,52 @@
}
module_init(crash_save_vmcoreinfo_init)
+
+#ifdef CONFIG_KEXEC_JUMP
+unsigned long kexec_jump_back_entry;
+
+int kexec_jump(void)
+{
+ int error;
+
+ if (!kexec_image)
+ return -EINVAL;
+
+ pm_prepare_console();
+ suspend_console();
+ error = device_suspend(PMSG_FREEZE);
+ if (error)
+ goto Resume_console;
+ error = disable_nonboot_cpus();
+ if (error)
+ goto Resume_devices;
+ local_irq_disable();
+ /* At this point, device_suspend() has been called, but *not*
+ * device_power_down(). We *must* device_power_down() now.
+ * Otherwise, drivers for some devices (e.g. interrupt controllers)
+ * become desynchronized with the actual state of the hardware
+ * at resume time, and evil weirdness ensues.
+ */
+ error = device_power_down(PMSG_FREEZE);
+ if (error)
+ goto Enable_irqs;
+
+ save_processor_state();
+ error = machine_kexec_jump(kexec_image);
+ restore_processor_state();
+
+ /* NOTE: device_power_up() is just a resume() for devices
+ * that suspended with irqs off ... no overall powerup.
+ */
+ device_power_up();
+ Enable_irqs:
+ local_irq_enable();
+ enable_nonboot_cpus();
+ Resume_devices:
+ device_resume();
+ Resume_console:
+ resume_console();
+ pm_restore_console();
+ return error;
+}
+#endif /* CONFIG_KEXEC_JUMP */
Index: linux-2.6.23-rc6/kernel/ksysfs.c
===================================================================
--- linux-2.6.23-rc6.orig/kernel/ksysfs.c 2007-09-20 11:24:25.000000000 +0800
+++ linux-2.6.23-rc6/kernel/ksysfs.c 2007-09-20 11:24:31.000000000 +0800
@@ -69,6 +69,20 @@
}
KERNEL_ATTR_RO(vmcoreinfo);
+#ifdef CONFIG_KEXEC_JUMP
+static ssize_t kexec_jump_back_entry_show(struct kset *kset, char *page)
+{
+ return sprintf(page, "0x%lx\n", kexec_jump_back_entry);
+}
+static ssize_t kexec_jump_back_entry_store(struct kset *kset, const char *page,
+ size_t count)
+{
+ kexec_jump_back_entry = simple_strtoul(page, NULL, 0);
+ return count;
+}
+
+KERNEL_ATTR_RW(kexec_jump_back_entry);
+#endif /* CONFIG_KEXEC_JUMP */
#endif /* CONFIG_KEXEC */
/*
@@ -105,6 +119,9 @@
&kexec_loaded_attr.attr,
&kexec_crash_loaded_attr.attr,
&vmcoreinfo_attr.attr,
+#ifdef CONFIG_KEXEC_JUMP
+ &kexec_jump_back_entry_attr.attr,
+#endif
#endif
NULL
};
Index: linux-2.6.23-rc6/kernel/sys.c
===================================================================
--- linux-2.6.23-rc6.orig/kernel/sys.c 2007-09-20 11:24:25.000000000 +0800
+++ linux-2.6.23-rc6/kernel/sys.c 2007-09-20 11:24:31.000000000 +0800
@@ -424,6 +424,14 @@
unlock_kernel();
return -EINVAL;
+ case LINUX_REBOOT_CMD_KEXEC_JUMP:
+ {
+ int ret;
+ ret = kexec_jump();
+ unlock_kernel();
+ return ret;
+ }
+
#ifdef CONFIG_HIBERNATION
case LINUX_REBOOT_CMD_SW_SUSPEND:
{
Index: linux-2.6.23-rc6/include/linux/reboot.h
===================================================================
--- linux-2.6.23-rc6.orig/include/linux/reboot.h 2007-09-20 11:24:25.000000000 +0800
+++ linux-2.6.23-rc6/include/linux/reboot.h 2007-09-20 11:24:31.000000000 +0800
@@ -23,6 +23,7 @@
* RESTART2 Restart system using given command string.
* SW_SUSPEND Suspend system using software suspend if compiled in.
* KEXEC Restart system using a previously loaded Linux kernel
+ * KEXEC_JUMP Jump between original kernel and kexeced kernel.
*/
#define LINUX_REBOOT_CMD_RESTART 0x01234567
@@ -33,6 +34,7 @@
#define LINUX_REBOOT_CMD_RESTART2 0xA1B2C3D4
#define LINUX_REBOOT_CMD_SW_SUSPEND 0xD000FCE2
#define LINUX_REBOOT_CMD_KEXEC 0x45584543
+#define LINUX_REBOOT_CMD_KEXEC_JUMP 0x3928A5FD
#ifdef __KERNEL__
Index: linux-2.6.23-rc6/arch/i386/Kconfig
===================================================================
--- linux-2.6.23-rc6.orig/arch/i386/Kconfig 2007-09-20 11:24:25.000000000 +0800
+++ linux-2.6.23-rc6/arch/i386/Kconfig 2007-09-20 11:24:31.000000000 +0800
@@ -830,6 +830,13 @@
(CONFIG_RELOCATABLE=y).
For more details see Documentation/kdump/kdump.txt
+config KEXEC_JUMP
+ bool "kexec jump (EXPERIMENTAL)"
+ depends on EXPERIMENTAL
+ depends on PM && X86_32 && KEXEC
+ ---help---
+ Jump between the kexeced kernel and the orignal kernel.
+
config PHYSICAL_START
hex "Physical address where the kernel is loaded" if (EMBEDDED || CRASH_DUMP)
default "0x1000000" if X86_NUMAQ
Index: linux-2.6.23-rc6/kernel/power/Kconfig
===================================================================
--- linux-2.6.23-rc6.orig/kernel/power/Kconfig 2007-09-20 11:24:25.000000000 +0800
+++ linux-2.6.23-rc6/kernel/power/Kconfig 2007-09-20 11:24:31.000000000 +0800
@@ -70,7 +70,7 @@
config PM_SLEEP
bool
- depends on SUSPEND || HIBERNATION
+ depends on SUSPEND || HIBERNATION || KEXEC_JUMP
default y
config SUSPEND_UP_POSSIBLE
Index: linux-2.6.23-rc6/arch/i386/kernel/relocate_kernel.S
===================================================================
--- linux-2.6.23-rc6.orig/arch/i386/kernel/relocate_kernel.S 2007-09-20 11:24:25.000000000 +0800
+++ linux-2.6.23-rc6/arch/i386/kernel/relocate_kernel.S 2007-09-20 11:24:31.000000000 +0800
@@ -19,8 +19,87 @@
#define PAGE_ATTR 0x63 /* _PAGE_PRESENT|_PAGE_RW|_PAGE_ACCESSED|_PAGE_DIRTY */
#define PAE_PGD_ATTR 0x01 /* _PAGE_PRESENT */
+#define STACK_TOP 0x1000
+
+#define DATA(offset) (KJUMP_OTHER_OFF+(offset))
+
+/* Minimal CPU stat */
+#define EBX DATA(0x0)
+#define ESI DATA(0x4)
+#define EDI DATA(0x8)
+#define EBP DATA(0xc)
+#define ESP DATA(0x10)
+#define CR0 DATA(0x14)
+#define CR3 DATA(0x18)
+#define CR4 DATA(0x1c)
+#define FLAG DATA(0x20)
+#define RET DATA(0x24)
+
+/* some information saved in control page (CP) for jumping back */
+#define CP_VA_CONTROL_PAGE DATA(0x30)
+#define CP_PA_PGD DATA(0x34)
+#define CP_PA_SWAP_PAGE DATA(0x38)
+
.text
.align PAGE_ALIGNED
+ .globl relocate_page
+relocate_page:
+
+/*
+ * Entry point for jumping back from kexeced kernel, the paging is
+ * turned off, the information needed is at relocate_page +
+ * PAGE_SIZE/2
+ */
+kexec_jump_back_entry:
+ movl $relocate_page, %ebx
+ movl %edi, KJUMP_ENTRY_OFF(%ebx)
+ movl CP_VA_CONTROL_PAGE(%ebx), %edi
+
+ lea STACK_TOP(%ebx), %esp
+
+ movl CP_PA_SWAP_PAGE(%ebx), %eax
+ movl KJUMP_BACKUP_PAGES_MAP_OFF(%ebx), %edx
+ pushl %eax
+ pushl %edx
+ call swap_pages
+ addl $8, %esp
+
+ movl CP_PA_PGD(%ebx), %eax
+ movl %eax, %cr3
+
+ movl %cr0, %eax
+ orl $(1<<31), %eax
+ movl %eax, %cr0
+
+ movl %edi, %esp
+ addl $STACK_TOP, %esp
+
+ movl %edi, %eax
+ addl $(virtual_mapped - relocate_page), %eax
+ pushl %eax
+ ret
+
+virtual_mapped:
+ movl %edi, %edx
+ movl EBX(%edx), %ebx
+ movl ESI(%edx), %esi
+ movl EDI(%edx), %edi
+ movl EBP(%edx), %ebp
+ movl FLAG(%edx), %eax
+ pushl %eax
+ popf
+ movl ESP(%edx), %esp
+ movl CR4(%edx), %eax
+ movl %eax, %cr4
+ movl CR3(%edx), %eax
+ movl %eax, %cr3
+ movl CR0(%edx), %eax
+ movl %eax, %cr0
+ movl RET(%edx), %eax
+ movl %eax, (%esp)
+ mov $1, %eax
+ ret
+
.globl relocate_kernel
relocate_kernel:
movl 8(%esp), %ebp /* list of pages */
@@ -146,6 +225,15 @@
pushl $0
popfl
+ /* save some information for jumping back */
+ movl PTR(VA_CONTROL_PAGE)(%ebp), %edi
+ movl %edi, CP_VA_CONTROL_PAGE(%edi)
+ movl PTR(PA_PGD)(%ebp), %eax
+ movl %eax, CP_PA_PGD(%edi)
+ movl PTR(PA_SWAP_PAGE)(%ebp), %eax
+ movl %eax, CP_PA_SWAP_PAGE(%edi)
+ movl %ebx, KJUMP_BACKUP_PAGES_MAP_OFF(%edi)
+
/* get physical address of control page now */
/* this is impossible after page table switch */
movl PTR(PA_CONTROL_PAGE)(%ebp), %edi
@@ -155,11 +243,11 @@
movl %eax, %cr3
/* setup a new stack at the end of the physical control page */
- lea 4096(%edi), %esp
+ lea STACK_TOP(%edi), %esp
/* jump to identity mapped page */
movl %edi, %eax
- addl $(identity_mapped - relocate_kernel), %eax
+ addl $(identity_mapped - relocate_page), %eax
pushl %eax
ret
@@ -197,8 +285,44 @@
xorl %eax, %eax
movl %eax, %cr3
+ movl CP_PA_SWAP_PAGE(%edi), %eax
+ pushl %eax
+ pushl %ebx
+ call swap_pages
+ addl $8, %esp
+
+ /* To be certain of avoiding problems with self-modifying code
+ * I need to execute a serializing instruction here.
+ * So I flush the TLB, it's handy, and not processor dependent.
+ */
+ xorl %eax, %eax
+ movl %eax, %cr3
+
+ /* set all of the registers to known values */
+ /* leave %esp alone */
+
+ movw KJUMP_MAGIC_OFF(%edi), %ax
+ cmpw $KJUMP_MAGIC_NUMBER, %ax
+ jz 1f
+ xorl %edi, %edi
+1:
+ xorl %eax, %eax
+ xorl %ebx, %ebx
+ xorl %ecx, %ecx
+ xorl %edx, %edx
+ xorl %esi, %esi
+ xorl %ebp, %ebp
+ ret
+
/* Do the copies */
- movl %ebx, %ecx
+swap_pages:
+ movl 8(%esp), %edx
+ movl 4(%esp), %ecx
+ pushl %ebp
+ pushl %ebx
+ pushl %edi
+ pushl %esi
+ movl %ecx, %ebx
jmp 1f
0: /* top, read another word from the indirection page */
@@ -226,27 +350,50 @@
movl %ecx, %esi /* For every source page do a copy */
andl $0xfffff000, %esi
+ movl %edi, %eax
+ movl %esi, %ebp
+
+ movl %edx, %edi
movl $1024, %ecx
rep ; movsl
- jmp 0b
-3:
+ movl %ebp, %edi
+ movl %eax, %esi
+ movl $1024, %ecx
+ rep ; movsl
- /* To be certain of avoiding problems with self-modifying code
- * I need to execute a serializing instruction here.
- * So I flush the TLB, it's handy, and not processor dependent.
- */
- xorl %eax, %eax
- movl %eax, %cr3
+ movl %eax, %edi
+ movl %edx, %esi
+ movl $1024, %ecx
+ rep ; movsl
- /* set all of the registers to known values */
- /* leave %esp alone */
+ lea 4096(%ebp), %esi
+ jmp 0b
+3:
+ popl %esi
+ popl %edi
+ popl %ebx
+ popl %ebp
+ ret
- xorl %eax, %eax
- xorl %ebx, %ebx
- xorl %ecx, %ecx
- xorl %edx, %edx
- xorl %esi, %esi
- xorl %edi, %edi
- xorl %ebp, %ebp
+ .globl kexec_jump_save_cpu
+kexec_jump_save_cpu:
+ movl 4(%esp), %edx
+ movl %ebx, EBX(%edx)
+ movl %esi, ESI(%edx)
+ movl %edi, EDI(%edx)
+ movl %ebp, EBP(%edx)
+ movl %esp, ESP(%edx)
+ movl %cr0, %eax
+ movl %eax, CR0(%edx)
+ movl %cr3, %eax
+ movl %eax, CR3(%edx)
+ movl %cr4, %eax
+ movl %eax, CR4(%edx)
+ pushf
+ popl %eax
+ movl %eax, FLAG(%edx)
+ movl (%esp), %eax
+ movl %eax, RET(%edx)
+ mov $0, %eax
ret
Index: linux-2.6.23-rc6/include/asm-i386/bootparam.h
===================================================================
--- linux-2.6.23-rc6.orig/include/asm-i386/bootparam.h 2007-09-20 11:24:25.000000000 +0800
+++ linux-2.6.23-rc6/include/asm-i386/bootparam.h 2007-09-20 11:24:31.000000000 +0800
@@ -41,6 +41,9 @@
u32 initrd_addr_max;
u32 kernel_alignment;
u8 relocatable_kernel;
+ u8 _pad2[3];
+ u32 cmdline_size;
+ u32 jump_back_entry;
} __attribute__((packed));
struct sys_desc_table {
Index: linux-2.6.23-rc6/arch/i386/boot/header.S
===================================================================
--- linux-2.6.23-rc6.orig/arch/i386/boot/header.S 2007-09-20 11:24:25.000000000 +0800
+++ linux-2.6.23-rc6/arch/i386/boot/header.S 2007-09-20 11:24:31.000000000 +0800
@@ -214,6 +214,8 @@
#added with boot protocol
#version 2.06
+jump_back_entry: .long 0 #jump back entry point
+
# End of setup header #####################################################
.section ".inittext", "ax"
Index: linux-2.6.23-rc6/arch/i386/kernel/setup.c
===================================================================
--- linux-2.6.23-rc6.orig/arch/i386/kernel/setup.c 2007-09-20 11:24:25.000000000 +0800
+++ linux-2.6.23-rc6/arch/i386/kernel/setup.c 2007-09-20 13:14:19.000000000 +0800
@@ -60,6 +60,7 @@
#include <asm/ist.h>
#include <asm/io.h>
#include <asm/vmi.h>
+#include <asm/kexec.h>
#include <setup_arch.h>
#include <bios_ebda.h>
@@ -566,6 +567,8 @@
data_resource.start = virt_to_phys(_etext);
data_resource.end = virt_to_phys(_edata)-1;
+ parse_kexec_jump_back_entry();
+
parse_early_param();
if (user_defined_memmap) {
Index: linux-2.6.23-rc6/Documentation/i386/jump_back_protocol.txt
===================================================================
--- /dev/null 1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6.23-rc6/Documentation/i386/jump_back_protocol.txt 2007-09-20 11:24:31.000000000 +0800
@@ -0,0 +1,81 @@
+ THE LINUX/I386 JUMP BACK PROTOCOL
+ ---------------------------------
+
+ Huang Ying <ying.huang@intel.com>
+ Last update 2007-09-19
+
+Currently, the following versions of the jump back protocol exist.
+
+Protocol 1.00: Jumping between original kernel and kexeced kernel
+ support.
+
+
+**** LOAD THE JUMP BACK IMAGE
+
+Jump back image is an ordinary ELF 64 executable file, it can be
+loaded just as other ELF64 image. That is, the PT_LOAD segments should
+be loaded into their physical address.
+
+Before loading all segments of jump back image, the jump back header
+should be checked. Jump back header can be loaded from the 4K page at
+the jump back entry in jump back image.
+
+The header looks like:
+
+Offset Proto Name Meaning
+/Size
+
+C00/2 1.00+ magic Magic number: 0x626A
+C02/2 1.00+ version Jump back protocol version
+C04/4 1.00+ backup_pages_map Map from target page to backup page
+
+Note: unlike ordinary ELF 64 file, the jump back image may occupy most
+memory pages, so it is important for loader to verify there is no
+conflict between pages of loaded image and pages used by loader
+itself.
+
+
+**** DETAILS OF JUMP BACK HEADER
+
+For each field, some are information from the jump back image to
+loader ("read"), some are expected to be filled out by the loader
+("write"), and some are expected to be read and modified by the loader
+("modify").
+
+All general purpose boot loaders should write the fields marked
+(obligatory).
+
+The byte order of all fields is little endian.
+
+Field name: magic
+Type: read
+Offset/size: 0xc00/2
+Protocol: 1.00+
+
+ Contains the magic number "jb" (0x626A)
+
+Field name: version
+Type: read
+Offset/size: 0xc02/2
+Protocol: 1.00+
+
+ Contains the version number, in (major << 8)+minor format,
+ e.g. 0x0100 for version 1.00.
+
+Field name: backup_pages_map
+Type: read
+Offset/size: 0xc04/4
+Protocol: 1.00+
+
+ The map from target address to backup address, it is kimage->head in
+ fact.
+ TODO: detailed description
+
+
+**** JUMP BACK TO THE JUMP BACK IMAGE
+
+To jump back to the jump back image, just jump to the jump back
+entry. At entry, the CPU must be in 32-bit protected mode with paging
+disabled; the CS, DS and SS must be 4G flat segments; if jumping back
+to loader is supported, %edi should be the jump back entry of loader,
+otherwise it should be zero.
^ permalink raw reply [flat|nested] 42+ messages in threadend of thread, other threads:[~2007-09-27 6:35 UTC | newest]
Thread overview: 42+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <1190266447.21818.17.camel@caritas-dev.intel.com>
2007-09-20 10:09 ` [RFC][PATCH 1/2 -mm] kexec based hibernation -v3: kexec jump Pavel Machek
[not found] ` <20070920100941.GA12157@atrey.karlin.mff.cuni.cz>
2007-09-21 0:24 ` Nigel Cunningham
[not found] ` <200709211024.35991.nigel@nigel.suspend2.net>
2007-09-21 1:06 ` Andrew Morton
2007-09-21 1:19 ` Nigel Cunningham
[not found] ` <200709211120.00448.ncunningham@crca.org.au>
2007-09-21 1:41 ` Andrew Morton
[not found] ` <20070920184106.79e1858a.akpm@linux-foundation.org>
2007-09-21 1:57 ` Nigel Cunningham
[not found] ` <200709211157.28622.nigel@nigel.suspend2.net>
2007-09-21 2:18 ` Huang, Ying
[not found] ` <1190341137.21818.52.camel@caritas-dev.intel.com>
2007-09-21 2:25 ` Nigel Cunningham
[not found] ` <200709211225.25874.nigel@nigel.suspend2.net>
2007-09-21 2:45 ` Huang, Ying
[not found] ` <1190342757.21818.75.camel@caritas-dev.intel.com>
2007-09-21 2:58 ` Nigel Cunningham
[not found] ` <200709211259.00195.nigel@nigel.suspend2.net>
2007-09-21 4:46 ` Eric W. Biederman
[not found] ` <m1r6ksiq27.fsf@ebiederm.dsl.xmission.com>
2007-09-21 9:45 ` Pavel Machek
[not found] ` <20070921094512.GA20149@elf.ucw.cz>
2007-09-26 20:30 ` Joseph Fannin
[not found] ` <20070926203036.GF31759@nineveh.local>
2007-09-26 20:52 ` Nigel Cunningham
2007-09-27 6:33 ` Huang, Ying
[not found] ` <1190874834.21818.300.camel@caritas-dev.intel.com>
2007-09-27 6:35 ` Nigel Cunningham
2007-09-22 22:02 ` Alon Bar-Lev
2007-09-21 3:33 ` Eric W. Biederman
2007-09-21 4:16 ` Andrew Morton
[not found] ` <m18x70ofp3.fsf@ebiederm.dsl.xmission.com>
2007-09-21 12:09 ` Rafael J. Wysocki
[not found] ` <200709211409.25008.rjw@sisk.pl>
2007-09-21 13:14 ` huang ying
[not found] ` <851fc09e0709210614q33cf3c81u1441fda17a66a6fd@mail.gmail.com>
2007-09-21 14:31 ` Rafael J. Wysocki
[not found] ` <200709211631.19130.rjw@sisk.pl>
2007-09-21 14:45 ` Alan Stern
2007-09-21 15:27 ` Rafael J. Wysocki
2007-09-21 15:02 ` huang ying
[not found] ` <851fc09e0709210802o3be2789s8e93410fa07f7066@mail.gmail.com>
2007-09-21 15:50 ` Rafael J. Wysocki
2007-09-21 18:11 ` Jeremy Maitin-Shepard
[not found] ` <87sl576g8q.fsf@jbms.ath.cx>
2007-09-21 19:00 ` Rafael J. Wysocki
2007-09-21 19:45 ` Alan Stern
2007-09-21 11:56 ` Rafael J. Wysocki
[not found] ` <200709211356.30291.rjw@sisk.pl>
2007-09-21 11:58 ` Nigel Cunningham
2007-09-21 13:25 ` huang ying
2007-09-24 17:37 ` Thomas Meyer
2007-09-21 9:49 ` Pavel Machek
[not found] ` <20070921094908.GB20149@elf.ucw.cz>
2007-09-21 12:10 ` Rafael J. Wysocki
2007-09-21 2:55 ` Eric W. Biederman
2007-09-21 4:01 ` Eric W. Biederman
[not found] ` <m1fy18sp5c.fsf@ebiederm.dsl.xmission.com>
2007-09-21 7:27 ` Huang, Ying
[not found] ` <m1y7f0k6p4.fsf@ebiederm.dsl.xmission.com>
2007-09-21 8:42 ` Huang, Ying
[not found] ` <200709212158.50538.nigel@nigel.suspend2.net>
[not found] ` <200709211418.20358.rjw@sisk.pl>
2007-09-21 12:15 ` Nigel Cunningham
2007-09-21 12:18 ` Rafael J. Wysocki
2007-09-20 5:34 Huang, Ying
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox