* Re: ktime_get_ts64() splat during resume
[not found] ` <20160617132943.GA3336@nazgul.tnic>
@ 2016-06-17 14:33 ` Borislav Petkov
2016-06-17 15:28 ` Rafael J. Wysocki
0 siblings, 1 reply; 16+ messages in thread
From: Borislav Petkov @ 2016-06-17 14:33 UTC (permalink / raw)
To: Thomas Gleixner, Rafael J. Wysocki
Cc: Ingo Molnar, Peter Zijlstra, lkml, John Stultz, Logan Gunthorpe,
Rafael J. Wysocki, Kees Cook, stable, Andy Lutomirski,
Brian Gerst, Denys Vlasenko, H. Peter Anvin, Linus Torvalds,
Linux PM list, Stephen Smalley
Ok,
bisect is done, full log below.
Rafael, that fix
70595b479ce1 ("x86/power/64: Fix crash whan the hibernation code passes control to the image kernel")
breaks s2disk here. It explodes during resume and a statically allocated
struct's member is NULL. See
https://lkml.kernel.org/r/20160617105435.GB15997@pd.tnic
for the splat and some debugging attempts.
Reverting 70595b479ce1 fixes the issue here.
@stable folk: You might want to hold off on picking up that one yet...
Now, I have serial connected to that box so if you want me to dump stuff
and try patches, let me know.
Btw, while I have your ear, can you pick up my einj fixes please?
:-)
-> https://lkml.kernel.org/r/1463992084-10012-1-git-send-email-bp@alien8.de
Thanks!
git bisect start
# bad: [f0e52976a6fad5aa2c8868d5fa91deee91299ebb] Merge remote-tracking branch 'tip/master' into tip-timers-splat
git bisect bad f0e52976a6fad5aa2c8868d5fa91deee91299ebb
# good: [bb967271c0317de8eed56b20aae978d31507b033] Merge tag 'pwm/for-4.7-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/thierry.reding/linux-pwm
git bisect good bb967271c0317de8eed56b20aae978d31507b033
# good: [bb41dcd3fd81971ce17f7c0e5020d343d905358f] Merge branch 'sched/core'
git bisect good bb41dcd3fd81971ce17f7c0e5020d343d905358f
# bad: [457642d3e6e1f7412c8174dbaeb06232618b2698] Merge branch 'x86/mm'
git bisect bad 457642d3e6e1f7412c8174dbaeb06232618b2698
# bad: [d818e3414b9def094a93ce57caad9cb75ecb2a17] Merge branch 'x86/boot'
git bisect bad d818e3414b9def094a93ce57caad9cb75ecb2a17
# good: [02e8fda2cc00419a11cf38199afea4c0d7172be8] x86/signals: Add build-time checks to the siginfo compat code
git bisect good 02e8fda2cc00419a11cf38199afea4c0d7172be8
# good: [c702ff2fc395d8d9949eef8f72fd8726943c5aa1] Merge branch 'timers/core'
git bisect good c702ff2fc395d8d9949eef8f72fd8726943c5aa1
# good: [0ef1f6adccfea59c600edca7632ff0834ddfee5b] Merge branch 'x86/asm'
git bisect good 0ef1f6adccfea59c600edca7632ff0834ddfee5b
# bad: [70595b479ce173425dd5cb347dc6c8b1060dfb2c] x86/power/64: Fix crash whan the hibernation code passes control to the image kernel
git bisect bad 70595b479ce173425dd5cb347dc6c8b1060dfb2c
# good: [8d950d2fb23b696d393020486ab6a350bcb2c582] MAINTAINERS: Update the Calgary IOMMU entry
git bisect good 8d950d2fb23b696d393020486ab6a350bcb2c582
# first bad commit: [70595b479ce173425dd5cb347dc6c8b1060dfb2c] x86/power/64: Fix crash whan the hibernation code passes control to the image kernel
--
Regards/Gruss,
Boris.
ECO tip #101: Trim your mails when you reply.
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: ktime_get_ts64() splat during resume
2016-06-17 14:33 ` ktime_get_ts64() splat during resume Borislav Petkov
@ 2016-06-17 15:28 ` Rafael J. Wysocki
2016-06-17 16:12 ` Borislav Petkov
2016-06-20 8:17 ` chenyu
0 siblings, 2 replies; 16+ messages in thread
From: Rafael J. Wysocki @ 2016-06-17 15:28 UTC (permalink / raw)
To: Borislav Petkov
Cc: Thomas Gleixner, Rafael J. Wysocki, Ingo Molnar, Peter Zijlstra,
lkml, John Stultz, Logan Gunthorpe, Rafael J. Wysocki, Kees Cook,
Stable, Andy Lutomirski, Brian Gerst, Denys Vlasenko,
H. Peter Anvin, Linus Torvalds, Linux PM list, Stephen Smalley
On Fri, Jun 17, 2016 at 4:33 PM, Borislav Petkov <bp@alien8.de> wrote:
> Ok,
>
> bisect is done, full log below.
>
> Rafael, that fix
>
> 70595b479ce1 ("x86/power/64: Fix crash whan the hibernation code passes control to the image kernel")
>
> breaks s2disk here. It explodes during resume and a statically allocated
> struct's member is NULL. See
>
> https://lkml.kernel.org/r/20160617105435.GB15997@pd.tnic
>
> for the splat and some debugging attempts.
>
> Reverting 70595b479ce1 fixes the issue here.
Quite evidently, memory is corrupted in the image kernel, but this
particular commit only affects the boot kernel, so it can't really
corrupt anything in the image one.
A couple of questions:
- I guess this is reproducible 100% of the time?
- If you do "echo disk > /sys/power/state" instead of using s2disk,
does it still crash in the same way?
- Are both the image and boot kernels the same binary?
> @stable folk: You might want to hold off on picking up that one yet...
>
>
> Now, I have serial connected to that box so if you want me to dump stuff
> and try patches, let me know.
>
> Btw, while I have your ear, can you pick up my einj fixes please?
>
> :-)
>
> -> https://lkml.kernel.org/r/1463992084-10012-1-git-send-email-bp@alien8.de
Sure. I thought Tony would pick them up.
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: ktime_get_ts64() splat during resume
2016-06-17 15:28 ` Rafael J. Wysocki
@ 2016-06-17 16:12 ` Borislav Petkov
2016-06-17 21:03 ` Rafael J. Wysocki
2016-06-20 8:17 ` chenyu
1 sibling, 1 reply; 16+ messages in thread
From: Borislav Petkov @ 2016-06-17 16:12 UTC (permalink / raw)
To: Rafael J. Wysocki
Cc: Thomas Gleixner, Rafael J. Wysocki, Ingo Molnar, Peter Zijlstra,
lkml, John Stultz, Logan Gunthorpe, Rafael J. Wysocki, Kees Cook,
Stable, Andy Lutomirski, Brian Gerst, Denys Vlasenko,
H. Peter Anvin, Linus Torvalds, Linux PM list, Stephen Smalley
On Fri, Jun 17, 2016 at 05:28:10PM +0200, Rafael J. Wysocki wrote:
> A couple of questions:
> - I guess this is reproducible 100% of the time?
Yap.
I took latest Linus + tip/master which has your commit.
> - If you do "echo disk > /sys/power/state" instead of using s2disk,
> does it still crash in the same way?
My suspend to disk script does:
echo 3 > /proc/sys/vm/drop_caches
echo "shutdown" > /sys/power/disk
echo "disk" > /sys/power/state
I don't use anything else for years now.
> - Are both the image and boot kernels the same binary?
Yep.
> Sure. I thought Tony would pick them up.
Oh ok, next time I'll pester him.
:-)
But seriously, should we route the RAS-relevant stuff touching
drivers/acpi/apei/ through our tree instead? Provided you're fine with
them?
This way we'll unload some of the burden off you...
Thanks!
--
Regards/Gruss,
Boris.
ECO tip #101: Trim your mails when you reply.
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: ktime_get_ts64() splat during resume
2016-06-17 16:12 ` Borislav Petkov
@ 2016-06-17 21:03 ` Rafael J. Wysocki
2016-06-18 1:11 ` Rafael J. Wysocki
2016-06-20 14:38 ` Rafael J. Wysocki
0 siblings, 2 replies; 16+ messages in thread
From: Rafael J. Wysocki @ 2016-06-17 21:03 UTC (permalink / raw)
To: Borislav Petkov
Cc: Rafael J. Wysocki, Thomas Gleixner, Rafael J. Wysocki,
Ingo Molnar, Peter Zijlstra, lkml, John Stultz, Logan Gunthorpe,
Rafael J. Wysocki, Kees Cook, Stable, Andy Lutomirski,
Brian Gerst, Denys Vlasenko, H. Peter Anvin, Linus Torvalds,
Linux PM list, Stephen Smalley
On Fri, Jun 17, 2016 at 6:12 PM, Borislav Petkov <bp@alien8.de> wrote:
> On Fri, Jun 17, 2016 at 05:28:10PM +0200, Rafael J. Wysocki wrote:
>> A couple of questions:
>> - I guess this is reproducible 100% of the time?
>
> Yap.
>
> I took latest Linus + tip/master which has your commit.
>
>> - If you do "echo disk > /sys/power/state" instead of using s2disk,
>> does it still crash in the same way?
>
> My suspend to disk script does:
>
> echo 3 > /proc/sys/vm/drop_caches
> echo "shutdown" > /sys/power/disk
> echo "disk" > /sys/power/state
>
> I don't use anything else for years now.
>
>> - Are both the image and boot kernels the same binary?
>
> Yep.
OK, we need to find out what's wrong, then.
First, please revert the changes in hibernate_asm_64.S alone and see
if that makes any difference.
Hibernation should still work then most of the time, but the bug fixed
by this commit will be back.
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: ktime_get_ts64() splat during resume
2016-06-17 21:03 ` Rafael J. Wysocki
@ 2016-06-18 1:11 ` Rafael J. Wysocki
2016-06-20 14:38 ` Rafael J. Wysocki
1 sibling, 0 replies; 16+ messages in thread
From: Rafael J. Wysocki @ 2016-06-18 1:11 UTC (permalink / raw)
To: Borislav Petkov
Cc: Thomas Gleixner, Rafael J. Wysocki, Ingo Molnar, Peter Zijlstra,
lkml, John Stultz, Logan Gunthorpe, Rafael J. Wysocki, Kees Cook,
Stable, Andy Lutomirski, Brian Gerst, Denys Vlasenko,
H. Peter Anvin, Linus Torvalds, Linux PM list, Stephen Smalley
On Fri, Jun 17, 2016 at 11:03 PM, Rafael J. Wysocki <rafael@kernel.org> wrote:
> On Fri, Jun 17, 2016 at 6:12 PM, Borislav Petkov <bp@alien8.de> wrote:
>> On Fri, Jun 17, 2016 at 05:28:10PM +0200, Rafael J. Wysocki wrote:
>>> A couple of questions:
>>> - I guess this is reproducible 100% of the time?
>>
>> Yap.
>>
>> I took latest Linus + tip/master which has your commit.
>>
>>> - If you do "echo disk > /sys/power/state" instead of using s2disk,
>>> does it still crash in the same way?
>>
>> My suspend to disk script does:
>>
>> echo 3 > /proc/sys/vm/drop_caches
>> echo "shutdown" > /sys/power/disk
>> echo "disk" > /sys/power/state
>>
>> I don't use anything else for years now.
>>
>>> - Are both the image and boot kernels the same binary?
>>
>> Yep.
>
> OK, we need to find out what's wrong, then.
>
> First, please revert the changes in hibernate_asm_64.S alone and see
> if that makes any difference.
>
> Hibernation should still work then most of the time, but the bug fixed
> by this commit will be back.
Due to the nature of the memory corruption you are seeing (the same
address appears to be corrupted every time in the same way) with 100%
reproducibility and due to the fact that new code added by the commit
in question only writes to dynamically allocated memory (and you've
already verified that https://patchwork.kernel.org/patch/9185165/
doesn't help), it is quite unlikely that the memory corruption comes
from that commit itself.
However, I see a couple of ways in which that commit might uncover a latent bug.
First, it changed the layout of the kernel text by adding the
PAGE_SIZE alignment of restore_registers(). That likely pushed stuff
behind it to new offsets, possibly including the static struct field
that is now corrupted. Now, say that the memory corruption has always
happened at the same memory location, but previously there was nothing
in there or whatever was in there, wasn't vital. In that case the
memory corruption might have gone unnoticed until the commit in
question caused things to move to new locations and the corrupted
location contains a vital piece of data now. This is my current
theory.
Second, it added two invocations of get_safe_page() that in theory
might push things a bit too far towards the limit and they started to
break there. I don't see how that can happen ATM, but I'm not
excluding this possibility yet. It seems, though, that in that case
the corruption would be more random and I certainly wouldn't expect it
to happen at the same location every time.
One more indicator is that multiple people reported success with that
commit and in many hibernation runs, so the problem appears to be very
specific to your system and/or kernel configuration. It also is
interesting that the memory corruption only becomes visible during the
thawing of tasks and given the piece of data that is corrupted, it
should become visible much earlier if the memory was corrupted during
image restoration by the boot kernel.
In any case, reverting the changes in hibernate_asm_64.S alone should
show us the direction, but if it makes things work again, I would try
to change the restore_registers() alignment to something smaller, like
64 (which should be safe) and see what happens then.
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: ktime_get_ts64() splat during resume
2016-06-17 15:28 ` Rafael J. Wysocki
2016-06-17 16:12 ` Borislav Petkov
@ 2016-06-20 8:17 ` chenyu
2016-06-20 12:21 ` Rafael J. Wysocki
1 sibling, 1 reply; 16+ messages in thread
From: chenyu @ 2016-06-20 8:17 UTC (permalink / raw)
To: Rafael J. Wysocki
Cc: Borislav Petkov, Thomas Gleixner, Rafael J. Wysocki, Ingo Molnar,
Peter Zijlstra, lkml, John Stultz, Logan Gunthorpe,
Rafael J. Wysocki, Kees Cook, Stable, Andy Lutomirski,
Brian Gerst, Denys Vlasenko, H. Peter Anvin, Linus Torvalds,
Linux PM list, Stephen Smalley
On Fri, Jun 17, 2016 at 11:28 PM, Rafael J. Wysocki <rafael@kernel.org> wrote:
> On Fri, Jun 17, 2016 at 4:33 PM, Borislav Petkov <bp@alien8.de> wrote:
>> Ok,
>>
>> bisect is done, full log below.
>>
>> Rafael, that fix
>>
>> 70595b479ce1 ("x86/power/64: Fix crash whan the hibernation code passes control to the image kernel")
>>
>> breaks s2disk here. It explodes during resume and a statically allocated
>> struct's member is NULL. See
>>
>> https://lkml.kernel.org/r/20160617105435.GB15997@pd.tnic
>>
>> for the splat and some debugging attempts.
>>
>> Reverting 70595b479ce1 fixes the issue here.
>
> Quite evidently, memory is corrupted in the image kernel, but this
> particular commit only affects the boot kernel, so it can't really
> corrupt anything in the image one.
>
In previous patch,
before we jump to the new kernel entry, we add the
text mapping to temp_level4_pgt,
/* switch over to the temporary kernel text mapping */
movq %r8, (%r9)
If I understand correctly, r9 contains the virtual address
of restore_pgd_addr, since the page table for restore_pgd_addr might be
incoherent across hibernation(as NX patch changes the kernel text mapping
to dynamically mapping), so we might write pmd entry to an incorrect place in
temp_level4_pgt?
Yu
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: ktime_get_ts64() splat during resume
2016-06-20 8:17 ` chenyu
@ 2016-06-20 12:21 ` Rafael J. Wysocki
0 siblings, 0 replies; 16+ messages in thread
From: Rafael J. Wysocki @ 2016-06-20 12:21 UTC (permalink / raw)
To: chenyu
Cc: Rafael J. Wysocki, Borislav Petkov, Thomas Gleixner, Ingo Molnar,
Peter Zijlstra, lkml, John Stultz, Logan Gunthorpe,
Rafael J. Wysocki, Kees Cook, Stable, Andy Lutomirski,
Brian Gerst, Denys Vlasenko, H. Peter Anvin, Linus Torvalds,
Linux PM list, Stephen Smalley
On Monday, June 20, 2016 04:17:13 PM chenyu wrote:
> On Fri, Jun 17, 2016 at 11:28 PM, Rafael J. Wysocki <rafael@kernel.org> wrote:
> > On Fri, Jun 17, 2016 at 4:33 PM, Borislav Petkov <bp@alien8.de> wrote:
> >> Ok,
> >>
> >> bisect is done, full log below.
> >>
> >> Rafael, that fix
> >>
> >> 70595b479ce1 ("x86/power/64: Fix crash whan the hibernation code passes control to the image kernel")
> >>
> >> breaks s2disk here. It explodes during resume and a statically allocated
> >> struct's member is NULL. See
> >>
> >> https://lkml.kernel.org/r/20160617105435.GB15997@pd.tnic
> >>
> >> for the splat and some debugging attempts.
> >>
> >> Reverting 70595b479ce1 fixes the issue here.
> >
> > Quite evidently, memory is corrupted in the image kernel, but this
> > particular commit only affects the boot kernel, so it can't really
> > corrupt anything in the image one.
> >
> In previous patch,
> before we jump to the new kernel entry, we add the
> text mapping to temp_level4_pgt,
>
> /* switch over to the temporary kernel text mapping */
> movq %r8, (%r9)
> If I understand correctly, r9 contains the virtual address
> of restore_pgd_addr, since the page table for restore_pgd_addr might be
> incoherent across hibernation(as NX patch changes the kernel text mapping
> to dynamically mapping), so we might write pmd entry to an incorrect place in
> temp_level4_pgt?
I'm not sure what you mean.
r9 contains the address previously stored in restore_pgd_addr. This is the
address of the cell in (the page pointed to by temp_level4_pgt) to write the
new pgd entry to.
Why do you think it may be incorrect?
In any case, the corruption reported by Boris also happens if all of the
assembly changes in commit 70595b479ce1 are reverted, in which case the code
should work as before that commit modulo some additional actions that
shouldn't matter. Those additional actions turn out to matter, though.
I'll write about that in detail shortly.
Thanks,
Rafael
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: ktime_get_ts64() splat during resume
2016-06-17 21:03 ` Rafael J. Wysocki
2016-06-18 1:11 ` Rafael J. Wysocki
@ 2016-06-20 14:38 ` Rafael J. Wysocki
2016-06-20 18:29 ` Linus Torvalds
1 sibling, 1 reply; 16+ messages in thread
From: Rafael J. Wysocki @ 2016-06-20 14:38 UTC (permalink / raw)
To: Borislav Petkov, Logan Gunthorpe, Kees Cook
Cc: Rafael J. Wysocki, Thomas Gleixner, Ingo Molnar, Peter Zijlstra,
lkml, John Stultz, Rafael J. Wysocki, Stable, Andy Lutomirski,
Brian Gerst, Denys Vlasenko, H. Peter Anvin, Linus Torvalds,
Linux PM list, Stephen Smalley
On Friday, June 17, 2016 11:03:46 PM Rafael J. Wysocki wrote:
> On Fri, Jun 17, 2016 at 6:12 PM, Borislav Petkov <bp@alien8.de> wrote:
> > On Fri, Jun 17, 2016 at 05:28:10PM +0200, Rafael J. Wysocki wrote:
> >> A couple of questions:
> >> - I guess this is reproducible 100% of the time?
> >
> > Yap.
> >
> > I took latest Linus + tip/master which has your commit.
> >
> >> - If you do "echo disk > /sys/power/state" instead of using s2disk,
> >> does it still crash in the same way?
> >
> > My suspend to disk script does:
> >
> > echo 3 > /proc/sys/vm/drop_caches
> > echo "shutdown" > /sys/power/disk
> > echo "disk" > /sys/power/state
> >
> > I don't use anything else for years now.
> >
> >> - Are both the image and boot kernels the same binary?
> >
> > Yep.
>
> OK, we need to find out what's wrong, then.
Boris is traveling this week, so the investigation will continue when he's
back, but we spent quite some time on this during the weekend, so below is
a summary of what we found plus some comments.
Overall, we seem to be heading towards the "really weird" territory here.
> First, please revert the changes in hibernate_asm_64.S alone and see
> if that makes any difference.
>
> Hibernation should still work then most of the time, but the bug fixed
> by this commit will be back.
First of all, reverting the changes in hibernate_asm_64.S doesn't make the
breakage go away, which means that the most essential changes in
"x86/power/64: Fix crash whan the hibernation code passes control to the image
kernel" don't make a difference.
Moreover, after some back-and-forth we narrowed down the memory corruption to
a single line in the (new) prepare_temporary_text_mapping() function and it
turned out that it is triggered by scribbling on a page returned by
get_safe_page(), immediately after obtaining it. Memory is corrupted no matter
how the page is written to. In particular, filling that page with all ones
triggers memory corruption later, for example.
So appended is the last tested patch sufficient to trigger the breakage on the
Boris' system (modulo some comments). Quite evidently, the breakage is
triggered by the memset() line in prepare_temporary_text_mapping().
In addition to the above, there are some interesting observations from the
investigation so far.
First, what is corrupted is the image kernel memory. This means that the line
in question scribbles on image data (which are stored in memory at that point)
somewhere. Nevertheless, the image kernel is evidently able to continue after
it has received control and carry out the full device resume up to the point
in which user space is thawed and then it crashes as a result of page tables
corruption. How those page tables are corrupted depends on what is written to
the page pointed to by pmd in prepare_temporary_text_mapping() (but, of course,
that write happens in the "boot" or "restore" kernel, before jumping to the
image one).
The above is what we know and now conclusions that it seems to lead to.
Generally, I see two possible scenarios here.
One is that the changes not directly related to prepare_temporary_text_mapping()
in the patch below, eg. the changes in arch_hibernation_header_save/restore()
(and related), make a difference, but that doesn't look likely to me. In any
case, that will be the first thing to try next, as it is relatively easy.
The second scenario is fundamentally less attractive, because it means that the
address we end up with in pmd in prepare_temporary_text_mapping() is bogus.
My first reaction to that option would be "Well, what if get_safe_page() is
broken somehow?", but that's not so simple. Namely, for the observed corruption
to happen, get_safe_page() would need to return a page that (1) had already
been allocated once before and (2) such that image data had been written to
it already. Still, the hibernation core only writes image data to pages that
have been explicitly allocated by it and it never frees them individually
(in particular, they are never freed before jumping to the image kernel at
all). It sets two bits in two different bitmaps for every page allocated by it
and checks those two bits every time before writing image data to any page.
One may think "Well, maybe the bitmaps don't work correctly then and the bits
are set when they shouldn't be sometimes?", but the problem with that line of
reasoning is that get_safe_page() checks one of those very bits before returning
a page and the page is only returned if that bit is clear. Thus, even if the
bits were set when they shouldn't be sometimes, get_safe_page() would still
not return such a page to the caller.
The next question to ask might be "What if, horror shock, get_zeroed_page()
called by get_safe_page() returns a page that has been allocated already?", but
then the bitmap check mentioned above would still prevent get_safe_page() from
returning that page (the bit in question would be set for it).
Hence, the only way get_safe_page() might return a wrong page in
prepare_temporary_text_mapping() would be if both get_zeroed_page() and the
bitmap failed at the same time in coincidence to produce the bogus result.
Failures of one of the two individually are not observed, though, let alone a
combined failure leading to exactly the worst outcome.
And there are more questions, like for example, if get_safe_page() returns a
wrong page in that particular place in prepare_temporary_text_mapping(), why
doesn't it return wrong pages anywhere else? There are at least several
invocations of it in set_up_temporary_mappings() and swsusp_arch_resume()
already and they don't lead to any kind of memory corruption, so why would
this particular one do that?
I'm honestly unsure about where that leads us, but it starts to look totally
weird to me.
Thanks,
Rafael
---
arch/x86/power/hibernate_64.c | 45 ++++++++++++++++++++++++++++++++++++++----
1 file changed, 41 insertions(+), 4 deletions(-)
Index: linux-pm/arch/x86/power/hibernate_64.c
===================================================================
--- linux-pm.orig/arch/x86/power/hibernate_64.c
+++ linux-pm/arch/x86/power/hibernate_64.c
@@ -27,7 +27,8 @@ extern asmlinkage __visible int restore_
* Address to jump to in the last phase of restore in order to get to the image
* kernel's text (this value is passed in the image header).
*/
-unsigned long restore_jump_address __visible;
+void *restore_jump_address __visible;
+unsigned long jump_address_phys;
/*
* Value of the cr3 register from before the hibernation (this value is passed
@@ -37,8 +38,37 @@ unsigned long restore_cr3 __visible;
pgd_t *temp_level4_pgt __visible;
+void *restore_pgd_addr __visible;
+pgd_t restore_pgd __visible;
+
void *relocated_restore_code __visible;
+static int prepare_temporary_text_mapping(void)
+{
+ unsigned long vaddr = (unsigned long)restore_jump_address;
+ unsigned long paddr = jump_address_phys & PMD_MASK;
+ pmd_t *pmd;
+ pud_t *pud;
+
+ pud = (pud_t *)get_safe_page(GFP_ATOMIC);
+ if (!pud)
+ return -ENOMEM;
+
+ restore_pgd = __pgd(__pa(pud) | _KERNPG_TABLE);
+
+ pud += pud_index(vaddr);
+ pmd = (pmd_t *)get_safe_page(GFP_ATOMIC);
+ if (!pmd)
+ return -ENOMEM;
+
+ set_pud(pud, __pud(__pa(pmd) | _KERNPG_TABLE));
+
+ memset(pmd, 0xff, PAGE_SIZE); /* CORRUPTION HERE! */
+
+ restore_pgd_addr = temp_level4_pgt + pgd_index(vaddr);
+ return 0;
+}
+
static void *alloc_pgt_page(void *context)
{
return (void *)get_safe_page(GFP_ATOMIC);
@@ -63,6 +93,10 @@ static int set_up_temporary_mappings(voi
set_pgd(temp_level4_pgt + pgd_index(__START_KERNEL_map),
init_level4_pgt[pgd_index(__START_KERNEL_map)]);
+ result = prepare_temporary_text_mapping();
+ if (result)
+ return result;
+
/* Set up the direct mapping from scratch */
for (i = 0; i < nr_pfn_mapped; i++) {
mstart = pfn_mapped[i].start << PAGE_SHIFT;
@@ -108,12 +142,13 @@ int pfn_is_nosave(unsigned long pfn)
}
struct restore_data_record {
- unsigned long jump_address;
+ void *jump_address;
+ unsigned long jump_address_phys;
unsigned long cr3;
unsigned long magic;
};
-#define RESTORE_MAGIC 0x0123456789ABCDEFUL
+#define RESTORE_MAGIC 0x123456789ABCDEF0UL
/**
* arch_hibernation_header_save - populate the architecture specific part
@@ -126,7 +161,8 @@ int arch_hibernation_header_save(void *a
if (max_size < sizeof(struct restore_data_record))
return -EOVERFLOW;
- rdr->jump_address = restore_jump_address;
+ rdr->jump_address = &restore_registers;
+ rdr->jump_address_phys = __pa_symbol(&restore_registers);
rdr->cr3 = restore_cr3;
rdr->magic = RESTORE_MAGIC;
return 0;
@@ -142,6 +178,7 @@ int arch_hibernation_header_restore(void
struct restore_data_record *rdr = addr;
restore_jump_address = rdr->jump_address;
+ jump_address_phys = rdr->jump_address_phys;
restore_cr3 = rdr->cr3;
return (rdr->magic == RESTORE_MAGIC) ? 0 : -EINVAL;
}
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: ktime_get_ts64() splat during resume
2016-06-20 14:38 ` Rafael J. Wysocki
@ 2016-06-20 18:29 ` Linus Torvalds
2016-06-20 21:15 ` Rafael J. Wysocki
0 siblings, 1 reply; 16+ messages in thread
From: Linus Torvalds @ 2016-06-20 18:29 UTC (permalink / raw)
To: Rafael J. Wysocki
Cc: Borislav Petkov, Logan Gunthorpe, Kees Cook, Rafael J. Wysocki,
Thomas Gleixner, Ingo Molnar, Peter Zijlstra, lkml, John Stultz,
Rafael J. Wysocki, Stable, Andy Lutomirski, Brian Gerst,
Denys Vlasenko, H. Peter Anvin, Linux PM list, Stephen Smalley
On Mon, Jun 20, 2016 at 7:38 AM, Rafael J. Wysocki <rjw@rjwysocki.net> wrote:
>
> Overall, we seem to be heading towards the "really weird" territory here.
So the whole commit that Boris bisected down to is weird to me.
Why isn't the temporary text mapping just set up unconditionally in
the temp_level4_pgt?
Why does it have that insane "let's leave the temp_level4_pgt alone
until we actually switch to it, and save away restore_pgd_addr and the
restore_pgd, to then be set up at restore time"?
All the other temporary mappings are set up statically in the
temp_level4_pgt, why not that one?
I suspect whatever corruption happens boils down to the same issue
that made people do that odd decision in the first place.
And regardless, those games are too ugly to live. So I would suggest
that that original commit should just be considered broken, and
reverted (or just removed - I'm not sure if it's in a stable branch or
not) and the fix be rethought so that the code mapping can be done
once and for all and *without* the extra games.
Linus
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: ktime_get_ts64() splat during resume
2016-06-20 18:29 ` Linus Torvalds
@ 2016-06-20 21:15 ` Rafael J. Wysocki
2016-06-21 0:05 ` Rafael J. Wysocki
0 siblings, 1 reply; 16+ messages in thread
From: Rafael J. Wysocki @ 2016-06-20 21:15 UTC (permalink / raw)
To: Linus Torvalds
Cc: Rafael J. Wysocki, Borislav Petkov, Logan Gunthorpe, Kees Cook,
Rafael J. Wysocki, Thomas Gleixner, Ingo Molnar, Peter Zijlstra,
lkml, John Stultz, Rafael J. Wysocki, Stable, Andy Lutomirski,
Brian Gerst, Denys Vlasenko, H. Peter Anvin, Linux PM list,
Stephen Smalley
On Mon, Jun 20, 2016 at 8:29 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Mon, Jun 20, 2016 at 7:38 AM, Rafael J. Wysocki <rjw@rjwysocki.net> wrote:
>>
>> Overall, we seem to be heading towards the "really weird" territory here.
>
> So the whole commit that Boris bisected down to is weird to me.
>
> Why isn't the temporary text mapping just set up unconditionally in
> the temp_level4_pgt?
>
> Why does it have that insane "let's leave the temp_level4_pgt alone
> until we actually switch to it, and save away restore_pgd_addr and the
> restore_pgd, to then be set up at restore time"?
>
> All the other temporary mappings are set up statically in the
> temp_level4_pgt, why not that one?
The text mapping in temp_level4_pgt has to map the image kernel's
physical entry address to the same virtual address that the image
kernel had for it, because the image kernel will switch over to its
own page tables first and it will use its own kernel text mapping from
that point on. That may not be the same as the text mapping of the
(currently running) restore (or "boot") kernel.
So if we set up the text mapping in temp_level4_pgt upfront, we
basically can't reference the original kernel text (or do any
addressing relative to it) any more after switching over to
temp_level4_pgt.
For some reason I thought that was not doable, but now that I look at
the code it looks like it can be done. I'll try doing that.
> I suspect whatever corruption happens boils down to the same issue
> that made people do that odd decision in the first place.
The breakage appears to happen regardless of these changes, though.
> And regardless, those games are too ugly to live. So I would suggest
> that that original commit should just be considered broken, and
> reverted (or just removed - I'm not sure if it's in a stable branch or
> not) and the fix be rethought so that the code mapping can be done
> once and for all and *without* the extra games.
OK
Thanks,
Rafael
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: ktime_get_ts64() splat during resume
2016-06-20 21:15 ` Rafael J. Wysocki
@ 2016-06-21 0:05 ` Rafael J. Wysocki
2016-06-21 1:22 ` Rafael J. Wysocki
0 siblings, 1 reply; 16+ messages in thread
From: Rafael J. Wysocki @ 2016-06-21 0:05 UTC (permalink / raw)
To: Linus Torvalds, Logan Gunthorpe, Kees Cook
Cc: Rafael J. Wysocki, Borislav Petkov, Thomas Gleixner, Ingo Molnar,
Peter Zijlstra, lkml, John Stultz, Rafael J. Wysocki, Stable,
Andy Lutomirski, Brian Gerst, Denys Vlasenko, H. Peter Anvin,
Linux PM list, Stephen Smalley
On Monday, June 20, 2016 11:15:18 PM Rafael J. Wysocki wrote:
> On Mon, Jun 20, 2016 at 8:29 PM, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> > On Mon, Jun 20, 2016 at 7:38 AM, Rafael J. Wysocki <rjw@rjwysocki.net> wrote:
> >>
> >> Overall, we seem to be heading towards the "really weird" territory here.
> >
> > So the whole commit that Boris bisected down to is weird to me.
> >
> > Why isn't the temporary text mapping just set up unconditionally in
> > the temp_level4_pgt?
> >
> > Why does it have that insane "let's leave the temp_level4_pgt alone
> > until we actually switch to it, and save away restore_pgd_addr and the
> > restore_pgd, to then be set up at restore time"?
> >
> > All the other temporary mappings are set up statically in the
> > temp_level4_pgt, why not that one?
>
> The text mapping in temp_level4_pgt has to map the image kernel's
> physical entry address to the same virtual address that the image
> kernel had for it, because the image kernel will switch over to its
> own page tables first and it will use its own kernel text mapping from
> that point on. That may not be the same as the text mapping of the
> (currently running) restore (or "boot") kernel.
>
> So if we set up the text mapping in temp_level4_pgt upfront, we
> basically can't reference the original kernel text (or do any
> addressing relative to it) any more after switching over to
> temp_level4_pgt.
>
> For some reason I thought that was not doable, but now that I look at
> the code it looks like it can be done. I'll try doing that.
>
> > I suspect whatever corruption happens boils down to the same issue
> > that made people do that odd decision in the first place.
>
> The breakage appears to happen regardless of these changes, though.
>
> > And regardless, those games are too ugly to live. So I would suggest
> > that that original commit should just be considered broken, and
> > reverted (or just removed - I'm not sure if it's in a stable branch or
> > not) and the fix be rethought so that the code mapping can be done
> > once and for all and *without* the extra games.
>
> OK
Below is a new simplified version of the offending patch (without the
restore_pgd_addr and restore_pgd things). It works for me. :-)
Logan, Kees, can you please check if this one works for you too?
Thanks,
Rafael
---
From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Subject: [PATCH] x86/power/64: Fix kernel text mapping corruption during image restoration
Logan Gunthorpe reports that hibernation stopped working reliably for
him after commit ab76f7b4ab23 (x86/mm: Set NX on gap between __ex_table
and rodata).
That turns out to be a consequence of a long-standing issue with the
64-bit image restoration code on x86, which is that the temporary
page tables set up by it to avoid page tables corruption when the
last bits of the image kernel's memory contents are copied into
their original page frames re-use the boot kernel's text mapping,
but that mapping may very well get corrupted just like any other
part of the page tables. Of course, if that happens, the final
jump to the image kernel's entry point will go to nowhere.
The exact reason why commit ab76f7b4ab23 matters here is that it
sometimes causes a PMD of a large page to be split into PTEs
that are allocated dynamically and get corrupted during image
restoration as described above.
To fix that issue note that the code copying the last bits of the
image kernel's memory contents to the page frames occupied by them
previoulsy doesn't use the kernel text mapping, because it runs from
a special page covered by the identity mapping set up for that code
from scratch. Hence, the kernel text mapping is only needed before
that code starts to run and then it will only be used just for the
final jump to the image kernel's entry point.
Accordingly, the temporary page tables set up in swsusp_arch_resume()
on x86-64 need to contain the kernel text mapping too. That mapping
is only going to be used for the final jump to the image kernel, so
it only needs to cover the image kernel's entry point, because the
first thing the image kernel does after getting control back is to
switch over to its own original page tables. Moreover, the virtual
address of the image kernel's entry point in that mapping has to be
the same as the one mapped by the image kernel's page tables.
With that in mind, modify the x86-64's arch_hibernation_header_save()
and arch_hibernation_header_restore() routines to pass the physical
address of the image kernel's entry point (in addition to its virtual
address) to the boot kernel (a small piece of assembly code involved
in passing the entry point's virtual address to the image kernel is
not necessary any more after that, so drop it). Update RESTORE_MAGIC
too to reflect the image header format change.
Next, in set_up_temporary_mappings(), use the physical and virtual
addresses of the image kernel's entry point passed in the image
header to set up a minimum kernel text mapping (using memory pages
that won't be overwritten by the image kernel's memory contents) that
will map those addresses to each other as appropriate.
This makes the concern about the possible corruption of the original
boot kernel text mapping go away and if the the minimum kernel text
mapping used for the final jump marks the image kernel's entry point
memory as executable, the jump to it is guaraneed to succeed.
Fixes: ab76f7b4ab23 (x86/mm: Set NX on gap between __ex_table and rodata)
Link: http://marc.info/?l=linux-pm&m=146372852823760&w=2
Reported-by: Logan Gunthorpe <logang@deltatee.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
---
arch/x86/power/hibernate_64.c | 58 ++++++++++++++++++++++++++++++++------
arch/x86/power/hibernate_asm_64.S | 55 +++++++++++++++---------------------
2 files changed, 72 insertions(+), 41 deletions(-)
Index: linux-pm/arch/x86/power/hibernate_64.c
===================================================================
--- linux-pm.orig/arch/x86/power/hibernate_64.c
+++ linux-pm/arch/x86/power/hibernate_64.c
@@ -27,7 +27,8 @@ extern asmlinkage __visible int restore_
* Address to jump to in the last phase of restore in order to get to the image
* kernel's text (this value is passed in the image header).
*/
-unsigned long restore_jump_address __visible;
+void *restore_jump_address __visible;
+unsigned long jump_address_phys;
/*
* Value of the cr3 register from before the hibernation (this value is passed
@@ -39,6 +40,42 @@ pgd_t *temp_level4_pgt __visible;
void *relocated_restore_code __visible;
+static int set_up_temporary_text_mapping(void)
+{
+ unsigned long vaddr = (unsigned long)restore_jump_address;
+ unsigned long paddr = jump_address_phys & PMD_MASK;
+ pmd_t *pmd;
+ pud_t *pud;
+
+ /*
+ * The new mapping only has to cover the page containing the image
+ * kernel's entry point (jump_address_phys), because the switch over to
+ * it is carried out by relocated code running from a page allocated
+ * specifically for this purpose and covered by the identity mapping, so
+ * the temporary kernel text mapping is only needed for the final jump.
+ * Moreover, in that mapping the virtual address of the image kernel's
+ * entry point must be the same as its virtual address in the image
+ * kernel (restore_jump_address), so the image kernel's
+ * restore_registers() code doesn't find itself in a different area of
+ * the virtual address space after switching over to the original page
+ * tables used by the image kernel.
+ */
+ pud = (pud_t *)get_safe_page(GFP_ATOMIC);
+ if (!pud)
+ return -ENOMEM;
+
+ pmd = (pmd_t *)get_safe_page(GFP_ATOMIC);
+ if (!pmd)
+ return -ENOMEM;
+
+ set_pmd(pmd + pmd_index(vaddr), __pmd(paddr | __PAGE_KERNEL_LARGE_EXEC));
+ set_pud(pud + pud_index(vaddr), __pud(__pa(pmd) | _KERNPG_TABLE));
+ set_pgd(temp_level4_pgt + pgd_index(vaddr),
+ __pgd(__pa(pud) | _KERNPG_TABLE));
+
+ return 0;
+}
+
static void *alloc_pgt_page(void *context)
{
return (void *)get_safe_page(GFP_ATOMIC);
@@ -59,9 +96,10 @@ static int set_up_temporary_mappings(voi
if (!temp_level4_pgt)
return -ENOMEM;
- /* It is safe to reuse the original kernel mapping */
- set_pgd(temp_level4_pgt + pgd_index(__START_KERNEL_map),
- init_level4_pgt[pgd_index(__START_KERNEL_map)]);
+ /* Prepare a temporary mapping for the kernel text */
+ result = set_up_temporary_text_mapping();
+ if (result)
+ return result;
/* Set up the direct mapping from scratch */
for (i = 0; i < nr_pfn_mapped; i++) {
@@ -89,8 +127,7 @@ int swsusp_arch_resume(void)
relocated_restore_code = (void *)get_safe_page(GFP_ATOMIC);
if (!relocated_restore_code)
return -ENOMEM;
- memcpy(relocated_restore_code, &core_restore_code,
- &restore_registers - &core_restore_code);
+ memcpy(relocated_restore_code, &core_restore_code, PAGE_SIZE);
restore_image();
return 0;
@@ -108,12 +145,13 @@ int pfn_is_nosave(unsigned long pfn)
}
struct restore_data_record {
- unsigned long jump_address;
+ void *jump_address;
+ unsigned long jump_address_phys;
unsigned long cr3;
unsigned long magic;
};
-#define RESTORE_MAGIC 0x0123456789ABCDEFUL
+#define RESTORE_MAGIC 0x123456789ABCDEF0UL
/**
* arch_hibernation_header_save - populate the architecture specific part
@@ -126,7 +164,8 @@ int arch_hibernation_header_save(void *a
if (max_size < sizeof(struct restore_data_record))
return -EOVERFLOW;
- rdr->jump_address = restore_jump_address;
+ rdr->jump_address = &restore_registers;
+ rdr->jump_address_phys = __pa_symbol(&restore_registers);
rdr->cr3 = restore_cr3;
rdr->magic = RESTORE_MAGIC;
return 0;
@@ -142,6 +181,7 @@ int arch_hibernation_header_restore(void
struct restore_data_record *rdr = addr;
restore_jump_address = rdr->jump_address;
+ jump_address_phys = rdr->jump_address_phys;
restore_cr3 = rdr->cr3;
return (rdr->magic == RESTORE_MAGIC) ? 0 : -EINVAL;
}
Index: linux-pm/arch/x86/power/hibernate_asm_64.S
===================================================================
--- linux-pm.orig/arch/x86/power/hibernate_asm_64.S
+++ linux-pm/arch/x86/power/hibernate_asm_64.S
@@ -44,9 +44,6 @@ ENTRY(swsusp_arch_suspend)
pushfq
popq pt_regs_flags(%rax)
- /* save the address of restore_registers */
- movq $restore_registers, %rax
- movq %rax, restore_jump_address(%rip)
/* save cr3 */
movq %cr3, %rax
movq %rax, restore_cr3(%rip)
@@ -57,28 +54,29 @@ ENTRY(swsusp_arch_suspend)
ENDPROC(swsusp_arch_suspend)
ENTRY(restore_image)
+ /* prepare to jump to the image kernel */
+ movq restore_jump_address(%rip), %r8
+ movq restore_cr3(%rip), %r9
+
+ /* prepare to copy image data to their original locations */
+ movq restore_pblist(%rip), %rdx
+ movq relocated_restore_code(%rip), %r10
+
/* switch to temporary page tables */
- movq $__PAGE_OFFSET, %rdx
movq temp_level4_pgt(%rip), %rax
- subq %rdx, %rax
+ movq mmu_cr4_features(%rip), %rbx
+ movq $__PAGE_OFFSET, %rcx
+ subq %rcx, %rax
movq %rax, %cr3
- /* Flush TLB */
- movq mmu_cr4_features(%rip), %rax
- movq %rax, %rdx
- andq $~(X86_CR4_PGE), %rdx
- movq %rdx, %cr4; # turn off PGE
+ /* flush TLB */
+ movq %rbx, %rcx
+ andq $~(X86_CR4_PGE), %rcx
+ movq %rcx, %cr4; # turn off PGE
movq %cr3, %rcx; # flush TLB
movq %rcx, %cr3;
- movq %rax, %cr4; # turn PGE back on
-
- /* prepare to jump to the image kernel */
- movq restore_jump_address(%rip), %rax
- movq restore_cr3(%rip), %rbx
-
- /* prepare to copy image data to their original locations */
- movq restore_pblist(%rip), %rdx
- movq relocated_restore_code(%rip), %rcx
- jmpq *%rcx
+ movq %rbx, %cr4; # turn PGE back on
+ /* jump to relocated restore code */
+ jmpq *%r10
/* code below has been relocated to a safe page */
ENTRY(core_restore_code)
@@ -96,24 +94,17 @@ ENTRY(core_restore_code)
/* progress to the next pbe */
movq pbe_next(%rdx), %rdx
jmp .Lloop
+
.Ldone:
/* jump to the restore_registers address from the image header */
- jmpq *%rax
- /*
- * NOTE: This assumes that the boot kernel's text mapping covers the
- * image kernel's page containing restore_registers and the address of
- * this page is the same as in the image kernel's text mapping (it
- * should always be true, because the text mapping is linear, starting
- * from 0, and is supposed to cover the entire kernel text for every
- * kernel).
- *
- * code below belongs to the image kernel
- */
+ jmpq *%r8
+ /* code below belongs to the image kernel */
+ .align PAGE_SIZE
ENTRY(restore_registers)
FRAME_BEGIN
/* go back to the original page tables */
- movq %rbx, %cr3
+ movq %r9, %cr3
/* Flush TLB, including "global" things (vmalloc) */
movq mmu_cr4_features(%rip), %rax
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: ktime_get_ts64() splat during resume
2016-06-21 0:05 ` Rafael J. Wysocki
@ 2016-06-21 1:22 ` Rafael J. Wysocki
2016-06-21 4:35 ` Logan Gunthorpe
0 siblings, 1 reply; 16+ messages in thread
From: Rafael J. Wysocki @ 2016-06-21 1:22 UTC (permalink / raw)
To: Linus Torvalds, Logan Gunthorpe, Kees Cook
Cc: Rafael J. Wysocki, Borislav Petkov, Thomas Gleixner, Ingo Molnar,
Peter Zijlstra, lkml, John Stultz, Rafael J. Wysocki, Stable,
Andy Lutomirski, Brian Gerst, Denys Vlasenko, H. Peter Anvin,
Linux PM list, Stephen Smalley
On Tuesday, June 21, 2016 02:05:59 AM Rafael J. Wysocki wrote:
> On Monday, June 20, 2016 11:15:18 PM Rafael J. Wysocki wrote:
> > On Mon, Jun 20, 2016 at 8:29 PM, Linus Torvalds
> > <torvalds@linux-foundation.org> wrote:
> > > On Mon, Jun 20, 2016 at 7:38 AM, Rafael J. Wysocki <rjw@rjwysocki.net> wrote:
> > >>
> > >> Overall, we seem to be heading towards the "really weird" territory here.
> > >
> > > So the whole commit that Boris bisected down to is weird to me.
> > >
> > > Why isn't the temporary text mapping just set up unconditionally in
> > > the temp_level4_pgt?
> > >
> > > Why does it have that insane "let's leave the temp_level4_pgt alone
> > > until we actually switch to it, and save away restore_pgd_addr and the
> > > restore_pgd, to then be set up at restore time"?
> > >
> > > All the other temporary mappings are set up statically in the
> > > temp_level4_pgt, why not that one?
> >
> > The text mapping in temp_level4_pgt has to map the image kernel's
> > physical entry address to the same virtual address that the image
> > kernel had for it, because the image kernel will switch over to its
> > own page tables first and it will use its own kernel text mapping from
> > that point on. That may not be the same as the text mapping of the
> > (currently running) restore (or "boot") kernel.
> >
> > So if we set up the text mapping in temp_level4_pgt upfront, we
> > basically can't reference the original kernel text (or do any
> > addressing relative to it) any more after switching over to
> > temp_level4_pgt.
> >
> > For some reason I thought that was not doable, but now that I look at
> > the code it looks like it can be done. I'll try doing that.
I recalled what the problem was. :-)
In principle, the kernel text mapping in the image kernel may be different
from the kernel text mapping in the restore ("boot") kernel, but the patch
I posted a couple of hours ago actually assumed them to be the same, because
it switched to temp_level4_pgt before jumping to the relocated code.
To get rid of that implicit assumption it has to switch to temp_level4_pgt
from the relocated code itself, but for that to work, the page containing
the relocated code must be executable in the original page tables (it isn't
usually).
So updated patch is appended.
Again, it works for me, but I'm wondering about everybody else.
Thanks,
Rafael
---
From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Subject: [PATCH v2] x86/power/64: Fix kernel text mapping corruption during image restoration
Logan Gunthorpe reports that hibernation stopped working reliably for
him after commit ab76f7b4ab23 (x86/mm: Set NX on gap between __ex_table
and rodata).
That turns out to be a consequence of a long-standing issue with the
64-bit image restoration code on x86, which is that the temporary
page tables set up by it to avoid page tables corruption when the
last bits of the image kernel's memory contents are copied into
their original page frames re-use the boot kernel's text mapping,
but that mapping may very well get corrupted just like any other
part of the page tables. Of course, if that happens, the final
jump to the image kernel's entry point will go to nowhere.
The exact reason why commit ab76f7b4ab23 matters here is that it
sometimes causes a PMD of a large page to be split into PTEs
that are allocated dynamically and get corrupted during image
restoration as described above.
To fix that issue note that the code copying the last bits of the
image kernel's memory contents to the page frames occupied by them
previoulsy doesn't use the kernel text mapping, because it runs from
a special page covered by the identity mapping set up for that code
from scratch. Hence, the kernel text mapping is only needed before
that code starts to run and then it will only be used just for the
final jump to the image kernel's entry point.
Accordingly, the temporary page tables set up in swsusp_arch_resume()
on x86-64 need to contain the kernel text mapping too. That mapping
is only going to be used for the final jump to the image kernel, so
it only needs to cover the image kernel's entry point, because the
first thing the image kernel does after getting control back is to
switch over to its own original page tables. Moreover, the virtual
address of the image kernel's entry point in that mapping has to be
the same as the one mapped by the image kernel's page tables.
With that in mind, modify the x86-64's arch_hibernation_header_save()
and arch_hibernation_header_restore() routines to pass the physical
address of the image kernel's entry point (in addition to its virtual
address) to the boot kernel (a small piece of assembly code involved
in passing the entry point's virtual address to the image kernel is
not necessary any more after that, so drop it). Update RESTORE_MAGIC
too to reflect the image header format change.
Next, in set_up_temporary_mappings(), use the physical and virtual
addresses of the image kernel's entry point passed in the image
header to set up a minimum kernel text mapping (using memory pages
that won't be overwritten by the image kernel's memory contents) that
will map those addresses to each other as appropriate.
This makes the concern about the possible corruption of the original
boot kernel text mapping go away and if the the minimum kernel text
mapping used for the final jump marks the image kernel's entry point
memory as executable, the jump to it is guaraneed to succeed.
Fixes: ab76f7b4ab23 (x86/mm: Set NX on gap between __ex_table and rodata)
Link: http://marc.info/?l=linux-pm&m=146372852823760&w=2
Reported-by: Logan Gunthorpe <logang@deltatee.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
---
arch/x86/power/hibernate_64.c | 93 ++++++++++++++++++++++++++++++++------
arch/x86/power/hibernate_asm_64.S | 55 +++++++++-------------
2 files changed, 104 insertions(+), 44 deletions(-)
Index: linux-pm/arch/x86/power/hibernate_64.c
===================================================================
--- linux-pm.orig/arch/x86/power/hibernate_64.c
+++ linux-pm/arch/x86/power/hibernate_64.c
@@ -27,7 +27,8 @@ extern asmlinkage __visible int restore_
* Address to jump to in the last phase of restore in order to get to the image
* kernel's text (this value is passed in the image header).
*/
-unsigned long restore_jump_address __visible;
+void *restore_jump_address __visible;
+unsigned long jump_address_phys;
/*
* Value of the cr3 register from before the hibernation (this value is passed
@@ -39,6 +40,42 @@ pgd_t *temp_level4_pgt __visible;
void *relocated_restore_code __visible;
+static int set_up_temporary_text_mapping(void)
+{
+ unsigned long vaddr = (unsigned long)restore_jump_address;
+ unsigned long paddr = jump_address_phys & PMD_MASK;
+ pmd_t *pmd;
+ pud_t *pud;
+
+ /*
+ * The new mapping only has to cover the page containing the image
+ * kernel's entry point (jump_address_phys), because the switch over to
+ * it is carried out by relocated code running from a page allocated
+ * specifically for this purpose and covered by the identity mapping, so
+ * the temporary kernel text mapping is only needed for the final jump.
+ * Moreover, in that mapping the virtual address of the image kernel's
+ * entry point must be the same as its virtual address in the image
+ * kernel (restore_jump_address), so the image kernel's
+ * restore_registers() code doesn't find itself in a different area of
+ * the virtual address space after switching over to the original page
+ * tables used by the image kernel.
+ */
+ pud = (pud_t *)get_safe_page(GFP_ATOMIC);
+ if (!pud)
+ return -ENOMEM;
+
+ pmd = (pmd_t *)get_safe_page(GFP_ATOMIC);
+ if (!pmd)
+ return -ENOMEM;
+
+ set_pmd(pmd + pmd_index(vaddr), __pmd(paddr | __PAGE_KERNEL_LARGE_EXEC));
+ set_pud(pud + pud_index(vaddr), __pud(__pa(pmd) | _KERNPG_TABLE));
+ set_pgd(temp_level4_pgt + pgd_index(vaddr),
+ __pgd(__pa(pud) | _KERNPG_TABLE));
+
+ return 0;
+}
+
static void *alloc_pgt_page(void *context)
{
return (void *)get_safe_page(GFP_ATOMIC);
@@ -59,9 +96,10 @@ static int set_up_temporary_mappings(voi
if (!temp_level4_pgt)
return -ENOMEM;
- /* It is safe to reuse the original kernel mapping */
- set_pgd(temp_level4_pgt + pgd_index(__START_KERNEL_map),
- init_level4_pgt[pgd_index(__START_KERNEL_map)]);
+ /* Prepare a temporary mapping for the kernel text */
+ result = set_up_temporary_text_mapping();
+ if (result)
+ return result;
/* Set up the direct mapping from scratch */
for (i = 0; i < nr_pfn_mapped; i++) {
@@ -78,19 +116,45 @@ static int set_up_temporary_mappings(voi
return 0;
}
+static int relocate_restore_code(void)
+{
+ unsigned long addr;
+ pgd_t *pgd;
+ pmd_t *pmd;
+
+ relocated_restore_code = (void *)get_safe_page(GFP_ATOMIC);
+ if (!relocated_restore_code)
+ return -ENOMEM;
+
+ memcpy(relocated_restore_code, &core_restore_code, PAGE_SIZE);
+
+ /* Make the page containing the relocated code executable */
+ addr = (unsigned long)relocated_restore_code;
+ pgd = (pgd_t *)__va(read_cr3()) + pgd_index(addr);
+ pmd = pmd_offset(pud_offset(pgd, addr), addr);
+ if (pmd_large(*pmd)) {
+ set_pmd(pmd, __pmd(pmd_val(*pmd) & ~_PAGE_NX));
+ } else {
+ pte_t *pte = pte_offset_kernel(pmd, addr);
+
+ set_pte(pte, __pte(pte_val(*pte) & ~_PAGE_NX));
+ }
+
+ return 0;
+}
+
int swsusp_arch_resume(void)
{
int error;
/* We have got enough memory and from now on we cannot recover */
- if ((error = set_up_temporary_mappings()))
+ error = set_up_temporary_mappings();
+ if (error)
return error;
- relocated_restore_code = (void *)get_safe_page(GFP_ATOMIC);
- if (!relocated_restore_code)
- return -ENOMEM;
- memcpy(relocated_restore_code, &core_restore_code,
- &restore_registers - &core_restore_code);
+ error = relocate_restore_code();
+ if (error)
+ return error;
restore_image();
return 0;
@@ -108,12 +172,13 @@ int pfn_is_nosave(unsigned long pfn)
}
struct restore_data_record {
- unsigned long jump_address;
+ void *jump_address;
+ unsigned long jump_address_phys;
unsigned long cr3;
unsigned long magic;
};
-#define RESTORE_MAGIC 0x0123456789ABCDEFUL
+#define RESTORE_MAGIC 0x123456789ABCDEF0UL
/**
* arch_hibernation_header_save - populate the architecture specific part
@@ -126,7 +191,8 @@ int arch_hibernation_header_save(void *a
if (max_size < sizeof(struct restore_data_record))
return -EOVERFLOW;
- rdr->jump_address = restore_jump_address;
+ rdr->jump_address = &restore_registers;
+ rdr->jump_address_phys = __pa_symbol(&restore_registers);
rdr->cr3 = restore_cr3;
rdr->magic = RESTORE_MAGIC;
return 0;
@@ -142,6 +208,7 @@ int arch_hibernation_header_restore(void
struct restore_data_record *rdr = addr;
restore_jump_address = rdr->jump_address;
+ jump_address_phys = rdr->jump_address_phys;
restore_cr3 = rdr->cr3;
return (rdr->magic == RESTORE_MAGIC) ? 0 : -EINVAL;
}
Index: linux-pm/arch/x86/power/hibernate_asm_64.S
===================================================================
--- linux-pm.orig/arch/x86/power/hibernate_asm_64.S
+++ linux-pm/arch/x86/power/hibernate_asm_64.S
@@ -44,9 +44,6 @@ ENTRY(swsusp_arch_suspend)
pushfq
popq pt_regs_flags(%rax)
- /* save the address of restore_registers */
- movq $restore_registers, %rax
- movq %rax, restore_jump_address(%rip)
/* save cr3 */
movq %cr3, %rax
movq %rax, restore_cr3(%rip)
@@ -57,31 +54,34 @@ ENTRY(swsusp_arch_suspend)
ENDPROC(swsusp_arch_suspend)
ENTRY(restore_image)
- /* switch to temporary page tables */
- movq $__PAGE_OFFSET, %rdx
- movq temp_level4_pgt(%rip), %rax
- subq %rdx, %rax
- movq %rax, %cr3
- /* Flush TLB */
- movq mmu_cr4_features(%rip), %rax
- movq %rax, %rdx
- andq $~(X86_CR4_PGE), %rdx
- movq %rdx, %cr4; # turn off PGE
- movq %cr3, %rcx; # flush TLB
- movq %rcx, %cr3;
- movq %rax, %cr4; # turn PGE back on
-
/* prepare to jump to the image kernel */
- movq restore_jump_address(%rip), %rax
- movq restore_cr3(%rip), %rbx
+ movq restore_jump_address(%rip), %r8
+ movq restore_cr3(%rip), %r9
+
+ /* prepare to switch to temporary page tables */
+ movq temp_level4_pgt(%rip), %rax
+ movq mmu_cr4_features(%rip), %rbx
/* prepare to copy image data to their original locations */
movq restore_pblist(%rip), %rdx
+
+ /* jump to relocated restore code */
movq relocated_restore_code(%rip), %rcx
jmpq *%rcx
/* code below has been relocated to a safe page */
ENTRY(core_restore_code)
+ /* switch to temporary page tables */
+ movq $__PAGE_OFFSET, %rcx
+ subq %rcx, %rax
+ movq %rax, %cr3
+ /* flush TLB */
+ movq %rbx, %rcx
+ andq $~(X86_CR4_PGE), %rcx
+ movq %rcx, %cr4; # turn off PGE
+ movq %cr3, %rcx; # flush TLB
+ movq %rcx, %cr3;
+ movq %rbx, %cr4; # turn PGE back on
.Lloop:
testq %rdx, %rdx
jz .Ldone
@@ -96,24 +96,17 @@ ENTRY(core_restore_code)
/* progress to the next pbe */
movq pbe_next(%rdx), %rdx
jmp .Lloop
+
.Ldone:
/* jump to the restore_registers address from the image header */
- jmpq *%rax
- /*
- * NOTE: This assumes that the boot kernel's text mapping covers the
- * image kernel's page containing restore_registers and the address of
- * this page is the same as in the image kernel's text mapping (it
- * should always be true, because the text mapping is linear, starting
- * from 0, and is supposed to cover the entire kernel text for every
- * kernel).
- *
- * code below belongs to the image kernel
- */
+ jmpq *%r8
+ /* code below belongs to the image kernel */
+ .align PAGE_SIZE
ENTRY(restore_registers)
FRAME_BEGIN
/* go back to the original page tables */
- movq %rbx, %cr3
+ movq %r9, %cr3
/* Flush TLB, including "global" things (vmalloc) */
movq mmu_cr4_features(%rip), %rax
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: ktime_get_ts64() splat during resume
2016-06-21 1:22 ` Rafael J. Wysocki
@ 2016-06-21 4:35 ` Logan Gunthorpe
2016-06-21 11:36 ` Rafael J. Wysocki
2016-06-21 18:04 ` Kees Cook
0 siblings, 2 replies; 16+ messages in thread
From: Logan Gunthorpe @ 2016-06-21 4:35 UTC (permalink / raw)
To: Rafael J. Wysocki, Linus Torvalds, Kees Cook
Cc: Rafael J. Wysocki, Borislav Petkov, Thomas Gleixner, Ingo Molnar,
Peter Zijlstra, lkml, John Stultz, Rafael J. Wysocki, Stable,
Andy Lutomirski, Brian Gerst, Denys Vlasenko, H. Peter Anvin,
Linux PM list, Stephen Smalley
Hey Rafael,
This patch appears to be working on my laptop. Thanks.
Logan
On 20/06/16 07:22 PM, Rafael J. Wysocki wrote:
> On Tuesday, June 21, 2016 02:05:59 AM Rafael J. Wysocki wrote:
>> On Monday, June 20, 2016 11:15:18 PM Rafael J. Wysocki wrote:
>>> On Mon, Jun 20, 2016 at 8:29 PM, Linus Torvalds
>>> <torvalds@linux-foundation.org> wrote:
>>>> On Mon, Jun 20, 2016 at 7:38 AM, Rafael J. Wysocki <rjw@rjwysocki.net> wrote:
>>>>>
>>>>> Overall, we seem to be heading towards the "really weird" territory here.
>>>>
>>>> So the whole commit that Boris bisected down to is weird to me.
>>>>
>>>> Why isn't the temporary text mapping just set up unconditionally in
>>>> the temp_level4_pgt?
>>>>
>>>> Why does it have that insane "let's leave the temp_level4_pgt alone
>>>> until we actually switch to it, and save away restore_pgd_addr and the
>>>> restore_pgd, to then be set up at restore time"?
>>>>
>>>> All the other temporary mappings are set up statically in the
>>>> temp_level4_pgt, why not that one?
>>>
>>> The text mapping in temp_level4_pgt has to map the image kernel's
>>> physical entry address to the same virtual address that the image
>>> kernel had for it, because the image kernel will switch over to its
>>> own page tables first and it will use its own kernel text mapping from
>>> that point on. That may not be the same as the text mapping of the
>>> (currently running) restore (or "boot") kernel.
>>>
>>> So if we set up the text mapping in temp_level4_pgt upfront, we
>>> basically can't reference the original kernel text (or do any
>>> addressing relative to it) any more after switching over to
>>> temp_level4_pgt.
>>>
>>> For some reason I thought that was not doable, but now that I look at
>>> the code it looks like it can be done. I'll try doing that.
>
> I recalled what the problem was. :-)
>
> In principle, the kernel text mapping in the image kernel may be different
> from the kernel text mapping in the restore ("boot") kernel, but the patch
> I posted a couple of hours ago actually assumed them to be the same, because
> it switched to temp_level4_pgt before jumping to the relocated code.
>
> To get rid of that implicit assumption it has to switch to temp_level4_pgt
> from the relocated code itself, but for that to work, the page containing
> the relocated code must be executable in the original page tables (it isn't
> usually).
>
> So updated patch is appended.
>
> Again, it works for me, but I'm wondering about everybody else.
>
> Thanks,
> Rafael
>
> ---
> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> Subject: [PATCH v2] x86/power/64: Fix kernel text mapping corruption during image restoration
>
> Logan Gunthorpe reports that hibernation stopped working reliably for
> him after commit ab76f7b4ab23 (x86/mm: Set NX on gap between __ex_table
> and rodata).
>
> That turns out to be a consequence of a long-standing issue with the
> 64-bit image restoration code on x86, which is that the temporary
> page tables set up by it to avoid page tables corruption when the
> last bits of the image kernel's memory contents are copied into
> their original page frames re-use the boot kernel's text mapping,
> but that mapping may very well get corrupted just like any other
> part of the page tables. Of course, if that happens, the final
> jump to the image kernel's entry point will go to nowhere.
>
> The exact reason why commit ab76f7b4ab23 matters here is that it
> sometimes causes a PMD of a large page to be split into PTEs
> that are allocated dynamically and get corrupted during image
> restoration as described above.
>
> To fix that issue note that the code copying the last bits of the
> image kernel's memory contents to the page frames occupied by them
> previoulsy doesn't use the kernel text mapping, because it runs from
> a special page covered by the identity mapping set up for that code
> from scratch. Hence, the kernel text mapping is only needed before
> that code starts to run and then it will only be used just for the
> final jump to the image kernel's entry point.
>
> Accordingly, the temporary page tables set up in swsusp_arch_resume()
> on x86-64 need to contain the kernel text mapping too. That mapping
> is only going to be used for the final jump to the image kernel, so
> it only needs to cover the image kernel's entry point, because the
> first thing the image kernel does after getting control back is to
> switch over to its own original page tables. Moreover, the virtual
> address of the image kernel's entry point in that mapping has to be
> the same as the one mapped by the image kernel's page tables.
>
> With that in mind, modify the x86-64's arch_hibernation_header_save()
> and arch_hibernation_header_restore() routines to pass the physical
> address of the image kernel's entry point (in addition to its virtual
> address) to the boot kernel (a small piece of assembly code involved
> in passing the entry point's virtual address to the image kernel is
> not necessary any more after that, so drop it). Update RESTORE_MAGIC
> too to reflect the image header format change.
>
> Next, in set_up_temporary_mappings(), use the physical and virtual
> addresses of the image kernel's entry point passed in the image
> header to set up a minimum kernel text mapping (using memory pages
> that won't be overwritten by the image kernel's memory contents) that
> will map those addresses to each other as appropriate.
>
> This makes the concern about the possible corruption of the original
> boot kernel text mapping go away and if the the minimum kernel text
> mapping used for the final jump marks the image kernel's entry point
> memory as executable, the jump to it is guaraneed to succeed.
>
> Fixes: ab76f7b4ab23 (x86/mm: Set NX on gap between __ex_table and rodata)
> Link: http://marc.info/?l=linux-pm&m=146372852823760&w=2
> Reported-by: Logan Gunthorpe <logang@deltatee.com>
> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> ---
> arch/x86/power/hibernate_64.c | 93 ++++++++++++++++++++++++++++++++------
> arch/x86/power/hibernate_asm_64.S | 55 +++++++++-------------
> 2 files changed, 104 insertions(+), 44 deletions(-)
>
> Index: linux-pm/arch/x86/power/hibernate_64.c
> ===================================================================
> --- linux-pm.orig/arch/x86/power/hibernate_64.c
> +++ linux-pm/arch/x86/power/hibernate_64.c
> @@ -27,7 +27,8 @@ extern asmlinkage __visible int restore_
> * Address to jump to in the last phase of restore in order to get to the image
> * kernel's text (this value is passed in the image header).
> */
> -unsigned long restore_jump_address __visible;
> +void *restore_jump_address __visible;
> +unsigned long jump_address_phys;
>
> /*
> * Value of the cr3 register from before the hibernation (this value is passed
> @@ -39,6 +40,42 @@ pgd_t *temp_level4_pgt __visible;
>
> void *relocated_restore_code __visible;
>
> +static int set_up_temporary_text_mapping(void)
> +{
> + unsigned long vaddr = (unsigned long)restore_jump_address;
> + unsigned long paddr = jump_address_phys & PMD_MASK;
> + pmd_t *pmd;
> + pud_t *pud;
> +
> + /*
> + * The new mapping only has to cover the page containing the image
> + * kernel's entry point (jump_address_phys), because the switch over to
> + * it is carried out by relocated code running from a page allocated
> + * specifically for this purpose and covered by the identity mapping, so
> + * the temporary kernel text mapping is only needed for the final jump.
> + * Moreover, in that mapping the virtual address of the image kernel's
> + * entry point must be the same as its virtual address in the image
> + * kernel (restore_jump_address), so the image kernel's
> + * restore_registers() code doesn't find itself in a different area of
> + * the virtual address space after switching over to the original page
> + * tables used by the image kernel.
> + */
> + pud = (pud_t *)get_safe_page(GFP_ATOMIC);
> + if (!pud)
> + return -ENOMEM;
> +
> + pmd = (pmd_t *)get_safe_page(GFP_ATOMIC);
> + if (!pmd)
> + return -ENOMEM;
> +
> + set_pmd(pmd + pmd_index(vaddr), __pmd(paddr | __PAGE_KERNEL_LARGE_EXEC));
> + set_pud(pud + pud_index(vaddr), __pud(__pa(pmd) | _KERNPG_TABLE));
> + set_pgd(temp_level4_pgt + pgd_index(vaddr),
> + __pgd(__pa(pud) | _KERNPG_TABLE));
> +
> + return 0;
> +}
> +
> static void *alloc_pgt_page(void *context)
> {
> return (void *)get_safe_page(GFP_ATOMIC);
> @@ -59,9 +96,10 @@ static int set_up_temporary_mappings(voi
> if (!temp_level4_pgt)
> return -ENOMEM;
>
> - /* It is safe to reuse the original kernel mapping */
> - set_pgd(temp_level4_pgt + pgd_index(__START_KERNEL_map),
> - init_level4_pgt[pgd_index(__START_KERNEL_map)]);
> + /* Prepare a temporary mapping for the kernel text */
> + result = set_up_temporary_text_mapping();
> + if (result)
> + return result;
>
> /* Set up the direct mapping from scratch */
> for (i = 0; i < nr_pfn_mapped; i++) {
> @@ -78,19 +116,45 @@ static int set_up_temporary_mappings(voi
> return 0;
> }
>
> +static int relocate_restore_code(void)
> +{
> + unsigned long addr;
> + pgd_t *pgd;
> + pmd_t *pmd;
> +
> + relocated_restore_code = (void *)get_safe_page(GFP_ATOMIC);
> + if (!relocated_restore_code)
> + return -ENOMEM;
> +
> + memcpy(relocated_restore_code, &core_restore_code, PAGE_SIZE);
> +
> + /* Make the page containing the relocated code executable */
> + addr = (unsigned long)relocated_restore_code;
> + pgd = (pgd_t *)__va(read_cr3()) + pgd_index(addr);
> + pmd = pmd_offset(pud_offset(pgd, addr), addr);
> + if (pmd_large(*pmd)) {
> + set_pmd(pmd, __pmd(pmd_val(*pmd) & ~_PAGE_NX));
> + } else {
> + pte_t *pte = pte_offset_kernel(pmd, addr);
> +
> + set_pte(pte, __pte(pte_val(*pte) & ~_PAGE_NX));
> + }
> +
> + return 0;
> +}
> +
> int swsusp_arch_resume(void)
> {
> int error;
>
> /* We have got enough memory and from now on we cannot recover */
> - if ((error = set_up_temporary_mappings()))
> + error = set_up_temporary_mappings();
> + if (error)
> return error;
>
> - relocated_restore_code = (void *)get_safe_page(GFP_ATOMIC);
> - if (!relocated_restore_code)
> - return -ENOMEM;
> - memcpy(relocated_restore_code, &core_restore_code,
> - &restore_registers - &core_restore_code);
> + error = relocate_restore_code();
> + if (error)
> + return error;
>
> restore_image();
> return 0;
> @@ -108,12 +172,13 @@ int pfn_is_nosave(unsigned long pfn)
> }
>
> struct restore_data_record {
> - unsigned long jump_address;
> + void *jump_address;
> + unsigned long jump_address_phys;
> unsigned long cr3;
> unsigned long magic;
> };
>
> -#define RESTORE_MAGIC 0x0123456789ABCDEFUL
> +#define RESTORE_MAGIC 0x123456789ABCDEF0UL
>
> /**
> * arch_hibernation_header_save - populate the architecture specific part
> @@ -126,7 +191,8 @@ int arch_hibernation_header_save(void *a
>
> if (max_size < sizeof(struct restore_data_record))
> return -EOVERFLOW;
> - rdr->jump_address = restore_jump_address;
> + rdr->jump_address = &restore_registers;
> + rdr->jump_address_phys = __pa_symbol(&restore_registers);
> rdr->cr3 = restore_cr3;
> rdr->magic = RESTORE_MAGIC;
> return 0;
> @@ -142,6 +208,7 @@ int arch_hibernation_header_restore(void
> struct restore_data_record *rdr = addr;
>
> restore_jump_address = rdr->jump_address;
> + jump_address_phys = rdr->jump_address_phys;
> restore_cr3 = rdr->cr3;
> return (rdr->magic == RESTORE_MAGIC) ? 0 : -EINVAL;
> }
> Index: linux-pm/arch/x86/power/hibernate_asm_64.S
> ===================================================================
> --- linux-pm.orig/arch/x86/power/hibernate_asm_64.S
> +++ linux-pm/arch/x86/power/hibernate_asm_64.S
> @@ -44,9 +44,6 @@ ENTRY(swsusp_arch_suspend)
> pushfq
> popq pt_regs_flags(%rax)
>
> - /* save the address of restore_registers */
> - movq $restore_registers, %rax
> - movq %rax, restore_jump_address(%rip)
> /* save cr3 */
> movq %cr3, %rax
> movq %rax, restore_cr3(%rip)
> @@ -57,31 +54,34 @@ ENTRY(swsusp_arch_suspend)
> ENDPROC(swsusp_arch_suspend)
>
> ENTRY(restore_image)
> - /* switch to temporary page tables */
> - movq $__PAGE_OFFSET, %rdx
> - movq temp_level4_pgt(%rip), %rax
> - subq %rdx, %rax
> - movq %rax, %cr3
> - /* Flush TLB */
> - movq mmu_cr4_features(%rip), %rax
> - movq %rax, %rdx
> - andq $~(X86_CR4_PGE), %rdx
> - movq %rdx, %cr4; # turn off PGE
> - movq %cr3, %rcx; # flush TLB
> - movq %rcx, %cr3;
> - movq %rax, %cr4; # turn PGE back on
> -
> /* prepare to jump to the image kernel */
> - movq restore_jump_address(%rip), %rax
> - movq restore_cr3(%rip), %rbx
> + movq restore_jump_address(%rip), %r8
> + movq restore_cr3(%rip), %r9
> +
> + /* prepare to switch to temporary page tables */
> + movq temp_level4_pgt(%rip), %rax
> + movq mmu_cr4_features(%rip), %rbx
>
> /* prepare to copy image data to their original locations */
> movq restore_pblist(%rip), %rdx
> +
> + /* jump to relocated restore code */
> movq relocated_restore_code(%rip), %rcx
> jmpq *%rcx
>
> /* code below has been relocated to a safe page */
> ENTRY(core_restore_code)
> + /* switch to temporary page tables */
> + movq $__PAGE_OFFSET, %rcx
> + subq %rcx, %rax
> + movq %rax, %cr3
> + /* flush TLB */
> + movq %rbx, %rcx
> + andq $~(X86_CR4_PGE), %rcx
> + movq %rcx, %cr4; # turn off PGE
> + movq %cr3, %rcx; # flush TLB
> + movq %rcx, %cr3;
> + movq %rbx, %cr4; # turn PGE back on
> .Lloop:
> testq %rdx, %rdx
> jz .Ldone
> @@ -96,24 +96,17 @@ ENTRY(core_restore_code)
> /* progress to the next pbe */
> movq pbe_next(%rdx), %rdx
> jmp .Lloop
> +
> .Ldone:
> /* jump to the restore_registers address from the image header */
> - jmpq *%rax
> - /*
> - * NOTE: This assumes that the boot kernel's text mapping covers the
> - * image kernel's page containing restore_registers and the address of
> - * this page is the same as in the image kernel's text mapping (it
> - * should always be true, because the text mapping is linear, starting
> - * from 0, and is supposed to cover the entire kernel text for every
> - * kernel).
> - *
> - * code below belongs to the image kernel
> - */
> + jmpq *%r8
>
> + /* code below belongs to the image kernel */
> + .align PAGE_SIZE
> ENTRY(restore_registers)
> FRAME_BEGIN
> /* go back to the original page tables */
> - movq %rbx, %cr3
> + movq %r9, %cr3
>
> /* Flush TLB, including "global" things (vmalloc) */
> movq mmu_cr4_features(%rip), %rax
>
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: ktime_get_ts64() splat during resume
2016-06-21 4:35 ` Logan Gunthorpe
@ 2016-06-21 11:36 ` Rafael J. Wysocki
2016-06-21 18:04 ` Kees Cook
1 sibling, 0 replies; 16+ messages in thread
From: Rafael J. Wysocki @ 2016-06-21 11:36 UTC (permalink / raw)
To: Logan Gunthorpe
Cc: Linus Torvalds, Kees Cook, Rafael J. Wysocki, Borislav Petkov,
Thomas Gleixner, Ingo Molnar, Peter Zijlstra, lkml, John Stultz,
Rafael J. Wysocki, Stable, Andy Lutomirski, Brian Gerst,
Denys Vlasenko, H. Peter Anvin, Linux PM list, Stephen Smalley
On Monday, June 20, 2016 10:35:16 PM Logan Gunthorpe wrote:
> Hey Rafael,
Hi,
> This patch appears to be working on my laptop. Thanks.
Thanks for the confirmation!
Rafael
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: ktime_get_ts64() splat during resume
2016-06-21 4:35 ` Logan Gunthorpe
2016-06-21 11:36 ` Rafael J. Wysocki
@ 2016-06-21 18:04 ` Kees Cook
2016-06-21 23:29 ` Rafael J. Wysocki
1 sibling, 1 reply; 16+ messages in thread
From: Kees Cook @ 2016-06-21 18:04 UTC (permalink / raw)
To: Logan Gunthorpe
Cc: Rafael J. Wysocki, Linus Torvalds, Rafael J. Wysocki,
Borislav Petkov, Thomas Gleixner, Ingo Molnar, Peter Zijlstra,
lkml, John Stultz, Rafael J. Wysocki, Stable, Andy Lutomirski,
Brian Gerst, Denys Vlasenko, H. Peter Anvin, Linux PM list,
Stephen Smalley
On Mon, Jun 20, 2016 at 9:35 PM, Logan Gunthorpe <logang@deltatee.com> wrote:
> Hey Rafael,
>
> This patch appears to be working on my laptop. Thanks.
Same for me: resume still works with KASLR in my tests too.
-Kees
--
Kees Cook
Chrome OS & Brillo Security
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: ktime_get_ts64() splat during resume
2016-06-21 18:04 ` Kees Cook
@ 2016-06-21 23:29 ` Rafael J. Wysocki
0 siblings, 0 replies; 16+ messages in thread
From: Rafael J. Wysocki @ 2016-06-21 23:29 UTC (permalink / raw)
To: Kees Cook
Cc: Logan Gunthorpe, Linus Torvalds, Rafael J. Wysocki,
Borislav Petkov, Thomas Gleixner, Ingo Molnar, Peter Zijlstra,
lkml, John Stultz, Rafael J. Wysocki, Stable, Andy Lutomirski,
Brian Gerst, Denys Vlasenko, H. Peter Anvin, Linux PM list,
Stephen Smalley
On Tuesday, June 21, 2016 11:04:41 AM Kees Cook wrote:
> On Mon, Jun 20, 2016 at 9:35 PM, Logan Gunthorpe <logang@deltatee.com> wrote:
> > Hey Rafael,
> >
> > This patch appears to be working on my laptop. Thanks.
>
> Same for me: resume still works with KASLR in my tests too.
Thanks for the confirmation!
Rafael
^ permalink raw reply [flat|nested] 16+ messages in thread
end of thread, other threads:[~2016-06-21 23:25 UTC | newest]
Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <20160617105435.GB15997@pd.tnic>
[not found] ` <alpine.DEB.2.11.1606171351390.5839@nanos>
[not found] ` <20160617132943.GA3336@nazgul.tnic>
2016-06-17 14:33 ` ktime_get_ts64() splat during resume Borislav Petkov
2016-06-17 15:28 ` Rafael J. Wysocki
2016-06-17 16:12 ` Borislav Petkov
2016-06-17 21:03 ` Rafael J. Wysocki
2016-06-18 1:11 ` Rafael J. Wysocki
2016-06-20 14:38 ` Rafael J. Wysocki
2016-06-20 18:29 ` Linus Torvalds
2016-06-20 21:15 ` Rafael J. Wysocki
2016-06-21 0:05 ` Rafael J. Wysocki
2016-06-21 1:22 ` Rafael J. Wysocki
2016-06-21 4:35 ` Logan Gunthorpe
2016-06-21 11:36 ` Rafael J. Wysocki
2016-06-21 18:04 ` Kees Cook
2016-06-21 23:29 ` Rafael J. Wysocki
2016-06-20 8:17 ` chenyu
2016-06-20 12:21 ` Rafael J. Wysocki
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).