* Question regardin intel64 arch and page table setup
@ 2010-08-11 19:47 Neil Horman
2010-08-11 20:02 ` H. Peter Anvin
0 siblings, 1 reply; 10+ messages in thread
From: Neil Horman @ 2010-08-11 19:47 UTC (permalink / raw)
To: kexec; +Cc: ebiederm, hpa
Hey all-
I've got a question regarding x86_64 and how linux uses the paging
hardware. I'm tinkering with ways to get kexec to boot a new kernel on panic
without leaving long mode. The idea being that if we can do that, then we don't
need to store the new kdump kernel below the 4G physical limit for 32 bit
systems. In doing this though, I figured I would have to re-initalize the page
table with an identity mapped set of page tables to cover all of ram and load
that into cr3. My question is, is it safe to do so while paging is enabled.
The docs I've read are unclear on that and if I have to disable paging that
automatically drops me out of long mode, which is bad. I would think its safe
to do, since I imagined we had to do on context switches in the scheduler, but
the __switch_to implementation for x86_64 sems to do nothing but update the task
register. Intel vol 3a says we need to update cr3, but I don't see where that
happens, so I'm not sure if theres some automated bit that does a cr3 update
safely when we write tr.
Anywho, any guidance, clarification would be appreciated. Thanks!
Neil
_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Question regardin intel64 arch and page table setup
2010-08-11 19:47 Question regardin intel64 arch and page table setup Neil Horman
@ 2010-08-11 20:02 ` H. Peter Anvin
2010-08-11 21:51 ` Eric W. Biederman
2010-08-12 1:05 ` Neil Horman
0 siblings, 2 replies; 10+ messages in thread
From: H. Peter Anvin @ 2010-08-11 20:02 UTC (permalink / raw)
To: Neil Horman; +Cc: kexec, ebiederm
On 08/11/2010 12:47 PM, Neil Horman wrote:
> Hey all-
> I've got a question regarding x86_64 and how linux uses the paging
> hardware. I'm tinkering with ways to get kexec to boot a new kernel on panic
> without leaving long mode. The idea being that if we can do that, then we don't
> need to store the new kdump kernel below the 4G physical limit for 32 bit
> systems. In doing this though, I figured I would have to re-initalize the page
> table with an identity mapped set of page tables to cover all of ram and load
> that into cr3. My question is, is it safe to do so while paging is enabled.
> The docs I've read are unclear on that and if I have to disable paging that
> automatically drops me out of long mode, which is bad. I would think its safe
> to do, since I imagined we had to do on context switches in the scheduler, but
> the __switch_to implementation for x86_64 sems to do nothing but update the task
> register. Intel vol 3a says we need to update cr3, but I don't see where that
> happens, so I'm not sure if theres some automated bit that does a cr3 update
> safely when we write tr.
>
> Anywho, any guidance, clarification would be appreciated. Thanks!
> Neil
>
It is definitely safe to load a new CR3 while paging is done; it is done
all the time. The currently executing page needs to be mapped to the
same physical and virtual address in most kernels.
However, there are a *LOT* of issues with having a kernel that is
completely above 4 GiB. For one thing, a lot of device drivers simply
will not work if there is no memory below 4 GiB awavilable to the
kernel. As such, I don't think you will be successful in this project.
-hpa
_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Question regardin intel64 arch and page table setup
2010-08-11 20:02 ` H. Peter Anvin
@ 2010-08-11 21:51 ` Eric W. Biederman
2010-08-11 21:54 ` Eric W. Biederman
2010-08-11 22:02 ` H. Peter Anvin
2010-08-12 1:05 ` Neil Horman
1 sibling, 2 replies; 10+ messages in thread
From: Eric W. Biederman @ 2010-08-11 21:51 UTC (permalink / raw)
To: H. Peter Anvin; +Cc: kexec, Neil Horman
"H. Peter Anvin" <hpa@zytor.com> writes:
> On 08/11/2010 12:47 PM, Neil Horman wrote:
>> Hey all-
>> I've got a question regarding x86_64 and how linux uses the paging
>> hardware. I'm tinkering with ways to get kexec to boot a new kernel on panic
>> without leaving long mode. The idea being that if we can do that, then we don't
>> need to store the new kdump kernel below the 4G physical limit for 32 bit
>> systems. In doing this though, I figured I would have to re-initalize the page
>> table with an identity mapped set of page tables to cover all of ram and load
>> that into cr3. My question is, is it safe to do so while paging is enabled.
>> The docs I've read are unclear on that and if I have to disable paging that
>> automatically drops me out of long mode, which is bad. I would think its safe
>> to do, since I imagined we had to do on context switches in the scheduler, but
>> the __switch_to implementation for x86_64 sems to do nothing but update the task
>> register. Intel vol 3a says we need to update cr3, but I don't see where that
>> happens, so I'm not sure if theres some automated bit that does a cr3 update
>> safely when we write tr.
>>
>> Anywho, any guidance, clarification would be appreciated. Thanks!
>> Neil
>>
>
> It is definitely safe to load a new CR3 while paging is done; it is done
> all the time. The currently executing page needs to be mapped to the
> same physical and virtual address in most kernels.
>
> However, there are a *LOT* of issues with having a kernel that is
> completely above 4 GiB. For one thing, a lot of device drivers simply
> will not work if there is no memory below 4 GiB awavilable to the
> kernel. As such, I don't think you will be successful in this
> project.
A couple of pieces.
1) The kernel side of kexec and kexec on panic does not leave long mode.
Long mode is left by the glue code in /sbin/kexec.
2) I agree about the DMA limitation however there are enough systems
with iommu's these days you may be able to get it to work.
3) I would start just getting the normal kexec case to work.
The 64bit kernel does support starting at the 64bit entry point,
but I don't think it has been tested if loaded above 4G.
It certainly should work and as time goes by I expect running
a kernel above 4G to become an increasingly interesting use case.
So it is certainly worth play with.
But as Peter says having a kernel completely above 4GiB has is likely
to uncover a lot of baked in assumptions so we real problems might
result.
Hmm. On the normal kexec side you don't loose the low 4GiB so that
case should be a lot easier to bootstrap with. Once it works with
the low 4GiB you can add a mem= or whatever to disable using the low
4GiB and see what happens.
Have fun.
Eric
_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Question regardin intel64 arch and page table setup
2010-08-11 21:51 ` Eric W. Biederman
@ 2010-08-11 21:54 ` Eric W. Biederman
2010-08-11 22:02 ` H. Peter Anvin
1 sibling, 0 replies; 10+ messages in thread
From: Eric W. Biederman @ 2010-08-11 21:54 UTC (permalink / raw)
To: H. Peter Anvin; +Cc: kexec, Neil Horman
ebiederm@xmission.com (Eric W. Biederman) writes:
> "H. Peter Anvin" <hpa@zytor.com> writes:
>
>> On 08/11/2010 12:47 PM, Neil Horman wrote:
>>> Hey all-
>>> I've got a question regarding x86_64 and how linux uses the paging
>>> hardware. I'm tinkering with ways to get kexec to boot a new kernel on panic
>>> without leaving long mode. The idea being that if we can do that, then we don't
>>> need to store the new kdump kernel below the 4G physical limit for 32 bit
>>> systems. In doing this though, I figured I would have to re-initalize the page
>>> table with an identity mapped set of page tables to cover all of ram and load
>>> that into cr3. My question is, is it safe to do so while paging is enabled.
>>> The docs I've read are unclear on that and if I have to disable paging that
>>> automatically drops me out of long mode, which is bad. I would think its safe
>>> to do, since I imagined we had to do on context switches in the scheduler, but
>>> the __switch_to implementation for x86_64 sems to do nothing but update the task
>>> register. Intel vol 3a says we need to update cr3, but I don't see where that
>>> happens, so I'm not sure if theres some automated bit that does a cr3 update
>>> safely when we write tr.
>>>
>>> Anywho, any guidance, clarification would be appreciated. Thanks!
>>> Neil
>>>
>>
>> It is definitely safe to load a new CR3 while paging is done; it is done
>> all the time. The currently executing page needs to be mapped to the
>> same physical and virtual address in most kernels.
>>
>> However, there are a *LOT* of issues with having a kernel that is
>> completely above 4 GiB. For one thing, a lot of device drivers simply
>> will not work if there is no memory below 4 GiB awavilable to the
>> kernel. As such, I don't think you will be successful in this
>> project.
>
> A couple of pieces.
> 1) The kernel side of kexec and kexec on panic does not leave long mode.
> Long mode is left by the glue code in /sbin/kexec.
>
> 2) I agree about the DMA limitation however there are enough systems
> with iommu's these days you may be able to get it to work.
>
> 3) I would start just getting the normal kexec case to work.
> The 64bit kernel does support starting at the 64bit entry point,
> but I don't think it has been tested if loaded above 4G.
>
> It certainly should work and as time goes by I expect running
> a kernel above 4G to become an increasingly interesting use case.
> So it is certainly worth play with.
>
> But as Peter says having a kernel completely above 4GiB has is likely
> to uncover a lot of baked in assumptions so we real problems might
> result.
>
> Hmm. On the normal kexec side you don't loose the low 4GiB so that
> case should be a lot easier to bootstrap with. Once it works with
> the low 4GiB you can add a mem= or whatever to disable using the low
> 4GiB and see what happens.
>
> Have fun.
I guess the one place where we have a bottleneck with loading above 4GiB
today is that we don't export the kernels 4GiB entry point in bzImage
(although it is at a stable offset from the 32bit one), and we can't
make up the kernel parameters from scratch because there are variables
in there with non-zero changing values that the kernel expects to have
initialized.
But hacking around that for testing should not be hard.
Eric
_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Question regardin intel64 arch and page table setup
2010-08-11 21:51 ` Eric W. Biederman
2010-08-11 21:54 ` Eric W. Biederman
@ 2010-08-11 22:02 ` H. Peter Anvin
2010-08-12 0:22 ` Eric W. Biederman
1 sibling, 1 reply; 10+ messages in thread
From: H. Peter Anvin @ 2010-08-11 22:02 UTC (permalink / raw)
To: Eric W. Biederman; +Cc: kexec, Neil Horman
On 08/11/2010 02:51 PM, Eric W. Biederman wrote:
>
> 3) I would start just getting the normal kexec case to work.
> The 64bit kernel does support starting at the 64bit entry point,
> but I don't think it has been tested if loaded above 4G.
>
I can guarantee that it hasn't; I looked at that code not all that long
ago and it's shock-full of 32- and 39-bit assumptions.
-hpa
_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Question regardin intel64 arch and page table setup
2010-08-11 22:02 ` H. Peter Anvin
@ 2010-08-12 0:22 ` Eric W. Biederman
0 siblings, 0 replies; 10+ messages in thread
From: Eric W. Biederman @ 2010-08-12 0:22 UTC (permalink / raw)
To: H. Peter Anvin; +Cc: kexec, Neil Horman
"H. Peter Anvin" <hpa@zytor.com> writes:
> On 08/11/2010 02:51 PM, Eric W. Biederman wrote:
>>
>> 3) I would start just getting the normal kexec case to work.
>> The 64bit kernel does support starting at the 64bit entry point,
>> but I don't think it has been tested if loaded above 4G.
>>
>
> I can guarantee that it hasn't; I looked at that code not all that long
> ago and it's shock-full of 32- and 39-bit assumptions.
Ugh. I thought I had purged the 32bit assumptions. I guess it has been
a while. 39-bit assumptions are forgivable, the architecture didn't
support more than 40bit physical addresses when it was written.
I wonder if this is a problem for SGI. I remember on the ia64 NUMA
machines only node 0 had memory below 4GiB and so if you booted without
node 0 you had no memory below 4GiB. I wonder if this restriction has
carried over to the x86_64 descendants of the Altix.
What is definitely true (unless someone has added an extension since last
I looked ) is that on a normal x86 smp architecture you can't start
additional processors without memory in the low 1MiB because that is all
you can specify in the startup ipi.
Eric
_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Question regardin intel64 arch and page table setup
2010-08-11 20:02 ` H. Peter Anvin
2010-08-11 21:51 ` Eric W. Biederman
@ 2010-08-12 1:05 ` Neil Horman
2010-08-12 1:46 ` H. Peter Anvin
` (2 more replies)
1 sibling, 3 replies; 10+ messages in thread
From: Neil Horman @ 2010-08-12 1:05 UTC (permalink / raw)
To: H. Peter Anvin; +Cc: kexec, ebiederm, Neil Horman
On Wed, Aug 11, 2010 at 01:02:10PM -0700, H. Peter Anvin wrote:
> On 08/11/2010 12:47 PM, Neil Horman wrote:
> > Hey all-
> > I've got a question regarding x86_64 and how linux uses the paging
> > hardware. I'm tinkering with ways to get kexec to boot a new kernel on panic
> > without leaving long mode. The idea being that if we can do that, then we don't
> > need to store the new kdump kernel below the 4G physical limit for 32 bit
> > systems. In doing this though, I figured I would have to re-initalize the page
> > table with an identity mapped set of page tables to cover all of ram and load
> > that into cr3. My question is, is it safe to do so while paging is enabled.
> > The docs I've read are unclear on that and if I have to disable paging that
> > automatically drops me out of long mode, which is bad. I would think its safe
> > to do, since I imagined we had to do on context switches in the scheduler, but
> > the __switch_to implementation for x86_64 sems to do nothing but update the task
> > register. Intel vol 3a says we need to update cr3, but I don't see where that
> > happens, so I'm not sure if theres some automated bit that does a cr3 update
> > safely when we write tr.
> >
> > Anywho, any guidance, clarification would be appreciated. Thanks!
> > Neil
> >
>
> It is definitely safe to load a new CR3 while paging is done; it is done
> all the time. The currently executing page needs to be mapped to the
> same physical and virtual address in most kernels.
>
> However, there are a *LOT* of issues with having a kernel that is
> completely above 4 GiB. For one thing, a lot of device drivers simply
> will not work if there is no memory below 4 GiB awavilable to the
> kernel. As such, I don't think you will be successful in this project.
>
> -hpa
>
>
Thanks for all the info, guys. I hadn't considered that we couldn't access the
64 bit startup point for the bzImage. I just figured we could jump to
startup_32 + 0x200 in the bzImage header once I had the page table bit set up
properly.
I hadn't considered the problems we might encounter with driver issues loading
above 4gb and what have you, nor the starting of AP's.
Regardless, I'll keep tinkering. One more question. When setting up the page
table in the panic boot case, is it sufficient to setup an identity map for the
pages in the reserved crashkernel range, or do we need to identity map the
entire range of ram?
Best
Neil
> _______________________________________________
> kexec mailing list
> kexec@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/kexec
_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Question regardin intel64 arch and page table setup
2010-08-12 1:05 ` Neil Horman
@ 2010-08-12 1:46 ` H. Peter Anvin
2010-08-12 1:53 ` H. Peter Anvin
2010-08-12 3:21 ` Eric W. Biederman
2 siblings, 0 replies; 10+ messages in thread
From: H. Peter Anvin @ 2010-08-12 1:46 UTC (permalink / raw)
To: Neil Horman; +Cc: kexec, ebiederm, Neil Horman
On 08/11/2010 06:05 PM, Neil Horman wrote:
>
> Thanks for all the info, guys. I hadn't considered that we couldn't access the
> 64 bit startup point for the bzImage. I just figured we could jump to
> startup_32 + 0x200 in the bzImage header once I had the page table bit set up
> properly.
>
> I hadn't considered the problems we might encounter with driver issues loading
> above 4gb and what have you, nor the starting of AP's.
>
> Regardless, I'll keep tinkering. One more question. When setting up the page
> table in the panic boot case, is it sufficient to setup an identity map for the
> pages in the reserved crashkernel range, or do we need to identity map the
> entire range of ram?
>
Not clear to me. Probably the former, but you might find problems.
Another issue: if you are planning to run SMP, the AP trampoline needs
low memory, too.
-hpa
--
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel. I don't speak on their behalf.
_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Question regardin intel64 arch and page table setup
2010-08-12 1:05 ` Neil Horman
2010-08-12 1:46 ` H. Peter Anvin
@ 2010-08-12 1:53 ` H. Peter Anvin
2010-08-12 3:21 ` Eric W. Biederman
2 siblings, 0 replies; 10+ messages in thread
From: H. Peter Anvin @ 2010-08-12 1:53 UTC (permalink / raw)
To: Neil Horman; +Cc: kexec, ebiederm, Neil Horman
On 08/11/2010 06:05 PM, Neil Horman wrote:
> Thanks for all the info, guys. I hadn't considered that we couldn't access the
> 64 bit startup point for the bzImage. I just figured we could jump to
> startup_32 + 0x200 in the bzImage header once I had the page table bit set up
> properly.
>
> I hadn't considered the problems we might encounter with driver issues loading
> above 4gb and what have you, nor the starting of AP's.
>
> Regardless, I'll keep tinkering. One more question. When setting up the page
> table in the panic boot case, is it sufficient to setup an identity map for the
> pages in the reserved crashkernel range, or do we need to identity map the
> entire range of ram?
>
> Best
> Neil
One thing... if the crashkernel reservation can be made non-contiguous,
it probably doesn't require a whole lot of memory below 1 MiB (a handful
of pages) and below 4 GiB (probably less than a megabyte) to achieve
full functionality. It just means that the crashkernel has to operate
with a true memory map for its reservations.
-hpa
--
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel. I don't speak on their behalf.
_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Question regardin intel64 arch and page table setup
2010-08-12 1:05 ` Neil Horman
2010-08-12 1:46 ` H. Peter Anvin
2010-08-12 1:53 ` H. Peter Anvin
@ 2010-08-12 3:21 ` Eric W. Biederman
2 siblings, 0 replies; 10+ messages in thread
From: Eric W. Biederman @ 2010-08-12 3:21 UTC (permalink / raw)
To: Neil Horman; +Cc: kexec, Neil Horman, H. Peter Anvin
Neil Horman <nhorman@redhat.com> writes:
> On Wed, Aug 11, 2010 at 01:02:10PM -0700, H. Peter Anvin wrote:
>> On 08/11/2010 12:47 PM, Neil Horman wrote:
>> > Hey all-
>> > I've got a question regarding x86_64 and how linux uses the paging
>> > hardware. I'm tinkering with ways to get kexec to boot a new kernel on panic
>> > without leaving long mode. The idea being that if we can do that, then we don't
>> > need to store the new kdump kernel below the 4G physical limit for 32 bit
>> > systems. In doing this though, I figured I would have to re-initalize the page
>> > table with an identity mapped set of page tables to cover all of ram and load
>> > that into cr3. My question is, is it safe to do so while paging is enabled.
>> > The docs I've read are unclear on that and if I have to disable paging that
>> > automatically drops me out of long mode, which is bad. I would think its safe
>> > to do, since I imagined we had to do on context switches in the scheduler, but
>> > the __switch_to implementation for x86_64 sems to do nothing but update the task
>> > register. Intel vol 3a says we need to update cr3, but I don't see where that
>> > happens, so I'm not sure if theres some automated bit that does a cr3 update
>> > safely when we write tr.
>> >
>> > Anywho, any guidance, clarification would be appreciated. Thanks!
>> > Neil
>> >
>>
>> It is definitely safe to load a new CR3 while paging is done; it is done
>> all the time. The currently executing page needs to be mapped to the
>> same physical and virtual address in most kernels.
>>
>> However, there are a *LOT* of issues with having a kernel that is
>> completely above 4 GiB. For one thing, a lot of device drivers simply
>> will not work if there is no memory below 4 GiB awavilable to the
>> kernel. As such, I don't think you will be successful in this project.
>>
>> -hpa
>>
>>
> Thanks for all the info, guys. I hadn't considered that we couldn't access the
> 64 bit startup point for the bzImage. I just figured we could jump to
> startup_32 + 0x200 in the bzImage header once I had the page table bit set up
> properly.
>
> I hadn't considered the problems we might encounter with driver issues loading
> above 4gb and what have you, nor the starting of AP's.
>
> Regardless, I'll keep tinkering. One more question. When setting up the page
> table in the panic boot case, is it sufficient to setup an identity map for the
> pages in the reserved crashkernel range, or do we need to identity map the
> entire range of ram?
You should be able to get away with simply using the page tables the
crashing/initial kernel sets up, as those should map all of memory,
and definitely all of memory the kernel you will be booting needs to
run in (the memory areas we tell it we are using).
I didn't do it much but I did test the 64bit kernel entry point ages ago
when I did the first round of implementing everything.
Eric
_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec
^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2010-08-12 3:21 UTC | newest]
Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-08-11 19:47 Question regardin intel64 arch and page table setup Neil Horman
2010-08-11 20:02 ` H. Peter Anvin
2010-08-11 21:51 ` Eric W. Biederman
2010-08-11 21:54 ` Eric W. Biederman
2010-08-11 22:02 ` H. Peter Anvin
2010-08-12 0:22 ` Eric W. Biederman
2010-08-12 1:05 ` Neil Horman
2010-08-12 1:46 ` H. Peter Anvin
2010-08-12 1:53 ` H. Peter Anvin
2010-08-12 3:21 ` Eric W. Biederman
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox