virtualization.lists.linux-foundation.org archive mirror
 help / color / mirror / Atom feed
* [RFC/PATCH LGUEST X86_64 00/13] Lguest for the x86_64
@ 2007-03-08 17:38 Steven Rostedt
  2007-03-09  0:18 ` Rusty Russell
  0 siblings, 1 reply; 3+ messages in thread
From: Steven Rostedt @ 2007-03-08 17:38 UTC (permalink / raw)
  To: virtualization; +Cc: Chris Wright, Ingo Molnar

Hi all!

Lately, Glauber and I have been working on getting both paravirt_ops
and lguest running on the x86_64.

I already pushed the x86_64 patches and as promised, I'm pushing now the
lguest64 patches.

This patches are greatly influenced by Rusty Russell, and we tried to stay
somewhat consistent with his work on the i386.  But there are some major
differences that we had to over come. Here's some of the thought we put
into this.

Major factors:

x86_64 has a much larger virtual space
x86_64 has 4 levels of page tables!!!!


Because of the large virtual address space that the x86_64 gives
us, we were originally going to map both the guest and the host
in the same address space. This would be great, and we thought
we could do this. One major requirement we had was to have one
kernel for both the host and a guest. But we thought with the
relocatable kernel going upstream, we could use that and have a
single kernel mapped in two locations.

The problem we found with the relocatable kernel, is that it seemed
to be focused on being located in two different locations of physical
memory and not virtual!  So it would remap itself to the virtual address
and then look at the physical.  This means that it wasn't an option
to do it this way.

So back to the drawing board!

What we came up instead, was to be a little like i386 lguest and 
have the hypervisor mapped in only.  So how to do this and not
cause too much change in the kernel?

Well, it would be nice to have the hypervisor text mapped in at
the same virtual address for both the host and the guest. But how
to do this easily?

Well, the solution that we came up with, was to use a FIXMAP area.

Why?

Well this way, since we plan on using the same kernel for both the
guest and the host, this will guarantee a location in the guest 
virtual address space that the host can use, and the guest will not.

Since it is virtual, we can make it as big as we need.

So we map the hypervisor text into this area for both the host
and the guest. The guest permissions for this area will obviously
be restricted to DPL 0 only (guest runs in PL 3).

Now what about guest data.  Well, as suppose to the i386 code, we
don't put any data in the hypervisor.S.  All data will be put into
a guest shared data structure.  This structure is called lguest_vcpu.
So each guest (and eventually, each guest cpu) will have it's own 
lguest_vcpu, and this structure will be mapped into this HV FIXMAP
area for both the host and the guest in the same location.

What's also nice about this, is that the host can see all the
guest vcpu shared data, but each guest will only have access to
their own, and only while running in dpl 0.

These vcpu structures holds lots of data, from the hosts current
gdt and idt pointer, to the cr3's (both guest and host), an
NMI trampoline section, and lots more.

Each guest also has a unique lguest_guest_info structure that stores
generic data for the guest, but nothing that would be needed for
running a specific VCPU.


Loading the hypervisor:
----------------------

As opposed to compiling a hypervisor.c blob, we build instead the
hypervisor itself into the lg.o module. We snap shot it with
start and end tags and align it so that it sits on it's own page.
We then use the tags to map it into the HV FIXMAP area.

On starting a guest, the lguest64 loader maps it into memory the same
way as the lguest32 does.  And then calls into the kernel the same
way as well.

But once in the kernel, we do things slightly differently.

The lguest_vcpu struct is allocated (via get_free_pages) and then
mapped into the HV FIXMAP area.  The host then maps the HV pages
and this vcpu data into the guest area in the same place.

Then we jump to the hypervisor which changes the gdt idt and cr3
for the guest (as well as the process GS base) and does an iretq
into the guest memory.


Page faulting:
--------------

This is a bit different too.

When the guest takes a page fault, we jump back to to the host
via switch_to_host, and the host needs to map in the page.


Page Hashes
-----------

The lguest_guest_info structure holds a bunch of pud, pmd, and
pte page hashes, so that when we take a fault and add a new pte
to the guest, we have a way to traverse back to the original cr3
of the guest.

With 4 level paging, we need to keep track of this hierarchy.

Say if the guest does a set_pte (or set_pmd or set_pud for that mater)
We need a way to know what page to free.  So we look up in the
hash the pte that's being touched.  The info in the hash points
us back to the pmd that holds the pte. And if needed, we can find
the pud that holds the pmd, and the pgd/cr3 that holds the pud.

This facilitates the managing of the page tables.

TODO:
=====

To prevent a guest from stealing all the hosts memory pages, we can
use these hashes to also limit the number of puds, pmds, and ptes.

If the page is not pinned (currently used), we can set up LRU lists,
and find those pages that are somewhat stale, and free them.  This
can be done safely since we have all the info we need to put them
back if the guest needs them again.


cr3:
====

Right now we hold many more cr3/pgd's then the i386 version does.
This is because we have the ability to implement page cleaning at
a lower level, and this lets us limit the amount of pages the
guest can take from the host.


Interrupts:
===========

When an interrupt goes off, we've put the tss->rsp0 to point to
the vcpu struct regs field. This way we push onto the vcpu struct
the trapnum errcord, rip, cs, rflags, rsp and ss regs. Alse we
put onto this field the guests regs and cr3. This is somewhat similar
to the i386 way of doing things.

We then put back the host gdt, idt, tr and cr3 regs and jump back to
the host.

We use the stack pointer to find our location of the vcpu struct.

NMI:
====

NMI is a big PITA!!!!

I don't know how it works with i386 lguest, but this caused us loads of
hell.  The nmi can go off at any time, and having interrupts disabled
doesn't protect you from it.  So what to do about it!

Well the order of loading the TR register is important.  The guests TSS
segment has the same IST used for the NMI as the host. So if an NMI
goes off before we load the guest IDT, the host should still function.
But the guest also has it's own IST for it's NMI.  And the NMI stack 
for the guest is also on the vcpu struct. It needs it's own stack because
the nmi can go off while we are in a process of storing data from an
interrupt, and we'll mess up the vcpu struct.

After an nmi goes off, we really don't know what state we are in. So
basically we save everything.  But only save on the first NMI of 
a nested NMI (explained further down).

When an NMI goes off, we find the vcpu struct by the offset of the
stack. We check a flag letting us know if we are in a nested NMI (you'll
see soon), and if we are not, then we save the current GDT, regs, GS
base and shadow (we don't know if we swapgs or not, remember that the
guest uses its gs too, so both shadow and normal gs base can be in
the same address. That's how linux knows to swap or not). All this data
is stored in a separate location in the vcpu, reserved for NMI usage only.

We then set up the GDT, cr3 and GS base for the host, regardless of
being in a nested NMI or not.

We then set up a the call to the actual NMI handler, set the flag that
we are in a NMI handler, and then call the host NMI handler.  The return
code of that set up is actually the back to the HV text that called the
NMI handler.  But now, that we did an iret in the host, we are once again
susceptible to more NMIs (hence the nested NMI).  So we start restoring
all the stuff from the NMI Storage back to the state before the NMI.

If another NMI goes off, it will skip the storage part (and skip blowing
away all the data from the original NMI). And it will load the host
context, and jump again to the NMI handler. This time, we jump back and
try to restore again. We don't jump back to the the previous restore,
since we don't need to. We just keep trying to restore until we succeed
before another NMI goes off.

Once the everything is back to normal, and we have a return code set,
we clear the nmi flag and do a iretq back to the original code that was
interrupted by the original NMI.


Debug:
=====

We've added lots of debugging features to make it easier to debug.
hypervisor.S is loaded with print to serial code. Be careful,
the output of hex numbers are backwards. So if you do a 
PRINT_QUAD(%rax), and %rax has in it 0x12345, you will get
54321 out of the serial. It's just easier that way (code wise).
The macros with a 'S_' prefix will store the regs used on the
stack, but that's not always good, since most of the hypervisor
code, does not have a usable stack.

Page tables. There's functions in lguest_debug.c that allows for
dumping out either the guest page tables, or host page tables.

kill_guest(linfo) - is just like i386 kill_guest and takes the
lguest_guest_info pointer as input.

kill_guest_dump(vcpu) - when possible, use the vcpu version,
since this will also dump to host printk, the regs of the guest
as well as a guest back trace. Which can be really usefull.


Well that's it!  We currently get to just before console_init
in init/main.c of the guest before we take an timer interrupt
storm (guest only, host still runs fine). This happens after
we enable interrupts. But we are working on that. If you want to
help, we would love to accept patches!!!

So, now go ahead and play, but don't hurt the puppies!

-- Steve

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [RFC/PATCH LGUEST X86_64 00/13] Lguest for the x86_64
  2007-03-08 17:38 [RFC/PATCH LGUEST X86_64 00/13] Lguest for the x86_64 Steven Rostedt
@ 2007-03-09  0:18 ` Rusty Russell
  2007-03-09 11:20   ` Avi Kivity
  0 siblings, 1 reply; 3+ messages in thread
From: Rusty Russell @ 2007-03-09  0:18 UTC (permalink / raw)
  To: Steven Rostedt; +Cc: Chris Wright, virtualization, Ingo Molnar

On Thu, 2007-03-08 at 12:38 -0500, Steven Rostedt wrote:
> So we map the hypervisor text into this area for both the host
> and the guest. The guest permissions for this area will obviously
> be restricted to DPL 0 only (guest runs in PL 3).
> 
> Now what about guest data.  Well, as suppose to the i386 code, we
> don't put any data in the hypervisor.S.  All data will be put into
> a guest shared data structure.  This structure is called lguest_vcpu.
> So each guest (and eventually, each guest cpu) will have it's own 
> lguest_vcpu, and this structure will be mapped into this HV FIXMAP
> area for both the host and the guest in the same location.

Hi Steven!

	In anticipation of the x86-64 limitations, and after discussion with
Andi and Zach Amsden, I've converted 32-bit lguest to use read-only
pages for the switcher code, rather than segment limits.  I just ran
into breaking on SMP hosts, otherwise patches would have been sent
yesterday.  But importantly, it brings us much closer together.

> As opposed to compiling a hypervisor.c blob, we build instead the
> hypervisor itself into the lg.o module. We snap shot it with
> start and end tags and align it so that it sits on it's own page.

I'll take a look; I don't see a reason to be different here?

> TODO:
> =====
> 
> To prevent a guest from stealing all the hosts memory pages, we can
> use these hashes to also limit the number of puds, pmds, and ptes.
> 
> If the page is not pinned (currently used), we can set up LRU lists,
> and find those pages that are somewhat stale, and free them.  This
> can be done safely since we have all the info we need to put them
> back if the guest needs them again.

This is the same issue with 32-bit (one main reason why it's root-only).
In my case it's not too hard to add a shrinker (it would drop PTE pages
out of the pagetable of any non-running guest, just needs locking), but
we also want to avoid pinning in guest (ie. userspace) pages: for this I
think we really want a per-mm callback when the swapper wants to kick
something out.

I imagine kvm will have the same or similar issues (they restrict their
pagetables to 256 pages per guest, which is simultanously too many and
too few IMHO).

> cr3:
> ====
> 
> Right now we hold many more cr3/pgd's then the i386 version does.
> This is because we have the ability to implement page cleaning at
> a lower level, and this lets us limit the amount of pages the
> guest can take from the host.

Not sure I follow this, but I'll read the code.

> Interrupts:
> ===========
> 
> When an interrupt goes off, we've put the tss->rsp0 to point to
> the vcpu struct regs field. This way we push onto the vcpu struct
> the trapnum errcord, rip, cs, rflags, rsp and ss regs. Alse we
> put onto this field the guests regs and cr3. This is somewhat similar
> to the i386 way of doing things.
> 
> We then put back the host gdt, idt, tr and cr3 regs and jump back to
> the host.
> 
> We use the stack pointer to find our location of the vcpu struct.

This is now identical, from this description.  Great minds think alike
8)

> NMI:
> ====
> 
> NMI is a big PITA!!!!
> 
> I don't know how it works with i386 lguest, but this caused us loads of
> hell.  The nmi can go off at any time, and having interrupts disabled
> doesn't protect you from it.  So what to do about it!

We crash.  I have a patch which improves this to just ignore it (iret).
I tried to actually switch into the host and deliver the NMI, but since
qemu didn't seem to give NMIs at all, I spent a day toying with it on
crashing hardware before moving on to something else.  Plus the
hypervisor.S code was almost doubled for this crap.

Nested NMIs are, as you found too, particularly nasty.  I considered
actually calling the host NMI handler directly so it didn't iret back to
us...

> Debug:
> =====
> 
> We've added lots of debugging features to make it easier to debug.
> hypervisor.S is loaded with print to serial code. Be careful,
> the output of hex numbers are backwards. So if you do a 
> PRINT_QUAD(%rax), and %rax has in it 0x12345, you will get
> 54321 out of the serial. It's just easier that way (code wise).
> The macros with a 'S_' prefix will store the regs used on the
> stack, but that's not always good, since most of the hypervisor
> code, does not have a usable stack.

Heh, I simply used qemu, but this has more geek points 8)

> Well that's it!  We currently get to just before console_init
> in init/main.c of the guest before we take an timer interrupt
> storm (guest only, host still runs fine). This happens after
> we enable interrupts. But we are working on that. If you want to
> help, we would love to accept patches!!!

Awesome, will give detailed feedback after reading patches!

Thanks!
Rusty.

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [RFC/PATCH LGUEST X86_64 00/13] Lguest for the x86_64
  2007-03-09  0:18 ` Rusty Russell
@ 2007-03-09 11:20   ` Avi Kivity
  0 siblings, 0 replies; 3+ messages in thread
From: Avi Kivity @ 2007-03-09 11:20 UTC (permalink / raw)
  To: Rusty Russell; +Cc: Chris Wright, virtualization, Ingo Molnar

Rusty Russell wrote:
>> To prevent a guest from stealing all the hosts memory pages, we can
>> use these hashes to also limit the number of puds, pmds, and ptes.
>>
>> If the page is not pinned (currently used), we can set up LRU lists,
>> and find those pages that are somewhat stale, and free them.  This
>> can be done safely since we have all the info we need to put them
>> back if the guest needs them again.
>>     
>
> This is the same issue with 32-bit (one main reason why it's root-only).
> In my case it's not too hard to add a shrinker (it would drop PTE pages
> out of the pagetable of any non-running guest, just needs locking), but
> we also want to avoid pinning in guest (ie. userspace) pages: for this I
> think we really want a per-mm callback when the swapper wants to kick
> something out.
>
> I imagine kvm will have the same or similar issues (they restrict their
> pagetables to 256 pages per guest, which is simultanously too many and
> too few IMHO).
>   

We have similar issues, but they are easily fixed since at most four 
pages are pinned per vcpu (sixteen with Ingo's cr3 cache).  A per-mm 
swapper callback sounds great, especially when thinking about swapping 
regular guest pages, and even more in the context of nested page tables.

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2007-03-09 11:20 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-03-08 17:38 [RFC/PATCH LGUEST X86_64 00/13] Lguest for the x86_64 Steven Rostedt
2007-03-09  0:18 ` Rusty Russell
2007-03-09 11:20   ` Avi Kivity

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).