virtualization.lists.linux-foundation.org archive mirror
 help / color / mirror / Atom feed
* [RFC/PATCH LGUEST X86_64 01/13] HV VM Fix map area for HV.
       [not found] <20070308162348.299676000@redhat.com>
@ 2007-03-08 17:38 ` Steven Rostedt
  2007-03-09  3:52   ` Rusty Russell
  2007-03-08 17:38 ` [RFC/PATCH LGUEST X86_64 02/13] hvvm export page utils Steven Rostedt
                   ` (11 subsequent siblings)
  12 siblings, 1 reply; 16+ messages in thread
From: Steven Rostedt @ 2007-03-08 17:38 UTC (permalink / raw)
  To: virtualization; +Cc: Chris Wright, Ingo Molnar

plain text document attachment (hvvm.patch)
OK, some explaination is needed here. The goal of lguest with paravirt
ops, is to have one kernel that can be loaded both as a host and a guest.
To do this, we need to map an area in virtual memory that both the host
and guest share. But I don't want any conflicts with the guest.

One solution is just to do a single area for boot up, and then
use the vmalloc to map. But this gets quite complex, since we need to
force the guest to map a given area, after the fact, hoping that
it didn't map it someplace else before we get to the code to map it.
This can be done, but doing it this way is (for now) much easier.

What I've done here, is to make a large area in the FIXMAP region.
The guest will not use this area for anything, since it is reserved
for only running a HV. So by making a FIXMAP area, we force this
area reserved for HV use.  Now the host can load the hypervisor
text section into this area and force it mapped to the guest without
worrying that the guest will want to use this area for anything else.

Each guest will have it's own shared data placed in this section too.
The guest will only get the hypervisor text and its own section data
mapped into this area.  But the host will map the hypervisor text
and all guest shared areas in this region.  And what makes this so
easy, is that the virtual addresses between the host and guest for
these locations will be the same!

To explain this a little better, here's what the virtual addresses
of the host and guests will look like:


    Host            Guest1          Guest2
 +-----------+   +-----------+  +-----------+
 |           |   |           |  |           |
 +-----------+   +-----------+  +-----------+
 | HV FIXMAP |   | HV FIXMAP |  | HV FIXMAP |
 |   TEXT    |   |   TEXT    |  |   TEXT    |
 +-----------+   +-----------+  +-----------+
 | GUEST 1   |   | GUEST 1   |  | UNMAPPED  |
 |SHARED DATA|   |SHARED DATA|  |           |
 +-----------+   +-----------+  +-----------+
 | GUEST 2   |   | UNMAPPED  |  | GUEST 2   |
 |SHARED DATA|   |           |  |SHARED DATA|
 +-----------+   |           |  +-----------+
 |           |   |           |  |           |



Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Cc: Glauber de Oliveira Costa <glommer@gmail.com>
Cc: Chris Wright <chrisw@sous-sol.org>
Cc: Ingo Molnar <mingo@elte.hu>


Index: work-pv/arch/x86_64/lguest/hv_vm.c
===================================================================
--- /dev/null
+++ work-pv/arch/x86_64/lguest/hv_vm.c
@@ -0,0 +1,367 @@
+/*
+ *  arch/x86_64/lguest/hv_vm.c
+ *
+ *  Copyright (C) 2007 Steven Rostedt <srostedt@redhat.com>, Red Hat
+ *
+ *  Some of this code was influenced by mm/vmalloc.c
+ *
+ *  FIXME: This should not be limited to lguest, but should be put
+ *         into the kernel proper, since this code should be
+ *         HV agnostic.
+ *
+ *  The purpose of the HV VM area is to create a virtual address
+ *  space that can be saved for use of sending information to and
+ *  from a guest.  A small hypervisor text section may be loaded
+ *  into this address and shared across all guests that use that
+ *  hypervisor.  Each guest may have a data page so that it can
+ *  communicate back and forth with the host.
+ *
+ *  The reason for this, is to allow an virtual address space that
+ *  will not be used by the guest for any other purpose.  This
+ *  allows a nice place to map code that will communicate to the guest.
+ */
+
+#include <linux/mm.h>
+#include <linux/module.h>
+#include <linux/highmem.h>
+#include <linux/slab.h>
+#include <linux/spinlock.h>
+#include <linux/interrupt.h>
+
+#include <asm/hv_vm.h>
+
+static DEFINE_MUTEX(hvvm_lock);
+
+static DECLARE_BITMAP(hvvm_avail_pages, NR_HV_PAGES);
+
+
+static void hvvm_pte_unmap(pmd_t *pmd, unsigned long addr)
+{
+	pte_t *pte;
+	pte_t ptent;
+
+	pte = pte_offset_kernel(pmd, addr);
+	ptent = ptep_get_and_clear(&init_mm, addr, pte);
+	WARN_ON(!pte_none(ptent) && !pte_present(ptent));
+}
+
+static inline void hvvm_pmd_unmap(pud_t *pud, unsigned long addr)
+{
+	pmd_t *pmd;
+
+	pmd = pmd_offset(pud, addr);
+	if (pmd_none_or_clear_bad(pmd))
+		return;
+	hvvm_pte_unmap(pmd, addr);
+}
+
+static inline void hvvm_pud_unmap(pgd_t *pgd, unsigned long addr)
+{
+	pud_t *pud;
+
+	pud = pud_offset(pgd, addr);
+	if (pud_none_or_clear_bad(pud))
+		return;
+	hvvm_pmd_unmap(pud, addr);
+}
+
+static void hvvm_unmap_page(unsigned long addr)
+{
+	pgd_t *pgd;
+
+	pgd = pgd_offset_k(addr);
+	hvvm_pud_unmap(pgd, addr);
+}
+
+static int hvvm_pte_alloc(pmd_t *pmd, unsigned long addr,
+			  unsigned long page, pgprot_t prot)
+{
+	pte_t *pte;
+
+	pte = pte_alloc_kernel(pmd, addr);
+	if (!pte)
+		return -ENOMEM;
+
+	WARN_ON(!pte_none(*pte));
+	set_pte_at(&init_mm, addr, pte,
+		   mk_pte(pfn_to_page(page >> PAGE_SHIFT), prot));
+
+	return 0;
+}
+
+static inline int hvvm_pmd_alloc(pud_t *pud, unsigned long addr,
+				 unsigned long page, pgprot_t prot)
+{
+	pmd_t *pmd;
+
+	pmd = pmd_alloc(&init_mm, pud, addr);
+	if (!pmd)
+		return -ENOMEM;
+	if (hvvm_pte_alloc(pmd, addr, page, prot))
+		return -ENOMEM;
+	return 0;
+}
+
+static inline int hvvm_pud_alloc(pgd_t *pgd, unsigned long addr,
+				 unsigned long page, pgprot_t prot)
+{
+	pud_t *pud;
+
+	pud = pud_alloc(&init_mm, pgd, addr);
+	if (!pud)
+		return -ENOMEM;
+	if (hvvm_pmd_alloc(pud, addr, page, prot))
+		return -ENOMEM;
+	return 0;
+}
+
+static int hvvm_alloc_page(unsigned long addr, unsigned long page, pgprot_t prot)
+{
+	pgd_t *pgd;
+	int err;
+
+	pgd = pgd_offset_k(addr);
+	err = hvvm_pud_alloc(pgd, addr, page, prot);
+	return err;
+}
+
+static unsigned long *get_vaddr(unsigned long paddr)
+{
+	paddr &= ~(0xfff);
+	return (unsigned long*)(paddr + PAGE_OFFSET);
+}
+
+unsigned long hvvm_get_actual_phys(void *addr, pgprot_t *prot)
+{
+	unsigned long vaddr;
+	unsigned long offset;
+	unsigned long cr3;
+	unsigned long pgd;
+	unsigned long pud;
+	unsigned long pmd;
+	unsigned long pte;
+	unsigned long mask;
+
+	unsigned long *p;
+
+	/*
+	 * Travers the page tables to get the actual
+	 * physical address. I want this to work for
+	 * all addresses, regardless of where they are mapped.
+	 */
+
+	/* FIXME: Do this better!! */
+
+	/* grab the start of the page tables */
+	asm ("movq %%cr3, %0" : "=r"(cr3));
+
+	p = get_vaddr(cr3);
+
+	offset = (unsigned long)addr;
+	offset >>= PGDIR_SHIFT;
+	offset &= PTRS_PER_PGD-1;
+
+	pgd = p[offset];
+
+	if (!(pgd & 1))
+		return 0;
+
+	p = get_vaddr(pgd);
+
+	offset = (unsigned long)addr;
+	offset >>= PUD_SHIFT;
+	offset &= PTRS_PER_PUD-1;
+
+	pud = p[offset];
+
+	if (!(pud & 1))
+		return 0;
+
+	p = get_vaddr(pud);
+
+	offset = (unsigned long)addr;
+	offset >>= PMD_SHIFT;
+	offset &= PTRS_PER_PMD-1;
+
+	pmd = p[offset];
+
+	if (!(pmd & 1))
+		return 0;
+
+	/* Now check to see if we are 2M pages or 4K pages */
+	if (pmd & (1 << 7)) {
+		/* stop here, we are 2M pages */
+		pte = pmd;
+		mask = (1<<21)-1;
+		goto calc;
+	}
+
+	p = get_vaddr(pmd);
+
+	offset = (unsigned long)addr;
+	offset >>= PAGE_SHIFT;
+	offset &= PTRS_PER_PTE-1;
+
+	pte = p[offset];
+	mask = PAGE_SIZE-1;
+
+ calc:
+
+	if (!(pte & 1))
+		return 0;
+
+	vaddr = pte & ~(0xfff) & ~(1UL << 63);
+
+	if (prot)
+		pgprot_val(*prot) = pte & 0xfff;
+
+	offset = (unsigned long)addr;
+	offset &= mask;
+
+	vaddr += offset;
+	/* Potentially clear the nx bit */
+	vaddr &= ~(1UL << 63);
+
+	return vaddr;
+}
+
+static unsigned long alloc_hv_pages(int pages)
+{
+	unsigned int bit = 0;
+	unsigned int found_bit;
+	int i;
+
+	/* FIXME : ADD LOCKING!!! */
+
+	/*
+	 * Scan the available bitmask for free pages.
+	 * 0 - available : 1 - used
+	 */
+	do {
+		bit = find_next_zero_bit(hvvm_avail_pages, NR_HV_PAGES, bit);
+		if (bit >= NR_HV_PAGES)
+			return 0;
+
+		found_bit = bit;
+
+		for (i=1; i < pages; i++) {
+			bit++;
+			if (test_bit(bit, hvvm_avail_pages))
+				break;
+		}
+	} while (i < pages);
+
+	if (i < pages)
+		return 0;
+
+	/*
+	 * OK we found a location where we can map our pages
+	 * so now we set them used, and do the mapping.
+	 */
+	bit = found_bit;
+	for (i=0; i < pages; i++)
+		set_bit(bit++, hvvm_avail_pages);
+
+	return HVVM_START + found_bit * PAGE_SIZE;
+}
+
+static void release_hv_pages(unsigned long addr, int pages)
+{
+	unsigned int bit;
+	int i;
+
+	/* FIXME : ADD LOCKING!!! */
+
+	bit = (addr - HVVM_START) / PAGE_SIZE;
+
+	for (i=0; i < pages; i++) {
+		BUG_ON(!test_bit(bit, hvvm_avail_pages));
+		clear_bit(bit++, hvvm_avail_pages);
+	}
+}
+
+
+/**
+ *	hvvm_map_pages - map an address to the HV VM area.
+ *	@vaddr:		virtual address to map
+ *	@pages:		Number of pages from that virtual address to map.
+ *	@hvaddr:	address returned that holds the mapping.
+ *
+ *	This function maps the pages represnted by addr into
+ * 	the HV VM area, and returns the address given to it.
+ *	NULL is returned on failure (no space left?)
+ */
+int hvvm_map_pages(void *vaddr, int pages, unsigned long *hvaddr)
+{
+	unsigned long paddr;
+	unsigned long addr;
+	pgprot_t prot;
+	int i;
+	int ret;
+
+	if ((unsigned long)vaddr & (PAGE_SIZE - 1)) {
+		printk("bad vaddr for hv mapping (%p)\n",
+		       vaddr);
+		return -EINVAL;
+	}
+
+	/*
+	 * First we need to find a place to allocate.
+	 */
+	/* FIXME - ADD LOCKING!!! */
+	addr = alloc_hv_pages(pages);
+	*hvaddr = addr;
+	printk("addr=%lx\n", addr);
+	if (!addr)
+		return -ENOMEM;
+
+	ret = -ENOMEM;
+
+	for (i=0; i < pages; i++, vaddr += PAGE_SIZE, addr += PAGE_SIZE) {
+		paddr = hvvm_get_actual_phys(vaddr, &prot);
+		printk("%d: paddr=%lx\n", i, paddr);
+		if (!paddr)
+			goto out;
+		ret = hvvm_alloc_page(addr, paddr, prot);
+		printk("%d: ret=%d addr=%lx\n", i, ret, addr);
+		if (ret < 0)
+			goto out;
+	}
+
+	addr = *hvaddr;
+	vaddr -= PAGE_SIZE * pages;
+	printk("vaddr=%p (%lx)\naddr=%p (%lx)\n",
+	       vaddr, *(unsigned long*)vaddr,
+	       (void*)addr, *(unsigned long*)addr);
+
+	return 0;
+out:
+	for (--i; i >=0; i--) {
+		addr -= PAGE_SIZE;
+		hvvm_unmap_page(addr);
+	}
+
+	release_hv_pages(addr, pages);
+	return ret;
+}
+
+void hvvm_unmap_pages(unsigned long addr, int pages)
+{
+	int i;
+
+	release_hv_pages(addr, pages);
+	for (i=0; i < pages; i++, addr += PAGE_SIZE)
+		hvvm_unmap_page(addr);
+}
+
+void hvvm_release_all(void)
+{
+	int bit;
+	unsigned long vaddr = HVVM_START;
+
+	for (bit=0; bit < NR_HV_PAGES; bit++, vaddr += PAGE_SIZE)
+		if (test_bit(bit, hvvm_avail_pages)) {
+			hvvm_unmap_page(vaddr);
+			clear_bit(bit, hvvm_avail_pages);
+		}
+}
Index: work-pv/include/asm-x86_64/fixmap.h
===================================================================
--- work-pv.orig/include/asm-x86_64/fixmap.h
+++ work-pv/include/asm-x86_64/fixmap.h
@@ -16,6 +16,7 @@
 #include <asm/page.h>
 #include <asm/vsyscall.h>
 #include <asm/vsyscall32.h>
+#include <asm/hv_vm.h>
 
 /*
  * Here we define all the compile-time 'special' virtual
@@ -40,6 +41,8 @@ enum fixed_addresses {
 	FIX_APIC_BASE,	/* local (CPU) APIC) -- required for SMP or not */
 	FIX_IO_APIC_BASE_0,
 	FIX_IO_APIC_BASE_END = FIX_IO_APIC_BASE_0 + MAX_IO_APICS-1,
+	FIX_HV_BASE,
+	FIX_HV_BASE_END = FIX_HV_BASE + HV_VIRT_SIZE - 1,
 	__end_of_fixed_addresses
 };
 
Index: work-pv/include/asm-x86_64/hv_vm.h
===================================================================
--- /dev/null
+++ work-pv/include/asm-x86_64/hv_vm.h
@@ -0,0 +1,13 @@
+#ifndef _LINUX_HV_VM
+#define _LINUX_HV_VM
+
+#define NR_HV_PAGES  256 /* meg? */
+#define HV_VIRT_SIZE (NR_HV_PAGES << PAGE_SHIFT)
+
+#define HVVM_START (__fix_to_virt(FIX_HV_BASE_END))
+
+int hvvm_map_pages(void *vaddr, int pages, unsigned long *hvaddr);
+void hvvm_unmap_pages(unsigned long addr, int pages);
+void hvvm_release_all(void);
+
+#endif

--

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [RFC/PATCH LGUEST X86_64 02/13] hvvm export page utils
       [not found] <20070308162348.299676000@redhat.com>
  2007-03-08 17:38 ` [RFC/PATCH LGUEST X86_64 01/13] HV VM Fix map area for HV Steven Rostedt
@ 2007-03-08 17:38 ` Steven Rostedt
  2007-03-08 17:38 ` [RFC/PATCH LGUEST X86_64 03/13] lguest64 core Steven Rostedt
                   ` (10 subsequent siblings)
  12 siblings, 0 replies; 16+ messages in thread
From: Steven Rostedt @ 2007-03-08 17:38 UTC (permalink / raw)
  To: virtualization; +Cc: Chris Wright, Ingo Molnar

plain text document attachment (hvvm-mm-export.patch)
I would like to have the HV VM code in the kernel proper,
but until then, it needs to get at some of the memory page
table utils (pud_alloc and friends). So as a module, we export
this.

I probably can change HV VM to just be compiled in the kernel
like some of the other lguest stuff too.

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Cc: Glauber de Oliveira Costa <glommer@gmail.com>
Cc: Chris Wright <chrisw@sous-sol.org>

Index: work-pv/mm/memory.c
===================================================================
--- work-pv.orig/mm/memory.c
+++ work-pv/mm/memory.c
@@ -2798,3 +2798,10 @@ int access_process_vm(struct task_struct
 	return buf - old_buf;
 }
 EXPORT_SYMBOL_GPL(access_process_vm);
+
+/* temp until we put the hv vm stuff into the kernel */
+EXPORT_SYMBOL_GPL(__pud_alloc);
+EXPORT_SYMBOL_GPL(__pmd_alloc);
+EXPORT_SYMBOL_GPL(__pte_alloc_kernel);
+EXPORT_SYMBOL_GPL(pmd_clear_bad);
+EXPORT_SYMBOL_GPL(pud_clear_bad);

--

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [RFC/PATCH LGUEST X86_64 03/13] lguest64 core
       [not found] <20070308162348.299676000@redhat.com>
  2007-03-08 17:38 ` [RFC/PATCH LGUEST X86_64 01/13] HV VM Fix map area for HV Steven Rostedt
  2007-03-08 17:38 ` [RFC/PATCH LGUEST X86_64 02/13] hvvm export page utils Steven Rostedt
@ 2007-03-08 17:38 ` Steven Rostedt
  2007-03-09  4:10   ` Rusty Russell
  2007-03-08 17:38 ` [RFC/PATCH LGUEST X86_64 04/13] Useful debugging Steven Rostedt
                   ` (9 subsequent siblings)
  12 siblings, 1 reply; 16+ messages in thread
From: Steven Rostedt @ 2007-03-08 17:38 UTC (permalink / raw)
  To: virtualization; +Cc: Chris Wright, Ingo Molnar

plain text document attachment (lguest64.patch)
This is the main core code for the lguest64.

Have fun, and don't hurt the puppies!

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Glauber de Oliveira Costa <glommer@gmail.com>
Cc: Chris Wright <chrisw@sous-sol.org>


Index: work-pv/arch/x86_64/lguest/Makefile
===================================================================
--- /dev/null
+++ work-pv/arch/x86_64/lguest/Makefile
@@ -0,0 +1,24 @@
+# Guest requires the paravirt_ops replacement and the bus driver.
+obj-$(CONFIG_LGUEST_GUEST) += lguest.o lguest_bus.o
+
+# Host requires the other files, which can be a module.
+obj-$(CONFIG_LGUEST)	+= lg.o
+lg-objs := core.o hypervisor.o lguest_user.o hv_vm.o page_tables.o \
+hypercalls.o io.o interrupts_and_traps.o lguest_debug.o
+
+# hypercalls.o page_tables.o interrupts_and_traps.o \
+#	segments.o io.o lguest_user.o
+
+# We use top 4MB for guest traps page, then hypervisor. */
+HYPE_ADDR := (0xFFC00000+4096)
+# The data is only 1k (256 interrupt handler pointers)
+HYPE_DATA_SIZE := 1024
+CFLAGS += -DHYPE_ADDR="$(HYPE_ADDR)" -DHYPE_DATA_SIZE="$(HYPE_DATA_SIZE)"
+
+##$(obj)/core.o: $(obj)/hypervisor-blob.c
+### This links the hypervisor in the right place and turns it into a C array.
+##$(obj)/hypervisor-raw: $(obj)/hypervisor.o
+##	@$(LD) -static -Tdata=`printf %#x $$(($(HYPE_ADDR)))` -Ttext=`printf %#x $$(($(HYPE_ADDR)+$(HYPE_DATA_SIZE)))` -o $@ $< && $(OBJCOPY) -O binary $@
+##$(obj)/hypervisor-blob.c: $(obj)/hypervisor-raw
+##	@od -tx1 -An -v $< | sed -e 's/^ /0x/' -e 's/$$/,/' -e 's/ /,0x/g' > $@
+
Index: work-pv/arch/x86_64/lguest/core.c
===================================================================
--- /dev/null
+++ work-pv/arch/x86_64/lguest/core.c
@@ -0,0 +1,379 @@
+#include <linux/module.h>
+#include <linux/sched.h>
+#include <linux/freezer.h>
+#include <linux/kallsyms.h>
+#include <asm/paravirt.h>
+#include <asm/hv_vm.h>
+#include <asm/uaccess.h>
+#include <asm/i387.h>
+#include "lguest.h"
+
+#define HV_OFFSET(x) (typeof(x))((unsigned long)(x)+lguest_hv_offset)
+
+unsigned long lguest_hv_addr;
+unsigned long lguest_hv_offset;
+int lguest_hv_pages;
+
+int lguest_vcpu_pages;
+int lguest_vcpu_order;
+
+DEFINE_MUTEX(lguest_lock);
+
+int lguest_address_ok(const struct lguest_guest_info *linfo, u64 addr)
+{
+	return addr / PAGE_SIZE < linfo->pfn_limit;
+}
+
+u8 lhread_u8(struct lguest_vcpu *vcpu, u64 addr)
+{
+	u8 val = 0;
+
+	if (!lguest_address_ok(vcpu->guest, addr)
+	    || get_user(val, (u8 __user *)addr) != 0)
+			kill_guest_dump(vcpu, "bad read address %llx", addr);
+	return val;
+}
+
+u16 lhread_u16(struct lguest_vcpu *vcpu, u64 addr)
+{
+	u16 val = 0;
+
+	if (!lguest_address_ok(vcpu->guest, addr)
+	    || get_user(val, (u16 __user *)addr) != 0)
+			kill_guest_dump(vcpu, "bad read address %llx", addr);
+	return val;
+}
+
+u64 lhread_u64(struct lguest_vcpu *vcpu, u64 addr)
+{
+	u64 val = 0;
+
+	if (!lguest_address_ok(vcpu->guest, addr)
+	    || get_user(val, (u64 __user *)addr) != 0)
+			kill_guest_dump(vcpu, "bad read address %llx", addr);
+	return val;
+}
+
+void lhwrite_u64(struct lguest_vcpu *vcpu, u64 addr, u64 val)
+{
+	if (!lguest_address_ok(vcpu->guest, addr)
+	    || put_user(val, (u64 __user *)addr) != 0)
+			kill_guest_dump(vcpu, "bad read address %llx", addr);
+}
+
+void lhread(struct lguest_guest_info *linfo, void *b, u64 addr, unsigned bytes)
+{
+	if (addr + bytes < addr || !lguest_address_ok(linfo, addr+bytes)
+	   || copy_from_user(b, (void __user *)addr, bytes) != 0) {
+		/* copy_from_user should do this, but as we rely on it... */
+		memset(b, 0, bytes);
+		kill_guest(linfo, "bad read address %llx len %u", addr, bytes);
+	}
+}
+
+void lhwrite(struct lguest_guest_info *linfo, u64 addr, const void *b,
+								unsigned bytes)
+{
+	if (addr + bytes < addr
+	   || !lguest_address_ok(linfo, addr+bytes)
+	   || copy_to_user((void __user *)addr, b, bytes) != 0)
+		kill_guest(linfo, "bad write address %llx len %u", addr, bytes);
+}
+
+static struct gate_struct *get_idt_table(void)
+{
+	struct desc_ptr idt;
+
+	asm("sidt %0":"=m" (idt));
+	return (void *)idt.address;
+}
+
+static int emulate_insn(struct lguest_vcpu *vcpu)
+{
+	u8 insn;
+	unsigned int insnlen = 0, in = 0, shift = 0;
+	unsigned long physaddr = guest_pa(vcpu->guest, vcpu->regs.rip);
+
+	if (vcpu->regs.rip < vcpu->guest->page_offset)
+		return 0;
+
+	lhread(vcpu->guest, &insn, physaddr, 1);
+
+	/* Operand size prefix means it's actually for ax. */
+	if (insn == 0x66) {
+		shift = 16;
+		insnlen = 1;
+		printk("physaddr + len: %lx\n",physaddr+insnlen);
+		lhread(vcpu->guest, &insn, physaddr + insnlen, 1);
+	}
+
+	switch (insn & 0xFE) {
+	case 0xE4: /* in     <next byte>,%al */
+		insnlen += 2;
+		in = 1;
+		break;
+	case 0xEC: /* in     (%dx),%al */
+		insnlen += 1;
+		in = 1;
+		break;
+	case 0xE6: /* out    %al,<next byte> */
+		insnlen += 2;
+		break;
+	case 0xEE: /* out    %al,(%dx) */
+		insnlen += 1;
+		break;
+	default:
+		printk("%llx: %02x unimplemented op\n", vcpu->regs.rip, insn);
+		kill_guest_dump(vcpu, "bad op");
+		return 0;
+	}
+	if (in) {
+		/* Lower bit tells is whether it's a 16 or 32 bit access */
+		if (insn & 0x1)
+			vcpu->regs.rax = 0xFFFFFFFF;
+		else
+			vcpu->regs.rax |= (0xFFFF << shift);
+	}
+	vcpu->regs.rip += insnlen;
+	return 1;
+}
+
+#define SAVE_CR2(cr2)	asm volatile ("movq %%cr2, %0" : "=r" (cr2))
+
+static void run_guest_once(struct lguest_vcpu *vcpu)
+{
+	void (*sw_guest)(struct lguest_vcpu *) = HV_OFFSET(&switch_to_guest);
+	unsigned long foo, bar;
+
+	BUG_ON(!vcpu->regs.cr3);
+	BUG_ON(!vcpu->pgdir);
+	BUG_ON(!vcpu->pgdir->pgdir);
+	asm volatile ("pushq %2; pushq %%rsp; pushfq; pushq %3; call *%6;"
+		      /* The stack we pushed is off by 8, due to the previous pushq */
+		      "addq $8, %%rsp"
+		      : "=D"(foo), "=a"(bar)
+		      : "i" (__KERNEL_DS), "i" (__KERNEL_CS), "0" (vcpu), "1"(get_idt_table()),
+			"r" (sw_guest)
+		      : "memory", "cc");
+}
+
+/* FIXME: don't know yet the right parameters to put here */
+int run_guest(struct lguest_vcpu *vcpu, char *__user user)
+{
+	struct lguest_guest_info *linfo = vcpu->guest;
+	struct desc_struct *gdt_table;
+	struct lguest_regs *regs = &vcpu->regs;
+	int ret;
+
+	unsigned long cr2 = 0;
+
+	while (!linfo->dead) {
+
+		if (regs->trapnum == LGUEST_TRAP_ENTRY) {
+
+			if (lguest_debug) {
+				printk("hit trap %lld rip=", regs->trapnum);
+				lguest_print_address(vcpu, regs->rip);
+				printk("calling hypercall %d!\n", (unsigned)regs->rax);
+			}
+
+			regs->trapnum = 255;
+			hypercall(vcpu);
+			if (linfo->dead)
+				lguest_dump_vcpu_regs(vcpu);
+		}
+
+		if (signal_pending(current))
+			return -EINTR;
+
+		maybe_do_interrupt(vcpu);
+
+		try_to_freeze();
+
+		if (linfo->dead)
+			return -1;
+
+
+		local_irq_disable();
+
+		/*
+		 * keep a pointer to the host GDT tss address.
+		 * Do this after disabling interrupts to make sure we
+		 * are on the same CPU.
+		 */
+		gdt_table = cpu_gdt(smp_processor_id());
+		vcpu->host_gdt_ptr = (unsigned long)gdt_table;
+		asm volatile ("sidt %0" : "=m"(vcpu->host_idt));
+
+		/* Even if *we* don't want FPU trap, guest might... */
+		if (vcpu->ts)
+			stts();
+
+		run_guest_once(vcpu);
+
+		if (regs->trapnum == 14) {
+			SAVE_CR2(cr2);
+			lgdebug_print("faulting cr2: %lx\n",cr2);
+		}
+
+		else if (regs->trapnum == 7)
+			math_state_restore();
+
+		if (lguest_debug && regs->trapnum < 32) {
+			printk("hit trap %lld rip=", regs->trapnum);
+			lguest_print_address(vcpu, regs->rip);
+		}
+
+		local_irq_enable();
+
+		BUG_ON(regs->trapnum > 0xFF);
+
+		switch (regs->trapnum) {
+		case 7:
+			/* We've intercepted a Device Not Available fault. */
+			/* If they don't want to know, just absorb it. */
+			if (!vcpu->ts)
+				continue;
+			if (reflect_trap(vcpu, 7, 1))
+				continue;
+			kill_guest(vcpu->guest, "Unhandled FPU trap at %#llx",
+								regs->rip);
+		case 13:
+			if (!regs->errcode) {
+				ret = emulate_insn(vcpu);
+				if (ret < 0) {
+					lguest_dump_vcpu_regs(vcpu);
+					return ret;
+				}
+				continue;
+			}
+			kill_guest_dump(vcpu, "took gfp errcode %lld\n", regs->errcode);
+			lguest_dump_vcpu_regs(vcpu);
+			break;
+		case 14:
+			if (demand_page(vcpu, cr2, regs->errcode & PF_WRITE))
+				continue;
+
+			if (lguest_debug) {
+				printk ("guest taking a page fault\n");
+				lguest_print_page_tables(vcpu->pgdir->pgdir);
+			}
+
+			/* inform guest on the current state of cr2 */
+			put_user(cr2, &linfo->lguest_data->cr2);
+			if (reflect_trap(vcpu, 14, 1))
+				continue;
+
+			lguest_dump_vcpu_regs(vcpu);
+			kill_guest_dump(vcpu, "unhandled page fault at %#lx"
+					" (rip=%#llx, errcode=%#llx)",
+					cr2, regs->rip, regs->errcode);
+			break;
+		case LGUEST_TRAP_ENTRY:
+			/* hypercall! */
+			continue;
+
+		case 32 ... 255:
+			cond_resched();
+			break;
+		default:
+			kill_guest_dump(vcpu, "bad trapnum %lld\n", regs->trapnum);
+			lguest_dump_vcpu_regs(vcpu);
+			return -EINVAL;
+		}
+	}
+	return -ENOENT;
+}
+
+extern long end_hyper_text;
+extern long start_hyper_text;
+
+static int __init init(void)
+{
+	unsigned long pages;
+	unsigned long hvaddr;
+#if 0
+	unsigned long lg_hcall = (unsigned long)HV_OFFSET(&hcall_teste);
+	unsigned long *lg_host_syscall =
+				(unsigned long *)HV_OFFSET(&host_syscall);
+#endif
+	int order;
+	int ret;
+
+	int i;
+	printk("start_hyper_text=%p\n",&start_hyper_text);
+	printk("end_hyper_text=%p\n",&end_hyper_text);
+	printk("default_idt_entries=%p\n",&_lguest_default_idt_entries);
+	printk("sizeof(vcpu)=%ld\n",sizeof(struct lguest_vcpu));
+
+	pages = (sizeof(struct lguest_vcpu)+(PAGE_SIZE-1))/PAGE_SIZE;
+	for (order = 0; (1<<order) < pages; order++)
+		;
+
+	lguest_vcpu_pages = pages;
+	lguest_vcpu_order = order;
+
+	ret = paravirt_enabled();
+	if (ret < 0)
+		return -EPERM;
+
+	ret = lguest_device_init();
+	if (ret < 0) {
+		return ret;
+	}
+
+	pages = (unsigned long)&end_hyper_text -
+		(unsigned long)&start_hyper_text;
+	pages = (pages + (PAGE_SIZE - 1)) / PAGE_SIZE;
+
+	ret = hvvm_map_pages(&start_hyper_text, pages, &hvaddr);
+	if (ret < 0)
+		goto out;
+	printk("hvaddr=%lx\n",hvaddr);
+
+	lguest_hv_addr = hvaddr;
+	lguest_hv_pages = pages;
+	lguest_hv_offset = hvaddr - (unsigned long)&start_hyper_text;
+
+	/* Setup LGUEST segments on all cpus */
+	for_each_possible_cpu(i) {
+		struct desc_struct *gdt_table;
+		gdt_table = cpu_gdt(i);
+		gdt_table[GDT_ENTRY_HV_CS] = gdt_table[gdt_index(__KERNEL_CS)];
+		gdt_table[GDT_ENTRY_HV_DS] = gdt_table[gdt_index(__KERNEL_DS)];
+	}
+
+//	rdmsrl(MSR_LSTAR, *lg_host_syscall);
+//	wrmsrl(MSR_LSTAR, lg_hcall);
+	return 0;
+#if 0
+	ret = init_pagetables(hvaddr);
+	if (ret < 0)
+		goto out2;
+
+	return 0;
+
+out2:
+	hvvm_unnmap_pages(hvaddr, pages);
+#endif
+out:
+	lguest_device_remove();
+	return ret;
+}
+
+
+static void __exit fini(void)
+{
+#if 0
+	unsigned long *lg_host_syscall =
+			(unsigned long *)HV_OFFSET(&host_syscall);
+
+	wrmsrl(MSR_LSTAR, *lg_host_syscall);
+#endif
+	hvvm_release_all();
+	lguest_device_remove();
+}
+
+module_init(init);
+module_exit(fini);
+MODULE_LICENSE("GPL");
Index: work-pv/arch/x86_64/lguest/hypercalls.c
===================================================================
--- /dev/null
+++ work-pv/arch/x86_64/lguest/hypercalls.c
@@ -0,0 +1,324 @@
+/*  Actual hypercalls, which allow guests to actually do something.
+    Copyright (C) 2007, Glauber de Oliveira Costa <gcosta@redhat.com>
+                        Steven Rostedt <srostedt@redhat.com>
+                        Red Hat Inc
+    Standing on the shoulders of Rusty Russell.
+
+    This program is free software; you can redistribute it and/or modify
+    it under the terms of the GNU General Public License as published by
+    the Free Software Foundation; either version 2 of the License, or
+    (at your option) any later version.
+
+    This program is distributed in the hope that it will be useful,
+    but WITHOUT ANY WARRANTY; without even the implied warranty of
+    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+    GNU General Public License for more details.
+
+    You should have received a copy of the GNU General Public License
+    along with this program; if not, write to the Free Software
+    Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA  02110-1301 USA
+*/
+#include <linux/uaccess.h>
+#include <linux/syscalls.h>
+#include <linux/mm.h>
+#include <asm/lguest.h>
+#include <asm/page.h>
+#include <asm/pgtable.h>
+#include <asm/msr.h>
+#include "lguest.h"
+
+/* FIXME: add this to Kconfig */
+#define CONFIG_LGUEST_DEBUG 1
+
+static void guest_set_stack(struct lguest_vcpu *vcpu,
+			    u64 rsp, unsigned int pages)
+{
+	/* You cannot have a stack segment with priv level 0. */
+	if (pages > 2)
+		kill_guest_dump(vcpu, "bad stack pages %u", pages);
+	vcpu->tss.rsp2 = rsp;
+	/* FIXME */
+//	lg->stack_pages = pages;
+//	pin_stack_pages(lg);
+}
+
+static DEFINE_MUTEX(hcall_print_lock);
+#define HCALL_PRINT_SIZ 1024
+static char hcall_print_buf[HCALL_PRINT_SIZ];
+
+/* Return true if DMA to host userspace now pending. */
+static int do_hcall(struct lguest_vcpu *vcpu)
+{
+	struct lguest_regs *regs = &vcpu->regs;
+	struct lguest_guest_info *linfo = vcpu->guest;
+	unsigned long val;
+	unsigned long ret;
+
+	switch (regs->rax) {
+	case LHCALL_PRINT:
+		mutex_lock(&hcall_print_lock);
+		ret = strncpy_from_user(hcall_print_buf,
+					(const char __user *)regs->rdx,
+					HCALL_PRINT_SIZ);
+		if (ret < 0) {
+			kill_guest_dump(vcpu,
+					"bad hcall print pointer (%llx)",
+					regs->rdx);
+			mutex_unlock(&hcall_print_lock);
+			return -EFAULT;
+		}
+		printk("LGUEST: %s", hcall_print_buf);
+		mutex_unlock(&hcall_print_lock);
+
+		break;
+	case LHCALL_FLUSH_ASYNC:
+		break;
+	case LHCALL_LGUEST_INIT:
+		kill_guest_dump(vcpu, "already have lguest_data");
+		break;
+	case LHCALL_RDMSR:
+		switch (regs->rdx) {
+		case MSR_KERNEL_GS_BASE:
+			val = (vcpu->guest_gs_shadow_a & ((1UL << 32)-1)) |
+				(vcpu->guest_gs_shadow_d << 32);
+			lhwrite_u64(vcpu, regs->rbx, val);
+			break;
+		case MSR_GS_BASE:
+			val = (vcpu->guest_gs_a & ((1UL << 32)-1)) |
+				(vcpu->guest_gs_d << 32);
+			lhwrite_u64(vcpu, regs->rbx, val);
+		break;
+		case MSR_FS_BASE:
+			lhwrite_u64(vcpu, regs->rbx, 0);
+		break;
+		case MSR_EFER:
+			val = EFER_SCE | EFER_LME | EFER_LMA | EFER_NX;
+			lhwrite_u64(vcpu, regs->rbx, val);
+		break;
+		default:
+			kill_guest_dump(vcpu, "bad read of msr %llx\n", regs->rdx);
+		}
+		break;
+	case LHCALL_WRMSR:
+		switch (regs->rdx) {
+		case MSR_KERNEL_GS_BASE:
+			if ((regs->rbx >= HVVM_START) &&
+			    (regs->rbx < (HVVM_START + HV_VIRT_SIZE))) {
+				kill_guest_dump(vcpu,
+						"guest trying to set GS shadow base"
+						" in hypervisor");
+				break;
+			}
+			vcpu->guest_gs_shadow_a = regs->rbx;
+			vcpu->guest_gs_shadow_d = regs->rbx >> 32;
+		break;
+		case MSR_GS_BASE:
+			if ((regs->rbx >= HVVM_START) &&
+			    (regs->rbx < (HVVM_START + HV_VIRT_SIZE))) {
+				kill_guest_dump(vcpu,
+						"guest trying to set GS base in hypervisor");
+				break;
+			}
+			vcpu->guest_gs_a = regs->rbx;
+			vcpu->guest_gs_d = regs->rbx >> 32;
+		break;
+		case MSR_FS_BASE:
+			/* always zero */
+		break;
+		default:
+			kill_guest(linfo, "bad write to msr %llx\n", regs->rdx);
+		}
+		break;
+	case LHCALL_SET_PMD:
+		guest_set_pmd(vcpu, regs->rdx, regs->rbx, regs->rcx);
+		break;
+	case LHCALL_SET_PUD:
+		guest_set_pud(vcpu, regs->rdx, regs->rbx, regs->rcx);
+		break;
+	case LHCALL_SET_PGD:
+		guest_set_pgd(vcpu, regs->rdx, regs->rbx, regs->rcx);
+		break;
+	case LHCALL_SET_PTE:
+		guest_set_pte(vcpu, regs->rdx, regs->rbx, regs->rcx);
+		break;
+
+	case LHCALL_FLUSH_TLB_SIG:
+		guest_flush_tlb_single(vcpu, regs->rdx, regs->rbx);
+		break;
+	case LHCALL_FLUSH_TLB:
+		if (regs->rdx)
+			guest_pagetable_clear_all(vcpu);
+		else
+			guest_pagetable_flush_user(vcpu);
+		break;
+
+	case LHCALL_NEW_PGTABLE:
+		guest_new_pagetable(vcpu, regs->rdx);
+		break;
+
+	case LHCALL_CRASH: {
+		char msg[128];
+		lhread(linfo, msg, regs->rdx, sizeof(msg));
+		msg[sizeof(msg)-1] = '\0';
+		kill_guest_dump(vcpu, "CRASH: %s", msg);
+		break;
+	}
+	case LHCALL_LOAD_GDT:
+		/* i386 does a lot of gdt reloads. We don't.
+		 * we may want to support it in the future for more
+		 * strange code paths. Not now */
+		return -ENOSYS;
+
+	case LHCALL_LOAD_IDT_ENTRY: {
+		struct gate_struct g;;
+		if (regs->rdx > 0xFF) {
+			kill_guest(linfo, "There are just 255 idt entries."
+					"What are you trying to do??");
+		}
+		lhread(linfo, &g, regs->rbx, sizeof(g));
+		load_guest_idt_entry(vcpu, regs->rdx,&g);
+		break;
+	}
+	case LHCALL_SET_STACK:
+		guest_set_stack(vcpu, regs->rdx, regs->rbx);
+		break;
+	case LHCALL_TS:
+		vcpu->ts = regs->rdx;
+		break;
+	case LHCALL_TIMER_READ: {
+		u32 now = jiffies;
+		mb();
+		regs->rax = now - linfo->last_timer;
+		linfo->last_timer = now;
+		break;
+	}
+	case LHCALL_TIMER_START:
+		linfo->timer_on = 1;
+		if (regs->rdx != HZ)
+			kill_guest(linfo, "Bad clock speed %lli", regs->rdx);
+		linfo->last_timer = jiffies;
+		break;
+	case LHCALL_HALT:
+		linfo->halted = 1;
+		break;
+	case LHCALL_GET_WALLCLOCK: {
+		struct timeval tv;
+		do_gettimeofday(&tv);
+		regs->rax = tv.tv_sec;
+		break;
+	}
+	case LHCALL_BIND_DMA:
+		printk("Binding dma....\n");
+		regs->rax = bind_dma(linfo, regs->rdx, regs->rbx,
+				     regs->rcx >> 8, regs->rcx & 0xFF);
+		break;
+	case LHCALL_SEND_DMA:
+		printk("Sending dma....\n");
+		return send_dma(linfo, regs->rdx, regs->rbx);
+
+	case LHCALL_IRET:
+		guest_iret(vcpu);
+		break;
+#if 0
+	case LHCALL_LOAD_TLS:
+		guest_load_tls(lg, (struct desc_struct __user*)regs->rdx);
+		break;
+#endif
+
+	case LHCALL_DEBUG_ME:
+#ifdef CONFIG_LGUEST_DEBUG
+		lguest_debug = regs->rdx;
+		printk("lguest debug turned %s\n", regs->rdx ? "on" : "off");
+		lguest_dump_vcpu_regs(vcpu);
+#else
+		{
+			static int once = 1;
+			if (once) {
+				once = 0;
+				printk("lguest debug is disabled, to use this "
+				       "please enable CONFIG_LGUEST_DEBUG\n");
+			}
+		}
+#endif
+		break;
+	default:
+		kill_guest(linfo, "Bad hypercall %lli\n", regs->rax);
+	}
+	return 0;
+}
+
+#if 0
+/* We always do queued calls before actual hypercall. */
+int do_async_hcalls(struct lguest *lg)
+{
+	unsigned int i, pending;
+	u8 st[LHCALL_RING_SIZE];
+
+	if (!lg->lguest_data)
+		return 0;
+
+	if (copy_from_user(&st, &lg->lguest_data->hcall_status, sizeof(st)))
+		return -EFAULT;
+
+	for (i = 0; i < ARRAY_SIZE(st); i++) {
+		struct lguest_regs regs;
+		unsigned int n = lg->next_hcall;
+
+		if (st[n] == 0xFF)
+			break;
+
+		if (++lg->next_hcall == LHCALL_RING_SIZE)
+			lg->next_hcall = 0;
+
+		get_user(regs.rax, &lg->lguest_data->hcalls[n].eax);
+		get_user(regs.rdx, &lg->lguest_data->hcalls[n].edx);
+		get_user(regs.rcx, &lg->lguest_data->hcalls[n].ecx);
+		get_user(regs.rbx, &lg->lguest_data->hcalls[n].ebx);
+		pending = do_hcall(lg, &regs);
+		put_user(0xFF, &lg->lguest_data->hcall_status[n]);
+		if (pending)
+			return 1;
+	}
+
+	set_wakeup_process(lg, NULL);
+	return 0;
+}
+#endif
+
+int hypercall(struct lguest_vcpu *vcpu)
+{
+	struct lguest_guest_info *linfo = vcpu->guest;
+	struct lguest_regs *regs = &vcpu->regs;
+	int pending;
+
+	if (!linfo->lguest_data) {
+		if (regs->rax != LHCALL_LGUEST_INIT) {
+			kill_guest(linfo, "hypercall %lli before LGUEST_INIT",
+				   regs->rax);
+			return 0;
+		}
+
+		linfo->lguest_data = (struct lguest_data __user *)regs->rdx;
+		/* We check here so we can simply copy_to_user/from_user */
+		if (!lguest_address_ok(linfo, (long)linfo->lguest_data)
+		    || !lguest_address_ok(linfo, (long)(linfo->lguest_data+1))){
+			kill_guest(linfo, "bad guest page %p", linfo->lguest_data);
+			return 0;
+		}
+		/* update the page_offset info */
+		get_user(linfo->page_offset, &linfo->lguest_data->page_offset);
+		get_user(linfo->start_kernel_map, &linfo->lguest_data->start_kernel_map);
+
+#if 0
+		get_user(linfo->noirq_start, &linfo->lguest_data->noirq_start);
+		get_user(linfo->noirq_end, &linfo->lguest_data->noirq_end);
+#endif
+		/* We reserve the top pgd entry. */
+		put_user(4U*1024*1024, &linfo->lguest_data->reserve_mem);
+		put_user(linfo->guest_id, &linfo->lguest_data->guest_id);
+		return 0;
+	}
+	pending = do_hcall(vcpu);
+	//set_wakeup_process(vcpu, NULL);
+	return pending;
+}
Index: work-pv/arch/x86_64/lguest/hypervisor.S
===================================================================
--- /dev/null
+++ work-pv/arch/x86_64/lguest/hypervisor.S
@@ -0,0 +1,711 @@
+#include <asm/asm-offsets.h>
+#include <asm/page.h>
+#include <asm/msr.h>
+#include <asm/segment.h>
+#include "lguest.h"
+
+.text
+.align PAGE_SIZE
+
+.global start_hyper_text
+	.type start_hyper_text, @function
+start_hyper_text:
+
+.global	host_syscall
+host_syscall:
+	.quad 0
+
+#define PRINT_L(L)				\
+        PRINT_OUT($L)
+
+#define PRINT_N(n)				\
+        PRINT_OUT($'0' + $n)
+
+#define PRINT_HEX(n)				\
+	mov     n, %cl;				\
+	and     $0xf, %cl;			\
+	cmp     $0xa, %cl;			\
+	jge     11f;				\
+	add     $'0', %cl;			\
+	jmp     12f;				\
+11:	add     $('a' - 10), %cl;               \
+12:	PRINT_OUT(%cl);
+
+#define PRINT_NUM_BX				\
+9:	PRINT_HEX(%bl);				\
+	shr     $4, %rbx;			\
+	jne     9b
+
+#define PRINT_NUM(n)				\
+	movl    $n, %ebx;			\
+	PRINT_NUM_BX;				\
+	PRINT_L('\n');				\
+	PRINT_L('\r')
+
+#define PRINT_LONG(n)				\
+	movl    n, %ebx;			\
+	PRINT_NUM_BX;				\
+	PRINT_L('\n');				\
+	PRINT_L('\r')
+
+#define PRINT_QUAD(n)				\
+	movq    n, %rbx;			\
+	PRINT_NUM_BX;				\
+	PRINT_L('\n');				\
+	PRINT_L('\r')
+
+#define PRINT_X					\
+	PRINT_L('x')
+
+#define PRINT_OUT(x)				\
+	mov $0x3f8, %esi;			\
+21:	lea  0x5(%esi), %edx;			\
+	movzwl %dx, %edx;			\
+	in  (%dx), %al;				\
+	test $0x20,%al;				\
+	jne 22f;				\
+	pause;					\
+	jmp 21b;				\
+22:						\
+	movl    %esi, %edx;			\
+	movzwl  %dx, %edx;			\
+	mov     x, %al;				\
+	out     %al, (%dx);			\
+31:						\
+	lea  0x5(%esi), %edx;			\
+	movzwl %dx, %edx;			\
+	in  (%dx), %al;				\
+	test $0x20,%al;				\
+	jne 32f;				\
+	pause;					\
+	jmp 31b;				\
+32:						\
+
+#define PUSH_NUM				\
+	pushq %rcx;				\
+	pushq %rbx;
+
+#define POP_NUM					\
+	pushq %rbx;				\
+	pushq %rcx;
+
+#define PUSH_PRINT				\
+	pushq %rsi;				\
+	pushq %rdx;				\
+	pushq %rax;				\
+
+#define POP_PRINT				\
+	popq %rax;				\
+	popq %rdx;				\
+	popq %rsi;
+
+#define S_PRINT_NUM(_n)				\
+	PUSH_PRINT;				\
+	PUSH_NUM;				\
+	PRINT_NUM(_n);				\
+	POP_NUM;				\
+	POP_PRINT;
+
+#define S_PRINT_L(x)				\
+	PUSH_PRINT;				\
+	PRINT_L(x);				\
+	POP_PRINT;
+
+#define S_PRINT_QUAD(_n)			\
+	PUSH_PRINT;				\
+	PUSH_NUM;				\
+	PRINT_QUAD(_n);				\
+	POP_NUM;				\
+	POP_PRINT;
+
+/* Save registers on the current stack. Both for
+ * switch_to_guest and switch_to_host usage */
+#define SAVE_REGS				\
+	/* Save old guest/host state */		\
+	pushq	%fs;				\
+	pushq	%rax;				\
+	pushq	%r15;				\
+	pushq	%r14;				\
+	pushq	%r13;				\
+	pushq	%r12;				\
+	pushq	%r11;				\
+	pushq	%r10;				\
+	pushq	%r9;				\
+	pushq	%r8;				\
+	pushq	%rbp;				\
+	pushq	%rdi;				\
+	pushq	%rsi;				\
+	pushq	%rdx;				\
+	pushq	%rcx;				\
+	pushq	%rbx;				\
+
+#define RESTORE_REGS				\
+	/* Save old guest/host state */		\
+	popq	%rbx;				\
+	popq	%rcx;				\
+	popq	%rdx;				\
+	popq	%rsi;				\
+	popq	%rdi;				\
+	popq	%rbp;				\
+	popq	%r8;				\
+	popq	%r9;				\
+	popq	%r10;				\
+	popq	%r11;				\
+	popq	%r12;				\
+	popq	%r13;				\
+	popq	%r14;				\
+	popq	%r15;				\
+	popq	%rax;				\
+	popq	%fs;				\
+
+.macro dump_stack_regs PREFIX
+	movq	$LGUEST_REGS_size, %r10
+	xorq	%r11, %r11
+1:	PRINT_L(\PREFIX);
+	movq	%r11, %rbx;
+	PRINT_NUM_BX;
+	PRINT_L(':'); PRINT_L(' ');
+	movq	%rsp, %r9
+	addq	%r11, %r9
+	PRINT_QUAD((%r9))
+	addq	$8, %r11
+	cmp	%r11, %r10
+	ja	1b
+.endm
+
+.macro debugme VCPU C
+	testb	$1,LGUEST_VCPU_debug(\VCPU)
+	jz	23f
+	PRINT_L(\C)
+23:
+.endm
+
+
+#if 0
+.global hcall_teste
+	.type hcall_teste, @function
+hcall_teste:
+	cmpq	$0, %gs:pda_vcpu
+	jne	handle_guest
+	jmp	*host_syscall
+handle_guest:
+	/* SAVE_REGS  maybe it is not the macro we want */
+	#cmpq	$__PAGE_OFFSET, %rcx;
+	jb	do_hypercall
+	movq	%gs:pda_vcpu, %rcx;
+	movq	LGUEST_VCPU_guest_syscall(%rcx), %rcx;
+#endif
+
+/**
+ * DECODE_IDT  parse a IDT descriptor to find the target.
+ *  @IDT     - The register that holds the IDT descriptor location
+ *  @IDTWORD - The word version of the IDT register
+ *	        (ie. IDT is %rax, then IDTWORD must be %ax)
+ *  @RESULT  - The regsiter to place the result.
+ *
+ * This clobbers both IDT and RESULT regs.
+ */
+.macro DECODE_IDT IDT IDTWORD RESULT
+	movzwq	(\IDT), \RESULT
+	movq	4(\IDT), \IDT
+	xorw	\IDTWORD, \IDTWORD
+	orq	\IDT, \RESULT
+.endm
+
+/**
+ * DECODE_SSEG  parse a System Segment descriptor to find the target.
+ *  @SEG       - The register that holds the Sys Seg descriptor location
+ *  @RESULT    - The regsiter to place the result.
+ *  @RW	       - The word version of the RESULT register
+ *  @RH	       - The high byte version of the RESULT register
+ *
+ * (ie. RESULT is %rax, then RW must be %ax and RH must be %ah)
+ *
+ * This clobbers both SEG and RESULT regs.
+ */
+/* Why does Intel need to make everything so darn complex! */
+.macro DECODE_SSEG SEG RESULT RW RH
+	movzbq	7(\SEG), \RESULT
+	shl	$16, \RESULT
+	movb	4(\SEG), \RH
+	shl	$8, \RESULT
+	movw	2(\SEG), \RW
+	movq	8(\SEG), \SEG
+	shlq	$32, \SEG
+	orq	\SEG, \RESULT
+.endm
+
+.global switch_to_guest
+	.type switch_to_guest, @function
+/* rdi holds the pointer to vcpu.
+ * Interrupts are off on entry   */
+switch_to_guest:
+	SAVE_REGS
+	/* save host stack */
+	movq	%rsp, LGUEST_VCPU_host_stack(%rdi)
+	/* put the guest's stack in */
+	movq	%rdi, %rsp
+	/* move the stack to point to guest regs */
+	addq	$LGUEST_VCPU_regs, %rsp
+	/* filling this pointer has the effect of signalizing we're
+	 * running guest code */
+	movq	%rdi, %gs:pda_vcpu
+
+	/* save this host's gdt and idt */
+	sgdt LGUEST_VCPU_host_gdt(%rdi)
+	sidt LGUEST_VCPU_host_idt(%rdi)
+
+	/* Save the gs base of the host (for nmi use) */
+	movl	$MSR_GS_BASE, %ecx
+	rdmsr
+	movq	%rax, LGUEST_VCPU_host_gs_a(%rdi)
+	movq	%rdx, LGUEST_VCPU_host_gs_d(%rdi)
+
+	/* Save the host proc gs pointer */
+	movl	$MSR_KERNEL_GS_BASE, %ecx
+	rdmsr
+	movq	%rax, LGUEST_VCPU_host_proc_gs_a(%rdi)
+	movq	%rdx, LGUEST_VCPU_host_proc_gs_d(%rdi)
+
+	/* save the hosts page tables */
+	movq %cr3, %rax
+	movq %rax, LGUEST_VCPU_host_cr3(%rdi)
+
+	/*
+	 * The NMI is a big PITA. There's no way to atomically load the
+	 * TSS and IDT, so we can't just switch to the guest TSS without
+	 * causing a race condition with  the NMI.
+	 * So we set up the host NMI stack in the guest TSS IST so that
+	 * in case we take an NMI after loading our TR register
+	 * but before we've updated the lidt, we still have a valid
+	 * stack for the host nmi handler to use.
+	 */
+	/* Load the guest gdt */
+	lgdt LGUEST_VCPU_gdt(%rdi)
+
+	/* Switch to guest's TSS (before loading the idt) */
+	movl	$(GDT_ENTRY_TSS*8), %ebx
+	ltr	%bx
+
+	/* Set host's TSS to available (clear byte 5 bit 2). */
+	movq	LGUEST_VCPU_host_gdt_ptr(%rdi), %rax
+	andb	$0xFD, (GDT_ENTRY_TSS*8+5)(%rax)
+
+	/* Now load the guest idt */
+	lidt LGUEST_VCPU_idt(%rdi)
+
+	/* Load the guest gs pointer */
+	movl	$MSR_KERNEL_GS_BASE, %ecx
+	movq	LGUEST_VCPU_guest_gs_a(%rdi), %rax
+	movq	LGUEST_VCPU_guest_gs_d(%rdi), %rdx
+	wrmsr
+
+	/* Flush the TLB */
+	movq	%cr4, %rax
+	movq	%rax, %rbx
+	andb	$~(1<<7), %al
+	movq	%rax, %cr4
+	movq	%rbx, %cr4
+
+	/* switch to the guests page tables */
+	popq %rax
+	movq %rax, %cr3
+
+	/* Now we swap gs to the guest gs base */
+	swapgs
+
+	/* restore guest registers */
+	RESTORE_REGS
+	/* skip trapnum and errorcode */
+	addq	$0x10, %rsp;
+	iretq
+
+.macro print_trap VCPU REG
+	movq	LGUEST_VCPU_trapnum(\VCPU), \REG
+	PRINT_QUAD(\REG)
+.endm
+
+#define SWITCH_TO_HOST							\
+	SAVE_REGS;							\
+	/* Save old pgdir */						\
+	movq	%cr3, %rax;						\
+	pushq	%rax;							\
+	/* Point rdi to the vcpu struct */				\
+	movq	%rsp, %rdi;						\
+	subq	$LGUEST_VCPU_regs, %rdi;				\
+	/* Load lguest ds segment for convenience. */			\
+	movq	$(__HV_DS), %rax;					\
+	movq	%rax, %ds;						\
+	/* Load the host page tables since that's where the gdt is */	\
+	movq    LGUEST_VCPU_host_cr3(%rdi), %rax;			\
+	movq    %rax, %cr3;						\
+	/* Switch to hosts gdt */					\
+	lgdt    LGUEST_VCPU_host_gdt(%rdi);				\
+	/* Set guest's TSS to available (clear byte 5 bit 2). */	\
+	movq    LGUEST_VCPU_vcpu(%rdi), %rax;				\
+	andb	$0xFD, (LGUEST_VCPU_gdt_table+GDT_ENTRY_TSS*8+5)(%rax);	\
+	/* Swap back to the host PDA */					\
+	swapgs;								\
+	/* Put back the host process gs as well */			\
+	movl  	$MSR_KERNEL_GS_BASE,%ecx;				\
+	movq    LGUEST_VCPU_host_proc_gs_a(%rdi), %rax;			\
+	movq    LGUEST_VCPU_host_proc_gs_d(%rdi), %rdx;			\
+	wrmsr;								\
+	/* With PDA back now switch to host idt */			\
+	lidt    LGUEST_VCPU_host_idt(%rdi);				\
+	/* Switch to host's TSS. */					\
+	movl	$(GDT_ENTRY_TSS*8), %eax;				\
+	ltr	%ax;							\
+	/* put flag down. We're in the host again */			\
+	movq	$0, %gs:pda_vcpu;					\
+	movq	LGUEST_VCPU_host_stack(%rdi), %rsp;			\
+	RESTORE_REGS;
+
+/* Return to run_guest_once. */
+return_to_host:
+	SWITCH_TO_HOST
+	iretq
+
+deliver_to_host:
+	SWITCH_TO_HOST
+decode_idt_and_jmp:
+	/* Decode IDT and jump to hosts' irq handler.  When that does iret, it
+	 * will return to run_guest_once.  This is a feature. */
+	/* We told gcc we'd clobber rdi and rax... */
+	movq	LGUEST_VCPU_trapnum(%rdi), %rdi
+	shl	$1, %rdi
+	leaq	(%rax,%rdi,8), %rdi
+	DECODE_IDT %rdi %di %rax
+	jmp	*%rax
+
+#define NMI_SWITCH_TO_HOST						\
+	/* Force switch to host, GDT, CR3, and both GS bases */		\
+	movl    $MSR_GS_BASE, %ecx;					\
+	movq    LGUEST_VCPU_host_gs_a(%rdi), %rax;			\
+	movq    LGUEST_VCPU_host_gs_d(%rdi), %rdx;			\
+	wrmsr;								\
+	movl    $MSR_KERNEL_GS_BASE, %ecx;				\
+	movq    LGUEST_VCPU_host_proc_gs_a(%rdi), %rax;			\
+	movq    LGUEST_VCPU_host_proc_gs_d(%rdi), %rdx;			\
+	wrmsr;								\
+	movq    LGUEST_VCPU_host_cr3(%rdi), %rax;			\
+	movq	%rax, %cr3;						\
+	lgdt    LGUEST_VCPU_host_gdt(%rdi);
+
+#if 0
+	/* Set host's TSS to available (clear byte 5 bit 2). */		\
+	movq	LGUEST_VCPU_host_gdt_ptr(%rdi), %rax;			\
+	andb	$0xFD, (GDT_ENTRY_TSS*8+5)(%rax);			\
+
+#endif
+
+/* Used by NMI only */
+/*
+ * The NMI is special because it uses its own stack, and needs to
+ * find the vcpu struct differently.
+ */
+nmi_trampoline:
+	/* nmi has it's own stack */
+	SAVE_REGS
+
+	/* save the cr3 */
+	movq     %cr3, %rax
+	pushq	 %rax
+
+	/* get the vcpu struct */
+	movq     %rsp, %rdi
+	subq     $LGUEST_VCPU_nmi_stack_end, %rdi
+	addq     $LGUEST_REGS_size, %rdi  /* compensate for saved regs */
+
+	/* compensate if our end pointer is not 16 bytes aligned */
+	movq	 $LGUEST_VCPU_nmi_stack_end, %rax
+	andq	 $0xf, %rax;
+	addq	 %rax, %rdi;
+
+#if 0 /* in case we want to see where the nmi hit */
+	movq	LGUEST_REGS_rip(%rsp), %r8
+	PRINT_L('R')
+	PRINT_QUAD(%r8)
+#endif
+
+	/*
+	 * All guest descriptors are above the HV text code (here!)
+	 * If we hit the suspected NMI race, our stack will be the host
+	 * kernel stack, and that is in lower address space than the HV.
+	 * So test to see if we are screwed. Don't do anything, but just
+	 * report it!
+	 */
+	call   1f
+1:
+	movq	0(%rsp), %rax /* put this RIP into rax */
+	/* If rsp >= rax; jmp */
+	cmpq	%rax, %rsp
+	jge	1f
+
+	PRINT_L('H'); PRINT_L('i'); PRINT_L('t'); PRINT_L(' ');
+	PRINT_L('N'); PRINT_L('M'); PRINT_L('I'); PRINT_L(' ');
+	PRINT_L('r'); PRINT_L('a'); PRINT_L('c');
+	PRINT_L('\n'); PRINT_L('\r');
+
+1:
+	/* put back the stack from the previous call */
+	addq   $8, %rsp
+
+	/*
+	 * If we take another NMI while saving, we need to start over
+	 * and try again. It's OK as long as we don't overwrite
+	 * the saved material.
+	 */
+	testq    $1,LGUEST_VCPU_nmi_sw(%rdi)
+	jnz      1f
+
+	/* Copy the saved regs */
+	cld
+	movq	%rdi,  %rbx   /* save off vcpu struct */
+	leaq	LGUEST_VCPU_nmi_regs(%rdi), %rdi
+	leaq	0(%rsp), %rsi
+	movq	$(LGUEST_REGS_size/8), %rcx
+	rep	movsq
+
+	movq	%rbx, %rdi  /* put back vcpu struct */
+
+	/* save the gs base and shadow */
+	movl	$MSR_GS_BASE, %ecx
+	rdmsr
+	movq	%rax, LGUEST_VCPU_nmi_gs_a(%rdi)
+	movq	%rdx, LGUEST_VCPU_nmi_gs_d(%rdi)
+
+	movl	$MSR_KERNEL_GS_BASE, %ecx
+	rdmsr
+	movq	%rax, LGUEST_VCPU_nmi_gs_shadow_a(%rdi)
+	movq	%rdx, LGUEST_VCPU_nmi_gs_shadow_d(%rdi)
+
+	/* save the gdt */
+	sgdt	LGUEST_VCPU_nmi_gdt(%rdi)
+
+	/* set the switch flag to prevent another nmi from saving over this */
+	movq   $1, LGUEST_VCPU_nmi_sw(%rdi)
+
+1:
+
+#if 0
+	S_PRINT_L('N')
+	S_PRINT_L('M')
+	S_PRINT_L('I')
+	S_PRINT_L(' ')
+	S_PRINT_L('l')
+	S_PRINT_L('g')
+	S_PRINT_L('u')
+	S_PRINT_L('e')
+	S_PRINT_L('s')
+	S_PRINT_L('t')
+	S_PRINT_L('\n')
+	S_PRINT_L('\r')
+#endif
+	NMI_SWITCH_TO_HOST
+
+	/* we want to come back here on the iret */
+	pushq  $__HV_DS
+	/* put the vcpu struct as our stack */
+	pushq %rdi
+	pushfq
+	pushq	$__HV_CS
+
+	movq    LGUEST_VCPU_host_idt_address(%rdi), %rax
+
+	/* Decode the location of the host NMI handler */
+	leaq   32(%rax), %rbx   /* NMI IDT entry */
+	DECODE_IDT %rbx %bx %rax
+
+	callq   *%rax
+
+	/*
+	 * Back from NMI, stack points to vcpu, and we can take
+	 * more NMIs at this point. That's OK, since we only
+	 * want to get to the original NMI interruption. We
+	 * just restart this restore process. Nested NMIs will
+	 * not destroy this data while the nmi_sw flag is set.
+	 */
+	movq    %rsp, %rdi
+
+	/* restore the cr3 */
+	addq   $(LGUEST_VCPU_nmi_regs), %rsp
+	popq   %rax
+	movq   %rax, %cr3
+
+	/* restore the gdt */
+	lgdt	LGUEST_VCPU_nmi_gdt(%rdi)
+
+#if 0 /* print magic */
+	movq	LGUEST_VCPU_magic(%rdi), %r8
+	movq	$(6*8), %r9
+1:	subq	$8, %r9
+	movq	%r9, %rcx
+	movq	%r8, %rbx
+	shr	%cl, %rbx
+	PRINT_OUT(%bl)
+	cmp	$0, %r9
+	jne	1b
+#endif
+
+	/* make both host and guest TSS available */
+#if 1
+	movq	LGUEST_VCPU_host_gdt_ptr(%rdi), %rax
+	andb	$0xFD, (GDT_ENTRY_TSS*8+5)(%rax)
+
+	andb	$0xFD, (LGUEST_VCPU_gdt_table+GDT_ENTRY_TSS*8+5)(%rdi)
+#endif
+
+#if 0
+	movl	$(GDT_ENTRY_TSS*8), %ebx
+	ltr	%bx
+#endif
+
+	/* restore the gs base and shadow */
+	movl   $MSR_GS_BASE, %ecx
+	movq   LGUEST_VCPU_nmi_gs_a(%rdi), %rax
+	movq   LGUEST_VCPU_nmi_gs_d(%rdi), %rdx
+	wrmsr
+
+	movl   $MSR_KERNEL_GS_BASE, %ecx
+	movq   LGUEST_VCPU_nmi_gs_shadow_a(%rdi), %rax
+	movq   LGUEST_VCPU_nmi_gs_shadow_d(%rdi), %rdx
+	wrmsr
+
+#if 0
+	PRINT_L('O')
+	PRINT_L('U')
+	PRINT_L('T')
+	PRINT_L('\n')
+	PRINT_L('\r')
+#endif
+
+#if 1
+	/* Flush the TLB */
+	movq	%cr4, %rax
+	movq	%rax, %rbx
+	andb	$~(1<<7), %al
+	movq	%rax, %cr4
+	movq	%rbx, %cr4
+#endif
+
+	RESTORE_REGS
+
+	/* skip trapnum and errcode */
+	addq	$0x10, %rsp
+
+	/*
+	 * Careful here, we can't modify any regs anymore
+	 * but we now have to zero out the nmi switch flag.
+	 * So all the work will be done by the stack pointer.
+	 */
+
+#define SW_OFFSET (LGUEST_VCPU_nmi_sw - \
+		   (LGUEST_VCPU_nmi_regs + LGUEST_REGS_rip))
+	 movq  $0, SW_OFFSET(%rsp)
+
+	 /* use iret to get back to where we were. */
+	 iretq;
+	 /* Whoo, all done! */
+
+do_crash:
+	SAVE_REGS
+	movq	%cr3, %rax;
+	pushq	%rax;
+	PRINT_L('C');PRINT_L('r');PRINT_L('a');PRINT_L('s');
+	PRINT_L('h');PRINT_L('i');PRINT_L('n');PRINT_L('g');
+	PRINT_L('\n');PRINT_L('\r');
+
+	dump_stack_regs 'S'
+
+	addq	$16, %rsp
+	sgdt	0(%rsp)
+	PRINT_L('G');PRINT_L('D');PRINT_L('T');PRINT_L('L');PRINT_L(':');PRINT_L(' ');
+	xorq	%r8, %r8
+	movw	(%rsp), %r8
+	PRINT_QUAD(%r8)
+	PRINT_L('G');PRINT_L('D');PRINT_L('T');PRINT_L('A');PRINT_L(':');PRINT_L(' ');
+	movq	2(%rsp), %r8
+	PRINT_QUAD(%r8)
+
+	PRINT_L('C');PRINT_L('S');PRINT_L(':');PRINT_L(' ');
+	movq	%cs, %rbx
+	PRINT_QUAD(%rbx)
+	movq	%cs, %rbx
+	andb	$(~3), %bl
+	addq	%rbx, %r8
+	movq	0(%r8), %r9
+	PRINT_L('S');PRINT_L('E');PRINT_L('G');PRINT_L(':');PRINT_L(' ');
+	PRINT_QUAD(%r9);
+	movq	$1, %r8;
+	shl	$47, %r8
+	andq	%r9, %r8
+	PRINT_L('P');PRINT_L(' ');PRINT_L(':');PRINT_L(' ');
+	PRINT_QUAD(%r8);
+	PRINT_L('D');PRINT_L('P');PRINT_L(':');PRINT_L(' ');
+	movq	$3, %r8;
+	shl	$45, %r8
+	andq	%r9, %r8
+	PRINT_QUAD(%r8);
+
+
+	/* just die! */
+2:
+	pause
+	jmp 2b
+
+
+/* Real hardware interrupts are delivered straight to the host.  Others
+   cause us to return to run_guest_once so it can decide what to do.  Note
+   that some of these are overridden by the guest to deliver directly, and
+   never enter here (see load_guest_idt_entry). */
+.macro IRQ_STUB N TARGET
+	.data; .quad 1f; .text; 1:
+ /* Make an error number for most traps, which don't have one. */
+/*  .if (\N <> 2) && (\N <> 8) && (\N < 10 || \N > 14) && (\N <> 17) */
+  .if (\N < 10 || \N > 14) && (\N <> 17)
+	pushq	$0
+ .endif
+	pushq	$\N
+	jmp	\TARGET
+	.align 8
+.endm
+
+.macro IRQ_STUBS FIRST LAST TARGET
+ irq=\FIRST
+ .rept \LAST-\FIRST+1
+	IRQ_STUB irq \TARGET
+  irq=irq+1
+ .endr
+.endm
+
+/* We intercept every interrupt, because we may need to switch back to
+ * host.  Unfortunately we can't tell them apart except by entry
+ * point, so we need 256 entry points.
+ */
+irq_stubs:
+.data
+.global _lguest_default_idt_entries
+_lguest_default_idt_entries:
+.text
+	IRQ_STUBS 0 1 return_to_host		/* First two traps */
+	IRQ_STUB 2 nmi_trampoline	/* NMI */
+	IRQ_STUBS 3 7 return_to_host		/* Rest of traps */
+/*debug for now */
+	IRQ_STUB 8 do_crash			/* Double fault! */
+#if 1
+	IRQ_STUBS 9 31 return_to_host		/* Rest of traps */
+#else
+	IRQ_STUBS 9 12 return_to_host		/* Rest of traps */
+	IRQ_STUB 13 do_crash			/* GPF! */
+	IRQ_STUBS 14 31 return_to_host		/* Rest of traps */
+#endif
+	IRQ_STUBS 32 127 deliver_to_host	/* Real interrupts */
+	IRQ_STUB 128 return_to_host		/* System call (overridden) */
+	IRQ_STUBS 129 255 deliver_to_host	/* Other real interrupts */
+
+	.align PAGE_SIZE
+.global end_hyper_text
+	.type end_hyper_text, @function
+end_hyper_text:
+	nop
Index: work-pv/arch/x86_64/lguest/interrupts_and_traps.c
===================================================================
--- /dev/null
+++ work-pv/arch/x86_64/lguest/interrupts_and_traps.c
@@ -0,0 +1,292 @@
+#include <linux/uaccess.h>
+#include <asm/lguest.h>
+#include <asm/desc.h>
+#include <asm/hw_irq.h>
+#include "lguest.h"
+
+static void push_guest_stack(struct lguest_vcpu *vcpu,
+					u64 __user **gstack, u64 val)
+{
+	lhwrite_u64(vcpu, (u64)--(*gstack), val);
+}
+
+static u64 pop_guest_stack(struct lguest_vcpu *vcpu,
+			   u64 __user **gstack)
+{
+	return lhread_u64(vcpu, (u64)(*gstack)++);
+}
+
+void guest_iret(struct lguest_vcpu *vcpu)
+{
+	struct lguest_regs *regs = &vcpu->regs;
+	u64 __user *gstack;
+	u64 cs;
+
+	gstack = (u64 __user *)guest_pa(vcpu->guest, regs->rsp);
+
+	regs->rip = pop_guest_stack(vcpu, &gstack);
+	cs = pop_guest_stack(vcpu, &gstack);
+
+	/* FIXME: determine if we are going back to userland */
+
+	regs->rflags = pop_guest_stack(vcpu, &gstack);
+	/* FIXME: check if this is correct */
+
+	if (regs->rflags & 512)
+		put_user(512, &vcpu->guest->lguest_data->irq_enabled);
+
+	/* make sure interrupts are enabled */
+	regs->rflags |= 512;
+
+	regs->rsp = pop_guest_stack(vcpu, &gstack);
+	regs->ss = pop_guest_stack(vcpu, &gstack);
+
+	/* restore the rax reg, since it was used by the guest to do the hcall */
+	regs->rax = vcpu->rax;
+
+	return;
+}
+
+int reflect_trap(struct lguest_vcpu *vcpu, int trap_num, int has_err)
+{
+	struct lguest_regs *regs = &vcpu->regs;
+	u64 __user *gstack;
+	u64 rflags, irq_enable;
+	u64 offset;
+
+	if (!vcpu->interrupt[trap_num]) {
+		printk("Not yet registered trap handler for %d\n",trap_num);
+		return 0;
+	}
+
+	/* save off the rax reg */
+	vcpu->rax = regs->rax;
+
+	/* FIXME: test for ring change and set up vcpu->tss.rsp2 ? */
+	gstack = (u64 __user *)guest_pa(vcpu->guest, regs->rsp);
+	offset = regs->rsp - (u64)gstack;
+
+	/* We use IF bit in eflags to indicate whether irqs were disabled
+	   (it's always 0, since irqs are enabled when guest is running). */
+	get_user(irq_enable, &vcpu->guest->lguest_data->irq_enabled);
+	rflags = regs->rflags;
+	rflags |= (irq_enable & 512);
+
+	/* FIXME: Really? */
+	push_guest_stack(vcpu, &gstack, regs->ss);
+	push_guest_stack(vcpu, &gstack, regs->rsp);
+	push_guest_stack(vcpu, &gstack, rflags);
+	/* FIXME: determine if guest is in kernel or user mode */
+	push_guest_stack(vcpu, &gstack, __KERNEL_CS);
+	push_guest_stack(vcpu, &gstack, regs->rip);
+
+	if (has_err)
+		push_guest_stack(vcpu, &gstack, regs->errcode);
+
+	/* Change the real stack so hypervisor returns to trap handler */
+	regs->ss = __USER_DS;
+	regs->rsp = (u64)gstack + offset;
+	regs->cs = __USER_CS;
+	lgdebug_print("rip was at %p\n", (void*)regs->rip);
+	regs->rip = vcpu->interrupt[trap_num];
+
+	/* Disable interrupts for an interrupt gate. */
+	if (test_bit(trap_num, vcpu->interrupt_disabled))
+		put_user(0, &vcpu->guest->lguest_data->irq_enabled);
+	return 1;
+#if 0
+	/* Was ist da? */
+	/* GS will be neutered on way back to guest. */
+	put_user(0, &lg->lguest_data->gs_gpf_eip);
+#endif
+	return 0;
+}
+
+void maybe_do_interrupt(struct lguest_vcpu *vcpu)
+{
+	unsigned int irq;
+	DECLARE_BITMAP(irqs, LGUEST_IRQS);
+
+	if (!vcpu->guest->lguest_data)
+		return;
+
+	/* If timer has changed, set timer interrupt. */
+	if (vcpu->guest->timer_on && jiffies != vcpu->guest->last_timer)
+		set_bit(0, vcpu->irqs_pending);
+
+	/* Mask out any interrupts they have blocked. */
+	if (copy_from_user(&irqs, vcpu->guest->lguest_data->interrupts,
+								sizeof(irqs)))
+		return;
+
+	bitmap_andnot(irqs, vcpu->irqs_pending, irqs, LGUEST_IRQS);
+
+	irq = find_first_bit(irqs, LGUEST_IRQS);
+	if (irq >= LGUEST_IRQS)
+		return;
+
+	/* If they're halted, we re-enable interrupts. */
+	if (vcpu->guest->halted) {
+		/* Re-enable interrupts. */
+		put_user(512, &vcpu->guest->lguest_data->irq_enabled);
+		vcpu->guest->halted = 0;
+	} else {
+		/* Maybe they have interrupts disabled? */
+		u32 irq_enabled;
+		get_user(irq_enabled, &vcpu->guest->lguest_data->irq_enabled);
+		if (!irq_enabled) {
+			lgdebug_print("Irqs are disabled\n");
+			return;
+		}
+	}
+
+	if (vcpu->interrupt[irq + FIRST_EXTERNAL_VECTOR] != 0) {
+		lgdebug_print("Reflect trap: %x\n",irq+FIRST_EXTERNAL_VECTOR);
+		clear_bit(irq, vcpu->irqs_pending);
+		reflect_trap(vcpu, irq+FIRST_EXTERNAL_VECTOR, 0);
+	}
+	else {
+		lgdebug_print("out without doing it!!\n");
+	}
+
+}
+
+void check_bug_kill(struct lguest_vcpu *vcpu)
+{
+/* FIXME: Use rostedt magic kallsyms */
+#if 0
+#ifdef CONFIG_BUG
+	u32 eip = lg->state->regs.rip - PAGE_OFFSET;
+	u16 insn;
+
+	/* This only works for addresses in linear mapping... */
+	if (lg->state->regs.rip < PAGE_OFFSET)
+		return;
+	lhread(lg, &insn, eip, sizeof(insn));
+	if (insn == 0x0b0f) {
+#ifdef CONFIG_DEBUG_BUGVERBOSE
+		u16 l;
+		u32 f;
+		char file[128];
+		lhread(lg, &l, eip+sizeof(insn), sizeof(l));
+		lhread(lg, &f, eip+sizeof(insn)+sizeof(l), sizeof(f));
+		lhread(lg, file, f - PAGE_OFFSET, sizeof(file));
+		file[sizeof(file)-1] = 0;
+		kill_guest(lg, "BUG() at %#x %s:%u", eip, file, l);
+#else
+		kill_guest(lg, "BUG() at %#x", eip);
+#endif	/* CONFIG_DEBUG_BUGVERBOSE */
+	}
+#endif	/* CONFIG_BUG */
+#endif
+}
+
+static void copy_trap(struct lguest_vcpu *vcpu,
+		      unsigned int trap_num,
+		      const struct gate_struct *desc)
+{
+
+	/* Not present? */
+	if (!desc->p) {
+		vcpu->interrupt[trap_num] = 0;
+		return;
+	}
+
+	switch (desc->type) {
+		case 0xE:
+			set_bit(trap_num,vcpu->interrupt_disabled);
+			break;
+		case 0xF:
+			clear_bit(trap_num,vcpu->interrupt_disabled);
+			break;
+		default:
+			kill_guest(vcpu->guest, "bad IDT type %i for irq %x",
+				desc->type,trap_num);
+	}
+
+	vcpu->interrupt[trap_num] = GATE_ADDRESS((*desc));
+}
+
+#if 0
+
+/* FIXME: Put this in hypervisor.S and do something clever with relocs? */
+static u8 tramp[]
+= { 0x0f, 0xa8, 0x0f, 0xa9, /* push %gs; pop %gs */
+    0x36, 0xc7, 0x05, 0x55, 0x55, 0x55, 0x55, 0x00, 0x00, 0x00, 0x00,
+    /* movl 0, %ss:lguest_data.gs_gpf_eip */
+    0xe9, 0x55, 0x55, 0x55, 0x55 /* jmp dstaddr */
+};
+#define TRAMP_MOVL_TARGET_OFF 7
+#define TRAMP_JMP_TARGET_OFF 16
+
+static u32 setup_trampoline(struct lguest *lg, unsigned int i, u32 dstaddr)
+{
+	u32 addr, off;
+
+	off = sizeof(tramp)*i;
+	memcpy(lg->trap_page + off, tramp, sizeof(tramp));
+
+	/* 0 is to be placed in lguest_data.gs_gpf_eip. */
+	addr = (u32)&lg->lguest_data->gs_gpf_eip + lg->page_offset;
+	memcpy(lg->trap_page + off + TRAMP_MOVL_TARGET_OFF, &addr, 4);
+
+	/* Address is relative to where end of jmp will be. */
+	addr = dstaddr - ((-4*1024*1024) + off + sizeof(tramp));
+	memcpy(lg->trap_page + off + TRAMP_JMP_TARGET_OFF, &addr, 4);
+	return (-4*1024*1024) + off;
+}
+
+#endif
+/* We bounce through the trap page, for two reasons: firstly, we need
+   the interrupt destination always mapped, to avoid double faults,
+   secondly we want to reload %gs to make it innocuous on entering kernel.
+ */
+/* guest kernel will not be mapped. we'd better do another schema */
+static void setup_idt(struct lguest_vcpu *vcpu,
+		      unsigned int i,
+		      const struct gate_struct *desc)
+{
+	u64 taddr;
+
+	/* Not present? */
+	if (!desc->p) {
+		/* FIXME: When we need this, we'll know... */
+		if (vcpu->idt_table[i].p)
+			kill_guest(vcpu->guest, "trying to remove irq line %i:"
+					"removing interrupts not supported",i);
+		return;
+	}
+
+#if 0
+	/* We could reflect and disable interrupts, but guest can do itself. */
+	if (desc->type != 0xF)
+		kill_guest(vcpu->guest, "bad direct IDT %i type 0x%x",
+								i, desc->type);
+#endif
+
+	/* FIXME: We may need to fix segment? */
+	_lguest_set_gate(&vcpu->idt_table[i], desc->type, GUEST_DPL, taddr, 0);
+#if 0
+	taddr = setup_trampoline(lg, i, (desc->a&0xFFFF)|(desc->b&0xFFFF0000));
+#endif
+}
+
+void load_guest_idt_entry(struct lguest_vcpu *vcpu, unsigned int i,
+				struct gate_struct *d)
+{
+	switch (i) {
+	/* Ignore NMI, doublefault, hypercall, spurious interrupt. */
+	case 2:
+	case 8:
+	case 14:
+	case 15:
+	case LGUEST_TRAP_ENTRY:
+	/* FIXME: We should handle debug and int3 */
+	case 1:
+	case 3:
+		return;
+	default:
+		copy_trap(vcpu,i,d);
+	}
+}
+
Index: work-pv/arch/x86_64/lguest/lguest.c
===================================================================
--- /dev/null
+++ work-pv/arch/x86_64/lguest/lguest.c
@@ -0,0 +1,705 @@
+/*
+ * Lguest specific paravirt-ops implementation
+ *
+ * Copyright (C) 2007, Glauber de Oliveira Costa <gcosta@redhat.com>
+ *                     Steven Rostedt <srostedt@redhat.com>
+ *                     Red Hat Inc
+ * Standing on the shoulders of Rusty Russell.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE, GOOD TITLE or
+ * NON INFRINGEMENT.  See the GNU General Public License for more
+ * details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
+ */
+#include <linux/kernel.h>
+#include <linux/start_kernel.h>
+#include <linux/string.h>
+#include <linux/console.h>
+#include <linux/screen_info.h>
+#include <linux/irq.h>
+#include <linux/interrupt.h>
+#include <linux/pfn.h>
+#include <asm/bootsetup.h>
+#include <asm/paravirt.h>
+#include <asm/lguest.h>
+#include <asm/lguest_user.h>
+#include <asm/param.h>
+#include <asm/page.h>
+#include <asm/pgtable.h>
+#include <asm/desc.h>
+#include <asm/setup.h>
+#include <asm/e820.h>
+#include <asm/pda.h>
+#include <asm/asm-offsets.h>
+#include <asm/mce.h>
+#include <asm/proto.h>
+#include <asm/sections.h>
+
+struct lguest_data lguest_data;
+struct lguest_device_desc *lguest_devices;
+static __initdata const struct lguest_boot_info *boot = (void*)__START_KERNEL_map;
+static struct lguest_text_ptr code_stack[2];
+extern int acpi_disabled;
+extern int acpi_ht;
+
+extern const unsigned long kallsyms_addresses[] __attribute__((weak));
+extern const unsigned long kallsyms_num_syms __attribute__((weak));
+extern const u8 kallsyms_names[] __attribute__((weak));
+extern const u8 kallsyms_token_table[] __attribute__((weak));
+extern const u16 kallsyms_token_index[] __attribute__((weak));
+extern const unsigned long kallsyms_markers[] __attribute__((weak));
+
+static DEFINE_SPINLOCK(hcall_print_lock);
+#define HCALL_BUFF_SIZ 1024
+static char hcall_buff[HCALL_BUFF_SIZ];
+
+/* Set to true when the lguest_init is called. */
+static int lguest_paravirt;
+
+struct lguest_print_ops {
+	void (*vprint)(const char *fmt, va_list ap);
+} *lguest_pops;
+
+void lguest_vprint(const char *fmt, va_list ap)
+{
+	if (lguest_pops)
+		lguest_pops->vprint(fmt, ap);
+}
+
+void lguest_print(const char *fmt, ...)
+{
+	va_list ap;
+
+	/* irq save? */
+	va_start(ap, fmt);
+	lguest_vprint(fmt, ap);
+	va_end(ap);
+}
+
+static void __lguest_vprint(const char *fmt, va_list ap)
+{
+	/* need to do this with interrupts disabled */
+//	spin_lock(&hcall_print_lock);
+	vsnprintf(hcall_buff, HCALL_BUFF_SIZ-1, fmt, ap);
+
+	hcall(LHCALL_PRINT, __pa(hcall_buff), 0, 0);
+//	spin_unlock(&hcall_print_lock);
+}
+
+struct lguest_print_ops local_pops = {__lguest_vprint };
+
+void lguest_set_debug(int d)
+{
+	if (lguest_paravirt)
+		hcall(LHCALL_DEBUG_ME, d, 0, 0);
+}
+
+void async_hcall(unsigned long call,
+		 unsigned long arg1, unsigned long arg2, unsigned long arg3)
+{
+	/* Note: This code assumes we're uniprocessor. */
+	static unsigned int next_call;
+	unsigned long flags;
+
+	local_irq_save(flags);
+	if (lguest_data.hcall_status[next_call] != 0xFF) {
+		/* Table full, so do normal hcall which will flush table. */
+		hcall(call, arg1, arg2, arg3);
+	} else {
+		lguest_data.hcalls[next_call].eax = call;
+		lguest_data.hcalls[next_call].edx = arg1;
+		lguest_data.hcalls[next_call].ebx = arg2;
+		lguest_data.hcalls[next_call].ecx = arg3;
+		wmb();
+		lguest_data.hcall_status[next_call] = 0;
+		if (++next_call == LHCALL_RING_SIZE)
+			next_call = 0;
+	}
+	local_irq_restore(flags);
+}
+
+#ifdef PARAVIRT_LAZY_NONE 	/* Not in 2.6.20. */
+static int lazy_mode;
+static void lguest_lazy_mode(int mode)
+{
+	lazy_mode = mode;
+	if (mode == PARAVIRT_LAZY_NONE)
+		hcall(LHCALL_FLUSH_ASYNC, 0, 0, 0);
+}
+
+static void lazy_hcall(unsigned long call,
+		       unsigned long arg1,
+		       unsigned long arg2,
+		       unsigned long arg3)
+{
+	if (lazy_mode == PARAVIRT_LAZY_NONE)
+		hcall(call, arg1, arg2, arg3);
+	else
+		async_hcall(call, arg1, arg2, arg3);
+}
+#else
+#define lazy_hcall hcall
+#endif
+
+static unsigned long save_fl(void)
+{
+	return lguest_data.irq_enabled;
+}
+
+static void restore_fl(unsigned long flags)
+{
+	/* FIXME: Check if interrupt pending... */
+	lguest_data.irq_enabled = flags;
+}
+
+static void irq_disable(void)
+{
+	lguest_data.irq_enabled = 0;
+}
+
+static void irq_enable(void)
+{
+	/* Linux i386 code expects bit 9 set. */
+	/* FIXME: Check if interrupt pending... */
+	lguest_data.irq_enabled = 512;
+}
+
+static void lguest_load_gdt(const struct desc_ptr *desc)
+{
+	/* Does nothing. HV should have done everything for us */
+}
+
+static void lguest_load_idt(const struct desc_ptr *desc)
+{
+	unsigned int i;
+	struct gate_struct *idt = (void *)desc->address;
+
+	for (i = 0; i < (desc->size+1)/16; i++) {
+		hcall(LHCALL_LOAD_IDT_ENTRY, i, __pa((u64)&idt[i]), 0);
+	}
+}
+
+static int lguest_panic(struct notifier_block *nb, unsigned long l, void *p)
+{
+	hcall(LHCALL_CRASH, __pa(p), 0, 0);
+	return NOTIFY_DONE;
+}
+
+static struct notifier_block paniced = {
+	.notifier_call = lguest_panic
+};
+
+static void lguest_memory_setup(void)
+{
+	/* We do this here because lockcheck barfs if before start_kernel */
+	atomic_notifier_chain_register(&panic_notifier_list, &paniced);
+
+	e820.nr_map = 0;
+	add_memory_region(0, PFN_PHYS(boot->max_pfn), E820_RAM);
+}
+
+static void lguest_cpuid(unsigned int *eax, unsigned int *ebx,
+				 unsigned int *ecx, unsigned int *edx)
+{
+	int is_feature = (*eax == 1);
+
+	native_cpuid(eax, ebx, ecx, edx);
+	if (is_feature) {
+		unsigned long *excap = (unsigned long *)ecx,
+			*features = (unsigned long *)edx;
+		/* Hypervisor needs to know when we flush kernel pages. */
+		set_bit(X86_FEATURE_PGE, features);
+		/* We don't have any features! */
+		clear_bit(X86_FEATURE_VME, features);
+		clear_bit(X86_FEATURE_DE, features);
+		clear_bit(X86_FEATURE_PSE, features);
+		clear_bit(X86_FEATURE_PAE, features);
+		clear_bit(X86_FEATURE_SEP, features);
+		clear_bit(X86_FEATURE_APIC, features);
+		clear_bit(X86_FEATURE_MTRR, features);
+		/* No MWAIT, either */
+		clear_bit(3, excap);
+	}
+}
+
+static unsigned long current_cr3;
+static void lguest_write_cr3(unsigned long cr3)
+{
+	hcall(LHCALL_NEW_PGTABLE, cr3, 0, 0);
+	current_cr3 = cr3;
+}
+
+static u64 lguest_read_msr(unsigned int msr, int *err)
+{
+	unsigned long val;
+
+	*err = 0;
+	hcall(LHCALL_RDMSR, msr, __pa(&val), 0);
+	return val;
+}
+
+static int lguest_write_msr(unsigned int msr, u64 val)
+{
+	hcall(LHCALL_WRMSR, msr, (unsigned long)val, 0);
+	return val;
+}
+
+static u64 lguest_read_tsc(void)
+{
+	/* we don't use natives, otherwise they can recurse */
+	unsigned int a,b;
+	asm volatile("rdtsc" : "=a" (a), "=d" (b));
+	return a | (unsigned long)(b) << 32 ;
+}
+
+static void lguest_flush_tlb(void)
+{
+	lazy_hcall(LHCALL_FLUSH_TLB, 0, 0, 0);
+}
+
+static void lguest_flush_tlb_kernel(void)
+{
+	lazy_hcall(LHCALL_FLUSH_TLB, 1, 0, 0);
+}
+
+static void lguest_flush_tlb_single(u64 addr)
+{
+	lazy_hcall(LHCALL_FLUSH_TLB_SIG, current_cr3, addr, 0);
+}
+
+static void lguest_set_pte(pte_t *ptep, pte_t pteval)
+{
+	*ptep = pteval;
+	hcall(LHCALL_SET_PTE, current_cr3, __pa(ptep), pte_val(pteval));
+}
+
+static void lguest_set_pte_at(struct mm_struct *mm, u64 addr, pte_t *ptep, pte_t pteval)
+{
+	*ptep = pteval;
+	lazy_hcall(LHCALL_SET_PTE, __pa(mm->pgd), __pa(ptep), pte_val(pteval));
+}
+
+static void lguest_set_pmd(pmd_t *pmdp, pmd_t pmdval)
+{
+	*pmdp = pmdval;
+	lazy_hcall(LHCALL_SET_PMD, current_cr3, __pa(pmdp)&PTE_MASK,
+		   (__pa(pmdp)&(PAGE_SIZE-1))/8);
+}
+
+static void lguest_set_pud(pud_t *pudp, pud_t pudval)
+{
+	*pudp = pudval;
+	lazy_hcall(LHCALL_SET_PUD, current_cr3, __pa(pudp)&PTE_MASK,
+		   (__pa(pudp)&(PAGE_SIZE-1))/8);
+}
+
+static void lguest_set_pgd(pgd_t *pgdp, pgd_t pgdval)
+{
+	*pgdp = pgdval;
+	lazy_hcall(LHCALL_SET_PGD, current_cr3, __pa(pgdp)&PTE_MASK,
+		   (__pa(pgdp)&(PAGE_SIZE-1))/8);
+}
+
+#ifdef CONFIG_X86_LOCAL_APIC
+static void lguest_apic_write(unsigned long reg, unsigned int v)
+{
+}
+
+static unsigned int lguest_apic_read(unsigned long reg)
+{
+	return 0;
+}
+#endif
+
+#if 0
+/* We move eflags word to lguest_data.irq_enabled to restore interrupt
+   state.  For page faults, gpfs and virtual interrupts, the
+   hypervisor has saved eflags manually, otherwise it was delivered
+   directly and so eflags reflects the real machine IF state,
+   ie. interrupts on.  Since the kernel always dies if it takes such a
+   trap with interrupts disabled anyway, turning interrupts back on
+   unconditionally here is OK. */
+asm("lguest_iret:"
+    " pushq	%rax;"
+    " movq	0x18(%rsp), %rax;"
+    "lguest_noirq_start:;"
+    " movq	%rax, lguest_data+"__stringify(LGUEST_DATA_irq_enabled)";"
+    " popq	%rax;"
+    " iretq;"
+    "lguest_noirq_end:");
+extern char lguest_noirq_start[], lguest_noirq_end[];
+#endif
+
+extern void lguest_iret(void);
+asm("lguest_iret:"
+    "  movq  $" __stringify(LHCALL_IRET) ", %rax\n"
+    "  int   $" __stringify(LGUEST_TRAP_ENTRY) );
+
+
+static void lguest_load_rsp0(struct tss_struct *tss,
+				     struct thread_struct *thread)
+{
+	lazy_hcall(LHCALL_SET_STACK, thread->rsp0, THREAD_SIZE/PAGE_SIZE, 0);
+}
+
+static void lguest_load_tr_desc(void)
+{
+}
+
+static void lguest_set_ldt(const void *addr, unsigned entries)
+{
+	/* FIXME: Implement. */
+	BUG_ON(entries);
+}
+
+static void lguest_load_tls(struct thread_struct *t, unsigned int cpu)
+{
+	lazy_hcall(LHCALL_LOAD_TLS, __pa(&t->tls_array), cpu, 0);
+}
+
+static void lguest_set_debugreg(int regno, unsigned long value)
+{
+	/* FIXME: Implement */
+}
+
+static unsigned int lguest_cr0;
+static void lguest_clts(void)
+{
+	lazy_hcall(LHCALL_TS, 0, 0, 0);
+	lguest_cr0 &= ~8U;
+}
+
+static unsigned long lguest_read_cr0(void)
+{
+	return lguest_cr0;
+}
+
+static void lguest_write_cr0(unsigned long val)
+{
+	hcall(LHCALL_TS, val & 8, 0, 0);
+	lguest_cr0 = val;
+}
+
+static unsigned long lguest_read_cr2(void)
+{
+	return lguest_data.cr2;
+}
+
+static unsigned long lguest_read_cr3(void)
+{
+	return current_cr3;
+}
+
+/* Used to enable/disable PGE, but we don't care. */
+static unsigned long lguest_read_cr4(void)
+{
+	return 0;
+}
+
+static void lguest_write_cr4(unsigned long val)
+{
+}
+
+static void lguest_time_irq(unsigned int irq, struct irq_desc *desc)
+{
+	do_timer(hcall(LHCALL_TIMER_READ, 0, 0, 0));
+	update_process_times(user_mode_vm(get_irq_regs()));
+}
+
+static void disable_lguest_irq(unsigned int irq)
+{
+	set_bit(irq, lguest_data.interrupts);
+}
+
+static void enable_lguest_irq(unsigned int irq)
+{
+	clear_bit(irq, lguest_data.interrupts);
+	/* FIXME: If it's pending? */
+}
+
+static struct irq_chip lguest_irq_controller = {
+	.name		= "lguest",
+	.mask		= disable_lguest_irq,
+	.mask_ack	= disable_lguest_irq,
+	.unmask		= enable_lguest_irq,
+};
+
+static void lguest_time_init(void)
+{
+	set_irq_handler(0, lguest_time_irq);
+	hcall(LHCALL_TIMER_START,HZ,0,0);
+}
+
+static void lguest_ebda_info(unsigned *addr, unsigned *size)
+{
+	*addr = *size = 0;
+}
+
+/* From i8259.c */
+extern void (*interrupt[])(void);
+static void __init lguest_init_IRQ(void)
+{
+	unsigned int i;
+
+	for (i = 0; i < LGUEST_IRQS; i++) {
+		int vector = FIRST_EXTERNAL_VECTOR + i;
+		if (i >= NR_IRQS)
+			break;
+		/* FIXTHEM: We should be doing it in a lot of other places */
+		if (vector != IA32_SYSCALL_VECTOR) {
+			printk("Setting vector %x as %p\n",vector, &interrupt[i]);
+			set_intr_gate(vector, interrupt[i]);
+			set_irq_chip_and_handler(i, &lguest_irq_controller,
+							 handle_level_irq);
+			hcall(LHCALL_LOAD_IDT_ENTRY, vector, __pa((u64)&idt_table[vector]), 0);
+		}
+	}
+}
+
+static inline void native_write_dt_entry(void *dt, int entry, u32 entry_low, u32 entry_high)
+{
+	u32 *lp = (u32 *)((char *)dt + entry*8);
+	lp[0] = entry_low;
+	lp[1] = entry_high;
+}
+
+static void lguest_write_ldt_entry(void *dt, int entrynum, u32 low, u32 high)
+{
+	/* FIXME: Allow this. */
+	BUG();
+}
+
+static void lguest_write_gdt_entry(void *dt, int entrynum,
+					   u32 low, u32 high)
+{
+	native_write_dt_entry(dt, entrynum, low, high);
+	hcall(LHCALL_LOAD_GDT, __pa(dt), GDT_ENTRIES, 0);
+}
+
+static void lguest_write_idt_entry(void *dt, int entrynum,
+					   u32 low, u32 high)
+{
+	native_write_dt_entry(dt, entrynum, low, high);
+	hcall(LHCALL_CRASH, 0, 0 ,0);
+	hcall(LHCALL_LOAD_IDT_ENTRY, entrynum, low, high);
+}
+
+#define LGUEST_IRQ "lguest_data+"__stringify(LGUEST_DATA_irq_enabled)
+#define DEF_LGUEST(name, code)				\
+	extern const char start_##name[], end_##name[];		\
+	asm("start_" #name ": " code "; end_" #name ":")
+DEF_LGUEST(cli, "movl $0," LGUEST_IRQ);
+DEF_LGUEST(sti, "movl $512," LGUEST_IRQ);
+DEF_LGUEST(popf, "movl %eax," LGUEST_IRQ);
+DEF_LGUEST(pushf, "movl " LGUEST_IRQ ",%eax");
+DEF_LGUEST(pushf_cli, "movl " LGUEST_IRQ ",%eax; movl $0," LGUEST_IRQ);
+DEF_LGUEST(iret, ".byte 0xE9,0,0,0,0"); /* jmp ... */
+
+static const struct lguest_insns
+{
+	const char *start, *end;
+} lguest_insns[] = {
+	[PARAVIRT_IRQ_DISABLE] = { start_cli, end_cli },
+	[PARAVIRT_IRQ_ENABLE] = { start_sti, end_sti },
+	[PARAVIRT_RESTORE_FLAGS] = { start_popf, end_popf },
+	[PARAVIRT_SAVE_FLAGS] = { start_pushf, end_pushf },
+	[PARAVIRT_SAVE_FLAGS_IRQ_DISABLE] = { start_pushf_cli, end_pushf_cli },
+	[PARAVIRT_INTERRUPT_RETURN] = { start_iret, end_iret },
+};
+static unsigned lguest_patch(u8 type, u16 clobber, void *insns, unsigned len)
+{
+	unsigned int insn_len;
+
+	/* Don't touch it if we don't have a replacement */
+	if (type >= ARRAY_SIZE(lguest_insns) || !lguest_insns[type].start)
+		return len;
+
+	insn_len = lguest_insns[type].end - lguest_insns[type].start;
+
+	/* Similarly if we can't fit replacement. */
+	if (len < insn_len)
+		return len;
+
+	memcpy(insns, lguest_insns[type].start, insn_len);
+	if (type == PARAVIRT_INTERRUPT_RETURN) {
+		/* Jumps are relative. */
+		u64 off = (u64)lguest_iret - ((u64)insns + insn_len);
+		memcpy(insns+1, &off, sizeof(off));
+	}
+	return insn_len;
+}
+
+static void lguest_safe_halt(void)
+{
+	hcall(LHCALL_HALT, 0, 0, 0);
+}
+
+static unsigned long lguest_get_wallclock(void)
+{
+	return hcall(LHCALL_GET_WALLCLOCK, 0, 0, 0);
+}
+
+static void lguest_power_off(void)
+{
+	hcall(LHCALL_CRASH, __pa("Power down"), 0, 0);
+}
+
+static void lguest_syscall_init(void)
+{
+	/* FIXME: Will have to implement it later */
+}
+
+static __attribute_used__ __init void lguest_init(void)
+{
+	int i;
+
+	current_cr3 = __pa(&boot_level4_pgt);
+	paravirt_ops.name = "lguest";
+	paravirt_ops.mem_type = "LGUEST";
+	paravirt_ops.paravirt_enabled = 1;
+	paravirt_ops.syscall_init = lguest_syscall_init;
+
+	paravirt_ops.save_fl = save_fl;
+	paravirt_ops.restore_fl = restore_fl;
+	paravirt_ops.irq_disable = irq_disable;
+	paravirt_ops.irq_enable = irq_enable;
+	paravirt_ops.load_gdt = lguest_load_gdt;
+	paravirt_ops.memory_setup = lguest_memory_setup;
+	paravirt_ops.cpuid = lguest_cpuid;
+	paravirt_ops.write_cr3 = lguest_write_cr3;
+	paravirt_ops.read_msr = lguest_read_msr,
+	paravirt_ops.write_msr = lguest_write_msr,
+	paravirt_ops.read_tsc = lguest_read_tsc,
+	paravirt_ops.flush_tlb_user = lguest_flush_tlb;
+	paravirt_ops.flush_tlb_single = lguest_flush_tlb_single;
+	paravirt_ops.flush_tlb_kernel = lguest_flush_tlb_kernel;
+	paravirt_ops.set_pte = lguest_set_pte;
+	paravirt_ops.set_pte_at = lguest_set_pte_at;
+	paravirt_ops.set_pmd = lguest_set_pmd;
+	paravirt_ops.set_pud = lguest_set_pud;
+	paravirt_ops.set_pgd = lguest_set_pgd;
+#ifdef CONFIG_X86_LOCAL_APIC
+	paravirt_ops.apic_write = lguest_apic_write;
+	paravirt_ops.apic_read = lguest_apic_read;
+#endif
+	paravirt_ops.load_idt = lguest_load_idt;
+	paravirt_ops.iret = lguest_iret;
+	paravirt_ops.load_rsp0 = lguest_load_rsp0;
+	paravirt_ops.load_tr_desc = lguest_load_tr_desc;
+	paravirt_ops.set_ldt = lguest_set_ldt;
+	paravirt_ops.load_tls = lguest_load_tls;
+	paravirt_ops.set_debugreg = lguest_set_debugreg;
+	paravirt_ops.clts = lguest_clts;
+	paravirt_ops.read_cr0 = lguest_read_cr0;
+	paravirt_ops.write_cr0 = lguest_write_cr0;
+	paravirt_ops.init_IRQ = lguest_init_IRQ;
+	paravirt_ops.read_cr2 = lguest_read_cr2;
+	paravirt_ops.read_cr3 = lguest_read_cr3;
+	paravirt_ops.read_cr4 = lguest_read_cr4;
+	paravirt_ops.write_cr4 = lguest_write_cr4;
+	paravirt_ops.write_ldt_entry = lguest_write_ldt_entry;
+	paravirt_ops.write_gdt_entry = lguest_write_gdt_entry;
+	paravirt_ops.write_idt_entry = lguest_write_idt_entry;
+	paravirt_ops.patch = lguest_patch;
+	paravirt_ops.safe_halt = lguest_safe_halt;
+	paravirt_ops.get_wallclock = lguest_get_wallclock;
+	paravirt_ops.time_init = lguest_time_init;
+#ifdef PARAVIRT_LAZY_NONE
+	paravirt_ops.set_lazy_mode = lguest_lazy_mode;
+#endif
+	paravirt_ops.ebda_info = lguest_ebda_info;
+
+	memset(lguest_data.hcall_status,0xFF,sizeof(lguest_data.hcall_status));
+#if 0
+	lguest_data.noirq_start = (u64)lguest_noirq_start;
+	lguest_data.noirq_end = (u64)lguest_noirq_end;
+#endif
+	lguest_data.start_kernel_map = __START_KERNEL_map; /* current page offset */
+	lguest_data.page_offset = PAGE_OFFSET;
+
+	code_stack[0].next = __pa(&code_stack[1]);
+	code_stack[0].start = (unsigned long)_stext;
+	code_stack[0].end = (unsigned long)_etext;
+	code_stack[1].next = 0;
+	code_stack[1].start = (unsigned long)_sinittext;
+	code_stack[1].end = (unsigned long)_einittext;
+
+	lguest_data.text = __pa(&code_stack[0]);
+
+	lguest_data.kallsyms_addresses = __pa(&kallsyms_addresses);
+	lguest_data.kallsyms_num_syms = kallsyms_num_syms;
+	lguest_data.kallsyms_names = __pa(&kallsyms_names);
+	lguest_data.kallsyms_token_table = __pa(&kallsyms_token_table);
+	lguest_data.kallsyms_token_index = __pa(&kallsyms_token_index);
+	lguest_data.kallsyms_markers = __pa(&kallsyms_markers);
+
+	hcall(LHCALL_LGUEST_INIT, __pa(&lguest_data), 0, 0);
+
+	lguest_pops = &local_pops;
+	lguest_paravirt = 1;
+
+	memcpy(init_level4_pgt, boot_level4_pgt, PTRS_PER_PGD*sizeof(pgd_t));
+	lguest_write_cr3(__pa_symbol(&init_level4_pgt));
+
+ 	for (i = 0; i < NR_CPUS; i++)
+ 		cpu_pda(i) = &boot_cpu_pda[i];
+
+	pda_init(0);
+//	copy_bootdata(real_mode_data);
+#ifdef CONFIG_SMP
+	cpu_set(0, cpu_online_map);
+#endif
+
+//	strncpy(boot_command_line, boot->cmdline, COMMAND_LINE_SIZE);
+
+	/* We use top of mem for initial pagetables. */
+//	init_pg_tables_end = __pa(pg0);
+
+//	reserve_top_address(lguest_data.reserve_mem);
+
+	/* FIXME: Better way? */
+	/* Suppress vgacon startup code */
+	SCREEN_INFO.orig_video_isVGA = VIDEO_TYPE_VLFB;
+
+	add_preferred_console("hvc", 0, NULL);
+/*
+#ifdef CONFIG_X86_MCE
+	mcheck_disable(NULL);
+#endif
+*/
+#ifdef CONFIG_ACPI
+	acpi_disabled = 1;
+	acpi_ht = 0;
+#endif
+	if (boot->initrd_size) {
+		/* We stash this at top of memory. */
+		INITRD_START = boot->max_pfn*PAGE_SIZE - boot->initrd_size;
+		INITRD_SIZE = boot->initrd_size;
+		LOADER_TYPE = 0xFF;
+	}
+	pm_power_off = lguest_power_off;
+
+	start_kernel();
+}
+
+asm("lguest_maybe_init:\n"
+    "	cmpq $"__stringify(LGUEST_MAGIC_R13)", %r13\n"
+    "	jne 1f\n"
+    "	cmpq $"__stringify(LGUEST_MAGIC_R14)", %r14\n"
+    "	jne 1f\n"
+    "	cmpq $"__stringify(LGUEST_MAGIC_R15)", %r15\n"
+    "	je lguest_init\n"
+    "1: ret");
+
+extern void asmlinkage lguest_maybe_init(void);
+paravirt_probe(lguest_maybe_init);
Index: work-pv/arch/x86_64/lguest/lguest.h
===================================================================
--- /dev/null
+++ work-pv/arch/x86_64/lguest/lguest.h
@@ -0,0 +1,161 @@
+#ifndef _LGUEST_GUEST_H_
+#define _LGUEST_GUEST_H_
+
+#define GUEST_DPL 0x3
+
+#define gdt_index(x) ((x) >> 3)
+
+/*
+ * Must be less than fixmap!
+ *
+ * To keep the hypervisor from needing any data sections,
+ * we need to hard code the difference between what the hypervisor
+ * may put into the GS base, and what we let the guest put in.
+ * We allow the guest to put in "Kernel addresses" to simplify
+ * the guest PDA code.
+ */
+#define LGUEST_HV_OFFSET_HIGH 0xffffffff
+#define LGUEST_HV_OFFSET_LOW  0xff000000
+
+#define LGUEST_NMI_IST 7
+
+#define LGUEST_MAGIC 0x6c6775657374 /* "lguest" */
+
+#ifndef __ASSEMBLY__
+#include <asm/lguest.h>
+
+extern void switch_to_guest(struct lguest_vcpu *);
+extern unsigned long hcall_teste;
+extern unsigned long host_syscall;
+extern unsigned long _lguest_default_idt_entries[];
+extern unsigned long lguest_hv_addr;
+extern unsigned long lguest_hv_offset;
+extern int lguest_hv_pages;
+extern int lguest_vcpu_pages;
+extern int lguest_vcpu_order;
+extern struct mutex lguest_lock;
+
+/* FIXME: Those would live better in some main kernel header */
+/* Page fault error code bits */
+#define PF_PROT	(1<<0)		/* or no page found */
+#define PF_WRITE	(1<<1)
+#define PF_USER	(1<<2)
+#define PF_RSVD	(1<<3)
+#define PF_INSTR	(1<<4)
+
+#define kill_guest(guest, fmt...)				\
+do {								\
+	if (!(guest)->dead) {					\
+		(guest)->dead = kasprintf(GFP_ATOMIC, fmt);	\
+		if (!(guest)->dead)				\
+			(guest)->dead = (void *)-1;		\
+	}							\
+} while (0)
+
+#define kill_guest_dump(vcpu, fmt...)		\
+do {						\
+	kill_guest((vcpu)->guest, fmt);		\
+	lguest_dump_vcpu_regs(vcpu);		\
+}  while(0)
+
+static inline void _lguest_set_gate(struct gate_struct *s, unsigned type, unsigned long func,
+				    unsigned dpl, unsigned ist)
+{
+        s->offset_low = PTR_LOW(func);
+        s->segment = __HV_CS;
+        s->ist = ist;
+        s->p = 1;
+        s->dpl = dpl;
+        s->zero0 = 0;
+        s->zero1 = 0;
+        s->type = type;
+        s->offset_middle = PTR_MIDDLE(func);
+        s->offset_high = PTR_HIGH(func);
+}
+
+static inline unsigned long guest_pa(struct lguest_guest_info *linfo, u64 addr)
+{
+	return (addr >= linfo->start_kernel_map) ?
+		(addr - linfo->start_kernel_map) :
+		(addr - linfo->page_offset);
+}
+
+int lguest_address_ok(const struct lguest_guest_info *, u64);
+
+int demand_page(struct lguest_vcpu *, u64, int);
+/* FIXME: put this in hv_vm.h */
+unsigned long hvvm_get_actual_phys(void *addr, pgprot_t *prot);
+
+int lguest_device_init(void);
+void lguest_device_remove(void);
+
+/* page_tables.h */
+int lguest_map_hv_pages(struct lguest_guest_info *lguest,
+			   unsigned long vaddr, int pages,
+			   pgprot_t *prot);
+int lguest_map_guest_page(struct lguest_guest_info *lguest,
+			  unsigned long vaddr, unsigned long paddr,
+			  pgprot_t prot);
+void lguest_unmap_guest_pages(struct lguest_guest_info *lguest,
+			      unsigned long vaddr, int pages);
+void lguest_free_guest_pages(struct lguest_guest_info *lguest);
+
+void *lguest_mem_addr(struct lguest_vcpu *vcpu, u64 vaddr);
+
+void guest_set_pte(struct lguest_vcpu *vcpu,
+		   unsigned long cr3, unsigned long base,
+		   unsigned long idx);
+void guest_set_pmd(struct lguest_vcpu *vcpu,
+		   unsigned long cr3, unsigned long base,
+		   unsigned long val);
+void guest_set_pud(struct lguest_vcpu *vcpu,
+		   unsigned long cr3, unsigned long base,
+		   unsigned long val);
+void guest_set_pgd(struct lguest_vcpu *vcpu,
+		   unsigned long cr3, unsigned long base,
+		   unsigned long val);
+void guest_flush_tlb_single(struct lguest_vcpu *vcpu, u64 cr3, u64 vaddr);
+void guest_pagetable_clear_all(struct lguest_vcpu *vcpu);
+void guest_pagetable_flush_user(struct lguest_vcpu *vcpu);
+void guest_new_pagetable(struct lguest_vcpu *vcpu, u64 pgtable);
+
+int init_guest_pagetable(struct lguest_guest_info *linfo, u64 pgtable);
+int lguest_init_vcpu_pagetable(struct lguest_vcpu *vcpu);
+
+int hypercall(struct lguest_vcpu *vcpu);
+
+/* core.c */
+u8 lhread_u8(struct lguest_vcpu *vcpu, u64 addr);
+u16 lhread_u16(struct lguest_vcpu *vcpu, u64 addr);
+u64 lhread_u64(struct lguest_vcpu *vcpu, u64 addr);
+void lhwrite_u64(struct lguest_vcpu *vcpu, u64 addr, u64 val);
+
+void lhread(struct lguest_guest_info *, void *, u64, unsigned);
+void lhwrite(struct lguest_guest_info *, u64, const void *, unsigned);
+
+/* io.c */
+u32 bind_dma(struct lguest_guest_info *, unsigned long, unsigned long,
+					u16, u8);
+int send_dma(struct lguest_guest_info *, unsigned long, unsigned long);
+
+/* interrupts_and_traps.c */
+
+void load_guest_idt_entry(struct lguest_vcpu *, unsigned int,
+						struct gate_struct *);
+void maybe_do_interrupt(struct lguest_vcpu *);
+void guest_iret(struct lguest_vcpu *vcpu);
+int reflect_trap(struct lguest_vcpu *, int, int);
+
+/* lguest_debug.c */
+extern int lguest_debug;
+void lgdebug_print(const char *fmt, ...);
+void lgdebug_vprint(const char *fmt, va_list ap);
+void lguest_dump_vcpu_regs(struct lguest_vcpu *vcpu);
+void lguest_dump_trace(struct lguest_vcpu *vcpu, struct lguest_regs *regs);
+void lguest_print_address(struct lguest_vcpu *vcpu, unsigned long address);
+void lguest_print_page_tables(u64 *cr3);
+void lguest_print_guest_page_tables(struct lguest_vcpu *vcpu, u64 cr3);
+
+#endif /* !__ASSEMBLY__ */
+
+#endif
Index: work-pv/arch/x86_64/lguest/lguest_user.c
===================================================================
--- /dev/null
+++ work-pv/arch/x86_64/lguest/lguest_user.c
@@ -0,0 +1,436 @@
+/* Userspace control of the guest, via /dev/lguest. */
+#include <linux/uaccess.h>
+#include <linux/miscdevice.h>
+#include <linux/fs.h>
+#include <asm/lguest_user.h>
+#include <asm/hv_vm.h>
+#include "lguest.h"
+
+static int next_guest_id;
+
+#if 0
+/* + addr */
+static long user_get_dma(struct lguest *lg, const u32 __user *input)
+{
+	unsigned long addr, udma, irq;
+
+	if (get_user(addr, input) != 0)
+		return -EFAULT;
+	udma = get_dma_buffer(lg, addr, &irq);
+	if (!udma)
+		return -ENOENT;
+
+	/* We put irq number in udma->used_len. */
+	lhwrite_u32(lg, udma + offsetof(struct lguest_dma, used_len), irq);
+	return udma;
+}
+
+/* + irq */
+static int user_send_irq(struct lguest *lg, const u32 __user *input)
+{
+	u32 irq;
+
+	if (get_user(irq, input) != 0)
+		return -EFAULT;
+	if (irq >= LGUEST_IRQS)
+		return -EINVAL;
+	set_bit(irq, lg->irqs_pending);
+	return 0;
+}
+#endif
+
+static ssize_t read(struct file *file, char __user *user, size_t size,loff_t*o)
+{
+	struct lguest_vcpu *vcpu = file->private_data;
+	struct lguest_guest_info *linfo = vcpu->guest;
+	int ret;
+
+	if (!vcpu)
+		return -EINVAL;
+
+	if (linfo->dead) {
+		size_t len;
+
+		if (linfo->dead == (void *)-1)
+			return -ENOMEM;
+
+		len = min(size, strlen(linfo->dead)+1);
+		if (copy_to_user(user, linfo->dead, len) != 0)
+			return -EFAULT;
+		return len;
+	}
+
+#if 0
+	if (lg->dma_is_pending)
+		lg->dma_is_pending = 0;
+#endif
+
+	ret = run_guest(vcpu, user);
+	if (ret != -EINTR)
+		ret = -ENOENT;
+	return ret;
+}
+
+struct lguest_vcpu *allocate_vcpu(struct lguest_guest_info *linfo)
+{
+	struct lguest_vcpu *vcpu;
+	unsigned long hv_vcpu;
+	int ret;
+
+	vcpu = (void*)__get_free_pages(GFP_KERNEL, lguest_vcpu_order);
+	if (!vcpu)
+		return NULL;
+	memset(vcpu, 0, sizeof(*vcpu));
+
+	ret = hvvm_map_pages(vcpu, lguest_vcpu_pages, &hv_vcpu);
+	if (ret < 0)
+		goto out;
+
+	ret = lguest_map_hv_pages(linfo, hv_vcpu, lguest_vcpu_pages, NULL);
+	if (ret < 0)
+		goto out2;
+
+	vcpu->host_page = (unsigned long)vcpu;
+
+	return (struct lguest_vcpu*)hv_vcpu;
+
+out2:
+	hvvm_unmap_pages(hv_vcpu, lguest_vcpu_pages);
+out:
+	free_pages((unsigned long)vcpu, lguest_vcpu_order);
+
+	return NULL;
+}
+
+void free_vcpu(struct lguest_guest_info *linfo, struct lguest_vcpu *vcpu)
+{
+	unsigned long hv_vcpu = (unsigned long)vcpu;
+	free_pages(vcpu->host_page, lguest_vcpu_order);
+	lguest_unmap_guest_pages(linfo, hv_vcpu, lguest_vcpu_pages);
+	hvvm_unmap_pages(hv_vcpu, lguest_vcpu_pages);
+	lguest_free_guest_pages(linfo);
+}
+
+#if 0
+static void print_tss(struct ldttss_desc *tss)
+{
+	u64 base;
+	u64 limit;
+	int i;
+	u16 iobp = 0x64;
+
+	base = (tss->base0) + ((u64)tss->base1 << 16) +
+		((u64)tss->base2 << 24) + ((u64)tss->base3 << 32);
+	limit = (tss->limit0) + ((u64)tss->limit1 << 16);
+	if (tss->g)
+		limit <<= 12;
+	printk("    base: %016llx\n", base);
+	printk("   limit: %llx\n", limit);
+	printk("    type: %x\n", tss->type);
+	printk("     dpl: %d\n", tss->dpl);
+	printk("       p: %d\n", tss->p);
+	printk("       g: %d\n", tss->g);
+
+	for (i=0; i < limit; i += 4) {
+		printk("   %8x: %08x\n", i, *(u32*)(base+i));
+		if (i == 0x64) {
+			iobp = (u16)((*(u32*)(base+i))>>16);
+		}
+		if (i >= iobp && *(s32*)(base+i) == -1L)
+			break;
+	}
+}
+#endif
+
+/* should be in some other file ? */
+int vcpu_start(int cpu, struct lguest_guest_info *linfo,
+				unsigned long entry_point,
+				void *pgd)
+{
+	struct lguest_vcpu *vcpu;
+	struct desc_struct *gdt_table;
+	struct lguest_regs *regs;
+	struct ldttss_desc *tss;
+	struct lguest_tss_struct *tss_ptr;
+	u64 target;
+	u64 limit;
+	u64 base;
+	int i;
+
+	if (cpu > LGUEST_MAX_VCPUS)
+		return -EINVAL;
+
+	vcpu = allocate_vcpu(linfo);
+	if (!vcpu)
+		return -ENOMEM;
+
+	printk("vcpu: %p\n", vcpu);
+
+	/*
+	 * Point back to itself to make it easier to read from gs:base in
+	 * hypervisor.S
+	 */
+	vcpu->vcpu = vcpu;
+	vcpu->magic = LGUEST_MAGIC;
+	gdt_table = cpu_gdt(get_cpu());
+	put_cpu();
+
+	/* Our gdt is basically host's, except for the privilege level */
+	for (i = 0; i < GDT_ENTRIES; i++) {
+		vcpu->gdt_table[i] = gdt_table[i];
+
+		if (!gdt_table[i].type)
+			continue;
+
+		switch (i) {
+		/* Keep TSS, and HV, and Host KERNEL segments the same */
+		case GDT_ENTRY_TSS:
+			/* The TSS will be modified below */
+		case GDT_ENTRY_HV_CS:
+		case GDT_ENTRY_HV_DS:
+		case __KERNEL_CS >> 3:
+		case __KERNEL_DS >> 3:
+			break;
+		default:
+			vcpu->gdt_table[i].dpl = GUEST_DPL;
+		}
+	}
+
+	for (i = 0; i < IDT_ENTRIES; i++) {
+		unsigned dpl = i == LGUEST_TRAP_ENTRY ? GUEST_DPL : 0;
+		/* NMI gets its own stack */
+		int ist = (i == 2) ? LGUEST_NMI_IST :
+			/* temp debug for now */
+			(i == 8) ? 6 :   /* Double Fault */
+//			(i == 13) ? 5 :  /* GPF */
+			0;
+
+		_lguest_set_gate(&vcpu->idt_table[i], 0xe,
+				 _lguest_default_idt_entries[i] +
+				 lguest_hv_offset, dpl, ist);
+	}
+
+	vcpu->gdt.size = 8 * GDT_ENTRIES - 1;
+	vcpu->gdt.address = (unsigned long)&vcpu->gdt_table;
+
+	vcpu->idt.size = 16 * IDT_ENTRIES -1;
+	vcpu->idt.address = (unsigned long)vcpu->idt_table;
+	rdmsrl(MSR_LSTAR, vcpu->host_syscall);
+
+	vcpu->id = cpu;
+	vcpu->guest = linfo;
+	linfo->vcpu[cpu] = vcpu;
+
+	lguest_init_vcpu_pagetable(vcpu);
+
+	/* setup the tss */
+	tss = (struct ldttss_desc*)&vcpu->gdt_table[GDT_ENTRY_TSS];
+	limit = sizeof(struct lguest_tss_struct);
+	base = (u64)&vcpu->tss;
+	tss->limit0 = (u16)limit;
+	tss->base0 = (u16)base;
+	tss->base1 = (u8)(base>>16);
+	tss->base2 = (u8)(base>>24);
+	tss->base3 = (u32)(base>>32);
+	tss->type = 0x9;
+	tss->g = 0; /* small tss */
+
+	vcpu->tss.rsp0 = (unsigned long)(&vcpu->regs.size);
+
+	/* NMI can happen at any time, so give it its own stack */
+	vcpu->tss.ist[LGUEST_NMI_IST-1] = (unsigned long)(&vcpu->nmi_stack_end);
+	printk("nmi stack at: %llx\n", vcpu->tss.ist[LGUEST_NMI_IST-1]);
+
+	/* temp debug stuff */
+	vcpu->tss.ist[5-1] = (unsigned long)(&vcpu->gpf_stack_end);
+	vcpu->tss.ist[6-1] = (unsigned long)(&vcpu->df_stack_end);
+	/*
+	 * Load the host nmi stack into the guest tss. This prevents races
+	 * in loading the TR and IDT.
+	 */
+	tss = (struct ldttss_desc *)&gdt_table[GDT_ENTRY_TSS];
+	target = (u64)tss->base0 |
+		((u64)tss->base1 << 16) |
+		((u64)tss->base2 << 24) |
+		((u64)tss->base3 << 32);
+
+	tss_ptr = (struct lguest_tss_struct*)target;
+
+	vcpu->tss.ist[NMI_STACK-1] = tss_ptr->ist[NMI_STACK-1];
+
+	/*
+	 * The rsp0 had better be on 16 bytes aligned, or the interrupt
+	 * will put the stack at a undesireable location.
+	 */
+	/* Don't remove this test!!! */
+	if (unlikely(vcpu->tss.rsp0 & 0xf)) {
+		printk("HV ALIGNMENT BUG! don't put stack here!!\n");
+		printk(" tss.rsp0 stack was set to %llx\n",
+		       vcpu->tss.rsp0);
+		goto out;
+	}
+
+	vcpu->tss.io_bitmap_base = 0x68;
+	vcpu->tss.io_bitmap[0] = -1UL;
+
+	regs = &vcpu->regs;
+	regs->cr3 = __pa(vcpu->pgdir->pgdir);
+	regs->rax = regs->rbx = regs->rcx = regs->rdx =
+	regs->r8 = regs->r9 = regs->r10 = regs->r11 =
+	regs->r12 = regs->rdi = regs->rsi = regs->rbp = 0;
+	regs->r13 = LGUEST_MAGIC_R13;
+	regs->r14 = LGUEST_MAGIC_R14;
+	regs->r15 = LGUEST_MAGIC_R15;
+	regs->fs = 0;
+	regs->trapnum = 0;
+	regs->errcode = 0;
+	regs->rip = entry_point;
+//	regs->rip = 0x1000100;
+	regs->cs = __USER_CS;
+	regs->rflags = 0x202;   /* Interrupts enabled. */
+	regs->rsp = 0;
+	regs->ss = __USER_DS;
+
+	return 0;
+out:
+	free_vcpu(linfo, vcpu);
+	return -EINVAL;
+}
+
+static int initialize_guest(struct file *file, const u64 __user *input)
+{
+	struct lguest_guest_info *linfo;
+	int err;
+	u64 args[4];
+	int i;
+
+	if (file->private_data)
+		return -EBUSY;
+
+	if (copy_from_user(args, input, sizeof(args)) != 0)
+		return -EFAULT;
+
+	linfo = kzalloc(sizeof(*linfo), GFP_KERNEL);
+	if (!linfo)
+		return -ENOMEM;
+
+	mutex_init(&linfo->page_lock);
+
+	/* FIXME: protect the guest_id counter */
+	linfo->guest_id = ++next_guest_id;
+
+	linfo->pfn_limit = args[0];
+	linfo->page_offset = args[3];
+	linfo->start_kernel_map = args[3];
+
+	mutex_init(&linfo->page_lock);
+	INIT_LIST_HEAD(&linfo->pgd_list);
+
+	for (i=0; i < PUD_HASH_SIZE; i++)
+		INIT_LIST_HEAD(&linfo->pud_hash[i]);
+
+	for (i=0; i < PMD_HASH_SIZE; i++)
+		INIT_LIST_HEAD(&linfo->pmd_hash[i]);
+
+	for (i=0; i < PTE_HASH_SIZE; i++)
+		INIT_LIST_HEAD(&linfo->pte_hash[i]);
+
+	err = init_guest_pagetable(linfo, args[1]);
+	if (err)
+		return -ENOMEM; /* what else to return ?? */
+#if 0
+
+	lg->state = setup_guest_state(i, lg->pgdirs[lg->pgdidx].pgdir,args[2]);
+	if (!lg->state) {
+		err = -ENOEXEC;
+		goto release_pgtable;
+	}
+#endif
+	err = vcpu_start(0, linfo, args[2], __va(read_cr3()));
+	if (err < 0)
+		return err;
+
+	file->private_data = linfo->vcpu[0];
+
+	return sizeof(args);
+}
+
+static ssize_t write(struct file *file, const char __user *input,
+		     size_t size, loff_t *off)
+{
+	struct lguest_vcpu *vcpu = file->private_data;
+	u64 req;
+
+	if (get_user(req, input) != 0)
+		return -EFAULT;
+	input += sizeof(req);
+
+	if (req != LHREQ_INITIALIZE && !vcpu)
+		return -EINVAL;
+#if 0
+	if (lg && lg->dead)
+		return -ENOENT;
+#endif
+
+	switch (req) {
+	case LHREQ_INITIALIZE:
+		return initialize_guest(file, (const u64 __user *)input);
+#if 0
+	case LHREQ_GETDMA:
+		return user_get_dma(lg, (const u32 __user *)input);
+	case LHREQ_IRQ:
+		return user_send_irq(lg, (const u32 __user *)input);
+#endif
+	default:
+		return -EINVAL;
+	}
+}
+
+static int close(struct inode *inode, struct file *file)
+{
+	struct lguest_vcpu *vcpu = file->private_data;
+	struct lguest_guest_info *linfo;
+
+	if (!vcpu)
+		return -EBADFD;
+
+	linfo = vcpu->guest;
+	/* FIXME: need to handle multiple vcpus */
+	free_vcpu(linfo, vcpu);
+	kfree(linfo);
+#if 0
+	mutex_lock(&lguest_lock);
+	release_all_dma(lg);
+	free_page((long)lg->trap_page);
+	free_guest_pagetable(lg);
+	mmput(lg->mm);
+	if (lg->dead != (void *)1)
+		kfree(lg->dead);
+	memset(lg->state, 0, sizeof(*lg->state));
+	memset(lg, 0, sizeof(*lg));
+	mutex_unlock(&lguest_lock);
+#endif
+	return 0;
+}
+
+static struct file_operations lguest_fops = {
+	.owner	 = THIS_MODULE,
+	.release = close,
+	.write	 = write,
+	.read	 = read,
+};
+static struct miscdevice lguest_dev = {
+	.minor	= MISC_DYNAMIC_MINOR,
+	.name	= "lguest",
+	.fops	= &lguest_fops,
+};
+
+int __init lguest_device_init(void)
+{
+	return misc_register(&lguest_dev);
+}
+
+void __exit lguest_device_remove(void)
+{
+	misc_deregister(&lguest_dev);
+}
Index: work-pv/arch/x86_64/lguest/page_tables.c
===================================================================
--- /dev/null
+++ work-pv/arch/x86_64/lguest/page_tables.c
@@ -0,0 +1,1285 @@
+/* Shadow page table operations.
+ * Copyright (C) Steven Rostedt, Red Hat Inc, 2007
+ * GPL v2 and any later version */
+#include <linux/mm.h>
+#include <linux/types.h>
+#include <linux/spinlock.h>
+#include <linux/random.h>
+#include <linux/percpu.h>
+#include <asm/tlbflush.h>
+#include <asm/hv_vm.h>
+#include "lguest.h"
+
+/* move this to hv_vm.h */
+#define HVVM_END (HVVM_START + HV_VIRT_SIZE)
+
+#define HASH_PUD(x) (((u64)(x)>>PAGE_SHIFT) & (PUD_HASH_SIZE-1))
+#define HASH_PMD(x) (((u64)(x)>>PAGE_SHIFT) & (PMD_HASH_SIZE-1))
+#define HASH_PTE(x) (((u64)(x)>>PAGE_SHIFT) & (PTE_HASH_SIZE-1))
+
+/* guest and host share the same offset into the page tables */
+/* 9 bits at 8 byte increments */
+#define guest_host_idx(vaddr) ((vaddr) & (0x1ff<<3))
+
+
+/* These access the guest versions. */
+static u64 gtoplev(struct lguest_vcpu *vcpu, unsigned long vaddr)
+{
+	unsigned index = pgd_index(vaddr);
+
+	return vcpu->pgdir->cr3 + index * sizeof(u64);
+}
+
+
+#if 0
+
+/* FIXME: we need to put these in and make it more secure! */
+static u32 check_pgtable_entry(struct lguest *lg, u32 entry)
+{
+	if ((entry & (_PAGE_PWT|_PAGE_PSE))
+	    || (entry >> PAGE_SHIFT) >= lg->pfn_limit)
+		kill_guest(lg, "bad page table entry");
+	return entry & ~_PAGE_GLOBAL;
+}
+
+void pin_stack_pages(struct lguest *lg)
+{
+	unsigned int i;
+	u32 stack = lg->state->tss.esp1;
+
+	for (i = 0; i < lg->stack_pages; i++)
+		if (!demand_page(lg, stack - i*PAGE_SIZE, 1))
+			kill_guest(lg, "bad stack page %i@%#x", i, stack);
+}
+
+void free_guest_pagetable(struct lguest *lg)
+{
+	unsigned int i;
+
+	release_all_pagetables(lg);
+	for (i = 0; i < ARRAY_SIZE(lg->pgdirs); i++)
+		free_page((long)lg->pgdirs[i].pgdir);
+}
+
+/* Caller must be preempt-safe */
+void map_trap_page(struct lguest *lg)
+{
+	int cpu = smp_processor_id();
+
+	hypervisor_pte_page(cpu)[0] = (__pa(lg->trap_page)|_PAGE_PRESENT);
+
+	/* Since hypervisor less that 4MB, we simply mug top pte page. */
+	lg->pgdirs[lg->pgdidx].pgdir[HYPERVISOR_PGD_ENTRY] =
+		(__pa(hypervisor_pte_page(cpu))| __PAGE_KERNEL);
+}
+
+#endif
+
+static int __lguest_map_guest_page(struct lguest_guest_info *linfo, u64 *cr3,
+				   unsigned long vaddr, unsigned long paddr,
+				   pgprot_t pprot);
+
+/* Do a virtual -> physical mapping on a user page. */
+static unsigned long get_pfn(unsigned long virtpfn, int write)
+{
+	struct vm_area_struct *vma;
+	struct page *page;
+	unsigned long ret = -1UL;
+
+	down_read(&current->mm->mmap_sem);
+	if (get_user_pages(current, current->mm, virtpfn << PAGE_SHIFT,
+			   1, write, 1, &page, &vma) == 1)
+		ret = page_to_pfn(page);
+	up_read(&current->mm->mmap_sem);
+	return ret;
+}
+
+static int is_hv_page(int pgd_idx, int pud_idx, int pmd_idx, int pte_idx)
+{
+	/* Never release the hv pages */
+	u64 addr = (u64)pgd_idx << PGDIR_SHIFT |
+		(u64)pud_idx << PUD_SHIFT |
+		(u64)pmd_idx << PMD_SHIFT |
+		(u64)pte_idx << PAGE_SHIFT;
+	/* sign extend */
+	if (pgd_idx & (1<<8))
+		addr |= 0xffffULL << 48;
+	return (addr >= HVVM_START) &&
+		(addr < (HVVM_START + HV_VIRT_SIZE));
+}
+
+static void release_pte(u64 pte)
+{
+	if (pte & _PAGE_PRESENT)
+		put_page(pfn_to_page(pte >> PAGE_SHIFT));
+}
+
+static int release_pmd(int pgd_idx, int pud_idx, u64 *pmd, int idx)
+{
+	int save = 0;
+	if (pmd[idx] & _PAGE_PRESENT) {
+		int i;
+		u64 *ptepage = __va(pmd[idx] & PTE_MASK);
+		for (i=0; i < PTRS_PER_PMD; i++)
+			if (is_hv_page(pgd_idx, pud_idx, idx, i))
+				save = 1;
+			else
+				release_pte(ptepage[i]);
+		/* never free the HV pmds */
+		if (!save) {
+			free_page((unsigned long)ptepage);
+			pmd[idx] = 0;
+		}
+	}
+	return save;
+}
+
+static int release_pud(int pgd_idx, u64 *pud, int idx)
+{
+	int save = 0;
+	if (pud[idx] & _PAGE_PRESENT) {
+		int i;
+		u64 *pmdpage = __va(pud[idx] & PTE_MASK);
+		for (i=0; i < PTRS_PER_PUD; i++)
+			if (release_pmd(pgd_idx, idx, pmdpage, i))
+				save = 1;
+		/* never free the HV puds */
+		if (!save) {
+			free_page((unsigned long)pmdpage);
+			pud[idx] = 0;
+		}
+	}
+	return save;
+}
+
+static int release_pgd(u64 *pgd, int idx)
+{
+	int save = 0;
+
+	if (pgd[idx] & _PAGE_PRESENT) {
+		int i;
+		u64 *pudpage = __va(pgd[idx] & PTE_MASK);
+		for (i=0; i < PTRS_PER_PGD; i++) {
+			if (release_pud(idx, pudpage, i))
+				save = 1;
+		}
+		/* never free the HV pgd */
+		if (!save) {
+			free_page((unsigned long)pudpage);
+			pgd[idx] = 0;
+		}
+	}
+	return save;
+}
+
+static struct lguest_pgd *find_pgd(struct lguest_guest_info *linfo, u64 cr3)
+{
+	struct lguest_pgd *pgdir;
+
+	list_for_each_entry(pgdir, &linfo->pgd_list, list)
+		if (!(pgdir->flags & LGUEST_PGD_MASTER_FL) && pgdir->cr3 == cr3)
+			break;
+
+	if (pgdir == list_entry(&linfo->pgd_list, struct lguest_pgd, list))
+		return NULL;
+
+	return pgdir;
+}
+
+static struct lguest_pud *find_pud(struct lguest_guest_info *linfo, u64 gpud)
+{
+	unsigned idx = HASH_PUD(gpud);
+	struct lguest_pud *pudir;
+
+	list_for_each_entry(pudir, &linfo->pud_hash[idx], list)
+		if (pudir->gpud == gpud)
+			break;
+
+	if (pudir == list_entry(&linfo->pud_hash[idx], struct lguest_pud, list))
+		return NULL;
+
+	return pudir;
+}
+
+static struct lguest_pmd *find_pmd(struct lguest_guest_info *linfo, u64 gpmd)
+{
+	unsigned idx = HASH_PMD(gpmd);
+	struct lguest_pmd *pmdir;
+
+	list_for_each_entry(pmdir, &linfo->pmd_hash[idx], list)
+		if (pmdir->gpmd == gpmd)
+			break;
+
+	if (pmdir == list_entry(&linfo->pmd_hash[idx], struct lguest_pmd, list))
+		return NULL;
+
+	return pmdir;
+}
+
+static struct lguest_pte *find_pte(struct lguest_guest_info *linfo, u64 gpte)
+{
+	unsigned idx = HASH_PTE(gpte);
+	struct lguest_pte *pte;
+
+	list_for_each_entry(pte, &linfo->pte_hash[idx], list)
+		if (pte->gpte == gpte)
+			break;
+
+	if (pte == list_entry(&linfo->pte_hash[idx], struct lguest_pte, list))
+		return NULL;
+
+	return pte;
+}
+
+static void __release_pte_hash(struct lguest_vcpu *vcpu, struct lguest_pte *pte)
+{
+	list_del(&pte->list);
+	kfree(pte);
+}
+
+static void __release_pmd_hash(struct lguest_vcpu *vcpu, struct lguest_pmd *pmdir)
+{
+	struct lguest_guest_info *linfo = vcpu->guest;
+	struct lguest_pte *pte;
+	int i;
+
+	list_del(&pmdir->list);
+
+	for (i=0; i < PTRS_PER_PMD; i++) {
+		u64 gpte;
+
+		gpte = lhread_u64(vcpu, pmdir->gpmd+i*sizeof(u64));
+		if (!gpte)
+			continue;
+		pte = find_pte(linfo, gpte & PTE_MASK);
+		if (!pte)
+			continue;
+		__release_pte_hash(vcpu, pte);
+	}
+
+	kfree(pmdir);
+}
+
+static void __release_pud_hash(struct lguest_vcpu *vcpu, struct lguest_pud *pudir)
+{
+	struct lguest_guest_info *linfo = vcpu->guest;
+	struct lguest_pmd *pmdir;
+	int i;
+
+	list_del(&pudir->list);
+
+	for (i=0; i < PTRS_PER_PUD; i++) {
+		u64 gpmd;
+
+		gpmd = lhread_u64(vcpu, pudir->gpud+i*sizeof(u64));
+		if (!gpmd)
+			continue;
+		pmdir = find_pmd(linfo, gpmd & PTE_MASK);
+		if (!pmdir)
+			continue;
+		__release_pmd_hash(vcpu, pmdir);
+	}
+
+	kfree(pudir);
+}
+
+static struct lguest_pud *hash_pud(struct lguest_vcpu *vcpu, u64 gpud, unsigned idx)
+{
+	struct lguest_guest_info *linfo = vcpu->guest;
+	struct lguest_pud *pudir;
+	unsigned h;
+
+	mutex_lock(&linfo->page_lock);
+	pudir = find_pud(linfo, gpud);
+	if (!pudir) {
+		/* FIXME: make this a slab? */
+		pudir = kzalloc(sizeof(*pudir), GFP_KERNEL);
+		if (!pudir)
+			goto out;
+		h = HASH_PUD(gpud);
+		list_add(&pudir->list, &linfo->pud_hash[h]);
+		pudir->pgdir = vcpu->pgdir;
+		pudir->gpud = gpud;
+		pudir->idx = idx;
+	}
+out:
+	mutex_unlock(&linfo->page_lock);
+
+	return pudir;
+}
+
+static struct lguest_pmd *hash_pmd(struct lguest_vcpu *vcpu, struct lguest_pud *pudir,
+				   u64 gpmd, unsigned idx)
+{
+	struct lguest_guest_info *linfo = vcpu->guest;
+	struct lguest_pmd *pmdir;
+	unsigned h;
+
+	mutex_lock(&linfo->page_lock);
+	pmdir = find_pmd(linfo, gpmd);
+	if (!pmdir) {
+		/* FIXME: make this a slab? */
+		pmdir = kzalloc(sizeof(*pmdir), GFP_KERNEL);
+		if (!pmdir)
+			goto out;
+		h = HASH_PMD(gpmd);
+		list_add(&pmdir->list, &linfo->pmd_hash[h]);
+		pmdir->pudir = pudir;
+		pmdir->gpmd = gpmd;
+		pmdir->idx = idx;
+	}
+out:
+	mutex_unlock(&linfo->page_lock);
+
+	return pmdir;
+}
+
+static struct lguest_pte *hash_pte(struct lguest_vcpu *vcpu, struct lguest_pmd *pmdir,
+				   u64 gpte, unsigned idx)
+{
+	struct lguest_guest_info *linfo = vcpu->guest;
+	struct lguest_pte *pte;
+	unsigned h;
+
+	mutex_lock(&linfo->page_lock);
+	pte = find_pte(linfo, gpte);
+	if (!pte) {
+		/* FIXME: make this a slab? */
+		pte = kzalloc(sizeof(*pte), GFP_KERNEL);
+		if (!pte)
+			goto out;
+		h = HASH_PTE(gpte);
+		list_add(&pte->list, &linfo->pte_hash[h]);
+		pte->pmdir = pmdir;
+		pte->gpte = gpte;
+		pte->idx = idx;
+	}
+out:
+	mutex_unlock(&linfo->page_lock);
+
+	return pte;
+}
+
+void guest_set_pte(struct lguest_vcpu *vcpu,
+		   unsigned long cr3, unsigned long vaddr,
+		   unsigned long value)
+{
+	struct lguest_guest_info *linfo = vcpu->guest;
+	struct lguest_pud *pudir;
+	struct lguest_pmd *pmdir;
+	struct lguest_pte *ptedir;
+	unsigned long idx = (vaddr & (PAGE_SIZE-1)) / 8;
+	u64 base = vaddr & PTE_MASK;
+	u64 pgd;
+	u64 pud;
+	u64 pmd;
+	u64 pte;
+	u64 *pudpage;
+	u64 *pmdpage;
+	u64 *ptepage;
+
+	mutex_lock(&linfo->page_lock);
+
+	ptedir = find_pte(linfo, base);
+	if (!ptedir)
+		goto out;
+
+	pmdir = ptedir->pmdir;
+	pudir = pmdir->pudir;
+
+	pgd = vcpu->pgdir->pgdir[pudir->idx];
+	if (!(pgd & _PAGE_PRESENT))
+		goto out;
+
+	pudpage = __va(pgd & PTE_MASK);
+	pud = pudpage[pmdir->idx];
+
+	if (!(pud & _PAGE_PRESENT))
+		goto out;
+
+	pmdpage = __va(pud & PTE_MASK);
+	pmd = pmdpage[ptedir->idx];
+
+	if (!(pmd & _PAGE_PRESENT))
+		goto out;
+
+	ptepage = __va(pmd & PTE_MASK);
+	pte = ptepage[idx];
+
+	if (!(pte & _PAGE_PRESENT))
+		goto out;
+
+	/* If the guest is trying to touch HV area, kill it! */
+	if (is_hv_page(pudir->idx, pmdir->idx, ptedir->idx, idx)) {
+		kill_guest_dump(vcpu, "guest trying to write to HV area\n");
+		goto out;
+	}
+
+	/* FIXME: perhaps we could set the pte now ? */
+
+	release_pte(ptepage[idx]);
+	__release_pte_hash(vcpu, ptedir);
+
+out:
+	mutex_unlock(&linfo->page_lock);
+}
+
+void guest_set_pmd(struct lguest_vcpu *vcpu,
+		   unsigned long cr3, unsigned long base,
+		   unsigned long idx)
+{
+	struct lguest_guest_info *linfo = vcpu->guest;
+	struct lguest_pud *pudir;
+	struct lguest_pmd *pmdir;
+	u64 pgd;
+	u64 pud;
+	u64 pmd;
+	u64 *pudpage;
+	u64 *pmdpage;
+	int save;
+
+	if (idx >= PTRS_PER_PMD) {
+		kill_guest_dump(vcpu, "illegal index for pgd (%ld)\n", idx);
+		return;
+	}
+
+	mutex_lock(&linfo->page_lock);
+
+	pmdir = find_pmd(linfo, base);
+	if (!pmdir)
+		goto out;
+
+	pudir = pmdir->pudir;
+
+	pgd = vcpu->pgdir->pgdir[pudir->idx];
+	if (!(pgd & _PAGE_PRESENT))
+		goto out;
+
+	pudpage = __va(pgd & PTE_MASK);
+	pud = pudpage[pmdir->idx];
+
+	if (!(pud & _PAGE_PRESENT))
+		goto out;
+
+	pmdpage = __va(pud & PTE_MASK);
+	pmd = pmdpage[idx];
+
+	if (!(pmd & _PAGE_PRESENT))
+		goto out;
+
+	save = release_pmd(pudir->idx, pmdir->idx, pmdpage, idx);
+	if (!save)
+		__release_pmd_hash(vcpu, pmdir);
+
+out:
+	mutex_unlock(&linfo->page_lock);
+}
+
+void guest_set_pud(struct lguest_vcpu *vcpu,
+		   unsigned long cr3, unsigned long base,
+		   unsigned long idx)
+{
+	struct lguest_guest_info *linfo = vcpu->guest;
+	struct lguest_pud *pudir;
+	u64 pgd;
+	u64 pud;
+	u64 *pudpage;
+	int save;
+
+	if (idx >= PTRS_PER_PUD) {
+		kill_guest_dump(vcpu, "illegal index for pgd (%ld)\n", idx);
+		return;
+	}
+
+	mutex_lock(&linfo->page_lock);
+
+	pudir = find_pud(linfo, base);
+	if (!pudir)
+		goto out;
+
+	pgd = vcpu->pgdir->pgdir[pudir->idx];
+	if (!(pgd & _PAGE_PRESENT))
+		goto out;
+
+	pudpage = __va(pgd & PTE_MASK);
+	pud = pudpage[idx];
+
+	if (!(pud & _PAGE_PRESENT))
+		goto out;
+
+	save = release_pud(pudir->idx, pudpage, idx);
+	if (!save)
+		__release_pud_hash(vcpu, pudir);
+
+out:
+	mutex_unlock(&linfo->page_lock);
+}
+
+void guest_set_pgd(struct lguest_vcpu *vcpu, unsigned long cr3,
+		   unsigned long base, unsigned long idx)
+{
+	struct lguest_guest_info *linfo = vcpu->guest;
+	struct lguest_pgd *pgdir;
+	struct lguest_pud *pudir;
+	u64 gpud;
+	u64 pgd;
+	u64 pud;
+	int save;
+
+	pgdir = vcpu->pgdir;
+
+	if (idx >= PTRS_PER_PGD) {
+		kill_guest_dump(vcpu, "illegal index for pgd (%ld)\n", idx);
+		return;
+	}
+
+	mutex_lock(&linfo->page_lock);
+
+	pgd = pgdir->pgdir[idx];
+	if (!(pgd & _PAGE_PRESENT))
+		goto out;
+
+	pud = pgd & PTE_MASK;
+
+	gpud = lhread_u64(vcpu, base + idx * sizeof(u64));
+	pudir = find_pud(linfo, gpud & PTE_MASK);
+	if (pudir)
+		__release_pud_hash(vcpu, pudir);
+	save = release_pgd(pgdir->pgdir, idx);
+
+	if (!save && idx >= guest_host_idx(linfo->page_offset >> (PGDIR_SHIFT-3))) {
+		/* All guest procesess share the same kernel PML4Es */
+		/*
+		 * So we only free the tree once, but then reset
+		 * all the others.
+		 */
+		list_for_each_entry(pgdir, &linfo->pgd_list, list) {
+			pgd = pgdir->pgdir[idx];
+			if (!(pgd & _PAGE_PRESENT))
+				continue;
+			BUG_ON((pgd & PTE_MASK) != pud);
+			pgdir->pgdir[idx] = 0;
+		}
+	}
+out:
+	mutex_unlock(&linfo->page_lock);
+}
+
+void guest_flush_tlb_single(struct lguest_vcpu *vcpu, u64 cr3, u64 vaddr)
+{
+	struct lguest_guest_info *linfo = vcpu->guest;
+	struct lguest_pgd *pgdir;
+	unsigned long pgd_idx;
+	unsigned long pud_idx;
+	unsigned long pmd_idx;
+	unsigned long idx;
+	u64 pgd;
+	u64 pud;
+	u64 pmd;
+	u64 pte;
+	u64 *pudpage;
+	u64 *pmdpage;
+	u64 *ptepage;
+
+	mutex_lock(&linfo->page_lock);
+
+	if (vaddr > linfo->page_offset)
+		pgdir = &linfo->kpgdir;
+	else
+		pgdir = find_pgd(linfo, cr3);
+
+	pgd_idx = pgd_index(vaddr);
+	pgd = pgdir->pgdir[pgd_idx];
+	if (!(pgd & _PAGE_PRESENT))
+		goto out;
+
+	pud_idx = pud_index(vaddr);
+	pudpage = __va(pgd & PTE_MASK);
+	pud = pudpage[pud_idx];
+
+	if (!(pud & _PAGE_PRESENT))
+		goto out;
+
+	pmd_idx = pmd_index(vaddr);
+	pmdpage = __va(pud & PTE_MASK);
+	pmd = pmdpage[pmd_idx];
+
+	if (!(pmd & _PAGE_PRESENT))
+		goto out;
+
+	idx = pte_index(vaddr);
+	ptepage = __va(pmd & PTE_MASK);
+	pte = ptepage[idx];
+
+	if (!(pte & _PAGE_PRESENT))
+		goto out;
+
+	/* If the guest is trying to touch HV area, kill it! */
+	if (is_hv_page(pgd_idx, pud_idx, pmd_idx, idx)) {
+		kill_guest_dump(vcpu, "guest trying to write to HV area\n");
+		goto out;
+	}
+
+	release_pte(ptepage[idx]);
+	/* FIXME: what about the hash?? */
+
+out:
+	mutex_unlock(&linfo->page_lock);
+}
+
+static void flush_user_mappings(struct lguest_guest_info *linfo, struct lguest_pgd *pgdir)
+{
+	unsigned int i;
+	for (i = 0; i < pgd_index(linfo->page_offset); i++)
+		release_pgd(pgdir->pgdir, i);
+}
+
+static struct lguest_pgd *new_pgdir(struct lguest_guest_info *linfo, u64 cr3)
+{
+	unsigned int next;
+	unsigned int i;
+
+	next = random32() % LGUEST_PGDIRS;
+	for (i=(next+1) % LGUEST_PGDIRS; i != next; i = (i+1) % LGUEST_PGDIRS) {
+		if (linfo->pgdirs[i].flags & LGUEST_PGD_BUSY_FL)
+			continue;
+		break;
+	}
+	BUG_ON(linfo->pgdirs[i].flags & LGUEST_PGD_BUSY_FL);
+
+	next = i;
+
+	linfo->pgdirs[next].cr3 = cr3;
+	if (!linfo->pgdirs[next].pgdir) {
+		linfo->pgdirs[next].pgdir = (u64 *)get_zeroed_page(GFP_KERNEL);
+		if (!linfo->pgdirs[next].pgdir)
+			return NULL;
+		/* all kernel pages are the same */
+		for (i=pgd_index(linfo->page_offset); i < PTRS_PER_PGD; i++)
+			linfo->pgdirs[next].pgdir[i] = linfo->kpgdir.pgdir[i];
+	} else {
+		BUG_ON(!(linfo->pgdirs[next].flags & LGUEST_PGD_LINK_FL));
+		/* Release all the non-kernel mappings. */
+		flush_user_mappings(linfo, &linfo->pgdirs[next]);
+	}
+
+	return &linfo->pgdirs[next];
+}
+
+void guest_new_pagetable(struct lguest_vcpu *vcpu, u64 pgtable)
+{
+	struct lguest_guest_info *linfo = vcpu->guest;
+	struct lguest_pgd *newpgdir;
+
+	mutex_lock(&linfo->page_lock);
+	newpgdir = find_pgd(linfo, pgtable);
+	if (vcpu->pgdir) {
+		if (!(--vcpu->pgdir->count))
+			vcpu->pgdir->flags &= ~(LGUEST_PGD_BUSY_FL);
+	}
+	if (!newpgdir)
+		newpgdir = new_pgdir(linfo, pgtable);
+	if (!newpgdir) {
+		kill_guest_dump(vcpu, "no more pgd's available!\n");
+		goto out;
+	}
+	vcpu->pgdir = newpgdir;
+	if (!vcpu->pgdir->count++)
+		vcpu->pgdir->flags |= LGUEST_PGD_BUSY_FL;
+	vcpu->regs.cr3 = __pa(vcpu->pgdir->pgdir);
+	if (!(vcpu->pgdir->flags & LGUEST_PGD_LINK_FL)) {
+		list_add(&vcpu->pgdir->list, &linfo->pgd_list);
+		vcpu->pgdir->flags |= LGUEST_PGD_LINK_FL;
+	}
+//	pin_stack_pages(lg);
+out:
+	mutex_unlock(&linfo->page_lock);
+}
+
+static void release_all_pagetables(struct lguest_guest_info *linfo)
+{
+	struct lguest_pgd *pgdir, *next;
+	int i;
+
+	/* We share the kernel pages, so do them once */
+	for (i=0; i < PTRS_PER_PGD; i++)
+		release_pgd(linfo->kpgdir.pgdir, i);
+
+	list_for_each_entry(pgdir, &linfo->pgd_list, list) {
+		if (pgdir->pgdir)
+			for (i=0; i < pgd_index(linfo->page_offset); i++)
+				release_pgd(pgdir->pgdir, i);
+	}
+	/* now release any pgdirs that are not busy */
+	list_for_each_entry_safe(pgdir, next, &linfo->pgd_list, list) {
+		if (!(pgdir->flags & LGUEST_PGD_BUSY_FL)) {
+			BUG_ON(pgdir->count);
+			pgdir->flags &= ~LGUEST_PGD_LINK_FL;
+			list_del(&pgdir->list);
+			free_page((u64)pgdir->pgdir);
+			pgdir->cr3 = 0;
+			pgdir->pgdir = NULL;
+		}
+	}
+}
+
+void guest_pagetable_clear_all(struct lguest_vcpu *vcpu)
+{
+	struct lguest_guest_info *linfo = vcpu->guest;
+
+	mutex_lock(&linfo->page_lock);
+	release_all_pagetables(linfo);
+//	pin_stack_pages(lg);
+	mutex_unlock(&linfo->page_lock);
+}
+
+void guest_pagetable_flush_user(struct lguest_vcpu *vcpu)
+{
+	struct lguest_guest_info *linfo = vcpu->guest;
+	unsigned int i;
+
+	for (i = 0; i < pgd_index(linfo->page_offset); i++)
+		release_pgd(vcpu->pgdir->pgdir, i);
+}
+
+/* FIXME: We hold reference to pages, which prevents them from being
+   swapped.  It'd be nice to have a callback when Linux wants to swap out. */
+
+/* We fault pages in, which allows us to update accessed/dirty bits.
+ * Return 0 if failed, 1 if good */
+static int page_in(struct lguest_vcpu *vcpu, u64 vaddr, pgprot_t prot)
+{
+	struct lguest_guest_info *linfo = vcpu->guest;
+	struct lguest_pud *pudir;
+	struct lguest_pmd *pmdir;
+	struct lguest_pte *ptedir;
+	u64 val;
+	u64 paddr;
+	u64 gpgd, gpud, gpmd, gpte;
+	u64 flags = pgprot_val(prot);
+	int write;
+	int ret;
+
+	gpgd = gtoplev(vcpu, vaddr);
+	val = lhread_u64(vcpu, gpgd);
+	if (!(val & _PAGE_PRESENT)) {
+		printk("pgd not present pgd:%llx vaddr:%llx val:%llx\n", gpgd, vaddr, val);
+		return 0;
+	}
+
+	gpud = val & PTE_MASK;
+
+	pudir = hash_pud(vcpu, gpud, pgd_index(vaddr));
+	if (!pudir)
+		return 0; /* -ENOMEM */
+
+	if (vaddr >= linfo->page_offset)
+		pudir->flags |= LGUEST_PUD_KERNEL_FL;
+
+	gpud += pud_index(vaddr) * sizeof(u64);
+	val = lhread_u64(vcpu, gpud);
+	if (!(val & _PAGE_PRESENT)) {
+		printk("pud not present?\n");
+		return 0;
+	}
+
+	gpmd = val & PTE_MASK;
+
+	pmdir = hash_pmd(vcpu, pudir, gpmd, pud_index(vaddr));
+	if (!pmdir)
+		return 0; /* -ENOMEM */
+
+	if (vaddr >= linfo->page_offset)
+		pmdir->flags |= LGUEST_PMD_KERNEL_FL;
+
+	gpmd += pmd_index(vaddr) * sizeof(u64);
+	val = lhread_u64(vcpu, gpmd);
+	if (!(val & _PAGE_PRESENT)) {
+		printk("pmd not present?\n");
+		return 0;
+	}
+
+	/* The guest might have set up a 2M page */
+	if (val & (1<<7)) {
+		/* 2M pages */
+		/*
+		 * Although the guest may have mapped this into 2M pages
+		 * we haven't and wont. So we still need to find the 4K
+		 * page position.
+		 */
+		paddr = val & ~((1<<20)-1);
+		paddr += pte_index(vaddr) << PAGE_SHIFT;
+		paddr &= PTE_MASK; /* can still have the NX bit set */
+	} else {
+		/* 4K pages */
+		gpte = val & PTE_MASK;
+
+		ptedir = hash_pte(vcpu, pmdir, gpte, pmd_index(vaddr));
+		if (!ptedir)
+			return 0; /* -ENOMEM */
+
+		gpte += pte_index(vaddr) * sizeof(u64);
+		val = lhread_u64(vcpu, gpte);
+		if (!(val & _PAGE_PRESENT) || ((flags & _PAGE_DIRTY) && !(val & _PAGE_RW))) {
+			printk("pte not present or dirty?\n");
+			return 0;
+		}
+		/* this is the guest's paddr */
+		paddr = val & PTE_MASK;
+
+	}
+
+	/* FIXME: check these values */
+
+	/*
+	 * FIXME: if this isn't write, we lose the lguest_data when we do
+	 *  a put_user in the hypercall init.
+	 */
+	write = 1; // val & _PAGE_DIRTY ? 1 : 0;
+
+	val = get_pfn(paddr >> PAGE_SHIFT, write);
+	if (val == (unsigned long)-1UL) {
+		printk("bad 1\n");
+		kill_guest_dump(vcpu, "page %llx not mapped", paddr);
+		return 0;
+	}
+
+	/* now we have the actual paddr */
+	val <<= PAGE_SHIFT;
+
+	ret = __lguest_map_guest_page(vcpu->guest, vcpu->pgdir->pgdir,
+				      vaddr, val, __pgprot(flags));
+	if (ret < 0) {
+		printk("bad 2\n");
+		kill_guest_dump(vcpu, "can't map page");
+		return 0;
+	}
+	return 1;
+}
+
+int demand_page(struct lguest_vcpu *vcpu, u64 vaddr, int write)
+{
+	return page_in(vcpu, vaddr, (write ? PAGE_SHARED_EXEC : PAGE_COPY_EXEC));
+}
+
+
+static pud_t *pud_from_index(unsigned long addr, unsigned index)
+{
+	pud_t *pud = (pud_t*)addr;
+
+	return &pud[index];
+}
+
+static pmd_t *pmd_from_index(unsigned long addr, unsigned index)
+{
+	pmd_t *pmd = (pmd_t*)addr;
+
+	return &pmd[index];
+}
+
+static pte_t *pte_from_index(unsigned long addr, unsigned index)
+{
+	pte_t *pte = (pte_t*)addr;
+
+	return &pte[index];
+}
+
+static int __lguest_map_guest_pte(pmd_t *pmd, unsigned long vaddr,
+				  unsigned long paddr, pgprot_t prot)
+{
+	unsigned long page;
+	pte_t *pte;
+	unsigned index;
+
+	page = pmd_page_vaddr(*pmd);
+
+	index = pte_index(vaddr);
+	pte = pte_from_index(page, index);
+	if (pte_val(*pte) & _PAGE_PRESENT &&
+	    pte_val(*pte) == pte_val(pfn_pte(paddr>>PAGE_SHIFT, prot)) ) {
+		printk("stange page faulting!\n");
+		printk("paddr=%lx (paddr)=%lx\n", paddr, *(unsigned long *)__va(paddr));
+		printk("vaddr: %lx pte %x val: %lx\n", vaddr, index, pte_val(*pte));
+	}
+
+	set_pte(pte, mk_pte(pfn_to_page(paddr >> PAGE_SHIFT), prot));
+
+	return 0;
+}
+
+static int __lguest_map_guest_pmd(pud_t *pud, unsigned long vaddr, unsigned long paddr,
+				  pgprot_t prot)
+{
+	unsigned long page;
+	pmd_t *pmd;
+	unsigned index;
+
+	page = pud_page_vaddr(*pud);
+
+	index = pmd_index(vaddr);
+	pmd = pmd_from_index(page, index);
+	if (!pmd_val(*pmd)) {
+		page = get_zeroed_page(GFP_KERNEL);
+		if (!page)
+			return -ENOMEM;
+		set_pmd(pmd, __pmd(_PAGE_TABLE | __pa(page)));
+	}
+
+	return __lguest_map_guest_pte(pmd, vaddr, paddr, prot);
+}
+
+static int __lguest_map_guest_pud(pgd_t *pgd, unsigned long vaddr, unsigned long paddr,
+				  pgprot_t prot)
+{
+	unsigned long page;
+	pud_t *pud;
+	unsigned index;
+
+	page = pgd_page_vaddr(*pgd);
+
+	index = pud_index(vaddr);
+	pud = pud_from_index(page, index);
+	if (!pud_val(*pud)) {
+		page = get_zeroed_page(GFP_KERNEL);
+		if (!page)
+			return -ENOMEM;
+		set_pud(pud, __pud(_PAGE_TABLE | __pa(page)));
+	}
+
+	return __lguest_map_guest_pmd(pud, vaddr, paddr, prot);
+}
+
+static int __lguest_map_guest_pgd(u64 *cr3,
+				  unsigned long vaddr, unsigned long paddr,
+				  pgprot_t prot)
+{
+	unsigned long page;
+	unsigned index;
+	pgd_t *pgd;
+
+	index = pgd_index(vaddr);
+	pgd = (pgd_t*)&cr3[index];
+	if (!pgd_val(*pgd)) {
+		page = get_zeroed_page(GFP_KERNEL);
+		if (!page)
+			return -ENOMEM;
+		set_pgd(pgd, __pgd(_PAGE_TABLE | __pa(page)));
+	}
+
+	return __lguest_map_guest_pud(pgd, vaddr, paddr, prot);
+}
+
+static int __lguest_map_guest_page(struct lguest_guest_info *linfo, u64 *cr3,
+				   unsigned long vaddr, unsigned long paddr,
+				   pgprot_t prot)
+{
+	int ret;
+
+	ret = __lguest_map_guest_pgd(cr3, vaddr, paddr, prot);
+	if (ret < 0)
+		return ret;
+
+	/* All guest kernel pages are the same */
+	if (vaddr >= linfo->page_offset) {
+		struct lguest_pgd *pgdir;
+		unsigned index;
+		pgd_t *pgd;
+		u64 val;
+
+		index = pgd_index(vaddr);
+		pgd = (pgd_t*)&cr3[index];
+		val = pgd_val(*pgd);
+
+		list_for_each_entry(pgdir, &linfo->pgd_list, list)
+			pgdir->pgdir[index] = val;
+	}
+	return ret;
+}
+
+static void __lguest_unmap_page_pmd(pmd_t *pmd, unsigned long vaddr)
+{
+	pte_t *pte;
+	unsigned index;
+	unsigned long page;
+
+	page = pmd_page_vaddr(*pmd);
+
+	index = pte_index(vaddr);
+	pte = pte_from_index(page, index);
+	if (pte_val(*pte) & 1)
+		set_pte(pte, __pte(0));
+}
+
+static void __lguest_unmap_page_pud(pud_t *pud, unsigned long vaddr)
+{
+	pmd_t *pmd;
+	unsigned index;
+	unsigned long page;
+
+	page = pud_page_vaddr(*pud);
+
+	index = pmd_index(vaddr);
+	pmd = pmd_from_index(page, index);
+	if (pmd_val(*pmd) & 1)
+		__lguest_unmap_page_pmd(pmd, vaddr);
+}
+
+static void __lguest_unmap_page_pgd(pgd_t *pgd, unsigned long vaddr)
+{
+	pud_t *pud;
+	unsigned index;
+	unsigned long page;
+
+	page = pgd_page_vaddr(*pgd);
+
+	index = pud_index(vaddr);
+	pud = pud_from_index(page, index);
+	if (pud_val(*pud) & 1)
+		__lguest_unmap_page_pud(pud, vaddr);
+}
+
+static void __lguest_unmap_guest_page(struct lguest_guest_info *linfo,
+				      unsigned long vaddr)
+{
+	pgd_t *pgd;
+	unsigned index;
+	u64 *cr3 = linfo->kpgdir.pgdir;
+
+	if (!cr3)
+		return;
+
+	index = pgd_index(vaddr);
+	pgd = (pgd_t*)&cr3[index];
+	if (!(pgd_val(*pgd)&1))
+		return;
+
+	__lguest_unmap_page_pgd(pgd, vaddr);
+}
+
+int lguest_map_hv_pages(struct lguest_guest_info *lguest,
+			unsigned long vaddr, int pages,
+			pgprot_t *pprot)
+{
+	unsigned long page;
+	int i;
+	int ret;
+	pgprot_t prot;
+
+	ret = -ENOMEM;
+	for (i=0; i < pages; i++) {
+		/* now add the page we want */
+		page = hvvm_get_actual_phys((void*)vaddr+PAGE_SIZE*i, &prot);
+		if (!page)
+			goto failed;
+
+		if (pprot)
+			prot = *pprot;
+		ret = __lguest_map_guest_page(lguest, lguest->kpgdir.pgdir,
+					      vaddr+PAGE_SIZE*i, page, prot);
+		if (ret < 0)
+			goto failed;
+	}
+	return 0;
+failed:
+	for (--i; i >= 0; i--)
+		__lguest_unmap_guest_page(lguest, vaddr+PAGE_SIZE*i);
+	return ret;
+}
+
+/**
+ * lguest_mem_addr - retrieve page that's mapped from guest.
+ * @vcpu: lguest vcpu descriptor.
+ * @addr: address to get from the guest's address space.
+ *
+ *  ONLY USE WHEN ALL ELSE FAILS!
+ */
+void *lguest_mem_addr(struct lguest_vcpu *vcpu, u64 addr)
+{
+	struct lguest_guest_info *linfo = vcpu->guest;
+	u64 *cr3 = linfo->kpgdir.pgdir;
+	unsigned long page;
+	pgd_t *pgd;
+	pud_t *pud;
+	pmd_t *pmd;
+	pte_t *pte;
+	unsigned index = pgd_index(addr);
+
+	pgd = (pgd_t*)&cr3[index];
+	if (!(pgd_val(*pgd) & 1))
+		return NULL;
+
+	page = pgd_page_vaddr(*pgd);
+	index = pud_index(addr);
+	pud = pud_from_index(page, index);
+	if (!(pud_val(*pud) & 1))
+		return NULL;
+
+	page = pud_page_vaddr(*pud);
+	index = pmd_index(addr);
+	pmd = pmd_from_index(page, index);
+	if (!(pmd_val(*pmd) & 1))
+		return NULL;
+
+	page = pmd_page_vaddr(*pmd);
+	index = pte_index(addr);
+	pte = pte_from_index(page, index);
+	if (!(pte_val(*pte) & 1))
+		return NULL;
+
+	page = ((pte_val(*pte) & PAGE_MASK) + (addr & (PAGE_SIZE-1)));
+
+	return (void *)(page + PAGE_OFFSET);
+}
+
+void __lguest_free_guest_pmd(pmd_t *pmd)
+{
+	pte_t *pte;
+	unsigned long page;
+	int i;
+
+	page = pmd_page_vaddr(*pmd);
+
+	for (i=0; i < PTRS_PER_PTE; i++) {
+		pte = pte_from_index(page, i);
+		if (!(pte_val(*pte) & 1))
+			continue;
+		/* FIXME: do some checks here??? */
+	}
+	set_pmd(pmd, __pmd(0));
+	free_page(page);
+}
+
+void __lguest_free_guest_pud(pud_t *pud)
+{
+	pmd_t *pmd;
+	unsigned long page;
+	int i;
+
+	page = pud_page_vaddr(*pud);
+
+	for (i=0; i < PTRS_PER_PMD; i++) {
+		pmd = pmd_from_index(page, i);
+		if (!(pmd_val(*pmd) & 1))
+			continue;
+		__lguest_free_guest_pmd(pmd);
+	}
+	set_pud(pud, __pud(0));
+	free_page(page);
+}
+
+void __lguest_free_guest_pgd(pgd_t *pgd)
+{
+	pud_t *pud;
+	unsigned long page;
+	int i;
+
+	page = pgd_page_vaddr(*pgd);
+
+	for (i=0; i < PTRS_PER_PUD; i++) {
+		pud = pud_from_index(page, i);
+		if (!(pud_val(*pud) & 1))
+			continue;
+		__lguest_free_guest_pud(pud);
+	}
+	set_pgd(pgd, __pgd(0));
+	free_page(page);
+}
+
+void __lguest_free_guest_pages(u64 *cr3)
+{
+	pgd_t *pgd;
+	int i;
+
+	if (!cr3)
+		return;
+
+	for (i=0; i < PTRS_PER_PGD; i++) {
+		pgd = (pgd_t*)&cr3[i];
+		if (!(pgd_val(*pgd) & 1))
+			continue;
+		__lguest_free_guest_pgd(pgd);
+	}
+	free_page((u64)cr3);
+}
+
+void __lguest_free_guest_upages(struct lguest_guest_info *linfo, u64 *cr3)
+{
+	pgd_t *pgd;
+	int i;
+
+	if (!cr3)
+		return;
+
+	for (i=0; i < pgd_index(linfo->page_offset); i++) {
+		pgd = (pgd_t*)&cr3[i];
+		if (!(pgd_val(*pgd) & 1))
+			continue;
+		__lguest_free_guest_pgd(pgd);
+	}
+	free_page((u64)cr3);
+}
+
+void lguest_free_guest_pages(struct lguest_guest_info *linfo)
+{
+	int i;
+
+	/* This frees all the guest kernel pages */
+	__lguest_free_guest_pages(linfo->kpgdir.pgdir);
+
+	for (i=0; i < LGUEST_PGDIRS; i++)
+		__lguest_free_guest_upages(linfo, linfo->pgdirs[i].pgdir);
+}
+
+void lguest_unmap_guest_pages(struct lguest_guest_info *lguest,
+			     unsigned long vaddr, int pages)
+{
+	int i;
+
+	for (i=0; i < pages; i++)
+		__lguest_unmap_guest_page(lguest, vaddr+PAGE_SIZE*i);
+}
+
+int lguest_init_vcpu_pagetable(struct lguest_vcpu *vcpu)
+{
+	struct lguest_guest_info *linfo = vcpu->guest;
+
+	mutex_lock(&linfo->page_lock);
+	vcpu->pgdir = new_pgdir(linfo, linfo->kpgdir.cr3);
+	BUG_ON(!vcpu->pgdir);
+	if (!vcpu->pgdir->count++)
+		vcpu->pgdir->flags |= LGUEST_PGD_BUSY_FL;
+	list_add(&vcpu->pgdir->list, &linfo->pgd_list);
+	mutex_unlock(&linfo->page_lock);
+
+	return 0;
+}
+
+int init_guest_pagetable(struct lguest_guest_info *linfo, u64 pgtable)
+{
+	int ret = -ENOMEM;
+
+	linfo->kpgdir.cr3 = pgtable;
+	linfo->kpgdir.pgdir = (u64*)get_zeroed_page(GFP_KERNEL);
+	if (!linfo->kpgdir.pgdir)
+		return -ENOMEM;
+	linfo->kpgdir.flags |= LGUEST_PGD_BUSY_FL | LGUEST_PGD_MASTER_FL;
+	linfo->kpgdir.count = -1;
+
+	/*
+	 * The list is used to update all the kernel page tables,
+	 * so that they all have the same mappings.
+	 */
+	list_add(&linfo->kpgdir.list, &linfo->pgd_list);
+
+	ret = lguest_map_hv_pages(linfo, lguest_hv_addr,
+				  lguest_hv_pages, NULL);
+	if (ret < 0)
+		goto out;
+
+	return 0;
+ out:
+	free_page((u64)linfo->kpgdir.pgdir);
+
+	return ret;
+}
+
Index: work-pv/arch/x86_64/Makefile
===================================================================
--- work-pv.orig/arch/x86_64/Makefile
+++ work-pv/arch/x86_64/Makefile
@@ -84,6 +84,7 @@ core-y					+= arch/x86_64/kernel/ \
 core-$(CONFIG_IA32_EMULATION)		+= arch/x86_64/ia32/
 drivers-$(CONFIG_PCI)			+= arch/x86_64/pci/
 drivers-$(CONFIG_OPROFILE)		+= arch/x86_64/oprofile/
+drivers-$(CONFIG_LGUEST_GUEST)		+= arch/x86_64/lguest/
 
 boot := arch/x86_64/boot
 
Index: work-pv/include/asm-x86_64/lguest.h
===================================================================
--- /dev/null
+++ work-pv/include/asm-x86_64/lguest.h
@@ -0,0 +1,350 @@
+#ifndef _LGUEST_H_
+#define _LGUEST_H_
+#include <asm/desc.h>
+#include <asm/hw_irq.h>
+#include <linux/futex.h>
+#include <asm/lguest_user.h>
+
+/* XXX: Come up with better magic later on */
+#define LGUEST_MAGIC_R13 0x1
+#define LGUEST_MAGIC_R14 0x2
+#define LGUEST_MAGIC_R15 0x3
+
+#define LGUEST_MAX_VCPUS 64
+
+#define LGUEST_PGDS_PER_VCPU 8
+#define LGUEST_PGDIRS (LGUEST_MAX_VCPUS * LGUEST_PGDS_PER_VCPU)
+
+#define LGUEST_IRQS 32
+
+#define LHCALL_FLUSH_ASYNC	0
+#define LHCALL_LGUEST_INIT	1
+#define LHCALL_CRASH		2
+#define LHCALL_LOAD_GDT		3
+#define LHCALL_NEW_PGTABLE	4
+#define LHCALL_FLUSH_TLB	5
+#define LHCALL_LOAD_IDT_ENTRY	6
+#define LHCALL_SET_STACK	7
+#define LHCALL_TS		8
+#define LHCALL_TIMER_READ	9
+#define LHCALL_TIMER_START	10
+#define LHCALL_HALT		11
+#define LHCALL_GET_WALLCLOCK	12
+#define LHCALL_BIND_DMA		13
+#define LHCALL_SEND_DMA		14
+#define LHCALL_FLUSH_TLB_SIG	15
+#define LHCALL_SET_PTE		16
+#define LHCALL_SET_PMD		17
+#define LHCALL_SET_PUD		18
+#define LHCALL_SET_PGD		19
+#define LHCALL_CLEAR_PTE	20
+#define LHCALL_CLEAR_PMD	21
+#define LHCALL_CLEAR_PUD	22
+#define LHCALL_CLEAR_PGD	23
+#define LHCALL_LOAD_TLS		24
+#define LHCALL_RDMSR		25
+#define LHCALL_WRMSR		26
+#define LHCALL_IRET		27
+
+#define LHCALL_PRINT		60
+#define LHCALL_DEBUG_ME		99
+
+#define LGUEST_TRAP_ENTRY 0x1F
+
+static inline unsigned long
+hcall(unsigned long call,
+      unsigned long arg1, unsigned long arg2, unsigned long arg3)
+{
+	asm volatile("int $" __stringify(LGUEST_TRAP_ENTRY)
+		     : "=a"(call)
+		     : "a"(call), "d"(arg1), "b"(arg2), "c"(arg3)
+		     : "memory");
+	return call;
+}
+
+void async_hcall(unsigned long call,
+		 unsigned long arg1, unsigned long arg2, unsigned long arg3);
+
+struct lguest_vcpu;
+
+struct lguest_dma_info
+{
+	struct list_head list;
+	union futex_key key;
+	unsigned long dmas;
+	u16 next_dma;
+	u16 num_dmas;
+	u32 guest_id;
+	u8 interrupt; 	/* 0 when not registered */
+};
+
+
+/* these must be powers of two */
+#define PUD_HASH_SIZE 256
+#define PMD_HASH_SIZE 256
+#define PTE_HASH_SIZE 256
+
+#define LGUEST_PGD_BUSY_FL	(1<<0)
+#define LGUEST_PGD_MASTER_FL	(1<<1)
+#define LGUEST_PGD_LINK_FL	(1<<2)
+
+#define LGUEST_PUD_KERNEL_FL	(1<<1)
+#define LGUEST_PMD_KERNEL_FL	(1<<1)
+#define LGUEST_PTE_KERNEL_FL	(1<<1)
+
+struct lguest_pgd {
+	struct list_head list;
+	u64 cr3;
+	u64 *pgdir;
+	u64 *user_pgdir;
+	unsigned count;
+	unsigned flags;
+};
+
+struct lguest_pud {
+	struct list_head list;
+	struct lguest_pgd *pgdir;
+	u64 gpud;  /* guest pud */
+	unsigned flags;
+	unsigned idx;
+};
+
+struct lguest_pmd {
+	struct list_head list;
+	struct lguest_pud *pudir;
+	u64 gpmd;  /* guest pmd */
+	unsigned flags;
+	unsigned idx;
+};
+
+struct lguest_pte {
+	struct list_head list;
+	struct lguest_pmd *pmdir;
+	u64 gpte;  /* guest pte */
+	unsigned flags;
+	unsigned idx;
+};
+
+struct lguest_guest_info {
+	struct lguest_data __user *lguest_data;
+	struct task_struct *tsk;
+	struct mm_struct *mm;
+	u32 guest_id;
+	u64 pfn_limit;
+	u64 start_kernel_map;
+	u64 page_offset;
+
+	int halted;
+	/* does it really belong here? */
+	char *dead;
+#if 0
+	unsigned long noirq_start, noirq_end;
+#endif
+	int dma_is_pending;
+	unsigned long pending_dma; /* struct lguest_dma */
+	unsigned long pending_addr; /* address they're sending to */
+
+	struct lguest_pgd kpgdir;
+	struct lguest_pgd pgdirs[LGUEST_PGDIRS];
+	struct list_head pgd_list;
+	struct list_head pud_hash[PUD_HASH_SIZE];
+	struct list_head pmd_hash[PMD_HASH_SIZE];
+	struct list_head pte_hash[PTE_HASH_SIZE];
+	struct mutex page_lock;
+
+	int timer_on;
+	int last_timer;
+
+	/* Cached wakeup: we hold a reference to this task. */
+	struct task_struct *wake;
+
+	struct lguest_dma_info dma[LGUEST_MAX_DMA];
+
+	struct lguest_vcpu *vcpu[LGUEST_MAX_VCPUS];
+};
+
+/* copied from old lguest code. Not sure if it's the best layout for us */
+struct lguest_regs
+{
+	u64 cr3;			/*   0 ( 0x0) */
+        /* Manually saved part. */
+        u64 rbx, rcx, rdx;		/*   8 ( 0x8) */
+        u64 rsi, rdi, rbp;		/*  32 (0x20) */
+        u64 r8, r9, r10, r11;		/*  56 (0x38) */
+        u64 r12, r13, r14, r15;		/*  88 (0x58) */
+        u64 rax;			/* 120 (0x78) */
+        u64 fs; /* ds; */		/* 128 (0x80) */
+        u64 trapnum, errcode;		/* 136 (0x88) */
+        /* Trap pushed part */
+        u64 rip;			/* 152 (0x98) */
+        u64 cs;				/* 160 (0xa0) */
+        u64 rflags;			/* 168 (0xa8) */
+        u64 rsp;			/* 176 (0xb0) */
+	u64 ss; /* Crappy Segment! */	/* 184 (0xb8) */
+	/* size = 192  (0xc0) */
+	char size[0];
+};
+
+struct lguest_tss_struct {
+	u32 reserved1;
+	u64 rsp0;
+	u64 rsp1;
+	u64 rsp2;
+	u64 reserved2;
+	u64 ist[7];
+	u32 reserved3;
+	u32 reserved4;
+	u16 reserved5;
+	u16 io_bitmap_base;
+	/* we don't let the guest have io privileges (yet) */
+	unsigned long io_bitmap[1];
+} __attribute__((packed)) ____cacheline_aligned;
+
+struct lguest_vcpu {
+	unsigned long host_syscall;
+	unsigned long guest_syscall;
+
+	/* Must be 16 bytes aligned at regs+sizeof(regs) */
+	struct lguest_regs regs;
+
+	struct lguest_vcpu *vcpu; /* pointer to itself */
+	unsigned long debug;
+	unsigned long magic;
+	unsigned int  id;
+	unsigned long host_stack;
+	unsigned long guest_stack;
+	unsigned long host_cr3;
+	unsigned long host_page;
+	struct desc_ptr host_gdt;
+	u16 host_gdt_buff[3];
+	struct desc_ptr host_idt;
+	u16 host_idt_buff[3];
+	unsigned long host_gdt_ptr;
+	/* Save rax on interrupts, it's used for iret hcall */
+	unsigned long rax;
+
+	/* Host save gs base pointer */
+	unsigned long host_gs_a;
+	unsigned long host_gs_d;
+
+	/* save host process gs base pointer */
+	unsigned long host_proc_gs_a;
+	unsigned long host_proc_gs_d;
+
+	/* save guest gs base pointer */
+	unsigned long guest_gs_a;
+	unsigned long guest_gs_d;
+
+	/* used for guest calling swapgs */
+	unsigned long guest_gs_shadow_a;
+	unsigned long guest_gs_shadow_d;
+
+	struct lguest_pgd *pgdir;
+
+	struct desc_ptr gdt; /* address of the GDT at this vcpu */
+	u16 gdt_buff[3];
+	struct desc_struct gdt_table[GDT_ENTRIES];
+
+	struct desc_ptr idt; /* address of the IDT at this vcpu */
+	u16 idt_buff[3];
+	struct gate_struct idt_table[IDT_ENTRIES];
+
+	struct lguest_guest_info *guest;
+
+	struct lguest_tss_struct tss;
+
+	unsigned long ts;
+
+	/* host ist 7 - we use it to prevent the NMI race */
+	unsigned long host_ist;
+
+	/* only for those above FIRST_EXTERNAL_VECTOR */
+	DECLARE_BITMAP(irqs_pending, LGUEST_IRQS);
+	/* those are general. We catch every possible interrupt */
+	DECLARE_BITMAP(interrupt_disabled, LGUEST_IRQS + FIRST_EXTERNAL_VECTOR);
+	unsigned long interrupt[LGUEST_IRQS + FIRST_EXTERNAL_VECTOR];
+
+	/* nmi trampoline storage */
+
+	struct lguest_regs nmi_regs;
+	unsigned long nmi_gs_a;
+	unsigned long nmi_gs_d;
+	unsigned long nmi_gs_shadow_a;
+	unsigned long nmi_gs_shadow_d;
+	struct desc_ptr nmi_gdt;
+	u16 nmi_gdt_buff[3];
+
+	/* set when we take an nmi */
+	unsigned long nmi_sw;
+
+	/* is this enough? */
+	char nmi_stack[1048];
+	char nmi_stack_end[0];
+	char gpf_stack[1048];
+	char gpf_stack_end[0];
+	char df_stack[1048];
+	char df_stack_end[0];
+};
+
+
+#define LHCALL_RING_SIZE 64
+struct hcall_ring
+{
+	u32 eax, edx, ebx, ecx;
+};
+
+struct lguest_text_ptr {
+	unsigned long next; /* guest pa address of next pointer */
+	unsigned long start;
+	unsigned long end;
+};
+
+struct lguest_data
+{
+/* Fields which change during running: */
+	/* 512 == enabled (same as eflags) */
+	unsigned int irq_enabled;
+	/* Blocked interrupts. */
+	DECLARE_BITMAP(interrupts, LGUEST_IRQS);
+
+	/* Last (userspace) address we got a GPF & reloaded gs. */
+	unsigned int gs_gpf_eip;
+
+	/* Virtual address of page fault. */
+	unsigned long cr2;
+
+	/* Async hypercall ring.  0xFF == done, 0 == pending. */
+	u8 hcall_status[LHCALL_RING_SIZE];
+	struct hcall_ring hcalls[LHCALL_RING_SIZE];
+
+/* Fields initialized by the hypervisor at boot: */
+	/* Memory not to try to access */
+	unsigned long reserve_mem;
+	/* ID of this guest (used by network driver to set ethernet address) */
+	u32 guest_id;
+
+/* Fields initialized by the guest at boot: */
+	/* Instruction range to suppress interrupts even if enabled */
+#if 0
+	unsigned long noirq_start, noirq_end;
+#endif
+	unsigned long start_kernel_map;
+	unsigned long page_offset;
+	unsigned long text; /* pa address of lguest_text_ptr addresses */
+
+/* If the kernel has kallsyms, we can use it to do backtraces of a guest */
+	unsigned long kallsyms_addresses;
+	unsigned long kallsyms_num_syms;
+	unsigned long kallsyms_names;
+	unsigned long kallsyms_token_table;
+	unsigned long kallsyms_token_index;
+	unsigned long kallsyms_markers;
+
+	unsigned long return_address;
+};
+
+extern struct lguest_data lguest_data;
+extern struct lguest_device_desc *lguest_devices; /* Just past max_pfn */
+int run_guest(struct lguest_vcpu *vcpu, char *__user user);
+
+#endif

--

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [RFC/PATCH LGUEST X86_64 04/13] Useful debugging
       [not found] <20070308162348.299676000@redhat.com>
                   ` (2 preceding siblings ...)
  2007-03-08 17:38 ` [RFC/PATCH LGUEST X86_64 03/13] lguest64 core Steven Rostedt
@ 2007-03-08 17:38 ` Steven Rostedt
  2007-03-08 17:38 ` [RFC/PATCH LGUEST X86_64 05/13] asm-offsets update Steven Rostedt
                   ` (8 subsequent siblings)
  12 siblings, 0 replies; 16+ messages in thread
From: Steven Rostedt @ 2007-03-08 17:38 UTC (permalink / raw)
  To: virtualization; +Cc: Chris Wright, Ingo Molnar

plain text document attachment (lguest64-debug-utils.patch)
This patch contains some nice features used to debug the lguest64 guest.

It has a way to print page tables for either the host or the guest.

It incorporates kallsyms, and can do a nice back trace of a guest
when it crashes.  The guest needs kallsyms obviously compiled in.
Note: This code needs to be fixed to be more secure!

Implements a lgdebug_print that can be used within the host that
will only print when lguest_debug is true.  There's a hypercall
that the guest can call to turn this on.

There's also a function called lguest_set_debug(n) the makes it
easy for the guest to turn it on. Where n=1 will turn on debugging
prints, and n=0 will turn it off. (well n!=0 will turn it on).

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Cc: Glauber de Oliveira Costa <glommer@gmail.com>
Cc: Chris Wright <chrisw@sous-sol.org>



Index: work-pv/arch/x86_64/lguest/lguest_debug.c
===================================================================
--- /dev/null
+++ work-pv/arch/x86_64/lguest/lguest_debug.c
@@ -0,0 +1,532 @@
+/*
+    lguest debug utils. Modified from various other parts of Linux.
+    What was modified is Copyright 2007 Steven Rostedt, Red Hat
+
+    This program is free software; you can redistribute it and/or modify
+    it under the terms of the GNU General Public License as published by
+    the Free Software Foundation; either version 2 of the License, or
+    (at your option) any later version.
+
+    This program is distributed in the hope that it will be useful,
+    but WITHOUT ANY WARRANTY; without even the implied warranty of
+    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+    GNU General Public License for more details.
+
+    You should have received a copy of the GNU General Public License
+    along with this program; if not, write to the Free Software
+    Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA  02110-1301 USA
+*/
+#include <linux/module.h>
+#include <linux/sched.h>
+#include <linux/freezer.h>
+#include <linux/kallsyms.h>
+#include <asm/paravirt.h>
+#include <asm/hv_vm.h>
+#include <asm/pgtable.h>
+#include <asm/uaccess.h>
+#include "lguest.h"
+
+int lguest_debug;
+
+static DEFINE_SPINLOCK(lgdebug_print_lock);
+#define LGDEBUG_BUF_SIZ 1024
+static char lgdebug_print_buf[LGDEBUG_BUF_SIZ];
+
+void lgdebug_vprint(const char *fmt, va_list ap)
+{
+	unsigned long flags;
+
+	if (!lguest_debug)
+		return;
+
+	spin_lock_irqsave(&lgdebug_print_lock, flags);
+	vsnprintf(lgdebug_print_buf, LGDEBUG_BUF_SIZ-1, fmt, ap);
+	printk("%s", lgdebug_print_buf);
+	spin_unlock_irqrestore(&lgdebug_print_lock, flags);
+}
+
+void lgdebug_print(const char *fmt, ...)
+{
+	va_list ap;
+
+	if (!lguest_debug)
+		return;
+
+	/* irq save? */
+	va_start(ap, fmt);
+	lgdebug_vprint(fmt, ap);
+	va_end(ap);
+}
+
+void lguest_dump_vcpu_regs(struct lguest_vcpu *vcpu)
+{
+	struct lguest_regs *regs = &vcpu->regs;
+
+	printk("Printing VCPU %d regs cr3: %016llx\n", vcpu->id, regs->cr3);
+	printk("RIP: %04llx: ", regs->cs & 0xffff);
+	lguest_print_address(vcpu, regs->rip);
+	printk("RSP: %04llx:%016llx  EFLAGS: %08llx\n", regs->ss, regs->rsp,
+		regs->rflags);
+	printk("RAX: %016llx RBX: %016llx RCX: %016llx\n",
+	       regs->rax, regs->rbx, regs->rcx);
+	printk("RDX: %016llx RSI: %016llx RDI: %016llx\n",
+	       regs->rdx, regs->rsi, regs->rdi);
+	printk("RBP: %016llx R08: %016llx R09: %016llx\n",
+	       regs->rbp, regs->r8, regs->r9);
+	printk("R10: %016llx R11: %016llx R12: %016llx\n",
+	       regs->r10, regs->r11, regs->r12);
+	printk("R13: %016llx R14: %016llx R15: %016llx\n",
+	       regs->r13, regs->r14, regs->r15);
+
+	printk("errcode: %llx   trapnum: %llx\n",
+	       regs->errcode, regs->trapnum);
+
+	lguest_dump_trace(vcpu, regs);
+}
+
+struct guest_ksym_stuff {
+	unsigned long *addresses;
+	unsigned long num_syms;
+	u8 *names;
+	u8 *token_table;
+	u16 *token_index;
+	unsigned long *markers;
+};
+
+static struct lguest_text_ptr *get_text_segs(struct lguest_vcpu *vcpu)
+{
+	struct lguest_guest_info *linfo = vcpu->guest;
+	struct lguest_text_ptr *segs, **p;
+	struct lguest_text_ptr *g;
+	unsigned long addr;
+	int i;
+
+	if (!linfo->lguest_data)
+		return NULL;
+
+	addr = lhread_u64(vcpu, (u64)&linfo->lguest_data->text);
+	if (!addr)
+		return NULL;
+
+	g = (struct lguest_text_ptr*)addr;
+
+	p = &segs;
+
+	/* only allow for 10 segs */
+	for (i=0; i < 10; i++) {
+		*p = kmalloc(sizeof(*segs), GFP_KERNEL);
+		if (!*p)
+			goto free_me;
+		(*p)->start = lhread_u64(vcpu, (u64)&g->start);
+		(*p)->end = lhread_u64(vcpu, (u64)&g->end);
+		addr = lhread_u64(vcpu, (u64)&g->next);
+		p = (struct lguest_text_ptr**)&((*p)->next);
+		if (!addr)
+			break;
+		g = (struct lguest_text_ptr*)addr;
+	}
+	*p = NULL;
+
+	return segs;
+
+free_me:
+	while (segs) {
+		g = (struct lguest_text_ptr*)segs->next;
+		kfree(segs);
+		segs = g;
+	}
+	return NULL;
+}
+
+static int is_text_seg(struct lguest_text_ptr *segs, unsigned long addr)
+{
+	while (segs) {
+		if (addr >= segs->start &&
+		    addr <= segs->end)
+			return 1;
+		segs = (struct lguest_text_ptr*)segs->next;
+	}
+	return 0;
+}
+
+static void put_text_segs(struct lguest_text_ptr *segs)
+{
+	struct lguest_text_ptr *p;
+
+	while (segs) {
+		p = (struct lguest_text_ptr*)segs->next;
+		kfree(segs);
+		segs = p;
+	}
+}
+
+static unsigned int expand_symbol(struct lguest_vcpu *vcpu,
+				  struct guest_ksym_stuff *kstuff,
+				  unsigned int off, char *result)
+{
+	int len, skipped_first = 0;
+	const u8 *tptr, *data;
+
+	/* get the compressed symbol length from the first symbol byte */
+	data = &kstuff->names[off];
+
+	len = lhread_u8(vcpu, (u64)data);
+
+	data++;
+
+	/* update the offset to return the offset for the next symbol on
+	 * the compressed stream */
+	off += len + 1;
+
+	/* for every byte on the compressed symbol data, copy the table
+	   entry for that byte */
+	while(len) {
+		u8 idx;
+		u16 tok;
+		idx = lhread_u8(vcpu, (u64)data);
+		tok = lhread_u16(vcpu, (u64)(&kstuff->token_index[idx]));
+		tptr = &kstuff->token_table[ tok ];
+		data++;
+		len--;
+
+		idx = lhread_u8(vcpu, (u64)tptr);
+		while (idx) {
+			if(skipped_first) {
+				*result = idx;
+				result++;
+			} else
+				skipped_first = 1;
+			tptr++;
+			idx = lhread_u8(vcpu, (u64)tptr);
+		}
+	}
+
+	*result = '\0';
+
+	/* return to offset to the next symbol */
+	return off;
+}
+
+static unsigned long get_symbol_pos(struct lguest_vcpu *vcpu,
+				    struct guest_ksym_stuff *kstuff,
+				    unsigned long addr,
+				    unsigned long *symbolsize,
+				    unsigned long *offset)
+{
+	unsigned long symbol_start = 0, symbol_end = 0;
+	unsigned long i, low, high, mid;
+
+	/* do a binary search on the sorted kallsyms_addresses array */
+	low = 0;
+	high = kstuff->num_syms;
+
+	while (high - low > 1) {
+		mid = (low + high) / 2;
+		if (kstuff->addresses[mid] <= addr)
+			low = mid;
+		else
+			high = mid;
+	}
+
+	/*
+	 * search for the first aliased symbol. Aliased
+	 * symbols are symbols with the same address
+	 */
+	while (low && kstuff->addresses[low-1] == kstuff->addresses[low])
+		--low;
+
+	symbol_start = kstuff->addresses[low];
+
+	/* Search for next non-aliased symbol */
+	for (i = low + 1; i < kstuff->num_syms; i++) {
+		if (kstuff->addresses[i] > symbol_start) {
+			symbol_end = kstuff->addresses[i];
+			break;
+		}
+	}
+
+	/* if we found no next symbol, we use the end of the section */
+	if (!symbol_end) {
+		return (unsigned long)(-1UL);
+#if 0
+		if (is_kernel_inittext(addr))
+			symbol_end = (unsigned long)_einittext;
+		else if (all_var)
+			symbol_end = (unsigned long)_end;
+		else
+			symbol_end = (unsigned long)_etext;
+#endif
+	}
+
+	*symbolsize = symbol_end - symbol_start;
+	*offset = addr - symbol_start;
+
+	return low;
+}
+
+static int is_ksym_addr(struct lguest_guest_info *linfo,
+			unsigned long addr)
+{
+	/* need to look up the segs */
+	return 1;
+}
+
+static unsigned int get_symbol_offset(struct lguest_vcpu *vcpu,
+				      struct guest_ksym_stuff *kstuff,
+				      unsigned long pos)
+{
+	const u8 *name;
+	int i;
+	unsigned long idx;
+
+	idx = lhread_u64(vcpu, (u64)&kstuff->markers[pos>>8]);
+
+	/* use the closest marker we have. We have markers every 256 positions,
+	 * so that should be close enough */
+	name = &kstuff->names[ idx ];
+
+	/* sequentially scan all the symbols up to the point we're searching for.
+	 * Every symbol is stored in a [<len>][<len> bytes of data] format, so we
+	 * just need to add the len to the current pointer for every symbol we
+	 * wish to skip */
+	for(i = 0; i < (pos&0xFF); i++) {
+		u8 c;
+		c = lhread_u8(vcpu, (u64)name);
+		name = name + c + 1;
+	}
+
+	return name - kstuff->names;
+}
+
+static const char *lguest_syms_lookup(struct lguest_vcpu *vcpu,
+				      unsigned long addr,
+				      unsigned long *symbolsize,
+				      unsigned long *offset,
+				      char **modname, char *namebuf)
+{
+	struct lguest_guest_info *linfo = vcpu->guest;
+	struct lguest_data *data = linfo->lguest_data;
+	struct guest_ksym_stuff kstuff;
+	const char *msym;
+	unsigned long *ptr;
+	int i;
+
+	kstuff.addresses = (unsigned long*)lhread_u64(vcpu, (u64)&data->kallsyms_addresses);
+	kstuff.num_syms = lhread_u64(vcpu, (u64)&data->kallsyms_num_syms);
+	kstuff.names = (u8*)lhread_u64(vcpu, (u64)&data->kallsyms_names);
+	kstuff.token_table = (u8*)lhread_u64(vcpu, (u64)&data->kallsyms_token_table);
+	kstuff.token_index = (u16*)lhread_u64(vcpu, (u64)&data->kallsyms_token_index);
+	kstuff.markers = (unsigned long*)lhread_u64(vcpu, (u64)&data->kallsyms_markers);
+
+	if (!kstuff.addresses || !kstuff.num_syms || !kstuff.names ||
+	    !kstuff.token_table || !kstuff.token_index || !kstuff.markers)
+		return NULL;
+
+	/* FIXME: Validate all the kstuff here!! */
+
+	ptr = kmalloc(sizeof(unsigned long)*kstuff.num_syms, GFP_KERNEL);
+	if (!ptr)
+		return NULL;
+
+	for (i=0; i < kstuff.num_syms; i++) {
+		/* FIXME: do this better! */
+		ptr[i] = lhread_u64(vcpu, (u64)&kstuff.addresses[i]);
+		if (i && ptr[i] < ptr[i-1]) {
+			kill_guest(linfo, "bad kallsyms table\n");
+			kstuff.addresses = ptr;
+			goto out;
+		}
+	}
+	kstuff.addresses = ptr;
+
+	namebuf[KSYM_NAME_LEN] = 0;
+	namebuf[0] = 0;
+
+	if (is_ksym_addr(linfo, addr)) {
+		unsigned long pos;
+
+		pos = get_symbol_pos(vcpu, &kstuff, addr, symbolsize, offset);
+		if (pos == (unsigned long)(-1UL))
+			goto out;
+
+		/* Grab name */
+		expand_symbol(vcpu, &kstuff,
+			      get_symbol_offset(vcpu, &kstuff, pos), namebuf);
+		*modname = NULL;
+		kfree(kstuff.addresses);
+		return namebuf;
+	}
+
+	/* see if it's in a module */
+	msym = module_address_lookup(addr, symbolsize, offset, modname);
+	if (msym) {
+		kfree(kstuff.addresses);
+		return strncpy(namebuf, msym, KSYM_NAME_LEN);
+	}
+
+out:
+	kfree(kstuff.addresses);
+	return NULL;
+}
+
+void lguest_print_address(struct lguest_vcpu *vcpu, unsigned long address)
+{
+	unsigned long offset = 0, symsize;
+	const char *symname;
+	char *modname;
+	char *delim = ":";
+	char namebuf[KSYM_NAME_LEN+1];
+
+	symname = lguest_syms_lookup(vcpu, address, &symsize, &offset,
+				     &modname, namebuf);
+	if (!symname) {
+		printk(" [<%016lx>]\n", address);
+		return;
+	}
+	if (!modname)
+		modname = delim = "";
+	printk(" [<%016lx>] %s%s%s%s+0x%lx/0x%lx\n",
+	       address, delim, modname, delim, symname, offset, symsize);
+
+}
+
+void lguest_dump_trace(struct lguest_vcpu *vcpu, struct lguest_regs *regs)
+{
+	unsigned long stack = regs->rsp;
+	unsigned long stack_end = (regs->rsp & PAGE_MASK) + PAGE_SIZE;
+	unsigned long start_kernel_map;
+	unsigned long page_offset;
+	unsigned long addr;
+	struct lguest_text_ptr *segs;
+
+	printk("Stack Dump:\n");
+	start_kernel_map = vcpu->guest->start_kernel_map;
+	page_offset = vcpu->guest->page_offset;
+
+	segs = get_text_segs(vcpu);
+	if (!segs)
+		return;
+
+	for (; stack < stack_end; stack += sizeof(stack)) {
+		addr = lhread_u64(vcpu, guest_pa(vcpu->guest, stack));
+		if (is_text_seg(segs, addr)) {
+			lguest_print_address(vcpu, addr);
+		}
+	}
+
+	put_text_segs(segs);
+}
+
+static u64 read_page(struct lguest_vcpu *vcpu, u64 page, u64 idx)
+{
+	u64 *ptr;
+
+	if (!vcpu) {
+		ptr = __va(page);
+		return ptr[idx];
+	}
+
+	return lhread_u64(vcpu, page+idx*sizeof(u64));
+}
+
+static void print_pte(u64 pte, u64 pgd_idx, u64 pud_idx, u64 pmd_idx, u64 pte_idx)
+{
+	printk("           %3llx: %llx\n", pte_idx, pte);
+	printk ("               (%llx)\n",
+		((pgd_idx&(1<<8)?(-1ULL):0ULL)<<48) |
+		(pgd_idx<<PGDIR_SHIFT) |
+		(pud_idx<<PUD_SHIFT) |
+		(pmd_idx<<PMD_SHIFT) |
+		(pte_idx<<PAGE_SHIFT));
+}
+
+static void print_pmd(struct lguest_vcpu *vcpu,
+		      u64 pmd, u64 pgd_idx, u64 pud_idx, u64 pmd_idx)
+{
+	u64 pte;
+	u64 ptr;
+	u64 i;
+
+	printk("        %3llx: %llx\n", pmd_idx, pmd);
+
+	/* 2M page? */
+	if (pmd & (1<<7)) {
+		printk ("            (%llx)\n",
+			((pgd_idx&(1<<8)?(-1ULL):0ULL)<<48) |
+			(pgd_idx<<PGDIR_SHIFT) |
+			(pud_idx<<PUD_SHIFT) |
+			(pmd_idx<<PMD_SHIFT));
+	} else {
+		pte = pmd & ~(0xfff) & ~(1UL << 63);
+		for (i=0; i < PTRS_PER_PTE; i++) {
+			ptr = read_page(vcpu, pte, i);
+			if (ptr)
+				print_pte(ptr, pgd_idx, pud_idx, pmd_idx, i);
+		}
+	}
+}
+
+static void print_pud(struct lguest_vcpu *vcpu,
+		      u64 pud, u64 pgd_idx, u64 pud_idx)
+{
+	u64 pmd;
+	u64 ptr;
+	u64 i;
+
+	printk("     %3llx: %llx\n", pud_idx, pud);
+
+	pmd = pud & ~(0xfff) & ~(1UL << 63);
+	for (i=0; i < PTRS_PER_PMD; i++) {
+		ptr = read_page(vcpu, pmd, i);
+		if (ptr)
+			print_pmd(vcpu, ptr, pgd_idx, pud_idx, i);
+	}
+}
+
+static void print_pgd(struct lguest_vcpu *vcpu,
+		      u64 pgd, u64 pgd_idx)
+{
+	u64 pud;
+	u64 ptr;
+	u64 i;
+
+	printk(" %3llx:  %llx\n", pgd_idx, pgd);
+	pud = pgd & ~(0xfff) & ~(1UL << 63);
+	for (i=0; i < PTRS_PER_PUD; i++) {
+		ptr = read_page(vcpu, pud, i);
+		if (ptr)
+			print_pud(vcpu, ptr, pgd_idx, i);
+	}
+
+}
+
+static void print_page_tables(struct lguest_vcpu *vcpu,
+			      u64 cr3)
+{
+	u64 pgd;
+	u64 ptr;
+	u64 i;
+
+	printk("cr3: %016llx\n", cr3);
+	pgd = cr3;
+
+	for (i=0; i < PTRS_PER_PGD; i++) {
+		ptr = read_page(vcpu, pgd, i);
+		if (ptr)
+			print_pgd(vcpu, ptr, i);
+	}
+}
+
+void lguest_print_page_tables(u64 *cr3)
+{
+	if (!cr3) {
+		printk("NULL cr3 pointer????\n");
+		return;
+	}
+	print_page_tables(NULL, __pa(cr3));
+}
+
+void lguest_print_guest_page_tables(struct lguest_vcpu *vcpu, u64 cr3)
+{
+	print_page_tables(vcpu, cr3);
+}

--

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [RFC/PATCH LGUEST X86_64 05/13] asm-offsets update
       [not found] <20070308162348.299676000@redhat.com>
                   ` (3 preceding siblings ...)
  2007-03-08 17:38 ` [RFC/PATCH LGUEST X86_64 04/13] Useful debugging Steven Rostedt
@ 2007-03-08 17:38 ` Steven Rostedt
  2007-03-08 17:38 ` [RFC/PATCH LGUEST X86_64 06/13] lguest64 Kconfig Steven Rostedt
                   ` (7 subsequent siblings)
  12 siblings, 0 replies; 16+ messages in thread
From: Steven Rostedt @ 2007-03-08 17:38 UTC (permalink / raw)
  To: virtualization; +Cc: Chris Wright, Ingo Molnar

plain text document attachment (lguest64-asm-offset.patch)
This patch puts in the offsets used by Lguest64 in assembly.

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Glauber de Oliveira Costa <glommer@gmail.com>
Cc: Chris Wright <chrisw@sous-sol.org>


Index: work-pv/arch/x86_64/kernel/asm-offsets.c
===================================================================
--- work-pv.orig/arch/x86_64/kernel/asm-offsets.c
+++ work-pv/arch/x86_64/kernel/asm-offsets.c
@@ -18,6 +18,9 @@
 #ifdef CONFIG_PARAVIRT
 #include <asm/paravirt.h>
 #endif
+#ifdef CONFIG_LGUEST_GUEST
+#include <asm/lguest.h>
+#endif
 
 #define DEFINE(sym, val) \
         asm volatile("\n->" #sym " %0 " #val : : "i" (val))
@@ -89,5 +92,51 @@ int main(void)
 	ENTRY(read_cr2);
 	ENTRY(swapgs);
 #endif
+
+#ifdef CONFIG_LGUEST_GUEST
+#undef ENTRY
+#define ENTRY(entry)  DEFINE(LGUEST_VCPU_ ##entry, offsetof(struct lguest_vcpu, entry))
+	ENTRY(vcpu);
+	ENTRY(debug);
+	ENTRY(magic);
+	ENTRY(guest_syscall);
+	ENTRY(host_stack);
+	ENTRY(host_cr3);
+	ENTRY(host_gs_a);
+	ENTRY(host_gs_d);
+	ENTRY(host_proc_gs_a);
+	ENTRY(host_proc_gs_d);
+	ENTRY(guest_gs_a);
+	ENTRY(guest_gs_d);
+	ENTRY(gdt);
+	ENTRY(idt);
+	ENTRY(host_gdt);
+	ENTRY(host_idt);
+	ENTRY(host_gdt_ptr);
+	ENTRY(gdt_table);
+	DEFINE(LGUEST_VCPU_trapnum, offsetof(struct lguest_vcpu, regs.trapnum));
+	DEFINE(LGUEST_VCPU_errcode, offsetof(struct lguest_vcpu, regs.errcode));
+	DEFINE(LGUEST_VCPU_rflags, offsetof(struct lguest_vcpu, regs.rflags));
+	DEFINE(LGUEST_VCPU_host_idt_address, offsetof(struct lguest_vcpu, host_idt.address));
+	ENTRY(regs);
+	ENTRY(nmi_regs);
+	DEFINE(LGUEST_VCPU_errcode, offsetof(struct lguest_vcpu, regs.errcode));
+	ENTRY(nmi_gs_a);
+	ENTRY(nmi_gs_d);
+	ENTRY(nmi_gs_shadow_a);
+	ENTRY(nmi_gs_shadow_d);
+	ENTRY(nmi_stack_end);
+	ENTRY(nmi_gdt);
+	ENTRY(nmi_sw);
+#undef ENTRY
+#define ENTRY(entry)  DEFINE(LGUEST_DATA_##entry, offsetof(struct lguest_data, entry))
+	ENTRY(irq_enabled);
+#undef ENTRY
+#define ENTRY(entry)  DEFINE(LGUEST_REGS_##entry, offsetof(struct lguest_regs, entry))
+	ENTRY(errcode);
+	ENTRY(rip);
+	ENTRY(size);
+	BLANK();
+#endif
 	return 0;
 }

--

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [RFC/PATCH LGUEST X86_64 06/13] lguest64 Kconfig
       [not found] <20070308162348.299676000@redhat.com>
                   ` (4 preceding siblings ...)
  2007-03-08 17:38 ` [RFC/PATCH LGUEST X86_64 05/13] asm-offsets update Steven Rostedt
@ 2007-03-08 17:38 ` Steven Rostedt
  2007-03-09  3:55   ` Rusty Russell
  2007-03-08 17:39 ` [RFC/PATCH LGUEST X86_64 07/13] lguest64 loader Steven Rostedt
                   ` (6 subsequent siblings)
  12 siblings, 1 reply; 16+ messages in thread
From: Steven Rostedt @ 2007-03-08 17:38 UTC (permalink / raw)
  To: virtualization; +Cc: Chris Wright, Ingo Molnar

plain text document attachment (lguest64-kconfig.patch)
Put the kconfig options for lguest64 in.

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Glauber de Oliveira Costa <glommer@gmail.com>
Cc: Chris Wright <chrisw@sous-sol.org>



Index: work-pv/arch/x86_64/Kconfig
===================================================================
--- work-pv.orig/arch/x86_64/Kconfig
+++ work-pv/arch/x86_64/Kconfig
@@ -320,6 +320,19 @@ config SCHED_MC
 
 source "kernel/Kconfig.preempt"
 
+config LGUEST
+	tristate "Lguest support"
+	depends on PARAVIRT
+	help
+	  Enable this is you think 32 bits are not enough fur a puppie.
+
+config LGUEST_GUEST
+	bool
+	depends on LGUEST
+	default y
+	help
+	  Guest definitions for lguest
+
 config NUMA
        bool "Non Uniform Memory Access (NUMA) Support"
        depends on SMP

--

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [RFC/PATCH LGUEST X86_64 07/13] lguest64 loader
       [not found] <20070308162348.299676000@redhat.com>
                   ` (5 preceding siblings ...)
  2007-03-08 17:38 ` [RFC/PATCH LGUEST X86_64 06/13] lguest64 Kconfig Steven Rostedt
@ 2007-03-08 17:39 ` Steven Rostedt
  2007-03-08 17:39 ` [RFC/PATCH LGUEST X86_64 08/13] lguest64 user header Steven Rostedt
                   ` (5 subsequent siblings)
  12 siblings, 0 replies; 16+ messages in thread
From: Steven Rostedt @ 2007-03-08 17:39 UTC (permalink / raw)
  To: virtualization; +Cc: Chris Wright, Ingo Molnar

plain text document attachment (lguest64-loader.patch)
I noticed that the lguest loader code for i386 was in
Documentation/lguest.  Well, that's fine (I guess) but
it can't just be for i386.  So I made a separate directory
to put the loader code in.  So now we have:

 Documentation/lguest/i386/... for the lguest i386 loader.

and
 Documentation/lguest/x86_64/... for the lguest x86_64 loader.

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Glauber de Oliveira Costa <glommer@gmail.com>
Cc: Chris Wright <chrisw@sous-sol.org>


Index: work-pv/Documentation/lguest/i386/Makefile
===================================================================
--- /dev/null
+++ work-pv/Documentation/lguest/i386/Makefile
@@ -0,0 +1,21 @@
+# This creates the demonstration utility "lguest" which runs a Linux guest.
+
+# We rely on CONFIG_PAGE_OFFSET to know where to put lguest binary.
+# Some shells (dash - ubunu) can't handle numbers that big so we cheat.
+include ../../../.config
+LGUEST_GUEST_TOP := ($(CONFIG_PAGE_OFFSET) - 0x08000000)
+
+CFLAGS:=-Wall -Wmissing-declarations -Wmissing-prototypes -O3 \
+	-static -DLGUEST_GUEST_TOP="$(LGUEST_GUEST_TOP)" -Wl,-T,lguest.lds
+LDLIBS:=-lz
+
+all: lguest.lds lguest
+
+# The linker script on x86 is so complex the only way of creating one
+# which will link our binary in the right place is to mangle the
+# default one.
+lguest.lds:
+	$(LD) --verbose | awk '/^==========/ { PRINT=1; next; } /SIZEOF_HEADERS/ { gsub(/0x[0-9A-F]*/, "$(LGUEST_GUEST_TOP)") } { if (PRINT) print $$0; }' > $@
+
+clean:
+	rm -f lguest.lds lguest
Index: work-pv/Documentation/lguest/i386/lguest.c
===================================================================
--- /dev/null
+++ work-pv/Documentation/lguest/i386/lguest.c
@@ -0,0 +1,1039 @@
+/* Simple program to layout "physical" memory for new lguest guest.
+ * Linked high to avoid likely physical memory.  */
+#define _LARGEFILE64_SOURCE
+#define _GNU_SOURCE
+#include <stdio.h>
+#include <string.h>
+#include <unistd.h>
+#include <err.h>
+#include <stdint.h>
+#include <stdlib.h>
+#include <elf.h>
+#include <sys/mman.h>
+#include <sys/types.h>
+#include <sys/stat.h>
+#include <sys/wait.h>
+#include <fcntl.h>
+#include <assert.h>
+#include <stdbool.h>
+#include <errno.h>
+#include <signal.h>
+#include <sys/socket.h>
+#include <sys/ioctl.h>
+#include <sys/time.h>
+#include <time.h>
+#include <netinet/in.h>
+#include <net/if.h>
+#include <linux/sockios.h>
+#include <linux/if_tun.h>
+#include <sys/uio.h>
+#include <termios.h>
+#include <zlib.h>
+typedef uint32_t u32;
+typedef uint16_t u16;
+typedef uint8_t u8;
+
+#include "../../../include/asm/lguest_user.h"
+
+#define PAGE_PRESENT 0x7 	/* Present, RW, Execute */
+#define NET_PEERNUM 1
+#define BRIDGE_PFX "bridge:"
+
+static bool verbose;
+#define verbose(args...) \
+	do { if (verbose) printf(args); fflush(stdout); } while(0)
+
+struct devices
+{
+	fd_set infds;
+	int max_infd;
+
+	struct device *dev;
+};
+
+struct device
+{
+	struct device *next;
+	struct lguest_device_desc *desc;
+	void *mem;
+
+	/* Watch this fd if handle_input non-NULL. */
+	int fd;
+	int (*handle_input)(int fd, struct device *me);
+
+	/* Watch DMA to this address if handle_input non-NULL. */
+	unsigned long watch_address;
+	u32 (*handle_output)(int fd, const struct iovec *iov,
+			     unsigned int num, struct device *me);
+
+	/* Device-specific data. */
+	void *priv;
+};
+
+static char buf[1024];
+static struct iovec discard_iov = { .iov_base=buf, .iov_len=sizeof(buf) };
+static int zero_fd;
+
+/* LGUEST_GUEST_TOP defined in Makefile, just below us.
+   FIXME: vdso gets mapped just under it, and we need to protect that. */
+#define RESERVE_TOP LGUEST_GUEST_TOP - 1024*1024
+
+static u32 memparse(const char *ptr)
+{
+	char *end;
+	unsigned long ret = strtoul(ptr, &end, 0);
+
+	switch (*end) {
+	case 'G':
+	case 'g':
+		ret <<= 10;
+	case 'M':
+	case 'm':
+		ret <<= 10;
+	case 'K':
+	case 'k':
+		ret <<= 10;
+		end++;
+	default:
+		break;
+	}
+	return ret;
+}
+
+static inline unsigned long page_align(unsigned long addr)
+{
+	return ((addr + getpagesize()-1) & ~(getpagesize()-1));
+}
+
+/* initrd gets loaded at top of memory: return length. */
+static unsigned long load_initrd(const char *name, unsigned long end)
+{
+	int ifd;
+	struct stat st;
+	void *iaddr;
+
+	if (!name)
+		return 0;
+
+	ifd = open(name, O_RDONLY, 0);
+	if (ifd < 0)
+		err(1, "Opening initrd '%s'", name);
+
+	if (fstat(ifd, &st) < 0)
+		err(1, "fstat() on initrd '%s'", name);
+
+	iaddr = mmap((void *)end - st.st_size, st.st_size,
+		     PROT_READ|PROT_EXEC|PROT_WRITE,
+		     MAP_FIXED|MAP_PRIVATE, ifd, 0);
+	if (iaddr != (void *)end - st.st_size)
+		err(1, "Mmaping initrd '%s' returned %p not %p",
+		    name, iaddr, (void *)end - st.st_size);
+	close(ifd);
+	verbose("mapped initrd %s size=%lu @ %p\n", name, st.st_size, iaddr);
+	return st.st_size;
+}
+
+/* First map /dev/zero over entire memory, then insert kernel. */
+static void map_memory(unsigned long mem)
+{
+	if (mmap(0, mem,
+		 PROT_READ|PROT_WRITE|PROT_EXEC,
+		 MAP_FIXED|MAP_PRIVATE, zero_fd, 0) != (void *)0)
+		err(1, "Mmaping /dev/zero for %li bytes", mem);
+}
+
+static u32 finish(unsigned long mem, unsigned long *page_offset,
+		  const char *initrd, unsigned long *ird_size)
+{
+	u32 *pgdir = NULL, *linear = NULL;
+	int i, pte_pages;
+
+	/* This is a top of mem. */
+	*ird_size = load_initrd(initrd, mem);
+
+	/* Below initrd is used as top level of pagetable. */
+	pte_pages = 1 + (mem/getpagesize() + 1023)/1024;
+
+	pgdir = (u32 *)page_align(mem - *ird_size - pte_pages*getpagesize());
+	linear = (void *)pgdir + getpagesize();
+
+	/* Linear map all of memory at page_offset (to top of mem). */
+	if (mem > -*page_offset)
+		mem = -*page_offset;
+
+	for (i = 0; i < mem / getpagesize(); i++)
+		linear[i] = ((i * getpagesize()) | PAGE_PRESENT);
+	verbose("Linear %p-%p (%i-%i) = %#08x-%#08x\n",
+		linear, linear+i-1, 0, i-1, linear[0], linear[i-1]);
+
+	/* Now set up pgd so that this memory is at page_offset */
+	for (i = 0; i < mem / getpagesize(); i += getpagesize()/sizeof(u32)) {
+		pgdir[(i + *page_offset/getpagesize())/1024]
+			= (((u32)linear + i*sizeof(u32)) | PAGE_PRESENT);
+		verbose("Top level %lu = %#08x\n",
+			(i + *page_offset/getpagesize())/1024,
+			pgdir[(i + *page_offset/getpagesize())/1024]);
+	}
+
+	return (unsigned long)pgdir;
+}
+
+/* Returns the entry point */
+static u32 map_elf(int elf_fd, const Elf32_Ehdr *ehdr, unsigned long mem,
+		   unsigned long *pgdir_addr,
+		   const char *initrd, unsigned long *ird_size,
+		   unsigned long *page_offset)
+{
+	void *addr;
+	Elf32_Phdr phdr[ehdr->e_phnum];
+	unsigned int i;
+
+	/* Sanity checks. */
+	if (ehdr->e_type != ET_EXEC
+	    || ehdr->e_machine != EM_386
+	    || ehdr->e_phentsize != sizeof(Elf32_Phdr)
+	    || ehdr->e_phnum < 1 || ehdr->e_phnum > 65536U/sizeof(Elf32_Phdr))
+		errx(1, "Malformed elf header");
+
+	if (lseek(elf_fd, ehdr->e_phoff, SEEK_SET) < 0)
+		err(1, "Seeking to program headers");
+	if (read(elf_fd, phdr, sizeof(phdr)) != sizeof(phdr))
+		err(1, "Reading program headers");
+
+	map_memory(mem);
+
+	*page_offset = 0;
+	/* We map the loadable segments at virtual addresses corresponding
+	 * to their physical addresses (our virtual == guest physical). */
+	for (i = 0; i < ehdr->e_phnum; i++) {
+		if (phdr[i].p_type != PT_LOAD)
+			continue;
+
+		verbose("Section %i: size %i addr %p\n",
+			i, phdr[i].p_memsz, (void *)phdr[i].p_paddr);
+		/* We map everything private, writable. */
+		if (phdr[i].p_paddr + phdr[i].p_memsz > mem)
+			errx(1, "Segment %i overlaps end of memory", i);
+
+		/* We expect linear address space. */
+		if (!*page_offset)
+			*page_offset = phdr[i].p_vaddr - phdr[i].p_paddr;
+		else if (*page_offset != phdr[i].p_vaddr - phdr[i].p_paddr)
+			errx(1, "Page offset of section %i different", i);
+
+		/* Recent ld versions don't page align any more. */
+		if (phdr[i].p_paddr % getpagesize()) {
+			phdr[i].p_filesz += (phdr[i].p_paddr % getpagesize());
+			phdr[i].p_offset -= (phdr[i].p_paddr % getpagesize());
+			phdr[i].p_paddr -= (phdr[i].p_paddr % getpagesize());
+		}
+		addr = mmap((void *)phdr[i].p_paddr,
+			    phdr[i].p_filesz,
+			    PROT_READ|PROT_WRITE|PROT_EXEC,
+			    MAP_FIXED|MAP_PRIVATE,
+			    elf_fd, phdr[i].p_offset);
+		if (addr != (void *)phdr[i].p_paddr)
+			err(1, "Mmaping vmlinux segment %i returned %p not %p (%p)",
+			    i, addr, (void *)phdr[i].p_paddr, &phdr[i].p_paddr);
+	}
+
+	*pgdir_addr = finish(mem, page_offset, initrd, ird_size);
+	/* Entry is physical address: convert to virtual */
+	return ehdr->e_entry + *page_offset;
+}
+
+static unsigned long intuit_page_offset(unsigned char *img, unsigned long len)
+{
+	unsigned int i, possibilities[256];
+
+	for (i = 0; i + 4 < len; i++) {
+		/* mov 0xXXXXXXXX,%eax */
+		if (img[i] == 0xA1 && ++possibilities[img[i+4]] > 3)
+			return (unsigned long)img[i+4] << 24;
+	}
+	errx(1, "could not determine page offset");
+}
+
+static u32 bzimage(int fd, unsigned long mem, unsigned long *pgdir_addr,
+		   const char *initrd, unsigned long *ird_size,
+		   unsigned long *page_offset)
+{
+	gzFile f;
+	int ret, len = 0;
+	void *img = (void *)0x100000;
+
+	map_memory(mem);
+
+	f = gzdopen(fd, "rb");
+	if (gzdirect(f))
+		errx(1, "did not find correct gzip header");
+	while ((ret = gzread(f, img + len, 65536)) > 0)
+		len += ret;
+	if (ret < 0)
+		err(1, "reading image from bzImage");
+
+	verbose("Unpacked size %i addr %p\n", len, img);
+	*page_offset = intuit_page_offset(img, len);
+	*pgdir_addr = finish(mem, page_offset, initrd, ird_size);
+
+	/* Entry is physical address: convert to virtual */
+	return (u32)img + *page_offset;
+}
+
+static u32 load_bzimage(int bzimage_fd, const Elf32_Ehdr *ehdr,
+			unsigned long mem, unsigned long *pgdir_addr,
+			const char *initrd, unsigned long *ird_size,
+			unsigned long *page_offset)
+{
+	unsigned char c;
+	int state = 0;
+
+	/* Just brute force it. */
+	while (read(bzimage_fd, &c, 1) == 1) {
+		switch (state) {
+		case 0:
+			if (c == 0x1F)
+				state++;
+			break;
+		case 1:
+			if (c == 0x8B)
+				state++;
+			else
+				state = 0;
+			break;
+		case 2 ... 8:
+			state++;
+			break;
+		case 9:
+			lseek(bzimage_fd, -10, SEEK_CUR);
+			if (c != 0x03) /* Compressed under UNIX. */
+				state = -1;
+			else
+				return bzimage(bzimage_fd, mem, pgdir_addr,
+					       initrd, ird_size, page_offset);
+		}
+	}
+	errx(1, "Could not find kernel in bzImage");
+}
+
+static void *map_pages(unsigned long addr, unsigned int num)
+{
+	if (mmap((void *)addr, getpagesize() * num,
+		 PROT_READ|PROT_WRITE|PROT_EXEC,
+		 MAP_FIXED|MAP_PRIVATE, zero_fd, 0) != (void *)addr)
+		err(1, "Mmaping %u pages of /dev/zero @%p", num, (void *)addr);
+	return (void *)addr;
+}
+
+static struct lguest_device_desc *
+get_dev_entry(struct lguest_device_desc *descs, u16 type, u16 num_pages)
+{
+	static unsigned long top = RESERVE_TOP;
+	int i;
+	unsigned long pfn = 0;
+
+	if (num_pages) {
+		top -= num_pages*getpagesize();
+		map_pages(top, num_pages);
+		pfn = top / getpagesize();
+	}
+
+	for (i = 0; i < LGUEST_MAX_DEVICES; i++) {
+		if (!descs[i].type) {
+			descs[i].features = descs[i].status = 0;
+			descs[i].type = type;
+			descs[i].num_pages = num_pages;
+			descs[i].pfn = pfn;
+			return &descs[i];
+		}
+	}
+	errx(1, "too many devices");
+}
+
+static void set_fd(int fd, struct devices *devices)
+{
+	FD_SET(fd, &devices->infds);
+	if (fd > devices->max_infd)
+		devices->max_infd = fd;
+}
+
+static struct device *new_device(struct devices *devices,
+				 struct lguest_device_desc *descs,
+				 u16 type, u16 num_pages,
+				 int fd,
+				 int (*handle_input)(int, struct device *),
+				 unsigned long watch_off,
+				 u32 (*handle_output)(int,
+						      const struct iovec *,
+						      unsigned,
+						      struct device *))
+{
+	struct device *dev = malloc(sizeof(*dev));
+
+	dev->next = devices->dev;
+	devices->dev = dev;
+
+	dev->fd = fd;
+	if (handle_input)
+		set_fd(dev->fd, devices);
+	dev->desc = get_dev_entry(descs, type, num_pages);
+	dev->mem = (void *)(dev->desc->pfn * getpagesize());
+	dev->handle_input = handle_input;
+	dev->watch_address = (unsigned long)dev->mem + watch_off;
+	dev->handle_output = handle_output;
+	return dev;
+}
+
+static int tell_kernel(u32 pagelimit, u32 pgdir, u32 start, u32 page_offset)
+{
+	u32 args[] = { LHREQ_INITIALIZE,
+		       pagelimit, pgdir, start, page_offset };
+	int fd = open("/dev/lguest", O_RDWR);
+
+	if (fd < 0)
+		err(1, "Opening /dev/lguest");
+
+	verbose("Telling kernel limit %u, pgdir %i, e=%#08x page_off=0x%08x\n",
+		pagelimit, pgdir, start, page_offset);
+	if (write(fd, args, sizeof(args)) < 0)
+		err(1, "Writing to /dev/lguest");
+	return fd;
+}
+
+static void concat(char *dst, char *args[])
+{
+	unsigned int i, len = 0;
+
+	for (i = 0; args[i]; i++) {
+		strcpy(dst+len, args[i]);
+		strcat(dst+len, " ");
+		len += strlen(args[i]) + 1;
+	}
+	/* In case it's empty. */
+	dst[len] = '\0';
+}
+
+static void *_check_pointer(unsigned long addr, unsigned int size,
+			    unsigned int line)
+{
+	if (addr >= RESERVE_TOP || addr + size >= RESERVE_TOP)
+		errx(1, "%s:%i: Invalid address %li", __FILE__, line, addr);
+	return (void *)addr;
+}
+#define check_pointer(addr,size) _check_pointer(addr, size, __LINE__)
+
+/* Returns pointer to dma->used_len */
+static u32 *dma2iov(unsigned long dma, struct iovec iov[], unsigned *num)
+{
+	unsigned int i;
+	struct lguest_dma *udma;
+
+	/* No buffers? */
+	if (dma == 0) {
+		printf("no buffers\n");
+		return NULL;
+	}
+
+	udma = check_pointer(dma, sizeof(*udma));
+	for (i = 0; i < LGUEST_MAX_DMA_SECTIONS; i++) {
+		if (!udma->len[i])
+			break;
+
+		iov[i].iov_base = check_pointer(udma->addr[i], udma->len[i]);
+		iov[i].iov_len = udma->len[i];
+	}
+	*num = i;
+	return &udma->used_len;
+}
+
+static u32 *get_dma_buffer(int fd, void *addr,
+			   struct iovec iov[], unsigned *num, u32 *irq)
+{
+	u32 buf[] = { LHREQ_GETDMA, (u32)addr };
+	unsigned long udma;
+	u32 *res;
+
+	udma = write(fd, buf, sizeof(buf));
+	if (udma == (unsigned long)-1)
+		return NULL;
+
+	/* Kernel stashes irq in ->used_len. */
+	res = dma2iov(udma, iov, num);
+	if (res)
+		*irq = *res;
+	return res;
+}
+
+static void trigger_irq(int fd, u32 irq)
+{
+	u32 buf[] = { LHREQ_IRQ, irq };
+	if (write(fd, buf, sizeof(buf)) != 0)
+		err(1, "Triggering irq %i", irq);
+}
+
+static struct termios orig_term;
+static void restore_term(void)
+{
+	tcsetattr(STDIN_FILENO, TCSANOW, &orig_term);
+}
+
+struct console_abort
+{
+	int count;
+	struct timeval start;
+};
+
+/* We DMA input to buffer bound at start of console page. */
+static int handle_console_input(int fd, struct device *dev)
+{
+	u32 num, irq = 0, *lenp;
+	int len;
+	struct iovec iov[LGUEST_MAX_DMA_SECTIONS];
+	struct console_abort *abort = dev->priv;
+
+	lenp = get_dma_buffer(fd, dev->mem, iov, &num, &irq);
+	if (!lenp) {
+		warn("console: no dma buffer!");
+		iov[0] = discard_iov;
+		num = 1;
+	}
+
+	len = readv(dev->fd, iov, num);
+	if (len <= 0) {
+		warnx("Failed to get console input, ignoring console.");
+		len = 0;
+	}
+
+	if (lenp) {
+		*lenp = len;
+		trigger_irq(fd, irq);
+	}
+
+	/* Three ^C within one second?  Exit. */
+	if (len == 1 && ((char *)iov[0].iov_base)[0] == 3) {
+		if (!abort->count++)
+			gettimeofday(&abort->start, NULL);
+		else if (abort->count == 3) {
+			struct timeval now;
+			gettimeofday(&now, NULL);
+			if (now.tv_sec <= abort->start.tv_sec+1)
+				exit(2);
+			abort->count = 0;
+		}
+	} else
+		abort->count = 0;
+
+	if (!len) {
+		restore_term();
+		return 0;
+	}
+	return 1;
+}
+
+static unsigned long peer_offset(unsigned int peernum)
+{
+	return 4 * peernum;
+}
+
+static u32 handle_tun_output(int fd, const struct iovec *iov,
+			     unsigned num, struct device *dev)
+{
+	/* Now we've seen output, we should warn if we can't get buffers. */
+	*(bool *)dev->priv = true;
+	return writev(dev->fd, iov, num);
+}
+
+static u32 handle_block_output(int fd, const struct iovec *iov,
+			       unsigned num, struct device *dev)
+{
+	struct lguest_block_page *p = dev->mem;
+	u32 irq, reply_num, *lenp;
+	int len;
+	struct iovec reply[LGUEST_MAX_DMA_SECTIONS];
+	off64_t device_len, off = (off64_t)p->sector * 512;
+
+	device_len = *(off64_t *)dev->priv;
+
+	if (off >= device_len)
+		err(1, "Bad offset %llu vs %llu", off, device_len);
+	if (lseek64(dev->fd, off, SEEK_SET) != off)
+		err(1, "Bad seek to sector %i", p->sector);
+
+	verbose("Block: %s at offset %llu\n", p->type ? "WRITE" : "READ", off);
+
+	lenp = get_dma_buffer(fd, dev->mem, reply, &reply_num, &irq);
+	if (!lenp)
+		err(1, "Block request didn't give us a dma buffer");
+
+	if (p->type) {
+		len = writev(dev->fd, iov, num);
+		if (off + len > device_len) {
+			ftruncate(dev->fd, device_len);
+			errx(1, "Write past end %llu+%u", off, len);
+		}
+		*lenp = 0;
+	} else {
+		len = readv(dev->fd, reply, reply_num);
+		*lenp = len;
+	}
+
+	p->result = 1 + (p->bytes != len);
+	trigger_irq(fd, irq);
+	return 0;
+}
+
+#define HIPQUAD(ip)				\
+	((u8)(ip >> 24)),			\
+	((u8)(ip >> 16)),			\
+	((u8)(ip >> 8)),			\
+	((u8)(ip))
+
+static void configure_device(int fd, const char *devname, u32 ipaddr,
+			     unsigned char hwaddr[6])
+{
+	struct ifreq ifr;
+	struct sockaddr_in *sin = (struct sockaddr_in *)&ifr.ifr_addr;
+
+	memset(&ifr, 0, sizeof(ifr));
+	strcpy(ifr.ifr_name, devname);
+	sin->sin_family = AF_INET;
+	sin->sin_addr.s_addr = htonl(ipaddr);
+	if (ioctl(fd, SIOCSIFADDR, &ifr) != 0)
+		err(1, "Setting %s interface address", devname);
+	ifr.ifr_flags = IFF_UP;
+	if (ioctl(fd, SIOCSIFFLAGS, &ifr) != 0)
+		err(1, "Bringing interface %s up", devname);
+
+	if (ioctl(fd, SIOCGIFHWADDR, &ifr) != 0)
+		err(1, "getting hw address for %s", devname);
+
+	memcpy(hwaddr, ifr.ifr_hwaddr.sa_data, 6);
+}
+
+/* We send lguest_add signals while input is pending: avoids races. */
+static void wake_parent(int pipefd, struct devices *devices)
+{
+	int parent = getppid();
+	nice(19);
+
+	set_fd(pipefd, devices);
+
+	for (;;) {
+		fd_set rfds = devices->infds;
+
+		select(devices->max_infd+1, &rfds, NULL, NULL, NULL);
+		if (FD_ISSET(pipefd, &rfds)) {
+			int ignorefd;
+			if (read(pipefd, &ignorefd, sizeof(ignorefd)) == 0)
+				exit(0);
+			FD_CLR(ignorefd, &devices->infds);
+		}
+		kill(parent, SIGUSR1);
+	}
+}
+
+/* We don't want signal to kill us, just jerk us out of kernel. */
+static void wakeup(int signo)
+{
+}
+
+static int handle_tun_input(int fd, struct device *dev)
+{
+	u32 irq = 0, num, *lenp;
+	int len;
+	struct iovec iov[LGUEST_MAX_DMA_SECTIONS];
+
+	lenp = get_dma_buffer(fd, dev->mem+peer_offset(NET_PEERNUM), iov, &num,
+			      &irq);
+	if (!lenp) {
+		if (*(bool *)dev->priv)
+			warn("network: no dma buffer!");
+		iov[0] = discard_iov;
+		num = 1;
+	}
+
+	len = readv(dev->fd, iov, num);
+	if (len <= 0)
+		err(1, "reading network");
+	if (lenp) {
+		*lenp = len;
+		trigger_irq(fd, irq);
+	}
+	verbose("tun input packet len %i [%02x %02x] (%s)\n", len,
+		((u8 *)iov[0].iov_base)[0], ((u8 *)iov[0].iov_base)[1],
+		lenp ? "sent" : "discarded");
+	return 1;
+}
+
+/* We use fnctl locks to reserve network slots (autocleanup!) */
+static unsigned int find_slot(int netfd, const char *filename)
+{
+	struct flock fl;
+
+	fl.l_type = F_WRLCK;
+	fl.l_whence = SEEK_SET;
+	fl.l_len = 1;
+	for (fl.l_start = 0;
+	     fl.l_start < getpagesize()/sizeof(struct lguest_net);
+	     fl.l_start++) {
+		if (fcntl(netfd, F_SETLK, &fl) == 0)
+			return fl.l_start;
+	}
+	errx(1, "No free slots in network file %s", filename);
+}
+
+static void setup_net_file(const char *filename,
+			   struct lguest_device_desc *descs,
+			   struct devices *devices)
+{
+	int netfd;
+	struct device *dev;
+
+	netfd = open(filename, O_RDWR, 0);
+	if (netfd < 0) {
+		if (errno == ENOENT) {
+			netfd = open(filename, O_RDWR|O_CREAT, 0600);
+			if (netfd >= 0) {
+				char page[getpagesize()];
+				/* 0xFFFF == NO_GUEST */
+				memset(page, 0xFF, sizeof(page));
+				write(netfd, page, sizeof(page));
+			}
+		}
+		if (netfd < 0)
+			err(1, "cannot open net file '%s'", filename);
+	}
+
+	dev = new_device(devices, descs, LGUEST_DEVICE_T_NET, 1,
+			 -1, NULL, 0, NULL);
+
+	/* This is the slot for the guest to use. */
+	dev->desc->features = find_slot(netfd, filename)|LGUEST_NET_F_NOCSUM;
+	/* We overwrite the /dev/zero mapping with the actual file. */
+	if (mmap(dev->mem, getpagesize(), PROT_READ|PROT_WRITE,
+			 MAP_FIXED|MAP_SHARED, netfd, 0) != dev->mem)
+			err(1, "could not mmap '%s'", filename);
+	verbose("device %p@%p: shared net %s, peer %i\n", dev->desc,
+		(void *)(dev->desc->pfn * getpagesize()), filename,
+		dev->desc->features & ~LGUEST_NET_F_NOCSUM);
+}
+
+static u32 str2ip(const char *ipaddr)
+{
+	unsigned int byte[4];
+
+	sscanf(ipaddr, "%u.%u.%u.%u", &byte[0], &byte[1], &byte[2], &byte[3]);
+	return (byte[0] << 24) | (byte[1] << 16) | (byte[2] << 8) | byte[3];
+}
+
+/* adapted from libbridge */
+static void add_to_bridge(int fd, const char *if_name, const char *br_name)
+{
+	int r, ifidx;
+	struct ifreq ifr;
+
+	if (!*br_name)
+		errx(1, "must specify bridge name");
+
+	ifidx = if_nametoindex(if_name);
+	if (!ifidx)
+		errx(1, "interface %s does not exist!\n", if_name);
+
+	strncpy(ifr.ifr_name, br_name, IFNAMSIZ);
+	ifr.ifr_ifindex = ifidx;
+	r = ioctl(fd, SIOCBRADDIF, &ifr);
+	if (r != -1)
+		return;
+
+	switch (errno) {
+	case ENODEV:
+		errx(1, "bridge %s does not exist!\n", br_name);
+	case EBUSY:
+		errx(1, "device %s is already a member of a bridge; "
+			"can't enslave it to bridge %s.\n", if_name, br_name);
+	case ELOOP:
+		errx(1, "device %s is a bridge device itself; "
+			"can't enslave a bridge device to a bridge device.\n",
+			if_name);
+	default:
+		err(1, "can't add %s to bridge %s\n", if_name, br_name);
+	}
+}
+
+
+static void setup_tun_net(const char *arg,
+			  struct lguest_device_desc *descs,
+			  struct devices *devices)
+{
+	struct device *dev;
+	struct ifreq ifr;
+	int netfd, ipfd;
+	u32 ipaddr;
+	const char *br_name = NULL;
+
+	netfd = open("/dev/net/tun", O_RDWR);
+	if (netfd < 0)
+		err(1, "opening /dev/net/tun");
+
+	memset(&ifr, 0, sizeof(ifr));
+	ifr.ifr_flags = IFF_TAP | IFF_NO_PI;
+	strcpy(ifr.ifr_name, "tap%d");
+	if (ioctl(netfd, TUNSETIFF, &ifr) != 0)
+		err(1, "configuring /dev/net/tun");
+
+	dev = new_device(devices, descs, LGUEST_DEVICE_T_NET, 1,
+			 netfd, handle_tun_input,
+			 peer_offset(0), handle_tun_output);
+	dev->priv = malloc(sizeof(bool));
+	*(bool *)dev->priv = false;
+
+	ipfd = socket(PF_INET, SOCK_DGRAM, IPPROTO_IP);
+	if (ipfd < 0)
+		err(1, "opening IP socket");
+
+	if (!strncmp(BRIDGE_PFX, arg, strlen(BRIDGE_PFX))) {
+		ipaddr = INADDR_ANY;
+		br_name = arg + strlen(BRIDGE_PFX);
+		add_to_bridge(ipfd, ifr.ifr_name, br_name);
+	} else
+		ipaddr = str2ip(arg);
+
+	/* We are peer 0, rest is all NO_GUEST */
+	configure_device(ipfd, ifr.ifr_name, ipaddr, dev->mem);
+	close (ipfd);
+
+	/* You will be peer 1: we should create enough jitter to randomize */
+	dev->desc->features = NET_PEERNUM|LGUEST_DEVICE_F_RANDOMNESS;
+	verbose("device %p@%p: tun net %u.%u.%u.%u\n", dev->desc,
+		(void *)(dev->desc->pfn * getpagesize()),
+		HIPQUAD(ipaddr));
+	if (br_name)
+		verbose("attched to bridge: %s\n", br_name);
+}
+
+static void setup_block_file(const char *filename,
+			     struct lguest_device_desc *descs,
+			     struct devices *devices)
+{
+	int fd;
+	struct device *dev;
+	off64_t *blocksize;
+	struct lguest_block_page *p;
+
+	fd = open(filename, O_RDWR|O_LARGEFILE|O_DIRECT, 0);
+	if (fd < 0)
+		err(1, "Opening %s", filename);
+
+	dev = new_device(devices, descs, LGUEST_DEVICE_T_BLOCK, 1,
+			 fd, NULL, 0, handle_block_output);
+	dev->desc->features = LGUEST_DEVICE_F_RANDOMNESS;
+	blocksize = dev->priv = malloc(sizeof(*blocksize));
+	*blocksize = lseek64(fd, 0, SEEK_END);
+	p = dev->mem;
+
+	p->num_sectors = *blocksize/512;
+	verbose("device %p@%p: block %i sectors\n", dev->desc,
+		(void *)(dev->desc->pfn * getpagesize()), p->num_sectors);
+}
+
+static u32 handle_console_output(int fd, const struct iovec *iov,
+				 unsigned num, struct device*dev)
+{
+	return writev(STDOUT_FILENO, iov, num);
+}
+
+static void setup_console(struct lguest_device_desc *descs,
+			  struct devices *devices)
+{
+	struct device *dev;
+
+	if (tcgetattr(STDIN_FILENO, &orig_term) == 0) {
+		struct termios term = orig_term;
+		term.c_lflag &= ~(ISIG|ICANON|ECHO);
+		tcsetattr(STDIN_FILENO, TCSANOW, &term);
+		atexit(restore_term);
+	}
+
+	/* We don't currently require a page for the console. */
+	dev = new_device(devices, descs, LGUEST_DEVICE_T_CONSOLE, 0,
+			 STDIN_FILENO, handle_console_input,
+			 4, handle_console_output);
+	dev->priv = malloc(sizeof(struct console_abort));
+	((struct console_abort *)dev->priv)->count = 0;
+	verbose("device %p@%p: console\n", dev->desc,
+		(void *)(dev->desc->pfn * getpagesize()));
+}
+
+static const char *get_arg(const char *arg, const char *prefix)
+{
+	if (strncmp(arg, prefix, strlen(prefix)) == 0)
+		return arg + strlen(prefix);
+	return NULL;
+}
+
+static u32 handle_device(int fd, unsigned long dma, unsigned long addr,
+			 struct devices *devices)
+{
+	struct device *i;
+	u32 *lenp;
+	struct iovec iov[LGUEST_MAX_DMA_SECTIONS];
+	unsigned num = 0;
+
+	lenp = dma2iov(dma, iov, &num);
+	if (!lenp)
+		errx(1, "Bad SEND_DMA %li for address %#lx\n", dma, addr);
+
+	for (i = devices->dev; i; i = i->next) {
+		if (i->handle_output && addr == i->watch_address) {
+			*lenp = i->handle_output(fd, iov, num, i);
+			return 0;
+		}
+	}
+	warnx("Pending dma %p, addr %p", (void *)dma, (void *)addr);
+	return 0;
+}
+
+static void handle_input(int fd, int childfd, struct devices *devices)
+{
+	struct timeval poll = { .tv_sec = 0, .tv_usec = 0 };
+
+	for (;;) {
+		struct device *i;
+		fd_set fds = devices->infds;
+
+		if (select(devices->max_infd+1, &fds, NULL, NULL, &poll) == 0)
+			break;
+
+		for (i = devices->dev; i; i = i->next) {
+			if (i->handle_input && FD_ISSET(i->fd, &fds)) {
+				if (!i->handle_input(fd, i)) {
+					FD_CLR(i->fd, &devices->infds);
+					/* Tell child to ignore it too... */
+					write(childfd, &i->fd, sizeof(i->fd));
+				}
+			}
+		}
+	}
+}
+
+int main(int argc, char *argv[])
+{
+	unsigned long mem, pgdir, entry, initrd_size, page_offset;
+	int arg, kern_fd, fd, child, pipefd[2];
+	Elf32_Ehdr hdr;
+	struct sigaction act;
+	sigset_t sigset;
+	struct lguest_device_desc *devdescs;
+	struct devices devices;
+	struct lguest_boot_info *boot = (void *)0;
+	const char *initrd_name = NULL;
+	u32 (*load)(int, const Elf32_Ehdr *ehdr, unsigned long,
+		    unsigned long *, const char *, unsigned long *,
+		    unsigned long *);
+
+	if (argv[1] && strcmp(argv[1], "--verbose") == 0) {
+		verbose = true;
+		argv++;
+		argc--;
+	}
+
+	if (argc < 4)
+		errx(1, "Usage: lguest [--verbose] <mem> vmlinux "
+			"[--sharenet=<filename>|--tunnet=(<ipaddr>|bridge:<bridgename>)"
+			"|--block=<filename>|--initrd=<filename>]... [args...]");
+
+	zero_fd = open("/dev/zero", O_RDONLY, 0);
+	if (zero_fd < 0)
+		err(1, "Opening /dev/zero");
+
+	mem = memparse(argv[1]);
+	kern_fd = open(argv[2], O_RDONLY, 0);
+	if (kern_fd < 0)
+		err(1, "Opening %s", argv[2]);
+
+	if (read(kern_fd, &hdr, sizeof(hdr)) != sizeof(hdr))
+		err(1, "Reading %s elf header", argv[2]);
+
+	if (memcmp(hdr.e_ident, ELFMAG, SELFMAG) == 0)
+		load = map_elf;
+	else
+		load = load_bzimage;
+
+	devices.max_infd = -1;
+	devices.dev = NULL;
+	FD_ZERO(&devices.infds);
+
+	devdescs = map_pages(mem, 1);
+	arg = 3;
+	while (argv[arg] && argv[arg][0] == '-') {
+		const char *argval;
+
+		if ((argval = get_arg(argv[arg], "--sharenet=")) != NULL)
+			setup_net_file(argval, devdescs, &devices);
+		else if ((argval = get_arg(argv[arg], "--tunnet=")) != NULL)
+			setup_tun_net(argval, devdescs, &devices);
+		else if ((argval = get_arg(argv[arg], "--block=")) != NULL)
+			setup_block_file(argval, devdescs, &devices);
+		else if ((argval = get_arg(argv[arg], "--initrd=")) != NULL)
+			initrd_name = argval;
+		else
+			errx(1, "unknown arg '%s'", argv[arg]);
+		arg++;
+	}
+
+	entry = load(kern_fd, &hdr, mem, &pgdir, initrd_name, &initrd_size,
+		     &page_offset);
+	setup_console(devdescs, &devices);
+
+	concat(boot->cmdline, argv+arg);
+	boot->max_pfn = mem/getpagesize();
+	boot->initrd_size = initrd_size;
+
+	act.sa_handler = wakeup;
+	sigemptyset(&act.sa_mask);
+	act.sa_flags = 0;
+	sigaction(SIGUSR1, &act, NULL);
+
+	pipe(pipefd);
+	child = fork();
+	if (child == -1)
+		err(1, "forking");
+
+	if (child == 0) {
+		close(pipefd[1]);
+		wake_parent(pipefd[0], &devices);
+	}
+	close(pipefd[0]);
+
+	sigemptyset(&sigset);
+	sigaddset(&sigset, SIGUSR1);
+	sigprocmask(SIG_BLOCK, &sigset, NULL);
+
+	fd = tell_kernel(RESERVE_TOP/getpagesize(), pgdir, entry, page_offset);
+
+	for (;;) {
+		unsigned long arr[2];
+		int readval;
+
+		sigprocmask(SIG_UNBLOCK, &sigset, NULL);
+		readval = read(fd, arr, sizeof(arr));
+		sigprocmask(SIG_BLOCK, &sigset, NULL);
+
+		switch (readval) {
+		case sizeof(arr):
+			handle_device(fd, arr[0], arr[1], &devices);
+			break;
+		case -1:
+			if (errno == EINTR)
+				break;
+		default:
+			if (errno == ENOENT) {
+				char reason[1024];
+				if (read(fd, reason, sizeof(reason)) > 0)
+					errx(1, "%s", reason);
+			}
+			err(1, "Running guest failed");
+		}
+		handle_input(fd, pipefd[1], &devices);
+	}
+}
Index: work-pv/Documentation/lguest/x86_64/Makefile
===================================================================
--- /dev/null
+++ work-pv/Documentation/lguest/x86_64/Makefile
@@ -0,0 +1,22 @@
+# This creates the demonstration utility "lguest" which runs a Linux guest.
+
+# For now on x86_64 we'll hard code the location of the lguest binary loader.
+# But when we can get a relocatable kernel, we'll have to work to make this
+# dynamic.
+LGUEST_GUEST_TOP := 0x7f000000
+
+CFLAGS:=-Wall -Wmissing-declarations -Wmissing-prototypes -O3 \
+	-g \
+	-static -DLGUEST_GUEST_TOP="$(LGUEST_GUEST_TOP)" -Wl,-T,lguest.lds
+LDLIBS:=-lz
+
+all: lguest.lds lguest
+
+# The linker script on x86 is so complex the only way of creating one
+# which will link our binary in the right place is to mangle the
+# default one.
+lguest.lds: Makefile
+	$(LD) --verbose | awk '/^==========/ { PRINT=1; next; } /SIZEOF_HEADERS/ { gsub(/0x[0-9A-F]*/, "$(LGUEST_GUEST_TOP)") } { if (PRINT) print $$0; }' > $@
+
+clean:
+	rm -f lguest.lds lguest
Index: work-pv/Documentation/lguest/x86_64/lguest.c
===================================================================
--- /dev/null
+++ work-pv/Documentation/lguest/x86_64/lguest.c
@@ -0,0 +1,1021 @@
+/* Simple program to layout "physical" memory for new lguest guest.
+ * Linked high to avoid likely physical memory.  */
+#define _LARGEFILE64_SOURCE
+#define _GNU_SOURCE
+#include <stdio.h>
+#include <string.h>
+#include <unistd.h>
+#include <err.h>
+#include <stdint.h>
+#include <stdlib.h>
+#include <elf.h>
+#include <sys/mman.h>
+#include <sys/types.h>
+#include <sys/stat.h>
+#include <sys/wait.h>
+#include <fcntl.h>
+#include <assert.h>
+#include <stdbool.h>
+#include <errno.h>
+#include <signal.h>
+#include <sys/socket.h>
+#include <sys/ioctl.h>
+#include <sys/time.h>
+#include <time.h>
+#include <netinet/in.h>
+#include <linux/if.h>
+#include <linux/if_tun.h>
+#include <asm/vsyscall.h>
+#include <sys/uio.h>
+#include <termios.h>
+#include <zlib.h>
+typedef uint64_t u64;
+typedef uint32_t u32;
+typedef uint16_t u16;
+typedef uint8_t u8;
+
+#include "../../../include/asm/lguest_user.h"
+
+#define PAGE_PRESENT 0x7 	/* Present, RW, Execute */
+#define NET_PEERNUM 1
+
+static bool verbose;
+#define verbose(args...) \
+	do { if (verbose) printf(args); fflush(stdout); } while(0)
+
+struct devices
+{
+	fd_set infds;
+	int max_infd;
+
+	struct device *dev;
+};
+
+struct device
+{
+	struct device *next;
+	struct lguest_device_desc *desc;
+	void *mem;
+
+	/* Watch this fd if handle_input non-NULL. */
+	int fd;
+	int (*handle_input)(int fd, struct device *me);
+
+	/* Watch DMA to this address if handle_input non-NULL. */
+	unsigned long watch_address;
+	u64 (*handle_output)(int fd, const struct iovec *iov,
+			     unsigned int num, struct device *me);
+
+	/* Device-specific data. */
+	void *priv;
+};
+
+static char buf[1024];
+static struct iovec discard_iov = { .iov_base=buf, .iov_len=sizeof(buf) };
+static int zero_fd;
+
+static u64 memparse(const char *ptr)
+{
+	char *end;
+	unsigned long ret = strtoul(ptr, &end, 0);
+
+	switch (*end) {
+	case 'G':
+	case 'g':
+		ret <<= 10;
+	case 'M':
+	case 'm':
+		ret <<= 10;
+	case 'K':
+	case 'k':
+		ret <<= 10;
+		end++;
+	default:
+		break;
+	}
+	return ret;
+}
+
+static inline unsigned long page_align(unsigned long addr)
+{
+	return ((addr + getpagesize()-1) & ~(getpagesize()-1));
+}
+
+/* initrd gets loaded at top of memory: return length. */
+static unsigned long load_initrd(const char *name, unsigned long end)
+{
+	int ifd;
+	struct stat st;
+	void *iaddr;
+
+	if (!name)
+		return 0;
+
+	ifd = open(name, O_RDONLY, 0);
+	if (ifd < 0)
+		err(1, "Opening initrd '%s'", name);
+
+	if (fstat(ifd, &st) < 0)
+		err(1, "fstat() on initrd '%s'", name);
+
+	iaddr = mmap((void *)end - st.st_size, st.st_size,
+		     PROT_READ|PROT_EXEC|PROT_WRITE,
+		     MAP_FIXED|MAP_PRIVATE, ifd, 0);
+	if (iaddr != (void *)end - st.st_size)
+		err(1, "Mmaping initrd '%s' returned %p not %p",
+		    name, iaddr, (void *)end - st.st_size);
+	close(ifd);
+	verbose("mapped initrd %s size=%lu @ %p\n", name, st.st_size, iaddr);
+	return st.st_size;
+}
+
+/* First map /dev/zero over entire memory, then insert kernel. */
+static void map_memory(unsigned long mem)
+{
+	if (mmap(0, mem,
+		 PROT_READ|PROT_WRITE|PROT_EXEC,
+		 MAP_FIXED|MAP_PRIVATE, zero_fd, 0) != (void *)0)
+		err(1, "Mmaping /dev/zero for %li bytes", mem);
+}
+
+/* Returns the entry point */
+static u64 map_elf(int elf_fd, const Elf64_Ehdr *ehdr, unsigned long mem,
+		   unsigned long *pgdir_addr,
+		   const char *initrd, unsigned long *ird_size,
+		   u64 *page_offset)
+{
+	void *addr;
+	Elf64_Phdr phdr[ehdr->e_phnum];
+	unsigned int i;
+	Elf64_Shdr sec[ehdr->e_shnum];
+	Elf64_Sym *syms;
+	char *strtab = NULL;
+	unsigned long nsyms = 0;
+
+	/* Sanity checks. */
+	if (ehdr->e_type != ET_EXEC
+	    || ehdr->e_machine != EM_X86_64
+	    || ehdr->e_phentsize != sizeof(Elf64_Phdr)
+	    || ehdr->e_phnum < 1 || ehdr->e_phnum > 65536U/sizeof(Elf64_Phdr))
+		errx(1, "Malformed elf header");
+
+	if (lseek(elf_fd, ehdr->e_phoff, SEEK_SET) < 0)
+		err(1, "Seeking to program headers");
+	if (read(elf_fd, phdr, sizeof(phdr)) != sizeof(phdr))
+		err(1, "Reading program headers");
+
+	map_memory(mem);
+
+	*page_offset = 0;
+	/* We map the loadable segments at virtual addresses corresponding
+	 * to their physical addresses (our virtual == guest physical). */
+	for (i = 0; i < ehdr->e_phnum; i++) {
+		if (phdr[i].p_type != PT_LOAD)
+			continue;
+
+		verbose("Section %i: size %li addr %p\n",
+			i, phdr[i].p_memsz, (void *)phdr[i].p_paddr);
+		/* We map everything private, writable. */
+		if (phdr[i].p_paddr + phdr[i].p_memsz > mem)
+			errx(1, "Segment %i overlaps end of memory", i);
+
+		/* We expect linear address space. */
+		if (!*page_offset)
+			*page_offset = phdr[i].p_vaddr - phdr[i].p_paddr;
+		else if ((*page_offset != phdr[i].p_vaddr - phdr[i].p_paddr) &&
+			 phdr[i].p_vaddr != VSYSCALL_START)
+			errx(1, "Page offset of section %i different (got %lx, expected %lx)",
+			     i, (phdr[i].p_vaddr - phdr[i].p_paddr), *page_offset);
+
+		/* Recent ld versions don't page align any more. */
+		if (phdr[i].p_paddr % getpagesize()) {
+			phdr[i].p_filesz += (phdr[i].p_paddr % getpagesize());
+			phdr[i].p_offset -= (phdr[i].p_paddr % getpagesize());
+			phdr[i].p_paddr -= (phdr[i].p_paddr % getpagesize());
+		}
+		addr = mmap((void *)phdr[i].p_paddr,
+			    phdr[i].p_filesz,
+			    PROT_READ|PROT_WRITE|PROT_EXEC,
+			    MAP_FIXED|MAP_PRIVATE,
+			    elf_fd, phdr[i].p_offset);
+		if (addr != (void *)phdr[i].p_paddr)
+			err(1, "Mmaping vmlinux segment %i returned %p not %p (%p)",
+			    i, addr, (void *)phdr[i].p_paddr, &phdr[i].p_paddr);
+	}
+
+	/* Now process sections searching for boot page tables
+	 * Start by finding the symtab section */
+	if (lseek(elf_fd, ehdr->e_shoff, SEEK_SET) < 0)
+		err(1, "Seeking to section headers");
+	if (read(elf_fd, sec, sizeof(sec)) != sizeof(sec))
+		err(1, "Reading section headers");
+
+	for (i = 0; i < ehdr->e_shnum; i++) {
+		if (sec[i].sh_type == SHT_SYMTAB) {
+			int ret = 0;
+			syms = malloc(sec[i].sh_size);
+			if (!syms)
+				err(1,"Not enough memory for symbol table");
+			ret = lseek(elf_fd, sec[i].sh_offset, SEEK_SET);
+			if (ret < 0)
+				err(1, "Seeking to symbol table");
+			ret = read(elf_fd, syms, sec[i].sh_size);
+			if (ret != sec[i].sh_size)
+				err(1, "Reading symbol table");
+			nsyms = sec[i].sh_size / sizeof(Elf64_Sym);
+
+
+			/* symtab links to strtab. We use it to find symbol
+			 * names */
+			strtab = malloc(sec[sec[i].sh_link].sh_size);
+			if (!strtab)
+				err(1,"Not enough memory for string table");
+			ret = lseek(elf_fd, sec[sec[i].sh_link].sh_offset , SEEK_SET);
+			if (ret < 0)
+				err(1, "Seeking to string table");
+			ret = read(elf_fd, strtab, sec[sec[i].sh_link].sh_size);
+			if (ret != sec[sec[i].sh_link].sh_size)
+				err(1, "Reading string table");
+			break;
+		}
+	}
+
+	/* We now have a pointer to the symtab, start searching for the symbol */
+	for (i = 0; i < nsyms; i++) {
+		if ((syms[i].st_shndx == SHN_UNDEF) || !syms[i].st_name)
+			continue;
+		if (!strcmp("boot_level4_pgt",
+				(char *)((u64)syms[i].st_name + strtab))) {
+			*pgdir_addr = syms[i].st_value - *page_offset;
+			break;
+		}
+	}
+
+	if (!*pgdir_addr)
+		err(1,"Unable to find boot pgdir");
+
+	*ird_size = load_initrd(initrd, mem);
+
+	/* Entry is physical address: convert to virtual */
+	printf("entry=%lx page_offset=%lx  entry+page_offset=%lx\n",
+	       ehdr->e_entry, *page_offset, ehdr->e_entry + *page_offset);
+	return ehdr->e_entry + *page_offset;
+}
+
+static unsigned long intuit_page_offset(unsigned char *img, unsigned long len)
+{
+	unsigned int i, possibilities[256];
+
+	for (i = 0; i + 4 < len; i++) {
+		/* mov 0xXXXXXXXX,%eax */
+		if (img[i] == 0xA1 && ++possibilities[img[i+4]] > 3)
+			return (unsigned long)img[i+4] << 24;
+	}
+	errx(1, "could not determine page offset");
+}
+
+static u64 bzimage(int fd, unsigned long mem, unsigned long *pgdir_addr,
+		   const char *initrd, unsigned long *ird_size,
+		   u64 *page_offset)
+{
+	gzFile f;
+	int ret, len = 0;
+	void *img = (void *)0x100000;
+
+	map_memory(mem);
+
+	f = gzdopen(fd, "rb");
+	if (gzdirect(f))
+		errx(1, "did not find correct gzip header");
+	while ((ret = gzread(f, img + len, 65536)) > 0)
+		len += ret;
+	if (ret < 0)
+		err(1, "reading image from bzImage");
+
+	verbose("Unpacked size %i addr %p\n", len, img);
+	*page_offset = intuit_page_offset(img, len);
+//	*pgdir_addr = finish(mem, page_offset, initrd, ird_size);
+
+	/* Entry is physical address: convert to virtual */
+	return (u64)img + *page_offset;
+}
+
+static u64 load_bzimage(int bzimage_fd, const Elf64_Ehdr *ehdr,
+			unsigned long mem, unsigned long *pgdir_addr,
+			const char *initrd, unsigned long *ird_size,
+			u64 *page_offset)
+{
+	unsigned char c;
+	int state = 0;
+
+	/* Just brute force it. */
+	while (read(bzimage_fd, &c, 1) == 1) {
+		switch (state) {
+		case 0:
+			if (c == 0x1F)
+				state++;
+			break;
+		case 1:
+			if (c == 0x8B)
+				state++;
+			else
+				state = 0;
+			break;
+		case 2 ... 8:
+			state++;
+			break;
+		case 9:
+			lseek(bzimage_fd, -10, SEEK_CUR);
+			if (c != 0x03) /* Compressed under UNIX. */
+				state = -1;
+			else
+				return bzimage(bzimage_fd, mem, pgdir_addr,
+					       initrd, ird_size, page_offset);
+		}
+	}
+	errx(1, "Could not find kernel in bzImage");
+}
+
+static void *map_pages(unsigned long addr, unsigned int num)
+{
+	if (mmap((void *)addr, getpagesize() * num,
+		 PROT_READ|PROT_WRITE|PROT_EXEC,
+		 MAP_FIXED|MAP_PRIVATE, zero_fd, 0) != (void *)addr)
+		err(1, "Mmaping %u pages of /dev/zero @%p", num, (void *)addr);
+	return (void *)addr;
+}
+
+static struct lguest_device_desc *
+get_dev_entry(struct lguest_device_desc *descs, u16 type, u16 num_pages)
+{
+	static unsigned long top = LGUEST_GUEST_TOP;
+	int i;
+	unsigned long pfn = 0;
+
+	if (num_pages) {
+		top -= num_pages*getpagesize();
+		map_pages(top, num_pages);
+		pfn = top / getpagesize();
+	}
+
+	for (i = 0; i < LGUEST_MAX_DEVICES; i++) {
+		if (!descs[i].type) {
+			descs[i].features = descs[i].status = 0;
+			descs[i].type = type;
+			descs[i].num_pages = num_pages;
+			descs[i].pfn = pfn;
+			return &descs[i];
+		}
+	}
+	errx(1, "too many devices");
+}
+
+static void set_fd(int fd, struct devices *devices)
+{
+	FD_SET(fd, &devices->infds);
+	if (fd > devices->max_infd)
+		devices->max_infd = fd;
+}
+
+static struct device *new_device(struct devices *devices,
+				 struct lguest_device_desc *descs,
+				 u16 type, u16 num_pages,
+				 int fd,
+				 int (*handle_input)(int, struct device *),
+				 unsigned long watch_off,
+				 u64 (*handle_output)(int,
+						      const struct iovec *,
+						      unsigned,
+						      struct device *))
+{
+	struct device *dev = malloc(sizeof(*dev));
+
+	dev->next = devices->dev;
+	devices->dev = dev;
+
+	dev->fd = fd;
+	if (handle_input)
+		set_fd(dev->fd, devices);
+	dev->desc = get_dev_entry(descs, type, num_pages);
+	dev->mem = (void *)(dev->desc->pfn * getpagesize());
+	dev->handle_input = handle_input;
+	dev->watch_address = (unsigned long)dev->mem + watch_off;
+	dev->handle_output = handle_output;
+	return dev;
+}
+
+#define DEVNAME "/dev/lguest"
+
+static int tell_kernel(u64 pagelimit, u64 pgdir, u64 start, u64 page_offset)
+{
+	u64 args[] = { LHREQ_INITIALIZE,
+		       pagelimit, pgdir, start, page_offset };
+	int fd;
+
+	fd = open(DEVNAME, O_RDWR);
+	if (fd < 0)
+		err(1, "Opening %s", DEVNAME);
+
+	verbose("Telling kernel limit %lu, pgdir %li, e=%#08lx page_off=0x%08lx\n",
+		pagelimit, pgdir, start, page_offset);
+	if (write(fd, args, sizeof(args)) < 0)
+		err(1, "Writing to /dev/lguest");
+	return fd;
+}
+
+static void concat(char *dst, char *args[])
+{
+	unsigned int i, len = 0;
+
+	for (i = 0; args[i]; i++) {
+		strcpy(dst+len, args[i]);
+		strcat(dst+len, " ");
+		len += strlen(args[i]) + 1;
+	}
+	/* In case it's empty. */
+	dst[len] = '\0';
+}
+
+static void *_check_pointer(unsigned long addr, unsigned int size,
+			    unsigned int line)
+{
+	if (addr >= LGUEST_GUEST_TOP || addr + size >= LGUEST_GUEST_TOP)
+		errx(1, "%s:%i: Invalid address %li", __FILE__, line, addr);
+	return (void *)addr;
+}
+#define check_pointer(addr,size) _check_pointer(addr, size, __LINE__)
+
+/* Returns pointer to dma->used_len */
+static u64 *dma2iov(unsigned long dma, struct iovec iov[], unsigned *num)
+{
+	unsigned int i;
+	struct lguest_dma *udma;
+
+	/* No buffers? */
+	if (dma == 0) {
+		printf("no buffers\n");
+		return NULL;
+	}
+
+	udma = check_pointer(dma, sizeof(*udma));
+	for (i = 0; i < LGUEST_MAX_DMA_SECTIONS; i++) {
+		if (!udma->len[i])
+			break;
+
+		iov[i].iov_base = check_pointer(udma->addr[i], udma->len[i]);
+		iov[i].iov_len = udma->len[i];
+	}
+	*num = i;
+	return &udma->used_len;
+}
+
+static u64 *get_dma_buffer(int fd, void *addr,
+			   struct iovec iov[], unsigned *num, u32 *irq)
+{
+	u64 buf[] = { LHREQ_GETDMA, (u64)addr };
+	unsigned long udma;
+	u64 *res;
+
+	udma = write(fd, buf, sizeof(buf));
+	if (udma == (unsigned long)-1)
+		return NULL;
+
+	/* Kernel stashes irq in ->used_len. */
+	res = dma2iov(udma, iov, num);
+	if (res)
+		*irq = *res;
+	return res;
+}
+
+static void trigger_irq(int fd, u32 irq)
+{
+	u64 buf[] = { LHREQ_IRQ, irq };
+	if (write(fd, buf, sizeof(buf)) != 0)
+		err(1, "Triggering irq %i", irq);
+}
+
+static struct termios orig_term;
+static void restore_term(void)
+{
+	tcsetattr(STDIN_FILENO, TCSANOW, &orig_term);
+}
+
+struct console_abort
+{
+	int count;
+	struct timeval start;
+};
+
+/* We DMA input to buffer bound at start of console page. */
+static int handle_console_input(int fd, struct device *dev)
+{
+	u32 num, irq = 0;
+	u64 *lenp;
+	int len;
+	struct iovec iov[LGUEST_MAX_DMA_SECTIONS];
+	struct console_abort *abort = dev->priv;
+
+	lenp = get_dma_buffer(fd, dev->mem, iov, &num, &irq);
+	if (!lenp) {
+		warn("console: no dma buffer!");
+		iov[0] = discard_iov;
+		num = 1;
+	}
+
+	len = readv(dev->fd, iov, num);
+	if (len <= 0) {
+		warnx("Failed to get console input, ignoring console.");
+		len = 0;
+	}
+
+	if (lenp) {
+		*lenp = len;
+		trigger_irq(fd, irq);
+	}
+
+	/* Three ^C within one second?  Exit. */
+	if (len == 1 && ((char *)iov[0].iov_base)[0] == 3) {
+		if (!abort->count++)
+			gettimeofday(&abort->start, NULL);
+		else if (abort->count == 3) {
+			struct timeval now;
+			gettimeofday(&now, NULL);
+			if (now.tv_sec <= abort->start.tv_sec+1)
+				exit(2);
+			abort->count = 0;
+		}
+	} else
+		abort->count = 0;
+
+	if (!len) {
+		restore_term();
+		return 0;
+	}
+	return 1;
+}
+
+static unsigned long peer_offset(unsigned int peernum)
+{
+	return 4 * peernum;
+}
+
+static u64 handle_tun_output(int fd, const struct iovec *iov,
+			     unsigned num, struct device *dev)
+{
+	/* Now we've seen output, we should warn if we can't get buffers. */
+	*(bool *)dev->priv = true;
+	return writev(dev->fd, iov, num);
+}
+
+static u64 handle_block_output(int fd, const struct iovec *iov,
+			       unsigned num, struct device *dev)
+{
+	struct lguest_block_page *p = dev->mem;
+	u32 irq, reply_num;
+	u64 *lenp;
+	int len;
+	struct iovec reply[LGUEST_MAX_DMA_SECTIONS];
+	off64_t device_len, off = (off64_t)p->sector * 512;
+
+	device_len = *(off64_t *)dev->priv;
+
+	if (off >= device_len)
+		err(1, "Bad offset %lu vs %lu", off, device_len);
+	if (lseek64(dev->fd, off, SEEK_SET) != off)
+		err(1, "Bad seek to sector %i", p->sector);
+
+	verbose("Block: %s at offset %lu\n", p->type ? "WRITE" : "READ", off);
+
+	lenp = get_dma_buffer(fd, dev->mem, reply, &reply_num, &irq);
+	if (!lenp)
+		err(1, "Block request didn't give us a dma buffer");
+
+	if (p->type) {
+		len = writev(dev->fd, iov, num);
+		if (off + len > device_len) {
+			ftruncate(dev->fd, device_len);
+			errx(1, "Write past end %lu+%u", off, len);
+		}
+		*lenp = 0;
+	} else {
+		len = readv(dev->fd, reply, reply_num);
+		*lenp = len;
+	}
+
+	p->result = 1 + (p->bytes != len);
+	trigger_irq(fd, irq);
+	return 0;
+}
+
+#define HIPQUAD(ip)				\
+	((u8)(ip >> 24)),			\
+	((u8)(ip >> 16)),			\
+	((u8)(ip >> 8)),			\
+	((u8)(ip))
+
+static void configure_device(const char *devname, u64 ipaddr,
+			     unsigned char hwaddr[6])
+{
+	struct ifreq ifr;
+	int fd;
+	struct sockaddr_in *sin = (struct sockaddr_in *)&ifr.ifr_addr;
+
+	memset(&ifr, 0, sizeof(ifr));
+	strcpy(ifr.ifr_name, devname);
+	sin->sin_family = AF_INET;
+	sin->sin_addr.s_addr = htonl(ipaddr);
+	fd = socket(PF_INET, SOCK_DGRAM, IPPROTO_IP);
+	if (fd < 0)
+		err(1, "opening IP socket");
+	if (ioctl(fd, SIOCSIFADDR, &ifr) != 0)
+		err(1, "Setting %s interface address", devname);
+	ifr.ifr_flags = IFF_UP;
+	if (ioctl(fd, SIOCSIFFLAGS, &ifr) != 0)
+		err(1, "Bringing interface %s up", devname);
+
+	if (ioctl(fd, SIOCGIFHWADDR, &ifr) != 0)
+		err(1, "getting hw address for %s", devname);
+
+	memcpy(hwaddr, ifr.ifr_hwaddr.sa_data, 6);
+}
+
+/* We send lguest_add signals while input is pending: avoids races. */
+static void wake_parent(int pipefd, struct devices *devices)
+{
+	int parent = getppid();
+	nice(19);
+
+	set_fd(pipefd, devices);
+
+	for (;;) {
+		fd_set rfds = devices->infds;
+
+		select(devices->max_infd+1, &rfds, NULL, NULL, NULL);
+		if (FD_ISSET(pipefd, &rfds)) {
+			int ignorefd;
+			if (read(pipefd, &ignorefd, sizeof(ignorefd)) == 0)
+				exit(0);
+			FD_CLR(ignorefd, &devices->infds);
+		}
+		kill(parent, SIGUSR1);
+	}
+}
+
+/* We don't want signal to kill us, just jerk us out of kernel. */
+static void wakeup(int signo)
+{
+}
+
+static int handle_tun_input(int fd, struct device *dev)
+{
+	u32 irq = 0, num;
+	u64 *lenp;
+	int len;
+	struct iovec iov[LGUEST_MAX_DMA_SECTIONS];
+
+	lenp = get_dma_buffer(fd, dev->mem+peer_offset(NET_PEERNUM), iov, &num,
+			      &irq);
+	if (!lenp) {
+		if (*(bool *)dev->priv)
+			warn("network: no dma buffer!");
+		iov[0] = discard_iov;
+		num = 1;
+	}
+
+	len = readv(dev->fd, iov, num);
+	if (len <= 0)
+		err(1, "reading network");
+	if (lenp) {
+		*lenp = len;
+		trigger_irq(fd, irq);
+	}
+	verbose("tun input packet len %i [%02x %02x] (%s)\n", len,
+		((u8 *)iov[0].iov_base)[0], ((u8 *)iov[0].iov_base)[1],
+		lenp ? "sent" : "discarded");
+	return 1;
+}
+
+/* We use fnctl locks to reserve network slots (autocleanup!) */
+static unsigned int find_slot(int netfd, const char *filename)
+{
+	struct flock fl;
+
+	fl.l_type = F_WRLCK;
+	fl.l_whence = SEEK_SET;
+	fl.l_len = 1;
+	for (fl.l_start = 0;
+	     fl.l_start < getpagesize()/sizeof(struct lguest_net);
+	     fl.l_start++) {
+		if (fcntl(netfd, F_SETLK, &fl) == 0)
+			return fl.l_start;
+	}
+	errx(1, "No free slots in network file %s", filename);
+}
+
+static void setup_net_file(const char *filename,
+			   struct lguest_device_desc *descs,
+			   struct devices *devices)
+{
+	int netfd;
+	struct device *dev;
+
+	netfd = open(filename, O_RDWR, 0);
+	if (netfd < 0) {
+		if (errno == ENOENT) {
+			netfd = open(filename, O_RDWR|O_CREAT, 0600);
+			if (netfd >= 0) {
+				char page[getpagesize()];
+				/* 0xFFFF == NO_GUEST */
+				memset(page, 0xFF, sizeof(page));
+				write(netfd, page, sizeof(page));
+			}
+		}
+		if (netfd < 0)
+			err(1, "cannot open net file '%s'", filename);
+	}
+
+	dev = new_device(devices, descs, LGUEST_DEVICE_T_NET, 1,
+			 -1, NULL, 0, NULL);
+
+	/* This is the slot for the guest to use. */
+	dev->desc->features = find_slot(netfd, filename)|LGUEST_NET_F_NOCSUM;
+	/* We overwrite the /dev/zero mapping with the actual file. */
+	if (mmap(dev->mem, getpagesize(), PROT_READ|PROT_WRITE,
+			 MAP_FIXED|MAP_SHARED, netfd, 0) != dev->mem)
+			err(1, "could not mmap '%s'", filename);
+	verbose("device %p@%p: shared net %s, peer %i\n", dev->desc,
+		(void *)(dev->desc->pfn * getpagesize()), filename,
+		dev->desc->features & ~LGUEST_NET_F_NOCSUM);
+}
+
+static u64 str2ip(const char *ipaddr)
+{
+	unsigned int byte[4];
+
+	sscanf(ipaddr, "%u.%u.%u.%u", &byte[0], &byte[1], &byte[2], &byte[3]);
+	return (byte[0] << 24) | (byte[1] << 16) | (byte[2] << 8) | byte[3];
+}
+
+static void setup_tun_net(const char *ipaddr,
+			  struct lguest_device_desc *descs,
+			  struct devices *devices)
+{
+	struct device *dev;
+	struct ifreq ifr;
+	int netfd;
+
+	netfd = open("/dev/net/tun", O_RDWR);
+	if (netfd < 0)
+		err(1, "opening /dev/net/tun");
+
+	memset(&ifr, 0, sizeof(ifr));
+	ifr.ifr_flags = IFF_TAP | IFF_NO_PI;
+	strcpy(ifr.ifr_name, "tap%d");
+	if (ioctl(netfd, TUNSETIFF, &ifr) != 0)
+		err(1, "configuring /dev/net/tun");
+
+	dev = new_device(devices, descs, LGUEST_DEVICE_T_NET, 1,
+			 netfd, handle_tun_input,
+			 peer_offset(0), handle_tun_output);
+	dev->priv = malloc(sizeof(bool));
+	*(bool *)dev->priv = false;
+
+	/* We are peer 0, rest is all NO_GUEST */
+	memset(dev->mem, 0xFF, getpagesize());
+	configure_device(ifr.ifr_name, str2ip(ipaddr), dev->mem);
+
+	/* You will be peer 1: we should create enough jitter to randomize */
+	dev->desc->features = NET_PEERNUM|LGUEST_DEVICE_F_RANDOMNESS;
+	verbose("device %p@%p: tun net %u.%u.%u.%u\n", dev->desc,
+		(void *)(dev->desc->pfn * getpagesize()),
+		HIPQUAD(str2ip(ipaddr)));
+}
+
+static void setup_block_file(const char *filename,
+			     struct lguest_device_desc *descs,
+			     struct devices *devices)
+{
+	int fd;
+	struct device *dev;
+	off64_t *blocksize;
+	struct lguest_block_page *p;
+
+	fd = open(filename, O_RDWR|O_LARGEFILE|O_DIRECT, 0);
+	if (fd < 0)
+		err(1, "Opening %s", filename);
+
+	dev = new_device(devices, descs, LGUEST_DEVICE_T_BLOCK, 1,
+			 fd, NULL, 0, handle_block_output);
+	dev->desc->features = LGUEST_DEVICE_F_RANDOMNESS;
+	blocksize = dev->priv = malloc(sizeof(*blocksize));
+	*blocksize = lseek64(fd, 0, SEEK_END);
+	p = dev->mem;
+
+	p->num_sectors = *blocksize/512;
+	verbose("device %p@%p: block %i sectors\n", dev->desc,
+		(void *)(dev->desc->pfn * getpagesize()), p->num_sectors);
+}
+
+static u64 handle_console_output(int fd, const struct iovec *iov,
+				 unsigned num, struct device*dev)
+{
+	return writev(STDOUT_FILENO, iov, num);
+}
+
+static void setup_console(struct lguest_device_desc *descs,
+			  struct devices *devices)
+{
+	struct device *dev;
+
+	if (tcgetattr(STDIN_FILENO, &orig_term) == 0) {
+		struct termios term = orig_term;
+		term.c_lflag &= ~(ISIG|ICANON|ECHO);
+		tcsetattr(STDIN_FILENO, TCSANOW, &term);
+		atexit(restore_term);
+	}
+
+	/* We don't currently require a page for the console. */
+	dev = new_device(devices, descs, LGUEST_DEVICE_T_CONSOLE, 0,
+			 STDIN_FILENO, handle_console_input,
+			 4, handle_console_output);
+	dev->priv = malloc(sizeof(struct console_abort));
+	((struct console_abort *)dev->priv)->count = 0;
+	verbose("device %p@%p: console\n", dev->desc,
+		(void *)(dev->desc->pfn * getpagesize()));
+}
+
+static const char *get_arg(const char *arg, const char *prefix)
+{
+	if (strncmp(arg, prefix, strlen(prefix)) == 0)
+		return arg + strlen(prefix);
+	return NULL;
+}
+
+static u32 handle_device(int fd, unsigned long dma, unsigned long addr,
+			 struct devices *devices)
+{
+	struct device *i;
+	u64 *lenp;
+	struct iovec iov[LGUEST_MAX_DMA_SECTIONS];
+	unsigned num = 0;
+
+	lenp = dma2iov(dma, iov, &num);
+	if (!lenp)
+		errx(1, "Bad SEND_DMA %li for address %#lx\n", dma, addr);
+
+	for (i = devices->dev; i; i = i->next) {
+		if (i->handle_output && addr == i->watch_address) {
+			*lenp = i->handle_output(fd, iov, num, i);
+			return 0;
+		}
+	}
+	warnx("Pending dma %p, addr %p", (void *)dma, (void *)addr);
+	return 0;
+}
+
+static void handle_input(int fd, int childfd, struct devices *devices)
+{
+	struct timeval poll = { .tv_sec = 0, .tv_usec = 0 };
+
+	for (;;) {
+		struct device *i;
+		fd_set fds = devices->infds;
+
+		if (select(devices->max_infd+1, &fds, NULL, NULL, &poll) == 0)
+			break;
+
+		for (i = devices->dev; i; i = i->next) {
+			if (i->handle_input && FD_ISSET(i->fd, &fds)) {
+				if (!i->handle_input(fd, i)) {
+					FD_CLR(i->fd, &devices->infds);
+					/* Tell child to ignore it too... */
+					write(childfd, &i->fd, sizeof(i->fd));
+				}
+			}
+		}
+	}
+}
+
+int main(int argc, char *argv[])
+{
+	unsigned long mem, pgdir, entry, initrd_size, page_offset;
+	int arg, kern_fd, fd, child, pipefd[2];
+	Elf64_Ehdr hdr;
+	struct sigaction act;
+	sigset_t sigset;
+	struct lguest_device_desc *devdescs;
+	struct devices devices;
+	struct lguest_boot_info *boot = (void *)0;
+	const char *initrd_name = NULL;
+	u64 (*load)(int, const Elf64_Ehdr *ehdr, unsigned long,
+		    unsigned long *, const char *, unsigned long *,
+		    u64 *);
+
+	if (argv[1] && strcmp(argv[1], "--verbose") == 0) {
+		verbose = true;
+		argv++;
+		argc--;
+	}
+
+	if (argc < 3)
+		errx(1, "Usage: lguest [--verbose] <mem> vmlinux "
+			"[--sharenet=<filename>|--tunnet=<ipaddr>|--block=<filename>"
+			"|--initrd=<filename>]... [args...]");
+
+	zero_fd = open("/dev/zero", O_RDONLY, 0);
+	if (zero_fd < 0)
+		err(1, "Opening /dev/zero");
+
+	mem = memparse(argv[1]);
+	kern_fd = open(argv[2], O_RDONLY, 0);
+	if (kern_fd < 0)
+		err(1, "Opening %s", argv[2]);
+
+	if (read(kern_fd, &hdr, sizeof(hdr)) != sizeof(hdr))
+		err(1, "Reading %s elf header", argv[2]);
+
+	if (memcmp(hdr.e_ident, ELFMAG, SELFMAG) == 0)
+		load = map_elf;
+	else
+		load = load_bzimage;
+
+	devices.max_infd = -1;
+	devices.dev = NULL;
+	FD_ZERO(&devices.infds);
+
+	devdescs = map_pages(mem, 1);
+	arg = 3;
+	while (argv[arg] && argv[arg][0] == '-') {
+		const char *argval;
+
+		if ((argval = get_arg(argv[arg], "--sharenet=")) != NULL)
+			setup_net_file(argval, devdescs, &devices);
+		else if ((argval = get_arg(argv[arg], "--tunnet=")) != NULL)
+			setup_tun_net(argval, devdescs, &devices);
+		else if ((argval = get_arg(argv[arg], "--block=")) != NULL)
+			setup_block_file(argval, devdescs, &devices);
+		else if ((argval = get_arg(argv[arg], "--initrd=")) != NULL)
+			initrd_name = argval;
+		else
+			errx(1, "unknown arg '%s'", argv[arg]);
+		arg++;
+	}
+
+	entry = load(kern_fd, &hdr, mem, &pgdir, initrd_name, &initrd_size,
+		     &page_offset);
+	setup_console(devdescs, &devices);
+
+	concat(boot->cmdline, argv+arg);
+	boot->max_pfn = mem/getpagesize();
+	boot->initrd_size = initrd_size;
+
+	act.sa_handler = wakeup;
+	sigemptyset(&act.sa_mask);
+	act.sa_flags = 0;
+	sigaction(SIGUSR1, &act, NULL);
+
+	pipe(pipefd);
+	child = fork();
+	if (child == -1)
+		err(1, "forking");
+
+	if (child == 0) {
+		close(pipefd[1]);
+		wake_parent(pipefd[0], &devices);
+	}
+	close(pipefd[0]);
+
+	sigemptyset(&sigset);
+	sigaddset(&sigset, SIGUSR1);
+	sigprocmask(SIG_BLOCK, &sigset, NULL);
+
+	/* LGUEST_GUEST_TOP defined in Makefile, just below us. */
+	fd = tell_kernel(LGUEST_GUEST_TOP/getpagesize(),
+			 pgdir, entry, page_offset);
+
+	for (;;) {
+		unsigned long arr[2];
+		int readval;
+
+		sigprocmask(SIG_UNBLOCK, &sigset, NULL);
+		readval = read(fd, arr, sizeof(arr));
+		sigprocmask(SIG_BLOCK, &sigset, NULL);
+
+		switch (readval) {
+		case sizeof(arr):
+			handle_device(fd, arr[0], arr[1], &devices);
+			break;
+		case -1:
+			if (errno == EINTR)
+				break;
+		default:
+			if (errno == ENOENT) {
+				char reason[1024];
+				if (read(fd, reason, sizeof(reason)) > 0)
+					errx(1, "%s", reason);
+			}
+			err(1, "Running guest failed");
+		}
+		handle_input(fd, pipefd[1], &devices);
+	}
+}
Index: work-pv/Documentation/lguest/Makefile
===================================================================
--- work-pv.orig/Documentation/lguest/Makefile
+++ /dev/null
@@ -1,21 +0,0 @@
-# This creates the demonstration utility "lguest" which runs a Linux guest.
-
-# We rely on CONFIG_PAGE_OFFSET to know where to put lguest binary.
-# Some shells (dash - ubunu) can't handle numbers that big so we cheat.
-include ../../.config
-LGUEST_GUEST_TOP := ($(CONFIG_PAGE_OFFSET) - 0x08000000)
-
-CFLAGS:=-Wall -Wmissing-declarations -Wmissing-prototypes -O3 \
-	-static -DLGUEST_GUEST_TOP="$(LGUEST_GUEST_TOP)" -Wl,-T,lguest.lds
-LDLIBS:=-lz
-
-all: lguest.lds lguest
-
-# The linker script on x86 is so complex the only way of creating one
-# which will link our binary in the right place is to mangle the
-# default one.
-lguest.lds:
-	$(LD) --verbose | awk '/^==========/ { PRINT=1; next; } /SIZEOF_HEADERS/ { gsub(/0x[0-9A-F]*/, "$(LGUEST_GUEST_TOP)") } { if (PRINT) print $$0; }' > $@
-
-clean:
-	rm -f lguest.lds lguest
Index: work-pv/Documentation/lguest/lguest.c
===================================================================
--- work-pv.orig/Documentation/lguest/lguest.c
+++ /dev/null
@@ -1,1039 +0,0 @@
-/* Simple program to layout "physical" memory for new lguest guest.
- * Linked high to avoid likely physical memory.  */
-#define _LARGEFILE64_SOURCE
-#define _GNU_SOURCE
-#include <stdio.h>
-#include <string.h>
-#include <unistd.h>
-#include <err.h>
-#include <stdint.h>
-#include <stdlib.h>
-#include <elf.h>
-#include <sys/mman.h>
-#include <sys/types.h>
-#include <sys/stat.h>
-#include <sys/wait.h>
-#include <fcntl.h>
-#include <assert.h>
-#include <stdbool.h>
-#include <errno.h>
-#include <signal.h>
-#include <sys/socket.h>
-#include <sys/ioctl.h>
-#include <sys/time.h>
-#include <time.h>
-#include <netinet/in.h>
-#include <net/if.h>
-#include <linux/sockios.h>
-#include <linux/if_tun.h>
-#include <sys/uio.h>
-#include <termios.h>
-#include <zlib.h>
-typedef uint32_t u32;
-typedef uint16_t u16;
-typedef uint8_t u8;
-
-#include "../../include/asm/lguest_user.h"
-
-#define PAGE_PRESENT 0x7 	/* Present, RW, Execute */
-#define NET_PEERNUM 1
-#define BRIDGE_PFX "bridge:"
-
-static bool verbose;
-#define verbose(args...) \
-	do { if (verbose) printf(args); fflush(stdout); } while(0)
-
-struct devices
-{
-	fd_set infds;
-	int max_infd;
-
-	struct device *dev;
-};
-
-struct device
-{
-	struct device *next;
-	struct lguest_device_desc *desc;
-	void *mem;
-
-	/* Watch this fd if handle_input non-NULL. */
-	int fd;
-	int (*handle_input)(int fd, struct device *me);
-
-	/* Watch DMA to this address if handle_input non-NULL. */
-	unsigned long watch_address;
-	u32 (*handle_output)(int fd, const struct iovec *iov,
-			     unsigned int num, struct device *me);
-
-	/* Device-specific data. */
-	void *priv;
-};
-
-static char buf[1024];
-static struct iovec discard_iov = { .iov_base=buf, .iov_len=sizeof(buf) };
-static int zero_fd;
-
-/* LGUEST_GUEST_TOP defined in Makefile, just below us.
-   FIXME: vdso gets mapped just under it, and we need to protect that. */
-#define RESERVE_TOP LGUEST_GUEST_TOP - 1024*1024
-
-static u32 memparse(const char *ptr)
-{
-	char *end;
-	unsigned long ret = strtoul(ptr, &end, 0);
-
-	switch (*end) {
-	case 'G':
-	case 'g':
-		ret <<= 10;
-	case 'M':
-	case 'm':
-		ret <<= 10;
-	case 'K':
-	case 'k':
-		ret <<= 10;
-		end++;
-	default:
-		break;
-	}
-	return ret;
-}
-
-static inline unsigned long page_align(unsigned long addr)
-{
-	return ((addr + getpagesize()-1) & ~(getpagesize()-1));
-}
-
-/* initrd gets loaded at top of memory: return length. */
-static unsigned long load_initrd(const char *name, unsigned long end)
-{
-	int ifd;
-	struct stat st;
-	void *iaddr;
-
-	if (!name)
-		return 0;
-
-	ifd = open(name, O_RDONLY, 0);
-	if (ifd < 0)
-		err(1, "Opening initrd '%s'", name);
-		
-	if (fstat(ifd, &st) < 0)
-		err(1, "fstat() on initrd '%s'", name);
-
-	iaddr = mmap((void *)end - st.st_size, st.st_size,
-		     PROT_READ|PROT_EXEC|PROT_WRITE,
-		     MAP_FIXED|MAP_PRIVATE, ifd, 0);
-	if (iaddr != (void *)end - st.st_size)
-		err(1, "Mmaping initrd '%s' returned %p not %p",
-		    name, iaddr, (void *)end - st.st_size);
-	close(ifd);
-	verbose("mapped initrd %s size=%lu @ %p\n", name, st.st_size, iaddr);
-	return st.st_size;
-}
-
-/* First map /dev/zero over entire memory, then insert kernel. */
-static void map_memory(unsigned long mem)
-{
-	if (mmap(0, mem,
-		 PROT_READ|PROT_WRITE|PROT_EXEC,
-		 MAP_FIXED|MAP_PRIVATE, zero_fd, 0) != (void *)0)
-		err(1, "Mmaping /dev/zero for %li bytes", mem);
-}
-
-static u32 finish(unsigned long mem, unsigned long *page_offset,
-		  const char *initrd, unsigned long *ird_size)
-{
-	u32 *pgdir = NULL, *linear = NULL;
-	int i, pte_pages;
-
-	/* This is a top of mem. */
-	*ird_size = load_initrd(initrd, mem);
-
-	/* Below initrd is used as top level of pagetable. */
-	pte_pages = 1 + (mem/getpagesize() + 1023)/1024;
-
-	pgdir = (u32 *)page_align(mem - *ird_size - pte_pages*getpagesize());
-	linear = (void *)pgdir + getpagesize();
-
-	/* Linear map all of memory at page_offset (to top of mem). */
-	if (mem > -*page_offset)
-		mem = -*page_offset;
-
-	for (i = 0; i < mem / getpagesize(); i++)
-		linear[i] = ((i * getpagesize()) | PAGE_PRESENT);
-	verbose("Linear %p-%p (%i-%i) = %#08x-%#08x\n",
-		linear, linear+i-1, 0, i-1, linear[0], linear[i-1]);
-
-	/* Now set up pgd so that this memory is at page_offset */
-	for (i = 0; i < mem / getpagesize(); i += getpagesize()/sizeof(u32)) {
-		pgdir[(i + *page_offset/getpagesize())/1024] 
-			= (((u32)linear + i*sizeof(u32)) | PAGE_PRESENT);
-		verbose("Top level %lu = %#08x\n",
-			(i + *page_offset/getpagesize())/1024,
-			pgdir[(i + *page_offset/getpagesize())/1024]);
-	}
-
-	return (unsigned long)pgdir;
-}
-
-/* Returns the entry point */
-static u32 map_elf(int elf_fd, const Elf32_Ehdr *ehdr, unsigned long mem,
-		   unsigned long *pgdir_addr,
-		   const char *initrd, unsigned long *ird_size,
-		   unsigned long *page_offset)
-{
-	void *addr;
-	Elf32_Phdr phdr[ehdr->e_phnum];
-	unsigned int i;
-
-	/* Sanity checks. */
-	if (ehdr->e_type != ET_EXEC
-	    || ehdr->e_machine != EM_386
-	    || ehdr->e_phentsize != sizeof(Elf32_Phdr)
-	    || ehdr->e_phnum < 1 || ehdr->e_phnum > 65536U/sizeof(Elf32_Phdr))
-		errx(1, "Malformed elf header");
-
-	if (lseek(elf_fd, ehdr->e_phoff, SEEK_SET) < 0)
-		err(1, "Seeking to program headers");
-	if (read(elf_fd, phdr, sizeof(phdr)) != sizeof(phdr))
-		err(1, "Reading program headers");
-
-	map_memory(mem);
-
-	*page_offset = 0;
-	/* We map the loadable segments at virtual addresses corresponding
-	 * to their physical addresses (our virtual == guest physical). */
-	for (i = 0; i < ehdr->e_phnum; i++) {
-		if (phdr[i].p_type != PT_LOAD)
-			continue;
-
-		verbose("Section %i: size %i addr %p\n",
-			i, phdr[i].p_memsz, (void *)phdr[i].p_paddr);
-		/* We map everything private, writable. */
-		if (phdr[i].p_paddr + phdr[i].p_memsz > mem)
-			errx(1, "Segment %i overlaps end of memory", i);
-
-		/* We expect linear address space. */
-		if (!*page_offset)
-			*page_offset = phdr[i].p_vaddr - phdr[i].p_paddr;
-		else if (*page_offset != phdr[i].p_vaddr - phdr[i].p_paddr)
-			errx(1, "Page offset of section %i different", i);
-
-		/* Recent ld versions don't page align any more. */
-		if (phdr[i].p_paddr % getpagesize()) {
-			phdr[i].p_filesz += (phdr[i].p_paddr % getpagesize());
-			phdr[i].p_offset -= (phdr[i].p_paddr % getpagesize());
-			phdr[i].p_paddr -= (phdr[i].p_paddr % getpagesize());
-		}
-		addr = mmap((void *)phdr[i].p_paddr,
-			    phdr[i].p_filesz,
-			    PROT_READ|PROT_WRITE|PROT_EXEC,
-			    MAP_FIXED|MAP_PRIVATE,
-			    elf_fd, phdr[i].p_offset);
-		if (addr != (void *)phdr[i].p_paddr)
-			err(1, "Mmaping vmlinux segment %i returned %p not %p (%p)",
-			    i, addr, (void *)phdr[i].p_paddr, &phdr[i].p_paddr);
-	}
-
-	*pgdir_addr = finish(mem, page_offset, initrd, ird_size);
-	/* Entry is physical address: convert to virtual */
-	return ehdr->e_entry + *page_offset;
-}
-
-static unsigned long intuit_page_offset(unsigned char *img, unsigned long len)
-{
-	unsigned int i, possibilities[256];
-
-	for (i = 0; i + 4 < len; i++) {
-		/* mov 0xXXXXXXXX,%eax */
-		if (img[i] == 0xA1 && ++possibilities[img[i+4]] > 3)
-			return (unsigned long)img[i+4] << 24;
-	}
-	errx(1, "could not determine page offset");
-}
-
-static u32 bzimage(int fd, unsigned long mem, unsigned long *pgdir_addr,
-		   const char *initrd, unsigned long *ird_size,
-		   unsigned long *page_offset)
-{
-	gzFile f;
-	int ret, len = 0;
-	void *img = (void *)0x100000;
-
-	map_memory(mem);
-
-	f = gzdopen(fd, "rb");
-	if (gzdirect(f))
-		errx(1, "did not find correct gzip header");
-	while ((ret = gzread(f, img + len, 65536)) > 0)
-		len += ret;
-	if (ret < 0)
-		err(1, "reading image from bzImage");
-
-	verbose("Unpacked size %i addr %p\n", len, img);
-	*page_offset = intuit_page_offset(img, len);
-	*pgdir_addr = finish(mem, page_offset, initrd, ird_size);
-
-	/* Entry is physical address: convert to virtual */
-	return (u32)img + *page_offset;
-}
-
-static u32 load_bzimage(int bzimage_fd, const Elf32_Ehdr *ehdr, 
-			unsigned long mem, unsigned long *pgdir_addr,
-			const char *initrd, unsigned long *ird_size,
-			unsigned long *page_offset)
-{
-	unsigned char c;
-	int state = 0;
-
-	/* Just brute force it. */
-	while (read(bzimage_fd, &c, 1) == 1) {
-		switch (state) {
-		case 0:
-			if (c == 0x1F)
-				state++;
-			break;
-		case 1:
-			if (c == 0x8B)
-				state++;
-			else
-				state = 0;
-			break;
-		case 2 ... 8:
-			state++;
-			break;
-		case 9:
-			lseek(bzimage_fd, -10, SEEK_CUR);
-			if (c != 0x03) /* Compressed under UNIX. */
-				state = -1;
-			else
-				return bzimage(bzimage_fd, mem, pgdir_addr,
-					       initrd, ird_size, page_offset);
-		}
-	}
-	errx(1, "Could not find kernel in bzImage");
-}
-
-static void *map_pages(unsigned long addr, unsigned int num)
-{
-	if (mmap((void *)addr, getpagesize() * num,
-		 PROT_READ|PROT_WRITE|PROT_EXEC,
-		 MAP_FIXED|MAP_PRIVATE, zero_fd, 0) != (void *)addr)
-		err(1, "Mmaping %u pages of /dev/zero @%p", num, (void *)addr);
-	return (void *)addr;
-}
-
-static struct lguest_device_desc *
-get_dev_entry(struct lguest_device_desc *descs, u16 type, u16 num_pages)
-{
-	static unsigned long top = RESERVE_TOP;
-	int i;
-	unsigned long pfn = 0;
-
-	if (num_pages) {
-		top -= num_pages*getpagesize();
-		map_pages(top, num_pages);
-		pfn = top / getpagesize();
-	}
-
-	for (i = 0; i < LGUEST_MAX_DEVICES; i++) {
-		if (!descs[i].type) {
-			descs[i].features = descs[i].status = 0;
-			descs[i].type = type;
-			descs[i].num_pages = num_pages;
-			descs[i].pfn = pfn;
-			return &descs[i];
-		}
-	}
-	errx(1, "too many devices");
-}
-
-static void set_fd(int fd, struct devices *devices)
-{
-	FD_SET(fd, &devices->infds);
-	if (fd > devices->max_infd)
-		devices->max_infd = fd;
-}
-
-static struct device *new_device(struct devices *devices,
-				 struct lguest_device_desc *descs,
-				 u16 type, u16 num_pages,
-				 int fd,
-				 int (*handle_input)(int, struct device *),
-				 unsigned long watch_off,
-				 u32 (*handle_output)(int,
-						      const struct iovec *,
-						      unsigned,
-						      struct device *))
-{
-	struct device *dev = malloc(sizeof(*dev));
-
-	dev->next = devices->dev;
-	devices->dev = dev;
-
-	dev->fd = fd;
-	if (handle_input)
-		set_fd(dev->fd, devices);
-	dev->desc = get_dev_entry(descs, type, num_pages);
-	dev->mem = (void *)(dev->desc->pfn * getpagesize());
-	dev->handle_input = handle_input;
-	dev->watch_address = (unsigned long)dev->mem + watch_off;
-	dev->handle_output = handle_output;
-	return dev;
-}
-
-static int tell_kernel(u32 pagelimit, u32 pgdir, u32 start, u32 page_offset)
-{
-	u32 args[] = { LHREQ_INITIALIZE,
-		       pagelimit, pgdir, start, page_offset };
-	int fd = open("/dev/lguest", O_RDWR);
-
-	if (fd < 0)
-		err(1, "Opening /dev/lguest");
-
-	verbose("Telling kernel limit %u, pgdir %i, e=%#08x page_off=0x%08x\n",
-		pagelimit, pgdir, start, page_offset);
-	if (write(fd, args, sizeof(args)) < 0)
-		err(1, "Writing to /dev/lguest");
-	return fd;
-}
-
-static void concat(char *dst, char *args[])
-{
-	unsigned int i, len = 0;
-
-	for (i = 0; args[i]; i++) {
-		strcpy(dst+len, args[i]);
-		strcat(dst+len, " ");
-		len += strlen(args[i]) + 1;
-	}
-	/* In case it's empty. */
-	dst[len] = '\0';
-}
-
-static void *_check_pointer(unsigned long addr, unsigned int size,
-			    unsigned int line)
-{
-	if (addr >= RESERVE_TOP || addr + size >= RESERVE_TOP)
-		errx(1, "%s:%i: Invalid address %li", __FILE__, line, addr);
-	return (void *)addr;
-}
-#define check_pointer(addr,size) _check_pointer(addr, size, __LINE__)
-
-/* Returns pointer to dma->used_len */
-static u32 *dma2iov(unsigned long dma, struct iovec iov[], unsigned *num)
-{
-	unsigned int i;
-	struct lguest_dma *udma;
-
-	/* No buffers? */
-	if (dma == 0) {
-		printf("no buffers\n");
-		return NULL;
-	}
-
-	udma = check_pointer(dma, sizeof(*udma));
-	for (i = 0; i < LGUEST_MAX_DMA_SECTIONS; i++) {
-		if (!udma->len[i])
-			break;
-
-		iov[i].iov_base = check_pointer(udma->addr[i], udma->len[i]);
-		iov[i].iov_len = udma->len[i];
-	}
-	*num = i;
-	return &udma->used_len;
-}
-
-static u32 *get_dma_buffer(int fd, void *addr,
-			   struct iovec iov[], unsigned *num, u32 *irq)
-{
-	u32 buf[] = { LHREQ_GETDMA, (u32)addr };
-	unsigned long udma;
-	u32 *res;
-
-	udma = write(fd, buf, sizeof(buf));
-	if (udma == (unsigned long)-1)
-		return NULL;
-
-	/* Kernel stashes irq in ->used_len. */
-	res = dma2iov(udma, iov, num);
-	if (res)
-		*irq = *res;
-	return res;
-}
-
-static void trigger_irq(int fd, u32 irq)
-{
-	u32 buf[] = { LHREQ_IRQ, irq };
-	if (write(fd, buf, sizeof(buf)) != 0)
-		err(1, "Triggering irq %i", irq);
-}
-
-static struct termios orig_term;
-static void restore_term(void)
-{
-	tcsetattr(STDIN_FILENO, TCSANOW, &orig_term);
-}
-
-struct console_abort
-{
-	int count;
-	struct timeval start;
-};
-
-/* We DMA input to buffer bound at start of console page. */
-static int handle_console_input(int fd, struct device *dev)
-{
-	u32 num, irq = 0, *lenp;
-	int len;
-	struct iovec iov[LGUEST_MAX_DMA_SECTIONS];
-	struct console_abort *abort = dev->priv;
-
-	lenp = get_dma_buffer(fd, dev->mem, iov, &num, &irq);
-	if (!lenp) {
-		warn("console: no dma buffer!");
-		iov[0] = discard_iov;
-		num = 1;
-	}
-
-	len = readv(dev->fd, iov, num);
-	if (len <= 0) {
-		warnx("Failed to get console input, ignoring console.");
-		len = 0;
-	}
-
-	if (lenp) {
-		*lenp = len;
-		trigger_irq(fd, irq);
-	}
-
-	/* Three ^C within one second?  Exit. */
-	if (len == 1 && ((char *)iov[0].iov_base)[0] == 3) {
-		if (!abort->count++)
-			gettimeofday(&abort->start, NULL);
-		else if (abort->count == 3) {
-			struct timeval now;
-			gettimeofday(&now, NULL);
-			if (now.tv_sec <= abort->start.tv_sec+1)
-				exit(2);
-			abort->count = 0;
-		}
-	} else
-		abort->count = 0;
-
-	if (!len) {
-		restore_term();
-		return 0;
-	}
-	return 1;
-}
-
-static unsigned long peer_offset(unsigned int peernum)
-{
-	return 4 * peernum;
-}
-
-static u32 handle_tun_output(int fd, const struct iovec *iov,
-			     unsigned num, struct device *dev)
-{
-	/* Now we've seen output, we should warn if we can't get buffers. */
-	*(bool *)dev->priv = true;
-	return writev(dev->fd, iov, num);
-}
-
-static u32 handle_block_output(int fd, const struct iovec *iov,
-			       unsigned num, struct device *dev)
-{
-	struct lguest_block_page *p = dev->mem;
-	u32 irq, reply_num, *lenp;
-	int len;
-	struct iovec reply[LGUEST_MAX_DMA_SECTIONS];
-	off64_t device_len, off = (off64_t)p->sector * 512;
-
-	device_len = *(off64_t *)dev->priv;
-
-	if (off >= device_len)
-		err(1, "Bad offset %llu vs %llu", off, device_len);
-	if (lseek64(dev->fd, off, SEEK_SET) != off)
-		err(1, "Bad seek to sector %i", p->sector);
-
-	verbose("Block: %s at offset %llu\n", p->type ? "WRITE" : "READ", off);
-
-	lenp = get_dma_buffer(fd, dev->mem, reply, &reply_num, &irq);
-	if (!lenp)
-		err(1, "Block request didn't give us a dma buffer");
-
-	if (p->type) {
-		len = writev(dev->fd, iov, num);
-		if (off + len > device_len) {
-			ftruncate(dev->fd, device_len);
-			errx(1, "Write past end %llu+%u", off, len);
-		}
-		*lenp = 0;
-	} else {
-		len = readv(dev->fd, reply, reply_num);
-		*lenp = len;
-	}
-
-	p->result = 1 + (p->bytes != len);
-	trigger_irq(fd, irq);
-	return 0;
-}
-
-#define HIPQUAD(ip)				\
-	((u8)(ip >> 24)),			\
-	((u8)(ip >> 16)),			\
-	((u8)(ip >> 8)),			\
-	((u8)(ip))
-
-static void configure_device(int fd, const char *devname, u32 ipaddr,
-			     unsigned char hwaddr[6])
-{
-	struct ifreq ifr;
-	struct sockaddr_in *sin = (struct sockaddr_in *)&ifr.ifr_addr;
-
-	memset(&ifr, 0, sizeof(ifr));
-	strcpy(ifr.ifr_name, devname);
-	sin->sin_family = AF_INET;
-	sin->sin_addr.s_addr = htonl(ipaddr);
-	if (ioctl(fd, SIOCSIFADDR, &ifr) != 0)
-		err(1, "Setting %s interface address", devname);
-	ifr.ifr_flags = IFF_UP;
-	if (ioctl(fd, SIOCSIFFLAGS, &ifr) != 0)
-		err(1, "Bringing interface %s up", devname);
-
-	if (ioctl(fd, SIOCGIFHWADDR, &ifr) != 0)
-		err(1, "getting hw address for %s", devname);
-
-	memcpy(hwaddr, ifr.ifr_hwaddr.sa_data, 6);
-}
-
-/* We send lguest_add signals while input is pending: avoids races. */
-static void wake_parent(int pipefd, struct devices *devices)
-{
-	int parent = getppid();
-	nice(19);
-
-	set_fd(pipefd, devices);
-
-	for (;;) {
-		fd_set rfds = devices->infds;
-
-		select(devices->max_infd+1, &rfds, NULL, NULL, NULL);
-		if (FD_ISSET(pipefd, &rfds)) {
-			int ignorefd;
-			if (read(pipefd, &ignorefd, sizeof(ignorefd)) == 0)
-				exit(0);
-			FD_CLR(ignorefd, &devices->infds);
-		}
-		kill(parent, SIGUSR1);
-	}
-}
-
-/* We don't want signal to kill us, just jerk us out of kernel. */
-static void wakeup(int signo)
-{
-}
-
-static int handle_tun_input(int fd, struct device *dev)
-{
-	u32 irq = 0, num, *lenp;
-	int len;
-	struct iovec iov[LGUEST_MAX_DMA_SECTIONS];
-
-	lenp = get_dma_buffer(fd, dev->mem+peer_offset(NET_PEERNUM), iov, &num,
-			      &irq);
-	if (!lenp) {
-		if (*(bool *)dev->priv)
-			warn("network: no dma buffer!");
-		iov[0] = discard_iov;
-		num = 1;
-	}
-
-	len = readv(dev->fd, iov, num);
-	if (len <= 0)
-		err(1, "reading network");
-	if (lenp) {
-		*lenp = len;
-		trigger_irq(fd, irq);
-	}
-	verbose("tun input packet len %i [%02x %02x] (%s)\n", len,
-		((u8 *)iov[0].iov_base)[0], ((u8 *)iov[0].iov_base)[1],
-		lenp ? "sent" : "discarded");
-	return 1;
-}
-
-/* We use fnctl locks to reserve network slots (autocleanup!) */
-static unsigned int find_slot(int netfd, const char *filename)
-{
-	struct flock fl;
-
-	fl.l_type = F_WRLCK;
-	fl.l_whence = SEEK_SET;
-	fl.l_len = 1;
-	for (fl.l_start = 0;
-	     fl.l_start < getpagesize()/sizeof(struct lguest_net);
-	     fl.l_start++) {
-		if (fcntl(netfd, F_SETLK, &fl) == 0)
-			return fl.l_start;
-	}
-	errx(1, "No free slots in network file %s", filename);
-}
-
-static void setup_net_file(const char *filename,
-			   struct lguest_device_desc *descs,
-			   struct devices *devices)
-{
-	int netfd;
-	struct device *dev;
-
-	netfd = open(filename, O_RDWR, 0);
-	if (netfd < 0) {
-		if (errno == ENOENT) {
-			netfd = open(filename, O_RDWR|O_CREAT, 0600);
-			if (netfd >= 0) {
-				char page[getpagesize()];
-				/* 0xFFFF == NO_GUEST */
-				memset(page, 0xFF, sizeof(page));
-				write(netfd, page, sizeof(page));
-			}
-		}
-		if (netfd < 0)
-			err(1, "cannot open net file '%s'", filename);
-	}
-
-	dev = new_device(devices, descs, LGUEST_DEVICE_T_NET, 1,
-			 -1, NULL, 0, NULL);
-
-	/* This is the slot for the guest to use. */
-	dev->desc->features = find_slot(netfd, filename)|LGUEST_NET_F_NOCSUM;
-	/* We overwrite the /dev/zero mapping with the actual file. */
-	if (mmap(dev->mem, getpagesize(), PROT_READ|PROT_WRITE,
-			 MAP_FIXED|MAP_SHARED, netfd, 0) != dev->mem)
-			err(1, "could not mmap '%s'", filename);
-	verbose("device %p@%p: shared net %s, peer %i\n", dev->desc, 
-		(void *)(dev->desc->pfn * getpagesize()), filename, 
-		dev->desc->features & ~LGUEST_NET_F_NOCSUM);
-}
-
-static u32 str2ip(const char *ipaddr)
-{
-	unsigned int byte[4];
-
-	sscanf(ipaddr, "%u.%u.%u.%u", &byte[0], &byte[1], &byte[2], &byte[3]);
-	return (byte[0] << 24) | (byte[1] << 16) | (byte[2] << 8) | byte[3];
-}
-
-/* adapted from libbridge */
-static void add_to_bridge(int fd, const char *if_name, const char *br_name)
-{
-	int r, ifidx;
-	struct ifreq ifr;
-
-	if (!*br_name)
-		errx(1, "must specify bridge name");
-
-	ifidx = if_nametoindex(if_name);
-	if (!ifidx)
-		errx(1, "interface %s does not exist!\n", if_name);
-
-	strncpy(ifr.ifr_name, br_name, IFNAMSIZ);
-	ifr.ifr_ifindex = ifidx;
-	r = ioctl(fd, SIOCBRADDIF, &ifr);
-	if (r != -1)
-		return;
-
-	switch (errno) {
-	case ENODEV:
-		errx(1, "bridge %s does not exist!\n", br_name);
-	case EBUSY:
-		errx(1, "device %s is already a member of a bridge; "
-			"can't enslave it to bridge %s.\n", if_name, br_name);
-	case ELOOP:
-		errx(1, "device %s is a bridge device itself; "
-			"can't enslave a bridge device to a bridge device.\n",
-			if_name);
-	default:
-		err(1, "can't add %s to bridge %s\n", if_name, br_name);
-	}
-}
-
-
-static void setup_tun_net(const char *arg,
-			  struct lguest_device_desc *descs,
-			  struct devices *devices)
-{
-	struct device *dev;
-	struct ifreq ifr;
-	int netfd, ipfd;
-	u32 ipaddr;
-	const char *br_name = NULL;
-
-	netfd = open("/dev/net/tun", O_RDWR);
-	if (netfd < 0)
-		err(1, "opening /dev/net/tun");
-
-	memset(&ifr, 0, sizeof(ifr));
-	ifr.ifr_flags = IFF_TAP | IFF_NO_PI;
-	strcpy(ifr.ifr_name, "tap%d");
-	if (ioctl(netfd, TUNSETIFF, &ifr) != 0)
-		err(1, "configuring /dev/net/tun");
-
-	dev = new_device(devices, descs, LGUEST_DEVICE_T_NET, 1,
-			 netfd, handle_tun_input,
-			 peer_offset(0), handle_tun_output);
-	dev->priv = malloc(sizeof(bool));
-	*(bool *)dev->priv = false;
-
-	ipfd = socket(PF_INET, SOCK_DGRAM, IPPROTO_IP);
-	if (ipfd < 0)
-		err(1, "opening IP socket");
-
-	if (!strncmp(BRIDGE_PFX, arg, strlen(BRIDGE_PFX))) {
-		ipaddr = INADDR_ANY;
-		br_name = arg + strlen(BRIDGE_PFX);
-		add_to_bridge(ipfd, ifr.ifr_name, br_name);
-	} else
-		ipaddr = str2ip(arg);
-
-	/* We are peer 0, rest is all NO_GUEST */
-	configure_device(ipfd, ifr.ifr_name, ipaddr, dev->mem);
-	close (ipfd);
-
-	/* You will be peer 1: we should create enough jitter to randomize */
-	dev->desc->features = NET_PEERNUM|LGUEST_DEVICE_F_RANDOMNESS;
-	verbose("device %p@%p: tun net %u.%u.%u.%u\n", dev->desc,
-		(void *)(dev->desc->pfn * getpagesize()),
-		HIPQUAD(ipaddr));
-	if (br_name)
-		verbose("attched to bridge: %s\n", br_name);
-}
-
-static void setup_block_file(const char *filename,
-			     struct lguest_device_desc *descs,
-			     struct devices *devices)
-{
-	int fd;
-	struct device *dev;
-	off64_t *blocksize;
-	struct lguest_block_page *p;
-
-	fd = open(filename, O_RDWR|O_LARGEFILE|O_DIRECT, 0);
-	if (fd < 0)
-		err(1, "Opening %s", filename);
-
-	dev = new_device(devices, descs, LGUEST_DEVICE_T_BLOCK, 1,
-			 fd, NULL, 0, handle_block_output);
-	dev->desc->features = LGUEST_DEVICE_F_RANDOMNESS;
-	blocksize = dev->priv = malloc(sizeof(*blocksize));
-	*blocksize = lseek64(fd, 0, SEEK_END);
-	p = dev->mem;
-
-	p->num_sectors = *blocksize/512;
-	verbose("device %p@%p: block %i sectors\n", dev->desc, 
-		(void *)(dev->desc->pfn * getpagesize()), p->num_sectors);
-}
-
-static u32 handle_console_output(int fd, const struct iovec *iov,
-				 unsigned num, struct device*dev)
-{
-	return writev(STDOUT_FILENO, iov, num);
-}
-
-static void setup_console(struct lguest_device_desc *descs,
-			  struct devices *devices)
-{
-	struct device *dev;
-
-	if (tcgetattr(STDIN_FILENO, &orig_term) == 0) {
-		struct termios term = orig_term;
-		term.c_lflag &= ~(ISIG|ICANON|ECHO);
-		tcsetattr(STDIN_FILENO, TCSANOW, &term);
-		atexit(restore_term);
-	}
-
-	/* We don't currently require a page for the console. */
-	dev = new_device(devices, descs, LGUEST_DEVICE_T_CONSOLE, 0,
-			 STDIN_FILENO, handle_console_input,
-			 4, handle_console_output);
-	dev->priv = malloc(sizeof(struct console_abort));
-	((struct console_abort *)dev->priv)->count = 0;
-	verbose("device %p@%p: console\n", dev->desc, 
-		(void *)(dev->desc->pfn * getpagesize()));
-}
-
-static const char *get_arg(const char *arg, const char *prefix)
-{
-	if (strncmp(arg, prefix, strlen(prefix)) == 0)
-		return arg + strlen(prefix);
-	return NULL;
-}
-
-static u32 handle_device(int fd, unsigned long dma, unsigned long addr,
-			 struct devices *devices)
-{
-	struct device *i;
-	u32 *lenp;
-	struct iovec iov[LGUEST_MAX_DMA_SECTIONS];
-	unsigned num = 0;
-
-	lenp = dma2iov(dma, iov, &num);
-	if (!lenp)
-		errx(1, "Bad SEND_DMA %li for address %#lx\n", dma, addr);
-
-	for (i = devices->dev; i; i = i->next) {
-		if (i->handle_output && addr == i->watch_address) {
-			*lenp = i->handle_output(fd, iov, num, i);
-			return 0;
-		}
-	}
-	warnx("Pending dma %p, addr %p", (void *)dma, (void *)addr);
-	return 0;
-}
-
-static void handle_input(int fd, int childfd, struct devices *devices)
-{
-	struct timeval poll = { .tv_sec = 0, .tv_usec = 0 };
-
-	for (;;) {
-		struct device *i;
-		fd_set fds = devices->infds;
-
-		if (select(devices->max_infd+1, &fds, NULL, NULL, &poll) == 0)
-			break;
-
-		for (i = devices->dev; i; i = i->next) {
-			if (i->handle_input && FD_ISSET(i->fd, &fds)) {
-				if (!i->handle_input(fd, i)) {
-					FD_CLR(i->fd, &devices->infds);
-					/* Tell child to ignore it too... */
-					write(childfd, &i->fd, sizeof(i->fd));
-				}
-			}
-		}
-	}
-}
-
-int main(int argc, char *argv[])
-{
-	unsigned long mem, pgdir, entry, initrd_size, page_offset;
-	int arg, kern_fd, fd, child, pipefd[2];
-	Elf32_Ehdr hdr;
-	struct sigaction act;
-	sigset_t sigset;
-	struct lguest_device_desc *devdescs;
-	struct devices devices;
-	struct lguest_boot_info *boot = (void *)0;
-	const char *initrd_name = NULL;
-	u32 (*load)(int, const Elf32_Ehdr *ehdr, unsigned long,
-		    unsigned long *, const char *, unsigned long *,
-		    unsigned long *);
-
-	if (argv[1] && strcmp(argv[1], "--verbose") == 0) {
-		verbose = true;
-		argv++;
-		argc--;
-	}
-
-	if (argc < 4)
-		errx(1, "Usage: lguest [--verbose] <mem> vmlinux "
-			"[--sharenet=<filename>|--tunnet=(<ipaddr>|bridge:<bridgename>)"
-			"|--block=<filename>|--initrd=<filename>]... [args...]");
-
-	zero_fd = open("/dev/zero", O_RDONLY, 0);
-	if (zero_fd < 0)
-		err(1, "Opening /dev/zero");
-
-	mem = memparse(argv[1]);
-	kern_fd = open(argv[2], O_RDONLY, 0);
-	if (kern_fd < 0)
-		err(1, "Opening %s", argv[2]);
-
-	if (read(kern_fd, &hdr, sizeof(hdr)) != sizeof(hdr))
-		err(1, "Reading %s elf header", argv[2]);
-
-	if (memcmp(hdr.e_ident, ELFMAG, SELFMAG) == 0)
-		load = map_elf;
-	else
-		load = load_bzimage;
-
-	devices.max_infd = -1;
-	devices.dev = NULL;
-	FD_ZERO(&devices.infds);
-
-	devdescs = map_pages(mem, 1);
-	arg = 3;
-	while (argv[arg] && argv[arg][0] == '-') {
-		const char *argval;
-
-		if ((argval = get_arg(argv[arg], "--sharenet=")) != NULL)
-			setup_net_file(argval, devdescs, &devices);
-		else if ((argval = get_arg(argv[arg], "--tunnet=")) != NULL)
-			setup_tun_net(argval, devdescs, &devices);
-		else if ((argval = get_arg(argv[arg], "--block=")) != NULL)
-			setup_block_file(argval, devdescs, &devices);
-		else if ((argval = get_arg(argv[arg], "--initrd=")) != NULL)
-			initrd_name = argval;
-		else
-			errx(1, "unknown arg '%s'", argv[arg]);
-		arg++;
-	}
-
-	entry = load(kern_fd, &hdr, mem, &pgdir, initrd_name, &initrd_size,
-		     &page_offset);
-	setup_console(devdescs, &devices);
-
-	concat(boot->cmdline, argv+arg);
-	boot->max_pfn = mem/getpagesize();
-	boot->initrd_size = initrd_size;
-
-	act.sa_handler = wakeup;
-	sigemptyset(&act.sa_mask);
-	act.sa_flags = 0;
-	sigaction(SIGUSR1, &act, NULL);
-
-	pipe(pipefd);
-	child = fork();
-	if (child == -1)
-		err(1, "forking");
-
-	if (child == 0) {
-		close(pipefd[1]);
-		wake_parent(pipefd[0], &devices);
-	}
-	close(pipefd[0]);
-
-	sigemptyset(&sigset);
-	sigaddset(&sigset, SIGUSR1);
-	sigprocmask(SIG_BLOCK, &sigset, NULL);
-
-	fd = tell_kernel(RESERVE_TOP/getpagesize(), pgdir, entry, page_offset);
-
-	for (;;) {
-		unsigned long arr[2];
-		int readval;
-
-		sigprocmask(SIG_UNBLOCK, &sigset, NULL);
-		readval = read(fd, arr, sizeof(arr));
-		sigprocmask(SIG_BLOCK, &sigset, NULL);
-
-		switch (readval) {
-		case sizeof(arr):
-			handle_device(fd, arr[0], arr[1], &devices);
-			break;
-		case -1:
-			if (errno == EINTR)
-				break;
-		default:
-			if (errno == ENOENT) {
-				char reason[1024];
-				if (read(fd, reason, sizeof(reason)) > 0)
-					errx(1, "%s", reason);
-			}
-			err(1, "Running guest failed");
-		}
-		handle_input(fd, pipefd[1], &devices);
-	}
-}

--

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [RFC/PATCH LGUEST X86_64 08/13] lguest64 user header.
       [not found] <20070308162348.299676000@redhat.com>
                   ` (6 preceding siblings ...)
  2007-03-08 17:39 ` [RFC/PATCH LGUEST X86_64 07/13] lguest64 loader Steven Rostedt
@ 2007-03-08 17:39 ` Steven Rostedt
  2007-03-08 17:39 ` [RFC/PATCH LGUEST X86_64 09/13] lguest64 devices Steven Rostedt
                   ` (4 subsequent siblings)
  12 siblings, 0 replies; 16+ messages in thread
From: Steven Rostedt @ 2007-03-08 17:39 UTC (permalink / raw)
  To: virtualization; +Cc: Chris Wright, Ingo Molnar

plain text document attachment (lguest64-user.patch)
This patch adds the header used by the lguest64 loader.

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Glauber de Oliveira Costa <glommer@gmail.com>
Cc: Chris Wright <chrisw@sous-sol.org>


Index: work-pv/include/asm-x86_64/lguest_user.h
===================================================================
--- /dev/null
+++ work-pv/include/asm-x86_64/lguest_user.h
@@ -0,0 +1,88 @@
+#ifndef _ASM_LGUEST_USER
+#define _ASM_LGUEST_USER
+
+/* Everything the "lguest" userspace program needs to know. */
+/* They can register up to 32 arrays of lguest_dma. */
+#define LGUEST_MAX_DMA		32
+
+/* How many devices?  Assume each one wants up to two dma arrays per device. */
+#define LGUEST_MAX_DEVICES (LGUEST_MAX_DMA/2)
+
+/* At most we can dma 16 lguest_dma in one op. */
+#define LGUEST_MAX_DMA_SECTIONS	16
+
+struct lguest_dma
+{
+	/* 0 if free to be used, filled by hypervisor. */
+ 	u64 used_len;
+	u64 addr[LGUEST_MAX_DMA_SECTIONS];
+	u16 len[LGUEST_MAX_DMA_SECTIONS];
+};
+
+/* This is found at address 0. */
+struct lguest_boot_info
+{
+	u32 max_pfn;
+	u32 initrd_size;
+	char cmdline[256];
+};
+
+struct lguest_block_page
+{
+	/* 0 is a read, 1 is a write. */
+	int type;
+	u32 sector; 	/* Offset in device = sector * 512. */
+	u32 bytes;	/* Length expected to be read/written in bytes */
+	/* 0 = pending, 1 = done, 2 = done, error */
+	int result;
+	u32 num_sectors; /* Disk length = num_sectors * 512 */
+};
+
+/* There is a shared page of these. */
+struct lguest_net
+{
+	union {
+		unsigned char mac[6];
+		struct {
+			u8 promisc;
+			u8 pad;
+			u16 guestid;
+		};
+	};
+};
+
+/* lguest_device_desc->type */
+#define LGUEST_DEVICE_T_CONSOLE	1
+#define LGUEST_DEVICE_T_NET	2
+#define LGUEST_DEVICE_T_BLOCK	3
+
+/* lguest_device_desc->status.  256 and above are device specific. */
+#define LGUEST_DEVICE_S_ACKNOWLEDGE	1 /* We have seen device. */
+#define LGUEST_DEVICE_S_DRIVER		2 /* We have found a driver */
+#define LGUEST_DEVICE_S_DRIVER_OK	4 /* Driver says OK! */
+#define LGUEST_DEVICE_S_REMOVED		8 /* Device has gone away. */
+#define LGUEST_DEVICE_S_REMOVED_ACK	16 /* Driver has been told. */
+#define LGUEST_DEVICE_S_FAILED		128 /* Something actually failed */
+
+#define LGUEST_NET_F_NOCSUM		0x4000 /* Don't bother checksumming */
+#define LGUEST_DEVICE_F_RANDOMNESS	0x8000 /* IRQ is fairly random */
+
+/* We have a page of these descriptors in the lguest_device page. */
+struct lguest_device_desc {
+	u16 type;
+	u16 features;
+	u16 status;
+	u16 num_pages;
+	u64 pfn;
+};
+
+/* Write command first word is a request. */
+enum lguest_req
+{
+	LHREQ_INITIALIZE, /* + pfnlimit, pgdir, start, pageoffset */
+	LHREQ_GETDMA, /* + addr (returns &lguest_dma, irq in ->used_len) */
+	LHREQ_IRQ, /* + irq */
+};
+
+
+#endif /* _ASM_LGUEST_USER */

--

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [RFC/PATCH LGUEST X86_64 09/13] lguest64 devices
       [not found] <20070308162348.299676000@redhat.com>
                   ` (7 preceding siblings ...)
  2007-03-08 17:39 ` [RFC/PATCH LGUEST X86_64 08/13] lguest64 user header Steven Rostedt
@ 2007-03-08 17:39 ` Steven Rostedt
  2007-03-08 17:39 ` [RFC/PATCH LGUEST X86_64 10/13] dont compile in the lguest_net Steven Rostedt
                   ` (3 subsequent siblings)
  12 siblings, 0 replies; 16+ messages in thread
From: Steven Rostedt @ 2007-03-08 17:39 UTC (permalink / raw)
  To: virtualization; +Cc: Chris Wright, Ingo Molnar

plain text document attachment (lguest64-device.patch)
We started working a little bit on the devices for lguest64.
This is still very much a work-in-progress and needs much more work.

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Glauber de Oliveira Costa <glommer@gmail.com>
Cc: Chris Wright <chrisw@sous-sol.org>



Index: work-pv/include/asm-x86_64/lguest_device.h
===================================================================
--- /dev/null
+++ work-pv/include/asm-x86_64/lguest_device.h
@@ -0,0 +1,31 @@
+#ifndef _ASM_LGUEST_DEVICE_H
+#define _ASM_LGUEST_DEVICE_H
+/* Everything you need to know about lguest devices. */
+#include <linux/device.h>
+#include <asm/lguest.h>
+#include <asm/lguest_user.h>
+
+struct lguest_device {
+	/* Unique busid, and index into lguest_page->devices[] */
+	/* By convention, each device can use irq index+1 if it wants to. */
+	unsigned int index;
+
+	struct device dev;
+
+	/* Driver can hang data off here. */
+	void *private;
+};
+
+struct lguest_driver {
+	const char *name;
+	struct module *owner;
+	u16 device_type;
+	int (*probe)(struct lguest_device *dev);
+	void (*remove)(struct lguest_device *dev);
+
+	struct device_driver drv;
+};
+
+extern int register_lguest_driver(struct lguest_driver *drv);
+extern void unregister_lguest_driver(struct lguest_driver *drv);
+#endif /* _ASM_LGUEST_DEVICE_H */
Index: work-pv/arch/x86_64/lguest/lguest_bus.c
===================================================================
--- /dev/null
+++ work-pv/arch/x86_64/lguest/lguest_bus.c
@@ -0,0 +1,180 @@
+#include <linux/init.h>
+#include <linux/bootmem.h>
+#include <asm/lguest_device.h>
+#include <asm/lguest.h>
+#include <asm/io.h>
+
+static ssize_t type_show(struct device *_dev,
+                         struct device_attribute *attr, char *buf)
+{
+	struct lguest_device *dev = container_of(_dev,struct lguest_device,dev);
+	return sprintf(buf, "%hu", lguest_devices[dev->index].type);
+}
+static ssize_t features_show(struct device *_dev,
+                             struct device_attribute *attr, char *buf)
+{
+	struct lguest_device *dev = container_of(_dev,struct lguest_device,dev);
+	return sprintf(buf, "%hx", lguest_devices[dev->index].features);
+}
+static ssize_t pfn_show(struct device *_dev,
+			 struct device_attribute *attr, char *buf)
+{
+	struct lguest_device *dev = container_of(_dev,struct lguest_device,dev);
+	return sprintf(buf, "%llu", lguest_devices[dev->index].pfn);
+}
+static ssize_t status_show(struct device *_dev,
+                           struct device_attribute *attr, char *buf)
+{
+	struct lguest_device *dev = container_of(_dev,struct lguest_device,dev);
+	return sprintf(buf, "%hx", lguest_devices[dev->index].status);
+}
+static ssize_t status_store(struct device *_dev, struct device_attribute *attr,
+                            const char *buf, size_t count)
+{
+	struct lguest_device *dev = container_of(_dev,struct lguest_device,dev);
+	if (sscanf(buf, "%hi", &lguest_devices[dev->index].status) != 1)
+		return -EINVAL;
+	return count;
+}
+static struct device_attribute lguest_dev_attrs[] = {
+	__ATTR_RO(type),
+	__ATTR_RO(features),
+	__ATTR_RO(pfn),
+	__ATTR(status, 0644, status_show, status_store),
+	__ATTR_NULL
+};
+
+static int lguest_dev_match(struct device *_dev, struct device_driver *_drv)
+{
+	struct lguest_device *dev = container_of(_dev,struct lguest_device,dev);
+	struct lguest_driver *drv = container_of(_drv,struct lguest_driver,drv);
+
+	return (drv->device_type == lguest_devices[dev->index].type);
+}
+
+struct lguest_bus {
+	struct bus_type bus;
+	struct device dev;
+};
+
+static struct lguest_bus lguest_bus = {
+	.bus = {
+		.name  = "lguest",
+		.match = lguest_dev_match,
+		.dev_attrs = lguest_dev_attrs,
+	},
+	.dev = {
+		.parent = NULL,
+		.bus_id = "lguest",
+	}
+};
+
+static int lguest_dev_probe(struct device *_dev)
+{
+	int ret;
+	struct lguest_device *dev = container_of(_dev,struct lguest_device,dev);
+	struct lguest_driver *drv = container_of(dev->dev.driver,
+						struct lguest_driver, drv);
+
+	lguest_devices[dev->index].status |= LGUEST_DEVICE_S_DRIVER;
+	ret = drv->probe(dev);
+	if (ret == 0)
+		lguest_devices[dev->index].status |= LGUEST_DEVICE_S_DRIVER_OK;
+	return ret;
+}
+
+static int lguest_dev_remove(struct device *_dev)
+{
+	struct lguest_device *dev = container_of(_dev,struct lguest_device,dev);
+	struct lguest_driver *drv = container_of(dev->dev.driver,
+						struct lguest_driver, drv);
+
+	if (dev->dev.driver && drv->remove)
+		drv->remove(dev);
+	put_device(&dev->dev);
+	return 0;
+}
+
+int register_lguest_driver(struct lguest_driver *drv)
+{
+	if (!lguest_devices)
+		return 0;
+
+	drv->drv.bus = &lguest_bus.bus;
+	drv->drv.name = drv->name;
+	drv->drv.owner = drv->owner;
+	drv->drv.probe = lguest_dev_probe;
+	drv->drv.remove = lguest_dev_remove;
+
+	return driver_register(&drv->drv);
+}
+EXPORT_SYMBOL_GPL(register_lguest_driver);
+
+void unregister_lguest_driver(struct lguest_driver *drv)
+{
+	if (!lguest_devices)
+		return;
+
+	driver_unregister(&drv->drv);
+}
+EXPORT_SYMBOL_GPL(unregister_lguest_driver);
+
+static void release_lguest_device(struct device *_dev)
+{
+	struct lguest_device *dev = container_of(_dev,struct lguest_device,dev);
+
+	lguest_devices[dev->index].status |= LGUEST_DEVICE_S_REMOVED_ACK;
+	kfree(dev);
+}
+
+static void add_lguest_device(unsigned int index)
+{
+	struct lguest_device *new;
+
+	lguest_devices[index].status |= LGUEST_DEVICE_S_ACKNOWLEDGE;
+	new = kmalloc(sizeof(struct lguest_device), GFP_KERNEL);
+	if (!new) {
+		printk(KERN_EMERG "Cannot allocate lguest device %u\n", index);
+		lguest_devices[index].status |= LGUEST_DEVICE_S_FAILED;
+		return;
+	}
+
+	new->index = index;
+	new->private = NULL;
+	memset(&new->dev, 0, sizeof(new->dev));
+	new->dev.parent = &lguest_bus.dev;
+	new->dev.bus = &lguest_bus.bus;
+	new->dev.release = release_lguest_device;
+	sprintf(new->dev.bus_id, "%u", index);
+	if (device_register(&new->dev) != 0) {
+		printk(KERN_EMERG "Cannot register lguest device %u\n", index);
+		lguest_devices[index].status |= LGUEST_DEVICE_S_FAILED;
+		kfree(new);
+	}
+}
+
+static void scan_devices(void)
+{
+	unsigned int i;
+
+	for (i = 0; i < LGUEST_MAX_DEVICES; i++)
+		if (lguest_devices[i].type)
+			add_lguest_device(i);
+}
+
+static int __init lguest_bus_init(void)
+{
+	if (strcmp(paravirt_ops.name, "lguest") != 0)
+		return 0;
+
+	/* Devices are in page above top of "normal" mem. */
+	lguest_devices = ioremap(max_pfn << PAGE_SHIFT, PAGE_SIZE);
+
+	if (bus_register(&lguest_bus.bus) != 0
+	    || device_register(&lguest_bus.dev) != 0)
+		panic("lguest bus registration failed");
+
+	scan_devices();
+	return 0;
+}
+postcore_initcall(lguest_bus_init);
Index: work-pv/arch/x86_64/lguest/io.c
===================================================================
--- /dev/null
+++ work-pv/arch/x86_64/lguest/io.c
@@ -0,0 +1,425 @@
+/* Simple I/O model for guests, based on shared memory.
+ * Copyright (C) 2006 Rusty Russell IBM Corporation
+ *
+ *  This program is free software; you can redistribute it and/or modify
+ *  it under the terms of the GNU General Public License as published by
+ *  the Free Software Foundation; either version 2 of the License, or
+ *  (at your option) any later version.
+ *
+ *  This program is distributed in the hope that it will be useful,
+ *  but WITHOUT ANY WARRANTY; without even the implied warranty of
+ *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ *  GNU General Public License for more details.
+ *
+ *  You should have received a copy of the GNU General Public License
+ *  along with this program; if not, write to the Free Software
+ *  Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA  02110-1301 USA
+ */
+#include <linux/types.h>
+#include <linux/futex.h>
+#include <linux/jhash.h>
+#include <linux/mm.h>
+#include <linux/highmem.h>
+#include <linux/uaccess.h>
+#include <asm/lguest.h>
+#include <asm/lguest_user.h>
+#include "lguest.h"
+
+static struct list_head dma_hash[64];
+
+/* FIXME: allow multi-page lengths. */
+static int check_dma_list(struct lguest_guest_info *linfo,
+				const struct lguest_dma *dma)
+{
+	unsigned int i;
+
+	for (i = 0; i < LGUEST_MAX_DMA_SECTIONS; i++) {
+		if (!dma->len[i])
+			return 1;
+		if (!lguest_address_ok(linfo, dma->addr[i]))
+			goto kill;
+		if (dma->len[i] > PAGE_SIZE)
+			goto kill;
+		/* We could do over a page, but is it worth it? */
+		if ((dma->addr[i] % PAGE_SIZE) + dma->len[i] > PAGE_SIZE)
+			goto kill;
+	}
+	return 1;
+
+kill:
+	kill_guest(linfo, "bad DMA entry: %u@%#llx", dma->len[i], dma->addr[i]);
+	return 0;
+}
+
+static unsigned int hash(const union futex_key *key)
+{
+	return jhash2((u32*)&key->both.word,
+		      (sizeof(key->both.word)+sizeof(key->both.ptr))/4,
+		      key->both.offset)
+		% ARRAY_SIZE(dma_hash);
+}
+
+/* Must hold read lock on dmainfo owner's current->mm->mmap_sem */
+static void unlink_dma(struct lguest_dma_info *dmainfo)
+{
+	BUG_ON(!mutex_is_locked(&lguest_lock));
+	dmainfo->interrupt = 0;
+	list_del(&dmainfo->list);
+	drop_futex_key_refs(&dmainfo->key);
+}
+
+static inline int key_eq(const union futex_key *a, const union futex_key *b)
+{
+	return (a->both.word == b->both.word
+		&& a->both.ptr == b->both.ptr
+		&& a->both.offset == b->both.offset);
+}
+
+static u32 unbind_dma(struct lguest_guest_info *linfo,
+		      const union futex_key *key,
+		      unsigned long dmas)
+{
+	int i, ret = 0;
+
+	for (i = 0; i < LGUEST_MAX_DMA; i++) {
+		if (key_eq(key, &linfo->dma[i].key) && dmas == linfo->dma[i].dmas) {
+			unlink_dma(&linfo->dma[i]);
+			ret = 1;
+			break;
+		}
+	}
+	return ret;
+}
+
+u32 bind_dma(struct lguest_guest_info *linfo, unsigned long addr,
+				unsigned long dmas, u16 numdmas, u8 interrupt)
+{
+	unsigned int i;
+	u32 ret = 0;
+	union futex_key key;
+
+	printk("inside the handler, with args: %lx, %lx, %x, %x\n",addr,dmas,numdmas,interrupt);
+	if (interrupt >= LGUEST_IRQS)
+		return 0;
+
+	mutex_lock(&lguest_lock);
+	down_read(&current->mm->mmap_sem);
+	printk("Trying to get futex key...  ");
+	if (get_futex_key((u32 __user *)addr, &key) != 0) {
+		kill_guest(linfo, "bad dma address %#lx", addr);
+		goto unlock;
+	}
+	printk("Got it.\n");
+	get_futex_key_refs(&key);
+
+	if (interrupt == 0)
+		ret = unbind_dma(linfo, &key, dmas);
+	else {
+		for (i = 0; i < LGUEST_MAX_DMA; i++) {
+			if (linfo->dma[i].interrupt == 0) {
+				linfo->dma[i].dmas = dmas;
+				linfo->dma[i].num_dmas = numdmas;
+				linfo->dma[i].next_dma = 0;
+				linfo->dma[i].key = key;
+				linfo->dma[i].guest_id = linfo->guest_id;
+				linfo->dma[i].interrupt = interrupt;
+				list_add(&linfo->dma[i].list,
+					 &dma_hash[hash(&key)]);
+				ret = 1;
+				printk("Will return, holding a reference\n");
+				goto unlock;
+			}
+		}
+	}
+	printk("Will return, _without_ a reference\n");
+	drop_futex_key_refs(&key);
+unlock:
+ 	up_read(&current->mm->mmap_sem);
+	mutex_unlock(&lguest_lock);
+	return ret;
+}
+/* lhread from another guest */
+static int lhread_other(struct lguest_guest_info *linfo,
+			void *buf, u32 addr, unsigned bytes)
+{
+	if (addr + bytes < addr
+	    || !lguest_address_ok(linfo, addr+bytes)
+	    || access_process_vm(linfo->tsk, addr, buf, bytes, 0) != bytes) {
+		memset(buf, 0, bytes);
+		kill_guest(linfo, "bad address in registered DMA struct");
+		return 0;
+	}
+	return 1;
+}
+
+/* lhwrite to another guest */
+static int lhwrite_other(struct lguest_guest_info *linfo, u32 addr,
+			 const void *buf, unsigned bytes)
+{
+	if (addr + bytes < addr
+	    || !lguest_address_ok(linfo, addr+bytes)
+	    || (access_process_vm(linfo->tsk, addr, (void *)buf, bytes, 1)
+		!= bytes)) {
+		kill_guest(linfo, "bad address writing to registered DMA");
+		return 0;
+	}
+	return 1;
+}
+
+static u32 copy_data(const struct lguest_dma *src,
+		     const struct lguest_dma *dst,
+		     struct page *pages[])
+{
+	unsigned int totlen, si, di, srcoff, dstoff;
+	void *maddr = NULL;
+
+	totlen = 0;
+	si = di = 0;
+	srcoff = dstoff = 0;
+	while (si < LGUEST_MAX_DMA_SECTIONS && src->len[si]
+	       && di < LGUEST_MAX_DMA_SECTIONS && dst->len[di]) {
+		u32 len = min(src->len[si] - srcoff, dst->len[di] - dstoff);
+
+		if (!maddr)
+			maddr = kmap(pages[di]);
+
+		/* FIXME: This is not completely portable, since
+		   archs do different things for copy_to_user_page. */
+		if (copy_from_user(maddr + (dst->addr[di] + dstoff)%PAGE_SIZE,
+				   (void *__user)src->addr[si], len) != 0) {
+			totlen = 0;
+			break;
+		}
+
+		totlen += len;
+		srcoff += len;
+		dstoff += len;
+		if (srcoff == src->len[si]) {
+			si++;
+			srcoff = 0;
+		}
+		if (dstoff == dst->len[di]) {
+			kunmap(pages[di]);
+			maddr = NULL;
+			di++;
+			dstoff = 0;
+		}
+	}
+
+	if (maddr)
+		kunmap(pages[di]);
+
+	return totlen;
+}
+
+/* Src is us, ie. current. */
+static u32 do_dma(struct lguest_guest_info *srclg, const struct lguest_dma *src,
+		  struct lguest_guest_info *dstlg, const struct lguest_dma *dst)
+{
+	int i;
+	u32 ret;
+	struct page *pages[LGUEST_MAX_DMA_SECTIONS];
+
+	if (!check_dma_list(dstlg, dst) || !check_dma_list(srclg, src))
+		return 0;
+
+	/* First get the destination pages */
+	for (i = 0; i < LGUEST_MAX_DMA_SECTIONS; i++) {
+		if (dst->len[i] == 0)
+			break;
+		if (get_user_pages(dstlg->tsk, dstlg->mm,
+				   dst->addr[i], 1, 1, 1, pages+i, NULL)
+		    != 1) {
+			ret = 0;
+			goto drop_pages;
+		}
+	}
+
+	/* Now copy until we run out of src or dst. */
+	ret = copy_data(src, dst, pages);
+
+drop_pages:
+	while (--i >= 0)
+		put_page(pages[i]);
+	return ret;
+}
+
+/* We cache one process to wakeup: helps for batching & wakes outside locks. */
+void set_wakeup_process(struct lguest_guest_info *linfo,
+						struct task_struct *p)
+{
+	if (p == linfo->wake)
+		return;
+
+	if (linfo->wake) {
+		wake_up_process(linfo->wake);
+		put_task_struct(linfo->wake);
+	}
+	linfo->wake = p;
+	if (linfo->wake)
+		get_task_struct(linfo->wake);
+}
+
+static int dma_transfer(struct lguest_guest_info *srclg,
+			unsigned long udma,
+			struct lguest_dma_info *dst)
+{
+#if 0
+	struct lguest_dma dst_dma, src_dma;
+	struct lguest_guest_info *dstlg;
+	u32 i, dma = 0;
+
+	dstlg = &lguests[dst->guest_id];
+	/* Get our dma list. */
+	lhread(srclg, &src_dma, udma, sizeof(src_dma));
+
+	/* We can't deadlock against them dmaing to us, because this
+	 * is all under the lguest_lock. */
+	down_read(&dstlg->mm->mmap_sem);
+
+	for (i = 0; i < dst->num_dmas; i++) {
+		dma = (dst->next_dma + i) % dst->num_dmas;
+		if (!lhread_other(dstlg, &dst_dma,
+				  dst->dmas + dma * sizeof(struct lguest_dma),
+				  sizeof(dst_dma))) {
+			goto fail;
+		}
+		if (!dst_dma.used_len)
+			break;
+	}
+	if (i != dst->num_dmas) {
+		unsigned long used_lenp;
+		unsigned int ret;
+
+		ret = do_dma(srclg, &src_dma, dstlg, &dst_dma);
+		/* Put used length in src. */
+		lhwrite_u32(srclg,
+			    udma+offsetof(struct lguest_dma, used_len), ret);
+		if (ret == 0 && src_dma.len[0] != 0)
+			goto fail;
+
+		/* Make sure destination sees contents before length. */
+		mb();
+		used_lenp = dst->dmas
+			+ dma * sizeof(struct lguest_dma)
+			+ offsetof(struct lguest_dma, used_len);
+		lhwrite_other(dstlg, used_lenp, &ret, sizeof(ret));
+		dst->next_dma++;
+	}
+ 	up_read(&dstlg->mm->mmap_sem);
+
+	/* Do this last so dst doesn't simply sleep on lock. */
+	set_bit(dst->interrupt, dstlg->irqs_pending);
+	set_wakeup_process(srclg, dstlg->tsk);
+	return i == dst->num_dmas;
+
+fail:
+	up_read(&dstlg->mm->mmap_sem);
+#endif
+	return 0;
+}
+
+int send_dma(struct lguest_guest_info *linfo, unsigned long addr,
+							unsigned long udma)
+{
+	union futex_key key;
+	int pending = 0, empty = 0;
+
+	printk("inside send_dma, with args: %lx, %lx\n",addr,udma);
+again:
+	mutex_lock(&lguest_lock);
+	down_read(&current->mm->mmap_sem);
+	if (get_futex_key((u32 __user *)addr, &key) != 0) {
+		kill_guest(linfo, "bad sending DMA address");
+		goto unlock;
+	}
+	/* Shared mapping?  Look for other guests... */
+	if (key.shared.offset & 1) {
+		struct lguest_dma_info *i, *n;
+		list_for_each_entry_safe(i, n, &dma_hash[hash(&key)], list) {
+			if (i->guest_id == linfo->guest_id)
+				continue;
+			if (!key_eq(&key, &i->key))
+				continue;
+
+			empty += dma_transfer(linfo, udma, i);
+			break;
+		}
+		if (empty == 1) {
+			/* Give any recipients one chance to restock. */
+			up_read(&current->mm->mmap_sem);
+			mutex_unlock(&lguest_lock);
+			yield();
+			empty++;
+			goto again;
+		}
+		pending = 0;
+	} else {
+		/* Private mapping: tell our userspace. */
+		linfo->dma_is_pending = 1;
+		linfo->pending_dma = udma;
+		linfo->pending_addr = addr;
+		pending = 1;
+	}
+unlock:
+	up_read(&current->mm->mmap_sem);
+	mutex_unlock(&lguest_lock);
+	printk("Returning send_dma with pending: %x\n",pending);
+	return pending;
+}
+void release_all_dma(struct lguest_guest_info *linfo)
+{
+	unsigned int i;
+
+	BUG_ON(!mutex_is_locked(&lguest_lock));
+
+	down_read(&linfo->mm->mmap_sem);
+	for (i = 0; i < LGUEST_MAX_DMA; i++) {
+		if (linfo->dma[i].interrupt)
+			unlink_dma(&linfo->dma[i]);
+	}
+	up_read(&linfo->mm->mmap_sem);
+}
+
+/* Userspace wants a dma buffer from this guest. */
+unsigned long get_dma_buffer(struct lguest_guest_info *linfo,
+			     unsigned long addr, unsigned long *interrupt)
+{
+	unsigned long ret = 0;
+	union futex_key key;
+	struct lguest_dma_info *i;
+
+	mutex_lock(&lguest_lock);
+	down_read(&current->mm->mmap_sem);
+	if (get_futex_key((u32 __user *)addr, &key) != 0) {
+		kill_guest(linfo, "bad registered DMA buffer");
+		goto unlock;
+	}
+	list_for_each_entry(i, &dma_hash[hash(&key)], list) {
+		if (key_eq(&key, &i->key) && i->guest_id == linfo->guest_id) {
+			unsigned int j;
+			for (j = 0; j < i->num_dmas; j++) {
+				struct lguest_dma dma;
+
+				ret = i->dmas + j * sizeof(struct lguest_dma);
+				lhread(linfo, &dma, ret, sizeof(dma));
+				if (dma.used_len == 0)
+					break;
+			}
+			*interrupt = i->interrupt;
+			break;
+		}
+	}
+unlock:
+	up_read(&current->mm->mmap_sem);
+	mutex_unlock(&lguest_lock);
+	return ret;
+}
+
+void lguest_io_init(void)
+{
+	unsigned int i;
+
+	for (i = 0; i < ARRAY_SIZE(dma_hash); i++)
+		INIT_LIST_HEAD(&dma_hash[i]);
+}

--

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [RFC/PATCH LGUEST X86_64 10/13] dont compile in the lguest_net
       [not found] <20070308162348.299676000@redhat.com>
                   ` (8 preceding siblings ...)
  2007-03-08 17:39 ` [RFC/PATCH LGUEST X86_64 09/13] lguest64 devices Steven Rostedt
@ 2007-03-08 17:39 ` Steven Rostedt
  2007-03-08 17:39 ` [RFC/PATCH LGUEST X86_64 11/13] x86_64 HVC attempt Steven Rostedt
                   ` (2 subsequent siblings)
  12 siblings, 0 replies; 16+ messages in thread
From: Steven Rostedt @ 2007-03-08 17:39 UTC (permalink / raw)
  To: virtualization; +Cc: Chris Wright, Ingo Molnar

plain text document attachment (lguest64-not-net.patch)
Right now we don't have lguest_net compiling for x86_64. So turn it
off for us.

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Cc: Glauber de Oliveira Costa <glommer@gmail.com>
Cc: Chris Wright <chrisw@sous-sol.org>


Index: work-pv/drivers/net/Makefile
===================================================================
--- work-pv.orig/drivers/net/Makefile
+++ work-pv/drivers/net/Makefile
@@ -217,4 +217,6 @@ obj-$(CONFIG_NETCONSOLE) += netconsole.o
 obj-$(CONFIG_FS_ENET) += fs_enet/
 
 obj-$(CONFIG_NETXEN_NIC) += netxen/
+ifneq ($(CONFIG_X86_64),y)
 obj-$(CONFIG_LGUEST_GUEST) += lguest_net.o
+endif

--

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [RFC/PATCH LGUEST X86_64 11/13] x86_64 HVC attempt.
       [not found] <20070308162348.299676000@redhat.com>
                   ` (9 preceding siblings ...)
  2007-03-08 17:39 ` [RFC/PATCH LGUEST X86_64 10/13] dont compile in the lguest_net Steven Rostedt
@ 2007-03-08 17:39 ` Steven Rostedt
  2007-03-08 17:39 ` [RFC/PATCH LGUEST X86_64 12/13] dump stack on crash Steven Rostedt
  2007-03-08 17:39 ` [RFC/PATCH LGUEST X86_64 13/13] Hack to get output Steven Rostedt
  12 siblings, 0 replies; 16+ messages in thread
From: Steven Rostedt @ 2007-03-08 17:39 UTC (permalink / raw)
  To: virtualization; +Cc: Chris Wright, Ingo Molnar

plain text document attachment (lguest64-hvc.patch)
This is a start to try to get HVC working for x86_64.

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Glauber de Oliveira Costa <glommer@gmail.com>
Cc: Chris Wright <chrisw@sous-sol.org>


Index: work-pv/drivers/char/Kconfig
===================================================================
--- work-pv.orig/drivers/char/Kconfig
+++ work-pv/drivers/char/Kconfig
@@ -595,6 +595,12 @@ config HVC_CONSOLE
 	  pSeries machines when partitioned support a hypervisor virtual
 	  console. This driver allows each pSeries partition to have a console
 	  which is accessed via the HMC.
+config HVC_LGUEST
+	bool "lguest hypervisor console"
+	depends on LGUEST_GUEST
+	select HVC_DRIVER
+	help
+	  Totally fubar
 
 config HVC_ISERIES
 	bool "iSeries Hypervisor Virtual Console support"
Index: work-pv/drivers/char/Makefile
===================================================================
--- work-pv.orig/drivers/char/Makefile
+++ work-pv/drivers/char/Makefile
@@ -43,7 +43,7 @@ obj-$(CONFIG_AMIGA_BUILTIN_SERIAL) += am
 obj-$(CONFIG_SX)		+= sx.o generic_serial.o
 obj-$(CONFIG_RIO)		+= rio/ generic_serial.o
 obj-$(CONFIG_HVC_CONSOLE)	+= hvc_vio.o hvsi.o
-obj-$(CONFIG_LGUEST_GUEST)	+= hvc_lguest.o
+obj-$(CONFIG_HVC_GUEST)		+= hvc_lguest.o
 obj-$(CONFIG_HVC_ISERIES)	+= hvc_iseries.o
 obj-$(CONFIG_HVC_RTAS)		+= hvc_rtas.o
 obj-$(CONFIG_HVC_DRIVER)	+= hvc_console.o
Index: work-pv/drivers/char/hvc_lguest.c
===================================================================
--- work-pv.orig/drivers/char/hvc_lguest.c
+++ work-pv/drivers/char/hvc_lguest.c
@@ -25,7 +25,6 @@ static int cons_irq;
 static int cons_offset;
 static char inbuf[256];
 static struct lguest_dma cons_input = { .used_len = 0,
-					.addr[0] = __pa(inbuf),
 					.len[0] = sizeof(inbuf),
 					.len[1] = 0 };
 
@@ -66,6 +65,12 @@ struct hv_ops lguest_cons = {
 
 static int __init cons_init(void)
 {
+	/*
+	 * Can't initialize this in the const declarations,
+	 * since __pa(inbuf) does not evaluate into a constant.
+	 */
+	cons_input.addr[0] = __pa(inbuf);
+
 	if (strcmp(paravirt_ops.name, "lguest") != 0)
 		return 0;
 

--

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [RFC/PATCH LGUEST X86_64 12/13] dump stack on crash
       [not found] <20070308162348.299676000@redhat.com>
                   ` (10 preceding siblings ...)
  2007-03-08 17:39 ` [RFC/PATCH LGUEST X86_64 11/13] x86_64 HVC attempt Steven Rostedt
@ 2007-03-08 17:39 ` Steven Rostedt
  2007-03-08 17:39 ` [RFC/PATCH LGUEST X86_64 13/13] Hack to get output Steven Rostedt
  12 siblings, 0 replies; 16+ messages in thread
From: Steven Rostedt @ 2007-03-08 17:39 UTC (permalink / raw)
  To: virtualization; +Cc: Chris Wright, Ingo Molnar

plain text document attachment (lguest64-dump-panic.patch)
it's nice to see a back trace dump when we do a panic.

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Cc: Glauber de Oliveira Costa <glommer@gmail.com>
Cc: Chris Wright <chrisw@sous-sol.org>


Index: work-pv/kernel/panic.c
===================================================================
--- work-pv.orig/kernel/panic.c
+++ work-pv/kernel/panic.c
@@ -78,6 +78,7 @@ NORET_TYPE void panic(const char * fmt, 
 	vsnprintf(buf, sizeof(buf), fmt, args);
 	va_end(args);
 	printk(KERN_EMERG "Kernel panic - not syncing: %s\n",buf);
+	dump_stack();
 	bust_spinlocks(0);
 
 	/*

--

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [RFC/PATCH LGUEST X86_64 13/13] Hack to get output
       [not found] <20070308162348.299676000@redhat.com>
                   ` (11 preceding siblings ...)
  2007-03-08 17:39 ` [RFC/PATCH LGUEST X86_64 12/13] dump stack on crash Steven Rostedt
@ 2007-03-08 17:39 ` Steven Rostedt
  12 siblings, 0 replies; 16+ messages in thread
From: Steven Rostedt @ 2007-03-08 17:39 UTC (permalink / raw)
  To: virtualization; +Cc: Chris Wright, Ingo Molnar

plain text document attachment (lguest64-debug.patch)
This is just a hack patch to get output from the guest.
It calls lguest_vprint from printk which is a hypercall
to the host to do the printk for the guest.

Chris Wright recommended that I put this into early_printk,
but until I can get that to work, I'm posting this.

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Cc: Glauber de Oliveira Costa <glommer@gmail.com>
Cc: Chris Wright <chrisw@sous-sol.org>


Index: work-pv/kernel/printk.c
===================================================================
--- work-pv.orig/kernel/printk.c
+++ work-pv/kernel/printk.c
@@ -499,12 +499,17 @@ static int have_callable_console(void)
  * printf(3)
  */
 
+extern void lguest_vprint(const char *fmt, va_list ap);
 asmlinkage int printk(const char *fmt, ...)
 {
 	va_list args;
+	va_list lgargs;
 	int r;
 
 	va_start(args, fmt);
+	va_copy(lgargs, args);
+	lguest_vprint(fmt, lgargs);
+	va_end(lgargs);
 	r = vprintk(fmt, args);
 	va_end(args);
 

--

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC/PATCH LGUEST X86_64 01/13] HV VM Fix map area for HV.
  2007-03-08 17:38 ` [RFC/PATCH LGUEST X86_64 01/13] HV VM Fix map area for HV Steven Rostedt
@ 2007-03-09  3:52   ` Rusty Russell
  0 siblings, 0 replies; 16+ messages in thread
From: Rusty Russell @ 2007-03-09  3:52 UTC (permalink / raw)
  To: Steven Rostedt; +Cc: Chris Wright, virtualization, Ingo Molnar

On Thu, 2007-03-08 at 12:38 -0500, Steven Rostedt wrote:
> One solution is just to do a single area for boot up, and then
> use the vmalloc to map. But this gets quite complex, since we need to
> force the guest to map a given area, after the fact, hoping that
> it didn't map it someplace else before we get to the code to map it.
> This can be done, but doing it this way is (for now) much easier.

Well, this way was more code, but you're right about the theoretical
failure mode of the vmalloc method.

>     Host            Guest1          Guest2
>  +-----------+   +-----------+  +-----------+
>  |           |   |           |  |           |
>  +-----------+   +-----------+  +-----------+
>  | HV FIXMAP |   | HV FIXMAP |  | HV FIXMAP |
>  |   TEXT    |   |   TEXT    |  |   TEXT    |
>  +-----------+   +-----------+  +-----------+
>  | GUEST 1   |   | GUEST 1   |  | UNMAPPED  |
>  |SHARED DATA|   |SHARED DATA|  |           |
>  +-----------+   +-----------+  +-----------+
>  | GUEST 2   |   | UNMAPPED  |  | GUEST 2   |
>  |SHARED DATA|   |           |  |SHARED DATA|
>  +-----------+   |           |  +-----------+
>  |           |   |           |  |           |

I think it's better to do this per-cpu, as in the recently posted 32-but
patches.  You have to copy in when changing guests, but you can support
an infinite number of guests with (HV TEXTSIZE + NR_CPUS*2) pages.

Damn, I forgot to cc you on that patch.  Sorry, I suck 8(

They went to lkml as:
[PATCH 7/9] lguest: use read-only pages rather than segments to protect high-mapped switcher
[PATCH 8/9] lguest: Optimize away copy in and out of per-cpu guest pages

Cheers!
Rusty.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC/PATCH LGUEST X86_64 06/13] lguest64 Kconfig
  2007-03-08 17:38 ` [RFC/PATCH LGUEST X86_64 06/13] lguest64 Kconfig Steven Rostedt
@ 2007-03-09  3:55   ` Rusty Russell
  0 siblings, 0 replies; 16+ messages in thread
From: Rusty Russell @ 2007-03-09  3:55 UTC (permalink / raw)
  To: Steven Rostedt; +Cc: Chris Wright, virtualization, Ingo Molnar

On Thu, 2007-03-08 at 12:38 -0500, Steven Rostedt wrote:
> +config LGUEST
> +	tristate "Lguest support"
> +	depends on PARAVIRT
> +	help
> +	  Enable this is you think 32 bits are not enough fur a puppie.

8)

I think this deserves "&& EXPERIMENTAL" here 8)

Rusty.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC/PATCH LGUEST X86_64 03/13] lguest64 core
  2007-03-08 17:38 ` [RFC/PATCH LGUEST X86_64 03/13] lguest64 core Steven Rostedt
@ 2007-03-09  4:10   ` Rusty Russell
  0 siblings, 0 replies; 16+ messages in thread
From: Rusty Russell @ 2007-03-09  4:10 UTC (permalink / raw)
  To: Steven Rostedt; +Cc: Chris Wright, virtualization, Ingo Molnar

On Thu, 2007-03-08 at 12:38 -0500, Steven Rostedt wrote:
> +lg-objs := core.o hypervisor.o lguest_user.o hv_vm.o page_tables.o \
> +hypercalls.o io.o interrupts_and_traps.o lguest_debug.o

Right, I missed the trick here: hypervisor.S doesn't require any
relocations, so that fact that it's linked at the wrong address doesn't
matter at all.  Excuse me while I prepare a patch 8)

> +extern long end_hyper_text;
> +extern long start_hyper_text;

The standard way of doing this is "extern char end_hyper_text[];"
doesn't matter on x86/x86-64, but on some platforms gcc can make
assumptions about addresses based on the size of the variable (sbss
etc).  So nice to use that everywhere for asm constants.

Cheers,
Rusty.

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2007-03-09  4:10 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <20070308162348.299676000@redhat.com>
2007-03-08 17:38 ` [RFC/PATCH LGUEST X86_64 01/13] HV VM Fix map area for HV Steven Rostedt
2007-03-09  3:52   ` Rusty Russell
2007-03-08 17:38 ` [RFC/PATCH LGUEST X86_64 02/13] hvvm export page utils Steven Rostedt
2007-03-08 17:38 ` [RFC/PATCH LGUEST X86_64 03/13] lguest64 core Steven Rostedt
2007-03-09  4:10   ` Rusty Russell
2007-03-08 17:38 ` [RFC/PATCH LGUEST X86_64 04/13] Useful debugging Steven Rostedt
2007-03-08 17:38 ` [RFC/PATCH LGUEST X86_64 05/13] asm-offsets update Steven Rostedt
2007-03-08 17:38 ` [RFC/PATCH LGUEST X86_64 06/13] lguest64 Kconfig Steven Rostedt
2007-03-09  3:55   ` Rusty Russell
2007-03-08 17:39 ` [RFC/PATCH LGUEST X86_64 07/13] lguest64 loader Steven Rostedt
2007-03-08 17:39 ` [RFC/PATCH LGUEST X86_64 08/13] lguest64 user header Steven Rostedt
2007-03-08 17:39 ` [RFC/PATCH LGUEST X86_64 09/13] lguest64 devices Steven Rostedt
2007-03-08 17:39 ` [RFC/PATCH LGUEST X86_64 10/13] dont compile in the lguest_net Steven Rostedt
2007-03-08 17:39 ` [RFC/PATCH LGUEST X86_64 11/13] x86_64 HVC attempt Steven Rostedt
2007-03-08 17:39 ` [RFC/PATCH LGUEST X86_64 12/13] dump stack on crash Steven Rostedt
2007-03-08 17:39 ` [RFC/PATCH LGUEST X86_64 13/13] Hack to get output Steven Rostedt

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).