* [RFC PATCH 0/4] Inter-guest virtio I/O example with lguest
@ 2008-03-20 5:59 Rusty Russell
[not found] ` <200803201659.14344.rusty-8n+1lVoiYb80n/F98K4Iww@public.gmane.org>
0 siblings, 1 reply; 27+ messages in thread
From: Rusty Russell @ 2008-03-20 5:59 UTC (permalink / raw)
To: virtualization-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA
Cc: kvm-devel, lguest
Hi all,
Just finished my prototype of inter-guest virtio, using networking as an
example. Each guest mmaps the other's address space and uses a FIFO for
notifications.
There are two issues with this approach. The first is that neither guest
can change its mappings. See patch 1. The second is that our feature
configuration is "host presents, guest chooses" which breaks down when we
don't know the capabilities of each guest. In particular, TSO capability for
networking.
There are three possible solutions:
1) Just offer the lowest common denominator to both sides (ie. no features).
This is what I do with lguest in these patches.
2) Offer something and handle the case where one Guest accepts and another
doesn't by emulating it. ie. de-TSO the packets manually.
3) "Hot unplug" the device from the guest which asks for the greater features,
then re-add it offering less features. Requires hotplug in the guest OS.
I haven't tuned or even benchmarked these patches, but it pings!
Rusty.
^ permalink raw reply [flat|nested] 27+ messages in thread
* [RFC PATCH 1/5] lguest: mmap backing file
[not found] ` <200803201659.14344.rusty-8n+1lVoiYb80n/F98K4Iww@public.gmane.org>
@ 2008-03-20 6:05 ` Rusty Russell
[not found] ` <200803201705.44422.rusty-8n+1lVoiYb80n/F98K4Iww@public.gmane.org>
2008-03-20 8:16 ` [Lguest] " Tim Post
2008-03-20 6:54 ` [kvm-devel] [RFC PATCH 0/4] Inter-guest virtio I/O example with lguest Avi Kivity
2008-03-20 14:11 ` [kvm-devel] " Anthony Liguori
2 siblings, 2 replies; 27+ messages in thread
From: Rusty Russell @ 2008-03-20 6:05 UTC (permalink / raw)
To: virtualization-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA
Cc: kvm-devel, lguest
From: Paul TBBle Hampson <Paul.Hampson-vM6MUUi4OUAAvxtiuMwx3w@public.gmane.org>
This creates a file in $HOME/.lguest/ to directly back the RAM and DMA memory
mappings created by map_zeroed_pages.
Signed-off-by: Rusty Russell <rusty-8n+1lVoiYb80n/F98K4Iww@public.gmane.org>
---
Documentation/lguest/lguest.c | 59 ++++++++++++++++++++++++++++--------------
1 file changed, 40 insertions(+), 19 deletions(-)
diff --git a/Documentation/lguest/lguest.c b/Documentation/lguest/lguest.c
--- a/Documentation/lguest/lguest.c
+++ b/Documentation/lguest/lguest.c
@@ -236,19 +236,51 @@ static int open_or_die(const char *name,
return fd;
}
-/* map_zeroed_pages() takes a number of pages. */
+/* unlink_memfile() removes the backing file for the Guest's memory, if we exit
+ * cleanly. */
+static char memfile_path[PATH_MAX];
+
+static void unlink_memfile(void)
+{
+ unlink(memfile_path);
+}
+
+/* map_zeroed_pages() takes a number of pages, and creates a mapping file where
+ * this Guest's memory lives. */
static void *map_zeroed_pages(unsigned int num)
{
- int fd = open_or_die("/dev/zero", O_RDONLY);
+ int fd;
void *addr;
- /* We use a private mapping (ie. if we write to the page, it will be
- * copied). */
+ /* We create a .lguest directory in the user's home, to put the memory
+ * files into. */
+ snprintf(memfile_path, PATH_MAX, "%s/.lguest", getenv("HOME") ?: "");
+ if (mkdir(memfile_path, S_IRWXU) != 0 && errno != EEXIST)
+ err(1, "Creating directory %s", memfile_path);
+
+ /* Name the memfiles by the process ID of this launcher. */
+ snprintf(memfile_path, PATH_MAX, "%s/.lguest/%u",
+ getenv("HOME") ?: "", getpid());
+ fd = open(memfile_path, O_RDWR | O_CREAT | O_TRUNC, S_IRWXU);
+ if (fd < 0)
+ err(1, "Creating memory backing file %s", memfile_path);
+
+ /* Make sure we remove it when we're finished. */
+ atexit(unlink_memfile);
+
+ /* Now, we opened it with O_TRUNC, so the file is 0 bytes long. Here
+ * we expand it to the length we need, and it will be filled with
+ * zeroes. */
+ if (ftruncate(fd, num * getpagesize()) != 0)
+ err(1, "Truncating file %s %u pages", memfile_path, num);
+
+ /* We use a shared mapping, so others can share with us. */
addr = mmap(NULL, getpagesize() * num,
- PROT_READ|PROT_WRITE|PROT_EXEC, MAP_PRIVATE, fd, 0);
+ PROT_READ|PROT_WRITE|PROT_EXEC, MAP_SHARED, fd, 0);
if (addr == MAP_FAILED)
err(1, "Mmaping %u pages of /dev/zero", num);
+ verbose("Memory backing file is %s @ %p\n", memfile_path, addr);
return addr;
}
@@ -263,23 +295,12 @@ static void *get_pages(unsigned int num)
return addr;
}
-/* This routine is used to load the kernel or initrd. It tries mmap, but if
- * that fails (Plan 9's kernel file isn't nicely aligned on page boundaries),
- * it falls back to reading the memory in. */
+/* This routine is used to load the kernel or initrd. We used to mmap, but now
+ * we simply read it in, so it will be present in the shared underlying
+ * file. */
static void map_at(int fd, void *addr, unsigned long offset, unsigned long len)
{
ssize_t r;
-
- /* We map writable even though for some segments are marked read-only.
- * The kernel really wants to be writable: it patches its own
- * instructions.
- *
- * MAP_PRIVATE means that the page won't be copied until a write is
- * done to it. This allows us to share untouched memory between
- * Guests. */
- if (mmap(addr, len, PROT_READ|PROT_WRITE|PROT_EXEC,
- MAP_FIXED|MAP_PRIVATE, fd, offset) != MAP_FAILED)
- return;
/* pread does a seek and a read in one shot: saves a few lines. */
r = pread(fd, addr, len, offset);
^ permalink raw reply [flat|nested] 27+ messages in thread
* [RFC PATCH 2/5] lguest: Encapsulate Guest memory ready for dealing with other Guests.
[not found] ` <200803201705.44422.rusty-8n+1lVoiYb80n/F98K4Iww@public.gmane.org>
@ 2008-03-20 6:22 ` Rusty Russell
2008-03-20 6:36 ` [RFC PATCH 3/5] lguest: separate out virtqueue info from device info Rusty Russell
2008-03-20 14:04 ` [kvm-devel] [RFC PATCH 1/5] lguest: mmap backing file Anthony Liguori
1 sibling, 1 reply; 27+ messages in thread
From: Rusty Russell @ 2008-03-20 6:22 UTC (permalink / raw)
To: virtualization-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA
Cc: kvm-devel, lguest
We currently keep Guest memory pointer and size in globals. We move
this into a structure and explicitly hand that to to_guest_phys() and
from_guest_phys() so we can deal with other Guests' memory.
Signed-off-by: Rusty Russell <rusty-8n+1lVoiYb80n/F98K4Iww@public.gmane.org>
---
Documentation/lguest/lguest.c | 89 +++++++++++++++++++++++-------------------
1 file changed, 49 insertions(+), 40 deletions(-)
diff -r 95558c7d210e Documentation/lguest/lguest.c
--- a/Documentation/lguest/lguest.c Thu Mar 13 14:11:40 2008 +1100
+++ b/Documentation/lguest/lguest.c Thu Mar 13 23:05:35 2008 +1100
@@ -76,10 +76,20 @@ static bool verbose;
/* The pipe to send commands to the waker process */
static int waker_fd;
-/* The pointer to the start of guest memory. */
-static void *guest_base;
-/* The maximum guest physical address allowed, and maximum possible. */
-static unsigned long guest_limit, guest_max;
+
+struct guest_memory
+{
+ /* The pointer to the start of guest memory. */
+ void *base;
+ /* The maximum guest physical address allowed. */
+ unsigned long limit;
+};
+
+/* The maximum possible page for the guest. */
+static unsigned long guest_max;
+
+/* This Guest's memory. */
+static struct guest_memory gmem;
/* a per-cpu variable indicating whose vcpu is currently running */
static unsigned int __thread cpu_id;
@@ -207,20 +217,19 @@ static u8 *get_feature_bits(struct devic
* will get you through this section. Or, maybe not.
*
* The Launcher sets up a big chunk of memory to be the Guest's "physical"
- * memory and stores it in "guest_base". In other words, Guest physical ==
- * Launcher virtual with an offset.
+ * memory. In other words, Guest physical == Launcher virtual with an offset.
*
* This can be tough to get your head around, but usually it just means that we
* use these trivial conversion functions when the Guest gives us it's
* "physical" addresses: */
-static void *from_guest_phys(unsigned long addr)
+static void *from_guest_phys(struct guest_memory *mem, unsigned long addr)
{
- return guest_base + addr;
+ return mem->base + addr;
}
-static unsigned long to_guest_phys(const void *addr)
+static unsigned long to_guest_phys(struct guest_memory *mem, const void *addr)
{
- return (addr - guest_base);
+ return (addr - mem->base);
}
/*L:130
@@ -287,10 +296,10 @@ static void *map_zeroed_pages(unsigned i
/* Get some more pages for a device. */
static void *get_pages(unsigned int num)
{
- void *addr = from_guest_phys(guest_limit);
+ void *addr = from_guest_phys(&gmem, gmem.limit);
- guest_limit += num * getpagesize();
- if (guest_limit > guest_max)
+ gmem.limit += num * getpagesize();
+ if (gmem.limit > guest_max)
errx(1, "Not enough memory for devices");
return addr;
}
@@ -351,7 +360,7 @@ static unsigned long map_elf(int elf_fd,
i, phdr[i].p_memsz, (void *)phdr[i].p_paddr);
/* We map this section of the file at its physical address. */
- map_at(elf_fd, from_guest_phys(phdr[i].p_paddr),
+ map_at(elf_fd, from_guest_phys(&gmem, phdr[i].p_paddr),
phdr[i].p_offset, phdr[i].p_filesz);
}
@@ -371,7 +380,7 @@ static unsigned long load_bzimage(int fd
struct boot_params boot;
int r;
/* Modern bzImages get loaded at 1M. */
- void *p = from_guest_phys(0x100000);
+ void *p = from_guest_phys(&gmem, 0x100000);
/* Go back to the start of the file and read the header. It should be
* a Linux boot header (see Documentation/i386/boot.txt) */
@@ -444,7 +453,7 @@ static unsigned long load_initrd(const c
/* We map the initrd at the top of memory, but mmap wants it to be
* page-aligned, so we round the size up for that. */
len = page_align(st.st_size);
- map_at(ifd, from_guest_phys(mem - len), 0, st.st_size);
+ map_at(ifd, from_guest_phys(&gmem, mem - len), 0, st.st_size);
/* Once a file is mapped, you can close the file descriptor. It's a
* little odd, but quite useful. */
close(ifd);
@@ -473,7 +482,7 @@ static unsigned long setup_pagetables(un
linear_pages = (mapped_pages + ptes_per_page-1)/ptes_per_page;
/* We put the toplevel page directory page at the top of memory. */
- pgdir = from_guest_phys(mem) - initrd_size - getpagesize();
+ pgdir = from_guest_phys(&gmem, mem) - initrd_size - getpagesize();
/* Now we use the next linear_pages pages as pte pages */
linear = (void *)pgdir - linear_pages*getpagesize();
@@ -487,16 +496,16 @@ static unsigned long setup_pagetables(un
/* The top level points to the linear page table pages above. */
for (i = 0; i < mapped_pages; i += ptes_per_page) {
pgdir[i/ptes_per_page]
- = ((to_guest_phys(linear) + i*sizeof(void *))
+ = ((to_guest_phys(&gmem, linear) + i*sizeof(void *))
| PAGE_PRESENT);
}
verbose("Linear mapping of %u pages in %u pte pages at %#lx\n",
- mapped_pages, linear_pages, to_guest_phys(linear));
+ mapped_pages, linear_pages, to_guest_phys(&gmem, linear));
/* We return the top level (guest-physical) address: the kernel needs
* to know where it is. */
- return to_guest_phys(pgdir);
+ return to_guest_phys(&gmem, pgdir);
}
/*:*/
@@ -525,12 +534,12 @@ static int tell_kernel(unsigned long pgd
static int tell_kernel(unsigned long pgdir, unsigned long start)
{
unsigned long args[] = { LHREQ_INITIALIZE,
- (unsigned long)guest_base,
- guest_limit / getpagesize(), pgdir, start };
+ (unsigned long)gmem.base,
+ gmem.limit / getpagesize(), pgdir, start };
int fd;
verbose("Guest: %p - %p (%#lx)\n",
- guest_base, guest_base + guest_limit, guest_limit);
+ gmem.base, gmem.base + gmem.limit, gmem.limit);
fd = open_or_die("/dev/lguest", O_RDWR);
if (write(fd, args, sizeof(args)) < 0)
err(1, "Writing to /dev/lguest");
@@ -629,18 +638,18 @@ static int setup_waker(int lguest_fd)
* if something funny is going on:
*/
static void *_check_pointer(unsigned long addr, unsigned int size,
- unsigned int line)
+ struct guest_memory *mem, unsigned int line)
{
/* We have to separately check addr and addr+size, because size could
* be huge and addr + size might wrap around. */
- if (addr >= guest_limit || addr + size >= guest_limit)
+ if (addr >= mem->limit || addr + size >= mem->limit)
errx(1, "%s:%i: Invalid address %#lx", __FILE__, line, addr);
/* We return a pointer for the caller's convenience, now we know it's
* safe to use. */
- return from_guest_phys(addr);
+ return from_guest_phys(&gmem, addr);
}
/* A macro which transparently hands the line number to the real function. */
-#define check_pointer(addr,size) _check_pointer(addr, size, __LINE__)
+#define check_pointer(mem,addr,size) _check_pointer(addr, size, mem, __LINE__)
/* Each buffer in the virtqueues is actually a chain of descriptors. This
* function returns the next descriptor in the chain, or vq->vring.num if we're
@@ -702,7 +711,7 @@ static unsigned get_vq_desc(struct virtq
/* Grab the first descriptor, and check it's OK. */
iov[*out_num + *in_num].iov_len = vq->vring.desc[i].len;
iov[*out_num + *in_num].iov_base
- = check_pointer(vq->vring.desc[i].addr,
+ = check_pointer(&gmem, vq->vring.desc[i].addr,
vq->vring.desc[i].len);
/* If this is an input descriptor, increment that count. */
if (vq->vring.desc[i].flags & VRING_DESC_F_WRITE)
@@ -975,7 +984,7 @@ static void handle_output(int fd, unsign
/* Check each device and virtqueue. */
for (i = devices.dev; i; i = i->next) {
/* Notifications to device descriptors reset the device. */
- if (from_guest_phys(addr) == i->desc) {
+ if (from_guest_phys(&gmem, addr) == i->desc) {
reset_device(i);
return;
}
@@ -1002,11 +1011,11 @@ static void handle_output(int fd, unsign
/* Early console write is done using notify on a nul-terminated string
* in Guest memory. */
- if (addr >= guest_limit)
+ if (addr >= gmem.limit)
errx(1, "Bad NOTIFY %#lx", addr);
- write(STDOUT_FILENO, from_guest_phys(addr),
- strnlen(from_guest_phys(addr), guest_limit - addr));
+ write(STDOUT_FILENO, from_guest_phys(&gmem, addr),
+ strnlen(from_guest_phys(&gmem, addr), gmem.limit - addr));
}
/* This is called when the Waker wakes us up: check for incoming file
@@ -1112,7 +1121,7 @@ static void add_virtqueue(struct device
/* Initialize the configuration. */
vq->config.num = num_descs;
vq->config.irq = devices.next_irq++;
- vq->config.pfn = to_guest_phys(p) / getpagesize();
+ vq->config.pfn = to_guest_phys(&gmem, p) / getpagesize();
/* Initialize the vring. */
vring_init(&vq->vring, num_descs, p, getpagesize());
@@ -1125,7 +1134,7 @@ static void add_virtqueue(struct device
memcpy(device_config(dev), &vq->config, sizeof(vq->config));
dev->desc->num_vq++;
- verbose("Virtqueue page %#lx\n", to_guest_phys(p));
+ verbose("Virtqueue page %#lx\n", to_guest_phys(&gmem, p));
/* Add to tail of list, so dev->vq is first vq, dev->vq->next is
* second. */
@@ -1731,9 +1740,9 @@ int main(int argc, char *argv[])
* guest-physical memory range. This fills it with 0,
* and ensures that the Guest won't be killed when it
* tries to access it. */
- guest_base = map_zeroed_pages(mem / getpagesize()
- + DEVICE_PAGES);
- guest_limit = mem;
+ gmem.base = map_zeroed_pages(mem / getpagesize()
+ + DEVICE_PAGES);
+ gmem.limit = mem;
guest_max = mem + DEVICE_PAGES*getpagesize();
devices.descpage = get_pages(1);
break;
@@ -1765,7 +1774,7 @@ int main(int argc, char *argv[])
if (optind + 2 > argc)
usage();
- verbose("Guest base is at %p\n", guest_base);
+ verbose("Guest base is at %p\n", gmem.base);
/* We always have a console device */
setup_console();
@@ -1774,7 +1783,7 @@ int main(int argc, char *argv[])
start = load_kernel(open_or_die(argv[optind+1], O_RDONLY));
/* Boot information is stashed at physical address 0 */
- boot = from_guest_phys(0);
+ boot = from_guest_phys(&gmem, 0);
/* Map the initrd image if requested (at top of physical memory) */
if (initrd_name) {
@@ -1796,7 +1805,7 @@ int main(int argc, char *argv[])
boot->e820_map[0] = ((struct e820entry) { 0, mem, E820_RAM });
/* The boot header contains a command line pointer: we put the command
* line after the boot header. */
- boot->hdr.cmd_line_ptr = to_guest_phys(boot + 1);
+ boot->hdr.cmd_line_ptr = to_guest_phys(&gmem, boot + 1);
/* We use a simple helper to copy the arguments separated by spaces. */
concat((char *)(boot + 1), argv+optind+2);
^ permalink raw reply [flat|nested] 27+ messages in thread
* [RFC PATCH 3/5] lguest: separate out virtqueue info from device info.
2008-03-20 6:22 ` [RFC PATCH 2/5] lguest: Encapsulate Guest memory ready for dealing with other Guests Rusty Russell
@ 2008-03-20 6:36 ` Rusty Russell
[not found] ` <200803201736.01883.rusty-8n+1lVoiYb80n/F98K4Iww@public.gmane.org>
0 siblings, 1 reply; 27+ messages in thread
From: Rusty Russell @ 2008-03-20 6:36 UTC (permalink / raw)
To: virtualization; +Cc: kvm-devel, lguest
To deal with other Guest's virtqueue, we need to separate out the
parts of the structure which deal with the actual virtqueue from
configuration information and the device. Then we can change the
virtqueue descriptor handling functions to take that smaller
structure.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
---
Documentation/lguest/lguest.c | 142 ++++++++++++++++++++++--------------------
1 file changed, 76 insertions(+), 66 deletions(-)
diff -r 49ed4fa72c7c Documentation/lguest/lguest.c
--- a/Documentation/lguest/lguest.c Mon Mar 17 15:33:54 2008 +1100
+++ b/Documentation/lguest/lguest.c Mon Mar 17 22:33:20 2008 +1100
@@ -148,6 +148,18 @@ struct device
};
/* The virtqueue structure describes a queue attached to a device. */
+struct virtqueue_info
+{
+ /* The memory this virtqueue sits in (usually gmem, our Guest). */
+ struct guest_memory *mem;
+
+ /* The actual ring of buffers. */
+ struct vring vring;
+
+ /* Last available index we saw. */
+ u16 last_avail_idx;
+};
+
struct virtqueue
{
struct virtqueue *next;
@@ -158,11 +170,8 @@ struct virtqueue
/* The configuration for this queue. */
struct lguest_vqconfig config;
- /* The actual ring of buffers. */
- struct vring vring;
-
- /* Last available index we saw. */
- u16 last_avail_idx;
+ /* Information about the Guest's virtqueue. */
+ struct virtqueue_info vqi;
/* The routine to call when the Guest pings us. */
void (*handle_output)(int fd, struct virtqueue *me);
@@ -656,7 +665,7 @@ static void *_check_pointer(unsigned lon
errx(1, "%s:%i: Invalid address %#lx", __FILE__, line, addr);
/* We return a pointer for the caller's convenience, now we know it's
* safe to use. */
- return from_guest_phys(&gmem, addr);
+ return from_guest_phys(mem, addr);
}
/* A macro which transparently hands the line number to the real function. */
#define check_pointer(mem,addr,size) _check_pointer(addr, size, mem, __LINE__)
@@ -664,20 +673,20 @@ static void *_check_pointer(unsigned lon
/* Each buffer in the virtqueues is actually a chain of descriptors. This
* function returns the next descriptor in the chain, or vq->vring.num if we're
* at the end. */
-static unsigned next_desc(struct virtqueue *vq, unsigned int i)
+static unsigned next_desc(struct virtqueue_info *vqi, unsigned int i)
{
unsigned int next;
/* If this descriptor says it doesn't chain, we're done. */
- if (!(vq->vring.desc[i].flags & VRING_DESC_F_NEXT))
- return vq->vring.num;
+ if (!(vqi->vring.desc[i].flags & VRING_DESC_F_NEXT))
+ return vqi->vring.num;
/* Check they're not leading us off end of descriptors. */
- next = vq->vring.desc[i].next;
+ next = vqi->vring.desc[i].next;
/* Make sure compiler knows to grab that: we don't want it changing! */
wmb();
- if (next >= vq->vring.num)
+ if (next >= vqi->vring.num)
errx(1, "Desc next is %u", next);
return next;
@@ -688,29 +697,29 @@ static unsigned next_desc(struct virtque
* number of output then some number of input descriptors, it's actually two
* iovecs, but we pack them into one and note how many of each there were.
*
- * This function returns the descriptor number found, or vq->vring.num (which
- * is never a valid descriptor number) if none was found. */
-static unsigned get_vq_desc(struct virtqueue *vq,
- struct iovec iov[],
- unsigned int *out_num, unsigned int *in_num)
+ * This function returns the descriptor number found, or -1 if none was
+ * found. */
+static int get_vq_desc(struct virtqueue_info *vqi,
+ struct iovec iov[],
+ unsigned int *out_num, unsigned int *in_num)
{
unsigned int i, head;
/* Check it isn't doing very strange things with descriptor numbers. */
- if ((u16)(vq->vring.avail->idx - vq->last_avail_idx) > vq->vring.num)
+ if ((u16)(vqi->vring.avail->idx - vqi->last_avail_idx) > vqi->vring.num)
errx(1, "Guest moved used index from %u to %u",
- vq->last_avail_idx, vq->vring.avail->idx);
+ vqi->last_avail_idx, vqi->vring.avail->idx);
/* If there's nothing new since last we looked, return invalid. */
- if (vq->vring.avail->idx == vq->last_avail_idx)
- return vq->vring.num;
+ if (vqi->vring.avail->idx == vqi->last_avail_idx)
+ return -1;
/* Grab the next descriptor number they're advertising, and increment
* the index we've seen. */
- head = vq->vring.avail->ring[vq->last_avail_idx++ % vq->vring.num];
+ head = vqi->vring.avail->ring[vqi->last_avail_idx++ % vqi->vring.num];
/* If their number is silly, that's a fatal mistake. */
- if (head >= vq->vring.num)
+ if (head >= vqi->vring.num)
errx(1, "Guest says index %u is available", head);
/* When we start there are none of either input nor output. */
@@ -719,12 +728,12 @@ static unsigned get_vq_desc(struct virtq
i = head;
do {
/* Grab the first descriptor, and check it's OK. */
- iov[*out_num + *in_num].iov_len = vq->vring.desc[i].len;
+ iov[*out_num + *in_num].iov_len = vqi->vring.desc[i].len;
iov[*out_num + *in_num].iov_base
- = check_pointer(&gmem, vq->vring.desc[i].addr,
- vq->vring.desc[i].len);
+ = check_pointer(vqi->mem, vqi->vring.desc[i].addr,
+ vqi->vring.desc[i].len);
/* If this is an input descriptor, increment that count. */
- if (vq->vring.desc[i].flags & VRING_DESC_F_WRITE)
+ if (vqi->vring.desc[i].flags & VRING_DESC_F_WRITE)
(*in_num)++;
else {
/* If it's an output descriptor, they're all supposed
@@ -735,27 +744,27 @@ static unsigned get_vq_desc(struct virtq
}
/* If we've got too many, that implies a descriptor loop. */
- if (*out_num + *in_num > vq->vring.num)
+ if (*out_num + *in_num > vqi->vring.num)
errx(1, "Looped descriptor");
- } while ((i = next_desc(vq, i)) != vq->vring.num);
+ } while ((i = next_desc(vqi, i)) != vqi->vring.num);
return head;
}
/* After we've used one of their buffers, we tell them about it. We'll then
* want to send them an interrupt, using trigger_irq(). */
-static void add_used(struct virtqueue *vq, unsigned int head, int len)
+static void add_used(struct virtqueue_info *vqi, unsigned int head, int len)
{
struct vring_used_elem *used;
/* The virtqueue contains a ring of used buffers. Get a pointer to the
* next entry in that used ring. */
- used = &vq->vring.used->ring[vq->vring.used->idx % vq->vring.num];
+ used = &vqi->vring.used->ring[vqi->vring.used->idx % vqi->vring.num];
used->id = head;
used->len = len;
/* Make sure buffer is written before we update index. */
wmb();
- vq->vring.used->idx++;
+ vqi->vring.used->idx++;
}
/* This actually sends the interrupt for this virtqueue */
@@ -764,7 +773,7 @@ static void trigger_irq(int fd, struct v
unsigned long buf[] = { LHREQ_IRQ, vq->config.irq };
/* If they don't want an interrupt, don't send one. */
- if (vq->vring.avail->flags & VRING_AVAIL_F_NO_INTERRUPT)
+ if (vq->vqi.vring.avail->flags & VRING_AVAIL_F_NO_INTERRUPT)
return;
/* Send the Guest an interrupt tell them we used something up. */
@@ -776,7 +785,7 @@ static void add_used_and_trigger(int fd,
static void add_used_and_trigger(int fd, struct virtqueue *vq,
unsigned int head, int len)
{
- add_used(vq, head, len);
+ add_used(&vq->vqi, head, len);
trigger_irq(fd, vq);
}
@@ -803,17 +812,17 @@ struct console_abort
/* This is the routine which handles console input (ie. stdin). */
static bool handle_console_input(int fd, struct device *dev)
{
- int len;
- unsigned int head, in_num, out_num;
- struct iovec iov[dev->vq->vring.num];
+ int len, head;
+ unsigned int in_num, out_num;
+ struct iovec iov[dev->vq->vqi.vring.num];
struct console_abort *abort = dev->priv;
/* First we need a console buffer from the Guests's input virtqueue. */
- head = get_vq_desc(dev->vq, iov, &out_num, &in_num);
+ head = get_vq_desc(&dev->vq->vqi, iov, &out_num, &in_num);
/* If they're not ready for input, stop listening to this file
* descriptor. We'll start again once they add an input buffer. */
- if (head == dev->vq->vring.num)
+ if (head < 0)
return false;
if (out_num)
@@ -872,12 +881,12 @@ static bool handle_console_input(int fd,
* and write them to stdout. */
static void handle_console_output(int fd, struct virtqueue *vq)
{
- unsigned int head, out, in;
- int len;
- struct iovec iov[vq->vring.num];
+ unsigned int out, in;
+ int head, len;
+ struct iovec iov[vq->vqi.vring.num];
/* Keep getting output buffers from the Guest until we run out. */
- while ((head = get_vq_desc(vq, iov, &out, &in)) != vq->vring.num) {
+ while ((head = get_vq_desc(&vq->vqi, iov, &out, &in)) >= 0) {
if (in)
errx(1, "Input buffers in output queue?");
len = writev(STDOUT_FILENO, iov, out);
@@ -1003,17 +1012,17 @@ static void complete_net_setup(struct de
* and write them to this device's file descriptor (the tap device). */
static void handle_net_output(int fd, struct virtqueue *vq)
{
- unsigned int head, out, in;
- int len;
+ unsigned int out, in;
+ int head, len;
struct net_priv *priv = vq->dev->priv;
- struct iovec iov[vq->vring.num];
+ struct iovec iov[vq->vqi.vring.num];
/* We might not know whether this Guest speaks GSO until now. */
if (!priv->done_setup)
complete_net_setup(vq->dev);
/* Keep getting output buffers from the Guest until we run out. */
- while ((head = get_vq_desc(vq, iov, &out, &in)) != vq->vring.num) {
+ while ((head = get_vq_desc(&vq->vqi, iov, &out, &in)) >= 0) {
if (in)
errx(1, "Input buffers in output queue?");
@@ -1035,10 +1044,10 @@ static void handle_net_output(int fd, st
* Guest. */
static bool handle_tun_input(int fd, struct device *dev)
{
- unsigned int head, in_num, out_num;
- int len;
+ unsigned int in_num, out_num;
+ int head, len;
struct net_priv *priv = dev->priv;
- struct iovec iov[dev->vq->vring.num];
+ struct iovec iov[dev->vq->vqi.vring.num];
/* We might not know whether this Guest speaks GSO until now. */
if (!priv->done_setup) {
@@ -1049,8 +1058,8 @@ static bool handle_tun_input(int fd, str
}
/* First we need a network buffer from the Guests's recv virtqueue. */
- head = get_vq_desc(dev->vq, iov, &out_num, &in_num);
- if (head == dev->vq->vring.num) {
+ head = get_vq_desc(&dev->vq->vqi, iov, &out_num, &in_num);
+ if (head < 0) {
/* FIXME: Only do this if DRIVER_ACTIVE. */
warn("network: no dma buffer!");
/* We'll turn this back on if input buffers are registered. */
@@ -1082,7 +1091,7 @@ static bool handle_tun_input(int fd, str
verbose("tun input packet len %i [%02x %02x] (%s)\n", len,
((u8 *)iov[1].iov_base)[0], ((u8 *)iov[1].iov_base)[1],
- head != dev->vq->vring.num ? "sent" : "discarded");
+ head >= 0 ? "sent" : "discarded");
/* All good. */
return true;
@@ -1113,9 +1122,9 @@ static void reset_device(struct device *
/* Zero out the virtqueues. */
for (vq = dev->vq; vq; vq = vq->next) {
- memset(vq->vring.desc, 0,
+ memset(vq->vqi.vring.desc, 0,
vring_size(vq->config.num, getpagesize()));
- vq->last_avail_idx = 0;
+ vq->vqi.last_avail_idx = 0;
}
}
@@ -1262,8 +1271,9 @@ static void add_virtqueue(struct device
/* Initialize the virtqueue */
vq->next = NULL;
- vq->last_avail_idx = 0;
vq->dev = dev;
+ vq->vqi.last_avail_idx = 0;
+ vq->vqi.mem = &gmem;
/* Initialize the configuration. */
vq->config.num = num_descs;
@@ -1271,7 +1281,7 @@ static void add_virtqueue(struct device
vq->config.pfn = to_guest_phys(&gmem, p) / getpagesize();
/* Initialize the vring. */
- vring_init(&vq->vring, num_descs, p, getpagesize());
+ vring_init(&vq->vqi.vring, num_descs, p, getpagesize());
/* Append virtqueue to this device's descriptor. We use
* device_config() to get the end of the device's current virtqueues;
@@ -1295,7 +1305,7 @@ static void add_virtqueue(struct device
/* As an optimization, set the advisory "Don't Notify Me" flag if we
* don't have a handler */
if (!handle_output)
- vq->vring.used->flags = VRING_USED_F_NO_NOTIFY;
+ vq->vqi.vring.used->flags = VRING_USED_F_NO_NOTIFY;
}
/* The first half of the feature bitmask is for us to advertise features. The
@@ -1508,16 +1518,16 @@ static bool service_io(struct device *de
static bool service_io(struct device *dev)
{
struct vblk_info *vblk = dev->priv;
- unsigned int head, out_num, in_num, wlen;
- int ret;
+ unsigned int out_num, in_num, wlen;
+ int head, ret;
struct virtio_blk_inhdr *in;
struct virtio_blk_outhdr *out;
- struct iovec iov[dev->vq->vring.num];
+ struct iovec iov[dev->vq->vqi.vring.num];
off64_t off;
/* See if there's a request waiting. If not, nothing to do. */
- head = get_vq_desc(dev->vq, iov, &out_num, &in_num);
- if (head == dev->vq->vring.num)
+ head = get_vq_desc(&dev->vq->vqi, iov, &out_num, &in_num);
+ if (head < 0)
return false;
/* Every block request should contain at least one output buffer
@@ -1587,7 +1597,7 @@ static bool service_io(struct device *de
/* We can't trigger an IRQ, because we're not the Launcher. It does
* that when we tell it we're done. */
- add_used(dev->vq, head, wlen);
+ add_used(&dev->vq->vqi, head, wlen);
return true;
}
@@ -1722,15 +1732,15 @@ static bool handle_rng_input(int fd, str
{
int len;
unsigned int head, in_num, out_num;
- struct iovec iov[dev->vq->vring.num];
+ struct iovec iov[dev->vq->vqi.vring.num];
printf("Got input on rng fd!\n");
/* First we need a buffer from the Guests's virtqueue. */
- head = get_vq_desc(dev->vq, iov, &out_num, &in_num);
+ head = get_vq_desc(&dev->vq->vqi, iov, &out_num, &in_num);
/* If they're not ready for input, stop listening to this file
* descriptor. We'll start again once they add an input buffer. */
- if (head == dev->vq->vring.num) {
+ if (head < 0) {
printf("But no buffer!\n");
return false;
}
-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
^ permalink raw reply [flat|nested] 27+ messages in thread
* [RFC PATCH 4/5] lguest: ignore bad virtqueues.
[not found] ` <200803201736.01883.rusty-8n+1lVoiYb80n/F98K4Iww@public.gmane.org>
@ 2008-03-20 6:40 ` Rusty Russell
2008-03-20 6:45 ` [RFC PATCH 5/5] lguest: Inter-guest networking Rusty Russell
0 siblings, 1 reply; 27+ messages in thread
From: Rusty Russell @ 2008-03-20 6:40 UTC (permalink / raw)
To: virtualization-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA
Cc: kvm-devel, lguest
Currently the lguest Launcher aborts when a Guest puts something bogus
in a virtio queue. If we want to deal with other (untrusted) Guests'
queues, that's a bad idea: simply print a warning and ignore it from
now on.
Signed-off-by: Rusty Russell <rusty-8n+1lVoiYb80n/F98K4Iww@public.gmane.org>
---
Documentation/lguest/lguest.c | 45 +++++++++++++++++++++++++++++++-----------
1 file changed, 34 insertions(+), 11 deletions(-)
diff -r 784299890d4a Documentation/lguest/lguest.c
--- a/Documentation/lguest/lguest.c Thu Mar 13 23:10:14 2008 +1100
+++ b/Documentation/lguest/lguest.c Thu Mar 13 23:21:55 2008 +1100
@@ -156,6 +156,9 @@ struct virtqueue_info
/* Last available index we saw. */
u16 last_avail_idx;
+
+ /* Are we broken? If so, ignore it from now on. */
+ bool broken;
};
struct virtqueue
@@ -676,8 +679,11 @@ static unsigned next_desc(struct virtque
/* Make sure compiler knows to grab that: we don't want it changing! */
wmb();
- if (next >= vqi->vring.num)
- errx(1, "Desc next is %u", next);
+ if (next >= vqi->vring.num) {
+ warnx("Desc next is %u", next);
+ vqi->broken = true;
+ return vqi->vring.num;
+ }
return next;
}
@@ -695,10 +701,16 @@ static int get_vq_desc(struct virtqueue_
{
unsigned int i, head;
+ /* If the queue is broken, we just pretend there's nothing there. */
+ if (vqi->broken)
+ return -1;
+
/* Check it isn't doing very strange things with descriptor numbers. */
- if ((u16)(vqi->vring.avail->idx - vqi->last_avail_idx) > vqi->vring.num)
- errx(1, "Guest moved used index from %u to %u",
- vqi->last_avail_idx, vqi->vring.avail->idx);
+ if ((u16)(vqi->vring.avail->idx-vqi->last_avail_idx) > vqi->vring.num) {
+ warnx("Guest moved used index from %u to %u",
+ vqi->last_avail_idx, vqi->vring.avail->idx);
+ goto broken;
+ }
/* If there's nothing new since last we looked, return invalid. */
if (vqi->vring.avail->idx == vqi->last_avail_idx)
@@ -709,8 +721,10 @@ static int get_vq_desc(struct virtqueue_
head = vqi->vring.avail->ring[vqi->last_avail_idx++ % vqi->vring.num];
/* If their number is silly, that's a fatal mistake. */
- if (head >= vqi->vring.num)
- errx(1, "Guest says index %u is available", head);
+ if (head >= vqi->vring.num) {
+ warnx("Guest says index %u is available", head);
+ goto broken;
+ }
/* When we start there are none of either input nor output. */
*out_num = *in_num = 0;
@@ -728,17 +742,25 @@ static int get_vq_desc(struct virtqueue_
else {
/* If it's an output descriptor, they're all supposed
* to come before any input descriptors. */
- if (*in_num)
- errx(1, "Descriptor has out after in");
+ if (*in_num) {
+ warnx("Descriptor has out after in");
+ goto broken;
+ }
(*out_num)++;
}
/* If we've got too many, that implies a descriptor loop. */
- if (*out_num + *in_num > vqi->vring.num)
- errx(1, "Looped descriptor");
+ if (*out_num + *in_num > vqi->vring.num) {
+ warnx("Looped descriptor");
+ goto broken;
+ }
} while ((i = next_desc(vqi, i)) != vqi->vring.num);
return head;
+
+broken:
+ vqi->broken = true;
+ return -1;
}
/* After we've used one of their buffers, we tell them about it. We'll then
@@ -1127,6 +1149,7 @@ static void add_virtqueue(struct device
vq->dev = dev;
vq->vqi.last_avail_idx = 0;
vq->vqi.mem = &gmem;
+ vq->vqi.broken = false;
/* Initialize the configuration. */
vq->config.num = num_descs;
^ permalink raw reply [flat|nested] 27+ messages in thread
* [RFC PATCH 5/5] lguest: Inter-guest networking
2008-03-20 6:40 ` [RFC PATCH 4/5] lguest: ignore bad virtqueues Rusty Russell
@ 2008-03-20 6:45 ` Rusty Russell
0 siblings, 0 replies; 27+ messages in thread
From: Rusty Russell @ 2008-03-20 6:45 UTC (permalink / raw)
To: virtualization; +Cc: kvm-devel, lguest
We open two FIFOs, mmap the other Guests' memory, and copy between
their send queue and our Guest's receive queue. A one-char byte is
used to notify the other Guest about virtqueue activity.
Note the FIXMEs, and the fact that we don't suppress notifications
even when we could (based on the flags in the other Guest's
virtqueue).
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
---
Documentation/lguest/lguest.c | 257 ++++++++++++++++++++++++++++++++++++++----
1 file changed, 235 insertions(+), 22 deletions(-)
diff -r d803a2208052 Documentation/lguest/lguest.c
--- a/Documentation/lguest/lguest.c Mon Mar 17 22:33:22 2008 +1100
+++ b/Documentation/lguest/lguest.c Mon Mar 17 22:35:59 2008 +1100
@@ -276,6 +276,12 @@ static void unlink_memfile(void)
unlink(memfile_path);
}
+/* Name the memfiles by the process ID of this launcher. */
+static void guest_memfile(char *buffer, pid_t pid)
+{
+ snprintf(buffer, PATH_MAX, "%s/.lguest/%u", getenv("HOME") ?: "", pid);
+}
+
/* map_zeroed_pages() takes a number of pages, and creates a mapping file where
* this Guest's memory lives. */
static void *map_zeroed_pages(unsigned int num)
@@ -289,9 +295,7 @@ static void *map_zeroed_pages(unsigned i
if (mkdir(memfile_path, S_IRWXU) != 0 && errno != EEXIST)
err(1, "Creating directory %s", memfile_path);
- /* Name the memfiles by the process ID of this launcher. */
- snprintf(memfile_path, PATH_MAX, "%s/.lguest/%u",
- getenv("HOME") ?: "", getpid());
+ guest_memfile(memfile_path, getpid());
fd = open(memfile_path, O_RDWR | O_CREAT | O_TRUNC, S_IRWXU);
if (fd < 0)
err(1, "Creating memory backing file %s", memfile_path);
@@ -1426,22 +1430,6 @@ static void setup_console(void)
}
/*:*/
-/*M:010 Inter-guest networking is an interesting area. Simplest is to have a
- * --sharenet=<name> option which opens or creates a named pipe. This can be
- * used to send packets to another guest in a 1:1 manner.
- *
- * More sopisticated is to use one of the tools developed for project like UML
- * to do networking.
- *
- * Faster is to do virtio bonding in kernel. Doing this 1:1 would be
- * completely generic ("here's my vring, attach to your vring") and would work
- * for any traffic. Of course, namespace and permissions issues need to be
- * dealt with. A more sophisticated "multi-channel" virtio_net.c could hide
- * multiple inter-guest channels behind one interface, although it would
- * require some manner of hotplugging new virtio channels.
- *
- * Finally, we could implement a virtio network switch in the kernel. :*/
-
static void random_ether_addr(u8 *mac)
{
int randfd = open_or_die("/dev/urandom", O_RDONLY);
@@ -1503,10 +1491,230 @@ static void setup_tun_net(const char *ar
if (priv->bridge_name)
verbose("attached to bridge: %s\n", priv->bridge_name);
}
+/*:*/
-/* Our block (disk) device should be really simple: the Guest asks for a block
- * number and we read or write that position in the file. Unfortunately, that
- * was amazingly slow: the Guest waits until the read is finished before
+struct sharenet_priv
+{
+ /* The fifo to write to tell the other Launcher. */
+ int writefd;
+
+ /* The other Guest's send virtqueue. */
+ struct virtqueue_info vq_out;
+
+ /* The information the other Guest gave us. */
+ struct sharenet_other {
+ unsigned int pid;
+ u16 vq_out_num;
+ unsigned long vq_out_addr;
+ struct guest_memory mem;
+ } other;
+};
+
+/* Alters contents of src[] and dst[]. Returns true if all of src copied. */
+static bool iovec_copy(struct iovec *dst, unsigned int dst_num,
+ struct iovec *src, unsigned int src_num,
+ unsigned int *totlen)
+{
+ *totlen = 0;
+ while (src_num) {
+ unsigned int len = src->iov_len < dst->iov_len
+ ? src->iov_len : dst->iov_len;
+ memcpy(dst->iov_base, src->iov_base, len);
+ *totlen += len;
+
+ src->iov_base += len;
+ src->iov_len -= len;
+ if (!src->iov_len) {
+ src++;
+ src_num--;
+ }
+
+ dst->iov_base += len;
+ dst->iov_len -= len;
+ if (!dst->iov_len) {
+ dst++;
+ dst_num--;
+ /* If we're out of dst room, it's only ok if we're out
+ * of src too */
+ if (dst_num == 0)
+ return src_num == 0;
+ }
+ }
+ return true;
+}
+
+static bool inter_iov_copy(struct iovec fiov[],
+ unsigned int fout_num, unsigned int fin_num,
+ struct iovec iov[],
+ unsigned int out_num, unsigned int in_num,
+ unsigned int *len)
+{
+ unsigned int partlen;
+
+ /* Transfer our output to their input (not used by net code). */
+ if (!iovec_copy(fiov + fout_num, fin_num, iov, out_num, &partlen))
+ return false;
+ *len = partlen;
+ if (!iovec_copy(iov + out_num, in_num, fiov, fout_num, &partlen))
+ return false;
+ *len += partlen;
+ return true;
+}
+
+static bool handle_sharenet_input(int fd, struct device *dev)
+{
+ struct sharenet_priv *p = dev->priv;
+ struct virtqueue *vq = dev->vq;
+ struct iovec fiov[p->vq_out.vring.num], iov[vq->vqi.vring.num];
+ unsigned int fin_num, fout_num, in_num, out_num;
+ int fhead, head;
+ char c;
+ bool progress = false, filled = false;
+
+ if (read(dev->fd, &c, 1) != 1) {
+ warn("sharenet: failed to read from other Guest");
+ return false;
+ }
+
+ /* Look in other Guests' (ie. foreign) virtqueue. */
+ /* FIXME: Don't allow arbitrary bidir copies? */
+ while ((fhead = get_vq_desc(&p->vq_out, fiov, &fout_num, &fin_num))>=0){
+ unsigned int len;
+ /* Copy it into our receive queue. */
+ head = get_vq_desc(&vq->vqi, iov, &out_num, &in_num);
+ if (out_num)
+ errx(1, "Output buffers in network recv queue?");
+ if (head < 0) {
+ /* We don't have room to take it, put it back. */
+ p->vq_out.last_avail_idx--;
+ filled = true;
+ break;
+ }
+
+ if (!inter_iov_copy(fiov, fout_num, fin_num,
+ iov, out_num, in_num, &len)) {
+ warnx("Inter-guest network copy failed: too long?");
+ p->vq_out.broken = true;
+ return false;
+ }
+
+ /* We used one buffer of ours, and one of theirs. */
+ add_used(&vq->vqi, head, len);
+ add_used(&p->vq_out, fhead, len);
+ progress = true;
+ }
+
+ if (progress) {
+ trigger_irq(fd, vq);
+ /* FIXME: Only tell it if they want notify. */
+ write(fd, &c, 1);
+ }
+
+ /* If we filled up, return false: enable_fd will re-enable us. */
+ return !filled;
+}
+
+static void handle_sharenet_output(int fd, struct virtqueue *vq)
+{
+ struct sharenet_priv *p = vq->dev->priv;
+ char c = 0;
+
+ /* Tell other Guest we've got something for it. */
+ write(p->writefd, &c, 1);
+}
+
+static void setup_sharenet(const char *arg)
+{
+ struct device *dev;
+ struct sharenet_priv *p = malloc(sizeof(*p));
+ int fd, readfd;
+ char other_memfile[PATH_MAX];
+ struct sharenet_other us;
+ char *other;
+
+ /* Other fifo is the same, with _ appended. */
+ other = malloc(strlen(arg) + 2);
+ sprintf(other, "%s_", arg);
+
+ /* OK, if we're the first, we get to create it. */
+ if (mkfifo(arg, S_IRUSR|S_IWUSR) == 0) {
+ /* We open our own FIFO, then their FIFO */
+ readfd = open_or_die(arg, O_RDONLY);
+ /* Once we're connected, delete arg. */
+ unlink(arg);
+ p->writefd = open_or_die(other, O_WRONLY);
+ unlink(other);
+ } else {
+ /* The other side got there first. */
+ if (errno != EEXIST)
+ err(1, "Creating sharenet fifo %s", arg);
+
+ /* OK, make the fifo for the other side to open. */
+ if (mkfifo(other, S_IRUSR|S_IWUSR) != 0)
+ err(1, "Creating second sharenet fifo %s", other);
+
+ /* Now, open their FIFO, then open ours. We unlink even though
+ * we didn't create it: redundancy is useful. */
+ p->writefd = open_or_die(arg, O_WRONLY);
+ unlink(arg);
+ readfd = open_or_die(other, O_RDONLY);
+ unlink(other);
+ }
+
+ /* Now set up the device. */
+ dev = new_device("sharenet", VIRTIO_ID_NET, readfd,
+ handle_sharenet_input);
+ dev->priv = p;
+
+ /* Network devices need a receive and a send queue. */
+ add_virtqueue(dev, VIRTQUEUE_NUM, enable_fd);
+ add_virtqueue(dev, VIRTQUEUE_NUM, handle_sharenet_output);
+
+ /* Tell the other end about ourselves. */
+ us.pid = getpid();
+ us.vq_out_addr = to_guest_phys(&gmem, dev->vq->next->vqi.vring.desc);
+ us.vq_out_num = dev->vq->next->vqi.vring.num;
+ us.mem = gmem;
+ if (write(p->writefd, &us, sizeof(us)) != sizeof(us))
+ err(1, "Writing to second sharenet fifo");
+
+ /* And, your hobbies are? */
+ if (read(readfd, &p->other, sizeof(p->other)) != sizeof(p->other))
+ err(1, "Reading info from sharenet fifo");
+
+ /* Map their memory file. */
+ guest_memfile(other_memfile, p->other.pid);
+ fd = open_or_die(other_memfile, O_RDWR);
+ p->other.mem.base = mmap(NULL, p->other.mem.limit,
+ PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
+ if (p->other.mem.base == MAP_FAILED)
+ err(1, "Failed to mmap other Guest's memory for sharenet");
+ close(fd);
+
+ /* Check for silly virtqueue stats. */
+ if (p->other.vq_out_addr >= p->other.mem.limit
+ ||p->other.vq_out_addr+vring_size(p->other.vq_out_num,getpagesize())
+ >= p->other.mem.limit)
+ err(1, "sharenet: other Guest gave %lu/%u for vq",
+ p->other.vq_out_addr, p->other.vq_out_num);
+
+ p->vq_out.mem = &p->other.mem;
+ p->vq_out.last_avail_idx = 0;
+ p->vq_out.broken = false;
+ vring_init(&p->vq_out.vring, p->other.vq_out_num,
+ from_guest_phys(&p->other.mem, p->other.vq_out_addr),
+ getpagesize());
+
+ /* FIXME: make fifo non-blocking, so other guest can't freeze
+ * us on write. */
+ /* FIXME: kill SIGPIPE, so other guest can't kill us on write. */
+ verbose("device %u: sharenet (%u at %p)\n", devices.device_num++,
+ p->other.pid, p->other.mem.base);
+}
+
+/*L:196 Our block (disk) device should be really simple: the Guest asks for a
+ * block number and we read or write that position in the file. Unfortunately,
+ * that was amazingly slow: the Guest waits until the read is finished before
* running anything else, even if it could have been doing useful work.
*
* We could use async I/O, except it's reputed to suck so hard that characters
@@ -1851,6 +2058,7 @@ static struct option opts[] = {
static struct option opts[] = {
{ "verbose", 0, NULL, 'v' },
{ "tunnet", 1, NULL, 't' },
+ { "sharenet", 1, NULL, 's' },
{ "block", 1, NULL, 'b' },
{ "rng", 0, NULL, 'r' },
{ "initrd", 1, NULL, 'i' },
@@ -1860,6 +2068,7 @@ static void usage(void)
{
errx(1, "Usage: lguest [--verbose] "
"[--tunnet=(<ipaddr>|bridge:<bridgename>)\n"
+ "[--sharenet=<controlfile>]\n"
"|--block=<filename>|--initrd=<filename>]...\n"
"<mem-in-mb> vmlinux [args...]");
}
@@ -1932,6 +2141,9 @@ int main(int argc, char *argv[])
break;
case 'i':
initrd_name = optarg;
+ break;
+ case 's':
+ setup_sharenet(optarg);
break;
default:
warnx("Unknown argument %s", argv[optind]);
-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [kvm-devel] [RFC PATCH 0/4] Inter-guest virtio I/O example with lguest
[not found] ` <200803201659.14344.rusty-8n+1lVoiYb80n/F98K4Iww@public.gmane.org>
2008-03-20 6:05 ` [RFC PATCH 1/5] lguest: mmap backing file Rusty Russell
@ 2008-03-20 6:54 ` Avi Kivity
[not found] ` <47E20A35.2000600-atKUWr5tajBWk0Htik3J/w@public.gmane.org>
2008-03-20 22:14 ` Rusty Russell
2008-03-20 14:11 ` [kvm-devel] " Anthony Liguori
2 siblings, 2 replies; 27+ messages in thread
From: Avi Kivity @ 2008-03-20 6:54 UTC (permalink / raw)
To: Rusty Russell
Cc: kvm-devel, lguest,
virtualization-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA
Rusty Russell wrote:
> Hi all,
>
> Just finished my prototype of inter-guest virtio, using networking as an
> example. Each guest mmaps the other's address space and uses a FIFO for
> notifications.
>
>
Isn't that a security hole (hole? chasm)? If the two guests can access
each other's memory, they might as well be just one guest, and
communicate internally.
My feeling is that the host needs to copy the data, using dma if
available. Another option is to have one guest map the other's memory
for read and write, while the other guest is unprivileged. This allows
one privileged guest to provide services for other, unprivileged guests,
like domain 0 or driver domains in Xen.
--
Any sufficiently difficult bug is indistinguishable from a feature.
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [Lguest] [RFC PATCH 1/5] lguest: mmap backing file
2008-03-20 6:05 ` [RFC PATCH 1/5] lguest: mmap backing file Rusty Russell
[not found] ` <200803201705.44422.rusty-8n+1lVoiYb80n/F98K4Iww@public.gmane.org>
@ 2008-03-20 8:16 ` Tim Post
[not found] ` <1206000960.6873.124.camel-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org>
1 sibling, 1 reply; 27+ messages in thread
From: Tim Post @ 2008-03-20 8:16 UTC (permalink / raw)
To: Rusty Russell; +Cc: kvm-devel, lguest, virtualization
On Thu, 2008-03-20 at 17:05 +1100, Rusty Russell wrote:
> + snprintf(memfile_path, PATH_MAX, "%s/.lguest",
> getenv("HOME") ?: "");
Hi Rusty,
Is that safe if being run via setuid/gid or shared root? It might be
better to just look it up in /etc/passwd against the real UID,
considering that anyone can change (or null) that env string.
Of course its also practical to just say "DON'T RUN LGUEST AS
SETUID/GID". Even if you say that, someone will do it. You might also
add beware of sudoers.
For people (like myself and lab mates) who are forced to share machines,
it could breed a whole new strain of practical jokes :)
That will cause lguest to inherit a memory leak from getpwuid(), but it
only leaks once.
Cheers,
--Tim
-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [kvm-devel] [RFC PATCH 0/4] Inter-guest virtio I/O example with lguest
[not found] ` <47E20A35.2000600-atKUWr5tajBWk0Htik3J/w@public.gmane.org>
@ 2008-03-20 13:55 ` Anthony Liguori
[not found] ` <47E26CC1.8080900-rdkfGonbjUSkNkDKm+mE6A@public.gmane.org>
0 siblings, 1 reply; 27+ messages in thread
From: Anthony Liguori @ 2008-03-20 13:55 UTC (permalink / raw)
To: Avi Kivity
Cc: kvm-devel, lguest,
virtualization-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA
Avi Kivity wrote:
> Rusty Russell wrote:
>
>> Hi all,
>>
>> Just finished my prototype of inter-guest virtio, using networking as an
>> example. Each guest mmaps the other's address space and uses a FIFO for
>> notifications.
>>
>>
>>
>
> Isn't that a security hole (hole? chasm)? If the two guests can access
> each other's memory, they might as well be just one guest, and
> communicate internally.
>
Each guest's host userspace mmaps the other guest's address space. The
userspace then does a copy on both the tx and rx paths.
Conceivably, this could be done as a read-only mapping so that each
guest userspace copies only the rx packets. That's about as secure as
you're going to get with this approach I think.
Regards,
Anthony Liguori
> My feeling is that the host needs to copy the data, using dma if
> available. Another option is to have one guest map the other's memory
> for read and write, while the other guest is unprivileged. This allows
> one privileged guest to provide services for other, unprivileged guests,
> like domain 0 or driver domains in Xen.
>
>
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [kvm-devel] [RFC PATCH 1/5] lguest: mmap backing file
[not found] ` <200803201705.44422.rusty-8n+1lVoiYb80n/F98K4Iww@public.gmane.org>
2008-03-20 6:22 ` [RFC PATCH 2/5] lguest: Encapsulate Guest memory ready for dealing with other Guests Rusty Russell
@ 2008-03-20 14:04 ` Anthony Liguori
[not found] ` <47E26EE1.5030706-rdkfGonbjUSkNkDKm+mE6A@public.gmane.org>
1 sibling, 1 reply; 27+ messages in thread
From: Anthony Liguori @ 2008-03-20 14:04 UTC (permalink / raw)
To: Rusty Russell
Cc: kvm-devel, lguest,
virtualization-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA
Rusty Russell wrote:
> From: Paul TBBle Hampson <Paul.Hampson-vM6MUUi4OUAAvxtiuMwx3w@public.gmane.org>
>
> This creates a file in $HOME/.lguest/ to directly back the RAM and DMA memory
> mappings created by map_zeroed_pages.
>
I created a test program recently that measured the latency of a
reads/writes to an mmap() file in /dev/shm and in a normal filesystem.
Even after unlinking the underlying file, the write latency was much
better with a mmap()'d file in /dev/shm.
/dev/shm is not really for general use. I think we'll want to have our
own tmpfs mount that we use to create VM images. I also prefer to use a
unix socket for communication, unlink the file immediately after open,
and then pass the fd via SCM_RIGHTS to the other process.
Regards,
Anthony Liguori
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [RFC PATCH 1/5] lguest: mmap backing file
[not found] ` <1206000960.6873.124.camel-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org>
@ 2008-03-20 14:07 ` Paul TBBle Hampson
2008-03-21 0:29 ` Rusty Russell
1 sibling, 0 replies; 27+ messages in thread
From: Paul TBBle Hampson @ 2008-03-20 14:07 UTC (permalink / raw)
To: Tim Post
Cc: kvm-devel, lguest,
virtualization-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA
[-- Attachment #1.1: Type: text/plain, Size: 4261 bytes --]
On Thu, Mar 20, 2008 at 04:16:00PM +0800, Tim Post wrote:
> On Thu, 2008-03-20 at 17:05 +1100, Rusty Russell wrote:
>> + snprintf(memfile_path, PATH_MAX, "%s/.lguest",
>> getenv("HOME") ?: "");
> Hi Rusty,
> Is that safe if being run via setuid/gid or shared root? It might be
> better to just look it up in /etc/passwd against the real UID,
> considering that anyone can change (or null) that env string.
> Of course its also practical to just say "DON'T RUN LGUEST AS
> SETUID/GID". Even if you say that, someone will do it. You might also
> add beware of sudoers.
> For people (like myself and lab mates) who are forced to share machines,
> it could breed a whole new strain of practical jokes :)
I'm not sure I see the risk here. Surely not "anyone" can modify your
environment variables out from under you?
Are you worried that other root users are going to point root's .lguest
directory somewhere else, but not the non-root user's directory?
I fear I'm missing something here...
There _is_ an issue I hadn't thought of at the time, which is if your
$HOME is on shared media, and you clash PIDs between lguest launchers on
two machines sharing that media as $HOME, you're going to clash
memfiles, specifically truncating the earlier memfile.
(Sorry for the double-up, lguest list. I hit send too quickly)
--
-----------------------------------------------------------
Paul "TBBle" Hampson, B.Sc, LPI, MCSE
Very-later-year Asian Studies student, ANU
The Boss, Bubblesworth Pty Ltd (ABN: 51 095 284 361)
Paul.Hampson-vM6MUUi4OUAAvxtiuMwx3w@public.gmane.org
Of course Pacman didn't influence us as kids. If it did,
we'd be running around in darkened rooms, popping pills and
listening to repetitive music.
-- Kristian Wilson, Nintendo, Inc, 1989
License: http://creativecommons.org/licenses/by/2.1/au/
-----------------------------------------------------------
[-- Attachment #1.2: Type: application/pgp-signature, Size: 189 bytes --]
[-- Attachment #2: Type: text/plain, Size: 158 bytes --]
_______________________________________________
Lguest mailing list
Lguest-mnsaURCQ41sdnm+yROfE0A@public.gmane.org
https://ozlabs.org/mailman/listinfo/lguest
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [kvm-devel] [RFC PATCH 0/4] Inter-guest virtio I/O example with lguest
[not found] ` <200803201659.14344.rusty-8n+1lVoiYb80n/F98K4Iww@public.gmane.org>
2008-03-20 6:05 ` [RFC PATCH 1/5] lguest: mmap backing file Rusty Russell
2008-03-20 6:54 ` [kvm-devel] [RFC PATCH 0/4] Inter-guest virtio I/O example with lguest Avi Kivity
@ 2008-03-20 14:11 ` Anthony Liguori
2008-03-23 12:05 ` Rusty Russell
2 siblings, 1 reply; 27+ messages in thread
From: Anthony Liguori @ 2008-03-20 14:11 UTC (permalink / raw)
To: Rusty Russell
Cc: kvm-devel, lguest,
virtualization-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA
Rusty Russell wrote:
> Hi all,
>
> Just finished my prototype of inter-guest virtio, using networking as an
> example. Each guest mmaps the other's address space and uses a FIFO for
> notifications.
>
> There are two issues with this approach. The first is that neither guest
> can change its mappings. See patch 1.
Avi mentioned that with MMU notifiers, it may be possible to introduce a
new kernel mechanism whereas you could map an arbitrary region of one
process's memory into another process. This would address this problem
quite nicely.
> The second is that our feature
> configuration is "host presents, guest chooses" which breaks down when we
> don't know the capabilities of each guest. In particular, TSO capability for
> networking.
> There are three possible solutions:
> 1) Just offer the lowest common denominator to both sides (ie. no features).
> This is what I do with lguest in these patches.
> 2) Offer something and handle the case where one Guest accepts and another
> doesn't by emulating it. ie. de-TSO the packets manually.
> 3) "Hot unplug" the device from the guest which asks for the greater features,
> then re-add it offering less features. Requires hotplug in the guest OS.
>
4) Add a feature negotiation feature. The feature that gets set is the
"feature negotiate" feature. If a guest doesn't support feature
negotiation, you end up with the least-common denominator (no
features). If both guests support feature negotiation, you can then add
something new to determine the true common subset.
> I haven't tuned or even benchmarked these patches, but it pings!
>
Very nice! It's particularly cool that it was possible entirely in
userspace.
Regards,
Anthony Liguori
> Rusty.
>
> -------------------------------------------------------------------------
> This SF.net email is sponsored by: Microsoft
> Defy all challenges. Microsoft(R) Visual Studio 2008.
> http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
> _______________________________________________
> kvm-devel mailing list
> kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f@public.gmane.org
> https://lists.sourceforge.net/lists/listinfo/kvm-devel
>
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [kvm-devel] [RFC PATCH 0/4] Inter-guest virtio I/O example with lguest
[not found] ` <47E26CC1.8080900-rdkfGonbjUSkNkDKm+mE6A@public.gmane.org>
@ 2008-03-20 14:27 ` Avi Kivity
[not found] ` <47E27461.4090404-atKUWr5tajBWk0Htik3J/w@public.gmane.org>
0 siblings, 1 reply; 27+ messages in thread
From: Avi Kivity @ 2008-03-20 14:27 UTC (permalink / raw)
To: Anthony Liguori
Cc: kvm-devel, lguest,
virtualization-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA
Anthony Liguori wrote:
> Avi Kivity wrote:
>> Rusty Russell wrote:
>>
>>> Hi all,
>>>
>>> Just finished my prototype of inter-guest virtio, using
>>> networking as an example. Each guest mmaps the other's address
>>> space and uses a FIFO for notifications.
>>>
>>>
>>
>> Isn't that a security hole (hole? chasm)? If the two guests can
>> access each other's memory, they might as well be just one guest, and
>> communicate internally.
>>
>
> Each guest's host userspace mmaps the other guest's address space.
> The userspace then does a copy on both the tx and rx paths.
>
Well, that's better security-wise (I'd still prefer to avoid it, so we
can run each guest under a separate uid), but then we lose performance wise.
> Conceivably, this could be done as a read-only mapping so that each
> guest userspace copies only the rx packets. That's about as secure as
> you're going to get with this approach I think.
>
Maybe we can terminate the virtio queue in the host kernel as a pipe,
and splice pipes together.
That gives us guest-guest and guest-process communications, and if you
use aio the kernel can use a dma engine for the copy.
--
error compiling committee.c: too many arguments to function
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [kvm-devel] [RFC PATCH 1/5] lguest: mmap backing file
[not found] ` <47E26EE1.5030706-rdkfGonbjUSkNkDKm+mE6A@public.gmane.org>
@ 2008-03-20 14:32 ` Paul TBBle Hampson
2008-03-20 15:07 ` Avi Kivity
2008-03-20 22:12 ` [kvm-devel] " Rusty Russell
2 siblings, 0 replies; 27+ messages in thread
From: Paul TBBle Hampson @ 2008-03-20 14:32 UTC (permalink / raw)
To: Anthony Liguori
Cc: kvm-devel, lguest,
virtualization-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA
[-- Attachment #1.1: Type: text/plain, Size: 2114 bytes --]
On Thu, Mar 20, 2008 at 09:04:17AM -0500, Anthony Liguori wrote:
> Rusty Russell wrote:
> >From: Paul TBBle Hampson <Paul.Hampson-vM6MUUi4OUAAvxtiuMwx3w@public.gmane.org>
> >This creates a file in $HOME/.lguest/ to directly back the RAM and DMA memory
> >mappings created by map_zeroed_pages.
> I created a test program recently that measured the latency of a reads/writes to an mmap() file in /dev/shm and in a normal filesystem. Even after unlinking the underlying file, the write latency was much better with a mmap()'d file in
> /dev/shm.
> /dev/shm is not really for general use. I think we'll want to have our own tmpfs mount that we use to create VM images. I also prefer to use a unix socket for communication, unlink the file immediately after open, and then pass the fd
> via SCM_RIGHTS to the other process.
The original motivations for the file-backed mmap (rather than the
/dev/zero mmap) were two-fold.
Firstly, to allow suspend and resume to be done to a guest, it would
need somewhere for its memory to survive. (ie. a guest could be
suspended externally immediately, and its state would be resumable from
that mmap file)
Secondly, heading towards some kind of common-page-sharing trick, where
each lguest could spot and share pages in common with other lguests.
Both of these assume the file is going to be visible in the filesystem
until the guest is shut down.
As to whether these are still interesting motivations, I withhold any
opinion in favour of those who know better. ^_^
--
-----------------------------------------------------------
Paul "TBBle" Hampson, B.Sc, LPI, MCSE
Very-later-year Asian Studies student, ANU
The Boss, Bubblesworth Pty Ltd (ABN: 51 095 284 361)
Paul.Hampson-vM6MUUi4OUAAvxtiuMwx3w@public.gmane.org
Of course Pacman didn't influence us as kids. If it did,
we'd be running around in darkened rooms, popping pills and
listening to repetitive music.
-- Kristian Wilson, Nintendo, Inc, 1989
License: http://creativecommons.org/licenses/by/2.1/au/
-----------------------------------------------------------
[-- Attachment #1.2: Type: application/pgp-signature, Size: 189 bytes --]
[-- Attachment #2: Type: text/plain, Size: 158 bytes --]
_______________________________________________
Lguest mailing list
Lguest-mnsaURCQ41sdnm+yROfE0A@public.gmane.org
https://ozlabs.org/mailman/listinfo/lguest
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [kvm-devel] [RFC PATCH 0/4] Inter-guest virtio I/O example with lguest
[not found] ` <47E27461.4090404-atKUWr5tajBWk0Htik3J/w@public.gmane.org>
@ 2008-03-20 14:39 ` Anthony Liguori
2008-03-20 14:55 ` Avi Kivity
0 siblings, 1 reply; 27+ messages in thread
From: Anthony Liguori @ 2008-03-20 14:39 UTC (permalink / raw)
To: Avi Kivity
Cc: kvm-devel, lguest,
virtualization-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA
Avi Kivity wrote:
> Anthony Liguori wrote:
>> Avi Kivity wrote:
>>
>> Each guest's host userspace mmaps the other guest's address space.
>> The userspace then does a copy on both the tx and rx paths.
>>
>
> Well, that's better security-wise (I'd still prefer to avoid it, so we
> can run each guest under a separate uid), but then we lose performance
> wise.
What performance win? I'm not sure the copies can be eliminated in the
case of interguest IO.
Fast interguest IO means mmap()'ing the other guest's address space
read-only. If you had a pv dma registration api you could conceivably
only allow the active dma entries to be mapped but my fear would be that
the zap'ing on unregister would hurt performance.
>> Conceivably, this could be done as a read-only mapping so that each
>> guest userspace copies only the rx packets. That's about as secure
>> as you're going to get with this approach I think.
>>
>
> Maybe we can terminate the virtio queue in the host kernel as a pipe,
> and splice pipes together.
>
> That gives us guest-guest and guest-process communications, and if you
> use aio the kernel can use a dma engine for the copy.
Ah, so you're looking to use a DMA engine for accelerated copy. Perhaps
the answer is to expose the DMA engine via a userspace API?
Regards,
Anthony Liguori
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [RFC PATCH 0/4] Inter-guest virtio I/O example with lguest
2008-03-20 14:39 ` Anthony Liguori
@ 2008-03-20 14:55 ` Avi Kivity
2008-03-20 15:05 ` Anthony Liguori
0 siblings, 1 reply; 27+ messages in thread
From: Avi Kivity @ 2008-03-20 14:55 UTC (permalink / raw)
To: Anthony Liguori; +Cc: kvm-devel, lguest, virtualization
Anthony Liguori wrote:
> Avi Kivity wrote:
>> Anthony Liguori wrote:
>>> Avi Kivity wrote:
>>>
>>> Each guest's host userspace mmaps the other guest's address space.
>>> The userspace then does a copy on both the tx and rx paths.
>>>
>>
>> Well, that's better security-wise (I'd still prefer to avoid it, so
>> we can run each guest under a separate uid), but then we lose
>> performance wise.
>
> What performance win? I'm not sure the copies can be eliminated in
> the case of interguest IO.
>
I guess not. But at least you can dma instead of busy-copying.
> Fast interguest IO means mmap()'ing the other guest's address space
> read-only.
This implies trusting the other userspace, which is not a good thing.
Let the kernel copy, we already trust it, and it has more resources to
do the copy.
> If you had a pv dma registration api you could conceivably only allow
> the active dma entries to be mapped but my fear would be that the
> zap'ing on unregister would hurt performance.
>
Yes, mmu games are costly. They also only work on page granularity
which isn't always possible to guarantee.
>>> Conceivably, this could be done as a read-only mapping so that each
>>> guest userspace copies only the rx packets. That's about as secure
>>> as you're going to get with this approach I think.
>>>
>>
>> Maybe we can terminate the virtio queue in the host kernel as a pipe,
>> and splice pipes together.
>>
>> That gives us guest-guest and guest-process communications, and if
>> you use aio the kernel can use a dma engine for the copy.
>
> Ah, so you're looking to use a DMA engine for accelerated copy.
> Perhaps the answer is to expose the DMA engine via a userspace API?
That's one option, but it still involves sharing all of memory.
Splicing pipes might be better.
--
error compiling committee.c: too many arguments to function
-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [RFC PATCH 0/4] Inter-guest virtio I/O example with lguest
2008-03-20 14:55 ` Avi Kivity
@ 2008-03-20 15:05 ` Anthony Liguori
2008-03-20 15:36 ` Avi Kivity
0 siblings, 1 reply; 27+ messages in thread
From: Anthony Liguori @ 2008-03-20 15:05 UTC (permalink / raw)
To: Avi Kivity; +Cc: kvm-devel, lguest, virtualization
Avi Kivity wrote:
> Anthony Liguori wrote:
>> Avi Kivity wrote:
>>> Anthony Liguori wrote:
>>>> Avi Kivity wrote:
>>>>
>>>> Each guest's host userspace mmaps the other guest's address space.
>>>> The userspace then does a copy on both the tx and rx paths.
>>>>
>>>
>>> Well, that's better security-wise (I'd still prefer to avoid it, so
>>> we can run each guest under a separate uid), but then we lose
>>> performance wise.
>>
>> What performance win? I'm not sure the copies can be eliminated in
>> the case of interguest IO.
>>
>
> I guess not. But at least you can dma instead of busy-copying.
>
>> Fast interguest IO means mmap()'ing the other guest's address space
>> read-only.
You can have the file descriptor be opened O_RDONLY so trust isn't an issue.
> This implies trusting the other userspace, which is not a good thing.
> Let the kernel copy, we already trust it, and it has more resources to
> do the copy.
>
You're going to end up with the same trust issues no matter what unless
you let the kernel look directly at the virtio ring queue. That's the
only way to arbitrate what memory gets copied. There may be a generic
API here for fast interprocess IO, I don't know. splice() is a little
awkward though for this because you really don't want to sit in a
splice() loop. What you want is for both sides to be kick'ing the
kernel and the kernel to raise an event via eventfd() or something.
Absent whatever this kernel API is (which is really just helpful with a
DMA engine), I think the current userspace approach is pretty
reasonable. Not just for interguest IO but also for driver domains
which I think is a logical extension.
Regards,
Anthony Liguori
-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [kvm-devel] [RFC PATCH 1/5] lguest: mmap backing file
[not found] ` <47E26EE1.5030706-rdkfGonbjUSkNkDKm+mE6A@public.gmane.org>
2008-03-20 14:32 ` Paul TBBle Hampson
@ 2008-03-20 15:07 ` Avi Kivity
2008-03-20 15:24 ` Anthony Liguori
2008-03-20 22:12 ` [kvm-devel] " Rusty Russell
2 siblings, 1 reply; 27+ messages in thread
From: Avi Kivity @ 2008-03-20 15:07 UTC (permalink / raw)
To: Anthony Liguori
Cc: kvm-devel, lguest,
virtualization-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA
Anthony Liguori wrote:
> Rusty Russell wrote:
>
>> From: Paul TBBle Hampson <Paul.Hampson-vM6MUUi4OUAAvxtiuMwx3w@public.gmane.org>
>>
>> This creates a file in $HOME/.lguest/ to directly back the RAM and DMA memory
>> mappings created by map_zeroed_pages.
>>
>>
>
> I created a test program recently that measured the latency of a
> reads/writes to an mmap() file in /dev/shm and in a normal filesystem.
> Even after unlinking the underlying file, the write latency was much
> better with a mmap()'d file in /dev/shm.
>
Surely the difference disappears once the pages have been faulted in?
--
error compiling committee.c: too many arguments to function
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [RFC PATCH 1/5] lguest: mmap backing file
2008-03-20 15:07 ` Avi Kivity
@ 2008-03-20 15:24 ` Anthony Liguori
0 siblings, 0 replies; 27+ messages in thread
From: Anthony Liguori @ 2008-03-20 15:24 UTC (permalink / raw)
To: Avi Kivity; +Cc: kvm-devel, lguest, virtualization
Avi Kivity wrote:
> Anthony Liguori wrote:
>> Rusty Russell wrote:
>>
>>> From: Paul TBBle Hampson <Paul.Hampson@Pobox.com>
>>>
>>> This creates a file in $HOME/.lguest/ to directly back the RAM and
>>> DMA memory
>>> mappings created by map_zeroed_pages.
>>>
>>
>> I created a test program recently that measured the latency of a
>> reads/writes to an mmap() file in /dev/shm and in a normal
>> filesystem. Even after unlinking the underlying file, the write
>> latency was much better with a mmap()'d file in /dev/shm.
>>
>
> Surely the difference disappears once the pages have been faulted in?
I don't recall. I believe rewrite was okay but initial write was much
worse.
Regards,
Anthony Liguori
-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [RFC PATCH 0/4] Inter-guest virtio I/O example with lguest
2008-03-20 15:05 ` Anthony Liguori
@ 2008-03-20 15:36 ` Avi Kivity
[not found] ` <47E28482.9010501-atKUWr5tajBWk0Htik3J/w@public.gmane.org>
0 siblings, 1 reply; 27+ messages in thread
From: Avi Kivity @ 2008-03-20 15:36 UTC (permalink / raw)
To: Anthony Liguori; +Cc: kvm-devel, lguest, virtualization
Anthony Liguori wrote:
>
> You can have the file descriptor be opened O_RDONLY so trust isn't an
> issue.
>
Reading is just as bad as writing.
>> This implies trusting the other userspace, which is not a good
>> thing. Let the kernel copy, we already trust it, and it has more
>> resources to do the copy.
>>
>
> You're going to end up with the same trust issues no matter what
> unless you let the kernel look directly at the virtio ring queue.
> That's the only way to arbitrate what memory gets copied.
That's what we need, then.
> There may be a generic API here for fast interprocess IO, I don't
> know. splice() is a little awkward though for this because you really
> don't want to sit in a splice() loop. What you want is for both sides
> to be kick'ing the kernel and the kernel to raise an event via
> eventfd() or something.
>
> Absent whatever this kernel API is (which is really just helpful with
> a DMA engine), I think the current userspace approach is pretty
> reasonable. Not just for interguest IO but also for driver domains
> which I think is a logical extension.
I disagree. A driver domain is shared between multiple guests, and if
one of the guests manages to break into qemu then it can see other
guest's data.
[Driver domains are a horrible idea IMO, but that's another story]
--
error compiling committee.c: too many arguments to function
-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [kvm-devel] [RFC PATCH 0/4] Inter-guest virtio I/O example with lguest
[not found] ` <47E28482.9010501-atKUWr5tajBWk0Htik3J/w@public.gmane.org>
@ 2008-03-20 15:52 ` Anthony Liguori
0 siblings, 0 replies; 27+ messages in thread
From: Anthony Liguori @ 2008-03-20 15:52 UTC (permalink / raw)
To: Avi Kivity
Cc: kvm-devel, lguest,
virtualization-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA
Avi Kivity wrote:
>
> I disagree. A driver domain is shared between multiple guests, and if
> one of the guests manages to break into qemu then it can see other
> guest's data.
You still don't strictly need to do things in the kernel if this is your
concern. You can have another process map both guest's address spaces
and do the copying on behalf of each guest if you're paranoid about
escaping into QEMU.
> [Driver domains are a horrible idea IMO, but that's another story]
I don't disagree :-)
Regards,
Anthony Liguori
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [kvm-devel] [RFC PATCH 1/5] lguest: mmap backing file
[not found] ` <47E26EE1.5030706-rdkfGonbjUSkNkDKm+mE6A@public.gmane.org>
2008-03-20 14:32 ` Paul TBBle Hampson
2008-03-20 15:07 ` Avi Kivity
@ 2008-03-20 22:12 ` Rusty Russell
2008-03-20 23:46 ` Anthony Liguori
2 siblings, 1 reply; 27+ messages in thread
From: Rusty Russell @ 2008-03-20 22:12 UTC (permalink / raw)
To: Anthony Liguori
Cc: kvm-devel, lguest,
virtualization-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA
On Friday 21 March 2008 01:04:17 Anthony Liguori wrote:
> Rusty Russell wrote:
> > From: Paul TBBle Hampson <Paul.Hampson-vM6MUUi4OUAAvxtiuMwx3w@public.gmane.org>
> >
> > This creates a file in $HOME/.lguest/ to directly back the RAM and DMA
> > memory mappings created by map_zeroed_pages.
>
> I created a test program recently that measured the latency of a
> reads/writes to an mmap() file in /dev/shm and in a normal filesystem.
> Even after unlinking the underlying file, the write latency was much
> better with a mmap()'d file in /dev/shm.
How odd! Do you have any idea why?
> /dev/shm is not really for general use. I think we'll want to have our
> own tmpfs mount that we use to create VM images.
If we're going to mod the kernel, how about a "mmap this part of their address
space" and having the kernel keep the mappings in sync. But I think that if
we want to get speed, we should probably be doing the copy between address
spaces in-kernel so we can do lightweight exits.
> I also prefer to use a
> unix socket for communication, unlink the file immediately after open,
> and then pass the fd via SCM_RIGHTS to the other process.
Yeah, I shied away from that because cred passing kills whole litters of
puppies. It makes for better encapsulation tho, so I'd do it that way in a
serious implementation.
Cheers,
Rusty.
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [RFC PATCH 0/4] Inter-guest virtio I/O example with lguest
2008-03-20 6:54 ` [kvm-devel] [RFC PATCH 0/4] Inter-guest virtio I/O example with lguest Avi Kivity
[not found] ` <47E20A35.2000600-atKUWr5tajBWk0Htik3J/w@public.gmane.org>
@ 2008-03-20 22:14 ` Rusty Russell
1 sibling, 0 replies; 27+ messages in thread
From: Rusty Russell @ 2008-03-20 22:14 UTC (permalink / raw)
To: Avi Kivity; +Cc: kvm-devel, lguest, virtualization
On Thursday 20 March 2008 17:54:45 Avi Kivity wrote:
> Rusty Russell wrote:
> > Hi all,
> >
> > Just finished my prototype of inter-guest virtio, using networking as
> > an example. Each guest mmaps the other's address space and uses a FIFO
> > for notifications.
>
> Isn't that a security hole (hole? chasm)? If the two guests can access
> each other's memory, they might as well be just one guest, and
> communicate internally.
Sorry, sloppy language on my part. Each launcher process maps the other
guest's memory as well: ie. copying occurs in the host.
> My feeling is that the host needs to copy the data, using dma if
> available. Another option is to have one guest map the other's memory
> for read and write, while the other guest is unprivileged. This allows
> one privileged guest to provide services for other, unprivileged guests,
> like domain 0 or driver domains in Xen.
One having privilege is possible, even trivial with the current patch (it's
actually doing a completely generic inter-virtio-ring shuffle). I chose the
symmetrical approach for this demo for no particularly good reason.
Cheers,
Rusty.
-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [RFC PATCH 1/5] lguest: mmap backing file
2008-03-20 22:12 ` [kvm-devel] " Rusty Russell
@ 2008-03-20 23:46 ` Anthony Liguori
2008-03-23 9:11 ` Avi Kivity
0 siblings, 1 reply; 27+ messages in thread
From: Anthony Liguori @ 2008-03-20 23:46 UTC (permalink / raw)
To: Rusty Russell; +Cc: kvm-devel, lguest, virtualization
Rusty Russell wrote:
> How odd! Do you have any idea why?
>
Nope, but part of the reason I did this was I recalled a similar
discussion relating to kqemu and why it used /dev/shm. I thought it was
only an issue with older kernels but apparently not.
>> /dev/shm is not really for general use. I think we'll want to have our
>> own tmpfs mount that we use to create VM images.
>>
>
> If we're going to mod the kernel, how about a "mmap this part of their address
> space" and having the kernel keep the mappings in sync. But I think that if
> we want to get speed, we should probably be doing the copy between address
> spaces in-kernel so we can do lightweight exits.
>
I don't think lightweight exits help the situation very much. The
difference between a light weight and heavy weight exit is only 3-4k
cycles or so.
in-kernel doesn't make the situation much easier. You have to map pages
in from a different task. It's a lot easier if you have both guest
mapped in userspace.
>> I also prefer to use a
>> unix socket for communication, unlink the file immediately after open,
>> and then pass the fd via SCM_RIGHTS to the other process.
>>
>
> Yeah, I shied away from that because cred passing kills whole litters of
> puppies. It makes for better encapsulation tho, so I'd do it that way in a
> serious implementation.
>
I'm working on an implementation for KVM at the moment. Instead of just
supporting two guests, I'm looking to support N-guests and provide a
simple switch. I'll have patches soon.
Regards,
Anthony Liguori
> Cheers,
> Rusty.
>
-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [RFC PATCH 1/5] lguest: mmap backing file
[not found] ` <1206000960.6873.124.camel-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org>
2008-03-20 14:07 ` Paul TBBle Hampson
@ 2008-03-21 0:29 ` Rusty Russell
1 sibling, 0 replies; 27+ messages in thread
From: Rusty Russell @ 2008-03-21 0:29 UTC (permalink / raw)
To: echo-Czp0qWhDxZq1SnRDb8oMDQ
Cc: kvm-devel, lguest,
virtualization-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA
On Thursday 20 March 2008 19:16:00 Tim Post wrote:
> On Thu, 2008-03-20 at 17:05 +1100, Rusty Russell wrote:
> > + snprintf(memfile_path, PATH_MAX, "%s/.lguest",
> > getenv("HOME") ?: "");
>
> Hi Rusty,
>
> Is that safe if being run via setuid/gid or shared root? It might be
> better to just look it up in /etc/passwd against the real UID,
> considering that anyone can change (or null) that env string.
Hi Tim,
Fair point: it is bogus in this usage case. Of course, setuid-ing lguest
is dumb anyway, since you could use --block= to read and write any file in
the filesystem. The mid-term goal is to allow non-root to run lguest, which
fixes this problem (we don't allow that at the moment, as the guest can pin
memory).
Cheers,
Rusty.
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [RFC PATCH 1/5] lguest: mmap backing file
2008-03-20 23:46 ` Anthony Liguori
@ 2008-03-23 9:11 ` Avi Kivity
0 siblings, 0 replies; 27+ messages in thread
From: Avi Kivity @ 2008-03-23 9:11 UTC (permalink / raw)
To: Anthony Liguori; +Cc: kvm-devel, lguest, virtualization
Anthony Liguori wrote:
>>>
>>>
>> If we're going to mod the kernel, how about a "mmap this part of their address
>> space" and having the kernel keep the mappings in sync. But I think that if
>> we want to get speed, we should probably be doing the copy between address
>> spaces in-kernel so we can do lightweight exits.
>>
>>
>
> I don't think lightweight exits help the situation very much. The
> difference between a light weight and heavy weight exit is only 3-4k
> cycles or so.
>
On what host cpu? IIRC the difference was bigger on Intel (and in
relative terms, set to increase).
> in-kernel doesn't make the situation much easier. You have to map pages
> in from a different task. It's a lot easier if you have both guest
> mapped in userspace.
>
The kernel already has everything mapped (kmap_atomic() is an addition
on x86_64).
--
error compiling committee.c: too many arguments to function
-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [RFC PATCH 0/4] Inter-guest virtio I/O example with lguest
2008-03-20 14:11 ` [kvm-devel] " Anthony Liguori
@ 2008-03-23 12:05 ` Rusty Russell
0 siblings, 0 replies; 27+ messages in thread
From: Rusty Russell @ 2008-03-23 12:05 UTC (permalink / raw)
To: Anthony Liguori; +Cc: kvm-devel, lguest, virtualization
On Friday 21 March 2008 01:11:35 Anthony Liguori wrote:
> Rusty Russell wrote:
> > There are three possible solutions:
> > 1) Just offer the lowest common denominator to both sides (ie. no
> > features). This is what I do with lguest in these patches.
> > 2) Offer something and handle the case where one Guest accepts and
> > another doesn't by emulating it. ie. de-TSO the packets manually.
> > 3) "Hot unplug" the device from the guest which asks for the greater
> > features, then re-add it offering less features. Requires hotplug in the
> > guest OS.
>
> 4) Add a feature negotiation feature. The feature that gets set is the
> "feature negotiate" feature. If a guest doesn't support feature
> negotiation, you end up with the least-common denominator (no
> features). If both guests support feature negotiation, you can then add
> something new to determine the true common subset.
Hmm, I discarded that out of hand as too icky, but we might end up there.
Analyse features like normal, accept feature negotiation, set DRIVER_OK, wait
for config change, if feature negotiation is still set then go around again
(presumably some features have been removed).
I'll prototype it and see how we go.
Thanks,
Rusty.
-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
^ permalink raw reply [flat|nested] 27+ messages in thread
end of thread, other threads:[~2008-03-23 12:05 UTC | newest]
Thread overview: 27+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-03-20 5:59 [RFC PATCH 0/4] Inter-guest virtio I/O example with lguest Rusty Russell
[not found] ` <200803201659.14344.rusty-8n+1lVoiYb80n/F98K4Iww@public.gmane.org>
2008-03-20 6:05 ` [RFC PATCH 1/5] lguest: mmap backing file Rusty Russell
[not found] ` <200803201705.44422.rusty-8n+1lVoiYb80n/F98K4Iww@public.gmane.org>
2008-03-20 6:22 ` [RFC PATCH 2/5] lguest: Encapsulate Guest memory ready for dealing with other Guests Rusty Russell
2008-03-20 6:36 ` [RFC PATCH 3/5] lguest: separate out virtqueue info from device info Rusty Russell
[not found] ` <200803201736.01883.rusty-8n+1lVoiYb80n/F98K4Iww@public.gmane.org>
2008-03-20 6:40 ` [RFC PATCH 4/5] lguest: ignore bad virtqueues Rusty Russell
2008-03-20 6:45 ` [RFC PATCH 5/5] lguest: Inter-guest networking Rusty Russell
2008-03-20 14:04 ` [kvm-devel] [RFC PATCH 1/5] lguest: mmap backing file Anthony Liguori
[not found] ` <47E26EE1.5030706-rdkfGonbjUSkNkDKm+mE6A@public.gmane.org>
2008-03-20 14:32 ` Paul TBBle Hampson
2008-03-20 15:07 ` Avi Kivity
2008-03-20 15:24 ` Anthony Liguori
2008-03-20 22:12 ` [kvm-devel] " Rusty Russell
2008-03-20 23:46 ` Anthony Liguori
2008-03-23 9:11 ` Avi Kivity
2008-03-20 8:16 ` [Lguest] " Tim Post
[not found] ` <1206000960.6873.124.camel-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org>
2008-03-20 14:07 ` Paul TBBle Hampson
2008-03-21 0:29 ` Rusty Russell
2008-03-20 6:54 ` [kvm-devel] [RFC PATCH 0/4] Inter-guest virtio I/O example with lguest Avi Kivity
[not found] ` <47E20A35.2000600-atKUWr5tajBWk0Htik3J/w@public.gmane.org>
2008-03-20 13:55 ` Anthony Liguori
[not found] ` <47E26CC1.8080900-rdkfGonbjUSkNkDKm+mE6A@public.gmane.org>
2008-03-20 14:27 ` Avi Kivity
[not found] ` <47E27461.4090404-atKUWr5tajBWk0Htik3J/w@public.gmane.org>
2008-03-20 14:39 ` Anthony Liguori
2008-03-20 14:55 ` Avi Kivity
2008-03-20 15:05 ` Anthony Liguori
2008-03-20 15:36 ` Avi Kivity
[not found] ` <47E28482.9010501-atKUWr5tajBWk0Htik3J/w@public.gmane.org>
2008-03-20 15:52 ` [kvm-devel] " Anthony Liguori
2008-03-20 22:14 ` Rusty Russell
2008-03-20 14:11 ` [kvm-devel] " Anthony Liguori
2008-03-23 12:05 ` Rusty Russell
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox