qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
* [Qemu-devel] [PATCH 0/2][RFC] postcopy migration: Linux char device for postcopy
@ 2011-12-29  1:26 Isaku Yamahata
  2011-12-29  1:26 ` [Qemu-devel] [PATCH 1/2] export necessary symbols Isaku Yamahata
                   ` (3 more replies)
  0 siblings, 4 replies; 42+ messages in thread
From: Isaku Yamahata @ 2011-12-29  1:26 UTC (permalink / raw)
  To: kvm, qemu-devel; +Cc: yamahata, t.hirofuchi, satoshi.itoh

This is Linux kernel driver for qemu/kvm postcopy live migration.
This is used by qemu/kvm postcopy live migration patch.

TODO:
- Consider FUSE/CUSE option
  So far several mmap patches for FUSE/CUSE are floating around. (their
  purpose isn't different from our purpose, though). They haven't merged
  into the upstream yet.
  The driver specific part in qemu patches is modularized. So I expect it
  wouldn't be difficult to switch kernel driver to CUSE based driver.

ioctl commands:

UMEM_DEV_CRATE_UMEM: create umem device for qemu
UMEM_DEV_LIST: list created umem devices
UMEM_DEV_REATTACH: re-attach the created umem device
		  UMEM_DEV_LIST and UMEM_DEV_REATTACH are used when
		  the process that services page fault disappears or get stack.
		  Then, administrator can list the umem devices and unblock
		  the process which is waiting for page.

UMEM_GET_PAGE_REQUEST: retrieve page fault of qemu process
UMEM_MARK_PAGE_CACHED: mark the specified pages pulled from the source
                       for daemon

UMEM_MAKE_VMA_ANONYMOUS: make the specified vma in the qemu process
			 This is _NOT_ implemented yet.
                         anonymous I'm not sure whether this can be implemented
                         or not.


---
Changes version 1 -> 2:
- make ioctl structures padded to align
- un-KVM
  KVM_VMEM -> UMEM
- dropped some ioctl commands as Avi requested

Isaku Yamahata (2):
  export necessary symbols
  umem: chardevice for kvm postcopy

 drivers/char/Kconfig  |    9 +
 drivers/char/Makefile |    1 +
 drivers/char/umem.c   |  898 +++++++++++++++++++++++++++++++++++++++++++++++++
 include/linux/umem.h  |   83 +++++
 mm/memcontrol.c       |    1 +
 mm/shmem.c            |    1 +
 6 files changed, 993 insertions(+), 0 deletions(-)
 create mode 100644 drivers/char/umem.c
 create mode 100644 include/linux/umem.h

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [Qemu-devel] [PATCH 1/2] export necessary symbols
  2011-12-29  1:26 [Qemu-devel] [PATCH 0/2][RFC] postcopy migration: Linux char device for postcopy Isaku Yamahata
@ 2011-12-29  1:26 ` Isaku Yamahata
  2011-12-29  1:26 ` [Qemu-devel] [PATCH 2/2] umem: chardevice for kvm postcopy Isaku Yamahata
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 42+ messages in thread
From: Isaku Yamahata @ 2011-12-29  1:26 UTC (permalink / raw)
  To: kvm, qemu-devel; +Cc: yamahata, t.hirofuchi, satoshi.itoh

Signed-off-by: Isaku Yamahata <yamahata@valinux.co.jp>
---
 mm/memcontrol.c |    1 +
 mm/shmem.c      |    1 +
 2 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index b63f5f7..85530fc 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2807,6 +2807,7 @@ int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm,
 
 	return ret;
 }
+EXPORT_SYMBOL_GPL(mem_cgroup_cache_charge);
 
 /*
  * While swap-in, try_charge -> commit or cancel, the page is locked.
diff --git a/mm/shmem.c b/mm/shmem.c
index d672250..d137a37 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -2546,6 +2546,7 @@ int shmem_zero_setup(struct vm_area_struct *vma)
 	vma->vm_flags |= VM_CAN_NONLINEAR;
 	return 0;
 }
+EXPORT_SYMBOL_GPL(shmem_zero_setup);
 
 /**
  * shmem_read_mapping_page_gfp - read into page cache, using specified page allocation flags.
-- 
1.7.1.1

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [Qemu-devel] [PATCH 2/2] umem: chardevice for kvm postcopy
  2011-12-29  1:26 [Qemu-devel] [PATCH 0/2][RFC] postcopy migration: Linux char device for postcopy Isaku Yamahata
  2011-12-29  1:26 ` [Qemu-devel] [PATCH 1/2] export necessary symbols Isaku Yamahata
@ 2011-12-29  1:26 ` Isaku Yamahata
  2011-12-29 11:17   ` Avi Kivity
  2012-01-05  4:08   ` [Qemu-devel] 回复: " thfbjyddx
  2011-12-29  1:31 ` [Qemu-devel] [PATCH 0/2][RFC] postcopy migration: Linux char device for postcopy Isaku Yamahata
  2011-12-29 11:24 ` Avi Kivity
  3 siblings, 2 replies; 42+ messages in thread
From: Isaku Yamahata @ 2011-12-29  1:26 UTC (permalink / raw)
  To: kvm, qemu-devel; +Cc: yamahata, t.hirofuchi, satoshi.itoh

This is a character device to hook page access.
The page fault in the area is reported to another user process by
this chardriver. Then, the process fills the page contents and
resolves the page fault.

Signed-off-by: Isaku Yamahata <yamahata@valinux.co.jp>
---
 drivers/char/Kconfig  |    9 +
 drivers/char/Makefile |    1 +
 drivers/char/umem.c   |  898 +++++++++++++++++++++++++++++++++++++++++++++++++
 include/linux/umem.h  |   83 +++++
 4 files changed, 991 insertions(+), 0 deletions(-)
 create mode 100644 drivers/char/umem.c
 create mode 100644 include/linux/umem.h

diff --git a/drivers/char/Kconfig b/drivers/char/Kconfig
index 4364303..001e3e4 100644
--- a/drivers/char/Kconfig
+++ b/drivers/char/Kconfig
@@ -15,6 +15,15 @@ config DEVKMEM
 	  kind of kernel debugging operations.
 	  When in doubt, say "N".
 
+config UMEM
+        tristate "/dev/umem user process backed memory support"
+	default n
+	help
+	  User process backed memory driver provides /dev/umem device.
+	  The /dev/umem device is designed for some sort of distributed
+	  shared memory. Especially post-copy live migration with KVM.
+	  When in doubt, say "N".
+
 config STALDRV
 	bool "Stallion multiport serial support"
 	depends on SERIAL_NONSTANDARD
diff --git a/drivers/char/Makefile b/drivers/char/Makefile
index 32762ba..1eb14dc 100644
--- a/drivers/char/Makefile
+++ b/drivers/char/Makefile
@@ -3,6 +3,7 @@
 #
 
 obj-y				+= mem.o random.o
+obj-$(CONFIG_UMEM)		+= umem.o
 obj-$(CONFIG_TTY_PRINTK)	+= ttyprintk.o
 obj-y				+= misc.o
 obj-$(CONFIG_ATARI_DSP56K)	+= dsp56k.o
diff --git a/drivers/char/umem.c b/drivers/char/umem.c
new file mode 100644
index 0000000..df669fb
--- /dev/null
+++ b/drivers/char/umem.c
@@ -0,0 +1,898 @@
+/*
+ * UMEM: user process backed memory.
+ *
+ * Copyright (c) 2011,
+ * National Institute of Advanced Industrial Science and Technology
+ *
+ * https://sites.google.com/site/grivonhome/quick-kvm-migration
+ * Author: Isaku Yamahata <yamahata at valinux co jp>
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ *
+ * You should have received a copy of the GNU General Public License along
+ * with this program; if not, see <http://www.gnu.org/licenses/>.
+ */
+
+#include <linux/module.h>
+#include <linux/pagemap.h>
+#include <linux/init.h>
+#include <linux/mm.h>
+#include <linux/slab.h>
+#include <linux/memcontrol.h>
+#include <linux/poll.h>
+#include <linux/file.h>
+#include <linux/anon_inodes.h>
+#include <linux/miscdevice.h>
+#include <linux/umem.h>
+
+struct umem_page_req_list {
+	struct list_head list;
+	pgoff_t pgoff;
+};
+
+struct umem {
+	loff_t size;
+	pgoff_t pgoff_end;
+	spinlock_t lock;
+
+	wait_queue_head_t req_wait;
+
+	int async_req_max;
+	int async_req_nr;
+	pgoff_t *async_req;
+
+	int sync_req_max;
+	unsigned long *sync_req_bitmap;
+	unsigned long *sync_wait_bitmap;
+	pgoff_t *sync_req;
+	wait_queue_head_t *page_wait;
+
+	int req_list_nr;
+	struct list_head req_list;
+	wait_queue_head_t req_list_wait;
+
+	unsigned long *cached;
+	unsigned long *faulted;
+
+	bool mmapped;
+	unsigned long vm_start;
+	unsigned int vma_nr;
+	struct task_struct *task;
+
+	struct file *shmem_filp;
+	struct vm_area_struct vma;
+
+	struct kref kref;
+	struct list_head list;
+	struct umem_name name;
+};
+
+
+static LIST_HEAD(umem_list);
+DEFINE_MUTEX(umem_list_mutex);
+
+static bool umem_name_eq(const struct umem_name *lhs,
+			  const struct umem_name *rhs)
+{
+	return memcmp(lhs->id, rhs->id, sizeof(lhs->id)) == 0 &&
+		memcmp(lhs->name, rhs->name, sizeof(lhs->name)) == 0;
+}
+
+static int umem_add_list(struct umem *umem)
+{
+	struct umem *entry;
+	BUG_ON(!mutex_is_locked(&umem_list_mutex));
+	list_for_each_entry(entry, &umem_list, list) {
+		if (umem_name_eq(&entry->name, &umem->name)) {
+			mutex_unlock(&umem_list_mutex);
+			return -EBUSY;
+		}
+	}
+
+	list_add(&umem->list, &umem_list);
+	return 0;
+}
+
+static void umem_release_fake_vmf(int ret, struct vm_fault *fake_vmf)
+{
+	if (ret & VM_FAULT_LOCKED) {
+		unlock_page(fake_vmf->page);
+	}
+	page_cache_release(fake_vmf->page);
+}
+
+static int umem_minor_fault(struct umem *umem,
+			    struct vm_area_struct *vma,
+			    struct vm_fault *vmf)
+{
+	struct vm_fault fake_vmf;
+	int ret;
+	struct page *page;
+
+	BUG_ON(!test_bit(vmf->pgoff, umem->cached));
+	fake_vmf = *vmf;
+	fake_vmf.page = NULL;
+	ret = umem->vma.vm_ops->fault(&umem->vma, &fake_vmf);
+	if (ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY))
+		return ret;
+
+	/*
+	 * TODO: pull out fake_vmf->page from shmem file and donate it
+	 * to this vma resolving the page fault.
+	 * vmf->page = fake_vmf->page;
+	 */
+
+	page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma, vmf->address);
+	if (!page)
+		return VM_FAULT_OOM;
+	if (mem_cgroup_cache_charge(page, vma->vm_mm, GFP_KERNEL)) {
+		umem_release_fake_vmf(ret, &fake_vmf);
+		page_cache_release(page);
+		return VM_FAULT_OOM;
+	}
+
+	copy_highpage(page, fake_vmf.page);
+	umem_release_fake_vmf(ret, &fake_vmf);
+
+	ret |= VM_FAULT_LOCKED;
+	SetPageUptodate(page);
+	vmf->page = page;
+	set_bit(vmf->pgoff, umem->faulted);
+
+	return ret;
+}
+
+static int umem_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
+{
+	struct file *filp = vma->vm_file;
+	struct umem *umem = filp->private_data;
+
+	if (vmf->pgoff >= umem->pgoff_end) {
+		return VM_FAULT_SIGBUS;
+	}
+
+	BUG_ON(test_bit(vmf->pgoff, umem->faulted));
+
+	if (!test_bit(vmf->pgoff, umem->cached)) {
+		/* major fault */
+		unsigned long bit;
+		DEFINE_WAIT(wait);
+
+		if (vmf->flags & FAULT_FLAG_RETRY_NOWAIT) {
+			/* async page fault */
+			spin_lock(&umem->lock);
+			if (umem->async_req_nr < umem->async_req_max) {
+				umem->async_req[umem->async_req_nr] =
+					vmf->pgoff;
+				umem->async_req_nr++;
+			}
+			spin_unlock(&umem->lock);
+			wake_up_poll(&umem->req_wait, POLLIN);
+
+			if (test_bit(vmf->pgoff, umem->cached))
+				return umem_minor_fault(umem, vma, vmf);
+			return VM_FAULT_MAJOR | VM_FAULT_RETRY;
+		}
+
+		spin_lock(&umem->lock);
+		bit = find_first_zero_bit(umem->sync_wait_bitmap,
+					  umem->sync_req_max);
+		if (likely(bit < umem->sync_req_max)) {
+			umem->sync_req[bit] = vmf->pgoff;
+			prepare_to_wait(&umem->page_wait[bit], &wait,
+					TASK_UNINTERRUPTIBLE);
+			set_bit(bit, umem->sync_req_bitmap);
+			set_bit(bit, umem->sync_wait_bitmap);
+			spin_unlock(&umem->lock);
+			wake_up_poll(&umem->req_wait, POLLIN);
+
+			if (!test_bit(vmf->pgoff, umem->cached))
+				schedule();
+			finish_wait(&umem->page_wait[bit], &wait);
+			clear_bit(bit, umem->sync_wait_bitmap);
+		} else {
+			struct umem_page_req_list page_req_list = {
+				.pgoff = vmf->pgoff,
+			};
+			umem->req_list_nr++;
+			list_add_tail(&page_req_list.list, &umem->req_list);
+			wake_up_poll(&umem->req_wait, POLLIN);
+			for (;;) {
+				prepare_to_wait(&umem->req_list_wait, &wait,
+						TASK_UNINTERRUPTIBLE);
+				if (test_bit(vmf->pgoff, umem->cached)) {
+					umem->req_list_nr--;
+					break;
+				}
+				spin_unlock(&umem->lock);
+				schedule();
+				spin_lock(&umem->lock);
+			}
+			spin_unlock(&umem->lock);
+			finish_wait(&umem->req_list_wait, &wait);
+		}
+
+		return umem_minor_fault(umem, vma, vmf) | VM_FAULT_MAJOR;
+	}
+
+	return umem_minor_fault(umem, vma, vmf);
+}
+
+/* for partial munmap */
+static void umem_vma_open(struct vm_area_struct *vma)
+{
+	struct file *filp = vma->vm_file;
+	struct umem *umem = filp->private_data;
+
+	spin_lock(&umem->lock);
+	umem->vma_nr++;
+	spin_unlock(&umem->lock);
+}
+
+static void umem_vma_close(struct vm_area_struct *vma)
+{
+	struct file *filp = vma->vm_file;
+	struct umem *umem = filp->private_data;
+	struct task_struct *task = NULL;
+
+	spin_lock(&umem->lock);
+	umem->vma_nr--;
+	if (umem->vma_nr == 0) {
+		task = umem->task;
+		umem->task = NULL;
+	}
+	spin_unlock(&umem->lock);
+
+	if (task)
+		put_task_struct(task);
+}
+
+static const struct vm_operations_struct umem_vm_ops = {
+	.open = umem_vma_open,
+	.close = umem_vma_close,
+	.fault = umem_fault,
+};
+
+static int umem_mmap(struct file *filp, struct vm_area_struct *vma)
+{
+	struct umem *umem = filp->private_data;
+	int error;
+
+	/* allow mmap() only once */
+	spin_lock(&umem->lock);
+	if (umem->mmapped) {
+		error = -EBUSY;
+		goto out;
+	}
+	if (((vma->vm_end - vma->vm_start) >> PAGE_SHIFT) + vma->vm_pgoff >
+	    umem->pgoff_end) {
+		error = -EINVAL;
+		goto out;
+	}
+
+	umem->mmapped = true;
+	umem->vma_nr = 1;
+	umem->vm_start = vma->vm_start;
+	get_task_struct(current);
+	umem->task = current;
+	spin_unlock(&umem->lock);
+
+	vma->vm_ops = &umem_vm_ops;
+	vma->vm_flags |= VM_DONTCOPY | VM_DONTEXPAND;
+	vma->vm_flags &= ~VM_SHARED;
+	return 0;
+
+out:
+	spin_unlock(&umem->lock);
+	return error;
+}
+
+static bool umem_req_pending(struct umem* umem)
+{
+	return !list_empty(&umem->req_list) ||
+		!bitmap_empty(umem->sync_req_bitmap, umem->sync_req_max) ||
+		(umem->async_req_nr > 0);
+}
+
+static unsigned int umem_poll(struct file* filp, poll_table *wait)
+{
+	struct umem *umem = filp->private_data;
+	unsigned int events = 0;
+
+	poll_wait(filp, &umem->req_wait, wait);
+
+	spin_lock(&umem->lock);
+	if (umem_req_pending(umem))
+		events |= POLLIN;
+	spin_unlock(&umem->lock);
+
+	return events;
+}
+
+/*
+ * return value
+ * true: finished
+ * false: more request
+ */
+static bool umem_copy_page_request(struct umem *umem,
+				   pgoff_t *pgoffs, int req_max,
+				   int *req_nr)
+{
+	struct umem_page_req_list *req_list;
+	struct umem_page_req_list *tmp;
+
+	unsigned long bit;
+
+	*req_nr = 0;
+	list_for_each_entry_safe(req_list, tmp, &umem->req_list, list) {
+		list_del(&req_list->list);
+		pgoffs[*req_nr] = req_list->pgoff;
+		(*req_nr)++;
+		if (*req_nr >= req_max)
+			return false;
+	}
+
+	bit = 0;
+	for (;;) {
+		bit = find_next_bit(umem->sync_req_bitmap, umem->sync_req_max,
+				    bit);
+		if (bit >= umem->sync_req_max)
+			break;
+		pgoffs[*req_nr] = umem->sync_req[bit];
+		(*req_nr)++;
+		clear_bit(bit, umem->sync_req_bitmap);
+		if (*req_nr >= req_max)
+			return false;
+		bit++;
+	}
+
+	if (umem->async_req_nr > 0) {
+		int nr = min(req_max - *req_nr, umem->async_req_nr);
+		memcpy(pgoffs + *req_nr, umem->async_req,
+		       sizeof(*umem->async_req) * nr);
+		umem->async_req_nr -= nr;
+		*req_nr += nr;
+		memmove(umem->async_req, umem->sync_req + nr,
+			umem->async_req_nr * sizeof(*umem->async_req));
+
+	}
+	return umem->async_req_nr == 0;
+}
+
+static int umem_get_page_request(struct umem *umem,
+				 struct umem_page_request *page_req)
+{
+	DEFINE_WAIT(wait);
+#define REQ_MAX	((__u32)32)
+	pgoff_t pgoffs[REQ_MAX];
+	__u32 req_copied = 0;
+	int ret = 0;
+
+	spin_lock(&umem->lock);
+	for (;;) {
+		prepare_to_wait(&umem->req_wait, &wait, TASK_INTERRUPTIBLE);
+		if (umem_req_pending(umem)) {
+			break;
+		}
+		if (signal_pending(current)) {
+			ret = -ERESTARTSYS;
+			break;
+		}
+		spin_unlock(&umem->lock);
+		schedule();
+		spin_lock(&umem->lock);
+	}
+	finish_wait(&umem->req_wait, &wait);
+	if (ret)
+		goto out_unlock;
+
+	while (req_copied < page_req->nr) {
+		int req_max;
+		int req_nr;
+		bool finished;
+		req_max = min(page_req->nr - req_copied, REQ_MAX);
+		finished = umem_copy_page_request(umem, pgoffs, req_max,
+						  &req_nr);
+
+		spin_unlock(&umem->lock);
+
+		if (req_nr > 0) {
+			ret = 0;
+			if (copy_to_user(page_req->pgoffs + req_copied, pgoffs,
+					 sizeof(*pgoffs) * req_nr)) {
+				ret = -EFAULT;
+				goto out;
+			}
+		}
+		req_copied += req_nr;
+		if (finished)
+			goto out;
+
+		spin_lock(&umem->lock);
+	}
+
+out_unlock:
+	spin_unlock(&umem->lock);
+out:
+	page_req->nr = req_copied;
+	return ret;
+}
+
+static int umem_mark_page_cached(struct umem *umem,
+				 struct umem_page_cached *page_cached)
+{
+	int ret = 0;
+#define PG_MAX	((__u32)32)
+	__u64 pgoffs[PG_MAX];
+	__u32 nr;
+	unsigned long bit;
+	bool wake_up_list = false;
+
+	nr = 0;
+	while (nr < page_cached->nr) {
+		__u32 todo = min(PG_MAX, (page_cached->nr - nr));
+		int i;
+
+		if (copy_from_user(pgoffs, page_cached->pgoffs + nr,
+				   sizeof(*pgoffs) * todo)) {
+			ret = -EFAULT;
+			goto out;
+		}
+		for (i = 0; i < todo; ++i) {
+			if (pgoffs[i] >= umem->pgoff_end) {
+				ret = -EINVAL;
+				goto out;
+			}
+			set_bit(pgoffs[i], umem->cached);
+		}
+		nr += todo;
+	}
+
+	spin_lock(&umem->lock);
+	bit = 0;
+	for (;;) {
+		bit = find_next_bit(umem->sync_wait_bitmap, umem->sync_req_max,
+				    bit);
+		if (bit >= umem->sync_req_max)
+			break;
+		if (test_bit(umem->sync_req[bit], umem->cached))
+			wake_up(&umem->page_wait[bit]);
+		bit++;
+	}
+
+	if (umem->req_list_nr > 0)
+		wake_up_list = true;
+	spin_unlock(&umem->lock);
+
+	if (wake_up_list)
+		wake_up_all(&umem->req_list_wait);
+
+out:
+	return ret;
+}
+
+static int umem_make_vma_anonymous(struct umem *umem)
+{
+#if 1
+	return -ENOSYS;
+#else
+	unsigned long saddr;
+	unsigned long eaddr;
+	unsigned long addr;
+	unsigned long bit;
+	struct task_struct *task;
+	struct mm_struct *mm;
+
+	spin_lock(&umem->lock);
+	task = umem->task;
+	saddr = umem->vm_start;
+	eaddr = saddr + umem->size;
+	bit = find_first_zero_bit(umem->faulted, umem->pgoff_end);
+	if (bit < umem->pgoff_end) {
+		spin_unlock(&umem->lock);
+		return -EBUSY;
+	}
+	spin_unlock(&umem->lock);
+	if (task == NULL)
+		return 0;
+	mm = get_task_mm(task);
+	if (mm == NULL)
+		return 0;
+
+	addr = saddr;
+	down_write(&mm->mmap_sem);
+	while (addr < eaddr) {
+		struct vm_area_struct *vma;
+		vma = find_vma(mm, addr);
+		if (umem_is_umem_vma(umem, vma)) {
+			/* XXX incorrect. race/locking and more fix up */
+			struct file *filp = vma->vm_file;
+			vma->vm_ops->close(vma);
+			vma->vm_ops = NULL;
+			vma->vm_file = NULL;
+			/* vma->vm_flags */
+			fput(filp);
+		}
+		addr = vma->vm_end;
+	}
+	up_write(&mm->mmap_sem);
+
+	mmput(mm);
+	return 0;
+#endif
+}
+
+static long umem_ioctl(struct file *filp, unsigned int ioctl,
+			   unsigned long arg)
+{
+	struct umem *umem = filp->private_data;
+	void __user *argp = (void __user *) arg;
+	long ret = 0;
+
+	switch (ioctl) {
+	case UMEM_GET_PAGE_REQUEST: {
+		struct umem_page_request page_request;
+		ret = -EFAULT;
+		if (copy_from_user(&page_request, argp, sizeof(page_request)))
+			break;
+		ret = umem_get_page_request(umem, &page_request);
+		if (ret == 0 &&
+		    copy_to_user(argp +
+				 offsetof(struct umem_page_request, nr),
+				 &page_request.nr,
+				 sizeof(page_request.nr))) {
+			ret = -EFAULT;
+			break;
+		}
+		break;
+	}
+	case UMEM_MARK_PAGE_CACHED: {
+		struct umem_page_cached page_cached;
+		ret = -EFAULT;
+		if (copy_from_user(&page_cached, argp, sizeof(page_cached)))
+			break;
+		ret = umem_mark_page_cached(umem, &page_cached);
+		break;
+	}
+	case UMEM_MAKE_VMA_ANONYMOUS:
+		ret = umem_make_vma_anonymous(umem);
+		break;
+	default:
+		ret = -EINVAL;
+		break;
+	}
+	return ret;
+}
+
+static unsigned long umem_bitmap_bytes(const struct umem *umem)
+{
+	return round_up(umem->pgoff_end, BITS_PER_LONG) / 8;
+}
+
+
+static void umem_free(struct kref *kref)
+{
+	struct umem *umem = container_of(kref, struct umem, kref);
+
+	BUG_ON(!mutex_is_locked(&umem_list_mutex));
+	list_del(&umem->list);
+	mutex_unlock(&umem_list_mutex);
+
+	if (umem->task) {
+		put_task_struct(umem->task);
+		umem->task = NULL;
+	}
+
+	if (umem->shmem_filp)
+		fput(umem->shmem_filp);
+	if (umem_bitmap_bytes(umem) > PAGE_SIZE) {
+		vfree(umem->cached);
+		vfree(umem->faulted);
+	} else {
+		kfree(umem->cached);
+		kfree(umem->faulted);
+	}
+	kfree(umem->async_req);
+	kfree(umem->sync_req_bitmap);
+	kfree(umem->sync_wait_bitmap);
+	kfree(umem->page_wait);
+	kfree(umem->sync_req);
+	kfree(umem);
+}
+
+static void umem_put(struct umem *umem)
+{
+	int ret;
+
+	mutex_lock(&umem_list_mutex);
+	ret = kref_put(&umem->kref, umem_free);
+	if (ret == 0) {
+		mutex_unlock(&umem_list_mutex);
+	}
+}
+
+static int umem_release(struct inode *inode, struct file *filp)
+{
+	struct umem *umem = filp->private_data;
+	umem_put(umem);
+	return 0;
+}
+
+static struct file_operations umem_fops = {
+	.release	= umem_release,
+	.unlocked_ioctl = umem_ioctl,
+	.mmap		= umem_mmap,
+	.poll		= umem_poll,
+	.llseek		= noop_llseek,
+};
+
+static int umem_create_umem(struct umem_create *create)
+{
+	int error = 0;
+	struct umem *umem = NULL;
+	struct vm_area_struct *vma;
+	int shmem_fd;
+	unsigned long bitmap_bytes;
+	unsigned long sync_bitmap_bytes;
+	int i;
+
+	umem = kzalloc(sizeof(*umem), GFP_KERNEL);
+	umem->name = create->name;
+	kref_init(&umem->kref);
+	INIT_LIST_HEAD(&umem->list);
+
+	mutex_lock(&umem_list_mutex);
+	error = umem_add_list(umem);
+	if (error) {
+		goto out;
+	}
+
+	umem->task = NULL;
+	umem->mmapped = false;
+	spin_lock_init(&umem->lock);
+	umem->size = roundup(create->size, PAGE_SIZE);
+	umem->pgoff_end = umem->size >> PAGE_SHIFT;
+	init_waitqueue_head(&umem->req_wait);
+
+	vma = &umem->vma;
+	vma->vm_start = 0;
+	vma->vm_end = umem->size;
+	/* this shmem file is used for temporal buffer for pages
+	   so it's unlikely that so many pages exists in this shmem file */
+	vma->vm_flags = VM_READ | VM_SHARED | VM_NOHUGEPAGE | VM_DONTCOPY |
+		VM_DONTEXPAND;
+	vma->vm_page_prot = vm_get_page_prot(vma->vm_flags);
+	vma->vm_pgoff = 0;
+	INIT_LIST_HEAD(&vma->anon_vma_chain);
+
+	shmem_fd = get_unused_fd();
+	if (shmem_fd < 0) {
+		error = shmem_fd;
+		goto out;
+	}
+	error = shmem_zero_setup(vma);
+	if (error < 0) {
+		put_unused_fd(shmem_fd);
+		goto out;
+	}
+	umem->shmem_filp = vma->vm_file;
+	get_file(umem->shmem_filp);
+	fd_install(shmem_fd, vma->vm_file);
+	create->shmem_fd = shmem_fd;
+
+	create->umem_fd = anon_inode_getfd("umem",
+					   &umem_fops, umem, O_RDWR);
+	if (create->umem_fd < 0) {
+		error = create->umem_fd;
+		goto out;
+	}
+
+	bitmap_bytes = umem_bitmap_bytes(umem);
+	if (bitmap_bytes > PAGE_SIZE) {
+		umem->cached = vzalloc(bitmap_bytes);
+		umem->faulted = vzalloc(bitmap_bytes);
+	} else {
+		umem->cached = kzalloc(bitmap_bytes, GFP_KERNEL);
+		umem->faulted = kzalloc(bitmap_bytes, GFP_KERNEL);
+	}
+
+	/* those constants are not exported.
+	   They are just used for default value */
+#define KVM_MAX_VCPUS	256
+#define ASYNC_PF_PER_VCPU 64
+
+#define ASYNC_REQ_MAX	(ASYNC_PF_PER_VCPU * KVM_MAX_VCPUS)
+	if (create->async_req_max == 0)
+		create->async_req_max = ASYNC_REQ_MAX;
+	umem->async_req_max = create->async_req_max;
+	umem->async_req_nr = 0;
+	umem->async_req = kzalloc(
+		sizeof(*umem->async_req) * umem->async_req_max,
+		GFP_KERNEL);
+
+#define SYNC_REQ_MAX	(KVM_MAX_VCPUS)
+	if (create->sync_req_max == 0)
+		create->sync_req_max = SYNC_REQ_MAX;
+	umem->sync_req_max = round_up(create->sync_req_max, BITS_PER_LONG);
+	sync_bitmap_bytes = sizeof(unsigned long) *
+		(umem->sync_req_max / BITS_PER_LONG);
+	umem->sync_req_bitmap = kzalloc(sync_bitmap_bytes, GFP_KERNEL);
+	umem->sync_wait_bitmap = kzalloc(sync_bitmap_bytes, GFP_KERNEL);
+	umem->page_wait = kzalloc(sizeof(*umem->page_wait) *
+				  umem->sync_req_max, GFP_KERNEL);
+	for (i = 0; i < umem->sync_req_max; ++i)
+		init_waitqueue_head(&umem->page_wait[i]);
+	umem->sync_req = kzalloc(sizeof(*umem->sync_req) *
+				 umem->sync_req_max, GFP_KERNEL);
+
+	umem->req_list_nr = 0;
+	INIT_LIST_HEAD(&umem->req_list);
+	init_waitqueue_head(&umem->req_list_wait);
+
+	mutex_unlock(&umem_list_mutex);
+	return 0;
+
+ out:
+	umem_free(&umem->kref);
+	return error;
+}
+
+static int umem_list_umem(struct umem_list __user *u_list)
+{
+	struct umem_list k_list;
+	struct umem *entry;
+	struct umem_name __user *u_name = u_list->names;
+	__u32 nr = 0;
+
+	if (copy_from_user(&k_list, u_list, sizeof(k_list))) {
+		return -EFAULT;
+	}
+
+	mutex_lock(&umem_list_mutex);
+	list_for_each_entry(entry, &umem_list, list) {
+		if (nr < k_list.nr) {
+			if (copy_to_user(u_name, &entry->name,
+					 sizeof(entry->name))) {
+				mutex_unlock(&umem_list_mutex);
+				return -EFAULT;
+			}
+			u_name++;
+		}
+		nr++;
+	}
+	mutex_unlock(&umem_list_mutex);
+
+	k_list.nr = nr;
+	if (copy_to_user(u_list, &k_list, sizeof(k_list))) {
+		return -EFAULT;
+	}
+
+	return 0;
+}
+
+static int umem_reattach_umem(struct umem_create *create)
+{
+	struct umem *entry;
+
+	mutex_lock(&umem_list_mutex);
+	list_for_each_entry(entry, &umem_list, list) {
+		if (umem_name_eq(&entry->name, &create->name)) {
+			kref_get(&entry->kref);
+			mutex_unlock(&umem_list_mutex);
+
+			create->shmem_fd = get_unused_fd();
+			if (create->shmem_fd < 0) {
+				umem_put(entry);
+				return create->shmem_fd;
+			}
+			create->umem_fd = anon_inode_getfd(
+				"umem", &umem_fops, entry, O_RDWR);
+			if (create->umem_fd < 0) {
+				put_unused_fd(create->shmem_fd);
+				umem_put(entry);
+				return create->umem_fd;
+			}
+			get_file(entry->shmem_filp);
+			fd_install(create->shmem_fd, entry->shmem_filp);
+
+			create->size = entry->size;
+			create->sync_req_max = entry->sync_req_max;
+			create->async_req_max = entry->async_req_max;
+			return 0;
+		}
+	}
+	mutex_unlock(&umem_list_mutex);
+
+	return -ENOENT;
+}
+
+static long umem_dev_ioctl(struct file *filp, unsigned int ioctl,
+			   unsigned long arg)
+{
+	void __user *argp = (void __user *) arg;
+	long ret;
+	struct umem_create *create = NULL;
+
+
+	switch (ioctl) {
+	case UMEM_DEV_CREATE_UMEM:
+		create = kmalloc(sizeof(*create), GFP_KERNEL);
+		if (copy_from_user(create, argp, sizeof(*create))) {
+			ret = -EFAULT;
+			break;
+		}
+		ret = umem_create_umem(create);
+		if (copy_to_user(argp, create, sizeof(*create))) {
+			ret = -EFAULT;
+			break;
+		}
+		break;
+	case UMEM_DEV_LIST:
+		ret = umem_list_umem(argp);
+		break;
+	case UMEM_DEV_REATTACH:
+		create = kmalloc(sizeof(*create), GFP_KERNEL);
+		if (copy_from_user(create, argp, sizeof(*create))) {
+			ret = -EFAULT;
+			break;
+		}
+		ret = umem_reattach_umem(create);
+		if (copy_to_user(argp, create, sizeof(*create))) {
+			ret = -EFAULT;
+			break;
+		}
+		break;
+	default:
+		ret = -EINVAL;
+		break;
+	}
+
+	kfree(create);
+	return ret;
+}
+
+static int umem_dev_release(struct inode *inode, struct file *filp)
+{
+	return 0;
+}
+
+static struct file_operations umem_dev_fops = {
+	.release = umem_dev_release,
+	.unlocked_ioctl = umem_dev_ioctl,
+};
+
+static struct miscdevice umem_dev = {
+	MISC_DYNAMIC_MINOR,
+	"umem",
+	&umem_dev_fops,
+};
+
+static int __init umem_init(void)
+{
+	int r;
+	r = misc_register(&umem_dev);
+	if (r) {
+		printk(KERN_ERR "umem: misc device register failed\n");
+		return r;
+	}
+	return 0;
+}
+module_init(umem_init);
+
+static void __exit umem_exit(void)
+{
+	misc_deregister(&umem_dev);
+}
+module_exit(umem_exit);
+
+MODULE_DESCRIPTION("UMEM user process backed memory driver "
+		   "for distributed shared memory");
+MODULE_LICENSE("GPL");
+MODULE_AUTHOR("Isaku Yamahata");
diff --git a/include/linux/umem.h b/include/linux/umem.h
new file mode 100644
index 0000000..e1a8633
--- /dev/null
+++ b/include/linux/umem.h
@@ -0,0 +1,83 @@
+/*
+ * User process backed memory.
+ * This is mainly for KVM post copy.
+ *
+ * Copyright (c) 2011,
+ * National Institute of Advanced Industrial Science and Technology
+ *
+ * https://sites.google.com/site/grivonhome/quick-kvm-migration
+ * Author: Isaku Yamahata <yamahata at valinux co jp>
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ *
+ * You should have received a copy of the GNU General Public License along
+ * with this program; if not, see <http://www.gnu.org/licenses/>.
+ */
+
+#ifndef __LINUX_UMEM_H
+#define __LINUX_UMEM_H
+
+#include <linux/types.h>
+#include <linux/ioctl.h>
+
+#ifdef __KERNEL__
+#include <linux/compiler.h>
+#else
+#define __user
+#endif
+
+#define UMEM_ID_MAX	256
+#define UMEM_NAME_MAX	256
+
+struct umem_name {
+	char id[UMEM_ID_MAX];		/* non-zero terminated */
+	char name[UMEM_NAME_MAX];	/* non-zero terminated */
+};
+
+struct umem_list {
+	__u32 nr;
+	__u32 padding;
+	struct umem_name names[0];
+};
+
+struct umem_create {
+	__u64 size;	/* in bytes */
+	__s32 umem_fd;
+	__s32 shmem_fd;
+	__u32 async_req_max;
+	__u32 sync_req_max;
+	struct umem_name name;
+};
+
+struct umem_page_request {
+	__u64 __user *pgoffs;
+	__u32 nr;
+	__u32 padding;
+};
+
+struct umem_page_cached {
+	__u64 __user *pgoffs;
+	__u32 nr;
+	__u32 padding;
+};
+
+#define UMEMIO	0x1E
+
+/* ioctl for umem_dev fd */
+#define UMEM_DEV_CREATE_UMEM	_IOWR(UMEMIO, 0x0, struct umem_create)
+#define UMEM_DEV_LIST		_IOWR(UMEMIO, 0x1, struct umem_list)
+#define UMEM_DEV_REATTACH	_IOWR(UMEMIO, 0x2, struct umem_create)
+
+/* ioctl for umem fd */
+#define UMEM_GET_PAGE_REQUEST	_IOWR(UMEMIO, 0x10, struct umem_page_request)
+#define UMEM_MARK_PAGE_CACHED	_IOW (UMEMIO, 0x11, struct umem_page_cached)
+#define UMEM_MAKE_VMA_ANONYMOUS	_IO  (UMEMIO, 0x12)
+
+#endif /* __LINUX_UMEM_H */
-- 
1.7.1.1

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* Re: [Qemu-devel] [PATCH 0/2][RFC] postcopy migration: Linux char device for postcopy
  2011-12-29  1:26 [Qemu-devel] [PATCH 0/2][RFC] postcopy migration: Linux char device for postcopy Isaku Yamahata
  2011-12-29  1:26 ` [Qemu-devel] [PATCH 1/2] export necessary symbols Isaku Yamahata
  2011-12-29  1:26 ` [Qemu-devel] [PATCH 2/2] umem: chardevice for kvm postcopy Isaku Yamahata
@ 2011-12-29  1:31 ` Isaku Yamahata
  2011-12-29 11:24 ` Avi Kivity
  3 siblings, 0 replies; 42+ messages in thread
From: Isaku Yamahata @ 2011-12-29  1:31 UTC (permalink / raw)
  To: kvm, qemu-devel; +Cc: t.hirofuchi, satoshi.itoh

On Thu, Dec 29, 2011 at 10:26:16AM +0900, Isaku Yamahata wrote:

> UMEM_DEV_LIST: list created umem devices
> UMEM_DEV_REATTACH: re-attach the created umem device
> 		  UMEM_DEV_LIST and UMEM_DEV_REATTACH are used when
> 		  the process that services page fault disappears or get stack.
> 		  Then, administrator can list the umem devices and unblock
> 		  the process which is waiting for page.

Here is a simple utility which cleans up umem devices.

---------------------------------------------------------------------------

/*
 * simple clean up utility of for umem devices
 *
 * Copyright (c) 2011,
 * National Institute of Advanced Industrial Science and Technology
 *
 * https://sites.google.com/site/grivonhome/quick-kvm-migration
 * Author: Isaku Yamahata <yamahata at valinux co jp>
 *
 * This program is free software; you can redistribute it and/or modify it
 * under the terms and conditions of the GNU General Public License,
 * version 2, as published by the Free Software Foundation.
 *
 * This program is distributed in the hope it will be useful, but WITHOUT
 * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
 * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
 * more details.
 *
 * You should have received a copy of the GNU General Public License along
 * with this program; if not, see <http://www.gnu.org/licenses/>.
 */

#include <err.h>
#include <errno.h>
#include <inttypes.h>
#include <unistd.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/ioctl.h>
#include <fcntl.h>

#include <linux/umem.h>

void mark_all_pages_cached(int umem_dev_fd, const char *id, const char *name)
{
	struct umem_create create;
	memset(&create, 0, sizeof(create));
	strncpy(create.name.id, id, sizeof(create.name.id));
	strncpy(create.name.name, name, sizeof(create.name.name));

	if (ioctl(umem_dev_fd, UMEM_DEV_REATTACH, &create) < 0) {
		err(EXIT_FAILURE, "UMEM_DEV_REATTACH");
	}

	close(create.shmem_fd);
	long page_size = sysconf(_SC_PAGESIZE);
	int page_shift = ffs(page_size) - 1;
	int umem_fd = create.umem_fd;
	printf("umem_fd %d size %"PRId64"\n", umem_fd, (uint64_t)create.size);

	__u64 i;
	__u64 e_pgoff = (create.size + page_size - 1) >> page_shift;
#define UMEM_CACHED_MAX	512
	__u64 pgoffs[UMEM_CACHED_MAX];
	struct umem_page_cached page_cached = {
		.nr = 0,
		.pgoffs = pgoffs,
	};

	for (i = 0; i < e_pgoff; i++) {
		page_cached.pgoffs[page_cached.nr] = i;
		page_cached.nr++;
		if (page_cached.nr == UMEM_CACHED_MAX) {
			if (ioctl(umem_fd, UMEM_MARK_PAGE_CACHED,
				  &page_cached) < 0) {
				err(EXIT_FAILURE, "UMEM_MARK_PAGE_CACHED");
			}
			page_cached.nr = 0;
		}
	}
	if (page_cached.nr > 0) {
		if (ioctl(umem_fd, UMEM_MARK_PAGE_CACHED, &page_cached) < 0) {
			err(EXIT_FAILURE, "UMEM_MARK_PAGE_CACHED");
		}
	}
	close(umem_fd);
}

#define DEV_UMEM	"/dev/umem"

int main(int argc, char **argv)
{
	const char *id = NULL;
	const char *name = NULL;
	if (argc >= 2) {
		id = argv[1];
	}
	if (argc >= 3) {
		name = argv[2];
	}

	int umem_dev_fd = open(DEV_UMEM, O_RDWR);
	if (umem_dev_fd < 0) {
		perror("can't open "DEV_UMEM);
		exit(EXIT_FAILURE);
	}

	struct umem_list tmp_ulist = {
		.nr = 0,
	};
	if (ioctl(umem_dev_fd, UMEM_DEV_LIST, &tmp_ulist) < 0) {
		err(EXIT_FAILURE, "UMEM_DEV_LIST");
	}
	if (tmp_ulist.nr == 0) {
		printf("no umem files\n");
		exit(EXIT_SUCCESS);
	}
	struct umem_list *ulist = malloc(
		sizeof(*ulist) + sizeof(ulist->names[0]) * tmp_ulist.nr);
	ulist->nr = tmp_ulist.nr;
	if (ioctl(umem_dev_fd, UMEM_DEV_LIST, ulist) < 0) {
		err(EXIT_FAILURE, "UMEM_DEV_LIST");
	}

	uint32_t i;
	for (i = 0; i < ulist->nr; ++i) {
		char *u_id = ulist->names[i].id;
		char *u_name = ulist->names[i].name;

		char tmp_id_c = u_id[UMEM_ID_MAX - 1];
		char tmp_name_c = u_name[UMEM_NAME_MAX - 1];
		u_id[UMEM_ID_MAX - 1] = '\0';
		u_name[UMEM_NAME_MAX - 1] = '\0';
		printf("%d: id: %s name: %s\n", i, u_id, u_name);

		if ((id != NULL || name != NULL) &&
		    (id == NULL || strncmp(id, u_id, UMEM_ID_MAX) == 0) &&
		    (name == NULL ||
		     strncmp(name, u_name, UMEM_NAME_MAX) == 0)) {
			printf("marking cached: %d: id: %s name: %s\n",
			       i, u_id, u_name);
			u_id[UMEM_ID_MAX - 1] = tmp_id_c;
			u_name[UMEM_NAME_MAX - 1] = tmp_name_c;
			mark_all_pages_cached(umem_dev_fd, u_id, u_name);
		}
	}

	close(umem_dev_fd);
	return 0;
}

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Qemu-devel] [PATCH 2/2] umem: chardevice for kvm postcopy
  2011-12-29  1:26 ` [Qemu-devel] [PATCH 2/2] umem: chardevice for kvm postcopy Isaku Yamahata
@ 2011-12-29 11:17   ` Avi Kivity
  2011-12-29 12:22     ` Isaku Yamahata
  2012-01-05  4:08   ` [Qemu-devel] 回复: " thfbjyddx
  1 sibling, 1 reply; 42+ messages in thread
From: Avi Kivity @ 2011-12-29 11:17 UTC (permalink / raw)
  To: Isaku Yamahata; +Cc: t.hirofuchi, qemu-devel, kvm, satoshi.itoh

On 12/29/2011 03:26 AM, Isaku Yamahata wrote:
> This is a character device to hook page access.
> The page fault in the area is reported to another user process by
> this chardriver. Then, the process fills the page contents and
> resolves the page fault.
>
>  
> +config UMEM
> +        tristate "/dev/umem user process backed memory support"

tab

> +	default n
> +	help
> +	  User process backed memory driver provides /dev/umem device.
> +	  The /dev/umem device is designed for some sort of distributed
> +	  shared memory. Especially post-copy live migration with KVM.
> +	  When in doubt, say "N".
> +

Need documentation of the protocol between the kernel and userspace; not
just the ioctls, but also how faults are propagated.

> +
> +struct umem_page_req_list {
> +	struct list_head list;
> +	pgoff_t pgoff;
> +};
> +
>
> +
> +
> +static int umem_mark_page_cached(struct umem *umem,
> +				 struct umem_page_cached *page_cached)
> +{
> +	int ret = 0;
> +#define PG_MAX	((__u32)32)
> +	__u64 pgoffs[PG_MAX];
> +	__u32 nr;
> +	unsigned long bit;
> +	bool wake_up_list = false;
> +
> +	nr = 0;
> +	while (nr < page_cached->nr) {
> +		__u32 todo = min(PG_MAX, (page_cached->nr - nr));
> +		int i;
> +
> +		if (copy_from_user(pgoffs, page_cached->pgoffs + nr,
> +				   sizeof(*pgoffs) * todo)) {
> +			ret = -EFAULT;
> +			goto out;
> +		}
> +		for (i = 0; i < todo; ++i) {
> +			if (pgoffs[i] >= umem->pgoff_end) {
> +				ret = -EINVAL;
> +				goto out;
> +			}
> +			set_bit(pgoffs[i], umem->cached);
> +		}
> +		nr += todo;
> +	}
> +

Probably need an smp_wmb() where.

> +	spin_lock(&umem->lock);
> +	bit = 0;
> +	for (;;) {
> +		bit = find_next_bit(umem->sync_wait_bitmap, umem->sync_req_max,
> +				    bit);
> +		if (bit >= umem->sync_req_max)
> +			break;
> +		if (test_bit(umem->sync_req[bit], umem->cached))
> +			wake_up(&umem->page_wait[bit]);

Why not do this test in the loop above?

> +		bit++;
> +	}
> +
> +	if (umem->req_list_nr > 0)
> +		wake_up_list = true;
> +	spin_unlock(&umem->lock);
> +
> +	if (wake_up_list)
> +		wake_up_all(&umem->req_list_wait);
> +
> +out:
> +	return ret;
> +}
> +
> +
> +
> +static void umem_put(struct umem *umem)
> +{
> +	int ret;
> +
> +	mutex_lock(&umem_list_mutex);
> +	ret = kref_put(&umem->kref, umem_free);
> +	if (ret == 0) {
> +		mutex_unlock(&umem_list_mutex);
> +	}

This looks wrong.

> +}
> +
> +
> +static int umem_create_umem(struct umem_create *create)
> +{
> +	int error = 0;
> +	struct umem *umem = NULL;
> +	struct vm_area_struct *vma;
> +	int shmem_fd;
> +	unsigned long bitmap_bytes;
> +	unsigned long sync_bitmap_bytes;
> +	int i;
> +
> +	umem = kzalloc(sizeof(*umem), GFP_KERNEL);
> +	umem->name = create->name;
> +	kref_init(&umem->kref);
> +	INIT_LIST_HEAD(&umem->list);
> +
> +	mutex_lock(&umem_list_mutex);
> +	error = umem_add_list(umem);
> +	if (error) {
> +		goto out;
> +	}
> +
> +	umem->task = NULL;
> +	umem->mmapped = false;
> +	spin_lock_init(&umem->lock);
> +	umem->size = roundup(create->size, PAGE_SIZE);
> +	umem->pgoff_end = umem->size >> PAGE_SHIFT;
> +	init_waitqueue_head(&umem->req_wait);
> +
> +	vma = &umem->vma;
> +	vma->vm_start = 0;
> +	vma->vm_end = umem->size;
> +	/* this shmem file is used for temporal buffer for pages
> +	   so it's unlikely that so many pages exists in this shmem file */
> +	vma->vm_flags = VM_READ | VM_SHARED | VM_NOHUGEPAGE | VM_DONTCOPY |
> +		VM_DONTEXPAND;
> +	vma->vm_page_prot = vm_get_page_prot(vma->vm_flags);
> +	vma->vm_pgoff = 0;
> +	INIT_LIST_HEAD(&vma->anon_vma_chain);
> +
> +	shmem_fd = get_unused_fd();
> +	if (shmem_fd < 0) {
> +		error = shmem_fd;
> +		goto out;
> +	}
> +	error = shmem_zero_setup(vma);
> +	if (error < 0) {
> +		put_unused_fd(shmem_fd);
> +		goto out;
> +	}
> +	umem->shmem_filp = vma->vm_file;
> +	get_file(umem->shmem_filp);
> +	fd_install(shmem_fd, vma->vm_file);
> +	create->shmem_fd = shmem_fd;
> +
> +	create->umem_fd = anon_inode_getfd("umem",
> +					   &umem_fops, umem, O_RDWR);
> +	if (create->umem_fd < 0) {
> +		error = create->umem_fd;
> +		goto out;
> +	}
> +
> +	bitmap_bytes = umem_bitmap_bytes(umem);
> +	if (bitmap_bytes > PAGE_SIZE) {
> +		umem->cached = vzalloc(bitmap_bytes);
> +		umem->faulted = vzalloc(bitmap_bytes);
> +	} else {
> +		umem->cached = kzalloc(bitmap_bytes, GFP_KERNEL);
> +		umem->faulted = kzalloc(bitmap_bytes, GFP_KERNEL);
> +	}
> +
> +	/* those constants are not exported.
> +	   They are just used for default value */
> +#define KVM_MAX_VCPUS	256
> +#define ASYNC_PF_PER_VCPU 64

Best to avoid defaults and require userspace choose.

> +
> +#define ASYNC_REQ_MAX	(ASYNC_PF_PER_VCPU * KVM_MAX_VCPUS)
> +	if (create->async_req_max == 0)
> +		create->async_req_max = ASYNC_REQ_MAX;
> +	umem->async_req_max = create->async_req_max;
> +	umem->async_req_nr = 0;
> +	umem->async_req = kzalloc(
> +		sizeof(*umem->async_req) * umem->async_req_max,
> +		GFP_KERNEL);
> +
> +#define SYNC_REQ_MAX	(KVM_MAX_VCPUS)
> +	if (create->sync_req_max == 0)
> +		create->sync_req_max = SYNC_REQ_MAX;
> +	umem->sync_req_max = round_up(create->sync_req_max, BITS_PER_LONG);
> +	sync_bitmap_bytes = sizeof(unsigned long) *
> +		(umem->sync_req_max / BITS_PER_LONG);
> +	umem->sync_req_bitmap = kzalloc(sync_bitmap_bytes, GFP_KERNEL);
> +	umem->sync_wait_bitmap = kzalloc(sync_bitmap_bytes, GFP_KERNEL);
> +	umem->page_wait = kzalloc(sizeof(*umem->page_wait) *
> +				  umem->sync_req_max, GFP_KERNEL);
> +	for (i = 0; i < umem->sync_req_max; ++i)
> +		init_waitqueue_head(&umem->page_wait[i]);
> +	umem->sync_req = kzalloc(sizeof(*umem->sync_req) *
> +				 umem->sync_req_max, GFP_KERNEL);
> +
> +	umem->req_list_nr = 0;
> +	INIT_LIST_HEAD(&umem->req_list);
> +	init_waitqueue_head(&umem->req_list_wait);
> +
> +	mutex_unlock(&umem_list_mutex);
> +	return 0;
> +
> + out:
> +	umem_free(&umem->kref);
> +	return error;
> +}
> +
> +
> +static int umem_reattach_umem(struct umem_create *create)
> +{
> +	struct umem *entry;
> +
> +	mutex_lock(&umem_list_mutex);
> +	list_for_each_entry(entry, &umem_list, list) {
> +		if (umem_name_eq(&entry->name, &create->name)) {
> +			kref_get(&entry->kref);
> +			mutex_unlock(&umem_list_mutex);
> +
> +			create->shmem_fd = get_unused_fd();
> +			if (create->shmem_fd < 0) {
> +				umem_put(entry);
> +				return create->shmem_fd;
> +			}
> +			create->umem_fd = anon_inode_getfd(
> +				"umem", &umem_fops, entry, O_RDWR);
> +			if (create->umem_fd < 0) {
> +				put_unused_fd(create->shmem_fd);
> +				umem_put(entry);
> +				return create->umem_fd;
> +			}
> +			get_file(entry->shmem_filp);
> +			fd_install(create->shmem_fd, entry->shmem_filp);
> +
> +			create->size = entry->size;
> +			create->sync_req_max = entry->sync_req_max;
> +			create->async_req_max = entry->async_req_max;
> +			return 0;
> +		}
> +	}
> +	mutex_unlock(&umem_list_mutex);
> +
> +	return -ENOENT;
> +}

Can you explain how reattach is used?

> +
> +static long umem_dev_ioctl(struct file *filp, unsigned int ioctl,
> +			   unsigned long arg)
> +{
> +	void __user *argp = (void __user *) arg;
> +	long ret;
> +	struct umem_create *create = NULL;
> +
> +
> +	switch (ioctl) {
> +	case UMEM_DEV_CREATE_UMEM:
> +		create = kmalloc(sizeof(*create), GFP_KERNEL);
> +		if (copy_from_user(create, argp, sizeof(*create))) {
> +			ret = -EFAULT;
> +			break;
> +		}
> +		ret = umem_create_umem(create);
> +		if (copy_to_user(argp, create, sizeof(*create))) {
> +			ret = -EFAULT;
> +			break;
> +		}
> +		break;

A simpler approach is the open("/dev/umem") returns an mmap()able fd. 
You need to call an ioctl() to set the size, etc. but only you only
operate on that fd.

> +	case UMEM_DEV_LIST:
> +		ret = umem_list_umem(argp);
> +		break;
> +	case UMEM_DEV_REATTACH:
> +		create = kmalloc(sizeof(*create), GFP_KERNEL);
> +		if (copy_from_user(create, argp, sizeof(*create))) {
> +			ret = -EFAULT;
> +			break;
> +		}
> +		ret = umem_reattach_umem(create);
> +		if (copy_to_user(argp, create, sizeof(*create))) {
> +			ret = -EFAULT;
> +			break;
> +		}
> +		break;
> +	default:
> +		ret = -EINVAL;
> +		break;
> +	}
> +
> +	kfree(create);
> +	return ret;
> +}
> +
> +
> +#ifdef __KERNEL__
> +#include <linux/compiler.h>
> +#else
> +#define __user
> +#endif

I think a #include <linux/compiler.h> is sufficient, the export process
(see include/linux/Kbuild, add an entry there) takes care of __user.

> +
> +#define UMEM_ID_MAX	256
> +#define UMEM_NAME_MAX	256
> +
> +struct umem_name {
> +	char id[UMEM_ID_MAX];		/* non-zero terminated */
> +	char name[UMEM_NAME_MAX];	/* non-zero terminated */
> +};

IMO, it would be better to avoid names, and use opaque __u64 identifiers
assigned by userspace, or perhaps file descriptors generated by the
kernel.  With names come the complications of namespaces, etc.  One user
can DoS another by grabbing a name that it knows the other user wants to
use.

> +
> +struct umem_create {
> +	__u64 size;	/* in bytes */
> +	__s32 umem_fd;
> +	__s32 shmem_fd;
> +	__u32 async_req_max;
> +	__u32 sync_req_max;
> +	struct umem_name name;
> +};
> +
> +struct umem_page_request {
> +	__u64 __user *pgoffs;

Pointers change their size in 32-bit and 64-bit userspace, best to avoid
them.

> +	__u32 nr;
> +	__u32 padding;
> +};
> +
> +struct umem_page_cached {
> +	__u64 __user *pgoffs;
> +	__u32 nr;
> +	__u32 padding;
> +};
> +
> +#define UMEMIO	0x1E
> +
> +/* ioctl for umem_dev fd */
> +#define UMEM_DEV_CREATE_UMEM	_IOWR(UMEMIO, 0x0, struct umem_create)
> +#define UMEM_DEV_LIST		_IOWR(UMEMIO, 0x1, struct umem_list)

Why is _LIST needed?

> +#define UMEM_DEV_REATTACH	_IOWR(UMEMIO, 0x2, struct umem_create)
> +
> +/* ioctl for umem fd */
> +#define UMEM_GET_PAGE_REQUEST	_IOWR(UMEMIO, 0x10, struct umem_page_request)
> +#define UMEM_MARK_PAGE_CACHED	_IOW (UMEMIO, 0x11, struct umem_page_cached)

You could make the GET_PAGE_REQUEST / MARK_PAGE_CACHED protocol run over
file descriptors, instead of an ioctl.  It allows you to implement the
other side in either the kernel or userspace.  This is similar to how
kvm uses an eventfd for communication with vhost-net in the kernel, or
an implementation in userspace.

> +#define UMEM_MAKE_VMA_ANONYMOUS	_IO  (UMEMIO, 0x12)
> +
> +#endif /* __LINUX_UMEM_H */


-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Qemu-devel] [PATCH 0/2][RFC] postcopy migration: Linux char device for postcopy
  2011-12-29  1:26 [Qemu-devel] [PATCH 0/2][RFC] postcopy migration: Linux char device for postcopy Isaku Yamahata
                   ` (2 preceding siblings ...)
  2011-12-29  1:31 ` [Qemu-devel] [PATCH 0/2][RFC] postcopy migration: Linux char device for postcopy Isaku Yamahata
@ 2011-12-29 11:24 ` Avi Kivity
  2011-12-29 12:39   ` Isaku Yamahata
  3 siblings, 1 reply; 42+ messages in thread
From: Avi Kivity @ 2011-12-29 11:24 UTC (permalink / raw)
  To: Isaku Yamahata; +Cc: t.hirofuchi, qemu-devel, kvm, satoshi.itoh

On 12/29/2011 03:26 AM, Isaku Yamahata wrote:
> This is Linux kernel driver for qemu/kvm postcopy live migration.
> This is used by qemu/kvm postcopy live migration patch.
>
> TODO:
> - Consider FUSE/CUSE option
>   So far several mmap patches for FUSE/CUSE are floating around. (their
>   purpose isn't different from our purpose, though). They haven't merged
>   into the upstream yet.
>   The driver specific part in qemu patches is modularized. So I expect it
>   wouldn't be difficult to switch kernel driver to CUSE based driver.

It would be good to get more input about this, please involve lkml and
the FUSE/CUSE people.

> ioctl commands:
>
> UMEM_DEV_CRATE_UMEM: create umem device for qemu
> UMEM_DEV_LIST: list created umem devices
> UMEM_DEV_REATTACH: re-attach the created umem device
> 		  UMEM_DEV_LIST and UMEM_DEV_REATTACH are used when
> 		  the process that services page fault disappears or get stack.
> 		  Then, administrator can list the umem devices and unblock
> 		  the process which is waiting for page.

Ah, I asked about this in my patch comments.  I think this is done
better by using SCM_RIGHTS to pass fds along, or asking qemu to launch a
new process.

Introducing a global namespace has a lot of complications attached.

>
> UMEM_GET_PAGE_REQUEST: retrieve page fault of qemu process
> UMEM_MARK_PAGE_CACHED: mark the specified pages pulled from the source
>                        for daemon
>
> UMEM_MAKE_VMA_ANONYMOUS: make the specified vma in the qemu process
> 			 This is _NOT_ implemented yet.
>                          anonymous I'm not sure whether this can be implemented
>                          or not.

How do we find out?  This is fairly important, stuff like transparent
hugepages and ksm only works on anonymous memory.

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Qemu-devel] [PATCH 2/2] umem: chardevice for kvm postcopy
  2011-12-29 11:17   ` Avi Kivity
@ 2011-12-29 12:22     ` Isaku Yamahata
  2011-12-29 12:47       ` Avi Kivity
  0 siblings, 1 reply; 42+ messages in thread
From: Isaku Yamahata @ 2011-12-29 12:22 UTC (permalink / raw)
  To: Avi Kivity; +Cc: t.hirofuchi, qemu-devel, kvm, satoshi.itoh

Thank you for review.

On Thu, Dec 29, 2011 at 01:17:51PM +0200, Avi Kivity wrote:
> > +	default n
> > +	help
> > +	  User process backed memory driver provides /dev/umem device.
> > +	  The /dev/umem device is designed for some sort of distributed
> > +	  shared memory. Especially post-copy live migration with KVM.
> > +	  When in doubt, say "N".
> > +
> 
> Need documentation of the protocol between the kernel and userspace; not
> just the ioctls, but also how faults are propagated.

Will do.

> 
> > +
> > +struct umem_page_req_list {
> > +	struct list_head list;
> > +	pgoff_t pgoff;
> > +};
> > +
> >
> > +
> > +
> > +static int umem_mark_page_cached(struct umem *umem,
> > +				 struct umem_page_cached *page_cached)
> > +{
> > +	int ret = 0;
> > +#define PG_MAX	((__u32)32)
> > +	__u64 pgoffs[PG_MAX];
> > +	__u32 nr;
> > +	unsigned long bit;
> > +	bool wake_up_list = false;
> > +
> > +	nr = 0;
> > +	while (nr < page_cached->nr) {
> > +		__u32 todo = min(PG_MAX, (page_cached->nr - nr));
> > +		int i;
> > +
> > +		if (copy_from_user(pgoffs, page_cached->pgoffs + nr,
> > +				   sizeof(*pgoffs) * todo)) {
> > +			ret = -EFAULT;
> > +			goto out;
> > +		}
> > +		for (i = 0; i < todo; ++i) {
> > +			if (pgoffs[i] >= umem->pgoff_end) {
> > +				ret = -EINVAL;
> > +				goto out;
> > +			}
> > +			set_bit(pgoffs[i], umem->cached);
> > +		}
> > +		nr += todo;
> > +	}
> > +
> 
> Probably need an smp_wmb() where.
> 
> > +	spin_lock(&umem->lock);
> > +	bit = 0;
> > +	for (;;) {
> > +		bit = find_next_bit(umem->sync_wait_bitmap, umem->sync_req_max,
> > +				    bit);
> > +		if (bit >= umem->sync_req_max)
> > +			break;
> > +		if (test_bit(umem->sync_req[bit], umem->cached))
> > +			wake_up(&umem->page_wait[bit]);
> 
> Why not do this test in the loop above?
> 
> > +		bit++;
> > +	}
> > +
> > +	if (umem->req_list_nr > 0)
> > +		wake_up_list = true;
> > +	spin_unlock(&umem->lock);
> > +
> > +	if (wake_up_list)
> > +		wake_up_all(&umem->req_list_wait);
> > +
> > +out:
> > +	return ret;
> > +}
> > +
> > +
> > +
> > +static void umem_put(struct umem *umem)
> > +{
> > +	int ret;
> > +
> > +	mutex_lock(&umem_list_mutex);
> > +	ret = kref_put(&umem->kref, umem_free);
> > +	if (ret == 0) {
> > +		mutex_unlock(&umem_list_mutex);
> > +	}
> 
> This looks wrong.
> 
> > +}
> > +
> > +
> > +static int umem_create_umem(struct umem_create *create)
> > +{
> > +	int error = 0;
> > +	struct umem *umem = NULL;
> > +	struct vm_area_struct *vma;
> > +	int shmem_fd;
> > +	unsigned long bitmap_bytes;
> > +	unsigned long sync_bitmap_bytes;
> > +	int i;
> > +
> > +	umem = kzalloc(sizeof(*umem), GFP_KERNEL);
> > +	umem->name = create->name;
> > +	kref_init(&umem->kref);
> > +	INIT_LIST_HEAD(&umem->list);
> > +
> > +	mutex_lock(&umem_list_mutex);
> > +	error = umem_add_list(umem);
> > +	if (error) {
> > +		goto out;
> > +	}
> > +
> > +	umem->task = NULL;
> > +	umem->mmapped = false;
> > +	spin_lock_init(&umem->lock);
> > +	umem->size = roundup(create->size, PAGE_SIZE);
> > +	umem->pgoff_end = umem->size >> PAGE_SHIFT;
> > +	init_waitqueue_head(&umem->req_wait);
> > +
> > +	vma = &umem->vma;
> > +	vma->vm_start = 0;
> > +	vma->vm_end = umem->size;
> > +	/* this shmem file is used for temporal buffer for pages
> > +	   so it's unlikely that so many pages exists in this shmem file */
> > +	vma->vm_flags = VM_READ | VM_SHARED | VM_NOHUGEPAGE | VM_DONTCOPY |
> > +		VM_DONTEXPAND;
> > +	vma->vm_page_prot = vm_get_page_prot(vma->vm_flags);
> > +	vma->vm_pgoff = 0;
> > +	INIT_LIST_HEAD(&vma->anon_vma_chain);
> > +
> > +	shmem_fd = get_unused_fd();
> > +	if (shmem_fd < 0) {
> > +		error = shmem_fd;
> > +		goto out;
> > +	}
> > +	error = shmem_zero_setup(vma);
> > +	if (error < 0) {
> > +		put_unused_fd(shmem_fd);
> > +		goto out;
> > +	}
> > +	umem->shmem_filp = vma->vm_file;
> > +	get_file(umem->shmem_filp);
> > +	fd_install(shmem_fd, vma->vm_file);
> > +	create->shmem_fd = shmem_fd;
> > +
> > +	create->umem_fd = anon_inode_getfd("umem",
> > +					   &umem_fops, umem, O_RDWR);
> > +	if (create->umem_fd < 0) {
> > +		error = create->umem_fd;
> > +		goto out;
> > +	}
> > +
> > +	bitmap_bytes = umem_bitmap_bytes(umem);
> > +	if (bitmap_bytes > PAGE_SIZE) {
> > +		umem->cached = vzalloc(bitmap_bytes);
> > +		umem->faulted = vzalloc(bitmap_bytes);
> > +	} else {
> > +		umem->cached = kzalloc(bitmap_bytes, GFP_KERNEL);
> > +		umem->faulted = kzalloc(bitmap_bytes, GFP_KERNEL);
> > +	}
> > +
> > +	/* those constants are not exported.
> > +	   They are just used for default value */
> > +#define KVM_MAX_VCPUS	256
> > +#define ASYNC_PF_PER_VCPU 64
> 
> Best to avoid defaults and require userspace choose.

Okay.


> > +
> > +#define ASYNC_REQ_MAX	(ASYNC_PF_PER_VCPU * KVM_MAX_VCPUS)
> > +	if (create->async_req_max == 0)
> > +		create->async_req_max = ASYNC_REQ_MAX;
> > +	umem->async_req_max = create->async_req_max;
> > +	umem->async_req_nr = 0;
> > +	umem->async_req = kzalloc(
> > +		sizeof(*umem->async_req) * umem->async_req_max,
> > +		GFP_KERNEL);
> > +
> > +#define SYNC_REQ_MAX	(KVM_MAX_VCPUS)
> > +	if (create->sync_req_max == 0)
> > +		create->sync_req_max = SYNC_REQ_MAX;
> > +	umem->sync_req_max = round_up(create->sync_req_max, BITS_PER_LONG);
> > +	sync_bitmap_bytes = sizeof(unsigned long) *
> > +		(umem->sync_req_max / BITS_PER_LONG);
> > +	umem->sync_req_bitmap = kzalloc(sync_bitmap_bytes, GFP_KERNEL);
> > +	umem->sync_wait_bitmap = kzalloc(sync_bitmap_bytes, GFP_KERNEL);
> > +	umem->page_wait = kzalloc(sizeof(*umem->page_wait) *
> > +				  umem->sync_req_max, GFP_KERNEL);
> > +	for (i = 0; i < umem->sync_req_max; ++i)
> > +		init_waitqueue_head(&umem->page_wait[i]);
> > +	umem->sync_req = kzalloc(sizeof(*umem->sync_req) *
> > +				 umem->sync_req_max, GFP_KERNEL);
> > +
> > +	umem->req_list_nr = 0;
> > +	INIT_LIST_HEAD(&umem->req_list);
> > +	init_waitqueue_head(&umem->req_list_wait);
> > +
> > +	mutex_unlock(&umem_list_mutex);
> > +	return 0;
> > +
> > + out:
> > +	umem_free(&umem->kref);
> > +	return error;
> > +}
> > +
> > +
> > +static int umem_reattach_umem(struct umem_create *create)
> > +{
> > +	struct umem *entry;
> > +
> > +	mutex_lock(&umem_list_mutex);
> > +	list_for_each_entry(entry, &umem_list, list) {
> > +		if (umem_name_eq(&entry->name, &create->name)) {
> > +			kref_get(&entry->kref);
> > +			mutex_unlock(&umem_list_mutex);
> > +
> > +			create->shmem_fd = get_unused_fd();
> > +			if (create->shmem_fd < 0) {
> > +				umem_put(entry);
> > +				return create->shmem_fd;
> > +			}
> > +			create->umem_fd = anon_inode_getfd(
> > +				"umem", &umem_fops, entry, O_RDWR);
> > +			if (create->umem_fd < 0) {
> > +				put_unused_fd(create->shmem_fd);
> > +				umem_put(entry);
> > +				return create->umem_fd;
> > +			}
> > +			get_file(entry->shmem_filp);
> > +			fd_install(create->shmem_fd, entry->shmem_filp);
> > +
> > +			create->size = entry->size;
> > +			create->sync_req_max = entry->sync_req_max;
> > +			create->async_req_max = entry->async_req_max;
> > +			return 0;
> > +		}
> > +	}
> > +	mutex_unlock(&umem_list_mutex);
> > +
> > +	return -ENOENT;
> > +}
> 
> Can you explain how reattach is used?
> 
> > +
> > +static long umem_dev_ioctl(struct file *filp, unsigned int ioctl,
> > +			   unsigned long arg)
> > +{
> > +	void __user *argp = (void __user *) arg;
> > +	long ret;
> > +	struct umem_create *create = NULL;
> > +
> > +
> > +	switch (ioctl) {
> > +	case UMEM_DEV_CREATE_UMEM:
> > +		create = kmalloc(sizeof(*create), GFP_KERNEL);
> > +		if (copy_from_user(create, argp, sizeof(*create))) {
> > +			ret = -EFAULT;
> > +			break;
> > +		}
> > +		ret = umem_create_umem(create);
> > +		if (copy_to_user(argp, create, sizeof(*create))) {
> > +			ret = -EFAULT;
> > +			break;
> > +		}
> > +		break;
> 
> A simpler approach is the open("/dev/umem") returns an mmap()able fd. 
> You need to call an ioctl() to set the size, etc. but only you only
> operate on that fd.

So you are suggesting that /dev/umem and /dev/umemctl should be introduced
and split the functionality.


> > +	case UMEM_DEV_LIST:
> > +		ret = umem_list_umem(argp);
> > +		break;
> > +	case UMEM_DEV_REATTACH:
> > +		create = kmalloc(sizeof(*create), GFP_KERNEL);
> > +		if (copy_from_user(create, argp, sizeof(*create))) {
> > +			ret = -EFAULT;
> > +			break;
> > +		}
> > +		ret = umem_reattach_umem(create);
> > +		if (copy_to_user(argp, create, sizeof(*create))) {
> > +			ret = -EFAULT;
> > +			break;
> > +		}
> > +		break;
> > +	default:
> > +		ret = -EINVAL;
> > +		break;
> > +	}
> > +
> > +	kfree(create);
> > +	return ret;
> > +}
> > +
> > +
> > +#ifdef __KERNEL__
> > +#include <linux/compiler.h>
> > +#else
> > +#define __user
> > +#endif
> 
> I think a #include <linux/compiler.h> is sufficient, the export process
> (see include/linux/Kbuild, add an entry there) takes care of __user.
> 
> > +
> > +#define UMEM_ID_MAX	256
> > +#define UMEM_NAME_MAX	256
> > +
> > +struct umem_name {
> > +	char id[UMEM_ID_MAX];		/* non-zero terminated */
> > +	char name[UMEM_NAME_MAX];	/* non-zero terminated */
> > +};
> 
> IMO, it would be better to avoid names, and use opaque __u64 identifiers
> assigned by userspace, or perhaps file descriptors generated by the
> kernel.  With names come the complications of namespaces, etc.  One user
> can DoS another by grabbing a name that it knows the other user wants to
> use.

So how about the kernel assigning identifiers which is system global?


> > +
> > +struct umem_create {
> > +	__u64 size;	/* in bytes */
> > +	__s32 umem_fd;
> > +	__s32 shmem_fd;
> > +	__u32 async_req_max;
> > +	__u32 sync_req_max;
> > +	struct umem_name name;
> > +};
> > +
> > +struct umem_page_request {
> > +	__u64 __user *pgoffs;
> 
> Pointers change their size in 32-bit and 64-bit userspace, best to avoid
> them.

Ah yes, right. How about following?
struct {
       __u32 nr;
       __u32 padding;
       __u64 pgoffs[0];
}

> > +	__u32 nr;
> > +	__u32 padding;
> > +};
> > +
> > +struct umem_page_cached {
> > +	__u64 __user *pgoffs;
> > +	__u32 nr;
> > +	__u32 padding;
> > +};
> > +
> > +#define UMEMIO	0x1E
> > +
> > +/* ioctl for umem_dev fd */
> > +#define UMEM_DEV_CREATE_UMEM	_IOWR(UMEMIO, 0x0, struct umem_create)
> > +#define UMEM_DEV_LIST		_IOWR(UMEMIO, 0x1, struct umem_list)
> 
> Why is _LIST needed?
> 
> > +#define UMEM_DEV_REATTACH	_IOWR(UMEMIO, 0x2, struct umem_create)
> > +
> > +/* ioctl for umem fd */
> > +#define UMEM_GET_PAGE_REQUEST	_IOWR(UMEMIO, 0x10, struct umem_page_request)
> > +#define UMEM_MARK_PAGE_CACHED	_IOW (UMEMIO, 0x11, struct umem_page_cached)
> 
> You could make the GET_PAGE_REQUEST / MARK_PAGE_CACHED protocol run over
> file descriptors, instead of an ioctl.  It allows you to implement the
> other side in either the kernel or userspace.  This is similar to how
> kvm uses an eventfd for communication with vhost-net in the kernel, or
> an implementation in userspace.

Do you mean that read/write on file descriptors is better than ioctl?
Okay, it would be easy to convert ioctl into read/write.


> > +#define UMEM_MAKE_VMA_ANONYMOUS	_IO  (UMEMIO, 0x12)
> > +
> > +#endif /* __LINUX_UMEM_H */
> 
> 
> -- 
> error compiling committee.c: too many arguments to function
> 

-- 
yamahata

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Qemu-devel] [PATCH 0/2][RFC] postcopy migration: Linux char device for postcopy
  2011-12-29 11:24 ` Avi Kivity
@ 2011-12-29 12:39   ` Isaku Yamahata
  2011-12-29 12:55     ` Avi Kivity
  0 siblings, 1 reply; 42+ messages in thread
From: Isaku Yamahata @ 2011-12-29 12:39 UTC (permalink / raw)
  To: Avi Kivity; +Cc: t.hirofuchi, qemu-devel, kvm, satoshi.itoh

On Thu, Dec 29, 2011 at 01:24:32PM +0200, Avi Kivity wrote:
> On 12/29/2011 03:26 AM, Isaku Yamahata wrote:
> > This is Linux kernel driver for qemu/kvm postcopy live migration.
> > This is used by qemu/kvm postcopy live migration patch.
> >
> > TODO:
> > - Consider FUSE/CUSE option
> >   So far several mmap patches for FUSE/CUSE are floating around. (their
> >   purpose isn't different from our purpose, though). They haven't merged
> >   into the upstream yet.
> >   The driver specific part in qemu patches is modularized. So I expect it
> >   wouldn't be difficult to switch kernel driver to CUSE based driver.
> 
> It would be good to get more input about this, please involve lkml and
> the FUSE/CUSE people.

Okay.


> > ioctl commands:
> >
> > UMEM_DEV_CRATE_UMEM: create umem device for qemu
> > UMEM_DEV_LIST: list created umem devices
> > UMEM_DEV_REATTACH: re-attach the created umem device
> > 		  UMEM_DEV_LIST and UMEM_DEV_REATTACH are used when
> > 		  the process that services page fault disappears or get stack.
> > 		  Then, administrator can list the umem devices and unblock
> > 		  the process which is waiting for page.
> 
> Ah, I asked about this in my patch comments.  I think this is done
> better by using SCM_RIGHTS to pass fds along, or asking qemu to launch a
> new process.

Can you please elaborate? I think those ways you are suggesting doesn't solve
the issue. Let me clarify the problem.

  process A (typically incoming qemu)
     |
     | mmap("/dev/umem") and access those pages triggering page faults
     | (the file descriptor might be closed after mmap() before page faults)
     |
     V
   /dev/umem
     ^
     |
     |
   daemon X resolving page faults triggered by process A
   (typically this daemon forked from incoming qemu:process A)

If daemon X disappears accidentally, there is no one that resolves
page faults of process A. At this moment process A is blocked due to page
fault. There is no file descriptor available corresponding to the VMA.
Here there is no way to kill process A, but system reboot.


> Introducing a global namespace has a lot of complications attached.
> 
> >
> > UMEM_GET_PAGE_REQUEST: retrieve page fault of qemu process
> > UMEM_MARK_PAGE_CACHED: mark the specified pages pulled from the source
> >                        for daemon
> >
> > UMEM_MAKE_VMA_ANONYMOUS: make the specified vma in the qemu process
> > 			 This is _NOT_ implemented yet.
> >                          anonymous I'm not sure whether this can be implemented
> >                          or not.
> 
> How do we find out?  This is fairly important, stuff like transparent
> hugepages and ksm only works on anonymous memory.

I agree that this is important.
At KVM-forum 2011, Andrea said THP and KSM works with non-anonymous VMA.
(Or at lease he'll look into those stuff. My memory is vague, though.
 Please correct me if I'm wrong)
-- 
yamahata

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Qemu-devel] [PATCH 2/2] umem: chardevice for kvm postcopy
  2011-12-29 12:22     ` Isaku Yamahata
@ 2011-12-29 12:47       ` Avi Kivity
  0 siblings, 0 replies; 42+ messages in thread
From: Avi Kivity @ 2011-12-29 12:47 UTC (permalink / raw)
  To: Isaku Yamahata; +Cc: t.hirofuchi, qemu-devel, kvm, satoshi.itoh

On 12/29/2011 02:22 PM, Isaku Yamahata wrote:
> > 
> > A simpler approach is the open("/dev/umem") returns an mmap()able fd. 
> > You need to call an ioctl() to set the size, etc. but only you only
> > operate on that fd.
>
> So you are suggesting that /dev/umem and /dev/umemctl should be introduced
> and split the functionality.

No; perhaps I'm missing some functionality, but I'm suggesting

  fd = open("/dev/umem");
  ftruncate(fd, size);
  struct umem_config config =  { ... };
  ioctl(fd, UMEM_CONFIG, &config);
  mmap(..., fd, size);

> > 
> > IMO, it would be better to avoid names, and use opaque __u64 identifiers
> > assigned by userspace, or perhaps file descriptors generated by the
> > kernel.  With names come the complications of namespaces, etc.  One user
> > can DoS another by grabbing a name that it knows the other user wants to
> > use.
>
> So how about the kernel assigning identifiers which is system global?

Depends on what you do with the identifiers.  Something like reattach
needs security, you can't just reattach to any random umem segment.

It's really best to stick with file descriptors, which already have a
security model.

>
>
> > > +
> > > +struct umem_create {
> > > +	__u64 size;	/* in bytes */
> > > +	__s32 umem_fd;
> > > +	__s32 shmem_fd;
> > > +	__u32 async_req_max;
> > > +	__u32 sync_req_max;
> > > +	struct umem_name name;
> > > +};
> > > +
> > > +struct umem_page_request {
> > > +	__u64 __user *pgoffs;
> > 
> > Pointers change their size in 32-bit and 64-bit userspace, best to avoid
> > them.
>
> Ah yes, right. How about following?
> struct {
>        __u32 nr;
>        __u32 padding;
>        __u64 pgoffs[0];
> }

Sure.

If we use a pipe to transport requests, you can just send them as a
sequence of __u64 addresses.

> > > +
> > > +/* ioctl for umem fd */
> > > +#define UMEM_GET_PAGE_REQUEST	_IOWR(UMEMIO, 0x10, struct umem_page_request)
> > > +#define UMEM_MARK_PAGE_CACHED	_IOW (UMEMIO, 0x11, struct umem_page_cached)
> > 
> > You could make the GET_PAGE_REQUEST / MARK_PAGE_CACHED protocol run over
> > file descriptors, instead of an ioctl.  It allows you to implement the
> > other side in either the kernel or userspace.  This is similar to how
> > kvm uses an eventfd for communication with vhost-net in the kernel, or
> > an implementation in userspace.
>
> Do you mean that read/write on file descriptors is better than ioctl?
> Okay, it would be easy to convert ioctl into read/write.

Yes, they already provide synchronization.  And if you want to implement
a umem provider over RDMA in the kernel, then it's easy to add it; it's
not trivial for the kernel to issue ioctls but reads/writes are easy.

It's also easy to pass file descriptors among processes.

How do FUSE/CUSE pass requests?

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Qemu-devel] [PATCH 0/2][RFC] postcopy migration: Linux char device for postcopy
  2011-12-29 12:39   ` Isaku Yamahata
@ 2011-12-29 12:55     ` Avi Kivity
  2011-12-29 13:49       ` Isaku Yamahata
  0 siblings, 1 reply; 42+ messages in thread
From: Avi Kivity @ 2011-12-29 12:55 UTC (permalink / raw)
  To: Isaku Yamahata
  Cc: Andrea Arcangeli, t.hirofuchi, qemu-devel, kvm, satoshi.itoh

On 12/29/2011 02:39 PM, Isaku Yamahata wrote:
> > > ioctl commands:
> > >
> > > UMEM_DEV_CRATE_UMEM: create umem device for qemu
> > > UMEM_DEV_LIST: list created umem devices
> > > UMEM_DEV_REATTACH: re-attach the created umem device
> > > 		  UMEM_DEV_LIST and UMEM_DEV_REATTACH are used when
> > > 		  the process that services page fault disappears or get stack.
> > > 		  Then, administrator can list the umem devices and unblock
> > > 		  the process which is waiting for page.
> > 
> > Ah, I asked about this in my patch comments.  I think this is done
> > better by using SCM_RIGHTS to pass fds along, or asking qemu to launch a
> > new process.
>
> Can you please elaborate? I think those ways you are suggesting doesn't solve
> the issue. Let me clarify the problem.
>
>   process A (typically incoming qemu)
>      |
>      | mmap("/dev/umem") and access those pages triggering page faults
>      | (the file descriptor might be closed after mmap() before page faults)
>      |
>      V
>    /dev/umem
>      ^
>      |
>      |
>    daemon X resolving page faults triggered by process A
>    (typically this daemon forked from incoming qemu:process A)
>
> If daemon X disappears accidentally, there is no one that resolves
> page faults of process A. At this moment process A is blocked due to page
> fault. There is no file descriptor available corresponding to the VMA.
> Here there is no way to kill process A, but system reboot.

qemu can have an extra thread that wait4()s the daemon, and relaunch
it.  This extra thread would not be blocked by the page fault.  It can
keep the fd so it isn't lost.

The unkillability of process A is a security issue; it could be done on
purpose.  Is it possible to change umem to sleep with
TASK_INTERRUPTIBLE, so it can be killed?

> > Introducing a global namespace has a lot of complications attached.
> > 
> > >
> > > UMEM_GET_PAGE_REQUEST: retrieve page fault of qemu process
> > > UMEM_MARK_PAGE_CACHED: mark the specified pages pulled from the source
> > >                        for daemon
> > >
> > > UMEM_MAKE_VMA_ANONYMOUS: make the specified vma in the qemu process
> > > 			 This is _NOT_ implemented yet.
> > >                          anonymous I'm not sure whether this can be implemented
> > >                          or not.
> > 
> > How do we find out?  This is fairly important, stuff like transparent
> > hugepages and ksm only works on anonymous memory.
>
> I agree that this is important.
> At KVM-forum 2011, Andrea said THP and KSM works with non-anonymous VMA.
> (Or at lease he'll look into those stuff. My memory is vague, though.
>  Please correct me if I'm wrong)

+= Andrea (who can also provide feedback on umem in general)

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Qemu-devel] [PATCH 0/2][RFC] postcopy migration: Linux char device for postcopy
  2011-12-29 12:55     ` Avi Kivity
@ 2011-12-29 13:49       ` Isaku Yamahata
  2011-12-29 13:52         ` Avi Kivity
  0 siblings, 1 reply; 42+ messages in thread
From: Isaku Yamahata @ 2011-12-29 13:49 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Andrea Arcangeli, t.hirofuchi, qemu-devel, kvm, satoshi.itoh

On Thu, Dec 29, 2011 at 02:55:42PM +0200, Avi Kivity wrote:
> On 12/29/2011 02:39 PM, Isaku Yamahata wrote:
> > > > ioctl commands:
> > > >
> > > > UMEM_DEV_CRATE_UMEM: create umem device for qemu
> > > > UMEM_DEV_LIST: list created umem devices
> > > > UMEM_DEV_REATTACH: re-attach the created umem device
> > > > 		  UMEM_DEV_LIST and UMEM_DEV_REATTACH are used when
> > > > 		  the process that services page fault disappears or get stack.
> > > > 		  Then, administrator can list the umem devices and unblock
> > > > 		  the process which is waiting for page.
> > > 
> > > Ah, I asked about this in my patch comments.  I think this is done
> > > better by using SCM_RIGHTS to pass fds along, or asking qemu to launch a
> > > new process.
> >
> > Can you please elaborate? I think those ways you are suggesting doesn't solve
> > the issue. Let me clarify the problem.
> >
> >   process A (typically incoming qemu)
> >      |
> >      | mmap("/dev/umem") and access those pages triggering page faults
> >      | (the file descriptor might be closed after mmap() before page faults)
> >      |
> >      V
> >    /dev/umem
> >      ^
> >      |
> >      |
> >    daemon X resolving page faults triggered by process A
> >    (typically this daemon forked from incoming qemu:process A)
> >
> > If daemon X disappears accidentally, there is no one that resolves
> > page faults of process A. At this moment process A is blocked due to page
> > fault. There is no file descriptor available corresponding to the VMA.
> > Here there is no way to kill process A, but system reboot.
> 
> qemu can have an extra thread that wait4()s the daemon, and relaunch
> it.  This extra thread would not be blocked by the page fault.  It can
> keep the fd so it isn't lost.
> 
> The unkillability of process A is a security issue; it could be done on
> purpose.  Is it possible to change umem to sleep with
> TASK_INTERRUPTIBLE, so it can be killed?

The issue is how to solve the page fault, not whether TASK_INTERRUPTIBLE or
TASK_UNINTERRUPTIBLE.
I can think of several options.
- When daemon X is dead, all page faults are served by zero pages.
- When daemon X is dead, all page faults are resovled as VM_FAULT_SIGBUS
- list/reattach: complications. You don't like it
- other?


> > > Introducing a global namespace has a lot of complications attached.
> > > 
> > > >
> > > > UMEM_GET_PAGE_REQUEST: retrieve page fault of qemu process
> > > > UMEM_MARK_PAGE_CACHED: mark the specified pages pulled from the source
> > > >                        for daemon
> > > >
> > > > UMEM_MAKE_VMA_ANONYMOUS: make the specified vma in the qemu process
> > > > 			 This is _NOT_ implemented yet.
> > > >                          anonymous I'm not sure whether this can be implemented
> > > >                          or not.
> > > 
> > > How do we find out?  This is fairly important, stuff like transparent
> > > hugepages and ksm only works on anonymous memory.
> >
> > I agree that this is important.
> > At KVM-forum 2011, Andrea said THP and KSM works with non-anonymous VMA.
> > (Or at lease he'll look into those stuff. My memory is vague, though.
> >  Please correct me if I'm wrong)
> 
> += Andrea (who can also provide feedback on umem in general)
> 
> -- 
> error compiling committee.c: too many arguments to function
> 

-- 
yamahata

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Qemu-devel] [PATCH 0/2][RFC] postcopy migration: Linux char device for postcopy
  2011-12-29 13:49       ` Isaku Yamahata
@ 2011-12-29 13:52         ` Avi Kivity
  2011-12-29 14:18           ` Isaku Yamahata
  0 siblings, 1 reply; 42+ messages in thread
From: Avi Kivity @ 2011-12-29 13:52 UTC (permalink / raw)
  To: Isaku Yamahata
  Cc: Andrea Arcangeli, t.hirofuchi, qemu-devel, kvm, satoshi.itoh

On 12/29/2011 03:49 PM, Isaku Yamahata wrote:
> > 
> > qemu can have an extra thread that wait4()s the daemon, and relaunch
> > it.  This extra thread would not be blocked by the page fault.  It can
> > keep the fd so it isn't lost.
> > 
> > The unkillability of process A is a security issue; it could be done on
> > purpose.  Is it possible to change umem to sleep with
> > TASK_INTERRUPTIBLE, so it can be killed?
>
> The issue is how to solve the page fault, not whether TASK_INTERRUPTIBLE or
> TASK_UNINTERRUPTIBLE.
> I can think of several options.
> - When daemon X is dead, all page faults are served by zero pages.
> - When daemon X is dead, all page faults are resovled as VM_FAULT_SIGBUS
> - list/reattach: complications. You don't like it
> - other?

Don't resolve the page fault.  It's up to the user/system to make sure
it happens.  qemu can easily do it by watching for the daemon's death
and respawning it.

When the new daemon is started, it can ask the kernel for a list of
pending requests, and service them.

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Qemu-devel] [PATCH 0/2][RFC] postcopy migration: Linux char device for postcopy
  2011-12-29 13:52         ` Avi Kivity
@ 2011-12-29 14:18           ` Isaku Yamahata
  2011-12-29 14:35             ` Avi Kivity
  0 siblings, 1 reply; 42+ messages in thread
From: Isaku Yamahata @ 2011-12-29 14:18 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Andrea Arcangeli, t.hirofuchi, qemu-devel, kvm, satoshi.itoh

On Thu, Dec 29, 2011 at 03:52:58PM +0200, Avi Kivity wrote:
> On 12/29/2011 03:49 PM, Isaku Yamahata wrote:
> > > 
> > > qemu can have an extra thread that wait4()s the daemon, and relaunch
> > > it.  This extra thread would not be blocked by the page fault.  It can
> > > keep the fd so it isn't lost.
> > > 
> > > The unkillability of process A is a security issue; it could be done on
> > > purpose.  Is it possible to change umem to sleep with
> > > TASK_INTERRUPTIBLE, so it can be killed?
> >
> > The issue is how to solve the page fault, not whether TASK_INTERRUPTIBLE or
> > TASK_UNINTERRUPTIBLE.
> > I can think of several options.
> > - When daemon X is dead, all page faults are served by zero pages.
> > - When daemon X is dead, all page faults are resovled as VM_FAULT_SIGBUS
> > - list/reattach: complications. You don't like it
> > - other?
> 
> Don't resolve the page fault.  It's up to the user/system to make sure
> it happens.  qemu can easily do it by watching for the daemon's death
> and respawning it.
> 
> When the new daemon is started, it can ask the kernel for a list of
> pending requests, and service them.

Great, then we agreed with list/reattach basically.
(Maybe identity scheme needs reconsideration.)
-- 
yamahata

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Qemu-devel] [PATCH 0/2][RFC] postcopy migration: Linux char device for postcopy
  2011-12-29 14:18           ` Isaku Yamahata
@ 2011-12-29 14:35             ` Avi Kivity
  2011-12-29 14:49               ` Isaku Yamahata
  0 siblings, 1 reply; 42+ messages in thread
From: Avi Kivity @ 2011-12-29 14:35 UTC (permalink / raw)
  To: Isaku Yamahata
  Cc: Andrea Arcangeli, t.hirofuchi, qemu-devel, kvm, satoshi.itoh

On 12/29/2011 04:18 PM, Isaku Yamahata wrote:
> > >
> > > The issue is how to solve the page fault, not whether TASK_INTERRUPTIBLE or
> > > TASK_UNINTERRUPTIBLE.
> > > I can think of several options.
> > > - When daemon X is dead, all page faults are served by zero pages.
> > > - When daemon X is dead, all page faults are resovled as VM_FAULT_SIGBUS
> > > - list/reattach: complications. You don't like it
> > > - other?
> > 
> > Don't resolve the page fault.  It's up to the user/system to make sure
> > it happens.  qemu can easily do it by watching for the daemon's death
> > and respawning it.
> > 
> > When the new daemon is started, it can ask the kernel for a list of
> > pending requests, and service them.
>
> Great, then we agreed with list/reattach basically.
> (Maybe identity scheme needs reconsideration.)

I guess we miscommunicated.  Why is reattach needed?  If you have the
fd, nothing else is needed.

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Qemu-devel] [PATCH 0/2][RFC] postcopy migration: Linux char device for postcopy
  2011-12-29 14:35             ` Avi Kivity
@ 2011-12-29 14:49               ` Isaku Yamahata
  2011-12-29 14:55                 ` Avi Kivity
  0 siblings, 1 reply; 42+ messages in thread
From: Isaku Yamahata @ 2011-12-29 14:49 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Andrea Arcangeli, t.hirofuchi, qemu-devel, kvm, satoshi.itoh

On Thu, Dec 29, 2011 at 04:35:36PM +0200, Avi Kivity wrote:
> On 12/29/2011 04:18 PM, Isaku Yamahata wrote:
> > > >
> > > > The issue is how to solve the page fault, not whether TASK_INTERRUPTIBLE or
> > > > TASK_UNINTERRUPTIBLE.
> > > > I can think of several options.
> > > > - When daemon X is dead, all page faults are served by zero pages.
> > > > - When daemon X is dead, all page faults are resovled as VM_FAULT_SIGBUS
> > > > - list/reattach: complications. You don't like it
> > > > - other?
> > > 
> > > Don't resolve the page fault.  It's up to the user/system to make sure
> > > it happens.  qemu can easily do it by watching for the daemon's death
> > > and respawning it.
> > > 
> > > When the new daemon is started, it can ask the kernel for a list of
> > > pending requests, and service them.
> >
> > Great, then we agreed with list/reattach basically.
> > (Maybe identity scheme needs reconsideration.)
> 
> I guess we miscommunicated.  Why is reattach needed?  If you have the
> fd, nothing else is needed.

What if malicious process close the fd and does page fault intentionally?
Unkillable process issue remains.
I think we are talking not only qemu case but also general case.
-- 
yamahata

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Qemu-devel] [PATCH 0/2][RFC] postcopy migration: Linux char device for postcopy
  2011-12-29 14:49               ` Isaku Yamahata
@ 2011-12-29 14:55                 ` Avi Kivity
  2011-12-29 15:53                   ` Isaku Yamahata
  0 siblings, 1 reply; 42+ messages in thread
From: Avi Kivity @ 2011-12-29 14:55 UTC (permalink / raw)
  To: Isaku Yamahata
  Cc: Andrea Arcangeli, t.hirofuchi, qemu-devel, kvm, satoshi.itoh

On 12/29/2011 04:49 PM, Isaku Yamahata wrote:
> > > Great, then we agreed with list/reattach basically.
> > > (Maybe identity scheme needs reconsideration.)
> > 
> > I guess we miscommunicated.  Why is reattach needed?  If you have the
> > fd, nothing else is needed.
>
> What if malicious process close the fd and does page fault intentionally?
> Unkillable process issue remains.
> I think we are talking not only qemu case but also general case.

It's not unkillable.  If you sleep with TASK_INTERRUPTIBLE then you can
process signals.  This includes SIGKILL.

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Qemu-devel] [PATCH 0/2][RFC] postcopy migration: Linux char device for postcopy
  2011-12-29 14:55                 ` Avi Kivity
@ 2011-12-29 15:53                   ` Isaku Yamahata
  2011-12-29 16:00                     ` Avi Kivity
  0 siblings, 1 reply; 42+ messages in thread
From: Isaku Yamahata @ 2011-12-29 15:53 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Andrea Arcangeli, t.hirofuchi, qemu-devel, kvm, satoshi.itoh

On Thu, Dec 29, 2011 at 04:55:11PM +0200, Avi Kivity wrote:
> On 12/29/2011 04:49 PM, Isaku Yamahata wrote:
> > > > Great, then we agreed with list/reattach basically.
> > > > (Maybe identity scheme needs reconsideration.)
> > > 
> > > I guess we miscommunicated.  Why is reattach needed?  If you have the
> > > fd, nothing else is needed.
> >
> > What if malicious process close the fd and does page fault intentionally?
> > Unkillable process issue remains.
> > I think we are talking not only qemu case but also general case.
> 
> It's not unkillable.  If you sleep with TASK_INTERRUPTIBLE then you can
> process signals.  This includes SIGKILL.

Hmm, you said that the fault handler doesn't resolve the page fault.

> > Don't resolve the page fault.  It's up to the user/system to make sure
> > it happens.  qemu can easily do it by watching for the daemon's death
> > and respawning it.

To kill the process, the fault handler must return resolving the fault.
It must return something. What do you expect? VM_FAULT_SIGBUS? zero page?
-- 
yamahata

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Qemu-devel] [PATCH 0/2][RFC] postcopy migration: Linux char device for postcopy
  2011-12-29 15:53                   ` Isaku Yamahata
@ 2011-12-29 16:00                     ` Avi Kivity
  2011-12-29 16:01                       ` Avi Kivity
  0 siblings, 1 reply; 42+ messages in thread
From: Avi Kivity @ 2011-12-29 16:00 UTC (permalink / raw)
  To: Isaku Yamahata
  Cc: Andrea Arcangeli, t.hirofuchi, qemu-devel, kvm, satoshi.itoh

On 12/29/2011 05:53 PM, Isaku Yamahata wrote:
> On Thu, Dec 29, 2011 at 04:55:11PM +0200, Avi Kivity wrote:
> > On 12/29/2011 04:49 PM, Isaku Yamahata wrote:
> > > > > Great, then we agreed with list/reattach basically.
> > > > > (Maybe identity scheme needs reconsideration.)
> > > > 
> > > > I guess we miscommunicated.  Why is reattach needed?  If you have the
> > > > fd, nothing else is needed.
> > >
> > > What if malicious process close the fd and does page fault intentionally?
> > > Unkillable process issue remains.
> > > I think we are talking not only qemu case but also general case.
> > 
> > It's not unkillable.  If you sleep with TASK_INTERRUPTIBLE then you can
> > process signals.  This includes SIGKILL.
>
> Hmm, you said that the fault handler doesn't resolve the page fault.
>
> > > Don't resolve the page fault.  It's up to the user/system to make sure
> > > it happens.  qemu can easily do it by watching for the daemon's death
> > > and respawning it.
>
> To kill the process, the fault handler must return resolving the fault.
> It must return something. What do you expect? VM_FAULT_SIGBUS? zero page?

   if (signal_pending(current))
        return VM_FAULT_RETRY;

for SIGKILL, the process dies immediately.  For other unblocked signals,
the process starts executing the signal handler, which isn't dependent
on the faulting page (of course the signal handler may itself fault).

The NFS client has exactly the same issue, if you mount it with the intr
option.  In fact you could use the NFS client as a trivial umem/cuse
prototype.

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Qemu-devel] [PATCH 0/2][RFC] postcopy migration: Linux char device for postcopy
  2011-12-29 16:00                     ` Avi Kivity
@ 2011-12-29 16:01                       ` Avi Kivity
  2012-01-02 17:05                         ` Andrea Arcangeli
  0 siblings, 1 reply; 42+ messages in thread
From: Avi Kivity @ 2011-12-29 16:01 UTC (permalink / raw)
  To: Isaku Yamahata
  Cc: Andrea Arcangeli, t.hirofuchi, qemu-devel, kvm, satoshi.itoh

On 12/29/2011 06:00 PM, Avi Kivity wrote:
> The NFS client has exactly the same issue, if you mount it with the intr
> option.  In fact you could use the NFS client as a trivial umem/cuse
> prototype.

Actually, NFS can return SIGBUS, it doesn't care about restarting daemons.

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Qemu-devel] [PATCH 0/2][RFC] postcopy migration: Linux char device for postcopy
  2011-12-29 16:01                       ` Avi Kivity
@ 2012-01-02 17:05                         ` Andrea Arcangeli
  2012-01-02 17:55                           ` Paolo Bonzini
  2012-01-04  3:03                           ` Isaku Yamahata
  0 siblings, 2 replies; 42+ messages in thread
From: Andrea Arcangeli @ 2012-01-02 17:05 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Isaku Yamahata, t.hirofuchi, qemu-devel, kvm, satoshi.itoh

On Thu, Dec 29, 2011 at 06:01:45PM +0200, Avi Kivity wrote:
> On 12/29/2011 06:00 PM, Avi Kivity wrote:
> > The NFS client has exactly the same issue, if you mount it with the intr
> > option.  In fact you could use the NFS client as a trivial umem/cuse
> > prototype.
> 
> Actually, NFS can return SIGBUS, it doesn't care about restarting daemons.

During KVMForum I suggested to a few people that it could be done
entirely in userland with PROT_NONE. So the problem is if we do it in
userland with the current functionality you'll run out of VMAs and
slowdown performance too much.

But all you need is the ability to map single pages in the address
space. The only special requirement is that a new vma must not be
created during the map operation. It'd be very similar to
remap_file_pages for MAP_SHARED, it also was created to avoid having
to create new vmas on a large MAP_SHARED mapping and no other reason
at all. In our case we deal with a large MAP_ANONYMOUS mapping and we
must alter the pte without creating new vmas but the problem is very
similar to remap_file_pages.

Qemu in the dst node can do:

	mmap(MAP_ANONYMOUS....)
	fault_area_prepare(start, end, signalnr)

prepare_fault_area will map the range with the magic pte.

Then when the signalnr fires, you do:

     send(givemepageX)
     recv(&tmpaddr_aligned, PAGE_SIZE,...);
     fault_area_map(final_dest_aligned, tmpaddr_aligned, size)

map_fault_area will check the pgprot of the two vmas mapping
final_dest_aligned and tmpaddr_aligned have the same vma->vm_pgprot
and various other vma bits, and if all ok, it'll just copy the pte
from tmpaddr_aligned, to final_dest_aligned and it'll update the
page->index. It can fail if the page is shared to avoid dealing with
the non-linearity of the page mapped in multiple vmas.

You basically need a bypass to avoid altering the pgprot of the vma,
and enter into the pte a "magic" thing that fires signal handlers
if accessed, without having to create new vmas. gup/gup_fast and stuff
should just always fallback into handle_mm_fault when encountering such a
thing, so returning failure as if gup_fast was run on a address beyond
the end of the i_size in the MAP_SHARED case.

THP already works on /dev/zero mmaps as long as it's a MAP_PRIVATE,
KSM should work too but I doubt anybody tested it on MAP_PRIVATE of
/dev/zero.

The device driver provides an advantage in being self contained but I
doubt it's simpler. I suppose after migration is complete you'll still
switch the vma back to regular anonymous vma so leading to the same
result?

The patch 2/2 is small and self contained so it's quite attractive, I
didn't see patch 1/2, was it posted?

Thanks,
Andrea

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Qemu-devel] [PATCH 0/2][RFC] postcopy migration: Linux char device for postcopy
  2012-01-02 17:05                         ` Andrea Arcangeli
@ 2012-01-02 17:55                           ` Paolo Bonzini
  2012-01-03 14:25                             ` Andrea Arcangeli
  2012-01-04  3:03                           ` Isaku Yamahata
  1 sibling, 1 reply; 42+ messages in thread
From: Paolo Bonzini @ 2012-01-02 17:55 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: kvm, satoshi.itoh, t.hirofuchi, qemu-devel, Isaku Yamahata,
	Avi Kivity

On 01/02/2012 06:05 PM, Andrea Arcangeli wrote:
> On Thu, Dec 29, 2011 at 06:01:45PM +0200, Avi Kivity wrote:
>> On 12/29/2011 06:00 PM, Avi Kivity wrote:
>>> The NFS client has exactly the same issue, if you mount it with the intr
>>> option.  In fact you could use the NFS client as a trivial umem/cuse
>>> prototype.
>>
>> Actually, NFS can return SIGBUS, it doesn't care about restarting daemons.
>
> During KVMForum I suggested to a few people that it could be done
> entirely in userland with PROT_NONE.

Or MAP_NORESERVE.

Anything you do that is CUSE-based should be doable in a separate QEMU 
thread (rather than a different process that talks to CUSE).  If a 
userspace CUSE-based solution could be done with acceptable performance, 
the same thing would have the same or better performance if done 
entirely within QEMU.

> So the problem is if we do it in
> userland with the current functionality you'll run out of VMAs and
> slowdown performance too much.
>
> But all you need is the ability to map single pages in the address
> space.

Would this also let you set different pgprots for different pages in the 
same VMA?  It would be useful for write barriers in garbage collectors 
(such as boehm-gc).  These do not have _that_ many VMAs, because every 
GC cycles could merge all of them back to a single VMA with PROT_READ 
permissions; however, they still put some strain on the VM subsystem.

Paolo

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Qemu-devel] [PATCH 0/2][RFC] postcopy migration: Linux char device for postcopy
  2012-01-02 17:55                           ` Paolo Bonzini
@ 2012-01-03 14:25                             ` Andrea Arcangeli
  2012-01-12 13:57                               ` Avi Kivity
  0 siblings, 1 reply; 42+ messages in thread
From: Andrea Arcangeli @ 2012-01-03 14:25 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: kvm, satoshi.itoh, t.hirofuchi, qemu-devel, Isaku Yamahata,
	Avi Kivity

On Mon, Jan 02, 2012 at 06:55:18PM +0100, Paolo Bonzini wrote:
> On 01/02/2012 06:05 PM, Andrea Arcangeli wrote:
> > On Thu, Dec 29, 2011 at 06:01:45PM +0200, Avi Kivity wrote:
> >> On 12/29/2011 06:00 PM, Avi Kivity wrote:
> >>> The NFS client has exactly the same issue, if you mount it with the intr
> >>> option.  In fact you could use the NFS client as a trivial umem/cuse
> >>> prototype.
> >>
> >> Actually, NFS can return SIGBUS, it doesn't care about restarting daemons.
> >
> > During KVMForum I suggested to a few people that it could be done
> > entirely in userland with PROT_NONE.
> 
> Or MAP_NORESERVE.

MAP_NORESERVE has no effect with the default
/proc/sys/vm/overcommit_memory == 0, and in general has no effect until you
run out of memory. It's an accounting on/off switch only, mostly a noop.

> Anything you do that is CUSE-based should be doable in a separate QEMU 
> thread (rather than a different process that talks to CUSE).  If a 
> userspace CUSE-based solution could be done with acceptable performance, 
> the same thing would have the same or better performance if done 
> entirely within QEMU.

It should be somehow doable within qemu and the source node could
handle one connection per vcpu thread for the async network pageins.
 
> > So the problem is if we do it in
> > userland with the current functionality you'll run out of VMAs and
> > slowdown performance too much.
> >
> > But all you need is the ability to map single pages in the address
> > space.
> 
> Would this also let you set different pgprots for different pages in the 
> same VMA?  It would be useful for write barriers in garbage collectors 
> (such as boehm-gc).  These do not have _that_ many VMAs, because every 
> GC cycles could merge all of them back to a single VMA with PROT_READ 
> permissions; however, they still put some strain on the VM subsystem.

Changing permission sounds more tricky as more code may make
assumptions on the vma before checking the pte.

Adding a magic unmapped pte entry sounds fairly safe because there's
the migration pte already used by migrate which halts page faults and
wait, that creates a precedent. So I guess we could reuse the same
code that already exists for the migration entry and we'd need to fire
a signal and returns to userland instead of waiting. The signal should
be invoked before the page fault will trigger again. Of course if the
signal returns and does nothing it'll loop at 100% cpu load but that's
ok. Maybe it's possible to tweak the permissions but it will need a
lot more thoughts. Specifically for anon pages marking them readonly
sounds possible if they are supposed to behave like regular COWs (not
segfaulting or anything), as you already can have a mixture of
readonly and read-write ptes (not to tell readonly KSM pages), but for
any other case it's non trivial. Last but not the least the API here
would be like a vma-less-mremap, moving a page from one address to
another without modifying the vmas, the permission tweak sounds more
like an mprotect, so I'm unsure if it could do both or if it should be
an optimization to consider independently.

In theory I suspect we could also teach mremap to do a
not-vma-mangling mremap if we move pages that aren't shared and so we
can adjust the page->index of the pages, instead of creating new vmas
at the dst address with an adjusted vma->vm_pgoff, but I suspect a
syscall that only works on top of fault-unmapped areas is simpler and
safer. mremap semantics requires nuking the dst region before the move
starts. If we would teach mremap how to handle the fault-unmapped
areas we could just add one syscall prepare_fault_area (or whatever
name you choose).

The locking of doing a vma-less-mremap still sounds tricky but I doubt
you can avoid that locking complexity by using the chardevice as long
as the chardevice backed-memory still allows THP, migration and swap,
if you want to do it atomic-zerocopy and I think zerocopy would be
better especially if the network card is fast and all vcpus are
faulting into unmapped pages simultaneously so triggering heavy amount
of copying from all physical cpus.

I don't mean the current device driver doing a copy_user won't work or
is bad idea, it's more self contained and maybe easier to merge
upstream. I'm just presenting another option more VM integrated
zerocopy with just 2 syscalls.

vmas must not be involved in the mremap for reliability, or too much
memory could get pinned in vmas even if we temporary lift the
/proc/sys/vm/max_map_count for the process. Plus sending another
signal (not sigsegv or sigbus) should be more reliable in case the
migration crashes for real.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Qemu-devel] [PATCH 0/2][RFC] postcopy migration: Linux char device for postcopy
  2012-01-02 17:05                         ` Andrea Arcangeli
  2012-01-02 17:55                           ` Paolo Bonzini
@ 2012-01-04  3:03                           ` Isaku Yamahata
  2012-01-12 13:59                             ` Avi Kivity
  1 sibling, 1 reply; 42+ messages in thread
From: Isaku Yamahata @ 2012-01-04  3:03 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: t.hirofuchi, satoshi.itoh, Avi Kivity, kvm, qemu-devel

On Mon, Jan 02, 2012 at 06:05:51PM +0100, Andrea Arcangeli wrote:
> On Thu, Dec 29, 2011 at 06:01:45PM +0200, Avi Kivity wrote:
> > On 12/29/2011 06:00 PM, Avi Kivity wrote:
> > > The NFS client has exactly the same issue, if you mount it with the intr
> > > option.  In fact you could use the NFS client as a trivial umem/cuse
> > > prototype.
> > 
> > Actually, NFS can return SIGBUS, it doesn't care about restarting daemons.
> 
> During KVMForum I suggested to a few people that it could be done
> entirely in userland with PROT_NONE. So the problem is if we do it in
> userland with the current functionality you'll run out of VMAs and
> slowdown performance too much.
> 
> But all you need is the ability to map single pages in the address
> space. The only special requirement is that a new vma must not be
> created during the map operation. It'd be very similar to
> remap_file_pages for MAP_SHARED, it also was created to avoid having
> to create new vmas on a large MAP_SHARED mapping and no other reason
> at all. In our case we deal with a large MAP_ANONYMOUS mapping and we
> must alter the pte without creating new vmas but the problem is very
> similar to remap_file_pages.
> 
> Qemu in the dst node can do:
> 
> 	mmap(MAP_ANONYMOUS....)
> 	fault_area_prepare(start, end, signalnr)
> 
> prepare_fault_area will map the range with the magic pte.
> 
> Then when the signalnr fires, you do:
> 
>      send(givemepageX)
>      recv(&tmpaddr_aligned, PAGE_SIZE,...);
>      fault_area_map(final_dest_aligned, tmpaddr_aligned, size)
> 
> map_fault_area will check the pgprot of the two vmas mapping
> final_dest_aligned and tmpaddr_aligned have the same vma->vm_pgprot
> and various other vma bits, and if all ok, it'll just copy the pte
> from tmpaddr_aligned, to final_dest_aligned and it'll update the
> page->index. It can fail if the page is shared to avoid dealing with
> the non-linearity of the page mapped in multiple vmas.
> 
> You basically need a bypass to avoid altering the pgprot of the vma,
> and enter into the pte a "magic" thing that fires signal handlers
> if accessed, without having to create new vmas. gup/gup_fast and stuff
> should just always fallback into handle_mm_fault when encountering such a
> thing, so returning failure as if gup_fast was run on a address beyond
> the end of the i_size in the MAP_SHARED case.

Yes, it's quite doable in user space(qemu) with a kernel-enhancement.
And it would be easy to convert a separated daemon process into a thread
in qemu.

I think it should be done out side of qemu process for some reasons.
(I just repeat same discussion at the KVM-forum because no one remembers
it)

- ptrace (and its variant)
  Some people want to investigate guest ram on host (qemu stopped or lively).
  For example, enhance crash utility and it will attach qemu process and
  debug guest kernel.

- core dump
  qemu process may core-dump.
  As postmortem analysis, people want to investigate guest RAM.
  Again enhance crash utility and it will read the core file and analyze
  guest kernel.
  When creating core, the qemu process is already dead.

It precludes the above possibilities to handle fault in qemu process.


> THP already works on /dev/zero mmaps as long as it's a MAP_PRIVATE,
> KSM should work too but I doubt anybody tested it on MAP_PRIVATE of
> /dev/zero.

Oh great. It seems to work with anonymous page generally of non-anonymous VMA.
Is that right?
If correct, THP/KSM work with mmap(MAP_PRIVATE, /dev/umem...), do they?


> The device driver provides an advantage in being self contained but I
> doubt it's simpler. I suppose after migration is complete you'll still
> switch the vma back to regular anonymous vma so leading to the same
> result?

Yes, it was my original intention.
The page is anonymous, but the vma isn't anonymous. I concerned that
KSM/THP doesn't work with such pages.
If they work, it isn't necessary to switch the VMA into anonymous.


> The patch 2/2 is small and self contained so it's quite attractive, I
> didn't see patch 1/2, was it posted?

Posted. It's quite short and trivial which just do EXPORT_SYMBOL_GPL of
mem_cgroup_cache_chage and shmem_zero_setup.
I include it here for convenience.

>From e8bfda16a845eef4381872a331c6f0f200c3f7d7 Mon Sep 17 00:00:00 2001
Message-Id: <e8bfda16a845eef4381872a331c6f0f200c3f7d7.1325055066.git.yamahata@valinux.co.jp>
In-Reply-To: <cover.1325055065.git.yamahata@valinux.co.jp>
References: <cover.1325055065.git.yamahata@valinux.co.jp>
From: Isaku Yamahata <yamahata@valinux.co.jp>
Date: Thu, 11 Aug 2011 20:05:28 +0900
Subject: [PATCH 1/2] export necessary symbols

Signed-off-by: Isaku Yamahata <yamahata@valinux.co.jp>
---
 mm/memcontrol.c |    1 +
 mm/shmem.c      |    1 +
 2 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index b63f5f7..85530fc 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2807,6 +2807,7 @@ int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm,
 
 	return ret;
 }
+EXPORT_SYMBOL_GPL(mem_cgroup_cache_charge);
 
 /*
  * While swap-in, try_charge -> commit or cancel, the page is locked.
diff --git a/mm/shmem.c b/mm/shmem.c
index d672250..d137a37 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -2546,6 +2546,7 @@ int shmem_zero_setup(struct vm_area_struct *vma)
 	vma->vm_flags |= VM_CAN_NONLINEAR;
 	return 0;
 }
+EXPORT_SYMBOL_GPL(shmem_zero_setup);
 
 /**
  * shmem_read_mapping_page_gfp - read into page cache, using specified page allocation flags.
-- 
1.7.1.1

-- 
yamahata

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [Qemu-devel] 回复:  [PATCH 2/2] umem: chardevice for kvm postcopy
  2011-12-29  1:26 ` [Qemu-devel] [PATCH 2/2] umem: chardevice for kvm postcopy Isaku Yamahata
  2011-12-29 11:17   ` Avi Kivity
@ 2012-01-05  4:08   ` thfbjyddx
  2012-01-05 10:48     ` [Qemu-devel] 回??: " Isaku Yamahata
  1 sibling, 1 reply; 42+ messages in thread
From: thfbjyddx @ 2012-01-05  4:08 UTC (permalink / raw)
  To: Isaku Yamahata, kvm, qemu-devel; +Cc: t.hirofuchi, satoshi.itoh

[-- Attachment #1: Type: text/plain, Size: 26364 bytes --]

hi,
I've tried to use this patch,
but it doesn't work for compiling error on 

 page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma, vmf->address);//vmf->virtual_address?

I guess it's for the wrong kernel version?
can you give me some detail about this or any clue?
3x 




From: Isaku Yamahata
Date: 2011-12-29 09:26
To: kvm; qemu-devel
CC: yamahata; t.hirofuchi; satoshi.itoh
Subject: [Qemu-devel] [PATCH 2/2] umem: chardevice for kvm postcopy
This is a character device to hook page access.
The page fault in the area is reported to another user process by
this chardriver. Then, the process fills the page contents and
resolves the page fault.

Signed-off-by: Isaku Yamahata <yamahata@valinux.co.jp>
---
 drivers/char/Kconfig  |    9 +
 drivers/char/Makefile |    1 +
 drivers/char/umem.c   |  898 +++++++++++++++++++++++++++++++++++++++++++++++++
 include/linux/umem.h  |   83 +++++
 4 files changed, 991 insertions(+), 0 deletions(-)
 create mode 100644 drivers/char/umem.c
 create mode 100644 include/linux/umem.h

diff --git a/drivers/char/Kconfig b/drivers/char/Kconfig
index 4364303..001e3e4 100644
--- a/drivers/char/Kconfig
+++ b/drivers/char/Kconfig
@@ -15,6 +15,15 @@ config DEVKMEM
    kind of kernel debugging operations.
    When in doubt, say "N".

+config UMEM
+        tristate "/dev/umem user process backed memory support"
+ default n
+ help
+   User process backed memory driver provides /dev/umem device.
+   The /dev/umem device is designed for some sort of distributed
+   shared memory. Especially post-copy live migration with KVM.
+   When in doubt, say "N".
+
 config STALDRV
  bool "Stallion multiport serial support"
  depends on SERIAL_NONSTANDARD
diff --git a/drivers/char/Makefile b/drivers/char/Makefile
index 32762ba..1eb14dc 100644
--- a/drivers/char/Makefile
+++ b/drivers/char/Makefile
@@ -3,6 +3,7 @@
 #

 obj-y += mem.o random.o
+obj-$(CONFIG_UMEM) += umem.o
 obj-$(CONFIG_TTY_PRINTK) += ttyprintk.o
 obj-y += misc.o
 obj-$(CONFIG_ATARI_DSP56K) += dsp56k.o
diff --git a/drivers/char/umem.c b/drivers/char/umem.c
new file mode 100644
index 0000000..df669fb
--- /dev/null
+++ b/drivers/char/umem.c
@@ -0,0 +1,898 @@
+/*
+ * UMEM: user process backed memory.
+ *
+ * Copyright (c) 2011,
+ * National Institute of Advanced Industrial Science and Technology
+ *
+ * https://sites.google.com/site/grivonhome/quick-kvm-migration
+ * Author: Isaku Yamahata <yamahata at valinux co jp>
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ *
+ * You should have received a copy of the GNU General Public License along
+ * with this program; if not, see <http://www.gnu.org/licenses/>.
+ */
+
+#include <linux/module.h>
+#include <linux/pagemap.h>
+#include <linux/init.h>
+#include <linux/mm.h>
+#include <linux/slab.h>
+#include <linux/memcontrol.h>
+#include <linux/poll.h>
+#include <linux/file.h>
+#include <linux/anon_inodes.h>
+#include <linux/miscdevice.h>
+#include <linux/umem.h>
+
+struct umem_page_req_list {
+ struct list_head list;
+ pgoff_t pgoff;
+};
+
+struct umem {
+ loff_t size;
+ pgoff_t pgoff_end;
+ spinlock_t lock;
+
+ wait_queue_head_t req_wait;
+
+ int async_req_max;
+ int async_req_nr;
+ pgoff_t *async_req;
+
+ int sync_req_max;
+ unsigned long *sync_req_bitmap;
+ unsigned long *sync_wait_bitmap;
+ pgoff_t *sync_req;
+ wait_queue_head_t *page_wait;
+
+ int req_list_nr;
+ struct list_head req_list;
+ wait_queue_head_t req_list_wait;
+
+ unsigned long *cached;
+ unsigned long *faulted;
+
+ bool mmapped;
+ unsigned long vm_start;
+ unsigned int vma_nr;
+ struct task_struct *task;
+
+ struct file *shmem_filp;
+ struct vm_area_struct vma;
+
+ struct kref kref;
+ struct list_head list;
+ struct umem_name name;
+};
+
+
+static LIST_HEAD(umem_list);
+DEFINE_MUTEX(umem_list_mutex);
+
+static bool umem_name_eq(const struct umem_name *lhs,
+   const struct umem_name *rhs)
+{
+ return memcmp(lhs->id, rhs->id, sizeof(lhs->id)) == 0 &&
+ memcmp(lhs->name, rhs->name, sizeof(lhs->name)) == 0;
+}
+
+static int umem_add_list(struct umem *umem)
+{
+ struct umem *entry;
+ BUG_ON(!mutex_is_locked(&umem_list_mutex));
+ list_for_each_entry(entry, &umem_list, list) {
+ if (umem_name_eq(&entry->name, &umem->name)) {
+ mutex_unlock(&umem_list_mutex);
+ return -EBUSY;
+ }
+ }
+
+ list_add(&umem->list, &umem_list);
+ return 0;
+}
+
+static void umem_release_fake_vmf(int ret, struct vm_fault *fake_vmf)
+{
+ if (ret & VM_FAULT_LOCKED) {
+ unlock_page(fake_vmf->page);
+ }
+ page_cache_release(fake_vmf->page);
+}
+
+static int umem_minor_fault(struct umem *umem,
+     struct vm_area_struct *vma,
+     struct vm_fault *vmf)
+{
+ struct vm_fault fake_vmf;
+ int ret;
+ struct page *page;
+
+ BUG_ON(!test_bit(vmf->pgoff, umem->cached));
+ fake_vmf = *vmf;
+ fake_vmf.page = NULL;
+ ret = umem->vma.vm_ops->fault(&umem->vma, &fake_vmf);
+ if (ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY))
+ return ret;
+
+ /*
+  * TODO: pull out fake_vmf->page from shmem file and donate it
+  * to this vma resolving the page fault.
+  * vmf->page = fake_vmf->page;
+  */
+
+ page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma, vmf->address);
+ if (!page)
+ return VM_FAULT_OOM;
+ if (mem_cgroup_cache_charge(page, vma->vm_mm, GFP_KERNEL)) {
+ umem_release_fake_vmf(ret, &fake_vmf);
+ page_cache_release(page);
+ return VM_FAULT_OOM;
+ }
+
+ copy_highpage(page, fake_vmf.page);
+ umem_release_fake_vmf(ret, &fake_vmf);
+
+ ret |= VM_FAULT_LOCKED;
+ SetPageUptodate(page);
+ vmf->page = page;
+ set_bit(vmf->pgoff, umem->faulted);
+
+ return ret;
+}
+
+static int umem_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
+{
+ struct file *filp = vma->vm_file;
+ struct umem *umem = filp->private_data;
+
+ if (vmf->pgoff >= umem->pgoff_end) {
+ return VM_FAULT_SIGBUS;
+ }
+
+ BUG_ON(test_bit(vmf->pgoff, umem->faulted));
+
+ if (!test_bit(vmf->pgoff, umem->cached)) {
+ /* major fault */
+ unsigned long bit;
+ DEFINE_WAIT(wait);
+
+ if (vmf->flags & FAULT_FLAG_RETRY_NOWAIT) {
+ /* async page fault */
+ spin_lock(&umem->lock);
+ if (umem->async_req_nr < umem->async_req_max) {
+ umem->async_req[umem->async_req_nr] =
+ vmf->pgoff;
+ umem->async_req_nr++;
+ }
+ spin_unlock(&umem->lock);
+ wake_up_poll(&umem->req_wait, POLLIN);
+
+ if (test_bit(vmf->pgoff, umem->cached))
+ return umem_minor_fault(umem, vma, vmf);
+ return VM_FAULT_MAJOR | VM_FAULT_RETRY;
+ }
+
+ spin_lock(&umem->lock);
+ bit = find_first_zero_bit(umem->sync_wait_bitmap,
+   umem->sync_req_max);
+ if (likely(bit < umem->sync_req_max)) {
+ umem->sync_req[bit] = vmf->pgoff;
+ prepare_to_wait(&umem->page_wait[bit], &wait,
+ TASK_UNINTERRUPTIBLE);
+ set_bit(bit, umem->sync_req_bitmap);
+ set_bit(bit, umem->sync_wait_bitmap);
+ spin_unlock(&umem->lock);
+ wake_up_poll(&umem->req_wait, POLLIN);
+
+ if (!test_bit(vmf->pgoff, umem->cached))
+ schedule();
+ finish_wait(&umem->page_wait[bit], &wait);
+ clear_bit(bit, umem->sync_wait_bitmap);
+ } else {
+ struct umem_page_req_list page_req_list = {
+ .pgoff = vmf->pgoff,
+ };
+ umem->req_list_nr++;
+ list_add_tail(&page_req_list.list, &umem->req_list);
+ wake_up_poll(&umem->req_wait, POLLIN);
+ for (;;) {
+ prepare_to_wait(&umem->req_list_wait, &wait,
+ TASK_UNINTERRUPTIBLE);
+ if (test_bit(vmf->pgoff, umem->cached)) {
+ umem->req_list_nr--;
+ break;
+ }
+ spin_unlock(&umem->lock);
+ schedule();
+ spin_lock(&umem->lock);
+ }
+ spin_unlock(&umem->lock);
+ finish_wait(&umem->req_list_wait, &wait);
+ }
+
+ return umem_minor_fault(umem, vma, vmf) | VM_FAULT_MAJOR;
+ }
+
+ return umem_minor_fault(umem, vma, vmf);
+}
+
+/* for partial munmap */
+static void umem_vma_open(struct vm_area_struct *vma)
+{
+ struct file *filp = vma->vm_file;
+ struct umem *umem = filp->private_data;
+
+ spin_lock(&umem->lock);
+ umem->vma_nr++;
+ spin_unlock(&umem->lock);
+}
+
+static void umem_vma_close(struct vm_area_struct *vma)
+{
+ struct file *filp = vma->vm_file;
+ struct umem *umem = filp->private_data;
+ struct task_struct *task = NULL;
+
+ spin_lock(&umem->lock);
+ umem->vma_nr--;
+ if (umem->vma_nr == 0) {
+ task = umem->task;
+ umem->task = NULL;
+ }
+ spin_unlock(&umem->lock);
+
+ if (task)
+ put_task_struct(task);
+}
+
+static const struct vm_operations_struct umem_vm_ops = {
+ .open = umem_vma_open,
+ .close = umem_vma_close,
+ .fault = umem_fault,
+};
+
+static int umem_mmap(struct file *filp, struct vm_area_struct *vma)
+{
+ struct umem *umem = filp->private_data;
+ int error;
+
+ /* allow mmap() only once */
+ spin_lock(&umem->lock);
+ if (umem->mmapped) {
+ error = -EBUSY;
+ goto out;
+ }
+ if (((vma->vm_end - vma->vm_start) >> PAGE_SHIFT) + vma->vm_pgoff >
+     umem->pgoff_end) {
+ error = -EINVAL;
+ goto out;
+ }
+
+ umem->mmapped = true;
+ umem->vma_nr = 1;
+ umem->vm_start = vma->vm_start;
+ get_task_struct(current);
+ umem->task = current;
+ spin_unlock(&umem->lock);
+
+ vma->vm_ops = &umem_vm_ops;
+ vma->vm_flags |= VM_DONTCOPY | VM_DONTEXPAND;
+ vma->vm_flags &= ~VM_SHARED;
+ return 0;
+
+out:
+ spin_unlock(&umem->lock);
+ return error;
+}
+
+static bool umem_req_pending(struct umem* umem)
+{
+ return !list_empty(&umem->req_list) ||
+ !bitmap_empty(umem->sync_req_bitmap, umem->sync_req_max) ||
+ (umem->async_req_nr > 0);
+}
+
+static unsigned int umem_poll(struct file* filp, poll_table *wait)
+{
+ struct umem *umem = filp->private_data;
+ unsigned int events = 0;
+
+ poll_wait(filp, &umem->req_wait, wait);
+
+ spin_lock(&umem->lock);
+ if (umem_req_pending(umem))
+ events |= POLLIN;
+ spin_unlock(&umem->lock);
+
+ return events;
+}
+
+/*
+ * return value
+ * true: finished
+ * false: more request
+ */
+static bool umem_copy_page_request(struct umem *umem,
+    pgoff_t *pgoffs, int req_max,
+    int *req_nr)
+{
+ struct umem_page_req_list *req_list;
+ struct umem_page_req_list *tmp;
+
+ unsigned long bit;
+
+ *req_nr = 0;
+ list_for_each_entry_safe(req_list, tmp, &umem->req_list, list) {
+ list_del(&req_list->list);
+ pgoffs[*req_nr] = req_list->pgoff;
+ (*req_nr)++;
+ if (*req_nr >= req_max)
+ return false;
+ }
+
+ bit = 0;
+ for (;;) {
+ bit = find_next_bit(umem->sync_req_bitmap, umem->sync_req_max,
+     bit);
+ if (bit >= umem->sync_req_max)
+ break;
+ pgoffs[*req_nr] = umem->sync_req[bit];
+ (*req_nr)++;
+ clear_bit(bit, umem->sync_req_bitmap);
+ if (*req_nr >= req_max)
+ return false;
+ bit++;
+ }
+
+ if (umem->async_req_nr > 0) {
+ int nr = min(req_max - *req_nr, umem->async_req_nr);
+ memcpy(pgoffs + *req_nr, umem->async_req,
+        sizeof(*umem->async_req) * nr);
+ umem->async_req_nr -= nr;
+ *req_nr += nr;
+ memmove(umem->async_req, umem->sync_req + nr,
+ umem->async_req_nr * sizeof(*umem->async_req));
+
+ }
+ return umem->async_req_nr == 0;
+}
+
+static int umem_get_page_request(struct umem *umem,
+  struct umem_page_request *page_req)
+{
+ DEFINE_WAIT(wait);
+#define REQ_MAX ((__u32)32)
+ pgoff_t pgoffs[REQ_MAX];
+ __u32 req_copied = 0;
+ int ret = 0;
+
+ spin_lock(&umem->lock);
+ for (;;) {
+ prepare_to_wait(&umem->req_wait, &wait, TASK_INTERRUPTIBLE);
+ if (umem_req_pending(umem)) {
+ break;
+ }
+ if (signal_pending(current)) {
+ ret = -ERESTARTSYS;
+ break;
+ }
+ spin_unlock(&umem->lock);
+ schedule();
+ spin_lock(&umem->lock);
+ }
+ finish_wait(&umem->req_wait, &wait);
+ if (ret)
+ goto out_unlock;
+
+ while (req_copied < page_req->nr) {
+ int req_max;
+ int req_nr;
+ bool finished;
+ req_max = min(page_req->nr - req_copied, REQ_MAX);
+ finished = umem_copy_page_request(umem, pgoffs, req_max,
+   &req_nr);
+
+ spin_unlock(&umem->lock);
+
+ if (req_nr > 0) {
+ ret = 0;
+ if (copy_to_user(page_req->pgoffs + req_copied, pgoffs,
+  sizeof(*pgoffs) * req_nr)) {
+ ret = -EFAULT;
+ goto out;
+ }
+ }
+ req_copied += req_nr;
+ if (finished)
+ goto out;
+
+ spin_lock(&umem->lock);
+ }
+
+out_unlock:
+ spin_unlock(&umem->lock);
+out:
+ page_req->nr = req_copied;
+ return ret;
+}
+
+static int umem_mark_page_cached(struct umem *umem,
+  struct umem_page_cached *page_cached)
+{
+ int ret = 0;
+#define PG_MAX ((__u32)32)
+ __u64 pgoffs[PG_MAX];
+ __u32 nr;
+ unsigned long bit;
+ bool wake_up_list = false;
+
+ nr = 0;
+ while (nr < page_cached->nr) {
+ __u32 todo = min(PG_MAX, (page_cached->nr - nr));
+ int i;
+
+ if (copy_from_user(pgoffs, page_cached->pgoffs + nr,
+    sizeof(*pgoffs) * todo)) {
+ ret = -EFAULT;
+ goto out;
+ }
+ for (i = 0; i < todo; ++i) {
+ if (pgoffs[i] >= umem->pgoff_end) {
+ ret = -EINVAL;
+ goto out;
+ }
+ set_bit(pgoffs[i], umem->cached);
+ }
+ nr += todo;
+ }
+
+ spin_lock(&umem->lock);
+ bit = 0;
+ for (;;) {
+ bit = find_next_bit(umem->sync_wait_bitmap, umem->sync_req_max,
+     bit);
+ if (bit >= umem->sync_req_max)
+ break;
+ if (test_bit(umem->sync_req[bit], umem->cached))
+ wake_up(&umem->page_wait[bit]);
+ bit++;
+ }
+
+ if (umem->req_list_nr > 0)
+ wake_up_list = true;
+ spin_unlock(&umem->lock);
+
+ if (wake_up_list)
+ wake_up_all(&umem->req_list_wait);
+
+out:
+ return ret;
+}
+
+static int umem_make_vma_anonymous(struct umem *umem)
+{
+#if 1
+ return -ENOSYS;
+#else
+ unsigned long saddr;
+ unsigned long eaddr;
+ unsigned long addr;
+ unsigned long bit;
+ struct task_struct *task;
+ struct mm_struct *mm;
+
+ spin_lock(&umem->lock);
+ task = umem->task;
+ saddr = umem->vm_start;
+ eaddr = saddr + umem->size;
+ bit = find_first_zero_bit(umem->faulted, umem->pgoff_end);
+ if (bit < umem->pgoff_end) {
+ spin_unlock(&umem->lock);
+ return -EBUSY;
+ }
+ spin_unlock(&umem->lock);
+ if (task == NULL)
+ return 0;
+ mm = get_task_mm(task);
+ if (mm == NULL)
+ return 0;
+
+ addr = saddr;
+ down_write(&mm->mmap_sem);
+ while (addr < eaddr) {
+ struct vm_area_struct *vma;
+ vma = find_vma(mm, addr);
+ if (umem_is_umem_vma(umem, vma)) {
+ /* XXX incorrect. race/locking and more fix up */
+ struct file *filp = vma->vm_file;
+ vma->vm_ops->close(vma);
+ vma->vm_ops = NULL;
+ vma->vm_file = NULL;
+ /* vma->vm_flags */
+ fput(filp);
+ }
+ addr = vma->vm_end;
+ }
+ up_write(&mm->mmap_sem);
+
+ mmput(mm);
+ return 0;
+#endif
+}
+
+static long umem_ioctl(struct file *filp, unsigned int ioctl,
+    unsigned long arg)
+{
+ struct umem *umem = filp->private_data;
+ void __user *argp = (void __user *) arg;
+ long ret = 0;
+
+ switch (ioctl) {
+ case UMEM_GET_PAGE_REQUEST: {
+ struct umem_page_request page_request;
+ ret = -EFAULT;
+ if (copy_from_user(&page_request, argp, sizeof(page_request)))
+ break;
+ ret = umem_get_page_request(umem, &page_request);
+ if (ret == 0 &&
+     copy_to_user(argp +
+  offsetof(struct umem_page_request, nr),
+  &page_request.nr,
+  sizeof(page_request.nr))) {
+ ret = -EFAULT;
+ break;
+ }
+ break;
+ }
+ case UMEM_MARK_PAGE_CACHED: {
+ struct umem_page_cached page_cached;
+ ret = -EFAULT;
+ if (copy_from_user(&page_cached, argp, sizeof(page_cached)))
+ break;
+ ret = umem_mark_page_cached(umem, &page_cached);
+ break;
+ }
+ case UMEM_MAKE_VMA_ANONYMOUS:
+ ret = umem_make_vma_anonymous(umem);
+ break;
+ default:
+ ret = -EINVAL;
+ break;
+ }
+ return ret;
+}
+
+static unsigned long umem_bitmap_bytes(const struct umem *umem)
+{
+ return round_up(umem->pgoff_end, BITS_PER_LONG) / 8;
+}
+
+
+static void umem_free(struct kref *kref)
+{
+ struct umem *umem = container_of(kref, struct umem, kref);
+
+ BUG_ON(!mutex_is_locked(&umem_list_mutex));
+ list_del(&umem->list);
+ mutex_unlock(&umem_list_mutex);
+
+ if (umem->task) {
+ put_task_struct(umem->task);
+ umem->task = NULL;
+ }
+
+ if (umem->shmem_filp)
+ fput(umem->shmem_filp);
+ if (umem_bitmap_bytes(umem) > PAGE_SIZE) {
+ vfree(umem->cached);
+ vfree(umem->faulted);
+ } else {
+ kfree(umem->cached);
+ kfree(umem->faulted);
+ }
+ kfree(umem->async_req);
+ kfree(umem->sync_req_bitmap);
+ kfree(umem->sync_wait_bitmap);
+ kfree(umem->page_wait);
+ kfree(umem->sync_req);
+ kfree(umem);
+}
+
+static void umem_put(struct umem *umem)
+{
+ int ret;
+
+ mutex_lock(&umem_list_mutex);
+ ret = kref_put(&umem->kref, umem_free);
+ if (ret == 0) {
+ mutex_unlock(&umem_list_mutex);
+ }
+}
+
+static int umem_release(struct inode *inode, struct file *filp)
+{
+ struct umem *umem = filp->private_data;
+ umem_put(umem);
+ return 0;
+}
+
+static struct file_operations umem_fops = {
+ .release = umem_release,
+ .unlocked_ioctl = umem_ioctl,
+ .mmap = umem_mmap,
+ .poll = umem_poll,
+ .llseek = noop_llseek,
+};
+
+static int umem_create_umem(struct umem_create *create)
+{
+ int error = 0;
+ struct umem *umem = NULL;
+ struct vm_area_struct *vma;
+ int shmem_fd;
+ unsigned long bitmap_bytes;
+ unsigned long sync_bitmap_bytes;
+ int i;
+
+ umem = kzalloc(sizeof(*umem), GFP_KERNEL);
+ umem->name = create->name;
+ kref_init(&umem->kref);
+ INIT_LIST_HEAD(&umem->list);
+
+ mutex_lock(&umem_list_mutex);
+ error = umem_add_list(umem);
+ if (error) {
+ goto out;
+ }
+
+ umem->task = NULL;
+ umem->mmapped = false;
+ spin_lock_init(&umem->lock);
+ umem->size = roundup(create->size, PAGE_SIZE);
+ umem->pgoff_end = umem->size >> PAGE_SHIFT;
+ init_waitqueue_head(&umem->req_wait);
+
+ vma = &umem->vma;
+ vma->vm_start = 0;
+ vma->vm_end = umem->size;
+ /* this shmem file is used for temporal buffer for pages
+    so it's unlikely that so many pages exists in this shmem file */
+ vma->vm_flags = VM_READ | VM_SHARED | VM_NOHUGEPAGE | VM_DONTCOPY |
+ VM_DONTEXPAND;
+ vma->vm_page_prot = vm_get_page_prot(vma->vm_flags);
+ vma->vm_pgoff = 0;
+ INIT_LIST_HEAD(&vma->anon_vma_chain);
+
+ shmem_fd = get_unused_fd();
+ if (shmem_fd < 0) {
+ error = shmem_fd;
+ goto out;
+ }
+ error = shmem_zero_setup(vma);
+ if (error < 0) {
+ put_unused_fd(shmem_fd);
+ goto out;
+ }
+ umem->shmem_filp = vma->vm_file;
+ get_file(umem->shmem_filp);
+ fd_install(shmem_fd, vma->vm_file);
+ create->shmem_fd = shmem_fd;
+
+ create->umem_fd = anon_inode_getfd("umem",
+    &umem_fops, umem, O_RDWR);
+ if (create->umem_fd < 0) {
+ error = create->umem_fd;
+ goto out;
+ }
+
+ bitmap_bytes = umem_bitmap_bytes(umem);
+ if (bitmap_bytes > PAGE_SIZE) {
+ umem->cached = vzalloc(bitmap_bytes);
+ umem->faulted = vzalloc(bitmap_bytes);
+ } else {
+ umem->cached = kzalloc(bitmap_bytes, GFP_KERNEL);
+ umem->faulted = kzalloc(bitmap_bytes, GFP_KERNEL);
+ }
+
+ /* those constants are not exported.
+    They are just used for default value */
+#define KVM_MAX_VCPUS 256
+#define ASYNC_PF_PER_VCPU 64
+
+#define ASYNC_REQ_MAX (ASYNC_PF_PER_VCPU * KVM_MAX_VCPUS)
+ if (create->async_req_max == 0)
+ create->async_req_max = ASYNC_REQ_MAX;
+ umem->async_req_max = create->async_req_max;
+ umem->async_req_nr = 0;
+ umem->async_req = kzalloc(
+ sizeof(*umem->async_req) * umem->async_req_max,
+ GFP_KERNEL);
+
+#define SYNC_REQ_MAX (KVM_MAX_VCPUS)
+ if (create->sync_req_max == 0)
+ create->sync_req_max = SYNC_REQ_MAX;
+ umem->sync_req_max = round_up(create->sync_req_max, BITS_PER_LONG);
+ sync_bitmap_bytes = sizeof(unsigned long) *
+ (umem->sync_req_max / BITS_PER_LONG);
+ umem->sync_req_bitmap = kzalloc(sync_bitmap_bytes, GFP_KERNEL);
+ umem->sync_wait_bitmap = kzalloc(sync_bitmap_bytes, GFP_KERNEL);
+ umem->page_wait = kzalloc(sizeof(*umem->page_wait) *
+   umem->sync_req_max, GFP_KERNEL);
+ for (i = 0; i < umem->sync_req_max; ++i)
+ init_waitqueue_head(&umem->page_wait[i]);
+ umem->sync_req = kzalloc(sizeof(*umem->sync_req) *
+  umem->sync_req_max, GFP_KERNEL);
+
+ umem->req_list_nr = 0;
+ INIT_LIST_HEAD(&umem->req_list);
+ init_waitqueue_head(&umem->req_list_wait);
+
+ mutex_unlock(&umem_list_mutex);
+ return 0;
+
+ out:
+ umem_free(&umem->kref);
+ return error;
+}
+
+static int umem_list_umem(struct umem_list __user *u_list)
+{
+ struct umem_list k_list;
+ struct umem *entry;
+ struct umem_name __user *u_name = u_list->names;
+ __u32 nr = 0;
+
+ if (copy_from_user(&k_list, u_list, sizeof(k_list))) {
+ return -EFAULT;
+ }
+
+ mutex_lock(&umem_list_mutex);
+ list_for_each_entry(entry, &umem_list, list) {
+ if (nr < k_list.nr) {
+ if (copy_to_user(u_name, &entry->name,
+  sizeof(entry->name))) {
+ mutex_unlock(&umem_list_mutex);
+ return -EFAULT;
+ }
+ u_name++;
+ }
+ nr++;
+ }
+ mutex_unlock(&umem_list_mutex);
+
+ k_list.nr = nr;
+ if (copy_to_user(u_list, &k_list, sizeof(k_list))) {
+ return -EFAULT;
+ }
+
+ return 0;
+}
+
+static int umem_reattach_umem(struct umem_create *create)
+{
+ struct umem *entry;
+
+ mutex_lock(&umem_list_mutex);
+ list_for_each_entry(entry, &umem_list, list) {
+ if (umem_name_eq(&entry->name, &create->name)) {
+ kref_get(&entry->kref);
+ mutex_unlock(&umem_list_mutex);
+
+ create->shmem_fd = get_unused_fd();
+ if (create->shmem_fd < 0) {
+ umem_put(entry);
+ return create->shmem_fd;
+ }
+ create->umem_fd = anon_inode_getfd(
+ "umem", &umem_fops, entry, O_RDWR);
+ if (create->umem_fd < 0) {
+ put_unused_fd(create->shmem_fd);
+ umem_put(entry);
+ return create->umem_fd;
+ }
+ get_file(entry->shmem_filp);
+ fd_install(create->shmem_fd, entry->shmem_filp);
+
+ create->size = entry->size;
+ create->sync_req_max = entry->sync_req_max;
+ create->async_req_max = entry->async_req_max;
+ return 0;
+ }
+ }
+ mutex_unlock(&umem_list_mutex);
+
+ return -ENOENT;
+}
+
+static long umem_dev_ioctl(struct file *filp, unsigned int ioctl,
+    unsigned long arg)
+{
+ void __user *argp = (void __user *) arg;
+ long ret;
+ struct umem_create *create = NULL;
+
+
+ switch (ioctl) {
+ case UMEM_DEV_CREATE_UMEM:
+ create = kmalloc(sizeof(*create), GFP_KERNEL);
+ if (copy_from_user(create, argp, sizeof(*create))) {
+ ret = -EFAULT;
+ break;
+ }
+ ret = umem_create_umem(create);
+ if (copy_to_user(argp, create, sizeof(*create))) {
+ ret = -EFAULT;
+ break;
+ }
+ break;
+ case UMEM_DEV_LIST:
+ ret = umem_list_umem(argp);
+ break;
+ case UMEM_DEV_REATTACH:
+ create = kmalloc(sizeof(*create), GFP_KERNEL);
+ if (copy_from_user(create, argp, sizeof(*create))) {
+ ret = -EFAULT;
+ break;
+ }
+ ret = umem_reattach_umem(create);
+ if (copy_to_user(argp, create, sizeof(*create))) {
+ ret = -EFAULT;
+ break;
+ }
+ break;
+ default:
+ ret = -EINVAL;
+ break;
+ }
+
+ kfree(create);
+ return ret;
+}
+
+static int umem_dev_release(struct inode *inode, struct file *filp)
+{
+ return 0;
+}
+
+static struct file_operations umem_dev_fops = {
+ .release = umem_dev_release,
+ .unlocked_ioctl = umem_dev_ioctl,
+};
+
+static struct miscdevice umem_dev = {
+ MISC_DYNAMIC_MINOR,
+ "umem",
+ &umem_dev_fops,
+};
+
+static int __init umem_init(void)
+{
+ int r;
+ r = misc_register(&umem_dev);
+ if (r) {
+ printk(KERN_ERR "umem: misc device register failed\n");
+ return r;
+ }
+ return 0;
+}
+module_init(umem_init);
+
+static void __exit umem_exit(void)
+{
+ misc_deregister(&umem_dev);
+}
+module_exit(umem_exit);
+
+MODULE_DESCRIPTION("UMEM user process backed memory driver "
+    "for distributed shared memory");
+MODULE_LICENSE("GPL");
+MODULE_AUTHOR("Isaku Yamahata");
diff --git a/include/linux/umem.h b/include/linux/umem.h
new file mode 100644
index 0000000..e1a8633
--- /dev/null
+++ b/include/linux/umem.h
@@ -0,0 +1,83 @@
+/*
+ * User process backed memory.
+ * This is mainly for KVM post copy.
+ *
+ * Copyright (c) 2011,
+ * National Institute of Advanced Industrial Science and Technology
+ *
+ * https://sites.google.com/site/grivonhome/quick-kvm-migration
+ * Author: Isaku Yamahata <yamahata at valinux co jp>
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ *
+ * You should have received a copy of the GNU General Public License along
+ * with this program; if not, see <http://www.gnu.org/licenses/>.
+ */
+
+#ifndef __LINUX_UMEM_H
+#define __LINUX_UMEM_H
+
+#include <linux/types.h>
+#include <linux/ioctl.h>
+
+#ifdef __KERNEL__
+#include <linux/compiler.h>
+#else
+#define __user
+#endif
+
+#define UMEM_ID_MAX 256
+#define UMEM_NAME_MAX 256
+
+struct umem_name {
+ char id[UMEM_ID_MAX]; /* non-zero terminated */
+ char name[UMEM_NAME_MAX]; /* non-zero terminated */
+};
+
+struct umem_list {
+ __u32 nr;
+ __u32 padding;
+ struct umem_name names[0];
+};
+
+struct umem_create {
+ __u64 size; /* in bytes */
+ __s32 umem_fd;
+ __s32 shmem_fd;
+ __u32 async_req_max;
+ __u32 sync_req_max;
+ struct umem_name name;
+};
+
+struct umem_page_request {
+ __u64 __user *pgoffs;
+ __u32 nr;
+ __u32 padding;
+};
+
+struct umem_page_cached {
+ __u64 __user *pgoffs;
+ __u32 nr;
+ __u32 padding;
+};
+
+#define UMEMIO 0x1E
+
+/* ioctl for umem_dev fd */
+#define UMEM_DEV_CREATE_UMEM _IOWR(UMEMIO, 0x0, struct umem_create)
+#define UMEM_DEV_LIST _IOWR(UMEMIO, 0x1, struct umem_list)
+#define UMEM_DEV_REATTACH _IOWR(UMEMIO, 0x2, struct umem_create)
+
+/* ioctl for umem fd */
+#define UMEM_GET_PAGE_REQUEST _IOWR(UMEMIO, 0x10, struct umem_page_request)
+#define UMEM_MARK_PAGE_CACHED _IOW (UMEMIO, 0x11, struct umem_page_cached)
+#define UMEM_MAKE_VMA_ANONYMOUS _IO  (UMEMIO, 0x12)
+
+#endif /* __LINUX_UMEM_H */
-- 
1.7.1.1

[-- Attachment #2: Type: text/html, Size: 50611 bytes --]

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* Re: [Qemu-devel] 回??:  [PATCH 2/2] umem: chardevice for kvm postcopy
  2012-01-05  4:08   ` [Qemu-devel] 回复: " thfbjyddx
@ 2012-01-05 10:48     ` Isaku Yamahata
  2012-01-05 11:10       ` Tommy
  0 siblings, 1 reply; 42+ messages in thread
From: Isaku Yamahata @ 2012-01-05 10:48 UTC (permalink / raw)
  To: thfbjyddx; +Cc: t.hirofuchi, qemu-devel, kvm, satoshi.itoh

On Thu, Jan 05, 2012 at 12:08:50PM +0800, thfbjyddx wrote:
> hi,
> I've tried to use this patch,

Oh great! Can we share your results?


> but it doesn't work for compiling error on
>  
>  page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma, vmf->address);//vmf->
> virtual_address?
>  
> I guess it's for the wrong kernel version?
> can you give me some detail about this or any clue?
> 3x 

Thank you for report. The following should fix.
It depends on kernel configuration. My config didn't catch it.

diff --git a/drivers/char/umem.c b/drivers/char/umem.c
index 4d031b5..853f1ce 100644
--- a/drivers/char/umem.c
+++ b/drivers/char/umem.c
@@ -129,7 +129,7 @@ static int umem_minor_fault(struct umem *umem,
 	 * vmf->page = fake_vmf->page;
 	 */
 
-	page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma, vmf->address);
+	page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma, vmf->virtual_address);
 	if (!page)
 		return VM_FAULT_OOM;
 	if (mem_cgroup_cache_charge(page, vma->vm_mm, GFP_KERNEL)) {



-- 
yamahata

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* Re: [Qemu-devel] 回??:  [PATCH 2/2] umem: chardevice for kvm postcopy
  2012-01-05 10:48     ` [Qemu-devel] 回??: " Isaku Yamahata
@ 2012-01-05 11:10       ` Tommy
  2012-01-05 12:18         ` Isaku Yamahata
  0 siblings, 1 reply; 42+ messages in thread
From: Tommy @ 2012-01-05 11:10 UTC (permalink / raw)
  To: Isaku Yamahata; +Cc: t.hirofuchi, qemu-devel, kvm, satoshi.itoh

[-- Attachment #1: Type: text/plain, Size: 1730 bytes --]

After I use this series of patches, but the migration failed.
2, I start migrate -d -p -n tcp:xxx:4444 on the outgoing node
2, on the incoming part, the qemu get stuck and migration failed
the  destnation can not typing any more

today I found it's just at qemu_loadvm_state, just after the while loop ,maybe in cpu_synchronize_all_post_init
I think there is some problems with qemu side for it doesn't get to the umem part
I'm not sure about the problem
do you have some suggestion?



Tommy

From: Isaku Yamahata
Date: 2012-01-05 18:48
To: thfbjyddx
CC: t.hirofuchi; qemu-devel; kvm; satoshi.itoh
Subject: Re: [Qemu-devel]回??: [PATCH 2/2] umem: chardevice for kvm postcopy
On Thu, Jan 05, 2012 at 12:08:50PM +0800, thfbjyddx wrote:
> hi,
> I've tried to use this patch,

Oh great! Can we share your results?


> but it doesn't work for compiling error on
>  
>  page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma, vmf->address);//vmf->
> virtual_address?
>  
> I guess it's for the wrong kernel version?
> can you give me some detail about this or any clue?
> 3x 

Thank you for report. The following should fix.
It depends on kernel configuration. My config didn't catch it.

diff --git a/drivers/char/umem.c b/drivers/char/umem.c
index 4d031b5..853f1ce 100644
--- a/drivers/char/umem.c
+++ b/drivers/char/umem.c
@@ -129,7 +129,7 @@ static int umem_minor_fault(struct umem *umem,
   * vmf->page = fake_vmf->page;
   */

- page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma, vmf->address);
+ page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma, vmf->virtual_address);
  if (!page)
  return VM_FAULT_OOM;
  if (mem_cgroup_cache_charge(page, vma->vm_mm, GFP_KERNEL)) {



-- 
yamahata

[-- Attachment #2: Type: text/html, Size: 4479 bytes --]

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* Re: [Qemu-devel] 回??: [PATCH 2/2] umem: chardevice for kvm postcopy
  2012-01-05 11:10       ` Tommy
@ 2012-01-05 12:18         ` Isaku Yamahata
  2012-01-05 15:02           ` Tommy Tang
                             ` (2 more replies)
  0 siblings, 3 replies; 42+ messages in thread
From: Isaku Yamahata @ 2012-01-05 12:18 UTC (permalink / raw)
  To: Tommy; +Cc: t.hirofuchi, qemu-devel, kvm, satoshi.itoh

Hmm, this sounds like you haven't specified -postcopy option at the
incoming qemu.
How did you start incoming qemu?


On Thu, Jan 05, 2012 at 07:10:42PM +0800, Tommy wrote:
> After I use this series of patches, but the migration failed.
> 2, I start migrate -d -p -n tcp:xxx:4444 on the outgoing node
> 2, on the incoming part, the qemu get stuck and migration failed
> the  destnation can not typing any more
>  
> today I found it's just at qemu_loadvm_state, just after the while loop ,maybe
> in cpu_synchronize_all_post_init
> I think there is some problems with qemu side for it doesn't get to the umem
> part
> I'm not sure about the problem
> do you have some suggestion?
> ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
> Tommy
>  
> From: Isaku Yamahata
> Date: 2012-01-05 18:48
> To: thfbjyddx
> CC: t.hirofuchi; qemu-devel; kvm; satoshi.itoh
> Subject: Re: [Qemu-devel]  ??: [PATCH 2/2] umem: chardevice for kvm postcopy
> On Thu, Jan 05, 2012 at 12:08:50PM +0800, thfbjyddx wrote:
> > hi,
> > I've tried to use this patch,
>  
> Oh great! Can we share your results?
>  
>  
> > but it doesn't work for compiling error on
> >  
> >  page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma, vmf->address);//vmf->
> > virtual_address?
> >  
> > I guess it's for the wrong kernel version?
> > can you give me some detail about this or any clue?
> > 3x 
>  
> Thank you for report. The following should fix.
> It depends on kernel configuration. My config didn't catch it.
>  
> diff --git a/drivers/char/umem.c b/drivers/char/umem.c
> index 4d031b5..853f1ce 100644
> --- a/drivers/char/umem.c
> +++ b/drivers/char/umem.c
> @@ -129,7 +129,7 @@ static int umem_minor_fault(struct umem *umem,
>    * vmf->page = fake_vmf->page;
>    */
>  
> - page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma, vmf->address);
> + page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma, vmf->virtual_address);
>   if (!page)
>   return VM_FAULT_OOM;
>   if (mem_cgroup_cache_charge(page, vma->vm_mm, GFP_KERNEL)) {
>  
>  
>  
> -- 
> yamahata
>  
>  

-- 
yamahata

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Qemu-devel] 回??:  [PATCH 2/2] umem: chardevice for kvm postcopy
  2012-01-05 12:18         ` Isaku Yamahata
@ 2012-01-05 15:02           ` Tommy Tang
       [not found]           ` <4F05BB68.9050302@hotmail.com>
  2012-01-06  7:02           ` thfbjyddx
  2 siblings, 0 replies; 42+ messages in thread
From: Tommy Tang @ 2012-01-05 15:02 UTC (permalink / raw)
  To: Isaku Yamahata; +Cc: t.hirofuchi, qemu-devel, kvm, satoshi.itoh

[-- Attachment #1: Type: text/plain, Size: 2290 bytes --]

qemu -m 256 -hda xxx -monitor stdio -enable-kvm -postcopy -incoming
tcp:xxx:4444 -vnc :1
I think it doesn't go wrong

于 2012/1/5 20:18, Isaku Yamahata 写道:
> Hmm, this sounds like you haven't specified -postcopy option at the
> incoming qemu.
> How did you start incoming qemu?
>
>
> On Thu, Jan 05, 2012 at 07:10:42PM +0800, Tommy wrote:
>> After I use this series of patches, but the migration failed.
>> 2, I start migrate -d -p -n tcp:xxx:4444 on the outgoing node
>> 2, on the incoming part, the qemu get stuck and migration failed
>> the  destnation can not typing any more
>>  
>> today I found it's just at qemu_loadvm_state, just after the while loop ,maybe
>> in cpu_synchronize_all_post_init
>> I think there is some problems with qemu side for it doesn't get to the umem
>> part
>> I'm not sure about the problem
>> do you have some suggestion?
>> ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
>> Tommy
>>  
>> From: Isaku Yamahata
>> Date: 2012-01-05 18:48
>> To: thfbjyddx
>> CC: t.hirofuchi; qemu-devel; kvm; satoshi.itoh
>> Subject: Re: [Qemu-devel]  ??: [PATCH 2/2] umem: chardevice for kvm postcopy
>> On Thu, Jan 05, 2012 at 12:08:50PM +0800, thfbjyddx wrote:
>>> hi,
>>> I've tried to use this patch,
>>  
>> Oh great! Can we share your results?
>>  
>>  
>>> but it doesn't work for compiling error on
>>>  
>>>  page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma, vmf->address);//vmf->
>>> virtual_address?
>>>  
>>> I guess it's for the wrong kernel version?
>>> can you give me some detail about this or any clue?
>>> 3x 
>>  
>> Thank you for report. The following should fix.
>> It depends on kernel configuration. My config didn't catch it.
>>  
>> diff --git a/drivers/char/umem.c b/drivers/char/umem.c
>> index 4d031b5..853f1ce 100644
>> --- a/drivers/char/umem.c
>> +++ b/drivers/char/umem.c
>> @@ -129,7 +129,7 @@ static int umem_minor_fault(struct umem *umem,
>>    * vmf->page = fake_vmf->page;
>>    */
>>  
>> - page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma, vmf->address);
>> + page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma, vmf->virtual_address);
>>   if (!page)
>>   return VM_FAULT_OOM;
>>   if (mem_cgroup_cache_charge(page, vma->vm_mm, GFP_KERNEL)) {
>>  
>>  
>>  
>> -- 
>> yamahata
>>  
>>  

[-- Attachment #2: Type: text/html, Size: 2828 bytes --]

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Qemu-devel] 回??:  [PATCH 2/2] umem: chardevice for kvm postcopy
       [not found]           ` <4F05BB68.9050302@hotmail.com>
@ 2012-01-05 15:05             ` Tommy Tang
  0 siblings, 0 replies; 42+ messages in thread
From: Tommy Tang @ 2012-01-05 15:05 UTC (permalink / raw)
  To: Isaku Yamahata; +Cc: t.hirofuchi, qemu-devel, kvm, satoshi.itoh

[-- Attachment #1: Type: text/plain, Size: 2529 bytes --]

sorry, it's:
qemu -m 256 -hda xxx -monitor stdio -enable-kvm -postcopy -incoming
tcp:0:4444 -vnc :1

anything wrong?

于 2012/1/5 23:02, Tommy Tang 写道:
> qemu -m 256 -hda xxx -monitor stdio -enable-kvm -postcopy -incoming
> tcp:xxx:4444 -vnc :1
> I think it doesn't go wrong
>
> 于 2012/1/5 20:18, Isaku Yamahata 写道:
>> Hmm, this sounds like you haven't specified -postcopy option at the
>> incoming qemu.
>> How did you start incoming qemu?
>>
>>
>> On Thu, Jan 05, 2012 at 07:10:42PM +0800, Tommy wrote:
>>> After I use this series of patches, but the migration failed.
>>> 2, I start migrate -d -p -n tcp:xxx:4444 on the outgoing node
>>> 2, on the incoming part, the qemu get stuck and migration failed
>>> the  destnation can not typing any more
>>>  
>>> today I found it's just at qemu_loadvm_state, just after the while loop ,maybe
>>> in cpu_synchronize_all_post_init
>>> I think there is some problems with qemu side for it doesn't get to the umem
>>> part
>>> I'm not sure about the problem
>>> do you have some suggestion?
>>> ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
>>> Tommy
>>>  
>>> From: Isaku Yamahata
>>> Date: 2012-01-05 18:48
>>> To: thfbjyddx
>>> CC: t.hirofuchi; qemu-devel; kvm; satoshi.itoh
>>> Subject: Re: [Qemu-devel]  ??: [PATCH 2/2] umem: chardevice for kvm postcopy
>>> On Thu, Jan 05, 2012 at 12:08:50PM +0800, thfbjyddx wrote:
>>>> hi,
>>>> I've tried to use this patch,
>>>  
>>> Oh great! Can we share your results?
>>>  
>>>  
>>>> but it doesn't work for compiling error on
>>>>  
>>>>  page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma, vmf->address);//vmf->
>>>> virtual_address?
>>>>  
>>>> I guess it's for the wrong kernel version?
>>>> can you give me some detail about this or any clue?
>>>> 3x 
>>>  
>>> Thank you for report. The following should fix.
>>> It depends on kernel configuration. My config didn't catch it.
>>>  
>>> diff --git a/drivers/char/umem.c b/drivers/char/umem.c
>>> index 4d031b5..853f1ce 100644
>>> --- a/drivers/char/umem.c
>>> +++ b/drivers/char/umem.c
>>> @@ -129,7 +129,7 @@ static int umem_minor_fault(struct umem *umem,
>>>    * vmf->page = fake_vmf->page;
>>>    */
>>>  
>>> - page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma, vmf->address);
>>> + page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma, vmf->virtual_address);
>>>   if (!page)
>>>   return VM_FAULT_OOM;
>>>   if (mem_cgroup_cache_charge(page, vma->vm_mm, GFP_KERNEL)) {
>>>  
>>>  
>>>  
>>> -- 
>>> yamahata
>>>  
>>>  

[-- Attachment #2: Type: text/html, Size: 3373 bytes --]

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Qemu-devel] 回??: [PATCH 2/2] umem: chardevice for kvm postcopy
  2012-01-05 12:18         ` Isaku Yamahata
  2012-01-05 15:02           ` Tommy Tang
       [not found]           ` <4F05BB68.9050302@hotmail.com>
@ 2012-01-06  7:02           ` thfbjyddx
  2012-01-06 17:13             ` [Qemu-devel] 回??: [PATCH 2/2] umem: chardevice for kvm?postcopy Isaku Yamahata
  2 siblings, 1 reply; 42+ messages in thread
From: thfbjyddx @ 2012-01-06  7:02 UTC (permalink / raw)
  To: Isaku Yamahata; +Cc: t.hirofuchi, qemu-devel, kvm, satoshi.itoh

[-- Attachment #1: Type: text/plain, Size: 2488 bytes --]

Hi,
Can you tell me the base version of the qemu?
the postcopy patches make some conflicts on the qemu which I clone from the git
Thanks! 



Tommy

From: Isaku Yamahata
Date: 2012-01-05 20:18
To: Tommy
CC: t.hirofuchi; qemu-devel; kvm; satoshi.itoh
Subject: Re: [Qemu-devel]回??: [PATCH 2/2] umem: chardevice for kvm postcopy
Hmm, this sounds like you haven't specified -postcopy option at the
incoming qemu.
How did you start incoming qemu?


On Thu, Jan 05, 2012 at 07:10:42PM +0800, Tommy wrote:
> After I use this series of patches, but the migration failed.
> 2, I start migrate -d -p -n tcp:xxx:4444 on the outgoing node
> 2, on the incoming part, the qemu get stuck and migration failed
> the  destnation can not typing any more
>  
> today I found it's just at qemu_loadvm_state, just after the while loop ,maybe
> in cpu_synchronize_all_post_init
> I think there is some problems with qemu side for it doesn't get to the umem
> part
> I'm not sure about the problem
> do you have some suggestion?
> ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
> Tommy
>  
> From: Isaku Yamahata
> Date: 2012-01-05 18:48
> To: thfbjyddx
> CC: t.hirofuchi; qemu-devel; kvm; satoshi.itoh
> Subject: Re: [Qemu-devel]  ??: [PATCH 2/2] umem: chardevice for kvm postcopy
> On Thu, Jan 05, 2012 at 12:08:50PM +0800, thfbjyddx wrote:
> > hi,
> > I've tried to use this patch,
>  
> Oh great! Can we share your results?
>  
>  
> > but it doesn't work for compiling error on
> >  
> >  page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma, vmf->address);//vmf->
> > virtual_address?
> >  
> > I guess it's for the wrong kernel version?
> > can you give me some detail about this or any clue?
> > 3x 
>  
> Thank you for report. The following should fix.
> It depends on kernel configuration. My config didn't catch it.
>  
> diff --git a/drivers/char/umem.c b/drivers/char/umem.c
> index 4d031b5..853f1ce 100644
> --- a/drivers/char/umem.c
> +++ b/drivers/char/umem.c
> @@ -129,7 +129,7 @@ static int umem_minor_fault(struct umem *umem,
>    * vmf->page = fake_vmf->page;
>    */
>  
> - page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma, vmf->address);
> + page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma, vmf->virtual_address);
>   if (!page)
>   return VM_FAULT_OOM;
>   if (mem_cgroup_cache_charge(page, vma->vm_mm, GFP_KERNEL)) {
>  
>  
>  
> -- 
> yamahata
>  
>  

-- 
yamahata

[-- Attachment #2: Type: text/html, Size: 6626 bytes --]

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Qemu-devel] 回??: [PATCH 2/2] umem: chardevice for kvm?postcopy
  2012-01-06  7:02           ` thfbjyddx
@ 2012-01-06 17:13             ` Isaku Yamahata
  0 siblings, 0 replies; 42+ messages in thread
From: Isaku Yamahata @ 2012-01-06 17:13 UTC (permalink / raw)
  To: thfbjyddx; +Cc: t.hirofuchi, qemu-devel, kvm, satoshi.itoh

On Fri, Jan 06, 2012 at 03:02:00PM +0800, thfbjyddx wrote:
> Hi,
> Can you tell me the base version of the qemu?
> the postcopy patches make some conflicts on the qemu which I clone from the git

03ecd2c80a64d030a22fe67cc7a60f24e17ff211


> Thanks! 
> ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
> Tommy
>  
> From: Isaku Yamahata
> Date: 2012-01-05 20:18
> To: Tommy
> CC: t.hirofuchi; qemu-devel; kvm; satoshi.itoh
> Subject: Re: [Qemu-devel]回??: [PATCH 2/2] umem: chardevice for kvm postcopy
> Hmm, this sounds like you haven't specified -postcopy option at the
> incoming qemu.
> How did you start incoming qemu?
>  
>  
> On Thu, Jan 05, 2012 at 07:10:42PM +0800, Tommy wrote:
> > After I use this series of patches, but the migration failed.
> > 2, I start migrate -d -p -n tcp:xxx:4444 on the outgoing node
> > 2, on the incoming part, the qemu get stuck and migration failed
> > the  destnation can not typing any more
> >  
> >
>  today I found it's just at qemu_loadvm_state, just after the while loop ,maybe
> > in cpu_synchronize_all_post_init
> > I think there is some problems with qemu side for it doesn't get to the umem
> > part
> > I'm not sure about the problem
> > do you have some suggestion?
> > ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
> ━
> > Tommy
> >  
> > From: Isaku Yamahata
> > Date: 2012-01-05 18:48
> > To: thfbjyddx
> > CC: t.hirofuchi; qemu-devel; kvm; satoshi.itoh
> > Subject: Re: [Qemu-devel]  ??: [PATCH 2/2] umem: chardevice for kvm postcopy
> > On Thu, Jan 05, 2012 at 12:08:50PM +0800, thfbjyddx wrote:
> > > hi,
> > > I've tried to use this patch,
> >  
> > Oh great! Can we share your results?
> >  
> >  
> > > but it doesn't work for compiling error on
> > >  
> > >  page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma, vmf->address);//vmf->
> > > virtual_address?
> > >  
> > > I guess it's for the wrong kernel version?
> > > can you give me some detail about this or any clue?
> > > 3x 
> >  
> > Thank you for report. The following should fix.
> > It depends on kernel configuration. My config didn't catch it.
> >  
> > diff --git a/drivers/char/umem.c b/drivers/char/umem.c
> > index 4d031b5..853f1ce 100644
> > --- a/drivers/char/umem.c
> > +++ b/drivers/char/umem.c
> > @@ -129,7 +129,7 @@ static int umem_minor_fault(struct umem *umem,
> >    * vmf->page = fake_vmf->page;
> >    */
> >  
> > - page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma, vmf->address);
> > + page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma, vmf->virtual_address);
> >   if (!page)
> >   return VM_FAULT_OOM;
> >   if (mem_cgroup_cache_charge(page, vma->vm_mm, GFP_KERNEL)) {
> >  
> >  
> >  
> > -- 
> > yamahata
> >  
> >  
>  
> -- 
> yamahata
>  
>  

-- 
yamahata

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Qemu-devel] [PATCH 0/2][RFC] postcopy migration: Linux char device for postcopy
  2012-01-03 14:25                             ` Andrea Arcangeli
@ 2012-01-12 13:57                               ` Avi Kivity
  2012-01-13  2:06                                 ` Andrea Arcangeli
  0 siblings, 1 reply; 42+ messages in thread
From: Avi Kivity @ 2012-01-12 13:57 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: kvm, satoshi.itoh, t.hirofuchi, qemu-devel, Isaku Yamahata,
	Paolo Bonzini

On 01/03/2012 04:25 PM, Andrea Arcangeli wrote:
>  
> > > So the problem is if we do it in
> > > userland with the current functionality you'll run out of VMAs and
> > > slowdown performance too much.
> > >
> > > But all you need is the ability to map single pages in the address
> > > space.
> > 
> > Would this also let you set different pgprots for different pages in the 
> > same VMA?  It would be useful for write barriers in garbage collectors 
> > (such as boehm-gc).  These do not have _that_ many VMAs, because every 
> > GC cycles could merge all of them back to a single VMA with PROT_READ 
> > permissions; however, they still put some strain on the VM subsystem.
>
> Changing permission sounds more tricky as more code may make
> assumptions on the vma before checking the pte.
>
> Adding a magic unmapped pte entry sounds fairly safe because there's
> the migration pte already used by migrate which halts page faults and
> wait, that creates a precedent. So I guess we could reuse the same
> code that already exists for the migration entry and we'd need to fire
> a signal and returns to userland instead of waiting. The signal should
> be invoked before the page fault will trigger again. 

Delivering signals is slow, and you can't use signalfd for it, because
that can be routed to a different task.  I would like an fd based
protocol with an explicit ack so the other end can be implemented by the
kernel, to use with RDMA.  Kind of like how vhost-net talks to a guest
via a kvm ioeventfd/irqfd.

> Of course if the
> signal returns and does nothing it'll loop at 100% cpu load but that's
> ok. Maybe it's possible to tweak the permissions but it will need a
> lot more thoughts. Specifically for anon pages marking them readonly
> sounds possible if they are supposed to behave like regular COWs (not
> segfaulting or anything), as you already can have a mixture of
> readonly and read-write ptes (not to tell readonly KSM pages), but for
> any other case it's non trivial. Last but not the least the API here
> would be like a vma-less-mremap, moving a page from one address to
> another without modifying the vmas, the permission tweak sounds more
> like an mprotect, so I'm unsure if it could do both or if it should be
> an optimization to consider independently.

Doesn't this stuff require tlb flushes across all threads?

>
> In theory I suspect we could also teach mremap to do a
> not-vma-mangling mremap if we move pages that aren't shared and so we
> can adjust the page->index of the pages, instead of creating new vmas
> at the dst address with an adjusted vma->vm_pgoff, but I suspect a
> syscall that only works on top of fault-unmapped areas is simpler and
> safer. mremap semantics requires nuking the dst region before the move
> starts. If we would teach mremap how to handle the fault-unmapped
> areas we could just add one syscall prepare_fault_area (or whatever
> name you choose).
>
> The locking of doing a vma-less-mremap still sounds tricky but I doubt
> you can avoid that locking complexity by using the chardevice as long
> as the chardevice backed-memory still allows THP, migration and swap,
> if you want to do it atomic-zerocopy and I think zerocopy would be
> better especially if the network card is fast and all vcpus are
> faulting into unmapped pages simultaneously so triggering heavy amount
> of copying from all physical cpus.
>
> I don't mean the current device driver doing a copy_user won't work or
> is bad idea, it's more self contained and maybe easier to merge
> upstream. I'm just presenting another option more VM integrated
> zerocopy with just 2 syscalls.

Zerocopy is really interesting here, esp. w/ RDMA.  But while adding
ptes is cheap, removing them is not.  I wonder if we can make a
write-only page?  Of course it's unmapped for cpu access, but we can
allow DMA write access from the NIC.  Probably too wierd.

>
> vmas must not be involved in the mremap for reliability, or too much
> memory could get pinned in vmas even if we temporary lift the
> /proc/sys/vm/max_map_count for the process. Plus sending another
> signal (not sigsegv or sigbus) should be more reliable in case the
> migration crashes for real.


-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Qemu-devel] [PATCH 0/2][RFC] postcopy migration: Linux char device for postcopy
  2012-01-04  3:03                           ` Isaku Yamahata
@ 2012-01-12 13:59                             ` Avi Kivity
  2012-01-13  1:09                               ` Benoit Hudzia
  2012-01-13  2:09                               ` Andrea Arcangeli
  0 siblings, 2 replies; 42+ messages in thread
From: Avi Kivity @ 2012-01-12 13:59 UTC (permalink / raw)
  To: Isaku Yamahata
  Cc: Andrea Arcangeli, t.hirofuchi, qemu-devel, kvm, satoshi.itoh

On 01/04/2012 05:03 AM, Isaku Yamahata wrote:
> Yes, it's quite doable in user space(qemu) with a kernel-enhancement.
> And it would be easy to convert a separated daemon process into a thread
> in qemu.
>
> I think it should be done out side of qemu process for some reasons.
> (I just repeat same discussion at the KVM-forum because no one remembers
> it)
>
> - ptrace (and its variant)
>   Some people want to investigate guest ram on host (qemu stopped or lively).
>   For example, enhance crash utility and it will attach qemu process and
>   debug guest kernel.

To debug the guest kernel you don't need to stop qemu itself.   I agree
it's a problem for qemu debugging though.

>
> - core dump
>   qemu process may core-dump.
>   As postmortem analysis, people want to investigate guest RAM.
>   Again enhance crash utility and it will read the core file and analyze
>   guest kernel.
>   When creating core, the qemu process is already dead.

Yes, strong point.

> It precludes the above possibilities to handle fault in qemu process.

I agree.


-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Qemu-devel] [PATCH 0/2][RFC] postcopy migration: Linux char device for postcopy
  2012-01-12 13:59                             ` Avi Kivity
@ 2012-01-13  1:09                               ` Benoit Hudzia
  2012-01-13  1:31                                 ` Takuya Yoshikawa
  2012-01-13  2:03                                 ` Isaku Yamahata
  2012-01-13  2:09                               ` Andrea Arcangeli
  1 sibling, 2 replies; 42+ messages in thread
From: Benoit Hudzia @ 2012-01-13  1:09 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Andrea Arcangeli, kvm, satoshi.itoh, t.hirofuchi, qemu-devel,
	Isaku Yamahata

Hi,

Sorry to jump to hijack the thread  like that , however i would like
to just to inform you  that we recently achieve a milestone out of the
research project I'm leading. We enhanced KVM in order to deliver
post copy live migration using RDMA at kernel level.

Few point on the architecture of the system :

* RDMA communication engine in kernel ( you can use soft iwarp or soft
ROCE if you don't have hardware acceleration, however we also support
standard RDMA enabled NIC) .
* Naturally Page are transferred with Zerop copy protocol
* Leverage the async page fault system.
* Pre paging / faulting
* No context switch as everything is handled within kernel and using
the page fault system.
* Hybrid migration ( pre + post copy) available
* Rely on an independent Kernel Module
* No modification to the KVM kernel Module
* Minimal Modification to the Qemu-Kvm code
* We plan to add the page prioritization algo in order to optimise the
pre paging algo and background transfer


You can learn a little bit more and see a demo here:
http://tinyurl.com/8xa2bgl
I hope to be able to provide more detail on the design soon. As well
as more concrete demo of the system ( live migration of VM running
large  enterprise apps such as ERP or In memory DB)

Note: this is just a step stone as the post copy live migration mainly
enable us to validate the architecture design and  code.

Regards
Benoit







Regards
Benoit


On 12 January 2012 13:59, Avi Kivity <avi@redhat.com> wrote:
> On 01/04/2012 05:03 AM, Isaku Yamahata wrote:
>> Yes, it's quite doable in user space(qemu) with a kernel-enhancement.
>> And it would be easy to convert a separated daemon process into a thread
>> in qemu.
>>
>> I think it should be done out side of qemu process for some reasons.
>> (I just repeat same discussion at the KVM-forum because no one remembers
>> it)
>>
>> - ptrace (and its variant)
>>   Some people want to investigate guest ram on host (qemu stopped or lively).
>>   For example, enhance crash utility and it will attach qemu process and
>>   debug guest kernel.
>
> To debug the guest kernel you don't need to stop qemu itself.   I agree
> it's a problem for qemu debugging though.
>
>>
>> - core dump
>>   qemu process may core-dump.
>>   As postmortem analysis, people want to investigate guest RAM.
>>   Again enhance crash utility and it will read the core file and analyze
>>   guest kernel.
>>   When creating core, the qemu process is already dead.
>
> Yes, strong point.
>
>> It precludes the above possibilities to handle fault in qemu process.
>
> I agree.
>
>
> --
> error compiling committee.c: too many arguments to function
>
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
" The production of too many useful things results in too many useless people"

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Qemu-devel] [PATCH 0/2][RFC] postcopy migration: Linux char device for postcopy
  2012-01-13  1:09                               ` Benoit Hudzia
@ 2012-01-13  1:31                                 ` Takuya Yoshikawa
  2012-01-13  9:40                                   ` Benoit Hudzia
  2012-01-13  2:03                                 ` Isaku Yamahata
  1 sibling, 1 reply; 42+ messages in thread
From: Takuya Yoshikawa @ 2012-01-13  1:31 UTC (permalink / raw)
  To: Benoit Hudzia
  Cc: Andrea Arcangeli, kvm, satoshi.itoh, t.hirofuchi, qemu-devel,
	Isaku Yamahata, Avi Kivity

(2012/01/13 10:09), Benoit Hudzia wrote:
> Hi,
>
> Sorry to jump to hijack the thread  like that , however i would like
> to just to inform you  that we recently achieve a milestone out of the
> research project I'm leading. We enhanced KVM in order to deliver
> post copy live migration using RDMA at kernel level.
>
> Few point on the architecture of the system :
>
> * RDMA communication engine in kernel ( you can use soft iwarp or soft
> ROCE if you don't have hardware acceleration, however we also support
> standard RDMA enabled NIC) .
> * Naturally Page are transferred with Zerop copy protocol
> * Leverage the async page fault system.
> * Pre paging / faulting
> * No context switch as everything is handled within kernel and using
> the page fault system.
> * Hybrid migration ( pre + post copy) available
> * Rely on an independent Kernel Module
> * No modification to the KVM kernel Module
> * Minimal Modification to the Qemu-Kvm code
> * We plan to add the page prioritization algo in order to optimise the
> pre paging algo and background transfer
>
>
> You can learn a little bit more and see a demo here:
> http://tinyurl.com/8xa2bgl
> I hope to be able to provide more detail on the design soon. As well
> as more concrete demo of the system ( live migration of VM running
> large  enterprise apps such as ERP or In memory DB)
>
> Note: this is just a step stone as the post copy live migration mainly
> enable us to validate the architecture design and  code.

Do you have any plan to send the patch series of your implementation?

	Takuya

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Qemu-devel] [PATCH 0/2][RFC] postcopy migration: Linux char device for postcopy
  2012-01-13  1:09                               ` Benoit Hudzia
  2012-01-13  1:31                                 ` Takuya Yoshikawa
@ 2012-01-13  2:03                                 ` Isaku Yamahata
  2012-01-13  2:15                                   ` Isaku Yamahata
  2012-01-13  9:48                                   ` Benoit Hudzia
  1 sibling, 2 replies; 42+ messages in thread
From: Isaku Yamahata @ 2012-01-13  2:03 UTC (permalink / raw)
  To: Benoit Hudzia
  Cc: Andrea Arcangeli, kvm, satoshi.itoh, t.hirofuchi, qemu-devel,
	Avi Kivity

Very interesting. We can cooperate for better (postcopy) live migration.
The code doesn't seem available yet, I'm eager for it.


On Fri, Jan 13, 2012 at 01:09:30AM +0000, Benoit Hudzia wrote:
> Hi,
> 
> Sorry to jump to hijack the thread  like that , however i would like
> to just to inform you  that we recently achieve a milestone out of the
> research project I'm leading. We enhanced KVM in order to deliver
> post copy live migration using RDMA at kernel level.
> 
> Few point on the architecture of the system :
> 
> * RDMA communication engine in kernel ( you can use soft iwarp or soft
> ROCE if you don't have hardware acceleration, however we also support
> standard RDMA enabled NIC) .

Do you mean infiniband subsystem?


> * Naturally Page are transferred with Zerop copy protocol
> * Leverage the async page fault system.
> * Pre paging / faulting
> * No context switch as everything is handled within kernel and using
> the page fault system.
> * Hybrid migration ( pre + post copy) available

Ah, I've been also planing this.
After pre-copy phase, is the dirty bitmap sent?

So far I've thought naively that pre-copy phase would be finished by the
number of iterations. On the other hand your choice is timeout of
pre-copy phase. Do you have rationale? or it was just natural for you?


> * Rely on an independent Kernel Module
> * No modification to the KVM kernel Module
> * Minimal Modification to the Qemu-Kvm code
> * We plan to add the page prioritization algo in order to optimise the
> pre paging algo and background transfer

Where do you plan to implement? in qemu or in your kernel module?
This algo could be shared.

thanks in advance.

> You can learn a little bit more and see a demo here:
> http://tinyurl.com/8xa2bgl
> I hope to be able to provide more detail on the design soon. As well
> as more concrete demo of the system ( live migration of VM running
> large  enterprise apps such as ERP or In memory DB)
> 
> Note: this is just a step stone as the post copy live migration mainly
> enable us to validate the architecture design and  code.
> 
> Regards
> Benoit
> 
> 
> 
> 
> 
> 
> 
> Regards
> Benoit
> 
> 
> On 12 January 2012 13:59, Avi Kivity <avi@redhat.com> wrote:
> > On 01/04/2012 05:03 AM, Isaku Yamahata wrote:
> >> Yes, it's quite doable in user space(qemu) with a kernel-enhancement.
> >> And it would be easy to convert a separated daemon process into a thread
> >> in qemu.
> >>
> >> I think it should be done out side of qemu process for some reasons.
> >> (I just repeat same discussion at the KVM-forum because no one remembers
> >> it)
> >>
> >> - ptrace (and its variant)
> >> ?? Some people want to investigate guest ram on host (qemu stopped or lively).
> >> ?? For example, enhance crash utility and it will attach qemu process and
> >> ?? debug guest kernel.
> >
> > To debug the guest kernel you don't need to stop qemu itself. ?? I agree
> > it's a problem for qemu debugging though.
> >
> >>
> >> - core dump
> >> ?? qemu process may core-dump.
> >> ?? As postmortem analysis, people want to investigate guest RAM.
> >> ?? Again enhance crash utility and it will read the core file and analyze
> >> ?? guest kernel.
> >> ?? When creating core, the qemu process is already dead.
> >
> > Yes, strong point.
> >
> >> It precludes the above possibilities to handle fault in qemu process.
> >
> > I agree.
> >
> >
> > --
> > error compiling committee.c: too many arguments to function
> >
> > --
> > To unsubscribe from this list: send the line "unsubscribe kvm" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at ??http://vger.kernel.org/majordomo-info.html
> 
> 
> 
> -- 
> " The production of too many useful things results in too many useless people"
> 

-- 
yamahata

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Qemu-devel] [PATCH 0/2][RFC] postcopy migration: Linux char device for postcopy
  2012-01-12 13:57                               ` Avi Kivity
@ 2012-01-13  2:06                                 ` Andrea Arcangeli
  0 siblings, 0 replies; 42+ messages in thread
From: Andrea Arcangeli @ 2012-01-13  2:06 UTC (permalink / raw)
  To: Avi Kivity
  Cc: kvm, satoshi.itoh, t.hirofuchi, qemu-devel, Isaku Yamahata,
	Paolo Bonzini

On Thu, Jan 12, 2012 at 03:57:47PM +0200, Avi Kivity wrote:
> On 01/03/2012 04:25 PM, Andrea Arcangeli wrote:
> >  
> > > > So the problem is if we do it in
> > > > userland with the current functionality you'll run out of VMAs and
> > > > slowdown performance too much.
> > > >
> > > > But all you need is the ability to map single pages in the address
> > > > space.
> > > 
> > > Would this also let you set different pgprots for different pages in the 
> > > same VMA?  It would be useful for write barriers in garbage collectors 
> > > (such as boehm-gc).  These do not have _that_ many VMAs, because every 
> > > GC cycles could merge all of them back to a single VMA with PROT_READ 
> > > permissions; however, they still put some strain on the VM subsystem.
> >
> > Changing permission sounds more tricky as more code may make
> > assumptions on the vma before checking the pte.
> >
> > Adding a magic unmapped pte entry sounds fairly safe because there's
> > the migration pte already used by migrate which halts page faults and
> > wait, that creates a precedent. So I guess we could reuse the same
> > code that already exists for the migration entry and we'd need to fire
> > a signal and returns to userland instead of waiting. The signal should
> > be invoked before the page fault will trigger again. 
> 
> Delivering signals is slow, and you can't use signalfd for it, because
> that can be routed to a different task.  I would like an fd based
> protocol with an explicit ack so the other end can be implemented by the
> kernel, to use with RDMA.  Kind of like how vhost-net talks to a guest
> via a kvm ioeventfd/irqfd.

As long as we tell qemu to run some per-vcpu handler (or io-thread
handler in case of access through the I/O thread) before accessing the
same memory address again, I don't see a problem in using a faster
mechanism than signals. In theory we could even implement this
functionality in kvm itself and just add two ioctl to the kvm fd,
instead of a syscall in fact but it'd be mangling over VM things like
page->index.

> > Of course if the
> > signal returns and does nothing it'll loop at 100% cpu load but that's
> > ok. Maybe it's possible to tweak the permissions but it will need a
> > lot more thoughts. Specifically for anon pages marking them readonly
> > sounds possible if they are supposed to behave like regular COWs (not
> > segfaulting or anything), as you already can have a mixture of
> > readonly and read-write ptes (not to tell readonly KSM pages), but for
> > any other case it's non trivial. Last but not the least the API here
> > would be like a vma-less-mremap, moving a page from one address to
> > another without modifying the vmas, the permission tweak sounds more
> > like an mprotect, so I'm unsure if it could do both or if it should be
> > an optimization to consider independently.
> 
> Doesn't this stuff require tlb flushes across all threads?

It does, to do it zerocopy and atomic we must move the pte.

> > In theory I suspect we could also teach mremap to do a
> > not-vma-mangling mremap if we move pages that aren't shared and so we
> > can adjust the page->index of the pages, instead of creating new vmas
> > at the dst address with an adjusted vma->vm_pgoff, but I suspect a
> > syscall that only works on top of fault-unmapped areas is simpler and
> > safer. mremap semantics requires nuking the dst region before the move
> > starts. If we would teach mremap how to handle the fault-unmapped
> > areas we could just add one syscall prepare_fault_area (or whatever
> > name you choose).
> >
> > The locking of doing a vma-less-mremap still sounds tricky but I doubt
> > you can avoid that locking complexity by using the chardevice as long
> > as the chardevice backed-memory still allows THP, migration and swap,
> > if you want to do it atomic-zerocopy and I think zerocopy would be
> > better especially if the network card is fast and all vcpus are
> > faulting into unmapped pages simultaneously so triggering heavy amount
> > of copying from all physical cpus.
> >
> > I don't mean the current device driver doing a copy_user won't work or
> > is bad idea, it's more self contained and maybe easier to merge
> > upstream. I'm just presenting another option more VM integrated
> > zerocopy with just 2 syscalls.
> 
> Zerocopy is really interesting here, esp. w/ RDMA.  But while adding
> ptes is cheap, removing them is not.  I wonder if we can make a
> write-only page?  Of course it's unmapped for cpu access, but we can
> allow DMA write access from the NIC.  Probably too wierd.

Keeping it mapped in two places gives problem with the non-linearity
of the page->index. We've 1 page->index and two different
vma->vm_pgoff, so it's just not a problem of readonly. We could even
it let it read-write as long as it is nuked when we swap. Problem is
we can't leave it there, we must update page->index, if we don't the
rmap walk breaks and swap/migrate with it... we would allow any thread
to see random memory through that window, even if read-only
(post-swapout) and it wouldn't be secure in multiuser system.

Maybe we can find a way to have qemu call a "recv" directly on the
guest physical address final destination and never expose the page
received to userland during the receive. Without having to pass
through a buffer mapped in userland would solve the tlb flushes. That
however would prevent lzo compression or any other trick that we might
do and it'd force zerocopy behavior all the way through the kernel
recv syscall and after recv we would just return from the "post-copy
handler" and return to guest mode (or return accessing the data in the
io-thread case). Maybe splice or sendfile or some other trick allows
that. Problem is we can't have the pte and sptes established until the
full receive is complete or other vcpu threads will see partial
receive. I suspect the post-migrate faults will be network I/O
dominated anyway.

About doing a copy, that doesn't require the TLB flush, the problem is
doing it atomic and not expose partial data to the vcpus and
io-thread. I can imagine various ways to achieve that, but I don't see
how that can be done "free", plus you pay the copy too.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Qemu-devel] [PATCH 0/2][RFC] postcopy migration: Linux char device for postcopy
  2012-01-12 13:59                             ` Avi Kivity
  2012-01-13  1:09                               ` Benoit Hudzia
@ 2012-01-13  2:09                               ` Andrea Arcangeli
  1 sibling, 0 replies; 42+ messages in thread
From: Andrea Arcangeli @ 2012-01-13  2:09 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Isaku Yamahata, t.hirofuchi, qemu-devel, kvm, satoshi.itoh

On Thu, Jan 12, 2012 at 03:59:59PM +0200, Avi Kivity wrote:
> On 01/04/2012 05:03 AM, Isaku Yamahata wrote:
> > Yes, it's quite doable in user space(qemu) with a kernel-enhancement.
> > And it would be easy to convert a separated daemon process into a thread
> > in qemu.
> >
> > I think it should be done out side of qemu process for some reasons.
> > (I just repeat same discussion at the KVM-forum because no one remembers
> > it)
> >
> > - ptrace (and its variant)
> >   Some people want to investigate guest ram on host (qemu stopped or lively).
> >   For example, enhance crash utility and it will attach qemu process and
> >   debug guest kernel.
> 
> To debug the guest kernel you don't need to stop qemu itself.   I agree
> it's a problem for qemu debugging though.

But you need to debug postcopy migration too with gdb or not? I don't
see a big benefit in trying to prevent gdb to see really what is going
on in the qemu image.

> > - core dump
> >   qemu process may core-dump.
> >   As postmortem analysis, people want to investigate guest RAM.
> >   Again enhance crash utility and it will read the core file and analyze
> >   guest kernel.
> >   When creating core, the qemu process is already dead.
> 
> Yes, strong point.
> 
> > It precludes the above possibilities to handle fault in qemu process.
> 
> I agree.

In the receiving node if the memory is not there yet (and it isn't),
not sure how you plan to get a clean core dump (like if live migration
wasn't running) by preventing the kernel from dumping zeroes if qemu
crashes during post copy migration. Surely it won't be the kernel
crash handler completing the post migration, it won't even know where
to write the data in memory.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Qemu-devel] [PATCH 0/2][RFC] postcopy migration: Linux char device for postcopy
  2012-01-13  2:03                                 ` Isaku Yamahata
@ 2012-01-13  2:15                                   ` Isaku Yamahata
  2012-01-13  9:55                                     ` Benoit Hudzia
  2012-01-13  9:48                                   ` Benoit Hudzia
  1 sibling, 1 reply; 42+ messages in thread
From: Isaku Yamahata @ 2012-01-13  2:15 UTC (permalink / raw)
  To: Benoit Hudzia
  Cc: Andrea Arcangeli, kvm, satoshi.itoh, t.hirofuchi, qemu-devel,
	Avi Kivity

One more question.
Does your architecture/implementation (in theory) allow KVM memory
features like swap, KSM, THP?


On Fri, Jan 13, 2012 at 11:03:23AM +0900, Isaku Yamahata wrote:
> Very interesting. We can cooperate for better (postcopy) live migration.
> The code doesn't seem available yet, I'm eager for it.
> 
> 
> On Fri, Jan 13, 2012 at 01:09:30AM +0000, Benoit Hudzia wrote:
> > Hi,
> > 
> > Sorry to jump to hijack the thread  like that , however i would like
> > to just to inform you  that we recently achieve a milestone out of the
> > research project I'm leading. We enhanced KVM in order to deliver
> > post copy live migration using RDMA at kernel level.
> > 
> > Few point on the architecture of the system :
> > 
> > * RDMA communication engine in kernel ( you can use soft iwarp or soft
> > ROCE if you don't have hardware acceleration, however we also support
> > standard RDMA enabled NIC) .
> 
> Do you mean infiniband subsystem?
> 
> 
> > * Naturally Page are transferred with Zerop copy protocol
> > * Leverage the async page fault system.
> > * Pre paging / faulting
> > * No context switch as everything is handled within kernel and using
> > the page fault system.
> > * Hybrid migration ( pre + post copy) available
> 
> Ah, I've been also planing this.
> After pre-copy phase, is the dirty bitmap sent?
> 
> So far I've thought naively that pre-copy phase would be finished by the
> number of iterations. On the other hand your choice is timeout of
> pre-copy phase. Do you have rationale? or it was just natural for you?
> 
> 
> > * Rely on an independent Kernel Module
> > * No modification to the KVM kernel Module
> > * Minimal Modification to the Qemu-Kvm code
> > * We plan to add the page prioritization algo in order to optimise the
> > pre paging algo and background transfer
> 
> Where do you plan to implement? in qemu or in your kernel module?
> This algo could be shared.
> 
> thanks in advance.
> 
> > You can learn a little bit more and see a demo here:
> > http://tinyurl.com/8xa2bgl
> > I hope to be able to provide more detail on the design soon. As well
> > as more concrete demo of the system ( live migration of VM running
> > large  enterprise apps such as ERP or In memory DB)
> > 
> > Note: this is just a step stone as the post copy live migration mainly
> > enable us to validate the architecture design and  code.
> > 
> > Regards
> > Benoit
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > Regards
> > Benoit
> > 
> > 
> > On 12 January 2012 13:59, Avi Kivity <avi@redhat.com> wrote:
> > > On 01/04/2012 05:03 AM, Isaku Yamahata wrote:
> > >> Yes, it's quite doable in user space(qemu) with a kernel-enhancement.
> > >> And it would be easy to convert a separated daemon process into a thread
> > >> in qemu.
> > >>
> > >> I think it should be done out side of qemu process for some reasons.
> > >> (I just repeat same discussion at the KVM-forum because no one remembers
> > >> it)
> > >>
> > >> - ptrace (and its variant)
> > >> ?? Some people want to investigate guest ram on host (qemu stopped or lively).
> > >> ?? For example, enhance crash utility and it will attach qemu process and
> > >> ?? debug guest kernel.
> > >
> > > To debug the guest kernel you don't need to stop qemu itself. ?? I agree
> > > it's a problem for qemu debugging though.
> > >
> > >>
> > >> - core dump
> > >> ?? qemu process may core-dump.
> > >> ?? As postmortem analysis, people want to investigate guest RAM.
> > >> ?? Again enhance crash utility and it will read the core file and analyze
> > >> ?? guest kernel.
> > >> ?? When creating core, the qemu process is already dead.
> > >
> > > Yes, strong point.
> > >
> > >> It precludes the above possibilities to handle fault in qemu process.
> > >
> > > I agree.
> > >
> > >
> > > --
> > > error compiling committee.c: too many arguments to function
> > >
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe kvm" in
> > > the body of a message to majordomo@vger.kernel.org
> > > More majordomo info at ??http://vger.kernel.org/majordomo-info.html
> > 
> > 
> > 
> > -- 
> > " The production of too many useful things results in too many useless people"
> > 
> 
> -- 
> yamahata
> 

-- 
yamahata

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Qemu-devel] [PATCH 0/2][RFC] postcopy migration: Linux char device for postcopy
  2012-01-13  1:31                                 ` Takuya Yoshikawa
@ 2012-01-13  9:40                                   ` Benoit Hudzia
  0 siblings, 0 replies; 42+ messages in thread
From: Benoit Hudzia @ 2012-01-13  9:40 UTC (permalink / raw)
  To: Takuya Yoshikawa
  Cc: Andrea Arcangeli, kvm, satoshi.itoh, t.hirofuchi, qemu-devel,
	Isaku Yamahata, Avi Kivity

Yes we plan to release patch as soon as we cleaned up the code and we
get the green light from our company ( and sadly it can take month on
that point..)

On 13 January 2012 01:31, Takuya Yoshikawa
<yoshikawa.takuya@oss.ntt.co.jp> wrote:
> (2012/01/13 10:09), Benoit Hudzia wrote:
>>
>> Hi,
>>
>> Sorry to jump to hijack the thread  like that , however i would like
>> to just to inform you  that we recently achieve a milestone out of the
>> research project I'm leading. We enhanced KVM in order to deliver
>> post copy live migration using RDMA at kernel level.
>>
>> Few point on the architecture of the system :
>>
>> * RDMA communication engine in kernel ( you can use soft iwarp or soft
>> ROCE if you don't have hardware acceleration, however we also support
>> standard RDMA enabled NIC) .
>> * Naturally Page are transferred with Zerop copy protocol
>> * Leverage the async page fault system.
>> * Pre paging / faulting
>> * No context switch as everything is handled within kernel and using
>> the page fault system.
>> * Hybrid migration ( pre + post copy) available
>> * Rely on an independent Kernel Module
>> * No modification to the KVM kernel Module
>> * Minimal Modification to the Qemu-Kvm code
>> * We plan to add the page prioritization algo in order to optimise the
>> pre paging algo and background transfer
>>
>>
>> You can learn a little bit more and see a demo here:
>> http://tinyurl.com/8xa2bgl
>> I hope to be able to provide more detail on the design soon. As well
>> as more concrete demo of the system ( live migration of VM running
>> large  enterprise apps such as ERP or In memory DB)
>>
>> Note: this is just a step stone as the post copy live migration mainly
>> enable us to validate the architecture design and  code.
>
>
> Do you have any plan to send the patch series of your implementation?
>
>        Takuya



-- 
" The production of too many useful things results in too many useless people"

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Qemu-devel] [PATCH 0/2][RFC] postcopy migration: Linux char device for postcopy
  2012-01-13  2:03                                 ` Isaku Yamahata
  2012-01-13  2:15                                   ` Isaku Yamahata
@ 2012-01-13  9:48                                   ` Benoit Hudzia
  1 sibling, 0 replies; 42+ messages in thread
From: Benoit Hudzia @ 2012-01-13  9:48 UTC (permalink / raw)
  To: Isaku Yamahata
  Cc: Andrea Arcangeli, kvm, satoshi.itoh, t.hirofuchi, qemu-devel,
	Avi Kivity

On 13 January 2012 02:03, Isaku Yamahata <yamahata@valinux.co.jp> wrote:
> Very interesting. We can cooperate for better (postcopy) live migration.
> The code doesn't seem available yet, I'm eager for it.
>
>
> On Fri, Jan 13, 2012 at 01:09:30AM +0000, Benoit Hudzia wrote:
>> Hi,
>>
>> Sorry to jump to hijack the thread  like that , however i would like
>> to just to inform you  that we recently achieve a milestone out of the
>> research project I'm leading. We enhanced KVM in order to deliver
>> post copy live migration using RDMA at kernel level.
>>
>> Few point on the architecture of the system :
>>
>> * RDMA communication engine in kernel ( you can use soft iwarp or soft
>> ROCE if you don't have hardware acceleration, however we also support
>> standard RDMA enabled NIC) .
>
> Do you mean infiniband subsystem?

Yes, basically any software or hardware implementation that support
the standard RDMA / OFED vverbs  stack in kernel.
>
>
>> * Naturally Page are transferred with Zerop copy protocol
>> * Leverage the async page fault system.
>> * Pre paging / faulting
>> * No context switch as everything is handled within kernel and using
>> the page fault system.
>> * Hybrid migration ( pre + post copy) available
>
> Ah, I've been also planing this.
> After pre-copy phase, is the dirty bitmap sent?

We sent over the dirty bitmap yes. In order to identify what is left
to be transferred . And combined with the priority algo we will then
prioritise the page for the background transfer.

>
> So far I've thought naively that pre-copy phase would be finished by the
> number of iterations. On the other hand your choice is timeout of
> pre-copy phase. Do you have rationale? or it was just natural for you?


The main rational behind that is any normal sys admin tend to to be
human and live migration iteration cycle has no meaning for him. As a
result we preferred to provide a time constraint rather than an
iteration constraint. Also it is hard to estimate how much time
bandwidth would be use per iteration cycle which led to poor
determinism.

>
>
>> * Rely on an independent Kernel Module
>> * No modification to the KVM kernel Module
>> * Minimal Modification to the Qemu-Kvm code
>> * We plan to add the page prioritization algo in order to optimise the
>> pre paging algo and background transfer
>
> Where do you plan to implement? in qemu or in your kernel module?
> This algo could be shared.

Yes we plan to actually release the algo first before the RDMA post
copy. The algo can be use for standard optimisation of the normal
pre-copy process (as demosntrated in my talk at KVM-forum). And if the
priority is reverse for the post copy page pull. My colleague Aidan
shribman is done with the implentation and we are now in testing phase
in order to quantify the improvement.


>
> thanks in advance.
>
>> You can learn a little bit more and see a demo here:
>> http://tinyurl.com/8xa2bgl
>> I hope to be able to provide more detail on the design soon. As well
>> as more concrete demo of the system ( live migration of VM running
>> large  enterprise apps such as ERP or In memory DB)
>>
>> Note: this is just a step stone as the post copy live migration mainly
>> enable us to validate the architecture design and  code.
>>
>> Regards
>> Benoit
>>
>>
>>
>>
>>
>>
>>
>> Regards
>> Benoit
>>
>>
>> On 12 January 2012 13:59, Avi Kivity <avi@redhat.com> wrote:
>> > On 01/04/2012 05:03 AM, Isaku Yamahata wrote:
>> >> Yes, it's quite doable in user space(qemu) with a kernel-enhancement.
>> >> And it would be easy to convert a separated daemon process into a thread
>> >> in qemu.
>> >>
>> >> I think it should be done out side of qemu process for some reasons.
>> >> (I just repeat same discussion at the KVM-forum because no one remembers
>> >> it)
>> >>
>> >> - ptrace (and its variant)
>> >> ?? Some people want to investigate guest ram on host (qemu stopped or lively).
>> >> ?? For example, enhance crash utility and it will attach qemu process and
>> >> ?? debug guest kernel.
>> >
>> > To debug the guest kernel you don't need to stop qemu itself. ?? I agree
>> > it's a problem for qemu debugging though.
>> >
>> >>
>> >> - core dump
>> >> ?? qemu process may core-dump.
>> >> ?? As postmortem analysis, people want to investigate guest RAM.
>> >> ?? Again enhance crash utility and it will read the core file and analyze
>> >> ?? guest kernel.
>> >> ?? When creating core, the qemu process is already dead.
>> >
>> > Yes, strong point.
>> >
>> >> It precludes the above possibilities to handle fault in qemu process.
>> >
>> > I agree.
>> >
>> >
>> > --
>> > error compiling committee.c: too many arguments to function
>> >
>> > --
>> > To unsubscribe from this list: send the line "unsubscribe kvm" in
>> > the body of a message to majordomo@vger.kernel.org
>> > More majordomo info at ??http://vger.kernel.org/majordomo-info.html
>>
>>
>>
>> --
>> " The production of too many useful things results in too many useless people"
>>
>
> --
> yamahata



-- 
" The production of too many useful things results in too many useless people"

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Qemu-devel] [PATCH 0/2][RFC] postcopy migration: Linux char device for postcopy
  2012-01-13  2:15                                   ` Isaku Yamahata
@ 2012-01-13  9:55                                     ` Benoit Hudzia
  0 siblings, 0 replies; 42+ messages in thread
From: Benoit Hudzia @ 2012-01-13  9:55 UTC (permalink / raw)
  To: Isaku Yamahata
  Cc: Andrea Arcangeli, kvm, satoshi.itoh, t.hirofuchi, qemu-devel,
	Avi Kivity

On 13 January 2012 02:15, Isaku Yamahata <yamahata@valinux.co.jp> wrote:
> One more question.
> Does your architecture/implementation (in theory) allow KVM memory
> features like swap, KSM, THP?

* Swap: Yes we support swap to disk ( the page is pulled from swap
before being send over), swap process do its job on the other side.
* KSM :  same , we support KSM, the KSMed page is broken down and
split and they are send individually ( yes sub optimal but make the
protocol less messy) and we let the KSM daemon do its job on the other
side.
* THP : more sticky here. Due to time constraint we decided that we
will be partially supporting it. What does it means: if we encounter
THP we break them down in standard page granularity as it is our
current memory unit we are manipulating. As a result you can have THP
on the source but you won't have THP on the other side.
           _ Note , we didn't explore fully the ramification of THP
with RDMA, i don't know if THP play well with the MMU of HW RDMA NIC,
One thing i would like to explore is if it is possible to break down
the THP in standard page and then reassemble them on the other side (
do any one fo you know if it is possible to aggregate page to for a
THP in kernel ? )
* cgroup  :  should be transparently working, but we need to do more
testing to confirm that .




>
>
> On Fri, Jan 13, 2012 at 11:03:23AM +0900, Isaku Yamahata wrote:
>> Very interesting. We can cooperate for better (postcopy) live migration.
>> The code doesn't seem available yet, I'm eager for it.
>>
>>
>> On Fri, Jan 13, 2012 at 01:09:30AM +0000, Benoit Hudzia wrote:
>> > Hi,
>> >
>> > Sorry to jump to hijack the thread  like that , however i would like
>> > to just to inform you  that we recently achieve a milestone out of the
>> > research project I'm leading. We enhanced KVM in order to deliver
>> > post copy live migration using RDMA at kernel level.
>> >
>> > Few point on the architecture of the system :
>> >
>> > * RDMA communication engine in kernel ( you can use soft iwarp or soft
>> > ROCE if you don't have hardware acceleration, however we also support
>> > standard RDMA enabled NIC) .
>>
>> Do you mean infiniband subsystem?
>>
>>
>> > * Naturally Page are transferred with Zerop copy protocol
>> > * Leverage the async page fault system.
>> > * Pre paging / faulting
>> > * No context switch as everything is handled within kernel and using
>> > the page fault system.
>> > * Hybrid migration ( pre + post copy) available
>>
>> Ah, I've been also planing this.
>> After pre-copy phase, is the dirty bitmap sent?
>>
>> So far I've thought naively that pre-copy phase would be finished by the
>> number of iterations. On the other hand your choice is timeout of
>> pre-copy phase. Do you have rationale? or it was just natural for you?
>>
>>
>> > * Rely on an independent Kernel Module
>> > * No modification to the KVM kernel Module
>> > * Minimal Modification to the Qemu-Kvm code
>> > * We plan to add the page prioritization algo in order to optimise the
>> > pre paging algo and background transfer
>>
>> Where do you plan to implement? in qemu or in your kernel module?
>> This algo could be shared.
>>
>> thanks in advance.
>>
>> > You can learn a little bit more and see a demo here:
>> > http://tinyurl.com/8xa2bgl
>> > I hope to be able to provide more detail on the design soon. As well
>> > as more concrete demo of the system ( live migration of VM running
>> > large  enterprise apps such as ERP or In memory DB)
>> >
>> > Note: this is just a step stone as the post copy live migration mainly
>> > enable us to validate the architecture design and  code.
>> >
>> > Regards
>> > Benoit
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> > Regards
>> > Benoit
>> >
>> >
>> > On 12 January 2012 13:59, Avi Kivity <avi@redhat.com> wrote:
>> > > On 01/04/2012 05:03 AM, Isaku Yamahata wrote:
>> > >> Yes, it's quite doable in user space(qemu) with a kernel-enhancement.
>> > >> And it would be easy to convert a separated daemon process into a thread
>> > >> in qemu.
>> > >>
>> > >> I think it should be done out side of qemu process for some reasons.
>> > >> (I just repeat same discussion at the KVM-forum because no one remembers
>> > >> it)
>> > >>
>> > >> - ptrace (and its variant)
>> > >> ?? Some people want to investigate guest ram on host (qemu stopped or lively).
>> > >> ?? For example, enhance crash utility and it will attach qemu process and
>> > >> ?? debug guest kernel.
>> > >
>> > > To debug the guest kernel you don't need to stop qemu itself. ?? I agree
>> > > it's a problem for qemu debugging though.
>> > >
>> > >>
>> > >> - core dump
>> > >> ?? qemu process may core-dump.
>> > >> ?? As postmortem analysis, people want to investigate guest RAM.
>> > >> ?? Again enhance crash utility and it will read the core file and analyze
>> > >> ?? guest kernel.
>> > >> ?? When creating core, the qemu process is already dead.
>> > >
>> > > Yes, strong point.
>> > >
>> > >> It precludes the above possibilities to handle fault in qemu process.
>> > >
>> > > I agree.
>> > >
>> > >
>> > > --
>> > > error compiling committee.c: too many arguments to function
>> > >
>> > > --
>> > > To unsubscribe from this list: send the line "unsubscribe kvm" in
>> > > the body of a message to majordomo@vger.kernel.org
>> > > More majordomo info at ??http://vger.kernel.org/majordomo-info.html
>> >
>> >
>> >
>> > --
>> > " The production of too many useful things results in too many useless people"
>> >
>>
>> --
>> yamahata
>>
>
> --
> yamahata



-- 
" The production of too many useful things results in too many useless people"

^ permalink raw reply	[flat|nested] 42+ messages in thread

end of thread, other threads:[~2012-01-13  9:55 UTC | newest]

Thread overview: 42+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-12-29  1:26 [Qemu-devel] [PATCH 0/2][RFC] postcopy migration: Linux char device for postcopy Isaku Yamahata
2011-12-29  1:26 ` [Qemu-devel] [PATCH 1/2] export necessary symbols Isaku Yamahata
2011-12-29  1:26 ` [Qemu-devel] [PATCH 2/2] umem: chardevice for kvm postcopy Isaku Yamahata
2011-12-29 11:17   ` Avi Kivity
2011-12-29 12:22     ` Isaku Yamahata
2011-12-29 12:47       ` Avi Kivity
2012-01-05  4:08   ` [Qemu-devel] 回复: " thfbjyddx
2012-01-05 10:48     ` [Qemu-devel] 回??: " Isaku Yamahata
2012-01-05 11:10       ` Tommy
2012-01-05 12:18         ` Isaku Yamahata
2012-01-05 15:02           ` Tommy Tang
     [not found]           ` <4F05BB68.9050302@hotmail.com>
2012-01-05 15:05             ` Tommy Tang
2012-01-06  7:02           ` thfbjyddx
2012-01-06 17:13             ` [Qemu-devel] 回??: [PATCH 2/2] umem: chardevice for kvm?postcopy Isaku Yamahata
2011-12-29  1:31 ` [Qemu-devel] [PATCH 0/2][RFC] postcopy migration: Linux char device for postcopy Isaku Yamahata
2011-12-29 11:24 ` Avi Kivity
2011-12-29 12:39   ` Isaku Yamahata
2011-12-29 12:55     ` Avi Kivity
2011-12-29 13:49       ` Isaku Yamahata
2011-12-29 13:52         ` Avi Kivity
2011-12-29 14:18           ` Isaku Yamahata
2011-12-29 14:35             ` Avi Kivity
2011-12-29 14:49               ` Isaku Yamahata
2011-12-29 14:55                 ` Avi Kivity
2011-12-29 15:53                   ` Isaku Yamahata
2011-12-29 16:00                     ` Avi Kivity
2011-12-29 16:01                       ` Avi Kivity
2012-01-02 17:05                         ` Andrea Arcangeli
2012-01-02 17:55                           ` Paolo Bonzini
2012-01-03 14:25                             ` Andrea Arcangeli
2012-01-12 13:57                               ` Avi Kivity
2012-01-13  2:06                                 ` Andrea Arcangeli
2012-01-04  3:03                           ` Isaku Yamahata
2012-01-12 13:59                             ` Avi Kivity
2012-01-13  1:09                               ` Benoit Hudzia
2012-01-13  1:31                                 ` Takuya Yoshikawa
2012-01-13  9:40                                   ` Benoit Hudzia
2012-01-13  2:03                                 ` Isaku Yamahata
2012-01-13  2:15                                   ` Isaku Yamahata
2012-01-13  9:55                                     ` Benoit Hudzia
2012-01-13  9:48                                   ` Benoit Hudzia
2012-01-13  2:09                               ` Andrea Arcangeli

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).