Linux virtualization list
 help / color / mirror / Atom feed
* [PATCH v7 03/11] kernel/locking: Drop the overload of {mutex, rwsem}_spin_on_owner
From: Pan Xinhui @ 2016-11-02  9:08 UTC (permalink / raw)
  To: linux-kernel, linuxppc-dev, virtualization, linux-s390,
	xen-devel-request, kvm, xen-devel, x86
  Cc: kernellwp, jgross, dave, David.Laight, rkrcmar, peterz, benh,
	konrad.wilk, will.deacon, Pan Xinhui, mingo, paulus, mpe,
	pbonzini, paulmck, boqun.feng
In-Reply-To: <1478077718-37424-1-git-send-email-xinhui.pan@linux.vnet.ibm.com>

An over-committed guest with more vCPUs than pCPUs has a heavy overload
in the two spin_on_owner. This blames on the lock holder preemption
issue.

Kernel has an interface bool vcpu_is_preempted(int cpu) to see if a vCPU
is currently running or not. So break the spin loops on true condition.

test-case:
perf record -a perf bench sched messaging -g 400 -p && perf report

before patch:
20.68%  sched-messaging  [kernel.vmlinux]  [k] mutex_spin_on_owner
 8.45%  sched-messaging  [kernel.vmlinux]  [k] mutex_unlock
 4.12%  sched-messaging  [kernel.vmlinux]  [k] system_call
 3.01%  sched-messaging  [kernel.vmlinux]  [k] system_call_common
 2.83%  sched-messaging  [kernel.vmlinux]  [k] copypage_power7
 2.64%  sched-messaging  [kernel.vmlinux]  [k] rwsem_spin_on_owner
 2.00%  sched-messaging  [kernel.vmlinux]  [k] osq_lock

after patch:
 9.99%  sched-messaging  [kernel.vmlinux]  [k] mutex_unlock
 5.28%  sched-messaging  [unknown]         [H] 0xc0000000000768e0
 4.27%  sched-messaging  [kernel.vmlinux]  [k] __copy_tofrom_user_power7
 3.77%  sched-messaging  [kernel.vmlinux]  [k] copypage_power7
 3.24%  sched-messaging  [kernel.vmlinux]  [k] _raw_write_lock_irq
 3.02%  sched-messaging  [kernel.vmlinux]  [k] system_call
 2.69%  sched-messaging  [kernel.vmlinux]  [k] wait_consider_task

Signed-off-by: Pan Xinhui <xinhui.pan@linux.vnet.ibm.com>
Acked-by: Christian Borntraeger <borntraeger@de.ibm.com>
Acked-by: Paolo Bonzini <pbonzini@redhat.com>
Tested-by: Juergen Gross <jgross@suse.com>
---
 kernel/locking/mutex.c      | 13 +++++++++++--
 kernel/locking/rwsem-xadd.c | 14 +++++++++++---
 2 files changed, 22 insertions(+), 5 deletions(-)

diff --git a/kernel/locking/mutex.c b/kernel/locking/mutex.c
index a70b90d..24face6 100644
--- a/kernel/locking/mutex.c
+++ b/kernel/locking/mutex.c
@@ -236,7 +236,11 @@ bool mutex_spin_on_owner(struct mutex *lock, struct task_struct *owner)
 		 */
 		barrier();
 
-		if (!owner->on_cpu || need_resched()) {
+		/*
+		 * Use vcpu_is_preempted to detect lock holder preemption issue.
+		 */
+		if (!owner->on_cpu || need_resched() ||
+				vcpu_is_preempted(task_cpu(owner))) {
 			ret = false;
 			break;
 		}
@@ -261,8 +265,13 @@ static inline int mutex_can_spin_on_owner(struct mutex *lock)
 
 	rcu_read_lock();
 	owner = READ_ONCE(lock->owner);
+
+	/*
+	 * As lock holder preemption issue, we both skip spinning if task is not
+	 * on cpu or its cpu is preempted
+	 */
 	if (owner)
-		retval = owner->on_cpu;
+		retval = owner->on_cpu && !vcpu_is_preempted(task_cpu(owner));
 	rcu_read_unlock();
 	/*
 	 * if lock->owner is not set, the mutex owner may have just acquired
diff --git a/kernel/locking/rwsem-xadd.c b/kernel/locking/rwsem-xadd.c
index 2337b4b..b664ce1 100644
--- a/kernel/locking/rwsem-xadd.c
+++ b/kernel/locking/rwsem-xadd.c
@@ -336,7 +336,11 @@ static inline bool rwsem_can_spin_on_owner(struct rw_semaphore *sem)
 		goto done;
 	}
 
-	ret = owner->on_cpu;
+	/*
+	 * As lock holder preemption issue, we both skip spinning if task is not
+	 * on cpu or its cpu is preempted
+	 */
+	ret = owner->on_cpu && !vcpu_is_preempted(task_cpu(owner));
 done:
 	rcu_read_unlock();
 	return ret;
@@ -362,8 +366,12 @@ static noinline bool rwsem_spin_on_owner(struct rw_semaphore *sem)
 		 */
 		barrier();
 
-		/* abort spinning when need_resched or owner is not running */
-		if (!owner->on_cpu || need_resched()) {
+		/*
+		 * abort spinning when need_resched or owner is not running or
+		 * owner's cpu is preempted.
+		 */
+		if (!owner->on_cpu || need_resched() ||
+				vcpu_is_preempted(task_cpu(owner))) {
 			rcu_read_unlock();
 			return false;
 		}
-- 
2.4.11

^ permalink raw reply related

* [PATCH v7 02/11] locking/osq: Drop the overload of osq_lock()
From: Pan Xinhui @ 2016-11-02  9:08 UTC (permalink / raw)
  To: linux-kernel, linuxppc-dev, virtualization, linux-s390,
	xen-devel-request, kvm, xen-devel, x86
  Cc: kernellwp, jgross, dave, David.Laight, rkrcmar, peterz, benh,
	konrad.wilk, will.deacon, Pan Xinhui, mingo, paulus, mpe,
	pbonzini, paulmck, boqun.feng
In-Reply-To: <1478077718-37424-1-git-send-email-xinhui.pan@linux.vnet.ibm.com>

An over-committed guest with more vCPUs than pCPUs has a heavy overload
in osq_lock().

This is because vCPU A hold the osq lock and yield out, vCPU B wait
per_cpu node->locked to be set. IOW, vCPU B wait vCPU A to run and
unlock the osq lock.

Kernel has an interface bool vcpu_is_preempted(int cpu) to detect if a
vCPU is currently running or not. So break the spin loops on true
condition.

test case:
perf record -a perf bench sched messaging -g 400 -p && perf report

before patch:
18.09%  sched-messaging  [kernel.vmlinux]  [k] osq_lock
12.28%  sched-messaging  [kernel.vmlinux]  [k] rwsem_spin_on_owner
 5.27%  sched-messaging  [kernel.vmlinux]  [k] mutex_unlock
 3.89%  sched-messaging  [kernel.vmlinux]  [k] wait_consider_task
 3.64%  sched-messaging  [kernel.vmlinux]  [k] _raw_write_lock_irq
 3.41%  sched-messaging  [kernel.vmlinux]  [k] mutex_spin_on_owner.is
 2.49%  sched-messaging  [kernel.vmlinux]  [k] system_call

after patch:
20.68%  sched-messaging  [kernel.vmlinux]  [k] mutex_spin_on_owner
 8.45%  sched-messaging  [kernel.vmlinux]  [k] mutex_unlock
 4.12%  sched-messaging  [kernel.vmlinux]  [k] system_call
 3.01%  sched-messaging  [kernel.vmlinux]  [k] system_call_common
 2.83%  sched-messaging  [kernel.vmlinux]  [k] copypage_power7
 2.64%  sched-messaging  [kernel.vmlinux]  [k] rwsem_spin_on_owner
 2.00%  sched-messaging  [kernel.vmlinux]  [k] osq_lock

Suggested-by: Boqun Feng <boqun.feng@gmail.com>
Signed-off-by: Pan Xinhui <xinhui.pan@linux.vnet.ibm.com>
Acked-by: Christian Borntraeger <borntraeger@de.ibm.com>
Acked-by: Paolo Bonzini <pbonzini@redhat.com>
Tested-by: Juergen Gross <jgross@suse.com>
---
 kernel/locking/osq_lock.c | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c
index 05a3785..091f97f 100644
--- a/kernel/locking/osq_lock.c
+++ b/kernel/locking/osq_lock.c
@@ -21,6 +21,11 @@ static inline int encode_cpu(int cpu_nr)
 	return cpu_nr + 1;
 }
 
+static inline int node_cpu(struct optimistic_spin_node *node)
+{
+	return node->cpu - 1;
+}
+
 static inline struct optimistic_spin_node *decode_cpu(int encoded_cpu_val)
 {
 	int cpu_nr = encoded_cpu_val - 1;
@@ -118,8 +123,9 @@ bool osq_lock(struct optimistic_spin_queue *lock)
 	while (!READ_ONCE(node->locked)) {
 		/*
 		 * If we need to reschedule bail... so we can block.
+		 * Use vcpu_is_preempted to detect lock holder preemption issue.
 		 */
-		if (need_resched())
+		if (need_resched() || vcpu_is_preempted(node_cpu(node->prev)))
 			goto unqueue;
 
 		cpu_relax_lowlatency();
-- 
2.4.11

^ permalink raw reply related

* [PATCH v7 01/11] kernel/sched: introduce vcpu preempted check interface
From: Pan Xinhui @ 2016-11-02  9:08 UTC (permalink / raw)
  To: linux-kernel, linuxppc-dev, virtualization, linux-s390,
	xen-devel-request, kvm, xen-devel, x86
  Cc: kernellwp, jgross, dave, David.Laight, rkrcmar, peterz, benh,
	konrad.wilk, will.deacon, Pan Xinhui, mingo, paulus, mpe,
	pbonzini, paulmck, boqun.feng
In-Reply-To: <1478077718-37424-1-git-send-email-xinhui.pan@linux.vnet.ibm.com>

This patch support to fix lock holder preemption issue.

For kernel users, we could use bool vcpu_is_preempted(int cpu) to detect
if one vcpu is preempted or not.

The default implementation is a macro defined by false. So compiler can
wrap it out if arch dose not support such vcpu preempted check.

Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Pan Xinhui <xinhui.pan@linux.vnet.ibm.com>
Acked-by: Christian Borntraeger <borntraeger@de.ibm.com>
Acked-by: Paolo Bonzini <pbonzini@redhat.com>
Tested-by: Juergen Gross <jgross@suse.com>
---
 include/linux/sched.h | 12 ++++++++++++
 1 file changed, 12 insertions(+)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 348f51b..44c1ce7 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -3506,6 +3506,18 @@ static inline void set_task_cpu(struct task_struct *p, unsigned int cpu)
 
 #endif /* CONFIG_SMP */
 
+/*
+ * In order to deal with a various lock holder preemption issues provide an
+ * interface to see if a vCPU is currently running or not.
+ *
+ * This allows us to terminate optimistic spin loops and block, analogous to
+ * the native optimistic spin heuristic of testing if the lock owner task is
+ * running or not.
+ */
+#ifndef vcpu_is_preempted
+#define vcpu_is_preempted(cpu)	false
+#endif
+
 extern long sched_setaffinity(pid_t pid, const struct cpumask *new_mask);
 extern long sched_getaffinity(pid_t pid, struct cpumask *mask);
 
-- 
2.4.11

^ permalink raw reply related

* [PATCH v7 00/11] implement vcpu preempted check
From: Pan Xinhui @ 2016-11-02  9:08 UTC (permalink / raw)
  To: linux-kernel, linuxppc-dev, virtualization, linux-s390,
	xen-devel-request, kvm, xen-devel, x86
  Cc: kernellwp, jgross, dave, David.Laight, rkrcmar, peterz, benh,
	konrad.wilk, will.deacon, Pan Xinhui, mingo, paulus, mpe,
	pbonzini, paulmck, boqun.feng

change from v6:
	fix typos and remove uncessary comments.
change from v5:
	spilt x86/kvm patch into guest/host part.
	introduce kvm_write_guest_offset_cached.
	fix some typos.
	rebase patch onto 4.9.2
change from v4:
	spilt x86 kvm vcpu preempted check into two patches.
	add documentation patch.
	add x86 vcpu preempted check patch under xen
	add s390 vcpu preempted check patch 
change from v3:
	add x86 vcpu preempted check patch
change from v2:
	no code change, fix typos, update some comments
change from v1:
	a simplier definition of default vcpu_is_preempted
	skip mahcine type check on ppc, and add config. remove dedicated macro.
	add one patch to drop overload of rwsem_spin_on_owner and mutex_spin_on_owner. 
	add more comments
	thanks boqun and Peter's suggestion.

This patch set aims to fix lock holder preemption issues.

test-case:
perf record -a perf bench sched messaging -g 400 -p && perf report

18.09%  sched-messaging  [kernel.vmlinux]  [k] osq_lock
12.28%  sched-messaging  [kernel.vmlinux]  [k] rwsem_spin_on_owner
 5.27%  sched-messaging  [kernel.vmlinux]  [k] mutex_unlock
 3.89%  sched-messaging  [kernel.vmlinux]  [k] wait_consider_task
 3.64%  sched-messaging  [kernel.vmlinux]  [k] _raw_write_lock_irq
 3.41%  sched-messaging  [kernel.vmlinux]  [k] mutex_spin_on_owner.is
 2.49%  sched-messaging  [kernel.vmlinux]  [k] system_call

We introduce interface bool vcpu_is_preempted(int cpu) and use it in some spin
loops of osq_lock, rwsem_spin_on_owner and mutex_spin_on_owner.
These spin_on_onwer variant also cause rcu stall before we apply this patch set

We also have observed some performace improvements in uninx benchmark tests.

PPC test result:
1 copy - 0.94%
2 copy - 7.17%
4 copy - 11.9%
8 copy -  3.04%
16 copy - 15.11%

details below:
Without patch:

1 copy - File Write 4096 bufsize 8000 maxblocks      2188223.0 KBps  (30.0 s, 1 samples)
2 copy - File Write 4096 bufsize 8000 maxblocks      1804433.0 KBps  (30.0 s, 1 samples)
4 copy - File Write 4096 bufsize 8000 maxblocks      1237257.0 KBps  (30.0 s, 1 samples)
8 copy - File Write 4096 bufsize 8000 maxblocks      1032658.0 KBps  (30.0 s, 1 samples)
16 copy - File Write 4096 bufsize 8000 maxblocks       768000.0 KBps  (30.1 s, 1 samples)

With patch: 

1 copy - File Write 4096 bufsize 8000 maxblocks      2209189.0 KBps  (30.0 s, 1 samples)
2 copy - File Write 4096 bufsize 8000 maxblocks      1943816.0 KBps  (30.0 s, 1 samples)
4 copy - File Write 4096 bufsize 8000 maxblocks      1405591.0 KBps  (30.0 s, 1 samples)
8 copy - File Write 4096 bufsize 8000 maxblocks      1065080.0 KBps  (30.0 s, 1 samples)
16 copy - File Write 4096 bufsize 8000 maxblocks       904762.0 KBps  (30.0 s, 1 samples)

X86 test result:
	test-case			after-patch	  before-patch
Execl Throughput                       |    18307.9 lps  |    11701.6 lps 
File Copy 1024 bufsize 2000 maxblocks  |  1352407.3 KBps |   790418.9 KBps
File Copy 256 bufsize 500 maxblocks    |   367555.6 KBps |   222867.7 KBps
File Copy 4096 bufsize 8000 maxblocks  |  3675649.7 KBps |  1780614.4 KBps
Pipe Throughput                        | 11872208.7 lps  | 11855628.9 lps 
Pipe-based Context Switching           |  1495126.5 lps  |  1490533.9 lps 
Process Creation                       |    29881.2 lps  |    28572.8 lps 
Shell Scripts (1 concurrent)           |    23224.3 lpm  |    22607.4 lpm 
Shell Scripts (8 concurrent)           |     3531.4 lpm  |     3211.9 lpm 
System Call Overhead                   | 10385653.0 lps  | 10419979.0 lps 


Christian Borntraeger (1):
  s390/spinlock: Provide vcpu_is_preempted

Juergen Gross (1):
  x86, xen: support vcpu preempted check

Pan Xinhui (9):
  kernel/sched: introduce vcpu preempted check interface
  locking/osq: Drop the overload of osq_lock()
  kernel/locking: Drop the overload of {mutex,rwsem}_spin_on_owner
  powerpc/spinlock: support vcpu preempted check
  x86, paravirt: Add interface to support kvm/xen vcpu preempted check
  KVM: Introduce kvm_write_guest_offset_cached
  x86, kvm/x86.c: support vcpu preempted check
  x86, kernel/kvm.c: support vcpu preempted check
  Documentation: virtual: kvm: Support vcpu preempted check

 Documentation/virtual/kvm/msr.txt     |  9 ++++++++-
 arch/powerpc/include/asm/spinlock.h   |  8 ++++++++
 arch/s390/include/asm/spinlock.h      |  8 ++++++++
 arch/s390/kernel/smp.c                |  9 +++++++--
 arch/s390/lib/spinlock.c              | 25 ++++++++-----------------
 arch/x86/include/asm/paravirt_types.h |  2 ++
 arch/x86/include/asm/spinlock.h       |  8 ++++++++
 arch/x86/include/uapi/asm/kvm_para.h  |  4 +++-
 arch/x86/kernel/kvm.c                 | 12 ++++++++++++
 arch/x86/kernel/paravirt-spinlocks.c  |  6 ++++++
 arch/x86/kvm/x86.c                    | 16 ++++++++++++++++
 arch/x86/xen/spinlock.c               |  3 ++-
 include/linux/kvm_host.h              |  2 ++
 include/linux/sched.h                 | 12 ++++++++++++
 kernel/locking/mutex.c                | 13 +++++++++++--
 kernel/locking/osq_lock.c             |  8 +++++++-
 kernel/locking/rwsem-xadd.c           | 14 +++++++++++---
 virt/kvm/kvm_main.c                   | 20 ++++++++++++++------
 18 files changed, 145 insertions(+), 34 deletions(-)

-- 
2.4.11

^ permalink raw reply

* [PATCH kernel v4 7/7] virtio-balloon: tell host vm's unused page info
From: Liang Li @ 2016-11-02  6:17 UTC (permalink / raw)
  To: mst, dave.hansen
  Cc: virtio-dev, kvm, linux-kernel, Liang Li, qemu-devel, dgilbert,
	linux-mm, amit.shah, pbonzini, virtualization, mgorman
In-Reply-To: <1478067447-24654-1-git-send-email-liang.z.li@intel.com>

Support the request for vm's unused page information, response with
a page bitmap. QEMU can make use of this bitmap and the dirty page
logging mechanism to skip the transportation of some of these unused
pages, this is very helpful to reduce the network traffic and  speed
up the live migration process.

Signed-off-by: Liang Li <liang.z.li@intel.com>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Cornelia Huck <cornelia.huck@de.ibm.com>
Cc: Amit Shah <amit.shah@redhat.com>
Cc: Dave Hansen <dave.hansen@intel.com>
---
 drivers/virtio/virtio_balloon.c | 128 +++++++++++++++++++++++++++++++++++++---
 1 file changed, 121 insertions(+), 7 deletions(-)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index c6c94b6..ba2d37b 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -56,7 +56,7 @@
 
 struct virtio_balloon {
 	struct virtio_device *vdev;
-	struct virtqueue *inflate_vq, *deflate_vq, *stats_vq;
+	struct virtqueue *inflate_vq, *deflate_vq, *stats_vq, *req_vq;
 
 	/* The balloon servicing is delegated to a freezable workqueue. */
 	struct work_struct update_balloon_stats_work;
@@ -83,6 +83,8 @@ struct virtio_balloon {
 	unsigned int nr_page_bmap;
 	/* Used to record the processed pfn range */
 	unsigned long min_pfn, max_pfn, start_pfn, end_pfn;
+	/* Request header */
+	struct virtio_balloon_req_hdr req_hdr;
 	/*
 	 * The pages we've told the Host we're not using are enqueued
 	 * at vb_dev_info->pages list.
@@ -552,6 +554,63 @@ static void update_balloon_stats(struct virtio_balloon *vb)
 				pages_to_bytes(available));
 }
 
+static void send_unused_pages_info(struct virtio_balloon *vb,
+				unsigned long req_id)
+{
+	struct scatterlist sg_in;
+	unsigned long pfn = 0, bmap_len, pfn_limit, last_pfn, nr_pfn;
+	struct virtqueue *vq = vb->req_vq;
+	struct virtio_balloon_resp_hdr *hdr = vb->resp_hdr;
+	int ret = 1, used_nr_bmap = 0, i;
+
+	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_PAGE_BITMAP) &&
+		vb->nr_page_bmap == 1)
+		extend_page_bitmap(vb);
+
+	pfn_limit = PFNS_PER_BMAP * vb->nr_page_bmap;
+	mutex_lock(&vb->balloon_lock);
+	last_pfn = get_max_pfn();
+
+	while (ret) {
+		clear_page_bitmap(vb);
+		ret = get_unused_pages(pfn, pfn + pfn_limit, vb->page_bitmap,
+			 PFNS_PER_BMAP, vb->nr_page_bmap);
+		if (ret < 0)
+			break;
+		hdr->cmd = BALLOON_GET_UNUSED_PAGES;
+		hdr->id = req_id;
+		bmap_len = BALLOON_BMAP_SIZE * vb->nr_page_bmap;
+
+		if (!ret) {
+			hdr->flag = BALLOON_FLAG_DONE;
+			nr_pfn = last_pfn - pfn;
+			used_nr_bmap = nr_pfn / PFNS_PER_BMAP;
+			if (nr_pfn % PFNS_PER_BMAP)
+				used_nr_bmap++;
+			bmap_len = nr_pfn / BITS_PER_BYTE;
+		} else {
+			hdr->flag = BALLOON_FLAG_CONT;
+			used_nr_bmap = vb->nr_page_bmap;
+		}
+		for (i = 0; i < used_nr_bmap; i++) {
+			unsigned int bmap_size = BALLOON_BMAP_SIZE;
+
+			if (i + 1 == used_nr_bmap)
+				bmap_size = bmap_len - BALLOON_BMAP_SIZE * i;
+			set_bulk_pages(vb, vq, pfn + i * PFNS_PER_BMAP,
+				 vb->page_bitmap[i], bmap_size, true);
+		}
+		if (vb->resp_pos > 0)
+			send_resp_data(vb, vq, true);
+		pfn += pfn_limit;
+	}
+
+	mutex_unlock(&vb->balloon_lock);
+	sg_init_one(&sg_in, &vb->req_hdr, sizeof(vb->req_hdr));
+	virtqueue_add_inbuf(vq, &sg_in, 1, &vb->req_hdr, GFP_KERNEL);
+	virtqueue_kick(vq);
+}
+
 /*
  * While most virtqueues communicate guest-initiated requests to the hypervisor,
  * the stats queue operates in reverse.  The driver initializes the virtqueue
@@ -686,18 +745,56 @@ static void update_balloon_size_func(struct work_struct *work)
 		queue_work(system_freezable_wq, work);
 }
 
+static void misc_handle_rq(struct virtio_balloon *vb)
+{
+	struct virtio_balloon_req_hdr *ptr_hdr;
+	unsigned int len;
+
+	ptr_hdr = virtqueue_get_buf(vb->req_vq, &len);
+	if (!ptr_hdr || len != sizeof(vb->req_hdr))
+		return;
+
+	switch (ptr_hdr->cmd) {
+	case BALLOON_GET_UNUSED_PAGES:
+		send_unused_pages_info(vb, ptr_hdr->param);
+		break;
+	default:
+		break;
+	}
+}
+
+static void misc_request(struct virtqueue *vq)
+{
+	struct virtio_balloon *vb = vq->vdev->priv;
+
+	misc_handle_rq(vb);
+}
+
 static int init_vqs(struct virtio_balloon *vb)
 {
-	struct virtqueue *vqs[3];
-	vq_callback_t *callbacks[] = { balloon_ack, balloon_ack, stats_request };
-	static const char * const names[] = { "inflate", "deflate", "stats" };
+	struct virtqueue *vqs[4];
+	vq_callback_t *callbacks[] = { balloon_ack, balloon_ack,
+					 stats_request, misc_request };
+	static const char * const names[] = { "inflate", "deflate", "stats",
+						 "misc" };
 	int err, nvqs;
 
 	/*
 	 * We expect two virtqueues: inflate and deflate, and
 	 * optionally stat.
 	 */
-	nvqs = virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ) ? 3 : 2;
+	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_HOST_REQ_VQ))
+		nvqs = 4;
+	else if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ))
+		nvqs = 3;
+	else
+		nvqs = 2;
+
+	if (!virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ)) {
+		__virtio_clear_bit(vb->vdev, VIRTIO_BALLOON_F_PAGE_BITMAP);
+		__virtio_clear_bit(vb->vdev, VIRTIO_BALLOON_F_HOST_REQ_VQ);
+	}
+
 	err = vb->vdev->config->find_vqs(vb->vdev, nvqs, vqs, callbacks, names);
 	if (err)
 		return err;
@@ -718,6 +815,18 @@ static int init_vqs(struct virtio_balloon *vb)
 			BUG();
 		virtqueue_kick(vb->stats_vq);
 	}
+	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_HOST_REQ_VQ)) {
+		struct scatterlist sg_in;
+
+		vb->req_vq = vqs[3];
+		sg_init_one(&sg_in, &vb->req_hdr, sizeof(vb->req_hdr));
+		if (virtqueue_add_inbuf(vb->req_vq, &sg_in, 1,
+		    &vb->req_hdr, GFP_KERNEL) < 0)
+			__virtio_clear_bit(vb->vdev,
+					VIRTIO_BALLOON_F_HOST_REQ_VQ);
+		else
+			virtqueue_kick(vb->req_vq);
+	}
 	return 0;
 }
 
@@ -851,11 +960,13 @@ static int virtballoon_probe(struct virtio_device *vdev)
 	vb->resp_hdr = kzalloc(sizeof(struct virtio_balloon_resp_hdr),
 				 GFP_KERNEL);
 	/* Clear the feature bit if memory allocation fails */
-	if (!vb->resp_hdr)
+	if (!vb->resp_hdr) {
 		__virtio_clear_bit(vdev, VIRTIO_BALLOON_F_PAGE_BITMAP);
-	else {
+		__virtio_clear_bit(vdev, VIRTIO_BALLOON_F_HOST_REQ_VQ);
+	} else {
 		vb->page_bitmap[0] = kmalloc(BALLOON_BMAP_SIZE, GFP_KERNEL);
 		if (!vb->page_bitmap[0]) {
+			__virtio_clear_bit(vdev, VIRTIO_BALLOON_F_HOST_REQ_VQ);
 			__virtio_clear_bit(vdev, VIRTIO_BALLOON_F_PAGE_BITMAP);
 			kfree(vb->resp_hdr);
 		} else {
@@ -864,6 +975,8 @@ static int virtballoon_probe(struct virtio_device *vdev)
 			if (!vb->resp_data) {
 				__virtio_clear_bit(vdev,
 						VIRTIO_BALLOON_F_PAGE_BITMAP);
+				__virtio_clear_bit(vdev,
+						VIRTIO_BALLOON_F_HOST_REQ_VQ);
 				kfree(vb->page_bitmap[0]);
 				kfree(vb->resp_hdr);
 			}
@@ -987,6 +1100,7 @@ static int virtballoon_restore(struct virtio_device *vdev)
 	VIRTIO_BALLOON_F_STATS_VQ,
 	VIRTIO_BALLOON_F_DEFLATE_ON_OOM,
 	VIRTIO_BALLOON_F_PAGE_BITMAP,
+	VIRTIO_BALLOON_F_HOST_REQ_VQ,
 };
 
 static struct virtio_driver virtio_balloon_driver = {
-- 
1.8.3.1

^ permalink raw reply related

* [PATCH kernel v4 6/7] virtio-balloon: define flags and head for host request vq
From: Liang Li @ 2016-11-02  6:17 UTC (permalink / raw)
  To: mst, dave.hansen
  Cc: virtio-dev, kvm, linux-kernel, Liang Li, qemu-devel, dgilbert,
	linux-mm, amit.shah, pbonzini, Andrew Morton, virtualization,
	mgorman
In-Reply-To: <1478067447-24654-1-git-send-email-liang.z.li@intel.com>

Define the flags and head struct for a new host request virtual
queue. Guest can get requests from host and then responds to them on
this new virtual queue.
Host can make use of this virtual queue to request the guest do some
operations, e.g. drop page cache, synchronize file system, etc.
And the hypervisor can get some of guest's runtime information
through this virtual queue too, e.g. the guest's unused page
information, which can be used for live migration optimization.

Signed-off-by: Liang Li <liang.z.li@intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Cornelia Huck <cornelia.huck@de.ibm.com>
Cc: Amit Shah <amit.shah@redhat.com>
Cc: Dave Hansen <dave.hansen@intel.com>
---
 include/uapi/linux/virtio_balloon.h | 22 ++++++++++++++++++++++
 1 file changed, 22 insertions(+)

diff --git a/include/uapi/linux/virtio_balloon.h b/include/uapi/linux/virtio_balloon.h
index bed6f41..c4e34d0 100644
--- a/include/uapi/linux/virtio_balloon.h
+++ b/include/uapi/linux/virtio_balloon.h
@@ -35,6 +35,7 @@
 #define VIRTIO_BALLOON_F_STATS_VQ	1 /* Memory Stats virtqueue */
 #define VIRTIO_BALLOON_F_DEFLATE_ON_OOM	2 /* Deflate balloon on OOM */
 #define VIRTIO_BALLOON_F_PAGE_BITMAP	3 /* Send page info with bitmap */
+#define VIRTIO_BALLOON_F_HOST_REQ_VQ	4 /* Host request virtqueue */
 
 /* Size of a PFN in the balloon interface. */
 #define VIRTIO_BALLOON_PFN_SHIFT 12
@@ -101,4 +102,25 @@ struct virtio_balloon_bmap_hdr {
 	__le64 bmap[0];
 };
 
+enum virtio_balloon_req_id {
+	/* Get unused page information */
+	BALLOON_GET_UNUSED_PAGES,
+};
+
+enum virtio_balloon_flag {
+	/* Have more data for a request */
+	BALLOON_FLAG_CONT,
+	/* No more data for a request */
+	BALLOON_FLAG_DONE,
+};
+
+struct virtio_balloon_req_hdr {
+	/* Used to distinguish different requests */
+	__le16 cmd;
+	/* Reserved */
+	__le16 reserved[3];
+	/* Request parameter */
+	__le64 param;
+};
+
 #endif /* _LINUX_VIRTIO_BALLOON_H */
-- 
1.8.3.1

^ permalink raw reply related

* [PATCH kernel v4 5/7] mm: add the related functions to get unused page
From: Liang Li @ 2016-11-02  6:17 UTC (permalink / raw)
  To: mst, dave.hansen
  Cc: virtio-dev, kvm, linux-kernel, Liang Li, qemu-devel, dgilbert,
	linux-mm, amit.shah, pbonzini, Andrew Morton, virtualization,
	mgorman
In-Reply-To: <1478067447-24654-1-git-send-email-liang.z.li@intel.com>

Save the unused page info into a split page bitmap. The virtio
balloon driver will use this new API to get the unused page bitmap
and send the bitmap to hypervisor(QEMU) to speed up live migration.
During sending the bitmap, some the pages may be modified and are
no free anymore, this inaccuracy can be corrected by the dirty
page logging mechanism.

Signed-off-by: Liang Li <liang.z.li@intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Cornelia Huck <cornelia.huck@de.ibm.com>
Cc: Amit Shah <amit.shah@redhat.com>
Cc: Dave Hansen <dave.hansen@intel.com>
---
 include/linux/mm.h |  2 ++
 mm/page_alloc.c    | 85 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 87 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index f47862a..7014d8a 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1773,6 +1773,8 @@ extern void free_area_init_node(int nid, unsigned long * zones_size,
 		unsigned long zone_start_pfn, unsigned long *zholes_size);
 extern void free_initmem(void);
 extern unsigned long get_max_pfn(void);
+extern int get_unused_pages(unsigned long start_pfn, unsigned long end_pfn,
+	unsigned long *bitmap[], unsigned long len, unsigned int nr_bmap);
 
 /*
  * Free reserved pages within range [PAGE_ALIGN(start), end & PAGE_MASK)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 12cc8ed..72537cc 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4438,6 +4438,91 @@ unsigned long get_max_pfn(void)
 }
 EXPORT_SYMBOL(get_max_pfn);
 
+static void mark_unused_pages_bitmap(struct zone *zone,
+		unsigned long start_pfn, unsigned long end_pfn,
+		unsigned long *bitmap[], unsigned long bits,
+		unsigned int nr_bmap)
+{
+	unsigned long pfn, flags, nr_pg, pos, *bmap;
+	unsigned int order, i, t, bmap_idx;
+	struct list_head *curr;
+
+	if (zone_is_empty(zone))
+		return;
+
+	end_pfn = min(start_pfn + nr_bmap * bits, end_pfn);
+	spin_lock_irqsave(&zone->lock, flags);
+
+	for_each_migratetype_order(order, t) {
+		list_for_each(curr, &zone->free_area[order].free_list[t]) {
+			pfn = page_to_pfn(list_entry(curr, struct page, lru));
+			if (pfn < start_pfn || pfn >= end_pfn)
+				continue;
+			nr_pg = 1UL << order;
+			if (pfn + nr_pg > end_pfn)
+				nr_pg = end_pfn - pfn;
+			bmap_idx = (pfn - start_pfn) / bits;
+			if (bmap_idx == (pfn + nr_pg - start_pfn) / bits) {
+				bmap = bitmap[bmap_idx];
+				pos = (pfn - start_pfn) % bits;
+				bitmap_set(bmap, pos, nr_pg);
+			} else
+				for (i = 0; i < nr_pg; i++) {
+					pos = pfn - start_pfn + i;
+					bmap_idx = pos / bits;
+					bmap = bitmap[bmap_idx];
+					pos = pos % bits;
+					bitmap_set(bmap, pos, 1);
+				}
+		}
+	}
+
+	spin_unlock_irqrestore(&zone->lock, flags);
+}
+
+/*
+ * During live migration, page is always discardable unless it's
+ * content is needed by the system.
+ * get_unused_pages provides an API to get the unused pages, these
+ * unused pages can be discarded if there is no modification since
+ * the request. Some other mechanism, like the dirty page logging
+ * can be used to track the modification.
+ *
+ * This function scans the free page list to get the unused pages
+ * whose pfn are range from start_pfn to end_pfn, and set the
+ * corresponding bit in the bitmap if an unused page is found.
+ *
+ * Allocating a large bitmap may fail because of fragmentation,
+ * instead of using a single bitmap, we use a scatter/gather bitmap.
+ * The 'bitmap' is the start address of an array which contains
+ * 'nr_bmap' separate small bitmaps, each bitmap contains 'bits' bits.
+ *
+ * return -1 if parameters are invalid
+ * return 0 when end_pfn >= max_pfn
+ * return 1 when end_pfn < max_pfn
+ */
+int get_unused_pages(unsigned long start_pfn, unsigned long end_pfn,
+	unsigned long *bitmap[], unsigned long bits, unsigned int nr_bmap)
+{
+	struct zone *zone;
+	int ret = 0;
+
+	if (bitmap == NULL || *bitmap == NULL || nr_bmap == 0 ||
+		 bits == 0 || start_pfn > end_pfn)
+		return -1;
+	if (end_pfn < max_pfn)
+		ret = 1;
+	if (end_pfn >= max_pfn)
+		ret = 0;
+
+	for_each_populated_zone(zone)
+		mark_unused_pages_bitmap(zone, start_pfn, end_pfn, bitmap,
+					 bits, nr_bmap);
+
+	return ret;
+}
+EXPORT_SYMBOL(get_unused_pages);
+
 static void zoneref_set_zone(struct zone *zone, struct zoneref *zoneref)
 {
 	zoneref->zone = zone;
-- 
1.8.3.1

^ permalink raw reply related

* [PATCH kernel v4 4/7] virtio-balloon: speed up inflate/deflate process
From: Liang Li @ 2016-11-02  6:17 UTC (permalink / raw)
  To: mst, dave.hansen
  Cc: virtio-dev, kvm, linux-kernel, Liang Li, qemu-devel, dgilbert,
	linux-mm, amit.shah, pbonzini, virtualization, mgorman
In-Reply-To: <1478067447-24654-1-git-send-email-liang.z.li@intel.com>

The implementation of the current virtio-balloon is not very
efficient, the time spends on different stages of inflating
the balloon to 7GB of a 8GB idle guest:

a. allocating pages (6.5%)
b. sending PFNs to host (68.3%)
c. address translation (6.1%)
d. madvise (19%)

It takes about 4126ms for the inflating process to complete.
Debugging shows that the bottle neck are the stage b and stage d.

If using a bitmap to send the page info instead of the PFNs, we
can reduce the overhead in stage b quite a lot. Furthermore, we
can do the address translation and call madvise() with a bulk of
RAM pages, instead of the current page per page way, the overhead
of stage c and stage d can also be reduced a lot.

This patch is the kernel side implementation which is intended to
speed up the inflating & deflating process by adding a new feature
to the virtio-balloon device. With this new feature, inflating the
balloon to 7GB of a 8GB idle guest only takes 590ms, the
performance improvement is about 85%.

TODO: optimize stage a by allocating/freeing a chunk of pages
instead of a single page at a time.

Signed-off-by: Liang Li <liang.z.li@intel.com>
Suggested-by: Michael S. Tsirkin <mst@redhat.com>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Cornelia Huck <cornelia.huck@de.ibm.com>
Cc: Amit Shah <amit.shah@redhat.com>
Cc: Dave Hansen <dave.hansen@intel.com>
---
 drivers/virtio/virtio_balloon.c | 398 +++++++++++++++++++++++++++++++++++++---
 1 file changed, 369 insertions(+), 29 deletions(-)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index 59ffe5a..c6c94b6 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -42,6 +42,10 @@
 #define OOM_VBALLOON_DEFAULT_PAGES 256
 #define VIRTBALLOON_OOM_NOTIFY_PRIORITY 80
 
+#define BALLOON_BMAP_SIZE	(8 * PAGE_SIZE)
+#define PFNS_PER_BMAP		(BALLOON_BMAP_SIZE * BITS_PER_BYTE)
+#define BALLOON_BMAP_COUNT	32
+
 static int oom_pages = OOM_VBALLOON_DEFAULT_PAGES;
 module_param(oom_pages, int, S_IRUSR | S_IWUSR);
 MODULE_PARM_DESC(oom_pages, "pages to free on OOM");
@@ -67,6 +71,18 @@ struct virtio_balloon {
 
 	/* Number of balloon pages we've told the Host we're not using. */
 	unsigned int num_pages;
+	/* Pointer to the response header. */
+	void *resp_hdr;
+	/* Pointer to the start address of response data. */
+	unsigned long *resp_data;
+	/* Pointer offset of the response data. */
+	unsigned long resp_pos;
+	/* Bitmap and bitmap count used to tell the host the pages */
+	unsigned long *page_bitmap[BALLOON_BMAP_COUNT];
+	/* Number of split page bitmaps */
+	unsigned int nr_page_bmap;
+	/* Used to record the processed pfn range */
+	unsigned long min_pfn, max_pfn, start_pfn, end_pfn;
 	/*
 	 * The pages we've told the Host we're not using are enqueued
 	 * at vb_dev_info->pages list.
@@ -110,20 +126,227 @@ static void balloon_ack(struct virtqueue *vq)
 	wake_up(&vb->acked);
 }
 
-static void tell_host(struct virtio_balloon *vb, struct virtqueue *vq)
+static inline void init_bmap_pfn_range(struct virtio_balloon *vb)
 {
-	struct scatterlist sg;
+	vb->min_pfn = ULONG_MAX;
+	vb->max_pfn = 0;
+}
+
+static inline void update_bmap_pfn_range(struct virtio_balloon *vb,
+				 struct page *page)
+{
+	unsigned long balloon_pfn = page_to_balloon_pfn(page);
+
+	vb->min_pfn = min(balloon_pfn, vb->min_pfn);
+	vb->max_pfn = max(balloon_pfn, vb->max_pfn);
+}
+
+static void extend_page_bitmap(struct virtio_balloon *vb)
+{
+	int i, bmap_count;
+	unsigned long bmap_len;
+
+	bmap_len = ALIGN(get_max_pfn(), BITS_PER_LONG) / BITS_PER_BYTE;
+	bmap_len = ALIGN(bmap_len, BALLOON_BMAP_SIZE);
+	bmap_count = min((int)(bmap_len / BALLOON_BMAP_SIZE),
+				 BALLOON_BMAP_COUNT);
+
+	for (i = 1; i < bmap_count; i++) {
+		vb->page_bitmap[i] = kmalloc(BALLOON_BMAP_SIZE, GFP_KERNEL);
+		if (vb->page_bitmap[i])
+			vb->nr_page_bmap++;
+		else
+			break;
+	}
+}
+
+static void free_extended_page_bitmap(struct virtio_balloon *vb)
+{
+	int i, bmap_count = vb->nr_page_bmap;
+
+
+	for (i = 1; i < bmap_count; i++) {
+		kfree(vb->page_bitmap[i]);
+		vb->page_bitmap[i] = NULL;
+		vb->nr_page_bmap--;
+	}
+}
+
+static void kfree_page_bitmap(struct virtio_balloon *vb)
+{
+	int i;
+
+	for (i = 0; i < vb->nr_page_bmap; i++)
+		kfree(vb->page_bitmap[i]);
+}
+
+static void clear_page_bitmap(struct virtio_balloon *vb)
+{
+	int i;
+
+	for (i = 0; i < vb->nr_page_bmap; i++)
+		memset(vb->page_bitmap[i], 0, BALLOON_BMAP_SIZE);
+}
+
+static unsigned long do_set_resp_bitmap(struct virtio_balloon *vb,
+	unsigned long *bitmap,	unsigned long base_pfn,
+	unsigned long pos, int nr_page)
+
+{
+	struct virtio_balloon_bmap_hdr *hdr;
+	unsigned long end, new_pos, new_end, nr_left, proccessed = 0;
+
+	new_pos = pos;
+	new_end = end = pos + nr_page;
+
+	if (pos % BITS_PER_LONG) {
+		unsigned long pos_s;
+
+		pos_s = rounddown(pos, BITS_PER_LONG);
+		hdr = (struct virtio_balloon_bmap_hdr *)(vb->resp_data
+							 + vb->resp_pos);
+		hdr->head.start_pfn = base_pfn + pos_s;
+		hdr->head.page_shift = PAGE_SHIFT;
+		hdr->head.bmap_len = sizeof(unsigned long);
+		hdr->bmap[0] = cpu_to_virtio64(vb->vdev,
+				 bitmap[pos_s / BITS_PER_LONG]);
+		vb->resp_pos += 2;
+		if (pos_s + BITS_PER_LONG >= end)
+			return roundup(end, BITS_PER_LONG) - pos;
+		new_pos = roundup(pos, BITS_PER_LONG);
+	}
+
+	if (end % BITS_PER_LONG) {
+		unsigned long pos_e;
+
+		pos_e = roundup(end, BITS_PER_LONG);
+		hdr = (struct virtio_balloon_bmap_hdr *)(vb->resp_data
+							 + vb->resp_pos);
+		hdr->head.start_pfn = base_pfn + pos_e - BITS_PER_LONG;
+		hdr->head.page_shift = PAGE_SHIFT;
+		hdr->head.bmap_len = sizeof(unsigned long);
+		hdr->bmap[0] = bitmap[pos_e / BITS_PER_LONG - 1];
+		vb->resp_pos += 2;
+		if (new_pos + BITS_PER_LONG >= pos_e)
+			return pos_e - pos;
+		new_end = rounddown(end, BITS_PER_LONG);
+	}
+
+	nr_left = nr_page = new_end - new_pos;
+
+	while (proccessed < nr_page) {
+		int bulk, order;
+
+		order = get_order(nr_left << PAGE_SHIFT);
+		if ((1 << order) > nr_left)
+			order--;
+		hdr = (struct virtio_balloon_bmap_hdr *)(vb->resp_data
+							 + vb->resp_pos);
+		hdr->head.start_pfn = base_pfn + new_pos + proccessed;
+		hdr->head.page_shift = order + PAGE_SHIFT;
+		hdr->head.bmap_len = 0;
+		bulk = 1 << order;
+		nr_left -= bulk;
+		proccessed += bulk;
+		vb->resp_pos++;
+	}
+
+	return roundup(end, BITS_PER_LONG) - pos;
+}
+
+static void send_resp_data(struct virtio_balloon *vb, struct virtqueue *vq,
+			bool busy_wait)
+{
+	struct scatterlist sg[2];
+	struct virtio_balloon_resp_hdr *hdr = vb->resp_hdr;
 	unsigned int len;
 
-	sg_init_one(&sg, vb->pfns, sizeof(vb->pfns[0]) * vb->num_pfns);
+	len = hdr->data_len = vb->resp_pos * sizeof(unsigned long);
+	sg_init_table(sg, 2);
+	sg_set_buf(&sg[0], hdr, sizeof(struct virtio_balloon_resp_hdr));
+	sg_set_buf(&sg[1], vb->resp_data, len);
+
+	if (virtqueue_add_outbuf(vq, sg, 2, vb, GFP_KERNEL) == 0) {
+		virtqueue_kick(vq);
+		if (busy_wait)
+			while (!virtqueue_get_buf(vq, &len)
+				&& !virtqueue_is_broken(vq))
+				cpu_relax();
+		else
+			wait_event(vb->acked, virtqueue_get_buf(vq, &len));
+		vb->resp_pos = 0;
+		free_extended_page_bitmap(vb);
+	}
+}
 
-	/* We should always be able to add one buffer to an empty queue. */
-	virtqueue_add_outbuf(vq, &sg, 1, vb, GFP_KERNEL);
-	virtqueue_kick(vq);
+static void set_bulk_pages(struct virtio_balloon *vb, struct virtqueue *vq,
+		unsigned long start_pfn, unsigned long *bitmap,
+		unsigned long len, bool busy_wait)
+{
+	unsigned long pos = 0, end = len * BITS_PER_BYTE;
+
+	while (pos < end) {
+		unsigned long one = find_next_bit(bitmap, end, pos);
+
+		if ((vb->resp_pos + 64) * sizeof(unsigned long) >
+			 BALLOON_BMAP_SIZE)
+			send_resp_data(vb, vq, busy_wait);
+		if (one < end) {
+			unsigned long pages, zero;
+
+			zero = find_next_zero_bit(bitmap, end, one + 1);
+			if (zero >= end)
+				pages = end - one;
+			else
+				pages = zero - one;
+			if (pages) {
+				pages = do_set_resp_bitmap(vb, bitmap,
+					 start_pfn, one, pages);
+			}
+			pos = one + pages;
+		} else
+			pos = one;
+	}
+}
 
-	/* When host has read buffer, this completes via balloon_ack */
-	wait_event(vb->acked, virtqueue_get_buf(vq, &len));
+static void tell_host(struct virtio_balloon *vb, struct virtqueue *vq)
+{
+	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_PAGE_BITMAP)) {
+		int nr_pfn, nr_used_bmap, i;
+		unsigned long start_pfn, bmap_len;
+
+		start_pfn = vb->start_pfn;
+		nr_pfn = vb->end_pfn - start_pfn + 1;
+		nr_pfn = roundup(nr_pfn, BITS_PER_LONG);
+		nr_used_bmap = nr_pfn / PFNS_PER_BMAP;
+		if (nr_pfn % PFNS_PER_BMAP)
+			nr_used_bmap++;
+		bmap_len = nr_pfn / BITS_PER_BYTE;
+
+		for (i = 0; i < nr_used_bmap; i++) {
+			unsigned int bmap_size = BALLOON_BMAP_SIZE;
+
+			if (i + 1 == nr_used_bmap)
+				bmap_size = bmap_len - BALLOON_BMAP_SIZE * i;
+			set_bulk_pages(vb, vq, start_pfn + i * PFNS_PER_BMAP,
+				 vb->page_bitmap[i], bmap_size, false);
+		}
+		if (vb->resp_pos > 0)
+			send_resp_data(vb, vq, false);
+	} else {
+		struct scatterlist sg;
+		unsigned int len;
+
+		sg_init_one(&sg, vb->pfns, sizeof(vb->pfns[0]) * vb->num_pfns);
 
+		/* We should always be able to add one buffer to an
+		 * empty queue
+		 */
+		virtqueue_add_outbuf(vq, &sg, 1, vb, GFP_KERNEL);
+		virtqueue_kick(vq);
+		/* When host has read buffer, this completes via balloon_ack */
+		wait_event(vb->acked, virtqueue_get_buf(vq, &len));
+	}
 }
 
 static void set_page_pfns(struct virtio_balloon *vb,
@@ -138,13 +361,59 @@ static void set_page_pfns(struct virtio_balloon *vb,
 					  page_to_balloon_pfn(page) + i);
 }
 
-static unsigned fill_balloon(struct virtio_balloon *vb, size_t num)
+static void set_page_bitmap(struct virtio_balloon *vb,
+			 struct list_head *pages, struct virtqueue *vq)
 {
-	struct balloon_dev_info *vb_dev_info = &vb->vb_dev_info;
-	unsigned num_allocated_pages;
+	unsigned long pfn, pfn_limit;
+	struct page *page;
+	bool found;
+	int bmap_idx;
+
+	vb->min_pfn = rounddown(vb->min_pfn, BITS_PER_LONG);
+	vb->max_pfn = roundup(vb->max_pfn, BITS_PER_LONG);
+	pfn_limit = PFNS_PER_BMAP * vb->nr_page_bmap;
+
+	for (pfn = vb->min_pfn; pfn < vb->max_pfn; pfn += pfn_limit) {
+		unsigned long end_pfn;
+
+		clear_page_bitmap(vb);
+		vb->start_pfn = pfn;
+		end_pfn = pfn;
+		found = false;
+		list_for_each_entry(page, pages, lru) {
+			unsigned long pos, balloon_pfn;
+
+			balloon_pfn = page_to_balloon_pfn(page);
+			if (balloon_pfn < pfn || balloon_pfn >= pfn + pfn_limit)
+				continue;
+			bmap_idx = (balloon_pfn - pfn) / PFNS_PER_BMAP;
+			pos = (balloon_pfn - pfn) % PFNS_PER_BMAP;
+			set_bit(pos, vb->page_bitmap[bmap_idx]);
+			if (balloon_pfn > end_pfn)
+				end_pfn = balloon_pfn;
+			found = true;
+		}
+		if (found) {
+			vb->end_pfn = end_pfn;
+			tell_host(vb, vq);
+		}
+	}
+}
 
-	/* We can only do one array worth at a time. */
-	num = min(num, ARRAY_SIZE(vb->pfns));
+static unsigned int fill_balloon(struct virtio_balloon *vb, size_t num)
+{
+	struct balloon_dev_info *vb_dev_info = &vb->vb_dev_info;
+	unsigned int num_allocated_pages;
+	bool use_bmap = virtio_has_feature(vb->vdev,
+				 VIRTIO_BALLOON_F_PAGE_BITMAP);
+
+	if (use_bmap) {
+		if (vb->nr_page_bmap == 1)
+			extend_page_bitmap(vb);
+		init_bmap_pfn_range(vb);
+	} else
+		/* We can only do one array worth at a time. */
+		num = min(num, ARRAY_SIZE(vb->pfns));
 
 	mutex_lock(&vb->balloon_lock);
 	for (vb->num_pfns = 0; vb->num_pfns < num;
@@ -159,7 +428,10 @@ static unsigned fill_balloon(struct virtio_balloon *vb, size_t num)
 			msleep(200);
 			break;
 		}
-		set_page_pfns(vb, vb->pfns + vb->num_pfns, page);
+		if (use_bmap)
+			update_bmap_pfn_range(vb, page);
+		else
+			set_page_pfns(vb, vb->pfns + vb->num_pfns, page);
 		vb->num_pages += VIRTIO_BALLOON_PAGES_PER_PAGE;
 		if (!virtio_has_feature(vb->vdev,
 					VIRTIO_BALLOON_F_DEFLATE_ON_OOM))
@@ -168,8 +440,13 @@ static unsigned fill_balloon(struct virtio_balloon *vb, size_t num)
 
 	num_allocated_pages = vb->num_pfns;
 	/* Did we get any? */
-	if (vb->num_pfns != 0)
-		tell_host(vb, vb->inflate_vq);
+	if (vb->num_pfns != 0) {
+		if (use_bmap)
+			set_page_bitmap(vb, &vb_dev_info->pages,
+					vb->inflate_vq);
+		else
+			tell_host(vb, vb->inflate_vq);
+	}
 	mutex_unlock(&vb->balloon_lock);
 
 	return num_allocated_pages;
@@ -189,15 +466,22 @@ static void release_pages_balloon(struct virtio_balloon *vb,
 	}
 }
 
-static unsigned leak_balloon(struct virtio_balloon *vb, size_t num)
+static unsigned int leak_balloon(struct virtio_balloon *vb, size_t num)
 {
-	unsigned num_freed_pages;
+	unsigned int num_freed_pages;
 	struct page *page;
 	struct balloon_dev_info *vb_dev_info = &vb->vb_dev_info;
 	LIST_HEAD(pages);
+	bool use_bmap = virtio_has_feature(vb->vdev,
+			 VIRTIO_BALLOON_F_PAGE_BITMAP);
 
-	/* We can only do one array worth at a time. */
-	num = min(num, ARRAY_SIZE(vb->pfns));
+	if (use_bmap) {
+		if (vb->nr_page_bmap == 1)
+			extend_page_bitmap(vb);
+		init_bmap_pfn_range(vb);
+	} else
+		/* We can only do one array worth at a time. */
+		num = min(num, ARRAY_SIZE(vb->pfns));
 
 	mutex_lock(&vb->balloon_lock);
 	/* We can't release more pages than taken */
@@ -207,7 +491,10 @@ static unsigned leak_balloon(struct virtio_balloon *vb, size_t num)
 		page = balloon_page_dequeue(vb_dev_info);
 		if (!page)
 			break;
-		set_page_pfns(vb, vb->pfns + vb->num_pfns, page);
+		if (use_bmap)
+			update_bmap_pfn_range(vb, page);
+		else
+			set_page_pfns(vb, vb->pfns + vb->num_pfns, page);
 		list_add(&page->lru, &pages);
 		vb->num_pages -= VIRTIO_BALLOON_PAGES_PER_PAGE;
 	}
@@ -218,8 +505,12 @@ static unsigned leak_balloon(struct virtio_balloon *vb, size_t num)
 	 * virtio_has_feature(vdev, VIRTIO_BALLOON_F_MUST_TELL_HOST);
 	 * is true, we *have* to do it in this order
 	 */
-	if (vb->num_pfns != 0)
-		tell_host(vb, vb->deflate_vq);
+	if (vb->num_pfns != 0) {
+		if (use_bmap)
+			set_page_bitmap(vb, &pages, vb->deflate_vq);
+		else
+			tell_host(vb, vb->deflate_vq);
+	}
 	release_pages_balloon(vb, &pages);
 	mutex_unlock(&vb->balloon_lock);
 	return num_freed_pages;
@@ -431,6 +722,20 @@ static int init_vqs(struct virtio_balloon *vb)
 }
 
 #ifdef CONFIG_BALLOON_COMPACTION
+static void tell_host_one_page(struct virtio_balloon *vb,
+	struct virtqueue *vq, struct page *page)
+{
+	struct virtio_balloon_bmap_hdr *bmap_hdr;
+
+	bmap_hdr = (struct virtio_balloon_bmap_hdr *)(vb->resp_data
+							 + vb->resp_pos);
+	bmap_hdr->head.start_pfn = page_to_pfn(page);
+	bmap_hdr->head.page_shift = PAGE_SHIFT;
+	bmap_hdr->head.bmap_len = 0;
+	vb->resp_pos++;
+	send_resp_data(vb, vq, false);
+}
+
 /*
  * virtballoon_migratepage - perform the balloon page migration on behalf of
  *			     a compation thread.     (called under page lock)
@@ -455,6 +760,8 @@ static int virtballoon_migratepage(struct balloon_dev_info *vb_dev_info,
 	struct virtio_balloon *vb = container_of(vb_dev_info,
 			struct virtio_balloon, vb_dev_info);
 	unsigned long flags;
+	bool use_bmap = virtio_has_feature(vb->vdev,
+				 VIRTIO_BALLOON_F_PAGE_BITMAP);
 
 	/*
 	 * In order to avoid lock contention while migrating pages concurrently
@@ -475,15 +782,23 @@ static int virtballoon_migratepage(struct balloon_dev_info *vb_dev_info,
 	vb_dev_info->isolated_pages--;
 	__count_vm_event(BALLOON_MIGRATE);
 	spin_unlock_irqrestore(&vb_dev_info->pages_lock, flags);
-	vb->num_pfns = VIRTIO_BALLOON_PAGES_PER_PAGE;
-	set_page_pfns(vb, vb->pfns, newpage);
-	tell_host(vb, vb->inflate_vq);
+	if (use_bmap)
+		tell_host_one_page(vb, vb->inflate_vq, newpage);
+	else {
+		vb->num_pfns = VIRTIO_BALLOON_PAGES_PER_PAGE;
+		set_page_pfns(vb, vb->pfns, newpage);
+		tell_host(vb, vb->inflate_vq);
+	}
 
 	/* balloon's page migration 2nd step -- deflate "page" */
 	balloon_page_delete(page);
-	vb->num_pfns = VIRTIO_BALLOON_PAGES_PER_PAGE;
-	set_page_pfns(vb, vb->pfns, page);
-	tell_host(vb, vb->deflate_vq);
+	if (use_bmap)
+		tell_host_one_page(vb, vb->deflate_vq, page);
+	else {
+		vb->num_pfns = VIRTIO_BALLOON_PAGES_PER_PAGE;
+		set_page_pfns(vb, vb->pfns, page);
+		tell_host(vb, vb->deflate_vq);
+	}
 
 	mutex_unlock(&vb->balloon_lock);
 
@@ -533,6 +848,28 @@ static int virtballoon_probe(struct virtio_device *vdev)
 	spin_lock_init(&vb->stop_update_lock);
 	vb->stop_update = false;
 	vb->num_pages = 0;
+	vb->resp_hdr = kzalloc(sizeof(struct virtio_balloon_resp_hdr),
+				 GFP_KERNEL);
+	/* Clear the feature bit if memory allocation fails */
+	if (!vb->resp_hdr)
+		__virtio_clear_bit(vdev, VIRTIO_BALLOON_F_PAGE_BITMAP);
+	else {
+		vb->page_bitmap[0] = kmalloc(BALLOON_BMAP_SIZE, GFP_KERNEL);
+		if (!vb->page_bitmap[0]) {
+			__virtio_clear_bit(vdev, VIRTIO_BALLOON_F_PAGE_BITMAP);
+			kfree(vb->resp_hdr);
+		} else {
+			vb->nr_page_bmap = 1;
+			vb->resp_data = kmalloc(BALLOON_BMAP_SIZE, GFP_KERNEL);
+			if (!vb->resp_data) {
+				__virtio_clear_bit(vdev,
+						VIRTIO_BALLOON_F_PAGE_BITMAP);
+				kfree(vb->page_bitmap[0]);
+				kfree(vb->resp_hdr);
+			}
+		}
+	}
+	vb->resp_pos = 0;
 	mutex_init(&vb->balloon_lock);
 	init_waitqueue_head(&vb->acked);
 	vb->vdev = vdev;
@@ -609,6 +946,8 @@ static void virtballoon_remove(struct virtio_device *vdev)
 	remove_common(vb);
 	if (vb->vb_dev_info.inode)
 		iput(vb->vb_dev_info.inode);
+	kfree_page_bitmap(vb);
+	kfree(vb->resp_hdr);
 	kfree(vb);
 }
 
@@ -647,6 +986,7 @@ static int virtballoon_restore(struct virtio_device *vdev)
 	VIRTIO_BALLOON_F_MUST_TELL_HOST,
 	VIRTIO_BALLOON_F_STATS_VQ,
 	VIRTIO_BALLOON_F_DEFLATE_ON_OOM,
+	VIRTIO_BALLOON_F_PAGE_BITMAP,
 };
 
 static struct virtio_driver virtio_balloon_driver = {
-- 
1.8.3.1

^ permalink raw reply related

* [PATCH kernel v4 3/7] mm: add a function to get the max pfn
From: Liang Li @ 2016-11-02  6:17 UTC (permalink / raw)
  To: mst, dave.hansen
  Cc: virtio-dev, kvm, linux-kernel, Liang Li, qemu-devel, dgilbert,
	linux-mm, amit.shah, pbonzini, Andrew Morton, virtualization,
	mgorman
In-Reply-To: <1478067447-24654-1-git-send-email-liang.z.li@intel.com>

Expose the function to get the max pfn, so it can be used in the
virtio-balloon device driver. Simply include the 'linux/bootmem.h'
is not enough, if the device driver is built to a module, directly
refer the max_pfn lead to build failed.

Signed-off-by: Liang Li <liang.z.li@intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Cornelia Huck <cornelia.huck@de.ibm.com>
Cc: Amit Shah <amit.shah@redhat.com>
Cc: Dave Hansen <dave.hansen@intel.com>
---
 include/linux/mm.h |  1 +
 mm/page_alloc.c    | 10 ++++++++++
 2 files changed, 11 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index a92c8d7..f47862a 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1772,6 +1772,7 @@ static inline spinlock_t *pmd_lock(struct mm_struct *mm, pmd_t *pmd)
 extern void free_area_init_node(int nid, unsigned long * zones_size,
 		unsigned long zone_start_pfn, unsigned long *zholes_size);
 extern void free_initmem(void);
+extern unsigned long get_max_pfn(void);
 
 /*
  * Free reserved pages within range [PAGE_ALIGN(start), end & PAGE_MASK)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 8fd42aa..12cc8ed 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4428,6 +4428,16 @@ void show_free_areas(unsigned int filter)
 	show_swap_cache_info();
 }
 
+/*
+ * The max_pfn can change because of memory hot plug, so it's only good
+ * as a hint. e.g. for sizing data structures.
+ */
+unsigned long get_max_pfn(void)
+{
+	return max_pfn;
+}
+EXPORT_SYMBOL(get_max_pfn);
+
 static void zoneref_set_zone(struct zone *zone, struct zoneref *zoneref)
 {
 	zoneref->zone = zone;
-- 
1.8.3.1

^ permalink raw reply related

* [PATCH kernel v4 2/7] virtio-balloon: define new feature bit and head struct
From: Liang Li @ 2016-11-02  6:17 UTC (permalink / raw)
  To: mst, dave.hansen
  Cc: virtio-dev, kvm, linux-kernel, Liang Li, qemu-devel, dgilbert,
	linux-mm, amit.shah, pbonzini, virtualization, mgorman
In-Reply-To: <1478067447-24654-1-git-send-email-liang.z.li@intel.com>

Add a new feature which supports sending the page information with
a bitmap. The current implementation uses PFNs array, which is not
very efficient. Using bitmap can improve the performance of
inflating/deflating significantly

The page bitmap header will used to tell the host some information
about the page bitmap. e.g. the page size, page bitmap length and
start pfn.

Signed-off-by: Liang Li <liang.z.li@intel.com>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Cornelia Huck <cornelia.huck@de.ibm.com>
Cc: Amit Shah <amit.shah@redhat.com>
Cc: Dave Hansen <dave.hansen@intel.com>
---
 include/uapi/linux/virtio_balloon.h | 19 +++++++++++++++++++
 1 file changed, 19 insertions(+)

diff --git a/include/uapi/linux/virtio_balloon.h b/include/uapi/linux/virtio_balloon.h
index 343d7dd..bed6f41 100644
--- a/include/uapi/linux/virtio_balloon.h
+++ b/include/uapi/linux/virtio_balloon.h
@@ -34,6 +34,7 @@
 #define VIRTIO_BALLOON_F_MUST_TELL_HOST	0 /* Tell before reclaiming pages */
 #define VIRTIO_BALLOON_F_STATS_VQ	1 /* Memory Stats virtqueue */
 #define VIRTIO_BALLOON_F_DEFLATE_ON_OOM	2 /* Deflate balloon on OOM */
+#define VIRTIO_BALLOON_F_PAGE_BITMAP	3 /* Send page info with bitmap */
 
 /* Size of a PFN in the balloon interface. */
 #define VIRTIO_BALLOON_PFN_SHIFT 12
@@ -82,4 +83,22 @@ struct virtio_balloon_stat {
 	__virtio64 val;
 } __attribute__((packed));
 
+/* Response header structure */
+struct virtio_balloon_resp_hdr {
+	__le64 cmd : 8; /* Distinguish different requests type */
+	__le64 flag: 8; /* Mark status for a specific request type */
+	__le64 id : 16; /* Distinguish requests of a specific type */
+	__le64 data_len: 32; /* Length of the following data, in bytes */
+};
+
+/* Page bitmap header structure */
+struct virtio_balloon_bmap_hdr {
+	struct {
+		__le64 start_pfn : 52; /* start pfn for the bitmap */
+		__le64 page_shift : 6; /* page shift width, in bytes */
+		__le64 bmap_len : 6;  /* bitmap length, in bytes */
+	} head;
+	__le64 bmap[0];
+};
+
 #endif /* _LINUX_VIRTIO_BALLOON_H */
-- 
1.8.3.1

^ permalink raw reply related

* [PATCH kernel v4 1/7] virtio-balloon: rework deflate to add page to a list
From: Liang Li @ 2016-11-02  6:17 UTC (permalink / raw)
  To: mst, dave.hansen
  Cc: virtio-dev, kvm, linux-kernel, Liang Li, qemu-devel, dgilbert,
	linux-mm, amit.shah, pbonzini, virtualization, mgorman
In-Reply-To: <1478067447-24654-1-git-send-email-liang.z.li@intel.com>

When doing the inflating/deflating operation, the current virtio-balloon
implementation uses an array to save 256 PFNS, then send these PFNS to
host through virtio and process each PFN one by one. This way is not
efficient when inflating/deflating a large mount of memory because too
many times of the following operations:

    1. Virtio data transmission
    2. Page allocate/free
    3. Address translation(GPA->HVA)
    4. madvise

The over head of these operations will consume a lot of CPU cycles and
will take a long time to complete, it may impact the QoS of the guest as
well as the host. The overhead will be reduced a lot if batch processing
is used. E.g. If there are several pages whose address are physical
contiguous in the guest, these bulk pages can be processed in one
operation.

The main idea for the optimization is to reduce the above operations as
much as possible. And it can be achieved by using a bitmap instead of an
PFN array. Comparing with PFN array, for a specific size buffer, bitmap
can present more pages, which is very important for batch processing.

Using bitmap instead of PFN is not very helpful when inflating/deflating
a small mount of pages, in this case, using PFNs is better. But using
bitmap will not impact the QoS of guest or host heavily because the
operation will be completed very soon for a small mount of pages, and we
will use some methods to make sure the efficiency not drop too much.

This patch saves the deflated pages to a list instead of the PFN array,
which will allow faster notifications using a bitmap down the road.
balloon_pfn_to_page() can be removed because it's useless.

Signed-off-by: Liang Li <liang.z.li@intel.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Cornelia Huck <cornelia.huck@de.ibm.com>
Cc: Amit Shah <amit.shah@redhat.com>
Cc: Dave Hansen <dave.hansen@intel.com>
---
 drivers/virtio/virtio_balloon.c | 22 ++++++++--------------
 1 file changed, 8 insertions(+), 14 deletions(-)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index 4e7003d..59ffe5a 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -103,12 +103,6 @@ static u32 page_to_balloon_pfn(struct page *page)
 	return pfn * VIRTIO_BALLOON_PAGES_PER_PAGE;
 }
 
-static struct page *balloon_pfn_to_page(u32 pfn)
-{
-	BUG_ON(pfn % VIRTIO_BALLOON_PAGES_PER_PAGE);
-	return pfn_to_page(pfn / VIRTIO_BALLOON_PAGES_PER_PAGE);
-}
-
 static void balloon_ack(struct virtqueue *vq)
 {
 	struct virtio_balloon *vb = vq->vdev->priv;
@@ -181,18 +175,16 @@ static unsigned fill_balloon(struct virtio_balloon *vb, size_t num)
 	return num_allocated_pages;
 }
 
-static void release_pages_balloon(struct virtio_balloon *vb)
+static void release_pages_balloon(struct virtio_balloon *vb,
+				 struct list_head *pages)
 {
-	unsigned int i;
-	struct page *page;
+	struct page *page, *next;
 
-	/* Find pfns pointing at start of each page, get pages and free them. */
-	for (i = 0; i < vb->num_pfns; i += VIRTIO_BALLOON_PAGES_PER_PAGE) {
-		page = balloon_pfn_to_page(virtio32_to_cpu(vb->vdev,
-							   vb->pfns[i]));
+	list_for_each_entry_safe(page, next, pages, lru) {
 		if (!virtio_has_feature(vb->vdev,
 					VIRTIO_BALLOON_F_DEFLATE_ON_OOM))
 			adjust_managed_page_count(page, 1);
+		list_del(&page->lru);
 		put_page(page); /* balloon reference */
 	}
 }
@@ -202,6 +194,7 @@ static unsigned leak_balloon(struct virtio_balloon *vb, size_t num)
 	unsigned num_freed_pages;
 	struct page *page;
 	struct balloon_dev_info *vb_dev_info = &vb->vb_dev_info;
+	LIST_HEAD(pages);
 
 	/* We can only do one array worth at a time. */
 	num = min(num, ARRAY_SIZE(vb->pfns));
@@ -215,6 +208,7 @@ static unsigned leak_balloon(struct virtio_balloon *vb, size_t num)
 		if (!page)
 			break;
 		set_page_pfns(vb, vb->pfns + vb->num_pfns, page);
+		list_add(&page->lru, &pages);
 		vb->num_pages -= VIRTIO_BALLOON_PAGES_PER_PAGE;
 	}
 
@@ -226,7 +220,7 @@ static unsigned leak_balloon(struct virtio_balloon *vb, size_t num)
 	 */
 	if (vb->num_pfns != 0)
 		tell_host(vb, vb->deflate_vq);
-	release_pages_balloon(vb);
+	release_pages_balloon(vb, &pages);
 	mutex_unlock(&vb->balloon_lock);
 	return num_freed_pages;
 }
-- 
1.8.3.1

^ permalink raw reply related

* [PATCH kernel v4 0/7] Extend virtio-balloon for fast (de)inflating & fast live migration
From: Liang Li @ 2016-11-02  6:17 UTC (permalink / raw)
  To: mst, dave.hansen
  Cc: virtio-dev, kvm, linux-kernel, Liang Li, qemu-devel, dgilbert,
	linux-mm, amit.shah, pbonzini, virtualization, mgorman

This patch set contains two parts of changes to the virtio-balloon.
 
One is the change for speeding up the inflating & deflating process,
the main idea of this optimization is to use bitmap to send the page
information to host instead of the PFNs, to reduce the overhead of
virtio data transmission, address translation and madvise(). This can
help to improve the performance by about 85%.
 
Another change is for speeding up live migration. By skipping process
guest's unused pages in the first round of data copy, to reduce needless
data processing, this can help to save quite a lot of CPU cycles and
network bandwidth. We put guest's unused page information in a bitmap
and send it to host with the virt queue of virtio-balloon. For an idle
guest with 8GB RAM, this can help to shorten the total live migration
time from 2Sec to about 500ms in 10Gbps network environment.
 
Changes from v3 to v4:
    * Use the new scheme suggested by Dave Hansen to encode the bitmap.
    * Add code which is missed in v3 to handle migrate page. 
    * Free the memory for bitmap intime once the operation is done.
    * Address some of the comments in v3.

Changes from v2 to v3:
    * Change the name of 'free page' to 'unused page'.
    * Use the scatter & gather bitmap instead of a 1MB page bitmap.
    * Fix overwriting the page bitmap after kicking.
    * Some of MST's comments for v2.
 
Changes from v1 to v2:
    * Abandon the patch for dropping page cache.
    * Put some structures to uapi head file.
    * Use a new way to determine the page bitmap size.
    * Use a unified way to send the free page information with the bitmap
    * Address the issues referred in MST's comments

Liang Li (7):
  virtio-balloon: rework deflate to add page to a list
  virtio-balloon: define new feature bit and head struct
  mm: add a function to get the max pfn
  virtio-balloon: speed up inflate/deflate process
  mm: add the related functions to get unused page
  virtio-balloon: define flags and head for host request vq
  virtio-balloon: tell host vm's unused page info

 drivers/virtio/virtio_balloon.c     | 546 ++++++++++++++++++++++++++++++++----
 include/linux/mm.h                  |   3 +
 include/uapi/linux/virtio_balloon.h |  41 +++
 mm/page_alloc.c                     |  95 +++++++
 4 files changed, 636 insertions(+), 49 deletions(-)

-- 
1.8.3.1

^ permalink raw reply

* SR-PCIM support in Xen
From: Bharat Kumar Gogada @ 2016-11-01 13:47 UTC (permalink / raw)
  To: xen-devel-request@lists.xenproject.org
  Cc: jgross@suse.com, david.vrabel@citrix.com

Hi All,

Is there any support for software mailbox communication between PF and VF driver in Xen ?

As per the following link:
http://www-archive.xenproject.org/files/xensummitboston08/Xen-SR-IOV.pdf
It says that for software based virtual mailbox we require SR-PCIM support.

So for this purpose is SR-PCIM implemented as part of Xen ?
Or is there any other method that Xen provides for communication between PF & VF ?

Thanks & regards,
Bharat

^ permalink raw reply

* [PULL] virtio: tests, cleanups and fixes
From: Michael S. Tsirkin @ 2016-11-01  0:11 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: jgross, lprosek, kvm, mst, netdev, kneumoin, will.deacon,
	linux-kernel, stable, virtualization, luto, pbonzini, den,
	matt.redfearn, elfring

The following changes since commit a909d3e636995ba7c349e2ca5dbb528154d4ac30:

  Linux 4.9-rc3 (2016-10-29 13:52:02 -0700)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost.git tags/for_linus

for you to fetch changes up to 75bfa81bf0897ba87f1e1b9b576a07536029b86a:

  virtio_ring: mark vring_dma_dev inline (2016-10-31 00:40:08 +0200)

----------------------------------------------------------------
virtio: tests, fixes and cleanups

Just minor tweaks, there's nothing major in this cycle.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>

----------------------------------------------------------------
Juergen Gross (1):
      virtio: remove config.c

Konstantin Neumoin (1):
      virtio: update balloon size in balloon "probe"

Ladi Prosek (1):
      virtio_ring: Make interrupt suppression spec compliant

Markus Elfring (2):
      virtio_blk: Use kmalloc_array() in init_vq()
      virtio_blk: Delete an unnecessary initialisation in init_vq()

Matt Redfearn (1):
      virtio: console: Unlock vqs while freeing buffers

Michael S. Tsirkin (2):
      virtio/vhost: add Jason to list of maintainers
      virtio_ring: mark vring_dma_dev inline

Paolo Bonzini (3):
      ringtest: use link-time optimization
      ringtest: commonize implementation of poll_avail/poll_used
      ringtest: poll for new buffers once before updating event index

Will Deacon (1):
      virtio_pci: Limit DMA mask to 44 bits for legacy virtio devices

 tools/virtio/ringtest/main.h            |  4 +--
 drivers/block/virtio_blk.c              | 10 +++---
 drivers/char/virtio_console.c           | 22 ++++++++----
 drivers/virtio/config.c                 | 12 -------
 drivers/virtio/virtio_balloon.c         |  2 ++
 drivers/virtio/virtio_pci_legacy.c      | 16 ++++++---
 drivers/virtio/virtio_ring.c            | 16 +++++----
 tools/virtio/ringtest/main.c            | 20 ++++++++---
 tools/virtio/ringtest/noring.c          |  6 ++--
 tools/virtio/ringtest/ptr_ring.c        | 22 +++---------
 tools/virtio/ringtest/ring.c            | 18 ++++------
 tools/virtio/ringtest/virtio_ring_0_9.c | 64 ++++++++-------------------------
 MAINTAINERS                             |  2 ++
 tools/virtio/ringtest/Makefile          |  4 +--
 14 files changed, 96 insertions(+), 122 deletions(-)
 delete mode 100644 drivers/virtio/config.c

^ permalink raw reply

* Re: [PATCH v6 02/11] locking/osq: Drop the overload of osq_lock()
From: Pan Xinhui @ 2016-10-30 14:39 UTC (permalink / raw)
  To: Davidlohr Bueso, Pan Xinhui
  Cc: kvm, rkrcmar, peterz, benh, will.deacon, virtualization, paulus,
	kernellwp, linux-s390, xen-devel-request, x86, mingo, xen-devel,
	paulmck, boqun.feng, jgross, linux-kernel, David.Laight, mpe,
	pbonzini, linuxppc-dev
In-Reply-To: <20161029165216.GA17451@linux-80c1.suse>



在 2016/10/30 00:52, Davidlohr Bueso 写道:
> On Fri, 28 Oct 2016, Pan Xinhui wrote:
>>         /*
>>          * If we need to reschedule bail... so we can block.
>> +         * Use vcpu_is_preempted to detech lock holder preemption issue
>                                            ^^ detect
ok. thanks for poingting it out.
>> +         * and break.
>
> Could you please remove the rest of this comment? Its just noise to point out
> that vcpu_is_preempted is a macro defined by arch/false. This is standard protocol
> in the kernel.
>
fair enough.

> Same goes for all locks you change with this.
>
> Thanks,
> Davidlohr
>
>>                * vcpu_is_preempted is a macro defined by false if
>> +         * arch does not support vcpu preempted check,
>>          */
>> -        if (need_resched())
>> +        if (need_resched() || vcpu_is_preempted(node_cpu(node->prev)))
>>             goto unqueue;
>>
>>         cpu_relax_lowlatency();
>> --
>> 2.4.11
>>
>

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply

* WorldCIST'2017 - 5th World Conference on Information Systems and Technologies - Deadline: November 20
From: ML @ 2016-10-29 21:06 UTC (permalink / raw)
  To: virtualization

[-- Attachment #1: Type: text/plain, Size: 7853 bytes --]

*Proceedings by Springer
**Best papers published in SCI/SSCI-indexed journals

---------------------------------------------------------------------------
WorldCIST'17 - 5th World Conference on Information Systems and Technologies 
Porto Santo Island, Madeira, Portugal
11th-13th of April 2017
http://www.worldcist.org/
--------------------------------------------------------------------


SCOPE

The WorldCist'17 - 5th World Conference on Information Systems and Technologies, to be held at Porto Santo Island, Madeira, Portugal, 11 - 13 April 2017, is a global forum for researchers and practitioners to present and discuss the most recent innovations, trends, results, experiences and concerns in the several perspectives of Information Systems and Technologies.

We are pleased to invite you to submit your papers to WorldCist'17 (http://www.worldcist.org/). All submissions will be reviewed on the basis of relevance, originality, importance and clarity.


THEMES

Submitted papers should be related with one or more of the main themes proposed for the Conference:

A) Information and Knowledge Management (IKM);
B) Organizational Models and Information Systems (OMIS);
C) Software and Systems Modeling (SSM);
D) Software Systems, Architectures, Applications and Tools (SSAAT);
E) Multimedia Systems and Applications (MSA);
F) Computer Networks, Mobility and Pervasive Systems (CNMPS);
G) Intelligent and Decision Support Systems (IDSS);
H) Big Data Analytics and Applications (BDAA);
I) Human-Computer Interaction (HCI);
J) Ethics, Computers and Security (ECS)
K) Health Informatics (HIS);
L) Information Technologies in Education (ITE);
M) Information Technologies in Radiocommunications (ITR).


TYPES of SUBMISSIONS AND DECISIONS

Four types of papers can be submitted:

- Full paper: Finished or consolidated R&D works, to be included in one of the Conference themes. These papers are assigned a 10-page limit.

- Short paper: Ongoing works with relevant preliminary results, open to discussion. These papers are assigned a 7-page limit.

- Poster paper: Initial work with relevant ideas, open to discussion. These papers are assigned to a 4-page limit.

- Company paper: Companies' papers that show practical experience, R & D, tools, etc., focused on some topics of the conference. These papers are assigned to a 4-page limit.

Submitted papers must comply with the format of Advances in Intelligent Systems and Computing Series (see Instructions for Authors at Springer Website or download a DOC example) be written in English, must not have been published before, not be under review for any other conference or publication and not include any information leading to the authors’ identification. Therefore, the authors’ names, affiliations and bibliographic references should not be included in the version for evaluation by the Program Committee. This information should only be included in the camera-ready version, saved in Word or Latex format and also in PDF format. These files must be accompanied by the Consent to Publication form filled out, in a ZIP file, and uploaded at the conference management system.

All papers will be subjected to a “double-blind review” by at least two members of the Program Committee.

Based on Program Committee evaluation, a paper can be rejected or accepted by the Conference Chairs. In the later case, it can be accepted as the type originally submitted or as another type. Thus, full papers can be accepted as short papers or poster papers only. Similarly, short papers can be accepted as poster papers only. In these cases, the authors will be allowed to maintain the original number of pages in the camera-ready version.

The authors of accepted poster papers must also build and print a poster to be exhibited during the Conference. This poster must follow an A1 or A2 vertical format. The Conference can includes Work Sessions where these posters are presented and orally discussed, with a 5 minute limit per poster.

The authors of accepted full papers will have 15 minutes to present their work in a Conference Work Session; approximately 5 minutes of discussion will follow each presentation. The authors of accepted short papers and company papers will have 11 minutes to present their work in a Conference Work Session; approximately 4 minutes of discussion will follow each presentation.


PUBLICATION & INDEXING

To ensure that a full paper, short paper, poster paper or company paper is published in the Proceedings, at least one of the authors must be fully registered by the 8th of January 2017, and the paper must comply with the suggested layout and page-limit. Additionally, all recommended changes must be addressed by the authors before they submit the camera-ready version.

No more than one paper per registration will be published in the Conference Proceedings. An extra fee must be paid for publication of additional papers, with a maximum of one additional paper per registration. One registration permits only the participation of one author in the conference.

Full and short papers will be published in Proceedings by Springer, in Advances in Intelligent Systems and Computing Series. Poster and company papers will be published by AISTI.

Published full and short papers will be submitted for indexation by ISI, EI-Compendex, SCOPUS, DBLP and Google Scholar, among others, and will be available in the SpringerLink Digital Library.

The authors of the best selected papers will be invited to extend them for publication in international journals indexed by ISI/SCI, SCOPUS and DBLP, among others, such as:

- International Journal of Neural Systems (IF: 6.085 / Q1)
- Integrated Computer-Aided Engineering (IF: 4.981 / Q1)
- International Journal of Information Management (IF: 2.692 / Q1)
- Telematics & Informatics (IF: 2.261 / Q1)
- Electronic Commerce Research and Applications (IF: 2.139 / Q1)
- Computers, Environment and Urban Systems (IF: 2.092 / Q1)
- Data Mining and Knowledge Discovery (IF: 1.759 / Q1)
- Journal of Medical Systems (IF: 2.213 / Q2)
- Journal of Business Research (IF: 2.129 / Q2)
- Pervasive and Mobile Computing (IF: 1.719 / Q2)
- Knowledge and Information Systems (IF: 1.702 / Q2)
- Journal of Grid Computing (IF: 1.561 / Q2) - Special Issue on "Big Data"
- Cluster Computing (IF:1.514 / Q2) - Special Issue on "Advanced Machine Learning in Parallel and Distributed Knowledge Discovery"
- International Journal of Critical Infrastructure Protection (IF: 1.351 / Q2)
- Expert Systems - Journal of Knowledge Engineering (IF: 0.947 / Q3)
- Concurrency and Computation: Practice and Experience (IF: 0.942 / Q3)
- Science of Computer Programming (IF: 0.828 / Q3)
- Ethics and Information Technology (IF: 0.739 / Q3)
- Annals of Telecommunications (IF: 0.722 / Q3)
- Engineering Computations (IF: 0.691 / Q3)
- Advances in Complex Systems (IF: 0.461 / Q3)
- Computing and Informatics (IF: 0.504 / Q4)
- AI Communications (IF: 0.364 / Q4)
- Journal of Hospitality and Tourism Technology (SR: 0.672 / Q2)
- Transforming Government: People, Process and Policy (SR: 0.642 / Q2)
- TEM Journal - Technology, Education, Management, Informatics (ISI - Emerging Sources Citation Index)
- Computer Methods in Biomechanics and Biomedical Engineering - Imaging & Visualization (ISI - Emerging Sources Citation Index)
- Journal of Information Systems Engineering & Management


IMPORTANT DATES

Paper Submission: November 20, 2016

Notification of Acceptance: December 25, 20156

Payment of Registration, to ensure the inclusion of an accepted paper in the conference proceedings: January 8, 2017.

Camera-ready Submission: January 8, 2017


-

Website of WorldCIST'17
http://www.worldcist.org/


-

Best regards,


ML







[-- Attachment #2: Type: text/plain, Size: 183 bytes --]

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply

* Re: [PATCH v6 02/11] locking/osq: Drop the overload of osq_lock()
From: Davidlohr Bueso @ 2016-10-29 16:52 UTC (permalink / raw)
  To: Pan Xinhui
  Cc: kvm, rkrcmar, peterz, benh, will.deacon, virtualization, paulus,
	kernellwp, linux-s390, xen-devel-request, x86, mingo, xen-devel,
	paulmck, boqun.feng, jgross, linux-kernel, David.Laight, mpe,
	pbonzini, linuxppc-dev
In-Reply-To: <1477642287-24104-3-git-send-email-xinhui.pan@linux.vnet.ibm.com>

On Fri, 28 Oct 2016, Pan Xinhui wrote:
> 		/*
> 		 * If we need to reschedule bail... so we can block.
>+		 * Use vcpu_is_preempted to detech lock holder preemption issue
                                            ^^ detect
>+		 * and break. 

Could you please remove the rest of this comment? Its just noise to point out
that vcpu_is_preempted is a macro defined by arch/false. This is standard protocol
in the kernel.

Same goes for all locks you change with this.

Thanks,
Davidlohr

>                * vcpu_is_preempted is a macro defined by false if
>+		 * arch does not support vcpu preempted check,
> 		 */
>-		if (need_resched())
>+		if (need_resched() || vcpu_is_preempted(node_cpu(node->prev)))
> 			goto unqueue;
>
> 		cpu_relax_lowlatency();
>-- 
>2.4.11
>

^ permalink raw reply

* Re: [PATCH v2 net-next] virtio-net: Update the mtu code to match virtio spec
From: David Miller @ 2016-10-29 16:01 UTC (permalink / raw)
  To: aconole; +Cc: jarod, netdev, mst, linux-kernel, virtualization
In-Reply-To: <1477426332-31093-1-git-send-email-aconole@redhat.com>

From: Aaron Conole <aconole@redhat.com>
Date: Tue, 25 Oct 2016 16:12:12 -0400

> The virtio committee recently ratified a change, VIRTIO-152, which
> defines the mtu field to be 'max' MTU, not simply desired MTU.
> 
> This commit brings the virtio-net device in compliance with VIRTIO-152.
> 
> Additionally, drop the max_mtu branch - it cannot be taken since the u16
> returned by virtio_cread16 will never exceed the initial value of
> max_mtu.
> 
> Signed-off-by: Aaron Conole <aconole@redhat.com>
> Acked-by: "Michael S. Tsirkin" <mst@redhat.com>
> Acked-by: Jarod Wilson <jarod@redhat.com>
> ---
> Nothing code-wise has changed, but I've included the ACKs and fixed up the
> subject line.

Applied, thanks.

^ permalink raw reply

* Re: [Xen-devel] [PATCH v6 00/11] implement vcpu preempted check
From: Pan Xinhui @ 2016-10-29  4:37 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk, Pan Xinhui
  Cc: kvm, rkrcmar, peterz, benh, will.deacon, virtualization, paulus,
	kernellwp, linux-s390, xen-devel-request, x86, mingo, xen-devel,
	paulmck, boqun.feng, jgross, linux-kernel, David.Laight, mpe,
	pbonzini, linuxppc-dev
In-Reply-To: <20161028193817.GD2879@char.us.oracle.com>

[-- Attachment #1: Type: text/plain, Size: 5899 bytes --]



在 2016/10/29 03:38, Konrad Rzeszutek Wilk 写道:
> On Fri, Oct 28, 2016 at 04:11:16AM -0400, Pan Xinhui wrote:
>> change from v5:
>> 	spilt x86/kvm patch into guest/host part.
>> 	introduce kvm_write_guest_offset_cached.
>> 	fix some typos.
>> 	rebase patch onto 4.9.2
>> change from v4:
>> 	spilt x86 kvm vcpu preempted check into two patches.
>> 	add documentation patch.
>> 	add x86 vcpu preempted check patch under xen
>> 	add s390 vcpu preempted check patch
>> change from v3:
>> 	add x86 vcpu preempted check patch
>> change from v2:
>> 	no code change, fix typos, update some comments
>> change from v1:
>> 	a simplier definition of default vcpu_is_preempted
>> 	skip mahcine type check on ppc, and add config. remove dedicated macro.
>> 	add one patch to drop overload of rwsem_spin_on_owner and mutex_spin_on_owner.
>> 	add more comments
>> 	thanks boqun and Peter's suggestion.
>>
>> This patch set aims to fix lock holder preemption issues.
>
> Do you have a git tree with these patches?
>
Currently no, sorry :(

I make a tar file for this patcheset. Maybe a little easier to apply :)

thanks
xinhui

>>
>> test-case:
>> perf record -a perf bench sched messaging -g 400 -p && perf report
>>
>> 18.09%  sched-messaging  [kernel.vmlinux]  [k] osq_lock
>> 12.28%  sched-messaging  [kernel.vmlinux]  [k] rwsem_spin_on_owner
>>  5.27%  sched-messaging  [kernel.vmlinux]  [k] mutex_unlock
>>  3.89%  sched-messaging  [kernel.vmlinux]  [k] wait_consider_task
>>  3.64%  sched-messaging  [kernel.vmlinux]  [k] _raw_write_lock_irq
>>  3.41%  sched-messaging  [kernel.vmlinux]  [k] mutex_spin_on_owner.is
>>  2.49%  sched-messaging  [kernel.vmlinux]  [k] system_call
>>
>> We introduce interface bool vcpu_is_preempted(int cpu) and use it in some spin
>> loops of osq_lock, rwsem_spin_on_owner and mutex_spin_on_owner.
>> These spin_on_onwer variant also cause rcu stall before we apply this patch set
>>
>> We also have observed some performace improvements in uninx benchmark tests.
>>
>> PPC test result:
>> 1 copy - 0.94%
>> 2 copy - 7.17%
>> 4 copy - 11.9%
>> 8 copy -  3.04%
>> 16 copy - 15.11%
>>
>> details below:
>> Without patch:
>>
>> 1 copy - File Write 4096 bufsize 8000 maxblocks      2188223.0 KBps  (30.0 s, 1 samples)
>> 2 copy - File Write 4096 bufsize 8000 maxblocks      1804433.0 KBps  (30.0 s, 1 samples)
>> 4 copy - File Write 4096 bufsize 8000 maxblocks      1237257.0 KBps  (30.0 s, 1 samples)
>> 8 copy - File Write 4096 bufsize 8000 maxblocks      1032658.0 KBps  (30.0 s, 1 samples)
>> 16 copy - File Write 4096 bufsize 8000 maxblocks       768000.0 KBps  (30.1 s, 1 samples)
>>
>> With patch:
>>
>> 1 copy - File Write 4096 bufsize 8000 maxblocks      2209189.0 KBps  (30.0 s, 1 samples)
>> 2 copy - File Write 4096 bufsize 8000 maxblocks      1943816.0 KBps  (30.0 s, 1 samples)
>> 4 copy - File Write 4096 bufsize 8000 maxblocks      1405591.0 KBps  (30.0 s, 1 samples)
>> 8 copy - File Write 4096 bufsize 8000 maxblocks      1065080.0 KBps  (30.0 s, 1 samples)
>> 16 copy - File Write 4096 bufsize 8000 maxblocks       904762.0 KBps  (30.0 s, 1 samples)
>>
>> X86 test result:
>> 	test-case			after-patch	  before-patch
>> Execl Throughput                       |    18307.9 lps  |    11701.6 lps
>> File Copy 1024 bufsize 2000 maxblocks  |  1352407.3 KBps |   790418.9 KBps
>> File Copy 256 bufsize 500 maxblocks    |   367555.6 KBps |   222867.7 KBps
>> File Copy 4096 bufsize 8000 maxblocks  |  3675649.7 KBps |  1780614.4 KBps
>> Pipe Throughput                        | 11872208.7 lps  | 11855628.9 lps
>> Pipe-based Context Switching           |  1495126.5 lps  |  1490533.9 lps
>> Process Creation                       |    29881.2 lps  |    28572.8 lps
>> Shell Scripts (1 concurrent)           |    23224.3 lpm  |    22607.4 lpm
>> Shell Scripts (8 concurrent)           |     3531.4 lpm  |     3211.9 lpm
>> System Call Overhead                   | 10385653.0 lps  | 10419979.0 lps
>>
>> Christian Borntraeger (1):
>>   s390/spinlock: Provide vcpu_is_preempted
>>
>> Juergen Gross (1):
>>   x86, xen: support vcpu preempted check
>>
>> Pan Xinhui (9):
>>   kernel/sched: introduce vcpu preempted check interface
>>   locking/osq: Drop the overload of osq_lock()
>>   kernel/locking: Drop the overload of {mutex,rwsem}_spin_on_owner
>>   powerpc/spinlock: support vcpu preempted check
>>   x86, paravirt: Add interface to support kvm/xen vcpu preempted check
>>   KVM: Introduce kvm_write_guest_offset_cached
>>   x86, kvm/x86.c: support vcpu preempted check
>>   x86, kernel/kvm.c: support vcpu preempted check
>>   Documentation: virtual: kvm: Support vcpu preempted check
>>
>>  Documentation/virtual/kvm/msr.txt     |  9 ++++++++-
>>  arch/powerpc/include/asm/spinlock.h   |  8 ++++++++
>>  arch/s390/include/asm/spinlock.h      |  8 ++++++++
>>  arch/s390/kernel/smp.c                |  9 +++++++--
>>  arch/s390/lib/spinlock.c              | 25 ++++++++-----------------
>>  arch/x86/include/asm/paravirt_types.h |  2 ++
>>  arch/x86/include/asm/spinlock.h       |  8 ++++++++
>>  arch/x86/include/uapi/asm/kvm_para.h  |  4 +++-
>>  arch/x86/kernel/kvm.c                 | 12 ++++++++++++
>>  arch/x86/kernel/paravirt-spinlocks.c  |  6 ++++++
>>  arch/x86/kvm/x86.c                    | 16 ++++++++++++++++
>>  arch/x86/xen/spinlock.c               |  3 ++-
>>  include/linux/kvm_host.h              |  2 ++
>>  include/linux/sched.h                 | 12 ++++++++++++
>>  kernel/locking/mutex.c                | 15 +++++++++++++--
>>  kernel/locking/osq_lock.c             | 10 +++++++++-
>>  kernel/locking/rwsem-xadd.c           | 16 +++++++++++++---
>>  virt/kvm/kvm_main.c                   | 20 ++++++++++++++------
>>  18 files changed, 151 insertions(+), 34 deletions(-)
>>
>> --
>> 2.4.11
>>
>>
>> _______________________________________________
>> Xen-devel mailing list
>> Xen-devel@lists.xen.org
>> https://lists.xen.org/xen-devel
>

[-- Attachment #2: vcpu.tar --]
[-- Type: application/x-tar, Size: 81920 bytes --]

[-- Attachment #3: Type: text/plain, Size: 183 bytes --]

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply

* Re: [Xen-devel] [PATCH v6 10/11] x86, xen: support vcpu preempted check
From: Pan Xinhui @ 2016-10-29  4:26 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk, Pan Xinhui
  Cc: kvm, rkrcmar, peterz, benh, will.deacon, virtualization, paulus,
	kernellwp, linux-s390, xen-devel-request, x86, mingo, xen-devel,
	paulmck, boqun.feng, jgross, linux-kernel, David.Laight, mpe,
	pbonzini, linuxppc-dev
In-Reply-To: <20161028194325.GE2879@char.us.oracle.com>



在 2016/10/29 03:43, Konrad Rzeszutek Wilk 写道:
> On Fri, Oct 28, 2016 at 04:11:26AM -0400, Pan Xinhui wrote:
>> From: Juergen Gross <jgross@suse.com>
>>
>> Support the vcpu_is_preempted() functionality under Xen. This will
>> enhance lock performance on overcommitted hosts (more runnable vcpus
>> than physical cpus in the system) as doing busy waits for preempted
>> vcpus will hurt system performance far worse than early yielding.
>>
>> A quick test (4 vcpus on 1 physical cpu doing a parallel build job
>> with "make -j 8") reduced system time by about 5% with this patch.
>>
>> Signed-off-by: Juergen Gross <jgross@suse.com>
>> Signed-off-by: Pan Xinhui <xinhui.pan@linux.vnet.ibm.com>
>> ---
>>  arch/x86/xen/spinlock.c | 3 ++-
>>  1 file changed, 2 insertions(+), 1 deletion(-)
>>
>> diff --git a/arch/x86/xen/spinlock.c b/arch/x86/xen/spinlock.c
>> index 3d6e006..74756bb 100644
>> --- a/arch/x86/xen/spinlock.c
>> +++ b/arch/x86/xen/spinlock.c
>> @@ -114,7 +114,6 @@ void xen_uninit_lock_cpu(int cpu)
>>  	per_cpu(irq_name, cpu) = NULL;
>>  }
>>
>> -
>
> Spurious change.
well, just remove one unnecessary blank line while at it.

>>  /*
>>   * Our init of PV spinlocks is split in two init functions due to us
>>   * using paravirt patching and jump labels patching and having to do
>> @@ -137,6 +136,8 @@ void __init xen_init_spinlocks(void)
>>  	pv_lock_ops.queued_spin_unlock = PV_CALLEE_SAVE(__pv_queued_spin_unlock);
>>  	pv_lock_ops.wait = xen_qlock_wait;
>>  	pv_lock_ops.kick = xen_qlock_kick;
>> +
>> +	pv_lock_ops.vcpu_is_preempted = xen_vcpu_stolen;
>>  }
>>
>>  /*
>> --
>> 2.4.11
>>
>>
>> _______________________________________________
>> Xen-devel mailing list
>> Xen-devel@lists.xen.org
>> https://lists.xen.org/xen-devel
>

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply

* Re: [Xen-devel] [PATCH v6 10/11] x86, xen: support vcpu preempted check
From: Konrad Rzeszutek Wilk @ 2016-10-28 19:43 UTC (permalink / raw)
  To: Pan Xinhui
  Cc: kvm, rkrcmar, peterz, benh, will.deacon, virtualization, paulus,
	kernellwp, linux-s390, xen-devel-request, x86, mingo, xen-devel,
	paulmck, boqun.feng, jgross, linux-kernel, David.Laight, mpe,
	pbonzini, linuxppc-dev
In-Reply-To: <1477642287-24104-11-git-send-email-xinhui.pan@linux.vnet.ibm.com>

On Fri, Oct 28, 2016 at 04:11:26AM -0400, Pan Xinhui wrote:
> From: Juergen Gross <jgross@suse.com>
> 
> Support the vcpu_is_preempted() functionality under Xen. This will
> enhance lock performance on overcommitted hosts (more runnable vcpus
> than physical cpus in the system) as doing busy waits for preempted
> vcpus will hurt system performance far worse than early yielding.
> 
> A quick test (4 vcpus on 1 physical cpu doing a parallel build job
> with "make -j 8") reduced system time by about 5% with this patch.
> 
> Signed-off-by: Juergen Gross <jgross@suse.com>
> Signed-off-by: Pan Xinhui <xinhui.pan@linux.vnet.ibm.com>
> ---
>  arch/x86/xen/spinlock.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/xen/spinlock.c b/arch/x86/xen/spinlock.c
> index 3d6e006..74756bb 100644
> --- a/arch/x86/xen/spinlock.c
> +++ b/arch/x86/xen/spinlock.c
> @@ -114,7 +114,6 @@ void xen_uninit_lock_cpu(int cpu)
>  	per_cpu(irq_name, cpu) = NULL;
>  }
>  
> -

Spurious change.
>  /*
>   * Our init of PV spinlocks is split in two init functions due to us
>   * using paravirt patching and jump labels patching and having to do
> @@ -137,6 +136,8 @@ void __init xen_init_spinlocks(void)
>  	pv_lock_ops.queued_spin_unlock = PV_CALLEE_SAVE(__pv_queued_spin_unlock);
>  	pv_lock_ops.wait = xen_qlock_wait;
>  	pv_lock_ops.kick = xen_qlock_kick;
> +
> +	pv_lock_ops.vcpu_is_preempted = xen_vcpu_stolen;
>  }
>  
>  /*
> -- 
> 2.4.11
> 
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> https://lists.xen.org/xen-devel

^ permalink raw reply

* Re: [Xen-devel] [PATCH v6 00/11] implement vcpu preempted check
From: Konrad Rzeszutek Wilk @ 2016-10-28 19:38 UTC (permalink / raw)
  To: Pan Xinhui
  Cc: kvm, rkrcmar, peterz, benh, will.deacon, virtualization, paulus,
	kernellwp, linux-s390, xen-devel-request, x86, mingo, xen-devel,
	paulmck, boqun.feng, jgross, linux-kernel, David.Laight, mpe,
	pbonzini, linuxppc-dev
In-Reply-To: <1477642287-24104-1-git-send-email-xinhui.pan@linux.vnet.ibm.com>

On Fri, Oct 28, 2016 at 04:11:16AM -0400, Pan Xinhui wrote:
> change from v5:
> 	spilt x86/kvm patch into guest/host part.
> 	introduce kvm_write_guest_offset_cached.
> 	fix some typos.
> 	rebase patch onto 4.9.2
> change from v4:
> 	spilt x86 kvm vcpu preempted check into two patches.
> 	add documentation patch.
> 	add x86 vcpu preempted check patch under xen
> 	add s390 vcpu preempted check patch 
> change from v3:
> 	add x86 vcpu preempted check patch
> change from v2:
> 	no code change, fix typos, update some comments
> change from v1:
> 	a simplier definition of default vcpu_is_preempted
> 	skip mahcine type check on ppc, and add config. remove dedicated macro.
> 	add one patch to drop overload of rwsem_spin_on_owner and mutex_spin_on_owner. 
> 	add more comments
> 	thanks boqun and Peter's suggestion.
> 
> This patch set aims to fix lock holder preemption issues.

Do you have a git tree with these patches?

> 
> test-case:
> perf record -a perf bench sched messaging -g 400 -p && perf report
> 
> 18.09%  sched-messaging  [kernel.vmlinux]  [k] osq_lock
> 12.28%  sched-messaging  [kernel.vmlinux]  [k] rwsem_spin_on_owner
>  5.27%  sched-messaging  [kernel.vmlinux]  [k] mutex_unlock
>  3.89%  sched-messaging  [kernel.vmlinux]  [k] wait_consider_task
>  3.64%  sched-messaging  [kernel.vmlinux]  [k] _raw_write_lock_irq
>  3.41%  sched-messaging  [kernel.vmlinux]  [k] mutex_spin_on_owner.is
>  2.49%  sched-messaging  [kernel.vmlinux]  [k] system_call
> 
> We introduce interface bool vcpu_is_preempted(int cpu) and use it in some spin
> loops of osq_lock, rwsem_spin_on_owner and mutex_spin_on_owner.
> These spin_on_onwer variant also cause rcu stall before we apply this patch set
> 
> We also have observed some performace improvements in uninx benchmark tests.
> 
> PPC test result:
> 1 copy - 0.94%
> 2 copy - 7.17%
> 4 copy - 11.9%
> 8 copy -  3.04%
> 16 copy - 15.11%
> 
> details below:
> Without patch:
> 
> 1 copy - File Write 4096 bufsize 8000 maxblocks      2188223.0 KBps  (30.0 s, 1 samples)
> 2 copy - File Write 4096 bufsize 8000 maxblocks      1804433.0 KBps  (30.0 s, 1 samples)
> 4 copy - File Write 4096 bufsize 8000 maxblocks      1237257.0 KBps  (30.0 s, 1 samples)
> 8 copy - File Write 4096 bufsize 8000 maxblocks      1032658.0 KBps  (30.0 s, 1 samples)
> 16 copy - File Write 4096 bufsize 8000 maxblocks       768000.0 KBps  (30.1 s, 1 samples)
> 
> With patch: 
> 
> 1 copy - File Write 4096 bufsize 8000 maxblocks      2209189.0 KBps  (30.0 s, 1 samples)
> 2 copy - File Write 4096 bufsize 8000 maxblocks      1943816.0 KBps  (30.0 s, 1 samples)
> 4 copy - File Write 4096 bufsize 8000 maxblocks      1405591.0 KBps  (30.0 s, 1 samples)
> 8 copy - File Write 4096 bufsize 8000 maxblocks      1065080.0 KBps  (30.0 s, 1 samples)
> 16 copy - File Write 4096 bufsize 8000 maxblocks       904762.0 KBps  (30.0 s, 1 samples)
> 
> X86 test result:
> 	test-case			after-patch	  before-patch
> Execl Throughput                       |    18307.9 lps  |    11701.6 lps 
> File Copy 1024 bufsize 2000 maxblocks  |  1352407.3 KBps |   790418.9 KBps
> File Copy 256 bufsize 500 maxblocks    |   367555.6 KBps |   222867.7 KBps
> File Copy 4096 bufsize 8000 maxblocks  |  3675649.7 KBps |  1780614.4 KBps
> Pipe Throughput                        | 11872208.7 lps  | 11855628.9 lps 
> Pipe-based Context Switching           |  1495126.5 lps  |  1490533.9 lps 
> Process Creation                       |    29881.2 lps  |    28572.8 lps 
> Shell Scripts (1 concurrent)           |    23224.3 lpm  |    22607.4 lpm 
> Shell Scripts (8 concurrent)           |     3531.4 lpm  |     3211.9 lpm 
> System Call Overhead                   | 10385653.0 lps  | 10419979.0 lps 
> 
> Christian Borntraeger (1):
>   s390/spinlock: Provide vcpu_is_preempted
> 
> Juergen Gross (1):
>   x86, xen: support vcpu preempted check
> 
> Pan Xinhui (9):
>   kernel/sched: introduce vcpu preempted check interface
>   locking/osq: Drop the overload of osq_lock()
>   kernel/locking: Drop the overload of {mutex,rwsem}_spin_on_owner
>   powerpc/spinlock: support vcpu preempted check
>   x86, paravirt: Add interface to support kvm/xen vcpu preempted check
>   KVM: Introduce kvm_write_guest_offset_cached
>   x86, kvm/x86.c: support vcpu preempted check
>   x86, kernel/kvm.c: support vcpu preempted check
>   Documentation: virtual: kvm: Support vcpu preempted check
> 
>  Documentation/virtual/kvm/msr.txt     |  9 ++++++++-
>  arch/powerpc/include/asm/spinlock.h   |  8 ++++++++
>  arch/s390/include/asm/spinlock.h      |  8 ++++++++
>  arch/s390/kernel/smp.c                |  9 +++++++--
>  arch/s390/lib/spinlock.c              | 25 ++++++++-----------------
>  arch/x86/include/asm/paravirt_types.h |  2 ++
>  arch/x86/include/asm/spinlock.h       |  8 ++++++++
>  arch/x86/include/uapi/asm/kvm_para.h  |  4 +++-
>  arch/x86/kernel/kvm.c                 | 12 ++++++++++++
>  arch/x86/kernel/paravirt-spinlocks.c  |  6 ++++++
>  arch/x86/kvm/x86.c                    | 16 ++++++++++++++++
>  arch/x86/xen/spinlock.c               |  3 ++-
>  include/linux/kvm_host.h              |  2 ++
>  include/linux/sched.h                 | 12 ++++++++++++
>  kernel/locking/mutex.c                | 15 +++++++++++++--
>  kernel/locking/osq_lock.c             | 10 +++++++++-
>  kernel/locking/rwsem-xadd.c           | 16 +++++++++++++---
>  virt/kvm/kvm_main.c                   | 20 ++++++++++++++------
>  18 files changed, 151 insertions(+), 34 deletions(-)
> 
> -- 
> 2.4.11
> 
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> https://lists.xen.org/xen-devel

^ permalink raw reply

* Re: [PATCH v6 00/11] implement vcpu preempted check
From: Paolo Bonzini @ 2016-10-28  9:57 UTC (permalink / raw)
  To: Pan Xinhui, linux-kernel, linuxppc-dev, virtualization,
	linux-s390, xen-devel-request, kvm, xen-devel, x86
  Cc: kernellwp, jgross, David.Laight, rkrcmar, peterz, benh,
	will.deacon, mingo, paulus, mpe, paulmck, boqun.feng
In-Reply-To: <1477642287-24104-1-git-send-email-xinhui.pan@linux.vnet.ibm.com>



On 28/10/2016 10:11, Pan Xinhui wrote:
> change from v5:
> 	spilt x86/kvm patch into guest/host part.
> 	introduce kvm_write_guest_offset_cached.
> 	fix some typos.
> 	rebase patch onto 4.9.2

Acked-by: Paolo Bonzini <pbonzini@redhat.com>

Thanks,

Paolo

> change from v4:
> 	spilt x86 kvm vcpu preempted check into two patches.
> 	add documentation patch.
> 	add x86 vcpu preempted check patch under xen
> 	add s390 vcpu preempted check patch 
> change from v3:
> 	add x86 vcpu preempted check patch
> change from v2:
> 	no code change, fix typos, update some comments
> change from v1:
> 	a simplier definition of default vcpu_is_preempted
> 	skip mahcine type check on ppc, and add config. remove dedicated macro.
> 	add one patch to drop overload of rwsem_spin_on_owner and mutex_spin_on_owner. 
> 	add more comments
> 	thanks boqun and Peter's suggestion.
> 
> This patch set aims to fix lock holder preemption issues.
> 
> test-case:
> perf record -a perf bench sched messaging -g 400 -p && perf report
> 
> 18.09%  sched-messaging  [kernel.vmlinux]  [k] osq_lock
> 12.28%  sched-messaging  [kernel.vmlinux]  [k] rwsem_spin_on_owner
>  5.27%  sched-messaging  [kernel.vmlinux]  [k] mutex_unlock
>  3.89%  sched-messaging  [kernel.vmlinux]  [k] wait_consider_task
>  3.64%  sched-messaging  [kernel.vmlinux]  [k] _raw_write_lock_irq
>  3.41%  sched-messaging  [kernel.vmlinux]  [k] mutex_spin_on_owner.is
>  2.49%  sched-messaging  [kernel.vmlinux]  [k] system_call
> 
> We introduce interface bool vcpu_is_preempted(int cpu) and use it in some spin
> loops of osq_lock, rwsem_spin_on_owner and mutex_spin_on_owner.
> These spin_on_onwer variant also cause rcu stall before we apply this patch set
> 
> We also have observed some performace improvements in uninx benchmark tests.
> 
> PPC test result:
> 1 copy - 0.94%
> 2 copy - 7.17%
> 4 copy - 11.9%
> 8 copy -  3.04%
> 16 copy - 15.11%
> 
> details below:
> Without patch:
> 
> 1 copy - File Write 4096 bufsize 8000 maxblocks      2188223.0 KBps  (30.0 s, 1 samples)
> 2 copy - File Write 4096 bufsize 8000 maxblocks      1804433.0 KBps  (30.0 s, 1 samples)
> 4 copy - File Write 4096 bufsize 8000 maxblocks      1237257.0 KBps  (30.0 s, 1 samples)
> 8 copy - File Write 4096 bufsize 8000 maxblocks      1032658.0 KBps  (30.0 s, 1 samples)
> 16 copy - File Write 4096 bufsize 8000 maxblocks       768000.0 KBps  (30.1 s, 1 samples)
> 
> With patch: 
> 
> 1 copy - File Write 4096 bufsize 8000 maxblocks      2209189.0 KBps  (30.0 s, 1 samples)
> 2 copy - File Write 4096 bufsize 8000 maxblocks      1943816.0 KBps  (30.0 s, 1 samples)
> 4 copy - File Write 4096 bufsize 8000 maxblocks      1405591.0 KBps  (30.0 s, 1 samples)
> 8 copy - File Write 4096 bufsize 8000 maxblocks      1065080.0 KBps  (30.0 s, 1 samples)
> 16 copy - File Write 4096 bufsize 8000 maxblocks       904762.0 KBps  (30.0 s, 1 samples)
> 
> X86 test result:
> 	test-case			after-patch	  before-patch
> Execl Throughput                       |    18307.9 lps  |    11701.6 lps 
> File Copy 1024 bufsize 2000 maxblocks  |  1352407.3 KBps |   790418.9 KBps
> File Copy 256 bufsize 500 maxblocks    |   367555.6 KBps |   222867.7 KBps
> File Copy 4096 bufsize 8000 maxblocks  |  3675649.7 KBps |  1780614.4 KBps
> Pipe Throughput                        | 11872208.7 lps  | 11855628.9 lps 
> Pipe-based Context Switching           |  1495126.5 lps  |  1490533.9 lps 
> Process Creation                       |    29881.2 lps  |    28572.8 lps 
> Shell Scripts (1 concurrent)           |    23224.3 lpm  |    22607.4 lpm 
> Shell Scripts (8 concurrent)           |     3531.4 lpm  |     3211.9 lpm 
> System Call Overhead                   | 10385653.0 lps  | 10419979.0 lps 
> 
> Christian Borntraeger (1):
>   s390/spinlock: Provide vcpu_is_preempted
> 
> Juergen Gross (1):
>   x86, xen: support vcpu preempted check
> 
> Pan Xinhui (9):
>   kernel/sched: introduce vcpu preempted check interface
>   locking/osq: Drop the overload of osq_lock()
>   kernel/locking: Drop the overload of {mutex,rwsem}_spin_on_owner
>   powerpc/spinlock: support vcpu preempted check
>   x86, paravirt: Add interface to support kvm/xen vcpu preempted check
>   KVM: Introduce kvm_write_guest_offset_cached
>   x86, kvm/x86.c: support vcpu preempted check
>   x86, kernel/kvm.c: support vcpu preempted check
>   Documentation: virtual: kvm: Support vcpu preempted check
> 
>  Documentation/virtual/kvm/msr.txt     |  9 ++++++++-
>  arch/powerpc/include/asm/spinlock.h   |  8 ++++++++
>  arch/s390/include/asm/spinlock.h      |  8 ++++++++
>  arch/s390/kernel/smp.c                |  9 +++++++--
>  arch/s390/lib/spinlock.c              | 25 ++++++++-----------------
>  arch/x86/include/asm/paravirt_types.h |  2 ++
>  arch/x86/include/asm/spinlock.h       |  8 ++++++++
>  arch/x86/include/uapi/asm/kvm_para.h  |  4 +++-
>  arch/x86/kernel/kvm.c                 | 12 ++++++++++++
>  arch/x86/kernel/paravirt-spinlocks.c  |  6 ++++++
>  arch/x86/kvm/x86.c                    | 16 ++++++++++++++++
>  arch/x86/xen/spinlock.c               |  3 ++-
>  include/linux/kvm_host.h              |  2 ++
>  include/linux/sched.h                 | 12 ++++++++++++
>  kernel/locking/mutex.c                | 15 +++++++++++++--
>  kernel/locking/osq_lock.c             | 10 +++++++++-
>  kernel/locking/rwsem-xadd.c           | 16 +++++++++++++---
>  virt/kvm/kvm_main.c                   | 20 ++++++++++++++------
>  18 files changed, 151 insertions(+), 34 deletions(-)
> 

^ permalink raw reply

* [PATCH v6 11/11] Documentation: virtual: kvm: Support vcpu preempted check
From: Pan Xinhui @ 2016-10-28  8:11 UTC (permalink / raw)
  To: linux-kernel, linuxppc-dev, virtualization, linux-s390,
	xen-devel-request, kvm, xen-devel, x86
  Cc: kernellwp, jgross, David.Laight, rkrcmar, peterz, benh,
	will.deacon, Pan Xinhui, mingo, paulus, mpe, pbonzini, paulmck,
	boqun.feng
In-Reply-To: <1477642287-24104-1-git-send-email-xinhui.pan@linux.vnet.ibm.com>

Commit ("x86, kvm: support vcpu preempted check") add one field "__u8
preempted" into struct kvm_steal_time. This field tells if one vcpu is
running or not.

It is zero if 1) some old KVM deos not support this filed. 2) the vcpu is
not preempted. Other values means the vcpu has been preempted.

Signed-off-by: Pan Xinhui <xinhui.pan@linux.vnet.ibm.com>
Acked-by: Radim Krčmář <rkrcmar@redhat.com>
---
 Documentation/virtual/kvm/msr.txt | 9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/Documentation/virtual/kvm/msr.txt b/Documentation/virtual/kvm/msr.txt
index 2a71c8f..ab2ab76 100644
--- a/Documentation/virtual/kvm/msr.txt
+++ b/Documentation/virtual/kvm/msr.txt
@@ -208,7 +208,9 @@ MSR_KVM_STEAL_TIME: 0x4b564d03
 		__u64 steal;
 		__u32 version;
 		__u32 flags;
-		__u32 pad[12];
+		__u8  preempted;
+		__u8  u8_pad[3];
+		__u32 pad[11];
 	}
 
 	whose data will be filled in by the hypervisor periodically. Only one
@@ -232,6 +234,11 @@ MSR_KVM_STEAL_TIME: 0x4b564d03
 		nanoseconds. Time during which the vcpu is idle, will not be
 		reported as steal time.
 
+		preempted: indicate the VCPU who owns this struct is running or
+		not. Non-zero values mean the VCPU has been preempted. Zero
+		means the VCPU is not preempted. NOTE, it is always zero if the
+		the hypervisor doesn't support this field.
+
 MSR_KVM_EOI_EN: 0x4b564d04
 	data: Bit 0 is 1 when PV end of interrupt is enabled on the vcpu; 0
 	when disabled.  Bit 1 is reserved and must be zero.  When PV end of
-- 
2.4.11

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply related

* [PATCH v6 10/11] x86, xen: support vcpu preempted check
From: Pan Xinhui @ 2016-10-28  8:11 UTC (permalink / raw)
  To: linux-kernel, linuxppc-dev, virtualization, linux-s390,
	xen-devel-request, kvm, xen-devel, x86
  Cc: kernellwp, jgross, David.Laight, rkrcmar, peterz, benh,
	will.deacon, Pan Xinhui, mingo, paulus, mpe, pbonzini, paulmck,
	boqun.feng
In-Reply-To: <1477642287-24104-1-git-send-email-xinhui.pan@linux.vnet.ibm.com>

From: Juergen Gross <jgross@suse.com>

Support the vcpu_is_preempted() functionality under Xen. This will
enhance lock performance on overcommitted hosts (more runnable vcpus
than physical cpus in the system) as doing busy waits for preempted
vcpus will hurt system performance far worse than early yielding.

A quick test (4 vcpus on 1 physical cpu doing a parallel build job
with "make -j 8") reduced system time by about 5% with this patch.

Signed-off-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Pan Xinhui <xinhui.pan@linux.vnet.ibm.com>
---
 arch/x86/xen/spinlock.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/x86/xen/spinlock.c b/arch/x86/xen/spinlock.c
index 3d6e006..74756bb 100644
--- a/arch/x86/xen/spinlock.c
+++ b/arch/x86/xen/spinlock.c
@@ -114,7 +114,6 @@ void xen_uninit_lock_cpu(int cpu)
 	per_cpu(irq_name, cpu) = NULL;
 }
 
-
 /*
  * Our init of PV spinlocks is split in two init functions due to us
  * using paravirt patching and jump labels patching and having to do
@@ -137,6 +136,8 @@ void __init xen_init_spinlocks(void)
 	pv_lock_ops.queued_spin_unlock = PV_CALLEE_SAVE(__pv_queued_spin_unlock);
 	pv_lock_ops.wait = xen_qlock_wait;
 	pv_lock_ops.kick = xen_qlock_kick;
+
+	pv_lock_ops.vcpu_is_preempted = xen_vcpu_stolen;
 }
 
 /*
-- 
2.4.11

^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox