Linux virtualization list
 help / color / mirror / Atom feed
* [PATCH 1/7] pstore: Split pstore fragile flags
From: Namhyung Kim @ 2016-07-27 15:08 UTC (permalink / raw)
  To: kvm, qemu-devel, virtualization
  Cc: Tony Luck, Kees Cook, Matt Fleming, Anton Vorontsov,
	Rafael J. Wysocki, LKML, linux-acpi, linux-efi, Colin Cross,
	Len Brown
In-Reply-To: <1469632111-23260-1-git-send-email-namhyung@kernel.org>

This patch adds new PSTORE_FLAGS for each pstore type so that they can
be enabled separately.  This is a preparation for ongoing virtio-pstore
work to support those types flexibly.

The PSTORE_FLAGS_FRAGILE is changed to PSTORE_FLAGS_DMESG to preserve the
original behavior.

Cc: Anton Vorontsov <anton@enomsg.org>
Cc: Colin Cross <ccross@android.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Tony Luck <tony.luck@intel.com>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Cc: Len Brown <lenb@kernel.org>
Cc: Matt Fleming <matt@codeblueprint.co.uk>
Cc: linux-acpi@vger.kernel.org
Cc: linux-efi@vger.kernel.org
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
---
 drivers/acpi/apei/erst.c          |  2 +-
 drivers/firmware/efi/efi-pstore.c |  2 +-
 fs/pstore/platform.c              | 17 ++++++++++-------
 fs/pstore/ram.c                   |  2 ++
 include/linux/pstore.h            |  7 ++++++-
 5 files changed, 20 insertions(+), 10 deletions(-)

diff --git a/drivers/acpi/apei/erst.c b/drivers/acpi/apei/erst.c
index f096ab3cb54d..ec4f507b524f 100644
--- a/drivers/acpi/apei/erst.c
+++ b/drivers/acpi/apei/erst.c
@@ -938,7 +938,7 @@ static int erst_clearer(enum pstore_type_id type, u64 id, int count,
 static struct pstore_info erst_info = {
 	.owner		= THIS_MODULE,
 	.name		= "erst",
-	.flags		= PSTORE_FLAGS_FRAGILE,
+	.flags		= PSTORE_FLAGS_DMESG,
 	.open		= erst_open_pstore,
 	.close		= erst_close_pstore,
 	.read		= erst_reader,
diff --git a/drivers/firmware/efi/efi-pstore.c b/drivers/firmware/efi/efi-pstore.c
index 30a24d09ea6c..4daa5acd9117 100644
--- a/drivers/firmware/efi/efi-pstore.c
+++ b/drivers/firmware/efi/efi-pstore.c
@@ -362,7 +362,7 @@ static int efi_pstore_erase(enum pstore_type_id type, u64 id, int count,
 static struct pstore_info efi_pstore_info = {
 	.owner		= THIS_MODULE,
 	.name		= "efi",
-	.flags		= PSTORE_FLAGS_FRAGILE,
+	.flags		= PSTORE_FLAGS_DMESG,
 	.open		= efi_pstore_open,
 	.close		= efi_pstore_close,
 	.read		= efi_pstore_read,
diff --git a/fs/pstore/platform.c b/fs/pstore/platform.c
index 16ecca5b72d8..76dd604a0f2c 100644
--- a/fs/pstore/platform.c
+++ b/fs/pstore/platform.c
@@ -659,13 +659,14 @@ int pstore_register(struct pstore_info *psi)
 	if (pstore_is_mounted())
 		pstore_get_records(0);
 
-	pstore_register_kmsg();
-
-	if ((psi->flags & PSTORE_FLAGS_FRAGILE) == 0) {
+	if (psi->flags & PSTORE_FLAGS_DMESG)
+		pstore_register_kmsg();
+	if (psi->flags & PSTORE_FLAGS_CONSOLE)
 		pstore_register_console();
+	if (psi->flags & PSTORE_FLAGS_FTRACE)
 		pstore_register_ftrace();
+	if (psi->flags & PSTORE_FLAGS_PMSG)
 		pstore_register_pmsg();
-	}
 
 	if (pstore_update_ms >= 0) {
 		pstore_timer.expires = jiffies +
@@ -689,12 +690,14 @@ EXPORT_SYMBOL_GPL(pstore_register);
 
 void pstore_unregister(struct pstore_info *psi)
 {
-	if ((psi->flags & PSTORE_FLAGS_FRAGILE) == 0) {
+	if (psi->flags & PSTORE_FLAGS_PMSG)
 		pstore_unregister_pmsg();
+	if (psi->flags & PSTORE_FLAGS_FTRACE)
 		pstore_unregister_ftrace();
+	if (psi->flags & PSTORE_FLAGS_CONSOLE)
 		pstore_unregister_console();
-	}
-	pstore_unregister_kmsg();
+	if (psi->flags & PSTORE_FLAGS_DMESG)
+		pstore_unregister_kmsg();
 
 	free_buf_for_compression();
 
diff --git a/fs/pstore/ram.c b/fs/pstore/ram.c
index 47516a794011..ba19a74e95bc 100644
--- a/fs/pstore/ram.c
+++ b/fs/pstore/ram.c
@@ -624,6 +624,8 @@ static int ramoops_probe(struct platform_device *pdev)
 		goto fail_clear;
 	}
 
+	cxt->pstore.flags = PSTORE_FLAGS_ALL;
+
 	err = pstore_register(&cxt->pstore);
 	if (err) {
 		pr_err("registering with pstore failed\n");
diff --git a/include/linux/pstore.h b/include/linux/pstore.h
index 899e95e84400..069b96faf478 100644
--- a/include/linux/pstore.h
+++ b/include/linux/pstore.h
@@ -74,7 +74,12 @@ struct pstore_info {
 	void		*data;
 };
 
-#define	PSTORE_FLAGS_FRAGILE	1
+#define PSTORE_FLAGS_DMESG	(1 << 0)
+#define PSTORE_FLAGS_CONSOLE	(1 << 1)
+#define PSTORE_FLAGS_FTRACE	(1 << 2)
+#define PSTORE_FLAGS_PMSG	(1 << 3)
+
+#define PSTORE_FLAGS_ALL	((1 << 4) - 1)
 
 extern int pstore_register(struct pstore_info *);
 extern void pstore_unregister(struct pstore_info *);
-- 
2.8.0

^ permalink raw reply related

* [RFC/PATCHSET 0/7] virtio: Implement virtio pstore device (v2)
From: Namhyung Kim @ 2016-07-27 15:08 UTC (permalink / raw)
  To: kvm, qemu-devel, virtualization
  Cc: Tony Luck, Radim Krčmář, Kees Cook,
	Michael S. Tsirkin, Anton Vorontsov, Will Deacon, LKML,
	Steven Rostedt, Minchan Kim, Anthony Liguori, Colin Cross,
	Paolo Bonzini, Ingo Molnar

Hello,

This is v2 of the virtio-pstore work.  In this patchset I addressed
most of feedbacks from previous version.  Limiting disk size is not
implemented yet.

 * changes in v2)
  - update VIRTIO_ID_PSTORE to 22  (Cornelia, Stefan)
  - make buffer size configurable  (Cornelia)
  - support PSTORE_TYPE_CONSOLE  (Kees)
  - use separate virtqueues for read and write
  - support concurrent async write
  - manage pstore (file) id in device side
  - fix various mistakes in qemu device  (Stefan)

It started from the fact that dumping ftrace buffer at kernel
oops/panic takes too much time.  Although there's a way to reduce the
size of the original data, sometimes I want to have the information as
many as possible.  Maybe kexec/kdump can solve this problem but it
consumes some portion of guest memory so I'd like to avoid it.  And I
know the qemu + crashtool can dump and analyze the whole guest memory
including the ftrace buffer without wasting guest memory, but it adds
one more layer and has some limitation as an out-of-tree tool like not
being in sync with the kernel changes.

So I think it'd be great using the pstore interface to dump guest
kernel data on the host.  One can read the data on the host directly
or on the guest (at the next boot) using pstore filesystem as usual.
While this patchset only implements dumping kernel log buffer, it can
be extended to have ftrace buffer and probably some more..

The patch 0001-0003 are preparation for pstore to support virtio
device which requires async write.  The patch 0004 implements virtio
pstore driver.  It has two virt queue for (sync) read and (async)
write, pstore buffer and io request and response structure.  The
virtio_pstore_req struct is to give information about the current
pstore operation.  The result will be written to the virtio_pstore_res
struct.  For read operation it also uses virtio_pstore_fileinfo struct.

The patch 0005 adds support for PSTORE_TYPE_CONSOLE which was
requested by Kees.  The console data is appended to a single file for
now.

The patch 0006 and 0007 implement virtio-pstore legacy PCI device on
qemu-kvm and kvmtool respectively.  I referenced virtio-baloon and
virtio-rng implementations and I don't know whether kvmtool supports
modern virtio 1.0+ spec.  Other transports might be supported later.

For example, using virtio-pstore on qemu looks like below:

  $ qemu-system-x86_64 -enable-kvm -device virtio-pstore,directory=xxx

When guest kernel gets panic the log messages will be saved under the
xxx directory.

  $ ls xxx
  dmesg-1.enc.z  dmesg-2.enc.z

As you can see the pstore subsystem compresses the log data using zlib
(now supports lzo and lz4 too).  The data can be extracted with the
following command:

  $ cat xxx/dmesg-1.enc.z | \
  > python -c 'import sys, zlib; print(zlib.decompress(sys.stdin.read()))'
  Oops#1 Part1
  <5>[    0.000000] Linux version 4.6.0kvm+ (namhyung@danjae) (gcc version 5.3.0 (GCC) ) #145 SMP Mon Jul 18 10:22:45 KST 2016
  <6>[    0.000000] Command line: root=/dev/vda console=ttyS0
  <6>[    0.000000] x86/fpu: Legacy x87 FPU detected.
  <6>[    0.000000] x86/fpu: Using 'eager' FPU context switches.
  <6>[    0.000000] e820: BIOS-provided physical RAM map:
  <6>[    0.000000] BIOS-e820: [mem 0x0000000000000000-0x000000000009fbff] usable
  <6>[    0.000000] BIOS-e820: [mem 0x000000000009fc00-0x000000000009ffff] reserved
  <6>[    0.000000] BIOS-e820: [mem 0x00000000000f0000-0x00000000000fffff] reserved
  <6>[    0.000000] BIOS-e820: [mem 0x0000000000100000-0x0000000007fddfff] usable
  <6>[    0.000000] BIOS-e820: [mem 0x0000000007fde000-0x0000000007ffffff] reserved
  <6>[    0.000000] BIOS-e820: [mem 0x00000000feffc000-0x00000000feffffff] reserved
  <6>[    0.000000] BIOS-e820: [mem 0x00000000fffc0000-0x00000000ffffffff] reserved
  <6>[    0.000000] NX (Execute Disable) protection: active
  <6>[    0.000000] SMBIOS 2.8 present.
  <7>[    0.000000] DMI: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.9.3-0-ge2fc41e-prebuilt.qemu-project.org 04/01/2014
  ...

To enable PSTORE_TYPE_CONSOLE, add 'console=true' to virtio-pstore
device option.  Also 'bufsize' option can set different size for
pstore buffer (default is 16K).  Maybe we can add a config option to
control the compression later.

Currently the kvmtool doesn't support any options except the directory
the pstore saves the logs.


Namhyung Kim (7):
  pstore: Split pstore fragile flags
  pstore/ram: Set pstore flags dynamically
  pstore: Manage buffer position for async write
  virtio: Basic implementation of virtio pstore driver
  virtio-pstore: Support PSTORE_TYPE_CONSOLE
  qemu: Implement virtio-pstore device
  kvmtool: Implement virtio-pstore device

 drivers/acpi/apei/erst.c             |   2 +-
 drivers/firmware/efi/efi-pstore.c    |   4 +-
 drivers/virtio/Kconfig               |  10 +
 drivers/virtio/Makefile              |   1 +
 drivers/virtio/virtio_pstore.c       | 421 +++++++++++++++++++++
 fs/pstore/platform.c                 |  65 +++-
 fs/pstore/ram.c                      |   8 +
 include/linux/pstore.h               |   9 +-
 include/uapi/linux/Kbuild            |   1 +
 include/uapi/linux/virtio_ids.h      |   1 +
 include/uapi/linux/virtio_pstore.h   |  78 +++-
 11 files changed, 580 insertions(+), 20 deletions(-)
 create mode 100644 drivers/virtio/virtio_pstore.c
 create mode 100644 include/uapi/linux/virtio_pstore.h


Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Anthony Liguori <aliguori@amazon.com>
Cc: Anton Vorontsov <anton@enomsg.org>
Cc: Colin Cross <ccross@android.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Will Deacon <will.deacon@arm.com>
Cc: kvm@vger.kernel.org
Cc: qemu-devel@nongnu.org
Cc: virtualization@lists.linux-foundation.org

Thanks,
Namhyung


-- 
2.8.0

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply

* Re: ext4 error when testing virtio-scsi & vhost-scsi
From: Zhangfei Gao @ 2016-07-27  7:58 UTC (permalink / raw)
  To: Theodore Ts'o, Michael S. Tsirkin, Nicholas Bellinger
  Cc: linux-ext4, target-devel, qemu-devel, kvm, virtualization
In-Reply-To: <CAMj5BkicJhB1Bk6q_ZT_nMKSPWf1XEGchtnGnMPFy=O_0x91hA@mail.gmail.com>

Hi, Michael

I have met ext4 error when using vhost_scsi on arm64 platform, and
suspect it is vhost_scsi issue.

Ext4 error when testing virtio_scsi & vhost_scsi


No issue:
1. virtio_scsi, ext4
2. vhost_scsi & virtio_scsi, ext2
3.  Instead of vhost, also tried loopback and no problem.
Using loopback, host can use the new block device, while vhost is used
by guest (qemu).
http://www.linux-iscsi.org/wiki/Tcm_loop
Test directly in host, not find ext4 error.



Have issue:
1. vhost_scsi & virtio_scsi, ext4
a. iblock
b, fileio, file located in /tmp (ram), no device based.

2, Have tried 4.7-r2 and 4.5-rc1 on D02 board, both have issue.
Since I need kvm specific patch for D02, so it may not freely to switch
to older version.

3. Also test with ext4, disabling journal
mkfs.ext4 -O ^has_journal /dev/sda


Do you have any suggestion?

Thanks

On Tue, Jul 19, 2016 at 4:21 PM, Zhangfei Gao <zhangfei.gao@gmail.com> wrote:
> On Tue, Jul 19, 2016 at 3:56 PM, Zhangfei Gao <zhangfei.gao@gmail.com> wrote:
>> Dear Ted
>>
>> On Wed, Jul 13, 2016 at 12:43 AM, Theodore Ts'o <tytso@mit.edu> wrote:
>>> On Tue, Jul 12, 2016 at 03:14:38PM +0800, Zhangfei Gao wrote:
>>>> Some update:
>>>>
>>>> If test with ext2, no problem in iblock.
>>>> If test with ext4, ext4_mb_generate_buddy reported error in the
>>>> removing files after reboot.
>>>>
>>>>
>>>> root@(none)$ rm test
>>>> [   21.006549] EXT4-fs error (device sda): ext4_mb_generate_buddy:758: group 18
>>>> , block bitmap and bg descriptor inconsistent: 26464 vs 25600 free clusters
>>>> [   21.008249] JBD2: Spotted dirty metadata buffer (dev = sda, blocknr = 0). Th
>>>> ere's a risk of filesystem corruption in case of system crash.
>>>>
>>>> Any special notes of using ext4 in qemu?
>>>
>>> Ext4 has more runtime consistency checking than ext2.  So just because
>>> ext4 complains doesn't mean that there isn't a problem with the file
>>> system; it just means that ext4 is more likely to notice before you
>>> lose user data.
>>>
>>> So if you test with ext2, try running e2fsck afterwards, to make sure
>>> the file system is consistent.
>>>
>>> Given that I'm reguarly testing ext4 using kvm, and I haven't seen
>>> anything like this in a very long time, I suspect the problemb is with
>>> your SCSI code, and not with ext4.
>>>
>>
>> Do you know what's the possible reason of this error.
>>
>> Have tried 4.7-rc2, same issue exist.
>> It can be reproduced by fileio and iblock as backstore.
>> It is easier to happen in qemu like this process:
>> qemu-> mount-> dd xx -> umout -> mount -> rm xx, then the error may
>> happen, no need to reboot.
>>
>> ramdisk can not cause error just because it just malloc and memcpy,
>> while not going to blk layer.
>>
>> Also tried creating one file in /tmp, used as fileio, also can reproduce.
>> So no real device is based.
>>
>> like:
>> cd /tmp
>> dd if=/dev/zero of=test bs=1M count=1024; sync;
>> targetcli
>> #targetcli
>> (targetcli) /> cd backstores/fileio
>> (targetcli) /> create name=file_backend file_or_dev=/tmp/test size=1G
>> (targetcli) /> cd /vhost
>> (targetcli) /> create wwn=naa.60014052cc816bf4
>> (targetcli) /> cd naa.60014052cc816bf4/tpgt1/luns
>> (targetcli) /> create /backstores/fileio/file_backend
>> (targetcli) /> cd /
>> (targetcli) /> saveconfig
>> (targetcli) /> exit
>>
>> /work/qemu.git/aarch64-softmmu/qemu-system-aarch64 \
>>     -enable-kvm -nographic -kernel Image \
>>     -device vhost-scsi-pci,wwpn=naa.60014052cc816bf4 \
>>     -m 512 -M virt -cpu host \
>>     -append "earlyprintk console=ttyAMA0 mem=512M"
>>
>> in qemu:
>> mkfs.ext4 /dev/sda
>> mount /dev/sda /mnt/
>> sync; date; dd if=/dev/zero of=/mnt/test bs=1M count=100; sync; date;
>>
>> using dd test, then some error happen.
>> log like:
>> oot@(none)$ sync; date; dd if=/dev/zero of=test bs=1M count=100; sync;; date;
>> [ 1789.917963] sbc_parse_cdb cdb[0]=0x35
>> [ 1789.922000] fd_execute_sync_cache immed=0
>> Tue Jul 19 07:26:12 UTC 2016
>> [  200.712879] EXT4-fs error (device sda) [ 1790.191770] sbc_parse_cdb
>> cdb[0]=0x2a
>> in ext4_reserve_inode_write:5362[ 1790.198382]  fd_execute_rw
>> : Corrupt filesystem
>> [  200.729001] EXT4-fs error (device sda) [ 1790.207843] sbc_parse_cdb
>> cdb[0]=0x2a
>> in ext4_reserve_inode_write:5362[ 1790.214495]  fd_execute_rw
>> : Corrupt filesystem
>>
>> Looks like the error usually happens after SYCHRONIZE CACHE, but not
>> for sure it is always happen after sync cache.
>>
> It is not always happen after SYCHRONIZE CACHE
>
> Just tried in qemu: mount-> dd xx -> umount -> mount -> rm xx
> ram based, (/tmp/test), no reboot.
>
> root@(none)$ cd /mnt
> root@(none)$ ls
> [  301.444966]  sbc_parse_cdb cdb[0]=0x28
> [  301.449003]  fd_execute_rw
> lost+found  test
> root@(none)$ rm test
> [  304.281920]  sbc_parse_cdb cdb[0]=0x28
> [  304.285955]  fd_execute_rw
> [  118.002338] EXT4-fs error (device sda):[  304.290685] gzf sbc_parse_cdb cdb[0
> ]=0x28
>  ext4_mb_generate_buddy:758: gro[  304.296737] gzf fd_execute_rw
> up 3, block bitmap and bg descri[  304.304099]  sbc_parse_cdb cdb[0]=0x28
> ptor inconsistent: 21504 vs 2143[  304.309322]  fd_execute_rw
> 9 free clusters
> [  118.015903] JBD2: Spotted dirty metadata buffer (dev = sda, blocknr = 0). The
> re's a risk of filesystem corruption in case of system crash.
> root@(none)$
>
> Thanks

^ permalink raw reply

* RE: [virtio-dev] Re: [PATCH v2 kernel 0/7] Extend virtio-balloon for fast (de)inflating & fast live migration
From: Li, Liang Z @ 2016-07-27  1:32 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: virtio-dev@lists.oasis-open.org, kvm@vger.kernel.org,
	linux-kernel@vger.kernel.org, qemu-devel@nongnu.org,
	dgilbert@redhat.com, linux-mm@kvack.org,
	virtualization@lists.linux-foundation.org
In-Reply-To: <20160726215256-mutt-send-email-mst@kernel.org>

> So I'm fine with this patchset, but I noticed it was not yet reviewed by MM
> people. And that is not surprising since you did not copy memory
> management mailing list on it.
> 
> I added linux-mm@kvack.org Cc on this mail but this might not be enough.
> 
> Please repost (e.g. [PATCH v2 repost]) copying the relevant mailing list so we
> can get some reviews.
> 

I will repost. Thanks!

Liang

^ permalink raw reply

* [PATCH v2 repost 7/7] virtio-balloon: tell host vm's free page info
From: Liang Li @ 2016-07-27  1:23 UTC (permalink / raw)
  To: linux-kernel
  Cc: virtio-dev, kvm, Amit Shah, Liang Li, qemu-devel, dgilbert,
	linux-mm, Michael S. Tsirkin, Paolo Bonzini, Andrew Morton,
	virtualization, Mel Gorman, Vlastimil Babka
In-Reply-To: <1469582616-5729-1-git-send-email-liang.z.li@intel.com>

Support the request for vm's free page information, response with
a page bitmap. QEMU can make use of this free page bitmap to speed
up live migration process by skipping process the free pages.

Signed-off-by: Liang Li <liang.z.li@intel.com>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Cornelia Huck <cornelia.huck@de.ibm.com>
Cc: Amit Shah <amit.shah@redhat.com>
---
 drivers/virtio/virtio_balloon.c | 104 +++++++++++++++++++++++++++++++++++++---
 1 file changed, 98 insertions(+), 6 deletions(-)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index 2d18ff6..5ca4ad3 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -62,10 +62,13 @@ module_param(oom_pages, int, S_IRUSR | S_IWUSR);
 MODULE_PARM_DESC(oom_pages, "pages to free on OOM");
 
 extern unsigned long get_max_pfn(void);
+extern int get_free_pages(unsigned long start_pfn, unsigned long end_pfn,
+		unsigned long *bitmap, unsigned long len);
+
 
 struct virtio_balloon {
 	struct virtio_device *vdev;
-	struct virtqueue *inflate_vq, *deflate_vq, *stats_vq;
+	struct virtqueue *inflate_vq, *deflate_vq, *stats_vq, *misc_vq;
 
 	/* The balloon servicing is delegated to a freezable workqueue. */
 	struct work_struct update_balloon_stats_work;
@@ -89,6 +92,8 @@ struct virtio_balloon {
 	unsigned long pfn_limit;
 	/* Used to record the processed pfn range */
 	unsigned long min_pfn, max_pfn, start_pfn, end_pfn;
+	/* Request header */
+	struct balloon_req_hdr req_hdr;
 	/*
 	 * The pages we've told the Host we're not using are enqueued
 	 * at vb_dev_info->pages list.
@@ -373,6 +378,49 @@ static void update_balloon_stats(struct virtio_balloon *vb)
 				pages_to_bytes(available));
 }
 
+static void update_free_pages_stats(struct virtio_balloon *vb,
+				unsigned long req_id)
+{
+	struct scatterlist sg_in, sg_out;
+	unsigned long pfn = 0, bmap_len, max_pfn;
+	struct virtqueue *vq = vb->misc_vq;
+	struct balloon_bmap_hdr *hdr = vb->bmap_hdr;
+	int ret = 1;
+
+	max_pfn = get_max_pfn();
+	mutex_lock(&vb->balloon_lock);
+	while (pfn < max_pfn) {
+		memset(vb->page_bitmap, 0, vb->bmap_len);
+		ret = get_free_pages(pfn, pfn + vb->pfn_limit,
+			vb->page_bitmap, vb->bmap_len * BITS_PER_BYTE);
+		hdr->cmd = cpu_to_virtio16(vb->vdev, BALLOON_GET_FREE_PAGES);
+		hdr->page_shift = cpu_to_virtio16(vb->vdev, PAGE_SHIFT);
+		hdr->req_id = cpu_to_virtio64(vb->vdev, req_id);
+		hdr->start_pfn = cpu_to_virtio64(vb->vdev, pfn);
+		bmap_len = vb->pfn_limit / BITS_PER_BYTE;
+		if (!ret) {
+			hdr->flag = cpu_to_virtio16(vb->vdev,
+							BALLOON_FLAG_DONE);
+			if (pfn + vb->pfn_limit > max_pfn)
+				bmap_len = (max_pfn - pfn) / BITS_PER_BYTE;
+		} else
+			hdr->flag = cpu_to_virtio16(vb->vdev,
+							BALLOON_FLAG_CONT);
+		hdr->bmap_len = cpu_to_virtio64(vb->vdev, bmap_len);
+		sg_init_one(&sg_out, hdr,
+			 sizeof(struct balloon_bmap_hdr) + bmap_len);
+
+		virtqueue_add_outbuf(vq, &sg_out, 1, vb, GFP_KERNEL);
+		virtqueue_kick(vq);
+		pfn += vb->pfn_limit;
+	}
+
+	sg_init_one(&sg_in, &vb->req_hdr, sizeof(vb->req_hdr));
+	virtqueue_add_inbuf(vq, &sg_in, 1, &vb->req_hdr, GFP_KERNEL);
+	virtqueue_kick(vq);
+	mutex_unlock(&vb->balloon_lock);
+}
+
 /*
  * While most virtqueues communicate guest-initiated requests to the hypervisor,
  * the stats queue operates in reverse.  The driver initializes the virtqueue
@@ -511,18 +559,49 @@ static void update_balloon_size_func(struct work_struct *work)
 		queue_work(system_freezable_wq, work);
 }
 
+static void misc_handle_rq(struct virtio_balloon *vb)
+{
+	struct balloon_req_hdr *ptr_hdr;
+	unsigned int len;
+
+	ptr_hdr = virtqueue_get_buf(vb->misc_vq, &len);
+	if (!ptr_hdr || len != sizeof(vb->req_hdr))
+		return;
+
+	switch (ptr_hdr->cmd) {
+	case BALLOON_GET_FREE_PAGES:
+		update_free_pages_stats(vb, ptr_hdr->param);
+		break;
+	default:
+		break;
+	}
+}
+
+static void misc_request(struct virtqueue *vq)
+{
+	struct virtio_balloon *vb = vq->vdev->priv;
+
+	misc_handle_rq(vb);
+}
+
 static int init_vqs(struct virtio_balloon *vb)
 {
-	struct virtqueue *vqs[3];
-	vq_callback_t *callbacks[] = { balloon_ack, balloon_ack, stats_request };
-	static const char * const names[] = { "inflate", "deflate", "stats" };
+	struct virtqueue *vqs[4];
+	vq_callback_t *callbacks[] = { balloon_ack, balloon_ack,
+					 stats_request, misc_request };
+	static const char * const names[] = { "inflate", "deflate", "stats",
+						 "misc" };
 	int err, nvqs;
 
 	/*
 	 * We expect two virtqueues: inflate and deflate, and
 	 * optionally stat.
 	 */
-	nvqs = virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ) ? 3 : 2;
+	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_MISC_VQ))
+		nvqs = 4;
+	else
+		nvqs = virtio_has_feature(vb->vdev,
+					  VIRTIO_BALLOON_F_STATS_VQ) ? 3 : 2;
 	err = vb->vdev->config->find_vqs(vb->vdev, nvqs, vqs, callbacks, names);
 	if (err)
 		return err;
@@ -543,6 +622,16 @@ static int init_vqs(struct virtio_balloon *vb)
 			BUG();
 		virtqueue_kick(vb->stats_vq);
 	}
+	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_MISC_VQ)) {
+		struct scatterlist sg_in;
+
+		vb->misc_vq = vqs[3];
+		sg_init_one(&sg_in, &vb->req_hdr, sizeof(vb->req_hdr));
+		if (virtqueue_add_inbuf(vb->misc_vq, &sg_in, 1,
+		    &vb->req_hdr, GFP_KERNEL) < 0)
+			BUG();
+		virtqueue_kick(vb->misc_vq);
+	}
 	return 0;
 }
 
@@ -639,8 +728,10 @@ static int virtballoon_probe(struct virtio_device *vdev)
 	vb->bmap_hdr = kzalloc(hdr_len + vb->bmap_len, GFP_KERNEL);
 
 	/* Clear the feature bit if memory allocation fails */
-	if (!vb->bmap_hdr)
+	if (!vb->bmap_hdr) {
 		__virtio_clear_bit(vdev, VIRTIO_BALLOON_F_PAGE_BITMAP);
+		__virtio_clear_bit(vdev, VIRTIO_BALLOON_F_MISC_VQ);
+	}
 	else
 		vb->page_bitmap = vb->bmap_hdr + hdr_len;
 	mutex_init(&vb->balloon_lock);
@@ -743,6 +834,7 @@ static unsigned int features[] = {
 	VIRTIO_BALLOON_F_STATS_VQ,
 	VIRTIO_BALLOON_F_DEFLATE_ON_OOM,
 	VIRTIO_BALLOON_F_PAGE_BITMAP,
+	VIRTIO_BALLOON_F_MISC_VQ,
 };
 
 static struct virtio_driver virtio_balloon_driver = {
-- 
1.9.1

^ permalink raw reply related

* [PATCH v2 repost 6/7] mm: add the related functions to get free page info
From: Liang Li @ 2016-07-27  1:23 UTC (permalink / raw)
  To: linux-kernel
  Cc: virtio-dev, kvm, Amit Shah, Liang Li, qemu-devel, dgilbert,
	linux-mm, Michael S. Tsirkin, Paolo Bonzini, Andrew Morton,
	virtualization, Mel Gorman, Vlastimil Babka
In-Reply-To: <1469582616-5729-1-git-send-email-liang.z.li@intel.com>

Save the free page info into a page bitmap, will be used in virtio
balloon device driver.

Signed-off-by: Liang Li <liang.z.li@intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Cornelia Huck <cornelia.huck@de.ibm.com>
Cc: Amit Shah <amit.shah@redhat.com>
---
 mm/page_alloc.c | 46 ++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 46 insertions(+)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 7da61ad..3ad8b10 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4523,6 +4523,52 @@ unsigned long get_max_pfn(void)
 }
 EXPORT_SYMBOL(get_max_pfn);
 
+static void mark_free_pages_bitmap(struct zone *zone, unsigned long start_pfn,
+	unsigned long end_pfn, unsigned long *bitmap, unsigned long len)
+{
+	unsigned long pfn, flags, page_num;
+	unsigned int order, t;
+	struct list_head *curr;
+
+	if (zone_is_empty(zone))
+		return;
+	end_pfn = min(start_pfn + len, end_pfn);
+	spin_lock_irqsave(&zone->lock, flags);
+
+	for_each_migratetype_order(order, t) {
+		list_for_each(curr, &zone->free_area[order].free_list[t]) {
+			pfn = page_to_pfn(list_entry(curr, struct page, lru));
+			if (pfn >= start_pfn && pfn <= end_pfn) {
+				page_num = 1UL << order;
+				if (pfn + page_num > end_pfn)
+					page_num = end_pfn - pfn;
+				bitmap_set(bitmap, pfn - start_pfn, page_num);
+			}
+		}
+	}
+
+	spin_unlock_irqrestore(&zone->lock, flags);
+}
+
+int get_free_pages(unsigned long start_pfn, unsigned long end_pfn,
+		unsigned long *bitmap, unsigned long len)
+{
+	struct zone *zone;
+	int ret = 0;
+
+	if (bitmap == NULL || start_pfn > end_pfn || start_pfn >= max_pfn)
+		return 0;
+	if (end_pfn < max_pfn)
+		ret = 1;
+	if (end_pfn >= max_pfn)
+		ret = 0;
+
+	for_each_populated_zone(zone)
+		mark_free_pages_bitmap(zone, start_pfn, end_pfn, bitmap, len);
+	return ret;
+}
+EXPORT_SYMBOL(get_free_pages);
+
 static void zoneref_set_zone(struct zone *zone, struct zoneref *zoneref)
 {
 	zoneref->zone = zone;
-- 
1.9.1

^ permalink raw reply related

* [PATCH v2 repost 5/7] virtio-balloon: define feature bit and head for misc virt queue
From: Liang Li @ 2016-07-27  1:23 UTC (permalink / raw)
  To: linux-kernel
  Cc: virtio-dev, kvm, Amit Shah, Liang Li, qemu-devel, dgilbert,
	linux-mm, Michael S. Tsirkin, Paolo Bonzini, virtualization
In-Reply-To: <1469582616-5729-1-git-send-email-liang.z.li@intel.com>

Define a new feature bit which supports a new virtual queue. This
new virtual qeuque is for information exchange between hypervisor
and guest. The VMM hypervisor can make use of this virtual queue
to request the guest do some operations, e.g. drop page cache,
synchronize file system, etc. And the VMM hypervisor can get some
of guest's runtime information through this virtual queue, e.g. the
guest's free page information, which can be used for live migration
optimization.

Signed-off-by: Liang Li <liang.z.li@intel.com>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Cornelia Huck <cornelia.huck@de.ibm.com>
Cc: Amit Shah <amit.shah@redhat.com>
---
 include/uapi/linux/virtio_balloon.h | 22 ++++++++++++++++++++++
 1 file changed, 22 insertions(+)

diff --git a/include/uapi/linux/virtio_balloon.h b/include/uapi/linux/virtio_balloon.h
index d3b182a..be4880f 100644
--- a/include/uapi/linux/virtio_balloon.h
+++ b/include/uapi/linux/virtio_balloon.h
@@ -35,6 +35,7 @@
 #define VIRTIO_BALLOON_F_STATS_VQ	1 /* Memory Stats virtqueue */
 #define VIRTIO_BALLOON_F_DEFLATE_ON_OOM	2 /* Deflate balloon on OOM */
 #define VIRTIO_BALLOON_F_PAGE_BITMAP	3 /* Send page info with bitmap */
+#define VIRTIO_BALLOON_F_MISC_VQ	4 /* Misc info virtqueue */
 
 /* Size of a PFN in the balloon interface. */
 #define VIRTIO_BALLOON_PFN_SHIFT 12
@@ -101,4 +102,25 @@ struct balloon_bmap_hdr {
 	__virtio64 bmap_len;
 };
 
+enum balloon_req_id {
+	/* Get free pages information */
+	BALLOON_GET_FREE_PAGES,
+};
+
+enum balloon_flag {
+	/* Have more data for a request */
+	BALLOON_FLAG_CONT,
+	/* No more data for a request */
+	BALLOON_FLAG_DONE,
+};
+
+struct balloon_req_hdr {
+	/* Used to distinguish different request */
+	__virtio16 cmd;
+	/* Reserved */
+	__virtio16 reserved[3];
+	/* Request parameter */
+	__virtio64 param;
+};
+
 #endif /* _LINUX_VIRTIO_BALLOON_H */
-- 
1.9.1

^ permalink raw reply related

* [PATCH v2 repost 4/7] virtio-balloon: speed up inflate/deflate process
From: Liang Li @ 2016-07-27  1:23 UTC (permalink / raw)
  To: linux-kernel
  Cc: virtio-dev, kvm, Amit Shah, Liang Li, qemu-devel, dgilbert,
	linux-mm, Michael S. Tsirkin, Paolo Bonzini, Andrew Morton,
	virtualization, Mel Gorman, Vlastimil Babka
In-Reply-To: <1469582616-5729-1-git-send-email-liang.z.li@intel.com>

The implementation of the current virtio-balloon is not very
efficient, the time spends on different stages of inflating
the balloon to 7GB of a 8GB idle guest:

a. allocating pages (6.5%)
b. sending PFNs to host (68.3%)
c. address translation (6.1%)
d. madvise (19%)

It takes about 4126ms for the inflating process to complete.
Debugging shows that the bottle neck are the stage b and stage d.

If using a bitmap to send the page info instead of the PFNs, we
can reduce the overhead in stage b quite a lot. Furthermore, we
can do the address translation and call madvise() with a bulk of
RAM pages, instead of the current page per page way, the overhead
of stage c and stage d can also be reduced a lot.

This patch is the kernel side implementation which is intended to
speed up the inflating & deflating process by adding a new feature
to the virtio-balloon device. With this new feature, inflating the
balloon to 7GB of a 8GB idle guest only takes 590ms, the
performance improvement is about 85%.

TODO: optimize stage a by allocating/freeing a chunk of pages
instead of a single page at a time.

Signed-off-by: Liang Li <liang.z.li@intel.com>
Suggested-by: Michael S. Tsirkin <mst@redhat.com>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Cornelia Huck <cornelia.huck@de.ibm.com>
Cc: Amit Shah <amit.shah@redhat.com>
---
 drivers/virtio/virtio_balloon.c | 184 +++++++++++++++++++++++++++++++++++-----
 1 file changed, 162 insertions(+), 22 deletions(-)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index 8d649a2..2d18ff6 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -41,10 +41,28 @@
 #define OOM_VBALLOON_DEFAULT_PAGES 256
 #define VIRTBALLOON_OOM_NOTIFY_PRIORITY 80
 
+/*
+ * VIRTIO_BALLOON_PFNS_LIMIT is used to limit the size of page bitmap
+ * to prevent a very large page bitmap, there are two reasons for this:
+ * 1) to save memory.
+ * 2) allocate a large bitmap may fail.
+ *
+ * The actual limit of pfn is determined by:
+ * pfn_limit = min(max_pfn, VIRTIO_BALLOON_PFNS_LIMIT);
+ *
+ * If system has more pages than VIRTIO_BALLOON_PFNS_LIMIT, we will scan
+ * the page list and send the PFNs with several times. To reduce the
+ * overhead of scanning the page list. VIRTIO_BALLOON_PFNS_LIMIT should
+ * be set with a value which can cover most cases.
+ */
+#define VIRTIO_BALLOON_PFNS_LIMIT ((32 * (1ULL << 30)) >> PAGE_SHIFT) /* 32GB */
+
 static int oom_pages = OOM_VBALLOON_DEFAULT_PAGES;
 module_param(oom_pages, int, S_IRUSR | S_IWUSR);
 MODULE_PARM_DESC(oom_pages, "pages to free on OOM");
 
+extern unsigned long get_max_pfn(void);
+
 struct virtio_balloon {
 	struct virtio_device *vdev;
 	struct virtqueue *inflate_vq, *deflate_vq, *stats_vq;
@@ -62,6 +80,15 @@ struct virtio_balloon {
 
 	/* Number of balloon pages we've told the Host we're not using. */
 	unsigned int num_pages;
+	/* Pointer of the bitmap header. */
+	void *bmap_hdr;
+	/* Bitmap and length used to tell the host the pages */
+	unsigned long *page_bitmap;
+	unsigned long bmap_len;
+	/* Pfn limit */
+	unsigned long pfn_limit;
+	/* Used to record the processed pfn range */
+	unsigned long min_pfn, max_pfn, start_pfn, end_pfn;
 	/*
 	 * The pages we've told the Host we're not using are enqueued
 	 * at vb_dev_info->pages list.
@@ -105,12 +132,45 @@ static void balloon_ack(struct virtqueue *vq)
 	wake_up(&vb->acked);
 }
 
+static inline void init_pfn_range(struct virtio_balloon *vb)
+{
+	vb->min_pfn = ULONG_MAX;
+	vb->max_pfn = 0;
+}
+
+static inline void update_pfn_range(struct virtio_balloon *vb,
+				 struct page *page)
+{
+	unsigned long balloon_pfn = page_to_balloon_pfn(page);
+
+	if (balloon_pfn < vb->min_pfn)
+		vb->min_pfn = balloon_pfn;
+	if (balloon_pfn > vb->max_pfn)
+		vb->max_pfn = balloon_pfn;
+}
+
 static void tell_host(struct virtio_balloon *vb, struct virtqueue *vq)
 {
 	struct scatterlist sg;
 	unsigned int len;
 
-	sg_init_one(&sg, vb->pfns, sizeof(vb->pfns[0]) * vb->num_pfns);
+	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_PAGE_BITMAP)) {
+		struct balloon_bmap_hdr *hdr = vb->bmap_hdr;
+		unsigned long bmap_len;
+
+		/* cmd and req_id are not used here, set them to 0 */
+		hdr->cmd = cpu_to_virtio16(vb->vdev, 0);
+		hdr->page_shift = cpu_to_virtio16(vb->vdev, PAGE_SHIFT);
+		hdr->reserved = cpu_to_virtio16(vb->vdev, 0);
+		hdr->req_id = cpu_to_virtio64(vb->vdev, 0);
+		hdr->start_pfn = cpu_to_virtio64(vb->vdev, vb->start_pfn);
+		bmap_len = min(vb->bmap_len,
+			(vb->end_pfn - vb->start_pfn) / BITS_PER_BYTE);
+		hdr->bmap_len = cpu_to_virtio64(vb->vdev, bmap_len);
+		sg_init_one(&sg, hdr,
+			 sizeof(struct balloon_bmap_hdr) + bmap_len);
+	} else
+		sg_init_one(&sg, vb->pfns, sizeof(vb->pfns[0]) * vb->num_pfns);
 
 	/* We should always be able to add one buffer to an empty queue. */
 	virtqueue_add_outbuf(vq, &sg, 1, vb, GFP_KERNEL);
@@ -118,7 +178,6 @@ static void tell_host(struct virtio_balloon *vb, struct virtqueue *vq)
 
 	/* When host has read buffer, this completes via balloon_ack */
 	wait_event(vb->acked, virtqueue_get_buf(vq, &len));
-
 }
 
 static void set_page_pfns(struct virtio_balloon *vb,
@@ -133,13 +192,53 @@ static void set_page_pfns(struct virtio_balloon *vb,
 					  page_to_balloon_pfn(page) + i);
 }
 
-static unsigned fill_balloon(struct virtio_balloon *vb, size_t num)
+static void set_page_bitmap(struct virtio_balloon *vb,
+			 struct list_head *pages, struct virtqueue *vq)
+{
+	unsigned long pfn;
+	struct page *page;
+	bool found;
+
+	vb->min_pfn = rounddown(vb->min_pfn, BITS_PER_LONG);
+	vb->max_pfn = roundup(vb->max_pfn, BITS_PER_LONG);
+	for (pfn = vb->min_pfn; pfn < vb->max_pfn;
+			pfn += vb->pfn_limit) {
+		vb->start_pfn = pfn + vb->pfn_limit;
+		vb->end_pfn = pfn;
+		memset(vb->page_bitmap, 0, vb->bmap_len);
+		found = false;
+		list_for_each_entry(page, pages, lru) {
+			unsigned long balloon_pfn = page_to_balloon_pfn(page);
+
+			if (balloon_pfn < pfn ||
+				 balloon_pfn >= pfn + vb->pfn_limit)
+				continue;
+			set_bit(balloon_pfn - pfn, vb->page_bitmap);
+			if (balloon_pfn > vb->end_pfn)
+				vb->end_pfn = balloon_pfn;
+			if (balloon_pfn < vb->start_pfn)
+				vb->start_pfn = balloon_pfn;
+			found = true;
+		}
+		if (found) {
+			vb->start_pfn = rounddown(vb->start_pfn, BITS_PER_LONG);
+			vb->end_pfn = roundup(vb->end_pfn, BITS_PER_LONG);
+			tell_host(vb, vq);
+		}
+	}
+}
+
+static unsigned int fill_balloon(struct virtio_balloon *vb, size_t num,
+				 bool use_bmap)
 {
 	struct balloon_dev_info *vb_dev_info = &vb->vb_dev_info;
-	unsigned num_allocated_pages;
+	unsigned int num_allocated_pages;
 
-	/* We can only do one array worth at a time. */
-	num = min(num, ARRAY_SIZE(vb->pfns));
+	if (use_bmap)
+		init_pfn_range(vb);
+	else
+		/* We can only do one array worth at a time. */
+		num = min(num, ARRAY_SIZE(vb->pfns));
 
 	mutex_lock(&vb->balloon_lock);
 	for (vb->num_pfns = 0; vb->num_pfns < num;
@@ -154,7 +253,10 @@ static unsigned fill_balloon(struct virtio_balloon *vb, size_t num)
 			msleep(200);
 			break;
 		}
-		set_page_pfns(vb, vb->pfns + vb->num_pfns, page);
+		if (use_bmap)
+			update_pfn_range(vb, page);
+		else
+			set_page_pfns(vb, vb->pfns + vb->num_pfns, page);
 		vb->num_pages += VIRTIO_BALLOON_PAGES_PER_PAGE;
 		if (!virtio_has_feature(vb->vdev,
 					VIRTIO_BALLOON_F_DEFLATE_ON_OOM))
@@ -163,8 +265,13 @@ static unsigned fill_balloon(struct virtio_balloon *vb, size_t num)
 
 	num_allocated_pages = vb->num_pfns;
 	/* Did we get any? */
-	if (vb->num_pfns != 0)
-		tell_host(vb, vb->inflate_vq);
+	if (vb->num_pfns != 0) {
+		if (use_bmap)
+			set_page_bitmap(vb, &vb_dev_info->pages,
+					vb->inflate_vq);
+		else
+			tell_host(vb, vb->inflate_vq);
+	}
 	mutex_unlock(&vb->balloon_lock);
 
 	return num_allocated_pages;
@@ -184,15 +291,19 @@ static void release_pages_balloon(struct virtio_balloon *vb,
 	}
 }
 
-static unsigned leak_balloon(struct virtio_balloon *vb, size_t num)
+static unsigned int leak_balloon(struct virtio_balloon *vb, size_t num,
+				bool use_bmap)
 {
-	unsigned num_freed_pages;
+	unsigned int num_freed_pages;
 	struct page *page;
 	struct balloon_dev_info *vb_dev_info = &vb->vb_dev_info;
 	LIST_HEAD(pages);
 
-	/* We can only do one array worth at a time. */
-	num = min(num, ARRAY_SIZE(vb->pfns));
+	if (use_bmap)
+		init_pfn_range(vb);
+	else
+		/* We can only do one array worth at a time. */
+		num = min(num, ARRAY_SIZE(vb->pfns));
 
 	mutex_lock(&vb->balloon_lock);
 	for (vb->num_pfns = 0; vb->num_pfns < num;
@@ -200,7 +311,10 @@ static unsigned leak_balloon(struct virtio_balloon *vb, size_t num)
 		page = balloon_page_dequeue(vb_dev_info);
 		if (!page)
 			break;
-		set_page_pfns(vb, vb->pfns + vb->num_pfns, page);
+		if (use_bmap)
+			update_pfn_range(vb, page);
+		else
+			set_page_pfns(vb, vb->pfns + vb->num_pfns, page);
 		list_add(&page->lru, &pages);
 		vb->num_pages -= VIRTIO_BALLOON_PAGES_PER_PAGE;
 	}
@@ -211,9 +325,14 @@ static unsigned leak_balloon(struct virtio_balloon *vb, size_t num)
 	 * virtio_has_feature(vdev, VIRTIO_BALLOON_F_MUST_TELL_HOST);
 	 * is true, we *have* to do it in this order
 	 */
-	if (vb->num_pfns != 0)
-		tell_host(vb, vb->deflate_vq);
-	release_pages_balloon(vb, &pages);
+	if (vb->num_pfns != 0) {
+		if (use_bmap)
+			set_page_bitmap(vb, &pages, vb->deflate_vq);
+		else
+			tell_host(vb, vb->deflate_vq);
+
+		release_pages_balloon(vb, &pages);
+	}
 	mutex_unlock(&vb->balloon_lock);
 	return num_freed_pages;
 }
@@ -347,13 +466,15 @@ static int virtballoon_oom_notify(struct notifier_block *self,
 	struct virtio_balloon *vb;
 	unsigned long *freed;
 	unsigned num_freed_pages;
+	bool use_bmap;
 
 	vb = container_of(self, struct virtio_balloon, nb);
 	if (!virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_DEFLATE_ON_OOM))
 		return NOTIFY_OK;
 
 	freed = parm;
-	num_freed_pages = leak_balloon(vb, oom_pages);
+	use_bmap = virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_PAGE_BITMAP);
+	num_freed_pages = leak_balloon(vb, oom_pages, use_bmap);
 	update_balloon_size(vb);
 	*freed += num_freed_pages;
 
@@ -373,15 +494,17 @@ static void update_balloon_size_func(struct work_struct *work)
 {
 	struct virtio_balloon *vb;
 	s64 diff;
+	bool use_bmap;
 
 	vb = container_of(work, struct virtio_balloon,
 			  update_balloon_size_work);
+	use_bmap = virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_PAGE_BITMAP);
 	diff = towards_target(vb);
 
 	if (diff > 0)
-		diff -= fill_balloon(vb, diff);
+		diff -= fill_balloon(vb, diff, use_bmap);
 	else if (diff < 0)
-		diff += leak_balloon(vb, -diff);
+		diff += leak_balloon(vb, -diff, use_bmap);
 	update_balloon_size(vb);
 
 	if (diff)
@@ -489,7 +612,7 @@ static int virtballoon_migratepage(struct balloon_dev_info *vb_dev_info,
 static int virtballoon_probe(struct virtio_device *vdev)
 {
 	struct virtio_balloon *vb;
-	int err;
+	int err, hdr_len;
 
 	if (!vdev->config->get) {
 		dev_err(&vdev->dev, "%s failure: config access disabled\n",
@@ -508,6 +631,18 @@ static int virtballoon_probe(struct virtio_device *vdev)
 	spin_lock_init(&vb->stop_update_lock);
 	vb->stop_update = false;
 	vb->num_pages = 0;
+	vb->pfn_limit = VIRTIO_BALLOON_PFNS_LIMIT;
+	vb->pfn_limit = min(vb->pfn_limit, get_max_pfn());
+	vb->bmap_len = ALIGN(vb->pfn_limit, BITS_PER_LONG) /
+		 BITS_PER_BYTE + 2 * sizeof(unsigned long);
+	hdr_len = sizeof(struct balloon_bmap_hdr);
+	vb->bmap_hdr = kzalloc(hdr_len + vb->bmap_len, GFP_KERNEL);
+
+	/* Clear the feature bit if memory allocation fails */
+	if (!vb->bmap_hdr)
+		__virtio_clear_bit(vdev, VIRTIO_BALLOON_F_PAGE_BITMAP);
+	else
+		vb->page_bitmap = vb->bmap_hdr + hdr_len;
 	mutex_init(&vb->balloon_lock);
 	init_waitqueue_head(&vb->acked);
 	vb->vdev = vdev;
@@ -541,9 +676,12 @@ out:
 
 static void remove_common(struct virtio_balloon *vb)
 {
+	bool use_bmap;
+
+	use_bmap = virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_PAGE_BITMAP);
 	/* There might be pages left in the balloon: free them. */
 	while (vb->num_pages)
-		leak_balloon(vb, vb->num_pages);
+		leak_balloon(vb, vb->num_pages, use_bmap);
 	update_balloon_size(vb);
 
 	/* Now we reset the device so we can clean up the queues. */
@@ -565,6 +703,7 @@ static void virtballoon_remove(struct virtio_device *vdev)
 	cancel_work_sync(&vb->update_balloon_stats_work);
 
 	remove_common(vb);
+	kfree(vb->page_bitmap);
 	kfree(vb);
 }
 
@@ -603,6 +742,7 @@ static unsigned int features[] = {
 	VIRTIO_BALLOON_F_MUST_TELL_HOST,
 	VIRTIO_BALLOON_F_STATS_VQ,
 	VIRTIO_BALLOON_F_DEFLATE_ON_OOM,
+	VIRTIO_BALLOON_F_PAGE_BITMAP,
 };
 
 static struct virtio_driver virtio_balloon_driver = {
-- 
1.9.1

^ permalink raw reply related

* [PATCH v2 repost 3/7] mm: add a function to get the max pfn
From: Liang Li @ 2016-07-27  1:23 UTC (permalink / raw)
  To: linux-kernel
  Cc: virtio-dev, kvm, Amit Shah, Liang Li, qemu-devel, dgilbert,
	linux-mm, Michael S. Tsirkin, Paolo Bonzini, Andrew Morton,
	virtualization, Mel Gorman, Vlastimil Babka
In-Reply-To: <1469582616-5729-1-git-send-email-liang.z.li@intel.com>

Expose the function to get the max pfn, so it can be used in the
virtio-balloon device driver.

Signed-off-by: Liang Li <liang.z.li@intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Cornelia Huck <cornelia.huck@de.ibm.com>
Cc: Amit Shah <amit.shah@redhat.com>
---
 mm/page_alloc.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 8b3e134..7da61ad 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4517,6 +4517,12 @@ void show_free_areas(unsigned int filter)
 	show_swap_cache_info();
 }
 
+unsigned long get_max_pfn(void)
+{
+	return max_pfn;
+}
+EXPORT_SYMBOL(get_max_pfn);
+
 static void zoneref_set_zone(struct zone *zone, struct zoneref *zoneref)
 {
 	zoneref->zone = zone;
-- 
1.9.1

^ permalink raw reply related

* [PATCH v2 repost 2/7] virtio-balloon: define new feature bit and page bitmap head
From: Liang Li @ 2016-07-27  1:23 UTC (permalink / raw)
  To: linux-kernel
  Cc: virtio-dev, kvm, Amit Shah, Liang Li, qemu-devel, dgilbert,
	linux-mm, Michael S. Tsirkin, Paolo Bonzini, virtualization
In-Reply-To: <1469582616-5729-1-git-send-email-liang.z.li@intel.com>

Add a new feature which supports sending the page information with
a bitmap. The current implementation uses PFNs array, which is not
very efficient. Using bitmap can improve the performance of
inflating/deflating significantly

The page bitmap header will used to tell the host some information
about the page bitmap. e.g. the page size, page bitmap length and
start pfn.

Signed-off-by: Liang Li <liang.z.li@intel.com>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Cornelia Huck <cornelia.huck@de.ibm.com>
Cc: Amit Shah <amit.shah@redhat.com>
---
 include/uapi/linux/virtio_balloon.h | 19 +++++++++++++++++++
 1 file changed, 19 insertions(+)

diff --git a/include/uapi/linux/virtio_balloon.h b/include/uapi/linux/virtio_balloon.h
index 343d7dd..d3b182a 100644
--- a/include/uapi/linux/virtio_balloon.h
+++ b/include/uapi/linux/virtio_balloon.h
@@ -34,6 +34,7 @@
 #define VIRTIO_BALLOON_F_MUST_TELL_HOST	0 /* Tell before reclaiming pages */
 #define VIRTIO_BALLOON_F_STATS_VQ	1 /* Memory Stats virtqueue */
 #define VIRTIO_BALLOON_F_DEFLATE_ON_OOM	2 /* Deflate balloon on OOM */
+#define VIRTIO_BALLOON_F_PAGE_BITMAP	3 /* Send page info with bitmap */
 
 /* Size of a PFN in the balloon interface. */
 #define VIRTIO_BALLOON_PFN_SHIFT 12
@@ -82,4 +83,22 @@ struct virtio_balloon_stat {
 	__virtio64 val;
 } __attribute__((packed));
 
+/* Page bitmap header structure */
+struct balloon_bmap_hdr {
+	/* Used to distinguish different request */
+	__virtio16 cmd;
+	/* Shift width of page in the bitmap */
+	__virtio16 page_shift;
+	/* flag used to identify different status */
+	__virtio16 flag;
+	/* Reserved */
+	__virtio16 reserved;
+	/* ID of the request */
+	__virtio64 req_id;
+	/* The pfn of 0 bit in the bitmap */
+	__virtio64 start_pfn;
+	/* The length of the bitmap, in bytes */
+	__virtio64 bmap_len;
+};
+
 #endif /* _LINUX_VIRTIO_BALLOON_H */
-- 
1.9.1

^ permalink raw reply related

* [PATCH v2 repost 1/7] virtio-balloon: rework deflate to add page to a list
From: Liang Li @ 2016-07-27  1:23 UTC (permalink / raw)
  To: linux-kernel
  Cc: virtio-dev, kvm, Amit Shah, Liang Li, qemu-devel, dgilbert,
	linux-mm, Michael S. Tsirkin, Paolo Bonzini, virtualization
In-Reply-To: <1469582616-5729-1-git-send-email-liang.z.li@intel.com>

will allow faster notifications using a bitmap down the road.
balloon_pfn_to_page() can be removed because it's useless.

Signed-off-by: Liang Li <liang.z.li@intel.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Cornelia Huck <cornelia.huck@de.ibm.com>
Cc: Amit Shah <amit.shah@redhat.com>
---
 drivers/virtio/virtio_balloon.c | 22 ++++++++--------------
 1 file changed, 8 insertions(+), 14 deletions(-)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index 476c0e3..8d649a2 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -98,12 +98,6 @@ static u32 page_to_balloon_pfn(struct page *page)
 	return pfn * VIRTIO_BALLOON_PAGES_PER_PAGE;
 }
 
-static struct page *balloon_pfn_to_page(u32 pfn)
-{
-	BUG_ON(pfn % VIRTIO_BALLOON_PAGES_PER_PAGE);
-	return pfn_to_page(pfn / VIRTIO_BALLOON_PAGES_PER_PAGE);
-}
-
 static void balloon_ack(struct virtqueue *vq)
 {
 	struct virtio_balloon *vb = vq->vdev->priv;
@@ -176,18 +170,16 @@ static unsigned fill_balloon(struct virtio_balloon *vb, size_t num)
 	return num_allocated_pages;
 }
 
-static void release_pages_balloon(struct virtio_balloon *vb)
+static void release_pages_balloon(struct virtio_balloon *vb,
+				 struct list_head *pages)
 {
-	unsigned int i;
-	struct page *page;
+	struct page *page, *next;
 
-	/* Find pfns pointing at start of each page, get pages and free them. */
-	for (i = 0; i < vb->num_pfns; i += VIRTIO_BALLOON_PAGES_PER_PAGE) {
-		page = balloon_pfn_to_page(virtio32_to_cpu(vb->vdev,
-							   vb->pfns[i]));
+	list_for_each_entry_safe(page, next, pages, lru) {
 		if (!virtio_has_feature(vb->vdev,
 					VIRTIO_BALLOON_F_DEFLATE_ON_OOM))
 			adjust_managed_page_count(page, 1);
+		list_del(&page->lru);
 		put_page(page); /* balloon reference */
 	}
 }
@@ -197,6 +189,7 @@ static unsigned leak_balloon(struct virtio_balloon *vb, size_t num)
 	unsigned num_freed_pages;
 	struct page *page;
 	struct balloon_dev_info *vb_dev_info = &vb->vb_dev_info;
+	LIST_HEAD(pages);
 
 	/* We can only do one array worth at a time. */
 	num = min(num, ARRAY_SIZE(vb->pfns));
@@ -208,6 +201,7 @@ static unsigned leak_balloon(struct virtio_balloon *vb, size_t num)
 		if (!page)
 			break;
 		set_page_pfns(vb, vb->pfns + vb->num_pfns, page);
+		list_add(&page->lru, &pages);
 		vb->num_pages -= VIRTIO_BALLOON_PAGES_PER_PAGE;
 	}
 
@@ -219,7 +213,7 @@ static unsigned leak_balloon(struct virtio_balloon *vb, size_t num)
 	 */
 	if (vb->num_pfns != 0)
 		tell_host(vb, vb->deflate_vq);
-	release_pages_balloon(vb);
+	release_pages_balloon(vb, &pages);
 	mutex_unlock(&vb->balloon_lock);
 	return num_freed_pages;
 }
-- 
1.9.1

^ permalink raw reply related

* [PATCH v2 repost 0/7] Extend virtio-balloon for fast (de)inflating & fast live migration
From: Liang Li @ 2016-07-27  1:23 UTC (permalink / raw)
  To: linux-kernel
  Cc: virtio-dev, kvm, Liang Li, qemu-devel, dgilbert, linux-mm,
	virtualization

This patchset is for kernel and contains two parts of change to the
virtio-balloon. 

One is the change for speeding up the inflating & deflating process,
the main idea of this optimization is to use bitmap to send the page
information to host instead of the PFNs, to reduce the overhead of
virtio data transmission, address translation and madvise(). This can
help to improve the performance by about 85%.

Another change is for speeding up live migration. By skipping process
guest's free pages in the first round of data copy, to reduce needless
data processing, this can help to save quite a lot of CPU cycles and
network bandwidth. We put guest's free page information in bitmap and
send it to host with the virt queue of virtio-balloon. For an idle 8GB
guest, this can help to shorten the total live migration time from 2Sec
to about 500ms in the 10Gbps network environment.  


Changes from v1 to v2:
    * Abandon the patch for dropping page cache.
    * Put some structures to uapi head file.
    * Use a new way to determine the page bitmap size.
    * Use a unified way to send the free page information with the bitmap 
    * Address the issues referred in MST's comments

Liang Li (7):
  virtio-balloon: rework deflate to add page to a list
  virtio-balloon: define new feature bit and page bitmap head
  mm: add a function to get the max pfn
  virtio-balloon: speed up inflate/deflate process
  virtio-balloon: define feature bit and head for misc virt queue
  mm: add the related functions to get free page info
  virtio-balloon: tell host vm's free page info

 drivers/virtio/virtio_balloon.c     | 306 +++++++++++++++++++++++++++++++-----
 include/uapi/linux/virtio_balloon.h |  41 +++++
 mm/page_alloc.c                     |  52 ++++++
 3 files changed, 359 insertions(+), 40 deletions(-)

-- 
1.9.1

^ permalink raw reply

* Re: [PATCH v2 kernel 0/7] Extend virtio-balloon for fast (de)inflating & fast live migration
From: Michael S. Tsirkin @ 2016-07-26 18:55 UTC (permalink / raw)
  To: Liang Li
  Cc: virtio-dev, kvm, linux-kernel, qemu-devel, dgilbert, linux-mm,
	virtualization
In-Reply-To: <1467196340-22079-1-git-send-email-liang.z.li@intel.com>

On Wed, Jun 29, 2016 at 06:32:13PM +0800, Liang Li wrote:
> This patch set contains two parts of changes to the virtio-balloon. 
> 
> One is the change for speeding up the inflating & deflating process,
> the main idea of this optimization is to use bitmap to send the page
> information to host instead of the PFNs, to reduce the overhead of
> virtio data transmission, address translation and madvise(). This can
> help to improve the performance by about 85%.
> 
> Another change is for speeding up live migration. By skipping process
> guest's free pages in the first round of data copy, to reduce needless
> data processing, this can help to save quite a lot of CPU cycles and
> network bandwidth. We put guest's free page information in bitmap and
> send it to host with the virt queue of virtio-balloon. For an idle 8GB
> guest, this can help to shorten the total live migration time from 2Sec
> to about 500ms in the 10Gbps network environment.  

So I'm fine with this patchset, but I noticed it was not
yet reviewed by MM people. And that is not surprising since
you did not copy memory management mailing list on it.

I added linux-mm@kvack.org Cc on this mail but this might not be enough.

Please repost (e.g. [PATCH v2 repost]) copying the relevant mailing list
so we can get some reviews.


> 
> Changes from v1 to v2:
>     * Abandon the patch for dropping page cache.
>     * Put some structures to uapi head file.
>     * Use a new way to determine the page bitmap size.
>     * Use a unified way to send the free page information with the bitmap 
>     * Address the issues referred in MST's comments
> 
> Liang Li (7):
>   virtio-balloon: rework deflate to add page to a list
>   virtio-balloon: define new feature bit and page bitmap head
>   mm: add a function to get the max pfn
>   virtio-balloon: speed up inflate/deflate process
>   virtio-balloon: define feature bit and head for misc virt queue
>   mm: add the related functions to get free page info
>   virtio-balloon: tell host vm's free page info
> 
>  drivers/virtio/virtio_balloon.c     | 306 +++++++++++++++++++++++++++++++-----
>  include/uapi/linux/virtio_balloon.h |  41 +++++
>  mm/page_alloc.c                     |  52 ++++++
>  3 files changed, 359 insertions(+), 40 deletions(-)
> 
> -- 
> 1.8.3.1

^ permalink raw reply

* Re: [PATCH v3] virtio: new feature to detect IOMMU device quirk
From: Michael S. Tsirkin @ 2016-07-26 18:50 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-kernel, virtualization
In-Reply-To: <20160725075009.GA25638@infradead.org>

On Mon, Jul 25, 2016 at 12:50:09AM -0700, Christoph Hellwig wrote:
> On Tue, Jul 19, 2016 at 05:38:23AM +0300, Michael S. Tsirkin wrote:
> > 
> > On other systems, including SPARC and PPC64, virtio-pci devices are
> > enumerated as though they are behind an IOMMU, but the virtio host
> > ignores the IOMMU, so we must either pretend that the IOMMU isn't
> > there or somehow map everything as the identity.
> > 
> > Add a feature bit to detect that quirk: VIRTIO_F_IOMMU_PLATFORM.
> > 
> > Any device with this feature bit set to 0 needs a quirk and has to be
> > passed physical addresses (as opposed to bus addresses) even though
> > the device is behind an IOMMU.
> 
> This is the wrong way around.
> 
> > Note: it has to be a per-device quirk because for example, there could
> > be a mix of passed-through and virtual virtio devices. As another
> > example, some devices could be implemented by an out of process
> > hypervisor backend (in case of qemu vhost, or vhost-user) and so support
> > for an IOMMU needs to be coded up separately.
> > 
> > It would be cleanest to handle this in IOMMU core code, but that needs
> > per-device DMA ops. While we are waiting for that to be implemented, use
> > a work-around in virtio core.
> > 
> > Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> > ---
> > 
> > wanted to use per-device dma ops but that does not
> > seem to be ready. So let's put this in virtio
> > code for now, and move when it becomes possible.
> 
> So work on making it ready.  We're close to there, and given that
> virtio needs it, finish it off.  We now have everyone using the
> operation vectors for DMA, so the only thing you need is a dma_ops
> pointer in struct device initialized to what get_dma_ops returns.

Given the timing, I suspect this will mean missing 4.8 though and I'd
much rather have something working finally and clean it up later.

And from experience, people seem to find it easier to re-factor
and clean up working code.

This patch is all of 10 lines so I'm comfortable including it right away -
I'll add a big TODO in v4 so we don't forget.

-- 
MST

^ permalink raw reply

* Re: [PATCH v3] virtio: new feature to detect IOMMU device quirk
From: Michael S. Tsirkin @ 2016-07-25 19:13 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-kernel, virtualization
In-Reply-To: <20160725075009.GA25638@infradead.org>

On Mon, Jul 25, 2016 at 12:50:09AM -0700, Christoph Hellwig wrote:
> On Tue, Jul 19, 2016 at 05:38:23AM +0300, Michael S. Tsirkin wrote:
> > 
> > On other systems, including SPARC and PPC64, virtio-pci devices are
> > enumerated as though they are behind an IOMMU, but the virtio host
> > ignores the IOMMU, so we must either pretend that the IOMMU isn't
> > there or somehow map everything as the identity.
> > 
> > Add a feature bit to detect that quirk: VIRTIO_F_IOMMU_PLATFORM.
> > 
> > Any device with this feature bit set to 0 needs a quirk and has to be
> > passed physical addresses (as opposed to bus addresses) even though
> > the device is behind an IOMMU.
> 
> This is the wrong way around.

Well given there are tons of quirky devices out there
and they have this bit set to 0, I don't think we can change this.
New devices will have it set to 1.

> > Note: it has to be a per-device quirk because for example, there could
> > be a mix of passed-through and virtual virtio devices. As another
> > example, some devices could be implemented by an out of process
> > hypervisor backend (in case of qemu vhost, or vhost-user) and so support
> > for an IOMMU needs to be coded up separately.
> > 
> > It would be cleanest to handle this in IOMMU core code, but that needs
> > per-device DMA ops. While we are waiting for that to be implemented, use
> > a work-around in virtio core.
> > 
> > Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> > ---
> > 
> > wanted to use per-device dma ops but that does not
> > seem to be ready. So let's put this in virtio
> > code for now, and move when it becomes possible.
> 
> So work on making it ready.  We're close to there, and given that
> virtio needs it, finish it off.  We now have everyone using the
> operation vectors for DMA, so the only thing you need is a dma_ops
> pointer in struct device initialized to what get_dma_ops returns.

And then pci_dma_configure would have a quirk to
set that for legacy virtio devices?

-- 
MST

^ permalink raw reply

* Re: [PATCH v3] virtio: new feature to detect IOMMU device quirk
From: Christoph Hellwig @ 2016-07-25  7:50 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: linux-kernel, virtualization
In-Reply-To: <20160719053823-mutt-send-email-mst@redhat.com>

On Tue, Jul 19, 2016 at 05:38:23AM +0300, Michael S. Tsirkin wrote:
> 
> On other systems, including SPARC and PPC64, virtio-pci devices are
> enumerated as though they are behind an IOMMU, but the virtio host
> ignores the IOMMU, so we must either pretend that the IOMMU isn't
> there or somehow map everything as the identity.
> 
> Add a feature bit to detect that quirk: VIRTIO_F_IOMMU_PLATFORM.
> 
> Any device with this feature bit set to 0 needs a quirk and has to be
> passed physical addresses (as opposed to bus addresses) even though
> the device is behind an IOMMU.

This is the wrong way around.

> Note: it has to be a per-device quirk because for example, there could
> be a mix of passed-through and virtual virtio devices. As another
> example, some devices could be implemented by an out of process
> hypervisor backend (in case of qemu vhost, or vhost-user) and so support
> for an IOMMU needs to be coded up separately.
> 
> It would be cleanest to handle this in IOMMU core code, but that needs
> per-device DMA ops. While we are waiting for that to be implemented, use
> a work-around in virtio core.
> 
> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> ---
> 
> wanted to use per-device dma ops but that does not
> seem to be ready. So let's put this in virtio
> code for now, and move when it becomes possible.

So work on making it ready.  We're close to there, and given that
virtio needs it, finish it off.  We now have everyone using the
operation vectors for DMA, so the only thing you need is a dma_ops
pointer in struct device initialized to what get_dma_ops returns.

^ permalink raw reply

* Re: [PATCH v3] virtio: new feature to detect IOMMU device quirk
From: kbuild test robot @ 2016-07-24 22:44 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: virtualization, kbuild-all, linux-kernel
In-Reply-To: <20160719053823-mutt-send-email-mst@redhat.com>

[-- Attachment #1: Type: text/plain, Size: 1544 bytes --]

Hi,

[auto build test ERROR on stable/master]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/Michael-S-Tsirkin/virtio-new-feature-to-detect-IOMMU-device-quirk/20160724-030032
base:   https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git master
config: x86_64-lkp (attached as .config)
compiler: gcc-6 (Debian 6.1.1-9) 6.1.1 20160705
reproduce:
        # save the attached .config to linux build tree
        make ARCH=x86_64 

All errors (new ones prefixed by >>):

   drivers/virtio/virtio_ring.c: In function 'vring_transport_features':
>> drivers/virtio/virtio_ring.c:1109:8: error: 'VIRTIO_F_IOMMU_PASSTHROUGH' undeclared (first use in this function)
      case VIRTIO_F_IOMMU_PASSTHROUGH:
           ^~~~~~~~~~~~~~~~~~~~~~~~~~
   drivers/virtio/virtio_ring.c:1109:8: note: each undeclared identifier is reported only once for each function it appears in

vim +/VIRTIO_F_IOMMU_PASSTHROUGH +1109 drivers/virtio/virtio_ring.c

  1103			case VIRTIO_RING_F_INDIRECT_DESC:
  1104				break;
  1105			case VIRTIO_RING_F_EVENT_IDX:
  1106				break;
  1107			case VIRTIO_F_VERSION_1:
  1108				break;
> 1109			case VIRTIO_F_IOMMU_PASSTHROUGH:
  1110				break;
  1111			case VIRTIO_F_IOMMU_PLATFORM:
  1112				/* Ignore passthrough hint for now, obey kernel config. */

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/octet-stream, Size: 22900 bytes --]

[-- Attachment #3: Type: text/plain, Size: 183 bytes --]

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply

* Re: [PATCH v3] virtio: new feature to detect IOMMU device quirk
From: kbuild test robot @ 2016-07-23 21:32 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: virtualization, kbuild-all, linux-kernel
In-Reply-To: <20160719053823-mutt-send-email-mst@redhat.com>

Hi,

[auto build test WARNING on stable/master]
[also build test WARNING on v4.7-rc7 next-20160722]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/Michael-S-Tsirkin/virtio-new-feature-to-detect-IOMMU-device-quirk/20160724-030032
base:   https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git master
reproduce:
        # apt-get install sparse
        make ARCH=x86_64 allmodconfig
        make C=1 CF=-D__CHECK_ENDIAN__


sparse warnings: (new ones prefixed by >>)

   include/linux/compiler.h:232:8: sparse: attribute 'no_sanitize_address': unknown attribute
   drivers/virtio/virtio_ring.c:1109:22: sparse: undefined identifier 'VIRTIO_F_IOMMU_PASSTHROUGH'
   drivers/virtio/virtio_ring.c:1113:50: sparse: undefined identifier 'VIRTIO_F_IOMMU_PASSTHROUGH'
>> drivers/virtio/virtio_ring.c:1109:22: sparse: incompatible types for 'case' statement
   drivers/virtio/virtio_ring.c:1109:22: sparse: Expected constant expression in case statement
   drivers/virtio/virtio_ring.c: In function 'vring_transport_features':
   drivers/virtio/virtio_ring.c:1109:8: error: 'VIRTIO_F_IOMMU_PASSTHROUGH' undeclared (first use in this function)
      case VIRTIO_F_IOMMU_PASSTHROUGH:
           ^~~~~~~~~~~~~~~~~~~~~~~~~~
   drivers/virtio/virtio_ring.c:1109:8: note: each undeclared identifier is reported only once for each function it appears in

vim +/case +1109 drivers/virtio/virtio_ring.c

  1093	}
  1094	EXPORT_SYMBOL_GPL(vring_del_virtqueue);
  1095	
  1096	/* Manipulates transport-specific feature bits. */
  1097	void vring_transport_features(struct virtio_device *vdev)
  1098	{
  1099		unsigned int i;
  1100	
  1101		for (i = VIRTIO_TRANSPORT_F_START; i < VIRTIO_TRANSPORT_F_END; i++) {
  1102			switch (i) {
  1103			case VIRTIO_RING_F_INDIRECT_DESC:
  1104				break;
  1105			case VIRTIO_RING_F_EVENT_IDX:
  1106				break;
  1107			case VIRTIO_F_VERSION_1:
  1108				break;
> 1109			case VIRTIO_F_IOMMU_PASSTHROUGH:
  1110				break;
  1111			case VIRTIO_F_IOMMU_PLATFORM:
  1112				/* Ignore passthrough hint for now, obey kernel config. */
  1113				__virtio_clear_bit(vdev, VIRTIO_F_IOMMU_PASSTHROUGH);
  1114				break;
  1115			default:
  1116				/* We don't understand this bit. */
  1117				__virtio_clear_bit(vdev, i);

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

^ permalink raw reply

* Re: [PATCH v3] virtio_blk: Fix a slient kernel panic
From: Minfei Huang @ 2016-07-23  1:58 UTC (permalink / raw)
  To: Cornelia Huck; +Cc: Minfei Huang, virtualization, linux-kernel, fanc.fnst, mst
In-Reply-To: <20160719142243.1c62ac03.cornelia.huck@de.ibm.com>


[-- Attachment #1.1: Type: text/plain, Size: 1195 bytes --]


Ping, Any comment is appreciate.

Thanks
Minfei

> On Jul 19, 2016, at 20:22, Cornelia Huck <cornelia.huck@de.ibm.com> wrote:
> 
> On Tue, 19 Jul 2016 12:32:42 +0800
> Minfei Huang <mnfhuang@gmail.com> wrote:
> 
>> From: Minfei Huang <mnghuan@gmail.com>
>> 
>> We do a lot of memory allocation in function init_vq, and don't handle
>> the allocation failure properly. Then this function will return 0,
>> although initialization fails due to lacking memory. At that moment,
>> kernel will panic in guest machine, if virtio is used to drive disk.
>> 
>> To fix this bug, we should take care of allocation failure, and return
>> correct value to let caller know what happen.
>> 
>> Tested-by: Chao Fan <fanc.fnst@cn.fujitsu.com>
>> Signed-off-by: Minfei Huang <minfei.hmf@alibaba-inc.com>
>> Signed-off-by: Minfei Huang <mnghuan@gmail.com>
>> ---
>> v2:
>> - Remove useless initialisation to NULL
>> v1:
>> - Refactor the patch to make code more readable
>> ---
>> drivers/block/virtio_blk.c | 26 ++++++++------------------
>> 1 file changed, 8 insertions(+), 18 deletions(-)
> 
> Your changes certainly make the function more compact.
> 
> Reviewed-by: Cornelia Huck <cornelia.huck@de.ibm.com>


[-- Attachment #1.2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 2353 bytes --]

[-- Attachment #2: Type: text/plain, Size: 183 bytes --]

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply

* Re: linux-next: Tree for Jul 21 (gpu/virtio)
From: Randy Dunlap @ 2016-07-21 16:21 UTC (permalink / raw)
  To: Stephen Rothwell, linux-next
  Cc: David Airlie, virtualization, linux-kernel, dri-devel
In-Reply-To: <20160721165631.6508756a@canb.auug.org.au>

On 07/20/16 23:56, Stephen Rothwell wrote:
> Hi all,
> 
> Changes since 20160720:
> 

on x86_64, when CONFIG_FB is not enabled:

ERROR: "remove_conflicting_framebuffers" [drivers/gpu/drm/virtio/virtio-gpu.ko] undefined!


-- 
~Randy

^ permalink raw reply

* [PATCH v3 4/4] kernel/locking: Drop the overhead of {mutex, rwsem}_spin_on_owner
From: Pan Xinhui @ 2016-07-21 11:45 UTC (permalink / raw)
  To: linux-kernel, linuxppc-dev, virtualization, linux-s390,
	xen-devel-request, kvm
  Cc: kernellwp, jgross, Pan Xinhui, peterz, benh, will.deacon, mingo,
	paulus, mpe, pbonzini, paulmck
In-Reply-To: <1469101514-49475-1-git-send-email-xinhui.pan@linux.vnet.ibm.com>

An over-committed guest with more vCPUs than pCPUs has a heavy overload in
the two spin_on_owner. This blames on the lock holder preemption issue.

Kernel has an interface bool vcpu_is_preempted(int cpu) to see if a vCPU is
currently running or not. So break the spin loops on true condition.

test-case:
perf record -a perf bench sched messaging -g 400 -p && perf report

before patch:
20.68%  sched-messaging  [kernel.vmlinux]  [k] mutex_spin_on_owner
 8.45%  sched-messaging  [kernel.vmlinux]  [k] mutex_unlock
 4.12%  sched-messaging  [kernel.vmlinux]  [k] system_call
 3.01%  sched-messaging  [kernel.vmlinux]  [k] system_call_common
 2.83%  sched-messaging  [kernel.vmlinux]  [k] copypage_power7
 2.64%  sched-messaging  [kernel.vmlinux]  [k] rwsem_spin_on_owner
 2.00%  sched-messaging  [kernel.vmlinux]  [k] osq_lock

after patch:
 9.99%  sched-messaging  [kernel.vmlinux]  [k] mutex_unlock
 5.28%  sched-messaging  [unknown]         [H] 0xc0000000000768e0
 4.27%  sched-messaging  [kernel.vmlinux]  [k] __copy_tofrom_user_power7
 3.77%  sched-messaging  [kernel.vmlinux]  [k] copypage_power7
 3.24%  sched-messaging  [kernel.vmlinux]  [k] _raw_write_lock_irq
 3.02%  sched-messaging  [kernel.vmlinux]  [k] system_call
 2.69%  sched-messaging  [kernel.vmlinux]  [k] wait_consider_task

Signed-off-by: Pan Xinhui <xinhui.pan@linux.vnet.ibm.com>
---
 kernel/locking/mutex.c      | 15 +++++++++++++--
 kernel/locking/rwsem-xadd.c | 16 +++++++++++++---
 2 files changed, 26 insertions(+), 5 deletions(-)

diff --git a/kernel/locking/mutex.c b/kernel/locking/mutex.c
index 79d2d76..4f00b0c 100644
--- a/kernel/locking/mutex.c
+++ b/kernel/locking/mutex.c
@@ -236,7 +236,13 @@ bool mutex_spin_on_owner(struct mutex *lock, struct task_struct *owner)
 		 */
 		barrier();
 
-		if (!owner->on_cpu || need_resched()) {
+		/*
+		 * Use vcpu_is_preempted to detech lock holder preemption issue
+		 * and break. vcpu_is_preempted is a macro defined by false if
+		 * arch does not support vcpu preempted check,
+		 */
+		if (!owner->on_cpu || need_resched() ||
+				vcpu_is_preempted(task_cpu(owner))) {
 			ret = false;
 			break;
 		}
@@ -261,8 +267,13 @@ static inline int mutex_can_spin_on_owner(struct mutex *lock)
 
 	rcu_read_lock();
 	owner = READ_ONCE(lock->owner);
+
+	/*
+	 * As lock holder preemption issue, we both skip spinning if task is not
+	 * on cpu or its cpu is preempted
+	 */
 	if (owner)
-		retval = owner->on_cpu;
+		retval = owner->on_cpu && !vcpu_is_preempted(task_cpu(owner));
 	rcu_read_unlock();
 	/*
 	 * if lock->owner is not set, the mutex owner may have just acquired
diff --git a/kernel/locking/rwsem-xadd.c b/kernel/locking/rwsem-xadd.c
index 09e30c6..99eb8fd 100644
--- a/kernel/locking/rwsem-xadd.c
+++ b/kernel/locking/rwsem-xadd.c
@@ -319,7 +319,11 @@ static inline bool rwsem_can_spin_on_owner(struct rw_semaphore *sem)
 		goto done;
 	}
 
-	ret = owner->on_cpu;
+	/*
+	 * As lock holder preemption issue, we both skip spinning if task is not
+	 * on cpu or its cpu is preempted
+	 */
+	ret = owner->on_cpu && !vcpu_is_preempted(task_cpu(owner));
 done:
 	rcu_read_unlock();
 	return ret;
@@ -340,8 +344,14 @@ bool rwsem_spin_on_owner(struct rw_semaphore *sem, struct task_struct *owner)
 		 */
 		barrier();
 
-		/* abort spinning when need_resched or owner is not running */
-		if (!owner->on_cpu || need_resched()) {
+		/*
+		 * abort spinning when need_resched or owner is not running or
+		 * owner's cpu is preempted. vcpu_is_preempted is a macro
+		 * defined by false if arch does not support vcpu preempted
+		 * check
+		 */
+		if (!owner->on_cpu || need_resched() ||
+				vcpu_is_preempted(task_cpu(owner))) {
 			rcu_read_unlock();
 			return false;
 		}
-- 
2.4.11

^ permalink raw reply related

* [PATCH v3 3/4] locking/osq: Drop the overhead of osq_lock()
From: Pan Xinhui @ 2016-07-21 11:45 UTC (permalink / raw)
  To: linux-kernel, linuxppc-dev, virtualization, linux-s390,
	xen-devel-request, kvm
  Cc: kernellwp, jgross, Pan Xinhui, peterz, benh, will.deacon, mingo,
	paulus, mpe, pbonzini, paulmck
In-Reply-To: <1469101514-49475-1-git-send-email-xinhui.pan@linux.vnet.ibm.com>

An over-committed guest with more vCPUs than pCPUs has a heavy overhead in
osq_lock().

This is because vCPU A hold the osq lock and yield out, vCPU B wait per_cpu
node->locked to be set. IOW, vCPU B wait vCPU A to run and unlock the osq
lock.

Kernel has an interface bool vcpu_is_preempted(int cpu) to see if a vCPU is
currently running or not. So break the spin loops on true condition.

test case:
perf record -a perf bench sched messaging -g 400 -p && perf report

before patch:
18.09%  sched-messaging  [kernel.vmlinux]  [k] osq_lock
12.28%  sched-messaging  [kernel.vmlinux]  [k] rwsem_spin_on_owner
 5.27%  sched-messaging  [kernel.vmlinux]  [k] mutex_unlock
 3.89%  sched-messaging  [kernel.vmlinux]  [k] wait_consider_task
 3.64%  sched-messaging  [kernel.vmlinux]  [k] _raw_write_lock_irq
 3.41%  sched-messaging  [kernel.vmlinux]  [k] mutex_spin_on_owner.is
 2.49%  sched-messaging  [kernel.vmlinux]  [k] system_call

after patch:
20.68%  sched-messaging  [kernel.vmlinux]  [k] mutex_spin_on_owner
 8.45%  sched-messaging  [kernel.vmlinux]  [k] mutex_unlock
 4.12%  sched-messaging  [kernel.vmlinux]  [k] system_call
 3.01%  sched-messaging  [kernel.vmlinux]  [k] system_call_common
 2.83%  sched-messaging  [kernel.vmlinux]  [k] copypage_power7
 2.64%  sched-messaging  [kernel.vmlinux]  [k] rwsem_spin_on_owner
 2.00%  sched-messaging  [kernel.vmlinux]  [k] osq_lock

Suggested-by: Boqun Feng <boqun.feng@gmail.com>
Signed-off-by: Pan Xinhui <xinhui.pan@linux.vnet.ibm.com>
---
 kernel/locking/osq_lock.c | 10 +++++++++-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c
index 05a3785..858a0ed 100644
--- a/kernel/locking/osq_lock.c
+++ b/kernel/locking/osq_lock.c
@@ -21,6 +21,11 @@ static inline int encode_cpu(int cpu_nr)
 	return cpu_nr + 1;
 }
 
+static inline int node_cpu(struct optimistic_spin_node *node)
+{
+	return node->cpu - 1;
+}
+
 static inline struct optimistic_spin_node *decode_cpu(int encoded_cpu_val)
 {
 	int cpu_nr = encoded_cpu_val - 1;
@@ -118,8 +123,11 @@ bool osq_lock(struct optimistic_spin_queue *lock)
 	while (!READ_ONCE(node->locked)) {
 		/*
 		 * If we need to reschedule bail... so we can block.
+		 * Use vcpu_is_preempted to detech lock holder preemption issue
+		 * and break the loop. vcpu_is_preempted is a macro defined by
+		 * false if arch does not support vcpu preempted check,
 		 */
-		if (need_resched())
+		if (need_resched() || vcpu_is_preempted(node_cpu(node->prev)))
 			goto unqueue;
 
 		cpu_relax_lowlatency();
-- 
2.4.11

^ permalink raw reply related

* [PATCH v3 2/4] powerpc/spinlock: support vcpu preempted check
From: Pan Xinhui @ 2016-07-21 11:45 UTC (permalink / raw)
  To: linux-kernel, linuxppc-dev, virtualization, linux-s390,
	xen-devel-request, kvm
  Cc: kernellwp, jgross, Pan Xinhui, peterz, benh, will.deacon, mingo,
	paulus, mpe, pbonzini, paulmck
In-Reply-To: <1469101514-49475-1-git-send-email-xinhui.pan@linux.vnet.ibm.com>

This is to fix some lock holder preemption issues. Some other locks
implementation do a spin loop before acquiring the lock itself.
Currently kernel has an interface of bool vcpu_is_preempted(int cpu). It
takes the cpu as parameter and return true if the cpu is preempted. Then
kernel can break the spin loops upon the retval of vcpu_is_preempted().

As kernel has used this interface, So lets support it.

Only pSeries need support it. And the fact is powerNV are built into same
kernel image with pSeries. So we need return false if we are runnig as
powerNV. The another fact is that lppaca->yield_count keeps zero on
powerNV. So we can just skip the machine type check.

Suggested-by: Boqun Feng <boqun.feng@gmail.com>
Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Pan Xinhui <xinhui.pan@linux.vnet.ibm.com>
---
 arch/powerpc/include/asm/spinlock.h | 18 ++++++++++++++++++
 1 file changed, 18 insertions(+)

diff --git a/arch/powerpc/include/asm/spinlock.h b/arch/powerpc/include/asm/spinlock.h
index 523673d..3ac9fcb 100644
--- a/arch/powerpc/include/asm/spinlock.h
+++ b/arch/powerpc/include/asm/spinlock.h
@@ -52,6 +52,24 @@
 #define SYNC_IO
 #endif
 
+/*
+ * This support kernel to check if one cpu is preempted or not.
+ * Then we can fix some lock holder preemption issue.
+ */
+#ifdef CONFIG_PPC_PSERIES
+#define vcpu_is_preempted vcpu_is_preempted
+static inline bool vcpu_is_preempted(int cpu)
+{
+	/*
+	 * pSeries and powerNV can be built into same kernel image. In
+	 * principle we need return false directly if we are running as
+	 * powerNV. However the yield_count is always zero on powerNV, So
+	 * skip such machine type check
+	 */
+	return !!(be32_to_cpu(lppaca_of(cpu).yield_count) & 1);
+}
+#endif
+
 static __always_inline int arch_spin_value_unlocked(arch_spinlock_t lock)
 {
 	return lock.slock == 0;
-- 
2.4.11

^ permalink raw reply related

* [PATCH v3 1/4] kernel/sched: introduce vcpu preempted check interface
From: Pan Xinhui @ 2016-07-21 11:45 UTC (permalink / raw)
  To: linux-kernel, linuxppc-dev, virtualization, linux-s390,
	xen-devel-request, kvm
  Cc: kernellwp, jgross, Pan Xinhui, peterz, benh, will.deacon, mingo,
	paulus, mpe, pbonzini, paulmck
In-Reply-To: <1469101514-49475-1-git-send-email-xinhui.pan@linux.vnet.ibm.com>

This patch supports to fix lock holder preemption issue.

For kernel users, we could use bool vcpu_is_preempted(int cpu) to detech if
one vcpu is preempted or not.

The default implementation is a macro defined by false. So compiler can
wrap it out if arch dose not support such vcpu pteempted check.

Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Pan Xinhui <xinhui.pan@linux.vnet.ibm.com>
---
 include/linux/sched.h | 12 ++++++++++++
 1 file changed, 12 insertions(+)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 6e42ada..cbe0574 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -3293,6 +3293,18 @@ static inline void set_task_cpu(struct task_struct *p, unsigned int cpu)
 
 #endif /* CONFIG_SMP */
 
+/*
+ * In order to deal with a various lock holder preemption issues provide an
+ * interface to see if a vCPU is currently running or not.
+ *
+ * This allows us to terminate optimistic spin loops and block, analogous to
+ * the native optimistic spin heuristic of testing if the lock owner task is
+ * running or not.
+ */
+#ifndef vcpu_is_preempted
+#define vcpu_is_preempted(cpu)	false
+#endif
+
 extern long sched_setaffinity(pid_t pid, const struct cpumask *new_mask);
 extern long sched_getaffinity(pid_t pid, struct cpumask *mask);
 
-- 
2.4.11

^ permalink raw reply related

* [PATCH v3 0/4] implement vcpu preempted check
From: Pan Xinhui @ 2016-07-21 11:45 UTC (permalink / raw)
  To: linux-kernel, linuxppc-dev, virtualization, linux-s390,
	xen-devel-request, kvm
  Cc: kernellwp, jgross, Pan Xinhui, peterz, benh, will.deacon, mingo,
	paulus, mpe, pbonzini, paulmck

change from v2:
	no code change, fix typos, update some comments

change from v1:
	a simplier definition of default vcpu_is_preempted
	skip mahcine type check on ppc, and add config. remove dedicated macro.
	add one patch to drop overload of rwsem_spin_on_owner and mutex_spin_on_owner. 
	add more comments
	thanks boqun and Peter's suggestion.

This patch set aims to fix lock holder preemption issues.

test-case:
perf record -a perf bench sched messaging -g 400 -p && perf report

before patch:
18.09%  sched-messaging  [kernel.vmlinux]  [k] osq_lock
12.28%  sched-messaging  [kernel.vmlinux]  [k] rwsem_spin_on_owner
 5.27%  sched-messaging  [kernel.vmlinux]  [k] mutex_unlock
 3.89%  sched-messaging  [kernel.vmlinux]  [k] wait_consider_task
 3.64%  sched-messaging  [kernel.vmlinux]  [k] _raw_write_lock_irq
 3.41%  sched-messaging  [kernel.vmlinux]  [k] mutex_spin_on_owner.is
 2.49%  sched-messaging  [kernel.vmlinux]  [k] system_call

after patch:
 9.99%  sched-messaging  [kernel.vmlinux]  [k] mutex_unlock
 5.28%  sched-messaging  [unknown]         [H] 0xc0000000000768e0
 4.27%  sched-messaging  [kernel.vmlinux]  [k] __copy_tofrom_user_power7
 3.77%  sched-messaging  [kernel.vmlinux]  [k] copypage_power7
 3.24%  sched-messaging  [kernel.vmlinux]  [k] _raw_write_lock_irq
 3.02%  sched-messaging  [kernel.vmlinux]  [k] system_call
 2.69%  sched-messaging  [kernel.vmlinux]  [k] wait_consider_task

We introduce interface bool vcpu_is_preempted(int cpu) and use it in some spin
loops of osq_lock, rwsem_spin_on_owner and mutex_spin_on_owner.
These spin_on_onwer variant also cause rcu stall before we apply this patch set

Pan Xinhui (4):
  kernel/sched: introduce vcpu preempted check interface
  powerpc/spinlock: support vcpu preempted check
  locking/osq: Drop the overhead of osq_lock()
  kernel/locking: Drop the overhead of {mutex,rwsem}_spin_on_owner

 arch/powerpc/include/asm/spinlock.h | 18 ++++++++++++++++++
 include/linux/sched.h               | 12 ++++++++++++
 kernel/locking/mutex.c              | 15 +++++++++++++--
 kernel/locking/osq_lock.c           | 10 +++++++++-
 kernel/locking/rwsem-xadd.c         | 16 +++++++++++++---
 5 files changed, 65 insertions(+), 6 deletions(-)

-- 
2.4.11

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox