* Re: [PATCH] vhost_net: stop polling socket during rx processing
From: Jason Wang @ 2016-04-28 6:19 UTC (permalink / raw)
To: Michael S. Tsirkin; +Cc: netdev, linux-kernel, kvm, virtualization
In-Reply-To: <20160427141317-mutt-send-email-mst@redhat.com>
On 04/27/2016 07:28 PM, Michael S. Tsirkin wrote:
> On Tue, Apr 26, 2016 at 03:35:53AM -0400, Jason Wang wrote:
>> We don't stop polling socket during rx processing, this will lead
>> unnecessary wakeups from under layer net devices (E.g
>> sock_def_readable() form tun). Rx will be slowed down in this
>> way. This patch avoids this by stop polling socket during rx
>> processing. A small drawback is that this introduces some overheads in
>> light load case because of the extra start/stop polling, but single
>> netperf TCP_RR does not notice any change. In a super heavy load case,
>> e.g using pktgen to inject packet to guest, we get about ~17%
>> improvement on pps:
>>
>> before: ~1370000 pkt/s
>> after: ~1500000 pkt/s
>>
>> Signed-off-by: Jason Wang <jasowang@redhat.com>
> Acked-by: Michael S. Tsirkin <mst@redhat.com>
>
> There is one other possible enhancement: we actually have the wait queue
> lock taken in _wake_up, but we give it up only to take it again in the
> handler.
>
> It would be nicer to just remove the entry when we wake
> the vhost thread. Re-add it if required.
> I think that something like the below would give you the necessary API.
> Pls feel free to use it if you are going to implement a patch on top
> doing this - that's not a reason not to include this simple patch
> though.
Thanks, this looks useful, will give it a try.
>
> --->
>
> wait: add API to drop a wait_queue_t entry from wake up handler
>
> A wake up handler might want to remove its own wait queue entry to avoid
> future wakeups. In particular, vhost has such a need. As wait queue
> lock is already taken, all we need is an API to remove the entry without
> wait_queue_head_t which isn't currently accessible to wake up handlers.
>
> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
>
> ---
>
> diff --git a/include/linux/wait.h b/include/linux/wait.h
> index 27d7a0a..9c6604b 100644
> --- a/include/linux/wait.h
> +++ b/include/linux/wait.h
> @@ -191,11 +191,17 @@ __add_wait_queue_tail_exclusive(wait_queue_head_t *q, wait_queue_t *wait)
> }
>
> static inline void
> -__remove_wait_queue(wait_queue_head_t *head, wait_queue_t *old)
> +__remove_wait_queue_entry(wait_queue_t *old)
> {
> list_del(&old->task_list);
> }
>
> +static inline void
> +__remove_wait_queue(wait_queue_head_t *head, wait_queue_t *old)
> +{
> + __remove_wait_queue_entry(old);
> +}
> +
> typedef int wait_bit_action_f(struct wait_bit_key *, int mode);
> void __wake_up(wait_queue_head_t *q, unsigned int mode, int nr, void *key);
> void __wake_up_locked_key(wait_queue_head_t *q, unsigned int mode, void *key);
^ permalink raw reply
* Re: [RFC PATCH V2 1/2] vhost: convert pre sorted vhost memory array to interval tree
From: Jason Wang @ 2016-04-28 6:20 UTC (permalink / raw)
To: Michael S. Tsirkin
Cc: kvm, qemu-devel, netdev, linux-kernel, peterx, virtualization,
pbonzini
In-Reply-To: <20160427142948-mutt-send-email-mst@redhat.com>
On 04/27/2016 07:30 PM, Michael S. Tsirkin wrote:
> On Fri, Mar 25, 2016 at 10:34:33AM +0800, Jason Wang wrote:
>> > Current pre-sorted memory region array has some limitations for future
>> > device IOTLB conversion:
>> >
>> > 1) need extra work for adding and removing a single region, and it's
>> > expected to be slow because of sorting or memory re-allocation.
>> > 2) need extra work of removing a large range which may intersect
>> > several regions with different size.
>> > 3) need trick for a replacement policy like LRU
>> >
>> > To overcome the above shortcomings, this patch convert it to interval
>> > tree which can easily address the above issue with almost no extra
>> > work.
>> >
>> > The patch could be used for:
>> >
>> > - Extend the current API and only let the userspace to send diffs of
>> > memory table.
>> > - Simplify Device IOTLB implementation.
> Does this affect performance at all?
>
In pktgen test, no difference.
Thanks
^ permalink raw reply
* Re: [RFC PATCH V2 2/2] vhost: device IOTLB API
From: Jason Wang @ 2016-04-28 6:37 UTC (permalink / raw)
To: Michael S. Tsirkin
Cc: kvm, qemu-devel, netdev, linux-kernel, peterx, virtualization,
pbonzini
In-Reply-To: <20160427143057-mutt-send-email-mst@redhat.com>
On 04/27/2016 07:45 PM, Michael S. Tsirkin wrote:
> On Fri, Mar 25, 2016 at 10:34:34AM +0800, Jason Wang wrote:
>> This patch tries to implement an device IOTLB for vhost. This could be
>> used with for co-operation with userspace(qemu) implementation of DMA
>> remapping.
>>
>> The idea is simple. When vhost meets an IOTLB miss, it will request
>> the assistance of userspace to do the translation, this is done
>> through:
>>
>> - Fill the translation request in a preset userspace address (This
>> address is set through ioctl VHOST_SET_IOTLB_REQUEST_ENTRY).
>> - Notify userspace through eventfd (This eventfd was set through ioctl
>> VHOST_SET_IOTLB_FD).
> Why use an eventfd for this?
The aim is to implement the API all through ioctls.
> We use them for interrupts because
> that happens to be what kvm wants, but here - why don't we
> just add a generic support for reading out events
> on the vhost fd itself?
I've considered this approach, but what's the advantages of this? I mean
looks like all other ioctls could be done through vhost fd
reading/writing too.
>
>> - device IOTLB were started and stopped through VHOST_RUN_IOTLB ioctl
>>
>> When userspace finishes the translation, it will update the vhost
>> IOTLB through VHOST_UPDATE_IOTLB ioctl. Userspace is also in charge of
>> snooping the IOTLB invalidation of IOMMU IOTLB and use
>> VHOST_UPDATE_IOTLB to invalidate the possible entry in vhost.
> There's one problem here, and that is that VQs still do not undergo
> translation. In theory VQ could be mapped in such a way
> that it's not contigious in userspace memory.
I'm not sure I get the issue, current vhost API support setting
desc_user_addr, used_user_addr and avail_user_addr independently. So
looks ok? If not, looks not a problem to device IOTLB API itself.
>
>
>> Signed-off-by: Jason Wang <jasowang@redhat.com>
> What limits amount of entries that kernel keeps around?
It depends on guest working set I think. Looking at
http://dpdk.org/doc/guides/linux_gsg/sys_reqs.html:
- For 2MB page size in guest, it suggests hugepages=1024
- For 1GB page size, it suggests a hugepages=4
So I choose 2048 to make sure it can cover this.
> Do we want at least a mod parameter for this?
Maybe.
>
>> ---
>> drivers/vhost/net.c | 6 +-
>> drivers/vhost/vhost.c | 301 +++++++++++++++++++++++++++++++++++++++------
>> drivers/vhost/vhost.h | 17 ++-
>> fs/eventfd.c | 3 +-
>> include/uapi/linux/vhost.h | 35 ++++++
>> 5 files changed, 320 insertions(+), 42 deletions(-)
>>
[...]
>> +struct vhost_iotlb_entry {
>> + __u64 iova;
>> + __u64 size;
>> + __u64 userspace_addr;
> Alignment requirements?
The API does not require any alignment. Will add a comment for this.
>
>> + struct {
>> +#define VHOST_ACCESS_RO 0x1
>> +#define VHOST_ACCESS_WO 0x2
>> +#define VHOST_ACCESS_RW 0x3
>> + __u8 perm;
>> +#define VHOST_IOTLB_MISS 1
>> +#define VHOST_IOTLB_UPDATE 2
>> +#define VHOST_IOTLB_INVALIDATE 3
>> + __u8 type;
>> +#define VHOST_IOTLB_INVALID 0x1
>> +#define VHOST_IOTLB_VALID 0x2
>> + __u8 valid;
> why do we need this flag?
Useless, will remove.
>
>> + __u8 u8_padding;
>> + __u32 padding;
>> + } flags;
>> +};
>> +
>> +struct vhost_vring_iotlb_entry {
>> + unsigned int index;
>> + __u64 userspace_addr;
>> +};
>> +
>> struct vhost_memory_region {
>> __u64 guest_phys_addr;
>> __u64 memory_size; /* bytes */
>> @@ -127,6 +153,15 @@ struct vhost_memory {
>> /* Set eventfd to signal an error */
>> #define VHOST_SET_VRING_ERR _IOW(VHOST_VIRTIO, 0x22, struct vhost_vring_file)
>>
>> +/* IOTLB */
>> +/* Specify an eventfd file descriptor to signle on IOTLB miss */
> typo
Will fix it.
>
>> +#define VHOST_SET_VRING_IOTLB_CALL _IOW(VHOST_VIRTIO, 0x23, struct \
>> + vhost_vring_file)
>> +#define VHOST_SET_VRING_IOTLB_REQUEST _IOW(VHOST_VIRTIO, 0x25, struct \
>> + vhost_vring_iotlb_entry)
>> +#define VHOST_UPDATE_IOTLB _IOW(VHOST_VIRTIO, 0x24, struct vhost_iotlb_entry)
>> +#define VHOST_RUN_IOTLB _IOW(VHOST_VIRTIO, 0x26, int)
>> +
> Is the assumption that userspace must dedicate a thread to running the iotlb?
> I dislike this one.
> Please support asynchronous APIs at least optionally to make
> userspace make its own threading decisions.
Nope, my qemu patches does not use a dedicated thread. This API is used
to start or top DMAR according to e.g whether guest enable DMAR for
intel IOMMU.
>
>> /* VHOST_NET specific defines */
>>
>> /* Attach virtio net ring to a raw socket, or tap device.
> Don't we need a feature bit for this?
Yes we need it. The feature bit is not considered in this patch and
looks like it was still under discussion. After we finalize it, I will add.
> Are we short on feature bits? If yes maybe it's time to add
> something like PROTOCOL_FEATURES that we have in vhost-user.
>
I believe it can just work like VERSION_1, or is there anything I missed?
^ permalink raw reply
* [PATCH] powerpc: enable qspinlock and its virtualization support
From: Pan Xinhui @ 2016-04-28 10:39 UTC (permalink / raw)
To: linuxppc-dev, linux-kernel, virtualization
Cc: Jeremy Fitzhardinge, Marc Zyngier, Benjamin Herrenschmidt,
Boqun Feng, Gavin Shan, Waiman Long, Adam Buchbinder,
Chris Wright, 'Peter Zijlstra (Intel)", Paul Mackerras,
Michael Ellerman, Alok Kataria, Paul E. McKenney, tglx
From: Pan Xinhui <xinhui.pan@linux.vnet.ibm.com>
This path aims to enable qspinlock on PPC. And on pseries platform, it also support
paravirt qspinlock.
Signed-off-by: Pan Xinhui <xinhui.pan@linux.vnet.ibm.com>
---
arch/powerpc/include/asm/qspinlock.h | 37 +++++++++++++++
arch/powerpc/include/asm/qspinlock_paravirt.h | 36 +++++++++++++++
.../powerpc/include/asm/qspinlock_paravirt_types.h | 13 ++++++
arch/powerpc/include/asm/spinlock.h | 31 ++++++++-----
arch/powerpc/include/asm/spinlock_types.h | 4 ++
arch/powerpc/kernel/paravirt.c | 52 ++++++++++++++++++++++
arch/powerpc/lib/locks.c | 32 +++++++++++++
arch/powerpc/platforms/pseries/setup.c | 5 +++
8 files changed, 198 insertions(+), 12 deletions(-)
create mode 100644 arch/powerpc/include/asm/qspinlock.h
create mode 100644 arch/powerpc/include/asm/qspinlock_paravirt.h
create mode 100644 arch/powerpc/include/asm/qspinlock_paravirt_types.h
create mode 100644 arch/powerpc/kernel/paravirt.c
diff --git a/arch/powerpc/include/asm/qspinlock.h b/arch/powerpc/include/asm/qspinlock.h
new file mode 100644
index 0000000..d4e4dc3
--- /dev/null
+++ b/arch/powerpc/include/asm/qspinlock.h
@@ -0,0 +1,37 @@
+#ifndef _ASM_POWERPC_QSPINLOCK_H
+#define _ASM_POWERPC_QSPINLOCK_H
+
+#include <asm-generic/qspinlock_types.h>
+
+#define SPIN_THRESHOLD (1 << 15)
+#define queued_spin_unlock queued_spin_unlock
+
+static inline void native_queued_spin_unlock(struct qspinlock *lock)
+{
+ /* no load/store can be across the unlock()*/
+ smp_store_release((u8 *)lock, 0);
+}
+
+#ifdef CONFIG_PARAVIRT_SPINLOCKS
+
+#include <asm/qspinlock_paravirt.h>
+
+static inline void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
+{
+ pv_queued_spin_lock(lock, val);
+}
+
+static inline void queued_spin_unlock(struct qspinlock *lock)
+{
+ pv_queued_spin_unlock(lock);
+}
+#else
+static inline void queued_spin_unlock(struct qspinlock *lock)
+{
+ native_queued_spin_unlock(lock);
+}
+#endif
+
+#include <asm-generic/qspinlock.h>
+
+#endif /* _ASM_POWERPC_QSPINLOCK_H */
diff --git a/arch/powerpc/include/asm/qspinlock_paravirt.h b/arch/powerpc/include/asm/qspinlock_paravirt.h
new file mode 100644
index 0000000..86e81c3
--- /dev/null
+++ b/arch/powerpc/include/asm/qspinlock_paravirt.h
@@ -0,0 +1,36 @@
+#ifndef CONFIG_PARAVIRT_SPINLOCKS
+#error "do not include this file"
+#endif
+
+#ifndef _ASM_QSPINLOCK_PARAVIRT_H
+#define _ASM_QSPINLOCK_PARAVIRT_H
+
+#include <asm/qspinlock_paravirt_types.h>
+
+extern void pv_lock_init(void);
+extern void native_queued_spin_lock_slowpath(struct qspinlock *lock, u32 val);
+extern void __pv_init_lock_hash(void);
+extern void __pv_queued_spin_lock_slowpath(struct qspinlock *lock, u32 val);
+extern void __pv_queued_spin_unlock(struct qspinlock *lock);
+
+static inline void pv_queued_spin_lock(struct qspinlock *lock, u32 val)
+{
+ pv_lock_op.lock(lock, val);
+}
+
+static inline void pv_queued_spin_unlock(struct qspinlock *lock)
+{
+ pv_lock_op.unlock(lock);
+}
+
+static inline void pv_wait(u8 *ptr, u8 val)
+{
+ pv_lock_op.wait(ptr, val, -1);
+}
+
+static inline void pv_kick(int cpu)
+{
+ pv_lock_op.kick(cpu);
+}
+
+#endif
diff --git a/arch/powerpc/include/asm/qspinlock_paravirt_types.h b/arch/powerpc/include/asm/qspinlock_paravirt_types.h
new file mode 100644
index 0000000..e1fdeb0
--- /dev/null
+++ b/arch/powerpc/include/asm/qspinlock_paravirt_types.h
@@ -0,0 +1,13 @@
+#ifndef _ASM_QSPINLOCK_PARAVIRT_TYPES_H
+#define _ASM_QSPINLOCK_PARAVIRT_TYPES_H
+
+struct pv_lock_ops {
+ void (*lock)(struct qspinlock *lock, u32 val);
+ void (*unlock)(struct qspinlock *lock);
+ void (*wait)(u8 *ptr, u8 val, int cpu);
+ void (*kick)(int cpu);
+};
+
+extern struct pv_lock_ops pv_lock_op;
+
+#endif
diff --git a/arch/powerpc/include/asm/spinlock.h b/arch/powerpc/include/asm/spinlock.h
index 523673d..3b65372 100644
--- a/arch/powerpc/include/asm/spinlock.h
+++ b/arch/powerpc/include/asm/spinlock.h
@@ -52,6 +52,24 @@
#define SYNC_IO
#endif
+#if defined(CONFIG_PPC_SPLPAR)
+/* We only yield to the hypervisor if we are in shared processor mode */
+#define SHARED_PROCESSOR (lppaca_shared_proc(local_paca->lppaca_ptr))
+extern void __spin_yield(arch_spinlock_t *lock);
+extern void __spin_yield_cpu(int cpu);
+extern void __spin_wake_cpu(int cpu);
+extern void __rw_yield(arch_rwlock_t *lock);
+#else /* SPLPAR */
+#define __spin_yield(x) barrier()
+#define __spin_yield_cpu(x) barrier()
+#define __spin_wake_cpu(x) barrier()
+#define __rw_yield(x) barrier()
+#define SHARED_PROCESSOR 0
+#endif
+
+#ifdef CONFIG_QUEUED_SPINLOCKS
+#include <asm/qspinlock.h>
+#else
static __always_inline int arch_spin_value_unlocked(arch_spinlock_t lock)
{
return lock.slock == 0;
@@ -106,18 +124,6 @@ static inline int arch_spin_trylock(arch_spinlock_t *lock)
* held. Conveniently, we have a word in the paca that holds this
* value.
*/
-
-#if defined(CONFIG_PPC_SPLPAR)
-/* We only yield to the hypervisor if we are in shared processor mode */
-#define SHARED_PROCESSOR (lppaca_shared_proc(local_paca->lppaca_ptr))
-extern void __spin_yield(arch_spinlock_t *lock);
-extern void __rw_yield(arch_rwlock_t *lock);
-#else /* SPLPAR */
-#define __spin_yield(x) barrier()
-#define __rw_yield(x) barrier()
-#define SHARED_PROCESSOR 0
-#endif
-
static inline void arch_spin_lock(arch_spinlock_t *lock)
{
CLEAR_IO_SYNC;
@@ -169,6 +175,7 @@ extern void arch_spin_unlock_wait(arch_spinlock_t *lock);
do { while (arch_spin_is_locked(lock)) cpu_relax(); } while (0)
#endif
+#endif /* !CONFIG_QUEUED_SPINLOCKS */
/*
* Read-write spinlocks, allowing multiple readers
* but only one writer.
diff --git a/arch/powerpc/include/asm/spinlock_types.h b/arch/powerpc/include/asm/spinlock_types.h
index 2351adc..bd7144e 100644
--- a/arch/powerpc/include/asm/spinlock_types.h
+++ b/arch/powerpc/include/asm/spinlock_types.h
@@ -5,11 +5,15 @@
# error "please don't include this file directly"
#endif
+#ifdef CONFIG_QUEUED_SPINLOCKS
+#include <asm-generic/qspinlock_types.h>
+#else
typedef struct {
volatile unsigned int slock;
} arch_spinlock_t;
#define __ARCH_SPIN_LOCK_UNLOCKED { 0 }
+#endif
typedef struct {
volatile signed int lock;
diff --git a/arch/powerpc/kernel/paravirt.c b/arch/powerpc/kernel/paravirt.c
new file mode 100644
index 0000000..355c9fb
--- /dev/null
+++ b/arch/powerpc/kernel/paravirt.c
@@ -0,0 +1,52 @@
+/*
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ */
+
+#include <linux/spinlock.h>
+
+static void __native_queued_spin_unlock(struct qspinlock *lock)
+{
+ native_queued_spin_unlock(lock);
+}
+
+static void __native_wait(u8 *ptr, u8 val, int cpu)
+{
+}
+
+static void __native_kick(int cpu)
+{
+}
+
+static void __pv_wait(u8 *ptr, u8 val, int cpu)
+{
+ HMT_low();
+ __spin_yield_cpu(cpu);
+ HMT_medium();
+}
+
+static void __pv_kick(int cpu)
+{
+ __spin_wake_cpu(cpu);
+}
+
+struct pv_lock_ops pv_lock_op = {
+ .lock = native_queued_spin_lock_slowpath,
+ .unlock = __native_queued_spin_unlock,
+ .wait = __native_wait,
+ .kick = __native_kick,
+};
+EXPORT_SYMBOL(pv_lock_op);
+
+void __init pv_lock_init(void)
+{
+ if (SHARED_PROCESSOR) {
+ __pv_init_lock_hash();
+ pv_lock_op.lock = __pv_queued_spin_lock_slowpath;
+ pv_lock_op.unlock = __pv_queued_spin_unlock;
+ pv_lock_op.wait = __pv_wait;
+ pv_lock_op.kick = __pv_kick;
+ }
+}
diff --git a/arch/powerpc/lib/locks.c b/arch/powerpc/lib/locks.c
index f7deebd..6e9d3bb 100644
--- a/arch/powerpc/lib/locks.c
+++ b/arch/powerpc/lib/locks.c
@@ -23,6 +23,35 @@
#include <asm/hvcall.h>
#include <asm/smp.h>
+void __spin_yield_cpu(int cpu)
+{
+ unsigned int holder_cpu = cpu, yield_count;
+
+ if (cpu == -1) {
+ plpar_hcall_norets(H_CEDE);
+ return;
+ }
+ BUG_ON(holder_cpu >= nr_cpu_ids);
+ yield_count = be32_to_cpu(lppaca_of(holder_cpu).yield_count);
+ if ((yield_count & 1) == 0)
+ return; /* virtual cpu is currently running */
+ rmb();
+ plpar_hcall_norets(H_CONFER,
+ get_hard_smp_processor_id(holder_cpu), yield_count);
+}
+EXPORT_SYMBOL_GPL(__spin_yield_cpu);
+
+void __spin_wake_cpu(int cpu)
+{
+ unsigned int holder_cpu = cpu;
+
+ BUG_ON(holder_cpu >= nr_cpu_ids);
+ plpar_hcall_norets(H_PROD,
+ get_hard_smp_processor_id(holder_cpu));
+}
+EXPORT_SYMBOL_GPL(__spin_wake_cpu);
+
+#ifndef CONFIG_QUEUED_SPINLOCKS
void __spin_yield(arch_spinlock_t *lock)
{
unsigned int lock_value, holder_cpu, yield_count;
@@ -42,6 +71,7 @@ void __spin_yield(arch_spinlock_t *lock)
get_hard_smp_processor_id(holder_cpu), yield_count);
}
EXPORT_SYMBOL_GPL(__spin_yield);
+#endif
/*
* Waiting for a read lock or a write lock on a rwlock...
@@ -69,6 +99,7 @@ void __rw_yield(arch_rwlock_t *rw)
}
#endif
+#ifndef CONFIG_QUEUED_SPINLOCKS
void arch_spin_unlock_wait(arch_spinlock_t *lock)
{
smp_mb();
@@ -84,3 +115,4 @@ void arch_spin_unlock_wait(arch_spinlock_t *lock)
}
EXPORT_SYMBOL(arch_spin_unlock_wait);
+#endif
diff --git a/arch/powerpc/platforms/pseries/setup.c b/arch/powerpc/platforms/pseries/setup.c
index 6e944fc..c9f056e 100644
--- a/arch/powerpc/platforms/pseries/setup.c
+++ b/arch/powerpc/platforms/pseries/setup.c
@@ -547,6 +547,11 @@ static void __init pSeries_setup_arch(void)
"%ld\n", rc);
}
}
+
+#ifdef CONFIG_PARAVIRT_SPINLOCKS
+ pv_lock_init();
+#endif
+
}
static int __init pSeries_init_panel(void)
--
2.4.3
^ permalink raw reply related
* [PATCH resend] powerpc: enable qspinlock and its virtualization support
From: Pan Xinhui @ 2016-04-28 10:55 UTC (permalink / raw)
To: linux-kernel, virtualization, linuxppc-dev
Cc: jeremy, tglx, 'Peter Zijlstra (Intel)", Michael Ellerman,
Boqun Feng, gwshan, Waiman Long, chrisw, marc.zyngier,
Paul Mackerras, akataria, Paul E. McKenney, adam.buchbinder
From: Pan Xinhui <xinhui.pan@linux.vnet.ibm.com>
This patch aims to enable qspinlock on PPC. And on pseries platform, it also support
paravirt qspinlock.
Signed-off-by: Pan Xinhui <xinhui.pan@linux.vnet.ibm.com>
---
arch/powerpc/include/asm/qspinlock.h | 37 +++++++++++++++
arch/powerpc/include/asm/qspinlock_paravirt.h | 36 +++++++++++++++
.../powerpc/include/asm/qspinlock_paravirt_types.h | 13 ++++++
arch/powerpc/include/asm/spinlock.h | 31 ++++++++-----
arch/powerpc/include/asm/spinlock_types.h | 4 ++
arch/powerpc/kernel/paravirt.c | 52 ++++++++++++++++++++++
arch/powerpc/lib/locks.c | 32 +++++++++++++
arch/powerpc/platforms/pseries/setup.c | 5 +++
8 files changed, 198 insertions(+), 12 deletions(-)
create mode 100644 arch/powerpc/include/asm/qspinlock.h
create mode 100644 arch/powerpc/include/asm/qspinlock_paravirt.h
create mode 100644 arch/powerpc/include/asm/qspinlock_paravirt_types.h
create mode 100644 arch/powerpc/kernel/paravirt.c
diff --git a/arch/powerpc/include/asm/qspinlock.h b/arch/powerpc/include/asm/qspinlock.h
new file mode 100644
index 0000000..d4e4dc3
--- /dev/null
+++ b/arch/powerpc/include/asm/qspinlock.h
@@ -0,0 +1,37 @@
+#ifndef _ASM_POWERPC_QSPINLOCK_H
+#define _ASM_POWERPC_QSPINLOCK_H
+
+#include <asm-generic/qspinlock_types.h>
+
+#define SPIN_THRESHOLD (1 << 15)
+#define queued_spin_unlock queued_spin_unlock
+
+static inline void native_queued_spin_unlock(struct qspinlock *lock)
+{
+ /* no load/store can be across the unlock()*/
+ smp_store_release((u8 *)lock, 0);
+}
+
+#ifdef CONFIG_PARAVIRT_SPINLOCKS
+
+#include <asm/qspinlock_paravirt.h>
+
+static inline void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
+{
+ pv_queued_spin_lock(lock, val);
+}
+
+static inline void queued_spin_unlock(struct qspinlock *lock)
+{
+ pv_queued_spin_unlock(lock);
+}
+#else
+static inline void queued_spin_unlock(struct qspinlock *lock)
+{
+ native_queued_spin_unlock(lock);
+}
+#endif
+
+#include <asm-generic/qspinlock.h>
+
+#endif /* _ASM_POWERPC_QSPINLOCK_H */
diff --git a/arch/powerpc/include/asm/qspinlock_paravirt.h b/arch/powerpc/include/asm/qspinlock_paravirt.h
new file mode 100644
index 0000000..86e81c3
--- /dev/null
+++ b/arch/powerpc/include/asm/qspinlock_paravirt.h
@@ -0,0 +1,36 @@
+#ifndef CONFIG_PARAVIRT_SPINLOCKS
+#error "do not include this file"
+#endif
+
+#ifndef _ASM_QSPINLOCK_PARAVIRT_H
+#define _ASM_QSPINLOCK_PARAVIRT_H
+
+#include <asm/qspinlock_paravirt_types.h>
+
+extern void pv_lock_init(void);
+extern void native_queued_spin_lock_slowpath(struct qspinlock *lock, u32 val);
+extern void __pv_init_lock_hash(void);
+extern void __pv_queued_spin_lock_slowpath(struct qspinlock *lock, u32 val);
+extern void __pv_queued_spin_unlock(struct qspinlock *lock);
+
+static inline void pv_queued_spin_lock(struct qspinlock *lock, u32 val)
+{
+ pv_lock_op.lock(lock, val);
+}
+
+static inline void pv_queued_spin_unlock(struct qspinlock *lock)
+{
+ pv_lock_op.unlock(lock);
+}
+
+static inline void pv_wait(u8 *ptr, u8 val)
+{
+ pv_lock_op.wait(ptr, val, -1);
+}
+
+static inline void pv_kick(int cpu)
+{
+ pv_lock_op.kick(cpu);
+}
+
+#endif
diff --git a/arch/powerpc/include/asm/qspinlock_paravirt_types.h b/arch/powerpc/include/asm/qspinlock_paravirt_types.h
new file mode 100644
index 0000000..e1fdeb0
--- /dev/null
+++ b/arch/powerpc/include/asm/qspinlock_paravirt_types.h
@@ -0,0 +1,13 @@
+#ifndef _ASM_QSPINLOCK_PARAVIRT_TYPES_H
+#define _ASM_QSPINLOCK_PARAVIRT_TYPES_H
+
+struct pv_lock_ops {
+ void (*lock)(struct qspinlock *lock, u32 val);
+ void (*unlock)(struct qspinlock *lock);
+ void (*wait)(u8 *ptr, u8 val, int cpu);
+ void (*kick)(int cpu);
+};
+
+extern struct pv_lock_ops pv_lock_op;
+
+#endif
diff --git a/arch/powerpc/include/asm/spinlock.h b/arch/powerpc/include/asm/spinlock.h
index 523673d..3b65372 100644
--- a/arch/powerpc/include/asm/spinlock.h
+++ b/arch/powerpc/include/asm/spinlock.h
@@ -52,6 +52,24 @@
#define SYNC_IO
#endif
+#if defined(CONFIG_PPC_SPLPAR)
+/* We only yield to the hypervisor if we are in shared processor mode */
+#define SHARED_PROCESSOR (lppaca_shared_proc(local_paca->lppaca_ptr))
+extern void __spin_yield(arch_spinlock_t *lock);
+extern void __spin_yield_cpu(int cpu);
+extern void __spin_wake_cpu(int cpu);
+extern void __rw_yield(arch_rwlock_t *lock);
+#else /* SPLPAR */
+#define __spin_yield(x) barrier()
+#define __spin_yield_cpu(x) barrier()
+#define __spin_wake_cpu(x) barrier()
+#define __rw_yield(x) barrier()
+#define SHARED_PROCESSOR 0
+#endif
+
+#ifdef CONFIG_QUEUED_SPINLOCKS
+#include <asm/qspinlock.h>
+#else
static __always_inline int arch_spin_value_unlocked(arch_spinlock_t lock)
{
return lock.slock == 0;
@@ -106,18 +124,6 @@ static inline int arch_spin_trylock(arch_spinlock_t *lock)
* held. Conveniently, we have a word in the paca that holds this
* value.
*/
-
-#if defined(CONFIG_PPC_SPLPAR)
-/* We only yield to the hypervisor if we are in shared processor mode */
-#define SHARED_PROCESSOR (lppaca_shared_proc(local_paca->lppaca_ptr))
-extern void __spin_yield(arch_spinlock_t *lock);
-extern void __rw_yield(arch_rwlock_t *lock);
-#else /* SPLPAR */
-#define __spin_yield(x) barrier()
-#define __rw_yield(x) barrier()
-#define SHARED_PROCESSOR 0
-#endif
-
static inline void arch_spin_lock(arch_spinlock_t *lock)
{
CLEAR_IO_SYNC;
@@ -169,6 +175,7 @@ extern void arch_spin_unlock_wait(arch_spinlock_t *lock);
do { while (arch_spin_is_locked(lock)) cpu_relax(); } while (0)
#endif
+#endif /* !CONFIG_QUEUED_SPINLOCKS */
/*
* Read-write spinlocks, allowing multiple readers
* but only one writer.
diff --git a/arch/powerpc/include/asm/spinlock_types.h b/arch/powerpc/include/asm/spinlock_types.h
index 2351adc..bd7144e 100644
--- a/arch/powerpc/include/asm/spinlock_types.h
+++ b/arch/powerpc/include/asm/spinlock_types.h
@@ -5,11 +5,15 @@
# error "please don't include this file directly"
#endif
+#ifdef CONFIG_QUEUED_SPINLOCKS
+#include <asm-generic/qspinlock_types.h>
+#else
typedef struct {
volatile unsigned int slock;
} arch_spinlock_t;
#define __ARCH_SPIN_LOCK_UNLOCKED { 0 }
+#endif
typedef struct {
volatile signed int lock;
diff --git a/arch/powerpc/kernel/paravirt.c b/arch/powerpc/kernel/paravirt.c
new file mode 100644
index 0000000..355c9fb
--- /dev/null
+++ b/arch/powerpc/kernel/paravirt.c
@@ -0,0 +1,52 @@
+/*
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ */
+
+#include <linux/spinlock.h>
+
+static void __native_queued_spin_unlock(struct qspinlock *lock)
+{
+ native_queued_spin_unlock(lock);
+}
+
+static void __native_wait(u8 *ptr, u8 val, int cpu)
+{
+}
+
+static void __native_kick(int cpu)
+{
+}
+
+static void __pv_wait(u8 *ptr, u8 val, int cpu)
+{
+ HMT_low();
+ __spin_yield_cpu(cpu);
+ HMT_medium();
+}
+
+static void __pv_kick(int cpu)
+{
+ __spin_wake_cpu(cpu);
+}
+
+struct pv_lock_ops pv_lock_op = {
+ .lock = native_queued_spin_lock_slowpath,
+ .unlock = __native_queued_spin_unlock,
+ .wait = __native_wait,
+ .kick = __native_kick,
+};
+EXPORT_SYMBOL(pv_lock_op);
+
+void __init pv_lock_init(void)
+{
+ if (SHARED_PROCESSOR) {
+ __pv_init_lock_hash();
+ pv_lock_op.lock = __pv_queued_spin_lock_slowpath;
+ pv_lock_op.unlock = __pv_queued_spin_unlock;
+ pv_lock_op.wait = __pv_wait;
+ pv_lock_op.kick = __pv_kick;
+ }
+}
diff --git a/arch/powerpc/lib/locks.c b/arch/powerpc/lib/locks.c
index f7deebd..6e9d3bb 100644
--- a/arch/powerpc/lib/locks.c
+++ b/arch/powerpc/lib/locks.c
@@ -23,6 +23,35 @@
#include <asm/hvcall.h>
#include <asm/smp.h>
+void __spin_yield_cpu(int cpu)
+{
+ unsigned int holder_cpu = cpu, yield_count;
+
+ if (cpu == -1) {
+ plpar_hcall_norets(H_CEDE);
+ return;
+ }
+ BUG_ON(holder_cpu >= nr_cpu_ids);
+ yield_count = be32_to_cpu(lppaca_of(holder_cpu).yield_count);
+ if ((yield_count & 1) == 0)
+ return; /* virtual cpu is currently running */
+ rmb();
+ plpar_hcall_norets(H_CONFER,
+ get_hard_smp_processor_id(holder_cpu), yield_count);
+}
+EXPORT_SYMBOL_GPL(__spin_yield_cpu);
+
+void __spin_wake_cpu(int cpu)
+{
+ unsigned int holder_cpu = cpu;
+
+ BUG_ON(holder_cpu >= nr_cpu_ids);
+ plpar_hcall_norets(H_PROD,
+ get_hard_smp_processor_id(holder_cpu));
+}
+EXPORT_SYMBOL_GPL(__spin_wake_cpu);
+
+#ifndef CONFIG_QUEUED_SPINLOCKS
void __spin_yield(arch_spinlock_t *lock)
{
unsigned int lock_value, holder_cpu, yield_count;
@@ -42,6 +71,7 @@ void __spin_yield(arch_spinlock_t *lock)
get_hard_smp_processor_id(holder_cpu), yield_count);
}
EXPORT_SYMBOL_GPL(__spin_yield);
+#endif
/*
* Waiting for a read lock or a write lock on a rwlock...
@@ -69,6 +99,7 @@ void __rw_yield(arch_rwlock_t *rw)
}
#endif
+#ifndef CONFIG_QUEUED_SPINLOCKS
void arch_spin_unlock_wait(arch_spinlock_t *lock)
{
smp_mb();
@@ -84,3 +115,4 @@ void arch_spin_unlock_wait(arch_spinlock_t *lock)
}
EXPORT_SYMBOL(arch_spin_unlock_wait);
+#endif
diff --git a/arch/powerpc/platforms/pseries/setup.c b/arch/powerpc/platforms/pseries/setup.c
index 6e944fc..c9f056e 100644
--- a/arch/powerpc/platforms/pseries/setup.c
+++ b/arch/powerpc/platforms/pseries/setup.c
@@ -547,6 +547,11 @@ static void __init pSeries_setup_arch(void)
"%ld\n", rc);
}
}
+
+#ifdef CONFIG_PARAVIRT_SPINLOCKS
+ pv_lock_init();
+#endif
+
}
static int __init pSeries_init_panel(void)
-- 2.4.3
^ permalink raw reply related
* Re: [PATCH V2 RFC] fixup! virtio: convert to use DMA api
From: Michael S. Tsirkin @ 2016-04-28 14:34 UTC (permalink / raw)
To: David Woodhouse
Cc: Wei Liu, Stefan Hajnoczi, qemu-block, Stefano Stabellini,
Joerg Roedel, qemu-devel, peterx, linux-kernel,
Christian Borntraeger, iommu, Andy Lutomirski, kvm, Amit Shah,
pbonzini, virtualization, Anthony PERARD
In-Reply-To: <1461784617.118304.181.camel@infradead.org>
On Wed, Apr 27, 2016 at 08:16:57PM +0100, David Woodhouse wrote:
> On Wed, 2016-04-27 at 21:17 +0300, Michael S. Tsirkin wrote:
> >
> > > Because it's a dirty hack in the *wrong* place.
> >
> > No one came up with a better one so far :(
>
> Seriously?
>
> Take a look at drivers/iommu/intel-iommu.c. It has quirks for all kinds
> of shitty devices that have to be put in passthrough mode or otherwise
> excluded.
I see work-arounds for broken IOMMUs but not for
individual devices. Could you point me to a more specific
example?
> We don't actually *need* it for the Intel IOMMU; all we need is for
> QEMU to stop lying in its DMAR tables.
We need it for legacy QEMU anyway, and it's not easy for QEMU to stop
lying about virtio, so we'll need it for a while.
I think it's easy for QEMU to stop lying about assigned devices,
so we don't need it for non-virtio devices.
> We *do* want the same kind of quirks in the relevant POWER and ARM
> IOMMU code in the kernel. Do that (hell, a simple quirk for all virtio
> devices will suffice, but NOT in the virtio driver
Sure - that works. It does not have to be in the driver.
>) at the same moment
> you fix the virtio devices to use the DMA API. Job done.
>
> Some time *later* we can work on *refining* that quirk, and a way for
> QEMU to tell the guest (via something generic like fwcfg, maybe) that
> some devices are and aren't translated.
I don't see why how fwcfg can work here. It's a static thing,
devices can come and go with hotplug.
> Actually, I'm about to look at moving dma_ops into struct device and
> cleaning up the way we detect which IOMMU is attached, at device
> instantiation time. Perhaps I can shove the virtio-exception quirk in
> there while I'm at it...
Sounds good.
> --
> dwmw2
>
^ permalink raw reply
* Re: [RFC PATCH V2 2/2] vhost: device IOTLB API
From: Michael S. Tsirkin @ 2016-04-28 14:43 UTC (permalink / raw)
To: Jason Wang
Cc: kvm, qemu-devel, netdev, linux-kernel, peterx, virtualization,
pbonzini
In-Reply-To: <5721AF9C.9030209@redhat.com>
On Thu, Apr 28, 2016 at 02:37:16PM +0800, Jason Wang wrote:
>
>
> On 04/27/2016 07:45 PM, Michael S. Tsirkin wrote:
> > On Fri, Mar 25, 2016 at 10:34:34AM +0800, Jason Wang wrote:
> >> This patch tries to implement an device IOTLB for vhost. This could be
> >> used with for co-operation with userspace(qemu) implementation of DMA
> >> remapping.
> >>
> >> The idea is simple. When vhost meets an IOTLB miss, it will request
> >> the assistance of userspace to do the translation, this is done
> >> through:
> >>
> >> - Fill the translation request in a preset userspace address (This
> >> address is set through ioctl VHOST_SET_IOTLB_REQUEST_ENTRY).
> >> - Notify userspace through eventfd (This eventfd was set through ioctl
> >> VHOST_SET_IOTLB_FD).
> > Why use an eventfd for this?
>
> The aim is to implement the API all through ioctls.
>
> > We use them for interrupts because
> > that happens to be what kvm wants, but here - why don't we
> > just add a generic support for reading out events
> > on the vhost fd itself?
>
> I've considered this approach, but what's the advantages of this? I mean
> looks like all other ioctls could be done through vhost fd
> reading/writing too.
read/write have a non-blocking flag.
It's not useful for other ioctls but it's useful here.
> >
> >> - device IOTLB were started and stopped through VHOST_RUN_IOTLB ioctl
> >>
> >> When userspace finishes the translation, it will update the vhost
> >> IOTLB through VHOST_UPDATE_IOTLB ioctl. Userspace is also in charge of
> >> snooping the IOTLB invalidation of IOMMU IOTLB and use
> >> VHOST_UPDATE_IOTLB to invalidate the possible entry in vhost.
> > There's one problem here, and that is that VQs still do not undergo
> > translation. In theory VQ could be mapped in such a way
> > that it's not contigious in userspace memory.
>
> I'm not sure I get the issue, current vhost API support setting
> desc_user_addr, used_user_addr and avail_user_addr independently. So
> looks ok? If not, looks not a problem to device IOTLB API itself.
The problem is that addresses are all HVA.
Without an iommu, we ask for them to be contigious and
since bus address == GPA, this means contigious GPA =>
contigious HVA. With an IOMMU you can map contigious
bus address but non contigious GPA and non contigious HVA.
Another concern: what if guest changes the GPA while keeping bus address
constant? Normal devices will work because they only use
bus addresses, but virtio will break.
> >
> >
> >> Signed-off-by: Jason Wang <jasowang@redhat.com>
> > What limits amount of entries that kernel keeps around?
>
> It depends on guest working set I think. Looking at
> http://dpdk.org/doc/guides/linux_gsg/sys_reqs.html:
>
> - For 2MB page size in guest, it suggests hugepages=1024
> - For 1GB page size, it suggests a hugepages=4
>
> So I choose 2048 to make sure it can cover this.
4K page size is rather common, too.
> > Do we want at least a mod parameter for this?
>
> Maybe.
>
> >
> >> ---
> >> drivers/vhost/net.c | 6 +-
> >> drivers/vhost/vhost.c | 301 +++++++++++++++++++++++++++++++++++++++------
> >> drivers/vhost/vhost.h | 17 ++-
> >> fs/eventfd.c | 3 +-
> >> include/uapi/linux/vhost.h | 35 ++++++
> >> 5 files changed, 320 insertions(+), 42 deletions(-)
> >>
>
> [...]
>
> >> +struct vhost_iotlb_entry {
> >> + __u64 iova;
> >> + __u64 size;
> >> + __u64 userspace_addr;
> > Alignment requirements?
>
> The API does not require any alignment. Will add a comment for this.
>
> >
> >> + struct {
> >> +#define VHOST_ACCESS_RO 0x1
> >> +#define VHOST_ACCESS_WO 0x2
> >> +#define VHOST_ACCESS_RW 0x3
> >> + __u8 perm;
> >> +#define VHOST_IOTLB_MISS 1
> >> +#define VHOST_IOTLB_UPDATE 2
> >> +#define VHOST_IOTLB_INVALIDATE 3
> >> + __u8 type;
> >> +#define VHOST_IOTLB_INVALID 0x1
> >> +#define VHOST_IOTLB_VALID 0x2
> >> + __u8 valid;
> > why do we need this flag?
>
> Useless, will remove.
>
> >
> >> + __u8 u8_padding;
> >> + __u32 padding;
> >> + } flags;
> >> +};
> >> +
> >> +struct vhost_vring_iotlb_entry {
> >> + unsigned int index;
> >> + __u64 userspace_addr;
> >> +};
> >> +
> >> struct vhost_memory_region {
> >> __u64 guest_phys_addr;
> >> __u64 memory_size; /* bytes */
> >> @@ -127,6 +153,15 @@ struct vhost_memory {
> >> /* Set eventfd to signal an error */
> >> #define VHOST_SET_VRING_ERR _IOW(VHOST_VIRTIO, 0x22, struct vhost_vring_file)
> >>
> >> +/* IOTLB */
> >> +/* Specify an eventfd file descriptor to signle on IOTLB miss */
> > typo
>
> Will fix it.
>
> >
> >> +#define VHOST_SET_VRING_IOTLB_CALL _IOW(VHOST_VIRTIO, 0x23, struct \
> >> + vhost_vring_file)
> >> +#define VHOST_SET_VRING_IOTLB_REQUEST _IOW(VHOST_VIRTIO, 0x25, struct \
> >> + vhost_vring_iotlb_entry)
> >> +#define VHOST_UPDATE_IOTLB _IOW(VHOST_VIRTIO, 0x24, struct vhost_iotlb_entry)
> >> +#define VHOST_RUN_IOTLB _IOW(VHOST_VIRTIO, 0x26, int)
> >> +
> > Is the assumption that userspace must dedicate a thread to running the iotlb?
> > I dislike this one.
> > Please support asynchronous APIs at least optionally to make
> > userspace make its own threading decisions.
>
> Nope, my qemu patches does not use a dedicated thread. This API is used
> to start or top DMAR according to e.g whether guest enable DMAR for
> intel IOMMU.
I see. Seems rather confusing - do we need to start/stop it
while device is running?
> >
> >> /* VHOST_NET specific defines */
> >>
> >> /* Attach virtio net ring to a raw socket, or tap device.
> > Don't we need a feature bit for this?
>
> Yes we need it. The feature bit is not considered in this patch and
> looks like it was still under discussion. After we finalize it, I will add.
>
> > Are we short on feature bits? If yes maybe it's time to add
> > something like PROTOCOL_FEATURES that we have in vhost-user.
> >
>
> I believe it can just work like VERSION_1, or is there anything I missed?
VERSION_1 is a virtio feature though. This one would be backend specific
...
--
MST
^ permalink raw reply
* Re: [PATCH V2 RFC] fixup! virtio: convert to use DMA api
From: David Woodhouse @ 2016-04-28 15:11 UTC (permalink / raw)
To: Michael S. Tsirkin
Cc: Wei Liu, Stefan Hajnoczi, qemu-block, Stefano Stabellini,
Joerg Roedel, qemu-devel, peterx, linux-kernel,
Christian Borntraeger, iommu, Andy Lutomirski, kvm, Amit Shah,
pbonzini, virtualization, Anthony PERARD
In-Reply-To: <20160428172039-mutt-send-email-mst@redhat.com>
[-- Attachment #1.1: Type: text/plain, Size: 2444 bytes --]
On Thu, 2016-04-28 at 17:34 +0300, Michael S. Tsirkin wrote:
> I see work-arounds for broken IOMMUs but not for
> individual devices. Could you point me to a more specific
> example?
I think the closest example is probably quirk_ioat_snb_local_iommu().
If we see this particular device, we *know* what the topology actually
looks like. We check the hardware setup, and if we're *not* being told
the truth, then we stick it in bypass mode because we know it *isn't*
actually being translated.
Actually, that's almost *identical* to what we want, isn't it?
Except instead of checking undocumented chipset registers, it wants to
be checking "am I on a version of qemu known to lie about virtio being
translated?"
> > We don't actually *need* it for the Intel IOMMU; all we need is for
> > QEMU to stop lying in its DMAR tables.
> We need it for legacy QEMU anyway, and it's not easy for QEMU to stop
> lying about virtio, so we'll need it for a while.
> I think it's easy for QEMU to stop lying about assigned devices,
> so we don't need it for non-virtio devices.
Why is it easier for QEMU to tell the truth about assigned devices,
than it is for virtio? Assuming they both remain actually untranslated
for now, why's it easier to fix the DMAR table for one and not the
other?
(Implementing translation of assigned devices is on my list, but it's a
long way off).
> I don't see why how fwcfg can work here. It's a static thing,
> devices can come and go with hotplug.
This touches on something you said elsewhere, that it's
painful/impossible to hot-unplug a translated device and hot-plug an
untranslated device in the same slot (and vice versa).
So let's assume for now that a given slot is indeed static, and either
translated or untranslated. Like the DMAR table, the fwcfg can just
give a list of slot which are (or aren't) translated.
And then you can *only* add a translated device to a translated slot,
or an untranslated device to an untranslated slot.
All the internally-emulated devices *can* be either translated or
untranslated. That's just a matter of software. Surely, you currently
*can't* have translated assigned devices (until someone implements the
whole VT-d page table shadowing or whatever), so you'll be barred from
assigning a device to a slot which *previously* had an untranslated
device. But so what? Put it in a different slot instead.
--
dwmw2
[-- Attachment #1.2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5760 bytes --]
[-- Attachment #2: Type: text/plain, Size: 183 bytes --]
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization
^ permalink raw reply
* Re: [PATCH V2 RFC] fixup! virtio: convert to use DMA api
From: Michael S. Tsirkin @ 2016-04-28 15:37 UTC (permalink / raw)
To: David Woodhouse
Cc: Wei Liu, Stefan Hajnoczi, qemu-block, Stefano Stabellini,
Joerg Roedel, qemu-devel, peterx, linux-kernel,
Christian Borntraeger, iommu, Andy Lutomirski, kvm, Amit Shah,
pbonzini, virtualization, Anthony PERARD
In-Reply-To: <1461856314.33870.98.camel@infradead.org>
On Thu, Apr 28, 2016 at 04:11:54PM +0100, David Woodhouse wrote:
> On Thu, 2016-04-28 at 17:34 +0300, Michael S. Tsirkin wrote:
> > I see work-arounds for broken IOMMUs but not for
> > individual devices. Could you point me to a more specific
> > example?
>
> I think the closest example is probably quirk_ioat_snb_local_iommu().
OK, so for intel, it seems that it's enough to set
pdev->dev.archdata.iommu = DUMMY_DEVICE_DOMAIN_INFO;
for the device.
Do I have to poke at each iommu implementation to find
a way to do this, or is there some way to do it
portably?
> If we see this particular device, we *know* what the topology actually
> looks like. We check the hardware setup, and if we're *not* being told
> the truth, then we stick it in bypass mode because we know it *isn't*
> actually being translated.
>
> Actually, that's almost *identical* to what we want, isn't it?
>
> Except instead of checking undocumented chipset registers, it wants to
> be checking "am I on a version of qemu known to lie about virtio being
> translated?"
Not exactly - I think that future versions of qemu might lie
about some devices but not others.
> > > We don't actually *need* it for the Intel IOMMU; all we need is for
> > > QEMU to stop lying in its DMAR tables.
> > We need it for legacy QEMU anyway, and it's not easy for QEMU to stop
> > lying about virtio, so we'll need it for a while.
> > I think it's easy for QEMU to stop lying about assigned devices,
> > so we don't need it for non-virtio devices.
>
> Why is it easier for QEMU to tell the truth about assigned devices,
> than it is for virtio? Assuming they both remain actually untranslated
> for now, why's it easier to fix the DMAR table for one and not the
> other?
>
> (Implementing translation of assigned devices is on my list, but it's a
> long way off).
DMAR is unfortunately not a good match for what people do with QEMU.
There is a patchset on list fixing translation of assigned
devices. So the fix for these will simply be to do translation for all
assigned devices. It's harder for virtio as it isn't always
processed in QEMU - there's vhost in kernel and an out of process
vhost-user plugin. So we can end up e.g. with modern QEMU which
does translate in-process virtio but not out of process one.
> > I don't see why how fwcfg can work here. It's a static thing,
> > devices can come and go with hotplug.
>
> This touches on something you said elsewhere, that it's
> painful/impossible to hot-unplug a translated device and hot-plug an
> untranslated device in the same slot (and vice versa).
>
> So let's assume for now that a given slot is indeed static, and either
> translated or untranslated. Like the DMAR table, the fwcfg can just
> give a list of slot which are (or aren't) translated.
>
> And then you can *only* add a translated device to a translated slot,
> or an untranslated device to an untranslated slot.
>
> All the internally-emulated devices *can* be either translated or
> untranslated. That's just a matter of software. Surely, you currently
> *can't* have translated assigned devices (until someone implements the
> whole VT-d page table shadowing or whatever), so you'll be barred from
> assigning a device to a slot which *previously* had an untranslated
> device. But so what? Put it in a different slot instead.
Unfortunately people got used to be able to put any device
in any slot, and built external tools around that ability.
It's rather painful to break this assumption.
> --
> dwmw2
>
^ permalink raw reply
* Re: [PATCH V2 RFC] fixup! virtio: convert to use DMA api
From: David Woodhouse @ 2016-04-28 15:48 UTC (permalink / raw)
To: Michael S. Tsirkin
Cc: Wei Liu, Stefan Hajnoczi, qemu-block, Stefano Stabellini,
Joerg Roedel, qemu-devel, peterx, linux-kernel,
Christian Borntraeger, iommu, Andy Lutomirski, kvm, Amit Shah,
pbonzini, virtualization, Anthony PERARD
In-Reply-To: <20160428182341-mutt-send-email-mst@redhat.com>
[-- Attachment #1.1: Type: text/plain, Size: 2724 bytes --]
On Thu, 2016-04-28 at 18:37 +0300, Michael S. Tsirkin wrote:
> OK, so for intel, it seems that it's enough to set
> pdev->dev.archdata.iommu = DUMMY_DEVICE_DOMAIN_INFO;
> for the device.
Yes, currently. Although that's vile. In fact what we *want* to happen
is for the intel-iommu code simply to decline to provide DMA ops for
this device, and let it fall back to the swiotlb or no-op DMA ops, as
appropriate.
As it is, we have the intel-iommu DMA ops *unconditionally, and they
have a hack to manually fall back to calling swiotlb. It's all just
horrid, which is why I want to clean it up with nice per-device DMA ops
and discovery thereof :)
> Do I have to poke at each iommu implementation to find
> a way to do this, or is there some way to do it
> portably?
There *will* be.... Christoph has already done some of the cleanup in
this space, and I need to take stock of what he's already done, and
finish off the parts I want to build on top of it.
> Not exactly - I think that future versions of qemu might lie
> about some devices but not others.
Can we keep this simple?
QEMU currently lies about some devices. Let's implement a heuristic for
the guest OS to know about that, and react accordingly.
Then let's fix QEMU to tell the truth. All the time, unconditionally.
Even on POWER/ARM where there's no obvious *way* for it to tell the
truth (because you don't have the flexibility that DMAR tables do), and
we need to devise a way to put it in the device-tree or fwcfg or
something else.
And only once QEMU consistently tells the *truth*, then we can start to
do new stuff and let it actually change its behaviour.
> DMAR is unfortunately not a good match for what people do with QEMU.
>
> There is a patchset on list fixing translation of assigned
> devices. So the fix for these will simply be to do translation for
> all assigned devices. It's harder for virtio as it isn't always
> processed in QEMU - there's vhost in kernel and an out of process
> vhost-user plugin. So we can end up e.g. with modern QEMU which
> does translate in-process virtio but not out of process one.
Right... just stop. Fix QEMU to tell the truth first, and *then* once
we can trust it, we can start to change its behaviour. :)
> Unfortunately people got used to be able to put any device
> in any slot, and built external tools around that ability.
> It's rather painful to break this assumption.
Well, if you just said you have a patch set which allows translation of
assigned devices then you are most of the way there, aren't you? We
just need to fix the out-of-process virtio case, and everything can be
either translated or untranslated?
--
dwmw2
[-- Attachment #1.2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5760 bytes --]
[-- Attachment #2: Type: text/plain, Size: 183 bytes --]
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization
^ permalink raw reply
* Re: [PATCH resend] powerpc: enable qspinlock and its virtualization support
From: Waiman Long @ 2016-04-28 21:07 UTC (permalink / raw)
To: Pan Xinhui
Cc: jeremy, tglx, peterz, Michael Ellerman, Boqun Feng, linux-kernel,
virtualization, chrisw, marc.zyngier, Paul Mackerras, akataria,
Paul E. McKenney, linuxppc-dev, adam.buchbinder, gwshan
In-Reply-To: <5721EC0E.8040506@linux.vnet.ibm.com>
On 04/28/2016 06:55 AM, Pan Xinhui wrote:
> From: Pan Xinhui<xinhui.pan@linux.vnet.ibm.com>
>
> This patch aims to enable qspinlock on PPC. And on pseries platform, it also support
> paravirt qspinlock.
>
> Signed-off-by: Pan Xinhui<xinhui.pan@linux.vnet.ibm.com>
> ---
> arch/powerpc/include/asm/qspinlock.h | 37 +++++++++++++++
> arch/powerpc/include/asm/qspinlock_paravirt.h | 36 +++++++++++++++
> .../powerpc/include/asm/qspinlock_paravirt_types.h | 13 ++++++
> arch/powerpc/include/asm/spinlock.h | 31 ++++++++-----
> arch/powerpc/include/asm/spinlock_types.h | 4 ++
> arch/powerpc/kernel/paravirt.c | 52 ++++++++++++++++++++++
> arch/powerpc/lib/locks.c | 32 +++++++++++++
> arch/powerpc/platforms/pseries/setup.c | 5 +++
> 8 files changed, 198 insertions(+), 12 deletions(-)
> create mode 100644 arch/powerpc/include/asm/qspinlock.h
> create mode 100644 arch/powerpc/include/asm/qspinlock_paravirt.h
> create mode 100644 arch/powerpc/include/asm/qspinlock_paravirt_types.h
> create mode 100644 arch/powerpc/kernel/paravirt.c
>
>
This is just an enablement patch. You will also need a patch to activate
qspinlock for, at lease, some PPC configs. Right?
It has dependency on the pv_wait() patch that I sent out extend the
parameter list. Some performance data on how PPC system will perform
with and without qspinlock will also be helpful data points.
Cheers,
Longman
^ permalink raw reply
* Re: [RFC PATCH V2 2/2] vhost: device IOTLB API
From: Jason Wang @ 2016-04-29 1:12 UTC (permalink / raw)
To: Michael S. Tsirkin
Cc: kvm, qemu-devel, netdev, linux-kernel, peterx, virtualization,
pbonzini
In-Reply-To: <20160428173543-mutt-send-email-mst@redhat.com>
On 04/28/2016 10:43 PM, Michael S. Tsirkin wrote:
> On Thu, Apr 28, 2016 at 02:37:16PM +0800, Jason Wang wrote:
>>
>> On 04/27/2016 07:45 PM, Michael S. Tsirkin wrote:
>>> On Fri, Mar 25, 2016 at 10:34:34AM +0800, Jason Wang wrote:
>>>> This patch tries to implement an device IOTLB for vhost. This could be
>>>> used with for co-operation with userspace(qemu) implementation of DMA
>>>> remapping.
>>>>
>>>> The idea is simple. When vhost meets an IOTLB miss, it will request
>>>> the assistance of userspace to do the translation, this is done
>>>> through:
>>>>
>>>> - Fill the translation request in a preset userspace address (This
>>>> address is set through ioctl VHOST_SET_IOTLB_REQUEST_ENTRY).
>>>> - Notify userspace through eventfd (This eventfd was set through ioctl
>>>> VHOST_SET_IOTLB_FD).
>>> Why use an eventfd for this?
>> The aim is to implement the API all through ioctls.
>>
>>> We use them for interrupts because
>>> that happens to be what kvm wants, but here - why don't we
>>> just add a generic support for reading out events
>>> on the vhost fd itself?
>> I've considered this approach, but what's the advantages of this? I mean
>> looks like all other ioctls could be done through vhost fd
>> reading/writing too.
> read/write have a non-blocking flag.
>
> It's not useful for other ioctls but it's useful here.
>
Ok, this looks better.
>>>> - device IOTLB were started and stopped through VHOST_RUN_IOTLB ioctl
>>>>
>>>> When userspace finishes the translation, it will update the vhost
>>>> IOTLB through VHOST_UPDATE_IOTLB ioctl. Userspace is also in charge of
>>>> snooping the IOTLB invalidation of IOMMU IOTLB and use
>>>> VHOST_UPDATE_IOTLB to invalidate the possible entry in vhost.
>>> There's one problem here, and that is that VQs still do not undergo
>>> translation. In theory VQ could be mapped in such a way
>>> that it's not contigious in userspace memory.
>> I'm not sure I get the issue, current vhost API support setting
>> desc_user_addr, used_user_addr and avail_user_addr independently. So
>> looks ok? If not, looks not a problem to device IOTLB API itself.
> The problem is that addresses are all HVA.
>
> Without an iommu, we ask for them to be contigious and
> since bus address == GPA, this means contigious GPA =>
> contigious HVA. With an IOMMU you can map contigious
> bus address but non contigious GPA and non contigious HVA.
Yes, so the issue is we should not reuse VHOST_SET_VRING_ADDR and invent
a new ioctl to set bus addr (guest iova). The access the VQ through
device IOTLB too.
>
> Another concern: what if guest changes the GPA while keeping bus address
> constant? Normal devices will work because they only use
> bus addresses, but virtio will break.
If we access VQ through device IOTLB too, this could be solved.
>
>
>
>>>
>>>> Signed-off-by: Jason Wang <jasowang@redhat.com>
>>> What limits amount of entries that kernel keeps around?
>> It depends on guest working set I think. Looking at
>> http://dpdk.org/doc/guides/linux_gsg/sys_reqs.html:
>>
>> - For 2MB page size in guest, it suggests hugepages=1024
>> - For 1GB page size, it suggests a hugepages=4
>>
>> So I choose 2048 to make sure it can cover this.
> 4K page size is rather common, too.
I assume hugepages is used widely, and there's a note in the above link:
"For 64-bit applications, it is recommended to use 1 GB hugepages if the
platform supports them."
For 4K case, the TLB hit rate will be very low for a large working set
even in a physical environment. Not sure we should care, if we want, we
probably can cache more translations in userspace's device IOTLB
implementation.
>
>>> Do we want at least a mod parameter for this?
>> Maybe.
>>
>>>> ---
>>>> drivers/vhost/net.c | 6 +-
>>>> drivers/vhost/vhost.c | 301 +++++++++++++++++++++++++++++++++++++++------
>>>> drivers/vhost/vhost.h | 17 ++-
>>>> fs/eventfd.c | 3 +-
>>>> include/uapi/linux/vhost.h | 35 ++++++
>>>> 5 files changed, 320 insertions(+), 42 deletions(-)
>>>>
>> [...]
>>
>>>> +struct vhost_iotlb_entry {
>>>> + __u64 iova;
>>>> + __u64 size;
>>>> + __u64 userspace_addr;
>>> Alignment requirements?
>> The API does not require any alignment. Will add a comment for this.
>>
>>>> + struct {
>>>> +#define VHOST_ACCESS_RO 0x1
>>>> +#define VHOST_ACCESS_WO 0x2
>>>> +#define VHOST_ACCESS_RW 0x3
>>>> + __u8 perm;
>>>> +#define VHOST_IOTLB_MISS 1
>>>> +#define VHOST_IOTLB_UPDATE 2
>>>> +#define VHOST_IOTLB_INVALIDATE 3
>>>> + __u8 type;
>>>> +#define VHOST_IOTLB_INVALID 0x1
>>>> +#define VHOST_IOTLB_VALID 0x2
>>>> + __u8 valid;
>>> why do we need this flag?
>> Useless, will remove.
>>
>>>> + __u8 u8_padding;
>>>> + __u32 padding;
>>>> + } flags;
>>>> +};
>>>> +
>>>> +struct vhost_vring_iotlb_entry {
>>>> + unsigned int index;
>>>> + __u64 userspace_addr;
>>>> +};
>>>> +
>>>> struct vhost_memory_region {
>>>> __u64 guest_phys_addr;
>>>> __u64 memory_size; /* bytes */
>>>> @@ -127,6 +153,15 @@ struct vhost_memory {
>>>> /* Set eventfd to signal an error */
>>>> #define VHOST_SET_VRING_ERR _IOW(VHOST_VIRTIO, 0x22, struct vhost_vring_file)
>>>>
>>>> +/* IOTLB */
>>>> +/* Specify an eventfd file descriptor to signle on IOTLB miss */
>>> typo
>> Will fix it.
>>
>>>> +#define VHOST_SET_VRING_IOTLB_CALL _IOW(VHOST_VIRTIO, 0x23, struct \
>>>> + vhost_vring_file)
>>>> +#define VHOST_SET_VRING_IOTLB_REQUEST _IOW(VHOST_VIRTIO, 0x25, struct \
>>>> + vhost_vring_iotlb_entry)
>>>> +#define VHOST_UPDATE_IOTLB _IOW(VHOST_VIRTIO, 0x24, struct vhost_iotlb_entry)
>>>> +#define VHOST_RUN_IOTLB _IOW(VHOST_VIRTIO, 0x26, int)
>>>> +
>>> Is the assumption that userspace must dedicate a thread to running the iotlb?
>>> I dislike this one.
>>> Please support asynchronous APIs at least optionally to make
>>> userspace make its own threading decisions.
>> Nope, my qemu patches does not use a dedicated thread. This API is used
>> to start or top DMAR according to e.g whether guest enable DMAR for
>> intel IOMMU.
> I see. Seems rather confusing - do we need to start/stop it
> while device is running?
Technically, guest driver (e.g intel ioomu) can stop DMAR at any time.
>
>>>> /* VHOST_NET specific defines */
>>>>
>>>> /* Attach virtio net ring to a raw socket, or tap device.
>>> Don't we need a feature bit for this?
>> Yes we need it. The feature bit is not considered in this patch and
>> looks like it was still under discussion. After we finalize it, I will add.
>>
>>> Are we short on feature bits? If yes maybe it's time to add
>>> something like PROTOCOL_FEATURES that we have in vhost-user.
>>>
>> I believe it can just work like VERSION_1, or is there anything I missed?
> VERSION_1 is a virtio feature though. This one would be backend specific
> ...
Any differences? Consider we want feature to be something like
VIRTIO_F_HOST_IOMMU, vhost could just add this to VHOST_FEATURES?
^ permalink raw reply
* Re: [PATCH resend] powerpc: enable qspinlock and its virtualization support
From: Pan Xinhui @ 2016-04-29 2:34 UTC (permalink / raw)
To: Waiman Long
Cc: jeremy, tglx, peterz, Michael Ellerman, Boqun Feng, linux-kernel,
virtualization, chrisw, marc.zyngier, Paul Mackerras, akataria,
Paul E. McKenney, linuxppc-dev, adam.buchbinder, gwshan
In-Reply-To: <57227BAA.7040807@hpe.com>
On 2016年04月29日 05:07, Waiman Long wrote:
> On 04/28/2016 06:55 AM, Pan Xinhui wrote:
>> From: Pan Xinhui<xinhui.pan@linux.vnet.ibm.com>
>>
>> This patch aims to enable qspinlock on PPC. And on pseries platform, it also support
>> paravirt qspinlock.
>>
>> Signed-off-by: Pan Xinhui<xinhui.pan@linux.vnet.ibm.com>
>> ---
>> arch/powerpc/include/asm/qspinlock.h | 37 +++++++++++++++
>> arch/powerpc/include/asm/qspinlock_paravirt.h | 36 +++++++++++++++
>> .../powerpc/include/asm/qspinlock_paravirt_types.h | 13 ++++++
>> arch/powerpc/include/asm/spinlock.h | 31 ++++++++-----
>> arch/powerpc/include/asm/spinlock_types.h | 4 ++
>> arch/powerpc/kernel/paravirt.c | 52 ++++++++++++++++++++++
>> arch/powerpc/lib/locks.c | 32 +++++++++++++
>> arch/powerpc/platforms/pseries/setup.c | 5 +++
>> 8 files changed, 198 insertions(+), 12 deletions(-)
>> create mode 100644 arch/powerpc/include/asm/qspinlock.h
>> create mode 100644 arch/powerpc/include/asm/qspinlock_paravirt.h
>> create mode 100644 arch/powerpc/include/asm/qspinlock_paravirt_types.h
>> create mode 100644 arch/powerpc/kernel/paravirt.c
>>
>>
>
> This is just an enablement patch. You will also need a patch to activate qspinlock for, at lease, some PPC configs. Right?
>
yep, I want to enable these config and makefile at last.
it just looks like
diff --git a/arch/powerpc/kernel/Makefile b/arch/powerpc/kernel/Makefile
index 2da380f..ae7c2f1 100644
--- a/arch/powerpc/kernel/Makefile
+++ b/arch/powerpc/kernel/Makefile
@@ -50,6 +50,7 @@ obj-$(CONFIG_PPC_970_NAP) += idle_power4.o
obj-$(CONFIG_PPC_P7_NAP) += idle_power7.o
procfs-y := proc_powerpc.o
obj-$(CONFIG_PROC_FS) += $(procfs-y)
+obj-$(CONFIG_PARAVIRT_SPINLOCKS) += paravirt.o
rtaspci-$(CONFIG_PPC64)-$(CONFIG_PCI) := rtas_pci.o
obj-$(CONFIG_PPC_RTAS) += rtas.o rtas-rtc.o $(rtaspci-y-y)
obj-$(CONFIG_PPC_RTAS_DAEMON) += rtasd.o
diff --git a/arch/powerpc/platforms/pseries/Kconfig b/arch/powerpc/platforms/pseries/Kconfig
index bec90fb..46632e4 100644
--- a/arch/powerpc/platforms/pseries/Kconfig
+++ b/arch/powerpc/platforms/pseries/Kconfig
@@ -21,6 +21,7 @@ config PPC_PSERIES
select HOTPLUG_CPU if SMP
select ARCH_RANDOM
select PPC_DOORBELL
+ select ARCH_USE_QUEUED_SPINLOCKS
default y
config PPC_SPLPAR
@@ -127,3 +128,11 @@ config HV_PERF_CTRS
systems. 24x7 is available on Power 8 systems.
If unsure, select Y.
+
+config PARAVIRT_SPINLOCKS
+ bool "Paravirtialization support for qspinlock"
+ depends on PPC_SPLPAR && QUEUED_SPINLOCKS
+ default y
+ help
+ If platform supports virtualization, for example PowerVM, this option
+ can let guest have a better performace.
--
2.4.3
> It has dependency on the pv_wait() patch that I sent out extend the parameter list. Some performance data on how PPC system will perform with and without qspinlock will also be helpful data points.
>
For now, pv_wait defined in ppc is static inline void pv_wait(u8 *ptr, u8 val)
My plan is that waiting your patch goes in kernel tree first, then I send out another patch to extend the parameter list.
yes, I need copy some performance data in my patch's comments.
thanks
xinhui
> Cheers,
> Longman
>
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization
^ permalink raw reply related
* Re: [RFC PATCH V2 2/2] vhost: device IOTLB API
From: Jason Wang @ 2016-04-29 4:44 UTC (permalink / raw)
To: Michael S. Tsirkin
Cc: kvm, qemu-devel, netdev, linux-kernel, peterx, virtualization,
pbonzini
In-Reply-To: <5722B511.6060401@redhat.com>
On 04/29/2016 09:12 AM, Jason Wang wrote:
> On 04/28/2016 10:43 PM, Michael S. Tsirkin wrote:
>> > On Thu, Apr 28, 2016 at 02:37:16PM +0800, Jason Wang wrote:
>>> >>
>>> >> On 04/27/2016 07:45 PM, Michael S. Tsirkin wrote:
>>>> >>> On Fri, Mar 25, 2016 at 10:34:34AM +0800, Jason Wang wrote:
>>>>> >>>> This patch tries to implement an device IOTLB for vhost. This could be
>>>>> >>>> used with for co-operation with userspace(qemu) implementation of DMA
>>>>> >>>> remapping.
>>>>> >>>>
>>>>> >>>> The idea is simple. When vhost meets an IOTLB miss, it will request
>>>>> >>>> the assistance of userspace to do the translation, this is done
>>>>> >>>> through:
>>>>> >>>>
>>>>> >>>> - Fill the translation request in a preset userspace address (This
>>>>> >>>> address is set through ioctl VHOST_SET_IOTLB_REQUEST_ENTRY).
>>>>> >>>> - Notify userspace through eventfd (This eventfd was set through ioctl
>>>>> >>>> VHOST_SET_IOTLB_FD).
>>>> >>> Why use an eventfd for this?
>>> >> The aim is to implement the API all through ioctls.
>>> >>
>>>> >>> We use them for interrupts because
>>>> >>> that happens to be what kvm wants, but here - why don't we
>>>> >>> just add a generic support for reading out events
>>>> >>> on the vhost fd itself?
>>> >> I've considered this approach, but what's the advantages of this? I mean
>>> >> looks like all other ioctls could be done through vhost fd
>>> >> reading/writing too.
>> > read/write have a non-blocking flag.
>> >
>> > It's not useful for other ioctls but it's useful here.
>> >
> Ok, this looks better.
>
>>>>> >>>> - device IOTLB were started and stopped through VHOST_RUN_IOTLB ioctl
>>>>> >>>>
>>>>> >>>> When userspace finishes the translation, it will update the vhost
>>>>> >>>> IOTLB through VHOST_UPDATE_IOTLB ioctl. Userspace is also in charge of
>>>>> >>>> snooping the IOTLB invalidation of IOMMU IOTLB and use
>>>>> >>>> VHOST_UPDATE_IOTLB to invalidate the possible entry in vhost.
>>>> >>> There's one problem here, and that is that VQs still do not undergo
>>>> >>> translation. In theory VQ could be mapped in such a way
>>>> >>> that it's not contigious in userspace memory.
>>> >> I'm not sure I get the issue, current vhost API support setting
>>> >> desc_user_addr, used_user_addr and avail_user_addr independently. So
>>> >> looks ok? If not, looks not a problem to device IOTLB API itself.
>> > The problem is that addresses are all HVA.
>> >
>> > Without an iommu, we ask for them to be contigious and
>> > since bus address == GPA, this means contigious GPA =>
>> > contigious HVA. With an IOMMU you can map contigious
>> > bus address but non contigious GPA and non contigious HVA.
> Yes, so the issue is we should not reuse VHOST_SET_VRING_ADDR and invent
> a new ioctl to set bus addr (guest iova). The access the VQ through
> device IOTLB too.
Note that userspace has checked for this and fallback to userspace if it
detects non contiguous GPA. Consider this happens rare, I'm not sure we
should handle this.
>
>> >
>> > Another concern: what if guest changes the GPA while keeping bus address
>> > constant? Normal devices will work because they only use
>> > bus addresses, but virtio will break.
> If we access VQ through device IOTLB too, this could be solved.
>
I don't see a reason why guest want change GPA during DMA. Even if it
can, it needs lots of other synchronization.
^ permalink raw reply
* Re: [PATCH V2 RFC] fixup! virtio: convert to use DMA api
From: Michael S. Tsirkin @ 2016-05-01 10:37 UTC (permalink / raw)
To: David Woodhouse
Cc: Wei Liu, Stefan Hajnoczi, qemu-block, Stefano Stabellini,
Joerg Roedel, qemu-devel, peterx, linux-kernel,
Christian Borntraeger, iommu, Andy Lutomirski, kvm, Amit Shah,
pbonzini, virtualization, Anthony PERARD
In-Reply-To: <1461858505.33870.108.camel@infradead.org>
On Thu, Apr 28, 2016 at 04:48:25PM +0100, David Woodhouse wrote:
> On Thu, 2016-04-28 at 18:37 +0300, Michael S. Tsirkin wrote:
> > OK, so for intel, it seems that it's enough to set
> > pdev->dev.archdata.iommu = DUMMY_DEVICE_DOMAIN_INFO;
> > for the device.
>
> Yes, currently. Although that's vile. In fact what we *want* to happen
> is for the intel-iommu code simply to decline to provide DMA ops for
> this device, and let it fall back to the swiotlb or no-op DMA ops, as
> appropriate.
>
> As it is, we have the intel-iommu DMA ops *unconditionally, and they
> have a hack to manually fall back to calling swiotlb. It's all just
> horrid, which is why I want to clean it up with nice per-device DMA ops
> and discovery thereof :)
>
> > Do I have to poke at each iommu implementation to find
> > a way to do this, or is there some way to do it
> > portably?
>
> There *will* be.... Christoph has already done some of the cleanup in
> this space, and I need to take stock of what he's already done, and
> finish off the parts I want to build on top of it.
>
> > Not exactly - I think that future versions of qemu might lie
> > about some devices but not others.
>
> Can we keep this simple?
>
> QEMU currently lies about some devices. Let's implement a heuristic for
> the guest OS to know about that, and react accordingly.
>
> Then let's fix QEMU to tell the truth. All the time, unconditionally.
> Even on POWER/ARM where there's no obvious *way* for it to tell the
> truth (because you don't have the flexibility that DMAR tables do), and
> we need to devise a way to put it in the device-tree or fwcfg or
> something else.
Right. Unfortunately all these aren't easy to implement at all.
So I'm inclined to go the "something else" route.
It has the added benefit of giving us a heuristic for free.
> And only once QEMU consistently tells the *truth*, then we can start to
> do new stuff and let it actually change its behaviour.
>
> > DMAR is unfortunately not a good match for what people do with QEMU.
> >
> > There is a patchset on list fixing translation of assigned
> > devices. So the fix for these will simply be to do translation for
> > all assigned devices. It's harder for virtio as it isn't always
> > processed in QEMU - there's vhost in kernel and an out of process
> > vhost-user plugin. So we can end up e.g. with modern QEMU which
> > does translate in-process virtio but not out of process one.
>
> Right... just stop. Fix QEMU to tell the truth first, and *then* once
> we can trust it, we can start to change its behaviour. :)
>
> > Unfortunately people got used to be able to put any device
> > in any slot, and built external tools around that ability.
> > It's rather painful to break this assumption.
>
> Well, if you just said you have a patch set which allows translation of
> assigned devices then you are most of the way there, aren't you? We
> just need to fix the out-of-process virtio case, and everything can be
> either translated or untranslated?
Absolutely. But that "just" will take a while. With out of process
there's always a chance that remote doesn't implement translation. E.g.
new QEMU running on an old host kernel.
> --
> dwmw2
>
^ permalink raw reply
* Re: [PATCH] vhost_net: stop polling socket during rx processing
From: Jason Wang @ 2016-05-03 2:52 UTC (permalink / raw)
To: Michael S. Tsirkin; +Cc: netdev, linux-kernel, kvm, virtualization
In-Reply-To: <5721AB7B.8070806@redhat.com>
On 04/28/2016 02:19 PM, Jason Wang wrote:
> On 04/27/2016 07:28 PM, Michael S. Tsirkin wrote:
>> > On Tue, Apr 26, 2016 at 03:35:53AM -0400, Jason Wang wrote:
>>> >> We don't stop polling socket during rx processing, this will lead
>>> >> unnecessary wakeups from under layer net devices (E.g
>>> >> sock_def_readable() form tun). Rx will be slowed down in this
>>> >> way. This patch avoids this by stop polling socket during rx
>>> >> processing. A small drawback is that this introduces some overheads in
>>> >> light load case because of the extra start/stop polling, but single
>>> >> netperf TCP_RR does not notice any change. In a super heavy load case,
>>> >> e.g using pktgen to inject packet to guest, we get about ~17%
>>> >> improvement on pps:
>>> >>
>>> >> before: ~1370000 pkt/s
>>> >> after: ~1500000 pkt/s
>>> >>
>>> >> Signed-off-by: Jason Wang <jasowang@redhat.com>
>> > Acked-by: Michael S. Tsirkin <mst@redhat.com>
>> >
>> > There is one other possible enhancement: we actually have the wait queue
>> > lock taken in _wake_up, but we give it up only to take it again in the
>> > handler.
>> >
>> > It would be nicer to just remove the entry when we wake
>> > the vhost thread. Re-add it if required.
>> > I think that something like the below would give you the necessary API.
>> > Pls feel free to use it if you are going to implement a patch on top
>> > doing this - that's not a reason not to include this simple patch
>> > though.
> Thanks, this looks useful, will give it a try.
Want to try, but looks like this will result a strange API:
- poll were removed automatically during wakeup, handler does not need
to care about this
- but handler still need to re-add the poll explicitly in the code
?
^ permalink raw reply
* [PATCH] tools/virtio/ringtest: add usage example to README
From: Mike Rapoport @ 2016-05-04 6:12 UTC (permalink / raw)
To: Michael S. Tsirkin; +Cc: Mike Rapoport, kvm, virtualization
Having typical usage example in the README file is more convinient than in
the git history...
Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
---
tools/virtio/ringtest/README | 4 ++++
1 file changed, 4 insertions(+)
diff --git a/tools/virtio/ringtest/README b/tools/virtio/ringtest/README
index 34e94c4..d83707a 100644
--- a/tools/virtio/ringtest/README
+++ b/tools/virtio/ringtest/README
@@ -1,2 +1,6 @@
Partial implementation of various ring layouts, useful to tune virtio design.
Uses shared memory heavily.
+
+Typical use:
+
+# sh run-on-all.sh perf stat -r 10 --log-fd 1 -- ./ring
--
1.9.1
^ permalink raw reply related
* [PATCH] tools/virtio/ringtest: fix run-on-all.sh to work without /dev/cpu
From: Mike Rapoport @ 2016-05-04 6:12 UTC (permalink / raw)
To: Michael S. Tsirkin; +Cc: Mike Rapoport, kvm, virtualization
In-Reply-To: <1462342376-16065-1-git-send-email-rppt@linux.vnet.ibm.com>
/dev/cpu is only available on x86 with certain modules (e.g. msr) enabled.
Using /proc/cpuinfo to get processors count is more portable.
Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
---
tools/virtio/ringtest/run-on-all.sh | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/tools/virtio/ringtest/run-on-all.sh b/tools/virtio/ringtest/run-on-all.sh
index 52b0f71..38ccfa3 100755
--- a/tools/virtio/ringtest/run-on-all.sh
+++ b/tools/virtio/ringtest/run-on-all.sh
@@ -3,10 +3,10 @@
#use last CPU for host. Why not the first?
#many devices tend to use cpu0 by default so
#it tends to be busier
-HOST_AFFINITY=$(cd /dev/cpu; ls|grep -v '[a-z]'|sort -n|tail -1)
+HOST_AFFINITY=$(($(grep -c processor /proc/cpuinfo) - 1))
#run command on all cpus
-for cpu in $(cd /dev/cpu; ls|grep -v '[a-z]'|sort -n);
+for cpu in $(seq 0 $HOST_AFFINITY)
do
#Don't run guest and host on same CPU
#It actually works ok if using signalling
--
1.9.1
^ permalink raw reply related
* Re: [PATCH] tools/virtio/ringtest: fix run-on-all.sh to work without /dev/cpu
From: Cornelia Huck @ 2016-05-04 6:40 UTC (permalink / raw)
To: Mike Rapoport; +Cc: virtualization, kvm, Michael S. Tsirkin
In-Reply-To: <1462342376-16065-2-git-send-email-rppt@linux.vnet.ibm.com>
On Wed, 4 May 2016 09:12:56 +0300
Mike Rapoport <rppt@linux.vnet.ibm.com> wrote:
> /dev/cpu is only available on x86 with certain modules (e.g. msr) enabled.
> Using /proc/cpuinfo to get processors count is more portable.
>
> Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
> ---
> tools/virtio/ringtest/run-on-all.sh | 4 ++--
> 1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/tools/virtio/ringtest/run-on-all.sh b/tools/virtio/ringtest/run-on-all.sh
> index 52b0f71..38ccfa3 100755
> --- a/tools/virtio/ringtest/run-on-all.sh
> +++ b/tools/virtio/ringtest/run-on-all.sh
> @@ -3,10 +3,10 @@
> #use last CPU for host. Why not the first?
> #many devices tend to use cpu0 by default so
> #it tends to be busier
> -HOST_AFFINITY=$(cd /dev/cpu; ls|grep -v '[a-z]'|sort -n|tail -1)
> +HOST_AFFINITY=$(($(grep -c processor /proc/cpuinfo) - 1))
>
> #run command on all cpus
> -for cpu in $(cd /dev/cpu; ls|grep -v '[a-z]'|sort -n);
> +for cpu in $(seq 0 $HOST_AFFINITY)
> do
> #Don't run guest and host on same CPU
> #It actually works ok if using signalling
As you're touching this anyway: Is there any way to avoid the
architecture-specific /proc/cpuinfo and do whatever lscpu does?
^ permalink raw reply
* [PATCH v2] tools/virtio/ringtest: fix run-on-all.sh to work without /dev/cpu
From: Mike Rapoport @ 2016-05-04 7:59 UTC (permalink / raw)
To: Michael S. Tsirkin; +Cc: Mike Rapoport, kvm, virtualization
In-Reply-To: <20160504084021.18cf98f0.cornelia.huck@de.ibm.com>
/dev/cpu is only available on x86 with certain modules (e.g. msr) enabled.
Using lscpu to get processors count is more portable.
Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
---
v2: use lspcu instead of /proc/cpuinfo as per Cornelia's suggestion
tools/virtio/ringtest/run-on-all.sh | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/tools/virtio/ringtest/run-on-all.sh b/tools/virtio/ringtest/run-on-all.sh
index 52b0f71..0177d50 100755
--- a/tools/virtio/ringtest/run-on-all.sh
+++ b/tools/virtio/ringtest/run-on-all.sh
@@ -3,10 +3,10 @@
#use last CPU for host. Why not the first?
#many devices tend to use cpu0 by default so
#it tends to be busier
-HOST_AFFINITY=$(cd /dev/cpu; ls|grep -v '[a-z]'|sort -n|tail -1)
+HOST_AFFINITY=$(lscpu -p | tail -n 1 | cut -d',' -f1)
#run command on all cpus
-for cpu in $(cd /dev/cpu; ls|grep -v '[a-z]'|sort -n);
+for cpu in $(seq 0 $HOST_AFFINITY)
do
#Don't run guest and host on same CPU
#It actually works ok if using signalling
--
1.9.1
^ permalink raw reply related
* Re: [PATCH v2] tools/virtio/ringtest: fix run-on-all.sh to work without /dev/cpu
From: Cornelia Huck @ 2016-05-04 8:14 UTC (permalink / raw)
To: Mike Rapoport; +Cc: virtualization, kvm, Michael S. Tsirkin
In-Reply-To: <1462348793-6944-1-git-send-email-rppt@linux.vnet.ibm.com>
On Wed, 4 May 2016 10:59:53 +0300
Mike Rapoport <rppt@linux.vnet.ibm.com> wrote:
> /dev/cpu is only available on x86 with certain modules (e.g. msr) enabled.
> Using lscpu to get processors count is more portable.
>
> Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
> ---
> v2: use lspcu instead of /proc/cpuinfo as per Cornelia's suggestion
>
> tools/virtio/ringtest/run-on-all.sh | 4 ++--
> 1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/tools/virtio/ringtest/run-on-all.sh b/tools/virtio/ringtest/run-on-all.sh
> index 52b0f71..0177d50 100755
> --- a/tools/virtio/ringtest/run-on-all.sh
> +++ b/tools/virtio/ringtest/run-on-all.sh
> @@ -3,10 +3,10 @@
> #use last CPU for host. Why not the first?
> #many devices tend to use cpu0 by default so
> #it tends to be busier
> -HOST_AFFINITY=$(cd /dev/cpu; ls|grep -v '[a-z]'|sort -n|tail -1)
> +HOST_AFFINITY=$(lscpu -p | tail -n 1 | cut -d',' -f1)
Shorter:
lscpu -p=cpu | tail -1
Should be good enough for a test tool :)
>
> #run command on all cpus
> -for cpu in $(cd /dev/cpu; ls|grep -v '[a-z]'|sort -n);
> +for cpu in $(seq 0 $HOST_AFFINITY)
> do
> #Don't run guest and host on same CPU
> #It actually works ok if using signalling
^ permalink raw reply
* [PATCH v3] tools/virtio/ringtest: fix run-on-all.sh to work without /dev/cpu
From: Mike Rapoport @ 2016-05-04 10:15 UTC (permalink / raw)
To: Michael S. Tsirkin; +Cc: Mike Rapoport, kvm, virtualization
In-Reply-To: <20160504101454.2b62e974.cornelia.huck@de.ibm.com>
/dev/cpu is only available on x86 with certain modules (e.g. msr) enabled.
Using lscpu to get processors count is more portable.
Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
---
v3: simplify by using lscpu -p=cpu
v2: use lspcu instead of /proc/cpuinfo as per Cornelia's suggestion
tools/virtio/ringtest/run-on-all.sh | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/tools/virtio/ringtest/run-on-all.sh b/tools/virtio/ringtest/run-on-all.sh
index 52b0f71..2e69ca8 100755
--- a/tools/virtio/ringtest/run-on-all.sh
+++ b/tools/virtio/ringtest/run-on-all.sh
@@ -3,10 +3,10 @@
#use last CPU for host. Why not the first?
#many devices tend to use cpu0 by default so
#it tends to be busier
-HOST_AFFINITY=$(cd /dev/cpu; ls|grep -v '[a-z]'|sort -n|tail -1)
+HOST_AFFINITY=$(lscpu -p=cpu | tail -1)
#run command on all cpus
-for cpu in $(cd /dev/cpu; ls|grep -v '[a-z]'|sort -n);
+for cpu in $(seq 0 $HOST_AFFINITY)
do
#Don't run guest and host on same CPU
#It actually works ok if using signalling
--
1.9.1
^ permalink raw reply related
* Re: [PATCH v3] tools/virtio/ringtest: fix run-on-all.sh to work without /dev/cpu
From: Cornelia Huck @ 2016-05-04 10:24 UTC (permalink / raw)
To: Mike Rapoport; +Cc: virtualization, kvm, Michael S. Tsirkin
In-Reply-To: <1462356950-3278-1-git-send-email-rppt@linux.vnet.ibm.com>
On Wed, 4 May 2016 13:15:50 +0300
Mike Rapoport <rppt@linux.vnet.ibm.com> wrote:
> /dev/cpu is only available on x86 with certain modules (e.g. msr) enabled.
> Using lscpu to get processors count is more portable.
>
> Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
> ---
> v3: simplify by using lscpu -p=cpu
> v2: use lspcu instead of /proc/cpuinfo as per Cornelia's suggestion
>
> tools/virtio/ringtest/run-on-all.sh | 4 ++--
> 1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/tools/virtio/ringtest/run-on-all.sh b/tools/virtio/ringtest/run-on-all.sh
> index 52b0f71..2e69ca8 100755
> --- a/tools/virtio/ringtest/run-on-all.sh
> +++ b/tools/virtio/ringtest/run-on-all.sh
> @@ -3,10 +3,10 @@
> #use last CPU for host. Why not the first?
> #many devices tend to use cpu0 by default so
> #it tends to be busier
> -HOST_AFFINITY=$(cd /dev/cpu; ls|grep -v '[a-z]'|sort -n|tail -1)
> +HOST_AFFINITY=$(lscpu -p=cpu | tail -1)
>
> #run command on all cpus
> -for cpu in $(cd /dev/cpu; ls|grep -v '[a-z]'|sort -n);
> +for cpu in $(seq 0 $HOST_AFFINITY)
> do
> #Don't run guest and host on same CPU
> #It actually works ok if using signalling
I think it's fine to depend on lscpu for such a test tool.
Acked-by: Cornelia Huck <cornelia.huck@de.ibm.com>
^ permalink raw reply
* [PULL] virtio/qemu: fixes for 4.6
From: Michael S. Tsirkin @ 2016-05-05 13:04 UTC (permalink / raw)
To: Linus Torvalds
Cc: kvm, mst, netdev, linux-kernel, virtualization, dan.carpenter
The following changes since commit c3b46c73264b03000d1e18b22f5caf63332547c9:
Linux 4.6-rc4 (2016-04-17 19:13:32 -0700)
are available in the git repository at:
git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost.git tags/for_linus
for you to fetch changes up to e00f7bd221292b318d4d09c3f0c2c8af9b1e5edf:
virtio: Silence uninitialized variable warning (2016-05-01 15:50:08 +0300)
----------------------------------------------------------------
virtio/qemu: fixes for 4.6
A couple of fixes for virtio and for the new QEMU fw cfg driver.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
----------------------------------------------------------------
Dan Carpenter (2):
firmware: qemu_fw_cfg.c: potential unintialized variable
virtio: Silence uninitialized variable warning
drivers/firmware/qemu_fw_cfg.c | 2 +-
drivers/virtio/virtio_ring.c | 2 +-
2 files changed, 2 insertions(+), 2 deletions(-)
^ permalink raw reply
* [PATCH v5 00/13] Support non-lru page migration
From: Minchan Kim @ 2016-05-09 2:20 UTC (permalink / raw)
To: Andrew Morton
Cc: Rik van Riel, Sergey Senozhatsky, Rafael Aquini, Minchan Kim,
Chan Gyun Jeong, Jonathan Corbet, Hugh Dickins, linux-kernel,
dri-devel, virtualization, John Einar Reitan, linux-mm,
Chulmin Kim, Gioh Kim, Konstantin Khlebnikov, Sangseok Lee,
Joonsoo Kim, Kyeongdon Kim, Naoya Horiguchi, Vlastimil Babka,
Mel Gorman
Recently, I got many reports about perfermance degradation in embedded
system(Android mobile phone, webOS TV and so on) and easy fork fail.
The problem was fragmentation caused by zram and GPU driver mainly.
With memory pressure, their pages were spread out all of pageblock and
it cannot be migrated with current compaction algorithm which supports
only LRU pages. In the end, compaction cannot work well so reclaimer
shrinks all of working set pages. It made system very slow and even to
fail to fork easily which requires order-[2 or 3] allocations.
Other pain point is that they cannot use CMA memory space so when OOM
kill happens, I can see many free pages in CMA area, which is not
memory efficient. In our product which has big CMA memory, it reclaims
zones too exccessively to allocate GPU and zram page although there are
lots of free space in CMA so system becomes very slow easily.
To solve these problem, this patch tries to add facility to migrate
non-lru pages via introducing new functions and page flags to help
migration.
struct address_space_operations {
..
..
bool (*isolate_page)(struct page *, isolate_mode_t);
void (*putback_page)(struct page *);
..
}
new page flags
PG_movable
PG_isolated
For details, please read description in "mm: migrate: support non-lru
movable page migration".
Originally, Gioh Kim had tried to support this feature but he moved so
I took over the work. I took many code from his work and changed a little
bit and Konstantin Khlebnikov helped Gioh a lot so he should deserve to have
many credit, too.
And I should mention Chulmin who have tested this patchset heavily
so I can find many bugs from him. :)
Thanks, Gioh, Konstantin and Chulmin!
This patchset consists of five parts.
1. clean up migration
mm: use put_page to free page instead of putback_lru_page
2. add non-lru page migration feature
mm: migrate: support non-lru movable page migration
3. rework KVM memory-ballooning
mm: balloon: use general non-lru movable page feature
4. zsmalloc refactoring for preparing page migration
zsmalloc: keep max_object in size_class
zsmalloc: use bit_spin_lock
zsmalloc: use accessor
zsmalloc: factor page chain functionality out
zsmalloc: introduce zspage structure
zsmalloc: separate free_zspage from putback_zspage
zsmalloc: use freeobj for index
5. zsmalloc page migration
zsmalloc: page migration support
zram: use __GFP_MOVABLE for memory allocation
* From v4
* rebase on mmotm-2016-05-05-17-19
* fix huge object migration - Chulmin
* !CONFIG_COMPACTION support for zsmalloc
* From v3
* rebase on mmotm-2016-04-06-20-40
* fix swap_info deadlock - Chulmin
* race without page_lock - Vlastimil
* no use page._mapcount for potential user-mapped page driver - Vlastimil
* fix and enhance doc/description - Vlastimil
* use page->mapping lower bits to represent PG_movable
* make driver side's rule simple.
* From v2
* rebase on mmotm-2016-03-29-15-54-16
* check PageMovable before lock_page - Joonsoo
* check PageMovable before PageIsolated checking - Joonsoo
* add more description about rule
* From v1
* rebase on v4.5-mmotm-2016-03-17-15-04
* reordering patches to merge clean-up patches first
* add Acked-by/Reviewed-by from Vlastimil and Sergey
* use each own mount model instead of reusing anon_inode_fs - Al Viro
* small changes - YiPing, Gioh
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: dri-devel@lists.freedesktop.org
Cc: Hugh Dickins <hughd@google.com>
Cc: John Einar Reitan <john.reitan@foss.arm.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Konstantin Khlebnikov <koct9i@gmail.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Rafael Aquini <aquini@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: virtualization@lists.linux-foundation.org
Cc: Gioh Kim <gi-oh.kim@profitbricks.com>
Cc: Chan Gyun Jeong <chan.jeong@lge.com>
Cc: Sangseok Lee <sangseok.lee@lge.com>
Cc: Kyeongdon Kim <kyeongdon.kim@lge.com>
Cc: Chulmin Kim <cmlaika.kim@samsung.com>
Minchan Kim (12):
mm: use put_page to free page instead of putback_lru_page
mm: migrate: support non-lru movable page migration
mm: balloon: use general non-lru movable page feature
zsmalloc: keep max_object in size_class
zsmalloc: use bit_spin_lock
zsmalloc: use accessor
zsmalloc: factor page chain functionality out
zsmalloc: introduce zspage structure
zsmalloc: separate free_zspage from putback_zspage
zsmalloc: use freeobj for index
zsmalloc: page migration support
zram: use __GFP_MOVABLE for memory allocation
Documentation/filesystems/Locking | 4 +
Documentation/filesystems/vfs.txt | 11 +
Documentation/vm/page_migration | 107 ++-
drivers/block/zram/zram_drv.c | 6 +-
drivers/virtio/virtio_balloon.c | 52 +-
include/linux/balloon_compaction.h | 51 +-
include/linux/fs.h | 2 +
include/linux/ksm.h | 3 +-
include/linux/migrate.h | 5 +
include/linux/mm.h | 1 +
include/linux/page-flags.h | 29 +-
include/uapi/linux/magic.h | 2 +
mm/balloon_compaction.c | 94 +--
mm/compaction.c | 40 +-
mm/ksm.c | 4 +-
mm/migrate.c | 287 ++++++--
mm/page_alloc.c | 2 +-
mm/util.c | 6 +-
mm/vmscan.c | 2 +-
mm/zsmalloc.c | 1351 +++++++++++++++++++++++++-----------
20 files changed, 1453 insertions(+), 606 deletions(-)
--
1.9.1
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox