* (no subject)
@ 2022-01-14 10:54 Li RongQing
2022-01-14 10:55 ` Paolo Bonzini
0 siblings, 1 reply; 35+ messages in thread
From: Li RongQing @ 2022-01-14 10:54 UTC (permalink / raw)
To: pbonzini, seanjc, vkuznets, wanpengli, jmattson, tglx, bp, x86,
kvm, joro, peterz
After support paravirtualized TLB shootdowns, steal_time.preempted
includes not only KVM_VCPU_PREEMPTED, but also KVM_VCPU_FLUSH_TLB
and kvm_vcpu_is_preempted should test only with KVM_VCPU_PREEMPTED
Signed-off-by: Li RongQing <lirongqing@baidu.com>
---
diff with v1:
clear the rest of rax, suggested by Sean and peter
remove Fixes tag, since no issue in practice
arch/x86/kernel/kvm.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index b061d17..45c9ce8d 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -1025,8 +1025,8 @@ asm(
".type __raw_callee_save___kvm_vcpu_is_preempted, @function;"
"__raw_callee_save___kvm_vcpu_is_preempted:"
"movq __per_cpu_offset(,%rdi,8), %rax;"
-"cmpb $0, " __stringify(KVM_STEAL_TIME_preempted) "+steal_time(%rax);"
-"setne %al;"
+"movb " __stringify(KVM_STEAL_TIME_preempted) "+steal_time(%rax), %al;"
+"and $" __stringify(KVM_VCPU_PREEMPTED) ", %rax;"
"ret;"
".size __raw_callee_save___kvm_vcpu_is_preempted, .-__raw_callee_save___kvm_vcpu_is_preempted;"
".popsection");
--
2.9.4
^ permalink raw reply related [flat|nested] 35+ messages in thread
* Re:
2022-01-14 10:54 Li RongQing
@ 2022-01-14 10:55 ` Paolo Bonzini
2022-01-14 17:13 ` Re: Sean Christopherson
0 siblings, 1 reply; 35+ messages in thread
From: Paolo Bonzini @ 2022-01-14 10:55 UTC (permalink / raw)
To: Li RongQing, seanjc, vkuznets, wanpengli, jmattson, tglx, bp, x86,
kvm, joro, peterz
On 1/14/22 11:54, Li RongQing wrote:
> After support paravirtualized TLB shootdowns, steal_time.preempted
> includes not only KVM_VCPU_PREEMPTED, but also KVM_VCPU_FLUSH_TLB
>
> and kvm_vcpu_is_preempted should test only with KVM_VCPU_PREEMPTED
>
> Signed-off-by: Li RongQing <lirongqing@baidu.com>
> ---
> diff with v1:
> clear the rest of rax, suggested by Sean and peter
> remove Fixes tag, since no issue in practice
>
> arch/x86/kernel/kvm.c | 4 ++--
> 1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
> index b061d17..45c9ce8d 100644
> --- a/arch/x86/kernel/kvm.c
> +++ b/arch/x86/kernel/kvm.c
> @@ -1025,8 +1025,8 @@ asm(
> ".type __raw_callee_save___kvm_vcpu_is_preempted, @function;"
> "__raw_callee_save___kvm_vcpu_is_preempted:"
> "movq __per_cpu_offset(,%rdi,8), %rax;"
> -"cmpb $0, " __stringify(KVM_STEAL_TIME_preempted) "+steal_time(%rax);"
> -"setne %al;"
> +"movb " __stringify(KVM_STEAL_TIME_preempted) "+steal_time(%rax), %al;"
> +"and $" __stringify(KVM_VCPU_PREEMPTED) ", %rax;"
This assumes that KVM_VCPU_PREEMPTED is 1. It could also be %eax
(slightly cheaper). Overall, I prefer to leave the code as is using setne.
Paolo
> "ret;"
> ".size __raw_callee_save___kvm_vcpu_is_preempted, .-__raw_callee_save___kvm_vcpu_is_preempted;"
> ".popsection");
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re:
2022-01-14 10:55 ` Paolo Bonzini
@ 2022-01-14 17:13 ` Sean Christopherson
2022-01-14 17:17 ` Re: Paolo Bonzini
0 siblings, 1 reply; 35+ messages in thread
From: Sean Christopherson @ 2022-01-14 17:13 UTC (permalink / raw)
To: Paolo Bonzini
Cc: Li RongQing, vkuznets, wanpengli, jmattson, tglx, bp, x86, kvm,
joro, peterz
On Fri, Jan 14, 2022, Paolo Bonzini wrote:
> On 1/14/22 11:54, Li RongQing wrote:
> > After support paravirtualized TLB shootdowns, steal_time.preempted
> > includes not only KVM_VCPU_PREEMPTED, but also KVM_VCPU_FLUSH_TLB
> >
> > and kvm_vcpu_is_preempted should test only with KVM_VCPU_PREEMPTED
> >
> > Signed-off-by: Li RongQing <lirongqing@baidu.com>
> > ---
> > diff with v1:
> > clear the rest of rax, suggested by Sean and peter
> > remove Fixes tag, since no issue in practice
> >
> > arch/x86/kernel/kvm.c | 4 ++--
> > 1 file changed, 2 insertions(+), 2 deletions(-)
> >
> > diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
> > index b061d17..45c9ce8d 100644
> > --- a/arch/x86/kernel/kvm.c
> > +++ b/arch/x86/kernel/kvm.c
> > @@ -1025,8 +1025,8 @@ asm(
> > ".type __raw_callee_save___kvm_vcpu_is_preempted, @function;"
> > "__raw_callee_save___kvm_vcpu_is_preempted:"
> > "movq __per_cpu_offset(,%rdi,8), %rax;"
> > -"cmpb $0, " __stringify(KVM_STEAL_TIME_preempted) "+steal_time(%rax);"
> > -"setne %al;"
> > +"movb " __stringify(KVM_STEAL_TIME_preempted) "+steal_time(%rax), %al;"
> > +"and $" __stringify(KVM_VCPU_PREEMPTED) ", %rax;"
>
> This assumes that KVM_VCPU_PREEMPTED is 1.
Ah, right, because technically the compiler is only required to be able to store
'1' and '0' in the boolean. That said, KVM_VCPU_PREEMPTED is ABI and isn't going
to change, so this could be "solved" with a comment.
> It could also be %eax (slightly cheaper).
Ya.
> Overall, I prefer to leave the code as is using setne.
But that also makes dangerous assumptions: (a) that the return type is bool,
and (b) that the compiler uses a single byte for bools.
If the assumptiong about KVM_VCPU_PREEMPTED being '1' is a sticking point, what
about combining the two to make everyone happy?
andl $" __stringify(KVM_VCPU_PREEMPTED) ", %eax
setnz %al
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re:
2022-01-14 17:13 ` Re: Sean Christopherson
@ 2022-01-14 17:17 ` Paolo Bonzini
0 siblings, 0 replies; 35+ messages in thread
From: Paolo Bonzini @ 2022-01-14 17:17 UTC (permalink / raw)
To: Sean Christopherson
Cc: Li RongQing, vkuznets, wanpengli, jmattson, tglx, bp, x86, kvm,
joro, peterz
On 1/14/22 18:13, Sean Christopherson wrote:
> If the assumptiong about KVM_VCPU_PREEMPTED being '1' is a sticking point, what
> about combining the two to make everyone happy?
>
> andl $" __stringify(KVM_VCPU_PREEMPTED) ", %eax
> setnz %al
Sure, that's indeed a nice solution (I appreciate the attention to
detail in setne->setnz, too :)).
Paolo
^ permalink raw reply [flat|nested] 35+ messages in thread
[parent not found: <E1hUrZM-0007qA-Q8@sslproxy01.your-server.de>]
* Re:
[not found] <E1hUrZM-0007qA-Q8@sslproxy01.your-server.de>
@ 2019-05-29 19:54 ` Alex Williamson
0 siblings, 0 replies; 35+ messages in thread
From: Alex Williamson @ 2019-05-29 19:54 UTC (permalink / raw)
To: Thomas Meyer; +Cc: kvm@vger.kernel.org, linux-kernel@vger.kernel.org
On Sun, 26 May 2019 13:44:04 +0200
"Thomas Meyer" <thomas@m3y3r.de> wrote:
> From thomas@m3y3r.de Sun May 26 00:13:26 2019
> Subject: [PATCH] vfio-pci/nvlink2: Use vma_pages function instead of explicit
> computation
> To: alex.williamson@redhat.com, kvm@vger.kernel.org, linux-kernel@vger.kernel.org
> Content-Type: text/plain; charset="UTF-8"
> Mime-Version: 1.0
> Content-Transfer-Encoding: 8bit
> X-Patch: Cocci
> X-Mailer: DiffSplit
> Message-ID: <1558822461341-1674464153-1-diffsplit-thomas@m3y3r.de>
> References: <1558822461331-726613767-0-diffsplit-thomas@m3y3r.de>
> In-Reply-To: <1558822461331-726613767-0-diffsplit-thomas@m3y3r.de>
> X-Serial-No: 1
>
> Use vma_pages function on vma object instead of explicit computation.
>
> Signed-off-by: Thomas Meyer <thomas@m3y3r.de>
> ---
>
> diff -u -p a/drivers/vfio/pci/vfio_pci_nvlink2.c b/drivers/vfio/pci/vfio_pci_nvlink2.c
> --- a/drivers/vfio/pci/vfio_pci_nvlink2.c
> +++ b/drivers/vfio/pci/vfio_pci_nvlink2.c
> @@ -161,7 +161,7 @@ static int vfio_pci_nvgpu_mmap(struct vf
>
> atomic_inc(&data->mm->mm_count);
> ret = (int) mm_iommu_newdev(data->mm, data->useraddr,
> - (vma->vm_end - vma->vm_start) >> PAGE_SHIFT,
> + vma_pages(vma),
> data->gpu_hpa, &data->mem);
>
> trace_vfio_pci_nvgpu_mmap(vdev->pdev, data->gpu_hpa, data->useraddr,
Besides the formatting of this patch, there's already a pending patch
with this same change:
https://lkml.org/lkml/2019/5/16/658
I think the original must have bounced from lkml due the encoding, but
I'll use that one since it came first, is slightly cleaner in wrapping
the line following the change, and already has Alexey's R-b. Thanks,
Alex
^ permalink raw reply [flat|nested] 35+ messages in thread
* (unknown)
@ 2019-03-19 14:41 Maxim Levitsky
2019-03-20 11:03 ` Felipe Franciosi
0 siblings, 1 reply; 35+ messages in thread
From: Maxim Levitsky @ 2019-03-19 14:41 UTC (permalink / raw)
To: linux-nvme
Cc: Maxim Levitsky, linux-kernel, kvm, Jens Axboe, Alex Williamson,
Keith Busch, Christoph Hellwig, Sagi Grimberg, Kirti Wankhede,
David S . Miller, Mauro Carvalho Chehab, Greg Kroah-Hartman,
Wolfram Sang, Nicolas Ferre, Paul E . McKenney , Paolo Bonzini,
Liang Cunming, Liu Changpeng, Fam Zheng, Amnon Ilan, John
Date: Tue, 19 Mar 2019 14:45:45 +0200
Subject: [PATCH 0/9] RFC: NVME VFIO mediated device
Hi everyone!
In this patch series, I would like to introduce my take on the problem of doing
as fast as possible virtualization of storage with emphasis on low latency.
In this patch series I implemented a kernel vfio based, mediated device that
allows the user to pass through a partition and/or whole namespace to a guest.
The idea behind this driver is based on paper you can find at
https://www.usenix.org/conference/atc18/presentation/peng,
Although note that I stared the development prior to reading this paper,
independently.
In addition to that implementation is not based on code used in the paper as
I wasn't being able at that time to make the source available to me.
***Key points about the implementation:***
* Polling kernel thread is used. The polling is stopped after a
predefined timeout (1/2 sec by default).
Support for all interrupt driven mode is planned, and it shows promising results.
* Guest sees a standard NVME device - this allows to run guest with
unmodified drivers, for example windows guests.
* The NVMe device is shared between host and guest.
That means that even a single namespace can be split between host
and guest based on different partitions.
* Simple configuration
*** Performance ***
Performance was tested on Intel DC P3700, With Xeon E5-2620 v2
and both latency and throughput is very similar to SPDK.
Soon I will test this on a better server and nvme device and provide
more formal performance numbers.
Latency numbers:
~80ms - spdk with fio plugin on the host.
~84ms - nvme driver on the host
~87ms - mdev-nvme + nvme driver in the guest
Throughput was following similar pattern as well.
* Configuration example
$ modprobe nvme mdev_queues=4
$ modprobe nvme-mdev
$ UUID=$(uuidgen)
$ DEVICE='device pci address'
$ echo $UUID > /sys/bus/pci/devices/$DEVICE/mdev_supported_types/nvme-2Q_V1/create
$ echo n1p3 > /sys/bus/mdev/devices/$UUID/namespaces/add_namespace #attach host namespace 1 parition 3
$ echo 11 > /sys/bus/mdev/devices/$UUID/settings/iothread_cpu #pin the io thread to cpu 11
Afterward boot qemu with
-device vfio-pci,sysfsdev=/sys/bus/mdev/devices/$UUID
Zero configuration on the guest.
*** FAQ ***
* Why to make this in the kernel? Why this is better that SPDK
-> Reuse the existing nvme kernel driver in the host. No new drivers in the guest.
-> Share the NVMe device between host and guest.
Even in fully virtualized configurations,
some partitions of nvme device could be used by guests as block devices
while others passed through with nvme-mdev to achieve balance between
all features of full IO stack emulation and performance.
-> NVME-MDEV is a bit faster due to the fact that in-kernel driver
can send interrupts to the guest directly without a context
switch that can be expensive due to meltdown mitigation.
-> Is able to utilize interrupts to get reasonable performance.
This is only implemented
as a proof of concept and not included in the patches,
but interrupt driven mode shows reasonable performance
-> This is a framework that later can be used to support NVMe devices
with more of the IO virtualization built-in
(IOMMU with PASID support coupled with device that supports it)
* Why to attach directly to nvme-pci driver and not use block layer IO
-> The direct attachment allows for better performance, but I will
check the possibility of using block IO, especially for fabrics drivers.
*** Implementation notes ***
* All guest memory is mapped into the physical nvme device
but not 1:1 as vfio-pci would do this.
This allows very efficient DMA.
To support this, patch 2 adds ability for a mdev device to listen on
guest's memory map events.
Any such memory is immediately pinned and then DMA mapped.
(Support for fabric drivers where this is not possible exits too,
in which case the fabric driver will do its own DMA mapping)
* nvme core driver is modified to announce the appearance
and disappearance of nvme controllers and namespaces,
to which the nvme-mdev driver is subscribed.
* nvme-pci driver is modified to expose raw interface of attaching to
and sending/polling the IO queues.
This allows the mdev driver very efficiently to submit/poll for the IO.
By default one host queue is used per each mediated device.
(support for other fabric based host drivers is planned)
* The nvme-mdev doesn't assume presence of KVM, thus any VFIO user, including
SPDK, a qemu running with tccg, ... can use this virtual device.
*** Testing ***
The device was tested with stock QEMU 3.0 on the host,
with host was using 5.0 kernel with nvme-mdev added and the following hardware:
* QEMU nvme virtual device (with nested guest)
* Intel DC P3700 on Xeon E5-2620 v2 server
* Samsung SM981 (in a Thunderbolt enclosure, with my laptop)
* Lenovo NVME device found in my laptop
The guest was tested with kernel 4.16, 4.18, 4.20 and
the same custom complied kernel 5.0
Windows 10 guest was tested too with both Microsoft's inbox driver and
open source community NVME driver
(https://lists.openfabrics.org/pipermail/nvmewin/2016-December/001420.html)
Testing was mostly done on x86_64, but 32 bit host/guest combination
was lightly tested too.
In addition to that, the virtual device was tested with nested guest,
by passing the virtual device to it,
using pci passthrough, qemu userspace nvme driver, and spdk
PS: I used to contribute to the kernel as a hobby using the
maximlevitsky@gmail.com address
Maxim Levitsky (9):
vfio/mdev: add .request callback
nvme/core: add some more values from the spec
nvme/core: add NVME_CTRL_SUSPENDED controller state
nvme/pci: use the NVME_CTRL_SUSPENDED state
nvme/pci: add known admin effects to augument admin effects log page
nvme/pci: init shadow doorbell after each reset
nvme/core: add mdev interfaces
nvme/core: add nvme-mdev core driver
nvme/pci: implement the mdev external queue allocation interface
MAINTAINERS | 5 +
drivers/nvme/Kconfig | 1 +
drivers/nvme/Makefile | 1 +
drivers/nvme/host/core.c | 149 +++++-
drivers/nvme/host/nvme.h | 55 ++-
drivers/nvme/host/pci.c | 385 ++++++++++++++-
drivers/nvme/mdev/Kconfig | 16 +
drivers/nvme/mdev/Makefile | 5 +
drivers/nvme/mdev/adm.c | 873 ++++++++++++++++++++++++++++++++++
drivers/nvme/mdev/events.c | 142 ++++++
drivers/nvme/mdev/host.c | 491 +++++++++++++++++++
drivers/nvme/mdev/instance.c | 802 +++++++++++++++++++++++++++++++
drivers/nvme/mdev/io.c | 563 ++++++++++++++++++++++
drivers/nvme/mdev/irq.c | 264 ++++++++++
drivers/nvme/mdev/mdev.h | 56 +++
drivers/nvme/mdev/mmio.c | 591 +++++++++++++++++++++++
drivers/nvme/mdev/pci.c | 247 ++++++++++
drivers/nvme/mdev/priv.h | 700 +++++++++++++++++++++++++++
drivers/nvme/mdev/udata.c | 390 +++++++++++++++
drivers/nvme/mdev/vcq.c | 207 ++++++++
drivers/nvme/mdev/vctrl.c | 514 ++++++++++++++++++++
drivers/nvme/mdev/viommu.c | 322 +++++++++++++
drivers/nvme/mdev/vns.c | 356 ++++++++++++++
drivers/nvme/mdev/vsq.c | 178 +++++++
drivers/vfio/mdev/vfio_mdev.c | 11 +
include/linux/mdev.h | 4 +
include/linux/nvme.h | 88 +++-
27 files changed, 7375 insertions(+), 41 deletions(-)
create mode 100644 drivers/nvme/mdev/Kconfig
create mode 100644 drivers/nvme/mdev/Makefile
create mode 100644 drivers/nvme/mdev/adm.c
create mode 100644 drivers/nvme/mdev/events.c
create mode 100644 drivers/nvme/mdev/host.c
create mode 100644 drivers/nvme/mdev/instance.c
create mode 100644 drivers/nvme/mdev/io.c
create mode 100644 drivers/nvme/mdev/irq.c
create mode 100644 drivers/nvme/mdev/mdev.h
create mode 100644 drivers/nvme/mdev/mmio.c
create mode 100644 drivers/nvme/mdev/pci.c
create mode 100644 drivers/nvme/mdev/priv.h
create mode 100644 drivers/nvme/mdev/udata.c
create mode 100644 drivers/nvme/mdev/vcq.c
create mode 100644 drivers/nvme/mdev/vctrl.c
create mode 100644 drivers/nvme/mdev/viommu.c
create mode 100644 drivers/nvme/mdev/vns.c
create mode 100644 drivers/nvme/mdev/vsq.c
--
2.17.2
^ permalink raw reply [flat|nested] 35+ messages in thread* Re:
2019-03-19 14:41 (unknown) Maxim Levitsky
@ 2019-03-20 11:03 ` Felipe Franciosi
2019-03-20 19:08 ` Re: Maxim Levitsky
0 siblings, 1 reply; 35+ messages in thread
From: Felipe Franciosi @ 2019-03-20 11:03 UTC (permalink / raw)
To: Maxim Levitsky
Cc: linux-nvme@lists.infradead.org, linux-kernel@vger.kernel.org,
kvm@vger.kernel.org, Jens Axboe, Alex Williamson, Keith Busch,
Christoph Hellwig, Sagi Grimberg, Kirti Wankhede,
David S . Miller, Mauro Carvalho Chehab, Greg Kroah-Hartman,
Wolfram Sang, Nicolas Ferre, Paul E . McKenney, Paolo Bonzini,
Liang Cunming, Liu Changpeng <changpeng.
> On Mar 19, 2019, at 2:41 PM, Maxim Levitsky <mlevitsk@redhat.com> wrote:
>
> Date: Tue, 19 Mar 2019 14:45:45 +0200
> Subject: [PATCH 0/9] RFC: NVME VFIO mediated device
>
> Hi everyone!
>
> In this patch series, I would like to introduce my take on the problem of doing
> as fast as possible virtualization of storage with emphasis on low latency.
>
> In this patch series I implemented a kernel vfio based, mediated device that
> allows the user to pass through a partition and/or whole namespace to a guest.
Hey Maxim!
I'm really excited to see this series, as it aligns to some extent with what we discussed in last year's KVM Forum VFIO BoF.
There's no arguing that we need a better story to efficiently virtualise NVMe devices. So far, for Qemu-based VMs, Changpeng's vhost-user-nvme is the best attempt at that. However, I seem to recall there was some pushback from qemu-devel in the sense that they would rather see investment in virtio-blk. I'm not sure what's the latest on that work and what are the next steps.
The pushback drove the discussion towards pursuing an mdev approach, which is why I'm excited to see your patches.
What I'm thinking is that passing through namespaces or partitions is very restrictive. It leaves no room to implement more elaborate virtualisation stacks like replicating data across multiple devices (local or remote), storage migration, software-managed thin provisioning, encryption, deduplication, compression, etc. In summary, anything that requires software intervention in the datapath. (Worth noting: vhost-user-nvme allows all of that to be easily done in SPDK's bdev layer.)
These complicated stacks should probably not be implemented in the kernel, though. So I'm wondering whether we could talk about mechanisms to allow efficient and performant userspace datapath intervention in your approach or pursue a mechanism to completely offload the device emulation to userspace (and align with what SPDK has to offer).
Thoughts welcome!
Felipe
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re:
2019-03-20 11:03 ` Felipe Franciosi
@ 2019-03-20 19:08 ` Maxim Levitsky
2019-03-21 16:12 ` Re: Stefan Hajnoczi
0 siblings, 1 reply; 35+ messages in thread
From: Maxim Levitsky @ 2019-03-20 19:08 UTC (permalink / raw)
To: Felipe Franciosi
Cc: Fam Zheng, kvm@vger.kernel.org, Wolfram Sang,
linux-nvme@lists.infradead.org, linux-kernel@vger.kernel.org,
Keith Busch, Kirti Wankhede, Mauro Carvalho Chehab,
Paul E . McKenney, Christoph Hellwig, Sagi Grimberg,
Harris, James R, Liang Cunming, Jens Axboe, Alex Williamson,
Stefan Hajnoczi, Thanos Makatos, John Ferlan
On Wed, 2019-03-20 at 11:03 +0000, Felipe Franciosi wrote:
> > On Mar 19, 2019, at 2:41 PM, Maxim Levitsky <mlevitsk@redhat.com> wrote:
> >
> > Date: Tue, 19 Mar 2019 14:45:45 +0200
> > Subject: [PATCH 0/9] RFC: NVME VFIO mediated device
> >
> > Hi everyone!
> >
> > In this patch series, I would like to introduce my take on the problem of
> > doing
> > as fast as possible virtualization of storage with emphasis on low latency.
> >
> > In this patch series I implemented a kernel vfio based, mediated device
> > that
> > allows the user to pass through a partition and/or whole namespace to a
> > guest.
>
> Hey Maxim!
>
> I'm really excited to see this series, as it aligns to some extent with what
> we discussed in last year's KVM Forum VFIO BoF.
>
> There's no arguing that we need a better story to efficiently virtualise NVMe
> devices. So far, for Qemu-based VMs, Changpeng's vhost-user-nvme is the best
> attempt at that. However, I seem to recall there was some pushback from qemu-
> devel in the sense that they would rather see investment in virtio-blk. I'm
> not sure what's the latest on that work and what are the next steps.
I agree with that. All my benchmarks were agains his vhost-user-nvme driver, and
I am able to get pretty much the same througput and latency.
The ssd I tested on died just recently (Murphy law), not due to bug in my driver
but some internal fault (even though most of my tests were reads, plus
occassional 'nvme format's.
We are in process of buying an replacement.
>
> The pushback drove the discussion towards pursuing an mdev approach, which is
> why I'm excited to see your patches.
>
> What I'm thinking is that passing through namespaces or partitions is very
> restrictive. It leaves no room to implement more elaborate virtualisation
> stacks like replicating data across multiple devices (local or remote),
> storage migration, software-managed thin provisioning, encryption,
> deduplication, compression, etc. In summary, anything that requires software
> intervention in the datapath. (Worth noting: vhost-user-nvme allows all of
> that to be easily done in SPDK's bdev layer.)
Hi Felipe!
I guess that my driver is not geared toward more complicated use cases like you
mentioned, but instead it is focused to get as fast as possible performance for
the common case.
One thing that I can do which would solve several of the above problems is to
accept an map betwent virtual and real logical blocks, pretty much in exactly
the same way as EPT does it.
Then userspace can map any portions of the device anywhere, while still keeping
the dataplane in the kernel, and having minimal overhead.
On top of that, note that the direction of IO virtualization is to do dataplane
in hardware, which will probably give you even worse partition granuality /
features but will be the fastest option aviable,
like for instance SR-IOV which alrady exists and just allows to split by
namespaces without any more fine grained control.
Think of nvme-mdev as a very low level driver, which currntly uses polling, but
eventually will use PASID based IOMMU to provide the guest with raw PCI device.
The userspace / qemu can build on top of that with varios software layers.
On top of that I am thinking to solve the problem of migration in Qemu, by
creating a 'vfio-nvme' driver which would bind vfio to bind to device exposed by
the kernel, and would pass through all the doorbells and queues to the guest,
while intercepting the admin queue. Such driver I think can be made to support
migration while beeing able to run on top both SR-IOV device, my vfio-nvme abit
with double admin queue emulation (its a bit ugly but won't affect performance
at all) and on top of even regular NVME device vfio assigned to guest.
Best regards,
Maxim Levitsky
>
> These complicated stacks should probably not be implemented in the kernel,
> though. So I'm wondering whether we could talk about mechanisms to allow
> efficient and performant userspace datapath intervention in your approach or
> pursue a mechanism to completely offload the device emulation to userspace
> (and align with what SPDK has to offer).
>
> Thoughts welcome!
> Felipe
> _______________________________________________
> Linux-nvme mailing list
> Linux-nvme@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-nvme
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re:
2019-03-20 19:08 ` Re: Maxim Levitsky
@ 2019-03-21 16:12 ` Stefan Hajnoczi
2019-03-21 16:21 ` Re: Keith Busch
0 siblings, 1 reply; 35+ messages in thread
From: Stefan Hajnoczi @ 2019-03-21 16:12 UTC (permalink / raw)
To: Maxim Levitsky
Cc: Felipe Franciosi, Fam Zheng, kvm@vger.kernel.org, Wolfram Sang,
linux-nvme@lists.infradead.org, linux-kernel@vger.kernel.org,
Keith Busch, Kirti Wankhede, Mauro Carvalho Chehab,
Paul E . McKenney, Christoph Hellwig, Sagi Grimberg,
Harris, James R, Liang Cunming, Jens Axboe, Alex Williamson,
Thanos Makatos, John Ferlan, Liu
[-- Attachment #1: Type: text/plain, Size: 4404 bytes --]
On Wed, Mar 20, 2019 at 09:08:37PM +0200, Maxim Levitsky wrote:
> On Wed, 2019-03-20 at 11:03 +0000, Felipe Franciosi wrote:
> > > On Mar 19, 2019, at 2:41 PM, Maxim Levitsky <mlevitsk@redhat.com> wrote:
> > >
> > > Date: Tue, 19 Mar 2019 14:45:45 +0200
> > > Subject: [PATCH 0/9] RFC: NVME VFIO mediated device
> > >
> > > Hi everyone!
> > >
> > > In this patch series, I would like to introduce my take on the problem of
> > > doing
> > > as fast as possible virtualization of storage with emphasis on low latency.
> > >
> > > In this patch series I implemented a kernel vfio based, mediated device
> > > that
> > > allows the user to pass through a partition and/or whole namespace to a
> > > guest.
> >
> > Hey Maxim!
> >
> > I'm really excited to see this series, as it aligns to some extent with what
> > we discussed in last year's KVM Forum VFIO BoF.
> >
> > There's no arguing that we need a better story to efficiently virtualise NVMe
> > devices. So far, for Qemu-based VMs, Changpeng's vhost-user-nvme is the best
> > attempt at that. However, I seem to recall there was some pushback from qemu-
> > devel in the sense that they would rather see investment in virtio-blk. I'm
> > not sure what's the latest on that work and what are the next steps.
> I agree with that. All my benchmarks were agains his vhost-user-nvme driver, and
> I am able to get pretty much the same througput and latency.
>
> The ssd I tested on died just recently (Murphy law), not due to bug in my driver
> but some internal fault (even though most of my tests were reads, plus
> occassional 'nvme format's.
> We are in process of buying an replacement.
>
> >
> > The pushback drove the discussion towards pursuing an mdev approach, which is
> > why I'm excited to see your patches.
> >
> > What I'm thinking is that passing through namespaces or partitions is very
> > restrictive. It leaves no room to implement more elaborate virtualisation
> > stacks like replicating data across multiple devices (local or remote),
> > storage migration, software-managed thin provisioning, encryption,
> > deduplication, compression, etc. In summary, anything that requires software
> > intervention in the datapath. (Worth noting: vhost-user-nvme allows all of
> > that to be easily done in SPDK's bdev layer.)
>
> Hi Felipe!
>
> I guess that my driver is not geared toward more complicated use cases like you
> mentioned, but instead it is focused to get as fast as possible performance for
> the common case.
>
> One thing that I can do which would solve several of the above problems is to
> accept an map betwent virtual and real logical blocks, pretty much in exactly
> the same way as EPT does it.
> Then userspace can map any portions of the device anywhere, while still keeping
> the dataplane in the kernel, and having minimal overhead.
>
> On top of that, note that the direction of IO virtualization is to do dataplane
> in hardware, which will probably give you even worse partition granuality /
> features but will be the fastest option aviable,
> like for instance SR-IOV which alrady exists and just allows to split by
> namespaces without any more fine grained control.
>
> Think of nvme-mdev as a very low level driver, which currntly uses polling, but
> eventually will use PASID based IOMMU to provide the guest with raw PCI device.
> The userspace / qemu can build on top of that with varios software layers.
>
> On top of that I am thinking to solve the problem of migration in Qemu, by
> creating a 'vfio-nvme' driver which would bind vfio to bind to device exposed by
> the kernel, and would pass through all the doorbells and queues to the guest,
> while intercepting the admin queue. Such driver I think can be made to support
> migration while beeing able to run on top both SR-IOV device, my vfio-nvme abit
> with double admin queue emulation (its a bit ugly but won't affect performance
> at all) and on top of even regular NVME device vfio assigned to guest.
mdev-nvme seems like a duplication of SPDK. The performance is not
better and the features are more limited, so why focus on this approach?
One argument might be that the kernel NVMe subsystem wants to offer this
functionality and loading the kernel module is more convenient than
managing SPDK to some users.
Thoughts?
Stefan
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 455 bytes --]
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re:
2019-03-21 16:12 ` Re: Stefan Hajnoczi
@ 2019-03-21 16:21 ` Keith Busch
2019-03-21 16:41 ` Re: Felipe Franciosi
0 siblings, 1 reply; 35+ messages in thread
From: Keith Busch @ 2019-03-21 16:21 UTC (permalink / raw)
To: Stefan Hajnoczi
Cc: Maxim Levitsky, Fam Zheng, kvm@vger.kernel.org, Wolfram Sang,
linux-nvme@lists.infradead.org, linux-kernel@vger.kernel.org,
Keith Busch, Kirti Wankhede, Mauro Carvalho Chehab,
Paul E . McKenney, Christoph Hellwig, Sagi Grimberg,
Harris, James R, Felipe Franciosi, Liang Cunming, Jens Axboe,
Alex Williamson, Thanos Makatos, Jo
On Thu, Mar 21, 2019 at 04:12:39PM +0000, Stefan Hajnoczi wrote:
> mdev-nvme seems like a duplication of SPDK. The performance is not
> better and the features are more limited, so why focus on this approach?
>
> One argument might be that the kernel NVMe subsystem wants to offer this
> functionality and loading the kernel module is more convenient than
> managing SPDK to some users.
>
> Thoughts?
Doesn't SPDK bind a controller to a single process? mdev binds to
namespaces (or their partitions), so you could have many mdev's assigned
to many VMs accessing a single controller.
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re:
2019-03-21 16:21 ` Re: Keith Busch
@ 2019-03-21 16:41 ` Felipe Franciosi
2019-03-21 17:04 ` Re: Maxim Levitsky
0 siblings, 1 reply; 35+ messages in thread
From: Felipe Franciosi @ 2019-03-21 16:41 UTC (permalink / raw)
To: Keith Busch
Cc: Stefan Hajnoczi, Maxim Levitsky, Fam Zheng, kvm@vger.kernel.org,
Wolfram Sang, linux-nvme@lists.infradead.org,
linux-kernel@vger.kernel.org, Keith Busch, Kirti Wankhede,
Mauro Carvalho Chehab, Paul E . McKenney, Christoph Hellwig,
Sagi Grimberg, Harris, James R, Liang Cunming, Jens Axboe,
Alex Williamson, Thanos Makatos
> On Mar 21, 2019, at 4:21 PM, Keith Busch <kbusch@kernel.org> wrote:
>
> On Thu, Mar 21, 2019 at 04:12:39PM +0000, Stefan Hajnoczi wrote:
>> mdev-nvme seems like a duplication of SPDK. The performance is not
>> better and the features are more limited, so why focus on this approach?
>>
>> One argument might be that the kernel NVMe subsystem wants to offer this
>> functionality and loading the kernel module is more convenient than
>> managing SPDK to some users.
>>
>> Thoughts?
>
> Doesn't SPDK bind a controller to a single process? mdev binds to
> namespaces (or their partitions), so you could have many mdev's assigned
> to many VMs accessing a single controller.
Yes, it binds to a single process which can drive the datapath of multiple virtual controllers for multiple VMs (similar to what you described for mdev). You can therefore efficiently poll multiple VM submission queues (and multiple device completion queues) from a single physical CPU.
The same could be done in the kernel, but the code gets complicated as you add more functionality to it. As this is a direct interface with an untrusted front-end (the guest), it's also arguably safer to do in userspace.
Worth noting: you can eventually have a single physical core polling all sorts of virtual devices (eg. virtual storage or network controllers) very efficiently. And this is quite configurable, too. In the interest of fairness, performance or efficiency, you can choose to dynamically add or remove queues to the poll thread or spawn more threads and redistribute the work.
F.
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re:
2019-03-21 16:41 ` Re: Felipe Franciosi
@ 2019-03-21 17:04 ` Maxim Levitsky
2019-03-22 7:54 ` Re: Felipe Franciosi
0 siblings, 1 reply; 35+ messages in thread
From: Maxim Levitsky @ 2019-03-21 17:04 UTC (permalink / raw)
To: Felipe Franciosi, Keith Busch
Cc: Stefan Hajnoczi, Fam Zheng, kvm@vger.kernel.org, Wolfram Sang,
linux-nvme@lists.infradead.org, linux-kernel@vger.kernel.org,
Keith Busch, Kirti Wankhede, Mauro Carvalho Chehab,
Paul E . McKenney, Christoph Hellwig, Sagi Grimberg,
Harris, James R, Liang Cunming, Jens Axboe, Alex Williamson,
Thanos Makatos, John Ferlan, Liu
On Thu, 2019-03-21 at 16:41 +0000, Felipe Franciosi wrote:
> > On Mar 21, 2019, at 4:21 PM, Keith Busch <kbusch@kernel.org> wrote:
> >
> > On Thu, Mar 21, 2019 at 04:12:39PM +0000, Stefan Hajnoczi wrote:
> > > mdev-nvme seems like a duplication of SPDK. The performance is not
> > > better and the features are more limited, so why focus on this approach?
> > >
> > > One argument might be that the kernel NVMe subsystem wants to offer this
> > > functionality and loading the kernel module is more convenient than
> > > managing SPDK to some users.
> > >
> > > Thoughts?
> >
> > Doesn't SPDK bind a controller to a single process? mdev binds to
> > namespaces (or their partitions), so you could have many mdev's assigned
> > to many VMs accessing a single controller.
>
> Yes, it binds to a single process which can drive the datapath of multiple
> virtual controllers for multiple VMs (similar to what you described for mdev).
> You can therefore efficiently poll multiple VM submission queues (and multiple
> device completion queues) from a single physical CPU.
>
> The same could be done in the kernel, but the code gets complicated as you add
> more functionality to it. As this is a direct interface with an untrusted
> front-end (the guest), it's also arguably safer to do in userspace.
>
> Worth noting: you can eventually have a single physical core polling all sorts
> of virtual devices (eg. virtual storage or network controllers) very
> efficiently. And this is quite configurable, too. In the interest of fairness,
> performance or efficiency, you can choose to dynamically add or remove queues
> to the poll thread or spawn more threads and redistribute the work.
>
> F.
Note though that SPDK doesn't support sharing the device between host and the
guests, it takes over the nvme device, thus it makes the kernel nvme driver
unbind from it.
My driver creates a polling thread per guest, but its trivial to add option to
use the same polling thread for many guests if there need for that.
Best regards,
Maxim Levitsky
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re:
2019-03-21 17:04 ` Re: Maxim Levitsky
@ 2019-03-22 7:54 ` Felipe Franciosi
2019-03-22 10:32 ` Re: Maxim Levitsky
2019-03-22 15:30 ` Re: Keith Busch
0 siblings, 2 replies; 35+ messages in thread
From: Felipe Franciosi @ 2019-03-22 7:54 UTC (permalink / raw)
To: Maxim Levitsky
Cc: Keith Busch, Stefan Hajnoczi, Fam Zheng, kvm@vger.kernel.org,
Wolfram Sang, linux-nvme@lists.infradead.org,
linux-kernel@vger.kernel.org, Keith Busch, Kirti Wankhede,
Mauro Carvalho Chehab, Paul E . McKenney, Christoph Hellwig,
Sagi Grimberg, Harris, James R, Liang Cunming, Jens Axboe,
Alex Williamson, Thanos Makatos
> On Mar 21, 2019, at 5:04 PM, Maxim Levitsky <mlevitsk@redhat.com> wrote:
>
> On Thu, 2019-03-21 at 16:41 +0000, Felipe Franciosi wrote:
>>> On Mar 21, 2019, at 4:21 PM, Keith Busch <kbusch@kernel.org> wrote:
>>>
>>> On Thu, Mar 21, 2019 at 04:12:39PM +0000, Stefan Hajnoczi wrote:
>>>> mdev-nvme seems like a duplication of SPDK. The performance is not
>>>> better and the features are more limited, so why focus on this approach?
>>>>
>>>> One argument might be that the kernel NVMe subsystem wants to offer this
>>>> functionality and loading the kernel module is more convenient than
>>>> managing SPDK to some users.
>>>>
>>>> Thoughts?
>>>
>>> Doesn't SPDK bind a controller to a single process? mdev binds to
>>> namespaces (or their partitions), so you could have many mdev's assigned
>>> to many VMs accessing a single controller.
>>
>> Yes, it binds to a single process which can drive the datapath of multiple
>> virtual controllers for multiple VMs (similar to what you described for mdev).
>> You can therefore efficiently poll multiple VM submission queues (and multiple
>> device completion queues) from a single physical CPU.
>>
>> The same could be done in the kernel, but the code gets complicated as you add
>> more functionality to it. As this is a direct interface with an untrusted
>> front-end (the guest), it's also arguably safer to do in userspace.
>>
>> Worth noting: you can eventually have a single physical core polling all sorts
>> of virtual devices (eg. virtual storage or network controllers) very
>> efficiently. And this is quite configurable, too. In the interest of fairness,
>> performance or efficiency, you can choose to dynamically add or remove queues
>> to the poll thread or spawn more threads and redistribute the work.
>>
>> F.
>
> Note though that SPDK doesn't support sharing the device between host and the
> guests, it takes over the nvme device, thus it makes the kernel nvme driver
> unbind from it.
That is absolutely true. However, I find it not to be a problem in practice.
Hypervisor products, specially those caring about performance, efficiency and fairness, will dedicate NVMe devices for a particular purpose (eg. vDisk storage, cache, metadata) and will not share these devices for other use cases. That's because these products want to deterministically control the performance aspects of the device, which you just cannot do if you are sharing the device with a subsystem you do not control.
For scenarios where the device must be shared and such fine grained control is not required, it looks like using the kernel driver with io_uring offers very good performance with flexibility.
Cheers,
Felipe
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re:
2019-03-22 7:54 ` Re: Felipe Franciosi
@ 2019-03-22 10:32 ` Maxim Levitsky
2019-03-22 15:30 ` Re: Keith Busch
1 sibling, 0 replies; 35+ messages in thread
From: Maxim Levitsky @ 2019-03-22 10:32 UTC (permalink / raw)
To: Felipe Franciosi
Cc: Keith Busch, Stefan Hajnoczi, Fam Zheng, kvm@vger.kernel.org,
Wolfram Sang, linux-nvme@lists.infradead.org,
linux-kernel@vger.kernel.org, Keith Busch, Kirti Wankhede,
Mauro Carvalho Chehab, Paul E . McKenney, Christoph Hellwig,
Sagi Grimberg, Harris, James R, Liang Cunming, Jens Axboe,
Alex Williamson, Thanos Makatos
On Fri, 2019-03-22 at 07:54 +0000, Felipe Franciosi wrote:
> > On Mar 21, 2019, at 5:04 PM, Maxim Levitsky <mlevitsk@redhat.com> wrote:
> >
> > On Thu, 2019-03-21 at 16:41 +0000, Felipe Franciosi wrote:
> > > > On Mar 21, 2019, at 4:21 PM, Keith Busch <kbusch@kernel.org> wrote:
> > > >
> > > > On Thu, Mar 21, 2019 at 04:12:39PM +0000, Stefan Hajnoczi wrote:
> > > > > mdev-nvme seems like a duplication of SPDK. The performance is not
> > > > > better and the features are more limited, so why focus on this
> > > > > approach?
> > > > >
> > > > > One argument might be that the kernel NVMe subsystem wants to offer
> > > > > this
> > > > > functionality and loading the kernel module is more convenient than
> > > > > managing SPDK to some users.
> > > > >
> > > > > Thoughts?
> > > >
> > > > Doesn't SPDK bind a controller to a single process? mdev binds to
> > > > namespaces (or their partitions), so you could have many mdev's assigned
> > > > to many VMs accessing a single controller.
> > >
> > > Yes, it binds to a single process which can drive the datapath of multiple
> > > virtual controllers for multiple VMs (similar to what you described for
> > > mdev).
> > > You can therefore efficiently poll multiple VM submission queues (and
> > > multiple
> > > device completion queues) from a single physical CPU.
> > >
> > > The same could be done in the kernel, but the code gets complicated as you
> > > add
> > > more functionality to it. As this is a direct interface with an untrusted
> > > front-end (the guest), it's also arguably safer to do in userspace.
> > >
> > > Worth noting: you can eventually have a single physical core polling all
> > > sorts
> > > of virtual devices (eg. virtual storage or network controllers) very
> > > efficiently. And this is quite configurable, too. In the interest of
> > > fairness,
> > > performance or efficiency, you can choose to dynamically add or remove
> > > queues
> > > to the poll thread or spawn more threads and redistribute the work.
> > >
> > > F.
> >
> > Note though that SPDK doesn't support sharing the device between host and
> > the
> > guests, it takes over the nvme device, thus it makes the kernel nvme driver
> > unbind from it.
>
> That is absolutely true. However, I find it not to be a problem in practice.
>
> Hypervisor products, specially those caring about performance, efficiency and
> fairness, will dedicate NVMe devices for a particular purpose (eg. vDisk
> storage, cache, metadata) and will not share these devices for other use
> cases. That's because these products want to deterministically control the
> performance aspects of the device, which you just cannot do if you are sharing
> the device with a subsystem you do not control.
>
> For scenarios where the device must be shared and such fine grained control is
> not required, it looks like using the kernel driver with io_uring offers very
> good performance with flexibility
I see the host/guest parition in the following way:
The guest assigned partitions are for guests that need lowest possible latency,
and in between these guests it is possible to guarantee good enough level of
fairness in my driver.
For example, in the current implementation of my driver, each guest gets its own
host submission queue.
On the other hand, the host assigned partitions are for significantly higher
latency IO, with no guarantees, and/or for guests that need all the more
advanced features of full IO virtualization, for instance snapshots, thin
provisioning, replication/backup over network, etc.
io_uring can be used here to speed things up but it won't reach the nvme-mdev
levels of latency.
Furthermore on NVME drives that support WRRU, its possible to let queues of
guest's assigned partitions to belong to the high priority class and let the
host queues use the regular medium/low priority class.
For drives that don't support WRRU, the IO throttling can be done in software on
the host queues.
Host assigned partitions also don't need polling, thus allowing polling to be
used only for guests that actually need low latency IO.
This reduces the number of cores that would be otherwise lost to polling,
because the less work the polling core does, the less latency it contributes to
overall latency, thus with less users, you could use less cores to achieve the
same levels of latency.
For Stefan's argument, we can look at it in a slightly different way too:
While the nvme-mdev can be seen as a duplication of SPDK, the SPDK can also be
seen as duplication of an existing kernel functionality which nvme-mdev can
reuse for free.
Best regards,
Maxim Levitsky
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re:
2019-03-22 7:54 ` Re: Felipe Franciosi
2019-03-22 10:32 ` Re: Maxim Levitsky
@ 2019-03-22 15:30 ` Keith Busch
2019-03-25 15:44 ` Re: Felipe Franciosi
1 sibling, 1 reply; 35+ messages in thread
From: Keith Busch @ 2019-03-22 15:30 UTC (permalink / raw)
To: Felipe Franciosi
Cc: Maxim Levitsky, Stefan Hajnoczi, Fam Zheng, kvm@vger.kernel.org,
Wolfram Sang, linux-nvme@lists.infradead.org,
linux-kernel@vger.kernel.org, Keith Busch, Kirti Wankhede,
Mauro Carvalho Chehab, Paul E . McKenney, Christoph Hellwig,
Sagi Grimberg, Harris, James R, Liang Cunming, Jens Axboe,
Alex Williamson, Thanos Makatos
On Fri, Mar 22, 2019 at 07:54:50AM +0000, Felipe Franciosi wrote:
> >
> > Note though that SPDK doesn't support sharing the device between host and the
> > guests, it takes over the nvme device, thus it makes the kernel nvme driver
> > unbind from it.
>
> That is absolutely true. However, I find it not to be a problem in practice.
>
> Hypervisor products, specially those caring about performance, efficiency and fairness, will dedicate NVMe devices for a particular purpose (eg. vDisk storage, cache, metadata) and will not share these devices for other use cases. That's because these products want to deterministically control the performance aspects of the device, which you just cannot do if you are sharing the device with a subsystem you do not control.
I don't know, it sounds like you've traded kernel syscalls for IPC,
and I don't think one performs better than the other.
> For scenarios where the device must be shared and such fine grained control is not required, it looks like using the kernel driver with io_uring offers very good performance with flexibility.
NVMe's IO Determinism features provide fine grained control for shared
devices. It's still uncommon to find hardware supporting that, though.
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re:
2019-03-22 15:30 ` Re: Keith Busch
@ 2019-03-25 15:44 ` Felipe Franciosi
0 siblings, 0 replies; 35+ messages in thread
From: Felipe Franciosi @ 2019-03-25 15:44 UTC (permalink / raw)
To: Keith Busch
Cc: Maxim Levitsky, Stefan Hajnoczi, Fam Zheng, kvm@vger.kernel.org,
Wolfram Sang, linux-nvme@lists.infradead.org,
linux-kernel@vger.kernel.org, Keith Busch, Kirti Wankhede,
Mauro Carvalho Chehab, Paul E . McKenney, Christoph Hellwig,
Sagi Grimberg, Harris, James R, Liang Cunming, Jens Axboe,
Alex Williamson, Thanos Makatos
Hi Keith,
> On Mar 22, 2019, at 3:30 PM, Keith Busch <kbusch@kernel.org> wrote:
>
> On Fri, Mar 22, 2019 at 07:54:50AM +0000, Felipe Franciosi wrote:
>>>
>>> Note though that SPDK doesn't support sharing the device between host and the
>>> guests, it takes over the nvme device, thus it makes the kernel nvme driver
>>> unbind from it.
>>
>> That is absolutely true. However, I find it not to be a problem in practice.
>>
>> Hypervisor products, specially those caring about performance, efficiency and fairness, will dedicate NVMe devices for a particular purpose (eg. vDisk storage, cache, metadata) and will not share these devices for other use cases. That's because these products want to deterministically control the performance aspects of the device, which you just cannot do if you are sharing the device with a subsystem you do not control.
>
> I don't know, it sounds like you've traded kernel syscalls for IPC,
> and I don't think one performs better than the other.
Sorry, I'm not sure I understand. My point is that if you are packaging a distro to be a hypervisor and you want to use a storage device for VM data, you _most likely_ won't be using that device for anything else. To that end, driving the device directly from your application definitely gives you more deterministic control.
>
>> For scenarios where the device must be shared and such fine grained control is not required, it looks like using the kernel driver with io_uring offers very good performance with flexibility.
>
> NVMe's IO Determinism features provide fine grained control for shared
> devices. It's still uncommon to find hardware supporting that, though.
Sure, but then your hypervisor needs to certify devices that support that. This will limit your HCL. Moreover, unless the feature is solid, well-established and works reliably on all devices you support, it's arguably preferable to have an architecture which gives you that control in software.
Cheers,
Felipe
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re:
@ 2018-02-05 5:28 Fahama Vaserman
0 siblings, 0 replies; 35+ messages in thread
From: Fahama Vaserman @ 2018-02-05 5:28 UTC (permalink / raw)
To: info; +Cc: info
Can i confide in you: ?
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re:
@ 2017-11-13 14:44 Amos Kalonzo
0 siblings, 0 replies; 35+ messages in thread
From: Amos Kalonzo @ 2017-11-13 14:44 UTC (permalink / raw)
Attn:
I am wondering why You haven't respond to my email for some days now.
reference to my client's contract balance payment of (11.7M,USD)
Kindly get back to me for more details.
Best Regards
Amos Kalonzo
^ permalink raw reply [flat|nested] 35+ messages in thread
[parent not found: <CAMj-D2DO_CfvD77izsGfggoKP45HSC9aD6auUPAYC9Yeq_aX7w@mail.gmail.com>]
* Re:
[not found] <CAMj-D2DO_CfvD77izsGfggoKP45HSC9aD6auUPAYC9Yeq_aX7w@mail.gmail.com>
@ 2017-05-04 16:44 ` gengdongjiu
0 siblings, 0 replies; 35+ messages in thread
From: gengdongjiu @ 2017-05-04 16:44 UTC (permalink / raw)
To: mtsirkin, kvm, Tyler Baicar, qemu-devel, Xiongfeng Wang, ben,
linux, kvmarm, huangshaoyu, lersek, songwenjun, wuquanming,
Marc Zyngier, qemu-arm, imammedo, linux-arm-kernel,
Ard Biesheuvel, pbonzini, James Morse
Dear James,
Thanks a lot for your review and comments. I am very sorry for the
late response.
2017-05-04 23:42 GMT+08:00 gengdongjiu <gengdj.1984@gmail.com>:
> Hi Dongjiu Geng,
>
> On 30/04/17 06:37, Dongjiu Geng wrote:
>> when happen SEA, deliver signal bus and handle the ioctl that
>> inject SEA abort to guest, so that guest can handle the SEA error.
>
>> diff --git a/arch/arm/kvm/mmu.c b/arch/arm/kvm/mmu.c
>> index 105b6ab..a96594f 100644
>> --- a/arch/arm/kvm/mmu.c
>> +++ b/arch/arm/kvm/mmu.c
>> @@ -20,8 +20,10 @@
>> @@ -1238,6 +1240,36 @@ static void coherent_cache_guest_page(struct kvm_vcpu *vcpu, kvm_pfn_t pfn,
>> __coherent_cache_guest_page(vcpu, pfn, size);
>> }
>>
>> +static void kvm_send_signal(unsigned long address, bool hugetlb, bool hwpoison)
>> +{
>> + siginfo_t info;
>> +
>> + info.si_signo = SIGBUS;
>> + info.si_errno = 0;
>> + if (hwpoison)
>> + info.si_code = BUS_MCEERR_AR;
>> + else
>> + info.si_code = 0;
>> +
>> + info.si_addr = (void __user *)address;
>> + if (hugetlb)
>> + info.si_addr_lsb = PMD_SHIFT;
>> + else
>> + info.si_addr_lsb = PAGE_SHIFT;
>> +
>> + send_sig_info(SIGBUS, &info, current);
>> +}
>> +
> « [hide part of quote]
>
> Punit reviewed the other version of this patch, this PMD_SHIFT is not the right
> thing to do, it needs a more accurate set of calls and shifts as there may be
> hugetlbfs pages other than PMD_SIZE.
>
> https://www.spinics.net/lists/arm-kernel/msg568919.html
>
> I haven't posted a new version of that patch because I was still hunting a bug
> in the hugepage/hwpoison code, even with Punit's fixes series I see -EFAULT
> returned to userspace instead of this hwpoison code being invoked.
Ok, got it, thanks for your information.
>
> Please avoid duplicating functionality between patches, it wastes reviewers
> time, especially when we know there are problems with this approach.
>
>
>> +static void kvm_handle_bad_page(unsigned long address,
>> + bool hugetlb, bool hwpoison)
>> +{
>> + /* handle both hwpoison and other synchronous external Abort */
>> + if (hwpoison)
>> + kvm_send_signal(address, hugetlb, true);
>> + else
>> + kvm_send_signal(address, hugetlb, false);
>> +}
>
> Why the extra level of indirection? We only want to signal userspace like this
> from KVM for hwpoison. Signals for RAS related reasons should come from the bits
> of the kernel that decoded the error.
For the SEA, the are maily two types:
0b010000 Synchronous External Abort on memory access.
0b0101xx Synchronous External Abort on page table walk. DFSC[1:0]
encode the level.
hwpoison should belong to the "Synchronous External Abort on memory access"
if the SEA type is not hwpoison, such as page table walk, do you mean
KVM do not deliver the SIGBUS?
If so, how the KVM handle the SEA type other than hwpoison?
>
> (hwpoison for KVM is a corner case as Qemu's memory effectively has two users,
> Qemu and KVM. This isn't the example of how user-space gets signalled.)
>
>
>> diff --git a/arch/arm64/kvm/guest.c b/arch/arm64/kvm/guest.c
>> index b37446a..780e3c4 100644
>> --- a/arch/arm64/kvm/guest.c
>> +++ b/arch/arm64/kvm/guest.c
>> @@ -277,6 +277,13 @@ int kvm_arch_vcpu_ioctl_set_sregs(struct kvm_vcpu *vcpu,
>> return -EINVAL;
>> }
>>
>> +int kvm_vcpu_ioctl_sea(struct kvm_vcpu *vcpu)
>> +{
>> + kvm_inject_dabt(vcpu, kvm_vcpu_get_hfar(vcpu));
>> +
>> + return 0;
>> +}
>
>> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
>> index bb02909..1d2e2e7 100644
>> --- a/include/uapi/linux/kvm.h
>> +++ b/include/uapi/linux/kvm.h
>> @@ -1306,6 +1306,7 @@ struct kvm_s390_ucas_mapping {
>> #define KVM_S390_GET_IRQ_STATE _IOW(KVMIO, 0xb6, struct kvm_s390_irq_state)
>> /* Available with KVM_CAP_X86_SMM */
>> #define KVM_SMI _IO(KVMIO, 0xb7)
>> +#define KVM_ARM_SEA _IO(KVMIO, 0xb8)
>>
>> #define KVM_DEV_ASSIGN_ENABLE_IOMMU (1 << 0)
>> #define KVM_DEV_ASSIGN_PCI_2_3 (1 << 1)
>>
>
> Why do we need a userspace API for SEA? It can also be done by using
> KVM_{G,S}ET_ONE_REG to change the vcpu registers. The advantage of doing it this
> way is you can choose which ESR value to use.
>
> Adding a new API call to do something you could do with an old one doesn't look
> right.
James, I considered your suggestion before that use the
KVM_{G,S}ET_ONE_REG to change the vcpu registers. but I found it does
not have difference to use the alread existed KVM API. so may be
changing the vcpu registers in qemu will duplicate with the KVM APIs.
injection a SEA is no more than setting some registers: elr_el1, PC,
PSTATE, SPSR_el1, far_el1, esr_el1
I seen this KVM API do the same thing as Qemu. do you found call this
API will have issue and necessary to choose another ESR value?
I pasted the alread existed KVM API code:
static void inject_abt64(struct kvm_vcpu *vcpu, bool is_iabt, unsigned
long addr)
{
unsigned long cpsr = *vcpu_cpsr(vcpu);
bool is_aarch32 = vcpu_mode_is_32bit(vcpu);
u32 esr = 0;
*vcpu_elr_el1(vcpu) = *vcpu_pc(vcpu);
*vcpu_pc(vcpu) = get_except_vector(vcpu, except_type_sync);
*vcpu_cpsr(vcpu) = PSTATE_FAULT_BITS_64;
*vcpu_spsr(vcpu) = cpsr;
vcpu_sys_reg(vcpu, FAR_EL1) = addr;
/*
* Build an {i,d}abort, depending on the level and the
* instruction set. Report an external synchronous abort.
*/
if (kvm_vcpu_trap_il_is32bit(vcpu))
esr |= ESR_ELx_IL;
/*
* Here, the guest runs in AArch64 mode when in EL1. If we get
* an AArch32 fault, it means we managed to trap an EL0 fault.
*/
if (is_aarch32 || (cpsr & PSR_MODE_MASK) == PSR_MODE_EL0t)
esr |= (ESR_ELx_EC_IABT_LOW << ESR_ELx_EC_SHIFT);
else
esr |= (ESR_ELx_EC_IABT_CUR << ESR_ELx_EC_SHIFT);
if (!is_iabt)
esr |= ESR_ELx_EC_DABT_LOW << ESR_ELx_EC_SHIFT;
vcpu_sys_reg(vcpu, ESR_EL1) = esr | ESR_ELx_FSC_EXTABT;
}
static void inject_abt32(struct kvm_vcpu *vcpu, bool is_pabt,
unsigned long addr)
{
u32 vect_offset;
u32 *far, *fsr;
bool is_lpae;
if (is_pabt) {
vect_offset = 12;
far = &vcpu_cp15(vcpu, c6_IFAR);
fsr = &vcpu_cp15(vcpu, c5_IFSR);
} else { /* !iabt */
vect_offset = 16;
far = &vcpu_cp15(vcpu, c6_DFAR);
fsr = &vcpu_cp15(vcpu, c5_DFSR);
}
prepare_fault32(vcpu, COMPAT_PSR_MODE_ABT | COMPAT_PSR_A_BIT, vect_offset);
*far = addr;
/* Give the guest an IMPLEMENTATION DEFINED exception */
is_lpae = (vcpu_cp15(vcpu, c2_TTBCR) >> 31);
if (is_lpae)
*fsr = 1 << 9 | 0x34;
else
*fsr = 0x14;
}
/**
* kvm_inject_dabt - inject a data abort into the guest
* @vcpu: The VCPU to receive the undefined exception
* @addr: The address to report in the DFAR
*
* It is assumed that this code is called from the VCPU thread and that the
* VCPU therefore is not currently executing guest code.
*/
void kvm_inject_dabt(struct kvm_vcpu *vcpu, unsigned long addr)
{
if (!(vcpu->arch.hcr_el2 & HCR_RW))
inject_abt32(vcpu, false, addr);
else
inject_abt64(vcpu, false, addr);
}
>
>
> Thanks,
>
> James
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm
^ permalink raw reply [flat|nested] 35+ messages in thread
* RE:
@ 2017-02-23 15:09 Qin's Yanjun
0 siblings, 0 replies; 35+ messages in thread
From: Qin's Yanjun @ 2017-02-23 15:09 UTC (permalink / raw)
How are you today and your family? I require your attention and honest
co-operation about some issues which i will really want to discuss with you
which. Looking forward to read from you soon.
Qin's
______________________________
Sky Silk, http://aknet.kz
^ permalink raw reply [flat|nested] 35+ messages in thread
[parent not found: <D0613EBE33E8FD439137DAA95CCF59555B7A5A4D@MGCCCMAIL2010-5.mgccc.cc.ms.us>]
* RE:
[not found] <D0613EBE33E8FD439137DAA95CCF59555B7A5A4D@MGCCCMAIL2010-5.mgccc.cc.ms.us>
@ 2015-11-24 13:21 ` Amis, Ryann
0 siblings, 0 replies; 35+ messages in thread
From: Amis, Ryann @ 2015-11-24 13:21 UTC (permalink / raw)
To: MGCCC Helpdesk
Our new web mail has been improved with a new messaging system from Owa/outlook which also include faster usage on email, shared calendar, web-documents and the new 2015 anti-spam version. Please use the link below to complete your update for our new Owa/outlook improved web mail. CLICK HERE<https://formcrafts.com/a/15851> to update or Copy and pest the Link to your Browser: http://bit.ly/1Xo5Vd4
Thanks,
ITC Administrator.
-----------------------------------------
The information contained in this e-mail message is intended only for the personal and confidential use of the recipient(s) named above. This message may be an attorney-client communication and/or work product and as such is privileged and confidential. If the reader of this message is not the intended recipient or an agent responsible for delivering it to the intended recipient, you are hereby notified that you have received this document in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify us immediately by e-mail, and delete the original message.
^ permalink raw reply [flat|nested] 35+ messages in thread
* (unknown),
@ 2014-04-13 21:01 Marcus White
2014-04-15 0:59 ` Marcus White
0 siblings, 1 reply; 35+ messages in thread
From: Marcus White @ 2014-04-13 21:01 UTC (permalink / raw)
To: kvm
Hello,
I had some basic questions regarding KVM, and would appreciate any help:)
I have been reading about the KVM architecture, and as I understand
it, the guest shows up as a regular process in the host itself..
I had some questions around that..
1. Are the guest processes implemented as a control group within the
overall VM process itself? Is the VM a kernel process or a user
process?
2. Is there a way for me to force some specific CPU/s to a guest, and
those CPUs to be not used for any work on the host itself? Pinning is
just making sure the vCPU runs on the same physical CPU always, I am
looking for something more than that..
3. If the host is compiled as a non pre-emptible kernel, kernel
process run to completion until they give up the CPU themselves. In
the context of a guest, I am trying to understand what that would mean
in the context of KVM and guest VMs. If the VM is a user process, it
means nothing, I wasnt sure as per (1).
Cheers!
M
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re:
2014-04-13 21:01 (unknown), Marcus White
@ 2014-04-15 0:59 ` Marcus White
2014-04-16 21:17 ` Re: Marcelo Tosatti
0 siblings, 1 reply; 35+ messages in thread
From: Marcus White @ 2014-04-15 0:59 UTC (permalink / raw)
To: kvm
Hello,
A friendly bump to see if anyone has any ideas:-)
Cheers!
On Sun, Apr 13, 2014 at 2:01 PM, Marcus White
<roastedseaweed.k@gmail.com> wrote:
> Hello,
> I had some basic questions regarding KVM, and would appreciate any help:)
>
> I have been reading about the KVM architecture, and as I understand
> it, the guest shows up as a regular process in the host itself..
>
> I had some questions around that..
>
> 1. Are the guest processes implemented as a control group within the
> overall VM process itself? Is the VM a kernel process or a user
> process?
>
> 2. Is there a way for me to force some specific CPU/s to a guest, and
> those CPUs to be not used for any work on the host itself? Pinning is
> just making sure the vCPU runs on the same physical CPU always, I am
> looking for something more than that..
>
> 3. If the host is compiled as a non pre-emptible kernel, kernel
> process run to completion until they give up the CPU themselves. In
> the context of a guest, I am trying to understand what that would mean
> in the context of KVM and guest VMs. If the VM is a user process, it
> means nothing, I wasnt sure as per (1).
>
> Cheers!
> M
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re:
2014-04-15 0:59 ` Marcus White
@ 2014-04-16 21:17 ` Marcelo Tosatti
2014-04-17 21:33 ` Re: Marcus White
0 siblings, 1 reply; 35+ messages in thread
From: Marcelo Tosatti @ 2014-04-16 21:17 UTC (permalink / raw)
To: Marcus White; +Cc: kvm
On Mon, Apr 14, 2014 at 05:59:05PM -0700, Marcus White wrote:
> Hello,
> A friendly bump to see if anyone has any ideas:-)
>
> Cheers!
>
> On Sun, Apr 13, 2014 at 2:01 PM, Marcus White
> <roastedseaweed.k@gmail.com> wrote:
> > Hello,
> > I had some basic questions regarding KVM, and would appreciate any help:)
> >
> > I have been reading about the KVM architecture, and as I understand
> > it, the guest shows up as a regular process in the host itself..
> >
> > I had some questions around that..
> >
> > 1. Are the guest processes implemented as a control group within the
> > overall VM process itself? Is the VM a kernel process or a user
> > process?
User process.
> > 2. Is there a way for me to force some specific CPU/s to a guest, and
> > those CPUs to be not used for any work on the host itself? Pinning is
> > just making sure the vCPU runs on the same physical CPU always, I am
> > looking for something more than that..
Control groups.
> > 3. If the host is compiled as a non pre-emptible kernel, kernel
> > process run to completion until they give up the CPU themselves. In
> > the context of a guest, I am trying to understand what that would mean
> > in the context of KVM and guest VMs. If the VM is a user process, it
> > means nothing, I wasnt sure as per (1).
What problem are you trying to solve?
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re:
2014-04-16 21:17 ` Re: Marcelo Tosatti
@ 2014-04-17 21:33 ` Marcus White
2014-04-21 21:49 ` Re: Marcelo Tosatti
0 siblings, 1 reply; 35+ messages in thread
From: Marcus White @ 2014-04-17 21:33 UTC (permalink / raw)
To: Marcelo Tosatti; +Cc: kvm
>> > Hello,
>> > I had some basic questions regarding KVM, and would appreciate any help:)
>> >
>> > I have been reading about the KVM architecture, and as I understand
>> > it, the guest shows up as a regular process in the host itself..
>> >
>> > I had some questions around that..
>> >
>> > 1. Are the guest processes implemented as a control group within the
>> > overall VM process itself? Is the VM a kernel process or a user
>> > process?
>
> User process.
>
>> > 2. Is there a way for me to force some specific CPU/s to a guest, and
>> > those CPUs to be not used for any work on the host itself? Pinning is
>> > just making sure the vCPU runs on the same physical CPU always, I am
>> > looking for something more than that..
>
> Control groups.
Do control groups prevent the host from using those CPUs? I want only
the VM to use the CPUs, and dont want any host user or kernel threads
to run on that physical CPU. I looked up control groups, maybe I
missed something there. I will go back and take a look. If you can
clarify, I would appreciate it:)
>
>> > 3. If the host is compiled as a non pre-emptible kernel, kernel
>> > process run to completion until they give up the CPU themselves. In
>> > the context of a guest, I am trying to understand what that would mean
>> > in the context of KVM and guest VMs. If the VM is a user process, it
>> > means nothing, I wasnt sure as per (1).
>
> What problem are you trying to solve?
Its more of an investigation at this point to understand what can happen..
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re:
2014-04-17 21:33 ` Re: Marcus White
@ 2014-04-21 21:49 ` Marcelo Tosatti
0 siblings, 0 replies; 35+ messages in thread
From: Marcelo Tosatti @ 2014-04-21 21:49 UTC (permalink / raw)
To: Marcus White; +Cc: kvm
On Thu, Apr 17, 2014 at 02:33:41PM -0700, Marcus White wrote:
> >> > Hello,
> >> > I had some basic questions regarding KVM, and would appreciate any help:)
> >> >
> >> > I have been reading about the KVM architecture, and as I understand
> >> > it, the guest shows up as a regular process in the host itself..
> >> >
> >> > I had some questions around that..
> >> >
> >> > 1. Are the guest processes implemented as a control group within the
> >> > overall VM process itself? Is the VM a kernel process or a user
> >> > process?
> >
> > User process.
> >
> >> > 2. Is there a way for me to force some specific CPU/s to a guest, and
> >> > those CPUs to be not used for any work on the host itself? Pinning is
> >> > just making sure the vCPU runs on the same physical CPU always, I am
> >> > looking for something more than that..
> >
> > Control groups.
> Do control groups prevent the host from using those CPUs? I want only
> the VM to use the CPUs, and dont want any host user or kernel threads
> to run on that physical CPU. I looked up control groups, maybe I
> missed something there. I will go back and take a look. If you can
> clarify, I would appreciate it:)
Per-CPU kernel threads usually perform necessary work on behalf of
the CPU.
For other user threads, yes you can have them execute exclusively on other
processors.
> >> > 3. If the host is compiled as a non pre-emptible kernel, kernel
> >> > process run to completion until they give up the CPU themselves. In
> >> > the context of a guest, I am trying to understand what that would mean
> >> > in the context of KVM and guest VMs. If the VM is a user process, it
> >> > means nothing, I wasnt sure as per (1).
> >
> > What problem are you trying to solve?
> Its more of an investigation at this point to understand what can happen..
http://www.phoronix.com/scan.php?page=news_item&px=MTI1NTg
http://www.linuxplumbersconf.org/2013/ocw/sessions/1143
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re:
@ 2013-06-28 10:14 emirates
0 siblings, 0 replies; 35+ messages in thread
From: emirates @ 2013-06-28 10:14 UTC (permalink / raw)
To: info
Did You Receive Our Last Notification?(Reply Via fly.emiratesairline@5d6d.cn)
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re:
@ 2013-06-28 10:12 emirates
0 siblings, 0 replies; 35+ messages in thread
From: emirates @ 2013-06-28 10:12 UTC (permalink / raw)
To: info
Did You Receive Our Last Notification?(Reply Via fly.emiratesairline@5d6d.cn)
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re:
@ 2013-06-27 21:21 emirates
0 siblings, 0 replies; 35+ messages in thread
From: emirates @ 2013-06-27 21:21 UTC (permalink / raw)
To: info
Did You Recieve Our Last Notification!!
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re:.
@ 2011-10-29 21:27 Young Chang
0 siblings, 0 replies; 35+ messages in thread
From: Young Chang @ 2011-10-29 21:27 UTC (permalink / raw)
May I ask if you would be eligible to pursue a Business Proposal of
$19.7m with me if you don't mind? Let me know if you are interested?
----------------------------------------------------------------
This message was sent using IMP, the Internet Messaging Program.
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re:
@ 2010-12-14 3:03 Irish Online News
0 siblings, 0 replies; 35+ messages in thread
From: Irish Online News @ 2010-12-14 3:03 UTC (permalink / raw)
You've earned 750,000 GBP. Send Necessary Information:Name,Age,Country
^ permalink raw reply [flat|nested] 35+ messages in thread
[parent not found: <20090427104117.GB29082@redhat.com>]
* Re:
[not found] <20090427104117.GB29082@redhat.com>
@ 2009-04-27 13:16 ` Sheng Yang
0 siblings, 0 replies; 35+ messages in thread
From: Sheng Yang @ 2009-04-27 13:16 UTC (permalink / raw)
To: Michael S. Tsirkin; +Cc: Avi Kivity, Marcelo Tosatti, kvm
On Monday 27 April 2009 18:41:17 Michael S. Tsirkin wrote:
> Sheng, Marcelo,
> I've been reading code in qemu/hw/device-assignment.c, and
> I have a couple of questions about msi-x implementation:
Hi Michael
> 1. What is the reason that msix_table_page is allocated
> with mmap and not with e.g. malloc?
msix_table_page is a page, and mmap allocate memory on page boundary. So I use
it.
> 2. msix_table_page has the guest view of the msix table for the device.
> However, even this memory isn't mapped into guest directly, instead
> msix_mmio_read/msix_mmio_write perform the write in qemu.
> Won't it be possible to map this page directly into
> guest memory, reducing the overhead for table writes?
First, Linux configured the real MSI-X table in device, which is out of our
scope. KVM accepted the interrupt from Linux, then inject it to the guest
according to the MSI-X table setting of guest. So KVM should know about the
page modification. For example, MSI-X table got mask bit which can be written
by guest at any time(this bit haven't been implement yet, but should be soon),
then we should mask the correlated vector of real MSI-X table; then guest may
modified the MSI address/data, that also should be intercepted by KVM and used
to update our knowledge of guest. So we can't passthrough the modification.
If guest can write to the real device MSI-X table directly, it would cause
chaos on interrupt delivery, for what guest see is totally different with
what's host see...
--
regards
Yang, Sheng
>
> Could you shed light on this for me please?
> Thanks,
^ permalink raw reply [flat|nested] 35+ messages in thread
* "-vga std" causes guest OS to crash (on disk io?) on a 1440x900 latpop.
@ 2009-03-08 22:05 Dylan Reid
2009-03-14 5:09 ` Dylan
0 siblings, 1 reply; 35+ messages in thread
From: Dylan Reid @ 2009-03-08 22:05 UTC (permalink / raw)
To: kvm
When I run the guest without the -vga option, everything is fine, the
resolutions isn't what I would like, but it works remarkably well.
If I add "-vga std" to the command I use to run the guest, then it
will crash when it attempts to start graphics. This will also happen
if I attempt to boot from an .iso with the "-vga std" option. If I
attempt to install ubuntu 8.10 for x86 I get errors about not being
able to read startup files in the /etc directory.
This also happens if I use the -no-vkm switch.
reidd@heman:/usr/appliances$ uname -a
Linux heman 2.6.27-11-server #1 SMP Thu Jan 29 20:19:41 UTC 2009 i686 GNU/Linux
this is running an AMD Turion X2 on a toshiba Satellite P305D.
I am using kvm-84, but also tested on 83 with the same results.
Any ideas on why this would happen?
Thanks,
Dylan
^ permalink raw reply [flat|nested] 35+ messages in thread
* (unknown),
@ 2009-02-25 0:50 Josh Borke
2009-02-25 0:58 ` Atsushi SAKAI
0 siblings, 1 reply; 35+ messages in thread
From: Josh Borke @ 2009-02-25 0:50 UTC (permalink / raw)
To: kvm
subscribe kvm
^ permalink raw reply [flat|nested] 35+ messages in thread
* (unknown)
@ 2009-01-10 21:53 Ekin Meroğlu
2009-11-07 15:59 ` Bulent Abali
0 siblings, 1 reply; 35+ messages in thread
From: Ekin Meroğlu @ 2009-01-10 21:53 UTC (permalink / raw)
To: kvm
subscribe kvm
^ permalink raw reply [flat|nested] 35+ messages in thread
* (unknown)
@ 2008-07-28 21:27 Mohammed Gamal
2008-07-28 21:29 ` Mohammed Gamal
0 siblings, 1 reply; 35+ messages in thread
From: Mohammed Gamal @ 2008-07-28 21:27 UTC (permalink / raw)
To: kvm; +Cc: avi, guillaume.thouvenin
laurent.vivier@bull.net, riel@surriel.com
Bcc:
Subject: [RFC][PATCH] VMX: Add and enhance VMentry failure detection
mechanism
Reply-To:
This patch is *not* meant to be merged. This patch fixes the random
crashes with gfxboot and it doesn't crash anymore at random
instructions.
It mainly does two things:
1- It handles all possible exit reasons before exiting for VMX failures
2- It handles vmentry failures avoiding external interrupts
However, while this patch allows booting FreeDOS with HIMEM with no
problems. It does occasionally crash with gfxboot at RIP 6e29, looking
at the gfxboot code the instructions causing the crash is as follows:
00006e10 <switch_to_pm_20>:
6e10: 66 b8 20 00 mov $0x20,%ax
6e14: 8e d8 mov %eax,%ds
6e16: 8c d0 mov %ss,%eax
6e18: 81 e4 ff ff 00 00 and $0xffff,%esp
6e1e: c1 e0 04 shl $0x4,%eax
6e21: 01 c4 add %eax,%esp
6e23: 66 b8 08 00 mov $0x8,%ax
6e27: 8e d0 mov %eax,%ss
6e29: 8e c0 mov %eax,%es
6e2b: 8e e0 mov %eax,%fs
6e2d: 8e e8 mov %eax,%gs
6e2f: 58 pop %eax
6e30: 66 9d popfw
6e32: 66 c3 retw
So apparently to fix the problem we need to add other guest state checks
-namely for ES, FS, GS- to invalid_guest_state().
Now enough talk, here is the patch
Signed-off-by: Guillaume Thouvenin <guillaume.thouvenin@ext.bull.net>
Signed-off-by: Laurent Vivier <laurent.vivier@bull.net>
Signed-off-by: Mohammed Gamal <m.gamal005@gmail.com>
---
arch/x86/kvm/vmx.c | 116 +++++++++++++++++++++++++++++++++++++++++---
arch/x86/kvm/vmx.h | 3 +
include/asm-x86/kvm_host.h | 1 +
3 files changed, 112 insertions(+), 8 deletions(-)
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index c4510fe..b438f94 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -1316,7 +1316,8 @@ static void enter_pmode(struct kvm_vcpu *vcpu)
fix_pmode_dataseg(VCPU_SREG_GS, &vcpu->arch.rmode.gs);
fix_pmode_dataseg(VCPU_SREG_FS, &vcpu->arch.rmode.fs);
- vmcs_write16(GUEST_SS_SELECTOR, 0);
+ if (vcpu->arch.rmode_failed)
+ vmcs_write16(GUEST_SS_SELECTOR, 0);
vmcs_write32(GUEST_SS_AR_BYTES, 0x93);
vmcs_write16(GUEST_CS_SELECTOR,
@@ -2708,6 +2709,93 @@ static int handle_nmi_window(struct kvm_vcpu *vcpu, struct kvm_run *kvm_run)
return 1;
}
+static int invalid_guest_state(struct kvm_vcpu *vcpu,
+ struct kvm_run *kvm_run, u32 failure_reason)
+{
+ u16 ss, cs;
+ u8 opcodes[4];
+ unsigned long rip = kvm_rip_read(vcpu);
+ unsigned long rip_linear;
+
+ ss = vmcs_read16(GUEST_SS_SELECTOR);
+ cs = vmcs_read16(GUEST_CS_SELECTOR);
+
+ if ((ss & 0x03) != (cs & 0x03)) {
+ int err;
+ rip_linear = rip + vmx_get_segment_base(vcpu, VCPU_SREG_CS);
+ emulator_read_std(rip_linear, (void *)opcodes, 4, vcpu);
+ err = emulate_instruction(vcpu, kvm_run, 0, 0, 0);
+ switch (err) {
+ case EMULATE_DONE:
+ return 1;
+ case EMULATE_DO_MMIO:
+ printk(KERN_INFO "mmio?\n");
+ return 0;
+ default:
+ /* HACK: If we can not emulate the instruction
+ * we write a sane value on SS to pass sanity
+ * checks. The good thing to do is to emulate the
+ * instruction */
+ kvm_report_emulation_failure(vcpu, "vmentry failure");
+ printk(KERN_INFO " => Quit real mode emulation\n");
+ vcpu->arch.rmode_failed = 1;
+ vmcs_write16(GUEST_SS_SELECTOR, 0);
+ return 1;
+ }
+ }
+
+ kvm_run->exit_reason = KVM_EXIT_UNKNOWN;
+ kvm_run->hw.hardware_exit_reason = failure_reason;
+ printk(KERN_INFO "Failed to handle invalid guest state\n");
+ return 0;
+}
+
+/*
+ * Should be replaced with exit handlers for each individual case
+ */
+static int handle_vmentry_failure(struct kvm_vcpu *vcpu,
+ struct kvm_run *kvm_run,
+ u32 failure_reason)
+{
+ unsigned long exit_qualification = vmcs_readl(EXIT_QUALIFICATION);
+ switch (failure_reason) {
+ case EXIT_REASON_INVALID_GUEST_STATE:
+ return invalid_guest_state(vcpu, kvm_run, failure_reason);
+ case EXIT_REASON_MSR_LOADING:
+ printk("VMentry failure caused by MSR entry %ld loading.\n",
+ exit_qualification);
+ printk(" ... Not handled\n");
+ break;
+ case EXIT_REASON_MACHINE_CHECK:
+ printk("VMentry failure caused by machine check.\n");
+ printk(" ... Not handled\n");
+ break;
+ default:
+ printk("reason not known yet!\n");
+ break;
+ }
+ return 0;
+}
+
+static int handle_invalid_guest_state(struct kvm_vcpu *vcpu, struct kvm_run *kvm_run)
+{
+ int rc;
+ u32 exit_reason = vmcs_read32(VM_EXIT_REASON);
+
+ /*
+ * Disable interrupts to avoid occasional vmexits while
+ * handling vmentry failures
+ */
+ spin_lock_irq(&vmx_vpid_lock);
+ if(exit_reason & VMX_EXIT_REASONS_FAILED_VMENTRY)
+ exit_reason &= ~VMX_EXIT_REASONS_FAILED_VMENTRY;
+
+ rc = invalid_guest_state(vcpu, kvm_run, exit_reason);
+ spin_unlock_irq(&vmx_vpid_lock);
+
+ return rc;
+}
+
/*
* The exit handlers return 1 if the exit was handled fully and guest execution
* may resume. Otherwise they set the kvm_run parameter to indicate what needs
@@ -2733,6 +2821,7 @@ static int (*kvm_vmx_exit_handlers[])(struct kvm_vcpu *vcpu,
[EXIT_REASON_WBINVD] = handle_wbinvd,
[EXIT_REASON_TASK_SWITCH] = handle_task_switch,
[EXIT_REASON_EPT_VIOLATION] = handle_ept_violation,
+ [EXIT_REASON_INVALID_GUEST_STATE] = handle_invalid_guest_state,
};
static const int kvm_vmx_max_exit_handlers =
@@ -2758,21 +2847,32 @@ static int kvm_handle_exit(struct kvm_run *kvm_run, struct kvm_vcpu *vcpu)
ept_load_pdptrs(vcpu);
}
- if (unlikely(vmx->fail)) {
- kvm_run->exit_reason = KVM_EXIT_FAIL_ENTRY;
- kvm_run->fail_entry.hardware_entry_failure_reason
- = vmcs_read32(VM_INSTRUCTION_ERROR);
- return 0;
- }
-
if ((vectoring_info & VECTORING_INFO_VALID_MASK) &&
(exit_reason != EXIT_REASON_EXCEPTION_NMI &&
exit_reason != EXIT_REASON_EPT_VIOLATION))
printk(KERN_WARNING "%s: unexpected, valid vectoring info and "
"exit reason is 0x%x\n", __func__, exit_reason);
+
+ /*
+ * Instead of using handle_vmentry_failure(), just clear
+ * the vmentry failure bit and leave it to the exit handlers
+ * to deal with the specific exit reason.
+ * The exit handlers other than invalid guest state handler
+ * will be added later.
+ */
+ if ((exit_reason & VMX_EXIT_REASONS_FAILED_VMENTRY))
+ exit_reason &= ~VMX_EXIT_REASONS_FAILED_VMENTRY;
+
+
+ /* Handle all possible exits first, handle failure later. */
if (exit_reason < kvm_vmx_max_exit_handlers
&& kvm_vmx_exit_handlers[exit_reason])
return kvm_vmx_exit_handlers[exit_reason](vcpu, kvm_run);
+ else if(unlikely(vmx->fail)) {
+ kvm_run->exit_reason = KVM_EXIT_FAIL_ENTRY;
+ kvm_run->fail_entry.hardware_entry_failure_reason
+ = vmcs_read32(VM_INSTRUCTION_ERROR);
+ }
else {
kvm_run->exit_reason = KVM_EXIT_UNKNOWN;
kvm_run->hw.hardware_exit_reason = exit_reason;
diff --git a/arch/x86/kvm/vmx.h b/arch/x86/kvm/vmx.h
index 0c22e5f..cf8b771 100644
--- a/arch/x86/kvm/vmx.h
+++ b/arch/x86/kvm/vmx.h
@@ -239,7 +239,10 @@ enum vmcs_field {
#define EXIT_REASON_IO_INSTRUCTION 30
#define EXIT_REASON_MSR_READ 31
#define EXIT_REASON_MSR_WRITE 32
+#define EXIT_REASON_INVALID_GUEST_STATE 33
+#define EXIT_REASON_MSR_LOADING 34
#define EXIT_REASON_MWAIT_INSTRUCTION 36
+#define EXIT_REASON_MACHINE_CHECK 41
#define EXIT_REASON_TPR_BELOW_THRESHOLD 43
#define EXIT_REASON_APIC_ACCESS 44
#define EXIT_REASON_EPT_VIOLATION 48
diff --git a/include/asm-x86/kvm_host.h b/include/asm-x86/kvm_host.h
index 0b6b996..422d7c2 100644
--- a/include/asm-x86/kvm_host.h
+++ b/include/asm-x86/kvm_host.h
@@ -294,6 +294,7 @@ struct kvm_vcpu_arch {
} tr, es, ds, fs, gs;
} rmode;
int halt_request; /* real mode on Intel only */
+ int rmode_failed;
int cpuid_nent;
struct kvm_cpuid_entry2 cpuid_entries[KVM_MAX_CPUID_ENTRIES];
^ permalink raw reply related [flat|nested] 35+ messages in thread
end of thread, other threads:[~2022-01-14 17:17 UTC | newest]
Thread overview: 35+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2022-01-14 10:54 Li RongQing
2022-01-14 10:55 ` Paolo Bonzini
2022-01-14 17:13 ` Re: Sean Christopherson
2022-01-14 17:17 ` Re: Paolo Bonzini
[not found] <E1hUrZM-0007qA-Q8@sslproxy01.your-server.de>
2019-05-29 19:54 ` Re: Alex Williamson
-- strict thread matches above, loose matches on Subject: below --
2019-03-19 14:41 (unknown) Maxim Levitsky
2019-03-20 11:03 ` Felipe Franciosi
2019-03-20 19:08 ` Re: Maxim Levitsky
2019-03-21 16:12 ` Re: Stefan Hajnoczi
2019-03-21 16:21 ` Re: Keith Busch
2019-03-21 16:41 ` Re: Felipe Franciosi
2019-03-21 17:04 ` Re: Maxim Levitsky
2019-03-22 7:54 ` Re: Felipe Franciosi
2019-03-22 10:32 ` Re: Maxim Levitsky
2019-03-22 15:30 ` Re: Keith Busch
2019-03-25 15:44 ` Re: Felipe Franciosi
2018-02-05 5:28 Re: Fahama Vaserman
2017-11-13 14:44 Re: Amos Kalonzo
[not found] <CAMj-D2DO_CfvD77izsGfggoKP45HSC9aD6auUPAYC9Yeq_aX7w@mail.gmail.com>
2017-05-04 16:44 ` Re: gengdongjiu
2017-02-23 15:09 Qin's Yanjun
[not found] <D0613EBE33E8FD439137DAA95CCF59555B7A5A4D@MGCCCMAIL2010-5.mgccc.cc.ms.us>
2015-11-24 13:21 ` RE: Amis, Ryann
2014-04-13 21:01 (unknown), Marcus White
2014-04-15 0:59 ` Marcus White
2014-04-16 21:17 ` Re: Marcelo Tosatti
2014-04-17 21:33 ` Re: Marcus White
2014-04-21 21:49 ` Re: Marcelo Tosatti
2013-06-28 10:14 Re: emirates
2013-06-28 10:12 Re: emirates
2013-06-27 21:21 Re: emirates
2011-10-29 21:27 Re: Young Chang
2010-12-14 3:03 Re: Irish Online News
[not found] <20090427104117.GB29082@redhat.com>
2009-04-27 13:16 ` Re: Sheng Yang
2009-03-08 22:05 "-vga std" causes guest OS to crash (on disk io?) on a 1440x900 latpop Dylan Reid
2009-03-14 5:09 ` Dylan
2009-02-25 0:50 (unknown), Josh Borke
2009-02-25 0:58 ` Atsushi SAKAI
2009-01-10 21:53 (unknown) Ekin Meroğlu
2009-11-07 15:59 ` Bulent Abali
2009-11-07 16:36 ` Neil Aggarwal
2008-07-28 21:27 (unknown) Mohammed Gamal
2008-07-28 21:29 ` Mohammed Gamal
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).