From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from eggs.gnu.org ([2001:4830:134:3::10]:58606)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <agraf@suse.de>) id 1YKdME-0002pf-T4
	for qemu-devel@nongnu.org; Sun, 08 Feb 2015 20:41:52 -0500
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <agraf@suse.de>) id 1YKdMB-0002UL-Jc
	for qemu-devel@nongnu.org; Sun, 08 Feb 2015 20:41:50 -0500
Message-ID: <54D81056.1090108@suse.de>
Date: Mon, 09 Feb 2015 02:41:42 +0100
From: Alexander Graf <agraf@suse.de>
MIME-Version: 1.0
References: <1422943851-25836-1-git-send-email-david@gibson.dropbear.id.au>	<20150203211906.GA13992@iris.ozlabs.ibm.com>	<20150204013211.GU28703@voom.fritz.box>	<54D23872.90007@suse.de>	<20150205004812.GD25675@voom.fritz.box>	<54D2BF4F.1030609@suse.de>	<20150205025556.GH25675@voom.fritz.box>	<6EFB0F0E-BB1D-4DAE-8BA4-367B05E88553@suse.de>	<20150205113007.GT25675@voom.fritz.box>	<54D35A41.6020907@suse.de>	<20150206025415.GZ25675@voom.fritz.box>	<54D473B0.3020201@suse.de>
	<20150209113718.5974a1a9@voom.fritz.box>
In-Reply-To: <20150209113718.5974a1a9@voom.fritz.box>
Content-Type: text/plain; charset=windows-1252
Content-Transfer-Encoding: 7bit
Subject: Re: [Qemu-devel] [RFC] pseries: Enable in-kernel H_LOGICAL_CI_{LOAD,
	STORE} implementations
List-Id: <qemu-devel.nongnu.org>
List-Unsubscribe: <https://lists.nongnu.org/mailman/options/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.nongnu.org/archive/html/qemu-devel>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <https://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
To: David Gibson <dgibson@redhat.com>
Cc: "mdroth@us.ibm.com" <mdroth@us.ibm.com>, "aik@ozlabs.ru" <aik@ozlabs.ru>, "qemu-devel@nongnu.org" <qemu-devel@nongnu.org>, "qemu-ppc@nongnu.org" <qemu-ppc@nongnu.org>, Stefan Hajnoczi <stefanha@redhat.com>, Paul Mackerras <paulus@samba.org>


On 09.02.15 01:37, David Gibson wrote:
> On Fri, 06 Feb 2015 08:56:32 +0100
> Alexander Graf <agraf@suse.de> wrote:
> 
>>
>>
>> On 06.02.15 03:54, David Gibson wrote:
>>> On Thu, Feb 05, 2015 at 12:55:45PM +0100, Alexander Graf wrote:
>>>>
>>>>
>>>> On 05.02.15 12:30, David Gibson wrote:
>>>>> On Thu, Feb 05, 2015 at 11:22:13AM +0100, Alexander Graf wrote:
>>> [snip]
>>>>>>>>>>>> [snip]
>>>>>>>>>>>>
>>>>>>>>>>>>> +    ret1 = kvmppc_enable_hcall(kvm_state, H_LOGICAL_CI_LOAD);
>>>>>>>>>>>>> +    if (ret1 != 0) {
>>>>>>>>>>>>> +        fprintf(stderr, "Warning: error enabling H_LOGICAL_CI_LOAD in KVM:"
>>>>>>>>>>>>> +                " %s\n", strerror(errno));
>>>>>>>>>>>>> +    }
>>>>>>>>>>>>> +
>>>>>>>>>>>>> +    ret2 = kvmppc_enable_hcall(kvm_state, H_LOGICAL_CI_STORE);
>>>>>>>>>>>>> +    if (ret2 != 0) {
>>>>>>>>>>>>> +        fprintf(stderr, "Warning: error enabling H_LOGICAL_CI_STORE in KVM:"
>>>>>>>>>>>>> +                " %s\n", strerror(errno));
>>>>>>>>>>>>> +     }
>>>>>>>>>>>>> +
>>>>>>>>>>>>> +    if ((ret1 != 0) || (ret2 != 0)) {
>>>>>>>>>>>>> +        fprintf(stderr, "Warning: Couldn't enable H_LOGICAL_CI_* in KVM, SLOF"
>>>>>>>>>>>>> +                " may be unable to operate devices with in-kernel emulation\n");
>>>>>>>>>>>>> +    }
>>>>>>>>>>>>
>>>>>>>>>>>> You'll always get these warnings if you're running on an old (meaning
>>>>>>>>>>>> current upstream) kernel, which could be annoying.
>>>>>>>>>>>
>>>>>>>>>>> True.
>>>>>>>>>>>
>>>>>>>>>>>> Is there any way
>>>>>>>>>>>> to tell whether you have configured any devices which need the
>>>>>>>>>>>> in-kernel MMIO emulation and only warn if you have?
>>>>>>>>>>>
>>>>>>>>>>> In theory, I guess so.  In practice I can't see how you'd enumerate
>>>>>>>>>>> all devices that might require kernel intervention without something
>>>>>>>>>>> horribly invasive.
>>>>>>>>>>
>>>>>>>>>> We could WARN_ONCE in QEMU if we emulate such a hypercall, but its
>>>>>>>>>> handler is io_mem_unassigned (or we add another minimum priority huge
>>>>>>>>>> memory region on all 64bits of address space that reports the breakage).
>>>>>>>>>
>>>>>>>>> Would that work for the virtio+iothread case?  I had the impression
>>>>>>>>> the kernel handled notification region was layered over the qemu
>>>>>>>>> emulated region in that case.
>>>>>>>>
>>>>>>>> IIRC we don't have a way to call back into kvm saying "please write to
>>>>>>>> this in-kernel device". But we could at least defer the warning to a
>>>>>>>> point where we know that we actually hit it.
>>>>>>>
>>>>>>> Right, but I'm saying we might miss the warning in cases where we want
>>>>>>> it, because the KVM device is shadowed by a qemu device, so qemu won't
>>>>>>> see the IO as unassigned or unhandled.
>>>>>>>
>>>>>>> In particular, I think that will happen in the case of virtio-blk with
>>>>>>> iothread, which is the simplest case in which to observe the problem.
>>>>>>> The virtio-blk device exists in qemu and is functional, but we rely on
>>>>>>> KVM catching the queue notification MMIO before it reaches the qemu
>>>>>>> implementation of the rest of the device's IO space.
>>>>>>
>>>>>> But in that case the VM stays functional and will merely see a
>>>>>> performance hit when using virtio in SLOF, no? I don't think that's
>>>>>> a problem worth worrying users about.
>>>>>
>>>>> Alas, no.  The iothread stuff *relies* on the in-kernel notification,
>>>>> so it will not work if the IO gets punted to qemu.  This is the whole
>>>>> reason for the in-kernel hcall implementation.
>>>>
>>>> So at least with vhost-net the in-kernel trapping is optional. If we
>>>> happen to get MMIO into QEMU, we'll just handle it there.
>>>>
>>>> Enlighten me why the iothread stuff can't handle it that way too.
>>>
>>> So, as I understand it, it could, but it doesn't.  Working out how to
>>> fix it properly requires better understanding of the dataplane code
>>> than I currently possess,
>>>
>>> So, using virtio-blk as the example case.  Normally the queue notify
>>> mmio will get routed by the general virtio code to
>>> virtio_blk_handle_output().
>>>
>>> In the case of dataplane, that just calls
>>> virtio_blk_data_plane_start().  So the first time we get a vq notify,
>>> the dataplane is started.  That sets up the host notifier
>>> (VirtioBusClass::set_host_notifier -> virtio_pci_set_host_notifier ->
>>> virtio_pci_set_host_notifier_internal -> memory_region_add_eventfd()
>>> -> memory_region_transaction_commit() ->
>>> address_space_update_ioeventfds - >address_space_add_del_ioeventfds ->
>>> kvm_mem_ioeventfd_add -> kvm_set_ioeventfd_mmio -> KVM_IOEVENTFD
>>> ioctl)
>>>
>>> From this point on further calls to virtio_blk_handle_output() are
>>> IIUC a "can't happen", because vq notifies should go to the eventfd
>>> instead, where they will kick the iothread.
>>>
>>> So, with SLOF, the first request is ok - it hits
>>> virtio_blk_handle_output() which starts the iothread which goes on to
>>> process the request.
>>>
>>> On the second request, however, we get back into
>>> virtio_blk_data_plane_start() which sees the iothread is already
>>> running and aborts.  I think it is assuming that this must be the
>>> result of a race with another vcpu starting the dataplane, and so
>>> assumes the racing thread will have woken the dataplane which will
>>> then handle this vcpu's request as well.
>>>
>>> In our case, however, the IO hcalls go through to
>>> virtio_blk_handle_output() when the dataplane already going, and
>>> become no-ops without waking it up again to handle the new request.
>>>
>>> Enlightened enough yet?
>>
>> So reading this, it sounds like we could just add logic in the virtio
>> dataplane code that allows for a graceful fallback to QEMU based MMIO by
>> triggering the eventfd itself in the MMIO handler. When going via this
>> slow path, we should of course emit a warning (once) to the user ;).
>>
>> Stefan, what do you think?
> 
> So, as I understand it this should be possible.  I did even have a
> draft which did this.  However, I don't know the dataplane well enough
> to know what gotchas there might be in terms of races, and therefore
> how to do this quite right.

I'm sure Stefan knows :).

> Note that this doesn't remove the need for the in-kernel H_LOGICAL_CI_*
> hcalls, because those will still be necessary if we get real in-kernel
> emulated devices in future.

Oh, I definitely agree on that part. I just want to make sure we
gracefully run on older host kernels with newer QEMU versions.


Alex