From: Steven Sistare <steven.sistare@oracle.com>
To: Jason Zeng <jason.zeng@intel.com>,
Alex Williamson <alex.williamson@redhat.com>
Cc: "Daniel P. Berrange" <berrange@redhat.com>,
"Juan Quintela" <quintela@redhat.com>,
"Markus Armbruster" <armbru@redhat.com>,
"Michael S. Tsirkin" <mst@redhat.com>,
qemu-devel@nongnu.org,
"Dr. David Alan Gilbert" <dgilbert@redhat.com>,
"Paolo Bonzini" <pbonzini@redhat.com>,
"Stefan Hajnoczi" <stefanha@redhat.com>,
"Marc-André Lureau" <marcandre.lureau@redhat.com>,
"Jason Zeng" <jason.zeng@linux.intel.com>,
"Philippe Mathieu-Daudé" <philmd@redhat.com>,
"Alex Bennée" <alex.bennee@linaro.org>
Subject: Re: [PATCH V1 30/32] vfio-pci: save and restore
Date: Wed, 7 Oct 2020 17:25:51 -0400 [thread overview]
Message-ID: <bdbb51c3-9e6e-3b4f-2e5d-79dd1ba64d70@oracle.com> (raw)
In-Reply-To: <20200820103333.GA30987@x48>
On 8/20/2020 6:33 AM, Jason Zeng wrote:
> On Wed, Aug 19, 2020 at 05:15:11PM -0400, Steven Sistare wrote:
>> On 8/9/2020 11:50 PM, Jason Zeng wrote:
>>> On Fri, Aug 07, 2020 at 04:38:12PM -0400, Steven Sistare wrote:
>>>> On 8/6/2020 6:22 AM, Jason Zeng wrote:
>>>>> Hi Steve,
>>>>>
>>>>> On Thu, Jul 30, 2020 at 08:14:34AM -0700, Steve Sistare wrote:
>>>>>> @@ -3182,6 +3207,51 @@ static Property vfio_pci_dev_properties[] = {
>>>>>> DEFINE_PROP_END_OF_LIST(),
>>>>>> };
>>>>>>
>>>>>> +static int vfio_pci_post_load(void *opaque, int version_id)
>>>>>> +{
>>>>>> + int vector;
>>>>>> + MSIMessage msg;
>>>>>> + Error *err = 0;
>>>>>> + VFIOPCIDevice *vdev = opaque;
>>>>>> + PCIDevice *pdev = &vdev->pdev;
>>>>>> +
>>>>>> + if (msix_enabled(pdev)) {
>>>>>> + vfio_msix_enable(vdev);
>>>>>> + pdev->msix_function_masked = false;
>>>>>> +
>>>>>> + for (vector = 0; vector < vdev->pdev.msix_entries_nr; vector++) {
>>>>>> + if (!msix_is_masked(pdev, vector)) {
>>>>>> + msg = msix_get_message(pdev, vector);
>>>>>> + vfio_msix_vector_use(pdev, vector, msg);
>>>>>> + }
>>>>>> + }
>>>>>
>>>>> It looks to me MSIX re-init here may lose device IRQs and impact
>>>>> device hardware state?
>>>>>
>>>>> The re-init will cause the kernel vfio driver to connect the device
>>>>> MSIX vectors to new eventfds and KVM instance. But before that, device
>>>>> IRQs will be routed to previous eventfd. Looks these IRQs will be lost.
>>>>
>>>> Thanks Jason, that sounds like a problem. I could try reading and saving an
>>>> event from eventfd before shutdown, and injecting it into the eventfd after
>>>> restart, but that would be racy unless I disable interrupts. Or, unconditionally
>>>> inject a spurious interrupt after restart to kick it, in case an interrupt
>>>> was lost.
>>>>
>>>> Do you have any other ideas?
>>>
>>> Maybe we can consider to also hand over the eventfd file descriptor, or
>>
>> I believe preserving this descriptor in isolation is not sufficient. We would
>> also need to preserve the KVM instance which it is linked to.
>>
>>> or even the KVM fds to the new Qemu?
>>>
>>> If the KVM fds can be preserved, we will just need to restore Qemu KVM
>>> side states. But not sure how complicated the implementation would be.
>>
>> That should work, but I fear it would require many code changes in QEMU
>> to re-use descriptors at object creation time and suppress the initial
>> configuration ioctl's, so it's not my first choice for a solution.
>>
>>> If we only preserve the eventfd fd, we can attach the old eventfd to
>>> vfio devices. But looks it may turn out we always inject an interrupt
>>> unconditionally, because kernel KVM irqfd eventfd handling is a bit
>>> different than normal user land eventfd read/write. It doesn't decrease
>>> the counter in the eventfd context. So if we read the eventfd from new
>>> Qemu, it looks will always have a non-zero counter, which requires an
>>> interrupt injection.
>>
>> Good to know, thanks.
>>
>> I will try creating a new eventfd and injecting an interrupt unconditionally.
>> I need a test case to demonstrate losing an interrupt, and fixing it with
>> injection. Any advice? My stress tests with a virtual function nic and a
>> directly assigned nvme block device have never failed across live update.
>>
>
> I am not familiar with nvme devices. For nic device, to my understanding,
> stress nic testing will not have many IRQs, because nic driver usually
> enables NAPI, which only take the first interrupt, then disable interrupt
> and start polling. It will only re-enable interrupt after some packet
> quota reached or the traffic quiesces for a while. But anyway, if the
> test goes enough long time, the number of IRQs should also be big, not
> sure why it doesn't trigger any issue. Maybe we can have some study on
> the IRQ pattern for the testing and see how we can design a test case?
> or see if our assumption is wrong?
>
>
>>>>> And the re-init will make the device go through the procedure of
>>>>> disabling MSIX, enabling INTX, and re-enabling MSIX and vectors.
>>>>> So if the device is active, its hardware state will be impacted?
>>>>
>>>> Again thanks. vfio_msix_enable() does indeed call vfio_disable_interrupts().
>>>> For a quick experiment, I deleted that call in for the post_load code path, and
>>>> it seems to work fine, but I need to study it more.
>>>
>>> vfio_msix_vector_use() will also trigger this procedure in the kernel.
>>
>> Because that code path calls VFIO_DEVICE_SET_IRQS? Or something else?
>> Can you point to what it triggers in the kernel?
>
>
> In vfio_msix_vector_use(), I see vfio_disable_irqindex() will be invoked
> if "vdev->nr_vectors < nr + 1" is true. Since the 'vdev' is re-inited,
> so this condition should be true, and vfio_disable_irqindex() will
> trigger VFIO_DEVICE_SET_IRQS with VFIO_IRQ_SET_DATA_NONE, which will
> cause kernel to disable MSIX.
>
>>
>>> Looks we shouldn't trigger any kernel vfio actions here? Because we
>>> preserve vfio fds, so its kernel state shouldn't be touched. Here we
>>> may only need to restore Qemu states. Re-connect to KVM instance should
>>> be done automatically when we setup the KVM irqfds with the same eventfd.
>>>
>>> BTW, if I remember correctly, it is not enough to only save MSIX state
>>> in the snapshot. We should also save the Qemu side pci config space
>>> cache to the snapshot, because Qemu's copy is not exactly the same as
>>> the kernel's copy. I encountered this before, but I don't remember which
>>> field it was.
>>
>> FYI all, Jason told me offline that qemu may emulate some pci capabilities and
>> hence keeps state in the shadow config that is never written to the kernel.
>> I need to study that.
>>
>
> Sorry, I read the code again, see Qemu does write all config-space-write
> to kernel in vfio_pci_write_config(). Now I am also confused about what
> I was seeing previously :(. But it seems we still need to look at kernel
> code to see if mismatch is possibile for config space cache between Qemu
> and kernel.
>
> FYI. Some discussion about the VFIO PCI config space saving/restoring in
> live migration scenario:
> https://lists.gnu.org/archive/html/qemu-devel/2020-06/msg06964.html
>
I have coded a solution for much of the "lost interrupts" issue.
cprsave preserves the vfio err, req, and msi irq eventfd's across exec:
vdev->err_notifier
vdev->req_notifier
vdev->msi_vectors[i].interrupt
vdev->msi_vectors[i].kvm_interrupt
The KVM instance is destroyed and recreated as before.
The eventfd descriptors are found and reused during vfio_realize using
event_notifier_init_fd. No calls to VFIO_DEVICE_SET_IRQS are made before or
after the exec. The descriptors are attached to the new KVM instance via the
usual ioctl's on the existing code paths.
It works. I issue cprsave, send an interrupt, wait a few seconds, then issue cprload.
The interrupt fires immediately after cprload. I tested interrupt delivery to the
kvm_irqchip and to qemu.
It does not support Posted Interrupts, as that involves state attached to the
VMCS, which is destroyed with the KVM instance. That needs more study and
a likely kernel enhancement.
I will post the full code as part of the V2 patch series.
- Steve
next prev parent reply other threads:[~2020-10-07 21:27 UTC|newest]
Thread overview: 118+ messages / expand[flat|nested] mbox.gz Atom feed top
2020-07-30 15:14 [PATCH V1 00/32] Live Update Steve Sistare
2020-07-30 15:14 ` [PATCH V1 01/32] savevm: add vmstate handler iterators Steve Sistare
2020-09-11 16:24 ` Dr. David Alan Gilbert
2020-09-24 21:43 ` Steven Sistare
2020-09-25 9:07 ` Dr. David Alan Gilbert
2020-07-30 15:14 ` [PATCH V1 02/32] savevm: VM handlers mode mask Steve Sistare
2020-07-30 15:14 ` [PATCH V1 03/32] savevm: QMP command for cprsave Steve Sistare
2020-07-30 16:12 ` Eric Blake
2020-07-30 17:52 ` Steven Sistare
2020-09-11 16:43 ` Dr. David Alan Gilbert
2020-09-25 18:43 ` Steven Sistare
2020-09-25 22:22 ` Steven Sistare
2020-07-30 15:14 ` [PATCH V1 04/32] savevm: HMP Command " Steve Sistare
2020-09-11 16:57 ` Dr. David Alan Gilbert
2020-09-24 21:44 ` Steven Sistare
2020-09-25 9:26 ` Dr. David Alan Gilbert
2020-07-30 15:14 ` [PATCH V1 05/32] savevm: QMP command for cprload Steve Sistare
2020-07-30 16:14 ` Eric Blake
2020-07-30 18:00 ` Steven Sistare
2020-09-11 17:18 ` Dr. David Alan Gilbert
2020-09-24 21:49 ` Steven Sistare
2020-07-30 15:14 ` [PATCH V1 06/32] savevm: HMP Command " Steve Sistare
2020-07-30 15:14 ` [PATCH V1 07/32] savevm: QMP command for cprinfo Steve Sistare
2020-07-30 16:17 ` Eric Blake
2020-07-30 18:02 ` Steven Sistare
2020-07-30 15:14 ` [PATCH V1 08/32] savevm: HMP " Steve Sistare
2020-09-11 17:27 ` Dr. David Alan Gilbert
2020-09-24 21:50 ` Steven Sistare
2020-07-30 15:14 ` [PATCH V1 09/32] savevm: prevent cprsave if memory is volatile Steve Sistare
2020-09-11 17:35 ` Dr. David Alan Gilbert
2020-09-24 21:51 ` Steven Sistare
2020-07-30 15:14 ` [PATCH V1 10/32] kvmclock: restore paused KVM clock Steve Sistare
2020-09-11 17:50 ` Dr. David Alan Gilbert
2020-09-25 18:07 ` Steven Sistare
2020-07-30 15:14 ` [PATCH V1 11/32] cpu: disable ticks when suspended Steve Sistare
2020-09-11 17:53 ` Dr. David Alan Gilbert
2020-09-24 20:42 ` Steven Sistare
2020-09-25 9:03 ` Dr. David Alan Gilbert
2020-07-30 15:14 ` [PATCH V1 12/32] vl: pause option Steve Sistare
2020-07-30 16:20 ` Eric Blake
2020-07-30 18:11 ` Steven Sistare
2020-07-31 10:07 ` Daniel P. Berrangé
2020-07-31 15:18 ` Steven Sistare
2020-07-30 17:03 ` Alex Bennée
2020-07-30 18:14 ` Steven Sistare
2020-07-31 9:44 ` Alex Bennée
2020-09-11 17:59 ` Dr. David Alan Gilbert
2020-09-24 21:51 ` Steven Sistare
2020-07-30 15:14 ` [PATCH V1 13/32] gdbstub: gdb support for suspended state Steve Sistare
2020-09-11 18:41 ` Dr. David Alan Gilbert
2020-09-24 21:51 ` Steven Sistare
2020-07-30 15:14 ` [PATCH V1 14/32] savevm: VMS_RESTART and cprsave restart Steve Sistare
2020-07-30 16:22 ` Eric Blake
2020-07-30 18:14 ` Steven Sistare
2020-09-11 18:44 ` Dr. David Alan Gilbert
2020-09-24 21:44 ` Steven Sistare
2020-07-30 15:14 ` [PATCH V1 15/32] vl: QEMU_START_FREEZE env var Steve Sistare
2020-09-11 18:49 ` Dr. David Alan Gilbert
2020-09-24 21:47 ` Steven Sistare
2020-09-25 15:52 ` Dr. David Alan Gilbert
2020-07-30 15:14 ` [PATCH V1 16/32] oslib: add qemu_clr_cloexec Steve Sistare
2020-09-11 18:52 ` Dr. David Alan Gilbert
2020-07-30 15:14 ` [PATCH V1 17/32] util: env var helpers Steve Sistare
2020-09-11 19:00 ` Dr. David Alan Gilbert
2020-09-24 21:52 ` Steven Sistare
2020-07-30 15:14 ` [PATCH V1 18/32] osdep: import MADV_DOEXEC Steve Sistare
2020-08-17 18:30 ` Steven Sistare
2020-08-17 20:48 ` Alex Williamson
2020-08-17 21:20 ` Steven Sistare
2020-08-17 21:44 ` Alex Williamson
2020-08-18 2:42 ` Alex Williamson
2020-08-19 21:52 ` Steven Sistare
2020-08-24 22:30 ` Alex Williamson
2020-10-08 16:32 ` Steven Sistare
2020-10-15 20:36 ` Alex Williamson
2020-10-19 16:33 ` Steven Sistare
2020-10-26 18:28 ` Steven Sistare
2020-07-30 15:14 ` [PATCH V1 19/32] memory: ram_block_add cosmetic changes Steve Sistare
2020-07-30 15:14 ` [PATCH V1 20/32] vl: add helper to request re-exec Steve Sistare
2020-07-30 15:14 ` [PATCH V1 21/32] exec, memory: exec(3) to restart Steve Sistare
2020-07-30 15:14 ` [PATCH V1 22/32] char: qio_channel_socket_accept reuse fd Steve Sistare
2020-09-15 17:33 ` Dr. David Alan Gilbert
2020-09-15 17:53 ` Daniel P. Berrangé
2020-09-24 21:54 ` Steven Sistare
2020-07-30 15:14 ` [PATCH V1 23/32] char: save/restore chardev socket fds Steve Sistare
2020-07-30 15:14 ` [PATCH V1 24/32] ui: save/restore vnc " Steve Sistare
2020-07-31 9:06 ` Daniel P. Berrangé
2020-07-31 16:51 ` Steven Sistare
2020-07-30 15:14 ` [PATCH V1 25/32] char: save/restore chardev pty fds Steve Sistare
2020-07-30 15:14 ` [PATCH V1 26/32] monitor: save/restore QMP negotiation status Steve Sistare
2020-07-30 15:14 ` [PATCH V1 27/32] vhost: reset vhost devices upon cprsave Steve Sistare
2020-07-30 15:14 ` [PATCH V1 28/32] char: restore terminal on restart Steve Sistare
2020-07-30 15:14 ` [PATCH V1 29/32] pci: export pci_update_mappings Steve Sistare
2020-07-30 15:14 ` [PATCH V1 30/32] vfio-pci: save and restore Steve Sistare
2020-08-06 10:22 ` Jason Zeng
2020-08-07 20:38 ` Steven Sistare
2020-08-10 3:50 ` Jason Zeng
2020-08-19 21:15 ` Steven Sistare
2020-08-20 10:33 ` Jason Zeng
2020-10-07 21:25 ` Steven Sistare [this message]
2020-07-30 15:14 ` [PATCH V1 31/32] vfio-pci: trace pci config Steve Sistare
2020-07-30 15:14 ` [PATCH V1 32/32] vfio-pci: improved tracing Steve Sistare
2020-09-15 18:49 ` Dr. David Alan Gilbert
2020-09-24 21:52 ` Steven Sistare
2020-07-30 16:52 ` [PATCH V1 00/32] Live Update Daniel P. Berrangé
2020-07-30 18:48 ` Steven Sistare
2020-07-31 8:53 ` Daniel P. Berrangé
2020-07-31 15:27 ` Steven Sistare
2020-07-31 15:52 ` Daniel P. Berrangé
2020-07-31 17:20 ` Steven Sistare
2020-08-11 19:08 ` Dr. David Alan Gilbert
2020-07-30 17:15 ` Paolo Bonzini
2020-07-30 19:09 ` Steven Sistare
2020-07-30 21:39 ` Paolo Bonzini
2020-07-31 19:22 ` Steven Sistare
2020-07-30 17:49 ` Dr. David Alan Gilbert
2020-07-30 19:31 ` Steven Sistare
2020-08-04 18:18 ` Steven Sistare
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=bdbb51c3-9e6e-3b4f-2e5d-79dd1ba64d70@oracle.com \
--to=steven.sistare@oracle.com \
--cc=alex.bennee@linaro.org \
--cc=alex.williamson@redhat.com \
--cc=armbru@redhat.com \
--cc=berrange@redhat.com \
--cc=dgilbert@redhat.com \
--cc=jason.zeng@intel.com \
--cc=jason.zeng@linux.intel.com \
--cc=marcandre.lureau@redhat.com \
--cc=mst@redhat.com \
--cc=pbonzini@redhat.com \
--cc=philmd@redhat.com \
--cc=qemu-devel@nongnu.org \
--cc=quintela@redhat.com \
--cc=stefanha@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).