Re: [PATCH v1] docs/devel: Add VFIO device migration documentation

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

From: Kirti Wankhede <kwankhede@nvidia.com>
To: Alex Williamson <alex.williamson@redhat.com>
Cc: mcrossley@nvidia.com, cjia@nvidia.com,
	Cornelia Huck <cohuck@redhat.com>,
	qemu-devel@nongnu.org, dnigam@nvidia.com, philmd@redhat.com
Subject: Re: [PATCH v1] docs/devel: Add VFIO device migration documentation
Date: Fri, 6 Nov 2020 00:29:36 +0530	[thread overview]
Message-ID: <6abf200c-972a-cbdb-8106-d197dccb780d@nvidia.com> (raw)
In-Reply-To: <20201104054527.22bbace7@x1.home>



On 11/4/2020 6:15 PM, Alex Williamson wrote:
> On Wed, 4 Nov 2020 13:25:40 +0530
> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> 
>> On 11/4/2020 1:57 AM, Alex Williamson wrote:
>>> On Wed, 4 Nov 2020 01:18:12 +0530
>>> Kirti Wankhede <kwankhede@nvidia.com> wrote:
>>>    
>>>> On 10/30/2020 12:35 AM, Alex Williamson wrote:
>>>>> On Thu, 29 Oct 2020 23:11:16 +0530
>>>>> Kirti Wankhede <kwankhede@nvidia.com> wrote:
>>>>>       
>>>>
>>>> <snip>
>>>>   
>>>>>>>> +System memory dirty pages tracking
>>>>>>>> +----------------------------------
>>>>>>>> +
>>>>>>>> +A ``log_sync`` memory listener callback is added to mark system memory pages
>>>>>>>
>>>>>>> s/is added to mark/marks those/
>>>>>>>          
>>>>>>>> +as dirty which are used for DMA by VFIO device. Dirty pages bitmap is queried
>>>>>>>
>>>>>>> s/by/by the/
>>>>>>> s/Dirty/The dirty/
>>>>>>>          
>>>>>>>> +per container. All pages pinned by vendor driver through vfio_pin_pages()
>>>>>>>
>>>>>>> s/by/by the/
>>>>>>>          
>>>>>>>> +external API have to be marked as dirty during migration. When there are CPU
>>>>>>>> +writes, CPU dirty page tracking can identify dirtied pages, but any page pinned
>>>>>>>> +by vendor driver can also be written by device. There is currently no device
>>>>>>>
>>>>>>> s/by/by the/ (x2)
>>>>>>>          
>>>>>>>> +which has hardware support for dirty page tracking. So all pages which are
>>>>>>>> +pinned by vendor driver are considered as dirty.
>>>>>>>> +Dirty pages are tracked when device is in stop-and-copy phase because if pages
>>>>>>>> +are marked dirty during pre-copy phase and content is transfered from source to
>>>>>>>> +destination, there is no way to know newly dirtied pages from the point they
>>>>>>>> +were copied earlier until device stops. To avoid repeated copy of same content,
>>>>>>>> +pinned pages are marked dirty only during stop-and-copy phase.
>>>>>>
>>>>>>      
>>>>>>> Let me take a quick stab at rewriting this paragraph (not sure if I
>>>>>>> understood it correctly):
>>>>>>>
>>>>>>> "Dirty pages are tracked when the device is in the stop-and-copy phase.
>>>>>>> During the pre-copy phase, it is not possible to distinguish a dirty
>>>>>>> page that has been transferred from the source to the destination from
>>>>>>> newly dirtied pages, which would lead to repeated copying of the same
>>>>>>> content. Therefore, pinned pages are only marked dirty during the
>>>>>>> stop-and-copy phase." ?
>>>>>>>          
>>>>>>
>>>>>> I think above rephrase only talks about repeated copying in pre-copy
>>>>>> phase. Used "copied earlier until device stops" to indicate both
>>>>>> pre-copy and stop-and-copy till device stops.
>>>>>
>>>>>
>>>>> Now I'm confused, I thought we had abandoned the idea that we can only
>>>>> report pinned pages during stop-and-copy.  Doesn't the device needs to
>>>>> expose its dirty memory footprint during the iterative phase regardless
>>>>> of whether that causes repeat copies?  If QEMU iterates and sees that
>>>>> all memory is still dirty, it may have transferred more data, but it
>>>>> can actually predict if it can achieve its downtime tolerances.  Which
>>>>> is more important, less data transfer or predictability?  Thanks,
>>>>>       
>>>>
>>>> Even if QEMU copies and transfers content of all sys mem pages during
>>>> pre-copy (worst case with IOMMU backed mdev device when its vendor
>>>> driver is not smart to pin pages explicitly and all sys mem pages are
>>>> marked dirty), then also its prediction about downtime tolerance will
>>>> not be correct, because during stop-and-copy again all pages need to be
>>>> copied as device can write to any of those pinned pages.
>>>
>>> I think you're only reiterating my point.  If QEMU copies all of guest
>>> memory during the iterative phase and each time it sees that all memory
>>> is dirty, such as if CPUs or devices (including assigned devices) are
>>> dirtying pages as fast as it copies them (or continuously marks them
>>> dirty), then QEMU can predict that downtime will require copying all
>>> pages.
>>
>> But as of now there is no way to know if device has dirtied pages during
>> iterative phase.
> 
> 
> This claim doesn't make any sense, pinned pages are considered
> persistently dirtied, during the iterative phase and while stopped.
> 
>   
>>> If instead devices don't mark dirty pages until the VM is
>>> stopped, then QEMU might iterate through memory copy and predict a short
>>> downtime because not much memory is dirty, only to be surprised that
>>> all of memory is suddenly dirty.  At that point it's too late, the VM
>>> is already stopped, the predicted short downtime takes far longer than
>>> expected.  This is exactly why we made the kernel interface mark pinned
>>> pages persistently dirty when it was proposed that we only report
>>> pinned pages once.  Thanks,
>>>    
>>
>> Since there is no way to know if device dirtied pages during iterative
>> phase, QEMU should query pinned pages in stop-and-copy phase.
> 
> 
> As above, I don't believe this is true.
> 
> 
>> Whenever there will be hardware support or some software mechanism to
>> report pages dirtied by device then we will add a capability bit in
>> migration capability and based on that capability bit qemu/user space
>> app should decide to query dirty pages in iterative phase.
> 
> 
> Yes, we could advertise support for fine granularity dirty page
> tracking, but I completely disagree that we should consider pinned
> pages clean until suddenly exposing them as dirty once the VM is
> stopped.  Thanks,
> 

Should QEMU copy dirtied pages twice, during iterative phase and then 
when VM is stopped?

Thanks,
Kirti

next prev parent reply	other threads:[~2020-11-05 19:00 UTC|newest]

Thread overview: 15+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-10-29  5:53 [PATCH v1] docs/devel: Add VFIO device migration documentation Kirti Wankhede
2020-10-29 11:52 ` Cornelia Huck
2020-10-29 17:41   ` Kirti Wankhede
2020-10-29 19:05     ` Alex Williamson
2020-10-29 20:25       ` Cornelia Huck
2020-11-03 19:48       ` Kirti Wankhede
2020-11-03 20:27         ` Alex Williamson
2020-11-04  7:55           ` Kirti Wankhede
2020-11-04 12:45             ` Alex Williamson
2020-11-05 18:59               ` Kirti Wankhede [this message]
2020-11-05 19:11                 ` Alex Williamson
2020-11-05 20:52                   ` Kirti Wankhede
2020-11-05 21:26                     ` Alex Williamson
2020-11-06 18:57                       ` Kirti Wankhede
2020-11-06 19:17                         ` Alex Williamson

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=6abf200c-972a-cbdb-8106-d197dccb780d@nvidia.com \
    --to=kwankhede@nvidia.com \
    --cc=alex.williamson@redhat.com \
    --cc=cjia@nvidia.com \
    --cc=cohuck@redhat.com \
    --cc=dnigam@nvidia.com \
    --cc=mcrossley@nvidia.com \
    --cc=philmd@redhat.com \
    --cc=qemu-devel@nongnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).