kvm.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Kirti Wankhede <kwankhede@nvidia.com>
To: Alex Williamson <alex.williamson@redhat.com>
Cc: <cjia@nvidia.com>, <kevin.tian@intel.com>, <ziye.yang@intel.com>,
	<changpeng.liu@intel.com>, <yi.l.liu@intel.com>,
	<mlevitsk@redhat.com>, <eskultet@redhat.com>, <cohuck@redhat.com>,
	<dgilbert@redhat.com>, <jonathan.davies@nutanix.com>,
	<eauger@redhat.com>, <aik@ozlabs.ru>, <pasic@linux.ibm.com>,
	<felipe@nutanix.com>, <Zhengxiao.zx@Alibaba-inc.com>,
	<shuangtai.tst@alibaba-inc.com>, <Ken.Xue@amd.com>,
	<zhi.a.wang@intel.com>, <yan.y.zhao@intel.com>,
	<qemu-devel@nongnu.org>, <kvm@vger.kernel.org>
Subject: Re: [PATCH v12 Kernel 4/7] vfio iommu: Implementation of ioctl to for dirty pages tracking.
Date: Wed, 19 Feb 2020 09:51:32 +0530	[thread overview]
Message-ID: <96d9990f-b27f-759e-1395-9b2eff218bfa@nvidia.com> (raw)
In-Reply-To: <20200218144105.4076b7eb@w520.home>



On 2/19/2020 3:11 AM, Alex Williamson wrote:
> On Tue, 18 Feb 2020 11:28:53 +0530
> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> 
>> <snip>
>>
>>>>>>>     As I understand the above algorithm, we find a vfio_dma
>>>>>>> overlapping the request and populate the bitmap for that range.  Then
>>>>>>> we go back and put_user() for each byte that we touched.  We could
>>>>>>> instead simply work on a one byte buffer as we enumerate the requested
>>>>>>> range and do a put_user() ever time we reach the end of it and have bits
>>>>>>> set. That would greatly simplify the above example.  But I would expect
>>>>>>> that we're a) more likely to get asked for ranges covering a single
>>>>>>> vfio_dma
>>>>>>
>>>>>> QEMU ask for single vfio_dma during each iteration.
>>>>>>
>>>>>> If we restrict this ABI to cover single vfio_dma only, then it
>>>>>> simplifies the logic here. That was my original suggestion. Should we
>>>>>> think about that again?
>>>>>
>>>>> But we currently allow unmaps that overlap multiple vfio_dmas as long
>>>>> as no vfio_dma is bisected, so I think that implies that an unmap while
>>>>> asking for the dirty bitmap has even further restricted semantics.  I'm
>>>>> also reluctant to design an ABI around what happens to be the current
>>>>> QEMU implementation.
>>>>>
>>>>> If we take your example above, ranges {0x0000,0xa000} and
>>>>> {0xa000,0x10000} ({start,end}), I think you're working with the
>>>>> following two bitmaps in this implementation:
>>>>>
>>>>> 00000011 11111111b
>>>>> 00111111b
>>>>>
>>>>> And we need to combine those into:
>>>>>
>>>>> 11111111 11111111b
>>>>>
>>>>> Right?
>>>>>
>>>>> But it seems like that would be easier if the second bitmap was instead:
>>>>>
>>>>> 11111100b
>>>>>
>>>>> Then we wouldn't need to worry about the entire bitmap being shifted by
>>>>> the bit offset within the byte, which limits our fixes to the boundary
>>>>> byte and allows us to use copy_to_user() directly for the bulk of the
>>>>> copy.  So how do we get there?
>>>>>
>>>>> I think we start with allocating the vfio_dma bitmap to account for
>>>>> this initial offset, so we calculate bitmap_base_iova as:
>>>>>      (iova & ~((PAGE_SIZE << 3) - 1))
>>>>> We then use bitmap_base_iova in calculating which bits to set.
>>>>>
>>>>> The user needs to follow the same rules, and maybe this adds some value
>>>>> to the user providing the bitmap size rather than the kernel
>>>>> calculating it.  For example, if the user wanted the dirty bitmap for
>>>>> the range {0xa000,0x10000} above, they'd provide at least a 1 byte
>>>>> bitmap, but we'd return bit #2 set to indicate 0xa000 is dirty.
>>>>>
>>>>> Effectively the user can ask for any iova range, but the buffer will be
>>>>> filled relative to the zeroth bit of the bitmap following the above
>>>>> bitmap_base_iova formula (and replacing PAGE_SIZE with the user
>>>>> requested pgsize).  I'm tempted to make this explicit in the user
>>>>> interface (ie. only allow bitmaps starting on aligned pages), but a
>>>>> user is able to map and unmap single pages and we need to support
>>>>> returning a dirty bitmap with an unmap, so I don't think we can do that.
>>>>>       
>>>>
>>>> Sigh, finding adjacent vfio_dmas within the same byte seems simpler than
>>>> this.
>>>
>>> How does KVM do this?  My intent was that if all of our bitmaps share
>>> the same alignment then we can merge the intersection and continue to
>>> use copy_to_user() on either side.  However, if QEMU doesn't do the
>>> same, it doesn't really help us.  Is QEMU stuck with an implementation
>>> of only retrieving dirty bits per MemoryRegionSection exactly because
>>> of this issue and therefore we can rely on it in our implementation as
>>> well?  Thanks,
>>>    
>>
>> QEMU sync dirty_bitmap per MemoryRegionSection. Within
>> MemoryRegionSection there could be multiple KVMSlots. QEMU queries
>> dirty_bitmap per KVMSlot and mark dirty for each KVMSlot.
>> On kernel side, KVM_GET_DIRTY_LOG ioctl calls
>> kvm_get_dirty_log_protect(), where it uses copy_to_user() to copy bitmap
>> of that memSlot.
>> vfio_dma is per MemoryRegionSection. We can reply on MemoryRegionSection
>> in our implementation. But to get bitmap during unmap, we have to take
>> care of concatenating bitmaps.
> 
> So KVM does not worry about bitmap alignment because the interface is
> based on slots, a dirty bitmap can only be retrieved for a single,
> entire slot.  We need VFIO_IOMMU_UNMAP_DMA to maintain its support for
> spanning multiple vfio_dmas, but maybe we have some leeway that we
> don't need to support both multiple vfio_dmas and dirty bitmap at the
> same time.  It seems like it would be a massive simplification if we
> required an unmap with dirty bitmap to span exactly one vfio_dma,
> right? 

Yes.

> I don't see that we'd break any existing users with that, it's
> unfortunate that we can't have the flexibility of the existing calling
> convention, but I think there's good reason for it here.  Our separate
> dirty bitmap log reporting would follow the same semantics.  I think
> this all aligns with how the MemoryListener works in QEMU right now,
> correct?  For example we wouldn't need any extra per MAP_DMA tracking
> in QEMU like KVM has for its slots.
> 

That right.
Should we go ahead with the implementation to get dirty bitmap for one 
vfio_dma for GET_DIRTY ioctl and unmap with dirty ioctl? Accordingly we 
can have sanity checks in these ioctls.

Thanks,
Kirti

>> In QEMU, in function kvm_physical_sync_dirty_bitmap() there is a comment
>> where bitmap size is calculated and bitmap is defined as 'void __user
>> *dirty_bitmap' which is also the concern you raised and could be handled
>> similarly as below.
>>
>>           /* XXX bad kernel interface alert
>>            * For dirty bitmap, kernel allocates array of size aligned to
>>            * bits-per-long.  But for case when the kernel is 64bits and
>>            * the userspace is 32bits, userspace can't align to the same
>>            * bits-per-long, since sizeof(long) is different between kernel
>>            * and user space.  This way, userspace will provide buffer which
>>            * may be 4 bytes less than the kernel will use, resulting in
>>            * userspace memory corruption (which is not detectable by valgrind
>>            * too, in most cases).
>>            * So for now, let's align to 64 instead of HOST_LONG_BITS here, in
>>            * a hope that sizeof(long) won't become >8 any time soon.
>>            */
>>           if (!mem->dirty_bmap) {
>>               hwaddr bitmap_size = ALIGN(((mem->memory_size) >>
>> TARGET_PAGE_BITS),
>>                                           /*HOST_LONG_BITS*/ 64) / 8;
>>               /* Allocate on the first log_sync, once and for all */
>>               mem->dirty_bmap = g_malloc0(bitmap_size);
>>           }
> 
> Sort of, the the KVM ioctl seems to just pass a slot number and user
> dirty bitmap pointer, so the size of the bitmap is inferred by the size
> of the slot, but if both kernel and user round up to a multiple of
> longs they might come up with different lengths.  QEMU therefore decides
> to always round up the size for an LP64 based long.  Since you've
> specified bitmap_size in our ioctl, the size agreement is explicit.
> 
> The concern I had looks like it addressed in KVM by placing the void*
> __user pointer in a union with a u64:
> 
> struct kvm_dirty_log {
>          __u32 slot;
>          __u32 padding1;
>          union {
>                  void __user *dirty_bitmap; /* one bit per page */
>                  __u64 padding2;
>          };
> };
> 
> The the kvm_vm_compat_ioctl() ioctl handles this with it's own private
> structure:
> 
> truct compat_kvm_dirty_log {
>          __u32 slot;
>          __u32 padding1;
>          union {
>                  compat_uptr_t dirty_bitmap; /* one bit per page */
>                  __u64 padding2;
>          };
> };
> 
> Which gets extracted via:
> 
> 	log.dirty_bitmap = compat_ptr(compat_log.dirty_bitmap);
> 
> However, compat_ptr() has:
> 
> /*
>   * A pointer passed in from user mode. This should not
>   * be used for syscall parameters, just declare them
>   * as pointers because the syscall entry code will have
>   * appropriately converted them already.
>   */
> #ifndef compat_ptr
> static inline void __user *compat_ptr(compat_uptr_t uptr)
> {
>          return (void __user *)(unsigned long)uptr;
> }
> #endif
> 
> So maybe we don't need to do anything special?  I'm tempted to think
> the KVM handling is using legacy mechanism or the padding in the union
> was assumed not to be for that purpose.  Thanks,
> 
> Alex
> 

  reply	other threads:[~2020-02-19  4:21 UTC|newest]

Thread overview: 31+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-02-07 19:42 [PATCH v12 Kernel 0/7] KABIs to support migration for VFIO devices Kirti Wankhede
2020-02-07 19:42 ` [PATCH v12 Kernel 1/7] vfio: KABI for migration interface for device state Kirti Wankhede
2020-02-10 17:25   ` Alex Williamson
2020-02-12 20:56     ` Kirti Wankhede
2020-02-14 10:21   ` Cornelia Huck
2020-02-27  8:58   ` Yan Zhao
2020-02-07 19:42 ` [PATCH v12 Kernel 2/7] vfio iommu: Remove atomicity of ref_count of pinned pages Kirti Wankhede
2020-02-07 19:42 ` [PATCH v12 Kernel 3/7] vfio iommu: Add ioctl definition for dirty pages tracking Kirti Wankhede
2020-02-07 19:42 ` [PATCH v12 Kernel 4/7] vfio iommu: Implementation of ioctl to " Kirti Wankhede
2020-02-10  9:49   ` Yan Zhao
2020-02-10 19:44     ` Alex Williamson
2020-02-11  2:52       ` Yan Zhao
2020-02-11  3:45         ` Alex Williamson
2020-02-11  4:11           ` Yan Zhao
2020-02-10 17:25   ` Alex Williamson
2020-02-12 20:56     ` Kirti Wankhede
2020-02-12 23:13       ` Alex Williamson
2020-02-13 20:11         ` Kirti Wankhede
2020-02-13 23:20           ` Alex Williamson
2020-02-17 19:13             ` Kirti Wankhede
2020-02-17 20:55               ` Alex Williamson
2020-02-18  5:58                 ` Kirti Wankhede
2020-02-18 21:41                   ` Alex Williamson
2020-02-19  4:21                     ` Kirti Wankhede [this message]
2020-02-19  4:53                       ` Alex Williamson
2020-02-07 19:42 ` [PATCH v12 Kernel 5/7] vfio iommu: Update UNMAP_DMA ioctl to get dirty bitmap before unmap Kirti Wankhede
2020-02-10 17:48   ` Alex Williamson
2020-02-07 19:42 ` [PATCH v12 Kernel 6/7] vfio iommu: Adds flag to indicate dirty pages tracking capability support Kirti Wankhede
2020-02-07 19:42 ` [PATCH v12 Kernel 7/7] vfio: Selective dirty page tracking if IOMMU backed device pins pages Kirti Wankhede
2020-02-10 18:14   ` Alex Williamson
2020-03-09  7:46 ` [PATCH v12 Kernel 0/7] KABIs to support migration for VFIO devices Zengtao (B)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=96d9990f-b27f-759e-1395-9b2eff218bfa@nvidia.com \
    --to=kwankhede@nvidia.com \
    --cc=Ken.Xue@amd.com \
    --cc=Zhengxiao.zx@Alibaba-inc.com \
    --cc=aik@ozlabs.ru \
    --cc=alex.williamson@redhat.com \
    --cc=changpeng.liu@intel.com \
    --cc=cjia@nvidia.com \
    --cc=cohuck@redhat.com \
    --cc=dgilbert@redhat.com \
    --cc=eauger@redhat.com \
    --cc=eskultet@redhat.com \
    --cc=felipe@nutanix.com \
    --cc=jonathan.davies@nutanix.com \
    --cc=kevin.tian@intel.com \
    --cc=kvm@vger.kernel.org \
    --cc=mlevitsk@redhat.com \
    --cc=pasic@linux.ibm.com \
    --cc=qemu-devel@nongnu.org \
    --cc=shuangtai.tst@alibaba-inc.com \
    --cc=yan.y.zhao@intel.com \
    --cc=yi.l.liu@intel.com \
    --cc=zhi.a.wang@intel.com \
    --cc=ziye.yang@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).