Re: [PATCH v12 Kernel 4/7] vfio iommu: Implementation of ioctl to for dirty pages tracking.

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Alex Williamson <alex.williamson@redhat.com>
To: Kirti Wankhede <kwankhede@nvidia.com>
Cc: <cjia@nvidia.com>, <kevin.tian@intel.com>, <ziye.yang@intel.com>,
	<changpeng.liu@intel.com>, <yi.l.liu@intel.com>,
	<mlevitsk@redhat.com>, <eskultet@redhat.com>, <cohuck@redhat.com>,
	<dgilbert@redhat.com>, <jonathan.davies@nutanix.com>,
	<eauger@redhat.com>, <aik@ozlabs.ru>, <pasic@linux.ibm.com>,
	<felipe@nutanix.com>, <Zhengxiao.zx@Alibaba-inc.com>,
	<shuangtai.tst@alibaba-inc.com>, <Ken.Xue@amd.com>,
	<zhi.a.wang@intel.com>, <yan.y.zhao@intel.com>,
	<qemu-devel@nongnu.org>, <kvm@vger.kernel.org>
Subject: Re: [PATCH v12 Kernel 4/7] vfio iommu: Implementation of ioctl to for dirty pages tracking.
Date: Tue, 18 Feb 2020 21:53:30 -0700	[thread overview]
Message-ID: <20200218215330.5bc8fc6a@w520.home> (raw)
In-Reply-To: <96d9990f-b27f-759e-1395-9b2eff218bfa@nvidia.com>

On Wed, 19 Feb 2020 09:51:32 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> On 2/19/2020 3:11 AM, Alex Williamson wrote:
> > On Tue, 18 Feb 2020 11:28:53 +0530
> > Kirti Wankhede <kwankhede@nvidia.com> wrote:
> >   
> >> <snip>
> >>  
> >>>>>>>     As I understand the above algorithm, we find a vfio_dma
> >>>>>>> overlapping the request and populate the bitmap for that range.  Then
> >>>>>>> we go back and put_user() for each byte that we touched.  We could
> >>>>>>> instead simply work on a one byte buffer as we enumerate the requested
> >>>>>>> range and do a put_user() ever time we reach the end of it and have bits
> >>>>>>> set. That would greatly simplify the above example.  But I would expect
> >>>>>>> that we're a) more likely to get asked for ranges covering a single
> >>>>>>> vfio_dma  
> >>>>>>
> >>>>>> QEMU ask for single vfio_dma during each iteration.
> >>>>>>
> >>>>>> If we restrict this ABI to cover single vfio_dma only, then it
> >>>>>> simplifies the logic here. That was my original suggestion. Should we
> >>>>>> think about that again?  
> >>>>>
> >>>>> But we currently allow unmaps that overlap multiple vfio_dmas as long
> >>>>> as no vfio_dma is bisected, so I think that implies that an unmap while
> >>>>> asking for the dirty bitmap has even further restricted semantics.  I'm
> >>>>> also reluctant to design an ABI around what happens to be the current
> >>>>> QEMU implementation.
> >>>>>
> >>>>> If we take your example above, ranges {0x0000,0xa000} and
> >>>>> {0xa000,0x10000} ({start,end}), I think you're working with the
> >>>>> following two bitmaps in this implementation:
> >>>>>
> >>>>> 00000011 11111111b
> >>>>> 00111111b
> >>>>>
> >>>>> And we need to combine those into:
> >>>>>
> >>>>> 11111111 11111111b
> >>>>>
> >>>>> Right?
> >>>>>
> >>>>> But it seems like that would be easier if the second bitmap was instead:
> >>>>>
> >>>>> 11111100b
> >>>>>
> >>>>> Then we wouldn't need to worry about the entire bitmap being shifted by
> >>>>> the bit offset within the byte, which limits our fixes to the boundary
> >>>>> byte and allows us to use copy_to_user() directly for the bulk of the
> >>>>> copy.  So how do we get there?
> >>>>>
> >>>>> I think we start with allocating the vfio_dma bitmap to account for
> >>>>> this initial offset, so we calculate bitmap_base_iova as:
> >>>>>      (iova & ~((PAGE_SIZE << 3) - 1))
> >>>>> We then use bitmap_base_iova in calculating which bits to set.
> >>>>>
> >>>>> The user needs to follow the same rules, and maybe this adds some value
> >>>>> to the user providing the bitmap size rather than the kernel
> >>>>> calculating it.  For example, if the user wanted the dirty bitmap for
> >>>>> the range {0xa000,0x10000} above, they'd provide at least a 1 byte
> >>>>> bitmap, but we'd return bit #2 set to indicate 0xa000 is dirty.
> >>>>>
> >>>>> Effectively the user can ask for any iova range, but the buffer will be
> >>>>> filled relative to the zeroth bit of the bitmap following the above
> >>>>> bitmap_base_iova formula (and replacing PAGE_SIZE with the user
> >>>>> requested pgsize).  I'm tempted to make this explicit in the user
> >>>>> interface (ie. only allow bitmaps starting on aligned pages), but a
> >>>>> user is able to map and unmap single pages and we need to support
> >>>>> returning a dirty bitmap with an unmap, so I don't think we can do that.
> >>>>>         
> >>>>
> >>>> Sigh, finding adjacent vfio_dmas within the same byte seems simpler than
> >>>> this.  
> >>>
> >>> How does KVM do this?  My intent was that if all of our bitmaps share
> >>> the same alignment then we can merge the intersection and continue to
> >>> use copy_to_user() on either side.  However, if QEMU doesn't do the
> >>> same, it doesn't really help us.  Is QEMU stuck with an implementation
> >>> of only retrieving dirty bits per MemoryRegionSection exactly because
> >>> of this issue and therefore we can rely on it in our implementation as
> >>> well?  Thanks,
> >>>      
> >>
> >> QEMU sync dirty_bitmap per MemoryRegionSection. Within
> >> MemoryRegionSection there could be multiple KVMSlots. QEMU queries
> >> dirty_bitmap per KVMSlot and mark dirty for each KVMSlot.
> >> On kernel side, KVM_GET_DIRTY_LOG ioctl calls
> >> kvm_get_dirty_log_protect(), where it uses copy_to_user() to copy bitmap
> >> of that memSlot.
> >> vfio_dma is per MemoryRegionSection. We can reply on MemoryRegionSection
> >> in our implementation. But to get bitmap during unmap, we have to take
> >> care of concatenating bitmaps.  
> > 
> > So KVM does not worry about bitmap alignment because the interface is
> > based on slots, a dirty bitmap can only be retrieved for a single,
> > entire slot.  We need VFIO_IOMMU_UNMAP_DMA to maintain its support for
> > spanning multiple vfio_dmas, but maybe we have some leeway that we
> > don't need to support both multiple vfio_dmas and dirty bitmap at the
> > same time.  It seems like it would be a massive simplification if we
> > required an unmap with dirty bitmap to span exactly one vfio_dma,
> > right?   
> 
> Yes.
> 
> > I don't see that we'd break any existing users with that, it's
> > unfortunate that we can't have the flexibility of the existing calling
> > convention, but I think there's good reason for it here.  Our separate
> > dirty bitmap log reporting would follow the same semantics.  I think
> > this all aligns with how the MemoryListener works in QEMU right now,
> > correct?  For example we wouldn't need any extra per MAP_DMA tracking
> > in QEMU like KVM has for its slots.
> >   
> 
> That right.
> Should we go ahead with the implementation to get dirty bitmap for one 
> vfio_dma for GET_DIRTY ioctl and unmap with dirty ioctl? Accordingly we 
> can have sanity checks in these ioctls.

Yes, I'm convinced that bitmap alignment is sufficiently too difficult
and unnecessary to restrict the calling convention of UNMAP_DMA, when
using the dirty bitmap extension, to exactly unmap a single vfio_dma.
Thanks,

Alex

WARNING: multiple messages have this Message-ID (diff)

From: Alex Williamson <alex.williamson@redhat.com>
To: Kirti Wankhede <kwankhede@nvidia.com>
Cc: Zhengxiao.zx@Alibaba-inc.com, kevin.tian@intel.com,
	yi.l.liu@intel.com, cjia@nvidia.com, kvm@vger.kernel.org,
	eskultet@redhat.com, ziye.yang@intel.com, qemu-devel@nongnu.org,
	cohuck@redhat.com, shuangtai.tst@alibaba-inc.com,
	dgilbert@redhat.com, zhi.a.wang@intel.com, mlevitsk@redhat.com,
	pasic@linux.ibm.com, aik@ozlabs.ru, eauger@redhat.com,
	felipe@nutanix.com, jonathan.davies@nutanix.com,
	yan.y.zhao@intel.com, changpeng.liu@intel.com, Ken.Xue@amd.com
Subject: Re: [PATCH v12 Kernel 4/7] vfio iommu: Implementation of ioctl to for dirty pages tracking.
Date: Tue, 18 Feb 2020 21:53:30 -0700	[thread overview]
Message-ID: <20200218215330.5bc8fc6a@w520.home> (raw)
In-Reply-To: <96d9990f-b27f-759e-1395-9b2eff218bfa@nvidia.com>

On Wed, 19 Feb 2020 09:51:32 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> On 2/19/2020 3:11 AM, Alex Williamson wrote:
> > On Tue, 18 Feb 2020 11:28:53 +0530
> > Kirti Wankhede <kwankhede@nvidia.com> wrote:
> >   
> >> <snip>
> >>  
> >>>>>>>     As I understand the above algorithm, we find a vfio_dma
> >>>>>>> overlapping the request and populate the bitmap for that range.  Then
> >>>>>>> we go back and put_user() for each byte that we touched.  We could
> >>>>>>> instead simply work on a one byte buffer as we enumerate the requested
> >>>>>>> range and do a put_user() ever time we reach the end of it and have bits
> >>>>>>> set. That would greatly simplify the above example.  But I would expect
> >>>>>>> that we're a) more likely to get asked for ranges covering a single
> >>>>>>> vfio_dma  
> >>>>>>
> >>>>>> QEMU ask for single vfio_dma during each iteration.
> >>>>>>
> >>>>>> If we restrict this ABI to cover single vfio_dma only, then it
> >>>>>> simplifies the logic here. That was my original suggestion. Should we
> >>>>>> think about that again?  
> >>>>>
> >>>>> But we currently allow unmaps that overlap multiple vfio_dmas as long
> >>>>> as no vfio_dma is bisected, so I think that implies that an unmap while
> >>>>> asking for the dirty bitmap has even further restricted semantics.  I'm
> >>>>> also reluctant to design an ABI around what happens to be the current
> >>>>> QEMU implementation.
> >>>>>
> >>>>> If we take your example above, ranges {0x0000,0xa000} and
> >>>>> {0xa000,0x10000} ({start,end}), I think you're working with the
> >>>>> following two bitmaps in this implementation:
> >>>>>
> >>>>> 00000011 11111111b
> >>>>> 00111111b
> >>>>>
> >>>>> And we need to combine those into:
> >>>>>
> >>>>> 11111111 11111111b
> >>>>>
> >>>>> Right?
> >>>>>
> >>>>> But it seems like that would be easier if the second bitmap was instead:
> >>>>>
> >>>>> 11111100b
> >>>>>
> >>>>> Then we wouldn't need to worry about the entire bitmap being shifted by
> >>>>> the bit offset within the byte, which limits our fixes to the boundary
> >>>>> byte and allows us to use copy_to_user() directly for the bulk of the
> >>>>> copy.  So how do we get there?
> >>>>>
> >>>>> I think we start with allocating the vfio_dma bitmap to account for
> >>>>> this initial offset, so we calculate bitmap_base_iova as:
> >>>>>      (iova & ~((PAGE_SIZE << 3) - 1))
> >>>>> We then use bitmap_base_iova in calculating which bits to set.
> >>>>>
> >>>>> The user needs to follow the same rules, and maybe this adds some value
> >>>>> to the user providing the bitmap size rather than the kernel
> >>>>> calculating it.  For example, if the user wanted the dirty bitmap for
> >>>>> the range {0xa000,0x10000} above, they'd provide at least a 1 byte
> >>>>> bitmap, but we'd return bit #2 set to indicate 0xa000 is dirty.
> >>>>>
> >>>>> Effectively the user can ask for any iova range, but the buffer will be
> >>>>> filled relative to the zeroth bit of the bitmap following the above
> >>>>> bitmap_base_iova formula (and replacing PAGE_SIZE with the user
> >>>>> requested pgsize).  I'm tempted to make this explicit in the user
> >>>>> interface (ie. only allow bitmaps starting on aligned pages), but a
> >>>>> user is able to map and unmap single pages and we need to support
> >>>>> returning a dirty bitmap with an unmap, so I don't think we can do that.
> >>>>>         
> >>>>
> >>>> Sigh, finding adjacent vfio_dmas within the same byte seems simpler than
> >>>> this.  
> >>>
> >>> How does KVM do this?  My intent was that if all of our bitmaps share
> >>> the same alignment then we can merge the intersection and continue to
> >>> use copy_to_user() on either side.  However, if QEMU doesn't do the
> >>> same, it doesn't really help us.  Is QEMU stuck with an implementation
> >>> of only retrieving dirty bits per MemoryRegionSection exactly because
> >>> of this issue and therefore we can rely on it in our implementation as
> >>> well?  Thanks,
> >>>      
> >>
> >> QEMU sync dirty_bitmap per MemoryRegionSection. Within
> >> MemoryRegionSection there could be multiple KVMSlots. QEMU queries
> >> dirty_bitmap per KVMSlot and mark dirty for each KVMSlot.
> >> On kernel side, KVM_GET_DIRTY_LOG ioctl calls
> >> kvm_get_dirty_log_protect(), where it uses copy_to_user() to copy bitmap
> >> of that memSlot.
> >> vfio_dma is per MemoryRegionSection. We can reply on MemoryRegionSection
> >> in our implementation. But to get bitmap during unmap, we have to take
> >> care of concatenating bitmaps.  
> > 
> > So KVM does not worry about bitmap alignment because the interface is
> > based on slots, a dirty bitmap can only be retrieved for a single,
> > entire slot.  We need VFIO_IOMMU_UNMAP_DMA to maintain its support for
> > spanning multiple vfio_dmas, but maybe we have some leeway that we
> > don't need to support both multiple vfio_dmas and dirty bitmap at the
> > same time.  It seems like it would be a massive simplification if we
> > required an unmap with dirty bitmap to span exactly one vfio_dma,
> > right?   
> 
> Yes.
> 
> > I don't see that we'd break any existing users with that, it's
> > unfortunate that we can't have the flexibility of the existing calling
> > convention, but I think there's good reason for it here.  Our separate
> > dirty bitmap log reporting would follow the same semantics.  I think
> > this all aligns with how the MemoryListener works in QEMU right now,
> > correct?  For example we wouldn't need any extra per MAP_DMA tracking
> > in QEMU like KVM has for its slots.
> >   
> 
> That right.
> Should we go ahead with the implementation to get dirty bitmap for one 
> vfio_dma for GET_DIRTY ioctl and unmap with dirty ioctl? Accordingly we 
> can have sanity checks in these ioctls.

Yes, I'm convinced that bitmap alignment is sufficiently too difficult
and unnecessary to restrict the calling convention of UNMAP_DMA, when
using the dirty bitmap extension, to exactly unmap a single vfio_dma.
Thanks,

Alex

next prev parent reply	other threads:[~2020-02-19  4:53 UTC|newest]

Thread overview: 62+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-02-07 19:42 [PATCH v12 Kernel 0/7] KABIs to support migration for VFIO devices Kirti Wankhede
2020-02-07 19:42 ` Kirti Wankhede
2020-02-07 19:42 ` [PATCH v12 Kernel 1/7] vfio: KABI for migration interface for device state Kirti Wankhede
2020-02-07 19:42   ` Kirti Wankhede
2020-02-10 17:25   ` Alex Williamson
2020-02-10 17:25     ` Alex Williamson
2020-02-12 20:56     ` Kirti Wankhede
2020-02-12 20:56       ` Kirti Wankhede
2020-02-14 10:21   ` Cornelia Huck
2020-02-14 10:21     ` Cornelia Huck
2020-02-27  8:58   ` Yan Zhao
2020-02-27  8:58     ` Yan Zhao
2020-02-07 19:42 ` [PATCH v12 Kernel 2/7] vfio iommu: Remove atomicity of ref_count of pinned pages Kirti Wankhede
2020-02-07 19:42   ` Kirti Wankhede
2020-02-07 19:42 ` [PATCH v12 Kernel 3/7] vfio iommu: Add ioctl definition for dirty pages tracking Kirti Wankhede
2020-02-07 19:42   ` Kirti Wankhede
2020-02-07 19:42 ` [PATCH v12 Kernel 4/7] vfio iommu: Implementation of ioctl to " Kirti Wankhede
2020-02-07 19:42   ` Kirti Wankhede
2020-02-10  9:49   ` Yan Zhao
2020-02-10  9:49     ` Yan Zhao
2020-02-10 19:44     ` Alex Williamson
2020-02-10 19:44       ` Alex Williamson
2020-02-11  2:52       ` Yan Zhao
2020-02-11  2:52         ` Yan Zhao
2020-02-11  3:45         ` Alex Williamson
2020-02-11  3:45           ` Alex Williamson
2020-02-11  4:11           ` Yan Zhao
2020-02-11  4:11             ` Yan Zhao
2020-02-10 17:25   ` Alex Williamson
2020-02-10 17:25     ` Alex Williamson
2020-02-12 20:56     ` Kirti Wankhede
2020-02-12 20:56       ` Kirti Wankhede
2020-02-12 23:13       ` Alex Williamson
2020-02-12 23:13         ` Alex Williamson
2020-02-13 20:11         ` Kirti Wankhede
2020-02-13 20:11           ` Kirti Wankhede
2020-02-13 23:20           ` Alex Williamson
2020-02-13 23:20             ` Alex Williamson
2020-02-17 19:13             ` Kirti Wankhede
2020-02-17 19:13               ` Kirti Wankhede
2020-02-17 20:55               ` Alex Williamson
2020-02-17 20:55                 ` Alex Williamson
2020-02-18  5:58                 ` Kirti Wankhede
2020-02-18  5:58                   ` Kirti Wankhede
2020-02-18 21:41                   ` Alex Williamson
2020-02-18 21:41                     ` Alex Williamson
2020-02-19  4:21                     ` Kirti Wankhede
2020-02-19  4:21                       ` Kirti Wankhede
2020-02-19  4:53                       ` Alex Williamson [this message]
2020-02-19  4:53                         ` Alex Williamson
2020-02-07 19:42 ` [PATCH v12 Kernel 5/7] vfio iommu: Update UNMAP_DMA ioctl to get dirty bitmap before unmap Kirti Wankhede
2020-02-07 19:42   ` Kirti Wankhede
2020-02-10 17:48   ` Alex Williamson
2020-02-10 17:48     ` Alex Williamson
2020-02-07 19:42 ` [PATCH v12 Kernel 6/7] vfio iommu: Adds flag to indicate dirty pages tracking capability support Kirti Wankhede
2020-02-07 19:42   ` Kirti Wankhede
2020-02-07 19:42 ` [PATCH v12 Kernel 7/7] vfio: Selective dirty page tracking if IOMMU backed device pins pages Kirti Wankhede
2020-02-07 19:42   ` Kirti Wankhede
2020-02-10 18:14   ` Alex Williamson
2020-02-10 18:14     ` Alex Williamson
2020-03-09  7:46 ` [PATCH v12 Kernel 0/7] KABIs to support migration for VFIO devices Zengtao (B)
2020-03-09  7:46   ` Zengtao (B)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20200218215330.5bc8fc6a@w520.home \
    --to=alex.williamson@redhat.com \
    --cc=Ken.Xue@amd.com \
    --cc=Zhengxiao.zx@Alibaba-inc.com \
    --cc=aik@ozlabs.ru \
    --cc=changpeng.liu@intel.com \
    --cc=cjia@nvidia.com \
    --cc=cohuck@redhat.com \
    --cc=dgilbert@redhat.com \
    --cc=eauger@redhat.com \
    --cc=eskultet@redhat.com \
    --cc=felipe@nutanix.com \
    --cc=jonathan.davies@nutanix.com \
    --cc=kevin.tian@intel.com \
    --cc=kvm@vger.kernel.org \
    --cc=kwankhede@nvidia.com \
    --cc=mlevitsk@redhat.com \
    --cc=pasic@linux.ibm.com \
    --cc=qemu-devel@nongnu.org \
    --cc=shuangtai.tst@alibaba-inc.com \
    --cc=yan.y.zhao@intel.com \
    --cc=yi.l.liu@intel.com \
    --cc=zhi.a.wang@intel.com \
    --cc=ziye.yang@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.