public inbox for kvm@vger.kernel.org
 help / color / mirror / Atom feed
From: Peter Xu <peterx@redhat.com>
To: Leonardo Bras Soares Passos <lsoaresp@redhat.com>
Cc: "Emanuele Giuseppe Esposito" <eesposit@redhat.com>,
	qemu-devel <qemu-devel@nongnu.org>,
	"Paolo Bonzini" <pbonzini@redhat.com>,
	"Michael S. Tsirkin" <mst@redhat.com>,
	"Cornelia Huck" <cohuck@redhat.com>,
	"David Hildenbrand" <david@redhat.com>,
	"Philippe Mathieu-Daudé" <f4bug@amsat.org>,
	"Maxim Levitsky" <mlevitsk@redhat.com>,
	kvm@vger.kernel.org
Subject: Re: [RFC PATCH 2/2] kvm/kvm-all.c: listener should delay kvm_vm_ioctl to the commit phase
Date: Mon, 22 Aug 2022 10:10:56 -0400	[thread overview]
Message-ID: <YwOOcC72KKABKgU+@xz-m1.local> (raw)
In-Reply-To: <CAJ6HWG6maoPjbP8T5qo=iXCbNeHu4dq3wHLKtRLahYKuJmMY-g@mail.gmail.com>

On Thu, Aug 18, 2022 at 09:55:20PM -0300, Leonardo Bras Soares Passos wrote:
> On Thu, Aug 18, 2022 at 5:05 PM Peter Xu <peterx@redhat.com> wrote:
> >
> > On Tue, Aug 16, 2022 at 06:12:50AM -0400, Emanuele Giuseppe Esposito wrote:
> > > +static void kvm_memory_region_node_add(KVMMemoryListener *kml,
> > > +                                       struct kvm_userspace_memory_region *mem)
> > > +{
> > > +    MemoryRegionNode *node;
> > > +
> > > +    node = g_malloc(sizeof(MemoryRegionNode));
> > > +    *node = (MemoryRegionNode) {
> > > +        .mem = mem,
> > > +    };
> >
> > Nit: direct assignment of struct looks okay, but maybe pointer assignment
> > is clearer (with g_malloc0?  Or iirc we're suggested to always use g_new0):
> >
> >   node = g_new0(MemoryRegionNode, 1);
> >   node->mem = mem;
> >
> > [...]
> >
> > > +/* for KVM_SET_USER_MEMORY_REGION_LIST */
> > > +struct kvm_userspace_memory_region_list {
> > > +     __u32 nent;
> > > +     __u32 flags;
> > > +     struct kvm_userspace_memory_region entries[0];
> > > +};
> > > +
> > >  /*
> > >   * The bit 0 ~ bit 15 of kvm_memory_region::flags are visible for userspace,
> > >   * other bits are reserved for kvm internal use which are defined in
> > > @@ -1426,6 +1433,8 @@ struct kvm_vfio_spapr_tce {
> > >                                       struct kvm_userspace_memory_region)
> > >  #define KVM_SET_TSS_ADDR          _IO(KVMIO,   0x47)
> > >  #define KVM_SET_IDENTITY_MAP_ADDR _IOW(KVMIO,  0x48, __u64)
> > > +#define KVM_SET_USER_MEMORY_REGION_LIST _IOW(KVMIO, 0x49, \
> > > +                                     struct kvm_userspace_memory_region_list)
> >
> > I think this is probably good enough, but just to provide the other small
> > (but may not be important) piece of puzzle here.  I wanted to think through
> > to understand better but I never did..
> >
> > For a quick look, please read the comment in kvm_set_phys_mem().
> >
> >                 /*
> >                  * NOTE: We should be aware of the fact that here we're only
> >                  * doing a best effort to sync dirty bits.  No matter whether
> >                  * we're using dirty log or dirty ring, we ignored two facts:
> >                  *
> >                  * (1) dirty bits can reside in hardware buffers (PML)
> >                  *
> >                  * (2) after we collected dirty bits here, pages can be dirtied
> >                  * again before we do the final KVM_SET_USER_MEMORY_REGION to
> >                  * remove the slot.
> >                  *
> >                  * Not easy.  Let's cross the fingers until it's fixed.
> >                  */
> >
> > One example is if we have 16G mem, we enable dirty tracking and we punch a
> > hole of 1G at offset 1G, it'll change from this:
> >
> >                      (a)
> >   |----------------- 16G -------------------|
> >
> > To this:
> >
> >      (b)    (c)              (d)
> >   |--1G--|XXXXXX|------------14G------------|
> >
> > Here (c) will be a 1G hole.
> >
> > With current code, the hole punching will del region (a) and add back
> > region (b) and (d).  After the new _LIST ioctl it'll be atomic and nicer.
> >
> > Here the question is if we're with dirty tracking it means for each region
> > we have a dirty bitmap.  Currently we do the best effort of doing below
> > sequence:
> >
> >   (1) fetching dirty bmap of (a)
> >   (2) delete region (a)
> >   (3) add region (b) (d)
> >
> > Here (a)'s dirty bmap is mostly kept as best effort, but still we'll lose
> > dirty pages written between step (1) and (2) (and actually if the write
> > comes within (2) and (3) I think it'll crash qemu, and iiuc that's what
> > we're going to fix..).
> >
> > So ideally the atomic op can be:
> >
> >   "atomically fetch dirty bmap for removed regions, remove regions, and add
> >    new regions"
> >
> > Rather than only:
> >
> >   "atomically remove regions, and add new regions"
> >
> > as what the new _LIST ioctl do.
> >
> > But... maybe that's not a real problem, at least I didn't know any report
> > showing issue with current code yet caused by losing of dirty bits during
> > step (1) and (2).  Neither do I know how to trigger an issue with it.
> >
> > I'm just trying to still provide this information so that you should be
> > aware of this problem too, at the meantime when proposing the new ioctl
> > change for qemu we should also keep in mind that we won't easily lose the
> > dirty bmap of (a) here, which I think this patch does the right thing.
> >
> 
> Thanks for bringing these details Peter!
> 
> What do you think of adding?
> (4) Copy the corresponding part of (a)'s dirty bitmap to (b) and (d)'s
> dirty bitmaps.

Sounds good to me, but may not cover dirty ring?  Maybe we could move on
with the simple but clean scheme first and think about a comprehensive
option only if very necessary.  The worst case is we need one more kvm cap
but we should still have enough.

Thanks,

-- 
Peter Xu


  reply	other threads:[~2022-08-22 14:11 UTC|newest]

Thread overview: 23+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-08-16 10:12 [RFC PATCH 0/2] accel/kvm: extend kvm memory listener to support Emanuele Giuseppe Esposito
2022-08-16 10:12 ` [RFC PATCH 1/2] softmmu/memory: add missing begin/commit callback calls Emanuele Giuseppe Esposito
2022-08-18 19:34   ` Peter Xu
2022-08-26 13:53     ` Emanuele Giuseppe Esposito
2022-08-26 14:13       ` Peter Xu
2022-08-27 21:03         ` Peter Xu
2022-09-09  8:02           ` Emanuele Giuseppe Esposito
2022-08-16 10:12 ` [RFC PATCH 2/2] kvm/kvm-all.c: listener should delay kvm_vm_ioctl to the commit phase Emanuele Giuseppe Esposito
2022-08-18 20:04   ` Peter Xu
2022-08-19  0:55     ` Leonardo Bras Soares Passos
2022-08-22 14:10       ` Peter Xu [this message]
2022-08-26 14:07         ` Emanuele Giuseppe Esposito
2022-08-27 20:58           ` Peter Xu
2022-08-30 10:59             ` David Hildenbrand
2022-09-09  8:02               ` Emanuele Giuseppe Esposito
2022-09-09 11:02                 ` David Hildenbrand
2022-09-09  8:00             ` Emanuele Giuseppe Esposito
2022-08-22  9:08   ` Cornelia Huck
2022-08-26 13:53     ` Emanuele Giuseppe Esposito
2022-08-26 14:15   ` David Hildenbrand
2022-08-26 14:32     ` Emanuele Giuseppe Esposito
2022-08-26 14:44       ` David Hildenbrand
2022-09-09  8:04         ` Emanuele Giuseppe Esposito

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=YwOOcC72KKABKgU+@xz-m1.local \
    --to=peterx@redhat.com \
    --cc=cohuck@redhat.com \
    --cc=david@redhat.com \
    --cc=eesposit@redhat.com \
    --cc=f4bug@amsat.org \
    --cc=kvm@vger.kernel.org \
    --cc=lsoaresp@redhat.com \
    --cc=mlevitsk@redhat.com \
    --cc=mst@redhat.com \
    --cc=pbonzini@redhat.com \
    --cc=qemu-devel@nongnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox