From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
To: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Thomas Huth <thuth@redhat.com>,
qemu-s390x <qemu-s390x@nongnu.org>,
qemu-devel <qemu-devel@nongnu.org>,
Cornelia Huck <cohuck@redhat.com>,
David Hildenbrand <david@redhat.com>,
Halil Pasic <pasic@linux.vnet.ibm.com>,
Janosch Frank <frankja@linux.vnet.ibm.com>,
Paolo Bonzini <pbonzini@redhat.com>
Subject: Re: [Qemu-devel] [PATCH 1/1] s390/kvm: implement clearing part of IPL clear
Date: Thu, 1 Mar 2018 12:28:55 +0000 [thread overview]
Message-ID: <20180301122854.GD2994@work-vm> (raw)
In-Reply-To: <69654fb2-f5ba-c23b-f6f5-1b559692cf37@de.ibm.com>
* Christian Borntraeger (borntraeger@de.ibm.com) wrote:
>
>
> On 03/01/2018 12:45 PM, Dr. David Alan Gilbert wrote:
> > * Christian Borntraeger (borntraeger@de.ibm.com) wrote:
> >>
> >>
> >> On 03/01/2018 10:24 AM, Dr. David Alan Gilbert wrote:
> >>> * Thomas Huth (thuth@redhat.com) wrote:
> >>>> On 28.02.2018 20:53, Christian Borntraeger wrote:
> >>>>> When a guests reboots with diagnose 308 subcode 3 it requests the memory
> >>>>> to be cleared. We did not do it so far. This does not only violate the
> >>>>> architecture, it also misses the chance to free up that memory on
> >>>>> reboot, which would help on host memory over commitment. By using
> >>>>> ram_block_discard_range we can cover both cases.
> >>>>
> >>>> Sounds like a good idea. I wonder whether that release_all_ram()
> >>>> function should maybe rather reside in exec.c, so that other machines
> >>>> that want to clear all RAM at reset time can use it, too?
> >>>>
> >>>>> Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
> >>>>> ---
> >>>>> target/s390x/kvm.c | 19 +++++++++++++++++++
> >>>>> 1 file changed, 19 insertions(+)
> >>>>>
> >>>>> diff --git a/target/s390x/kvm.c b/target/s390x/kvm.c
> >>>>> index 8f3a422288..2e145ad5c3 100644
> >>>>> --- a/target/s390x/kvm.c
> >>>>> +++ b/target/s390x/kvm.c
> >>>>> @@ -34,6 +34,8 @@
> >>>>> #include "qapi/error.h"
> >>>>> #include "qemu/error-report.h"
> >>>>> #include "qemu/timer.h"
> >>>>> +#include "qemu/rcu_queue.h"
> >>>>> +#include "sysemu/cpus.h"
> >>>>> #include "sysemu/sysemu.h"
> >>>>> #include "sysemu/hw_accel.h"
> >>>>> #include "hw/boards.h"
> >>>>> @@ -41,6 +43,7 @@
> >>>>> #include "sysemu/device_tree.h"
> >>>>> #include "exec/gdbstub.h"
> >>>>> #include "exec/address-spaces.h"
> >>>>> +#include "exec/ram_addr.h"
> >>>>> #include "trace.h"
> >>>>> #include "qapi-event.h"
> >>>>> #include "hw/s390x/s390-pci-inst.h"
> >>>>> @@ -1841,6 +1844,14 @@ static int kvm_arch_handle_debug_exit(S390CPU *cpu)
> >>>>> return ret;
> >>>>> }
> >>>>>
> >>>>> +static void release_all_rams(void)
> >>>>
> >>>> s/rams/ram/ maybe?
> >>>>
> >>>>> +{
> >>>>> + struct RAMBlock *rb;
> >>>>> +
> >>>>> + QLIST_FOREACH_RCU(rb, &ram_list.blocks, next)
> >>>>> + ram_block_discard_range(rb, 0, rb->used_length);
> >>>>
> >>>> From a coding style point of view, I think there should be curly braces
> >>>> around ram_block_discard_range() ?
> >>>
> >>> I think this might break if it happens during a postcopy migrate.
> >>> The destination CPU is running, so it can do a reboot at just the wrong
> >>> time; and then the pages (that are protected by userfaultfd) would get
> >>> deallocated and trigger userfaultfd requests if accessed.
> >>
> >> Yes, userfaultd/postcopy is really fragile and relies on things that are not
> >> necessarily true (e.g. virito-balloon can also invalidate pages).
> >
> > That's why we use qemu_balloon_inhibit around postcopy to stop
> > ballooning; I'm not aware of anything else that does the same.
>
> we also have at least the pte_unused thing in mm/rmap.c that clearly
> predates userfaultfd. We might need to look into this as well....
I've not come across that; what does that do?
> >
> >> The right thing here would be to actually terminate the postcopy migrate but
> >> return it as "successful" (since we are going to clear that RAM anyway). Do
> >> you see a good way to achieve that?
> >
> > There's no current mechanism to do it; I think it would have to involve
> > some interaction with the source as well though to tell it that you
> > didn't need that area of RAM anyway.
> >
> > However, there are more problems:
> > a) Even forgetting the userfault problem, this is racy since during
> > postcopy you're still receiving blocks from the source at the same time;
> > so some of the area that you've discarded might get overwritten by data
> > from the source.
>
> So how do you handle the case when the target system writes to memory
> that is still in flight? Can we build on that mechanism?
Once we've entered postcopy, a page is basically in one of two states:
a) Not yet received - i.e. marked absent with MADV_DONTNEED; if the
guest tries to write to it then it'll block with userfault and ask the
source for the page; so the write wont happen until the page arrives.
b) Received - we've already got the page from the source; the source
never resends a page (once in postcopy) so now the destination can just
write to the page.
Once in postcopy, a page is received at most once (i.e. if it's not
been received during precopy).
I can imagine two ways of curing it:
a) Simple but slow; just read all the pages before doing the
discard, this forces it to wait for the pages to be received.
b) More complex but fast; Add a message on the return path to the
source telling it that you're going to discard a range; the source then
marks it's notes as cleared for those pages and then sends some form of
ack, and at that point you drop it.
A 3rd; incomplete way; would be just to drop the userfaultfd on the
destination for the RAMBlocks that are being cleared; but this does
leave the source state in a bit of a mess.
> > b) Your release_all_rams seems to do all RAM Blocks - won't that nuke
> > any ROMs as well? Or maybe even flash?
>
> ROMs loaded with load_elf (like our s390-ccw.img) are reloaded on every reset.
> See rom_reset in /hw/core/loader.c
Ah, so this is happening after your reset code you've added?
> Is this different with the x86 bios?
Not sure; I know x86 keeps some mirrored copies of ROMs across
reboots but I don't fully understand the mechanisms we use.
But the other case I was thinking of was stuff like pflash on x86 which
are the flash images holding variable data.
(Also watch out for the way ram_block_discard_range deals with file
backed memory; discarding is actually quite hard in some cases).
> > c) In a normal precopy migration, I think you may also get old data;
> > Paolo said that an MADV_DONTNEED won't cause the dirty flags to be set,
> > so if the migrate has already sent the data for a page, and then this
> > happens, before the CPUs are stopped during the migration, when you
> > restart on the destination you'll have the old data.
>
> Yes, looks like we might get non-cleared data. Could we maybe combine fixing
> and optimizing: we can stop tranmitting the memory and do a clean
> startup on the target side. In other words could we actually use the
> reset clear trigger to speed up migration?
They're separate problems because they happen on opposite sides; on
the source you've got a chance of doing that type of hack, but it would
be a bit invasive.
Dave
>
>
>
> >
> > Dave
> >
> >>
> >>>
> >>> Dave
> >>>
> >>>>> +}
> >>>>> +
> >>>>> int kvm_arch_handle_exit(CPUState *cs, struct kvm_run *run)
> >>>>> {
> >>>>> S390CPU *cpu = S390_CPU(cs);
> >>>>> @@ -1853,6 +1864,14 @@ int kvm_arch_handle_exit(CPUState *cs, struct kvm_run *run)
> >>>>> ret = handle_intercept(cpu);
> >>>>> break;
> >>>>> case KVM_EXIT_S390_RESET:
> >>>>> + if (run->s390_reset_flags & KVM_S390_RESET_CLEAR) {
> >>>>> + /*
> >>>>> + * We will stop other CPUs anyway, avoid spurious crashes and
> >>>>> + * get all CPUs out. The reset will take care of the resume.
> >>>>> + */
> >>>>> + pause_all_vcpus();
> >>>>> + release_all_rams();
> >>>>> + }
> >>>>> s390_reipl_request();
> >>>>> break;
> >>>>> case KVM_EXIT_S390_TSCH:
> >>>>>
> >>>>
> >>>> Apart from the cosmetic nits, patch looks good to me.
> >>>>
> >>>> Thomas
> >>> --
> >>> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> >>>
> >>
> > --
> > Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> >
>
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
next prev parent reply other threads:[~2018-03-01 12:29 UTC|newest]
Thread overview: 16+ messages / expand[flat|nested] mbox.gz Atom feed top
2018-02-28 19:53 [Qemu-devel] [PATCH 1/1] s390/kvm: implement clearing part of IPL clear Christian Borntraeger
2018-03-01 3:58 ` Thomas Huth
2018-03-01 7:37 ` Christian Borntraeger
2018-03-01 8:44 ` Paolo Bonzini
2018-03-01 9:24 ` Dr. David Alan Gilbert
2018-03-01 11:00 ` Christian Borntraeger
2018-03-01 11:45 ` Dr. David Alan Gilbert
2018-03-01 12:08 ` Christian Borntraeger
2018-03-01 12:28 ` Dr. David Alan Gilbert [this message]
2018-03-01 12:35 ` Christian Borntraeger
2018-03-01 12:39 ` Christian Borntraeger
2018-03-01 12:58 ` Dr. David Alan Gilbert
2018-03-01 12:49 ` Dr. David Alan Gilbert
2018-03-01 9:21 ` David Hildenbrand
2018-03-05 12:54 ` Cornelia Huck
2018-03-05 13:04 ` Christian Borntraeger
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20180301122854.GD2994@work-vm \
--to=dgilbert@redhat.com \
--cc=borntraeger@de.ibm.com \
--cc=cohuck@redhat.com \
--cc=david@redhat.com \
--cc=frankja@linux.vnet.ibm.com \
--cc=pasic@linux.vnet.ibm.com \
--cc=pbonzini@redhat.com \
--cc=qemu-devel@nongnu.org \
--cc=qemu-s390x@nongnu.org \
--cc=thuth@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).