From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
To: Peter Maydell <peter.maydell@linaro.org>, dhildenb@redhat.com
Cc: QEMU Trivial <qemu-trivial@nongnu.org>,
Paolo Bonzini <pbonzini@redhat.com>,
David CARLIER <devnexen@gmail.com>,
qemu-devel <qemu-devel@nongnu.org>,
"Michael S. Tsirkin" <mst@redhat.com>
Subject: Re: [PATCH 2/3] exec: posix_madvise usage on SunOS.
Date: Mon, 20 Jul 2020 20:13:18 +0100 [thread overview]
Message-ID: <20200720191318.GM2642@work-vm> (raw)
In-Reply-To: <CAFEAcA96mh_4EkKz31HgzfPOEQvhta8VTcvMV=An8Us0+x=NfQ@mail.gmail.com>
(Copies in Dave Hildenbrand)
* Peter Maydell (peter.maydell@linaro.org) wrote:
> On Sat, 18 Jul 2020 at 14:21, David CARLIER <devnexen@gmail.com> wrote:
> >
> > From a9e3cced279ae55a59847ba232f7828bc2479367 Mon Sep 17 00:00:00 2001
> > From: David Carlier <devnexen@gmail.com>
> > Date: Sat, 18 Jul 2020 13:29:44 +0100
> > Subject: [PATCH 2/3] exec: posix_madvise usage on SunOS.
> >
> > with _XOPEN_SOURCE set, the older mman.h API based on caddr_t handling
> > is not accessible thus using posix_madvise here.
> >
> > Signed-off-by: David Carlier <devnexen@gmail.com>
> > ---
> > exec.c | 8 ++++++++
> > 1 file changed, 8 insertions(+)
> >
> > diff --git a/exec.c b/exec.c
> > index 6f381f98e2..0466a75b89 100644
> > --- a/exec.c
> > +++ b/exec.c
> > @@ -3964,7 +3964,15 @@ int ram_block_discard_range(RAMBlock *rb,
> > uint64_t start, size_t length)
> > * fallocate'd away).
> > */
> > #if defined(CONFIG_MADVISE)
> > +#if !defined(CONFIG_SOLARIS)
> > ret = madvise(host_startaddr, length, MADV_DONTNEED);
> > +#else
> > + /*
> > + * mmap and its caddr_t based api is not accessible
> > + * with _XOPEN_SOURCE set on illumos
> > + */
> > + ret = posix_madvise(host_startaddr, length, POSIX_MADV_DONTNEED);
> > +#endif
>
> Hi. I'm not sure this patch will do the right thing, because
> I don't think that Solaris's POSIX_MADV_DONTNEED provides
> the semantics that this QEMU function says it needs. The
> comment at the top of the function says:
>
> * Unmap pages of memory from start to start+length such that
> * they a) read as 0, b) Trigger whatever fault mechanism
> * the OS provides for postcopy.
> * The pages must be unmapped by the end of the function.
This code has moved around a bit over it's life; joining the case
needed by balloon and the case needed by postcopy.
> (Aside: the use of 'unmap' in this comment is a bit confusing,
> because it clearly doesn't mean 'unmap' if it wants read-as-0.
> And the reference to faults on postcopy is incomprehensible
> to me: if memory is read-as-0 it isn't going to fault.)
I think because internally to Linux the behaviour is the same;
this causes the mapping to disappear from the TLB so it faults;
normally when reading the kernel resolves the fault and puts
a read-as-zero page there, except if userfault was enabled
for postcopy, in which case it gives us a kick and we service it.
> Linux's madvise(MADV_DONTNEED) does guarantee us this
> read-as-zero behaviour. (It's a silly API choice that Linux
> put this behaviour behind madvise, which is supposed to be
> merely advisory, but that's how it is.)
Yes, I don't think there's any equivalent to madvise
that guarantees anything.
> The Solaris
> posix_madvise() manpage says it is merely advisory and
> doesn't affect the behaviour of accesses to the memory.
>
> If posix_madvise() behaviour was OK in this function, the
> right way to fix this would be to use qemu_madvise()
> instead, which already provides this "if host has
> madvise(), use it, otherwise use posix_madvise()" logic.
> But I suspect that the direct madvise() here is deliberate.
Yes, but I can't remember the semantics fully - I think it was because
we needed the guarantee at this point (and even Linux's
posix madvise did something different??)
I've got a note saying we didn't want to use
qemu_madvise because we wanted to be sure we didn't get
posix_madvise.
> Side note: not sure the current code is correct for the
> BSDs either -- they have madvise() but don't provide
> Linux's really-read-as-zero guarantee for MADV_DONTNEED.
> So we should probably be doing something else there, and
> whatever that something-else is is probably also what
> Solaris wants.
>
> We use ram_block_discard_range() only in migration and
> in virtio-balloon and virtio-mem; I've cc'd some people
> who hopefully understand what the requirements on this
> function are and might have a view on what the not-Linux
> implementation should look like. (David Gilbert: git
> blame says you wrote this code :-))
Dave
>
> thanks
> -- PMM
>
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
next prev parent reply other threads:[~2020-07-20 19:14 UTC|newest]
Thread overview: 7+ messages / expand[flat|nested] mbox.gz Atom feed top
2020-07-18 13:20 [PATCH 2/3] exec: posix_madvise usage on SunOS David CARLIER
2020-07-18 14:15 ` Peter Maydell
2020-07-18 15:23 ` David CARLIER
2020-07-20 19:13 ` Dr. David Alan Gilbert [this message]
2020-07-21 8:22 ` David Hildenbrand
2020-07-21 9:31 ` Peter Maydell
2020-07-21 9:48 ` David Hildenbrand
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20200720191318.GM2642@work-vm \
--to=dgilbert@redhat.com \
--cc=devnexen@gmail.com \
--cc=dhildenb@redhat.com \
--cc=mst@redhat.com \
--cc=pbonzini@redhat.com \
--cc=peter.maydell@linaro.org \
--cc=qemu-devel@nongnu.org \
--cc=qemu-trivial@nongnu.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).