From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
To: Peter Maydell <peter.maydell@linaro.org>, dhildenb@redhat.com
Cc: David CARLIER <devnexen@gmail.com>,
qemu-devel <qemu-devel@nongnu.org>,
QEMU Trivial <qemu-trivial@nongnu.org>,
Paolo Bonzini <pbonzini@redhat.com>,
"Michael S. Tsirkin" <mst@redhat.com>
Subject: Re: [PATCH 2/3] exec: posix_madvise usage on SunOS.
Date: Mon, 20 Jul 2020 20:13:18 +0100 [thread overview]
Message-ID: <20200720191318.GM2642@work-vm> (raw)
In-Reply-To: <CAFEAcA96mh_4EkKz31HgzfPOEQvhta8VTcvMV=An8Us0+x=NfQ@mail.gmail.com>
(Copies in Dave Hildenbrand)
* Peter Maydell (peter.maydell@linaro.org) wrote:
> On Sat, 18 Jul 2020 at 14:21, David CARLIER <devnexen@gmail.com> wrote:
> >
> > From a9e3cced279ae55a59847ba232f7828bc2479367 Mon Sep 17 00:00:00 2001
> > From: David Carlier <devnexen@gmail.com>
> > Date: Sat, 18 Jul 2020 13:29:44 +0100
> > Subject: [PATCH 2/3] exec: posix_madvise usage on SunOS.
> >
> > with _XOPEN_SOURCE set, the older mman.h API based on caddr_t handling
> > is not accessible thus using posix_madvise here.
> >
> > Signed-off-by: David Carlier <devnexen@gmail.com>
> > ---
> > exec.c | 8 ++++++++
> > 1 file changed, 8 insertions(+)
> >
> > diff --git a/exec.c b/exec.c
> > index 6f381f98e2..0466a75b89 100644
> > --- a/exec.c
> > +++ b/exec.c
> > @@ -3964,7 +3964,15 @@ int ram_block_discard_range(RAMBlock *rb,
> > uint64_t start, size_t length)
> > * fallocate'd away).
> > */
> > #if defined(CONFIG_MADVISE)
> > +#if !defined(CONFIG_SOLARIS)
> > ret = madvise(host_startaddr, length, MADV_DONTNEED);
> > +#else
> > + /*
> > + * mmap and its caddr_t based api is not accessible
> > + * with _XOPEN_SOURCE set on illumos
> > + */
> > + ret = posix_madvise(host_startaddr, length, POSIX_MADV_DONTNEED);
> > +#endif
>
> Hi. I'm not sure this patch will do the right thing, because
> I don't think that Solaris's POSIX_MADV_DONTNEED provides
> the semantics that this QEMU function says it needs. The
> comment at the top of the function says:
>
> * Unmap pages of memory from start to start+length such that
> * they a) read as 0, b) Trigger whatever fault mechanism
> * the OS provides for postcopy.
> * The pages must be unmapped by the end of the function.
This code has moved around a bit over it's life; joining the case
needed by balloon and the case needed by postcopy.
> (Aside: the use of 'unmap' in this comment is a bit confusing,
> because it clearly doesn't mean 'unmap' if it wants read-as-0.
> And the reference to faults on postcopy is incomprehensible
> to me: if memory is read-as-0 it isn't going to fault.)
I think because internally to Linux the behaviour is the same;
this causes the mapping to disappear from the TLB so it faults;
normally when reading the kernel resolves the fault and puts
a read-as-zero page there, except if userfault was enabled
for postcopy, in which case it gives us a kick and we service it.
> Linux's madvise(MADV_DONTNEED) does guarantee us this
> read-as-zero behaviour. (It's a silly API choice that Linux
> put this behaviour behind madvise, which is supposed to be
> merely advisory, but that's how it is.)
Yes, I don't think there's any equivalent to madvise
that guarantees anything.
> The Solaris
> posix_madvise() manpage says it is merely advisory and
> doesn't affect the behaviour of accesses to the memory.
>
> If posix_madvise() behaviour was OK in this function, the
> right way to fix this would be to use qemu_madvise()
> instead, which already provides this "if host has
> madvise(), use it, otherwise use posix_madvise()" logic.
> But I suspect that the direct madvise() here is deliberate.
Yes, but I can't remember the semantics fully - I think it was because
we needed the guarantee at this point (and even Linux's
posix madvise did something different??)
I've got a note saying we didn't want to use
qemu_madvise because we wanted to be sure we didn't get
posix_madvise.
> Side note: not sure the current code is correct for the
> BSDs either -- they have madvise() but don't provide
> Linux's really-read-as-zero guarantee for MADV_DONTNEED.
> So we should probably be doing something else there, and
> whatever that something-else is is probably also what
> Solaris wants.
>
> We use ram_block_discard_range() only in migration and
> in virtio-balloon and virtio-mem; I've cc'd some people
> who hopefully understand what the requirements on this
> function are and might have a view on what the not-Linux
> implementation should look like. (David Gilbert: git
> blame says you wrote this code :-))
Dave
>
> thanks
> -- PMM
>
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
WARNING: multiple messages have this Message-ID (diff)
From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
To: Peter Maydell <peter.maydell@linaro.org>, dhildenb@redhat.com
Cc: QEMU Trivial <qemu-trivial@nongnu.org>,
Paolo Bonzini <pbonzini@redhat.com>,
David CARLIER <devnexen@gmail.com>,
qemu-devel <qemu-devel@nongnu.org>,
"Michael S. Tsirkin" <mst@redhat.com>
Subject: Re: [PATCH 2/3] exec: posix_madvise usage on SunOS.
Date: Mon, 20 Jul 2020 20:13:18 +0100 [thread overview]
Message-ID: <20200720191318.GM2642@work-vm> (raw)
In-Reply-To: <CAFEAcA96mh_4EkKz31HgzfPOEQvhta8VTcvMV=An8Us0+x=NfQ@mail.gmail.com>
(Copies in Dave Hildenbrand)
* Peter Maydell (peter.maydell@linaro.org) wrote:
> On Sat, 18 Jul 2020 at 14:21, David CARLIER <devnexen@gmail.com> wrote:
> >
> > From a9e3cced279ae55a59847ba232f7828bc2479367 Mon Sep 17 00:00:00 2001
> > From: David Carlier <devnexen@gmail.com>
> > Date: Sat, 18 Jul 2020 13:29:44 +0100
> > Subject: [PATCH 2/3] exec: posix_madvise usage on SunOS.
> >
> > with _XOPEN_SOURCE set, the older mman.h API based on caddr_t handling
> > is not accessible thus using posix_madvise here.
> >
> > Signed-off-by: David Carlier <devnexen@gmail.com>
> > ---
> > exec.c | 8 ++++++++
> > 1 file changed, 8 insertions(+)
> >
> > diff --git a/exec.c b/exec.c
> > index 6f381f98e2..0466a75b89 100644
> > --- a/exec.c
> > +++ b/exec.c
> > @@ -3964,7 +3964,15 @@ int ram_block_discard_range(RAMBlock *rb,
> > uint64_t start, size_t length)
> > * fallocate'd away).
> > */
> > #if defined(CONFIG_MADVISE)
> > +#if !defined(CONFIG_SOLARIS)
> > ret = madvise(host_startaddr, length, MADV_DONTNEED);
> > +#else
> > + /*
> > + * mmap and its caddr_t based api is not accessible
> > + * with _XOPEN_SOURCE set on illumos
> > + */
> > + ret = posix_madvise(host_startaddr, length, POSIX_MADV_DONTNEED);
> > +#endif
>
> Hi. I'm not sure this patch will do the right thing, because
> I don't think that Solaris's POSIX_MADV_DONTNEED provides
> the semantics that this QEMU function says it needs. The
> comment at the top of the function says:
>
> * Unmap pages of memory from start to start+length such that
> * they a) read as 0, b) Trigger whatever fault mechanism
> * the OS provides for postcopy.
> * The pages must be unmapped by the end of the function.
This code has moved around a bit over it's life; joining the case
needed by balloon and the case needed by postcopy.
> (Aside: the use of 'unmap' in this comment is a bit confusing,
> because it clearly doesn't mean 'unmap' if it wants read-as-0.
> And the reference to faults on postcopy is incomprehensible
> to me: if memory is read-as-0 it isn't going to fault.)
I think because internally to Linux the behaviour is the same;
this causes the mapping to disappear from the TLB so it faults;
normally when reading the kernel resolves the fault and puts
a read-as-zero page there, except if userfault was enabled
for postcopy, in which case it gives us a kick and we service it.
> Linux's madvise(MADV_DONTNEED) does guarantee us this
> read-as-zero behaviour. (It's a silly API choice that Linux
> put this behaviour behind madvise, which is supposed to be
> merely advisory, but that's how it is.)
Yes, I don't think there's any equivalent to madvise
that guarantees anything.
> The Solaris
> posix_madvise() manpage says it is merely advisory and
> doesn't affect the behaviour of accesses to the memory.
>
> If posix_madvise() behaviour was OK in this function, the
> right way to fix this would be to use qemu_madvise()
> instead, which already provides this "if host has
> madvise(), use it, otherwise use posix_madvise()" logic.
> But I suspect that the direct madvise() here is deliberate.
Yes, but I can't remember the semantics fully - I think it was because
we needed the guarantee at this point (and even Linux's
posix madvise did something different??)
I've got a note saying we didn't want to use
qemu_madvise because we wanted to be sure we didn't get
posix_madvise.
> Side note: not sure the current code is correct for the
> BSDs either -- they have madvise() but don't provide
> Linux's really-read-as-zero guarantee for MADV_DONTNEED.
> So we should probably be doing something else there, and
> whatever that something-else is is probably also what
> Solaris wants.
>
> We use ram_block_discard_range() only in migration and
> in virtio-balloon and virtio-mem; I've cc'd some people
> who hopefully understand what the requirements on this
> function are and might have a view on what the not-Linux
> implementation should look like. (David Gilbert: git
> blame says you wrote this code :-))
Dave
>
> thanks
> -- PMM
>
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
next prev parent reply other threads:[~2020-07-20 19:13 UTC|newest]
Thread overview: 13+ messages / expand[flat|nested] mbox.gz Atom feed top
2020-07-18 13:20 [PATCH 2/3] exec: posix_madvise usage on SunOS David CARLIER
2020-07-18 14:15 ` Peter Maydell
2020-07-18 14:15 ` Peter Maydell
2020-07-18 15:23 ` David CARLIER
2020-07-18 15:23 ` David CARLIER
2020-07-20 19:13 ` Dr. David Alan Gilbert [this message]
2020-07-20 19:13 ` Dr. David Alan Gilbert
2020-07-21 8:22 ` David Hildenbrand
2020-07-21 8:22 ` David Hildenbrand
2020-07-21 9:31 ` Peter Maydell
2020-07-21 9:31 ` Peter Maydell
2020-07-21 9:48 ` David Hildenbrand
2020-07-21 9:48 ` David Hildenbrand
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20200720191318.GM2642@work-vm \
--to=dgilbert@redhat.com \
--cc=devnexen@gmail.com \
--cc=dhildenb@redhat.com \
--cc=mst@redhat.com \
--cc=pbonzini@redhat.com \
--cc=peter.maydell@linaro.org \
--cc=qemu-devel@nongnu.org \
--cc=qemu-trivial@nongnu.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.