Re: [Qemu-devel] [PATCH] os: don't corrupt pre-existing memory-backend data with prealloc

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

From: "Daniel P. Berrange" <berrange@redhat.com>
To: Rik van Riel <riel@redhat.com>
Cc: Stefan Hajnoczi <stefanha@gmail.com>,
	qemu-devel@nongnu.org, Jitendra Kolhe <jitendra.kolhe@hpe.com>,
	Paolo Bonzini <pbonzini@redhat.com>,
	Stefan Hajnoczi <stefanha@redhat.com>,
	Michal Privoznik <mprivozn@redhat.com>,
	Andrea Arcangeli <aarcange@redhat.com>
Subject: Re: [Qemu-devel] [PATCH] os: don't corrupt pre-existing memory-backend data with prealloc
Date: Mon, 27 Feb 2017 13:58:35 +0000	[thread overview]
Message-ID: <20170227135835.GN18219@redhat.com> (raw)
In-Reply-To: <1488203170.4736.24.camel@redhat.com>

On Mon, Feb 27, 2017 at 08:46:10AM -0500, Rik van Riel wrote:
> On Mon, 2017-02-27 at 11:10 +0000, Stefan Hajnoczi wrote:
> > On Thu, Feb 23, 2017 at 10:59:22AM +0000, Daniel P. Berrange wrote:
> > > When using a memory-backend object with prealloc turned on, QEMU
> > > will memset() the first byte in every memory page to zero. While
> > > this might have been acceptable for memory backends associated
> > > with RAM, this corrupts application data for NVDIMMs.
> > > 
> > > Instead of setting every page to zero, read the current byte
> > > value and then just write that same value back, so we are not
> > > corrupting the original data.
> > > 
> > > Signed-off-by: Daniel P. Berrange <berrange@redhat.com>
> > > ---
> > > 
> > > I'm unclear if this is actually still safe in practice ? Is the
> > > compiler permitted to optimize away the read+write since it doesn't
> > > change the memory value. I'd hope not, but I've been surprised
> > > before...
> > > 
> > > IMHO this is another factor in favour of requesting an API from
> > > the kernel to provide the prealloc behaviour we want.
> > > 
> > >  util/oslib-posix.c | 3 ++-
> > >  1 file changed, 2 insertions(+), 1 deletion(-)
> > > 
> > > diff --git a/util/oslib-posix.c b/util/oslib-posix.c
> > > index 35012b9..8f5b656 100644
> > > --- a/util/oslib-posix.c
> > > +++ b/util/oslib-posix.c
> > > @@ -355,7 +355,8 @@ void os_mem_prealloc(int fd, char *area, size_t
> > > memory, Error **errp)
> > >  
> > >          /* MAP_POPULATE silently ignores failures */
> > >          for (i = 0; i < numpages; i++) {
> > > -            memset(area + (hpagesize * i), 0, 1);
> > > +            char val = *(area + (hpagesize * i));
> > > +            memset(area + (hpagesize * i), 0, val);
> > 
> > Please include a comment in the final patch explaining why we want to
> > preserve memory contents.
> > 
> > In the case of NVDIMM I'm not sure if the memset is needed at
> > all.  The
> > memory already exists - no new pages need to be allocated by the
> > kernel.
> > We just want the page table entries to be populated for the NVDIMM
> > when
> > -mem-prealloc is used.
> > 
> > Perhaps Andrea or Rik have ideas on improving the kernel interface
> > and
> > whether mmap(MAP_POPULATE) should be used with NVDIMM instead of this
> > userspace "touch every page" workaround?
> 
> Why do we need the page table entries to be populated
> in advance at all?

It is a choice apps using QEMU have - they can tell QEMU whether they
want prealloc or not - if they decide they do want it, then we should
not be corrupting the data.

> The high cost of the page fault for regular memory
> is zeroing out the memory pages before we give them
> to userspace.

NVDIMM in the guest might be backed by regular memory in the host - QEMU
doesn't require use of NVDIMM in the host.

> Simply faulting in the NVDIMM memory as it is touched
> may make more sense than treating it like DRAM,
> especially given that with DAX, NVDIMM areas may be
> orders of magnitude larger than RAM, and we really
> do not want to set up all the page tables for every
> part of the guest DAX "disk".



Regards,
Daniel
-- 
|: http://berrange.com      -o-    http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org              -o-             http://virt-manager.org :|
|: http://entangle-photo.org       -o-    http://search.cpan.org/~danberr/ :|

next prev parent reply	other threads:[~2017-02-27 13:58 UTC|newest]

Thread overview: 16+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-02-23 10:59 [Qemu-devel] [PATCH] os: don't corrupt pre-existing memory-backend data with prealloc Daniel P. Berrange
2017-02-23 12:05 ` Michal Privoznik
2017-02-23 12:07   ` Daniel P. Berrange
2017-02-24  9:05     ` Michal Privoznik
2017-02-24  9:24       ` Daniel P. Berrange
2017-02-24 12:12         ` Dr. David Alan Gilbert
2017-02-24 12:18           ` Paolo Bonzini
2017-02-27 11:10 ` Stefan Hajnoczi
2017-02-27 13:46   ` Rik van Riel
2017-02-27 13:58     ` Daniel P. Berrange [this message]
  -- strict thread matches above, loose matches on Subject: below --
2017-02-24 17:27 Daniel P. Berrange
2017-02-24 17:33 ` no-reply
2017-02-27  9:25   ` Daniel P. Berrange
2017-02-24 19:04 ` Eric Blake
2017-02-27 13:28 ` Stefan Hajnoczi
2017-02-27 15:53 ` Andrea Arcangeli

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20170227135835.GN18219@redhat.com \
    --to=berrange@redhat.com \
    --cc=aarcange@redhat.com \
    --cc=jitendra.kolhe@hpe.com \
    --cc=mprivozn@redhat.com \
    --cc=pbonzini@redhat.com \
    --cc=qemu-devel@nongnu.org \
    --cc=riel@redhat.com \
    --cc=stefanha@gmail.com \
    --cc=stefanha@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).