From: Kevin Wolf <kwolf@redhat.com>
To: Stefan Hajnoczi <stefanha@redhat.com>
Cc: Max Reitz <mreitz@redhat.com>, Pankaj Gupta <pagupta@redhat.com>,
qemu block <qemu-block@nongnu.org>,
qemu-devel@nongnu.org, He Junyan <junyan.he@intel.com>
Subject: Re: [Qemu-devel] [Qemu-block] Some question about savem/qcow2 incremental snapshot
Date: Fri, 11 May 2018 19:25:31 +0200 [thread overview]
Message-ID: <20180511172531.GB5016@localhost.localdomain> (raw)
In-Reply-To: <20180510082659.GC1296@stefanha-x1.localdomain>
[-- Attachment #1: Type: text/plain, Size: 5808 bytes --]
Am 10.05.2018 um 10:26 hat Stefan Hajnoczi geschrieben:
> On Wed, May 09, 2018 at 07:54:31PM +0200, Max Reitz wrote:
> > On 2018-05-09 12:16, Stefan Hajnoczi wrote:
> > > On Tue, May 08, 2018 at 05:03:09PM +0200, Kevin Wolf wrote:
> > >> Am 08.05.2018 um 16:41 hat Eric Blake geschrieben:
> > >>> On 12/25/2017 01:33 AM, He Junyan wrote:
> > >> 2. Make the nvdimm device use the QEMU block layer so that it is backed
> > >> by a non-raw disk image (such as a qcow2 file representing the
> > >> content of the nvdimm) that supports snapshots.
> > >>
> > >> This part is hard because it requires some completely new
> > >> infrastructure such as mapping clusters of the image file to guest
> > >> pages, and doing cluster allocation (including the copy on write
> > >> logic) by handling guest page faults.
> > >>
> > >> I think it makes sense to invest some effort into such interfaces, but
> > >> be prepared for a long journey.
> > >
> > > I like the suggestion but it needs to be followed up with a concrete
> > > design that is feasible and fair for Junyan and others to implement.
> > > Otherwise the "long journey" is really just a way of rejecting this
> > > feature.
> > >
> > > Let's discuss the details of using the block layer for NVDIMM and try to
> > > come up with a plan.
> > >
> > > The biggest issue with using the block layer is that persistent memory
> > > applications use load/store instructions to directly access data. This
> > > is fundamentally different from the block layer, which transfers blocks
> > > of data to and from the device.
> > >
> > > Because of block DMA, QEMU is able to perform processing at each block
> > > driver graph node. This doesn't exist for persistent memory because
> > > software does not trap I/O. Therefore the concept of filter nodes
> > > doesn't make sense for persistent memory - we certainly do not want to
> > > trap every I/O because performance would be terrible.
> > >
> > > Another difference is that persistent memory I/O is synchronous.
> > > Load/store instructions execute quickly. Perhaps we could use KVM async
> > > page faults in cases where QEMU needs to perform processing, but again
> > > the performance would be bad.
> >
> > Let me first say that I have no idea how the interface to NVDIMM looks.
> > I just assume it works pretty much like normal RAM (so the interface is
> > just that it’s a part of the physical address space).
> >
> > Also, it sounds a bit like you are already discarding my idea, but here
> > goes anyway.
> >
> > Would it be possible to introduce a buffering block driver that presents
> > the guest an area of RAM/NVDIMM through an NVDIMM interface (so I
> > suppose as part of the guest address space)? For writing, we’d keep a
> > dirty bitmap on it, and then we’d asynchronously move the dirty areas
> > through the block layer, so basically like mirror. On flushing, we’d
> > block until everything is clean.
> >
> > For reading, we’d follow a COR/stream model, basically, where everything
> > is unpopulated in the beginning and everything is loaded through the
> > block layer both asynchronously all the time and on-demand whenever the
> > guest needs something that has not been loaded yet.
> >
> > Now I notice that that looks pretty much like a backing file model where
> > we constantly run both a stream and a commit job at the same time.
> >
> > The user could decide how much memory to use for the buffer, so it could
> > either hold everything or be partially unallocated.
> >
> > You’d probably want to back the buffer by NVDIMM normally, so that
> > nothing is lost on crashes (though this would imply that for partial
> > allocation the buffering block driver would need to know the mapping
> > between the area in real NVDIMM and its virtual representation of it).
> >
> > Just my two cents while scanning through qemu-block to find emails that
> > don’t actually concern me...
>
> The guest kernel already implements this - it's the page cache and the
> block layer!
>
> Doing it in QEMU with dirty memory logging enabled is less efficient
> than doing it in the guest.
>
> That's why I said it's better to just use block devices than to
> implement buffering.
>
> I'm saying that persistent memory emulation on top of the iscsi:// block
> driver (for example) does not make sense. It could be implemented but
> the performance wouldn't be better than block I/O and the
> complexity/code size in QEMU isn't justified IMO.
I think it could make sense if you put everything together.
The primary motivation to use this would of course be that you can
directly map the guest clusters of a qcow2 file into the guest. We'd
potentially fault on the first access, but once it's mapped, you get raw
speed. You're right about flushing, and I was indeed thinking of
Pankaj's work there; maybe I should have been more explicit about that.
Now buffering in QEMU might come in useful when you want to run a block
job on the device. Block jobs are usually just temporary, and accepting
temporarily lower performance might be very acceptable when the
alternative is that you can't perform block jobs at all.
If we want to offer something nvdimm-like not only for the extreme
"performance only, no features" case, but as a viable option for the
average user, we need to be fast in the normal case, and allow to use
any block layer features without having to restart the VM with a
different storage device, even if at a performance penalty.
On iscsi, you still don't gain anything compared to just using a block
device, but support for that might just happen as a side effect when you
implement the interesting features.
Kevin
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 801 bytes --]
next prev parent reply other threads:[~2018-05-11 17:25 UTC|newest]
Thread overview: 18+ messages / expand[flat|nested] mbox.gz Atom feed top
2017-12-25 7:33 [Qemu-devel] Some question about savem/qcow2 incremental snapshot He Junyan
2018-05-08 14:41 ` Eric Blake
2018-05-08 15:03 ` [Qemu-devel] [Qemu-block] " Kevin Wolf
2018-05-09 10:16 ` Stefan Hajnoczi
2018-05-09 17:54 ` Max Reitz
2018-05-10 8:26 ` Stefan Hajnoczi
2018-05-11 17:25 ` Kevin Wolf [this message]
2018-05-14 13:48 ` Stefan Hajnoczi
2018-05-28 7:01 ` He, Junyan
2018-05-30 14:44 ` Stefan Hajnoczi
2018-05-30 16:07 ` Kevin Wolf
2018-05-31 10:48 ` Stefan Hajnoczi
2018-06-08 5:02 ` He, Junyan
2018-06-08 7:59 ` Pankaj Gupta
2018-06-08 15:58 ` Junyan He
2018-06-08 16:38 ` Pankaj Gupta
2018-06-08 16:49 ` Pankaj Gupta
2018-06-08 13:29 ` Stefan Hajnoczi
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20180511172531.GB5016@localhost.localdomain \
--to=kwolf@redhat.com \
--cc=junyan.he@intel.com \
--cc=mreitz@redhat.com \
--cc=pagupta@redhat.com \
--cc=qemu-block@nongnu.org \
--cc=qemu-devel@nongnu.org \
--cc=stefanha@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).