qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
* [Qemu-devel] Question about vNVDIMM file format
@ 2016-05-16 15:04 Richard W.M. Jones
  2016-05-16 16:53 ` Stefan Hajnoczi
  2016-05-16 17:39 ` Xiao Guangrong
  0 siblings, 2 replies; 8+ messages in thread
From: Richard W.M. Jones @ 2016-05-16 15:04 UTC (permalink / raw)
  To: qemu-devel; +Cc: Xiao Guangrong

I'm playing with ext4 and DAX.

I'm using:

  -object memory-backend-file,id=mem1,share,mem-path=/var/tmp/pmem,size=4G \
  -device nvdimm,memdev=mem1,id=nv1

where /var/tmp/pmem is a 4 GB ext4 filesystem image (no partition
table).  I can mount this in the guest using:

  mount -o dax /dev/pmem0 /mnt

and everything appears to work.

I read in the mailing list that the pmem file has some internal
structure for storing config data, stored in the last 128 KB of the
file.  Is that still the case?

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
virt-top is 'top' for virtual machines.  Tiny program with many
powerful monitoring features, net stats, disk stats, logging, etc.
http://people.redhat.com/~rjones/virt-top

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [Qemu-devel] Question about vNVDIMM file format
  2016-05-16 15:04 [Qemu-devel] Question about vNVDIMM file format Richard W.M. Jones
@ 2016-05-16 16:53 ` Stefan Hajnoczi
  2016-05-16 18:25   ` Richard W.M. Jones
  2016-05-16 17:39 ` Xiao Guangrong
  1 sibling, 1 reply; 8+ messages in thread
From: Stefan Hajnoczi @ 2016-05-16 16:53 UTC (permalink / raw)
  To: Richard W.M. Jones; +Cc: qemu-devel, Xiao Guangrong

[-- Attachment #1: Type: text/plain, Size: 772 bytes --]

On Mon, May 16, 2016 at 04:04:01PM +0100, Richard W.M. Jones wrote:
> I'm playing with ext4 and DAX.
> 
> I'm using:
> 
>   -object memory-backend-file,id=mem1,share,mem-path=/var/tmp/pmem,size=4G \
>   -device nvdimm,memdev=mem1,id=nv1
> 
> where /var/tmp/pmem is a 4 GB ext4 filesystem image (no partition
> table).  I can mount this in the guest using:
> 
>   mount -o dax /dev/pmem0 /mnt
> 
> and everything appears to work.
> 
> I read in the mailing list that the pmem file has some internal
> structure for storing config data, stored in the last 128 KB of the
> file.  Is that still the case?

AFAICT qemu.git/master does not support the ACPI _DSM for namespace
configuration.  That means the entire /var/tmp/pmem should be visible.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [Qemu-devel] Question about vNVDIMM file format
  2016-05-16 15:04 [Qemu-devel] Question about vNVDIMM file format Richard W.M. Jones
  2016-05-16 16:53 ` Stefan Hajnoczi
@ 2016-05-16 17:39 ` Xiao Guangrong
  1 sibling, 0 replies; 8+ messages in thread
From: Xiao Guangrong @ 2016-05-16 17:39 UTC (permalink / raw)
  To: Richard W.M. Jones, qemu-devel



On 05/16/2016 11:04 PM, Richard W.M. Jones wrote:
> I'm playing with ext4 and DAX.

Thanks for your try, Rich! :)

>
> I'm using:
>
>    -object memory-backend-file,id=mem1,share,mem-path=/var/tmp/pmem,size=4G \
>    -device nvdimm,memdev=mem1,id=nv1
>
> where /var/tmp/pmem is a 4 GB ext4 filesystem image (no partition
> table).  I can mount this in the guest using:
>
>    mount -o dax /dev/pmem0 /mnt
>
> and everything appears to work.
>
> I read in the mailing list that the pmem file has some internal
> structure for storing config data, stored in the last 128 KB of the
> file.  Is that still the case?

The patchset supporting label data has not been merged yet so currently there
is no label data for vNVDIMM device.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [Qemu-devel] Question about vNVDIMM file format
  2016-05-16 16:53 ` Stefan Hajnoczi
@ 2016-05-16 18:25   ` Richard W.M. Jones
  2016-05-18  7:04     ` Xiao Guangrong
  0 siblings, 1 reply; 8+ messages in thread
From: Richard W.M. Jones @ 2016-05-16 18:25 UTC (permalink / raw)
  To: Stefan Hajnoczi; +Cc: qemu-devel, Xiao Guangrong

On Mon, May 16, 2016 at 09:53:36AM -0700, Stefan Hajnoczi wrote:
> On Mon, May 16, 2016 at 04:04:01PM +0100, Richard W.M. Jones wrote:
> > I'm playing with ext4 and DAX.
> > 
> > I'm using:
> > 
> >   -object memory-backend-file,id=mem1,share,mem-path=/var/tmp/pmem,size=4G \
> >   -device nvdimm,memdev=mem1,id=nv1
> > 
> > where /var/tmp/pmem is a 4 GB ext4 filesystem image (no partition
> > table).  I can mount this in the guest using:
> > 
> >   mount -o dax /dev/pmem0 /mnt
> > 
> > and everything appears to work.
> > 
> > I read in the mailing list that the pmem file has some internal
> > structure for storing config data, stored in the last 128 KB of the
> > file.  Is that still the case?
> 
> AFAICT qemu.git/master does not support the ACPI _DSM for namespace
> configuration.  That means the entire /var/tmp/pmem should be visible.

That's great, thanks both for your answers.

FWIW I was able to add support to libguestfs -- at least for the
"direct" backend where we run qemu directly.  Unfortunately libvirt
does not support the vNVDIMM device yet.

I have posted the two patches needed on our mailing list.  There seems
to be some delay in our mail server, so they aren't in the archives
yet:

  https://www.redhat.com/archives/libguestfs/2016-May/thread.html

There are a few possible problems / questions I have:

(a) How necessary is the ACPI dependency?  We disable ACPI because it
is quite slow, adding something like 150-200ms to the boot process
(every millisecond counts for us!).  Because I previously never needed
ACPI, I never really looked into why this is, and it could be
something quite simple, so I'm going to look at this issue next.  I
understand that NVDIMMs are not regular (eg) PCI devices, so ordinary
device probing isn't going to work, and that probably answers the
question why you need to use ACPI.

(b) Could you describe what the 3 modules (nd_btt, nd_pmem, nfit) do?
Are all 3 modules necessary in the guest kernel?

(c) I've got the root filesystem (which is actually ext2, but using
the ext4.ko driver) mounted with -o dax.  What benefits / differences
should I observe?  Just general reduced memory / page cache usage?

(d) If, in future, you add the namespace metadata, what tools will be
available on the host to create a packed filesystem + metadata?
Assuming that we won't be able to export just a filesystem as I am
doing now.

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
virt-builder quickly builds VMs from scratch
http://libguestfs.org/virt-builder.1.html

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [Qemu-devel] Question about vNVDIMM file format
  2016-05-16 18:25   ` Richard W.M. Jones
@ 2016-05-18  7:04     ` Xiao Guangrong
  2016-05-18  8:11       ` Zhang, Haozhong
                         ` (2 more replies)
  0 siblings, 3 replies; 8+ messages in thread
From: Xiao Guangrong @ 2016-05-18  7:04 UTC (permalink / raw)
  To: Richard W.M. Jones, Stefan Hajnoczi; +Cc: qemu-devel, Zhang, Haozhong


Hi Rich,


On 05/17/2016 02:25 AM, Richard W.M. Jones wrote:
> On Mon, May 16, 2016 at 09:53:36AM -0700, Stefan Hajnoczi wrote:
>> On Mon, May 16, 2016 at 04:04:01PM +0100, Richard W.M. Jones wrote:
>>> I'm playing with ext4 and DAX.
>>>
>>> I'm using:
>>>
>>>    -object memory-backend-file,id=mem1,share,mem-path=/var/tmp/pmem,size=4G \
>>>    -device nvdimm,memdev=mem1,id=nv1
>>>
>>> where /var/tmp/pmem is a 4 GB ext4 filesystem image (no partition
>>> table).  I can mount this in the guest using:
>>>
>>>    mount -o dax /dev/pmem0 /mnt
>>>
>>> and everything appears to work.
>>>
>>> I read in the mailing list that the pmem file has some internal
>>> structure for storing config data, stored in the last 128 KB of the
>>> file.  Is that still the case?
>>
>> AFAICT qemu.git/master does not support the ACPI _DSM for namespace
>> configuration.  That means the entire /var/tmp/pmem should be visible.
>
> That's great, thanks both for your answers.
>
> FWIW I was able to add support to libguestfs -- at least for the
> "direct" backend where we run qemu directly.  Unfortunately libvirt
> does not support the vNVDIMM device yet.
>
> I have posted the two patches needed on our mailing list.  There seems
> to be some delay in our mail server, so they aren't in the archives
> yet:
>
>    https://www.redhat.com/archives/libguestfs/2016-May/thread.html
>
> There are a few possible problems / questions I have:
>
> (a) How necessary is the ACPI dependency?  We disable ACPI because it
> is quite slow, adding something like 150-200ms to the boot process
> (every millisecond counts for us!).  Because I previously never needed
> ACPI, I never really looked into why this is, and it could be
> something quite simple, so I'm going to look at this issue next.  I
> understand that NVDIMMs are not regular (eg) PCI devices, so ordinary
> device probing isn't going to work, and that probably answers the
> question why you need to use ACPI.

Yes, ACPI is necessary to export NVDIMM devices. The good news is that
Intel is working on ‘lite QEMU’ which only has basic/simplest ACPI
support. Haozhong, who has been CCed, is working on it.

>
> (b) Could you describe what the 3 modules (nd_btt, nd_pmem, nfit) do?
> Are all 3 modules necessary in the guest kernel?

I think the best answer is from Kernel's Kconfig :):
ACPI_NFIT: Infrastructure to probe ACPI 6 compliant platforms for
            NVDIMMs (NFIT) and register a libnvdimm device tree

BTT:
           The Block Translation Table (BTT) provides atomic sector
           update semantics for persistent memory devices, so that
           applications that rely on sector writes not being torn (a
           guarantee that typical disks provide) can continue to do so.

PMEM:
           Memory ranges for PMEM are described by either an NFIT
           (NVDIMM Firmware Interface Table, see CONFIG_NFIT_ACPI), a
           non-standard OEM-specific E820 memory type (type-12, see
           CONFIG_X86_PMEM_LEGACY), or it is manually specified by the
           'memmap=nn[KMG]!ss[KMG]' kernel command line (see
           Documentation/kernel-parameters.txt).  This driver converts
           these persistent memory ranges into block devices that are
           capable of DAX (direct-access) file system mappings

Currently vNVDIMM is pure PMEM device without label, BTT is unnecessary,
so you can say N to BTT when configuring linux kernel for VM.

>
> (c) I've got the root filesystem (which is actually ext2, but using
> the ext4.ko driver) mounted with -o dax.  What benefits / differences
> should I observe?  Just general reduced memory / page cache usage?
>

And better performance as slow IO path is not needed anymore. :)

However, there is potential issue if it is not backend by real NVDIMM
hardware, the data is not persistent. We are going to resolve it by
emulating PCOMMIT and do msync properly.

> (d) If, in future, you add the namespace metadata, what tools will be
> available on the host to create a packed filesystem + metadata?
> Assuming that we won't be able to export just a filesystem as I am
> doing now.

Yes, this kind of tool is useful, we has this plan however it is low priority
in our TODO. :(

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [Qemu-devel] Question about vNVDIMM file format
  2016-05-18  7:04     ` Xiao Guangrong
@ 2016-05-18  8:11       ` Zhang, Haozhong
  2016-05-18 10:50       ` Richard W.M. Jones
  2016-05-18 17:36       ` Richard W.M. Jones
  2 siblings, 0 replies; 8+ messages in thread
From: Zhang, Haozhong @ 2016-05-18  8:11 UTC (permalink / raw)
  To: Richard W.M. Jones; +Cc: Xiao Guangrong, Stefan Hajnoczi, qemu-devel

Hi Richard,

On 05/18/16 15:04, Xiao Guangrong wrote:
[..]
> >There are a few possible problems / questions I have:
> >
> >(a) How necessary is the ACPI dependency?  We disable ACPI because it
> >is quite slow, adding something like 150-200ms to the boot process
> >(every millisecond counts for us!).  Because I previously never needed
> >ACPI, I never really looked into why this is, and it could be
> >something quite simple, so I'm going to look at this issue next.  I
> >understand that NVDIMMs are not regular (eg) PCI devices, so ordinary
> >device probing isn't going to work, and that probably answers the
> >question why you need to use ACPI.
> 
> Yes, ACPI is necessary to export NVDIMM devices. The good news is that
> Intel is working on ‘lite QEMU’ which only has basic/simplest ACPI
> support. Haozhong, who has been CCed, is working on it.
>

The way we were used for non-ACPI VM is really a dirty hack: we
modified qemu and seabios to create a type-12 e820 entry for each pmem
region, so that the guest linux could use the legacy pmem driver.

According to ACPI 6 spec, type-7 e820 entries should be used
here. However, the pmem driver in linux still requires ACPI for pmem
present via type-7 e820 entries. (That is why we turned to type-12
and why I call it a hack.)

We are now considering to add ACPI back in our project (for some other
requirements than NVDIMM) and consequently will use the standard
NVDIMM ACPI in QEMU, so above hack will be obsoleted in future. No
optimization has been applied to NVDIMM ACPI so far.

If the customized linux kernel is allowed, you could attempt to remove
most ACPI supports and only leave NFIT (and maybe others that cannot
be removed), which may save some boot time.

[..]
> 
> >(d) If, in future, you add the namespace metadata, what tools will be
> >available on the host to create a packed filesystem + metadata?
> >Assuming that we won't be able to export just a filesystem as I am
> >doing now.
> 
> Yes, this kind of tool is useful, we has this plan however it is low priority
> in our TODO. :(
>

I'm not clear the exact requirement here. Maybe you could have a look
at ndctl (https://github.com/pmem/ndctl) which supports some sorts of
operations on namespace.

Thanks,
Haozhong

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [Qemu-devel] Question about vNVDIMM file format
  2016-05-18  7:04     ` Xiao Guangrong
  2016-05-18  8:11       ` Zhang, Haozhong
@ 2016-05-18 10:50       ` Richard W.M. Jones
  2016-05-18 17:36       ` Richard W.M. Jones
  2 siblings, 0 replies; 8+ messages in thread
From: Richard W.M. Jones @ 2016-05-18 10:50 UTC (permalink / raw)
  To: Xiao Guangrong; +Cc: Stefan Hajnoczi, qemu-devel, Zhang, Haozhong

On Wed, May 18, 2016 at 03:04:52PM +0800, Xiao Guangrong wrote:
> On 05/17/2016 02:25 AM, Richard W.M. Jones wrote:
> >(a) How necessary is the ACPI dependency?  We disable ACPI because it
> >is quite slow, adding something like 150-200ms to the boot process
> >(every millisecond counts for us!).  Because I previously never needed
> >ACPI, I never really looked into why this is, and it could be
> >something quite simple, so I'm going to look at this issue next.  I
> >understand that NVDIMMs are not regular (eg) PCI devices, so ordinary
> >device probing isn't going to work, and that probably answers the
> >question why you need to use ACPI.
> 
> Yes, ACPI is necessary to export NVDIMM devices. The good news is that
> Intel is working on ‘lite QEMU’ which only has basic/simplest ACPI
> support. Haozhong, who has been CCed, is working on it.

I remeasured the ACPI overhead with the latest upstream kernel & qemu,
it has dropped to under 20ms, so now I've just unconditionally enabled
ACPI.

> >(c) I've got the root filesystem (which is actually ext2, but using
> >the ext4.ko driver) mounted with -o dax.  What benefits / differences
> >should I observe?  Just general reduced memory / page cache usage?
> >
> 
> And better performance as slow IO path is not needed anymore. :)
> 
> However, there is potential issue if it is not backend by real NVDIMM
> hardware, the data is not persistent. We are going to resolve it by
> emulating PCOMMIT and do msync properly.

I'm using share=off (ie. MMAP_PRIVATE), because for this appliance
model I don't want writes to go to the backing disk.

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
libguestfs lets you edit virtual machines.  Supports shell scripting,
bindings from many languages.  http://libguestfs.org

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [Qemu-devel] Question about vNVDIMM file format
  2016-05-18  7:04     ` Xiao Guangrong
  2016-05-18  8:11       ` Zhang, Haozhong
  2016-05-18 10:50       ` Richard W.M. Jones
@ 2016-05-18 17:36       ` Richard W.M. Jones
  2 siblings, 0 replies; 8+ messages in thread
From: Richard W.M. Jones @ 2016-05-18 17:36 UTC (permalink / raw)
  To: Xiao Guangrong; +Cc: Stefan Hajnoczi, qemu-devel, Zhang, Haozhong

I thought you might be interested in some performance results, "hot
off the presses".

With DAX enabled, I see the following messages in the guest kernel
logs, which I assume means it is working:

[    0.469364] EXT4-fs (pmem0): DAX enabled. Warning: EXPERIMENTAL, use at your own risk
[    0.469932] EXT4-fs (pmem0): mounting ext2 file system using the ext4 subsystem
[    0.470682] EXT4-fs (pmem0): mounted filesystem without journal. Opts: dax

Enabling vNVDIMM + ext4.ko + DAX for the libguestfs appliance[1] root
disk improves our boot+shutdown performance measure[2] by between 20
and 30 milliseconds, which is about 5% faster at the moment.

I also wanted to know if memory usage is reduced.  I ran `free -m'
inside a freshly booted appliance.

Without DAX:

$ free -m
              total        used        free      shared  buff/cache   available
Mem:            485           3         451           1          30         465
Swap:             0           0           0

With DAX:

$ free -m
              total        used        free      shared  buff/cache   available
Mem:            485           3         469           1          12         467
Swap:             0           0           0

I also wanted to know if qemu's memory usage is reduced.  I captured
the rusage.ru_maxrss of the qemu subprocess with and without DAX.  The
difference is only about 5 MB, which doesn't seem like very much to
me.  Perhaps I'm measuring this wrong.

If you have better suggestions for measuring memory usage, please let
me know.

Rich.

[1] http://libguestfs.org/guestfs-internals.1.html#architecture
[2] http://libguestfs.org/guestfs-performance.1.html#baseline:-starting-the-appliance

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
Fedora Windows cross-compiler. Compile Windows programs, test, and
build Windows installers. Over 100 libraries supported.
http://fedoraproject.org/wiki/MinGW

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2016-05-18 17:36 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-05-16 15:04 [Qemu-devel] Question about vNVDIMM file format Richard W.M. Jones
2016-05-16 16:53 ` Stefan Hajnoczi
2016-05-16 18:25   ` Richard W.M. Jones
2016-05-18  7:04     ` Xiao Guangrong
2016-05-18  8:11       ` Zhang, Haozhong
2016-05-18 10:50       ` Richard W.M. Jones
2016-05-18 17:36       ` Richard W.M. Jones
2016-05-16 17:39 ` Xiao Guangrong

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).