Re: [Qemu-devel] [PATCH v9 0/5] implement vNVDIMM

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

From: Xiao Guangrong <guangrong.xiao@linux.intel.com>
To: pbonzini@redhat.com, imammedo@redhat.com
Cc: ehabkost@redhat.com, kvm@vger.kernel.org, mst@redhat.com,
	gleb@kernel.org, mtosatti@redhat.com, qemu-devel@nongnu.org,
	stefanha@redhat.com, dan.j.williams@intel.com, rth@twiddle.net
Subject: Re: [Qemu-devel] [PATCH v9 0/5] implement vNVDIMM
Date: Thu, 10 Dec 2015 11:11:59 +0800	[thread overview]
Message-ID: <5668ED7F.2020007@linux.intel.com> (raw)
In-Reply-To: <1449040860-19040-1-git-send-email-guangrong.xiao@linux.intel.com>


New version, new week, and unfortunate new ping... :(


On 12/02/2015 03:20 PM, Xiao Guangrong wrote:
> This patchset can be found at:
>        https://github.com/xiaogr/qemu.git nvdimm-v9
>
> It is based on pci branch on Michael's tree and the top commit is:
> commit 0c73277af7 (vhost-user-test: fix crash with glib < 2.36).
>
> Changelog in v9:
> - the changes address Michael's comments:
>    1) move the control parameter to -machine and it is off on default, then
>       it can be enabled by, for example, -machine pc,nvdimm
>    2) introduce a macro to define "NCAL"
>    3) abstract the function, nvdimm_build_device_dsm(), to clean up the
>       code
>    4) adjust the code style of dsm method
>    5) add spec reference in the code comment
>
> other:
>    pick up Stefan's Reviewed-by
>
> Changelog in v8:
> We split the long patch series into the small parts, as you see now, this
> is the first part which enables NVDIMM without label data support.
>
> The command line has been changed because some patches simplifying the
> things have not been included into this series, you should specify the
> file size exactly using the parameters as follows:
>     memory-backend-file,id=mem1,share,mem-path=/tmp/nvdimm1,size=10G \
>     -device nvdimm,memdev=mem1,id=nv1
>
> Changelog in v7:
> - changes from Vladimir Sementsov-Ogievskiy's comments:
>    1) let gethugepagesize() realize if fstat is failed instead of get
>       normal page size
>    2) rename  open_file_path to open_ram_file_path
>    3) better log the error message by using error_setg_errno
>    4) update commit in the commit log to explain hugepage detection on
>       Windows
>
> - changes from Eduardo Habkost's comments:
>    1) use 'Error**' to collect error message for qemu_file_get_page_size()
>    2) move gethugepagesize() replacement to the same patch to make it
>       better for review
>    3) introduce qemu_get_file_size to unity the code with raw_getlength()
>
> - changes from Stefan's comments:
>    1) check the memory region is large enough to contain DSM output
>       buffer
>
> - changes from Eric Blake's comments:
>    1) update the shell command in the commit log to generate the patch
>       which drops 'pc-dimm' prefix
>
> - others:
>    pick up Reviewed-by from Stefan, Vladimir Sementsov-Ogievskiy, and
>    Eric Blake.
>
> Changelog in v6:
> - changes from Stefan's comments:
>    1) fix code style of struct naming by CamelCase way
>    2) fix offset + length overflow when read/write label data
>    3) compile hw/acpi/nvdimm.c for per target so that TARGET_PAGE_SIZE can
>       be used to replace getpagesize()
>
> Changelog in v5:
> - changes from Michael's comments:
>    1) prefix nvdimm_ to everything in NVDIMM source files
>    2) make parsing _DSM Arg3 more clear
>    3) comment style fix
>    5) drop single used definition
>    6) fix dirty dsm buffer lost due to memory write happened on host
>    7) check dsm buffer if it is big enough to contain input data
>    8) use build_append_int_noprefix to store single value to GArray
>
> - changes from Michael's and Igor's comments:
>    1) introduce 'nvdimm-support' parameter to control nvdimm
>       enablement and it is disabled for 2.4 and its earlier versions
>       to make live migration compatible
>    2) only reserve 1 RAM page and 4 bytes IO Port for NVDIMM ACPI
>       virtualization
>
> - changes from Stefan's comments:
>    1) do endian adjustment for the buffer length
>
> - changes from Bharata B Rao's comments:
>    1) fix compile on ppc
>
> - others:
>    1) the buffer length is directly got from IO read rather than got
>       from dsm memory
>    2) fix dirty label data lost due to memory write happened on host
>
> Changelog in v4:
> - changes from Michael's comments:
>    1) show the message, "Memory is not allocated from HugeTlbfs", if file
>       based memory is not allocated from hugetlbfs.
>    2) introduce function, acpi_get_nvdimm_state(), to get NVDIMMState
>       from Machine.
>    3) statically define UUID and make its operation more clear
>    4) use GArray to build device structures to avoid potential buffer
>       overflow
>    4) improve comments in the code
>    5) improve code style
>
> - changes from Igor's comments:
>    1) add NVDIMM ACPI spec document
>    2) use serialized method to avoid Mutex
>    3) move NVDIMM ACPI's code to hw/acpi/nvdimm.c
>    4) introduce a common ASL method used by _DSM for all devices to reduce
>       ACPI size
>    5) handle UUID in ACPI AML code. BTW, i'd keep handling revision in QEMU
>       it's better to upgrade QEMU to support Rev2 in the future
>
> - changes from Stefan's comments:
>    1) copy input data from DSM memory to local buffer to avoid potential
>       issues as DSM memory is visible to guest. Output data is handled
>       in a similar way
>
> - changes from Dan's comments:
>    1) drop static namespace as Linux has already supported label-less
>       nvdimm devices
>
> - changes from Vladimir's comments:
>    1) print better message, "failed to get file size for %s, can't create
>       backend on it", if any file operation filed to obtain file size
>
> - others:
>    create a git repo on github.com for better review/test
>
> Also, thanks for Eric Blake's review on QAPI's side.
>
> Thank all of you to review this patchset.
>
> Changelog in v3:
> There is huge change in this version, thank Igor, Stefan, Paolo, Eduardo,
> Michael for their valuable comments, the patchset finally gets better shape.
> - changes from Igor's comments:
>    1) abstract dimm device type from pc-dimm and create nvdimm device based on
>       dimm, then it uses memory backend device as nvdimm's memory and NUMA has
>       easily been implemented.
>    2) let file-backend device support any kind of filesystem not only for
>       hugetlbfs and let it work on file not only for directory which is
>       achieved by extending 'mem-path' - if it's a directory then it works as
>       current behavior, otherwise if it's file then directly allocates memory
>       from it.
>    3) we figure out a unused memory hole below 4G that is 0xFF00000 ~
>       0xFFF00000, this range is large enough for NVDIMM ACPI as build 64-bit
>       ACPI SSDT/DSDT table will break windows XP.
>       BTW, only make SSDT.rev = 2 can not work since the width is only depended
>       on DSDT.rev based on 19.6.28 DefinitionBlock (Declare Definition Block)
>       in ACPI spec:
> | Note: For compatibility with ACPI versions before ACPI 2.0, the bit
> | width of Integer objects is dependent on the ComplianceRevision of the DSDT.
> | If the ComplianceRevision is less than 2, all integers are restricted to 32
> | bits. Otherwise, full 64-bit integers are used. The version of the DSDT sets
> | the global integer width for all integers, including integers in SSDTs.
>    4) use the lowest ACPI spec version to document AML terms.
>    5) use "nvdimm" as nvdimm device name instead of "pc-nvdimm"
>
> - changes from Stefan's comments:
>    1) do not do endian adjustment in-place since _DSM memory is visible to guest
>    2) use target platform's target page size instead of fixed PAGE_SIZE
>       definition
>    3) lots of code style improvement and typo fixes.
>    4) live migration fix
> - changes from Paolo's comments:
>    1) improve the name of memory region
>
> - other changes:
>    1) return exact buffer size for _DSM method instead of the page size.
>    2) introduce mutex in NVDIMM ACPI as the _DSM memory is shared by all nvdimm
>       devices.
>    3) NUMA support
>    4) implement _FIT method
>    5) rename "configdata" to "reserve-label-data"
>    6) simplify _DSM arg3 determination
>    7) main changelog update to let it reflect v3.
>
> Changlog in v2:
> - Use litten endian for DSM method, thanks for Stefan's suggestion
>
> - introduce a new parameter, @configdata, if it's false, Qemu will
>    build a static and readonly namespace in memory and use it serveing
>    for DSM GET_CONFIG_SIZE/GET_CONFIG_DATA requests. In this case, no
>    reserved region is needed at the end of the @file, it is good for
>    the user who want to pass whole nvdimm device and make its data
>    completely be visible to guest
>
> - divide the source code into separated files and add maintain info
>
> BTW, PCOMMIT virtualization on KVM side is work in progress, hopefully will
> be posted on next week
>
> ====== Background ======
> NVDIMM (A Non-Volatile Dual In-line Memory Module) is going to be supported
> on Intel's platform. They are discovered via ACPI and configured by _DSM
> method of NVDIMM device in ACPI. There has some supporting documents which
> can be found at:
> ACPI 6: http://www.uefi.org/sites/default/files/resources/ACPI_6.0.pdf
> NVDIMM Namespace: http://pmem.io/documents/NVDIMM_Namespace_Spec.pdf
> DSM Interface Example: http://pmem.io/documents/NVDIMM_DSM_Interface_Example.pdf
> Driver Writer's Guide: http://pmem.io/documents/NVDIMM_Driver_Writers_Guide.pdf
>
> Currently, the NVDIMM driver has been merged into upstream Linux Kernel and
> this patchset tries to enable it in virtualization field
>
> ====== Design ======
> NVDIMM supports two mode accesses, one is PMEM which maps NVDIMM into CPU's
> address space then CPU can directly access it as normal memory, another is
> BLK which is used as block device to reduce the occupying of CPU address
> space
>
> BLK mode accesses NVDIMM via Command Register window and Data Register window.
> BLK virtualization has high workload since each sector access will cause at
> least two VM-EXIT. So we currently only imperilment vPMEM in this patchset
>
> --- vPMEM design ---
> We introduce a new device named "nvdimm", it uses memory backend device as
> NVDIMM memory. The file in file-backend device can be a regular file and block
> device. We can use any file when we do test or emulation, however,
> in the real word, the files passed to guest are:
> - the regular file in the filesystem with DAX enabled created on NVDIMM device
>    on host
> - the raw PMEM device on host, e,g /dev/pmem0
> Memory access on the address created by mmap on these kinds of files can
> directly reach NVDIMM device on host.
>
> --- vConfigure data area design ---
> Each NVDIMM device has a configure data area which is used to store label
> namespace data. In order to emulating this area, we divide the file into two
> parts:
> - first parts is (0, size - 128K], which is used as PMEM
> - 128K at the end of the file, which is used as Label Data Area
> So that the label namespace data can be persistent during power lose or system
> failure.
>
> We also support passing the whole file to guest without reserve any region for
> label data area which is achieved by "reserve-label-data" parameter - if it's
> false then QEMU will build static and readonly namespace in memory and that
> namespace contains the whole file size. The parameter is false on default.
>
> --- _DSM method design ---
> _DSM in ACPI is used to configure NVDIMM, currently we only allow access of
> label namespace data, i.e, Get Namespace Label Size (Function Index 4),
> Get Namespace Label Data (Function Index 5) and Set Namespace Label Data
> (Function Index 6)
>
> _DSM uses two pages to transfer data between ACPI and Qemu, the first page
> is RAM-based used to save the input info of _DSM method and Qemu reuse it
> store output info and another page is MMIO-based, ACPI write data to this
> page to transfer the control to Qemu
>
> ====== Test ======
> In host
> 1) create memory backed file, e.g # dd if=zero of=/tmp/nvdimm bs=1G count=10
> 2) append "-object memory-backend-file,share,id=mem1,
>     mem-path=/tmp/nvdimm -device nvdimm,memdev=mem1,reserve-label-data,
>     id=nv1" in QEMU command line
>
> In guest, download the latest upsteam kernel (4.2 merge window) and enable
> ACPI_NFIT, LIBNVDIMM and BLK_DEV_PMEM.
> 1) insmod drivers/nvdimm/libnvdimm.ko
> 2) insmod drivers/acpi/nfit.ko
> 3) insmod drivers/nvdimm/nd_btt.ko
> 4) insmod drivers/nvdimm/nd_pmem.ko
> You can see the whole nvdimm device used as a single namespace and /dev/pmem0
> appears. You can do whatever on /dev/pmem0 including DAX access.
>
> Currently Linux NVDIMM driver does not support namespace operation on this
> kind of PMEM, apply below changes to support dynamical namespace:
>
> @@ -798,7 +823,8 @@ static int acpi_nfit_register_dimms(struct acpi_nfit_desc *a
>                          continue;
>                  }
>
> -               if (nfit_mem->bdw && nfit_mem->memdev_pmem)
> +               //if (nfit_mem->bdw && nfit_mem->memdev_pmem)
> +               if (nfit_mem->memdev_pmem)
>                          flags |= NDD_ALIASING;
>
> You can append another NVDIMM device in guest and do:
> # cd /sys/bus/nd/devices/
> # cd namespace1.0/
> # echo `uuidgen` > uuid
> # echo `expr 1024 \* 1024 \* 128` > size
> then reload nd.pmem.ko
>
> You can see /dev/pmem1 appears
>
> Xiao Guangrong (5):
>    nvdimm: implement NVDIMM device abstract
>    acpi: support specified oem table id for build_header
>    nvdimm acpi: build ACPI NFIT table
>    nvdimm acpi: build ACPI nvdimm devices
>    nvdimm: add maintain info
>
>   MAINTAINERS                        |   7 +
>   default-configs/i386-softmmu.mak   |   2 +
>   default-configs/x86_64-softmmu.mak |   2 +
>   hw/acpi/Makefile.objs              |   1 +
>   hw/acpi/aml-build.c                |  15 +-
>   hw/acpi/memory_hotplug.c           |   5 +
>   hw/acpi/nvdimm.c                   | 488 +++++++++++++++++++++++++++++++++++++
>   hw/arm/virt-acpi-build.c           |  13 +-
>   hw/i386/acpi-build.c               |  32 ++-
>   hw/i386/pc.c                       |  19 ++
>   hw/mem/Makefile.objs               |   1 +
>   hw/mem/nvdimm.c                    |  46 ++++
>   include/hw/acpi/aml-build.h        |   3 +-
>   include/hw/i386/pc.h               |   2 +
>   include/hw/mem/nvdimm.h            |  32 +++
>   qemu-options.hx                    |   5 +-
>   16 files changed, 651 insertions(+), 22 deletions(-)
>   create mode 100644 hw/acpi/nvdimm.c
>   create mode 100644 hw/mem/nvdimm.c
>   create mode 100644 include/hw/mem/nvdimm.h
>

next prev parent reply	other threads:[~2015-12-10  3:18 UTC|newest]

Thread overview: 32+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-12-02  7:20 [Qemu-devel] [PATCH v9 0/5] implement vNVDIMM Xiao Guangrong
2015-12-02  7:20 ` [Qemu-devel] [PATCH v9 1/5] nvdimm: implement NVDIMM device abstract Xiao Guangrong
2015-12-02  7:20 ` [Qemu-devel] [PATCH v9 2/5] acpi: support specified oem table id for build_header Xiao Guangrong
2015-12-02  7:20 ` [Qemu-devel] [PATCH v9 3/5] nvdimm acpi: build ACPI NFIT table Xiao Guangrong
2015-12-02  7:20 ` [Qemu-devel] [PATCH v9 4/5] nvdimm acpi: build ACPI nvdimm devices Xiao Guangrong
2015-12-02  7:21 ` [Qemu-devel] [PATCH v9 5/5] nvdimm: add maintain info Xiao Guangrong
2015-12-10  3:11 ` Xiao Guangrong [this message]
2015-12-21 14:13   ` [Qemu-devel] [PATCH v9 0/5] implement vNVDIMM Xiao Guangrong
2015-12-28  2:39 ` [Qemu-devel] How to reserve guest physical region for ACPI Xiao Guangrong
2015-12-28 12:50   ` Michael S. Tsirkin
2015-12-30 15:55     ` Igor Mammedov
2015-12-30 19:52       ` Michael S. Tsirkin
2016-01-04 20:17         ` Laszlo Ersek
2016-01-05 17:08           ` Igor Mammedov
2016-01-05 17:22             ` Laszlo Ersek
2016-01-06 13:39               ` Igor Mammedov
2016-01-06 14:43                 ` Laszlo Ersek
2016-01-07 13:51           ` Igor Mammedov
2016-01-07 17:33             ` Laszlo Ersek
2016-01-05 16:30         ` Igor Mammedov
2016-01-05 16:43           ` Michael S. Tsirkin
2016-01-05 17:07             ` Laszlo Ersek
2016-01-05 17:07             ` Xiao Guangrong
2016-01-07  9:21               ` Igor Mammedov
2016-01-08  4:21                 ` Xiao Guangrong
2016-01-08  9:42                   ` Laszlo Ersek
2016-01-08 15:59                   ` Igor Mammedov
2016-01-07 10:30             ` Igor Mammedov
2016-01-07 10:54               ` Michael S. Tsirkin
2016-01-07 13:42                 ` Igor Mammedov
2016-01-07 17:11                   ` Laszlo Ersek
2016-01-07 17:08                 ` Laszlo Ersek

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=5668ED7F.2020007@linux.intel.com \
    --to=guangrong.xiao@linux.intel.com \
    --cc=dan.j.williams@intel.com \
    --cc=ehabkost@redhat.com \
    --cc=gleb@kernel.org \
    --cc=imammedo@redhat.com \
    --cc=kvm@vger.kernel.org \
    --cc=mst@redhat.com \
    --cc=mtosatti@redhat.com \
    --cc=pbonzini@redhat.com \
    --cc=qemu-devel@nongnu.org \
    --cc=rth@twiddle.net \
    --cc=stefanha@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).