* Re: [Qemu-devel] [PATCH v11 2/3] util/mmap-alloc: support MAP_SYNC in qemu_ram_mmap()
2019-01-29 14:49 ` [Qemu-devel] [PATCH v11 2/3] util/mmap-alloc: support MAP_SYNC in qemu_ram_mmap() Zhang, Yi
@ 2019-01-29 6:55 ` Pankaj Gupta
2019-01-30 11:15 ` Yi Zhang
2019-01-29 13:50 ` Michael S. Tsirkin
1 sibling, 1 reply; 12+ messages in thread
From: Pankaj Gupta @ 2019-01-29 6:55 UTC (permalink / raw)
To: Yi Zhang
Cc: xiaoguangrong eric, stefanha, pbonzini, yu c zhang, richardw yang,
mst, ehabkost, imammedo, dan j williams, qemu-devel
>
> From: Zhang Yi <yi.z.zhang@linux.intel.com>
>
> When a file supporting DAX is used as vNVDIMM backend, mmap it with
> MAP_SYNC flag in addition which can ensure file system metadata
> synced in each guest writes to the backend file, without other QEMU
> actions (e.g., periodic fsync() by QEMU).
>
> Current, We have below different possible use cases:
>
> 1. pmem=on is set, shared=on is set, MAP_SYNC supported:
> a: backend is a dax supporting file.
> - MAP_SYNC will active.
> b: backend is not a dax supporting file.
> - mmap will trigger a warning. then MAP_SYNC flag will be ignored
>
> 2. The rest of cases:
> - we will never pass the MAP_SYNC to mmap2
>
> Signed-off-by: Haozhong Zhang <haozhong.zhang@intel.com>
> Signed-off-by: Zhang Yi <yi.z.zhang@linux.intel.com>
> ---
> include/qemu/osdep.h | 21 +++++++++++++++++++++
> util/mmap-alloc.c | 28 +++++++++++++++++++++++++++-
> 2 files changed, 48 insertions(+), 1 deletion(-)
>
> diff --git a/include/qemu/osdep.h b/include/qemu/osdep.h
> index 457d24e..96209bb 100644
> --- a/include/qemu/osdep.h
> +++ b/include/qemu/osdep.h
> @@ -419,6 +419,27 @@ void qemu_anon_ram_free(void *ptr, size_t size);
> # define QEMU_VMALLOC_ALIGN getpagesize()
> #endif
>
> +/*
> + * MAP_SHARED_VALIDATE and MAP_SYNC are introduced in Linux kernel
> + * 4.15, so they may not be defined when compiling on older kernels.
> + */
> +#ifdef CONFIG_LINUX
> +
> +#include <linux/mman.h>
> +
> +#ifndef MAP_SYNC
> +#define MAP_SYNC 0x80000
> +#endif
> +
> +#ifndef MAP_SHARED_VALIDATE
> +#define MAP_SHARED_VALIDATE 0x03
> +#endif
> +
> +#else /* !CONFIG_LINUX */
> +#define MAP_SYNC 0x0
> +#define MAP_SHARED_VALIDATE 0x0
> +#endif /* CONFIG_LINUX */
> +
> #ifdef CONFIG_POSIX
> struct qemu_signalfd_siginfo {
> uint32_t ssi_signo; /* Signal number */
> diff --git a/util/mmap-alloc.c b/util/mmap-alloc.c
> index 97bbeed..2c86ad2 100644
> --- a/util/mmap-alloc.c
> +++ b/util/mmap-alloc.c
> @@ -101,6 +101,7 @@ void *qemu_ram_mmap(int fd,
> #else
> void *ptr = mmap(0, total, PROT_NONE, MAP_ANONYMOUS | MAP_PRIVATE, -1,
> 0);
> #endif
> + int mmap_xflags = 0;
> size_t offset;
> void *ptr1;
>
> @@ -111,13 +112,38 @@ void *qemu_ram_mmap(int fd,
> assert(is_power_of_2(align));
> /* Always align to host page size */
> assert(align >= getpagesize());
> + if (shared && is_pmem) {
> + mmap_xflags = MAP_SYNC | MAP_SHARED_VALIDATE;
> + }
>
> offset = QEMU_ALIGN_UP((uintptr_t)ptr, align) - (uintptr_t)ptr;
> +retry_mmap:
> ptr1 = mmap(ptr + offset, size, PROT_READ | PROT_WRITE,
> MAP_FIXED |
> (fd == -1 ? MAP_ANONYMOUS : 0) |
> - (shared ? MAP_SHARED : MAP_PRIVATE),
> + (shared ? MAP_SHARED : MAP_PRIVATE) | mmap_xflags,
> fd, 0);
> +
> + /* if map failed with MAP_SHARED_VALIDATE | MAP_SYNC,
> + * we try with MAP_SHARED_VALIDATE without MAP_SYNC
> + */
> + if (ptr1 == MAP_FAILED &&
> + mmap_xflags == (MAP_SYNC | MAP_SHARED_VALIDATE)) {
> + if (errno == ENOTSUP) {
> + perror("failed to validate with mapping flags");
> + }
> + mmap_xflags = MAP_SHARED_VALIDATE;
> + goto retry_mmap;
> + }
> + /* MAP_SHARED_VALIDATE flag is available since Linux 4.15
> + * Test only with MAP_SHARED_VALIDATE flag for compatibility.
> + * Then ignore the MAP_SHARED_VALIDATE flag and retry again
> + */
> + if (mmap_xflags == MAP_SHARED_VALIDATE &&
> + ptr1 == MAP_FAILED) {
> + mmap_xflags &= ~MAP_SHARED_VALIDATE;
> + goto retry_mmap;
> + }
I am not sure if we need this multiple validation. If MAP_SYNC with
MAP_SHARED_VALIDATE is not supported or failed, just fallback to
mmap without MAP_SYNC & MAP_SHARED_VALIDATE?
I saw a'lot of discussion in previous version of this patch series.
I am not sure if its suggested this way or I am missing anything
important here.
Thanks,
Pankaj
> if (ptr1 == MAP_FAILED) {
> munmap(ptr, total);
> return MAP_FAILED;
> --
> 2.7.4
>
>
>
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [Qemu-devel] [PATCH v11 1/3] util/mmap-alloc: Add a 'is_pmem' parameter to qemu_ram_mmap
2019-01-29 14:49 ` [Qemu-devel] [PATCH v11 1/3] util/mmap-alloc: Add a 'is_pmem' parameter to qemu_ram_mmap Zhang, Yi
@ 2019-01-29 6:58 ` Pankaj Gupta
0 siblings, 0 replies; 12+ messages in thread
From: Pankaj Gupta @ 2019-01-29 6:58 UTC (permalink / raw)
To: Yi Zhang
Cc: xiaoguangrong eric, stefanha, pbonzini, yu c zhang, richardw yang,
mst, ehabkost, imammedo, dan j williams, qemu-devel
>
> From: Zhang Yi <yi.z.zhang@linux.intel.com>
>
> besides the existing 'shared' flags, we are going to add
> 'is_pmem' to qemu_ram_mmap(), which indicated the memory backend
> file is a persist memory.
>
> Signed-off-by: Haozhong Zhang <haozhong.zhang@intel.com>
> Signed-off-by: Zhang Yi <yi.z.zhang@linux.intel.com>
> ---
> exec.c | 2 +-
> include/qemu/mmap-alloc.h | 21 ++++++++++++++++++++-
> util/mmap-alloc.c | 6 +++++-
> util/oslib-posix.c | 2 +-
> 4 files changed, 27 insertions(+), 4 deletions(-)
>
> diff --git a/exec.c b/exec.c
> index bb6170d..27cea52 100644
> --- a/exec.c
> +++ b/exec.c
> @@ -1860,7 +1860,7 @@ static void *file_ram_alloc(RAMBlock *block,
> }
>
> area = qemu_ram_mmap(fd, memory, block->mr->align,
> - block->flags & RAM_SHARED);
> + block->flags & RAM_SHARED, block->flags &
> RAM_PMEM);
> if (area == MAP_FAILED) {
> error_setg_errno(errp, errno,
> "unable to map backing store for guest RAM");
> diff --git a/include/qemu/mmap-alloc.h b/include/qemu/mmap-alloc.h
> index 50385e3..190688a 100644
> --- a/include/qemu/mmap-alloc.h
> +++ b/include/qemu/mmap-alloc.h
> @@ -7,7 +7,26 @@ size_t qemu_fd_getpagesize(int fd);
>
> size_t qemu_mempath_getpagesize(const char *mem_path);
>
> -void *qemu_ram_mmap(int fd, size_t size, size_t align, bool shared);
> +/**
> + * qemu_ram_mmap: mmap the specified file or device.
> + *
> + * Parameters:
> + * @fd: the file or the device to mmap
> + * @size: the number of bytes to be mmaped
> + * @align: if not zero, specify the alignment of the starting mapping
> address;
> + * otherwise, the alignment in use will be determined by QEMU.
> + * @shared: map has RAM_SHARED flag.
> + * @is_pmem: map has RAM_PMEM flag.
> + *
> + * Return:
> + * On success, return a pointer to the mapped area.
> + * On failure, return MAP_FAILED.
> + */
> +void *qemu_ram_mmap(int fd,
> + size_t size,
> + size_t align,
> + bool shared,
> + bool is_pmem);
>
> void qemu_ram_munmap(void *ptr, size_t size);
>
> diff --git a/util/mmap-alloc.c b/util/mmap-alloc.c
> index fd329ec..97bbeed 100644
> --- a/util/mmap-alloc.c
> +++ b/util/mmap-alloc.c
> @@ -75,7 +75,11 @@ size_t qemu_mempath_getpagesize(const char *mem_path)
> return getpagesize();
> }
>
> -void *qemu_ram_mmap(int fd, size_t size, size_t align, bool shared)
> +void *qemu_ram_mmap(int fd,
> + size_t size,
> + size_t align,
> + bool shared,
> + bool is_pmem)
> {
> /*
> * Note: this always allocates at least one extra page of virtual
> address
> diff --git a/util/oslib-posix.c b/util/oslib-posix.c
> index fbd0dc8..040937f 100644
> --- a/util/oslib-posix.c
> +++ b/util/oslib-posix.c
> @@ -203,7 +203,7 @@ void *qemu_memalign(size_t alignment, size_t size)
> void *qemu_anon_ram_alloc(size_t size, uint64_t *alignment, bool shared)
> {
> size_t align = QEMU_VMALLOC_ALIGN;
> - void *ptr = qemu_ram_mmap(-1, size, align, shared);
> + void *ptr = qemu_ram_mmap(-1, size, align, shared, false);
>
> if (ptr == MAP_FAILED) {
> return NULL;
> --
> 2.7.4
Looks good to me.
Reviewed-by: pagupta@redhat.com
>
>
>
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [Qemu-devel] [PATCH v11 2/3] util/mmap-alloc: support MAP_SYNC in qemu_ram_mmap()
2019-01-29 14:49 ` [Qemu-devel] [PATCH v11 2/3] util/mmap-alloc: support MAP_SYNC in qemu_ram_mmap() Zhang, Yi
2019-01-29 6:55 ` Pankaj Gupta
@ 2019-01-29 13:50 ` Michael S. Tsirkin
2019-01-30 10:36 ` Yi Zhang
1 sibling, 1 reply; 12+ messages in thread
From: Michael S. Tsirkin @ 2019-01-29 13:50 UTC (permalink / raw)
To: Zhang, Yi
Cc: xiaoguangrong.eric, stefanha, pbonzini, pagupta, yu.c.zhang,
richardw.yang, ehabkost, qemu-devel, imammedo, dan.j.williams
On Tue, Jan 29, 2019 at 10:49:09PM +0800, Zhang, Yi wrote:
> From: Zhang Yi <yi.z.zhang@linux.intel.com>
>
> When a file supporting DAX is used as vNVDIMM backend, mmap it with
> MAP_SYNC flag in addition which can ensure file system metadata
> synced in each guest writes to the backend file, without other QEMU
> actions (e.g., periodic fsync() by QEMU).
>
> Current, We have below different possible use cases:
>
> 1. pmem=on is set, shared=on is set, MAP_SYNC supported:
> a: backend is a dax supporting file.
> - MAP_SYNC will active.
> b: backend is not a dax supporting file.
> - mmap will trigger a warning. then MAP_SYNC flag will be ignored
>
> 2. The rest of cases:
> - we will never pass the MAP_SYNC to mmap2
>
> Signed-off-by: Haozhong Zhang <haozhong.zhang@intel.com>
> Signed-off-by: Zhang Yi <yi.z.zhang@linux.intel.com>
> ---
> include/qemu/osdep.h | 21 +++++++++++++++++++++
> util/mmap-alloc.c | 28 +++++++++++++++++++++++++++-
> 2 files changed, 48 insertions(+), 1 deletion(-)
>
> diff --git a/include/qemu/osdep.h b/include/qemu/osdep.h
> index 457d24e..96209bb 100644
> --- a/include/qemu/osdep.h
> +++ b/include/qemu/osdep.h
> @@ -419,6 +419,27 @@ void qemu_anon_ram_free(void *ptr, size_t size);
> # define QEMU_VMALLOC_ALIGN getpagesize()
> #endif
>
> +/*
> + * MAP_SHARED_VALIDATE and MAP_SYNC are introduced in Linux kernel
> + * 4.15, so they may not be defined when compiling on older kernels.
> + */
> +#ifdef CONFIG_LINUX
> +
> +#include <linux/mman.h>
> +
> +#ifndef MAP_SYNC
> +#define MAP_SYNC 0x80000
> +#endif
> +
> +#ifndef MAP_SHARED_VALIDATE
> +#define MAP_SHARED_VALIDATE 0x03
> +#endif
> +
I commented on this part in v7. That's a wrong way to handle
compatibility.
We had this discussion several times in the past. commit log doesn't
mention any reasons to ignore this. All for setting a single bit in a
single system call. This is getting discouraging.
> +#else /* !CONFIG_LINUX */
> +#define MAP_SYNC 0x0
> +#define MAP_SHARED_VALIDATE 0x0
> +#endif /* CONFIG_LINUX */
> +
> #ifdef CONFIG_POSIX
> struct qemu_signalfd_siginfo {
> uint32_t ssi_signo; /* Signal number */
> diff --git a/util/mmap-alloc.c b/util/mmap-alloc.c
> index 97bbeed..2c86ad2 100644
> --- a/util/mmap-alloc.c
> +++ b/util/mmap-alloc.c
> @@ -101,6 +101,7 @@ void *qemu_ram_mmap(int fd,
> #else
> void *ptr = mmap(0, total, PROT_NONE, MAP_ANONYMOUS | MAP_PRIVATE, -1, 0);
> #endif
> + int mmap_xflags = 0;
That's not a good variable name. It's extra for you since
you are writing the patch but it makes no sense
in the context of the function. Just mmap_flags
will do - and I would put all flags there, not just
the "extra" that this patch is adding.
> size_t offset;
> void *ptr1;
>
> @@ -111,13 +112,38 @@ void *qemu_ram_mmap(int fd,
> assert(is_power_of_2(align));
> /* Always align to host page size */
> assert(align >= getpagesize());
> + if (shared && is_pmem) {
> + mmap_xflags = MAP_SYNC | MAP_SHARED_VALIDATE;
> + }
>
> offset = QEMU_ALIGN_UP((uintptr_t)ptr, align) - (uintptr_t)ptr;
> +retry_mmap:
> ptr1 = mmap(ptr + offset, size, PROT_READ | PROT_WRITE,
> MAP_FIXED |
> (fd == -1 ? MAP_ANONYMOUS : 0) |
> - (shared ? MAP_SHARED : MAP_PRIVATE),
> + (shared ? MAP_SHARED : MAP_PRIVATE) | mmap_xflags,
> fd, 0);
> +
> + /* if map failed with MAP_SHARED_VALIDATE | MAP_SYNC,
> + * we try with MAP_SHARED_VALIDATE without MAP_SYNC
> + */
> + if (ptr1 == MAP_FAILED &&
> + mmap_xflags == (MAP_SYNC | MAP_SHARED_VALIDATE)) {
> + if (errno == ENOTSUP) {
> + perror("failed to validate with mapping flags");
> + }
> + mmap_xflags = MAP_SHARED_VALIDATE;
> + goto retry_mmap;
Have you read
https://homepages.cwi.nl/~storm/teaching/reader/Dijkstra68.pdf
Please just call the function twice. You don't need goto
to repeat a single line.
> + }
> + /* MAP_SHARED_VALIDATE flag is available since Linux 4.15
> + * Test only with MAP_SHARED_VALIDATE flag for compatibility.
> + * Then ignore the MAP_SHARED_VALIDATE flag and retry again
> + */
> + if (mmap_xflags == MAP_SHARED_VALIDATE &&
> + ptr1 == MAP_FAILED) {
> + mmap_xflags &= ~MAP_SHARED_VALIDATE;
> + goto retry_mmap;
> + }
> if (ptr1 == MAP_FAILED) {
> munmap(ptr, total);
> return MAP_FAILED;
> --
> 2.7.4
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [Qemu-devel] [PATCH v11 3/3] docs: Added MAP_SYNC documentation
2019-01-29 14:49 ` [Qemu-devel] [PATCH v11 3/3] docs: Added MAP_SYNC documentation Zhang, Yi
@ 2019-01-29 14:09 ` Michael S. Tsirkin
2019-01-30 11:20 ` Yi Zhang
0 siblings, 1 reply; 12+ messages in thread
From: Michael S. Tsirkin @ 2019-01-29 14:09 UTC (permalink / raw)
To: Zhang, Yi
Cc: xiaoguangrong.eric, stefanha, pbonzini, pagupta, yu.c.zhang,
richardw.yang, ehabkost, qemu-devel, imammedo, dan.j.williams
On Tue, Jan 29, 2019 at 10:49:18PM +0800, Zhang, Yi wrote:
> From: Zhang Yi <yi.z.zhang@linux.intel.com>
>
> Signed-off-by: Zhang Yi <yi.z.zhang@linux.intel.com>
> ---
> docs/nvdimm.txt | 29 ++++++++++++++++++++++++++++-
> qemu-options.hx | 4 ++++
> 2 files changed, 32 insertions(+), 1 deletion(-)
>
> diff --git a/docs/nvdimm.txt b/docs/nvdimm.txt
> index 5f158a6..9da96aa 100644
> --- a/docs/nvdimm.txt
> +++ b/docs/nvdimm.txt
> @@ -142,11 +142,38 @@ backend of vNVDIMM:
> Guest Data Persistence
> ----------------------
>
> +vNVDIMM is designed and implemented to guarantee the guest data
> +persistence on the backends in case of host crash or a power failures.
> +However, there are still some requirements and limitations
> +as explained below.
> +
I'd just drop the above paragraph.
> Though QEMU supports multiple types of vNVDIMM backends on Linux,
> -currently the only one that can guarantee the guest write persistence
> +if MAP_SYNC is not supported by the host kernel and the backends,
> +the only backend that can guarantee the guest write persistence
> is the device DAX on the real NVDIMM device (e.g., /dev/dax0.0), to
> which all guest access do not involve any host-side kernel cache.
>
> +mmap(2) flag MAP_SYNC is added since Linux kernel 4.15. On such
> +systems, QEMU can mmap(2) the dax backend files with MAP_SYNC, which
> +ensures filesystem metadata consistency in case of a host crash or a power
> +failure. Enabling MAP_SYNC in QEMU requires below conditions
> +
> + - 'pmem' option of memory-backend-file is 'on':
> + The backend is a file supporting DAX, e.g., a file on an ext4 or
> + xfs file system mounted with '-o dax'. if your pmem=on ,but the backend is
> + not a file supporting DAX, mapping with this flag results in an EOPNOTSUPP
> + warning. then MAP_SYNC will be ignored
> +
> + - 'share' option of memory-backend-file is 'on':
> + MAP_SYNC flag available only with the MAP_SHARED_VALIDATE mapping type.
> +
> + - 'MAP_SYNC' is supported on linux kernel.(default opened since Linux 4.15)
> +
> +Otherwise, We will ignore the MAP_SYNC flag.
> +
> +For more details, please reference mmap(2) man page:
> +http://man7.org/linux/man-pages/man2/mmap.2.html.
> +
OK above is too low level so it doesn't really help anyone. Instead
it describes code internals and will quickly get out of sync
(pun intended). Let's look at the manpage:
Shared file mappings with this flag provide the guarantee that
while some memory is writably mapped in the address space of
the process, it will be visible in the same file at the same
offset even after the system crashes or is rebooted. In con‐
junction with the use of appropriate CPU instructions, this
provides users of such mappings with a more efficient way of
making data modifications persistent.
OK this is more readable. We already have:
Though QEMU supports multiple types of vNVDIMM backends on Linux,
the only backend that can guarantee the guest write persistence
is the device DAX on the real NVDIMM device (e.g., /dev/dax0.0), to
which all guest access do not involve any host-side kernel cache.
Let's add:
When using a file supporting DAX (direct mapping of persistent memory)
as a backend, write persistence is guaranteed if the host kernel
has support for the MAP_SYNC flag in the mmap system call
(available since Linux 4.15 and on certain distro kernels)
and additionally both 'pmem' and 'share' flags are set to 'on'
on the backend.
If these conditions are not satisfied i.e. if either 'pmem' or 'share'
are not set, if the backend file does not support DAX
or if MAP_SYNC is not supported by the host kernel, write
persistence is not guaranteed after a system crash.
For compatibility reasons, these conditions are silently ignored if not
satisfied. Currently, no way is provided to test for them.
> When using other types of backends, it's suggested to set 'unarmed'
> option of '-device nvdimm' to 'on', which sets the unarmed flag of the
> guest NVDIMM region mapping structure. This unarmed flag indicates
> diff --git a/qemu-options.hx b/qemu-options.hx
> index 08f8516..0cd41f4 100644
> --- a/qemu-options.hx
> +++ b/qemu-options.hx
> @@ -4002,6 +4002,10 @@ using the SNIA NVM programming model (e.g. Intel NVDIMM).
> If @option{pmem} is set to 'on', QEMU will take necessary operations to
> guarantee the persistence of its own writes to @option{mem-path}
> (e.g. in vNVDIMM label emulation and live migration).
> +Also, we will map the backend-file with MAP_SYNC flag, which can ensure
> +the file metadata is in sync to @option{mem-path} in case of host crash
> +or a power failure. MAP_SYNC requires support from both the host kernel
> +(since Linux kernel 4.15) and @option{mem-path} (only files supporting DAX).
>
> @item -object memory-backend-ram,id=@var{id},merge=@var{on|off},dump=@var{on|off},share=@var{on|off},prealloc=@var{on|off},size=@var{size},host-nodes=@var{host-nodes},policy=@var{default|preferred|bind|interleave}
>
> --
> 2.7.4
^ permalink raw reply [flat|nested] 12+ messages in thread
* [Qemu-devel] [PATCH v11 0/3] support MAP_SYNC for memory-backend-file
@ 2019-01-29 14:48 Zhang, Yi
2019-01-29 14:49 ` [Qemu-devel] [PATCH v11 1/3] util/mmap-alloc: Add a 'is_pmem' parameter to qemu_ram_mmap Zhang, Yi
` (2 more replies)
0 siblings, 3 replies; 12+ messages in thread
From: Zhang, Yi @ 2019-01-29 14:48 UTC (permalink / raw)
To: xiaoguangrong.eric, stefanha, pbonzini, pagupta, yu.c.zhang,
richardw.yang, mst, ehabkost
Cc: qemu-devel, imammedo, dan.j.williams, Zhang, Yi
Linux 4.15 introduces a new mmap flag MAP_SYNC, which can be used to
guarantee the write persistence to mmap'ed files supporting DAX (e.g.,
files on ext4/xfs file system mounted with '-o dax').
A description of MAP_SYNC and MAP_SHARED_VALIDATE can be found at
https://patchwork.kernel.org/patch/10028151/
In order to make sure that the file metadata is in sync after a fault
while we are writing a shared DAX supporting backend files, this
patch-set enables QEMU to use MAP_SYNC flag for memory-backend-dax-file.
As the DAX vs DMA truncated issue was solved, we refined the code and
send out this feature for the v5 version.
We will pass MAP_SYNC to mmap(2); if MAP_SYNC is supported and
'share=on' & 'pmem=on'.
Or QEMU will not pass this flag to mmap(2)
Test with below cases:
1. pmem=on is set, shared=on is set, MAP_SYNC supported:
a: backend is a dax supporting file.
1) start VM1 with options:
-object memory-backend-file,id=nv_be4,share,mem-path=${DAX_FILE_1},size=${DAX_FILE_SIZE_1},align=128M,pmem=on,share=on
-device nvdimm,id=nv4,memdev=nv_be4,label-size=2M.
2) start VM2 with options:
-object memory-backend-file,id=nv_be4,share,mem-path=${DAX_FILE_2,size=${DAX_FILE_SIZE_2},align=128M,pmem=on,share=on
-device nvdimm,id=nv4,memdev=nv_be4,label-size=2M.
3) live migrate from VM1 to VM2.
4) Suddly let Host crash or power failure.
5) check DAX_FILE_1 and DAX_FILE_2, no corrupt.
b: backend is a regular file.
1) start with options
-object memory-backend-file,id=nv_be4,share,mem-path=${REG_FILE},size=${REG_FILE_SIZE},align=128M,pmem=on,share=on
-device nvdimm,id=nv4,memdev=nv_be4,label-size=2M.
will warning "failed to validate with mapping flags: Operation not supported"
FILE_1 and FILE_2 random corrupt.
2. Other cases:
FILE_1 and FILE_2 random corrupt.
Changes in V11:
* 1/3: Micheal: Change to just add a bool is_pmem in qemu_ram_mmap.
* 2/3: Micheal: Fix the compatibility for old kernel.
* 2/3&3/3: Micheal&Eduardo :Update the behavior below:
Waning at no-dax and continue without MAP_SYNC.
Test if fails again for compatibility, then remove the MAP_VALIDATE and
silently proceed.
Changes in V10:
* 4/4: refine the document.
* 3/4: Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
* 2/4: refine the commit message, Added MAP_SHARED_VALIDATE.
* 2/4: Fix the wrong include header
Changes in V9:
* 1/6: Reviewed-by: Eduardo Habkost <ehabkost@redhat.com>
* 2/6: New Added: Micheal: use sparse feature define RAM_FLAG.
since I don't have much knowledge about the sparse feature, @Micheal Could you
add some documentation/commit message on this patch? Thank you very much.
* 3/6: from 2/5: Eduardo: updated the commit message.
* 4/6: from 3/5: Micheal: don't ignore MAP_SYNC failures silently.
* 5/6: from 4/5: Eduardo: updated the commit message.
* 6/6: from 5/5: Micheal: Drop the sync option, document the MAP_SYNC.
Changes in v8:
* Micheal: 3/5, remove the duplicated define in the os_dep.h
* Micheal: 2/5, make type define safety.
* Micheal: 2/5, fixed the incorrect define MAP_SHARE on qemu_anon_ram_alloc.
* 4/6 removed, we remove the on/off/auto define of sync, as by now,
MAP_SYNC only worked with pmem=on.
* @Micheal, I still reuse the RAM_SYNC flag, it is much straightforward to parse
all the flags in one parameter.
Changes in v7:
* Micheal: [3,4,6]/6 limited the "sync" flag only on a nvdimm backend.(pmem=on)
Changes in v6:
* Pankaj: 3/7 are squashed with 2/7
* Pankaj: 7/7 update comments to "consistent filesystem metadata".
* Pankaj, Igor: 1/7 Added Reviewed-by in patch-1/7
* Stefan, 4/7 move the include header from "/linux/mman.h" to "osdep.h"
* Stefan, 5/7 Add missing "munmap"
* Stefan, 2/7 refine the shared/flag.
Changes in v5:
* Add patch 1 to fix a memory leak issue.
* Refine the patch 4-6
* Remove the patch 3 as we already change the parameter from "shared" to
"flags"
Changes in v4:
* Add patch 1-3 to switch some functions to a single 'flags'
parameters. (Michael S. Tsirkin)
* v3 patch 1-3 become v4 patch 4-6.
* Patch 4: move definitions of MAP_SYNC and MAP_SHARED_VALIDATE to a
new header file under include/standard-headers/linux/. (Michael S. Tsirkin)
* Patch 6: refine the description of the 'sync' option. (Michael S. Tsirkin)
Changes in v3:
* Patch 1: add MAP_SHARED_VALIDATE in both sync=on and sync=auto
cases, and add back the retry mechanism. MAP_SYNC will be ignored
by Linux kernel 4.15 if MAP_SHARED_VALIDATE is missed.
* Patch 1: define MAP_SYNC and MAP_SHARED_VALIDATE as 0 on non-Linux
platforms in order to make qemu_ram_mmap() compile on those platforms.
* Patch 2&3: include more information in error messages of
memory-backend in hope to help user to identify the error.
(Dr. David Alan Gilbert)
* Patch 3: fix typo in the commit message. (Dr. David Alan Gilbert)
Changes in v2:
* Add 'sync' option to control the use of MAP_SYNC. (Eduardo Habkost)
* Remove the unnecessary set of MAP_SHARED_VALIDATE in some cases and
the retry mechanism in qemu_ram_mmap(). (Michael S. Tsirkin)
* Move OS dependent definitions of MAP_SYNC and MAP_SHARED_VALIDATE
to osdep.h. (Michael S. Tsirkin)
Zhang Yi (3):
util/mmap-alloc: Add a 'is_pmem' parameter to qemu_ram_mmap
util/mmap-alloc: support MAP_SYNC in qemu_ram_mmap()
docs: Added MAP_SYNC documentation
docs/nvdimm.txt | 29 ++++++++++++++++++++++++++++-
exec.c | 2 +-
include/qemu/mmap-alloc.h | 21 ++++++++++++++++++++-
include/qemu/osdep.h | 21 +++++++++++++++++++++
qemu-options.hx | 4 ++++
util/mmap-alloc.c | 34 ++++++++++++++++++++++++++++++++--
util/oslib-posix.c | 2 +-
7 files changed, 107 insertions(+), 6 deletions(-)
--
2.7.4
^ permalink raw reply [flat|nested] 12+ messages in thread
* [Qemu-devel] [PATCH v11 1/3] util/mmap-alloc: Add a 'is_pmem' parameter to qemu_ram_mmap
2019-01-29 14:48 [Qemu-devel] [PATCH v11 0/3] support MAP_SYNC for memory-backend-file Zhang, Yi
@ 2019-01-29 14:49 ` Zhang, Yi
2019-01-29 6:58 ` Pankaj Gupta
2019-01-29 14:49 ` [Qemu-devel] [PATCH v11 2/3] util/mmap-alloc: support MAP_SYNC in qemu_ram_mmap() Zhang, Yi
2019-01-29 14:49 ` [Qemu-devel] [PATCH v11 3/3] docs: Added MAP_SYNC documentation Zhang, Yi
2 siblings, 1 reply; 12+ messages in thread
From: Zhang, Yi @ 2019-01-29 14:49 UTC (permalink / raw)
To: xiaoguangrong.eric, stefanha, pbonzini, pagupta, yu.c.zhang,
richardw.yang, mst, ehabkost
Cc: qemu-devel, imammedo, dan.j.williams, Zhang Yi
From: Zhang Yi <yi.z.zhang@linux.intel.com>
besides the existing 'shared' flags, we are going to add
'is_pmem' to qemu_ram_mmap(), which indicated the memory backend
file is a persist memory.
Signed-off-by: Haozhong Zhang <haozhong.zhang@intel.com>
Signed-off-by: Zhang Yi <yi.z.zhang@linux.intel.com>
---
exec.c | 2 +-
include/qemu/mmap-alloc.h | 21 ++++++++++++++++++++-
util/mmap-alloc.c | 6 +++++-
util/oslib-posix.c | 2 +-
4 files changed, 27 insertions(+), 4 deletions(-)
diff --git a/exec.c b/exec.c
index bb6170d..27cea52 100644
--- a/exec.c
+++ b/exec.c
@@ -1860,7 +1860,7 @@ static void *file_ram_alloc(RAMBlock *block,
}
area = qemu_ram_mmap(fd, memory, block->mr->align,
- block->flags & RAM_SHARED);
+ block->flags & RAM_SHARED, block->flags & RAM_PMEM);
if (area == MAP_FAILED) {
error_setg_errno(errp, errno,
"unable to map backing store for guest RAM");
diff --git a/include/qemu/mmap-alloc.h b/include/qemu/mmap-alloc.h
index 50385e3..190688a 100644
--- a/include/qemu/mmap-alloc.h
+++ b/include/qemu/mmap-alloc.h
@@ -7,7 +7,26 @@ size_t qemu_fd_getpagesize(int fd);
size_t qemu_mempath_getpagesize(const char *mem_path);
-void *qemu_ram_mmap(int fd, size_t size, size_t align, bool shared);
+/**
+ * qemu_ram_mmap: mmap the specified file or device.
+ *
+ * Parameters:
+ * @fd: the file or the device to mmap
+ * @size: the number of bytes to be mmaped
+ * @align: if not zero, specify the alignment of the starting mapping address;
+ * otherwise, the alignment in use will be determined by QEMU.
+ * @shared: map has RAM_SHARED flag.
+ * @is_pmem: map has RAM_PMEM flag.
+ *
+ * Return:
+ * On success, return a pointer to the mapped area.
+ * On failure, return MAP_FAILED.
+ */
+void *qemu_ram_mmap(int fd,
+ size_t size,
+ size_t align,
+ bool shared,
+ bool is_pmem);
void qemu_ram_munmap(void *ptr, size_t size);
diff --git a/util/mmap-alloc.c b/util/mmap-alloc.c
index fd329ec..97bbeed 100644
--- a/util/mmap-alloc.c
+++ b/util/mmap-alloc.c
@@ -75,7 +75,11 @@ size_t qemu_mempath_getpagesize(const char *mem_path)
return getpagesize();
}
-void *qemu_ram_mmap(int fd, size_t size, size_t align, bool shared)
+void *qemu_ram_mmap(int fd,
+ size_t size,
+ size_t align,
+ bool shared,
+ bool is_pmem)
{
/*
* Note: this always allocates at least one extra page of virtual address
diff --git a/util/oslib-posix.c b/util/oslib-posix.c
index fbd0dc8..040937f 100644
--- a/util/oslib-posix.c
+++ b/util/oslib-posix.c
@@ -203,7 +203,7 @@ void *qemu_memalign(size_t alignment, size_t size)
void *qemu_anon_ram_alloc(size_t size, uint64_t *alignment, bool shared)
{
size_t align = QEMU_VMALLOC_ALIGN;
- void *ptr = qemu_ram_mmap(-1, size, align, shared);
+ void *ptr = qemu_ram_mmap(-1, size, align, shared, false);
if (ptr == MAP_FAILED) {
return NULL;
--
2.7.4
^ permalink raw reply related [flat|nested] 12+ messages in thread
* [Qemu-devel] [PATCH v11 2/3] util/mmap-alloc: support MAP_SYNC in qemu_ram_mmap()
2019-01-29 14:48 [Qemu-devel] [PATCH v11 0/3] support MAP_SYNC for memory-backend-file Zhang, Yi
2019-01-29 14:49 ` [Qemu-devel] [PATCH v11 1/3] util/mmap-alloc: Add a 'is_pmem' parameter to qemu_ram_mmap Zhang, Yi
@ 2019-01-29 14:49 ` Zhang, Yi
2019-01-29 6:55 ` Pankaj Gupta
2019-01-29 13:50 ` Michael S. Tsirkin
2019-01-29 14:49 ` [Qemu-devel] [PATCH v11 3/3] docs: Added MAP_SYNC documentation Zhang, Yi
2 siblings, 2 replies; 12+ messages in thread
From: Zhang, Yi @ 2019-01-29 14:49 UTC (permalink / raw)
To: xiaoguangrong.eric, stefanha, pbonzini, pagupta, yu.c.zhang,
richardw.yang, mst, ehabkost
Cc: qemu-devel, imammedo, dan.j.williams, Zhang Yi
From: Zhang Yi <yi.z.zhang@linux.intel.com>
When a file supporting DAX is used as vNVDIMM backend, mmap it with
MAP_SYNC flag in addition which can ensure file system metadata
synced in each guest writes to the backend file, without other QEMU
actions (e.g., periodic fsync() by QEMU).
Current, We have below different possible use cases:
1. pmem=on is set, shared=on is set, MAP_SYNC supported:
a: backend is a dax supporting file.
- MAP_SYNC will active.
b: backend is not a dax supporting file.
- mmap will trigger a warning. then MAP_SYNC flag will be ignored
2. The rest of cases:
- we will never pass the MAP_SYNC to mmap2
Signed-off-by: Haozhong Zhang <haozhong.zhang@intel.com>
Signed-off-by: Zhang Yi <yi.z.zhang@linux.intel.com>
---
include/qemu/osdep.h | 21 +++++++++++++++++++++
util/mmap-alloc.c | 28 +++++++++++++++++++++++++++-
2 files changed, 48 insertions(+), 1 deletion(-)
diff --git a/include/qemu/osdep.h b/include/qemu/osdep.h
index 457d24e..96209bb 100644
--- a/include/qemu/osdep.h
+++ b/include/qemu/osdep.h
@@ -419,6 +419,27 @@ void qemu_anon_ram_free(void *ptr, size_t size);
# define QEMU_VMALLOC_ALIGN getpagesize()
#endif
+/*
+ * MAP_SHARED_VALIDATE and MAP_SYNC are introduced in Linux kernel
+ * 4.15, so they may not be defined when compiling on older kernels.
+ */
+#ifdef CONFIG_LINUX
+
+#include <linux/mman.h>
+
+#ifndef MAP_SYNC
+#define MAP_SYNC 0x80000
+#endif
+
+#ifndef MAP_SHARED_VALIDATE
+#define MAP_SHARED_VALIDATE 0x03
+#endif
+
+#else /* !CONFIG_LINUX */
+#define MAP_SYNC 0x0
+#define MAP_SHARED_VALIDATE 0x0
+#endif /* CONFIG_LINUX */
+
#ifdef CONFIG_POSIX
struct qemu_signalfd_siginfo {
uint32_t ssi_signo; /* Signal number */
diff --git a/util/mmap-alloc.c b/util/mmap-alloc.c
index 97bbeed..2c86ad2 100644
--- a/util/mmap-alloc.c
+++ b/util/mmap-alloc.c
@@ -101,6 +101,7 @@ void *qemu_ram_mmap(int fd,
#else
void *ptr = mmap(0, total, PROT_NONE, MAP_ANONYMOUS | MAP_PRIVATE, -1, 0);
#endif
+ int mmap_xflags = 0;
size_t offset;
void *ptr1;
@@ -111,13 +112,38 @@ void *qemu_ram_mmap(int fd,
assert(is_power_of_2(align));
/* Always align to host page size */
assert(align >= getpagesize());
+ if (shared && is_pmem) {
+ mmap_xflags = MAP_SYNC | MAP_SHARED_VALIDATE;
+ }
offset = QEMU_ALIGN_UP((uintptr_t)ptr, align) - (uintptr_t)ptr;
+retry_mmap:
ptr1 = mmap(ptr + offset, size, PROT_READ | PROT_WRITE,
MAP_FIXED |
(fd == -1 ? MAP_ANONYMOUS : 0) |
- (shared ? MAP_SHARED : MAP_PRIVATE),
+ (shared ? MAP_SHARED : MAP_PRIVATE) | mmap_xflags,
fd, 0);
+
+ /* if map failed with MAP_SHARED_VALIDATE | MAP_SYNC,
+ * we try with MAP_SHARED_VALIDATE without MAP_SYNC
+ */
+ if (ptr1 == MAP_FAILED &&
+ mmap_xflags == (MAP_SYNC | MAP_SHARED_VALIDATE)) {
+ if (errno == ENOTSUP) {
+ perror("failed to validate with mapping flags");
+ }
+ mmap_xflags = MAP_SHARED_VALIDATE;
+ goto retry_mmap;
+ }
+ /* MAP_SHARED_VALIDATE flag is available since Linux 4.15
+ * Test only with MAP_SHARED_VALIDATE flag for compatibility.
+ * Then ignore the MAP_SHARED_VALIDATE flag and retry again
+ */
+ if (mmap_xflags == MAP_SHARED_VALIDATE &&
+ ptr1 == MAP_FAILED) {
+ mmap_xflags &= ~MAP_SHARED_VALIDATE;
+ goto retry_mmap;
+ }
if (ptr1 == MAP_FAILED) {
munmap(ptr, total);
return MAP_FAILED;
--
2.7.4
^ permalink raw reply related [flat|nested] 12+ messages in thread
* [Qemu-devel] [PATCH v11 3/3] docs: Added MAP_SYNC documentation
2019-01-29 14:48 [Qemu-devel] [PATCH v11 0/3] support MAP_SYNC for memory-backend-file Zhang, Yi
2019-01-29 14:49 ` [Qemu-devel] [PATCH v11 1/3] util/mmap-alloc: Add a 'is_pmem' parameter to qemu_ram_mmap Zhang, Yi
2019-01-29 14:49 ` [Qemu-devel] [PATCH v11 2/3] util/mmap-alloc: support MAP_SYNC in qemu_ram_mmap() Zhang, Yi
@ 2019-01-29 14:49 ` Zhang, Yi
2019-01-29 14:09 ` Michael S. Tsirkin
2 siblings, 1 reply; 12+ messages in thread
From: Zhang, Yi @ 2019-01-29 14:49 UTC (permalink / raw)
To: xiaoguangrong.eric, stefanha, pbonzini, pagupta, yu.c.zhang,
richardw.yang, mst, ehabkost
Cc: qemu-devel, imammedo, dan.j.williams, Zhang Yi
From: Zhang Yi <yi.z.zhang@linux.intel.com>
Signed-off-by: Zhang Yi <yi.z.zhang@linux.intel.com>
---
docs/nvdimm.txt | 29 ++++++++++++++++++++++++++++-
qemu-options.hx | 4 ++++
2 files changed, 32 insertions(+), 1 deletion(-)
diff --git a/docs/nvdimm.txt b/docs/nvdimm.txt
index 5f158a6..9da96aa 100644
--- a/docs/nvdimm.txt
+++ b/docs/nvdimm.txt
@@ -142,11 +142,38 @@ backend of vNVDIMM:
Guest Data Persistence
----------------------
+vNVDIMM is designed and implemented to guarantee the guest data
+persistence on the backends in case of host crash or a power failures.
+However, there are still some requirements and limitations
+as explained below.
+
Though QEMU supports multiple types of vNVDIMM backends on Linux,
-currently the only one that can guarantee the guest write persistence
+if MAP_SYNC is not supported by the host kernel and the backends,
+the only backend that can guarantee the guest write persistence
is the device DAX on the real NVDIMM device (e.g., /dev/dax0.0), to
which all guest access do not involve any host-side kernel cache.
+mmap(2) flag MAP_SYNC is added since Linux kernel 4.15. On such
+systems, QEMU can mmap(2) the dax backend files with MAP_SYNC, which
+ensures filesystem metadata consistency in case of a host crash or a power
+failure. Enabling MAP_SYNC in QEMU requires below conditions
+
+ - 'pmem' option of memory-backend-file is 'on':
+ The backend is a file supporting DAX, e.g., a file on an ext4 or
+ xfs file system mounted with '-o dax'. if your pmem=on ,but the backend is
+ not a file supporting DAX, mapping with this flag results in an EOPNOTSUPP
+ warning. then MAP_SYNC will be ignored
+
+ - 'share' option of memory-backend-file is 'on':
+ MAP_SYNC flag available only with the MAP_SHARED_VALIDATE mapping type.
+
+ - 'MAP_SYNC' is supported on linux kernel.(default opened since Linux 4.15)
+
+Otherwise, We will ignore the MAP_SYNC flag.
+
+For more details, please reference mmap(2) man page:
+http://man7.org/linux/man-pages/man2/mmap.2.html.
+
When using other types of backends, it's suggested to set 'unarmed'
option of '-device nvdimm' to 'on', which sets the unarmed flag of the
guest NVDIMM region mapping structure. This unarmed flag indicates
diff --git a/qemu-options.hx b/qemu-options.hx
index 08f8516..0cd41f4 100644
--- a/qemu-options.hx
+++ b/qemu-options.hx
@@ -4002,6 +4002,10 @@ using the SNIA NVM programming model (e.g. Intel NVDIMM).
If @option{pmem} is set to 'on', QEMU will take necessary operations to
guarantee the persistence of its own writes to @option{mem-path}
(e.g. in vNVDIMM label emulation and live migration).
+Also, we will map the backend-file with MAP_SYNC flag, which can ensure
+the file metadata is in sync to @option{mem-path} in case of host crash
+or a power failure. MAP_SYNC requires support from both the host kernel
+(since Linux kernel 4.15) and @option{mem-path} (only files supporting DAX).
@item -object memory-backend-ram,id=@var{id},merge=@var{on|off},dump=@var{on|off},share=@var{on|off},prealloc=@var{on|off},size=@var{size},host-nodes=@var{host-nodes},policy=@var{default|preferred|bind|interleave}
--
2.7.4
^ permalink raw reply related [flat|nested] 12+ messages in thread
* Re: [Qemu-devel] [PATCH v11 2/3] util/mmap-alloc: support MAP_SYNC in qemu_ram_mmap()
2019-01-30 10:36 ` Yi Zhang
@ 2019-01-30 2:28 ` Michael S. Tsirkin
0 siblings, 0 replies; 12+ messages in thread
From: Michael S. Tsirkin @ 2019-01-30 2:28 UTC (permalink / raw)
To: xiaoguangrong.eric, stefanha, pbonzini, pagupta, yu.c.zhang,
richardw.yang, ehabkost, qemu-devel, imammedo, dan.j.williams
On Wed, Jan 30, 2019 at 06:36:46PM +0800, Yi Zhang wrote:
> On 2019-01-29 at 08:50:46 -0500, Michael S. Tsirkin wrote:
> > On Tue, Jan 29, 2019 at 10:49:09PM +0800, Zhang, Yi wrote:
> > > From: Zhang Yi <yi.z.zhang@linux.intel.com>
> > >
> > > When a file supporting DAX is used as vNVDIMM backend, mmap it with
> > > MAP_SYNC flag in addition which can ensure file system metadata
> > > synced in each guest writes to the backend file, without other QEMU
> > > actions (e.g., periodic fsync() by QEMU).
> > >
> > > Current, We have below different possible use cases:
> > >
> > > 1. pmem=on is set, shared=on is set, MAP_SYNC supported:
> > > a: backend is a dax supporting file.
> > > - MAP_SYNC will active.
> > > b: backend is not a dax supporting file.
> > > - mmap will trigger a warning. then MAP_SYNC flag will be ignored
> > >
> > > 2. The rest of cases:
> > > - we will never pass the MAP_SYNC to mmap2
> > >
> > > Signed-off-by: Haozhong Zhang <haozhong.zhang@intel.com>
> > > Signed-off-by: Zhang Yi <yi.z.zhang@linux.intel.com>
> > > ---
> > > include/qemu/osdep.h | 21 +++++++++++++++++++++
> > > util/mmap-alloc.c | 28 +++++++++++++++++++++++++++-
> > > 2 files changed, 48 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/include/qemu/osdep.h b/include/qemu/osdep.h
> > > index 457d24e..96209bb 100644
> > > --- a/include/qemu/osdep.h
> > > +++ b/include/qemu/osdep.h
> > > @@ -419,6 +419,27 @@ void qemu_anon_ram_free(void *ptr, size_t size);
> > > # define QEMU_VMALLOC_ALIGN getpagesize()
> > > #endif
> > >
> > > +/*
> > > + * MAP_SHARED_VALIDATE and MAP_SYNC are introduced in Linux kernel
> > > + * 4.15, so they may not be defined when compiling on older kernels.
> > > + */
> > > +#ifdef CONFIG_LINUX
> > > +
> > > +#include <linux/mman.h>
> > > +
> > > +#ifndef MAP_SYNC
> > > +#define MAP_SYNC 0x80000
> > > +#endif
> > > +
> > > +#ifndef MAP_SHARED_VALIDATE
> > > +#define MAP_SHARED_VALIDATE 0x03
> > > +#endif
> > > +
> >
> > I commented on this part in v7. That's a wrong way to handle
> > compatibility.
> MAP_SYNC and MAP_SHARED_VALIDATE have not defined at pre 4.15 kernel.
> to handle the compatibility we should defined it, Right?, That is Why I
> changed the value to run time handle the compatibility.
>
> in previous version, you comments that we shouldn't direct define that
> in our own headers to avoid duplicated code.
>
> So we add #include <linux/mman.h> Right?
>
> Then shouldn't consider build qemu on pre 4.15 kernel header? won't
> failed?
>
> #ifndef MAP_SYNC
> #define MAP_SYNC 0x80000
> #endif
>
> #ifndef MAP_SHARED_VALIDATE
> #define MAP_SHARED_VALIDATE 0x03
> #endif
>
> forgive my poor understanding, I'm getting more confused.
Look at how we handle other such defines such as e.g. kvm ioctl values.
>
> >
> > We had this discussion several times in the past. commit log doesn't
> > mention any reasons to ignore this. All for setting a single bit in a
> > single system call. This is getting discouraging.
> >
> >
> > > +#else /* !CONFIG_LINUX */
> > > +#define MAP_SYNC 0x0
> > > +#define MAP_SHARED_VALIDATE 0x0
> > > +#endif /* CONFIG_LINUX */
> > > +
> > > #ifdef CONFIG_POSIX
> > > struct qemu_signalfd_siginfo {
> > > uint32_t ssi_signo; /* Signal number */
> > > diff --git a/util/mmap-alloc.c b/util/mmap-alloc.c
> > > index 97bbeed..2c86ad2 100644
> > > --- a/util/mmap-alloc.c
> > > +++ b/util/mmap-alloc.c
> > > @@ -101,6 +101,7 @@ void *qemu_ram_mmap(int fd,
> > > #else
> > > void *ptr = mmap(0, total, PROT_NONE, MAP_ANONYMOUS | MAP_PRIVATE, -1, 0);
> > > #endif
> > > + int mmap_xflags = 0;
> >
> > That's not a good variable name. It's extra for you since
> > you are writing the patch but it makes no sense
> > in the context of the function. Just mmap_flags
> > will do - and I would put all flags there, not just
> > the "extra" that this patch is adding.
> >
> >
> > > size_t offset;
> > > void *ptr1;
> > >
> > > @@ -111,13 +112,38 @@ void *qemu_ram_mmap(int fd,
> > > assert(is_power_of_2(align));
> > > /* Always align to host page size */
> > > assert(align >= getpagesize());
> > > + if (shared && is_pmem) {
> > > + mmap_xflags = MAP_SYNC | MAP_SHARED_VALIDATE;
> > > + }
> > >
> > > offset = QEMU_ALIGN_UP((uintptr_t)ptr, align) - (uintptr_t)ptr;
> > > +retry_mmap:
> > > ptr1 = mmap(ptr + offset, size, PROT_READ | PROT_WRITE,
> > > MAP_FIXED |
> > > (fd == -1 ? MAP_ANONYMOUS : 0) |
> > > - (shared ? MAP_SHARED : MAP_PRIVATE),
> > > + (shared ? MAP_SHARED : MAP_PRIVATE) | mmap_xflags,
> > > fd, 0);
> > > +
> > > + /* if map failed with MAP_SHARED_VALIDATE | MAP_SYNC,
> > > + * we try with MAP_SHARED_VALIDATE without MAP_SYNC
> > > + */
> > > + if (ptr1 == MAP_FAILED &&
> > > + mmap_xflags == (MAP_SYNC | MAP_SHARED_VALIDATE)) {
> > > + if (errno == ENOTSUP) {
> > > + perror("failed to validate with mapping flags");
> > > + }
> > > + mmap_xflags = MAP_SHARED_VALIDATE;
> > > + goto retry_mmap;
> >
> > Have you read
> > https://homepages.cwi.nl/~storm/teaching/reader/Dijkstra68.pdf
> >
> > Please just call the function twice. You don't need goto
> > to repeat a single line.
> >
> >
> > > + }
> > > + /* MAP_SHARED_VALIDATE flag is available since Linux 4.15
> > > + * Test only with MAP_SHARED_VALIDATE flag for compatibility.
> > > + * Then ignore the MAP_SHARED_VALIDATE flag and retry again
> > > + */
> > > + if (mmap_xflags == MAP_SHARED_VALIDATE &&
> > > + ptr1 == MAP_FAILED) {
> > > + mmap_xflags &= ~MAP_SHARED_VALIDATE;
> > > + goto retry_mmap;
> > > + }
> > > if (ptr1 == MAP_FAILED) {
> > > munmap(ptr, total);
> > > return MAP_FAILED;
> > > --
> > > 2.7.4
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [Qemu-devel] [PATCH v11 2/3] util/mmap-alloc: support MAP_SYNC in qemu_ram_mmap()
2019-01-29 13:50 ` Michael S. Tsirkin
@ 2019-01-30 10:36 ` Yi Zhang
2019-01-30 2:28 ` Michael S. Tsirkin
0 siblings, 1 reply; 12+ messages in thread
From: Yi Zhang @ 2019-01-30 10:36 UTC (permalink / raw)
To: Michael S. Tsirkin
Cc: xiaoguangrong.eric, stefanha, pbonzini, pagupta, yu.c.zhang,
richardw.yang, ehabkost, qemu-devel, imammedo, dan.j.williams
On 2019-01-29 at 08:50:46 -0500, Michael S. Tsirkin wrote:
> On Tue, Jan 29, 2019 at 10:49:09PM +0800, Zhang, Yi wrote:
> > From: Zhang Yi <yi.z.zhang@linux.intel.com>
> >
> > When a file supporting DAX is used as vNVDIMM backend, mmap it with
> > MAP_SYNC flag in addition which can ensure file system metadata
> > synced in each guest writes to the backend file, without other QEMU
> > actions (e.g., periodic fsync() by QEMU).
> >
> > Current, We have below different possible use cases:
> >
> > 1. pmem=on is set, shared=on is set, MAP_SYNC supported:
> > a: backend is a dax supporting file.
> > - MAP_SYNC will active.
> > b: backend is not a dax supporting file.
> > - mmap will trigger a warning. then MAP_SYNC flag will be ignored
> >
> > 2. The rest of cases:
> > - we will never pass the MAP_SYNC to mmap2
> >
> > Signed-off-by: Haozhong Zhang <haozhong.zhang@intel.com>
> > Signed-off-by: Zhang Yi <yi.z.zhang@linux.intel.com>
> > ---
> > include/qemu/osdep.h | 21 +++++++++++++++++++++
> > util/mmap-alloc.c | 28 +++++++++++++++++++++++++++-
> > 2 files changed, 48 insertions(+), 1 deletion(-)
> >
> > diff --git a/include/qemu/osdep.h b/include/qemu/osdep.h
> > index 457d24e..96209bb 100644
> > --- a/include/qemu/osdep.h
> > +++ b/include/qemu/osdep.h
> > @@ -419,6 +419,27 @@ void qemu_anon_ram_free(void *ptr, size_t size);
> > # define QEMU_VMALLOC_ALIGN getpagesize()
> > #endif
> >
> > +/*
> > + * MAP_SHARED_VALIDATE and MAP_SYNC are introduced in Linux kernel
> > + * 4.15, so they may not be defined when compiling on older kernels.
> > + */
> > +#ifdef CONFIG_LINUX
> > +
> > +#include <linux/mman.h>
> > +
> > +#ifndef MAP_SYNC
> > +#define MAP_SYNC 0x80000
> > +#endif
> > +
> > +#ifndef MAP_SHARED_VALIDATE
> > +#define MAP_SHARED_VALIDATE 0x03
> > +#endif
> > +
>
> I commented on this part in v7. That's a wrong way to handle
> compatibility.
MAP_SYNC and MAP_SHARED_VALIDATE have not defined at pre 4.15 kernel.
to handle the compatibility we should defined it, Right?, That is Why I
changed the value to run time handle the compatibility.
in previous version, you comments that we shouldn't direct define that
in our own headers to avoid duplicated code.
So we add #include <linux/mman.h> Right?
Then shouldn't consider build qemu on pre 4.15 kernel header? won't
failed?
#ifndef MAP_SYNC
#define MAP_SYNC 0x80000
#endif
#ifndef MAP_SHARED_VALIDATE
#define MAP_SHARED_VALIDATE 0x03
#endif
forgive my poor understanding, I'm getting more confused.
>
> We had this discussion several times in the past. commit log doesn't
> mention any reasons to ignore this. All for setting a single bit in a
> single system call. This is getting discouraging.
>
>
> > +#else /* !CONFIG_LINUX */
> > +#define MAP_SYNC 0x0
> > +#define MAP_SHARED_VALIDATE 0x0
> > +#endif /* CONFIG_LINUX */
> > +
> > #ifdef CONFIG_POSIX
> > struct qemu_signalfd_siginfo {
> > uint32_t ssi_signo; /* Signal number */
> > diff --git a/util/mmap-alloc.c b/util/mmap-alloc.c
> > index 97bbeed..2c86ad2 100644
> > --- a/util/mmap-alloc.c
> > +++ b/util/mmap-alloc.c
> > @@ -101,6 +101,7 @@ void *qemu_ram_mmap(int fd,
> > #else
> > void *ptr = mmap(0, total, PROT_NONE, MAP_ANONYMOUS | MAP_PRIVATE, -1, 0);
> > #endif
> > + int mmap_xflags = 0;
>
> That's not a good variable name. It's extra for you since
> you are writing the patch but it makes no sense
> in the context of the function. Just mmap_flags
> will do - and I would put all flags there, not just
> the "extra" that this patch is adding.
>
>
> > size_t offset;
> > void *ptr1;
> >
> > @@ -111,13 +112,38 @@ void *qemu_ram_mmap(int fd,
> > assert(is_power_of_2(align));
> > /* Always align to host page size */
> > assert(align >= getpagesize());
> > + if (shared && is_pmem) {
> > + mmap_xflags = MAP_SYNC | MAP_SHARED_VALIDATE;
> > + }
> >
> > offset = QEMU_ALIGN_UP((uintptr_t)ptr, align) - (uintptr_t)ptr;
> > +retry_mmap:
> > ptr1 = mmap(ptr + offset, size, PROT_READ | PROT_WRITE,
> > MAP_FIXED |
> > (fd == -1 ? MAP_ANONYMOUS : 0) |
> > - (shared ? MAP_SHARED : MAP_PRIVATE),
> > + (shared ? MAP_SHARED : MAP_PRIVATE) | mmap_xflags,
> > fd, 0);
> > +
> > + /* if map failed with MAP_SHARED_VALIDATE | MAP_SYNC,
> > + * we try with MAP_SHARED_VALIDATE without MAP_SYNC
> > + */
> > + if (ptr1 == MAP_FAILED &&
> > + mmap_xflags == (MAP_SYNC | MAP_SHARED_VALIDATE)) {
> > + if (errno == ENOTSUP) {
> > + perror("failed to validate with mapping flags");
> > + }
> > + mmap_xflags = MAP_SHARED_VALIDATE;
> > + goto retry_mmap;
>
> Have you read
> https://homepages.cwi.nl/~storm/teaching/reader/Dijkstra68.pdf
>
> Please just call the function twice. You don't need goto
> to repeat a single line.
>
>
> > + }
> > + /* MAP_SHARED_VALIDATE flag is available since Linux 4.15
> > + * Test only with MAP_SHARED_VALIDATE flag for compatibility.
> > + * Then ignore the MAP_SHARED_VALIDATE flag and retry again
> > + */
> > + if (mmap_xflags == MAP_SHARED_VALIDATE &&
> > + ptr1 == MAP_FAILED) {
> > + mmap_xflags &= ~MAP_SHARED_VALIDATE;
> > + goto retry_mmap;
> > + }
> > if (ptr1 == MAP_FAILED) {
> > munmap(ptr, total);
> > return MAP_FAILED;
> > --
> > 2.7.4
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [Qemu-devel] [PATCH v11 2/3] util/mmap-alloc: support MAP_SYNC in qemu_ram_mmap()
2019-01-29 6:55 ` Pankaj Gupta
@ 2019-01-30 11:15 ` Yi Zhang
0 siblings, 0 replies; 12+ messages in thread
From: Yi Zhang @ 2019-01-30 11:15 UTC (permalink / raw)
To: Pankaj Gupta
Cc: xiaoguangrong eric, stefanha, pbonzini, yu c zhang, richardw yang,
mst, ehabkost, imammedo, dan j williams, qemu-devel
On 2019-01-29 at 01:55:06 -0500, Pankaj Gupta wrote:
>
> >
> > From: Zhang Yi <yi.z.zhang@linux.intel.com>
> >
> > When a file supporting DAX is used as vNVDIMM backend, mmap it with
> > MAP_SYNC flag in addition which can ensure file system metadata
> > synced in each guest writes to the backend file, without other QEMU
> > actions (e.g., periodic fsync() by QEMU).
> >
> > Current, We have below different possible use cases:
> >
> > 1. pmem=on is set, shared=on is set, MAP_SYNC supported:
> > a: backend is a dax supporting file.
> > - MAP_SYNC will active.
> > b: backend is not a dax supporting file.
> > - mmap will trigger a warning. then MAP_SYNC flag will be ignored
> >
> > 2. The rest of cases:
> > - we will never pass the MAP_SYNC to mmap2
> >
> > Signed-off-by: Haozhong Zhang <haozhong.zhang@intel.com>
> > Signed-off-by: Zhang Yi <yi.z.zhang@linux.intel.com>
> > ---
> > include/qemu/osdep.h | 21 +++++++++++++++++++++
> > util/mmap-alloc.c | 28 +++++++++++++++++++++++++++-
> > 2 files changed, 48 insertions(+), 1 deletion(-)
> >
> > diff --git a/include/qemu/osdep.h b/include/qemu/osdep.h
> > index 457d24e..96209bb 100644
> > --- a/include/qemu/osdep.h
> > +++ b/include/qemu/osdep.h
> > @@ -419,6 +419,27 @@ void qemu_anon_ram_free(void *ptr, size_t size);
> > # define QEMU_VMALLOC_ALIGN getpagesize()
> > #endif
> >
> > +/*
> > + * MAP_SHARED_VALIDATE and MAP_SYNC are introduced in Linux kernel
> > + * 4.15, so they may not be defined when compiling on older kernels.
> > + */
> > +#ifdef CONFIG_LINUX
> > +
> > +#include <linux/mman.h>
> > +
> > +#ifndef MAP_SYNC
> > +#define MAP_SYNC 0x80000
> > +#endif
> > +
> > +#ifndef MAP_SHARED_VALIDATE
> > +#define MAP_SHARED_VALIDATE 0x03
> > +#endif
> > +
> > +#else /* !CONFIG_LINUX */
> > +#define MAP_SYNC 0x0
> > +#define MAP_SHARED_VALIDATE 0x0
> > +#endif /* CONFIG_LINUX */
> > +
> > #ifdef CONFIG_POSIX
> > struct qemu_signalfd_siginfo {
> > uint32_t ssi_signo; /* Signal number */
> > diff --git a/util/mmap-alloc.c b/util/mmap-alloc.c
> > index 97bbeed..2c86ad2 100644
> > --- a/util/mmap-alloc.c
> > +++ b/util/mmap-alloc.c
> > @@ -101,6 +101,7 @@ void *qemu_ram_mmap(int fd,
> > #else
> > void *ptr = mmap(0, total, PROT_NONE, MAP_ANONYMOUS | MAP_PRIVATE, -1,
> > 0);
> > #endif
> > + int mmap_xflags = 0;
> > size_t offset;
> > void *ptr1;
> >
> > @@ -111,13 +112,38 @@ void *qemu_ram_mmap(int fd,
> > assert(is_power_of_2(align));
> > /* Always align to host page size */
> > assert(align >= getpagesize());
> > + if (shared && is_pmem) {
> > + mmap_xflags = MAP_SYNC | MAP_SHARED_VALIDATE;
> > + }
> >
> > offset = QEMU_ALIGN_UP((uintptr_t)ptr, align) - (uintptr_t)ptr;
> > +retry_mmap:
> > ptr1 = mmap(ptr + offset, size, PROT_READ | PROT_WRITE,
> > MAP_FIXED |
> > (fd == -1 ? MAP_ANONYMOUS : 0) |
> > - (shared ? MAP_SHARED : MAP_PRIVATE),
> > + (shared ? MAP_SHARED : MAP_PRIVATE) | mmap_xflags,
> > fd, 0);
> > +
> > + /* if map failed with MAP_SHARED_VALIDATE | MAP_SYNC,
> > + * we try with MAP_SHARED_VALIDATE without MAP_SYNC
> > + */
> > + if (ptr1 == MAP_FAILED &&
> > + mmap_xflags == (MAP_SYNC | MAP_SHARED_VALIDATE)) {
> > + if (errno == ENOTSUP) {
> > + perror("failed to validate with mapping flags");
> > + }
> > + mmap_xflags = MAP_SHARED_VALIDATE;
> > + goto retry_mmap;
> > + }
> > + /* MAP_SHARED_VALIDATE flag is available since Linux 4.15
> > + * Test only with MAP_SHARED_VALIDATE flag for compatibility.
> > + * Then ignore the MAP_SHARED_VALIDATE flag and retry again
> > + */
> > + if (mmap_xflags == MAP_SHARED_VALIDATE &&
> > + ptr1 == MAP_FAILED) {
> > + mmap_xflags &= ~MAP_SHARED_VALIDATE;
> > + goto retry_mmap;
> > + }
>
> I am not sure if we need this multiple validation. If MAP_SYNC with
> MAP_SHARED_VALIDATE is not supported or failed, just fallback to
> mmap without MAP_SYNC & MAP_SHARED_VALIDATE?
Right, that is, I will improve that. Thanks Pankaj.
>
> I saw a'lot of discussion in previous version of this patch series.
> I am not sure if its suggested this way or I am missing anything
> important here.
>
> Thanks,
> Pankaj
>
>
> > if (ptr1 == MAP_FAILED) {
> > munmap(ptr, total);
> > return MAP_FAILED;
> > --
> > 2.7.4
> >
> >
> >
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [Qemu-devel] [PATCH v11 3/3] docs: Added MAP_SYNC documentation
2019-01-29 14:09 ` Michael S. Tsirkin
@ 2019-01-30 11:20 ` Yi Zhang
0 siblings, 0 replies; 12+ messages in thread
From: Yi Zhang @ 2019-01-30 11:20 UTC (permalink / raw)
To: Michael S. Tsirkin
Cc: xiaoguangrong.eric, stefanha, pbonzini, pagupta, yu.c.zhang,
richardw.yang, ehabkost, qemu-devel, imammedo, dan.j.williams
On 2019-01-29 at 09:09:54 -0500, Michael S. Tsirkin wrote:
> On Tue, Jan 29, 2019 at 10:49:18PM +0800, Zhang, Yi wrote:
> > From: Zhang Yi <yi.z.zhang@linux.intel.com>
> >
> > Signed-off-by: Zhang Yi <yi.z.zhang@linux.intel.com>
> > ---
> > docs/nvdimm.txt | 29 ++++++++++++++++++++++++++++-
> > qemu-options.hx | 4 ++++
> > 2 files changed, 32 insertions(+), 1 deletion(-)
> >
> > diff --git a/docs/nvdimm.txt b/docs/nvdimm.txt
> > index 5f158a6..9da96aa 100644
> > --- a/docs/nvdimm.txt
> > +++ b/docs/nvdimm.txt
> > @@ -142,11 +142,38 @@ backend of vNVDIMM:
> > Guest Data Persistence
> > ----------------------
> >
> > +vNVDIMM is designed and implemented to guarantee the guest data
> > +persistence on the backends in case of host crash or a power failures.
> > +However, there are still some requirements and limitations
> > +as explained below.
> > +
>
> I'd just drop the above paragraph.
>
> > Though QEMU supports multiple types of vNVDIMM backends on Linux,
> > -currently the only one that can guarantee the guest write persistence
> > +if MAP_SYNC is not supported by the host kernel and the backends,
> > +the only backend that can guarantee the guest write persistence
> > is the device DAX on the real NVDIMM device (e.g., /dev/dax0.0), to
> > which all guest access do not involve any host-side kernel cache.
> >
> > +mmap(2) flag MAP_SYNC is added since Linux kernel 4.15. On such
> > +systems, QEMU can mmap(2) the dax backend files with MAP_SYNC, which
> > +ensures filesystem metadata consistency in case of a host crash or a power
> > +failure. Enabling MAP_SYNC in QEMU requires below conditions
> > +
> > + - 'pmem' option of memory-backend-file is 'on':
> > + The backend is a file supporting DAX, e.g., a file on an ext4 or
> > + xfs file system mounted with '-o dax'. if your pmem=on ,but the backend is
> > + not a file supporting DAX, mapping with this flag results in an EOPNOTSUPP
> > + warning. then MAP_SYNC will be ignored
> > +
> > + - 'share' option of memory-backend-file is 'on':
> > + MAP_SYNC flag available only with the MAP_SHARED_VALIDATE mapping type.
> > +
> > + - 'MAP_SYNC' is supported on linux kernel.(default opened since Linux 4.15)
> > +
> > +Otherwise, We will ignore the MAP_SYNC flag.
> > +
> > +For more details, please reference mmap(2) man page:
> > +http://man7.org/linux/man-pages/man2/mmap.2.html.
> > +
>
>
> OK above is too low level so it doesn't really help anyone. Instead
> it describes code internals and will quickly get out of sync
> (pun intended). Let's look at the manpage:
>
> Shared file mappings with this flag provide the guarantee that
> while some memory is writably mapped in the address space of
> the process, it will be visible in the same file at the same
> offset even after the system crashes or is rebooted. In con‐
> junction with the use of appropriate CPU instructions, this
> provides users of such mappings with a more efficient way of
> making data modifications persistent.
>
> OK this is more readable. We already have:
>
> Though QEMU supports multiple types of vNVDIMM backends on Linux,
> the only backend that can guarantee the guest write persistence
> is the device DAX on the real NVDIMM device (e.g., /dev/dax0.0), to
> which all guest access do not involve any host-side kernel cache.
>
> Let's add:
>
> When using a file supporting DAX (direct mapping of persistent memory)
> as a backend, write persistence is guaranteed if the host kernel
> has support for the MAP_SYNC flag in the mmap system call
> (available since Linux 4.15 and on certain distro kernels)
> and additionally both 'pmem' and 'share' flags are set to 'on'
> on the backend.
>
> If these conditions are not satisfied i.e. if either 'pmem' or 'share'
> are not set, if the backend file does not support DAX
> or if MAP_SYNC is not supported by the host kernel, write
> persistence is not guaranteed after a system crash.
> For compatibility reasons, these conditions are silently ignored if not
> satisfied. Currently, no way is provided to test for them.
Much better than me, thank you very much, Michael. I will add that.
>
>
> > When using other types of backends, it's suggested to set 'unarmed'
> > option of '-device nvdimm' to 'on', which sets the unarmed flag of the
> > guest NVDIMM region mapping structure. This unarmed flag indicates
> > diff --git a/qemu-options.hx b/qemu-options.hx
> > index 08f8516..0cd41f4 100644
> > --- a/qemu-options.hx
> > +++ b/qemu-options.hx
> > @@ -4002,6 +4002,10 @@ using the SNIA NVM programming model (e.g. Intel NVDIMM).
> > If @option{pmem} is set to 'on', QEMU will take necessary operations to
> > guarantee the persistence of its own writes to @option{mem-path}
> > (e.g. in vNVDIMM label emulation and live migration).
> > +Also, we will map the backend-file with MAP_SYNC flag, which can ensure
> > +the file metadata is in sync to @option{mem-path} in case of host crash
> > +or a power failure. MAP_SYNC requires support from both the host kernel
> > +(since Linux kernel 4.15) and @option{mem-path} (only files supporting DAX).
> >
> > @item -object memory-backend-ram,id=@var{id},merge=@var{on|off},dump=@var{on|off},share=@var{on|off},prealloc=@var{on|off},size=@var{size},host-nodes=@var{host-nodes},policy=@var{default|preferred|bind|interleave}
> >
> > --
> > 2.7.4
^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2019-01-30 2:50 UTC | newest]
Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2019-01-29 14:48 [Qemu-devel] [PATCH v11 0/3] support MAP_SYNC for memory-backend-file Zhang, Yi
2019-01-29 14:49 ` [Qemu-devel] [PATCH v11 1/3] util/mmap-alloc: Add a 'is_pmem' parameter to qemu_ram_mmap Zhang, Yi
2019-01-29 6:58 ` Pankaj Gupta
2019-01-29 14:49 ` [Qemu-devel] [PATCH v11 2/3] util/mmap-alloc: support MAP_SYNC in qemu_ram_mmap() Zhang, Yi
2019-01-29 6:55 ` Pankaj Gupta
2019-01-30 11:15 ` Yi Zhang
2019-01-29 13:50 ` Michael S. Tsirkin
2019-01-30 10:36 ` Yi Zhang
2019-01-30 2:28 ` Michael S. Tsirkin
2019-01-29 14:49 ` [Qemu-devel] [PATCH v11 3/3] docs: Added MAP_SYNC documentation Zhang, Yi
2019-01-29 14:09 ` Michael S. Tsirkin
2019-01-30 11:20 ` Yi Zhang
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).