* [PATCH 1/3] osdep: Make qemu_madvise() to set errno in all cases
2024-05-28 16:15 [PATCH 0/3] backends/hostmem: Round up memory size for qemu_madvise() and mbind() Michal Privoznik
@ 2024-05-28 16:15 ` Michal Privoznik
2024-05-28 16:51 ` David Hildenbrand
2024-05-28 16:15 ` [PATCH 2/3] backends/hostmem: Warn on qemu_madvise() failures Michal Privoznik
2024-05-28 16:15 ` [PATCH 3/3] backends/hostmem: Round up memory size for qemu_madvise() and mbind() Michal Privoznik
2 siblings, 1 reply; 9+ messages in thread
From: Michal Privoznik @ 2024-05-28 16:15 UTC (permalink / raw)
To: qemu-devel; +Cc: david, imammedo
The unspoken premise of qemu_madvise() is that errno is set on
error. And it is mostly the case except for posix_madvise() which
is documented to return either zero (on success) or a positive
error number. This means, we must set errno ourselves. And while
at it, make the function return a negative value on error, just
like other error paths do.
Signed-off-by: Michal Privoznik <mprivozn@redhat.com>
---
util/osdep.c | 7 ++++++-
1 file changed, 6 insertions(+), 1 deletion(-)
diff --git a/util/osdep.c b/util/osdep.c
index e996c4744a..e42f4e8121 100644
--- a/util/osdep.c
+++ b/util/osdep.c
@@ -57,7 +57,12 @@ int qemu_madvise(void *addr, size_t len, int advice)
#if defined(CONFIG_MADVISE)
return madvise(addr, len, advice);
#elif defined(CONFIG_POSIX_MADVISE)
- return posix_madvise(addr, len, advice);
+ int rc = posix_madvise(addr, len, advice);
+ if (rc) {
+ errno = rc;
+ return -1;
+ }
+ return 0;
#else
errno = EINVAL;
return -1;
--
2.44.1
^ permalink raw reply related [flat|nested] 9+ messages in thread
* Re: [PATCH 1/3] osdep: Make qemu_madvise() to set errno in all cases
2024-05-28 16:15 ` [PATCH 1/3] osdep: Make qemu_madvise() to set errno in all cases Michal Privoznik
@ 2024-05-28 16:51 ` David Hildenbrand
0 siblings, 0 replies; 9+ messages in thread
From: David Hildenbrand @ 2024-05-28 16:51 UTC (permalink / raw)
To: Michal Privoznik, qemu-devel; +Cc: imammedo
Am 28.05.24 um 18:15 schrieb Michal Privoznik:
> The unspoken premise of qemu_madvise() is that errno is set on
> error. And it is mostly the case except for posix_madvise() which
> is documented to return either zero (on success) or a positive
> error number. This means, we must set errno ourselves. And while
> at it, make the function return a negative value on error, just
> like other error paths do.
>
> Signed-off-by: Michal Privoznik <mprivozn@redhat.com>
> ---
> util/osdep.c | 7 ++++++-
> 1 file changed, 6 insertions(+), 1 deletion(-)
>
> diff --git a/util/osdep.c b/util/osdep.c
> index e996c4744a..e42f4e8121 100644
> --- a/util/osdep.c
> +++ b/util/osdep.c
> @@ -57,7 +57,12 @@ int qemu_madvise(void *addr, size_t len, int advice)
> #if defined(CONFIG_MADVISE)
> return madvise(addr, len, advice);
> #elif defined(CONFIG_POSIX_MADVISE)
> - return posix_madvise(addr, len, advice);
> + int rc = posix_madvise(addr, len, advice);
> + if (rc) {
> + errno = rc;
> + return -1;
> + }
> + return 0;
> #else
> errno = EINVAL;
> return -1;
Interesting, seems to be correct
Reviewed-by: David Hildenbrand <david@redhat.com>
--
Thanks,
David / dhildenb
^ permalink raw reply [flat|nested] 9+ messages in thread
* [PATCH 2/3] backends/hostmem: Warn on qemu_madvise() failures
2024-05-28 16:15 [PATCH 0/3] backends/hostmem: Round up memory size for qemu_madvise() and mbind() Michal Privoznik
2024-05-28 16:15 ` [PATCH 1/3] osdep: Make qemu_madvise() to set errno in all cases Michal Privoznik
@ 2024-05-28 16:15 ` Michal Privoznik
2024-05-28 16:15 ` [PATCH 3/3] backends/hostmem: Round up memory size for qemu_madvise() and mbind() Michal Privoznik
2 siblings, 0 replies; 9+ messages in thread
From: Michal Privoznik @ 2024-05-28 16:15 UTC (permalink / raw)
To: qemu-devel; +Cc: david, imammedo
If user sets .merge or .dump attributes qemu_madvise() is called
with corresponding advice. But it is never checked for failure
which may mislead users into thinking the attribute is set
correctly.I believe at this point it's too late to report errors
in that case but let's report a warning at least.
Signed-off-by: Michal Privoznik <mprivozn@redhat.com>
---
backends/hostmem.c | 27 +++++++++++++++++++--------
1 file changed, 19 insertions(+), 8 deletions(-)
diff --git a/backends/hostmem.c b/backends/hostmem.c
index eb9682b4a8..1a6fd1c714 100644
--- a/backends/hostmem.c
+++ b/backends/hostmem.c
@@ -14,6 +14,7 @@
#include "sysemu/hostmem.h"
#include "hw/boards.h"
#include "qapi/error.h"
+#include "qemu/error-report.h"
#include "qapi/qapi-builtin-visit.h"
#include "qapi/visitor.h"
#include "qemu/config-file.h"
@@ -178,8 +179,11 @@ static void host_memory_backend_set_merge(Object *obj, bool value, Error **errp)
void *ptr = memory_region_get_ram_ptr(&backend->mr);
uint64_t sz = memory_region_size(&backend->mr);
- qemu_madvise(ptr, sz,
- value ? QEMU_MADV_MERGEABLE : QEMU_MADV_UNMERGEABLE);
+ if (qemu_madvise(ptr, sz,
+ value ? QEMU_MADV_MERGEABLE : QEMU_MADV_UNMERGEABLE)) {
+ warn_report("Couldn't change property 'merge' on '%s': %s",
+ object_get_typename(obj), strerror(errno));
+ }
backend->merge = value;
}
}
@@ -204,8 +208,11 @@ static void host_memory_backend_set_dump(Object *obj, bool value, Error **errp)
void *ptr = memory_region_get_ram_ptr(&backend->mr);
uint64_t sz = memory_region_size(&backend->mr);
- qemu_madvise(ptr, sz,
- value ? QEMU_MADV_DODUMP : QEMU_MADV_DONTDUMP);
+ if (qemu_madvise(ptr, sz,
+ value ? QEMU_MADV_DODUMP : QEMU_MADV_DONTDUMP)) {
+ warn_report("Couldn't change property 'dump' on '%s': %s",
+ object_get_typename(obj), strerror(errno));
+ }
backend->dump = value;
}
}
@@ -337,11 +344,15 @@ host_memory_backend_memory_complete(UserCreatable *uc, Error **errp)
ptr = memory_region_get_ram_ptr(&backend->mr);
sz = memory_region_size(&backend->mr);
- if (backend->merge) {
- qemu_madvise(ptr, sz, QEMU_MADV_MERGEABLE);
+ if (backend->merge &&
+ qemu_madvise(ptr, sz, QEMU_MADV_MERGEABLE)) {
+ warn_report("Couldn't set property 'merge' on '%s': %s",
+ object_get_typename(OBJECT(uc)), strerror(errno));
}
- if (!backend->dump) {
- qemu_madvise(ptr, sz, QEMU_MADV_DONTDUMP);
+ if (!backend->dump &&
+ qemu_madvise(ptr, sz, QEMU_MADV_DONTDUMP)) {
+ warn_report("Couldn't set property 'dump' on '%s': %s",
+ object_get_typename(OBJECT(uc)), strerror(errno));
}
#ifdef CONFIG_NUMA
unsigned long lastbit = find_last_bit(backend->host_nodes, MAX_NODES);
--
2.44.1
^ permalink raw reply related [flat|nested] 9+ messages in thread
* [PATCH 3/3] backends/hostmem: Round up memory size for qemu_madvise() and mbind()
2024-05-28 16:15 [PATCH 0/3] backends/hostmem: Round up memory size for qemu_madvise() and mbind() Michal Privoznik
2024-05-28 16:15 ` [PATCH 1/3] osdep: Make qemu_madvise() to set errno in all cases Michal Privoznik
2024-05-28 16:15 ` [PATCH 2/3] backends/hostmem: Warn on qemu_madvise() failures Michal Privoznik
@ 2024-05-28 16:15 ` Michal Privoznik
2024-05-28 16:47 ` David Hildenbrand
2024-05-28 17:22 ` Richard Henderson
2 siblings, 2 replies; 9+ messages in thread
From: Michal Privoznik @ 2024-05-28 16:15 UTC (permalink / raw)
To: qemu-devel; +Cc: david, imammedo
Simple reproducer:
qemu.git $ ./build/qemu-system-x86_64 \
-m size=8389632k,slots=16,maxmem=25600000k \
-object
'{"qom-type":"memory-backend-file","id":"ram-node0","mem-path":"/hugepages2M/","prealloc":true,"size":8590983168,"host-nodes":[0],"policy":"bind"}' \
-numa node,nodeid=0,cpus=0,memdev=ram-node0
With current master I get:
qemu-system-x86_64: cannot bind memory to host NUMA nodes: Invalid argument
The problem is that memory size (8193MiB) is not an integer
multiple of underlying pagesize (2MiB) which triggers a check
inside of mbind(), since we can't really set policy just to a
fraction of a page. As qemu_madvise() has the same expectation,
round size passed to underlying pagesize.
Signed-off-by: Michal Privoznik <mprivozn@redhat.com>
---
backends/hostmem.c | 11 +++++++++++
1 file changed, 11 insertions(+)
diff --git a/backends/hostmem.c b/backends/hostmem.c
index 1a6fd1c714..9b727699f6 100644
--- a/backends/hostmem.c
+++ b/backends/hostmem.c
@@ -179,6 +179,8 @@ static void host_memory_backend_set_merge(Object *obj, bool value, Error **errp)
void *ptr = memory_region_get_ram_ptr(&backend->mr);
uint64_t sz = memory_region_size(&backend->mr);
+ sz = ROUND_UP(sz, qemu_ram_pagesize(backend->mr.ram_block));
+
if (qemu_madvise(ptr, sz,
value ? QEMU_MADV_MERGEABLE : QEMU_MADV_UNMERGEABLE)) {
warn_report("Couldn't change property 'merge' on '%s': %s",
@@ -208,6 +210,8 @@ static void host_memory_backend_set_dump(Object *obj, bool value, Error **errp)
void *ptr = memory_region_get_ram_ptr(&backend->mr);
uint64_t sz = memory_region_size(&backend->mr);
+ sz = ROUND_UP(sz, qemu_ram_pagesize(backend->mr.ram_block));
+
if (qemu_madvise(ptr, sz,
value ? QEMU_MADV_DODUMP : QEMU_MADV_DONTDUMP)) {
warn_report("Couldn't change property 'dump' on '%s': %s",
@@ -344,6 +348,13 @@ host_memory_backend_memory_complete(UserCreatable *uc, Error **errp)
ptr = memory_region_get_ram_ptr(&backend->mr);
sz = memory_region_size(&backend->mr);
+ /*
+ * Round up size to be an integer multiple of pagesize, because
+ * both madvise() and mbind() does not really like setting
+ * advice/policy to just a fraction of a page.
+ */
+ sz = ROUND_UP(sz, qemu_ram_pagesize(backend->mr.ram_block));
+
if (backend->merge &&
qemu_madvise(ptr, sz, QEMU_MADV_MERGEABLE)) {
warn_report("Couldn't set property 'merge' on '%s': %s",
--
2.44.1
^ permalink raw reply related [flat|nested] 9+ messages in thread
* Re: [PATCH 3/3] backends/hostmem: Round up memory size for qemu_madvise() and mbind()
2024-05-28 16:15 ` [PATCH 3/3] backends/hostmem: Round up memory size for qemu_madvise() and mbind() Michal Privoznik
@ 2024-05-28 16:47 ` David Hildenbrand
2024-05-29 6:48 ` Michal Prívozník
2024-05-28 17:22 ` Richard Henderson
1 sibling, 1 reply; 9+ messages in thread
From: David Hildenbrand @ 2024-05-28 16:47 UTC (permalink / raw)
To: Michal Privoznik, qemu-devel; +Cc: imammedo
Am 28.05.24 um 18:15 schrieb Michal Privoznik:
> ./build/qemu-system-x86_64 \ -m size=8389632k,slots=16,maxmem=25600000k \
> -object
> '{"qom-type":"memory-backend-file","id":"ram-node0","mem-path":"hugepages2M","prealloc":true,"size":8590983168,"host-nodes":[0],"policy":"bind"}' \ -numa node,nodeid=0,cpus=0,memdev=ram-node0
For DIMMs and friends we now (again) enforce that the size must be aligned to
the page size:
commit 540a1abbf0b243e4cfb4333c5d30a041f7080ba4
Author: David Hildenbrand <david@redhat.com>
Date: Wed Jan 17 14:55:54 2024 +0100
memory-device: reintroduce memory region size check
We used to check that the memory region size is multiples of the overall
requested address alignment for the device memory address.
We removed that check, because there are cases (i.e., hv-balloon) where
devices unconditionally request an address alignment that has a very large
alignment (i.e., 32 GiB), but the actual memory device size might not be
multiples of that alignment.
However, this change:
(a) allows for some practically impossible DIMM sizes, like "1GB+1 byte".
(b) allows for DIMMs that partially cover hugetlb pages, previously
reported in [1].
...
Partial hugetlb pages do not particularly make sense; wasting memory. Do we
expect people to actually ave working setup that we might break when disallowing
such configurations? Or would we actually help them identify that something
non-sensical is happening?
When using memory-backend-memfd, we already do get a proper error:
./build/qemu-system-x86_64 -m 2047m -object
memory-backend-memfd,id=ram-node0,hugetlb=on,size=2047m,reserve=off -numa
node,nodeid=0,cpus=0,memdev=ram-node0 -S
qemu-system-x86_64: failed to resize memfd to 2146435072: Invalid argument
So this only applies to memory-backend-file, where we maybe should fail in a
similar way?
--
Thanks,
David / dhildenb
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH 3/3] backends/hostmem: Round up memory size for qemu_madvise() and mbind()
2024-05-28 16:47 ` David Hildenbrand
@ 2024-05-29 6:48 ` Michal Prívozník
2024-05-29 12:30 ` David Hildenbrand
0 siblings, 1 reply; 9+ messages in thread
From: Michal Prívozník @ 2024-05-29 6:48 UTC (permalink / raw)
To: David Hildenbrand, qemu-devel; +Cc: imammedo
On 5/28/24 18:47, David Hildenbrand wrote:
> Am 28.05.24 um 18:15 schrieb Michal Privoznik:
>> ./build/qemu-system-x86_64 \ -m
>> size=8389632k,slots=16,maxmem=25600000k \ -object
>> '{"qom-type":"memory-backend-file","id":"ram-node0","mem-path":"hugepages2M","prealloc":true,"size":8590983168,"host-nodes":[0],"policy":"bind"}' \ -numa node,nodeid=0,cpus=0,memdev=ram-node0
>
> For DIMMs and friends we now (again) enforce that the size must be
> aligned to the page size:
>
> commit 540a1abbf0b243e4cfb4333c5d30a041f7080ba4
> Author: David Hildenbrand <david@redhat.com>
> Date: Wed Jan 17 14:55:54 2024 +0100
>
> memory-device: reintroduce memory region size check
>
> We used to check that the memory region size is multiples of the
> overall
> requested address alignment for the device memory address.
>
> We removed that check, because there are cases (i.e., hv-balloon) where
> devices unconditionally request an address alignment that has a very
> large
> alignment (i.e., 32 GiB), but the actual memory device size might
> not be
> multiples of that alignment.
>
> However, this change:
>
> (a) allows for some practically impossible DIMM sizes, like "1GB+1
> byte".
> (b) allows for DIMMs that partially cover hugetlb pages, previously
> reported in [1].
> ...
>
> Partial hugetlb pages do not particularly make sense; wasting memory. Do
> we expect people to actually ave working setup that we might break when
> disallowing such configurations? Or would we actually help them identify
> that something non-sensical is happening?
>
> When using memory-backend-memfd, we already do get a proper error:
>
> ./build/qemu-system-x86_64 -m 2047m -object
> memory-backend-memfd,id=ram-node0,hugetlb=on,size=2047m,reserve=off
> -numa node,nodeid=0,cpus=0,memdev=ram-node0 -S
> qemu-system-x86_64: failed to resize memfd to 2146435072: Invalid argument
>
> So this only applies to memory-backend-file, where we maybe should fail
> in a similar way?
>
Yeah, let's fail in that case. But non-aligned length is just one of
many reasons madvise()/mbind() can fail. What about the others? Should
we make them report an error or just a warning?
Michal
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH 3/3] backends/hostmem: Round up memory size for qemu_madvise() and mbind()
2024-05-29 6:48 ` Michal Prívozník
@ 2024-05-29 12:30 ` David Hildenbrand
0 siblings, 0 replies; 9+ messages in thread
From: David Hildenbrand @ 2024-05-29 12:30 UTC (permalink / raw)
To: Michal Prívozník, qemu-devel; +Cc: imammedo
On 29.05.24 08:48, Michal Prívozník wrote:
> On 5/28/24 18:47, David Hildenbrand wrote:
>> Am 28.05.24 um 18:15 schrieb Michal Privoznik:
>>> ./build/qemu-system-x86_64 \ -m
>>> size=8389632k,slots=16,maxmem=25600000k \ -object
>>> '{"qom-type":"memory-backend-file","id":"ram-node0","mem-path":"hugepages2M","prealloc":true,"size":8590983168,"host-nodes":[0],"policy":"bind"}' \ -numa node,nodeid=0,cpus=0,memdev=ram-node0
>>
>> For DIMMs and friends we now (again) enforce that the size must be
>> aligned to the page size:
>>
>> commit 540a1abbf0b243e4cfb4333c5d30a041f7080ba4
>> Author: David Hildenbrand <david@redhat.com>
>> Date: Wed Jan 17 14:55:54 2024 +0100
>>
>> memory-device: reintroduce memory region size check
>>
>> We used to check that the memory region size is multiples of the
>> overall
>> requested address alignment for the device memory address.
>>
>> We removed that check, because there are cases (i.e., hv-balloon) where
>> devices unconditionally request an address alignment that has a very
>> large
>> alignment (i.e., 32 GiB), but the actual memory device size might
>> not be
>> multiples of that alignment.
>>
>> However, this change:
>>
>> (a) allows for some practically impossible DIMM sizes, like "1GB+1
>> byte".
>> (b) allows for DIMMs that partially cover hugetlb pages, previously
>> reported in [1].
>> ...
>>
>> Partial hugetlb pages do not particularly make sense; wasting memory. Do
>> we expect people to actually ave working setup that we might break when
>> disallowing such configurations? Or would we actually help them identify
>> that something non-sensical is happening?
>>
>> When using memory-backend-memfd, we already do get a proper error:
>>
>> ./build/qemu-system-x86_64 -m 2047m -object
>> memory-backend-memfd,id=ram-node0,hugetlb=on,size=2047m,reserve=off
>> -numa node,nodeid=0,cpus=0,memdev=ram-node0 -S
>> qemu-system-x86_64: failed to resize memfd to 2146435072: Invalid argument
>>
>> So this only applies to memory-backend-file, where we maybe should fail
>> in a similar way?
>>
>
> Yeah, let's fail in that case. But non-aligned length is just one of
> many reasons madvise()/mbind() can fail. What about the others? Should
> we make them report an error or just a warning?
Regarding madvise(), we should report at least a warning.
In qemu_ram_setup_dump() we print an error if QEMU_MADV_DONTDUMP failed.
But we swallow any errors from memory_try_enable_merging() ... in
general, we likely have to distinguish the "not supported by the OS"
case from " actually supported but failed" case.
In the second patch, maybe we should really fail if something unexpected
happens, instead of fake-changing the properties.
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH 3/3] backends/hostmem: Round up memory size for qemu_madvise() and mbind()
2024-05-28 16:15 ` [PATCH 3/3] backends/hostmem: Round up memory size for qemu_madvise() and mbind() Michal Privoznik
2024-05-28 16:47 ` David Hildenbrand
@ 2024-05-28 17:22 ` Richard Henderson
1 sibling, 0 replies; 9+ messages in thread
From: Richard Henderson @ 2024-05-28 17:22 UTC (permalink / raw)
To: Michal Privoznik, qemu-devel; +Cc: david, imammedo
On 5/28/24 09:15, Michal Privoznik wrote:
> + sz = ROUND_UP(sz, qemu_ram_pagesize(backend->mr.ram_block));
Second argument is evaluated twice.
You probably don't want that to be a function call.
r~
^ permalink raw reply [flat|nested] 9+ messages in thread