From: David Hildenbrand <david@redhat.com>
To: "Michal Prívozník" <mprivozn@redhat.com>, qemu-devel@nongnu.org
Cc: imammedo@redhat.com
Subject: Re: [PATCH] hostmem: Round up memory size for qemu_madvise() in host_memory_backend_memory_complete()
Date: Mon, 4 Dec 2023 10:21:23 +0100 [thread overview]
Message-ID: <0b6dacd2-a7c0-469f-830a-9162dfae75bf@redhat.com> (raw)
In-Reply-To: <de569c7a-2e21-4a98-a4a9-98132b43c621@redhat.com>
On 01.12.23 10:07, Michal Prívozník wrote:
> On 11/27/23 14:55, David Hildenbrand wrote:
>> On 27.11.23 14:37, David Hildenbrand wrote:
>>> On 27.11.23 13:32, Michal Privoznik wrote:
>>>> Simple reproducer:
>>>> qemu.git $ ./build/qemu-system-x86_64 \
>>>> -m size=8389632k,slots=16,maxmem=25600000k \
>>>> -object
>>>> '{"qom-type":"memory-backend-file","id":"ram-node0","mem-path":"/hugepages2M/","prealloc":true,"size":8590983168,"host-nodes":[0],"policy":"bind"}' \
>>>> -numa node,nodeid=0,cpus=0,memdev=ram-node0
>>>>
>>>> With current master I get:
>>>>
>>>> qemu-system-x86_64: cannot bind memory to host NUMA nodes: Invalid
>>>> argument
>>>>
>>>> The problem is that memory size (8193MiB) is not an integer
>>>> multiple of underlying pagesize (2MiB) which triggers a check
>>>> inside of madvise(), since we can't really set a madvise() policy
>>>> just to a fraction of a page.
>>>
>>> I thought we would just always fail create something that doesn't really
>>> make any sense.
>>>
>>> Why would we want to support that case?
>>>
>>> Let me dig, I thought we would have had some check there at some point
>>> that would make that fail (especially: RAM block not aligned to the
>>> pagesize).
>>
>>
>> At least memory-backend-memfd properly fails for that case:
>>
>> $ ./build/qemu-system-x86_64 -object
>> memory-backend-memfd,hugetlb=on,size=3m,id=tmp
>> qemu-system-x86_64: failed to resize memfd to 3145728: Invalid argument
>>
>> memory-backend-file ends up creating a new file:
>>
>> $ ./build/qemu-system-x86_64 -object
>> memory-backend-file,share=on,mem-path=/dev/hugepages/tmp,size=3m,id=tmp
>>
>> $ stat /dev/hugepages/tmp
>> File: /dev/hugepages/tmp
>> Size: 4194304 Blocks: 0 IO Block: 2097152 regular file
>>
>> ... and ends up sizing it properly aligned to the huge page size.
>>
>>
>> Seems to be due to:
>>
>> if (memory < block->page_size) {
>> error_setg(errp, "memory size 0x" RAM_ADDR_FMT " must be equal to "
>> "or larger than page size 0x%zx",
>> memory, block->page_size);
>> return NULL;
>> }
>>
>> memory = ROUND_UP(memory, block->page_size);
>>
>> /*
>> * ftruncate is not supported by hugetlbfs in older
>> * hosts, so don't bother bailing out on errors.
>> * If anything goes wrong with it under other filesystems,
>> * mmap will fail.
>> *
>> * Do not truncate the non-empty backend file to avoid corrupting
>> * the existing data in the file. Disabling shrinking is not
>> * enough. For example, the current vNVDIMM implementation stores
>> * the guest NVDIMM labels at the end of the backend file. If the
>> * backend file is later extended, QEMU will not be able to find
>> * those labels. Therefore, extending the non-empty backend file
>> * is disabled as well.
>> */
>> if (truncate && ftruncate(fd, offset + memory)) {
>> perror("ftruncate");
>> }
>>
>> So we create a bigger file and map the bigger file and also have a
>> RAMBlock that is bigger. So we'll also consume more memory.
>>
>> ... but the memory region is smaller and we tell the VM that it has
>> less memory. Lot of work with no obvious benefit, and only some
>> memory waste :)
>>
>>
>> We better should have just rejected such memory backends right from
>> the start. But now it's likely too late.
>>
>> I suspect other things like
>> * qemu_madvise(ptr, sz, QEMU_MADV_MERGEABLE);
>> * qemu_madvise(ptr, sz, QEMU_MADV_DONTDUMP);
>>
>> fail, but we don't care for hugetlb at least regarding merging
>> and don't even log an error.
>>
>> But QEMU_MADV_DONTDUMP might also be broken, because that
>> qemu_madvise() call will just fail.
>>
>> Your fix would be correct. But I do wonder if we want to just let that
>> case fail and warn users that they are doing something that doesn't
>> make too much sense.
>>
>
> Yeah, what's suspicious is: if the size is smaller than page size we
> error out, but if it's larger (but still not aligned) we accept that.
> I'm failing to see reasoning there. Looks like the ROUND_UP() was added
> in v0.13.0-rc0~1201^2~4 (though it's done with some bit magic) and the
> check itself was added in v2.8.0-rc0~30^2~23. So it's a bit late, yes.
Yeah.
>
> OTOH - if users want to waste resources, should we stop them? For
It's all inconsistent, including memfd handling or what you noted above.
For example, Having a 1025 MiB guest on gigantic pages, consuming 2 GiB
really is just absolutely stupid.
Likely the user wants to know about such mistakes instead of making QEMU
silence all side effects of that. :)
> instance, when user requests more vCPUs than physical CPUs a warning is
> printed:
>
> $ ./build/qemu-system-x86_64 -accel kvm -smp cpus=128
> qemu-system-x86_64: -accel kvm: warning: Number of SMP cpus requested
> (128) exceeds the recommended cpus supported by KVM (8)
But that case is still reasonable for testing guest behavior with many
vCPUs, or migrating from a machine with more vCPUs.
Here, the guest will actually see all vCPUs. In comparison, the memory
waste here will never ever be consumable by the VM.
>
> but that's about it. So maybe the error can be demoted to just a warning?
The question is what we want to do, for example, with the
qemu_madvise(QEMU_MADV_DONTDUMP). It will similarly simply fail.
I'm curious, are there real customers running into that?
We could fix it all that but always warn when something like that is
being done.
--
Cheers,
David / dhildenb
next prev parent reply other threads:[~2023-12-04 9:21 UTC|newest]
Thread overview: 7+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-11-27 12:32 [PATCH] hostmem: Round up memory size for qemu_madvise() in host_memory_backend_memory_complete() Michal Privoznik
2023-11-27 13:37 ` David Hildenbrand
2023-11-27 13:55 ` David Hildenbrand
2023-12-01 9:07 ` Michal Prívozník
2023-12-04 9:21 ` David Hildenbrand [this message]
2023-12-04 14:12 ` Michal Prívozník
2023-12-04 18:38 ` David Hildenbrand
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=0b6dacd2-a7c0-469f-830a-9162dfae75bf@redhat.com \
--to=david@redhat.com \
--cc=imammedo@redhat.com \
--cc=mprivozn@redhat.com \
--cc=qemu-devel@nongnu.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).