qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
From: David Hildenbrand <david@redhat.com>
To: "Michal Prívozník" <mprivozn@redhat.com>, qemu-devel@nongnu.org
Cc: imammedo@redhat.com
Subject: Re: [PATCH] hostmem: Round up memory size for qemu_madvise() in host_memory_backend_memory_complete()
Date: Mon, 4 Dec 2023 10:21:23 +0100	[thread overview]
Message-ID: <0b6dacd2-a7c0-469f-830a-9162dfae75bf@redhat.com> (raw)
In-Reply-To: <de569c7a-2e21-4a98-a4a9-98132b43c621@redhat.com>

On 01.12.23 10:07, Michal Prívozník wrote:
> On 11/27/23 14:55, David Hildenbrand wrote:
>> On 27.11.23 14:37, David Hildenbrand wrote:
>>> On 27.11.23 13:32, Michal Privoznik wrote:
>>>> Simple reproducer:
>>>> qemu.git $ ./build/qemu-system-x86_64 \
>>>> -m size=8389632k,slots=16,maxmem=25600000k \
>>>> -object
>>>> '{"qom-type":"memory-backend-file","id":"ram-node0","mem-path":"/hugepages2M/","prealloc":true,"size":8590983168,"host-nodes":[0],"policy":"bind"}' \
>>>> -numa node,nodeid=0,cpus=0,memdev=ram-node0
>>>>
>>>> With current master I get:
>>>>
>>>> qemu-system-x86_64: cannot bind memory to host NUMA nodes: Invalid
>>>> argument
>>>>
>>>> The problem is that memory size (8193MiB) is not an integer
>>>> multiple of underlying pagesize (2MiB) which triggers a check
>>>> inside of madvise(), since we can't really set a madvise() policy
>>>> just to a fraction of a page.
>>>
>>> I thought we would just always fail create something that doesn't really
>>> make any sense.
>>>
>>> Why would we want to support that case?
>>>
>>> Let me dig, I thought we would have had some check there at some point
>>> that would make that fail (especially: RAM block not aligned to the
>>> pagesize).
>>
>>
>> At least memory-backend-memfd properly fails for that case:
>>
>> $ ./build/qemu-system-x86_64 -object
>> memory-backend-memfd,hugetlb=on,size=3m,id=tmp
>> qemu-system-x86_64: failed to resize memfd to 3145728: Invalid argument
>>
>> memory-backend-file ends up creating a new file:
>>
>>   $ ./build/qemu-system-x86_64 -object
>> memory-backend-file,share=on,mem-path=/dev/hugepages/tmp,size=3m,id=tmp
>>
>> $ stat /dev/hugepages/tmp
>>    File: /dev/hugepages/tmp
>>    Size: 4194304         Blocks: 0          IO Block: 2097152 regular file
>>
>> ... and ends up sizing it properly aligned to the huge page size.
>>
>>
>> Seems to be due to:
>>
>>      if (memory < block->page_size) {
>>          error_setg(errp, "memory size 0x" RAM_ADDR_FMT " must be equal to "
>>                     "or larger than page size 0x%zx",
>>                     memory, block->page_size);
>>          return NULL;
>>      }
>>
>>      memory = ROUND_UP(memory, block->page_size);
>>
>>      /*
>>       * ftruncate is not supported by hugetlbfs in older
>>       * hosts, so don't bother bailing out on errors.
>>       * If anything goes wrong with it under other filesystems,
>>       * mmap will fail.
>>       *
>>       * Do not truncate the non-empty backend file to avoid corrupting
>>       * the existing data in the file. Disabling shrinking is not
>>       * enough. For example, the current vNVDIMM implementation stores
>>       * the guest NVDIMM labels at the end of the backend file. If the
>>       * backend file is later extended, QEMU will not be able to find
>>       * those labels. Therefore, extending the non-empty backend file
>>       * is disabled as well.
>>       */
>>      if (truncate && ftruncate(fd, offset + memory)) {
>>          perror("ftruncate");
>>      }
>>
>> So we create a bigger file and map the bigger file and also have a
>> RAMBlock that is bigger. So we'll also consume more memory.
>>
>> ... but the memory region is smaller and we tell the VM that it has
>> less memory. Lot of work with no obvious benefit, and only some
>> memory waste :)
>>
>>
>> We better should have just rejected such memory backends right from
>> the start. But now it's likely too late.
>>
>> I suspect other things like
>>   * qemu_madvise(ptr, sz, QEMU_MADV_MERGEABLE);
>>   * qemu_madvise(ptr, sz, QEMU_MADV_DONTDUMP);
>>
>> fail, but we don't care for hugetlb at least regarding merging
>> and don't even log an error.
>>
>> But QEMU_MADV_DONTDUMP might also be broken, because that
>> qemu_madvise() call will just fail.
>>
>> Your fix would be correct. But I do wonder if we want to just let that
>> case fail and warn users that they are doing something that doesn't
>> make too much sense.
>>
> 
> Yeah, what's suspicious is: if the size is smaller than page size we
> error out, but if it's larger (but still not aligned) we accept that.
> I'm failing to see reasoning there. Looks like the ROUND_UP() was added
> in v0.13.0-rc0~1201^2~4 (though it's done with some bit magic) and the
> check itself was added in v2.8.0-rc0~30^2~23. So it's a bit late, yes.

Yeah.

> 
> OTOH - if users want to waste resources, should we stop them? For

It's all inconsistent, including memfd handling or what you noted above.

For example, Having a 1025 MiB guest on gigantic pages, consuming 2 GiB 
really is just absolutely stupid.

Likely the user wants to know about such mistakes instead of making QEMU 
silence all side effects of that. :)


> instance, when user requests more vCPUs than physical CPUs a warning is
> printed:
> 
> $ ./build/qemu-system-x86_64 -accel kvm -smp cpus=128
> qemu-system-x86_64: -accel kvm: warning: Number of SMP cpus requested
> (128) exceeds the recommended cpus supported by KVM (8)

But that case is still reasonable for testing guest behavior with many 
vCPUs, or migrating from a machine with more vCPUs.

Here, the guest will actually see all vCPUs. In comparison, the memory 
waste here will never ever be consumable by the VM.

> 
> but that's about it. So maybe the error can be demoted to just a warning?

The question is what we want to do, for example, with the 
qemu_madvise(QEMU_MADV_DONTDUMP). It will similarly simply fail.

I'm curious, are there real customers running into that?


We could fix it all that but always warn when something like that is 
being done.

-- 
Cheers,

David / dhildenb



  reply	other threads:[~2023-12-04  9:21 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-11-27 12:32 [PATCH] hostmem: Round up memory size for qemu_madvise() in host_memory_backend_memory_complete() Michal Privoznik
2023-11-27 13:37 ` David Hildenbrand
2023-11-27 13:55   ` David Hildenbrand
2023-12-01  9:07     ` Michal Prívozník
2023-12-04  9:21       ` David Hildenbrand [this message]
2023-12-04 14:12         ` Michal Prívozník
2023-12-04 18:38           ` David Hildenbrand

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=0b6dacd2-a7c0-469f-830a-9162dfae75bf@redhat.com \
    --to=david@redhat.com \
    --cc=imammedo@redhat.com \
    --cc=mprivozn@redhat.com \
    --cc=qemu-devel@nongnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).