linux-api.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: David Hildenbrand <david@redhat.com>
To: Vlastimil Babka <vbabka@suse.cz>,
	Paolo Bonzini <pbonzini@redhat.com>,
	Matthew Wilcox <willy@infradead.org>,
	Shivank Garg <shivankg@amd.com>
Cc: x86@kernel.org, viro@zeniv.linux.org.uk, brauner@kernel.org,
	jack@suse.cz, akpm@linux-foundation.org,
	linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	linux-mm@kvack.org, linux-api@vger.kernel.org,
	linux-arch@vger.kernel.org, kvm@vger.kernel.org,
	chao.gao@intel.com, pgonda@google.com, thomas.lendacky@amd.com,
	seanjc@google.com, luto@kernel.org, tglx@linutronix.de,
	mingo@redhat.com, bp@alien8.de, dave.hansen@linux.intel.com,
	arnd@arndb.de, kees@kernel.org, bharata@amd.com, nikunj@amd.com,
	michael.day@amd.com, Neeraj.Upadhyay@amd.com,
	linux-coco@lists.linux.dev
Subject: Re: [RFC PATCH 0/4] Add fbind() and NUMA mempolicy support for KVM guest_memfd
Date: Mon, 11 Nov 2024 23:14:19 +0100	[thread overview]
Message-ID: <6fbef654-36e2-4be5-906e-2a648a845278@redhat.com> (raw)
In-Reply-To: <ff3174f8-c8b5-4fae-a9d9-87546d37c162@suse.cz>

On 11.11.24 12:02, Vlastimil Babka wrote:
> On 11/8/24 18:31, Paolo Bonzini wrote:
>> On 11/7/24 16:10, Matthew Wilcox wrote:
>>> On Thu, Nov 07, 2024 at 02:24:20PM +0530, Shivank Garg wrote:
>>>> The folio allocation path from guest_memfd typically looks like this...
>>>>
>>>> kvm_gmem_get_folio
>>>>     filemap_grab_folio
>>>>       __filemap_get_folio
>>>>         filemap_alloc_folio
>>>>           __folio_alloc_node_noprof
>>>>             -> goes to the buddy allocator
>>>>
>>>> Hence, I am trying to have a version of filemap_alloc_folio() that takes an mpol.
>>>
>>> It only takes that path if cpuset_do_page_mem_spread() is true.  Is the
>>> real problem that you're trying to solve that cpusets are being used
>>> incorrectly?
>>
>> If it's false it's not very different, it goes to alloc_pages_noprof().
>> Then it respects the process's policy, but the policy is not
>> customizable without mucking with state that is global to the process.
>>
>> Taking a step back: the problem is that a VM can be configured to have
>> multiple guest-side NUMA nodes, each of which will pick memory from the
>> right NUMA node in the host.  Without a per-file operation it's not
>> possible to do this on guest_memfd.  The discussion was whether to use
>> ioctl() or a new system call.  The discussion ended with the idea of
>> posting a *proposal* asking for *comments* as to whether the system call
>> would be useful in general beyond KVM.
>>
>> Commenting on the system call itself I am not sure I like the
>> file_operations entry, though I understand that it's the simplest way to
>> implement this in an RFC series.  It's a bit surprising that fbind() is
>> a total no-op for everything except KVM's guest_memfd.
>>
>> Maybe whatever you pass to fbind() could be stored in the struct file *,
>> and used as the default when creating VMAs; as if every mmap() was
>> followed by an mbind(), except that it also does the right thing with
>> MAP_POPULATE for example.  Or maybe that's a horrible idea?
> 
> mbind() manpage has this:
> 
>         The  specified  policy  will  be  ignored  for  any MAP_SHARED
> mappings in the specified memory range.  Rather the pages will be allocated
> according to the memory policy of the thread that caused the page to be
> allocated. Again, this may not be the thread that called mbind().

I recall discussing that a couple of times in the context of QEMU. I 
have some faint recollection that the manpage is a bit imprecise:

IIRC, hugetlb also ends up using the VMA policy for MAP_SHARED mappings 
during faults (huge_node()->get_vma_policy()) -- but in contrast to 
shmem, it doesn't end up becoming the "shared" policy for the file, used 
when accessed through other VMAs.

> 
> So that seems like we're not very keen on having one user of a file set a
> policy that would affect other users of the file?

For VMs in QEMU we really want to configure the policy once in the main 
process and have all other processes (e.g., vhost-user) not worry about 
that when they mmap() guest memory.

With shmem this works by "shared policy" design (below). For hugetlb, we 
rely on the fact that mbind()+MADV_POPULATE_WRITE allows us to 
preallocate NUMA-aware. So with hugetlb we really preallocate all guest 
RAM to guarantee the NUMA placement.

It would not be the worst idea to have a clean interface to configure 
file-range policies instead of having this weird shmem mbind() behavior 
and the hugetlb hack.

Having that said, other filesystem are rarely used for backing VMs, at 
least in combination with NUMA. So nobody really cared that much for now.

Maybe fbind() would primarily only be useful for in-memory filesystems 
(shmem/hugetlb/guest_memfd).

> 
> Now the next paragraph of the manpage says that shmem is different, and
> guest_memfd is more like shmem than a regular file.
> 
> My conclusion from that is that fbind() might be too broad and we don't want
> this for actual filesystem-backed files? And if it's limited to guest_memfd,
> it shouldn't be an fbind()?

I was just once again diving into how mbind() on shmem is handled. And 
in fact, mbind() will call vma->set_policy() to update the per 
file-range policy. I wonder why we didn't do the same for hugetlb ... 
but of course, hugetlb must be special in any possible way.

Not saying it's the best idea, but as we are talking about mmap support 
of guest_memfd (only allowing to fault in shared/faultable pages), one 
*could* look into implementing mbind()+vma->set_policy() for guest_memfd 
similar to how shmem handles it.

It would require a (temporary) dummy VMA in the worst case (all private 
memory).

It sounds a bit weird, though, to require a VMA to configure this, 
though. But at least it's similar to what shmem does ...

-- 
Cheers,

David / dhildenb


      reply	other threads:[~2024-11-11 22:14 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-11-05 16:45 [RFC PATCH 0/4] Add fbind() and NUMA mempolicy support for KVM guest_memfd Shivank Garg
2024-11-05 16:45 ` [RFC PATCH 1/4] mm: Add mempolicy support to the filemap layer Shivank Garg
2024-11-05 16:55 ` [RFC PATCH 2/4] Introduce fbind syscall Shivank Garg
2024-11-05 16:55   ` [RFC PATCH 3/4] KVM: guest_memfd: Pass file pointer instead of inode in guest_memfd APIs Shivank Garg
2024-11-05 16:55   ` [RFC PATCH 4/4] KVM: guest_memfd: Enforce NUMA mempolicy if available Shivank Garg
2024-11-05 18:55 ` [RFC PATCH 0/4] Add fbind() and NUMA mempolicy support for KVM guest_memfd Matthew Wilcox
2024-11-07  8:54   ` Shivank Garg
2024-11-07 15:10     ` Matthew Wilcox
2024-11-08  9:21       ` Shivank Garg
2024-11-08 17:31       ` Paolo Bonzini
2024-11-11 11:02         ` Vlastimil Babka
2024-11-11 22:14           ` David Hildenbrand [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=6fbef654-36e2-4be5-906e-2a648a845278@redhat.com \
    --to=david@redhat.com \
    --cc=Neeraj.Upadhyay@amd.com \
    --cc=akpm@linux-foundation.org \
    --cc=arnd@arndb.de \
    --cc=bharata@amd.com \
    --cc=bp@alien8.de \
    --cc=brauner@kernel.org \
    --cc=chao.gao@intel.com \
    --cc=dave.hansen@linux.intel.com \
    --cc=jack@suse.cz \
    --cc=kees@kernel.org \
    --cc=kvm@vger.kernel.org \
    --cc=linux-api@vger.kernel.org \
    --cc=linux-arch@vger.kernel.org \
    --cc=linux-coco@lists.linux.dev \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=luto@kernel.org \
    --cc=michael.day@amd.com \
    --cc=mingo@redhat.com \
    --cc=nikunj@amd.com \
    --cc=pbonzini@redhat.com \
    --cc=pgonda@google.com \
    --cc=seanjc@google.com \
    --cc=shivankg@amd.com \
    --cc=tglx@linutronix.de \
    --cc=thomas.lendacky@amd.com \
    --cc=vbabka@suse.cz \
    --cc=viro@zeniv.linux.org.uk \
    --cc=willy@infradead.org \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).