From: Michal Hocko <mhocko@suse.com>
To: Ackerley Tng <ackerleytng@google.com>
Cc: kvm@vger.kernel.org, linux-api@vger.kernel.org,
linux-arch@vger.kernel.org, linux-doc@vger.kernel.org,
linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org,
linux-mm@kvack.org, qemu-devel@nongnu.org, aarcange@redhat.com,
ak@linux.intel.com, akpm@linux-foundation.org, arnd@arndb.de,
bfields@fieldses.org, bp@alien8.de, chao.p.peng@linux.intel.com,
corbet@lwn.net, dave.hansen@intel.com, david@redhat.com,
ddutile@redhat.com, dhildenb@redhat.com, hpa@zytor.com,
hughd@google.com, jlayton@kernel.org, jmattson@google.com,
joro@8bytes.org, jun.nakajima@intel.com,
kirill.shutemov@linux.intel.com, linmiaohe@huawei.com,
luto@kernel.org, mail@maciej.szmigiero.name,
michael.roth@amd.com, mingo@redhat.com, naoya.horiguchi@nec.com,
pbonzini@redhat.com, qperret@google.com, rppt@kernel.org,
seanjc@google.com, shuah@kernel.org, steven.price@arm.com,
tabba@google.com, tglx@linutronix.de, vannapurve@google.com,
vbabka@suse.cz, vkuznets@redhat.com, wanpengli@tencent.com,
wei.w.wang@intel.com, x86@kernel.org, yu.c.zhang@linux.intel.com,
muchun.song@linux.dev, feng.tang@intel.com, brgerst@gmail.com,
rdunlap@infradead.org, masahiroy@kernel.org,
mailhol.vincent@wanadoo.fr
Subject: Re: [RFC PATCH 0/6] Setting memory policy for restrictedmem file
Date: Fri, 14 Apr 2023 08:33:08 +0200 [thread overview]
Message-ID: <ZDjzpKL9Omcox991@dhcp22.suse.cz> (raw)
In-Reply-To: <cover.1681430907.git.ackerleytng@google.com>
On Fri 14-04-23 00:11:49, Ackerley Tng wrote:
> Hello,
>
> This patchset builds upon the memfd_restricted() system call that was
> discussed in the 'KVM: mm: fd-based approach for supporting KVM' patch
> series [1].
>
> The tree can be found at:
> https://github.com/googleprodkernel/linux-cc/tree/restrictedmem-set-memory-policy
>
> In this patchset, a new syscall is introduced, which allows userspace
> to set the memory policy (e.g. NUMA bindings) for a restrictedmem
> file, to the granularity of offsets within the file.
>
> The offset/length tuple is termed a file_range which is passed to the
> kernel via a pointer to get around the limit of 6 arguments for a
> syscall.
>
> The following other approaches were also considered:
>
> 1. Pre-configuring a mount with a memory policy and providing that
> mount to memfd_restricted() as proposed at [2].
> + Pro: It allows choice of a specific backing mount with custom
> memory policy configurations
> + Con: Will need to create an entire new mount just to set memory
> policy for a restrictedmem file; files on the same mount cannot
> have different memory policies.
Could you expand on this some more please? How many restricted
files/mounts do we expect? My understanding was that this would be
essentially a backing store for guest memory so it would scale with the
number of guests.
> 2. Passing memory policy to the memfd_restricted() syscall at creation time.
> + Pro: Only need to make a single syscall to create a file with a
> given memory policy
> + Con: At creation time, the kernel doesn’t know the size of the
> restrictedmem file. Given that memory policy is stored in the
> inode based on ranges (start, end), it is awkward for the kernel
> to store the memory policy and then add hooks to set the memory
> policy when allocation is done.
>
> 3. A more generic fbind(): it seems like this new functionality is
> really only needed for restrictedmem files, hence a separate,
> specific syscall was proposed to avoid complexities with handling
> conflicting policies that may be specified via other syscalls like
> mbind()
I do not think it is a good idea to make the syscall restrict mem
specific. History shows that users are much more creative when it comes
to usecases than us. I do understand that the nature of restricted
memory is that it is not mapable but memory policies without a mapping
are a reasonable concept in genereal. After all this just tells where
the memory should be allocated from. Do we need to implement that for
any other fs? No, you can safely return EINVAL for anything but
memfd_restricted fd for now but you shouldn't limit usecases upfront.
>
> TODOs
How do you query a policy for the specific fd? Are there any plans to
add a syscall for that as well but you just wait for the direction for
the set method?
> + Return -EINVAL if file_range is not within the size of the file and
> tests for this
>
> Dependencies:
>
> + Chao’s work on UPM [3]
>
> [1] https://lore.kernel.org/lkml/20221202061347.1070246-1-chao.p.peng@linux.intel.com/T/
> [2] https://lore.kernel.org/lkml/cover.1681176340.git.ackerleytng@google.com/T/
> [3] https://github.com/chao-p/linux/commits/privmem-v11.5
>
> ---
>
> Ackerley Tng (6):
> mm: shmem: Refactor out shmem_shared_policy() function
> mm: mempolicy: Refactor out mpol_init_from_nodemask
> mm: mempolicy: Refactor out __mpol_set_shared_policy()
> mm: mempolicy: Add and expose mpol_create
> mm: restrictedmem: Add memfd_restricted_bind() syscall
> selftests: mm: Add selftest for memfd_restricted_bind()
>
> arch/x86/entry/syscalls/syscall_32.tbl | 1 +
> arch/x86/entry/syscalls/syscall_64.tbl | 1 +
> include/linux/mempolicy.h | 4 +
> include/linux/shmem_fs.h | 7 +
> include/linux/syscalls.h | 5 +
> include/uapi/asm-generic/unistd.h | 5 +-
> include/uapi/linux/mempolicy.h | 7 +-
> kernel/sys_ni.c | 1 +
> mm/mempolicy.c | 100 ++++++++++---
> mm/restrictedmem.c | 75 ++++++++++
> mm/shmem.c | 10 +-
> scripts/checksyscalls.sh | 1 +
> tools/testing/selftests/mm/.gitignore | 1 +
> tools/testing/selftests/mm/Makefile | 8 +
> .../selftests/mm/memfd_restricted_bind.c | 139 ++++++++++++++++++
> .../mm/restrictedmem_testmod/Makefile | 21 +++
> .../restrictedmem_testmod.c | 89 +++++++++++
> tools/testing/selftests/mm/run_vmtests.sh | 6 +
> 18 files changed, 454 insertions(+), 27 deletions(-)
> create mode 100644 tools/testing/selftests/mm/memfd_restricted_bind.c
> create mode 100644 tools/testing/selftests/mm/restrictedmem_testmod/Makefile
> create mode 100644 tools/testing/selftests/mm/restrictedmem_testmod/restrictedmem_testmod.c
>
> --
> 2.40.0.634.g4ca3ef3211-goog
--
Michal Hocko
SUSE Labs
next prev parent reply other threads:[~2023-04-14 7:25 UTC|newest]
Thread overview: 9+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-04-14 0:11 [RFC PATCH 0/6] Setting memory policy for restrictedmem file Ackerley Tng
2023-04-14 0:11 ` [RFC PATCH 1/6] mm: shmem: Refactor out shmem_shared_policy() function Ackerley Tng
2023-04-14 0:11 ` [RFC PATCH 2/6] mm: mempolicy: Refactor out mpol_init_from_nodemask Ackerley Tng
2023-04-14 0:11 ` [RFC PATCH 3/6] mm: mempolicy: Refactor out __mpol_set_shared_policy() Ackerley Tng
2023-04-14 0:11 ` [RFC PATCH 4/6] mm: mempolicy: Add and expose mpol_create Ackerley Tng
2023-04-14 0:11 ` [RFC PATCH 5/6] mm: restrictedmem: Add memfd_restricted_bind() syscall Ackerley Tng
2023-04-14 0:11 ` [RFC PATCH 6/6] selftests: mm: Add selftest for memfd_restricted_bind() Ackerley Tng
2023-04-14 6:33 ` Michal Hocko [this message]
2023-04-14 17:24 ` [RFC PATCH 0/6] Setting memory policy for restrictedmem file Sean Christopherson
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=ZDjzpKL9Omcox991@dhcp22.suse.cz \
--to=mhocko@suse.com \
--cc=aarcange@redhat.com \
--cc=ackerleytng@google.com \
--cc=ak@linux.intel.com \
--cc=akpm@linux-foundation.org \
--cc=arnd@arndb.de \
--cc=bfields@fieldses.org \
--cc=bp@alien8.de \
--cc=brgerst@gmail.com \
--cc=chao.p.peng@linux.intel.com \
--cc=corbet@lwn.net \
--cc=dave.hansen@intel.com \
--cc=david@redhat.com \
--cc=ddutile@redhat.com \
--cc=dhildenb@redhat.com \
--cc=feng.tang@intel.com \
--cc=hpa@zytor.com \
--cc=hughd@google.com \
--cc=jlayton@kernel.org \
--cc=jmattson@google.com \
--cc=joro@8bytes.org \
--cc=jun.nakajima@intel.com \
--cc=kirill.shutemov@linux.intel.com \
--cc=kvm@vger.kernel.org \
--cc=linmiaohe@huawei.com \
--cc=linux-api@vger.kernel.org \
--cc=linux-arch@vger.kernel.org \
--cc=linux-doc@vger.kernel.org \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=luto@kernel.org \
--cc=mail@maciej.szmigiero.name \
--cc=mailhol.vincent@wanadoo.fr \
--cc=masahiroy@kernel.org \
--cc=michael.roth@amd.com \
--cc=mingo@redhat.com \
--cc=muchun.song@linux.dev \
--cc=naoya.horiguchi@nec.com \
--cc=pbonzini@redhat.com \
--cc=qemu-devel@nongnu.org \
--cc=qperret@google.com \
--cc=rdunlap@infradead.org \
--cc=rppt@kernel.org \
--cc=seanjc@google.com \
--cc=shuah@kernel.org \
--cc=steven.price@arm.com \
--cc=tabba@google.com \
--cc=tglx@linutronix.de \
--cc=vannapurve@google.com \
--cc=vbabka@suse.cz \
--cc=vkuznets@redhat.com \
--cc=wanpengli@tencent.com \
--cc=wei.w.wang@intel.com \
--cc=x86@kernel.org \
--cc=yu.c.zhang@linux.intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).