From: "Arnd Bergmann" <arnd@arndb.de>
To: "Gregory Price" <gregory.price@memverge.com>
Cc: "Gregory Price" <gourry.memverge@gmail.com>,
linux-mm@vger.kernel.org, linux-kernel@vger.kernel.org,
Linux-Arch <linux-arch@vger.kernel.org>,
linux-api@vger.kernel.org, linux-cxl@vger.kernel.org,
"Andy Lutomirski" <luto@kernel.org>,
"Thomas Gleixner" <tglx@linutronix.de>,
"Ingo Molnar" <mingo@redhat.com>,
"Borislav Petkov" <bp@alien8.de>,
"Dave Hansen" <dave.hansen@linux.intel.com>,
"H. Peter Anvin" <hpa@zytor.com>,
"Andrew Morton" <akpm@linux-foundation.org>,
x86@kernel.org
Subject: Re: [RFC PATCH 3/3] mm/migrate: Create move_phys_pages syscall
Date: Sat, 09 Sep 2023 17:18:13 +0200 [thread overview]
Message-ID: <2fe03345-01a2-4cfe-9648-ae088493d1af@app.fastmail.com> (raw)
In-Reply-To: <ZPrRcJCjRBvJ9c3N@memverge.com>
On Fri, Sep 8, 2023, at 09:46, Gregory Price wrote:
> On Sat, Sep 09, 2023 at 10:03:33AM +0200, Arnd Bergmann wrote:
>> > diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
>> > index 22bc6bc147f8..6860675a942f 100644
>> > --- a/include/linux/syscalls.h
>> > +++ b/include/linux/syscalls.h
>> > @@ -821,6 +821,11 @@ asmlinkage long sys_move_pages(pid_t pid, unsigned
>> > long nr_pages,
>> > const int __user *nodes,
>> > int __user *status,
>> > int flags);
>> > +asmlinkage long sys_move_phys_pages(unsigned long nr_pages,
>> > + const void __user * __user *pages,
>> > + const int __user *nodes,
>> > + int __user *status,
>> > + int flags);
>>
>> The prototype looks good from a portability point of view,
>> i.e. I made sure this is portable across CONFIG_COMPAT
>> architectures etc.
>>
>> What I'm not sure about is whether the representation of physical
>> memory pages as a 'void *' is a good choice, I can see potential
>> problems with this:
>>
>> - it's not really a pointer, but instead a shifted PFN number
>> in the implementation
>>
>> - physical addresses may be wider than pointers on 32-bit
>> architectures with CONFIG_PHYS_ADDR_T_64BIT
>>
>
> Hm, good points.
>
> I tried to keep the code close to move_pages for the sake of
> familiarity and ease of review, but the physical address length
> is not something i'd considered, and I'm not sure how exactly
> we would handle CONFIG_PHYS_ADDR_T_64BIT. If you have an idea,
> I'm happy to run with it.
I think a pointer to '__u64' is the most appropriate here,
that is compatible between 32-bit and 64-bit architectures
and covers all addresses until we get architectures with
128-bit addressing.
Thinking about it more, I noticed an existing bug in
both sys_move_pages and your current version of
sys_move_phys_pages: the 'pages' array is in fact not
suitable for compat tasks and needs an is_compat_task
check to load a 32-bit address from compat userspace on
the "if (get_user(p, pages + i))" line.
> on address vs pfn:
>
> Would using PFNs cause issues with portability of userland code? User
> code that gets access to a physical address would have to convert
> that to a PFN, which would be kernel-defined. That could be easy
> enough if the kernel exposed the shift size somewhere.
I don't think we currently use PFN anywhere in the syscall
ABI, so this introduces a new basic type into uapi headers that
is currently kernel internal.
A 32-bit PFN is always sufficient to represent all physical
addresses on native 32-bit kernel, but not necessarily for
compat user space on 64-bit kernels with an address space wider
than 44 bits (16 terabyte address range).
For the interface it would also require the same quirk
for compat tasks that I pointed out above that is missing
in sys_move_pages today.
A minor quirk of PFN values is that they require knowing
the page size to convert addresses, but I suppose you need
that anyway.
>> I'm also not sure where the user space caller gets the
>> physical addresses from, are those not intentionally kept
>> hidden from userspace?
>>
>
> There are presently 4 places (that I know of), and 1 that is being
> proposed here in the near future
>
> 1) Generally: /proc/pid/pagemap can be used to do page table
> translations. I think this is only really useful for testing,
> since if you have the virtual address and pid you would use
> move_pages, but it's certainly useful for testing this.
>
> 2) X86: IBS (AMD) and PEBS (Intel) can be configured to return physical
> address information instead of virtual address information. In fact
> you can configure it to give you both the virtual and physical
> address for a process.
Ah right. I see these already require CAP_SYS_ADMIN permissions
because of the risk of rowhammer or speculative execution attacks,
so I suppose users of move_phys_pages() would need additional
permissions compared to move_pages() to actually use that information.
> 3) zoneinfo: /proc/zoneinfo exposes the start PFN of zones
>
> 4) /sys/kernel/mm/page_idle: A way to query whether a PFN is idle.
> While itself seemingly not useful, if the goal is to migrate
> less-used pages to "lower tiers" of memory, then you can query
> the bitmap, directly recover idle PFNs, and attempt to migrate
> them (which may fail, for a variety of reasons).
>
> https://docs.kernel.org/admin-guide/mm/idle_page_tracking.html
>
>
> 5) CXL (Proposed): a CXL memory device on the PCIe bus may provide
> hot/cold information about its memory. If a heatmap is provided,
> for example, it would have to be a device address (0-based) or a
> physical address (some base defined by the kernel and programmed
> into device decoders such that it can convert them to 0-based).
>
> This is presently being proposed but has not been agreed upon yet.
These do not seem to be problematic from the ASLR perspective, so
I guess it may still be useful without CAP_SYS_ADMIN.
Arnd
next prev parent reply other threads:[~2023-09-09 16:25 UTC|newest]
Thread overview: 21+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-09-07 7:54 [RFC 0/3] sys_move_phy_pages system call Gregory Price
2023-09-07 7:54 ` [RFC PATCH 1/3] mm/migrate: remove unused mm argument from do_move_pages_to_node Gregory Price
2023-09-07 7:54 ` [RFC PATCH 2/3] mm/migrate: refactor add_page_for_migration for code re-use Gregory Price
2023-09-07 7:54 ` [RFC PATCH 3/3] mm/migrate: Create move_phys_pages syscall Gregory Price
2023-09-09 8:03 ` Arnd Bergmann
2023-09-08 7:46 ` Gregory Price
2023-09-09 15:18 ` Arnd Bergmann [this message]
2023-09-10 12:52 ` Gregory Price
2023-09-11 17:26 ` Arnd Bergmann
2023-09-10 15:01 ` Gregory Price
2023-09-10 20:36 ` Jonathan Corbet
2023-09-10 11:49 ` Gregory Price
2023-09-19 3:34 ` Andy Lutomirski
2023-09-19 16:31 ` Gregory Price
2023-09-19 17:59 ` Andy Lutomirski
2023-09-19 18:20 ` Gregory Price
2023-09-19 18:52 ` Andy Lutomirski
2023-09-19 19:59 ` Gregory Price
2023-09-19 0:17 ` Thomas Gleixner
2023-09-19 15:57 ` Gregory Price
2023-09-19 16:37 ` Gregory Price
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=2fe03345-01a2-4cfe-9648-ae088493d1af@app.fastmail.com \
--to=arnd@arndb.de \
--cc=akpm@linux-foundation.org \
--cc=bp@alien8.de \
--cc=dave.hansen@linux.intel.com \
--cc=gourry.memverge@gmail.com \
--cc=gregory.price@memverge.com \
--cc=hpa@zytor.com \
--cc=linux-api@vger.kernel.org \
--cc=linux-arch@vger.kernel.org \
--cc=linux-cxl@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@vger.kernel.org \
--cc=luto@kernel.org \
--cc=mingo@redhat.com \
--cc=tglx@linutronix.de \
--cc=x86@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).