Re: [RFC PATCH 3/3] mm/migrate: Create move_phys_pages syscall

Linux CXL
 help / color / mirror / Atom feed

From: Gregory Price <gregory.price@memverge.com>
To: Thomas Gleixner <tglx@linutronix.de>
Cc: Gregory Price <gourry.memverge@gmail.com>,
	linux-mm@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-arch@vger.kernel.org, linux-api@vger.kernel.org,
	linux-cxl@vger.kernel.org, luto@kernel.org, mingo@redhat.com,
	bp@alien8.de, dave.hansen@linux.intel.com, hpa@zytor.com,
	arnd@arndb.de, akpm@linux-foundation.org, x86@kernel.org
Subject: Re: [RFC PATCH 3/3] mm/migrate: Create move_phys_pages syscall
Date: Tue, 19 Sep 2023 11:57:34 -0400	[thread overview]
Message-ID: <ZQnE7i7tYOPnrL3w@memverge.com> (raw)
In-Reply-To: <877conxbhw.ffs@tglx>

On Tue, Sep 19, 2023 at 02:17:15AM +0200, Thomas Gleixner wrote:
> On Thu, Sep 07 2023 at 03:54, Gregory Price wrote:
> > Similar to the move_pages system call, instead of taking a pid and
> > list of virtual addresses, this system call takes a list of physical
> > addresses.
> 
> Silly question. Where are these physical addresses coming from?
> 
> In my naive understanding user space deals with virtual addresses for a
> reason.
> 
> Exposing access to physical addresses is definitely helpful to write
> more powerful exploits, so what are the restriction applied to this?
> 

There are a variety of sources from which userspace can acquire physical
addresses, most require SYS_CAP_ADMIN.

I should probably add this list explicitly to the RFC for clarification:

1) /proc/pid/pagemap: can be used to do page table translation.
     This is only really useful for testing, and it requires
     CAP_SYS_ADMIN.

2) X86:  IBS (AMD) and PEBS (Intel) can be configured to return
     physical, virtual, or both physical and virtual adressing.
     This requires CAP_SYS_ADMIN.

3) zoneinfo:  /proc/zoneinfo exposes the start PFN of zones. Not largely
     useful in this context, but noteably it does expose a PFN does not
     require privilege.

4) /sys/kernel/mm/page_idle:  A way to query whether a PFN is idle.
     One tiering strategy can be to use idle information to move
     less-used pages to "lower tiers" of memory.

     With page_idle, you can query page_idle migrate based on PFN,
     without the need to know which process it belongs to.  Obviously
     migrations must still respect memcg restrictions, which is why
     iterating through each mapping VMA is required.

     https://docs.kernel.org/admin-guide/mm/idle_page_tracking.html

5) CXL (Proposed): a CXL memory device on the PCIe bus may provide
     hot/cold information about its memory.  If a heatmap is provided,
     for example, it would have to be a device address (0-based) or a
     physical address (some base defined by the kernel and programmed
     into device decoders such that it can convert them to 0-based).

     This is presently being proposed but has not been agreed upon yet.


> > +	if (flags & ~(MPOL_MF_MOVE|MPOL_MF_MOVE_ALL))
> > +		return -EINVAL;
> > +
> > +	if ((flags & MPOL_MF_MOVE_ALL) && !capable(CAP_SYS_NICE))
> > +		return -EPERM;
> 
> According to this logic here MPOL_MF_MOVE is unrestricted, right?
> 
> But how is an unpriviledged process knowing which physical address the
> pages have? Confused....
> 

Someone else pointed this out as well, I should have a SYS_CAP_ADMIN
check here that i neglected to add, I have a v2 of the RFC with a test
and fixes, and I am working on a sample man page.

This entire syscall should require SYS_CAP_ADMIN, though there may be an
argument for CAP_SYS_NICE.  I personally am of the opinion that ADMIN is
the right choice, because you're right that operations on physical
addresses are a major security issue.

> > +	/* All tasks mapping each page is checked in phys_page_migratable */
> > +	nodes_setall(target_nodes);
> 
> How is the comment related to nodes_setall() and why is nodes_setall()
> unconditional when target_nodes is only used in the @nodes != NULL case?
> 

Yeah this comment is confusing.  This is a bit of a wart that comes from
trying to re-use the existing do_pages_move to ensure correctness, and
I really should have added this information directly in the comments.

The relevant code is here:

static int do_pages_move(struct mm_struct *mm, nodemask_t task_nodes,
                         unsigned long nr_pages,
                         const void __user * __user *pages,
                         const int __user *nodes,
                         int __user *status, int flags)
{
...
	err = -EACCES;
	if (!node_isset(node, task_nodes))
		goto out_flush;
...
	if (mm)
		err = add_virt_page_for_migration(mm, p, current_node, &pagelist,
					     flags & MPOL_MF_MOVE_ALL);
	else
		err = add_phys_page_for_migration(p, current_node, &pagelist,
				flags & MPOL_MF_MOVE_ALL);


do_pages_move was originally written to check that the memcg contained
the node as a valid destination for the entire request. This nodemask
is aquired from find_mm_struct in a call to cpuset_mems_allowed(task).

e.g. if memcg.valid_nodes=0,1 and you attempt a move_pages(task, 2), then
the entire syscall will error out with -EACCES.

The first chunk above checks this condition.

The second chunk (add_virt/phys...), these calls check that individual
pages are actually migratable, and if so removes them from the LRU and
places them onto the list of pages to migrate in-bulk at a later time.

In the physical addressing path, there is no pid/task provided by the
user, because physical addresses are not associated directly with task.
We must look up each VMA, and each owner of that VMA, and validate
the intersection of the allowed nodes defined by that set contain the
requested node.

This is implemented here:

static int add_phys_page_for_migration(const void __user *p, int node,
                                       struct list_head *pagelist,
                                       bool migrate_all)
...
        pfn = ((unsigned long)p) >> PAGE_SHIFT;
        page = pfn_to_online_page(pfn);
        if (!page || PageTail(page))
                return -ENOENT;

        folio = phys_migrate_get_folio(page);
        if (folio) {
                rmap_walk(folio, &rwc); <---- per-vma walk
                folio_put(folio);
        }


static bool phys_page_migratable(struct folio *folio,
                                 struct vm_area_struct *vma,
                                 unsigned long address,
                                 void *arg)
{
        struct rmap_page_ctxt *ctxt = (struct rmap_page_ctxt *)arg;
        struct task_struct *owner = vma->vm_mm->owner;
        nodemask_t task_nodes = cpuset_mems_allowed(owner);
... the actual node check here ...
        ctxt->node_allowed &= node_isset(ctxt->node, task_nodes);



I will add some better comments that lay this out.

~Gregory

next prev parent reply	other threads:[~2023-09-19 15:57 UTC|newest]

Thread overview: 21+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-09-07  7:54 [RFC 0/3] sys_move_phy_pages system call Gregory Price
2023-09-07  7:54 ` [RFC PATCH 1/3] mm/migrate: remove unused mm argument from do_move_pages_to_node Gregory Price
2023-09-07  7:54 ` [RFC PATCH 2/3] mm/migrate: refactor add_page_for_migration for code re-use Gregory Price
2023-09-07  7:54 ` [RFC PATCH 3/3] mm/migrate: Create move_phys_pages syscall Gregory Price
2023-09-09  8:03   ` Arnd Bergmann
2023-09-08  7:46     ` Gregory Price
2023-09-09 15:18       ` Arnd Bergmann
2023-09-10 12:52         ` Gregory Price
2023-09-11 17:26           ` Arnd Bergmann
2023-09-10 15:01             ` Gregory Price
2023-09-10 20:36   ` Jonathan Corbet
2023-09-10 11:49     ` Gregory Price
2023-09-19  3:34       ` Andy Lutomirski
2023-09-19 16:31         ` Gregory Price
2023-09-19 17:59           ` Andy Lutomirski
2023-09-19 18:20             ` Gregory Price
2023-09-19 18:52               ` Andy Lutomirski
2023-09-19 19:59                 ` Gregory Price
2023-09-19  0:17   ` Thomas Gleixner
2023-09-19 15:57     ` Gregory Price [this message]
2023-09-19 16:37     ` Gregory Price

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ZQnE7i7tYOPnrL3w@memverge.com \
    --to=gregory.price@memverge.com \
    --cc=akpm@linux-foundation.org \
    --cc=arnd@arndb.de \
    --cc=bp@alien8.de \
    --cc=dave.hansen@linux.intel.com \
    --cc=gourry.memverge@gmail.com \
    --cc=hpa@zytor.com \
    --cc=linux-api@vger.kernel.org \
    --cc=linux-arch@vger.kernel.org \
    --cc=linux-cxl@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@vger.kernel.org \
    --cc=luto@kernel.org \
    --cc=mingo@redhat.com \
    --cc=tglx@linutronix.de \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox