All of lore.kernel.org
 help / color / mirror / Atom feed
From: Gregory Price <gregory.price@memverge.com>
To: Michal Hocko <mhocko@suse.com>
Cc: Gregory Price <gourry.memverge@gmail.com>,
	linux-mm@kvack.org, linux-doc@vger.kernel.org,
	linux-fsdevel@vger.kernel.org, linux-api@vger.kernel.org,
	linux-arch@vger.kernel.org, linux-kernel@vger.kernel.org,
	akpm@linux-foundation.org, arnd@arndb.de, tglx@linutronix.de,
	luto@kernel.org, mingo@redhat.com, bp@alien8.de,
	dave.hansen@linux.intel.com, x86@kernel.org, hpa@zytor.com,
	tj@kernel.org, ying.huang@intel.com
Subject: Re: [RFC PATCH 06/11] mm/mempolicy: modify do_mbind to operate on task argument instead of current
Date: Tue, 28 Nov 2023 13:08:54 -0500	[thread overview]
Message-ID: <ZWYsth2CtC4Ilvoz@memverge.com> (raw)
In-Reply-To: <ZWX0-hEjqkmnR1Nq@tiehlicka>

On Tue, Nov 28, 2023 at 03:11:06PM +0100, Michal Hocko wrote:
> On Wed 22-11-23 16:11:55, Gregory Price wrote:
> [...]
> > + * Like get_vma_policy and get_task_policy, must hold alloc/task_lock
> > + * while calling this.
> > + */
> > +static struct mempolicy *get_task_vma_policy(struct task_struct *task,
> > +					     struct vm_area_struct *vma,
> > +					     unsigned long addr, int order,
> > +					     pgoff_t *ilx)
> [...]
> 
> You should add lockdep annotation for alloc_lock/task_lock here for clarity and 
> also...  
> > @@ -1844,16 +1899,7 @@ struct mempolicy *__get_vma_policy(struct vm_area_struct *vma,
> >  struct mempolicy *get_vma_policy(struct vm_area_struct *vma,
> >  				 unsigned long addr, int order, pgoff_t *ilx)
> >  {
> > -	struct mempolicy *pol;
> > -
> > -	pol = __get_vma_policy(vma, addr, ilx);
> > -	if (!pol)
> > -		pol = get_task_policy(current);
> > -	if (pol->mode == MPOL_INTERLEAVE) {
> > -		*ilx += vma->vm_pgoff >> order;
> > -		*ilx += (addr - vma->vm_start) >> (PAGE_SHIFT + order);
> > -	}
> > -	return pol;
> > +	return get_task_vma_policy(current, vma, addr, order, ilx);
> 
> I do not think that all get_vma_policy take task_lock (just random check
> dequeue_hugetlb_folio_vma->huge_node->get_vma_policy AFAICS)
> 
> Also I do not see policy_nodemask to be handled anywhere. That one is
> used along with get_vma_policy (sometimes hidden like in
> alloc_pages_mpol). It has a dependency on
> cpuset_nodemask_valid_mems_allowed. That means that e.g. mbind on a
> remote task would be constrained by current task cpuset when allocating
> migration targets for the target task. I am wondering how many other
> dependencies like that are lurking there.

So after further investigation, I'm going to have to back out the
changes that make home_node and mbind modifiable by an external task
and revisit it at a later time.

Right now, there's a very nasty rats nest of entanglement between
mempolicy and vma/shmem that hides a bunch of accesses to current.

It only becomes apparently when you start chasing all the callers of
mpol_dup, which had another silent access to current->cpusets.

mpol_dup calls the following:
	current_cpuset_is_being_rebound
	cpuset_mems_allowed(current)

So we would need to do the following
1) create mpol_dup_task and make current explicit, not implicit
2) chase down all callers to mpol_dup and make sure it isn't generated
   from any of the task interfaces
3) if it is generated from the task interfaces, plumb a reference to
   current down through... somehow... if possible...

Here's a ~1 hour chase that lead me to the conclusion that this will
take considerably more work, and is not to be taken lightly:

do_mbind
	mbind_range
		vma_modify_policy
			split_vma
				__split_vma
					vma_dup_policy
						mpol_dup
		vma_replace_policy
			mpol_dup
			vma->vm_ops->set_policy - see below

__set_mempolicy_home_node
	mbind_range
		... same as above ...

digging into vma->vm_ops->set_policy we end up in mm/shmem.c

shmem_set_policy
	mpol_set_shared_policy
		sp_alloc
			mpol_dup
				current_cpuset_is_being_rebound()
				cpuset_mems_allowed(current)

Who knows what else is burried in the vma stack, but making vma
mempolicies externally modifiable looks to be a much more monumental
task than just simply making the task policy modifiable.

For now i'm going to submit a V2 with home_node and mbind removed from
the proposal.  Those will take far more investigation.

This also means that process_set_mempolicy should not be extended to
allow for vma policy replacements.

~Gregory

  parent reply	other threads:[~2023-11-28 18:09 UTC|newest]

Thread overview: 31+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-11-22 21:11 [RFC PATCH 00/11] mm/mempolicy: Make task->mempolicy externally modifiable via syscall and procfs Gregory Price
2023-11-22 21:11 ` [RFC PATCH 01/11] mm/mempolicy: refactor do_set_mempolicy for code re-use Gregory Price
2023-11-22 21:11 ` [RFC PATCH 02/11] mm/mempolicy: swap cond reference counting logic in do_get_mempolicy Gregory Price
2023-11-28 14:07   ` Michal Hocko
     [not found]     ` <ZWX0ytAwmOdooHdZ@memverge.com>
2023-11-28 14:28       ` Michal Hocko
2023-11-22 21:11 ` [RFC PATCH 03/11] mm/mempolicy: refactor set_mempolicy stack to take a task argument Gregory Price
2023-11-22 21:11 ` [RFC PATCH 04/11] mm/mempolicy: modify get_mempolicy call " Gregory Price
2023-11-28 14:07   ` Michal Hocko
     [not found]     ` <ZWX1U1gCTXC+lFXn@memverge.com>
2023-11-28 14:49       ` Michal Hocko
2023-11-22 21:11 ` [RFC PATCH 05/11] mm/mempolicy: modify set_mempolicy_home_node " Gregory Price
2023-11-28 14:07   ` Michal Hocko
2023-11-28 14:14     ` Gregory Price
2023-11-22 21:11 ` [RFC PATCH 06/11] mm/mempolicy: modify do_mbind to operate on task argument instead of current Gregory Price
2023-11-28 14:11   ` Michal Hocko
2023-11-28 14:51     ` Gregory Price
2023-11-28 18:08     ` Gregory Price [this message]
2023-11-22 21:11 ` [RFC PATCH 07/11] mm/mempolicy: add task mempolicy syscall variants Gregory Price
2023-11-24  7:56   ` kernel test robot
2023-11-24  7:56   ` kernel test robot
2023-11-24  7:57   ` kernel test robot
2023-11-22 21:11 ` [RFC PATCH 08/11] mm/mempolicy: export replace_mempolicy for use by procfs Gregory Price
2023-11-22 21:11 ` [RFC PATCH 09/11] mm/mempolicy: build mpol_parse_str unconditionally Gregory Price
2023-11-22 21:11 ` [RFC PATCH 10/11] mm/mempolicy: mpol_parse_str should ignore trailing characters in nodelist Gregory Price
2023-11-22 21:12 ` [RFC PATCH 11/11] fs/proc: Add mempolicy attribute to allow read/write of task mempolicy Gregory Price
2023-11-22 21:33 ` [RFC PATCH 00/11] mm/mempolicy: Make task->mempolicy externally modifiable via syscall and procfs Andrew Morton
2023-11-22 21:35   ` Andrew Morton
2023-11-22 22:24   ` Gregory Price
2023-11-27 15:29     ` Michal Hocko
2023-11-27 16:14       ` Gregory Price
2023-11-28  9:45         ` Michal Hocko
2023-11-28 13:15           ` Gregory Price

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ZWYsth2CtC4Ilvoz@memverge.com \
    --to=gregory.price@memverge.com \
    --cc=akpm@linux-foundation.org \
    --cc=arnd@arndb.de \
    --cc=bp@alien8.de \
    --cc=dave.hansen@linux.intel.com \
    --cc=gourry.memverge@gmail.com \
    --cc=hpa@zytor.com \
    --cc=linux-api@vger.kernel.org \
    --cc=linux-arch@vger.kernel.org \
    --cc=linux-doc@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=luto@kernel.org \
    --cc=mhocko@suse.com \
    --cc=mingo@redhat.com \
    --cc=tglx@linutronix.de \
    --cc=tj@kernel.org \
    --cc=x86@kernel.org \
    --cc=ying.huang@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.