Re: [PATCH v2 3/3] mm/mempolicy: introduce MPOL_WEIGHTED_INTERLEAVE for weighted interleaving

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Gregory Price <gregory.price@memverge.com>
To: "Huang, Ying" <ying.huang@intel.com>
Cc: Gregory Price <gourry.memverge@gmail.com>,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	linux-doc@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	linux-api@vger.kernel.org, corbet@lwn.net,
	akpm@linux-foundation.org, honggyu.kim@sk.com, rakie.kim@sk.com,
	hyeongtak.ji@sk.com, mhocko@kernel.org, vtavarespetr@micron.com,
	jgroves@micron.com, ravis.opensrc@micron.com,
	sthanneeru@micron.com, emirakhur@micron.com, Hasan.Maruf@amd.com,
	seungjun.ha@samsung.com, hannes@cmpxchg.org,
	dan.j.williams@intel.com,
	Srinivasulu Thanneeru <sthanneeru.opensrc@micron.com>
Subject: Re: [PATCH v2 3/3] mm/mempolicy: introduce MPOL_WEIGHTED_INTERLEAVE for weighted interleaving
Date: Wed, 24 Jan 2024 13:01:37 -0500	[thread overview]
Message-ID: <ZbFQgSFfqDF+UvSX@memverge.com> (raw)
In-Reply-To: <87jznzts6f.fsf@yhuang6-desk2.ccr.corp.intel.com>

On Wed, Jan 24, 2024 at 09:51:20AM +0800, Huang, Ying wrote:
> Gregory Price <gregory.price@memverge.com> writes:
> 
> +	if (new && (new->mode == MPOL_INTERLEAVE ||
> +		    new->mode == MPOL_WEIGHTED_INTERLEAVE))
>  		current->il_prev = MAX_NUMNODES-1;
>  	task_unlock(current);
>  	mpol_put(old);
> 
> I don't think we need to change this.
>

Ah you're right it's set to MAX_NUMNODES-1 here, but NUMA_NO_NODE can be
passed in as an argument to alloc_pages_bulk_array_mempolicy, like here:

vm_area_alloc_pages()
	if (IS_ENABLED(CONFIG_NUMA) && nid == NUMA_NO_NODE)
		nr = alloc_pages_bulk_array_mempolicy(bulk_gfp,
			nr_pages_request,
			pages + nr_allocated);

> > (cur_weight = 0) can happen in two scenarios:
> >   - initial setting of mempolicy (NUMA_NO_NODE w/ cur_weight=0)
> >   - weighted_interleave_nodes decrements it down to 0
> >
> > Now that i'm looking at it - the second condition should not exist, and
> > we can eliminate it. The logic in weighted_interleave_nodes is actually
> > annoyingly unclear at the moment, so I'm going to re-factor it a bit to
> > be more explicit.
> 
> I am OK with either way.  Just a reminder, the first condition may be
> true in alloc_pages_bulk_array_weighted_interleave() and perhaps some
> other places.
> 

Yeah, the bulk allocator handles it correctly, it's just a matter of
clarity for weighted_interleave_nodes.

What isn't necessarily handled correctly is the rebind code. Rebind due
to a cgroup/mems_allowed change can cause a stale weight to be carried.

Basically cur_weight is not cleared, but the node it applied to may no
longer be the next node when next_node_in() is called.

The race condition is 1) exceedingly rare, and 2) not necessarily harmful,
just inaccurate. The worst case scenario is that a node receives up to 255
additional allocations once after a rebind (but more likely 10-20).

I was considering forcing the interleave forward like this:

@@ -356,6 +361,10 @@ static void mpol_rebind_nodemask(struct mempolicy *pol, const nodemask_t *nodes)
                tmp = *nodes;

        pol->nodes = tmp;
+
+       /* Weighted interleave policies are forced forward to the next node */
+       if (pol->mode & MPOL_WEIGHTED_INTERLEAVE)
+               pol->wil.cur_weight = 0;
 }

But this creates 2 race conditions when we read cur_weight and nodemask
in the allocator path.

Example 1:
1) bulk allocator READ_ONCE(mask), READ_ONCE(cur_weight)
2) rebind changes nodemask and { cur_weight = 0; }
3) bulk allocator sets pol->wil.cur_weight

In this scenario, resume_weight is stale coming out of bulk allocations
if the resume_node has been removed from the node mask.

Example 2:
1) rebind changes nodemask
2) bulk allocator READ_ONCE(mask), READ_ONCE(cur_weight)
3) rebind sets { cur_weight = 0; }

In this scenario, cur_weight is stale going into bulk allocations.

Neither of these can force a violation of mems_allowed, just a
mis-application of a weight.

I'll need to think on this a bit.  We can either leave this as-is,
meaning the first allocation after a rebind may apply the wrong weight
to a node, or we can try to track the current-interleave-node and
validate next_node_in(mask) == current-interleave-node before leaving
the allocator path (this may also be just as racey).

turns out concurrent counting is still hard :]

~Gregory

next prev parent reply	other threads:[~2024-01-24 18:01 UTC|newest]

Thread overview: 15+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-01-19 17:57 [PATCH v2 0/3] mm/mempolicy: weighted interleave mempolicy and sysfs extension Gregory Price
2024-01-19 17:57 ` [PATCH v2 1/3] mm/mempolicy: implement the sysfs-based weighted_interleave interface Gregory Price
2024-01-22  8:03   ` Huang, Ying
2024-01-22 16:58     ` Gregory Price
2024-01-19 17:57 ` [PATCH v2 2/3] mm/mempolicy: refactor a read-once mechanism into a function for re-use Gregory Price
2024-01-19 17:57 ` [PATCH v2 3/3] mm/mempolicy: introduce MPOL_WEIGHTED_INTERLEAVE for weighted interleaving Gregory Price
2024-01-23  3:02   ` Huang, Ying
2024-01-23  4:54     ` Gregory Price
2024-01-23  5:16       ` Gregory Price
2024-01-23  8:35         ` Huang, Ying
2024-01-23 21:27           ` Gregory Price
2024-01-24  1:51             ` Huang, Ying
2024-01-24 18:01               ` Gregory Price [this message]
2024-01-23  8:13       ` Huang, Ying
2024-01-23  8:40   ` Huang, Ying

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ZbFQgSFfqDF+UvSX@memverge.com \
    --to=gregory.price@memverge.com \
    --cc=Hasan.Maruf@amd.com \
    --cc=akpm@linux-foundation.org \
    --cc=corbet@lwn.net \
    --cc=dan.j.williams@intel.com \
    --cc=emirakhur@micron.com \
    --cc=gourry.memverge@gmail.com \
    --cc=hannes@cmpxchg.org \
    --cc=honggyu.kim@sk.com \
    --cc=hyeongtak.ji@sk.com \
    --cc=jgroves@micron.com \
    --cc=linux-api@vger.kernel.org \
    --cc=linux-doc@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mhocko@kernel.org \
    --cc=rakie.kim@sk.com \
    --cc=ravis.opensrc@micron.com \
    --cc=seungjun.ha@samsung.com \
    --cc=sthanneeru.opensrc@micron.com \
    --cc=sthanneeru@micron.com \
    --cc=vtavarespetr@micron.com \
    --cc=ying.huang@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.