Re: [RFC PATCH 3/3] mm/mempolicy: implement a partial-interleave mempolicy

linux-arch.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Gregory Price <gregory.price@memverge.com>
To: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Gregory Price <gourry.memverge@gmail.com>,
	linux-mm@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-arch@vger.kernel.org, linux-api@vger.kernel.org,
	linux-cxl@vger.kernel.org, luto@kernel.org, tglx@linutronix.de,
	mingo@redhat.com, bp@alien8.de, dave.hansen@linux.intel.com,
	hpa@zytor.com, arnd@arndb.de, akpm@linux-foundation.org,
	x86@kernel.org
Subject: Re: [RFC PATCH 3/3] mm/mempolicy: implement a partial-interleave mempolicy
Date: Mon, 2 Oct 2023 12:10:01 -0400	[thread overview]
Message-ID: <ZRrrWTTjPR62qR/Q@memverge.com> (raw)
In-Reply-To: <20231002144035.00000b36@Huawei.com>

On Mon, Oct 02, 2023 at 02:40:35PM +0100, Jonathan Cameron wrote:
> On Thu, 14 Sep 2023 19:54:57 -0400
> Gregory Price <gourry.memverge@gmail.com> wrote:
> 
> > The partial-interleave mempolicy implements interleave on an
> 
> I'm not sure 'partial' really conveys what is going on here.
> Weighted, or uneven-interleave maybe?
>
> > local_node% : interval/((nr_nodes-1)+interval-1)
> > other_node% : (1-local_node%)/(nr_nodes-1)
> 
> I'd like to see more discussion here of why you would do this...
> 

TL;DR: "Partial" in the sense that it's a simplified version of
weighted interleave.  I honestly struggled with the name, but i'm not
tied to it if there's something better.

I also considered "Preferred Interleave", where the local node is
preferred for some weight, and the remaining nodes are interleaved.
Maybe that's a more intuitive name.

For now i'll start calling it "preferred interleave" instead.

More generally:

This was a first pass at weighted interleave without adding the full
weights[MAX_NUMNODES] field to the mempolicy structure.

I've since added full weighted interleave and that'll be in v2 of the
RFC (hopefully pushing up today after addressing your notes).

I'll these notes for discussion in the RFC v2
---
I can see advantages of both full-weighted and preferred-interleave.

Something to consider: task migration and cpuset/memcg.

With "full-weighted" interleave, consider the scenario where the user
initially runs on Node 0/Socket 0, and sets the following weights

[0:10,1:3,2:5,3:1]
Where the nodes are as follows:
0 - socket 0 DRAM
1 - socket 1 DRAM
2 - socket 0 CXL
3 - socket 1 DRAM

If that task gets migrated to socket 1... that's not going to be a good
weighting plan.  This is the same reason a single set of weighted tiers
that abstract nodes is not a good idea - because Nodes 1 and 3 in this
scenario have "similar attributes" but only relative to their local
sockets (0-2 and 1-3).

Worse - if Nodes 2 and 3 *don't* have similar attributes, if we
implement an "auto-rebalance" mechanism, a lot of assumptions would have
to be made, and any time a migration is detected between nodes you would
have to do this auto-rebalance.

Even worse - I attempted to expose the weights per-task via procfs, and
realized the entire mempolicy subsystem is very unfriendly to outside
tasks twiddling bits (i.e. mempolicy is very 'current'-centric).

There are *tons* of race conditions that have to be handled, and it's
really rather nasty in my opinion.

consider this code:

2446 static unsigned offset_il_node(struct mempolicy *pol, unsigned long n)
2447 {
... snip ...
2458         nodemask = pol->nodes;
2459
2460         /*
2461          * The barrier will stabilize the nodemask in a register or on
2462          * the stack so that it will stop changing under the code.
2463          *
2464          * Between first_node() and next_node(), pol->nodes could be changed
2465          * by other threads. So we put pol->nodes in a local stack.
2466          */
2467         barrier();

big oof, you wouldn't be able to depend on this for weights, so you need
an algorithm that can allow some slop as weights are being replaced.

So unless we rewrite mempolicy.c to be more robust in this sense, I
would argue a fully-weighted scenario is most useful if you are very
confident that your task is not going to be migrated.  Otherwise there
will be very high costs associated with recalculating weights.

With preferred-interleave, if a task migrates, the rebalance happens
automatically based on the nodemask:  The new local node becomes the
heavily weighted node, and the rest interleave evenly.

(If local node is for some reason not in the nodemask, use first-node,
but this could possibly be changed to use a manually defined node))

So basically if you expect your task to be migrate-able, something like
"preferred interleave" gets you a more aligned post-migration behavior to
what you originally wanted.  Similarly, if your interleave ratios are
simple, then this strategy is the simplest way to get to the desired
outcome.

Is it the *best* strategy? TBD. The behavior is more predictable, though.

I will have a weighted interleave patch added to my next RFC.  I need to
test it first.

Thanks
Gregory

     prev parent reply	other threads:[~2023-10-02 16:10 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-09-14 23:54 [RFC PATCH 0/3] mm/mempolicy: set/get_mempolicy2 Gregory Price
2023-09-14 23:54 ` [RFC PATCH 1/3] mm/mempolicy: refactor do_set_mempolicy for code re-use Gregory Price
2023-10-02 11:03   ` Jonathan Cameron
2023-09-14 23:54 ` [RFC PATCH 2/3] mm/mempolicy: Implement set_mempolicy2 and get_mempolicy2 syscalls Gregory Price
2023-10-02 13:30   ` Jonathan Cameron
2023-10-02 15:30     ` Gregory Price
2023-10-02 18:03     ` Gregory Price
2023-09-14 23:54 ` [RFC PATCH 3/3] mm/mempolicy: implement a partial-interleave mempolicy Gregory Price
2023-10-02 13:40   ` Jonathan Cameron
2023-10-02 16:10     ` Gregory Price [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ZRrrWTTjPR62qR/Q@memverge.com \
    --to=gregory.price@memverge.com \
    --cc=Jonathan.Cameron@huawei.com \
    --cc=akpm@linux-foundation.org \
    --cc=arnd@arndb.de \
    --cc=bp@alien8.de \
    --cc=dave.hansen@linux.intel.com \
    --cc=gourry.memverge@gmail.com \
    --cc=hpa@zytor.com \
    --cc=linux-api@vger.kernel.org \
    --cc=linux-arch@vger.kernel.org \
    --cc=linux-cxl@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@vger.kernel.org \
    --cc=luto@kernel.org \
    --cc=mingo@redhat.com \
    --cc=tglx@linutronix.de \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).