All of lore.kernel.org
 help / color / mirror / Atom feed
From: Gregory Price <gregory.price@memverge.com>
To: "Huang, Ying" <ying.huang@intel.com>
Cc: Gregory Price <gourry.memverge@gmail.com>,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	linux-doc@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	linux-api@vger.kernel.org, corbet@lwn.net,
	akpm@linux-foundation.org, honggyu.kim@sk.com, rakie.kim@sk.com,
	hyeongtak.ji@sk.com, mhocko@kernel.org, vtavarespetr@micron.com,
	jgroves@micron.com, ravis.opensrc@micron.com,
	sthanneeru@micron.com, emirakhur@micron.com, Hasan.Maruf@amd.com,
	seungjun.ha@samsung.com, hannes@cmpxchg.org,
	dan.j.williams@intel.com,
	Srinivasulu Thanneeru <sthanneeru.opensrc@micron.com>
Subject: Re: [PATCH v4 3/3] mm/mempolicy: introduce MPOL_WEIGHTED_INTERLEAVE for weighted interleaving
Date: Wed, 31 Jan 2024 02:43:16 -0500	[thread overview]
Message-ID: <Zbn6FG3346jhrQga@memverge.com> (raw)
In-Reply-To: <877cjqgfzz.fsf@yhuang6-desk2.ccr.corp.intel.com>


On Wed, Jan 31, 2024 at 02:43:12PM +0800, Huang, Ying wrote:
> Gregory Price <gourry.memverge@gmail.com> writes:
> >  
> > +static unsigned int weighted_interleave_nodes(struct mempolicy *policy)
> > +{
> > +	unsigned int node = current->il_prev;
> > +
> > +	if (!current->il_weight || !node_isset(node, policy->nodes)) {
> > +		node = next_node_in(node, policy->nodes);
> > +		/* can only happen if nodemask is being rebound */
> > +		if (node == MAX_NUMNODES)
> > +			return node;
> 
> I feel a little unsafe to read policy->nodes at same time of writing in
> rebound.  Is it better to use a seqlock to guarantee its consistency?
> It's unnecessary to be a part of this series though.
> 

I think this is handled already? It is definitely an explicit race
condition that is documented elsewhere:

/*
 * mpol_rebind_policy - Migrate a policy to a different set of nodes
 *
 * Per-vma policies are protected by mmap_lock. Allocations using per-task
 * policies are protected by task->mems_allowed_seq to prevent a premature
 * OOM/allocation failure due to parallel nodemask modification.
 */

example from slub:

do {
	cpuset_mems_cookie = read_mems_allowed_begin();
	zonelist = node_zonelist(mempolicy_slab_node(), pc->flags);
	...
} while (read_mems_allowed_retry(cpuset_mems_cookie));

quick perusal through other allocators, show similar checks.

page_alloc.c  -  check_retry_cpusetset()
filemap.c     -  filemap_alloc_folio()

If we ever want mempolicy to be swappable from outside the current task
context, this will have to change most likely - but that's another
feature for another day.

> > +	while (target) {
> > +		/* detect system default usage */
> > +		weight = table ? table[nid] : 1;
> > +		weight = weight ? weight : 1;
> 
> I found duplicated pattern as above in this patch.  Can we define a
> function like below to remove the duplication?
> 
> u8 __get_il_weight(u8 *table, int nid)
> {
>         u8 weight;
> 
>         weight = table ? table[nid] : 1;
>         return weight ? : 1;
> }
> 

When we implement the system-default array, this will change to:

weight = sysfs_table ? sysfs_table[nid] : default_table[nid];

This cleanup will get picked up in that patch set since this code is
going to change anyway.

> > +			if (delta == weight) {
> > +				/* boundary: resume from next node/weight */
> > +				resume_node = next_node_in(node, nodes);
> > +				resume_weight = weights[resume_node];
> > +			} else {
> > +				/* remainder: resume this node w/ remainder */
> > +				resume_node = node;
> > +				resume_weight = weight - delta;
> > +			}
> 
> If we are comfortable to leave resume_weight == 0, then the above
> branch can be simplified to.
> 
>         resume_node = node;
>         resume_weight = weight - delta;
> 
> But, this is a style issue again.  I will leave it to you to decide.

Good point, and in fact there's a similar branch in the first half of
the function that can be simplified.  Will follow up with a style patch.

 mm/mempolicy.c | 21 ++++-----------------
 1 file changed, 4 insertions(+), 17 deletions(-)

My favorite style of patch :D


Andrew if you happen to be monitoring, this is the patch (not tested
yet, but it's pretty obvious, otherwise i'll submit individually
tomorrow).


diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 2c1aef8eab70..b0ca9bcdd64c 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -2405,15 +2405,9 @@ static unsigned long alloc_pages_bulk_array_weighted_interleave(gfp_t gfp,
                page_array += nr_allocated;
                total_allocated += nr_allocated;
                /* if that's all the pages, no need to interleave */
-               if (rem_pages < weight) {
-                       /* stay on current node, adjust il_weight */
+               if (rem_pages <= weight) {
                        me->il_weight -= rem_pages;
                        return total_allocated;
-               } else if (rem_pages == weight) {
-                       /* move to next node / weight */
-                       me->il_prev = next_node_in(node, nodes);
-                       me->il_weight = get_il_weight(me->il_prev);
-                       return total_allocated;
                }
                /* Otherwise we adjust remaining pages, continue from there */
                rem_pages -= weight;
@@ -2460,17 +2454,10 @@ static unsigned long alloc_pages_bulk_array_weighted_interleave(gfp_t gfp,
                        node_pages += weight;
                        delta -= weight;
                } else if (delta) {
+                       /* when delta is deleted, resume from that node */
                        node_pages += delta;
-                       /* delta may deplete on a boundary or w/ a remainder */
-                       if (delta == weight) {
-                               /* boundary: resume from next node/weight */
-                               resume_node = next_node_in(node, nodes);
-                               resume_weight = weights[resume_node];
-                       } else {
-                               /* remainder: resume this node w/ remainder */
-                               resume_node = node;
-                               resume_weight = weight - delta;
-                       }
+                       resume_node = node;
+                       resume_weight = weight - delta;
                        delta = 0;
                }
                /* node_pages can be 0 if an allocation fails and rounds == 0 */


> 
> So, except the issue you pointed out already.  All series looks good to
> me!  Thanks!  Feel free to add
> 
> Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
> 
> to the whole series.
> 

Thank you so much for your patience with me! I appreciate all the help.

I am looking forward to this feature very much!

~Gregory

  reply	other threads:[~2024-01-31  7:43 UTC|newest]

Thread overview: 15+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-01-30 18:20 [PATCH v4 0/3] mm/mempolicy: weighted interleave mempolicy and sysfs extension Gregory Price
2024-01-30 18:20 ` [PATCH v4 1/3] mm/mempolicy: implement the sysfs-based weighted_interleave interface Gregory Price
2024-01-30 18:20 ` [PATCH v4 2/3] mm/mempolicy: refactor a read-once mechanism into a function for re-use Gregory Price
2024-01-30 18:20 ` [PATCH v4 3/3] mm/mempolicy: introduce MPOL_WEIGHTED_INTERLEAVE for weighted interleaving Gregory Price
2024-01-31  5:12   ` Gregory Price
2024-01-31  6:43   ` Huang, Ying
2024-01-31  7:43     ` Gregory Price [this message]
2024-01-31  9:19       ` Huang, Ying
2024-01-31 16:35         ` Gregory Price
2024-01-31 17:29         ` Gregory Price
2024-02-01  1:55           ` Huang, Ying
2024-02-01  2:01             ` Gregory Price
2024-02-01  2:18             ` Gregory Price
2024-02-01  3:02               ` Huang, Ying
2024-02-01  3:10                 ` Gregory Price

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Zbn6FG3346jhrQga@memverge.com \
    --to=gregory.price@memverge.com \
    --cc=Hasan.Maruf@amd.com \
    --cc=akpm@linux-foundation.org \
    --cc=corbet@lwn.net \
    --cc=dan.j.williams@intel.com \
    --cc=emirakhur@micron.com \
    --cc=gourry.memverge@gmail.com \
    --cc=hannes@cmpxchg.org \
    --cc=honggyu.kim@sk.com \
    --cc=hyeongtak.ji@sk.com \
    --cc=jgroves@micron.com \
    --cc=linux-api@vger.kernel.org \
    --cc=linux-doc@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mhocko@kernel.org \
    --cc=rakie.kim@sk.com \
    --cc=ravis.opensrc@micron.com \
    --cc=seungjun.ha@samsung.com \
    --cc=sthanneeru.opensrc@micron.com \
    --cc=sthanneeru@micron.com \
    --cc=vtavarespetr@micron.com \
    --cc=ying.huang@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.