Re: [PATCH v4 3/3] mm/mempolicy: introduce MPOL_WEIGHTED_INTERLEAVE for weighted interleaving

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Gregory Price <gregory.price@memverge.com>
To: "Huang, Ying" <ying.huang@intel.com>
Cc: Gregory Price <gourry.memverge@gmail.com>,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	linux-doc@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	linux-api@vger.kernel.org, corbet@lwn.net,
	akpm@linux-foundation.org, honggyu.kim@sk.com, rakie.kim@sk.com,
	hyeongtak.ji@sk.com, mhocko@kernel.org, vtavarespetr@micron.com,
	jgroves@micron.com, ravis.opensrc@micron.com,
	sthanneeru@micron.com, emirakhur@micron.com, Hasan.Maruf@amd.com,
	seungjun.ha@samsung.com, hannes@cmpxchg.org,
	dan.j.williams@intel.com,
	Srinivasulu Thanneeru <sthanneeru.opensrc@micron.com>
Subject: Re: [PATCH v4 3/3] mm/mempolicy: introduce MPOL_WEIGHTED_INTERLEAVE for weighted interleaving
Date: Wed, 31 Jan 2024 12:29:37 -0500	[thread overview]
Message-ID: <ZbqDgaOsAeXnqRP2@memverge.com> (raw)
In-Reply-To: <87y1c5g8qw.fsf@yhuang6-desk2.ccr.corp.intel.com>

On Wed, Jan 31, 2024 at 05:19:51PM +0800, Huang, Ying wrote:
> Gregory Price <gregory.price@memverge.com> writes:
> 
> >
> > I think this is handled already? It is definitely an explicit race
> > condition that is documented elsewhere:
> >
> > /*
> >  * mpol_rebind_policy - Migrate a policy to a different set of nodes
> >  *
> >  * Per-vma policies are protected by mmap_lock. Allocations using per-task
> >  * policies are protected by task->mems_allowed_seq to prevent a premature
> >  * OOM/allocation failure due to parallel nodemask modification.
> >  */
> 
> Thanks for pointing this out!
> 
> If we use task->mems_allowed_seq reader side in
> weighted_interleave_nodes() we can guarantee the consistency of
> policy->nodes.  That may be not deserved, because it's not a big deal to
> allocate 1 page in a wrong node.
> 
> It makes more sense to do that in
> alloc_pages_bulk_array_weighted_interleave(), because a lot of pages may
> be allocated there.
> 

To save the versioning if there are issues, here are the 3 diffs that
I have left. If you are good with these changes, I'll squash the first
2 into the third commit, keep the last one as a separate commit (it
changes the interleave_nodes() logic too), and submit v5 w/ your
reviewed tag on all of them.


Fix one (pedantic?) warning from syzbot:
----------------------------------------

diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index b1437396c357..dfd097009606 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -2391,7 +2391,7 @@ static unsigned long alloc_pages_bulk_array_weighted_interleave(gfp_t gfp,
        unsigned long nr_allocated = 0;
        unsigned long rounds;
        unsigned long node_pages, delta;
-       u8 __rcu *table, *weights, weight;
+       u8 __rcu *table, __rcu *weights, weight;
        unsigned int weight_total = 0;
        unsigned long rem_pages = nr_pages;
        nodemask_t nodes;



Simplifying resume_node/weight logic:
-------------------------------------

diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 2c1aef8eab70..b0ca9bcdd64c 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -2405,15 +2405,9 @@ static unsigned long alloc_pages_bulk_array_weighted_interleave(gfp_t gfp,
                page_array += nr_allocated;
                total_allocated += nr_allocated;
                /* if that's all the pages, no need to interleave */
-               if (rem_pages < weight) {
-                       /* stay on current node, adjust il_weight */
+               if (rem_pages <= weight) {
                        me->il_weight -= rem_pages;
                        return total_allocated;
-               } else if (rem_pages == weight) {
-                       /* move to next node / weight */
-                       me->il_prev = next_node_in(node, nodes);
-                       me->il_weight = get_il_weight(me->il_prev);
-                       return total_allocated;
                }
                /* Otherwise we adjust remaining pages, continue from there */
                rem_pages -= weight;
@@ -2460,17 +2454,10 @@ static unsigned long alloc_pages_bulk_array_weighted_interleave(gfp_t gfp,
                        node_pages += weight;
                        delta -= weight;
                } else if (delta) {
+                       /* when delta is deleted, resume from that node */
                        node_pages += delta;
-                       /* delta may deplete on a boundary or w/ a remainder */
-                       if (delta == weight) {
-                               /* boundary: resume from next node/weight */
-                               resume_node = next_node_in(node, nodes);
-                               resume_weight = weights[resume_node];
-                       } else {
-                               /* remainder: resume this node w/ remainder */
-                               resume_node = node;
-                               resume_weight = weight - delta;
-                       }
+                       resume_node = node;
+                       resume_weight = weight - delta;
                        delta = 0;
                }
                /* node_pages can be 0 if an allocation fails and rounds == 0 */





task->mems_allowed_seq protection (added as 4th patch)
------------------------------------------------------

diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index b0ca9bcdd64c..b1437396c357 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -1879,10 +1879,15 @@ bool apply_policy_zone(struct mempolicy *policy, enum zone_type zone)
 static unsigned int weighted_interleave_nodes(struct mempolicy *policy)
 {
        unsigned int node = current->il_prev;
+       unsigned int cpuset_mems_cookie;

+retry:
+       /* to prevent miscount use tsk->mems_allowed_seq to detect rebind */
+       cpuset_mems_cookie = read_mems_allowed_begin();
        if (!current->il_weight || !node_isset(node, policy->nodes)) {
                node = next_node_in(node, policy->nodes);
-               /* can only happen if nodemask is being rebound */
+               if (read_mems_allowed_retry(cpuset_mems_cookie))
+                       goto retry;
                if (node == MAX_NUMNODES)
                        return node;
                current->il_prev = node;
@@ -1896,10 +1901,17 @@ static unsigned int weighted_interleave_nodes(struct mempolicy *policy)
 static unsigned int interleave_nodes(struct mempolicy *policy)
 {
        unsigned int nid;
+       unsigned int cpuset_mems_cookie;
+
+       /* to prevent miscount, use tsk->mems_allowed_seq to detect rebind */
+       do {
+               cpuset_mems_cookie = read_mems_allowed_begin();
+               nid = next_node_in(current->il_prev, policy->nodes);
+       } while (read_mems_allowed_retry(cpuset_mems_cookie));

-       nid = next_node_in(current->il_prev, policy->nodes);
        if (nid < MAX_NUMNODES)
                current->il_prev = nid;
+
        return nid;
 }

@@ -2374,6 +2386,7 @@ static unsigned long alloc_pages_bulk_array_weighted_interleave(gfp_t gfp,
                struct page **page_array)
 {
        struct task_struct *me = current;
+       unsigned int cpuset_mems_cookie;
        unsigned long total_allocated = 0;
        unsigned long nr_allocated = 0;
        unsigned long rounds;
@@ -2388,10 +2401,17 @@ static unsigned long alloc_pages_bulk_array_weighted_interleave(gfp_t gfp,
        int prev_node;
        int i;

+
        if (!nr_pages)
                return 0;

-       nnodes = read_once_policy_nodemask(pol, &nodes);
+       /* read the nodes onto the stack, retry if done during rebind */
+       do {
+               cpuset_mems_cookie = read_mems_allowed_begin();
+               nnodes = read_once_policy_nodemask(pol, &nodes);
+       } while (read_mems_allowed_retry(cpuset_mems_cookie));
+
+       /* if the nodemask has become invalid, we cannot do anything */
        if (!nnodes)
                return 0;

next prev parent reply	other threads:[~2024-01-31 17:29 UTC|newest]

Thread overview: 15+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-01-30 18:20 [PATCH v4 0/3] mm/mempolicy: weighted interleave mempolicy and sysfs extension Gregory Price
2024-01-30 18:20 ` [PATCH v4 1/3] mm/mempolicy: implement the sysfs-based weighted_interleave interface Gregory Price
2024-01-30 18:20 ` [PATCH v4 2/3] mm/mempolicy: refactor a read-once mechanism into a function for re-use Gregory Price
2024-01-30 18:20 ` [PATCH v4 3/3] mm/mempolicy: introduce MPOL_WEIGHTED_INTERLEAVE for weighted interleaving Gregory Price
2024-01-31  5:12   ` Gregory Price
2024-01-31  6:43   ` Huang, Ying
2024-01-31  7:43     ` Gregory Price
2024-01-31  9:19       ` Huang, Ying
2024-01-31 16:35         ` Gregory Price
2024-01-31 17:29         ` Gregory Price [this message]
2024-02-01  1:55           ` Huang, Ying
2024-02-01  2:01             ` Gregory Price
2024-02-01  2:18             ` Gregory Price
2024-02-01  3:02               ` Huang, Ying
2024-02-01  3:10                 ` Gregory Price

find likely ancestor, descendant, or conflicting patches for this message:
( dfblob:b1437396c35 dfblob:dfd09700960 dfblob:2c1aef8eab7
dfblob:b0ca9bcdd64 dfblob:b0ca9bcdd64 dfblob:b1437396c35 )
 OR (
bs:"Re: [PATCH v4 3/3] mm/mempolicy: introduce MPOL_WEIGHTED_INTERLEAVE for weighted interleaving" )
	(help)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ZbqDgaOsAeXnqRP2@memverge.com \
    --to=gregory.price@memverge.com \
    --cc=Hasan.Maruf@amd.com \
    --cc=akpm@linux-foundation.org \
    --cc=corbet@lwn.net \
    --cc=dan.j.williams@intel.com \
    --cc=emirakhur@micron.com \
    --cc=gourry.memverge@gmail.com \
    --cc=hannes@cmpxchg.org \
    --cc=honggyu.kim@sk.com \
    --cc=hyeongtak.ji@sk.com \
    --cc=jgroves@micron.com \
    --cc=linux-api@vger.kernel.org \
    --cc=linux-doc@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mhocko@kernel.org \
    --cc=rakie.kim@sk.com \
    --cc=ravis.opensrc@micron.com \
    --cc=seungjun.ha@samsung.com \
    --cc=sthanneeru.opensrc@micron.com \
    --cc=sthanneeru@micron.com \
    --cc=vtavarespetr@micron.com \
    --cc=ying.huang@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.