Re: [External Mail] Re: [External Mail] [RFC PATCH] mm/mempolicy: Weighted interleave auto-tuning

All of lore.kernel.org
 help / color / mirror / Atom feed

From: "Huang, Ying" <ying.huang@linux.alibaba.com>
To: Joshua Hahn <joshua.hahnjy@gmail.com>
Cc: "Gregory Price" <gourry@gourry.net>,
	"Hyeonggon Yoo" <hyeonggon.yoo@sk.com>,
	kernel_team@skhynix.com, 42.hyeyoo@gmail.com,
	"rafael@kernel.org" <rafael@kernel.org>,
	"lenb@kernel.org" <lenb@kernel.org>,
	"gregkh@linuxfoundation.org" <gregkh@linuxfoundation.org>,
	"akpm@linux-foundation.org" <akpm@linux-foundation.org>,
	"김홍규(KIM HONGGYU) System SW" <honggyu.kim@sk.com>,
	"김락기(KIM RAKIE) System SW" <rakie.kim@sk.com>,
	"dan.j.williams@intel.com" <dan.j.williams@intel.com>,
	"Jonathan.Cameron@huawei.com" <Jonathan.Cameron@huawei.com>,
	"dave.jiang@intel.com" <dave.jiang@intel.com>,
	"horen.chuang@linux.dev" <horen.chuang@linux.dev>,
	"hannes@cmpxchg.org" <hannes@cmpxchg.org>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	"linux-acpi@vger.kernel.org" <linux-acpi@vger.kernel.org>,
	"linux-mm@kvack.org" <linux-mm@kvack.org>,
	"kernel-team@meta.com" <kernel-team@meta.com>
Subject: Re: [External Mail] Re: [External Mail] [RFC PATCH] mm/mempolicy: Weighted interleave auto-tuning
Date: Tue, 21 Jan 2025 19:01:51 +0800	[thread overview]
Message-ID: <87msfkh1ls.fsf@DESKTOP-5N7EMDA> (raw)
In-Reply-To: <20250109191102.3772288-1-joshua.hahnjy@gmail.com> (Joshua Hahn's message of "Thu, 9 Jan 2025 11:10:34 -0800")

Hi, Joshua,

Sorry for late reply.

Joshua Hahn <joshua.hahnjy@gmail.com> writes:

> On Thu, 9 Jan 2025 09:18:18 -0800 Joshua Hahn <joshua.hahnjy@gmail.com> wrote:
>
>> On Thu, 9 Jan 2025 10:56:20 -0500 Gregory Price <gourry@gourry.net> wrote:
>> 
>> > On Wed, Jan 08, 2025 at 10:19:19AM +0900, Hyeonggon Yoo wrote:
>> > > Hi, hope you all had a nice year-end holiday :)
>> > > 
>> > ... snip ...
>> > > Please let me know if there's any point we discussed that I am missing.
>> > > 
>> > > Additionally I would like to mention that within an internal discussion
>> > > my colleague Honggyu suggested introducing 'mode' parameter which can be
>> > > either 'manual' or 'auto' instead of 'use_defaults' to be provide more
>> > > intuitive interface.
>> > > 
>> > > With Honggyu's suggestion and the points we've discussed,
>> > > I think the interface could be:
>> > > 
>> > > # At booting, the mode is 'auto' where the kernel can automatically
>> > > # update any weights.
>> > > 
>> > > mode             auto         # User hasn't specified any weight yet.
>> > > effective        [2, 1, -, -] # Using system defaults for node 0-1,
>> > >                               # and node 2-3 not populated yet.
>> > > 
>> > > # When a new NUMA node is added (e.g. via hotplug) in the 'auto' mode,
>> > > # all weights are re-calculated based on ACPI HMAT table, including the
>> > > # weight of the new node.
>> > > 
>> > > mode             auto         # User hasn't specified weights yet.
>> > > effective        [2, 1, 1, -] # Using system defaults for node 0-2,
>> > >                               # and node 3 not populated yet.
>> > > 
>> > > # When user set at least one weight value, change the mode to 'manual'
>> > > # where the kernel does not update any weights automatically without
>> > > # user's consent.
>> > > 
>> > > mode             manual       # User changed the weight of node 0 to 4,
>> > >                               # changing the mode to manual config mode.
>> > > effective        [4, 1, 1, -]
>> > > 
>> > > 
>> > > # When a new NUMA node is added (e.g. via hotplug) in the manual mode,
>> > > # the new node's weight is zero because it's in manual mode and user
>> > > # did not specify the weight for the new node yet.
>> > > 
>> > > mode             manual
>> > > effective        [4, 1, 1, 0]
>> > > 
>> > 
>> > 0's cannot show up in the effective list - the allocators can never
>> > percieve a 0 as there are (race) conditions where that may cause a div0.
>> > 
>> > The actual content of the list may be 0, but the allocator will see '1'.
>> > 
>> > IIRC this was due to lock/sleep limitations in the allocator paths and
>> > accessing this RCU protected memory. If someone wants to take another
>> > look at the allocator paths and characterize the risk more explicitly,
>> > this would be helpful.
>> 
>> Hi Gregory and Hyeonggon,
>> 
>> Based on a quick look, I see that there can be a problematic scenario
>> in alloc_pages_bulk_array_weighted_interleave where we sum up all
>> the weights from iw_table and divide by this sum. This _can_ be problematic
>> for two reasons, one of them being the div0 mentioned.
>> 
>> Currently, you can access the weights in one of two ways:
>> The first way is to call get_il_weight, which will retrieve a specified
>> node's weight under an rcu read lock. Within this function, it first
>> checks if the value at iw_table[nid] is 0, and if it is, returns 1.
>> Although this prevents a div0 scenario by ensuring that all weights are
>> nonzero, there is a coherency problem, since each instance of get_il_weight
>> creates a new rcu read lock. Therefore, retrieving node weights within a
>> loop creates a race condition in which the state of iw_table may change
>> in between iterations of the loop.
>> 
>> The second way is to directly dereference iw_table under a rcu lock,
>> copy its contents locally, then free the lock. This is how
>> alloc_pages_bulk_array_weighted_interleave currently calculates the sum.
>> The problem here is that even though we solve the coherency issue, there
>> is no check to ensure that this sum is zero. Thus, while having an array of
>> weights [0,0,0,0] gets translated into [1,1,1,1] when inspecting each
>> node individually using get_il_weight, it is still stored internally as 0
>> and can lead to a div0 here.
>> 
>> There are a few workarounds:
>> - Check that weight_total != 0 before performing the division.
>> - During the weight sum iteration, add by weights[node] ? weights[node] : 1
>>   like it is calculated within get_il_weight
>> - Prevent users from ever storing 0 into a node.
>> 
>> Of course, we can implement all three of these changes to make sure that
>> there are no unforunate div0s. However, there are realistic scenarios
>> where we may want the node to actually have a weight of 0, so perhaps
>> it makes sense to just do the first to checks. I can write up a quick
>> patch to perform these checks, if it looks good to everyone.
>> 
>> Please let me know if I missed anything as well.
>
> On second thought, the second bullet point doesn't make much sense, if
> we expect nodes to have 0 as a valid value. Here is something that could
> work for the first bullet point, though. I can send this as a separate
> patch since this is not explicitly related to this thread.
>
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index cb355bdcdd12..afb0f2a7bd4f 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -2552,10 +2552,13 @@ static unsigned long alloc_pages_bulk_array_weighted_interleave(gfp_t gfp,
>          * if (rounds > 0) and (delta == 0), resume_node will always be
>          * the node following prev_node and its weight.
>          */
> -       rounds = rem_pages / weight_total;
> -       delta = rem_pages % weight_total;
>         resume_node = next_node_in(prev_node, nodes);
>         resume_weight = weights[resume_node];
> +       if (weight_total == 0)
> +               goto out;
> +
> +       rounds = rem_pages / weight_total;
> +       delta = rem_pages % weight_total;
>         for (i = 0; i < nnodes; i++) {
>                 node = next_node_in(prev_node, nodes);
>                 weight = weights[node];
> @@ -2582,6 +2585,8 @@ static unsigned long alloc_pages_bulk_array_weighted_interleave(gfp_t gfp,
>                         break;
>                 prev_node = node;
>         }
> +
> +out:
>         me->il_prev = resume_node;
>         me->il_weight = resume_weight;
>         kfree(weights);
>
> Of course, the only way this can happen is if a user purposefully
> sets all of the node weights to 0, so I don't think this is something
> that should ever happen naturally. Even with the new reduce_interleave_weights
> function, it manually checks and makes sure that the lowest possible value
> is 1.
>
> Again, please let me know if I am missing anything!

I don't think that "0" is a valid weight value.  If you don't want to
allocate pages on some nodes, just don't specify them in the node mask
parameter when you call mbind() or set_mempolicy().

---
Best Regards,
Huang, Ying

     prev parent reply	other threads:[~2025-01-21 11:02 UTC|newest]

Thread overview: 24+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-12-10 21:54 [RFC PATCH] mm/mempolicy: Weighted interleave auto-tuning Joshua Hahn
2024-12-11 11:39 ` kernel test robot
2024-12-13  6:19 ` [External Mail] " Hyeonggon Yoo
2024-12-13 16:28   ` Gregory Price
2024-12-13 19:57   ` Joshua Hahn
2024-12-16  7:53     ` Hyeonggon Yoo
2024-12-16 15:46       ` Joshua Hahn
2024-12-21  5:57     ` Huang, Ying
2024-12-21 14:58       ` Gregory Price
2024-12-22  8:29         ` Huang, Ying
2024-12-22 16:54           ` Gregory Price
2024-12-25  0:25             ` Huang, Ying
2024-12-25  9:30               ` Joshua Hahn
2024-12-26  1:35                 ` Huang, Ying
2024-12-26 18:13                   ` Gregory Price
2024-12-27  1:59                     ` Huang, Ying
2024-12-27 15:35                       ` Gregory Price
2024-12-30  6:48                         ` Huang, Ying
2025-01-08  1:19                           ` [External Mail] " Hyeonggon Yoo
2025-01-08 16:56                             ` Joshua Hahn
2025-01-09 15:56                             ` Gregory Price
2025-01-09 17:18                               ` Joshua Hahn
2025-01-09 19:10                                 ` Joshua Hahn
2025-01-21 11:01                                   ` Huang, Ying [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87msfkh1ls.fsf@DESKTOP-5N7EMDA \
    --to=ying.huang@linux.alibaba.com \
    --cc=42.hyeyoo@gmail.com \
    --cc=Jonathan.Cameron@huawei.com \
    --cc=akpm@linux-foundation.org \
    --cc=dan.j.williams@intel.com \
    --cc=dave.jiang@intel.com \
    --cc=gourry@gourry.net \
    --cc=gregkh@linuxfoundation.org \
    --cc=hannes@cmpxchg.org \
    --cc=honggyu.kim@sk.com \
    --cc=horen.chuang@linux.dev \
    --cc=hyeonggon.yoo@sk.com \
    --cc=joshua.hahnjy@gmail.com \
    --cc=kernel-team@meta.com \
    --cc=kernel_team@skhynix.com \
    --cc=lenb@kernel.org \
    --cc=linux-acpi@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=rafael@kernel.org \
    --cc=rakie.kim@sk.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.