From: "Huang, Ying" <ying.huang@intel.com>
To: Gregory Price <gregory.price@memverge.com>
Cc: Gregory Price <gourry.memverge@gmail.com>, <linux-mm@kvack.org>,
<linux-doc@vger.kernel.org>, <linux-fsdevel@vger.kernel.org>,
<linux-kernel@vger.kernel.org>, <linux-api@vger.kernel.org>,
<x86@kernel.org>, <akpm@linux-foundation.org>, <arnd@arndb.de>,
<tglx@linutronix.de>, <luto@kernel.org>, <mingo@redhat.com>,
<bp@alien8.de>, <dave.hansen@linux.intel.com>, <hpa@zytor.com>,
<mhocko@kernel.org>, <tj@kernel.org>, <corbet@lwn.net>,
<rakie.kim@sk.com>, <hyeongtak.ji@sk.com>, <honggyu.kim@sk.com>,
<vtavarespetr@micron.com>, <peterz@infradead.org>,
<jgroves@micron.com>, <ravis.opensrc@micron.com>,
<sthanneeru@micron.com>, <emirakhur@micron.com>,
<Hasan.Maruf@amd.com>, <seungjun.ha@samsung.com>,
Srinivasulu Thanneeru <sthanneeru.opensrc@micron.com>
Subject: Re: [PATCH v5 02/11] mm/mempolicy: introduce MPOL_WEIGHTED_INTERLEAVE for weighted interleaving
Date: Wed, 03 Jan 2024 13:46:56 +0800 [thread overview]
Message-ID: <87mstnc6jz.fsf@yhuang6-desk2.ccr.corp.intel.com> (raw)
In-Reply-To: <ZZRybDPSoLme8Ldh@memverge.com> (Gregory Price's message of "Tue, 2 Jan 2024 15:30:36 -0500")
Gregory Price <gregory.price@memverge.com> writes:
> On Tue, Jan 02, 2024 at 04:42:42PM +0800, Huang, Ying wrote:
>> Gregory Price <gregory.price@memverge.com> writes:
>>
>> > On Wed, Dec 27, 2023 at 04:32:37PM +0800, Huang, Ying wrote:
>> >> Gregory Price <gourry.memverge@gmail.com> writes:
>> >>
>> >> > +static unsigned int weighted_interleave_nid(struct mempolicy *pol, pgoff_t ilx)
>> >> > +{
>> >> > + nodemask_t nodemask = pol->nodes;
>> >> > + unsigned int target, weight_total = 0;
>> >> > + int nid;
>> >> > + unsigned char weights[MAX_NUMNODES];
>> >>
>> >> MAX_NUMNODSE could be as large as 1024. 1KB stack space may be too
>> >> large?
>> >>
>> >
>> > I've been struggling with a good solution to this. We need a local copy
>> > of weights to prevent weights from changing out from under us during
>> > allocation (which may take quite some time), but it seemed unwise to
>> > to allocate 1KB heap in this particular path.
>> >
>> > Is my concern unfounded? If so, I can go ahead and add the allocation
>> > code.
>>
>> Please take a look at NODEMASK_ALLOC().
>>
>
> This is not my question. NODEMASK_ALLOC calls kmalloc/kfree.
>
> Some of the allocations on the stack can be replaced with a scratch
> allocation, that's no big deal.
>
> I'm specifically concerned about:
> weighted_interleave_nid
> alloc_pages_bulk_array_weighted_interleave
>
> I'm unsure whether kmalloc/kfree is safe (and non-offensive) in those
> contexts. If kmalloc/kfree is safe fine, this problem is trivial.
>
> If not, there is no good solution to this without pre-allocating a
> scratch area per-task.
You need to audit whether it's safe for all callers. I guess that you
need to allocate pages after calling, so you can use the same GFP flags
here.
>> >> I don't think barrier() is needed to wait for memory operations for
>> >> stack. It's usually used for cross-processor memory order.
>> >>
>> >
>> > This is present in the old interleave code. To the best of my
>> > understanding, the concern is for mempolicy->nodemask rebinding that can
>> > occur when cgroups.cpusets.mems_allowed changes.
>> >
>> > so we can't iterate over (mempolicy->nodemask), we have to take a local
>> > copy.
>> >
>> > My *best* understanding of the barrier here is to prevent the compiler
>> > from reordering operations such that it attempts to optimize out the
>> > local copy (or do lazy-fetch).
>> >
>> > It is present in the original interleave code, so I pulled it forward to
>> > this, but I have not tested whether this is a bit paranoid or not.
>> >
>> > from `interleave_nid`:
>> >
>> > /*
>> > * The barrier will stabilize the nodemask in a register or on
>> > * the stack so that it will stop changing under the code.
>> > *
>> > * Between first_node() and next_node(), pol->nodes could be changed
>> > * by other threads. So we put pol->nodes in a local stack.
>> > */
>> > barrier();
>>
>> Got it. This is kind of READ_ONCE() for nodemask. To avoid to add
>> comments all over the place. Can we implement a wrapper for it? For
>> example, memcpy_once(). __read_once_size() in
>> tools/include/linux/compiler.h can be used as reference.
>>
>> Because node_weights[] may be changed simultaneously too. We may need
>> to consider similar issue for it too. But RCU seems more appropriate
>> for node_weights[].
>>
>
> Weights are collected individually onto the stack because we have to sum
> them up before we actually apply the weights.
>
> A stale weight is not offensive. RCU is not needed and doesn't help.
When you copy weights from iw_table[] to stack, it's possible for
compiler to cache its contents in register, or merge, split the memory
operations. At the same time, iw_table[] may be changed simultaneously
via sysfs interface. So, we need a mechanism to guarantee that we read
the latest contents consistently.
> The reason the barrier is needed is not weights, it's the nodemask.
Yes. So I said that we need similar stuff for weights.
> So you basically just want to replace barrier() with this and drop the
> copy/pasted comments:
>
> static void read_once_policy_nodemask(struct mempolicy *pol, nodemask_t *mask)
> {
> /*
> * The barrier will stabilize the nodemask in a register or on
> * the stack so that it will stop changing under the code.
> *
> * Between first_node() and next_node(), pol->nodes could be changed
> * by other threads. So we put pol->nodes in a local stack.
> */
> barrier();
> __builtin_memcpy(mask, &pol->nodes, sizeof(nodemask_t));
> barrier();
> }
>
> - nodemask_t nodemask = pol->nodemask
> - barrier()
> + nodemask_t nodemask;
> + read_once_policy_nodemask(pol, &nodemask)
>
> Is that right?
Yes. Something like that. Or even more general (need to be optimized?),
static inline static void memcpy_once(void *dst, void *src, size_t n)
{
barrier();
memcpy(dst, src, n);
barrier();
}
memcpy_once(&nodemask, &pol->nodemask, sizeof(nodemask));
The comments can be based on that of READ_ONCE().
--
Best Regards,
Huang, Ying
next prev parent reply other threads:[~2024-01-03 5:49 UTC|newest]
Thread overview: 46+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-12-23 18:10 [PATCH v5 00/11] mempolicy2, mbind2, and weighted interleave Gregory Price
2023-12-23 18:10 ` [PATCH v5 01/11] mm/mempolicy: implement the sysfs-based weighted_interleave interface Gregory Price
2023-12-27 6:42 ` Huang, Ying
2023-12-26 6:48 ` Gregory Price
2024-01-02 7:41 ` Huang, Ying
2024-01-02 19:45 ` Gregory Price
2024-01-03 2:45 ` Huang, Ying
2024-01-03 2:59 ` Gregory Price
2024-01-03 6:03 ` Huang, Ying
2024-01-03 2:46 ` Gregory Price
2023-12-23 18:10 ` [PATCH v5 02/11] mm/mempolicy: introduce MPOL_WEIGHTED_INTERLEAVE for weighted interleaving Gregory Price
2023-12-27 8:32 ` Huang, Ying
2023-12-26 7:01 ` Gregory Price
2023-12-26 8:06 ` Gregory Price
2023-12-26 11:32 ` Gregory Price
2024-01-02 8:42 ` Huang, Ying
2024-01-02 20:30 ` Gregory Price
2024-01-03 5:46 ` Huang, Ying [this message]
2024-01-03 22:09 ` Gregory Price
2024-01-04 5:39 ` Huang, Ying
2024-01-04 18:59 ` Gregory Price
2024-01-05 6:51 ` Huang, Ying
2024-01-05 7:25 ` Gregory Price
2024-01-08 7:08 ` Huang, Ying
2023-12-23 18:10 ` [PATCH v5 03/11] mm/mempolicy: refactor sanitize_mpol_flags for reuse Gregory Price
2023-12-27 8:39 ` Huang, Ying
2023-12-26 7:05 ` Gregory Price
2023-12-26 11:48 ` Gregory Price
2024-01-02 9:09 ` Huang, Ying
2024-01-02 20:32 ` Gregory Price
2023-12-23 18:10 ` [PATCH v5 04/11] mm/mempolicy: create struct mempolicy_args for creating new mempolicies Gregory Price
2023-12-23 18:10 ` [PATCH v5 05/11] mm/mempolicy: refactor kernel_get_mempolicy for code re-use Gregory Price
2023-12-23 18:10 ` [PATCH v5 06/11] mm/mempolicy: allow home_node to be set by mpol_new Gregory Price
2023-12-23 18:10 ` [PATCH v5 07/11] mm/mempolicy: add userland mempolicy arg structure Gregory Price
2023-12-23 18:10 ` [PATCH v5 08/11] mm/mempolicy: add set_mempolicy2 syscall Gregory Price
2024-01-02 14:38 ` Geert Uytterhoeven
2023-12-23 18:10 ` [PATCH v5 09/11] mm/mempolicy: add get_mempolicy2 syscall Gregory Price
2024-01-02 14:46 ` Geert Uytterhoeven
2023-12-23 18:11 ` [PATCH v5 10/11] mm/mempolicy: add the mbind2 syscall Gregory Price
2024-01-02 14:47 ` Geert Uytterhoeven
2023-12-23 18:11 ` [PATCH v5 11/11] mm/mempolicy: extend set_mempolicy2 and mbind2 to support weighted interleave Gregory Price
2023-12-25 7:54 ` [PATCH v5 00/11] mempolicy2, mbind2, and " Huang, Ying
2023-12-26 7:45 ` Gregory Price
2024-01-02 4:27 ` Huang, Ying
2024-01-02 19:06 ` Gregory Price
2024-01-03 3:15 ` Huang, Ying
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=87mstnc6jz.fsf@yhuang6-desk2.ccr.corp.intel.com \
--to=ying.huang@intel.com \
--cc=Hasan.Maruf@amd.com \
--cc=akpm@linux-foundation.org \
--cc=arnd@arndb.de \
--cc=bp@alien8.de \
--cc=corbet@lwn.net \
--cc=dave.hansen@linux.intel.com \
--cc=emirakhur@micron.com \
--cc=gourry.memverge@gmail.com \
--cc=gregory.price@memverge.com \
--cc=honggyu.kim@sk.com \
--cc=hpa@zytor.com \
--cc=hyeongtak.ji@sk.com \
--cc=jgroves@micron.com \
--cc=linux-api@vger.kernel.org \
--cc=linux-doc@vger.kernel.org \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=luto@kernel.org \
--cc=mhocko@kernel.org \
--cc=mingo@redhat.com \
--cc=peterz@infradead.org \
--cc=rakie.kim@sk.com \
--cc=ravis.opensrc@micron.com \
--cc=seungjun.ha@samsung.com \
--cc=sthanneeru.opensrc@micron.com \
--cc=sthanneeru@micron.com \
--cc=tglx@linutronix.de \
--cc=tj@kernel.org \
--cc=vtavarespetr@micron.com \
--cc=x86@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).